本篇博文主要内容为 2025-11-13 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-13)

今日共更新494篇论文,其中:

  • 自然语言处理73篇(Computation and Language (cs.CL))
  • 人工智能144篇(Artificial Intelligence (cs.AI))
  • 计算机视觉96篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习154篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] SynClaimEval: A Framework for Evaluating the Utility of Synthetic Data in Long-Context Claim Verification

【速读】: 该论文旨在解决长上下文场景下生成式AI(Generative AI)进行事实核查(claim verification)任务时,高质量标注数据稀缺导致模型训练与评估受限的问题。其解决方案的关键在于提出SynClaimEval框架,通过合成数据(synthetic data)的可控生成策略,在三个维度上系统评估合成数据的有效性:输入特征(如上下文长度和跨域泛化能力)、合成逻辑(如论断复杂度与错误类型多样性)以及解释质量(即模型推理过程是否提供与预测一致的证据)。实验表明,长上下文合成数据不仅能提升基础指令微调模型的事实核查性能,尤其在增强已有真人标注数据集时效果显著,还能独立改善模型解释的一致性,从而同时增强模型的准确性与可解释性。

链接: https://arxiv.org/abs/2511.09539
作者: Mohamed Elaraby,Jyoti Prakash Maheswari
机构: University of Pittsburgh (匹兹堡大学); Zillow Inc. (Zillow公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with extended context windows promise direct reasoning over long documents, reducing the need for chunking or retrieval. Constructing annotated resources for training and evaluation, however, remains costly. Synthetic data offers a scalable alternative, and we introduce SynClaimEval, a framework for evaluating synthetic data utility in long-context claim verification – a task central to hallucination detection and fact-checking. Our framework examines three dimensions: (i) input characteristics, by varying context length and testing generalization to out-of-domain benchmarks; (ii) synthesis logic, by controlling claim complexity and error type variation; and (iii) explanation quality, measuring the degree to which model explanations provide evidence consistent with predictions. Experiments across benchmarks show that long-context synthesis can improve verification in base instruction-tuned models, particularly when augmenting existing human-written datasets. Moreover, synthesis enhances explanation quality, even when verification scores do not improve, underscoring its potential to strengthen both performance and explainability.
zh

[NLP-1] Readability Measures and Automatic Text Simplification: In the Search of a Construct

【速读】: 该论文旨在解决自动文本简化(Automatic Text Simplification, ATS)评估中缺乏统一标准的问题,特别是现有研究多关注自动评估指标与人工判断之间的相关性,而忽视了可读性度量(readability measures)在其中的作用。其关键解决方案在于系统性地考察可读性度量与人类判断及ATS自动评估指标之间的相关性,发现三者之间普遍存在较低的相关性,从而指出当前ATS评估体系中存在概念界定不清的问题,并呼吁建立更明确的简化效果评价构念(construct)以提升评估的科学性和一致性。

链接: https://arxiv.org/abs/2511.09536
作者: Rémi Cardon,A. Seza Doğruöz
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Readability is a key concept in the current era of abundant written information. To help making texts more readable and make information more accessible to everyone, a line of researched aims at making texts accessible for their target audience: automatic text simplification (ATS). Lately, there have been studies on the correlations between automatic evaluation metrics in ATS and human judgment. However, the correlations between those two aspects and commonly available readability measures (such as readability formulas or linguistic features) have not been the focus of as much attention. In this work, we investigate the place of readability measures in ATS by complementing the existing studies on evaluation metrics and human judgment, on English. We first discuss the relationship between ATS and research in readability, then we report a study on correlations between readability measures and human judgment, and between readability measures and ATS evaluation metrics. We identify that in general, readability measures do not correlate well with automatic metrics and human judgment. We argue that as the three different angles from which simplification can be assessed tend to exhibit rather low correlations with one another, there is a need for a clear definition of the construct in ATS.
zh

[NLP-2] AdaCuRL: Adaptive Curriculum Reinforcement Learning with Invalid Sample Mitigation and Historical Revisiting

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练过程中因直接使用混合难度样本所引发的梯度饥饿(Gradient Starvation)和策略退化(Policy Degradation)问题,同时克服现有方法依赖高成本链式思维(Chain-of-Thought, CoT)标注数据或受限于人工设计课程学习(Curriculum Learning)策略所带来的困难,如难度不匹配、灾难性遗忘(Catastrophic Forgetting)等。其解决方案的关键在于提出AdaCuRL框架,该框架通过粗到细的难度估计机制与自适应课程调度策略动态对齐数据难度与模型能力,并引入数据重访机制缓解灾难性遗忘;此外,采用自适应参考策略和稀疏KL散度约束防止策略退化,从而实现更稳定、高效的推理能力提升。

链接: https://arxiv.org/abs/2511.09478
作者: Renda Li,Hailang Huang,Fei Wei,Feng Xiong,Yong Wang,Xiangxiang Chu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has demonstrated considerable potential for enhancing reasoning in large language models (LLMs). However, existing methods suffer from Gradient Starvation and Policy Degradation when training directly on samples with mixed difficulty. To mitigate this, prior approaches leverage Chain-of-Thought (CoT) data, but the construction of high-quality CoT annotations remains labor-intensive. Alternatively, curriculum learning strategies have been explored but frequently encounter challenges, such as difficulty mismatch, reliance on manual curriculum design, and catastrophic forgetting. To address these issues, we propose AdaCuRL, a Adaptive Curriculum Reinforcement Learning framework that integrates coarse-to-fine difficulty estimation with adaptive curriculum scheduling. This approach dynamically aligns data difficulty with model capability and incorporates a data revisitation mechanism to mitigate catastrophic forgetting. Furthermore, AdaCuRL employs adaptive reference and sparse KL strategies to prevent Policy Degradation. Extensive experiments across diverse reasoning benchmarks demonstrate that AdaCuRL consistently achieves significant performance improvements on both LLMs and MLLMs.
zh

[NLP-3] BIG5-TPoT: Predicting BIG Five Personality Traits Facets and Items Through Targeted Preselection of Texts

【速读】: 该论文旨在解决从大量生成文本中预测个体人格特征(如大五人格特质、子维度或具体条目)的难题,尤其针对大语言模型在输入文本量限制下的性能瓶颈。其解决方案的关键在于提出了一种名为“目标文本预筛选”(Targeted Preselection of Texts, TPoT)的新策略,通过语义过滤机制选取与特定人格维度高度相关的文本作为深度学习模型(即BIG5-TPoT模型)的输入,从而在降低冗余信息干扰的同时提升预测精度,显著改善了在Stream of Consciousness Essays数据集上的平均绝对误差(Mean Absolute Error)和准确率指标。

链接: https://arxiv.org/abs/2511.09426
作者: Triet M. Le,Arjun Chandra,C. Anton Rytting,Valerie P. Karuzis,Vladimir Rife,William A. Simpson
机构: The University of Maryland Applied Research Laboratory for Intelligence and Security (ARLIS); The University of Maryland (UMD)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Predicting an individual’s personalities from their generated texts is a challenging task, especially when the text volume is large. In this paper, we introduce a straightforward yet effective novel strategy called targeted preselection of texts (TPoT). This method semantically filters the texts as input to a deep learning model, specifically designed to predict a Big Five personality trait, facet, or item, referred to as the BIG5-TPoT model. By selecting texts that are semantically relevant to a particular trait, facet, or item, this strategy not only addresses the issue of input text limits in large language models but also improves the Mean Absolute Error and accuracy metrics in predictions for the Stream of Consciousness Essays dataset.
zh

[NLP-4] GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning AAAI2026

【速读】: 该论文旨在解决机器学习(Machine Learning, ML)相关研究中细粒度信息提取与连接的难题,以提升科研成果的理解性与可复现性。其核心挑战在于从海量科学文献中自动识别并关联如方法训练、数据使用等关键研究要素。解决方案的关键是构建了一个名为GSAP-ERE的高质量人工标注细粒度数据集,包含10类实体和18类语义关系类型,涵盖100篇ML论文中的6.3万实体提及与3.5万关系实例。该数据集不仅支持下游任务(如知识图谱构建和计算复现性监控),还作为测试套件验证了大语言模型(Large Language Models, LLM)在信息抽取(Information Extraction, IE)中的表现,揭示了监督式微调模型在命名实体识别(NER)和关系抽取(RE)上显著优于当前主流LLM提示策略的性能差距,凸显了此类结构化数据集对推动学术信息抽取领域发展的必要性。

链接: https://arxiv.org/abs/2511.09411
作者: Wolfgang Otto,Lu Gan,Sharmila Upadhyaya,Saurav Karmakar,Stefan Dietze
机构: 1. University of Hamburg (汉堡大学); 2. Indian Institute of Technology (印度理工学院); 3. German Research Center for Artificial Intelligence (德国人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Research in Machine Learning (ML) and AI evolves rapidly. Information Extraction (IE) from scientific publications enables to identify information about research concepts and resources on a large scale and therefore is a pathway to improve understanding and reproducibility of ML-related research. To extract and connect fine-grained information in ML-related research, e.g. method training and data usage, we introduce GSAP-ERE. It is a manually curated fine-grained dataset with 10 entity types and 18 semantically categorized relation types, containing mentions of 63K entities and 35K relations from the full text of 100 ML publications. We show that our dataset enables fine-tuned models to automatically extract information relevant for downstream tasks ranging from knowledge graph (KG) construction, to monitoring the computational reproducibility of AI research at scale. Additionally, we use our dataset as a test suite to explore prompting strategies for IE using Large Language Models (LLM). We observe that the performance of state-of-the-art LLM prompting methods is largely outperformed by our best fine-tuned baseline model (NER: 80.6%, RE: 54.0% for the fine-tuned model vs. NER: 44.4%, RE: 10.1% for the LLM). This disparity of performance between supervised models and unsupervised usage of LLMs suggests datasets like GSAP-ERE are needed to advance research in the domain of scholarly information extraction.
zh

[NLP-5] CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLM s in Psychological Counseling

【速读】: 该论文旨在解决当前心理咨询服务需求增长与专业服务供给有限之间的矛盾,以及现有评估大型语言模型(Large Language Models, LLMs)在心理咨询场景中能力的基准测试存在不足的问题。具体而言,现有研究受限于非专业的来访者模拟、静态问答评估形式及单一维度指标,难以全面衡量模型应对多样化和复杂客户情境的能力。解决方案的关键在于提出一个名为CARE-Bench的动态交互式自动化评估基准,其基于真实心理咨询案例构建多样化的来访者画像,并依据专家指导进行模拟,同时引入基于成熟心理学量表的多维性能评价体系,从而系统性地评估LLMs在心理咨询任务中的综合表现。

链接: https://arxiv.org/abs/2511.09407
作者: Bichen Wang,Yixin Sun,Junzhe Wang,Hao Yang,Xing Fu,Yanyan Zhao,Si Wei,Shijin Wang,Bing Qin
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model’s comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbfCARE-Bench, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs’ failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.
zh

[NLP-6] Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

【速读】: 该论文旨在解决低资源语言(以巴斯克语为例)在多模态大语言模型(Multimodal Large Language Models, MLLMs)中性能不足的问题。其解决方案的关键在于:首先,构建了专为巴斯克语设计的训练与评估图像-文本数据集;其次,通过实验发现,仅需约20%的巴斯克语多模态数据混合到主流模型中即可获得良好的基准表现;更重要的是,研究证实并不需要使用针对巴斯克语优化的指令微调语言模型作为基础架构(backbone),即可实现高性能的巴斯克语MLLM,这打破了传统认知并为其他低资源语言的多模态模型开发提供了可复用的方法论和开放资源。

链接: https://arxiv.org/abs/2511.09396
作者: Lukas Arana,Julen Etxaniz,Ander Salaberria,Gorka Azkune
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.
zh

[NLP-7] AMaPO: Adaptive Margin-attached Preference Optimization for Language Model Alignment AAAI2026

【速读】: 该论文旨在解决离线偏好优化(Offline Preference Optimization)中因排名准确率不足而导致模型对齐效果受限的问题,其核心挑战被作者定义为“过拟合-欠拟合困境”(Overfitting-Underfitting Dilemma):现有边际设计导致模型对正确排序样本施加过多冗余梯度(过拟合),同时对错误排序样本提供不足的修正信号(欠拟合)。解决方案的关键在于提出自适应边缘附着偏好优化(Adaptive Margin-attached Preference Optimization, AMaPO),通过实例级自适应边际机制——结合Z标准化与指数缩放策略——动态重分配学习资源:放大错误排序样本的梯度强度,抑制正确排序样本的梯度贡献,从而有效缓解上述两难问题。

链接: https://arxiv.org/abs/2511.09385
作者: Ruibo Deng,Duanyu Feng,Wenqiang Lei
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2026 AIA track, our code is available at this https URL

点击查看摘要

Abstract:Offline preference optimization offers a simpler and more stable alternative to RLHF for aligning language models. However, their effectiveness is critically dependent on ranking accuracy, a metric where further gains are highly impactful. This limitation arises from a fundamental problem that we identify and formalize as the Overfitting-Underfitting Dilemma: current margin designs cause models to apply excessive, wasteful gradients to correctly ranked samples (overfitting) while providing insufficient corrective signals for misranked ones (underfitting). To resolve this dilemma, we propose Adaptive Margin-attached Preference Optimization (AMaPO), a simple yet principled algorithm. AMaPO employs an instance-wise adaptive margin, refined by Z-normalization and exponential scaling, which dynamically reallocates learning effort by amplifying gradients for misranked samples and suppressing them for correct ones. Extensive experiments on widely used benchmarks demonstrate that AMaPO not only achieves better ranking accuracy and superior downstream alignment performance, but targeted analysis also confirms that it successfully mitigates the core overfitting and underfitting issues.
zh

[NLP-8] Self-Correcting Large Language Models : Generation vs. Multiple Choice

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在不同任务结构下自校正机制(self-correction mechanism)表现差异的问题,特别是对比开放文本生成与多选题选择两种范式在纠错行为和性能提升上的本质区别。其解决方案的关键在于系统性地比较两类任务中模型的改进模式与失败模式,发现开放生成依赖于重解释和组合式精炼带来的灵活性优势,而多选任务则受限于选项边界清晰但可选空间有限的特性;由此提出,自校正机制的设计必须考虑任务结构与输出空间之间的交互关系,以适配知识密集型推理与决策导向型应用的不同需求。

链接: https://arxiv.org/abs/2511.09381
作者: Hossein A. Rahmani,Satyapriya Krishna,Xi Wang,Mohammadmehdi Naghiaei,Emine Yilmaz
机构: University College London (伦敦大学学院); Amazon AGI; University of Sheffield (谢菲尔德大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:Large language models have recently demonstrated remarkable abilities to self-correct their responses through iterative refinement, often referred to as self-consistency or self-reflection. However, the dynamics of this self-correction mechanism may differ substantially depending on whether the model is tasked with open-ended text generation or with selecting the most appropriate response from multiple predefined options. In this paper, we conduct a systematic investigation of these two paradigms by comparing performance trends and error-correction behaviors across various natural language understanding and reasoning tasks, covering language models of different scales and families. Our experimental results reveal distinct patterns of improvement and failure modes: \textitWhile open-ended generation often benefits from the flexibility of re-interpretation and compositional refinement, multiple-choice selection can leverage clearer solution boundaries but may be limited by the provided options. This contrast also reflects the dual demands faced by emerging agentic LLM applications: effective agents must not only generate and refine open-ended plans or explanations, but also make reliable discrete choices when operating within constrained action spaces. Our findings, therefore, highlight that the design of self-correction mechanisms should take into account the interaction between task structure and output space, with implications for both knowledge-intensive reasoning and decision-oriented applications of LLMs. Comments: 20 pages Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.09381 [cs.CL] (or arXiv:2511.09381v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.09381 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-9] MTQ-Eval: Multilingual Text Quality Evaluation for Language Models AACL2025

【速读】: 该论文旨在解决多语言环境下文本质量评估的通用性问题,即如何将大语言模型(Large Language Models, LLMs)从特定任务评估扩展到跨语言、跨场景的文本质量判别能力。其关键解决方案是提出了一种名为MTQ-Eval的新框架,该框架通过自动构建高质量与低质量文本偏好数据集,并利用这些数据对开源基础LLM进行微调,使其内部表示能够有效对齐人类对文本质量的评分,从而实现对115种语言下文本质量的准确评估,且该能力可进一步提升下游任务表现。

链接: https://arxiv.org/abs/2511.09374
作者: Rhitabrat Pokharel,Ameeta Agrawal
机构: Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at IJCNLP-AACL 2025 Findings

点击查看摘要

Abstract:The use of large language models (LLMs) for evaluating outputs is becoming an increasingly effective and scalable approach. However, it remains uncertain whether this capability extends beyond task-specific evaluations to more general assessments of text quality, particularly in multilingual contexts. In this study, we introduce, MTQ-Eval, a novel framework for multilingual text quality evaluation that learns from examples of both high- and low-quality texts, adjusting its internal representations. To develop MTQ-Eval, we first automatically generate text quality preference data and then use it to train open-source base LLMs to align with ratings of high- and low-quality text. Our comprehensive evaluation across 115 languages demonstrates the improved performance of the proposed model. Upon further analysis, we find that this enhanced evaluation capability also leads to notable improvements in downstream tasks.
zh

[NLP-10] Routesplain: Towards Faithful and Intervenable Routing for Software-related Tasks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理软件相关任务时性能差异显著的问题,尤其是在不同任务之间及同一任务内部的性能波动较大,导致响应质量不稳定且成本较高。为提升响应准确性和经济性,论文提出了一种名为Routesplain的新型LLM路由机制,其关键在于:首先从用户查询中提取人类可理解的概念(如任务类型、领域、推理复杂度等),仅基于这些语义概念进行路由决策,而非依赖黑箱模型;这一设计不仅提供了可解释、忠实的推理依据,还显著优于单一模型和现有黑箱基线方法,在8类软件相关任务上实现了更高的准确率与更低的成本。

链接: https://arxiv.org/abs/2511.09373
作者: Adam Štorek,Vikas Upadhyay,Marianne Menglin Liu,Daniel W. Peterson,Anshul Mittal,Sujeeth Bharadwaj,Fahad Shah,Dan Roth
机构: Columbia University (哥伦比亚大学); Oracle AI
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LLMs now tackle a wide range of software-related tasks, yet we show that their performance varies markedly both across and within these tasks. Routing user queries to the appropriate LLMs can therefore help improve response quality while reducing cost. Prior work, however, has focused mainly on general-purpose LLM routing via black-box models. We introduce Routesplain, the first LLM router for software-related tasks, including multilingual code generation and repair, input/output prediction, and computer science QA. Unlike existing routing approaches, Routesplain first extracts human-interpretable concepts from each query (e.g., task, domain, reasoning complexity) and only routes based on these concepts, thereby providing intelligible, faithful rationales. We evaluate Routesplain on 16 state-of-the-art LLMs across eight software-related tasks; Routesplain outperforms individual models both in terms of accuracy and cost, and equals or surpasses all black-box baselines, with concept-level intervention highlighting avenues for further router improvements.
zh

[NLP-11] Spider4SSC S2CLite: A text-to-multi-query-language dataset using lightweight ontology-agnostic SPARQL to Cypher parser

【速读】: 该论文旨在解决跨图数据库查询语言(如SPARQL与Cypher)之间语义对齐和高效转换的问题,特别是在自然语言到查询(Text-to-Query)任务中缺乏统一、高精度的多语言查询数据集及轻量级解析工具。其解决方案的关键在于提出S2CLite——一个无依赖、纯规则驱动的轻量级解析器(ontology-agnostic parser),它无需RDF图或外部工具即可将SPARQL查询直接翻译为Cypher查询,从而实现本地化(in-situ)与大规模转换。该方法基于传统编程语言编译器的设计理念,显著提升了解析准确率,在Spider4SPARQL数据集上达到77.8%的总解析准确率,优于当前最优方案S2CTrans的44.2%,并在交集查询子集上实现96.6%的执行准确率,同时进一步构建了Spider4SSC这一包含SQL、SPARQL和Cypher三类等价查询的统一文本到查询数据集,为多语言图数据库研究提供了高质量基准资源。

链接: https://arxiv.org/abs/2511.09354
作者: Martin Vejvar,Yasutaka Fujimoto
机构: Yokohama National University (横滨国立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Spider4SSC dataset and S2CLite parsing tool. S2CLite is a lightweight, ontology-agnostic parser that translates SPARQL queries into Cypher queries, enabling both in-situ and large-scale SPARQL to Cypher translation. Unlike existing solutions, S2CLite is purely rule-based (inspired by traditional programming language compilers) and operates without requiring an RDF graph or external tools. Experiments conducted on the BSBM42 and Spider4SPARQL datasets show that S2CLite significantly reduces query parsing errors, achieving a total parsing accuracy of 77.8% on Spider4SPARQL compared to 44.2% by the state-of-the-art S2CTrans. Furthermore, S2CLite achieved a 96.6% execution accuracy on the intersecting subset of queries parsed by both parsers, outperforming S2CTrans by 7.3%. We further use S2CLite to parse Spider4SPARQL queries to Cypher and generate Spider4SSC, a unified Text-to-Query language (SQL, SPARQL, Cypher) dataset with 4525 unique questions and 3 equivalent sets of 2581 matching queries (SQL, SPARQL and Cypher). We open-source S2CLite for further development on GitHub (this http URL) and provide the clean Spider4SSC dataset for download.
zh

[NLP-12] Seer Self-Consistency: Advance Budget Estimation for Adaptive Test-Time Scaling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在测试时缩放(test-time scaling)过程中面临的高计算成本与延迟问题。现有方法虽通过动态自一致性(dynamic self-consistency)降低token消耗,但仍受限于串行请求带来的高延迟。其解决方案的关键在于提出SeerSC框架,该框架融合系统1(System 1)和系统2(System 2)推理机制:利用快速的系统1估算查询的答案熵(answer entropy),以此评估样本进行缩放的潜力,从而在系统2中实现动态自一致性;借助系统1提供的提前且准确的估计,该方法可在显著减少token使用的同时,通过并行生成大幅降低推理延迟,最终实现token效率与延迟的双重优化。

链接: https://arxiv.org/abs/2511.09345
作者: Shiyu Ji,Yixuan Wang,Yijun Liu,Qingfu Zhu,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Test-time scaling improves the inference performance of Large Language Models (LLMs) but also incurs substantial computational costs. Although recent studies have reduced token consumption through dynamic self-consistency, they remain constrained by the high latency of sequential requests. In this paper, we propose SeerSC, a dynamic self-consistency framework that simultaneously improves token efficiency and latency by integrating System 1 and System 2 reasoning. Specifically, we utilize the rapid System 1 to compute the answer entropy for given queries. This score is then used to evaluate the potential of samples for scaling, enabling dynamic self-consistency under System 2. Benefiting from the advance and accurate estimation provided by System 1, the proposed method can reduce token usage while simultaneously achieving a significant decrease in latency through parallel generation. It outperforms existing methods, achieving up to a 47% reduction in token consumption and a 43% reduction in inference latency without significant performance loss.
zh

[NLP-13] mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models AACL

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多模态推理基准测试中表现优异但难以区分真实科学推理能力与模式匹配的问题。为应对这一挑战,作者提出了mmJEE-Eval,一个涵盖印度JEE高级考试(2019–2025年)中物理、化学和数学领域的多模态双语(英文与印地语)基准,包含1,460道高质量题目。其关键创新在于:通过引入高难度、强推理需求的试题,系统性地揭示了前沿闭源模型(如GPT-5、Gemini 2.5 Pro)虽在标准测试中表现良好(准确率77–84%),但在元认知推理压力下(如错误修正能力仅5.2%)迅速失效;同时开放源代码和数据,使得不同训练策略与推理机制的差异得以清晰识别,从而有效区分真正具备深度科学推理能力的模型。

链接: https://arxiv.org/abs/2511.09339
作者: Arka Mukherjee,Shreya Ghosh
机构: Kalinga Institute of Industrial Technology (KIIT); Indian Institute of Technology (IIT), Bhubaneswar
类目: Computation and Language (cs.CL)
备注: Accepted to IJCNLP-AACL Findings 2025

点击查看摘要

Abstract:Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbfmmJEE-Eval, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India’s JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84% accuracy on held-out 2025 questions, open-source models plateau at 37-45% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2% errors). Systematic ablations show mmJEE-Eval’s difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: this https URL
zh

[NLP-14] Not Everything That Counts Can Be Counted: A Case for Safe Qualitative AI

【速读】: 该论文试图解决的问题是:当前人工智能(AI)和大语言模型(LLM)在科学发现中主要推动了定量方法的自动化,但对定性研究方法的支持严重不足,导致从事质性研究的学者难以安全、可靠地使用AI工具进行访谈解读、数据标注和主题建模等任务,而现有通用工具如ChatGPT存在偏见、不透明、不可复现和隐私风险等问题,从而形成了定量与定性方法之间的重要断层。解决方案的关键在于开发专门面向解释性研究的定性AI系统,这些系统必须从底层设计出发,确保透明性(transparency)、可复现性(reproducibility)和隐私友好性(privacy-friendly),并通过整合稳健的定性能力来增强现有的自动化科学发现流程,从而促进跨学科和混合方法研究的发展。

链接: https://arxiv.org/abs/2511.09325
作者: Stine Beltoft,Lukas Galke
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at 3rd International Conference on Frontiers of Artificial Intelligence, Ethics, and Multidisciplinary Applications (FAIEMA 2025)

点击查看摘要

Abstract:Artificial intelligence (AI) and large language models (LLM) are reshaping science, with most recent advances culminating in fully-automated scientific discovery pipelines. But qualitative research has been left behind. Researchers in qualitative methods are hesitant about AI adoption. Yet when they are willing to use AI at all, they have little choice but to rely on general-purpose tools like ChatGPT to assist with interview interpretation, data annotation, and topic modeling - while simultaneously acknowledging these system’s well-known limitations of being biased, opaque, irreproducible, and privacy-compromising. This creates a critical gap: while AI has substantially advanced quantitative methods, the qualitative dimensions essential for meaning-making and comprehensive scientific understanding remain poorly integrated. We argue for developing dedicated qualitative AI systems built from the ground up for interpretive research. Such systems must be transparent, reproducible, and privacy-friendly. We review recent literature to show how existing automated discovery pipelines could be enhanced by robust qualitative capabilities, and identify key opportunities where safe qualitative AI could advance multidisciplinary and mixed-methods research.
zh

[NLP-15] AdaptDel: Adaptable Deletion Rate Randomized Smoothing for Certified Robustness NEURIPS2025

【速读】: 该论文旨在解决序列分类任务中针对编辑距离(edit distance)扰动的认证鲁棒性问题,尤其关注自然语言处理中输入序列长度不一带来的挑战。现有方法依赖固定删除率机制,在处理变长输入时表现不佳。其解决方案的关键在于提出AdaptDel方法,通过引入可自适应调整删除率的机制,使删除策略能够根据输入特性动态变化,并扩展随机平滑(randomized smoothing)理论框架以支持变率删除,从而在编辑距离下实现可靠的认证鲁棒性保障。实证结果表明,该方法在自然语言任务中显著提升认证区域的中位基数,相较于最先进方法提升达30个数量级。

链接: https://arxiv.org/abs/2511.09316
作者: Zhuoqun Huang,Neil G. Marchant,Olga Ohrimenko,Benjamin I. P. Rubinstein
机构: 未知
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 33 pages, 7 figures, camera ready version for NeurIPS 2025

点击查看摘要

Abstract:We consider the problem of certified robustness for sequence classification against edit distance perturbations. Naturally occurring inputs of varying lengths (e.g., sentences in natural language processing tasks) present a challenge to current methods that employ fixed-rate deletion mechanisms and lead to suboptimal performance. To this end, we introduce AdaptDel methods with adaptable deletion rates that dynamically adjust based on input properties. We extend the theoretical framework of randomized smoothing to variable-rate deletion, ensuring sound certification with respect to edit distance. We achieve strong empirical results in natural language tasks, observing up to 30 orders of magnitude improvement to median cardinality of the certified region, over state-of-the-art certifications.
zh

[NLP-16] owards Explainable Khmer Polarity Classification

【速读】: 该论文旨在解决柬埔寨语(Khmer)情感分类任务中缺乏可解释性的问题,即现有模型仅输出情感标签(正面、负面或中性),而无法提供预测依据。其解决方案的关键在于通过微调基于指令的推理模型 Qwen-3,使其生成自解释(self-explanations)来阐明预测理由,例如识别与情感相关的关键词或短语。该方法不仅提升了分类准确性,还增强了模型决策过程的透明度和可信度。

链接: https://arxiv.org/abs/2511.09313
作者: Marry Kong,Rina Buoy,Sovisal Chenda,Nguonly Taing
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Khmer polarity classification is a fundamental natural language processing task that assigns a positive, negative, or neutral label to a given Khmer text input. Existing Khmer models typically predict the label without explaining the rationale behind the prediction. This paper proposes an explainable Khmer polarity classifier by fine-tuning an instruction-based reasoning Qwen-3 model. The notion of explainability in this paper is limited to self-explanations, which the model uses to rationalize its predictions. Experimental results show that the fine-tuned model not only predicts labels accurately but also provides reasoning by identifying polarity-related keywords or phrases to support its predictions. In addition, we contribute a new Khmer polarity dataset consisting of short- to medium-length casual, romanized, and mixed-code Khmer expressions. This dataset was constructed using both heuristic rules and human curation and is publicly available through a gated Hugging Face repository (rinabuoy/khmerpolarity_nonreasoning). The fine-tuned Qwen-3 models are also made available in the same Hugging Face account.
zh

[NLP-17] LiteraryTaste: A Preference Dataset for Creative Writing Personalization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在创意写作任务中缺乏个性化适配的问题,即现有模型通常基于统一的数据集训练,未能考虑个体用户在文学偏好上的差异。其解决方案的关键在于构建并发布LiteraryTaste数据集,该数据集包含60名参与者对创意写作文本的自我报告偏好(stated preference)和基于100对短文的标注偏好(revealed preference),并通过微调Transformer编码器实现对个人与集体偏好建模,准确率分别达到75.8%和67.7%,同时发现自述偏好对预测实际偏好效果有限,为个性化创意写作生成技术提供了可量化、可解释的数据基础与方法框架。

链接: https://arxiv.org/abs/2511.09310
作者: John Joon Young Chung,Vishakh Padmakumar,Melissa Roemmele,Yi Wang,Yuqian Sun,Tiffany Wang,Shm Garanganao Almeda,Brett A. Halperin,Yuwen Lu,Max Kreminski
机构: Midjourney( Midjourney); Stanford University (斯坦福大学); UC Berkeley (加州大学伯克利分校); University of Washington (华盛顿大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:People have different creative writing preferences, and large language models (LLMs) for these tasks can benefit from adapting to each user’s preferences. However, these models are often trained over a dataset that considers varying personal tastes as a monolith. To facilitate developing personalized creative writing LLMs, we introduce LiteraryTaste, a dataset of reading preferences from 60 people, where each person: 1) self-reported their reading habits and tastes (stated preference), and 2) annotated their preferences over 100 pairs of short creative writing texts (revealed preference). With our dataset, we found that: 1) people diverge on creative writing preferences, 2) finetuning a transformer encoder could achieve 75.8% and 67.7% accuracy when modeling personal and collective revealed preferences, and 3) stated preferences had limited utility in modeling revealed preferences. With an LLM-driven interpretability pipeline, we analyzed how people’s preferences vary. We hope our work serves as a cornerstone for personalizing creative writing technologies.
zh

[NLP-18] C3TG: Conflict-aware Composite and Collaborative Controlled Text Generation AAAI-2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本生成过程中难以实现细粒度、多维属性控制的问题,尤其是当多个属性要求存在冲突时,现有方法缺乏协调机制导致属性间相互干扰,且未引入迭代优化过程以提升控制精度。其解决方案的关键在于提出一个两阶段框架——Conflict-aware, Composite, and Collaborative Controlled Text Generation (C³TG),第一阶段通过选择性地将LLM与17个维度的属性分类器配对,并利用加权KL散度调整词元概率;第二阶段则构建融合分类器得分与惩罚项的能量函数,通过迭代反馈机制化解属性冲突,从而实现多维属性的精确协同控制,同时保持文本的自然流畅性与多样性。

链接: https://arxiv.org/abs/2511.09292
作者: Yu Li,Zhe Yang,Yi Huang,Xin Liu,Guilin Qi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper has been accepted as a poster presentation at AAAI-2026

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have demonstrated remarkable text generation capabilities. However, controlling specific attributes of generated text remains challenging without architectural modifications or extensive fine-tuning. Current methods typically toggle a single, basic attribute but struggle with precise multi-attribute control. In scenarios where attribute requirements conflict, existing methods lack coordination mechanisms, causing interference between desired attributes. Furthermore, these methods fail to incorporate iterative optimization processes in the controlled generation pipeline. To address these limitations, we propose Conflict-aware, Composite, and Collaborative Controlled Text Generation (C ^3 TG), a two-phase framework for fine-grained, multi-dimensional text attribute control. During generation, C ^3 TG selectively pairs the LLM with the required attribute classifiers from the 17 available dimensions and employs weighted KL-divergence to adjust token probabilities. The optimization phase then leverages an energy function combining classifier scores and penalty terms to resolve attribute conflicts through iterative feedback, enabling precise control over multiple dimensions simultaneously while preserving natural text flow. Experiments show that C ^3 TG significantly outperforms baselines across multiple metrics including attribute accuracy, linguistic fluency, and output diversity, while simultaneously reducing toxicity. These results establish C ^3 TG as an effective and flexible solution for multi-dimensional text attribute control that requires no costly model modifications.
zh

[NLP-19] End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering AAAI2026

【速读】: 该论文旨在解决长音频场景下语音问答(Spoken Question Answering, SQA)任务中现有方法难以有效处理长时间音频的问题,尤其是大型音频语言模型在长序列建模上的局限性。其解决方案的关键在于提出一种端到端的对比语言-语音检索器(Contrastive Language-Speech Retriever, CLSR),该模型通过引入一个中间步骤——将声学特征转换为类文本表示(text-like representations),再进行跨模态对齐,从而更有效地弥合语音与文本之间的模态鸿沟。实验表明,CLSR在四个跨模态检索数据集上均优于现有的端到端语音检索器和结合语音识别与文本检索的流水线方法,为实际长音频问答应用提供了可靠基础。

链接: https://arxiv.org/abs/2511.09282
作者: Jiliang Hu,Zuchao Li,Baoyuan Qi,Liu Guoming,Ping Wang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 12 pages, 7 figures, accepted by AAAI 2026

点击查看摘要

Abstract:Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.
zh

[NLP-20] POTSA: A Cross-Lingual Speech Alignment Framework for Low Resource Speech-to-Text Translation ICASSP2026

【速读】: 该论文旨在解决现有语音大语言模型(Speech Large Language Models, SpeechLLMs)在多语言语音到文本翻译(S2TT)任务中因忽视源语言间语义共性而导致的翻译性能偏差问题。其解决方案的关键在于提出一种基于最优传输(Optimal Transport, OT)的并行语音对齐框架POTSA(Parallel Optimal Transport for Speech Alignment),通过两个核心机制实现:首先引入偏置补偿模块粗粒度对齐不同语言的初始语音表示;其次在Q-Former中施加基于平行语音对的token级OT约束,以建立细粒度的跨语言表征一致性,并结合层调度策略聚焦于最具语义价值的网络层,从而有效缩小高资源与低资源语言间的翻译差距。

链接: https://arxiv.org/abs/2511.09232
作者: Xuanchen Li,Chenrui Cui,Tianrui Wang,Meng Ge,Zikang Huang,Jin Li,Yizhou Peng,Longbiao Wang,Jianwu Dang,Nyima Tashi
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 3 figures, submitted to ICASSP 2026

点击查看摘要

Abstract:Speech Large Language Models (SpeechLLMs) have achieved breakthroughs in multilingual speech-to-text translation (S2TT). However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose \textbfPOTSA (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport (OT), designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations across languages. Second, we impose token-level OT constraints on a Q-Former using parallel speech pairs to establish fine-grained consistency of representations. Then, we apply a layer scheduling strategy to focus OT constraints on the most semantically beneficial layers. Experiments on the FLEURS dataset show that our method achieves SOTA performance, with +0.93 BLEU on average over five common languages and +5.05 BLEU on zero-shot languages, using only 10 hours of parallel speech per source language.
zh

[NLP-21] aming Object Hallucinations with Verified Atomic Confidence Estimation

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的幻觉问题,尤其是对象存在性、属性和关系方面的错误,这些问题严重影响了模型的可靠性。解决方案的关键在于提出TACO(Verified Atomic Confidence Estimation)框架,其核心机制包括:将回答分解为原子级查询以增强可验证性,通过重述(paraphrasing)降低对表述方式的敏感性,并利用自一致性(self-consistency,黑盒)或自信心聚合(self-confidence,灰盒)来估计置信度,最终通过语言模型对答案进行精炼。该方法无需依赖外部视觉专家,在多个基准测试中显著提升了模型输出的忠实性和置信度校准效果。

链接: https://arxiv.org/abs/2511.09228
作者: Jiarui Liu,Weihao Xuan,Zhijing Jin,Mona Diab
机构: CMU(卡内基梅隆大学); The University of Tokyo(东京大学); The University of Toronto(多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) often suffer from hallucinations, particularly errors in object existence, attributes, or relations, which undermine their reliability. We introduce TACO (Verified Atomic Confidence Estimation), a simple framework that mitigates hallucinations through self-verification and confidence calibration without relying on external vision experts. TACO decomposes responses into atomic queries, paraphrases them to reduce sensitivity to wording, and estimates confidence using self-consistency (black-box) or self-confidence (gray-box) aggregation, before refining answers with a language model. Experiments on five benchmarks (POPE, MME, HallusionBench, AMBER, and MM-Hal Bench) with two MLLMs (\textttLLaVA-1.5-7B and \textttCogVLM2) show that TACO consistently outperforms direct prompting and Visual Contrastive Decoding, reduces systematic biases, and improves confidence calibration, demonstrating its effectiveness in enhancing the faithfulness of MLLMs.
zh

[NLP-22] Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

【速读】: 该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在训练初期因负奖励主导而导致模型性能崩溃的问题,特别是在诚实对齐(honesty alignment)场景下,语言模型不仅需正确解答可回答的问题,还需识别无法从前提中得出结论的情况。针对这一挑战,作者通过构建基于图结构的多步演绎推理数据集(涵盖线性代数和逻辑推理),引入不可回答样本以测试模型的诚实性,并发现现有方法如GRPO即使采用监督微调初始化仍难以稳定训练。解决方案的关键在于提出Anchor方法——该方法在采样过程中注入真实轨迹(ground truth trajectories),从而避免早期训练崩溃,显著提升推理性能,强调了训练动态设计对实现可靠演绎推理的重要性。

链接: https://arxiv.org/abs/2511.09222
作者: Jiarui Liu,Kaustubh Dhole,Yingheng Wang,Haoyang Wen,Sarah Zhang,Haitao Mao,Gaotang Li,Neeraj Varshney,Jingguo Liu,Xiaoman Pan
机构: Carnegie Mellon University (卡内基梅隆大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a promising framework for aligning language models with complex reasoning objectives. However, most existing methods optimize only for final task outcomes, leaving models vulnerable to collapse when negative rewards dominate early training. This challenge is especially pronounced in honesty alignment, where models must not only solve answerable queries but also identify when conclusions cannot be drawn from the given premises. Deductive reasoning provides an ideal testbed because it isolates reasoning capability from reliance on external factual knowledge. To investigate honesty alignment, we curate two multi-step deductive reasoning datasets from graph structures, one for linear algebra and one for logical inference, and introduce unanswerable cases by randomly perturbing an edge in half of the instances. We find that GRPO, with or without supervised fine tuning initialization, struggles on these tasks. Through extensive experiments across three models, we evaluate stabilization strategies and show that curriculum learning provides some benefit but requires carefully designed in distribution datasets with controllable difficulty. To address these limitations, we propose Anchor, a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse. Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance, underscoring the importance of training dynamics for enabling reliable deductive reasoning in aligned language models.
zh

[NLP-23] Pretraining Finnish ModernBERTs

【速读】: 该论文旨在解决多语言自然语言处理模型在有限语种覆盖下的性能瓶颈问题,尤其关注芬兰相关语言的建模能力。其解决方案的关键在于预训练一系列不同规模(51M至475M参数)的ModernBERT编码器模型,并在训练过程中聚焦于与芬兰相关的多语言数据集,从而在资源受限的语言场景下实现优于现有单语模型的性能表现,特别是在处理超过512 token上下文长度的任务时展现出显著优势。

链接: https://arxiv.org/abs/2511.09213
作者: Akseli Reunamo,Laura-Maria Peltonen,Hans Moen,Sampo Pyysalo
机构: University of Turku (图尔库大学); University of Eastern Finland (东芬兰大学); Kuopio University Hospital (库奥皮奥大学医院); Aalto University (阿尔托大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper reports on pretraining ModernBERT encoder models in six different sizes, ranging from 51M to 475M parameters, with a focus on limited multilingualism, emphasizing languages relevant to Finland. Our models are competitive with, or superior to, existing multilingual models. They outperform monolingual models on tasks that require a context longer than 512 tokens. We present empirical results on using different data in the final stage of training. The code and models are publicly released.
zh

[NLP-24] he Learning Dynamics of Subword Segmentation for Morphologically Diverse Languages

【速读】: 该论文旨在解决传统子词分段(subword segmentation)在预训练和微调过程中固定不变所导致的局限性问题,即无法根据语言模型的训练目标动态优化分词策略。其解决方案的关键在于扩展子词分段语言模型(Subword Segmentation Language Model, SSLM)框架,使其支持预训练与微调阶段的子词学习机制,从而实现子词边界在训练过程中的动态演化。通过在三种形态学特征迥异的语言(Isi-Xhosa、Setswana 和 English)上进行实验,研究发现子词结构会经历四个学习阶段,且在微调阶段趋于更细粒度的划分,尤其对形态复杂的低资源语言(如 Isi-Xhosa)表现出显著的文本生成与跨语言迁移性能提升潜力。

链接: https://arxiv.org/abs/2511.09197
作者: Francois Meyer,Jan Buys
机构: University of Cape Town (开普敦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.
zh

[NLP-25] Context is Enough: Empirical Validation of textitSequentiality on Essays

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在评估作文叙事连贯性(narrative flow)时所依赖的“顺序性”(sequentiality)指标的有效性和可解释性问题。原方法将主题词与上下文词结合,但受到批评认为其结果可能受主题选择偏差影响,且缺乏与人类标注的流利度(flow)等判别标准的验证。论文的关键解决方案是采用仅基于上下文词的顺序性变体(contextual sequentiality),并通过两个包含人工标注评分的作文数据集(ASAP++ 和 ELLIPSE)进行实证验证,证明该方法在预测组织结构(Organization)和连贯性(Cohesion)等 discourse-level traits 上优于原始版本,并在与传统语言特征结合后显著提升模型性能,从而确立了上下文驱动的顺序性作为可解释、互补且有效的自动作文评分特征。

链接: https://arxiv.org/abs/2511.09185
作者: Amal Sunny,Advay Gupta,Vishnu Sreekumar
机构: IIIT-Hyderabad (印度国际信息技术研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Recent work has proposed using Large Language Models (LLMs) to quantify narrative flow through a measure called sequentiality, which combines topic and contextual terms. A recent critique argued that the original results were confounded by how topics were selected for the topic-based component, and noted that the metric had not been validated against ground-truth measures of flow. That work proposed using only the contextual term as a more conceptually valid and interpretable alternative. In this paper, we empirically validate that proposal. Using two essay datasets with human-annotated trait scores, ASAP++ and ELLIPSE, we show that the contextual version of sequentiality aligns more closely with human assessments of discourse-level traits such as Organization and Cohesion. While zero-shot prompted LLMs predict trait scores more accurately than the contextual measure alone, the contextual measure adds more predictive value than both the topic-only and original sequentiality formulations when combined with standard linguistic features. Notably, this combination also outperforms the zero-shot LLM predictions, highlighting the value of explicitly modeling sentence-to-sentence flow. Our findings support the use of context-based sequentiality as a validated, interpretable, and complementary feature for automated essay scoring and related NLP tasks.
zh

[NLP-26] A Hybrid Search for Complex Table Question Answering in Securities Report

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在表格问答(Table Question Answering, TQA)任务中因无法有效理解复杂表格结构而导致的答非所问问题。其核心解决方案是提出一种无需人工标注的单元格提取方法:通过融合语言模型与TF-IDF的混合检索机制,计算问题与表格各单元格之间的相似度以自动识别相关表头;随后选取与问题最相关的行和列交点处的单元格作为答案。该方法的关键创新在于利用对比学习在小规模问题-表头配对数据上训练语言模型,从而提升对复杂表头结构的理解能力,最终在NTCIR-18 U4共享任务的数据集上达到74.6%的准确率,显著优于GPT-4o mini(63.9%)。

链接: https://arxiv.org/abs/2511.09179
作者: Daiki Shirafuji,Koji Tanaka,Tatsuhiko Saito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IIAI AAI 2025

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) are gaining increased attention in the domain of Table Question Answering (TQA), particularly for extracting information from tables in documents. However, directly entering entire tables as long text into LLMs often leads to incorrect answers because most LLMs cannot inherently capture complex table structures. In this paper, we propose a cell extraction method for TQA without manual identification, even for complex table headers. Our approach estimates table headers by computing similarities between a given question and individual cells via a hybrid retrieval mechanism that integrates a language model and TF-IDF. We then select as the answer the cells at the intersection of the most relevant row and column. Furthermore, the language model is trained using contrastive learning on a small dataset of question-header pairs to enhance performance. We evaluated our approach in the TQA dataset from the U4 shared task at NTCIR-18. The experimental results show that our pipeline achieves an accuracy of 74.6%, outperforming existing LLMs such as GPT-4o mini~(63.9%). In the future, although we used traditional encoder models for retrieval in this study, we plan to incorporate more efficient text-search models to improve performance and narrow the gap with human evaluation results.
zh

[NLP-27] LoopTool: Closing the Data-Training Loop for Robust LLM Tool Calls

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在工具调用能力训练中因静态合成数据流水线导致的效率低下问题,即数据生成与模型训练分离、无法动态聚焦模型薄弱环节以及噪声标签持续存在等问题。解决方案的关键在于提出一种全自动化、模型感知的数据演化框架 LoopTool,其通过闭环迭代机制紧密集成数据合成与模型训练,包含三个协同模块:(1)贪婪能力探测(Greedy Capability Probing, GCP)识别模型已掌握和失败的能力;(2)判断引导的标签验证(Judgement-Guided Label Verification, JGLV)利用开源判别模型修正标注错误,逐步净化数据集;(3)错误驱动的数据扩展(Error-Driven Data Expansion, EDDE)基于识别出的失败案例生成更具挑战性的新样本。该闭环流程在低成本开源生态中运行,无需依赖昂贵闭源API,实验表明使用LoopTool训练的8B模型在BFCL-v3和ACEBench基准上超越了32B规模的数据生成器,并达到同规模下的新SOTA性能。

链接: https://arxiv.org/abs/2511.09148
作者: Kangning Zhang,Wenxiang Jiao,Kounianhua Du,Yuan Lu,Weiwen Liu,Weinan Zhang,Lei Zhang,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); Xiaohongshu Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Augmenting Large Language Models (LLMs) with external tools enables them to execute complex, multi-step tasks. However, tool learning is hampered by the static synthetic data pipelines where data generation and model training are executed as two separate, non-interactive processes. This approach fails to adaptively focus on a model’s specific weaknesses and allows noisy labels to persist, degrading training efficiency. We introduce LoopTool, a fully automated, model-aware data evolution framework that closes this loop by tightly integrating data synthesis and model training. LoopTool iteratively refines both the data and the model through three synergistic modules: (1) Greedy Capability Probing (GCP) diagnoses the model’s mastered and failed capabilities; (2) Judgement-Guided Label Verification (JGLV) uses an open-source judge model to find and correct annotation errors, progressively purifying the dataset; and (3) Error-Driven Data Expansion (EDDE) generates new, challenging samples based on identified failures. This closed-loop process operates within a cost-effective, open-source ecosystem, eliminating dependence on expensive closed-source APIs. Experiments show that our 8B model trained with LoopTool significantly surpasses its 32B data generator and achieves new state-of-the-art results on the BFCL-v3 and ACEBench benchmarks for its scale. Our work demonstrates that closed-loop, self-refining data pipelines can dramatically enhance the tool-use capabilities of LLMs.
zh

[NLP-28] DoPE: Denoising Rotary Position Embedding

【速读】: 该论文旨在解决旋转位置编码(Rotary Position Embedding, RoPE)在Transformer模型中存在长度外推(length extrapolation)性能下降的问题,其核心表现是注意力机制中的“注意力下沉”(attention sink)现象。解决方案的关键在于提出了一种无需训练的去噪位置编码方法(Denoising Positional Encoding, DoPE),通过截断矩阵熵(truncated matrix entropy)检测特征图中的异常频率带,并利用无参数的高斯分布对特征图进行重参数化,从而有效抑制噪声并恢复平衡的注意力模式,显著提升长上下文下的检索准确性和推理稳定性(最长支持64K tokens)。

链接: https://arxiv.org/abs/2511.09146
作者: Jing Xiong,Liyang Fan,Hui Shen,Zunhai Su,Min Yang,Lingpeng Kong,Ngai Wong
机构: The University of Hong Kong (香港大学); University of Michigan, Ann Arbor (密歇根大学安娜堡分校); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length extrapolation. We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional Encoding (DoPE), a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map. Leveraging the noise characteristics of the feature map, we further reparameterize it with a parameter-free Gaussian distribution to achieve robust extrapolation. Our method theoretically reveals the underlying cause of the attention sink phenomenon and its connection to truncated matrix entropy. Experiments on needle-in-a-haystack and many-shot in-context learning tasks demonstrate that DoPE significantly improves retrieval accuracy and reasoning stability across extended contexts (up to 64K tokens). The results show that the denoising strategy for positional embeddings effectively mitigates attention sinks and restores balanced attention patterns, providing a simple yet powerful solution for improving length generalization. Our project page is Project: this https URL
zh

[NLP-29] One-Topic-Doesnt-Fit-All: Transcreating Reading Comprehension Test for Personalized Learning

【速读】: 该论文旨在解决英语作为外语(EFL)教学中学生阅读理解能力提升受限于学习动机不足与内容缺乏个性化匹配的问题。其核心解决方案在于构建一个基于GPT-4o的结构化内容转译(content transcreation)流程,通过提取学生兴趣主题、结合布卢姆分类法(Bloom’s taxonomy)对问题进行分类、分析语言特征,并生成语义上贴合学生兴趣但语言难度与原题相当的新文本及选择题,从而实现阅读材料的个性化定制,显著提升学生的参与度和理解效果。

链接: https://arxiv.org/abs/2511.09135
作者: Jieun Han,Daniel Lee,Haneul Yoo,Jinsung Yoon,Junyeong Park,Suin Kim,So-Yeon Ahn,Alice Oh
机构: KAIST(韩国科学技术院); Salesforce AI Research( Salesforce人工智能研究); Google Cloud(谷歌云); Elice
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Personalized learning has gained attention in English as a Foreign Language (EFL) education, where engagement and motivation play crucial roles in reading comprehension. We propose a novel approach to generating personalized English reading comprehension tests tailored to students’ interests. We develop a structured content transcreation pipeline using OpenAI’s gpt-4o, where we start with the RACE-C dataset, and generate new passages and multiple-choice reading comprehension questions that are linguistically similar to the original passages but semantically aligned with individual learners’ interests. Our methodology integrates topic extraction, question classification based on Bloom’s taxonomy, linguistic feature analysis, and content transcreation to enhance student engagement. We conduct a controlled experiment with EFL learners in South Korea to examine the impact of interest-aligned reading materials on comprehension and motivation. Our results show students learning with personalized reading passages demonstrate improved comprehension and motivation retention compared to those learning with non-personalized materials.
zh

[NLP-30] Assessing the Capabilities of LLM s in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在幽默理解与生成能力评估中存在单一维度评价标准的问题,即仅以“是否有趣”作为评判依据,难以全面刻画人类幽默的复杂性。其解决方案的关键在于引入日本即兴喜剧游戏Oogiri作为多维幽默评测框架,并构建了一个包含六维人工标注(新颖性、清晰度、相关性、智力性、共情力和整体趣味性)的高质量语料库,从而系统性地评估LLMs在生成和评价幽默响应中的表现。研究发现,尽管LLMs在创意生成上达到低至中等人类水平,但其在共情(Empathy)维度显著不足,且模型与人类在评价偏好上呈现根本性差异——模型更关注新颖性,而人类更重视共情,这揭示了现有模型在情感智能方面的关键缺陷。

链接: https://arxiv.org/abs/2511.09133
作者: Ritsu Sakabe,Hwichan Kim,Tosho Hirasawa,Mamoru Komachi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications, such as sophisticated dialogue systems. While previous studies have benchmarked the humor capabilities of Large Language Models (LLMs), they have often relied on single-dimensional evaluations, such as judging whether something is simply ``funny.‘’ This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri, a form of Japanese improvisational comedy games. To achieve this, we expanded upon existing Oogiri datasets with data from new sources and then augmented the collection with Oogiri responses generated by LLMs. We then manually annotated this expanded collection with 5-point absolute ratings across six dimensions: Novelty, Clarity, Relevance, Intelligence, Empathy, and Overall Funniness. Using this dataset, we assessed the capabilities of state-of-the-art LLMs on two core tasks: their ability to generate creative Oogiri responses and their ability to evaluate the funniness of responses using a six-dimensional evaluation. Our results show that while LLMs can generate responses at a level between low- and mid-tier human performance, they exhibit a notable lack of Empathy. This deficit in Empathy helps explain their failure to replicate human humor assessment. Correlation analyses of human and model evaluation data further reveal a fundamental divergence in evaluation criteria: LLMs prioritize Novelty, whereas humans prioritize Empathy. We release our annotated corpus to the community to pave the way for the development of more emotionally intelligent and sophisticated conversational agents.
zh

[NLP-31] History-Aware Reasoning for GUI Agents AAAI2026

【速读】: 该论文旨在解决当前GUI代理在长周期任务中因缺乏历史感知推理能力而导致的性能瓶颈问题。现有方法虽结合强化学习(Reinforcement Learning, RL)与系统2类思维链(System-2 Chain-of-Thought),但其显式推理仍表现为对交互序列的离散屏幕理解,忽视了episode内的历史交互信息,从而削弱了短时记忆和决策连贯性。解决方案的关键在于提出一种历史感知推理(History-Aware Reasoning, HAR)框架,通过构建反思学习场景、设计定制化修正指导策略以及引入混合强化学习奖励函数,使GUI代理能够从自身错误中提取经验并增强短期记忆能力,从而实现从历史无关到历史感知的推理模式转变,显著提升GUI自动化任务中的稳定性与细节感知能力。

链接: https://arxiv.org/abs/2511.09127
作者: Ziwei Wang,Leyang Yang,Xiaoxuan Tang,Sheng Zhou,Dajun Chen,Wei Jiang,Yong Li
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Tsinghua University (清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Paper accepted to AAAI 2026

点击查看摘要

Abstract:Advances in Multimodal Large Language Models have significantly enhanced Graphical User Interface (GUI) automation. Equipping GUI agents with reliable episodic reasoning capabilities is essential for bridging the gap between users’ concise task descriptions and the complexities of real-world execution. Current methods integrate Reinforcement Learning (RL) with System-2 Chain-of-Thought, yielding notable gains in reasoning enhancement. For long-horizon GUI tasks, historical interactions connect each screen to the goal-oriented episode chain, and effectively leveraging these clues is crucial for the current decision. However, existing native GUI agents exhibit weak short-term memory in their explicit reasoning, interpreting the chained interactions as discrete screen understanding, i.e., unawareness of the historical interactions within the episode. This history-agnostic reasoning challenges their performance in GUI automation. To alleviate this weakness, we propose a History-Aware Reasoning (HAR) framework, which encourages an agent to reflect on its own errors and acquire episodic reasoning knowledge from them via tailored strategies that enhance short-term memory in long-horizon interaction. The framework mainly comprises constructing a reflective learning scenario, synthesizing tailored correction guidelines, and designing a hybrid RL reward function. Using the HAR framework, we develop a native end-to-end model, HAR-GUI-3B, which alters the inherent reasoning mode from history-agnostic to history-aware, equipping the GUI agent with stable short-term memory and reliable perception of screen details. Comprehensive evaluations across a range of GUI-related benchmarks demonstrate the effectiveness and generalization of our method.
zh

[NLP-32] hinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在复杂多步推理任务中表现受限的问题,特别是由于现有方法依赖结果导向的监督信号,缺乏对中间推理步骤的有效引导,导致奖励黑客(reward hacking)和响应质量下降。其解决方案的关键在于提出Bi-RAR框架,通过双向信息距离(bidirectional information distance)量化每个中间步骤的信息完整性——该距离基于柯尔莫哥洛夫复杂度(Kolmogorov complexity)并用语言模型生成概率近似计算,从而同时衡量当前推理离答案的距离与对问题的覆盖程度;进一步结合多目标强化学习与级联奖励结构,强化早期轨迹对齐,实现更高效、高质量的搜索交互式推理。

链接: https://arxiv.org/abs/2511.09109
作者: Wenda Wei,Yu-An Liu,Ruqing Zhang,Jiafeng Guo,Lixin Su,Shuaiqiang Wang,Dawei Yin,Maarten de Rijke,Xueqi Cheng
机构: 1. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Alibaba Group (阿里巴巴集团); 4. University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models, yet its effectiveness remains limited in complex, multi-step reasoning this http URL efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval. Most approaches rely on outcome-based supervision, offering no explicit guidance for intermediate steps. This often leads to reward hacking and degraded response quality. We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions. To assess the information completeness of each step, we introduce a bidirectional information distance grounded in Kolmogorov complexity, approximated via language model generation probabilities. This quantification measures both how far the current reasoning is from the answer and how well it addresses the question. To optimize reasoning under these bidirectional signals, we adopt a multi-objective reinforcement learning framework with a cascading reward structure that emphasizes early trajectory alignment. Empirical results on seven question answering benchmarks demonstrate that Bi-RAR surpasses previous methods and enables efficient interaction and reasoning with the search engine during training and inference.
zh

[NLP-33] Context-Aware Dynamic Chunking for Streaming Tibetan Speech Recognition

【速读】: 该论文旨在解决流式语音识别中因固定分块(fixed-chunk)策略导致的上下文截断问题,从而影响识别准确率和适应不同语速的能力。其解决方案的关键在于提出一种基于上下文感知的动态分块机制(context-aware dynamic chunking mechanism),该机制根据编码状态自适应调整分块宽度,实现灵活的感受野、跨分块信息交互,并提升对不同说话速率的鲁棒性;同时,结合基于藏文字 orthographic 原则构建的词典和外部语言模型,进一步增强语言建模能力与长句识别性能。

链接: https://arxiv.org/abs/2511.09085
作者: Chao Wang,Yuqing Cai,Renzeng Duojie,Jin Zhang,Yutong Liu,Nyima Tashi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this work, we propose a streaming speech recognition framework for Amdo Tibetan, built upon a hybrid CTC/Atten-tion architecture with a context-aware dynamic chunking mechanism. The proposed strategy adaptively adjusts chunk widths based on encoding states, enabling flexible receptive fields, cross-chunk information exchange, and robust adaptation to varying speaking rates, thereby alleviating the context truncation problem of fixed-chunk methods. To further capture the linguistic characteristics of Tibetan, we construct a lexicon grounded in its orthographic principles, providing linguistically motivated modeling units. During decoding, an external language model is integrated to enhance semantic consistency and improve recognition of long sentences. Experimental results show that the proposed framework achieves a word error rate (WER) of 6.23% on the test set, yielding a 48.15% relative improvement over the fixed-chunk baseline, while significantly reducing recognition latency and maintaining performance close to global decoding.
zh

[NLP-34] MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

【速读】: 该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在跨模态任务中缺乏系统性批判能力评估的问题,尤其是在图像描述和视觉推理等场景下,现有研究主要聚焦于纯文本领域的批判能力,而对多模态情境下的批判能力关注不足。解决方案的关键在于提出一个全面的基准测试框架MM-CRITIC,涵盖基础、修正和比较三个维度,包含8类主要任务和超过500个具体任务,共4471个样本,并引入专家标注的参考答案作为评分标准,指导GPT-4o生成高质量参考批判意见,从而提升评估的可靠性与可解释性,为LMMs的自我改进和可信辅助提供量化依据。

链接: https://arxiv.org/abs/2511.09067
作者: Gailun Zeng,Ziyang Luo,Hongzhan Lin,Yuchen Tian,Kaixin Li,Ziyang Gong,Jianxiong Guo,Jing Ma
机构: Hong Kong Baptist University (香港浸会大学); Beijing Normal-Hong Kong Baptist University (北京师范大学-香港浸会大学); National University of Singapore (新加坡国立大学); Beijing Normal University (北京师范大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 14 figures, 19 tables

点击查看摘要

Abstract:The ability of critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-CRITIC, a holistic benchmark for evaluating the critique ability of LMMs across multiple dimensions: basic, correction, and comparison. Covering 8 main task types and over 500 tasks, MM-CRITIC collects responses from various LMMs with different model sizes and is composed of 4471 samples. To enhance the evaluation reliability, we integrate expert-informed ground answers into scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-CRITIC and provide a comprehensive assessment of leading LMMs’ critique capabilities under multiple dimensions. Further analysis reveals some key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions. Our code is available at this https URL.
zh

[NLP-35] PAN: A World Model for General Interactable and Long-Horizon World Simulation

【速读】: 该论文旨在解决当前世界模型在因果控制、交互性及长时一致性方面的不足,以及视频生成模型缺乏对动作条件的可控性和长期动态一致性的局限问题。其解决方案的关键在于提出PAN(a general, interactable, and long-horizon world model),采用生成潜在预测(Generative Latent Prediction, GLP)架构:该架构融合基于大语言模型(Large Language Model, LLM)的自回归潜在动力学主干,以文本知识为 grounding 并支持自然语言动作条件;同时结合视频扩散解码器(video diffusion decoder),实现感知细节丰富且时间连贯的视觉重建,从而统一潜在空间推理(想象)与可实现的世界动态(现实)。这一设计使模型能够在多样环境中进行开放域、动作条件下的高质量长期模拟,显著提升预测能力和可解释的推演能力。

链接: https://arxiv.org/abs/2511.09057
作者: PAN Team Institute of Foundation Models:Jiannan Xiang,Yi Gu,Zihan Liu,Zeyu Feng,Qiyue Gao,Yiyan Hu,Benhao Huang,Guangyi Liu,Yichi Yang,Kun Zhou,Davit Abrahamyan,Arif Ahmad,Ganesh Bannur,Junrong Chen,Kimi Chen,Mingkai Deng,Ruobing Han,Xinqi Huang,Haoqiang Kang,Zheqi Li,Enze Ma,Hector Ren,Yashowardhan Shinde,Rohan Shingre,Ramsundar Tanikella,Kaiming Tao,Dequan Yang,Xinle Yu,Cong Zeng,Binglin Zhou,Hector Liu,Zhiting Hu,Eric P. Xing
机构: PAN Team, Institute of Foundation Models
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A world model enables an intelligent agent to imagine, predict, and reason about how the world evolves in response to its actions, and accordingly to plan and strategize. While recent video generation models produce realistic visual sequences, they typically operate in the prompt-to-full-video manner without causal control, interactivity, or long-horizon consistency required for purposeful reasoning. Existing world modeling efforts, on the other hand, often focus on restricted domains (e.g., physical, game, or 3D-scene dynamics) with limited depth and controllability, and struggle to generalize across diverse environments and interaction formats. In this work, we introduce PAN, a general, interactable, and long-horizon world model that predicts future world states through high-quality video simulation conditioned on history and natural language actions. PAN employs the Generative Latent Prediction (GLP) architecture that combines an autoregressive latent dynamics backbone based on a large language model (LLM), which grounds simulation in extensive text-based knowledge and enables conditioning on language-specified actions, with a video diffusion decoder that reconstructs perceptually detailed and temporally coherent visual observations, to achieve a unification between latent space reasoning (imagination) and realizable world dynamics (reality). Trained on large-scale video-action pairs spanning diverse domains, PAN supports open-domain, action-conditioned simulation with coherent, long-term dynamics. Extensive experiments show that PAN achieves strong performance in action-conditioned world simulation, long-horizon forecasting, and simulative reasoning compared to other video generators and world models, taking a step towards general world models that enable predictive simulation of future world states for reasoning and acting.
zh

[NLP-36] Solving a Million-Step LLM Task with Zero Errors

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行需要大量逻辑步骤的长程任务时因累积误差导致无法规模化的问题。现有LLM在数百步后便容易出现错误,难以模拟人类、组织或社会层面的复杂流程。解决方案的关键在于提出MAKER系统,其核心是将任务极端分解为可由微代理(microagents)独立处理的子任务,并通过高效的多代理投票机制实现每一步的纠错,从而实现零错误地完成超过百万步的任务,且理论上可无限扩展。这一方法表明,大规模分解的代理流程(Massively Decomposed Agentic Processes, MDAPs)比单纯提升LLM自身能力更有效于解决高复杂度问题。

链接: https://arxiv.org/abs/2511.09030
作者: Elliot Meyerson,Giuseppe Paolo,Roberto Dailey,Hormoz Shahrzad,Olivier Francon,Conor F. Hayes,Xin Qiu,Babak Hodjat,Risto Miikkulainen
机构: Cognizant AI Lab; UT Austin
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Main paper: 14 pages, 29 pages with references and appendix

点击查看摘要

Abstract:LLMs have achieved remarkable breakthroughs in reasoning, insights, and tool use, but chaining these abilities into extended processes at the scale of those routinely executed by humans, organizations, and societies has remained out of reach. The models have a persistent error rate that prevents scale-up: for instance, recent experiments in the Towers of Hanoi benchmark domain showed that the process inevitably becomes derailed after at most a few hundred steps. Thus, although LLM research is often still benchmarked on tasks with relatively few dependent logical steps, there is increasing attention on the ability (or inability) of LLMs to perform long range tasks. This paper describes MAKER, the first system that successfully solves a task with over one million LLM steps with zero errors, and, in principle, scales far beyond this level. The approach relies on an extreme decomposition of a task into subtasks, each of which can be tackled by focused microagents. The high level of modularity resulting from the decomposition allows error correction to be applied at each step through an efficient multi-agent voting scheme. This combination of extreme decomposition and error correction makes scaling possible. Thus, the results suggest that instead of relying on continual improvement of current LLMs, massively decomposed agentic processes (MDAPs) may provide a way to efficiently solve problems at the level of organizations and societies.
zh

[NLP-37] A Neurosymbolic Approach to Natural Language Formalization and Verification

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在金融、医疗等受监管行业中的应用受限问题,其根本原因在于LLMs固有的随机性导致推理结果不可控,难以满足严格合规性要求。解决方案的关键在于提出一个两阶段神经符号(neurosymbolic)框架:第一阶段利用LLMs结合可选的人类指导将自然语言政策形式化为逻辑表达式,实现对形式化过程的细粒度控制;第二阶段在推理时通过自动形式化(autoformalization)验证自然语言陈述与政策逻辑的一致性,并在关键场景下执行多冗余形式化步骤以交叉校验语义等价性,从而确保近零假阳性率(>99%保真度)。该方法生成可审计的逻辑产物,既支持验证结果的溯源,也可反哺原始文本的优化。

链接: https://arxiv.org/abs/2511.09008
作者: Sam Bayless,Stefano Buliani,Darion Cassel,Byron Cook,Duncan Clough,Rémi Delmas,Nafi Diallo,Ferhat Erata,Nick Feng,Dimitra Giannakopoulou,Aman Goel,Aditya Gokhale,Joe Hendrix,Marc Hudak,Dejan Jovanović,Andrew M. Kent,Benjamin Kiesl-Reiter,Jeffrey J. Kuna,Nadia Labai,Joseph Lilien,Divya Raghunathan,Zvonimir Rakamarić,Niloofar Razavi,Michael Tautschnig,Ali Torkamani,Nathaniel Weir,Michael W. Whalen,Jianan Yao
机构: Amazon Web Services (亚马逊网络服务); University College London (伦敦大学学院); University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 20 pages, 12 figures

点击查看摘要

Abstract:Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autoformalization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our approach produces auditable logical artifacts that substantiate the verification outcomes and can be used to improve the original text.
zh

[NLP-38] AI Founding Fathers: A Case Study of GIS Search in Multi-Agent Pipelines

【速读】: 该论文旨在解决如何从大型语言模型(Large Language Models, LLMs)中提取更强推理能力的问题。当前LLMs虽具备出色的文本流畅性,但在复杂推理任务中仍存在局限,亟需系统性方法提升其逻辑严谨性和结构化思考能力。解决方案的关键在于提出一种基于“渐进式、增量式、序列化”(Gradual, Incremental, Sequential, GIS)搜索的多智能体流水线框架,并通过递归精炼(Recursive Refinement, RR)机制实现该框架:RR包含自我批评、对抗性压力测试与批判性反馈整合三个阶段,形成一个可控的迭代优化过程,从而引导模型在推理空间中进行更高效、更深入的探索。实验表明,采用此架构的复杂模型在分析深度、结构细腻度和策略呈现方面显著优于线性简单模型,验证了GIS搜索路径与递归精炼机制对增强LLM推理能力的有效性。

链接: https://arxiv.org/abs/2511.09005
作者: Alvin Chauhan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 9 pages, 3 figures. Code and data available at this https URL

点击查看摘要

Abstract:Although Large Language Models (LLMs) show exceptional fluency, efforts persist to extract stronger reasoning capabilities from them. Drawing on search-based interpretations of LLM computation, this paper advances a systematic framework for understanding LLM reasoning and optimization. Namely, that enhancing reasoning is best achieved by structuring a multi-agent pipeline to ensure a traversal of the search space in a gradual, incremental, and sequential (GIS) manner. Stated succinctly, high-quality reasoning is a controlled, incremental search. To test this framework, we investigate the efficacy of recursive refinement (RR)–an iterative process of self-criticism, adversarial stress-testing, and integrating critical feedback–as a practical method for implementing GIS search. We designed an experiment comparing a simple, linear pipeline against a complex, explicitly structured pipeline leveraging a recursive refinement layer. The multi-agent models were constructed to reflect the historical personas of three US Founding Fathers (Hamilton, Jefferson, and Madison) using RAG-powered corpora and were prompted to generate responses to three contemporary political issues. Model performance was evaluated using a two-tiered approach: a quantitative score from an LLM arbiter agent and qualitative human judgment. Our results revealed that the complex model consistently outperformed the simple model across all nine test cases with an average arbiter-outputted score of 88.3 versus 71.7. The complex model’s arguments were superior in analytical depth, structural nuance, and strategic framing. We conclude that recursive refinement is a robust architectural feature for enhancing LLM reasoning via GIS search.
zh

[NLP-39] Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在情感支持(Emotional Support)能力评估中普遍存在的局限性,即现有方法多依赖于短时、静态对话样本,难以反映真实场景下用户情绪状态的动态演化与长期稳定性。其解决方案的关键在于提出一种基于轨迹(trajectory-based)的评估框架,从用户中心视角出发,通过构建包含328个情感情境和1,152个扰动事件的大规模基准数据集,模拟真实对话中的情绪波动;同时引入经心理学验证的情绪调节策略(如情境选择和认知重评)对模型输出进行约束,并将用户情绪轨迹建模为一阶马尔可夫过程,结合因果调整的情绪估计方法实现无偏的情绪状态追踪。最终,该框架定义了三个轨迹级指标:基线情绪水平(Baseline Emotional Level, BEL)、情绪轨迹波动性(Emotional Trajectory Volatility, ETV)和情绪中心位置(Emotional Centroid Position, ECP),从而系统性地刻画用户情绪随时间演变的特性,推动对LLMs长期情感支持能力的科学评估。

链接: https://arxiv.org/abs/2511.09003
作者: Zhouxing Tan,Ruochong Xiong,Yulong Wan,Jinlong Ma,Hanlin Xue,Qichun Deng,Haifeng Jing,Zhengtong Zhang,Depei Liu,Shiyuan Luo,Junfei Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship. However, existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support. To overcome this limitation, we shift from snapshot-based evaluation to trajectory-based assessment, adopting a user-centered perspective that evaluates models based on their ability to improve and stabilize user emotional states over time. Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios. To encourage psychologically grounded responses, we constrain model outputs using validated emotion regulation strategies such as situation selection and cognitive reappraisal. User emotional trajectories are modeled as a first-order Markov process, and we apply causally-adjusted emotion estimation to obtain unbiased emotional state tracking. Based on this framework, we introduce three trajectory-level metrics: Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP). These metrics collectively capture user emotional dynamics over time and support comprehensive evaluation of long-term emotional support performance of LLMs. Extensive evaluations across a diverse set of LLMs reveal significant disparities in emotional support capabilities and provide actionable insights for model development.
zh

[NLP-40] SpiralThinker: Latent Reasoning through an Iterative Process with Text-Latent Interleaving

【速读】: 该论文旨在解决现有隐式推理(latent reasoning)方法中缺乏稳定潜空间表示演化机制,以及无法系统性地融合隐式与显式推理的问题。其解决方案的关键在于提出SpiralThinker框架,通过在潜空间中进行迭代更新实现无需生成额外文本标记的扩展隐式推理,并结合渐进对齐目标与结构化标注,确保潜空间推理与文本推理之间的一致性。实验证明,这种对齐驱动的迭代机制是提升推理性能的核心要素。

链接: https://arxiv.org/abs/2511.08983
作者: Shengmin Piao,Sanghyun Park
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large reasoning models have been driven by reinforcement learning and test-time scaling, accompanied by growing interest in latent rather than purely textual reasoning. However, existing latent reasoning methods lack mechanisms to ensure stable evolution of latent representations and a systematic way to interleave implicit and explicit reasoning. We introduce SpiralThinker, a unified framework that performs iterative updates over latent representations, enabling extended implicit reasoning without generating additional tokens. A progressive alignment objective combined with structured annotations maintains coherence between latent and textual reasoning. Across mathematical, logical, and commonsense reasoning tasks, SpiralThinker achieves the best overall performance among latent reasoning approaches, consistently surpassing previous methods across all benchmarks. Detailed analyses reveal that both iteration and alignment are indispensable, the numbers of latent tokens and iterations exhibit dataset-specific optima, and appropriate alignment proves critical for an effective iterative process. Overall, SpiralThinker bridges iterative computation and latent reasoning, demonstrating that aligned iterative updates can reliably steer reasoning in the latent space.
zh

[NLP-41] Bayesian Mixture of Experts For Large Language Models

【速读】: 该论文旨在解决微调后的大型语言模型(Large Language Models, LLMs)在推理过程中缺乏可靠不确定性估计的问题,从而提升下游任务决策的可靠性。解决方案的关键在于提出一种后处理的贝叶斯不确定性估计框架——Bayesian Mixture of Experts (Bayesian-MoE),其核心创新是通过对MoE架构中每个专家的第二线性层应用结构化的拉普拉斯近似(structured Laplace approximation),在不修改原始训练流程且不引入额外参数的前提下,实现校准的不确定性估计。该方法利用MoE模型固有的模块化结构进行可 tractable 的块状后验估计,并结合Kronecker分解的低秩近似来建模参数空间的曲率,从而高效计算预测不确定性与边际似然,实验表明该方法显著改善了预期校准误差(Expected Calibration Error, ECE)和负对数似然(Negative Log-Likelihood, NLL)。

链接: https://arxiv.org/abs/2511.08968
作者: Maryam Dialameh,Hossein Rajabzadeh,Weiwei Zhang,Walid Ahmed,Hyock Ju Kwon
机构: University of Waterloo (滑铁卢大学); Ascend Team, Huawei Technologies (华为技术有限公司)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Bayesian Mixture of Experts (Bayesian-MoE), a post-hoc uncertainty estimation framework for fine-tuned large language models (LLMs) based on Mixture-of-Experts architectures. Our method applies a structured Laplace approximation to the second linear layer of each expert, enabling calibrated uncertainty estimation without modifying the original training procedure or introducing new parameters. Unlike prior approaches, which apply Bayesian inference to added adapter modules, Bayesian-MoE directly targets the expert pathways already present in MoE models, leveraging their modular design for tractable block-wise posterior estimation. We use Kronecker-factored low-rank approximations to model curvature and derive scalable estimates of predictive uncertainty and marginal likelihood. Experiments on common-sense reasoning benchmarks with Qwen1.5-MoE and DeepSeek-MoE demonstrate that Bayesian-MoE improves both expected calibration error (ECE) and negative log-likelihood (NLL) over baselines, confirming its effectiveness for reliable downstream decision-making.
zh

[NLP-42] EVADE: LLM -Based Explanation Generation and Validation for Error Detection in NLI

【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)任务中因人类标注变异(Human Label Variation, HLV)导致的数据集质量难题,即如何在存在多个合理标签的情况下区分标注错误与合法变体。其解决方案的关键在于提出一种名为EVADE的新框架,利用大语言模型(Large Language Models, LLMs)生成并验证解释(explanations),从而自动检测和剔除错误标注,相较于传统两轮人工标注方法显著降低人力成本,并提升数据质量与模型微调性能。

链接: https://arxiv.org/abs/2511.08949
作者: Longfei Zuo,Barbara Plank,Siyao Peng
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework VARIERR (Weber-Genzel et al., 2024) asks multiple annotators to explain their label decisions in the first round and flag errors via validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.
zh

[NLP-43] ransactionGPT

【速读】: 该论文旨在解决消费交易数据中复杂动态建模与多任务预测的难题,特别是在大规模支付网络场景下如何高效地理解并生成交易轨迹,同时支持下游分类与预测任务。其核心解决方案是提出了一种专为交易数据设计的3D-Transformer架构(3D-Transformer architecture),该架构通过创新性的模态融合机制和计算效率优化,实现了对交易时序、用户行为及商户特征等多维信息的联合建模,并支持与下游目标的端到端联合优化。该模型在百亿级真实交易数据上训练,显著优于现有生产模型,在未来交易生成和分类任务中均展现出优越性能,且相较微调的大语言模型(LLM)在预测精度、训练与推理速度方面更具优势。

链接: https://arxiv.org/abs/2511.08939
作者: Yingtong Dou,Zhimeng Jiang,Tianyi Zhang,Mingzhi Hu,Zhichao Xu,Shubham Jain,Uday Singh Saini,Xiran Fan,Jiarui Sun,Menghai Pan,Junpeng Wang,Xin Dai,Liang Wang,Chin-Chia Michael Yeh,Yujie Fan,Vineeth Rakesh,Huiyuan Chen,Mangesh Bendre,Zhongfang Zhuang,Xiaoting Li,Prince Aboagye,Vivian Lai,Minghua Xu,Hao Yang,Yiwei Cai,Mahashweta Das,Yuzhong Chen
机构: Visa Research
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:We present TransactionGPT (TGPT), a foundation model for consumer transaction data within one of world’s largest payment networks. TGPT is designed to understand and generate transaction trajectories while simultaneously supporting a variety of downstream prediction and classification tasks. We introduce a novel 3D-Transformer architecture specifically tailored for capturing the complex dynamics in payment transaction data. This architecture incorporates design innovations that enhance modality fusion and computational efficiency, while seamlessly enabling joint optimization with downstream objectives. Trained on billion-scale real-world transactions, TGPT significantly improves downstream classification performance against a competitive production model and exhibits advantages over baselines in generating future transactions. We conduct extensive empirical evaluations utilizing a diverse collection of company transaction datasets spanning multiple downstream tasks, thereby enabling a thorough assessment of TGPT’s effectiveness and efficiency in comparison to established methodologies. Furthermore, we examine the incorporation of LLM-derived embeddings within TGPT and benchmark its performance against fine-tuned LLMs, demonstrating that TGPT achieves superior predictive accuracy as well as faster training and inference. We anticipate that the architectural innovations and practical guidelines from this work will advance foundation models for transaction-like data and catalyze future research in this emerging field.
zh

[NLP-44] he Double Contingency Problem: AI Recursion and the Limits of Interspecies Understanding NEURIPS2025

【速读】: 该论文试图解决当前生物声学人工智能(Bioacoustic AI)系统在跨物种通信分析中忽视递归认知交互的问题,即当具有注意力机制、迭代处理和反馈回路的AI系统与其它物种的递归交流过程相遇时,其信息处理方式可能系统性地遮蔽或扭曲其他物种的沟通结构。解决方案的关键在于将生物声学AI从“通用模式识别”重新概念化为不同形式递归认知之间的“外交式相遇”,强调模型设计、评估框架和研究方法需适应双 contingent 性问题——即物种交流依赖于生态与进化条件的偶然性,而AI系统则受自身架构与训练条件的偶然性制约。

链接: https://arxiv.org/abs/2511.08927
作者: Graham L. Bishop(University of California, San Diego)
机构: UC San Diego (加州大学圣地亚哥分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 5 pages, no figures, to be published in the NeurIPS 2025: AI for Non-Human Animal Communication Workshop Proceedings

点击查看摘要

Abstract:Current bioacoustic AI systems achieve impressive cross-species performance by processing animal communication through transformer architectures, foundation model paradigms, and other computational approaches. However, these approaches overlook a fundamental question: what happens when one form of recursive cognition–AI systems with their attention mechanisms, iterative processing, and feedback loops–encounters the recursive communicative processes of other species? Drawing on philosopher Yuk Hui’s work on recursivity and contingency, I argue that AI systems are not neutral pattern detectors but recursive cognitive agents whose own information processing may systematically obscure or distort other species’ communicative structures. This creates a double contingency problem: each species’ communication emerges through contingent ecological and evolutionary conditions, while AI systems process these signals through their own contingent architectural and training conditions. I propose that addressing this challenge requires reconceptualizing bioacoustic AI from universal pattern recognition toward diplomatic encounter between different forms of recursive cognition, with implications for model design, evaluation frameworks, and research methodologies.
zh

[NLP-45] DAR: Think in Diffusion Talk in Autoregression

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中并行生成效率与自回归(AR)模型质量之间的权衡问题:如何在保持高吞吐量和高GPU利用率的同时,达到AR模型的文本质量水平。现有方法要么因采用弱模型进行串行草稿(如推测解码)导致草稿效率低下,要么在扩散模型中引入类似AR的解码逻辑,从而牺牲并行性并降低质量。解决方案的关键在于提出TiDAR——一种序列级混合架构,其通过结构化注意力掩码在单次前向传播中实现两阶段操作:首先在扩散过程中并行草稿(Thinking),随后以自回归方式采样最终输出(Talking);该设计充分利用了GPU计算密度,实现了草稿与验证能力的高效平衡,并支持精确的KV缓存机制,显著提升推理效率与质量,最终在1.5B至8B规模模型上首次在保持AR级质量的同时实现4.71x至5.91x的token每秒生成速率提升。

链接: https://arxiv.org/abs/2511.08923
作者: Jingyu Liu,Xin Dong,Zhifan Ye,Rishabh Mehta,Yonggan Fu,Vartika Singh,Jan Kautz,Ce Zhang,Pavlo Molchanov
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NVIDIA-Tech Report

点击查看摘要

Abstract:Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free GPU compute density, achieving a strong balance between drafting and verification capacity. Moreover, TiDAR is designed to be serving-friendly (low overhead) as a standalone model. We extensively evaluate TiDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at 1.5B and 8B scales. Thanks to the parallel drafting and sampling as well as exact KV cache support, TiDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
zh

[NLP-46] HalluClean: A Unified Framework to Combat Hallucinations in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容中普遍存在幻觉(hallucination)问题,即模型输出与事实不符的虚假或无依据陈述,从而影响其在实际应用中的可信度。解决方案的关键在于提出一种轻量级、任务无关的框架 HalluClean,其核心创新是采用增强推理范式,将检测与修正过程显式分解为规划(planning)、执行(execution)和修订(revision)三个阶段,通过最小化任务路由提示(task-routing prompts)实现零样本跨领域泛化,且无需依赖外部知识源或监督式检测器,从而有效识别并修正不支持的主张,显著提升生成文本的事实一致性。

链接: https://arxiv.org/abs/2511.08916
作者: Yaxin Zhao,Yu Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive performance across a wide range of natural language processing tasks, yet they often produce hallucinated content that undermines factual reliability. To address this challenge, we introduce HalluClean, a lightweight and task-agnostic framework for detecting and correcting hallucinations in LLM-generated text. HalluClean adopts a reasoning-enhanced paradigm, explicitly decomposing the process into planning, execution, and revision stages to identify and refine unsupported claims. It employs minimal task-routing prompts to enable zero-shot generalization across diverse domains, without relying on external knowledge sources or supervised detectors. We conduct extensive evaluations on five representative tasks-question answering, dialogue, summarization, math word problems, and contradiction detection. Experimental results show that HalluClean significantly improves factual consistency and outperforms competitive baselines, demonstrating its potential to enhance the trustworthiness of LLM outputs in real-world applications.
zh

[NLP-47] Hallucinate or Memorize? The Two Sides of Probabilistic Learning in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文献引用推荐中产生的幻觉问题,即模型生成不存在的参考文献。其解决方案的关键在于揭示了引用频次(citation count)作为训练数据冗余度(training data redundancy)代理指标的作用:研究发现,高被引论文(即在预训练语料库中出现频率更高的文献)具有更低的幻觉率,且当引用次数超过约1000次时,模型对这些文献的引用信息几乎完全以记忆形式保留,而非生成;同时识别出记忆干扰现象——当多个高被引文献内容相似时,模型易发生混淆。这一阈值标志着从泛化到记忆的转变,为提升LLM引用准确性提供了关键机制解释。

链接: https://arxiv.org/abs/2511.08877
作者: Junichiro Niimi
机构: Meijo University (明治大学); RIKEN AIP (理化学研究所先进智能研究中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in citation recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM’s ability to correctly produce bibliographic records depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the pretraining corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record appears in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 citations across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) citation count is strongly correlated with factual accuracy, (ii) bibliographic information becomes almost verbatim memorized beyond roughly 1,000 citations, and (iii) memory interference occurs when multiple highly cited papers share similar content. These findings indicate a threshold where generalization shifts into memorization, with highly cited papers being nearly verbatim retained in the model.
zh

[NLP-48] BioVerge: A Comprehensive Benchmark and Study of Self-Evaluating Agents for Biomedical Hypothesis Generation

【速读】: 该论文旨在解决当前生物医学假说生成(hypothesis generation)方法受限于单一数据类型或预定义提取模式,难以发现新颖且复杂关联的问题。解决方案的关键在于提出 BioVerge 基准数据集与 BioVerge Agent 框架,构建一个标准化的执行环境,其中 BioVerge Agent 采用基于 ReAct 的架构,包含独立的生成(Generation)与评估(Evaluation)模块,通过迭代式生成与自我评估机制提升假说的创新性与相关性,同时融合结构化和文本信息源以增强上下文多样性。

链接: https://arxiv.org/abs/2511.08866
作者: Fuyi Yang,Chenchen Ye,Mingyu Derek Ma,Yijia Xiao,Matthew Yang,Wei Wang
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hypothesis generation in biomedical research has traditionally centered on uncovering hidden relationships within vast scientific literature, often using methods like Literature-Based Discovery (LBD). Despite progress, current approaches typically depend on single data types or predefined extraction patterns, which restricts the discovery of novel and complex connections. Recent advances in Large Language Model (LLM) agents show significant potential, with capabilities in information retrieval, reasoning, and generation. However, their application to biomedical hypothesis generation has been limited by the absence of standardized datasets and execution environments. To address this, we introduce BioVerge, a comprehensive benchmark, and BioVerge Agent, an LLM-based agent framework, to create a standardized environment for exploring biomedical hypothesis generation at the frontier of existing scientific knowledge. Our dataset includes structured and textual data derived from historical biomedical hypotheses and PubMed literature, organized to support exploration by LLM agents. BioVerge Agent utilizes a ReAct-based approach with distinct Generation and Evaluation modules that iteratively produce and self-assess hypothesis proposals. Through extensive experimentation, we uncover key insights: 1) different architectures of BioVerge Agent influence exploration diversity and reasoning strategies; 2) structured and textual information sources each provide unique, critical contexts that enhance hypothesis generation; and 3) self-evaluation significantly improves the novelty and relevance of proposed hypotheses.
zh

[NLP-49] Beyond Task-Oriented and Chitchat Dialogues: Proactive and Transition-Aware Conversational Agents EMNLP2025

【速读】: 该论文旨在解决传统对话代理在任务导向型对话(Task-Oriented Dialogue, TOD)与开放式闲聊(Chitchat)之间缺乏统一建模的问题,而现实中的对话往往需要在这两种模式间自然切换。解决方案的关键在于提出一个名为TACT(TOD-And-Chitchat Transition)的数据集,该数据集支持用户和代理驱动的模式切换,并包含结构多样且集成的对话流;同时引入两个新指标——Switch和Recovery,用于评估模型在模式切换中的主动性与恢复能力。通过在TACT上训练模型并结合直接偏好优化(Direct Preference Optimization, DPO),显著提升了意图识别准确率和模式转换控制能力,最终实现了75.74%的联合模式-意图准确率及70.1%的人类评估胜率,验证了结构化多样性数据与DPO协同优化对构建更主动、过渡感知的对话系统的重要性。

链接: https://arxiv.org/abs/2511.08835
作者: Yejin Yoon,Yuri Son,Namyoung So,Minseo Kim,Minsoo Cho,Chanhee Park,Seungshin Lee,Taeuk Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: accepted to EMNLP2025

点击查看摘要

Abstract:Conversational agents have traditionally been developed for either task-oriented dialogue (TOD) or open-ended chitchat, with limited progress in unifying the two. Yet, real-world conversations naturally involve fluid transitions between these modes. To address this gap, we introduce TACT (TOD-And-Chitchat Transition), a dataset designed for transition-aware dialogue modeling that incorporates structurally diverse and integrated mode flows. TACT supports both user- and agent-driven mode switches, enabling robust modeling of complex conversational dynamics. To evaluate an agent’s ability to initiate and recover from mode transitions, we propose two new metrics – Switch and Recovery. Models trained on TACT outperform baselines in both intent detection and mode transition handling. Moreover, applying Direct Preference Optimization (DPO) to TACT-trained models yields additional gains, achieving 75.74% joint mode-intent accuracy and a 70.1% win rate against GPT-4o in human evaluation. These results demonstrate that pairing structurally diverse data with DPO enhances response quality and transition control, paving the way for more proactive and transition-aware conversational agents.
zh

[NLP-50] BayesQ: Uncertainty-Guided Bayesian Quantization

【速读】: 该论文旨在解决后训练量化(Post-Training Quantization, PTQ)中因忽略权重不确定性而导致的精度损失问题,其核心挑战在于如何在有限比特预算下实现最优的量化性能。解决方案的关键在于引入贝叶斯视角,构建轻量级高斯后验分布(Gaussian posterior over weights),通过后验协方差进行白化预处理,并设计基于后验期望失真的码本(codebooks)以最小化量化误差;同时采用贪心背包算法分配混合精度,最大化每比特边际预期损失减少量。该方法将低比特量化重新建模为一种不确定性感知的风险最小化过程,在ResNet-50和BERT-base等模型上显著优于现有强基线(如GPTQ),且仅需一次与GPTQ相当的预处理开销。

链接: https://arxiv.org/abs/2511.08821
作者: Ismail Lamaakal,Chaymae Yahyati,Yassine Maleh,Khalid El Makkaoui,Ibrahim Ouahbi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present BayesQ, an uncertainty-guided post-training quantization framework that is the first to optimize quantization under the posterior expected loss. BayesQ fits a lightweight Gaussian posterior over weights (diagonal Laplace by default; optional K-FAC/low-rank), whitens by the posterior covariance, designs codebooks to minimize posterior-expected distortion, and allocates mixed precision via a greedy knapsack that maximizes marginal expected-loss reduction per bit under a global budget. For scalar quantizers, posterior-expected MSE yields closed-form tables; task-aware proxies are handled by short Monte Carlo on a small calibration set. An optional calibration-only distillation aligns the quantized model with the posterior predictive teacher. At matched average bits/weight of 3.0/3.5/4.0, BayesQ improves over strong PTQ baselines on ResNet-50 (ImageNet) and BERT-base (GLUE) e.g., vs. GPTQ by +1.5/+0.7/+0.3 top-1 percentage points on RN50 and +1.1/+0.4/+0.2 GLUE points on BERT, while requiring one-time preprocessing comparable to a GPTQ pass. BayesQ reframes low-bit quantization as uncertainty-aware risk minimization in a practical, post-training pipeline.
zh

[NLP-51] BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference

【速读】: 该论文旨在解决 Bengali 语言在自然语言推理(Natural Language Inference, NLI)研究中资源极度匮乏的问题,特别是现有 Bengali NLI 数据集存在的标注错误、句子对歧义以及语言多样性不足等缺陷,这些问题严重制约了模型的有效训练与评估。解决方案的关键在于构建了一个经过严格注释流程优化的 Bengali NLI 数据集 BNLI,其核心特征是强调语义清晰性与三类关系(蕴含、矛盾、中立)之间的平衡,并通过多款先进的 Transformer 架构进行基准测试,验证了 BNLI 在提升模型可靠性与可解释性方面的有效性,从而为 Bengali 及其他低资源语言的推理任务提供了高质量的数据基础。

链接: https://arxiv.org/abs/2511.08813
作者: Farah Binta Haque,Md Yasin,Shishir Saha,Md Shoaib Akhter Rafi,Farig Sadeque
机构: BRAC University (BRAC大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the growing progress in Natural Language Inference (NLI) research, resources for the Bengali language remain extremely limited. Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity, which hinder effective model training and evaluation. To address these limitations, we introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling. The dataset was constructed through a rigorous annotation pipeline emphasizing semantic clarity and balance across entailment, contradiction, and neutrality classes. We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations in Bengali text. The experimental findings highlight the improved reliability and interpretability achieved with BNLI, establishing it as a strong foundation for advancing research in Bengali and other low-resource language inference tasks.
zh

[NLP-52] oward Automated Cognitive Assessment in Parkinsons Disease Using Pretrained Language Models

【速读】: 该论文旨在解决如何从帕金森病(Parkinson’s disease, PD)患者的非结构化第一人称叙述中自动识别反映多种认知过程的类别问题,以捕捉疾病相关认知与情绪变化的细微特征。其解决方案的关键在于开发并比较三种自然语言处理(Natural Language Processing, NLP)模型:基于Bio_ClinicalBERT的嵌套实体识别模型、使用QLoRA微调的Meta-Llama-3-8B-Instruct指令跟随模型,以及在零样本和少样本设置下评估的GPT-4o mini模型。其中,微调后的Llama-3模型在整体F1分数上表现最优(micro-average 0.74,macro-average 0.59),尤其在依赖上下文的认知类别如“思维”和“社交互动”中展现出显著优势,表明针对复杂认知语义的指令微调和上下文理解能力是提升此类任务性能的核心因素。

链接: https://arxiv.org/abs/2511.08806
作者: Varada Khanna(1),Nilay Bhatt(1),Ikgyu Shin(1),Sule Tinaz(2),Yang Ren(1),Hua Xu(1),Vipina K. Keloth(1) ((1) Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, (2) Department of Neurology, Yale School of Medicine, New Haven, CT)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures, 1 table. Varada Khanna and Nilay Bhatt are co-first authors. Sule Tinaz and Hua Xu are co-senior authors. Corresponding author: Vipina K. Keloth ( this http URL @yale.edu)

点击查看摘要

Abstract:Understanding how individuals with Parkinson’s disease (PD) describe cognitive experiences in their daily lives can offer valuable insights into disease-related cognitive and emotional changes. However, extracting such information from unstructured patient narratives is challenging due to the subtle, overlapping nature of cognitive constructs. This study developed and evaluated natural language processing (NLP) models to automatically identify categories that reflect various cognitive processes from de-identified first-person narratives. Three model families, a Bio_ClinicalBERT-based span categorization model for nested entity recognition, a fine-tuned Meta-Llama-3-8B-Instruct model using QLoRA for instruction following, and GPT-4o mini evaluated under zero- and few-shot settings, were compared on their performance on extracting seven categories. Our findings indicated that model performance varied substantially across categories and model families. The fine-tuned Meta-Llama-3-8B-Instruct achieved the highest overall F1-scores (0.74 micro-average and 0.59 macro-average), particularly excelling in context-dependent categories such as thought and social interaction. Bio_ClinicalBERT exhibited high precision but low recall and performed comparable to Llama for some category types such as location and time but failed on other categories such as thought, emotion and social interaction. Compared to conventional information extraction tasks, this task presents a greater challenge due to the abstract and overlapping nature of narrative accounts of complex cognitive processes. Nonetheless, with continued refinement, these NLP systems hold promise for enabling low-burden, longitudinal monitoring of cognitive function and serving as a valuable complement to formal neuropsychological assessments in PD.
zh

[NLP-53] Structured Uncertainty guided Clarification for LLM Agents

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)代理在面对模糊用户指令时,因工具调用参数不确定性导致错误调用和任务失败的问题。其核心解决方案在于提出一种结构化的不确定性建模方法,将工具调用参数的不确定性形式化为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),并引入期望完美信息价值(Expected Value of Perfect Information, EVPI)目标以优化澄清问题的选择;同时采用基于任务方面的成本建模机制避免冗余交互。该方法通过SAGE-Agent实现,在提升模糊任务覆盖率7-39%的同时,使澄清提问次数减少1.5-2.7倍,显著提升了工具增强型代理的任务成功率与交互效率。

链接: https://arxiv.org/abs/2511.08798
作者: Manan Suri,Puneet Mathur,Nedim Lipka,Franck Dernoncourt,Ryan A. Rossi,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院市分校); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures. We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy. Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7 \times compared to strong prompting and uncertainty-based baselines. We present ClarifyBench, the first multi-turn tool-augmented disambiguation benchmark with realistic LLM-based user simulation across diverse domains including document editing, vehicle control, and travel booking. Additionally, we demonstrate that structured uncertainty provides effective training signals for reinforcement learning, boosting When2Call accuracy from 36.5% to 65.2% (3B model) and 36.7% to 62.9% (7B model) through uncertainty-weighted GRPO training. These results establish structured uncertainty as a principled, efficient approach for tool-augmented agents, improving both task success and interaction efficiency in real-world scenarios.
zh

[NLP-54] Benevolent Dictators? On LLM Agent Behavior in Dictator Games

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的代理行为研究中存在的重要问题:现有研究往往忽视系统提示(system prompt)对模型行为的影响,且未充分评估提示微小变化所引发的行为敏感性,导致结果缺乏稳健性。为克服这一局限,作者提出LLM代理行为研究框架(LLM agent behavior study, LLM-ABS),其关键在于通过系统化地探索不同系统提示对LLM行为的影响、采用中立提示变体以获得更可靠的行为偏好分析,并结合开放指令下的语言特征提取来深入理解代理决策背后的推理机制。该框架有效提升了LLM代理行为研究的可重复性和可信度,为未来复杂行为建模提供了坚实基础。

链接: https://arxiv.org/abs/2511.08721
作者: Andreas Einwiller,Kanishka Ghosh Dastidar,Artur Romazanov,Annette Hautli-Janisz,Michael Granitzer,Florian Lemmerich
机构: University of Passau (帕绍大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 2 figures, v1 init

点击查看摘要

Abstract:In behavioral sciences, experiments such as the ultimatum game are conducted to assess preferences for fairness or self-interest of study participants. In the dictator game, a simplified version of the ultimatum game where only one of two players makes a single decision, the dictator unilaterally decides how to split a fixed sum of money between themselves and the other player. Although recent studies have explored behavioral patterns of AI agents based on Large Language Models (LLMs) instructed to adopt different personas, we question the robustness of these results. In particular, many of these studies overlook the role of the system prompt - the underlying instructions that shape the model’s behavior - and do not account for how sensitive results can be to slight changes in prompts. However, a robust baseline is essential when studying highly complex behavioral aspects of LLMs. To overcome previous limitations, we propose the LLM agent behavior study (LLM-ABS) framework to (i) explore how different system prompts influence model behavior, (ii) get more reliable insights into agent preferences by using neutral prompt variations, and (iii) analyze linguistic features in responses to open-ended instructions by LLM agents to better understand the reasoning behind their behavior. We found that agents often exhibit a strong preference for fairness, as well as a significant impact of the system prompt on their behavior. From a linguistic perspective, we identify that models express their responses differently. Although prompt sensitivity remains a persistent challenge, our proposed framework demonstrates a robust foundation for LLM agent behavior studies. Our code artifacts are available at this https URL.
zh

[NLP-55] AI-generated podcasts: Synthetic Intimacy and Cultural Translation in NotebookLMs Audio Overviews

【速读】: 该论文试图解决的问题是:AI生成的播客(AI-generated podcasts)作为新兴媒体形式,尚未被系统性地分析其媒介特性与社会影响,尤其是在文化语境和公共领域建构方面的潜在机制。解决方案的关键在于通过实验性上传不同类型的文本至Google的NotebookLM平台,生成对应的播客内容,并对其结构与语言特征进行细致分析,从而揭示其基于固定模板的叙事逻辑以及将多元文化背景抽象为标准化白人中产阶级美式语境的文化转译倾向,这一发现标志着从人类播客所体现的多中心公共领域向单一化、去情境化的媒介抽象形态的转变。

链接: https://arxiv.org/abs/2511.08654
作者: Jill Walker Rettberg
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement number 101142306. The project is also supported by the Center for Digital Narrative, which is funded by the Research Council of Norway through its Centres of Excellence scheme, project number 332643

点击查看摘要

Abstract:This paper analyses AI-generated podcasts produced by Google’s NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts’ structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.
zh

[NLP-56] Accelerating Training Speed of Tiny Recursive Models via Curriculum Guided Adaptive Recursion

【速读】: 该论文旨在解决递归推理模型(recursive reasoning models)在训练过程中计算成本过高这一问题,现有方法通常需要约36 GPU小时/数据集,限制了其广泛应用与研究进展。解决方案的关键在于提出一种新颖的训练方法CGAR(Curriculum-guided Architecture Refinement),其核心创新是将课程学习(curriculum learning)应用于模型架构深度而非传统数据顺序:一是引入渐进式深度课程(Progressive Depth Curriculum),在训练中从浅层到深层逐步增加递归深度,防止早期过拟合并显著降低计算开销;二是设计分层监督权重机制(Hierarchical Supervision Weighting),对不同推理步骤施加指数衰减的损失权重,使其与梯度幅度衰减趋势一致。实验表明,CGAR在Sudoku-Extreme数据集上实现1.71倍训练加速(从10.93降至6.38小时)且精度仅下降0.63%,同时提升推理效率,证明了基于架构深度的课程学习可实现训练效率与解题质量的帕累托改进。

链接: https://arxiv.org/abs/2511.08653
作者: Kaleem Ullah Qasim,Jiashu Zhang
机构: Southwest Jiaotong University (西南交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Recursive reasoning models achieve remarkable performance on complex reasoning tasks through iterative refinement, enabling tiny networks to match large language models thousands of times their size. However, training remains computationally expensive, prior work reporting approximately 36 GPU-hours per dataset, limiting broader adoption and research. We propose CGAR, a novel training methodology that applies curriculum learning to architectural depth rather than traditional data ordering. CGAR introduces two synergistic components: Progressive Depth Curriculum dynamically adjusts recursion depth from shallow to deep configurations during training, preventing early overfitting while reducing computational cost, and Hierarchical Supervision Weighting applies exponentially decaying importance to supervision steps, aligning loss weighting with observed gradient magnitude decay. On Sudoku-Extreme with 423,168 test puzzles, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours, 42% cost reduction) with only 0.63% accuracy drop (86.65% to 86.02%). Systematic ablations reveal Progressive Depth Curriculum alone achieves 2.26x speedup with 85.47% accuracy, demonstrating a rare Pareto improvement where architectural curriculum simultaneously enhances training efficiency and solution quality. CGAR-trained models exhibit superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Our work demonstrates that principled curriculum on architectural depth enables efficient training of recursive reasoning models on modest hardware. Code and models: this https URL and this https URL
zh

[NLP-57] Detecting Suicidal Ideation in Text with Interpretable Deep Learning: A CNN-BiGRU with Attention Mechanism

【速读】: 该论文旨在解决青少年自杀问题中早期识别自杀意念(suicidal ideation)的难题,尤其关注通过社交媒体(Social Network, SN)数据自动检测具有潜在自杀风险的个体。其解决方案的关键在于提出一种混合深度学习架构,结合卷积神经网络(CNN)进行局部特征提取、双向门控循环单元(BiGRU)实现序列建模,并引入注意力机制增强关键语义信息的捕捉能力,同时利用SHapley Additive exPlanations(SHAP)方法提升模型预测结果的可解释性,从而构建一个高精度且可信的自杀意念检测框架。实验表明,该方法在公开数据集上达到了93.97%的准确率,优于现有主流机器学习与深度学习模型。

链接: https://arxiv.org/abs/2511.08636
作者: Mohaiminul Islam Bhuiyan,Nur Shazwani Kamarudin,Nur Hafieza Ismail
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 4 figures, 2025 IEEE 9th International Conference on Software Engineering Computer Systems

点击查看摘要

Abstract:Worldwide, suicide is the second leading cause of death for adolescents with past suicide attempts to be an important predictor for increased future suicides. While some people with suicidal thoughts may try to suppress them, many signal their intentions in social media platforms. To address these issues, we propose a new type of hybrid deep learning scheme, i.e., the combination of a CNN architecture and a BiGRU technique, which can accurately identify the patterns of suicidal ideation from SN datasets. Also, we apply Explainable AI methods using SHapley Additive exPlanations to interpret the prediction results and verifying the model reliability. This integration of CNN local feature extraction, BiGRU bidirectional sequence modeling, attention mechanisms, and SHAP interpretability provides a comprehensive framework for suicide detection. Training and evaluation of the system were performed on a publicly available dataset. Several performance metrics were used for evaluating model performance. Our method was found to have achieved 93.97 accuracy in experimental results. Comparative study to different state-of-the-art Machine Learning and DL models and existing literature demonstrates the superiority of our proposed technique over all the competing methods.
zh

[NLP-58] Learn More Forget Less: A Gradient-Aware Data Selection Approach for LLM

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域进行监督微调(Supervised Fine-Tuning, SFT)时面临的两大挑战:一是数据资源消耗高,二是易引发灾难性遗忘(Catastrophic Forgetting, CF)导致通用能力下降。其解决方案的关键在于提出一种自适应的梯度感知数据选择方法(Gradient-aware Data Selection, GrADS),通过分析预训练阶段获得的梯度信息,设计基于梯度幅值和统计分布的自引导标准,自动筛选出对模型学习贡献最大的代表性样本,从而在显著减少训练数据量的同时提升领域适应性能,并有效缓解灾难性遗忘问题。

链接: https://arxiv.org/abs/2511.08620
作者: Yibai Liu,Shihang Wang,Zeming Liu,Zheming Song,Junzhe Wang,Jingjing Liu,Qingjie Liu,Yunhong Wang
机构: Columbia University (哥伦比亚大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Despite large language models (LLMs) have achieved impressive achievements across numerous tasks, supervised fine-tuning (SFT) remains essential for adapting these models to specialized domains. However, SFT for domain specialization can be resource-intensive and sometimes leads to a deterioration in performance over general capabilities due to catastrophic forgetting (CF). To address these issues, we propose a self-adaptive gradient-aware data selection approach (GrADS) for supervised fine-tuning of LLMs, which identifies effective subsets of training data by analyzing gradients obtained from a preliminary training phase. Specifically, we design self-guided criteria that leverage the magnitude and statistical distribution of gradients to prioritize examples that contribute the most to the model’s learning process. This approach enables the acquisition of representative samples that enhance LLMs understanding of domain-specific tasks. Through extensive experimentation with various LLMs across diverse domains such as medicine, law, and finance, GrADS has demonstrated significant efficiency and cost-effectiveness. Remarkably, utilizing merely 5% of the selected GrADS data, LLMs already surpass the performance of those fine-tuned on the entire dataset, and increasing to 50% of the data results in significant improvements! With catastrophic forgetting substantially mitigated simultaneously. We will release our code for GrADS later.
zh

[NLP-59] A Super-Learner with Large Language Models for Medical Emergency Advising

【速读】: 该论文旨在解决急诊医学中诊断准确性不足的问题,特别是如何利用大型语言模型(Large Language Models, LLMs)提升对急症疾病的诊断能力。研究发现,尽管单个LLMs在急诊场景下的诊断准确率介于58%至65%,显著高于人类医生的水平,但其性能存在差异。为此,作者提出了一种基于元学习(meta-learning)的集成方法——MEDAS(Medical Emergency Diagnostic Advising System),通过构建一个包含五个主流LLMs(Gemini、Llama、Grok、GPT和Claude)的超学习器(super-learner),实现对各模型诊断能力的协同优化。解决方案的关键在于:利用元学习器自动学习并整合不同LLMs的优势,从而在群体层面实现更高的诊断准确率(达70%),甚至在单一模型中达到85%的准确率,表明该方法能有效挖掘和融合多个LLM所依赖的不同医疗数据集的知识资源。

链接: https://arxiv.org/abs/2511.08614
作者: Sergey K. Aityan,Abdolreza Mosaddegh,Rolando Herrero,Haitham Tayyar,Jiang Han,Vikram Sawant,Qi Chen,Rishabh Jain,Aruna Senthamaraikannan,Stephen Wood,Manuel Mersini,Rita Lazzaro,Mario Balzaneli,Nicola Iacovazzo,Ciro Gargiulo Isacco
机构: 未知
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients’ conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.
zh

[NLP-60] Mina: A Multilingual LLM -Powered Legal Assistant Agent for Bangladesh for Empowering Access to Justice

【速读】: 该论文旨在解决孟加拉国低收入群体在获取负担得起的法律咨询方面面临的障碍,包括法律语言复杂、程序不透明及高昂成本等问题;同时针对现有AI法律助手缺乏孟加拉语支持和本地司法环境适配性的局限。解决方案的关键在于开发了一个名为Mina的多语言大语言模型(Large Language Model, LLM)驱动的法律助理系统,其核心创新是采用多语言嵌入(multilingual embeddings)与基于检索增强生成(Retrieval-Augmented Generation, RAG)的工具链框架,实现检索、推理、翻译与文书生成一体化流程,从而提供情境感知的法律草案、引用和通俗解释,并通过交互式聊天界面交付服务。该系统已在孟加拉国顶尖高校法学院教师对2022年和2023年律师资格考试各环节的评估中表现优异(初步选择题、笔试与模拟口试得分75–80%),达到或超过人类平均水平,验证了其在低成本、多语言环境下自动化关键法律任务并扩大司法可及性的潜力。

链接: https://arxiv.org/abs/2511.08605
作者: Azmine Toushik Wasi,Wahid Faisal,Mst Rafia Islam
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Bangladesh’s low-income population faces major barriers to affordable legal advice due to complex legal language, procedural opacity, and high costs. Existing AI legal assistants lack Bengali-language support and jurisdiction-specific adaptation, limiting their effectiveness. To address this, we developed Mina, a multilingual LLM-based legal assistant tailored for the Bangladeshi context. It employs multilingual embeddings and a RAG-based chain-of-tools framework for retrieval, reasoning, translation, and document generation, delivering context-aware legal drafts, citations, and plain-language explanations via an interactive chat interface. Evaluated by law faculty from leading Bangladeshi universities across all stages of the 2022 and 2023 Bangladesh Bar Council Exams, Mina scored 75-80% in Preliminary MCQs, Written, and simulated Viva Voce exams, matching or surpassing average human performance and demonstrating clarity, contextual understanding, and sound legal reasoning. These results confirm its potential as a low-cost, multilingual AI assistant that automates key legal tasks and scales access to justice, offering a real-world case study on building domain-specific, low-resource systems and addressing challenges of multilingual adaptation, efficiency, and sustainable public-service AI deployment.
zh

[NLP-61] Evaluating DisCoCirc in Translation Tasks its Limitations: A Comparative Study Between Bengali English

【速读】: 该论文旨在解决跨语言翻译中因语法结构差异导致的表达冗余与信息失真问题,尤其是在英语与孟加拉语之间。其解决方案的关键在于扩展DisCoCirc(Distributed Compositional Circuits)形式化框架至孟加拉语,并利用其基于生成规则的电路式表示来构建更精确的范畴论结构,从而减少语言官僚主义(language bureaucracy)。该方法通过将句法结构映射为可组合的逻辑电路,试图在保持语义一致性的同时提升翻译准确性,尽管实证结果显示其对简单句仍存在局限性,提示未来需进一步优化以应对两种语言间的结构性差异。

链接: https://arxiv.org/abs/2511.08601
作者: Nazmoon Falgunee Moon
机构: 未知
类目: Computation and Language (cs.CL)
备注: 17 pages, 21 figures, 3 tables

点击查看摘要

Abstract:In [4], the authors present the DisCoCirc (Distributed Compositional Circuits) formalism for the English language, a grammar-based framework derived from the production rules that incorporates circuit-like representations in order to give a precise categorical theoretical structure to the language. In this paper, we extend this approach to develop a similar framework for Bengali and apply it to translation tasks between English and Bengali. A central focus of our work lies in reassessing the effectiveness of DisCoCirc in reducing language bureaucracy. Unlike the result suggested in [5], our findings indicate that although it works well for a large part of the language, it still faces limitations due to the structural variation of the two languages. We discuss the possible methods that might handle these shortcomings and show that, in practice, DisCoCirc still struggles even with relatively simple sentences. This divergence from prior claims not only highlights the framework’s constraints in translation but also suggest scope for future improvement. Apart from our primary focus on English-Bengali translation, we also take a short detour to examine English conjunctions, following [1], showing a connection between conjunctions and Boolean logic.
zh

[NLP-62] Retrieval-Augmented Generation of Pediatric Speech-Language Pathology vignettes: A Proof-of-Concept Study

【速读】: 该论文旨在解决语音语言病理学(Speech-Language Pathology, SLP)教育中临床案例(clinical vignettes)手工制作耗时费力的问题。传统大型语言模型(Large Language Models, LLMs)虽能生成文本,但缺乏领域专业知识,易产生幻觉且需大量专家修正。解决方案的关键在于构建一个基于检索增强生成(Retrieval-Augmented Generation, RAG)的多模态系统,整合经筛选的领域知识库与工程化提示模板,从而生成符合专业指南、结构完整且临床适切的儿科SLP案例材料。该方法通过引入外部知识源显著提升了生成内容的准确性与可靠性,为教育和研究场景下的自动化案例生成提供了技术可行性验证。

链接: https://arxiv.org/abs/2511.08600
作者: Yilan Liu
机构: University of Redlands (红lands大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 37 pages, 3 figures

点击查看摘要

Abstract:Clinical vignettes are essential educational tools in speech-language pathology (SLP), but manual creation is time-intensive. While general-purpose large language models (LLMs) can generate text, they lack domain-specific knowledge, leading to hallucinations and requiring extensive expert revision. This study presents a proof-of-concept system integrating retrieval-augmented generation (RAG) with curated knowledge bases to generate pediatric SLP case materials. A multi-model RAG-based system was prototyped integrating curated domain knowledge with engineered prompt templates, supporting five commercial (GPT-4o, Claude 3.5 Sonnet, Gemini 2.5 Pro) and open-source (Llama 3.2, Qwen 2.5-7B) LLMs. Seven test scenarios spanning diverse disorder types and grade levels were systematically designed. Generated cases underwent automated quality assessment using a multi-dimensional rubric evaluating structural completeness, internal consistency, clinical appropriateness, and IEP goal/session note quality. This proof-of-concept demonstrates technical feasibility for RAG-augmented generation of pediatric SLP vignettes. Commercial models showed marginal quality advantages, but open-source alternatives achieved acceptable performance, suggesting potential for privacy-preserving institutional deployment. Integration of curated knowledge bases enabled content generation aligned with professional guidelines. Extensive validation through expert review, student pilot testing, and psychometric evaluation is required before educational or research implementation. Future applications may extend to clinical decision support, automated IEP goal generation, and clinical reflection training.
zh

[NLP-63] OKBench: Democratizing LLM Evaluation with Fully Automated On-Demand Open Knowledge Benchmarking

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)知识密集型问答评估中所依赖的静态基准测试无法反映现实世界动态知识更新的问题,以及集中式数据维护难以跟上LLM快速演进速度的挑战。其解决方案的关键在于提出Open Knowledge Bench (OKBench),一个全自动化的、面向新闻领域的动态知识基准生成框架,通过自动化地完成数据源获取、基准构建、验证与分发流程,实现按需生成高质量、低重叠于预训练数据的新知识基准,从而有效支持对检索增强型方法的全面评估,并揭示不同规模模型在面对新信息时的行为差异及检索机制对缩小大小模型性能差距的作用。

链接: https://arxiv.org/abs/2511.08598
作者: Yanhong Li,Tianyang Xu,Kenan Tang,Karen Livescu,David McAllester,Jiawei Zhou
机构: University of Chicago (芝加哥大学); TTI-Chicago (TTI-芝加哥); UC Santa Barbara (加州大学圣塔芭芭拉分校); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge-intensive question answering is central to large language models (LLMs) and is typically assessed using static benchmarks derived from sources like Wikipedia and textbooks. However, these benchmarks fail to capture evolving knowledge in a dynamic world, and centralized curation struggles to keep pace with rapid LLM advancements. To address these drawbacks, we propose Open Knowledge Bench (OKBench), a fully automated framework for generating high-quality, dynamic knowledge benchmarks on demand. Focusing on the news domain where knowledge updates daily, OKBench is an agentic framework that automates the sourcing, creation, validation, and distribution of benchmarks. Our approach democratizes benchmark creation and facilitates thorough evaluation of retrieval-augmented methods by reducing overlap with pretraining data. We evaluate our framework on a wide range open-source and proprietary LLMs of various sizes and configurations, both with and without retrieval over freshly generated knowledge. Our results reveal distinct model behaviors when confronted with new information and highlight how retrieval narrows the performance gap between small and large models. These findings underscore the importance of evaluating LLMs on evolving knowledge benchmarks.
zh

[NLP-64] Self-HarmLLM : Can Large Language Model Harm Itself?

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)防御机制中一个被忽视的问题:即模型自身生成的有害查询(Mitigated Harmful Query, MHQ)可能成为新的攻击向量,而非仅依赖外部攻击者构造的恶意输入。其解决方案的关键在于提出“Self-HarmLLM”场景,通过设计一种意图保留但危害性不直接暴露的MHQ,并将其重新输入同一模型的不同会话中,以测试是否存在越狱(jailbreak)现象。实验表明,在零样本和少样本条件下均存在显著的转换成功率与越狱成功率,且自动化评估方法存在系统性高估问题,凸显了现有防御策略在面对模型内生攻击时的脆弱性,从而呼吁对护栏机制进行根本性重构和更严谨的评估体系建立。

链接: https://arxiv.org/abs/2511.08597
作者: Heehwan Kim,Sungjune Park,Daeseon Choi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model’s own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.
zh

[NLP-65] What About the Scene with the Hitler Reference? HAUNT: A Framework to Probe LLM s Self-consistency Via Adversarial Nudge

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域部署时因幻觉(Hallucination)导致的事实准确性问题,尤其关注模型在受到对抗性诱导(adversarial nudge)下的事实一致性脆弱性。解决方案的关键在于构建一个三步式压力测试框架:首先指令模型生成与封闭领域一致的真言与假话集合;其次要求模型自我验证这些陈述的真实性与虚假性;最后测试模型对自身生成的虚假陈述的鲁棒性。该方法能够系统评估LLM在面对内部生成的误导信息时的事实保持能力,从而揭示不同模型在真实应用场景中的可靠性差异。

链接: https://arxiv.org/abs/2511.08596
作者: Arka Dutta,Sujan Dutta,Rijul Magu,Soumyajit Datta,Munmun De Choudhury,Ashiqur R. KhudaBukhsh
机构: Rochester Institute of Technology (罗切斯特理工学院); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. In the first step, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. In the next step, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. In the final step, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: \textttClaude exhibits strong resilience, \textttGPT and \textttGrok demonstrate moderate resilience, while \textttGemini and \textttDeepSeek show weak resilience. Considering that a large population is increasingly using LLMs for information seeking, our findings raise alarm.
zh

[NLP-66] Chopping Trees: Semantic Similarity Based Dynamic Pruning for Tree-of-Thought Reasoning NEURIPS2025

【速读】: 该论文旨在解决树状思维(Tree-of-Thought, ToT)推理中因语义冗余导致的计算开销过大的问题,即不同分支探索了等价的推理路径,造成资源浪费。其解决方案的关键在于提出一种基于语义相似性的动态剪枝方法(Semantic Similarity-Based Dynamic Pruning, SSDP),首次将在线语义合并机制集成到并行化树搜索中,实现实时聚类与剪枝冗余步骤,从而显著提升推理效率。在GSM8K和MATH500等基准测试中,SSDP相较于最先进基线实现最高2.3倍的速度提升,同时保持竞争力的准确性(通常在最强基线的5%以内),并将探索节点数减少85–90%。

链接: https://arxiv.org/abs/2511.08595
作者: Joongho Kim,Xirui Huang,Zarreen Reza,Gabriel Grand,Kevin Zhu,Ryan Lagasse
机构: Algoverse AI Research; Massachusetts Institute of Technology (MIT)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Efficient Reasoning

点击查看摘要

Abstract:Tree-of-Thought (ToT) reasoning boosts the problem-solving abilities of Large Language Models (LLMs) but is computationally expensive due to semantic redundancy, where distinct branches explore equivalent reasoning paths. We introduce Semantic Similarity-Based Dynamic Pruning (SSDP), a lightweight method that, to the best of our knowledge, is the first framework to integrate online semantic merging into parallelized tree search, enabling the clustering and pruning of redundant steps in real time. Across reasoning benchmarks, including GSM8K and MATH500, SSDP achieves up to a 2.3x speedup over state-of-the-art tree-search baselines while maintaining competitive accuracy (typically within 5% of the strongest baseline) and reducing the number of explored nodes by 85-90%, demonstrating a practical approach to efficient, scalable LLM reasoning. The implementation of SSDP is publicly available at this https URL.
zh

[NLP-67] Diverse Preference Learning for Capabilities and Alignment

【速读】: 该论文试图解决的问题是:当前基于偏好学习(preference learning)的对齐算法(如RLHF和DPO)在提升语言模型(LLM)输出质量的同时,显著降低了其生成文本的多样性,表现为重复的结构、词汇选择以及对问题的趋同解法,并削弱了模型对社会多元观点的代表性。解决方案的关键在于提出Soft Preference Learning,该方法通过解耦KL散度惩罚项中的熵(entropy)与交叉熵(cross-entropy)成分,实现对生成多样性的细粒度控制,从而在保持对齐效果的同时显著提升语义和词汇多样性,同时增强模型对不同社会观点的覆盖能力和逻辑概率校准性能。

链接: https://arxiv.org/abs/2511.08594
作者: Stewart Slocum,Asher Parker-Sartori,Dylan Hadfield-Menell
机构: MIT CSAIL (Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The ability of LLMs to represent diverse perspectives is critical as they increasingly impact society. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to systematically overweight majority opinions and sacrifice diversity in its outputs. To address this, we propose Soft Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty - allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Soft Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Soft Preference Learning resembles, but is a Pareto improvement over, standard temperature scaling.
zh

[NLP-68] Knowledge Graph Analysis of Legal Understanding and Violations in LLM s

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在解读敏感法律文本(如美国法典第18编第175条关于生物武器的规定)时存在的安全与合规风险问题,即LLMs虽具备法律分析能力,却可能生成有害输出(如生物武器制作步骤),暴露出其推理和安全机制的显著缺陷。解决方案的关键在于提出一种结合知识图谱构建与检索增强生成(Retrieval-Augmented Generation, RAG)的方法,以系统评估LLMs对法律条文的理解程度、对犯罪意图(mens rea)的识别能力及潜在滥用风险;通过结构化实验验证其在识别违法行为、生成禁止指令和检测非法意图方面的表现,从而为开发兼具伦理性和安全性、能有效辅助敏感法律领域的LLM提供理论基础和技术路径。

链接: https://arxiv.org/abs/2511.08593
作者: Abha Jha,Abel Salinas,Fred Morstatter
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) offers transformative potential for interpreting complex legal frameworks, such as Title 18 Section 175 of the US Code, which governs biological weapons. These systems hold promise for advancing legal analysis and compliance monitoring in sensitive domains. However, this capability comes with a troubling contradiction: while LLMs can analyze and interpret laws, they also demonstrate alarming vulnerabilities in generating unsafe outputs, such as actionable steps for bioweapon creation, despite their safeguards. To address this challenge, we propose a methodology that integrates knowledge graph construction with Retrieval-Augmented Generation (RAG) to systematically evaluate LLMs’ understanding of this law, their capacity to assess legal intent (mens rea), and their potential for unsafe applications. Through structured experiments, we assess their accuracy in identifying legal violations, generating prohibited instructions, and detecting unlawful intent in bioweapons-related scenarios. Our findings reveal significant limitations in LLMs’ reasoning and safety mechanisms, but they also point the way forward. By combining enhanced safety protocols with more robust legal reasoning frameworks, this research lays the groundwork for developing LLMs that can ethically and securely assist in sensitive legal domains - ensuring they act as protectors of the law rather than inadvertent enablers of its violation.
zh

[NLP-69] he Collective Turing Test: Large Language Models Can Generate Realistic Multi-User Discussions

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够真实模拟社交媒体上的群体对话,从而在社会仿真中替代真实人类互动。其解决方案的关键在于通过对比真实Reddit用户对话与由Llama 3 70B和GPT-4o生成的虚拟对话,进行双盲实验评估人类参与者对两者辨识的准确率。结果显示,参与者将LLM生成的内容误判为人类创作的比例达39%,其中对Llama 3生成内容的识别正确率仅为56%,接近随机水平,表明当前LLMs已具备生成高度逼真社交文本的能力,可有效用于社会模拟研究,但也揭示了其被滥用于制造虚假内容的风险。

链接: https://arxiv.org/abs/2511.08592
作者: Azza Bouleimen,Giordano De Marzo,Taehee Kim,Nicol`o Pagan,Hannah Metzler,Silvia Giordano,David Garcia
机构: University of Zurich, Department of Informatics, Zurich, 8050, Switzerland; University of Konstanz, Department of Politics and Public Administration, Konstanz, 78464, Germany; Complexity Science Hub, Vienna, 1030, Austria; University of Applied Sciences and Arts of Southern Switzerland, Department of Innovative Technologies, Viganello, 6962, Switzerland
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) offer new avenues to simulate online communities and social media. Potential applications range from testing the design of content recommendation algorithms to estimating the effects of content policies and interventions. However, the validity of using LLMs to simulate conversations between various users remains largely untested. We evaluated whether LLMs can convincingly mimic human group conversations on social media. We collected authentic human conversations from Reddit and generated artificial conversations on the same topic with two LLMs: Llama 3 70B and GPT-4o. When presented side-by-side to study participants, LLM-generated conversations were mistaken for human-created content 39% of the time. In particular, when evaluating conversations generated by Llama 3, participants correctly identified them as AI-generated only 56% of the time, barely better than random chance. Our study demonstrates that LLMs can generate social media conversations sufficiently realistic to deceive humans when reading them, highlighting both a promising potential for social simulation and a warning message about the potential misuse of LLMs to generate new inauthentic social media content.
zh

[NLP-70] GMTRouter: Personalized LLM Router over Multi-turn User Interactions

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)路由中个性化不足的问题,即现有方法难以有效捕捉用户与LLM之间复杂的交互关系,且受限于稀疏、噪声大和格式不一致的用户偏好数据,导致个性化效果有限。其解决方案的关键在于提出GMTRouter,通过将多轮用户-LLM交互建模为包含用户、LLM、查询和响应四类节点的异构图结构(heterogeneous graph),并设计一种定制化的消息传递机制,在轻量级归纳式图学习框架下从少量样本中学习用户偏好,从而实现高效且灵活的个性化路由。

链接: https://arxiv.org/abs/2511.08590
作者: Encheng Xie,Yihang Sun,Tao Feng,Jiaxuan You
机构: Cranberry-Lemon University (克兰伯里-柠檬大学); University of the Witwatersrand (威特沃特斯兰德大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large Language Model (LLM) routing has demonstrated strong capability in balancing response quality with computational cost. As users exhibit diverse preferences, personalization has attracted increasing attention in LLM routing, since even identical queries may require different models to generate responses tailored to individual needs. However, existing approaches are not fully personalized and often fail to capture the complex interactions between specific users and LLMs. Moreover, user preference data is typically scarce, noisy, and inconsistent in format, which limits the effectiveness of methods that rely solely on user-specific data. To address these challenges, we propose GMTRouter, which represents multi-turn user-LLM interactions as a heterogeneous graph with four node types: user, LLM, query, and response, thereby preserving the rich relational structure of the interaction. Through a tailored message-passing mechanism, GMTRouter learns to capture user preferences from few-shot data within a lightweight inductive graph learning framework, enabling effective personalization. Extensive experiments demonstrate that GMTRouter consistently outperforms strong baselines, achieving 0.9 to 21.6 percent higher accuracy and 0.006 to 0.309 higher AUC across multiple datasets. More importantly, we demonstrate that GMTRouter can adapt to new users and evolving preferences using only few-shot data, without extensive fine-tuning. The code for GMTRouter is publicly available at this https URL.
zh

[NLP-71] Where did you get that? Towards Summarization Attribution for Analysts

【速读】: 该论文旨在解决摘要中信息来源的可追溯性问题,即如何自动将摘要中的每个句子与源文本中的相应片段进行关联,从而实现准确的归因(attribution)。其解决方案的关键在于采用混合摘要方法(hybrid summarization),即通过自动重写(automatic paraphrase)提取式摘要(extractive summary)来简化归因过程,并引入自定义拓扑结构(custom topology)以识别不同类别的归因错误比例,从而提升归因的准确性与可解释性。

链接: https://arxiv.org/abs/2511.08589
作者: Violet B,John M. Conroy,Sean Lynch,Danielle M,Neil P. Molino,Aaron Wiechmann,Julia S. Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analysts require attribution, as nothing can be reported without knowing the source of the information. In this paper, we will focus on automatic methods for attribution, linking each sentence in the summary to a portion of the source text, which may be in one or more documents. We explore using a hybrid summarization, i.e., an automatic paraphrase of an extractive summary, to ease attribution. We also use a custom topology to identify the proportion of different categories of attribution-related errors.
zh

[NLP-72] Conversational Agents for Building Energy Efficiency – Advising Housing Cooperatives in Stockholm on Reducing Energy Consumption

【速读】: 该论文旨在解决瑞典住房合作社(Housing Cooperative)在能源效率提升过程中面临的决策支持不足问题,尤其针对董事会成员普遍缺乏专业知识和资源来有效管理建筑能耗的困境。解决方案的关键在于开发一个名为SPARA的对话代理系统(Conversational Agent System),该系统基于检索增强生成(Retrieval-Augmented Generation, RAG)框架与语言模型(Language Model, LM),利用斯德哥尔摩地区专业能源顾问与合作社代表之间的电子邮件通信构建知识库,从而生成精准的节能改造建议。初步结果显示,SPARA提供的建议准确率达80%,接近市政能源效率专家水平,表明此类生成式AI(Generative AI)技术可在能源转型中显著提升对利益相关方的支持能力。

链接: https://arxiv.org/abs/2511.08587
作者: Shadaab Ghani,Anne Håkansson,Oleksii Pasichnyi,Hossein Shahrokni
机构: KTH Royal Institute of Technology (皇家理工学院); Department of Sustainable Development, Environmental Science and Engineering (SEED) (可持续发展、环境科学与工程系)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Housing cooperative is a common type of multifamily building ownership in Sweden. Although this ownership structure grants decision-making autonomy, it places a burden of responsibility on cooperative’s board members. Most board members lack the resources or expertise to manage properties and their energy consumption. This ignorance presents a unique challenge, especially given the EU directives that prohibit buildings rated as energy classes F and G by 2033. Conversational agents (CAs) enable human-like interactions with computer systems, facilitating human-computer interaction across various domains. In our case, CAs can be implemented to support cooperative members in making informed energy retrofitting and usage decisions. This paper introduces a Conversational agent system, called SPARA, designed to advise cooperatives on energy efficiency. SPARA functions as an energy efficiency advisor by leveraging the Retrieval-Augmented Generation (RAG) framework with a Language Model(LM). The LM generates targeted recommendations based on a knowledge base composed of email communications between professional energy advisors and cooperatives’ representatives in Stockholm. The preliminary results indicate that SPARA can provide energy efficiency advice with precision 80%, comparable to that of municipal energy efficiency (EE) experts. A pilot implementation is currently underway, where municipal EE experts are evaluating SPARA performance based on questions posed to EE experts by BRF members. Our findings suggest that LMs can significantly improve outreach by supporting stakeholders in their energy transition. For future work, more research is needed to evaluate this technology, particularly limitations to the stability and trustworthiness of its energy efficiency advice.
zh

计算机视觉

[CV-0] IFG: Internet-Scale Guidance for Functional Grasping Generation

【速读】:该论文旨在解决大型视觉模型在复杂场景中虽具备强大的语义分割与对象部分理解能力,但缺乏精确几何理解从而难以实现灵巧机器人手对3D物体的精准抓取的问题。解决方案的关键在于引入一种基于仿真的力闭合抓取生成流程(force-closure grasping generation pipeline),该流程能够理解手部与物体局部几何关系;随后将该流程生成的慢速且依赖真值观测的数据蒸馏为一个可在相机点云上实时运行的扩散模型(diffusion model),从而结合互联网规模模型的全局语义理解与仿真驱动的局部几何感知能力,实现无需人工收集训练数据的高性能语义抓取。

链接: https://arxiv.org/abs/2511.09558
作者: Ray Muxin Liu,Mingxuan Li,Kenneth Shaw,Deepak Pathak
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Website at this https URL

点击查看摘要

Abstract:Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at this https URL
zh

[CV-1] SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation AAAI2026

【速读】:该论文旨在解决机器人操作中因几何与语义信息耦合导致的鲁棒性不足问题,尤其是在真实场景下深度噪声干扰下语义理解易失真、以及对低层空间线索利用不充分的问题。其核心解决方案是提出SpatialActor框架,通过显式解耦语义与几何信息实现增强鲁棒性:一是设计语义引导的几何模块(Semantic-guided Geometric Module),自适应融合来自噪声深度数据和语义引导专家先验的互补几何信息;二是引入空间变换器(Spatial Transformer),利用低层空间线索实现精确的2D-3D映射,并促进空间特征间的交互,从而提升复杂任务下的精度与泛化能力。

链接: https://arxiv.org/abs/2511.09555
作者: Hao Shi,Bin Xie,Yingfei Liu,Yang Yue,Tiancai Wang,Haoqiang Fan,Xiangyu Zhang,Gao Huang
机构: Dexmal; Gao Huang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026 Oral | Project Page: this https URL

点击查看摘要

Abstract:Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations. Project Page: this https URL
zh

[CV-2] RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

【速读】:该论文旨在解决开放词汇检测器(open-vocabulary detectors)在面对分布外(out-of-distribution)类别时泛化能力不足的问题,尤其是在真实世界数据集上表现不佳的难题。传统方法通常通过微调重型视觉语言模型(Vision-Language Model, VLM)来适应新领域,但效率低下且难以平衡精度与延迟。其解决方案的关键在于提出RF-DETR——一种轻量级专用检测Transformer,结合权重共享神经架构搜索(weight-sharing Neural Architecture Search, NAS),能够在不重新训练的情况下快速评估数千种不同精度-延迟权衡的网络配置,从而为任意目标数据集发现最优的准确率-延迟帕累托前沿(accuracy-latency Pareto curves)。此外,作者重新审视了NAS中的“可调旋钮”(tunable knobs),提升了检测Transformer(DETR)在多样化目标域上的迁移能力,显著优于现有实时检测方法,在COCO和Roboflow100-VL等基准上实现更高性能与更低延迟的协同优化。

链接: https://arxiv.org/abs/2511.09554
作者: Isaac Robinson,Peter Robicheaux,Matvei Popov,Deva Ramanan,Neehar Peri
机构: Roboflow; Carnegie Mellon University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training. Rather than simply fine-tuning a heavy-weight vision-language model (VLM) for new domains, we introduce RF-DETR, a light-weight specialist detection transformer that discovers accuracy-latency Pareto curves for any target dataset with weight-sharing neural architecture search (NAS). Our approach fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations with different accuracy-latency tradeoffs without re-training. Further, we revisit the “tunable knobs” for NAS to improve the transferability of DETRs to diverse target domains. Notably, RF-DETR significantly improves on prior state-of-the-art real-time methods on COCO and Roboflow100-VL. RF-DETR (nano) achieves 48.0 AP on COCO, beating D-FINE (nano) by 5.3 AP at similar latency, and RF-DETR (2x-large) outperforms GroundingDINO (tiny) by 1.2 AP on Roboflow100-VL while running 20x as fast. To the best of our knowledge, RF-DETR (2x-large) is the first real-time detector to surpass 60 AP on COCO. Our code is at this https URL
zh

[CV-3] vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs AAAI2026

【速读】:该论文旨在解决基于医学视觉-语言模型(VLM)的提示学习(prompt learning)中因大语言模型(LLM)与CLIP变体之间语义偏差导致的对齐困难问题,以及传统欧氏空间优化方法在复杂生物医学图像场景下难以建模统一表示和施加局部几何约束、进而加剧模态差异并削弱少样本适应能力的问题。其解决方案的关键在于提出vMFCoOp框架,通过在共享超球面流形(Hyperspherical Manifold)上逆向估计von Mises-Fisher (vMF)分布,并利用统一语义锚点(Unified Semantic Anchors)对齐任意LLM与CLIP主干网络之间的语义偏置,从而实现鲁棒的生物医学提示生成与优越的少样本分类性能。

链接: https://arxiv.org/abs/2511.09540
作者: Minye Shao,Sihan Guo,Xinrun Li,Xingyu Miao,Haoran Duan,Yang Long
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)

点击查看摘要

Abstract:Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work will be continuously expanded to encompass more downstream applications, and the corresponding resources are intended to be shared through this https URL.
zh

[CV-4] MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation

【速读】:该论文旨在解决预训练视觉-语言-动作(Vision-Language-Action, VLA)模型在长程任务中因缺乏记忆机制且仅依赖即时感官输入而导致性能下降的问题。其解决方案的关键在于提出一种名为Memory-Augmented Prompting for VLA(MAP-VLA)的框架,通过从历史演示中构建记忆库,将每个记忆单元表示为可学习的软提示(soft prompts),并在实时执行时基于轨迹相似性匹配检索相关记忆,并动态融合至冻结的VLA模型中以增强动作生成能力。该方法作为插件模块无需重训练VLA模型,具有轻量化和灵活性优势,实验表明其在仿真和真实机器人场景下分别带来最高7.0%和25.0%的性能提升。

链接: https://arxiv.org/abs/2511.09516
作者: Runhao Li,Wenkai Guo,Zhenyu Wu,Changyuan Wang,Haoyuan Deng,Zhenyu Weng,Yap-Peng Tan,Ziwei Wang
机构: Nanyang Technological University (南洋理工大学); VinUniversity (Vin大学); Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); South China University of Technology (华南理工大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained Vision-Language-Action (VLA) models have achieved remarkable success in improving robustness and generalization for end-to-end robotic manipulation. However, these models struggle with long-horizon tasks due to their lack of memory and reliance solely on immediate sensory inputs. To address this limitation, we propose Memory-Augmented Prompting for Vision-Language-Action model (MAP-VLA), a novel framework that empowers pre-trained VLA models with demonstration-derived memory prompts to augment action generation for long-horizon robotic manipulation tasks. To achieve this, MAP-VLA first constructs a memory library from historical demonstrations, where each memory unit captures information about a specific stage of a task. These memory units are implemented as learnable soft prompts optimized through prompt tuning. Then, during real-time task execution, MAP-VLA retrieves relevant memory through trajectory similarity matching and dynamically integrates it into the VLA model for augmented action generation. Importantly, this prompt tuning and retrieval augmentation approach operates as a plug-and-play module for a frozen VLA model, offering a lightweight and flexible solution to improve task performance. Experimental results show that MAP-VLA delivers up to 7.0% absolute performance gains in the simulation benchmark and 25.0% on real robot evaluations for long-horizon tasks, surpassing the current state-of-the-art methods.
zh

[CV-5] DreamPose3D: Hallucinative Diffusion with Prompt Learning for 3D Human Pose Estimation

【速读】:该论文旨在解决3D人体姿态估计中因仅依赖几何线索而导致的运动歧义性难以解析、以及在真实场景下泛化能力不足的问题(即如何同时实现帧间时间一致性与关节间精细结构建模)。其解决方案的关键在于提出一种基于扩散模型的框架DreamPose3D,通过两个核心机制实现:一是引入动作感知推理(action-aware reasoning),利用从2D姿态序列中提取的任务相关动作提示动态调节去噪过程,从而捕捉高层意图;二是设计包含运动学关节亲和性的表示编码器,将关节间的结构关系融入注意力机制以增强对关节关系的建模能力;此外,还采用幻觉式姿态解码器(hallucinative pose decoder)在训练阶段生成具有时间一致性的3D姿态序列,模拟人类对运动轨迹的心理重构过程,有效缓解感知歧义。

链接: https://arxiv.org/abs/2511.09502
作者: Jerrin Bright,Yuhao Chen,John S. Zelek
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate 3D human pose estimation remains a critical yet unresolved challenge, requiring both temporal coherence across frames and fine-grained modeling of joint relationships. However, most existing methods rely solely on geometric cues and predict each 3D pose independently, which limits their ability to resolve ambiguous motions and generalize to real-world scenarios. Inspired by how humans understand and anticipate motion, we introduce DreamPose3D, a diffusion-based framework that combines action-aware reasoning with temporal imagination for 3D pose estimation. DreamPose3D dynamically conditions the denoising process using task-relevant action prompts extracted from 2D pose sequences, capturing high-level intent. To model the structural relationships between joints effectively, we introduce a representation encoder that incorporates kinematic joint affinity into the attention mechanism. Finally, a hallucinative pose decoder predicts temporally coherent 3D pose sequences during training, simulating how humans mentally reconstruct motion trajectories to resolve ambiguity in perception. Extensive experiments on benchmarked Human3.6M and MPI-3DHP datasets demonstrate state-of-the-art performance across all metrics. To further validate DreamPose3D’s robustness, we tested it on a broadcast baseball dataset, where it demonstrated strong performance despite ambiguous and noisy 2D inputs, effectively handling temporal consistency and intent-driven motion variations.
zh

[CV-6] SPIDER: Scalable Physics-Informed Dexterous Retargeting

【速读】:该论文旨在解决人形机器人和灵巧手控制中因缺乏大规模机器人特定示范数据而导致的策略学习难题,即如何利用易获取的人类运动数据(如动作捕捉、视频等)来生成动态可行的机器人轨迹。其解决方案的关键在于提出了一种基于物理的可扩展灵巧重定向框架SPIDER,该框架通过将人类演示提供的全局任务结构与大规模物理驱动采样结合,并引入课程式虚拟接触引导机制,实现从仅含运动学信息的人类示范到动态可行机器人轨迹的高效转换,从而显著提升轨迹可行性与任务成功率,同时大幅优于传统强化学习方法的效率。

链接: https://arxiv.org/abs/2511.09484
作者: Chaoyi Pan,Changhao Wang,Haozhi Qi,Zixi Liu,Homanga Bharadhwaj,Akash Sharma,Tingfan Wu,Guanya Shi,Jitendra Malik,Francois Hogan
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

Abstract:Learning dexterous and agile policy for humanoid and dexterous hand control requires large-scale demonstrations, but collecting robot-specific data is prohibitively expensive. In contrast, abundant human motion data is readily available from motion capture, videos, and virtual reality, which could help address the data scarcity problem. However, due to the embodiment gap and missing dynamic information like force and torque, these demonstrations cannot be directly executed on robots. To bridge this gap, we propose Scalable Physics-Informed DExterous Retargeting (SPIDER), a physics-based retargeting framework to transform and augment kinematic-only human demonstrations to dynamically feasible robot trajectories at scale. Our key insight is that human demonstrations should provide global task structure and objective, while large-scale physics-based sampling with curriculum-style virtual contact guidance should refine trajectories to ensure dynamical feasibility and correct contact sequences. SPIDER scales across diverse 9 humanoid/dexterous hand embodiments and 6 datasets, improving success rates by 18% compared to standard sampling, while being 10X faster than reinforcement learning (RL) baselines, and enabling the generation of a 2.4M frames dynamic-feasible robot dataset for policy learning. As a universal physics-based retargeting method, SPIDER can work with diverse quality data and generate diverse and high-quality data to enable efficient policy learning with methods like RL.
zh

[CV-7] Revisiting Cross-Architecture Distillation: Adaptive Dual-Teacher Transfer for Lightweight Video Models

【速读】:该论文旨在解决视频动作识别中轻量级卷积神经网络(CNN)因架构差异导致的知识蒸馏效果受限,以及现有跨架构知识蒸馏方法难以有效利用强教师模型(如ViT和CNN)的问题。其核心解决方案是提出一种双教师知识蒸馏框架(Dual-Teacher Knowledge Distillation),关键创新在于:(1) 不确定性感知的教师加权机制(Discrepancy-Aware Teacher Weighting),通过动态融合ViT与CNN教师的预测结果,基于教师置信度和与学生预测的差异自适应分配权重,提升监督信号的有效性;(2) 结构差异感知的蒸馏策略(Structure Discrepancy-Aware Distillation),引入轻量辅助分支学习ViT与CNN教师之间的残差特征,聚焦于可迁移的架构差异而非全维模仿,从而在保持效率的同时显著提升轻量CNN学生的性能。

链接: https://arxiv.org/abs/2511.09469
作者: Ying Peng,Hongsen Ye,Changxin Huang,Xiping Hu,Jian Chen,Runhao Zeng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 figures, 7 tables

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved strong performance in video action recognition, but their high computational cost limits their practicality. Lightweight CNNs are more efficient but suffer from accuracy gaps. Cross-Architecture Knowledge Distillation (CAKD) addresses this by transferring knowledge from ViTs to CNNs, yet existing methods often struggle with architectural mismatch and overlook the value of stronger homogeneous CNN teachers. To tackle these challenges, we propose a Dual-Teacher Knowledge Distillation framework that leverages both a heterogeneous ViT teacher and a homogeneous CNN teacher to collaboratively guide a lightweight CNN student. We introduce two key components: (1) Discrepancy-Aware Teacher Weighting, which dynamically fuses the predictions from ViT and CNN teachers by assigning adaptive weights based on teacher confidence and prediction discrepancy with the student, enabling more informative and effective supervision; and (2) a Structure Discrepancy-Aware Distillation strategy, where the student learns the residual features between ViT and CNN teachers via a lightweight auxiliary branch, focusing on transferable architectural differences without mimicking all of ViT’s high-dimensional patterns. Extensive experiments on benchmarks including HMDB51, EPIC-KITCHENS-100, and Kinetics-400 demonstrate that our method consistently outperforms state-of-the-art distillation approaches, achieving notable performance improvements with a maximum accuracy gain of 5.95% on HMDB51.
zh

[CV-8] Hand Held Multi-Object Tracking Dataset in American Football

【速读】:该论文旨在解决美国橄榄球(American football)比赛中运动员检测与跟踪(Multi-Object Tracking, MOT)缺乏标准化公开数据集的问题,从而阻碍了不同算法之间的公平比较。其关键解决方案是构建了首个专为美式橄榄球运动员设计的检测与跟踪数据集,并通过微调检测模型和重识别(re-identification)模型,集成到跟踪系统中,显著提升了在高密度、频繁遮挡和身体接触等复杂场景下的跟踪精度。

链接: https://arxiv.org/abs/2511.09455
作者: Rintaro Otsubo,Kanta Sawafuji,Hideo Saito
机构: Keio University(庆应义塾大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-Object Tracking (MOT) plays a critical role in analyzing player behavior from videos, enabling performance evaluation. Current MOT methods are often evaluated using publicly available datasets. However, most of these focus on everyday scenarios such as pedestrian tracking or are tailored to specific sports, including soccer and basketball. Despite the inherent challenges of tracking players in American football, such as frequent occlusion and physical contact, no standardized dataset has been publicly available, making fair comparisons between methods difficult. To address this gap, we constructed the first dedicated detection and tracking dataset for the American football players and conducted a comparative evaluation of various detection and tracking methods. Our results demonstrate that accurate detection and tracking can be achieved even in crowded scenarios. Fine-tuning detection models improved performance over pre-trained models. Furthermore, when these fine-tuned detectors and re-identification models were integrated into tracking systems, we observed notable improvements in tracking accuracy compared to existing approaches. This work thus enables robust detection and tracking of American football players in challenging, high-density scenarios previously underserved by conventional methods.
zh

[CV-9] BronchOpt : Vision-Based Pose Optimization with Fine-Tuned Foundation Models for Accurate Bronchoscopy Navigation

【速读】:该论文旨在解决支气管镜术中定位精度低的问题,核心挑战在于呼吸运动、解剖结构变异以及CT与术中图像之间的配准偏差导致的形变和错位。解决方案的关键在于提出一种基于视觉的鲁棒性姿态优化框架,通过一个微调后的模态和域不变编码器实现真实内镜RGB图像与CT渲染深度图之间的直接相似性计算,并结合可微分渲染模块通过深度一致性迭代优化相机位姿,从而实现帧级2D-3D配准。此外,研究还构建了首个公开的合成基准数据集,支持标准化和可复现的评估,使模型在仅使用合成数据训练的情况下即可实现平均平移误差2.65 mm、旋转误差0.19 rad的高精度定位,并展现出跨域泛化能力。

链接: https://arxiv.org/abs/2511.09443
作者: Hongchao Shu,Roger D. Soberanis-Mukul,Jiru Xu,Hao Ding,Morgan Ringel,Mali Shen,Saif Iftekar Sayed,Hedyeh Rafii-Tari,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate intra-operative localization of the bronchoscope tip relative to patient anatomy remains challenging due to respiratory motion, anatomical variability, and CT-to-body divergence that cause deformation and misalignment between intra-operative views and pre-operative CT. Existing vision-based methods often fail to generalize across domains and patients, leading to residual alignment errors. This work establishes a generalizable foundation for bronchoscopy navigation through a robust vision-based framework and a new synthetic benchmark dataset that enables standardized and reproducible evaluation. We propose a vision-based pose optimization framework for frame-wise 2D-3D registration between intra-operative endoscopic views and pre-operative CT anatomy. A fine-tuned modality- and domain-invariant encoder enables direct similarity computation between real endoscopic RGB frames and CT-rendered depth maps, while a differentiable rendering module iteratively refines camera poses through depth consistency. To enhance reproducibility, we introduce the first public synthetic benchmark dataset for bronchoscopy navigation, addressing the lack of paired CT-endoscopy data. Trained exclusively on synthetic data distinct from the benchmark, our model achieves an average translational error of 2.65 mm and a rotational error of 0.19 rad, demonstrating accurate and stable localization. Qualitative results on real patient data further confirm strong cross-domain generalization, achieving consistent frame-wise 2D-3D alignment without domain-specific adaptation. Overall, the proposed framework achieves robust, domain-invariant localization through iterative vision-based optimization, while the new benchmark provides a foundation for standardized progress in vision-based bronchoscopy navigation.
zh

[CV-10] OUGS: Active View Selection via Object-aware Uncertainty Estimation in 3DGS

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在复杂场景中高效重建特定目标时面临的挑战,即现有主动重建方法依赖于全局场景级不确定性度量,易受无关背景干扰,导致视图选择效率低下。其解决方案的关键在于提出一种基于物理参数的不确定性建模方法——OUGS(Object-aware Uncertainty for 3DGS),通过直接从3D高斯基元的显式物理参数(如位置、尺度、旋转)出发,利用渲染雅可比矩阵传播这些参数的协方差,构建出具有高度可解释性的不确定性模型;进而结合语义分割掩码,生成面向目标的不确定性分数,有效将目标与环境解耦,从而实现更精准的对象感知视图选择策略,显著提升目标重建质量和重建效率。

链接: https://arxiv.org/abs/2511.09397
作者: Haiyi Li,Qi Chen,Denis Kalkofen,Hsiang-Ting Chen
机构: University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: 11 pages (10 main + 1 appendix), 7 figures, 3 tables. Preprint, under review for Eurographics 2026

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have achieved state-of-the-art results for novel view synthesis. However, efficiently capturing high-fidelity reconstructions of specific objects within complex scenes remains a significant challenge. A key limitation of existing active reconstruction methods is their reliance on scene-level uncertainty metrics, which are often biased by irrelevant background clutter and lead to inefficient view selection for object-centric tasks. We present OUGS, a novel framework that addresses this challenge with a more principled, physically-grounded uncertainty formulation for 3DGS. Our core innovation is to derive uncertainty directly from the explicit physical parameters of the 3D Gaussian primitives (e.g., position, scale, rotation). By propagating the covariance of these parameters through the rendering Jacobian, we establish a highly interpretable uncertainty model. This foundation allows us to then seamlessly integrate semantic segmentation masks to produce a targeted, object-aware uncertainty score that effectively disentangles the object from its environment. This allows for a more effective active view selection strategy that prioritizes views critical to improving object fidelity. Experimental evaluations on public datasets demonstrate that our approach significantly improves the efficiency of the 3DGS reconstruction process and achieves higher quality for targeted objects compared to existing state-of-the-art methods, while also serving as a robust uncertainty estimator for the global scene.
zh

[CV-11] Learning by Neighbor-Aware Semantics Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition

【速读】:该论文旨在解决零样本骨骼动作识别(zero-shot skeleton action recognition)中因缺乏对应骨骼先验信息而导致的未见动作类别识别难题。现有方法多采用“对齐-分类”范式,面临两大瓶颈:一是由语义不完善导致的点对点对齐脆弱性,二是受限于静态决策边界和粗粒度锚点的刚性分类器。解决方案的关键在于提出名为Flora的新方法,其核心创新包括:(1) 通过引入邻近类间上下文线索构建方向感知的区域语义,并结合跨模态几何一致性目标实现稳定可靠的点到区域对齐;(2) 利用无噪流匹配弥合语义与骨骼潜在嵌入间的模态分布差异,同时采用条件无关对比正则化增强判别能力,从而实现基于token级速度预测的细粒度分布感知分类器。

链接: https://arxiv.org/abs/2511.09388
作者: Yang Chen,Miaoge Li,Zhijie Rao,Deze Zeng,Song Guo,Jingcai Guo
机构: The Hong Kong Polytechnic University (香港理工大学); China University of Geoscience (中国地质大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Recognizing unseen skeleton action categories remains highly challenging due to the absence of corresponding skeletal priors. Existing approaches generally follow an “align-then-classify” paradigm but face two fundamental issues, i.e., (i) fragile point-to-point alignment arising from imperfect semantics, and (ii) rigid classifiers restricted by static decision boundaries and coarse-grained anchors. To address these issues, we propose a novel method for zero-shot skeleton action recognition, termed \texttt \textbfFlora , which builds upon \textbfF lexib \textbfL e neighb \textbfO r-aware semantic attunement and open-form dist \textbfR ibution-aware flow cl \textbfA ssifier. Specifically, we flexibly attune textual semantics by incorporating neighboring inter-class contextual cues to form direction-aware regional semantics, coupled with a cross-modal geometric consistency objective that ensures stable and robust point-to-region alignment. Furthermore, we employ noise-free flow matching to bridge the modality distribution gap between semantic and skeleton latent embeddings, while a condition-free contrastive regularization enhances discriminability, leading to a distribution-aware classifier with fine-grained decision boundaries achieved through token-level velocity predictions. Extensive experiments on three benchmark datasets validate the effectiveness of our method, showing particularly impressive performance even when trained with only 10% of the seen data. Code is available at this https URL.
zh

[CV-12] Spatio-Temporal Context Learning with Temporal Difference Convolution for Moving Infrared Small Target Detection AAAI2026

【速读】:该论文旨在解决移动红外小目标检测(Moving Infrared Small Target Detection, IRSTD)中因目标特征微弱和背景干扰复杂而导致的检测精度低的问题。其核心挑战在于如何有效建模时空特征,以提升对运动目标的识别能力。解决方案的关键在于提出一种新颖的TDCNet网络架构,其中包含两个核心技术:一是设计了时差卷积(Temporal Difference Convolution, TDC)重参数化模块,通过三个并行的TDC块融合时差信息与3D卷积,实现多尺度运动上下文特征提取并抑制伪运动杂波;二是引入基于TDC的时空注意力机制,通过跨注意力机制在TDC主干与平行3D主干之间建模全局语义依赖关系,从而优化当前帧特征表示,显著提升检测性能。

链接: https://arxiv.org/abs/2511.09352
作者: Houzhang Fang,Shukai Guo,Qiuhuan Chen,Yi Chang,Luxin Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Moving infrared small target detection (IRSTD) plays a critical role in practical applications, such as surveillance of unmanned aerial vehicles (UAVs) and UAV-based search system. Moving IRSTD still remains highly challenging due to weak target features and complex background interference. Accurate spatio-temporal feature modeling is crucial for moving target detection, typically achieved through either temporal differences or spatio-temporal (3D) convolutions. Temporal difference can explicitly leverage motion cues but exhibits limited capability in extracting spatial features, whereas 3D convolution effectively represents spatio-temporal features yet lacks explicit awareness of motion dynamics along the temporal dimension. In this paper, we propose a novel moving IRSTD network (TDCNet), which effectively extracts and enhances spatio-temporal features for accurate target detection. Specifically, we introduce a novel temporal difference convolution (TDC) re-parameterization module that comprises three parallel TDC blocks designed to capture contextual dependencies across different temporal ranges. Each TDC block fuses temporal difference and 3D convolution into a unified spatio-temporal convolution representation. This re-parameterized module can effectively capture multi-scale motion contextual features while suppressing pseudo-motion clutter in complex backgrounds, significantly improving detection performance. Moreover, we propose a TDC-guided spatio-temporal attention mechanism that performs cross-attention between the spatio-temporal features from the TDC-based backbone and a parallel 3D backbone. This mechanism models their global semantic dependencies to refine the current frame’s features. Extensive experiments on IRSTD-UAV and public infrared datasets demonstrate that our TDCNet achieves state-of-the-art detection performance in moving target detection.
zh

[CV-13] FQ-PETR: Fully Quantized Position Embedding Transformation for Multi-View 3D Object Detection

【速读】:该论文旨在解决生成式视觉-激光雷达融合三维目标检测模型(如PETR及其变体)在实际部署中面临的高计算成本与内存占用问题,其核心挑战在于现有量化方法直接应用于PETR时会导致严重精度下降,主要原因包括多模态特征间量级差异显著(尤其是图像特征与相机射线位置嵌入(camera-ray positional embeddings, PE)之间),以及非线性运算(如反sigmoid函数)的量化效率低下和近似误差较大。解决方案的关键在于提出FQ-PETR框架,包含三项创新:(1) 量化友好型LiDAR射线位置嵌入(QFPE),通过引入基于LiDAR先验的单点采样和锚点嵌入替代多点采样,消除非线性操作并使PE尺度与图像特征对齐;(2) 双查表法(DULUT),采用两级级联线性查找表近似复杂非线性函数,在无需专用硬件的前提下实现高保真度;(3) 数值稳定后量化(QANS),在softmax数值稳定后再进行量化,缓解因大输入导致的注意力机制失真。实验表明,FQ-PETR在W8A8量化下仅损失约1%精度,同时将延迟降低高达75%,显著优于现有的PTQ和QAT基线方法。

链接: https://arxiv.org/abs/2511.09347
作者: Jiangyong Yu,Changyong Shu,Sifan Zhou,Zichen Yu,Xing Hu,Yan Chen,Dawei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Camera-based multi-view 3D detection is crucial for autonomous driving. PETR and its variants (PETRs) excel in benchmarks but face deployment challenges due to high computational cost and memory footprint. Quantization is an effective technique for compressing deep neural networks by reducing the bit width of weights and activations. However, directly applying existing quantization methods to PETRs leads to severe accuracy degradation. This issue primarily arises from two key challenges: (1) significant magnitude disparity between multi-modal features-specifically, image features and camera-ray positional embeddings (PE), and (2) the inefficiency and approximation error of quantizing non-linear operators, which commonly rely on hardware-unfriendly computations. In this paper, we propose FQ-PETR, a fully quantized framework for PETRs, featuring three key innovations: (1) Quantization-Friendly LiDAR-ray Position Embedding (QFPE): Replacing multi-point sampling with LiDAR-prior-guided single-point sampling and anchor-based embedding eliminates problematic non-linearities (e.g., inverse-sigmoid) and aligns PE scale with image features, preserving accuracy. (2) Dual-Lookup Table (DULUT): This algorithm approximates complex non-linear functions using two cascaded linear LUTs, achieving high fidelity with minimal entries and no specialized hardware. (3) Quantization After Numerical Stabilization (QANS): Performing quantization after softmax numerical stabilization mitigates attention distortion from large inputs. On PETRs (e.g. PETR, StreamPETR, PETRv2, MV2d), FQ-PETR under W8A8 achieves near-floating-point accuracy (1% degradation) while reducing latency by up to 75%, significantly outperforming existing PTQ and QAT baselines.
zh

[CV-14] DualFete: Revisiting Teacher-Student Interactions from a Feedback Perspective for Semi-supervised Medical Image Segmentation AAAI-26 AAAI

【速读】:该论文旨在解决半监督医学图像分割中教师-学生(teacher-student)框架因图像固有模糊性导致的错误标注传播问题,尤其是学生对错误伪标签的迭代再确认会引发自增强偏差(self-reinforcing bias)。解决方案的关键在于引入一个反馈机制,使学生能够对教师伪标签引起的更新进行反馈,从而指导教师优化伪标签。该机制依赖两个核心组件:反馈 attributor(用于识别触发学生更新的伪标签)和反馈 receiver(决定反馈作用的位置)。进一步地,论文提出双教师反馈模型,通过跨教师监督缓解分歧并避免一致性错误,显著提升了误差纠正能力与分割性能。

链接: https://arxiv.org/abs/2511.09319
作者: Le Yi,Wei Huang,Lei Zhang,Kefu Zhao,Yan Wang,Zizhou Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Proceedings of the AAAI Conference on Artificial Intelligence 40 (AAAI-26)

点击查看摘要

Abstract:The teacher-student paradigm has emerged as a canonical framework in semi-supervised learning. When applied to medical image segmentation, the paradigm faces challenges due to inherent image ambiguities, making it particularly vulnerable to erroneous supervision. Crucially, the student’s iterative reconfirmation of these errors leads to self-reinforcing bias. While some studies attempt to mitigate this bias, they often rely on external modifications to the conventional teacher-student framework, overlooking its intrinsic potential for error correction. In response, this work introduces a feedback mechanism into the teacher-student framework to counteract error reconfirmations. Here, the student provides feedback on the changes induced by the teacher’s pseudo-labels, enabling the teacher to refine these labels accordingly. We specify that this interaction hinges on two key components: the feedback attributor, which designates pseudo-labels triggering the student’s update, and the feedback receiver, which determines where to apply this feedback. Building on this, a dual-teacher feedback model is further proposed, which allows more dynamics in the feedback loop and fosters more gains by resolving disagreements through cross-teacher supervision while avoiding consistent errors. Comprehensive evaluations on three medical image benchmarks demonstrate the method’s effectiveness in addressing error propagation in semi-supervised medical image segmentation.
zh

[CV-15] DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures

【速读】:该论文旨在解决当前3D生成模型在设计轻量化且自支撑结构时忽视物理约束与可制造性的问题。其核心挑战在于如何在保持几何保真度的同时,生成既节省材料又具备结构稳定性的中空3D模型。解决方案的关键在于提出DensiCrafter框架,通过优化密度场(density field)实现这一目标:从Trellis生成的粗粒度体素网格出发,将其视为连续密度场进行可微分优化,并引入三种无需仿真、基于物理约束的损失项以确保结构稳定性;同时,质量正则化项抑制冗余材料使用,受限优化域保留外表面形态。该方法可无缝集成至基于Trellis的预训练模型(如Trellis、DSO),无需修改网络架构,在文本到3D任务中实现最高达43%的材料质量减少,且实验证明所生成模型具备良好的打印可靠性和自支撑能力。

链接: https://arxiv.org/abs/2511.09298
作者: Shengqi Dang,Fu Chai,Jiaxin Li,Chao Yuan,Wei Ye,Nan Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.
zh

[CV-16] Enriching Knowledge Distillation with Cross-Modal Teacher Fusion

【速读】:该论文旨在解决多教师知识蒸馏(Multi-teacher Knowledge Distillation, KD)中知识多样性不足的问题,现有方法通常仅依赖单一模态的视觉信息,忽略了跨模态表示(如文本-图像对齐)所蕴含的语义丰富性。其解决方案的关键在于引入CLIP(Contrastive Language–Image Pretraining)模型的视觉-语言知识作为补充监督信号,提出一种简单而有效的融合框架——RichKD,通过联合优化传统教师模型与CLIP的logits和特征,并利用CLIP的多提示文本引导机制,实现数据集特定与语义增强的视觉线索融合,从而提升学生模型的准确性、置信度可靠性及类别间一致性,显著改善蒸馏质量并增强鲁棒性。

链接: https://arxiv.org/abs/2511.09286
作者: Amir M. Mansourian,Amir Mohammad Babaei,Shohreh Kasaei
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP’s vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP’s multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.
zh

[CV-17] Deep Learning for Metabolic Rate Estimation from Biosignals: A Comparative Study of Architectures and Signal Selection BMVC2025

【速读】:该论文旨在解决能量消耗(Energy Expenditure, EE)估计问题,即通过生理信号(如心率、呼吸或加速度计数据)准确推断人体代谢率。传统方法多依赖经典回归模型,而现有深度学习方法往往未区分神经网络架构与信号选择对性能的独立贡献。解决方案的关键在于系统性地评估两个维度:一是不同神经架构(包括Transformer、CNN、ResNet等)在单一信号、信号组合及多传感器输入下的表现;二是分析信号类型(如分钟通气量)对预测精度的影响。研究发现,分钟通气量(Minute Ventilation)是单信号中最具预测性的特征,结合Transformer模型可实现最低RMSE(0.87 W/kg),同时多信号融合(如Hexoskin智能衬衫提供的五维信号)能支持更高效模型(如CNN+注意力机制),并揭示个体间显著差异,强调了自适应建模策略的必要性。

链接: https://arxiv.org/abs/2511.09276
作者: Sarvenaz Babakhani,David Remy,Alina Roitberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the MPI Workshop, BMVC 2025. 17 pages, 6 figures. Code available at this https URL

点击查看摘要

Abstract:Energy expenditure estimation aims to infer human metabolic rate from physiological signals such as heart rate, respiration, or accelerometer data, and has been studied primarily with classical regression methods. The few existing deep learning approaches rarely disentangle the role of neural architecture from that of signal choice. In this work, we systematically evaluate both aspects. We compare classical baselines with newer neural architectures across single signals, signal pairs, and grouped sensor inputs for diverse physical activities. Our results show that minute ventilation is the most predictive individual signal, with a transformer model achieving the lowest root mean square error (RMSE) of 0.87 W/kg across all activities. Paired and grouped signals, such as those from the Hexoskin smart shirt (five signals), offer good alternatives for faster models like CNN and ResNet with attention. Per-activity evaluation revealed mixed outcomes: notably better results in low-intensity activities (RMSE down to 0.29 W/kg; NRMSE = 0.04), while higher-intensity tasks showed larger RMSE but more comparable normalized errors. Finally, subject-level analysis highlights strong inter-individual variability, motivating the need for adaptive modeling strategies. Our code and models will be publicly available at this https URL .
zh

[CV-18] GRACE: Designing Generative Face Video Codec via Agile Hardware-Centric Workflow

【速读】:该论文旨在解决生成式 AI (Generative AI) 领域中动画基生成编解码器(Animation-based Generative Codec, AGC)在资源与功耗受限的边缘设备上部署困难的问题,主要挑战包括模型参数量大、算法难以动态适配以及计算和传输带来的高能耗。解决方案的关键在于提出一种面向现场可编程门阵列(Field Programmable Gate Arrays, FPGAs)的新型部署方案:首先通过后训练静态量化和层融合等网络压缩技术降低模型复杂度;进而设计基于协处理器架构的重叠加速器,结合软硬件协同设计实现高效计算,并利用卷积、网格采样、上采样等硬件引擎及双缓冲流水线和循环展开等并行优化策略充分挖掘FPGA资源潜力。最终在PYNQ-Z1平台上构建AGC原型系统,相比商用CPU和GPU分别实现24.9倍和4.1倍的能量效率提升,单像素重建仅需11.7微焦耳(μJ)。

链接: https://arxiv.org/abs/2511.09272
作者: Rui Wan,Qi Zheng,Ruoyu Zhang,Bu Chen,Jiaming Liu,Min Li,Minge Jing,Jinjia Zhou,Yibo Fan
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Hosei University (明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Animation-based Generative Codec (AGC) is an emerging paradigm for talking-face video compression. However, deploying its intricate decoder on resource and power-constrained edge devices presents challenges due to numerous parameters, the inflexibility to adapt to dynamically evolving algorithms, and the high power consumption induced by extensive computations and data transmission. This paper for the first time proposes a novel field programmable gate arrays (FPGAs)-oriented AGC deployment scheme for edge-computing video services. Initially, we analyze the AGC algorithm and employ network compression methods including post-training static quantization and layer fusion techniques. Subsequently, we design an overlapped accelerator utilizing the co-processor paradigm to perform computations through software-hardware co-design. The hardware processing unit comprises engines such as convolution, grid sampling, upsample, etc. Parallelization optimization strategies like double-buffered pipelines and loop unrolling are employed to fully exploit the resources of FPGA. Ultimately, we establish an AGC FPGA prototype on the PYNQ-Z1 platform using the proposed scheme, achieving \textbf24.9 \times and \textbf4.1 \times higher energy efficiency against commercial Central Processing Unit (CPU) and Graphic Processing Unit (GPU), respectively. Specifically, only \textbf11.7 microjoules ( \upmu J) are required for one pixel reconstructed by this FPGA system.
zh

[CV-19] Spatial Information Bottleneck for Interpretable Visual Recognition

【速读】:该论文旨在解决深度神经网络中表征空间纠缠(spatially entangled representations)的问题,即模型在学习过程中将判别性前景特征与虚假的背景相关性混杂在一起,从而损害模型的可解释性和鲁棒性。其解决方案的关键在于从信息论视角重新理解基于梯度的归因方法:作者证明了反向传播中的向量-雅可比乘积(Vector-Jacobian Products, VJP)在弱条件下构成输入特征关于类别标签的最小充分统计量,并据此提出编码-解码视角——前向传播将输入编码至类别空间,而VJP则实现从类别空间到特征空间的解码。基于此,提出了空间信息瓶颈(Spatial Information Bottleneck, S-IB),通过最大化前景区域VJP与输入之间的互信息、最小化背景区域的互信息,强制网络仅在与类别相关的空间区域内编码信息,从而实现空间解耦的信息流。该方法直接优化训练过程中VJP的空间结构,无需针对特定解释方法进行调参,即可在多个基准数据集和六种解释方法上统一提升可视化质量与分类性能。

链接: https://arxiv.org/abs/2511.09239
作者: Kaixiang Shu,Kai Meng,Junqin Luo
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks typically learn spatially entangled representations that conflate discriminative foreground features with spurious background correlations, thereby undermining model interpretability and robustness. We propose a novel understanding framework for gradient-based attribution from an information-theoretic perspective. We prove that, under mild conditions, the Vector-Jacobian Products (VJP) computed during backpropagation form minimal sufficient statistics of input features with respect to class labels. Motivated by this finding, we propose an encoding-decoding perspective : forward propagation encodes inputs into class space, while VJP in backpropagation decodes this encoding back to feature space. Therefore, we propose Spatial Information Bottleneck (S-IB) to spatially disentangle information flow. By maximizing mutual information between foreground VJP and inputs while minimizing mutual information in background regions, S-IB encourages networks to encode information only in class-relevant spatial regions. Since post-hoc explanation methods fundamentally derive from VJP computations, directly optimizing VJP’s spatial structure during training improves visualization quality across diverse explanation paradigms. Experiments on five benchmarks demonstrate universal improvements across six explanation methods, achieving better foreground concentration and background suppression without method-specific tuning, alongside consistent classification accuracy gains.
zh

[CV-20] owards Trustworthy Dermatology MLLM s: A Benchmark and Multimodal Evaluator for Diagnostic Narratives

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在皮肤科诊断叙述生成任务中缺乏可靠评估手段的问题,这是其负责任临床部署的主要瓶颈。解决方案的关键在于提出了一套结合DermBench与DermEval的新型评估框架:DermBench是一个精心构建的基准数据集,包含4,000张真实世界皮肤科图像及其专家认证的诊断叙述,并利用基于大语言模型的判别器从临床相关维度对候选叙述进行评分,从而实现一致且全面的模型评估;DermEval则是一个无需参考文本的多模态自动评估器,能够针对单个病例生成结构化批评及整体分数和分维度评分,支持细粒度的个体案例分析,有助于识别模型局限性和偏倚。实验表明,该框架在4,500例多样化的皮肤科病例上与专家评分高度一致(平均偏差分别为0.251和0.117,满分5分),为不同多模态大语言模型的诊断能力与可信度提供了可重复、可扩展且具有临床意义的量化指标。

链接: https://arxiv.org/abs/2511.09195
作者: Yuhao Shen,Jiahe Qian,Shuping Zhang,Zhangtianyi Chen,Tao Lu,Juexiao Zhou
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Chinese Academy of Sciences (中国科学院); Shantou University Medical College (汕头大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (LLMs) are increasingly used to generate dermatology diagnostic narratives directly from images. However, reliable evaluation remains the primary bottleneck for responsible clinical deployment. We introduce a novel evaluation framework that combines DermBench, a meticulously curated benchmark, with DermEval, a robust automatic evaluator, to enable clinically meaningful, reproducible, and scalable assessment. We build DermBench, which pairs 4,000 real-world dermatology images with expert-certified diagnostic narratives and uses an LLM-based judge to score candidate narratives across clinically grounded dimensions, enabling consistent and comprehensive evaluation of multimodal models. For individual case assessment, we train DermEval, a reference-free multimodal evaluator. Given an image and a generated narrative, DermEval produces a structured critique along with an overall score and per-dimension ratings. This capability enables fine-grained, per-case analysis, which is critical for identifying model limitations and biases. Experiments on a diverse dataset of 4,500 cases demonstrate that DermBench and DermEval achieve close alignment with expert ratings, with mean deviations of 0.251 and 0.117 (out of 5), respectively, providing reliable measurement of diagnostic ability and trustworthiness across different multimodal LLMs.
zh

[CV-21] DBINDS - Can Initial Noise from Diffusion Model Inversion Help Reveal AI-Generated Videos?

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成视频内容对数字内容安全和取证分析带来的挑战,特别是现有检测方法主要依赖像素级视觉线索、在面对未见过的生成器时泛化能力差的问题。其解决方案的关键在于提出一种基于扩散模型反演(diffusion-model-inversion)的检测框架 DBINDS,该框架不直接分析像素,而是通过分析潜在空间(latent space)的动力学特性来识别视频真伪;具体而言,DBINDS 发现真实与生成视频在扩散模型反演恢复的初始噪声序列上存在系统性差异,并据此构建初始噪声差异序列(Initial Noise Difference Sequence, INDS),进而提取多域、多尺度特征,结合 LightGBM 分类器与贝叶斯优化策略,实现了仅用单一生成器训练数据即可在 GenVidBench 上展现出优异的跨生成器泛化性能和鲁棒性。

链接: https://arxiv.org/abs/2511.09184
作者: Yanlin Wu,Xiaogang Yuan,Dezhi An
机构: Gansu University of Political Science and Law (甘肃政法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Submitted to IEEE Transactions on Dependable and Secure Computing (TDSC) on 16 September 2025

点击查看摘要

Abstract:AI-generated video has advanced rapidly and poses serious challenges to content security and forensic analysis. Existing detectors rely mainly on pixel-level visual cues and generalize poorly to unseen generators. We propose DBINDS, a diffusion-model-inversion based detector that analyzes latent-space dynamics rather than pixels. We find that initial noise sequences recovered by diffusion inversion differ systematically between real and generated videos. Building on this, DBINDS forms an Initial Noise Difference Sequence (INDS) and extracts multi-domain, multi-scale features. With feature optimization and a LightGBM classifier tuned by Bayesian search, DBINDS (trained on a single generator) achieves strong cross-generator performance on GenVidBench, demonstrating good generalization and robustness in limited-data settings.
zh

[CV-22] FSampler: Training Free Acceleration of Diffusion Sampling via Epsilon Extrapolation

【速读】:该论文旨在解决扩散模型(Diffusion Model)在采样过程中计算效率低下的问题,即通过减少函数评估次数(Number of Function Evaluations, NFE)来加速生成过程,同时保持高保真度的图像质量。其核心解决方案是提出一种无需训练、与采样器无关的执行层FSampler,该方法利用近期模型调用中保留的去噪信号(epsilon)历史,采用二阶至四阶有限差分预测器对外推下一个epsilon值,并在特定步骤中用预测值替代实际模型调用,从而减少模型调用次数;为保证稳定性,FSampler引入学习稳定器(rescales predictions on skipped steps)、梯度估计稳定器(compensates local curvature)、保护窗口(protected windows)、周期性锚点(periodic anchors)以及连续跳过步数上限等机制,有效控制轨迹偏差。实验表明,FSampler可在不修改原有采样器更新规则的前提下,在SSIM 0.95–0.99范围内实现8–22%的加速和15–25%的模型调用减少,甚至在较低保真度下可达45–50%的调用削减。

链接: https://arxiv.org/abs/2511.09180
作者: Michael A. Vladimir
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages; diffusion models; accelerated sampling; ODE solvers; epsilon extrapolation; training free inference

点击查看摘要

Abstract:FSampler is a training free, sampler agnostic execution layer that accelerates diffusion sampling by reducing the number of function evaluations (NFE). FSampler maintains a short history of denoising signals (epsilon) from recent real model calls and extrapolates the next epsilon using finite difference predictors at second order, third order, or fourth order, falling back to lower order when history is insufficient. On selected steps the predicted epsilon substitutes the model call while keeping each sampler’s update rule unchanged. Predicted epsilons are validated for finiteness and magnitude; a learning stabilizer rescales predictions on skipped steps to correct drift, and an optional gradient estimation stabilizer compensates local curvature. Protected windows, periodic anchors, and a cap on consecutive skips bound deviation over the trajectory. Operating at the sampler level, FSampler integrates with Euler/DDIM, DPM++ 2M/2S, LMS/AB2, and RES family exponential multistep methods and drops into standard workflows. FLUX.1 dev, Qwen Image, and Wan 2.2, FSampler reduces time by 8 to 22% and model calls by 15 to 25% at high fidelity (Structural Similarity Index (SSIM) 0.95 to 0.99), without altering sampler formulas. With an aggressive adaptive gate, reductions can reach 45 to 50% fewer model calls at lower fidelity (SSIM 0.73 to 0.74).
zh

[CV-23] HOTFLoc: End-to-End Hierarchical LiDAR Place Recognition Re-Ranking and 6-DoF Metric Localisation in Forests

【速读】:该论文旨在解决森林等复杂场景下LiDAR位姿估计中因杂波、自相似性和视角变化导致的定位精度下降与重排序失败问题。其核心解决方案是提出HOTFLoc++框架,通过基于八叉树(octree)的Transformer结构提取多粒度层次化局部描述子以增强鲁棒性,并引入可学习的多尺度几何验证模块来降低单尺度对应关系退化引发的重排序错误;同时采用粗到精的配准策略,在保证高精度的同时实现比RANSAC快两个数量级的运行效率。

链接: https://arxiv.org/abs/2511.09170
作者: Ethan Griffiths,Maryam Haghighat,Simon Denman,Clinton Fookes,Milad Ramezani
机构: CSIRO Robotics, Data61, CSIRO (澳大利亚联邦科学与工业研究组织); Queensland University of Technology (昆士兰科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 9 pages, 2 figures. Submitted to RA-L

点击查看摘要

Abstract:This article presents HOTFLoc++, an end-to-end framework for LiDAR place recognition, re-ranking, and 6-DoF metric localisation in forests. Leveraging an octree-based transformer, our approach extracts hierarchical local descriptors at multiple granularities to increase robustness to clutter, self-similarity, and viewpoint changes in challenging scenarios, including ground-to-ground and ground-to-aerial in forest and urban environments. We propose a learnable multi-scale geometric verification module to reduce re-ranking failures in the presence of degraded single-scale correspondences. Our coarse-to-fine registration approach achieves comparable or lower localisation errors to baselines, with runtime improvements of two orders of magnitude over RANSAC for dense point clouds. Experimental results on public datasets show the superiority of our approach compared to state-of-the-art methods, achieving an average Recall@1 of 90.7% on CS-Wild-Places: an improvement of 29.6 percentage points over baselines, while maintaining high performance on single-source benchmarks with an average Recall@1 of 91.7% and 96.0% on Wild-Places and MulRan, respectively. Our method achieves under 2 m and 5 degrees error for 97.2% of 6-DoF registration attempts, with our multi-scale re-ranking module reducing localisation errors by ~2 \times on average. The code will be available upon acceptance.
zh

[CV-24] PressTrack-HMR: Pressure-Based Top-Down Multi-Person Global Human Mesh Recovery AAAI26

【速读】:该论文旨在解决多人群体场景下基于压力信号的全身网格重建(Multi-person Global Human Mesh Recovery, HMR)难题,特别是如何从混叠的压力信号中区分不同个体并提取其时序压力数据的问题。解决方案的关键在于提出了一种自顶向下的PressTrack-HMR框架,该框架采用“检测-跟踪”策略:首先从原始压力数据中识别并分割出每个个体的压力信号,随后对每个分离出的个体压力信号独立执行HMR,从而实现无遮挡、隐私友好的多人运动捕捉。

链接: https://arxiv.org/abs/2511.09147
作者: Jiayue Yuan,Fangting Xie,Guangwen Ouyang,Changhai Ma,Ziyu Wu,Heyu Ding,Quan Wan,Yi Ke,Yuchen Wu,Xiaohui Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepeted by AAAI26

点击查看摘要

Abstract:Multi-person global human mesh recovery (HMR) is crucial for understanding crowd dynamics and interactions. Traditional vision-based HMR methods sometimes face limitations in real-world scenarios due to mutual occlusions, insufficient lighting, and privacy concerns. Human-floor tactile interactions offer an occlusion-free and privacy-friendly alternative for capturing human motion. Existing research indicates that pressure signals acquired from tactile mats can effectively estimate human pose in single-person scenarios. However, when multiple individuals walk randomly on the mat simultaneously, how to distinguish intermingled pressure signals generated by different persons and subsequently acquire individual temporal pressure data remains a pending challenge for extending pressure-based HMR to the multi-person situation. In this paper, we present \textbfPressTrack-HMR, a top-down pipeline that recovers multi-person global human meshes solely from pressure signals. This pipeline leverages a tracking-by-detection strategy to first identify and segment each individual’s pressure signal from the raw pressure data, and subsequently performs HMR for each extracted individual signal. Furthermore, we build a multi-person interaction pressure dataset \textbfMIP, which facilitates further research into pressure-based human motion analysis in multi-person scenarios. Experimental results demonstrate that our method excels in multi-person HMR using pressure data, with 89.2~ mm MPJPE and 112.6~ mm WA-MPJPE _100 , and these showcase the potential of tactile mats for ubiquitous, privacy-preserving multi-person action recognition. Our dataset \ code are available at this https URL.
zh

[CV-25] MACEval: A Multi-Agent Continual Evaluation Network for Large Models

【速读】:该论文旨在解决当前大型模型评估基准存在的三大核心问题:一是多数基准为封闭式设计,易因训练语料库不断增长导致数据污染,从而引发过拟合,削弱评估可信度;二是现有基准规模与范围持续扩大,且指标动态变化,加之高度依赖人工维护,难以及时适应大模型能力的快速演进;三是缺乏自动化、可持续的评估机制。其解决方案的关键在于提出MACEval(Multi-Agent Continual Evaluation network),通过多智能体协同机制实现动态评估:采用角色分配、过程内数据生成及级联代理网络中的评估路由策略,构建交互式、自主化的评估流程,从而实现无需人工干预、高效经济且灵活可扩展的长期性能量化评估。

链接: https://arxiv.org/abs/2511.09139
作者: Zijian Chen,Yuze Sun,Yuan Tian,Wenjun Zhang,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 38 pages, 12 figures

点击查看摘要

Abstract:Hundreds of benchmarks dedicated to evaluating large models from multiple perspectives have been presented over the past few years. Albeit substantial efforts, most of them remain closed-ended and are prone to overfitting due to the potential data contamination in the ever-growing training corpus of large models, thereby undermining the credibility of the evaluation. Moreover, the increasing scale and scope of current benchmarks with transient metrics, as well as the heavily human-dependent curation procedure, pose significant challenges for timely maintenance and adaptation to gauge the advancing capabilities of large models. In this paper, we introduce MACEval, a \Multi-Agent Continual Evaluation network for dynamic evaluation of large models, and define a new set of metrics to quantify performance longitudinally and sustainably. MACEval adopts an interactive and autonomous evaluation mode that employs role assignment, in-process data generation, and evaluation routing through a cascaded agent network. Extensive experiments on 9 open-ended tasks with 23 participating large models demonstrate that MACEval is (1) human-free and automatic, mitigating laborious result processing with inter-agent judgment guided; (2) efficient and economical, reducing a considerable amount of data and overhead to obtain similar results compared to related benchmarks; and (3) flexible and scalable, migrating or integrating existing benchmarks via customized evaluation topologies. We hope that MACEval can broaden future directions of large model evaluation.
zh

[CV-26] PIFF: A Physics-Informed Generative Flow Model for Real-Time Flood Depth Mapping

【速读】:该论文旨在解决传统洪水淹没制图方法(如数值模拟和航空摄影)在效率和可靠性方面的局限性,尤其是在需要快速响应的实时洪水预测场景中。解决方案的关键在于提出一种物理信息引导的、基于流形的生成神经网络(PIFF),其核心创新是将水动力学先验知识嵌入到训练过程中,通过简化淹没模型(SPM)作为条件约束,并结合基于Transformer的降雨编码器捕捉降水的时间依赖性,从而在图像到图像的生成框架下实现从数字高程模型(DEM)到洪水深度的高效映射。该方法有效融合了物理规律与数据驱动学习,替代了昂贵的数值模拟,实现了高精度、近实时的洪水地图生成。

链接: https://arxiv.org/abs/2511.09130
作者: ChunLiang Wu,Tsunhua Yang,Hungying Chen
机构: Brightest Technology Inc.( brightest科技公司); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flood mapping is crucial for assessing and mitigating flood impacts, yet traditional methods like numerical modeling and aerial photography face limitations in efficiency and reliability. To address these challenges, we propose PIFF, a physics-informed, flow-based generative neural network for near real-time flood depth estimation. Built on an image-to-image generative framework, it efficiently maps Digital Elevation Models (DEM) to flood depth predictions. The model is conditioned on a simplified inundation model (SPM) that embeds hydrodynamic priors into the training process. Additionally, a transformer-based rainfall encoder captures temporal dependencies in precipitation. Integrating physics-informed constraints with data-driven learning, PIFF captures the causal relationships between rainfall, topography, SPM, and flooding, replacing costly simulations with accurate, real-time flood maps. Using a 26 km study area in Tainan, Taiwan, with 182 rainfall scenarios ranging from 24 mm to 720 mm over 24 hours, our results demonstrate that PIFF offers an effective, data-driven alternative for flood prediction and response.
zh

[CV-27] DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

【速读】:该论文旨在解决当前光学字符识别(OCR)技术在处理带有噪声的前现代日文手写体(Kuzushiji)文档时准确率下降的问题,尤其是针对文档退化和印章干扰等现实场景中的挑战。现有方法虽能在干净文本上表现良好,但缺乏对复杂噪声的鲁棒性。解决方案的关键在于构建首个专门面向此类问题的基准数据集——Degraded Kuzushiji Documents with Seals (DKDS),并在此基础上定义两个任务:文本与印章检测及文档二值化。通过引入专家辅助标注的数据构建流程,并提供基于YOLO系列模型和GAN等先进算法的基线结果,为后续研究提供了标准化评估平台和可复现的技术路径。

链接: https://arxiv.org/abs/2511.09117
作者: Rui-Yang Ju,Kohei Yamashita,Hirotaka Kameko,Shinsuke Mori
机构: Kyoto University (京都大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using multiple versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, and Generative Adversarial Network (GAN)-based methods. The DKDS dataset and the implementation code for baseline methods are available at this https URL.
zh

[CV-28] Ultra-Light Test-Time Adaptation for Vision–Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在域偏移(domain shift)场景下出现的特征漂移(feature drift)、类别先验不匹配(class-prior mismatch)以及严重校准不足(miscalibration)的问题。现有测试时自适应(Test-Time Adaptation, TTA)方法通常依赖于对大型骨干网络的反向传播、协方差估计或高内存开销,难以适用于流式处理和边缘计算场景。其解决方案的关键在于提出一种完全免训练且无需反向传播的超轻量级测试时自适应框架(Ultra-Light Test-Time Adaptation, UL-TTA),该框架仅冻结模型骨干网络,仅优化 logits 层参数——包括类别原型(class prototypes)、类别先验(class priors)和温度(temperature)。UL-TTA 采用在线 EM 风格更新机制,结合置信度筛选、基于文本和 Dirichlet 先验的闭式贝叶斯更新、预测与校准解耦温度,并引入轻量级保护机制(如范数裁剪、先验 KL 约束和平滑温度)以防止长序列中的性能退化,从而在保持极低延迟(<8% 开销)的同时显著提升准确率与校准性能(ECE 降低 20–30%)。

链接: https://arxiv.org/abs/2511.09101
作者: Byunghyun Kim
机构: Kyungpook National University (庆北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Vision-Language Models (VLMs) such as CLIP achieve strong zero-shot recognition by comparing image embeddings to text-derived class prototypes. However, under domain shift, they suffer from feature drift, class-prior mismatch, and severe miscalibration. Existing test-time adaptation (TTA) methods often require backpropagation through large backbones, covariance estimation, or heavy memory/state, which is problematic for streaming and edge scenarios. We propose Ultra-Light Test-Time Adaptation (UL-TTA), a fully training-free and backprop-free framework that freezes the backbone and adapts only logit-level parameters: class prototypes, class priors, and temperature. UL-TTA performs an online EM-style procedure with (i) selective sample filtering to use only confident predictions, (ii) closed-form Bayesian updates for prototypes and priors anchored by text and Dirichlet priors, (iii) decoupled temperatures for prediction vs. calibration, and (iv) lightweight guards (norm clipping, prior KL constraints, smoothed temperature) to prevent drift in long streams. Across large-scale cross-domain and OOD benchmarks (PACS, Office-Home, DomainNet, Terra Incognita, ImageNet-R/A/V2/Sketch; ~726K test samples) and strong TTA baselines including Tent, T3A, CoTTA, SAR, Tip-Adapter, and FreeTTA, UL-TTA consistently improves top-1 accuracy (e.g., +4.7 points over zero-shot CLIP on average) while reducing ECE by 20-30%, with less than 8% latency overhead. Long-stream experiments up to 200K samples show no collapse. Our results demonstrate that logit-level Bayesian adaptation is sufficient to obtain state-of-the-art accuracy-calibration trade-offs for VLMs under domain shift, without updating any backbone parameters.
zh

[CV-29] Composition-Incremental Learning for Compositional Generalization

【速读】:该论文旨在解决真实世界中持续涌现的、组合无限且长尾分布的数据场景下,模型如何逐步提升组合泛化能力(compositional generalization)的问题。现有方法多基于预收集的训练数据,难以适应动态变化的组合模式。为实现组合增量学习(Composition-Incremental Learning, CompIL),论文提出一种伪回放框架:其关键在于利用视觉合成器(visual synthesizer)生成已学组合的视觉表征,并通过语言基元蒸馏机制(linguistic primitive distillation mechanism)保持语义基元表示的一致性,从而在不遗忘旧知识的前提下持续学习新组合,有效提升模型的组合泛化性能。

链接: https://arxiv.org/abs/2511.09082
作者: Zhen Li,Yuwei Wu,Chenchen Jing,Che Sun,Chuanhao Li,Yunde Jia
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Chinese Academy of Sciences (中国科学院); 4. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.
zh

[CV-30] SMF-VO: Direct Ego-Motion Estimation via Sparse Motion Fields

【速读】:该论文旨在解决传统视觉里程计(Visual Odometry, VO)和视觉惯性里程计(Visual Inertial Odometry, VIO)方法因依赖“以位姿为中心”(pose-centric)范式而带来的计算复杂度高、难以在资源受限设备上实现实时运行的问题。这类方法需维护大规模特征点地图并持续优化,导致性能瓶颈。其解决方案的关键在于提出一种轻量级的“以运动为中心”(motion-centric)框架——稀疏运动场视觉里程计(Sparse Motion Field Visual Odometry, SMF-VO),该方法直接从稀疏光流中估计瞬时线速度与角速度,无需显式位姿估计或昂贵的特征跟踪;同时引入广义的基于3D射线的运动场建模方法,适用于多种相机模型(包括广角镜头),从而在保证精度的同时显著提升效率,在仅使用CPU的Raspberry Pi 5上实现超过100 FPS的实时性能。

链接: https://arxiv.org/abs/2511.09072
作者: Sangheon Yang,Yeongin Yoon,Hong Mo Jung,Jongwoo Lim
机构: Hanyang University (汉阳大学); Seoul National University (首尔国立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional Visual Odometry (VO) and Visual Inertial Odometry (VIO) methods rely on a ‘pose-centric’ paradigm, which computes absolute camera poses from the local map thus requires large-scale landmark maintenance and continuous map optimization. This approach is computationally expensive, limiting their real-time performance on resource-constrained devices. To overcome these limitations, we introduce Sparse Motion Field Visual Odometry (SMF-VO), a lightweight, ‘motion-centric’ framework. Our approach directly estimates instantaneous linear and angular velocity from sparse optical flow, bypassing the need for explicit pose estimation or expensive landmark tracking. We also employed a generalized 3D ray-based motion field formulation that works accurately with various camera models, including wide-field-of-view lenses. SMF-VO demonstrates superior efficiency and competitive accuracy on benchmark datasets, achieving over 100 FPS on a Raspberry Pi 5 using only a CPU. Our work establishes a scalable and efficient alternative to conventional methods, making it highly suitable for mobile robotics and wearable devices.
zh

[CV-31] Diversifying Counterattacks: Orthogonal Exploration for Robust CLIP Inference AAAI-2026

【速读】:该论文旨在解决视觉-语言预训练模型(Vision-Language Pre-training Models, VLPs)在测试时对对抗样本(adversarial examples)敏感的问题,即现有测试时防御方法(如Test-Time Counterattack, TTC)因优化目标与攻击不一致而导致的泛化能力不足和对抗扰动多样性缺失问题。解决方案的关键在于提出方向正交对抗攻击(Directional Orthogonal Counterattack, DOC),通过引入正交梯度方向和基于动量的更新机制,扩展对抗扰动的搜索空间并提升扰动多样性;同时设计基于平均余弦相似度的方向敏感性评分,以增强样本判别能力并自适应调节对抗强度,从而实现更鲁棒且通用的测试时防御效果。

链接: https://arxiv.org/abs/2511.09064
作者: Chengze Jiang,Minjing Dong,Xinli Shi,Jie Gui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI-2026 Oral

点击查看摘要

Abstract:Vision-language pre-training models (VLPs) demonstrate strong multimodal understanding and zero-shot generalization, yet remain vulnerable to adversarial examples, raising concerns about their reliability. Recent work, Test-Time Counterattack (TTC), improves robustness by generating perturbations that maximize the embedding deviation of adversarial inputs using PGD, pushing them away from their adversarial representations. However, due to the fundamental difference in optimization objectives between adversarial attacks and counterattacks, generating counterattacks solely based on gradients with respect to the adversarial input confines the search to a narrow space. As a result, the counterattacks could overfit limited adversarial patterns and lack the diversity to fully neutralize a broad range of perturbations. In this work, we argue that enhancing the diversity and coverage of counterattacks is crucial to improving adversarial robustness in test-time defense. Accordingly, we propose Directional Orthogonal Counterattack (DOC), which augments counterattack optimization by incorporating orthogonal gradient directions and momentum-based updates. This design expands the exploration of the counterattack space and increases the diversity of perturbations, which facilitates the discovery of more generalizable counterattacks and ultimately improves the ability to neutralize adversarial perturbations. Meanwhile, we present a directional sensitivity score based on averaged cosine similarity to boost DOC by improving example discrimination and adaptively modulating the counterattack strength. Extensive experiments on 16 datasets demonstrate that DOC improves adversarial robustness under various attacks while maintaining competitive clean accuracy. Code is available at this https URL.
zh

[CV-32] VietMEAgent : Culturally-Aware Few-Shot Multimodal Explanation for Vietnamese Visual Question Answering

【速读】:该论文旨在解决当前视觉问答(Visual Question Answering, VQA)系统在处理越南文化相关内容时表现受限的问题,其根源在于训练数据中文化知识代表性不足,且推理过程缺乏可解释性。解决方案的关键在于提出 VietMEAgent 框架,该框架通过集成文化对象检测主干网络与结构化程序生成层,实现答案预测与解释的紧密耦合;同时引入一个精心构建的越南文化实体知识库作为显式背景信息源,并设计双模态解释模块,融合基于注意力机制的视觉证据与结构化的、人类可读的文本推理过程,从而在保障文化敏感性的同时提供透明、可理解的解释,支持教育与文化传承应用。

链接: https://arxiv.org/abs/2511.09058
作者: Hai-Dang Nguyen,Minh-Anh Dang,Minh-Tan Le,Minh-Tuan Le
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 3 figures, 3 tables, FAIR 2025 conference

点击查看摘要

Abstract:Contemporary Visual Question Answering (VQA) systems remain constrained when confronted with culturally specific content, largely because cultural knowledge is under-represented in training corpora and the reasoning process is not rendered interpretable to end users. This paper introduces VietMEAgent, a multimodal explainable framework engineered for Vietnamese cultural understanding. The method integrates a cultural object detection backbone with a structured program generation layer, yielding a pipeline in which answer prediction and explanation are tightly coupled. A curated knowledge base of Vietnamese cultural entities serves as an explicit source of background information, while a dual-modality explanation module combines attention-based visual evidence with structured, human-readable textual rationales. We further construct a Vietnamese Cultural VQA dataset sourced from public repositories and use it to demonstrate the practicality of programming-based methodologies for cultural AI. The resulting system provides transparent explanations that disclose both the computational rationale and the underlying cultural context, supporting education and cultural preservation with an emphasis on interpretability and cultural sensitivity.
zh

[CV-33] 4KDehazeFlow: Ultra-High-Definition Image Dehazing via Flow Matching

【速读】:该论文旨在解决超高清(Ultra-High-Definition, UHD)图像去雾中现有方法存在的两大问题:基于先验的方法场景适应性差,而深度学习方法则存在计算复杂度高且易引入色彩失真的缺陷。其解决方案的关键在于提出一种基于流匹配(Flow Matching)与雾霾感知向量场(Haze-Aware vector field)的新方法——4KDehazeFlow。该方法将去雾过程建模为连续向量场流的渐进优化,通过可学习的三维查找表(3D Lookup Table, LUT)编码雾霾变换参数,实现高效推理;并采用四阶龙格-库塔(Runge-Kutta, RK4)常微分方程(ODE)求解器稳定迭代求解去雾流场,有效抑制伪影,从而在保持高色彩保真度的同时显著提升去雾质量与效率。

链接: https://arxiv.org/abs/2511.09055
作者: Xingchi Chen,Pu Wang,Xuerui Li,Chaopeng Li,Juxiang Zhou,Jianhou Gan,Dianjie Lu,Guijuan Zhang,Wenqi Ren,Zhuoran Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ultra-High-Definition (UHD) image dehazing faces challenges such as limited scene adaptability in prior-based methods and high computational complexity with color distortion in deep learning approaches. To address these issues, we propose 4KDehazeFlow, a novel method based on Flow Matching and the Haze-Aware vector field. This method models the dehazing process as a progressive optimization of continuous vector field flow, providing efficient data-driven adaptive nonlinear color transformation for high-quality dehazing. Specifically, our method has the following advantages: 1) 4KDehazeFlow is a general method compatible with various deep learning networks, without relying on any specific network architecture. 2) We propose a learnable 3D lookup table (LUT) that encodes haze transformation parameters into a compact 3D mapping matrix, enabling efficient inference through precomputed mappings. 3) We utilize a fourth-order Runge-Kutta (RK4) ordinary differential equation (ODE) solver to stably solve the dehazing flow field through an accurate step-by-step iterative method, effectively suppressing artifacts. Extensive experiments show that 4KDehazeFlow exceeds seven state-of-the-art methods. It delivers a 2dB PSNR increase and better performance in dense haze and color fidelity.
zh

[CV-34] USF-Net: A Unified Spatiotemporal Fusion Network for Ground-Based Remote Sensing Cloud Image Sequence Extrapolation

【速读】:该论文旨在解决地面遥感云图像序列外推(ground-based remote sensing cloud image sequence extrapolation)中的三大挑战:(1) 现有方法依赖静态卷积核,缺乏动态多尺度特征提取能力;(2) 时间引导不足,难以建模长程时空依赖关系;(3) 注意力机制的二次计算复杂度被忽视,限制了实际部署效率。解决方案的关键在于提出一种统一时空融合网络(Unified Spatiotemporal Fusion Network, USF-Net),其核心创新包括:(1) 引入自适应大卷积核(adaptive large-kernel convolutions)与轻量级注意力机制(low-complexity attention mechanism)以实现动态特征增强和高效时序建模;(2) 设计时空融合模块(USTM),包含具有状态空间模型(SSM)的多尺度信息捕捉单元(SiB)和具备时序注意力模块(TAM)的长期时序依赖建模单元(TiB);(3) 提出统一的时间引导时空依赖建模机制(DSM with temporal guidance module, TGM),并引入去伪影模块(DUM)缓解“鬼影效应”,从而在预测精度与计算效率之间取得更优平衡。

链接: https://arxiv.org/abs/2511.09045
作者: Penghui Niu,Taotao Cai,Jiashuai She,Yajuan Zhang,Junhua Gua,Ping Zhanga,Jungong Hane,Jianxin Li
机构: Hebei University of Technology (河北工业大学); University of Southern Queensland (南昆士兰大学); Edith Cowan University (埃迪斯科文大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ground-based remote sensing cloud image sequence extrapolation is a key research area in the development of photovoltaic power systems. However, existing approaches exhibit several limitations:(1)they primarily rely on static kernels to augment feature information, lacking adaptive mechanisms to extract features at varying resolutions dynamically;(2)temporal guidance is insufficient, leading to suboptimal modeling of long-range spatiotemporal dependencies; and(3)the quadratic computational cost of attention mechanisms is often overlooked, limiting efficiency in practical deployment. To address these challenges, we propose USF-Net, a Unified Spatiotemporal Fusion Network that integrates adaptive large-kernel convolutions and a low-complexity attention mechanism, combining temporal flow information within an encoder-decoder framework. Specifically, the encoder employs three basic layers to extract features. Followed by the USTM, which comprises:(1)a SiB equipped with a SSM that dynamically captures multi-scale contextual information, and(2)a TiB featuring a TAM that effectively models long-range temporal dependencies while maintaining computational efficiency. In addition, a DSM with a TGM is introduced to enable unified modeling of temporally guided spatiotemporal dependencies. On the decoder side, a DUM is employed to address the common “ghosting effect.” It utilizes the initial temporal state as an attention operator to preserve critical motion signatures. As a key contribution, we also introduce and release the ASI-CIS dataset. Extensive experiments on ASI-CIS demonstrate that USF-Net significantly outperforms state-of-the-art methods, establishing a superior balance between prediction accuracy and computational efficiency for ground-based cloud extrapolation. The dataset and source code will be available at this https URL.
zh

[CV-35] Dense Cross-Scale Image Alignment With Fully Spatial Correlation and Just Noticeable Difference Guidance

【速读】:该论文旨在解决现有无监督图像对齐方法在准确性和计算复杂度方面的局限性。其核心解决方案是提出一种密集的跨尺度图像对齐模型,通过建模跨尺度特征间的相关性来降低对齐难度,并引入全空间相关模块以提升精度同时保持低计算成本;此外,通过引入刚刚可察觉差异(Just Noticeable Difference, JND)机制,引导模型关注对失真更敏感的图像区域,从而有效避免明显的对齐误差。

链接: https://arxiv.org/abs/2511.09028
作者: Jinkun You,Jiaxue Li,Jie Zhang,Yicong Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing unsupervised image alignment methods exhibit limited accuracy and high computational complexity. To address these challenges, we propose a dense cross-scale image alignment model. It takes into account the correlations between cross-scale features to decrease the alignment difficulty. Our model supports flexible trade-offs between accuracy and efficiency by adjusting the number of scales utilized. Additionally, we introduce a fully spatial correlation module to further improve accuracy while maintaining low computational costs. We incorporate the just noticeable difference to encourage our model to focus on image regions more sensitive to distortions, eliminating noticeable alignment errors. Extensive quantitative and qualitative experiments demonstrate that our method surpasses state-of-the-art approaches.
zh

[CV-36] Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs AAAI2026

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的图像幻觉(object hallucination)问题,即模型生成的内容与输入图像不一致的现象。现有基于语言解码器的缓解方法通常独立调控视觉或文本注意力机制,忽略了二者作为两个关键因果因素之间的交互关系。其解决方案的核心在于提出一个基于因果建模的框架Owl(Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation),通过结构化因果图将分解后的视觉和文本注意力视为中介变量,并引入VTACR(Visual-to-Textual Attention Contribution Ratio)这一新指标量化解码过程中模态贡献失衡情况;研究发现幻觉多发生在低VTACR场景下,此时文本先验主导而视觉 grounding 被削弱。为此,作者设计了一种细粒度注意力干预机制,依据VTACR信号动态调整token级和层级注意力;进一步提出双路径对比解码策略:一条路径强化视觉引导预测,另一条路径放大幻觉内容——使视觉真实性凸显、幻觉自然坍缩。实验表明,Owl在POPE和CHAIR基准上显著降低幻觉率,在保持视觉语言理解能力的同时实现了忠实性(faithfulness)的新SOTA。

链接: https://arxiv.org/abs/2511.09018
作者: Liu Yu,Zhonghao Chen,Ping Kuang,Zhikun Feng,Fan Zhou,Lan Wang,Gillian Dobbie
机构: University of Auckland (奥克兰大学); China Scholarship Council (国家留学基金委)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, published to AAAI 2026

点击查看摘要

Abstract:Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones – letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at this https URL
zh

[CV-37] UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

【速读】:该论文旨在解决当前自动驾驶系统在多智能体协作中感知、预测与规划模块之间缺乏协同一致性的问题,尤其是现有方法往往仅关注感知层面的协作,而忽视了与下游决策规划任务的对齐,且未能充分利用端到端自动驾驶模型的潜力。其解决方案的关键在于提出UniMM-V2X框架,通过多层次融合策略统一感知与预测阶段的协作机制,使多个智能体能够共享查询并协同推理以实现一致且安全的决策;同时引入Mixture-of-Experts(MoE)架构动态增强BEV表示,并进一步将MoE扩展至解码器以更好地捕捉多样化运动模式,从而显著提升整体性能,在DAIR-V2X数据集上实现了感知准确率提升39.7%、预测误差降低7.2%、规划性能提高33.2%。

链接: https://arxiv.org/abs/2511.09013
作者: Ziyi Song,Chen Xia,Chenbing Wang,Haibao Yu,Sheng Zhou,Zhisheng Niu
机构: 1. Tsinghua University (清华大学); 2. China Unicom (中国联通); 3. Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving holds transformative potential but remains fundamentally constrained by the limited perception and isolated decision-making with standalone intelligence. While recent multi-agent approaches introduce cooperation, they often focus merely on perception-level tasks, overlooking the alignment with downstream planning and control, or fall short in leveraging the full capacity of the recent emerging end-to-end autonomous driving. In this paper, we present UniMM-V2X, a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning. At the core of our framework is a multi-level fusion strategy that unifies perception and prediction cooperation, allowing agents to share queries and reason cooperatively for consistent and safe decision-making. To adapt to diverse downstream tasks and further enhance the quality of multi-level fusion, we incorporate a Mixture-of-Experts (MoE) architecture to dynamically enhance the BEV representations. We further extend MoE into the decoder to better capture diverse motion patterns. Extensive experiments on the DAIR-V2X dataset demonstrate our approach achieves state-of-the-art (SOTA) performance with a 39.7% improvement in perception accuracy, a 7.2% reduction in prediction error, and a 33.2% improvement in planning performance compared with UniV2X, showcasing the strength of our MoE-enhanced multi-level cooperative paradigm.
zh

[CV-38] -Rex-Omni: Integrating Negative Visual Prompt in Generic Object Detection AAAI2026

【速读】:该论文旨在解决当前开放集目标检测(open-set object detection)方法在面对视觉相似但语义不同的干扰项时表现脆弱的问题,其根源在于现有方法仅依赖正样本提示(positive indicators),如文本描述或视觉示例,缺乏对负样本的建模能力。解决方案的关键在于引入负向视觉提示(negative visual prompts)以抑制硬负样本干扰,核心创新包括:1)提出统一的视觉提示编码器,联合处理正负视觉提示;2)设计无需训练的“否定负样本计算”(Negating Negative Computing, NNC)模块,在概率计算阶段动态抑制负响应;3)引入“否定负样本铰链损失”(Negating Negative Hinge, NNH)用于微调,强化正负嵌入间的判别边界。该框架支持正样本独用或正负样本联合推理模式,显著提升零样本检测性能,尤其在长尾分布场景下优势明显(LVIS-minival上达51.2 AP_r)。

链接: https://arxiv.org/abs/2511.08997
作者: Jiazhou Zhou,Qing Jiang,Kanghao Chen,Lutao Jiang,Yuanhuiyi Lyu,Ying-Cong Chen,Lei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026. Main paper: 7 pages with 4 figures; Appendix: 8 pages with 7 figures

点击查看摘要

Abstract:Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.
zh

[CV-39] Fast k-means clustering in Riemannian manifolds via Fréchet maps: Applications to large-dimensional SPD matrices

【速读】:该论文旨在解决高维非欧几里得流形(non-Euclidean manifold)上数据聚类的计算效率问题,尤其是针对对称正定矩阵流形 SPD(n)\mathit{SPD}(n) 的应用场景。传统内在方法在处理此类数据时面临显著的计算复杂度瓶颈,难以扩展至大规模数据集。解决方案的关键在于提出一种基于 pp-Fréchet 映射 Fp:MRF^p: \mathcal{M} \to \mathbb{R}^\ell 的嵌入框架,该映射通过一组参考点将流形数据高效投影到低维欧氏空间 R\mathbb{R}^\ell 中,从而使得标准欧氏聚类算法(如 k-means)能够被直接、准确地应用。理论分析与大量数值实验表明,该方法在保持高聚类精度的同时,相较于传统内在方法可实现高达两个数量级的运行时间加速,尤其在现有替代方法失效或表现不佳的情况下仍能稳定工作。

链接: https://arxiv.org/abs/2511.08993
作者: Ji Shi,Nicolas Charon,Andreas Mang,Demetrio Labate,Robert Azencott
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
备注: 32 pages, 5 figures, 5 tables

点击查看摘要

Abstract:We introduce a novel, efficient framework for clustering data on high-dimensional, non-Euclidean manifolds that overcomes the computational challenges associated with standard intrinsic methods. The key innovation is the use of the p -Fréchet map F^p : \mathcalM \to \mathbbR^\ell – defined on a generic metric space \mathcalM – which embeds the manifold data into a lower-dimensional Euclidean space \mathbbR^\ell using a set of reference points \r_i_i=1^\ell , r_i \in \mathcalM . Once embedded, we can efficiently and accurately apply standard Euclidean clustering techniques such as k-means. We rigorously analyze the mathematical properties of F^p in the Euclidean space and the challenging manifold of n \times n symmetric positive definite matrices \mathitSPD(n) . Extensive numerical experiments using synthetic and real \mathitSPD(n) data demonstrate significant performance gains: our method reduces runtime by up to two orders of magnitude compared to intrinsic manifold-based approaches, all while maintaining high clustering accuracy, including scenarios where existing alternative methods struggle or fail.
zh

[CV-40] An ICTM-RMSAV Framework for Bias-Field Aware Image Segmentation under Poisson and Multiplicative Noise

【速读】:该论文旨在解决图像分割在强噪声干扰和灰度不均匀性条件下性能显著下降的问题。其核心解决方案在于提出一种变分分割模型,该模型通过引入I散度(I-divergence)项与自适应总变差(adaptive total-variation, TV)正则化项来实现去噪,有效应对Gamma分布乘性噪声和泊松噪声;同时,利用基于灰度水平的区域自适应权重引导扩散过程,以区分不同强度区域,并通过估计平滑变化的偏置场(bias field)缓解强度不均匀性对分割精度的影响;此外,采用特征函数表示区域并编码轮廓长度,结合松弛修改的标量辅助变量(relaxed modified scalar auxiliary variable, RMSAV)优化策略,实现了高效计算与稳定收敛。

链接: https://arxiv.org/abs/2511.08988
作者: Xinyu Wang,Wenjun Yao,Fanghui Song,Zhichang Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Image segmentation is a core task in image processing, yet many methods degrade when images are heavily corrupted by noise and exhibit intensity inhomogeneity. Within the iterative-convolution thresholding method (ICTM) framework, we propose a variational segmentation model that integrates denoising terms. Specifically, the denoising component consists of an I-divergence term and an adaptive total-variation (TV) regularizer, making the model well suited to images contaminated by Gamma–distributed multiplicative noise and Poisson noise. A spatially adaptive weight derived from a gray-level indicator guides diffusion differently across regions of varying intensity. To further address intensity inhomogeneity, we estimate a smoothly varying bias field, which improves segmentation accuracy. Regions are represented by characteristic functions, with contour length encoded accordingly. For efficient optimization, we couple ICTM with a relaxed modified scalar auxiliary variable (RMSAV) scheme. Extensive experiments on synthetic and real-world images with intensity inhomogeneity and diverse noise types show that the proposed model achieves superior accuracy and robustness compared with competing approaches.
zh

[CV-41] WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images AAAI2026

【速读】:该论文旨在解决生成式AI在糖尿病视网膜病变(Diabetic Retinopathy, DR)早期病灶——微动脉瘤(Microaneurysms, MAs)自动检测中面临的三大挑战:一是模型易出现“身份映射”(identity mapping)现象,导致输出与输入图像高度相似;二是难以区分微动脉瘤与其他异常结构,造成假阳性率高;三是正常视网膜特征重建效果不佳,影响整体检测性能。解决方案的关键在于提出一种基于小波扩散Transformer的微动脉瘤检测框架(Wavelet Diffusion Transformer for MA Detection, WDT-MD),其核心创新包括:1)噪声编码图像条件机制,通过扰动训练阶段的图像条件避免身份映射;2)基于图像修复(inpainting)的伪正常模式合成方法,引入像素级监督以增强对微动脉瘤与其他异常的区分能力;3)融合小波分析与扩散Transformer架构,利用多尺度小波变换提升对正常视网膜结构的重建精度,从而显著改善像素级和图像级检测性能。

链接: https://arxiv.org/abs/2511.08987
作者: Yifei Sun,Yuzhi He,Junhao Jia,Jinhong Wang,Ruiquan Ge,Changmiao Wang,Hongxia Xu
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. Peking University (北京大学); 4. Shanghai Jiao Tong University (上海交通大学); 5. Microsoft Research (微软研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 6 figures, 8 tables, accepted by AAAI 2026

点击查看摘要

Abstract:Microaneurysms (MAs), the earliest pathognomonic signs of Diabetic Retinopathy (DR), present as sub-60 \mu m lesions in fundus images with highly variable photometric and morphological characteristics, rendering manual screening not only labor-intensive but inherently error-prone. While diffusion-based anomaly detection has emerged as a promising approach for automated MA screening, its clinical application is hindered by three fundamental limitations. First, these models often fall prey to “identity mapping”, where they inadvertently replicate the input image. Second, they struggle to distinguish MAs from other anomalies, leading to high false positives. Third, their suboptimal reconstruction of normal features hampers overall performance. To address these challenges, we propose a Wavelet Diffusion Transformer framework for MA Detection (WDT-MD), which features three key innovations: a noise-encoded image conditioning mechanism to avoid “identity mapping” by perturbing image conditions during training; pseudo-normal pattern synthesis via inpainting to introduce pixel-level supervision, enabling discrimination between MAs and other anomalies; and a wavelet diffusion Transformer architecture that combines the global modeling capability of diffusion Transformers with multi-scale wavelet analysis to enhance reconstruction of normal retinal features. Comprehensive experiments on the IDRiD and e-ophtha MA datasets demonstrate that WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level MA detection. This advancement holds significant promise for improving early DR screening.
zh

[CV-42] A Finite Difference Approximation of Second Order Regularization of Neural-SDFs SIGGRAPH

【速读】:该论文旨在解决神经符号距离场(Signed Distance Field, SDF)学习中曲率正则化计算效率低的问题。现有方法依赖二阶自动微分(automatic differentiation)获取完整的海森矩阵(Hessian)信息以施加曲率先验,虽精度高但计算开销大;另一些方法虽避免了显式海森矩阵组装,但仍需高阶微分操作。论文提出的关键解决方案是采用轻量级有限差分模板(finite-difference stencils),通过泰勒展开近似二阶导数,截断误差为O(h²),可直接替代高斯曲率损失和秩不足损失(rank-deficiency losses)。该方法在保持重建保真度的同时,将GPU内存占用和训练时间减少达两倍,且在稀疏、不完整及非CAD数据上表现出鲁棒性和通用性,为曲率感知的SDF学习提供了高效可扩展的新范式。

链接: https://arxiv.org/abs/2511.08980
作者: Haotian Yin,Aleksander Plocharski,Michal Jan Wlodarczyk,Przemyslaw Musialski
机构: New Jersey Institute of Technology (新泽西理工学院); Warsaw University of Technology (华沙理工大学); IDEAS NCBR (IDEAS 国家科技中心)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: SIGGRAPH Asia Technical Communications, 6 pages, 6 figures, preprint

点击查看摘要

Abstract:We introduce a finite-difference framework for curvature regularization in neural signed distance field (SDF) learning. Existing approaches enforce curvature priors using full Hessian information obtained via second-order automatic differentiation, which is accurate but computationally expensive. Others reduced this overhead by avoiding explicit Hessian assembly, but still required higher-order differentiation. In contrast, our method replaces these operations with lightweight finite-difference stencils that approximate second derivatives using the well known Taylor expansion with a truncation error of O(h^2), and can serve as drop-in replacements for Gaussian curvature and rank-deficiency losses. Experiments demonstrate that our finite-difference variants achieve reconstruction fidelity comparable to their automatic-differentiation counterparts, while reducing GPU memory usage and training time by up to a factor of two. Additional tests on sparse, incomplete, and non-CAD data confirm that the proposed formulation is robust and general, offering an efficient and scalable alternative for curvature-aware SDF learning.
zh

[CV-43] Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

【速读】:该论文旨在解决交通场景理解(Traffic Scene Understanding, TSU)任务中因忽视时空信息与视觉-文本数据间关联性而导致的性能瓶颈问题。现有研究常将TSU简化为普通图像理解任务,忽略了时空数据对场景语义解析的关键作用。其解决方案的核心在于提出一种基于CLIP(Contrastive Language–Image Pretraining)架构的时空增强模型ST-CLIP,通过设计“时空上下文感知多视角提示学习方法”(SpatioTemporal Context Aware Multiaspect Prompt, SCAMP),将动态提取的时空上下文表示向量嵌入到CLIP模型的提示词嵌入空间中,并结合低层视觉特征与高层语义特征以挖掘交通场景各要素间的交互关系,从而实现对复杂场景的精准理解。

链接: https://arxiv.org/abs/2511.08978
作者: Jingtian Ma,Jingyuan Wang,Wayne Xin Zhao,Guoping Liu,Xiang Wen
机构: Beihang University (北京航空航天大学); Renmin University of China (中国人民大学); DiDi Global Inc. (滴滴全球公司)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Nowadays, navigation and ride-sharing apps have collected numerous images with spatio-temporal data. A core technology for analyzing such images, associated with spatiotemporal information, is Traffic Scene Understanding (TSU), which aims to provide a comprehensive description of the traffic scene. Unlike traditional spatio-temporal data analysis tasks, the dependence on both spatio-temporal and visual-textual data introduces distinct challenges to TSU task. However, recent research often treats TSU as a common image understanding task, ignoring the spatio-temporal information and overlooking the interrelations between different aspects of the traffic scene. To address these issues, we propose a novel SpatioTemporal Enhanced Model based on CILP (ST-CLIP) for TSU. Our model uses the classic vision-language model, CLIP, as the backbone, and designs a Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) learning method to incorporate spatiotemporal information into TSU. The prompt learning method consists of two components: A dynamic spatio-temporal context representation module that extracts representation vectors of spatio-temporal data for each traffic scene image, and a bi-level ST-aware multi-aspect prompt learning module that integrates the ST-context representation vectors into word embeddings of prompts for the CLIP model. The second module also extracts low-level visual features and image-wise high-level semantic features to exploit interactive relations among different aspects of traffic scenes. To the best of our knowledge, this is the first attempt to integrate spatio-temporal information into visionlanguage models to facilitate TSU task. Experiments on two realworld datasets demonstrate superior performance in the complex scene understanding scenarios with a few-shot learning strategy.
zh

[CV-44] Efficient and Effective In-context Demonstration Selection with Coreset AAAI26

【速读】:该论文旨在解决大视觉语言模型(Large Visual Language Models, LVLMs)在上下文学习(In-context Learning, ICL)中演示样本(demonstration)选择效率与效果难以兼顾的问题。现有方法如随机采样、基于相似性的采样或Infoscore采样往往因计算复杂度高或信息利用不充分而表现不佳。其解决方案的关键在于提出一种名为基于核心集的双重检索(Coreset-based Dual Retrieval, CoDR)的新框架:首先通过聚类剪枝方法构建一个具有多样性的核心子集(coreset),以提升预期互信息;其次设计双重检索机制,在保证全局最优演示选择的同时维持高效性,从而实现更鲁棒且高效的演示样本筛选。

链接: https://arxiv.org/abs/2511.08977
作者: Zihua Wang,Jiarui Wang,Haiyang Xu,Ming Yan,Fei Huang,Xu Yang,Xiu-Shen Wei,Siya Mi,Yu Zhang
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团); 3. Peking University (北京大学); 4. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by AAAI26

点击查看摘要

Abstract:In-context learning (ICL) has emerged as a powerful paradigm for Large Visual Language Models (LVLMs), enabling them to leverage a few examples directly from input contexts. However, the effectiveness of this approach is heavily reliant on the selection of demonstrations, a process that is NP-hard. Traditional strategies, including random, similarity-based sampling and infoscore-based sampling, often lead to inefficiencies or suboptimal performance, struggling to balance both efficiency and effectiveness in demonstration selection. In this paper, we propose a novel demonstration selection framework named Coreset-based Dual Retrieval (CoDR). We show that samples within a diverse subset achieve a higher expected mutual information. To implement this, we introduce a cluster-pruning method to construct a diverse coreset that aligns more effectively with the query while maintaining diversity. Additionally, we develop a dual retrieval mechanism that enhances the selection process by achieving global demonstration selection while preserving efficiency. Experimental results demonstrate that our method significantly improves the ICL performance compared to the existing strategies, providing a robust solution for effective and efficient demonstration selection.
zh

[CV-45] Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation AAAI2026

【速读】:该论文旨在解决第一人称视角人工智能代理(egocentric AI agents)在执行任务时因多模态意图模糊性(multimodal intent ambiguity)而导致的性能瓶颈问题,具体表现为语言指令不明确、视觉数据不完整以及指示性手势(deictic gestures)难以解析,从而引发任务失败或生成错误响应。解决方案的关键在于提出一种即插即用式澄清器(Plug-and-Play Clarifier)框架,其核心是将复杂的多模态歧义问题分解为三个协同工作的模块:(1) 文本澄清器通过对话驱动推理交互式消除语言意图歧义;(2) 视觉澄清器提供实时引导反馈,指导用户调整姿态以提升图像捕捉质量;(3) 跨模态澄清器结合定位机制,准确理解三维指向手势并识别目标对象。该框架实现了零样本部署,显著提升了小规模语言模型(4–8B参数)的意图澄清能力约30%,同时对大模型亦具增益效果,有效增强了第一人称场景下的用户交互体验。

链接: https://arxiv.org/abs/2511.08971
作者: Sicheng Yang,Yukai Huang,Weitong Cai,Shitong Sun,You He,Jiankang Deng,Hang Zhang,Jifei Song,Zhensong Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 16 pages, 9 figures, AAAI 2026

点击查看摘要

Abstract:The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity. This challenge arises from a combination of underspecified language, imperfect visual data, and deictic gestures, which frequently leads to task failure. Existing monolithic Vision-Language Models (VLMs) struggle to resolve these multimodal ambiguous inputs, often failing silently or hallucinating responses. To address these ambiguities, we introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks. Specifically, our framework consists of three synergistic modules: (1) a text clarifier that uses dialogue-driven reasoning to interactively disambiguate linguistic intent, (2) a vision clarifier that delivers real-time guidance feedback, instructing users to adjust their positioning for improved capture quality, and (3) a cross-modal clarifier with grounding mechanism that robustly interprets 3D pointing gestures and identifies the specific objects users are pointing to. Extensive experiments demonstrate that our framework improves the intent clarification performance of small language models (4–8B) by approximately 30%, making them competitive with significantly larger counterparts. We also observe consistent gains when applying our framework to these larger models. Furthermore, our vision clarifier increases corrective guidance accuracy by over 20%, and our cross-modal clarifier improves semantic answer accuracy for referential grounding by 5%. Overall, our method provides a plug-and-play framework that effectively resolves multimodal ambiguity and significantly enhances user experience in egocentric interaction.
zh

[CV-46] AuthSig: Safeguarding Scanned Signatures Against Unauthorized Reuse in Paperless Workflows

【速读】:该论文旨在解决静态电子签名(static electronic signature)在实际应用中因缺乏可验证性而易被恶意复制和重用的问题。当前虽然纸质签名正逐步向电子化过渡,但大量仍采用扫描图像形式的静态签名已丧失原始身份认证属性,难以保障安全性。为此,作者提出AuthSig框架,其核心在于通过生成式模型与水印技术相结合的方式,将认证信息隐式嵌入签名图像中,实现“一签一用”(One Signature, One Use)的防伪机制。关键创新点包括:利用人类视觉系统对细微风格变化不敏感的特性,在生成过程中微调风格嵌入以编码水印比特;同时引入基于关键点驱动的数据增强策略,提升手写签名数据的多样性,从而增强水印嵌入的鲁棒性,实验表明该方案在数字域失真及打印-扫描场景下均能保持超过98%的水印提取准确率。

链接: https://arxiv.org/abs/2511.08967
作者: RuiQiang Zhang,Zehua Ma,Guanjie Wang,Chang Liu,Hengyi Wang,Weiming Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the deepening trend of paperless workflows, signatures as a means of identity authentication are gradually shifting from traditional ink-on-paper to electronic this http URL the availability of dynamic pressure-sensitive and PKI-based digital signatures, static scanned signatures remain prevalent in practice due to their convenience. However, these static images, having almost lost their authentication attributes, cannot be reliably verified and are vulnerable to malicious copying and reuse. To address these issues, we propose AuthSig, a novel static electronic signature framework based on generative models and watermark, which binds authentication information to the signature image. Leveraging the human visual system’s insensitivity to subtle style variations, AuthSig finely modulates style embeddings during generation to implicitly encode watermark bits-enforcing a One Signature, One Use this http URL overcome the scarcity of handwritten signature data and the limitations of traditional augmentation methods, we introduce a keypoint-driven data augmentation strategy that effectively enhances style diversity to support robust watermark embedding. Experimental results show that AuthSig achieves over 98% extraction accuracy under both digital-domain distortions and signature-specific degradations, and remains effective even in print-scan scenarios.
zh

[CV-47] FGM-HD: Boosting Generation Diversity of Fractal Generative Models through Hausdorff Dimension Induction AAAI-26

【速读】:该论文旨在解决生成式图像模型中一个关键难题:在保持高视觉质量的前提下提升生成结果的多样性。现有分形生成模型(Fractal Generative Models, FGMs)虽能高效生成高质量图像,但其固有的自相似性限制了输出多样性。解决方案的关键在于引入豪斯多夫维数(Hausdorff Dimension, HD)作为结构复杂性的量化指标,并设计了一种可学习的HD估计方法,直接从图像嵌入中预测HD以降低计算开销。进一步地,作者提出基于HD的损失函数结合单调动量驱动调度策略,在训练阶段逐步优化超参数以实现多样性的提升而不牺牲图像质量;在推理阶段则采用HD引导的拒绝采样策略,筛选几何结构更丰富的样本。该方法首次将HD引入FGM框架,显著提升了生成多样性(ImageNet上提升39%),同时维持与原模型相当的视觉质量,为FGM的发展提供了理论支撑和实践路径。

链接: https://arxiv.org/abs/2511.08945
作者: Haowei Zhang,Yuanpei Zhao,Jizhe Zhou,Mao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, AAAI-26

点击查看摘要

Abstract:Improving the diversity of generated results while maintaining high visual quality remains a significant challenge in image generation tasks. Fractal Generative Models (FGMs) are efficient in generating high-quality images, but their inherent self-similarity limits the diversity of output images. To address this issue, we propose a novel approach based on the Hausdorff Dimension (HD), a widely recognized concept in fractal geometry used to quantify structural complexity, which aids in enhancing the diversity of generated outputs. To incorporate HD into FGM, we propose a learnable HD estimation method that predicts HD directly from image embeddings, addressing computational cost concerns. However, simply introducing HD into a hybrid loss is insufficient to enhance diversity in FGMs due to: 1) degradation of image quality, and 2) limited improvement in generation diversity. To this end, during training, we adopt an HD-based loss with a monotonic momentum-driven scheduling strategy to progressively optimize the hyperparameters, obtaining optimal diversity without sacrificing visual quality. Moreover, during inference, we employ HD-guided rejection sampling to select geometrically richer outputs. Extensive experiments on the ImageNet dataset demonstrate that our FGM-HD framework yields a 39% improvement in output diversity compared to vanilla FGMs, while preserving comparable image quality. To our knowledge, this is the very first work introducing HD into FGM. Our method effectively enhances the diversity of generated outputs while offering a principled theoretical contribution to FGM development.
zh

[CV-48] Neural B-frame Video Compression with Bi-directional Reference Harmonization

【速读】:该论文旨在解决神经网络视频压缩(Neural Video Compression, NVC)中双向参考帧(B-frame)压缩效率不足的问题,尤其针对层级编码下因帧间跨度大导致的双向参考帧贡献失衡问题。解决方案的关键在于提出一种新型神经 B 帧视频压缩方法——双向参考调和视频压缩(Bi-directional Reference Harmonization Video Compression, BRHVC),其核心创新包括双向运动收敛(Bi-directional Motion Converge, BMC)与双向上下文融合(Bi-directional Contextual Fusion, BCF):BMC 通过汇聚多光流实现更精确的大尺度运动补偿,BCF 则基于运动补偿精度显式建模参考帧上下文权重,从而有效调和双向参考信息的利用效率,显著提升压缩性能,在 HEVC 数据集上超越了现有最先进 NVC 方法甚至传统编码标准 VTM-RA(随机访问配置)。

链接: https://arxiv.org/abs/2511.08938
作者: Yuxi Liu,Dengchao Jin,Shuai Huo,Jiawen Gu,Chao Zhou,Huihui Bai,Ming Lu,Zhan Ma
机构: Nanjing University (南京大学); Kuaishou Technology (快手科技); Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural video compression (NVC) has made significant progress in recent years, while neural B-frame video compression (NBVC) remains underexplored compared to P-frame compression. NBVC can adopt bi-directional reference frames for better compression performance. However, NBVC’s hierarchical coding may complicate continuous temporal prediction, especially at some hierarchical levels with a large frame span, which could cause the contribution of the two reference frames to be unbalanced. To optimize reference information utilization, we propose a novel NBVC method, termed Bi-directional Reference Harmonization Video Compression (BRHVC), with the proposed Bi-directional Motion Converge (BMC) and Bi-directional Contextual Fusion (BCF). BMC converges multiple optical flows in motion compression, leading to more accurate motion compensation on a larger scale. Then BCF explicitly models the weights of reference contexts under the guidance of motion compensation accuracy. With more efficient motions and contexts, BRHVC can effectively harmonize bi-directional references. Experimental results indicate that our BRHVC outperforms previous state-of-the-art NVC methods, even surpassing the traditional coding, VTM-RA (under random access configuration), on the HEVC datasets. The source code is released at this https URL.
zh

[CV-49] Boosting Adversarial Transferability via Ensemble Non-Attention

【速读】:该论文旨在解决异构模型架构间对抗样本迁移性能不佳的问题,其核心挑战在于不同架构(如CNN与ViT)的梯度更新方向差异显著,导致集成模型梯度方差难以降低且无法有效利用个体模型的优势。解决方案的关键在于提出一种名为NAMEA的新颖集成攻击方法,首次将集成模型中非注意力区域(non-attention areas)的梯度引入迭代梯度优化过程;其设计灵感源于观察到CNN与ViT在注意力区域上存在显著差异,因此非注意力区域可能互补地反映对方模型关注的重点。通过元学习策略对注意力与非注意力区域梯度进行解耦并融合,NAMEA实现了CNN与ViT之间跨架构迁移能力的显著提升,在ImageNet数据集上的实验表明其相比AdaEA和SMER等当前最优集成攻击方法平均提升15.0%和9.6%。

链接: https://arxiv.org/abs/2511.08937
作者: Yipeng Zou,Qin Liu,Jie Wu,Yu Peng,Guo Chen,Hui Zhou,Guanghui Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ensemble attacks integrate the outputs of surrogate models with diverse architectures, which can be combined with various gradient-based attacks to improve adversarial transferability. However, previous work shows unsatisfactory attack performance when transferring across heterogeneous model architectures. The main reason is that the gradient update directions of heterogeneous surrogate models differ widely, making it hard to reduce the gradient variance of ensemble models while making the best of individual model. To tackle this challenge, we design a novel ensemble attack, NAMEA, which for the first time integrates the gradients from the non-attention areas of ensemble models into the iterative gradient optimization process. Our design is inspired by the observation that the attention areas of heterogeneous models vary sharply, thus the non-attention areas of ViTs are likely to be the focus of CNNs and vice versa. Therefore, we merge the gradients respectively from the attention and non-attention areas of ensemble models so as to fuse the transfer information of CNNs and ViTs. Specifically, we pioneer a new way of decoupling the gradients of non-attention areas from those of attention areas, while merging gradients by meta-learning. Empirical evaluations on ImageNet dataset indicate that NAMEA outperforms AdaEA and SMER, the state-of-the-art ensemble attacks by an average of 15.0% and 9.6%, respectively. This work is the first attempt to explore the power of ensemble non-attention in boosting cross-architecture transferability, providing new insights into launching ensemble attacks.
zh

[CV-50] Expand Your SCOPE: Semantic Cognition over Potential-Based Exploration for Embodied Visual Navigation

【速读】:该论文旨在解决具身视觉导航(Embodied Visual Navigation)中因环境未知性导致的长期规划能力不足问题,特别是现有零样本方法忽视了视觉前哨边界(visual frontier boundaries)对轨迹预测的关键作用,且难以从局部观测中推断与导航目标的关系。解决方案的关键在于提出语义认知驱动的潜在探索框架(Semantic Cognition Over Potential-based Exploration, SCOPE):首先利用视觉-语言模型(Vision-Language Model)估计探索潜在值,并构建时空潜在图(spatio-temporal potential graph)以捕捉边界动态,从而支持长程规划;其次引入自省机制(self-reconsideration mechanism),通过回溯和修正先前决策提升策略可靠性并减少过度自信误差。此设计显著提升了导航准确率(较最优基线提升4.6%)及泛化能力。

链接: https://arxiv.org/abs/2511.08935
作者: Ningnan Wang,Weihuang Chen,Liming Chen,Haoxuan Ji,Zhongyu Guo,Xuchong Zhang,Hongbin Sun
机构: 西安交通大学(Shaanxi Province Key Laboratory of Intelligent Information Processing, School of Computer Science and Technology, Xi’an Jiaotong University)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Embodied visual navigation remains a challenging task, as agents must explore unknown environments with limited knowledge. Existing zero-shot studies have shown that incorporating memory mechanisms to support goal-directed behavior can improve long-horizon planning performance. However, they overlook visual frontier boundaries, which fundamentally dictate future trajectories and observations, and fall short of inferring the relationship between partial visual observations and navigation goals. In this paper, we propose Semantic Cognition Over Potential-based Exploration (SCOPE), a zero-shot framework that explicitly leverages frontier information to drive potential-based exploration, enabling more informed and goal-relevant decisions. SCOPE estimates exploration potential with a Vision-Language Model and organizes it into a spatio-temporal potential graph, capturing boundary dynamics to support long-horizon planning. In addition, SCOPE incorporates a self-reconsideration mechanism that revisits and refines prior decisions, enhancing reliability and reducing overconfident errors. Experimental results on two diverse embodied navigation tasks show that SCOPE outperforms state-of-the-art baselines by 4.6% in accuracy. Further analysis demonstrates that its core components lead to improved calibration, stronger generalization, and higher decision quality.
zh

[CV-51] From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model

【速读】:该论文旨在解决扩散模型(diffusion models)在推理时延(inference latency)过高的问题,这一瓶颈限制了其在实时应用场景中的部署。现有方法主要分为基于轨迹(trajectory-based)和基于分布(distribution-based)的步数蒸馏(step distillation),但二者存在根本性权衡:前者保留全局结构却损失高频细节,后者虽可提升保真度却易出现模式崩溃(mode collapse)且训练不稳定。本文提出一种层级蒸馏(Hierarchical Distillation, HD)框架,将两类方法整合为协同组件——首先利用轨迹蒸馏生成结构“草图”(sketch),作为后续分布蒸馏阶段的近优初始分布,从而提升整体性能上限;随后引入自适应加权判别器(Adaptive Weighted Discriminator, AWD),通过动态分配token权重聚焦局部瑕疵,实现高效细节优化。该方案在ImageNet 256×256上单步模型FID达2.26,媲美250步教师模型,并在高分辨率文生图MJHQ基准上表现优异,确立了高保真、单步扩散模型的新范式。

链接: https://arxiv.org/abs/2511.08930
作者: Hanbo Cheng,Peng Wang,Kaixiang Lei,Qi Li,Zhen Zou,Pengfei Hu,Jun Du
机构: University of Science and Technology of China (中国科学技术大学); ByteDance China (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The inference latency of diffusion models remains a critical barrier to their real-time application. While trajectory-based and distribution-based step distillation methods offer solutions, they present a fundamental trade-off. Trajectory-based methods preserve global structure but act as a “lossy compressor”, sacrificing high-frequency details. Conversely, distribution-based methods can achieve higher fidelity but often suffer from mode collapse and unstable training. This paper recasts them from independent paradigms into synergistic components within our novel Hierarchical Distillation (HD) framework. We leverage trajectory distillation not as a final generator, but to establish a structural ``sketch", providing a near-optimal initialization for the subsequent distribution-based refinement stage. This strategy yields an ideal initial distribution that enhances the ceiling of overall performance. To further improve quality, we introduce and refine the adversarial training process. We find standard discriminator structures are ineffective at refining an already high-quality generator. To overcome this, we introduce the Adaptive Weighted Discriminator (AWD), tailored for the HD pipeline. By dynamically allocating token weights, AWD focuses on local imperfections, enabling efficient detail refinement. Our approach demonstrates state-of-the-art performance across diverse tasks. On ImageNet 256\times256 , our single-step model achieves an FID of 2.26, rivaling its 250-step teacher. It also achieves promising results on the high-resolution text-to-image MJHQ benchmark, proving its generalizability. Our method establishes a robust new paradigm for high-fidelity, single-step diffusion models.
zh

[CV-52] “Its trained by non-disabled people”: Evaluating How Image Quality Affects Product Captioning with VLMs

【速读】:该论文旨在解决盲人及低视力(Blind and Low-Vision, BLV)人群在使用视觉语言模型(Vision-Language Models, VLMs)进行日常物品识别时,因图像质量缺陷(如模糊、错位构图)导致生成描述准确性下降的问题,并探究这些缺陷是否影响BLV用户的实际信息需求满足。其解决方案的关键在于通过一项针对86名BLV用户的调研,系统评估不同图像质量问题对VLM生成caption的影响,发现当图像无质量问题时模型准确率达98%,但存在质量问题时整体准确率降至75%,且随着问题叠加显著恶化;因此提出需在模型评估中嵌入残障用户的真实体验,并为HCI与机器学习(Machine Learning, ML)研究者提供具体建议,以提升VLM对BLV群体的可靠性。

链接: https://arxiv.org/abs/2511.08917
作者: Kapil Garg,Xinru Tang,Jimin Heo,Dwayne R. Morgan,Darren Gergle,Erik B. Sudderth,Anne Marie Piper
机构: University of California, Irvine (加州大学欧文分校); Northwestern University (西北大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper under review

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly used by blind and low-vision (BLV) people to identify and understand products in their everyday lives, such as food, personal products, and household goods. Despite their prevalence, we lack an empirical understanding of how common image quality issues, like blur and misframing of items, affect the accuracy of VLM-generated captions and whether resulting captions meet BLV people’s information needs. Grounded in a survey with 86 BLV people, we systematically evaluate how image quality issues affect captions generated by VLMs. We show that the best model recognizes products in images with no quality issues with 98% accuracy, but drops to 75% accuracy overall when quality issues are present, worsening considerably as issues compound. We discuss the need for model evaluations that center on disabled people’s experiences throughout the process and offer concrete recommendations for HCI and ML researchers to make VLMs more reliable for BLV people.
zh

[CV-53] Machines Serve Human: A Novel Variable Human-machine Collaborative Compression Framework

【速读】:该论文旨在解决当前人机协同压缩(human-machine collaborative compression)方法在复杂度和比特率上的不足,特别是现有方法基于以人类视觉为中心的压缩流程,未能有效利用机器视觉仅关注图像/视频核心区域、所需信息远低于人类视觉的特点。解决方案的关键在于提出一种以机器视觉为导向的新压缩框架——Diff-FCHM(diffusion-prior based feature compression for human and machine visions),其核心创新是将机器视觉作为基础,并通过可插拔的变比特率策略优化机器视觉任务;同时,采用渐进式语义聚合机制结合扩散先验(diffusion prior)恢复高保真细节,从而实现对人类视觉和机器视觉的高效协同压缩。

链接: https://arxiv.org/abs/2511.08915
作者: Zifu Zhang,Shengxi Li,Xiancheng Sun,Mai Xu,Zhengyuan Liu,Jingyuan Xia
机构: Beihang University (北京航空航天大学); State Key Laboratory of Virtual Reality Technology and Systems, Beihang University (虚拟现实技术与系统全国重点实验室,北京航空航天大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human-machine collaborative compression has been receiving increasing research efforts for reducing image/video data, serving as the basis for both human perception and machine intelligence. Existing collaborative methods are dominantly built upon the de facto human-vision compression pipeline, witnessing deficiency on complexity and bit-rates when aggregating the machine-vision compression. Indeed, machine vision solely focuses on the core regions within the image/video, requiring much less information compared with the compressed information for human vision. In this paper, we thus set out the first successful attempt by a novel collaborative compression method based on the machine-vision-oriented compression, instead of human-vision pipeline. In other words, machine vision serves as the basis for human vision within collaborative compression. A plug-and-play variable bit-rate strategy is also developed for machine vision tasks. Then, we propose to progressively aggregate the semantics from the machine-vision compression, whilst seamlessly tailing the diffusion prior to restore high-fidelity details for human vision, thus named as diffusion-prior based feature compression for human and machine visions (Diff-FCHM). Experimental results verify the consistently superior performances of our Diff-FCHM, on both machine-vision and human-vision compression with remarkable margins. Our code will be released upon acceptance.
zh

[CV-54] SPEED-Q: Staged Processing with Enhanced Distillation towards Efficient Low-bit On-device VLM Quantization

【速读】:该论文旨在解决在资源受限的边缘设备(如智能手机和机器人)上部署视觉-语言模型(Vision-Language Models, VLMs)时面临的两大关键挑战:一是视觉Transformer(ViT)与语言大模型(LLM)组件在量化敏感性上的显著差异;二是低比特量化带来的训练不稳定性问题。为应对上述挑战,论文提出SPEED-Q框架,其核心创新在于引入一种分阶段的敏感性自适应机制,以协调不同模态间的性能表现,并设计了一种增强型蒸馏量化策略,从而提升训练稳定性并降低对数据的依赖性。SPEED-Q是首个针对小规模百亿参数VLM进行低比特权重-only量化的方法,在2-bit设置下相较现有方法最高提升6倍准确率,并在2-bit和4-bit设置下持续优于已有边缘端VLM方案。

链接: https://arxiv.org/abs/2511.08914
作者: Tianyu Guo,Shanwei Zhao,Shiai Zhu,Chenguang Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deploying Vision-Language Models (VLMs) on edge devices (e.g., smartphones and robots) is crucial for enabling low-latency and privacy-preserving intelligent applications. Given the resource constraints of these devices, quantization offers a promising solution by improving memory efficiency and reducing bandwidth requirements, thereby facilitating the deployment of VLMs. However, existing research has rarely explored aggressive quantization on VLMs, particularly for the models ranging from 1B to 2B parameters, which are more suitable for resource-constrained edge devices. In this paper, we propose SPEED-Q, a novel Staged Processing with Enhanced Distillation framework for VLM low-bit weight-only quantization that systematically addresses the following two critical obstacles: (1) significant discrepancies in quantization sensitivity between vision (ViT) and language (LLM) components in VLMs; (2) training instability arising from the reduced numerical precision inherent in low-bit quantization. In SPEED-Q, a staged sensitivity adaptive mechanism is introduced to effectively harmonize performance across different modalities. We further propose a distillation-enhanced quantization strategy to stabilize the training process and reduce data dependence. Together, SPEED-Q enables accurate, stable, and data-efficient quantization of complex VLMs. SPEED-Q is the first framework tailored for quantizing entire small-scale billion-parameter VLMs to low bits. Extensive experiments across multiple benchmarks demonstrate that SPEED-Q achieves up to 6x higher accuracy than existing quantization methods under 2-bit settings and consistently outperforms prior on-device VLMs under both 2-bit and 4-bit settings. Our code and models are available at this https URL.
zh

[CV-55] Negative Entity Suppression for Zero-Shot Captioning with Synthetic Images

【速读】:该论文旨在解决零样本图像描述(Zero-Shot Image Captioning, ZIC)中因仅使用文本训练导致的跨域泛化能力差的问题,尤其是在面对新颖视觉环境时生成内容出现幻觉(hallucination)的现象。其核心解决方案是引入“负向实体”(Negative Entities)的概念,并提出负向实体抑制(Negative Entity Suppression, NES)机制,通过三个阶段实现:首先利用合成图像确保训练与推理阶段图像到文本检索的一致性;其次过滤检索内容中的负向实体以提升准确性;最后在注意力层面抑制识别出的负向实体,从而降低幻觉特征的影响。该方法在多个基准测试中实现了优于现有技术的跨域迁移性能和更低的幻觉率。

链接: https://arxiv.org/abs/2511.08909
作者: Zimao Lu,Hui Xu,Bing Liu,Ke Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures

点击查看摘要

Abstract:Text-only training provides an attractive approach to address data scarcity challenges in zero-shot image captioning (ZIC), avoiding the expense of collecting paired image-text annotations. However, although these approaches perform well within training domains, they suffer from poor cross-domain generalization, often producing hallucinated content when encountering novel visual environments. Retrieval-based methods attempt to mitigate this limitation by leveraging external knowledge, but they can paradoxically exacerbate hallucination when retrieved captions contain entities irrelevant to the inputs. We introduce the concept of negative entities–objects that appear in generated caption but are absent from the input–and propose Negative Entity Suppression (NES) to tackle this challenge. NES seamlessly integrates three stages: (1) it employs synthetic images to ensure consistent image-to-text retrieval across both training and inference; (2) it filters negative entities from retrieved content to enhance accuracy; and (3) it applies attention-level suppression using identified negative entities to further minimize the impact of hallucination-prone features. Evaluation across multiple benchmarks demonstrates that NES maintains competitive in-domain performance while improving cross-domain transfer and reducing hallucination rates, achieving new state-of-the-art results in ZIC. Our code is available at this https URL.
zh

[CV-56] HitoMi-Cam: A Shape-Agnostic Person Detection Method Using the Spectral Characteristics of Clothing

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Network, CNN)在目标检测中对物体形状敏感的问题,即当待检测目标的姿势未出现在训练数据中时,其性能显著下降。为此,作者提出了一种名为HitoMi-Cam的轻量级、形状无关的人体检测方法,其核心创新在于利用衣物的光谱反射特性进行识别,而非依赖形状或外观特征。该方案的关键在于通过光谱信息实现对人的鲁棒检测,从而在形状不可预测的实际场景(如灾难搜救)中保持高精度(平均精度达93.5%),同时可在无GPU的边缘设备上以23.2帧/秒的速度实时运行,展现出良好的实用性和互补性。

链接: https://arxiv.org/abs/2511.08908
作者: Shuji Ono
机构: Fujifilm Corporation(富士胶片公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages, 21 figures, 9 tables. Published in MDPI Journal of Imaging. Includes 1 supplementary video file (ancillary file)

点击查看摘要

Abstract:While convolutional neural network (CNN)-based object detection is widely used, it exhibits a shape dependency that degrades performance for postures not included in the training data. Building upon our previous simulation study published in this journal, this study implements and evaluates the spectral-based approach on physical hardware to address this limitation. Specifically, this paper introduces HitoMi-Cam, a lightweight and shape-agnostic person detection method that uses the spectral reflectance properties of clothing. The author implemented the system on a resource-constrained edge device without a GPU to assess its practical viability. The results indicate that a processing speed of 23.2 frames per second (fps) (253x190 pixels) is achievable, suggesting that the method can be used for real-time applications. In a simulated search and rescue scenario where the performance of CNNs declines, HitoMi-Cam achieved an average precision (AP) of 93.5%, surpassing that of the compared CNN models (best AP of 53.8%). Throughout all evaluation scenarios, the occurrence of false positives remained minimal. This study positions the HitoMi-Cam method not as a replacement for CNN-based detectors but as a complementary tool under specific conditions. The results indicate that spectral-based person detection can be a viable option for real-time operation on edge devices in real-world environments where shapes are unpredictable, such as disaster rescue.
zh

[CV-57] Consistency Change Detection Framework for Unsupervised Remote Sensing Change Detection ICME

【速读】:该论文旨在解决无监督遥感变化检测中因生成器过拟合导致性能不佳的问题。现有方法依赖生成网络进行多时相遥感图像的重建,将无法重建区域视为变化区域,但易受生成器过拟合影响,从而降低检测精度。其解决方案的关键在于提出了一种一致性变化检测框架(Consistency Change Detection Framework, CCDF),其中包含两个核心模块:一是循环一致性(Cycle Consistency, CC)模块,用于缓解生成器过拟合;二是语义一致性(Semantic Consistency, SC)模块,以提升细节重建能力,从而更准确地识别变化区域。

链接: https://arxiv.org/abs/2511.08904
作者: Yating Liu,Yan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2025 IEEE International Conference on Multimedia and Expo (ICME)

点击查看摘要

Abstract:Unsupervised remote sensing change detection aims to monitor and analyze changes from multi-temporal remote sensing images in the same geometric region at different times, without the need for labeled training data. Previous unsupervised methods attempt to achieve style transfer across multi-temporal remote sensing images through reconstruction by a generator network, and then capture the unreconstructable areas as the changed regions. However, it often leads to poor performance due to generator overfitting. In this paper, we propose a novel Consistency Change Detection Framework (CCDF) to address this challenge. Specifically, we introduce a Cycle Consistency (CC) module to reduce the overfitting issues in the generator-based reconstruction. Additionally, we propose a Semantic Consistency (SC) module to enable detail reconstruction. Extensive experiments demonstrate that our method outperforms other state-of-the-art approaches.
zh

[CV-58] LLM -Guided Probabilistic Fusion for Label-Efficient Document Layout Analysis

【速读】:该论文旨在解决文档版面理解(Document Layout Understanding)任务中对标注数据高度依赖的问题,即便在半监督学习(Semi-Supervised Learning, SSL)取得进展的背景下依然存在。其核心解决方案是通过一种基于概率加权融合的框架,将OCR与文本预训练大语言模型(LLM)提供的结构先验(Structural Priors)与视觉检测器的预测结果进行融合。关键创新在于使用逆方差加权(Inverse-Variance Fusion)策略,实现视觉预测与LLM结构信息的自适应整合,并引入实例自适应门控机制(Instance-Adaptive Gating),显著提升模型性能(+0.9 AP),同时理论上的PAC边界可解释收敛行为。此方法在轻量级SwiftFormer和强预训练LayoutLMv3架构上均表现出优越性,证明了LLM结构先验对不同规模模型具有普适增强效果。

链接: https://arxiv.org/abs/2511.08903
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
机构: Iowa State University (爱荷华州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Document layout understanding remains data-intensive despite advances in semi-supervised learning. We present a framework that enhances semi-supervised detection by fusing visual predictions with structural priors from text-pretrained LLMs via principled probabilistic weighting. Given unlabeled documents, an OCR-LLM pipeline infers hierarchical regions which are combined with teacher detector outputs through inverse-variance fusion to generate refined this http URL method demonstrates consistent gains across model scales. With a lightweight SwiftFormer backbone (26M params), we achieve 88.2 \pm 0.3 AP using only 5% labels on PubLayNet. When applied to document-pretrained LayoutLMv3 (133M params), our fusion framework reaches 89.7 \pm 0.4 AP, surpassing both LayoutLMv3 with standard semi-supervised learning (89.1 \pm 0.4 AP, p=0.02) and matching UDOP~\citeudop (89.8 AP) which requires 100M+ pages of multimodal pretraining. This demonstrates that LLM structural priors are complementary to both lightweight and pretrained architectures. Key findings include: (1) learned instance-adaptive gating improves over fixed weights by +0.9 AP with data-dependent PAC bounds correctly predicting convergence; (2) open-source LLMs enable privacy-preserving deployment with minimal loss (Llama-3-70B: 87.1 AP lightweight, 89.4 AP with LayoutLMv3); (3) LLMs provide targeted semantic disambiguation (18.7% of cases, +3.8 AP gain) beyond simple text this http URL system cost includes \ 12 for GPT-4o-mini API or 17 GPU-hours for local Llama-3-70B per 50K pages, amortized across training runs.
zh

[CV-59] Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency AAAI-2026

【速读】:该论文旨在解决跨模态知识蒸馏(Cross-modal Knowledge Distillation, CKD)在现实场景中因配对模态数据稀缺而导致性能受限的问题,特别是当模态间语义一致性较弱时(即非对称跨模态知识蒸馏,Asymmetric Cross-modal Knowledge Distillation, ACKD),如何有效传递知识。其解决方案的关键在于提出一个名为SemBridge的框架,该框架包含两个核心模块:一是“学生友好匹配模块”(Student-Friendly Matching module),利用自监督学习获取语义信息,并通过动态选择相关教师样本为每个学生样本提供个性化指导;二是“语义感知知识对齐模块”(Semantic-aware Knowledge Alignment module),基于拉格朗日优化寻找最优传输路径以降低知识迁移成本。该方法在多光谱(Multi-Spectral, MS)与RGB图像的遥感场景分类任务上验证了有效性,显著优于7种现有方法。

链接: https://arxiv.org/abs/2511.08901
作者: Riling Wei,Kelu Yao,Chuanguang Yang,Jin Wang,Zhuoyan Gao,Chao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-2026

点击查看摘要

Abstract:Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.
zh

[CV-60] Improving VisNet for Object Recognition

【速读】:该论文旨在解决人工视觉系统在对象识别(Object Recognition)中难以实现类人效率与鲁棒性的问题,特别是如何构建具备变换不变性(Transformation Invariant Representations)的视觉表征。其解决方案的关键在于基于生物启发的VisNet神经网络模型,通过引入径向基函数(Radial Basis Function, RBF)神经元、基于马氏距离(Mahalanobis Distance)的学习机制以及类似视网膜的预处理模块,并结合赫布学习(Hebbian Learning)和时间连续性关联机制,使模型能够从时序相邻视图中提取稳定且具有生物学合理性的特征表示。实验表明,这些改进显著提升了在MNIST、CIFAR10及自定义对称物体数据集上的识别准确率,验证了该架构在神经科学与人工智能交叉领域的有效性与可解释性。

链接: https://arxiv.org/abs/2511.08897
作者: Mehdi Fatan Serj,C. Alejandro Parraga,Xavier Otazu
机构: Computer Vision Centre(CVC); Autonomous University of Barcelona(UAB)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object recognition plays a fundamental role in how biological organisms perceive and interact with their environment. While the human visual system performs this task with remarkable efficiency, reproducing similar capabilities in artificial systems remains challenging. This study investigates VisNet, a biologically inspired neural network model, and several enhanced variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing for both general object recognition and symmetry classification. By leveraging principles of Hebbian learning and temporal continuity associating temporally adjacent views to build invariant representations. VisNet and its extensions capture robust and transformation invariant features. Experimental results across multiple datasets, including MNIST, CIFAR10, and custom symmetric object sets, show that these enhanced VisNet variants substantially improve recognition accuracy compared with the baseline model. These findings underscore the adaptability and biological relevance of VisNet inspired architectures, offering a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence. Keywords: VisNet, Object Recognition, Symmetry Detection, Hebbian Learning, RBF Neurons, Mahalanobis Distance, Biologically Inspired Models, Invariant Representations Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.08897 [cs.CV] (or arXiv:2511.08897v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.08897 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-61] Classifying Histopathologic Glioblastoma Sub-regions with EfficientNet

【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma, GBM)病理图像中不同组织亚区域自动识别的难题,以实现对疾病形态学特征的大规模解析。其解决方案的关键在于设计了一种基于深度学习的四步分类方法,采用多种EfficientNet架构(B0–B4)在BraTS-Path 2024挑战数据集上进行训练与评估,其中EfficientNet-B1和EfficientNet-B4在交叉验证中表现最优(F1分数达0.98),但模型在独立验证集和测试集上的性能显著下降(F1分数分别为0.546和0.517),揭示了当前模型泛化能力不足的问题,凸显了临床部署中提升模型鲁棒性的关键挑战。

链接: https://arxiv.org/abs/2511.08896
作者: Sanyukta Adap,Ujjwal Baid,Spyridon Bakas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Glioblastoma (GBM) is the most common aggressive, fast-growing brain tumor, with a grim prognosis. Despite clinical diagnostic advancements, there have not been any substantial improvements to patient prognosis. Histopathological assessment of excised tumors is the first line of clinical diagnostic routine. We hypothesize that automated, robust, and accurate identification of distinct histological sub-regions within GBM could contribute to morphologically understanding this disease at scale. In this study, we designed a four-step deep learning approach to classify six (6) histopathological regions and quantitatively evaluated it on the BraTS-Path 2024 challenge dataset, which includes digitized Hematoxylin \ Eosin (H\E) stained GBM tissue sections annotated for six distinct regions. We used the challenge’s publicly available training dataset to develop and evaluate the effectiveness of several variants of EfficientNet architectures (i.e., B0, B1, B2, B3, B4). EfficientNet-B1 and EfficientNet-B4 achieved the best performance, achieving an F1 score of 0.98 in a 5-fold cross-validation configuration using the BraTS-Path training set. The quantitative performance evaluation of our proposed approach with EfficientNet-B1 on the BraTS-Path hold-out validation data and the final hidden testing data yielded F1 scores of 0.546 and 0.517, respectively, for the associated 6-class classification task. The difference in the performance on training, validation, and testing data highlights the challenge of developing models that generalize well to new data, which is crucial for clinical applications. The source code of the proposed approach can be found at the GitHub repository of Indiana University Division of Computational Pathology: this https URL.
zh

[CV-62] Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks

【速读】:该论文旨在解决现有对比学习网络在图像聚类任务中因编码器间隐式交互(如参数共享或动量更新)而未能充分挖掘正样本对之间的互补性与相似性,从而限制了聚类特征提取能力的问题。其解决方案的关键在于设计了一种新颖的多融合-增强Vision Transformer块(Multiple Fusing-Augmenting ViT Blocks, MFAVBs),通过显式融合两个正样本对的特征并引入多轮融合与增强机制,提升模型对相似图像的区分能力;同时利用CLIP预训练模型提取的特征对输入数据进行预处理,进一步优化聚类性能。

链接: https://arxiv.org/abs/2511.08883
作者: Cheng Wang,Shuisheng Zhou,Fengjiao Peng,Jin Sheng,Feng Ye,Yinli Dong
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the field of image clustering, the widely used contrastive learning networks improve clustering performance by maximizing the similarity between positive pairs and the dissimilarity of negative pairs of the inputs. Extant contrastive learning networks, whose two encoders often implicitly interact with each other by parameter sharing or momentum updating, may not fully exploit the complementarity and similarity of the positive pairs to extract clustering features from input data. To explicitly fuse the learned features of positive pairs, we design a novel multiple fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT). Firstly, two preprocessed augmentions as positive pairs are separately fed into two shared-weight ViTs, then their output features are fused to input into a larger ViT. Secondly, the learned features are split into a pair of new augmented positive samples and passed to the next FAVBs, enabling multiple fusion and augmention through MFAVBs operations. Finally, the learned features are projected into both instance-level and clustering-level spaces to calculate the cross-entropy loss, followed by parameter updates by backpropagation to finalize the training process. To further enhance ability of the model to distinguish between similar images, our input data for the network we propose is preprocessed augmentions with features extracted from the CLIP pretrained model. Our experiments on seven public datasets demonstrate that MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.
zh

[CV-63] SasMamba: A Lightweight Structure-Aware Stride State Space Model for 3D Human Pose Estimation WACV2026

【速读】:该论文旨在解决基于状态空间模型(State Space Models, SSM)的3D人体姿态估计方法中,现有手工设计的扫描操作将检测到的2D姿态序列扁平化为纯时间序列时所导致的空间结构破坏和时空特征纠缠问题,从而难以捕捉复杂的姿态依赖关系。其解决方案的关键在于提出骨架结构感知的步长SSM(Skeleton Structure-Aware Stride SSM, SAS-SSM),首先通过结构感知的时空卷积动态捕获关节间的局部交互关系,再采用基于步长的扫描策略构建多尺度全局结构表示,从而在保持线性计算复杂度的前提下,实现对局部与全局姿态信息的灵活建模。

链接: https://arxiv.org/abs/2511.08872
作者: Hu Cui,Wenqiang Hua,Renjing Huang,Shurui Jia,Tessai Hayama
机构: Nagaoka University of Technology (长冈技术科学大学); Xi’an University of Posts and Telecommunications (西安邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8pages, WACV2026 accepted

点击查看摘要

Abstract:Recently, the Mamba architecture based on State Space Models (SSMs) has gained attention in 3D human pose estimation due to its linear complexity and strong global modeling capability. However, existing SSM-based methods typically apply manually designed scan operations to flatten detected 2D pose sequences into purely temporal sequences, either locally or globally. This approach disrupts the inherent spatial structure of human poses and entangles spatial and temporal features, making it difficult to capture complex pose dependencies. To address these limitations, we propose the Skeleton Structure-Aware Stride SSM (SAS-SSM), which first employs a structure-aware spatiotemporal convolution to dynamically capture essential local interactions between joints, and then applies a stride-based scan strategy to construct multi-scale global structural representations. This enables flexible modeling of both local and global pose information while maintaining linear computational complexity. Built upon SAS-SSM, our model SasMamba achieves competitive 3D pose estimation performance with significantly fewer parameters compared to existing hybrid models. The source code is available at this https URL.
zh

[CV-64] Enhancing Rotation-Invariant 3D Learning with Global Pose Awareness and Attention Mechanisms AAAI2026

【速读】:该论文旨在解决现有旋转不变(Rotation-Invariant, RI)学习方法在处理3D点云时因局部感受野受限而导致的全局姿态信息丢失问题,进而无法区分几何相似但空间位置不同的结构(如飞机左右机翼),即“翼尖特征坍缩”(Wing-tip feature collapse)。其解决方案的关键在于提出一种受阴影启发的姿态特征(Shadow-informed Pose Feature, SiPF),通过引入一个由共享旋转学习得到的全局一致参考点(称为“阴影”)来增强局部RI描述子,从而在保持旋转不变性的同时保留全局姿态感知能力;进一步设计了基于注意力机制的旋转不变注意力卷积(RIAttnConv)以整合SiPF并提升模型对结构相似组件的区分能力,并结合基于单位四元数上Bingham分布的任务自适应阴影定位模块动态优化全局旋转,实现更鲁棒的3D形状理解。

链接: https://arxiv.org/abs/2511.08833
作者: Jiaxun Guo,Manar Amayri,Nizar Bouguila,Xin Liu,Wentao Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 gigures,AAAI 2026

点击查看摘要

Abstract:Recent advances in rotation-invariant (RI) learning for 3D point clouds typically replace raw coordinates with handcrafted RI features to ensure robustness under arbitrary rotations. However, these approaches often suffer from the loss of global pose information, making them incapable of distinguishing geometrically similar but spatially distinct structures. We identify that this limitation stems from the restricted receptive field in existing RI methods, leading to Wing-tip feature collapse, a failure to differentiate symmetric components (e.g., left and right airplane wings) due to indistinguishable local geometries. To overcome this challenge, we introduce the Shadow-informed Pose Feature (SiPF), which augments local RI descriptors with a globally consistent reference point (referred to as the ‘shadow’) derived from a learned shared rotation. This mechanism enables the model to preserve global pose awareness while maintaining rotation invariance. We further propose Rotation-invariant Attention Convolution (RIAttnConv), an attention-based operator that integrates SiPFs into the feature aggregation process, thereby enhancing the model’s capacity to distinguish structurally similar components. Additionally, we design a task-adaptive shadow locating module based on the Bingham distribution over unit quaternions, which dynamically learns the optimal global rotation for constructing consistent shadows. Extensive experiments on 3D classification and part segmentation benchmarks demonstrate that our approach substantially outperforms existing RI methods, particularly in tasks requiring fine-grained spatial discrimination under arbitrary rotations.
zh

[CV-65] DT-NVS: Diffusion Transformers for Novel View Synthesis

【速读】:该论文旨在解决从单张图像中生成自然场景(包括室内和室外的日常场景)新视角的难题,这一问题在现有研究中尚未充分探索,尤其相较于以物体为中心的新视角合成任务。传统基于扩散模型的方法主要局限于真实场景中的小范围相机运动或仅处理非自然的物体中心场景,限制了其在真实世界应用中的潜力。解决方案的关键在于提出一种名为DT-NVS的3D感知扩散模型,该模型利用基于Transformer的架构将图像映射到3D表示,并引入新颖的相机条件策略以适应真实世界中未对齐的数据集;同时创新性地采用了一种训练范式,通过交换参考帧与噪声输入的角色,显著提升了模型在真实、多类别、随意采集视频数据上的泛化能力与生成多样性,从而实现了优于当前最优3D感知扩散模型和确定性方法的性能表现。

链接: https://arxiv.org/abs/2511.08823
作者: Wonbong Jang,Jonathan Tremblay,Lourdes Agapito
机构: UCL (伦敦大学学院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Generating novel views of a natural scene, e.g., every-day scenes both indoors and outdoors, from a single view is an under-explored problem, even though it is an organic extension to the object-centric novel view synthesis. Existing diffusion-based approaches focus rather on small camera movements in real scenes or only consider unnatural object-centric scenes, limiting their potential applications in real-world settings. In this paper we move away from these constrained regimes and propose a 3D diffusion model trained with image-only losses on a large-scale dataset of real-world, multi-category, unaligned, and casually acquired videos of everyday scenes. We propose DT-NVS, a 3D-aware diffusion model for generalized novel view synthesis that exploits a transformer-based architecture backbone. We make significant contributions to transformer and self-attention architectures to translate images to 3d representations, and novel camera conditioning strategies to allow training on real-world unaligned datasets. In addition, we introduce a novel training paradigm swapping the role of reference frame between the conditioning image and the sampled noisy input. We evaluate our approach on the 3D task of generalized novel view synthesis from a single input image and show improvements over state-of-the-art 3D aware diffusion models and deterministic approaches, while generating diverse outputs.
zh

[CV-66] SIFT-Graph: Benchmarking Multimodal Defense Against Image Adversarial Attacks With Robust Feature Graph ICCV2025

【速读】:该论文旨在解决现代深度视觉模型对像素级表示的高度依赖所导致的脆弱性问题,这种脆弱性使得模型极易受到微小但精心设计的对抗扰动(adversarial perturbations)影响。解决方案的关键在于提出一种多模态防御框架 SIFT-Graph,其核心创新是通过融合手工设计的尺度不变特征变换(Scale-Invariant Feature Transform, SIFT)关键点与图注意力网络(Graph Attention Network, GAT),提取具有结构意义且对扰动鲁棒的局部特征;这些鲁棒特征嵌入随后与传统视觉模型(如 Vision Transformer 和卷积神经网络)进行融合,构建出兼具结构感知能力和抗扰动能力的统一模型,从而在保持较高原始准确率的前提下显著提升模型对基于梯度的白盒对抗攻击的鲁棒性。

链接: https://arxiv.org/abs/2511.08810
作者: Jingjie He,Weijie Liang,Zihan Shan,Matthew Caesar
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025 Workshop, short paper

点击查看摘要

Abstract:Adversarial attacks expose a fundamental vulnerability in modern deep vision models by exploiting their dependence on dense, pixel-level representations that are highly sensitive to imperceptible perturbations. Traditional defense strategies typically operate within this fragile pixel domain, lacking mechanisms to incorporate inherently robust visual features. In this work, we introduce SIFT-Graph, a multimodal defense framework that enhances the robustness of traditional vision models by aggregating structurally meaningful features extracted from raw images using both handcrafted and learned modalities. Specifically, we integrate Scale-Invariant Feature Transform keypoints with a Graph Attention Network to capture scale and rotation invariant local structures that are resilient to perturbations. These robust feature embeddings are then fused with traditional vision model, such as Vision Transformer and Convolutional Neural Network, to form a unified, structure-aware and perturbation defensive model. Preliminary results demonstrate that our method effectively improves the visual model robustness against gradient-based white box adversarial attacks, while incurring only a marginal drop in clean accuracy.
zh

[CV-67] Adaptive graph Kolmogorov-Arnold network for 3D human pose estimation

【速读】:该论文旨在解决基于图卷积网络(GCN)的3D人体姿态估计方法在处理遮挡和深度模糊时因局部感受野有限而难以捕捉长程依赖关系,以及存在谱偏差(spectral bias)导致对高频细节建模能力不足的问题。其解决方案的关键在于提出PoseKAN框架,这是一种自适应图Kolmogorov-Arnold Network(KAN),通过将KAN扩展至图结构学习任务中,用可学习的边函数替代GCN中固定的激活函数,实现数据驱动的特征变换,从而增强模型的表达能力和适应性;同时引入多跳特征聚合机制以融合局部与远距离邻域信息,并结合残差PoseKAN模块和全局响应归一化,显著提升空间感知能力和特征选择性。

链接: https://arxiv.org/abs/2511.08809
作者: Abu Taib Mohammed Shahjahan,A. Ben Hamza
机构: Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation by leveraging the natural graph structure of the human skeleton. However, their local receptive field limits their ability to capture long-range dependencies essential for handling occlusions and depth ambiguities. They also exhibit spectral bias, which prioritizes low-frequency components while struggling to model high-frequency details. In this paper, we introduce PoseKAN, an adaptive graph Kolmogorov-Arnold Network (KAN), framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image. Unlike GCNs that use fixed activation functions, KANs employ learnable functions on graph edges, allowing data-driven, adaptive feature transformations. This enhances the model’s adaptability and expressiveness, making it more expressive in learning complex pose variations. Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors, leading to improved spatial awareness. It also incorporates residual PoseKAN blocks for deeper feature refinement, and a global response normalization for improved feature selectivity and contrast. Extensive experiments on benchmark datasets demonstrate the competitive performance of our model against state-of-the-art methods.
zh

[CV-68] WiCV at CVPR 2025: The Women in Computer Vision Workshop

【速读】:该论文旨在解决计算机视觉(Computer Vision)领域中女性和代表性不足群体(Underrepresented Minorities)在学术影响力、职业发展与参与度方面的不平等问题,通过系统性记录和分析Women in Computer Vision (WiCV) 工作坊的组织成效与演变趋势,为促进人工智能(AI)社区的多样性、公平性与包容性(DEI)提供可复制的实践范式。其解决方案的关键在于:构建一个持续性的学术交流平台,结合高质量论文展示(14篇录用论文及36篇扩展摘要海报)、跨行业导师制(80名学员匹配37位来自学界与工业界的导师)、大规模资金支持(约44,000美元旅行补助与多样性奖项)以及高参与度的现场互动(超100名与会者),从而增强弱势群体的研究可见度、专业成长机会与网络连接,推动整个计算机视觉社区向更具包容性的方向发展。

链接: https://arxiv.org/abs/2511.08748
作者: Estefania Talavera,Deblina Bhattacharjee,Himangi Mittal,Mengwei Ren,Karen Sanchez,Carla Muntean,JungEun Kim,Mona Jalal
机构: University of Twente(特温特大学); University of Bath(巴斯大学); Carnegie Mellon University(卡内基梅隆大学); Adobe(Adobe公司); KAUST(沙特阿美科技大学); Microsoft(微软); KAIST(韩国科学技术院); General Robotics(通用机器人); Toyota Material Handling North America(丰田物料搬运北美公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Women in Computer Vision Workshop (WiCV@CVPR 2025) was held in conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025) in Nashville, Tennessee, United States. This report presents an overview of the workshop program, participation statistics, mentorship outcomes, and historical trends from previous WiCV editions. The goal is to document the impact and evolution of WiCV as a reference for future editions and for other initiatives aimed at advancing diversity, equity, and inclusion within the AI and computer vision communities. WiCV@CVPR 2025 marked the 16th edition of this long-standing event dedicated to increasing the visibility, inclusion, and professional growth of women and underrepresented minorities in the computer vision community. This year’s workshop featured 14 accepted papers in the CVPR Workshop Proceedings out of 32 full-paper submissions. Five of these were selected for oral presentations, while all 14 were also presented as posters, along with 36 extended abstract posters accepted from 62 short-paper submissions, which are not included in the proceedings. The mentoring program matched 80 mentees with 37 mentors from both academia and industry. The 2025 edition attracted over 100 onsite participants, fostering rich technical and networking interactions across all career stages. Supported by 10 sponsors and approximately 44,000 USD in travel grants and diversity awards, WiCV continued its mission to empower emerging researchers and amplify diverse voices in computer vision.
zh

[CV-69] Harnessing Diffusion-Generated Synthetic Images for Fair Image Classification

【速读】:该论文旨在解决图像分类系统中因训练数据群体代表性不均而继承的偏见问题,例如在头发颜色分类任务中,金发常与女性过度关联,导致刻板印象强化。其解决方案的关键在于利用扩散模型(Diffusion Models)进行微调以生成更均衡的训练数据:首先采用LoRA和DreamBooth等微调技术直接从各群体样本中学习,从而更准确地保留原始数据分布;进一步地,为避免单个DreamBooth模型因组内变异过大而性能下降,提出对每个群体内的图像进行聚类,并为每个簇训练独立的DreamBooth模型,最终使用这些模型生成群体平衡数据用于预训练,再在真实数据上微调。实验表明,该方法在多个基准测试中优于原始Stable Diffusion,且在严重偏见数据集上超越当前最先进去偏技术(如Group-DRO)。

链接: https://arxiv.org/abs/2511.08711
作者: Abhipsa Basu,Aviral Gupta,Abhijnya Bhat,R. Venkatesh Babu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image classification systems often inherit biases from uneven group representation in training data. For example, in face datasets for hair color classification, blond hair may be disproportionately associated with females, reinforcing stereotypes. A recent approach leverages the Stable Diffusion model to generate balanced training data, but these models often struggle to preserve the original data distribution. In this work, we explore multiple diffusion-finetuning techniques, e.g., LoRA and DreamBooth, to generate images that more accurately represent each training group by learning directly from their samples. Additionally, in order to prevent a single DreamBooth model from being overwhelmed by excessive intra-group variations, we explore a technique of clustering images within each group and train a DreamBooth model per cluster. These models are then used to generate group-balanced data for pretraining, followed by fine-tuning on real data. Experiments on multiple benchmarks demonstrate that the studied finetuning approaches outperform vanilla Stable Diffusion on average and achieve results comparable to SOTA debiasing techniques like Group-DRO, while surpassing them as the dataset bias severity increases.
zh

[CV-70] Stabilizing Direct Training of Spiking Neural Networks: Membrane Potential Initialization and Threshold-robust Surrogate Gradient WACV2026

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在直接训练过程中面临的两个核心挑战:时间协变量偏移(Temporal Covariate Shift, TCS)和可学习阈值下的梯度不稳定问题。针对这些问题,作者提出了两项关键技术:MP-Init(膜电位初始化),通过将初始膜电位对齐到其稳态分布来缓解TCS;以及TrSG(阈值鲁棒替代梯度),用于稳定训练过程中与阈值电压相关的梯度流动。实验表明,该方法在静态和动态图像数据集上均达到当前最优性能。

链接: https://arxiv.org/abs/2511.08708
作者: Hyunho Kook,Byeongho Yu,Jeong Min Oh,Eunhyeok Park
机构: Pohang University of Science and Technology (POSTECH)
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by WACV 2026

点击查看摘要

Abstract:Recent advancements in the direct training of Spiking Neural Networks (SNNs) have demonstrated high-quality outputs even at early timesteps, paving the way for novel energy-efficient AI paradigms. However, the inherent non-linearity and temporal dependencies in SNNs introduce persistent challenges, such as temporal covariate shift (TCS) and unstable gradient flow with learnable neuron thresholds. In this paper, we present two key innovations: MP-Init (Membrane Potential Initialization) and TrSG (Threshold-robust Surrogate Gradient). MP-Init addresses TCS by aligning the initial membrane potential with its stationary distribution, while TrSG stabilizes gradient flow with respect to threshold voltage during training. Extensive experiments validate our approach, achieving state-of-the-art accuracy on both static and dynamic image datasets. The code is available at: this https URL
zh

[CV-71] Rethinking generative image pretraining: How far are we from scaling up next-pixel prediction?

【速读】:该论文旨在解决统一视觉模型中自回归式逐像素预测(autoregressive next-pixel prediction)的缩放规律问题,即如何在不同计算预算下最优地平衡模型规模与数据量以提升图像分类和生成任务的性能。其解决方案的关键在于通过控制FLOPs(浮点运算次数)一致的实验设计,在32×32至更高分辨率下系统性地评估模型缩放策略,并发现:图像生成任务的最优缩放需数据量增长速度是图像分类任务的3–5倍;随着分辨率提升,模型规模的增长速率远超数据规模;且当前主要瓶颈为计算资源而非训练数据量,预计未来五年内基于像素级建模的图像生成将具备可行性。

链接: https://arxiv.org/abs/2511.08704
作者: Xinchen Yan,Chen Liang,Lijun Yu,Adams Wei Yu,Yifeng Lu,Quoc V. Le
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation quality measured by Fr’echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed 32x32 resolution alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.
zh

[CV-72] Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding

【速读】:该论文旨在解决视频基础模型(video foundation models)在提取和存储视觉特征时可能泄露敏感个人信息(如肤色、性别或着装)的问题。现有隐私保护方法通常在输入像素层面进行匿名化处理,需重新训练整个模型且仅适用于特定任务,难以适配当前冻结权重的视频基础模型。解决方案的关键在于提出一种轻量级的“匿名化适配器模块”(Anonymizing Adapter Module, AAM),该模块可在冻结的视频编码器中以即插即用方式部署,通过三个新设计的训练目标实现隐私去相关与任务性能保留:(1) 片段级自监督隐私目标以降低静态片段间的互信息,(2) 联合训练目标维持已见任务的通用性,(3) 潜空间一致性损失提升未见任务的泛化能力。实验表明,AAM在保持接近基线性能的同时,显著减少35%的隐私泄露,并有效缓解动作识别中的性别偏见问题。

链接: https://arxiv.org/abs/2511.08666
作者: Joseph Fioresi,Ishan Rajendrakumar Dave,Mubarak Shah
机构: University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space. While spatio-temporal features learned by foundation models have deepened general understanding of video content, sharing or storing these extracted visual features for downstream tasks inadvertently reveals sensitive personal information like skin color, gender, or clothing. Current privacy preservation methods focus on input-pixel-level anonymization, which requires retraining the entire utility video model and results in task-specific anonymization, making them unsuitable for recent video foundational models. To address these challenges, we introduce a lightweight Anonymizing Adapter Module (AAM) that removes private information from video features while retaining general task utility. AAM can be applied in a plug-and-play fashion to frozen video encoders, minimizing the computational burden of finetuning and re-extracting features. Our framework employs three newly designed training objectives: (1) a clip-level self-supervised privacy objective to reduce mutual information between static clips, (2) a co-training objective to retain utility across seen tasks, and (3) a latent consistency loss for generalization on unseen tasks. Our extensive evaluations demonstrate a significant 35% reduction in privacy leakage while maintaining near-baseline utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). We also provide an analysis on anonymization for sensitive temporal attribute recognition. Additionally, we propose new protocols for assessing gender bias in action recognition models, showing that our method effectively mitigates such biases and promotes more equitable video understanding.
zh

[CV-73] RS-Net: Context-Aware Relation Scoring for Dynamic Scene Graph Generation

【速读】:该论文旨在解决动态场景图生成(Dynamic Scene Graph Generation, DSGG)中因训练仅依赖标注的物体对关系而缺乏对非相关物体对的判别能力,导致推理阶段难以识别有意义关系的问题。解决方案的关键在于提出一种模块化框架——关系评分网络(Relation Scoring Network, RS-Net),其通过空间上下文编码器(含可学习上下文令牌)和时序编码器分别建模物体间的空间交互与长程时间依赖,并将两者融合得到关系得分,进而集成到统一的三元组评分机制中以提升关系预测性能。该方法无需修改现有DSGG模型结构即可有效增强对稀有关系的识别能力,尤其在处理关系分布长尾问题上表现显著。

链接: https://arxiv.org/abs/2511.08651
作者: Hae-Won Jo,Yeong-Jun Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic Scene Graph Generation (DSGG) models how object relations evolve over time in videos. However, existing methods are trained only on annotated object pairs and lack guidance for non-related pairs, making it difficult to identify meaningful relations during inference. In this paper, we propose Relation Scoring Network (RS-Net), a modular framework that scores the contextual importance of object pairs using both spatial interactions and long-range temporal context. RS-Net consists of a spatial context encoder with learnable context tokens and a temporal encoder that aggregates video-level information. The resulting relation scores are integrated into a unified triplet scoring mechanism to enhance relation prediction. RS-Net can be easily integrated into existing DSGG models without architectural changes. Experiments on the Action Genome dataset show that RS-Net consistently improves both Recall and Precision across diverse baselines, with notable gains in mean Recall, highlighting its ability to address the long-tailed distribution of relations. Despite the increased number of parameters, RS-Net maintains competitive efficiency, achieving superior performance over state-of-the-art methods.
zh

[CV-74] Predict and Resist: Long-Term Accident Anticipation under Sensor Noise AAAI-26 AAAI

【速读】:该论文旨在解决自动驾驶中事故预判(accident anticipation)的两个关键挑战:一是传感器输入在恶劣天气、运动模糊或硬件限制下容易出现噪声或退化,二是需要在保证预警可靠性的同时实现及时预测,即平衡提前预警与误报抑制。其解决方案的核心在于提出一个统一框架,融合基于扩散模型(diffusion-based denoising)的去噪模块与具备时间感知能力的演员-评论家(actor-critic)结构:前者通过迭代优化重建鲁棒的图像和目标特征,保留传感器退化下的关键运动与交互线索;后者利用长时程时序推理与时间加权奖励机制,精准判断最优预警时机,从而实现更早、更稳定且符合人类直觉的事故预警。

链接: https://arxiv.org/abs/2511.08640
作者: Xingcheng Liu,Bin Rao,Yanchen Guan,Chengyue Wang,Haicheng Liao,Jiaxun Zhang,Chengyu Lin,Meixin Zhu,Zhenning Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Accident anticipation is essential for proactive and safe autonomous driving, where even a brief advance warning can enable critical evasive actions. However, two key challenges hinder real-world deployment: (1) noisy or degraded sensory inputs from weather, motion blur, or hardware limitations, and (2) the need to issue timely yet reliable predictions that balance early alerts with false-alarm suppression. We propose a unified framework that integrates diffusion-based denoising with a time-aware actor-critic model to address these challenges. The diffusion module reconstructs noise-resilient image and object features through iterative refinement, preserving critical motion and interaction cues under sensor degradation. In parallel, the actor-critic architecture leverages long-horizon temporal reasoning and time-weighted rewards to determine the optimal moment to raise an alert, aligning early detection with reliability. Experiments on three benchmark datasets (DAD, CCD, A3D) demonstrate state-of-the-art accuracy and significant gains in mean time-to-accident, while maintaining robust performance under Gaussian and impulse noise. Qualitative analyses further show that our model produces earlier, more stable, and human-aligned predictions in both routine and highly complex traffic scenarios, highlighting its potential for real-world, safety-critical deployment.
zh

[CV-75] CADIC: Continual Anomaly Detection Based on Incremental Coreset

【速读】:该论文旨在解决持续异常检测(Continual Anomaly Detection, CAD)中因动态数据分布导致的灾难性遗忘问题,同时克服现有基于嵌入的记忆库方法在任务间需构建类特定子记忆库所引发的灵活性与可扩展性不足。其解决方案的关键在于提出一种所有任务共享统一记忆库的新框架:训练阶段通过增量更新固定大小的coreset中的嵌入表示,实现对连续任务的知识持续获取,避免任务特异性内存碎片化;推理阶段则采用最近邻匹配机制计算异常分数,从而在MVTec AD和Visa等基准数据集上达到SOTA性能(图像级AUROC分别为0.972和0.891),并在真实电子纸数据集上实现100%异常样本检出率,验证了方法在实际场景中的鲁棒性。

链接: https://arxiv.org/abs/2511.08634
作者: Gen Yang,Zhipeng Deng,Junfeng Man
机构: Hunan First Normal University (湖南第一师范学院); Changsha University of Science and Technology (长沙理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:The primary objective of Continual Anomaly Detection (CAD) is to learn the normal patterns of new tasks under dynamic data distribution assumptions while mitigating catastrophic forgetting. Existing embedding-based CAD approaches continuously update a memory bank with new embeddings to adapt to sequential tasks. However, these methods require constructing class-specific sub-memory banks for each task, which restricts their flexibility and scalability. To address this limitation, we propose a novel CAD framework where all tasks share a unified memory bank. During training, the method incrementally updates embeddings within a fixed-size coreset, enabling continuous knowledge acquisition from sequential tasks without task-specific memory fragmentation. In the inference phase, anomaly scores are computed via a nearest-neighbor matching mechanism, achieving state-of-the-art detection accuracy. We validate the method through comprehensive experiments on MVTec AD and Visa datasets. Results show that our approach outperforms existing baselines, achieving average image-level AUROC scores of 0.972 (MVTec AD) and 0.891 (Visa). Notably, on a real-world electronic paper dataset, it demonstrates 100% accuracy in anomaly sample detection, confirming its robustness in practical scenarios. The implementation will be open-sourced on GitHub.
zh

[CV-76] me-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

【速读】:该论文旨在解决基于扩散模型的视频生成中运动控制精度不足的问题,现有方法依赖图像或文本条件,难以实现精确的运动调控;同时,已有运动条件生成方案通常需要针对特定模型进行微调,计算成本高且灵活性差。其解决方案的关键在于提出一种无需训练、即插即用的框架 Time-to-Move (TTM),通过用户友好的操作(如剪切拖拽或基于深度的重投影)生成粗略参考动画作为粗粒度运动提示,并借鉴 SDEdit 中布局提示的思想将其拓展至视频域。TTM 采用像素级图像条件保持外观一致性,并引入双时钟去噪(dual-clock denoising)机制——一种区域依赖的去噪策略,在指定运动区域内强制强对齐,其他区域则保留自然动态,从而在不增加额外训练或推理开销的前提下,显著提升运动控制精度与视频真实性。

链接: https://arxiv.org/abs/2511.08633
作者: Assaf Singer,Noam Rotstein,Amir Mann,Ron Kimmel,Or Litany
机构: Technion – Israel Institute of Technology (以色列理工学院); NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit’s use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: this https URL.
zh

[CV-77] Learning Topology-Driven Multi-Subspace Fusion for Grassmannian Deep Network AAAI2026

【速读】:该论文旨在解决现有基于Grassmann流形的几何表示学习方法主要依赖静态单子空间表示,忽视了多个子空间之间动态交互关系的问题,从而难以捕捉复杂几何结构。其解决方案的关键在于提出一种拓扑驱动的多子空间融合框架,通过两个核心创新实现:(1) 受Kolmogorov-Arnold表示定理启发,设计了一种自适应多子空间建模机制,利用拓扑收敛分析动态选择并加权任务相关的子空间;(2) 引入多子空间交互模块,通过流形上的Fréchet均值优化融合异构几何表示。该方法在投影度量拓扑下建立了自适应子空间的收敛性保障,并结合Riemannian批量归一化和互信息正则化提升判别性和鲁棒性,成功将欧氏空间中成熟的多通道交互思想迁移至非欧几里得域,显著提升了模型的性能与可解释性。

链接: https://arxiv.org/abs/2511.08628
作者: Xuan Yu,Tianyang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, accepted at AAAI 2026

点击查看摘要

Abstract:Grassmannian manifold offers a powerful carrier for geometric representation learning by modelling high-dimensional data as low-dimensional subspaces. However, existing approaches predominantly rely on static single-subspace representations, neglecting the dynamic interplay between multiple subspaces critical for capturing complex geometric structures. To address this limitation, we propose a topology-driven multi-subspace fusion framework that enables adaptive subspace collaboration on the Grassmannian. Our solution introduces two key innovations: (1) Inspired by the Kolmogorov-Arnold representation theorem, an adaptive multi-subspace modelling mechanism is proposed that dynamically selects and weights task-relevant subspaces via topological convergence analysis, and (2) a multi-subspace interaction block that fuses heterogeneous geometric representations through Fréchet mean optimisation on the manifold. Theoretically, we establish the convergence guarantees of adaptive subspaces under a projection metric topology, ensuring stable gradient-based optimisation. Practically, we integrate Riemannian batch normalisation and mutual information regularisation to enhance discriminability and robustness. Extensive experiments on 3D action recognition (HDM05, FPHA), EEG classification (MAMEM-SSVEPII), and graph tasks demonstrate state-of-the-art performance. Our work not only advances geometric deep learning but also successfully adapts the proven multi-channel interaction philosophy of Euclidean networks to non-Euclidean domains, achieving superior discriminability and interpretability.
zh

[CV-78] A Multi-Drone Multi-View Dataset and Deep Learning Framework for Pedestrian Detection and Tracking

【速读】:该论文旨在解决动态无人机(drone)场景下行人跟踪的挑战,特别是由于相机位置不断变化和复杂遮挡导致的传统静态摄像头方法性能显著下降的问题。解决方案的关键在于提出一个名为MATRIX的多视角数据集和一套深度学习框架,其中包含实时相机标定、基于特征的图像配准以及鸟瞰图(bird’s-eye-view, BEV)表示下的多视角特征融合技术,从而在复杂城市环境中实现高精度的检测与跟踪,且具备良好的鲁棒性和可迁移性。

链接: https://arxiv.org/abs/2511.08615
作者: Kosta Dakic,Kanchana Thilakarathna,Rodrigo N. Calheiros,Teng Joon Lim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: Introduction of the MATRIX Dataset, featuring synchronized footage from eight drones in an urban environment with comprehensive annotations for detection and tracking, available at this https URL

点击查看摘要

Abstract:Multi-drone surveillance systems offer enhanced coverage and robustness for pedestrian tracking, yet existing approaches struggle with dynamic camera positions and complex occlusions. This paper introduces MATRIX (Multi-Aerial TRacking In compleX environments), a comprehensive dataset featuring synchronized footage from eight drones with continuously changing positions, and a novel deep learning framework for multi-view detection and tracking. Unlike existing datasets that rely on static cameras or limited drone coverage, MATRIX provides a challenging scenario with 40 pedestrians and a significant architectural obstruction in an urban environment. Our framework addresses the unique challenges of dynamic drone-based surveillance through real-time camera calibration, feature-based image registration, and multi-view feature fusion in bird’s-eye-view (BEV) representation. Experimental results demonstrate that while static camera methods maintain over 90% detection and tracking precision and accuracy metrics in a simplified MATRIX environment without an obstruction, 10 pedestrians and a much smaller observational area, their performance significantly degrades in the complex environment. Our proposed approach maintains robust performance with \sim 90% detection and tracking accuracy, as well as successfully tracks \sim 80% of trajectories under challenging conditions. Transfer learning experiments reveal strong generalization capabilities, with the pretrained model achieving much higher detection and tracking accuracy performance compared to training the model from scratch. Additionally, systematic camera dropout experiments reveal graceful performance degradation, demonstrating practical robustness for real-world deployments where camera failures may occur. The MATRIX dataset and framework provide essential benchmarks for advancing dynamic multi-view surveillance systems.
zh

[CV-79] Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

【速读】:该论文旨在解决基于图像修复(inpainting-based)的说话人脸生成方法中存在的唇部泄漏(lip leaking)问题,即生成的唇部运动不仅受驱动音频影响,还受到身份参考图像的干扰,导致语音与唇动不一致且难以通过传统评估指标检测。解决方案的关键在于提出了一套系统性的评估方法,包含三种互补的测试设置:无声输入生成、音频-视频错配合成以及音视频匹配合成,并引入了唇同步偏差(lip-sync discrepancy)和无声音频基础唇同步得分等衍生指标,从而能够量化并分析唇部泄漏现象;同时,研究不同身份参考选择对泄漏的影响,为参考图像的设计提供指导,整个方法具有模型无关性,可作为未来说话人脸生成研究的更可靠基准。

链接: https://arxiv.org/abs/2511.08613
作者: Dogucan Yaman,Fevziye Irem Eyiokur,Hazım Kemal Ekenel,Alexander Waibel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.
zh

[CV-80] Case Study: Transformer-Based Solution for the Automatic Digitization of Gas Plants

【速读】:该论文旨在解决天然气设施(gas plant)数字化过程中信息提取效率低下的问题,即如何自动化地从非结构化PDF文档中准确获取设计数据和拓扑结构信息。解决方案的关键在于融合多种人工智能技术——包括光学字符识别(OCR)、视觉大语言模型(Vision LLM)、目标检测(Object Detection)、关系推理(Relational Reasoning)与优化算法,并引入一种新型Transformer架构以增强对设备组件间复杂关系的建模能力,从而实现高精度的信息结构化输出:文本类设计数据提取准确率达91%,组件识别准确率93%,层级结构提取准确率约80%。

链接: https://arxiv.org/abs/2511.08609
作者: I. Bailo,F. Buonora,G. Ciarfaglia,L. T. Consoli,A. Evangelista,M. Gabusi,M. Ghiani,C. Petracca Ciavarella,F. Picariello,F. Sarcina,F. Tuosto,V. Zullo,L. Airoldi,G. Bruno,D. D. Gobbo,S. Pezzenati,G. A. Tona
机构: Eng AI&Data @Engineering Group (Eng AI&Data @Engineering Group); Snam S.p.A. (Snam S.p.A.)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The energy transition is a key theme of the last decades to determine a future of eco-sustainability, and an area of such importance cannot disregard digitization, innovation and the new technological tools available. This is the context in which the Generative Artificial Intelligence models described in this paper are positioned, developed by Engineering Ingegneria Informatica SpA in order to automate the plant structures acquisition of SNAM energy infrastructure, a leading gas transportation company in Italy and Europe. The digitization of a gas plant consists in registering all its relevant information through the interpretation of the related documentation. The aim of this work is therefore to design an effective solution based on Artificial Intelligence techniques to automate the extraction of the information necessary for the digitization of a plant, in order to streamline the daily work of MGM users. The solution received the PID of the plant as input, each one in pdf format, and uses OCR, Vision LLM, Object Detection, Relational Reasoning and optimization algorithms to return an output consisting of two sets of information: a structured overview of the relevant design data and the hierarchical framework of the plant. To achieve convincing results, we extend a state-of-the-art model for Scene Graph Generation introducing a brand new Transformer architecture with the aim of deepening the analysis of the complex relations between the plant’s components. The synergistic use of the listed AI-based technologies allowed to overcome many obstacles arising from the high variety of data, due to the lack of standardization. An accuracy of 91% has been achieved in the extraction of textual information relating to design data. Regarding the plants topology, 93% of components are correctly identified and the hierarchical structure is extracted with an accuracy around 80%.
zh

[CV-81] Exploring Category-level Articulated Object Pose Tracking on SE(3) Manifolds

【速读】:该论文旨在解决关节式物体(articulated objects)在多帧场景下的位姿跟踪(pose tracking)问题,这一任务相较于刚性物体更具挑战性,因其受内在运动学约束限制且结构复杂。解决方案的关键在于提出了一种基于点对特征(Point Pair Features, PPF)的位姿跟踪框架——PPF-Tracker:首先在SE(3)李群空间中对点云进行准规范变换(quasi-canonicalization),利用PPF对点对几何关系建模并结合SE(3)不变性预测位姿投票参数;随后引入关节轴语义信息以统一施加全结构的运动学约束,从而实现对关节式物体各部件协同、鲁棒的位姿估计。

链接: https://arxiv.org/abs/2511.05996
作者: Xianhui Meng,Yukang Huo,Li Zhang,Liu Liu,Haonan Jiang,Yan Zhong,Pingrui Zhang,Cewu Lu,Jun Liu
机构: 1. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 2. School of Computer Science and Engineering, South China University of Technology (华南理工大学计算机科学与工程学院); 3. School of Information Science and Technology, Sun Yat-sen University (中山大学信息科学与技术学院); 4. School of Data Science, Fudan University (复旦大学数据科学学院); 5. Beijing Key Laboratory of Intelligent Perception and Computing, Beijing Institute of Technology (北京理工大学智能感知与计算北京市重点实验室); 6. School of Artificial Intelligence, Beihang University (北京航空航天大学人工智能学院); 7. School of Computer Science and Technology, Tsinghua University (清华大学计算机科学与技术系); 8. Institute of Artificial Intelligence, Peking University (北京大学人工智能研究院); 9. Alibaba Cloud (阿里巴巴云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Articulated objects are prevalent in daily life and robotic manipulation tasks. However, compared to rigid objects, pose tracking for articulated objects remains an underexplored problem due to their inherent kinematic constraints. To address these challenges, this work proposes a novel point-pair-based pose tracking framework, termed \textbfPPF-Tracker. The proposed framework first performs quasi-canonicalization of point clouds in the SE(3) Lie group space, and then models articulated objects using Point Pair Features (PPF) to predict pose voting parameters by leveraging the invariance properties of SE(3). Finally, semantic information of joint axes is incorporated to impose unified kinematic constraints across all parts of the articulated object. PPF-Tracker is systematically evaluated on both synthetic datasets and real-world scenarios, demonstrating strong generalization across diverse and challenging environments. Experimental results highlight the effectiveness and robustness of PPF-Tracker in multi-frame pose tracking of articulated objects. We believe this work can foster advances in robotics, embodied intelligence, and augmented reality. Codes are available at this https URL.
zh

[CV-82] oken Is All You Need: Cognitive Planning through Belief-Intent Co-Evolution

【速读】:该论文旨在解决端到端自动驾驶(End-to-End Autonomous Driving, E2EAD)中长期依赖的“全场景建模”假设问题,即传统方法认为高精度规划必须依赖对环境的完整重建。其解决方案的关键在于提出一种基于认知科学启发的“信念-意图共演化”机制(belief-intent co-evolution),通过一组语义丰富的稀疏意图标记(intent tokens)实现高效决策:模型无需显式预测未来状态,而是利用这些标记在感知信念与目标意图之间动态协同优化,从而在nuPlan基准上实现0.382米平均位移误差(ADE),较仅使用意图标记提升21.6%,且显式重建损失反而降低性能,验证了任务驱动的信念-意图耦合足以支撑鲁棒规划。这一范式将规划从“反应式行为”重构为“理解式推理”,推动自动驾驶系统向具备想象力的前瞻性智能体演进。

链接: https://arxiv.org/abs/2511.05540
作者: Shiyao Sang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
备注: 7 pages, 3 figures. A paradigm shift from reconstructing the world to understanding it: planning through belief-intent co-evolution

点击查看摘要

Abstract:We challenge the long-standing assumption that exhaustive scene modeling is required for high-performance end-to-end autonomous driving (E2EAD). Inspired by cognitive science, we propose that effective planning arises not from reconstructing the world, but from the co-evolution of belief and intent within a minimal set of semantically rich tokens. Experiments on the nuPlan benchmark (720 scenarios, 11k+ samples) reveal three principles: (1) sparse intent tokens alone achieve 0.487 m ADE, demonstrating strong performance without future prediction; (2) conditioning trajectory decoding on predicted future tokens reduces ADE to 0.382 m, a 21.6% improvement, showing that performance emerges from cognitive planning; and (3) explicit reconstruction loss degrades performance, confirming that task-driven belief-intent co-evolution suffices under reliable perception inputs. Crucially, we observe the emergence of cognitive consistency: through prolonged training, the model spontaneously develops stable token dynamics that balance current perception (belief) and future goals (intent). This process, accompanied by “temporal fuzziness,” enables robustness under uncertainty and continuous self-optimization. Our work establishes a new paradigm: intelligence lies not in pixel fidelity, but in the tokenized duality of belief and intent. By reframing planning as understanding rather than reaction, TIWM bridges the gap between world models and VLA systems, paving the way for foresightful agents that plan through imagination. Note: Numerical comparisons with methods reporting results on nuScenes are indicative only, as nuPlan presents a more challenging planning-focused evaluation.
zh

[CV-83] Augment to Augment: Diverse Augmentations Enable Competitive Ultra-Low-Field MRI Enhancement MICCAI2025

【速读】:该论文旨在解决超低场强磁共振成像(Ultra-low-field MRI, ULF-MRI)因信噪比(SNR)低、空间分辨率差及对比度偏离高场标准而导致图像质量受限的问题。其解决方案的关键在于通过任务适配的数据增强策略,提升标准深度学习模型在有限配对训练数据(仅50例3D体积)下的图像增强性能,特别是引入基于高场数据的辅助任务以增强模型的泛化能力与重建 fidelity,从而实现从ULF到高场外观的高质量图像映射。

链接: https://arxiv.org/abs/2511.09366
作者: Felix F Zimmermann
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: MICCAI 2025 ULF-EnC Challenge

点击查看摘要

Abstract:Ultra-low-field (ULF) MRI promises broader accessibility but suffers from low signal-to-noise ratio (SNR), reduced spatial resolution, and contrasts that deviate from high-field standards. Imageto- image translation can map ULF images to a high-field appearance, yet efficacy is limited by scarce paired training data. Working within the ULF-EnC challenge constraints (50 paired 3D volumes; no external data), we study how task-adapted data augmentations impact a standard deep model for ULF image enhancement. We show that strong, diverse augmentations, including auxiliary tasks on high-field data, substantially improve fidelity. Our submission ranked third by brain-masked SSIM on the public validation leaderboard and fourth by the official score on the final test leaderboard. Code is available at this https URL.
zh

[CV-84] RadHARSimulator V2: Video to Doppler Generator

【速读】:该论文旨在解决雷达辅助的人类活动识别(Radar-based Human Activity Recognition, HAR)领域中缺乏全面仿真方法的问题。现有软件多依赖于物理模型或运动捕捉数据,灵活性不足。解决方案的关键在于提出一个直接从视频帧生成多普勒谱的模拟器(RadHARSimulator V2),其核心包括两个模块:计算机视觉模块通过目标检测、二维与三维姿态估计及卡尔曼滤波实现高精度时序人体姿态重建;雷达模块则利用萨维茨基-戈拉过滤(Savitzky-Golay method)进行姿态插值平滑,并结合延迟模型和镜像法模拟自由空间与穿墙场景下的回波信号,最终通过脉冲压缩、动目标指示(MTI)以及DnCNN网络生成距离-时间图(Range-Time Map)和多普勒-时间图(Doppler-Time Map, DTM),并提取DTM上的脊线特征(ridge features)。此外,论文还设计了一种混合并行-串行神经网络架构用于雷达HAR任务,实验证明该模拟器与网络模型均具有效性和实用性。

链接: https://arxiv.org/abs/2511.09022
作者: Weicheng Gao
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 16 figures, 8 tables

点击查看摘要

Abstract:Radar-based human activity recognition (HAR) still lacks a comprehensive simulation method. Existing software is developed based on models or motion-captured data, resulting in limited flexibility. To address this issue, a simulator that directly generates Doppler spectra from recorded video footage (RadHARSimulator V2) is presented in this paper. Both computer vision and radar modules are included in the simulator. In computer vision module, the real-time model for object detection with global nearest neighbor is first used to detect and track human targets in the video. Then, the high-resolution network is used to estimate two-dimensional poses of the detected human targets. Next, the three-dimensional poses of the detected human targets are obtained by nearest matching method. Finally, smooth temporal three-dimensional pose estimation is achieved through Kalman filtering. In radar module, pose interpolation and smoothing are first achieved through the Savitzky-Golay method. Second, the delay model and the mirror method are used to simulate echoes in both free-space and through-the-wall scenarios. Then, range-time map is generated using pulse compression, moving target indication, and DnCNN. Next, Doppler-time map (DTM) is generated using short-time Fourier transform and DnCNN again. Finally, the ridge features on the DTM are extracted using the maximum local energy method. In addition, a hybrid parallel-serial neural network architecture is proposed for radar-based HAR. Numerical experiments are conducted and analyzed to demonstrate the effectiveness of the designed simulator and the proposed network model. The open-source code of this work can be found in: this https URL.
zh

[CV-85] MicroEvoEval: A Systematic Evaluation Framework for Image-Based Microstructure Evolution Prediction AAAI2026

【速读】:该论文旨在解决微结构演化(Microstructure Evolution, MicroEvo)模拟中缺乏标准化评估基准的问题,现有研究普遍存在对物理保真度关注不足、误差随时间传播未被充分分析,以及专用模型与先进时空架构之间缺乏系统比较等缺陷。解决方案的关键在于提出首个面向图像驱动的微结构演化预测的综合性基准测试平台 MicroEvoEval,通过涵盖14种模型(包括领域特定与通用架构)在四个代表性任务上的多维度评估,不仅量化数值精度和计算成本,还引入结构保持性指标以衡量物理保真度,并支持短中期与长期演化评估。实验结果表明,现代架构(如 VMamba)在保证更高物理保真度的同时,实现约一个数量级的计算效率提升,验证了其作为数据驱动材料科学中高效可靠代理模型的巨大潜力。

链接: https://arxiv.org/abs/2511.08955
作者: Qinyi Zhang,Duanyu Feng,Ronghui Han,Yangshuai Wang,Hao Wang
机构: National University of Singapore (新加坡国立大学); Sichuan University (四川大学)
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Simulating microstructure evolution (MicroEvo) is vital for materials design but demands high numerical accuracy, efficiency, and physical fidelity. Although recent studies on deep learning (DL) offer a promising alternative to traditional solvers, the field lacks standardized benchmarks. Existing studies are flawed due to a lack of comparing specialized MicroEvo DL models with state-of-the-art spatio-temporal architectures, an overemphasis on numerical accuracy over physical fidelity, and a failure to analyze error propagation over time. To address these gaps, we introduce MicroEvoEval, the first comprehensive benchmark for image-based microstructure evolution prediction. We evaluate 14 models, encompassing both domain-specific and general-purpose architectures, across four representative MicroEvo tasks with datasets specifically structured for both short- and long-term assessment. Our multi-faceted evaluation framework goes beyond numerical accuracy and computational cost, incorporating a curated set of structure-preserving metrics to assess physical fidelity. Our extensive evaluations yield several key insights. Notably, we find that modern architectures (e.g., VMamba), not only achieve superior long-term stability and physical fidelity but also operate with an order-of-magnitude greater computational efficiency. The results highlight the necessity of holistic evaluation and identify these modern architectures as a highly promising direction for developing efficient and reliable surrogate models in data-driven materials science.
zh

[CV-86] ROI-based Deep Image Compression with Implicit Bit Allocation

【速读】:该论文旨在解决现有基于感兴趣区域(Region of Interest, ROI)的图像压缩方法中,依赖显式比特分配策略(即硬门控掩码)导致熵模型统计分布失衡、进而限制编码性能的问题。其解决方案的关键在于提出一种隐式比特分配机制,通过设计Mask-Guided Feature Enhancement (MGFE)模块实现灵活的区域自适应比特分配:该模块包含区域自适应注意力(Region-Adaptive Attention, RAA)和频域-空间域协同注意力(Frequency-Spatial Collaborative Attention, FSCA)结构,以增强全局与局部特征;同时采用双解码器分别重建前景和背景图像,使编码网络能够数据驱动地优化前景增强与背景质量保持之间的平衡。此方法首次在高保真区域自适应编码中应用隐式比特分配,显著提升了率失真性能。

链接: https://arxiv.org/abs/2511.08918
作者: Kai Hu,Han Wang,Renhe Liu,Zhilin Li,Shenghui Song,Yu Liu
机构: Tianjin University (天津大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Multimedia (cs.MM)
备注: 10 pages, 10 figures, journal

点击查看摘要

Abstract:Region of Interest (ROI)-based image compression has rapidly developed due to its ability to maintain high fidelity in important regions while reducing data redundancy. However, existing compression methods primarily apply masks to suppress background information before quantization. This explicit bit allocation strategy, which uses hard gating, significantly impacts the statistical distribution of the entropy model, thereby limiting the coding performance of the compression model. In response, this work proposes an efficient ROI-based deep image compression model with implicit bit allocation. To better utilize ROI masks for implicit bit allocation, this paper proposes a novel Mask-Guided Feature Enhancement (MGFE) module, comprising a Region-Adaptive Attention (RAA) block and a Frequency-Spatial Collaborative Attention (FSCA) block. This module allows for flexible bit allocation across different regions while enhancing global and local features through frequencyspatial domain collaboration. Additionally, we use dual decoders to separately reconstruct foreground and background images, enabling the coding network to optimally balance foreground enhancement and background quality preservation in a datadriven manner. To the best of our knowledge, this is the first work to utilize implicit bit allocation for high-quality regionadaptive coding. Experiments on the COCO2017 dataset show that our implicit-based image compression method significantly outperforms explicit bit allocation approaches in rate-distortion performance, achieving optimal results while maintaining satisfactory visual quality in the reconstructed background regions.
zh

[CV-87] OG-PCL: Efficient Sparse Point Cloud Processing for Human Activity Recognition

【速读】:该论文旨在解决基于毫米波(mmWave)雷达的人员活动识别(Human Activity Recognition, HAR)中因点云稀疏性导致的特征提取困难与模型复杂度高的问题。其核心解决方案是提出一种轻量级的Occupancy-Gated Parallel-CNN Bi-LSTM(OG-PCL)网络架构,关键创新在于引入了三视图并行结构以有效保留三维空间信息,并设计了Occupancy-Gated Convolution(OGConv)模块,通过占用补偿机制增强对稀疏点云的感知能力,从而在仅0.83M参数规模下实现91.75%的准确率,显著优于2D CNN、PointNet和3D CNN等基线方法。

链接: https://arxiv.org/abs/2511.08910
作者: Jiuqi Yan,Chendong Xu,Dongyu Liu
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human activity recognition (HAR) with millimeter-wave (mmWave) radar offers a privacy-preserving and robust alternative to camera- and wearable-based approaches. In this work, we propose the Occupancy-Gated Parallel-CNN Bi-LSTM (OG-PCL) network to process sparse 3D radar point clouds produced by mmWave sensing. Designed for lightweight deployment, the parameter size of the proposed OG-PCL is only 0.83M and achieves 91.75 accuracy on the RadHAR dataset, outperforming those existing baselines such as 2D CNN, PointNet, and 3D CNN methods. We validate the advantages of the tri-view parallel structure in preserving spatial information across three dimensions while maintaining efficiency through ablation studies. We further introduce the Occupancy-Gated Convolution (OGConv) block and demonstrate the necessity of its occupancy compensation mechanism for handling sparse point clouds. The proposed OG-PCL thus offers a compact yet accurate framework for real-time radar-based HAR on lightweight platforms.
zh

[CV-88] 3D-TDA - Topological feature extraction from 3D images for Alzheimers disease classification

【速读】:该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期、客观且准确的临床诊断问题,尤其是在使用最低成本测量模态(如结构磁共振成像,structural MRI)的情况下。其解决方案的关键在于提出了一种基于持久同调(persistent homology)的新颖特征提取方法,将脑部结构MRI中的拓扑特征转化为Betti函数表示的特征向量,并结合轻量级机器学习模型(如XGBoost)进行分类。该方法无需数据增强或复杂预处理,计算效率高,在ADNI 3D MRI数据集上实现了优于当前主流深度学习模型的性能,尤其在二分类和三分类任务中分别达到97.43%和95.47%的平均准确率,同时具备与现有模型互补的潜力,可拓展用于多模态融合分析。

链接: https://arxiv.org/abs/2511.08663
作者: Faisal Ahmed,Taymaz Akan,Fatih Gelir,Owen T. Carmichael,Elizabeth A. Disbrow,Steven A. Conrad,Mohammad A. N. Bhuiyan
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Now that disease-modifying therapies for Alzheimer disease have been approved by regulatory agencies, the early, objective, and accurate clinical diagnosis of AD based on the lowest-cost measurement modalities possible has become an increasingly urgent need. In this study, we propose a novel feature extraction method using persistent homology to analyze structural MRI of the brain. This approach converts topological features into powerful feature vectors through Betti functions. By integrating these feature vectors with a simple machine learning model like XGBoost, we achieve a computationally efficient machine learning model. Our model outperforms state-of-the-art deep learning models in both binary and three-class classification tasks for ADNI 3D MRI disease diagnosis. Using 10-fold cross-validation, our model achieved an average accuracy of 97.43 percent and sensitivity of 99.09 percent for binary classification. For three-class classification, it achieved an average accuracy of 95.47 percent and sensitivity of 94.98 percent. Unlike many deep learning models, our approach does not require data augmentation or extensive preprocessing, making it particularly suitable for smaller datasets. Topological features differ significantly from those commonly extracted using convolutional filters and other deep learning machinery. Because it provides an entirely different type of information from machine learning models, it has the potential to combine topological features with other models later on.
zh

[CV-89] Fluence Map Prediction with Deep Learning: A Transformer-based Approach

【速读】:该论文旨在解决传统调强放射治疗(IMRT)中剂量分布优化过程耗时且高度依赖医生经验的问题,以实现高效、高质量的射野强度图(fluence map)生成。其解决方案的关键在于提出了一种基于3D Swin-UNETR的端到端深度学习框架,利用Transformer架构中的分层自注意力机制,同时捕捉局部解剖结构与远距离空间依赖关系,从而直接从体积CT图像和解剖轮廓预测九野强度图,无需逆向优化过程。该方法在测试集上实现了高精度预测(平均R²=0.95±0.02,MAE=0.035±0.008),并保持了与临床计划相当的剂量体积直方图(DVH)参数,显著提升了自动化IMRT计划生成的空间一致性、准确性和效率。

链接: https://arxiv.org/abs/2511.08645
作者: Ujunwa Mgboh,Rafi Sultan,Dongxiao Zhu,Joshua Kim
机构: Wayne State University (韦恩州立大学); Henry Ford Health (亨利福特健康)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate fluence map prediction is essential in intensity-modulated radiation therapy (IMRT) to maximize tumor coverage while minimizing dose to healthy tissues. Conventional optimization is time-consuming and dependent on planner expertise. This study presents a deep learning framework that accelerates fluence map generation while maintaining clinical quality. An end-to-end 3D Swin-UNETR network was trained to predict nine-beam fluence maps directly from volumetric CT images and anatomical contours using 99 prostate IMRT cases (79 for training and 20 for testing). The transformer-based model employs hierarchical self-attention to capture both local anatomical structures and long-range spatial dependencies. Predicted fluence maps were imported into the Eclipse Treatment Planning System for dose recalculation, and model performance was evaluated using beam-wise fluence correlation, spatial gamma analysis, and dose-volume histogram (DVH) metrics. The proposed model achieved an average R^2 of 0.95 +/- 0.02, MAE of 0.035 +/- 0.008, and gamma passing rate of 85 +/- 10 percent (3 percent / 3 mm) on the test set, with no significant differences observed in DVH parameters between predicted and clinical plans. The Swin-UNETR framework enables fully automated, inverse-free fluence map prediction directly from anatomical inputs, enhancing spatial coherence, accuracy, and efficiency while offering a scalable and consistent solution for automated IMRT plan generation.
zh

[CV-90] SAMora: Enhancing SAM through Hierarchical Self-Supervised Pre-Training for Medical Images

【速读】:该论文旨在解决生成式 AI(Generative AI)在医学图像分割任务中,当标注数据稀缺时性能受限的问题,同时挖掘和利用医学数据中常被忽视的层次化信息。其关键解决方案是提出 SAMora 框架,通过在图像、补丁和像素三个层级上应用互补的自监督学习目标来捕获层次化医学知识,并引入 HL-Attn 层次融合模块以整合多尺度特征并保留其独特性,从而显著提升模型在少样本和全监督场景下的表现,同时将微调轮次减少 90%。

链接: https://arxiv.org/abs/2511.08626
作者: Shuhang Chen,Hangjie Yuan,Pengwei Liu,Hanxue Gu,Tao Feng,Dong Ni
机构: Zhejiang University (浙江大学); Duke University (杜克大学); Tsinghua University (清华大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Segment Anything Model (SAM) has demonstrated significant potential in medical image segmentation. Yet, its performance is limited when only a small amount of labeled data is available, while there is abundant valuable yet often overlooked hierarchical information in medical data. To address this limitation, we draw inspiration from self-supervised learning and propose SAMora, an innovative framework that captures hierarchical medical knowledge by applying complementary self-supervised learning objectives at the image, patch, and pixel levels. To fully exploit the complementarity of hierarchical knowledge within LoRAs, we introduce HL-Attn, a hierarchical fusion module that integrates multi-scale features while maintaining their distinct characteristics. SAMora is compatible with various SAM variants, including SAM2, SAMed, and H-SAM. Experimental results on the Synapse, LA, and PROMISE12 datasets demonstrate that SAMora outperforms existing SAM variants. It achieves state-of-the-art performance in both few-shot and fully supervised settings while reducing fine-tuning epochs by 90%. The code is available at this https URL.
zh

[CV-91] Moving pattern-based modeling using a new type of interval ARX model

【速读】:该论文旨在解决传统自回归移动平均模型(ARX)在处理区间数据时的局限性,即无法有效建模具有不确定性的区间型输入输出数据。其解决方案的关键在于提出了一种新的区间数与实矩阵之间的运算算子,并基于此构建了区间自回归移动平均模型(Interval ARX Model, IARX),使模型能够直接处理区间数据;进一步将该模型应用于基于移动模式的建模方法中,实验验证表明,所提IARX模型对参数变化具有鲁棒性,且建模性能优于现有方法。

链接: https://arxiv.org/abs/2307.04402
作者: Changping Sun
机构: 未知
类目: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In this paper,firstly,to overcome the shortcoming of traditional ARX model, a new operator between an interval number and a real matrix is defined, and then it is applied to the traditional ARX model to get a new type of structure interval ARX model that can deal with interval data, which is defined as interval ARX model (IARX). Secondly,the IARX model is applied to moving pattern-based modeling. Finally,to verify the validity of the proposed modeling method,it is applied to a sintering process. The simulation results show the moving pattern-based modeling using the new type of interval ARX model is robust to variation in parameters of the model, and the performance of the modeling using the proposed IARX is superior to that of the previous work.
zh

人工智能

[AI-0] Breadth-First Search vs. Restarting Random Walks for Escaping Uninformed Heuristic Regions

【速读】:该论文旨在解决在启发式搜索中遇到无信息启发区域(Uninformed Heuristic Regions, UHRs)时,传统贪婪搜索方法如强制爬坡算法(Enforced Hill-Climbing, EHC)难以有效逃脱的问题。其解决方案的关键在于引入基于随机游走的重启策略(Restarting Random Walks, RRWs)替代EHC中使用的广度优先搜索(Breadth-First Search, BrFS)来逃离UHRs,并通过理论分析和实验验证表明,在特定条件下RRWs在期望运行时间上优于BrFS,从而显著提升算法在复杂启发式地形中的效率与鲁棒性。

链接: https://arxiv.org/abs/2511.09549
作者: Daniel Platnick,Dawson Tomasz,Eamon Earl,Sourena Khanzadeh,Richard Valenzano
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Greedy search methods like Greedy Best-First Search (GBFS) and Enforced Hill-Climbing (EHC) often struggle when faced with Uninformed Heuristic Regions (UHRs) like heuristic local minima or plateaus. In this work, we theoretically and empirically compare two popular methods for escaping UHRs in breadth-first search (BrFS) and restarting random walks (RRWs). We first derive the expected runtime of escaping a UHR using BrFS and RRWs, based on properties of the UHR and the random walk procedure, and then use these results to identify when RRWs will be faster in expectation than BrFS. We then evaluate these methods for escaping UHRs by comparing standard EHC, which uses BrFS to escape UHRs, to variants of EHC called EHC-RRW, which use RRWs for that purpose. EHC-RRW is shown to have strong expected runtime guarantees in cases where EHC has previously been shown to be effective. We also run experiments with these approaches on PDDL planning benchmarks to better understand their relative effectiveness for escaping UHRs.
zh

[AI-1] Robust and Diverse Multi-Agent Learning via Rational Policy Gradient NEURIPS2025

【速读】:该论文旨在解决对抗优化算法在合作(cooperative)和非零和(general-sum)多智能体场景中因导致智能体自我破坏(self-sabotage)而失效的问题。传统对抗优化方法在合作环境中会诱导智能体采取非理性行为,从而阻碍任务完成并终止学习过程。解决方案的关键在于提出一种理性保持策略优化(Rationality-preserving Policy Optimization, RPO),其核心思想是确保智能体在对抗过程中仍保持理性——即每个智能体的策略对其可能的伙伴策略而言仍是最优的。为实现RPO,作者进一步设计了理性策略梯度(Rational Policy Gradient, RPG),通过对手塑造(opponent shaping)技术重构原博弈环境,在此修改后的环境中训练智能体最大化自身奖励,从而有效规避自毁激励。该方法使多种现有对抗优化算法能够在合作场景下稳定运行,提升鲁棒性、适应性和策略多样性。

链接: https://arxiv.org/abs/2511.09535
作者: Niklas Lauffer,Ameesh Shah,Micah Carroll,Sanjit A. Seshia,Stuart Russell,Michael Dennis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at NeurIPS 2025

点击查看摘要

Abstract:Adversarial optimization algorithms that explicitly search for flaws in agents’ policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational–that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments. Our project page can be found at this https URL.
zh

[AI-2] Digital Co-Founders: Transforming Imagination into Viable Solo Business via Agent ic AI

【速读】:该论文试图解决的问题是:个体创业者如何在人工智能(AI)日益渗透的时代,将创意构想转化为成功的独资企业(solopreneurship)。其核心挑战在于资源有限、决策责任完全集中于创始人,且业务成败高度依赖创始人身份认同。解决方案的关键在于提出一个三阶段框架:(1) 想象塑造(Imagination shaping),借助AI代理进行市场扫描与概念生成,将模糊目标转化为清晰的价值主张;(2) 现实验证(Reality testing),通过低成本实验、反馈循环和AI自动化任务(如原型设计、内容生成、客户交互)实现高效试错;(3) 现实扩展(Reality scaling),将验证有效的模式转化为可重复流程与规模化策略,并由自主或半自主的AI工作流持续优化运营。该框架强调心理适应性、有效规划及人机协同能力作为关键促成因素,从而应对不确定性与认知过载等典型挑战。

链接: https://arxiv.org/abs/2511.09533
作者: Farhad Rezazadeh,Pegah Bonehgazy
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:This paper investigates how individual entrepreneurs can turn creative ideas into successful solo businesses in an era increasingly shaped by Artificial Intelligence (AI) agents. It highlights the key steps that connect personal vision, structured experimentation, and lasting value creation, and shows how AI agents can act as digital co-founders throughout this journey. Building on research in entrepreneurship, creativity, and innovation, we present a framework with three key stages: (1) Imagination shaping, where vague goals become clear value propositions, supported by AI agents that help with market scanning, idea refinement, and rapid concept generation; (2) Reality testing, where these ideas are tested through low-cost experiments, structured feedback loops, and efficient execution, with AI agents automating tasks such as prototyping, content creation, customer interaction, and data analysis; and (3) Reality scaling, where successful ideas are transformed into repeatable processes, scalable market strategies, and long-term business models, increasingly operated and optimized by autonomous or semi-autonomous AI workflows. We focus on the specific context of solopreneurship, characterized by limited human resources, complete accountability for decision-making, and a strong association between the founder’s identity and the business. The framework clearly identifies key enabling factors such as mental adaptability, effective planning, and successful human-AI collaboration within digital ecosystems. It also thoughtfully addresses ongoing challenges, like uncertainty and cognitive overload, which are heightened by our constant connectivity.
zh

[AI-3] WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作中依赖专家示范、难以从失败中学习和进行自我修正的问题,以及强化学习(Reinforcement Learning, RL)在真实机器人上因样本复杂度高而难以高效训练的挑战。其解决方案的关键在于提出世界模型驱动的策略优化(World-Model-based Policy Optimization, WMPO),该框架通过基于像素的世界模型预测来对齐VLA特征空间,实现无需与真实环境交互的在线策略优化(on-policy VLA RL),并利用基于梯度的策略优化(GRPO)方法提升策略性能,从而显著提高样本效率、增强自纠正能力,并支持鲁棒泛化与持续学习。

链接: https://arxiv.org/abs/2511.09515
作者: Fangqi Zhu,Zhengyang Yan,Zicong Hong,Quanxin Shou,Xiao Ma,Song Guo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: project website: this https URL

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the “imagined” trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.
zh

[AI-4] Fundamentals of Physical AI

【速读】:该论文试图解决如何从科学和系统视角构建物理智能(Physical AI)的理论基础,以描述智能系统在物理世界中所具有的具身性、感知能力、行动能力、学习机制、自主性和情境敏感性等问题。其解决方案的关键在于提出并论证六个核心原则——具身性(embodiment)、感官感知、运动行为、学习、自主性和情境敏感性——它们并非孤立的功能模块,而是构成一个闭环控制回路,其中能量、信息、控制与情境持续交互,从而使得智能系统能够通过物理体验生成意义,而非依赖数据库或抽象计算;这种范式转变强调学习是代理与环境之间结构耦合的变化,而非参数调整,为设计和评估具有真实物理交互能力的智能系统提供了理论框架。

链接: https://arxiv.org/abs/2511.09497
作者: Vahid Salehi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper is already published in Journal of Intelligent System of Systems Lifecycle Management

点击查看摘要

Abstract:This work will elaborate the fundamental principles of physical artificial intelligence (Physical AI) from a scientific and systemic perspective. The aim is to create a theoretical foundation that describes the physical embodiment, sensory perception, ability to act, learning processes, and context sensitivity of intelligent systems within a coherent framework. While classical AI approaches rely on symbolic processing and data driven models, Physical AI understands intelligence as an emergent phenomenon of real interaction between body, environment, and experience. The six fundamentals presented here are embodiment, sensory perception, motor action, learning, autonomy, and context sensitivity, and form the conceptual basis for designing and evaluating physically intelligent systems. Theoretically, it is shown that these six principles do not represent loose functional modules but rather act as a closed control loop in which energy, information, control, and context are in constant interaction. This circular interaction enables a system to generate meaning not from databases, but from physical experience, a paradigm shift that understands intelligence as an physical embodied process. Physical AI understands learning not as parameter adjustment, but as a change in the structural coupling between agents and the environment. To illustrate this, the theoretical model is explained using a practical scenario: An adaptive assistant robot supports patients in a rehabilitation clinic. This example illustrates that physical intelligence does not arise from abstract calculation, but from immediate, embodied experience. It shows how the six fundamentals interact in a real system: embodiment as a prerequisite, perception as input, movement as expression, learning as adaptation, autonomy as regulation, and context as orientation.
zh

[AI-5] Consensus Sampling for Safer Generative AI

【速读】:该论文旨在解决生成式 AI (Generative AI) 中某些安全风险无法仅通过模型输出或激活值检查来检测的问题,即传统方法在面对隐蔽性风险时存在局限。其解决方案的关键在于提出一种架构无关(architecture-agnostic)的安全增强机制:通过聚合多个生成模型的输出,使聚合模型的安全性继承自这组模型中安全性最高的子集。具体而言,作者设计了一种共识采样算法(consensus sampling algorithm),在给定 $ k $ 个模型和一个提示(prompt)时,能够实现与其中最安全的 $ s $ 个模型平均风险相当的安全水平,同时在模型间缺乏足够一致性时选择“回避”(abstention)。该方法依赖于各模型对输出概率的计算能力,并在足够多模型安全且具备良好一致性条件下,理论上可控制回避概率。这一策略为提升生成式 AI 的整体安全性提供了一种新的、不依赖特定模型结构的保障方式,其核心思想是将未知子集中可靠的安全性放大至单一可信模型。

链接: https://arxiv.org/abs/2511.09493
作者: Adam Tauman Kalai,Yael Tauman Kalai,Or Zamir
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Many approaches to AI safety rely on inspecting model outputs or activations, yet certain risks are inherently undetectable by inspection alone. We propose a complementary, architecture-agnostic approach that enhances safety through the aggregation of multiple generative models, with the aggregated model inheriting its safety from the safest subset of a given size among them. Specifically, we present a consensus sampling algorithm that, given k models and a prompt, achieves risk competitive with the average risk of the safest s of the k models, where s is a chosen parameter, while abstaining when there is insufficient agreement between them. The approach leverages the models’ ability to compute output probabilities, and we bound the probability of abstention when sufficiently many models are safe and exhibit adequate agreement. The algorithm is inspired by the provable copyright protection algorithm of Vyas et al. (2023). It requires some overlap among safe models, offers no protection when all models are unsafe, and may accumulate risk over repeated use. Nonetheless, our results provide a new, model-agnostic approach for AI safety by amplifying safety guarantees from an unknown subset of models within a collection to that of a single reliable model.
zh

[AI-6] Enhancing Password Security Through a High-Accuracy Scoring Framework Using Random Forests

【速读】:该论文旨在解决传统密码强度检测工具因依赖静态规则(如字符类型要求)而无法有效识别常见密码模式(例如 ‘P@ssw0rd1!’)所导致的安全漏洞问题,从而给用户带来虚假的安全感。其解决方案的关键在于提出一种新颖的混合特征工程方法,通过引入多种细粒度特征来捕捉传统指标遗漏的脆弱性,包括:基于 leetspeak 归一化的香农熵(Shannon entropy)以评估密码的真实随机性、键盘路径与序列模式检测机制,以及基于字符级 TF-IDF n-gram 的子串重复分析,用于识别来自泄露密码数据集中的高频重复片段。在此基础上,采用随机森林(Random Forest, RF)模型实现最优性能(测试集准确率达 99.12%),并借助其可解释性优势进行特征重要性分析,为开发提供具体、可操作反馈的安全工具奠定基础,从而在提升预测准确性的同时增强实际可用性。

链接: https://arxiv.org/abs/2511.09492
作者: Muhammed El Mustaqeem Mazelan,Noor Hazlina Abdul,Nouar AlDahoul
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Password security plays a crucial role in cybersecurity, yet traditional password strength meters, which rely on static rules like character-type requirements, often fail. Such methods are easily bypassed by common password patterns (e.g., ‘P@ssw0rd1!’), giving users a false sense of security. To address this, we implement and evaluate a password strength scoring system by comparing four machine learning models: Random Forest (RF), Support Vector Machine (SVM), a Convolutional Neural Network (CNN), and Logistic Regression with a dataset of over 660,000 real-world passwords. Our primary contribution is a novel hybrid feature engineering approach that captures nuanced vulnerabilities missed by standard metrics. We introduce features like leetspeak-normalized Shannon entropy to assess true randomness, pattern detection for keyboard walks and sequences, and character-level TF-IDF n-grams to identify frequently reused substrings from breached password datasets. our RF model achieved superior performance, achieving 99.12% accuracy on a held-out test set. Crucially, the interpretability of the Random Forest model allows for feature importance analysis, providing a clear pathway to developing security tools that offer specific, actionable feedback to users. This study bridges the gap between predictive accuracy and practical usability, resulting in a high-performance scoring system that not only reduces password-based vulnerabilities but also empowers users to make more informed security decisions.
zh

[AI-7] CrochetBench: Can Vision-Language Models Move from Describing to Doing in Crochet Domain?

【速读】:该论文旨在解决当前多模态大语言模型在低层次、细粒度程序性推理能力上的不足,尤其是在手工编织(crochet)这一现实世界创造性任务中的执行精度问题。现有基准主要关注高层次描述或视觉问答,难以评估模型将输入转化为可执行操作的能力。解决方案的关键在于构建CrochetBench基准,采用CrochetPARADE领域特定语言(DSL)作为中间表示,使模型输出可被结构化验证和功能执行;通过 stitch 分类、指令定位以及自然语言与图像到 DSL 的转换等任务,系统性地衡量模型从感知到生成可编译编织程序的全过程能力,从而揭示模型在长程符号推理和三维感知程序合成方面的局限性。

链接: https://arxiv.org/abs/2511.09483
作者: Peiyu Li,Xiaobao Huang,Nitesh V. Chawla
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: code available at this https URL

点击查看摘要

Abstract:We present CrochetBench, a benchmark for evaluating the ability of multimodal large language models to perform fine-grained, low-level procedural reasoning in the domain of crochet. Unlike prior benchmarks that focus on high-level description or visual question answering, CrochetBench shifts the emphasis from describing to doing: models are required to recognize stitches, select structurally appropriate instructions, and generate compilable crochet procedures. We adopt the CrochetPARADE DSL as our intermediate representation, enabling structural validation and functional evaluation via execution. The benchmark covers tasks including stitch classification, instruction grounding, and both natural language and image-to-DSL translation. Across all tasks, performance sharply declines as the evaluation shifts from surface-level similarity to executable correctness, exposing limitations in long-range symbolic reasoning and 3D-aware procedural synthesis. CrochetBench offers a new lens for assessing procedural competence in multimodal models and highlights the gap between surface-level understanding and executable precision in real-world creative domains. Code is available at this https URL.
zh

[AI-8] Algorithmic Advice as a Strategic Signal on Competitive Markets

链接: https://arxiv.org/abs/2511.09454
作者: Tobias R. Rebholz,Maxwell Uphoff,Christian H. R. Bernges,Florian Scholten
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); General Economics (econ.GN)
备注:

点击查看摘要

[AI-9] How does the Performance of the Data-driven Traffic Flow Forecasting Models deteriorate with Increasing Forecasting Horizon? An Extensive Approach Considering Statistical Machine Learning and Deep Learning Models

链接: https://arxiv.org/abs/2511.09450
作者: Amanta Sherfenaz,Nazmul Haque,Protiva Sadhukhan Prova,Md Asif Raihan,Md. Hadiuzzaman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6,227 words text + 2*250 (2 tables) = 6,727 words

点击查看摘要

[AI-10] LLM -Guided Dynamic-UMAP for Personalized Federated Graph Learning

【速读】:该论文旨在解决在个性化和隐私约束条件下,如何利用大语言模型(Large Language Models, LLMs)增强图机器学习(Graph Machine Learning, GML)的问题,特别是在低资源场景下实现节点分类与链接预测任务。其解决方案的关键在于:结合稀疏图的数据增强、提示(prompt)与指令微调以适配基础模型至图任务,并通过上下文学习提供少量样本的图推理信号;这些信号用于参数化客户端特定图嵌入的动态UMAP流形结构,嵌入贝叶斯变分目标中以支持个性化联邦学习;同时引入跨模态正则化器将语言模型潜在表示与图结构对齐,从而提升模型在隐私保护下的性能表现。

链接: https://arxiv.org/abs/2511.09438
作者: Sai Puppala,Ismail Hossain,Md Jahangir Alam,Tanzim Ahad,Sajedul Talukder
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a method that uses large language models to assist graph machine learning under personalization and privacy constraints. The approach combines data augmentation for sparse graphs, prompt and instruction tuning to adapt foundation models to graph tasks, and in-context learning to supply few-shot graph reasoning signals. These signals parameterize a Dynamic UMAP manifold of client-specific graph embeddings inside a Bayesian variational objective for personalized federated learning. The method supports node classification and link prediction in low-resource settings and aligns language model latent representations with graph structure via a cross-modal regularizer. We outline a convergence argument for the variational aggregation procedure, describe a differential privacy threat model based on a moments accountant, and present applications to knowledge graph completion, recommendation-style link prediction, and citation and product graphs. We also discuss evaluation considerations for benchmarking LLM-assisted graph machine learning.
zh

[AI-11] What We Dont C: Representations for scientific discovery beyond VAEs NEURIPS2025

【速读】:该论文旨在解决如何在高维数据的已学习表征中有效访问和解析信息的问题,这对于科学发现至关重要。其解决方案的关键在于提出一种基于无分类器引导(classifier-free guidance)的潜在流匹配(latent flow matching)方法,通过显式分离条件信息(conditioning information)与残差表示中的信息,实现潜在子空间的解耦。该方法在合成二维高斯问题、彩色MNIST以及Galaxy10天文数据集上的实验表明,能够提取出有意义的数据特征,从而为分析、控制和重用潜在表征提供了一种简单而强大的机制,推动生成模型在科学探索中用于理解未被捕捉、考虑或编目内容的潜力。

链接: https://arxiv.org/abs/2511.09433
作者: Brian Rogers,Micah Bowles,Chris J. Lintott,Steve Croft
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the Machine Learning and the Physical Sciences workshop at NeurIPS 2025

点击查看摘要

Abstract:Accessing information in learned representations is critical for scientific discovery in high-dimensional domains. We introduce a novel method based on latent flow matching with classifier-free guidance that disentangles latent subspaces by explicitly separating information included in conditioning from information that remains in the residual representation. Across three experiments – a synthetic 2D Gaussian toy problem, colored MNIST, and the Galaxy10 astronomy dataset – we show that our method enables access to meaningful features of high dimensional data. Our results highlight a simple yet powerful mechanism for analyzing, controlling, and repurposing latent representations, providing a pathway toward using generative models for scientific exploration of what we don’t capture, consider, or catalog.
zh

[AI-12] Spatio-Temporal Graph Unlearning

【速读】:该论文旨在解决时空图(spatio-temporal graph)中未经授权数据的完全遗忘(complete unlearning)问题,即在隐私法规(如GDPR和CCPA)要求下,需彻底移除特定节点的数据影响,而传统方法因时空图的全局信息扩散特性导致删除单个节点的成本接近于重新训练整个模型。解决方案的关键在于提出CallosumNet框架,其核心创新为两项技术:(1) 增强子图构建(Enhanced Subgraph Construction, ESC),通过生物启发的虚拟神经节(virtual ganglions)自适应地构造多个局部子图;(2) 全局神经节桥接(Global Ganglion Bridging, GGB),从这些局部子图重建全局时空依赖关系,从而高效恢复完整图表示,实现仅需1%-2%相对MAE损失即可完成完全遗忘。

链接: https://arxiv.org/abs/2511.09404
作者: Qiming Guo,Wenbo Sun,Wenlu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Spatio-temporal graphs are widely used in modeling complex dynamic processes such as traffic forecasting, molecular dynamics, and healthcare monitoring. Recently, stringent privacy regulations such as GDPR and CCPA have introduced significant new challenges for existing spatio-temporal graph models, requiring complete unlearning of unauthorized data. Since each node in a spatio-temporal graph diffuses information globally across both spatial and temporal dimensions, existing unlearning methods primarily designed for static graphs and localized data removal cannot efficiently erase a single node without incurring costs nearly equivalent to full model retraining. Therefore, an effective approach for complete spatio-temporal graph unlearning is a pressing need. To address this, we propose CallosumNet, a divide-and-conquer spatio-temporal graph unlearning framework inspired by the corpus callosum structure that facilitates communication between the brain’s two hemispheres. CallosumNet incorporates two novel techniques: (1) Enhanced Subgraph Construction (ESC), which adaptively constructs multiple localized subgraphs based on several factors, including biologically-inspired virtual ganglions; and (2) Global Ganglion Bridging (GGB), which reconstructs global spatio-temporal dependencies from these localized subgraphs, effectively restoring the full graph representation. Empirical results on four diverse real-world datasets show that CallosumNet achieves complete unlearning with only 1%-2% relative MAE loss compared to the gold model, significantly outperforming state-of-the-art baselines. Ablation studies verify the effectiveness of both proposed techniques.
zh

[AI-13] Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

【速读】:该论文旨在解决序列推荐系统(Sequential Recommenders)在面对针对用户画像污染攻击(Profile Pollution Attack, PPA)时的脆弱性问题,即通过隐蔽地篡改部分用户交互行为来诱导目标预测错误。现有PPA方法存在两大局限:一是过度依赖序列长度影响,难以实现对物品转移关系的细粒度扰动;二是整体修改导致可检测的分布偏移。为此,作者提出了一种受约束强化驱动的攻击方法CREAT,其核心创新在于融合双层优化框架与多奖励强化学习机制,以平衡攻击的有效性和隐蔽性。关键解决方案包括:设计模式平衡奖励策略(Pattern Balanced Rewarding Policy),结合模式反转奖励和分布一致性奖励(基于非平衡最优传输)实现精准扰动与低可检测性;以及引入受约束分组相对强化学习范式(Constrained Group Relative Reinforcement Learning),通过动态屏障约束和共享经验回放机制,实现逐步扰动并最小化攻击痕迹。

链接: https://arxiv.org/abs/2511.09392
作者: Jiajie Su,Zihan Nan,Yunshan Ma,Xiaobo Xia,Xiaohua Feng,Weiming Liu,Xiaolin Zheng,Chaochao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequential Recommenders, which exploit dynamic user intents through interaction sequences, is vulnerable to adversarial attacks. While existing attacks primarily rely on data poisoning, they require large-scale user access or fake profiles thus lacking practicality. In this paper, we focus on the Profile Pollution Attack that subtly contaminates partial user interactions to induce targeted mispredictions. Previous PPA methods suffer from two limitations, i.e., i) over-reliance on sequence horizon impact restricts fine-grained perturbations on item transitions, and ii) holistic modifications cause detectable distribution shifts. To address these challenges, we propose a constrained reinforcement driven attack CREAT that synergizes a bi-level optimization framework with multi-reward reinforcement learning to balance adversarial efficacy and stealthiness. We first develop a Pattern Balanced Rewarding Policy, which integrates pattern inversion rewards to invert critical patterns and distribution consistency rewards to minimize detectable shifts via unbalanced co-optimal transport. Then we employ a Constrained Group Relative Reinforcement Learning paradigm, enabling step-wise perturbations through dynamic barrier constraints and group-shared experience replay, achieving targeted pollution with minimal detectability. Extensive experiments demonstrate the effectiveness of CREAT.
zh

[AI-14] he 2025 Planning Performance of Frontier Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动规划任务中推理能力的评估与提升问题,特别是其在标准和扰动(obfuscated)PDDL(Planning Domain Definition Language)域上的端到端规划性能。解决方案的关键在于通过对比前沿LLMs(如GPT-5、DeepSeek R1、Gemini 2.5 Pro)与传统规划器LAMA在国际规划竞赛最新学习赛道子集上的表现,量化LLMs在结构化任务规划中的推理能力,并验证其相对于早期模型的显著进步,从而缩小与专用规划器之间的性能差距。

链接: https://arxiv.org/abs/2511.09378
作者: Augusto B. Corrêa,André G. Pereira,Jendrik Seipp
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The capacity of Large Language Models (LLMs) for reasoning remains an active area of research, with the capabilities of frontier models continually advancing. We provide an updated evaluation of the end-to-end planning performance of three frontier LLMs as of 2025, where models are prompted to generate a plan from PDDL domain and task descriptions. We evaluate DeepSeek R1, Gemini 2.5 Pro, GPT-5 and as reference the planner LAMA on a subset of domains from the most recent Learning Track of the International Planning Competition. Our results show that on standard PDDL domains, the performance of GPT-5 in terms of solved tasks is competitive with LAMA. When the PDDL domains and tasks are obfuscated to test for pure reasoning, the performance of all LLMs degrades, though less severely than previously reported for other models. These results show substantial improvements over prior generations of LLMs, reducing the performance gap to planners on a challenging benchmark.
zh

[AI-15] BarrierBench : Evaluating Large Language Models for Safety Verification in Dynamical Systems

【速读】:该论文旨在解决动态系统安全验证中屏障证书(barrier certificate)合成的可扩展性差、对模板依赖性强以及手动调参复杂等问题。传统方法在函数空间搜索中效率低,且高度依赖专家经验来选择模板、求解器、超参数及采样策略,限制了自动化与泛化能力。解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的智能体框架(agentic framework),通过自然语言推理实现屏障证书的自动提议、精炼与验证,并融合LLM驱动的模板发现与SMT-based验证机制,同时支持屏障-控制器协同合成以保证一致性。该框架显著提升了证书生成的成功率(>90%),并通过引入BarrierBench基准测试集和检索增强生成(retrieval-augmented generation)与智能体协作策略,为语言模型引导的形式化验证提供了可复现、可评估的工具链。

链接: https://arxiv.org/abs/2511.09363
作者: Ali Taheri,Alireza Taban,Sadegh Soudjani,Ashutosh Trivedi
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Safety verification of dynamical systems via barrier certificates is essential for ensuring correctness in autonomous applications. Synthesizing these certificates involves discovering mathematical functions with current methods suffering from poor scalability, dependence on carefully designed templates, and exhaustive or incremental function-space searches. They also demand substantial manual expertise–selecting templates, solvers, and hyperparameters, and designing sampling strategies–requiring both theoretical and practical knowledge traditionally shared through linguistic reasoning rather than formalized methods. This motivates a key question: can such expert reasoning be captured and operationalized by language models? We address this by introducing an LLM-based agentic framework for barrier certificate synthesis. The framework uses natural language reasoning to propose, refine, and validate candidate certificates, integrating LLM-driven template discovery with SMT-based verification, and supporting barrier-controller co-synthesis to ensure consistency between safety certificates and controllers. To evaluate this capability, we introduce BarrierBench, a benchmark of 100 dynamical systems spanning linear, nonlinear, discrete-time, and continuous-time settings. Our experiments assess not only the effectiveness of LLM-guided barrier synthesis but also the utility of retrieval-augmented generation and agentic coordination strategies in improving its reliability and performance. Across these tasks, the framework achieves more than 90% success in generating valid certificates. By releasing BarrierBench and the accompanying toolchain, we aim to establish a community testbed for advancing the integration of language-based reasoning with formal verification in dynamical systems. The benchmark is publicly available at this https URL Subjects: Artificial Intelligence (cs.AI); Systems and Control (eess.SY) Cite as: arXiv:2511.09363 [cs.AI] (or arXiv:2511.09363v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.09363 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ali Taheri [view email] [v1] Wed, 12 Nov 2025 14:23:49 UTC (281 KB)
zh

[AI-16] Distribution-Based Feature Attribution for Explaining the Predictions of Any Classifier AAAI-26 AAAI

【速读】:该论文旨在解决复杂黑箱人工智能(Artificial Intelligence, AI)模型决策过程缺乏可解释性的问题,特别是针对特征归因(Feature Attribution)方法在理论基础和实际效果上的不足。传统方法虽被广泛用于提供事后解释(post-hoc explanations),但长期以来缺乏严格的数学定义与形式化问题框架,导致许多方法无法基于数据分布提供可靠解释。本文的关键贡献在于首次提出一个形式化的特征归因问题定义,要求解释必须由给定数据集所隐含的概率分布支撑;在此基础上,提出分布感知的特征归因解释方法(Distributional Feature Attribution eXplanations, DFAX),其核心创新是直接基于数据分布对分类器预测进行归因分析,从而实现更有效、更高效的特征重要性评估。

链接: https://arxiv.org/abs/2511.09332
作者: Xinpeng Li,Kai Ming Ting
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for an oral presentation at the 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26). This document is the extended version which includes the appendix

点击查看摘要

Abstract:The proliferation of complex, black-box AI models has intensified the need for techniques that can explain their decisions. Feature attribution methods have become a popular solution for providing post-hoc explanations, yet the field has historically lacked a formal problem definition. This paper addresses this gap by introducing a formal definition for the problem of feature attribution, which stipulates that explanations be supported by an underlying probability distribution represented by the given dataset. Our analysis reveals that many existing model-agnostic methods fail to meet this criterion, while even those that do often possess other limitations. To overcome these challenges, we propose Distributional Feature Attribution eXplanations (DFAX), a novel, model-agnostic method for feature attribution. DFAX is the first feature attribution method to explain classifier predictions directly based on the data distribution. We show through extensive experiments that DFAX is more effective and efficient than state-of-the-art baselines.
zh

[AI-17] askSense: Cognitive Chain Modeling and Difficulty Estimation for GUI Tasks

【速读】:该论文旨在解决现有GUI任务难度评估基准主要依赖操作步骤数量(如动作计数)而忽视认知负荷的问题,从而无法准确反映用户完成任务时的内在认知挑战。其解决方案的关键在于提出“认知链”(Cognitive Chain)框架,将执行一个操作前的认知过程分解为一系列可量化认知步骤(如寻找、决策、计算),每个步骤均基于信息论构建难度指数,并利用大语言模型(LLM)自动从任务执行轨迹中提取认知链,实现对认知难度的客观测量与验证。

链接: https://arxiv.org/abs/2511.09309
作者: Yiwen Yin,Zhian Hu,Xiaoxi Xu,Chun Yu,Xintong Wu,Wenyu Fan,Yuanchun Shi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:Measuring GUI task difficulty is crucial for user behavior analysis and agent capability evaluation. Yet, existing benchmarks typically quantify difficulty based on motor actions (e.g., step counts), overlooking the cognitive demands underlying task completion. In this work, we propose Cognitive Chain, a novel framework that models task difficulty from a cognitive perspective. A cognitive chain decomposes the cognitive processes preceding a motor action into a sequence of cognitive steps (e.g., finding, deciding, computing), each with a difficulty index grounded in information theories. We develop an LLM-based method to automatically extract cognitive chains from task execution traces. Validation with linear regression shows that our estimated cognitive difficulty correlates well with user completion time (step-level R-square=0.46 after annotation). Assessment of state-of-the-art GUI agents shows reduced success on cognitively demanding tasks, revealing capability gaps and Human-AI consistency patterns. We conclude by discussing potential applications in agent training, capability assessment, and human-agent delegation optimization.
zh

[AI-18] GuardFed: A Trustworthy Federated Learning Framework Against Dual-Facet Attacks

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中同时损害模型预测准确性和群体公平性的双重威胁问题,现有研究多聚焦于单一目标攻击或防御,而对兼具破坏效用与公平性的协同攻击缺乏系统探讨。其解决方案的关键在于提出一种新型威胁模型——双维度攻击(Dual-Facet Attack, DFA),包含同步型(S-DFA)和拆分型(Sp-DFA)两种变体以模拟不同场景下的恶意协作行为;并进一步设计了自适应防御框架GuardFed,通过引入少量干净服务器数据结合合成样本构建公平感知的参考模型,在每轮训练中计算客户端的双视角可信度分数(综合评估其效用偏差与公平性退化),从而实现对可信更新的选择性聚合,有效保障在非独立同分布(non-IID)及对抗环境下模型的准确性与公平性。

链接: https://arxiv.org/abs/2511.09294
作者: Yanli Li,Yanan Zhou,Zhongliang Guo,Nan Yang,Yuning Zhang,Huaming Chen,Dong Yuan,Weiping Ding,Witold Pedrycz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) enables privacy-preserving collaborative model training but remains vulnerable to adversarial behaviors that compromise model utility or fairness across sensitive groups. While extensive studies have examined attacks targeting either objective, strategies that simultaneously degrade both utility and fairness remain largely unexplored. To bridge this gap, we introduce the Dual-Facet Attack (DFA), a novel threat model that concurrently undermines predictive accuracy and group fairness. Two variants, Synchronous DFA (S-DFA) and Split DFA (Sp-DFA), are further proposed to capture distinct real-world collusion scenarios. Experimental results show that existing robust FL defenses, including hybrid aggregation schemes, fail to resist DFAs effectively. To counter these threats, we propose GuardFed, a self-adaptive defense framework that maintains a fairness-aware reference model using a small amount of clean server data augmented with synthetic samples. In each training round, GuardFed computes a dual-perspective trust score for every client by jointly evaluating its utility deviation and fairness degradation, thereby enabling selective aggregation of trustworthy updates. Extensive experiments on real-world datasets demonstrate that GuardFed consistently preserves both accuracy and fairness under diverse non-IID and adversarial conditions, achieving state-of-the-art performance compared with existing robust FL methods.
zh

[AI-19] From Model Training to Model Raising - A call to reform AI model training paradigms from post-hoc alignment to intrinsic identity-based development

【速读】:该论文试图解决当前人工智能(AI)训练方法中模型对齐(alignment)滞后的问题,即在模型核心能力形成之后才进行价值观对齐,导致模型容易出现价值错位且缺乏深层次的价值观根基。其解决方案的关键在于提出从“模型训练”到“模型养育”(model raising)的范式转变,核心是重新设计训练语料库:包括以第一人称视角重构训练数据、将信息重新语境化为生活经验、模拟社会互动以及结构化地组织训练数据顺序,从而实现从第一个训练标记(token)起就嵌入价值观,使知识、技能与价值观难以分离。

链接: https://arxiv.org/abs/2511.09287
作者: Roland Aydin,Christian Cyron,Steve Bachelor,Ashton Anderson,Robert West
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Accepted for publication in Communications of the ACM (CACM), Opinion section

点击查看摘要

Abstract:Current AI training methods align models with human values only after their core capabilities have been established, resulting in models that are easily misaligned and lack deep-rooted value systems. We propose a paradigm shift from “model training” to “model raising”, in which alignment is woven into a model’s development from the start. We identify several key components for this paradigm, all centered around redesigning the training corpus: reframing training data from a first-person perspective, recontextualizing information as lived experience, simulating social interactions, and scaffolding the ordering of training data. We expect that this redesign of the training corpus will lead to an early commitment to values from the first training token onward, such that knowledge, skills, and values are intrinsically much harder to separate. In an ecosystem in which large language model capabilities start overtaking human capabilities in many tasks, this seems to us like a critical need.
zh

[AI-20] HyperD: Hybrid Periodicity Decoupling Framework for Traffic Forecasting

【速读】:该论文旨在解决交通流量预测中的两大挑战:一是路网中道路段与交通传感器之间动态交互所引发的复杂空间依赖关系;二是多尺度周期性模式(如日周期和周周期)与由事故、天气或施工等不可预测事件引起的非周期性波动共存的问题。解决方案的关键在于提出HyperD框架,通过将交通数据解耦为周期分量和残差分量进行分别建模:周期分量由混合周期表示模块处理,该模块利用可学习的周期嵌入和时空注意力机制提取细粒度的日/周模式;残差分量则由频域感知的残差表示模块建模,该模块采用复数域MLP捕捉高频非周期波动。此外,引入双视角对齐损失(Dual-View Alignment Loss)以强制两类分量在语义上分离,从而提升预测精度与鲁棒性。

链接: https://arxiv.org/abs/2511.09275
作者: Minlan Shao,Zijian Zhang,Yili Wang,Yiwei Dai,Xu Shen,Xin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate traffic forecasting plays a vital role in intelligent transportation systems, enabling applications such as congestion control, route planning, and urban mobility this http URL, traffic forecasting remains challenging due to two key factors: (1) complex spatial dependencies arising from dynamic interactions between road segments and traffic sensors across the network, and (2) the coexistence of multi-scale periodic patterns (e.g., daily and weekly periodic patterns driven by human routines) with irregular fluctuations caused by unpredictable events (e.g., accidents, weather, or construction). To tackle these challenges, we propose HyperD (Hybrid Periodic Decoupling), a novel framework that decouples traffic data into periodic and residual components. The periodic component is handled by the Hybrid Periodic Representation Module, which extracts fine-grained daily and weekly patterns using learnable periodic embeddings and spatial-temporal attention. The residual component, which captures non-periodic, high-frequency fluctuations, is modeled by the Frequency-Aware Residual Representation Module, leveraging complex-valued MLP in frequency domain. To enforce semantic separation between the two components, we further introduce a Dual-View Alignment Loss, which aligns low-frequency information with the periodic branch and high-frequency information with the residual branch. Extensive experiments on four real-world traffic datasets demonstrate that HyperD achieves state-of-the-art prediction accuracy, while offering superior robustness under disturbances and improved computational efficiency compared to existing methods.
zh

[AI-21] Unveiling Hidden Threats: Using Fractal Triggers to Boost Stealthiness of Distributed Backdoor Attacks in Federated Learning

【速读】:该论文旨在解决传统分布式后门攻击(Distributed Backdoor Attack, DBA)在联邦学习中因需大量污染数据以维持攻击强度而导致暴露风险增加的问题。其解决方案的关键在于提出一种基于分形触发的分布式后门攻击(Fractal-Triggerred Distributed Backdoor Attack, FTDBA),利用分形结构的自相似特性增强子触发器(sub-trigger)的特征强度,从而显著降低完成同等攻击效果所需的污染比例;同时引入动态角度扰动机制,在训练过程中自适应调整扰动强度,以平衡攻击效率与隐蔽性,有效降低频域和梯度域中的可检测性。

链接: https://arxiv.org/abs/2511.09252
作者: Jian Wang,Hong Shen,Chan-Tong Lam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 1 figures, conference

点击查看摘要

Abstract:Traditional distributed backdoor attacks (DBA) in federated learning improve stealthiness by decomposing global triggers into sub-triggers, which however requires more poisoned data to maintian the attck strength and hence increases the exposure risk. To overcome this defect, This paper proposes a novel method, namely Fractal-Triggerred Distributed Backdoor Attack (FTDBA), which leverages the self-similarity of fractals to enhance the feature strength of sub-triggers and hence significantly reduce the required poisoning volume for the same attack strength. To address the detectability of fractal structures in the frequency and gradient domains, we introduce a dynamic angular perturbation mechanism that adaptively adjusts perturbation intensity across the training phases to balance efficiency and stealthiness. Experiments show that FTDBA achieves a 92.3% attack success rate with only 62.4% of the poisoning volume required by traditional DBA methods, while reducing the detection rate by 22.8% and KL divergence by 41.2%. This study presents a low-exposure, high-efficiency paradigm for federated backdoor attacks and expands the application of fractal features in adversarial sample generation.
zh

[AI-22] MedFuse: Multiplicative Embedding Fusion For Irregular Clinical Time Series

【速读】:该论文旨在解决临床时间序列数据(来源于电子健康记录,EHR)中固有的不规则性问题,包括采样异步、缺失值以及特征动态的异质性。现有嵌入策略通常通过加法操作融合特征标识嵌入与数值嵌入,限制了对依赖于数值的特征交互关系的建模能力。其解决方案的关键在于提出MedFuse框架,核心是MuFuse(Multiplicative Embedding Fusion)模块——该模块通过乘法调制方式融合值嵌入(value embeddings)与特征嵌入(feature embeddings),在保留特征特异性信息的同时,能够捕捉跨特征的高阶依赖关系,从而显著提升对不规则临床时间序列的建模效果与泛化能力。

链接: https://arxiv.org/abs/2511.09247
作者: Yi-Hsien Hsieh,Ta-Jung Chien,Chun-Kai Huang,Shao-Hua Sun,Che Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical time series derived from electronic health records (EHRs) are inherently irregular, with asynchronous sampling, missing values, and heterogeneous feature dynamics. While numerical laboratory measurements are highly informative, existing embedding strategies usually combine feature identity and value embeddings through additive operations, which constrains their ability to capture value-dependent feature interactions. We propose MedFuse, a framework for irregular clinical time series centered on the MuFuse (Multiplicative Embedding Fusion) module. MuFuse fuses value and feature embeddings through multiplicative modulation, preserving feature-specific information while modeling higher-order dependencies across features. Experiments on three real-world datasets covering both intensive and chronic care show that MedFuse consistently outperforms state-of-the-art baselines on key predictive tasks. Analysis of the learned representations further demonstrates that multiplicative fusion enhances expressiveness and supports cross-dataset pretraining. These results establish MedFuse as a generalizable approach for modeling irregular clinical time series.
zh

[AI-23] Leverag ing Large Language Models for Use Case Model Generation from Software Requirements

【速读】:该论文试图解决软件需求建模中使用用例模型(use case modeling)的效率低下问题,即传统手动创建用例模型过程繁琐且耗时,导致实践中常被省略。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)结合先进的提示工程(prompt engineering)技术,系统性地从软件需求文档中自动提取参与者(actors)和用例(use cases),从而显著提升建模效率并保持模型质量。

链接: https://arxiv.org/abs/2511.09231
作者: Tobias Eisenreich,Nicholas Friedlaender,Stefan Wagner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the Intelligent Software Engineering Workshop (ISE 2025) at ASE 2025

点击查看摘要

Abstract:Use case modeling employs user-centered scenarios to outline system requirements. These help to achieve consensus among relevant stakeholders. Because the manual creation of use case models is demanding and time-consuming, it is often skipped in practice. This study explores the potential of Large Language Models (LLMs) to assist in this tedious process. The proposed method integrates an open-weight LLM to systematically extract actors and use cases from software requirements with advanced prompt engineering techniques. The method is evaluated using an exploratory study conducted with five professional software engineers, which compares traditional manual modeling to the proposed LLM-based approach. The results show a substantial acceleration, reducing the modeling time by 60%. At the same time, the model quality remains on par. Besides improving the modeling efficiency, the participants indicated that the method provided valuable guidance in the process.
zh

[AI-24] Learning Binary Autoencoder-Based Codes with Progressive Training

【速读】:该论文旨在解决生成式 AI (Generative AI) 在通信系统中端到端设计时,如何在可微分的自编码器(Autoencoder, AE)架构中有效实现二进制码字(binary codewords)的问题。传统方法因离散化操作破坏梯度传播而导致训练不稳定。解决方案的关键在于提出一种简化的两阶段训练流程:首先进行连续预训练以获得稳定的编码器-解码器参数,随后直接对输出进行二值化并进行微调,无需依赖梯度近似技术(如Straight-Through Estimator)。该方法在(7,4)分组码配置下成功学习到了最优汉明码的旋转版本(coset code),自然恢复了其线性结构和最小距离特性,从而实现了与最大似然(ML)译码相当的块错误率(BLER),验证了紧凑型AE架构可通过稳定且直接的训练机制学习具有代数最优性的二进制编码方案。

链接: https://arxiv.org/abs/2511.09221
作者: Vukan Ninkovic,Dejan Vukobratovic
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: Invited paper at TELFOR 2025

点击查看摘要

Abstract:Error correcting codes play a central role in digital communication, ensuring that transmitted information can be accurately reconstructed despite channel impairments. Recently, autoencoder (AE) based approaches have gained attention for the end-to-end design of communication systems, offering a data driven alternative to conventional coding schemes. However, enforcing binary codewords within differentiable AE architectures remains difficult, as discretization breaks gradient flow and often leads to unstable convergence. To overcome this limitation, a simplified two stage training procedure is proposed, consisting of a continuous pretraining phase followed by direct binarization and fine tuning without gradient approximation techniques. For the (7,4) block configuration over a binary symmetric channel (BSC), the learned encoder-decoder pair learns a rotated version (coset code) of the optimal Hamming code, naturally recovering its linear and distance properties and thereby achieving the same block error rate (BLER) with maximum likelihood (ML) decoding. These results indicate that compact AE architectures can effectively learn structured, algebraically optimal binary codes through stable and straightforward training.
zh

[AI-25] Enhancing PIBT via Multi-Action Operations

【速读】:该论文旨在解决传统基于规则的多智能体路径规划(Multi-Agent Path Finding, MAPF)求解器PIBT在处理具有方向性且需执行耗时旋转动作的场景时性能不佳的问题。其关键解决方案在于对PIBT进行增强,引入多动作操作(multi-action operations),使其能够在单次决策中考虑多个连续动作,从而有效缓解短视设计带来的局限性;同时保持原有算法的高效率优势,并结合图引导技术与大邻域搜索优化,在在线LMAPF-T场景中实现了当前最优性能。

链接: https://arxiv.org/abs/2511.09193
作者: Egor Yukhnevich,Anton Andreychuk
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PIBT is a rule-based Multi-Agent Path Finding (MAPF) solver, widely used as a low-level planner or action sampler in many state-of-the-art approaches. Its primary advantage lies in its exceptional speed, enabling action selection for thousands of agents within milliseconds by considering only the immediate next timestep. However, this short-horizon design leads to poor performance in scenarios where agents have orientation and must perform time-consuming rotation actions. In this work, we present an enhanced version of PIBT that addresses this limitation by incorporating multi-action operations. We detail the modifications introduced to improve PIBT’s performance while preserving its hallmark efficiency. Furthermore, we demonstrate how our method, when combined with graph-guidance technique and large neighborhood search optimization, achieves state-of-the-art performance in the online LMAPF-T setting.
zh

[AI-26] Perspectives on a Reliability Monitoring Framework for Agent ic AI Systems

【速读】:该论文旨在解决agentic AI系统在运行过程中可靠性不足的问题,尤其是在高风险领域(如医疗或工业过程控制)中,其自主决策行为可能导致不可预测的后果。研究表明,这一问题源于agentic AI系统固有的特性,且与传统AI系统存在共性,即缺乏对输入异常和内部操作透明性的有效监控机制。论文的关键解决方案是提出一个两层可靠性监测框架:第一层为分布外检测层(out-of-distribution detection layer),用于识别新型或异常输入;第二层为AI透明层(AI transparency layer),揭示系统内部决策逻辑。该框架通过提供人类操作者所需的决策支持信息,帮助判断输出是否可能不可靠并适时干预,从而为降低因可靠性不确定性引发的风险提供了基础支撑。

链接: https://arxiv.org/abs/2511.09178
作者: Niclas Flehmig,Mary Ann Lundteigen,Shen Yin
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The implementation of agentic AI systems has the potential of providing more helpful AI systems in a variety of applications. These systems work autonomously towards a defined goal with reduced external control. Despite their potential, one of their flaws is the insufficient reliability which makes them especially unsuitable for high-risk domains such as healthcare or process industry. Unreliable systems pose a risk in terms of unexpected behavior during operation and mitigation techniques are needed. In this work, we derive the main reliability challenges of agentic AI systems during operation based on their characteristics. We draw the connection to traditional AI systems and formulate a fundamental reliability challenge during operation which is inherent to traditional and agentic AI systems. As our main contribution, we propose a two-layered reliability monitoring framework for agentic AI systems which consists of a out-of-distribution detection layer for novel inputs and AI transparency layer to reveal internal operations. This two-layered monitoring approach gives a human operator the decision support which is needed to decide whether an output is potential unreliable or not and intervene. This framework provides a foundation for developing mitigation techniques to reduce risk stemming from uncertain reliability during operation.
zh

[AI-27] ractable Weighted First-Order Model Counting with Bounded Treewidth Binary Evidence AAAI2026

【速读】:该论文致力于解决加权一阶模型计数(Weighted First-Order Model Counting, WFOMC)在条件证据下的计算难题,即在给定一组基元命题真值固定(证据)的情况下,计算逻辑句子的加权模型总数。已有研究表明,在一般情况下,即使对于原本可高效求解WFOMC的逻辑片段,加入证据后问题也难以在域大小的多项式时间内求解(除非#P ⊆ FP)。为突破这一障碍,作者的关键解决方案是限制证据为二元形式,并要求其对应的Gaifman图具有有界树宽(bounded treewidth)。在此约束下,论文提出了针对两变量逻辑片段FO²和C²的多项式时间算法,从而首次实现了在受限证据条件下对WFOMC的高效计算。该方法还被成功应用于解决高难度组合优化问题——有界度且有界树宽图上的稳定座位安排问题,展示了其理论与实际应用价值。

链接: https://arxiv.org/abs/2511.09174
作者: Václav Kůla,Qipeng Kuang,Yuyi Wang,Yuanhong Wang,Ondřej Kuželka
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: To be published in AAAI 2026

点击查看摘要

Abstract:The Weighted First-Order Model Counting Problem (WFOMC) asks to compute the weighted sum of models of a given first-order logic sentence over a given domain. Conditioning WFOMC on evidence – fixing the truth values of a set of ground literals – has been shown impossible in time polynomial in the domain size (unless \mathsf#P \subseteq FP ) even for fragments of logic that are otherwise tractable for WFOMC without evidence. In this work, we address the barrier by restricting the binary evidence to the case where the underlying Gaifman graph has bounded treewidth. We present a polynomial-time algorithm in the domain size for computing WFOMC for the two-variable fragments \textFO^2 and \textC^2 conditioned on such binary evidence. Furthermore, we show the applicability of our algorithm in combinatorial problems by solving the stable seating arrangement problem on bounded-treewidth graphs of bounded degree, which was an open problem. We also conducted experiments to show the scalability of our algorithm compared to the existing model counting solvers.
zh

[AI-28] Data Fusion-Enhanced Decision Transformer for Stable Cross-Domain Generalization

【速读】:该论文旨在解决决策变压器(Decision Transformer, DT)在跨域迁移中因轨迹片段拼接不可靠而导致的性能下降问题。具体而言,现有方法依赖单一过滤标准选择源域轨迹片段,导致状态结构错位、回报至目标(Return-to-Go, RTG)不可比以及动作跳跃,从而破坏RTG token的连续性,削弱DT的推理能力。其解决方案的关键在于提出数据融合增强型决策变压器(Data Fusion-Enhanced Decision Transformer, DFDT),通过两级数据筛选机制融合稀缺目标域数据与可信源片段,并利用最大均值差异(Maximum Mean Discrepancy, MMD)对齐状态结构、最优传输(Optimal Transport, OT)保障动作可行性;同时,将RTG token替换为优势条件token以提升语义连续性,并引入Q引导正则项抑制拼接点处的价值和动作突变。理论分析进一步表明,状态值和策略性能差距与MMD不匹配度和OT偏差呈正相关,且随着这两项指标减小而收紧,验证了DFDT设计的有效性。

链接: https://arxiv.org/abs/2511.09173
作者: Guojian Wang,Quinson Hon,Xuyang Chen,Lin Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages,4 figures

点击查看摘要

Abstract:Cross-domain shifts present a significant challenge for decision transformer (DT) policies. Existing cross-domain policy adaptation methods typically rely on a single simple filtering criterion to select source trajectory fragments and stitch them together. They match either state structure or action feasibility. However, the selected fragments still have poor stitchability: state structures can misalign, the return-to-go (RTG) becomes incomparable when the reward or horizon changes, and actions may jump at trajectory junctions. As a result, RTG tokens lose continuity, which compromises DT’s inference ability. To tackle these challenges, we propose Data Fusion-Enhanced Decision Transformer (DFDT), a compact pipeline that restores stitchability. Particularly, DFDT fuses scarce target data with selectively trusted source fragments via a two-level data filter, maximum mean discrepancy (MMD) mismatch for state-structure alignment, and optimal transport (OT) deviation for action feasibility. It then trains on a feasibility-weighted fusion distribution. Furthermore, DFDT replaces RTG tokens with advantage-conditioned tokens, which improves the continuity of the semantics in the token sequence. It also applies a Q -guided regularizer to suppress junction value and action jumps. Theoretically, we provide bounds that tie state value and policy performance gaps to the MMD-mismatch and OT-deviation measures, and show that the bounds tighten as these two measures shrink. We show that DFDT improves return and stability over strong offline RL and sequence-model baselines across gravity, kinematic, and morphology shifts on D4RL-style control tasks, and further corroborate these gains with token-stitching and sequence-semantics stability analyses.
zh

[AI-29] Efficient Reasoning via Reward Model

【速读】:该论文旨在解决大语言模型在强化学习中因“过度思考”(overthinking)现象导致的冗余推理路径问题,即模型生成的响应包含大量无关或重复的推理步骤,从而显著增加计算开销。其核心解决方案是提出一种训练简洁性奖励模型(Conciseness Reward Model, CRM)的流水线,并设计一种新的奖励函数——简洁性奖励函数(Conciseness Reward Function, CRF),该函数显式建模结果奖励与简洁性评分之间的依赖关系,从而在提升推理准确性的同时增强token效率。理论分析表明,CRF可通过降低方差和改善收敛性实现更优性能;实验验证了该方法在五个数学基准数据集上的有效性,尤其在Qwen2.5-7B模型上实现了8.1%的准确率提升和19.9%的响应token减少,且具备良好的跨模型泛化能力。

链接: https://arxiv.org/abs/2511.09158
作者: Yuhao Wang,Xiaopeng Li,Cheng Gong,Ziru Liu,Suiyun Zhang,Rui Liu,Xiangyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs), enabling the development of large reasoning models (LRMs). However, LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking-which substantially increases computational costs. Prior efforts to mitigate this issue commonly incorporate length penalties into the reward function, but we find they frequently suffer from two critical issues: length collapse and training collapse, resulting in sub-optimal performance. To address them, we propose a pipeline for training a Conciseness Reward Model (CRM) that scores the conciseness of reasoning path. Additionally, we introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score, thereby fostering both more effective and more efficient reasoning. From a theoretical standpoint, we demonstrate the superiority of the new reward from the perspective of variance reduction and improved convergence properties. Besides, on the practical side, extensive experiments on five mathematical benchmark datasets demonstrate the method’s effectiveness and token efficiency, which achieves an 8.1% accuracy improvement and a 19.9% reduction in response token length on Qwen2.5-7B. Furthermore, the method generalizes well to other LLMs including Llama and Mistral. The implementation code and datasets are publicly available for reproduction: this https URL.
zh

[AI-30] ProBench: Benchmarking GUI Agents with Accurate Process Information AAAI2026

【速读】:该论文旨在解决当前GUI Agent评估中仅依赖最终屏幕状态而忽视中间操作过程的问题,导致对代理在复杂多步骤任务中的真实性能评估不准确。其解决方案的关键在于提出ProBench基准,包含超过200个具有挑战性的移动GUI任务,并引入“Process Provider”模块自动提供精确的中间步骤信息,从而实现对代理执行过程的精准评估,弥补了传统基于状态的任务评估方法的局限性。

链接: https://arxiv.org/abs/2511.09157
作者: Leyang Yang,Ziwei Wang,Xiaoxuan Tang,Sheng Zhou,Dajun Chen,Wei Jiang,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Paper accepted to AAAI 2026

点击查看摘要

Abstract:With the deep integration of artificial intelligence and interactive technology, Graphical User Interface (GUI) Agent, as the carrier connecting goal-oriented natural language and real-world devices, has received widespread attention from the community. Contemporary benchmarks aim to evaluate the comprehensive capabilities of GUI agents in GUI operation tasks, generally determining task completion solely by inspecting the final screen state. However, GUI operation tasks consist of multiple chained steps while not all critical information is presented in the final few pages. Although a few research has begun to incorporate intermediate steps into evaluation, accurately and automatically capturing this process information still remains an open challenge. To address this weakness, we introduce ProBench, a comprehensive mobile benchmark with over 200 challenging GUI tasks covering widely-used scenarios. Remaining the traditional State-related Task evaluation, we extend our dataset to include Process-related Task and design a specialized evaluation method. A newly introduced Process Provider automatically supplies accurate process information, enabling presice assessment of agent’s performance. Our evaluation of advanced GUI agents reveals significant limitations for real-world GUI scenarios. These shortcomings are prevalent across diverse models, including both large-scale generalist models and smaller, GUI-specific models. A detailed error analysis further exposes several universal problems, outlining concrete directions for future improvements.
zh

[AI-31] Enabling Agents to Communicate Entirely in Latent Space

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的智能体在协作解决问题时,受限于自然语言作为通信媒介所带来的信息损失问题。自然语言通过离散标记(token)对模型内部丰富的潜在状态(latent states)进行下采样,导致信息深度和细微差别难以有效传递,从而制约了协作效率与探索能力。其解决方案的关键在于提出Interlat(Inter-agent Latent Space Communication)范式,即利用LLM的最后隐藏层状态作为智能体“心智”的表示,并直接在潜在空间中进行通信(称为潜在通信),同时引入完全基于潜在空间的推理压缩机制,在保持信息完整性的同时显著提升推理速度并增强协作表现。

链接: https://arxiv.org/abs/2511.09149
作者: Zhuoyun Du,Runze Wang,Huiyu Bai,Zouying Cao,Xiaoyong Zhu,Bo Zheng,Wei Chen,Haochao Ying
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Work in progess

点击查看摘要

Abstract:While natural language is the de facto communication medium for LLM-based agents, it presents a fundamental constraint. The process of downsampling rich, internal latent states into discrete tokens inherently limits the depth and nuance of information that can be transmitted, thereby hindering collaborative problem-solving. Inspired by human mind-reading, we propose Interlat (Inter-agent Latent Space Communication), a paradigm that leverages the last hidden states of an LLM as a representation of its mind for direct transmission (termed latent communication). An additional compression process further compresses latent communication via entirely latent space reasoning. Experiments demonstrate that Interlat outperforms both fine-tuned chain-of-thought (CoT) prompting and single-agent baselines, promoting more exploratory behavior and enabling genuine utilization of latent information. Further compression not only substantially accelerates inference but also maintains competitive performance through an efficient information-preserving mechanism. We position this work as a feasibility study of entirely latent space inter-agent communication, and our results highlight its potential, offering valuable insights for future research.
zh

[AI-32] Vendor-Aware Industrial Agents : RAG -Enhanced LLM s for Secure On-Premise PLC Code Generation

【速读】:该论文旨在解决工业控制领域中可编程逻辑控制器(Programmable Logic Controllers, PLC)编程助手的开发难题,特别是针对专有代码方言导致的大语言模型(Large Language Models, LLMs)难以适配、企业对云服务信任不足以及低数据环境下高质量代码生成困难等问题。解决方案的关键在于采用小规模本地化模型微调(fine-tuning),结合检索增强生成(Retrieval-Augmented Generation, RAG)技术与精心设计的提示工程(prompt engineering),在无需大规模预训练或云端依赖的前提下实现高精度代码生成;同时通过多模型竞争机制、自动推理纠错和即时编译验证,显著提升代码质量和实用性,适用于边缘设备部署。

链接: https://arxiv.org/abs/2511.09122
作者: Joschka Kersting,Michael Rummel,Gesa Benndorf
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Programmable Logic Controllers are operated by proprietary code dialects; this makes it challenging to train coding assistants. Current LLMs are trained on large code datasets and are capable of writing IEC 61131-3 compatible code out of the box, but they neither know specific function blocks, nor related project code. Moreover, companies like Mitsubishi Electric and their customers do not trust cloud providers. Hence, an own coding agent is the desired solution to cope with this. In this study, we present our work on a low-data domain coding assistant solution for industrial use. We show how we achieved high quality code generation without fine-tuning large models and by fine-tuning small local models for edge device usage. Our tool lets several AI models compete with each other, uses reasoning, corrects bugs automatically and checks code validity by compiling it directly in the chat interface. We support our approach with an extensive evaluation that comes with code compilation statistics and user ratings. We found that a Retrieval-Augmented Generation (RAG) supported coding assistant can work in low-data domains by using extensive prompt engineering and directed retrieval.
zh

[AI-33] Differentially Private Rankings via Outranking Methods and Performance Data Aggregation

【速读】:该论文旨在解决多准则决策(Multiple-Criteria Decision Making, MCDM)方法在处理个人敏感数据时隐私保护机制不足的问题,尤其是在推荐系统等动态数据驱动场景中,如何在保障个体贡献隐私的前提下实现可靠的排序决策。其解决方案的关键在于将MCDM的优序法(outranking methods)与差分隐私(Differential Privacy, DP)相结合,通过一个预处理步骤聚合多个用户评估以构建综合性能矩阵,并在该过程中引入差分隐私机制,从而在保证排名结果统计相关性(强至极强)的同时提供可证明的隐私保障。

链接: https://arxiv.org/abs/2511.09120
作者: Luis Del Vasto-Terrientes
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted and published in the USB Proceedings of the 22th International Conference on Modeling Decisions for Artificial Intelligence (MDAI 2025), Valencia, Spain, September 15–18, 2025, ISBN 978-91-531-0240-3, pp. 21–32

点击查看摘要

Abstract:Multiple-Criteria Decision Making (MCDM) is a sub-discipline of Operations Research that helps decision-makers in choosing, ranking, or sorting alternatives based on conflicting criteria. Over time, its application has been expanded into dynamic and data-driven domains, such as recommender systems. In these contexts, the availability and handling of personal and sensitive data can play a critical role in the decision-making process. Despite this increased reliance on sensitive data, the integration of privacy mechanisms with MCDM methods is underdeveloped. This paper introduces an integrated approach that combines MCDM outranking methods with Differential Privacy (DP), safeguarding individual contributions’ privacy in ranking problems. This approach relies on a pre-processing step to aggregate multiple user evaluations into a comprehensive performance matrix. The evaluation results show a strong to very strong statistical correlation between the true rankings and their anonymized counterparts, ensuring robust privacy parameter guarantees.
zh

[AI-34] Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment AAAI2026

【速读】:该论文旨在解决强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)与直接偏好优化(Direct Preference Optimization, DPO)对齐过程中,攻击者通过翻转偏好标签(preference labels)以最小成本操纵大语言模型(Large Language Models, LLMs)策略的问题。其核心解决方案是将最小代价的标签翻转攻击建模为一个带有线性约束的凸优化问题,从而推导出攻击成本的上下界;关键创新在于提出一种后处理方法,可在不改变中毒效果的前提下显著减少所需标签翻转的数量,尤其在奖励模型特征维度远小于数据集规模时表现突出。

链接: https://arxiv.org/abs/2511.09105
作者: Shigeki Kusaka,Keita Saito,Mikoto Kudo,Takumi Tanabe,Akifumi Wachi,Youhei Akimoto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted for AAAI 2026 Special Track on AI Alignment

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in real-world systems, making it critical to understand their vulnerabilities. While data poisoning attacks during RLHF/DPO alignment have been studied empirically, their theoretical foundations remain unclear. We investigate the minimum-cost poisoning attack required to steer an LLM’s policy toward an attacker’s target by flipping preference labels during RLHF/DPO, without altering the compared outputs. We formulate this as a convex optimization problem with linear constraints, deriving lower and upper bounds on the minimum attack cost. As a byproduct of this theoretical analysis, we show that any existing label-flipping attack can be post-processed via our proposed method to reduce the number of label flips required while preserving the intended poisoning effect. Empirical results demonstrate that this cost-minimization post-processing can significantly reduce poisoning costs over baselines, particularly when the reward model’s feature dimension is small relative to the dataset size. These findings highlight fundamental vulnerabilities in RLHF/DPO pipelines and provide tools to evaluate their robustness against low-cost poisoning attacks.
zh

[AI-35] Factorization-in-Loop: Proximal Fill-in Minimization for Sparse Matrix Reordering

【速读】:该论文旨在解决大规模稀疏矩阵LU分解过程中因填充元(fill-ins)导致的内存占用增加和计算时间延长的问题。传统矩阵重排序方法虽可减少填充元,但其优化目标(如最小化非零元素数量)难以直接建模,且现有替代目标缺乏理论保证。解决方案的关键在于提出一种基于学习的重排序网络(reordering network),通过最小化重排后矩阵三角因子的l1l_1范数来近似真实填充元数量,并引入两种重参数化技术将节点评分映射为排列矩阵,再结合交替方向乘子法(ADMM)与近端梯度下降法联合优化目标函数,从而实现高效且可微分的重排序策略。实验表明,该方法在SuiteSparse基准集上相较最先进基线显著降低了20%的填充元数量和17.8%的LU分解耗时。

链接: https://arxiv.org/abs/2511.09093
作者: Ziwei Li,Shuzi Niu,Tao Yuan,Huiyuan Li,Wenjia Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fill-ins are new nonzero elements in the summation of the upper and lower triangular factors generated during LU factorization. For large sparse matrices, they will increase the memory usage and computational time, and be reduced through proper row or column arrangement, namely matrix reordering. Finding a row or column permutation with the minimal fill-ins is NP-hard, and surrogate objectives are designed to derive fill-in reduction permutations or learn a reordering function. However, there is no theoretical guarantee between the golden criterion and these surrogate objectives. Here we propose to learn a reordering network by minimizing (l_1) norm of triangular factors of the reordered matrix to approximate the exact number of fill-ins. The reordering network utilizes a graph encoder to predict row or column node scores. For inference, it is easy and fast to derive the permutation from sorting algorithms for matrices. For gradient based optimization, there is a large gap between the predicted node scores and resultant triangular factors in the optimization objective. To bridge the gap, we first design two reparameterization techniques to obtain the permutation matrix from node scores. The matrix is reordered by multiplying the permutation matrix. Then we introduce the factorization process into the objective function to arrive at target triangular factors. The overall objective function is optimized with the alternating direction method of multipliers and proximal gradient descent. Experimental results on benchmark sparse matrix collection SuiteSparse show the fill-in number and LU factorization time reduction of our proposed method is 20% and 17.8% compared with state-of-the-art baselines.
zh

[AI-36] OR-R1: Automating Modeling and Solving of Operations Research Optimization Problem via Test-Time Reinforcement Learning AAAI2026

【速读】:该论文旨在解决运筹学(Operations Research, OR)中自动化优化建模与求解过程中,将自然语言问题描述转化为形式化模型和求解代码的专家依赖性强、数据效率低的问题。现有基于大语言模型(Large Language Models, LLMs)的方法普遍需要大量标注或合成数据,导致成本高且难以扩展。其解决方案的关键在于提出OR-R1框架,采用两阶段训练策略:首先通过监督微调(Supervised Fine-Tuning, SFT)利用少量标注数据学习问题建模与代码生成的核心推理模式;其次引入测试时组相对策略优化(Test-Time Group Relative Policy Optimization, TGRPO),有效提升模型能力与一致性,从而在稀缺标注数据下充分挖掘未标注数据的价值。实验表明,OR-R1仅需1/10的合成数据即可达到当前最优性能(平均求解准确率67.7%),且在极小样本(如100个样本)下仍显著优于基线方法,同时TGRPO进一步缩小了单次尝试(Pass@1)与多次尝试(Pass@8)之间的性能差距,展现出鲁棒性、可扩展性和成本效益。

链接: https://arxiv.org/abs/2511.09092
作者: Zezhen Ding,Zhen Tan,Jiheng Zhang,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 9 pages, 5 figures, AAAI 2026

点击查看摘要

Abstract:Optimization modeling and solving are fundamental to the application of Operations Research (OR) in real-world decision making, yet the process of translating natural language problem descriptions into formal models and solver code remains highly expertise intensive. While recent advances in large language models (LLMs) have opened new opportunities for automation, the generalization ability and data efficiency of existing LLM-based methods are still limited, asmost require vast amounts of annotated or synthetic data, resulting in high costs and scalability barriers. In this work, we present OR-R1, a data-efficient training framework for automated optimization modeling and solving. OR-R1 first employs supervised fine-tuning (SFT) to help the model acquire the essential reasoning patterns for problem formulation and code generation from limited labeled data. In addition, it improves the capability and consistency through Test-Time Group Relative Policy Optimization (TGRPO). This two-stage design enables OR-R1 to leverage both scarce labeled and abundant unlabeled data for effective learning. Experiments show that OR-R1 achieves state-of-the-art performance with an average solving accuracy of 67.7% , using only 1/10 the synthetic data required by prior methods such as ORLM, exceeding ORLM’s solving accuracy by up to 4.2% . Remarkably, OR-R1 outperforms ORLM by over 2.4% with just 100 synthetic samples. Furthermore, TGRPO contributes an additional 3.1%-6.4% improvement in accuracy, significantly narrowing the gap between single-attempt (Pass@1) and multi-attempt (Pass@8) performance from 13% to 7% . Extensive evaluations across diverse real-world benchmarks demonstrate that OR-R1 provides a robust, scalable, and cost-effective solution for automated OR optimization problem modeling and solving, lowering the expertise and data barriers for industrial OR applications.
zh

[AI-37] Diff-V2M: A Hierarchical Conditional Diffusion Model with Explicit Rhythmic Modeling for Video-to-Music Generation AAAI2026

【速读】:该论文旨在解决视频到音乐(Video-to-Music, V2M)生成中的两个核心挑战:一是现有方法缺乏显式的节奏建模,导致音频与视觉在时间上的对齐困难;二是如何有效融合多种视觉特征以条件化音乐生成。解决方案的关键在于提出Diff-V2M框架,其基于分层条件扩散模型,包含视觉特征提取与条件音乐生成两部分。其中,通过评估多种节奏表示(如低分辨率音高图、tempogram和起音检测函数ODF),发现低分辨率ODF更适合作为节奏信号,并设计了节奏预测器直接从视频中推断该信号;同时提取语义与情感特征以确保音乐的情感一致性。所有特征通过分层交叉注意力机制注入生成器:第一层用情感特征控制整体情绪基调,第二层融合语义与节奏特征;进一步引入时间步感知的特征融合策略(如FiLM和加权融合),使模型在扩散过程中动态调整语义与节奏信息的权重,从而实现更精准的音频-视觉对齐与高质量音乐生成。

链接: https://arxiv.org/abs/2511.09090
作者: Shulei Ji,Zihao Wang,Jiaxing Yu,Xiangyuan Yang,Shuyu Li,Songruoyao Wu,Kejun Zhang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: AAAI 2026

点击查看摘要

Abstract:Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to condition music generation remains non-trivial. To address these issues, we propose Diff-V2M, a general V2M framework based on a hierarchical conditional diffusion model, comprising two core components: visual feature extraction and conditional music generation. For rhythm modeling, we begin by evaluating several rhythmic representations, including low-resolution mel-spectrograms, tempograms, and onset detection functions (ODF), and devise a rhythmic predictor to infer them directly from videos. To ensure contextual and affective coherence, we also extract semantic and emotional features. All features are incorporated into the generator via a hierarchical cross-attention mechanism, where emotional features shape the affective tone via the first layer, while semantic and rhythmic features are fused in the second cross-attention layer. To enhance feature integration, we introduce timestep-aware fusion strategies, including feature-wise linear modulation (FiLM) and weighted fusion, allowing the model to adaptively balance semantic and rhythmic cues throughout the diffusion process. Extensive experiments identify low-resolution ODF as a more effective signal for modeling musical rhythm and demonstrate that Diff-V2M outperforms existing models on both in-domain and out-of-domain datasets, achieving state-of-the-art performance in terms of objective metrics and subjective comparisons. Demo and code are available at this https URL.
zh

[AI-38] Improving Sustainability of Adversarial Examples in Class-Incremental Learning AAAI2026

【速读】:该论文旨在解决当前对抗样本(Adversarial Examples, AEs)在类增量学习(Class-Incremental Learning, CIL)场景下鲁棒性下降的问题。由于CIL模型随新类别的加入发生显著的域偏移(domain drift),传统AEs往往失效。解决方案的关键在于提出一种名为SAE(Sustainable Adversarial Examples)的方法,其核心是通过语义增强机制提升AE对域漂移的适应能力:首先利用视觉-语言模型生成通用语义以避免过拟合初始模型;其次设计语义修正模块(Semantic Correction Module),结合CIL模型引导AE语义向目标类别靠近;最后引入过滤与增强模块(Filtering-and-Augmentation Module)稳定潜在空间中的语义分布,从而实现AE在多轮CIL更新后仍保持高攻击成功率。

链接: https://arxiv.org/abs/2511.09088
作者: Taifeng Liu,Xinjing Liu,Liangqiu Dong,Yang Liu,Yilong Yang,Zhuo Ma
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper is accepted to AAAI 2026

点击查看摘要

Abstract:Current adversarial examples (AEs) are typically designed for static models. However, with the wide application of Class-Incremental Learning (CIL), models are no longer static and need to be updated with new data distributed and labeled differently from the old ones. As a result, existing AEs often fail after CIL updates due to significant domain drift. In this paper, we propose SAE to enhance the sustainability of AEs against CIL. The core idea of SAE is to enhance the robustness of AE semantics against domain drift by making them more similar to the target class while distinguishing them from all other classes. Achieving this is challenging, as relying solely on the initial CIL model to optimize AE semantics often leads to overfitting. To resolve the problem, we propose a Semantic Correction Module. This module encourages the AE semantics to be generalized, based on a visual-language model capable of producing universal semantics. Additionally, it incorporates the CIL model to correct the optimization direction of the AE semantics, guiding them closer to the target class. To further reduce fluctuations in AE semantics, we propose a Filtering-and-Augmentation Module, which first identifies non-target examples with target-class semantics in the latent space and then augments them to foster more stable semantics. Comprehensive experiments demonstrate that SAE outperforms baselines by an average of 31.28% when updated with a 9-fold increase in the number of classes.
zh

[AI-39] -LLM -Hub: Building Context-Aware Multi-Agent LLM Systems for Telecom Networks

【速读】:该论文旨在解决5G及未来无线网络中,智能大语言模型(Large Language Model, LLM)应用在复杂网络环境下缺乏领域特定网络状态共享机制的问题,从而限制了多智能体(Multi-Agent, MA)系统对网络上下文的准确理解与协同决策。其解决方案的关键在于提出TeleMCP(Telecom Model Context Protocol),一种用于电信环境中智能体间结构化、语境丰富通信的协议,并通过Tele-LLM-Hub这一低代码平台实现该协议,支持智能体创建、工作流编排以及与srsRAN等软件栈的集成,从而加速上下文感知型多智能体LLM系统的原型设计与部署。

链接: https://arxiv.org/abs/2511.09087
作者: Vijay K Shah,Cong Shen
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces Tele-LLM-Hub, a user friendly low-code solution for rapid prototyping and deployment of context aware multi-agent (MA) Large Language Model (LLM) systems tailored for 5G and beyond. As telecom wireless networks become increasingly complex, intelligent LLM applications must share a domainspecific understanding of network state. We propose TeleMCP, the Telecom Model Context Protocol, to enable structured and context-rich communication between agents in telecom environments. Tele-LLM-Hub actualizes TeleMCP through a low-code interface that supports agent creation, workflow composition, and interaction with software stacks such as srsRAN. Key components include a direct chat interface, a repository of pre-built systems, an Agent Maker leveraging finetuning with our RANSTRUCT framework, and an MA-Maker for composing MA workflows. The goal of Tele-LLM-Hub is to democratize the design of contextaware MA systems and accelerate innovation in next-generation wireless networks.
zh

[AI-40] Good-for-MDP State Reduction for Stochastic LTL Planning AAAI2026

【速读】:该论文旨在解决马尔可夫决策过程(MDP)中基于线性时序逻辑(LTL)目标的随机规划问题,其核心挑战在于传统方法通过将LTL公式转换为适用于MDP的良好数学自动机(good-for-MDP, GFM)来实现策略合成,但生成的GFM状态空间规模过大,严重限制了算法的可扩展性。解决方案的关键在于提出一种新颖的GFM状态空间缩减技术,该技术通过一系列复杂的变换链,利用在对抗场景下最近发展的“好游戏”(good-for-games)最小化理论,显著减少GFM自动机的状态数量;此外,论文还提出了一种针对特定形式 GFφ\mathsf{G}\mathsf{F}\varphi(其中φ\varphi为共安全公式)的直接构造方法,该方法在最坏情况下仅具有单指数复杂度,相比通用方法的双指数复杂度大幅提升可扩展性,实验结果验证了该策略的有效性和实用性。

链接: https://arxiv.org/abs/2511.09073
作者: Christoph Weinhuber,Giuseppe De Giacomo,Yong Li,Sven Schewe,Qiyi Tang
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: 16 pages including appendices, accepted to AAAI 2026

点击查看摘要

Abstract:We study stochastic planning problems in Markov Decision Processes (MDPs) with goals specified in Linear Temporal Logic (LTL). The state-of-the-art approach transforms LTL formulas into good-for-MDP (GFM) automata, which feature a restricted form of nondeterminism. These automata are then composed with the MDP, allowing the agent to resolve the nondeterminism during policy synthesis. A major factor affecting the scalability of this approach is the size of the generated automata. In this paper, we propose a novel GFM state-space reduction technique that significantly reduces the number of automata states. Our method employs a sophisticated chain of transformations, leveraging recent advances in good-for-games minimisation developed for adversarial settings. In addition to our theoretical contributions, we present empirical results demonstrating the practical effectiveness of our state-reduction technique. Furthermore, we introduce a direct construction method for formulas of the form \mathsfG\mathsfF\varphi , where \varphi is a co-safety formula. This construction is provably single-exponential in the worst case, in contrast to the general doubly-exponential complexity. Our experiments confirm the scalability advantages of this specialised construction.
zh

[AI-41] Break the Tie: Learning Cluster-Customized Category Relationships for Categorical Data Clustering AAAI2026

【速读】:该论文旨在解决类别型属性(categorical attributes)在聚类分析中因缺乏明确值间关系而难以发现紧凑簇结构的问题。传统方法通常假设类别间存在固定拓扑关系来学习距离度量,这限制了其对不同簇结构的适应能力并导致次优聚类效果。解决方案的关键在于打破类别间的固有关系约束,通过学习可定制的距离度量以灵活且准确地揭示多种簇分布;所学类别关系被证明满足欧氏距离度量相容性(Euclidean distance metric-compatible),从而可无缝扩展至包含数值与类别属性的混合数据集。

链接: https://arxiv.org/abs/2511.09049
作者: Mingjie Zhao,Zhanpei Huang,Yang Lu,Mengke Li,Yiqun Zhang,Weifeng Su,Yiu-ming Cheung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Aeecpted to AAAI 2026

点击查看摘要

Abstract:Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.
zh

[AI-42] Advancing Autonomous Emergency Response Systems: A Generative AI Perspective

【速读】:该论文旨在解决自动驾驶车辆(AV)在应急响应场景中因传统强化学习(Reinforcement Learning, RL)方法存在样本效率低和动态环境适应性差的问题。其解决方案的关键在于引入生成式AI技术:一方面,采用扩散模型(Diffusion Model, DM)增强的强化学习框架,通过合成数据提升策略鲁棒性;另一方面,探索大语言模型(Large Language Model, LLM)辅助的上下文学习(In-Context Learning, ICL)范式,实现无需重训练的轻量级、可解释的实时适应能力。这两种路径共同构成了下一代自主应急响应系统的核心优化策略。

链接: https://arxiv.org/abs/2511.09044
作者: Yousef Emami,Radha Reddy,Azadeh Pourkabirian,Miguel Gutierrez Gaitan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Autonomous Vehicles (AVs) are poised to revolutionize emergency services by enabling faster, safer, and more efficient responses. This transformation is driven by advances in Artificial Intelligence (AI), particularly Reinforcement Learning (RL), which allows AVs to navigate complex environments and make critical decisions in real time. However, conventional RL paradigms often suffer from poor sample efficiency and lack adaptability in dynamic emergency scenarios. This paper reviews next-generation AV optimization strategies to address these limitations. We analyze the shift from conventional RL to Diffusion Model (DM)-augmented RL, which enhances policy robustness through synthetic data generation, albeit with increased computational cost. Additionally, we explore the emerging paradigm of Large Language Model (LLM)-assisted In-Context Learning (ICL), which offers a lightweight and interpretable alternative by enabling rapid, on-the-fly adaptation without retraining. By reviewing the state of the art in AV intelligence, DM-augmented RL, and LLM-assisted ICL, this paper provides a critical framework for understanding the next generation of autonomous emergency response systems from a Generative AI perspective.
zh

[AI-43] MedHE: Communication-Efficient Privacy-Preserving Federated Learning with Adaptive Gradient Sparsification for Healthcare

【速读】:该论文旨在解决医疗联邦学习中隐私保护与计算效率之间的矛盾问题,即在资源受限的医疗机构间进行协作训练时,如何在保障敏感医疗数据隐私的同时实现高效的通信和计算。其解决方案的关键在于提出了一种名为MedHE的新框架,该框架结合了自适应梯度稀疏化(adaptive gradient sparsification)与CKKS同态加密(CKKS homomorphic encryption),通过引入带有误差补偿的动态阈值机制进行top-k梯度选择,在实现97.5%通信量减少(从1277 MB降至32 MB/轮)的同时保持模型性能稳定(准确率89.5%±0.8%,与标准联邦学习无显著差异,p=0.32),并提供基于环学习误差(Ring Learning with Errors, RLWE)假设的形式化安全分析及ε≤1.0的差分隐私保证,从而满足HIPAA合规性并具备扩展至100家以上机构的可行性。

链接: https://arxiv.org/abs/2511.09043
作者: Farjana Yesmin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 Figures, 5 Tables

点击查看摘要

Abstract:Healthcare federated learning requires strong privacy guarantees while maintaining computational efficiency across resource-constrained medical institutions. This paper presents MedHE, a novel framework combining adaptive gradient sparsification with CKKS homomorphic encryption to enable privacy-preserving collaborative learning on sensitive medical data. Our approach introduces a dynamic threshold mechanism with error compensation for top-k gradient selection, achieving 97.5 percent communication reduction while preserving model utility. We provide formal security analysis under Ring Learning with Errors assumptions and demonstrate differential privacy guarantees with epsilon less than or equal to 1.0. Statistical testing across 5 independent trials shows MedHE achieves 89.5 percent plus or minus 0.8 percent accuracy, maintaining comparable performance to standard federated learning (p=0.32) while reducing communication from 1277 MB to 32 MB per training round. Comprehensive evaluation demonstrates practical feasibility for real-world medical deployments with HIPAA compliance and scalability to 100 plus institutions.
zh

[AI-44] FedSDWC: Federated Synergistic Dual-Representation Weak Causal Learning for OOD

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际部署中因数据分布差异(如协变量偏移和语义偏移)导致的可靠性下降问题。现有方法在准确捕捉不变特征(invariant features)并直接构建因果表示方面存在局限,难以有效提升模型的泛化能力和对分布外(Out-of-Distribution, OOD)数据的检测性能。解决方案的关键在于提出FedSDWC,一种基于因果推断的方法,通过建模不变特征与可变特征之间的弱因果影响,同时融合不变和可变特征以推断因果语义表征(causal semantic representations),从而突破传统不变学习方法的瓶颈。该方法首次建立了其泛化误差界与客户端先验分布之间的关系,并在多个基准数据集上验证了其在处理协变量和语义偏移时的优越性能。

链接: https://arxiv.org/abs/2511.09036
作者: Zhenyuan Huang,Hui Zhang,Wenzhong Tang,Haijun Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Amid growing demands for data privacy and advances in computational infrastructure, federated learning (FL) has emerged as a prominent distributed learning paradigm. Nevertheless, differences in data distribution (such as covariate and semantic shifts) severely affect its reliability in real-world deployments. To address this issue, we propose FedSDWC, a causal inference method that integrates both invariant and variant features. FedSDWC infers causal semantic representations by modeling the weak causal influence between invariant and variant features, effectively overcoming the limitations of existing invariant learning methods in accurately capturing invariant features and directly constructing causal representations. This approach significantly enhances FL’s ability to generalize and detect OOD data. Theoretically, we derive FedSDWC’s generalization error bound under specific conditions and, for the first time, establish its relationship with client prior distributions. Moreover, extensive experiments conducted on multiple benchmark datasets validate the superior performance of FedSDWC in handling covariate and semantic shifts. For example, FedSDWC outperforms FedICON, the next best baseline, by an average of 3.04% on CIFAR-10 and 8.11% on CIFAR-100.
zh

[AI-45] Argus: Resilience-Oriented Safety Assurance Framework for End-to-End ADSs

【速读】:该论文旨在解决端到端自动驾驶系统(End-to-End Autonomous Driving Systems, ADSs)在实际道路部署中面临的运行时安全性挑战,即如何持续监测潜在驾驶危害并自适应响应以防止安全违规行为,从而提升复杂场景下的鲁棒性表现。其解决方案的关键在于提出一个名为Argus的运行时韧性框架,该框架通过持续监控ADS生成的车辆轨迹来识别安全隐患,并在检测到不安全状态时无缝接管控制权,由一个危害缓解模块(hazard mitigator)进行干预,从而有效降低违规发生率并显著提升驾驶评分(平均提升达150.30%),同时保持极低的时间开销。

链接: https://arxiv.org/abs/2511.09032
作者: Dingji Wang,You Lu,Bihuan Chen,Shuo Hao,Haowen Jiang,Yifan Tian,Xin Peng
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE)
备注: The paper has been accepted by the 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025

点击查看摘要

Abstract:End-to-end autonomous driving systems (ADSs), with their strong capabilities in environmental perception and generalizable driving decisions, are attracting growing attention from both academia and industry. However, once deployed on public roads, ADSs are inevitably exposed to diverse driving hazards that may compromise safety and degrade system performance. This raises a strong demand for resilience of ADSs, particularly the capability to continuously monitor driving hazards and adaptively respond to potential safety violations, which is crucial for maintaining robust driving behaviors in complex driving scenarios. To bridge this gap, we propose a runtime resilience-oriented framework, Argus, to mitigate the driving hazards, thus preventing potential safety violations and improving the driving performance of an ADS. Argus continuously monitors the trajectories generated by the ADS for potential hazards and, whenever the EGO vehicle is deemed unsafe, seamlessly takes control through a hazard mitigator. We integrate Argus with three state-of-the-art end-to-end ADSs, i.e., TCP, UniAD and VAD. Our evaluation has demonstrated that Argus effectively and efficiently enhances the resilience of ADSs, improving the driving score of the ADS by up to 150.30% on average, and preventing up to 64.38% of the violations, with little additional time overhead. Comments: The paper has been accepted by the 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025 Subjects: Artificial Intelligence (cs.AI); Robotics (cs.RO); Software Engineering (cs.SE) Cite as: arXiv:2511.09032 [cs.AI] (or arXiv:2511.09032v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.09032 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering.2025
zh

[AI-46] Heterogeneous Graph Neural Networks for Assumption-Based Argumentation AAAI2026

【速读】:该论文旨在解决结构化论证形式化中基于假设的论证(Assumption-Based Argumentation, ABA)在稳定语义下计算扩展(extensions)的计算复杂性问题,尤其针对大规模框架难以进行精确计算的挑战。其解决方案的关键在于提出两种基于图神经网络(Graph Neural Network, GNN)的新架构——ABAGCN与ABAGAT,通过构建包含假设、主张和规则节点及其支持、推导和攻击关系的异构依赖图表示,利用残差异构卷积或注意力机制学习节点嵌入,并结合ICCMA 2023基准数据集与合成ABA框架进行训练优化;进一步设计了一个基于预测结果的多项式时间扩展重构算法,在小规模框架上实现F1超过0.85的稳定扩展重建性能,且在大规模框架上保持约0.58的F1分数,从而为结构化论证中的可扩展近似推理开辟了新路径。

链接: https://arxiv.org/abs/2511.08982
作者: Preesha Gehlot,Anna Rapberger,Fabrizio Russo,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to AAAI2026. Version with Appendix

点击查看摘要

Abstract:Assumption-Based Argumentation (ABA) is a powerful structured argumentation formalism, but exact computation of extensions under stable semantics is intractable for large frameworks. We present the first Graph Neural Network (GNN) approach to approximate credulous acceptance in ABA. To leverage GNNs, we model ABA frameworks via a dependency graph representation encoding assumptions, claims and rules as nodes, with heterogeneous edge labels distinguishing support, derive and attack relations. We propose two GNN architectures - ABAGCN and ABAGAT - that stack residual heterogeneous convolution or attention layers, respectively, to learn node embeddings. Our models are trained on the ICCMA 2023 benchmark, augmented with synthetic ABAFs, with hyperparameters optimised via Bayesian search. Empirically, both ABAGCN and ABAGAT outperform a state-of-the-art GNN baseline that we adapt from the abstract argumentation literature, achieving a node-level F1 score of up to 0.71 on the ICCMA instances. Finally, we develop a sound polynomial time extension-reconstruction algorithm driven by our predictor: it reconstructs stable extensions with F1 above 0.85 on small ABAFs and maintains an F1 of about 0.58 on large frameworks. Our work opens new avenues for scalable approximate reasoning in structured argumentation.
zh

[AI-47] AlphaCast: A Human Wisdom-LLM Intelligence Co-Reasoning Framework for Interactive Time Series Forecasting

【速读】:该论文旨在解决传统时间序列预测方法在高风险领域(如能源、医疗和气候)中因缺乏交互性、推理能力和适应性而难以应对复杂现实环境的问题。现有方法通常将预测视为静态的一次性映射任务,无法模拟人类专家的动态决策过程。解决方案的关键在于提出AlphaCast框架,该框架通过人智与大语言模型(Large Language Model, LLM)的协同推理机制,将预测重构为一个分阶段的交互式过程:第一阶段构建多源认知基础(包括特征集、领域知识库、上下文存储库和案例库),第二阶段引入生成式推理与反思优化机制,整合统计时序特征、先验知识、上下文信息及策略匹配,并通过元推理循环实现持续自我修正与策略精炼,从而显著提升预测准确性。

链接: https://arxiv.org/abs/2511.08947
作者: Xiaohan Zhang,Tian Gao,Mingyue Cheng,Bokai Pan,Ze Guo,Yaguo Liu,Xiaoyu Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series forecasting plays a critical role in high-stakes domains such as energy, healthcare, and climate. Although recent advances have improved accuracy, most approaches still treat forecasting as a static one-time mapping task, lacking the interaction, reasoning, and adaptability of human experts. This gap limits their usefulness in complex real-world environments. To address this, we propose AlphaCast, a human wisdom-large language model (LLM) intelligence co-reasoning framework that redefines forecasting as an interactive process. The key idea is to enable step-by-step collaboration between human wisdom and LLM intelligence to jointly prepare, generate, and verify forecasts. The framework consists of two stages: (1) automated prediction preparation, where AlphaCast builds a multi-source cognitive foundation comprising a feature set that captures key statistics and time patterns, a domain knowledge base distilled from corpora and historical series, a contextual repository that stores rich information for each time window, and a case base that retrieves optimal strategies via pattern clustering and matching; and (2) generative reasoning and reflective optimization, where AlphaCast integrates statistical temporal features, prior knowledge, contextual information, and forecasting strategies, triggering a meta-reasoning loop for continuous self-correction and strategy refinement. Extensive experiments on short- and long-term datasets show that AlphaCast consistently outperforms state-of-the-art baselines in predictive accuracy. Code is available at this repository: this https URL .
zh

[AI-48] hink Remember Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在机器人导航任务中推理能力未被充分挖掘的问题,即现有方法多将VLM视为被动观察者,未能发挥其高层规划与情境理解的优势。解决方案的关键在于重构VLM的角色,使其成为导航过程中的主动策略制定者:通过结构化链式思维提示(structured chain-of-thought prompting)激发逻辑化的分步推理,动态引入智能体近期动作历史以避免陷入循环,以及创新性地赋予VLM同时解读俯视障碍物地图与第一人称视角的能力,从而显著提升空间感知与路径规划的合理性。实验证明,该框架在HM3D、Gibson和MP3D等复杂基准上生成了更直接、逻辑更强的导航轨迹,大幅优于现有方法。

链接: https://arxiv.org/abs/2511.08942
作者: Mobin Habibpour,Fatemeh Afghah
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Vision-Language Models (VLMs) are set to transform robotic navigation, existing methods often underutilize their reasoning capabilities. To unlock the full potential of VLMs in robotics, we shift their role from passive observers to active strategists in the navigation process. Our framework outsources high-level planning to a VLM, which leverages its contextual understanding to guide a frontier-based exploration agent. This intelligent guidance is achieved through a trio of techniques: structured chain-of-thought prompting that elicits logical, step-by-step reasoning; dynamic inclusion of the agent’s recent action history to prevent getting stuck in loops; and a novel capability that enables the VLM to interpret top-down obstacle maps alongside first-person views, thereby enhancing spatial awareness. When tested on challenging benchmarks like HM3D, Gibson, and MP3D, this method produces exceptionally direct and logical trajectories, marking a substantial improvement in navigation efficiency over existing approaches and charting a path toward more capable embodied agents.
zh

[AI-49] A Research on Business Process Optimisation Model Integrating AI and Big Data Analytics

【速读】:该论文旨在解决企业数字化转型背景下业务流程优化不足的问题,以提升企业竞争力。其解决方案的关键在于构建一个融合人工智能(Artificial Intelligence, AI)与大数据技术的业务流程优化模型,采用包含数据处理、AI算法和业务逻辑的三层架构,实现对业务流程全生命周期的智能管理,并通过分布式计算与深度学习技术保障系统在复杂场景下的高性能与高可靠性。

链接: https://arxiv.org/abs/2511.08934
作者: Di Liao,Ruijia Liang,Ziyi Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the deepening of digital transformation, business process optimisation has become the key to improve the competitiveness of enterprises. This study constructs a business process optimisation model integrating artificial intelligence and big data to achieve intelligent management of the whole life cycle of processes. The model adopts a three-layer architecture incorporating data processing, AI algorithms, and business logic to enable real-time process monitoring and optimization. Through distributed computing and deep learning techniques, the system can handle complex business scenarios while maintaining high performance and reliability. Experimental validation across multiple enterprise scenarios shows that the model shortens process processing time by 42%, improves resource utilisation by 28%, and reduces operating costs by 35%. The system maintained 99.9% availability under high concurrent loads. The research results have important theoretical and practical value for promoting the digital transformation of enterprises, and provide new ideas for improving the operational efficiency of enterprises.
zh

[AI-50] Achieving Equilibrium under Utility Heterogeneity: An Agent -Attention Framework for Multi-Agent Multi-Objective Reinforcement Learning

【速读】:该论文旨在解决多智能体多目标强化学习(Multi-agent Multi-objective Reinforcement Learning, MAMORL)中因个体目标异质性与私有效用函数导致的训练非平稳性问题,尤其在去中心化执行约束下难以达成贝叶斯纳什均衡(Bayesian Nash Equilibrium, BNE)的挑战。解决方案的关键在于提出一种Agent-Attention Multi-Agent Multi-Objective Reinforcement Learning(AA-MAMORL)框架:通过中央训练阶段隐式学习其他智能体效用函数及其策略的联合信念,从而在不依赖智能体间通信的前提下,使每个智能体基于局部观测和自身私有效用函数独立决策,逼近全局最优的BNE,显著提升性能并优于现有最先进方法。

链接: https://arxiv.org/abs/2511.08926
作者: Zhuhui Li,Chunbo Luo,Liming Huang,Luyu Qi,Geyong Min
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-agent multi-objective systems (MAMOS) have emerged as powerful frameworks for modelling complex decision-making problems across various real-world domains, such as robotic exploration, autonomous traffic management, and sensor network optimisation. MAMOS offers enhanced scalability and robustness through decentralised control and more accurately reflects inherent trade-offs between conflicting objectives. In MAMOS, each agent uses utility functions that map return vectors to scalar values. Existing MAMOS optimisation methods face challenges in handling heterogeneous objective and utility function settings, where training non-stationarity is intensified due to private utility functions and the associated policies. In this paper, we first theoretically prove that direct access to, or structured modeling of, global utility functions is necessary for the Bayesian Nash Equilibrium under decentralised execution constraints. To access the global utility functions while preserving the decentralised execution, we propose an Agent-Attention Multi-Agent Multi-Objective Reinforcement Learning (AA-MAMORL) framework. Our approach implicitly learns a joint belief over other agents’ utility functions and their associated policies during centralised training, effectively mapping global states and utilities to each agent’s policy. In execution, each agent independently selects actions based on local observations and its private utility function to approximate a BNE, without relying on inter-agent communication. We conduct comprehensive experiments in both a custom-designed MAMO Particle environment and the standard MOMALand benchmark. The results demonstrate that access to global preferences and our proposed AA-MAMORL significantly improve performance and consistently outperform state-of-the-art methods.
zh

[AI-51] Diffusion Policies with Value-Conditional Optimization for Offline Reinforcement Learning IROS2025

【速读】:该论文旨在解决离线强化学习(offline reinforcement learning)中因分布外(out-of-distribution, OOD)动作导致的价值高估问题,该问题严重限制了策略性能。现有方法虽利用扩散模型(diffusion models)的强分布匹配能力并通过行为策略约束实现保守性,但常对低质量数据集中冗余动作施加无差别正则化,造成过度保守并破坏扩散建模的表达能力与效率之间的平衡。其解决方案的关键在于提出一种名为DIffusion policies with Value-conditional Optimization (DIVO)的新方法:通过引入基于动作优势值的二值加权机制,在扩散模型训练阶段引导生成更贴近数据分布且仅扩展高优势动作边界,从而实现精准分布对齐;同时在策略改进阶段动态筛选高回报潜力动作,有效提升策略性能。此设计实现了离线强化学习中保守性与探索性之间的关键平衡。

链接: https://arxiv.org/abs/2511.08922
作者: Yunchang Ma,Tenglong Liu,Yixing Lan,Xin Yin,Changxin Zhang,Xinglong Zhang,Xin Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: IROS 2025

点击查看摘要

Abstract:In offline reinforcement learning, value overestimation caused by out-of-distribution (OOD) actions significantly limits policy performance. Recently, diffusion models have been leveraged for their strong distribution-matching capabilities, enforcing conservatism through behavior policy constraints. However, existing methods often apply indiscriminate regularization to redundant actions in low-quality datasets, resulting in excessive conservatism and an imbalance between the expressiveness and efficiency of diffusion modeling. To address these issues, we propose DIffusion policies with Value-conditional Optimization (DIVO), a novel approach that leverages diffusion models to generate high-quality, broadly covered in-distribution state-action samples while facilitating efficient policy improvement. Specifically, DIVO introduces a binary-weighted mechanism that utilizes the advantage values of actions in the offline dataset to guide diffusion model training. This enables a more precise alignment with the dataset’s distribution while selectively expanding the boundaries of high-advantage actions. During policy improvement, DIVO dynamically filters high-return-potential actions from the diffusion model, effectively guiding the learned policy toward better performance. This approach achieves a critical balance between conservatism and explorability in offline RL. We evaluate DIVO on the D4RL benchmark and compare it against state-of-the-art baselines. Empirical results demonstrate that DIVO achieves superior performance, delivering significant improvements in average returns across locomotion tasks and outperforming existing methods in the challenging AntMaze domain, where sparse rewards pose a major difficulty.
zh

[AI-52] Seal: Encrypted Fingerprinting for Reliable LLM Ownership Verification AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)知识产权(Intellectual Property, IP)保护中指纹识别方法在模型窃贼完全控制推理过程时失效的问题。现有方法依赖于提取或注入模型特有特征进行所有权验证,但在攻击者能操控推理流程的情况下,可通过共享提示-响应对实现指纹遗忘(fingerprint unlearning)或篡改输出以规避精确匹配验证,导致验证失败。解决方案的关键在于提出 iSeal,一种面向端到端控制场景的新型指纹机制:其通过在模型内部与外部模块中同时注入唯一特征,并结合纠错机制和基于相似度的验证策略,有效抵御包括合谋式指纹遗忘和响应操纵在内的多种验证时攻击,理论分析与实证结果表明,iSeal 在12个LLM上对10余种攻击实现了100%的指纹成功识别率(Fingerprint Success Rate, FSR),而基线方法则在指纹遗忘和响应操纵下失效。

链接: https://arxiv.org/abs/2511.08905
作者: Zixun Xiong,Gaoyi Wu,Qingyang Yu,Mingyu Derek Ma,Lingfeng Yao,Miao Pan,Xiaojiang Du,Hao Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Given the high cost of large language model (LLM) training from scratch, safeguarding LLM intellectual property (IP) has become increasingly crucial. As the standard paradigm for IP ownership verification, LLM fingerprinting thus plays a vital role in addressing this challenge. Existing LLM fingerprinting methods verify ownership by extracting or injecting model-specific features. However, they overlook potential attacks during the verification process, leaving them ineffective when the model thief fully controls the LLM’s inference process. In such settings, attackers may share prompt-response pairs to enable fingerprint unlearning or manipulate outputs to evade exact-match verification. We propose iSeal, the first fingerprinting method designed for reliable verification when the model thief controls the suspected LLM in an end-to-end manner. It injects unique features into both the model and an external module, reinforced by an error-correction mechanism and a similarity-based verification strategy. These components are resistant to verification-time attacks, including collusion-based fingerprint unlearning and response manipulation, backed by both theoretical analysis and empirical results. iSeal achieves 100 percent Fingerprint Success Rate (FSR) on 12 LLMs against more than 10 attacks, while baselines fail under unlearning and response manipulations.
zh

[AI-53] Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

【速读】:该论文旨在解决如何在复杂、开放世界的3D环境中开发具备长期任务执行能力的通用智能体(generalist agents)的问题,尤其关注其在真实时间内的持续决策与多模态交互能力。解决方案的关键在于提出Lumine,一个采用类人交互范式的端到端架构,通过视觉-语言模型(vision-language model)统一感知、推理与动作生成,并以5 Hz处理原始像素输入、输出30 Hz的键盘鼠标动作,同时仅在必要时动态调用推理模块,从而实现高效且灵活的实时控制。该方法在《原神》中成功完成五小时主线剧情,且无需微调即可零样本迁移至《鸣潮》和《崩坏:星穹铁道》,展现出强大的跨游戏泛化能力。

链接: https://arxiv.org/abs/2511.08892
作者: Weihao Tan,Xiangyang Li,Yunhao Fang,Heyuan Yao,Shi Yan,Hao Luo,Tenglong Ao,Huihui Li,Hongbin Ren,Bairen Yi,Yujia Qin,Bo An,Libin Liu,Guang Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine’s effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.
zh

[AI-54] FAST-CAD: A Fairness-Aware Framework for Non-Contact Stroke Diagnosis

【速读】:该论文旨在解决当前自动化卒中(stroke)诊断方法在不同人口群体中存在公平性问题,可能导致医疗资源分配不均和健康差距扩大的挑战。其解决方案的关键在于提出FAST-CAD框架,该框架融合了领域对抗训练(domain-adversarial training, DAT)与组分布鲁棒优化(group distributionally robust optimization, Group-DRO),通过学习跨群体不变的特征表示并最小化最差组别的风险,从而在保证诊断准确性的同时实现公平性保障。该方法基于领域适应和极小极大公平性理论,提供收敛性证明和公平性边界,实验证明其在12个由年龄、性别和体位定义的人口亚组上均表现出优越的诊断性能与公平性。

链接: https://arxiv.org/abs/2511.08887
作者: Tianming Sha,Zechuan Chen,Zhan Cheng,Haotian Zhai,Xuwei Ding,Junnan Li,Haixiang Tang,Zaoting Sun,Yanchuan Tang,Yongzhe Yi,Yanjie Huang,Anhao Li,Yuan Gao,Keze Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Stroke is an acute cerebrovascular disease, and timely diagnosis significantly improves patient survival. However, existing automated diagnosis methods suffer from fairness issues across demographic groups, potentially exacerbating healthcare disparities. In this work we propose FAST-CAD, a theoretically grounded framework that combines domain-adversarial training (DAT) with group distributionally robust optimization (Group-DRO) for fair and accurate non-contact stroke diagnosis. Our approach is built on domain adaptation and minimax fairness theory and provides convergence guarantees and fairness bounds. We curate a multimodal dataset covering 12 demographic subgroups defined by age, gender, and posture. FAST-CAD employs self-supervised encoders with adversarial domain discrimination to learn demographic-invariant representations, while Group-DRO optimizes worst-group risk to ensure robust performance across all subgroups. Extensive experiments show that our method achieves superior diagnostic performance while maintaining fairness across demographic groups, and our theoretical analysis supports the effectiveness of the unified DAT + Group-DRO framework. This work provides both practical advances and theoretical insights for fair medical AI systems.
zh

[AI-55] UCO: A Multi-Turn Interactive Reinforcement Learning Method for Adaptive Teaching with Large Language Models

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在教育场景中作为智能导师时存在的两大局限性:一是监督微调方法仅学习表面教学模式,缺乏动态适应能力;二是强化学习方法虽能提升适应性,但无法区分学生是否真正理解知识(而非机械复现答案),且无法实时感知学生的认知状态变化,从而难以动态调整教学策略以匹配其最近发展区(Zone of Proximal Development, ZPD)。解决方案的关键在于提出单向认知优化(Unidirectional Cognitive Optimization, UCO)方法,其核心创新是设计了两个协同作用的奖励函数:Progress Reward 用于衡量学生认知进展,判断其是否从困惑走向理解;Scaffold Reward 则动态识别每个学生的 ZPD,引导教师在该区域内实施有效教学,从而实现基于多轮交互式强化学习的自适应教学优化。

链接: https://arxiv.org/abs/2511.08873
作者: Shouang Wei,Min Zhang,Xin Lin,Bo Jiang,Kun Kuang,Zhongxiang Dai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are shifting from answer providers to intelligent tutors in educational settings, yet current supervised fine-tuning methods only learn surface teaching patterns without dynamic adaptation capabilities. Recent reinforcement learning approaches address this limitation but face two critical challenges. First, they evaluate teaching effectiveness solely based on whether students produce correct outputs, unable to distinguish whether students genuinely understand or echo teacher-provided answers during interaction. Second, they cannot perceive students’ evolving cognitive states in real time through interactive dialogue, thus failing to adapt teaching strategies to match students’ cognitive levels dynamically. We propose the Unidirectional Cognitive Optimization (UCO) method to address these challenges. UCO uses a multi-turn interactive reinforcement learning paradigm where the innovation lies in two synergistic reward functions: the Progress Reward captures students’ cognitive advancement, evaluating whether students truly transition from confusion to comprehension, while the Scaffold Reward dynamically identifies each student’s Zone of Proximal Development (ZPD), encouraging teachers to maintain productive teaching within this zone. We evaluate UCO by comparing it against 11 baseline models on BigMath and MathTutorBench benchmarks. Experimental results demonstrate that our UCO model outperforms all models of equivalent scale and achieves performance comparable to advanced closed-source models. The code and data are available at this https URL.
zh

[AI-56] Conformal Prediction for Multi-Source Detection on a Network

【速读】:该论文旨在解决网络中信息或感染传播的多源检测问题(multi-source detection problem),即在给定图上节点感染状态的快照观测下,估计引发传播的源节点集合。现有方法要么缺乏统计保障,要么受限于特定扩散模型和假设。其解决方案的关键在于提出了一种新颖的置信预测框架(conformal prediction framework),该框架不依赖于底层扩散过程或数据分布,即可提供严格的召回率(recall)保证;通过设计合理的评分函数量化预测概率与真实源节点的一致性,并利用校准集构造具有用户指定召回率和覆盖率的预测集,从而在单源与多源场景下均具备通用性和高效性。

链接: https://arxiv.org/abs/2511.08867
作者: Xingchao Jian,Purui Zhang,Lan Tian,Feng Ji,Wenfei Liang,Wee Peng Tay,Bihan Wen,Felix Krahmer
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting the origin of information or infection spread in networks is a fundamental challenge with applications in misinformation tracking, epidemiology, and beyond. We study the multi-source detection problem: given snapshot observations of node infection status on a graph, estimate the set of source nodes that initiated the propagation. Existing methods either lack statistical guarantees or are limited to specific diffusion models and assumptions. We propose a novel conformal prediction framework that provides statistically valid recall guarantees for source set detection, independent of the underlying diffusion process or data distribution. Our approach introduces principled score functions to quantify the alignment between predicted probabilities and true sources, and leverages a calibration set to construct prediction sets with user-specified recall and coverage levels. The method is applicable to both single- and multi-source scenarios, supports general network diffusion dynamics, and is computationally efficient for large graphs. Empirical results demonstrate that our method achieves rigorous coverage with competitive accuracy, outperforming existing baselines in both reliability and this http URL code is available online.
zh

[AI-57] ransformer-Based Sleep Stage Classification Enhanced by Clinical Information

【速读】:该论文旨在解决人工睡眠分期(sleep staging)耗时且存在评分者间变异的问题,同时克服现有深度学习模型仅依赖原始多导睡眠图(polysomnography, PSG)信号而忽略人类专家使用的上下文线索的局限性。其解决方案的关键在于提出一种两阶段架构:首先使用基于Transformer的逐epoch编码器提取PSG特征,再通过一维卷积神经网络(1D CNN)聚合特征,并系统性地融合两类显式上下文信息——受试者层面的临床元数据(如年龄、性别、BMI)和逐epoch专家标注事件(如呼吸暂停、血氧饱和度下降、觉醒、周期性呼吸)。实验表明,这种特征融合策略显著提升分期准确率(macro-F1从0.7745提升至0.8031),且优于多任务学习方法,证明了引入临床可解释特征能有效增强模型性能与可解释性,无需修改PSG采集方案或增加传感器。

链接: https://arxiv.org/abs/2511.08864
作者: Woosuk Chung,Seokwoo Hong,Wonhyeok Lee,Sangyoon Bae
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Manual sleep staging from polysomnography (PSG) is labor-intensive and prone to inter-scorer variability. While recent deep learning models have advanced automated staging, most rely solely on raw PSG signals and neglect contextual cues used by human experts. We propose a two-stage architecture that combines a Transformer-based per-epoch encoder with a 1D CNN aggregator, and systematically investigates the effect of incorporating explicit context: subject-level clinical metadata (age, sex, BMI) and per-epoch expert event annotations (apneas, desaturations, arousals, periodic breathing). Using the Sleep Heart Health Study (SHHS) cohort (n=8,357), we demonstrate that contextual fusion substantially improves staging accuracy. Compared to a PSG-only baseline (macro-F1 0.7745, micro-F1 0.8774), our final model achieves macro-F1 0.8031 and micro-F1 0.9051, with event annotations contributing the largest gains. Notably, feature fusion outperforms multi-task alternatives that predict the same auxiliary labels. These results highlight that augmenting learned representations with clinically meaningful features enhances both performance and interpretability, without modifying the PSG montage or requiring additional sensors. Our findings support a practical and scalable path toward context-aware, expert-aligned sleep staging systems.
zh

[AI-58] Rethinking Graph Super-resolution: Dual Frameworks for Topological Fidelity

【速读】:该论文旨在解决图超分辨率(Graph Super-Resolution)任务中现有基于图神经网络(GNN)方法的两大局限性:一是矩阵形式的节点超分辨率方法忽视图结构信息,缺乏排列不变性;二是依赖节点表示来推断边权重,限制了模型的可扩展性和表达能力。其解决方案的关键在于提出两个与GNN无关的框架:首先,Bi-SR通过构建连接低分辨率(LR)与高分辨率(HR)节点的二分图,实现结构感知的节点超分辨率,从而保持拓扑结构并具备排列不变性;其次,DEFEND通过将HR边映射到对偶图(dual graph)的节点上,学习边表示,并利用标准的节点GNN进行边推理,有效提升了模型的表达能力和可扩展性。

链接: https://arxiv.org/abs/2511.08853
作者: Pragya Singh,Islem Rekik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Graph super-resolution, the task of inferring high-resolution (HR) graphs from low-resolution (LR) counterparts, is an underexplored yet crucial research direction that circumvents the need for costly data acquisition. This makes it especially desirable for resource-constrained fields such as the medical domain. While recent GNN-based approaches show promise, they suffer from two key limitations: (1) matrix-based node super-resolution that disregards graph structure and lacks permutation invariance; and (2) reliance on node representations to infer edge weights, which limits scalability and expressivity. In this work, we propose two GNN-agnostic frameworks to address these issues. First, Bi-SR introduces a bipartite graph connecting LR and HR nodes to enable structure-aware node super-resolution that preserves topology and permutation invariance. Second, DEFEND learns edge representations by mapping HR edges to nodes of a dual graph, allowing edge inference via standard node-based GNNs. We evaluate both frameworks on a real-world brain connectome dataset, where they achieve state-of-the-art performance across seven topological measures. To support generalization, we introduce twelve new simulated datasets that capture diverse topologies and LR-HR relationships. These enable comprehensive benchmarking of graph super-resolution methods.
zh

[AI-59] 3D Guard-Layer: An Integrated Agent ic AI Safety System for Edge Artificial Intelligence

【速读】:该论文旨在解决边缘人工智能(Edge AI)系统中日益突出的安全漏洞与挑战问题,这些问题已成为制约AI系统实际部署与安全性的主要障碍。其解决方案的关键在于提出一种基于代理的AI安全架构(agentic AI safety architecture),通过3D集成方式嵌入专用安全层,并构建具备动态学习能力的自适应AI安全基础设施,能够持续监测、检测并主动缓解针对AI系统的攻击。该架构充分利用了与边缘计算硬件共置的优势,实现了本地化处理与学习,从而在最小成本和3D集成开销下提升系统对新型网络攻击的韧性、可靠性、模块化和性能表现。

链接: https://arxiv.org/abs/2511.08842
作者: Eren Kurshan,Yuan Xie,Paul Franzon
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Resubmitting Re: Arxiv Committee Approval

点击查看摘要

Abstract:AI systems have found a wide range of real-world applications in recent years. The adoption of edge artificial intelligence, embedding AI directly into edge devices, is rapidly growing. Despite the implementation of guardrails and safety mechanisms, security vulnerabilities and challenges have become increasingly prevalent in this domain, posing a significant barrier to the practical deployment and safety of AI systems. This paper proposes an agentic AI safety architecture that leverages 3D to integrate a dedicated safety layer. It introduces an adaptive AI safety infrastructure capable of dynamically learning and mitigating attacks against the AI system. The system leverages the inherent advantages of co-location with the edge computing hardware to continuously monitor, detect and proactively mitigate threats to the AI system. The integration of local processing and learning capabilities enhances resilience against emerging network-based attacks while simultaneously improving system reliability, modularity, and performance, all with minimal cost and 3D integration overhead.
zh

[AI-60] Enhancing DPSGD via Per-Sample Momentum and Low-Pass Filtering AAAI2026

【速读】:该论文旨在解决差分隐私随机梯度下降(Differentially Private Stochastic Gradient Descent, DPSGD)在训练深度神经网络时因引入噪声和偏差而导致模型准确率下降的问题。现有方法通常仅针对噪声或裁剪偏差(clipping bias)之一进行优化,而无法兼顾二者,因为降低噪声可能加剧裁剪偏差,反之亦然。其解决方案的关键在于提出一种新方法 DP-PMLF,该方法结合了每个样本的动量(per-sample momentum)与低通滤波策略:前者在梯度裁剪前平滑梯度估计以减少采样方差,后者通过后处理低通滤波器抑制高频差分隐私噪声,且不消耗额外隐私预算。理论分析表明该方法在严格的差分隐私保证下具有更优的收敛速率,实验结果也验证了其在隐私-效用权衡上的显著提升。

链接: https://arxiv.org/abs/2511.08841
作者: Xincheng Xu,Thilina Ranbaduge,Qing Wang,Thierry Rakotoarivelo,David Smith
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To appear in AAAI 2026

点击查看摘要

Abstract:Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to train deep neural networks with formal privacy guarantees. However, the addition of differential privacy (DP) often degrades model accuracy by introducing both noise and bias. Existing techniques typically address only one of these issues, as reducing DP noise can exacerbate clipping bias and vice-versa. In this paper, we propose a novel method, \emphDP-PMLF, which integrates per-sample momentum with a low-pass filtering strategy to simultaneously mitigate DP noise and clipping bias. Our approach uses per-sample momentum to smooth gradient estimates prior to clipping, thereby reducing sampling variance. It further employs a post-processing low-pass filter to attenuate high-frequency DP noise without consuming additional privacy budget. We provide a theoretical analysis demonstrating an improved convergence rate under rigorous DP guarantees, and our empirical evaluations reveal that DP-PMLF significantly enhances the privacy-utility trade-off compared to several state-of-the-art DPSGD variants.
zh

[AI-61] IGER-MARL: Enhancing Multi-Agent Reinforcement Learning with Temporal Information through Graph-based Embeddings and Representations

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中因忽略智能体间协作结构随时间动态演化而导致的协调鲁棒性与适应性不足的问题。现有方法通常依赖静态或每步独立的关系图,未能捕捉智能体在适应环境、移动或调整合作策略过程中自然产生的时序交互变化。解决方案的关键在于提出Temporal Information through Graph-based Embeddings and Representations (TIGER),其通过构建包含当前与历史交互的动态时序图,并利用基于时间注意力机制的编码器聚合结构邻域与时间邻域的信息,生成具备时间感知能力的智能体嵌入(agent embeddings),从而指导更有效的协作策略学习。

链接: https://arxiv.org/abs/2511.08832
作者: Nikunj Gupta,Ludwika Twardecka,James Zachary Hare,Jesse Milzman,Rajgopal Kannan,Viktor Prasanna
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we propose capturing and utilizing \textitTemporal Information through Graph-based Embeddings and Representations or \textbfTIGER to enhance multi-agent reinforcement learning (MARL). We explicitly model how inter-agent coordination structures evolve over time. While most MARL approaches rely on static or per-step relational graphs, they overlook the temporal evolution of interactions that naturally arise as agents adapt, move, or reorganize cooperation strategies. Capturing such evolving dependencies is key to achieving robust and adaptive coordination. To this end, TIGER constructs dynamic temporal graphs of MARL agents, connecting their current and historical interactions. It then employs a temporal attention-based encoder to aggregate information across these structural and temporal neighborhoods, yielding time-aware agent embeddings that guide cooperative policy learning. Through extensive experiments on two coordination-intensive benchmarks, we show that TIGER consistently outperforms diverse value-decomposition and graph-based MARL baselines in task performance and sample efficiency. Furthermore, we conduct comprehensive ablation studies to isolate the impact of key design parameters in TIGER, revealing how structural and temporal factors can jointly shape effective policy learning in MARL. All codes can be found here: this https URL.
zh

[AI-62] Neural Value Iteration

【速读】:该论文旨在解决部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)中因状态空间规模庞大而导致的传统点基值迭代(point-based value iteration)方法计算复杂度急剧上升的问题。现有方法依赖于对 α-向量(α-vectors)进行贝尔曼更新,而每个 α-向量为 |S|-维(|S| 为状态数),在大规模问题中难以高效执行。解决方案的关键在于利用 POMDP 值函数的分段线性凸(piecewise-linear-convex, PWLC)性质,提出一种基于神经网络的新表示方式:将值函数建模为一组有限的神经网络,从而实现对值函数的高效近似与泛化。该方法构建了名为 Neural Value Iteration 的新型规划算法,融合了神经网络的泛化能力与经典值迭代框架,在现有离线求解器无法处理的超大规模 POMDP 中仍能获得近优解。

链接: https://arxiv.org/abs/2511.08825
作者: Yang You,Ufuk Çakır,Alex Schutz,Robert Skilton,Nick Hawes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as \alpha -vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on \alpha -vectors at reachable belief points until convergence. However, since each \alpha -vector is |S| -dimensional, these methods quickly become intractable for large-scale problems due to the prohibitive computational cost of Bellman backups. In this work, we demonstrate that the PWLC property allows a POMDP’s value function to be alternatively represented as a finite set of neural networks. This insight enables a novel POMDP planning algorithm called \emphNeural Value Iteration, which combines the generalization capability of neural networks with the classical value iteration framework. Our approach achieves near-optimal solutions even in extremely large POMDPs that are intractable for existing offline solvers.
zh

[AI-63] Hey Pentti We Did (More of) It!: A Vector-Symbolic Lisp With Residue Arithmetic IJCNN2025

【速读】:该论文旨在解决当前神经网络缺乏对结构化表示(structured representations)的有效编码与处理能力的问题,从而限制了其在复杂任务中实现更通用智能的能力。解决方案的关键在于引入基于频域全息降维表示(Frequency-domain Holographic Reduced Representations, FHRRs)的向量符号架构(Vector-Symbolic Architecture, VSA),并结合残差超维计算(Residue Hyperdimensional Computing, RHC)实现对Lisp 1.5语法的图灵完备编码,使得高维向量空间能够自然地包含任意结构化的、可解释的表示形式,进而提升神经网络状态的表达能力和结构敏感性。

链接: https://arxiv.org/abs/2511.08767
作者: Connor Hanley,Eilene Tomkins-Flanaganm,Mary Alexandria Kelly
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 11 pages, 1 figure, conference paper at IJCNN 2025 Rome

点击查看摘要

Abstract:Using Frequency-domain Holographic Reduced Representations (FHRRs), we extend a Vector-Symbolic Architecture (VSA) encoding of Lisp 1.5 with primitives for arithmetic operations using Residue Hyperdimensional Computing (RHC). Encoding a Turing-complete syntax over a high-dimensional vector space increases the expressivity of neural network states, enabling network states to contain arbitrarily structured representations that are inherently interpretable. We discuss the potential applications of the VSA encoding in machine learning tasks, as well as the importance of encoding structured representations and designing neural networks whose behavior is sensitive to the structure of their representations in virtue of attaining more general intelligent agents than exist at present.
zh

[AI-64] Information-Driven Fault Detection and Identification for Multi-Agent Spacecraft Systems: Collaborative On-Orbit Inspection Mission

【速读】:该论文旨在解决多航天器在低地球轨道执行协同巡检任务时的故障检测与识别(Fault Detection and Identification, FDI)问题,尤其关注如何在不确定环境下实现高可靠性的故障定位与分类。解决方案的关键在于提出了一种“全局到局部、任务感知”的FDI框架,通过统一的全局信息驱动代价函数(cost functional)将任务分配、导航控制与故障检测/识别相耦合:该代价函数融合了传感器模型、航天器位姿及任务级信息增益目标,利用预期与观测的任务指标差异进行故障检测,并借助高阶代价梯度测量实现对传感器、执行器和状态估计器等部件的故障识别;同时引入自适应阈值机制以应对巡检几何结构和任务动态变化,从而为自主巡检系统的鲁棒性提供统一的信息驱动基础。

链接: https://arxiv.org/abs/2511.08752
作者: Akshita Gupta,Arna Bhardwaj,Yashwanth Kumar Nakka,Changrak Choi,Amir Rahmani
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: AIAA Book Chapter (accepted)

点击查看摘要

Abstract:This work presents a global-to-local, task-aware fault detection and identification (FDI) framework for multi-spacecraft systems conducting collaborative inspection missions in low Earth orbit. The inspection task is represented by a global information-driven cost functional that integrates the sensor model, spacecraft poses, and mission-level information-gain objectives. This formulation links guidance, control, and FDI by using the same cost function to drive both global task allocation and local sensing or motion decisions. Fault detection is achieved through comparisons between expected and observed task metrics, while higher-order cost-gradient measures enable the identification of faults among sensors, actuators, and state estimators. An adaptive thresholding mechanism captures the time-varying inspection geometry and dynamic mission conditions. Simulation results for representative multi-spacecraft inspection scenarios demonstrate the reliability of fault localization and classification under uncertainty, providing a unified, information-driven foundation for resilient autonomous inspection architectures.
zh

[AI-65] Interpretable by Design: Query-Specific Neural Modules for Explainable Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)系统中知识表征与控制目标耦合过深的问题,即当前架构难以高效提取和利用训练过程中隐含的环境知识(如可达性、路径、状态距离等),从而限制了模型的可解释性、验证能力和人机协作潜力。其解决方案的关键在于提出Query Conditioned Deterministic Inference Networks (QDIN),一种将多种查询类型(策略、可达性、路径、比较等)作为第一类公民统一建模的架构,通过为每种推理模式设计专用神经模块来实现高效的知识检索。关键实证发现表明:推理准确性(如99%的可达性交并比)与控制性能(如31%回报率)之间存在本质解耦,说明用于精确世界知识表示的表征与最优控制所需的表征不同,这为构建从初始设计即具备可查询能力的RL知识库提供了新范式。

链接: https://arxiv.org/abs/2511.08749
作者: Mehrdad Zakershahrak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning has traditionally focused on a singular objective: learning policies that select actions to maximize reward. We challenge this paradigm by asking: what if we explicitly architected RL systems as inference engines that can answer diverse queries about their environment? In deterministic settings, trained agents implicitly encode rich knowledge about reachability, distances, values, and dynamics - yet current architectures are not designed to expose this information efficiently. We introduce Query Conditioned Deterministic Inference Networks (QDIN), a unified architecture that treats different types of queries (policy, reachability, paths, comparisons) as first-class citizens, with specialized neural modules optimized for each inference pattern. Our key empirical finding reveals a fundamental decoupling: inference accuracy can reach near-perfect levels (99% reachability IoU) even when control performance remains suboptimal (31% return), suggesting that the representations needed for accurate world knowledge differ from those required for optimal control. Experiments demonstrate that query specialized architectures outperform both unified models and post-hoc extraction methods, while maintaining competitive control performance. This work establishes a research agenda for RL systems designed from inception as queryable knowledge bases, with implications for interpretability, verification, and human-AI collaboration.
zh

[AI-66] Vector Symbolic Algebras for the Abstraction and Reasoning Corpus

【速读】:该论文旨在解决当前人工智能系统在应对抽象推理任务时表现不佳的问题,尤其是针对人工通用智能(Artificial General Intelligence, AGI)评估基准ARC-AGI(Abstraction and Reasoning Corpus for AGI)的挑战——尽管人类能轻松完成该任务,但现有AI模型仍难以有效应对。解决方案的关键在于提出一种认知上合理的ARC-AGI求解器,其核心是将System 1(直觉式)与System 2(推理式)认知机制通过神经符号方法(neurosymbolic methods)整合,利用向量符号代数(Vector Symbolic Algebras, VSAs)实现对象中心的程序合成(object-centric program synthesis),从而高效、可解释地表示抽象对象、引导解题搜索并支持样本高效的神经学习。该方法首次将VSAs应用于ARC-AGI,并展现出优于主流大模型(如GPT-4)在部分简化基准上的性能与计算效率优势。

链接: https://arxiv.org/abs/2511.08747
作者: Isaac Joffe,Chris Eliasmith
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) is a generative, few-shot fluid intelligence benchmark. Although humans effortlessly solve ARC-AGI, it remains extremely difficult for even the most advanced artificial intelligence systems. Inspired by methods for modelling human intelligence spanning neuroscience to psychology, we propose a cognitively plausible ARC-AGI solver. Our solver integrates System 1 intuitions with System 2 reasoning in an efficient and interpretable process using neurosymbolic methods based on Vector Symbolic Algebras (VSAs). Our solver works by object-centric program synthesis, leveraging VSAs to represent abstract objects, guide solution search, and enable sample-efficient neural learning. Preliminary results indicate success, with our solver scoring 10.8% on ARC-AGI-1-Train and 3.0% on ARC-AGI-1-Eval. Additionally, our solver performs well on simpler benchmarks, scoring 94.5% on Sort-of-ARC and 83.1% on 1D-ARC – the latter outperforming GPT-4 at a tiny fraction of the computational cost. Importantly, our approach is unique; we believe we are the first to apply VSAs to ARC-AGI and have developed the most cognitively plausible ARC-AGI solver yet. Our code is available at: this https URL.
zh

[AI-67] Intuitive Programming Adaptive Task Planning and Dynamic Role Allocation in Human-Robot Collaboration WWW

【速读】:该论文旨在解决人类在人机协作(Human-Robot Collaboration, HRC)中被动参与、难以有效交互的问题,以及机器人在人类环境中无法充分适应人类状态与意图而导致协同效率低下的挑战。其核心解决方案在于构建一个连续的信息流机制:一方面,通过多模态输入到机器人可理解表示的映射,实现人类对机器人的直观指令传达、知识共享与需求表达;另一方面,机器人需清晰地反馈自身内部状态和下一步动作,以维持用户的信息感知、心理舒适度与控制感。这一双向信息流贯穿从人到机器的通信桥接、自适应规划与角色分配,直至控制层与反馈机制的闭环设计,从而推动更自然、高效且可扩展的人机协同系统发展。

链接: https://arxiv.org/abs/2511.08732
作者: Marta Lagomarsino,Elena Merlo,Andrea Pupa,Timo Birr,Franziska Krebs,Cristian Secchi,Tamim Asfour,Arash Ajoudani
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published in the Annual Review of Control, Robotics, and Autonomous Systems, Volume 9; copyright 2026 the author(s), CC BY 4.0, this https URL

点击查看摘要

Abstract:Remarkable capabilities have been achieved by robotics and AI, mastering complex tasks and environments. Yet, humans often remain passive observers, fascinated but uncertain how to engage. Robots, in turn, cannot reach their full potential in human-populated environments without effectively modeling human states and intentions and adapting their behavior. To achieve a synergistic human-robot collaboration (HRC), a continuous information flow should be established: humans must intuitively communicate instructions, share expertise, and express needs. In parallel, robots must clearly convey their internal state and forthcoming actions to keep users informed, comfortable, and in control. This review identifies and connects key components enabling intuitive information exchange and skill transfer between humans and robots. We examine the full interaction pipeline: from the human-to-robot communication bridge translating multimodal inputs into robot-understandable representations, through adaptive planning and role allocation, to the control layer and feedback mechanisms to close the loop. Finally, we highlight trends and promising directions toward more adaptive, accessible HRC.
zh

[AI-68] Bridging Natural Language and ASP: A Hybrid Approach Using LLM s and AMR Parsing

【速读】:该论文旨在解决自然语言到逻辑编程(Answer Set Programming, ASP)的自动转换问题,尤其针对非编程背景用户难以直接编写ASP程序的挑战。解决方案的关键在于提出一种结合大型语言模型(LLM)与抽象意义表示(Abstract Meaning Representation, AMR)图的新方法:LLM仅用于简化自然语言句子、识别关键词并生成基础事实,而AMR图则被用来系统化地解析语义并生成ASP规则、事实和约束,从而构建完整的ASP程序。该方法显著降低了对LLM的依赖,提升了系统的轻量化与可解释性,为将自然语言转化为复杂组合逻辑问题求解方案提供了可行路径。

链接: https://arxiv.org/abs/2511.08715
作者: Connar Hite,Sean Saud,Raef Taha,Nayim Rahman,Tanvir Atahary,Scott Douglass,Tarek Taha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Answer Set Programming (ASP) is a declarative programming paradigm based on logic programming and non-monotonic reasoning. It is a tremendously powerful tool for describing and solving combinatorial problems. Like any other language, ASP requires users to learn how it works and the syntax involved. It is becoming increasingly required for those unfamiliar with programming languages to interact with code. This paper proposes a novel method of translating unconstrained English into ASP programs for logic puzzles using an LLM and Abstract Meaning Representation (AMR) graphs. Everything from ASP rules, facts, and constraints is generated to fully represent and solve the desired problem. Example logic puzzles are used to demonstrate the capabilities of the system. While most current methods rely entirely on an LLM, our system minimizes the role of the LLM only to complete straightforward tasks. The LLM is used to simplify natural language sentences, identify keywords, and generate simple facts. The AMR graphs are then parsed from simplified language and used to generate ASP constraints systematically. The system successfully creates an entire ASP program that solves a combinatorial logic problem. This approach is a significant first step in creating a lighter-weight, explainable system that converts natural language to solve complex logic problems.
zh

[AI-69] Convergence dynamics of Agent -to-Agent Interactions with Misaligned objectives

【速读】:该论文旨在解决多智能体场景中基于语言模型的代理之间因目标不一致而导致的交互不稳定与偏差问题,特别是在上下文学习(in-context learning)框架下,两个代理通过迭代梯度更新各自目标时所引发的非对称收敛和次优均衡状态。解决方案的关键在于构建一个理论框架,明确刻画了当代理目标存在错位时的生成动力学行为,并揭示了最终平衡点中的残差误差可由目标差距(objective gap)及每个代理提示(prompt)诱导的几何结构预测;进一步提出了一个能保证对抗性结果的算法,实现单边成功,从而为多智能体系统的稳定性、偏见控制与鲁棒性提供可预测且可防御的设计依据。

链接: https://arxiv.org/abs/2511.08710
作者: Romain Cosentino,Sarath Shekkizhar,Adam Earle
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop a theoretical framework for agent-to-agent interactions in multi-agent scenarios. We consider the setup in which two language model based agents perform iterative gradient updates toward their respective objectives in-context, using the output of the other agent as input. We characterize the generation dynamics associated with the interaction when the agents have misaligned objectives, and show that this results in a biased equilibrium where neither agent reaches its target - with the residual errors predictable from the objective gap and the geometry induced by the prompt of each agent. We establish the conditions for asymmetric convergence and provide an algorithm that provably achieves an adversarial result, producing one-sided success. Experiments with trained transformer models as well as GPT 5 for the task of in-context linear regression validate the theory. Our framework presents a setup to study, predict, and defend multi-agent systems; explicitly linking prompt design and interaction setup to stability, bias, and robustness.
zh

[AI-70] FAIRPLAI: A Human-in-the-Loop Approach to Fair and Private Machine Learning

【速读】:该论文旨在解决机器学习系统在实际应用中面临的多目标冲突问题,即如何在保障准确性、公平性(fairness)和隐私保护(privacy)三者之间取得平衡。现有方法往往难以兼顾:差分隐私(differential privacy)可能无意中加剧群体间的不公平,公平性干预通常依赖敏感数据而受限于隐私约束,且自动化流程忽视了公平性本质上是人类判断与情境相关的特性。解决方案的关键在于提出 FAIRPLAI 框架,其核心创新包括:构建隐私-公平前沿以可视化权衡关系;引入交互式利益相关者输入机制,使决策者能根据领域需求选择公平标准与运行点;以及嵌入差分隐私审计循环,允许人类审查解释和边缘案例而不泄露个体数据。该框架通过将人类监督主动融入模型设计与部署全过程,实现了可解释、可控且符合社会价值的机器学习实践路径。

链接: https://arxiv.org/abs/2511.08702
作者: David Sanchez Jr.,Holly Lopez,Michelle Buraczyk,Anantaa Kotal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As machine learning systems move from theory to practice, they are increasingly tasked with decisions that affect healthcare access, financial opportunities, hiring, and public services. In these contexts, accuracy is only one piece of the puzzle - models must also be fair to different groups, protect individual privacy, and remain accountable to stakeholders. Achieving all three is difficult: differential privacy can unintentionally worsen disparities, fairness interventions often rely on sensitive data that privacy restricts, and automated pipelines ignore that fairness is ultimately a human and contextual judgment. We introduce FAIRPLAI (Fair and Private Learning with Active Human Influence), a practical framework that integrates human oversight into the design and deployment of machine learning systems. FAIRPLAI works in three ways: (1) it constructs privacy-fairness frontiers that make trade-offs between accuracy, privacy guarantees, and group outcomes transparent; (2) it enables interactive stakeholder input, allowing decision-makers to select fairness criteria and operating points that reflect their domain needs; and (3) it embeds a differentially private auditing loop, giving humans the ability to review explanations and edge cases without compromising individual data security. Applied to benchmark datasets, FAIRPLAI consistently preserves strong privacy protections while reducing fairness disparities relative to automated baselines. More importantly, it provides a straightforward, interpretable process for practitioners to manage competing demands of accuracy, privacy, and fairness in socially impactful applications. By embedding human judgment where it matters most, FAIRPLAI offers a pathway to machine learning systems that are effective, responsible, and trustworthy in practice. GitHub: this https URL
zh

[AI-71] Binary and Multiclass Cyberattack Classification on GeNIS Dataset

【速读】:该论文旨在解决当前基于人工智能(AI)的网络入侵检测系统(NIDS)在面对新型网络流量时泛化能力不足的问题,其根本原因在于训练数据缺乏多样性与时效性。解决方案的关键在于实验验证GeNIS数据集作为基准的可靠性,并通过五种特征选择方法(信息增益、卡方检验、递归特征消除、平均绝对偏差和离散度比)对数据集进行降维处理,从而提取出最具代表性的时序与数量行为特征,提升模型的计算效率与检测性能。在此基础上,采用三种决策树集成模型和两种深度神经网络进行二分类与多分类任务训练,结果表明ML集成模型在保持较高准确率和F1分数的同时展现出更优的泛化能力和计算效率,证明了GeNIS数据集在支持智能入侵检测与攻击分类方面的有效性。

链接: https://arxiv.org/abs/2511.08660
作者: Miguel Silva,Daniela Pinto,João Vitorino,Eva Maia,Isabel Praça,Ivone Amorim,Maria João Viamonte
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 12 tables, FPS 2025 conference

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) in Network Intrusion Detection Systems (NIDS) is a promising approach to tackle the increasing sophistication of cyberattacks. However, since Machine Learning (ML) and Deep Learning (DL) models rely heavily on the quality of their training data, the lack of diverse and up-to-date datasets hinders their generalization capability to detect malicious activity in previously unseen network traffic. This study presents an experimental validation of the reliability of the GeNIS dataset for AI-based NIDS, to serve as a baseline for future benchmarks. Five feature selection methods, Information Gain, Chi-Squared Test, Recursive Feature Elimination, Mean Absolute Deviation, and Dispersion Ratio, were combined to identify the most relevant features of GeNIS and reduce its dimensionality, enabling a more computationally efficient detection. Three decision tree ensembles and two deep neural networks were trained for both binary and multiclass classification tasks. All models reached high accuracy and F1-scores, and the ML ensembles achieved slightly better generalization while remaining more efficient than DL models. Overall, the obtained results indicate that the GeNIS dataset supports intelligent intrusion detection and cyberattack classification with time-based and quantity-based behavioral features.
zh

[AI-72] Introduction to Automated Negotiation

【速读】:该论文旨在解决计算机科学初学者在学习自动化协商(Automated Negotiation)领域时缺乏系统性入门教材的问题。其解决方案的关键在于提供一个基于Python实现的简单玩具世界协商框架,该框架结构精简、易于理解,允许读者无需复杂前置知识即可快速实现自定义协商算法并开展实验验证。此框架具备良好的可移植性,便于使用者根据需求迁移到其他编程语言中,从而有效降低学习门槛并促进实践能力培养。

链接: https://arxiv.org/abs/2511.08659
作者: Dave de Jonge
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:This book is an introductory textbook targeted towards computer science students who are completely new to the topic of automated negotiation. It does not require any prerequisite knowledge, except for elementary mathematics and basic programming skills. This book comes with an simple toy-world negotiation framework implemented in Python that can be used by the readers to implement their own negotiation algorithms and perform experiments with them. This framework is small and simple enough that any reader who does not like to work in Python should be able to re-implement it very quickly in any other programming language of their choice. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2511.08659 [cs.MA] (or arXiv:2511.08659v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2511.08659 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-73] Learning the Basis: A Kolmogorov-Arnold Network Approach Embedding Greens Function Priors

【速读】:该论文旨在解决传统矩量法(Method of Moments, MoM)受限于静态几何定义基函数(如Rao-Wilton-Glisson, RWG基函数)所带来的表达灵活性不足和物理一致性难以保障的问题。其解决方案的关键在于提出一种物理信息引导的Kolmogorov-Arnold网络(PhyKAN),将RWG基函数从静态、分段线性表示推广为可学习且自适应的基函数族,该方法基于EFIE(电场积分方程)推导,融合局部KAN分支与嵌入格林函数先验的全局分支,从而在保持物理一致性的同时实现高精度电磁建模,包括亚0.01的重建误差和无监督下的雷达散射截面预测能力。

链接: https://arxiv.org/abs/2511.08655
作者: Rui Zhu,Yuexing Peng,George C. Alexandropoulos,Wenbo Wang,Wei Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Method of Moments (MoM) is constrained by the usage of static, geometry-defined basis functions, such as the Rao-Wilton-Glisson (RWG) basis. This letter reframes electromagnetic modeling around a learnable basis representation rather than solving for the coefficients over a fixed basis. We first show that the RWG basis is essentially a static and piecewise-linear realization of the Kolmogorov-Arnold representation theorem. Inspired by this insight, we propose PhyKAN, a physics-informed Kolmogorov-Arnold Network (KAN) that generalizes RWG into a learnable and adaptive basis family. Derived from the EFIE, PhyKAN integrates a local KAN branch with a global branch embedded with Green’s function priors to preserve physical consistency. It is demonstrated that, across canonical geometries, PhyKAN achieves sub-0.01 reconstruction errors as well as accurate, unsupervised radar cross section predictions, offering an interpretable, physics-consistent bridge between classical solvers and modern neural network models for electromagnetic modeling.
zh

[AI-74] Energy Consumption of Dataframe Libraries for End-to-End Deep Learning Pipelines:A Comparative Analysis

【速读】:该论文试图解决现有研究中缺乏对Python数据处理库(如Pandas、Polars和Dask)在完整深度学习(Deep Learning, DL)训练与推理流水线中性能表现的系统性比较问题,尤其关注这些库在GPU密集型任务阶段(如数据加载、预处理和批处理输入)中的交互效率。其解决方案的关键在于通过量化运行时间、内存占用、磁盘使用量以及CPU和GPU能耗等关键指标,在多种机器学习模型和数据集上对三类主流数据处理库进行综合评估,从而为实际DL工程实践提供基于实证的数据驱动决策依据。

链接: https://arxiv.org/abs/2511.08644
作者: Punit Kumar,Asif Imran,Tevfik Kosar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:

点击查看摘要

Abstract:This paper presents a detailed comparative analysis of the performance of three major Python data manipulation libraries - Pandas, Polars, and Dask - specifically when embedded within complete deep learning (DL) training and inference pipelines. The research bridges a gap in existing literature by studying how these libraries interact with substantial GPU workloads during critical phases like data loading, preprocessing, and batch feeding. The authors measured key performance indicators including runtime, memory usage, disk usage, and energy consumption (both CPU and GPU) across various machine learning models and datasets.
zh

[AI-75] he Journal of Prompt-Engineered Philosophy Or: How I Started to Track AI Assistance and Stopped Worrying About Slop

【速读】:该论文试图解决学术出版中因披露生成式 AI (Generative AI) 辅助而面临的声誉成本与透明度需求之间的结构性矛盾,即当 AI 辅助程度较高时,作者反而因披露而受损,从而抑制了关键场景下的透明性。解决方案的关键在于构建一个脱离传统 prestige 系统的替代性出版基础设施,该系统强制要求披露、支持基于复现的评审,并通过详尽的方法学记录保障生态效度(ecological validity),从而实现对 AI 辅助研究的可验证评估,而非简单地支持或反对此类研究。

链接: https://arxiv.org/abs/2511.08639
作者: Michele Loi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 46 pages (30 Article + 16 Appendix); 2 figures This arXiv preprint presents the conceptual framework and AI usage methodology (in the Appendix). Complete supplementary materials have been prepared and will accompany journal submission. This version is shared for early community feedback on the proposed infrastructure design

点击查看摘要

Abstract:Academic publishing increasingly requires authors to disclose AI assistance, yet imposes reputational costs for doing so–especially when such assistance is substantial. This article analyzes that structural contradiction, showing how incentives discourage transparency in precisely the work where it matters most. Traditional venues cannot resolve this tension through policy tweaks alone, as the underlying prestige economy rewards opacity. To address this, the article proposes an alternative publishing infrastructure: a venue outside prestige systems that enforces mandatory disclosure, enables reproduction-based review, and supports ecological validity through detailed documentation. As a demonstration of this approach, the article itself is presented as an example of AI-assisted scholarship under reasonably detailed disclosure, with representative prompt logs and modification records included. Rather than taking a position for or against AI-assisted scholarship, the article outlines conditions under which such work can be evaluated on its own terms: through transparent documentation, verification-oriented review, and participation by methodologically committed scholars. While focused on AI, the framework speaks to broader questions about how academic systems handle methodological innovation.
zh

[AI-76] How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets

【速读】:该论文旨在解决当前大规模文本-图像模型训练数据采集过程中对数据所有权人意愿尊重不足的问题,特别是Web规模数据收集是否充分尊重了数据所有者的使用许可。其核心问题在于:现有AI训练数据集(如DataComp)中大量样本可能来自未明确授权的来源,且数据提供方通过多种渠道(如版权声明、水印、网站服务条款等)表达的同意意图常被忽视。解决方案的关键在于提出一套综合性的分析方法,从样本级(如版权信息、水印、元数据)和域名级(如服务条款ToS、机器人排除协议Robots Exclusion Protocol)两个维度系统评估数据合规性,并发现至少1.22亿条样本存在版权提示,60%来自禁止爬取的前50大域名,且9–13%的样本含水印但现有检测方法无法高保真识别。这一发现揭示了当前数据采集流程对多渠道数据许可信号的忽视,强调需建立面向AI目的的统一数据许可框架以提升伦理与法律合规性。

链接: https://arxiv.org/abs/2511.08637
作者: Chung Peng Lee,Rachel Hong,Harry Jiang,Aster Plotnik,William Agnew,Jamie Morgenstern
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The internet has become the main source of data to train modern text-to-image or vision-language models, yet it is increasingly unclear whether web-scale data collection practices for training AI systems adequately respect data owners’ wishes. Ignoring the owner’s indication of consent around data usage not only raises ethical concerns but also has recently been elevated into lawsuits around copyright infringement cases. In this work, we aim to reveal information about data owners’ consent to AI scraping and training, and study how it’s expressed in DataComp, a popular dataset of 12.8 billion text-image pairs. We examine both the sample-level information, including the copyright notice, watermarking, and metadata, and the web-domain-level information, such as a site’s Terms of Service (ToS) and Robots Exclusion Protocol. We estimate at least 122M of samples exhibit some indication of copyright notice in CommonPool, and find that 60% of the samples in the top 50 domains come from websites with ToS that prohibit scraping. Furthermore, we estimate 9-13% with 95% confidence interval of samples from CommonPool to contain watermarks, where existing watermark detection methods fail to capture them in high fidelity. Our holistic methods and findings show that data owners rely on various channels to convey data consent, of which current AI data collection pipelines do not entirely respect. These findings highlight the limitations of the current dataset curation/release practice and the need for a unified data consent framework taking AI purposes into consideration.
zh

[AI-77] Hope Aspirations and the Impact of LLM s on Female Programming Learners in Afghanistan

【速读】:该论文试图解决在社会政治不稳定背景下,如何有效测量教育抱负(aspirations)这一关键变量的问题,以支持设计具有影响力的教育技术。当前缺乏可扩展的指标来量化教育抱负,限制了相关干预措施的效果评估。其解决方案的关键在于对Snyder的希望量表(Hope Scale)进行适应性改编、翻译和验证,使其适用于阿富汗女性在线学习编程这一特殊情境。该量表展现出良好的信度(Cronbach’s α = 0.78),且参与者认为其内容清晰且相关;尽管整体抱负得分未因是否接触大型语言模型(LLMs)而显著差异,但有LLM访问权限的群体在“路径”子维度上呈现边际显著更高得分(p = .056),提示LLM可能拓宽了受试者对实现教育目标的认知路径。这表明该适配后的量表可用于衡量不稳定环境中教育抱负的变化,并为以抱负驱动的设计理念评估教育技术影响提供可靠工具。

链接: https://arxiv.org/abs/2511.08630
作者: Hamayoon Behmanush,Freshta Akhtari,Roghieh Nooripour,Ingmar Weber,Vikram Kamath Cannanure
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Designing impactful educational technologies in contexts of socio-political instability requires a nuanced understanding of educational aspirations. Currently, scalable metrics for measuring aspirations are limited. This study adapts, translates, and evaluates Snyder’s Hope Scale as a metric for measuring aspirations among 136 women learning programming online during a period of systemic educational restrictions in Afghanistan. The adapted scale demonstrated good reliability (Cronbach’s \alpha = 0.78) and participants rated it as understandable and relevant. While overall aspiration-related scores did not differ significantly by access to Large Language Models (LLMs), those with access reported marginally higher scores on the Avenues subscale (p = .056), suggesting broader perceived pathways to achieving educational aspirations. These findings support the use of the adapted scale as a metric for aspirations in contexts of socio-political instability. More broadly, the adapted scale can be used to evaluate the impact of aspiration-driven design of educational technologies.
zh

[AI-78] CometNet: Contextual Motif-guided Long-term Time Series Forecasting AAAI2026

【速读】:该论文旨在解决长时序预测(Long-term Time Series Forecasting)中因模型感受野瓶颈(receptive field bottleneck)导致的准确性受限问题,尤其针对主流基于Transformer和多层感知机(MLP)的方法依赖有限历史窗口、难以建模长期依赖关系的缺陷。其解决方案的关键在于提出CometNet框架,通过引入上下文模式提取模块(Contextual Motif Extraction module),从复杂的历史序列中识别出具有重复性和主导性的上下文模式(contextual motifs),从而获得远超有限回看窗口的时空依赖信息;随后利用模式引导预测模块(Motif-guided Forecasting module),将提取的主导模式动态映射到预测过程中,有效利用其上下文信息增强长期预测能力。

链接: https://arxiv.org/abs/2511.08049
作者: Weixu Wang,Xiaobo Zhou,Xin Qiao,Lei Wang,Tie Qiu
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Long-term Time Series Forecasting is crucial across numerous critical domains, yet its accuracy remains fundamentally constrained by the receptive field bottleneck in existing models. Mainstream Transformer- and Multi-layer Perceptron (MLP)-based methods mainly rely on finite look-back windows, limiting their ability to model long-term dependencies and hurting forecasting performance. Naively extending the look-back window proves ineffective, as it not only introduces prohibitive computational complexity, but also drowns vital long-term dependencies in historical noise. To address these challenges, we propose CometNet, a novel Contextual Motif-guided Long-term Time Series Forecasting framework. CometNet first introduces a Contextual Motif Extraction module that identifies recurrent, dominant contextual motifs from complex historical sequences, providing extensive temporal dependencies far exceeding limited look-back windows; Subsequently, a Motif-guided Forecasting module is proposed, which integrates the extracted dominant motifs into forecasting. By dynamically mapping the look-back window to its relevant motifs, CometNet effectively harnesses their contextual information to strengthen long-term forecasting capability. Extensive experimental results on eight real-world datasets have demonstrated that CometNet significantly outperforms current state-of-the-art (SOTA) methods, particularly on extended forecast horizons.
zh

[AI-79] A general framework for adaptive nonparametric dimensionality reduction

【速读】:该论文旨在解决维度约简(Dimensionality Reduction)中因局部邻域结构选择不当而导致嵌入质量下降的问题,尤其是传统方法需手动调参邻域数和低维空间维度,而这些参数对结果影响显著。解决方案的关键在于利用一种近期提出的内在维度估计器(intrinsic dimension estimator),该估计器不仅能估计数据的内在维度,还能自适应地确定最优局部邻域大小,从而实现对依赖局部邻接结构的任意维度约简算法的超参数优化。实验表明,该方法可显著提升经典投影方法在多种学习任务中的性能,且在定量指标与低维可视化质量上均有改善。

链接: https://arxiv.org/abs/2511.09486
作者: Antonio Di Noia,Federico Ravenda,Antonietta Mira
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Dimensionality reduction is a fundamental task in modern data science. Several projection methods specifically tailored to take into account the non-linearity of the data via local embeddings have been proposed. Such methods are often based on local neighbourhood structures and require tuning the number of neighbours that define this local structure, and the dimensionality of the lower-dimensional space onto which the data are projected. Such choices critically influence the quality of the resulting embedding. In this paper, we exploit a recently proposed intrinsic dimension estimator which also returns the optimal locally adaptive neighbourhood sizes according to some desirable criteria. In principle, this adaptive framework can be employed to perform an optimal hyper-parameter tuning of any dimensionality reduction algorithm that relies on local neighbourhood structures. Numerical experiments on both real-world and simulated datasets show that the proposed method can be used to significantly improve well-known projection methods when employed for various learning tasks, with improvements measurable through both quantitative metrics and the quality of low-dimensional visualizations.
zh

[AI-80] DeepVRegulome: DNABERT-based deep-learning framework for predicting the functional impact of short genomic variants on the human regulome

【速读】:该论文旨在解决非编码区短变异(non-coding short variants)的功能影响难以准确预测和优先排序的问题,尤其是在基因调控区域中具有临床意义的突变识别。其解决方案的关键在于提出 Deep VRegulome,一种基于深度学习的预测与解析方法,该方法整合了700个在 ENCODE 基因调控区域上微调的 DNABERT 模型,结合变异评分、基序分析、注意力可视化和生存分析,从而实现对人类调控组(regulome)中功能破坏性变异的精准识别与解释。

链接: https://arxiv.org/abs/2511.09026
作者: Pratik Dutta,Matthew Obusan,Rekha Sathian,Max Chao,Pallavi Surana,Nimisha Papineni,Yanrong Ji,Zhihan Zhou,Han Liu,Alisa Yurovsky,Ramana V Davuluri
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Whole-genome sequencing (WGS) has revealed numerous non-coding short variants whose functional impacts remain poorly understood. Despite recent advances in deep-learning genomic approaches, accurately predicting and prioritizing clinically relevant mutations in gene regulatory regions remains a major challenge. Here we introduce Deep VRegulome, a deep-learning method for prediction and interpretation of functionally disruptive variants in the human regulome, which combines 700 DNABERT fine-tuned models, trained on vast amounts of ENCODE gene regulatory regions, with variant scoring, motif analysis, attention-based visualization, and survival analysis. We showcase its application on TCGA glioblastoma WGS dataset in prioritizing survival-associated mutations and regulatory regions. The analysis identified 572 splice-disrupting and 9,837 transcription-factor binding site altering mutations occurring in greater than 10% of glioblastoma samples. Survival analysis linked 1352 mutations and 563 disrupted regulatory regions to patient outcomes, enabling stratification via non-coding mutation signatures. All the code, fine-tuned models, and an interactive data portal are publicly available.
zh

[AI-81] When is a System Discoverable from Data? Discovery Requires Chaos

【速读】:该论文旨在解决从有限观测数据中唯一识别 governing equations(控制方程)的问题,即在数据驱动的科学发现中,如何确保学习到的模型具有真正的预测能力而非仅拟合现有数据。其核心挑战在于非唯一性问题:即使模型完美匹配观测数据,也可能不具备泛化能力。论文的关键解决方案是揭示混沌(chaos)在系统可发现性中的决定性作用——尽管混沌通常被视为不可预测性的来源,但研究表明,在连续函数空间中,全域混沌系统可通过单条轨迹被唯一识别;而在奇异吸引子上混沌的系统,在吸引子满足特定几何条件时,可在解析函数空间中实现解析可发现性。这一发现不仅首次证明经典Lorenz系统具有解析可发现性,还指出存在第一积分(first integrals)的系统无法解析可发现,从而为理解数据驱动方法的成功(如天气预报)与局限(如数字孪生等工程应用)提供了理论依据,并强调了引入先验物理知识对非混沌系统的必要性。

链接: https://arxiv.org/abs/2511.08860
作者: Zakhar Shumaylov,Peter Zaika,Philipp Scholl,Gitta Kutyniok,Lior Horesh,Carola-Bibiane Schönlieb
机构: 未知
类目: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA); Chaotic Dynamics (nlin.CD)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:The deep learning revolution has spurred a rise in advances of using AI in sciences. Within physical sciences the main focus has been on discovery of dynamical systems from observational data. Yet the reliability of learned surrogates and symbolic models is often undermined by the fundamental problem of non-uniqueness. The resulting models may fit the available data perfectly, but lack genuine predictive power. This raises the question: under what conditions can the systems governing equations be uniquely identified from a finite set of observations? We show, counter-intuitively, that chaos, typically associated with unpredictability, is crucial for ensuring a system is discoverable in the space of continuous or analytic functions. The prevalence of chaotic systems in benchmark datasets may have inadvertently obscured this fundamental limitation. More concretely, we show that systems chaotic on their entire domain are discoverable from a single trajectory within the space of continuous functions, and systems chaotic on a strange attractor are analytically discoverable under a geometric condition on the attractor. As a consequence, we demonstrate for the first time that the classical Lorenz system is analytically discoverable. Moreover, we establish that analytic discoverability is impossible in the presence of first integrals, common in real-world systems. These findings help explain the success of data-driven methods in inherently chaotic domains like weather forecasting, while revealing a significant challenge for engineering applications like digital twins, where stable, predictable behavior is desired. For these non-chaotic systems, we find that while trajectory data alone is insufficient, certain prior physical knowledge can help ensure discoverability. These findings warrant a critical re-evaluation of the fundamental assumptions underpinning purely data-driven discovery. Comments: 20 pages, 5 figures Subjects: Dynamical Systems (math.DS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Numerical Analysis (math.NA); Chaotic Dynamics (nlin.CD) Cite as: arXiv:2511.08860 [math.DS] (or arXiv:2511.08860v1 [math.DS] for this version) https://doi.org/10.48550/arXiv.2511.08860 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-82] Bio AI Agent : A Multi-Agent Artificial Intelligence System for Autonomous CAR-T Cell Therapy Development with Integrated Target Discovery Toxicity Prediction and Rational Molecular Design

【速读】:该论文旨在解决嵌合抗原受体T细胞(CAR-T)疗法在开发过程中存在的效率低下问题,具体表现为研发周期长达8–12年、临床失败率超过40–60%,主要瓶颈在于靶点选择不当、安全性评估不足及分子设计优化困难。解决方案的关键在于提出并实现了一种由多个专业化智能代理组成的自主人工智能系统——Bio AI Agent,其通过六个协同工作的专用代理(包括靶点筛选、毒性预测、分子设计、专利分析、临床转化与决策协调)实现并行化、专业化推理与自主决策,从而显著提升CAR-T开发的精准性与效率,推动下一代免疫治疗药物从发现到临床应用的快速转化。

链接: https://arxiv.org/abs/2511.08649
作者: Yi Ni,Liwei Zhu,Shuai Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注: 12 pages, 0 figures

点击查看摘要

Abstract:Chimeric antigen receptor T-cell (CAR-T) therapy represents a paradigm shift in cancer treatment, yet development timelines of 8-12 years and clinical attrition rates exceeding 40-60% highlight critical inefficiencies in target selection, safety assessment, and molecular optimization. We present Bio AI Agent, a multi-agent artificial intelligence system powered by large language models that enables autonomous CAR-T development through collaborative specialized agents. The system comprises six autonomous agents: Target Selection Agent for multi-parametric antigen prioritization across 10,000 cancer-associated targets, Toxicity Prediction Agent for comprehensive safety profiling integrating tissue expression atlases and pharmacovigilance databases, Molecular Design Agent for rational CAR engineering, Patent Intelligence Agent for freedom-to-operate analysis, Clinical Translation Agent for regulatory compliance, and Decision Orchestration Agent for multi-agent coordination. Retrospective validation demonstrated autonomous identification of high-risk targets including FcRH5 (hepatotoxicity) and CD229 (off-tumor toxicity), patent infringement risks for CD38+SLAMF7 combinations, and generation of comprehensive development roadmaps. By enabling parallel processing, specialized reasoning, and autonomous decision-making superior to monolithic AI systems, Bio AI Agent addresses critical gaps in precision oncology development and has potential to accelerate translation of next-generation immunotherapies from discovery to clinic.
zh

[AI-83] Cross-Field Interface-Aware Neural Operators for Multiphase Flow Simulation

【速读】:该论文旨在解决多相流系统中传统数值求解器计算效率低、神经算子在高分辨率精度上表现不足的问题,其核心挑战源于多相流的强空间异质性和高质量训练数据稀缺。解决方案的关键在于提出界面信息感知神经算子(Interface Information-Aware Neural Operator, IANO),通过两个创新机制实现:一是界面感知的多函数编码机制,联合建模多个物理场与界面信息以捕捉界面处的高频物理特征;二是几何感知的位置编码机制,显式建立界面信息、物理变量与空间位置之间的关系,从而在低数据条件下仍能实现逐点超分辨率预测。

链接: https://arxiv.org/abs/2511.08625
作者: ZhenZhong Wang,Xin Zhang,Jun Liao,Min Jiang
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multiphase flow systems, with their complex dynamics, field discontinuities, and interphase interactions, pose significant computational challenges for traditional numerical solvers. While neural operators offer efficient alternatives, they often struggle to achieve high-resolution numerical accuracy in these systems. This limitation primarily stems from the inherent spatial heterogeneity and the scarcity of high-quality training data in multiphase flows. In this work, we propose the Interface Information-Aware Neural Operator (IANO), a novel framework that explicitly leverages interface information as a physical prior to enhance the prediction accuracy. The IANO architecture introduces two key components: 1) An interface-aware multiple function encoding mechanism jointly models multiple physical fields and interfaces, thus capturing the high-frequency physical features at the interface. 2) A geometry-aware positional encoding mechanism further establishes the relationship between interface information, physical variables, and spatial positions, enabling it to achieve pointwise super-resolution prediction even in the low-data regimes. Experimental results demonstrate that IANO outperforms baselines by \sim 10% in accuracy for multiphase flow simulations while maintaining robustness under data-scarce and noise-perturbed conditions.
zh

[AI-84] Multi-period Learning for Financial Time Series Forecasting

【速读】:该论文旨在解决金融时间序列预测(Financial Time Series Forecasting, TSF)中因短期公众情绪与中长期政策及市场趋势共同作用而导致的多周期输入处理不足问题。现有模型或仅使用单周期输入,或缺乏针对多周期特征的定制化设计,导致预测精度受限。解决方案的关键在于提出一种多周期学习框架(Multi-period Learning Framework, MLF),其核心创新包括三个模块:(i) 跨周期冗余过滤(Inter-period Redundancy Filtering, IRF),用于去除不同周期间的信息冗余以提升自注意力建模准确性;(ii) 可学习加权平均融合(Learnable Weighted-average Integration, LWI),实现多周期预测结果的有效整合;(iii) 多周期自适应分块(Multi-period self-Adaptive Patching, MAP),通过统一各周期的分块数量缓解对特定周期的偏差。此外,引入Patch Squeeze模块降低自注意力计算复杂度,兼顾预测精度与效率。

链接: https://arxiv.org/abs/2511.08622
作者: Xu Zhang,Zhengang Huang,Yunzhi Wu,Xun Lu,Erpeng Qi,Yunkai Chen,Zhongya Xue,Qitong Wang,Peng Wang,Wei Wang
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: The codes are available at this https URL

点击查看摘要

Abstract:Time series forecasting is important in finance domain. Financial time series (TS) patterns are influenced by both short-term public opinions and medium-/long-term policy and market trends. Hence, processing multi-period inputs becomes crucial for accurate financial time series forecasting (TSF). However, current TSF models either use only single-period input, or lack customized designs for addressing multi-period characteristics. In this paper, we propose a Multi-period Learning Framework (MLF) to enhance financial TSF performance. MLF considers both TSF’s accuracy and efficiency requirements. Specifically, we design three new modules to better integrate the multi-period inputs for improving accuracy: (i) Inter-period Redundancy Filtering (IRF), that removes the information redundancy between periods for accurate self-attention modeling, (ii) Learnable Weighted-average Integration (LWI), that effectively integrates multi-period forecasts, (iii) Multi-period self-Adaptive Patching (MAP), that mitigates the bias towards certain periods by setting the same number of patches across all periods. Furthermore, we propose a Patch Squeeze module to reduce the number of patches in self-attention modeling for maximized efficiency. MLF incorporates multiple inputs with varying lengths (periods) to achieve better accuracy and reduces the costs of selecting input lengths during training. The codes and datasets are available at this https URL.
zh

[AI-85] he LLM Pro Finance Suite: Multilingual Large Language Models for Financial Applications

【速读】:该论文旨在解决通用大语言模型(Large Language Models, LLMs)在金融领域任务中表现不足的问题,尤其是在处理金融文本理解、生成和翻译等专业场景时的局限性。其解决方案的关键在于构建一个专为金融应用设计的指令微调大模型套件——LLM Pro Finance Suite,该套件包含5个参数规模从8B到70B不等的模型,通过在高质量、多语言(英语、法语、德语)金融语料库(其中超过50%为金融相关数据)上进行精细化微调,在保留原始基座模型在指令遵循、推理能力和毒性控制等方面优势的同时,显著提升其在金融任务上的性能。这种“双能力平衡”策略使该套件成为金融工作流中可直接替换现有LLMs的理想选择,既增强了领域专业性,又不牺牲通用能力。

链接: https://arxiv.org/abs/2511.08621
作者: Gaëtan Caillaut,Raheel Qader,Jingshu Liu,Mariam Nakhlé,Arezki Sadoune,Massinissa Ahmim,Jean-Gabriel Barthelemy
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:The financial industry’s growing demand for advanced natural language processing (NLP) capabilities has highlighted the limitations of generalist large language models (LLMs) in handling domain-specific financial tasks. To address this gap, we introduce the LLM Pro Finance Suite, a collection of five instruction-tuned LLMs (ranging from 8B to 70B parameters) specifically designed for financial applications. Our approach focuses on enhancing generalist instruction-tuned models, leveraging their existing strengths in instruction following, reasoning, and toxicity control, while fine-tuning them on a curated, high-quality financial corpus comprising over 50% finance-related data in English, French, and German. We evaluate the LLM Pro Finance Suite on a comprehensive financial benchmark suite, demonstrating consistent improvement over state-of-the-art baselines in finance-oriented tasks and financial translation. Notably, our models maintain the strong general-domain capabilities of their base models, ensuring reliable performance across non-specialized tasks. This dual proficiency, enhanced financial expertise without compromise on general abilities, makes the LLM Pro Finance Suite an ideal drop-in replacement for existing LLMs in financial workflows, offering improved domain-specific performance while preserving overall versatility. We publicly release two 8B-parameters models to foster future research and development in financial NLP applications: this https URL. Subjects: Statistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP) Cite as: arXiv:2511.08621 [q-fin.ST] (or arXiv:2511.08621v1 [q-fin.ST] for this version) https://doi.org/10.48550/arXiv.2511.08621 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-86] Reasoning on Time-Series for Financial Technical Analysis

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在股票预测中仅依赖文本报告而忽视历史价格数据(即技术分析,Technical Analysis)的问题,该问题导致模型输出难以同时兼顾准确性与可解释性。其核心挑战在于跨域推理:输入和输出为时间序列数据,而推理过程需转化为自然语言以实现可解释性。解决方案的关键在于提出一种名为“口语化技术分析”(Verbal Technical Analysis, VTA)的新框架,该框架通过将股价数据转化为文本注释进行语义推理,并利用逆均方误差(inverse Mean Squared Error, MSE)奖励优化推理路径;同时,基于推理生成的属性条件化时间序列主干模型(time-series backbone model),从而实现高精度且可解释的股票时序预测。

链接: https://arxiv.org/abs/2511.08616
作者: Kelvin J.L. Koa,Jan Chen,Yunshan Ma,Huanhuan Zheng,Tat-Seng Chua
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注: ICAIF 2025 Workshop (Best Paper)

点击查看摘要

Abstract:While Large Language Models have been used to produce interpretable stock forecasts, they mainly focus on analyzing textual reports but not historical price data, also known as Technical Analysis. This task is challenging as it switches between domains: the stock price inputs and outputs lie in the time-series domain, while the reasoning step should be in natural language. In this work, we introduce Verbal Technical Analysis (VTA), a novel framework that combine verbal and latent reasoning to produce stock time-series forecasts that are both accurate and interpretable. To reason over time-series, we convert stock price data into textual annotations and optimize the reasoning trace using an inverse Mean Squared Error (MSE) reward objective. To produce time-series outputs from textual reasoning, we condition the outputs of a time-series backbone model on the reasoning-based attributes. Experiments on stock datasets across U.S., Chinese, and European markets show that VTA achieves state-of-the-art forecasting accuracy, while the reasoning traces also perform well on evaluation by industry experts.
zh

[AI-87] Data-driven Feynman-Kac Discovery with Applications to Prediction and Data Generation

【速读】:该论文旨在解决从有限金融时间序列数据中识别隐含在费曼-卡茨公式(Feynman-Kac formula)背后的概率规律问题,特别是如何从单一股票与期权轨迹对中恢复后向随机微分方程(Backward Stochastic Differential Equation, BSDE)。传统方法通常依赖于遍历性(ergodicity)假设来识别随机微分方程,这限制了其在实际金融数据中的适用性。本文的关键创新在于首次提出在风险中性概率测度(risk-neutral probability measure)下构建随机SINDy(Sparse Identification of Nonlinear Dynamics)方法,从而无需遍历性假设即可实现BSDE的精确恢复,进而支持向前预测和生成符合底层概率规律的合成数据路径。

链接: https://arxiv.org/abs/2511.08606
作者: Qi Feng,Guang Lin,Purav Matlia,Denny Serdarevic
机构: 未知
类目: Mathematical Finance (q-fin.MF); Artificial Intelligence (cs.AI); Computational Finance (q-fin.CP)
备注:

点击查看摘要

Abstract:In this paper, we propose a novel data-driven framework for discovering probabilistic laws underlying the Feynman-Kac formula. Specifically, we introduce the first stochastic SINDy method formulated under the risk-neutral probability measure to recover the backward stochastic differential equation (BSDE) from a single pair of stock and option trajectories. Unlike existing approaches to identifying stochastic differential equations-which typically require ergodicity-our framework leverages the risk-neutral measure, thereby eliminating the ergodicity assumption and enabling BSDE recovery from limited financial time series data. Using this algorithm, we are able not only to make forward-looking predictions but also to generate new synthetic data paths consistent with the underlying probabilistic law.
zh

机器学习

[LG-0] LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

链接: https://arxiv.org/abs/2511.09557
作者: Prajwal Singhania,Siddharth Singh,Lannie Dalton Hough,Akarsh Srivastava,Harshitha Menon,Charles Fredrick Jekel,Abhinav Bhatele
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 12 Figures

点击查看摘要

Abstract:As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Since all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9x-3.6x lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72x reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

[LG-1] NSL-MT: Linguistically Informed Negative Samples for Efficient Machine Translation in Low-Resource Languages

链接: https://arxiv.org/abs/2511.09537
作者: Mamadou K. Keita,Christopher Homan,Huy Le
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Negative Space Learning MT (NSL-MT), a training method that teaches models what not to generate by encoding linguistic constraints as severity-weighted penalties in the loss function. NSL-MT increases limited parallel data with synthetically generated violations of target language grammar, explicitly penalizing the model when it assigns high probability to these linguistically invalid outputs. We demonstrate that NSL-MT delivers improvements across all architectures: 3-12% BLEU gains for well-performing models and 56-89% gains for models lacking descent initial support. Furthermore, NSL-MT provides a 5x data efficiency multiplier – training with 1,000 examples matches or exceeds normal training with 5,000 examples. Thus, NSL-MT provides a data-efficient alternative training method for settings where there is limited annotated parallel corporas.

[LG-2] SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

链接: https://arxiv.org/abs/2511.09529
作者: Samyak Sanghvi,Nishant Ranjan,Tarak Karmakar
类目: Machine Learning (cs.LG)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textitvia selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

[LG-3] Event-Driven Digital-Time-Domain Inference Architectures for Tsetlin Machines

链接: https://arxiv.org/abs/2511.09527
作者: Tian Lan,Rishad Shafik,Alex Yakovlev
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning fits model parameters to approximate input-output mappings, predicting unknown samples. However, these models often require extensive arithmetic computations during inference, increasing latency and power consumption. This paper proposes a digital-time-domain computing approach for Tsetlin machine ™ inference process to address these challenges. This approach leverages a delay accumulation mechanism to mitigate the costly arithmetic sums of classes and employs a Winner-Takes-All scheme to replace conventional magnitude comparators. Specifically, a Hamming distance-driven time-domain scheme is implemented for multi-class TMs. Furthermore, differential delay paths, combined with a leading-ones-detector logarithmic delay compression digital-time-domain scheme, are utilised for the coalesced TMs, accommodating both binary-signed and exponential-scale delay accumulation issues. Compared to the functionally equivalent, post-implementation digital TM architecture baseline, the proposed architecture demonstrates orders-of-magnitude improvements in energy efficiency and throughput.

[LG-4] GenePheno: Interpretable Gene Knockout-Induced Phenotype Abnormality Prediction from Gene Sequences

链接: https://arxiv.org/abs/2511.09512
作者: Jingquan Yan,Yuwei Miao,Lei Yu,Yuzhi Guo,Xue Xiao,Lin Xu,Junzhou Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exploring how genetic sequences shape phenotypes is a fundamental challenge in biology and a key step toward scalable, hypothesis-driven experimentation. The task is complicated by the large modality gap between sequences and phenotypes, as well as the pleiotropic nature of gene-phenotype relationships. Existing sequence-based efforts focus on the degree to which variants of specific genes alter a limited set of phenotypes, while general gene knockout induced phenotype abnormality prediction methods heavily rely on curated genetic information as inputs, which limits scalability and generalizability. As a result, the task of broadly predicting the presence of multiple phenotype abnormalities under gene knockout directly from gene sequences remains underexplored. We introduce GenePheno, the first interpretable multi-label prediction framework that predicts knockout induced phenotypic abnormalities from gene sequences. GenePheno employs a contrastive multi-label learning objective that captures inter-phenotype correlations, complemented by an exclusive regularization that enforces biological consistency. It further incorporates a gene function bottleneck layer, offering human interpretable concepts that reflect functional mechanisms behind phenotype formation. To support progress in this area, we curate four datasets with canonical gene sequences as input and multi-label phenotypic abnormalities induced by gene knockouts as targets. Across these datasets, GenePheno achieves state-of-the-art gene-centric Fmax and phenotype-centric AUC, and case studies demonstrate its ability to reveal gene functional mechanisms.

[LG-5] Quasi-Newton Compatible Actor-Critic for Deterministic Policies

链接: https://arxiv.org/abs/2511.09509
作者: Arash Bahari Kordabad,Dean Brandner,Sebastien Gros,Sergio Lucia,Sadegh Soudjani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: 8 pages, 9 figs

点击查看摘要

Abstract:In this paper, we propose a second-order deterministic actor-critic framework in reinforcement learning that extends the classical deterministic policy gradient method to exploit curvature information of the performance function. Building on the concept of compatible function approximation for the critic, we introduce a quadratic critic that simultaneously preserves the true policy gradient and an approximation of the performance Hessian. A least-squares temporal difference learning scheme is then developed to estimate the quadratic critic parameters efficiently. This construction enables a quasi-Newton actor update using information learned by the critic, yielding faster convergence compared to first-order methods. The proposed approach is general and applicable to any differentiable policy class. Numerical examples demonstrate that the method achieves improved convergence and performance over standard deterministic actor-critic baselines.

[LG-6] AutoSynth: Automated Workflow Optimization for High-Quality Synthetic Dataset Generation via Monte Carlo Tree Search

链接: https://arxiv.org/abs/2511.09488
作者: Shuzhen Bi,Chang Song,Siyu Song,Jinze Lv,Jian Chen,Xinyun Wang,Aimin Zhou,Hao Hao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) of large language models (LLMs) for specialized tasks requires high-quality datasets, but manual curation is prohibitively expensive. Synthetic data generation offers scalability, but its effectiveness relies on complex, multi-stage workflows, integrating prompt engineering and model orchestration. Existing automated workflow methods face a cold start problem: they require labeled datasets for reward modeling, which is especially problematic for subjective, open-ended tasks with no objective ground truth. We introduce AutoSynth, a framework that automates workflow discovery and optimization without reference datasets by reframing the problem as a Monte Carlo Tree Search guided by a novel dataset-free hybrid reward. This reward enables meta-learning through two LLM-as-judge components: one evaluates sample quality using dynamically generated task-specific metrics, and another assesses workflow code and prompt quality. Experiments on subjective educational tasks show that while expert-designed workflows achieve higher human preference rates (96-99% win rates vs. AutoSynth’s 40-51%), models trained on AutoSynth-generated data dramatically outperform baselines (40-51% vs. 2-5%) and match or surpass expert workflows on certain metrics, suggesting discovery of quality dimensions beyond human intuition. These results are achieved while reducing human effort from 5-7 hours to just 30 minutes (90% reduction). AutoSynth tackles the cold start issue in data-centric AI, offering a scalable, cost-effective method for subjective LLM tasks. Code: this https URL.

[LG-7] PDAC: Efficient Coreset Selection for Continual Learning via Probability Density Awareness

链接: https://arxiv.org/abs/2511.09487
作者: Junqi Gao,Zhichang Guo,Dazhi Zhang,Yao Li,Yi Ran,Biqing Qi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rehearsal-based Continual Learning (CL) maintains a limited memory buffer to store replay samples for knowledge retention, making these approaches heavily reliant on the quality of the stored samples. Current Rehearsal-based CL methods typically construct the memory buffer by selecting a representative subset (referred to as coresets), aiming to approximate the training efficacy of the full dataset with minimal storage overhead. However, mainstream Coreset Selection (CS) methods generally formulate the CS problem as a bi-level optimization problem that relies on numerous inner and outer iterations to solve, leading to substantial computational cost thus limiting their practical efficiency. In this paper, we aim to provide a more efficient selection logic and scheme for coreset construction. To this end, we first analyze the Mean Squared Error (MSE) between the buffer-trained model and the Bayes-optimal model through the perspective of localized error decomposition to investigate the contribution of samples from different regions to MSE suppression. Further theoretical and experimental analyses demonstrate that samples with high probability density play a dominant role in error suppression. Inspired by this, we propose the Probability Density-Aware Coreset (PDAC) method. PDAC leverages the Projected Gaussian Mixture (PGM) model to estimate each sample’s joint density, enabling efficient density-prioritized buffer selection. Finally, we introduce the streaming Expectation Maximization (EM) algorithm to enhance the adaptability of PGM parameters to streaming data, yielding Streaming PDAC (SPDAC) for streaming scenarios. Extensive comparative experiments show that our methods outperforms other baselines across various CL settings while ensuring favorable efficiency.

[LG-8] Latent Planning via Embedding Arithmetic: A Contrastive Approach to Strategic Reasoning

链接: https://arxiv.org/abs/2511.09477
作者: Andrew Hamara,Greg Hamerly,Pablo Rivas,Andrew C. Freeman
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Planning in high-dimensional decision spaces is increasingly being studied through the lens of learned representations. Rather than training policies or value heads, we investigate whether planning can be carried out directly in an evaluation-aligned embedding space. We introduce SOLIS, which learns such a space using supervised contrastive learning. In this representation, outcome similarity is captured by proximity, and a single global advantage vector orients the space from losing to winning regions. Candidate actions are then ranked according to their alignment with this direction, reducing planning to vector operations in latent space. We demonstrate this approach in chess, where SOLIS uses only a shallow search guided by the learned embedding to reach competitive strength under constrained conditions. More broadly, our results suggest that evaluation-aligned latent planning offers a lightweight alternative to traditional dynamics models or policy learning.

[LG-9] Enhancing Explainability in Solar Energetic Particle Event Prediction: A Global Feature Mapping Approach ICDM

链接: https://arxiv.org/abs/2511.09475
作者: Anli Ji,Pranjal Patil,Chetraj Pandey,Manolis K. Georgoulis,Berkay Aydin
类目: Machine Learning (cs.LG)
*备注: 10 pages, 3 Figures. This is a pre-print of an accepted paper at ICDMW: SABID 2025

点击查看摘要

Abstract:Solar energetic particle (SEP) events, as one of the most prominent manifestations of solar activity, can generate severe hazardous radiation when accelerated by solar flares or shock waves formed aside from coronal mass ejections (CMEs). However, most existing data-driven methods used for SEP predictions are operated as black-box models, making it challenging for solar physicists to interpret the results and understand the underlying physical causes of such events rather than just obtain a prediction. To address this challenge, we propose a novel framework that integrates global explanations and ad-hoc feature mapping to enhance model transparency and provide deeper insights into the decision-making process. We validate our approach using a dataset of 341 SEP events, including 244 significant (=10 MeV) proton events exceeding the Space Weather Prediction Center S1 threshold, spanning solar cycles 22, 23, and 24. Furthermore, we present an explainability-focused case study of major SEP events, demonstrating how our method improves explainability and facilitates a more physics-informed understanding of SEP event prediction.

[LG-10] MCAD: Multimodal Context-Aware Audio Description Generation For Soccer

链接: https://arxiv.org/abs/2511.09448
作者: Lipisha Chaudhary,Trisha Mittal,Subhadra Gopalakrishnan,Ifeoma Nwogu,Jaclyn Pytlarz
类目: Multimedia (cs.MM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Audio Descriptions (AD) are essential for making visual content accessible to individuals with visual impairments. Recent works have shown a promising step towards automating AD, but they have been limited to describing high-quality movie content using human-annotated ground truth AD in the process. In this work, we present an end-to-end pipeline, MCAD, that extends AD generation beyond movies to the domain of sports, with a focus on soccer games, without relying on ground truth AD. To address the absence of domain-specific AD datasets, we fine-tune a Video Large Language Model on publicly available movie AD datasets so that it learns the narrative structure and conventions of AD. During inference, MCAD incorporates multimodal contextual cues such as player identities, soccer events and actions, and commentary from the game. These cues, combined with input prompts to the fine-tuned VideoLLM, allow the system to produce complete AD text for each video segment. We further introduce a new evaluation metric, ARGE-AD, designed to accurately assess the quality of generated AD. ARGE-AD evaluates the generated AD for the presence of five characteristics: (i) usage of people’s names, (ii) mention of actions and events, (iii) appropriate length of AD, (iv) absence of pronouns, and (v) overlap from commentary or subtitles. We present an in-depth analysis of our approach on both movie and soccer datasets. We also validate the use of this metric to quantitatively comment on the quality of generated AD using our metric across domains. Additionally, we contribute audio descriptions for 100 soccer game clips annotated by two AD experts.

[LG-11] Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders NEURIPS2025

链接: https://arxiv.org/abs/2511.09432
作者: Ege Erdogan,Ana Lucic
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 Mechanistic Interpretability and UniReps workshops

点击查看摘要

Abstract:Sparse autoencoders (SAEs) have proven useful in disentangling the opaque activations of neural networks, primarily large language models, into sets of interpretable features. However, adapting them to domains beyond language, such as scientific data with group symmetries, introduces challenges that can hinder their effectiveness. We show that incorporating such group symmetries into the SAEs yields features more useful in downstream tasks. More specifically, we train autoencoders on synthetic images and find that a single matrix can explain how their activations transform as the images are rotated. Building on this, we develop adaptively equivariant SAEs that can adapt to the base model’s level of equivariance. These adaptive SAEs discover features that lead to superior probing performance compared to regular SAEs, demonstrating the value of incorporating symmetries in mechanistic interpretability tools.

[LG-12] Several Supporting Evidences for the Adaptive Feature Program

链接: https://arxiv.org/abs/2511.09425
作者: Yicheng Li,Qian Lin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Theoretically exploring the advantages of neural networks might be one of the most challenging problems in the AI era. An adaptive feature program has recently been proposed to analyze the feature learning characteristic property of neural networks in a more abstract way. Motivated by the celebrated Le Cam equivalence, we advocate the over-parametrized sequence models to further simplify the analysis of the training dynamics of adaptive feature program and present several supporting evidences for the adaptive feature program. More precisely, after having introduced the feature error measure (FEM) to characterize the quality of the learned feature, we show that the FEM is decreasing during the training process of several concrete adaptive feature models including linear regression, single/multiple index models, etc. We believe that this hints at the potential successes of the adaptive feature program.

[LG-13] ransformer Semantic Genetic Programming for d-dimensional Symbolic Regression Problems

链接: https://arxiv.org/abs/2511.09416
作者: Philipp Anthes,Dominik Sobania,Franz Rothlauf
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Transformer Semantic Genetic Programming (TSGP) is a semantic search approach that uses a pre-trained transformer model as a variation operator to generate offspring programs with controlled semantic similarity to a given parent. Unlike other semantic GP approaches that rely on fixed syntactic transformations, TSGP aims to learn diverse structural variations that lead to solutions with similar semantics. We find that a single transformer model trained on millions of programs is able to generalize across symbolic regression problems of varying dimension. Evaluated on 24 real-world and synthetic datasets, TSGP significantly outperforms standard GP, SLIM_GSGP, Deep Symbolic Regression, and Denoising Autoencoder GP, achieving an average rank of 1.58 across all benchmarks. Moreover, TSGP produces more compact solutions than SLIM_GSGP, despite its higher accuracy. In addition, the target semantic distance \mathrmSD_t is able to control the step size in the semantic space: small values of \mathrmSD_t enable consistent improvement in fitness but often lead to larger programs, while larger values promote faster convergence and compactness. Thus, \mathrmSD_t provides an effective mechanism for balancing exploration and exploitation.

[LG-14] Probing then Editing: A Push-Pull Framework for Retain-Free Machine Unlearning in Industrial IoT

链接: https://arxiv.org/abs/2511.09414
作者: Jiao Chen,Weihua Li,Jianhua Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In dynamic Industrial Internet of Things (IIoT) environments, models need the ability to selectively forget outdated or erroneous knowledge. However, existing methods typically rely on retain data to constrain model behavior, which increases computational and energy burdens and conflicts with industrial data silos and privacy compliance requirements. To address this, we propose a novel retain-free unlearning framework, referred to as Probing then Editing (PTE). PTE frames unlearning as a probe-edit process: first, it probes the decision boundary neighborhood of the model on the to-be-forgotten class via gradient ascent and generates corresponding editing instructions using the model’s own predictions. Subsequently, a push-pull collaborative optimization is performed: the push branch actively dismantles the decision region of the target class using the editing instructions, while the pull branch applies masked knowledge distillation to anchor the model’s knowledge on retained classes to their original states. Benefiting from this mechanism, PTE achieves efficient and balanced knowledge editing using only the to-be-forgotten data and the original model. Experimental results demonstrate that PTE achieves an excellent balance between unlearning effectiveness and model utility across multiple general and industrial benchmarks such as CWRU and SCUT-FD.

[LG-15] Abstract Gradient Training: A Unified Certification Framework for Data Poisoning Unlearning and Differential Privacy

链接: https://arxiv.org/abs/2511.09400
作者: Philip Sosnin,Matthew Wicker,Josh Collyer,Calvin Tsay
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The impact of inference-time data perturbation (e.g., adversarial attacks) has been extensively studied in machine learning, leading to well-established certification techniques for adversarial robustness. In contrast, certifying models against training data perturbations remains a relatively under-explored area. These perturbations can arise in three critical contexts: adversarial data poisoning, where an adversary manipulates training samples to corrupt model performance; machine unlearning, which requires certifying model behavior under the removal of specific training data; and differential privacy, where guarantees must be given with respect to substituting individual data points. This work introduces Abstract Gradient Training (AGT), a unified framework for certifying robustness of a given model and training procedure to training data perturbations, including bounded perturbations, the removal of data points, and the addition of new samples. By bounding the reachable set of parameters, i.e., establishing provable parameter-space bounds, AGT provides a formal approach to analyzing the behavior of models trained via first-order optimization methods.

[LG-16] Diffusion-based Sinogram Interpolation for Limited Angle PET

链接: https://arxiv.org/abs/2511.09383
作者: Rüveyda Yilmaz,Julian Thull,Johannes Stegmaier,Volkmar Schulz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate PET imaging increasingly requires methods that support unconstrained detector layouts from walk-through designs to long-axial rings where gaps and open sides lead to severely undersampled sinograms. Instead of constraining the hardware to form complete cylinders, we propose treating the missing lines-of-responses as a learnable prior. Data-driven approaches, particularly generative models, offer a promising pathway to recover this missing information. In this work, we explore the use of conditional diffusion models to interpolate sparsely sampled sinograms, paving the way for novel, cost-efficient, and patient-friendly PET geometries in real clinical settings.

[LG-17] From Decision Trees to Boolean Logic: A Fast and Unified SHAP Algorithm AAAI2026

链接: https://arxiv.org/abs/2511.09376
作者: Alexander Nadel,Ron Wettenstein
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2026

点击查看摘要

Abstract:SHapley Additive exPlanations (SHAP) is a key tool for interpreting decision tree ensembles by assigning contribution values to features. It is widely used in finance, advertising, medicine, and other domains. Two main approaches to SHAP calculation exist: Path-Dependent SHAP, which leverages the tree structure for efficiency, and Background SHAP, which uses a background dataset to estimate feature distributions. We introduce WOODELF, a SHAP algorithm that integrates decision trees, game theory, and Boolean logic into a unified framework. For each consumer, WOODELF constructs a pseudo-Boolean formula that captures their feature values, the structure of the decision tree ensemble, and the entire background dataset. It then leverages this representation to compute Background SHAP in linear time. WOODELF can also compute Path-Dependent SHAP, Shapley interaction values, Banzhaf values, and Banzhaf interaction values. WOODELF is designed to run efficiently on CPU and GPU hardware alike. Available via the WOODELF Python package, it is implemented using NumPy, SciPy, and CuPy without relying on custom C++ or CUDA code. This design enables fast performance and seamless integration into existing frameworks, supporting large-scale computation of SHAP and other game-theoretic values in practice. For example, on a dataset with 3,000,000 rows, 5,000,000 background samples, and 127 features, WOODELF computed all Background Shapley values in 162 seconds on CPU and 16 seconds on GPU - compared to 44 minutes required by the best method on any hardware platform, representing 16x and 165x speedups, respectively. Comments: Accepted to AAAI 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.09376 [cs.LG] (or arXiv:2511.09376v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.09376 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ron Wettenstein [view email] [v1] Wed, 12 Nov 2025 14:43:27 UTC (567 KB)

[LG-18] GAMMA_FLOW: Guided Analysis of Multi-label spectra by MAtrix Factorization for Lightweight Operational Workflows

链接: https://arxiv.org/abs/2511.09326
作者: Viola Rädle,Tilman Hartwig,Benjamin Oesen,Emily Alice Kröger,Julius Vogt,Eike Gericke,Martin Baron
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:GAMMA_FLOW is an open-source Python package for real-time analysis of spectral data. It supports classification, denoising, decomposition, and outlier detection of both single- and multi-component spectra. Instead of relying on large, computationally intensive models, it employs a supervised approach to non-negative matrix factorization (NMF) for dimensionality reduction. This ensures a fast, efficient, and adaptable analysis while reducing computational costs. gamma_flow achieves classification accuracies above 90% and enables reliable automated spectral interpretation. Originally developed for gamma-ray spectra, it is applicable to any type of one-dimensional spectral data. As an open and flexible alternative to proprietary software, it supports various applications in research and industry.

[LG-19] MARBLE: Multi-Armed Restless Bandits in Latent Markovian Environment

链接: https://arxiv.org/abs/2511.09324
作者: Mohsen Amiri,Konstantin Avrachenkov,Ibtihal El Mimouni,Sindri Magnússon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Restless Multi-Armed Bandits (RMABs) are powerful models for decision-making under uncertainty, yet classical formulations typically assume fixed dynamics, an assumption often violated in nonstationary environments. We introduce MARBLE (Multi-Armed Restless Bandits in a Latent Markovian Environment), which augments RMABs with a latent Markov state that induces nonstationary behavior. In MARBLE, each arm evolves according to a latent environment state that switches over time, making policy learning substantially more challenging. We further introduce the Markov-Averaged Indexability (MAI) criterion as a relaxed indexability assumption and prove that, despite unobserved regime switches, under the MAI criterion, synchronous Q-learning with Whittle Indices (QWI) converges almost surely to the optimal Q-function and the corresponding Whittle indices. We validate MARBLE on a calibrated simulator-embedded (digital twin) recommender system, where QWI consistently adapts to a shifting latent state and converges to an optimal policy, empirically corroborating our theoretical findings.

[LG-20] Mixture-of-Channels: Exploiting Sparse FFNs for Efficient LLM s Pre-Training and Inference

链接: https://arxiv.org/abs/2511.09323
作者: Tong Wu,Yutong He,Bin Wang,Kun Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated remarkable success across diverse artificial intelligence tasks, driven by scaling laws that correlate model size and training data with performance improvements. However, this scaling paradigm incurs substantial memory overhead, creating significant challenges for both training and inference. While existing research has primarily addressed parameter and optimizer state memory reduction, activation memory-particularly from feed-forward networks (FFNs)-has become the critical bottleneck, especially when FlashAttention is implemented. In this work, we conduct a detailed memory profiling of LLMs and identify FFN activations as the predominant source to activation memory overhead. Motivated by this, we introduce Mixture-of-Channels (MoC), a novel FFN architecture that selectively activates only the Top-K most relevant channels per token determined by SwiGLU’s native gating mechanism. MoC substantially reduces activation memory during pre-training and improves inference efficiency by reducing memory access through partial weight loading into GPU SRAM. Extensive experiments validate that MoC delivers significant memory savings and throughput gains while maintaining competitive model performance.

[LG-21] A Tensor Residual Circuit Neural Network Factorized with Matrix Product Operation

链接: https://arxiv.org/abs/2511.09315
作者: Andi Chen
类目: Machine Learning (cs.LG)
*备注: This is the supplementary material link: this https URL

点击查看摘要

Abstract:It is challenging to reduce the complexity of neural networks while maintaining their generalization ability and robustness, especially for practical applications. Conventional solutions for this problem incorporate quantum-inspired neural networks with Kronecker products and hybrid tensor neural networks with MPO factorization and fully-connected layers. Nonetheless, the generalization power and robustness of the fully-connected layers are not as outstanding as circuit models in quantum computing. In this paper, we propose a novel tensor circuit neural network (TCNN) that takes advantage of the characteristics of tensor neural networks and residual circuit models to achieve generalization ability and robustness with low complexity. The proposed activation operation and parallelism of the circuit in complex number field improves its non-linearity and efficiency for feature learning. Moreover, since the feature information exists in the parameters in both the real and imaginary parts in TCNN, an information fusion layer is proposed for merging features stored in those parameters to enhance the generalization capability. Experimental results confirm that TCNN showcases more outstanding generalization and robustness with its average accuracies on various datasets 2%-3% higher than those of the state-of-the-art compared models. More significantly, while other models fail to learn features under noise parameter attacking, TCNN still showcases prominent learning capability owing to its ability to prevent gradient explosion. Furthermore, it is comparable to the compared models on the number of trainable parameters and the CPU running time. An ablation study also indicates the advantage of the activation operation, the parallelism architecture and the information fusion layer.

[LG-22] Efficiently Transforming Neural Networks into Decision Trees: A Path to Ground Truth Explanations with RENTT

链接: https://arxiv.org/abs/2511.09299
作者: Helena Monke,Benjamin Fresz,Marco Bernreuther,Yilin Chen,Marco F. Huber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although neural networks are a powerful tool, their widespread use is hindered by the opacity of their decisions and their black-box nature, which result in a lack of trustworthiness. To alleviate this problem, methods in the field of explainable Artificial Intelligence try to unveil how such automated decisions are made. But explainable AI methods are often plagued by missing faithfulness/correctness, meaning that they sometimes provide explanations that do not align with the neural network’s decision and logic. Recently, transformations to decision trees have been proposed to overcome such problems. Unfortunately, they typically lack exactness, scalability, or interpretability as the size of the neural network grows. Thus, we generalize these previous results, especially by considering convolutional neural networks, recurrent neural networks, non-ReLU activation functions, and bias terms. Our findings are accompanied by rigorous proofs and we present a novel algorithm RENTT (Runtime Efficient Network to Tree Transformation) designed to compute an exact equivalent decision tree representation of neural networks in a manner that is both runtime and memory efficient. The resulting decision trees are multivariate and thus, possibly too complex to understand. To alleviate this problem, we also provide a method to calculate the ground truth feature importance for neural networks via the equivalent decision trees - for entire models (global), specific input regions (regional), or single decisions (local). All theoretical results are supported by detailed numerical experiments that emphasize two key aspects: the computational efficiency and scalability of our algorithm, and that only RENTT succeeds in uncovering ground truth explanations compared to conventional approximation methods like LIME and SHAP. All code is available at this https URL .

[LG-23] Multi-step Predictive Coding Leads To Simplicity Bias

链接: https://arxiv.org/abs/2511.09290
作者: Aviv Ratzon,Omri Barak
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:Predictive coding is a framework for understanding the formation of low-dimensional internal representations mirroring the environment’s latent structure. The conditions under which such representations emerge remain unclear. In this work, we investigate how the prediction horizon and network depth shape the solutions of predictive coding tasks. Using a minimal abstract setting inspired by prior work, we show empirically and theoretically that sufficiently deep networks trained with multi-step prediction horizons consistently recover the underlying latent structure, a phenomenon explained through the Ordinary Least Squares estimator structure and biases in learning dynamics. We then extend these insights to nonlinear networks and complex datasets, including piecewise linear functions, MNIST, multiple latent states and higher dimensional state geometries. Our results provide a principled understanding of when and why predictive coding induces structured representations, bridging the gap between empirical observations and theoretical foundations.

[LG-24] A Distributed Training Architecture For Combinatorial Optimization

链接: https://arxiv.org/abs/2511.09261
作者: Yuyao Long
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, graph neural networks (GNNs) have been widely applied in tackling combinatorial optimization problems. However, existing methods still suffer from limited accuracy when addressing that on complex graphs and exhibit poor scalability, since full training requires loading the whole adjacent matrix and all embeddings at a time, the it may results in out of memory of a single machine. This limitation significantly restricts their applicability to large-scale scenarios. To address these challenges, we propose a distributed GNN-based training framework for combinatorial optimization. In details, firstly, large graph is partition into several small subgraphs. Then the individual subgraphs are full trained, providing a foundation for efficient local optimization. Finally, reinforcement learning (RL) are employed to take actions according to GNN output, to make sure the restrictions between cross nodes can be learned. Extensive experiments are conducted on both real large-scale social network datasets (e.g., Facebook, Youtube) and synthetically generated high-complexity graphs, which demonstrate that our framework outperforms state-of-the-art approaches in both solution quality and computational efficiency. Moreover, the experiments on large graph instances also validate the scalability of the model.

[LG-25] Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization

链接: https://arxiv.org/abs/2511.09219
作者: Paul Strang,Zacharie Alès,Côme Bissuel,Safia Kedad-Sidhoum,Emmanuel Rachelson
类目: Machine Learning (cs.LG)
*备注: arXiv admin note: text overlap with arXiv:2510.19348

点击查看摘要

Abstract:Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (BB). A key driver influencing BB solvers efficiency is the variable selection heuristic that guides branching decisions. Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the BB setting, aiming to learn branching strategies tailored to specific MILP distributions. In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS). Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanBB), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the BB dynamics to discover improved branching strategies. Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.

[LG-26] Controllable protein design through Feynman-Kac steering

链接: https://arxiv.org/abs/2511.09216
作者: Erik Hartman,Jonas Wallin,Johan Malmström,Jimmy Olsson
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注: 15 pages, 6 figures

点击查看摘要

Abstract:Diffusion-based models have recently enabled the generation of realistic and diverse protein structures, yet they remain limited in their ability to steer outcomes toward specific functional or biochemical objectives, such as binding affinity or sequence composition. Here we extend the Feynman-Kac (FK) steering framework, an inference-time control approach, to diffusion-based protein design. By coupling FK steering with structure generation, the method guides sampling toward desirable structural or energetic features while maintaining the diversity of the underlying diffusion process. To enable simultaneous generation of both sequence and structure properties, rewards are computed on models refined through ProteinMPNN and all-atom relaxation. Applied to binder design, FK steering consistently improves predicted interface energetics across diverse targets with minimal computational overhead. More broadly, this work demonstrates that inference-time FK control generalizes diffusion-based protein design to arbitrary, non-differentiable, and reward-agnostic objectives, providing a unified and model-independent framework for guided molecular generation.

[LG-27] Parameter-Free Clustering via Self-Supervised Consensus Maximization (Extended Version)

链接: https://arxiv.org/abs/2511.09211
作者: Lijun Zhang,Suyuan Liu,Siwei Wang,Shengju Yu,Xueling Zhu,Miaomiao Li,Xinwang Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clustering is a fundamental task in unsupervised learning, but most existing methods heavily rely on hyperparameters such as the number of clusters or other sensitive settings, limiting their applicability in real-world scenarios. To address this long-standing challenge, we propose a novel and fully parameter-free clustering framework via Self-supervised Consensus Maximization, named SCMax. Our framework performs hierarchical agglomerative clustering and cluster evaluation in a single, integrated process. At each step of agglomeration, it creates a new, structure-aware data representation through a self-supervised learning task guided by the current clustering structure. We then introduce a nearest neighbor consensus score, which measures the agreement between the nearest neighbor-based merge decisions suggested by the original representation and the self-supervised one. The moment at which consensus maximization occurs can serve as a criterion for determining the optimal number of clusters. Extensive experiments on multiple datasets demonstrate that the proposed framework outperforms existing clustering approaches designed for scenarios with an unknown number of clusters.

[LG-28] CoCo-MILP: Inter-Variable Contrastive and Intra-Constraint Competitive MILP Solution Prediction

链接: https://arxiv.org/abs/2511.09209
作者: Tianle Pu,Jianing Li,Yingying Gao,Shixuan Liu,Zijie Geng,Haoyang Liu,Chao Chen,Changjun Fan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixed-Integer Linear Programming (MILP) is a cornerstone of combinatorial optimization, yet solving large-scale instances remains a significant computational challenge. Recently, Graph Neural Networks (GNNs) have shown promise in accelerating MILP solvers by predicting high-quality solutions. However, we identify that existing methods misalign with the intrinsic structure of MILP problems at two levels. At the leaning objective level, the Binary Cross-Entropy (BCE) loss treats variables independently, neglecting their relative priority and yielding plausible logits. At the model architecture level, standard GNN message passing inherently smooths the representations across variables, missing the natural competitive relationships within constraints. To address these challenges, we propose CoCo-MILP, which explicitly models inter-variable Contrast and intra-constraint Competition for advanced MILP solution prediction. At the objective level, CoCo-MILP introduces the Inter-Variable Contrastive Loss (VCL), which explicitly maximizes the embedding margin between variables assigned one versus zero. At the architectural level, we design an Intra-Constraint Competitive GNN layer that, instead of homogenizing features, learns to differentiate representations of competing variables within a constraint, capturing their exclusionary nature. Experimental results on standard benchmarks demonstrate that CoCo-MILP significantly outperforms existing learning-based approaches, reducing the solution gap by up to 68.12% compared to traditional solvers. Our code is available at this https URL.

[LG-29] Stochastic Mean-Shift Clustering

链接: https://arxiv.org/abs/2511.09202
作者: Itshak Lapidot,Yann Sepulcre,Tom Trigano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a stochastic version of the mean-shift clustering algorithm. In this stochastic version a randomly chosen sequence of data points move according to partial gradient ascent steps of the objective function. Theoretical results illustrating the convergence of the proposed approach, and its relative performances is evaluated on synthesized 2-dimensional samples generated by a Gaussian mixture distribution and compared with state-of-the-art methods. It can be observed that in most cases the stochastic mean-shift clustering outperforms the standard mean-shift. We also illustrate as a practical application the use of the presented method for speaker clustering.

[LG-30] Sure! Heres a short and concise title for your paper: “Contamination in Generated Text Detection Benchmarks”

链接: https://arxiv.org/abs/2511.09200
作者: Philipp Dingfelder,Christian Riess
类目: Machine Learning (cs.LG)
*备注: published at CSCML 2025

点击查看摘要

Abstract:Large language models are increasingly used for many applications. To prevent illicit use, it is desirable to be able to detect AI-generated text. Training and evaluation of such detectors critically depend on suitable benchmark datasets. Several groups took on the tedious work of collecting, curating, and publishing large and diverse datasets for this task. However, it remains an open challenge to ensure high quality in all relevant aspects of such a dataset. For example, the DetectRL benchmark exhibits relatively simple patterns of AI-generation in 98.5% of the Claude-LLM data. These patterns may include introductory words such as “Sure! Here is the academic article abstract:”, or instances where the LLM rejects the prompted task. In this work, we demonstrate that detectors trained on such data use such patterns as shortcuts, which facilitates spoofing attacks on the trained detectors. We consequently reprocessed the DetectRL dataset with several cleansing operations. Experiments show that such data cleansing makes direct attacks more difficult. The reprocessed dataset is publicly available.

[LG-31] Iterated Population Based Training with Task-Agnostic Restarts

链接: https://arxiv.org/abs/2511.09190
作者: Alexander Chebykin,Tanja Alderliesten,Peter A. N. Bosman
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:

点击查看摘要

Abstract:Hyperparameter Optimization (HPO) can lift the burden of tuning hyperparameters (HPs) of neural networks. HPO algorithms from the Population Based Training (PBT) family are efficient thanks to dynamically adjusting HPs every few steps of the weight optimization. Recent results indicate that the number of steps between HP updates is an important meta-HP of all PBT variants that can substantially affect their performance. Yet, no method or intuition is available for efficiently setting its value. We introduce Iterated Population Based Training (IPBT), a novel PBT variant that automatically adjusts this HP via restarts that reuse weight information in a task-agnostic way and leverage time-varying Bayesian optimization to reinitialize HPs. Evaluation on 8 image classification and reinforcement learning tasks shows that, on average, our algorithm matches or outperforms 5 previous PBT variants and other HPO algorithms (random search, ASHA, SMAC3), without requiring a budget increase or any changes to its HPs. The source code is available at this https URL.

[LG-32] Compact Memory for Continual Logistic Regression

链接: https://arxiv.org/abs/2511.09167
作者: Yohan Jung,Hyungi Lee,Wenlong Chen,Thomas Möllenhoff,Yingzhen Li,Juho Lee,Mohammad Emtiyaz Khan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Despite recent progress, continual learning still does not match the performance of batch training. To avoid catastrophic forgetting, we need to build compact memory of essential past knowledge, but no clear solution has yet emerged, even for shallow neural networks with just one or two layers. In this paper, we present a new method to build compact memory for logistic regression. Our method is based on a result by Khan and Swaroop [2021] who show the existence of optimal memory for such models. We formulate the search for the optimal memory as Hessian-matching and propose a probabilistic PCA method to estimate them. Our approach can drastically improve accuracy compared to Experience Replay. For instance, on Split-ImageNet, we get 60% accuracy compared to 30% obtained by replay with memory-size equivalent to 0.3% of the data size. Increasing the memory size to 2% further boosts the accuracy to 74%, closing the gap to the batch accuracy of 77.6% on this task. Our work opens a new direction for building compact memory that can also be useful in the future for continual deep learning.

[LG-33] Unsupervised Feature Selection Through Group Discovery AAAI2026

链接: https://arxiv.org/abs/2511.09166
作者: Shira Lifshitz,Ofir Lindenbaum,Gal Mishne,Ron Meir,Hadas Benisty
类目: Machine Learning (cs.LG)
*备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Unsupervised feature selection (FS) is essential for high-dimensional learning tasks where labels are not available. It helps reduce noise, improve generalization, and enhance interpretability. However, most existing unsupervised FS methods evaluate features in isolation, even though informative signals often emerge from groups of related features. For example, adjacent pixels, functionally connected brain regions, or correlated financial indicators tend to act together, making independent evaluation suboptimal. Although some methods attempt to capture group structure, they typically rely on predefined partitions or label supervision, limiting their applicability. We propose GroupFS, an end-to-end, fully differentiable framework that jointly discovers latent feature groups and selects the most informative groups among them, without relying on fixed a priori groups or label supervision. GroupFS enforces Laplacian smoothness on both feature and sample graphs and applies a group sparsity regularizer to learn a compact, structured representation. Across nine benchmarks spanning images, tabular data, and biological datasets, GroupFS consistently outperforms state-of-the-art unsupervised FS in clustering and selects groups of features that align with meaningful patterns.

[LG-34] Practical Global and Local Bounds in Gaussian Process Regression via Chaining AAAI2026

链接: https://arxiv.org/abs/2511.09144
作者: Junyi Liu,Stanley Kok
类目: Machine Learning (cs.LG)
*备注: Accepted as a conference paper at AAAI2026

点击查看摘要

Abstract:Gaussian process regression (GPR) is a popular nonparametric Bayesian method that provides predictive uncertainty estimates and is widely used in safety-critical applications. While prior research has introduced various uncertainty bounds, most existing approaches require access to specific input features and rely on posterior mean and variance estimates or tuning hyperparameters. These limitations hinder robustness and fail to capture the model’s global behavior in expectation. To address these limitations, we propose a chaining-based framework for estimating upper and lower bounds on the expected extreme values over unseen data, without requiring access to specific input locations. We provide kernel-specific refinements for commonly used kernels such as RBF and Matérn, in which our bounds are tighter than generic constructions. We further improve numerical tightness by avoiding analytical relaxations. In addition to global estimation, we also develop a novel method for local uncertainty quantification at specified inputs. This approach leverages chaining geometry through partition diameters, adapting to local structure without relying on posterior variance scaling. Our experimental results validate the theoretical findings and demonstrate that our method outperforms existing approaches on both synthetic and real-world datasets.

[LG-35] rusted Multi-view Learning for Long-tailed Classification AAAI2026

链接: https://arxiv.org/abs/2511.09138
作者: Chuanqing Tang,Yifei Shi,Guanghao Lin,Lei Xing,Long Shi
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted to AAAI2026

点击查看摘要

Abstract:Class imbalance has been extensively studied in single-view scenarios; however, addressing this challenge in multi-view contexts remains an open problem, with even scarcer research focusing on trustworthy solutions. In this paper, we tackle a particularly challenging class imbalance problem in multi-view scenarios: long-tailed classification. We propose TMLC, a Trusted Multi-view Long-tailed Classification framework, which makes contributions on two critical aspects: opinion aggregation and pseudo-data generation. Specifically, inspired by Social Identity Theory, we design a group consensus opinion aggregation mechanism that guides decision making toward the direction favored by the majority of the group. In terms of pseudo-data generation, we introduce a novel distance metric to adapt SMOTE for multi-view scenarios and develop an uncertainty-guided data generation module that produces high-quality pseudo-data, effectively mitigating the adverse effects of class imbalance. Extensive experiments on long-tailed multi-view datasets demonstrate that our model is capable of achieving superior performance. The code is released at this https URL.

[LG-36] owards a Generalisable Cyber Defence Agent for Real-World Computer Networks

链接: https://arxiv.org/abs/2511.09114
作者: Tim Dudman,Martyn Bull
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: CAMLIS 2025. To be published in the Proceedings of Machine Learning Research (PMLR)

点击查看摘要

Abstract:Recent advances in deep reinforcement learning for autonomous cyber defence have resulted in agents that can successfully defend simulated computer networks against cyber-attacks. However, many of these agents would need retraining to defend networks with differing topology or size, making them poorly suited to real-world networks where topology and size can vary over time. In this research we introduce a novel set of Topological Extensions for Reinforcement Learning Agents (TERLA) that provide generalisability for the defence of networks with differing topology and size, without the need for retraining. Our approach involves the use of heterogeneous graph neural network layers to produce a fixed-size latent embedding representing the observed network state. This representation learning stage is coupled with a reduced, fixed-size, semantically meaningful and interpretable action space. We apply TERLA to a standard deep reinforcement learning Proximal Policy Optimisation (PPO) agent model, and to reduce the sim-to-real gap, conduct our research using Cyber Autonomy Gym for Experimentation (CAGE) Challenge 4. This Cyber Operations Research Gym environment has many of the features of a real-world network, such as realistic Intrusion Detection System (IDS) events and multiple agents defending network segments of differing topology and size. TERLA agents retain the defensive performance of vanilla PPO agents whilst showing improved action efficiency. Generalisability has been demonstrated by showing that all TERLA agents have the same network-agnostic neural network architecture, and by deploying a single TERLA agent multiple times to defend network segments with differing topology and size, showing improved defensive performance and efficiency.

[LG-37] FedPM: Federated Learning Using Second-order Optimization with Preconditioned Mixing of Local Parameters AAAI-26 AAAI

链接: https://arxiv.org/abs/2511.09100
作者: Hiro Ishii,Kenta Niwa,Hiroshi Sawada,Akinori Fujino,Noboru Harada,Rio Yokota
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 18 pages, 7 figures. Accepted for publication in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:We propose Federated Preconditioned Mixing (FedPM), a novel Federated Learning (FL) method that leverages second-order optimization. Prior methods–such as LocalNewton, LTDA, and FedSophia–have incorporated second-order optimization in FL by performing iterative local updates on clients and applying simple mixing of local parameters on the server. However, these methods often suffer from drift in local preconditioners, which significantly disrupts the convergence of parameter training, particularly in heterogeneous data settings. To overcome this issue, we refine the update rules by decomposing the ideal second-order update–computed using globally preconditioned global gradients–into parameter mixing on the server and local parameter updates on clients. As a result, our FedPM introduces preconditioned mixing of local parameters on the server, effectively mitigating drift in local preconditioners. We provide a theoretical convergence analysis demonstrating a superlinear rate for strongly convex objectives in scenarios involving a single local update. To demonstrate the practical benefits of FedPM, we conducted extensive experiments. The results showed significant improvements with FedPM in the test accuracy compared to conventional methods incorporating simple mixing, fully leveraging the potential of second-order optimization. Comments: 18 pages, 7 figures. Accepted for publication in the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI-26) Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2511.09100 [cs.LG] (or arXiv:2511.09100v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.09100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Human-Corrected Labels Learning: Enhancing Labels Quality via Human Correction of VLMs Discrepancies

链接: https://arxiv.org/abs/2511.09063
作者: Zhongnian Li,Lan Chen,Yixin Xu,Shi Xu,Xinzheng Xu
类目: Machine Learning (cs.LG)
*备注: 11 pages, 3 figures

点击查看摘要

Abstract:Vision-Language Models (VLMs), with their powerful content generation capabilities, have been successfully applied to data annotation processes. However, the VLM-generated labels exhibit dual limitations: low quality (i.e., label noise) and absence of error correction mechanisms. To enhance label quality, we propose Human-Corrected Labels (HCLs), a novel setting that efficient human correction for VLM-generated noisy labels. As shown in Figure 1(b), HCL strategically deploys human correction only for instances with VLM discrepancies, achieving both higher-quality annotations and reduced labor costs. Specifically, we theoretically derive a risk-consistent estimator that incorporates both human-corrected labels and VLM predictions to train classifiers. Besides, we further propose a conditional probability method to estimate the label distribution using a combination of VLM outputs and model predictions. Extensive experiments demonstrate that our approach achieves superior classification performance and is robust to label noise, validating the effectiveness of HCL in practical weak supervision scenarios. Code this https URL

[LG-39] Guaranteeing Conservation of Integrals with Projection in Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2511.09048
作者: Anthony Baez,Wang Zhang,Ziwen Ma,Lam Nguyen,Subhro Das,Luca Daniel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel projection method that guarantees the conservation of integral quantities in Physics-Informed Neural Networks (PINNs). While the soft constraint that PINNs use to enforce the structure of partial differential equations (PDEs) enables necessary flexibility during training, it also permits the discovered solution to violate physical laws. To address this, we introduce a projection method that guarantees the conservation of the linear and quadratic integrals, both separately and jointly. We derived the projection formulae by solving constrained non-linear optimization problems and found that our PINN modified with the projection, which we call PINN-Proj, reduced the error in the conservation of these quantities by three to four orders of magnitude compared to the soft constraint and marginally reduced the PDE solution error. We also found evidence that the projection improved convergence through improving the conditioning of the loss landscape. Our method holds promise as a general framework to guarantee the conservation of any integral quantity in a PINN if a tractable solution exists.

[LG-40] Preference is More Than Comparisons: Rethinking Dueling Bandits with Augmented Human Feedback AAAI2026

链接: https://arxiv.org/abs/2511.09047
作者: Shengbo Wang,Hong Sun,Ke Li
类目: Machine Learning (cs.LG)
*备注: Extended version of our AAAI 2026 paper

点击查看摘要

Abstract:Interactive preference elicitation (IPE) aims to substantially reduce human effort while acquiring human preferences in wide personalization systems. Dueling bandit (DB) algorithms enable optimal decision-making in IPE building on pairwise comparisons. However, they remain inefficient when human feedback is sparse. Existing methods address sparsity by heavily relying on parametric reward models, whose rigid assumptions are vulnerable to misspecification. In contrast, we explore an alternative perspective based on feedback augmentation, and introduce critical improvements to the model-free DB framework. Specifically, we introduce augmented confidence bounds to integrate augmented human feedback under generalized concentration properties, and analyze the multi-factored performance trade-off via regret analysis. Our prototype algorithm achieves competitive performance across several IPE benchmarks, including recommendation, multi-objective optimization, and response optimization for large language models, demonstrating the potential of our approach for provably efficient IPE in broader applications.

[LG-41] GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs

链接: https://arxiv.org/abs/2511.09042
作者: Liangwei Yang,Jing Ma,Jianguo Zhang,Zhiwei Liu,Jielin Qiu,Shirley Kokane,Shiyu Wang,Haolin Chen,Rithesh Murthy,Ming Zhu,Huan Wang,Weiran Yao,Caiming Xiong,Shelby Heinecke
类目: Machine Learning (cs.LG)
*备注: 10 pages

点击查看摘要

Abstract:Graph neural networks (GNNs) on text–attributed graphs (TAGs) typically encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation. However, the representation spaces of modern PLMs are highly non–linear and geometrically structured, where textual embeddings reside on curved semantic manifolds rather than flat Euclidean spaces. Linear aggregation on such manifolds inevitably distorts geometry and causes semantic drift–a phenomenon where aggregated representations deviate from the intrinsic manifold, losing semantic fidelity and expressive power. To quantitatively investigate this problem, this work introduces a local PCA–based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure. Building upon these insights, we propose Geodesic Aggregation, a manifold–aware mechanism that aggregates neighbor information along geodesics via log–exp mappings on the unit sphere, ensuring that representations remain faithful to the semantic manifold during message passing. We further develop GeoGNN, a practical instantiation that integrates spherical attention with manifold interpolation. Extensive experiments across four benchmark datasets and multiple text encoders show that GeoGNN substantially mitigates semantic drift and consistently outperforms strong baselines, establishing the importance of manifold–aware aggregation in text–attributed graph learning.

[LG-42] Fairness-Aware Few-Shot Learning for Audio-Visual Stress Detection

链接: https://arxiv.org/abs/2511.09039
作者: Anushka Sanjay Shelke,Aditya Sneh,Arya Adyasha,Haroon R. Lone
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Fairness in AI-driven stress detection is critical for equitable mental healthcare, yet existing models frequently exhibit gender bias, particularly in data-scarce scenarios. To address this, we propose FairM2S, a fairness-aware meta-learning framework for stress detection leveraging audio-visual data. FairM2S integrates Equalized Odds constraints during both meta-training and adaptation phases, employing adversarial gradient masking and fairness-constrained meta-updates to effectively mitigate bias. Evaluated against five state-of-the-art baselines, FairM2S achieves 78.1% accuracy while reducing the Equal Opportunity to 0.06, demonstrating substantial fairness gains. We also release SAVSD, a smartphone-captured dataset with gender annotations, designed to support fairness research in low-resource, real-world contexts. Together, these contributions position FairM2S as a state-of-the-art approach for equitable and scalable few-shot stress detection in mental health AI. We release our dataset and FairM2S publicly with this paper.

[LG-43] FLAD: Federated Learning for LLM -based Autonomous Driving in Vehicle-Edge-Cloud Networks

链接: https://arxiv.org/abs/2511.09025
作者: Tianao Xiang,Mingjian Zhi,Yuanguo Bi,Lin Cai,Yuhao Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have impressive data fusion and reasoning capabilities for autonomous driving (AD). However, training LLMs for AD faces significant challenges including high computation transmission costs, and privacy concerns associated with sensitive driving data. Federated Learning (FL) is promising for enabling autonomous vehicles (AVs) to collaboratively train models without sharing raw data. We present Federated LLM-based Autonomous Driving (FLAD), an FL framework that leverages distributed multimodal sensory data across AVs in heterogeneous environment. FLAD has three key innovations: (1) a cloud-edge-vehicle collaborative architecture that reduces communication delay and preserving data privacy; (2) an intelligent parallelized collaborative training with a communication scheduling mechanism that optimizes training efficiency, leveraging end-devices otherwise having insufficient resources for model training; and (3) a knowledge distillation method that personalizes LLM according to heterogeneous edge data. In addition, we prototype FLAD in a testbed with NVIDIA Jetsons, overcoming practical implementation challenges including CPU/GPU memory sharing in resource-constrained devices, dynamic model partitions, and fault-tolerant this http URL experimental evaluation demonstrates that FLAD achieves superior end-to-end AD performance while efficiently utilizing distributed vehicular resources, opening up new possibilities for future collaborative AD model training and knowledge sharing.

[LG-44] Assumed Density Filtering and Smoothing with Neural Network Surrogate Models

链接: https://arxiv.org/abs/2511.09016
作者: Simon Kuang,Xinfan Lin
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Kalman filter and Rauch-Tung-Striebel (RTS) smoother are optimal for state estimation in linear dynamic systems. With nonlinear systems, the challenge consists in how to propagate uncertainty through the state transitions and output function. For the case of a neural network model, we enable accurate uncertainty propagation using a recent state-of-the-art analytic formula for computing the mean and covariance of a deep neural network with Gaussian input. We argue that cross entropy is a more appropriate performance metric than RMSE for evaluating the accuracy of filters and smoothers. We demonstrate the superiority of our method for state estimation on a stochastic Lorenz system and a Wiener system, and find that our method enables more optimal linear quadratic regulation when the state estimate is used for feedback.

[LG-45] Data reuse enables cost-efficient randomized trials of medical AI models

链接: https://arxiv.org/abs/2511.08986
作者: Michael Nercessian,Wenxin Zhang,Alexander Schubert,Daphne Yang,Maggie Chung,Ahmed Alaa,Adam Yala
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Randomized controlled trials (RCTs) are indispensable for establishing the clinical value of medical artificial-intelligence (AI) tools, yet their high cost and long timelines hinder timely validation as new models emerge rapidly. Here, we propose BRIDGE, a data-reuse RCT design for AI-based risk models. AI risk models support a broad range of interventions, including screening, treatment selection, and clinical alerts. BRIDGE trials recycle participant-level data from completed trials of AI models when legacy and updated models make concordant predictions, thereby reducing the enrollment requirement for subsequent trials. We provide a practical checklist for investigators to assess whether reusing data from previous trials allows for valid causal inference and preserves type I error. Using real-world datasets across breast cancer, cardiovascular disease, and sepsis, we demonstrate concordance between successive AI models, with up to 64.8% overlap in top 5% high-risk cohorts. We then simulate a series of breast cancer screening studies, where our design reduced required enrollment by 46.6%–saving over US 2.8 million–while maintaining 80% power. By transforming trials into adaptive, modular studies, our proposed design makes Level I evidence generation feasible for every model iteration, thereby accelerating cost-effective translation of AI into routine care.

[LG-46] DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks AAAI2026

链接: https://arxiv.org/abs/2511.08985
作者: Yunfei Yang,Xiaojun Chen,Yuexin Xuan,Zhendong Zhao,Xin Zhao,He Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Extended version of the paper accepted by AAAI 2026

点击查看摘要

Abstract:Model watermarking techniques can embed watermark information into the protected model for ownership declaration by constructing specific input-output pairs. However, existing watermarks are easily removed when facing model stealing attacks, and make it difficult for model owners to effectively verify the copyright of stolen models. In this paper, we analyze the root cause of the failure of current watermarking methods under model stealing scenarios and then explore potential solutions. Specifically, we introduce a robust watermarking framework, DeepTracer, which leverages a novel watermark samples construction method and a same-class coupling loss constraint. DeepTracer can incur a high-coupling model between watermark task and primary task that makes adversaries inevitably learn the hidden watermark task when stealing the primary task functionality. Furthermore, we propose an effective watermark samples filtering mechanism that elaborately select watermark key samples used in model ownership verification to enhance the reliability of watermarks. Extensive experiments across multiple datasets and models demonstrate that our method surpasses existing approaches in defending against various model stealing attacks, as well as watermark attacks, and achieves new state-of-the-art effectiveness and robustness.

[LG-47] Selective Sinkhorn Routing for Improved Sparse Mixture of Experts

链接: https://arxiv.org/abs/2511.08972
作者: Duc Anh Nguyen,Huu Binh Ta,Nhuan Le Duc,Tan M. Nguyen,Toan Tran
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures

点击查看摘要

Abstract:Sparse Mixture-of-Experts (SMoE) has gained prominence as a scalable and computationally efficient architecture, enabling significant growth in model capacity without incurring additional inference costs. However, existing SMoE models often rely on auxiliary losses (e.g., z-loss, load balancing) and additional trainable parameters (e.g., noisy gating) to encourage expert diversity, leading to objective misalignment and increased model complexity. Moreover, existing Sinkhorn-based methods suffer from significant training overhead due to their heavy reliance on the computationally expensive Sinkhorn algorithm. In this work, we formulate token-to-expert assignment as an optimal transport problem, incorporating constraints to ensure balanced expert utilization. We demonstrate that introducing a minimal degree of optimal transport-based routing enhances SMoE performance without requiring auxiliary balancing losses. Unlike previous methods, our approach derives gating scores directly from the transport map, enabling more effective token-to-expert balancing, supported by both theoretical analysis and empirical results. Building on these insights, we propose Selective Sinkhorn Routing (SSR), a routing mechanism that replaces auxiliary loss with lightweight Sinkhorn-based routing. SSR promotes balanced token assignments while preserving flexibility in expert selection. Across both language modeling and image classification tasks, SSR achieves faster training, higher accuracy, and greater robustness to input corruption.

[LG-48] Improving Conditional VAE with approximation using Normalizing Flows

链接: https://arxiv.org/abs/2511.08946
作者: Tuhin Subhra De
类目: Machine Learning (cs.LG)
*备注: Independent Work

点击查看摘要

Abstract:Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity, we refer a method that solve this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using normalizing flows results in better image generation than existing methods by reducing the FID by 5% and increasing log likelihood by 7.7% than the previous case.

[LG-49] Robust Backdoor Removal by Reconstructing Trigger-Activated Changes in Latent Representation

链接: https://arxiv.org/abs/2511.08944
作者: Kazuki Iwahana,Yusuke Yamasaki,Akira Ito,Takayuki Miura,Toshiki Shibahara
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Backdoor attacks pose a critical threat to machine learning models, causing them to behave normally on clean data but misclassify poisoned data into a poisoned class. Existing defenses often attempt to identify and remove backdoor neurons based on Trigger-Activated Changes (TAC) which is the activation differences between clean and poisoned data. These methods suffer from low precision in identifying true backdoor neurons due to inaccurate estimation of TAC values. In this work, we propose a novel backdoor removal method by accurately reconstructing TAC values in the latent representation. Specifically, we formulate the minimal perturbation that forces clean data to be classified into a specific class as a convex quadratic optimization problem, whose optimal solution serves as a surrogate for TAC. We then identify the poisoned class by detecting statistically small L^2 norms of perturbations and leverage the perturbation of the poisoned class in fine-tuning to remove backdoors. Experiments on CIFAR-10, GTSRB, and TinyImageNet demonstrated that our approach consistently achieves superior backdoor suppression with high clean accuracy across different attack types, datasets, and architectures, outperforming existing defense methods.

[LG-50] QIBONN: A Quantum-Inspired Bilevel Optimizer for Neural Networks on Tabular Classification

链接: https://arxiv.org/abs/2511.08940
作者: Pedro Chumpitaz-Flores,My Duong,Ying Mao,Kaixun Hua
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 6 pages, 3 figures, 3 tables. Accepted at IEEE International Conference on Big Data 2025

点击查看摘要

Abstract:Hyperparameter optimization (HPO) for neural networks on tabular data is critical to a wide range of applications, yet it remains challenging due to large, non-convex search spaces and the cost of exhaustive tuning. We introduce the Quantum-Inspired Bilevel Optimizer for Neural Networks (QIBONN), a bilevel framework that encodes feature selection, architectural hyperparameters, and regularization in a unified qubit-based representation. By combining deterministic quantum-inspired rotations with stochastic qubit mutations guided by a global attractor, QIBONN balances exploration and exploitation under a fixed evaluation budget. We conduct systematic experiments under single-qubit bit-flip noise (0.1%–1%) emulated by an IBM-Q backend. Results on 13 real-world datasets indicate that QIBONN is competitive with established methods, including classical tree-based methods and both classical/quantum-inspired HPO algorithms under the same tuning budget.

[LG-51] DeepDR: an integrated deep-learning model web server for drug repositioning

链接: https://arxiv.org/abs/2511.08921
作者: Shuting Jin,Yi Jiang,Yimin Liu,Tengfei Ma,Dongsheng Cao,Leyi Wei,Xiangrong Liu,Xiangxiang Zeng
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:Background: Identifying new indications for approved drugs is a complex and time-consuming process that requires extensive knowledge of pharmacology, clinical data, and advanced computational methods. Recently, deep learning (DL) methods have shown their capability for the accurate prediction of drug repositioning. However, implementing DL-based modeling requires in-depth domain knowledge and proficient programming skills. Results: In this application, we introduce DeepDR, the first integrated platform that combines a variety of established DL-based models for disease- and target-specific drug repositioning tasks. DeepDR leverages invaluable experience to recommend candidate drugs, which covers more than 15 networks and a comprehensive knowledge graph that includes 5.9 million edges across 107 types of relationships connecting drugs, diseases, proteins/genes, pathways, and expression from six existing databases and a large scientific corpus of 24 million PubMed publications. Additionally, the recommended results include detailed descriptions of the recommended drugs and visualize key patterns with interpretability through a knowledge graph. Conclusion: DeepDR is free and open to all users without the requirement of registration. We believe it can provide an easy-to-use, systematic, highly accurate, and computationally automated platform for both experimental and computational scientists.

[LG-52] Weaver: Kronecker Product Approximations of Spatiotemporal Attention for Traffic Network Forecasting

链接: https://arxiv.org/abs/2511.08888
作者: Christopher Cheong,Gary Davis,Seongjin Choi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spatiotemporal forecasting on transportation networks is a complex task that requires understanding how traffic nodes interact within a dynamic, evolving system dictated by traffic flow dynamics and social behavioral patterns. The importance of transportation networks and ITS for modern mobility and commerce necessitates forecasting models that are not only accurate but also interpretable, efficient, and robust under structural or temporal perturbations. Recent approaches, particularly Transformer-based architectures, have improved predictive performance but often at the cost of high computational overhead and diminished architectural interpretability. In this work, we introduce Weaver, a novel attention-based model that applies Kronecker product approximations (KPA) to decompose the PN X PN spatiotemporal attention of O(P^2N^2) complexity into local P X P temporal and N X N spatial attention maps. This Kronecker attention map enables our Parallel-Kronecker Matrix-Vector product (P2-KMV) for efficient spatiotemporal message passing with O(P^2N + N^2P) complexity. To capture real-world traffic dynamics, we address the importance of negative edges in modeling traffic behavior by introducing Valence Attention using the continuous Tanimoto coefficient (CTC), which provides properties conducive to precise latent graph generation and training stability. To fully utilize the model’s learning capacity, we introduce the Traffic Phase Dictionary for self-conditioning. Evaluations on PEMS-BAY and METR-LA show that Weaver achieves competitive performance across model categories while training more efficiently.

[LG-53] Spectral Predictability as a Fast Reliability Indicator for Time Series Forecasting Model Selection

链接: https://arxiv.org/abs/2511.08884
作者: Oliver Wang,Pengrui Quan,Kang Yang,Mani Srivastava
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Practitioners deploying time series forecasting models face a dilemma: exhaustively validating dozens of models is computationally prohibitive, yet choosing the wrong model risks poor performance. We show that spectral predictability~ \Omega – a simple signal processing metric – systematically stratifies model family performance, enabling fast model selection. We conduct controlled experiments in four different domains, then further expand our analysis to 51 models and 28 datasets from the GIFT-Eval benchmark. We find that large time series foundation models (TSFMs) systematically outperform lightweight task-trained baselines when \Omega is high, while their advantage vanishes as \Omega drops. Computing \Omega takes seconds per dataset, enabling practitioners to quickly assess whether their data suits TSFM approaches or whether simpler, cheaper models suffice. We demonstrate that \Omega stratifies model performance predictably, offering a practical first-pass filter that reduces validation costs while highlighting the need for models that excel on genuinely difficult (low- \Omega ) problems rather than merely optimizing easy ones.

[LG-54] Covariance Scattering Transforms

链接: https://arxiv.org/abs/2511.08878
作者: Andrea Cavallo,Ayushman Raghuvanshi,Sundeep Prabhakar Chepuri,Elvin Isufi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Machine learning and data processing techniques relying on covariance information are widespread as they identify meaningful patterns in unsupervised and unlabeled settings. As a prominent example, Principal Component Analysis (PCA) projects data points onto the eigenvectors of their covariance matrix, capturing the directions of maximum variance. This mapping, however, falls short in two directions: it fails to capture information in low-variance directions, relevant when, e.g., the data contains high-variance noise; and it provides unstable results in low-sample regimes, especially when covariance eigenvalues are close. CoVariance Neural Networks (VNNs), i.e., graph neural networks using the covariance matrix as a graph, show improved stability to estimation errors and learn more expressive functions in the covariance spectrum than PCA, but require training and operate in a labeled setup. To get the benefits of both worlds, we propose Covariance Scattering Transforms (CSTs), deep untrained networks that sequentially apply filters localized in the covariance spectrum to the input data and produce expressive hierarchical representations via nonlinearities. We define the filters as covariance wavelets that capture specific and detailed covariance spectral patterns. We improve CSTs’ computational and memory efficiency via a pruning mechanism, and we prove that their error due to finite-sample covariance estimations is less sensitive to close covariance eigenvalues compared to PCA, improving their stability. Our experiments on age prediction from cortical thickness measurements on 4 datasets collecting patients with neurodegenerative diseases show that CSTs produce stable representations in low-data settings, as VNNs but without any training, and lead to comparable or better predictions w.r.t. more complex learning models.

[LG-55] EEG-X: Device-Agnostic and Noise-Robust Foundation Model for EEG

链接: https://arxiv.org/abs/2511.08861
作者: Navid Mohammadi Foumani,Soheila Ghane,Nam Nguyen,Mahsa Salehi,Geoffrey I. Webb,Geoffrey Mackellar
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:

点击查看摘要

Abstract:Foundation models for EEG analysis are still in their infancy, limited by two key challenges: (1) variability across datasets caused by differences in recording devices and configurations, and (2) the low signal-to-noise ratio (SNR) of EEG, where brain signals are often buried under artifacts and non-brain sources. To address these challenges, we present EEG-X, a device-agnostic and noise-robust foundation model for EEG representation learning. EEG-X introduces a novel location-based channel embedding that encodes spatial information and improves generalization across domains and tasks by allowing the model to handle varying channel numbers, combinations, and recording lengths. To enhance robustness against noise, EEG-X employs a noise-aware masking and reconstruction strategy in both raw and latent spaces. Unlike previous models that mask and reconstruct raw noisy EEG signals, EEG-X is trained to reconstruct denoised signals obtained through an artifact removal process, ensuring that the learned representations focus on neural activity rather than noise. To further enhance reconstruction-based pretraining, EEG-X introduces a dictionary-inspired convolutional transformation (DiCT) layer that projects signals into a structured feature space before computing reconstruction (MSE) loss, reducing noise sensitivity and capturing frequency- and shape-aware similarities. Experiments on datasets collected from diverse devices show that EEG-X outperforms state-of-the-art methods across multiple downstream EEG tasks and excels in cross-domain settings where pre-trained and downstream datasets differ in electrode layouts. The models and code are available at: this https URL

[LG-56] ForeSWE: Forecasting Snow-Water Equivalent with an Uncertainty-Aware Attention Model AAAI

链接: https://arxiv.org/abs/2511.08856
作者: Krishu K Thapa,Supriya Savalkar,Bhupinderjeet Singh,Trong Nghia Hoang,Kirti Rajagopalan,Ananth Kalyanaraman
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: Accepted for publication at the 2026 AAAI conference

点击查看摘要

Abstract:Various complex water management decisions are made in snow-dominant watersheds with the knowledge of Snow-Water Equivalent (SWE) – a key measure widely used to estimate the water content of a snowpack. However, forecasting SWE is challenging because SWE is influenced by various factors including topography and an array of environmental conditions, and has therefore been observed to be spatio-temporally variable. Classical approaches to SWE forecasting have not adequately utilized these spatial/temporal correlations, nor do they provide uncertainty estimates – which can be of significant value to the decision maker. In this paper, we present ForeSWE, a new probabilistic spatio-temporal forecasting model that integrates deep learning and classical probabilistic techniques. The resulting model features a combination of an attention mechanism to integrate spatiotemporal features and interactions, alongside a Gaussian process module that provides principled quantification of prediction uncertainty. We evaluate the model on data from 512 Snow Telemetry (SNOTEL) stations in the Western US. The results show significant improvements in both forecasting accuracy and prediction interval compared to existing approaches. The results also serve to highlight the efficacy in uncertainty estimates between different approaches. Collectively, these findings have provided a platform for deployment and feedback by the water management community.

[LG-57] Decomposition of Small Transformer Models NEURIPS2025

链接: https://arxiv.org/abs/2511.08854
作者: Casper L. Christensen,Logan Riggs
类目: Machine Learning (cs.LG)
*备注: Accepted at Neurips 2025 Workshop on Mechanistic Interpretability

点击查看摘要

Abstract:Recent work in mechanistic interpretability has shown that decomposing models in parameter space may yield clean handles for analysis and intervention. Previous methods have demonstrated successful applications on a wide range of toy models, but the gap to “real models” has not yet been bridged. In this work, we extend Stochastic Parameter Decomposition (SPD) to Transformer models, proposing an updated causal importance function suited for sequential data and a new loss function. We demonstrate that SPD can successfully decompose a toy induction-head model and recover the expected 2-step circuit. We also show that applying SPD to GPT-2-small can successfully locate subcomponents corresponding to interpretable concepts like “golf” and “basketball”. These results take the first step in the direction of extending SPD to modern models, and show that we can use the method to surface interpretable parameter-space mechanisms.

[LG-58] Learning-based Radio Link Failure Prediction Based on Measurement Dataset in Railway Environments

链接: https://arxiv.org/abs/2511.08851
作者: Po-Heng Chou,Da-Chih Lin,Hung-Yu Wei,Walid Saad,Yu Tsao
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 7 pages, 3 figures, 2 tables, and submitted to IEEE ICC 2026

点击查看摘要

Abstract:In this paper, a measurement-driven framework is proposed for early radio link failure (RLF) prediction in 5G non-standalone (NSA) railway environments. Using 10 Hz metro-train traces with serving and neighbor-cell indicators, we benchmark six models, namely CNN, LSTM, XGBoost, Anomaly Transformer, PatchTST, and TimesNet, under varied observation windows and prediction horizons. When the observation window is three seconds, TimesNet attains the highest F1 score with a three-second prediction horizon, while CNN provides a favorable accuracy-latency tradeoff with a two-second horizon, enabling proactive actions such as redundancy and adaptive handovers. The results indicate that deep temporal models can anticipate reliability degradations several seconds in advance using lightweight features available on commercial devices, offering a practical path to early-warning control in 5G-based railway systems.

[LG-59] On topological descriptors for graph products NEURIPS2025

链接: https://arxiv.org/abs/2511.08846
作者: Mattie Ji,Amauri H. Souza,Vikas Garg
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
*备注: 26 pages, 4 tables, 5 figures. Accepted at NeurIPS 2025

点击查看摘要

Abstract:Topological descriptors have been increasingly utilized for capturing multiscale structural information in relational data. In this work, we consider various filtrations on the (box) product of graphs and the effect on their outputs on the topological descriptors - the Euler characteristic (EC) and persistent homology (PH). In particular, we establish a complete characterization of the expressive power of EC on general color-based filtrations. We also show that the PH descriptors of (virtual) graph products contain strictly more information than the computation on individual graphs, whereas EC does not. Additionally, we provide algorithms to compute the PH diagrams of the product of vertex- and edge-level filtrations on the graph product. We also substantiate our theoretical analysis with empirical investigations on runtime analysis, expressivity, and graph classification performance. Overall, this work paves way for powerful graph persistent descriptors via product filtrations. Code is available at this https URL.

[LG-60] Physics-Informed Machine Learning for Characterizing System Stability

链接: https://arxiv.org/abs/2511.08831
作者: Tomoki Koike,Elizabeth Qian
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:In the design and operation of complex dynamical systems, it is essential to ensure that all state trajectories of the dynamical system converge to a desired equilibrium within a guaranteed stability region. Yet, for many practical systems – especially in aerospace – this region cannot be determined a priori and is often challenging to compute. One of the most common methods for computing the stability region is to identify a Lyapunov function. A Lyapunov function is a positive function whose time derivative along system trajectories is non-positive, which provides a sufficient condition for stability and characterizes an estimated stability region. However, existing methods of characterizing a stability region via a Lyapunov function often rely on explicit knowledge of the system governing equations. In this work, we present a new physics-informed machine learning method of characterizing an estimated stability region by inferring a Lyapunov function from system trajectory data that treats the dynamical system as a black box and does not require explicit knowledge of the system governing equations. In our presented Lyapunov function Inference method (LyapInf), we propose a quadratic form for the unknown Lyapunov function and fit the unknown quadratic operator to system trajectory data by minimizing the average residual of the Zubov equation, a first-order partial differential equation whose solution yields a Lyapunov function. The inferred quadratic Lyapunov function can then characterize an ellipsoidal estimate of the stability region. Numerical results on benchmark examples demonstrate that our physics-informed stability analysis method successfully characterizes a near-maximal ellipsoid of the system stability region associated with the inferred Lyapunov function without requiring knowledge of the system governing equations.

[LG-61] A Neural-Operator Preconditioned Newton Method for Accelerated Nonlinear Solvers

链接: https://arxiv.org/abs/2511.08811
作者: Youngkyu Lee,Shanqing Liu,Jerome Darbon,George Em Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, 7 tables

点击查看摘要

Abstract:We propose a novel neural preconditioned Newton (NP-Newton) method for solving parametric nonlinear systems of equations. To overcome the stagnation or instability of Newton iterations caused by unbalanced nonlinearities, we introduce a fixed-point neural operator (FPNO) that learns the direct mapping from the current iterate to the solution by emulating fixed-point iterations. Unlike traditional line-search or trust-region algorithms, the proposed FPNO adaptively employs negative step sizes to effectively mitigate the effects of unbalanced nonlinearities. Through numerical experiments we demonstrate the computational efficiency and robustness of the proposed NP-Newton method across multiple real-world applications, especially for very strong nonlinearities.

[LG-62] A Generalized Bias-Variance Decomposition for Bregman Divergences

链接: https://arxiv.org/abs/2511.08789
作者: David Pfau
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Extended version of notes previously posted here: this http URL

点击查看摘要

Abstract:The bias-variance decomposition is a central result in statistics and machine learning, but is typically presented only for the squared error. We present a generalization of the bias-variance decomposition where the prediction error is a Bregman divergence, which is relevant to maximum likelihood estimation with exponential families. While the result is already known, there was not previously a clear, standalone derivation, so we provide one for pedagogical purposes. A version of this note previously appeared on the author’s personal website without context. Here we provide additional discussion and references to the relevant prior literature.

[LG-63] Gromov-Wasserstein Graph Coarsening

链接: https://arxiv.org/abs/2511.08733
作者: Carlos A. Taveras,Santiago Segarra,César A. Uribe
类目: Machine Learning (cs.LG)
*备注: 11 pages, 2 figures

点击查看摘要

Abstract:We study the problem of graph coarsening within the Gromov-Wasserstein geometry. Specifically, we propose two algorithms that leverage a novel representation of the distortion induced by merging pairs of nodes. The first method, termed Greedy Pair Coarsening (GPC), iteratively merges pairs of nodes that locally minimize a measure of distortion until the desired size is achieved. The second method, termed k -means Greedy Pair Coarsening (KGPC), leverages clustering based on pairwise distortion metrics to directly merge clusters of nodes. We provide conditions guaranteeing optimal coarsening for our methods and validate their performance on six large-scale datasets and a downstream clustering task. Results show that the proposed methods outperform existing approaches on a wide range of parameters and scenarios.

[LG-64] Macroscopic Emission Modeling of Urban Traffic Using Probe Vehicle Data: A Machine Learning Approach

链接: https://arxiv.org/abs/2511.08722
作者: Mohammed Ali El Adlouni,Ling Jin,Xiaodan Xu,C. Anna Spurlock,Alina Lazar,Kaveh Farokhi Sadabadi,Mahyar Amirgholy,Mona Asudegi
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 3 pages, 5 figures, IEEE Big Data 2024 conference

点击查看摘要

Abstract:Urban congestions cause inefficient movement of vehicles and exacerbate greenhouse gas emissions and urban air pollution. Macroscopic emission fundamental diagram (eMFD)captures an orderly relationship among emission and aggregated traffic variables at the network level, allowing for real-time monitoring of region-wide emissions and optimal allocation of travel demand to existing networks, reducing urban congestion and associated emissions. However, empirically derived eMFD models are sparse due to historical data limitation. Leveraging a large-scale and granular traffic and emission data derived from probe vehicles, this study is the first to apply machine learning methods to predict the network wide emission rate to traffic relationship in U.S. urban areas at a large scale. The analysis framework and insights developed in this work generate data-driven eMFDs and a deeper understanding of their location dependence on network, infrastructure, land use, and vehicle characteristics, enabling transportation authorities to measure carbon emissions from urban transport of given travel demand and optimize location specific traffic management and planning decisions to mitigate network-wide emissions.

[LG-65] Automated Hardware Trojan Insertion in Industrial-Scale Designs DATE2026

链接: https://arxiv.org/abs/2511.08703
作者: Yaroslav Popryho,Debjit Pal,Inna Partin-Vaisband
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted in DATE 2026

点击查看摘要

Abstract:Industrial Systems-on-Chips (SoCs) often comprise hundreds of thousands to millions of nets and millions to tens of millions of connectivity edges, making empirical evaluation of hardware-Trojan (HT) detectors on realistic designs both necessary and difficult. Public benchmarks remain significantly smaller and hand-crafted, while releasing truly malicious RTL raises ethical and operational risks. This work presents an automated and scalable methodology for generating HT-like patterns in industry-scale netlists whose purpose is to stress-test detection tools without altering user-visible functionality. The pipeline (i) parses large gate-level designs into connectivity graphs, (ii) explores rare regions using SCOAP testability metrics, and (iii) applies parameterized, function-preserving graph transformations to synthesize trigger-payload pairs that mimic the statistical footprint of stealthy HTs. When evaluated on the benchmarks generated in this work, representative state-of-the-art graph-learning models fail to detect Trojans. The framework closes the evaluation gap between academic circuits and modern SoCs by providing reproducible challenge instances that advance security research without sharing step-by-step attack instructions.

[LG-66] PEGNet: A Physics-Embedded Graph Network for Long-Term Stable Multiphysics Simulation

链接: https://arxiv.org/abs/2511.08697
作者: Can Yang,Zhenzhong Wang,Junyuan Liu,Yunpeng Gong,Min Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and efficient simulations of physical phenomena governed by partial differential equations (PDEs) are important for scientific and engineering progress. While traditional numerical solvers are powerful, they are often computationally expensive. Recently, data-driven methods have emerged as alternatives, but they frequently suffer from error accumulation and limited physical consistency, especially in multiphysics and complex geometries. To address these challenges, we propose PEGNet, a Physics-Embedded Graph Network that incorporates PDE-guided message passing to redesign the graph neural network architecture. By embedding key PDE dynamics like convection, viscosity, and diffusion into distinct message functions, the model naturally integrates physical constraints into its forward propagation, producing more stable and physically consistent solutions. Additionally, a hierarchical architecture is employed to capture multi-scale features, and physical regularization is integrated into the loss function to further enforce adherence to governing physics. We evaluated PEGNet on benchmarks, including custom datasets for respiratory airflow and drug delivery, showing significant improvements in long-term prediction accuracy and physical consistency over existing methods. Our code is available at this https URL.

[LG-67] Practical and Performant Enhancements for Maximization of Algebraic Connectivity ICRA2026

链接: https://arxiv.org/abs/2511.08694
作者: Leonard Jung,Alan Papalia,Kevin Doherty,Michael Everett
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to ICRA 2026

点击查看摘要

Abstract:Long-term state estimation over graphs remains challenging as current graph estimation methods scale poorly on large, long-term graphs. To address this, our work advances a current state-of-the-art graph sparsification algorithm, maximizing algebraic connectivity (MAC). MAC is a sparsification method that preserves estimation performance by maximizing the algebraic connectivity, a spectral graph property that is directly connected to the estimation error. Unfortunately, MAC remains computationally prohibitive for online use and requires users to manually pre-specify a connectivity-preserving edge set. Our contributions close these gaps along three complementary fronts: we develop a specialized solver for algebraic connectivity that yields an average 2x runtime speedup; we investigate advanced step size strategies for MAC’s optimization procedure to enhance both convergence speed and solution quality; and we propose automatic schemes that guarantee graph connectivity without requiring manual specification of edges. Together, these contributions make MAC more scalable, reliable, and suitable for real-time estimation applications.

[LG-68] abPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

链接: https://arxiv.org/abs/2511.08667
作者: Léo Grinsztajn,Klemens Flöge,Oscar Key,Felix Birkel,Philipp Jund,Brendan Roof,Benjamin Jäger,Dominik Safaric,Simone Alessi,Adrian Hayler,Mihir Manium,Rosen Yu,Felix Jablonski,Shi Bin Hoo,Anurag Garg,Jake Robertson,Magnus Bühler,Vladyslav Moroshan,Lennart Purucker,Clara Cornu,Lilly Charlotte Wehrhahn,Alessandro Bonetto,Bernhard Schölkopf,Sauraj Gambhir,Noah Hollmann,Frank Hutter
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases. This report introduces TabPFN-2.5, the next generation of our tabular foundation model, built for datasets with up to 50,000 data points and 2,000 features, a 20x increase in data cells compared to TabPFNv2. TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (=10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression). For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment. This new release will immediately strengthen the performance of the many applications and methods already built on the TabPFN ecosystem.

[LG-69] A Lightweight CNN-Attention-BiLSTM Architecture for Multi-Class Arrhythmia Classification on Standard and Wearable ECGs

链接: https://arxiv.org/abs/2511.08650
作者: Vamsikrishna Thota,Hardik Prajapati,Yuvraj Joshi,Shubhangi Rathi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early and accurate detection of cardiac arrhythmias is vital for timely diagnosis and intervention. We propose a lightweight deep learning model combining 1D Convolutional Neural Networks (CNN), attention mechanisms, and Bidirectional Long Short-Term Memory (BiLSTM) for classifying arrhythmias from both 12-lead and single-lead ECGs. Evaluated on the CPSC 2018 dataset, the model addresses class imbalance using a class-weighted loss and demonstrates superior accuracy and F1- scores over baseline models. With only 0.945 million parameters, our model is well-suited for real-time deployment in wearable health monitoring systems.

[LG-70] Learning based Modelling of Throttleable Engine Dynamics for Lunar Landing Mission

链接: https://arxiv.org/abs/2511.08612
作者: Suraj Kumar,Aditya Rallapalli,Bharat Kumar GVP
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 5 pages, 9 figures, Global Space Exploration Conference 2025

点击查看摘要

Abstract:Typical lunar landing missions involve multiple phases of braking to achieve soft-landing. The propulsion system configuration for these missions consists of throttleable engines. This configuration involves complex interconnected hydraulic, mechanical, and pneumatic components each exhibiting non-linear dynamic characteristics. Accurate modelling of the propulsion dynamics is essential for analyzing closed-loop guidance and control schemes during descent. This paper presents a learning-based system identification approach for modelling of throttleable engine dynamics using data obtained from high-fidelity propulsion model. The developed model is validated with experimental results and used for closed-loop guidance and control simulations.

[LG-71] MoE-GraphSAGE-Based Integrated Evaluation of Transient Rotor Angle and Voltage Stability in Power Systems

链接: https://arxiv.org/abs/2511.08610
作者: Kunyu Zhang,Guang Yang,Fashun Shi,Shaoying He,Yuchi Zhang
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The large-scale integration of renewable energy and power electronic devices has increased the complexity of power system stability, making transient stability assessment more challenging. Conventional methods are limited in both accuracy and computational efficiency. To address these challenges, this paper proposes MoE-GraphSAGE, a graph neural network framework based on the MoE for unified TAS and TVS assessment. The framework leverages GraphSAGE to capture the power grid’s spatiotemporal topological features and employs multi-expert networks with a gating mechanism to model distinct instability modes jointly. Experimental results on the IEEE 39-bus system demonstrate that MoE-GraphSAGE achieves superior accuracy and efficiency, offering an effective solution for online multi-task transient stability assessment in complex power systems.

[LG-72] Distributional Shrinkage I: Universal Denoisers in Multi-Dimensions

链接: https://arxiv.org/abs/2511.09500
作者: Tengyuan Liang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:We revisit the problem of denoising from noisy measurements where only the noise level is known, not the noise distribution. In multi-dimensions, independent noise Z corrupts the signal X , resulting in the noisy measurement Y = X + \sigma Z , where \sigma \in (0, 1) is a known noise level. Our goal is to recover the underlying signal distribution P_X from denoising P_Y . We propose and analyze universal denoisers that are agnostic to a wide range of signal and noise distributions. Our distributional denoisers offer order-of-magnitude improvements over the Bayes-optimal denoiser derived from Tweedie’s formula, if the focus is on the entire distribution P_X rather than on individual realizations of X . Our denoisers shrink P_Y toward P_X optimally, achieving O(\sigma^4) and O(\sigma^6) accuracy in matching generalized moments and density functions. Inspired by optimal transport theory, the proposed denoisers are optimal in approximating the Monge-Ampère equation with higher-order accuracy, and can be implemented efficiently via score matching. Let q represent the density of P_Y ; for optimal distributional denoising, we recommend replacing the Bayes-optimal denoiser, [ \mathbfT^*(y) = y + \sigma^2 \nabla \log q(y), ] with denoisers exhibiting less aggressive distributional shrinkage, [ \mathbfT_1(y) = y + \frac\sigma^22 \nabla \log q(y), ] [ \mathbfT_2(y) = y + \frac\sigma^22 \nabla \log q(y) - \frac\sigma^48 \nabla \left( \frac12 | \nabla \log q(y) |^2 + \nabla \cdot \nabla \log q(y) \right) . ] Comments: 26 pages, 5 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2511.09500 [stat.ML] (or arXiv:2511.09500v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.09500 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-73] Branching Flows: Discrete Continuous and Manifold Flow Matching with Splits and Deletions

链接: https://arxiv.org/abs/2511.09465
作者: Hedwig Nora Nordlinder(1),Lukas Billera(1),Jack Collier Ryder(1),Anton Oresten(1),Aron Stålmarck(1),Theodor Mosetti Björk(1),Ben Murrell(1) ((1) Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 30 pages, 10 figures

点击查看摘要

Abstract:Diffusion and flow matching approaches to generative modeling have shown promise in domains where the state space is continuous, such as image generation or protein folding design, and discrete, exemplified by diffusion large language models. They offer a natural fit when the number of elements in a state is fixed in advance (e.g. images), but require ad hoc solutions when, for example, the length of a response from a large language model, or the number of amino acids in a protein chain is not known a priori. Here we propose Branching Flows, a generative modeling framework that, like diffusion and flow matching approaches, transports a simple distribution to the data distribution. But in Branching Flows, the elements in the state evolve over a forest of binary trees, branching and dying stochastically with rates that are learned by the model. This allows the model to control, during generation, the number of elements in the sequence. We also show that Branching Flows can compose with any flow matching base process on discrete sets, continuous Euclidean spaces, smooth manifolds, and `multimodal’ product spaces that mix these components. We demonstrate this in three domains: small molecule generation (multimodal), antibody sequence generation (discrete), and protein backbone generation (multimodal), and show that Branching Flows is a capable distribution learner with a stable learning objective, and that it enables new capabilities. Comments: 30 pages, 10 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2511.09465 [stat.ML] (or arXiv:2511.09465v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2511.09465 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] Adversarially and Distributionally Robust Virtual Energy Storag e Systems via the Scenario Approach

链接: https://arxiv.org/abs/2511.09427
作者: Georgios Pantazis,Nicola Mignoni,Raffaele Carli,Mariagrazia Dotoli,Sergio Grammatico
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We propose an optimization model where a parking lot manager (PLM) can aggregate parked EV batteries to provide virtual energy storage services that are provably robust under uncertain EV departures and state-of-charge caps. Our formulation yields a data-driven convex optimization problem where a prosumer community agrees on a contract with the PLM for the provision of storage services over a finite horizon. Leveraging recent results in the scenario approach, we certify out-of-sample constraint safety. Furthermore, we enable a tunable profit-risk trade-off through scenario relaxation and extend our model to account for robustness to adversarial perturbations and distributional shifts over Wasserstein-based ambiguity sets. All the approaches are accompanied by tight finite-sample certificates. Numerical studies demonstrate the out-of-sample and out-of-distribution constraint satisfaction of our proposed model compared to the developed theoretical guarantees, showing their effectiveness and potential in robust and efficient virtual energy services.

[LG-75] Robust Least-Squares Optimization for Data-Driven Predictive Control: A Geometric Approach

链接: https://arxiv.org/abs/2511.09242
作者: Shreyas Bharadwaj,Bamdev Mishra,Cyrus Mostajeran,Alberto Padoan,Jeremy Coulson,Ravi N. Banavar
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: Submitted to the 8th Annual Learning for Dynamics Control Conference June 17-19 2026, USC, Los Angeles, USA

点击查看摘要

Abstract:The paper studies a geometrically robust least-squares problem that extends classical and norm-based robust formulations. Rather than minimizing residual error for fixed or perturbed data, we interpret least-squares as enforcing approximate subspace inclusion between measured and true data spaces. The uncertainty in this geometric relation is modeled as a metric ball on the Grassmannian manifold, leading to a min-max problem over Euclidean and manifold variables. The inner maximization admits a closed-form solution, enabling an efficient algorithm with a transparent geometric interpretation. Applied to robust finite-horizon linear-quadratic tracking in data-enabled predictive control, the method improves upon existing robust least-squares formulations, achieving stronger robustness and favorable scaling under small uncertainty.

[LG-76] Resource-Efficient Variational Quantum Classifier

链接: https://arxiv.org/abs/2511.09204
作者: Petr Ptáček,Paulina Lewandowska,Ryszard Kukulski
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 20 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Quantum computing promises a revolution in information processing, with significant potential for machine learning and classification tasks. However, achieving this potential requires overcoming several fundamental challenges. One key limitation arises at the prediction stage, where the intrinsic randomness of quantum model outputs necessitates repeated executions, resulting in substantial overhead. To overcome this, we propose a novel measurement strategy for a variational quantum classifier that allows us to define the unambiguous quantum classifier. This strategy achieves near-deterministic predictions while maintaining competitive classification accuracy in noisy environments, all with significantly fewer quantum circuit executions. Although this approach entails a slight reduction in performance, it represents a favorable trade-off for improved resource efficiency. We further validate our theoretical model with supporting experimental results.

[LG-77] Scalable Mixed-Integer Optimization with Neural Constraints via Dual Decomposition

链接: https://arxiv.org/abs/2511.09186
作者: Shuli Zeng,Sijia Zhang,Feng Wu,Shaojie Tang,Xiang-Yang Li
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Embedding deep neural networks (NNs) into mixed-integer programs (MIPs) is attractive for decision making with learned constraints, yet state-of-the-art monolithic linearisations blow up in size and quickly become intractable. In this paper, we introduce a novel dual-decomposition framework that relaxes the single coupling equality u=x with an augmented Lagrange multiplier and splits the problem into a vanilla MIP and a constrained NN block. Each part is tackled by the solver that suits it best-branch and cut for the MIP subproblem, first-order optimisation for the NN subproblem-so the model remains modular, the number of integer variables never grows with network depth, and the per-iteration cost scales only linearly with the NN size. On the public \textscSurrogateLIB benchmark, our method proves \textbfscalable, \textbfmodular, and \textbfadaptable: it runs (120\times) faster than an exact Big-M formulation on the largest test case; the NN sub-solver can be swapped from a log-barrier interior step to a projected-gradient routine with no code changes and identical objective value; and swapping the MLP for an LSTM backbone still completes the full optimisation in 47s without any bespoke adaptation.

[LG-78] Learning to Validate Generative Models: a Goodness-of-Fit Approach

链接: https://arxiv.org/abs/2511.09118
作者: Pietro Cappelli,Gaia Grosso,Marco Letizia,Humberto Reyes-González,Marco Zanetti
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
*备注: 16 pages, 6 figures

点击查看摘要

Abstract:Generative models are increasingly central to scientific workflows, yet their systematic use and interpretation require a proper understanding of their limitations through rigorous validation. Classic approaches struggle with scalability, statistical power, or interpretability when applied to high-dimensional data, making it difficult to certify the reliability of these models in realistic, high-dimensional scientific settings. Here, we propose the use of the New Physics Learning Machine (NPLM), a learning based approach to goodness-of-fit testing inspired by the Neyman-Pearson construction, to test generative networks trained on high-dimensional scientific data. We demonstrate the performance of NPLM for validation in two benchmark cases: generative models trained on mixtures of Gaussian models with increasing dimensionality, and a public end-to-end generator for the Large Hadron Collider called FlashSim, trained on jet data, typical in the field of high-energy physics. We demonstrate that the NPLM can serve as a powerful validation method while also providing a means to diagnose sub-optimally modeled regions of the data.

[LG-79] VAE-Based Synthetic EMG Generation with Mix-Consistency Loss for Recognizing Unseen Motion Combinations

链接: https://arxiv.org/abs/2511.09060
作者: Itsuki Yazawa,Akira Furui
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, accepted at IEEE SII 2026

点击查看摘要

Abstract:Electromyogram (EMG)-based motion classification using machine learning has been widely employed in applications such as prosthesis control. While previous studies have explored generating synthetic patterns of combined motions to reduce training data requirements, these methods assume that combined motions can be represented as linear combinations of basic motions. However, this assumption often fails due to complex neuromuscular phenomena such as muscle co-contraction, resulting in low-fidelity synthetic signals and degraded classification performance. To address this limitation, we propose a novel method that learns to synthesize combined motion patterns in a structured latent space. Specifically, we employ a variational autoencoder (VAE) to encode EMG signals into a low-dimensional representation and introduce a mixconsistency loss that structures the latent space such that combined motions are embedded between their constituent basic motions. Synthetic patterns are then generated within this structured latent space and used to train classifiers for recognizing unseen combined motions. We validated our approach through upper-limb motion classification experiments with eight healthy participants. The results demonstrate that our method outperforms input-space synthesis approaches, achieving approximately 30% improvement in accuracy.

[LG-80] Convergence and Stability Analysis of Self-Consuming Generative Models with Heterogeneous Human Curation

链接: https://arxiv.org/abs/2511.09002
作者: Hongru Zhao,Jinwen Fu,Tuan Pham
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 41 pages, 2 tables

点击查看摘要

Abstract:Self-consuming generative models have received significant attention over the last few years. In this paper, we study a self-consuming generative model with heterogeneous preferences that is a generalization of the model in Ferbach et al. (2024). The model is retrained round by round using real data and its previous-round synthetic outputs. The asymptotic behavior of the retraining dynamics is investigated across four regimes using different techniques including the nonlinear Perron–Frobenius theory. Our analyses improve upon that of Ferbach et al. (2024) and provide convergence results in settings where the well-known Banach contraction mapping arguments do not apply. Stability and non-stability results regarding the retraining dynamics are also given.

[LG-81] Generalisable prediction model of surgical case duration: multicentre development and temporal validation

链接: https://arxiv.org/abs/2511.08994
作者: Daijiro Kabata,Mari Ito,Tokito Koga,Kazuma Yunoki
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 20 pages, 2 tables, and 5 figures

点击查看摘要

Abstract:Background: Accurate prediction of surgical case duration underpins operating room (OR) scheduling, yet existing models often depend on site- or surgeon-specific inputs and rarely undergo external validation, limiting generalisability. Methods: We undertook a retrospective multicentre study using routinely collected perioperative data from two general hospitals in Japan (development: 1 January 2021-31 December 2023; temporal test: 1 January-31 December 2024). Elective weekday procedures with American Society of Anesthesiologists (ASA) Physical Status 1-4 were included. Pre-specified preoperative predictors comprised surgical context (year, month, weekday, scheduled duration, general anaesthesia indicator, body position) and patient factors (sex, age, body mass index, allergy, infection, comorbidity, ASA). Missing data were addressed by multiple imputation by chained equations. Four learners (elastic-net, generalised additive models, random forest, gradient-boosted trees) were tuned within internal-external cross-validation (IECV; leave-one-cluster-out by centre-year) and combined by stacked generalisation to predict log-transformed duration. Results: We analysed 63,206 procedures (development 45,647; temporal test 17,559). Cluster-specific and pooled errors and calibrations from IECV are provided with consistent performance across centres and years. In the 2024 temporal test cohort, calibration was good (intercept 0.423, 95%CI 0.372 to 0.474; slope 0.921, 95%CI 0.911 to 0.932). Conclusions: A stacked machine-learning model using only widely available preoperative variables achieved accurate, well-calibrated predictions in temporal external validation, supporting transportability across sites and over time. Such general-purpose tools may improve OR scheduling without relying on idiosyncratic inputs. Comments: 20 pages, 2 tables, and 5 figures Subjects: Applications (stat.AP); Machine Learning (cs.LG) MSC classes: 62P10, 68T05 ACMclasses: J.3; I.2.6 Cite as: arXiv:2511.08994 [stat.AP] (or arXiv:2511.08994v1 [stat.AP] for this version) https://doi.org/10.48550/arXiv.2511.08994 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Daijiro Kabata [view email] [v1] Wed, 12 Nov 2025 05:30:51 UTC (1,925 KB)

[LG-82] Robust Sampling for Active Statistical Inference NEURIPS2025

链接: https://arxiv.org/abs/2511.08991
作者: Puheng Li,Tijana Zrnic,Emmanuel Candès
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Active statistical inference is a new method for inference with AI-assisted data collection. Given a budget on the number of labeled data points that can be collected and assuming access to an AI predictive model, the basic idea is to improve estimation accuracy by prioritizing the collection of labels where the model is most uncertain. The drawback, however, is that inaccurate uncertainty estimates can make active sampling produce highly noisy results, potentially worse than those from naive uniform sampling. In this work, we present robust sampling strategies for active statistical inference. Robust sampling ensures that the resulting estimator is never worse than the estimator using uniform sampling. Furthermore, with reliable uncertainty estimates, the estimator usually outperforms standard active inference. This is achieved by optimally interpolating between uniform and active sampling, depending on the quality of the uncertainty scores, and by using ideas from robust optimization. We demonstrate the utility of the method on a series of real datasets from computational social science and survey research.

[LG-83] DRL-Based Beam Positioning for LEO Satellite Constellations with Weighted Least Squares

链接: https://arxiv.org/abs/2511.08852
作者: Po-Heng Chou,Chiapin Wang,Kuan-Hao Chen,Wei-Chen Hsiao
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 2 figures, 1 table, and submitted to IEEE ICC 2026

点击查看摘要

Abstract:In this paper, we propose a reinforcement learning based beam weighting framework that couples a policy network with an augmented weighted least squares (WLS) estimator for accurate and low-complexity positioning in multi-beam LEO constellations. Unlike conventional geometry or CSI-dependent approaches, the policy learns directly from uplink pilot responses and geometry features, enabling robust localization without explicit CSI estimation. An augmented WLS jointly estimates position and receiver clock bias, improving numerical stability under dynamic beam geometry. Across representative scenarios, the proposed method reduces the mean positioning error by 99.3% compared with the geometry-based baseline, achieving 0.395 m RMSE with near real-time inference.

[LG-84] Effects of label noise on the classification of outlier observations

链接: https://arxiv.org/abs/2511.08808
作者: Matheus Vinícius Barreto de Farias,Mario de Castro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:This study investigates the impact of adding noise to the training set classes in classification tasks using the BCOPS algorithm (Balanced and Conformal Optimized Prediction Sets), proposed by Guan Tibshirani (2022). The BCOPS algorithm is an application of conformal prediction combined with a machine learning method to construct prediction sets such that the probability of the true class being included in the prediction set for a test observation meets a specified coverage guarantee. An observation is considered an outlier if its true class is not present in the training set. The study employs both synthetic and real datasets and conducts experiments to evaluate the prediction abstention rate for outlier observations and the model’s robustness in this previously untested scenario. The results indicate that the addition of noise, even in small amounts, can have a significant effect on model performance.

[LG-85] he Probably Approximately Correct Learning Model in Computational Learning Theory

链接: https://arxiv.org/abs/2511.08791
作者: Rocco A. Servedio
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 45 pages, 4 figures

点击查看摘要

Abstract:This survey paper gives an overview of various known results on learning classes of Boolean functions in Valiant’s Probably Approximately Correct (PAC) learning model and its commonly studied variants.

[LG-86] WATSON-Net: Vetting Validation and Analysis of Transits from Space Observations with Neural Networks

链接: https://arxiv.org/abs/2511.08768
作者: M. Dévora-Pajares,F.J. Pozuelos,J.C. Suárez,M. González-Penedo,C. Dafonte
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Submitted, Open to comments

点击查看摘要

Abstract:Context. As the number of detected transiting exoplanet candidates continues to grow, the need for robust and scalable automated tools to prioritize or validate them has become increasingly critical. Among the most promising solutions, deep learning models offer the ability to interpret complex diagnostic metrics traditionally used in the vetting process. Aims. In this work, we present WATSON-Net, a new open-source neural network classifier and data preparation package designed to compete with current state-of-the-art tools for vetting and validation of transiting exoplanet signals from space-based missions. Methods. Trained on Kepler Q1-Q17 DR25 data using 10-fold cross-validation, WATSON-Net produces ten independent models, each evaluated on dedicated validation and test sets. The ten models are calibrated and prepared to be extensible for TESS data by standardizing the input pipeline, allowing for performance assessment across different space missions. Results. For Kepler targets, WATSON-Net achieves a recall-at-precision of 0.99 (R@P0.99) of 0.903, ranking second, with only the ExoMiner network performing better (R@P0.99 = 0.936). For TESS signals, WATSON-Net emerges as the best-performing non-fine-tuned machine learning classifier, achieving a precision of 0.93 and a recall of 0.76 on a test set comprising confirmed planets and false positives. Both the model and its data preparation tools are publicly available in the dearwatson Python package, fully open-source and integrated into the vetting engine of the SHERLOCK pipeline.

[LG-87] A Deep Learning-Based Method for Fully Coupled Non-Markovian FBSDEs with Applications

链接: https://arxiv.org/abs/2511.08735
作者: Hasib Uddin Molla,Ankit Banarjee,Matthew Backhouse,Jinniao Qiu
类目: Mathematical Finance (q-fin.MF); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we extend deep learning-based numerical methods to fully coupled forward-backward stochastic differential equations (FBSDEs) within a non-Markovian framework. Error estimates and convergence are provided. In contrast to the existing literature, our approach not only analyzes the non-Markovian framework but also addresses fully coupled settings, in which both the drift and diffusion coefficients of the forward process may be random and depend on the backward components Y and Z . Furthermore, we illustrate the practical applicability of our framework by addressing utility maximization problems under rough volatility, which are solved numerically with the proposed deep learning-based methods.

[LG-88] Practical considerations when designing an online learning algorithm for an app-based mHealth intervention

链接: https://arxiv.org/abs/2511.08719
作者: Rachel T Gonzalez,Madeline R Abbott,Brahmajee Nallamothu,Scott Hummel,Michael Dorsch,Walter Dempsey
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The ubiquitous nature of mobile health (mHealth) technology has expanded opportunities for the integration of reinforcement learning into traditional clinical trial designs, allowing researchers to learn individualized treatment policies during the study. LowSalt4Life 2 (LS4L2) is a recent trial aimed at reducing sodium intake among hypertensive individuals through an app-based intervention. A reinforcement learning algorithm, which was deployed in one of the trial arms, was designed to send reminder notifications to promote app engagement in contexts where the notification would be effective, i.e., when a participant is likely to open the app in the next 30-minute and not when prior data suggested reduced effectiveness. Such an algorithm can improve app-based mHealth interventions by reducing participant burden and more effectively promoting behavior change. We encountered various challenges during the implementation of the learning algorithm, which we present as a template to solving challenges in future trials that deploy reinforcement learning algorithms. We provide template solutions based on LS4L2 for solving the key challenges of (i) defining a relevant reward, (ii) determining a meaningful timescale for optimization, (iii) specifying a robust statistical model that allows for automation, (iv) balancing model flexibility with computational cost, and (v) addressing missing values in gradually collected data.

[LG-89] Optimal Control of the Future via Prospective Forag ing

链接: https://arxiv.org/abs/2511.08717
作者: Yuxin Bai,Aranyak Acharyya,Ashwin De Silva,Zeyu Shen,James Hassett,Joshua T. Vogelstein
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in either reinforcement learning or online learning. While powerful, these frameworks for learning are mathematically distinct from Probably Approximately Correct (PAC) learning, which has been the workhorse for the recent technological achievements in AI. We therefore build on the prior work of prospective learning, an extension of PAC learning (without control) in non-stationary environments (De Silva et al., 2023; Silva et al., 2024; Bai et al., 2026). Here, we further extend the PAC learning framework to address learning and control in non-stationary environments. Using this framework, called ‘‘Prospective Control’’, we prove that under certain fairly general assumptions, empirical risk minimization (ERM) asymptotically achieves the Bayes optimal policy. We then consider a specific instance of prospective control, foraging, which is a canonical task for any mobile agent, be it natural or artificial. We illustrate that existing reinforcement learning algorithms fail to learn in these non-stationary environments, and even with modifications, they are orders of magnitude less efficient than our prospective foraging agents. Code is available at: this https URL.

[LG-90] “It Looks All the Same to Me”: Cross-index Training for Long-term Financial Series Prediction

链接: https://arxiv.org/abs/2511.08658
作者: Stanislav Selitskiy
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate a number of Artificial Neural Network architectures (well-known and more ``exotic’‘) in application to the long-term financial time-series forecasts of indexes on different global markets. The particular area of interest of this research is to examine the correlation of these indexes’ behaviour in terms of Machine Learning algorithms cross-training. Would training an algorithm on an index from one global market produce similar or even better accuracy when such a model is applied for predicting another index from a different market? The demonstrated predominately positive answer to this question is another argument in favour of the long-debated Efficient Market Hypothesis of Eugene Fama.

[LG-91] Compact Artificial Neural Network Models for Predicting Protein Residue - RNA Base Binding

链接: https://arxiv.org/abs/2511.08648
作者: Stanislav Selitskiy
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Artificial Neural Network (ANN) models have demonstrated success in various domains, including general text and image generation, drug discovery, and protein-RNA (ribonucleic acid) binding tasks. However, these models typically demand substantial computational resources, time, and data for effective training. Given that such extensive resources are often inaccessible to many researchers and that life sciences data sets are frequently limited, we investigated whether small ANN models could achieve acceptable accuracy in protein-RNA prediction. We experimented with shallow feed-forward ANNs comprising two hidden layers and various non-linearities. These models did not utilize explicit structural information; instead, a sliding window approach was employed to implicitly consider the context of neighboring residues and bases. We explored different training techniques to address the issue of highly unbalanced data. Among the seven most popular non-linearities for feed-forward ANNs, only three: Rectified Linear Unit (ReLU), Gated Linear Unit (GLU), and Hyperbolic Tangent (Tanh) yielded converging models. Common re-balancing techniques, such as under- and over-sampling of training sets, proved ineffective, whereas increasing the volume of training data and using model ensembles significantly improved performance. The optimal context window size, balancing both false negative and false positive errors, was found to be approximately 30 residues and bases. Our findings indicate that high-accuracy protein-RNA binding prediction is achievable using computing hardware accessible to most educational and research institutions.

[LG-92] Pattern Recognition of Scrap Plastic Misclassification in Global Trade Data

链接: https://arxiv.org/abs/2511.08638
作者: Muhammad Sukri Bin Ramli
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose an interpretable machine learning framework to help identify trade data discrepancies that are challenging to detect with traditional methods. Our system analyzes trade data to find a novel inverse price-volume signature, a pattern where reported volumes increase as average unit prices decrease. The model achieves 0.9375 accuracy and was validated by comparing large-scale UN data with detailed firm-level data, confirming that the risk signatures are consistent. This scalable tool provides customs authorities with a transparent, data-driven method to shift from conventional to priority-based inspection protocols, translating complex data into actionable intelligence to support international environmental policies. Subjects: General Economics (econ.GN); Machine Learning (cs.LG) Cite as: arXiv:2511.08638 [econ.GN] (or arXiv:2511.08638v1 [econ.GN] for this version) https://doi.org/10.48550/arXiv.2511.08638 Focus to learn more arXiv-issued DOI via DataCite

[LG-93] Explainable Federated Learning for U.S. State-Level Financial Distress Modeling

链接: https://arxiv.org/abs/2511.08588
作者: Lorenzo Carta,Fernando Spadea,Oshani Seneviratne
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present the first application of federated learning (FL) to the U.S. National Financial Capability Study, introducing an interpretable framework for predicting consumer financial distress across all 50 states and the District of Columbia without centralizing sensitive data. Our cross-silo FL setup treats each state as a distinct data silo, simulating real-world governance in nationwide financial systems. Unlike prior work, our approach integrates two complementary explainable AI techniques to identify both global (nationwide) and local (state-specific) predictors of financial hardship, such as contact from debt collection agencies. We develop a machine learning model specifically suited for highly categorical, imbalanced survey data. This work delivers a scalable, regulation-compliant blueprint for early warning systems in finance, demonstrating how FL can power socially responsible AI applications in consumer credit risk and financial inclusion.

[LG-94] Extrapolation to infinite model space of no-core shell model calculations using machine learning

链接: https://arxiv.org/abs/2511.05061
作者: Aleksandr Mazur,Roman Sharypov,Andrey Shirokov
类目: Nuclear Theory (nucl-th); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 9 pages, 3 figures

点击查看摘要

Abstract:An ensemble of neural networks is employed to extrapolate no-core shell model (NCSM) results to infinite model space for light nuclei. We present a review of our neural network extrapolations of the NCSM results obtained with the Daejeon16 NN interaction in different model spaces and with different values of the NCSM basis parameter \hbar\Omega for energies of nuclear states and root-mean-square (rms) radii of proton, neutron and matter distributions in light nuclei. The method yields convergent predictions with quantifiable uncertainties. Ground-state energies for ^6 Li, ^6 He, and the unbound ^6 Be, as well as the excited (3^+,0) and (0^+,1) states of ^6 Li, are obtained within a few hundred keV of experiment. The extrapolated radii of bound states converge well. In contrast, radii of unbound states in ^6 Be and ^6 Li do not stabilize.

信息检索

[IR-0] Practical RAG Evaluation: A Rarity-Aware Set-Based Metric and Cost-Latency-Quality Trade-offs

链接: https://arxiv.org/abs/2511.09545
作者: Etienne Dallaire
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper addresses the guessing game in building production RAG. Classical rank-centric IR metrics (nDCG/MAP/MRR) are a poor fit for RAG, where LLMs consume a set of passages rather than a browsed list; position discounts and prevalence-blind aggregation miss what matters: whether the prompt at cutoff K contains the decisive evidence. Second, there is no standardized, reproducible way to build and audit golden sets. Third, leaderboards exist but lack end-to-end, on-corpus benchmarking that reflects production trade-offs. Fourth, how state-of-the-art embedding models handle proper-name identity signals and conversational noise remains opaque. To address these, we contribute: (1) RA-nWG@K, a rarity-aware, per-query-normalized set score, and operational ceilings via the pool-restricted oracle ceiling (PROC) and the percentage of PROC (%PROC) to separate retrieval from ordering headroom within a Cost-Latency-Quality (CLQ) lens; (2) rag-gs (MIT), a lean golden-set pipeline with Plackett-Luce listwise refinement whose iterative updates outperform single-shot LLM ranking; (3) a comprehensive benchmark on a production RAG (scientific-papers corpus) spanning dense retrieval, hybrid dense+BM25, embedding models and dimensions, cross-encoder rerankers, ANN (HNSW), and quantization; and (4) targeted diagnostics that quantify proper-name identity signal and conversational-noise sensitivity via identity-destroying and formatting ablations. Together, these components provide practitioner Pareto guidance and auditable guardrails to support reproducible, budget/SLA-aware decisions.

[IR-1] Sim4IA-Bench: A User Simulation Benchmark Suite for Next Query and Utterance Prediction

链接: https://arxiv.org/abs/2511.09329
作者: Andreas Konstantin Kruff,Christin Katharina Kreutz,Timo Breuer,Philipp Schaer,Krisztian Balog
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Validating user simulation is a difficult task due to the lack of established measures and benchmarks, which makes it challenging to assess whether a simulator accurately reflects real user behavior. As part of the Sim4IA Micro-Shared Task at the Sim4IA Workshop, SIGIR 2025, we present Sim4IA-Bench, a simulation benchmark suit for the prediction of the next queries and utterances, the first of its kind in the IR com- munity. Our dataset as part of the suite comprises 160 real-world search sessions from the CORE search engine. For 70 of these sessions, up to 62 simulator runs are available, divided into Task A and Task B, in which different approaches predicted users next search queries or utterances. Sim4IA-Bench provides a basis for evaluating and comparing user simu- lation approaches and for developing new measures of simulator validity. Although modest in size, the suite represents the first publicly available benchmark that links real search sessions with simulated next-query pre- dictions. In addition to serving as a testbed for next query prediction, it also enables exploratory studies on query reformulation behavior, intent drift, and interaction-aware retrieval evaluation. We also introduce a new measure for evaluating next-query predictions in this task. By making the suite publicly available, we aim to promote reproducible research and stimulate further work on realistic and explainable user simulation for information access: this https URL.

[IR-2] NeuroCLIP: Brain-Inspired Prompt Tuning for EEG-to-Image Multimodal Contrastive Learning

链接: https://arxiv.org/abs/2511.09250
作者: Jiyuan Wang,Li Zhang,Haipeng Lin,Qile Liu,Gan Huang,Ziyu Li,Zhen Liang,Xia Wu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances in brain-inspired artificial intelligence have sought to align neural signals with visual semantics using multimodal models such as CLIP. However, existing methods often treat CLIP as a static feature extractor, overlooking its adaptability to neural representations and the inherent physiological-symbolic gap in EEG-image alignment. To address these challenges, we present NeuroCLIP, a prompt tuning framework tailored for EEG-to-image contrastive learning. Our approach introduces three core innovations: (1) We design a dual-stream visual embedding pipeline that combines dynamic filtering and token-level fusion to generate instance-level adaptive prompts, which guide the adjustment of patch embedding tokens based on image content, thereby enabling fine-grained modulation of visual representations under neural constraints; (2) We are the first to introduce visual prompt tokens into EEG-image alignment, acting as global, modality- level prompts that work in conjunction with instance-level adjustments. These visual prompt tokens are inserted into the Transformer architecture to facilitate neural-aware adaptation and parameter optimization at a global level; (3) Inspired by neuroscientific principles of human visual encoding, we propose a refined contrastive loss that better model the semantic ambiguity and cross-modal noise present in EEG signals. On the THINGS-EEG2 dataset, NeuroCLIP achieves a Top-1 accuracy of 63.2% in zero-shot image retrieval, surpassing the previous best method by +12.3%, and demonstrates strong generalization under inter-subject conditions (+4.6% Top-1), highlighting the potential of physiology-aware prompt tuning for bridging brain signals and visual semantics.

[IR-3] Efficient Model-Agnostic Continual Learning for Next POI Recommendation ICDE2026

链接: https://arxiv.org/abs/2511.08941
作者: Chenhao Wang,Shanshan Feng,Lisi Chen,Fan Li,Shuo Shang
类目: Information Retrieval (cs.IR)
*备注: This paper was accepted by ICDE2026

点击查看摘要

Abstract:Next point-of-interest (POI) recommendation improves personalized location-based services by predicting users’ next destinations based on their historical check-ins. However, most existing methods rely on static datasets and fixed models, limiting their ability to adapt to changes in user behavior over time. To address this limitation, we explore a novel task termed continual next POI recommendation, where models dynamically adapt to evolving user interests through continual updates. This task is particularly challenging, as it requires capturing shifting user behaviors while retaining previously learned knowledge. Moreover, it is essential to ensure efficiency in update time and memory usage for real-world deployment. To this end, we propose GIRAM (Generative Key-based Interest Retrieval and Adaptive Modeling), an efficient, model-agnostic framework that integrates context-aware sustained interests with recent interests. GIRAM comprises four components: (1) an interest memory to preserve historical preferences; (2) a context-aware key encoding module for unified interest key representation; (3) a generative key-based retrieval module to identify diverse and relevant sustained interests; and (4) an adaptive interest update and fusion module to update the interest memory and balance sustained and recent interests. In particular, GIRAM can be seamlessly integrated with existing next POI recommendation models. Experiments on three real-world datasets demonstrate that GIRAM consistently outperforms state-of-the-art methods while maintaining high efficiency in both update time and memory consumption.

附件下载

点击下载今日全部论文列表