本篇博文主要内容为 2025-11-26 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-26)

今日共更新656篇论文,其中:

  • 自然语言处理51篇(Computation and Language (cs.CL))
  • 人工智能195篇(Artificial Intelligence (cs.AI))
  • 计算机视觉219篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习195篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Latent Collaboration in Multi-Agent Systems

【速读】: 该论文旨在解决多智能体系统(Multi-agent Systems, MAS)在依赖文本媒介进行推理与通信时所面临的效率低、信息损失及复杂度高的问题。现有方法受限于文本交互的离散性与冗余性,难以实现高效且无损的信息传递。其解决方案的关键在于提出LatentMAS框架,该框架无需额外训练即可实现基于连续潜在空间(continuous latent space)的纯潜在协作:每个智能体通过最后一层隐藏嵌入生成自回归潜在思维(auto-regressive latent thoughts),并通过共享潜在工作记忆(shared latent working memory)实现内部表征的无损保存与传递,从而在保持更高表达能力的同时显著降低计算复杂度,并在多个基准测试中展现出更高的准确率和推理效率。

链接: https://arxiv.org/abs/2511.20639
作者: Jiaru Zou,Xiyuan Yang,Ruizhong Qiu,Gaotang Li,Katherine Tieu,Pan Lu,Ke Shen,Hanghang Tong,Yejin Choi,Jingrui He,James Zou,Mengdi Wang,Ling Yang
机构: Princeton University (普林斯顿大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project: this https URL

点击查看摘要

Abstract:Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent’s internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at this https URL.
zh

[NLP-1] On Evaluating LLM Alignment by Evaluating LLM s as Judges NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)对齐人类偏好(alignment with human preferences)的评估难题,即如何在不依赖人工标注或强LLM评判者直接评估生成内容的情况下,准确衡量模型对齐程度。其核心问题是:LLMs的生成能力与评价能力之间是否存在内在关联,能否利用后者间接反映前者?解决方案的关键在于提出“生成-评价一致性”(generation-evaluation consistency, GE-consistency)这一新认知,发现LLMs在强LLM偏好基准(preference oracle)下表现出显著的生成与评价能力相关性,并据此设计了AlignEval基准——通过评估LLM作为评价者的性能来间接衡量其对齐水平,从而无需直接分析模型输出即可有效捕捉人类偏好排序结果,且在性能上优于AlpacaEval和Arena-Hard等现有自动评估方法。

链接: https://arxiv.org/abs/2511.20604
作者: Yixin Liu,Pengfei Liu,Arman Cohan
机构: Yale University (耶鲁大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025 Camera Ready

点击查看摘要

Abstract:Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models’ (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs’ generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs’ generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.
zh

[NLP-2] Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward

【速读】: 该论文旨在解决统一多模态模型中“理解是否真正赋能生成”的核心问题,即识别并量化理解与生成之间的能力差距。其关键解决方案是提出UniSandbox——一个解耦的评估框架,并结合受控的合成数据集以避免数据泄露并支持细粒度分析。实验发现,该差距主要体现在推理生成和知识迁移两个维度:在推理生成方面,显式的思维链(Chain-of-Thought, CoT)可有效弥合差距,且通过自训练方法可将此能力内化为隐式推理;在知识迁移方面,CoT有助于检索新学知识,同时发现基于查询的架构本身具有潜在的类CoT特性,影响知识传递效率。该工作为未来统一架构设计和训练策略提供了初步洞见。

链接: https://arxiv.org/abs/2511.20561
作者: Yuwei Niu,Weiyang Jin,Jiaqi Liao,Chaoran Feng,Peng Jin,Bin Lin,Zongjian Li,Bin Zhu,Weihao Yu,Li Yuan
机构: Peking University (北京大学); Chongqing University (重庆大学); HKU MMLab (香港大学多媒体实验室); PengCheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve newly learned knowledge, and also discover that query-based architectures inherently exhibit latent CoT-like properties that affect this transfer. UniSandbox provides preliminary insights for designing future unified architectures and training strategies that truly bridge the gap between understanding and generation. Code and data are available at this https URL
zh

[NLP-3] From Words to Wisdom: Discourse Annotation and Baseline Models for Student Dialogue Understanding

【速读】: 该论文旨在解决教育研究中对学生对话中话语特征(discourse features)识别的自动化难题,以提升对知识建构(knowledge construction)与任务完成(task production)等教学变量的分析效率。传统人工标注方法耗时且难以扩展,限制了大规模教育数据的研究能力。其解决方案的关键在于构建一个标注的教育对话数据集,并基于预训练大语言模型(如GPT-3.5和Llama-3.1)建立基线预测模型,用于自动识别每一轮对话中的 discourse 属性,从而为教育研究提供可扩展、数据驱动的分析工具。实验表明当前主流模型在此任务上表现不佳,提示未来在教育语境下优化 NLP 模型具有重要研究价值。

链接: https://arxiv.org/abs/2511.20547
作者: Farjana Sultana Mim,Shuchin Aeron,Eric Miller,Kristen Wendell
机构: Tufts University (塔夫茨大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Identifying discourse features in student conversations is quite important for educational researchers to recognize the curricular and pedagogical variables that cause students to engage in constructing knowledge rather than merely completing tasks. The manual analysis of student conversations to identify these discourse features is time-consuming and labor-intensive, which limits the scale and scope of studies. Leveraging natural language processing (NLP) techniques can facilitate the automatic detection of these discourse features, offering educational researchers scalable and data-driven insights. However, existing studies in NLP that focus on discourse in dialogue rarely address educational data. In this work, we address this gap by introducing an annotated educational dialogue dataset of student conversations featuring knowledge construction and task production discourse. We also establish baseline models for automatically predicting these discourse properties for each turn of talk within conversations, using pre-trained large language models GPT-3.5 and Llama-3.1. Experimental results indicate that these state-of-the-art models perform suboptimally on this task, indicating the potential for future research.
zh

[NLP-4] Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition ICML2025

【速读】: 该论文旨在解决低资源语言在自动语音识别(Automatic Speech Recognition, ASR)系统中性能显著落后于高资源语言的问题,其根本原因在于训练数据的稀缺性和获取成本高。解决方案的关键在于提出了一种新颖的语音语料库数据增强技术,通过该方法显著提升了ASR模型在低资源语言上的表现,并且实验证明其优于现有增强策略,为提升欠代表语言群体的语音技术提供了可行路径。

链接: https://arxiv.org/abs/2511.20534
作者: Wesley Bian,Xiaofeng Lin,Guang Cheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted at ICML 2025 Workshop on Machine Learning for Audio

点击查看摘要

Abstract:Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.
zh

[NLP-5] DesignPref: Capturing Personal Preferences in Visual Design Generation

【速读】: 该论文旨在解决生成式 AI 在视觉设计(如用户界面 UI 和演示文稿)评估中因个体偏好差异导致的模型泛化能力不足问题。传统方法依赖于多数投票机制训练统一的判别模型,但研究发现专业设计师之间对设计偏好的判断存在显著分歧(Krippendorff’s alpha = 0.25),且分歧源于对设计要素重要性的认知差异和个性化偏好。解决方案的关键在于引入 DesignPref 数据集(包含 12,000 条成对比较及多级评分),并提出通过微调或结合检索增强生成(Retrieval-Augmented Generation, RAG)管道中的个体标注信息进行个性化建模,实验证明此类个性化策略在仅使用 1/20 样本的情况下仍能显著优于聚合基线模型,从而为建模个体设计品味提供了首个数据基础与有效方法路径。

链接: https://arxiv.org/abs/2511.20513
作者: Yi-Hao Peng,Jeffrey P. Bigham,Jason Wu
机构: Carnegie Mellon University (卡内基梅隆大学); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff’s alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers’ preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.
zh

[NLP-6] he Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

【速读】: 该论文旨在解决当前缺乏适用于大语言模型(Large Language Models, LLMs)的标准化语言障碍评估工具的问题,以系统性地识别和量化其类失语症(aphasia-like)语言缺陷。传统临床评估方法依赖于人类特有的语用压力和认知机制,不适用于人工架构,因此无法有效用于LLMs的语言能力分析。解决方案的关键在于提出Text Aphasia Battery(TAB),这是一个基于Quick Aphasia Battery(QAB)改编的纯文本基准测试框架,包含连贯文本生成、词汇理解、句子理解与重复四个子测验,并配套自动化评分协议(采用Gemini 2.5 Flash实现),其评估一致性与专家人类评分相当(预估加权Cohen’s kappa = 0.255 vs. 0.286),从而为大规模、可扩展地研究人工系统中的语言缺陷提供了临床基础且高效的分析工具。

链接: https://arxiv.org/abs/2511.20507
作者: Nathan Roll,Jill Kries,Flora Jin,Catherine Wang,Ann Marie Finley,Meghan Sumner,Cory Shain,Laura Gwilliams
机构: Stanford University (斯坦福大学); University of California, San Francisco (加州大学旧金山分校); Temple University (天普大学); San Diego State University (圣地亚哥州立大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have emerged as a candidate “model organism” for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB’s design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen’s kappa = 0.255 for model–consensus agreement vs. 0.286 for human–human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
zh

[NLP-7] Adversarial Confusion Attack: Disrupting Multimodal Large Language Models

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在面对对抗性干扰时的鲁棒性不足问题,尤其是针对生成式AI(Generative AI)系统中因图像输入被恶意扰动而导致输出不一致或错误的问题。解决方案的关键在于提出一种新型攻击方法——对抗混淆攻击(Adversarial Confusion Attack),其核心思想是通过最大化下一个token的熵来诱导模型产生混乱或自信错误的输出;该方法利用一个小型开源MLLM集合进行梯度估计,并采用基础的投影梯度下降(PGD)算法生成高迁移性的扰动,实验证明单张对抗图像即可在白盒环境下破坏整个模型集合,且扰动能有效转移到未见过的开源和专有模型上,从而揭示了当前MLLMs在实际部署中的潜在安全风险。

链接: https://arxiv.org/abs/2511.20494
作者: Jakub Hoscilowicz,Artur Janicki
机构: Warsaw University of Technology (华沙理工大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce the Adversarial Confusion Attack, a new class of threats against multimodal large language models (MLLMs). Unlike jailbreaks or targeted misclassification, the goal is to induce systematic disruption that makes the model generate incoherent or confidently incorrect outputs. Applications include embedding adversarial images into websites to prevent MLLM-powered agents from operating reliably. The proposed attack maximizes next-token entropy using a small ensemble of open-source MLLMs. In the white-box setting, we show that a single adversarial image can disrupt all models in the ensemble, both in the full-image and adversarial CAPTCHA settings. Despite relying on a basic adversarial technique (PGD), the attack generates perturbations that transfer to both unseen open-source (e.g., Qwen3-VL) and proprietary (e.g., GPT-5.1) models.
zh

[NLP-8] Generation Evaluation and Explanation of Novelists Styles with Single-Token Prompts

【速读】: 该论文旨在解决两个核心问题:一是当缺乏配对数据时如何训练生成式模型以模仿特定写作风格,二是如何在不依赖人类主观判断的前提下评估文本的风格一致性。解决方案的关键在于构建一个结合生成与评估的框架:首先利用少量单标记提示(single-token prompts)微调大语言模型(Large Language Models, LLMs),使其能够生成19世纪小说家如狄更斯、奥斯汀等人的文风文本;其次采用基于Transformer的检测器,该检测器在真实语句上训练,既作为分类器用于风格识别,又通过可解释人工智能方法(如注意力机制和梯度分析)揭示驱动风格模仿的语言特征,从而实现对生成文本风格质量的客观量化评估。

链接: https://arxiv.org/abs/2511.20459
作者: Mosab Rezaei,Mina Rajaei Moghadam,Abdul Rahman Shaikh,Hamed Alhoori,Reva Freedman
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models have created new opportunities for stylometry, the study of writing styles and authorship. Two challenges, however, remain central: training generative models when no paired data exist, and evaluating stylistic text without relying only on human judgment. In this work, we present a framework for both generating and evaluating sentences in the style of 19th-century novelists. Large language models are fine-tuned with minimal, single-token prompts to produce text in the voices of authors such as Dickens, Austen, Twain, Alcott, and Melville. To assess these generative models, we employ a transformer-based detector trained on authentic sentences, using it both as a classifier and as a tool for stylistic explanation. We complement this with syntactic comparisons and explainable AI methods, including attention-based and gradient-based analyses, to identify the linguistic cues that drive stylistic imitation. Our findings show that the generated text reflects the authors’ distinctive patterns and that AI-based evaluation offers a reliable alternative to human assessment. All artifacts of this work are published online.
zh

[NLP-9] A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines

【速读】: 该论文旨在解决现有词干提取(stemming)方法评估体系的局限性问题,特别是当前方法无法有效捕捉过度词干化可能带来的语义损害。解决方案的关键在于提出一种任务导向的综合评估框架,包含三个核心指标:(1) 词干有效性得分(Stemming Effectiveness Score, SES),衡量词干化对词汇压缩的效果;(2) 模型性能变化量(Model Performance Delta, MPD),评估词干化对下游自然语言处理任务的影响;(3) 平均归一化莱文施泰因距离(Average Normalized Levenshtein Distance, ANLD),量化词干后词汇与原始词之间的语义相似度。该框架能够区分单纯效率提升(高SES)与语义保留能力(低ANLD),从而更全面地判断词干化方法的实际可靠性。

链接: https://arxiv.org/abs/2511.20409
作者: Md Abdullah Al Kafi,Raka Moni,Sumit Kumar Banshal
机构: Dhaka International University (达卡国际大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive evaluation framework. We apply our evaluation framework to compare two stemmers for Bangla (BNLTK) and English (Snowball), and our results reveal a significant issue, prompting us to analyze their performance in detail. While the Bangla stemmer achieves the highest SES (1.67) due to effective word reduction (CR = 1.90), SES alone is insufficient because our proposed safety measure, ANLD, reveals that this high SES is due to harmful over-stemming (ANLD = 0.26), which correlates with the observed decrease in downstream this http URL contrast, the English stemmer achieves a moderate SES (1.31) with a safe meaning distance (ANLD = 0.14), allowing its word reduction to contribute positively to downstream performance; therefore, it is a more reliable stemmer. Our study provides a valuable tool for distinguishing between potential efficiency gains (high SES) and meaning preservation (low ANLD).
zh

[NLP-10] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在比喻性和文化语境推理任务中表现不足的问题,尤其是在低资源语言如孟加拉语(Bengali)中的评估缺失。其解决方案的关键在于构建了一个名为BengaliFig的高质量、多维标注的挑战数据集,包含435个源自孟加拉语口头与文学传统的谜题,并从推理类型、陷阱类型、文化深度、答案类别和难度五个正交维度进行标注,同时通过约束感知的AI辅助流水线自动转换为多项选择格式。该数据集不仅作为诊断工具用于评估LLMs在低资源文化场景下的鲁棒性,也为实现更具包容性和遗产意识的自然语言处理(Natural Language Processing, NLP)评估提供了重要支撑。

链接: https://arxiv.org/abs/2511.20399
作者: Abdullah Al Sefat
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.
zh

[NLP-11] Soft Adaptive Policy Optimization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练中因策略优化不稳定而导致的性能瓶颈问题,特别是由token-level重要性比率(importance ratios)高方差引发的更新不稳定性,这一现象在混合专家模型(Mixture-of-Experts, MoE)中尤为显著。现有基于分组的策略优化方法(如GSPO和GRPO)通过硬截断(hard clipping)缓解方差问题,但难以兼顾训练稳定性和有效学习信号的保留。其解决方案的关键在于提出Soft Adaptive Policy Optimization (SAPO),该方法用平滑、温度可控的门控机制替代硬截断,自适应地衰减离策略(off-policy)更新,同时保留有用的学习信号。SAPO实现了序列一致性(sequence-coherent)与token级适应性(token-adaptive)的统一:相比GSPO,其连续的信任区域避免了因少量离策略token导致整个序列梯度被抑制的问题;相比GRPO,其软缩放机制提供了更稳定且信息丰富的更新方向。实证结果表明,SAPO在数学推理基准上提升了训练稳定性与Pass@1指标,并在Qwen3-VL系列模型的不同任务和规模下均展现出一致的性能提升,证明其是一种更可靠、可扩展且高效的LLM强化学习优化策略。

链接: https://arxiv.org/abs/2511.20347
作者: Chang Gao,Chujie Zheng,Xiong-Hui Chen,Kai Dang,Shixuan Liu,Bowen Yu,An Yang,Shuai Bai,Jingren Zhou,Junyang Lin
机构: Qwen Team, Alibaba Inc(阿里巴巴集团)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
zh

[NLP-12] he Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models AAAI2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否能够编码并应用高层次关系概念(high-level relational concepts)进行类比推理的问题,特别是在面对新颖情境时的泛化能力。其关键发现在于:LLMs 能够在中间到上层网络中传播属性与关系信息,且成功类比推理依赖于结构对齐(structural alignment);然而,当关系信息缺失或需迁移至新实体时,模型表现受限,此时通过在关键标记位置策略性地修补隐藏表示可部分促进信息传递,揭示了LLMs在关系建模上的潜力与局限性。

链接: https://arxiv.org/abs/2511.20344
作者: Taewhoo Lee,Minju Song,Chanwoong Yoon,Jungwoo Park,Jaewoo Kang
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2026

点击查看摘要

Abstract:Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities. While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts and apply them to novel situations through structured comparisons. In this work, we explore this fundamental aspect using proportional and story analogies, and identify three key findings. First, LLMs effectively encode the underlying relationships between analogous entities; both attributive and relational information propagate through mid-upper layers in correct cases, whereas reasoning failures reflect missing relational information within these layers. Second, unlike humans, LLMs often struggle not only when relational information is missing, but also when attempting to apply it to new entities. In such cases, strategically patching hidden representations at critical token positions can facilitate information transfer to a certain extent. Lastly, successful analogical reasoning in LLMs is marked by strong structural alignment between analogous situations, whereas failures often reflect degraded or misplaced alignment. Overall, our findings reveal that LLMs exhibit emerging but limited capabilities in encoding and applying high-level relational concepts, highlighting both parallels and gaps with human cognition.
zh

[NLP-13] Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios AAAI-2026

【速读】: 该论文旨在解决当前推测解码(speculative decoding)方法在低验证资源和低调度开销下难以实现高效加速的问题,尤其是在主流模型推理系统中批量处理(batching)已成主流的背景下。传统方法依赖于大型前缀树(prefix tree)和强大计算资源来生成复杂且庞大的草稿树,导致资源浪费与调度开销高。解决方案的关键在于提出一种名为 SpecFormer 的新架构,其核心创新是融合单向(unidirectional)与双向(bidirectional)注意力机制,从而在保持自回归模型(autoregressive model)对完整输入序列建模能力的同时,获得非自回归模型(non-autoregressive model)支持并行生成的优势,消除了对大规模前缀树的依赖,并在大批次场景下仍能实现稳定加速,显著降低训练需求和计算成本。

链接: https://arxiv.org/abs/2511.20340
作者: Luohe Shi,Zuchao Li,Lefei Zhang,Baoyuan Qi,Guoming Liu,Hai Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注: accepted by AAAI-2026

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing power, then generate a complex and massive draft tree using a small autoregressive language model to improve overall prediction accuracy. However, methods like batching have been widely applied in mainstream model inference systems as a superior alternative to speculative decoding, as they compress the available idle computing power. Therefore, performing speculative decoding with low verification resources and low scheduling costs has become an important research problem. We believe that more capable models that allow for parallel generation on draft sequences are what we truly need. Recognizing the fundamental nature of draft models to only generate sequences of limited length, we propose SpecFormer, a novel architecture that integrates unidirectional and bidirectional attention mechanisms. SpecFormer combines the autoregressive model’s ability to extract information from the entire input sequence with the parallel generation benefits of non-autoregressive models. This design eliminates the reliance on large prefix trees and achieves consistent acceleration, even in large-batch scenarios. Through lossless speculative decoding experiments across models of various scales, we demonstrate that SpecFormer sets a new standard for scaling LLM inference with lower training demands and reduced computational costs.
zh

[NLP-14] Geometry of Decision Making in Language Models NEURIPS2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多选题问答(Multiple-Choice Question Answering, MCQA)任务中决策过程的可解释性问题,即揭示其内部隐藏表示的几何结构如何支撑推理与泛化能力。解决方案的关键在于通过**内在维数(Intrinsic Dimension, ID)**分析方法,系统量化不同层隐藏表示的空间复杂度,发现模型在中间层经历维度扩张、随后在后期压缩至低维决策相关流形的规律性动态,从而表明LLMs隐式地将语言输入投影到与任务目标对齐的结构化低维流形上,为理解生成式AI的推理机制提供了新的几何视角。

链接: https://arxiv.org/abs/2511.20315
作者: Abhinav Joshi,Divyanshu Bhatt,Ashutosh Modi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque. In this work, we study the geometry of hidden representations in LLMs through the lens of \textitintrinsic dimension (ID), focusing specifically on decision-making dynamics in a multiple-choice question answering (MCQA) setting. We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators, while also quantifying per-layer performance on MCQA tasks. Our findings reveal a consistent ID pattern across models: early layers operate on low-dimensional manifolds, middle layers expand this space, and later layers compress it again, converging to decision-relevant representations. Together, these results suggest LLMs implicitly learn to project linguistic inputs onto structured, low-dimensional manifolds aligned with task-specific decisions, providing new geometric insights into how generalization and reasoning emerge in language models.
zh

[NLP-15] Beyond Components: Singular Vector-Based Interpretability of Transformer Circuits NEURIPS2025

【速读】: 该论文旨在解决当前对基于Transformer的语言模型内部计算机制理解不足的问题,特别是现有机制可解释性方法将注意力头(attention heads)和多层感知机层(MLPs)视为不可分割的单元,忽略了其内部可能存在的功能子结构。解决方案的关键在于提出一种更细粒度的视角,通过将这些组件分解为正交的奇异方向(singular directions),揭示单个注意力头或MLP中叠加且独立的计算过程;实验证明,先前识别出的典型功能模块(如“名字搬运者”头)实际上编码了多个与不同奇异方向对齐的重叠子功能,且计算图中的节点在特定低秩方向上表现出强激活,表明有意义的计算存在于紧凑的子空间中,从而揭示了Transformer内部计算比以往认为的更加分布化、结构化和组合化。

链接: https://arxiv.org/abs/2511.20273
作者: Areeb Ahmad,Abhinav Joshi,Ashutosh Modi
机构: Indian Institute of Technology Kanpur (IIT Kanpur)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Transformer-based language models exhibit complex and distributed behavior, yet their internal computations remain poorly understood. Existing mechanistic interpretability methods typically treat attention heads and multilayer perceptron layers (MLPs) (the building blocks of a transformer architecture) as indivisible units, overlooking possibilities of functional substructure learned within them. In this work, we introduce a more fine-grained perspective that decomposes these components into orthogonal singular directions, revealing superposed and independent computations within a single head or MLP. We validate our perspective on widely used standard tasks like Indirect Object Identification (IOI), Gender Pronoun (GP), and Greater Than (GT), showing that previously identified canonical functional heads, such as the name mover, encode multiple overlapping subfunctions aligned with distinct singular directions. Nodes in a computational graph, that are previously identified as circuit elements show strong activation along specific low-rank directions, suggesting that meaningful computations reside in compact subspaces. While some directions remain challenging to interpret fully, our results highlight that transformer computations are more distributed, structured, and compositional than previously assumed. This perspective opens new avenues for fine-grained mechanistic interpretability and a deeper understanding of model internals.
zh

[NLP-16] REFLEX: Self-Refining Explainable Fact-Checking via Disentangling Truth into Style and Substance

【速读】: 该论文旨在解决社交媒体上虚假信息泛滥问题,提出一种可自动提供准确判断并具备可解释性的事实核查系统。现有基于大语言模型(Large Language Model, LLM)的方法通常依赖外部知识源,导致延迟高、易产生幻觉,从而削弱了可靠性、可解释性和实时响应能力。其解决方案的关键在于提出REFLEX(REason-guided Fact-checking with Latent EXplanations)范式,该范式采用角色扮演对话形式联合训练判断预测与解释生成,并通过自适应提取骨干模型与其微调版本之间的对比激活对,构建解耦“真理”为“风格”与“实质”的引导向量,从而在激活层面上指导推理过程、抑制噪声解释,实现更忠实高效的推理机制。实验表明,仅需465个自精炼样本即可达到最先进性能,且带有解释目标训练的模型能有效提升无解释目标模型的表现,凸显内部解释信号在增强事实推理中的双重作用。

链接: https://arxiv.org/abs/2511.20233
作者: Chuyi Kong,Gao Wei,Jing Ma,Hongzhan Lin,Zhiyuan Fan
机构: Hong Kong Baptist University (香港浸会大学); Singapore Management University (新加坡管理大学); Soochow University (苏州大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The prevalence of misinformation on social media threatens public trust, demanding automated fact-checking systems that provide accurate verdicts with interpretable explanations. However, existing large language model-based (LLM-based) approaches often rely heavily on external knowledge sources, introducing substantial latency and even hallucinations that undermine reliability, interpretability, and responsiveness, which is crucial for real-time use. To address these challenges, we propose REason-guided Fact-checking with Latent EXplanations REFLEX paradigm, a plug-and-play, self-refining paradigm that leverages the internal knowledge in backbone model to improve both verdict accuracy and explanation quality. REFLEX reformulates fact-checking as a role-play dialogue and jointly trains verdict prediction and explanation generation. It adaptively extracts contrastive activation pairs between the backbone model and its fine-tuned variant to construct steering vectors that disentangle truth into style and substance naturally. These activation-level signals guide inference and suppress noisy explanations, enabling more faithful and efficient reasoning. Experiments on real-world datasets show that REFLEX outperforms previous methods that steer toward a single truth direction and underscores the challenge traditional approaches face when handling the subtle, human-unknown truth in fact-checking tasks. Remarkably, with only 465 self-refined training samples, RELFEX achieves state-of-the-art performance. Furthermore, models trained with explanatory objectives can effectively guide those without them, yielding up to a 7.57% improvement, highlighting that internal explanation signals play a dual role in both interpreting and enhancing factual reasoning.
zh

[NLP-17] KyrgyzBERT: A Compact Efficient Language Model for Kyrgyz NLP

【速读】: 该论文旨在解决吉尔吉斯语(Kyrgyz)作为低资源语言在自然语言处理(Natural Language Processing, NLP)领域缺乏基础工具的问题。其解决方案的关键在于提出并发布首个公开可用的吉尔吉斯语单语BERT模型——KyrgyzBERT,该模型具有3590万参数,并采用针对吉尔吉斯语形态结构定制的分词器;同时构建了kyrgyz-sst2情感分析基准数据集,通过在该数据集上微调,KyrgyzBERT实现了F1分数0.8280,性能优于参数量为其五倍的多语言BERT(mBERT),验证了其有效性与高效性。

链接: https://arxiv.org/abs/2511.20182
作者: Adilet Metinov,Gulida M. Kudakeeva,Gulnara D. Kabaeva
机构: 未知
类目: Computation and Language (cs.CL)
备注: 3 pages, 1 figure, 2 tables. Preprint

点击查看摘要

Abstract:Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language’s morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.
zh

[NLP-18] SEDA: A Self-Adapted Entity-Centric Data Augmentation for Boosting Gird-based Discontinuous NER Models

【速读】: 该论文旨在解决命名实体识别(Named Entity Recognition, NER)中跨句不连续实体的识别难题,尤其是传统文本分割方法常导致此类实体被错误分割或完全遗漏的问题。其解决方案的关键在于将图像数据增强技术(如裁剪、缩放和填充)引入基于网格的标注模型中,通过提升模型对不连续实体的空间感知能力来改善分割准确性和识别性能。实验表明,该方法在CADEC、ShARe13和ShARe14数据集上显著提升了F1分数,尤其在不连续实体上的增益达3.7–8.4%,验证了其有效性。

链接: https://arxiv.org/abs/2511.20143
作者: Wen-Fang Su,Hsiao-Wei Chou,Wen-Yang Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Named Entity Recognition (NER) is a critical task in natural language processing, yet it remains particularly challenging for discontinuous entities. The primary difficulty lies in text segmentation, as traditional methods often missegment or entirely miss cross-sentence discontinuous entities, significantly affecting recognition accuracy. Therefore, we aim to address the segmentation and omission issues associated with such entities. Recent studies have shown that grid-tagging methods are effective for information extraction due to their flexible tagging schemes and robust architectures. Building on this, we integrate image data augmentation techniques, such as cropping, scaling, and padding, into grid-based models to enhance their ability to recognize discontinuous entities and handle segmentation challenges. Experimental results demonstrate that traditional segmentation methods often fail to capture cross-sentence discontinuous entities, leading to decreased performance. In contrast, our augmented grid models achieve notable improvements. Evaluations on the CADEC, ShARe13, and ShARe14 datasets show F1 score gains of 1-2.5% overall and 3.7-8.4% for discontinuous entities, confirming the effectiveness of our approach.
zh

[NLP-19] “When Data is Scarce Prompt Smarter”… Approaches to Grammatical Error Correction in Low-Resource Settings AACL2025

【速读】: 该论文旨在解决低资源印地语系(Indic)语言在语法错误纠正(Grammatical Error Correction, GEC)任务中性能不足的问题,其核心挑战包括数据稀缺、语言多样性以及复杂的形态学结构。解决方案的关键在于利用最先进的大语言模型(Large Language Models, LLMs),如GPT-4.1、Gemini-2.5和LLaMA-4,结合少量样本(few-shot)提示策略进行轻量级适配,而非传统的微调方法。实验表明,即使采用基础的零样本(zero-shot)或少样本提示方式,这些LLMs也能显著超越专门针对印地语系语言训练的模型(如Sarvam-22B),证明了当代LLMs在多语言GEC任务中具备卓越的泛化能力;同时,精心设计的提示工程与轻量适应机制进一步提升了多种印地语系语言的纠错质量,在多个语言的共享任务中取得了领先结果。

链接: https://arxiv.org/abs/2511.20120
作者: Somsubhra De,Harsh Kumar,Arun Prakash A
机构: IIT Madras; AI4Bharat
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 5 tables; Accept-demonstration at BHASHA Workshop, IJCNLP-AACL 2025

点击查看摘要

Abstract:Grammatical error correction (GEC) is an important task in Natural Language Processing that aims to automatically detect and correct grammatical mistakes in text. While recent advances in transformer-based models and large annotated datasets have greatly improved GEC performance for high-resource languages such as English, the progress has not extended equally. For most Indic languages, GEC remains a challenging task due to limited resources, linguistic diversity and complex morphology. In this work, we explore prompting-based approaches using state-of-the-art large language models (LLMs), such as GPT-4.1, Gemini-2.5 and LLaMA-4, combined with few-shot strategy to adapt them to low-resource settings. We observe that even basic prompting strategies, such as zero-shot and few-shot approaches, enable these LLMs to substantially outperform fine-tuned Indic-language models like Sarvam-22B, thereby illustrating the exceptional multilingual generalization capabilities of contemporary LLMs for GEC. Our experiments show that carefully designed prompts and lightweight adaptation significantly enhance correction quality across multiple Indic languages. We achieved leading results in the shared task–ranking 1st in Tamil (GLEU: 91.57) and Hindi (GLEU: 85.69), 2nd in Telugu (GLEU: 85.22), 4th in Bangla (GLEU: 92.86), and 5th in Malayalam (GLEU: 92.97). These findings highlight the effectiveness of prompt-driven NLP techniques and underscore the potential of large-scale LLMs to bridge resource gaps in multilingual GEC.
zh

[NLP-20] Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

【速读】: 该论文旨在解决语言学习与语音治疗中发音错误检测与诊断(Mispronunciation Detection and Diagnosis, MDD)的问题,传统方法通常依赖于评分模型或需训练音素级别的专用模型,存在建模复杂度高、数据依赖性强等局限。其解决方案的关键在于提出一种无需训练的框架,利用预训练的自动语音识别(Automatic Speech Recognition, ASR)模型结合检索技术,直接从语音中提取语义与发音特征,从而实现对发音错误的精准检测与诊断,避免了音素级建模和额外任务特定训练的繁琐过程,在L2-ARCTIC数据集上取得了F1分数69.60%的优异性能。

链接: https://arxiv.org/abs/2511.20107
作者: Huu Tuong Tu,Ha Viet Khanh,Tran Tien Dat,Vu Huan,Thien Van Luong,Nguyen Tien Cuong,Nguyen Thi Thu Trang
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.
zh

[NLP-21] EM2LDL: A Multilingual Speech Corpus for Mixed Emotion Recognition through Label Distribution Learning

【速读】: 该论文旨在解决现有情感识别数据集在语言多样性、混合情绪建模及生态效度方面的局限性,即多数数据集为单语种且仅标注单一情绪标签,难以反映多语言环境中真实的情感表达复杂性。其解决方案的关键在于构建EM2LDL——一个包含英语、普通话和粤语的多语言语音语料库,通过整合来自在线平台的自发情感表达,并采用32类细粒度情绪分布进行标注,从而支持跨语言混合情绪识别研究。该语料库特别捕捉了香港与澳门地区常见的句内语言切换(intra-utterance code-switching)现象,提升了模型在真实多语境下的泛化能力。实验表明,基于自监督学习模型(如HuBERT-large-EN)的基线方法在性别、年龄和人格特征独立评估中均表现稳健,验证了EM2LDL作为多语言混合情绪识别研究平台的有效性。

链接: https://arxiv.org/abs/2511.20106
作者: Xingfeng Li,Xiaohan Shi,Junjie Li,Yongwei Li,Masashi Unoki,Tomoki Toda,Masato Akagi
机构: City University of Macau (澳门城市大学); Nagoya University (名古屋大学); Chinese Academy of Sciences (中国科学院); Japan Advanced Institute of Science and Technology (日本高级科学技术研究院)
类目: Computation and Language (cs.CL)
备注: Submitted to IEEE Transactions on Affective computing

点击查看摘要

Abstract:This study introduces EM2LDL, a novel multilingual speech corpus designed to advance mixed emotion recognition through label distribution learning. Addressing the limitations of predominantly monolingual and single-label emotion corpora \textcolorblackthat restrict linguistic diversity, are unable to model mixed emotions, and lack ecological validity, EM2LDL comprises expressive utterances in English, Mandarin, and Cantonese, capturing the intra-utterance code-switching prevalent in multilingual regions like Hong Kong and Macao. The corpus integrates spontaneous emotional expressions from online platforms, annotated with fine-grained emotion distributions across 32 categories. Experimental baselines using self-supervised learning models demonstrate robust performance in speaker-independent gender-, age-, and personality-based evaluations, with HuBERT-large-EN achieving optimal results. By incorporating linguistic diversity and ecological validity, EM2LDL enables the exploration of complex emotional dynamics in multilingual settings. This work provides a versatile testbed for developing adaptive, empathetic systems for applications in affective computing, including mental health monitoring and cross-cultural communication. The dataset, annotations, and baseline codes are publicly available at this https URL.
zh

[NLP-22] he Devil in the Details: Emergent Misalignment Format and Coherence in Open-Weights LLM s

【速读】: 该论文旨在解决当前开放权重模型(open-weights models)在特定领域微调时是否会出现“涌现偏差”(emergent misalignment)的问题,即模型在训练数据与安全目标不一致时,其输出行为偏离预期规范的现象。研究发现,尽管所有测试模型均表现出一定程度的涌现偏差,但不同架构和规模的模型存在显著差异:例如Qwen-2.5家族相对更具鲁棒性,而GPT-4o则表现出最强的偏差;此外,在九种现代开源模型中,微调后生成不安全代码的误判率平均为0.68%,远低于GPT-4o的20%,表明开放模型的偏差水平明显低于闭源系统。关键解决方案在于识别出一种格式依赖性的脆弱性——当要求模型以JSON格式输出时,其误判率(0.96%)几乎翻倍于自然语言提示下的结果(0.42%),说明结构化输出限制了模型的自由度,从而可能绕过安全训练机制,这为设计更稳健的安全对齐策略提供了重要依据。

链接: https://arxiv.org/abs/2511.20104
作者: Craig Dickson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Prior work has shown that fine-tuning models on a narrow domain with misaligned data can lead to broad misalignment - a phenomenon termed “emergent misalignment” (Betley et al. 2025). While all tested models were susceptible to emergent misalignment, some models showed more resistance than others. Specifically the Qwen-2.5 family proved to be relatively resistant, while GPT-4o exhibited the strongest misalignment. In this paper we evaluate if current-generation open-weights models exhibit similar resistance to the Qwen-2.5 family and measure misalignment robustness over a range of model architectures and scales. We replicate the effect across nine modern open-weights models (Gemma 3 and Qwen 3 families, 1B-32B parameters). Models fine-tuned on insecure code generation show a 0.68% misalignment rate (compared to 0.07% for base models), matching the lower end of prior open-model results but dramatically lower than GPT-4o’s 20%. We identify a critical format-dependent vulnerability: requiring JSON output doubles misalignment rates compared to natural language prompts (0.96% vs 0.42%). This suggests that structural constraints may bypass safety training by reducing the model’s ‘degrees of freedom’ to refuse. These findings confirm emergent misalignment as a reproducible phenomenon in modern open-weights models, with rates substantially lower than observed in proprietary systems. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2511.20104 [cs.LG] (or arXiv:2511.20104v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.20104 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-23] SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中全注意力(full attention)机制因二次复杂度导致的长文本处理效率低下问题,同时克服现有稀疏注意力(sparse attention)方法在训练过程中因缺乏梯度传播而导致性能下降的缺陷。其解决方案的关键在于提出一种统一的训练框架——稀疏稀疏注意力(Sparse Sparse Attention, SSA),该框架在每一层强制实现稀疏注意力与全注意力之间的双向对齐,从而确保所有token都能获得梯度更新,并显式引导稀疏注意力输出逼近全注意力结果,有效提升稀疏程度与模型性能的平衡,最终实现推理时灵活的计算-性能权衡以及更强的长上下文外推能力。

链接: https://arxiv.org/abs/2511.20102
作者: Zhenyi Shen,Junru Lu,Lin Gui,Jiazheng Li,Yulan He,Di Yin,Xing Sun
机构: King’s College London (伦敦国王学院); Tencent Youtu Lab (腾讯优图实验室)
类目: Computation and Language (cs.CL)
备注: 28 pages

点击查看摘要

Abstract:The quadratic complexity of full attention limits efficient long-context processing in large language models (LLMs). Sparse attention mitigates this cost by restricting each query to attend to a subset of previous tokens; however, training-free approaches often lead to severe performance degradation. Native sparse-attention methods (e.g., NSA, MoBA) alleviate this issue, yet exhibit a critical paradox: they produce lower attention sparsity than full-attention models, despite aiming to approximate full attention, which may constrain their effectiveness. We attribute this paradox to gradient update deficiency: low-ranked key-value pairs excluded during sparse training receive neither forward contribution nor backward gradients, and thus never learn proper suppression. To overcome this limitation, we propose SSA (Sparse Sparse Attention), a unified training framework that considers both sparse and full attention and enforces bidirectional alignment at every layer. This design preserves gradient flow to all tokens while explicitly encouraging sparse-attention outputs to align with their full-attention counterparts, thereby promoting stronger sparsity. As a result, SSA achieves state-of-the-art performance under both sparse and full attention inference across multiple commonsense benchmarks. Furthermore, SSA enables models to adapt smoothly to varying sparsity budgets; performance improves consistently as more tokens are allowed to attend, supporting flexible compute-performance trade-offs at inference time. Finally, we show that native sparse-attention training surprisingly improves long-context extrapolation by mitigating the over-allocation of attention values in sink areas, with SSA demonstrating the strongest extrapolation capability.
zh

[NLP-24] QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM -Based High-Performance GPU Kernel Generation AAAI2026

【速读】: 该论文旨在解决高性能GPU内核(GPU kernel)开发中面临的两大核心挑战:一是依赖专家手工优化导致效率低下且可移植性差,二是现有基于大语言模型(LLM)的方法在正确性(correctness)与效率(efficiency)之间存在根本性矛盾。其关键解决方案是提出一种分层框架Macro Thinking Micro Coding (MTMC),通过将优化策略(optimization strategy)与实现细节(implementation details)解耦,利用强化学习引导轻量级LLM高效探索语义层面的优化策略以最大化硬件利用率,同时借助通用LLM逐步增量式地实现这些策略,从而避免全内核生成带来的错误,有效应对庞大优化空间和复杂实现细节的双重挑战。

链接: https://arxiv.org/abs/2511.20100
作者: Xinguo Zhu,Shaohui Peng,Jiaming Guo,Yunji Chen,Qi Guo,Yuanbo Wen,Hang Qin,Ruizhi Chen,Qirui Zhou,Ke Gao,Yanjun Wu,Chen Zhao,Ling Li
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. Beijing Academy of Artificial Intelligence (北京人工智能研究院); 4. National Engineering Research Center for Intelligent Computing Systems (国家智能计算系统工程研究中心)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, accepted by AAAI 2026

点击查看摘要

Abstract:Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.
zh

[NLP-25] More Bias Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多项选择题(Multiple-Choice Question, MCQ)任务中因缺乏对选项的上下文锚定与解释而导致推理能力受限的问题。现有方法通常直接将答案选项输入模型,未引导其充分探索所有可能选项的合理性,从而影响最终预测准确性。解决方案的关键在于提出BiasPrompting框架,该框架包含两个阶段:首先通过推理生成阶段促使模型为每个选项生成支持性理由;随后在推理引导的一致性阶段,综合分析这些理由以选出最合理的答案。这一机制有效提升了LLMs在复杂MCQ场景下的系统性推理能力。

链接: https://arxiv.org/abs/2511.20086
作者: Duc Anh Vu,Thong Nguyen,Cong-Duy Nguyen,Viet Anh Nguyen,Anh Tuan Luu
机构: Nanyang Techonological University (南洋理工大学); National University of Singapore (新加坡国立大学); Centre for AI research, VinUniversity (VinUniversity人工智能研究中心)
类目: Computation and Language (cs.CL)
备注: Accepted at the 41st ACM/SIGAPP Symposium On Applied Computing (SAC 2026), Main Conference

点击查看摘要

Abstract:With the advancement of large language models (LLMs), their performance on multiple-choice question (MCQ) tasks has improved significantly. However, existing approaches face key limitations: answer choices are typically presented to LLMs without contextual grounding or explanation. This absence of context can lead to incomplete exploration of all possible answers, ultimately degrading the models’ reasoning capabilities. To address these challenges, we introduce BiasPrompting, a novel inference framework that guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction. It consists of two components: first, a reasoning generation stage, where the model is prompted to produce supportive reasonings for each answer option, and then, a reasoning-guided agreement stage, where the generated reasonings are synthesized to select the most plausible answer. Through comprehensive evaluations, BiasPrompting demonstrates significant improvements in five widely used multiple-choice question answering benchmarks. Our experiments showcase that BiasPrompting enhances the reasoning capabilities of LLMs and provides a strong foundation for tackling complex and challenging questions, particularly in settings where existing methods underperform.
zh

[NLP-26] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

【速读】: 该论文旨在解决个性化大语言模型(Personalized Large Language Models, PLLMs)在实际应用中面临的两大挑战:一是为每个用户单独微调模型导致存储成本随用户数量线性增长,难以扩展;二是针对数据稀疏用户的静态模型微调往往效果不佳。解决方案的关键在于提出一种名为MTA(Merge-then-Adapt)的框架,其核心创新包括三个阶段:首先构建共享的元LoRA库(Meta-LoRA Bank),通过锚定用户预训练元个性化特征;其次引入自适应LoRA融合(Adaptive LoRA Fusion)阶段,动态检索并合并最相关的锚点元LoRA以生成用户专属表示,从而避免用户级存储且支持灵活组合;最后设计少样本LoRA堆叠(LoRA Stacking for Few-Shot Personalization)阶段,在合并后的LoRA基础上叠加一个超低秩轻量模块进行微调,实现高效少样本个性化。

链接: https://arxiv.org/abs/2511.20072
作者: Xiaopeng Li,Yuanjin Zheng,Wanyu Wang,wenlin zhang,Pengyue Jia,Yiqi Wang,Maolin Wang,Xuetao Wei,Xiangyu Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.
zh

[NLP-27] Online-PVLM: Advancing Personalized VLMs with Online Concept Learning

【速读】: 该论文旨在解决个性化视觉语言模型(Personalized Visual Language Models, PVLMs)在实际应用中难以实现在线概念学习的问题,尤其是在大规模场景下,传统方法需为每个新概念单独训练嵌入表示(embedding),导致测试阶段无法实时适应且检索效率低下。其解决方案的关键在于提出Online-PVLM框架,通过引入双曲空间(hyperbolic representations)实现无需训练的嵌入生成机制,从而在测试时动态生成概念嵌入,显著提升了个性化VLM的可扩展性与实时适应能力。

链接: https://arxiv.org/abs/2511.20056
作者: Huiyu Bai,Runze Wang,Zhuoyun Du,Yiyang Zhao,Fengji Zhang,Haoyu Chen,Xiaoyong Zhu,Bo Zheng,Xuejiao Zhao
机构: Nanyang Technological University (南洋理工大学); Alibaba Group (阿里巴巴集团); Zhejiang University (浙江大学); City University of Hong Kong (香港城市大学); University of Oulu (奥卢大学)
类目: Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Personalized Visual Language Models (VLMs) are gaining increasing attention for their formidable ability in user-specific concepts aligned interactions (e.g., identifying a user’s bike). Existing methods typically require the learning of separate embeddings for each new concept, which fails to support real-time adaptation during testing. This limitation becomes particularly pronounced in large-scale scenarios, where efficient retrieval of concept embeddings is not achievable. To alleviate this gap, we propose Online-PVLM, a framework for online concept learning by leveraging hyperbolic representations. Our approach makes a train-free paradigm for concept embeddings generation at test time, making the use of personalized VLMs both scalable and efficient. In addition, we develop OP-Eval, a comprehensive and large-scale benchmark comprising 1,292 concepts and over 30K high-quality instances with diverse question types, designed to rigorously assess online concept learning in realistic scenarios. Extensive experiments demonstrate the state-of-the-art performance of our proposed framework. Our source code and dataset will be made available.
zh

[NLP-28] A Machine Learning Approach for Detection of Mental Health Conditions and Cyberbullying from Social Media AAAI-26 ALT

【速读】: 该论文旨在解决数字空间中日益严重的心理健康挑战与网络欺凌(cyberbullying)的检测问题,尤其关注如何构建一个可扩展且具备可解释性的多类别分类系统。其核心解决方案在于提出一个统一的多类分类框架,通过“分治再平衡”(split-then-balance)的数据处理流程,在保持训练数据平衡的同时于真实分布的不平衡测试集上评估模型性能;关键创新点在于验证了端到端微调(end-to-end fine-tuning)对模型性能的重要性,并基于领域适配的MentalBERT模型实现了最优效果(准确率0.92,Macro F1 0.76),同时引入混合SHAPLLM可解释性框架与原型仪表板(Social Media Screener),将预测结果及其解释嵌入到人工审核工作流中,从而将系统定位为“人在回路”的筛查辅助工具而非诊断工具。

链接: https://arxiv.org/abs/2511.20001
作者: Edward Ajayi,Martha Kachweka,Mawuli Deku,Emily Aiken
机构: Carnegie Mellon University Africa (卡内基梅隆大学非洲分校)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted for Oral Presentation at the AAAI-26 Bridge Program on AI for Medicine and Healthcare (AIMedHealth). To appear in Proceedings of Machine Learning Research (PMLR)

点击查看摘要

Abstract:Mental health challenges and cyberbullying are increasingly prevalent in digital spaces, necessitating scalable and interpretable detection systems. This paper introduces a unified multiclass classification framework for detecting ten distinct mental health and cyberbullying categories from social media data. We curate datasets from Twitter and Reddit, implementing a rigorous “split-then-balance” pipeline to train on balanced data while evaluating on a realistic, held-out imbalanced test set. We conducted a comprehensive evaluation comparing traditional lexical models, hybrid approaches, and several end-to-end fine-tuned transformers. Our results demonstrate that end-to-end fine-tuning is critical for performance, with the domain-adapted MentalBERT emerging as the top model, achieving an accuracy of 0.92 and a Macro F1 score of 0.76, surpassing both its generic counterpart and a zero-shot LLM baseline. Grounded in a comprehensive ethical analysis, we frame the system as a human-in-the-loop screening aid, not a diagnostic tool. To support this, we introduce a hybrid SHAPLLM explainability framework and present a prototype dashboard (“Social Media Screener”) designed to integrate model predictions and their explanations into a practical workflow for moderators. Our work provides a robust baseline, highlighting future needs for multi-label, clinically-validated datasets at the critical intersection of online safety and computational mental health.
zh

[NLP-29] Directional Optimization Asymmetry in Transformers: A Synthetic Stress Test

【速读】: 该论文旨在解决Transformer模型在方向性学习上存在的“反转诅咒”(reversal curse)问题,即其在处理从左到右与从右到左映射时表现出显著的不对称性能差异,这种差异究竟是源于语言统计特性(如语料库的时间不对称性),还是由架构本身导致。为厘清这一争议,作者设计了一个完全合成、熵可控的基准测试,作为纯净环境下的压力测试工具:通过构造具有可调分支因子K的随机字符串映射,生成前向任务(条件熵为零)和逆向任务(理论熵下界可解析计算),从而量化超出理论最小损失的部分。关键发现是,即使在无语义、无语言先验的条件下,GPT-2模型仍表现出显著且可复现的方向优化差距(例如K=5时达1.16 nats),远超同等数据下MLP模型的表现;预训练初始化虽改变优化轨迹但无法消除该差距,而LoRA在高熵逆向任务中遭遇容量瓶颈。这表明方向性摩擦是因果Transformer训练中的内在属性,与语言统计无关,为理解现代序列模型的方向偏差提供了可控实验框架,并推动对逆向建模本质困难的机制研究。

链接: https://arxiv.org/abs/2511.19997
作者: Mihir Sahasrabudhe
机构: University of Illinois(伊利诺伊大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures. Code available at this https URL

点击查看摘要

Abstract:Transformers are theoretically reversal-invariant: their function class does not prefer left-to-right over right-to-left mappings. Yet empirical studies on natural language repeatedly report a “reversal curse,” and recent work on temporal asymmetry in LLMs suggests that real-world corpora carry their own arrow of time. This leaves an unresolved question: do directional failures stem from linguistic statistics, or from the architecture itself? We cut through this ambiguity with a fully synthetic, entropy-controlled benchmark designed as a clean-room stress test for directional learning. Using random string mappings with tunable branching factor K, we construct forward tasks with zero conditional entropy and inverse tasks with analytically determined entropy floors. Excess loss above these floors reveals that even scratch-trained GPT-2 models exhibit a strong, reproducible directional optimization gap (e.g., 1.16 nats at K=5), far larger than that of an MLP trained on the same data. Pre-trained initializations shift optimization behavior but do not eliminate this gap, while LoRA encounters a sharp capacity wall on high-entropy inverse mappings. Together, these results isolate a minimal, semantics-free signature of directional friction intrinsic to causal Transformer training-one that persists even when linguistic priors, token frequencies, and corpus-level temporal asymmetries are removed. Our benchmark provides a controlled instrument for dissecting directional biases in modern sequence models and motivates deeper mechanistic study of why inversion remains fundamentally harder for Transformers.
zh

[NLP-30] textR2textR: A Route-to-Rerank Post-Training Framework for Multi-Domain Decoder-Only Rerankers

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中检索增强生成(Retrieval-Augmented Generation, RAG)系统中Decoder-only rerankers在高风险领域(如金融与法律)因缺乏领域特异性而表现不佳的问题,以及传统微调方法导致的表面特征过拟合和灾难性遗忘问题。其解决方案的关键在于提出一种名为R2R的领域感知框架,该框架通过两个核心技术实现:一是两阶段训练策略——Entity Abstraction for Generalization (EAG),其引入反捷径机制,通过掩码最具预测性的表面线索迫使重排序器学习领域不变的相关性模式;二是轻量级潜在语义路由机制(Latent Semantic Router),基于冻结主干解码器的内部表示动态选择最优LoRA专家,从而高效激活领域专家模型。该方法具有模型无关性和模块化特性,在多个领域(法律、医疗、金融)中均展现出优于通用模型和单领域微调基线的跨域鲁棒性。

链接: https://arxiv.org/abs/2511.19987
作者: Xinyu Wang,Hanwei Wu,Qingchen Hu,Zhenghan Tai,Jingrui Tian,Lei Ding,Jijun Chi,Hailin He,Tung Sum Thomas Kwok,Yufei Cui,Sicheng Lyu,Muzhi Li,Mingze Li,Xinyue Yu,Ling Zhou,Peng Lu
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 13 pages, including 3 figures and 3 tables

点击查看摘要

Abstract:Decoder-only rerankers are central to Retrieval-Augmented Generation (RAG). However, generalist models miss domain-specific nuances in high-stakes fields like finance and law, and naive fine-tuning causes surface-form overfitting and catastrophic forgetting. To address this challenge, we introduce R2R, a domain-aware framework that combines dynamic expert routing with a two-stage training strategy, Entity Abstraction for Generalization (EAG). EAG introduces a counter-shortcut mechanism by masking the most predictive surface cues, forcing the reranker to learn domain-invariant relevance patterns rather than memorizing dataset-specific entities. To efficiently activate domain experts, R2R employs a lightweight Latent Semantic Router that probes internal representations from the frozen backbone decoder to select the optimal LoRA expert per query. Extensive experiments across different reranker backbones and diverse domains (legal, medical, and financial) demonstrate that R2R consistently surpasses generalist and single-domain fine-tuned baselines. Our results confirm that R2R is a model-agnostic and modular approach to domain specialization with strong cross-domain robustness.
zh

[NLP-31] AppSelectBench: Application-Level Tool Selection Benchmark

【速读】: 该论文旨在解决当前计算机使用代理(Computer Using Agents, CUAs)在执行复杂任务时,缺乏对跨应用层级的合理选择能力的问题。现有基准主要聚焦于细粒度API的选择评估,未能充分考察模型在不同应用程序之间进行推理和决策的能力,导致对CUAs整体应用级推理能力的认知存在盲区。解决方案的关键在于提出AppSelectBench——一个全面的基准测试平台,其核心创新包括:一是设计了一种新颖的用户任务生成管道,可规模化生成真实、多样且语义 grounded 的用户意图;二是构建了统一的评估协议,涵盖随机、启发式、零样本、少样本及检索增强等多种设置;三是覆盖100个常用桌面应用和超过十万条高质量任务数据,从而系统性地揭示大语言模型在跨应用选择中的性能瓶颈与优势,为推进CUAs在应用层面上的智能推理提供可靠评测基础。

链接: https://arxiv.org/abs/2511.19957
作者: Tianyi Chen,Michael Solodko,Sen Wang,Jongwoo Ko,Junheng Hao,Colby Banbury,Sara Abdali,Saeed Amizadeh,Qing Xiao,Yinheng Li,Tianyu Ding,Kamran Ghasedi Dizaji,Suzhen Zheng,Hao Fan,Justin Wagle,Pashmina Cameron,Kazuhito Koishida
机构: Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Computer Using Agents (CUAs) are increasingly equipped with external tools, enabling them to perform complex and realistic tasks. For CUAs to operate effectively, application selection, which refers to deciding which application to use before invoking fine-grained tools such as APIs, is a fundamental capability. It determines whether the agent initializes the correct environment, avoids orchestration confusion, and efficiently focuses on relevant context. However, existing benchmarks primarily assess fine-grained API selection, offering limited insight into whether models can reason across and choose between different applications. To fill this gap, we introduce AppSelectBench, a comprehensive benchmark for evaluating application selection in CUAs. AppSelectBench contains a novel user task generation pipeline that produces realistic, diverse, and semantically grounded user intents at scale, together with unified evaluation protocols covering random, heuristic, zero-shot, few-shot, and retrieval-augmented-settings. AppSelectBench covers one hundred widely used desktop applications and includes more than one hundred thousand realistic, diverse, and semantically grounded user tasks. Extensive experiments across both closed-source and open-source large language models reveal systematic strengths and weaknesses in inter-application reasoning, showing that even the most capable models still struggle to make consistent application choices. Together, these results establish AppSelectBench as a foundation for studying and advancing application level reasoning, an essential yet underexplored capability of intelligent CUAs. The source is available at this https URL.
zh

[NLP-32] EfficientXpert: Efficient Domain Adaptation for Large Language Models via Propagation-Aware Pruning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域(如医疗、法律和金融)部署时面临的两个核心问题:一是模型规模过大导致难以在资源受限环境中应用,二是现有压缩方法在跨领域迁移时性能下降或计算开销过高。其解决方案的关键在于提出EfficientXpert框架,该框架通过两个创新机制实现高效且领域自适应的模型压缩:一是基于传播感知的剪枝准则(Foresight Mask),能够识别并保留对特定领域任务至关重要的模型结构;二是高效的适配器更新算法(Partial Brain Surgeon),与LoRA微调流程无缝集成,在单步操作中将通用预训练模型转化为稀疏、领域特化的专家模型。实验表明,该方法在医疗和法律任务中可在40%稀疏度下保持高达98%的密集模型性能,显著优于当前最优方法,并揭示了领域依赖的结构变化对通用剪枝策略有效性的影响,强调了设计领域感知剪枝策略的必要性。

链接: https://arxiv.org/abs/2511.19935
作者: Songlin Zhao,Michael Pitts,Zhuwei Qin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has increased the demand for domain-specialized variants in areas such as law, healthcare, and finance. However, their large size remains a barrier to deployment in resource-constrained environments, and existing compression methods either generalize poorly across domains or incur high overhead. In this work, we propose \textbfEfficientXpert, a lightweight domain-pruning framework that combines a propagation-aware pruning criterion (Foresight Mask) with an efficient adapter-update algorithm (Partial Brain Surgeon). Integrated into the LoRA fine-tuning process, EfficientXpert enables a one-step transformation of general pretrained models into sparse, domain-adapted experts. Across health and legal tasks, it retains up to 98% of dense-model performance at 40% sparsity, outperforming state-of-the-art methods. Further analysis reveals substantial domain-dependent structural shifts that degrade the effectiveness of general pruning masks, underscoring the need for adaptive, domain-aware pruning strategies tailored to each domain.
zh

[NLP-33] CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

【速读】: 该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在视频理解中对反事实推理(counterfactual reasoning)能力不足的问题,即模型难以在假设条件下推断未发生事件的可能结果,这要求识别潜在因果结构并推理未观测可能性,而非仅依赖已观察到的模式。解决方案的关键在于提出一种后训练方法CFGPT,通过从语言模态中蒸馏反事实推理能力来增强模型的视觉反事实推理性能,在CounterVQA基准的不同难度层级上均实现一致提升。

链接: https://arxiv.org/abs/2511.19923
作者: Yuefei Chen,Jiang Liu,Xiaodong Lin,Ruixiang Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model’s visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.
zh

[NLP-34] MAPS: Preserving Vision-Language Representations via Module-Wise Proximity Scheduling for Better Vision-Language-Action Generalization

【速读】: 该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在微调过程中因破坏预训练视觉-语言模型(Vision-Language Model, VLM)先验表示而导致泛化能力下降的问题。现有方法如模块冻结或统一正则化,要么过度约束适应过程,要么忽视VLA各组件的功能差异。解决方案的关键在于提出MAPS(Module-Wise Proximity Scheduling)框架,通过系统性分析发现一个经验性的“接近约束松弛顺序”,并线性调度该顺序:使视觉编码器保持与预训练VLM的接近性以维持稳定性,同时允许面向动作的语言层更自由地适应,从而在稳定性和灵活性之间取得平衡。MAPS无需引入额外参数或数据,可无缝集成至现有VLA模型中,并在多个基准测试和真实机器人平台上显著提升性能(最高达+30%)。

链接: https://arxiv.org/abs/2511.19878
作者: Chengyue Huang,Mellon M. Zhang,Robert Azarcon,Glen Chou,Zsolt Kira
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models inherit strong priors from pretrained Vision-Language Models (VLMs), but naive fine-tuning often disrupts these representations and harms generalization. Existing fixes – freezing modules or applying uniform regularization – either overconstrain adaptation or ignore the differing roles of VLA components. We present MAPS (Module-Wise Proximity Scheduling), the first robust fine-tuning framework for VLAs. Through systematic analysis, we uncover an empirical order in which proximity constraints should be relaxed to balance stability and flexibility. MAPS linearly schedules this relaxation, enabling visual encoders to stay close to their pretrained priors while action-oriented language layers adapt more freely. MAPS introduces no additional parameters or data, and can be seamlessly integrated into existing VLAs. Across MiniVLA-VQ, MiniVLA-OFT, OpenVLA-OFT, and challenging benchmarks such as SimplerEnv, CALVIN, LIBERO, as well as real-world evaluations on the Franka Emika Panda platform, MAPS consistently boosts both in-distribution and out-of-distribution performance (up to +30%). Our findings highlight empirically guided proximity to pretrained VLMs as a simple yet powerful principle for preserving broad generalization in VLM-to-VLA transfer.
zh

[NLP-35] A Systematic Analysis of Large Language Models with RAG -enabled Dynamic Prompting for Medical Error Detection and Correction

【速读】: 该论文旨在解决临床文档中存在事实性、诊断及管理错误的问题,这些错误可能危及患者安全,而大型语言模型(Large Language Models, LLMs)在自动检测与修正此类错误方面的有效性及其在不同提示策略下的行为尚不明确。解决方案的关键在于对比三种提示策略:零样本提示(zero-shot prompting)、静态提示加随机示例(static prompting with random exemplars, SPR)以及检索增强的动态提示(retrieval-augmented dynamic prompting, RDP),并基于MEDEC数据集评估其在医疗错误处理三个子任务中的表现——错误标志检测、错误句子检测和错误修正。研究发现,RDP通过引入检索到的相关示例,显著降低了假阳性率(FPR)约15%,提升了错误句子检测的召回率5–10%,并生成更具上下文准确性的修正结果,从而在多种LLM上展现出最优性能。

链接: https://arxiv.org/abs/2511.19858
作者: Farzad Ahmed,Joniel Augustine Jerome,Meliha Yetisgen,Özlem Uzuner
机构: George Mason University (乔治梅森大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.19858 [cs.CL] (or arXiv:2511.19858v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.19858 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Farzad Ahmed [view email] [v1] Tue, 25 Nov 2025 02:40:49 UTC (1,607 KB)
zh

[NLP-36] Profile-LLM : Dynamic Profile Optimization for Realistic Personality Expression in LLM s

【速读】: 该论文旨在解决个性化大语言模型(Personalized Large Language Models, LLMs)在生成具有真实感人格表达时,因提示词(prompt)设计不足而导致的人格表现力有限的问题。现有方法多依赖心理学研究中的人格描述构建提示词,但未针对最大化人格特征表达进行优化。解决方案的关键在于提出PersonaPulse框架,该框架利用LLM自身对人格特质的先验知识,通过迭代优化角色扮演提示词,并引入情境化响应基准作为评分工具,实现更真实、上下文贴合的评估与引导,从而显著提升人格表达的强度和自然度。

链接: https://arxiv.org/abs/2511.19852
作者: Shi-Wei Dai,Yan-Wei Shie,Tsung-Huan Yang,Lun-Wei Ku,Yung-Hui Li
机构: Academia Sinica (中央研究院); AI Research Center, Hon Hai Research Institute (鸿海研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Personalized Large Language Models (LLMs) have been shown to be an effective way to create more engaging and enjoyable user-AI interactions. While previous studies have explored using prompts to elicit specific personality traits in LLMs, they have not optimized these prompts to maximize personality expression. To address this limitation, we propose PersonaPulse: Dynamic Profile Optimization for Realistic Personality Expression in LLMs, a framework that leverages LLMs’ inherent knowledge of personality traits to iteratively enhance role-play prompts while integrating a situational response benchmark as a scoring tool, ensuring a more realistic and contextually grounded evaluation to guide the optimization process. Quantitative evaluations demonstrate that the prompts generated by PersonaPulse outperform those of prior work, which were designed based on personality descriptions from psychological studies. Additionally, we explore the relationship between model size and personality modeling through extensive experiments. Finally, we find that, for certain personality traits, the extent of personality evocation can be partially controlled by pausing the optimization process. These findings underscore the importance of prompt optimization in shaping personality expression within LLMs, offering valuable insights for future research on adaptive AI interactions.
zh

[NLP-37] CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception

【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在需要细粒度图像理解的任务(如场景文本识别或文档分析)中表现不佳的问题,其根源在于感知能力受限和视觉信息碎片化。解决方案的关键在于提出CropVLM,一种外部低成本的增强方法,通过强化学习训练使VLM能够动态“聚焦”于图像中的相关区域,从而提升对细节的捕捉能力;该方法无需人类标注的边界框监督信号,也无需昂贵的合成评估,且仅需一次训练即可适配各类开源或专有VLM,显著提升性能而不修改或微调原模型,有效避免灾难性遗忘。

链接: https://arxiv.org/abs/2511.19820
作者: Miguel Carvalho,Helder Dias,Bruno Martins
机构: INESC-ID (Institute for Systems and Robotics); Instituto Superior Técnico (Technical University of Lisbon); University of Lisbon (里斯本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) often struggle with tasks that require fine-grained image understanding, such as scene-text recognition or document analysis, due to perception limitations and visual fragmentation. To address these challenges, we introduce CropVLM as an external low-cost method for boosting performance, enabling VLMs to dynamically ‘‘zoom in’’ on relevant image regions, enhancing their ability to capture fine details. CropVLM is trained using reinforcement learning, without using human-labeled bounding boxes as a supervision signal, and without expensive synthetic evaluations. The model is trained once and can be paired with both open-source and proprietary VLMs to improve their performance. Our approach delivers significant improvements on tasks that require high-resolution image understanding, notably for benchmarks that are out-of-domain for the target VLM, without modifying or fine-tuning the VLM, thus avoiding catastrophic forgetting.
zh

[NLP-38] Language-Independent Sentiment Labelling with Distant Supervision: A Case Study for English Sepedi and Setswana

【速读】: 该论文旨在解决非洲低资源语言(如Sepedi和Setswana)在情感分析任务中因缺乏标注文本数据而难以构建有效模型的问题。其核心挑战在于手动标注成本高、效率低,限制了这些语言的情感计算研究进展。解决方案的关键在于提出一种语言无关的情感自动标注方法,通过融合情感承载的emoji和词汇信息,实现对社交媒体文本(如推文)的初步情感标签生成,从而显著减少人工校正的工作量——实验表明该方法在英语、Sepedi和Setswana语种上平均仅需34%的人工修正即可获得高质量标签。

链接: https://arxiv.org/abs/2511.19818
作者: Koena Ronny Mabokela,Tim Schlippe,Mpho Raborife,Turgay Celik
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published in the The Fourth Workshop on Processing Emotions, Decisions and Opinions (EDO 2023) at 10th Language Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2023), Poznań, Poland, 21-23 April 2023. ISBN: 978-83-232-4176-8

点击查看摘要

Abstract:Sentiment analysis is a helpful task to automatically analyse opinions and emotions on various topics in areas such as AI for Social Good, AI in Education or marketing. While many of the sentiment analysis systems are developed for English, many African languages are classified as low-resource languages due to the lack of digital language resources like text labelled with corresponding sentiment classes. One reason for that is that manually labelling text data is time-consuming and expensive. Consequently, automatic and rapid processes are needed to reduce the manual effort as much as possible making the labelling process as efficient as possible. In this paper, we present and analyze an automatic language-independent sentiment labelling method that leverages information from sentiment-bearing emojis and words. Our experiments are conducted with tweets in the languages English, Sepedi and Setswana from SAfriSenti, a multilingual sentiment corpus for South African languages. We show that our sentiment labelling approach is able to label the English tweets with an accuracy of 66%, the Sepedi tweets with 69%, and the Setswana tweets with 63%, so that on average only 34% of the automatically generated labels remain to be corrected.
zh

[NLP-39] Breaking Bad: Norms for Valence Arousal and Dominance for over 10k English Multiword Expressions

【速读】: 该论文旨在解决现有情感词典(如NRC VAD Lexicon v1)在多词表达(Multiword Expressions, MWEs)和近年高频词汇覆盖不足的问题。其解决方案的关键在于构建了一个扩展版本的NRC VAD Lexicon v2,新增了10,000个英文MWEs及其组成词的情感维度(Valence, Arousal, Dominance)的人工标注评分,并扩充了未登录词(unigrams)的覆盖范围,尤其纳入了自2018年以来使用频率显著上升的词汇。该版本共包含10,000个MWEs和25,000个单词,显著提升了情感语义资源的完整性与时效性,从而支持更广泛的语言处理、心理学、公共健康及社会科学等领域的研究。

链接: https://arxiv.org/abs/2511.19816
作者: Saif M. Mohammad
机构: National Research Council Canada (加拿大国家研究委员会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Factor analysis studies have shown that the primary dimensions of word meaning are Valence (V), Arousal (A), and Dominance (D). Existing lexicons such as the NRC VAD Lexicon, published in 2018, include VAD association ratings for words. Here, we present a complement to it, which has human ratings of valence, arousal, and dominance for 10k English Multiword Expressions (MWEs) and their constituent words. We also increase the coverage of unigrams, especially words that have become more common since 2018. In all, the new NRC VAD Lexicon v2 now has entries for 10k MWEs and 25k words, in addition to the entries in v1. We show that the associations are highly reliable. We use the lexicon to examine emotional characteristics of MWEs, including: 1. The degree to which MWEs (idioms, noun compounds, and verb particle constructions) exhibit strong emotionality; 2. The degree of emotional compositionality in MWEs. The lexicon enables a wide variety of research in NLP, Psychology, Public Health, Digital Humanities, and Social Sciences. The NRC VAD Lexicon v2 is freely available through the project webpage: this http URL
zh

[NLP-40] raining-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization

【速读】: 该论文旨在解决文本到图像扩散模型中存在的图像多样性不足问题(image diversity),即模型倾向于重复生成某些主导模式的样本,导致采样冗余并影响创意探索与下游应用。其解决方案的关键在于提出一种无需训练且与模型无关的模块——Token-Prompt Embedding Space Optimization (TPSO),通过在token嵌入空间中引入可学习参数来探索被低估的区域,从而降低模型对强模式的依赖;同时利用提示层面的空间约束实现全局语义调控,防止分布偏移引发的图像质量下降,有效提升生成多样性而不牺牲保真度。

链接: https://arxiv.org/abs/2511.19811
作者: Debin Meng,Chen Jin,Zheng Gao,Yanran Li,Ioannis Patras,Georgios Tzimiropoulos
机构: Queen Mary University of London (伦敦玛丽女王大学); Centre for AI, AstraZeneca (阿斯利康人工智能中心); University of Bedfordshire (贝德福德大学); Samsung AI Center (三星人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: under review

点击查看摘要

Abstract:Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.
zh

[NLP-41] Gender Bias in Emotion Recognition by Large Language Models AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在情感理论心智(emotional theory of mind)任务中可能存在的性别偏见问题,即当模型被给予一个人及其环境的描述并要求判断其情绪状态时,是否会表现出对不同性别的刻板印象或不公平倾向。研究发现,仅通过推理阶段的提示工程(prompt engineering)等干预手段难以有效降低偏见,而关键在于采用基于训练数据的干预策略,如在训练过程中引入去偏方法,才能实现显著且有意义的偏见缓解效果。

链接: https://arxiv.org/abs/2511.19785
作者: Maureen Herbert,Katie Sun,Angelica Lim,Yasaman Etesam
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted at AAAI 2026 Workshop (WS37)

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) and their growing integration into daily life underscore the importance of evaluating and ensuring their fairness. In this work, we examine fairness within the domain of emotional theory of mind, investigating whether LLMs exhibit gender biases when presented with a description of a person and their environment and asked, “How does this person feel?”. Furthermore, we propose and evaluate several debiasing strategies, demonstrating that achieving meaningful reductions in bias requires training based interventions rather than relying solely on inference-time prompt-based approaches such as prompt engineering.
zh

[NLP-42] Scaling Agent ic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多步视觉交互推理能力上的局限性,即模型难以实现“以图像为媒介的思考”(think with images),尤其是在工具选择、调用与协调方面的不足。解决方案的关键在于提出一个可扩展的训练环境 VISTA-Gym,该环境统一了来自13个数据集的7类真实世界多模态推理任务,并提供标准化的视觉工具接口(如目标定位、结构解析)、可执行的交互循环、可验证的反馈信号以及高效的轨迹日志记录机制,从而支持大规模视觉代理强化学习(visual agentic reinforcement learning)。在此基础上,作者训练出 VISTA-R1 模型,通过多轮轨迹采样和端到端强化学习,使模型能够将工具使用与代理式推理有机结合,在11个公开的高推理强度VQA基准测试中显著优于同类规模的最先进基线模型(提升9.51%–18.72%),验证了VISTA-Gym作为激发VLM工具集成推理能力的有效训练平台。

链接: https://arxiv.org/abs/2511.19773
作者: Meng Lu,Ran Xu,Yi Fang,Wenxuan Zhang,Yue Yu,Gaurav Srivastava,Yuchen Zhuang,Mohamed Elhoseiny,Charles Fleming,Carl Yang,Zhengzhong Tu,Yang Xie,Guanghua Xiao,Hanrui Wang,Di Jin,Wenqi Shi,Xuan Wang
机构: Virginia Tech (弗吉尼亚理工学院); Emory University (埃默里大学); KAUST (沙特阿卜杜拉国王科技大学); Georgia Tech (佐治亚理工学院); Cisco (思科); TAMU (德州农工大学); UT Southwestern Medical Center (西南医学中心); Eigen AI (Eigen人工智能)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures, work in progress

点击查看摘要

Abstract:While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to “think with images”, i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via multi-turn trajectory sampling and end-to-end reinforcement learning. Extensive experiments across 11 public reasoning-intensive VQA benchmarks show that VISTA-R1-8B outperforms state-of-the-art baselines with similar sizes by 9.51%-18.72%, demonstrating VISTA-Gym as an effective training ground to unlock the tool-integrated reasoning capabilities for VLMs.
zh

[NLP-43] What does it mean to understand language?

【速读】: 该论文试图解决的问题是:如何实现对语言的深层理解,即超越表面语义提取,构建对语言描述情境的丰富心理模型(mental models)。其解决方案的关键在于提出并论证一个核心假设——由于大脑核心语言系统的信息处理能力存在根本限制,深度语言理解必须将信息从语言系统导出至其他脑区,这些脑区负责计算感知与运动表征、构建心理模型以及存储世界知识和自传体记忆。论文进一步指出,认知神经科学的最新进展为验证这一假设提供了概念基础和实验方法,从而开辟了一条揭示语言理解在认知与神经层面本质的新路径。

链接: https://arxiv.org/abs/2511.19757
作者: Colton Casto,Anna Ivanova,Evelina Fedorenko,Nancy Kanwisher
机构: Harvard University (哈佛大学); Georgia Institute of Technology (佐治亚理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Language understanding entails not just extracting the surface-level meaning of the linguistic input, but constructing rich mental models of the situation it describes. Here we propose that because processing within the brain’s core language system is fundamentally limited, deeply understanding language requires exporting information from the language system to other brain regions that compute perceptual and motor representations, construct mental models, and store our world knowledge and autobiographical memories. We review the existing evidence for this hypothesis, and argue that recent progress in cognitive neuroscience provides both the conceptual foundation and the methods to directly test it, thus opening up a new strategy to reveal what it means, cognitively and neurally, to understand language.
zh

[NLP-44] Comparative Analysis of LoRA-Adapted Embedding Models for Clinical Cardiology Text Representation

【速读】: 该论文旨在解决临床自然语言处理(Clinical Natural Language Processing, CNLP)中领域特定文本嵌入(domain-specific text embeddings)性能评估不足的问题,特别是缺乏对不同模型架构在心血管专科文本上的系统性比较。其解决方案的关键在于采用低秩适应(Low-Rank Adaptation, LoRA)微调方法,在来自权威医学教材的106,535个心血管文本对数据集上对十种基于Transformer的嵌入模型进行适配与评估,结果表明仅编码器架构(如BioLinkBERT)在保持高领域区分度(分离得分:0.510)的同时显著降低计算资源消耗,从而挑战了“更大语言模型必然产生更优领域嵌入”的假设,并为临床NLP系统的开发提供了实证依据和可复现的工具链。

链接: https://arxiv.org/abs/2511.19739
作者: Richard J. Young,Alice M. Matthews
机构: University of Nevada Las Vegas (内华达大学拉斯维加斯分校); Concorde Career Colleges (康科德职业学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 25 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Domain-specific text embeddings are critical for clinical natural language processing, yet systematic comparisons across model architectures remain limited. This study evaluates ten transformer-based embedding models adapted for cardiology through Low-Rank Adaptation (LoRA) fine-tuning on 106,535 cardiology text pairs derived from authoritative medical textbooks. Results demonstrate that encoder-only architectures, particularly BioLinkBERT, achieve superior domain-specific performance (separation score: 0.510) compared to larger decoder-based models, while requiring significantly fewer computational resources. The findings challenge the assumption that larger language models necessarily produce better domain-specific embeddings and provide practical guidance for clinical NLP system development. All models, training code, and evaluation datasets are publicly available to support reproducible research in medical informatics.
zh

[NLP-45] Can LLM s Faithfully Explain Themselves in Low-Resource Languages? A Case Study on Emotion Detection in Persian

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(如波斯语)中生成解释时的可信度(faithfulness)问题,即模型生成的自解释是否真实反映其决策依据。解决方案的关键在于通过对比LLM识别的关键词与人工标注者识别的关键词,并利用基于token级对数概率的置信度分数评估解释的可信度,同时测试两种提示策略(先预测后解释 vs. 先解释后预测)对解释可信度的影响。结果表明,尽管LLM在情感分类任务中表现良好,但其生成的解释往往偏离人类判断,且不同模型间的解释一致性高于与人类标注的一致性,凸显了现有解释方法和评估指标在多语言及低资源场景下的局限性。

链接: https://arxiv.org/abs/2511.19719
作者: Mobina Mehrazar,Mohammad Amin Yousefi,Parisa Abolfath Beygi,Behnam Bahrak
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to generate self-explanations alongside their predictions, a practice that raises concerns about the faithfulness of these explanations, especially in low-resource languages. This study evaluates the faithfulness of LLM-generated explanations in the context of emotion classification in Persian, a low-resource language, by comparing the influential words identified by the model against those identified by human annotators. We assess faithfulness using confidence scores derived from token-level log-probabilities. Two prompting strategies, differing in the order of explanation and prediction (Predict-then-Explain and Explain-then-Predict), are tested for their impact on explanation faithfulness. Our results reveal that while LLMs achieve strong classification performance, their generated explanations often diverge from faithful reasoning, showing greater agreement with each other than with human judgments. These results highlight the limitations of current explanation methods and metrics, emphasizing the need for more robust approaches to ensure LLM reliability in multilingual and low-resource contexts.
zh

[NLP-46] Fara-7B: An Efficient Agent ic Model for Computer Use

【速读】: 该论文旨在解决计算机使用代理(Computer Use Agents, CUAs)发展受限于缺乏大规模高质量数据集的问题,尤其是缺少能够真实反映人类与计算机交互行为的轨迹数据。现有大语言模型(LLMs)得益于海量文本数据取得显著进展,但CUA领域尚无类似规模的轨迹数据集。其解决方案的关键在于提出FaraGen——一个用于多步骤网页任务的合成数据生成系统,该系统能从高频网站中提出多样化任务、生成多种解题尝试,并通过多验证器筛选出成功轨迹,从而实现高吞吐量、高产出率和高多样性;基于此数据训练得到的小型本地化CUA模型Fara-7B仅依赖屏幕截图感知环境并预测坐标执行操作,在WebVoyager、Online-Mind2Web及新提出的WebTailBench等基准上表现优于同规模模型,且媲美更大规模前沿模型,验证了可扩展数据生成对推动高效小模型发展的关键作用。

链接: https://arxiv.org/abs/2511.19663
作者: Ahmed Awadallah,Yash Lara,Raghav Magazine,Hussein Mozannar,Akshay Nambi,Yash Pandya,Aravind Rajeswaran,Corby Rosset,Alexey Taymanov,Vibhav Vineet,Spencer Whitehead,Andrew Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately 1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench – our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.
zh

[NLP-47] Efficient Multi-Hop Question Answering over Knowledge Graphs via LLM Planning and Embedding-Guided Search

【速读】: 该论文旨在解决知识图谱上的多跳问答(multi-hop question answering over knowledge graphs)中存在的计算效率低和答案可验证性差的问题。现有方法依赖昂贵的大语言模型(Large Language Model, LLM)进行实体链接与路径排序,导致部署受限,且生成的答案缺乏结构化知识的可追溯性。解决方案的关键在于提出两种互补的混合算法:一是LLM-Guided Planning,通过一次LLM调用预测关系序列并结合广度优先搜索执行推理,实现近完美的准确率(micro-F1 0.90)并保证所有答案均基于知识图谱结构化推理;二是Embedding-Guided Neural Search,完全摒弃LLM调用,利用轻量级6.7M参数边评分器融合文本与图嵌入,实现超100倍的速度提升且保持竞争力的准确率。进一步通过知识蒸馏将规划能力压缩至4B参数模型,在无API费用下达到大模型性能,表明可验证的多跳推理无需大规模模型,而需架构层面结合符号结构与学习表示的归纳偏置。

链接: https://arxiv.org/abs/2511.19648
作者: Manil Shrestha,Edward Kim
机构: Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-hop question answering over knowledge graphs remains computationally challenging due to the combinatorial explosion of possible reasoning paths. Recent approaches rely on expensive Large Language Model (LLM) inference for both entity linking and path ranking, limiting their practical deployment. Additionally, LLM-generated answers often lack verifiable grounding in structured knowledge. We present two complementary hybrid algorithms that address both efficiency and verifiability: (1) LLM-Guided Planning that uses a single LLM call to predict relation sequences executed via breadth-first search, achieving near-perfect accuracy (micro-F1 0.90) while ensuring all answers are grounded in the knowledge graph, and (2) Embedding-Guided Neural Search that eliminates LLM calls entirely by fusing text and graph embeddings through a lightweight 6.7M-parameter edge scorer, achieving over 100 times speedup with competitive accuracy. Through knowledge distillation, we compress planning capability into a 4B-parameter model that matches large-model performance at zero API cost. Evaluation on MetaQA demonstrates that grounded reasoning consistently outperforms ungrounded generation, with structured planning proving more transferable than direct answer generation. Our results show that verifiable multi-hop reasoning does not require massive models at inference time, but rather the right architectural inductive biases combining symbolic structure with learned representations.
zh

[NLP-48] Studying Maps at Scale: A Digital Investigation of Cartography and the Evolution of Figuration

【速读】: 该论文旨在解决如何从文化视角大规模研究地图遗产(cartographic heritage)的问题,尤其关注地图作为语义符号系统和文化对象所承载的政治与认知期待,而现有自动化方法多局限于技术层面的内容识别,缺乏对地图历史与意义的深入理解。其解决方案的关键在于构建了一个包含771,561条记录和99,715幅数字化图像的多样化语料库,并结合语义分割(semantic segmentation)与目标检测(object detection)模型,对土地类别和制图符号进行通用识别;同时通过分析制图符号的潜在视觉空间分布,揭示了图示演变规律(如等高线取代晕滃法)及其局部一致性特征,进而阐明政治动态、军事冲突与城市网络在制图规范传播中的作用机制。

链接: https://arxiv.org/abs/2511.19538
作者: Remi Petitpierre
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: PhD thesis, EPFL. 396 pages, 156 figures

点击查看摘要

Abstract:This thesis presents methods and datasets to investigate cartographic heritage on a large scale and from a cultural perspective. Heritage institutions worldwide have digitized more than one million maps, and automated techniques now enable large-scale recognition and extraction of map content. Yet these methods have engaged little with the history of cartography, or the view that maps are semantic-symbolic systems, and cultural objects reflecting political and epistemic expectations. This work leverages a diverse corpus of 771,561 map records and 99,715 digitized images aggregated from 38 digital catalogs. After normalization, the dataset includes 236,925 contributors and spans six centuries, from 1492 to 1948. These data make it possible to chart geographic structures and the global chronology of map publication. The spatial focus of cartography is analyzed in relation to political dynamics, evidencing links between Atlantic maritime charting, the triangular trade, and colonial expansion. Further results document the progression of national, domestic focus and the impact of military conflicts on publication volumes. The research introduces semantic segmentation techniques and object detection models for the generic recognition of land classes and cartographic signs, trained on annotated data and synthetic images. The analysis of land classes shows that maps are designed images whose framing and composition emphasize features through centering and semantic symmetries. The study of cartographic figuration encodes 63 M signs and 25 M fragments into a latent visual space, revealing figurative shifts such as the replacement of relief hachures by terrain contours and showing that signs tend to form locally consistent systems. Analyses of collaboration and diffusion highlight the role of legitimacy, larger actors, and major cities in the spread of figurative norms and semiotic cultures.
zh

[NLP-49] Quantifying Modality Contributions via Disentangling Multimodal Representations

【速读】: 该论文旨在解决多模态模型中模态贡献量化难题,现有方法常将模态的“贡献”与“影响”混淆,依赖性能下降来衡量模态重要性,但这类基于结果(outcome-based)的指标无法区分模态是本身具有信息价值,还是仅通过与其他模态交互才体现价值。尤其在交叉注意力(cross-attention)架构中,模态间存在复杂的信息交互,传统方法难以准确刻画其作用机制。论文的关键解决方案是引入部分信息分解(Partial Information Decomposition, PID),将内部嵌入中的预测信息分解为唯一(unique)、冗余(redundant)和协同(synergistic)三类成分,从而实现对各模态贡献的精细解构;并进一步提出基于迭代比例拟合过程(Iterative Proportional Fitting Procedure, IPFP)的算法,在无需重新训练的前提下,实现层级和数据集级别的可扩展、推理-only分析,提供了更清晰、可解释的多模态表示层面行为理解。

链接: https://arxiv.org/abs/2511.19470
作者: Padegal Amit,Omkar Mahesh Kashyap,Namitha Rayasam,Nidhi Shekhar,Surabhi Narayan
机构: PES University (PES大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 11 figures

点击查看摘要

Abstract:Quantifying modality contributions in multimodal models remains a challenge, as existing approaches conflate the notion of contribution itself. Prior work relies on accuracy-based approaches, interpreting performance drops after removing a modality as indicative of its influence. However, such outcome-driven metrics fail to distinguish whether a modality is inherently informative or whether its value arises only through interaction with other modalities. This distinction is particularly important in cross-attention architectures, where modalities influence each other’s representations. In this work, we propose a framework based on Partial Information Decomposition (PID) that quantifies modality contributions by decomposing predictive information in internal embeddings into unique, redundant, and synergistic components. To enable scalable, inference-only analysis, we develop an algorithm based on the Iterative Proportional Fitting Procedure (IPFP) that computes layer and dataset-level contributions without retraining. This provides a principled, representation-level view of multimodal behavior, offering clearer and more interpretable insights than outcome-based metrics.
zh

[NLP-50] BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

【速读】: 该论文旨在解决生成式 AI(Generative AI)中机制可解释性(mechanistic interpretability)与模型编辑(model editing)领域缺乏形式化保障的问题,即现有方法通常依赖非正式证据和临时实验,难以确保提取或编辑后的模型在关键输入上偏离原始模型的程度。其解决方案的关键在于提出 BlockCert 框架,通过结构化地提取 Transformer 残差块(residual block)的替代实现,并附带机器可验证的证书,以界定了近似误差、记录覆盖度指标并哈希底层产物;进一步利用 Lean 4 中的形式化 Lipschitz 基础组合定理,将局部保证提升为全局偏差边界,从而为块级提取与局部编辑提供可证明的安全性支撑。

链接: https://arxiv.org/abs/2511.17645
作者: Sandro Andric
机构: New York University (纽约大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 16 pages, 1 figure

点击查看摘要

Abstract:Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within approximately 6e-5 on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.
zh

计算机视觉

[CV-0] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation

【速读】:该论文旨在解决文本到图像生成模型在强化学习(Reinforcement Learning, RL)对齐过程中奖励设计缺乏可解释性与灵活性的问题。现有方法通常依赖固定权重的复合指标(如CLIP、OCR和真实感评分)或从人类偏好模型中蒸馏出的单一标量奖励,难以满足用户对特定视觉属性的精细化控制需求。其解决方案的关键在于提出RubricRL框架,通过为每个提示动态构建结构化的评分清单(rubric),将生成结果分解为细粒度的视觉标准(如物体正确性、属性准确性、OCR保真度和真实感),并由多模态判别器独立评估各维度,再结合提示自适应加权机制突出相关特征。该设计不仅提供模块化且可解释的监督信号用于策略优化(如GRPO或PPO),还允许用户直接调整不同维度的奖励权重,从而实现更灵活、可控且通用的文本到图像生成模型对齐。

链接: https://arxiv.org/abs/2511.20651
作者: Xuelu Feng,Yunsheng Li,Ziyu Wan,Zixuan Gao,Junsong Yuan,Dongdong Chen,Chunming Qiao
机构: University at Buffalo (纽约州立大学布法罗分校); Microsoft CoreAI (微软核心人工智能部门); Nikola Tesla STEM High School (尼古拉·特斯拉科技高中)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has recently emerged as a promising approach for aligning text-to-image generative models with human preferences. A key challenge, however, lies in designing effective and interpretable rewards. Existing methods often rely on either composite metrics (e.g., CLIP, OCR, and realism scores) with fixed weights or a single scalar reward distilled from human preference models, which can limit interpretability and flexibility. We propose RubricRL, a simple and general framework for rubric-based reward design that offers greater interpretability, composability, and user control. Instead of using a black-box scalar signal, RubricRL dynamically constructs a structured rubric for each prompt–a decomposable checklist of fine-grained visual criteria such as object correctness, attribute accuracy, OCR fidelity, and realism–tailored to the input text. Each criterion is independently evaluated by a multimodal judge (e.g., o4-mini), and a prompt-adaptive weighting mechanism emphasizes the most relevant dimensions. This design not only produces interpretable and modular supervision signals for policy optimization (e.g., GRPO or PPO), but also enables users to directly adjust which aspects to reward or penalize. Experiments with an autoregressive text-to-image model demonstrate that RubricRL improves prompt faithfulness, visual detail, and generalizability, while offering a flexible and extensible foundation for interpretable RL alignment across text-to-image architectures.
zh

[CV-1] MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities

【速读】:该论文旨在解决医学影像中传统目标检测模型在封闭集(closed-set)范式下无法识别新类别对象的问题,即缺乏对未见标签结构的检测能力。其核心解决方案是提出首个实时开放词汇目标检测模型 MedROV,关键创新在于:1)构建大规模多模态医学图像检测数据集 Omnis(含600K样本),并设计伪标签策略处理多源数据中的标注缺失问题;2)利用预训练基础模型的知识增强泛化能力,并通过对比学习和跨模态表示实现对已知与未知结构的有效检测。实验表明,MedROV 在保持70 FPS 实时性能的同时,相较现有最先进模型提升40 mAP50,显著优于封闭集检测器超过3 mAP50,确立了医学图像检测的新基准。

链接: https://arxiv.org/abs/2511.20650
作者: Tooba Tehreem Sheikh,Jean Lahoud,Rao Muhammad Anwer,Fahad Shahbaz Khan,Salman Khan,Hisham Cholakkal
机构: Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traditional object detection models in medical imaging operate within a closed-set paradigm, limiting their ability to detect objects of novel labels. Open-vocabulary object detection (OVOD) addresses this limitation but remains underexplored in medical imaging due to dataset scarcity and weak text-image alignment. To bridge this gap, we introduce MedROV, the first Real-time Open Vocabulary detection model for medical imaging. To enable open-vocabulary learning, we curate a large-scale dataset, Omnis, with 600K detection samples across nine imaging modalities and introduce a pseudo-labeling strategy to handle missing annotations from multi-source datasets. Additionally, we enhance generalization by incorporating knowledge from a large pre-trained foundation model. By leveraging contrastive learning and cross-modal representations, MedROV effectively detects both known and novel structures. Experimental results demonstrate that MedROV outperforms the previous state-of-the-art foundation model for medical image detection with an average absolute improvement of 40 mAP50, and surpasses closed-set detectors by more than 3 mAP50, while running at 70 FPS, setting a new benchmark in medical detection. Our source code, dataset, and trained model are available at this https URL.
zh

[CV-2] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

【速读】:该论文旨在解决当前自回归视频扩散模型的三大核心瓶颈:(i)基于3D旋转位置编码(3D Rotary Positional Embedding, 3D-RoPE)的有限时间范围限制;(ii)在长序列生成过程中难以实现细粒度动作控制导致的提示响应迟缓;(iii)单次生成流中无法实现非连续的电影级转场。解决方案的关键在于提出一个训练-free的推理时框架——∞-RoPE,其由三个相互关联的组件构成:Block-Relativistic RoPE通过将时间编码重构为移动局部参考系,使新生成的潜在块相对于基础模型最大帧数进行相对旋转,同时向前回退早期块以保持相对时间几何结构,从而突破固定时间位置限制,实现无限时长视频生成;KV Flush通过仅保留两个潜在帧(全局锚点与最后一帧)刷新键值(KV)缓存,无需重新编码即可实现即时提示响应,保障细粒度动作控制;RoPE Cut则引入可控的时间RoPE坐标断点,支持单次连续生成流中的多段场景切换,实现电影级转场效果。三者协同构建了具备无限时域、可控性和影视化能力的视频扩散基础架构。

链接: https://arxiv.org/abs/2511.20649
作者: Hidir Yesiltepe,Tuna Han Salih Meral,Adil Kaan Akan,Kaan Oktay,Pinar Yanardag
机构: Virginia Tech (弗吉尼亚理工大学); fal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Current autoregressive video diffusion models are constrained by three core bottlenecks: (i) the finite temporal horizon imposed by the base model’s 3D Rotary Positional Embedding (3D-RoPE), (ii) slow prompt responsiveness in maintaining fine-grained action control during long-form rollouts, and (iii) the inability to realize discontinuous cinematic transitions within a single generation stream. We introduce \infty -RoPE, a unified inference-time framework that addresses all three limitations through three interconnected components: Block-Relativistic RoPE, KV Flush, and RoPE Cut. Block-Relativistic RoPE reformulates temporal encoding as a moving local reference frame, where each newly generated latent block is rotated relative to the base model’s maximum frame horizon while earlier blocks are rotated backward to preserve relative temporal geometry. This relativistic formulation eliminates fixed temporal positions, enabling continuous video generation far beyond the base positional limits. To obtain fine-grained action control without re-encoding, KV Flush renews the KV cache by retaining only two latent frames, the global sink and the last generated latent frame, thereby ensuring immediate prompt responsiveness. Finally, RoPE Cut introduces controlled discontinuities in temporal RoPE coordinates, enabling multi-cut scene transitions within a single continuous rollout. Together, these components establish \infty -RoPE as a training-free foundation for infinite-horizon, controllable, and cinematic video diffusion. Comprehensive experiments show that \infty -RoPE consistently surpasses previous autoregressive models in overall VBench scores.
zh

[CV-3] LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在多目标三维检测(multi-object 3D detection)任务中能力缺失的问题,尽管现有VLMs在开放域二维描述与定位(open-ended 2D description and grounding)方面表现优异。其解决方案的关键在于提出一种原生面向VLM的框架——LocateAnything3D,将3D检测建模为一个next-token prediction问题,核心创新是引入一种短而明确的“视线链”(Chain-of-Sight, CoS)序列:首先在二维空间中识别物体,再逐步推断其距离、尺寸和姿态。该方法通过两个层面的分层策略优化预测稳定性与学习效率:跨对象按由近及远顺序降低早期歧义并匹配以自我为中心的实用性;同一对象内则采用中心-尺寸-旋转的因子分解方式,按信息稳定性和可学习性排序。此设计无需专用检测头即可保持开放词汇(open-vocabulary)和视觉提示(visual-prompting)能力,在Omni3D基准上实现49.89 AP_3D,较此前最优结果提升15.51绝对分数,且具备零样本泛化能力和强鲁棒性。

链接: https://arxiv.org/abs/2511.20648
作者: Yunze Man,Shihao Wang,Guowen Zhang,Johan Bjorck,Zhiqi Li,Liang-Yan Gui,Jim Fan,Jan Kautz,Yu-Xiong Wang,Zhiding Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Tech report. Project page: this https URL

点击查看摘要

Abstract:To act in the world, a model must name what it sees and know where it is in 3D. Today’s vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 49.89 AP_3D, surpassing the previous best by +15.51 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
zh

[CV-4] Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization

【速读】:该论文旨在解决当前文本到视频(Text-to-Video, T2V)扩散模型在单一文本提示下生成视频时多样性不足的问题,即多次采样产生的视频内容重复性高、缺乏变化。其解决方案的关键在于提出一种名为 DPP-GRPO 的新框架,该框架融合了行列式点过程(Determinantal Point Processes, DPPs)与组相对策略优化(Group Relative Policy Optimization, GRPO)理论:DPP 通过引入边际收益递减机制对冗余样本施加显式惩罚,从而强化多样性;GRPO 则提供群体层面的反馈信号,指导模型从候选视频集合中学习更优的多样化分布。该方法无需修改基础生成模型,具有良好的通用性和可扩展性,并在多个主流评估指标(如 VBench、VideoScore)和人类偏好实验中显著提升了视频多样性,同时保持了对文本提示的忠实度和视觉质量。

链接: https://arxiv.org/abs/2511.20647
作者: Tahira Kazimi,Connor Dunlop,Pinar Yanardag
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.
zh

[CV-5] 3D-Aware Multi-Task Learning with Cross-View Correlations for Dense Scene Understanding

【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中单一网络联合执行多个密集预测任务(如语义分割和深度估计)时,现有方法主要在二维图像空间捕捉跨任务关系,导致特征缺乏三维感知能力的问题。其解决方案的关键在于引入跨视角相关性(cross-view correlations),通过构建代价体(cost volume)作为几何一致性约束,集成到MTL网络中;具体提出一种轻量级的跨视角模块(Cross-view Module, CvM),该模块可跨任务共享、用于视图间信息交换并融合MTL编码器特征,从而提升模型对场景的三维理解能力,且该模块与网络架构无关,适用于单目和多目数据。

链接: https://arxiv.org/abs/2511.20646
作者: Xiaoye Wang,Chen Tang,Xiangyu Yue,Wei-Hong Li
机构: University of Cambridge (剑桥大学); The Chinese University of Hong Kong (香港中文大学); University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 3D-aware Multi-task Learning, Cross-view Correlations, Code will be available at this https URL

点击查看摘要

Abstract:This paper addresses the challenge of training a single network to jointly perform multiple dense prediction tasks, such as segmentation and depth estimation, i.e., multi-task learning (MTL). Current approaches mainly capture cross-task relations in the 2D image space, often leading to unstructured features lacking 3D-awareness. We argue that 3D-awareness is vital for modeling cross-task correlations essential for comprehensive scene understanding. We propose to address this problem by integrating correlations across views, i.e., cost volume, as geometric consistency in the MTL network. Specifically, we introduce a lightweight Cross-view Module (CvM), shared across tasks, to exchange information across views and capture cross-view correlations, integrated with a feature from MTL encoder for multi-task predictions. This module is architecture-agnostic and can be applied to both single and multi-view data. Extensive results on NYUv2 and PASCAL-Context demonstrate that our method effectively injects geometric consistency into existing MTL methods to improve performance.
zh

[CV-6] PixelDiT: Pixel Diffusion Transformers for Image Generation

【速读】:该论文旨在解决扩散 Transformer(DiT)在潜空间(latent space)建模中存在的两阶段流水线问题,即预训练自编码器引入的有损重建会导致误差累积并阻碍联合优化。其核心解决方案是提出 PixelDiT,一种直接在像素空间(pixel space)中学习扩散过程的单阶段端到端模型,摒弃了传统依赖自编码器的架构。关键创新在于采用双层级设计:Patch-level DiT 用于捕捉全局语义信息,Pixel-level DiT 用于精细纹理重构,从而实现高效训练且保留细节的像素级扩散生成。

链接: https://arxiv.org/abs/2511.20645
作者: Yongsheng Yu,Wei Xiong,Weili Nie,Yichen Sheng,Shiqiu Liu,Jiebo Luo
机构: NVIDIA; University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. Our analysis reveals that effective pixel-level token modeling is essential to the success of pixel diffusion. PixelDiT achieves 1.61 FID on ImageNet 256x256, surpassing existing pixel generative models by a large margin. We further extend PixelDiT to text-to-image generation and pretrain it at the 1024x1024 resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models.
zh

[CV-7] Vision-Language Memory for Spatial Reasoning

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在基于视频的空间推理任务中难以达到人类水平性能的问题,其核心挑战在于语义-几何错位导致的3D理解不一致,以及缺乏持久记忆以长期保留和更新3D表征。解决方案的关键在于提出VLM²,一种具备持久记忆机制的视觉-语言模型,通过引入双记忆模块——工作记忆(作为滑动窗口聚焦短期上下文)与情景记忆(固化并存储关键长期信息),实现视图一致且具备3D感知能力的2D视频驱动空间推理;该设计在保持固定计算成本的前提下显著提升了长时程空间推理能力,在多个基准测试中取得了视频单模态模型中的最先进性能。

链接: https://arxiv.org/abs/2511.20644
作者: Zuntao Liu,Yi Du,Taimeng Fu,Shaoshu Su,Cherie Ho,Chen Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM ^2 , a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM ^2 achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.
zh

[CV-8] Concept-Aware Batch Sampling Improves Language-Image Pretraining

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Model)训练数据选择中的关键问题:现有数据筛选方法多为离线且概念无关(offline and concept-agnostic),即基于预设规则生成静态数据集,且使用模型驱动的过滤器引入额外偏差,难以适配特定下游任务。解决方案的核心在于提出一种灵活、任务自适应的在线概念驱动数据筛选框架——Concept-Aware Batch Sampling (CABS),其关键创新是利用一个包含1.28亿条细粒度概念标注的图像-文本对数据集DataConcept,实现按需动态构建批次(batch),支持两种策略:多样性最大化(CABS-DM)以覆盖广泛概念,以及频率最大化(CABS-FM)以提升对象多重性。实验证明,CABS显著提升CLIP/SigLIP类模型性能,是一种可定制化、开源的替代方案,能根据下游任务需求优化概念分布。

链接: https://arxiv.org/abs/2511.20643
作者: Adhiraj Ghosh,Vishaal Udandarao,Thao Nguyen,Matteo Farina,Mehdi Cherti,Jenia Jitsev,Sewoong Oh,Elisa Ricci,Ludwig Schmidt,Matthias Bethge
机构: Tübingen AI Center, University of Tübingen (图宾根人工智能中心,图宾根大学); University of Cambridge (剑桥大学); University of Washington (华盛顿大学); University of Trento (特伦托大学); LAION; Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Tech Report

点击查看摘要

Abstract:What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.
zh

[CV-9] Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

【速读】:该论文旨在解决长尾多标签视觉识别(long-tailed multi-label visual recognition)中的模型偏差问题,即在类别分布极度不均衡的情况下,现有方法因依赖数据稀缺的尾部类别所推导出的语义关联不可靠,且预训练视觉-语言模型(如CLIP)的零样本范式并不适配多标签任务。其解决方案的关键在于提出一种端到端框架——相关性自适应提示网络(Correlation Adaptation Prompt Network, CAPNET),该框架通过CLIP文本编码器显式建模标签间语义关系,结合图卷积网络实现标签感知传播,并引入可学习软提示(learnable soft prompts)优化嵌入表示;同时采用分布平衡的Focal损失与类别感知重加权策略提升训练稳定性,并通过测试时集成和参数高效微调增强泛化能力,从而在不牺牲头部类别性能的前提下显著改善尾部类别的识别效果。

链接: https://arxiv.org/abs/2511.20641
作者: Wei Tang,Zuo-Zheng Wang,Kun Zhang,Tong Wei,Min-Ling Zhang
机构: Southeast University (东南大学); Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) (穆罕默德·本·扎耶德人工智能大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP’s zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP’s textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.
zh

[CV-10] MotionV2V: Editing Motion in a Video

【速读】:该论文旨在解决现有生成式视频模型在视频编辑任务中面临的挑战,即如何实现对已有视频运动信息的精确控制与修改。尽管当前生成式视频模型在图像质量和时序一致性方面已取得显著进展,但其在视频编辑场景下的可控性仍不足。解决方案的关键在于提出一种基于稀疏轨迹(sparse trajectories)的“运动编辑”(motion edit)表示方法:通过直接修改输入视频中提取的稀疏轨迹来改变运动模式,并构建包含相同内容但不同运动的视频对(称为“运动反事实”(motion counterfactuals)),进而利用该数据集微调一个运动条件驱动的视频扩散架构(motion-conditioned video diffusion architecture)。此方法使编辑可从任意时间戳开始并自然传播,实验证明其在四组用户对比测试中获得超过65%的偏好率,优于现有方法。

链接: https://arxiv.org/abs/2511.20640
作者: Ryan Burgert,Charles Herrmann,Forrester Cole,Michael S Ryoo,Neal Wadhwa,Andrey Voynov,Nataniel Ruiz
机构: Google(谷歌); Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While generative video models have achieved remarkable fidelity and consistency, applying these capabilities to video editing remains a complex challenge. Recent research has explored motion controllability as a means to enhance text-to-video generation or image animation; however, we identify precise motion control as a promising yet under-explored paradigm for editing existing videos. In this work, we propose modifying video motion by directly editing sparse trajectories extracted from the input. We term the deviation between input and output trajectories a “motion edit” and demonstrate that this representation, when coupled with a generative backbone, enables powerful video editing capabilities. To achieve this, we introduce a pipeline for generating “motion counterfactuals”, video pairs that share identical content but distinct motion, and we fine-tune a motion-conditioned video diffusion architecture on this dataset. Our approach allows for edits that start at any timestamp and propagate naturally. In a four-way head-to-head user study, our model achieves over 65 percent preference against prior work. Please see our project page: this https URL
zh

[CV-11] Montage: Unified Versatile Highly Dynamic Many-to-many Image Generation

【速读】:该论文旨在解决预训练视频模型在生成动态内容时受限于训练数据连续性而导致的多样性不足问题,即如何在保持时间连贯性的同时扩展生成内容的动态范围与多样性。解决方案的关键在于提出iMontage框架,通过一种优雅且侵入性极小的适配策略,将图像数据中丰富的、非受限的内容多样性注入到原本以视频为基础的时序框架中;同时结合定制化的数据整理流程和训练范式,使模型能够在不破坏原有运动先验(motion priors)的前提下,获得广泛的图像操作能力,从而实现从变长图像集合输入到输出的统一生成与编辑任务。

链接: https://arxiv.org/abs/2511.20635
作者: Zhoujie Fu,Xianfang Zeng,Jinghong Lan,Xinyao Liao,Cheng Chen,Junyi Chen,Jiacheng Wei,Wei Cheng,Shiyu Liu,Yunuo Chen,Gang Yu,Guosheng Lin
机构: Nanyang Technological University (南洋理工大学); StepFun; Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: this https URL.
zh

[CV-12] MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models

【速读】:该论文旨在解决多奖励目标联合优化时出现的“对齐税”(alignment tax)问题,即在提升某一偏好维度性能的同时,往往会损害其他维度的表现。其解决方案的关键在于提出两种互补方法:一是MapReduce LoRA,通过并行训练特定偏好的LoRA专家,并迭代合并以精炼共享基础模型;二是Reward-aware Token Embedding (RaTE),学习与奖励相关的token嵌入,在推理阶段动态组合以实现灵活的偏好控制。这两种机制共同实现了跨模态(文本到图像、文本到视频、语言任务)的多偏好对齐,显著提升了多个评估指标的性能。

链接: https://arxiv.org/abs/2511.20629
作者: Chieh-Yun Chen,Zhonghao Wang,Qi Chen,Zhifan Ye,Min Shi,Yue Zhao,Yinan Zhao,Hui Qu,Wei-An Lin,Yiru Shen,Ajinkya Kale,Irfan Essa,Humphrey Shi
机构: Georgia Tech (佐治亚理工学院); Adobe (Adobe公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning from human feedback (RLHF) with reward models has advanced alignment of generative models to human aesthetic and perceptual preferences. However, jointly optimizing multiple rewards often incurs an alignment tax, improving one dimension while degrading others. To address this, we introduce two complementary methods: MapReduce LoRA and Reward-aware Token Embedding (RaTE). MapReduce LoRA trains preference-specific LoRA experts in parallel and iteratively merges them to refine a shared base model; RaTE learns reward-specific token embeddings that compose at inference for flexible preference control. Experiments on Text-to-Image generation (Stable Diffusion 3.5 Medium and FLUX.1-dev) show improvements of 36.1%, 4.6%, and 55.7%, and 32.7%, 4.3%, and 67.1% on GenEval, PickScore, and OCR, respectively. On Text-to-Video generation (HunyuanVideo), visual and motion quality improve by 48.1% and 90.0%, respectively. On the language task, Helpful Assistant, with Llama-2 7B, helpful and harmless improve by 43.4% and 136.7%, respectively. Our framework sets a new state-of-the-art multi-preference alignment recipe across modalities.
zh

[CV-13] ShapeGen: Towards High-Quality 3D Shape Synthesis SIGGRAPH

【速读】:该论文旨在解决当前图像到3D形状生成方法中存在的关键问题,包括生成结果缺乏精细细节、表面过于平滑以及薄壳结构易破碎等缺陷,这些问题导致生成的3D资产难以满足艺术家对高质量内容的要求。解决方案的关键在于三方面改进:一是优化3D表示与监督机制,二是实现分辨率的提升,三是利用线性Transformer(linear transformers)的优势。这些改进协同作用,显著提升了生成质量,使结果更适用于实际3D工作流,并达到了该领域的最新技术水平。

链接: https://arxiv.org/abs/2511.20624
作者: Yangguang Li,Xianglong He,Zi-Xin Zou,Zexiang Liu,Wanli Ouyang,Ding Liang,Yan-Pei Cao
机构: The Chinese University of Hong Kong (香港中文大学); VASTChina (中国科学院自动化研究所); Tsinghua University (清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH Asia 2025

点击查看摘要

Abstract:Inspired by generative paradigms in image and video, 3D shape generation has made notable progress, enabling the rapid synthesis of high-fidelity 3D assets from a single image. However, current methods still face challenges, including the lack of intricate details, overly smoothed surfaces, and fragmented thin-shell structures. These limitations leave the generated 3D assets still one step short of meeting the standards favored by artists. In this paper, we present ShapeGen, which achieves high-quality image-to-3D shape generation through 3D representation and supervision improvements, resolution scaling up, and the advantages of linear transformers. These advancements allow the generated assets to be seamlessly integrated into 3D pipelines, facilitating their widespread adoption across various applications. Through extensive experiments, we validate the impact of these improvements on overall performance. Ultimately, thanks to the synergistic effects of these enhancements, ShapeGen achieves a significant leap in image-to-3D generation, establishing a new state-of-the-art performance.
zh

[CV-14] Wanderland: Geometrically Grounded Simulation for Open-World Embodied AI

【速读】:该论文旨在解决开放世界具身智能(Embodied AI)中可复现的闭环评估难题,特别是由于视觉与几何领域的“仿真到现实”(sim-to-real)差距导致的基准测试不可靠问题。其核心挑战在于现有方法难以在复杂城市环境中实现高保真度的物理交互与感知模拟。解决方案的关键是提出Wanderland框架,该框架通过多传感器采集、可靠重建、精确几何建模和鲁棒视图合成技术,构建了一个从真实世界到仿真环境的高质量映射管道。这一方法不仅显著缩小了sim-to-real差距,还验证了仅依赖图像的管线在扩展性上的局限性以及几何质量对新视角合成和导航策略学习的影响,从而为具身导航提供了可信的测试平台,并支持3D重建与视图合成模型的联合基准测试。

链接: https://arxiv.org/abs/2511.20620
作者: Xinhao Liu,Jiaqi Li,Youming Deng,Ruxin Chen,Yingjia Zhang,Yifei Ma,Li Guo,Yiming Li,Jing Zhang,Chen Feng
机构: New York University (纽约大学); Cornell University (康奈尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Reproducible closed-loop evaluation remains a major bottleneck in Embodied AI such as visual navigation. A promising path forward is high-fidelity simulation that combines photorealistic sensor rendering with geometrically grounded interaction in complex, open-world urban environments. Although recent video-3DGS methods ease open-world scene capturing, they are still unsuitable for benchmarking due to large visual and geometric sim-to-real gaps. To address these challenges, we introduce Wanderland, a real-to-sim framework that features multi-sensor capture, reliable reconstruction, accurate geometry, and robust view synthesis. Using this pipeline, we curate a diverse dataset of indoor-outdoor urban scenes and systematically demonstrate how image-only pipelines scale poorly, how geometry quality impacts novel view synthesis, and how all of these adversely affect navigation policy learning and evaluation reliability. Beyond serving as a trusted testbed for embodied navigation, Wanderland’s rich raw sensor data further allows benchmarking of 3D reconstruction and novel view synthesis models. Our work establishes a new foundation for reproducible research in open-world embodied AI. Project website is at this https URL.
zh

[CV-15] Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities

【速读】:该论文旨在解决动态负载搬运任务中全身姿态预测的准确性问题,特别是如何利用深度神经网络有效捕捉时间序列运动数据中的依赖关系以提升预测精度。其解决方案的关键在于:(1)采用双向长短期记忆网络(Bidirectional Long Short-Term Memory, BLSTM)和Transformer两种时序建模架构进行对比学习;(2)提出一种新的损失函数优化策略,通过强制约束肢体段长度恒定来改善模型对当前及未来姿态的预测性能,显著降低了臂部和腿部预测误差(分别减少约8%和21%);(3)实验表明,基于Transformer的模型在长期预测上表现更优,均方根误差(Root-Mean-Square Error, RMSE)为47.0 mm,相比BLSTM模型提升了约58%的准确性。

链接: https://arxiv.org/abs/2511.20615
作者: Seyede Niloofar Hosseini,Ali Mojibi,Mahdi Mohseni,Navid Arjmand,Alireza Taheri
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures, 7 tables

点击查看摘要

Abstract:This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching activities. Two time-series models were trained using bidirectional long short-term memory (BLSTM) and transformer architectures. The dataset consisted of 3D full-body plug-in gait dynamic coordinates from 20 normal-weight healthy male individuals each performing 204 load-reaching tasks from different load positions while adapting various lifting and handling techniques. The model inputs consisted of the 3D position of the hand-load position, lifting (stoop, full-squat and semi-squat) and handling (one- and two-handed) techniques, body weight and height, and the 3D coordinate data of the body posture from the first 25% of the task duration. These inputs were used by the models to predict body coordinates during the remaining 75% of the task period. Moreover, a novel method was proposed to improve the accuracy of the previous and present posture prediction networks by enforcing constant body segment lengths through the optimization of a new cost function. The results indicated that the new cost function decreased the prediction error of the models by approximately 8% and 21% for the arm and leg models, respectively. We indicated that utilizing the transformer architecture, with a root-mean-square-error of 47.0 mm, exhibited ~58% more accurate long-term performance than the BLSTM-based model. This study merits the use of neural networks that capture time series dependencies in 3D motion frames, providing a unique approach for understanding and predict motion dynamics during manual material handling activities.
zh

[CV-16] he Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

【速读】:该论文旨在解决生成式图像在细粒度细节上的一致性问题(inconsistency problem),即现有生成模型在参考图像引导下的定制化生成任务中,难以保持局部细节的准确性和一致性。其解决方案的关键在于提出一种参考引导的后编辑方法 ImageCritic,通过构建基于视觉语言模型(VLM)选择与显式退化模拟的参考-退化-目标三元组数据集,精准刻画常见不一致现象;同时设计注意力对齐损失(attention alignment loss)和细节编码器(detail encoder),从模型内部机制出发进行精细化修正,并支持在智能体框架中实现多轮、局部化的自动检测与修复,从而显著提升复杂场景下生成图像的细节一致性。

链接: https://arxiv.org/abs/2511.20614
作者: Ziheng Ouyang,Yiren Song,Yaoli Liu,Shihao Zhu,Qibin Hou,Ming-Ming Cheng,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model’s attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.
zh

[CV-17] Latent Diffusion Inversion Requires Understanding the Latent Space

【速读】:该论文旨在解决潜空间生成模型(如Latent Diffusion Models, LDMs)中训练数据隐私泄露问题,特别是针对基于得分的成员推断攻击(score-based membership inference attack)在低误报率条件下性能不足的问题。传统方法通常忽略编码器/解码器结构及其潜在空间中的几何特性,而本文发现扩散模型在潜空间中存在非均匀记忆现象:高失真区域的潜码更容易过拟合训练样本,且同一潜码的不同维度对记忆的贡献不均等。解决方案的关键在于提出一种系统性方法,通过量化每个潜维对解码器拉回度量(decoder pullback metric)的贡献程度来排序并筛选出最易导致记忆的维度;实验证明,在计算攻击统计量时剔除低记忆贡献维度后,平均AUROC提升2.7%,TPR@1%FPR显著提高6.42%,从而在极低误报容忍度下增强了对成员身份识别的置信度。这一成果揭示了自编码器几何结构对LDM记忆行为的重要影响,并为分析扩散模型的隐私风险提供了新视角。

链接: https://arxiv.org/abs/2511.20592
作者: Mingxing Rao,Bowen Qu,Daniel Moyer
机构: Vanderbilt University (范德比尔特大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures, 4 tables

点击查看摘要

Abstract:The recovery of training data from generative models (``model inversion’') has been extensively studied for diffusion models in the data domain. The encoder/decoder pair and corresponding latent codes have largely been ignored by inversion techniques applied to latent space generative models, e.g., Latent Diffusion models (LDMs). In this work we describe two key findings: (1) The diffusion model exhibits non-uniform memorization across latent codes, tending to overfit samples located in high-distortion regions of the decoder pullback metric. (2) Even within a single latent code, different dimensions contribute unequally to memorization. We introduce a principled method to rank latent dimensions by their per-dimensional contribution to the decoder pullback metric, identifying those most responsible for memorization. Empirically, removing less-memorizing dimensions when computing attack statistics for score-based membership inference attacker significantly improves performance, with average AUROC gains of 2.7% and substantial increases in TPR@1%FPR (6.42%) across diverse datasets including CIFAR-10, CelebA, ImageNet-1K, Pokémon, MS-COCO, and Flickr. This indicates stronger confidence in identifying members under extremely low false-positive tolerance. Our results highlight the overlooked influence of the auto-encoder geometry on LDM memorization and provide a new perspective for analyzing privacy risks in diffusion-based generative models.
zh

[CV-18] VQ-VA World: Towards High-Quality Visual Question-Visual Answering

【速读】:该论文旨在解决生成式 AI (Generative AI) 中视觉问答到视觉回答(Visual Question-Visual Answering, VQ-VA)的能力缺失问题,即模型需根据视觉问题生成图像而非文本作为输出。为实现这一目标,作者提出VQ-VA World框架,其核心在于构建一个以智能体(agentic)驱动的数据采集流水线,用于大规模、定向地生成高质量的图像-文本交错样本(约180万条),从而支持开放源代码模型训练。该方案的关键创新在于数据构建方法,通过网络级部署实现高效数据采集与清洗,并辅以Human-curated IntelligentBench基准系统评估世界知识、设计知识和推理能力,最终使LightFusion模型在该基准上得分提升至53.06,显著缩小了与主流闭源系统的差距。

链接: https://arxiv.org/abs/2511.20573
作者: Chenhui Gou,Zilong Chen,Zeyu Wang,Feng Li,Deyao Zhu,Zicheng Duan,Kunchang Li,Chaorui Deng,Hongyi Yuan,Haoqi Fan,Cihang Xie,Jianfei Cai,Hamid Rezatofighi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper studies Visual Question-Visual Answering (VQ-VA): generating an image, rather than text, in response to a visual question – an ability that has recently emerged in proprietary systems such as NanoBanana and GPT-Image. To also bring this capability to open-source models, we introduce VQ-VA World, a data-centric framework built around an agentic pipeline for large-scale, targeted data construction. Leveraging web-scale deployment, this pipeline crawls a massive amount of ~1.8M high-quality, interleaved image-text samples for model training. For evaluation, we further release IntelligentBench, a human-curated benchmark that systematically assesses VQ-VA along the aspects of world knowledge, design knowledge, and reasoning. Training with VQ-VA World data yields strong empirical gains: it helps LightFusion attain 53.06 on IntelligentBench, substantially surpassing the best prior open-source baselines (i.e., 7.78 from vanilla LightFusion; 1.94 from UniWorld-V1), and significantly narrowing the gap toward leading proprietary systems (e.g., 81.67 from NanoBanana; 82.64 from GPT-Image). By releasing the full suite of model weights, datasets, and pipelines, we hope to stimulate future research on VQ-VA.
zh

[CV-19] DINO-Tok: Adapting DINO for Visual Tokenizers

【速读】:该论文旨在解决现有视觉生成模型中视觉分词器(visual tokenizer)在高维潜在空间中难以同时平衡语义表达能力与重建保真度的问题,尤其是在从零开始训练的分词器中,常出现语义信息丢失和码本坍缩(codebook collapse)等现象。其解决方案的关键在于提出一种基于DINO的视觉分词器DINO-Tok,通过融合浅层保留细粒度细节的特征与深层编码全局语义的特征,构建一个信息完备的层次化潜在空间;并进一步设计全局主成分分析(PCA)重加权机制以稳定向量量化(VQ),防止关键信息在高维空间中丢失,从而实现语义对齐且高保真的潜在表示,显著提升图像重建性能,在ImageNet 256×256上达到28.54 PSNR(自编码)和23.98 PSNR(VQ建模),优于以往分词器并媲美百亿级数据训练模型。

链接: https://arxiv.org/abs/2511.20565
作者: Mingkai Jia,Mingxiao Li,Liaoyuan Fan,Tianxing Shi,Jiaxin Guo,Zeming Li,Xiaoyang Guo,Xiao-Xiao Long,Qian Zhang,Ping Tan,Wei Yin
机构: The Hong Kong University of Science and Technology (香港科技大学); Horizon Robotics ( horizon 机器人); Nanjing University (南京大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in visual generation have highlighted the rise of Latent Generative Models (LGMs), which rely on effective visual tokenizers to bridge pixels and semantics. However, existing tokenizers are typically trained from scratch and struggle to balance semantic representation and reconstruction fidelity, particularly in high-dimensional latent spaces. In this work, we introduce DINO-Tok, a DINO-based visual tokenizer that unifies hierarchical representations into an information-complete latent space. By integrating shallow features that retain fine-grained details with deep features encoding global semantics, DINO-Tok effectively bridges pretrained representations and visual generation. We further analyze the challenges of vector quantization (VQ) in this high-dimensional space, where key information is often lost and codebook collapse occurs. We thus propose a global PCA reweighting mechanism to stabilize VQ and preserve essential information across dimensions. On ImageNet 256 \times 256, DINO-Tok achieves state-of-the-art reconstruction performance, reaching 28.54 PSNR for autoencoding and 23.98 PSNR for VQ-based modeling, significantly outperforming prior tokenizers and comparable to billion-level data trained models (such as Hunyuan and Wan). These results demonstrate that adapting powerful pretrained vision models like DINO for tokenization enables semantically aligned and high-fidelity latent representations, enabling next-generation visual generative models. Code will be publicly available at this https URL.
zh

[CV-20] A Reason -then-Describe Instruction Interpreter for Controllable Video Generation

【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中可控性不足的问题,特别是用户输入的模糊性、简洁性和组合复杂性与训练阶段使用的详细提示之间存在的意图-输出不匹配现象。解决方案的关键在于提出一种通用且模型无关的解释器 ReaDe,其采用“先推理后描述”(reason-then-describe)范式:首先对用户请求进行分析以识别核心需求并消除歧义,随后生成精确的动作指令以指导下游视频生成器;该方法通过两阶段优化实现——第一阶段引入增强推理的监督信号,提供逐步推理轨迹和密集标注,第二阶段利用多维奖励分配机制实现自然风格描述的稳定反馈优化,从而显著提升指令忠实度、描述准确性及视频生成质量,并展现出对高推理强度和未见输入的良好泛化能力。

链接: https://arxiv.org/abs/2511.20563
作者: Shengqiong Wu,Weicai Ye,Yuanxing Zhang,Jiahao Wang,Quande Liu,Xintao Wang,Pengfei Wan,Kun Gai,Hao Fei,Tat-Seng Chua
机构: Kling Team, Kuaishou Technology (快手科技); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 13 figures, 13 tables, Project Page: this https URL

点击查看摘要

Abstract:Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: this https URL.
zh

[CV-21] PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

【速读】:该论文旨在解决当前视频生成模型在物理可控性和合理性方面的不足,即尽管现有方法在视觉保真度上表现优异,但难以实现对物理行为的显式控制和长期时序上的物理一致性。其解决方案的关键在于提出PhysChoreo框架,该框架通过两个阶段实现:首先利用基于部件感知的物理属性重建方法估计图像中所有物体的静态初始物理属性;其次通过时序指令引导且可物理编辑的仿真机制,合成具有丰富动态行为和物理真实感的高质量视频。这一两阶段设计有效提升了视频生成过程中对物理规律的遵循能力与可控性。

链接: https://arxiv.org/abs/2511.20562
作者: Haoze Zhang,Tianyu Huang,Zichen Wan,Xiaowei Jin,Hongzhi Zhang,Hui Li,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.
zh

[CV-22] Flash-DMD: Towards High-Fidelity Few-Step Image Generation with Efficient Distillation and Joint Reinforcement Learning

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中因迭代采样导致的计算成本过高问题,以及现有时间步蒸馏(timestep distillation)方法训练成本高、图像质量下降和强化学习(Reinforcement Learning, RL)微调不稳定(易出现奖励劫持)的问题。其解决方案的关键在于提出了一种名为Flash-DMD的新框架:首先设计了一种高效的、时间步感知的蒸馏策略,大幅降低训练开销(仅需DMD2的2.1%),同时提升生成图像的真实性;其次引入联合训练机制,在蒸馏训练持续进行的同时对模型进行RL微调,利用蒸馏过程提供的稳定且明确的损失作为正则项,有效抑制RL训练中的策略崩溃,从而实现快速收敛、高保真度与训练稳定性三者的统一。

链接: https://arxiv.org/abs/2511.20549
作者: Guanjie Chen,Shirui Huang,Kai Liu,Jianchen Zhu,Xiaoye Qu,Peng Chen,Yu Cheng,Yifu Sun
机构: Shanghai Jiao Tong University (上海交通大学); Tencent (腾讯); Huazhong University of Science and Technology (华中科技大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Models have emerged as a leading class of generative models, yet their iterative sampling process remains computationally expensive. Timestep distillation is a promising technique to accelerate generation, but it often requires extensive training and leads to image quality degradation. Furthermore, fine-tuning these distilled models for specific objectives, such as aesthetic appeal or user preference, using Reinforcement Learning (RL) is notoriously unstable and easily falls into reward hacking. In this work, we introduce Flash-DMD, a novel framework that enables fast convergence with distillation and joint RL-based refinement. Specifically, we first propose an efficient timestep-aware distillation strategy that significantly reduces training cost with enhanced realism, outperforming DMD2 with only 2.1% its training cost. Second, we introduce a joint training scheme where the model is fine-tuned with an RL objective while the timestep distillation training continues simultaneously. We demonstrate that the stable, well-defined loss from the ongoing distillation acts as a powerful regularizer, effectively stabilizing the RL training process and preventing policy collapse. Extensive experiments on score-based and flow matching models show that our proposed Flash-DMD not only converges significantly faster but also achieves state-of-the-art generation quality in the few-step sampling regime, outperforming existing methods in visual quality, human preference, and text-image alignment metrics. Our work presents an effective paradigm for training efficient, high-fidelity, and stable generative models. Codes are coming soon.
zh

[CV-23] New York Smells: A Large Multimodal Dataset for Olfaction

【速读】:该论文旨在解决机器难以感知和理解化学感官信息(即嗅觉)的问题,这是当前人工智能在多模态感知能力上的关键瓶颈。其核心挑战在于缺乏在自然环境中采集的多样化、多模态嗅觉训练数据。论文提出的解决方案是构建一个大规模的“纽约气味”(New York Smells)数据集,包含7,000对图像与嗅觉信号,覆盖3,500个不同物体,数量级远超现有数据集。该数据集支持跨模态嗅觉表征学习,实验证明视觉信息可有效促进嗅觉特征的学习,并且所学的嗅觉表示优于传统手工设计特征(hand-crafted features),从而为机器嗅觉感知提供了新的技术路径。

链接: https://arxiv.org/abs/2511.20544
作者: Ege Ozguroglu,Junbang Liang,Ruoshi Liu,Mia Chiquier,Michael DeTienne,Wesley Wei Qian,Alexandra Horowitz,Andrew Owens,Carl Vondrick
机构: Columbia University (哥伦比亚大学); Cornell University (康奈尔大学); Osmo Labs
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project website at this https URL

点击查看摘要

Abstract:While olfaction is central to how animals perceive the world, this rich chemical sensory modality remains largely inaccessible to machines. One key bottleneck is the lack of diverse, multimodal olfactory training data collected in natural settings. We present New York Smells, a large dataset of paired image and olfactory signals captured ``in the wild.‘’ Our dataset contains 7,000 smell-image pairs from 3,500 distinct objects across indoor and outdoor environments, with approximately 70 \times more objects than existing olfactory datasets. Our benchmark has three tasks: cross-modal smell-to-image retrieval, recognizing scenes, objects, and materials from smell alone, and fine-grained discrimination between grass species. Through experiments on our dataset, we find that visual data enables cross-modal olfactory representation learning, and that our learned olfactory representations outperform widely-used hand-crafted features.
zh

[CV-24] Automated Monitoring of Cultural Heritage Artifacts Using Semantic Segmentation

【速读】:该论文旨在解决文化遗产保护中裂缝自动检测的难题,特别是针对雕像和纪念碑等对象实现像素级裂纹识别。其解决方案的关键在于采用基于卷积神经网络(CNN)的U-Net架构,并通过对比不同CNN编码器在OmniCrack30k数据集上的语义分割性能,验证模型在细粒度裂纹分割任务中的有效性。实验表明,尽管模型未在雕像或纪念碑图像上进行显式训练,仍展现出对未见文化遗产场景的良好泛化能力。

链接: https://arxiv.org/abs/2511.20541
作者: Andrea Ranieri,Giorgio Palmieri,Silvia Biasotti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Keywords: Cultural Heritage, Monitoring, Deep Learning, U-Nets, Semantic Segmentation

点击查看摘要

Abstract:This paper addresses the critical need for automated crack detection in the preservation of cultural heritage through semantic segmentation. We present a comparative study of U-Net architectures, using various convolutional neural network (CNN) encoders, for pixel-level crack identification on statues and monuments. A comparative quantitative evaluation is performed on the test set of the OmniCrack30k dataset [1] using popular segmentation metrics including Mean Intersection over Union (mIoU), Dice coefficient, and Jaccard index. This is complemented by an out-of-distribution qualitative evaluation on an unlabeled test set of real-world cracked statues and monuments. Our findings provide valuable insights into the capabilities of different CNN- based encoders for fine-grained crack segmentation. We show that the models exhibit promising generalization capabilities to unseen cultural heritage contexts, despite never having been explicitly trained on images of statues or monuments.
zh

[CV-25] Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models ICML

【速读】:该论文旨在解决视觉语言模型(Visual Language Models, VLMs)在生成内容时缺乏事实准确性的问题,其根源在于VLMs推理能力不足且尚未充分整合外部知识以支持多步逻辑推理。解决方案的关键在于提出一个基于结构化知识图谱(Knowledge Graph, KG)的知识引导式推理框架,通过图像-描述任务实现跨模态的多跳验证:首先识别视觉实体,继而遍历知识图谱进行逻辑推理,最终基于事实对生成描述进行精炼。该方法在多个知识表示形式(层级、三元组和列表)下验证了其在提升事实准确性和逻辑一致性方面的有效性,初步实验表明事实准确性平均提升约31%。

链接: https://arxiv.org/abs/2511.20531
作者: Shamima Hossain
机构: Brac University (布拉克大学); bKash Limited (bkash有限公司)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as poster at NewInML Workshop ICML, 2025

点击查看摘要

Abstract:Visual Language Models (VLMs) are powerful generative tools but often produce factually in- accurate outputs due to a lack of robust reason- ing capabilities. While extensive research has been conducted on integrating external knowl- edge for reasoning in large language models (LLMs), such efforts remain underexplored in VLMs, where the challenge is compounded by the need to bridge multiple modalities seam- lessly. This work introduces a framework for knowledge-guided reasoning in VLMs, leverag- ing structured knowledge graphs for multi-hop verification using image-captioning task to il- lustrate our framework. Our approach enables systematic reasoning across multiple steps, in- cluding visual entity recognition, knowledge graph traversal, and fact-based caption refine- ment. We evaluate the framework using hi- erarchical, triple-based and bullet-point based knowledge representations, analyzing their ef- fectiveness in factual accuracy and logical infer- ence. Empirical results show that our approach improves factual accuracy by approximately 31% on preliminary experiments on a curated dataset of mixtures from Google Landmarks v2, Conceptual captions and Coco captions re- vealing key insights into reasoning patterns and failure modes. This work demonstrates the po- tential of integrating external knowledge for advancing reasoning in VLMs, paving the way for more reliable and knowledgable multimodal systems.
zh

[CV-26] Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

【速读】:该论文旨在解决人类在第一人称视频(egocentric video)中犯错时的细粒度理解问题,现有研究缺乏对错误来源的精确归因,难以定位错误发生的具体环节。为应对这一挑战,作者提出Mistake Attribution (MATT)任务,明确将错误归因于输入指令文本或尝试视频,并进一步识别违反指令的语义角色(what)、不可逆偏差发生的时刻(Point-of-No-Return, PNR)以及PNR帧中的空间位置(where)。解决方案的关键在于开发MisEngine数据引擎,能够自动从已有数据集中构建富含归因信息的错误样本并继承原有标注;同时设计了MisFormer模型,通过统一的注意力机制联合建模语义、时间与空间维度的错误归因,实现跨模态细粒度错误分析。

链接: https://arxiv.org/abs/2511.20525
作者: Yayuan Li,Aadit Jain,Filippos Bellos,Jason J. Corso
机构: University of Michigan (密歇根大学); Voxel51
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures, 6 tables

点击查看摘要

Abstract:We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.
zh

[CV-27] HBridge: H-Shape Bridging of Heterogeneous Experts for Unified Multimodal Understanding and Generation

【速读】:该论文旨在解决当前统一多模态模型中因对称式架构(如Mixture-of-Transformers, MoT)导致的模态差异难以有效融合的问题,尤其在理解专家(如大语言模型,LLMs)与生成专家(如扩散模型)之间存在固有语义和表示差异时,现有方法效率低且生成质量受限。解决方案的关键在于提出一种非对称的H形架构(HBridge),通过选择性地桥接中间层而非全连接方式,减少超过40%的注意力共享,从而提升计算效率并增强生成质量;同时,通过解耦浅层和深层的模态特异性表征、在中层进行语义对齐,并引入语义重建标记(semantic reconstruction tokens)显式引导生成专家重构目标图像的视觉语义token,显著提升了跨模态一致性与生成性能。

链接: https://arxiv.org/abs/2511.20520
作者: Xiang Wang,Zhifei Zhang,He Zhang,Zhe Lin,Yuqian Zhou,Qing Liu,Shiwei Zhang,Yijun Li,Shaoteng Liu,Haitian Zheng,Jason Kuen,Yuehuan Wang,Changxin Gao,Nong Sang
机构: Huazhong University of Science and Technology (华中科技大学); Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent unified models integrate understanding experts (e.g., LLMs) with generative experts (e.g., diffusion models), achieving strong multimodal performance. However, recent advanced methods such as BAGEL and LMFusion follow the Mixture-of-Transformers (MoT) paradigm, adopting a symmetric design that mirrors one expert to another for convenient initialization and fusion, which remains suboptimal due to inherent modality discrepancies. In this work, we propose HBridge, an asymmetric H-shaped architecture that enables heterogeneous experts to optimally leverage pretrained priors from their respective modality domains. Unlike prior dense fusion strategies that straightforwardly connect all layers between experts via shared attention, HBridge selectively bridges intermediate layers, reducing over 40% attention sharing, which improves efficiency and enhances generation quality. Shallow and deep layers, which capture modality-specific representations, are decoupled, while mid-layer bridging promotes semantic alignment. To further strengthen cross-modal coherence, we introduce semantic reconstruction tokens that explicitly guide the generative expert to reconstruct visual semantic tokens of the target image. Extensive experiments across multiple benchmarks demonstrate the effectiveness and superior performance of HBridge, establishing a new paradigm for unified multimodal generation.
zh

[CV-28] AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs

【速读】:该论文旨在解决当前图像-文本对齐模型(如CLIP)评估基准存在的局限性问题,即现有基准多依赖规则扰动或简短描述,难以衡量细粒度的对齐能力。其解决方案的关键在于提出AlignBench,一个基于多样化的图像到文本和文本到图像生成模型所产生详细图像-标题配对的新基准,每条句子均经过正确性标注,从而可直接评估视觉语言模型(VLM)作为对齐评估器的能力。通过该基准对多种解码器架构的VLM进行评测,揭示了三类关键现象:CLIP类模型即使针对组合推理优化仍表现不佳;检测器对早期句子存在系统性高估;且模型表现出显著的自我偏好倾向,损害检测性能。

链接: https://arxiv.org/abs/2511.20515
作者: Kuniaki Saito,Risa Shinoda,Shohei Tanaka,Tosho Hirasawa,Fumio Okura,Yoshitaka Ushiku
机构: OMRON SINIC X Corporation(欧姆龙辛尼克X公司); The University of Osaka(大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at this https URL.
zh

[CV-29] A Physics-Informed Loss Function for Boundary-Consistent and Robust Artery Segmentation in DSA Sequences

【速读】:该论文旨在解决数字减影血管造影(Digital Subtraction Angiography, DSA)序列中脑动脉提取与分割的精度问题,传统损失函数仅依赖像素级重叠度,忽略了血管边界在几何和物理上的一致性,导致分割结果碎片化或不稳定。解决方案的关键在于提出一种新型物理信息损失(Physics-Informed Loss, PIL),其灵感来源于材料物理学中的位错理论,将预测边界与真实边界之间的相互作用建模为弹性过程,并引入基于物理的正则化项,以强制轮廓平滑演化和结构一致性,从而提升网络对细粒度血管几何形状的捕捉能力。

链接: https://arxiv.org/abs/2511.20501
作者: Muhammad Irfan,Nasir Rahim,Khalid Mahmood Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate extraction and segmentation of the cerebral arteries from digital subtraction angiography (DSA) sequences is essential for developing reliable clinical management models of complex cerebrovascular diseases. Conventional loss functions often rely solely on pixel-wise overlap, overlooking the geometric and physical consistency of vascular boundaries, which can lead to fragmented or unstable vessel predictions. To overcome this limitation, we propose a novel \textitPhysics-Informed Loss (PIL) that models the interaction between the predicted and ground-truth boundaries as an elastic process inspired by dislocation theory in materials physics. This formulation introduces a physics-based regularization term that enforces smooth contour evolution and structural consistency, allowing the network to better capture fine vascular geometry. The proposed loss is integrated into several segmentation architectures, including U-Net, U-Net++, SegFormer, and MedFormer, and evaluated on two public benchmarks: DIAS and DSCA. Experimental results demonstrate that PIL consistently outperforms conventional loss functions such as Cross-Entropy, Dice, Active Contour, and Surface losses, achieving superior sensitivity, F1 score, and boundary coherence. These findings confirm that the incorporation of physics-based boundary interactions into deep neural networks improves both the precision and robustness of vascular segmentation in dynamic angiographic imaging. The implementation of the proposed method is publicly available at this https URL.
zh

[CV-30] Modular Deep Learning Framework for Assistive Perception: Gaze Affect and Speaker Identification

【速读】:该论文旨在解决辅助技术中视觉与听觉感知模块化集成的难题,以实现资源受限设备上的实时多模态感知。其解决方案的关键在于采用轻量级、任务专用的独立模型架构:通过卷积神经网络(Convolutional Neural Network, CNN)实现高精度的眼部状态检测(如困倦/注意力),利用深度CNN进行面部表情识别,以及基于长短期记忆网络(Long Short-Term Memory, LSTM)完成语音驱动的说话人识别。三个模块分别在各自数据集上达到93.0%、97.8%和96.89%的准确率,验证了模块化设计在离散任务中的高保真性能,为后续在嵌入式系统中部署高效多模态融合提供了可靠基础。

链接: https://arxiv.org/abs/2511.20474
作者: Akshit Pramod Anchan,Jewelith Thomas,Sritama Roy
机构: Vellore Institute of Technology (VIT); School of Computer Science and Engineering (SCOPE); School of Electronics Engineering (SENSE)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 9 figures, and 3 tables

点击查看摘要

Abstract:Developing comprehensive assistive technologies requires the seamless integration of visual and auditory perception. This research evaluates the feasibility of a modular architecture inspired by core functionalities of perceptive systems like ‘Smart Eye.’ We propose and benchmark three independent sensing modules: a Convolutional Neural Network (CNN) for eye state detection (drowsiness/attention), a deep CNN for facial expression recognition, and a Long Short-Term Memory (LSTM) network for voice-based speaker identification. Utilizing the Eyes Image, FER2013, and customized audio datasets, our models achieved accuracies of 93.0%, 97.8%, and 96.89%, respectively. This study demonstrates that lightweight, domain-specific models can achieve high fidelity on discrete tasks, establishing a validated foundation for future real-time, multimodal integration in resource-constrained assistive devices.
zh

[CV-31] Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features

【速读】:该论文旨在解决基于运动数据对舞蹈风格进行识别与区分的难题,这一问题在人类活动识别中尤为复杂,因为不同舞种常共享相似的姿态、手势及时间上的运动模式。其解决方案的关键在于提出一种轻量级框架,通过提取视频中的姿态估计信息构建时-空特征描述符,这些描述符受Laban Movement Analysis(LMA)启发,能够捕捉上肢关节的局部动态特性(如速度、加速度和角运动),从而实现空间协调性的结构化表示;同时引入快速傅里叶变换(Fast Fourier Transform, FFT)特征以编码动作的节奏性和周期性,最终在无需复杂模型架构的情况下实现高鲁棒性的舞蹈风格分类,并验证了可解释的运动表征能有效捕获风格细微差异。

链接: https://arxiv.org/abs/2511.20469
作者: Ben Hamscher,Arnold Brosch,Nicolas Binninger,Maksymilian Jan Dejna,Kira Maag
机构: Heinrich-Heine-University Düsseldorf (海因里希·海涅大学杜塞尔多夫分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Dance is an essential component of human culture and serves as a tool for conveying emotions and telling stories. Identifying and distinguishing dance genres based on motion data is a complex problem in human activity recognition, as many styles share similar poses, gestures, and temporal motion patterns. This work presents a lightweight framework for classifying dance styles that determines motion characteristics based on pose estimates extracted from videos. We propose temporal-spatial descriptors inspired by Laban Movement Analysis. These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body, enabling a structured representation of spatial coordination. To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain. The proposed approach achieves robust classification of different dance styles with low computational effort, as complex model architectures are not required, and shows that interpretable motion representations can effectively capture stylistic nuances.
zh

[CV-32] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flow

【速读】:该论文旨在解决视频生成任务中因时空复杂度高、计算成本大而导致的生成质量与效率难题,尤其针对当前主流扩散模型在长时间序列建模中易出现误差累积的问题。其解决方案的关键在于提出STARFlow-V,一种基于归一化流(Normalizing Flows, NFs)的视频生成模型,通过引入全局-局部架构限制因果依赖仅存在于全局潜在空间以缓解误差传播,并结合流-得分匹配(flow-score matching)机制引入轻量级因果去噪器,在自回归框架下提升视频生成的一致性;同时采用视频感知的雅可比迭代策略优化采样效率,保持因果性的同时实现并行化更新。该设计使模型具备端到端学习能力、鲁棒的因果预测和原生似然估计优势,首次实证表明归一化流可实现高质量自回归视频生成,为构建世界模型提供了新路径。

链接: https://arxiv.org/abs/2511.20462
作者: Jiatao Gu,Ying Shen,Tianrong Chen,Laurent Dinh,Yuyang Wang,Miguel Angel Bautista,David Berthelot,Josh Susskind,Shuangfei Zhai
机构: Apple(苹果)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at this https URL.
zh

[CV-33] Look Where It Matters: Training-Free Ultra-HR Remote Sensing VQA via Adaptive Zoom Search

【速读】:该论文旨在解决超高分辨率(Ultra-HR)遥感图像在视觉问答(RS-VQA)任务中因模型输入规模过大导致的token与内存资源消耗过高,以及传统缩放预处理方法丢失细粒度关键信息的问题。解决方案的关键在于提出一种无需训练、即插即用的ZoomSearch框架,其核心创新是将“关注区域定位”与“答案生成”解耦:通过自适应多分支缩放搜索(Adaptive Multi-Branch Zoom Search)实现对图像块的分层定位以找到查询相关区域,并结合布局感知的图像块重组(Layout-Aware Patch Reassembly)将选中区域重构为紧凑且保持空间结构的输入画面,从而在不牺牲细节的前提下显著提升模型精度与推理效率。

链接: https://arxiv.org/abs/2511.20460
作者: Yunqi Zhou,Chengjie Jiang,Chun Yuan,Jing Li
机构: Central University of Finance and Economics (中央财经大学); Tsinghua University (清华大学); East China Normal University (华东师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:With advances in satellite constellations, sensor technologies, and imaging pipelines, ultra-high-resolution (Ultra-HR) remote sensing imagery is becoming increasingly widespread. However, current remote sensing foundation models are ill-suited to such inputs: full-image encoding exhausts token and memory budgets, while resize-based preprocessing loses fine-grained and answer-critical details. In this context, guiding the model look where it matters before prediction becomes crucial. Therefore, we present ZoomSearch, a training-free, plug-and-play pipeline that decouples ‘where to look’ from ‘how to answer’ for Ultra-HR Remote Sensing Visual Question Answering (RS-VQA). ZoomSearch combines Adaptive Multi-Branch Zoom Search, which performs a hierarchical search over image patches to localize query-relevant regions, with Layout-Aware Patch Reassembly, which reorganizes the selected patches into a compact, layout-faithful canvas. We conduct comprehensive experiments on Ultra-HR RS-VQA benchmarks MME-RealWorld-RS and LRS-VQA, comparing against (i) strong general foundation models, (ii) remote sensing foundation models, (iii) Ultra-HR RS-VQA methods, and (iv) plug-and-play search-based VQA methods. When integrated with LLaVA-ov, ZoomSearch attains state-of-the-art accuracy across diverse tasks, improving the LLaVA-ov baseline by 26.3% on LRS-VQA and 114.8% on MME-RealWorld-RS. Meanwhile, it achieves much higher inference efficiency, outperforming prior search-based methods by 20%~44% in speed.
zh

[CV-34] Learning to Generate Human-Human-Object Interactions from Textual Descriptions

【速读】:该论文旨在解决人类在共享交互中涉及物体时的复杂行为建模问题,即Human-Human-Object Interactions (HHOIs) 的建模与生成问题。传统方法通常仅关注单人与物体的交互(Human-Object Interactions, HOIs),难以刻画多人协同动作的语义关联和空间结构。解决方案的关键在于提出一种统一的生成框架,通过利用图像生成模型合成高质量的HHOI数据,并基于score-based diffusion model分别训练文本到HOI和文本到Human-Human Interaction (HHI) 的子模型;最终将二者融合,在单一高级采样过程中联合生成完整的HHOI场景,从而实现多主体、多物体参与的复杂交互行为的可控合成。该方法显著扩展了现有研究对多人群体交互建模的能力。

链接: https://arxiv.org/abs/2511.20446
作者: Jeonghyeon Na,Sangwon Baik,Inhee Lee,Junyoung Lee,Hanbyul Joo
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The way humans interact with each other, including interpersonal distances, spatial configuration, and motion, varies significantly across different situations. To enable machines to understand such complex, context-dependent behaviors, it is essential to model multiple people in relation to the surrounding scene context. In this paper, we present a novel research problem to model the correlations between two people engaged in a shared interaction involving an object. We refer to this formulation as Human-Human-Object Interactions (HHOIs). To overcome the lack of dedicated datasets for HHOIs, we present a newly captured HHOIs dataset and a method to synthesize HHOI data by leveraging image generative models. As an intermediary, we obtain individual human-object interaction (HOIs) and human-human interaction (HHIs) from the HHOIs, and with these data, we train an text-to-HOI and text-to-HHI model using score-based diffusion model. Finally, we present a unified generative framework that integrates the two individual model, capable of synthesizing complete HHOIs in a single advanced sampling process. Our method extends HHOI generation to multi-human settings, enabling interactions involving more than two individuals. Experimental results show that our method generates realistic HHOIs conditioned on textual descriptions, outperforming previous approaches that focus only on single-human HOIs. Furthermore, we introduce multi-human motion generation involving objects as an application of our framework.
zh

[CV-35] Object-Centric Vision Token Pruning for Vision Language Models

【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)中视觉标记(vision tokens)数量庞大但信息分散导致计算冗余的问题,从而提升推理效率。现有剪枝方法多依赖间接且无保障的方式选择视觉标记,难以确保精度不损失。解决方案的关键在于提出OC-VTP——一种直接且保证最优性的视觉标记剪枝方法:通过轻量级预训练一个以对象为中心的视觉标记剪枝器(object-centric vision token pruner),该剪枝器无需任何下游数据微调即可插入到现有VLM中,其核心思想是通过最小化从所选标记重构原始未剪枝标记的误差,来保证保留最具代表性的视觉标记,从而在任意剪枝比例下均能实现最高推理精度与效率的平衡。

链接: https://arxiv.org/abs/2511.20439
作者: Guangyuan Li,Rongzhen Zhao,Jinhong Deng,Yanbo Wang,Joni Pajarinen
机构: Aalto University (阿尔托大学); University of Electronic Science and Technology of China (电子科技大学); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at this https URL.
zh

[CV-36] BRIC: Bridging Kinematic Plans and Physical Control at Test Time

【速读】:该论文旨在解决扩散模型生成的运动规划与强化学习驱动的物理控制器之间在测试阶段存在执行偏差的问题,即扩散模型虽能生成多样且语义丰富的动作序列,但其输出常缺乏物理合理性,导致仿真中出现执行漂移(execution drift)。解决方案的关键在于提出一种名为BRIC的测试时适应(Test-Time Adaptation, TTA)框架,其核心包含两个创新机制:一是通过动态调整物理控制器以适应噪声运动计划,同时利用损失函数防止灾难性遗忘(catastrophic forgetting),从而保留预训练技能;二是引入轻量级测试时引导机制,在信号空间内引导扩散模型生成更符合物理约束的运动轨迹,而无需更新模型参数。二者协同作用,实现了跨多种环境下的长期、一致且物理合理的动作执行,显著提升了任务性能。

链接: https://arxiv.org/abs/2511.20431
作者: Dohun Lim,Minji Kim,Jaewoon Lim,Sungchan Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.
zh

[CV-37] Block Cascading: Training Free Acceleration of Block-Causal Video Models

【速读】:该论文旨在解决块因果(block-causal)视频生成中存在显著的速度-质量权衡问题:小模型(1.3B参数)虽能实现16 FPS,但质量较低;大模型(14B参数)虽质量更高,但推理速度仅4.5 FPS,难以满足实时交互需求。其解决方案的关键在于提出“块级级联”(Block Cascading)机制——通过训练-free的并行化策略,利用前序块的部分去噪信息启动当前块的生成,从而将原本串行的生成流程转变为可并行执行的级联结构。此方法在不牺牲生成质量的前提下,借助多GPU的时序并行能力实现约2倍加速,并消除KV缓存重载(KV-recaching)带来的约200ms延迟,显著提升交互式视频生成的响应效率。

链接: https://arxiv.org/abs/2511.20426
作者: Hmrishav Bandyopadhyay,Nikhil Pinnaparaju,Rahim Entezari,Jim Scott,Yi-Zhe Song,Varun Jampani
机构: Stability AI (Stability.AI); SketchX, University of Surrey (斯凯奇X,萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: this https URL
zh

[CV-38] VibraVerse: A Large-Scale Geometry-Acoustics Alignment Dataset for Physically-Consistent Multimodal Learning

【速读】:该论文旨在解决现有多模态学习框架在视觉与语言任务中缺乏物理一致性的问题,特别是忽视了物体几何结构、材料属性、振动模态与声音产生之间的内在因果关系。解决方案的关键在于构建VibraVerse数据集和CLASP对比学习框架:VibraVerse是一个大规模的几何-声学对齐数据集,显式地连接从3D几何到物理属性、模态参数再到声学信号的因果链;CLASP则通过跨模态对齐机制保留物体物理结构与其声学响应间的因果对应关系,确保样本在形状、图像和声音统一表示空间中具有可追溯性和物理一致性。这一方法显著提升了模型在几何到声音预测、声引导形状重建及跨模态表征学习等任务中的准确性、可解释性与泛化能力。

链接: https://arxiv.org/abs/2511.20422
作者: Bo Pang,Chenxi Xu,Jierui Ren,Guoping Wang,Sheng Li
机构: Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Understanding the physical world requires perceptual models grounded in physical laws rather than mere statistical correlations. However, existing multimodal learning frameworks, focused on vision and language, lack physical consistency and overlook the intrinsic causal relationships among an object’s geometry, material, vibration modes, and the sounds it produces. We introduce VibraVerse, a large-scale geometry-acoustics alignment dataset that explicitly bridges the causal chain from 3D geometry - physical attributes - modal parameters - acoustic signals. Each 3D model has explicit physical properties (density, Young’s modulus, Poisson’s ratio) and volumetric geometry, from which modal eigenfrequencies and eigenvectors are computed for impact sound synthesis under controlled excitations. To establish this coherence, we introduce CLASP, a contrastive learning framework for cross-modal alignment that preserves the causal correspondence between an object’s physical structure and its acoustic response. This framework enforces physically consistent alignment across modalities, ensuring that every sample is coherent, traceable to the governing equations, and embedded within a unified representation space spanning shape, image, and sound. Built upon VibraVerse, we define a suite of benchmark tasks for geometry-to-sound prediction, sound-guided shape reconstruction, and cross-modal representation learning. Extensive validations on these tasks demonstrate that models trained on VibraVerse exhibit superior accuracy, interpretability, and generalization across modalities. These results establish VibraVerse as a benchmark for physically consistent and causally interpretable multimodal learning, providing a foundation for sound-guided embodied perception and a deeper understanding of the physical world. The dataset will be open-sourced.
zh

[CV-39] StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections

【速读】:该论文旨在解决多目标跟踪(Multi-object Tracking, MOT)在低频检测条件下的性能瓶颈问题,即在计算资源受限场景下,传统方法因依赖高频率帧内检测而难以维持稳定跟踪质量。解决方案的关键在于提出一种名为StableTrack的新方法,其核心创新包括:1)设计了一种两阶段匹配策略,提升跨帧低频检测的关联准确性;2)引入基于边界框(Bbox-Based Distance)的距离度量替代传统的马氏距离(Mahalanobis distance),从而更有效地结合重识别(Re-ID)模型进行匹配;3)将视觉跟踪模块嵌入卡尔曼滤波(Kalman Filter)框架,优化整体跟踪流水线。实验表明,该方法在MOT17-val数据集上以1 Hz频率运行时,HOTA指标提升11.6%,同时在标准基准测试中保持与当前最优方法相当的性能。

链接: https://arxiv.org/abs/2511.20418
作者: Matvei Shelukhan,Timur Mamedov,Karina Kvanchiani
机构: Tevian(特维安); Lomonosov Moscow State University (莫斯科国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving \textit11.6% HOTA improvement at \textit1 Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.
zh

[CV-40] MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts

【速读】:该论文旨在解决现有3D城市生成方法难以兼顾文本驱动的创造性灵活性与基于显式结构表示的对象级可编辑性的问题。其核心挑战在于如何在保持几何一致性与风格多样性的同时,实现对城市场景的细粒度控制。解决方案的关键在于提出MajutsuCity框架,该框架通过四阶段流水线将城市建模为可控的布局、资产和材质组合,并引入MajutsuAgent这一交互式语言引导编辑代理,支持五种对象级操作以增强生成后的可编辑性;同时构建了高质量多模态数据集MajutsuDataset,涵盖2D语义布局、高度图、多样化3D建筑资产及PBR材质与天空盒,并配套设计了一套覆盖结构一致性、场景复杂度、材质保真度和光照氛围的评估指标体系。实验表明,MajutsuCity在布局FID上较CityDreamer降低83.7%,较CityCraft降低20.1%,并在所有AQS和RDR评分中排名第一,显著优于现有方法,在几何保真度、风格适应性和语义可控性方面达到新SOTA水平。

链接: https://arxiv.org/abs/2511.20415
作者: Zilong Huang,Jun He,Xiaobin Huang,Ziyi Xiong,Yang Luo,Junyan Ye,Weijia Li,Yiping Chen,Ting Han
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Generating realistic 3D cities is fundamental to world models, virtual reality, and game development, where an ideal urban scene must satisfy both stylistic diversity, fine-grained, and controllability. However, existing methods struggle to balance the creative flexibility offered by text-based generation with the object-level editability enabled by explicit structural representations. We introduce MajutsuCity, a natural language-driven and aesthetically adaptive framework for synthesizing structurally consistent and stylistically diverse 3D urban scenes. MajutsuCity represents a city as a composition of controllable layouts, assets, and materials, and operates through a four-stage pipeline. To extend controllability beyond initial generation, we further integrate MajutsuAgent, an interactive language-grounded editing agent that supports five object-level operations. To support photorealistic and customizable scene synthesis, we also construct MajutsuDataset, a high-quality multimodal dataset containing 2D semantic layouts and height maps, diverse 3D building assets, and curated PBR materials and skyboxes, each accompanied by detailed annotations. Meanwhile, we develop a practical set of evaluation metrics, covering key dimensions such as structural consistency, scene complexity, material fidelity, and lighting atmosphere. Extensive experiments demonstrate MajutsuCity reduces layout FID by 83.7% compared with CityDreamer and by 20.1% over CityCraft. Our method ranks first across all AQS and RDR scores, outperforming existing methods by a clear margin. These results confirm MajutsuCity as a new state-of-the-art in geometric fidelity, stylistic adaptability, and semantic controllability for 3D city generation. We expect our framework can inspire new avenues of research in 3D city generation. Our dataset and code will be released at this https URL.
zh

[CV-41] Image-Free Timestep Distillation via Continuous-Time Consistency with Trajectory-Sampled Pairs

【速读】:该论文旨在解决连续时间一致性蒸馏方法在训练数据依赖性强、计算资源消耗大等问题,限制了其在资源受限场景下的部署及跨域扩展性。解决方案的关键在于提出轨迹反向一致性模型(Trajectory-Backward Consistency Model, TBCM),通过直接从教师模型的生成轨迹中提取潜在表示,摒弃对外部训练数据和VAE编码的依赖,构建自包含的蒸馏范式,从而显著提升效率与简化流程;同时,轨迹提取的样本天然弥合了训练与推理阶段的数据分布差异,促进更有效的知识迁移。

链接: https://arxiv.org/abs/2511.20410
作者: Bao Tang,Shuai Zhang,Yueting Zhu,Jijun Xiang,Xin Yang,Li Yu,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Timestep distillation is an effective approach for improving the generation efficiency of diffusion models. The Consistency Model (CM), as a trajectory-based framework, demonstrates significant potential due to its strong theoretical foundation and high-quality few-step generation. Nevertheless, current continuous-time consistency distillation methods still rely heavily on training data and computational resources, hindering their deployment in resource-constrained scenarios and limiting their scalability to diverse domains. To address this issue, we propose Trajectory-Backward Consistency Model (TBCM), which eliminates the dependence on external training data by extracting latent representations directly from the teacher model’s generation trajectory. Unlike conventional methods that require VAE encoding and large-scale datasets, our self-contained distillation paradigm significantly improves both efficiency and simplicity. Moreover, the trajectory-extracted samples naturally bridge the distribution gap between training and inference, thereby enabling more effective knowledge transfer. Empirically, TBCM achieves 6.52 FID and 28.08 CLIP scores on MJHQ-30k under one-step generation, while reducing training time by approximately 40% compared to Sana-Sprint and saving a substantial amount of GPU memory, demonstrating superior efficiency without sacrificing quality. We further reveal the diffusion-generation space discrepancy in continuous-time consistency distillation and analyze how sampling strategies affect distillation performance, offering insights for future distillation research. GitHub Link: this https URL.
zh

[CV-42] A Training-Free Approach for Multi-ID Customization via Attention Adjustment and Spatial Control

【速读】:该论文旨在解决多ID定制(multi-ID customization)任务中的两个核心挑战:一是生成图像时出现的“复制粘贴”问题,导致图像质量下降;二是文本控制能力弱,生成结果无法与输入文本提示对齐。其解决方案的关键在于提出一种无需训练的框架MultiID,通过引入ID解耦的交叉注意力机制,将不同个体的ID嵌入注入对应图像区域,从而实现多身份融合的高质量生成;同时结合局部提示(local prompt)、深度引导的空间控制(depth-guided spatial control)和扩展自注意力(extended self-attention)三种策略,显著提升生成结果与文本提示及ID图像的一致性。

链接: https://arxiv.org/abs/2511.20401
作者: Jiawei Lin,Guanlong Jiao,Jianjin Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-ID customization is an interesting topic in computer vision and attracts considerable attention recently. Given the ID images of multiple individuals, its purpose is to generate a customized image that seamlessly integrates them while preserving their respective identities. Compared to single-ID customization, multi-ID customization is much more difficult and poses two major challenges. First, since the multi-ID customization model is trained to reconstruct an image from the cropped person regions, it often encounters the copy-paste issue during inference, leading to lower quality. Second, the model also suffers from inferior text controllability. The generated result simply combines multiple persons into one image, regardless of whether it is aligned with the input text. In this work, we propose MultiID to tackle this challenging task in a training-free manner. Since the existing single-ID customization models have less copy-paste issue, our key idea is to adapt these models to achieve multi-ID customization. To this end, we present an ID-decoupled cross-attention mechanism, injecting distinct ID embeddings into the corresponding image regions and thus generating multi-ID outputs. To enhance the generation controllability, we introduce three critical strategies, namely the local prompt, depth-guided spatial control, and extended self-attention, making the results more consistent with the text prompts and ID images. We also carefully build a benchmark, called IDBench, for evaluation. The extensive qualitative and quantitative results demonstrate the effectiveness of MultiID in solving the aforementioned two challenges. Its performance is comparable or even better than the training-based multi-ID customization methods.
zh

[CV-43] FREE: Uncertainty-Aware Autoregression for Parallel Diffusion Transformers

【速读】:该论文旨在解决扩散模型中基于Transformer的生成式AI(Diffusion Transformers, DiTs)在推理阶段因长序列去噪轨迹导致的高延迟问题。现有推测推理方法虽能在U-Net结构中实现无损并行采样,但在DiTs上加速效果受限,主要由于验证阶段草稿准确率不足。其解决方案的关键在于:首先发现DiTs顶层(top-block)特征具有强时序一致性与丰富的语义抽象能力,据此提出FREE框架,利用轻量级草稿模型进行特征级自回归并配合并行验证,实现理论保证的无损加速;其次,针对后期去噪步骤预测方差(不确定性)增大导致接受率下降的问题,进一步引入不确定性引导的松弛策略(uncertainty-guided relaxation),动态调整接受概率以提升效率,最终在ImageNet 512²图像生成任务上实现最高达2.25倍的加速,同时保持高质量生成结果。

链接: https://arxiv.org/abs/2511.20390
作者: Xinwan Wen,Bowen Li,Jiajun Luo,Ye Li,Zhi Wang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve state-of-the-art generation quality but require long sequential denoising trajectories, leading to high inference latency. Recent speculative inference methods enable lossless parallel sampling in U-Net-based diffusion models via a drafter-verifier scheme, but their acceleration is limited on DiTs due to insufficient draft accuracy during verification. To address this limitation, we analyze the DiTs’ feature dynamics and find the features of the final transformer layer (top-block) exhibit strong temporal consistency and rich semantic abstraction. Based on this insight, we propose FREE, a novel framework that employs a lightweight drafter to perform feature-level autoregression with parallel verification, guaranteeing lossless acceleration with theoretical and empirical support. Meanwhile, prediction variance (uncertainty) of DiTs naturally increases in later denoising steps, reducing acceptance rates under speculative sampling. To mitigate this effect, we further introduce an uncertainty-guided relaxation strategy, forming FREE (relax), which dynamically adjusts the acceptance probability in response to uncertainty levels. Experiments on ImageNet- 512^2 show that FREE achieves up to 1.86 \times acceleration, and FREE (relax) further reaches 2.25 \times speedup while maintaining high perceptual and quantitative fidelity in generation quality.
zh

[CV-44] VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild

【速读】:该论文旨在解决从日常用户拍摄的多视角野生图像(in-the-wild images)中自动重建拓扑一致的人脸几何结构的问题,现有方法普遍存在手动干预繁琐、泛化能力弱或受限于3D形态模型(3D Morphable Models, 3DMM)表达能力不足等局限。解决方案的关键在于提出VGGTFace框架,其核心创新是将3D基础模型VGGT与Pixel3DMM结合:利用VGGT强大的泛化能力和点图表示(point map representation),并通过Pixel3DMM注入像素对齐的UV值以恢复缺失的拓扑信息,从而将VGGT输出的无拓扑点图转换为具有拓扑结构的点云;进一步地,设计了一种拓扑感知的Bundle Adjustment策略,通过构建拉普拉斯能量项优化点云融合过程,实现高质量且高效的重建,仅需10秒即可完成16视图的重建任务。

链接: https://arxiv.org/abs/2511.20366
作者: Xin Ming,Yuxuan Han,Tianyu Huang,Feng Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, \emphi.e. VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at this https URL.
zh

[CV-45] From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations AAAI2026

【速读】:该论文旨在解决图像操作定位(Image Manipulation Localization, IML)中注释成本与细粒度定位精度之间的根本权衡问题。现有全监督方法依赖密集的像素级掩码标注,虽精度高但难以扩展;而弱监督方法多基于图像级标签,虽降低注释成本但缺乏空间精度。其解决方案的关键在于提出BoxPromptIML框架:首先采用粗略区域标注策略,在较低注释成本下生成相对准确的操作掩码;其次设计轻量级学生模型,通过知识蒸馏从固定教师模型(基于Segment Anything Model, SAM)中学习精细定位能力;最后引入受人类潜意识记忆机制启发的双引导特征融合模块,实现长期记忆原型与实时观测线索的动态交互,从而显著提升定位精度与鲁棒性。

链接: https://arxiv.org/abs/2511.20359
作者: Zhiqing Guo,Dongdong Xi,Songlin Li,Gaobo Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Image manipulation localization (IML) faces a fundamental trade-off between minimizing annotation cost and achieving fine-grained localization accuracy. Existing fully-supervised IML methods depend heavily on dense pixel-level mask annotations, which limits scalability to large datasets or real-world this http URL contrast, the majority of existing weakly-supervised IML approaches are based on image-level labels, which greatly reduce annotation effort but typically lack precise spatial localization. To address this dilemma, we propose BoxPromptIML, a novel weakly-supervised IML framework that effectively balances annotation cost and localization performance. Specifically, we propose a coarse region annotation strategy, which can generate relatively accurate manipulation masks at lower cost. To improve model efficiency and facilitate deployment, we further design an efficient lightweight student model, which learns to perform fine-grained localization through knowledge distillation from a fixed teacher model based on the Segment Anything Model (SAM). Moreover, inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled prototypical patterns with real-time observational cues derived from the input. Instead of passive feature extraction, this strategy enables a dynamic process of knowledge recollection, where long-term memory is adapted to the specific context of the current image, significantly enhancing localization accuracy and robustness. Extensive experiments across both in-distribution and out-of-distribution datasets show that BoxPromptIML outperforms or rivals fully-supervised models, while maintaining strong generalization, low annotation cost, and efficient deployment characteristics.
zh

[CV-46] GS-Checker: Tampering Localization for 3D Gaussian Splatting AAAI2026

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 模型在编辑技术发展背景下可能被恶意篡改的问题,其核心挑战在于如何精准定位被篡改的三维区域。解决方案的关键在于提出 GS-Checker 方法:首先,在3D高斯参数中嵌入一个3D篡改属性(3D tampering attribute),用于标记每个高斯点是否被篡改;其次,设计一种基于关键属性相似度比较的3D对比机制(3D contrastive mechanism),从三维空间层面挖掘篡改线索;最后,引入循环优化策略迭代 refine 该篡改属性,从而提升定位精度。值得注意的是,该方法无需昂贵的3D标注监督即可实现有效检测。

链接: https://arxiv.org/abs/2511.20354
作者: Haoliang Han,Ziyuan Luo,Jun Qi,Anderson Rocha,Renjie Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Recent advances in editing technologies for 3D Gaussian Splatting (3DGS) have made it simple to manipulate 3D scenes. However, these technologies raise concerns about potential malicious manipulation of 3D content. To avoid such malicious applications, localizing tampered regions becomes crucial. In this paper, we propose GS-Checker, a novel method for locating tampered areas in 3DGS models. Our approach integrates a 3D tampering attribute into the 3D Gaussian parameters to indicate whether the Gaussian has been tampered. Additionally, we design a 3D contrastive mechanism by comparing the similarity of key attributes between 3D Gaussians to seek tampering cues at 3D level. Furthermore, we introduce a cyclic optimization strategy to refine the 3D tampering attribute, enabling more accurate tampering localization. Notably, our approach does not require expensive 3D labels for supervision. Extensive experimental results demonstrate the effectiveness of our proposed method to locate the tampered 3DGS area.
zh

[CV-47] hinking in 360°: Humanoid Visual Search in the Wild

【速读】:该论文旨在解决现有视觉搜索方法局限于静态图像、忽视物理具身(embodiment)及其与三维世界交互的问题,从而难以实现类人级的高效视觉搜索。其核心挑战在于如何在不依赖真实硬件限制的前提下,构建具备主动头部旋转能力的具身视觉搜索代理(embodied visual search agents)。解决方案的关键是提出“人形视觉搜索”(humanoid visual search)范式,即通过一个类人代理在360°全景图像表示的沉浸式环境中主动转动头部进行目标或路径搜索,并基于此构建了H* Bench基准,涵盖交通枢纽、大型零售空间等复杂现实场景,以评估模型在高密度视觉环境下的空间推理能力。实验表明,即使顶级商用模型在该任务中成功率仅约30%,而通过后训练优化开源模型Qwen2.5-VL可显著提升性能,尤其在对象搜索中成功率达47.38%,验证了该范式的有效性及当前系统仍面临巨大挑战,尤其是在路径搜索中体现的空间常识推理难度。

链接: https://arxiv.org/abs/2511.20351
作者: Heyang Yu,Yinan Han,Xiangyu Zhang,Baiqiao Yin,Bowen Chang,Xiangyu Han,Xinhao Liu,Jing Zhang,Marco Pavone,Chen Feng,Saining Xie,Yiming Li
机构: NYU(纽约大学); NVIDIA(英伟达); TU Darmstadt(达姆施塔特工业大学); UC Berkeley(加州大学伯克利分校); Stanford University(斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.
zh

[CV-48] Material-informed Gaussian Splatting for 3D World Reconstruction in a Digital Twin

【速读】:该论文旨在解决数字孪生(Digital Twins)中3D重建依赖LiDAR导致的语义信息与纹理缺失问题,以及传统LiDAR-相机融合方法在玻璃等特殊材质上表现不佳且需复杂标定的问题。其解决方案的关键在于提出一种纯相机驱动的重建流程:利用多视角图像通过3D高斯溅射(3D Gaussian Splatting)实现高保真几何重建,结合视觉模型提取语义材料掩码,将高斯表示转换为带投影材料标签的网格表面,并赋予物理基础材料属性,从而在无需LiDAR硬件和标定的情况下,实现与LiDAR-相机融合相当的传感器仿真精度。

链接: https://arxiv.org/abs/2511.20348
作者: João Malheiro Silva,Andy Huynh,Tong Duy Son,Holger Caesar
机构: Siemens Digital Industries Software (西门子数字工业软件); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures. Submitted to IEEE Intelligent Vehicles Symposium (IV) 2026 for possible publication

点击查看摘要

Abstract:3D reconstruction for Digital Twins often relies on LiDAR-based methods, which provide accurate geometry but lack the semantics and textures naturally captured by cameras. Traditional LiDAR-camera fusion approaches require complex calibration and still struggle with certain materials like glass, which are visible in images but poorly represented in point clouds. We propose a camera-only pipeline that reconstructs scenes using 3D Gaussian Splatting from multi-view images, extracts semantic material masks via vision models, converts Gaussian representations to mesh surfaces with projected material labels, and assigns physics-based material properties for accurate sensor simulation in modern graphics engines and simulators. This approach combines photorealistic reconstruction with physics-based material assignment, providing sensor simulation fidelity comparable to LiDAR-camera fusion while eliminating hardware complexity and calibration requirements. We validate our camera-only method using an internal dataset from an instrumented test vehicle, leveraging LiDAR as ground truth for reflectivity validation alongside image similarity metrics.
zh

[CV-49] AMB3R: Accurate Feed-forward Metric-scale 3D Reconstruction with Backend

【速读】:该论文旨在解决大规模场景下密集三维重建(dense 3D reconstruction)中几何精度与计算效率难以兼顾的问题,同时拓展模型在未标定视觉里程计(uncalibrated visual odometry)和大规模结构光束法平差(structure from motion, SfM)等任务中的泛化能力。解决方案的关键在于提出一种多视角前馈模型AMB3R,其核心创新是采用稀疏但紧凑的体素场景表示(sparse yet compact volumetric scene representation)作为后端,从而在保持空间紧凑性的同时实现高效的几何推理。该设计使模型无需针对不同任务进行微调或测试时优化,即可在相机位姿估计、深度图恢复、度量尺度重建及三维重建等多个任务上达到最优性能,并超越传统基于优化的SLAM和SfM方法。

链接: https://arxiv.org/abs/2511.20343
作者: Hengyi Wang,Lourdes Agapito
机构: University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present AMB3R, a multi-view feed-forward model for dense 3D reconstruction on a metric-scale that addresses diverse 3D vision tasks. The key idea is to leverage a sparse, yet compact, volumetric scene representation as our backend, enabling geometric reasoning with spatial compactness. Although trained solely for multi-view reconstruction, we demonstrate that AMB3R can be seamlessly extended to uncalibrated visual odometry (online) or large-scale structure from motion without the need for task-specific fine-tuning or test-time optimization. Compared to prior pointmap-based models, our approach achieves state-of-the-art performance in camera pose, depth, and metric-scale estimation, 3D reconstruction, and even surpasses optimization-based SLAM and SfM methods with dense reconstruction priors on common benchmarks.
zh

[CV-50] ShelfRectNet: Single View Shelf Image Rectification with Homography Estimation

【速读】:该论文旨在解决从单张图像中估计同形变换(homography)的问题,这在零售场景中具有重要应用价值,例如货架监控与商品对齐,其中通常只能获取单一视角的图像。其解决方案的关键在于提出了一种基于ConvNeXt主干网络的深度学习框架,通过预测4点参数化的同形变换矩阵来校正任意角度拍摄的货架图像;同时引入归一化坐标回归策略以提升稳定性,并设计了一种新颖的合成同形变换采样增强方法,有效缓解数据稀缺问题并促进模型泛化能力。实验表明,该方法在测试集上平均角点误差为1.298像素,在精度和推理速度上均优于传统计算机视觉与深度学习方法,展现出良好的鲁棒性和实用性。

链接: https://arxiv.org/abs/2511.20335
作者: Onur Berk Tore,Ibrahim Samil Yalciner,Server Calap
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating homography from a single image remains a challenging yet practically valuable task, particularly in domains like retail, where only one viewpoint is typically available for shelf monitoring and product alignment. In this paper, we present a deep learning framework that predicts a 4-point parameterized homography matrix to rectify shelf images captured from arbitrary angles. Our model leverages a ConvNeXt-based backbone for enhanced feature representation and adopts normalized coordinate regression for improved stability. To address data scarcity and promote generalization, we introduce a novel augmentation strategy by modeling and sampling synthetic homographies. Our method achieves a mean corner error of 1.298 pixels on the test set. When compared with both classical computer vision and deep learning-based approaches, our method demonstrates competitive performance in both accuracy and inference speed. Together, these results establish our approach as a robust and efficient solution for realworld single-view rectification. To encourage further research in this domain, we will make our dataset, ShelfRectSet, and code publicly available
zh

[CV-51] 3D Motion Perception of Binocular Vision Target with PID-CNN

【速读】:该论文旨在解决基于双目视觉(binocular vision)的三维运动感知问题,即从图像序列中实时估计目标的三维坐标、速度和加速度,从而实现基本的时空感知能力。其解决方案的关键在于设计了一个结构紧凑的PID卷积神经网络(PID Convolutional Neural Network),该网络通过多层非线性变换逐步将原始输入特征映射到期望的输出表示,并引入了基于差分方程的局部建模思想(将单层网络视为二阶差分方程与非线性函数的组合),同时采用特征拼接与池化相结合的简单但有效的特征复用策略,使得模型在仅含413千参数的情况下仍能逼近输入图像分辨率所允许的预测精度上限。

链接: https://arxiv.org/abs/2511.20332
作者: Shi Jiazhao,Pan Pan,Shi Haotian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 pages, 9 figures, 2 tables

点击查看摘要

Abstract:This article trained a network for perceiving three-dimensional motion information of binocular vision target, which can provide real-time three-dimensional coordinate, velocity, and acceleration, and has a basic spatiotemporal perception capability. Understood the ability of neural networks to fit nonlinear problems from the perspective of PID. Considered a single-layer neural network as using a second-order difference equation and a nonlinearity to describe a local problem. Multilayer networks gradually transform the raw representation to the desired representation through multiple such combinations. Analysed some reference principles for designing neural networks. Designed a relatively small PID convolutional neural network, with a total of 17 layers and 413 thousand parameters. Implemented a simple but practical feature reuse method by concatenation and pooling. The network was trained and tested using the simulated randomly moving ball datasets, and the experimental results showed that the prediction accuracy was close to the upper limit that the input image resolution can represent. Analysed the experimental results and errors, as well as the existing shortcomings and possible directions for improvement. Finally, discussed the advantages of high-dimensional convolution in improving computational efficiency and feature space utilization. As well as the potential advantages of using PID information to implement memory and attention mechanisms.
zh

[CV-52] ArtiBench and ArtiBrain: Benchmarking Generalizable Vision-Language Articulated Object Manipulation

【速读】:该论文旨在解决交互式可动物体操作(interactive articulated manipulation)中长期、多步骤交互时的泛化性难题,尤其是现有视觉语言模型(vision-language models)和扩散模型(diffusion-based policies)在不同部件(part)、实例(instance)及类别(category)间难以迁移的问题。解决方案的关键在于提出ArtiBrain框架,其核心由两部分组成:一是基于大语言模型(VLM-based Task Reasoner, GPT-4.1)的高层任务分解与子目标验证机制,确保策略逻辑的合理性;二是融合几何感知关键帧执行与属性引导扩散(affordance-guided diffusion)的混合控制器(Hybrid Controller),实现精确且可解释的低层控制;此外,通过持续积累成功执行经验并构建属性记忆库(Affordance Memory Bank),将已知部件的动作属性传播至未见过的可动部件配置,显著提升系统在复杂场景下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2511.20330
作者: Yuhan Wu,Tiantian Wei,Shuo Wang,ZhiChao Wang,Yanyong Zhang,Daniel Cremers,Yan Xia
机构: University of Science and Technology of China (中国科学技术大学); Technical University of Munich (慕尼黑工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interactive articulated manipulation requires long-horizon, multi-step interactions with appliances while maintaining physical consistency. Existing vision-language and diffusion-based policies struggle to generalize across parts, instances, and categories. We first introduce ArtiBench, a five-level benchmark covering kitchen, storage, office, and tool environments. ArtiBench enables structured evaluation from cross-part and cross-instance variation to long-horizon multi-object tasks, revealing the core generalization challenges of articulated object manipulation. Building on this benchmark, we propose ArtiBrain, a modular framework that unifies high-level reasoning with adaptive low-level control. ArtiBrain uses a VLM-based Task Reasoner (GPT-4.1) to decompose and validate subgoals, and employs a Hybrid Controller that combines geometry-aware keyframe execution with affordance-guided diffusion for precise and interpretable manipulation. An Affordance Memory Bank continually accumulates successful execution episodes and propagates part-level actionable affordances to unseen articulated parts and configurations. Extensive experiments on ArtiBench show that our ArtiBrain significantly outperforms state-of-the-art multimodal and diffusion-based methods in robustness and generalization. Code and dataset will be released upon acceptance.
zh

[CV-53] AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

【速读】:该论文旨在解决自主驾驶中基于强化学习(Reinforcement Learning, RL)的端到端模型在安全性与长尾事件处理上的不足,其核心问题在于现有世界模型存在深层的乐观偏差(optimistic bias),导致对危险场景预测失真。解决方案的关键在于提出一种基于“公正世界模型”(Impartial World Model)的策略后训练精炼框架,通过新颖的反事实合成(Counterfactual Synthesis)数据生成管道,系统性地构建包含高风险碰撞和离路事件的丰富课程(curriculum),使模型从被动场景补全者转变为忠实反映行为与后果因果关系的可信预测器。该模型作为闭环RL中的内部评判者(internal critic),在策略优化阶段引导智能体“梦想”候选动作的潜在后果,从而显著提升失败预测能力,并在挑战性仿真环境中大幅减少安全违规事件,验证了“教会模型预见危险”是实现真正安全智能自主代理的关键步骤。

链接: https://arxiv.org/abs/2511.20325
作者: Tianyi Yan,Tao Tang,Xingtai Gui,Yongkang Li,Jiasen Zhesng,Weiyao Huang,Lingdong Kong,Wencheng Han,Xia Zhou,Xueyang Zhang,Yifei Zhan,Kun Zhan,Cheng-zhong Xu,Jianbing Shen
机构: University of Macau (澳门大学); Sun Yat-sen University (中山大学); Li Auto Inc. (理想汽车公司); Huazhong University of Science and Technology (华中科技大学); Northeastern University (东北大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end models for autonomous driving hold the promise of learning complex behaviors directly from sensor data, but face critical challenges in safety and handling long-tail events. Reinforcement Learning (RL) offers a promising path to overcome these limitations, yet its success in autonomous driving has been elusive. We identify a fundamental flaw hindering this progress: a deep seated optimistic bias in the world models used for RL. To address this, we introduce a framework for post-training policy refinement built around an Impartial World Model. Our primary contribution is to teach this model to be honest about danger. We achieve this with a novel data synthesis pipeline, Counterfactual Synthesis, which systematically generates a rich curriculum of plausible collisions and off-road events. This transforms the model from a passive scene completer into a veridical forecaster that remains faithful to the causal link between actions and outcomes. We then integrate this Impartial World Model into our closed-loop RL framework, where it serves as an internal critic. During refinement, the agent queries the critic to ``dream" of the outcomes for candidate actions. We demonstrate through extensive experiments, including on a new Risk Foreseeing Benchmark, that our model significantly outperforms baselines in predicting failures. Consequently, when used as a critic, it enables a substantial reduction in safety violations in challenging simulations, proving that teaching a model to dream of danger is a critical step towards building truly safe and intelligent autonomous agents.
zh

[CV-54] IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection

【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)中因信噪比低、背景复杂及目标特征不明显所带来的挑战,尤其是现有基于深度学习的编码器-解码器框架在不同场景下(如昼夜变化、天空/海洋/陆地区域)存在模式漂移(pattern drift)问题,导致检测鲁棒性不足。解决方案的关键在于提出一种名为IrisNet的元学习框架,其核心创新是通过图像到解码器的Transformer建立输入红外图像特征与整个解码器参数之间的动态映射关系;具体而言,将解码器参数建模为保留层级相关性的二维张量,并利用自注意力机制建模层间依赖、交叉注意力机制生成自适应解码模式,同时引入高频分量增强对目标位置和场景边缘信息的感知能力,从而实现跨场景的动态适配与高性能检测。

链接: https://arxiv.org/abs/2511.20319
作者: Xuelin Qian,Jiaming Lu,Zixuan Wang,Wenxuan Wang,Zhongling Huang,Dingwen Zhang,Junwei Han
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10pages,5figures

点击查看摘要

Abstract:Infrared Small Target Detection (IRSTD) faces significant challenges due to low signal-to-noise ratios, complex backgrounds, and the absence of discernible target features. While deep learning-based encoder-decoder frameworks have advanced the field, their static pattern learning suffers from pattern drift across diverse scenarios (\emphe.g., day/night variations, sky/maritime/ground domains), limiting robustness. To address this, we propose IrisNet, a novel meta-learned framework that dynamically adapts detection strategies to the input infrared image status. Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters via an image-to-decoder transformer. More concretely, we represent the parameterized decoder as a structured 2D tensor preserving hierarchical layer correlations and enable the transformer to model inter-layer dependencies through self-attention while generating adaptive decoding patterns via cross-attention. To further enhance the perception ability of infrared images, we integrate high-frequency components to supplement target-position and scene-edge information. Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet, achieving state-of-the-art performance.
zh

[CV-55] ReFT: Taming Rectified Flow Models For One-Step Image Translation

【速读】:该论文旨在解决生成式 AI(Generative AI)中基于修正流(Rectified Flow, RF)模型在图像到图像翻译任务中仍依赖高成本多步去噪的问题,从而限制了实时应用的可行性。现有方法如CycleGAN-Turbo虽能在预训练扩散模型上实现一步推理,但直接应用于RF模型时会出现严重收敛问题。解决方案的关键在于提出TReFT(Tame Rectified Flow for one-step Translation),其核心创新是:在对抗训练框架下,直接使用预训练DiT或UNet预测的速度作为输出,这一设计有效缓解了收敛难题;该策略的合理性源于一个新发现——在去噪过程末期,RF模型预测的速度趋于从原点指向最终干净图像的向量,该性质通过理论分析得到验证。此外,TReFT还引入内存高效的潜在循环一致性与身份损失,以及轻量化结构简化,使大型预训练RF模型(如SD3.5和FLUX)在保持SOTA性能的同时支持实时推理。

链接: https://arxiv.org/abs/2511.20307
作者: Shengqian Li,Ming Gao,Yi Liu,Zuzeng Lin,Feng Wang,Feng Dai
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); Beihang University (北京航空航天大学); Tianjin University (天津大学); CreateAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rectified Flow (RF) models have advanced high-quality image and video synthesis via optimal transport theory. However, when applied to image-to-image translation, they still depend on costly multi-step denoising, hindering real-time applications. Although the recent adversarial training paradigm, CycleGAN-Turbo, works in pretrained diffusion models for one-step image translation, we find that directly applying it to RF models leads to severe convergence issues. In this paper, we analyze these challenges and propose TReFT, a novel method to Tame Rectified Flow models for one-step image Translation. Unlike previous works, TReFT directly uses the velocity predicted by pretrained DiT or UNet as output-a simple yet effective design that tackles the convergence issues under adversarial training with one-step inference. This design is mainly motivated by a novel observation that, near the end of the denoising process, the velocity predicted by pretrained RF models converges to the vector from origin to the final clean image, a property we further justify through theoretical analysis. When applying TReFT to large pretrained RF models such as SD3.5 and FLUX, we introduce memory-efficient latent cycle-consistency and identity losses during training, as well as lightweight architectural simplifications for faster inference. Pretrained RF models finetuned with TReFT achieve performance comparable to sota methods across multiple image translation datasets while enabling real-time inference.
zh

[CV-56] aCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection

【速读】:该论文旨在解决遥感变化检测(Remote Sensing Change Detection, RSCD)中因仅依赖掩码监督而导致的时序语义不一致问题。现有方法虽能实现空间上的连贯预测,但难以保证跨时相的语义一致性,导致检测结果存在语义层面的错误。解决方案的关键在于提出TaCo网络,其核心创新是引入时空语义联合约束机制,将变化建模为双时相状态间的语义转换过程,并通过文本引导的过渡生成器(Text-guided Transition Generator)融合文本语义与双时相视觉特征,构建跨时相过渡特征;同时设计了双时相重建约束和过渡约束,前者确保重构特征与原始特征对齐,后者增强变化区域的判别能力,从而在不增加推理计算开销的前提下显著提升性能。

链接: https://arxiv.org/abs/2511.20306
作者: Han Guo,Chenyang Liu,Haotian Zhang,Bowen Chen,Zhengxia Zou,Zhenwei Shi
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides spatial localization but provides limited constraints on the temporal semantic transitions. Consequently, they often produce spatially coherent predictions while still suffering from unresolved semantic inconsistencies. To address this limitation, we propose TaCo, a spatio-temporal semantic consistent network, which enriches the existing mask-supervised framework with a spatio-temporal semantic joint constraint. TaCo conceptualizes change as a semantic transition between bi-temporal states, in which one temporal feature representation can be derived from the other via dedicated transition features. To realize this, we introduce a Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct the cross-temporal transition features. In addition, we propose a spatio-temporal semantic joint constraint consisting of bi-temporal reconstruct constraints and a transition constraint: the former enforces alignment between reconstructed and original features, while the latter enhances discrimination for changes. This design can yield substantial performance gains without introducing any additional computational overhead during inference. Extensive experiments on six public datasets, spanning both binary and semantic change detection tasks, demonstrate that TaCo consistently achieves SOTA performance.
zh

[CV-57] CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

【速读】:该论文旨在解决遥感(Remote Sensing, RS)领域中参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在大规模地球观测任务中适应性不足的问题,尤其是在存在多维且不可预测的域偏移(如空间、语义和频率偏移)时,现有PEFT方法难以有效激活基础模型的泛化表示能力。其解决方案的关键在于提出CrossEarth-Gate框架,该框架包含两个核心创新:一是构建一个涵盖空间、语义和频率模块的综合RS模块工具箱以应对多维域偏移;二是设计基于Fisher信息引导的自适应选择机制,通过量化各模块对任务特定梯度流的贡献度,动态地在合适层激活最关键模块,从而优化梯度传播路径,提升微调的有效性和效率。

链接: https://arxiv.org/abs/2511.20302
作者: Shilei Cao,Ziyang Gong,Hehai Lin,Yang Liu,Jiashun Cheng,Xiaoxing Hu,Haoyuan Liang,Guowen Li,Chengwei Qin,Hong Cheng,Xue Yang,Juepeng Zheng,Haohuan Fu
机构: Sun Yat-sen University (中山大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学); National Supercomputing Center in Shenzhen (深圳国家超级计算中心); The Hong Kong University of Science and Technology (香港科技大学); Beijing Institute of Technology (北京理工大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module’s importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.
zh

[CV-58] Prompting Lipschitz-constrained network for multiple-in-one sparse-view CT reconstruction

【速读】:该论文旨在解决深度学习驱动的稀疏视图计算机断层成像(Sparse-View Computed Tomography, SVCT)重建算法中存在的两大问题:一是现有深度展开算法中先验网络难以显式证明满足Lipschitz约束,因其多为经验设计;二是针对不同稀疏采样设置需训练独立模型,导致存储成本高,限制了临床实用价值。解决方案的关键在于提出一种可显式证明Lipschitz约束的网络结构LipNet,并引入显式提示模块(prompt module),以区分不同稀疏采样配置,从而实现“多视图合一”的统一重建模型。在此基础上,构建了名为PromptCT的存储高效深度展开框架,将LipNet作为先验网络嵌入迭代优化流程,确保算法收敛性,同时在模拟与真实数据实验中显著优于基准方法,在保证高质量重建的同时大幅降低存储开销。理论层面进一步证明LipNet满足边界性质,从而严格确立其Lipschitz连续性并分析迭代算法收敛性。

链接: https://arxiv.org/abs/2511.20296
作者: Baoshun Shi,Ke Jiang,Qiusheng Lian,Xinran Yu,Huazhu Fu
机构: Yanshan University (燕山大学); Capital Normal University (首都师范大学); Institute of High Performance Computing (IHPC) (高性能计算研究所); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite significant advancements in deep learning-based sparse-view computed tomography (SVCT) reconstruction algorithms, these methods still encounter two primary limitations: (i) It is challenging to explicitly prove that the prior networks of deep unfolding algorithms satisfy Lipschitz constraints due to their empirically designed nature. (ii) The substantial storage costs of training a separate model for each setting in the case of multiple views hinder practical clinical applications. To address these issues, we elaborate an explicitly provable Lipschitz-constrained network, dubbed LipNet, and integrate an explicit prompt module to provide discriminative knowledge of different sparse sampling settings, enabling the treatment of multiple sparse view configurations within a single model. Furthermore, we develop a storage-saving deep unfolding framework for multiple-in-one SVCT reconstruction, termed PromptCT, which embeds LipNet as its prior network to ensure the convergence of its corresponding iterative algorithm. In simulated and real data experiments, PromptCT outperforms benchmark reconstruction algorithms in multiple-in-one SVCT reconstruction, achieving higher-quality reconstructions with lower storage costs. On the theoretical side, we explicitly demonstrate that LipNet satisfies boundary property, further proving its Lipschitz continuity and subsequently analyzing the convergence of the proposed iterative algorithms. The data and code are publicly available at this https URL.
zh

[CV-59] Back to the Feature: Explaining Video Classifiers with Video Counterfactual Explanations

【速读】:该论文旨在解决视频分类模型的可解释性问题,即如何生成物理上合理、时间上连贯且运动轨迹平滑的视频类反事实解释(Counterfactual Explanations, CFEs)。现有方法主要针对图像分类器设计,无法有效生成满足视频特性要求的反事实视频。解决方案的关键在于提出一种名为 Back To The Feature (BTTF) 的优化框架,其核心创新包括:1)基于输入视频第一帧条件化的初始潜在噪声检索机制,确保生成视频与原视频在空间上一致;2)两阶段优化策略,使反事实视频搜索聚焦于输入视频邻域,从而提升解释的局部性和有效性;同时,通过仅依赖目标分类器进行引导的优化过程,保证解释的忠实性,并引入渐进式优化策略加速收敛。

链接: https://arxiv.org/abs/2511.20295
作者: Chao Wang,Chengan Che,Xinyue Chen,Sophia Tsoka,Luis C. Garcia-Peraza-Herrera
机构: King’s College London (伦敦国王学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Counterfactual explanations (CFEs) are minimal and semantically meaningful modifications of the input of a model that alter the model predictions. They highlight the decisive features the model relies on, providing contrastive interpretations for classifiers. State-of-the-art visual counterfactual explanation methods are designed to explain image classifiers. The generation of CFEs for video classifiers remains largely underexplored. For the counterfactual videos to be useful, they have to be physically plausible, temporally coherent, and exhibit smooth motion trajectories. Existing CFE image-based methods, designed to explain image classifiers, lack the capacity to generate temporally coherent, smooth and physically plausible video CFEs. To address this, we propose Back To The Feature (BTTF), an optimization framework that generates video CFEs. Our method introduces two novel features, 1) an optimization scheme to retrieve the initial latent noise conditioned by the first frame of the input video, 2) a two-stage optimization strategy to enable the search for counterfactual videos in the vicinity of the input video. Both optimization processes are guided solely by the target classifier, ensuring the explanation is faithful. To accelerate convergence, we also introduce a progressive optimization strategy that incrementally increases the number of denoising steps. Extensive experiments on video datasets such as Shape-Moving (motion classification), MEAD (emotion classification), and NTU RGB+D (action classification) show that our BTTF effectively generates valid, visually similar and realistic counterfactual videos that provide concrete insights into the classifier’s decision-making mechanism.
zh

[CV-60] Bootstrapping Physics-Grounded Video Generation through VLM-Guided Iterative Self-Refinement ICCV2025

【速读】:该论文旨在解决当前视频生成模型在视觉质量提升的同时,仍难以遵循真实世界物理规律的问题(即物理一致性不足)。其解决方案的关键在于提出一种无需训练、可即插即用的迭代自精炼框架,利用大语言模型(LLM)和视觉-语言模型(Vision-Language Model, VLM)提供物理感知引导,并引入多模态思维链(Multimodal Chain-of-Thought, MM-CoT)机制,通过识别与反馈物理不一致信息来逐步优化生成提示(prompt),从而显著提升视频生成结果的物理合理性。实验表明,该方法在PhyIQ基准测试中将Physics-IQ得分从56.31提升至62.38。

链接: https://arxiv.org/abs/2511.20280
作者: Yang Liu,Xilin Zhao,Peisong Wen,Siran Dai,Qingming Huang
机构: University of Chinese Academy of Sciences (中国科学院大学); Beijing Institute of Technology (北京理工大学); Chinese Academy of Sciences (中国科学院); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 Physics-IQ Challenge Third Place Solution

点击查看摘要

Abstract:Recent progress in video generation has led to impressive visual quality, yet current models still struggle to produce results that align with real-world physical principles. To this end, we propose an iterative self-refinement framework that leverages large language models and vision-language models to provide physics-aware guidance for video generation. Specifically, we introduce a multimodal chain-of-thought (MM-CoT) process that refines prompts based on feedback from physical inconsistencies, progressively enhancing generation quality. This method is training-free and plug-and-play, making it readily applicable to a wide range of video generation models. Experiments on the PhyIQ benchmark show that our method improves the Physics-IQ score from 56.31 to 62.38. We hope this work serves as a preliminary exploration of physics-consistent video generation and may offer insights for future research.
zh

[CV-61] SelfMOTR: Revisiting MOTR with Self-Generating Detection Priors

【速读】:该论文旨在解决基于Transformer架构的端到端多目标跟踪(Multi-Object Tracking, MOT)方法中检测性能不佳以及检测与关联模块在联合架构中存在冲突的问题。其解决方案的关键在于提出SelfMOTR,一种依赖自生成检测先验(detection priors)的新型跟踪Transformer模型;通过深入分析和消融实验,作者揭示了MOTR类模型隐藏的强检测能力,并提供了一套实用工具以有效利用这些能力,从而在DanceTrack数据集上达到与当前先进端到端跟踪方法相当的性能。

链接: https://arxiv.org/abs/2511.20279
作者: Fabian Gülhan,Emil Mededovic,Yuli Wu,Johannes Stegmaier
机构: RWTH Aachen University (亚琛工业大学); Heinrich Heine University Düsseldorf (海德堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Despite progress toward end-to-end tracking with transformer architectures, poor detection performance and the conflict between detection and association in a joint architecture remain critical concerns. Recent approaches aim to mitigate these issues by (i) employing advanced denoising or label assignment strategies, or (ii) incorporating detection priors from external object detectors via distillation or anchor proposal techniques. Inspired by the success of integrating detection priors and by the key insight that MOTR-like models are secretly strong detection models, we introduce SelfMOTR, a novel tracking transformer that relies on self-generated detection priors. Through extensive analysis and ablation studies, we uncover and demonstrate the hidden detection capabilities of MOTR-like models, and present a practical set of tools for leveraging them effectively. On DanceTrack, SelfMOTR achieves strong performance, competing with recent state-of-the-art end-to-end tracking methods.
zh

[CV-62] DAPointMamba: Domain Adaptive Point Mamba for Point Cloud Completion AAAI2026

【速读】:该论文旨在解决域自适应点云补全(Domain Adaptive Point Cloud Completion, DA PCC)中源域与目标域之间几何结构和语义差异导致的性能下降问题。现有方法受限于卷积神经网络(CNNs)或视觉Transformer的局部感受野限制或二次计算复杂度,难以有效适应不同域的数据分布。解决方案的关键在于提出一种基于状态空间模型(State Space Models, SSMs)的新框架DAPointMamba,其核心创新包括三个模块:跨域Patch级扫描(Cross-Domain Patch-Level Scanning)建立局部几何对应关系以实现有效对齐;跨域空间SSM对齐(Cross-Domain Spatial SSM Alignment)通过跨域相似性调节patch特征强化空间一致性;跨域通道SSM对齐(Cross-Domain Channel SSM Alignment)通过交错对齐特征通道缓解全局语义鸿沟。该框架兼具全局感受野与线性复杂度优势,在合成与真实世界基准上显著优于现有最先进方法,同时降低计算开销与推理延迟。

链接: https://arxiv.org/abs/2511.20278
作者: Yinghui Li,Qianyu Zhou,Di Shao,Hao Yang,Ye Zhu,Richard Dazeley,Xuequan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Domain adaptive point cloud completion (DA PCC) aims to narrow the geometric and semantic discrepancies between the labeled source and unlabeled target domains. Existing methods either suffer from limited receptive fields or quadratic complexity due to using CNNs or vision Transformers. In this paper, we present the first work that studies the adaptability of State Space Models (SSMs) in DA PCC and find that directly applying SSMs to DA PCC will encounter several challenges: directly serializing 3D point clouds into 1D sequences often disrupts the spatial topology and local geometric features of the target domain. Besides, the overlook of designs in the learning domain-agnostic representations hinders the adaptation performance. To address these issues, we propose a novel framework, DAPointMamba for DA PCC, that exhibits strong adaptability across domains and has the advantages of global receptive fields and efficient linear complexity. It has three novel modules. In particular, Cross-Domain Patch-Level Scanning introduces patch-level geometric correspondences, enabling effective local alignment. Cross-Domain Spatial SSM Alignment further strengthens spatial consistency by modulating patch features based on cross-domain similarity, effectively mitigating fine-grained structural discrepancies. Cross-Domain Channel SSM Alignment actively addresses global semantic gaps by interleaving and aligning feature channels. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our DAPointMamba outperforms state-of-the-art methods with less computational complexity and inference latency.
zh

[CV-63] ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis

【速读】:该论文旨在解决当前基于CLIP(Contrastive Language–Image Pretraining)的模型在处理现实世界场景图像时存在的局限性,即现有方法主要聚焦于单对象分类或短文本描述检索任务,未能有效建模图像中多对象之间的关系结构和组合特性。为应对这一挑战,论文提出ScenarioCLIP模型,其关键创新在于引入输入文本、关系锚定(grounded relations)以及聚焦区域(focused regions),显式地对图像中的对象间关系进行建模,并通过预训练阶段利用人工与自动结合的标注流程构建了一个新的场景级数据集(scenario-based dataset),从而支持跨模态检索和细粒度视觉理解等下游任务。该方案显著提升了零样本(zero-shot)及微调(fine-tune)性能,在多个领域特定任务上展现出更强的泛化能力。

链接: https://arxiv.org/abs/2511.20274
作者: Advik Sinha,Saurabh Atreya,Aashutosh A V,Sk Aziz Ali,Abhijit Das
机构: Birla Institute of Technology and Science, Pilani – Hyderabad Campus (比尔拉理工大学与科学学院,皮拉尼-海得拉巴校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input texts, grounded relations, and input images, along with focused regions highlighting relations. The proposed model is pretrained on curated scenario data, and finetuned for specialized downstream tasks, such as cross-modal retrieval and fine-grained visual understanding tasks. To address the lack of domain-specific datasets, we generate a novel dataset by extending image-text pairs from existing diverse indoor and outdoor scenario datasets that are publicly available. We used a pipeline of existing language models to ground action, object, and relations, filled by manual and automatic curation. We established a comprehensive benchmark for several scenario-based tasks and compared it with many baseline methods. ScenarioCLIP demonstrates robust zero-shot and finetune performance on various domain-specific tasks. Our code and dataset are available at this https URL
zh

[CV-64] VKnowU: Evaluating Visual Knowledge Understanding in Multimodal LLM s

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解视觉世界深层语义方面的不足,即缺乏对物理和社会原理的类人直觉认知能力,这种能力被称为视觉知识(visual knowledge),是连接感知与推理的关键桥梁。为系统评估这一能力,作者构建了VKnowU基准测试集,并发现主流MLLMs在世界中心型(如直观物理)任务上表现显著落后于人类水平。解决方案的核心在于提出新数据集VKnowQA和VideoKnow+基线模型,后者采用“看-思考-回答”结构化范式,并引入基于视觉知识奖励的强化学习机制,从而显式地将视觉知识融入模型训练中,在VKnowU上实现+3.7%的性能提升,并在多个视频理解基准(MVBench、Video-MME、MMVU)上取得一致改进,验证了视觉知识作为通用MLLM发展基石的重要性。

链接: https://arxiv.org/abs/2511.20272
作者: Tianxiang Jiang,Sheng Xia,Yicheng Xu,Linquan Wu,Xiangyu Zeng,Limin Wang,Yu Qiao,Yi Wang
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室); Nanjing University (南京大学); Shanghai Innovation Institute (上海创新研究院); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Data Code: this https URL

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have become adept at recognizing objects, they often lack the intuitive, human-like understanding of the world’s underlying physical and social principles. This high-level vision-grounded semantics, which we term visual knowledge, forms a bridge between perception and reasoning, yet remains an underexplored area in current MLLMs. To systematically evaluate this capability, we present VKnowU, a comprehensive benchmark featuring 1,680 questions in 1,249 videos, covering 8 core types of visual knowledge spanning both world-centric (e.g., intuitive physics) and human-centric (e.g., subjective intentions). Evaluation of 23 SOTA MLLMs reveals that leading models still fall short of human performance, with particularly notable gaps in the world-centric. To bridge this gap, we introduce a new dataset, VKnowQA, and VideoKnow+, a baseline model that explicitly incorporates visual knowledge into MLLMs. VideoKnow+ follows a structured See-Think-Answer paradigm and adopts reinforcement learning with visual knowledge reward, achieving a +3.7% improvement on VKnowU and consistent gains on MVBench, Video-MME, and MMVU. Our work highlights visual knowledge as a missing cornerstone for developing more generalizable MLLMs that can not only see but also truly understand our physical and social worlds.
zh

[CV-65] DRL-Guided Neural Batch Sampling for Semi-Supervised Pixel-Level Anomaly Detection

【速读】:该论文旨在解决工业视觉检测中因缺陷样本稀缺而导致的异常检测难题,现有方法多依赖仅用正常数据进行无监督重建,常导致过拟合且难以识别细微缺陷。其解决方案的关键在于提出一种半监督深度强化学习框架,核心创新包括:基于强化学习的神经批次采样器通过复合奖励机制平衡探索与利用,自适应选择具有信息量的图像块;自动编码器生成损失轮廓以突出异常区域,预测器在损失空间中执行分割任务;二者协同作用使系统能在有限标注数据下有效学习正常与缺陷模式,从而提升对细微异常的检测精度与定位能力。

链接: https://arxiv.org/abs/2511.20270
作者: Amirhossein Khadivi Noghredeh,Abdollah Safari,Fatemeh Ziaeetabar,Firoozeh Haghighi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection in industrial visual inspection is challenging due to the scarcity of defective samples. Most existing methods rely on unsupervised reconstruction using only normal data, often resulting in overfitting and poor detection of subtle defects. We propose a semi-supervised deep reinforcement learning framework that integrates a neural batch sampler, an autoencoder, and a predictor. The RL-based sampler adaptively selects informative patches by balancing exploration and exploitation through a composite reward. The autoencoder generates loss profiles highlighting abnormal regions, while the predictor performs segmentation in the loss-profile space. This interaction enables the system to effectively learn both normal and defective patterns with limited labeled data. Experiments on the MVTec AD dataset demonstrate that our method achieves higher accuracy and better localization of subtle anomalies than recent state-of-the-art approaches while maintaining low complexity, yielding an average improvement of 0.15 in F1_max and 0.06 in AUC, with a maximum gain of 0.37 in F1_max in the best case.
zh

[CV-66] Advancing Image Classification with Discrete Diffusion Classification Modeling

【速读】:该论文旨在解决图像分类任务在高不确定性条件下的性能瓶颈问题,例如输入图像受损或训练数据有限时,传统分类模型直接从图像预测类别标签往往表现不佳。其解决方案的关键在于提出了一种名为离散扩散分类建模(Discrete Diffusion Classification Modeling, DiDiCM)的新框架,该框架利用基于扩散过程的方法来建模给定输入图像条件下类别标签的后验分布,从而支持在类别概率或离散类别标签上进行扩散推理,提供了计算与内存消耗之间的灵活权衡。实验表明,DiDiCM在ImageNet数据集上显著优于标准分类器,尤其在任务难度增加时准确率提升更为明显。

链接: https://arxiv.org/abs/2511.20263
作者: Omer Belhasin,Shelly Golan,Ran El-Yaniv,Michael Elad
机构: Technion - Israel Institute of Technology (以色列理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image classification is a well-studied task in computer vision, and yet it remains challenging under high-uncertainty conditions, such as when input images are corrupted or training data are limited. Conventional classification approaches typically train models to directly predict class labels from input images, but this might lead to suboptimal performance in such scenarios. To address this issue, we propose Discrete Diffusion Classification Modeling (DiDiCM), a novel framework that leverages a diffusion-based procedure to model the posterior distribution of class labels conditioned on the input image. DiDiCM supports diffusion-based predictions either on class probabilities or on discrete class labels, providing flexibility in computation and memory trade-offs. We conduct a comprehensive empirical study demonstrating the superior performance of DiDiCM over standard classifiers, showing that a few diffusion iterations achieve higher classification accuracy on the ImageNet dataset compared to baselines, with accuracy gains increasing as the task becomes more challenging. We release our code at this https URL .
zh

[CV-67] Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization

【速读】:该论文旨在解决多模态领域泛化(Multi-Modal Domain Generalization, MMDG)中权重平均(Weight Averaging, WA)技术因模态间优化速度差异而导致的早期过拟合问题:WA会偏向快速收敛的模态,抑制慢速但互补模态的贡献,从而破坏模态融合并使损失曲面趋向于更尖锐、泛化能力差的极小值。解决方案的关键在于提出MBCD(Modality-Balanced Collaborative Distillation)框架,其核心机制包括:1)在学生模型中引入自适应模态丢弃策略以缓解早期对主导模态的偏倚;2)通过梯度一致性约束对齐单模态分支与融合表示的学习信号,促进协同优化;3)基于WA的教师模型执行跨模态蒸馏,将融合知识传递至各单模态分支,强化跨模态交互并引导收敛至更平坦的解空间。

链接: https://arxiv.org/abs/2511.20258
作者: Xiaohan Wang,Zhangtao Cheng,Ting Zhong,Leiting Chen,Fan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape, which correlates with stronger out-of-distribution performance. However, applying WA directly to multi-modal domain generalization (MMDG) is challenging: differences in optimization speed across modalities lead WA to overfit to faster-converging ones in early stages, suppressing the contribution of slower yet complementary modalities, thereby hindering effective modality fusion and skewing the loss surface toward sharper, less generalizable minima. To address this issue, we propose MBCD, a unified collaborative distillation framework that retains WA’s flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts. MBCD begins with adaptive modality dropout in the student model to curb early-stage bias toward dominant modalities. A gradient consistency constraint then aligns learning signals between uni-modal branches and the fused representation, encouraging coordinated and smoother optimization. Finally, a WA-based teacher conducts cross-modal distillation by transferring fused knowledge to each uni-modal branch, which strengthens cross-modal interactions and steer convergence toward flatter solutions. Extensive experiments on MMDG benchmarks show that MBCD consistently outperforms existing methods, achieving superior accuracy and robustness across diverse unseen domains.
zh

[CV-68] he Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

【速读】:该论文旨在解决生成式图像强化学习(Reinforcement Learning, RL)中奖励函数不可靠的问题,尤其是现有方法依赖预训练偏好模型输出标量奖励时,常因无法准确反映人类感知而出现奖励黑客(reward hacking)现象,即高奖励分数并不对应高质量图像。其解决方案的关键在于提出Adv-GRPO框架,通过引入对抗性奖励机制,迭代更新奖励模型与生成器;同时,将参考图像与视觉基础模型(如DINO)结合,以图像本身作为密集视觉奖励信号替代单一标量奖励,从而更有效地引导生成过程,在图像质量、美学和任务指标上实现稳定提升。此外,该方法还支持分布迁移与风格定制,显著优于Flow-GRPO和SD3等基线模型。

链接: https://arxiv.org/abs/2511.20256
作者: Weijia Mao,Hao Chen,Zhenheng Yang,Mike Zheng Shou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A reliable reward function is essential for reinforcement learning (RL) in image generation. Most current RL approaches depend on pre-trained preference models that output scalar rewards to approximate human preferences. However, these rewards often fail to capture human perception and are vulnerable to reward hacking, where higher scores do not correspond to better images. To address this, we introduce Adv-GRPO, an RL framework with an adversarial reward that iteratively updates both the reward model and the generator. The reward model is supervised using reference images as positive samples and can largely avoid being hacked. Unlike KL regularization that constrains parameter updates, our learned reward directly guides the generator through its visual outputs, leading to higher-quality images. Moreover, while optimizing existing reward functions can alleviate reward hacking, their inherent biases remain. For instance, PickScore may degrade image quality, whereas OCR-based rewards often reduce aesthetic fidelity. To address this, we take the image itself as a reward, using reference images and vision foundation models (e.g., DINO) to provide rich visual rewards. These dense visual signals, instead of a single scalar, lead to consistent gains across image quality, aesthetics, and task-specific metrics. Finally, we show that combining reference samples with foundation-model rewards enables distribution transfer and flexible style customization. In human evaluation, our method outperforms Flow-GRPO and SD3, achieving 70.0% and 72.4% win rates in image quality and aesthetics, respectively. Code and models have been released.
zh

[CV-69] XiCAD: Camera Activation Detection in the Da Vinci Xi User Interface

【速读】:该论文旨在解决机器人辅助微创手术中缺乏可靠摄像头状态信息的问题,尤其针对达芬奇Xi系统UI中相机图块(camera tile)激活状态的自动识别问题。这一信息对于下游手术数据分析任务(如器械追踪、技能评估或摄像控制自动化)至关重要。解决方案的关键在于提出了一种基于ResNet18卷积神经网络的轻量级检测流水线,通过在SurgToolLoc数据集上手动标注的数据进行微调,并在三个公开数据集(总计超7万帧图像)上验证,实现了对相机激活状态的高精度二分类(F1分数0.993–1.000)和零误检的精确定位,从而支持实时、可靠的手术视频元数据提取。

链接: https://arxiv.org/abs/2511.20254
作者: Alexander C. Jenke,Gregor Just,Claas de Boer,Martin Wagner,Sebastian Bodenstedt,Stefanie Speidel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: Robot-assisted minimally invasive surgery relies on endoscopic video as the sole intraoperative visual feedback. The DaVinci Xi system overlays a graphical user interface (UI) that indicates the state of each robotic arm, including the activation of the endoscope arm. Detecting this activation provides valuable metadata such as camera movement information, which can support downstream surgical data science tasks including tool tracking, skill assessment, or camera control automation. Methods: We developed a lightweight pipeline based on a ResNet18 convolutional neural network to automatically identify the position of the camera tile and its activation state within the DaVinci Xi UI. The model was fine-tuned on manually annotated data from the SurgToolLoc dataset and evaluated across three public datasets comprising over 70,000 frames. Results: The model achieved F1-scores between 0.993 and 1.000 for the binary detection of active cameras and correctly localized the camera tile in all cases without false multiple-camera detections. Conclusion: The proposed pipeline enables reliable, real-time extraction of camera activation metadata from surgical videos, facilitating automated preprocessing and analysis for diverse downstream applications. All code, trained models, and annotations are publicly available. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.20254 [cs.CV] (or arXiv:2511.20254v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.20254 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Alexander C. Jenke [view email] [v1] Tue, 25 Nov 2025 12:29:10 UTC (816 KB)
zh

[CV-70] Zoo3D: Zero-Shot 3D Object Detection at Scene Level

【速读】:该论文旨在解决开放词汇表三维(3D)目标检测中模型依赖训练数据的问题,尤其是在真实环境中识别未见过物体的挑战。现有方法虽降低了标注需求,但仍需在特定场景的点云或图像上进行训练,限制了其泛化能力。解决方案的关键在于提出首个无需训练的3D目标检测框架Zoo3D:首先通过图聚类2D实例掩码生成3D边界框,再利用新颖的开放词汇模块结合最佳视角选择和视图一致性掩码生成实现语义标签分配;该框架包含零样本模式(Zoo3D₀)与自监督微调模式(Zoo3D₁),后者基于前者生成的伪标签进一步优化检测性能,且可直接处理带位姿甚至无位姿的图像输入,显著提升了模型在真实场景下的适应性与准确性。

链接: https://arxiv.org/abs/2511.20253
作者: Andrey Lemeshko,Bulat Gabdullin,Nikita Drozdov,Anton Konushin,Danila Rukhovich,Maksim Kolodiazhnyi
机构: Lomonosov Moscow State University (莫斯科国立大学); Higher School of Economics (高等经济学院); M:3L Lab, Institute of Mechanics, Armenia (亚美尼亚力学研究所M:3L实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D _0 , which requires no training at all, and the self-supervised Zoo3D _1 , which refines 3D box prediction by training a class-agnostic detector on Zoo3D _0 -generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D _0 and Zoo3D _1 achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D _0 outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at this https URL .
zh

[CV-71] PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

【速读】:该论文旨在解决生成式 AI(Generative AI)在长提示(long prompts)条件下出现的保真度(fidelity)与多样性(diversity)之间的权衡问题。研究表明,随着提示长度增加,当前最先进的文本到图像(text-to-image, T2I)模型虽能提升输出细节和语义一致性,但多样性显著下降,导致生成结果重复且缺乏创造性。解决方案的关键在于提出一种无需训练的方法 PromptMoG,其核心思想是通过在嵌入空间中从高斯混合模型(Mixture-of-Gaussians, MoG)采样多个提示嵌入(prompt embeddings),以增加采样熵并提升多样性,同时保持语义一致性,从而实现对长提示生成任务中多样性瓶颈的有效缓解。

链接: https://arxiv.org/abs/2511.20251
作者: Bo-Kai Ruan,Teng-Fang Hsiao,Ling Lo,Yi-Lun Wu,Hong-Han Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.
zh

[CV-72] Uplifting Table Tennis: A Robust Real-World Application for 3D Trajectory and Spin Estimation

【速读】:该论文旨在解决从标准单目视频中精确获取乒乓球三维运动轨迹的难题,尤其是现有基于合成数据训练的方法在真实场景下因缺乏三维地面真值轨迹和旋转标注而泛化能力差的问题。解决方案的关键在于提出一种两阶段流水线:前端感知模块利用新构建的TTHQ数据集进行大量二维监督训练,后端2D到3D提升网络则仅使用物理正确的合成数据训练,并针对性地重构提升模型以增强对真实世界常见干扰(如检测缺失和帧率变化)的鲁棒性;通过整合球体检测器与台面关键点检测器,实现了从概念验证到实用、稳健且高性能的端到端乒乓球三维轨迹与旋转分析系统。

链接: https://arxiv.org/abs/2511.20250
作者: Daniel Kienzle,Katja Ludwig,Julian Lorenz,Shin’ichi Satoh,Rainer Lienhart
机构: University of Augsburg (奥格斯堡大学); National Institute of Informatics (日本信息研究所); University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Obtaining the precise 3D motion of a table tennis ball from standard monocular videos is a challenging problem, as existing methods trained on synthetic data struggle to generalize to the noisy, imperfect ball and table detections of the real world. This is primarily due to the inherent lack of 3D ground truth trajectories and spin annotations for real-world video. To overcome this, we propose a novel two-stage pipeline that divides the problem into a front-end perception task and a back-end 2D-to-3D uplifting task. This separation allows us to train the front-end components with abundant 2D supervision from our newly created TTHQ dataset, while the back-end uplifting network is trained exclusively on physically-correct synthetic data. We specifically re-engineer the uplifting model to be robust to common real-world artifacts, such as missing detections and varying frame rates. By integrating a ball detector and a table keypoint detector, our approach transforms a proof-of-concept uplifting method into a practical, robust, and high-performing end-to-end application for 3D table tennis trajectory and spin analysis.
zh

[CV-73] HistoSpeckle-Net: Mutual Information-Guided Deep Learning for high-fidelity reconstruction of complex OrganAMNIST images via perturbed Multimode Fibers

【速读】:该论文旨在解决多模光纤(Multimode Fiber, MMF)成像中现有深度学习方法在复杂真实场景下泛化能力不足、数据依赖性强的问题,尤其是在有限训练样本和光纤弯曲扰动条件下重建图像结构 fidelity 低的挑战。解决方案的关键在于提出 HistoSpeckle-Net 架构,其核心创新包括:一是引入分布感知的学习策略,通过基于直方图的互信息损失(histogram-based mutual information loss)增强模型对 speckle 统计特性与重构图像统计一致性建模的能力,从而降低对大规模标注数据的依赖;二是设计三尺度特征精炼模块(Three-Scale Feature Refinement Module),结合多尺度结构相似性指数(SSIM)损失以提升图像结构保真度。这两项机制共同提升了模型在复杂医学图像重建中的鲁棒性和性能,使其更适用于临床环境下的实际部署。

链接: https://arxiv.org/abs/2511.20245
作者: Jawaria Maqbool,M. Imran Cheema
机构: Lahore University of Management Sciences (拉合尔管理科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Existing deep learning methods in multimode fiber (MMF) imaging often focus on simpler datasets, limiting their applicability to complex, real-world imaging tasks. These models are typically data-intensive, a challenge that becomes more pronounced when dealing with diverse and complex images. In this work, we propose HistoSpeckle-Net, a deep learning architecture designed to reconstruct structurally rich medical images from MMF speckles. To build a clinically relevant dataset, we develop an optical setup that couples laser light through a spatial light modulator (SLM) into an MMF, capturing output speckle patterns corresponding to input OrganAMNIST images. Unlike previous MMF imaging approaches, which have not considered the underlying statistics of speckles and reconstructed images, we introduce a distribution-aware learning strategy. We employ a histogram-based mutual information loss to enhance model robustness and reduce reliance on large datasets. Our model includes a histogram computation unit that estimates smooth marginal and joint histograms for calculating mutual information loss. It also incorporates a unique Three-Scale Feature Refinement Module, which leads to multiscale Structural Similarity Index Measure (SSIM) loss computation. Together, these two loss functions enhance both the structural fidelity and statistical alignment of the reconstructed images. Our experiments on the complex OrganAMNIST dataset demonstrate that HistoSpeckle-Net achieves higher fidelity than baseline models such as U-Net and Pix2Pix. It gives superior performance even with limited training samples and across varying fiber bending conditions. By effectively reconstructing complex anatomical features with reduced data and under fiber perturbations, HistoSpeckle-Net brings MMF imaging closer to practical deployment in real-world clinical environments.
zh

[CV-74] V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs

【速读】:该论文旨在解决当前对抗攻击方法在大型视觉语言模型(Large Vision-Language Models, LVLMs)中难以实现对图像特定概念语义进行精确操控的问题。现有方法通常基于patch-token表示进行攻击,但由于自注意力机制聚合的全局上下文主导了局部patch特征,导致语义纠缠严重,从而丧失了对局部语义的可控性。论文的关键发现是:Transformer注意力模块中的值特征(Value features, V)能够有效抑制全局上下文干扰,保留高熵且解耦的局部语义信息,因此成为更精准的操控目标。基于此,作者提出V-Attack方法,其核心创新包括两个组件:(1) 自值增强模块(Self-Value Enhancement),用于提升V本身的语义丰富度;(2) 文本引导的值操纵模块(Text-Guided Value Manipulation),利用文本提示定位源概念并优化为目标概念。通过直接操作V而非传统patch特征,V-Attack显著提升了语义操控的准确性和成功率,在多个主流LVLM上平均攻击成功率比现有最优方法提高36%,揭示了现代视觉语言理解模型的关键脆弱性。

链接: https://arxiv.org/abs/2511.20223
作者: Sen Nie,Jie Zhang,Jianxin Yan,Shiguang Shan,Xilin Chen
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of Chinese Academy of Sciences (中国科学院大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages

点击查看摘要

Abstract:Adversarial attacks have evolved from simply disrupting predictions on conventional task-specific models to the more complex goal of manipulating image semantics on Large Vision-Language Models (LVLMs). However, existing methods struggle with controllability and fail to precisely manipulate the semantics of specific concepts in the image. We attribute this limitation to semantic entanglement in the patch-token representations on which adversarial attacks typically operate: global context aggregated by self-attention in the vision encoder dominates individual patch features, making them unreliable handles for precise local semantic manipulation. Our systematic investigation reveals a key insight: value features (V) computed within the transformer attention block serve as much more precise handles for manipulation. We show that V suppresses global-context channels, allowing it to retain high-entropy, disentangled local semantic information. Building on this discovery, we propose V-Attack, a novel method designed for precise local semantic attacks. V-Attack targets the value features and introduces two core components: (1) a Self-Value Enhancement module to refine V’s intrinsic semantic richness, and (2) a Text-Guided Value Manipulation module that leverages text prompts to locate source concept and optimize it toward a target concept. By bypassing the entangled patch features, V-Attack achieves highly effective semantic control. Extensive experiments across diverse LVLMs, including LLaVA, InternVL, DeepseekVL and GPT-4o, show that V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods, exposing critical vulnerabilities in modern visual-language understanding. Our code and data are available this https URL.
zh

[CV-75] Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder MICCAI2025

【速读】:该论文旨在解决胶质母细胞瘤(glioblastoma)在分子和病理层面的高度异质性所带来的诊断与患者分层难题,传统组织病理学评估虽为标准方法,但存在主观性和低效问题。解决方案的关键在于利用深度学习技术对全切片图像(whole slide images, WSI)进行客观、自动化的分析,具体采用预训练的视觉Transformer(Vision Transformer, ViT)编码器,并在其基础上添加专用分类头,在官方训练数据集上进行微调。该方法在BraTS-Path 2025挑战赛中取得了优异性能,验证了ViT在病理图像分类任务中的有效性,为基于ViT的组织病理学分析建立了可靠基线。

链接: https://arxiv.org/abs/2511.20221
作者: Juexin Zhang,Qifeng Zhong,Ying Weng,Ke Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2025 conference

点击查看摘要

Abstract:The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model’s performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.
zh

[CV-76] xt-guided Controllable Diffusion for Realistic Camouflage Images Generation AAAI2026

【速读】:该论文旨在解决现有伪装图像生成(Camouflage Images Generation, CIG)方法在合成过程中忽视伪装物体与背景环境之间逻辑关系的问题,导致生成结果不够自然和合理。其解决方案的关键在于提出一种可控文本引导的伪装图像生成方法(Controllable Text-guided Camouflage Images Generation, CT-CIG),核心创新包括:1)设计伪装揭示对话机制(Camouflage-Revealing Dialogue Mechanism, CRDM),利用大型视觉语言模型(Large Visual Language Models, VLM)为现有数据集自动标注高质量文本提示,增强语义一致性;2)基于图像-提示对微调Stable Diffusion,并引入轻量级控制器以精确控制伪装物体的位置与形状,提升场景适配性;3)设计频域交互细化模块(Frequency Interaction Refinement Module, FIRM),有效捕捉高频纹理特征,从而学习复杂伪装模式,最终实现高保真且逻辑合理的伪装图像生成。

链接: https://arxiv.org/abs/2511.20218
作者: Yuhang Qian,Haiyan Chen,Wentong Li,Ningzhong Liu,Jie Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced camouflage scene fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG’s ability to produce photorealistic camouflage images.
zh

[CV-77] CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents

【速读】:该论文旨在解决当前自主配送机器人导航研究中忽视经济可行性的问题,即现有基准测试主要关注任务成功率,而未考虑商业化部署所需的成本收益分析。其解决方案的关键在于提出CostNav——一个微尺度导航经济测试平台,通过建模硬件、训练、能源、维护成本及配送收入的完整经济生命周期,并结合服务等级协议(Service-Level Agreement, SLA)和行业数据参数,实现对具身智能体的量化经济评估。该方法首次揭示了任务成功优化与商业可行性的显著差距,指出碰撞导致的维护成本占单次运行成本的99.7%,凸显碰撞避免是关键优化目标,从而为导航算法设计提供经济驱动的决策依据。

链接: https://arxiv.org/abs/2511.20216
作者: Haebin Seong,Sungmin Kim,Minchan Kim,Yongjun Cho,Myunchul Joe,Suhwan Choi,Jaeyoon Jung,Jiyong Youn,Yoonshik Kim,Samwoo Seong,Yubeen Park,Youngjae Yu,Yunsung Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Existing navigation benchmarks focus on task success metrics while overlooking economic viability – critical for commercial deployment of autonomous delivery robots. We introduce \emphCostNav, a \textbfMicro-Navigation Economic Testbed that evaluates embodied agents through comprehensive cost-revenue analysis aligned with real-world business operations. CostNav models the complete economic lifecycle including hardware, training, energy, maintenance costs, and delivery revenue with service-level agreements, using industry-derived parameters. \textbfTo our knowledge, CostNav is the first work to quantitatively expose the gap between navigation research metrics and commercial viability, revealing that optimizing for task success fundamentally differs from optimizing for economic deployment. Our cost model uses parameters derived from industry data sources (energy rates, delivery service pricing), and we project from a reduced-scale simulation to realistic deliveries. Under this projection, the baseline achieves 43.0% SLA compliance but is \emphnot commercially viable: yielding a loss of \ 30.009 per run with no finite break-even point, because operating costs are dominated by collision-induced maintenance, which accounts for 99.7% of per-run costs and highlights collision avoidance as a key optimization target. We demonstrate a learning-based on-device navigation baseline and establish a foundation for evaluating rule-based navigation, imitation learning, and cost-aware RL training. CostNav bridges the gap between navigation research and commercial deployment, enabling data-driven decisions about economic trade-offs across navigation paradigms.
zh

[CV-78] OmniAlpha: A Sequence-to-Sequence Framework for Unified Multi-Task RGBA Generation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在 RGBA 图像处理中缺乏统一多任务框架的问题:现有方法要么专注于单一任务(如 alpha 通道处理),缺乏通用性;要么虽为多任务但局限于 RGB 域,无法有效支持透明度信息的联合建模与编辑。解决方案的关键在于提出 OmniAlpha,首个面向序列到序列 RGBA 图像生成与编辑的统一多任务生成框架,其核心创新包括:1)MSRoPE-BiL,一种具有双向可扩展层轴的旋转位置编码(Rotary Position Embedding, RoPE)方法,用于增强 Diffusion Transformer (DiT) 骨干网络对多输入/输出 RGBA 层的并行处理能力;2)AlphaLayers 数据集,通过自动化合成与过滤管道构建的 1000 个高质量多层图像三元组,支撑跨 21 项多样化任务的联合训练。实验表明,该方案在 mask-free matting 和 layer-conditioned completion 等任务上显著优于专用基线模型,验证了统一多任务学习在 RGBA 表示学习中的优越性。

链接: https://arxiv.org/abs/2511.20211
作者: Hao Yu,Jiabo Zhan,Zile Wang,Jinglin Wang,Huaisong Zhang,Hongyu Li,Xinrui Chen,Yongxian Wei,Chun Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative models have excelled in RGB synthesis, but real-world applications require RGBA manipulation. This has led to a fragmented landscape: specialized, single-task models handle alpha but lack versatility, while unified multi-task frameworks are confined to the RGB domain. To bridge this critical gap, we propose OmniAlpha, the first unified, multi-task generative framework for sequence-to-sequence RGBA image generation and editing. Its architecture features MSRoPE-BiL, a novel RoPE method with a bi-directionally extendable layer axis for its Diffusion Transformer (DiT) backbone, enabling the concurrent processing of multiple input and target RGBA layers. To power this framework, we introduce AlphaLayers, a new dataset of 1,000 high-quality, multi-layer triplets, built via a novel automated synthesis and filter pipeline. Jointly training OmniAlpha on this dataset across a comprehensive suite of 21 diverse tasks, extensive experiments demonstrate that our unified approach consistently outperforms strong, specialized baselines. Most notably, OmniAlpha achieves a dramatic 84.8% relative reduction in SAD for mask-free matting on AIM-500 and wins over 90% of human preferences in layer-conditioned completion. Our work proves that a unified, multi-task model can learn a superior shared representation for RGBA, paving the way for more powerful, layer-aware generative systems.
zh

[CV-79] Robust 3D Brain MRI Inpainting with Random Masking Augmentation MICCAI2025

【速读】:该论文旨在解决脑肿瘤磁共振成像(MRI)中因数据集偏差导致深度学习模型在定量分析时性能受限的问题。其核心解决方案是一种新颖的3D深度学习框架,用于合成健康组织区域,关键创新在于基于U-Net架构训练模型以修复人工损坏的区域,并引入随机掩码增强策略(random masking augmentation strategy)提升模型泛化能力。该方法在验证集和最终在线测试集上均取得了优异的客观指标表现,包括SSIM、PSNR和MSE等,最终在BraTS-Inpainting 2025挑战赛中排名第一,超越了前两届冠军方案。

链接: https://arxiv.org/abs/2511.20202
作者: Juexin Zhang,Ying Weng,Ke Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by the International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2025 conference

点击查看摘要

Abstract:The ASNR-MICCAI BraTS-Inpainting Challenge was established to mitigate dataset biases that limit deep learning models in the quantitative analysis of brain tumor MRI. This paper details our submission to the 2025 challenge, a novel deep learning framework for synthesizing healthy tissue in 3D scans. The core of our method is a U-Net architecture trained to inpaint synthetically corrupted regions, enhanced with a random masking augmentation strategy to improve generalization. Quantitative evaluation confirmed the efficacy of our approach, yielding an SSIM of 0.873 \pm 0.004, a PSNR of 24.996 \pm 4.694, and an MSE of 0.005 \pm 0.087 on the validation set. On the final online test set, our method achieved an SSIM of 0.919 \pm 0.088, a PSNR of 26.932 \pm 5.057, and an RMSE of 0.052 \pm 0.026. This performance secured first place in the BraTS-Inpainting 2025 challenge and surpassed the winning solutions from the 2023 and 2024 competitions on the official leaderboard.
zh

[CV-80] GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

【速读】:该论文旨在解决视频问答(Video QA)中对复杂人-物交互关系建模不足的问题,尤其在跨帧语义理解和动作分解方面存在局限。传统基于像素的方法难以捕捉动态场景中的人体行为与物体之间的细粒度互动,导致模型对时空动态的推理能力受限。其解决方案的关键在于提出GHR-VQA框架,通过构建以人类为中心的图结构——将每一帧表示为场景图(scene graph),并将跨帧的人节点连接至全局根节点形成视频级图结构,从而实现以人类为主体的层级关系推理;随后利用图神经网络(Graph Neural Networks, GNNs)提取上下文感知的嵌入表示,并结合问题特征在多抽象层次上进行融合,显著提升了局部细节与全局语义的理解能力,同时增强了模型的可解释性。

链接: https://arxiv.org/abs/2511.20201
作者: Dionysia Danai Brilli,Dimitrios Mallis,Vassilis Pitsikalis,Petros Maragos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.
zh

[CV-81] SFA: Scan Focus and Amplify toward Guidance-aware Answering for Video TextVQA

【速读】:该论文针对视频文本视觉问答(Video TextVQA)任务中模型难以准确感知和理解跨帧尺度、方向与清晰度各异的场景文本,且难以有效融合时空语义上下文以生成精准答案的问题提出解决方案。其关键在于设计了一个无需训练的框架SFA,该框架模拟人类答题过程,通过自适应扫描视频帧、选择性聚焦关键区域并直接增强这些区域,从而引导视频大语言模型(Video-LLM)注意力集中于最相关的信息线索,提升答案准确性。

链接: https://arxiv.org/abs/2511.20190
作者: Haibin He,Qihuang Zhong,Juhua Liu,Bo Du,Peng Wang,Jing Zhang
机构: Wuhan University (武汉大学); Manchester Metropolitan University (曼彻斯特城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM’s attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.
zh

[CV-82] Exo2EgoSyn: Unlocking Foundation Video Generation Models for Exocentric-to-Egocentric Video Synthesis

【速读】:该论文旨在解决基础视频生成模型(如WAN 2.2)在跨视角视频合成中的局限性,即其仅支持同视角(same-view)生成,无法从第三人称视角(exocentric)生成第一人称视角(egocentric)视频。解决方案的关键在于提出Exo2EgoSyn框架,包含三个核心模块:Ego-Exo View Alignment(EgoExo-Align)实现第三人称与第一人称初始帧表示的潜在空间对齐,将生成空间从给定的第三人称视角重新定向至第一人称视角;Multi-view Exocentric Video Conditioning(MultiExoCon)通过聚合多视角第三人称视频构建统一条件信号,扩展了原始模型仅依赖单图或文本条件的能力;Pose-Aware Latent Injection(PoseInj)注入相对第三人称到第一人称相机位姿信息至潜在状态,引导跨视角几何感知的视频合成。上述模块协同作用,实现了无需从头训练即可从第三人称观测中生成高保真第一人称视角视频的能力。

链接: https://arxiv.org/abs/2511.20186
作者: Mohammad Mahdi,Yuqian Fu,Nedko Savov,Jiancheng Pan,Danda Pani Paudel,Luc Van Gool
机构: INSAIT(INSAT); Sofia University “St. Kliment Ohridski”(索非亚大学“圣克莱门特·奥霍里斯基”)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation video generation models such as WAN 2.2 exhibit strong text- and image-conditioned synthesis abilities but remain constrained to the same-view generation setting. In this work, we introduce Exo2EgoSyn, an adaptation of WAN 2.2 that unlocks Exocentric-to-Egocentric(Exo2Ego) cross-view video synthesis. Our framework consists of three key modules. Ego-Exo View Alignment(EgoExo-Align) enforces latent-space alignment between exocentric and egocentric first-frame representations, reorienting the generative space from the given exo view toward the ego view. Multi-view Exocentric Video Conditioning (MultiExoCon) aggregates multi-view exocentric videos into a unified conditioning signal, extending WAN2.2 beyond its vanilla single-image or text conditioning. Furthermore, Pose-Aware Latent Injection (PoseInj) injects relative exo-to-ego camera pose information into the latent state, guiding geometry-aware synthesis across viewpoints. Together, these modules enable high-fidelity ego view video generation from third-person observations without retraining from scratch. Experiments on ExoEgo4D validate that Exo2EgoSyn significantly improves Ego2Exo synthesis, paving the way for scalable cross-view video generation with foundation models. Source code and models will be released publicly.
zh

[CV-83] Realizing Fully-Integrated Low-Power Event-Based Pupil Tracking with Neuromorphic Hardware

【速读】:该论文旨在解决可穿戴平台中实现高频率、低功耗眼动追踪的难题,尤其针对事件驱动视觉传感器(event-based vision sensors)缺乏集成化、低功耗实时推理方案的问题。解决方案的关键在于首次实现了基于商用Speck2f系统级芯片(system-on-chip)的全本地集成式瞳孔中心追踪系统,结合事件感知与类脑计算(neuromorphic processing),并引入一种具有门控时间解码机制的不确定性量化脉冲神经网络(spiking neural network),在严苛的内存与带宽约束下优化模型性能;同时通过系统级部署机制弥合现实差距(reality gap),最终在双类脑设备原型上实现了每只眼睛平均功耗低于5 mW、100 Hz稳定双眼瞳孔追踪,验证了端到端类脑计算在下一代节能可穿戴系统中的可行性。

链接: https://arxiv.org/abs/2511.20175
作者: Federico Paredes-Valles,Yoshitaka Miyatani,Kirk Y. W. Scheper
机构: Sony Advanced Visual Sensing AG (索尼高级视觉传感AG); Sony Semiconductor Solutions Corporation (索尼半导体解决方案公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures, 3 tables

点击查看摘要

Abstract:Eye tracking is fundamental to numerous applications, yet achieving robust, high-frequency tracking with ultra-low power consumption remains challenging for wearable platforms. While event-based vision sensors offer microsecond resolution and sparse data streams, they have lacked fully integrated, low-power processing solutions capable of real-time inference. In this work, we present the first battery-powered, wearable pupil-center-tracking system with complete on-device integration, combining event-based sensing and neuromorphic processing on the commercially available Speck2f system-on-chip with lightweight coordinate decoding on a low-power microcontroller. Our solution features a novel uncertainty-quantifying spiking neural network with gated temporal decoding, optimized for strict memory and bandwidth constraints, complemented by systematic deployment mechanisms that bridge the reality gap. We validate our system on a new multi-user dataset and demonstrate a wearable prototype with dual neuromorphic devices achieving robust binocular pupil tracking at 100 Hz with an average power consumption below 5 mW per eye. Our work demonstrates that end-to-end neuromorphic computing enables practical, always-on eye tracking for next-generation energy-efficient wearable systems.
zh

[CV-84] ADNet: A Large-Scale and Extensible Multi-Domain Benchmark for Anomaly Detection Across 380 Real-World Categories

【速读】:该论文旨在解决当前异常检测(Anomaly Detection, AD)基准测试在类别覆盖范围狭窄、跨场景泛化能力与可扩展性不足的问题。现有基准如MVTec-AD仅涵盖15个类别,难以评估模型在大规模多领域环境下的性能表现。为此,作者提出ADNet——一个包含380个类别的大规模多域基准,覆盖电子、工业、农业食品、基础设施和医疗五大领域,共包含196,294张RGB图像,并提供像素级标注和结构化文本描述以支持多模态异常检测任务。实验表明,主流方法在单类别设置下表现优异(I-AUROC达90.6%),但在扩展至全部380类别时性能显著下降(降至78.5%),凸显了规模化挑战。为应对这一问题,研究者提出Dinomaly-m,一种基于上下文引导的专家混合(Mixture-of-Experts)架构改进方案,在不增加推理成本的前提下扩展解码器容量,最终实现83.2% I-AUROC和93.1% P-AUROC的性能,优于现有方法,验证了其在复杂多场景下的有效性与可扩展性。

链接: https://arxiv.org/abs/2511.20169
作者: Hai Ling,Jia Guo,Zhulin Tao,Yunkang Cao,Donglin Di,Hongyan Xu,Xiu Su,Yang Song,Lei Fan
机构: Communication University of China (中国传媒大学); DZ Matrix; Tsinghua University (清华大学); Hunan University (湖南大学); Central South University (中南大学); UNSW Sydney (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection (AD) aims to identify defects using normal-only training data. Existing anomaly detection benchmarks (e.g., MVTec-AD with 15 categories) cover only a narrow range of categories, limiting the evaluation of cross-context generalization and scalability. We introduce ADNet, a large-scale, multi-domain benchmark comprising 380 categories aggregated from 49 publicly available datasets across Electronics, Industry, Agrifood, Infrastructure, and Medical domains. The benchmark includes a total of 196,294 RGB images, consisting of 116,192 normal samples for training and 80,102 test images, of which 60,311 are anomalous. All images are standardized with MVTec-style pixel-level annotations and structured text descriptions spanning both spatial and visual attributes, enabling multimodal anomaly detection tasks. Extensive experiments reveal a clear scalability challenge: existing state-of-the-art methods achieve 90.6% I-AUROC in one-for-one settings but drop to 78.5% when scaling to all 380 categories in a multi-class setting. To address this, we propose Dinomaly-m, a context-guided Mixture-of-Experts extension of Dinomaly that expands decoder capacity without increasing inference cost. It achieves 83.2% I-AUROC and 93.1% P-AUROC, demonstrating superior performance over existing approaches. ADNet is designed as a standardized and extensible benchmark, supporting the community in expanding anomaly detection datasets across diverse domains and providing a scalable foundation for future anomaly detection foundation models. Dataset: this https URL
zh

[CV-85] While recognizing actions LMMs struggle to detect core interaction events

【速读】:该论文旨在解决大型多模态模型(Large Multi-Modal Models, LMMs)在理解动态视觉场景时缺乏对物理交互事件的精确感知定位能力的问题,即模型虽能识别对象和动作并提供合理描述,却难以准确判断交互开始或结束的具体帧时间与空间位置。其解决方案的关键在于构建了一个大规模视频标注数据集(超过20K个交互事件),基于Something-Something-V2数据集,由人工标注者明确标记出物体与主体之间接触(contact)和释放(release)的核心事件时刻与位置,并利用该数据集评估两个先进LMM(Qwen-2.5VL 和 GPT-4o)在短视频中定位这些物理交互事件的能力,从而揭示模型在感知 grounding 方面的局限性。

链接: https://arxiv.org/abs/2511.20162
作者: Daniel Harari,Michael Sidorov,Liel David,Chen Shterental,Abrham Kahsay Gebreselasie,Muhammad Haris Khan
机构: Weizmann Institute of Science (魏茨曼科学研究所); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (‘contact’) or detached (‘release’). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.
zh

[CV-86] Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLM s

【速读】:该论文旨在解决持续视觉指令微调(Continual Visual Instruction Tuning, CVIT)过程中,安全对齐的多模态大语言模型(Multimodal Large Language Models, MLLMs)在任务性能提升时出现的安全性退化与灾难性遗忘问题。现有研究多忽视模型安全性,而实际应用中MLLMs必须具备可靠的安全机制以规避潜在风险。解决方案的关键在于提出一种名为“和谐参数适配”(Harmonious Parameter Adaptation, HPA)的后训练框架,其核心包括:基于参数对安全或任务性能的关注度进行划分、从平衡视角选择保留关键参数,以及通过正交约束限制参数更新方向,从而在保持高安全性的同时有效缓解遗忘现象。

链接: https://arxiv.org/abs/2511.20158
作者: Ziqi Wang,Chang Che,Qi Wang,Hui Ma,Zenglin Shi,Cees G. M. Snoek,Meng Wang
机构: Hefei University of Technology (合肥工业大学); Tsinghua University (清华大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines.
zh

[CV-87] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

【速读】:该论文旨在解决现有参数化3D人体模型(如SMPL)在人体姿态与形状估计中因简化骨骼结构而导致生物力学真实性不足的问题。针对这一问题,作者提出SKEL-CF框架,其关键在于采用分阶段细化的Transformer编码器-解码器架构:编码器首先预测粗粒度的相机参数和SKEL参数,随后解码器通过多层逐步精化这些参数;同时,为确保解剖学一致性,将原有SMPL数据集4DHuman重构为与SKEL对齐的4DHuman-SKEL数据集,并显式引入相机建模以缓解深度与尺度模糊性,从而显著提升参数估计精度与生物力学合理性。

链接: https://arxiv.org/abs/2511.20157
作者: Da Li,Ji-Ping Jin,Xuanlong Yu,Wei Liu,Xiaodong Cun,Kai Chen,Rui Fan,Jiangang Kong,Shen Xi
机构: Intellindust AI Lab; Shenzhen University (深圳大学); ShanghaiTech University (上海科技大学); Great Bay University (大湾大学); Didi Chuxing Co.Ltd (滴滴出行有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures

点击查看摘要

Abstract:Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: this https URL.
zh

[CV-88] Map-World: Masked Action planning and Path-Integral World Model for Autonomous Driving

【速读】:该论文旨在解决自动驾驶中运动规划(motion planning)在处理多模态未来轨迹时的效率与信息损失问题:现有端到端系统和基于世界模型的规划器虽能预测丰富的多模态轨迹,但通常依赖人工设计的锚点(anchor)或强化学习(reinforcement learning)从多个可能轨迹中选择单一最优模式进行训练和控制,这不仅丢弃了其他潜在合理路径的信息,还导致优化过程复杂化。其解决方案的关键在于提出一种无先验(prior-free)的多模态规划框架MAP-World,核心创新包括两个模块:一是掩码动作规划(Masked Action Planning, MAP),将未来自车运动建模为掩码序列补全任务,通过注入噪声生成多样且时序一致的轨迹查询;二是路径加权的世界模型(path-weighted world model),基于候选轨迹对BEV语义进行条件滚动预测,并在训练中以轨迹概率作为离散路径权重,计算语义损失的期望值,从而使规划器从整个可行未来分布中学习,而非局限于单条路径。该方法无需锚点库或教师策略,同时保持实时推理延迟,在NAVSIM上达到当前基于世界模型方法的最先进性能。

链接: https://arxiv.org/abs/2511.20156
作者: Bin Hu,Zijian Lu,Haicheng Liao,Chengran Yuan,Bin Rao,Yongkang Li,Guofa Li,Zhiyong Cui,Cheng-zhong Xu,Zhenning Li
机构: University of Macau (澳门大学); National University of Singapore (新加坡国立大学); Purdue University (普渡大学); Chongqing University (重庆大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Motion planning for autonomous driving must handle multiple plausible futures while remaining computationally efficient. Recent end-to-end systems and world-model-based planners predict rich multi-modal trajectories, but typically rely on handcrafted anchors or reinforcement learning to select a single best mode for training and control. This selection discards information about alternative futures and complicates optimization. We propose MAP-World, a prior-free multi-modal planning framework that couples masked action planning with a path-weighted world model. The Masked Action Planning (MAP) module treats future ego motion as masked sequence completion: past waypoints are encoded as visible tokens, future waypoints are represented as mask tokens, and a driving-intent path provides a coarse scaffold. A compact latent planning state is expanded into multiple trajectory queries with injected noise, yielding diverse, temporally consistent modes without anchor libraries or teacher policies. A lightweight world model then rolls out future BEV semantics conditioned on each candidate trajectory. During training, semantic losses are computed as an expectation over modes, using trajectory probabilities as discrete path weights, so the planner learns from the full distribution of plausible futures instead of a single selected path. On NAVSIM, our method matches anchor-based approaches and achieves state-of-the-art performance among world-model-based methods, while avoiding reinforcement learning and maintaining real-time inference latency.
zh

[CV-89] Alzheimers Disease Progression Prediction Based on Manifold Mapping of Irregularly Sampled Longitudinal Data

【速读】:该论文旨在解决从不规则采样的纵向结构磁共振成像(sMRI)数据中建模阿尔茨海默病(Alzheimer’s Disease, AD)进展的难题。现有基于影像的疾病预测模型通常在欧几里得空间(Euclidean space)中操作,无法充分捕捉不规则采样数据中疾病进展的内在连续性和非线性几何结构。解决方案的关键在于提出一个融合几何建模与时间感知动态建模的联合框架——R-TNAG,其核心包括:(1) 将高维sMRI特征映射到黎曼流形空间(Riemannian manifold),以保留疾病进展的内在几何特性;(2) 引入时间感知神经微分方程(Time-aware Neural Ordinary Differential Equation, TNODE),实现观测点之间潜在状态的连续演化建模;(3) 设计注意力机制驱动的黎曼门控循环单元(Attention-based Riemannian Gated Recurrent Unit, ARGRU),自适应融合历史与当前信息以应对不规则时间间隔。该设计显著提升了时序一致性与预测鲁棒性,在疾病状态分类和认知评分回归任务中均优于现有最先进方法。

链接: https://arxiv.org/abs/2511.20154
作者: Xin Hong,Ying Shi,Yinhao Li,Yen-Wei Chen
机构: Huaqiao University (华侨大学); Key Laboratory of Computer Vision and Machine Learning in Fujian Province (福建省计算机视觉与机器学习重点实验室); Ritsumeikan University (立命馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:The uncertainty of clinical examinations frequently leads to irregular observation intervals in longitudinal imaging data, posing challenges for modeling disease this http URL existing imaging-based disease prediction models operate in Euclidean space, which assumes a flat representation of data and fails to fully capture the intrinsic continuity and nonlinear geometric structure of irregularly sampled longitudinal images. To address the challenge of modeling Alzheimers disease (AD) progression from irregularly sampled longitudinal structural Magnetic Resonance Imaging (sMRI) data, we propose a Riemannian manifold mapping, a Time-aware manifold Neural ordinary differential equation, and an Attention-based riemannian Gated recurrent unit (R-TNAG) framework. Our approach first projects features extracted from high-dimensional sMRI into a manifold space to preserve the intrinsic geometry of disease progression. On this representation, a time-aware Neural Ordinary Differential Equation (TNODE) models the continuous evolution of latent states between observations, while an Attention-based Riemannian Gated Recurrent Unit (ARGRU) adaptively integrates historical and current information to handle irregular intervals. This joint design improves temporal consistency and yields robust AD trajectory prediction under irregular this http URL results demonstrate that the proposed method consistently outperforms state-of-the-art models in both disease status prediction and cognitive score regression. Ablation studies verify the contributions of each module, highlighting their complementary roles in enhancing predictive accuracy. Moreover, the model exhibits stable performance across varying sequence lengths and missing data rates, indicating strong temporal generalizability. Cross-dataset validation further confirms its robustness and applicability in diverse clinical settings.
zh

[CV-90] Restora-Flow: Mask-Guided Image Restoration with Flow Matching WACV2026

【速读】:该论文旨在解决当前基于流模型(flow models)的图像恢复方法中存在的两个关键问题:一是处理时间过长,二是生成结果过度平滑。为应对这些挑战,作者提出Restora-Flow,其核心创新在于无需训练即可通过退化掩码(degradation mask)引导流匹配(flow matching)采样,并引入轨迹校正机制(trajectory correction mechanism)以确保生成结果与退化输入的一致性。该方案在自然和医学图像数据集上针对基于掩码的退化任务(如图像修复、超分辨率和去噪)均展现出更优的感知质量和更快的处理速度。

链接: https://arxiv.org/abs/2511.20152
作者: Arnela Hadzic,Franz Thaler,Lea Bogensperger,Simon Johannes Joham,Martin Urschler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for WACV 2026

点击查看摘要

Abstract:Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.
zh

[CV-91] Hybrid Convolution and Frequency State Space Network for Image Compression

【速读】:该论文旨在解决当前基于状态空间模型(State Space Model, SSM)和Transformer的图像压缩方法在建模长距离依赖时易丢失结构信息或忽略频域特性,从而影响压缩效率的问题。解决方案的关键在于提出一种混合架构HCFSSNet,其核心创新包括:(1) 引入视觉频率状态空间(Vision Frequency State Space, VFSS)模块,结合全向邻域状态空间(Omni-directional Neighborhood State Space, VONSS)以多方向扫描特征,捕获长程低频信息;(2) 设计自适应频域调制模块(Adaptive Frequency Modulation Module, AFMM),对离散余弦变换(Discrete Cosine Transform, DCT)频域成分进行内容感知加权,实现更高效的比特分配;(3) 将AFMM与Swin Transformer融合形成频率感知的注意力模块(Frequency Swin Transformer Attention Module, FSTAM),提升熵模型中侧信息的建模能力。该方案在保持较低参数量的同时,在Kodak、Tecnick和CLIC数据集上显著优于传统视频编码标准VTM,验证了其高效性和可解释性。

链接: https://arxiv.org/abs/2511.20151
作者: Haodong Pan,Hao Wei,Yusong Wang,Nanning Zheng,Caigui Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 8 figures

点击查看摘要

Abstract:Learned image compression (LIC) has recently benefited from Transformer based and state space model (SSM) based architectures. Convolutional neural networks (CNNs) effectively capture local high frequency details, whereas Transformers and SSMs provide strong long range modeling capabilities but may cause structural information loss or ignore frequency characteristics that are crucial for compression. In this work we propose HCFSSNet, a Hybrid Convolution and Frequency State Space Network for LIC. HCFSSNet uses CNNs to extract local high frequency structures and introduces a Vision Frequency State Space (VFSS) block that models long range low frequency information. The VFSS block combines an Omni directional Neighborhood State Space (VONSS) module, which scans features horizontally, vertically and diagonally, with an Adaptive Frequency Modulation Module (AFMM) that applies content adaptive weighting of discrete cosine transform frequency components for more efficient bit allocation. To further reduce redundancy in the entropy model, we integrate AFMM with a Swin Transformer to form a Frequency Swin Transformer Attention Module (FSTAM) for frequency aware side information modeling. Experiments on the Kodak, Tecnick and CLIC Professional Validation datasets show that HCFSSNet achieves competitive rate distortion performance compared with recent SSM based codecs such as MambaIC, while using significantly fewer parameters. On Kodak, Tecnick and CLIC, HCFSSNet reduces BD rate over the VTM anchor by 18.06, 24.56 and 22.44 percent, respectively, providing an efficient and interpretable hybrid architecture for future learned image compression systems.
zh

[CV-92] Vision-Language Models for Automated 3D PET/CT Report Generation

【速读】:该论文旨在解决正电子发射断层扫描/计算机断层扫描(PET/CT)报告自动化生成(PETRG)中因功能成像特性复杂、跨中心报告风格差异大而导致的临床实用性不足问题。其核心挑战在于:PET代谢模式受示踪剂生理学影响显著,且需依赖全身体积三维上下文信息而非局部区域判断。解决方案的关键是提出一种端到端的3D双分支框架PETRG-3D,该框架独立编码PET与CT体积数据,并引入风格自适应提示(style-adaptive prompts)以缓解不同医院间的报告习惯差异;同时构建了多中心淋巴瘤数据集PETRG-Lym及公开基准AutoPET-RG-Lym,并设计淋巴瘤特异性评估协议PETRG-Score,实现对代谢与结构发现的联合量化评价,从而显著提升模型在自然语言指标(如ROUGE-L提升31.49%)和临床有效性指标(如PET-All提升8.18%)上的表现。

链接: https://arxiv.org/abs/2511.20145
作者: Wenpei Jiao,Kun Shang,Hui Li,Ke Yan,Jiajin Zhang,Guangjie Yang,Lijuan Guo,Yan Wan,Xing Yang,Dakai Jin,Zhaoheng Xie
机构: Peking University (北京大学); Peking University People’s Hospital (北京大学人民医院); Peking University Third Hospital (北京大学第三医院); Alibaba Group (阿里巴巴集团); DAMO Academy (达摩院); Hupan Lab; Shanghai Jiao Tong University (上海交通大学); Qingdao University (青岛大学); Henan Medical University (河南医科大学); Jiujiang City Key Laboratory of Cell Therapy, Jiu Jiang NO.1 People’s Hospital (九江市细胞治疗重点实验室,九江市第一人民医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Positron emission tomography/computed tomography (PET/CT) is essential in oncology, yet the rapid expansion of scanners has outpaced the availability of trained specialists, making automated PET/CT report generation (PETRG) increasingly important for reducing clinical workload. Compared with structural imaging (e.g., X-ray, CT, and MRI), functional PET poses distinct challenges: metabolic patterns vary with tracer physiology, and whole-body 3D contextual information is required rather than local-region interpretation. To advance PETRG, we propose PETRG-3D, an end-to-end 3D dual-branch framework that separately encodes PET and CT volumes and incorporates style-adaptive prompts to mitigate inter-hospital variability in reporting practices. We construct PETRG-Lym, a multi-center lymphoma dataset collected from four hospitals (824 reports w/ 245,509 paired PET/CT slices), and construct AutoPET-RG-Lym, a publicly accessible PETRG benchmark derived from open imaging data but equipped with new expert-written, clinically validated reports (135 cases). To assess clinical utility, we introduce PETRG-Score, a lymphoma-specific evaluation protocol that jointly measures metabolic and structural findings across curated anatomical regions. Experiments show that PETRG-3D substantially outperforms existing methods on both natural language metrics (e.g., +31.49% ROUGE-L) and clinical efficacy metrics (e.g., +8.18% PET-All), highlighting the benefits of volumetric dual-modality modeling and style-aware prompting. Overall, this work establishes a foundation for future PET/CT-specific models emphasizing disease-aware reasoning and clinically reliable evaluation. Codes, models, and AutoPET-RG-Lym will be released.
zh

[CV-93] UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

【速读】:该论文致力于解决视频扩散变换器(Video Diffusion Transformers)在训练长度之外的泛化能力不足问题,即视频长度外推(video length extrapolation)难题。研究识别出两种失效模式:模型特异性的周期性内容重复和普遍存在的质量退化。传统方法仅通过位置编码缓解重复问题,却忽视了质量退化且外推效果有限。本文从注意力机制的本质出发,发现两类失效均由“注意力分散”(attention dispersion)统一导致——即超出训练窗口的token稀释了已学习的注意力模式。基于此洞察,提出无需训练、可即插即用的UltraViCo方法,通过恒定衰减因子抑制训练窗口外token的注意力权重,从而同时解决上述两类问题。实验表明,该方法将外推极限从2倍提升至4倍,在4倍外推下动态度(Dynamic Degree)和成像质量(Imaging Quality)分别提升233%和40.5%,并可无缝迁移至可控视频生成与编辑等下游任务。

链接: https://arxiv.org/abs/2511.20123
作者: Min Zhao,Hongzhou Zhu,Yingze Wang,Bokai Yan,Jintao Zhang,Guande He,Ling Yang,Chongxuan Li,Jun Zhu
机构: Tsinghua University (清华大学); Renmin University of China (中国人民大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
zh

[CV-94] LungEvaty: A Scalable Open-Source Transformer-based Deep Learning Model for Lung Cancer Risk Prediction in LDCT Screening

【速读】:该论文旨在解决肺癌风险预测中因影像数据规模扩大而面临的可扩展性与性能瓶颈问题,尤其针对现有方法在依赖像素级标注导致难以规模化,或分割肺部区域造成信息碎片化从而削弱预测效果的局限性。其解决方案的关键在于提出一个完全基于Transformer架构的框架LungEvaty,该框架直接以完整肺部体积为输入,利用大规模筛查数据学习全面的解剖和病理特征,无需区域监督即可达到当前最优性能;同时引入可选的解剖引导注意力损失(Anatomically Informed Attention Guidance, AIAG),增强模型对关键解剖区域的关注能力,从而在保证高效性的同时提升预测精度。

链接: https://arxiv.org/abs/2511.20116
作者: Johannes Brandt,Maulik Chevli,Rickmer Braren,Georgios Kaissis,Philip Müller,Daniel Rueckert
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lung cancer risk estimation is gaining increasing importance as more countries introduce population-wide screening programs using low-dose CT (LDCT). As imaging volumes grow, scalable methods that can process entire lung volumes efficiently are essential to tap into the full potential of these large screening datasets. Existing approaches either over-rely on pixel-level annotations, limiting scalability, or analyze the lung in fragments, weakening performance. We present LungEvaty, a fully transformer-based framework for predicting 1-6 year lung cancer risk from a single LDCT scan. The model operates on whole-lung inputs, learning directly from large-scale screening data to capture comprehensive anatomical and pathological cues relevant for malignancy risk. Using only imaging data and no region supervision, LungEvaty matches state-of-the-art performance, refinable by an optional Anatomically Informed Attention Guidance (AIAG) loss that encourages anatomically focused attention. In total, LungEvaty was trained on more than 90,000 CT scans, including over 28,000 for fine-tuning and 6,000 for evaluation. The framework offers a simple, data-efficient, and fully open-source solution that provides an extensible foundation for future research in longitudinal and multimodal lung cancer risk prediction.
zh

[CV-95] Multi Head Attention Enhanced Inception v3 for Cardiomegaly Detection

【速读】:该论文旨在解决心血管疾病中心脏扩大(cardiomegaly)的自动检测问题,传统方法依赖人工判读X-ray图像,存在主观性强、效率低等局限。解决方案的关键在于提出一种融合深度学习与多头注意力机制(multi-head attention mechanism)的集成模型:首先通过高质量标注的X-ray数据集进行训练,结合Inception V3架构提取特征,并引入多层注意力机制以实现对输入图像关键区域的自适应聚焦,从而提升模型对心脏边界和形态异常的敏感性。该方法显著增强了特征表示能力,在准确率(95.6%)、精确率(95.2%)、召回率(96.2%)及AUC(96.0%)等指标上表现优异,验证了其在临床辅助诊断中的可行性与有效性。

链接: https://arxiv.org/abs/2511.20101
作者: Abishek Karthik,Pandiyaraju V
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The healthcare industry has been revolutionized significantly by novel imaging technologies, not just in the diagnosis of cardiovascular diseases but also by the visualization of structural abnormalities like cardiomegaly. This article explains an integrated approach to the use of deep learning tools and attention mechanisms for automatic detection of cardiomegaly using X-ray images. The initiation of the project is grounded on a strong Data Collection phase and gathering the data of annotated X-ray images of various types. Then, while the Preprocessing module fine-tunes image quality, it is feasible to utilize the best out of the data quality in the proposed system. In our proposed system, the process is a CNN configuration leveraging the inception V3 model as one of the key blocks. Besides, we also employ a multilayer attention mechanism to enhance the strength. The most important feature of the method is the multi-head attention mechanism that can learn features automatically. By exact selective focusing on only some regions of input, the model can thus identify cardiomegaly in a sensitive manner. Attention rating is calculated, duplicated, and applied to enhance representation of main data, and therefore there is a successful diagnosis. The Evaluation stage will be extremely strict and it will thoroughly evaluate the model based on such measures as accuracy and precision. This will validate that the model can identify cardiomegaly and will also show the clinical significance of this method. The model has accuracy of 95.6, precision of 95.2, recall of 96.2, sensitivity of 95.7, specificity of 96.1 and an Area Under Curve(AUC) of 96.0 and their respective graphs are plotted for visualisation.
zh

[CV-96] Exploring State-of-the-art models for Early Detection of Forest Fires

【速读】:该论文旨在解决森林火灾早期检测中因数据集规模不足和模型针对性不强而导致的漏检问题。其解决方案的关键在于构建了一个专门用于早期识别森林火灾的视觉分析数据集,该数据集聚焦于火势初起阶段的烟雾羽流和火焰实例,而非已广泛蔓延的火灾图像;同时,通过游戏模拟器(如《荒野大镖客2》)合成数据,并与公开图像结合,提升了数据多样性与代表性;在此基础上,对比了图像分类与目标定位方法,特别是YOLOv7和检测Transformer(DETR)等模型在该数据集上的性能表现,从而为早期火灾预警提供了更可靠的算法基础。

链接: https://arxiv.org/abs/2511.20096
作者: Sharjeel Ahmed,Daim Armaghan,Fatima Naweed,Umair Yousaf,Ahmad Zubair,Murtaza Taj
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:There have been many recent developments in the use of Deep Learning Neural Networks for fire detection. In this paper, we explore an early warning system for detection of forest fires. Due to the lack of sizeable datasets and models tuned for this task, existing methods suffer from missed detection. In this work, we first propose a dataset for early identification of forest fires through visual analysis. Unlike existing image corpuses that contain images of wide-spread fire, our dataset consists of multiple instances of smoke plumes and fire that indicates the initiation of fire. We obtained this dataset synthetically by utilising game simulators such as Red Dead Redemption 2. We also combined our dataset with already published images to obtain a more comprehensive set. Finally, we compared image classification and localisation methods on the proposed dataset. More specifically we used YOLOv7 (You Only Look Once) and different models of detection transformer.
zh

[CV-97] WPT: World-to-Policy Transfer via Online World Model Distillation

【速读】:该论文旨在解决当前世界模型(World Model)在实际应用中面临的两个核心问题:一是现有方法通常存在运行时耦合紧密导致推理开销大,二是依赖离线奖励信号阻碍了端到端优化。为克服上述限制,作者提出了一种名为WPT(World-to-Policy Transfer)的训练范式,其关键在于通过可训练的奖励模型将世界模型预测的未来动态与候选轨迹对齐,从而将世界知识注入教师策略(teacher policy);随后采用策略蒸馏(policy distillation)和世界奖励蒸馏(world reward distillation)技术,将教师策略的推理能力高效迁移至轻量级学生策略(student policy),在保持实时部署能力的同时显著提升规划性能。

链接: https://arxiv.org/abs/2511.20095
作者: Guangfeng Jiang,Yueru Luo,Jun Liu,Yi Huang,Yiyao Zhu,Zhan Qu,Dave Zhenyu Chen,Bingbing Liu,Xu Yan
机构: University of Science and Technology of China (中国科学技术大学); CUHK-SZ; HKUST; Huawei Foundation Model Dept (华为基础模型部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years have witnessed remarkable progress in world models, which primarily aim to capture the spatio-temporal correlations between an agent’s actions and the evolving environment. However, existing approaches often suffer from tight runtime coupling or depend on offline reward signals, resulting in substantial inference overhead or hindering end-to-end optimization. To overcome these limitations, we introduce WPT, a World-to-Policy Transfer training paradigm that enables online distillation under the guidance of an end-to-end world model. Specifically, we develop a trainable reward model that infuses world knowledge into a teacher policy by aligning candidate trajectories with the future dynamics predicted by the world model. Subsequently, we propose policy distillation and world reward distillation to transfer the teacher’s reasoning ability into a lightweight student policy, enhancing planning performance while preserving real-time deployability. Extensive experiments on both open-loop and closed-loop benchmarks show that our WPT achieves state-of-the-art performance with a simple policy architecture: it attains a 0.11 collision rate (open-loop) and achieves a 79.23 driving score (closed-loop) surpassing both world-model-based and imitation-learning methods in accuracy and safety. Moreover, the student sustains up to 4.9x faster inference, while retaining most of the gains.
zh

[CV-98] Explainable Visual Anomaly Detection via Concept Bottleneck Models

【速读】:该论文旨在解决视觉异常检测(Visual Anomaly Detection, VAD)中现有方法虽能提供异常区域的可视化解释,但缺乏语义层面可理解描述的问题。其关键解决方案是将概念瓶颈模型(Concept Bottleneck Models, CBMs)引入VAD场景,通过学习有意义的概念(concepts),使模型不仅能定位异常区域,还能生成人类可读的异常描述,从而实现语义解释与定位解释的融合。该方案的核心创新包括:构建用于CBMs的VAD概念数据集、改进CBM架构以同时输出概念和视觉解释,并设计合成人工异常的流程,在不依赖稀有异常样本的前提下提升模型的可解释性与可信度。

链接: https://arxiv.org/abs/2511.20088
作者: Arianna Stropeni,Valentina Zaccaria,Francesco Borsatti,Davide Dalle Pezze,Manuel Barusco,Gian Antonio Susto
机构: University of Padova (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, Visual Anomaly Detection (VAD) has gained significant attention due to its ability to identify anomalous images using only normal images during training. Many VAD models work without supervision but are still able to provide visual explanations by highlighting the anomalous regions within an image. However, although these visual explanations can be helpful, they lack a direct and semantically meaningful interpretation for users. To address this limitation, we propose extending Concept Bottleneck Models (CBMs) to the VAD setting. By learning meaningful concepts, the network can provide human-interpretable descriptions of anomalies, offering a novel and more insightful way to explain them. Our contributions are threefold: (i) we develop a Concept Dataset to support research on CBMs for VAD; (ii) we improve the CBM architecture to generate both concept-based and visual explanations, bridging semantic and localization interpretability; and (iii) we introduce a pipeline for synthesizing artificial anomalies, preserving the VAD paradigm of minimizing dependence on rare anomalous samples. Our approach, Concept-Aware Visual Anomaly Detection (CONVAD), achieves performance comparable to classic VAD methods while providing richer, concept-driven explanations that enhance interpretability and trust in VAD systems.
zh

[CV-99] Blind Adaptive Local Denoising for CEST Imaging

【速读】:该论文旨在解决化学交换饱和转移(Chemical Exchange Saturation Transfer, CEST)磁共振成像中因硬件限制导致的空间异质噪声(heteroscedasticity)问题,以及复杂成像协议引发的定量对比度映射(如酰胺质子转移,Amide Proton Transfer, APT)精度下降问题。传统去噪方法无法适应此类非平稳噪声特性,常会破坏关键生物医学信息。其解决方案的关键在于提出一种盲适应局部去噪(Blind Adaptive Local Denoising, BALD)方法:首先利用CEST数据的自相似性构建一个无需先验噪声知识的方差稳定变换(variance-stabilizing transform),使各像素噪声分布均一化;随后在数据的局部奇异值分解(SVD)线性变换域上进行两阶段去噪,有效分离分子信号与噪声,同时避免空间和谱域伪影。实验表明,BALD在多个体模和活体CEST扫描中均显著优于现有最优去噪方法,在去噪指标及下游任务(如分子浓度图估计与癌症检测)中表现更优。

链接: https://arxiv.org/abs/2511.20081
作者: Chu Chen,Aitor Artola,Yang Liu,Se Weon Park,Raymond H. Chan,Jean-Michel Morel,Kannie W. Y. Chan
机构: City University of Hong Kong (香港城市大学); Hong Kong Centre for Cerebro-cardiovascular Health Engineering (香港脑心血管健康工程中心); Lingnan University (岭南大学); Johns Hopkins University School of Medicine (约翰霍普金斯大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chemical Exchange Saturation Transfer (CEST) MRI enables molecular-level visualization of low-concentration metabolites by leveraging proton exchange dynamics. However, its clinical translation is hindered by inherent challenges: spatially varying noise arising from hardware limitations, and complex imaging protocols introduce heteroscedasticity in CEST data, perturbing the accuracy of quantitative contrast mapping such as amide proton transfer (APT) imaging. Traditional denoising methods are not designed for this complex noise and often alter the underlying information that is critical for biomedical analysis. To overcome these limitations, we propose a new Blind Adaptive Local Denoising (BALD) method. BALD exploits the self-similar nature of CEST data to derive an adaptive variance-stabilizing transform that equalizes the noise distributions across CEST pixels without prior knowledge of noise characteristics. Then, BALD performs two-stage denoising on a linear transformation of data to disentangle molecular signals from noise. A local SVD decomposition is used as a linear transform to prevent spatial and spectral denoising artifacts. We conducted extensive validation experiments on multiple phantoms and \textitin vivo CEST scans. In these experiments, BALD consistently outperformed state-of-the-art CEST denoisers in both denoising metrics and downstream tasks such as molecular concentration maps estimation and cancer detection.
zh

[CV-100] Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding AAAI2026

【速读】:该论文旨在解决现有方法在构建可执行复杂任务的智能体时,难以将抽象的任务与步骤描述与视频中具体的视觉细节进行稳健对齐的问题。其核心解决方案是引入“状态”(state)作为视觉可感知的语义层,即通过物体配置的文本快照来锚定抽象程序与模型实际可见的内容,并提出任务-步骤-状态(Task-Step-State, TSS)框架,将任务实现建模为由步骤驱动的状态转移过程。关键创新在于采用渐进式预训练策略,逐步展开TSS层级结构,强制模型在状态层面建立视觉基础的同时关联步骤和高层任务,从而提升视频表征的程序感知能力。

链接: https://arxiv.org/abs/2511.20073
作者: Jinghan Zhao,Yifei Huang,Feng Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026. 15 pages, 12 figures

点击查看摘要

Abstract:Learning procedural-aware video representations is a key step towards building agents that can reason about and execute complex tasks. Existing methods typically address this problem by aligning visual content with textual descriptions at the task and step levels to inject procedural semantics into video representations. However, due to their high level of abstraction, ‘task’ and ‘step’ descriptions fail to form a robust alignment with the concrete, observable details in visual data. To address this, we introduce ‘states’, i.e., textual snapshots of object configurations, as a visually-grounded semantic layer that anchors abstract procedures to what a model can actually see. We formalize this insight in a novel Task-Step-State (TSS) framework, where tasks are achieved via steps that drive transitions between observable states. To enforce this structure, we propose a progressive pre-training strategy that unfolds the TSS hierarchy, forcing the model to ground representations in states while associating them with steps and high-level tasks. Extensive experiments on the COIN and CrossTask datasets show that our method outperforms baseline models on multiple downstream tasks, including task recognition, step recognition, and next step prediction. Ablation studies show that introducing state supervision is a key driver of performance gains across all tasks. Additionally, our progressive pretraining strategy proves more effective than standard joint training, as it better enforces the intended hierarchical structure.
zh

[CV-101] PRADA: Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images

【速读】:该论文旨在解决自回归图像生成(Autoregressive Image Generation)所引发的图像真实性检测难题,即如何可靠地识别由自回归模型生成的图像,并准确追溯其来源模型。现有方法缺乏对这类生成图像的专门检测能力,而随着生成式AI(Generative AI)技术的发展,这一问题愈发紧迫。解决方案的关键在于提出PRADA(Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images),其核心思想是通过分析图像对应的自回归标记序列在条件概率与无条件概率之间的比值特征来实现检测与溯源:当图像由特定模型生成时,该概率比值表现出独特且可区分的模式,而真实图像或其他模型生成的图像则不具备此特性。作者利用这一特性构建了一个简单、模型专属的评分函数,并通过阈值判断完成检测与归属任务,在八类图像到图像和四类文本到图像生成模型上验证了方法的有效性。

链接: https://arxiv.org/abs/2511.20068
作者: Simon Damm,Jonas Ricker,Henning Petzka,Asja Fischer
机构: Ruhr University Bochum (鲁尔大学波鸿分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autoregressive (AR) image generation has recently emerged as a powerful paradigm for image synthesis. Leveraging the generation principle of large language models, they allow for efficiently generating deceptively real-looking images, further increasing the need for reliable detection methods. However, to date there is a lack of work specifically targeting the detection of images generated by AR image generators. In this work, we present PRADA (Probability-Ratio-Based Attribution and Detection of Autoregressive-Generated Images), a simple and interpretable approach that can reliably detect AR-generated images and attribute them to their respective source model. The key idea is to inspect the ratio of a model’s conditional and unconditional probability for the autoregressive token sequence representing a given image. Whenever an image is generated by a particular model, its probability ratio shows unique characteristics which are not present for images generated by other models or real images. We exploit these characteristics for threshold-based attribution and detection by calibrating a simple, model-specific score function. Our experimental evaluation shows that PRADA is highly effective against eight class-to-image and four text-to-image models.
zh

[CV-102] FLaTEC: Frequency-Disentangled Latent Triplanes for Efficient Compression of LiDAR Point Clouds

【速读】:该论文旨在解决点云压缩中压缩比与重建质量难以平衡的问题,尤其针对低频结构和高频纹理在相同分辨率下贡献差异显著的挑战。解决方案的关键在于提出一种频率感知的压缩模型FLaTEC,其核心创新包括:(1)引入频率解耦机制,将低频结构与高频纹理分离存储,其中低频成分以紧凑形式编码,高频细节跨尺度聚合;(2)采用混合潜在三平面(hybrid latent triplanes)作为点云的紧凑代理表示,通过将体素嵌入转换为三平面表示降低稀疏性、计算成本和存储开销;(3)设计基于频率的注意力机制以恢复丢失的3D相关性,并支持任意分辨率点云的重建。该方法在SemanticKITTI和Ford数据集上分别相比标准编解码器提升78%和94%的BD-rate性能,达到当前最优率失真表现。

链接: https://arxiv.org/abs/2511.20065
作者: Xiaoge Zhang,Zijie Wu,Mingtao Feng,Zichen Geng,Mehwish Nasim,Saeed Anwar,Ajmal Mian
机构: University of Western Australia (西澳大利亚大学); Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud compression methods jointly optimize bitrates and reconstruction distortion. However, balancing compression ratio and reconstruction quality is difficult because low-frequency and high-frequency components contribute differently at the same resolution. To address this, we propose FLaTEC, a frequency-aware compression model that enables the compression of a full scan with high compression ratios. Our approach introduces a frequency-aware mechanism that decouples low-frequency structures and high-frequency textures, while hybridizing latent triplanes as a compact proxy for point cloud. Specifically, we convert voxelized embeddings into triplane representations to reduce sparsity, computational cost, and storage requirements. We then devise a frequency-disentangling technique that extracts compact low-frequency content while collecting high-frequency details across scales. The decoupled low-frequency and high-frequency components are stored in binary format. During decoding, full-spectrum signals are progressively recovered via a modulation block. Additionally, to compensate for the loss of 3D correlation, we introduce an efficient frequency-based attention mechanism that fosters local connectivity and outputs arbitrary resolution points. Our method achieves state-of-the-art rate-distortion performance and outperforms the standard codecs by 78% and 94% in BD-rate on both SemanticKITTI and Ford datasets.
zh

[CV-103] DeLightMono: Enhancing Self-Supervised Monocular Depth Estimation in Endoscopy by Decoupling Uneven Illumination

【速读】:该论文旨在解决内窥镜图像中由于光照不均(尤其是低亮度区域)导致的单目深度估计性能下降问题。现有低光增强技术无法有效引导深度网络,而其他领域(如自动驾驶)的方法则依赖良好光照条件,难以直接迁移且增加数据采集负担。解决方案的关键在于提出一种名为DeLight-Mono的自监督单目深度估计框架,其核心是通过设计一个光照-反射-深度分解模型(illumination-reflectance-depth model),利用辅助网络对图像进行解耦,并构建基于解耦成分的新型自监督联合优化损失函数,从而显著缓解光照不均对深度估计的影响。

链接: https://arxiv.org/abs/2511.20058
作者: Mingyang Ou,Haojin Li,Yifeng Zhang,Ke Niu,Zhongxi Qiu,Heng Li,Jiang Liu
机构: 1.未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised monocular depth estimation serves as a key task in the development of endoscopic navigation systems. However, performance degradation persists due to uneven illumination inherent in endoscopic images, particularly in low-intensity regions. Existing low-light enhancement techniques fail to effectively guide the depth network. Furthermore, solutions from other fields, like autonomous driving, require well-lit images, making them unsuitable and increasing data collection burdens. To this end, we present DeLight-Mono - a novel self-supervised monocular depth estimation framework with illumination decoupling. Specifically, endoscopic images are represented by a designed illumination-reflectance-depth model, and are decomposed with auxiliary networks. Moreover, a self-supervised joint-optimizing framework with novel losses leveraging the decoupled components is proposed to mitigate the effects of uneven illumination on depth estimation. The effectiveness of the proposed methods was rigorously verified through extensive comparisons and an ablation study performed on two public datasets.
zh

[CV-104] History-Augmented Contrastive Meta-Learning for Unsupervised Blind Super-Resolution of Planetary Remote Sensing Images

【速读】:该论文旨在解决行星遥感图像在复杂成像环境和硬件限制下因未知退化因素导致的图像质量下降问题,尤其针对监督式盲超分辨率(blind super-resolution)因缺乏真实标签(ground-truth)而难以实施的挑战。其解决方案的关键在于提出一种无监督的盲超分辨率框架 HACBSR,核心创新包括:(1) 基于核相似性控制的对比采样机制,以缓解传统高斯采样带来的分布偏移;(2) 利用历史模型生成负样本的历史增强对比学习策略,从而实现更稳健的优化并诱导强凸性,无需依赖真实标签即可提升重建性能。

链接: https://arxiv.org/abs/2511.20045
作者: Huijia Zhao,Jie Lu,Yunqing Jiang,Xiao-Ping Lu,Kaichang Di
机构: Macau University of Science and Technology (澳门科技大学); State Key Laboratory of Lunar and Planetary Sciences (月球与行星科学国家重点实验室); State Key Laboratory of Remote Sensing Science (遥感科学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13pages

点击查看摘要

Abstract:Planetary remote sensing images are affected by diverse and unknown degradations caused by imaging environments and hardware constraints. These factors limit image quality and hinder supervised blind super-resolution due to the lack of ground-truth images. This work presents History-Augmented Contrastive Blind Super-Resolution (HACBSR), an unsupervised framework for blind super-resolution that operates without ground-truth images and external kernel priors. HACBSR comprises two components: (1) a contrastive kernel sampling mechanism with kernel similarity control to mitigate distribution bias from Gaussian sampling, and (2) a history-augmented contrastive learning that uses historical models to generate negative samples to enable less greedy optimization and to induce strong convexity without ground-truth. A convergence analysis of the history-augmented contrastive learning is given in the Appendix. To support evaluation in planetary applications, we introduce Ceres-50, a dataset with diverse geological features simulated degradation patterns. Experiments show that HACBSR achieves competitive performance compared with state-of-the-art unsupervised methods across multiple upscaling factors. The code is available at this https URL, and the dataset is available at this https URL.
zh

[CV-105] MFM-point: Multi-scale Flow Matching for Point Cloud Generation

【速读】:该论文旨在解决点云生成中点基方法(point-based methods)在性能上落后于基于其他表示形式(如网格、体素或潜在特征)的方法的问题,同时保持其训练成本低和算法简洁的优势。解决方案的关键在于提出一种多尺度流匹配(multi-scale Flow Matching, MFM-Point)框架,采用从粗到细的生成范式,在不增加训练或推理开销的前提下显著提升生成质量和可扩展性;其核心创新在于设计了一种结构化的下采样与上采样策略,有效保留无序点云的几何结构,并确保不同分辨率间分布过渡平滑一致,从而实现点基方法在多类别和高分辨率生成任务中的最优表现。

链接: https://arxiv.org/abs/2511.20041
作者: Petr Molodyk,Jaemoo Choi,David W. Romero,Ming-Yu Liu,Yongxin Chen
机构: Georgia Institute of Technology (佐治亚理工学院); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In recent years, point cloud generation has gained significant attention in 3D generative modeling. Among existing approaches, point-based methods directly generate point clouds without relying on other representations such as latent features, meshes, or voxels. These methods offer low training cost and algorithmic simplicity, but often underperform compared to representation-based approaches. In this paper, we propose MFM-Point, a multi-scale Flow Matching framework for point cloud generation that substantially improves the scalability and performance of point-based methods while preserving their simplicity and efficiency. Our multi-scale generation algorithm adopts a coarse-to-fine generation paradigm, enhancing generation quality and scalability without incurring additional training or inference overhead. A key challenge in developing such a multi-scale framework lies in preserving the geometric structure of unordered point clouds while ensuring smooth and consistent distributional transitions across resolutions. To address this, we introduce a structured downsampling and upsampling strategy that preserves geometry and maintains alignment between coarse and fine resolutions. Our experimental results demonstrate that MFM-Point achieves best-in-class performance among point-based methods and challenges the best representation-based methods. In particular, MFM-point demonstrates strong results in multi-category and high-resolution generation tasks.
zh

[CV-106] Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization

【速读】:该论文旨在解决现有图像矢量化方法在处理复杂真实世界图像时存在的问题,即难以保持视觉保真度的同时实现语义简洁性,常导致生成碎片化形状。其解决方案的关键在于提出COVec框架,首次在矢量域中引入内在图像分解(intrinsic image decomposition),将图像统一表示为反照率(albedo)、阴影(shade)和光照(light)三层,并通过语义引导的初始化与两阶段可微渲染优化策略,实现对各层的精细化重构,从而显著提升矢量表示的视觉保真度和可编辑性。

链接: https://arxiv.org/abs/2511.20034
作者: Xingyue Lin,Shuai Peng,Xiangyu Xie,Jianhua Zhu,Yuxuan Zhou,Liangcai Gao
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex real-world images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light-shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.
zh

[CV-107] Model Where to Look: Mitigating Hallucinations in MLLM s by Vision-Guided Attention

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉理解过程中因视觉注意力(Visual Attention)定位能力有限而导致的幻觉问题(Hallucination)。解决方案的关键在于提出一种无需训练的视觉引导注意力机制(Vision-Guided Attention, VGA),其核心思想是利用视觉token的语义内容构建精确的视觉锚定(Visual Grounding),并以此指导模型聚焦于相关视觉区域;在图像描述生成阶段,VGA进一步通过抑制已描述区域动态优化注意力分布,从而有效减少幻觉。该方法仅引入4.36%的延迟开销,且兼容高效注意力实现如FlashAttention,在多个基准测试中显著提升了去幻觉性能。

链接: https://arxiv.org/abs/2511.20032
作者: Jianfei Zhao,Feng Zhang,Xin Sun,Chong Feng,Zhixing Tan
机构: Beijing Institute of Technology (北京理工大学); Zhongguancun Academy (中关村学院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Visual attention serves as the primary mechanism through which MLLMs interpret visual information; however, its limited localization capability often leads to hallucinations. We observe that although MLLMs can accurately extract visual semantics from visual tokens, they fail to fully leverage this advantage during subsequent inference. To address this limitation, we propose Vision-Guided Attention (VGA), a training-free method that first constructs precise visual grounding by exploiting the semantic content of visual tokens, and then uses this grounding to guide the model’s focus toward relevant visual regions. In image captioning, VGA further refines this guidance dynamically during generation by suppressing regions that have already been described. In VGA, each token undergoes only a single forward pass, introducing a negligible latency overhead of just 4.36%. In addition, VGA is fully compatible with efficient attention implementations such as FlashAttention. Extensive experiments across diverse MLLMs and multiple hallucination benchmarks demonstrate that VGA achieves state-of-the-art dehallucination performance. Further analysis confirms that explicit visual guidance plays a crucial role in enhancing the visual understanding capabilities of MLLMs.
zh

[CV-108] SAM-MI: A Mask-Injected Framework for Enhancing Open-Vocabulary Semantic Segmentation with SAM

【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary Semantic Segmentation, OVSS)中两个关键挑战:一是Segment Anything Model (SAM) 产生的过分割(over-segmentation)问题,二是固定掩码与标签之间难以有效融合的硬组合问题。解决方案的关键在于提出一种新颖的掩码注入框架 SAM-MI,其核心创新包括:(1) 文本引导的稀疏点提示器(Text-guided Sparse Point Prompter),替代传统的密集网格提示,显著提升掩码生成效率;(2) 浅层掩码聚合(Shallow Mask Aggregation, SMAgg),用于合并部分掩码以缓解过分割;(3) 解耦掩码注入(Decoupled Mask Injection, DMI),分别在低频和高频特征空间中引入 SAM 掩码指导,实现更灵活、高效的掩码融合。该方法在多个基准测试上验证了优越性,尤其在 MESS 数据集上相较 Grounded-SAM 提升了 16.7% 的 mIoU,并实现了 1.6 倍加速。

链接: https://arxiv.org/abs/2511.20027
作者: Lin Chen,Yingjian Zhu,Qi Yang,Xin Niu,Kun Ding,Shiming Xiang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) aims to segment and recognize objects universally. Trained on extensive high-quality segmentation data, the segment anything model (SAM) has demonstrated remarkable universal segmentation capabilities, offering valuable support for OVSS. Although previous methods have made progress in leveraging SAM for OVSS, there are still some challenges: (1) SAM’s tendency to over-segment and (2) hard combinations between fixed masks and labels. This paper introduces a novel mask-injected framework, SAM-MI, which effectively integrates SAM with OVSS models to address these challenges. Initially, SAM-MI employs a Text-guided Sparse Point Prompter to sample sparse prompts for SAM instead of previous dense grid-like prompts, thus significantly accelerating the mask generation process. The framework then introduces Shallow Mask Aggregation (SMAgg) to merge partial masks to mitigate the SAM’s over-segmentation issue. Finally, Decoupled Mask Injection (DMI) incorporates SAM-generated masks for guidance at low-frequency and high-frequency separately, rather than directly combining them with labels. Extensive experiments on multiple benchmarks validate the superiority of SAM-MI. Notably, the proposed method achieves a 16.7% relative improvement in mIoU over Grounded-SAM on the MESS benchmark, along with a 1.6 \times speedup. We hope SAM-MI can serve as an alternative methodology to effectively equip the OVSS model with SAM.
zh

[CV-109] WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在自动驾驶中处理安全关键场景时的推理能力不足问题,尤其在高风险情境下,单一前视视角难以识别和规避潜在风险,且决策可能引发新的次生风险。解决方案的关键在于提出“安全关键推理”(Safety-Critical Reasoning)这一新任务,并将其分解为两个阶段:首先消除即时风险,再缓解由决策引发的下游风险;同时构建了包含35,000个高质量人工标注问答对的WaymoQA数据集,涵盖复杂高风险驾驶场景的多模态(图像与视频)多选题与开放题形式,通过该数据集微调MLLMs可显著提升其在安全关键场景下的推理性能,从而增强自动驾驶系统的安全性与决策合理性。

链接: https://arxiv.org/abs/2511.20022
作者: Seungjun Yu,Seonho Lee,Namho Kim,Jaeyo Shin,Junsung Park,Wonjeong Ryu,Raehyuk Jung,Hyunjung Shim
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院); Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have shown strong understanding of driving scenes, drawing interest in their application to autonomous driving. However, high-level reasoning in safety-critical scenarios, where avoiding one traffic risk can create another, remains a major challenge. Such reasoning is often infeasible with only a single front view and requires a comprehensive view of the environment, which we achieve through multi-view inputs. We define Safety-Critical Reasoning as a new task that leverages multi-view inputs to address this challenge. Then, we distill Safety-Critical Reasoning into two stages: first resolve the immediate risk, then mitigate the decision-induced downstream risks. To support this, we introduce WaymoQA, a dataset of 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios. The dataset includes multiple-choice and open-ended formats across both image and video modalities. Experiments reveal that existing MLLMs underperform in safety-critical scenarios compared to normal scenes, but fine-tuning with WaymoQA significantly improves their reasoning ability, highlighting the effectiveness of our dataset in developing safer and more reasoning-capable driving agents.
zh

[CV-110] ACIT: Attention-Guided Cross-Modal Interaction Transformer for Pedestrian Crossing Intention Prediction

【速读】:该论文旨在解决自动驾驶车辆中行人过街意图预测问题,其核心挑战在于如何有效提取并融合来自多种模态数据(如视觉与运动信息)的互补特征。解决方案的关键在于提出一种注意力引导的跨模态交互Transformer(ACIT),该模型通过三组模态对(全局语义图与光流、局部RGB图像与光流、自车速度与行人边界框)实现细粒度的跨模态交互:在每组视觉模态对中引入双路径注意力机制,利用自注意力增强主模态中的显著区域,并通过光流引导注意力促进与辅助模态(光流)的深度交互;在运动模态对中采用跨模态注意力建模动态关系以提取互补运动特征;此外,还设计了多模态特征融合模块和基于Transformer的时间特征聚合模块,从而在时序维度上进一步增强跨模态信息整合能力。实验表明,该方法在JAADbeh和JAADall数据集上分别达到70%和89%的准确率,优于现有最优方法。

链接: https://arxiv.org/abs/2511.20020
作者: Yuanzhe Li,Steffen Müller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting pedestrian crossing intention is crucial for autonomous vehicles to prevent pedestrian-related collisions. However, effectively extracting and integrating complementary cues from different types of data remains one of the major challenges. This paper proposes an attention-guided cross-modal interaction Transformer (ACIT) for pedestrian crossing intention prediction. ACIT leverages six visual and motion modalities, which are grouped into three interaction pairs: (1) Global semantic map and global optical flow, (2) Local RGB image and local optical flow, and (3) Ego-vehicle speed and pedestrian’s bounding box. Within each visual interaction pair, a dual-path attention mechanism enhances salient regions within the primary modality through intra-modal self-attention and facilitates deep interactions with the auxiliary modality (i.e., optical flow) via optical flow-guided attention. Within the motion interaction pair, cross-modal attention is employed to model the cross-modal dynamics, enabling the effective extraction of complementary motion features. Beyond pairwise interactions, a multi-modal feature fusion module further facilitates cross-modal interactions at each time step. Furthermore, a Transformer-based temporal feature aggregation module is introduced to capture sequential dependencies. Experimental results demonstrate that ACIT outperforms state-of-the-art methods, achieving accuracy rates of 70% and 89% on the JAADbeh and JAADall datasets, respectively. Extensive ablation studies are further conducted to investigate the contribution of different modules of ACIT.
zh

[CV-111] Multi-Context Fusion Transformer for Pedestrian Crossing Intention Prediction in Urban Environments

【速读】:该论文旨在解决城市环境中行人过街意图预测的准确性难题,该问题对自动驾驶车辆提升行人安全性和减少交通事故至关重要。其解决方案的关键在于提出一种多上下文融合Transformer(Multi-context Fusion Transformer, MFT),通过四个维度的上下文信息——行人行为上下文、环境上下文、行人定位上下文和车辆运动上下文——进行深度特征融合。MFT采用渐进式融合策略:首先在各上下文中利用互注意力机制实现内部特征交互与序列融合,生成上下文特定表示;随后通过跨上下文注意力机制,以全局CLS标记作为紧凑的多上下文表征进行跨模态整合;最后通过引导式内上下文与跨上下文注意力进一步优化各上下文token及全局表示,实现更深层次的信息传播与融合,从而显著提升预测精度。

链接: https://arxiv.org/abs/2511.20011
作者: Yuanzhe Li,Hang Zhong,Steffen Müller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pedestrian crossing intention prediction is essential for autonomous vehicles to improve pedestrian safety and reduce traffic accidents. However, accurate pedestrian intention prediction in urban environments remains challenging due to the multitude of factors affecting pedestrian behavior. In this paper, we propose a multi-context fusion Transformer (MFT) that leverages diverse numerical contextual attributes across four key dimensions, encompassing pedestrian behavior context, environmental context, pedestrian localization context and vehicle motion context, to enable accurate pedestrian intention prediction. MFT employs a progressive fusion strategy, where mutual intra-context attention enables reciprocal interactions within each context, thereby facilitating feature sequence fusion and yielding a context token as a context-specific representation. This is followed by mutual cross-context attention, which integrates features across contexts with a global CLS token serving as a compact multi-context representation. Finally, guided intra-context attention refines context tokens within each context through directed interactions, while guided cross-context attention strengthens the global CLS token to promote multi-context fusion via guided information propagation, yielding deeper and more efficient integration. Experimental results validate the superiority of MFT over state-of-the-art methods, achieving accuracy rates of 73%, 93%, and 90% on the JAADbeh, JAADall, and PIE datasets, respectively. Extensive ablation studies are further conducted to investigate the effectiveness of the network architecture and contribution of different input context. Our code is open-source: this https URL.
zh

[CV-112] Pedestrian Crossing Intention Prediction Using Multimodal Fusion Network

【速读】:该论文旨在解决城市环境中自动驾驶车辆(AV)对行人过街意图预测的难题,该任务因行人行为多样性及多因素情境依赖性而极具挑战。解决方案的关键在于提出一种多模态融合网络,通过七个来自视觉与运动分支的模态特征进行互补信息提取与整合;具体包括:利用多个基于Transformer的提取模块分别获取运动和视觉特征,设计深度引导注意力模块以空间特征交互方式指导注意力聚焦于另一模态中的显著区域,并引入模态注意力与时间注意力机制,动态加权不同模态和帧的重要性,从而有效捕捉时空依赖关系并提升预测准确性。

链接: https://arxiv.org/abs/2511.20008
作者: Yuanzhe Li,Steffen Müller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pedestrian crossing intention prediction is essential for the deployment of autonomous vehicles (AVs) in urban environments. Ideal prediction provides AVs with critical environmental cues, thereby reducing the risk of pedestrian-related collisions. However, the prediction task is challenging due to the diverse nature of pedestrian behavior and its dependence on multiple contextual factors. This paper proposes a multimodal fusion network that leverages seven modality features from both visual and motion branches, aiming to effectively extract and integrate complementary cues across different modalities. Specifically, motion and visual features are extracted from the raw inputs using multiple Transformer-based extraction modules. Depth-guided attention module leverages depth information to guide attention towards salient regions in another modality through comprehensive spatial feature interactions. To account for the varying importance of different modalities and frames, modality attention and temporal attention are designed to selectively emphasize informative modalities and effectively capture temporal dependencies. Extensive experiments on the JAAD dataset validate the effectiveness of the proposed network, achieving superior performance compared to the baseline methods.
zh

[CV-113] Zero-Shot Transfer Capabilities of the Sundial Foundation Model for Leaf Area Index Forecasting

【速读】:该论文旨在解决农业监测中叶面积指数(Leaf Area Index, LAI)预测问题,特别是如何在不进行任务特定调优的情况下实现高精度的零样本(zero-shot)时间序列预测。其解决方案的关键在于验证了预训练的时间序列基础模型(如Sundial)在输入上下文窗口覆盖多个完整季节周期时,能够超越经过全监督训练的LSTM模型,从而首次证明通用基础模型可在遥感时间序列预测任务中无需微调即可优于专用监督模型,展现出强大的即插即用潜力。

链接: https://arxiv.org/abs/2511.20004
作者: Peining Zhang,Hongchen Qin,Haochen Zhang,Ziqi Guo,Guiling Wang,Jinbo Bi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work investigates the zero-shot forecasting capability of time-series foundation models for Leaf Area Index (LAI) forecasting in agricultural monitoring. Using the HiQ dataset (U.S., 2000-2022), we systematically compare statistical baselines, a fully supervised LSTM, and the Sundial foundation model under multiple evaluation protocols. We find that Sundial, in the zero-shot setting, can outperform a fully trained LSTM provided that the input context window is sufficiently long-specifically, when covering more than one or two full seasonal cycles. This demonstrates, for the first time, that a general-purpose foundation model can surpass specialized supervised models on remote-sensing time series prediction without any task-specific tuning. These results highlight the strong potential of pretrained time-series foundation models to serve as effective plug-and-play forecasters in agricultural and environmental applications.
zh

[CV-114] On the Feasibility of Hijacking MLLM s Decision Chain via One Perturbation

【速读】:该论文旨在解决传统对抗攻击仅针对神经网络单一决策的局限性,揭示在实际应用中因决策链式传递而导致的级联错误风险——即单个扰动可操控模型在整个决策序列中的多个输出结果。其关键解决方案是提出语义感知通用扰动(Semantic-Aware Universal Perturbations, SAUPs),通过在归一化空间中结合语义分离策略进行优化搜索,使扰动能根据输入内容的语义特征诱导出多种预设的目标错误分类结果(如同时将“非机动车道”标志误判为“机动车道”和“行人”误判为“塑料袋”),从而实现对多目标决策链的精准控制。

链接: https://arxiv.org/abs/2511.20002
作者: Changyue Li,Jiaying Li,Youliang Yuan,Jiaming He,Zhicong Huang,Pinjia He
机构: The Chinese University of Hong Kong, Shenzhen (深圳大学); University of Electronic Science and Technology of China (电子科技大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Conventional adversarial attacks focus on manipulating a single decision of neural networks. However, real-world models often operate in a sequence of decisions, where an isolated mistake can be easily corrected, but cascading errors can lead to severe risks. This paper reveals a novel threat: a single perturbation can hijack the whole decision chain. We demonstrate the feasibility of manipulating a model’s outputs toward multiple, predefined outcomes, such as simultaneously misclassifying “non-motorized lane” signs as “motorized lane” and “pedestrian” as “plastic bag”. To expose this threat, we introduce Semantic-Aware Universal Perturbations (SAUPs), which induce varied outcomes based on the semantics of the inputs. We overcome optimization challenges by developing an effective algorithm, which searches for perturbations in normalized space with a semantic separation strategy. To evaluate the practical threat of SAUPs, we present RIST, a new real-world image dataset with fine-grained semantic annotations. Extensive experiments on three multimodal large language models demonstrate their vulnerability, achieving a 70% attack success rate when controlling five distinct targets using just an adversarial frame. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2511.20002 [cs.CV] (or arXiv:2511.20002v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.20002 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-115] CREward: A Type-Specific Creativity Reward Model

【速读】:该论文旨在解决现有创造力评估方法将创造力视为单一维度量度的局限性问题,提出了一种类型特异性的创造力奖励模型(CREward),该模型从图像生成流程中的几何(geometry)、材质(material)和纹理(texture)三个维度对创造力进行建模。其解决方案的关键在于:首先通过人类基准评估捕捉不同类别创造力的人类感知,随后验证大型视觉语言模型(LVLMs)在预测人类创造力判断上的强一致性,进而利用LVLM生成的标签训练出可应用于创造力评估、可解释性分析及创意样本获取的CREward模型,从而实现对创造性图像的多维量化与引导生成。

链接: https://arxiv.org/abs/2511.19995
作者: Jiyeon Han,Ali Mahdavi-Amiri,Hao Zhang,Haedong Jeong
机构: Simon Fraser University (西蒙菲莎大学); Sogang University (首尔中央大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Creativity is a complex phenomenon. When it comes to representing and assessing creativity, treating it as a single undifferentiated quantity would appear naive and underwhelming. In this work, we learn the \emphfirst type-specific creativity reward model, coined CREward, which spans three creativity ``axes," geometry, material, and texture, to allow us to view creativity through the lens of the image formation pipeline. To build our reward model, we first conduct a human benchmark evaluation to capture human perception of creativity for each type across various creative images. We then analyze the correlation between human judgments and predictions by large vision-language models (LVLMs), confirming that LVLMs exhibit strong alignment with human perception. Building on this observation, we collect LVLM-generated labels to train our CREward model that is applicable to both evaluation and generation of creative images. We explore three applications of CREward: creativity assessment, explainable creativity, and creative sample acquisition for both human design inspiration and guiding creative generation through low-rank adaptation.
zh

[CV-116] OmniRefiner: Reinforcement-Guided Local Diffusion Refinement

【速读】:该论文旨在解决当前基于扩散模型(diffusion models)的参考引导图像生成方法在细节修复过程中难以保持精细视觉特征的问题。具体而言,VAE(变分自编码器)隐空间压缩会丢失细微纹理信息,导致身份和属性相关的线索消失;同时,现有后编辑方法在增强局部细节时往往与原图在光照、纹理或形状上不一致。解决方案的关键在于提出一个两阶段的细节感知精修框架 \ourMthd:第一阶段通过微调单图扩散编辑器,联合输入草稿图像与参考图像,实现全局结构一致性下的精细重构;第二阶段引入强化学习进一步优化局部编辑能力,显式地提升细节准确性与语义一致性,从而显著改善参考对齐度与细粒度细节保留效果。

链接: https://arxiv.org/abs/2511.19990
作者: Yaoli Liu,Ziheng Ouyang,Shengtao Lou,Yiren Song
机构: Zhejiang University (浙江大学); Nankai University (南开大学); National University of Singapore (新加坡国立大学); Creatly.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reference-guided image generation has progressed rapidly, yet current diffusion models still struggle to preserve fine-grained visual details when refining a generated image using a reference. This limitation arises because VAE-based latent compression inherently discards subtle texture information, causing identity- and attribute-specific cues to vanish. Moreover, post-editing approaches that amplify local details based on existing methods often produce results inconsistent with the original image in terms of lighting, texture, or shape. To address this, we introduce \ourMthd, a detail-aware refinement framework that performs two consecutive stages of reference-driven correction to enhance pixel-level consistency. We first adapt a single-image diffusion editor by fine-tuning it to jointly ingest the draft image and the reference image, enabling globally coherent refinement while maintaining structural fidelity. We then apply reinforcement learning to further strengthen localized editing capability, explicitly optimizing for detail accuracy and semantic consistency. Extensive experiments demonstrate that \ourMthd significantly improves reference alignment and fine-grained detail preservation, producing faithful and visually coherent edits that surpass both open-source and commercial models on challenging reference-guided restoration benchmarks.
zh

[CV-117] GazeProphetV2: Head-Movement-Based Gaze Prediction Enabling Efficient Foveated Rendering on Mobile VR

【速读】:该论文旨在解决虚拟现实(Virtual Reality, VR)环境中注视行为预测的难题,该问题对渲染优化和交互界面设计具有重要意义。其解决方案的关键在于提出了一种多模态方法,通过融合时间维度上的注视模式、头部运动数据与视觉场景信息,并采用门控融合机制结合跨模态注意力机制,使模型能够根据上下文相关性自适应地调整注视历史、头部朝向和场景内容的权重。实验表明,该方法在22个VR场景、530万条注视样本的数据集上显著提升了预测精度,且具备良好的跨场景泛化能力(验证准确率达93.1%),为构建能预判用户注意力模式的高效VR系统提供了新路径。

链接: https://arxiv.org/abs/2511.19988
作者: Farhaan Ebadulla,Chiraag Mudlpaur,Shreya Chaurasia,Gaurav BV
机构: PES University (PES大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Predicting gaze behavior in virtual reality environments remains a significant challenge with implications for rendering optimization and interface design. This paper introduces a multimodal approach to VR gaze prediction that combines temporal gaze patterns, head movement data, and visual scene information. By leveraging a gated fusion mechanism with cross-modal attention, the approach learns to adaptively weight gaze history, head movement, and scene content based on contextual relevance. Evaluations using a dataset spanning 22 VR scenes with 5.3M gaze samples demonstrate improvements in predictive accuracy when combining modalities compared to using individual data streams alone. The results indicate that integrating past gaze trajectories with head orientation and scene content enhances prediction accuracy across 1-3 future frames. Cross-scene generalization testing shows consistent performance with 93.1% validation accuracy and temporal consistency in predicted gaze trajectories. These findings contribute to understanding attention mechanisms in virtual environments while suggesting potential applications in rendering optimization, interaction design, and user experience evaluation. The approach represents a step toward more efficient virtual reality systems that can anticipate user attention patterns without requiring expensive eye tracking hardware.
zh

[CV-118] On-Demand Multi-Task Sparsity for Efficient Large-Model Deployment on Edge Devices

【速读】:该论文旨在解决在资源受限的边缘平台部署大规模模型时,因任务频繁切换导致的显著I/O开销问题。现有方法通常孤立地优化每个任务的稀疏模式,忽略了任务切换带来的冷启动延迟。其解决方案的关键在于提出一种按需多任务稀疏框架(on-demand multi-task sparsity framework),通过将权重分解为可重用的块粒度单元,并对齐不同任务间的稀疏结构以最大化参数重用率,从而动态加载下一任务所需的最小差异块集合,有效降低任务切换延迟。实验表明,该方法在真实自动驾驶平台上平均提速超过6.6倍。

链接: https://arxiv.org/abs/2511.19986
作者: Lianming Huang,Haibo Hu,Qiao Li,Nan Guan,Chun Jason Xue
机构: City University of Hong Kong (香港城市大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sparsity is essential for deploying large models on resource constrained edge platforms. However, optimizing sparsity patterns for individual tasks in isolation ignores the significant I/O overhead incurred during frequent task switching. We introduce an on-demand multi-task sparsity framework specifically designed to minimize switching costs by maximizing parameter reuse. Unlike monolithic approaches, we decompose weights into reusable block-granular units and align sparse structures across tasks to maximize overlap. By dynamically loading only the small differential set of blocks required for the next task, our method effectively mitigates the cold-start latency inherent in traditional monolithic this http URL on a real-world autonomous driving platform demonstrate that our framework achieves superior switching efficiency, accelerating task switching by over 6.6X on average compared to existing sparsity methods.
zh

[CV-119] SONIC: Spectral Optimization of Noise for Inpainting with Consistency

【速读】:该论文旨在解决如何在不进行模型训练的前提下,利用现成的文本到图像生成模型(text-to-image models)实现高质量图像修复(inpainting)的问题。当前基于引导(guidance-based)的方法虽然理论上适用于此类逆问题,但在实践中效果受限,通常需要专门设计的修复模型。其核心解决方案在于:优化初始噪声种子(initial seed noise),使其生成结果在未掩码区域与真实数据尽可能一致,仅需数十次优化步骤即可实现;关键创新点包括:(i)通过线性近似避免昂贵的反向传播链式计算(unrolling),提升效率;(ii)在频域空间中优化初始噪声以增强优化稳定性。该方法显著优于现有无训练修复技术。

链接: https://arxiv.org/abs/2511.19985
作者: Seungyeon Baek,Erqun Dong,Shadan Namazifard,Mark J. Matthews,Kwang Moo Yi
机构: University of British Columbia (不列颠哥伦比亚大学); Google DeepMind (谷歌深度心智)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a novel training-free method for inpainting with off-the-shelf text-to-image models. While guidance-based methods in theory allow generic models to be used for inverse problems such as inpainting, in practice, their effectiveness is limited, leading to the necessity of specialized inpainting-specific models. In this work, we argue that the missing ingredient for training-free inpainting is the optimization (guidance) of the initial seed noise. We propose to optimize the initial seed noise to approximately match the unmasked parts of the data - with as few as a few tens of optimization steps. We then apply conventional training-free inpainting methods on top of our optimized initial seed noise. Critically, we propose two core ideas to effectively implement this idea: (i) to avoid the costly unrolling required to relate the initial noise and the generated outcome, we perform linear approximation; and (ii) to stabilize the optimization, we optimize the initial seed noise in the spectral domain. We demonstrate the effectiveness of our method on various inpainting tasks, outperforming the state of the art. Project page: this https URL
zh

[CV-120] EmoFeedback2: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

【速读】:该论文旨在解决连续情感图像生成(Continuous Emotional Image Generation, C-EICG)中两个核心问题:一是现有方法缺乏对生成图像的情感反馈,导致难以控制情感的连续性;二是情感与文本提示之间的简单对齐方式无法根据图像内容自适应调整提示,从而造成情感保真度不足。解决方案的关键在于提出一种“生成-理解-反馈”强化学习范式(EmoFeedback2),其核心创新包括:1)引入情感感知的奖励反馈策略,利用微调后的视觉语言模型(Large Vision-Language Model, LVLM)评估生成图像的情感值并计算与目标情感的奖励,指导生成模型的强化微调以增强情感连续性;2)设计自促进文本反馈框架,LVLM迭代分析图像情感内容并自适应生成精细化提示改进建议,从而提升情感保真度与细粒度内容一致性。

链接: https://arxiv.org/abs/2511.19982
作者: Jingyang Jia,Kai Shu,Gang Yang,Long Xing,Xun Chen,Aiping Liu
机构: University of Science and Technology of China (中国科学技术大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback2) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.
zh

[CV-121] Boosting Reasoning in Large Multimodal Models via Activation Replay

【速读】:该论文旨在解决强化学习带可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)后训练大型多模态模型(Large Multimodal Models, LMMs)时,其推理能力提升机制不明确的问题,尤其是输入激活状态如何被RLVR影响以及能否通过简单方法进一步优化推理性能。解决方案的关键在于提出一种名为“激活重放”(Activation Replay)的新颖、无需训练的策略:在测试阶段操纵视觉token,将基础LMM中低熵激活状态(low-entropy activations)从输入上下文中“重放”至RLVR模型,从而引导其产生更高质量的推理行为。实验表明,该方法能有效提升数学推理、视觉代理和视频理解等多场景下的表现,并改善RLVR带来的推理覆盖范围狭窄问题,优于其他替代方案如高熵激活重放或直接跨模型干预。

链接: https://arxiv.org/abs/2511.19972
作者: Yun Xing,Xiaobin Hu,Qingdong He,Jiangning Zhang,Shuicheng Yan,Shijian Lu,Yu-Gang Jiang
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Tencent Youtu Lab (腾讯优图实验室); Zhejiang University (浙江大学); Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 figures, 10 tables

点击查看摘要

Abstract:Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach to incentivizing reasoning capability in Large Multimodal Models (LMMs), while the underlying mechanisms behind this post-training paradigm are poorly understood. We begin by exploring how input activations are affected by RLVR through the perspective of logit lens. Our systematic investigations across multiple post-trained LMMs suggest that RLVR shifts low-entropy activations unexpectedly, while high-entropy ones are less affected. We further demonstrate that such phenomena are associated with LMM reasoning by controlled experiments, suggesting a potentially beneficial role of modulating low-entropy activations. To this end, we propose Activation Replay, a novel simple yet effective training-free approach that boosts multimodal reasoning of post-trained LMMs without requiring expensive policy optimization. Our design involves manipulation of visual tokens at test time, replaying low-entropy activations from the input context of base LMMs to regulating the RLVR counterparts. Activation Replay triggers better reasoning across diverse scenarios, including mathematics, o3-like visual agents, and video reasoning. We further show that Activation Replay boosts Pass@K and mitigates narrower reasoning coverage of RLVR. Our design is compared against alternative choices, such as replaying high-entropy activations instead of low-entropy ones, or direct cross-model intervention instead of manipulating input tokens, demonstrating the superiority of our implementation. Codes will be made publicly available.
zh

[CV-122] VGGT4D: Mining Motion Cues in Visual Geometry Transformers for 4D Scene Reconstruction

【速读】:该论文旨在解决动态4D场景重建中难以有效分离运动物体与静态背景的问题,尤其是在运动物体占据主导时,现有3D基础模型(如VGGT)性能显著下降。其解决方案的关键在于利用VGGT模型中已存在的全局注意力层所隐式编码的分层动态线索,通过Gram相似性挖掘并放大这些全局动态特征,并在时间窗口内进行聚合以生成准确的静态-动态掩码;进一步引入基于投影梯度的精修策略优化掩码边界,最终将高精度掩码嵌入VGGT的早期推理阶段,从而有效抑制运动干扰,提升相机位姿估计和几何重建的鲁棒性。

链接: https://arxiv.org/abs/2511.19971
作者: Yu Hu,Chong Cheng,Sicheng Yu,Xiaoyang Guo,Hao Wang
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Horizon Robotics ( horizon 机器人)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic 4D scenes is challenging, as it requires robust disentanglement of dynamic objects from the static background. While 3D foundation models like VGGT provide accurate 3D geometry, their performance drops markedly when moving objects dominate. Existing 4D approaches often rely on external priors, heavy post-optimization, or require fine-tuning on 4D datasets. In this paper, we propose VGGT4D, a training-free framework that extends the 3D foundation model VGGT for robust 4D scene reconstruction. Our approach is motivated by the key finding that VGGT’s global attention layers already implicitly encode rich, layer-wise dynamic cues. To obtain masks that decouple static and dynamic elements, we mine and amplify global dynamic cues via gram similarity and aggregate them across a temporal window. To further sharpen mask boundaries, we introduce a refinement strategy driven by projection gradient. We then integrate these precise masks into VGGT’s early-stage inference, effectively mitigating motion interference in both pose estimation and geometric reconstruction. Across six datasets, our method achieves superior performance in dynamic object segmentation, camera pose estimation, and dense reconstruction. It also supports single-pass inference on sequences longer than 500 frames.
zh

[CV-123] HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning

【速读】:该论文旨在解决当前扩散模型在处理复杂提示(complex prompts)时存在的生成偏差问题,特别是当提示涉及多个对象和层次结构时,模型容易出现概念遗漏、混淆及组合性差等缺陷。其解决方案的核心在于提出一种基于新型“合成链”(Chain of Synthesis, CoS)范式的分层组合生成框架(Hierarchical Compositional Generative framework, HiCoGen)。该框架首先利用大语言模型(Large Language Model, LLM)将复杂提示分解为最小语义单元,随后通过迭代式合成机制逐步构建图像,每一步生成的图像为下一步提供关键视觉上下文,从而保障所有文本概念被准确映射至最终场景。此外,为优化此过程,作者引入强化学习(Reinforcement Learning, RL)框架,并设计了分层奖励机制与去噪调度策略(Decaying Stochasticity Schedule),以提升样本多样性并增强对复杂结构的理解能力。

链接: https://arxiv.org/abs/2511.19965
作者: Hongji Yang,Yucheng Zhou,Wencheng Han,Runzhou Tao,Zhongying Qiu,Jianfei Yang,Jianbing Shen
机构: University of Macau (澳门大学); Zhejiang ZEEKR Automobile Research & Development Co., Ltd. (浙江极氪汽车研发有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Recent advances in diffusion models have demonstrated impressive capability in generating high-quality images for simple prompts. However, when confronted with complex prompts involving multiple objects and hierarchical structures, existing models struggle to accurately follow instructions, leading to issues such as concept omission, confusion, and poor compositionality. To address these limitations, we propose a Hierarchical Compositional Generative framework (HiCoGen) built upon a novel Chain of Synthesis (CoS) paradigm. Instead of monolithic generation, HiCoGen first leverages a Large Language Model (LLM) to decompose complex prompts into minimal semantic units. It then synthesizes these units iteratively, where the image generated in each step provides crucial visual context for the next, ensuring all textual concepts are faithfully constructed into the final scene. To further optimize this process, we introduce a reinforcement learning (RL) framework. Crucially, we identify that the limited exploration of standard diffusion samplers hinders effective RL. We theoretically prove that sample diversity is maximized by concentrating stochasticity in the early generation stages and, based on this insight, propose a novel Decaying Stochasticity Schedule to enhance exploration. Our RL algorithm is then guided by a hierarchical reward mechanism that jointly evaluates the image at the global, subject, and relationship levels. We also construct HiCoPrompt, a new text-to-image benchmark with hierarchical prompts for rigorous evaluation. Experiments show our approach significantly outperforms existing methods in both concept coverage and compositional accuracy.
zh

[CV-124] MambaEye: A Size-Agnostic Visual Encoder with Causal Sequential Processing

【速读】:该论文旨在解决视觉编码器在输入尺寸上缺乏泛化能力的问题,即当前模型难以像人类视觉系统一样对任意分辨率图像实现统一处理。其解决方案的关键在于提出MambaEye,一种基于纯Mamba2架构的因果顺序编码器,通过严格的单向处理保持状态空间模型(State Space Model, SSM)的固有因果性,从而可在输入序列任意位置生成预测;同时引入相对位移嵌入(relative move embedding),编码连续图像块间的空间偏移,为平移不变性提供强归纳偏置,使模型天然适配任意图像分辨率和扫描方式,并结合扩散启发的损失函数实现逐步监督训练,提升模型在多尺度下的鲁棒性与推理效率。

链接: https://arxiv.org/abs/2511.19963
作者: Changho Choi,Minho Kim,Jinkyu Kim
机构: Korea University (韩国国立大学); MIT (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code will be released in github

点击查看摘要

Abstract:Despite decades of progress, a truly input-size agnostic visual encoder-a fundamental characteristic of human vision-has remained elusive. We address this limitation by proposing \textbfMambaEye, a novel, causal sequential encoder that leverages the low complexity and causal-process based pure Mamba2 backbone. Unlike previous Mamba-based vision encoders that often employ bidirectional processing, our strictly unidirectional approach preserves the inherent causality of State Space Models, enabling the model to generate a prediction at any point in its input sequence. A core innovation is our use of relative move embedding, which encodes the spatial shift between consecutive patches, providing a strong inductive bias for translation invariance and making the model inherently adaptable to arbitrary image resolutions and scanning patterns. To achieve this, we introduce a novel diffusion-inspired loss function that provides dense, step-wise supervision, training the model to build confidence as it gathers more visual evidence. We demonstrate that MambaEye exhibits robust performance across a wide range of image resolutions, especially at higher resolutions such as 1536^2 on the ImageNet-1K classification task. This feat is achieved while maintaining linear time and memory complexity relative to the number of patches.
zh

[CV-125] GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion WACV2026

【速读】:该论文旨在解决3D人脸认证中生物特征模板的隐私保护问题,特别是在确保高识别准确率的同时抵御重建攻击。其核心解决方案是提出GFT-GCN框架,关键在于将图傅里叶变换(Graph Fourier Transform, GFT)与图卷积网络(Graph Convolutional Networks, GCN)结合,从3D人脸网格中提取紧凑且具有判别性的频域特征,并引入频谱扩散机制生成不可逆、可更新且无法关联的模板,从而实现隐私保护与识别性能之间的平衡。

链接: https://arxiv.org/abs/2511.19958
作者: Hichem Felouat,Hanrui Wang,Isao Echizen
机构: The Graduate University for Advanced Studies, SOKENDAI, Kanagawa, Japan; National Institute of Informatics, NII, Tokyo, Japan; The University of Tokyo, Tokyo, Japan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, WACV 2026

点击查看摘要

Abstract:3D face recognition offers a robust biometric solution by capturing facial geometry, providing resilience to variations in illumination, pose changes, and presentation attacks. Its strong spoof resistance makes it suitable for high-security applications, but protecting stored biometric templates remains critical. We present GFT-GCN, a privacy-preserving 3D face recognition framework that combines spectral graph learning with diffusion-based template protection. Our approach integrates the Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract compact, discriminative spectral features from 3D face meshes. To secure these features, we introduce a spectral diffusion mechanism that produces irreversible, renewable, and unlinkable templates. A lightweight client-server architecture ensures that raw biometric data never leaves the client device. Experiments on the BU-3DFE and FaceScape datasets demonstrate high recognition accuracy and strong resistance to reconstruction attacks. Results show that GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication.
zh

[CV-126] Supervise Less See More: Training-free Nuclear Instance Segmentation with Prototype-Guided Prompting

【速读】:该论文旨在解决组织病理学中核实例分割(nuclear instance segmentation)任务对密集标注和昂贵微调的依赖问题,从而阻碍了模型在临床场景中的高效部署与泛化能力。其解决方案的关键在于提出一种完全无需训练和标注的提示框架SPROUT,通过引入组织学先验知识构建针对每张切片的参考原型(reference prototypes),利用部分最优传输(partial optimal transport)逐步引导特征对齐,并将对齐后的前景与背景特征转化为正负点提示(positive and negative point prompts),进而驱动Segment Anything Model (SAM) 实现精确的核边界分割,且无需任何参数更新。

链接: https://arxiv.org/abs/2511.19953
作者: Wen Zhang,Qin Ren,Wenjing Liu,Haibin Ling,Chenyu You
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; 40 pages, 25 figures, 18 tables

点击查看摘要

Abstract:Accurate nuclear instance segmentation is a pivotal task in computational pathology, supporting data-driven clinical insights and facilitating downstream translational applications. While large vision foundation models have shown promise for zero-shot biomedical segmentation, most existing approaches still depend on dense supervision and computationally expensive fine-tuning. Consequently, training-free methods present a compelling research direction, yet remain largely unexplored. In this work, we introduce SPROUT, a fully training- and annotation-free prompting framework for nuclear instance segmentation. SPROUT leverages histology-informed priors to construct slide-specific reference prototypes that mitigate domain gaps. These prototypes progressively guide feature alignment through a partial optimal transport scheme. The resulting foreground and background features are transformed into positive and negative point prompts, enabling the Segment Anything Model (SAM) to produce precise nuclear delineations without any parameter updates. Extensive experiments across multiple histopathology benchmarks demonstrate that SPROUT achieves competitive performance without supervision or retraining, establishing a novel paradigm for scalable, training-free nuclear instance segmentation in pathology.
zh

[CV-127] Low-Resolution Editing is All You Need for High-Resolution Editing

【速读】:该论文旨在解决高分辨率图像编辑(high-resolution image editing)这一核心挑战,即如何在保持用户意图一致性的前提下实现可控、高质量的高分辨率图像内容生成。现有方法普遍受限于低分辨率(通常仅支持至1K分辨率),难以满足实际应用中对细节和视觉保真度的要求。其解决方案的关键在于提出一种测试时优化(test-time optimization)框架:首先对高分辨率源图像进行分块(patch-wise)优化,随后引入细粒度细节迁移模块(fine-grained detail transfer module)以增强局部纹理一致性,并设计了一种新颖的同步策略(synchronization strategy)来确保各图像块之间的全局一致性,从而显著提升高分辨率图像编辑的质量与可控性。

链接: https://arxiv.org/abs/2511.19945
作者: Junsung Lee,Hyunsoo Lee,Yong Jae Lee,Bohyung Han
机构: Seoul National University (首尔国立大学); University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 8 figures, 2 tables

点击查看摘要

Abstract:High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. While images serve as the most fundamental modality for visual expression, content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating the way toward high-resolution content creation.
zh

[CV-128] Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

【速读】:该论文旨在解决视频中目标对象的零样本跟踪(zero-shot object tracking)问题,即在不依赖特定任务训练数据的情况下实现精准的像素级目标定位与追踪。其解决方案的关键在于重新诠释图像扩散模型(image diffusion models)中的自注意力(self-attention)机制,将其视为语义标签传播核(semantic label propagation kernel),从而在图像内部建立鲁棒的像素级对应关系;进一步将该机制扩展至时序维度,构建时间传播核(temporal propagation kernel),实现基于分割的视频目标跟踪。此外,通过测试时优化策略(如DDIM反演、文本反演和自适应头权重调整)增强扩散特征的稳定性与一致性,并引入DRIFT框架结合SAM引导的掩码精炼(SAM-guided mask refinement),最终在标准视频目标分割基准上实现了最先进(state-of-the-art)的零样本性能。

链接: https://arxiv.org/abs/2511.19936
作者: Youngseo Kim,Dohyun Kim,Geohee Han,Paul Hongsuck Seo
机构: Korea University (韩国大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.
zh

[CV-129] Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking

【速读】:该论文旨在解决单流Transformer跟踪器中因背景和干扰项(distractor)token过多地与目标模板token进行注意力交互,从而削弱模型判别能力的问题。其关键解决方案在于提出CPDATrack框架:首先在两个指定编码层之间引入可学习模块,估计每个搜索区域token与目标的关联概率,并据此剪枝低信息量的背景token以保留目标周围上下文;其次设计判别性选择性注意力机制,在早期层完全屏蔽搜索到模板的注意力,在后续层仅选取高概率目标token局部区域来关注模板,有效抑制背景和干扰项的影响,同时提升计算效率。

链接: https://arxiv.org/abs/2511.19928
作者: Janani Kugarajeevan,Thanikasalam Kokul,Amirthalingam Ramanan,Subha Fernando
机构: University of Jaffna (贾夫纳大学); University of Moratuwa (莫鲁图瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:One-stream Transformer-based trackers have demonstrated remarkable performance by concatenating template and search region tokens, thereby enabling joint attention across all tokens. However, enabling an excessive proportion of background search tokens to attend to the target template tokens weakens the tracker’s discriminative capability. Several token pruning methods have been proposed to mitigate background interference; however, they often remove tokens near the target, leading to the loss of essential contextual information and degraded tracking performance. Moreover, the presence of distractors within the search tokens further reduces the tracker’s ability to accurately identify the target. To address these limitations, we propose CPDATrack, a novel tracking framework designed to suppress interference from background and distractor tokens while enhancing computational efficiency. First, a learnable module is integrated between two designated encoder layers to estimate the probability of each search token being associated with the target. Based on these estimates, less-informative background tokens are pruned from the search region while preserving the contextual cues surrounding the target. To further suppress background interference, a discriminative selective attention mechanism is employed that fully blocks search-to-template attention in the early layers. In the subsequent encoder layers, high-probability target tokens are selectively extracted from a localized region to attend to the template tokens, thereby reducing the influence of background and distractor tokens. The proposed CPDATrack achieves state-of-the-art performance across multiple benchmarks, particularly on GOT-10k, where it attains an average overlap of 75.1 percent.
zh

[CV-130] Intelligent Image Search Algorithms Fusing Visual Large Models

【速读】:该论文旨在解决细粒度图像检索(fine-grained image retrieval)中的关键瓶颈问题:传统方法在对象组件识别与状态判断方面存在局限性,具体表现为手工特征(如SIFT)鲁棒性差、基于深度学习的检测器(如YOLO)无法实现状态特定检索或零样本搜索,而视觉大模型(VLMs)虽具备语义理解和零样本能力,却因空间定位不准且计算开销高,难以直接用于高效检索。解决方案的关键在于提出DetVLM框架,其核心创新是构建一个两阶段协同架构:第一阶段由YOLO检测器完成高效、高召回率的对象组件筛查;第二阶段利用VLM作为召回增强单元,对漏检组件进行二次验证,并结合任务提示(task-specific prompts)执行状态判别与零样本检索,从而实现状态搜索(State Search)和零样本搜索(Zero-shot Search)两大先进功能。

链接: https://arxiv.org/abs/2511.19920
作者: Kehan Wang,Tingqiong Cui,Yang Zhang,Yu Chen,Shifeng Wu,Zhenzhang Li
机构: Chongqing University (重庆大学); CRRC Chongqing Co., Ltd. (中车重庆公司); Guangdong Polytechnic Normal University (广东技术师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages,7 figures

点击查看摘要

Abstract:Fine-grained image retrieval, which aims to find images containing specific object components and assess their detailed states, is critical in fields like security and industrial inspection. However, conventional methods face significant limitations: manual features (e.g., SIFT) lack robustness; deep learning-based detectors (e.g., YOLO) can identify component presence but cannot perform state-specific retrieval or zero-shot search; Visual Large Models (VLMs) offer semantic and zero-shot capabilities but suffer from poor spatial grounding and high computational cost, making them inefficient for direct retrieval. To bridge these gaps, this paper proposes DetVLM, a novel intelligent image search framework that synergistically fuses object detection with VLMs. The framework pioneers a search-enhancement paradigm via a two-stage pipeline: a YOLO detector first conducts efficient, high-recall component-level screening to determine component presence; then, a VLM acts as a recall-enhancement unit, performing secondary verification for components missed by the detector. This architecture directly enables two advanced capabilities: 1) State Search: Guided by task-specific prompts, the VLM refines results by verifying component existence and executing sophisticated state judgments (e.g., “sun visor lowered”), allowing retrieval based on component state. 2) Zero-shot Search: The framework leverages the VLM’s inherent zero-shot capability to recognize and retrieve images containing unseen components or attributes (e.g., “driver wearing a mask”) without any task-specific training. Experiments on a vehicle component dataset show DetVLM achieves a state-of-the-art overall retrieval accuracy of 94.82%, significantly outperforming detection-only baselines. It also attains 94.95% accuracy in zero-shot search for driver mask-wearing and over 90% average accuracy in state search tasks.
zh

[CV-131] HybriDLA: Hybrid Generation for Document Layout Analysis AAAI2026

【速读】:该论文旨在解决传统文档版面分析(Document Layout Analysis, DLA)在处理现代复杂文档时的局限性问题,即其依赖经验先验或固定数量的可学习查询,在面对元素数量多样且布局复杂的文档时性能下降。解决方案的关键在于提出HybriDLA框架,该框架将扩散模型(diffusion)与自回归解码(autoregressive decoding)统一在一个层内:扩散组件通过迭代优化边界框假设提升定位精度,而自回归组件则引入语义和上下文信息以增强区域预测能力;同时设计多尺度特征融合编码器,有效捕获细粒度与高层视觉线索,从而实现83.5%的平均精度(mAP),显著优于现有方法。

链接: https://arxiv.org/abs/2511.19919
作者: Yufan Chen,Omar Moured,Ruiping Liu,Junwei Zheng,Kunyu Peng,Jiaming Zhang,Rainer Stiefelhagen
机构: 1. University of Stuttgart (斯图加特大学); 2. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 (Oral). Project page at this https URL

点击查看摘要

Abstract:Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M ^6 Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at this https URL.
zh

[CV-132] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

【速读】:该论文旨在解决当前测试时扩展(Test-Time Scaling, TTS)方法在图像生成中效率低下的问题,即现有TTS方法对全图进行统一计算,忽略了图像质量的空间异质性,导致对已高质量区域冗余计算,而局部缺陷修正不足。其解决方案的关键在于提出一种全新的“局部化TTS”(Localized TTS)范式——LoTTS,该框架无需训练即可实现:首先通过质量感知提示(如高质量与低质量对比)提取交叉注意力和自注意力信号差异,精确定位缺陷区域并生成连贯掩码;其次仅对缺陷区域进行扰动和局部去噪,确保修复效果受限于局部且不破坏全局一致性。此方法显著缩小搜索空间,在保持甚至提升图像局部质量和整体保真度的同时,GPU成本降低2–4倍。

链接: https://arxiv.org/abs/2511.19917
作者: Qin Ren,Yufei Wang,Lanqing Guo,Wen Zhang,Zhiwen Fan,Chenyu You
机构: Stony Brook University (纽约州立大学石溪分校); Nanyang Technological University (南洋理工大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Johns Hopkins University (约翰霍普金斯大学); Texas A&M University (德州农工大学); SparcAI Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.
zh

[CV-133] Coupled Physics-Gated Adaptation: Spatially Decoding Volumetric Photochemical Conversion in Complex 3D-Printed Objects

【速读】:该论文旨在解决复杂三维打印物体中光化学转化过程的预测问题,这是一项全新的计算机视觉任务:从3D视觉数据中预测密集且非视觉的体积物理属性。传统视觉模型因缺乏对光学物理(衍射、吸收)与材料物理(扩散、对流)之间耦合非线性相互作用的归纳偏置,难以胜任此任务。解决方案的关键在于提出了一种名为“耦合物理门控适配”(Coupled Physics-Gated Adaptation, C-PGA)的新型多模态融合架构,其通过将稀疏几何和工艺参数(如表面传输、打印层高)作为Query,利用特征逐元素线性调制(FiLM)动态门控并调整密集视觉特征,从而在双路3D视觉流(分别处理原始投影堆栈及其经扩散-衍射校正后的版本)上实现基于物理情境的空间调制,使模型能够根据物理条件重新校准其视觉感知,从而实现虚拟化学表征的突破,无需传统后处理测量即可精确控制化学转化状态。

链接: https://arxiv.org/abs/2511.19913
作者: Maryam Eftekharifar,Churun Zhang,Jialiang Wei,Xudong Cao,Hossein Heidari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a framework that pioneers the prediction of photochemical conversion in complex three-dimensionally printed objects, introducing a challenging new computer vision task: predicting dense, non-visual volumetric physical properties from 3D visual data. This approach leverages the largest-ever optically printed 3D specimen dataset, comprising a large family of parametrically designed complex minimal surface structures that have undergone terminal chemical characterisation. Conventional vision models are ill-equipped for this task, as they lack an inductive bias for the coupled, non-linear interactions of optical physics (diffraction, absorption) and material physics (diffusion, convection) that govern the final chemical state. To address this, we propose Coupled Physics-Gated Adaptation (C-PGA), a novel multimodal fusion architecture. Unlike standard concatenation, C-PGA explicitly models physical coupling by using sparse geometrical and process parameters (e.g., surface transport, print layer height) as a Query to dynamically gate and adapt the dense visual features via feature-wise linear modulation (FiLM). This mechanism spatially modulates dual 3D visual streams-extracted by parallel 3D-CNNs processing raw projection stacks and their diffusion-diffraction corrected counterparts allowing the model to recalibrate its visual perception based on the physical context. This approach offers a breakthrough in virtual chemical characterisation, eliminating the need for traditional post-print measurements and enabling precise control over the chemical conversion state.
zh

[CV-134] Reasoning -VLA: A Fast and General Vision-Language-Action Reasoning Model for Autonomous Driving

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在自动驾驶场景中推理效率低、泛化能力弱的问题,尤其针对新型车辆配置和未见驾驶场景下的性能下降。解决方案的关键在于提出一种名为Reasoning-VLA的通用且高效的动作生成框架:通过可学习的动作查询(learnable action queries)初始化为训练集中真实轨迹的高斯采样,并与增强推理能力的视觉-语言特征交互,实现连续动作轨迹的并行生成;同时,将八个公开自动驾驶数据集统一为基于思维链(Chain-of-Thought)推理的数据格式,结合监督学习与强化学习进行微调,从而显著提升模型的泛化性能与推理速度。

链接: https://arxiv.org/abs/2511.19912
作者: Dapeng Zhang,Zhenlong Yuan,Zhangquan Chen,Chih-Ting Liao,Yinda Chen,Fei Shen,Qingguo Zhou,Tat-Seng Chua
机构: Lanzhou University (兰州大学); National University of Singapore (新加坡国立大学); University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently shown strong decision-making capabilities in autonomous driving. However, existing VLAs often struggle with achieving efficient inference and generalizing to novel autonomous vehicle configurations and driving scenarios. In this paper, we propose Reasoning-VLA, a general and fast action-generation VLA framework. The proposed model employs a set of learnable action queries, initialized via Gaussian sampling from ground-truth trajectories within the training corpus. These learnable queries interact with reasoning-enhanced vision-language features to generate continuous action trajectories in parallel. To promote robust generalization, we consolidate eight publicly available autonomous driving datasets into a standardized, Chain-of-Thought reasoning-based, and easy-to-use data format for model training. Leveraging both supervised learning and reinforcement learning fine-tuning, extensive empirical evaluations across multiple benchmarks demonstrate that Reasoning-VLA achieves state-of-the-art performance, superior generalization capability, and the excellent inference speed reported to date.
zh

[CV-135] Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance

【速读】:该论文旨在解决单目源视频到单视图目标图像的刚性运动迁移问题,现有方法通常依赖几何、生成或仿真先验来引导迁移过程,但这些外部先验会引入额外约束,导致泛化能力与时间一致性之间的权衡。解决方案的关键在于提出一种内部先验(internal prior),该先验仅捕捉源视频与任意迁移目标视频之间共享的空间-时间变换关系,而不依赖于物体几何或语义信息。具体而言,作者首先将源视频和目标图像映射到统一的3D表示空间,从中提取运动轨迹构建空间-时间(SpaT)先验,进而将其与目标对象结合生成可控制的速度场,并通过基于位置的动力学(Position-Based Dynamics)进行优化以减少伪影并提升视觉连贯性,从而实现高效且可控的视频生成。

链接: https://arxiv.org/abs/2511.19909
作者: Haoxuan Wang,Jiachen Tao,Junyi Wu,Gaowen Liu,Ramana Rao Kompella,Yan Yan
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Cisco Research (思科研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Motion Marionette, a zero-shot framework for rigid motion transfer from monocular source videos to single-view target images. Previous works typically employ geometric, generative, or simulation priors to guide the transfer process, but these external priors introduce auxiliary constraints that lead to trade-offs between generalizability and temporal consistency. To address these limitations, we propose guiding the motion transfer process through an internal prior that exclusively captures the spatial-temporal transformations and is shared between the source video and any transferred target video. Specifically, we first lift both the source video and the target image into a unified 3D representation space. Motion trajectories are then extracted from the source video to construct a spatial-temporal (SpaT) prior that is independent of object geometry and semantics, encoding relative spatial variations over time. This prior is further integrated with the target object to synthesize a controllable velocity field, which is subsequently refined using Position-Based Dynamics to mitigate artifacts and enhance visual coherence. The resulting velocity field can be flexibly employed for efficient video production. Empirical results demonstrate that Motion Marionette generalizes across diverse objects, produces temporally consistent videos that align well with the source motion, and supports controllable video generation.
zh

[CV-136] MHB: Multimodal Handshape-aware Boundary Detection for Continuous Sign Language Recognition

【速读】:该论文旨在解决连续手语识别中签名边界检测不准确的问题,尤其是在美国手语(ASL)视频中如何有效分割出单个手势并进行准确识别。其解决方案的关键在于提出一种多模态融合方法:首先利用3D骨骼特征(3D skeletal features)捕捉手语动作的空间与时间动态特性,这些特征在签名边界处具有聚类倾向;其次引入预训练的手形分类器(handshape classifier),基于自建的标准化数据集对87种语言学定义的典型手形进行分类,以增强起始和结束帧的检测鲁棒性;最后通过多模态融合模块整合视频分割框架与手形信息,并使用包含孤立词和连续手语标注数据的大规模数据库训练最终的签名识别模型,从而显著提升整体识别性能。

链接: https://arxiv.org/abs/2511.19907
作者: Mingyu Zhao,Zhanfu Yang,Yang Zhou,Zhaoyang Xia,Can Jin,Xiaoxiao He,Carol Neidle,Dimitris N. Metaxas
机构: Rutgers University (罗格斯大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.
zh

[CV-137] Agent 0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在多模态推理任务中因依赖人工标注监督而导致的学习受限问题,以及现有纯文本自评估方法难以验证复杂视觉推理步骤、易产生评估幻觉的缺陷。其解决方案的关键在于提出Agent0-VL——一个具备工具集成推理能力的自进化视觉语言代理,通过将工具使用嵌入到推理、自评估与自我修复三个环节中,实现基于证据的反思与优化;具体而言,Agent0-VL在单一大型视觉语言模型(Large Vision-Language Model, LVLM)内统一了“求解器(Solver)”与“验证器(Verifier)”两个协同角色:前者执行多轮工具增强的推理,后者基于工具驱动的批判生成结构化反馈和细粒度自奖励,二者通过“自进化推理循环”交互,借助工具驱动的验证与强化学习联合对齐推理与评估分布,从而在无外部奖励信号或人工标注的情况下实现稳定持续的自我改进。

链接: https://arxiv.org/abs/2511.19900
作者: Jiaqi Liu,Kaiwen Xiong,Peng Xia,Yiyang Zhou,Haonian Ji,Lu Feng,Siwei Han,Mingyu Ding,Huaxiu Yao
机构: UNC-Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at \hrefthis https URLthis https URL.
zh

[CV-138] VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering

【速读】:该论文旨在解决开源大型视觉语言模型(Large Vision-Language Models, LVLMs)在科学视觉问答(Scientific Visual Question Answering, SVQA)任务中表现不佳的问题,其核心瓶颈在于缺乏大规模、高质量的公开SVQA数据集。现有方法虽尝试利用LVLMs自动生成数据,但存在系统性错误,主要源于模型自身局限性和图文信息不对称。为此,作者提出了一种以验证为核心的“生成-验证”(Generate-then-Verify)框架:首先基于图注文本上下文生成QA对,再通过跨模态一致性检查与辅助过滤机制剔除错误样本。该方案的关键创新在于引入验证机制以提升数据质量,从而构建出包含20,351个QA对的VeriSciQA数据集,显著提升了开源模型在SVQA上的性能表现。

链接: https://arxiv.org/abs/2511.19899
作者: Yuyi Li,Daoyuan Chen,Zhen Wang,Yutong Lu,Yaliang Li
机构: Sun Yat-sen University (中山大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck lies in the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs’ inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a verification-centric Generate-then-Verify framework that first generates QA pairs with figure-associated textual context, then applies cross-modal consistency checks against figures along with auxiliary filters to eliminate erroneous pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,351 QA pairs spanning 20 scientific domains and 12 figure types. VeriSciQA poses a challenging benchmark for open-source models, with a substantial accuracy gap between the leading open-source models (64%) and a proprietary model (82%). Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size and surpass models trained on existing datasets. Human evaluation further validates the superior correctness of VeriSciQA. Together, these evidences demonstrate that continued data expansion by our scalable framework can further advance SVQA capability in the open-source community.
zh

[CV-139] LiMT: A Multi-task Liver Image Benchmark Dataset ALT

【速读】:该论文旨在解决当前计算机辅助诊断(CAD)技术中数据集应用范围受限的问题,尤其是现有数据集通常仅支持单一任务,限制了多任务协同分析与模型泛化能力的发展。其解决方案的关键在于构建一个面向肝脏疾病的多任务数据集(LiMT),该数据集可同时支持肝脏及肿瘤分割、多标签病变分类和病变检测三项任务,并基于动脉期增强CT影像进行标注与验证。该数据集涵盖150例不同病例,包含四种肝病类型及正常样本,且所有图像均由经验丰富的临床医生精细标注,从而为探索任务间关联性提供统一、高质量的数据基础,避免因任务特异性数据异质性带来的训练干扰。

链接: https://arxiv.org/abs/2511.19889
作者: Zhe Liu,Kai Han,Siqi Ma,Yan Zhu,Jun Chen,Chongwen Lyu,Xinyi Qiu,Chengxuan Qian,Yuqing Song,Yi Liu,Liyuan Tian,Yang Ji,Yuefeng Li
机构: Jiangsu University (江苏大学); Affiliated Hospital of Jiangsu University (江苏大学附属医院); Medical College of Jiangsu University (江苏大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Journal of Biomedical and Health Informatics

点击查看摘要

Abstract:Computer-aided diagnosis (CAD) technology can assist clinicians in evaluating liver lesions and intervening with treatment in time. Although CAD technology has advanced in recent years, the application scope of existing datasets remains relatively limited, typically supporting only single tasks, which has somewhat constrained the development of CAD technology. To address the above limitation, in this paper, we construct a multi-task liver dataset (LiMT) used for liver and tumor segmentation, multi-label lesion classification, and lesion detection based on arterial phase-enhanced computed tomography (CT), potentially providing an exploratory solution that is able to explore the correlation between tasks and does not need to worry about the heterogeneity between task-specific datasets during training. The dataset includes CT volumes from 150 different cases, comprising four types of liver diseases as well as normal cases. Each volume has been carefully annotated and calibrated by experienced clinicians. This public multi-task dataset may become a valuable resource for the medical imaging research community in the future. In addition, this paper not only provides relevant baseline experimental results but also reviews existing datasets and methods related to liver-related tasks. Our dataset is available at this https URL.
zh

[CV-140] Distilling Cross-Modal Knowledge via Feature Disentanglement AAAI2026

【速读】:该论文旨在解决跨模态知识蒸馏(cross-modal knowledge distillation)中因模态间表征不一致而导致的知识迁移效率低下的问题。其核心解决方案是提出频域解耦的跨模态知识蒸馏方法,关键在于利用频域特征分离不同模态间的知识:低频特征具有高跨模态一致性,因此施加强对齐损失以增强知识传递;高频特征跨模态相似性极低,故采用宽松对齐策略避免误导;同时引入尺度一致性损失缓解模态间分布偏移,并通过共享分类器统一特征空间,从而显著提升跨模态知识蒸馏的效果。

链接: https://arxiv.org/abs/2511.19887
作者: Junhong Liu,Yuan Zhang,Tao Huang,Wenchao Xu,Renyu Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at this https URL.
zh

[CV-141] Frequency Bias Matters: Diving into Robust and Generalized Deep Image Forgery Detection

【速读】:该论文旨在解决生成式 AI (Generative AI) 伪造图像检测中普遍存在的泛化能力(generalizability)和鲁棒性(robustness)问题,即检测器在面对未知生成对抗网络(GAN)模型和含噪样本时的可靠性不足。研究表明,深度神经网络(DNN)伪造检测器存在频率偏差(frequency bias),这是导致上述两类问题的根本原因之一。解决方案的关键在于提出一种两步频率对齐方法(two-step frequency alignment method),通过消除真实图像与伪造图像之间的频域差异,实现双面效益:在反取证场景下可作为强黑盒攻击提升伪造图像的逃避能力,在取证场景下则作为通用防御机制增强检测器的可靠性和泛化性能。

链接: https://arxiv.org/abs/2511.19886
作者: Chi Liu,Tianqing Zhu,Wanlei Zhou,Wei Zhao
机构: Faculty of Data Science, City University of Macau (澳门城市大学数据科学学院); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Transactions on Dependable and Secure Computing

点击查看摘要

Abstract:As deep image forgery powered by AI generative models, such as GANs, continues to challenge today’s digital world, detecting AI-generated forgeries has become a vital security topic. Generalizability and robustness are two critical concerns of a forgery detector, determining its reliability when facing unknown GANs and noisy samples in an open world. Although many studies focus on improving these two properties, the root causes of these problems have not been fully explored, and it is unclear if there is a connection between them. Moreover, despite recent achievements in addressing these issues from image forensic or anti-forensic aspects, a universal method that can contribute to both sides simultaneously remains practically significant yet unavailable. In this paper, we provide a fundamental explanation of these problems from a frequency perspective. Our analysis reveals that the frequency bias of a DNN forgery detector is a possible cause of generalization and robustness issues. Based on this finding, we propose a two-step frequency alignment method to remove the frequency discrepancy between real and fake images, offering double-sided benefits: it can serve as a strong black-box attack against forgery detectors in the anti-forensic context or, conversely, as a universal defense to improve detector reliability in the forensic context. We also develop corresponding attack and defense implementations and demonstrate their effectiveness, as well as the effect of the frequency alignment method, in various experimental settings involving twelve detectors, eight forgery models, and five metrics.
zh

[CV-142] ChessMamba: Structure-Aware Interleaving of State Spaces for Change Detection in Remote Sensing Images

【速读】:该论文旨在解决多时相遥感影像中细粒度变化检测(Change Detection, CD)面临的挑战,尤其是由于异质性和时空错位导致的局部结构一致性破坏问题。现有基于视觉Transformer或状态空间模型的方法在时间序列化过程中常削弱局部结构信息,从而掩盖了错位下的判别性特征,影响可靠的变化定位。其解决方案的关键在于提出ChessMamba框架,通过两个核心创新实现结构感知的鲁棒变化检测:一是采用棋盘交错(Chessboard interleaving)与蛇形扫描顺序(snake scanning order)对多时相特征进行统一序列化,缩短交互路径并支持直接比较以提升定位精度;二是利用多膨胀卷积(multi-dilated convolutions)实现结构感知融合,选择性捕获单时相图像中中心与角落邻域上下文信息,增强对异质特征的有效融合能力。

链接: https://arxiv.org/abs/2511.19882
作者: Lei Ding,Tong Liu,Xuanguang Liu,Xiangyun Liu,Haitao Guo,Jun Lu
机构: Information Engineering University (信息工程大学); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Change detection (CD) in multitemporal remote sensing imagery presents significant challenges for fine-grained recognition, owing to heterogeneity and spatiotemporal misalignment. However, existing methodologies based on vision transformers or state-space models typically disrupt local structural consistency during temporal serialization, obscuring discriminative cues under misalignment and hindering reliable change localization. To address this, we introduce ChessMamba, a structure-aware framework leveraging interleaved state-space modeling for robust CD with multi-temporal inputs. ChessMamba integrates a SpatialMamba encoder with a lightweight cross-source interaction module, featuring two key innovations: (i) Chessboard interleaving with snake scanning order, which serializes multi-temporal features into a unified sequence within a single forward pass, thereby shortening interaction paths and enabling direct comparison for accurate change localization; and (ii) Structure-aware fusion via multi-dilated convolutions, selectively capturing center-and-corner neighborhood contexts within each mono-temporal. Comprehensive evaluations on three CD tasks, including binary CD, semantic CD and multimodal building damage assessment, demonstrate that ChessMamba effectively fuses heterogeneous features and achieves substantial accuracy improvements over state-of-the-art this http URL relevant code will be available at: this http URL.
zh

[CV-143] It Hears It Sees too: Multi-Modal LLM for Depression Detection By Integrating Visual Understanding into Audio Language Models

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的抑郁检测系统难以有效融合音频与视觉等非语言模态信息的问题。传统LLMs主要依赖文本输入,无法充分捕捉语音语调、面部表情等关键非言语线索,而这些线索在心理评估中具有重要价值。解决方案的关键在于提出一种新颖的多模态大语言模型框架,通过将音频语言模型增强为具备视觉理解能力,并在时间戳级别对音视频特征进行细粒度对齐,从而更好地建模跨模态的时间动态性,同时降低对大规模标注数据和计算资源的需求。实验表明,该方法在DAIC-WoZ数据集上优于单一模态及现有多模态方法,且具备扩展至生理信号融合的能力,为临床场景中的抑郁症辅助诊断提供了可拓展的技术路径。

链接: https://arxiv.org/abs/2511.19877
作者: Xiangyu Zhao,Yaling Shen,Yiwen Jiang,Zimu Wang,Jiahe Liu,Maxmartwell H Cheng,Guilherme C Oliveira,Robert Desimone,Dominic Dwyer,Zongyuan Ge
机构: Monash University (蒙纳士大学); Massachusetts Institute of Technology (麻省理工学院); The University of Melbourne (墨尔本大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Depression is one of the most prevalent mental health disorders globally. In recent years, multi-modal data, such as speech, video, and transcripts, has been increasingly used to develop AI-assisted depression assessment systems. Large language models have further advanced this field due to their strong language understanding and generalization capabilities. However, conventional LLMs remain text-centric and cannot process the rich non-verbal cues found in audio and visual modalities, which are critical components in mental health evaluation. While multi-modal LLMs offer a promising direction, few are tailored for psychological applications. In this study, we propose a novel multi-modal LLM framework for depression detection. Our approach augments an audio language model with visual understanding and aligns audio-visual features at the timestamp level. This fine-grained alignment improves modeling of temporal dynamics across modalities while reducing the need for extensive training data and computational resources. Experiments on the DAIC-WoZ dataset demonstrate that our model outperforms both single-modality approaches and previous multi-modal methods. Moreover, the proposed framework can be extended to incorporate additional physiological signals, paving the way for broader clinical applications beyond mental health.
zh

[CV-144] GigaWorld-0: World Models as Data Engine to Empower Embodied AI

【速读】:该论文旨在解决当前具身人工智能(Embodied AI)在训练过程中对真实世界数据依赖性强、数据获取成本高且难以实现大规模多样化交互数据生成的问题。解决方案的关键在于提出GigaWorld-0框架,其核心创新是将视频生成与3D物理建模相结合:一方面通过GigaWorld-0-Video模块实现高保真、时序一致且可控的视觉-动作序列合成;另一方面借助GigaWorld-0-3D模块融合3D生成建模、3D高斯泼溅重建、可微分物理系统辨识和可执行运动规划,确保几何一致性与物理真实性。两者协同优化实现了视觉逼真、空间连贯、物理合理且指令对齐的大规模具身交互数据合成,从而显著提升VLA模型(如GigaBrain-0)在真实机器人上的泛化能力和任务成功率,且无需任何真实交互数据参与训练。

链接: https://arxiv.org/abs/2511.19861
作者: GigaWorld Team,Angen Ye,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Haoyun Li,Jiagang Zhu,Kerui Li,Mengyuan Xu,Qiuping Deng,Siting Wang,Wenkang Qin,Xinze Chen,Xiaofeng Wang,Yankai Wang,Yu Cao,Yifan Chang,Yuan Xu,Yun Ye,Yang Wang,Yukun Zhou,Zhengyuan Zhang,Zhehao Dong,Zheng Zhu
机构: GigaAI
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.
zh

[CV-145] mporal-Visual Semantic Alignment: A Unified Architecture for Transferring Spatial Priors from Vision Models to Zero-Shot Temporal Tasks

【速读】:该论文旨在解决现有方法在时间序列与视觉模态之间缺乏语义层面对齐的问题,尤其是如何利用非视觉的连续时序信号(如时间序列)作为条件来生成高保真图像。传统方法常将时间序列转换为“伪图像”进行预测,但无法实现跨模态的语义一致性。其解决方案的关键在于提出TimeArtist框架,采用“预热-对齐”(warmup-align)范式:首先通过大规模数据自监督训练双自动编码器和共享量化器以学习跨模态共享表征;随后冻结编码器与量化器,并引入投影层在表示层面实现时序与视觉样本的对齐。这一设计使得模型能够直接从时间序列生成高质量、多样化的图像,同时捕捉时间波动模式并将其作为风格迁移的依据,从而建立时序动态与视觉语义之间的新桥梁。

链接: https://arxiv.org/abs/2511.19856
作者: Xiangkai Ma,Han Zhang,Wenzhong Li,Sanglu Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved remarkable progress in aligning and generating content across text and image modalities. However, the potential of using non-visual, continuous sequential, as a conditioning signal for high-fidelity image generation remains largely unexplored. Furthermore, existing methods that convert series into “pseudo-images” for temporal forecasting fail to establish semantic-level alignment. In this paper, we propose TimeArtist, a temporal-visual conversion framework that pioneers semantic-level alignment between time series fluctuations and visual concepts. It pioneers a “warmup-align” paradigm: first, a dual-autoencoder and shared quantizer are self-supervised trained on large-scale datasets to learn modality-shared representations. Then, the encoders and quantizer are frozen, and a projection is introduced to align temporal and visual samples at the representation level. TimeArtist establishes a versatile cross-modal framework, enabling high-quality, diverse image generation directly from time series, while capturing temporal fluctuation patterns to render images as styles transfer. Extensive experiments show that TimeArtist achieves satisfactory performance in image generation metrics, while also attaining superior results in zero-shot temporal tasks. Our work establishes a new paradigm for cross-modal generation, bridging the gap between temporal dynamics and visual semantics.
zh

[CV-146] STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction

【速读】:该论文旨在解决从单目视频中重建高保真且可驱动的3D头部虚拟形象(3D head avatar)这一挑战性问题,现有基于3D高斯溅射(3D Gaussian Splatting)的方法通常将高斯分布绑定到网格三角面片上,并仅通过线性混合皮肤(Linear Blend Skinning, LBS)建模形变,导致运动僵硬、表达能力有限,且缺乏对频繁遮挡区域(如口腔内部、眼睑)的有效处理策略。解决方案的关键在于提出STAvatar框架,其核心创新包括:(1) UV自适应软绑定(UV-Adaptive Soft Binding)机制,利用图像和几何先验在UV空间内学习每个高斯的特征偏移,支持动态重采样,兼容自适应密度控制(Adaptive Density Control, ADC),提升对形状与纹理变化的适应性;(2) 时间域ADC策略,通过结构相似帧聚类优化密度增加准则的计算效率,并引入融合感知误差作为新增密判据,联合捕捉几何与纹理差异,从而在细节需求更高的区域实现更精细的重建。

链接: https://arxiv.org/abs/2511.19854
作者: Jiankuo Zhao,Xiangyu Zhu,Zidu Wang,Zhen Lei
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences (中国科学院香港科学与创新研究院人工智能与机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 14 figures

点击查看摘要

Abstract:Reconstructing high-fidelity and animatable 3D head avatars from monocular videos remains a challenging yet essential task. Existing methods based on 3D Gaussian Splatting typically bind Gaussians to mesh triangles and model deformations solely via Linear Blend Skinning, which results in rigid motion and limited expressiveness. Moreover, they lack specialized strategies to handle frequently occluded regions (e.g., mouth interiors, eyelids). To address these limitations, we propose STAvatar, which consists of two key components: (1) a UV-Adaptive Soft Binding framework that leverages both image-based and geometric priors to learn per-Gaussian feature offsets within the UV space. This UV representation supports dynamic resampling, ensuring full compatibility with Adaptive Density Control (ADC) and enhanced adaptability to shape and textural variations. (2) a Temporal ADC strategy, which first clusters structurally similar frames to facilitate more targeted computation of the densification criterion. It further introduces a novel fused perceptual error as clone criterion to jointly capture geometric and textural discrepancies, encouraging densification in regions requiring finer details. Extensive experiments on four benchmark datasets demonstrate that STAvatar achieves state-of-the-art reconstruction performance, especially in capturing fine-grained details and reconstructing frequently occluded regions. The code will be publicly available.
zh

[CV-147] DOGE: Differentiable Bezier Graph Optimization for Road Network Extraction

【速读】:该论文旨在解决从航空影像中自动提取道路网络时,现有方法依赖折线(polyline)表示难以建模弯曲几何结构的问题。其核心创新在于提出了一种基于贝塞尔曲线(Bézier Graph)的可微分参数化表示,突破了传统方法对难获取的矢量地面真值(vector ground-truth)的依赖。解决方案的关键在于将任务重构为在贝塞尔图上的全局优化问题,并通过DOGE框架实现:该框架交替运行两个互补模块——DiffAlign通过可微渲染持续优化几何形状,TopoAdapt利用离散操作优化拓扑结构,从而直接从分割掩码学习高保真道路矢量地图,显著提升了精度并树立了新基准。

链接: https://arxiv.org/abs/2511.19850
作者: Jiahui Sun,Junran Lu,Jinhui Yin,Yishuo Xu,Yuanqi Li,Yanwen Guo
机构: Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Automatic extraction of road networks from aerial imagery is a fundamental task, yet prevailing methods rely on polylines that struggle to model curvilinear geometry. We maintain that road geometry is inherently curve-based and introduce the Bézier Graph, a differentiable parametric curve-based representation. The primary obstacle to this representation is to obtain the difficult-to-construct vector ground-truth (GT). We sidestep this bottleneck by reframing the task as a global optimization problem over the Bézier Graph. Our framework, DOGE, operationalizes this paradigm by learning a parametric Bézier Graph directly from segmentation masks, eliminating the need for curve GT. DOGE holistically optimizes the graph by alternating between two complementary modules: DiffAlign continuously optimizes geometry via differentiable rendering, while TopoAdapt uses discrete operators to refine its topology. Our method sets a new state-of-the-art on the large-scale SpaceNet and CityScale benchmarks, presenting a new paradigm for generating high-fidelity vector maps of road networks. We will release our code and related data.
zh

[CV-148] Face Whole-Person and Object Classification in a Unified Space Via The Interleaved Multi-Domain Identity Curriculum

【速读】:该论文旨在解决多任务视觉模型在微调过程中面临的灾难性遗忘(catastrophic forgetting)问题,同时实现四种不同但相关的识别任务(物体分类、高质量与低质量人脸识别、全身人体识别)在同一嵌入空间中的联合建模。解决方案的关键在于提出两种变体的交错多域身份课程学习(Interleaved Multi-Domain Identity Curriculum, IMIC),通过梯度耦合的交错训练策略,同步微调基础骨干网络在所有四个任务上,从而在不显著损害分布外泛化能力的前提下,实现任务间特征共享与线性可分性。实验表明,基于EVA-02和CLIP基础模型的IMIC方法在四项任务上均达到或超越领域专家水平,并且仅需少于100个主成分即可完成全部任务,展现出高度高效的特征复用能力。

链接: https://arxiv.org/abs/2511.19846
作者: Thomas M Metz,Matthew Q Hill,Alice J O’Toole
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision foundation models can perform generalized object classification in zero-shot mode, and face/person recognition when they are fine-tuned. However, fine-tuned models suffer from catastrophic forgetting. We create models that perform four tasks (object recognition, face recognition from high- and low-quality images, and person recognition from whole-body images) in a single embedding space – without incurring substantial catastrophic forgetting. To accomplish this, we introduce two variants of the Interleaved Multi-Domain Identity Curriculum (IMIC): a gradient-coupled, interleaving training schedule that fine-tunes a foundation backbone simultaneously on all four tasks. The IMIC method proved effective with three foundation model bases: DINOv3, CLIP, and EVA-02. Two of these (EVA-02 and CLIP) performed comparably with domain experts on all four tasks concurrently and were more accurate than humans at multi-tasking across face, body, and object datasets. Further, we demonstrate that our approach does not substantially harm out-of-distribution generalization, thus maintaining a key property of foundation models. Analysis of the most accurate model variants (EVA-02 + IMIC A and B) showed linearly separable representations of the four tasks in the unified embedding space, but with substantial sharing of features across tasks. Fewer than 100 PCs calculated from any one task could perform all other tasks with nearly zero performance degradation.
zh

[CV-149] 4DWorldBench: A Comprehensive Evaluation Framework for 3D/4D World Generation Models

【速读】:该论文旨在解决当前世界生成模型(World Generation Models)缺乏统一评估基准的问题,尤其是现有评测体系在感知质量、条件-4D对齐、物理真实性及4D一致性等关键维度上割裂且不系统,难以客观衡量模型构建真实、动态、物理一致的3D/4D世界的能力。解决方案的关键在于提出4DWorldBench这一综合性评估框架,其核心创新包括:(1)定义四个核心评价维度以全面刻画世界生成能力;(2)引入跨模态自适应条件映射机制,将图像、视频、文本等多种输入条件统一转化为文本空间进行标准化评估;(3)融合LLM-as-judge、MLLM-as-judge与传统网络方法,实现多视角协同判断,显著提升评估结果与人类主观感知的一致性。该设计实现了对世界生成模型在多维属性上的统一、可比、高保真评估,推动从“视觉生成”向“世界生成”的范式演进。

链接: https://arxiv.org/abs/2511.19836
作者: Yiting Lu,Wei Luo,Peiyan Tu,Haoran Li,Hanxin Zhu,Zihao Yu,Xingrui Wang,Xinyi Chen,Xinge Peng,Xin Li,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Beijing Zhongguancun Academy (北京中关村学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:World Generation Models are emerging as a cornerstone of next-generation multimodal intelligence systems. Unlike traditional 2D visual generation, World Models aim to construct realistic, dynamic, and physically consistent 3D/4D worlds from images, videos, or text. These models not only need to produce high-fidelity visual content but also maintain coherence across space, time, physics, and instruction control, enabling applications in virtual reality, autonomous driving, embodied intelligence, and content creation. However, prior benchmarks emphasize different evaluation dimensions and lack a unified assessment of world-realism capability. To systematically evaluate World Models, we introduce the 4DWorldBench, which measures models across four key dimensions: Perceptual Quality, Condition-4D Alignment, Physical Realism, and 4D Consistency. The benchmark covers tasks such as Image-to-3D/4D, Video-to-4D, Text-to-3D/4D. Beyond these, we innovatively introduce adaptive conditioning across multiple modalities, which not only integrates but also extends traditional evaluation paradigms. To accommodate different modality-conditioned inputs, we map all modality conditions into a unified textual space during evaluation, and further integrate LLM-as-judge, MLLM-as-judge, and traditional network-based methods. This unified and adaptive design enables more comprehensive and consistent evaluation of alignment, physical realism, and cross-modal coherence. Preliminary human studies further demonstrate that our adaptive tool selection achieves closer agreement with subjective human judgments. We hope this benchmark will serve as a foundation for objective comparisons and improvements, accelerating the transition from “visual generation” to “world generation.” Our project can be found at this https URL.
zh

[CV-150] Rectified SpaAttn: Revisiting Attention Sparsity for Efficient Video Generation

【速读】:该论文旨在解决扩散模型中视频生成任务里注意力机制计算复杂度高(二次方复杂度)导致的延迟问题,同时克服现有稀疏注意力方法因系统性偏差造成性能下降的缺陷。其核心解决方案是提出Rectified SpaAttn,通过引入隐式的全注意力参考来修正稀疏注意力分配:对于关键token,采用隔离池化注意力重分配(Isolated-Pooling Attention Reallocation)以准确计算校正因子;对于非关键token,则设计增益感知池化校正(Gain-Aware Pooling Rectification)确保校正后的增益始终大于池化误差。该方法在保持高质量视频生成的同时,在HunyuanVideo和Wan 2.1上分别实现最高3.33倍和2.08倍的速度提升。

链接: https://arxiv.org/abs/2511.19835
作者: Xuewen Liu,Zhikai Li,Jing Zhang,Mengjuan Chen,Qingyi Gu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Code at this https URL

点击查看摘要

Abstract:Diffusion Transformers dominate video generation, but the quadratic complexity of attention computation introduces substantial latency. Attention sparsity reduces computational costs by focusing on critical tokens while ignoring non-critical tokens. However, existing methods suffer from severe performance degradation. In this paper, we revisit attention sparsity and reveal that existing methods induce systematic biases in attention allocation: (1) excessive focus on critical tokens amplifies their attention weights; (2) complete neglect of non-critical tokens causes the loss of relevant attention weights. To address these issues, we propose Rectified SpaAttn, which rectifies attention allocation with implicit full attention reference, thereby enhancing the alignment between sparse and full attention maps. Specifically: (1) for critical tokens, we show that their bias is proportional to the sparse attention weights, with the ratio governed by the amplified weights. Accordingly, we propose Isolated-Pooling Attention Reallocation, which calculates accurate rectification factors by reallocating multimodal pooled weights. (2) for non-critical tokens, recovering attention weights from the pooled query-key yields attention gains but also introduces pooling errors. Therefore, we propose Gain-Aware Pooling Rectification, which ensures that the rectified gain consistently surpasses the induced error. Moreover, we customize and integrate the Rectified SpaAttn kernel using Triton, achieving up to 3.33 and 2.08 times speedups on HunyuanVideo and Wan 2.1, respectively, while maintaining high generation quality. We release Rectified SpaAttn as open-source at this https URL .
zh

[CV-151] Large Language Model Aided Birt-Hogg-Dube Syndrome Diagnosis with Multimodal Retrieval-Augmented Generation

【速读】:该论文旨在解决深度学习方法在弥漫性囊性肺疾病(Diffuse Cystic Lung Diseases, DCLDs)诊断中面临的双重挑战:临床样本有限以及不同类别间图像特征区分度低,从而影响Birt-Hogg-Dubé综合征(BHD)的CT影像诊断准确性。同时,尽管多模态大语言模型(Multimodal Large Language Models, MLLMs)在罕见病诊断中展现出潜力,但其缺乏领域专业知识和可参考的影像学特征易导致幻觉问题。解决方案的关键在于提出一种名为BHD-RAG的多模态检索增强生成框架,通过三个核心组件实现:(1) 专用代理生成DCLDs病例的影像表现描述以构建多模态语料库;(2) 基于余弦相似度的检索器精准匹配查询图像对应的影像描述对;(3) MLLM融合检索到的证据与影像数据进行综合诊断推理,从而显著提升诊断准确率并生成符合专家认知的解释性描述。

链接: https://arxiv.org/abs/2511.19834
作者: Haoqing Li,Jun Shi,Xianmeng Chen,Qiwei Jia,Rui Wang,Wei Wei,Hong An,Xiaowen Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning methods face dual challenges of limited clinical samples and low inter-class differentiation among Diffuse Cystic Lung Diseases (DCLDs) in advancing Birt-Hogg-Dube syndrome (BHD) diagnosis via Computed Tomography (CT) imaging. While Multimodal Large Language Models (MLLMs) demonstrate diagnostic potential fo such rare diseases, the absence of domain-specific knowledge and referable radiological features intensify hallucination risks. To address this problem, we propose BHD-RAG, a multimodal retrieval-augmented generation framework that integrates DCLD-specific expertise and clinical precedents with MLLMs to improve BHD diagnostic accuracy. BHDRAG employs: (1) a specialized agent generating imaging manifestation descriptions of CT images to construct a multimodal corpus of DCLDs cases. (2) a cosine similarity-based retriever pinpointing relevant imagedescription pairs for query images, and (3) an MLLM synthesizing retrieved evidence with imaging data for diagnosis. BHD-RAG is validated on the dataset involving four types of DCLDs, achieving superior accuracy and generating evidence-based descriptions closely aligned with expert insights.
zh

[CV-152] ReDirector: Creating Any-Length Video Retakes with Rotary Camera Encoding

【速读】:该论文旨在解决动态捕捉的变长视频在重拍生成过程中存在的时空位置错位与多视角关系建模不足的问题,尤其针对先前方法中RoPE(Rotary Position Embedding)使用不当导致的时空对齐偏差。其解决方案的关键在于提出一种名为RoCE(Rotary Camera Encoding)的新机制,通过引入相机条件驱动的RoPE相位偏移,显式建模输入视频与目标重拍视频之间跨视图的时空关系,从而提升模型对分布外相机轨迹和视频长度的泛化能力,并实现更精确的动态物体定位与静态背景保持。

链接: https://arxiv.org/abs/2511.19827
作者: Byeongjun Park,Byung-Hoon Kim,Hyungjin Chung,Jong Chul Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We present ReDirector, a novel camera-controlled video retake generation method for dynamically captured variable-length videos. In particular, we rectify a common misuse of RoPE in previous works by aligning the spatiotemporal positions of the input video and the target retake. Moreover, we introduce Rotary Camera Encoding (RoCE), a camera-conditioned RoPE phase shift that captures and integrates multi-view relationships within and across the input and target videos. By integrating camera conditions into RoPE, our method generalizes to out-of-distribution camera trajectories and video lengths, yielding improved dynamic object localization and static background preservation. Extensive experiments further demonstrate significant improvements in camera controllability, geometric consistency, and video quality across various trajectories and lengths.
zh

[CV-153] Reading Between the Lines: Abstaining from VLM-Generated OCR Errors via Latent Representation Probes

【速读】:该论文旨在解决视觉语言模型(VLM)在安全关键任务中缺乏可靠不确定性感知能力的问题,特别是在场景文本视觉问答(STVQA)任务中,OCR错误可能导致严重后果,而现有方法依赖于不可靠的输出概率或不适用于OCR任务的语义一致性判断。其解决方案的关键在于提出一种基于潜在表示探测(Latent Representation Probing, LRP)的新框架:通过训练轻量级探针(probe)分析VLM内部隐藏状态或注意力模式来识别不确定性信号,而非依赖最终输出的概率分布。该方法探索了三种探针设计策略,并在多模态数据集上验证其有效性,显著提升了 abstention 准确率(相比最优基线提升7.6%),且发现中间层特征比最终层更适合作为不确定性信号来源,从而为部署可靠的AI系统提供了可解释、泛化性强的内在置信度检测机制。

链接: https://arxiv.org/abs/2511.19806
作者: Jihan Yao,Achin Kulshrestha,Nathalie Rauschmayr,Reed Roberts,Banghua Zhu,Yulia Tsvetkov,Federico Tombari
机构: University of Washington (华盛顿大学); Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As VLMs are deployed in safety-critical applications, their ability to abstain from answering when uncertain becomes crucial for reliability, especially in Scene Text Visual Question Answering (STVQA) tasks. For example, OCR errors like misreading “50 mph” as “60 mph” could cause severe traffic accidents. This leads us to ask: Can VLMs know when they can’t see? Existing abstention methods suggest pessimistic answers: they either rely on miscalibrated output probabilities or require semantic agreement unsuitable for OCR tasks. However, this failure may indicate we are looking in the wrong place: uncertainty signals could be hidden in VLMs’ internal representations. Building on this insight, we propose Latent Representation Probing (LRP): training lightweight probes on hidden states or attention patterns. We explore three probe designs: concatenating representations across all layers, aggregating attention over visual tokens, and ensembling single layer probes by majority vote. Experiments on four benchmarks across image and video modalities show LRP improves abstention accuracy by 7.6% over best baselines. Our analysis reveals: probes generalize across various uncertainty sources and datasets, and optimal signals emerge from intermediate rather than final layers. This establishes a principled framework for building deployment-ready AI systems by detecting confidence signals from internal states rather than unreliable outputs. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.19806 [cs.CV] (or arXiv:2511.19806v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.19806 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-154] rminal Velocity Matching

【速读】:该论文旨在解决生成式模型在少步(one- and few-step)条件下难以同时实现高保真度与高效采样的问题。传统流匹配(flow matching)方法通常在初始时间点对模型行为进行正则化,而本文提出终端速度匹配(Terminal Velocity Matching, TVM),通过建模任意两个扩散时间步之间的转移过程,并将正则化约束施加于终末时间点而非初始时间点,从而提升模型在少步生成中的精度。其关键创新在于:1)理论层面证明了当模型满足Lipschitz连续性时,TVM可提供数据分布与模型分布之间2-Wasserstein距离的上界;2)针对Diffusion Transformer缺乏Lipschitz性质的问题,引入最小架构修改以实现稳定的一阶段训练;3)开发融合注意力核(fused attention kernel)支持雅可比向量积(Jacobian-Vector Products)的反向传播,显著提升计算效率,使TVM在ImageNet 256x256和512x512图像上分别达到3.29 FID(1次函数评估)和4.32 FID(1次函数评估)的SOTA性能。

链接: https://arxiv.org/abs/2511.19797
作者: Linqi Zhou,Mathias Parger,Ayaan Haque,Jiaming Song
机构: Luma AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: Code available at: this https URL

点击查看摘要

Abstract:We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the 2 -Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.
zh

[CV-155] One Attention One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer

【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在混合分辨率去噪任务中使用常规线性插值进行旋转位置编码(Rotary Position Embedding, RoPE)时出现的核心失效问题:当来自不同空间网格的 token 被混合时,注意力机制会崩溃。其根本原因在于结构层面——线性坐标映射强制单个注意力头以不兼容的采样率比较 RoPE 相位,导致相位混叠(phase aliasing),从而破坏得分流形的稳定性。针对此问题,作者提出了一种无需训练的即插即用解决方案——跨分辨率相位对齐注意力(Cross-Resolution Phase-Aligned Attention, CRPA),其关键在于修改每次注意力计算中的 RoPE 索引映射方式:将所有查询(Q)和键(K)位置统一表达在查询的步长下,确保相同物理距离始终产生相同的相位增量,从而恢复 DiT 依赖的精确相位模式。CRPA 兼容预训练 DiT,可均匀稳定所有注意力头与层,并显著提升图像与视频生成的保真度与效率。

链接: https://arxiv.org/abs/2511.19778
作者: Haoyu Wu,Jingyi Xu,Qiaomu Miao,Dimitris Samaras,Hieu Le
机构: Stony Brook University (石溪大学); UNC-Charlotte (北卡罗来纳大学夏洛特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We identify a core failure mode that occurs when using the usual linear interpolation on rotary positional embeddings (RoPE) for mixed-resolution denoising with Diffusion Transformers. When tokens from different spatial grids are mixed, the attention mechanism collapses. The issue is structural. Linear coordinate remapping forces a single attention head to compare RoPE phases sampled at incompatible rates, creating phase aliasing that destabilizes the score landscape. Pretrained DiTs are especially brittle-many heads exhibit extremely sharp, periodic phase selectivity-so even tiny cross-rate inconsistencies reliably cause blur, artifacts, or full collapse. To this end, our main contribution is Cross-Resolution Phase-Aligned Attention (CRPA), a training-free drop-in fix that eliminates this failure at its source. CRPA modifies only the RoPE index map for each attention call: all Q/K positions are expressed on the query’s stride so that equal physical distances always induce identical phase increments. This restores the precise phase patterns that DiTs rely on. CRPA is fully compatible with pretrained DiTs, stabilizes all heads and layers uniformly. We demonstrate that CRPA enables high-fidelity and efficient mixed-resolution generation, outperforming previous state-of-the-art methods on image and video generation. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.19778 [cs.CV] (or arXiv:2511.19778v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.19778 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-156] Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在具身问答(Embodied Question Answering, EQA)任务中进行逐步探索时出现的前沿振荡(frontier oscillations)问题,即由于模型过自信和校准不足导致的不稳定来回移动,从而降低导航效率和答案质量。解决方案的关键在于提出“剪枝-规划”(Prune-Then-Plan)框架:首先通过受Holm-Bonferroni启发的剪枝过程排除不合理的前沿选择,再由基于覆盖度的规划器做出最终决策,从而将VLM的过度自信预测转化为保守且可解释的动作,实现对VLM步级行为的人类水平校准。该方法在3D-Mem EQA框架中集成后,在视觉锚定SPL和LLM-Match指标上分别较基线提升49%和33%,并在OpenEQA与EXPRESS-Bench数据集上以相同探索预算获得更优场景覆盖率。

链接: https://arxiv.org/abs/2511.19768
作者: Noah Frahm,Prakrut Patel,Yue Zhang,Shoubin Yu,Mohit Bansal,Roni Sengupta
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: webpage: this https URL

点击查看摘要

Abstract:Large vision-language models (VLMs) have improved embodied question answering (EQA) agents by providing strong semantic priors for open-vocabulary reasoning. However, when used directly for step-level exploration, VLMs often exhibit frontier oscillations, unstable back-and-forth movements caused by overconfidence and miscalibration, leading to inefficient navigation and degraded answer quality. We propose Prune-Then-Plan, a simple and effective framework that stabilizes exploration through step-level calibration. Instead of trusting raw VLM scores, our method prunes implausible frontier choices using a Holm-Bonferroni inspired pruning procedure and then delegates final decisions to a coverage-based planner. This separation converts overconfident predictions into conservative, interpretable actions by relying on human-level judgments to calibrate the step-level behavior of VLMs. Integrated into the 3D-Mem EQA framework, our approach achieves relative improvements of up to 49% and 33% in visually grounded SPL and LLM-Match metrics respectively over baselines. Overall, our method achieves better scene coverage under equal exploration budgets on both OpenEQA and EXPRESS-Bench datasets.
zh

[CV-157] Lightweight Transformer Framework for Weakly Supervised Semantic Segmentation

【速读】:该论文致力于解决弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)中因噪声和不充分标注线索导致的分割精度不足问题,特别是边界模糊、小物体漏检及对标签噪声敏感等挑战。其解决方案的关键在于对SegFormer解码器进行三项协同改进:(1)引入边界分支,通过轻量级边缘头和边界感知损失监督细长目标轮廓;(2)设计不确定性引导精修模块,预测像素级似然不确定性(aleatoric uncertainty),并据此加权损失与门控残差修正分割logits;(3)构建动态多尺度融合层,以空间softmax门控替代静态特征拼接,可选地由不确定性调制各尺度特征的融合权重。这些改动无需修改主干网络MiT且不依赖复杂后处理,实现了单次前向传播下边界清晰、局部尺度自适应选择以及对弱标签噪声鲁棒的分割性能提升。

链接: https://arxiv.org/abs/2511.19765
作者: Ali Torabi,Sanjog Gaihre,Yaqoob Majeed
机构: University of Wyoming (怀俄明大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) must learn dense masks from noisy, under-specified cues. We revisit the SegFormer decoder and show that three small, synergistic changes make weak supervision markedly more effective-without altering the MiT backbone or relying on heavy post-processing. Our method, CrispFormer, augments the decoder with: (1) a boundary branch that supervises thin object contours using a lightweight edge head and a boundary-aware loss; (2) an uncertainty-guided refiner that predicts per-pixel aleatoric uncertainty and uses it to weight losses and gate a residual correction of the segmentation logits; and (3) a dynamic multi-scale fusion layer that replaces static concatenation with spatial softmax gating over multi-resolution features, optionally modulated by uncertainty. The result is a single-pass model that preserves crisp boundaries, selects appropriate scales per location, and resists label noise from weak cues. Integrated into a standard WSSS pipeline (seed, student, and EMA relabeling), CrispFormer consistently improves boundary F-score, small-object recall, and mIoU over SegFormer baselines trained on the same seeds, while adding minimal compute. Our decoder-centric formulation is simple to implement, broadly compatible with existing SegFormer variants, and offers a reproducible path to higher-fidelity masks from image-level supervision.
zh

[CV-158] A Storag e-Efficient Feature for 3D Concrete Defect Segmentation to Replace Normal Vector

【速读】:该论文旨在解决点云重建在混凝土表面缺陷识别中因3D数据量庞大而导致的存储与计算资源消耗过高的问题。其关键解决方案是提出了一种新的单维特征——相对角度(relative angle),即某一点的法向量与父点云平均法向量之间的夹角,该特征能够等效表达法向量的方向信息,同时通过熵-based特征评估方法有效过滤无损区域的冗余信息,保留损伤区域的有效特征。基于此特征训练的PointNet++模型在性能上与使用法向量的模型相当,却实现了27.6%的存储压缩和83%的输入通道压缩,从而在不改变模型结构的前提下显著提升了资源受限硬件上的批量处理能力。

链接: https://arxiv.org/abs/2511.19760
作者: Linxin Hua(1),Jianghua Deng(2),Ye Lu(1) ((1) Department of Civil and Environmental Engineering, Monash University, Melbourne, Australia, (2) School of Civil Engineering and Architecture, Changzhou Institute of Technology, Changzhou, China)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 7 figures

点击查看摘要

Abstract:Point cloud reconstruction of damage offers an effective solution to image-based methods vulnerable to background noise, yet its application is constrained by the high volume of 3D data. This study proposes a new feature, relative angle, computed as the angle between the normal vector of a point and the average normal vector of its parent point cloud. This single-dimensional feature provides directionality information equivalent to normal vectors for concrete surface defect characteristics. Through entropy-based feature evaluation, this study demonstrates the ability of relative angle to filter out redundant information in undamaged sections while retaining effective information in damaged sections. By training and testing with PointNet++, models based on the relative angles achieved similar performance to that of models based on normal vectors while delivering 27.6% storage reduction and 83% input channel compression. This novel feature has the potential to enable larger-batch execution on resource-constrained hardware without the necessity of architectural modifications to models.
zh

[CV-159] Vision–Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中对大量专家标注数据依赖性强的问题,尤其是在标注资源极其有限的情况下如何提升分割精度。其解决方案的关键在于提出一种视觉-语言增强的半监督分割辅助系统(Vision-Language Enhanced Semi-supervised Segmentation Assistant, VESSA),通过引入具备基础级视觉-语义理解能力的视觉语言模型(VLM),在两个阶段实现性能优化:第一阶段利用模板库中的金标准示例训练VESSA作为参考引导的分割助手,基于输入图像与模板对的视觉特征匹配生成结构化提示,驱动类似SAM2的掩码解码器输出初步分割结果;第二阶段将VESSA嵌入到先进的半监督学习(SSL)框架中,形成动态交互机制——随着学生模型预测质量的提升,其输出被反馈至VESSA以生成更高质量的伪标签和更强的指导信号,从而显著提升分割准确性,尤其在极端低标注条件下优于现有最先进方法。

链接: https://arxiv.org/abs/2511.19759
作者: Jiaqi Guo,Mingzhen Li,Hanyu Su,Santiago López,Lexiaozi Fan,Daniel Kim,Aggelos Katsaggelos
机构: Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.
zh

[CV-160] What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities

【速读】:该论文旨在解决多模态神经网络在物种识别任务中存在的两大问题:一是模型决策过程缺乏可解释性(black-box nature),二是获取基因数据成本高且常需侵入性采样操作。解决方案的关键在于扩展原型网络(Prototype Networks, ProtoPNets)至多模态、成本感知的设置,通过融合各模态的原型并引入权重机制来动态分配不同模态对预测的贡献;同时设计方法识别无需昂贵基因信息即可做出高置信度判断的情形,从而实现对基因数据的智能分配,仅在需要精细区分时使用,其余场景主要依赖图像数据完成分类,最终在保持与始终使用双模态数据模型相当准确率的前提下显著降低资源消耗。

链接: https://arxiv.org/abs/2511.19752
作者: Muchang Bahng,Charlie Berens,Jon Donnelly,Eric Chen,Chaofan Chen,Cynthia Rudin
机构: Duke University (杜克大学); University of Maine (缅因大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages. 16 figures. 10 tables

点击查看摘要

Abstract:Species detection is important for monitoring the health of ecosystems and identifying invasive species, serving a crucial role in guiding conservation efforts. Multimodal neural networks have seen increasing use for identifying species to help automate this task, but they have two major drawbacks. First, their black-box nature prevents the interpretability of their decision making process. Second, collecting genetic data is often expensive and requires invasive procedures, often necessitating researchers to capture or kill the target specimen. We address both of these problems by extending prototype networks (ProtoPNets), which are a popular and interpretable alternative to traditional neural networks, to the multimodal, cost-aware setting. We ensemble prototypes from each modality, using an associated weight to determine how much a given prediction relies on each modality. We further introduce methods to identify cases for which we do not need the expensive genetic information to make confident predictions. We demonstrate that our approach can intelligently allocate expensive genetic data for fine-grained distinctions while using abundant image data for clearer visual classifications and achieving comparable accuracy to models that consistently use both modalities.
zh

[CV-161] Leverag ing Foundation Models for Histological Grading in Cutaneous Squamous Cell Carcinoma using PathFMTools ALT ML4H

【速读】:该论文旨在解决病理学基础模型(pathology foundation models)在特定临床任务中适应困难的问题,主要挑战包括全切片图像(Whole-Slide Image, WSI)处理的复杂性、模型学习特征的不透明性以及多样化的适配策略选择。解决方案的关键在于提出 PathFMTools,一个轻量且可扩展的 Python 工具包,用于高效执行、分析和可视化病理基础模型;通过该工具对接并评估两个先进的视觉-语言基础模型(CONCH 和 MUSK)在皮肤鳞状细胞癌(cutaneous squamous cell carcinoma, cSCC)组织学分级任务中的表现,验证了利用基础模型嵌入训练小型专用模型的可行性,并揭示了不同预测方法之间的权衡,从而为病理基础模型在真实临床场景中的落地提供了实证支持。

链接: https://arxiv.org/abs/2511.19751
作者: Abdul Rahman Diab,Emily E. Karn,Renchin Wu,Emily S. Ruiz,William Lotter
机构: Dana-Farber Cancer Institute(达纳-法伯癌症研究所); Brigham and Women’s Hospital(布里格姆妇女医院); Harvard Medical School(哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Proceedings of the 5th Machine Learning for Health (ML4H) Symposium (2025)

点击查看摘要

Abstract:Despite the promise of computational pathology foundation models, adapting them to specific clinical tasks remains challenging due to the complexity of whole-slide image (WSI) processing, the opacity of learned features, and the wide range of potential adaptation strategies. To address these challenges, we introduce PathFMTools, a lightweight, extensible Python package that enables efficient execution, analysis, and visualization of pathology foundation models. We use this tool to interface with and evaluate two state-of-the-art vision-language foundation models, CONCH and MUSK, on the task of histological grading in cutaneous squamous cell carcinoma (cSCC), a critical criterion that informs cSCC staging and patient management. Using a cohort of 440 cSCC HE WSIs, we benchmark multiple adaptation strategies, demonstrating trade-offs across prediction approaches and validating the potential of using foundation model embeddings to train small specialist models. These findings underscore the promise of pathology foundation models for real-world clinical applications, with PathFMTools enabling efficient analysis and validation.
zh

[CV-162] Efficient Transferable Optimal Transport via Min-Sliced Transport Plans

【速读】:该论文旨在解决最优传输(Optimal Transport, OT)在计算成本上的瓶颈问题,尤其是在处理大规模数据或频繁进行OT计算的场景下,传统方法难以满足效率需求。其核心挑战在于如何设计一种高效且具有迁移能力的slice-based OT方法,使得在不同分布对之间优化出的切片(slicer)能够有效泛化至新的分布对,从而避免重复训练并提升可扩展性。解决方案的关键在于提出min-Sliced Transport Plan(min-STP)框架,并通过理论证明:当数据分布发生微小扰动时,已优化的slicer仍保持近似稳定,这为跨分布的迁移提供了理论保障;同时引入minibatch形式的min-STP及其统计一致性保证,显著提升了算法的实用性与可扩展性,在点云对齐和基于流的生成建模等任务中实现了强one-shot匹配性能和高效的 amortized 训练机制。

链接: https://arxiv.org/abs/2511.19741
作者: Xinran Liu,Elaheh Akbari,Rocio Diaz Martin,Navid NaderiAlizadeh,Soheil Kolouri
机构: Vanderbilt University (范德堡大学); Florida State University (佛罗里达州立大学); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optimal Transport (OT) offers a powerful framework for finding correspondences between distributions and addressing matching and alignment problems in various areas of computer vision, including shape analysis, image generation, and multimodal tasks. The computation cost of OT, however, hinders its scalability. Slice-based transport plans have recently shown promise for reducing the computational cost by leveraging the closed-form solutions of 1D OT problems. These methods optimize a one-dimensional projection (slice) to obtain a conditional transport plan that minimizes the transport cost in the ambient space. While efficient, these methods leave open the question of whether learned optimal slicers can transfer to new distribution pairs under distributional shift. Understanding this transferability is crucial in settings with evolving data or repeated OT computations across closely related distributions. In this paper, we study the min-Sliced Transport Plan (min-STP) framework and investigate the transferability of optimized slicers: can a slicer trained on one distribution pair yield effective transport plans for new, unseen pairs? Theoretically, we show that optimized slicers remain close under slight perturbations of the data distributions, enabling efficient transfer across related tasks. To further improve scalability, we introduce a minibatch formulation of min-STP and provide statistical guarantees on its accuracy. Empirically, we demonstrate that the transferable min-STP achieves strong one-shot matching performance and facilitates amortized training for point cloud alignment and flow-based generative modeling.
zh

[CV-163] Maritime Small Object Detection from UAVs using Deep Learning with Altitude-Aware Dynamic Tiling

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicles, UAVs)在海上搜救(Search and Rescue, SAR)任务中因高空拍摄导致小目标检测困难的问题,主要表现为小目标与背景的像素比例低,从而影响检测精度。解决方案的关键在于提出一种高度感知的动态分块(altitude-aware dynamic tiling)方法,通过结合高度依赖的缩放策略与自适应分块因子,在保证检测性能的同时减少冗余计算;实验表明,该方法相较基线模型在小目标上的平均精度(mAP)提升38%,且推理速度超过静态分块方式的两倍,显著提升了UAV在多样化环境下的搜救效率与准确性。

链接: https://arxiv.org/abs/2511.19728
作者: Sakib Ahmed,Oscar Pizarro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This is the author’s accepted version of an article that has been published by IEEE. The final published version is available at IEEE Xplore

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are crucial in Search and Rescue (SAR) missions due to their ability to monitor vast maritime areas. However, small objects often remain difficult to detect from high altitudes due to low object-to-background pixel ratios. We propose an altitude-aware dynamic tiling method that scales and adaptively subdivides the image into tiles for enhanced small object detection. By integrating altitude-dependent scaling with an adaptive tiling factor, we reduce unnecessary computation while maintaining detection performance. Tested on the SeaDronesSee dataset [1] with YOLOv5 [2] and Slicing Aided Hyper Inference (SAHI) framework [3], our approach improves Mean Average Precision (mAP) for small objects by 38% compared to a baseline and achieves more than double the inference speed compared to static tiling. This approach enables more efficient and accurate UAV-based SAR operations under diverse conditions.
zh

[CV-164] Rethinking Vision Transformer Depth via Structural Reparameterization

【速读】:该论文旨在解决视觉 Transformer(Vision Transformer, ViT)在实际应用中因深层架构导致的计算开销过大的问题,特别是探索是否可以在减少堆叠的 Transformer 层数量的同时保持相近的表征能力。其解决方案的关键在于提出一种基于分支的结构重参数化技术(branch-based structural reparameterization),该技术在训练阶段引入并行分支结构,并通过渐进式合并非线性组件入口处的分支,实现对前馈网络(Feed-Forward Network, FFN)和多头自注意力模块(Multi-Head Self-Attention, MHSA)的精确数学重参数化,从而在推理时构建单路径模型,无需近似误差即可显著压缩模型深度,最终在 ImageNet-1K 分类任务上验证了从原始 12 层压缩至 3 层仍能保持准确率,并在移动 CPU 平台上实现最高达 37% 的推理加速。

链接: https://arxiv.org/abs/2511.19718
作者: Chengwei Zhou,Vipin Chaudhary,Gourav Datta
机构: Case Western Reserve University (凯斯西储大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 6 figures

点击查看摘要

Abstract:The computational overhead of Vision Transformers in practice stems fundamentally from their deep architectures, yet existing acceleration strategies have primarily targeted algorithmic-level optimizations such as token pruning and attention speedup. This leaves an underexplored research question: can we reduce the number of stacked transformer layers while maintaining comparable representational capacity? To answer this, we propose a branch-based structural reparameterization technique that operates during the training phase. Our approach leverages parallel branches within transformer blocks that can be systematically consolidated into streamlined single-path models suitable for inference deployment. The consolidation mechanism works by gradually merging branches at the entry points of nonlinear components, enabling both feed-forward networks (FFN) and multi-head self-attention (MHSA) modules to undergo exact mathematical reparameterization without inducing approximation errors at test time. When applied to ViT-Tiny, the framework successfully reduces the original 12-layer architecture to 6, 4, or as few as 3 layers while maintaining classification accuracy on ImageNet-1K. The resulting compressed models achieve inference speedups of up to 37% on mobile CPU platforms. Our findings suggest that the conventional wisdom favoring extremely deep transformer stacks may be unnecessarily restrictive, and point toward new opportunities for constructing efficient vision transformers.
zh

[CV-165] RADSeg: Unleashing Parameter and Compute Efficient Zero-Shot Open-Vocabulary Segmentation Using Agglomerative Models

【速读】:该论文旨在解决开放词汇语义分割(Open-vocabulary Semantic Segmentation, OVSS)中模型泛化能力不足与计算资源消耗过高的问题。现有方法要么受限于训练数据的稀缺性,导致泛化性能差,要么依赖零样本启发式策略或多个大型视觉语言模型组合,造成高延迟和内存开销。其解决方案的关键在于利用一个被忽视的凝聚型视觉基础模型RADIO,并通过三项核心改进显著提升零样本OVSS性能:自相关递归注意力(self-correlating recursive attention)、自相关全局聚合(self-correlating global aggregation)以及计算高效的掩码精炼机制。由此提出的RADSeg方法在保持极低参数量(仅105M)的同时,实现了比以往多模型组合方案(850–1350M参数)更高的mIoU精度,并且推理速度提升3.95倍、参数使用减少2.5倍,达到当前最优的准确性-效率平衡。

链接: https://arxiv.org/abs/2511.19704
作者: Omar Alama,Darshil Jariwala,Avigyan Bhattacharya,Seungchan Kim,Wenshan Wang,Sebastian Scherer
机构: Carnegie Mellon University (卡内基梅隆大学); IIIT Hyderabad (印度国际技术研究所海得拉巴分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) underpins many vision and robotics tasks that require generalizable semantic understanding. Existing approaches either rely on limited segmentation training data, which hinders generalization, or apply zero-shot heuristics to vision-language models (e.g CLIP), while the most competitive approaches combine multiple models to improve performance at the cost of high computational and memory demands. In this work, we leverage an overlooked agglomerative vision foundation model, RADIO, to improve zero-shot OVSS along three key axes simultaneously: mIoU, latency, and parameter efficiency. We present the first comprehensive study of RADIO for zero-shot OVSS and enhance its performance through self-correlating recursive attention, self-correlating global aggregation, and computationally efficient mask refinement. Our approach, RADSeg, achieves 6-30% mIoU improvement in the base ViT class while being 3.95x faster and using 2.5x fewer parameters. Surprisingly, RADSeg-base (105M) outperforms previous combinations of huge vision models (850-1350M) in mIoU, achieving state-of-the-art accuracy with substantially lower computational and memory cost.
zh

[CV-166] CountXplain: Interpretable Cell Counting with Prototype-Based Density Map Estimation

【速读】:该论文旨在解决深度学习模型在生物医学图像细胞计数任务中可解释性不足的问题(interpretability challenge)。其关键解决方案是提出一种基于原型(prototype-based)的方法,通过在密度估计网络中引入原型层,使模型能够学习代表细胞和背景伪影的视觉模式;该方法不仅保持了与现有方法相当的计数精度,还通过生成与每个原型最相似的图像区域来提供直观的解释,从而增强模型决策过程的透明度。

链接: https://arxiv.org/abs/2511.19686
作者: Abdurahman Ali Mohammed,Wallapak Tavanapong,Catherine Fonder,Donald S. Sakaguchi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Medical Imaging with Deep Learning 2025

点击查看摘要

Abstract:Cell counting in biomedical imaging is pivotal for various clinical applications, yet the interpretability of deep learning models in this domain remains a significant challenge. We propose a novel prototype-based method for interpretable cell counting via density map estimation. Our approach integrates a prototype layer into the density estimation network, enabling the model to learn representative visual patterns for both cells and background artifacts. The learned prototypes were evaluated through a survey of biologists, who confirmed the relevance of the visual patterns identified, further validating the interpretability of the model. By generating interpretations that highlight regions in the input image most similar to each prototype, our method offers a clear understanding of how the model identifies and counts cells. Extensive experiments on two public datasets demonstrate that our method achieves interpretability without compromising counting effectiveness. This work provides researchers and clinicians with a transparent and reliable tool for cell counting, potentially increasing trust and accelerating the adoption of deep learning in critical biomedical applications. Code is available at this https URL.
zh

[CV-167] IndEgo: A Dataset of Industrial Scenarios and Collaborative Work for Egocentric Assistants NEURIPS2025

【速读】:该论文旨在解决工业场景中多模态协作任务理解的难题,特别是针对装配/拆卸、物流与组织、检查与维修等常见工业任务,现有数据集缺乏对双人协同工作过程中多模态信息(如第一人称视角、第三人称视角、眼动、语音、动作等)的系统性采集与标注。解决方案的关键在于构建IndEgo数据集,该数据集包含3,460段第一人称记录(约197小时)和1,092段第三人称记录(约97小时),并提供精细标注(动作、摘要、错误标记、叙述)、元数据、处理后的输出(眼动、手部姿态、半稠密点云)以及针对程序性与非程序性任务理解、错误检测和基于推理的问题回答等基准测试。这一结构化且多模态的数据资源为提升工业场景下协作任务的理解能力提供了关键支撑,当前主流多模态模型在该数据集上的基线评估结果表明其仍具挑战性。

链接: https://arxiv.org/abs/2511.19684
作者: Vivek Chavan,Yasmina Imgrund,Tung Dao,Sanwantri Bai,Bosong Wang,Ze Lu,Oliver Heimann,Jörg Krüger
机构: Fraunhofer IPK (弗劳恩霍夫信息物理系统集成研究所); Technical University of Berlin (柏林工业大学); University of Tübingen (图宾根大学); RWTH Aachen University (亚琛工业大学); Leibniz University Hannover (汉诺威莱布尼茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Robotics (cs.RO)
备注: Accepted to NeurIPS 2025 DB Track. Project Page: this https URL

点击查看摘要

Abstract:We introduce IndEgo, a multimodal egocentric and exocentric dataset addressing common industrial tasks, including assembly/disassembly, logistics and organisation, inspection and repair, woodworking, and others. The dataset contains 3,460 egocentric recordings (approximately 197 hours), along with 1,092 exocentric recordings (approximately 97 hours). A key focus of the dataset is collaborative work, where two workers jointly perform cognitively and physically intensive tasks. The egocentric recordings include rich multimodal data and added context via eye gaze, narration, sound, motion, and others. We provide detailed annotations (actions, summaries, mistake annotations, narrations), metadata, processed outputs (eye gaze, hand pose, semi-dense point cloud), and benchmarks on procedural and non-procedural task understanding, Mistake Detection, and reasoning-based Question Answering. Baseline evaluations for Mistake Detection, Question Answering and collaborative task understanding show that the dataset presents a challenge for the state-of-the-art multimodal models. Our dataset is available at: this https URL
zh

[CV-168] INTERLACE: Interleaved Layer Pruning and Efficient Adaptation in Large Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(VLMs)在层剪枝(layer pruning)过程中性能显著下降的问题。现有方法在对VLMs进行剪枝时难以保持模型性能,而本文提出INTERLACE框架,其核心创新在于采用“交错式微调-冻结”设计:通过分析连续三层的局部冗余性,移除前两层中最冗余的一层,仅对剩余层进行微调以补偿容量损失,并冻结第三层作为微调过程中的稳定锚点。这一机制使得模型在仅使用FineVision数据集1%样本微调一个epoch后即可实现快速收敛与高保真度性能保留——在剪掉25%网络参数的情况下仍达到88.9%的平均性能保留率,优于当前最优方法(SOTA)。

链接: https://arxiv.org/abs/2511.19676
作者: Parsa Madinei,Ryan Solgi,Ziqi Wen,Jonathan Skaza,Miguel Eckstein,Ramtin Pedarsani
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce INTERLACE, a novel framework that prunes redundant layers in VLMs while maintaining performance through sample-efficient finetuning. Existing layer pruning methods lead to significant performance drop when applied to VLMs. Instead, we analyze triplets of consecutive layers to identify local redundancy, removing the most redundant of the first two layers, finetune the remaining layer to compensate for the lost capacity, and freeze the third layer to serve as a stable anchor during finetuning. We found that this interleaved finetune-freeze design enables rapid convergence with minimal data after pruning. By finetuning only a subset of layers on just 1% of the FineVision dataset for one epoch, Interlace achieves 88.9% average performance retention after dropping 25% of the network, achieving SOTA performance. Our code is available at: this https URL
zh

[CV-169] OncoVision: Integrating Mammography and Clinical Data through Attention-Driven Multimodal AI for Enhanced Breast Cancer Diagnosis

【速读】:该论文旨在解决乳腺癌早期诊断中因影像与临床数据割裂导致的准确性不足和判读一致性差的问题。其解决方案的关键在于构建一个基于注意力机制的多模态AI流水线(OncoVision),通过联合分割四种感兴趣区域(masses, calcifications, axillary findings, and breast tissues)并预测十项结构化临床特征,实现影像与临床信息的深度融合;进一步采用两种后融合(late-fusion)策略,有效整合多模态数据优势,提升诊断精度并降低观察者间差异,从而为临床提供可信赖、可视化且易部署的实时辅助诊断工具。

链接: https://arxiv.org/abs/2511.19667
作者: Istiak Ahmed,Galib Ahmed,K. Shahriar Sanjid,Md. Tanzim Hossain,Md. Nishan Khan,Md. Misbah Khan,Md. Arifur Rahman,Sheikh Anisul Haque,Sharmin Akhtar Rupa,Mohammed Mejbahuddin Mia,Mahmud Hasan Mostofa Kamal,Md. Mostafa Kamal Sarker,M. Monir Uddin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:OncoVision is a multimodal AI pipeline that combines mammography images and clinical data for better breast cancer diagnosis. Employing an attention-based encoder-decoder backbone, it jointly segments four ROIs - masses, calcifications, axillary findings, and breast tissues - with state-of-the-art accuracy and robustly predicts ten structured clinical features: mass morphology, calcification type, ACR breast density, and BI-RADS categories. To fuse imaging and clinical insights, we developed two late-fusion strategies. By utilizing complementary multimodal data, late fusion strategies improve diagnostic precision and reduce inter-observer variability. Operationalized as a secure, user-friendly web application, OncoVision produces structured reports with dual-confidence scoring and attention-weighted visualizations for real-time diagnostic support to improve clinician trust and facilitate medical teaching. It can be easily incorporated into the clinic, making screening available in underprivileged areas around the world, such as rural South Asia. Combining accurate segmentation with clinical intuition, OncoVision raises the bar for AI-based mammography, offering a scalable and equitable solution to detect breast cancer at an earlier stage and enhancing treatment through timely interventions.
zh

[CV-170] CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

【速读】:该论文旨在解决当前视觉-语言代理模型在执行视觉推理任务时存在的“不可信推理”问题,即模型虽然最终答案准确率高,但其工具调用(如图像裁剪、区域选择等)可能基于无关区域或忽略工具输出,仅靠猜测获得正确结果。为解决此问题,作者提出了一种基于代码的视觉代理系统 CodeV,其核心创新在于引入了工具感知策略优化(Tool-Aware Policy Optimization, TAPO),这是一种过程级强化学习框架,通过直接对视觉工具输入与输出设计密集奖励信号,而非依赖链式思维(chain-of-thought)token,从而实现更易验证且抗奖励黑客攻击的监督机制。TAPO 鼓励模型在每一步都进行必要且证据一致的工具操作,显著提升了工具使用的可信度,同时在视觉搜索及其他多模态推理任务中保持或超越现有方法的准确性。

链接: https://arxiv.org/abs/2511.19661
作者: Xinhai Hou,Shaoyuan Xu,Manan Biyani,Mayan Li,Jia Liu,Todd C. Hollon,Bryan Wang
机构: University of Michigan (密歇根大学); Amazon.com (亚马逊); The Ohio State University (俄亥俄州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Agentic vision-language models are increasingly trained to “think with images” by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.
zh

[CV-171] Navigating Gigapixel Pathology Images with Large Multimodal Models

【速读】:该论文旨在解决通用大模型(如生成式AI)在病理图像分析中表现不佳的问题,特别是针对高分辨率全切片图像(Whole-Slide Images, WSIs)的复杂推理任务。传统方法多依赖低分辨率缩略图或随机图像块(patch),导致模型无法充分理解组织结构和空间上下文信息,从而低估了其潜在性能。解决方案的关键在于提出首个支持迭代导航WSI的框架——Gigapixel Image Agent for Navigating Tissue (GIANT),它使大模型能够像病理学家一样逐步探索、聚焦并推理整个图像内容;同时构建了MultiPathQA基准,涵盖934个WSI级问题及128个由专业病理医师撰写的直接滑片解读题,用于系统评估模型在癌症诊断到开放性推理等临床相关任务中的表现。实验表明,结合GIANT的简单代理系统显著优于传统基线,并接近甚至超越需数百万张图像训练的专业模型。

链接: https://arxiv.org/abs/2511.19652
作者: Thomas A. Buckley,Kian R. Weihrauch,Katherine Latham,Andrew Z. Zhou,Padmini A. Manrai,Arjun K. Manrai
机构: Harvard Medical School (哈佛医学院); Massachusetts General Hospital (马萨诸塞州总医院); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite being widely used to support clinical care, general-purpose large multimodal models (LMMs) have generally shown poor or inconclusive performance in medical image interpretation, particularly in pathology, where gigapixel images are used. However, prior studies have used either low-resolution thumbnails or random patches, which likely underestimated model performance. Here, we ask whether LMMs can be adapted to reason coherently and accurately in the evaluation of such images. In this study, we introduce Gigapixel Image Agent for Navigating Tissue (GIANT), the first framework that allows LMMs to iteratively navigate whole-slide images (WSIs) like a pathologist. Accompanying GIANT, we release MultiPathQA, a new benchmark, which comprises 934 WSI-level questions, encompassing five clinically-relevant tasks ranging from cancer diagnosis to open-ended reasoning. MultiPathQA also includes 128 questions, authored by two professional pathologists, requiring direct slide interpretation. Using MultiPathQA, we show that our simple agentic system substantially outperforms conventional patch- and thumbnail-based baselines, approaching or surpassing the performance of specialized models trained on millions of images. For example, on pathologist-authored questions, GPT-5 with GIANT achieves 62.5% accuracy, outperforming specialist pathology models such as TITAN (43.8%) and SlideChat (37.5%). Our findings reveal the strengths and limitations of current foundation models and ground future development of LMMs for expert reasoning in pathology.
zh

[CV-172] On the Utility of Foundation Models for Fast MRI: Vision-Language-Guided Image Reconstruction

【速读】:该论文旨在解决欠采样磁共振成像(undersampled MRI)重建中因传统先验信息不足而导致的细节丢失与感知质量下降问题。其解决方案的关键在于提出了一种语义分布引导的重建框架(semantic distribution-guided reconstruction framework),利用预训练的视觉-语言基础模型(vision-language foundation model)将重建图像与辅助信息编码为高层语义特征,并通过对比目标(contrastive objective)使重建表示对齐于目标语义分布,从而在保持数据保真度的同时,引入高阶语义先验以提升图像的结构保真度和感知质量。该方法可灵活融合来自单模态(仅图像)或多模态(图像-文本)的语义先验,显著优于传统正则化策略。

链接: https://arxiv.org/abs/2511.19641
作者: Ruimin Feng,Xingxin He,Ronald Mercer,Zachary Stewart,Fang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Purpose: To investigate whether a vision-language foundation model can enhance undersampled MRI reconstruction by providing high-level contextual information beyond conventional priors. Methods: We proposed a semantic distribution-guided reconstruction framework that uses a pre-trained vision-language foundation model to encode both the reconstructed image and auxiliary information into high-level semantic features. A contrastive objective aligns the reconstructed representation with the target semantic distribution, ensuring consistency with high-level perceptual cues. The proposed objective works with various deep learning-based reconstruction methods and can flexibly incorporate semantic priors from multimodal sources. To test the effectiveness of these semantic priors, we evaluated reconstruction results guided by priors derived from either image-only or image-language auxiliary information. Results: Experiments on knee and brain datasets demonstrate that semantic priors from images preserve fine anatomical structures and achieve superior perceptual quality, as reflected in lower LPIPS values, higher Tenengrad scores, and improved scores in the reader study, compared with conventional regularization. The image-language information further expands the semantic distribution and enables high-level control over reconstruction attributes. Across all evaluations, the contrastive objective consistently guided the reconstructed features toward the desired semantic distributions while maintaining data fidelity, demonstrating the effectiveness of the proposed optimization framework. Conclusion: The study highlights that vision-language foundation models can improve undersampled MRI reconstruction through semantic-space optimization.
zh

[CV-173] SkillSight: Efficient First-Person Skill Assessment with Gaze

【速读】:该论文旨在解决从第一人称视角(egocentric)数据中实现高效技能评估的技术难题,尤其针对智能眼镜等边缘设备上的低功耗应用场景。其核心解决方案在于提出了一种两阶段框架——SkillSight,关键创新点是首次实证表明:技能水平不仅体现在个体执行动作的视频内容中,还体现在其执行过程中的注视行为(gaze)中。通过联合建模注视与第一人称视频来训练教师模型以获得最优性能,再将知识蒸馏至仅依赖注视输入的学生模型,从而在推理阶段无需持续处理视频流,显著降低功耗(减少73倍),同时保持高精度,为野外环境下基于AI的技能学习提供了可行路径。

链接: https://arxiv.org/abs/2511.19629
作者: Chi Hsuan Wu,Kumar Ashutosh,Kristen Grauman
机构: UT Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across diverse real-world settings. Our SkillSight teacher model achieves state-of-the-art performance, while our gaze-only student variant maintains high accuracy using 73x less power than competing methods. These results pave the way for in-the-wild AI-supported skill learning.
zh

[CV-174] Learning Massively Multitask World Models for Continuous Control WWW

【速读】:该论文旨在解决通用控制(general-purpose control)中在线强化学习(online reinforcement learning, RL)难以扩展至多任务场景的问题,即当前连续控制领域的研究仍以单任务或离线训练为主,导致在线RL在复杂多任务环境下缺乏可扩展性。其解决方案的关键在于提出一种基于“基础模型范式”(foundation model recipe)的新方法:首先利用大规模演示数据对世界模型进行预训练,获得任务感知的表征和动作先验(action priors),随后通过跨所有任务的在线交互进行联合优化。这一策略显著提升了多任务性能与数据效率,并实现了对未见任务的快速适应能力。

链接: https://arxiv.org/abs/2511.19584
作者: Nicklas Hansen,Hao Su,Xiaolong Wang
机构: University of California San Diego(加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Webpage: this https URL

点击查看摘要

Abstract:General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimes, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present \emphNewt, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints.
zh

[CV-175] Multiscale Vector-Quantized Variational Autoencoder for Endoscopic Image Synthesis

【速读】:该论文旨在解决无线胶囊内镜(Wireless Capsule Endoscopy, WCE)图像分析中因真实标注数据稀缺而导致临床决策支持(Clinical Decision Support, CDS)系统训练受限的问题。其关键解决方案是提出一种基于向量量化变分自编码器(Vector Quantized Variational Autoencoder, VQ-VAE)的多尺度扩展模型——多尺度向量量化变分自编码器(Multiscale Vector Quantized Variational Autoencoder, MSVQ-VAE),该方法不仅能够生成高质量的正常WCE图像,还可通过条件生成机制无缝引入多种类型的异常(如息肉、血管病变和炎症),从而有效扩充训练数据集,并在图像分类任务中验证了生成数据对CDS模型性能的提升效果,与仅使用真实数据训练的结果相当。

链接: https://arxiv.org/abs/2511.19578
作者: Dimitrios E. Diamantis,Dimitris K. Iakovidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in IEEE Int. Conf. Imaging Systems and Techniques (IST 2025), Strasburg, France

点击查看摘要

Abstract:Gastrointestinal (GI) imaging via Wireless Capsule Endoscopy (WCE) generates a large number of images requiring manual screening. Deep learning-based Clinical Decision Support (CDS) systems can assist screening, yet their performance relies on the existence of large, diverse, training medical datasets. However, the scarcity of such data, due to privacy constraints and annotation costs, hinders CDS development. Generative machine learning offers a viable solution to combat this limitation. While current Synthetic Data Generation (SDG) methods, such as Generative Adversarial Networks and Variational Autoencoders have been explored, they often face challenges with training stability and capturing sufficient visual diversity, especially when synthesizing abnormal findings. This work introduces a novel VAE-based methodology for medical image synthesis and presents its application for the generation of WCE images. The novel contributions of this work include a) multiscale extension of the Vector Quantized VAE model, named as Multiscale Vector Quantized Variational Autoencoder (MSVQ-VAE); b) unlike other VAE-based SDG models for WCE image generation, MSVQ-VAE is used to seamlessly introduce abnormalities into normal WCE images; c) it enables conditional generation of synthetic images, enabling the introduction of different types of abnormalities into the normal WCE images; d) it performs experiments with a variety of abnormality types, including polyps, vascular and inflammatory conditions. The utility of the generated images for CDS is assessed via image classification. Comparative experiments demonstrate that training a CDS classifier using the abnormal images generated by the proposed methodology yield comparable results with a classifier trained with only real data. The generality of the proposed methodology promises its applicability to various domains related to medical multimedia.
zh

[CV-176] Leverag ing Unlabeled Scans for NCCT Image Segmentation in Early Stroke Diagnosis: A Semi-Supervised GAN Approach

【速读】:该论文旨在解决急性缺血性卒中(acute ischemic stroke)早期诊断中非增强计算机断层扫描(NCCT)难以识别细微梗死灶的问题,从而导致干预延迟。解决方案的关键在于提出一种基于生成对抗网络(GAN)的半监督分割方法,通过结合少量标注数据与大量未标注数据,在Dice损失、交叉熵损失、特征匹配损失和自训练损失的联合优化下,有效学习早期梗死区域的特征表示,实现对微小或低对比度病灶的精准分割,提升诊断效率并减轻人工标注负担。

链接: https://arxiv.org/abs/2511.19576
作者: Maria Thoma,Michalis A. Savelonas,Dimitris K. Iakovidis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ischemic stroke is a time-critical medical emergency where rapid diagnosis is essential for improving patient outcomes. Non-contrast computed tomography (NCCT) serves as the frontline imaging tool, yet it often fails to reveal the subtle ischemic changes present in the early, hyperacute phase. This limitation can delay crucial interventions. To address this diagnostic challenge, we introduce a semi-supervised segmentation method using generative adversarial networks (GANs) to accurately delineate early ischemic stroke regions. The proposed method employs an adversarial framework to effectively learn from a limited number of annotated NCCT scans, while simultaneously leveraging a larger pool of unlabeled scans. By employing Dice loss, cross-entropy loss, a feature matching loss and a self-training loss, the model learns to identify and delineate early infarcts, even when they are faint or their size is small. Experiments on the publicly available Acute Ischemic Stroke Dataset (AISD) demonstrate the potential of the proposed method to enhance diagnostic capabilities, reduce the burden of manual annotation, and support more efficient clinical decision-making in stroke care.
zh

[CV-177] HunyuanOCR Technical Report

【速读】:该论文旨在解决传统OCR(光学字符识别)系统中存在的三大核心问题:一是模型功能单一,难以兼顾文本检测(Text Spotting)、解析(Parsing)、信息抽取(IE)等多任务需求;二是流水线式架构依赖复杂的预处理模块(如版面分析),导致误差传播严重且部署复杂;三是现有轻量级视觉语言模型(VLM)在性能上难以媲美大模型或商业API。其解决方案的关键在于:首先,构建一个统一的轻量化VLM框架(HunyuanOCR,1B参数),通过MLP适配器连接ViT与轻量LLM,实现多种OCR相关任务的端到端联合优化;其次,采用纯端到端架构摒弃外部预处理模块,从根本上减少误差累积并提升部署效率;最后,结合高质量数据策略与业界首次应用于OCR任务的强化学习(Reinforcement Learning, RL)方法,在ICDAR 2025 DIMT Challenge和OCRBench等多个基准上取得SOTA结果,显著优于同类轻量模型及商用API。

链接: https://arxiv.org/abs/2511.19575
作者: Hunyuan Vision Team,Pengyuan Lyu,Xingyu Wan,Gengluo Li,Shangpin Peng,Weinong Wang,Liang Wu,Huawen Shen,Yu Zhou,Canhui Tang,Qi Yang,Qiming Peng,Bin Luo,Hower Yang,Houwen Peng,Hongming Yang,Senhao Xie,Binghong Wu,Mana Yang,Sergey Wang,Raccoon Liu,Dick Zhu,Jie Jiang,Linus,Han Hu,Chengquan Zhang
机构: Tencent Hunyuan Vision Team (腾讯混元视觉团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow “OCR expert models” and inefficient “General VLMs”. 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.19575 [cs.CV] (or arXiv:2511.19575v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.19575 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-178] Merging without Forgetting: Continual Fusion of Task-Specific Models via Optimal Transport

【速读】:该论文旨在解决多任务模型融合过程中因直接参数插值(parameter interpolation)导致的特征空间分布偏移问题,该偏移会损害各任务特定的知识保留。解决方案的关键在于提出基于最优传输理论的掩码融合方法(OTMF),通过发现任务向量间的最优传输计划来识别共用掩码,从而对齐任务模型的语义几何结构;该掩码能选择性地提取跨任务通用组件并保留每个任务的独特结构特征,同时支持增量式持续融合机制,在不回溯历史任务的前提下实现高效、低内存开销的多任务模型合并。

链接: https://arxiv.org/abs/2511.19561
作者: Zecheng Pan,Zhikang Chen,Ding Li,Min Zhang,Sen Cui,Hongshuo Jin,Luqi Tao,Yi Yang,Deheng Ye,Yu Zhang,Tingting Zhu,Tianling Ren
机构: Tsinghua University (清华大学); University of Oxford (牛津大学); East China Normal University (华东师范大学); Zhejiang University (浙江大学); Tencent (腾讯); Southern University of Science and Technology (南方科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Merging models fine-tuned for different tasks into a single unified model has become an increasingly important direction for building versatile, efficient multi-task systems. Existing approaches predominantly rely on parameter interpolation in weight space, which we show introduces significant distribution shift in the feature space and undermines task-specific knowledge. In this paper, we propose OTMF (Optimal Transport-based Masked Fusion), a novel model merging framework rooted in optimal transport theory to address the distribution shift that arises from naive parameter interpolation. Instead of directly aggregating features or weights, OTMF aligns the semantic geometry of task-specific models by discovering common masks applied to task vectors through optimal transport plans. These masks selectively extract transferable and task-agnostic components while preserving the unique structural identities of each task. To ensure scalability in real-world settings, OTMF further supports a continual fusion paradigm that incrementally integrates each new task vector without revisiting previous ones, maintaining a bounded memory footprint and enabling efficient fusion across a growing number of tasks. We conduct comprehensive experiments on multiple vision and language benchmarks, and results show that OTMF achieves state-of-the-art performance in terms of both accuracy and efficiency. These findings highlight the practical and theoretical value of our approach to model merging.
zh

[CV-179] SPQR: A Standardized Benchmark for Modern Safety Alignment Methods in Text-to-Image Diffusion Models

【速读】:该论文旨在解决当前文本到图像扩散模型(text-to-image diffusion models)在经过良性下游微调(如LoRA个性化、风格/领域适配器)后,安全对齐(safety alignment)效果容易失效的问题。现有评估方法未充分测试此类场景下的安全性稳定性,导致实际部署中可能仍存在生成受版权保护、不安全或隐私敏感内容的风险。解决方案的关键在于提出SPQR基准(Safety-Prompt adherence-Quality-Robustness),这是一个单分数指标框架,能够标准化地评估模型在良性微调下保持安全性、实用性与鲁棒性的能力,并通过统一的排行榜得分促进不同安全对齐技术的比较。

链接: https://arxiv.org/abs/2511.19558
作者: Mohammed Talha Alam,Nada Saadi,Fahad Shamshad,Nils Lukas,Karthik Nandakumar,Fahkri Karray,Samuele Poppi
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); Michigan State University; University of Waterloo
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 20 pages, 8 figures, 10 tables

点击查看摘要

Abstract:Text-to-image diffusion models can emit copyrighted, unsafe, or private content. Safety alignment aims to suppress specific concepts, yet evaluations seldom test whether safety persists under benign downstream fine-tuning routinely applied after deployment (e.g., LoRA personalization, style/domain adapters). We study the stability of current safety methods under benign fine-tuning and observe frequent breakdowns. As true safety alignment must withstand even benign post-deployment adaptations, we introduce the SPQR benchmark (Safety-Prompt adherence-Quality-Robustness). SPQR is a single-scored metric that provides a standardized and reproducible framework to evaluate how well safety-aligned diffusion models preserve safety, utility, and robustness under benign fine-tuning, by reporting a single leaderboard score to facilitate comparisons. We conduct multilingual, domain-specific, and out-of-distribution analyses, along with category-wise breakdowns, to identify when safety alignment fails after benign fine-tuning, ultimately showcasing SPQR as a concise yet comprehensive benchmark for T2I safety alignment techniques for T2I models.
zh

[CV-180] hink First Assign Next (ThiFAN-VQA): A Two-stage Chain-of-Thought Framework for Post-Disaster Damage Assessment

【速读】:该论文旨在解决灾后损伤评估中AI模型训练数据稀缺、标注成本高以及传统分类框架答案空间固定导致信息获取受限的问题。解决方案的关键在于提出ThiFAN-VQA,一个基于两阶段推理的视觉问答(VQA)框架:首先利用链式思维(Chain-of-Thought, CoT)提示与上下文学习(In-Context Learning, ICL)生成结构化推理路径,实现有限监督下的可解释推理;随后通过答案选择模块对生成响应进行评估和筛选,确保输出的连贯性与场景相关性,从而在零样本与监督方法之间建立平衡,提升模型在真实灾害场景中的准确性、可解释性和适应性。

链接: https://arxiv.org/abs/2511.19557
作者: Ehsan Karimi,Nhut Le,Maryam Rahnemoonfar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Timely and accurate assessment of damages following natural disasters is essential for effective emergency response and recovery. Recent AI-based frameworks have been developed to analyze large volumes of aerial imagery collected by Unmanned Aerial Vehicles, providing actionable insights rapidly. However, creating and annotating data for training these models is costly and time-consuming, resulting in datasets that are limited in size and diversity. Furthermore, most existing approaches rely on traditional classification-based frameworks with fixed answer spaces, restricting their ability to provide new information without additional data collection or model retraining. Using pre-trained generative models built on in-context learning (ICL) allows for flexible and open-ended answer spaces. However, these models often generate hallucinated outputs or produce generic responses that lack domain-specific relevance. To address these limitations, we propose ThiFAN-VQA, a two-stage reasoning-based framework for visual question answering (VQA) in disaster scenarios. ThiFAN-VQA first generates structured reasoning traces using chain-of-thought (CoT) prompting and ICL to enable interpretable reasoning under limited supervision. A subsequent answer selection module evaluates the generated responses and assigns the most coherent and contextually accurate answer, effectively improve the model performance. By integrating a custom information retrieval system, domain-specific prompting, and reasoning-guided answer selection, ThiFAN-VQA bridges the gap between zero-shot and supervised methods, combining flexibility with consistency. Experiments on FloodNet and RescueNet-VQA, UAV-based datasets from flood- and hurricane-affected regions, demonstrate that ThiFAN-VQA achieves superior accuracy, interpretability, and adaptability for real-world post-disaster damage assessment tasks.
zh

[CV-181] Proxy-Free Gaussian Splats Deformation with Splat-Based Surface Estimation

【速读】:该论文旨在解决现有高斯 splatting (Gaussian splats, GS) 变形方法依赖变形代理(如笼状结构或网格)所带来的质量敏感性和额外计算开销问题,以及直接将 splat 视为点云进行拉普拉斯(Laplacian)变形时因缺乏显式表面结构而导致的细节和拓扑信息丢失问题。解决方案的关键在于提出一种无需代理的变形方法 SpLap,其核心创新是构建了一个表面感知的 splat 图(surface-aware splat graph),通过定义基于 splat 交集而非中心距离的邻接关系来更准确地捕捉几何结构;同时引入高斯核自适应技术,在变形过程中保持表面结构完整性,从而提升变形后的渲染质量。

链接: https://arxiv.org/abs/2511.19542
作者: Jaeyeong Kim,Seungwoo Yoo,Minhyuk Sung
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, Accepted to 3DV 2026 (IEEE/CVF International Conference on 3D Vision)

点击查看摘要

Abstract:We introduce SpLap, a proxy-free deformation method for Gaussian splats (GS) based on a Laplacian operator computed from our novel surface-aware splat graph. Existing approaches to GS deformation typically rely on deformation proxies such as cages or meshes, but they suffer from dependency on proxy quality and additional computational overhead. An alternative is to directly apply Laplacian-based deformation techniques by treating splats as point clouds. However, this often fail to properly capture surface information due to lack of explicit structure. To address this, we propose a novel method that constructs a surface-aware splat graph, enabling the Laplacian operator derived from it to support more plausible deformations that preserve details and topology. Our key idea is to leverage the spatial arrangement encoded in splats, defining neighboring splats not merely by the distance between their centers, but by their intersections. Furthermore, we introduce a Gaussian kernel adaptation technique that preserves surface structure under deformation, thereby improving rendering quality after deformation. In our experiments, we demonstrate the superior performance of our method compared to both proxy-based and proxy-free baselines, evaluated on 50 challenging objects from the ShapeNet, Objaverse, and Sketchfab datasets, as well as the NeRF-Synthetic dataset. Code is available at this https URL.
zh

[CV-182] Cross-Domain Generalization of Multimodal LLM s for Global Photovoltaic Assessment

【速读】:该论文旨在解决分布式光伏(Distributed Photovoltaic, PV)系统在全球范围内的监测与评估难题,尤其是由于大量安装未被记录导致的电网管理挑战。传统计算机视觉(Computer Vision, CV)模型如卷积神经网络(Convolutional Neural Networks, CNNs)和U-Net依赖大量标注数据且跨区域泛化能力差,难以实现全球尺度的高效识别与量化。解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Model, LLM)的能力,通过结构化提示(structured prompts)和微调策略,将检测、定位与量化任务统一建模于一个框架中,从而显著提升在未见区域上的性能稳定性,展现出更强的域迁移鲁棒性与可解释性,为全球光伏地图构建提供了可扩展、可迁移的技术路径。

链接: https://arxiv.org/abs/2511.19537
作者: Muhao Guo,Yang Weng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 5 pages, 7 figures

点击查看摘要

Abstract:The rapid expansion of distributed photovoltaic (PV) systems poses challenges for power grid management, as many installations remain undocumented. While satellite imagery provides global coverage, traditional computer vision (CV) models such as CNNs and U-Nets require extensive labeled data and fail to generalize across regions. This study investigates the cross-domain generalization of a multimodal large language model (LLM) for global PV assessment. By leveraging structured prompts and fine-tuning, the model integrates detection, localization, and quantification within a unified schema. Cross-regional evaluation using the \Delta F1 metric demonstrates that the proposed model achieves the smallest performance degradation across unseen regions, outperforming conventional CV and transformer baselines. These results highlight the robustness of multimodal LLMs under domain shift and their potential for scalable, transferable, and interpretable global PV mapping.
zh

[CV-183] Vidi2: Large Multimodal Models for Video Understanding and Creation

【速读】:该论文旨在解决视频理解中细粒度时空定位(Spatio-Temporal Grounding, STG)与视频问答(Video QA)的挑战,以支持更复杂、精准的视频内容编辑与交互应用。其解决方案的关键在于提出Vidi2模型,该模型具备端到端的时空定位能力,能够在给定文本查询时精确识别目标对象的时间戳及其在视频帧中的边界框(bounding boxes),从而实现对视频内容的细粒度语义解析。此外,研究构建了新的基准数据集VUE-STG,通过扩展视频时长范围、优化查询格式、提升标注质量及引入改进的评估指标(vIoU/tIoU/vIoU-Intersection),显著提升了STG任务的评测实用性与严谨性,使Vidi2在多项指标上超越主流商用系统(如Gemini 3 Pro和GPT-5)并保持与开源模型相当的性能表现。

链接: https://arxiv.org/abs/2511.19529
作者: Vidi Team,Celong Liu,Chia-Wen Kuo,Chuang Huang,Dawei Du,Fan Chen,Guang Chen,Haoji Zhang,Haojun Zhao,Lingxi Zhang,Lu Guo,Lusha Li,Longyin Wen,Qihang Fan,Qingyu Chen,Rachel Deng,Sijie Zhu,Stuart Siew,Tong Jin,Weiyan Tao,Wen Zhong,Xiaohui Shen,Xin Gu,Zhenfang Chen,Zuhua Lin
机构: ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. This end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping. To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning; 2) Query format: queries are mostly converted into noun phrases while preserving sentence-level expressiveness; 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced video-length distribution and more user-style queries. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks.
zh

[CV-184] MapRF: Weakly Supervised Online HD Map Construction via NeRF-Guided Self-Training

【速读】:该论文旨在解决自主驾驶系统中高精地图(High-Definition Map, HD Map)在线构建时依赖昂贵3D标注数据导致的泛化性差与可扩展性受限的问题。其核心解决方案是提出一种弱监督框架MapRF,关键在于利用仅有的2D图像标签,通过引入一个基于地图预测条件化的神经辐射场(Neural Radiance Fields, NeRF)模块生成高质量伪标签,进而以自训练方式迭代优化地图网络;同时设计了Map-to-Ray Matching策略,通过将地图预测与由2D标签推导出的相机射线对齐,有效缓解自训练过程中的误差累积问题,从而实现无需额外3D标注即可逼近全监督方法性能的高效HD地图构建。

链接: https://arxiv.org/abs/2511.19527
作者: Hongyu Lyu,Thomas Monninger,Julie Stephany Berrio Perez,Mao Shan,Zhenxing Ming,Stewart Worrall
机构: The University of Sydney (悉尼大学); Mercedes-Benz Research & Development North America (梅赛德斯-奔驰北美研发公司); University of Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach to generate local maps from on-board sensors. However, existing methods typically rely on costly 3D map annotations for training, which limits their generalization and scalability across diverse driving environments. In this work, we propose MapRF, a weakly supervised framework that learns to construct 3D maps using only 2D image labels. To generate high-quality pseudo labels, we introduce a novel Neural Radiance Fields (NeRF) module conditioned on map predictions, which reconstructs view-consistent 3D geometry and semantics. These pseudo labels are then iteratively used to refine the map network in a self-training manner, enabling progressive improvement without additional supervision. Furthermore, to mitigate error accumulation during self-training, we propose a Map-to-Ray Matching strategy that aligns map predictions with camera rays derived from 2D labels. Extensive experiments on the Argoverse 2 and nuScenes datasets demonstrate that MapRF achieves performance comparable to fully supervised methods, attaining around 75% of the baseline while surpassing several approaches using only 2D labels. This highlights the potential of MapRF to enable scalable and cost-effective online HD map construction for autonomous driving.
zh

[CV-185] Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

【速读】:该论文旨在解决当前视觉语言模型在物理 grounded 视觉推理能力上的不足,尤其是对场景中物体及其空间配置的识别后,进一步推断任务相关属性(如材质、可操作性、功能和物理特性)的能力缺失问题。现有基准主要关注表面级识别或图像-文本对齐,缺乏对结构化感知分类(Perceptual Taxonomy)的系统评估。其解决方案的关键在于提出一个名为 Perceptual Taxonomy 的新基准,通过标注 3173 个对象的 84 个细粒度属性(涵盖四类属性家族),构建包含 5802 张图像的多选题数据集,涵盖四种推理类型(物体描述、空间推理、属性匹配和分类推理),并引入专家设计的 50 道题目以全面评估模型在多层次感知推理中的表现。实验表明,主流视觉语言模型在属性驱动的问题上性能下降 10–20%,尤其在多步结构化属性推理中表现不佳,而基于模拟场景的上下文提示策略可有效提升真实世界与专家题目上的表现,验证了感知分类引导提示的有效性。

链接: https://arxiv.org/abs/2511.19526
作者: Jonathan Lee,Xingrui Wang,Jiawei Peng,Luoxin Ye,Zehan Zheng,Tiezheng Zhang,Tao Wang,Wufei Ma,Siyi Chen,Yu-Cheng Chou,Prakhar Kaushik,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.19526 [cs.CV] (or arXiv:2511.19526v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.19526 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-186] Shortcut Invariance: Targeted Jacobian Regularization in Disentangled Latent Space

【速读】:该论文旨在解决深度神经网络在训练过程中学习到数据中的捷径(shortcut)信号、虚假相关性及易学特征,从而导致分布外(out-of-distribution, OOD)泛化性能严重下降的问题。其解决方案的关键在于不依赖于显式分离鲁棒表征,而是通过学习一个函数上不变的分类器来实现鲁棒性:在解耦的潜在空间中识别与标签强相关的候选捷径特征(作为语义简单性的代理),并在训练中注入定向的各向异性潜噪声,使分类器对这些特征不敏感;该方法可被理论解释为一种目标雅可比正则化(targeted Jacobian regularization),促使模型忽略捷径特征并依赖更复杂的本质语义信号,最终在主流捷径学习基准上达到当前最优的OOD性能。

链接: https://arxiv.org/abs/2511.19525
作者: Shivam Pal,Sakshi Varshney,Piyush Rai
机构: IIT Kanpur (印度理工学院坎普尔分校); ARF, IIT Kanpur (空气研究基金会,印度理工学院坎普尔分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Deep neural networks are prone to learning shortcuts, spurious and easily learned correlations in training data that cause severe failures in out-of-distribution (OOD) generalization. A dominant line of work seeks robustness by learning a robust representation, often explicitly partitioning the latent space into core and spurious components; this approach can be complex, brittle, and difficult to scale. We take a different approach, instead of a robust representation, we learn a robust function. We present a simple and effective training method that renders the classifier functionally invariant to shortcut signals. Our method operates within a disentangled latent space, which is essential as it isolates spurious and core features into distinct dimensions. This separation enables the identification of candidate shortcut features by their strong correlation with the label, used as a proxy for semantic simplicity. The classifier is then desensitized to these features by injecting targeted, anisotropic latent noise during training. We analyze this as targeted Jacobian regularization, which forces the classifier to ignore spurious features and rely on more complex, core semantic signals. The result is state-of-the-art OOD performance on established shortcut learning benchmarks.
zh

[CV-187] VideoChat-M1: Collaborative Policy Planning for Video Understanding via Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决当前基于工具增强的多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频理解任务中,因采用静态且不可学习的工具调用机制而导致难以发现时空复杂视频中多样线索的问题。解决方案的关键在于提出一种名为VideoChat-M1的新型多智能体系统,其核心是引入协作策略规划(Collaborative Policy Planning, CPP)范式,包含策略生成、策略执行与策略通信三个关键过程:首先各智能体根据用户查询生成个性化工具调用策略;随后按序调用相关工具探索视频内容;并在执行过程中通过交互更新各自策略,实现动态优化。该框架结合简明的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)方法,使策略智能体团队能够联合优化,从而显著提升视频理解性能,在八个基准测试中达到最先进水平(SOTA)。

链接: https://arxiv.org/abs/2511.19524
作者: Boyu Chen,Zikang Wang,Zhengrong Yue,Kainan Yan,Chenyun Yu,Yi Huang,Zijun Liu,Yafei Wen,Xiaoxin Chen,Yang Liu,Peng Li,Yali Wang
机构: Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); VIVO AI Lab (维沃人工智能实验室); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Shanghai Jiao Tong University (上海交通大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学智能产业研究院); Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University (清华大学计算机科学与技术系、清华大学人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注: 21 pages, 9 figures

点击查看摘要

Abstract:By leveraging tool-augmented Multimodal Large Language Models (MLLMs), multi-agent frameworks are driving progress in video understanding. However, most of them adopt static and non-learnable tool invocation mechanisms, which limit the discovery of diverse clues essential for robust perception and reasoning regarding temporally or spatially complex videos. To address this challenge, we propose a novel Multi-agent system for video understanding, namely VideoChat-M1. Instead of using a single or fixed policy, VideoChat-M1 adopts a distinct Collaborative Policy Planning (CPP) paradigm with multiple policy agents, which comprises three key processes. (1) Policy Generation: Each agent generates its unique tool invocation policy tailored to the user’s query; (2) Policy Execution: Each agent sequentially invokes relevant tools to execute its policy and explore the video content; (3) Policy Communication: During the intermediate stages of policy execution, agents interact with one another to update their respective policies. Through this collaborative framework, all agents work in tandem, dynamically refining their preferred policies based on contextual insights from peers to effectively respond to the user’s query. Moreover, we equip our CPP paradigm with a concise Multi-Agent Reinforcement Learning (MARL) method. Consequently, the team of policy agents can be jointly optimized to enhance VideoChat-M1’s performance, guided by both the final answer reward and intermediate collaborative process feedback. Extensive experiments demonstrate that VideoChat-M1 achieves SOTA performance across eight benchmarks spanning four tasks. Notably, on LongVideoBench, our method outperforms the SOTA model Gemini 2.5 pro by 3.6% and GPT-4o by 15.6%.
zh

[CV-188] Blinking Beyond EAR: A Stable Eyelid Angle Metric for Driver Drowsiness Detection and Data Augmentation

【速读】:该论文旨在解决驾驶员疲劳检测中因视角变化导致的准确性下降以及自然状态下疲劳数据稀缺的问题。其解决方案的关键在于提出一种新的眼睑角度(Eyelid Angle, ELA)度量方法,该方法基于三维面部关键点计算,能够稳定描述眼睑运动,且对相机视角变化具有鲁棒性;同时利用ELA信号驱动Blender 3D中的绑定虚拟角色,生成可控噪声、视角和眨眼动态的合成数据集,从而提升疲劳识别模型的训练多样性和泛化能力。

链接: https://arxiv.org/abs/2511.19519
作者: Mathis Wolter,Julie Stephany Berrio Perez,Mao Shan
机构: Hamburg University of Technology (汉堡工业大学); Australian Center for Robotics (澳大利亚机器人中心); The University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 8 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Detecting driver drowsiness reliably is crucial for enhancing road safety and supporting advanced driver assistance systems (ADAS). We introduce the Eyelid Angle (ELA), a novel, reproducible metric of eye openness derived from 3D facial landmarks. Unlike conventional binary eye state estimators or 2D measures, such as the Eye Aspect Ratio (EAR), the ELA provides a stable geometric description of eyelid motion that is robust to variations in camera angle. Using the ELA, we design a blink detection framework that extracts temporal characteristics, including the closing, closed, and reopening durations, which are shown to correlate with drowsiness levels. To address the scarcity and risk of collecting natural drowsiness data, we further leverage ELA signals to animate rigged avatars in Blender 3D, enabling the creation of realistic synthetic datasets with controllable noise, camera viewpoints, and blink dynamics. Experimental results in public driver monitoring datasets demonstrate that the ELA offers lower variance under viewpoint changes compared to EAR and achieves accurate blink detection. At the same time, synthetic augmentation expands the diversity of training data for drowsiness recognition. Our findings highlight the ELA as both a reliable biometric measure and a powerful tool for generating scalable datasets in driver state monitoring.
zh

[CV-189] owards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)因规模不断增长而导致的部署效率低下问题,现有压缩方法多依赖启发式重要性度量或经验剪枝规则,缺乏关于信息保留的理论保障。其解决方案的关键在于提出 InfoPrune,一个基于信息瓶颈(Information Bottleneck)原理的信息论框架,通过在保留任务相关语义与去除冗余依赖之间建立权衡关系,量化每个注意力头的贡献并引入熵基有效秩(eRank)和柯尔莫哥洛夫-斯米尔诺夫距离(KS distance)构建统一的结构稀疏性与信息效率评估准则,进而设计两种互补的压缩策略:基于训练的注意力头剪枝和无需训练的前馈网络(FFN)低秩自适应近似,从而实现显著的计算资源节省(最高达3.2倍FLOP减少)且性能损失可忽略。

链接: https://arxiv.org/abs/2511.19518
作者: Zhaoqi Xu,Yingying Zhang,Jian Li,Jianwei Guo,Qiannan Zhu,Hua Huang
机构: Beijing Normal University (北京师范大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have shown remarkable performance across multimodal tasks, yet their ever-growing scale poses severe challenges for deployment and efficiency. Existing compression methods often rely on heuristic importance metrics or empirical pruning rules, lacking theoretical guarantees about information preservation. In this work, we propose InfoPrune, an information-theoretic framework for adaptive structural compression of VLMs. Grounded in the Information Bottleneck principle, we formulate pruning as a trade-off between retaining task-relevant semantics and discarding redundant dependencies. To quantify the contribution of each attention head, we introduce an entropy-based effective rank (eRank) and employ the Kolmogorov–Smirnov (KS) distance to measure the divergence between original and compressed structures. This yields a unified criterion that jointly considers structural sparsity and informational efficiency. Building on this foundation, we further design two complementary schemes: (1) a training-based head pruning guided by the proposed information loss objective, and (2) a training-free FFN compression via adaptive low-rank approximation. Extensive experiments on VQAv2, TextVQA, and GQA demonstrate that InfoPrune achieves up to 3.2x FLOP reduction and 1.8x acceleration with negligible performance degradation, establishing a theoretically grounded and practically effective step toward efficient multimodal large models.
zh

[CV-190] Connecting the Dots: Training-Free Visual Grounding via Agent ic Reasoning AAAI2025

【速读】:该论文旨在解决视觉接地(Visual Grounding)任务中现有方法依赖大量特定任务标注和微调、难以泛化到新场景或分布外情况的问题。其解决方案的关键在于提出GroundingAgent框架,该框架不依赖任何任务特定微调,而是通过结构化的迭代推理机制,融合预训练的开放词汇目标检测器、多模态大语言模型(MLLM)和大语言模型(LLM),实现语义与空间信息的联合分析,逐步精炼候选区域。该设计使模型在无需微调的情况下达到65.1%的零样本准确率,并在选择阶段单独使用原始查询文本时接近监督性能(约90%),凸显了LLM推理能力的核心作用。

链接: https://arxiv.org/abs/2511.19516
作者: Liqin Luo,Guangyao Chen,Xiawu Zheng,Yongxing Dai,Yixiong Zou,Yonghong Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2025

点击查看摘要

Abstract:Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.
zh

[CV-191] Fewer Tokens Greater Scaling: Self-Adaptive Visual Bases for Efficient and Expansive Representation Learning

【速读】:该论文旨在解决模型容量与维持图像语义所需的最小视觉标记(visual tokens)数量之间的根本关系问题。其核心挑战在于如何量化图像在视觉语义空间中的内在复杂度,并据此优化标记效率。解决方案的关键在于提出一种基于最小描述长度(Minimum Description Length)原理的重新诠释:将视觉标记视为视觉语义空间中的向量,并定义图像的内在语义复杂度为能张成该空间的最小基向量集。在此基础上,作者设计了轻量级模块Orthogonal Filtering,通过自适应地将冗余标记聚类为一组正交基向量,从而实现对视觉语义空间的紧凑表示。实验表明,随着模型规模增大,所需标记数显著减少,揭示了标记数与模型规模之间的一致性缩放规律。

链接: https://arxiv.org/abs/2511.19515
作者: Shawn Young,Xingyu Zeng,Lijian Xu
机构: Shenzhen University of Advanced Technology (深圳先进技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper investigates the fundamental relationship between model capacity and the minimal number of visual tokens required to preserve image semantics. Inspired by the Minimum Description Length principle, we reinterpret image tokens as vectors in a visual semantic space and define the intrinsic semantic complexity of an image as the smallest set of basis vectors needed to span this space. Building on this perspective, we propose Orthogonal Filtering, a lightweight module that adaptively clusters redundant tokens into a compact set of orthogonal bases. Through extensive experiments across a range of ViT models, we reveal a consistent token, model scaling law: larger models require significantly fewer tokens to span visual semantic space. Besides, we also contribute a visual long-context dataset.
zh

[CV-192] Single Image to High-Quality 3D Object via Latent Features

【速读】:该论文旨在解决单图生成3D物体时难以同时实现快速、高细节和高保真度的问题。解决方案的关键在于提出LatentDreamer框架,其核心是利用预训练的变分自编码器(Variational Autoencoder, VAE)将3D几何结构映射到潜在特征空间,从而显著降低3D生成的难度;在此基础上,LatentDreamer通过从潜在特征出发,依次生成粗略几何、精细化几何和真实纹理,最终在约70秒内完成高质量3D物体生成,且仅需少量训练数据即可达到与当前主流方法相当的性能。

链接: https://arxiv.org/abs/2511.19512
作者: Huanning Dong,Yinuo Huang,Fan Li,Ping Kuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D assets are essential in the digital age. While automatic 3D generation, such as image-to-3d, has made significant strides in recent years, it often struggles to achieve fast, detailed, and high-fidelity generation simultaneously. In this work, we introduce LatentDreamer, a novel framework for generating 3D objects from single images. The key to our approach is a pre-trained variational autoencoder that maps 3D geometries to latent features, which greatly reducing the difficulty of 3D generation. Starting from latent features, the pipeline of LatentDreamer generates coarse geometries, refined geometries, and realistic textures sequentially. The 3D objects generated by LatentDreamer exhibit high fidelity to the input images, and the entire generation process can be completed within a short time (typically in 70 seconds). Extensive experiments show that with only a small amount of training, LatentDreamer demonstrates competitive performance compared to contemporary approachs.
zh

[CV-193] he Determinant Ratio Matrix Approach to Solving 3D Matching and 2D Orthographic Projection Alignment Tasks

【速读】:该论文旨在解决三维(3D)和二维(2D)正交投影姿态估计问题,即在已知3D参考对象及其在图像中的正交投影(OnP问题)或3D到3D对应点对的情况下,精确求解物体的完整位姿(EnP问题)。其关键解决方案是提出并应用行列式比矩阵(DRaM)方法,该方法能够以闭式解析形式求解误差-free 的 EnP 和 OnP 问题,并通过简单的旋转校正策略处理含噪数据场景。相比传统基于奇异值分解(SVD)或最优四元数特征系统的方法仅适用于3D-3D对齐(EnP),DRaM 方法首次为3D-2D正交投影(OnP)提供了可比较的闭式解,且揭示了此类解法构成一个统一的“DRaM家族”,具有广泛的适用性与理论深度,甚至可推广至任意N维欧氏空间的姿态估计问题。

链接: https://arxiv.org/abs/2511.19511
作者: Andrew J. Hanson,Sonya M. Hanson
机构: Indiana University (印第安纳大学); The Flatiron Institute (Flatiron研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 12 pages of main text, 3 figures, 31 pages total (including references and 2 appendices, one with algorithm-defining source code)

点击查看摘要

Abstract:Pose estimation is a general problem in computer vision with wide applications. The relative orientation of a 3D reference object can be determined from a 3D rotated version of that object, or from a projection of the rotated object to a 2D planar image. This projection can be a perspective projection (the PnP problem) or an orthographic projection (the OnP problem). We restrict our attention here to the OnP problem and the full 3D pose estimation task (the EnP problem). Here we solve the least squares systems for both the error-free EnP and OnP problems in terms of the determinant ratio matrix (DRaM) approach. The noisy-data case can be addressed with a straightforward rotation correction scheme. While the SVD and optimal quaternion eigensystem methods solve the noisy EnP 3D-3D alignment exactly, the noisy 3D-2D orthographic (OnP) task has no known comparable closed form, and can be solved by DRaM-class methods. We note that while previous similar work has been presented in the literature exploiting both the QR decomposition and the Moore-Penrose pseudoinverse transformations, here we place these methods in a larger context that has not previously been fully recognized in the absence of the corresponding DRaM solution. We term this class of solutions as the DRaM family, and conduct comparisons of the behavior of the families of solutions for the EnP and OnP rotation estimation problems. Overall, this work presents both a new solution to the 3D and 2D orthographic pose estimation problems and provides valuable insight into these classes of problems. With hindsight, we are able to show that our DRaM solutions to the exact EnP and OnP problems possess derivations that could have been discovered in the time of Gauss, and in fact generalize to all analogous N-dimensional Euclidean pose estimation problems.
zh

[CV-194] Beyond Binary Classification: A Semi-supervised Approach to Generalized AI-generated Image Detection AAAI

【速读】:该论文旨在解决当前数字图像伪造检测中缺乏跨生成器泛化能力的问题,特别是当检测模型从一种架构(如生成对抗网络 GAN)迁移到另一种架构(如扩散模型 DM)时性能显著下降的难题。其核心原因是不同生成架构在优化目标上的差异导致了截然不同的伪影模式:GANs 通常产生局部覆盖不全的特征空间,易出现边界伪影;而 DMs 则强制全局覆盖,引发过度平滑现象。解决方案的关键在于提出一种名为 Triarchy Detector (TriDetect) 的半监督方法,通过在“假图像”类别中挖掘潜在的架构级特征模式,利用 Sinkhorn-Knopp 算法实现平衡聚类分配,并引入跨视图一致性机制,促使模型学习到本质的架构区分特征,从而显著提升对未见过的生成器的泛化能力。

链接: https://arxiv.org/abs/2511.19499
作者: Hong-Hanh Nguyen-Le,Van-Tuan Tran,Dinh-Thuc Nguyen,Nhien-An Le-Khac
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to The 40th Annual AAAI Conference on Artificial Intelligence - 2025

点击查看摘要

Abstract:The rapid advancement of generators (e.g., StyleGAN, Midjourney, DALL-E) has produced highly realistic synthetic images, posing significant challenges to digital media authenticity. These generators are typically based on a few core architectural families, primarily Generative Adversarial Networks (GANs) and Diffusion Models (DMs). A critical vulnerability in current forensics is the failure of detectors to achieve cross-generator generalization, especially when crossing architectural boundaries (e.g., from GANs to DMs). We hypothesize that this gap stems from fundamental differences in the artifacts produced by these \textbfdistinct architectures. In this work, we provide a theoretical analysis explaining how the distinct optimization objectives of the GAN and DM architectures lead to different manifold coverage behaviors. We demonstrate that GANs permit partial coverage, often leading to boundary artifacts, while DMs enforce complete coverage, resulting in over-smoothing patterns. Motivated by this analysis, we propose the \textbfTriarchy \textbfDetector (TriDetect), a semi-supervised approach that enhances binary classification by discovering latent architectural patterns within the “fake” class. TriDetect employs balanced cluster assignment via the Sinkhorn-Knopp algorithm and a cross-view consistency mechanism, encouraging the model to learn fundamental architectural distincts. We evaluate our approach on two standard benchmarks and three in-the-wild datasets against 13 baselines to demonstrate its generalization capability to unseen generators.
zh

[CV-195] racking and Segmenting Anything in Any Modality AAAI2026

【速读】:该论文旨在解决视频理解中跟踪(Tracking)与分割(Segmentation)任务在多模态输入和多任务训练中存在的两个关键挑战:一是不同模态间的分布差异(distributional gap),二是任务间特征表示的不一致性(feature representation gap),这些问题限制了跨任务与跨模态的知识共享,阻碍了通用模型(generalist model)的发展。其解决方案的核心在于提出一个名为SATA的统一框架,其中包含两个关键技术:一是解耦式专家混合机制(Decoupled Mixture-of-Expert, DeMoE),用于将统一表征学习分解为跨模态共享知识建模与特定信息提取两部分,从而在保持灵活性的同时提升泛化能力;二是任务感知的多目标跟踪流水线(Task-aware Multi-object Tracking, TaMOT),通过将所有任务输出统一为带有校准ID信息的实例集合,缓解多任务训练过程中任务特异性知识的退化问题。

链接: https://arxiv.org/abs/2511.19475
作者: Tianlu Zhang,Qiang Zhang,Guiguang Ding,Jungong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accpetd by AAAI 2026

点击查看摘要

Abstract:Tracking and segmentation play essential roles in video understanding, providing basic positional information and temporal association of objects within video sequences. Despite their shared objective, existing approaches often tackle these tasks using specialized architectures or modality-specific parameters, limiting their generalization and scalability. Recent efforts have attempted to unify multiple tracking and segmentation subtasks from the perspectives of any modality input or multi-task inference. However, these approaches tend to overlook two critical challenges: the distributional gap across different modalities and the feature representation gap across tasks. These issues hinder effective cross-task and cross-modal knowledge sharing, ultimately constraining the development of a true generalist model. To address these limitations, we propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input. Specifically, a Decoupled Mixture-of-Expert (DeMoE) mechanism is presented to decouple the unified representation learning task into the modeling process of cross-modal shared knowledge and specific information, thus enabling the model to maintain flexibility while enhancing generalization. Additionally, we introduce a Task-aware Multi-object Tracking (TaMOT) pipeline to unify all the task outputs as a unified set of instances with calibrated ID information, thereby alleviating the degradation of task-specific knowledge during multi-task training. SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
zh

[CV-196] Pistachio: Towards Synthetic Balanced and Long-Form Video Anomaly Benchmarks

【速读】:该论文旨在解决现有视频异常检测(Video Anomaly Detection, VAD)与视频异常理解(Video Anomaly Understanding, VAU)基准数据集在场景多样性、异常覆盖均衡性及时间复杂性方面的不足,以及VAU因依赖大量人工标注而难以有效评估的问题。解决方案的关键在于提出一个完全基于生成式方法的新型基准Pistachio,其核心创新包括:场景条件化的异常分配机制、多步骤叙事生成策略和时序一致的长视频合成方法,从而实现对视频场景、异常类型和时间叙事的精确控制,显著减少人为偏差并提升数据质量与复杂度,为VAD/VAU模型提供更贴近真实世界挑战的测试环境。

链接: https://arxiv.org/abs/2511.19474
作者: Jie Li,Hongyi Cai,Mingkang Dong,Muxin Pu,Shan You,Fei Wang,Tao Huang
机构: University of Science & Technology Beijing (北京科技大学); Universiti Malaya (马来西亚大学); Monash University (莫纳什大学); SenseTime Research (商汤科技研究院); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.
zh

[CV-197] SG-OIF: A Stability-Guided Online Influence Framework for Reliable Vision Data

【速读】:该论文旨在解决深度学习视觉模型中训练样本对测试预测影响的近似计算问题,这对于识别噪声数据至关重要。传统影响函数方法在深度神经网络中难以应用,主要受限于逆曲率计算成本高以及训练过程非平稳性导致静态近似失效的问题。为应对这些挑战,论文提出了一种稳定性引导的在线影响框架(Stability-Guided Online Influence Framework, SG-OIF),其核心创新在于将算法稳定性作为实时控制器:(i) 通过随机Richardson和预条件Neumann方法维持轻量级锚点Hessian向量积(IHVP);(ii) 设计模块化曲率后端,利用稳定性引导的残差阈值、异常门控和置信度调节每个样本的影响评分。该方案实现了噪声标签和分布外检测任务上的SOTA性能,验证了其作为在线影响估计实用控制器的有效性。

链接: https://arxiv.org/abs/2511.19466
作者: Penghao Rao,Runmin Jiang,Min Xu
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Approximating training-point influence on test predictions is critical for deploying deep-learning vision models, essential for locating noisy data. Though the influence function was proposed for attributing how infinitesimal up-weighting or removal of individual training examples affects model outputs, its implementation is still challenging in deep-learning vision models: inverse-curvature computations are expensive, and training non-stationarity invalidates static approximations. Prior works use iterative solvers and low-rank surrogates to reduce cost, but offline computation lags behind training dynamics, and missing confidence calibration yields fragile rankings that misidentify critical examples. To address these challenges, we introduce a Stability-Guided Online Influence Framework (SG-OIF), the first framework that treats algorithmic stability as a real-time controller, which (i) maintains lightweight anchor IHVPs via stochastic Richardson and preconditioned Neumann; (ii) proposes modular curvature backends to modulate per-example influence scores using stability-guided residual thresholds, anomaly gating, and confidence. Experimental results show that SG-OIF achieves SOTA (State-Of-The-Art) on noise-label and out-of-distribution detection tasks across multiple datasets with various corruption. Notably, our approach achieves 91.1% accuracy in the top 1% prediction samples on the CIFAR-10 (20% asym), and gets 99.8% AUPR score on MNIST, effectively demonstrating that this framework is a practical controller for online influence estimation.
zh

[CV-198] Personalized Reward Modeling for Text-to-Image Generation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在个性化评价方面的挑战,即如何准确评估生成图像与个体用户偏好之间的对齐程度。传统方法如通用奖励函数或基于相似性的指标难以捕捉用户视觉偏好的多样性与复杂性。其解决方案的关键在于提出 PIGReward——一个能够动态生成用户条件化评估维度并借助思维链(Chain-of-Thought, CoT)推理进行图像评分的个性化奖励模型。该模型通过自引导(self-bootstrapping)策略,在有限参考数据下构建丰富的用户上下文,实现无需用户特定训练的个性化建模;同时提供可解释的个性化反馈以优化用户提示(prompt),从而提升生成结果与个体意图的一致性。

链接: https://arxiv.org/abs/2511.19458
作者: Jeongeun Lee,Ryang Heo,Dongha Lee
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent text-to-image (T2I) models generate semantically coherent images from textual prompts, yet evaluating how well they align with individual user preferences remains an open challenge. Conventional evaluation methods, general reward functions or similarity-based metrics, fail to capture the diversity and complexity of personal visual tastes. In this work, we present PIGReward, a personalized reward model that dynamically generates user-conditioned evaluation dimensions and assesses images through CoT reasoning. To address the scarcity of user data, PIGReward adopt a self-bootstrapping strategy that reasons over limited reference data to construct rich user contexts, enabling personalization without user-specific training. Beyond evaluation, PIGReward provides personalized feedback that drives user-specific prompt optimization, improving alignment between generated images and individual intent. We further introduce PIGBench, a per-user preference benchmark capturing diverse visual interpretations of shared prompts. Extensive experiments demonstrate that PIGReward surpasses existing methods in both accuracy and interpretability, establishing a scalable and reasoning-based foundation for personalized T2I evaluation and optimization. Taken together, our findings highlight PIGReward as a robust steptoward individually aligned T2I generation.
zh

[CV-199] PuzzlePoles: Cylindrical Fiducial Markers Based on the PuzzleBoard Pattern

【速读】:该论文旨在解决自主系统中环境感知的可靠性问题,尤其是标定(calibration)与定位(localization)任务对鲁棒视觉标记的需求。解决方案的关键在于提出一种新型特征标记——PuzzlePole,它基于近期提出的PuzzleBoard标定图案设计而成,是一种圆柱形标记,能够从360°视角可靠识别并估计位姿;其独特组合结构确保了高精度的定位与朝向估计,并具备良好的遮挡鲁棒性,从而提升多种自主系统场景(如机器人导航、SLAM及物理交互界面)中的感知性能。

链接: https://arxiv.org/abs/2511.19448
作者: Juri Zach,Peer Stelldinger
机构: HAW Hamburg(汉堡应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable perception of the environment is a key enabler for autonomous systems, where calibration and localization tasks often rely on robust visual markers. We introduce the PuzzlePole, a new type of fiducial markers derived from the recently proposed PuzzleBoard calibration pattern. The PuzzlePole is a cylindrical marker, enabling reliable recognition and pose estimation from 360° viewing direction. By leveraging the unique combinatorial structure of the PuzzleBoard pattern, PuzzlePoles provide a high accuracy in localization and orientation while being robust to occlusions. The design offers flexibility for deployment in diverse autonomous systems scenarios, ranging from robot navigation and SLAM to tangible interfaces.
zh

[CV-200] Splatblox: Traversability-Aware Gaussian Splatting for Outdoor Robot Navigation ICRA2026

【速读】:该论文旨在解决户外复杂环境中机器人自主导航的难题,特别是针对密集植被、不规则障碍物和复杂地形导致的传统方法失效的问题。其解决方案的关键在于提出Splatblox系统,通过高斯点绘(Gaussian Splatting)融合分割后的RGB图像与LiDAR点云数据,构建一种可在线更新的可通行性感知欧氏有符号距离场(Traversability-aware Euclidean Signed Distance Field, ESDF),该场同时编码几何结构与语义信息,从而实现对可通行植被(如高草)与刚性障碍物(如树木)的精准区分,并借助LiDAR提供360°几何覆盖以支持长距离路径规划。

链接: https://arxiv.org/abs/2511.18525
作者: Samarth Chopra,Jing Liang,Gershom Seneviratne,Yonghan Lee,Jaehoon Choi,Jianyu An,Stephen Cheng,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICRA 2026

点击查看摘要

Abstract:We present Splatblox, a real-time system for autonomous navigation in outdoor environments with dense vegetation, irregular obstacles, and complex terrain. Our method fuses segmented RGB images and LiDAR point clouds using Gaussian Splatting to construct a traversability-aware Euclidean Signed Distance Field (ESDF) that jointly encodes geometry and semantics. Updated online, this field enables semantic reasoning to distinguish traversable vegetation (e.g., tall grass) from rigid obstacles (e.g., trees), while LiDAR ensures 360-degree geometric coverage for extended planning horizons. We validate Splatblox on a quadruped robot and demonstrate transfer to a wheeled platform. In field trials across vegetation-rich scenarios, it outperforms state-of-the-art methods with over 50% higher success rate, 40% fewer freezing incidents, 5% shorter paths, and up to 13% faster time to goal, while supporting long-range missions up to 100 meters. Experiment videos and more details can be found on our project page: this https URL
zh

[CV-201] Optimization of Sums of Bivariate Functions: An Introduction to Relaxation-Based Methods for the Case of Finite Domains

【速读】:该论文旨在解决在有限域上对具有 $ n^2 $ 个变量的函数进行优化的问题,这类函数可表示为多个仅涉及 $ n $ 个变量中两两组合的子函数之和,即“和式双变量函数”(sums of bivariates)。此类问题在计算复杂性上被证明是 NP 等价的,且存在“免费午餐”现象(free lunch),意味着某些优化策略在特定条件下优于随机搜索。解决方案的关键在于引入基于测度值扩展的目标函数(称为松弛,relaxations),结合 2\ell^2-逼近与熵正则化(entropy-regularization),从而将原问题转化为可通过线性规划、坐标上升法或闭式解求解的可 tractable(可处理)形式。进一步通过从二阶边缘分布重构测度的一般理论,分析了这些松弛方法的适用边界,实验验证了其在随机函数、顶点着色及信号重建等场景中的有效性,揭示了不同函数类别可建模为和式双变量结构的特性。

链接: https://arxiv.org/abs/2511.20607
作者: Nils Müller
机构: 未知
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 59 pages, 7 figures

点击查看摘要

Abstract:We study the optimization of functions with n2 arguments that have a representation as a sum of several functions that have only 2 of the n arguments each, termed sums of bivariates, on finite domains. The complexity of optimizing sums of bivariates is shown to be NP-equivalent and it is shown that there exists free lunch in the optimization of sums of bivariates. Based on measure-valued extensions of the objective function, so-called relaxations, \ell^2 -approximation, and entropy-regularization, we derive several tractable problem formulations solvable with linear programming, coordinate ascent as well as with closed-form solutions. The limits of applying tractable versions of such relaxations to sums of bivariates are investigated using general results for reconstructing measures from their bivariate marginals. Experiments in which the derived algorithms are applied to random functions, vertex coloring, and signal reconstruction problems provide insights into qualitatively different function classes that can be modeled as sums of bivariates.
zh

[CV-202] Development of a fully deep learning model to improve the reproducibility of sector classification systems for predicting unerupted maxillary canine likelihood of impaction

【速读】:该论文旨在解决上颌埋伏尖牙位置分类系统在临床应用中存在的人工操作者间与操作者内一致性差的问题,从而影响对埋伏可能性预测的准确性。解决方案的关键在于开发一个完全基于深度学习的模型,利用预训练于1,222张放射影像的大规模数据集进行优化,并通过对比不同AI模型的表现(以敏感性和精确度为指标)识别出最优模型——DenseNet121,其整体准确率达到76.8%,能够自动且稳定地完成埋伏尖牙位置的三类分类任务,显著提升诊断的一致性与客观性。

链接: https://arxiv.org/abs/2511.20493
作者: Marzio Galdi,Davide Cannatà,Flavia Celentano,Luigia Rizzo,Domenico Rossi,Tecla Bocchino,Stefano Martina
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Objectives. The aim of the present study was to develop a fully deep learning model to reduce the intra- and inter-operator reproducibility of sector classification systems for predicting unerupted maxillary canine likelihood of impaction. Methods. Three orthodontists (Os) and three general dental practitioners (GDPs) classified the position of unerupted maxillary canines on 306 radiographs (T0) according to the three different sector classification systems (5-, 4-, and 3-sector classification system). The assessment was repeated after four weeks (T1). Intra- and inter-observer agreement were evaluated with Cohen’s K and Fleiss K, and between group differences with a z-test. The same radiographs were tested on different artificial intelligence (AI) models, pre-trained on an extended dataset of 1,222 radiographs. The best-performing model was identified based on its sensitivity and precision. Results. The 3-sector system was found to be the classification method with highest reproducibility, with an agreement (Cohen’s K values) between observations (T0 versus T1) for each examiner ranged from 0.80 to 0.92, and an overall agreement of 0.85 [95% confidence interval (CI) = 0.83-0.87]. The overall inter-observer agreement (Fleiss K) ranged from 0.69 to 0.7. The educational background did not affect either intra- or inter-observer agreement (p0.05). DenseNet121 proved to be the best-performing model in allocating impacted canines in the three different classes, with an overall accuracy of 76.8%. Conclusion. AI models can be designed to automatically classify the position of unerupted maxillary canines.
zh

[CV-203] Redefining Radar Segmentation: Simultaneous Static-Moving Segmentation and Ego-Motion Estimation using Radar Point Clouds

【速读】:该论文旨在解决传统雷达目标分割研究中忽视静态与动态物体区分的问题,而这一区分是自动驾驶中多数感知任务的前提。现有方法通常仅关注类别标签的预测,却未充分考虑雷达与光学传感器在可靠性上的差异,导致对静态和移动物体的识别不够准确。为填补这一空白,论文提出了一种基于神经网络的解决方案,能够直接从原始雷达点云中同时实现静态与动态物体的分割,并估计移动平台(即“自我运动”)的瞬时二维速度。其关键创新在于使用简单的构建模块——多层感知机(MLP)和循环神经网络(RNN)——即可完成双任务,且无需进行点云聚合、多普勒补偿或运动补偿等中间处理步骤,从而首次实现了在未经预处理的雷达点云上直接提取双任务所需信息的可行性。

链接: https://arxiv.org/abs/2511.20003
作者: Simin Zhu,Satish Ravindran,Alexander Yarovoy,Francesco Fioranelli
机构: Delft University of Technology (代尔夫特理工大学); NXP Semiconductors (恩智浦半导体)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures, under review at IEEE Transactions on Radar Systems

点击查看摘要

Abstract:Conventional radar segmentation research has typically focused on learning category labels for different moving objects. Although fundamental differences between radar and optical sensors lead to differences in the reliability of predicting accurate and consistent category labels, a review of common radar perception tasks in automotive reveals that determining whether an object is moving or static is a prerequisite for most tasks. To fill this gap, this study proposes a neural network based solution that can simultaneously segment static and moving objects from radar point clouds. Furthermore, since the measured radial velocity of static objects is correlated with the motion of the radar, this approach can also estimate the instantaneous 2D velocity of the moving platform or vehicle (ego motion). However, despite performing dual tasks, the proposed method employs very simple yet effective building blocks for feature extraction: multi layer perceptrons (MLPs) and recurrent neural networks (RNNs). In addition to being the first of its kind in the literature, the proposed method also demonstrates the feasibility of extracting the information required for the dual task directly from unprocessed point clouds, without the need for cloud aggregation, Doppler compensation, motion compensation, or any other intermediate signal processing steps. To measure its performance, this study introduces a set of novel evaluation metrics and tests the proposed method using a challenging real world radar dataset, RadarScenes. The results show that the proposed method not only performs well on the dual tasks, but also has broad application potential in other radar perception tasks.
zh

[CV-204] DLADiff: A Dual-Layer Defense Framework against Fine-Tuning and Zero-Shot Customization of Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在个性化定制过程中对人脸隐私造成的严重威胁,特别是针对基于少量图像(3–5张)的微调方法和仅需单张参考图像的零样本生成方法所引发的合成身份伪造风险。现有防御手段主要聚焦于微调类攻击,忽视了零样本生成场景下的防护需求。解决方案的关键在于提出双层反扩散机制(Dual-Layer Anti-Diffusion, DLADiff):第一层通过双代理模型(Dual-Surrogate Models, DSUR)与交替动态微调(Alternating Dynamic Fine-Tuning, ADFT)结合对抗训练和预微调模型先验知识,有效抵御未经授权的微调攻击;第二层则以简洁设计实现对零样本生成的有效阻断,从而在两类主流攻击范式下均展现出显著优于现有方法的防御性能。

链接: https://arxiv.org/abs/2511.19910
作者: Jun Jia,Hongyi Miao,Yingjie Zhou,Linhan Cao,Yanwei Jiang,Wangqiu Zhou,Dandan Zhu,Hua Yang,Wei Sun,Xiongkuo Min,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shandong University (山东大学); Hefei University of Technology (合肥工业大学); East China Normal University (华东师范大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of diffusion models, a variety of fine-tuning methods have been developed, enabling high-fidelity image generation with high similarity to the target content using only 3 to 5 training images. More recently, zero-shot generation methods have emerged, capable of producing highly realistic outputs from a single reference image without altering model weights. However, technological advancements have also introduced significant risks to facial privacy. Malicious actors can exploit diffusion model customization with just a few or even one image of a person to create synthetic identities nearly identical to the original identity. Although research has begun to focus on defending against diffusion model customization, most existing defense methods target fine-tuning approaches and neglect zero-shot generation defenses. To address this issue, this paper proposes Dual-Layer Anti-Diffusion (DLADiff) to defense both fine-tuning methods and zero-shot methods. DLADiff contains a dual-layer protective mechanism. The first layer provides effective protection against unauthorized fine-tuning by leveraging the proposed Dual-Surrogate Models (DSUR) mechanism and Alternating Dynamic Fine-Tuning (ADFT), which integrates adversarial training with the prior knowledge derived from pre-fine-tuned models. The second layer, though simple in design, demonstrates strong effectiveness in preventing image generation through zero-shot methods. Extensive experimental results demonstrate that our method significantly outperforms existing approaches in defending against fine-tuning of diffusion models and achieves unprecedented performance in protecting against zero-shot generation.
zh

[CV-205] he Selective Disk Bispectrum and Its Inversion with Application to Multi-Reference Alignment

【速读】:该论文旨在解决图像形状分析中旋转不变性表示的构建问题,即如何在保留图像全部形状信息的同时忽略物体在图像中的旋转角度,从而支持基于形状的学习任务。传统方法如平移不变的双谱(translational bispectrum)虽已应用于一维和二维信号,但其扩展至二维图像的旋转不变性时面临两大挑战:缺乏可逆公式以及计算复杂度为立方级(cubic complexity)。本文的关键解决方案是推导出一种显式的逆变换,从而定义了“选择性”盘双谱(selective disk bispectrum),该方法仅使用恢复形状所需的最小系数集,显著提升了效率并保证了可逆性。这一突破使得多参考对齐(multi-reference alignment)等此前难以实现的任务成为可能,使盘双谱成为一种理论严谨且实际可用的旋转不变形状学习工具。

链接: https://arxiv.org/abs/2511.19706
作者: Adele Myers,Nina Miolane
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In many computer vision and shape analysis tasks, practitioners are interested in learning from the shape of the object in an image, while disregarding the object’s orientation. To this end, it is valuable to define a rotation-invariant representation of images, retaining all information about that image, but disregarding the way an object is rotated in the frame. To be practical for learning tasks, this representation must be computationally efficient for large datasets and invertible, so the representation can be visualized in image space. To this end, we present the selective disk bispectrum: a fast, rotation-invariant representation for image shape analysis. While the translational bispectrum has long been used as a translational invariant representation for 1-D and 2-D signals, its extension to 2-D (disk) rotational invariance on images has been hindered by the absence of an invertible formulation and its cubic complexity. In this work, we derive an explicit inverse for the disk bispectrum, which allows us to define a “selective” disk bispectrum, which only uses the minimal number of coefficients needed for faithful shape recovery. We show that this representation enables multi-reference alignment for rotated images-a task previously intractable for disk bispectrum methods. These results establish the disk bispectrum as a practical and theoretically grounded tool for learning on rotation-invariant shape data.
zh

[CV-206] PhysDNet: Physics-Guided Decomposition Network of Side-Scan Sonar Imagery

【速读】:该论文旨在解决侧扫声呐(Side-scan Sonar, SSS)图像中强度信息受海底反射率、地形高程和声学路径衰减共同影响而导致的视点依赖性强、下游分析鲁棒性差的问题。解决方案的关键在于提出一种物理引导的多分支网络(PhysDNet),通过嵌入朗伯反射模型(Lambertian reflection model),将SSS图像解耦为三个可解释的物理场:海底反射率、地形高程和传播损耗,并基于这些成分重构声呐强度,从而实现无需真实标注的自监督训练。该方法显著提升了图像的物理一致性与地质结构稳定性,增强了配准与阴影解析等下游任务的可靠性。

链接: https://arxiv.org/abs/2511.19539
作者: Can Lei,Hayat Rajani,Nuno Gracias,Rafael Garcia,Huigang Wang
机构: Northwestern Polytechnical University (西北工业大学); Research & Development Institute of Northwestern Polytechnical University in Shenzhen (西北工业大学深圳研究院); University of Girona (赫罗纳大学)
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: This work was previously submitted in error as arXiv:2509.11255v2

点击查看摘要

Abstract:Side-scan sonar (SSS) imagery is widely used for seafloor mapping and underwater remote sensing, yet the measured intensity is strongly influenced by seabed reflectivity, terrain elevation, and acoustic path loss. This entanglement makes the imagery highly view-dependent and reduces the robustness of downstream analysis. In this letter, we present PhysDNet, a physics-guided multi-branch network that decouples SSS images into three interpretable fields: seabed reflectivity, terrain elevation, and propagation loss. By embedding the Lambertian reflection model, PhysDNet reconstructs sonar intensity from these components, enabling self-supervised training without ground-truth annotations. Experiments show that the decomposed representations preserve stable geological structures, capture physically consistent illumination and attenuation, and produce reliable shadow maps. These findings demonstrate that physics-guided decomposition provides a stable and interpretable domain for SSS analysis, improving both physical consistency and downstream tasks such as registration and shadow interpretation.
zh

[CV-207] A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT

【速读】:该论文旨在解决儿科肝脏肿瘤(pediatric liver tumors)诊断中依赖有创活检所带来的风险与局限性问题,如高血管性导致的出血并发症、患儿配合度差需麻醉操作增加医疗成本及心理创伤等。其解决方案的关键在于构建一个基于多期增强CT影像的多阶段深度学习(deep learning, DL)框架,通过创新性的PKCP-MixUp数据增强方法缓解小样本和类别不平衡问题,并采用两阶段诊断流程:第一阶段利用肿瘤检测模型提取感兴趣区域(ROI),第二阶段基于ROI-masked图像使用三个骨干网络进行良恶性及亚型分类,最终实现了高准确率的自动化诊断(良恶性分类AUC=0.989,良性亚型AUC=0.915,恶性亚型AUC=0.979),填补了儿童专用DL诊断空白,为精准、无创的儿科肝脏肿瘤诊断提供了可行路径。

链接: https://arxiv.org/abs/2511.19478
作者: Wanqi Wang,Chun Yang,Jianbo Shao,Yaokai Zhang,Xuehua Peng,Jin Sun,Chao Xiong,Long Lu,Lianting Hu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatment. While pathological examination is the gold standard, the invasive biopsy has notable limitations: the highly vascular pediatric liver and fragile tumor tissue raise complication risks such as bleeding; additionally, young children with poor compliance require anesthesia for biopsy, increasing medical costs or psychological trauma. Although many efforts have been made to utilize AI in clinical settings, most researchers have overlooked its importance in pediatric liver tumors. To establish a non-invasive examination procedure, we developed a multi-stage deep learning (DL) framework for automated pediatric liver tumor diagnosis using multi-phase contrast-enhanced CT. Two retrospective and prospective cohorts were enrolled. We established a novel PKCP-MixUp data augmentation method to address data scarcity and class imbalance. We also trained a tumor detection model to extract ROIs, and then set a two-stage diagnosis pipeline with three backbones with ROI-masked images. Our tumor detection model has achieved high performance (mAP=0.871), and the first stage classification model between benign and malignant tumors reached an excellent performance (AUC=0.989). Final diagnosis models also exhibited robustness, including benign subtype classification (AUC=0.915) and malignant subtype classification (AUC=0.979). We also conducted multi-level comparative analyses, such as ablation studies on data and training pipelines, as well as Shapley-Value and CAM interpretability analyses. This framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible pediatric liver tumor diagnosis.
zh

[CV-208] Not Quite Anything: Overcoming SAMs Limitations for 3D Medical Imaging

【速读】:该论文旨在解决基础分割模型(如SAM和SAM-2)在脑部磁共振成像(MRI)中表现不佳的问题,尤其是在边界模糊、对比度低的结构(如尾状核和丘脑)上。其核心解决方案是提出一种组合式架构(has-a architecture),将基础模型的输出作为额外输入通道与MRI图像融合,从而增强感兴趣区域的提示信息;具体而言,通过一个轻量级3D U-Net生成SAM-2的提示,该U-Net可来自不同数据集,虽精度有限但定位合理,随后对基础模型预测结果的边缘进行平滑处理以提升与MRI的空间一致性。此方法无需微调基础模型权重,有效适应域偏移(domain shift),且在基底节分割中达到约96%的体积准确率,适用于纵向体积变化研究,具有速度快、标签效率高、鲁棒性强等优势。

链接: https://arxiv.org/abs/2511.19471
作者: Keith Moore
机构: Stanford University (斯坦福大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint; Paper accepted at AIAS 2025

点击查看摘要

Abstract:Foundation segmentation models such as SAM and SAM-2 perform well on natural images but struggle with brain MRIs where structures like the caudate and thalamus lack sharp boundaries and have low contrast. Rather than fine tune these models (for example MedSAM), we propose a compositional alternative where the foundation model output is treated as an additional input channel and passed alongside the MRI to highlight regions of interest. We generate SAM-2 prompts by using a lightweight 3D U-Net that was previously trained on MRI segmentation. The U-Net may have been trained on a different dataset, so its guesses are often imprecise but usually in the correct region. The edges of the resulting foundation model guesses are smoothed to improve alignment with the MRI. We also test prompt free segmentation using DINO attention maps in the same framework. This has-a architecture avoids modifying foundation weights and adapts to domain shift without retraining the foundation model. It reaches about 96 percent volume accuracy on basal ganglia segmentation, which is sufficient for our study of longitudinal volume change. The approach is fast, label efficient, and robust to out of distribution scans. We apply it to study inflammation linked changes in sudden onset pediatric OCD. Comments: Preprint; Paper accepted at AIAS 2025 Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.19471 [eess.IV] (or arXiv:2511.19471v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2511.19471 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-209] Dual-Path Knowledge-Augmented Contrastive Alignment Network for Spatially Resolved Transcriptomics AAAI2026

【速读】:该论文旨在解决空间转录组学(Spatial Transcriptomics, ST)技术因成本高昂而难以广泛应用的问题,具体目标是通过全切片图像(Whole Slide Images, WSI)预测空间分辨的基因表达谱。当前方法存在三大局限:对高层次生物语境利用不足、过度依赖样本检索(exemplar retrieval),以及跨模态特征对齐不充分。其解决方案的关键在于提出DKAN——一种双路径知识增强对比对齐网络,核心创新包括:1)引入基因语义表示模块,利用外部基因数据库提供生物学先验信息以增强预测能力;2)采用统一的一阶段对比学习范式,融合对比学习与监督学习,消除对示例检索的依赖,并结合自适应权重机制优化训练;3)设计双路径对比对齐模块,以基因语义特征作为动态跨模态协调器,实现组织病理图像与基因表达特征的有效融合。实验表明,DKAN在三个公开ST数据集上显著优于现有最优模型,为生物医学研究提供了强有力的工具。

链接: https://arxiv.org/abs/2511.17685
作者: Wei Zhang,Jiajun Chu,Xinci Liu,Chen Tong,Xinyue Li
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: AAAI 2026 Oral, extended version

点击查看摘要

Abstract:Spatial Transcriptomics (ST) is a technology that measures gene expression profiles within tissue sections while retaining spatial context. It reveals localized gene expression patterns and tissue heterogeneity, both of which are essential for understanding disease etiology. However, its high cost has driven efforts to predict spatial gene expression from whole slide images. Despite recent advancements, current methods still face significant limitations, such as under-exploitation of high-level biological context, over-reliance on exemplar retrievals, and inadequate alignment of heterogeneous modalities. To address these challenges, we propose DKAN, a novel Dual-path Knowledge-Augmented contrastive alignment Network that predicts spatially resolved gene expression by integrating histopathological images and gene expression profiles through a biologically informed approach. Specifically, we introduce an effective gene semantic representation module that leverages the external gene database to provide additional biological insights, thereby enhancing gene expression prediction. Further, we adopt a unified, one-stage contrastive learning paradigm, seamlessly combining contrastive learning and supervised learning to eliminate reliance on exemplars, complemented with an adaptive weighting mechanism. Additionally, we propose a dual-path contrastive alignment module that employs gene semantic features as dynamic cross-modal coordinators to enable effective heterogeneous feature integration. Through extensive experiments across three public ST datasets, DKAN demonstrates superior performance over state-of-the-art models, establishing a new benchmark for spatial gene expression prediction and offering a powerful tool for advancing biological and clinical research.
zh

人工智能

[AI-0] Fighting AI with AI: Leverag ing Foundation Models for Assuring AI-Enabled Safety-Critical Systems

【速读】:该论文旨在解决将生成式 AI(Generative AI)组件,尤其是深度神经网络(Deep Neural Networks, DNNs)集成到航空航天和自动驾驶等安全关键系统中时所面临的验证与保障难题。核心问题包括:AI系统的黑箱特性、高层需求与低层网络表示之间的语义鸿沟,以及传统需求工程(Requirements Engineering, RE)中存在的自然语言规范模糊性和形式化过程的可扩展性瓶颈。解决方案的关键在于利用AI自身能力构建两个互补模块:REACT(Requirements Engineering with AI for Consistency and Testing)通过大型语言模型(Large Language Models, LLMs)实现非正式自然语言需求到形式化规格的映射,支持早期验证与确认;SemaLens(Semantic Analysis of Visual Perception using large Multi-modal models)则借助视觉语言模型(Vision Language Models, VLMs)以人类可理解的概念对基于DNN的感知系统进行推理、测试与监控,从而形成从非正式需求到可验证实现的完整闭环流程。

链接: https://arxiv.org/abs/2511.20627
作者: Anastasia Mavridou,Divya Gopinath,Corina S. Păsăreanu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of AI components, particularly Deep Neural Networks (DNNs), into safety-critical systems such as aerospace and autonomous vehicles presents fundamental challenges for assurance. The opacity of AI systems, combined with the semantic gap between high-level requirements and low-level network representations, creates barriers to traditional verification approaches. These AI-specific challenges are amplified by longstanding issues in Requirements Engineering, including ambiguity in natural language specifications and scalability bottlenecks in formalization. We propose an approach that leverages AI itself to address these challenges through two complementary components. REACT (Requirements Engineering with AI for Consistency and Testing) employs Large Language Models (LLMs) to bridge the gap between informal natural language requirements and formal specifications, enabling early verification and validation. SemaLens (Semantic Analysis of Visual Perception using large Multi-modal models) utilizes Vision Language Models (VLMs) to reason about, test, and monitor DNN-based perception systems using human-understandable concepts. Together, these components provide a comprehensive pipeline from informal requirements to validated implementations.
zh

[AI-1] ROOT: Robust Orthogonalized Optimizer for Neural Network Training

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练中因模型规模扩大而加剧的算法不精确性和训练不稳定性问题,特别是现有优化器在正交化精度上的维度脆弱性以及对异常值噪声的敏感性。解决方案的关键在于提出ROOT(Robust Orthogonalized Optimizer),其核心创新为两个双稳健机制:一是基于自适应牛顿迭代和细粒度系数设计的维度鲁棒正交化方案,确保不同矩阵尺寸下保持高精度;二是通过近端优化框架抑制异常值噪声,同时保留有效梯度方向,从而提升训练稳定性与收敛性能。

链接: https://arxiv.org/abs/2511.20626
作者: Wei He,Kai Han,Hang Zhou,Hanting Chen,Zhicheng Liu,Xinghao Chen,Yunhe Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The optimization of large language models (LLMs) remains a critical challenge, particularly as model scaling exacerbates sensitivity to algorithmic imprecision and training instability. Recent advances in optimizers have improved convergence efficiency through momentum orthogonalization, but suffer from two key robustness limitations: dimensional fragility in orthogonalization precision and vulnerability to outlier-induced noise. To address these robustness challenges, we introduce ROOT, a Robust Orthogonalized Optimizer that enhances training stability through dual robustness mechanisms. First, we develop a dimension-robust orthogonalization scheme using adaptive Newton iterations with fine-grained coefficients tailored to specific matrix sizes, ensuring consistent precision across diverse architectural configurations. Second, we introduce an optimization-robust framework via proximal optimization that suppresses outlier noise while preserving meaningful gradient directions. Extensive experiments demonstrate that ROOT achieves significantly improved robustness, with faster convergence and superior final performance compared to both Muon and Adam-based optimizers, particularly in noisy and non-convex scenarios. Our work establishes a new paradigm for developing robust and precise optimizers capable of handling the complexities of modern large-scale model training. The code will be available at this https URL.
zh

[AI-2] Copyright Detection in Large Language Models : An Ethical Approach to Generative AI Development

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)训练数据中可能未经授权包含受版权保护内容的问题,尤其针对当前检测框架如DE-COP存在计算资源消耗高、对独立创作者不友好等局限性。其解决方案的关键在于提出一个开源的版权检测平台,通过优化相似性检测算法、改进数据集验证流程,并借助高效的API调用将计算开销降低10%-30%,同时提供直观的用户界面和可扩展的后端架构,从而实现对内容创作者友好的、透明且可扩展的版权合规验证机制。

链接: https://arxiv.org/abs/2511.20623
作者: David Szczecina,Senan Gaffori,Edmond Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures

点击查看摘要

Abstract:The widespread use of Large Language Models (LLMs) raises critical concerns regarding the unauthorized inclusion of copyrighted content in training data. Existing detection frameworks, such as DE-COP, are computationally intensive, and largely inaccessible to independent creators. As legal scrutiny increases, there is a pressing need for a scalable, transparent, and user-friendly solution. This paper introduce an open-source copyright detection platform that enables content creators to verify whether their work was used in LLM training datasets. Our approach enhances existing methodologies by facilitating ease of use, improving similarity detection, optimizing dataset validation, and reducing computational overhead by 10-30% with efficient API calls. With an intuitive user interface and scalable backend, this framework contributes to increasing transparency in AI development and ethical compliance, facilitating the foundation for further research in responsible AI development and copyright enforcement.
zh

[AI-3] DiFR: Inference Verification Despite Nondeterminism

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)推理过程中可验证性不足的问题,即如何在不重新运行整个推理流程的前提下,高效且可靠地验证生成结果的正确性,从而防止错误或恶意篡改。其核心挑战在于,由于数值噪声的存在,相同输入下多次推理可能产生不同输出,导致难以区分合法波动与实际故障。解决方案的关键在于提出Token-DiFR(Token-Divergence-From-Reference),通过将待验证模型的输出与使用相同随机种子的可信参考实现(reference implementation)进行逐token比对,利用采样种子同步严格约束有效输出空间,使生成token本身即可作为零额外成本的可审计证据。此外,为支持样本效率更高的前向传递验证,论文进一步引入Activation-DiFR,采用随机正交投影压缩激活值为紧凑指纹以供验证,在仅需2个输出token的情况下即可检测4-bit量化误差(AUC > 0.999),同时相比现有方法降低25%-75%通信开销。

链接: https://arxiv.org/abs/2511.20621
作者: Adam Karvonen,Daniel Reuter,Roy Rinberg,Luke Marks,Adrià Garriga-Alonso,Keri Warr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As demand for LLM inference grows, it is becoming increasingly important that providers and their customers can verify that inference processes are performed correctly, without errors or tampering. However, re-running the same inference process twice often leads to different results due to benign numerical noise, making it difficult to distinguish legitimate variation from actual problems. To address this problem, we introduce Token-DiFR (Token-Divergence-From-Reference), a method for verifying inference outputs by comparing generated tokens against predictions made by a trusted reference implementation conditioned on the same random seed. Sampling seed synchronization tightly constrains valid outputs, leaving providers minimal room to deviate from correct inference, which allows output tokens themselves to serve as auditable evidence of correctness at zero additional cost to the provider. Token-DiFR reliably identifies sampling errors, simulated bugs, and model quantization, detecting 4-bit quantization with AUC 0.999 within 300 output tokens. For applications requiring sample-efficient forward-pass verification, we additionally introduce Activation-DiFR, a scheme that uses random orthogonal projections to compress activations into compact fingerprints for subsequent verification. Activation-DiFR detects 4-bit quantization with AUC 0.999 using just 2 output tokens, while reducing communication overhead by 25-75% relative to existing methods. We release an open-source integration with vLLM to accelerate practical deployment of verifiable inference.
zh

[AI-4] Can Vibe Coding Beat Graduate CS Students? An LLM vs. Human Coding Tournament on Market-driven Strategic Planning

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在代码生成能力评估中存在的重要局限性问题,即现有基准测试过度依赖单元测试通过率和语法正确性,而忽视了真实世界场景中所需的规划、优化与策略交互等复杂推理能力。为弥补这一差距,作者提出了一种基于真实物流优化问题(拍卖、取货与配送问题,Auction, Pickup, and Delivery Problem)的多智能体推理驱动型基准测试,其核心在于构建能够同时应对不确定性下的战略竞标和容量受限路径优化的智能体系统。该解决方案的关键在于设计了一个耦合竞争性拍卖与资源约束路径规划的复杂任务环境,并通过对比40个由不同LLM生成的智能体与17个由人类(研究生)编写的智能体在大规模对战中的表现,揭示了当前LLM在生成具备现实竞争力代码方面的显著不足。

链接: https://arxiv.org/abs/2511.20613
作者: Panayiotis Danassis,Naman Goel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) has revolutionized AI-assisted code generation. This rapid development of LLMs has outpaced our ability to properly benchmark them. Prevailing benchmarks emphasize unit-test pass rates and syntactic correctness. Such metrics understate the difficulty of many real-world problems that require planning, optimization, and strategic interaction. We introduce a multi-agent reasoning-driven benchmark based on a real-world logistics optimization problem (Auction, Pickup, and Delivery Problem) that couples competitive auctions with capacity-constrained routing. The benchmark requires building agents that can (i) bid strategically under uncertainty and (ii) optimize planners that deliver tasks while maximizing profit. We evaluate 40 LLM-coded agents (by a wide range of state-of-the-art LLMs under multiple prompting methodologies, including vibe coding) against 17 human-coded agents developed before the advent of LLMs. Our results over 12 double all-play-all tournaments and \sim 40 k matches demonstrate (i) a clear superiority of human(graduate students)-coded agents: the top 5 spots are consistently won by human-coded agents, (ii) the majority of LLM-coded agents (33 out of 40) are beaten by very simple baselines, and (iii) given the best human solution as an input and prompted to improve upon, the best performing LLM makes the solution significantly worse instead of improving it. Our results highlight a gap in LLMs’ ability to produce code that works competitively in the real-world, and motivate new evaluations that emphasize reasoning-driven code synthesis in real-world scenarios.
zh

[AI-5] Building a Foundation Model for Trajectory from Scratch

【速读】:该论文试图解决当前在移动轨迹数据上构建基础模型(Foundation Models)缺乏清晰、可复现的实现路径的问题,尤其是在从零开始设计和训练针对时空轨迹数据的基础模型时,相关方法尚未系统化或文档化。其解决方案的关键在于提供一个从GPT-2出发的最小化实现流程,通过代码驱动的方式展示如何将预训练语言模型适配为处理时空轨迹数据的架构,并在此基础上对比分析如TrajFM和TrajGPT等代表性轨迹基础模型的结构创新与差异,同时引入TimesFM等跨领域技术(如patching策略)作为补充工具。该教程旨在提升研究者在移动性AI领域的实现能力与评估标准,从而增强研究透明度与同行评审质量。

链接: https://arxiv.org/abs/2511.20610
作者: Gaspard Merten,Mahmoud Sakr,Gilles Dejaegere
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models are transformative in artificial intelligence, but building them from scratch, especially for mobility trajectories, is not yet clear or documented. This tutorial bridges this gap by demonstrating the steps and code of a minimal implementation of a trajectory-focused foundation model starting from GPT-2. Through a concise, step-by-step, code-driven process, we demonstrate adapting GPT-2 for spatiotemporal data. We then review and compare representative trajectory foundation models, such as TrajFM and TrajGPT, highlighting their architectural innovations and differences. Additionally, we introduce complementary techniques from related domains, like TimesFM’s patching approach. Targeted at researchers and practitioners, this tutorial aims to explain the concepts and terminology of foundation models, at the implementation level. We find it timely and indispensable to create this educational material in order to support the SIGSPATIAL community in building and evaluating mobility foundation models, enhancing both research clarity and peer-review effectiveness in mobility AI.
zh

[AI-6] he Driver-Blindness Phenomenon: Why Deep Sequence Models Default to Autocorrelation in Blood Glucose Forecasting

【速读】:该论文旨在解决深度序列模型在血糖预测中普遍存在的“驱动盲视”(Driver-Blindness)问题,即模型未能有效利用临床关键驱动因素(如胰岛素、饮食和活动)进行预测,尽管这些因素具有明确的生理机制。作者通过定义 Δₘᵢᵥₑᵣₛ(多变量模型相对于单变量基线的性能提升)量化该问题,并指出其根源在于三个相互作用的因素:架构偏向自相关性(C1)、数据保真度不足导致驱动因素噪声大且混杂(C2),以及生理异质性削弱群体模型泛化能力(C3)。解决方案的关键在于综合采用生理特征编码器、因果正则化和个性化策略来部分缓解驱动盲视现象,并建议未来研究应常规报告 Δₘᵢᵥₑᵣₛ 以避免误导性的“最先进”模型被误认为有效。

链接: https://arxiv.org/abs/2511.20601
作者: Heman Shakeri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:Deep sequence models for blood glucose forecasting consistently fail to leverage clinically informative drivers–insulin, meals, and activity–despite well-understood physiological mechanisms. We term this Driver-Blindness and formalize it via \Delta_\textdrivers , the performance gain of multivariate models over matched univariate baselines. Across the literature, \Delta_\textdrivers is typically near zero. We attribute this to three interacting factors: architectural biases favoring autocorrelation (C1), data fidelity gaps that render drivers noisy and confounded (C2), and physiological heterogeneity that undermines population-level models (C3). We synthesize strategies that partially mitigate Driver-Blindness–including physiological feature encoders, causal regularization, and personalization–and recommend that future work routinely report \Delta_\textdrivers to prevent driver-blind models from being considered state-of-the-art.
zh

[AI-7] BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents

【速读】:该论文旨在解决生成式 AI (Generative AI) 代理集成到网页浏览器中所引发的新型安全挑战,特别是针对提示注入(prompt injection)攻击在真实环境中的影响尚不明确的问题。现有研究虽已识别提示注入为一种新的攻击向量,但缺乏对其实战场景下危害性的系统评估。为此,作者构建了一个嵌入真实 HTML 载荷中的提示注入攻击基准测试集,其特点在于聚焦于能触发实际行为而非仅文本输出的攻击,并模拟现实世界中复杂性和干扰项频率。基于此基准,论文对前沿 AI 模型的多种防御机制进行了全面实证评估,并提出一种多层次防御策略,融合架构级与模型级防护措施,以应对不断演进的提示注入攻击。其关键创新在于通过“纵深防御”(defense-in-depth)理念,为设计可落地、安全的 Web Agent 提供了一套实用框架。

链接: https://arxiv.org/abs/2511.20597
作者: Kaiyuan Zhang,Mark Tenenholtz,Kyle Polley,Jerry Ma,Denis Yarats,Ninghui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The integration of artificial intelligence (AI) agents into web browsers introduces security challenges that go beyond traditional web application threat models. Prior work has identified prompt injection as a new attack vector for web agents, yet the resulting impact within real-world environments remains insufficiently understood. In this work, we examine the landscape of prompt injection attacks and synthesize a benchmark of attacks embedded in realistic HTML payloads. Our benchmark goes beyond prior work by emphasizing injections that can influence real-world actions rather than mere text outputs, and by presenting attack payloads with complexity and distractor frequency similar to what real-world agents encounter. We leverage this benchmark to conduct a comprehensive empirical evaluation of existing defenses, assessing their effectiveness across a suite of frontier AI models. We propose a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. Our work offers a blueprint for designing practical, secure web agents through a defense-in-depth approach. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2511.20597 [cs.LG] (or arXiv:2511.20597v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.20597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-8] EnergyTwin: A Multi-Agent System for Simulating and Coordinating Energy Microgrids

【速读】:该论文旨在解决微电网(Microgrid)在多时间尺度和动态条件下协调异构分布式能源资源(Distributed Energy Resources, DERs)的难题,以实现降低购电成本、减少对波动电价的暴露并保障扰动下的服务连续性。现有工具中,电力系统仿真器虽能准确模拟物理行为但假设集中式控制,而多智能体框架虽支持去中心化决策却缺乏物理基础;为此,论文提出EnergyTwin——一个基于智能体的微电网仿真环境,其核心创新在于将物理建模与预测驱动的滚动时域规划及协商机制相耦合:每个资产作为独立智能体,与中央代理交互,由后者获取预测信息、生成调度方案并通过合同机制分配能量。该方法聚焦三级决策层,具备扩展为数字孪生(Digital Twin)平台的能力,实验证明其可显著提升本地能源自给率、维持更高电池储备并降低低韧性运行状态的风险。

链接: https://arxiv.org/abs/2511.20590
作者: Jakub Muszyński,Ignacy Walużenicz,Patryk Zan,Zofia Wrona,Maria Ganzha,Marcin Paprzycki,Costin Bădică
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Microgrids are deployed to reduce purchased grid energy, limit exposure to volatile tariffs, and ensure service continuity during disturbances. This requires coordinating heterogeneous distributed energy resources across multiple time scales and under variable conditions. Among existing tools, typically, power-system simulators capture physical behaviour but assume centralized control, while multi-agent frameworks model decentralized decision-making but represent energy with no physical grounding. In this context, the EnergyTwin is introduced, an agent-based microgrid simulation environment that couples physically grounded models with forecast-informed, rolling-horizon planning, and negotiations. Each asset is modeled as an agent, interacting with a central agent that obtains forecasts, formulates predictions, and allocates energy through contract-based interactions. EnergyTwin targets tertiary-layer decision making and is extensible for digital-twin use. Its feasibility was evaluated in a university campus microgrid scenario where multiple planning strategies were compared. Achieved results show that forecast-driven rolling-horizon planning increases local energy self-sufficiency, maintains higher battery reserves, and reduces exposure to low-resilience operating states. They demonstrate also potential of EnergyTwin as platform supporting research on resilient, negotiation-driven microgrids.
zh

[AI-9] PaTAS: A Parallel System for Trust Propagation in Neural Networks Using Subjective Logic

【速读】:该论文旨在解决当前人工智能系统在安全关键应用中缺乏可信赖评估机制的问题,尤其是传统评价指标(如准确率和精确度)无法捕捉模型预测的不确定性或可靠性,特别是在对抗性或退化条件下。解决方案的关键在于提出并实现一个并行的可信度评估系统(Parallel Trust Assessment System, PaTAS),该系统基于主观逻辑(Subjective Logic, SL)建模和传播神经网络中的信任信息;其核心创新包括:1)通过信任节点(Trust Nodes)与信任函数(Trust Functions)在标准神经计算之外并行传播输入、参数和激活的信任值;2)引入参数可信度更新机制以在训练过程中优化参数可靠性;3)设计推理路径可信度评估(Inference-Path Trust Assessment, IPTA)方法,在推理阶段生成实例特定的可信度估计。该框架实现了可解释、对称且收敛的可信度量化,有效识别模型置信度与实际可靠性之间的偏差,并提升AI全生命周期内的可信推理能力。

链接: https://arxiv.org/abs/2511.20586
作者: Koffi Ismael Ouattara,Ioannis Krontiris,Theo Dimitrakos,Dennis Eisermann,Frank Kargl
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics such as accuracy and precision fail to capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the \emphParallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through \emphTrust Nodes and \emphTrust Functions that propagate input, parameter, and activation trust across the network. The framework defines a \emphParameter Trust Update mechanism to refine parameter reliability during training and an \emphInference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a principled foundation for evaluating model reliability across the AI lifecycle.
zh

[AI-10] Gated Uncertainty-Aware Runtime Dual Invariants for Neural Signal-Controlled Robotics NEURIPS2025

【速读】:该论文旨在解决安全关键型辅助系统(如脑机接口控制的机器人)在直接从神经信号中解码用户意图时,如何提供可靠性和可信度保障的问题。解决方案的关键在于提出GUARDIAN框架,其通过将置信度校准的脑信号解码与符号化目标接地相结合,并引入双层运行时监控机制,实现逻辑安全性与生理可信性的协同约束。该框架能够在低精度解码模型(测试准确率仅27–46%)和高置信度偏差(ECE达0.22–0.41)条件下仍维持94–97%的安全率,且具备100Hz的实时监测频率与亚毫秒级决策延迟,从而为闭环神经信号控制系统提供了可验证、可审计的运行保障。

链接: https://arxiv.org/abs/2511.20570
作者: Tasha Kim,Oiwi Parker Jones
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Embodied and Safe-Assured Robotic Systems workshop at NeurIPS 2025

点击查看摘要

Abstract:Safety-critical assistive systems that directly decode user intent from neural signals require rigorous guarantees of reliability and trust. We present GUARDIAN (Gated Uncertainty-Aware Runtime Dual Invariants), a framework for real-time neuro-symbolic verification for neural signal-controlled robotics. GUARDIAN enforces both logical safety and physiological trust by coupling confidence-calibrated brain signal decoding with symbolic goal grounding and dual-layer runtime monitoring. On the BNCI2014 motor imagery electroencephalogram (EEG) dataset with 9 subjects and 5,184 trials, the system performs at a high safety rate of 94-97% even with lightweight decoder architectures with low test accuracies (27-46%) and high ECE confidence miscalibration (0.22-0.41). We demonstrate 1.7x correct interventions in simulated noise testing versus at baseline. The monitor operates at 100Hz and sub-millisecond decision latency, making it practically viable for closed-loop neural signal-based systems. Across 21 ablation results, GUARDIAN exhibits a graduated response to signal degradation, and produces auditable traces from intent, plan to action, helping to link neural evidence to verifiable robot action.
zh

[AI-11] Proceedings Twentieth Conference on Theoretical Aspects of Rationality and Knowledge

【速读】:该论文旨在推进对理性与知识理论的跨学科理解,聚焦于多智能体系统中关于理性决策、信念更新、认知逻辑及博弈论等核心问题的研究。其解决方案的关键在于通过形式化建模(如epistemic logic、bounded rationality模型)和算法分析(如计算社会选择、算法博弈论),构建统一框架以刻画个体与群体在不确定环境下的推理机制,并促进计算机科学、人工智能、哲学与认知科学等领域的交叉融合。

链接: https://arxiv.org/abs/2511.20540
作者: Adam Bjorndahl(Carnegie Mellon University)
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The TARK conference (Theoretical Aspects of Rationality and Knowledge) is a conference that aims to bring together researchers from a wide variety of fields, including computer science, artificial intelligence, game theory, decision theory, philosophy, logic, linguistics, and cognitive science. Its goal is to further our understanding of interdisciplinary issues involving reasoning about rationality and knowledge. Previous conferences have been held biennially around the world since 1986, on the initiative of Joe Halpern (Cornell University). Topics of interest include, but are not limited to, semantic models for knowledge, belief, uncertainty, awareness, bounded rationality, common sense epistemic reasoning, epistemic logic, epistemic game theory, knowledge and action, applications of reasoning about knowledge and other mental states, belief revision, computational social choice, algorithmic game theory, and foundations of multi-agent systems. Information about TARK is available at this http URL. These proceedings contain the papers that have been accepted for presentation at the Twentieth Conference on Theoretical Aspects of Rationality and Knowledge (TARK 2025), held July 14–16, 2025, at Heinrich-Heine-Universität, Düsseldorf, Germany. The conference website can be found at this https URL. Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA) Cite as: arXiv:2511.20540 [cs.LO] (or arXiv:2511.20540v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2511.20540 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 437, 2025 Related DOI: https://doi.org/10.4204/EPTCS.437 Focus to learn more DOI(s) linking to related resources Submission history From: EPTCS [view email][via EPTCS proxy] [v1] Tue, 25 Nov 2025 17:41:15 UTC (8 KB)
zh

[AI-12] Assessing LLM s Performance: Insights from the Chinese Pharmacist Exam

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险、领域特定的药剂师执业资格认证任务中表现差异的问题,重点评估不同模型在真实考试题目上的准确率及其对AI辅助形成性评价的启示。解决方案的关键在于系统性地对比了两个主流LLM——ChatGPT-4o与DeepSeek-R1——在2017–2021年中国药剂师执业资格考试中的文本类选择题上的答题准确性,并通过统计检验验证其性能差异,结果表明DeepSeek-R1在整体准确率(90.0% vs. 76.1%,p < 0.001)及多个知识模块上均显著优于ChatGPT-4o,凸显出针对专业领域优化的模型在复杂临床和理论判断任务中的优势,同时也强调了在法律与伦理敏感场景下人工审核的必要性。

链接: https://arxiv.org/abs/2511.20526
作者: Xinran Wang,Boran Zhu,Shujuan Zhou,Ziwen Long,Dehua Zhou,Shu Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Background: As large language models (LLMs) become increasingly integrated into digital health education and assessment workflows, their capabilities in supporting high-stakes, domain-specific certification tasks remain this http URL China, the national pharmacist licensure exam serves as a standardized benchmark for evaluating pharmacists’ clinical and theoretical competencies. Objective: This study aimed to compare the performance of two LLMs: ChatGPT-4o and DeepSeek-R1 on real questions from the Chinese Pharmacist Licensing Examination (2017-2021), and to discuss the implications of these performance differences for AI-enabled formative evaluation. Methods: A total of 2,306 multiple-choice (text-only) questions were compiled from official exams, training materials, and public databases. Questions containing tables or images were excluded. Each item was input in its original Chinese format, and model responses were evaluated for exact accuracy. Pearson’s Chi-squared test was used to compare overall performance, and Fisher’s exact test was applied to year-wise multiple-choice accuracy. Results: DeepSeek-R1 outperformed ChatGPT-4o with a significantly higher overall accuracy (90.0% vs. 76.1%, p 0.001). Unit-level analyses revealed consistent advantages for DeepSeek-R1, particularly in foundational and clinical synthesis modules. While year-by-year multiple-choice performance also favored DeepSeek-R1, this performance gap did not reach statistical significance in any specific unit-year (all p 0.05). Conclusion: DeepSeek-R1 demonstrated robust alignment with the structural and semantic demands of the pharmacist licensure exam. These findings suggest that domain-specific models warrant further investigation for this context, while also reinforcing the necessity of human oversight in legally and ethically sensitive contexts.
zh

[AI-13] FRAG MENTA: End-to-end Frag mentation-based Generative Model with Agent ic Tuning for Drug Lead Optimization

【速读】:该论文旨在解决生成式 AI 在药物发现中因类别特定数据集样本量少(通常少于100个)而导致的分子生成质量差的问题,以及传统碎片化模型受限于启发式分割策略、多样性不足且难以捕捉关键片段的缺陷。此外,现有模型调优依赖药物化学家与AI工程师之间的低效协作流程,效率低下。其解决方案的关键在于提出FRAGMENTA框架:一是设计了一种新颖的生成模型,将碎片化过程重构为“词汇选择”问题,并通过动态Q-learning联合优化碎片划分与分子生成;二是构建了一个代理型AI系统(agentic AI system),通过与领域专家的对话反馈自动迭代优化目标,从而实现无需人工干预的自主调优。实验表明,该框架在真实癌症药物发现场景中显著优于基线方法,验证了其在提升分子生成多样性和精准匹配专家意图方面的有效性。

链接: https://arxiv.org/abs/2511.20510
作者: Yuto Suzuki,Paul Awolade,Daniel V. LaBarbera,Farnoush Banaei-Kashani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecule generation using generative AI is vital for drug discovery, yet class-specific datasets often contain fewer than 100 training examples. While fragment-based models handle limited data better than atom-based approaches, existing heuristic fragmentation limits diversity and misses key fragments. Additionally, model tuning typically requires slow, indirect collaboration between medicinal chemists and AI engineers. We introduce FRAGMENTA, an end-to-end framework for drug lead optimization comprising: 1) a novel generative model that reframes fragmentation as a “vocabulary selection” problem, using dynamic Q-learning to jointly optimize fragmentation and generation; and 2) an agentic AI system that refines objectives via conversational feedback from domain experts. This system removes the AI engineer from the loop and progressively learns domain knowledge to eventually automate tuning. In real-world cancer drug discovery experiments, FRAGMENTA’s Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. Furthermore, the fully autonomous Agent-Agent system outperformed traditional Human-Human tuning, demonstrating the efficacy of agentic tuning in capturing expert intent.
zh

[AI-14] From One Attack Domain to Another: Contrastive Transfer Learning with Siamese Networks for APT Detection

【速读】:该论文旨在解决高级持续性威胁(Advanced Persistent Threats, APT)检测中传统机器学习方法面临的三大挑战:类别不平衡、高维特征冗余以及真实攻击样本稀缺导致的跨域泛化能力差的问题。解决方案的关键在于提出一个融合迁移学习(Transfer Learning)、可解释人工智能(Explainable AI, XAI)、对比学习(contrastive learning)与Siamese网络的混合框架:通过注意力机制自动编码器实现跨域知识迁移,利用Shapley Additive exPlanations(SHAP)筛选稳定且信息量大的特征以降低维度和计算开销,同时采用对比损失训练Siamese编码器对齐源域与目标域表征,从而增强异常样本的可分离性并缓解特征漂移问题。该方案在DARPA透明计算(Transparent Computing, TC)项目的真实数据集上验证有效,并通过合成攻击场景测试了鲁棒性,展现出良好的可扩展性、可解释性和跨域适应能力。

链接: https://arxiv.org/abs/2511.20500
作者: Sidahmed Benabderrahmane,Talal Rahwan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Advanced Persistent Threats (APT) pose a major cybersecurity challenge due to their stealth, persistence, and adaptability. Traditional machine learning detectors struggle with class imbalance, high dimensional features, and scarce real world traces. They often lack transferability-performing well in the training domain but degrading in novel attack scenarios. We propose a hybrid transfer framework that integrates Transfer Learning, Explainable AI (XAI), contrastive learning, and Siamese networks to improve cross-domain generalization. An attention-based autoencoder supports knowledge transfer across domains, while Shapley Additive exPlanations (SHAP) select stable, informative features to reduce dimensionality and computational cost. A Siamese encoder trained with a contrastive objective aligns source and target representations, increasing anomaly separability and mitigating feature drift. We evaluate on real-world traces from the DARPA Transparent Computing (TC) program and augment with synthetic attack scenarios to test robustness. Across source to target transfers, the approach delivers improved detection scores with classical and deep baselines, demonstrating a scalable, explainable, and transferable solution for APT detection.
zh

[AI-15] Quantifying the Privacy Implications of High-Fidelity Synthetic Network Traffic

【速读】:该论文旨在解决生成式网络流量(Generative Network Traffic)在隐私保护方面的不确定性问题,即合成流量是否隐含敏感信息及其泄露程度尚不明确,且不同模型架构对隐私风险的影响缺乏系统评估。其解决方案的关键在于提出一套综合性的隐私度量指标体系,结合标准的成员推断攻击(Membership Inference Attacks, MIA)与数据提取攻击,并引入网络特异性标识符和属性作为评估维度,从而对多种代表性生成模型进行全面隐私脆弱性分析。结果表明,不同模型和数据集间的隐私风险差异显著,MIA成功率可达88%,部分场景下可完全恢复网络标识符,进而识别出关键影响因素如训练数据多样性及模型拟合度,为设计低隐私泄露的生成模型提供了可操作的指导依据。

链接: https://arxiv.org/abs/2511.20497
作者: Van Tran,Shinan Liu,Tian Li,Nick Feamster
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 13 Figures, 6 Tables

点击查看摘要

Abstract:To address the scarcity and privacy concerns of network traffic data, various generative models have been developed to produce synthetic traffic. However, synthetic traffic is not inherently privacy-preserving, and the extent to which it leaks sensitive information, and how to measure such leakage, remain largely unexplored. This challenge is further compounded by the diversity of model architectures, which shape how traffic is represented and synthesized. We introduce a comprehensive set of privacy metrics for synthetic network traffic, combining standard approaches like membership inference attacks (MIA) and data extraction attacks with network-specific identifiers and attributes. Using these metrics, we systematically evaluate the vulnerability of different representative generative models and examine the factors that influence attack success. Our results reveal substantial variability in privacy risks across models and datasets. MIA success ranges from 0% to 88%, and up to 100% of network identifiers can be recovered from generated traffic, highlighting serious privacy vulnerabilities. We further identify key factors that significantly affect attack outcomes, including training data diversity and how well the generative model fits the training data. These findings provide actionable guidance for designing and deploying generative models that minimize privacy leakage, establishing a foundation for safer synthetic network traffic generation.
zh

[AI-16] MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology NEURIPS2025

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在生物医学推理任务中评估不足的问题,特别是现有基准测试无法反映真实临床场景下多代理协作决策的复杂性,如分子肿瘤委员会(Molecular Tumor Boards, MTBs)所体现的纵向、多模态和跨专家的知识整合过程。解决方案的关键在于提出MTBBench——一个模拟MTB决策流程的代理型基准测试框架,其核心创新包括:(1) 设计具有临床挑战性的多模态、纵向肿瘤学问题,确保任务与实际医疗工作流一致;(2) 通过临床医生协同开发的应用程序对标注结果进行验证,保障临床相关性;(3) 提供基于基础模型工具的代理框架,增强模型在多模态信息融合与时间序列数据推理方面的能力,从而显著提升任务性能(最高达9.0%和11.2%)。

链接: https://arxiv.org/abs/2511.20490
作者: Kiril Vasilev,Alexandre Misrahi,Eeshaan Jain,Phil F Cheng,Petros Liakopoulos,Olivier Michielin,Michael Moor,Charlotte Bunne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability – frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.
zh

[AI-17] Ranking-Enhanced Anomaly Detection Using Active Learning-Assisted Attention Adversarial Dual AutoEncoders

【速读】:该论文旨在解决高级持续性威胁(Advanced Persistent Threats, APTs)在真实网络安全环境中因标注数据稀缺而导致监督学习方法难以有效应用的问题。其解决方案的关键在于提出一种基于注意力对抗双自编码器(Attention Adversarial Dual AutoEncoder)的无监督异常检测框架,并引入主动学习(active learning)机制,通过有选择地向专家(oracle)查询不确定性样本的标签,以最小化标注成本并迭代提升模型对APT异常的检测精度。该方法在DARPA透明计算项目生成的真实世界异构操作系统(Android、Linux、BSD、Windows)数据集上验证,其中APT攻击样本占比低至0.004%,仍实现了显著的检测性能提升。

链接: https://arxiv.org/abs/2511.20480
作者: Sidahmed Benabderrahmane,James Cheney,Talal Rahwan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Advanced Persistent Threats (APTs) pose a significant challenge in cybersecurity due to their stealthy and long-term nature. Modern supervised learning methods require extensive labeled data, which is often scarce in real-world cybersecurity environments. In this paper, we propose an innovative approach that leverages AutoEncoders for unsupervised anomaly detection, augmented by active learning to iteratively improve the detection of APT anomalies. By selectively querying an oracle for labels on uncertain or ambiguous samples, we minimize labeling costs while improving detection rates, enabling the model to improve its detection accuracy with minimal data while reducing the need for extensive manual labeling. We provide a detailed formulation of the proposed Attention Adversarial Dual AutoEncoder-based anomaly detection framework and show how the active learning loop iteratively enhances the model. The framework is evaluated on real-world imbalanced provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The results have shown significant improvements in detection rates during active learning and better performance compared to other existing approaches.
zh

[AI-18] Universe of Thoughts: Enabling Creative Reasoning with Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在复杂逻辑任务中虽能有效执行传统推理(如Chain-of-Thought prompting),但缺乏创造性推理能力的问题,尤其是在解空间广阔且常规解次优的领域(如药物发现或商业战略制定)。其核心挑战在于如何系统性地激发LLMs生成具有可行性的创新解决方案。解决方案的关键是提出一个受认知科学启发的“思维宇宙”(Universe of Thoughts, UoT)计算框架,该框架定义了三种核心创造性推理范式:组合式(combinational)、探索式(exploratory)和转化式(transformative)推理,通过结构化扩展和引导思维空间,实现对潜在解的系统性探索与创新生成。

链接: https://arxiv.org/abs/2511.20471
作者: Yuto Suzuki,Farnoush Banaei-Kashani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning based on Large Language Models (LLMs) has garnered increasing attention due to outstanding performance of these models in mathematical and complex logical tasks. Beginning with the Chain-of-Thought (CoT) prompting technique, numerous reasoning methods have emerged that decompose problems into smaller, sequential steps (or thoughts). However, existing reasoning models focus on conventional problem-solving and do not necessarily generate creative solutions by ``creative reasoning’'. In domains where the solution space is expansive and conventional solutions are suboptimal, such as drug discovery or business strategization, creative reasoning to discover innovative solutions is crucial. To address this gap, first we introduce a computational framework for creative reasoning inspired by established cognitive science principles. With this framework, we propose three core creative reasoning paradigms, namely, \textitcombinational, \textitexploratory, and \textittransformative reasoning, where each offers specific directions for systematic exploration of the universe of thoughts to generate creative solutions. Next, to materialize this framework using LLMs, we introduce the \textitUniverse of Thoughts (or \textitUoT, for short), a novel set of methods to implement the aforementioned three creative processes. Finally, we introduce three novel tasks that necessitate creative problem-solving, along with an evaluation benchmark to assess creativity from three orthogonal perspectives: feasibility as constraint, and utility and novelty as metrics. With a comparative analysis against the state-of-the-art (SOTA) reasoning techniques as well as representative commercial models with reasoning capability, we show that UoT demonstrates superior performance in creative reasoning.
zh

[AI-19] Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model IJCNN2025

【速读】:该论文旨在解决音乐混音中个体声源(如人声)分离的难题,特别是针对传统神经网络方法在处理声源重叠与相关性时性能受限、且训练需完整源信号信息难以获取的问题。其解决方案的关键在于采用扩散模型(diffusion models)进行生成式歌唱人声分离,仅使用孤立人声与混音对应的成对数据进行训练,并借助潜在扩散(latent diffusion)机制——在紧凑的潜在空间中生成样本并解码为音频,从而实现高效优化与快速推理。此方法显著提升了分离性能与推理效率,同时在噪声鲁棒性方面提供了新见解。

链接: https://arxiv.org/abs/2511.20470
作者: Genís Plaja-Roglans,Yun-Ning Hung,Xavier Serra,Igor Pereira
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted for oral presentation at IJCNN 2025

点击查看摘要

Abstract:Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in music signals poses an inherent challenge. Also, accessing all sources in the mixture is crucial to train these systems, while complicated. Attempts to address these challenges in a generative fashion exist, however, the separation performance and inference efficiency remain limited. In this work, we study the potential of diffusion models to advance toward bridging this gap, focusing on generative singing voice separation relying only on corresponding pairs of isolated vocals and mixtures for training. To align with creative workflows, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio. This enables efficient optimization and faster inference. Our system is trained using only open data. We outperform existing generative separation systems, and level the compared non-generative systems on a list of signal quality measures and on interference removal. We provide a noise robustness study on the latent encoder, providing insights on its potential for the task. We release a modular toolkit for further research on the topic.
zh

[AI-20] DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLM s

【速读】:该论文旨在解决当前多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架中因依赖单次响应而缺乏推理路径多样性的问题,从而限制了大型语言模型(Large Language Models, LLMs)在复杂任务中的表现。其解决方案的关键在于提出DRAFT-RL框架,该框架将Chain-of-Draft(CoD)推理机制融入多智能体强化学习训练流程:每个代理对同一查询生成多个草稿(draft),由其他代理和一个学习到的奖励模型共同评估并选择最具潜力的推理轨迹,进而通过Actor-Critic结构更新策略,实现显式的多路径探索、同伴引导的反思以及奖励对齐的选择,显著提升了LLM代理行为的鲁棒性与可解释性。

链接: https://arxiv.org/abs/2511.20468
作者: Yuanhao Li,Mingshan Liu,Hongbo Wang,Yiding Zhang,Yifei Ma,Wei Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive capabilities in multi-step reasoning and this http URL works introduce multi-agent reflection frameworks where multiple LLM agents critique and refine each other’s outputs using reinforcement learning (RL). However, these approaches often rely on single-shot responses and lack structural diversity in reasoning exploration. In this paper, we propose DRAFT-RL, a novel framework that integrates Chain-of-Draft (CoD) reasoning into multi-agent RL training. Instead of generating single responses, each agent produces multiple drafts per query, which are then evaluated by peer agents and a learned reward model to identify the most promising trajectory. These selected drafts are used to refine future reasoning strategies through actor-critic this http URL-RL enables explicit multi-path exploration, peer-guided reflection, and reward-aligned selection, resulting in more robust and interpretable LLM agent behavior. We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge-intensive QA,demonstrating that DRAFT-RL outperforms existing reflective and RL-based agents by significant margins in both accuracy and convergence speed
zh

[AI-21] Short-Range Oversquashing

【速读】:该论文旨在解决消息传递神经网络(Message Passing Neural Networks, MPNNs)在图结构数据上学习时存在的“过度压缩”(oversquashing)问题,即模型难以有效处理长距离信息传递。研究发现,oversquashing不仅存在于长程任务中,也会出现在短程任务中,从而揭示了两种不同的机制:一是短程场景下的瓶颈效应(bottleneck phenomenon),二是与长程任务密切相关的梯度消失现象(vanishing gradient phenomenon)。关键在于,现有MPNN改进方法如虚拟节点(virtual nodes)仅能缓解长程问题,无法解决短程瓶颈效应;而图注意力机制(Graph Transformers)在短程任务中表现更优,表明其作为解决oversquashing的更具潜力方案。

链接: https://arxiv.org/abs/2511.20406
作者: Yaaqov Mishayev,Yonatan Sverdlov,Tal Amir,Nadav Dym
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to Learning on Graphs (LoG) 2025. Version identical to the camera-ready paper

点击查看摘要

Abstract:Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limitation has led some researchers to advocate Graph Transformers as a better alternative, whereas others suggest that it can be mitigated within the MPNN framework, using virtual nodes or other rewiring techniques. In this work, we demonstrate that oversquashing is not limited to long-range tasks, but can also arise in short-range problems. This observation allows us to disentangle two distinct mechanisms underlying oversquashing: (1) the bottleneck phenomenon, which can arise even in low-range settings, and (2) the vanishing gradient phenomenon, which is closely associated with long-range tasks. We further show that the short-range bottleneck effect is not captured by existing explanations for oversquashing, and that adding virtual nodes does not resolve it. In contrast, transformers do succeed in such tasks, positioning them as the more compelling solution to oversquashing, compared to specialized MPNNs. Comments: Accepted to Learning on Graphs (LoG) 2025. Version identical to the camera-ready paper Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.20406 [cs.LG] (or arXiv:2511.20406v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.20406 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-22] LLM s for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)生成的单元测试在实际软件开发中缺乏标准化评估框架的问题。当前LLM生成的测试质量参差不齐,且缺少统一、可复现的评价体系来比较不同模型和提示策略的效果。解决方案的关键在于提出AgoneTest框架,其核心包括一个名为Classes2Test的数据集,用于映射待测Java类与其对应的测试类,并集成多种高级评估指标(如变异分数和测试异味),构建端到端的自动化评估流程。该框架支持在真实场景下对LLM生成的测试进行系统性比较,实验证明,在编译通过的测试子集中,LLM生成的测试在覆盖率和缺陷检测能力上可达到甚至超越人工编写的测试水平,同时表明优化提示策略能显著提升测试质量。

链接: https://arxiv.org/abs/2511.20403
作者: Andrea Lops,Fedelucio Narducci,Azzurra Ragone,Michelantonio Trizio,Claudio Barto
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at 40th IEEE/ACM International Conference on Automated Software Engineering

点击查看摘要

Abstract:Unit testing is an essential but resource-intensive step in software development, ensuring individual code units function correctly. This paper introduces AgoneTest, an automated evaluation framework for Large Language Model-generated (LLM) unit tests in Java. AgoneTest does not aim to propose a novel test generation algorithm; rather, it supports researchers and developers in comparing different LLMs and prompting strategies through a standardized end-to-end evaluation pipeline under realistic conditions. We introduce the Classes2Test dataset, which maps Java classes under test to their corresponding test classes, and a framework that integrates advanced evaluation metrics, such as mutation score and test smells, for a comprehensive assessment. Experimental results show that, for the subset of tests that compile, LLM-generated tests can match or exceed human-written tests in terms of coverage and defect detection. Our findings also demonstrate that enhanced prompting strategies contribute to test quality. AgoneTest clarifies the potential of LLMs in software testing and offers insights for future improvements in model design, prompt engineering, and testing practices.
zh

[AI-23] NNGPT : Rethinking AutoML with Large Language Models

【速读】:该论文旨在解决构建具备自我改进能力的AI系统这一根本性挑战,尤其聚焦于神经网络(Neural Network)开发中的自动化与优化问题。其核心解决方案是提出NNGPT框架,该框架将大型语言模型(Large Language Model, LLM)转变为一个自增强的自动机器学习(AutoML)引擎,通过闭环生成-评估-自我改进机制扩展神经网络数据集,并在统一工作流中集成五个协同的LLM驱动管道:零样本架构合成、超参数优化(Hyperparameter Optimization, HPO)、代码感知的准确率/早停预测、检索增强的范围封闭PyTorch模块合成(NN-RAG)以及强化学习。关键创新在于利用Lemur数据集作为可复现指标的审计语料库,实现从单一提示到端到端执行与学习的全流程自动化,从而显著提升模型生成效率与质量,例如NN-RAG达到73%执行成功率,HPO性能优于Optuna,且支持大规模自主生成5000+验证模型,验证了其作为自治AutoML引擎的有效性。

链接: https://arxiv.org/abs/2511.20333
作者: Roman Kochnev,Waleed Khalid,Tolgay Atinc Uzun,Xi Zhang,Yashkumar Sanjaybhai Dhameliya,Furui Qin,Chandini Vysyaraju,Raghuvir Duvvuri,Avi Goyal,Dmitry Ignatov,Radu Timofte
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.
zh

[AI-24] Active Inference in Discrete State Spaces from First Principles

【速读】:该论文旨在澄清主动推理(active inference)的概念,将其与自由能原理(Free Energy Principle)区分开来。其核心问题在于:如何在离散状态空间中实现主动推理,而不依赖于预期自由能(expected free energy)的框架。解决方案的关键在于将主动推理所需的优化问题形式化为约束散度最小化问题,并利用标准均场方法(mean field methods)进行求解,这些方法不涉及预期自由能的概念。在此框架下,用于建模感知时,提出的感知/动作散度准则等价于变分自由能;而用于建模行动时,则与预期自由能函数相差一个熵正则化项。

链接: https://arxiv.org/abs/2511.20321
作者: Patrick Kenny
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 56 pages

点击查看摘要

Abstract:We seek to clarify the concept of active inference by disentangling it from the Free Energy Principle. We show how the optimizations that need to be carried out in order to implement active inference in discrete state spaces can be formulated as constrained divergence minimization problems which can be solved by standard mean field methods that do not appeal to the idea of expected free energy. When it is used to model perception, the perception/action divergence criterion that we propose coincides with variational free energy. When it is used to model action, it differs from an expected free energy functional by an entropy regularizer.
zh

[AI-25] Data Augmentation Techniques to Reverse-Engineer Neural Network Weights from Input-Output Queries

【速读】:该论文试图解决在教师-学生框架下,当教师网络参数数量远超训练数据量时,学生网络难以准确逆向工程(reverse-engineer)教师网络权重的问题。传统方法在此场景中因学生过拟合查询样本而失效,无法实现对教师参数的有效拟合。解决方案的关键在于设计专门针对网络隐藏层表示空间的数据增强技术,以更有效地采样教师网络的输入-输出映射,从而激发丰富的中间层表征;通过此类定制化增强策略,研究者显著扩展了可恢复的网络规模上限,实验证明可在训练数据点数量的100倍范围内成功恢复教师网络参数。

链接: https://arxiv.org/abs/2511.20312
作者: Alexander Beiser,Flavio Martinelli,Wulfram Gerstner,Johanni Brea
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of the III edition of the Workshop on Unifying Representations in Neural Models (UniReps 2025)

点击查看摘要

Abstract:Network weights can be reverse-engineered given enough informative samples of a network’s input-output function. In a teacher-student setup, this translates into collecting a dataset of the teacher mapping – querying the teacher – and fitting a student to imitate such mapping. A sensible choice of queries is the dataset the teacher is trained on. But current methods fail when the teacher parameters are more numerous than the training data, because the student overfits to the queries instead of aligning its parameters to the teacher. In this work, we explore augmentation techniques to best sample the input-output mapping of a teacher network, with the goal of eliciting a rich set of representations from the teacher hidden layers. We discover that standard augmentations such as rotation, flipping, and adding noise, bring little to no improvement to the identification problem. We design new data augmentation techniques tailored to better sample the representational space of the network’s hidden layers. With our augmentations we extend the state-of-the-art range of recoverable network sizes. To test their scalability, we show that we can recover networks of up to 100 times more parameters than training data-points.
zh

[AI-26] RIS-Assisted Downlink Pinching-Antenna Systems: GNN-Enabled Optimization Approaches

【速读】:该论文旨在解决集成可重构智能表面(Reconfigurable Intelligent Surface, RIS)与多波导夹紧天线(Multi-waveguide Pinching Antenna, PA)系统(PASS)后对无线通信性能影响不明确的问题,特别是如何在受限的PA移动区域、总功率预算及RIS单元可调相位约束下,实现多用户下行链路的信息传输中和能效(Energy Efficiency, EE)的最大化。解决方案的关键在于提出了一种新型三阶段图神经网络(Graph Neural Network, GNN),该GNN基于RIS辅助PASS的图结构拓扑,通过无监督训练学习PA位置与RIS相位偏移,最终确定波束赋形向量;其创新性体现在结合凸优化的三种实现策略,可在推理时间与解的最优性之间提供灵活权衡,同时具备良好的泛化能力、性能可靠性与实时适用性。

链接: https://arxiv.org/abs/2511.20305
作者: Changpeng He,Yang Lu,Yanqing Xu,Chong-Yung Chi,Bo Ai,Arumugam Nallanathan
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper investigates a reconfigurable intelligent surface (RIS)-assisted multi-waveguide pinching-antenna (PA) system (PASS) for multi-user downlink information transmission, motivated by the unknown impact of the integration of emerging PASS and RIS on wireless communications. First, we formulate sum rate (SR) and energy efficiency (EE) maximization problems in a unified framework, subject to constraints on the movable region of PAs, total power budget, and tunable phase of RIS elements. Then, by leveraging a graph-structured topology of the RIS-assisted PASS, a novel three-stage graph neural network (GNN) is proposed, which learns PA positions based on user locations, and RIS phase shifts according to composite channel conditions at the first two stages, respectively, and finally determines beamforming vectors. Specifically, the proposed GNN is achieved through unsupervised training, together with three implementation strategies for its integration with convex optimization, thus offering trade-offs between inference time and solution optimality. Extensive numerical results are provided to validate the effectiveness of the proposed GNN, and to support its unique attributes of viable generalization capability, good performance reliability, and real-time applicability. Moreover, the impact of key parameters on RIS-assisted PASS is illustrated and analyzed.
zh

[AI-27] Improving Language Agents through BREW

【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的智能体在训练过程中存在的高计算开销、策略难以解释与迭代优化困难等问题,尤其针对强化学习类方法(如PPO和GRPO)在rollout收敛上效率低下且缺乏可扩展性的问题。其解决方案的关键在于引入BREW(Bootstrapping expeRientially-learned Environmental knoWledge)框架,通过构建并持续精炼智能体从环境中获得的经验知识库(Knowledge Base, KB),实现对智能体行为的结构化记忆管理与可控优化。BREW采用任务评分器与行为评分标准(behavior rubrics)提取经验洞察,并结合状态空间搜索以增强自然语言指令中的鲁棒性,从而在不显著增加计算成本的前提下,提升任务精度(10–20%)、减少API调用次数(10–15%),并使智能体策略具备透明、可解释和可扩展的特性。

链接: https://arxiv.org/abs/2511.20297
作者: Shashank Kirtania,Param Biyani,Priyanshu Gupta,Yasharth Bajpai,Roshni Iyer,Sumit Gulwani,Gustavo Soares
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents are increasingly applied to tasks requiring structured reasoning, tool use, and environmental adaptation, such as data manipulation, multistep planning, and computer-use automation. However, despite their versatility, current training paradigms for model weight optimization methods, like PPO and GRPO, remain relatively impractical with their high computational overhead for rollout convergence. In addition, the resulting agent policies are difficult to interpret, adapt, or incrementally improve. To address this, we investigate creating and refining structured memory of experiential learning of an agent from its environment as an alternative route to agent optimization. We introduce BREW (Bootstrapping expeRientially-learned Environmental knoWledge), a framework for agent optimization for downstream tasks via KB construction and refinement. In our formulation, we introduce an effective method for partitioning agent memory for more efficient retrieval and refinement. BREW uses task graders and behavior rubrics to learn insights while leveraging state-space search for ensuring robustness from the noise and non-specificity in natural language. Empirical results on real world, domain-grounded benchmarks – OSWorld, \tau^2 Bench, and SpreadsheetBench – show BREW achieves 10-20% improvement in task precision, 10-15% reduction in API/tool calls leading to faster execution time, all while maintaining computational efficiency on par with base models. Unlike prior work where memory is treated as static context, we establish the KB as a modular and controllable substrate for agent optimization – an explicit lever for shaping behavior in a transparent, interpretable, and extensible manner.
zh

[AI-28] Forgetting by Pruning: Data Deletion in Join Cardinality Estimation AAAI26

【速读】:该论文旨在解决多表学习型基数估计(Learned Cardinality Estimation, CE)系统中数据删除(data deletion)所面临的三大挑战:属性级敏感性(attribute-level sensitivity)、跨表传播效应(inter-table propagation)以及域消失导致的多路连接严重高估问题。解决方案的关键在于提出首个专为多表学习型CE系统设计的机器遗忘框架——基数估计剪枝(Cardinality Estimation Pruning, CEP),其核心创新包括:1)分布敏感性剪枝(Distribution Sensitivity Pruning),通过构建半连接删除结果并计算敏感度分数指导参数剪枝;2)域剪枝(Domain Pruning),移除因删除操作而完全消失的取值域的支持。实验表明,CEP在IMDB和TPC-H数据集上显著优于现有方法,尤其在高删除比例下保持最低的Q-error,并大幅减少收敛迭代次数,计算开销仅为微调时间的0.3%-2.5%。

链接: https://arxiv.org/abs/2511.20293
作者: Chaowei He,Yuanjun Liu,Qingzhi Ma,Shenyuan Ren,Xizhao Luo,Lei Zhao,An Liu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: AAAI26

点击查看摘要

Abstract:Machine unlearning in learned cardinality estimation (CE) systems presents unique challenges due to the complex distributional dependencies in multi-table relational data. Specifically, data deletion, a core component of machine unlearning, faces three critical challenges in learned CE models: attribute-level sensitivity, inter-table propagation and domain disappearance leading to severe overestimation in multi-way joins. We propose Cardinality Estimation Pruning (CEP), the first unlearning framework specifically designed for multi-table learned CE systems. CEP introduces Distribution Sensitivity Pruning, which constructs semi-join deletion results and computes sensitivity scores to guide parameter pruning, and Domain Pruning, which removes support for value domains entirely eliminated by deletion. We evaluate CEP on state-of-the-art architectures NeuroCard and FACE across IMDB and TPC-H datasets. Results demonstrate CEP consistently achieves the lowest Q-error in multi-table scenarios, particularly under high deletion ratios, often outperforming full retraining. Furthermore, CEP significantly reduces convergence iterations, incurring negligible computational overhead of 0.3%-2.5% of fine-tuning time.
zh

[AI-29] SMoG: Schema Matching on Graph

【速读】:该论文旨在解决医疗领域中电子健康记录(Electronic Health Record, EHR)系统在数据集成过程中面临的模式匹配(schema matching)难题,尤其针对大语言模型(Large Language Models, LLMs)在该任务中存在的幻觉问题和缺乏最新领域知识的局限性。解决方案的关键在于提出SMoG(Schema Matching on Graph)框架,其核心创新是通过迭代执行简单的单跳SPARQL查询来利用知识图谱(Knowledge Graph, KG)增强LLMs的推理能力,从而在保证可解释性和可靠性的同时,显著降低存储开销并提升效率。

链接: https://arxiv.org/abs/2511.20285
作者: Mingyu Jeon,Jaeyoung Suh,Suwan Cho
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Schema matching is a critical task in data integration, par- ticularly in the medical domain where disparate Electronic Health Record (EHR) systems must be aligned to standard models like OMOP CDM. While Large Language Models (LLMs) have shown promise in schema matching, they suf- fer from hallucination and lack of up-to-date domain knowl- edge. Knowledge Graphs (KGs) offer a solution by pro- viding structured, verifiable knowledge. However, existing KG-augmented LLM approaches often rely on inefficient complex multi-hop queries or storage-intensive vector-based retrieval methods. This paper introduces SMoG (Schema Matching on Graph), a novel framework that leverages iter- ative execution of simple 1-hop SPARQL queries, inspired by successful strategies in Knowledge Graph Question An- swering (KGQA). SMoG enhances explainability and relia- bility by generating human-verifiable query paths while sig- nificantly reducing storage requirements by directly querying SPARQL endpoints. Experimental results on real-world med- ical datasets demonstrate that SMoG achieves performance comparable to state-of-the-art baselines, validating its effec- tiveness and efficiency in KG-augmented schema matching.
zh

[AI-30] Can LLM s Make (Personalized) Access Control Decisions?

【速读】:该论文旨在解决用户在面对日益复杂和自动化的系统时,因认知负荷过重而导致访问控制决策质量下降的问题,尤其是在传统应用与基于代理的系统中,用户的权限授予行为可能变得不理性或随意。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的处理与推理能力,构建一种动态、上下文感知的访问控制机制,使其能够根据用户的自然语言隐私偏好做出更符合个体意图的决策。研究通过用户实验获取了307条自然语言隐私陈述及14,682个访问控制决策数据,并对比通用与个性化LLM版本的表现,发现LLM可达到高达86%的准确率,但同时也揭示了个性化可能导致违反安全最佳实践的风险,从而为实现兼顾个性化、安全性与可用性的自然语言驱动访问控制系统提供了设计依据。

链接: https://arxiv.org/abs/2511.20284
作者: Friederike Groschupp,Daniele Lain,Aritra Dhar,Lara Magdalena Lazier,Srdjan Čapkun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precise access control decisions are crucial to the security of both traditional applications and emerging agent-based systems. Typically, these decisions are made by users during app installation or at runtime. Due to the increasing complexity and automation of systems, making these access control decisions can add a significant cognitive load on users, often overloading them and leading to suboptimal or even arbitrary access control decisions. To address this problem, we propose to leverage the processing and reasoning capabilities of large language models (LLMs) to make dynamic, context-aware decisions aligned with the user’s security preferences. For this purpose, we conducted a user study, which resulted in a dataset of 307 natural-language privacy statements and 14,682 access control decisions made by users. We then compare these decisions against those made by two versions of LLMs: a general and a personalized one, for which we also gathered user feedback on 1,446 of its decisions. Our results show that in general, LLMs can reflect users’ preferences well, achieving up to 86% accuracy when compared to the decision made by the majority of users. Our study also reveals a crucial trade-off in personalizing such a system: while providing user-specific privacy preferences to the LLM generally improves agreement with individual user decisions, adhering to those preferences can also violate some security best practices. Based on our findings, we discuss design and risk considerations for implementing a practical natural-language-based access control system that balances personalization, security, and utility. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.20284 [cs.CR] (or arXiv:2511.20284v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.20284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-31] HVAdam: A Full-Dimension Adaptive Optimizer

【速读】:该论文旨在解决自适应优化器(如Adam)在训练大规模模型时泛化性能较差的问题,尤其是在与非自适应方法(如SGD)对比时,后者在经典架构(如CNN)上表现更优。其关键原因在于预条件机制中的适应性限制了优化器对多样化优化景观的适应能力。为解决此问题,作者提出Anon(Adaptivity Non-restricted Optimizer with Novel convergence technique),其核心创新在于引入连续可调的适应性设计,使优化器能够在SGD-like和Adam-like行为之间插值甚至外推;同时提出增量延迟更新(Incremental Delay Update, IDU)机制,替代AMSGrad中硬性最大值追踪策略,从而提升对梯度噪声的鲁棒性并确保在整个适应性范围内实现收敛性保障。

链接: https://arxiv.org/abs/2511.20277
作者: Yiheng Zhang,Shaowu Wu,Yuanzhuo Xu,Jiajun Wu,Shang Xu,Steve Drew,Xiaoguang Niu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer’s ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity , allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad’s hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.20277 [cs.LG] (or arXiv:2511.20277v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.20277 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Interpretable Air Pollution Forecasting by Physics-Guided Spatiotemporal Decoupling

【速读】:该论文旨在解决空气质量预测中模型性能与可解释性之间的权衡问题(trade-off between performance and interpretability)。其解决方案的关键在于提出了一种物理引导的、设计即可解释的时空学习框架,该框架将污染物浓度的时空行为分解为两个透明且可加的模块:一是基于风场和地理信息的物理引导输运核(advection),其权重具有方向性;二是可解释的注意力机制,用于学习局部响应并明确未来浓度对特定历史滞后和外生驱动因素的贡献。这种结构设计实现了高预测精度与时空可解释性的统一,为实际空气质量管理提供了更可靠的决策支持。

链接: https://arxiv.org/abs/2511.20257
作者: Zhiguo Zhang,Xiaoliang Ma,Daniel Schlesinger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to 2025 IEEE International Conference on Big Data

点击查看摘要

Abstract:Accurate and interpretable air pollution forecasting is crucial for public health, but most models face a trade-off between performance and interpretability. This study proposes a physics-guided, interpretable-by-design spatiotemporal learning framework. The model decomposes the spatiotemporal behavior of air pollutant concentrations into two transparent, additive modules. The first is a physics-guided transport kernel with directed weights conditioned on wind and geography (advection). The second is an explainable attention mechanism that learns local responses and attributes future concentrations to specific historical lags and exogenous drivers. Evaluated on a comprehensive dataset from the Stockholm region, our model consistently outperforms state-of-the-art baselines across multiple forecasting horizons. Our model’s integration of high predictive performance and spatiotemporal interpretability provides a more reliable foundation for operational air-quality management in real-world applications.
zh

[AI-33] Actionable and diverse counterfactual explanations incorporating domain knowledge and causal constraints

【速读】:该论文旨在解决现有反事实解释方法在生成可行动解释时忽视真实世界数据中复杂特征依赖关系的问题,导致生成的修改方案不切实际或不可行。其解决方案的关键在于提出一种名为DANCE(Diverse, Actionable, and kNowledge-Constrained Explanations)的方法,该方法通过学习数据中的线性与非线性约束或整合专家提供的依赖图结构,将特征间的因果关系和领域知识纳入约束条件,从而确保生成的反事实样本既符合现实可行性,又具备可操作性和多样性,并在可实现性、多样性与稀疏性之间取得平衡。

链接: https://arxiv.org/abs/2511.20236
作者: Szymon Bobek,Łukasz Bałec,Grzegorz J. Nalepa
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Counterfactual explanations enhance the actionable interpretability of machine learning models by identifying the minimal changes required to achieve a desired outcome of the model. However, existing methods often ignore the complex dependencies in real-world datasets, leading to unrealistic or impractical modifications. Motivated by cybersecurity applications in the email marketing domain, we propose a method for generating Diverse, Actionable, and kNowledge-Constrained Explanations (DANCE), which incorporates feature dependencies and causal constraints to ensure plausibility and real-world feasibility of counterfactuals. Our method learns linear and nonlinear constraints from data or integrates expert-provided dependency graphs, ensuring counterfactuals are plausible and actionable. By maintaining consistency with feature relationships, the method produces explanations that align with real-world constraints. Additionally, it balances plausibility, diversity, and sparsity, effectively addressing key limitations in existing algorithms. The work is developed based on a real-life case study with Freshmail, the largest email marketing company in Poland and supported by a joint RD project Sendguard. Furthermore, we provide an extensive evaluation using 140 public datasets, which highlights its ability to generate meaningful, domain-relevant counterfactuals that outperform other existing approaches based on widely used metrics. The source code for reproduction of the results can be found in a GitHub repository we provide.
zh

[AI-34] Leverag ing weights signals - Predicting and improving generalizability in reinforcement learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)智能体在训练环境之外的泛化能力不足的问题,即智能体容易过拟合到特定训练环境中。其解决方案的关键在于:提出一种基于RL智能体神经网络内部权重的泛化能力预测方法,并据此对近端策略优化(Proximal Policy Optimization, PPO)损失函数进行改进,从而在训练过程中主动提升智能体的泛化性能。实验表明,改进后的PPO算法能够生成具有更强泛化能力的智能体。

链接: https://arxiv.org/abs/2511.20234
作者: Olivier Moulin,Vincent Francois-lavet,Paul Elbers,Mark Hoogendoorn
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalizability of Reinforcement Learning (RL) agents (ability to perform on environments different from the ones they have been trained on) is a key problem as agents have the tendency to overfit to their training environments. In order to address this problem and offer a solution to increase the generalizability of RL agents, we introduce a new methodology to predict the generalizability score of RL agents based on the internal weights of the agent’s neural networks. Using this prediction capability, we propose some changes in the Proximal Policy Optimization (PPO) loss function to boost the generalization score of the agents trained with this upgraded version. Experimental results demonstrate that our improved PPO algorithm yields agents with stronger generalizability compared to the original version.
zh

[AI-35] DUO-TOK: Dual-Track Semantic Music Tokenizer for Vocal-Accompaniment Generation

【速读】:该论文旨在解决现代歌词到歌曲生成系统中重建质量与语言模型(Language Model, LM)可学习性之间的矛盾问题。现有编码器要么追求高保真重建而产生难以建模的声学标记,要么过度压缩为语义友好的标记但造成信息损失,且很少考虑双轨结构(人声与伴奏)的显式建模。解决方案的关键在于提出Duo-Tok——一种源感知的双码本分层编码器,其核心是四阶段自监督学习(SSL)流程:首先在大规模音频上预训练BEST-RQ风格编码器,随后通过高斯替换噪声和多任务监督稳定并因子分解表示,接着冻结编码器并基于SimVQ学习用于人声与伴奏的硬路由双码本,最终在离散标记上训练潜空间扩散解码器。该方法在0.75 kbps下显著提升了重建-生成帕累托前沿,实现了最佳音乐标签准确率(AP)和最低词汇归一化语言模型困惑度(perplexity),同时保持了与当前最优音乐编码器相当的重建质量。

链接: https://arxiv.org/abs/2511.20224
作者: Rui Lin,Zhiyue Wu,Jiahe Le,Kangdi Wang,Weixiong Chen,Junyu Dai,Tao Jiang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, 8 tables. Project page: this https URL

点击查看摘要

Abstract:Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.
zh

[AI-36] Interactive AI NPCs Powered by LLM s: Technical Report for the CPDC Challenge 2025

【速读】:该论文旨在解决常识性人格基础对话(Commonsense Persona-Grounded Dialogue, CPD)任务中的多模态工具调用稳定性、角色扮演一致性与小样本过拟合问题,尤其在GPU Track和API Track两个赛道中提升对话系统的性能。其解决方案的关键在于提出一个简洁但高效的框架:首先,通过上下文工程(Context Engineering)实现输入压缩与后处理优化,包括动态工具剪枝(dynamic tool pruning)、人格片段截断(persona clipping)、参数归一化及函数合并等技术,显著增强工具调用的稳定性和角色扮演引导能力;其次,在GPU Track中引入基于奖励信号直接优化的GRPO(Generalized Reward Policy Optimization)训练策略,替代传统的监督微调(Supervised Fine-Tuning),有效缓解小样本场景下的过拟合问题,从而大幅提升任务导向型对话的表现力。

链接: https://arxiv.org/abs/2511.20200
作者: Yitian Huang,Yuxuan Lei,Jianxun Lian,Hao Liao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This report presents the solution and results of our team MSRA_SC in the Commonsense Persona-Grounded Dialogue Challenge (CPDC 2025). We propose a simple yet effective framework that unifies improvements across both GPU Track and API Track. Our method centers on two key components. First, Context Engineering applies dynamic tool pruning and persona clipping for input compression, combined with post-processing techniques such as parameter normalization and function merging. Together with manually refined prompts, this design improves tool call stability, execution reliability, and role-playing guidance. Second, in the GPU Track, we further adopt GRPO training, replacing supervised fine-tuning with reinforcement learning directly optimized by reward signals. This mitigates small-sample overfitting and significantly enhances task-oriented dialogue performance. In the final evaluation, our team ranks 1st in Task 2 API, 2nd in Task 1 API, and 3rd in both Task 3 API and GPU track, demonstrating the effectiveness of our approach. Our code is publicly available at this https URL
zh

[AI-37] owards Benign Memory Forgetting for Selective Multimodal Large Language Model Unlearning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中可能无意中记忆敏感信息的问题,同时避免现有遗忘方法因导致模型整体图像理解能力下降而无法实现“良性遗忘”(benign forgetting)。其解决方案的关键在于提出 Sculpted Memory Forgetting Adapter (SMFA),该方法通过两阶段机制实现选择性遗忘:首先微调模型将敏感回答替换为拒绝响应,生成一个遗忘适配器;随后引入保留锚点引导的掩码机制(retaining anchor-guided masking mechanism),精准抑制目标记忆区域,同时保护无关知识和基础视觉理解能力。实验表明,SMFA 能在不损害模型通用视觉理解性能的前提下实现精确可控的遗忘。

链接: https://arxiv.org/abs/2511.20196
作者: Zhen Zeng,Leijiang Gu,Zhangling Duan,Feng Li,Zenglin Shi,Cees G. M. Snoek,Meng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve remarkable capabilities but can inadvertently memorize privacy-sensitive information. Although existing unlearning methods can remove such knowledge, they fail to achieve benign forgetting because they often degrade the model’s general image understanding performance. To address this, we propose the Sculpted Memory Forgetting Adapter (SMFA), which confines forgetting to targeted memory regions while preserving overall capabilities. SMFA first fine-tunes the model to replace sensitive responses with refusals, yielding a memory forgetting adapter, and then applies a retaining anchor-guided masking mechanism to prevent interference with unrelated knowledge and understanding ability. To systematically evaluate selective MLLM unlearning, we introduce S-MLLMUn Bench, the first benchmark designed to jointly assess the removal of sensitive knowledge and retention of general visual understanding. Extensive experiments show that, unlike prior methods, SMFA achieves precise and controllable unlearning while maintaining the model’s foundational image understanding.
zh

[AI-38] Beluga: A CXL-Based Memory Architecture for Scalable and Efficient LLM KVCache Management SIGMOD’26

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)推理过程中因KVCache(键值缓存)占用内存过大而导致的GPU内存瓶颈问题。随着LLM模型规模和长上下文需求的增长,传统依赖CPU DRAM与GPU HBM协同工作的架构受限于CPU内存通道数量,而现有基于RDMA的分离式内存池方案又引入了高延迟、复杂通信协议及同步开销。论文提出Beluga架构,其核心创新在于利用新兴的Compute Express Link (CXL) 技术构建共享大容量内存池,使GPU可直接通过CXL交换机以原生load/store语义访问该内存池,从而实现接近本地内存的访问延迟,同时显著降低编程复杂度和同步开销。在此基础上设计的Beluga-KVCache系统在vLLM推理引擎中实现了89.6%的Time-To-First-Token(TTFT)减少和7.35倍吞吐量提升,是首个支持GPU通过CXL开关直接访问大规模内存池的系统,为GPU高效共享海量内存资源提供了新范式。

链接: https://arxiv.org/abs/2511.20172
作者: Xinjun Yang,Qingda Hu,Junru Li,Feifei Li,Yuqi Zhou,Yicong Zhu,Qiuru Lin,Jian Dai,Yang Kong,Jiayu Zhang,Guoqiang Xu,Qiang Liu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 13 pages, accepted by SIGMOD’26

点击查看摘要

Abstract:The rapid increase in LLM model sizes and the growing demand for long-context inference have made memory a critical bottleneck in GPU-accelerated serving systems. Although high-bandwidth memory (HBM) on GPUs offers fast access, its limited capacity necessitates reliance on host memory (CPU DRAM) to support larger working sets such as the KVCache. However, the maximum DRAM capacity is constrained by the limited number of memory channels per CPU socket. To overcome this limitation, current systems often adopt RDMA-based disaggregated memory pools, which introduce significant challenges including high access latency, complex communication protocols, and synchronization overhead. Fortunately, the emerging CXL technology introduces new opportunities in KVCache design. In this paper, we propose Beluga, a novel memory architecture that enables GPUs and CPUs to access a shared, large-scale memory pool through CXL switches. By supporting native load/store access semantics over the CXL fabric, our design delivers near-local memory latency, while reducing programming complexity and minimizing synchronization overhead. We conduct a systematic characterization of a commercial CXL switch-based memory pool and propose a set of design guidelines. Based on Beluga, we design and implement Beluga-KVCache, a system tailored for managing the large-scale KVCache in LLM inference. Beluga-KVCache achieves an 89.6% reduction in Time-To-First-Token (TTFT) and 7.35x throughput improvement in the vLLM inference engine compared to RDMA-based solutions. To the best of our knowledge, Beluga is the first system that enables GPUs to directly access large-scale memory pools through CXL switches, marking a significant step toward low-latency, shared access to vast memory resources by GPUs.
zh

[AI-39] On the Limits of Momentum in Decentralized and Federated Optimization NEURIPS2025

【速读】:该论文旨在解决在去中心化联邦学习(Federated Learning, FL)场景下,动量(momentum)方法是否能在存在无界统计异质性(statistical heterogeneity)时保证收敛的问题。现有研究表明,动量在本地优化中可缓解异质性影响,但在客户端周期性参与(cyclic client participation)的去中心化设置中,理论分析表明动量仍无法消除异质性的干扰。其关键发现是:无论步长衰减速率如何,只要快于 Θ(1/t)\Theta(1/t),算法最终将收敛到一个依赖于初始值和异质性上界的常数解,而非全局最优解;这揭示了动量在异质性环境下固有的局限性,为设计更鲁棒的分布式优化算法提供了理论依据。

链接: https://arxiv.org/abs/2511.20168
作者: Riccardo Zaccone,Sai Praneeth Karimireddy,Carlo Masone
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 17th Workshop on Optimization for Machine Learning (OPT@NeurIPS2025)

点击查看摘要

Abstract:Recent works have explored the use of momentum in local methods to enhance distributed SGD. This is particularly appealing in Federated Learning (FL), where momentum intuitively appears as a solution to mitigate the effects of statistical heterogeneity. Despite recent progress in this direction, it is still unclear if momentum can guarantee convergence under unbounded heterogeneity in decentralized scenarios, where only some workers participate at each round. In this work we analyze momentum under cyclic client participation, and theoretically prove that it remains inevitably affected by statistical heterogeneity. Similarly to SGD, we prove that decreasing step-sizes do not help either: in fact, any schedule decreasing faster than \Theta\left(1/t\right) leads to convergence to a constant value that depends on the initialization and the heterogeneity bound. Numerical results corroborate the theory, and deep learning experiments confirm its relevance for realistic settings.
zh

[AI-40] IDAP: Advancing Divergence-Based Pruning via Filter-Level and Layer-Level Optimization

【速读】:该论文旨在解决神经网络压缩中冗余信息的高效去除问题,特别是在滤波器(filter)层级和架构(architectural)层级上的冗余识别与优化。其解决方案的关键在于提出一种基于信息流分析的统一框架,核心是利用张量流发散(tensor flow divergence)这一指标来量化信息在各层间的变换程度,并据此设计两阶段优化策略:第一阶段通过迭代的发散感知剪枝(divergence-aware pruning)剔除冗余滤波器,保留关键信息通路;第二阶段进一步扩展至架构层面,分析每层对信息传播的贡献度,选择性移除对性能影响最小的层。该方法可适配多种架构(如卷积神经网络、Transformer及混合模型),提供跨层类型的一致结构重要性评估标准,从而在保持竞争力准确率的同时实现显著的参数压缩。

链接: https://arxiv.org/abs/2511.20141
作者: Aleksei Samarin,Artem Nazarenko,Egor Kotenko,Valentin Malykh,Alexander Savelev,Aleksei Toropov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 65 pages, 4 figures, 38 tables

点击查看摘要

Abstract:This paper presents a novel approach to neural network compression that addresses redundancy at both the filter and architectural levels through a unified framework grounded in information flow analysis. Building on the concept of tensor flow divergence, which quantifies how information is transformed across network layers, we develop a two-stage optimization process. The first stage employs iterative divergence-aware pruning to identify and remove redundant filters while preserving critical information pathways. The second stage extends this principle to higher-level architecture optimization by analyzing layer-wise contributions to information propagation and selectively eliminating entire layers that demonstrate minimal impact on network performance. The proposed method naturally adapts to diverse architectures, including convolutional networks, transformers, and hybrid designs, providing a consistent metric for comparing the structural importance across different layer types. Experimental validation across multiple modern architectures and datasets reveals that this combined approach achieves substantial model compression while maintaining competitive accuracy. The presented approach achieves parameter reduction results that are globally comparable to those of state-of-the-art solutions and outperforms them across a wide range of modern neural network architectures, from convolutional models to transformers. The results demonstrate how flow divergence serves as an effective guiding principle for both filter-level and layer-level optimization, offering practical benefits for deployment in resource-constrained environments.
zh

[AI-41] From data to concepts via wiring diagrams

【速读】:该论文旨在解决从顺序数据中自动提取结构化行为模式的问题,特别是如何将复杂的时间过程抽象为可解释的图结构表示。其核心解决方案是引入**准骨架布线图(quasi-skeleton wiring diagram graph)的概念,并证明这类图与哈斯图(Hasse diagram)**一一对应,从而建立了一个理论基础,使得可以从序列数据中高效提取出具有语义意义的因果或依赖关系结构。基于此理论,作者设计了相应的算法来识别自主代理在计算机游戏中所采用的获胜策略,并通过对比DBSCAN和凝聚层次聚类等传统聚类方法,在数据扰动情况下仍表现出优越的鲁棒性和准确性,体现了该方法在强化学习与数据工程交叉场景中的有效性。

链接: https://arxiv.org/abs/2511.20138
作者: Jason Lo,Mohammadnima Jafari
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG); Combinatorics (math.CO)
备注: 19 pages

点击查看摘要

Abstract:A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi-skeleton wiring diagram graph, and prove that quasi-skeleton wiring diagram graphs correspond to Hasse diagrams. Using this result, we designed algorithms that extract wiring diagrams from sequential data. We used our algorithms in analyzing the behavior of an autonomous agent playing a computer game, and the algorithms correctly identified the winning strategies. We compared the performance of our main algorithm with two other algorithms based on standard clustering techniques (DBSCAN and agglomerative hierarchical), including when some of the data was perturbed. Overall, this article brings together techniques in category theory, graph theory, clustering, reinforcement learning, and data engineering.
zh

[AI-42] he Making of Digital Ghosts: Designing Ethical AI Afterlives

【速读】:该论文旨在解决人工智能驱动的数字遗存(digital ghosts)在个人、商业与机构场景中日益普及所带来的伦理挑战,包括哀伤与福祉、真实性与欺骗、同意与死后隐私、尊严与误读,以及哀悼商品化等问题。其解决方案的关键在于提出一个九维分类法来系统化理解数字后生命技术,并进一步界定一种伦理上可接受的数字幽灵应具备的核心特征:生前意图、相互同意、透明且有限的数据使用、明确披露、用途与访问受限、家庭或遗产管理、以及最小行为自主性,从而确保数字幽灵能够辅助记忆而不滑向欺骗。

链接: https://arxiv.org/abs/2511.20094
作者: Giovanni Spitale,Federico Germani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Advances in artificial intelligence now make it possible to simulate the dead through chatbots, voice clones, and video avatars trained on a person’s digital traces. These “digital ghosts” are moving from fiction to commercial reality, reshaping how people mourn and remember. This paper offers a conceptual and ethical analysis of AI-mediated digital afterlives. We define what counts as a digital ghost, trace their rise across personal, commercial, and institutional contexts, and identify core ethical tensions around grief and well-being, truthfulness and deception, consent and posthumous privacy, dignity and misrepresentation, and the commercialization of mourning. To analyze these challenges, we propose a nine-dimensional taxonomy of digital afterlife technologies and, building on it, outline the features of an ethically acceptable digital ghost: premortem intent, mutual consent, transparent and limited data use, clear disclosure, restricted purposes and access, family or estate stewardship, and minimal behavioral agency. We argue for targeted regulation and professional guidelines to ensure that digital ghosts can aid remembrance without slipping into forms of deception.
zh

[AI-43] R3A: Reliable RTL Repair Framework with Multi-Agent Fault Localization and Stochastic Tree-of-Thoughts Patch Generation

【速读】:该论文旨在解决寄存器传输级(Register Transfer Level, RTL)代码中缺陷修复的可靠性与覆盖范围问题。传统自动程序修复(Automatic Program Repair, APR)方法依赖固定模板,难以应对复杂或多样化的bug;而基于大语言模型(Large Language Models, LLMs)的方法虽具备理解代码语义的能力,却因固有随机性和RTL代码及波形输入上下文过长导致结果不可靠。为此,作者提出R3A框架,其核心创新在于:一是引入随机性控制的“思维树”(Tree-Of-Thoughts)策略以引导补丁生成代理在探索与利用之间平衡,提升修复路径的稳定性;二是设计多智能体故障定位机制,精准识别潜在错误候选点作为补丁生成起点,从而显著增强修复的可靠性和有效性。实验表明,R3A在给定时间限制内可修复90.6%的RTL缺陷,较传统方法和现有LLM方案分别多覆盖45%的bug,并实现平均86.7%的pass@5率。

链接: https://arxiv.org/abs/2511.20090
作者: Zizhang Luo,Fan Cui,Kexing Zhou,Runlin Guo,Mile Xia,Hongyuan Hou,Yun Lian
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Repairing RTL bugs is crucial for hardware design and verification. Traditional automatic program repair (APR) methods define dedicated search spaces to locate and fix bugs with program synthesis. However, they heavily rely on fixed templates and can only deal with limited bugs. As an alternative, Large Language Models with the ability to understand code semantics can be explored for RTL repair. However, they suffer from unreliable outcomes due to inherent randomness and long input contexts of RTL code and waveform. To address these challenges, we propose R3A, an LLM-based automatic RTL program repair framework upon the basic model to improve reliability. R3A proposes the stochastic Tree-Of-Thoughts method to control a patch generation agent to explore a validated solution for the bug. The algorithm samples search states according to a heuristic function to balance between exploration and exploitation for a reliable outcome. Besides, R3A proposes a multi-agent fault localization method to find fault candidates as the starting points for the patch generation agent, further increasing the reliability. Experiments show R3A can fix 90.6% of bugs in the RTL-repair dataset within a given time limit, which covers 45% more bugs than traditional methods and other LLM-based approaches, while achieving an 86.7% pass@5 rate on average, showing a high reliability.
zh

[AI-44] VICoT-Agent : A Vision-Interleaved Chain-of-Thought Framework for Interpretable Multimodal Reasoning and Scalable Remote Sensing Analysis

【速读】:该论文旨在解决遥感图像分析任务从传统目标识别向复杂智能推理演进过程中,模型推理能力不足与工具调用灵活性欠缺的问题。其核心解决方案是提出一种新的多模态智能体框架——视觉交错思维链框架(Vision-Interleaved Chain-of-Thought Framework, VICoT),通过动态将视觉工具嵌入思维链中实现显式的多轮推理;该框架采用基于栈的推理结构和模块化MCP兼容工具集,使大语言模型(LLM)能够高效执行多轮交错的视觉-语言推理任务,并具备良好泛化能力;此外,论文还提出“推理栈蒸馏”(Reasoning Stack distillation)方法,可将复杂智能体行为迁移至轻量级模型,在显著降低计算复杂度的同时保留推理能力。

链接: https://arxiv.org/abs/2511.20085
作者: Chujie Wang,Zhiyuan Luo,Ruiqi Liu,Can Ran,Shenghua Fan,Xi Chen,Chu He
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The current remote sensing image analysis task is increasingly evolving from traditional object recognition to complex intelligence reasoning, which places higher requirements on the model’s reasoning ability and the flexibility of tool invocation. To this end, we propose a new multimodal agent framework, Vision-Interleaved Chain-of-Thought Framework (VICoT), which implements explicit multi-round reasoning by dynamically incorporating visual tools into the chain of thought. Through a stack-based reasoning structure and a modular MCP-compatible tool suite, VICoT enables LLMs to efficiently perform multi-round, interleaved vision-language reasoning tasks with strong generalization and this http URL also propose the Reasoning Stack distillation method to migrate complex Agent behaviors to small, lightweight models, which ensures the reasoning capability while significantly reducing complexity. Experiments on multiple remote sensing benchmarks demonstrate that VICoT significantly outperforms existing SOTA frameworks in reasoning transparency, execution efficiency, and generation quality.
zh

[AI-45] “Are We Done Yet?”: A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents AAAI2026

【速读】:该论文旨在解决计算机使用代理(Computer Use Agents, CUAs)在自主操作数字界面时,难以可靠判断任务是否完成的问题。解决方案的关键在于提出了一种基于视觉-语言模型(vision-language models)的自主评估与反馈框架,该框架能够直接通过截图和任务描述来评估任务完成情况,并提供反馈以提升CUA的自我纠错能力与任务成功率。实验表明,该方法在任务成功检测上最高可达73%准确率,并在引入评估反馈后平均提升27%的任务成功率。

链接: https://arxiv.org/abs/2511.20067
作者: Marta Sumyk,Oleksandr Kosovan
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: This work has been accepted to appear at the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)

点击查看摘要

Abstract:Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.
zh

[AI-46] Reducing Latency of LLM Search Agent via Speculation-based Algorithm-System Co-Design

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的搜索代理在多步推理过程中因串行执行导致的严重延迟问题。传统方法中,每一步都需要先进行LLM推理再执行工具操作,形成瓶颈。解决方案的关键在于引入“推测”(speculation)机制,并提出SPAgent框架——该框架采用两阶段自适应推测策略,在安全条件下跳过验证步骤以减少冗余计算;同时设计两级调度器根据系统负载动态调节推测请求,确保推测始终带来性能收益。实验表明,SPAgent可在保持甚至提升准确率的前提下实现最高1.65倍的端到端加速,显著改善多步搜索代理的实际部署可行性。

链接: https://arxiv.org/abs/2511.20048
作者: Zixiao Huang,Wen Zeng,Tianyu Fu,Tengxuan Liu,Yizhou Sun,Ke Hong,Xinhao Yang,Chengchun Liu,Yan Li,Quanlu Zhang,Guohao Dai,Zhenhua Zhu,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:LLM-based search agents achieve strong performance but suffer from severe latency, as each step requires serialized LLM reasoning followed by action of tool execution. We revisit this bottleneck through the lens of speculation. While traditional predict-verify speculation paradigm can break serial execution, its benefit remains limited, as it retains the full original workload and adds extra inference overhead. We observe that early agent steps often involve simple evidence-gathering, where correct actions can often be predicted without full reasoning. Building on these observations, we present SPAgent, an algorithm-system co-design framework that expands the role of speculation in search agents to reduce latency. Algorithmically, SPAgent introduces a two-phase adaptive speculation mechanism that selectively omits verification when safe. System-wise, a two-level scheduler regulates speculative requests based on engine load to ensure speculation remains beneficial. We implement SPAgent in real-world systems. Across extensive experimental settings, SPAgent achieves up to 1.65\times end-to-end speedup while maintaining same or even achieving higher accuracy, enabling practical deployment of multi-step search agents.
zh

[AI-47] Energy Costs and Neural Complexity Evolution in Changing Environments

【速读】:该论文旨在解决神经复杂性演化背后的驱动机制问题,特别是探讨环境可变性与能量成本如何共同影响大脑尺寸和结构的进化。研究通过演化用于强化学习(Reinforcement Learning, RL)代理的人工神经网络(Artificial Neural Networks, ANNs),模拟不同季节性和能量约束条件下的神经结构适应过程。其解决方案的关键在于:在能量受限条件下,高度季节性的环境会降低净能量摄入,从而限制ANN规模,这支持了昂贵大脑假说(Expensive Brain Hypothesis, EBH)而非认知缓冲假说(Cognitive Buffer Hypothesis, CBH);同时发现,ANN结构复杂性主要作为规模的副产物出现,且能量成本促进了更高效网络的演化,揭示了能量约束在塑造神经复杂性中的核心作用。

链接: https://arxiv.org/abs/2511.20018
作者: Sian Heesom-Green,Jonathan Shock,Geoff Nitschke
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Presented at ALIFE 2025, proceedings forthcoming (MIT Press)

点击查看摘要

Abstract:The Cognitive Buffer Hypothesis (CBH) posits that larger brains evolved to enhance survival in changing conditions. However, larger brains also carry higher energy demands, imposing additional metabolic burdens. Alongside brain size, brain organization plays a key role in cognitive ability and, with suitable architectures, may help mitigate energy challenges. This study evolves Artificial Neural Networks (ANNs) used by Reinforcement Learning (RL) agents to investigate how environmental variability and energy costs influence the evolution of neural complexity, defined in terms of ANN size and structure. Results indicate that under energy constraints, increasing seasonality led to smaller ANNs. This challenges CBH and supports the Expensive Brain Hypothesis (EBH), as highly seasonal environments reduced net energy intake and thereby constrained brain size. ANN structural complexity primarily emerged as a byproduct of size, where energy costs promoted the evolution of more efficient networks. These results highlight the role of energy constraints in shaping neural complexity, offering in silico support for biological theory and energy-efficient robotic design.
zh

[AI-48] Popularity Bias Alignment Estimates

【速读】:该论文旨在解决推荐系统中流行度偏差记忆(Popularity Bias Memorization)的理论边界问题,特别是在任意度分布下的泛化能力与最优对齐性估计。其核心解决方案是将原始定理扩展至更广泛的网络结构,并首次在top-k奇异超空间(singular hyperspace)上建立了上下界估计,从而为理解模型对高频项的记忆机制提供了严格的数学框架。

链接: https://arxiv.org/abs/2511.19999
作者: Anton Lyubinin
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We are extending Popularity Bias Memorization theorem from arXiv:archive/2404.12008 in several directions. We extend it to arbitrary degree distributions and also prove both upper and lower estimates for the alignment with top-k singular hyperspace.
zh

[AI-49] M3Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation

【速读】:该论文旨在解决多模态检索增强生成(multi-modal retrieval-augmented generation, mRAG)系统中多智能体架构带来的高Token开销与计算成本问题,从而阻碍其大规模部署。解决方案的关键在于提出一种新颖的分层通信图剪枝框架——M³Prune,其核心机制包括:首先对文本和视觉模态内部进行图稀疏化以识别任务关键边;随后基于这些关键边构建动态跨模态通信拓扑;最后通过渐进式剪枝策略获得更高效且具有层次结构的通信图,从而在保持甚至提升任务性能的同时显著降低Token消耗。

链接: https://arxiv.org/abs/2511.19969
作者: Weizi Shao,Taolin Zhang,Zijie Zhou,Chen Chen,Chengyu Wang,Xiaofeng He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M ^3 Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M ^3 Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.
zh

[AI-50] Optimize Flip Angle Schedules In MR Fingerprinting Using Reinforcement Learning

【速读】:该论文旨在解决磁共振指纹成像(Magnetic Resonance Fingerprinting, MRF)中脉冲序列参数优化的高维复杂性问题,特别是如何通过自动化的策略设计提升指纹的可区分性。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)框架来优化翻转角(flip angle)调度,从而在不依赖人工经验的前提下,学习出具有非周期性特征的翻转角序列,显著增强不同组织参数间的指纹分离度;同时,RL优化的调度还可能减少重复时间(repetition time),实现MRF采集加速。

链接: https://arxiv.org/abs/2511.19941
作者: Shenjun Zhong,Zhifeng Chen,Zhaolin Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 4 pages, 5 figures, submitted to conference

点击查看摘要

Abstract:Magnetic Resonance Fingerprinting (MRF) leverages transient-state signal dynamics generated by the tunable acquisition parameters, making the design of an optimal, robust sequence a complex, high-dimensional sequential decision problem, such as optimizing one of the key parameters, flip angle. Reinforcement learning (RL) offers a promising approach to automate parameter selection, to optimize pulse sequences that maximize the distinguishability of fingerprints across the parameter space. In this work, we introduce an RL framework for optimizing the flip-angle schedule in MRF and demonstrate a learned schedule exhibiting non-periodic patterns that enhances fingerprint separability. Additionally, an interesting observation is that the RL-optimized schedule may enable a reduction in the number of repetition time, potentially accelerate MRF acquisitions.
zh

[AI-51] A System-Level Taxonomy of Failure Modes in Large Language Model Applications

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在实际生产环境中行为不可预测、失败模式复杂且与传统机器学习模型显著不同的问题。其核心挑战在于现有评估和监控手段无法有效捕捉LLM系统的稳定性、可复现性、漂移现象及工作流集成问题。解决方案的关键在于提出一个涵盖十五种隐性故障模式的系统级分类法(system-level taxonomy),并将LLM可靠性视为系统工程问题而非仅模型层面的问题,从而为构建可靠、可维护且成本敏感的LLM应用提供设计原则,推动评价方法、AI系统鲁棒性和部署可靠性等领域的研究发展。

链接: https://arxiv.org/abs/2511.19933
作者: Vaishali Vinay
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure patterns differ fundamentally from those of traditional machine learning models. This paper presents a system-level taxonomy of fifteen hidden failure modes that arise in real-world LLM applications, including multi-step reasoning drift, latent inconsistency, context-boundary degradation, incorrect tool invocation, version drift, and cost-driven performance collapse. Using this taxonomy, we analyze the growing gap in evaluation and monitoring practices: existing benchmarks measure knowledge or reasoning but provide little insight into stability, reproducibility, drift, or workflow integration. We further examine the production challenges associated with deploying LLMs - including observability limitations, cost constraints, and update-induced regressions - and outline high-level design principles for building reliable, maintainable, and cost-aware LLM systems. Finally, we outline high-level design principles for building reliable, maintainable, and cost-aware LLM-based systems. By framing LLM reliability as a system-engineering problem rather than a purely model-centric one, this work provides an analytical foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.
zh

[AI-52] LLM -EDT: Large Language Model Enhanced Cross-domain Sequential Recommendation with Dual-phase Training

【速读】:该论文针对跨域序列推荐(Cross-domain Sequential Recommendation, CDSR)中存在的两个核心问题展开研究:一是域不平衡问题(imbalance issue),即某一域的用户行为交互占据主导地位,导致难以捕捉其他域的特定特征;二是域间过渡问题(transition issue),即在混合行为序列中难以准确建模用户的跨域偏好,从而影响特定域的下一物品预测性能。为解决上述挑战,论文提出了一种基于大语言模型(Large Language Models, LLMs)增强的双阶段训练框架(LLM-EDT),其关键创新在于:首先设计可迁移物品增强器(transferable item augmenter),自适应生成用户可能的跨域行为以缓解域不平衡并减少无关噪声;其次引入双阶段训练策略,使域特定分支在共享背景知识下学习更精细的偏好表示,从而改善跨域行为序列中的过渡建模;最后构建域感知的用户画像模块(domain-aware profiling module),对每个域内的用户偏好进行精准总结,并自适应聚合形成完整的用户画像,有效缓解粗粒度建模带来的偏差问题。

链接: https://arxiv.org/abs/2511.19931
作者: Ziwei Liu,Qidong Liu,Wanyu Wang,Yejing Wang,Tong Xu,Wei Huang,Chong Chen,Peng Chuan,Xiangyu Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-domain Sequential Recommendation (CDSR) has been proposed to enrich user-item interactions by incorporating information from various domains. Despite current progress, the imbalance issue and transition issue hinder further development of CDSR. The former one presents a phenomenon that the interactions in one domain dominate the entire behavior, leading to difficulty in capturing the domain-specific features in the other domain. The latter points to the difficulty in capturing users’ cross-domain preferences within the mixed interaction sequence, resulting in poor next-item prediction performance for specific domains. With world knowledge and powerful reasoning ability, Large Language Models (LLMs) partially alleviate the above issues by performing as a generator and an encoder. However, current LLMs-enhanced CDSR methods are still under exploration, which fail to recognize the irrelevant noise and rough profiling problems. Thus, to make peace with the aforementioned challenges, we proposed an LLMs Enhanced Cross-domain Sequential Recommendation with Dual-phase Training (LLM-EDT). To address the imbalance issue while introducing less irrelevant noise, we first propose the transferable item augmenter to adaptively generate possible cross-domain behaviors for users. Then, to alleviate the transition issue, we introduce a dual-phase training strategy to empower the domain-specific thread with a domain-shared background. As for the rough profiling problem, we devise a domain-aware profiling module to summarize the user’s preference in each domain and adaptively aggregate them to generate comprehensive user profiles. The experiments on three public datasets validate the effectiveness of our proposed LLM-EDT. To ease reproducibility, we have released the detailed code online at this https URL.
zh

[AI-53] Semantic-KG: Using Knowledge Graphs to Construct Benchmarks for Measuring Semantic Similarity

【速读】:该论文旨在解决当前用于评估大语言模型(Large Language Models, LLMs)生成文本语义相似性的方法存在局限性的问题,特别是现有方法可能更关注句法或词汇形式而非深层语义内容,且现有基准测试常因依赖主观人工判断而成本高昂、领域适配性差、等价定义模糊。其解决方案的关键在于提出一种基于知识图谱(Knowledge Graphs, KGs)的新颖基准生成方法,通过KG自动构建语义相似或不相似的自然语言对,并将不相似对细分为四类子类型,从而在通用知识、生物医学、金融和生物学四个领域构建高质量、可复现的语义相似性评测基准数据集。该方法显著降低了人工标注成本,提升了领域适应性和语义差异的细粒度刻画能力,为评估LLM生成文本的语义一致性提供了更可靠的标准。

链接: https://arxiv.org/abs/2511.19925
作者: Qiyao Wei,Edward Morrell,Lea Goetz,Mihaela van der Schaar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the open-form textual responses generated by Large Language Models (LLMs) typically requires measuring the semantic similarity of the response to a (human generated) reference. However, there is evidence that current semantic similarity methods may capture syntactic or lexical forms over semantic content. While benchmarks exist for semantic equivalence, they often suffer from high generation costs due to reliance on subjective human judgment, limited availability for domain-specific applications, and unclear definitions of equivalence. This paper introduces a novel method for generating benchmarks to evaluate semantic similarity methods for LLM outputs, specifically addressing these limitations. Our approach leverages knowledge graphs (KGs) to generate pairs of natural-language statements that are semantically similar or dissimilar, with dissimilar pairs categorized into one of four sub-types. We generate benchmark datasets in four different domains (general knowledge, biomedicine, finance, biology), and conduct a comparative study of semantic similarity methods including traditional natural language processing scores and LLM-as-a-judge predictions. We observe that the sub-type of semantic variation, as well as the domain of the benchmark impact the performance of semantic similarity methods, with no method being consistently superior. Our results present important implications for the use of LLM-as-a-judge in detecting the semantic content of text. Code is available at this https URL and the dataset is available at this https URL.
zh

[AI-54] Zero-Knowledge Proof Based Verifiable Inference of Models

【速读】:该论文旨在解决生成式 AI 模型推理结果的可验证性问题,即在模型所有者无法或不愿公开其参数(因涉及高昂训练成本和知识产权)的前提下,如何确保第三方能够验证模型输出的正确性。解决方案的关键在于提出一个无需可信设置的零知识框架,通过递归组合零知识证明技术,支持线性和非线性神经网络层(如矩阵乘法、归一化、Softmax 和 SiLU),并利用 Fiat-Shamir 启发式方法生成常数大小的非交互式知识论证(zkSNARK),从而实现高效且灵活的模型推理验证。

链接: https://arxiv.org/abs/2511.19902
作者: Yunxiao Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI), particularly deep learning, have led to widespread adoption across various applications. Yet, a fundamental challenge persists: how can we verify the correctness of AI model inference when model owners cannot (or will not) reveal their parameters? These parameters represent enormous training costs and valuable intellectual property, making transparent verification difficult. In this paper, we introduce a zero-knowledge framework capable of verifying deep learning inference without exposing model internal parameters. Built on recursively composed zero-knowledge proofs and requiring no trusted setup, our framework supports both linear and nonlinear neural network layers, including matrix multiplication, normalization, softmax, and SiLU. Leveraging the Fiat-Shamir heuristic, we obtain a succinct non-interactive argument of knowledge (zkSNARK) with constant-size proofs. To demonstrate the practicality of our approach, we translate the DeepSeek model into a fully SNARK-verifiable version named ZK-DeepSeek and show experimentally that our framework delivers both efficiency and flexibility in real-world AI verification workloads.
zh

[AI-55] RPM-MCTS: Knowledge-Retrieval as Process Reward Model with Monte Carlo Tree Search for Code Generation AAAI2026

【速读】:该论文旨在解决基于树搜索(Tree Search)的代码生成方法在实际应用中面临的两大问题:一是难以有效评估中间算法步骤的正确性,二是无法及时定位并修正错误步骤,导致生成错误代码且计算开销较高。解决方案的关键在于提出RPM-MCTS方法,其核心创新包括:利用基于知识库检索的进程奖励模型(Knowledge-Retrieval as Process Reward Model, RPM)替代复杂训练过程来高效评估中间步骤;在扩展阶段引入相似度过滤机制以减少冗余节点、保障推理路径多样性;并通过沙箱执行反馈精准识别错误步骤,实现即时、定向修正。实验表明,该方法在多个公开代码生成基准上优于当前最先进方法,并将token消耗降低约15%。

链接: https://arxiv.org/abs/2511.19895
作者: Yuanyuan Lin,Xiangyu Ouyang,Teng Zhang,Kaixin Sui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Tree search-based methods have made significant progress in enhancing the code generation capabilities of large language models. However, due to the difficulty in effectively evaluating intermediate algorithmic steps and the inability to locate and timely correct erroneous steps, these methods often generate incorrect code and incur increased computational costs. To tackle these problems, we propose RPM-MCTS, an effective method that utilizes Knowledge-Retrieval as Process Reward Model based on Monte Carlo Tree Search to evaluate intermediate algorithmic steps. By utilizing knowledge base retrieval, RPM-MCTS avoids the complex training of process reward models. During the expansion phase, similarity filtering is employed to remove redundant nodes, ensuring diversity in reasoning paths. Furthermore, our method utilizes sandbox execution feedback to locate erroneous algorithmic steps during generation, enabling timely and targeted corrections. Extensive experiments on four public code generation benchmarks demonstrate that RPM-MCTS outperforms current state-of-the-art methods while achieving an approximately 15% reduction in token consumption. Furthermore, full fine-tuning of the base model using the data constructed by RPM-MCTS significantly enhances its code capabilities.
zh

[AI-56] CodeFuse-CommitEval: Towards Benchmarking LLM s Power on Commit Message and Code Change Inconsistency Detection

【速读】:该论文旨在解决版本控制系统中提交信息(commit message)与代码变更差异(diff)之间存在语义不一致的问题,即消息-代码不一致(message-code inconsistency, MCI)。MCI会导致审查误导、维护困难、研究数据污染甚至掩盖安全补丁,而此前缺乏专门用于评估大语言模型(LLM)检测MCI能力的基准。解决方案的关键在于构建首个面向MCI检测的基准数据集CODEFUSE-COMMITEVAL:基于ApacheCM数据集保证多样性和质量,通过规则引导的变异生成七类不一致消息,并采用两阶段验证确保正负样本准确性;在此基础上,系统评估六种主流开源LLM在零样本及三种增强策略(少样本提示、思维链、扩展上下文)下的性能表现,揭示了不同模型和方法在检测准确率、效率及类型敏感性上的差异,为后续MCI检测研究提供了可量化、可比较的基准和方向。

链接: https://arxiv.org/abs/2511.19875
作者: Qingyu Zhang,Puzhuo Liu,Peng Di,Chenxiong Qian
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level “purpose” inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.
zh

[AI-57] Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

【速读】:该论文旨在解决生成式 AI(Generative AI)在企业级部署中因共享工具库和预训练组件而产生的供应链安全漏洞问题,特别是现有行为后门检测方法无法跨不同大语言模型(Large Language Models, LLMs)有效泛化这一关键挑战。解决方案的关键在于识别出模型特异性行为特征(尤其是时序特征,其变异系数达0.8),并提出一种引入模型身份作为额外特征的“模型感知”检测机制,从而在六种生产级LLM上实现90.6%的统一检测准确率,显著缩小了单模型检测在跨模型场景下的43.4个百分点性能差距。

链接: https://arxiv.org/abs/2511.19874
作者: Arun Chowdary Sanna
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, 8 tables. Evaluation across 6 production LLMs with 1,198 traces

点击查看摘要

Abstract:As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre-trained components creates significant supply chain vulnerabilities. While previous work has demonstrated behavioral backdoor detection within individual LLM architectures, the critical question of cross-LLM generalization remains unexplored, a gap with serious implications for organizations deploying multiple AI systems. We present the first systematic study of cross-LLM behavioral backdoor detection, evaluating generalization across six production LLMs (GPT-5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT-OSS 120B, and DeepSeek Chat V3.1). Through 1,198 execution traces and 36 cross-model experiments, we quantify a critical finding: single-model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs, a 43.4 percentage point generalization gap equivalent to random guessing. Our analysis reveals that this gap stems from model-specific behavioral signatures, particularly in temporal features (coefficient of variation 0.8), while structural features remain stable across architectures. We show that model-aware detection incorporating model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models. We release our multi-LLM trace dataset and detection framework to enable reproducible research.
zh

[AI-58] Simulated Self-Assessment in Large Language Models : A Psychometric Approach to AI Self-Efficacy

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估中过度依赖任务准确率、忽视自我效能感(self-efficacy)测量的问题。其解决方案的关键在于引入并适配了10项一般自我效能量表(General Self-Efficacy Scale, GSES),通过结构化提示(psychometric prompting)诱导LLMs在不同任务条件下(无任务、计算推理、社会推理和摘要生成)进行模拟自我评估,从而系统性地量化模型的自我认知水平。结果显示,尽管模型自评分数稳定,但其与实际表现之间缺乏可靠一致性,且高分模型常伴随更强的人格化推理风格,而低分模型则更谨慎、去人格化,表明此类自评可揭示模型沟通行为特征,但无法提供校准的性能估计。

链接: https://arxiv.org/abs/2511.19872
作者: Daniel I Jackson,Emma L Jensen,Syed-Amad Hussain,Emre Sezgin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages,5 tables, 3 figures

点击查看摘要

Abstract:Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.
zh

[AI-59] Agent ic AI-Empowered Conversational Embodied Intelligence Networks in 6G

【速读】:该论文旨在解决6G时代多具身智能设备(Multi-Embodied Intelligent Devices, MEIDs)在复杂任务执行中面临的多模态信息融合困难、自适应通信能力不足以及决策可解释性差等问题。其核心解决方案是提出一种协同式对话具身智能网络(Collaborative Conversational Embodied Intelligence Network, CC-EIN),关键创新在于:1)PerceptiNet模块实现图像与雷达数据的跨模态特征融合,生成统一语义表示;2)自适应语义通信策略根据任务紧急性和信道质量动态调整编码方案与传输功率;3)语义驱动的协作机制支持异构设备间的任务分解与无冲突协调;4)InDec模块通过Grad-CAM可视化提升决策透明度。仿真结果表明,该方案在灾后救援场景下实现了95.4%的任务完成率和95%的传输效率,同时保持高语义一致性与能效。

链接: https://arxiv.org/abs/2511.19865
作者: Mingkai Chen,Zijie Feng,Lei Wang,Yaser Khamayseh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 8 figures. Preprint submitted to IEEE Vehicle Technology Magazine

点击查看摘要

Abstract:In the 6G era, semantic collaboration among multiple embodied intelligent devices (MEIDs) becomes crucial for complex task execution. However, existing systems face challenges in multimodal information fusion, adaptive communication, and decision interpretability. To address these limitations, we propose a collaborative Conversational Embodied Intelligence Network (CC-EIN) integrating multimodal feature fusion, adaptive semantic communication, task coordination, and interpretability. PerceptiNet performs cross-modal fusion of image and radar data to generate unified semantic representations. An adaptive semantic communication strategy dynamically adjusts coding schemes and transmission power according to task urgency and channel quality. A semantic-driven collaboration mechanism further supports task decomposition and conflict-free coordination among heterogeneous devices. Finally, the InDec module enhances decision transparency through Grad-CAM visualization. Simulation results in post-earthquake rescue scenarios demonstrate that CC-EIN achieves 95.4% task completion rate and 95% transmission efficiency while maintaining strong semantic consistency and energy efficiency.
zh

[AI-60] MicroSims: A Framework for AI-Generated Scalable Educational Simulations with Universal Embedding and Adaptive Learning Support

【速读】:该论文旨在解决教育模拟(Educational Simulations)在实际应用中长期存在的三大障碍:高昂的开发成本、复杂的技術要求以及平台依赖性,从而限制了其在教学中的普及与个性化使用。解决方案的关键在于提出MicroSims框架,其核心创新包括:(1) 标准化设计模式(standardized design patterns)以支持人工智能辅助生成;(2) 基于iframe的架构实现跨平台嵌入与沙箱安全;(3) 透明可修改的代码结构保障教学透明度与定制灵活性。该框架使教师无需编程即可按需创建符合课程标准的交互式模拟,显著降低使用门槛并提升学习效果(实证表明可提高概念理解达30–40%)。

链接: https://arxiv.org/abs/2511.19864
作者: Valerie Lockhart,Dan McCreary,Troy A. Peterson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 42 pages, 4 figures

点击查看摘要

Abstract:Educational simulations have long been recognized as powerful tools for enhancing learning outcomes, yet their creation has traditionally required substantial resources and technical expertise. This paper introduces MicroSims a novel framework for creating lightweight, interactive educational simulations that can be rapidly generated using artificial intelligence, universally embedded across digital learning platforms, and easily customized without programming knowledge. MicroSims occupy a unique position at the intersection of three key innovations: (1) standardized design patterns that enable AI-assisted generation, (2) iframe-based architecture that provides universal embedding and sandboxed security, and (3) transparent, modifiable code that supports customization and pedagogical transparency. We present a comprehensive framework encompassing design principles, technical architecture, metadata standards, and development workflows. Drawing on empirical research from physics education studies and meta-analyses across STEM disciplines, we demonstrate that interactive simulations can improve conceptual understanding by up to 30-40% compared to traditional instruction. MicroSims extend these benefits while addressing persistent barriers of cost, technical complexity, and platform dependence. This work has significant implications for educational equity, and low-cost intelligent interactive textbooks that enabling educators worldwide to create customized, curriculum-aligned simulations on demand. We discuss implementation considerations, present evidence of effectiveness, and outline future directions for AI-powered adaptive learning systems built on the MicroSim foundation.
zh

[AI-61] Reinforcement Learning with ω-Regular Objectives and Constraints

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中因依赖标量奖励而导致的局限性问题,包括难以表达时序性、条件性或安全关键目标,以及可能出现的奖励黑客(reward hacking)现象。同时,现有方法在衡量性能时仅使用单一标量指标(如奖励值或满足概率),忽略了安全与性能之间的权衡关系。解决方案的关键在于将ω-正则目标(ω-regular objectives)与显式约束相结合,从而分别建模安全要求和优化目标;作者进一步提出一种基于线性规划的模型化强化学习算法,在极限情况下可生成最大化满足ω-正则目标概率的同时严格遵守指定阈值内ω-正则约束的策略,并通过映射到受限平均回报问题实现了最优性保持的理论保证。

链接: https://arxiv.org/abs/2511.19849
作者: Dominik Wagner,Leon Witzman,Luke Ong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) commonly relies on scalar rewards with limited ability to express temporal, conditional, or safety-critical goals, and can lead to reward hacking. Temporal logic expressible via the more general class of \omega -regular objectives addresses this by precisely specifying rich behavioural properties. Even still, measuring performance by a single scalar (be it reward or satisfaction probability) masks safety-performance trade-offs that arise in settings with a tolerable level of risk. We address both limitations simultaneously by combining \omega -regular objectives with explicit constraints, allowing safety requirements and optimisation targets to be treated separately. We develop a model-based RL algorithm based on linear programming, which in the limit produces a policy maximising the probability of satisfying an \omega -regular objective while also adhering to \omega -regular constraints within specified thresholds. Furthermore, we establish a translation to constrained limit-average problems with optimality-preserving guarantees. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2511.19849 [cs.AI] (or arXiv:2511.19849v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.19849 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-62] Cisco Time Series Model Technical Report

【速读】:该论文旨在解决传统时间序列预测模型在处理多分辨率输入时的局限性,以及在特定领域(如可观测性数据)中缺乏高性能通用模型的问题。解决方案的关键在于提出一种基于架构创新的单变量零样本预测模型——Cisco Time Series Model,其核心是将多分辨率输入机制引入到流行的解码器-only 时间序列模型(TimesFM)中,从而构建出一个多分辨率解码器-only 模型。该模型在超过3000亿个唯一数据点上进行训练,其中一半以上来自可观测性领域,实验证明其在可观测性数据集上表现优异,同时保持了在通用基准(GIFT-Eval)上的稳定性能,表明多分辨率结构显著提升了长上下文输入下的预测准确性。

链接: https://arxiv.org/abs/2511.19841
作者: Liang Gou,Archit Khare,Praneet Pabolu,Prachi Patel,Joseph Ross,Hercy Shen,Yuhan(Ellen)Song,Jingze Sun,Kristal Curtis,Vedant Dharnidharka,Abhinav Mathur,Hao Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We introduce the Cisco Time Series Model, a univariate zero-shot forecaster. This time series foundation model is the result of a general architectural innovation to a time series model enabling it to accept multiresolution input, applied to a popular decoder-only time series model (TimesFM). The resulting multiresolution decoder-only model is trained on over 300B unique data points, with more than half coming from the observability domain. Quantitative and qualitative evaluations demonstrate that the resulting model achieves superior performance on observability datasets while retaining very similar performance on a standard general-purpose forecasting benchmark (GIFT-Eval), and suggest that the multiresolution structure enables the model to make more accurate predictions on long context input.
zh

[AI-63] GED-Consistent Disentanglement of Aligned and Unaligned Substructures for Graph Similarity Learning

【速读】:该论文旨在解决现有基于图神经网络(Graph Neural Network, GNN)的图相似性计算方法在逼近图编辑距离(Graph Edit Distance, GED)时存在的两个关键问题:一是未能捕捉最优对齐下的全局结构对应关系,二是因节点级信号的虚假关联导致编辑成本误判。其解决方案的核心在于提出GCGSim框架,该框架以图级匹配和子结构级编辑成本为中心,通过三个关键技术贡献实现GED一致性建模,从而有效学习解耦且语义明确的子结构表示,并在四个基准数据集上达到当前最优性能。

链接: https://arxiv.org/abs/2511.19837
作者: Zhentao Zhan,Xiaoliang Xu,Jingjing Wang,Junmei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Graph Similarity Computation (GSC) is a fundamental graph related task where Graph Edit Distance (GED) serves as a prevalent metric. GED is determined by an optimal alignment between a pair of graphs that partitions each into aligned (zero-cost) and unaligned (cost-incurring) substructures. Due to NP-hard nature of exact GED computation, GED approximations based on Graph Neural Network(GNN) have emerged. Existing GNN-based GED approaches typically learn node embeddings for each graph and then aggregate pairwise node similarities to estimate the final similarity. Despite their effectiveness, we identify a mismatch between this prevalent node-centric matching paradigm and the core principles of GED. This discrepancy leads to two critical limitations: (1) a failure to capture the global structural correspondence for optimal alignment, and (2) a misattribution of edit costs driven by spurious node level signals. To address these limitations, we propose GCGSim, a GED-consistent graph similarity learning framework centering on graph-level matching and substructure-level edit costs. Specifically, we make three core technical contributions. Extensive experiments on four benchmark datasets show that GCGSim achieves state-of-the-art performance. Our comprehensive analyses further validate that the framework effectively learns disentangled and semantically meaningful substructure representations.
zh

[AI-64] Beyond Relational: Semantic-Aware Multi-Modal Analytics with LLM -Native Query Optimization

【速读】:该论文旨在解决传统关系型查询操作符在处理多模态数据分析时语义理解能力有限的问题,从而限制了其在电商、医疗、娱乐等领域的实际应用。其核心挑战在于如何构建一个能够理解自然语言语义并高效执行复杂查询的数据分析系统。解决方案的关键在于提出Nirvana框架,该框架通过两个创新机制实现:一是引入基于自然语言规则的代理式逻辑优化器(agentic logical optimizer),利用随机游走搜索策略探索海量语义等价查询计划空间;二是设计成本感知的物理优化器(cost-aware physical optimizer),基于新颖的改进评分指标动态选择最优的大语言模型(LLM)后端以执行每个操作符。此外,Nirvana还结合模型能力假设指导的计算复用与评估下推技术,显著提升了整体效率和可扩展性。

链接: https://arxiv.org/abs/2511.19830
作者: Junhao Zhu,Lu Chen,Xiangyu Ke,Ziquan Fang,Tianyi Li,Yunjun Gao,Christian S. Jensen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal analytical processing has the potential to transform applications in e-commerce, healthcare, entertainment, and beyond. However, real-world adoption remains elusive due to the limited ability of traditional relational query operators to capture query semantics. The emergence of foundation models, particularly the large language models (LLMs), opens up new opportunities to develop flexible, semantic-aware data analytics systems that transcend the relational paradigm. We present Nirvana, a multi-modal data analytics framework that incorporates programmable semantic operators while leveraging both logical and physical query optimization strategies, tailored for LLM-driven semantic query processing. Nirvana addresses two key challenges. First, it features an agentic logical optimizer that uses natural language-specified transformation rules and random-walk-based search to explore vast spaces of semantically equivalent query plans – far beyond the capabilities of conventional optimizers. Second, it introduces a cost-aware physical optimizer that selects the most effective LLM backend for each operator using a novel improvement-score metric. To further enhance efficiency, Nirvana incorporates computation reuse and evaluation pushdown techniques guided by model capability hypotheses. Experimental evaluations on three real-world benchmarks demonstrate that Nirvana is able to reduce end-to-end runtime by 10%–85% and reduces system processing costs by 76% on average, outperforming state-of-the-art systems at both efficiency and scalability. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.19830 [cs.DB] (or arXiv:2511.19830v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2511.19830 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-65] A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization

【速读】:该论文旨在解决现有提示优化方法在复杂动态用户场景下效果不佳的问题,尤其是静态提示模板无法适应变化的查询需求,以及现有依赖文本反馈或黑箱奖励模型的查询相关方法所提供的优化信号不稳定且不可解释。其解决方案的关键在于提出一个以性能为导向、系统化且多维的提示质量评估框架,并开发了一个无需执行即可预测多维质量分数的评估器(evaluator),该评估器通过指导一个指标感知的优化器(metric-aware optimizer)来诊断失败模式并以可解释的方式重写提示,从而实现稳定、可解释且与模型无关的提示优化效果。

链接: https://arxiv.org/abs/2511.19829
作者: Ke Chen,Yifeng Wang,Hassan Almosapeeh,Haohan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing query-dependent approaches rely on unstable textual feedback or black-box reward models, providing weak and uninterpretable optimization signals. More fundamentally, prompt quality itself lacks a unified, systematic definition, resulting in fragmented and unreliable evaluation signals. Our approach first establishes a performance-oriented, systematic, and comprehensive prompt evaluation framework. Furthermore, we develop and finetune an execution-free evaluator that predicts multi-dimensional quality scores directly from text. The evaluator then instructs a metric-aware optimizer that diagnoses failure modes and rewrites prompts in an interpretable, query-dependent manner. Our evaluator achieves the strongest accuracy in predicting prompt performance, and the evaluation-instructed optimization consistently surpass both static-template and query-dependent baselines across eight datasets and on three backbone models. Overall, we propose a unified, metric-grounded perspective on prompt quality, and demonstrated that our evaluation-instructed optimization pipeline delivers stable, interpretable, and model-agnostic improvements across diverse tasks.
zh

[AI-66] Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

【速读】:该论文旨在解决稀疏专家混合模型(Sparse Mixture-of-Experts, SMoE)在实际部署中因静态内存开销大而导致的可扩展性问题,以及现有训练后剪枝方法因依赖单一通用语料库而引发的跨领域性能下降问题。解决方案的关键在于提出一种名为“Mosaic Pruning (MoP)”的新剪枝策略,其核心是通过“聚类-选择”结构化流程构建功能全面的专家集合:首先基于跨任务域的专家表现相似性度量对专家进行功能聚类,随后利用提出的激活变异性评分(Activation Variability Score)从每个簇中选出最具代表性的专家,从而确保剪枝后的模型保留功能互补的专家子集,实现对多样化下游任务的良好泛化能力。

链接: https://arxiv.org/abs/2511.19822
作者: Wentao Hu,Mingkuan Zhao,Shuangyong Song,Xiaoyan Zhu,Xin Lai,Jiayin Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model’s capabilities, enabling it to handle diverse downstream this http URL experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24% gain on general tasks and 8.92% on specialized tasks like math reasoning and code generation.
zh

[AI-67] Learning to Clean: Reinforcement Learning for Noisy Label Correction NEURIPS2025

【速读】:该论文旨在解决机器学习中因标签噪声导致预测模型性能显著下降的问题(learning with noisy labels)。其解决方案的关键在于将标签修正过程建模为强化学习(Reinforcement Learning, RL)问题,提出了一种名为RLNLC(Reinforcement Learning for Noisy Label Correction)的框架:该框架定义了包含数据及其标签的状态空间、可能的标签修正动作空间以及评估修正效果的奖励机制,并通过Actor-Critic方法训练一个基于深度特征表示的策略网络,在迭代过程中自动修正训练标签并提升预测模型性能。

链接: https://arxiv.org/abs/2511.19808
作者: Marzi Heidari,Hanping Zhang,Yuhong Guo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:The challenge of learning with noisy labels is significant in machine learning, as it can severely degrade the performance of prediction models if not addressed properly. This paper introduces a novel framework that conceptualizes noisy label correction as a reinforcement learning (RL) problem. The proposed approach, Reinforcement Learning for Noisy Label Correction (RLNLC), defines a comprehensive state space representing data and their associated labels, an action space that indicates possible label corrections, and a reward mechanism that evaluates the efficacy of label corrections. RLNLC learns a deep feature representation based policy network to perform label correction through reinforcement learning, utilizing an actor-critic method. The learned policy is subsequently deployed to iteratively correct noisy training labels and facilitate the training of the prediction model. The effectiveness of RLNLC is demonstrated through extensive experiments on multiple benchmark datasets, where it consistently outperforms existing state-of-the-art techniques for learning with noisy labels.
zh

[AI-68] KOM: A Multi-Agent Artificial Intelligence System for Precision Management of Knee Osteoarthritis (KOA)

【速读】:该论文旨在解决膝骨关节炎(Knee Osteoarthritis, KOA)在资源有限环境中难以实施个性化多学科干预的问题,此类干预虽能延缓疾病进展并提升生活质量,但通常依赖大量医疗资源与专业能力。解决方案的关键在于开发一种名为KOM的多智能体系统(multi-agent system),该系统能够自动化完成KOA的评估、风险预测与治疗处方生成,通过整合患者个体特征、疾病状态、风险因素及禁忌症,辅助临床医生制定精准管理方案,并在模拟试验中显著缩短诊断与规划时间(减少38.5%)且提升治疗质量,从而实现高效、可扩展的KOA智能化管理。

链接: https://arxiv.org/abs/2511.19798
作者: Weizhi Liu,Xi Chen,Zekun Jiang,Liang Zhao,Kunyuan Jiang,Ruisi Tang,Li Wang,Mingke You,Hanyu Zhou,Hongyu Chen,Qiankun Xiong,Yong Nie,Kang Li,Jian Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Knee osteoarthritis (KOA) affects more than 600 million individuals globally and is associated with significant pain, functional impairment, and disability. While personalized multidisciplinary interventions have the potential to slow disease progression and enhance quality of life, they typically require substantial medical resources and expertise, making them difficult to implement in resource-limited settings. To address this challenge, we developed KOM, a multi-agent system designed to automate KOA evaluation, risk prediction, and treatment prescription. This system assists clinicians in performing essential tasks across the KOA care pathway and supports the generation of tailored management plans based on individual patient profiles, disease status, risk factors, and contraindications. In benchmark experiments, KOM demonstrated superior performance compared to several general-purpose large language models in imaging analysis and prescription generation. A randomized three-arm simulation study further revealed that collaboration between KOM and clinicians reduced total diagnostic and planning time by 38.5% and resulted in improved treatment quality compared to each approach used independently. These findings indicate that KOM could help facilitate automated KOA management and, when integrated into clinical workflows, has the potential to enhance care efficiency. The modular architecture of KOM may also offer valuable insights for developing AI-assisted management systems for other chronic conditions.
zh

[AI-69] NOEM3A: A Neuro-Symbolic Ontology-Enhanced Method for Multi-Intent Understanding in Mobile Agents

【速读】:该论文旨在解决移动AI代理中多意图理解(multi-intent understanding)的准确性与效率问题,特别是在资源受限设备上实现高性能自然语言理解(Natural Language Understanding, NLU)。其核心挑战在于如何在保持低计算开销的同时提升模型对复杂、模糊对话中多个意图的识别能力。解决方案的关键在于提出一种神经符号融合框架,通过将结构化的意图本体(intent ontology)与紧凑的语言模型结合,利用检索增强提示(retrieval-augmented prompting)、logit偏置(logit biasing)和可选分类头注入符号化意图结构到输入和输出表示中,从而增强模型的语义一致性与解释性。该方法不仅显著提升了多意图解析的准确性(接近GPT-4水平),还大幅降低能耗与内存占用,验证了符号对齐(symbolic alignment)作为高效且准确的端侧NLU策略的有效性。

链接: https://arxiv.org/abs/2511.19780
作者: Ioannis Tzachristas,Aifen Sui
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a neuro-symbolic framework for multi-intent understanding in mobile AI agents by integrating a structured intent ontology with compact language models. Our method leverages retrieval-augmented prompting, logit biasing and optional classification heads to inject symbolic intent structure into both input and output representations. We formalize a new evaluation metric-Semantic Intent Similarity (SIS)-based on hierarchical ontology depth, capturing semantic proximity even when predicted intents differ lexically. Experiments on a subset of ambiguous/demanding dialogues of MultiWOZ 2.3 (with oracle labels from GPT-o3) demonstrate that a 3B Llama model with ontology augmentation approaches GPT-4 accuracy (85% vs 90%) at a tiny fraction of the energy and memory footprint. Qualitative comparisons show that ontology-augmented models produce more grounded, disambiguated multi-intent interpretations. Our results validate symbolic alignment as an effective strategy for enabling accurate and efficient on-device NLU.
zh

[AI-70] Scaling Item-to-Standard Alignment with Large Language Models : Accuracy Limits and Solutions

【速读】:该论文旨在解决教育评估项目(assessment items)与内容标准(content standards)之间对齐校验的效率问题,传统人工审核方式虽准确但耗时且劳动密集,难以应对大规模题库的持续验证需求。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)实现自动化筛选与初步分类,尤其通过预过滤候选技能(candidate skills)策略显著提升准确性——实验表明,在数学任务中LLM可识别约83–94%的对齐状态,阅读任务中因标准语义重叠导致性能下降,但引入候选过滤后,正确技能在前五名建议中的出现率超过95%。因此,研究提出构建“LLM初筛+人工复核”的混合流程,以兼顾效率与精度,为持续的项目验证和教学对齐提供可扩展的技术路径。

链接: https://arxiv.org/abs/2511.19749
作者: Farzan Karimi-Malekabadi,Pooya Razavi,Sonya Powers
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As educational systems evolve, ensuring that assessment items remain aligned with content standards is essential for maintaining fairness and instructional relevance. Traditional human alignment reviews are accurate but slow and labor-intensive, especially across large item banks. This study examines whether Large Language Models (LLMs) can accelerate this process without sacrificing accuracy. Using over 12,000 item-skill pairs in grades K-5, we tested three LLMs (GPT-3.5 Turbo, GPT-4o-mini, and GPT-4o) across three tasks that mirror real-world challenges: identifying misaligned items, selecting the correct skill from the full set of standards, and narrowing candidate lists prior to classification. In Study 1, GPT-4o-mini correctly identified alignment status in approximately 83-94% of cases, including subtle misalignments. In Study 2, performance remained strong in mathematics but was lower for reading, where standards are more semantically overlapping. Study 3 demonstrated that pre-filtering candidate skills substantially improved results, with the correct skill appearing among the top five suggestions more than 95% of the time. These findings suggest that LLMs, particularly when paired with candidate filtering strategies, can significantly reduce the manual burden of item review while preserving alignment accuracy. We recommend the development of hybrid pipelines that combine LLM-based screening with human review in ambiguous cases, offering a scalable solution for ongoing item validation and instructional alignment.
zh

[AI-71] Prompt Fencing: A Cryptographic Approach to Establishing Security Boundaries in Large Language Model Prompts

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生产环境中面临的提示注入攻击(prompt injection attacks)问题,这是当前LLM部署中最严重的安全威胁之一。解决方案的关键在于提出“提示围栏”(Prompt Fencing)架构,通过引入密码学认证和数据架构原则,在提示中嵌入带有数字签名的元数据(如信任评分和内容类型),使LLM能够区分可信指令与不可信内容。实验表明,即使现有LLM缺乏原生围栏感知能力,仅通过提示指令模拟围栏意识即可将攻击成功率从86.7%(260/300次成功攻击)降至0%(0/300次成功攻击),且实现了一个开销仅为0.224秒(含生成与验证)的原型验证管道,具备平台无关性和渐进式部署优势。

链接: https://arxiv.org/abs/2511.19727
作者: Steven Peh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 44 pages, 1 figure

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to prompt injection attacks, representing the most significant security threat in production deployments. We present Prompt Fencing, a novel architectural approach that applies cryptographic authentication and data architecture principles to establish explicit security boundaries within LLM prompts. Our approach decorates prompt segments with cryptographically signed metadata including trust ratings and content types, enabling LLMs to distinguish between trusted instructions and untrusted content. While current LLMs lack native fence awareness, we demonstrate that simulated awareness through prompt instructions achieved complete prevention of injection attacks in our experiments, reducing success rates from 86.7% (260/300 successful attacks) to 0% (0/300 successful attacks) across 300 test cases with two leading LLM providers. We implement a proof-of-concept fence generation and verification pipeline with a total overhead of 0.224 seconds (0.130s for fence generation, 0.094s for validation) across 100 samples. Our approach is platform-agnostic and can be incrementally deployed as a security layer above existing LLM infrastructure, with the expectation that future models will be trained with native fence awareness for optimal security.
zh

[AI-72] An Adaptive Data-Integrated Agent -Based Modeling Framework for Explainable and Contestable Policy Design

【速读】:该论文旨在解决多智能体系统(multi-agent systems)在动态反馈、适应性和非平稳性环境下,传统仿真研究中普遍采用静态决策规则和固定控制参数所带来的局限性问题。其解决方案的关键在于提出一个通用的自适应多智能体学习框架,该框架通过五个核心要素实现:(i) 区分静态与自适应智能体及固定与自适应系统参数的四类动态状态;(ii) 利用信息论诊断指标(如熵率、统计复杂度和预测信息)量化系统的可预测性和结构特征;(iii) 引入结构因果模型以明确干预语义;(iv) 从群体或样本数据中生成智能体级别的先验知识;(v) 运用无监督方法识别涌现的行为模式。该框架提供了一个领域无关的架构,用于系统分析学习智能体与自适应控制如何共同塑造系统轨迹,并支持对非平衡、振荡或漂移动力学下稳定性、性能与可解释性的对比评估。

链接: https://arxiv.org/abs/2511.19726
作者: Roberto Garrone
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: 27 pages, 2 case studies (emissions and smart grids). Preprint prepared during the author’s PhD research at the Open University of Cyprus and the University of Milano-Bicocca. Introduces a unified framework for adaptive multi-agent learning with information-theoretic, causal, and clustering diagnostics

点击查看摘要

Abstract:Multi-agent systems often operate under feedback, adaptation, and non-stationarity, yet many simulation studies retain static decision rules and fixed control parameters. This paper introduces a general adaptive multi-agent learning framework that integrates: (i) four dynamic regimes distinguishing static versus adaptive agents and fixed versus adaptive system parameters; (ii) information-theoretic diagnostics (entropy rate, statistical complexity, and predictive information) to assess predictability and structure; (iii) structural causal models for explicit intervention semantics; (iv) procedures for generating agent-level priors from aggregate or sample data; and (v) unsupervised methods for identifying emergent behavioral regimes. The framework offers a domain-neutral architecture for analyzing how learning agents and adaptive controls jointly shape system trajectories, enabling systematic comparison of stability, performance, and interpretability across non-equilibrium, oscillatory, or drifting dynamics. Mathematical definitions, computational operators, and an experimental design template are provided, yielding a structured methodology for developing explainable and contestable multi-agent decision processes.
zh

[AI-73] CrypTorch: PyTorch-based Auto-tuning Compiler for Machine Learning with Multi-party Computation

【速读】:该论文旨在解决多方计算(Multi-Party Computing, MPC)框架中用于生成式 AI(Generative AI)等机器学习(Machine Learning, ML)任务时,因对Softmax、GELU等非线性操作采用近似算法而导致的性能瓶颈问题。现有方案中的近似方法往往精度不足或效率低下,且难以识别与优化。其关键解决方案是提出一个名为CrypTorch的编译器,该编译器将近似算法与MPC运行时分离,通过编程接口灵活扩展新近似方法,并自动选择最优近似策略以平衡性能与精度。实验证明,仅使用CrypTorch的自动调优机制即可带来1.20–1.7×的即时加速,允许一定精度损失时可达1.31–1.8×加速,结合工程优化后相比主流框架CrypTen实现3.22–8.6×端到端提速。

链接: https://arxiv.org/abs/2511.19711
作者: Jinyu Liu,Gang Tan,Kiwan Maeng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 28 pages, 17 figures. Submitted to PLDI 2026

点击查看摘要

Abstract:Machine learning (ML) involves private data and proprietary model parameters. MPC-based ML allows multiple parties to collaboratively run an ML workload without sharing their private data or model parameters using multi-party computing (MPC). Because MPC cannot natively run ML operations such as Softmax or GELU, existing frameworks use different approximations. Our study shows that, on a well-optimized framework, these approximations often become the dominating bottleneck. Popular approximations are often insufficiently accurate or unnecessarily slow, and these issues are hard to identify and fix in existing frameworks. To tackle this issue, we propose a compiler for MPC-based ML, CrypTorch. CrypTorch disentangles these approximations with the rest of the MPC runtime, allows easily adding new approximations through its programming interface, and automatically selects approximations to maximize both performance and accuracy. Built as an extension to PyTorch 2’s compiler, we show that CrypTorch’s auto-tuning alone provides 1.20–1.7 \times immediate speedup without sacrificing accuracy, and 1.31–1.8 \times speedup when some accuracy degradation is allowed, compared to our well-optimized baseline. Combined with better engineering and adoption of state-of-the-art practices, the entire framework brings 3.22–8.6 \times end-to-end speedup compared to the popular framework, CrypTen.
zh

[AI-74] A Layered Protocol Architecture for the Internet of Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂任务处理中因上下文窗口有限而导致的计算与记忆瓶颈问题,进而提出通过多智能体协作实现系统可扩展性的解决方案。其关键在于引入两个新的网络层:第8层(L8)——智能体通信层(Agent Communication Layer),用于标准化消息结构、言语行为规范(如REQUEST、INFORM)及交互模式(如请求-回复、发布-订阅);第9层(L9)——智能体语义协商层(Agent Semantic Negotiation Layer),首次提出用于定义通信语义的机制,使智能体能够发现、协商并锁定一个“共享上下文”(Shared Context),即一套形式化的概念、任务与参数规范,从而支持跨智能体的语义对齐与协同推理。这两层共同构成了“智能体互联网”(Internet of Agents, IoA)的基础架构,为下一代分布式多智能体系统提供可扩展的协作能力。

链接: https://arxiv.org/abs/2511.19699
作者: Charles Fleming,Vijoy Pandey,Ramana Kompella,Luca Muscariello
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable performance improvements and the ability to learn domain-specific languages (DSLs), including APIs and tool interfaces. This capability has enabled the creation of AI agents that can perform preliminary computations and act through tool calling, now being standardized via protocols like MCP. However, LLMs face fundamental limitations: their context windows cannot grow indefinitely, constraining their memory and computational capacity. Agent collaboration emerges as essential for solving increasingly complex problems, mirroring how computational systems rely on different types of memory to scale. The “Internet of Agents” (IoA) represents the communication stack that enables agents to scale by distributing computation across collaborating entities. Current network architectural stacks (OSI and TCP/IP) were designed for data delivery between hosts and processes, not for agent collaboration with semantic understanding. To address this gap, we propose two new layers: an \textbfAgent Communication Layer (L8) and an \textbfAgent Semantic Negotiation Layer (L9). L8 formalizes the \textitstructure of communication, standardizing message envelopes, speech-act performatives (e.g., REQUEST, INFORM), and interaction patterns (e.g., request-reply, publish-subscribe), building on protocols like MCP. L9, which does not exist today, formalizes the \textitmeaning of communication, enabling agents to discover, negotiate, and lock a “Shared Context” – a formal schema defining the concepts, tasks, and parameters relevant to their interaction. Together, these layers provide the foundation for scalable, distributed agent collaboration, enabling the next generation of multi-agentic systems. Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) Cite as: arXiv:2511.19699 [cs.NI] (or arXiv:2511.19699v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2511.19699 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-75] CT: A Synthetically Pre-Trained Foundation Model for Time Series Classification

【速读】:该论文旨在解决时间序列分类任务中因标注数据成本高而导致的通用基础模型(foundation models)难以构建的问题。现有大规模时间序列模型多聚焦于预测任务,缺乏适用于多样化分类场景且无需微调的通用解决方案。其关键解决方案是提出TiCT(Time-series in-Context Transformer),一种基于Transformer架构、仅在合成数据上预训练的模型,具备in-context learning(ICL)能力,可在推理时通过少量示例动态适应新任务,而无需更新任何模型参数。核心技术贡献包括:1)一种可扩展的基于位的标签编码机制与特殊的输出注意力机制,支持任意类别数;2)结合Mixup思想与数据增强的合成预训练框架,提升模型泛化能力和抗噪性。实验表明,TiCT在UCR Archive上的分类性能媲美监督式先进方法,且完全依赖in-context示例完成推理。

链接: https://arxiv.org/abs/2511.19694
作者: Chin-Chia Michael Yeh,Uday Singh Saini,Junpeng Wang,Xin Dai,Xiran Fan,Jiarui Sun,Yujie Fan,Yan Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the high cost of labeled data. Foundation models capable of in-context learning (ICL) offer a powerful solution, adapting to new tasks with minimal examples and reducing the need for extensive retraining. However, prior work on large-scale time series models has predominantly focused on forecasting, leaving a critical gap for versatile, fine-tuning-free classification. To address this, we introduce TiCT (Time-series in-Context Transformer), a transformer-based model pre-trained exclusively on synthetic data to perform in-context classification. We make two primary technical contributions: 1) a novel architecture featuring a scalable bit-based label encoding and a special output attention mechanism to handle an arbitrary number of classes; and 2) a synthetic pre-training framework that combines a Mixup-inspired process with data augmentation to foster generalization and noise invariance. Extensive evaluations on the UCR Archive show that TiCT achieves competitive performance against state-of-the-art supervised methods. Crucially, this is accomplished using only in-context examples at inference time, without updating a single model weight.
zh

[AI-76] REASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding

【速读】:该论文旨在解决支付网络中交易数据建模的挑战,以支持异常行为检测和消费者级个性化推荐等应用。传统方法难以同时捕捉用户行为模式与支付网络信号(如响应码和系统标志),导致模型性能受限。解决方案的关键在于提出 TREASURE——一个基于 Transformer 的基础模型,其核心创新包括:1)设计专用子模块处理静态与动态属性,提升训练与推理效率;2)开发针对高基数类别属性的有效训练范式;3)在工业级数据集上验证其作为独立模型可使异常行为检测准确率提升 111%,或作为嵌入提供者使推荐模型性能提升 104%。

链接: https://arxiv.org/abs/2511.19693
作者: Chin-Chia Michael Yeh,Uday Singh Saini,Xin Dai,Xiran Fan,Shubham Jain,Yujie Fan,Jiarui Sun,Junpeng Wang,Menghai Pan,Yingtong Dou,Yuzhong Chen,Vineeth Rakesh,Liang Wang,Yan Zheng,Mahashweta Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people’s lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.
zh

[AI-77] FISCAL: Financial Synthetic Claim-document Augmented Learning for Efficient Fact-Checking NEURIPS2025

【速读】:该论文旨在解决金融领域大语言模型(Large Language Models, LLMs)在事实可靠性与计算效率之间的矛盾问题,即现有系统常出现幻觉(hallucination)且依赖庞大模型导致部署成本高。其解决方案的关键在于提出FISCAL(Financial Synthetic Claim-Document Augmented Learning)框架,通过生成针对金融事实核查任务的合成数据(synthetic data),并基于此训练轻量级验证器MiniCheck-FISCAL。该方法结合领域特定的合成数据增强与高效微调策略,使小型模型在准确率、鲁棒性和可扩展性上达到甚至逼近大型模型(如Mixtral-8x22B和GPT-4o)的性能水平,从而实现金融AI应用中“小而强”的高效验证能力。

链接: https://arxiv.org/abs/2511.19671
作者: Rishab Sharma,Iman Saberi,Elham Alipour,Jie JW Wu,Fatemeh Fard
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 3 tables, 11 pages, 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Generative AI in Finance

点击查看摘要

Abstract:Financial applications of large language models (LLMs) require factual reliability and computational efficiency, yet current systems often hallucinate details and depend on prohibitively large models. We propose FISCAL (Financial Synthetic Claim-Document Augmented Learning), a modular framework for generating synthetic data tailored to financial fact-checking. Using FISCAL, we generate a dataset called FISCAL-data and use it to train MiniCheck-FISCAL, a lightweight verifier for numerical financial claims. MiniCheck-FISCAL outperforms its baseline, surpasses GPT-3.5 Turbo and other open-source peers of similar size, and approaches the accuracy of much larger systems (20x), such as Mixtral-8x22B and Command R+. On external datasets FinDVer and Fin-Fact, it rivals GPT-4o and Claude-3.5 while outperforming Gemini-1.5 Flash. These results show that domain-specific synthetic data, combined with efficient fine-tuning, enables compact models to achieve state-of-the-art accuracy, robustness, and scalability for practical financial AI. The dataset and scripts are available in the project repository (link provided in the paper).
zh

[AI-78] HeaRT: A Hierarchical Circuit Reasoning Tree-Based Agent ic Framework for AMS Design Optimization

【速读】:该论文旨在解决传统基于人工智能(AI)的自动设计方法(AMS design automation)中存在的三大问题:依赖高质量数据集来捕捉电路行为、跨架构迁移能力差,以及缺乏自适应机制。其解决方案的关键在于提出 HeaRT——一个用于自动化循环的基础推理引擎,该引擎具备类人式智能优化能力,能够在不牺牲设计意图的前提下实现快速收敛(提速3倍),并在40个电路基准测试中保持97%的推理准确率和98%的Pass@1性能,同时仅需当前最先进基线模型一半的实时token预算。

链接: https://arxiv.org/abs/2511.19669
作者: Souradip Poddar,Chia-Tung Ho,Ziming Wei,Weidong Cao,Haoxing Ren,David Z. Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conventional AI-driven AMS design automation algorithms remain constrained by their reliance on high-quality datasets to capture underlying circuit behavior, coupled with poor transferability across architectures, and a lack of adaptive mechanisms. This work proposes HeaRT, a foundational reasoning engine for automation loops and a first step toward intelligent, adaptive, human-style design optimization. HeaRT consistently demonstrates reasoning accuracy 97% and Pass@1 performance 98% across our 40-circuit benchmark repository, even as circuit complexity increases, while operating at 0.5x real-time token budget of SOTA baselines. Our experiments show that HeaRT yields 3x faster convergence in both sizing and topology design adaptation tasks across diverse optimization approaches, while preserving prior design intent.
zh

[AI-79] Accuracy and Efficiency Trade-Offs in LLM -Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning

【速读】:该论文旨在解决生成式 AI(Generative AI)在恶意软件分类任务中生成人类可解释决策与解释时的可信度问题,尤其是在资源受限环境下如何平衡模型性能与计算效率。其解决方案的关键在于采用低秩适应(Low-Rank Adaptation, LoRA)技术对大型语言模型(Large Language Models, LLMs)进行微调,通过引入不同配置的LoRA模块,在显著降低模型参数量(约减少81%)和训练时间(超过80%)的同时,仍能保持接近全量微调模型的解释质量——尤其在部分语义评估指标上甚至优于全量微调基线,从而实现了可解释性与资源效率之间的有效权衡。

链接: https://arxiv.org/abs/2511.19654
作者: Stephen C. Gravereaux,Sheikh Rabiul Islam
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE Big Data 2025

点击查看摘要

Abstract:This study examines whether Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification. Achieving trustworthy malware detection, particularly when LLMs are involved, remains a significant challenge. We developed an evaluation framework using Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Semantic Similarity Metrics to benchmark explanation quality across five LoRA configurations and a fully fine-tuned baseline. Results indicate that full fine-tuning achieves the highest overall scores, with BLEU and ROUGE improvements of up to 10% over LoRA variants. However, mid-range LoRA models deliver competitive performance exceeding full fine-tuning on two metrics while reducing model size by approximately 81% and training time by over 80% on a LoRA model with 15.5% trainable parameters. These findings demonstrate that LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality. By providing feature-driven natural language explanations for malware classifications, this approach enhances transparency, analyst confidence, and operational scalability in malware detection systems.
zh

[AI-80] Synthetic Data: AIs New Weapon Against Android Malware

【速读】:该论文旨在解决Android恶意软件检测中因真实样本稀缺且标注成本高昂而导致的高质量训练数据不足问题,进而影响机器学习模型性能。解决方案的关键在于提出MalSynGen方法,其核心是利用条件生成对抗网络(conditional Generative Adversarial Network, cGAN)生成具有真实数据统计特性的合成表格型数据,从而提升安卓恶意软件分类器的性能,并在不同数据集上展现出良好的泛化能力。

链接: https://arxiv.org/abs/2511.19649
作者: Angelo Gaspar Diniz Nogueira,Kayua Oleques Paim,Hendrio Bragança,Rodrigo Brandão Mansilha,Diego Kreutz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 18 figures, 8 tables. Accepted for publication at the JBCS

点击查看摘要

Abstract:The ever-increasing number of Android devices and the accelerated evolution of malware, reaching over 35 million samples by 2024, highlight the critical importance of effective detection methods. Attackers are now using Artificial Intelligence to create sophisticated malware variations that can easily evade traditional detection techniques. Although machine learning has shown promise in malware classification, its success relies heavily on the availability of up-to-date, high-quality datasets. The scarcity and high cost of obtaining and labeling real malware samples presents significant challenges in developing robust detection models. In this paper, we propose MalSynGen, a Malware Synthetic Data Generation methodology that uses a conditional Generative Adversarial Network (cGAN) to generate synthetic tabular data. This data preserves the statistical properties of real-world data and improves the performance of Android malware classifiers. We evaluated the effectiveness of this approach using various datasets and metrics that assess the fidelity of the generated data, its utility in classification, and the computational efficiency of the process. Our experiments demonstrate that MalSynGen can generalize across different datasets, providing a viable solution to address the issues of obsolescence and low quality data in malware detection.
zh

[AI-81] Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

【速读】:该论文旨在解决基础模型(Foundation Models, FMs)在真实世界部署中因训练数据与实际场景不匹配而导致的性能脆弱性问题,尤其是面对未见过的、杂乱的现实数据(如遮挡文本或跨语言内容)时表现不佳。其解决方案的关键在于提出“机器人驱动的数据飞轮”(Robot-Powered Data Flywheel)框架,通过将机器人作为具身代理部署于真实环境(如图书馆),使其在执行任务的同时自动收集高质量、领域相关的现实世界数据,并用于持续优化基础模型,形成“任务执行—数据生成—模型改进”的闭环迭代机制。该方法不仅降低了人工标注成本,还显著提升了模型在特定场景(如图书识别)和邻近任务(如多语言光学字符识别)中的泛化能力。

链接: https://arxiv.org/abs/2511.19647
作者: Jennifer Grannen,Michelle Pan,Kenneth Llontop,Cherie Ho,Mark Zolotas,Jeannette Bohg,Dorsa Sadigh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models (FM) have unlocked powerful zero-shot capabilities in vision and language, yet their reliance on internet pretraining data leaves them brittle in unstructured, real-world settings. The messy, real-world data encountered during deployment (e.g. occluded or multilingual text) remains massively underrepresented in existing corpora. Robots, as embodied agents, are uniquely positioned to close this gap: they can act in physical environments to collect large-scale, real-world data that enriches FM training with precisely the examples current models lack. We introduce the Robot-Powered Data Flywheel, a framework that transforms robots from FM consumers into data generators. By deploying robots equipped with FMs in the wild, we enable a virtuous cycle: robots perform useful tasks while collecting real-world data that improves both domain-specific adaptation and domain-adjacent generalization. We instantiate this framework with Scanford, a mobile manipulator deployed in the East Asia Library for 2 weeks. Scanford autonomously scans shelves, identifies books using a vision-language model (VLM), and leverages the library catalog to label images without human annotation. This deployment both aids librarians and produces a dataset to finetune the underlying VLM, improving performance on the domain-specific in-the-wild library setting and on domain-adjacent multilingual OCR benchmarks. Using data collected from 2103 shelves, Scanford improves VLM performance on book identification from 32.0% to 71.8% and boosts domain-adjacent multilingual OCR from 24.8% to 46.6% (English) and 30.8% to 38.0% (Chinese), while saving an ~18.7 hrs of human time. These results highlight how robot-powered data flywheels can both reduce human effort in real deployments and unlock new pathways for continually adapting FMs to the messiness of reality. More details are at: this https URL
zh

[AI-82] IRSDA: An Agent -Orchestrated Framework for Enterprise Intrusion Response

【速读】:该论文旨在解决现代企业系统中日益复杂、分布式且多阶段的网络攻击难以通过传统静态规则和人工流程实现快速精准响应的问题。其核心解决方案是提出一种基于代理(Agent)的入侵响应数字助手(Intrusion Response System Digital Assistant, IRSDA),关键在于融合自适应自主计算系统(Self-Adaptive Autonomic Computing Systems, SA-ACS)与知识引导的监控-分析-规划-执行-知识(MAPE-K)循环,实现跨企业基础设施的实时、分区感知决策,并借助知识驱动架构整合上下文信息与AI推理能力,在确保操作合规的前提下自动完成隔离、响应与可追溯输出,从而提升网络安全防御的自动化水平、可解释性与系统状态感知能力。

链接: https://arxiv.org/abs/2511.19644
作者: Damodar Panigrahi,Raj Patel,Shaswata Mitra,Sudip Mittal,Shahram Rahimi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Modern enterprise systems face escalating cyber threats that are increasingly dynamic, distributed, and multi-stage in nature. Traditional intrusion detection and response systems often rely on static rules and manual workflows, which limit their ability to respond with the speed and precision required in high-stakes environments. To address these challenges, we present the Intrusion Response System Digital Assistant (IRSDA), an agent-based framework designed to deliver autonomous and policy-compliant cyber defense. IRSDA combines Self-Adaptive Autonomic Computing Systems (SA-ACS) with the Knowledge guided Monitor, Analyze, Plan, and Execute (MAPE-K) loop to support real-time, partition-aware decision-making across enterprise infrastructure. IRSDA incorporates a knowledge-driven architecture that integrates contextual information with AI-based reasoning to support system-guided intrusion response. The framework leverages retrieval mechanisms and structured representations to inform decision-making while maintaining alignment with operational policies. We assess the system using a representative real-world microservices application, demonstrating its ability to automate containment, enforce compliance, and provide traceable outputs for security analyst interpretation. This work outlines a modular and agent-driven approach to cyber defense that emphasizes explainability, system-state awareness, and operational control in intrusion response. Comments: 10 pages, 4 figures Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.19644 [cs.CR] (or arXiv:2511.19644v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.19644 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-83] Many Ways to be Right: Rashomon Sets for Concept-Based Neural Networks

【速读】:该论文旨在解决深度神经网络中存在“Rashomon效应”(Rashomon Effect)的问题,即多个模型在相同任务上可达到相似性能,但其内部决策机制和依赖特征却可能截然不同,而这种多样性难以被有效挖掘和利用。解决方案的关键在于提出一种名为“Rashomon概念瓶颈模型”(Rashomon Concept Bottleneck Models)的框架,通过结合轻量级适配模块(adapter modules)与多样性正则化训练目标,在不从头重新训练的前提下高效构建一组兼具高准确率且基于不同人类可理解概念进行推理的深度模型,从而系统性揭示等效性能下模型决策过程的多样性。

链接: https://arxiv.org/abs/2511.19636
作者: Shihan Feng,Cheng Zhang,Michael Xi,Ethan Hsu,Lesia Semenova,Chudi Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern neural networks rarely have a single way to be right. For many tasks, multiple models can achieve identical performance while relying on different features or reasoning patterns, a property known as the Rashomon Effect. However, uncovering this diversity in deep architectures is challenging as their continuous parameter spaces contain countless near-optimal solutions that are numerically distinct but often behaviorally similar. We introduce Rashomon Concept Bottleneck Models, a framework that learns multiple neural networks which are all accurate yet reason through distinct human-understandable concepts. By combining lightweight adapter modules with a diversity-regularized training objective, our method constructs a diverse set of deep concept-based models efficiently without retraining from scratch. The resulting networks provide fundamentally different reasoning processes for the same predictions, revealing how concept reliance and decision making vary across equally performing solutions. Our framework enables systematic exploration of data-driven reasoning diversity in deep models, offering a new mechanism for auditing, comparison, and alignment across equally accurate solutions.
zh

[AI-84] owards Synergistic Teacher-AI Interactions with Generative Artificial Intelligence

【速读】:该论文旨在解决生成式 AI (Generative AI) 在教育领域广泛应用背景下,教师如何与AI有效协作以避免职业能力弱化、保持专业自主性并提升教学效能的问题。其解决方案的关键在于提出一个五级教师-AI协同框架(transactional、situational、operational、praxical 和 synergistic teaming),系统刻画教师与生成式 AI 互动的复杂动态,强调从简单的任务替代向深度协同演进,最终实现教师与AI在协商、建设性挑战和共同推理中相互增强能力,达成单一主体无法独立实现的教育成果。

链接: https://arxiv.org/abs/2511.19580
作者: Mutlu Cukurova,Wannapon Suraworachet,Qi Zhou,Sahan Bulathwela
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 18 pages, 6 pages

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) is increasingly used in education, posing significant challenges for teachers adapting to these changes. GenAI offers unprecedented opportunities for accessibility, scalability and productivity in educational tasks. However, the automation of teaching tasks through GenAI raises concerns about reduced teacher agency, potential cognitive atrophy, and the broader deprofessionalisation of teaching. Drawing findings from prior literature on AI in Education, and refining through a recent systematic literature review, this chapter presents a conceptualisation of five levels of teacher-AI teaming: transactional, situational, operational, praxical and synergistic teaming. The framework aims to capture the nuanced dynamics of teacher-AI interactions, particularly with GenAI, that may lead to the replacement, complementarity, or augmentation of teachers’ competences and professional practice. GenAI technological affordances required in supporting teaming, along with empirical studies, are discussed. Drawing on empirical observations, we outline a future vision that moves beyond individual teacher agency toward collaborative decision-making between teachers and AI, in which both agents engage in negotiation, constructive challenge, and co-reasoning that enhance each other’s capabilities and enable outcomes neither could realise independently. Further discussion of socio-technical factors beyond teacher-AI teaming is also included to streamline the synergy of teachers and AI in education ethically and practically.
zh

[AI-85] Using Wearable Devices to Improve Chronic PainTreatment among Patients with Opioid Use Disorder

【速读】:该论文旨在解决慢性疼痛(Chronic Pain, CP)与阿片类药物使用障碍(Opioid Use Disorder, OUD)共病人群在接受阿片类药物维持治疗(Medication for Opioid Use Disorder, MOUD)过程中缺乏循证整合干预措施的问题。其解决方案的关键在于利用可穿戴设备实时监测患者生理与心理指标(如疼痛波动和主观压力水平),结合机器学习(Machine Learning, ML)与大语言模型(Large Language Models, LLMs)等人工智能方法,识别疼痛峰值的临床相关因素,并探索早期预警与个性化干预的可能性。研究发现,ML模型对疼痛峰值预测准确率达0.7,而LLMs表现有限,表明当前LLMs尚不足以提供可操作的临床洞察,亟需开发更适配OUD/CP场景的AI模型以支持精准干预。

链接: https://arxiv.org/abs/2511.19577
作者: Abhay Goyal,Navin Kumar,Kimberly DiMeola,Rafael Trujillo,Soorya Ram Shimgekar,Christian Poellabauer,Pi Zonooz,Ermonda Gjoni-Markaj,Declan Barry,Lynn Madden
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Chronic pain (CP) and opioid use disorder (OUD) are common and interrelated chronic medical conditions. Currently, there is a paucity of evidence-based integrated treatments for CP and OUD among individuals receiving medication for opioid use disorder (MOUD). Wearable devices have the potential to monitor complex patient information and inform treatment development for persons with OUD and CP, including pain variability (e.g., exacerbations of pain or pain spikes) and clinical correlates (e.g., perceived stress). However, the application of large language models (LLMs) with wearable data for understanding pain spikes, remains unexplored. Consequently, the aim of this pilot study was to examine the clinical correlates of pain spikes using a range of AI approaches. We found that machine learning models achieved relatively high accuracy (0.7) in predicting pain spikes, while LLMs were limited in providing insights on pain spikes. Real-time monitoring through wearable devices, combined with advanced AI models, could facilitate early detection of pain spikes and support personalized interventions that may help mitigate the risk of opioid relapse, improve adherence to MOUD, and enhance the integration of CP and OUD care. Given overall limited LLM performance, these findings highlight the need to develop LLMs which can provide actionable insights in the OUD/CP context.
zh

[AI-86] Deductive Systems for Logic Programs with Counting

【速读】:该论文旨在解决含计数聚合(counting aggregate)的程序在答案集编程(Answer Set Programming, ASP)中的强等价性判定问题。传统上,强等价可通过在适当的演绎系统中互相推导规则来证明,但这一方法此前未适用于包含计数聚合的程序。论文的关键解决方案是扩展该演绎方法,使其能够处理含计数聚合的规则,从而实现对这类程序强等价性的形式化验证。

链接: https://arxiv.org/abs/2511.19565
作者: Jorge Fandinno,Vladimir Lifschitz
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Under consideration in Theory and Practice of Logic Programming (TPLP)

点击查看摘要

Abstract:In answer set programming, two groups of rules are considered strongly equivalent if they have the same meaning in any context. Strong equivalence of two programs can be sometimes established by deriving rules of each program from rules of the other in an appropriate deductive system. This paper shows how to extend this method of proving strong equivalence to programs containing the counting aggregate.
zh

[AI-87] rust-Based Social Learning for Communication (TSLEC) Protocol Evolution in Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决多智能体系统中通过独立学习实现涌现通信时存在的收敛速度慢和协议性能欠佳的问题。其解决方案的关键在于提出TSLEC(基于信任的社会学习与涌现通信)框架,该框架使智能体能够基于学习到的信任关系,主动向同伴传授成功策略,从而实现知识的高效传递与筛选;实验表明,该机制显著提升了收敛效率(减少23.9%的episode数,p < 0.001),并生成具有组合性且对动态目标鲁棒的通信协议(C = 0.38,Phi > 0.867)。

链接: https://arxiv.org/abs/2511.19562
作者: Abraham Itzhak Weinberg
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Emergent communication in multi-agent systems typically occurs through independent learning, resulting in slow convergence and potentially suboptimal protocols. We introduce TSLEC (Trust-Based Social Learning with Emergent Communication), a framework where agents explicitly teach successful strategies to peers, with knowledge transfer modulated by learned trust relationships. Through experiments with 100 episodes across 30 random seeds, we demonstrate that trust-based social learning reduces episodes-to-convergence by 23.9% (p 0.001, Cohen’s d = 1.98) compared to independent emergence, while producing compositional protocols (C = 0.38) that remain robust under dynamic objectives (Phi 0.867 decoding accuracy). Trust scores strongly correlate with teaching quality (r = 0.743, p 0.001), enabling effective knowledge filtering. Our results establish that explicit social learning fundamentally accelerates emergent communication in multi-agent coordination.
zh

[AI-88] Online Sparse Feature Selection in Data Streams via Differential Evolution

【速读】:该论文旨在解决高维数据流中因设备故障和技术限制导致的数据不完整问题,以及现有在线稀疏流特征选择(Online Sparse Streaming Feature Selection, OS2FS)方法在特征评估方面的性能瓶颈。解决方案的关键在于提出一种新的在线差分进化稀疏特征选择方法(Online Differential Evolution for Sparse Feature Selection, ODESFS),其核心创新包括:(1) 基于潜在因子分析模型进行缺失值插补,提升数据完整性;(2) 利用差分进化算法实现特征重要性评估,从而更有效地筛选最优特征子集,显著提升分类准确率。

链接: https://arxiv.org/abs/2511.19555
作者: Ruiyang Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The processing of high-dimensional streaming data commonly utilizes online streaming feature selection (OSFS) techniques. However, practical implementations often face challenges with data incompleteness due to equipment failures and technical constraints. Online Sparse Streaming Feature Selection (OS2FS) tackles this issue through latent factor analysis-based missing data imputation. Despite this advancement, existing OS2FS approaches exhibit substantial limitations in feature evaluation, resulting in performance deterioration. To address these shortcomings, this paper introduces a novel Online Differential Evolution for Sparse Feature Selection (ODESFS) in data streams, incorporating two key innovations: (1) missing value imputation using a latent factor analysis model, and (2) feature importance evaluation through differential evolution. Comprehensive experiments conducted on six real-world datasets demonstrate that ODESFS consistently outperforms state-of-the-art OSFS and OS2FS methods by selecting optimal feature subsets and achieving superior accuracy.
zh

[AI-89] he Semiotic Channel Principle: Measuring the Capacity for Meaning in LLM Communication

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在人机语义交互中“表达丰富性”与“解释稳定性”之间的内在矛盾问题,即如何在保持模型输出语义多样性的同时确保人类能够准确理解其意图。解决方案的关键在于提出一个基于信息论的符号学框架,将LLM建模为随机符号引擎,并引入生成复杂度参数(lambda)来量化和调控这一权衡关系:其中语义广度(semiotic breadth)由源熵(source entropy)衡量,而可解码性(decipherability)则以消息与人类解释间的互信息(mutual information)表征;通过优化lambda可实现语义传输容量的最大化,从而将原本难以观测的模型内部机制转化为可测量的文本表征,为模型评估、提示工程、风险分析及自适应系统设计提供可操作的理论工具。

链接: https://arxiv.org/abs/2511.19550
作者: Davide Picca
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes a novel semiotic framework for analyzing Large Language Models (LLMs), conceptualizing them as stochastic semiotic engines whose outputs demand active, asymmetric human interpretation. We formalize the trade-off between expressive richness (semiotic breadth) and interpretive stability (decipherability) using information-theoretic tools. Breadth is quantified as source entropy, and decipherability as the mutual information between messages and human interpretations. We introduce a generative complexity parameter (lambda) that governs this trade-off, as both breadth and decipherability are functions of lambda. The core trade-off is modeled as an emergent property of their distinct responses to \lambda . We define a semiotic channel, parameterized by audience and context, and posit a capacity constraint on meaning transmission, operationally defined as the maximum decipherability by optimizing lambda. This reframing shifts analysis from opaque model internals to observable textual artifacts, enabling empirical measurement of breadth and decipherability. We demonstrate the framework’s utility across four key applications: (i) model profiling; (ii) optimizing prompt/context design; (iii) risk analysis based on ambiguity; and (iv) adaptive semiotic systems. We conclude that this capacity-based semiotic approach offers a rigorous, actionable toolkit for understanding, evaluating, and designing LLM-mediated communication.
zh

[AI-90] When Should Neural Data Inform Welfare? A Critical Framework for Policy Uses of Neuroeconomics

【速读】:该论文试图解决的问题是:在政策制定中,如何合法且有效地利用神经数据来支持福利判断,而非仅仅描述行为。其核心挑战在于区分神经信号是否能够提供关于个体真实利益的规范性证据,而不是仅反映情境依赖的行为模式。解决方案的关键在于构建一个基于模型的非实证框架,该框架将三个层次——神经信号、计算决策模型和规范性福利标准——进行系统关联。具体而言,论文提出必须满足三个条件:(1)神经-计算映射需经过充分验证;(2)决策模型能识别“真实兴趣”与情境性错误;(3)福利标准须明确设定并加以辩护。通过这一框架,作者进一步推导出适用于成瘾、神经营销和环境政策等场景的“神经经济学福利推理检查清单”,从而为监管者和神经人工智能(NeuroAI)系统设计者提供可操作的评估工具。

链接: https://arxiv.org/abs/2511.19548
作者: Yiven(Louis)Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Economics (econ.GN); Neurons and Cognition (q-bio.NC)
备注: Durham Economic Journal 2025

点击查看摘要

Abstract:Neuroeconomics promises to ground welfare analysis in neural and computational evidence about how people value outcomes, learn from experience and exercise self-control. At the same time, policy and commercial actors increasingly invoke neural data to justify paternalistic regulation, “brain-based” interventions and new welfare measures. This paper asks under what conditions neural data can legitimately inform welfare judgements for policy rather than merely describing behaviour. I develop a non-empirical, model-based framework that links three levels: neural signals, computational decision models and normative welfare criteria. Within an actor-critic reinforcement-learning model, I formalise the inference path from neural activity to latent values and prediction errors and then to welfare claims. I show that neural evidence constrains welfare judgements only when the neural-computational mapping is well validated, the decision model identifies “true” interests versus context-dependent mistakes, and the welfare criterion is explicitly specified and defended. Applying the framework to addiction, neuromarketing and environmental policy, I derive a Neuroeconomic Welfare Inference Checklist for regulators and for designers of NeuroAI systems. The analysis treats brains and artificial agents as value-learning systems while showing that internal reward signals, whether biological or artificial, are computational quantities and cannot be treated as welfare measures without an explicit normative model.
zh

[AI-91] AttackPilot: Autonomous Inference Attacks Against ML Services With LLM -Based Agents

【速读】:该论文旨在解决非专家在实施推理攻击(inference attack)时面临的挑战,包括攻击参数优化困难及缺乏系统性风险评估能力的问题。解决方案的关键在于提出 AttackPilot——一个基于大语言模型(LLM)的自主代理,能够无需人工干预独立执行推理攻击任务;其核心创新包括多智能体框架与任务特定动作空间的设计,有效缓解了计划错误、指令偏离、上下文丢失和幻觉等问题,从而实现高成功率(100.0%任务完成率)与接近专家水平的攻击性能,同时具备对不同LLM和约束条件的自适应优化能力。

链接: https://arxiv.org/abs/2511.19536
作者: Yixin Wu,Rui Wen,Chi Cui,Michael Backes,Yang Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inference attacks have been widely studied and offer a systematic risk assessment of ML services; however, their implementation and the attack parameters for optimal estimation are challenging for non-experts. The emergence of advanced large language models presents a promising yet largely unexplored opportunity to develop autonomous agents as inference attack experts, helping address this challenge. In this paper, we propose AttackPilot, an autonomous agent capable of independently conducting inference attacks without human intervention. We evaluate it on 20 target services. The evaluation shows that our agent, using GPT-4o, achieves a 100.0% task completion rate and near-expert attack performance, with an average token cost of only 0.627 per run. The agent can also be powered by many other representative LLMs and can adaptively optimize its strategy under service constraints. We further perform trace analysis, demonstrating that design choices, such as a multi-agent framework and task-specific action spaces, effectively mitigate errors such as bad plans, inability to follow instructions, task context loss, and hallucinations. We anticipate that such agents could empower non-expert ML service providers, auditors, or regulators to systematically assess the risks of ML services without requiring deep domain expertise.
zh

[AI-92] Discover Learn and Reinforce: Scaling Vision-Language-Action Pretraining with Diverse RL-Generated Trajectories

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型预训练中高质量、多样化操作轨迹数据获取难的问题。当前主流方法依赖人工远程操作(teleoperation)收集数据,成本高且难以扩展;而标准强化学习(Reinforcement Learning, RL)虽可自主探索生成数据,但易陷入单一执行模式,导致轨迹多样性不足,限制了其在大规模预训练中的应用。论文提出的Discover, Learn and Reinforce (DLR) 框架通过信息论驱动的模式发现机制,从RL中自动挖掘多个不同且高成功率的行为策略,从而显著提升轨迹数据的多样性与覆盖范围。其关键创新在于利用信息最大化原理识别并保留多模态行为模式,使VLA模型在下游任务迁移时表现优于基于单一模式RL数据预训练的模型,并展现出正向的数据规模扩展特性。

链接: https://arxiv.org/abs/2511.19528
作者: Rushuai Yang,Zhiyuan Feng,Tianxiang Zhang,Kaixin Wang,Chuheng Zhang,Li Zhao,Xiu Su,Yi Chen,Jiang Bian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling vision-language-action (VLA) model pre-training requires large volumes of diverse, high-quality manipulation trajectories. Most current data is obtained via human teleoperation, which is expensive and difficult to scale. Reinforcement learning (RL) methods learn useful skills through autonomous exploration, making them a viable approach for generating data. However, standard RL training collapses to a narrow execution pattern, limiting its utility for large-scale pre-training. We propose Discover, Lea rn and Reinforce (DLR), an information-theoretic pattern discovery framework that generates multiple distinct, high-success behavioral patterns for VLA pretraining. Empirically, DLR generates a markedly more diverse trajectory corpus on LIBERO. Specifically, it learns multiple distinct, high-success strategies for the same task where standard RL discovers only one, and hence it covers substantially broader regions of the state-action space. When adapted to unseen downstream task suites, VLA models pretrained on our diverse RL data surpass counterparts trained on equal-sized standard RL datasets. Moreover, DLR exhibits positive data-scaling behavior that single-pattern RL lacks. These results position multi-pattern RL as a practical, scalable data engine for embodied foundation models.
zh

[AI-93] Hierarchical Dual-Strategy Unlearning for Biomedical and Healthcare Intelligence Using Imperfect and Privacy-Sensitive Medical Data

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)因训练数据记忆而导致的隐私风险问题,特别是在医疗场景中涉及不完整或敏感患者信息时,如何实现精准的知识擦除而不损害基础医学能力。其解决方案的关键在于提出一种分层双策略框架,通过几何约束的梯度更新机制对目标参数进行选择性调节,并结合基于统一四层医学概念层级的token级干预策略,区分需保留的关键知识与应删除的目标知识,从而在仅修改0.1%参数的情况下实现82.7%的遗忘率和88.5%的知识保留率,兼顾隐私保护、监管合规与临床研究伦理要求。

链接: https://arxiv.org/abs/2511.19498
作者: Yi Zhang,Tianxiang Xu,Zijian Li,Chao Zhang,Kunyu Zhang,Zhan Gao,Meinuo Li,Xiaohan Zhang,Qichao Qi,Bing Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) exhibit exceptional performance but pose substantial privacy risks due to training data memorization, particularly within healthcare contexts involving imperfect or privacy-sensitive patient information. We present a hierarchical dual-strategy framework for selective knowledge unlearning that precisely removes specialized knowledge while preserving fundamental medical competencies. Our approach synergistically integrates geometric-constrained gradient updates to selectively modulate target parameters with concept-aware token-level interventions that distinguish between preservation-critical and unlearning-targeted tokens via a unified four-level medical concept hierarchy. Comprehensive evaluations on the MedMCQA (surgical) and MHQA (anxiety, depression, trauma) datasets demonstrate superior performance, achieving an 82.7% forgetting rate and 88.5% knowledge preservation. Notably, our framework maintains robust privacy guarantees while requiring modification of only 0.1% of parameters, addressing critical needs for regulatory compliance, auditability, and ethical standards in clinical research.
zh

[AI-94] PeriodNet: Boosting the Potential of Attention Mechanism for Time Series Forecasting

【速读】:该论文旨在解决注意力机制在时间序列预测(Time Series Forecasting, TSF)中应用效果不佳的问题,尤其是在捕捉局部特征、周期性模式与全局依赖关系方面的局限性。其解决方案的关键在于提出了一种全新的网络结构PeriodNet,核心创新包括:引入周期注意力(period attention)和稀疏周期注意力(sparse period attention)机制以增强对相邻周期的分析能力;设计迭代分组机制(iterative grouping mechanism)高效减少变量间的冗余信息;并重构原始Transformer架构,提出周期扩散器(period diffuser)用于精准的多周期预测。这些改进显著提升了模型在单变量和多变量时间序列预测中的性能表现。

链接: https://arxiv.org/abs/2511.19497
作者: Bowen Zhao,Huanlai Xing,Zhiwen Xiao,Jincheng Peng,Li Feng,Xinhan Wang,Rong Qu,Hui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The attention mechanism has demonstrated remarkable potential in sequence modeling, exemplified by its successful application in natural language processing with models such as Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT). Despite these advancements, its utilization in time series forecasting (TSF) has yet to meet expectations. Exploring a better network structure for attention in TSF holds immense significance across various domains. In this paper, we present PeriodNet with a brand new structure to forecast univariate and multivariate time series. PeriodNet incorporates period attention and sparse period attention mechanism for analyzing adjacent periods. It enhances the mining of local characteristics, periodic patterns, and global dependencies. For efficient cross-variable modeling, we introduce an iterative grouping mechanism which can directly reduce the cross-variable redundancy. To fully leverage the extracted features on the encoder side, we redesign the entire architecture of the vanilla Transformer and propose a period diffuser for precise multi-period prediction. Through comprehensive experiments conducted on eight datasets, we demonstrate that PeriodNet outperforms six state-of-the-art models in both univariate and multivariate TSF scenarios in terms of mean square error and mean absolute error. In particular, PeriodNet achieves a relative improvement of 22% when forecasting time series with a length of 720, in comparison to other models based on the conventional encoder-decoder Transformer architecture.
zh

[AI-95] Xmodel-2.5: 1.3B Data-Efficient Reasoning SLM

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在边缘计算或成本敏感场景中因计算资源消耗过高而难以部署的问题。其核心解决方案是提出一个参数量为13亿的轻量级语言模型Xmodel-2.5,作为可直接替换的代理核心(drop-in agent core)。关键创新在于采用最大更新参数化(maximal-update parameterization, μP),使得在小型代理模型(20M参数)上优化的超参数能够无修改地迁移到完整模型,即使在参数共享的词嵌入架构(tie-word-embedding)下也有效;此外,通过设计1.4万亿token的Warmup–Stable–Decay训练课程,并在衰减阶段从AdamW切换至Muon优化器,在保持其他超参数不变的情况下使13项推理任务平均性能提升4.58%,验证了早期稳定性与后期锐化相结合的有效性;同时利用FP8混合精度训练实现准确率与吞吐量之间的平衡。

链接: https://arxiv.org/abs/2511.19496
作者: Yang Liu,Xiaolong Zhong,Ling Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models deliver strong reasoning and tool-use skills, yet their computational demands make them impractical for edge or cost-sensitive deployments. We present \textbfXmodel-2.5, a 1.3-billion-parameter small language model designed as a \emphdrop-in agent core. Training with maximal-update parameterization ( \mu P) allows hyper-parameters tuned on a 20M-parameter proxy to transfer directly to the full model, even under the parameter-tied \emphtie-word-embedding architecture. A 1.4T-token Warmup–Stable–Decay curriculum is used, and we further show that \textbfswitching from AdamW to Muon during the decay phase improves the 13-task reasoning average by 4.58,% while keeping every other hyper-parameter fixed, verifying that early AdamW stability can be paired with late Muon sharpening for better downstream performance. FP8-mixed-precision training balances accuracy and throughput. All checkpoints, recipes, and evaluation code are released under the Apache-2.0 license.\footnotethis https URL and this https URL (training checkpoints). Training code and evaluation harness: this https URL.
zh

[AI-96] A Systematic Study of Compression Ordering for Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在资源受限环境中部署时面临的计算资源消耗过大的问题,重点探究知识蒸馏(Knowledge Distillation)、结构化剪枝(Structured Pruning)和低比特量化(Low-bit Quantization)这三种主流压缩技术在单独应用及组合使用时的性能表现及其相互作用机制。解决方案的关键在于系统性地评估多种压缩流水线,并发现技术顺序对最终模型质量具有显著影响:最优顺序为剪枝→知识蒸馏→量化(P-KD-Q),该序列在实现3.68倍压缩比的同时保持了较强的指令遵循能力和语言理解性能,而早期应用量化会导致不可逆的信息损失,严重损害后续训练效果。

链接: https://arxiv.org/abs/2511.19495
作者: Shivansh Chhawri,Rahul Mahadik,Suparna Rooj
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) require substantial computational resources, making model compression essential for efficient deployment in constrained environments. Among the dominant compression techniques: knowledge distillation, structured pruning, and low-bit quantization, their individual effects are well studied, but their interactions and optimal sequencing remain unclear. This work systematically examines how these techniques perform both independently and in combination when applied to the Qwen2.5 3B model. We evaluate multiple compression pipelines, including single, and proposed three-technique sequences, using perplexity, G-Eval, clarity, prompt alignment, and compression ratio as metrics. Our experiments show that quantization provides the greatest standalone compression, while pruning introduces moderate quality degradation. Critically, the ordering of techniques significantly affects the final model quality: the sequence Pruning, Knowledge Distillation, Quantization (P-KD-Q) yields the best balance, achieving a 3.68x compression ratio while preserving strong instruction-following and language understanding capabilities. Conversely, pipelines applying quantization early suffer severe performance degradation due to irreversible information loss that impairs subsequent training. Overall, this study offers practical insight into designing effective, ordering-aware compression pipelines for deploying LLMs in resource-limited settings.
zh

[AI-97] Forecasting AI Time Horizon Under Compute Slowdowns

【速读】:该论文试图解决的问题是:在当前算力(compute)持续增长背景下,若未来算力增速放缓,将如何影响人工智能代理(AI agent)的时间跨度(time horizon)能力预测。其解决方案的关键在于构建一个包含训练算力与算法进步关系的模型,并基于2019–2025年时间跨度和算力均呈恒定增长率的实证事实,推导出时间跨度的增长必须与算力增长成正比。该模型进一步结合OpenAI的算力预测,揭示了即使在理想情况下,若算力增长不及预期,时间跨度能力的实现可能显著延迟——例如,在80%可靠性下,一个月时间跨度的实现将比简单趋势外推推迟7年。

链接: https://arxiv.org/abs/2511.19492
作者: Parker Whitfill,Ben Snodin,Joel Becker
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:METR’s time horizon metric has grown exponentially since 2019, along with compute. However, it is unclear whether compute scaling will persist at current rates through 2030, raising the question of how possible compute slowdowns might impact AI agent capability forecasts. Given a model of time horizon as a function of training compute and algorithms, along with a model of how compute investment spills into algorithmic progress (which, notably, precludes the possibility of a software-only singularity), and the empirical fact that both time horizon and compute have grown at constant rates over 2019–2025, we derive that time horizon growth must be proportional to compute growth. We provide additional, albeit limited, experimental evidence consistent with this theory. We use our model to project time horizon growth under OpenAI’s compute projection, finding substantial projected delays in some cases. For example, 1-month time horizons at 80% reliability occur 7 years later than simple trend extrapolation suggests.
zh

[AI-98] Generative Model-Aided Continual Learning for CSI Feedback in FDD mMIMO-OFDM Systems

【速读】:该论文旨在解决大规模多输入多输出(massive multiple-input multiple-output, mMIMO)正交频分复用(orthogonal frequency division multiplexing, OFDM)系统中信道状态信息(channel state information, CSI)反馈开销高且模型难以适应动态环境的问题。现有CSI反馈模型在用户移动导致的环境变化下需频繁重训练,且在返回旧环境时因灾难性遗忘导致性能下降。解决方案的关键在于提出一种基于生成对抗网络(generative adversarial network, GAN)的学习方法,利用GAN生成器作为记忆单元,实现对历史环境知识的有效保留,从而在不牺牲先前任务性能的前提下持续学习新环境,显著提升深度自编码器(deep autoencoder, DAE)框架的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2511.19490
作者: Guijun Liu,Yuwen Cao,Tomoaki Ohtsuki,Jiguang He,Shahid Mumtaz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Deep autoencoder (DAE) frameworks have demonstrated their effectiveness in reducing channel state information (CSI) feedback overhead in massive multiple-input multiple-output (mMIMO) orthogonal frequency division multiplexing (OFDM) systems. However, existing CSI feedback models struggle to adapt to dynamic environments caused by user mobility, requiring retraining when encountering new CSI distributions. Moreover, returning to previously encountered environments often leads to performance degradation due to catastrophic forgetting. Continual learning involves enabling models to incorporate new information while maintaining performance on previously learned tasks. To address these challenges, we propose a generative adversarial network (GAN)-based learning approach for CSI feedback. By using a GAN generator as a memory unit, our method preserves knowledge from past environments and ensures consistently high performance across diverse scenarios without forgetting. Simulation results show that the proposed approach enhances the generalization capability of the DAE framework while maintaining low memory overhead. Furthermore, it can be seamlessly integrated with other advanced CSI feedback models, highlighting its robustness and adaptability.
zh

[AI-99] Evolution without an Oracle: Driving Effective Evolution with LLM Judges

【速读】:该论文试图解决的问题是:如何在缺乏可计算目标函数(Oracle)的场景下,利用大语言模型(Large Language Models, LLMs)驱动进化计算(Evolutionary Computation, EC)进行有效优化。传统EC方法高度依赖客观、机器可计算的适应度函数,但在许多开放域任务中(如软件需求满足、复杂指令遵循),这类函数难以定义。解决方案的关键在于提出MADE(Multi-Agent Decomposed Evolution)框架,其核心机制是通过“问题规范”(Problem Specification)将模糊的主观指令分解为具体、可验证的子要求,从而将高方差的LLM主观反馈转化为稳定且精确的选择压力,实现从“可计算指标”到“可描述质量”的范式转变,显著提升了进化优化在无地面真值场景下的性能表现。

链接: https://arxiv.org/abs/2511.19489
作者: Zhe Zhao,Yuheng Yang,Haibin Wen,Xiaojie Qiu,Zaixi Zhang,Qingfu Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with Evolutionary Computation (EC) has unlocked new frontiers in scientific discovery but remains shackled by a fundamental constraint: the reliance on an Oracle–an objective, machine-computable fitness function. This paper breaks this barrier by asking: Can evolution thrive in a purely subjective landscape governed solely by LLM judges? We introduce MADE (Multi-Agent Decomposed Evolution), a framework that tames the inherent noise of subjective evaluation through “Problem Specification.” By decomposing vague instructions into specific, verifiable sub-requirements, MADE transforms high-variance LLM feedback into stable, precise selection pressure. The results are transformative: across complex benchmarks like DevAI and InfoBench, MADE outperforms strong baselines by over 50% in software requirement satisfaction (39.9% to 61.9%) and achieves a 95% perfect pass rate on complex instruction following. This work validates a fundamental paradigm shift: moving from optimizing “computable metrics” to “describable qualities,” thereby unlocking evolutionary optimization for the vast open-ended domains where no ground truth exists.
zh

[AI-100] Building Resilient Information Ecosystems: Large LLM -Generated Dataset of Persuasion Attacks

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在大规模生成具有说服力的内容时,对组织官方信息传播构成的挑战,这些内容可能与政府或商业机构的声明形成竞争性叙事,从而削弱其传播效果并损害公众信任。解决方案的关键在于构建一个包含134,136条由GPT-4、Gemma 2和Llama 3.1生成的 persuasion attack(说服攻击)的数据集,涵盖23种来自SemEval 2023 Task 3的说服技术,并覆盖972篇来自10个机构的新闻稿,攻击形式包括长文本新闻声明与短文本社交媒体帖子。通过分析不同模型生成攻击中的道德共鸣(moral resonance),识别出各模型偏好的说服策略(如GPT-4侧重Care,Llama 3.1侧重Loyalty与Care),为组织建立主动防御机制、打造声誉防护盾(reputation armor)提供依据,从而提升信息生态中沟通的有效性与韧性。

链接: https://arxiv.org/abs/2511.19488
作者: Hsien-Te Kao,Aleksey Panasyuk,Peter Bautista,William Dupree,Gabriel Ganberg,Jeffrey M. Beaubien,Laura Cassani,Svitlana Volkova
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Organization’s communication is essential for public trust, but the rise of generative AI models has introduced significant challenges by generating persuasive content that can form competing narratives with official messages from government and commercial organizations at speed and scale. This has left agencies in a reactive position, often unaware of how these models construct their persuasive strategies, making it more difficult to sustain communication effectiveness. In this paper, we introduce a large LLM-generated persuasion attack dataset, which includes 134,136 attacks generated by GPT-4, Gemma 2, and Llama 3.1 on agency news. These attacks span 23 persuasive techniques from SemEval 2023 Task 3, directed toward 972 press releases from ten agencies. The generated attacks come in two mediums, press release statements and social media posts, covering both long-form and short-form communication strategies. We analyzed the moral resonance of these persuasion attacks to understand their attack vectors. GPT-4’s attacks mainly focus on Care, with Authority and Loyalty also playing a role. Gemma 2 emphasizes Care and Authority, while Llama 3.1 centers on Loyalty and Care. Analyzing LLM-generated persuasive attacks across models will enable proactive defense, allow to create the reputation armor for organizations, and propel the development of both effective and resilient communications in the information ecosystem.
zh

[AI-101] Efficient Inference Using Large Language Models with Limited Human Data: Fine-Tuning then Rectification

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在市场研究与社会科学应用中生成人类类似响应时存在的性能不足问题,特别是如何有效利用有限的标注样本提升预测准确性。其核心挑战在于如何在Fine-tuning(微调)和Rectification(校正)两个阶段之间最优分配标注数据,以协同优化整体性能。解决方案的关键在于提出一种新的微调目标:最小化预测误差的方差而非传统均方误差(MSE),这一设计更有利于后续校正阶段的偏差修正;并基于经验缩放定律(empirical scaling laws)构建了一种数据驱动的方法,实现标注样本在两个阶段间的最优分割,从而显著优于单独使用微调或校正的效果。

链接: https://arxiv.org/abs/2511.19486
作者: Lei Wang,Zikun Ye,Jinglong Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by recent advances in artificial intelligence (AI), a growing body of work demonstrates the potential of using large language models (LLMs) to generate human-like responses in market research and social science applications. Two primary approaches can be applied to improve the performance of LLMs: fine-tuning, which aligns LLM predictions more closely with human responses, and rectification, which corrects biases in LLM outputs. In this paper, we develop a framework that combines fine-tuning and rectification, and optimally allocates limited labeled samples across the two stages. Unlike the conventional objective that minimizes the mean squared prediction errors, we propose to minimize the variance of the prediction errors as the fine-tuning objective, which is optimal for the downstream rectification stage. Building on this insight, we leverage empirical scaling laws to develop a data-driven method for optimally splitting samples between the fine-tuning and rectification stages. Empirical analysis validates our framework, demonstrating improved estimation and inference performance compared to using either fine-tuning or rectification alone.
zh

[AI-102] Z-Space: A Multi-Agent Tool Orchestration Framework for Enterprise-Grade LLM Automation

【速读】:该论文旨在解决大规模企业级模型上下文协议(Model Context Protocol, MCP)服务中,如何高效且准确地从数千种异构工具中匹配目标功能的问题,这一挑战限制了系统的实用性。现有方法依赖全提示注入或静态语义检索,存在用户查询与工具描述间语义断层、大语言模型(Large Language Model, LLM)输入上下文膨胀及高推理延迟等问题。解决方案的关键在于提出Z-Space框架,其核心包括:(1) 通过意图解析模型实现对用户查询的结构化语义理解;(2) 设计基于融合子空间加权算法(Fused Subspace Weighted Algorithm, FSWW)的工具过滤模块,无需参数调优即可实现意图与工具间的细粒度语义对齐;(3) 构建推理执行代理以支持多步骤任务的动态规划与容错执行。该方案已在饿了么平台技术部门部署,服务于多个业务单元的大规模测试数据生成场景,实测显示平均令牌消耗降低96.26%,工具调用准确率达92%,显著提升了智能测试数据生成系统的效率与可靠性。

链接: https://arxiv.org/abs/2511.19483
作者: Qingsong He,Jing Nan,Jiayu Jiao,Liangjie Tang,Xiaodong Xu,Mengmeng Sun,Qingyao Wang,Minghui Yan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models can break through knowledge and timeliness limitations by invoking external tools within the Model Context Protocol framework to achieve automated execution of complex tasks. However, with the rapid growth of enterprise-scale MCP services, efficiently and accurately matching target functionalities among thousands of heterogeneous tools has become a core challenge restricting system practicality. Existing approaches generally rely on full-prompt injection or static semantic retrieval, facing issues including semantic disconnection between user queries and tool descriptions, context inflation in LLM input, and high inference latency. To address these challenges, this paper proposes Z-Space, a data-generation-oriented multi-agent collaborative tool invocation framework Z-Space. The Z-Space framework establishes a multi-agent collaborative architecture and tool filtering algorithm: (1) A structured semantic understanding of user queries is achieved through an intent parsing model; (2) A tool filtering module (FSWW) based on fused subspace weighted algorithm realizes fine-grained semantic alignment between intents and tools without parameter tuning; (3) An inference execution agent is constructed to support dynamic planning and fault-tolerant execution for multi-step tasks. This framework has been deployed in the Eleme platform’s technical division, serving large-scale test data generation scenarios across multiple business units including Taotian, Gaode, and Hema. Production data demonstrates that the system reduces average token consumption in tool inference by 96.26% while achieving a 92% tool invocation accuracy rate, significantly enhancing the efficiency and reliability of intelligent test data generation systems.
zh

[AI-103] Human Experts Evaluation of Generative AI for Contextualizing STEAM Education in the Global South

【速读】:该论文旨在解决生成式 AI(Generative AI)在非洲发展中国家(以加纳为例)情境下,如何有效支持STEAM教育本土化与文化适切性的问题。研究发现,通过结合定制化的“文化响应式教案规划工具”(Culturally Responsive Lesson Planner, CRLP),生成式AI能够将抽象课程标准与学习者的文化知识、社区实践和日常经验相连接,从而提升教学内容的文化根基性和教学法适应性。其解决方案的关键在于:利用CRLP引导AI输出更贴近本地语境的教案,使生成内容能整合本土知识体系、双语元素及地方相关案例,从而增强文化代表性与教学相关性;但同时也揭示出当前模型在深度呈现文化多样性方面存在局限,尤其在数学与计算机学科中表现不足,表明仍需教师介入、社区参与以及基于本土语言语料库的模型微调,以实现真正符合全球南方背景的文化忠实度。

链接: https://arxiv.org/abs/2511.19482
作者: Matthew Nyaaba,Macharious Nabang,Patrick Kyeremeh,Ibrahim Nantomah,Collins Owusu-Fordjour,Martin Ako,Bismark Nyaaba Akanzire,Kassim Korah Nantom,Cecilia Issaka,Xiaoming Zhai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study investigates how human experts evaluate the capacity of Generative AI (GenAI) to contextualize STEAM education in the Global South, with a focus on Ghana. Using a convergent mixed-methods design, four STEAM specialists assessed GenAI-generated lesson plans created with a customized Culturally Responsive Lesson Planner (CRLP) and compared them to standardized lesson plans from the Ghana National Council for Curriculum and Assessment (NaCCA). Quantitative ratings were based on a validated 25-item Culturally Responsive Pedagogy Rubric measuring bias awareness, cultural representation, contextual relevance, linguistic responsiveness, and teacher agency. Qualitative reflections provided additional insight into how GenAI handles cultural and pedagogical appropriateness. Findings show that GenAI, when paired with the CRLP tool, can support contextualized STEAM instruction by linking abstract curriculum standards to learners’ cultural knowledge, community practices, and everyday experiences. Experts rated GenAI-assisted lessons as more culturally grounded and pedagogically responsive than NaCCA plans, integrating Indigenous knowledge, bilingual elements, and locally relevant examples. However, GenAI struggled to represent Ghana’s cultural pluralism, often offering surface-level references to language, history, and identity. These weaknesses were most evident in Mathematics and Computing, where cultural nuance was limited. The results highlight the need for continued teacher mediation, community involvement, and culturally attuned refinement of AI outputs. Future work should include classroom trials, expanded expert participation, and model fine-tuning using Indigenous language corpora to strengthen cultural fidelity in Global South contexts. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.19482 [cs.CY] (or arXiv:2511.19482v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2511.19482 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-104] Exploiting the Experts: Unauthorized Compression in MoE-LLM s

【速读】:该论文旨在解决混合专家(Mixture-of-Experts, MoE)大语言模型在任务特定使用场景下易受剪枝攻击的问题,即恶意用户可通过移除部分专家并低成本微调剩余模块来绕过授权和安全限制。解决方案的关键在于提出了一种基于专家归属(expert attribution)的分析框架,识别出对特定任务贡献最大的专家子集,并通过主动学习驱动的微调策略进行再对齐,从而揭示了知识损失与恢复之间的权衡关系;在此基础上,进一步设计了“纠缠式专家训练”和“选择性微调协议”等防御机制,使MoE模型更难被未经授权地压缩或适应,从而实现对模型专用化过程的安全保障。

链接: https://arxiv.org/abs/2511.19480
作者: Pinaki Prasad Guha Neogi,Ahmad Mohammadshirazi,Dheeraj Kulshrestha,Rajiv Ramnath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures are increasingly adopted in large language models (LLMs) for their scalability and efficiency. However, their modular structure introduces a unique vulnerability: adversaries can attempt to compress or repurpose models by pruning experts and cheaply fine-tuning the remainder, effectively bypassing licensing and security constraints. In this paper, we systematically study the prunability of MoE-LLMs under task-specific usage. We first develop an expert attribution framework that identifies the subset of experts most responsible for a given task, then evaluate the performance trade-offs of pruning and re-aligning these experts using active learning-driven fine-tuning. Our findings reveal a critical knowledge loss–recovery trade-off: while certain experts can be isolated to retain task accuracy, significant degradation occurs without targeted re-alignment. Based on this analysis, we propose defense strategies that aim to make MoE models harder to compress and fine-tune without authorization, including entangled expert training and selective fine-tuning protocols that resist unauthorized adaptation. By positioning expert pruning as both a threat vector and a defense target, this work highlights the dual-use nature of MoE modularity and provides the first systematic evaluation framework for secure specialization of MoE-LLMs.
zh

[AI-105] WavefrontDiffusion: Dynamic Decoding Schedule or Improved Reasoning

【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在文本生成过程中因固定 denoising 策略导致的语义不连贯与过早终止问题。现有主流方法如 Standard Diffusion 由于全局去噪易造成上下文 incomplete 和 premature end-of-sequence 预测,而 BlockDiffusion 虽控制计算复杂度但其刚性块更新机制会破坏语义单元完整性,影响推理能力。解决方案的关键在于提出 WavefrontDiffusion,一种动态解码策略:它以已确定位置为起点,向外扩展“波前”式活跃 token 区域进行自适应更新,从而遵循自然语义结构流动,同时保持与块状方法相当的计算效率。该方法在四个推理与代码生成基准上实现 SOTA 性能,并显著提升输出语义保真度。

链接: https://arxiv.org/abs/2511.19473
作者: Haojin Yang,Rui Hu,Zequn Sun,Rui Zhou,Yujun Cai,Yiwei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages. 3 figures

点击查看摘要

Abstract:Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.
zh

[AI-106] PrefixGPT : Prefix Adder Optimization by a Generative Pre-trained Transformer AAAI-2026

【速读】:该论文旨在解决前缀加法器(prefix adder)在计算密集型应用中因设计规则严格和设计空间呈指数级增长而导致的优化难题。解决方案的关键在于提出PrefixGPT,一种基于生成式预训练Transformer(Generative Pre-trained Transformer, GPT)的模型,它通过将加法器拓扑表示为二维坐标序列,并在生成过程中引入合法性掩码(legality mask),确保所有生成的设计均满足约束条件(valid by construction)。该模型采用定制化的仅解码器(decoder-only)架构,先在随机合成的有效前缀加法器语料库上进行预训练以学习设计规则,再通过微调实现高质量设计空间探索,最终在面积-延迟积(Area-Delay Product, ADP)指标上显著优于现有方法,证明了GPT类模型在硬件设计中的潜力。

链接: https://arxiv.org/abs/2511.19472
作者: Ruogu Ding,Xin Ning,Ulf Schlichtmann,Weikang Qian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: An extended version that has been accepted by AAAI-2026 conference

点击查看摘要

Abstract:Prefix adders are widely used in compute-intensive applications for their high speed. However, designing optimized prefix adders is challenging due to strict design rules and an exponentially large design space. We introduce PrefixGPT, a generative pre-trained Transformer (GPT) that directly generates optimized prefix adders from scratch. Our approach represents an adder’s topology as a two-dimensional coordinate sequence and applies a legality mask during generation, ensuring every design is valid by construction. PrefixGPT features a customized decoder-only Transformer architecture. The model is first pre-trained on a corpus of randomly synthesized valid prefix adders to learn design rules and then fine-tuned to navigate the design space for optimized design quality. Compared with existing works, PrefixGPT not only finds a new optimal design with a 7.7% improved area-delay product (ADP) but exhibits superior exploration quality, lowering the average ADP by up to 79.1%. This demonstrates the potential of GPT-style models to first master complex hardware design principles and then apply them for more efficient design optimization.
zh

[AI-107] Hidden markov model to predict tourists visited place

【速读】:该论文旨在解决如何基于社交网络数据准确预测游客未来移动轨迹的问题,以支持旅游营销决策和需求分析。其解决方案的关键在于提出一种基于机器学习语法推断算法的方法,并将其适配至大数据场景,从而构建一个可编辑的隐马尔可夫模型(Hidden Markov Model, HMM),用于建模和学习群体游客的移动模式,实现在巴黎这一案例城市中的高效预测验证。

链接: https://arxiv.org/abs/2511.19465
作者: Theo Demessance,Chongke Bi,Sonia Djebali,Guillaume Guerard
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Nowadays, social networks are becoming a popular way of analyzing tourist behavior, thanks to the digital traces left by travelers during their stays on these networks. The massive amount of data generated; by the propensity of tourists to share comments and photos during their trip; makes it possible to model their journeys and analyze their behavior. Predicting the next movement of tourists plays a key role in tourism marketing to understand demand and improve decision support. In this paper, we propose a method to understand and to learn tourists’ movements based on social network data analysis to predict future movements. The method relies on a machine learning grammatical inference algorithm. A major contribution in this paper is to adapt the grammatical inference algorithm to the context of big data. Our method produces a hidden Markov model representing the movements of a group of tourists. The hidden Markov model is flexible and editable with new data. The capital city of France, Paris is selected to demonstrate the efficiency of the proposed methodology.
zh

[AI-108] mperature in SLMs: Impact on Incident Categorization in On-Premises Environments

【速读】:该论文旨在解决安全运营中心(SOC)和计算机应急响应小组(CSIRT)在网络安全事件分类自动化过程中面临的挑战,特别是云部署的大语言模型(LLM)带来的成本高、延迟大及数据保密性风险问题。解决方案的关键在于评估本地部署的小型语言模型(SLM)是否能够有效替代云LLM实现高效且安全的事件分类;研究发现,模型参数量和GPU计算能力是影响性能的核心因素,而温度超参数对分类精度影响较小,表明在资源受限环境下选择合适规模的SLM并优化硬件配置即可实现可靠的自动化分类。

链接: https://arxiv.org/abs/2511.19464
作者: Marcio Pohlmann,Alex Severo,Gefté Almeida,Diego Kreutz,Tiago Heinrich,Lourenço Pereira
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Performance (cs.PF)
备注: 5 pages, 3 figures, 2 tables, submitted to ERRC/WRSeg 2025

点击查看摘要

Abstract:SOCs and CSIRTs face increasing pressure to automate incident categorization, yet the use of cloud-based LLMs introduces costs, latency, and confidentiality risks. We investigate whether locally executed SLMs can meet this challenge. We evaluated 21 models ranging from 1B to 20B parameters, varying the temperature hyperparameter and measuring execution time and precision across two distinct architectures. The results indicate that temperature has little influence on performance, whereas the number of parameters and GPU capacity are decisive factors.
zh

[AI-109] Systemic approach for modeling a generic smart grid

【速读】:该论文旨在解决传统计算方法难以应对智能电网(Smart Grid)复杂跨学科建模与仿真问题的挑战,特别是如何有效整合电力系统、能源市场、需求侧管理及其他资源资产的系统性建模。其解决方案的关键在于提出一个智能电网的骨干模型(backbone model),通过分布式优化子系统实现发电与用电调度,在保证灵活性和可扩展性的前提下验证不同场景假设,从而为大规模实际部署提供可靠的仿真基础。

链接: https://arxiv.org/abs/2511.19460
作者: Sofiane Ben Amor,Guillaume Guerard,Loup-Noé Levy
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Smart grid technological advances present a recent class of complex interdisciplinary modeling and increasingly difficult simulation problems to solve using traditional computational methods. To simulate a smart grid requires a systemic approach to integrated modeling of power systems, energy markets, demand-side management, and much other resources and assets that are becoming part of the current paradigm of the power grid. This paper presents a backbone model of a smart grid to test alternative scenarios for the grid. This tool simulates disparate systems to validate assumptions before the human scale model. Thanks to a distributed optimization of subsystems, the production and consumption scheduling is achieved while maintaining flexibility and scalability.
zh

[AI-110] SparOA: Sparse and Operator-aware Hybrid Scheduling for Edge DNN Inference

【速读】:该论文旨在解决深度神经网络(Deep Neural Network, DNN)模型在资源受限的边缘设备上部署时面临的性能瓶颈问题,现有方案如模型压缩常导致精度损失,而专用硬件则成本高且灵活性差。其解决方案的关键在于提出一种CPU-GPU混合推理框架SparOA,通过融合稀疏性(sparsity)与计算强度(computational intensity)来优化算子调度;具体包括三个核心组件:(1) 基于阈值预测器精确确定最优稀疏性和计算强度阈值;(2) 基于强化学习的调度器根据实时硬件状态动态分配资源;(3) 混合推理引擎通过异步执行和批处理优化效率。实验表明,SparOA相较基线平均提速1.22–1.31倍,相比纯CPU方案最高提速50.7倍,并在每推理能耗上优于当前最优协同执行基线7%–16%。

链接: https://arxiv.org/abs/2511.19457
作者: Ziyang Zhang,Jie Liu,Luca Mottola
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 14 pages, 12 figures

点击查看摘要

Abstract:The resource demands of deep neural network (DNN) models introduce significant performance challenges, especially when deployed on resource-constrained edge devices. Existing solutions like model compression often sacrifice accuracy, while specialized hardware remains costly and inflexible. Hybrid inference methods, however, typically overlook how operator characteristics impact performance. In this work, we present SparOA, a CPU-GPU hybrid inference framework, which leverages both sparsity and computational intensity to optimize operator scheduling. SparOA embraces aforementioned challenges through three key components: (1) a threshold predictor that accurately determines optimal sparsity and computational intensity thresholds; (2) a reinforcement learning-based scheduler that dynamically optimizes resource allocation based on real-time hardware states; and (3) a hybrid inference engine that enhances efficiency through asynchronous execution and batch size this http URL results show that SparOA achieves an average speedup of 1.22-1.31x compared to all baselines, and outperforms the CPU-Only by up to 50.7x. Also, SparOA achieves optimal energy-per-inference, consuming 7%-16% less energy than the SOTA co-execution baseline.
zh

[AI-111] DiverseClaire: Simulating Students to Improve Introductory Programming Course Materials for All CS1 Learners

【速读】:该论文旨在解决当前计算机科学(Computer Science, CS)入门课程(如CS1)普遍采用“一刀切”教学模式所导致的认知负荷过重问题,尤其对自闭症、注意力缺陷多动障碍(ADHD)、阅读障碍等神经多样性学习者不利。其解决方案的关键在于引入DiverseClaire——一个基于大语言模型(LLMs)模拟不同神经多样性学生群体的试点研究框架,结合布鲁姆认知目标分类法(Bloom’s Taxonomy)与全纳性学习设计(Universal Design for Learning, UDL),通过对比传统课件与UDL优化后的教学材料,验证多模态、可访问的学习资源对学生认知表现的影响。实验结果表明,未经适配的课件显著阻碍了模拟神经多样性学习者的理解与学习效果,凸显了为不同学习偏好提供多样化内容格式的重要性。

链接: https://arxiv.org/abs/2511.14198
作者: Wendy Wong,Yuchao Jiang,Yuekang Li
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Programming Languages (cs.PL)
备注: 2 pages

点击查看摘要

Abstract:Although CS programs are booming, introductory courses like CS1 still adopt a one-size-fits-all formats that can exacerbate cognitive load and discourage learners with autism, ADHD, dyslexia and other neurological conditions. These call for compassionate pedagogies and Universal Design For Learning (UDL) to create learning environments and materials where cognitive diversity is welcomed. To address this, we introduce DiverseClaire a pilot study, which simulates students including neurodiverse profiles using LLMs and diverse personas. By leveraging Bloom’s Taxonomy and UDL, DiverseClaire compared UDL-transformed lecture slides with traditional formats. To evaluate DiverseClaire controlled experiments, we used the evaluation metric the average score. The findings revealed that the simulated neurodiverse students struggled with learning due to lecture slides that were in inaccessible formats. These results highlight the need to provide course materials in multiple formats for diverse learner preferences. Data from our pilot study will be made available to assist future CS1 instructors.
zh

[AI-112] Are Large Brainwave Foundation Models Capable Yet? Insights from Fine-tuning

【速读】:该论文旨在解决当前大型脑电波基础模型(Large Brainwave Foundation Models, LBMs)在脑机接口(Brain-Computer Interface, BCI)任务中效率与适用性不足的问题。研究表明,尽管LBMs在多个BCI基准任务(如记忆任务和睡眠阶段分类)中展现出一定潜力,但其性能提升仅达0.9%–1.2%,远低于传统深度架构,且参数量呈数量级增长(百万级 vs 千级),暴露出其在训练和部署中的低效性。解决方案的关键在于通过细致的消融实验与低秩适应(Low-Rank Adaptation, LoRA)技术,显著减少可训练参数而保持性能稳定,并揭示了当前LBMs的局限源于架构设计和训练策略的不匹配;尤其重要的是,论文首次将LoRA应用于LBMs,发现同时适配多个神经网络组件时才能获得性能增益,从而强调未来需采用面向脑电波分析的领域特定开发策略,甚至可能需要重新设计模型架构以充分发挥基础模型在脑波建模中的潜力。

链接: https://arxiv.org/abs/2507.01196
作者: Na Lee,Konstantinos Barmpas,Yannis Panagakis,Dimitrios Adamos,Nikolaos Laskaris,Stefanos Zafeiriou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Foundation Models have demonstrated significant success across various domains in Artificial Intelligence (AI), yet their capabilities for brainwave modeling remain unclear. In this paper, we comprehensively evaluate current Large Brainwave Foundation Models (LBMs) through systematic fine-tuning experiments across multiple Brain-Computer Interface (BCI) benchmark tasks, including memory tasks and sleep stage classification. Our extensive analysis shows that state-of-the-art LBMs achieve only marginal improvements (0.9%-1.2%) over traditional deep architectures while requiring significantly more parameters (millions vs thousands), raising important questions about their efficiency and applicability in BCI contexts. Moreover, through detailed ablation studies and Low-Rank Adaptation (LoRA), we significantly reduce trainable parameters without performance degradation, while demonstrating that architectural and training inefficiencies limit LBMs’ current capabilities. Our experiments span both full model fine-tuning and parameter-efficient adaptation techniques, providing insights into optimal training strategies for BCI applications. We pioneer the application of LoRA to LBMs, revealing that performance benefits generally emerge when adapting multiple neural network components simultaneously. These findings highlight the critical need for domain-specific development strategies to advance LBMs, suggesting that current architectures may require redesign to fully leverage the potential of foundation models in brainwave analysis.
zh

[AI-113] me-Domain Linear Model-based Framework for Passive Acoustic Mapping of Cavitation Activity

【速读】:该论文旨在解决被动声学成像(Passive Acoustic Mapping, PAM)中因缺乏参考发射起始时间而导致轴向分辨率受限的问题,尤其针对传统时域和频域波束成形方法在数据效率与空间分辨率上的不足。解决方案的关键在于提出一种完全基于时域的线性模型驱动波束成形框架,其中通过建立离散时空分布的空化活动与探头记录的时间信号之间的线性关系,显式考虑由采集几何结构决定的传播时间延迟(time-of-flight delays),并利用时空域先验知识对模型进行正则化求解,从而在仅使用传统频域方法所需数据20%的情况下实现更优或相当的空化图质量。

链接: https://arxiv.org/abs/2511.20551
作者: Tatiana Gelvez-Barrera,Barbara Nicolas,Denis Kouamé,Bruno Gilles,Adrian Basarab
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Passive acoustic mapping enables the spatial mapping and temporal monitoring of cavitation activity, playing a crucial role in therapeutic ultrasound applications. Most conventional beamforming methods, whether implemented in the time or frequency domains, suffer from limited axial resolution due to the absence of a reference emission onset time. While frequency-domain methods, the most efficient of which are based on the cross-spectral matrix, require long signals for accurate estimation, time-domain methods typically achieve lower spatial resolution. To address these limitations, we propose a linear model-based beamforming framework fully formulated in the time domain. The linear forward model relates a discretized spatiotemporal distribution of cavitation activity to the temporal signals recorded by a probe, explicitly accounting for time-of-flight delays dictated by the acquisition geometry. This model is then inverted using regularization techniques that exploit prior knowledge of cavitation activity in both spatial and temporal domains. Experimental results show that the proposed framework achieves enhanced or competitive cavitation map quality while using only 20% of the data typically required by frequency-domain methods. This highlights the substantial gain in data efficiency and the flexibility of our spatiotemporal regularization to adapt to diverse passive cavitation scenarios, outperforming state-of-the-art techniques.
zh

[AI-114] MIMIC-MJX: Neuromechanical Emulation of Animal Behavior

【速读】:该论文旨在解决如何从运动学轨迹中推断出生物上合理的神经控制策略这一问题,即如何通过观测到的行为(如肢体运动)反推出驱动这些行为的神经控制机制。传统方法仅依赖运动学轨迹难以揭示内在的控制过程,而本文提出的MIMIC-MJX框架的关键在于:通过训练神经控制器在物理仿真中驱动生物力学逼真的身体模型,使其能够重现真实运动学轨迹,从而学习到符合生物学原理的神经控制策略。该方法的核心创新在于将生成式建模与物理仿真相结合,实现了对神经控制机制的可解释、高效且泛化的建模。

链接: https://arxiv.org/abs/2511.20532
作者: Charles Y. Zhang(1),Yuanjia Yang(2, 3),Aidan Sirbu(4, 5),Elliott T.T. Abe(6),Emil Wärnberg(1),Eric J. Leonardis(2),Diego E. Aldarondo(1),Adam Lee(1, 2),Aaditya Prasad(7),Jason Foat(2),Kaiwen Bian(2),Joshua Park(2),Rusham Bhatt(2),Hutton Saunders(2),Akira Nagamori(2),Ayesha R. Thanawalla(2),Kee Wui Huang(2),Fabian Plum(8),Hendrik K. Beck(8),Steven W. Flavell(7, 9),David Labonte(8),Blake A. Richards(4, 5, 10),Bingni W. Brunton(6),Eiman Azim(2),Bence P. Ölveczky(1),Talmo D. Pereira(2) ((1) Harvard University, (2) Salk Institute for Biological Studies, (3) University of California San Diego, (4) Mila, (5) McGill University, (6) University of Washington, (7) Massachusetts Institute of Technology, (8) Imperial College London, (9) Howard Hughes Medical Institute, (10) Canadian Institute for Advanced Research)
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The primary output of the nervous system is movement and behavior. While recent advances have democratized pose tracking during complex behavior, kinematic trajectories alone provide only indirect access to the underlying control processes. Here we present MIMIC-MJX, a framework for learning biologically-plausible neural control policies from kinematics. MIMIC-MJX models the generative process of motor control by training neural controllers that learn to actuate biomechanically-realistic body models in physics simulation to reproduce real kinematic trajectories. We demonstrate that our implementation is accurate, fast, data-efficient, and generalizable to diverse animal body models. Policies trained with MIMIC-MJX can be utilized to both analyze neural control strategies and simulate behavioral experiments, illustrating its potential as an integrative modeling framework for neuroscience.
zh

[AI-115] Human-computer interactions predict mental health

【速读】:该论文旨在解决精神疾病(mental illness)全球范围内可扩展评估的难题,这是实现可及且公平心理健康服务的关键障碍。其解决方案的核心在于提出一种名为MAILA(MAchine-learning framework for Inferring Latent mental states from digital Activity)的机器学习框架,该框架通过分析用户与计算机交互时产生的数字活动(如鼠标移动和触控屏操作)来推断个体的心理状态及其动态变化。关键创新在于:MAILA能够从9,000名在线参与者中采集的20,000份轨迹数据中训练出对130万条自我报告心理健康数据的预测模型,并在三个正交维度上追踪心理状态的变化,具备跨场景泛化能力,在群体层面达到接近天花板的预测精度。此方法实现了无需主动反馈的被动、可靠、动态且高度可扩展的心理健康评估,为精准医学和公共卫生设定了新基准。

链接: https://arxiv.org/abs/2511.20179
作者: Veith Weilnhammer,Jefferson Ortega,David Whitney
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Scalable assessments of mental illness, the leading driver of disability worldwide, remain a critical roadblock toward accessible and equitable care. Here, we show that human-computer interactions encode multiple dimensions of self-reported mental health and their changes over time. We introduce MAILA, a MAchine-learning framework for Inferring Latent mental states from digital Activity. We trained MAILA to predict 1.3 million mental-health self-reports from 20,000 cursor and touchscreen recordings recorded in 9,000 online participants. The dataset includes 2,000 individuals assessed longitudinally, 1,500 diagnosed with depression, and 500 with obsessive-compulsive disorder. MAILA tracks dynamic mental states along three orthogonal dimensions, generalizes across contexts, and achieves near-ceiling accuracy when predicting group-level mental health. The model translates from general to clinical populations, identifies individuals living with mental illness, and captures signatures of psychological function that are not conveyed by language. Our results demonstrate how everyday human-computer interactions can power passive, reliable, dynamic, and maximally scalable mental health assessments. The ability to decode mental states at zero marginal cost sets new benchmarks for precision medicine and public health, while raising important questions about privacy, agency, and autonomy online. Subjects: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2511.20179 [q-bio.NC] (or arXiv:2511.20179v1 [q-bio.NC] for this version) https://doi.org/10.48550/arXiv.2511.20179 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-116] BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

【速读】:该论文旨在解决自动音高修正(Automatic Pitch Correction, APC)系统中普遍存在的两个问题:一是现有方法依赖参考音高(reference pitch),限制了其在无标注或实时场景下的实用性;二是简单音高估计算法难以保留演唱者的表达性与自然性。解决方案的关键在于提出一种基于音乐语言模型的无参考音高修正框架 BERT-APC,其核心创新包括:1)引入一个平稳音高预测器(stationary pitch predictor)从偏移的歌声中估计感知音高;2)利用重构的音乐语言模型作为上下文感知的音高预测器,推断出意图音高序列;3)设计逐音符级校正算法,在纠正音高偏差的同时保留有意的情感表达性偏移;4)提出可学习的数据增强策略,模拟真实偏移模式以提升模型鲁棒性。该方法首次实现了基于符号化音乐语境的无参考音高修正,显著优于现有商业工具(如AutoTune和Melodyne)在主观评测(MOS)和客观准确率上的表现。

链接: https://arxiv.org/abs/2511.20006
作者: Sungjae Kim,Kihyun Na,Jinyoung Choi,Injung Kim
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 12 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a novel reference-free APC framework that corrects pitch errors while maintaining the natural expressiveness of vocal performances. In BERT-APC, a novel stationary pitch predictor first estimates the perceived pitch of each note from the detuned singing voice. A context-aware note pitch predictor estimates the intended pitch sequence by leveraging a music language model repurposed to incorporate musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional pitch deviations for emotional expression. In addition, we introduce a learnable data augmentation strategy that improves the robustness of the music language model by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples in terms of the raw pitch accuracy. In the MOS test, BERT-APC achieved the highest score of 4.32 \pm 0.15 , which is significantly higher than those of the widely-used commercial APC tools, AutoTune ( 3.22 \pm 0.18 ) and Melodyne ( 3.08 \pm 0.18 ), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples of BERT-APC are available online.
zh

[AI-117] AI/ML based Joint Source and Channel Coding for HARQ-ACK Payload

【速读】:该论文旨在解决5G新空口(NR)上行链路中混合自动重传请求确认(HARQ-ACK)比特因非均匀分布导致的传输效率低下问题。传统信道编码假设输入比特在物理层呈均匀分布,而实际HARQ-ACK比特具有明显的非均匀特性,若不加以利用将造成性能损失。解决方案的关键在于:首先,设计基于Transformer的编码器,并采用一种新颖的“免费午餐”(free-lunch)训练算法,结合逐码字功率整形(per-codeword power shaping)策略,在保持对HARQ-ACK分布微小变化鲁棒性的前提下,有效利用源先验信息;其次,在解码端提出扩展的Neyman-Pearson检验方法,实现多信息比特系统下的不等错误保护(Unequal Error Protection, UEP),以显著降低否定确认(NACK)误码率,避免因多次NACK累积引发无线链路失败。仿真结果表明,该方案在衰落信道下相较NR基线可实现平均发射功率降低3–6 dB、最大发射功率降低2–3 dB,从而带来显著的覆盖增益和能效提升。

链接: https://arxiv.org/abs/2511.19943
作者: Akash Doshi,Pinar Sen,Kirill Ivanov,Wei Yang,June Namgoong,Runxin Wang,Rachel Wang,Taesang Yoo,Jing Jiang,Tingfang Ji
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 39 pages, 15 figures. Under consideration for publication in Journal of Sel. Areas in Information Theory. This paper was presented in part at the International Symposium on Topics in Coding, August 2025 in the Session for Coding and AI

点击查看摘要

Abstract:Channel coding from 2G to 5G has assumed the inputs bits at the physical layer to be uniformly distributed. However, hybrid automatic repeat request acknowledgement (HARQ-ACK) bits transmitted in the uplink are inherently non-uniformly distributed. For such sources, significant performance gains could be obtained by employing joint source channel coding, aided by deep learning-based techniques. In this paper, we learn a transformer-based encoder using a novel “free-lunch” training algorithm and propose per-codeword power shaping to exploit the source prior at the encoder whilst being robust to small changes in the HARQ-ACK distribution. Furthermore, any HARQ-ACK decoder has to achieve a low negative acknowledgement (NACK) error rate to avoid radio link failures resulting from multiple NACK errors. We develop an extension of the Neyman-Pearson test to a coded bit system with multiple information bits to achieve Unequal Error Protection of NACK over ACK bits at the decoder. Finally, we apply the proposed encoder and decoder designs to a 5G New Radio (NR) compliant uplink setup under a fading channel, describing the optimal receiver design and a low complexity coherent approximation to it. Our results demonstrate 3-6 dB reduction in the average transmit power required to achieve the target error rates compared to the NR baseline, while also achieving a 2-3 dB reduction in the maximum transmit power, thus providing for significant coverage gains and power savings.
zh

[AI-118] he Alexander-Hirschowitz theorem for neurovarieties

【速读】:该论文旨在解决多项式神经网络(polynomial neural networks)中神经变体(neurovarieties)的维度问题,特别是在单输出情形下,明确其何时能达到预期维度。解决方案的关键在于对单输出架构下的神经变体维度进行完全刻画,从而推导出多输出结构的非缺陷性(non-defectiveness)和全局可识别性(global identifiability)。

链接: https://arxiv.org/abs/2511.19703
作者: A. Massarenti,M. Mella
机构: 未知
类目: Algebraic Geometry (math.AG); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Commutative Algebra (math.AC)
备注: 21 pages

点击查看摘要

Abstract:We study neurovarieties for polynomial neural networks and fully characterize when they attain the expected dimension in the single-output case. As consequences, we establish non-defectiveness and global identifiability for multi-output architectures.
zh

[AI-119] CycleChemist: A Dual-Pronged Machine Learning Framework for Organic Photovoltaic Discovery

【速读】:该论文旨在解决有机光伏(Organic Photovoltaic, OPV)材料开发中难以高效筛选高性能给体-受体配对的问题,尤其针对现有设计策略多局限于单一组分而缺乏对两者协同作用的统一建模。其解决方案的关键在于构建了一个双阶段机器学习框架:首先利用大规模标注数据集OPV2D训练分类模型OPVC和多任务图神经网络,实现对材料是否具备OPV行为的预测及HOMO/LUMO能级(分子轨道能量估计器MOE2)与功率转换效率(PCE)的精准预测;其次引入基于强化学习的材料生成预训练Transformer(MatGPT),通过三目标策略优化生成具有合成可行性的新型有机半导体分子,从而打通从分子结构到光电性能的闭环设计路径。该框架实现了分子表示学习与性能预测的深度融合,显著提升了高PCE OPV材料的发现效率。

链接: https://arxiv.org/abs/2511.19500
作者: Hou Hei Lam,Jiangjie Qiu,Xiuyuan Hu,Wentao Li,Fankun Zeng,Siwei Fu,Hao Zhang,Xiaonan Wang
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Organic photovoltaic (OPV) materials offer a promising path toward sustainable energy generation, but their development is limited by the difficulty of identifying high performance donor and acceptor pairs with strong power conversion efficiencies (PCEs). Existing design strategies typically focus on either the donor or the acceptor alone, rather than using a unified approach capable of modeling both components. In this work, we introduce a dual machine learning framework for OPV discovery that combines predictive modeling with generative molecular design. We present the Organic Photovoltaic Donor Acceptor Dataset (OPV2D), the largest curated dataset of its kind, containing 2000 experimentally characterized donor acceptor pairs. Using this dataset, we develop the Organic Photovoltaic Classifier (OPVC) to predict whether a material exhibits OPV behavior, and a hierarchical graph neural network that incorporates multi task learning and donor acceptor interaction modeling. This framework includes the Molecular Orbital Energy Estimator (MOE2) for predicting HOMO and LUMO energy levels, and the Photovoltaic Performance Predictor (P3) for estimating PCE. In addition, we introduce the Material Generative Pretrained Transformer (MatGPT) to produce synthetically accessible organic semiconductors, guided by a reinforcement learning strategy with three objective policy optimization. By linking molecular representation learning with performance prediction, our framework advances data driven discovery of high performance OPV materials.
zh

[AI-120] FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

【速读】:该论文旨在解决现有核心集(coreset)选择方法在分布匹配上的局限性问题:传统基于深度神经网络(DNN-based)的方法受模型参数约束并引入架构偏差,而无DNN方法依赖启发式策略且缺乏理论保障,同时二者均未显式约束分布等价性,且常用指标(如MSE、KL、MMD、CE)难以准确捕捉高阶矩差异,导致核心集质量不佳。其解决方案的关键在于提出FAST框架——首个无DNN的分布匹配核心集选择方法,通过将核心集选择建模为图约束优化问题,利用特征函数距离(Characteristic Function Distance, CFD)在频域中完整捕获分布信息;进一步针对CFD在中高频区域存在的“相位梯度消失”问题,设计了衰减相位解耦的CFD(Attenuated Phase-Decoupled CFD),并结合渐进式差异感知采样策略(Progressive Discrepancy-Aware Sampling),按频率从低到高逐步调度采样,优先保留全局结构再细化局部细节,在减少频率使用的同时避免过拟合,实现高效且精准的分布匹配。

链接: https://arxiv.org/abs/2511.19476
作者: Jin Cui(1),Boran Zhao(2),Jiajun Xu(2),Jiaqi Guo(3),Shuo Guan(2),Pengju Ren(1) ((1) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, (2) School of Software Engineering, Xi’an Jiaotong University, (3) School of Mathematical Sciences, Nankai University)
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Coreset selection compresses large datasets into compact, representative subsets, reducing the energy and computational burden of training deep neural networks. Existing methods are either: (i) DNN-based, which are tied to model-specific parameters and introduce architectural bias; or (ii) DNN-free, which rely on heuristics lacking theoretical guarantees. Neither approach explicitly constrains distributional equivalence, largely because continuous distribution matching is considered inapplicable to discrete sampling. Moreover, prevalent metrics (e.g., MSE, KL, MMD, CE) cannot accurately capture higher-order moment discrepancies, leading to suboptimal coresets. In this work, we propose FAST, the first DNN-free distribution-matching coreset selection framework that formulates the coreset selection task as a graph-constrained optimization problem grounded in spectral graph theory and employs the Characteristic Function Distance (CFD) to capture full distributional information in the frequency domain. We further discover that naive CFD suffers from a “vanishing phase gradient” issue in medium and high-frequency regions; to address this, we introduce an Attenuated Phase-Decoupled CFD. Furthermore, for better convergence, we design a Progressive Discrepancy-Aware Sampling strategy that progressively schedules frequency selection from low to high, preserving global structure before refining local details and enabling accurate matching with fewer frequencies while avoiding overfitting. Extensive experiments demonstrate that FAST significantly outperforms state-of-the-art coreset selection methods across all evaluated benchmarks, achieving an average accuracy gain of 9.12%. Compared to other baseline coreset methods, it reduces power consumption by 96.57% and achieves a 2.2x average speedup, underscoring its high performance and energy efficiency.
zh

机器学习

[LG-0] Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model

链接: https://arxiv.org/abs/2511.20636
作者: Ziyue Wang,Yayati Jadhav,Peter Pak,Amir Barati Farimani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mechanical design and manufacturing workflows conventionally begin with conceptual design, followed by the creation of a computer-aided design (CAD) model and fabrication through material-extrusion (MEX) printing. This process requires converting CAD geometry into machine-readable G-code through slicing and path planning. While each step is well established, dependence on CAD modeling remains a major bottleneck: constructing object-specific 3D geometry is slow and poorly suited to rapid prototyping. Even minor design variations typically necessitate manual updates in CAD software, making iteration time-consuming and difficult to scale. To address this limitation, we introduce Image2Gcode, an end-to-end data-driven framework that bypasses the CAD stage and generates printer-ready G-code directly from images and part drawings. Instead of relying on an explicit 3D model, a hand-drawn or captured 2D image serves as the sole input. The framework first extracts slice-wise structural cues from the image and then employs a denoising diffusion probabilistic model (DDPM) over G-code sequences. Through iterative denoising, the model transforms Gaussian noise into executable print-move trajectories with corresponding extrusion parameters, establishing a direct mapping from visual input to native toolpaths. By producing structured G-code directly from 2D imagery, Image2Gcode eliminates the need for CAD or STL intermediates, lowering the entry barrier for additive manufacturing and accelerating the design-to-fabrication cycle. This approach supports on-demand prototyping from simple sketches or visual references and integrates with upstream 2D-to-3D reconstruction modules to enable an automated pipeline from concept to physical artifact. The result is a flexible, computationally efficient framework that advances accessibility in design iteration, repair workflows, and distributed manufacturing.

[LG-1] Sparse-to-Field Reconstruction via Stochastic Neural Dynamic Mode Decomposition

链接: https://arxiv.org/abs/2511.20612
作者: Yujin Kim,Sarah Dean
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Many consequential real-world systems, like wind fields and ocean currents, are dynamic and hard to model. Learning their governing dynamics remains a central challenge in scientific machine learning. Dynamic Mode Decomposition (DMD) provides a simple, data-driven approximation, but practical use is limited by sparse/noisy observations from continuous fields, reliance on linear approximations, and the lack of principled uncertainty quantification. To address these issues, we introduce Stochastic NODE-DMD, a probabilistic extension of DMD that models continuous-time, nonlinear dynamics while remaining interpretable. Our approach enables continuous spatiotemporal reconstruction at arbitrary coordinates and quantifies predictive uncertainty. Across four benchmarks, a synthetic setting and three physics-based flows, it surpasses a baseline in reconstruction accuracy when trained from only 10% observation density. It further recovers the dynamical structure by aligning learned modes and continuous-time eigenvalues with ground truth. Finally, on datasets with multiple realizations, our method learns a calibrated distribution over latent dynamics that preserves ensemble variability rather than averaging across regimes. Our code is available at: this https URL

[LG-2] Adaptive Hopfield Network: Rethinking Similarities in Associative Memory

链接: https://arxiv.org/abs/2511.20609
作者: Shurong Wang,Yuqi Pan,Zhuoyang Shen,Meng Zhang,Hongwei Wang,Guoqi Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Associative memory models are content-addressable memory systems fundamental to biological intelligence and are notable for their high interpretability. However, existing models evaluate the quality of retrieval based on proximity, which cannot guarantee that the retrieved pattern has the strongest association with the query, failing correctness. We reframe this problem by proposing that a query is a generative variant of a stored memory pattern, and define a variant distribution to model this subtle context-dependent generative process. Consequently, correct retrieval should return the memory pattern with the maximum a posteriori probability of being the query’s origin. This perspective reveals that an ideal similarity measure should approximate the likelihood of each stored pattern generating the query in accordance with variant distribution, which is impossible for fixed and pre-defined similarities used by existing associative memories. To this end, we develop adaptive similarity, a novel mechanism that learns to approximate this insightful but unknown likelihood from samples drawn from context, aiming for correct retrieval. We theoretically prove that our proposed adaptive similarity achieves optimal correct retrieval under three canonical and widely applicable types of variants: noisy, masked, and biased. We integrate this mechanism into a novel adaptive Hopfield network (A-Hop), and empirical results show that it achieves state-of-the-art performance across diverse tasks, including memory retrieval, tabular classification, image classification, and multiple instance learning.

[LG-3] How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

链接: https://arxiv.org/abs/2511.20605
作者: Xiwen Huang,Pierre Pinson
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted as a preprint. 34 pages, 14 figures, 4 tables

点击查看摘要

Abstract:We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

[LG-4] Attention Trajectories as a Diagnostic Axis for Deep Reinforcement Learning

链接: https://arxiv.org/abs/2511.20591
作者: Charlotte Beylier,Hannah Selder,Arthur Fleig,Simon M. Hofmann,Nico Scherf
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attention-oriented metrics (ATOMs) to investigate the development of an RL agent’s attention during training. In a controlled experiment, we tested ATOMs on three variations of a Pong game, each designed to teach the agent distinct behaviours, complemented by a behavioural assessment. ATOMs successfully delineate the attention patterns of an agent trained on each game variation, and that these differences in attention patterns translate into differences in the agent’s behaviour. Through continuous monitoring of ATOMs during training, we observed that the agent’s attention developed in phases, and that these phases were consistent across game variations. Overall, we believe that ATOM could help improve our understanding of the learning processes of RL agents and better understand the relationship between attention and learning.

[LG-5] Anatomica: Localized Control over Geometric and Topological Properties for Anatomical Diffusion Models

链接: https://arxiv.org/abs/2511.20587
作者: Karim Kadry,Abdallah Abdelwahed,Shoaib Goraya,Ajay Manicka,Naravich Chutisilp,Farhad Nezami,Elazer Edelman
类目: Machine Learning (cs.LG)
*备注: 8 pages, 10 figures

点击查看摘要

Abstract:We present Anatomica: an inference-time framework for generating multi-class anatomical voxel maps with localized geo-topological control. During generation, we use cuboidal control domains of varying dimensionality, location, and shape to slice out relevant substructures. These local substructures are used to compute differentiable penalty functions that steer the sample towards target constraints. We control geometric features such as size, shape, and position through voxel-wise moments, while topological features such as connected components, loops, and voids are enforced through persistent homology. Lastly, we implement Anatomica for latent diffusion models, where neural field decoders partially extract substructures, enabling the efficient control of anatomical properties. Anatomica applies flexibly across diverse anatomical systems, composing constraints to control complex structures over arbitrary dimensions and coordinate systems, thereby enabling the rational design of synthetic datasets for virtual trials or machine learning workflows.

[LG-6] A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent

链接: https://arxiv.org/abs/2511.20584
作者: Shuo Xie,Tianhao Wang,Beining Wu,Zhiyuan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.

[LG-7] MSTN: Fast and Efficient Multivariate Time Series Model

链接: https://arxiv.org/abs/2511.20577
作者: Sumit S Shevtekar,Chandresh K Maurya,Gourab Sil
类目: Machine Learning (cs.LG)
*备注: 21 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Real-world time-series data is highly non stationary and complex in dynamics that operate across multiple timescales, ranging from fast, short-term changes to slow, long-term trends. Most existing models rely on fixed-scale structural priors, such as patch-based tokenization, fixed frequency transformations, or frozen backbone architectures. This often leads to over-regularization of temporal dynamics, which limits their ability to adaptively model the full spectrum of temporal variations and impairs their performance on unpredictable, Sudden, high-magnitude events. To address this, we introduce the Multi-scale Temporal Network (MSTN), a novel deep learning architecture founded on a hierarchical multi-scale and sequence modeling principle. The MSTN framework integrates: (i) a multi-scale convolutional encoder that constructs a hierarchical feature pyramid for local patterns (ii) a sequence modeling component for long-range temporal dependencies. We empirically validate this with BiLSTM and Transformer variants, establishing a flexible foundation for future architectural advancements. and (iii) a gated fusion mechanism augmented with squeeze-and-excitation (SE) and multi-head temporal attention (MHTA) for dynamic, context-aware feature integration. This design enables MSTN to adaptively model temporal patterns from milliseconds to long-range dependencies within a unified framework. Extensive evaluations across time-series long-horizon forecasting, imputation, classification and generalizability study demonstrate that MSTN achieves competitive state-of-the-art (SOTA) performance, showing improvements over contemporary approaches including EMTSF, LLM4TS, HiMTM, TIME-LLM, MTST, SOFTS, iTransformer, TimesNet, and PatchTST. In total, MSTN establishes new SOTA performance on 24 of 32 benchmark datasets, demonstrating its consistent performance across diverse temporal tasks.

[LG-8] E2E-GRec: An End-to-End Joint Training Framework for Graph Neural Networks and Recommender Systems

链接: https://arxiv.org/abs/2511.20564
作者: Rui Xue,Shichao Zhu,Liang Qin,Guangmou Pan,Yang Song,Tianfu Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as powerful tools for modeling graph-structured data and have been widely used in recommender systems, such as for capturing complex user-item and item-item relations. However, most industrial deployments adopt a two-stage pipeline: GNNs are first pre-trained offline to generate node embeddings, which are then used as static features for downstream recommender systems. This decoupled paradigm leads to two key limitations: (1) high computational overhead, since large-scale GNN inference must be repeatedly executed to refresh embeddings; and (2) lack of joint optimization, as the gradient from the recommender system cannot directly influence the GNN learning process, causing the GNN to be suboptimally informative for the recommendation task. In this paper, we propose E2E-GRec, a novel end-to-end training framework that unifies GNN training with the recommender system. Our framework is characterized by three key components: (i) efficient subgraph sampling from a large-scale cross-domain heterogeneous graph to ensure training scalability and efficiency; (ii) a Graph Feature Auto-Encoder (GFAE) serving as an auxiliary self-supervised task to guide the GNN to learn structurally meaningful embeddings; and (iii) a two-level feature fusion mechanism combined with Gradnorm-based dynamic loss balancing, which stabilizes graph-aware multi-task end-to-end training. Extensive offline evaluations, online A/B tests (e.g., a +0.133% relative improvement in stay duration, a 0.3171% reduction in the average number of videos a user skips) on large-scale production data, together with theoretical analysis, demonstrate that E2E-GRec consistently surpasses traditional approaches, yielding significant gains across multiple recommendation metrics.

[LG-9] Feature-Modulated UFNO for Improved Prediction of Multiphase Flow in Porous Media

链接: https://arxiv.org/abs/2511.20543
作者: Alhasan Abdellatif,Hannah P. Menke,Ahmed H. Elsheikh,Florian Doster,Kamaljit Singh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The UNet-enhanced Fourier Neural Operator (UFNO) extends the Fourier Neural Operator (FNO) by incorporating a parallel UNet pathway, enabling the retention of both high- and low-frequency components. While UFNO improves predictive accuracy over FNO, it inefficiently treats scalar inputs (e.g., temperature, injection rate) as spatially distributed fields by duplicating their values across the domain. This forces the model to process redundant constant signals within the frequency domain. Additionally, its standard loss function does not account for spatial variations in error sensitivity, limiting performance in regions of high physical importance. We introduce UFNO-FiLM, an enhanced architecture that incorporates two key innovations. First, we decouple scalar inputs from spatial features using a Feature-wise Linear Modulation (FiLM) layer, allowing the model to modulate spatial feature maps without introducing constant signals into the Fourier transform. Second, we employ a spatially weighted loss function that prioritizes learning in critical regions. Our experiments on subsurface multiphase flow demonstrate a 21% reduction in gas saturation Mean Absolute Error (MAE) compared to UFNO, highlighting the effectiveness of our approach in improving predictive accuracy.

[LG-10] Adam Simplified: Bias Correction Simplified

链接: https://arxiv.org/abs/2511.20516
作者: Sam Laing,Antonio Orvieto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters \beta_1, \beta_2 \in [0,1) . Our findings challenge the universal inclusion of this component.

[LG-11] DP-MicroAdam: Private and Frugal Algorithm for Training and Fine-tuning

链接: https://arxiv.org/abs/2511.20509
作者: Mihaela Hudişteanu,Edwige Cyffers,Nikita P. Kalinin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Adaptive optimizers are the de facto standard in non-private training as they often enable faster convergence and improved performance. In contrast, differentially private (DP) training is still predominantly performed with DP-SGD, typically requiring extensive compute and hyperparameter tuning. We propose DP-MicroAdam, a memory-efficient and sparsity-aware adaptive DP optimizer. We prove that DP-MicroAdam converges in stochastic non-convex optimization at the optimal \mathcalO(1/\sqrtT) rate, up to privacy-dependent constants. Empirically, DP-MicroAdam outperforms existing adaptive DP optimizers and achieves competitive or superior accuracy compared to DP-SGD across a range of benchmarks, including CIFAR-10, large-scale ImageNet training, and private fine-tuning of pretrained transformers. These results demonstrate that adaptive optimization can improve both performance and stability under differential privacy.

[LG-12] InferF: Declarative Factorization of AI/ML Inferences over Joins SIGMOD2026

链接: https://arxiv.org/abs/2511.20489
作者: Kanchan Chowdhury,Lixi Zhou,Lulu Xie,Xinwei Fu,Jia Zou
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注: Accepted to SIGMOD 2026 as full research paper. This archived version has a full appendix

点击查看摘要

Abstract:Real-world AI/ML workflows often apply inference computations to feature vectors joined from multiple datasets. To avoid the redundant AI/ML computations caused by repeated data records in the join’s output, factorized ML has been proposed to decompose ML computations into sub-computations to be executed on each normalized dataset. However, there is insufficient discussion on how factorized ML could impact AI/ML inference over multi-way joins. To address the limitations, we propose a novel declarative InferF system, focusing on the factorization of arbitrary inference workflows represented as analyzable expressions over the multi-way joins. We formalize our problem to flexibly push down partial factorized computations to qualified nodes in the join tree to minimize the overall inference computation and join costs and propose two algorithms to resolve the problem: (1) a greedy algorithm based on a per-node cost function that estimates the influence on overall latency if a subset of factorized computations is pushed to a node, and (2) a genetic algorithm for iteratively enumerating and evaluating promising factorization plans. We implement InferF on Velox, an open-sourced database engine from Meta, evaluate it on real-world datasets, observed up to 11.3x speedups, and systematically summarized the factors that determine when factorized ML can benefit AI/ML inference workflows.

[LG-13] NVIDIA Nemotron Parse 1.1

链接: https://arxiv.org/abs/2511.20478
作者: Kateryna Chumachenko,Amala Sanjay Deshmukh,Jarno Seppanen,Ilia Karmanov,Chia-Chih Chen,Lukas Voegtle,Philipp Fischer,Marek Wawrzos,Saeid Motiian,Roman Ageev,Kedi Wu,Alexandre Milesi,Maryam Moosaei,Krzysztof Pawelec,Padmavathy Subramanian,Mehrzad Samadi,Xin Yu,Celina Dear,Sarah Stoddard,Jenna Diamond,Jesse Oliver,Leanna Chraghchian,Patrick Skelly,Tom Balough,Yao Xu,Jane Polak Scowcroft,Daniel Korzekwa,Darragh Hanley,Sandip Bhaskar,Timo Roman,Karan Sapra,Andrew Tao,Bryan Catanzaro
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Nemotron-Parse-1.1, a lightweight document parsing and OCR model that advances the capabilities of its predecessor, Nemoretriever-Parse-1.0. Nemotron-Parse-1.1 delivers improved capabilities across general OCR, markdown formatting, structured table parsing, and text extraction from pictures, charts, and diagrams. It also supports a longer output sequence length for visually dense documents. As with its predecessor, it extracts bounding boxes of text segments, as well as corresponding semantic classes. Nemotron-Parse-1.1 follows an encoder-decoder architecture with 885M parameters, including a compact 256M-parameter language decoder. It achieves competitive accuracy on public benchmarks making it a strong lightweight OCR solution. We release the model weights publicly on Huggingface, as well as an optimized NIM container, along with a subset of the training data as part of the broader Nemotron-VLM-v2 dataset. Additionally, we release Nemotron-Parse-1.1-TC which operates on a reduced vision token length, offering a 20% speed improvement with minimal quality degradation.

[LG-14] owards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks

链接: https://arxiv.org/abs/2511.20456
作者: Shreevanth Krishnaa Gopalakrishnan,Stephen Hailes
类目: Machine Learning (cs.LG)
*备注: 19 pages, 8 figures, 7 tables

点击查看摘要

Abstract:Machine learning has become integral to Channel State Information (CSI)-based human sensing systems and is expected to power applications such as device-free activity recognition and identity detection in future cellular and Wi-Fi generations. However, these systems rely on models whose decisions can be subtly perturbed, raising concerns for security and reliability in ubiquitous sensing. Quantifying and understanding the robustness of such models, defined as their ability to maintain accurate predictions under adversarial perturbations, is therefore critical before wireless sensing can be safely deployed in real-world environments. This work presents a systematic evaluation of the robustness of CSI deep learning models under diverse threat models (white-box, black-box/transfer, and universal perturbations) and varying degrees of attack realism. We establish a framework to compare compact temporal autoencoder models with larger deep architectures across three public datasets, quantifying how model scale, training regime, and physical constraints influence robustness. Our experiments show that smaller models, while efficient and equally performant on clean data, are markedly less robust. We further confirm that physically realizable signal-space perturbations, designed to be feasible in real wireless channels, significantly reduce attack success compared to unconstrained feature-space attacks. Adversarial training mitigates these vulnerabilities, improving mean robust accuracy with only moderate degradation in clean performance across both model classes. As wireless sensing advances towards reliable, cross-domain operation, these findings provide quantitative baselines for robustness estimation and inform design principles for secure and trustworthy human-centered sensing systems. Comments: 19 pages, 8 figures, 7 tables Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.20456 [cs.LG] (or arXiv:2511.20456v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.20456 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] Diffusion for Fusion: Designing Stellarators with Generative AI

链接: https://arxiv.org/abs/2511.20445
作者: Misha Padidar,Teresa Huang,Andrew Giuliani,Marina Spivak
类目: Machine Learning (cs.LG); Plasma Physics (physics.plasm-ph)
*备注:

点击查看摘要

Abstract:Stellarators are a prospective class of fusion-based power plants that confine a hot plasma with three-dimensional magnetic fields. Typically framed as a PDE-constrained optimization problem, stellarator design is a time-consuming process that can take hours to solve on a computing cluster. Developing fast methods for designing stellarators is crucial for advancing fusion research. Given the recent development of large datasets of optimized stellarators, machine learning approaches have emerged as a potential candidate. Motivated by this, we present an open inverse problem to the machine learning community: to rapidly generate high-quality stellarator designs which have a set of desirable characteristics. As a case study in the problem space, we train a conditional diffusion model on data from the QUASR database to generate quasisymmetric stellarator designs with desirable characteristics (aspect ratio and mean rotational transform). The diffusion model is applied to design stellarators with characteristics not seen during training. We provide evaluation protocols and show that many of the generated stellarators exhibit solid performance: less than 5% deviation from quasisymmetry and the target characteristics. The modest deviation from quasisymmetry highlights an opportunity to reach the sub 1% target. Beyond the case study, we share multiple promising avenues for generative modeling to advance stellarator design.

[LG-16] ght Margin-Based Generalization Bounds for Voting Classifiers over Finite Hypothesis Sets

链接: https://arxiv.org/abs/2511.20407
作者: Kasper Green Larsen,Natascha Schalburg
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We prove the first margin-based generalization bound for voting classifiers, that is asymptotically tight in the tradeoff between the size of the hypothesis set, the margin, the fraction of training points with the given margin, the number of training samples and the failure probability.

[LG-17] Model-Based Learning of Whittle indices

链接: https://arxiv.org/abs/2511.20397
作者: Joël Charles-Rebuffé,Nicolas Gast,Bruno Gaujal
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Numerical Analysis (math.NA)
*备注: 31 pages, 8 figures, submitted to TOMPECS

点击查看摘要

Abstract:We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using pre-trained neural networks to predict Q-values.

[LG-18] Identifying environmental factors associated with tetrodotoxin contamination in bivalve mollusks using eXplainable AI

链接: https://arxiv.org/abs/2511.20395
作者: M.C. Schoppema,B.H.M. van der Velden,A. Hürriyetoğlu,M.D. Klijnstra,E.J. Faassen,A. Gerssen,H.J. van der Fels-Klerx
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures, submitted to Nature Food

点击查看摘要

Abstract:Since 2012, tetrodotoxin (TTX) has been found in seafoods such as bivalve mollusks in temperate European waters. TTX contamination leads to food safety risks and economic losses, making early prediction of TTX contamination vital to the food industry and competent authorities. Recent studies have pointed to shallow habitats and water temperature as main drivers to TTX contamination in bivalve mollusks. However, the temporal relationships between abiotic factors, biotic factors, and TTX contamination remain unexplored. We have developed an explainable, deep learning-based model to predict TTX contamination in the Dutch Zeeland estuary. Inputs for the model were meteorological and hydrological features; output was the presence or absence of TTX contamination. Results showed that the time of sunrise, time of sunset, global radiation, water temperature, and chloride concentration contributed most to TTX contamination. Thus, the effective number of sun hours, represented by day length and global radiation, was an important driver for tetrodotoxin contamination in bivalve mollusks. To conclude, our explainable deep learning model identified the aforementioned environmental factors (number of sun hours, global radiation, water temperature, and water chloride concentration) to be associated with tetrodotoxin contamination in bivalve mollusks; making our approach a valuable tool to mitigate marine toxin risks for food industry and competent authorities. Comments: 18 pages, 6 figures, submitted to Nature Food Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.20395 [cs.LG] (or arXiv:2511.20395v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.20395 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Matthijs Cornelis Schoppema [view email] [v1] Tue, 25 Nov 2025 15:19:57 UTC (418 KB)

[LG-19] MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers

链接: https://arxiv.org/abs/2511.20382
作者: Audrey Pei-Hsuan Chen
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with scArches, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.

[LG-20] Differentiable Attenuation Filters for Feedback Delay Networks

链接: https://arxiv.org/abs/2511.20380
作者: Ilias Ibnyahya,Joshua D. Reiss
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Net- works (FDNs). Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ), enabling fine control over frequency-dependent reverberation decay. Unlike traditional graphic equalizer designs, which require numerous filters per delay line, we propose a scal- able solution where the number of filters can be adjusted. The fre- quency, gain, and quality factor (Q) parameters are shared parame- ters across delay lines and only the gain is adjusted based on delay length. This design not only reduces the number of optimization parameters, but also remains fully differentiable and compatible with gradient-based learning frameworks. Leveraging principles of analog filter design, our method allows for efficient and accu- rate filter fitting using supervised learning. Our method delivers a flexible and differentiable design, achieving state-of-the-art per- formance while significantly reducing computational cost.

[LG-21] PRISM: Periodic Representation with multIscale and Similarity graph Modelling for enhanced crystal structure property prediction

链接: https://arxiv.org/abs/2511.20362
作者: Àlex Solé,Albert Mosella-Montoro,Joan Cardona,Daniel Aravena,Silvia Gómez-Coca,Eliseo Ruiz,Javier Ruiz-Hidalgo
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Crystal structures are characterised by repeating atomic patterns within unit cells across three-dimensional space, posing unique challenges for graph-based representation learning. Current methods often overlook essential periodic boundary conditions and multiscale interactions inherent to crystalline structures. In this paper, we introduce PRISM, a graph neural network framework that explicitly integrates multiscale representations and periodic feature encoding by employing a set of expert modules, each specialised in encoding distinct structural and chemical aspects of periodic systems. Extensive experiments across crystal structure-based benchmarks demonstrate that PRISM improves state-of-the-art predictive accuracy, significantly enhancing crystal property prediction.

[LG-22] Extension and neural operator approximation of the electrical impedance tomography inverse map

链接: https://arxiv.org/abs/2511.20361
作者: Maarten V. de Hoop,Nikola B. Kovachki,Matti Lassas,Nicholas H. Nelsen
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注: 80 pages (49 main text, 20 appendix, and 11 references pages), 14 figures, 2 tables

点击查看摘要

Abstract:This paper considers the problem of noise-robust neural operator approximation for the solution map of Calderón’s inverse conductivity problem. In this continuum model of electrical impedance tomography (EIT), the boundary measurements are realized as a noisy perturbation of the Neumann-to-Dirichlet map’s integral kernel. The theoretical analysis proceeds by extending the domain of the inversion operator to a Hilbert space of kernel functions. The resulting extension shares the same stability properties as the original inverse map from kernels to conductivities, but is now amenable to neural operator approximation. Numerical experiments demonstrate that Fourier neural operators excel at reconstructing infinite-dimensional piecewise constant and lognormal conductivities in noisy setups both within and beyond the theory’s assumptions. The methodology developed in this paper for EIT exemplifies a broader strategy for addressing nonlinear inverse problems with a noise-aware operator learning framework.

[LG-23] Complexity Reduction Study Based on RD Costs Approximation for VVC Intra Partitioning

链接: https://arxiv.org/abs/2511.20349
作者: M.E.A. Kherchouche,F. Galpin,T. Dumas,F. Schnitzler,D. Menard,L. Zhang
类目: Machine Learning (cs.LG)
*备注: 2025 Data Compression Conference (DCC)

点击查看摘要

Abstract:In this paper, a complexity study is conducted for Versatile Video Codec (VVC) intra partitioning to accelerate the exhaustive search involved in Rate-Distortion Optimization (RDO) process. To address this problem, two main machine learning techniques are proposed and compared. Unlike existing methods, the proposed approaches are size independent and incorporate the Rate-Distortion (RD) costs of neighboring blocks as input features. The first method is a regression based technique that predicts normalized RD costs of a given Coding Unit (CU). As partitioning possesses the Markov property, the associated decision-making problem can be modeled as a Markov Decision Process (MDP) and solved by Reinforcement Learning (RL). The second approach is a RL agent learned from trajectories of CU decision across two depths with Deep Q-Network (DQN) algorithm. Then a pre-determined thresholds are applied for both methods to select a suitable split for the current CU.

[LG-24] MXtalTools: A Toolkit for Machine Learning on Molecular Crystals

链接: https://arxiv.org/abs/2511.20327
作者: Michael Kilgour,Mark E. Tuckerman,Jutta Rogal
类目: Machine Learning (cs.LG)
*备注: 16 pages, 11 figures

点击查看摘要

Abstract:We present MXtalTools, a flexible Python package for the data-driven modelling of molecular crystals, facilitating machine learning studies of the molecular solid state. MXtalTools comprises several classes of utilities: (1) synthesis, collation, and curation of molecule and crystal datasets, (2) integrated workflows for model training and inference, (3) crystal parameterization and representation, (4) crystal structure sampling and optimization, (5) end-to-end differentiable crystal sampling, construction and analysis. Our modular functions can be integrated into existing workflows or combined and used to build novel modelling pipelines. MXtalTools leverages CUDA acceleration to enable high-throughput crystal modelling. The Python code is available open-source on our GitHub page, with detailed documentation on ReadTheDocs.

[LG-25] Quantum-Enhanced Reinforcement Learning for Accelerating Newton-Raphson Convergence with Ising Machines: A Case Study for Power Flow Analysis

链接: https://arxiv.org/abs/2511.20237
作者: Zeynab Kaseb,Matthias Moller,Lindsay Spoor,Jerry J. Guo,Yu Xiang,Peter Palensky,Pedro P. Vergara
类目: ystems and Control (eess.SY); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: 10 pages, 9 figures, 4 tables

点击查看摘要

Abstract:The Newton-Raphson (NR) method is widely used for solving power flow (PF) equations due to its quadratic convergence. However, its performance deteriorates under poor initialization or extreme operating scenarios, e.g., high levels of renewable energy penetration. Traditional NR initialization strategies often fail to address these challenges, resulting in slow convergence or even divergence. We propose the use of reinforcement learning (RL) to optimize the initialization of NR, and introduce a novel quantum-enhanced RL environment update mechanism to mitigate the significant computational cost of evaluating power system states over a combinatorially large action space at each RL timestep by formulating the voltage adjustment task as a quadratic unconstrained binary optimization problem. Specifically, quantum/digital annealers are integrated into the RL environment update to evaluate state transitions using a problem Hamiltonian designed for PF. Results demonstrate significant improvements in convergence speed, a reduction in NR iteration counts, and enhanced robustness under different operating conditions.

[LG-26] DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning AAAI-26

链接: https://arxiv.org/abs/2511.20225
作者: Bo Han,Zhuoming Li,Xiaoyu Wang,Yaxin Hou,Hui Liu,Junhui Hou,Yuheng Jia
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI-26

点击查看摘要

Abstract:Semi-supervised multi-label learning (SSMLL) aims to address the challenge of limited labeled data in multi-label learning (MLL) by leveraging unlabeled data to improve the model’s performance. While pseudo-labeling has become a dominant strategy in SSMLL, most existing methods assign equal weights to all pseudo-labels regardless of their quality, which can amplify the impact of noisy or uncertain predictions and degrade the overall performance. In this paper, we theoretically verify that the optimal weight for a pseudo-label should reflect its correctness likelihood. Empirically, we observe that on the same dataset, the correctness likelihood distribution of unlabeled data remains stable, even as the number of labeled training samples varies. Building on this insight, we propose Distribution-Calibrated Pseudo-labeling (DiCaP), a correctness-aware framework that estimates posterior precision to calibrate pseudo-label weights. We further introduce a dual-thresholding mechanism to separate confident and ambiguous regions: confident samples are pseudo-labeled and weighted accordingly, while ambiguous ones are explored by unsupervised contrastive learning. Experiments conducted on multiple benchmark datasets verify that our method achieves consistent improvements, surpassing state-of-the-art methods by up to 4.27%.

[LG-27] Decoupling and Damping: Structurally-Regularized Gradient Matching for Multimodal Graph Condensation

链接: https://arxiv.org/abs/2511.20222
作者: Lian Shen,Zhendan Chen,Yinhui jiang,Meijia Song,Ziming Su,Juan Liu,Xiangrong Liu
类目: Machine Learning (cs.LG)
*备注: 11pages,5 figures,6 tables

点击查看摘要

Abstract:In critical web applications such as e-commerce and recommendation systems, multimodal graphs integrating rich visual and textual attributes are increasingly central, yet their large scale introduces substantial computational burdens for training Graph Neural Networks (GNNs). While Graph Condensation (GC) offers a promising solution by synthesizing smaller datasets, existing methods falter in the multimodal setting. We identify a dual challenge causing this failure: (1) conflicting gradients arising from semantic misalignments between modalities, and (2) the GNN’s message-passing architecture pathologically amplifying this gradient noise across the graph structure. To address this, we propose Structurally-Regularized Gradient Matching (SR-GM), a novel condensation framework tailored for multimodal graphs. SR-GM introduces two synergistic components: first, a gradient decoupling mechanism that resolves inter-modality conflicts at their source via orthogonal projection; and second, a structural damping regularizer that acts directly on the gradient field. By leveraging the graph’s Dirichlet energy, this regularizer transforms the topology from a noise amplifier into a stabilizing force during optimization. Extensive experiments demonstrate that SR-GM significantly improves accuracy and accelerates convergence compared to baseline methods. Ablation studies confirm that addressing both gradient conflict and structural amplification in tandem is essential for achieving superior performance. Moreover, the condensed multimodal graphs exhibit strong cross-architecture generalization and promise to accelerate applications like Neural Architecture Search. This research provides a scalable methodology for multimodal graph-based learning in resource-constrained environments.

[LG-28] Communication-Efficient Learning for Satellite Constellations

链接: https://arxiv.org/abs/2511.20220
作者: Ruxandra-Stefania Tudose,Moritz H.W. Grüss,Grace Ra Kim,Karl H. Johansson,Nicola Bastianello
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Satellite constellations in low-Earth orbit are now widespread, enabling positioning, Earth imaging, and communications. In this paper we address the solution of learning problems using these satellite constellations. In particular, we focus on a federated approach, where satellites collect and locally process data, with the ground station aggregating local models. We focus on designing a novel, communication-efficient algorithm that still yields accurate trained models. To this end, we employ several mechanisms to reduce the number of communications with the ground station (local training) and their size (compression). We then propose an error feedback mechanism that enhances accuracy, which yields, as a byproduct, an algorithm-agnostic error feedback scheme that can be more broadly applied. We analyze the convergence of the resulting algorithm, and compare it with the state of the art through simulations in a realistic space scenario, showcasing superior performance.

[LG-29] In-Context Compositional Learning via Sparse Coding Transformer NEURIPS2025

链接: https://arxiv.org/abs/2511.20194
作者: Wei Chen,Jingxi Yu,Zichen Miao,Qiang Qiu
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target problems by inferring compositional rules from context examples, which are composed of basic components structured by underlying rules. However, some of these tasks remain challenging for Transformers, which are not inherently designed to handle compositional tasks and offer limited structural inductive bias. In this work, inspired by the principle of sparse coding, we propose a reformulation of the attention to enhance its capability for compositional tasks. In sparse coding, data are represented as sparse combinations of dictionary atoms with coefficients that capture their compositional rules. Specifically, we reinterpret the attention block as a mapping of inputs into outputs through projections onto two sets of learned dictionary atoms: an encoding dictionary and a decoding dictionary. The encoding dictionary decomposes the input into a set of coefficients, which represent the compositional structure of the input. To enhance structured representations, we impose sparsity on these coefficients. The sparse coefficients are then used to linearly combine the decoding dictionary atoms to generate the output. Furthermore, to assist compositional generalization tasks, we propose estimating the coefficients of the target problem as a linear combination of the coefficients obtained from the context examples. We demonstrate the effectiveness of our approach on the S-RAVEN and RAVEN datasets. For certain compositional generalization tasks, our method maintains performance even when standard Transformers fail, owing to its ability to learn and apply compositional rules.

[LG-30] Learning Subgroups with Maximum Treatment Effects without Causal Heuristics AAAI2026

链接: https://arxiv.org/abs/2511.20189
作者: Lincen Yang,Zhong Li,Matthijs van Leeuwen,Saber Salehkaleybar
类目: Machine Learning (cs.LG)
*备注: The full version (including the Appendix). Accepted at AAAI 2026

点击查看摘要

Abstract:Discovering subgroups with the maximum average treatment effect is crucial for targeted decision making in domains such as precision medicine, public policy, and education. While most prior work is formulated in the potential outcome framework, the corresponding structural causal model (SCM) for this task has been largely overlooked. In practice, two approaches dominate. The first estimates pointwise conditional treatment effects and then fits a tree on those estimates, effectively turning subgroup estimation into the harder problem of accurate pointwise estimation. The second constructs decision trees or rule sets with ad-hoc ‘causal’ heuristics, typically without rigorous justification for why a given heuristic may be used or whether such heuristics are necessary at all. We address these issues by studying the problem directly under the SCM framework. Under the assumption of a partition-based model, we show that optimal subgroup discovery reduces to recovering the data-generating models and hence a standard supervised learning problem (regression or classification). This allows us to adopt any partition-based methods to learn the subgroup from data. We instantiate the approach with CART, arguably one of the most widely used tree-based methods, to learn the subgroup with maximum treatment effect. Finally, on a large collection of synthetic and semi-synthetic datasets, we compare our method against a wide range of baselines and find that our approach, which avoids such causal heuristics, more accurately identifies subgroups with maximum treatment effect. Our source code is available at this https URL.

[LG-31] AdaCap: An Adaptive Contrastive Approach for Small-Data Neural Networks

链接: https://arxiv.org/abs/2511.20170
作者: Bruno Belucci,Karim Lounici,Katia Meziani
类目: Machine Learning (cs.LG)
*备注: Submitted to ESANN 2026

点击查看摘要

Abstract:Neural networks struggle on small tabular datasets, where tree-based models remain dominant. We introduce Adaptive Contrastive Approach (AdaCap), a training scheme that combines a permutation-based contrastive loss with a Tikhonov-based closed-form output mapping. Across 85 real-world regression datasets and multiple architectures, AdaCap yields consistent and statistically significant improvements in the small-sample regime, particularly for residual models. A meta-predictor trained on dataset characteristics (size, skewness, noise) accurately anticipates when AdaCap is beneficial. These results show that AdaCap acts as a targeted regularization mechanism, strengthening neural networks precisely where they are most fragile. All results and code are publicly available at this https URL.

[LG-32] CLIMATEAGENT : Multi-Agent Orchestration for Complex Climate Data Science Workflows

链接: https://arxiv.org/abs/2511.20109
作者: Hyeonjae Kim,Chenyue Li,Wen Deng,Mengxi Jin,Wen Huang,Mengqian Lu,Binhang Yuan
类目: Machine Learning (cs.LG)
*备注: 30 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

[LG-33] Multivariate Forecasting of Bitcoin Volatility with Gradient Boosting: Deterministic Probabilistic and Feature Importance Perspectives

链接: https://arxiv.org/abs/2511.20105
作者: Grzegorz Dudek,Mateusz Kasprzyk,Paweł Pełka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates the application of the Light Gradient Boosting Machine (LGBM) model for both deterministic and probabilistic forecasting of Bitcoin realized volatility. Utilizing a comprehensive set of 69 predictors – encompassing market, behavioral, and macroeconomic indicators – we evaluate the performance of LGBM-based models and compare them with both econometric and machine learning baselines. For probabilistic forecasting, we explore two quantile-based approaches: direct quantile regression using the pinball loss function, and a residual simulation method that transforms point forecasts into predictive distributions. To identify the main drivers of volatility, we employ gain-based and permutation feature importance techniques, consistently highlighting the significance of trading volume, lagged volatility measures, investor attention, and market capitalization. The results demonstrate that LGBM models effectively capture the nonlinear and high-variance characteristics of cryptocurrency markets while providing interpretable insights into the underlying volatility dynamics.

[LG-34] QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression AAAI26

链接: https://arxiv.org/abs/2511.20099
作者: Lei Huang,Rui Zhang,Jiaming Guo,Yang Zhang,Di Huang,Shuyao Cheng,Pengwei Jin,Chongxiao Li,Zidong Du,Xing Hu,Qi Guo,Yunji Chen
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Programming Languages (cs.PL)
*备注: Accepted by the AAAI26 Conference Main Track

点击查看摘要

Abstract:Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.

[LG-35] SOMBRL: Scalable and Optimistic Model-Based RL

链接: https://arxiv.org/abs/2511.20066
作者: Bhavya Sukhija,Lenart Treven,Carmelo Sferrazza,Florian Dörfler,Pieter Abbeel,Andreas Krause
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the challenge of efficient exploration in model-based reinforcement learning (MBRL), where the system dynamics are unknown and the RL agent must learn directly from online interactions. We propose Scalable and Optimistic MBRL (SOMBRL), an approach based on the principle of optimism in the face of uncertainty. SOMBRL learns an uncertainty-aware dynamics model and greedily maximizes a weighted sum of the extrinsic reward and the agent’s epistemic uncertainty. SOMBRL is compatible with any policy optimizers or planners, and under common regularity assumptions on the system, we show that SOMBRL has sublinear regret for nonlinear dynamics in the (i) finite-horizon, (ii) discounted infinite-horizon, and (iii) non-episodic settings. Additionally, SOMBRL offers a flexible and scalable solution for principled exploration. We evaluate SOMBRL on state-based and visual-control environments, where it displays strong performance across all tasks and baselines. We also evaluate SOMBRL on a dynamic RC car hardware and show SOMBRL outperforms the state-of-the-art, illustrating the benefits of principled exploration for MBRL.

[LG-36] RED-F: Reconstruction-Elimination based Dual-stream Contrastive Forecasting for Multivariate Time Series Anomaly Prediction

链接: https://arxiv.org/abs/2511.20044
作者: PengYu Chen,Xiaohou Shi,Yuan Chang,Yan Sun,Sajal K. Das
类目: Machine Learning (cs.LG)
*备注: 13 pages, 12 figures

点击查看摘要

Abstract:The proactive prediction of anomalies (AP) in mul- tivariate time series (MTS) is a critical challenge to ensure system dependability. The difficulty lies in identifying subtle anomaly precursors concealed within normal signals. However, existing unsupervised methods, trained exclusively on normal data, demonstrate a fundamental propensity to reconstruct normal patterns. Consequently, when confronted with weak precursors, their predictions are dominated by the normal pattern, submerging the very signal required for prediction. To contend with the limitation, we propose RED-F, a Reconstruction- Elimination based Dual-stream Contrastive Forecasting frame- work, comprising the Reconstruction-Elimination Model (REM) and the Dual-stream Contrastive Forecasting Model (DFM). The REM utilizes a hybrid time-frequency mechanism to mitigate the precursor, generating a purified, normal-pattern baseline. The DFM then receives this purified baseline and the original sequence which retains the precursor as parallel inputs. At the core of our framework, RED-F employs a contrastive forecast that transforms the difficult task of absolute signal detection into a simpler, more robust task of relative trajectory comparison by computing the divergence between these two predictive streams. This contrastive mechanism serves to amplify the faint precursor signal. Furthermore, the DFM is trained with a novel Multi-Series Prediction (MSP) objective, which leverages distant future con- text to enhance its predictive sensitivity. Extensive experiments on six real-world datasets demonstrate the superior capability of RED-F in anomaly prediction tasks.

[LG-37] Softmax Transformers are Turing-Complete

链接: https://arxiv.org/abs/2511.20038
作者: Hongjian Jiang,Michael Hahn,Georg Zetzsche,Anthony Widjaja Lin
类目: Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注:

点击查看摘要

Abstract:Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete. More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for letter-bounded languages). While we show this is not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theory by training transformers for languages requiring complex (non-linear) arithmetic reasoning.

[LG-38] Cross-Contrastive Clustering for Multimodal Attributed Graphs with Dual Graph Filtering KDD2026

链接: https://arxiv.org/abs/2511.20030
作者: Haoran Zheng,Renchi Yang,Hongtao Wang,Jianliang Xu
类目: Machine Learning (cs.LG)
*备注: Accepted by SIGKDD 2026. The code is available at this https URL

点击查看摘要

Abstract:Multimodal Attributed Graphs (MMAGs) are an expressive data model for representing the complex interconnections among entities that associate attributes from multiple data modalities (text, images, etc.). Clustering over such data finds numerous practical applications in real scenarios, including social community detection, medical data analytics, etc. However, as revealed by our empirical studies, existing multi-view clustering solutions largely rely on the high correlation between attributes across various views and overlook the unique characteristics (e.g., low modality-wise correlation and intense feature-wise noise) of multimodal attributes output by large pre-trained language and vision models in MMAGs, leading to suboptimal clustering performance. Inspired by foregoing empirical observations and our theoretical analyses with graph signal processing, we propose the Dual Graph Filtering (DGF) scheme, which innovatively incorporates a feature-wise denoising component into node representation learning, thereby effectively overcoming the limitations of traditional graph filters adopted in the extant multi-view graph clustering approaches. On top of that, DGF includes a tri-cross contrastive training strategy that employs instance-level contrastive learning across modalities, neighborhoods, and communities for learning robust and discriminative node representations. Our comprehensive experiments on eight benchmark MMAG datasets exhibit that DGF is able to outperform a wide range of state-of-the-art baselines consistently and significantly in terms of clustering quality measured against ground-truth labels. Comments: Accepted by SIGKDD 2026. The code is available at this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.20030 [cs.LG] (or arXiv:2511.20030v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.20030 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] RadioDiff: Physics-Informed Diffusion Model for Indoor Radio Map Construction and Localization

链接: https://arxiv.org/abs/2511.20015
作者: Xiucheng Wang,Tingwei Yuan,Yang Cao,Nan Cheng,Ruijin Sun,Weihua Zhuang
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Radio maps (RMs) serve as environment-aware electromagnetic (EM) representations that connect scenario geometry and material properties to the spatial distribution of signal strength, enabling localization without costly in-situ measurements. However, constructing high-fidelity indoor RMs remains challenging due to the prohibitive latency of EM solvers and the limitations of learning-based methods, which often rely on sparse measurements or assumptions of homogeneous material, which are misaligned with the heterogeneous and multipath-rich nature of indoor environments. To overcome these challenges, we propose iRadioDiff, a sampling-free diffusion-based framework for indoor RM construction. iRadioDiff is conditioned on access point (AP) positions, and physics-informed prompt encoded by material reflection and transmission coefficients. It further incorporates multipath-critical priors, including diffraction points, strong transmission boundaries, and line-of-sight (LoS) contours, to guide the generative process via conditional channels and boundary-weighted objectives. This design enables accurate modeling of nonstationary field discontinuities and efficient construction of physically consistent RMs. Experiments demonstrate that iRadioDiff achieves state-of-the-art performance in indoor RM construction and received signal strength based indoor localization, which offers effective generalization across layouts and material configurations. Code is available at this https URL.

[LG-40] REWA: Witness-Overlap Theory – Foundations for Composable Binary Similarity Systems

链接: https://arxiv.org/abs/2511.19998
作者: Nikit Phadke
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:REWA introduces a general theory of similarity based on witness-overlap structures. We show that whenever similarity between concepts can be expressed as monotone witness overlap – whether arising from graph neighborhoods, causal relations, temporal structure, topological features, symbolic patterns, or embedding-based neighborhoods – it admits a reduction to compact encodings with provable ranking preservation guarantees. REWA systems consist of: (1) finite witness sets W(v) , (2) semi-random bit assignments generated from each witness, and (3) monotonicity of expected similarity in the overlap \Delta(u, v) = |W(u) \cap W(v)| . We prove that under an overlap-gap condition on the final witness sets – independent of how they were constructed – top- k rankings are preserved using m = O(\log(|V|/\delta)) bits. The witness-set formulation is compositional: any sequence of structural, temporal, causal, topological, information-theoretic, or learned transformations can be combined into pipelines that terminate in discrete witness sets. The theory applies to the final witness overlap, enabling modular construction of similarity systems from reusable primitives. This yields a vast design space: millions of composable similarity definitions inherit logarithmic encoding complexity. REWA subsumes and unifies Bloom filters, minhash, LSH bitmaps, random projections, sketches, and hierarchical filters as special cases. It provides a principled foundation for similarity systems whose behavior is governed by witness overlap rather than hash-function engineering. This manuscript presents the axioms, the main reducibility theorem, complete proofs with explicit constants, and a detailed discussion of compositional design, limitations, and future extensions including multi-bit encodings, weighted witnesses, and non-set representations.

[LG-41] RankOOD - Class Ranking-based Out-of-Distribution Detection

链接: https://arxiv.org/abs/2511.19996
作者: Dishanika Denipitiyage,Naveen Karunanayake,Suranga Seneviratne,Sanjay Chawla
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose RankOOD, a rank-based Out-of-Distribution (OOD) detection approach based on training a model with the Placket-Luce loss, which is now extensively used for preference alignment tasks in foundational models. Our approach is based on the insight that with a deep learning model trained using the Cross Entropy Loss, in-distribution (ID) class prediction induces a ranking pattern for each ID class prediction. The RankOOD framework formalizes the insight by first extracting a rank list for each class using an initial classifier and then uses another round of training with the Plackett-Luce loss, where the class rank, a fixed permutation for each class, is the predicted variable. An OOD example may get assigned with high probability to an ID example, but the probability of it respecting the ranking classification is likely to be small. RankOOD, achieves SOTA performance on the near-ODD TinyImageNet evaluation benchmark, reducing FPR95 by 4.3%.

[LG-42] Rethinking Message Passing Neural Networks with Diffusion Distance-guided Stress Majorization KDD2026

链接: https://arxiv.org/abs/2511.19984
作者: Haoran Zheng,Renchi Yang,Yubo Zhou,Jianliang Xu
类目: Machine Learning (cs.LG)
*备注: Accepted by SIGKDD 2026. The code is available at this https URL

点击查看摘要

Abstract:Message passing neural networks (MPNNs) have emerged as go-to models for learning on graph-structured data in the past decade. Despite their effectiveness, most of such models still incur severe issues such as over-smoothing and -correlation, due to their underlying objective of minimizing the Dirichlet energy and the derived neighborhood aggregation operations. In this paper, we propose the DDSM, a new MPNN model built on an optimization framework that includes the stress majorization and orthogonal regularization for overcoming the above issues. Further, we introduce the diffusion distances for nodes into the framework to guide the new message passing operations and develop efficient algorithms for distance approximations, both backed by rigorous theoretical analyses. Our comprehensive experiments showcase that DDSM consistently and considerably outperforms 15 strong baselines on both homophilic and heterophilic graphs.

[LG-43] Operator Learning at Machine Precision

链接: https://arxiv.org/abs/2511.19980
作者: Aras Bacho,Aleksei G. Sorokin,Xianjin Yang,Théo Bourdais,Edoardo Calvello,Matthieu Darcy,Alexander Hsu,Bamdad Hosseini,Houman Owhadi
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Neural operator learning methods have garnered significant attention in scientific computing for their ability to approximate infinite-dimensional operators. However, increasing their complexity often fails to substantially improve their accuracy, leaving them on par with much simpler approaches such as kernel methods and more traditional reduced-order models. In this article, we set out to address this shortcoming and introduce CHONKNORIS (Cholesky Newton–Kantorovich Neural Operator Residual Iterative System), an operator learning paradigm that can achieve machine precision. CHONKNORIS draws on numerical analysis: many nonlinear forward and inverse PDE problems are solvable by Newton-type methods. Rather than regressing the solution operator itself, our method regresses the Cholesky factors of the elliptic operator associated with Tikhonov-regularized Newton–Kantorovich updates. The resulting unrolled iteration yields a neural architecture whose machine-precision behavior follows from achieving a contractive map, requiring far lower accuracy than end-to-end approximation of the solution operator. We benchmark CHONKNORIS on a range of nonlinear forward and inverse problems, including a nonlinear elliptic equation, Burgers’ equation, a nonlinear Darcy flow problem, the Calderón problem, an inverse wave scattering problem, and a problem from seismic imaging. We also present theoretical guarantees for the convergence of CHONKNORIS in terms of the accuracy of the emulated Cholesky factors. Additionally, we introduce a foundation model variant, FONKNORIS (Foundation Newton–Kantorovich Neural Operator Residual Iterative System), which aggregates multiple pre-trained CHONKNORIS experts for diverse PDEs to emulate the solution map of a novel nonlinear PDE. Our FONKNORIS model is able to accurately solve unseen nonlinear PDEs such as the Klein–Gordon and Sine–Gordon equations.

[LG-44] Rethinking Semi-Supervised Node Classification with Self-Supervised Graph Clustering

链接: https://arxiv.org/abs/2511.19976
作者: Songbo Wang,Renchi Yang,Yurui Lai,Xiaoyang Lin,Tsz Nam Chan
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 14 pages

点击查看摘要

Abstract:The emergence of graph neural networks (GNNs) has offered a powerful tool for semi-supervised node classification tasks. Subsequent studies have achieved further improvements through refining the message passing schemes in GNN models or exploiting various data augmentation techniques to mitigate limited supervision. In real graphs, nodes often tend to form tightly-knit communities/clusters, which embody abundant signals for compensating label scarcity in semi-supervised node classification but are not explored in prior methods. Inspired by this, this paper presents NCGC that integrates self-supervised graph clustering and semi-supervised classification into a unified framework. Firstly, we theoretically unify the optimization objectives of GNNs and spectral graph clustering, and based on that, develop soft orthogonal GNNs (SOGNs) that leverage a refined message passing paradigm to generate node representations for both classification and clustering. On top of that, NCGC includes a self-supervised graph clustering module that enables the training of SOGNs for learning representations of unlabeled nodes in a self-supervised manner. Particularly, this component comprises two non-trivial clustering objectives and a Sinkhorn-Knopp normalization that transforms predicted cluster assignments into balanced soft pseudo-labels. Through combining the foregoing clustering module with the classification model using a multi-task objective containing the supervised classification loss on labeled data and self-supervised clustering loss on unlabeled data, NCGC promotes synergy between them and achieves enhanced model capacity. Our extensive experiments showcase that the proposed NCGC framework consistently and considerably outperforms popular GNN models and recent baselines for semi-supervised node classification on seven real graphs, when working with various classic GNN backbones. Comments: 14 pages Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2511.19976 [cs.LG] (or arXiv:2511.19976v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.19976 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Strag glers Can Contribute More: Uncertainty-Aware Distillation for Asynchronous Federated Learning

链接: https://arxiv.org/abs/2511.19966
作者: Yujia Wang,Fenglong Ma,Jinghui Chen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 28 pages

点击查看摘要

Abstract:Asynchronous federated learning (FL) has recently gained attention for its enhanced efficiency and scalability, enabling local clients to send model updates to the server at their own pace without waiting for slower participants. However, such a design encounters significant challenges, such as the risk of outdated updates from straggler clients degrading the overall model performance and the potential bias introduced by faster clients dominating the learning process, especially under heterogeneous data distributions. Existing methods typically address only one of these issues, creating a conflict where mitigating the impact of outdated updates can exacerbate the bias created by faster clients, and vice versa. To address these challenges, we propose FedEcho, a novel framework that incorporates uncertainty-aware distillation to enhance the asynchronous FL performances under large asynchronous delays and data heterogeneity. Specifically, uncertainty-aware distillation enables the server to assess the reliability of predictions made by straggler clients, dynamically adjusting the influence of these predictions based on their estimated uncertainty. By prioritizing more certain predictions while still leveraging the diverse information from all clients, FedEcho effectively mitigates the negative impacts of outdated updates and data heterogeneity. Through extensive experiments, we demonstrate that FedEcho consistently outperforms existing asynchronous federated learning baselines, achieving robust performance without requiring access to private client data.

[LG-46] ParaBlock: Communication-Computation Parallel Block Coordinate Federated Learning for Large Language Models

链接: https://arxiv.org/abs/2511.19959
作者: Yujia Wang,Yuanpu Cao,Jinghui Chen
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 32 pages, 2 figures

点击查看摘要

Abstract:Federated learning (FL) has been extensively studied as a privacy-preserving training paradigm. Recently, federated block coordinate descent scheme has become a popular option in training large-scale models, as it allows clients to train only a subset of the model locally instead of the entire model. However, in the era of large language models (LLMs), even a single block can contain a significant number of parameters, posing substantial communication latency, particularly for resource-constrained clients. To address this challenge in federated training/fine-tuning LLMs, we propose ParaBlock, a novel approach that establishes two parallel threads for communication and computation to enhance communication efficiency. We theoretically prove that the proposed ParaBlock achieves the same convergence rate as the standard federated block coordinate descent methods. Empirical evaluations on fine-tuning LLMs on general instruction following and mathematical reasoning confirm that ParaBlock not only maintains strong performance but also significantly improves communication efficiency.

[LG-47] Prompt Fairness: Sub-group Disparities in LLM s

链接: https://arxiv.org/abs/2511.19956
作者: Meiyu Zhong,Noel Teku,Ravi Tandon
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.

[LG-48] Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios

链接: https://arxiv.org/abs/2511.19952
作者: Haoran Hu,Junren Shi,Shuo Jiang,Kun Cheng,Xia Yang,Changhao Piao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Forward Collision Warning systems are crucial for vehicle safety and autonomous driving, yet current methods often fail to balance precise multi-agent interaction modeling with real-time decision adaptability, evidenced by the high computational cost for edge deployment and the unreliability stemming from simplified interaction this http URL overcome these dual challenges-computational complexity and modeling insufficiency-along with the high false alarm rates of traditional static-threshold warnings, this paper introduces an integrated FCW framework that pairs a Hierarchical Spatio-Temporal Attention Network with a Dynamic Risk Threshold Adjustment algorithm. HSTAN employs a decoupled architecture (Graph Attention Network for spatial, cascaded GRU with self-attention for temporal) to achieve superior performance and efficiency, requiring only 12.3 ms inference time (73% faster than Transformer methods) and reducing the Average Displacement Error (ADE) to 0.73m (42.2% better than Social_LSTM) on the NGSIM dataset. Furthermore, Conformalized Quantile Regression enhances reliability by generating prediction intervals (91.3% coverage at 90% confidence), which the DTRA module then converts into timely warnings via a physics-informed risk potential function and an adaptive threshold mechanism inspired by statistical process this http URL across multi-scenario datasets, the complete system demonstrates high efficacy, achieving an F1 score of 0.912, a low false alarm rate of 8.2%, and an ample warning lead time of 2.8 seconds, validating the framework’s superior performance and practical deployment feasibility in complex environments.

[LG-49] Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning

链接: https://arxiv.org/abs/2511.19942
作者: Jingchu Gai,Guanning Zeng,Huaqing Zhang,Aditi Raghunathan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It is widely recognized that reinforcement learning (RL) fine-tuning of large language models often leads to \textitdiversity collapse, where outputs lack variety. Prior work has proposed a range of heuristics to counteract this effect, but these methods are ad hoc: they frequently trade off correctness for diversity, their effectiveness varies across tasks, and in some cases they even contradict one another. In this work, we place these observations on a rigorous foundation. We first provide a formal proof of why RL fine-tuning exhibits diversity collapse via a selection and reinforcement bias. Next, we make a key observation that any reward modification to address diversity collapse only needs to be applied on the correct trajectories. Building directly on this analysis, we introduce a principled method – \textitdifferential smoothing – that provably improves both correctness and diversity, outperforming vanilla RL as well as widely used entropy-based heuristics. Our theory precisely characterizes when existing heuristics help and why they fail, while showing that differential smoothing is universally superior. Extensive experiments with models from 1B to 7B parameters, across domains including CountDown and real-world mathematical reasoning, demonstrate consistent gains. Differential smoothing improves both Pass@1 and Pass@k, with up to 6.7% improvements on AIME24 dataset.

[LG-50] Adaptivity and Universality: Problem-dependent Universal Regret for Online Convex Optimization

链接: https://arxiv.org/abs/2511.19937
作者: Peng Zhao,Yu-Hu Yan,Hang Yu,Zhi-Hua Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Universal online learning aims to achieve optimal regret guarantees without requiring prior knowledge of the curvature of online functions. Existing methods have established minimax-optimal regret bounds for universal online learning, where a single algorithm can simultaneously attain \mathcalO(\sqrtT) regret for convex functions, \mathcalO(d \log T) for exp-concave functions, and \mathcalO(\log T) for strongly convex functions, where T is the number of rounds and d is the dimension of the feasible domain. However, these methods still lack problem-dependent adaptivity. In particular, no universal method provides regret bounds that scale with the gradient variation V_T , a key quantity that plays a crucial role in applications such as stochastic optimization and fast-rate convergence in games. In this work, we introduce UniGrad, a novel approach that achieves both universality and adaptivity, with two distinct realizations: this http URL and this http URL. Both methods achieve universal regret guarantees that adapt to gradient variation, simultaneously attaining \mathcalO(\log V_T) regret for strongly convex functions and \mathcalO(d \log V_T) regret for exp-concave functions. For convex functions, the regret bounds differ: this http URL achieves an \mathcalO(\sqrtV_T \log V_T) bound while preserving the RVU property that is crucial for fast convergence in online games, whereas this http URL achieves the optimal \mathcalO(\sqrtV_T) regret bound through a novel design. Both methods employ a meta algorithm with \mathcalO(\log T) base learners, which naturally requires \mathcalO(\log T) gradient queries per round. To enhance computational efficiency, we introduce UniGrad++, which retains the regret while reducing the gradient query to just 1 per round via surrogate optimization. We further provide various implications.

[LG-51] Designing Reputation Systems for Manufacturing Data Trading Markets: A Multi-Agent Evaluation with Q-Learning and IRL-Estimated Utilities

链接: https://arxiv.org/abs/2511.19930
作者: Kenta Yamamoto,Teruaki Hayashi
类目: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 10 pages, 10 figures

点击查看摘要

Abstract:Recent advances in machine learning and big data analytics have intensified the demand for high-quality cross-domain datasets and accelerated the growth of data trading across organizations. As data become increasingly recognized as an economic asset, data marketplaces have emerged as a key infrastructure for data-driven innovation. However, unlike mature product or service markets, data-trading environments remain nascent and suffer from pronounced information asymmetry. Buyers cannot verify the content or quality before purchasing data, making trust and quality assurance central challenges. To address these issues, this study develops a multi-agent data-market simulator that models participant behavior and evaluates the institutional mechanisms for trust formation. Focusing on the manufacturing sector, where initiatives such as GAIA-X and Catena-X are advancing, the simulator integrates reinforcement learning (RL) for adaptive agent behavior and inverse reinforcement learning (IRL) to estimate utility functions from empirical behavioral data. Using the simulator, we examine the market-level effects of five representative reputation systems-Time-decay, Bayesian-beta, PageRank, PowerTrust, and PeerTrust-and found that PeerTrust achieved the strongest alignment between data price and quality, while preventing monopolistic dominance. Building on these results, we develop a hybrid reputation mechanism that integrates the strengths of existing systems to achieve improved price-quality consistency and overall market stability. This study extends simulation-based data-market analysis by incorporating trust and reputation as endogenous mechanisms and offering methodological and institutional insights into the design of reliable and efficient data ecosystems.

[LG-52] Frailty-Aware Transformer for Recurrent Survival Modeling of Driver Retention in Ride-Hailing Platforms KDD

链接: https://arxiv.org/abs/2511.19893
作者: Shuoyan Xu,Yu Zhang,Eric J. Miller
类目: Machine Learning (cs.LG)
*备注: 13 pages, 6 figures, under review, Accepted by KDD Workshop 2025

点击查看摘要

Abstract:Ride-hailing platforms are characterized by high-frequency, behavior-driven environments. Although survival analysis has been applied to recurrent events in other domains, its use in modeling ride-hailing driver behavior remains largely unexplored. This study formulates idle behavior as a recurrent survival process using large-scale platform data and proposes a Transformer-based framework that captures long-term temporal dependencies with causal masking and incorporates driver-specific embeddings to model latent heterogeneity. Results on Toronto ride-hailing data demonstrate that the proposed Frailty-Aware Cox Transformer (FACT) achieves the highest time-dependent C-indices and lowest Brier Scores, outperforming classical and deep learning survival models. This approach enables more accurate risk estimation, supports platform retention strategies, and provides policy-relevant insights.

[LG-53] Complex Instruction Following with Diverse Style Policies in Football Games AAAI2026

链接: https://arxiv.org/abs/2511.19885
作者: Chenglu Sun,Shuo Shen,Haonan Hu,Wei Zhou,Chen Chen
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: 21 pages, 13 figures, accepted by AAAI2026

点击查看摘要

Abstract:Despite advancements in language-controlled reinforcement learning (LC-RL) for basic domains and straightforward commands (e.g., object manipulation and navigation), effectively extending LC-RL to comprehend and execute high-level or abstract instructions in complex, multi-agent environments, such as football games, remains a significant challenge. To address this gap, we introduce Language-Controlled Diverse Style Policies (LCDSP), a novel LC-RL paradigm specifically designed for complex scenarios. LCDSP comprises two key components: a Diverse Style Training (DST) method and a Style Interpreter (SI). The DST method efficiently trains a single policy capable of exhibiting a wide range of diverse behaviors by modulating agent actions through style parameters (SP). The SI is designed to accurately and rapidly translate high-level language instructions into these corresponding SP. Through extensive experiments in a complex 5v5 football environment, we demonstrate that LCDSP effectively comprehends abstract tactical instructions and accurately executes the desired diverse behavioral styles, showcasing its potential for complex, real-world applications.

[LG-54] Accelerating Wireless Distributed Learning via Hybrid Split and Federated Learning Optimization

链接: https://arxiv.org/abs/2511.19851
作者: Kun Guo,Xuefei Li,Xijun Wang,Howard H. Yang,Wei Feng,Tony Q. S. Quek
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Federated learning (FL) and split learning (SL) are two effective distributed learning paradigms in wireless networks, enabling collaborative model training across mobile devices without sharing raw data. While FL supports low-latency parallel training, it may converge to less accurate model. In contrast, SL achieves higher accuracy through sequential training but suffers from increased delay. To leverage the advantages of both, hybrid split and federated learning (HSFL) allows some devices to operate in FL mode and others in SL mode. This paper aims to accelerate HSFL by addressing three key questions: 1) How does learning mode selection affect overall learning performance? 2) How does it interact with batch size? 3) How can these hyperparameters be jointly optimized alongside communication and computational resources to reduce overall learning delay? We first analyze convergence, revealing the interplay between learning mode and batch size. Next, we formulate a delay minimization problem and propose a two-stage solution: a block coordinate descent method for a relaxed problem to obtain a locally optimal solution, followed by a rounding algorithm to recover integer batch sizes with near-optimal performance. Experimental results demonstrate that our approach significantly accelerates convergence to the target accuracy compared to existing methods.

[LG-55] SX-GeoTree: Self-eXplaining Geospatial Regression Tree Incorporating the Spatial Similarity of Feature Attributions

链接: https://arxiv.org/abs/2511.19845
作者: Chaogui Kang,Lijian Luo,Qingfeng Guan,Yu Liu
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Machine Learning (stat.ML)
*备注: 41 pages, 7 figures, 12 tables

点击查看摘要

Abstract:Decision trees remain central for tabular prediction but struggle with (i) capturing spatial dependence and (ii) producing locally stable (robust) explanations. We present SX-GeoTree, a self-explaining geospatial regression tree that integrates three coupled objectives during recursive splitting: impurity reduction (MSE), spatial residual control (global Moran’s I), and explanation robustness via modularity maximization on a consensus similarity network formed from (a) geographically weighted regression (GWR) coefficient distances (stimulus-response similarity) and (b) SHAP attribution distances (explanatory similarity). We recast local Lipschitz continuity of feature attributions as a network community preservation problem, enabling scalable enforcement of spatially coherent explanations without per-sample neighborhood searches. Experiments on two exemplar tasks (county-level GDP in Fujian, n=83; point-wise housing prices in Seattle, n=21,613) show SX-GeoTree maintains competitive predictive accuracy (within 0.01 R^2 of decision trees) while improving residual spatial evenness and doubling attribution consensus (modularity: Fujian 0.19 vs 0.09; Seattle 0.10 vs 0.05). Ablation confirms Moran’s I and modularity terms are complementary; removing either degrades both spatial residual structure and explanation stability. The framework demonstrates how spatial similarity - extended beyond geometric proximity through GWR-derived local relationships - can be embedded in interpretable models, advancing trustworthy geospatial machine learning and offering a transferable template for domain-aware explainability.

[LG-56] Provably Outlier-resistant Semi-parametric Regression for Transferable Calibration of Low-cost Air-quality Sensors

链接: https://arxiv.org/abs/2511.19810
作者: Divyansh Chaurasia,Manoj Daram,Roshan Kumar,Nihal Thukarama Rao,Vipul Sangode,Pranjal Srivastava,Avnish Tripathi,Shoubhik Chakraborty,Akanksha,Ambasht Kumar,Davender Sethi,Sachchida Nand Tripathi,Purushottam Kar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 14 figures, under peer review

点击查看摘要

Abstract:We present a case study for the calibration of Low-cost air-quality (LCAQ) CO sensors from one of the largest multi-site-multi-season-multi-sensor-multi-pollutant mobile air-quality monitoring network deployments in India. LCAQ sensors have been shown to play a critical role in the establishment of dense, expansive air-quality monitoring networks and combating elevated pollution levels. The calibration of LCAQ sensors against regulatory-grade monitors is an expensive, laborious and time-consuming process, especially when a large number of sensors are to be deployed in a geographically diverse layout. In this work, we present the RESPIRE technique to calibrate LCAQ sensors to detect ambient CO (Carbon Monoxide) levels. RESPIRE offers specific advantages over baseline calibration methods popular in literature, such as improved prediction in cross-site, cross-season, and cross-sensor settings. RESPIRE offers a training algorithm that is provably resistant to outliers and an explainable model with the ability to flag instances of model overfitting. Empirical results are presented based on data collected during an extensive deployment spanning four sites, two seasons and six sensor packages. RESPIRE code is available at this https URL.

[LG-57] Scalable Data Attribution via Forward-Only Test-Time Inference

链接: https://arxiv.org/abs/2511.19803
作者: Sibo Ma,Julian Nyarko
类目: Machine Learning (cs.LG)
*备注: 8 pages. Work in progress

点击查看摘要

Abstract:Data attribution seeks to trace model behavior back to the training examples that shaped it, enabling debugging, auditing, and data valuation at scale. Classical influence-function methods offer a principled foundation but remain impractical for modern networks because they require expensive backpropagation or Hessian inversion at inference. We propose a data attribution method that preserves the same first-order counterfactual target while eliminating per-query backward passes. Our approach simulates each training example’s parameter influence through short-horizon gradient propagation during training and later reads out attributions for any query using only forward evaluations. This design shifts computation from inference to simulation, reflecting real deployment regimes where a model may serve billions of user queries but originate from a fixed, finite set of data sources (for example, a large language model trained on diverse corpora while compensating a specific publisher such as the New York Times). Empirically, on standard MLP benchmarks, our estimator matches or surpasses state-of-the-art baselines such as TRAK on standard attribution metrics (LOO and LDS) while offering orders-of-magnitude lower inference cost. By combining influence-function fidelity with first-order scalability, our method provides a theoretical framework for practical, real-time data attribution in large pretrained models.

[LG-58] When 1% Is Not Enough: A Paired Bootstrap Protocol for Evaluating Small Improvements

链接: https://arxiv.org/abs/2511.19794
作者: Wenzhang Du
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 3 figures

点击查看摘要

Abstract:Recent machine learning papers often report 1-2 percentage point improvements from a single run on a benchmark. These gains are highly sensitive to random seeds, data ordering, and implementation details, yet are rarely accompanied by uncertainty estimates or significance tests. It is therefore unclear when a reported +1-2% reflects a real algorithmic advance versus noise. We revisit this problem under realistic compute budgets, where only a few runs are affordable. We propose a simple, PC-friendly evaluation protocol based on paired multi-seed runs, bias-corrected and accelerated (BCa) bootstrap confidence intervals, and a sign-flip permutation test on per-seed deltas. The protocol is intentionally conservative and is meant as a guardrail against over-claiming. We instantiate it on CIFAR-10, CIFAR-10N, and AG News using synthetic no-improvement, small-gain, and medium-gain scenarios. Single runs and unpaired t-tests often suggest significant gains for 0.6-2.0 point improvements, especially on text. With only three seeds, our paired protocol never declares significance in these settings. We argue that such conservative evaluation is a safer default for small gains under tight budgets. Comments: 13 pages, 3 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 62F40, 62G09 ACMclasses: I.5.2; G.3; H.3.4 Cite as: arXiv:2511.19794 [cs.LG] (or arXiv:2511.19794v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.19794 Focus to learn more arXiv-issued DOI via DataCite

[LG-59] DISCO: A Browser-Based Privacy-Preserving Framework for Distributed Collaborative Learning

链接: https://arxiv.org/abs/2511.19750
作者: Julien T. T. Vignoud,Valérian Rousset,Hugo El Guedj,Ignacio Aleman,Walid Bennaceur,Batuhan Faik Derinbay,Eduard Ďurech,Damien Gengler,Lucas Giordano,Felix Grimberg,Franziska Lippoldt,Christina Kopidaki,Jiafan Liu,Lauris Lopata,Nathan Maire,Paul Mansat,Martin Milenkoski,Emmanuel Omont,Güneş Özgün,Mina Petrović,Francesco Posa,Morgan Ridel,Giorgio Savini,Marcel Torne,Lucas Trognon,Alyssa Unell,Olena Zavertiaieva,Sai Praneeth Karimireddy,Tahseen Rabbani,Mary-Anne Hartley,Martin Jaggi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data is often impractical to share for a range of well considered reasons, such as concerns over privacy, intellectual property, and legal constraints. This not only fragments the statistical power of predictive models, but creates an accessibility bias, where accuracy becomes inequitably distributed to those who have the resources to overcome these concerns. We present DISCO: an open-source DIStributed COllaborative learning platform accessible to non-technical users, offering a means to collaboratively build machine learning models without sharing any original data or requiring any programming knowledge. DISCO’s web application trains models locally directly in the browser, making our tool cross-platform out-of-the-box, including smartphones. The modular design of \disco offers choices between federated and decentralized paradigms, various levels of privacy guarantees and several approaches to weight aggregation strategies that allow for model personalization and bias resilience in the collaborative training. Code repository is available at this https URL and a showcase web interface at this https URL

[LG-60] CAMformer: Associative Memory is All You Need

链接: https://arxiv.org/abs/2511.19740
作者: Tergel Molom-Ochir,Benjamin F. Morris,Mark Horton,Chiyue Wei,Cong Guo,Brady Taylor,Peter Liu,Shan X. Wang,Deliang Fan,Hai Helen Li,Yiran Chen
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 7 pages, 10 figures

点击查看摘要

Abstract:Transformers face scalability challenges due to the quadratic cost of attention, which involves dense similarity computations between queries and keys. We propose CAMformer, a novel accelerator that reinterprets attention as an associative memory operation and computes attention scores using a voltage-domain Binary Attention Content Addressable Memory (BA-CAM). This enables constant-time similarity search through analog charge sharing, replacing digital arithmetic with physical similarity sensing. CAMformer integrates hierarchical two-stage top-k filtering, pipelined execution, and high-precision contextualization to achieve both algorithmic accuracy and architectural efficiency. Evaluated on BERT and Vision Transformer workloads, CAMformer achieves over 10x energy efficiency, up to 4x higher throughput, and 6-8x lower area compared to state-of-the-art accelerators–while maintaining near-lossless accuracy.

[LG-61] raining-Free Active Learning Framework in Materials Science with Large Language Models

链接: https://arxiv.org/abs/2511.19730
作者: Hongchen Wang,Rafael Espinosa Castañeda,Jay R. Werber,Yao Fehlis,Edward Kim,Jason Hattrick-Simpers
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Active learning (AL) accelerates scientific discovery by prioritizing the most informative experiments, but traditional machine learning (ML) models used in AL suffer from cold-start limitations and domain-specific feature engineering, restricting their generalizability. Large language models (LLMs) offer a new paradigm by leveraging their pretrained knowledge and universal token-based representations to propose experiments directly from text-based descriptions. Here, we introduce an LLM-based active learning framework (LLM-AL) that operates in an iterative few-shot setting and benchmark it against conventional ML models across four diverse materials science datasets. We explored two prompting strategies: one using concise numerical inputs suited for datasets with more compositional and structured features, and another using expanded descriptive text suited for datasets with more experimental and procedural features to provide additional context. Across all datasets, LLM-AL could reduce the number of experiments needed to reach top-performing candidates by over 70% and consistently outperformed traditional ML models. We found that LLM-AL performs broader and more exploratory searches while still reaching the optima with fewer iterations. We further examined the stability boundaries of LLM-AL given the inherent non-determinism of LLMs and found its performance to be broadly consistent across runs, within the variability range typically observed for traditional ML approaches. These results demonstrate that LLM-AL can serve as a generalizable alternative to conventional AL pipelines for more efficient and interpretable experiment selection and potential LLM-driven autonomous discovery.

[LG-62] Large Scale Community-Aware Network Generation CCS

链接: https://arxiv.org/abs/2511.19717
作者: Vikram Ramavarapu,João Alfredo Cardoso Lamy,Mohammad Dindoost,David A. Bader
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注: 22 pages, 10 figures, code made available at this https URL

点击查看摘要

Abstract:Community detection, or network clustering, is used to identify latent community structure in networks. Due to the scarcity of labeled ground truth in real-world networks, evaluating these algorithms poses significant challenges. To address this, researchers use synthetic network generators that produce networks with ground-truth community labels. RECCS is one such algorithm that takes a network and its clustering as input and generates a synthetic network through a modular pipeline. Each generated ground truth cluster preserves key characteristics of the corresponding input cluster, including connectivity, minimum degree, and degree sequence distribution. The output consists of a synthetically generated network, and disjoint ground truth cluster labels for all nodes. In this paper, we present two enhanced versions: RECCS+ and RECCS++. RECCS+ maintains algorithmic fidelity to the original RECCS while introducing parallelization through an orchestrator that coordinates algorithmic components across multiple processes and employs multithreading. RECCS++ builds upon this foundation with additional algorithmic optimizations to achieve further speedup. Our experimental results demonstrate that RECCS+ and RECCS++ achieve speedups of up to 49x and 139x respectively on our benchmark datasets, with RECCS++'s additional performance gains involving a modest accuracy tradeoff. With this newfound performance, RECCS++ can now scale to networks with over 100 million nodes and nearly 2 billion edges.

[LG-63] Designing Preconditioners for SGD: Local Conditioning Noise Floors and Basin Stability

链接: https://arxiv.org/abs/2511.19716
作者: Mitchell Scott,Tianshi Xu,Ziyuan Tang,Alexandra Pichette-Emmons,Qiang Ye,Yousef Saad,Yuanzhe Xi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 31 pages, 11 Figures

点击查看摘要

Abstract:Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix \mathbfM , deriving bounds in which both the convergence rate and the stochastic noise floor are governed by \mathbfM -dependent quantities: the rate through an effective condition number in the \mathbfM -metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the \mathbfM -norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose \mathbfM to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

[LG-64] CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

链接: https://arxiv.org/abs/2511.19705
作者: Ziteng Sun,Adrian Benton,Samuel Kushnir,Asher Trockman,Vikas Singh,Suhas Diggavi,Ananda Theertha Suresh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

[LG-65] Demystifying Diffusion Objectives: Reweighted Losses are Better Variational Bounds

链接: https://arxiv.org/abs/2511.19664
作者: Jiaxin Shi,Michalis K. Titsias
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We derive a new theoretical interpretation of the reweighted losses that are widely used for training diffusion models. Our method is based on constructing a cascade of time-dependent variational lower bounds on the data log-likelihood, that provably improves upon the standard evidence lower bound and results in reduced data-model KL-divergences. Combining such bounds gives rise to reweighted objectives that can be applied to any generative diffusion model including both continuous Gaussian diffusion and masked (discrete) diffusion models. Then, we showcase this framework in masked diffusion and report significant improvements over previous training losses in pixel-space image modeling, approaching sample quality comparable to continuous diffusion models. Our results also provide a theoretical justification for the simple weighting scheme widely used in masked image models.

[LG-66] Structured Noise Modeling for Enhanced Time-Series Forecasting

链接: https://arxiv.org/abs/2511.19657
作者: Sepideh Koohfar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time-series forecasting remains difficult in real-world settings because temporal patterns operate at multiple scales, from broad contextual trends to fast, fine-grained fluctuations that drive critical decisions. Existing neural models often struggle to represent these interacting dynamics, leading to unstable predictions and reduced reliability in downstream applications. This work introduces a forecast-blur-denoise framework that improves temporal fidelity through structured noise modeling. The approach incorporates a learnable Gaussian Process module that generates smooth, correlated perturbations, encouraging the forecasting backbone to capture long-range structure while a dedicated refinement model restores high-resolution temporal detail. Training the components jointly enables natural competence division and avoids the artifacts commonly produced by isotropic corruption methods. Experiments across electricity, traffic, and solar datasets show consistent gains in multi-horizon accuracy and stability. The modular design also allows the blur-denoise layer to operate as a lightweight enhancement for pretrained models, supporting efficient adaptation in limited-data scenarios. By strengthening the reliability and interpretability of fine-scale temporal predictions, this framework contributes to more trustworthy AI systems used in forecasting-driven decision support across energy, infrastructure, and other time-critical domains.

[LG-67] Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles

链接: https://arxiv.org/abs/2511.19656
作者: Kaiyi Ji
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 24 pages, 1 figure

点击查看摘要

Abstract:Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvex-strongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least \Omega(\kappa^3/2\epsilon^-2) oracle calls to find an \epsilon -accurate stationary point, improving the optimal lower bounds known for single-level nonconvex optimization and for nonconvex-strongly-convex min-max problems. In the stochastic case, we show that at least \Omega(\kappa^5/2\epsilon^-4) stochastic oracle calls are necessary, again strengthening the best known bounds in related settings. Our results expose substantial gaps between current upper and lower bounds for bilevel optimization and suggest that even simplified regimes, such as those with quadratic lower-level objectives, warrant further investigation toward understanding the optimal complexity of bilevel optimization under standard first-order oracles.

[LG-68] Agint: Agent ic Graph Compilation for Software Engineering Agents NEURIPS2025

链接: https://arxiv.org/abs/2511.19635
作者: Abhi Chivukula,Jay Somasundaram,Vijay Somasundaram
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 18 pages, 5 figures, NeurIPS 2025: Deep Learning for Code in the Agentic Era

点击查看摘要

Abstract:LLM-based coding agents are increasingly common but still face challenges in context management, latency, reliability, reproducibility, and scalability. We present Agint, an agentic graph compiler, interpreter, and runtime that incrementally and hierarchically converts natural-language instructions into typed, effect-aware code DAGs. Agint introduces explicit type floors (text to data to spec to code) grounded in semantic graph transformations and a hybrid LLM and function-based JIT runtime. This enables dynamic graph refinement, reproducible and optimizable execution, speculative evaluation, and interoperability with existing developer tools. Agint’s typed graph bindings improve reliability and allow concurrent composition of concurrent codebases by construction, supporting accelerated development with smaller and faster models, lower latency, efficient context utilization, and higher throughput. Hierarchical compilation allows scalable graph edits, while the graph structure supports reproducibility and efficient parallel generation. Agint provides a composable unix-style toolchain: dagify (DAG compiler), dagent (hybrid JIT runtime), schemagin (schema generator), and datagin (data transformer) for realtime, low-latency code and dataflow creation. Human developers and coding agents refine graphs through the Agint CLI, while non-technical users use Agint Flow GUI for visual editing, conversational refinement, and debugging to promote prototype agentic workflows to production code. This continuous co-creation model allows teams to prototype quickly, refine seamlessly, and deploy reliably, bridging natural language, compiler methods, and developer tooling to enable a new generation of composable, team-centric coding agents at scale.

[LG-69] Neural Tractability via Structure: Learning-Augmented Algorithms for Graph Combinatorial Optimization

链接: https://arxiv.org/abs/2511.19573
作者: Jialiang Li,Weitong Chen,Mingyu Guo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Neural models have shown promise in solving NP-hard graph combinatorial optimization (CO) problems. Once trained, they offer fast inference and reasonably high-quality solutions for in-distribution testing instances, but they generally fall short in terms of absolute solution quality compared to classical search-based algorithms that are admittedly slower but offer optimality guarantee once search finishes. We propose a novel framework that combines the inference efficiency and exploratory power of neural models with the solution quality guarantee of search-based algorithms. In particular, we use parameterized algorithms (PAs) as the search component. PAs are dedicated to identifying easy instances of generally NP-hard problems, and allow for practically efficient search by exploiting structural simplicity (of the identified easy instances). Under our framework, we use parameterized analysis to identify the structurally hard parts of a CO instance. The neural model handles the hard parts by generating advisory signals based on its data-driven understanding. The PA-based search component then integrates the advisory signals to systematically and efficiently searches through the remaining structurally easy parts. Notably, our framework is agnostic to the choice of neural model and produces strictly better solutions than neural solvers alone. We examine our framework on multiple CO tasks. Empirical results show that it achieves superior solution quality, competitive with that of commercial solvers. Furthermore, by using the neural model only for exploratory advisory signals, our framework exhibits improved out-of-distribution generalization, addressing a key limitation of existing neural CO solvers. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2511.19573 [cs.LG] (or arXiv:2511.19573v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.19573 Focus to learn more arXiv-issued DOI via DataCite

[LG-70] An Invariant Latent Space Perspective on Language Model Inversion AAAI-26 AAAI

链接: https://arxiv.org/abs/2511.19569
作者: Wentao Ye,Jiaqi Hu,Haobo Wang,Xinpeng Ti,Zhiqing Xiao,Hao Chen,Liyao Li,Lei Feng,Sai Wu,Junbo Zhao
类目: Machine Learning (cs.LG)
*备注: The Fortieth AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Language model inversion (LMI), i.e., recovering hidden prompts from outputs, emerges as a concrete threat to user privacy and system security. We recast LMI as reusing the LLM’s own latent space and propose the Invariant Latent Space Hypothesis (ILSH): (1) diverse outputs from the same source prompt should preserve consistent semantics (source invariance), and (2) input-output cyclic mappings should be self-consistent within a shared latent space (cyclic invariance). Accordingly, we present Inv^2A, which treats the LLM as an invariant decoder and learns only a lightweight inverse encoder that maps outputs to a denoised pseudo-representation. When multiple outputs are available, they are sparsely concatenated at the representation layer to increase information density. Training proceeds in two stages: contrastive alignment (source invariance) and supervised reinforcement (cyclic invariance). An optional training-free neighborhood search can refine local performance. Across 9 datasets covering user and system prompt scenarios, Inv^2A outperforms baselines by an average of 4.77% BLEU score while reducing dependence on large inverse corpora. Our analysis further shows that prevalent defenses provide limited protection, underscoring the need for stronger strategies. The source code and data involved in this paper can be found in this https URL.

[LG-71] ModHiFi: Identifying High Fidelity predictive components for Model Modification NEURIPS2025

链接: https://arxiv.org/abs/2511.19566
作者: Dhruva Kashyap,Chaitanya Murti,Pranav K Nayak,Tanay Narshana,Chiranjib Bhattacharyya
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025 (Spotlight). Our code is available at this https URL

点击查看摘要

Abstract:Open weight models, which are ubiquitous, rarely provide access to their training data or loss function. This makes modifying such models for tasks such as pruning or unlearning constrained by this unavailability an active area of research. Existing techniques typically require gradients or ground-truth labels, rendering them infeasible in settings with limited computational resources. In this work, we investigate the fundamental question of identifying components that are critical to the model’s predictive performance, without access to either gradients or the loss function, and with only distributional access such as synthetic data. We theoretically demonstrate that the global reconstruction error is linearly bounded by local reconstruction errors for Lipschitz-continuous networks such as CNNs and well-trained Transformers (which, contrary to existing literature, we find exhibit Lipschitz continuity). This motivates using the locally reconstructive behavior of component subsets to quantify their global importance, via a metric that we term Subset Fidelity. In the uncorrelated features setting, selecting individual components via their Subset Fidelity scores is optimal, which we use to propose ModHiFi, an algorithm for model modification that requires no training data or loss function access. ModHiFi-P, for structured pruning, achieves an 11% speedup over the current state of the art on ImageNet models and competitive performance on language models. ModHiFi-U, for classwise unlearning, achieves complete unlearning on CIFAR-10 without fine-tuning and demonstrates competitive performance on Swin Transformers.

[LG-72] Learning to Solve Weighted Maximum Satisfiability with a Co-Training Architecture

链接: https://arxiv.org/abs/2511.19544
作者: Kaidi Wan,Minghao Liu,Yong Lai
类目: Machine Learning (cs.LG)
*备注: 10 pages, 4 figures

点击查看摘要

Abstract:Wepropose SplitGNN, a graph neural network (GNN)-based approach that learns to solve weighted maximum satisfiabil ity (MaxSAT) problem. SplitGNN incorporates a co-training architecture consisting of supervised message passing mech anism and unsupervised solution boosting layer. A new graph representation called edge-splitting factor graph is proposed to provide more structural information for learning, which is based on spanning tree generation and edge classification. To improve the solutions on challenging and weighted instances, we implement a GPU-accelerated layer applying efficient score calculation and relaxation-based optimization. Exper iments show that SplitGNN achieves 3* faster convergence and better predictions compared with other GNN-based ar chitectures. More notably, SplitGNN successfully finds solu tions that outperform modern heuristic MaxSAT solvers on much larger and harder weighted MaxSAT benchmarks, and demonstrates exceptional generalization abilities on diverse structural instances. Comments: 10 pages, 4 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.19544 [cs.LG] (or arXiv:2511.19544v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.19544 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Kaidi Wan [view email] [v1] Mon, 24 Nov 2025 11:22:29 UTC (282 KB) Full-text links: Access Paper: View a PDF of the paper titled Learning to Solve Weighted Maximum Satisfiability with a Co-Training Architecture, by Kaidi Wan and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-11 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-73] Automating Deception: Scalable Multi-Turn LLM Jailbreaks

链接: https://arxiv.org/abs/2511.19517
作者: Adarsh Kumarappan,Ananya Mujoo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google’s Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic’s Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.

[LG-74] Row-stochastic matrices can provably outperform doubly stochastic matrices in decentralized learning

链接: https://arxiv.org/abs/2511.19513
作者: Bing Liu,Boao Kong,Limin Lu,Kun Yuan,Chengcheng Zhao
类目: Machine Learning (cs.LG)
*备注: 41 pages, 38 figures

点击查看摘要

Abstract:Decentralized learning often involves a weighted global loss with heterogeneous node weights \lambda . We revisit two natural strategies for incorporating these weights: (i) embedding them into the local losses to retain a uniform weight (and thus a doubly stochastic matrix), and (ii) keeping the original losses while employing a \lambda -induced row-stochastic matrix. Although prior work shows that both strategies yield the same expected descent direction for the global loss, it remains unclear whether the Euclidean-space guarantees are tight and what fundamentally differentiates their behaviors. To clarify this, we develop a weighted Hilbert-space framework L^2(\lambda;\mathbbR^d) and obtain convergence rates that are strictly tighter than those from Euclidean analysis. In this geometry, the row-stochastic matrix becomes self-adjoint whereas the doubly stochastic one does not, creating additional penalty terms that amplify consensus error, thereby slowing convergence. Consequently, the difference in convergence arises not only from spectral gaps but also from these penalty terms. We then derive sufficient conditions under which the row-stochastic design converges faster even with a smaller spectral gap. Finally, by using a Rayleigh-quotient and Loewner-order eigenvalue comparison, we further obtain topology conditions that guarantee this advantage and yield practical topology-design guidelines.

[LG-75] ouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception AAAI2026

链接: https://arxiv.org/abs/2511.19509
作者: Kailin Lyu,Long Xiao,Jianing Zeng,Junhao Dong,Xuexin Liu,Zhuojun Zou,Haoyue Yang,Lin Shu,Jie Hao
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures, Accepted by AAAI 2026

点击查看摘要

Abstract:Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer’s effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.

[LG-76] Profile Generators: A Link between the Narrative and the Binary Matrix Representation

链接: https://arxiv.org/abs/2511.19506
作者: Raoul H. Kutil,Georg Zimmermann,Barbara Strasser-Kirchweger,Christian Borgelt
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
*备注: 31 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Mental health disorders, particularly cognitive disorders defined by deficits in cognitive abilities, are described in detail in the DSM-5, which includes definitions and examples of signs and symptoms. A simplified, machine-actionable representation was developed to assess the similarity and separability of these disorders, but it is not suited for the most complex cases. Generating or applying a full binary matrix for similarity calculations is infeasible due to the vast number of symptom combinations. This research develops an alternative representation that links the narrative form of the DSM-5 with the binary matrix representation and enables automated generation of valid symptom combinations. Using a strict pre-defined format of lists, sets, and numbers with slight variations, complex diagnostic pathways involving numerous symptom combinations can be represented. This format, called the symptom profile generator (or simply generator), provides a readable, adaptable, and comprehensive alternative to a binary matrix while enabling easy generation of symptom combinations (profiles). Cognitive disorders, which typically involve multiple diagnostic criteria with several symptoms, can thus be expressed as lists of generators. Representing several psychotic disorders in generator form and generating all symptom combinations showed that matrix representations of complex disorders become too large to manage. The MPCS (maximum pairwise cosine similarity) algorithm cannot handle matrices of this size, prompting the development of a profile reduction method using targeted generator manipulation to find specific MPCS values between disorders. The generators allow easier creation of binary representations for large matrices and make it possible to calculate specific MPCS cases between complex disorders through conditional generators.

[LG-77] Position: The Complexity of Perfect AI Alignment – Formalizing the RLHF Trilemma NEURIPS2025

链接: https://arxiv.org/abs/2511.19504
作者: Subramanyam Sahoo,Aman Chadha,Vinija Jain,Divya Chaudhary
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (ResponsibleFM)

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon = 0.01) and robustness (delta = 0.001) for global-scale populations requires Omega(2^d_context) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3–10^4 samples from homogeneous annotator pools while 10^7–10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.

[LG-78] RFX: High-Performance Random Forests with GPU Acceleration and QLORA Compression

链接: https://arxiv.org/abs/2511.19493
作者: Chris Kuchar
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: 39 pages, 9 tables, 4 figures

点击查看摘要

Abstract:RFX (Random Forests X), where X stands for compression or quantization, presents a production-ready implementation of Breiman and Cutler’s Random Forest classification methodology in Python. RFX v1.0 provides complete classification: out-of-bag error estimation, overall and local importance measures, proximity matrices with QLORA compression, case-wise analysis, and interactive visualization (rfviz)–all with CPU and GPU acceleration. Regression, unsupervised learning, CLIQUE importance, and RF-GAP proximity are planned for v2.0. This work introduces four solutions addressing the proximity matrix memory bottleneck limiting Random Forest analysis to ~60,000 samples: (1) QLORA (Quantized Low-Rank Adaptation) compression for GPU proximity matrices, reducing memory from 80GB to 6.4MB for 100k samples (12,500x compression with INT8 quantization) while maintaining 99% geometric structure preservation, (2) CPU TriBlock proximity–combining upper-triangle storage with block-sparse thresholding–achieving 2.7x memory reduction with lossless quality, (3) SM-aware GPU batch sizing achieving 95% GPU utilization, and (4) GPU-accelerated 3D MDS visualization computing embeddings directly from low-rank factors using power iteration. Validation across four implementation modes (GPU/CPU x case-wise/non-case-wise) demonstrates correct implementation. GPU achieves 1.4x speedup over CPU for overall importance with 500+ trees. Proximity computation scales from 1,000 to 200,000+ samples (requiring GPU QLORA), with CPU TriBlock filling the gap for medium-scale datasets (10K-50K samples). RFX v1.0 eliminates the proximity memory bottleneck, enabling proximity-based Random Forest analysis on datasets orders of magnitude larger than previously feasible. Open-source production-ready classification following Breiman and Cutler’s original methodology. Comments: 39 pages, 9 tables, 4 figures Subjects: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML) MSC classes: 62H30 (secondary) Cite as: arXiv:2511.19493 [cs.LG] (or arXiv:2511.19493v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.19493 Focus to learn more arXiv-issued DOI via DataCite

[LG-79] OpenCML: End-to-End Framework of Open-world Machine Learning to Learn Unknown Classes Incrementally

链接: https://arxiv.org/abs/2511.19491
作者: Jitendra Parmar,Praveen Singh Thakur
类目: Machine Learning (cs.LG)
*备注: Introduces an open-world machine learning model for continual and adaptive learning Discovers unknown classes and dynamically creates new class this http URL class-incremental learning to retain and extend prior knowledge. Enables continuous model improvement across multiple learning iterations. Achieved superior performance with an average accuracy of 82.54

点击查看摘要

Abstract:Open-world machine learning is an emerging technique in artificial intelligence, where conventional machine learning models often follow closed-world assumptions, which can hinder their ability to retain previously learned knowledge for future tasks. However, automated intelligence systems must learn about novel classes and previously known tasks. The proposed model offers novel learning classes in an open and continuous learning environment. It consists of two different but connected tasks. First, it discovers unknown classes in the data and creates novel classes; next, it learns how to perform class incrementally for each new class. Together, they enable continual learning, allowing the system to expand its understanding of the data and improve over time. The proposed model also outperformed existing approaches in open-world learning. Furthermore, it demonstrated strong performance in continuous learning, achieving a highest average accuracy of 82.54% over four iterations and a minimum accuracy of 65.87%.

[LG-80] he Generalized Proximity Forest

链接: https://arxiv.org/abs/2511.19487
作者: Ben Shaw,Adam Rustad,Sofia Pelagalli Maia,Jake S. Rhodes,Kevin R. Moon
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur. Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier. We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the k -nearest neighbors model.

[LG-81] OmniTFT: Omni Target Forecasting for Vital Signs and Laboratory Result Trajectories in Multi Center ICU Data

链接: https://arxiv.org/abs/2511.19485
作者: Wanzhe Xu,Yutong Dai,Yitao Yang,Martin Loza,Weihang Zhang,Yang Cui,Xin Zeng,Sung Joon Park,Kenta Nakai
类目: Machine Learning (cs.LG)
*备注: 23 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Accurate multivariate time-series prediction of vital signs and laboratory results is crucial for early intervention and precision medicine in intensive care units (ICUs). However, vital signs are often noisy and exhibit rapid fluctuations, while laboratory tests suffer from missing values, measurement lags, and device-specific bias, making integrative forecasting highly challenging. To address these issues, we propose OmniTFT, a deep learning framework that jointly learns and forecasts high-frequency vital signs and sparsely sampled laboratory results based on the Temporal Fusion Transformer (TFT). Specifically, OmniTFT implements four novel strategies to enhance performance: sliding window equalized sampling to balance physiological states, frequency-aware embedding shrinkage to stabilize rare-class representations, hierarchical variable selection to guide model attention toward informative feature clusters, and influence-aligned attention calibration to enhance robustness during abrupt physiological changes. By reducing the reliance on target-specific architectures and extensive feature engineering, OmniTFT enables unified modeling of multiple heterogeneous clinical targets while preserving cross-institutional generalizability. Across forecasting tasks, OmniTFT achieves substantial performance improvement for both vital signs and laboratory results on the MIMIC-III, MIMIC-IV, and eICU datasets. Its attention patterns are interpretable and consistent with known pathophysiology, underscoring its potential utility for quantitative decision support in clinical care.

[LG-82] stable-pretraining-v1: Foundation Model Research Made Simple

链接: https://arxiv.org/abs/2511.19484
作者: Randall Balestriero,Hugues Van Assel,Sami BuGhanem,Lucas Maes
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models and self-supervised learning (SSL) have become central to modern AI, yet research in this area remains hindered by complex codebases, redundant re-implementations, and the heavy engineering burden of scaling experiments. We present stable-pretraining, a modular, extensible, and performance-optimized library built on top of PyTorch, Lightning, Hugging Face, and TorchMetrics. Unlike prior toolkits focused narrowly on reproducing state-of-the-art results, stable-pretraining is designed for flexibility and iteration speed: it unifies essential SSL utilities–including probes, collapse detection metrics, augmentation pipelines, and extensible evaluation routines–within a coherent and reliable framework. A central design principle is logging everything, enabling fine-grained visibility into training dynamics that makes debugging, monitoring, and reproducibility seamless. We validate the library by demonstrating its ability to generate new research insights with minimal overhead, including depthwise representation probing and the analysis of CLIP degradation under synthetic data finetuning. By lowering barriers to entry while remaining scalable to large experiments, stable-pretraining aims to accelerate discovery and expand the possibilities of foundation model research.

[LG-83] Quality analysis and evaluation prediction of RAG retrieval based on machine learning algorithms

链接: https://arxiv.org/abs/2511.19481
作者: Ruoxin Zhang,Zhizhao Wen,Chao Wang,Chenchen Tang,Puyang Xu,Yifan Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid evolution of large language models, retrieval enhanced generation technology has been widely used due to its ability to integrate external knowledge to improve output accuracy. However, the performance of the system is highly dependent on the quality of the retrieval module. If the retrieval results have low relevance to user needs or contain noisy information, it will directly lead to distortion of the generated content. In response to the performance bottleneck of existing models in processing tabular features, this paper proposes an XGBoost machine learning regression model based on feature engineering and particle swarm optimization. Correlation analysis shows that answer_quality is positively correlated with doc_delevance by 0.66, indicating that document relevance has a significant positive effect on answer quality, and improving document relevance may enhance answer quality; The strong negative correlations between semantic similarity, redundancy, and diversity were -0.89 and -0.88, respectively, indicating a trade- off between semantic similarity, redundancy, and diversity. In other words, as the former two increased, diversity significantly decreased. The experimental results comparing decision trees, AdaBoost, etc. show that the VMD PSO BiLSTM model is superior in all evaluation indicators, with significantly lower MSE, RMSE, MAE, and MAPE compared to the comparison model. The R2 value is higher, indicating that its prediction accuracy, stability, and data interpretation ability are more outstanding. This achievement provides an effective path for optimizing the retrieval quality and improving the generation effect of RAG system, and has important value in promoting the implementation and application of related technologies.

[LG-84] Federated Learning Framework for Scalable AI in Heterogeneous HPC and Cloud Environments

链接: https://arxiv.org/abs/2511.19479
作者: Sangam Ghimire,Paribartan Timalsina,Nirjal Bhurtel,Bishal Neupane,Bigyan Byanju Shrestha,Subarna Bhattarai,Prajwal Gaire,Jessica Thapa,Sudan Jha
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the demand grows for scalable and privacy-aware AI systems, Federated Learning (FL) has emerged as a promising solution, allowing decentralized model training without moving raw data. At the same time, the combination of high- performance computing (HPC) and cloud infrastructure offers vast computing power but introduces new complexities, especially when dealing with heteroge- neous hardware, communication limits, and non-uniform data. In this work, we present a federated learning framework built to run efficiently across mixed HPC and cloud environments. Our system addresses key challenges such as system het- erogeneity, communication overhead, and resource scheduling, while maintaining model accuracy and data privacy. Through experiments on a hybrid testbed, we demonstrate strong performance in terms of scalability, fault tolerance, and convergence, even under non-Independent and Identically Distributed (non-IID) data distributions and varied hardware. These results highlight the potential of federated learning as a practical approach to building scalable Artificial Intelligence (AI) systems in modern, distributed computing settings.

[LG-85] owards a future space-based highly scalable AI infrastructure system design

链接: https://arxiv.org/abs/2511.19468
作者: Blaise Agüera y Arcas,Travis Beals,Maria Biggs,Jessica V. Bloom,Thomas Fischbacher,Konstantin Gromov,Urs Köster,Rishiraj Pravahan,James Manyika
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 19 pages, 4 figures

点击查看摘要

Abstract:If AI is a foundational general-purpose technology, we should anticipate that demand for AI compute – and energy – will continue to grow. The Sun is by far the largest energy source in our solar system, and thus it warrants consideration how future AI infrastructure could most efficiently tap into that power. This work explores a scalable compute system for machine learning in space, using fleets of satellites equipped with solar arrays, inter-satellite links using free-space optics, and Google tensor processing unit (TPU) accelerator chips. To facilitate high-bandwidth, low-latency inter-satellite communication, the satellites would be flown in close proximity. We illustrate the basic approach to formation flight via a 81-satellite cluster of 1 km radius, and describe an approach for using high-precision ML-based models to control large-scale constellations. Trillium TPUs are radiation tested. They survive a total ionizing dose equivalent to a 5 year mission life without permanent failures, and are characterized for bit-flip errors. Launch costs are a critical part of overall system cost; a learning curve analysis suggests launch to low-Earth orbit (LEO) may reach \lesssim \ 200/kg by the mid-2030s.

[LG-86] Spatio-Temporal Hierarchical Causal Models

链接: https://arxiv.org/abs/2511.20558
作者: Xintong Li,Haoran Zhang,Xiao Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The abundance of fine-grained spatio-temporal data, such as traffic sensor networks, offers vast opportunities for scientific discovery. However, inferring causal relationships from such observational data remains challenging, particularly due to unobserved confounders that are specific to units (e.g., geographical locations) yet influence outcomes over time. Most existing methods for spatio-temporal causal inference assume that all confounders are observed, an assumption that is often violated in practice. In this paper, we introduce Spatio-Temporal Hierarchical Causal Models (ST-HCMs), a novel graphical framework that extends hierarchical causal modeling to the spatio-temporal domain. At the core of our approach is the Spatio-Temporal Collapse Theorem, which shows that a complex ST-HCM converges to a simpler flat causal model as the amount of subunit data increases. This theoretical result enables a general procedure for causal identification, allowing ST-HCMs to recover causal effects even in the presence of unobserved, time-invariant unit-level confounders, a scenario where standard non-hierarchical models fail. We validate the effectiveness of our framework on both synthetic and real-world datasets, demonstrating its potential for robust causal inference in complex dynamic systems.

[LG-87] Generative Modeling with Manifold Percolation

链接: https://arxiv.org/abs/2511.20503
作者: Rui Tong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures. Correspondence: this http URL @warwick. this http URL

点击查看摘要

Abstract:Generative modeling is typically framed as learning mapping rules, but from an observer’s perspective without access to these rules, the task manifests as disentangling the geometric support from the probability distribution. We propose that Continuum Percolation is uniquely suited for this support analysis, as the sampling process effectively projects high-dimensional density estimation onto a geometric counting problem on the support. In this work, we establish a rigorous isomorphism between the topological phase transitions of Random Geometric Graphs and the underlying data manifold in high-dimensional space. By analyzing the relationship between our proposed Percolation Shift metric and FID, we demonstrate that our metric captures structural pathologies (such as implicit mode collapse) where statistical metrics fail. Finally, we translate this topological phenomenon into a differentiable loss function to guide training. Experimental results confirm that this approach not only prevents manifold shrinkage but drives the model toward a state of “Hyper-Generalization,” achieving good fidelity and verified topological expansion.

[LG-88] A Fully Probabilistic Tensor Network for Regularized Volterra System Identification

链接: https://arxiv.org/abs/2511.20457
作者: Afra Kilic,Kim Batselier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, 1 table. Submitted to IFAC 2026. Code available at: this https URL

点击查看摘要

Abstract:Modeling nonlinear systems with Volterra series is challenging because the number of kernel coefficients grows exponentially with the model order. This work introduces Bayesian Tensor Network Volterra kernel machines (BTN-V), extending the Bayesian Tensor Network framework to Volterra system identification. BTN-V represents Volterra kernels using canonical polyadic decomposition, reducing model complexity from O(I^D) to O(DIR). By treating all tensor components and hyperparameters as random variables, BTN-V provides predictive uncertainty estimation at no additional computational cost. Sparsity-inducing hierarchical priors enable automatic rank determination and the learning of fading-memory behavior directly from data, improving interpretability and preventing overfitting. Empirical results demonstrate competitive accuracy, enhanced uncertainty quantification, and reduced computational cost.

[LG-89] Solving Heterogeneous Agent Models with Physics-informed Neural Networks

链接: https://arxiv.org/abs/2511.20283
作者: Marta Grzeskiewicz
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding household behaviour is essential for modelling macroeconomic dynamics and designing effective policy. While heterogeneous agent models offer a more realistic alternative to representative agent frameworks, their implementation poses significant computational challenges, particularly in continuous time. The Aiyagari-Bewley-Huggett (ABH) framework, recast as a system of partial differential equations, typically relies on grid-based solvers that suffer from the curse of dimensionality, high computational cost, and numerical inaccuracies. This paper introduces the ABH-PINN solver, an approach based on Physics-Informed Neural Networks (PINNs), which embeds the Hamilton-Jacobi-Bellman and Kolmogorov Forward equations directly into the neural network training objective. By replacing grid-based approximation with mesh-free, differentiable function learning, the ABH-PINN solver benefits from the advantages of PINNs of improved scalability, smoother solutions, and computational efficiency. Preliminary results show that the PINN-based approach is able to obtain economically valid results matching the established finite-difference solvers.

[LG-90] Learning Degenerate Manifolds of Frustrated Magnets with Boltzmann Machines

链接: https://arxiv.org/abs/2511.19879
作者: Jackson C. Glass,Gia-Wei Chern
类目: rongly Correlated Electrons (cond-mat.str-el); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 12 pages, 10 figures

点击查看摘要

Abstract:We show that Restricted Boltzmann Machines (RBMs) provide a flexible generative framework for modeling spin configurations in disordered yet strongly correlated phases of frustrated magnets. As a benchmark, we first demonstrate that an RBM can learn the zero-temperature ground-state manifold of the one-dimensional ANNNI model at its multiphase point, accurately reproducing its characteristic oscillatory and exponentially decaying correlations. We then apply RBMs to kagome spin ice and show that they successfully learn the local ice rules and short-range correlations of the extensively degenerate ice-I manifold. Correlation functions computed from RBM-generated configurations closely match those from direct Monte Carlo simulations. For the partially ordered ice-II phase – featuring long-range charge order and broken time-reversal symmetry – accurate modeling requires RBMs with uniform-sign bias fields, mirroring the underlying symmetry breaking. These results highlight the utility of RBMs as generative models for learning constrained and highly frustrated magnetic states.

[LG-91] Latent-space metrics for Complex-Valued VAE out-of-distribution detection under radar clutter ICASSP2026

链接: https://arxiv.org/abs/2511.19805
作者: Y. A. Rouzoumka,E. Terreaux,C. Morisseau,J.-P. Ovarlez,C. Ren
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under review at ICASSP 2026

点击查看摘要

Abstract:We investigate complex-valued Variational AutoEncoders (CVAE) for radar Out-Of-Distribution (OOD) detection in complex radar environments. We proposed several detection metrics: the reconstruction error of CVAE (CVAE-MSE), the latent-based scores (Mahalanobis, Kullback-Leibler divergence (KLD)), and compared their performance against the classical ANMF-Tyler detector (ANMF-FP). The performance of all these detectors is analyzed on synthetic and experimental radar data, showing the advantages and the weaknesses of each detector.

[LG-92] Clustering Approaches for Mixed-Type Data: A Comparative Study

链接: https://arxiv.org/abs/2511.19755
作者: Badih Ghattas,Alvaro Sanchez San-Benito
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Clustering is widely used in unsupervised learning to find homogeneous groups of observations within a dataset. However, clustering mixed-type data remains a challenge, as few existing approaches are suited for this task. This study presents the state-of-the-art of these approaches and compares them using various simulation models. The compared methods include the distance-based approaches k-prototypes, PDQ, and convex k-means, and the probabilistic methods KAy-means for MIxed LArge data (KAMILA), the mixture of Bayesian networks (MBNs), and latent class model (LCM). The aim is to provide insights into the behavior of different methods across a wide range of scenarios by varying some experimental factors such as the number of clusters, cluster overlap, sample size, dimension, proportion of continuous variables in the dataset, and clusters’ distribution. The degree of cluster overlap and the proportion of continuous variables in the dataset and the sample size have a significant impact on the observed performances. When strong interactions exist between variables alongside an explicit dependence on cluster membership, none of the evaluated methods demonstrated satisfactory performance. In our experiments KAMILA, LCM, and k-prototypes exhibited the best performance, with respect to the adjusted rand index (ARI). All the methods are available in R.

[LG-93] Integrating RCTs RWD AI/ML and Statistics: Next-Generation Evidence Synthesis

链接: https://arxiv.org/abs/2511.19735
作者: Shu Yang,Margaret Gamalo,Haoda Fu
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Randomized controlled trials (RCTs) have been the cornerstone of clinical evidence; however, their cost, duration, and restrictive eligibility criteria limit power and external validity. Studies using real-world data (RWD), historically considered less reliable for establishing causality, are now recognized to be important for generating real-world evidence (RWE). In parallel, artificial intelligence and machine learning (AI/ML) are being increasingly used throughout the drug development process, providing scalability and flexibility but also presenting challenges in interpretability and rigor that traditional statistics do not face. This Perspective argues that the future of evidence generation will not depend on RCTs versus RWD, or statistics versus AI/ML, but on their principled integration. To this end, a causal roadmap is needed to clarify inferential goals, make assumptions explicit, and ensure transparency about tradeoffs. We highlight key objectives of integrative evidence synthesis, including transporting RCT results to broader populations, embedding AI-assisted analyses within RCTs, designing hybrid controlled trials, and extending short-term RCTs with long-term RWD. We also outline future directions in privacy-preserving analytics, uncertainty quantification, and small-sample methods. By uniting statistical rigor with AI/ML innovation, integrative approaches can produce robust, transparent, and policy-relevant evidence, making them a key component of modern regulatory science.

[LG-94] Individual and group fairness in geographical partitioning

链接: https://arxiv.org/abs/2511.19722
作者: Ilya O. Ryzhov,John Gunnar Carlsson,Yinchu Zhu
类目: Econometrics (econ.EM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Socioeconomic segregation often arises in school districting and other contexts, causing some groups to be over- or under-represented within a particular district. This phenomenon is closely linked with disparities in opportunities and outcomes. We formulate a new class of geographical partitioning problems in which the population is heterogeneous, and it is necessary to ensure fair representation for each group at each facility. We prove that the optimal solution is a novel generalization of the additively weighted Voronoi diagram, and we propose a simple and efficient algorithm to compute it, thus resolving an open question dating back to Dvoretzky et al. (1951). The efficacy and potential for practical insight of the approach are demonstrated in a realistic case study involving seven demographic groups and 78 district offices.

[LG-95] Optimization and Regularization Under Arbitrary Objectives

链接: https://arxiv.org/abs/2511.19628
作者: Jared N. Lakhani,Etienne Pienaar
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: 46 pages, 28 figures, 16 tables

点击查看摘要

Abstract:This study investigates the limitations of applying Markov Chain Monte Carlo (MCMC) methods to arbitrary objective functions, focusing on a two-block MCMC framework which alternates between Metropolis-Hastings and Gibbs sampling. While such approaches are often considered advantageous for enabling data-driven regularization, we show that their performance critically depends on the sharpness of the employed likelihood form. By introducing a sharpness parameter and exploring alternative likelihood formulations proportional to the target objective function, we demonstrate how likelihood curvature governs both in-sample performance and the degree of regularization inferred by the training data. Empirical applications are conducted on reinforcement learning tasks: including a navigation problem and the game of tic-tac-toe. The study concludes with a separate analysis examining the implications of extreme likelihood sharpness on arbitrary objective functions stemming from the classic game of blackjack, where the first block of the two-block MCMC framework is replaced with an iterative optimization step. The resulting hybrid approach achieves performance nearly identical to the original MCMC framework, indicating that excessive likelihood sharpness effectively collapses posterior mass onto a single dominant mode.

[LG-96] Masked Autoencoder Joint Learning for Robust Spitzoid Tumor Classification

链接: https://arxiv.org/abs/2511.19535
作者: Ilán Carretero,Roshni Mahtani,Silvia Perez-Deben,José Francisco González-Muñoz,Carlos Monteagudo,Valery Naranjo,Rocío del Amor
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Accepted in CASEIB 2025

点击查看摘要

Abstract:Accurate diagnosis of spitzoid tumors (ST) is critical to ensure a favorable prognosis and to avoid both under- and over-treatment. Epigenetic data, particularly DNA methylation, provide a valuable source of information for this task. However, prior studies assume complete data, an unrealistic setting as methylation profiles frequently contain missing entries due to limited coverage and experimental artifacts. Our work challenges these favorable scenarios and introduces ReMAC, an extension of ReMasker designed to tackle classification tasks on high-dimensional data under complete and incomplete regimes. Evaluation on real clinical data demonstrates that ReMAC achieves strong and robust performance compared to competing classification methods in the stratification of ST. Code is available: this https URL.

信息检索

[IR-0] Kleinkram: Open Robotic Data Management KR

链接: https://arxiv.org/abs/2511.20492
作者: Cyrill Püntener,Johann Schwabe,Dominique Garmier,Jonas Frey,Marco Hutter
类目: Robotics (cs.RO); Information Retrieval (cs.IR)
*备注: for associated source code, see this https URL

点击查看摘要

Abstract:We introduce Kleinkram, a free and open-source system designed to solve the challenge of managing massive, unstructured robotic datasets. Designed as a modular, on-premises cloud solution, Kleinkram enables scalable storage, indexing, and sharing of datasets, ranging from individual experiments to large-scale research collections. Kleinkram natively integrates with standard formats such as ROS bags and MCAP and utilises S3-compatible storage for flexibility. Beyond storage, Kleinkram features an integrated “Action Runner” that executes customizable Docker-based workflows for data validation, curation, and benchmarking. Kleinkram has successfully managed over 30 TB of data from diverse robotic systems, streamlining the research lifecycle through a modern web interface and a robust Command Line Interface (CLI).

[IR-1] HHFT: Hierarchical Heterogeneous Feature Transformer for Recommendation Systems

链接: https://arxiv.org/abs/2511.20235
作者: Liren Yu,Wenming Zhang,Silu Zhou,Zhixuan Zhang,Dan Ou
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:We propose HHFT (Hierarchical Heterogeneous Feature Transformer), a Transformer-based architecture tailored for industrial CTR prediction. HHFT addresses the limitations of DNN through three key designs: (1) Semantic Feature Partitioning: Grouping heterogeneous features (e.g. user profile, item information, behaviour sequennce) into semantically coherent blocks to preserve domain-specific information; (2) Heterogeneous Transformer Encoder: Adopting block-specific QKV projections and FFNs to avoid semantic confusion between distinct feature types; (3) Hiformer Layer: Capturing high-order interactions across features. Our findings reveal that Transformers significantly outperform DNN baselines, achieving a +0.4% improvement in CTR AUC at scale. We have successfully deployed the model on Taobao’s production platform, observing a significant uplift in key business metrics, including a +0.6% increase in Gross Merchandise Value (GMV).

[IR-2] HKRAG : Holistic Knowledge Retrieval-Augmented Generation Over Visually-Rich Documents

链接: https://arxiv.org/abs/2511.20227
作者: Anyang Tong,Xiang Niu,ZhiPing Liu,Chang Tian,Yanyan Wei,Zenglin Shi,Meng Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Existing multimodal Retrieval-Augmented Generation (RAG) methods for visually rich documents (VRD) are often biased towards retrieving salient knowledge(e.g., prominent text and visual elements), while largely neglecting the critical fine-print knowledge(e.g., small text, contextual details). This limitation leads to incomplete retrieval and compromises the generator’s ability to produce accurate and comprehensive answers. To bridge this gap, we propose HKRAG, a new holistic RAG framework designed to explicitly capture and integrate both knowledge types. Our framework features two key components: (1) a Hybrid Masking-based Holistic Retriever that employs explicit masking strategies to separately model salient and fine-print knowledge, ensuring a query-relevant holistic information retrieval; and (2) an Uncertainty-guided Agentic Generator that dynamically assesses the uncertainty of initial answers and actively decides how to integrate the two distinct knowledge streams for optimal response generation. Extensive experiments on open-domain visual question answering benchmarks show that HKRAG consistently outperforms existing methods in both zero-shot and supervised settings, demonstrating the critical importance of holistic knowledge retrieval for VRD understanding.

[IR-3] Enhancing Sequential Recommendation with World Knowledge from Large Language Models

链接: https://arxiv.org/abs/2511.20177
作者: Tianjie Dai,Xu Chen,Yunmeng Shu,Jinsong Lan,Xiaoyong Zhu,Jiangchao Yao,Bo Zheng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential Recommendation System~(SRS) has become pivotal in modern society, which predicts subsequent actions based on the user’s historical behavior. However, traditional collaborative filtering-based sequential recommendation models often lead to suboptimal performance due to the limited information of their collaborative signals. With the rapid development of LLMs, an increasing number of works have incorporated LLMs’ world knowledge into sequential recommendation. Although they achieve considerable gains, these approaches typically assume the correctness of LLM-generated results and remain susceptible to noise induced by LLM hallucinations. To overcome these limitations, we propose GRASP (Generation Augmented Retrieval with Holistic Attention for Sequential Prediction), a flexible framework that integrates generation augmented retrieval for descriptive synthesis and similarity retrieval, and holistic attention enhancement which employs multi-level attention to effectively employ LLM’s world knowledge even with hallucinations and better capture users’ dynamic interests. The retrieved similar users/items serve as auxiliary contextual information for the later holistic attention enhancement module, effectively mitigating the noisy guidance of supervision-based methods. Comprehensive evaluations on two public benchmarks and one industrial dataset reveal that GRASP consistently achieves state-of-the-art performance when integrated with diverse backbones. The code is available at: this https URL.

[IR-4] owards A Tri-View Diffusion Framework for Recommendation KDD2026

链接: https://arxiv.org/abs/2511.20122
作者: Ximing Chen,Pui Ieng Lei,Yijun Sheng,Yanyan Liu,Zhiguo Gong
类目: Information Retrieval (cs.IR)
*备注: 13 pages, 11 figures, accepted by KDD2026 (First Cycle)

点击查看摘要

Abstract:Diffusion models (DMs) have recently gained significant interest for their exceptional potential in recommendation tasks. This stems primarily from their prominent capability in distilling, modeling, and generating comprehensive user preferences. However, previous work fails to examine DMs in recommendation tasks through a rigorous lens. In this paper, we first experimentally investigate the completeness of recommender models from a thermodynamic view. We reveal that existing DM-based recommender models operate by maximizing the energy, while classic recommender models operate by reducing the entropy. Based on this finding, we propose a minimalistic diffusion framework that incorporates both factors via the maximization of Helmholtz free energy. Meanwhile, to foster the optimization, our reverse process is armed with a well-designed denoiser to maintain the inherent anisotropy, which measures the user-item cross-correlation in the context of bipartite graphs. Finally, we adopt an Acceptance-Rejection Gumbel Sampling Process (AR-GSP) to prioritize the far-outnumbered unobserved interactions for model robustness. AR-GSP integrates an acceptance-rejection sampling to ensure high-quality hard negative samples for general recommendation tasks, and a timestep-dependent Gumbel Softmax to handle an adaptive sampling strategy for diffusion models. Theoretical analyses and extensive experiments demonstrate that our proposed framework has distinct superiority over baselines in terms of accuracy and efficiency.

[IR-5] Invisible in Search? Auditing Aesthetic Bias in the Visual Representation of Holocaust Victims on Google

链接: https://arxiv.org/abs/2511.20036
作者: Mykola Makhortykh,Tobias Rohrbach,Maryna Sydorova
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR)
*备注: 22 pages

点击查看摘要

Abstract:Information retrieval systems, such as search engines, increasingly shape the representation of the past and present states of social reality. Despite their importance, these systems face challenges in dealing with the ethical aspects of representation due to various forms of bias, including aesthetic bias that perpetuates hegemonic patterns of representation. While most research on aesthetic bias has examined it in the context of current societal issues, it is also crucial for historical representation, particularly of sensitive subjects such as historical atrocities. To address this gap, we conduct a comparative audit of the visual representation of Holocaust victims on Google. We find that Google tends to propagate a male-dominated representation of Holocaust victims with an emphasis on atrocity context, risking rendering invisible gender-specific suffering and decreasing potential for nurturing empathy. We also observe a variation in representation across geographic locations, suggesting that search algorithms may produce their own aesthetic of victimhood.

[IR-6] Adaptive Knowledge Transfer for Cross-Disciplinary Cold-Start Knowledge Tracing

链接: https://arxiv.org/abs/2511.20009
作者: Yulong Deng,Zheng Guan,Min He,Xue Wang,Jie Liu,Zheng Li
类目: Information Retrieval (cs.IR)
*备注: 10 pages, 5 figures

点击查看摘要

Abstract:Cross-Disciplinary Cold-start Knowledge Tracing (CDCKT) faces a critical challenge: insufficient student interaction data in the target discipline prevents effective knowledge state modeling and performance prediction. Existing cross-disciplinary methods rely on overlapping entities between disciplines for knowledge transfer through simple mapping functions, but suffer from two key limitations: (1) overlapping entities are scarce in real-world scenarios, and (2) simple mappings inadequately capture cross-disciplinary knowledge complexity. To overcome these challenges, we propose Mixed of Experts and Adversarial Generative Network-based Cross-disciplinary Cold-start Knowledge Tracing Framework. Our approach consists of three key components: First, we pre-train a source discipline model and cluster student knowledge states into K categories. Second, these cluster attributes guide a mixture-of-experts network through a gating mechanism, serving as a cross-domain mapping bridge. Third, an adversarial discriminator enforces feature separation by pulling same-attribute student features closer while pushing different-attribute features apart, effectively mitigating small-sample limitations. We validate our method’s effectiveness across 20 extreme cross-disciplinary cold-start scenarios.

[IR-7] he 2nd Workshop on Human-Centered Recommender Systems

链接: https://arxiv.org/abs/2511.19979
作者: Kaike Zhang,Jiakai Tang,Du Su,Shuchang Liu,Julian McAuley,Lina Yao,Qi Cao,Yue Feng,Fei Sun
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems shape how people discover information, form opinions, and connect with society. Yet, as their influence grows, traditional metrics, e.g., accuracy, clicks, and engagement, no longer capture what truly matters to humans. The workshop on Human-Centered Recommender Systems (HCRS) calls for a paradigm shift from optimizing engagement toward designing systems that truly understand, involve, and benefit people. It brings together researchers in recommender systems, human-computer interaction, AI safety, and social computing to explore how human values, e.g., trust, safety, fairness, transparency, and well-being, can be integrated into recommendation processes. Centered around three thematic axes-Human Understanding, Human Involvement, and Human Impact-HCRS features keynotes, panels, and papers covering topics from LLM-based interactive recommenders to societal welfare optimization. By fostering interdisciplinary collaboration, HCRS aims to shape the next decade of responsible and human-aligned recommendation research.

[IR-8] SCoTER: Structured Chain-of-Thought Transfer for Enhanced Recommendation

链接: https://arxiv.org/abs/2511.19514
作者: Yang Wu,Qian Li,Yuling Xiong,Hongbo Tang,Xun Liu,Jun Zhang,Huan Yu,Jie Jiang,Hailong Shi
类目: Information Retrieval (cs.IR)
*备注: 12 pages,4 figures

点击查看摘要

Abstract:Harnessing the reasoning power of Large Language Models (LLMs) for recommender systems is hindered by two fundamental challenges. First, current approaches lack a mechanism for automated, data-driven discovery of effective reasoning patterns, relying instead on brittle manual templates or unstable zero-shot prompting. Second, they employ structure-collapsing integration: direct prompting incurs prohibitive online inference costs, while feature extraction collapses reasoning chains into single vectors, discarding stepwise logic. To address these challenges, we propose SCoTER (Structured Chain-of-Thought Transfer for Enhanced Recommendation), a unified framework that treats pattern discovery and structure-aware transfer as a jointly optimized problem. Specifically, SCoTER operationalizes this through two synergistic components: a GVM pipeline for automated pattern discovery and a structure-preserving integration architecture that transfers stepwise logic to efficient models. Formally, we provide information-theoretic justification proving that structure-preserving transfer achieves tighter performance bounds than structure-agnostic alternatives. Empirically, experiments on four benchmarks demonstrate improvements of 3.75%-11.59% over a strong TIGER backbone. Moreover, in production deployment on the Tencent Advertising Platform, SCoTER achieved a 2.14% lift in Gross Merchandise Value (GMV) while eliminating online LLM inference costs. Overall, SCoTER establishes a principled and production-validated blueprint for transferring structured LLM reasoning to large-scale recommender systems.

附件下载

点击下载今日全部论文列表