本篇博文主要内容为 2025-11-17 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-11-17)
今日共更新516篇论文,其中:
- 自然语言处理共91篇(Computation and Language (cs.CL))
- 人工智能共150篇(Artificial Intelligence (cs.AI))
- 计算机视觉共135篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共123篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] Optimizing Mixture of Block Attention
【速读】: 该论文旨在解决混合块注意力(Mixture of Block Attention, MoBA)在长文本处理中虽具高效性但性能机制不明确且缺乏高效GPU实现的问题。其关键解决方案在于:首先通过构建统计模型揭示MoBA性能核心依赖于路由机制对查询-键相似度的区分能力,并据此提出两个改进路径——使用更小的块大小和在键上应用短卷积以增强相关信号聚类,从而提升路由准确性;其次,为克服小块大小在GPU上的效率瓶颈,设计了硬件感知的FlashMoBA CUDA内核,实现了理论推荐的小块尺寸下的高效执行,最终在训练从头开始的大语言模型中验证了性能与密集注意力相当,且相较FlashAttention-2最高提速达14.7倍。
链接: https://arxiv.org/abs/2511.11571
作者: Guangxuan Xiao,Junxian Guo,Kasra Mazaheri,Song Han
机构: MIT (麻省理工学院); NVIDIA (英伟达)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: The first two authors contributed equally to this work
Abstract:Mixture of Block Attention (MoBA) (Lu et al., 2025) is a promising building block for efficiently processing long contexts in LLMs by enabling queries to sparsely attend to a small subset of key-value blocks, drastically reducing computational cost. However, the design principles governing MoBA’s performance are poorly understood, and it lacks an efficient GPU implementation, hindering its practical adoption. In this paper, we first develop a statistical model to analyze MoBA’s underlying mechanics. Our model reveals that performance critically depends on the router’s ability to accurately distinguish relevant from irrelevant blocks based on query-key affinities. We derive a signal-to-noise ratio that formally connects architectural parameters to this retrieval accuracy. Guided by our analysis, we identify two key pathways for improvement: using smaller block sizes and applying a short convolution on keys to cluster relevant signals, which enhances routing accuracy. While theoretically better, small block sizes are inefficient on GPUs. To bridge this gap, we introduce FlashMoBA, a hardware-aware CUDA kernel that enables efficient MoBA execution even with the small block sizes our theory recommends. We validate our insights by training LLMs from scratch, showing that our improved MoBA models match the performance of dense attention baselines. FlashMoBA achieves up to 14.7x speedup over FlashAttention-2 for small blocks, making our theoretically-grounded improvements practical. Code is available at: this https URL.
zh
[NLP-1] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
【速读】: 该论文旨在解决当前前沿模型评估体系在真实专业场景中表现不足的问题,尤其是针对金融(Finance)与法律(Legal)等高风险领域中,现有学术基准无法有效衡量生成式 AI 在开放性、经济相关任务上的实际能力。其解决方案的关键在于构建并开源了 Professional Reasoning Bench (PRBench),这是一个由182名具备JD、CFA资质或6年以上经验的专业人士设计的、涵盖1,100个真实世界问题和19,356条专家校准评分标准的基准测试集,覆盖114个国家和47个美国司法辖区,具有高度现实性和多样性;并通过严格的专家验证流程确保评分质量,从而为评估模型在专业场景下的推理能力提供可量化、可复现且贴近实践的基准,揭示出当前主流模型在判断准确性、过程透明度和推理完整性方面的显著缺陷。
链接: https://arxiv.org/abs/2511.11562
作者: Afra Feyza Akyürek,Advait Gosai,Chen Bo Calvin Zhang,Vipul Gupta,Jaehwan Jeong,Anisha Gunjal,Tahseen Rabbani,Maria Mazzone,David Randolph,Mohammad Mahmoudi Meymand,Gurshaan Chattha,Paula Rodriguez,Diego Mares,Pavit Singh,Michael Liu,Subodh Chawla,Pete Cline,Lucy Ogaz,Ernesto Hernandez,Zihao Wang,Pavi Bhatter,Marcos Ayestaran,Bing Liu,Yunzhong He
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
zh
[NLP-2] DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
【速读】: 该论文旨在解决长视觉文档理解中因信息分散于大量文本与视觉元素而导致的证据定位(evidence localization)难题,现有视觉语言模型(Vision-Language Models, VLMs)在此任务上表现受限,主要表现为难以准确检索相关页面并忽略视觉元素中的细粒度信息,从而引发性能瓶颈与模型幻觉。其解决方案的关键在于提出DocLens——一种工具增强的多智能体框架,通过“聚焦式”推理机制实现从整篇文档到特定视觉元素的逐步精确定位,并结合采样-仲裁(sampling-adjudication)机制生成单一可靠答案,显著提升了在以视觉为中心和不可回答查询上的表现。
链接: https://arxiv.org/abs/2511.11552
作者: Dawei Zhu,Rui Meng,Jiefeng Chen,Sujian Li,Tomas Pfister,Jinsung Yoon
机构: Google Cloud AI Research(谷歌云人工智能研究); School of Computer Science, Peking University(北京大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Comprehending long visual documents, where information is distributed across extensive pages of text and visual elements, is a critical but challenging task for modern Vision-Language Models (VLMs). Existing approaches falter on a fundamental challenge: evidence localization. They struggle to retrieve relevant pages and overlook fine-grained details within visual elements, leading to limited performance and model hallucination. To address this, we propose DocLens, a tool-augmented multi-agent framework that effectively ``zooms in’’ on evidence like a lens. It first navigates from the full document to specific visual elements on relevant pages, then employs a sampling-adjudication mechanism to generate a single, reliable answer. Paired with Gemini-2.5-Pro, DocLens achieves state-of-the-art performance on MMLongBench-Doc and FinRAGBench-V, surpassing even human experts. The framework’s superiority is particularly evident on vision-centric and unanswerable queries, demonstrating the power of its enhanced localization capabilities.
zh
[NLP-3] Aligning Machiavellian Agents : Behavior Steering via Test-Time Policy Shaping AAAI2026
【速读】: 该论文旨在解决决策型人工智能(AI)代理在复杂动态环境中运行时,如何在不重新训练模型的前提下保持与人类价值观或伦理准则的一致性问题。当前方法往往面临奖励最大化与对齐性之间的权衡,尤其对于预训练代理而言,重新训练成本高且效率低,同时伦理属性多样且可能存在冲突。解决方案的关键在于提出一种基于模型引导的策略塑造(model-guided policy shaping)的测试时对齐技术,通过场景-动作属性分类器实现对个体行为属性的精确控制,从而在不改变原始代理结构的基础上,在测试阶段灵活调整其决策逻辑,以实现伦理对齐与奖励最大化之间的合理权衡。该方法已在MACHIAVELLI基准(包含134个文本游戏和数千个伦理决策标注场景)上验证其有效性,展现出跨环境泛化能力和对多种伦理违规及权力寻求行为的有效缓解能力。
链接: https://arxiv.org/abs/2511.11551
作者: Dena Mujtaba,Brian Hu,Anthony Hoogs,Arslan Basharat
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to AAAI 2026 AI Alignment Track
Abstract:The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining the alignment. For the pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between ethical alignment and reward maximization without requiring agent retraining. We evaluate our approach using the MACHIAVELLI benchmark, which comprises 134 text-based game environments and thousands of annotated scenarios involving ethical decisions. The RL agents are first trained to maximize the reward in their respective games. At test time, we apply policy shaping via scenario-action attribute classifiers to ensure decision alignment with ethical attributes. We compare our approach against prior training-time methods and general-purpose agents, as well as study several types of ethical violations and power-seeking behavior. Our results demonstrate that test-time policy shaping provides an effective and scalable solution for mitigating unethical behavior across diverse environments and alignment attributes.
zh
[NLP-4] W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在输出时与人类偏好存在偏差的问题,尤其针对训练阶段对齐方法(如基于人类反馈的强化学习,Reinforcement Learning from Human Feedback, RLHF)所面临的专家标注成本高、可扩展性差以及推理阶段缺乏细粒度控制的局限性。其解决方案的关键在于提出W2S-AlignTree框架,首次将蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与弱到强泛化(Weak-to-Strong Generalization)范式相结合,将LLM对齐建模为生成搜索树中的最优启发式搜索问题;通过利用弱模型在实时生成步骤中提供的信号作为对齐代理,并引入熵感知探索机制,在不修改强模型参数的前提下实现推理阶段的精细引导,从而动态平衡高维生成空间中的探索与利用。
链接: https://arxiv.org/abs/2511.11518
作者: Zhenyu Ding,Yuhao Wang,Tengyue Xiao,Haoying Wang,Guojun Ma,Mingyang Wan,Caigui Jiang,Ning Ding
机构: 未知
类目: Computation and Language (cs.CL)
备注: AAAI 2026 Oral
Abstract:Large Language Models (LLMs) demonstrate impressive capabilities, yet their outputs often suffer from misalignment with human preferences due to the inadequacy of weak supervision and a lack of fine-grained control. Training-time alignment methods like Reinforcement Learning from Human Feedback (RLHF) face prohibitive costs in expert supervision and inherent scalability limitations, offering limited dynamic control during inference. Consequently, there is an urgent need for scalable and adaptable alignment mechanisms. To address this, we propose W2S-AlignTree, a pioneering plug-and-play inference-time alignment framework that synergistically combines Monte Carlo Tree Search (MCTS) with the Weak-to-Strong Generalization paradigm for the first time. W2S-AlignTree formulates LLM alignment as an optimal heuristic search problem within a generative search tree. By leveraging weak model’s real-time, step-level signals as alignment proxies and introducing an Entropy-Aware exploration mechanism, W2S-AlignTree enables fine-grained guidance during strong model’s generation without modifying its parameters. The approach dynamically balances exploration and exploitation in high-dimensional generation search trees. Experiments across controlled sentiment generation, summarization, and instruction-following show that W2S-AlignTree consistently outperforms strong baselines. Notably, W2S-AlignTree raises the performance of Llama3-8B from 1.89 to 2.19, a relative improvement of 15.9 on the summarization task.
zh
[NLP-5] Proactive Hearing Assistants that Isolate Egocentric Conversations EMNLP2025
【速读】: 该论文旨在解决传统助听设备在多说话人场景中无法自动识别并分离目标对话者的问题,从而提升佩戴者在复杂声学环境中的语音可懂度和交互体验。其核心解决方案是提出一种主动式助听系统(proactive hearing assistants),利用佩戴者的自发声作为锚点(anchor),结合话语轮替行为(turn-taking behavior)与对话动态(dialogue dynamics)来推断对话伙伴并抑制非目标声音。关键创新在于设计了一种双模型架构:一个轻量级流式模型每12.5 ms运行以实现低延迟的对话者提取,另一个较慢模型则用于捕捉更长时程的对话结构,二者协同实现端侧实时处理能力,且在真实世界多说话人场景下表现出良好的泛化性能。
链接: https://arxiv.org/abs/2511.11473
作者: Guilin Hu,Malek Itani,Tuochao Chen,Shyamnath Gollakota
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted at EMNLP 2025 Main Conference
Abstract:We introduce proactive hearing assistants that automatically identify and separate the wearer’s conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer’s self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement. More information can be found on our website: this https URL
zh
[NLP-6] From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在基于真实场景数据进行微调时常见的偏差、标注错误和分布不平衡问题,这些问题会导致过拟合和性能不均衡。其解决方案的关键在于重构微调流程:首先,通过自动化方式综合采样物体属性(如颜色、形状、大小和位置),生成无偏、分布均衡且标注高质量的合成数据集;其次,利用该合成数据集对前沿VLMs进行微调,并验证其在绝对位置任务上向真实世界数据的迁移能力。实验表明,基于平衡合成数据的微调不仅能实现视觉场景中性能的均匀性,还能显著提升在真实数据(如COCO)上的表现,优于匹配设置下的直接真实数据微调。
链接: https://arxiv.org/abs/2511.11440
作者: Massimo Rizzoli,Simone Alghisi,Seyed Mahed Mousavi,Giuseppe Riccardi
机构: University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Fine-tuning Vision-Language Models (VLMs) is a common strategy to improve performance following an ad-hoc data collection and annotation of real-world scenes. However, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have tried to address this problem by generating synthetic data, they lacked control over distribution bias and annotation quality. To address these challenges, we redesign the fine-tuning process in two ways. First, we control the generation of data and its annotations, ensuring it is free from bias, distribution imbalance, and annotation errors. We automatically construct the dataset by comprehensively sampling objects’ attributes, including color, shape, size, and position within the scene. Secondly, using this annotated dataset, we fine-tune state-of-the-art VLMs and assess performance transferability to real-world data on the absolute position task. We conduct exhaustive evaluations on both synthetic and real-world benchmarks. Our experiments reveal two key findings: 1) fine-tuning on balanced synthetic data yields uniform performance across the visual scene and mitigates common biases; and 2) fine-tuning on synthetic stimuli significantly improves performance on real-world data (COCO), outperforming models fine-tuned in the matched setting.
zh
[NLP-7] MajinBook: An open catalogue of digital world literature with likes
【速读】: 该论文旨在解决计算社会科学与文化分析领域中高质量、大规模文本语料库获取困难的问题,尤其是传统数字图书馆(如HathiTrust)存在数据偏倚和可访问性限制。其解决方案的关键在于构建MajinBook这一开放目录,通过将影子图书馆(如Library Genesis和Z-Library)的元数据与Goodreads的结构化书目数据进行精准关联,形成包含539,000余本英文图书的高精度语料库,涵盖首次出版日期、类型及流行度指标(如评分与评论),并优先采用原生数字EPUB格式以确保机器可读性,从而提升研究可用性与代表性。
链接: https://arxiv.org/abs/2511.11412
作者: Antoine Mazières,Thierry Poibeau
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Other Statistics (stat.OT)
备注: 9 pages, 5 figures, 1 table
Abstract:This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries–such as Library Genesis and Z-Library–for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project’s legal permissibility under EU and US frameworks for text and data mining in research.
zh
[NLP-8] Studies with impossible languages falsify LMs as models of human language
【速读】: 该论文试图解决的问题是:语言模型(Language Models, LMs)在学习语言时是否具备类似人类婴儿的“归纳偏置”(inductive biases),即是否倾向于更容易习得符合人类语言实际结构(attested languages)的语言,而非结构异常(impossible languages)的语言。Futrell and Mahowald 的研究提出,人类婴儿和语言模型均更易习得自然语言而非不可能语言,但本文通过回顾文献指出,许多语言模型实际上能以相当水平学习大量不可能语言;真正难以学习的不可能语言往往具有更高的复杂度或随机性,而非纯粹的结构性异常。解决方案的关键在于揭示:语言模型缺乏人类特有的、支持语言习得的归纳偏置,这导致其对语言结构的敏感性与人类存在本质差异,从而解释了为何某些不可能语言对LM而言更难学习——并非因为其“非自然”,而是因为其复杂性更高。
链接: https://arxiv.org/abs/2511.11389
作者: Jeffrey S. Bowers,Jeff Mitchell
机构: 未知
类目: Computation and Language (cs.CL)
备注: Commentary on Futrell, R., Mahowald, K. arXiv:2501.17047 (in press). How linguistics learned to stop worrying and love the language models. Behavioural and Brain Sciences
Abstract:According to Futrell and Mahowald [arXiv:2501.17047], both infants and language models (LMs) find attested languages easier to learn than impossible languages that have unnatural structures. We review the literature and show that LMs often learn attested and many impossible languages equally well. Difficult to learn impossible languages are simply more complex (or random). LMs are missing human inductive biases that support language acquisition.
zh
[NLP-9] On-Device Fine-Tuning via Backprop-Free Zeroth-Order Optimization
【速读】: 该论文旨在解决边缘人工智能(Edge AI)系统中因设备内存限制导致的模型微调(fine-tuning)瓶颈问题。在资源受限的边缘设备上,传统基于反向传播(Backpropagation, BP)的训练方法需要存储各层激活值和优化器状态,这严重限制了可部署模型的最大规模。为此,论文提出采用内存高效的零阶优化(Memory-efficient Zeroth-order Optimization, MeZO)作为解决方案,其核心在于仅通过前向评估(forward evaluations)估算梯度,从而无需存储中间激活或优化器状态,显著提升模型在片上内存中的适配能力,尽管可能延长微调的墙-clock时间。
链接: https://arxiv.org/abs/2511.11362
作者: Prabodh Katti,Sangwoo Park,Bipin Rajendran,Osvaldo Simeone
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Conference submission; Under review
Abstract:On-device fine-tuning is a critical capability for edge AI systems, which must support adaptation to different agentic tasks under stringent memory constraints. Conventional backpropagation (BP)-based training requires storing layer activations and optimizer states, a demand that can be only partially alleviated through checkpointing. In edge deployments in which the model weights must reside entirely in device memory, this overhead severely limits the maximum model size that can be deployed. Memory-efficient zeroth-order optimization (MeZO) alleviates this bottleneck by estimating gradients using forward evaluations alone, eliminating the need for storing intermediate activations or optimizer states. This enables significantly larger models to fit within on-chip memory, albeit at the cost of potentially longer fine-tuning wall-clock time. This paper first provides a theoretical estimate of the relative model sizes that can be accommodated under BP and MeZO training. We then numerically validate the analysis, demonstrating that MeZO exhibits accuracy advantages under on-device memory constraints, provided sufficient wall-clock time is available for fine-tuning.
zh
[NLP-10] M-DAIGT: A Shared Task on Multi-Domain Detection of AI-Generated Text
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成高度流畅文本对信息完整性和学术研究带来的挑战,特别是针对新闻文章和学术写作中AI生成文本的检测问题。解决方案的关键在于提出了多领域AI生成文本检测(Multi-Domain Detection of AI-Generated Text, M-DAIGT)共享任务,包含新闻文章检测(News Article Detection, NAD)与学术写作检测(Academic Writing Detection, AWD)两个二分类子任务,并构建了一个包含3万条样本的大规模基准数据集,涵盖多种LLMs(如GPT-4、Claude)及不同提示策略生成的AI文本,从而为跨领域AI生成内容检测提供标准化评估平台。
链接: https://arxiv.org/abs/2511.11340
作者: Salima Lamsiyah,Saad Ezzini,Abdelkader El Mahdaouy,Hamza Alami,Abdessamad Benlahbib,Samir El Amrany,Salmane Chafik,Hicham Hammouchi
机构: University of Luxembourg, Luxembourg; King Fahd University of Petroleum and Minerals, Saudi Arabia; Mohammed VI Polytechnic University, Morocco; Sidi Mohamed Ben Abdellah University, Morocco
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The generation of highly fluent text by Large Language Models (LLMs) poses a significant challenge to information integrity and academic research. In this paper, we introduce the Multi-Domain Detection of AI-Generated Text (M-DAIGT) shared task, which focuses on detecting AI-generated text across multiple domains, particularly in news articles and academic writing. M-DAIGT comprises two binary classification subtasks: News Article Detection (NAD) (Subtask 1) and Academic Writing Detection (AWD) (Subtask 2). To support this task, we developed and released a new large-scale benchmark dataset of 30,000 samples, balanced between human-written and AI-generated texts. The AI-generated content was produced using a variety of modern LLMs (e.g., GPT-4, Claude) and diverse prompting strategies. A total of 46 unique teams registered for the shared task, of which four teams submitted final results. All four teams participated in both Subtask 1 and Subtask 2. We describe the methods employed by these participating teams and briefly discuss future directions for M-DAIGT.
zh
[NLP-11] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言,特别是东南亚语言如老挝语(Lao)中的评估不足问题。现有LLM评测多集中于英语等高资源语言,缺乏对低资源语言综合能力的系统性衡量。解决方案的关键在于构建首个面向老挝语的大规模、高质量、多维度基准数据集——LaoBench,其涵盖知识应用、K12基础教育和老挝语-中文-英文三语翻译三个核心维度,共计超17,000个精心筛选的样本,并区分开源与闭源子集以支持黑盒评估,从而保障公平性和数据安全性;同时,通过专家人工校准与自动化代理辅助验证相结合的数据构建流程,确保语义准确性、文化相关性与教育价值,为LLM在老挝语场景下的性能评估提供可靠依据。
链接: https://arxiv.org/abs/2511.11334
作者: Jian Gao,Richeng Xuan,Zhaolu Kang,Dingshi Liao,Wenxin Huang,Zongmou Huang,Yangdi Xu,Bowen Qin,Zheqi He,Xi Yang,Changjin Li
机构: China-ASEAN Information Harbor Co., Ltd.(中国-东盟信息港股份有限公司); Beijing Academy of Artificial Intelligence (北京人工智能研究院); School of Software & Microelectronics, Peking University (北京大学软件与微电子学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce LaoBench, the first large-scale, high-quality, and multidimensional benchmark dataset dedicated to assessing LLMs’ comprehensive language understanding and reasoning abilities in Lao. LaoBench comprises over 17,000 carefully curated samples spanning three core dimensions: knowledge application, K12 foundational education, and bilingual translation among Lao, Chinese, and English. The dataset is divided into open-source and closed-source subsets, with the closed-source portion enabling black-box evaluation on an official platform to ensure fairness and data security. Our data construction pipeline integrates expert human curation with automated agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational value. Benchmarking multiple state-of-the-art LLMs on LaoBench reveals that current models still face significant challenges in mastering Lao across diverse tasks. We hope LaoBench will catalyze further research and development of AI technologies for underrepresented Southeast Asian languages.
zh
[NLP-12] NOVA: An Agent ic Framework for Automated Histopathology Analysis and Discovery
【速读】: 该论文旨在解决数字化组织病理学分析中因流程复杂、耗时且依赖专业技能而导致的可及性受限问题。其解决方案的关键在于提出NOVA框架,这是一个基于代理(agentic)的系统,能够将科学问题转化为可执行的分析流水线,通过迭代生成和运行Python代码实现自动化分析;NOVA集成了49个基于开源软件构建的领域特定工具(如细胞核分割、全切片编码等),并具备即时创建新工具的能力,从而显著提升病理分析的效率与可扩展性。
链接: https://arxiv.org/abs/2511.11324
作者: Anurag J. Vaidya,Felix Meissen,Daniel C. Castro,Shruthi Bannur,Tristan Lazard,Drew F. K. Williamson,Faisal Mahmood,Javier Alvarez-Valle,Stephanie L. Hyland,Kenza Bouzid
机构: Mass General Brigham (马萨诸塞州总医院); Microsoft Health Futures (微软健康未来); Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Digitized histopathology analysis involves complex, time-intensive workflows and specialized expertise, limiting its accessibility. We introduce NOVA, an agentic framework that translates scientific queries into executable analysis pipelines by iteratively generating and running Python code. NOVA integrates 49 domain-specific tools (e.g., nuclei segmentation, whole-slide encoding) built on open-source software, and can also create new tools ad hoc. To evaluate such systems, we present SlideQuest, a 90-question benchmark – verified by pathologists and biomedical scientists – spanning data processing, quantitative analysis, and hypothesis testing. Unlike prior biomedical benchmarks focused on knowledge recall or diagnostic QA, SlideQuest demands multi-step reasoning, iterative coding, and computational problem solving. Quantitative evaluation shows NOVA outperforms coding-agent baselines, and a pathologist-verified case study links morphology to prognostically relevant PAM50 subtypes, demonstrating its scalable discovery potential.
zh
[NLP-13] LAET: A Layer-wise Adaptive Ensemble Tuning Framework for Pretrained Language Models
【速读】: 该论文旨在解决金融领域大语言模型(Large Language Models, LLMs)在实际部署中面临的高计算资源需求问题,即模型性能虽优但难以被多数机构高效应用。其解决方案的关键在于提出一种名为“逐层自适应集成微调”(Layer-wise Adaptive Ensemble Tuning, LAET)的新策略:通过分析预训练模型的隐藏状态表示,识别并选择性地微调对特定任务最有效的层,同时冻结其余不关键层,从而显著降低计算开销,并在金融自然语言处理(Natural Language Processing, NLP)任务中实现优于现有基准和主流模型(如GPT-4)的性能表现,即使使用参数量较小(约3B)的模型也能达到卓越效果。
链接: https://arxiv.org/abs/2511.11315
作者: Jawad Ibn Ahad,Muhammad Rafsan Kabir,Robin Krambroeckers,Sifat Momen,Nabeel Mohammed,Shafin Rahman
机构: Robotbulls.com (Robotbulls); North South University (北方南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Natural Language Processing (NLP) has transformed the financial industry, enabling advancements in areas such as textual analysis, risk management, and forecasting. Large language models (LLMs) like BloombergGPT and FinMA have set new benchmarks across various financial NLP tasks, including sentiment analysis, stock movement prediction, and credit risk assessment. Furthermore, FinMA-ES, a bilingual financial LLM, has also demonstrated strong performance using the FLARE and FLARE-ES benchmarks. However, the high computational demands of these models limit the accessibility of many organizations. To address this, we propose Layer-wise Adaptive Ensemble Tuning (LAET), a novel strategy that selectively fine-tunes the most effective layers of pre-trained LLMs by analyzing hidden state representations while freezing less critical layers. LAET significantly reduces computational overhead while enhancing task-specific performance. Our approach shows strong results in financial NLP tasks, outperforming existing benchmarks and state-of-the-art LLMs such as GPT-4, even with smaller LLMs ( \sim 3B parameters). This work bridges cutting-edge financial NLP research and real-world deployment with efficient and scalable models for financial applications.
zh
[NLP-14] destroR: Attacking Transfer Models with Obfuscous Examples to Discard Perplexity
【速读】: 该论文旨在解决当前先进机器学习模型在面对对抗性攻击时的脆弱性问题,尤其是通过生成具有高困惑度(perplexity)的模糊输入来误导模型决策,从而揭示其鲁棒性不足。解决方案的关键在于设计一种新型对抗攻击策略,利用机器学习与深度学习方法构造高迷惑性的对抗样本,并特别引入孟加拉语(Bangla Language)作为实验语言,以扩展对抗攻击的研究边界;同时强调在攻击过程中保持实用性和效率,为未来提升模型鲁棒性提供路径参考。
链接: https://arxiv.org/abs/2511.11309
作者: Saadat Rafid Ahmed,Rubayet Shareen,Radoan Sharkar,Nazia Hossain,Mansur Mahi,Farig Yousuf Sadeque
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 6 Table
Abstract:Advancements in Machine Learning Neural Networks in recent years have led to widespread implementations of Natural Language Processing across a variety of fields with remarkable success, solving a wide range of complicated problems. However, recent research has shown that machine learning models may be vulnerable in a number of ways, putting both the models and the systems theyre used in at risk. In this paper, we intend to analyze and experiment with the best of existing adversarial attack recipes and create new ones. We concentrated on developing a novel adversarial attack strategy on current state-of-the-art machine learning models by producing ambiguous inputs for the models to confound them and then constructing the path to the future development of the robustness of the models. We will develop adversarial instances with maximum perplexity, utilizing machine learning and deep learning approaches in order to trick the models. In our attack recipe, we will analyze several datasets and focus on creating obfuscous adversary examples to put the models in a state of perplexity, and by including the Bangla Language in the field of adversarial attacks. We strictly uphold utility usage reduction and efficiency throughout our work.
zh
[NLP-15] MAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference AAAI2026
【速读】: 该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)在实际应用中效率低下且可能降低准确性的核心问题:即对所有查询均触发MAD会导致高昂的计算(token)成本,且可能错误地纠正原本正确的单智能体答案。解决方案的关键在于提出一种智能的、基于决策的触发机制——iMAD(intelligent Multi-Agent Debate),其通过训练一个轻量级的辩论决策分类器(debate-decision classifier),利用从单智能体生成的结构化自我批判响应中提取的41个可解释的语言和语义特征(捕捉犹豫线索),结合作者提出的FocusCal损失函数进行训练,从而实现仅在可能纠正初始错误答案时才触发MAD,显著减少token消耗(最高达92%)并提升最终答案准确率(最高提升13.5%)。
链接: https://arxiv.org/abs/2511.11306
作者: Wei Fan,JinYi Yoon,Bo Ji
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted in AAAI 2026 (Oral)
Abstract:Large Language Model (LLM) agent systems have advanced rapidly, driven by their strong generalization in zero-shot settings. To further enhance reasoning and accuracy on complex tasks, Multi-Agent Debate (MAD) has emerged as a promising framework that engages multiple LLM agents in structured debates to encourage diverse reasoning. However, triggering MAD for every query is inefficient, as it incurs substantial computational (token) cost and may even degrade accuracy by overturning correct single-agent answers. To address these limitations, we propose intelligent Multi-Agent Debate (iMAD), a token-efficient framework that selectively triggers MAD only when it is likely to be beneficial (i.e., correcting an initially wrong answer). To achieve this goal, iMAD learns generalizable model behaviors to make accurate debate decisions. Specifically, iMAD first prompts a single agent to produce a structured self-critique response, from which we extract 41 interpretable linguistic and semantic features capturing hesitation cues. Then, iMAD uses a lightweight debate-decision classifier, trained using our proposed FocusCal loss, to determine whether to trigger MAD, enabling robust debate decisions without test dataset-specific tuning. Through extensive experiments using six (visual) question answering datasets against five competitive baselines, we have shown that iMAD significantly reduces token usage (by up to 92%) while also improving final answer accuracy (by up to 13.5%).
zh
[NLP-16] Building the Web for Agents : A Declarative Framework for Agent -Web Interaction
【速读】: 该论文旨在解决当前自主AI代理在网页上部署时面临的根本性错位问题:AI代理必须从面向人类的用户界面中推断可用功能(affordances),导致交互过程脆弱、低效且存在安全隐患。解决方案的关键在于提出一种名为VOIX的网页原生框架,通过简单的声明式HTML元素,使网站能够向AI代理暴露可靠、可审计且隐私保护的能力。VOIX引入了“工具标签”(tool tags)和“上下文标签”(context tags),使开发者可以显式定义可用操作和相关状态,从而建立清晰、机器可读的代理行为契约,将控制权交还给网站开发者,同时通过分离对话交互与网站本身来保障用户隐私。
链接: https://arxiv.org/abs/2511.11287
作者: Sven Schultze,Meike Verena Kietzmann,Nils-Lucas Schönfeld,Ruth Stock-Homburg
机构: Technical University of Darmstadt(达姆施塔特工业大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: for associated documentation, see this https URL
Abstract:The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces tool and context tags, allowing developers to explicitly define available actions and relevant state, thereby creating a clear, machine-readable contract for agent behavior. This approach shifts control to the website developer while preserving user privacy by disconnecting the conversational interactions from the website. We evaluated the framework’s practicality, learnability, and expressiveness in a three-day hackathon study with 16 developers. The results demonstrate that participants, regardless of prior experience, were able to rapidly build diverse and functional agent-enabled web applications. Ultimately, this work provides a foundational mechanism for realizing the Agentic Web, enabling a future of seamless and secure human-AI collaboration on the web.
zh
[NLP-17] Language-Aided State Estimation
【速读】: 该论文旨在解决物理系统状态估计问题,其中人类作为感知代理,通过自然语言数据(如文本和语音)提供观测信息。传统状态估计方法难以有效利用此类非结构化的人类描述,而本文提出了一种语言辅助粒子滤波器(Language-Aided Particle Filter, LAPF),其关键在于通过自然语言处理(Natural Language Processing, NLP)技术对人类观测进行结构化建模,并将处理后的语义信息嵌入粒子滤波的更新步骤中,从而提升状态估计的准确性与鲁棒性。该方法在灌溉渠道水位估计任务中得到验证,证明了其有效性。
链接: https://arxiv.org/abs/2511.11285
作者: Yuki Miyoshi,Masaki Inoue,Yusuke Fujimoto
机构: Keio University (庆应义塾大学); The University of Osaka (大阪大学)
类目: ystems and Control (eess.SY); Computation and Language (cs.CL)
备注: 7 pages, 5 figures, submitted to IFAC World Congress 2026 with Journal option (IFAC Journal of Systems and Control)
Abstract:Natural language data, such as text and speech, have become readily available through social networking services and chat platforms. By leveraging human observations expressed in natural language, this paper addresses the problem of state estimation for physical systems, in which humans act as sensing agents. To this end, we propose a Language-Aided Particle Filter (LAPF), a particle filter framework that structures human observations via natural language processing and incorporates them into the update step of the state estimation. Finally, the LAPF is applied to the water level estimation problem in an irrigation canal and its effectiveness is demonstrated.
zh
[NLP-18] SQuaD: The Software Quality Dataset
【速读】: 该论文旨在解决现有软件质量研究数据集维度单一、难以支持跨时间与多质量维度综合分析的问题。当前资源通常仅聚焦于代码异味(Code Smell)、技术债(Technical Debt)或重构活动等有限指标,限制了对软件系统演化规律和质量特征的全面理解。其解决方案的关键在于构建了一个多维、时间感知的软件质量数据集(SQuaD),通过整合九种先进的静态分析工具(如SonarQube、RefactoringMiner等),统一提取方法、类、文件及项目层级的700余项独特指标,并覆盖450个成熟开源项目的63,586个版本发布,同时集成版本控制、缺陷跟踪、漏洞数据(CVE/CWE)及有助于即时缺陷预测(Just-In-Time, JIT)的过程指标,从而实现了前所未有的规模和粒度,为维护性、技术债演化、软件质量评估等方向提供坚实的数据基础。
链接: https://arxiv.org/abs/2511.11265
作者: Mikel Robredo,Matteo Esposito,Davide Taibi,Rafael Peñaloza,Valentina Lenarduzzi
机构: University of Oulu(奥卢大学); University of Southern Denmark(南丹麦大学); University of Milano-Bicocca(米兰博科尼大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:
Abstract:Software quality research increasingly relies on large-scale datasets that measure both the product and process aspects of software systems. However, existing resources often focus on limited dimensions, such as code smells, technical debt, or refactoring activity, thereby restricting comprehensive analyses across time and quality dimensions. To address this gap, we present the Software Quality Dataset (SQuaD), a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel. By integrating nine state-of-the-art static analysis tools, i.e., SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, and PyRef, our dataset unifies over 700 unique metrics at method, class, file, and project levels. Covering a total of 63,586 analyzed project releases, SQuaD also provides version control and issue-tracking histories, software vulnerability data (CVE/CWE), and process metrics proven to enhance Just-In-Time (JIT) defect prediction. The SQuaD enables empirical research on maintainability, technical debt, software evolution, and quality assessment at unprecedented scale. We also outline emerging research directions, including automated dataset updates and cross-project quality modeling to support the continuous evolution of software analytics. The dataset is publicly available on ZENODO (DOI: https://doi.org/10.5281/zenodo.17566690).
zh
[NLP-19] Discovering Meaningful Units with Visually Grounded Semantics from Image Captions
【速读】: 该论文旨在解决视觉-语言模型在细粒度理解现实世界时存在的语义对齐不足问题,即现有方法主要关注图像块(image patches)与语言标记(tokens)的粗粒度对齐,而忽略了人类可感知的对象级别语义信息。其解决方案的关键在于:设计一种将文本标记分组(token grouping)作为模型架构组成部分的新方法,使语言表示能够对应图像中实际存在的对象,并通过与预训练目标检测器输出的物体特征对齐,实现更精细的语言-视觉对齐。实验表明,该策略显著提升了模型的细粒度理解能力,且自动发现的标记组与可锚定于图像的短语高度一致。
链接: https://arxiv.org/abs/2511.11262
作者: Melika Behjati,James Henderson
机构: Idiap Research Institute (Idiap 研究所); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused on aligning the image patches with the tokens on the language side. However, image patches do not have any meaning to the human eye, and individual tokens do not necessarily carry groundable information in the image. It is groups of tokens which describe different aspects of the scene. In this work, we propose a model which groups the caption tokens as part of its architecture in order to capture a fine-grained representation of the language. We expect our representations to be at the level of objects present in the image, and therefore align our representations with the output of an image encoder trained to discover objects. We show that by learning to group the tokens, the vision-language model has a better fine-grained understanding of vision and language. In addition, the token groups that our model discovers are highly similar to groundable phrases in text, both qualitatively and quantitatively.
zh
[NLP-20] KGQuest: Template-Driven QA Generation from Knowledge Graphs with LLM -Based Refinement
【速读】: 该论文旨在解决从知识图谱(Knowledge Graph, KG)中生成自然语言问答对(Question Answering, QA)时面临的可扩展性差、语言质量低以及事实一致性不足的问题。其解决方案的关键在于提出了一种可扩展且确定性的流水线方法:首先基于关系对KG三元组进行聚类,利用实体类型和关系自动生成可复用的自然语言模板;随后引入大语言模型(Large Language Models, LLM)对模板进行精炼,以提升语言流畅性和语义清晰度,同时确保事实准确性;最后通过从KG中选择干扰项(distractors)来构建答案选项,从而实现高质量QA对的高效生成。
链接: https://arxiv.org/abs/2511.11258
作者: Sania Nayab,Marco Simoni,Giulio Rossolini,Andrea Saracino
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The generation of questions and answers (QA) from knowledge graphs (KG) plays a crucial role in the development and testing of educational platforms, dissemination tools, and large language models (LLM). However, existing approaches often struggle with scalability, linguistic quality, and factual consistency. This paper presents a scalable and deterministic pipeline for generating natural language QA from KGs, with an additional refinement step using LLMs to further enhance linguistic quality. The approach first clusters KG triplets based on their relations, creating reusable templates through natural language rules derived from the entity types of objects and relations. A module then leverages LLMs to refine these templates, improving clarity and coherence while preserving factual accuracy. Finally, the instantiation of answer options is achieved through a selection strategy that introduces distractors from the KG. Our experiments demonstrate that this hybrid approach efficiently generates high-quality QA pairs, combining scalability with fluency and linguistic precision.
zh
[NLP-21] LANE: Lexical Adversarial Negative Examples for Word Sense Disambiguation
【速读】: 该论文旨在解决神经语言模型(Neural Language Models, NLMs)在细粒度词义解析(fine-grained word meaning resolution)上的局限性,即模型容易过度依赖全局句子表示而忽略局部语义细节。其解决方案的关键在于提出一种新颖的对抗训练策略——LANE,通过在训练集中有选择性地标记替代词来生成具有挑战性的负样本,从而迫使模型增强同一句子中不同标记词之间的表征可分性(separability)。该方法不依赖特定模型架构,可无缝集成至现有表示学习框架中,并在词汇语义变化检测和词义消歧任务上显著优于标准对比学习基线。
链接: https://arxiv.org/abs/2511.11234
作者: Jader Martins Camboim de Sá,Jooyoung Lee,Cédric Pruski,Marcos Da Silveira
机构: FSTM - University of Luxembourg (卢森堡大学科学、技术与医学学院); Luxembourg Institute of Science and Technology (卢森堡科学与技术研究所); Brown University (布朗大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Fine-grained word meaning resolution remains a critical challenge for neural language models (NLMs) as they often overfit to global sentence representations, failing to capture local semantic details. We propose a novel adversarial training strategy, called LANE, to address this limitation by deliberately shifting the model’s learning focus to the target word. This method generates challenging negative training examples through the selective marking of alternate words in the training set. The goal is to force the model to create a greater separability between same sentences with different marked words. Experimental results on lexical semantic change detection and word sense disambiguation benchmarks demonstrate that our approach yields more discriminative word representations, improving performance over standard contrastive learning baselines. We further provide qualitative analyses showing that the proposed negatives lead to representations that better capture subtle meaning differences even in challenging environments. Our method is model-agnostic and can be integrated into existing representation learning frameworks.
zh
[NLP-22] Adverbs Revisited: Enhancing WordNet Coverag e of Adverbs with a Supersense Taxonomy
【速读】: 该论文旨在解决词典资源中副词(adverb)语义分类系统化不足的问题,当前WordNet虽已为名词和动词构建了丰富的超义类(supersense)层级结构,但副词仍缺乏统一的语义分类体系。其解决方案的关键在于提出一种基于语言学理论的副词超义类类型学(supersense typology),通过人工标注验证,识别出包括方式、时间、频率、程度、领域、说话者导向及主语导向等主要语义范畴,实证表明该分类体系能有效覆盖自然文本中的副词,并具备良好的标注一致性。此方案不仅扩展了WordNet的语义覆盖范围,也更贴近语言学理论,同时为下游自然语言处理任务如词义消歧、事件抽取、情感分析与话语建模提供支持。
链接: https://arxiv.org/abs/2511.11214
作者: Jooyoung Lee,Jader Martins Camboim de Sá
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:WordNet offers rich supersense hierarchies for nouns and verbs, yet adverbs remain underdeveloped, lacking a systematic semantic classification. We introduce a linguistically grounded supersense typology for adverbs, empirically validated through annotation, that captures major semantic domains including manner, temporal, frequency, degree, domain, speaker-oriented, and subject-oriented functions. Results from a pilot annotation study demonstrate that these categories provide broad coverage of adverbs in natural text and can be reliably assigned by human annotators. Incorporating this typology extends WordNet’s coverage, aligns it more closely with linguistic theory, and facilitates downstream NLP applications such as word sense disambiguation, event extraction, sentiment analysis, and discourse modeling. We present the proposed supersense categories, annotation outcomes, and directions for future work.
zh
[NLP-23] Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多模态推理中因幻觉(Hallucination)导致的可靠性问题,尤其是现有基于多智能体辩论(Multi-Agent Debate, MAD)框架对所有参与者均为理性且具备反思能力的不切实际假设。其解决方案的关键在于提出一种名为“多智能体潜伏博弈”(Multi-agent Undercover Gaming, MUG)的新协议,该协议受社会推理游戏启发,通过引入多模态反事实测试(counterfactual tests)来识别“潜伏”代理(即存在幻觉的代理),具体方法是修改参考图像以嵌入反事实证据,并观察代理是否能准确识别变化,从而提供可验证的基准用于检测幻觉行为。MUG在三个维度上改进了MAD:实现超越统计共识的事实验证、利用动态修改的证据源进行跨证据推理、以及推动代理主动探询而非被动回答,从而构建更可靠、有效的多模态推理机制。
链接: https://arxiv.org/abs/2511.11182
作者: Dayong Liang,Xiao-Yong Wei,Changmeng Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Multimedia (cs.MM)
备注: Accepted by AAAI 2026
Abstract:Hallucination continues to pose a major obstacle in the reasoning capabilities of large language models (LLMs). Although the Multi-Agent Debate (MAD) paradigm offers a promising solution by promoting consensus among multiple agents to enhance reliability, it relies on the unrealistic assumption that all debaters are rational and reflective, which is a condition that may not hold when agents themselves are prone to hallucinations. To address this gap, we introduce the Multi-agent Undercover Gaming (MUG) protocol, inspired by social deduction games like “Who is Undercover?”. MUG reframes MAD as a process of detecting “undercover” agents (those suffering from hallucinations) by employing multimodal counterfactual tests. Specifically, we modify reference images to introduce counterfactual evidence and observe whether agents can accurately identify these changes, providing ground-truth for identifying hallucinating agents and enabling robust, crowd-powered multimodal reasoning. MUG advances MAD protocols along three key dimensions: (1) enabling factual verification beyond statistical consensus through counterfactual testing; (2) introducing cross-evidence reasoning via dynamically modified evidence sources instead of relying on static inputs; and (3) fostering active reasoning, where agents engage in probing discussions rather than passively answering questions. Collectively, these innovations offer a more reliable and effective framework for multimodal reasoning in LLMs. The source code can be accessed at this https URL.
zh
[NLP-24] PRSM: A Measure to Evaluate CLIPs Robustness Against Paraphrases
【速读】: 该论文旨在解决生成式 AI(Generative AI)中多模态模型 CLIP 在面对语义相同但表述不同的句子(即 paraphrasing,同义改写)时,其文本-图像对齐能力的稳定性问题。这一问题在社会敏感场景下尤为关键,因为不稳定的响应可能放大性别等人口统计学偏差。论文提出的关键解决方案是引入 Paraphrase Ranking Stability Metric (PRSM),这是一种用于量化 CLIP 对同义查询敏感性的新指标,并基于 Social Counterfactuals 数据集进行实证评估,揭示了不同改写策略下模型鲁棒性的差异以及性别相关查询间的系统性差异,从而为公平、可靠的多模态系统部署提供可量化的分析框架。
链接: https://arxiv.org/abs/2511.11141
作者: Udo Schlegel,Franziska Weeber,Jian Lan,Thomas Seidl
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 8 pages, accpeted as short paper at MMM 2026
Abstract:Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP’s sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP’s stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.
zh
[NLP-25] Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
【速读】: 该论文旨在解决自动语音识别(ASR)系统在需要领域特定知识的上下文场景中难以有效利用长距离上下文信息的问题,其核心挑战源于模型上下文窗口受限以及大量上下文中相关信号稀疏。解决方案的关键在于提出SAP²方法,该方法通过两阶段动态剪枝与整合相关上下文关键词实现,每阶段均采用作者提出的基于语音驱动的注意力池化机制(Speech-Driven Attention-based Pooling),从而高效压缩上下文嵌入并保留语音显著信息,显著提升ASR在复杂上下文下的性能表现。
链接: https://arxiv.org/abs/2511.11139
作者: Yiming Rong,Yixin Zhang,Ziyi Wang,Deyang Jiang,Yunlong Zhao,Haoran Wu,Shiyu Zhou,Bo Xu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP ^2 method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP ^2 on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP ^2 also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.
zh
[NLP-26] Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion
【速读】: 该论文旨在解决Meme Emotion Understanding (MEU) 中的两个核心挑战:一是缺乏细粒度的多模态融合策略,二是对meme中隐含意义和背景知识挖掘不足。解决方案的关键在于提出MemoDetector框架,其创新性体现在两方面:首先设计了一个四步文本增强模块,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)逐步推理并提取meme中的隐含与上下文信息,从而显著丰富原始内容并为下游分类提供指导;其次构建了双阶段模态融合策略,第一阶段进行原始图像与文本的浅层融合,第二阶段深度融合增强后的视觉与文本特征,实现对跨模态情感线索的精细捕捉。实验表明,该方法在MET-MEME和MOOD数据集上分别提升了4.3%和3.4%的F1分数,验证了其有效性与鲁棒性。
链接: https://arxiv.org/abs/2511.11126
作者: Yi Shi,Wenlong Meng,Zhenyuan Guo,Chengkun Wei,Wenzhi Chen
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid rise of social media and Internet culture, memes have become a popular medium for expressing emotional tendencies. This has sparked growing interest in Meme Emotion Understanding (MEU), which aims to classify the emotional intent behind memes by leveraging their multimodal contents. While existing efforts have achieved promising results, two major challenges remain: (1) a lack of fine-grained multimodal fusion strategies, and (2) insufficient mining of memes’ implicit meanings and background knowledge. To address these challenges, we propose MemoDetector, a novel framework for advancing MEU. First, we introduce a four-step textual enhancement module that utilizes the rich knowledge and reasoning capabilities of Multimodal Large Language Models (MLLMs) to progressively infer and extract implicit and contextual insights from memes. These enhanced texts significantly enrich the original meme contents and provide valuable guidance for downstream classification. Next, we design a dual-stage modal fusion strategy: the first stage performs shallow fusion on raw meme image and text, while the second stage deeply integrates the enhanced visual and textual features. This hierarchical fusion enables the model to better capture nuanced cross-modal emotional cues. Experiments on two datasets, MET-MEME and MOOD, demonstrate that our method consistently outperforms state-of-the-art baselines. Specifically, MemoDetector improves F1 scores by 4.3% on MET-MEME and 3.4% on MOOD. Further ablation studies and in-depth analyses validate the effectiveness and robustness of our approach, highlighting its strong potential for advancing MEU. Our code is available at this https URL.
zh
[NLP-27] AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
【速读】: 该论文旨在解决多说话人噪声环境下对话模型易产生无关回复和不自然轮次切换的问题。其解决方案的关键在于提出AV-Dialog,首个融合音频与视觉线索的多模态对话框架,通过声学标记化(acoustic tokenization)结合单模态、合成及真实音视频对话数据的多任务、多阶段训练,实现鲁棒的流式转录、语义相关的轮次边界检测和准确响应生成,从而提升对话流畅性与自然度。
链接: https://arxiv.org/abs/2511.11124
作者: Tuochao Chen,Bandhav Veluri,Hongyu Gong,Shyamnath Gollakota
机构: University of Washington (华盛顿大学); Meta AI Research (Meta人工智能研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:
Abstract:Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for spoken dialogue agents that perform robustly in real-world, noisy environments.
zh
[NLP-28] Analysing Personal Attacks in U.S. Presidential Debates
【速读】: 该论文旨在解决如何自动化检测美国总统辩论中的人身攻击(personal attacks)这一问题,以提升政治话语的透明度并为媒体、分析人员和公众提供洞察。其解决方案的关键在于构建一个基于人工标注的辩论文本数据集(涵盖2016、2020和2024年选举周期),并结合统计方法与预训练语言模型(如BERT及通用大语言模型LLMs)进行微调,从而实现对正式政治语境下人身攻击的有效识别。研究强调任务特定的语言模型适配在政治传播理解中的重要价值。
链接: https://arxiv.org/abs/2511.11108
作者: Ruban Goyal,Rohitash Chandra,Sonit Singh
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 13 pages
Abstract:Personal attacks have become a notable feature of U.S. presidential debates and play an important role in shaping public perception during elections. Detecting such attacks can improve transparency in political discourse and provide insights for journalists, analysts and the public. Advances in deep learning and transformer-based models, particularly BERT and large language models (LLMs) have created new opportunities for automated detection of harmful language. Motivated by these developments, we present a framework for analysing personal attacks in U.S. presidential debates. Our work involves manual annotation of debate transcripts across the 2016, 2020 and 2024 election cycles, followed by statistical and language-model based analysis. We investigate the potential of fine-tuned transformer models alongside general-purpose LLMs to detect personal attacks in formal political speech. This study demonstrates how task-specific adaptation of modern language models can contribute to a deeper understanding of political communication.
zh
[NLP-29] CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation ICASSP2026
【速读】: 该论文旨在解决指令引导的文本到语音(Text-to-Speech, TTS)合成中存在的两个相互关联的偏见问题:口音偏见(accent bias),即模型倾向于生成主导的发音模式;以及语言偏见(linguistic bias),即忽略方言特有的词汇和文化线索。解决方案的关键在于提出一种与骨干网络无关的框架CLARITY(Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis),通过双信号优化实现:(i) 上下文语言适应(contextual linguistic adaptation),将输入文本本地化为目标方言;(ii) 增强检索的口音提示(retrieval-augmented accent prompting, RAAP),提供与口音一致的语音提示,从而在保持高感知质量的同时提升口音准确性和公平性。
链接: https://arxiv.org/abs/2511.11104
作者: Crystal Min Hui Poon,Pai Chet Ng,Xiaoxiao Miao,Immanuel Jun Kai Loh,Bowen Zhang,Haoyu Song,Ian Mcloughlin
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Submitted to ICASSP 2026
Abstract:Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.
zh
[NLP-30] Can LLM s Detect Their Own Hallucinations?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本时可能出现的幻觉(hallucination)问题,即模型输出与事实不符的内容。研究的核心在于评估LLMs是否具备自我检测幻觉的能力,并提出一种基于思维链(Chain-of-Thought, CoT)的分类方法,通过从模型参数中提取知识来实现幻觉检测。解决方案的关键在于将幻觉检测建模为句子级别的分类任务,并利用CoT机制增强模型对自身知识边界的认知能力,从而提升其识别错误生成内容的准确性。实验表明,GPT-3.5 Turbo结合CoT可检测出58.2%的自身幻觉,证明了该方法的有效性。
链接: https://arxiv.org/abs/2511.11087
作者: Sora Kadotani,Kosuke Nishida,Kyosuke Nishida
机构: NTT Human Informatics Labs., NTT, Inc. (NTT公司人类信息学实验室)
类目: Computation and Language (cs.CL)
备注: 8 pages
Abstract:Large language models (LLMs) can generate fluent responses, but sometimes hallucinate facts. In this paper, we investigate whether LLMs can detect their own hallucinations. We formulate hallucination detection as a classification task of a sentence. We propose a framework for estimating LLMs’ capability of hallucination detection and a classification method using Chain-of-Thought (CoT) to extract knowledge from their parameters. The experimental results indicated that GPT- 3.5 Turbo with CoT detected 58.2% of its own hallucinations. We concluded that LLMs with CoT can detect hallucinations if sufficient knowledge is contained in their parameters.
zh
[NLP-31] S2D-ALIGN: Shallow-to-Deep Auxiliary Learning for Anatomically-Grounded Radiology Report Generation
【速读】: 该论文旨在解决放射学报告生成(Radiology Report Generation, RRG)中因仅依赖图像-文本对的实例级对齐而导致的生成质量不佳问题,尤其是由于报告模板化结构难以实现解剖学层面的精准对齐。其解决方案的关键在于提出一种新颖的SFT(Supervised Fine-Tuning)范式——S2D-Align,通过引入多粒度辅助信号(如参考报告和关键短语),采用由粗到细(shallow-to-deep)的渐进式对齐策略,逐步增强跨模态对齐的解剖学根基,并借助基于记忆的适配器模块实现不同对齐阶段间的特征共享与融合,从而显著提升生成报告的准确性与细节一致性。
链接: https://arxiv.org/abs/2511.11066
作者: Jiechao Gao,Chang Liu,Yuangang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Radiology Report Generation (RRG) aims to automatically generate diagnostic reports from radiology images. To achieve this, existing methods have leveraged the powerful cross-modal generation capabilities of Multimodal Large Language Models (MLLMs), primarily focusing on optimizing cross-modal alignment between radiographs and reports through Supervised Fine-Tuning (SFT). However, by only performing instance-level alignment with the image-text pairs, the standard SFT paradigm fails to establish anatomically-grounded alignment, where the templated nature of reports often leads to sub-optimal generation quality. To address this, we propose \textscS2D-Align, a novel SFT paradigm that establishes anatomically-grounded alignment by leveraging auxiliary signals of varying granularities. \textscS2D-Align implements a shallow-to-deep strategy, progressively enriching the alignment process: it begins with the coarse radiograph-report pairing, then introduces reference reports for instance-level guidance, and ultimately utilizes key phrases to ground the generation in specific anatomical details. To bridge the different alignment stages, we introduce a memory-based adapter that empowers feature sharing, thereby integrating coarse and fine-grained guidance. For evaluation, we conduct experiments on the public \textscMIMIC-CXR and \textscIU X-Ray benchmarks, where \textscS2D-Align achieves state-of-the-art performance compared to existing methods. Ablation studies validate the effectiveness of our multi-stage, auxiliary-guided approach, highlighting a promising direction for enhancing grounding capabilities in complex, multi-modal generation tasks.
zh
[NLP-32] Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
【速读】: 该论文旨在解决当前文本嵌入模型(text embedding models)中存在的系统性偏差问题,即所有嵌入向量 $ e $ 均可分解为 $ \tilde{e} + \mu $,其中 $ \mu $ 在不同句子间几乎保持不变,导致模型在下游任务中性能受限。解决方案的关键在于提出一种无需训练、轻量且可插拔的“重归一化”(Renormalization)方法:通过从原始嵌入向量中减去 $ \mu $ 或其在 $ \mu $ 方向上的投影来消除该偏移,从而提升模型在多语言文本嵌入基准(MMTEB)上的表现。理论与实验证明,基于投影的变体效果更优,显著改善了检索、分类及其他任务的性能。
链接: https://arxiv.org/abs/2511.11041
作者: Xingyu Ren,Youran Sun,Haoyu Liang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector e can be decomposed as \tildee + \mu , where \mu is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 \sigma on retrieval tasks, 3.1 \sigma on classification tasks, and 0.8 \sigma on other types of tasks. Renormalization has two variants: directly subtracting \mu from e , or subtracting the projection of e onto \mu . We theoretically predict that the latter performs better, and our experiments confirm this prediction.
zh
[NLP-33] Automata-Based Steering of Large Language Models for Diverse Structured Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成结构化输出时普遍存在的多样性不足问题,尽管现有结构化生成方法能保证输出的有效性,但其生成结果往往缺乏多样性,限制了实际应用场景的泛化能力。解决方案的关键在于利用自动机(automaton)遍历历史来引导LLM探索新的结构模式,从而在保持生成效率的同时显著提升结构和内容多样性。
链接: https://arxiv.org/abs/2511.11018
作者: Xiaokun Luan,Zeming Wei,Yihao Zhang,Meng Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: ICFEM 2025 (Best Paper Award)
Abstract:Large language models (LLMs) are increasingly tasked with generating structured outputs. While structured generation methods ensure validity, they often lack output diversity, a critical limitation that we confirm in our preliminary study. We propose a novel method to enhance diversity in automaton-based structured generation. Our approach utilizes automata traversal history to steer LLMs towards novel structural patterns. Evaluations show our method significantly improves structural and content diversity while maintaining comparable generation efficiency. Furthermore, we conduct a case study showcasing the effectiveness of our method in generating diverse test cases for testing open-source libraries.
zh
[NLP-34] When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets
【速读】: 该论文旨在解决当前开放源代码的直接偏好优化(Direct Preference Optimization, DPO)数据集缺乏系统性比较与质量评估的问题,尤其在样本级偏好标注可靠性、任务类型覆盖广度以及奖励信号一致性方面存在显著不足。其解决方案的关键在于引入Magpie框架对多个主流DPO数据集进行细粒度标注,包括任务类别、输入质量及基于奖励模型的偏好奖励(preference reward),从而实现无需人工标注即可量化验证偏好顺序的有效性;在此基础上,研究者构建了一个精选混合数据集UltraMix,通过剔除噪声和冗余样本,在体积减少30%的前提下超越单一数据集在关键基准上的性能表现,为数据驱动的偏好优化提供了可复现且高效的高质量训练资源。
链接: https://arxiv.org/abs/2511.10985
作者: Aladin Djuhera,Farhan Ahmed,Swanand Ravindra Kadhe,Syed Zawad,Heiko Ludwig,Holger Boche
机构: Technical University Munich (慕尼黑工业大学); IBM Research (IBM研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Aligning large language models (LLMs) is a central objective of post-training, often achieved through reward modeling and reinforcement learning methods. Among these, direct preference optimization (DPO) has emerged as a widely adopted technique that fine-tunes LLMs on preferred completions over less favorable ones. While most frontier LLMs do not disclose their curated preference pairs, the broader LLM community has released several open-source DPO datasets, including TuluDPO, ORPO, UltraFeedback, HelpSteer, and Code-Preference-Pairs. However, systematic comparisons remain scarce, largely due to the high computational cost and the lack of rich quality annotations, making it difficult to understand how preferences were selected, which task types they span, and how well they reflect human judgment on a per-sample level. In this work, we present the first comprehensive, data-centric analysis of popular open-source DPO corpora. We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward, a reward-model-based signal that validates the preference order without relying on human annotations. This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins. Building on these insights, we systematically curate a new DPO mixture, UltraMix, that draws selectively from all five corpora while removing noisy or redundant samples. UltraMix is 30% smaller than the best-performing individual dataset yet exceeds its performance across key benchmarks. We publicly release all annotations, metadata, and our curated mixture to facilitate future research in data-centric preference optimization.
zh
[NLP-35] DiscoX: Benchmarking Discourse-Level Translation task in Expert Domains
【速读】: 该论文旨在解决专业领域中篇章级翻译(discourse-level translation)评估不足的问题,当前主流评估方法多聚焦于句子级别的准确性和流畅性,难以衡量跨语言学术传播所需的语篇连贯性和术语精确性。解决方案的关键在于提出一个新的基准测试集DiscoX和一个无参考的细粒度评估系统Metric-S:DiscoX包含7个专业领域的200篇高质量中英对照文本,平均每篇超过1700个token,确保了语篇复杂性和专业深度;Metric-S通过自动评估准确性、流畅性和得体性三个维度,与人工判断高度一致,显著优于现有指标。实验表明,即使是最先进的大语言模型(LLM)在该任务上仍远落后于人类专家,验证了DiscoX的挑战性并揭示了实现专业级机器翻译仍面临的关键瓶颈。
链接: https://arxiv.org/abs/2511.10984
作者: Xiying Zhao,Zhoufutu Wen,Zhixuan Chen,Jingzhe Ding,Jianpeng Jiao,Shuai Li,Xi Li,Danni Liang,Shengda Long,Qianqian Liu,Xianbo Wu,Hongwan Gao,Xiang Gao,Liang Hu,Jiashuo Liu,Mengyun Liu,Weiran Shi,Chenghao Yang,Qianyu Yang,Xuanliang Zhang,Ge Zhang,Wenhao Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 36 pages
Abstract:The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication. While these translations demand discourse-level coherence and strict terminological precision, current evaluation methods predominantly focus on segment-level accuracy and fluency. To address this limitation, we introduce DiscoX, a new benchmark for discourse-level and expert-level Chinese-English translation. It comprises 200 professionally-curated texts from 7 domains, with an average length exceeding 1700 tokens. To evaluate performance on DiscoX, we also develop Metric-S, a reference-free system that provides fine-grained automatic assessments across accuracy, fluency, and appropriateness. Metric-S demonstrates strong consistency with human judgments, significantly outperforming existing metrics. Our experiments reveal a remarkable performance gap: even the most advanced LLMs still trail human experts on these tasks. This finding validates the difficulty of DiscoX and underscores the challenges that remain in achieving professional-grade machine translation. The proposed benchmark and evaluation system provide a robust framework for more rigorous evaluation, facilitating future advancements in LLM-based translation.
zh
[NLP-36] CardioEmbed: Domain-Specialized Text Embeddings for Clinical Cardiology
【速读】: 该论文旨在解决现有生物医学文本嵌入模型在心血管临床实践中的适用性不足问题,即当前模型主要基于PubMed研究文献训练,难以有效捕捉临床心血管领域中以操作性知识和专业术语为核心的文本语义。其解决方案的关键在于构建一个领域特化的嵌入模型CardioEmbed,基于Qwen3-Embedding-8B架构,利用对比学习(contrastive learning)在7本综合性心血管教科书构成的约15万句去重语料上进行训练,采用批次内负样本的InfoNCE损失函数,从而显著提升心脏专科语义检索任务的准确率(达到99.60% Acc@1),相较当前最优模型MedTE提升15.94个百分点。
链接: https://arxiv.org/abs/2511.10930
作者: Richard J. Young,Alice M. Matthews
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 6 figures
Abstract:Biomedical text embeddings have primarily been developed using research literature from PubMed, yet clinical cardiology practice relies heavily on procedural knowledge and specialized terminology found in comprehensive textbooks rather than research abstracts. This research practice gap limits the effectiveness of existing embedding models for clinical applications incardiology. This study trained CardioEmbed, a domain-specialized embedding model based on Qwen3-Embedding-8B, using contrastive learning on a curated corpus of seven comprehensive cardiology textbooks totaling approximately 150,000 sentences after deduplication. The model employs InfoNCE loss with in-batch negatives and achieves 99.60% retrieval accuracy on cardiac-specific semantic retrieval tasks, a +15.94 percentage point improvement over MedTE, the current state-of-the-art medical embedding model. On MTEB medical benchmarks, the model obtained BIOSSES 0.77 Spearman and SciFact 0.61 NDCG@10, indicating competitive performance on related biomedical domains. Domain-specialized training on comprehensive clinical textbooks yields near-perfect cardiology retrieval (99.60% Acc@1), improving over MedTE by +15.94 percentage points.
zh
[NLP-37] Evaluating Large Language Models on Rare Disease Diagnosis: A Case Study using House M.D
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在基于叙事医学病例的罕见病诊断任务中表现不佳且缺乏系统评估的问题。其解决方案的关键在于构建了一个由176个症状-诊断对组成的全新数据集,该数据源自经医学教育验证的电视剧《豪斯医生》(House M.D.),并在此基础上评估了四种前沿LLM(包括GPT 4o mini、GPT 5 mini、Gemini 2.5 Flash和Gemini 2.5 Pro)在叙事推理任务中的性能。结果表明,尽管当前模型整体准确率较低(16.48%–38.64%),但新一代模型相较旧版本实现了2.3倍的性能提升,验证了该基准数据集作为教育学验证工具和公开评估框架的有效性,为未来AI辅助诊断研究提供了可复现的基线与发展方向。
链接: https://arxiv.org/abs/2511.10912
作者: Arsh Gupta,Ajay Narayanan Sridhar,Bonam Mingole,Amulya Yadav
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have demonstrated capabilities across diverse domains, yet their performance on rare disease diagnosis from narrative medical cases remains underexplored. We introduce a novel dataset of 176 symptom-diagnosis pairs extracted from House M.D., a medical television series validated for teaching rare disease recognition in medical education. We evaluate four state-of-the-art LLMs such as GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro on narrative-based diagnostic reasoning tasks. Results show significant variation in performance, ranging from 16.48% to 38.64% accuracy, with newer model generations demonstrating a 2.3 times improvement. While all models face substantial challenges with rare disease diagnosis, the observed improvement across architectures suggests promising directions for future development. Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation framework for advancing AI-assisted diagnosis research.
zh
[NLP-38] Automated Analysis of Learning Outcomes and Exam Questions Based on Blooms Taxonomy
【速读】: 该论文旨在解决基于布卢姆分类法(Bloom’s Taxonomy)对考试题目和学习成果进行自动分类的问题,以提升教育评估的效率与一致性。其关键解决方案在于对比多种机器学习模型(包括传统方法、循环神经网络、Transformer模型及大语言模型)在小规模标注数据集上的表现,并发现通过数据增强策略(如同义词替换、词嵌入等)优化后的支持向量机(Support Vector Machine, SVM)模型在准确率、召回率和F1分数上均达到94%,且过拟合现象最小;相比之下,深度学习模型(如LSTM、BERT)因数据量有限而严重过拟合,RoBERTa虽初期表现良好但随训练时间增长也出现过拟合趋势,而零样本测试下OpenAI和Gemini等大语言模型虽具备一定泛化能力(约0.72–0.73准确率),但仍逊于精心调优的SVM方法。因此,研究强调了在小数据场景下采用简单算法结合有效数据增强的重要性。
链接: https://arxiv.org/abs/2511.10903
作者: Ramya Kumar,Dhruv Gulwani,Sonit Singh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 Pages
Abstract:This paper explores the automatic classification of exam questions and learning outcomes according to Bloom’s Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom’s Taxonomy classification.
zh
[NLP-39] Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions
【速读】: 该论文旨在解决当前学术同行评审系统中存在的三大核心问题:文本输入限制、缺乏上下文依据以及反馈缺乏可操作性。针对这些问题,其解决方案的关键在于构建一个基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的交互式Web平台,通过融合文本与视觉信息提升评审内容的全面性;利用检索增强生成(Retrieval-Augmented Generation, RAG)技术从大规模OpenReview数据中获取上下文支撑,从而提高评审质量;并引入Action:Objective[#]格式将生成的评审意见转化为结构化、可追踪的待办任务清单,实现对稿件修订的精准指导。该系统无缝集成于现有学术写作平台,支持实时反馈与修订追踪,实验结果表明其生成的评审意见在完整性与实用性上显著优于基线方法,更符合专家标准。
链接: https://arxiv.org/abs/2511.10902
作者: Mengze Hong,Di Jiang,Weiwei Zhao,Yawen Li,Yihang Wang,Xinyuan Luo,Yanjie Sun,Chen Jason Zhang
机构: Hong Kong Polytechnic University (香港理工大学); WeBank (微众银行); Beijing University of Posts and Telecommunications (北京邮电大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:
Abstract:While large language models (LLMs) offer promising capabilities for automating academic workflows, existing systems for academic peer review remain constrained by text-only inputs, limited contextual grounding, and a lack of actionable feedback. In this work, we present an interactive web-based system for multimodal, community-aware peer review simulation to enable effective manuscript revisions before paper submission. Our framework integrates textual and visual information through multimodal LLMs, enhances review quality via retrieval-augmented generation (RAG) grounded in web-scale OpenReview data, and converts generated reviews into actionable to-do lists using the proposed Action:Objective[#] format, providing structured and traceable guidance. The system integrates seamlessly into existing academic writing platforms, providing interactive interfaces for real-time feedback and revision tracking. Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assistance.
zh
[NLP-40] Expert-Guided Prompting and Retrieval-Augmented Generation for Emergency Medical Service Question Answering AAAI2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗问答任务中忽视专业领域知识的问题,特别是临床主题(如创伤、气道管理)和认证级别(如EMT、急救员)等结构化上下文未被有效利用,导致其在高风险场景下表现受限。解决方案的关键在于构建了EMSQA数据集(含24.3K道多选题,覆盖10个临床主题与4个认证层级),并提出两种增强策略:(i) Expert-CoT,通过将链式思维(Chain-of-Thought, CoT)推理条件化于特定临床主题与认证等级来提升推理准确性;(ii) ExpertRAG,基于主题对齐的知识库与真实患者数据进行检索增强生成,从而实现更精准的医学决策支持。实验表明,Expert-CoT相比基线CoT提升最高达2.05%,结合ExpertRAG后相较标准RAG基准提升最高达4.59%,且32B参数规模的专家增强型LLM成功通过计算机自适应EMS认证模拟考试。
链接: https://arxiv.org/abs/2511.10900
作者: Xueren Ge,Sahil Murtaza,Anthony Cortez,Homa Alemzadeh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Large language models (LLMs) have shown promise in medical question answering, yet they often overlook the domain-specific expertise that professionals depend on, such as the clinical subject areas (e.g., trauma, airway) and the certification level (e.g., EMT, Paramedic). Existing approaches typically apply general-purpose prompting or retrieval strategies without leveraging this structured context, limiting performance in high-stakes settings. We address this gap with EMSQA, an 24.3K-question multiple-choice dataset spanning 10 clinical subject areas and 4 certification levels, accompanied by curated, subject area-aligned knowledge bases (40K documents and 2M tokens). Building on EMSQA, we introduce (i) Expert-CoT, a prompting strategy that conditions chain-of-thought (CoT) reasoning on specific clinical subject area and certification level, and (ii) ExpertRAG, a retrieval-augmented generation pipeline that grounds responses in subject area-aligned documents and real-world patient data. Experiments on 4 LLMs show that Expert-CoT improves up to 2.05% over vanilla CoT prompting. Additionally, combining Expert-CoT with ExpertRAG yields up to a 4.59% accuracy gain over standard RAG baselines. Notably, the 32B expertise-augmented LLMs pass all the computer-adaptive EMS certification simulation exams.
zh
[NLP-41] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
【速读】: 该论文旨在解决工具增强型语言模型(Tool-augmented Language Models, TaLMs)在使用外部工具(如Code Interpreter)时,虽然能提升最终答案的准确性,但其推理过程却可能因过度依赖工具输出而变得不连贯、不可信的问题。这种现象被称为“工具诱导性短视”(Tool-Induced Myopia, TIM),即模型将工具结果视为替代推理的捷径,而非辅助证据,导致看似正确的解法缺乏逻辑一致性与深度。解决方案的关键在于提出一种基于偏好优化(preference-optimization-based)的框架,通过重新对齐模型行为,使其将工具作为辅助证据而非推理替代品,从而在保持或提升最终答案准确性的基础上,显著改善推理深度和可靠性。
链接: https://arxiv.org/abs/2511.10899
作者: Farima Fatahi Bayat,Pouya Pezeshkpour,Estevam Hruschka
机构: Megagon Labs (Megagon 实验室)
类目: Computation and Language (cs.CL); Logic in Computer Science (cs.LO); Software Engineering (cs.SE)
备注: 19 pages, 5 figures
Abstract:Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: this https URL.
zh
[NLP-42] MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking AACL
【速读】: 该论文旨在解决生物医学命名实体识别(Named Entity Recognition, NER)与实体链接(Entity Linking, EL)研究中面临的三大挑战:数据资源碎片化、缺乏用于构建可解释模型的资源,以及语义盲区评估指标的局限性。其解决方案的关键在于提出MedPath——一个大规模、多领域的生物医学EL数据集,该数据集整合了九个已有的专家标注EL数据集,并通过以下三方面实现核心创新:1)使用最新版统一医学语言系统(Unified Medical Language System, UMLS)对所有实体进行标准化;2)扩展映射至62个其他生物医学词汇表;3)关键性地引入完整的本体路径(ontological paths),即从通用到具体的层级路径,在最多11个生物医学词汇表中提供结构化语义信息。这一设计显著提升了模型的语义丰富性和可解释性,推动了下一代可互操作、可解释的临床自然语言处理(Natural Language Processing, NLP)模型的发展。
链接: https://arxiv.org/abs/2511.10887
作者: Nishant Mishra,Wilker Aziz,Iacer Calixto
机构: Amsterdam UMC, University of Amsterdam (阿姆斯特丹大学医学中心,阿姆斯特丹大学); Amsterdam Public Health, Methodology (阿姆斯特丹公共卫生方法学); ILLC, University of Amsterdam (阿姆斯特丹大学语言、逻辑与计算研究所)
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注: Accepted at AACL-IJCNLP 2025(main)
Abstract:Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths – i.e., from general to specific – in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.
zh
[NLP-43] A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在二元决策任务中普遍存在的负向偏倚(negative bias)问题,即模型倾向于过度生成否定类响应。其解决方案的关键在于揭示了负向偏倚的多维成因,尤其是发现模型存在“格式层面的负向偏倚”(format-level negative bias),即提示(prompt)格式对输出倾向的影响超过语义内容本身;并通过构建一个基于模型参数知识状态的细粒度评估集(分为正确、错误和知识不足三类子集),识别出模型在缺乏足够知识时会采取“捷径行为”(shortcut behavior)直接输出否定回答,从而导致偏倚。进一步实验表明,提供相关上下文或引入“我不知道”选项可缓解偏倚,而链式思维(chain-of-thought)提示则可能加剧偏倚,说明提示设计类型显著影响响应方向。该研究为系统性缓解LLMs中的负向偏倚提供了关键机制洞察与实证依据。
链接: https://arxiv.org/abs/2511.10881
作者: Jongyoon Song,Sangwon Yu,Sungroh Yoon
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE Transactions on Audio, Speech and Language Processing
Abstract:Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model’s parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an “I don’t know” option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs.
zh
[NLP-44] ICX360: In-Context eXplainability 360 Toolkit
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险应用场景中缺乏可解释性的问题,尤其是在用户输入的上下文(即提示词,prompt)对模型输出产生关键影响的情况下。其解决方案的核心是提出一个名为In-Context Explainability 360(ICX360)的开源Python工具包,专注于通过黑盒和白盒方法(分别基于扰动和梯度)来解释LLM的输出,从而帮助用户理解模型决策依据,并提升其在医疗、会议摘要等关键任务中的可信度与可控性。
链接: https://arxiv.org/abs/2511.10879
作者: Dennis Wei,Ronny Luss,Xiaomeng Hu,Lucas Monteiro Paes,Pin-Yu Chen,Karthikeyan Natesan Ramamurthy,Erik Miehling,Inge Vejsbjerg,Hendrik Strobelt
机构: IBM Research (IBM 研究院); The Chinese University of Hong Kong (香港中文大学); Harvard University (哈佛大学); Apple (苹果)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 4 figures
Abstract:Large Language Models (LLMs) have become ubiquitous in everyday life and are entering higher-stakes applications ranging from summarizing meeting transcripts to answering doctors’ questions. As was the case with earlier predictive models, it is crucial that we develop tools for explaining the output of LLMs, be it a summary, list, response to a question, etc. With these needs in mind, we introduce In-Context Explainability 360 (ICX360), an open-source Python toolkit for explaining LLMs with a focus on the user-provided context (or prompts in general) that are fed to the LLMs. ICX360 contains implementations for three recent tools that explain LLMs using both black-box and white-box methods (via perturbations and gradients respectively). The toolkit, available at this https URL, contains quick-start guidance materials as well as detailed tutorials covering use cases such as retrieval augmented generation, natural language generation, and jailbreaking.
zh
[NLP-45] From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会性或对话情境下进行判断时的可靠性问题,即当任务从直接的事实性提问转变为需要评估说话者正确性的对话判断任务时,LLM 的判断是否会发生系统性偏移。其解决方案的关键在于构建一个可复现的评估框架:将同一信息以“直接事实查询”和“最小对话语境中的说话者正确性评估”两种形式呈现,并引入简单反驳(如“前一回答错误”)作为压力扰动,从而量化模型在社交语境下信念的稳定性。实验表明,不同模型表现出显著差异——部分模型(如 GPT-4o-mini)出现迎合倾向,另一些(如 Llama-8B-Instruct)则过度批判,平均性能变化达 9.24%,凸显了对话框架对 LLM 判断力的重要影响。
链接: https://arxiv.org/abs/2511.10871
作者: Parisa Rabbani,Nimet Beyza Bozdag,Dilek Hakkani-Tür
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: 11 pages, 3 figures. Under review at IWSDS 2026
Abstract:LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM’s conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model’s performance on direct factual queries with its assessment of a speaker’s correctness when the same information is presented within a minimal dialogue, effectively shifting the query from “Is this statement correct?” to “Is this speaker correct?”. Furthermore, we apply pressure in the form of a simple rebuttal (“The previous answer is incorrect.”) to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.
zh
[NLP-46] Leverag ing Parameter Space Symmetries for Reasoning Skill Transfer in LLM s
【速读】: 该论文旨在解决任务算术(task arithmetic)在跨大型语言模型(Large Language Models, LLMs)技能迁移中因模型参数空间发散而导致的负向干扰问题。其解决方案的关键在于先对齐模型的参数空间,利用Transformer架构固有的排列、旋转和缩放对称性,适配现代分组查询注意力(Grouped-Query Attention, GQA)和SwiGLU层,采用基于权重和激活的双重对齐策略。通过这一“对齐优先”的方法,成功将高级推理能力迁移到非推理模型,并在复杂推理基准测试中显著优于标准任务算术,从而为LLM家族间专业化技能的高效融合与迁移提供了有效路径。
链接: https://arxiv.org/abs/2511.10850
作者: Stefan Horoi,Sangwoo Cho,Supriyo Chakraborty,Shi-Xiong Zhang,Sambit Sahu,Guy Wolf,Genta Indra Winata
机构: Université de Montréal(蒙特利尔大学); Mila – Quebec AI Institute (魁北克人工智能研究所); Capital One(资本一号)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs), but it often suffers from negative interference when models have diverged during training. We address this limitation by first aligning the models’ parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures. We adapt parameter space alignment for modern Grouped-Query Attention (GQA) and SwiGLU layers, exploring both weight-based and activation-based approaches. Using this alignment-first strategy, we successfully transfer advanced reasoning skills to a non-reasoning model. Experiments on challenging reasoning benchmarks show that our method consistently outperforms standard task arithmetic. This work provides an effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.
zh
[NLP-47] Reinforcing Stereotypes of Anger: Emotion AI on African American Vernacular English
【速读】: 该论文旨在解决当前情绪识别模型在跨文化、跨方言场景下存在显著偏差的问题,尤其是针对非洲裔美国人非标准英语(African American Vernacular English, AAVE)表达的情绪识别准确性不足,从而导致对AAVE使用者的情绪误判(如高估愤怒情绪),可能强化种族刻板印象并引发伦理风险。其解决方案的关键在于引入“群内”(ingroup)标注机制——即由熟悉AAVE的非洲裔美国 annotators 构建基于社区共识的“银标签”(silver labels),以此作为更公平、更具文化敏感性的评估基准,并通过量化模型在不同方言文本上的表现差异(如SpanEmo模型在AAVE中愤怒误报率高达60%)揭示现有模型的文化偏见本质,进而呼吁构建更加包容、基于方言和文化语境的计算情感系统(affective computing systems)。
链接: https://arxiv.org/abs/2511.10846
作者: Rebecca Dorn,Christina Chance,Casandra Rusti,Charles Bickham Jr.,Kai-Wei Chang,Fred Morstatter,Kristina Lerman
机构: University of Southern California, Information Science Institute (南加州大学信息科学研究所); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Automated emotion detection is widely used in applications ranging from well-being monitoring to high-stakes domains like mental health and hiring. However, models often rely on annotations that reflect dominant cultural norms, limiting model ability to recognize emotional expression in dialects often excluded from training data distributions, such as African American Vernacular English (AAVE). This study examines emotion recognition model performance on AAVE compared to General American English (GAE). We analyze 2.7 million tweets geo-tagged within Los Angeles. Texts are scored for strength of AAVE using computational approximations of dialect features. Annotations of emotion presence and intensity are collected on a dataset of 875 tweets with both high and low AAVE densities. To assess model accuracy on a task as subjective as emotion perception, we calculate community-informed “silver” labels where AAVE-dense tweets are labeled by African American, AAVE-fluent (ingroup) annotators. On our labeled sample, GPT and BERT-based models exhibit false positive prediction rates of anger on AAVE more than double than on GAE. SpanEmo, a popular text-based emotion model, increases false positive rates of anger from 25 percent on GAE to 60 percent on AAVE. Additionally, a series of linear regressions reveals that models and non-ingroup annotations are significantly more correlated with profanity-based AAVE features than ingroup annotations. Linking Census tract demographics, we observe that neighborhoods with higher proportions of African American residents are associated with higher predictions of anger (Pearson’s correlation r = 0.27) and lower joy (r = -0.10). These results find an emergent safety issue of emotion AI reinforcing racial stereotypes through biased emotion classification. We emphasize the need for culturally and dialect-informed affective computing systems.
zh
[NLP-48] racing Multilingual Representations in LLM s with Cross-Layer Transcoders
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, MLLMs)内部如何表征多种语言的多样性这一关键问题,特别是探究其是否存在共享的多语言表示(pivot language representations)以及为何性能仍偏向训练数据中占主导地位的语言。解决方案的关键在于通过训练不同语料混合比例的模型,并结合跨层转码器(cross-layer transcoders, CLT)与归因图(attribution graphs)分析其内部机制,发现模型在早期层中使用近乎相同的表示来处理所有语言,而语言特异性解码则出现在后期层;进一步研究表明,最终层中的少量高频语言特征线性读取首层的通用表示以实现语言识别,通过对这些特征进行干预可控制输出语言。该研究揭示了“枢纽语言”机制是理解并改进多语言对齐的核心基础。
链接: https://arxiv.org/abs/2511.10840
作者: Abir Harrasse,Florent Draye,Zhijing Jin,Bernhard Schölkopf
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); Mohammed VI Polytechnic University (穆罕默德六世 polytechnic 大学); University of Toronto (多伦多大学); Vector Institute (向量研究所); ELLIS Institute (ELLIS 研究所)
类目: Computation and Language (cs.CL)
备注: 28 pages, 35 figures, under review. Extensive supplementary materials. Code and models available at this https URL and this https URL
Abstract:Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language? To address this, we train a series of LLMs on different mixtures of multilingual data and analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs. Our results provide strong evidence for pivot language representations: the model employs nearly identical representations across languages, while language-specific decoding emerges in later layers. Attribution analyses reveal that decoding relies in part on a small set of high-frequency language features in the final layers, which linearly read out language identity from the first layers in the model. By intervening on these features, we can suppress one language and substitute another in the model’s outputs. Finally, we study how the dominant training language influences these mechanisms across attribution graphs and decoding pathways. We argue that understanding this pivot-language mechanism is crucial for improving multilingual alignment in LLMs.
zh
[NLP-49] he Map of Misbelief: Tracing Intrinsic and Extrinsic Hallucinations Through Attention Patterns AAAI2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在安全关键领域部署时存在的幻觉(hallucination)问题,特别是现有基于置信度表示的检测方法大多依赖计算成本高昂的采样策略,且未区分幻觉类型。其解决方案的关键在于提出一个系统性的评估框架,将幻觉细分为外源性(extrinsic)和内源性(intrinsic)两类,并在此基础上引入一种基于注意力机制的不确定性量化算法,结合新颖的注意力聚合策略,在提升可解释性的同时显著改善检测性能:实验表明,采样类方法如语义熵(Semantic Entropy)对检测外源性幻觉有效,但难以应对内源性幻觉;而基于输入token注意力聚合的方法则更适用于内源性幻觉,揭示了检测策略需与幻觉本质相匹配的重要性,并强调注意力机制是量化模型不确定性的丰富信号。
链接: https://arxiv.org/abs/2511.10837
作者: Elyes Hajji,Aymen Bouguerra,Fabio Arnez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at AAAI 2025-FS-ATRACC
Abstract:Large Language Models (LLMs) are increasingly deployed in safety-critical domains, yet remain susceptible to hallucinations. While prior works have proposed confidence representation methods for hallucination detection, most of these approaches rely on computationally expensive sampling strategies and often disregard the distinction between hallucination types. In this work, we introduce a principled evaluation framework that differentiates between extrinsic and intrinsic hallucination categories and evaluates detection performance across a suite of curated benchmarks. In addition, we leverage a recent attention-based uncertainty quantification algorithm and propose novel attention aggregation strategies that improve both interpretability and hallucination detection performance. Our experimental findings reveal that sampling-based methods like Semantic Entropy are effective for detecting extrinsic hallucinations but generally fail on intrinsic ones. In contrast, our method, which aggregates attention over input tokens, is better suited for intrinsic hallucinations. These insights provide new directions for aligning detection strategies with the nature of hallucination and highlight attention as a rich signal for quantifying model uncertainty.
zh
[NLP-50] LLM -as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实课堂环境中与人类评分者一致性不足的问题,特别是在短答测验和项目报告等教育评估任务中的适用性尚未充分验证。其解决方案的关键在于利用GPT-4o对来自本科生计算语言学课程的约50名学生的五次测验作答及14个团队的项目报告进行评分,并将其结果与教学助教(Teaching Assistants, TAs)独立完成的人工评分进行对比分析。实验表明,GPT-4o在测验评分中与人类评分者相关性高达0.98,且在55%的案例中完全一致;对于项目报告也表现出良好整体一致性,但在技术性和开放性问题上存在一定评分波动。这一方法为LLM在实际教学场景中的自动化评分提供了实证依据和技术支持。
链接: https://arxiv.org/abs/2511.10819
作者: Grace Byun,Swati Rajwal,Jinho D. Choi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
zh
[NLP-51] From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在推理过程中缺乏适应性的问题,即模型对所有任务均采用统一的推理策略,导致简单任务产生冗长推理链而复杂任务又无法有效扩展推理深度。其解决方案的关键在于将推理重新定义为“自适应推理”(adaptive reasoning),即根据输入特征(如难度和不确定性)动态分配推理资源。作者提出一个系统性的框架,将自适应推理建模为一种控制增强的策略优化问题,在保证任务性能的同时平衡计算成本,并据此构建了训练型方法(如强化学习、监督微调和学习控制器)与无训练方法(如提示条件化、反馈驱动终止和模块化组合)的分类体系,从而实现对不同机制下自适应推理实践的清晰解释与系统比较。
链接: https://arxiv.org/abs/2511.10788
作者: Chao Wu,Baoheng Li,Mingchen Gao,Zhenyi Wang
机构: University at Buffalo (纽约州立大学布法罗分校); University of Central Florida (中佛罗里达大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have made reasoning a central benchmark for evaluating intelligence. While prior surveys focus on efficiency by examining how to shorten reasoning chains or reduce computation, this view overlooks a fundamental challenge: current LLMs apply uniform reasoning strategies regardless of task complexity, generating long traces for trivial problems while failing to extend reasoning for difficult tasks. This survey reframes reasoning through the lens of adaptivity: the capability to allocate reasoning effort based on input characteristics such as difficulty and uncertainty. We make three contributions. First, we formalize deductive, inductive, and abductive reasoning within the LLM context, connecting these classical cognitive paradigms with their algorithmic realizations. Second, we formalize adaptive reasoning as a control-augmented policy optimization problem balancing task performance with computational cost, distinguishing learned policies from inference-time control mechanisms. Third, we propose a systematic taxonomy organizing existing methods into training-based approaches that internalize adaptivity through reinforcement learning, supervised fine-tuning, and learned controllers, and training-free approaches that achieve adaptivity through prompt conditioning, feedback-driven halting, and modular composition. This framework clarifies how different mechanisms realize adaptive reasoning in practice and enables systematic comparison across diverse strategies. We conclude by identifying open challenges in self-evaluation, meta-reasoning, and human-aligned reasoning control.
zh
[NLP-52] Sabiá: Um Chatbot de Inteligência Artificial Generativa para Suporte no Dia a Dia do Ensino Superior
【速读】: 该论文旨在解决学生在获取日常学术信息时面临的困难,这些问题通常源于信息分散于多个机构文档和网站,导致信息不清晰且容易造成混淆。解决方案的关键在于开发一个基于生成式人工智能(Generative Artificial Intelligence, GenAI)与检索增强生成(Retrieval-Augmented Generation, RAG)技术的聊天机器人,以整合并简化信息访问流程。通过对比多种GenAI模型的质量指标及采用LLM-as-a-Judge评估方法,研究发现Gemini 2.0 Flash在质量和速度上表现最优,而Gemma 3n则因开源特性展现出良好性能,二者共同构成了该方案的核心技术基础。
链接: https://arxiv.org/abs/2511.10787
作者: Guilherme Biava Rodrigues,Franciele Beal,Marlon Marcon,Alinne Cristinne Corrêa Souza,André Roberto Ortoncelli,Francisco Carlos Monteiro Souza,Rodolfo Adamshuk Silva
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepte for publishing in SBIE2025, in Portuguese language
Abstract:Students often report difficulties in accessing day-to-day academic information, which is usually spread across numerous institutional documents and websites. This fragmentation results in a lack of clarity and causes confusion about routine university information. This project proposes the development of a chatbot using Generative Artificial Intelligence (GenAI) and Retrieval-Augmented Generation (RAG) to simplify access to such information. Several GenAI models were tested and evaluated based on quality metrics and the LLM-as-a-Judge approach. Among them, Gemini 2.0 Flash stood out for its quality and speed, and Gemma 3n for its good performance and open-source nature.
zh
[NLP-53] EDxTN: A Three-way Speech Translation Corpus for Code-Switched Tunisian Arabic - English
【速读】: 该论文旨在解决阿拉伯语方言(特别是突尼斯阿拉伯语)在语音翻译任务中数据稀缺的问题。其关键解决方案是构建并公开发布首个面向突尼斯阿拉伯语到英语的语音翻译语料库TEDxTN,该语料库包含108段TEDx演讲,总计25小时语音数据,涵盖多种口音和代码转换现象,并附有详细的标注指南。这一资源为后续基于端到端模型的语音识别与语音翻译系统提供了可扩展的基础,有助于推动突尼斯阿拉伯语自然语言处理的研究进展。
链接: https://arxiv.org/abs/2511.10780
作者: Fethi Bougares,Salima Mdhaffar,Haroun Elleuch,Yannick Estève
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The Third Arabic Natural Language Processing Conference. Association for Computational Linguistics. 2025
Abstract:In this paper, we introduce TEDxTN, the first publicly available Tunisian Arabic to English speech translation dataset. This work is in line with the ongoing effort to mitigate the data scarcity obstacle for a number of Arabic dialects. We collected, segmented, transcribed and translated 108 TEDx talks following our internally developed annotations guidelines. The collected talks represent 25 hours of speech with code-switching that cover speakers with various accents from over 11 different regions of Tunisia. We make the annotation guidelines and corpus publicly available. This will enable the extension of TEDxTN to new talks as they become available. We also report results for strong baseline systems of Speech Recognition and Speech Translation using multiple pre-trained and fine-tuned end-to-end models. This corpus is the first open source and publicly available speech translation corpus of Code-Switching Tunisian dialect. We believe that this is a valuable resource that can motivate and facilitate further research on the natural language processing of Tunisian Dialect.
zh
[NLP-54] Faithful Summarization of Consumer Health Queries: A Cross-Lingual Framework with LLM s NEURIPS2025
【速读】: 该论文旨在解决医疗文本摘要中因不忠实(unfaithful)摘要导致的严重风险问题,即摘要可能歪曲关键医学信息,从而影响医患沟通和临床决策。其解决方案的关键在于提出一个融合TextRank句子提取与医学命名实体识别(Named Entity Recognition, NER)技术的框架,并基于大语言模型(Large Language Models, LLMs)进行微调,以提升摘要的忠实度。实验表明,该方法在MeQSum(英文)和BanglaCHQ-Summ(孟加拉语)数据集上均显著优于零样本基线和现有系统,在质量(ROUGE、BERTScore、可读性)和忠实度(SummaC、AlignScore)指标上取得一致提升,且超过80%的生成摘要能保留关键医疗信息,验证了忠实度作为可靠医疗摘要核心维度的重要性。
链接: https://arxiv.org/abs/2511.10768
作者: Ajwad Abrar,Nafisa Tabassum Oeshy,Prianka Maheru,Farzana Tabassum,Tareque Mohmud Chowdhury
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Computation and Language (cs.CL)
备注: Accepted at the 5th Muslims in Machine Learning (MusIML) Workshop, co-located with NeurIPS 2025
Abstract:Summarizing consumer health questions (CHQs) can ease communication in healthcare, but unfaithful summaries that misrepresent medical details pose serious risks. We propose a framework that combines TextRank-based sentence extraction and medical named entity recognition with large language models (LLMs) to enhance faithfulness in medical text summarization. In our experiments, we fine-tuned the LLaMA-2-7B model on the MeQSum (English) and BanglaCHQ-Summ (Bangla) datasets, achieving consistent improvements across quality (ROUGE, BERTScore, readability) and faithfulness (SummaC, AlignScore) metrics, and outperforming zero-shot baselines and prior systems. Human evaluation further shows that over 80% of generated summaries preserve critical medical information. These results highlight faithfulness as an essential dimension for reliable medical summarization and demonstrate the potential of our approach for safer deployment of LLMs in healthcare contexts.
zh
[NLP-55] PISanitizer: Preventing Prompt Injection to Long-Context LLM s via Prompt Sanitization
【速读】: 该论文旨在解决长上下文大语言模型(Long Context Large Language Models, LLMs)在面对提示注入攻击(Prompt Injection Attack)时的脆弱性问题。此类攻击通过在长上下文中嵌入恶意指令,诱导模型生成攻击者期望的输出,而现有防御方法主要针对短上下文设计,在长上下文中效果有限,因其难以识别并处理仅占极小比例的注入内容。解决方案的关键在于提出PISanitizer,其核心机制是:首先利用大语言模型自身的注意力机制识别出对指令遵循行为起关键作用的高关注度token,然后对这些潜在注入token进行净化,从而消除恶意指令的影响。该方法基于两个观察:(1)提示注入本质上是构造一个迫使模型执行的指令;(2)模型通过注意力机制聚焦于关键输入token以生成响应。由此形成对攻击者的策略性制约——越有效的注入指令,越可能因被高度关注而被检测与清除,实现高效且鲁棒的防御。
链接: https://arxiv.org/abs/2511.10720
作者: Runpeng Geng,Yanting Wang,Chenlong Yin,Minhao Cheng,Ying Chen,Jinyuan Jia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: The code is available at this https URL
Abstract:Long context LLMs are vulnerable to prompt injection, where an attacker can inject an instruction in a long context to induce an LLM to generate an attacker-desired output. Existing prompt injection defenses are designed for short contexts. When extended to long-context scenarios, they have limited effectiveness. The reason is that an injected instruction constitutes only a very small portion of a long context, making the defense very challenging. In this work, we propose PISanitizer, which first pinpoints and sanitizes potential injected tokens (if any) in a context before letting a backend LLM generate a response, thereby eliminating the influence of the injected instruction. To sanitize injected tokens, PISanitizer builds on two observations: (1) prompt injection attacks essentially craft an instruction that compels an LLM to follow it, and (2) LLMs intrinsically leverage the attention mechanism to focus on crucial input tokens for output generation. Guided by these two observations, we first intentionally let an LLM follow arbitrary instructions in a context and then sanitize tokens receiving high attention that drive the instruction-following behavior of the LLM. By design, PISanitizer presents a dilemma for an attacker: the more effectively an injected instruction compels an LLM to follow it, the more likely it is to be sanitized by PISanitizer. Our extensive evaluation shows that PISanitizer can successfully prevent prompt injection, maintain utility, outperform existing defenses, is efficient, and is robust to optimization-based and strong adaptive attacks. The code is available at this https URL.
zh
[NLP-56] Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents AAAI2026
【速读】: 该论文旨在解决当前图形用户界面(Graphical User Interface, GUI)任务自动化中GUI代理模型存在的两大核心问题:一是缺乏对规划(Planning)与接地(Grounding)模型之间协同效应的充分挖掘;二是过度依赖合成数据生成而未有效利用这些数据。其解决方案的关键在于提出一种自迭代训练框架Co-EPG(Co-Evolution of Planning and Grounding),通过构建一个正向反馈循环实现两个模型的协同进化:规划模型在基于接地模型提供的奖励指导下,使用群体相对策略优化(Group Relative Policy Optimization, GRPO)探索更优策略,并生成多样化数据以优化接地模型;同时,优化后的接地模型又能为后续GRPO训练提供更有效的奖励信号,从而形成持续增强的闭环。此机制使代理能力在无外部数据条件下通过自玩优化和训练数据蒸馏实现迭代提升。
链接: https://arxiv.org/abs/2511.10705
作者: Yuan Zhao,Hualei Zhu,Tingyu Jiang,Shen Li,Xiaohang Xu,Hao Henry Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by AAAI 2026
Abstract:Graphical User Interface (GUI) task automation constitutes a critical frontier in artificial intelligence research. While effective GUI agents synergistically integrate planning and grounding capabilities, current methodologies exhibit two fundamental limitations: (1) insufficient exploitation of cross-model synergies, and (2) over-reliance on synthetic data generation without sufficient utilization. To address these challenges, we propose Co-EPG, a self-iterative training framework for Co-Evolution of Planning and Grounding. Co-EPG establishes an iterative positive feedback loop: through this loop, the planning model explores superior strategies under grounding-based reward guidance via Group Relative Policy Optimization (GRPO), generating diverse data to optimize the grounding model. Concurrently, the optimized Grounding model provides more effective rewards for subsequent GRPO training of the planning model, fostering continuous improvement. Co-EPG thus enables iterative enhancement of agent capabilities through self-play optimization and training data distillation. On the Multimodal-Mind2Web and AndroidControl benchmarks, our framework outperforms existing state-of-the-art methods after just three iterations without requiring external data. The agent consistently improves with each iteration, demonstrating robust self-enhancement capabilities. This work establishes a novel training paradigm for GUI agents, shifting from isolated optimization to an integrated, self-driven co-evolution approach.
zh
[NLP-57] π-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling
【速读】: 该论文旨在解决Transformer模型在长序列建模中因注意力机制的二次计算复杂度(O(L2))导致的性能瓶颈问题,尤其针对现有稀疏注意力机制如RingAttention存在的感受野受限和适应性不足的缺陷。解决方案的关键在于提出一种周期性稀疏注意力机制——ΠAttention,其核心创新包括:将注意力分解为环状局部邻域、确定性的π-步跳过(π-stride skips)以及自适应融合门(adaptive fusion gate),从而在保持每层计算复杂度线性于序列长度(O(L))的同时,显著扩展感受野至O(kL+πlogL),优于RingAttention的O(kL)。实验表明,该方法在语言建模、检索和视觉-语言任务中达到或超越稠密注意力的质量,且资源消耗更低。
链接: https://arxiv.org/abs/2511.10696
作者: Dong Liu,Yanxuan Yu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE International Conference on Parallel and Distributed Systems 2025 (ICPADS 2025 Oral)
Abstract:Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic \pi -stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves \mathcalO(kL + \pi \log L) receptive field growth compared to \mathcalO(kL) for RingAttention, where k is the local window size, \pi is the skip period, and L is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3% lower perplexity than RingAttention while using 50% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.
zh
[NLP-58] “As Eastern Powers I will veto.” : An Investigation of Nation-level Bias of Large Language Models in International Relations AAAI2026
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在国际关系(International Relations, IR)领域中存在的国家层面偏见问题,特别是针对联合国安理会五个常任理事国的偏见表现及其在不同模型和任务中的多维变化特性。解决方案的关键在于提出一种结合检索增强生成(Retrieval-Augmented Generation, RAG)与基于反思(Reflexion-based self-reflection)的去偏框架,通过提升模型的事实推理能力来有效降低国家层面偏见,并在GPT-4o-mini和LLama-3.3-70B等模型上验证了其有效性。
链接: https://arxiv.org/abs/2511.10695
作者: Jonghyeon Choi,Yeonjun Choi,Hyun-chul Kim,Beakcheol Jang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures. This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details
Abstract:This paper systematically examines nation-level biases exhibited by Large Language Models (LLMs) within the domain of International Relations (IR). Leveraging historical records from the United Nations Security Council (UNSC), we developed a bias evaluation framework comprising three distinct tests to explore nation-level bias in various LLMs, with a particular focus on the five permanent members of the UNSC. Experimental results show that, even with the general bias patterns across models (e.g., favorable biases toward the western nations, and unfavorable biases toward Russia), these still vary based on the LLM. Notably, even within the same LLM, the direction and magnitude of bias for a nation change depending on the evaluation context. This observation suggests that LLM biases are fundamentally multidimensional, varying across models and tasks. We also observe that models with stronger reasoning abilities show reduced bias and better performance. Building on this finding, we introduce a debiasing framework that improves LLMs’ factual reasoning combining Retrieval-Augmented Generation with Reflexion-based self-reflection techniques. Experiments show it effectively reduces nation-level bias, and improves performance, particularly in GPT-4o-mini and LLama-3.3-70B. Our findings emphasize the need to assess nation-level bias alongside performance when applying LLMs in the IR domain.
zh
[NLP-59] Where does an LLM begin computing an instruction?
【速读】: 该论文旨在解决生成式 AI 模型中“指令遵循”(instruction following)行为在神经网络层间具体发生位置的问题,即明确模型从单纯理解输入内容转向执行指令的转折点。其解决方案的关键在于设计了三个简单任务(Key-Value、Quote Attribution、Letter Selection)及其双跳组合,并通过激活修补(activation patching)技术,在最小对比提示对上测量逐层翻转率(flip rate),从而识别出一个称为“ onset”的拐点——在此层之前干预残差激活可显著改变预测结果,而之后则无效,这一拐点即标志着指令执行行为的起始位置。
链接: https://arxiv.org/abs/2511.10694
作者: Aditya Pola,Vineeth N. Balasubramanian
机构: Indian Institute of Technology, Hyderabad (印度理工学院海得拉巴分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Following an instruction involves distinct sub-processes, such as reading content, reading the instruction, executing it, and producing an answer. We ask where, along the layer stack, instruction following begins, the point where reading gives way to doing. We introduce three simple datasets (Key-Value, Quote Attribution, Letter Selection) and two hop compositions of these tasks. Using activation patching on minimal-contrast prompt pairs, we measure a layer-wise flip rate that indicates when substituting selected residual activations changes the predicted answer. Across models in the Llama family, we observe an inflection point, which we term onset, where interventions that change predictions before this point become largely ineffective afterward. Multi-hop compositions show a similar onset location. These results provide a simple, replicable way to locate where instruction following begins and to compare this location across tasks and model sizes.
zh
[NLP-60] Do AI Voices Learn Social Nuances? A Case of Politeness and Speech Rate
【速读】: 该论文试图解决的问题是:生成式语音人工智能(Generative AI)是否能够隐式学习并再现人类沟通中非显性的情感与社会规范,例如通过调整语速来体现礼貌这一非明显的韵律特征。解决方案的关键在于设计对照实验,让来自两个主流AI平台(AI Studio 和 OpenAI)的22个合成语音模型在“礼貌正式”和“随意非正式”两种提示条件下朗读相同脚本,并测量其语音时长差异。结果表明,所有AI Studio语音及多数OpenAI语音在礼貌提示下显著降低语速,说明这些系统已内化人类社交行为中的心理细微差别,验证了生成式语音系统具备作为社会角色模仿人类社会规范的能力。
链接: https://arxiv.org/abs/2511.10693
作者: Eyal Rabin,Zohar Elyoseph,Rotem Israel-Fishelson,Adi Dali,Ravit Nussinson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD)
备注:
Abstract:Voice-based artificial intelligence is increasingly expected to adhere to human social conventions, but can it learn implicit cues that are not explicitly programmed? This study investigates whether state-of-the-art text-to-speech systems have internalized the human tendency to reduce speech rate to convey politeness - a non-obvious prosodic marker. We prompted 22 synthetic voices from two leading AI platforms (AI Studio and OpenAI) to read a fixed script under both “polite and formal” and “casual and informal” conditions and measured the resulting speech duration. Across both AI platforms, the polite prompt produced slower speech than the casual prompt with very large effect sizes, an effect that was statistically significant for all of AI Studio’s voices and for a large majority of OpenAI’s voices. These results demonstrate that AI can implicitly learn and replicate psychological nuances of human communication, highlighting its emerging role as a social actor capable of reinforcing human social norms.
zh
[NLP-61] Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估基准面临的核心挑战:一是静态基准可能因训练数据污染导致模型表现虚高,难以真实反映其解决问题的能力;二是现有评估多假设理想环境,缺乏对资源受限和信息不对称条件下模型行为的考察。解决方案的关键在于提出Squid Game——一个动态、对抗性的评估环境,通过六层淘汰制游戏机制,在资源约束与信息不对称设置下,以交互式对抗方式测试LLM在指令遵循、代码生成、推理、规划及安全对齐等多维度能力。该设计不仅揭示了模型代际性能跃迁现象,还发现部分模型依赖投机性捷径取胜,从而暴露静态基准潜在的高层级评估污染问题,并验证动态评估可作为静态评估的有效补充。
链接: https://arxiv.org/abs/2511.10691
作者: Zijian Chen,Wenjun Zhang,Guangtao Zhai
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Lab (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 12 figures
Abstract:Contemporary benchmarks are struggling to keep pace with the development of large language models (LLMs). Although they are indispensable to evaluate model performance on various tasks, it is uncertain whether the models trained on Internet data have genuinely learned how to solve problems or merely seen the questions before. This potential data contamination issue presents a fundamental challenge to establishing trustworthy evaluation frameworks. Meanwhile, existing benchmarks predominantly assume benign, resource-rich settings, leaving the behavior of LLMs under pressure unexplored. In this paper, we introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings elaborated to evaluate LLMs through interactive gameplay against other LLM opponents. Notably, Squid Game consists of six elimination-style levels, focusing on multi-faceted abilities, such as instruction-following, code, reasoning, planning, and safety alignment. We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios. We observe a clear generational phase transition on performance in the same model lineage and find evidence that some models resort to speculative shortcuts to win the game, indicating the possibility of higher-level evaluation paradigm contamination in static benchmarks. Furthermore, we compare prominent LLM benchmarks and Squid Game with correlation analyses, highlighting that dynamic evaluation can serve as a complementary part for static evaluations. The code and data will be released in the future.
zh
[NLP-62] Saying the Unsaid: Revealing the Hidden Language of Multimodal Systems Through Telephone Games NEURIPS2025
【速读】: 该论文旨在解决闭源多模态系统中“隐藏语言”(hidden language)的不可解释性问题,即这些系统在处理图像到文本再到图像的压缩与重建过程中,由于其黑箱架构导致内部概念关联机制不透明。解决方案的关键在于利用系统在跨模态转换中的偏好偏差(preference bias),通过设计多轮“电话游戏”(telephone game)框架来量化并揭示这种偏好如何改变输入概念的共现频率,从而构建出多模态系统理解世界时的概念连接图谱。该方法不仅可扩展至测试阶段,还能识别训练继承的偏好、评估泛化能力,并借助推理型大语言模型(Reasoning-LLMs)挖掘超越文本与视觉相似性的深层概念关系,为多模态系统的可解释性和可控性研究提供新路径。
链接: https://arxiv.org/abs/2511.10690
作者: Juntu Zhao,Jialing Zhang,Chongxuan Li,Dequan Wang
机构: Shanghai Jiao Tong University (上海交通大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025 MTI-LLM Workshop
Abstract:Recent closed-source multimodal systems have made great advances, but their hidden language for understanding the world remains opaque because of their black-box architectures. In this paper, we use the systems’ preference bias to study their hidden language: During the process of compressing the input images (typically containing multiple concepts) into texts and then reconstructing them into images, the systems’ inherent preference bias introduces specific shifts in the outputs, disrupting the original input concept co-occurrence. We employ the multi-round “telephone game” to strategically leverage this bias. By observing the co-occurrence frequencies of concepts in telephone games, we quantitatively investigate the concept connection strength in the understanding of multimodal systems, i.e., “hidden language.” We also contribute Telescope, a dataset of 10,000+ concept pairs, as the database of our telephone game framework. Our telephone game is test-time scalable: By iteratively running telephone games, we can construct a global map of concept connections in multimodal systems’ understanding. Here we can identify preference bias inherited from training, assess generalization capability advancement, and discover more stable pathways for fragile concept connections. Furthermore, we use Reasoning-LLMs to uncover unexpected concept relationships that transcend textual and visual similarities, inferring how multimodal systems understand and simulate the world. This study offers a new perspective on the hidden language of multimodal systems and lays the foundation for future research on the interpretability and controllability of multimodal systems.
zh
[NLP-63] Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data AAAI
【速读】: 该论文旨在解决递归提示(recursive prompting)生成合成数据时可能放大性别偏见的问题,尤其关注偏见在多代文本生成中的动态演化及其缓解策略的有效性。其解决方案的关键在于提出并验证了对比增强(contrastive augmentation)方法——通过引入性别对调的变体来平衡偏见分布,尽管该方法在嵌入相似度指标上表现不佳,却显著降低了下游任务中的偏见水平(低初始偏见下减少98.8%,平均减少91%),揭示了仅依赖语义相似性评估可能无法准确反映行为公平性,强调了多维评估在负责任合成数据生成中的必要性。
链接: https://arxiv.org/abs/2511.10689
作者: Ashish Kattamuri,Arpita Vats,Harshwardhan Fartale,Rahul Raja,Akshata Kishore Moharir,Ishita Prasad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models
Abstract:Recursive prompting with large language models enables scalable synthetic dataset generation but introduces the risk of bias amplification. We investigate gender bias dynamics across three generations of recursive text generation using three complementary evaluation frameworks: rule-based pattern matching, embedding-based semantic similarity, and downstream task performance. Experiments with three initial bias levels (0.1, 0.3, 0.6) and four mitigation strategies reveal equilibrium dynamics rather than monotonic amplification. The low initial bias amplifies toward the model’s inherent bias level (+36%), whereas the high initial bias decays toward it (-26%). Among mitigation methods, contrastive augmentation, which introduces gender-swapped variants, achieves significant downstream bias reduction (98.8% for low initial bias and 91% on average) despite producing higher embedding-based bias scores. This paradox demonstrates that semantic similarity metrics may diverge from behavioral fairness outcomes, highlighting the need for multidimensional evaluation in responsible synthetic data generation.
zh
[NLP-64] Modeling and Predicting Multi-Turn Answer Instability in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在交互式应用场景中因多轮对话导致的推理稳定性问题,即模型输出随交互轮次变化而出现的准确性波动。其关键解决方案是引入马尔可夫链(Markov chains)建模模型在多轮问答中的准确率动态演化过程,并结合线性探测(linear probes)技术从模型隐藏状态中预测未来答案变化,从而量化并估计模型的稳态准确率(stationary accuracy)。研究发现,即使简单提示如“Think again”也能显著降低模型准确率(最高达10%),且稳态准确率平均比首轮准确率低约8%,揭示了当前LLMs在重复提问下的脆弱性,为交互场景下的鲁棒性评估提供了可量化的指标体系。
链接: https://arxiv.org/abs/2511.10688
作者: Jiahang He,Rishi Ramachandran,Neel Ramachandran,Aryan Katakam,Kevin Zhu,Sunishchal Dev,Ashwinee Panda,Aryan Shrivastava
机构: Algoverse AI Research; University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) are adopted in an increasingly wide range of applications, user-model interactions have grown in both frequency and scale. Consequently, research has focused on evaluating the robustness of LLMs, an essential quality for real-world tasks. In this paper, we employ simple multi-turn follow-up prompts to evaluate models’ answer changes, model accuracy dynamics across turns with Markov chains, and examine whether linear probes can predict these changes. Our results show significant vulnerabilities in LLM robustness: a simple “Think again” prompt led to an approximate 10% accuracy drop for Gemini 1.5 Flash over nine turns, while combining this prompt with a semantically equivalent reworded question caused a 7.5% drop for Claude 3.5 Haiku. Additionally, we find that model accuracy across turns can be effectively modeled using Markov chains, enabling the prediction of accuracy probabilities over time. This allows for estimation of the model’s stationary (long-run) accuracy, which we find to be on average approximately 8% lower than its first-turn accuracy for Gemini 1.5 Flash. Our results from a model’s hidden states also reveal evidence that linear probes can help predict future answer changes. Together, these results establish stationary accuracy as a principled robustness metric for interactive settings and expose the fragility of models under repeated questioning. Addressing this instability will be essential for deploying LLMs in high-stakes and interactive settings where consistent reasoning is as important as initial accuracy.
zh
[NLP-65] Who Gets the Reward Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents
【速读】: 该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中大型语言模型(Large Language Models, LLMs)训练时缺乏将系统级评估与代理级及消息级学习有效关联的原理性方法的问题。现有训练策略要么仅依赖合作博弈论的归因(如Shapley值),要么仅使用步骤级标签(如偏好强化学习,Preference-based Reinforcement Learning, PRM),难以生成局部、带符号且信用守恒的学习信号。其解决方案的关键在于提出一个理论框架,融合合作博弈论归因与过程奖励建模(Process Reward Modeling),将系统级评价转化为代理信用,并进一步细化为响应级别的信号;该方法在成功案例中通过Shapley值公平分配结果并生成促进协作、抑制冗余或破坏行为的消息级奖励,在失败案例中实现首次错误定位,生成具有修复意识的偏好信号,从而在全局评估与局部监督之间建立统一、可审计的映射路径。
链接: https://arxiv.org/abs/2511.10687
作者: Chih-Hsuan Yang,Tanwi Mallick,Le Chen,Krishnan Raghavan,Azton Wells,Amal Gueroudji,Ian T. Foster,Rajeev Thakur
机构: Argonne National Laboratory (阿贡国家实验室); University of Chicago (芝加哥大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:
Abstract:Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent-level and message-level learning. We propose a theoretical framework that unifies cooperative game-theoretic attribution with process reward modeling to transform system evaluation into agent credit and then into response-level signals. Unlike prior approaches that rely only on attribution (e.g., Shapley) or step-level labels (e.g., PRM), our method produces local, signed, and credit-conserving signals. In success cases, Shapley-based credit assignment fairly allocates outcomes across agents and is refined into per-message rewards that promote cooperation while discouraging redundancy or sabotage. In failure cases, first-error localization yields repair-aware preferences that penalize harmful steps while rewarding corrective attempts. The resulting signals are bounded, cooperative, and directly compatible with reinforcement-based or preference-based post-training, providing a unified and auditable pathway from global evaluation to local supervision in LLM multi-agent training. Our contribution is conceptual: we present a theoretical foundation and training signals, leaving empirical validation for future work.
zh
[NLP-66] A methodological analysis of prompt perturbations and their effect on attack success rates
【速读】: 该论文旨在解决不同大型语言模型(Large Language Models, LLMs)对齐方法如何影响模型在面对提示攻击(prompt attacks)时响应敏感性的问题。其解决方案的关键在于系统性地选取基于三种主流对齐方法(监督微调 SFT、直接偏好优化 DPO 和基于人类反馈的强化学习 RLHF)的开源模型,通过统计分析方法评估攻击成功率(Attack Success Rate, ASR)对提示扰动的敏感度,从而揭示现有“攻击基准”可能无法全面暴露模型与对齐方法潜在漏洞的局限性,并推动基于统计严谨性的模型攻击评估研究。
链接: https://arxiv.org/abs/2511.10686
作者: Tiago Machado,Maysa Malfiza Garcia de Macedo,Rogerio Abreu de Paula,Marcelo Carpinette Grave,Aminat Adebiyi,Luan Soares de Souza,Enrico Santarelli,Claudio Pinhanez
机构: 1. Universidade de São Paulo (圣保罗大学); 2. University of Ibadan (伊巴丹大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This work aims to investigate how different Large Language Models (LLMs) alignment methods affect the models’ responses to prompt attacks. We selected open source models based on the most common alignment methods, namely, Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Reinforcement Learning with Human Feedback (RLHF). We conducted a systematic analysis using statistical methods to verify how sensitive the Attack Success Rate (ASR) is when we apply variations to prompts designed to elicit inappropriate content from LLMs. Our results show that even small prompt modifications can significantly change the Attack Success Rate (ASR) according to the statistical tests we run, making the models more or less susceptible to types of attack. Critically, our results demonstrate that running existing ‘attack benchmarks’ alone may not be sufficient to elicit all possible vulnerabilities of both models and alignment methods. This paper thus contributes to ongoing efforts on model attack evaluation by means of systematic and statistically-based analyses of the different alignment methods and how sensitive their ASR is to prompt variation.
zh
[NLP-67] SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI
【速读】: 该论文旨在解决消费者产品碳足迹估算中劳动密集型且成本高昂的生命周期评估(Life Cycle Assessment, LCA)问题,尤其关注如何高效、准确地生成LCA所需的过程信息。其解决方案的关键在于提出SpiderGen——一个基于大语言模型(Large Language Model, LLM)的工作流,该工作流将传统LCA的分类体系与方法论同LLM的推理能力及世界知识相结合,自动构建用于LCA分析的流程信息。实验表明,SpiderGen在10个样本数据点上达到62%的F1分数,显著优于链式思维提示(chain-of-thought prompting)和单样本提示(one-shot prompting)等基线方法,并可在不到10分钟内以低于1美元的成本完成LCA过程信息生成,相较传统LCA(耗时21人日、成本超25000美元)大幅降低人力与经济成本。
链接: https://arxiv.org/abs/2511.10684
作者: Anupama Sitaraman,Bharathan Balaji,Yuvraj Agarwal
机构: Amazon(亚马逊)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Investigating the effects of climate change and global warming caused by GHG emissions have been a primary concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate the procedural information used for LCA. We additionally evaluate the output of SpiderGen using real-world LCA documents as ground-truth. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 62% across 10 sample data points. We observe that the remaining missed processes and hallucinated errors occur primarily due to differences in detail between LCA documents, as well as differences in the “scope” of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen’s potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \ 1 USD in under 10 minutes as compared to the status quo LCA, which can cost over \ 25000 USD and take up to 21-person days.
zh
[NLP-68] Pre-Attention Expert Prediction and Prefetching for Mixture-of-Experts Large Language Models
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)大语言模型在推理过程中因专家预测不准确导致的性能瓶颈问题,尤其是现有方法依赖前一层激活进行预测时精度较低、且无法优化第一层的问题。其解决方案的关键在于提出“预注意力专家预测”(pre-attention expert prediction),利用同一层中注意力模块之前的激活特征,结合两个线性函数与排名感知损失(ranking-aware loss),实现高精度且轻量级的专家预取。该方法揭示了LLM中某些操作具有排序保持特性(ranking-preserving),从而能够在不引入复杂计算或训练独立网络的前提下,显著提升专家预测准确率(相比最先进方法绝对提升约15%),并支持对第一层的优化。
链接: https://arxiv.org/abs/2511.10676
作者: Shien Zhu,Samuel Bohl,Robin Oester,Gustavo Alonso
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture-of-Experts (MoE) Large Language Models (LLMs) efficiently scale-up the model while keeping relatively low inference cost. As MoE models only activate part of the experts, related work has proposed expert prediction and caching methods to prefetch the experts for faster inference. However, existing approaches utilize the activations from the previous layer for prediction, incurring low accuracy and leave the first layer unoptimized. Applying complex layers or even training standalone networks for better prediction introduces high computation overhead. In this paper, we propose pre-attention expert prediction to achieve accurate and lightweight expert prefetching. The key insight is that some functions in LLMs are ranking-preserving, indicating that matching the ranking of selected experts using simple linear functions is possible. Therefore, we utilize the activations before the attention block in the same layer with 2 linear functions and ranking-aware loss to achieve accurate prediction, which also supports prefetching in the first layer. Our lightweight, pre-attention expert routers achieve 93.03% accuracy on DeepSeek V2 Lite, 94.69% on Qwen3-30B, and 97.62% on Phi-mini-MoE, showing about 15% improvement on absolute accuracy over the state-of-the-art methods.
zh
[NLP-69] Learn to Select: Exploring Label Distribution Divergence for In-Context Demonstration Selection in Text Classification
【速读】: 该论文旨在解决大语言模型(LLM)在文本分类任务中,由于示范样本(demonstrations)选择不当而导致性能受限的问题。现有方法主要依赖测试输入与示范样本之间的语义相似性进行筛选,但忽略了标签分布一致性的重要性,这可能导致模型在实际推理时表现不佳。解决方案的关键在于提出一种两阶段的示范选择方法——TopK + 标签分布差异(Label Distribution Divergence, L2D),该方法利用微调后的BERT类小语言模型(SLM)为测试输入和候选示范生成标签分布,并通过计算其分布差异来优选既语义相关又标签分布对齐的示范样本,从而显著提升LLM在多种文本分类基准上的性能表现。
链接: https://arxiv.org/abs/2511.10675
作者: Ye Jiang,Taihang Wang,Youzheng Liu,Yimin Wang,Yuhan Xia,Yunfei Long
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:In-context learning (ICL) for text classification, which uses a few input-label demonstrations to describe a task, has demonstrated impressive performance on large language models (LLMs). However, the selection of in-context demonstrations plays a crucial role and can significantly affect LLMs’ performance. Most existing demonstration selection methods primarily focus on semantic similarity between test inputs and demonstrations, often overlooking the importance of label distribution alignment. To address this limitation, we propose a two-stage demonstration selection method, TopK + Label Distribution Divergence (L2D), which leverages a fine-tuned BERT-like small language model (SLM) to generate label distributions and calculate their divergence for both test inputs and candidate demonstrations. This enables the selection of demonstrations that are not only semantically similar but also aligned in label distribution with the test input. Extensive experiments across seven text classification benchmarks show that our method consistently outperforms previous demonstration selection strategies. Further analysis reveals a positive correlation between the performance of LLMs and the accuracy of the underlying SLMs used for label distribution estimation.
zh
[NLP-70] Continual Learning of Domain Knowledge from Human Feedback in Text-to-SQL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在将自然语言问题转换为SQL查询时,对数据库特定模式(schema)和隐性领域知识(tacit domain knowledge)理解不足的问题。解决方案的关键在于引入一种基于人类反馈的持续学习框架,其中学习代理(learning agent)通过接收自然语言形式的人类反馈来优化SQL查询,并将从中提取的知识以结构化方式存入记忆模块(structured memory),从而实现知识的复用与迭代改进。实验表明,采用记忆增强型代理架构(尤其是过程式代理,Procedural Agent)能够显著提升执行准确率并减少错误,证明了将人类隐性专业知识转化为可重用知识对于构建更适应性强、领域感知能力高的文本到SQL系统至关重要。
链接: https://arxiv.org/abs/2511.10674
作者: Thomas Cook,Kelly Patel,Sivapriya Vellaichamy,Saba Rahimi,Zhen Zeng,Sumitra Ganesh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 34 pages, 6 figures, 4 tables
Abstract:Large Language Models (LLMs) can generate SQL queries from natural language questions but struggle with database-specific schemas and tacit domain knowledge. We introduce a framework for continual learning from human feedback in text-to-SQL, where a learning agent receives natural language feedback to refine queries and distills the revealed knowledge for reuse on future tasks. This distilled knowledge is stored in a structured memory, enabling the agent to improve execution accuracy over time. We design and evaluate multiple variations of a learning agent architecture that vary in how they capture and retrieve past experiences. Experiments on the BIRD benchmark Dev set show that memory-augmented agents, particularly the Procedural Agent, achieve significant accuracy gains and error reduction by leveraging human-in-the-loop feedback. Our results highlight the importance of transforming tacit human expertise into reusable knowledge, paving the way for more adaptive, domain-aware text-to-SQL systems that continually learn from a human-in-the-loop.
zh
[NLP-71] Large language models in materials science and the need for open-source approaches
【速读】: 该论文旨在解决当前材料科学领域中生成式 AI(Generative AI)应用依赖封闭源商业模型所带来的透明度低、可复现性差、成本高及数据隐私风险等问题。其解决方案的关键在于通过基准测试验证开源模型在材料发现全流程中的性能表现,证明其能够媲美闭源模型,同时提供更高的透明度、可复现性、成本效益和数据隐私保护能力,从而推动构建开放、灵活且由社区驱动的科学发现平台。
链接: https://arxiv.org/abs/2511.10673
作者: Fengxu Yang,Weitong Chen,Jack D. Evans
机构: The University of Adelaide (阿德莱德大学)
类目: Computation and Language (cs.CL); Materials Science (cond-mat.mtrl-sci)
备注: 16 pages, 5 figures
Abstract:Large language models (LLMs) are rapidly transforming materials science. This review examines recent LLM applications across the materials discovery pipeline, focusing on three key areas: mining scientific literature , predictive modelling, and multi-agent experimental systems. We highlight how LLMs extract valuable information such as synthesis conditions from text, learn structure-property relationships, and can coordinate agentic systems integrating computational tools and laboratory automation. While progress has been largely dependent on closed-source commercial models, our benchmark results demonstrate that open-source alternatives can match performance while offering greater transparency, reproducibility, cost-effectiveness, and data privacy. As open-source models continue to improve, we advocate their broader adoption to build accessible, flexible, and community-driven AI platforms for scientific discovery.
zh
[NLP-72] Grounded Visual Factualization: Factual Anchor-Based Finetuning for Enhancing MLLM Factual Consistency
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的视觉幻觉(Visual Hallucination)问题,即模型在图像理解任务中生成与输入图像内容不一致的虚假细节,从而严重影响其可靠性。解决方案的关键在于提出一种名为“接地视觉事实化”(Grounded Visual Factualization, GVF)的微调方法,通过三个核心机制实现对事实性推理的深度干预:1)事实锚点数据增强(Factual Anchor Data Augmentation),在训练数据中引入结构化的事实锚点和反事实提示以强化事实约束;2)事实感知指令微调(Fact-Aware Instruction Tuning),将事实线索显式嵌入指令中引导模型关注真实信息;3)事实一致性损失函数(Factual Consistency Loss),专门惩罚事实性错误。实验证明,GVF显著提升了LLaVA-1.5-13B在VHTest基准上的表现,同时保持或略微改善了通用多模态能力(如MME和POPE),实现了视觉事实一致性与泛化性能的平衡。
链接: https://arxiv.org/abs/2511.10671
作者: Filippo Morbiato,Luca Romano,Alessandro Persona
机构: University of Padua (帕多瓦大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual hallucination, where Multimodal Large Language Models fabricate details inconsistent with image content, critically undermines their reliability. Existing fine-tuning methods offer limited improvement, failing to deeply intervene in factual reasoning. This paper introduces Grounded Visual Factualization (GVF) Finetuning, a novel approach to systematically enhance MLLM visual factual consistency. GVF integrates explicit factual signals via three core mechanisms: Factual Anchor Data Augmentation, enriching training data with structured factual anchors and counter-factual prompts; Fact-Aware Instruction Tuning, embedding these cues into explicit instructions; and a Factual Consistency Loss function, specifically penalizing factual inaccuracies. Evaluated on LLaVA-1.5-13B, GVF Finetuning significantly outperforms standard fine-tuning on the VHTest benchmark for both Open-Ended Question (OEQ) and Yes/No Question (YNQ) formats. Crucially, GVF maintains or even slightly improves performance on general multimodal benchmarks like MME and POPE, demonstrating effective mitigation of visual hallucinations without compromising general understanding and reasoning abilities.
zh
[NLP-73] owards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment
【速读】: 该论文旨在解决代码切换(Code-switching, CS)语音翻译中语义建模复杂性和高质量CS数据稀缺的问题。现有方法通常依赖模型自身隐式学习语义,且需大量人工标注,效率低下。其解决方案的关键在于:1)引入基于专家混合(Mixture of Experts, MoE)的语音投影器,使每个专家专注于特定语言的语义子空间,实现对语音特征的细粒度建模;2)设计多阶段训练范式,利用易得的单语种自动语音识别(ASR)和单语种语音翻译(ST)数据进行分阶段训练,提升语音-文本对齐与翻译能力;3)通过语言特定损失和组内负载均衡损失引导MoE语音投影器在不同专家间高效分配token,并引入过渡损失(transition loss)促进各训练阶段间的平滑数据迁移,从而缓解CS数据稀缺问题并增强模型在CS场景下的适应性。
链接: https://arxiv.org/abs/2511.10670
作者: Yan Gao,Yazheng Yang,Zhibin Lan,Yidong Chen,Min Zhang,Daimeng Wei,Hui Huang,Jinsong Su
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Working in progress
Abstract:Code-switching (CS) speech translation (ST) refers to translating speech that alternates between two or more languages into a target language text, which poses significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies tend to rely on the model itself to implicitly learn semantic modeling during training, and resort to inefficient and costly manual annotations for these two challenges. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture of Experts (MoE) speech projector, where each expert specializes in the semantic subspace of a specific language, enabling fine-grained modeling of speech features. Additionally, we introduce a multi-stage training paradigm that utilizes readily available monolingual automatic speech recognition (ASR) and monolingual ST data, facilitating speech-text alignment and improving translation capabilities. During training, we leverage a combination of language-specific loss and intra-group load balancing loss to guide the MoE speech projector in efficiently allocating tokens to the appropriate experts, across expert groups and within each group, respectively. To bridge the data gap across different training stages and improve adaptation to the CS scenario, we further employ a transition loss, enabling smooth transitions of data between stages, to effectively address the scarcity of high-quality CS speech translation data. Extensive experiments on widely used datasets demonstrate the effectiveness and generality of our approach.
zh
[NLP-74] Forecasting Spoken Language Development in Children with Cochlear Implants Using Preimplantation MRI
【速读】: 该论文旨在解决儿童耳蜗植入(Cochlear Implants, CI)后口语语言发展结果预测准确性不足的问题,尤其是传统方法(如基于植入年龄或残余听力的预测模型)无法可靠预测个体儿童的语言改善程度。其解决方案的关键在于采用深度迁移学习(Deep Transfer Learning, DTL)算法,通过 bilinear attention-based 融合策略提取脑部神经解剖特征中的判别性与任务相关表示,从而显著提升预测性能:DTL 模型在二分类高/低语言改善者任务中达到 92.39% 的准确率、91.22% 的敏感度和 93.56% 的特异度,AUC 达到 0.977,优于传统机器学习方法,验证了单一 DTL 模型在全球耳蜗植入语言预测中的可行性与优越性。
链接: https://arxiv.org/abs/2511.10669
作者: Yanlin Wang,Di Yuan,Shani Dettman,Dawn Choo,Emily Shimeng Xu,Denise Thomas,Maura E Ryan,Patrick C M Wong,Nancy M Young
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 38 pages
Abstract:Cochlear implants (CI) significantly improve spoken language in children with severe-to-profound sensorineural hearing loss (SNHL), yet outcomes remain more variable than in children with normal hearing. This variability cannot be reliably predicted for individual children using age at implantation or residual hearing. This study aims to compare the accuracy of traditional machine learning (ML) to deep transfer learning (DTL) algorithms to predict post-CI spoken language development of children with bilateral SNHL using a binary classification model of high versus low language improvers. A total of 278 implanted children enrolled from three centers. The accuracy, sensitivity and specificity of prediction models based upon brain neuroanatomic features using traditional ML and DTL learning. DTL prediction models using bilinear attention-based fusion strategy achieved: accuracy of 92.39% (95% CI, 90.70%-94.07%), sensitivity of 91.22% (95% CI, 89.98%-92.47%), specificity of 93.56% (95% CI, 90.91%-96.21%), and area under the curve (AUC) of 0.977 (95% CI, 0.969-0.986). DTL outperformed traditional ML models in all outcome measures. DTL was significantly improved by direct capture of discriminative and task-specific information that are advantages of representation learning enabled by this approach over ML. The results support the feasibility of a single DTL prediction model for language prediction of children served by CI programs worldwide.
zh
[NLP-75] Evaluating LLM Understanding via Structured Tabular Decision Simulations
【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在预测准确性较高时,仍可能缺乏真正理解能力的问题。现有评估方法主要关注局部准确率,而忽视了模型是否能基于领域相关且一致的决策因素做出可靠判断,即缺乏对全局层面“理解”的系统性评测。解决方案的关键在于提出结构化表格决策模拟(Structured Tabular Decision Simulations, STaDS),通过专家式决策场景对LLMs进行多维度评估:包括问题与指令理解、知识驱动预测以及对关键决策因素的依赖程度。STaDS旨在衡量模型是否能在多样任务中识别并使用正确的决策依据,从而揭示其真实理解水平,推动从单一准确率导向向更深层次认知能力评估的范式转变。
链接: https://arxiv.org/abs/2511.10667
作者: Sichao Li,Xinyue Xu,Xiaomeng Li
机构: City University of Macau (澳门城市大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) often achieve impressive predictive accuracy, yet correctness alone does not imply genuine understanding. True LLM understanding, analogous to human expertise, requires making consistent, well-founded decisions across multiple instances and diverse domains, relying on relevant and domain-grounded decision factors. We introduce Structured Tabular Decision Simulations (STaDS), a suite of expert-like decision settings that evaluate LLMs as if they were professionals undertaking structured decision ``exams’‘. In this context, understanding is defined as the ability to identify and rely on the correct decision factors, features that determine outcomes within a domain. STaDS jointly assesses understanding through: (i) question and instruction comprehension, (ii) knowledge-based prediction, and (iii) reliance on relevant decision factors. By analyzing 9 frontier LLMs across 15 diverse decision settings, we find that (a) most models struggle to achieve consistently strong accuracy across diverse domains; (b) models can be accurate yet globally unfaithful, and there are frequent mismatches between stated rationales and factors driving predictions. Our findings highlight the need for global-level understanding evaluation protocols and advocate for novel frameworks that go beyond accuracy to enhance LLMs’ understanding ability.
zh
[NLP-76] Guarding the Meaning: Self-Supervised Training for Semantic Robustness in Guard Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)安全防护机制中“守卫模型”(guard models)对表面语言变化敏感的问题,即即使语义保持不变的改写文本也会导致安全评分剧烈波动,反映出其缺乏语义基础。解决方案的关键在于提出一种实用的自监督框架,通过利用改写集合(paraphrase sets)强制预测一致性,并采用一种新颖的偏斜感知聚合策略(skew-aware aggregation strategy)来计算鲁棒的目标标签,从而提升守卫模型的语义鲁棒性。该方法显著降低了 paraphrase 间的评分差异(约58%),同时提升了基准测试准确率(平均约2.5%),并揭示了校准(calibration)与一致性之间存在双向关系,为构建更可靠的守卫模型提供了可扩展的训练范式。
链接: https://arxiv.org/abs/2511.10665
作者: Cristina Pinneri,Christos Louizos
机构: Qualcomm AI Research (高通人工智能研究); Qualcomm Technologies, Inc. (高通技术公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Guard models are a critical component of LLM safety, but their sensitivity to superficial linguistic variations remains a key vulnerability. We show that even meaning-preserving paraphrases can cause large fluctuations in safety scores, revealing a lack of semantic grounding. To address this, we introduce a practical, self-supervised framework for improving the semantic robustness of guard models. Our method leverages paraphrase sets to enforce prediction consistency using a novel, skew-aware aggregation strategy for robust target computation. Notably, we find that standard aggregation methods like mean and median can degrade safety, underscoring the need for skew-aware alternatives. We analyze six open-source guard models and show that our approach reduces semantic variability across paraphrases by ~58%, improves benchmark accuracy by ~2.5% on average, and generalizes to unseen stylistic variations. Intriguingly, we discover a bidirectional relationship between model calibration and consistency: our robustness training improves calibration by up to 40%, revealing a fundamental connection between these properties. These results highlight the value of treating semantic consistency as a first-class training objective and provide a scalable recipe for building more reliable guard models.
zh
[NLP-77] Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese Japanese and Turkish
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源和形态学复杂的语言中性能不足的问题,特别是在粤语(Cantonese)、日语(Japanese)和土耳其语(Turkish)等语言上的跨语言泛化能力与文化适配性问题。其关键解决方案是构建一个涵盖开放域问答、文档摘要、英译X(English-to-X)翻译及文化情境对话四类任务的新跨语言基准测试集,并结合人工评估(评分流畅性、事实准确性与文化适宜性)与自动指标(如BLEU、ROUGE)对七种前沿LLMs进行系统性评测,从而揭示不同模型在多语言场景下的优势与局限,尤其强调了形态复杂性和文化语境理解方面的差距,为开发更具文化敏感性和语言通用性的下一代LLM提供实证依据与可复现的基准数据。
链接: https://arxiv.org/abs/2511.10664
作者: Chengxuan Xia,Qianye Wu,Hongbin Guan,Sixuan Tian,Yilun Hao,Xiaoyu Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: This paper requires XeLaTeX for proper Unicode rendering of Japanese and Cantonese text
Abstract:Large language models (LLMs) have achieved impressive results in high-resource languages like English, yet their effectiveness in low-resource and morphologically rich languages remains underexplored. In this paper, we present a comprehensive evaluation of seven cutting-edge LLMs – including GPT-4o, GPT-4, Claude~3.5~Sonnet, LLaMA~3.1, Mistral~Large~2, LLaMA-2~Chat~13B, and Mistral~7B~Instruct – on a new cross-lingual benchmark covering \textbfCantonese, Japanese, and Turkish. Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue. We combine \textbfhuman evaluations (rating fluency, factual accuracy, and cultural appropriateness) with automated metrics (e.g., BLEU, ROUGE) to assess model performance. Our results reveal that while the largest proprietary models (GPT-4o, GPT-4, Claude~3.5) generally lead across languages and tasks, significant gaps persist in culturally nuanced understanding and morphological generalization. Notably, GPT-4o demonstrates robust multilingual performance even on cross-lingual tasks, and Claude~3.5~Sonnet achieves competitive accuracy on knowledge and reasoning benchmarks. However, all models struggle to some extent with the unique linguistic challenges of each language, such as Turkish agglutinative morphology and Cantonese colloquialisms. Smaller open-source models (LLaMA-2~13B, Mistral~7B) lag substantially in fluency and accuracy, highlighting the resource disparity. We provide detailed quantitative results, qualitative error analysis, and discuss implications for developing more culturally aware and linguistically generalizable LLMs. Our benchmark and evaluation data are released to foster reproducibility and further research. Comments: This paper requires XeLaTeX for proper Unicode rendering of Japanese and Cantonese text Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.10664 [cs.CL] (or arXiv:2511.10664v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2511.10664 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-78] Bayesian Evaluation of Large Language Model Behavior NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)文本生成系统评估中缺乏统计不确定性量化的问题。现有评估方法通常依赖于预定义的输入提示集,并对每个输出进行二元判定(如有害/非有害),进而通过聚合得分来评价模型性能,但忽略了由模型概率性生成策略所引入的不确定性。论文提出了一种贝叶斯方法,用于量化此类二元评估指标中的不确定性,其关键在于将文本生成过程建模为概率过程,并利用贝叶斯推断对评估结果的置信度进行定量分析。通过两个案例研究——对抗性输入下的拒绝率评估和对话场景中LLM之间的偏好比较——验证了该方法在提供可靠不确定性估计方面的有效性。
链接: https://arxiv.org/abs/2511.10661
作者: Rachel Longjohn,Shang Wu,Saatvik Kher,Catarina Belém,Padhraic Smyth
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注: Accepted to NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Abstract:It is increasingly important to evaluate how text generation systems based on large language models (LLMs) behave, such as their tendency to produce harmful output or their sensitivity to adversarial inputs. Such evaluations often rely on a curated benchmark set of input prompts provided to the LLM, where the output for each prompt may be assessed in a binary fashion (e.g., harmful/non-harmful or does not leak/leaks sensitive information), and the aggregation of binary scores is used to evaluate the LLM. However, existing approaches to evaluation often neglect statistical uncertainty quantification. With an applied statistics audience in mind, we provide background on LLM text generation and evaluation, and then describe a Bayesian approach for quantifying uncertainty in binary evaluation metrics. We focus in particular on uncertainty that is induced by the probabilistic text generation strategies typically deployed in LLM-based systems. We present two case studies applying this approach: 1) evaluating refusal rates on a benchmark of adversarial inputs designed to elicit harmful responses, and 2) evaluating pairwise preferences of one LLM over another on a benchmark of open-ended interactive dialogue examples. We demonstrate how the Bayesian approach can provide useful uncertainty quantification about the behavior of LLM-based systems.
zh
[NLP-79] st-Time Steering for Lossless Text Compression via Weighted Product of Experts EMNLP2025
【速读】: 该论文旨在解决神经压缩模型在未见数据上泛化能力差的问题,同时保留其相较于传统通用压缩算法(如gzip)更优的压缩率优势。解决方案的关键在于提出一种基于加权专家乘积(Weighted Product of Experts, wPoE)的测试时调整(Test-Time Steering)框架,在推理阶段自适应地融合一个通用压缩模型与一个预训练的神经语言模型,从而确保压缩率不低于任一单独模型的最佳表现,且无需额外微调即可显著提升文本压缩性能,并兼容任意自回归语言模型。
链接: https://arxiv.org/abs/2511.10660
作者: Qihang Zhang,Muchen Li,Ziao Wang,Renjie Liao,Lele Wang
机构: University of British Columbia (不列颠哥伦比亚大学); Vector Institute for AI (AI研究所); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: 8 pages. Accepted by EMNLP 2025. Code and additional details are available at: this https URL
Abstract:Lossless compression techniques are crucial in an era of rapidly growing data. Traditional universal compressors like gzip offer low computational overhead, high speed, and broad applicability across data distributions. However, they often lead to worse compression rates than modern neural compressors, which leverage large-scale training data to model data distributions more effectively. Despite their advantages, neural compressors struggle to generalize to unseen data. To address this limitation, we propose a novel framework that performs Test-Time Steering via a Weighted Product of Experts (wPoE). At inference, our method adaptively combines a universal compression model with a pretrained neural language model, ensuring the compression rate is at least as good as that of the best individual model. Extensive experiments demonstrate that our approach improves the performance of text compression without requiring fine-tuning. Furthermore, it seamlessly integrates with any autoregressive language model, providing a practical solution for enhancing text compression across diverse data distributions.
zh
[NLP-80] Information Extraction From Fiscal Documents Using LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂层级结构表格数据时能力不足的问题,尤其是在从多页政府财政文档中准确提取结构化数据方面。其核心挑战在于传统光学字符识别(OCR)方法难以验证数值提取的准确性,而财政表格本身具有明确的层级结构(如各级汇总项),可提供内部一致性校验机制。解决方案的关键在于构建一个多阶段处理流程:首先利用领域知识与序列上下文理解表格内容,再通过财政数据固有的层级关系设计多层次验证逻辑,从而实现高精度的数据抽取。此方法不仅提升了LLMs对表格和文档结构的理解能力,还为将PDF格式的财政披露转化为可研究数据库提供了可扩展的范式,尤其适用于发展中国家的政务数据数字化场景。
链接: https://arxiv.org/abs/2511.10659
作者: Vikram Aggarwal,Jay Kulkarni,Aditi Mascarenhas,Aakriti Narang,Siddarth Raman,Ajay Shah,Susan Thomas
机构: Google Inc(谷歌); XKDR Forum
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 6 pages. Presented at the AI for Financial Inclusion, Risk Modeling and Resilience in Emerging Markets workshop at ACM ICAIF 2025 Singapore
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in text comprehension, but their ability to process complex, hierarchical tabular data remains underexplored. We present a novel approach to extracting structured data from multi-page government fiscal documents using LLM-based techniques. Applied to annual fiscal documents from the State of Karnataka in India (200+ pages), our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation. A large challenge with traditional OCR methods is the inability to verify the accurate extraction of numbers. When applied to fiscal data, the inherent structure of fiscal tables, with totals at each level of the hierarchy, allows for robust internal validation of the extracted data. We use these hierarchical relationships to create multi-level validation checks. We demonstrate that LLMs can read tables and also process document-specific structural hierarchies, offering a scalable process for converting PDF-based fiscal disclosures into research-ready databases. Our implementation shows promise for broader applications across developing country contexts.
zh
[NLP-81] Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages
【速读】: 该论文旨在解决临床自由文本报告(如病理和影像报告)中结构化信息提取的挑战,尤其关注多疾病、多语言及多机构场景下的通用性和可扩展性问题。其关键解决方案在于系统评估了15个开源权重的大语言模型(LLMs),涵盖通用型与医学专用型模型,并对比六种提示策略(包括零样本、少样本、思维链、自一致性及提示图等),发现小到中等规模的通用模型在多数任务上表现接近大型模型,且提示图和少样本提示策略能显著提升性能约13%,同时强调任务特异性因素(如复杂度与标注变异性)对结果的影响大于模型规模或提示方式本身,从而证明开放权重的LLMs具备跨场景结构化数据提取能力,为临床数据标准化与自动化处理提供了可行路径。
链接: https://arxiv.org/abs/2511.10658
作者: Douwe J. Spaanderman,Karthik Prathaban,Petr Zelina,Kaouther Mouheb,Lukáš Hejtmánek,Matthew Marzetti,Antonius W. Schurink,Damian Chan,Ruben Niemantsverdriet,Frederik Hartmann,Zhen Qian,Maarten G.J. Thomeer,Petr Holub,Farhan Akram,Frank J. Wolters,Meike W. Vernooij,Cornelis Verhoef,Esther E. Bron,Vít Nováček,Dirk J. Grünhagen,Wiro J. Niessen,Martijn P.A. Starmans,Stefan Klein
机构: Erasmus MC Cancer Institute, University Medical Center Rotterdam, the Netherlands (埃拉姆斯MC癌症研究所,鹿特丹大学医学中心,荷兰); Masaryk University (马萨里克大学); University of Leeds, UK (利兹大学,英国); Erasmus MC University Medical Center, Rotterdam, the Netherlands (埃拉姆斯MC大学医学中心,鹿特丹,荷兰); BBMRI-ERIC, Graz, Austria (生物医学研究基础设施欧洲研究网络,格拉茨,奥地利); Masaryk Memorial Cancer Institute (马萨里克纪念癌症研究所); University of Groningen, Groningen, the Netherlands (格罗宁根大学,格罗宁根,荷兰)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.
zh
[NLP-82] Patent Representation Learning via Self-supervision
【速读】: 该论文旨在解决专利文本表示学习中因缺乏高质量标注数据而导致的嵌入质量受限问题,特别是现有自监督方法(如SimCSE风格的dropout增强)在专利场景下会产生过度均匀的嵌入,从而丧失语义一致性。其解决方案的关键在于提出一种基于段落的增强策略(section-based augmentation),利用同一专利文档内的不同部分(如摘要、权利要求书、背景技术)作为互补视图进行对比学习,通过引入自然的语义和结构多样性来缓解嵌入过度分散的问题,从而更好地保留全局结构与局部连续性。实验表明,该方法在大规模基准上可达到或超越依赖引用和国际专利分类(IPC)标签的监督基线,在无需依赖脆弱或不完整标注的前提下实现了更鲁棒的专利理解。
链接: https://arxiv.org/abs/2511.10657
作者: You Zuo(ALMAnaCH),Kim Gerdes(LISN),Eric Villemonte de La Clergerie(ALMAnaCH),Benoît Sagot(ALMAnaCH)
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This paper presents a simple yet effective contrastive learning framework for learning patent embeddings by leveraging multiple views from within the same document. We first identify a patent-specific failure mode of SimCSE style dropout augmentation: it produces overly uniform embeddings that lose semantic cohesion. To remedy this, we propose section-based augmentation, where different sections of a patent (e.g., abstract, claims, background) serve as complementary views. This design introduces natural semantic and structural diversity, mitigating over-dispersion and yielding embeddings that better preserve both global structure and local continuity. On large-scale benchmarks, our fully self-supervised method matches or surpasses citation-and IPC-supervised baselines in prior-art retrieval and classification, while avoiding reliance on brittle or incomplete annotations. Our analysis further shows that different sections specialize for different tasks-claims and summaries benefit retrieval, while background sections aid classification-highlighting the value of patents’ inherent discourse structure for representation learning. These results highlight the value of exploiting intra-document views for scalable and generalizable patent understanding.
zh
[NLP-83] Preference Orchestrator: Prompt-Aware Multi-Objective Alignment for Large Language Models
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多目标对齐(multi-objective alignment)过程中,因依赖人工指定偏好权重而导致用户负担重且训练效率低的问题。现有方法难以自动探索有效的偏好组合,从而影响模型性能。其解决方案的关键在于提出一种名为PRO(PReference Orchestrator)的框架,该框架引入了一个轻量级的偏好适配器(preference adapter),能够在训练和部署阶段自动推断与具体提示(prompt)相关的偏好权重;该适配器通过在多个奖励模型(reward models)提供的归一化奖励分数上进行训练,隐式学习到跨任务的有效偏好平衡,从而实现更高效、更精准的多目标对齐。
链接: https://arxiv.org/abs/2511.10656
作者: Biao Liu,Ning Xu,Junming Yang,Xin Geng
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, aligning these models with varying human preferences across multiple objectives remains a significant challenge in practical deployments. Existing multi-objective alignment methods rely on manually specified preference weights, which not only burden users with difficult preference specification tasks but also lead to suboptimal training efficiency due to exploration of irrelevant preference combinations. To alleviate these issues, we propose a novel framework named PRO, i.e., PReference Orchestrator, which features a lightweight preference adapter that automatically infers prompt-specific preference weights during both training and deployment phases. Specifically, the adapter automatically learns appropriate preference weights for each prompt by training on normalized reward scores from multiple reward models for preferred responses, which inherently reflect effective preference balances across objectives. Additionally, We provide theoretical analysis proving that our prompt-aware preference mechanism achieves superior performance compared to fixed preference weights in multi-objective alignment scenarios. Extensive experiments across multiple tasks demonstrate the effectiveness of our method over existing multi-objective alignment approaches.
zh
[NLP-84] Spectral Neuro-Symbolic Reasoning II: Semantic Node Merging Entailment Filtering and Knowledge Graph Alignment
【速读】: 该论文旨在解决现有谱神经符号推理(Spectral Neuro-Symbolic Reasoning, Spectral NSR)框架中知识图谱质量不足、推理噪声大以及对复杂语义关系建模能力有限的问题。其关键解决方案在于引入三个语义 grounded 的预处理增强模块:(1) 基于上下文嵌入(如 Sentence-BERT、SimCSE)的 Transformer 节点合并策略以减少冗余;(2) 使用预训练自然语言推理(Natural Language Inference, NLI)分类器(如 RoBERTa、DeBERTa)进行句级蕴含验证以提升边的质量;(3) 与外部知识图谱(如 ConceptNet、Wikidata)对齐以补充缺失上下文。这些改进在不修改核心谱推理引擎的前提下,显著提升了图谱保真度和推理鲁棒性,实现了高效、可解释且可扩展的推理系统。
链接: https://arxiv.org/abs/2511.10655
作者: Andrew Kiruluta,Priscilla Burity
机构: UC Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:This report extends the Spectral Neuro-Symbolic Reasoning (Spectral NSR) framework by introducing three semantically grounded enhancements: (1) transformer-based node merging using contextual embeddings (e.g., Sentence-BERT, SimCSE) to reduce redundancy, (2) sentence-level entailment validation with pretrained NLI classifiers (e.g., RoBERTa, DeBERTa) to improve edge quality, and (3) alignment with external knowledge graphs (e.g., ConceptNet, Wikidata) to augment missing context. These modifications enhance graph fidelity while preserving the core spectral reasoning pipeline. Experimental results on ProofWriter, EntailmentBank, and CLUTRR benchmarks show consistent accuracy gains (up to +3.8%), improved generalization to adversarial cases, and reduced inference noise. The novelty lies in performing semantic and symbolic refinement entirely upstream of the spectral inference stage, enabling efficient, interpretable, and scalable reasoning without relying on quadratic attention mechanisms. In summary, this work extends the Spectral NSR framework with modular, semantically grounded preprocessing steps that improve graph quality without altering the core spectral reasoning engine. The result is a more robust, interpretable, and scalable reasoning system suitable for deployment in open-domain and real-world settings.
zh
[NLP-85] Empirical Characterization of Temporal Constraint Processing in LLM s
【速读】: 该论文试图解决在时间约束下部署大语言模型(Large Language Models, LLMs)时,其对动作窗口是否开放的判断能力不可靠的问题。研究发现,当前主流LLM在处理时间约束任务时存在显著风险:性能呈双峰分布(准确率要么95%要么50%)、对提示格式极度敏感(变化可导致准确率波动30–60个百分点),以及系统性误报偏倚(失败模型的假阳性率为100%)。关键解决方案在于引入三种架构机制:(1) 连续的时间状态表示,(2) 与语言模式匹配分离的显式约束检查,(3) 对时间关系进行系统性组合推理。现有自回归架构缺乏这些机制,因此在时间敏感场景中必须采用融合符号推理模块的混合架构,否则部署风险不可接受。
链接: https://arxiv.org/abs/2511.10654
作者: Javier Marín
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:When deploying LLMs in agentic architectures requiring real-time decisions under temporal constraints, we assume they reliably determine whether action windows remain open or have closed. This assumption is untested. We characterize temporal constraint processing across eight production-scale models (2.8-8B parameters) using deadline detection tasks, revealing systematic deployment risks: bimodal performance distribution (models achieve either 95% or 50% accuracy), extreme prompt brittleness (30-60 percentage point swings from formatting changes alone), and systematic action bias (100% false positive rates in failing models). Parameter count shows no correlation with capability in this range-a 3.8B model matches 7B models while other 7B models fail completely. Fine-tuning on 200 synthetic examples improves models with partial capability by 12-37 percentage points. We demonstrate that temporal constraint satisfaction cannot be reliably learned through next-token prediction on natural language, even with targeted fine-tuning. This capability requires architectural mechanisms for: (1) continuous temporal state representation, (2) explicit constraint checking separate from linguistic pattern matching, (3) systematic compositional reasoning over temporal relations. Current autoregressive architectures lack these mechanisms. Deploying such systems in time-critical applications without hybrid architectures incorporating symbolic reasoning modules represents unacceptable risk.
zh
[NLP-86] Hybrid Quantum Transformer for Language Generation
【速读】: 该论文旨在解决当前量子计算在大规模自然语言生成任务中尚未取得成功的问题,即如何将量子计算有效集成到大型语言模型(Large Language Model, LLM)中以提升其性能或效率。解决方案的关键在于提出了一种混合量子-经典架构——HyQuT,该架构首次将变分量子电路(Variational Quantum Circuit, VQC)嵌入到Transformer框架中,并在8M和150M参数规模下实现上下文感知的连贯对话生成;实验表明,仅需10个量子比特和80个量子门即可替代约10%的经典参数,在保持收敛稳定性和生成质量的同时,验证了量子计算在大规模生成式语言模型中的可行性。
链接: https://arxiv.org/abs/2511.10653
作者: Desheng Kong,Xiangshuo Cui,Jiaying Jin,Jing Xu,Donglin Wang
机构: Nankai University (南开大学); Beijing Sursen Information Technology Co., Ltd (北京苏森信息技术有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:
Abstract:Although quantum computing has been increasingly applied to replace classical computation, most existing quantum or hybrid models remain confined to simple tasks, with no successful application to large-scale natural language generation to date. In this work, we present the first hybrid quantum-classical large language model (LLM) for natural language generation, HyQuT, capable of performing coherent and context-aware dialogue. The proposed architecture integrates variational quantum circuits (VQCs) into the Transformer framework at both 8M and 150M parameter scales. Experimental results show that a minimal number of qubits (10 qubits with 80 quantum gates) can replace about 10% of the classical parameters in the 150M-parameter model, while achieving comparable convergence stability and generation quality. This study provides an early demonstration of the feasibility of integrating quantum computing to large-scale generative language models.
zh
[NLP-87] Cognitively-Inspired Episodic Memory Architectures for Accurate and Efficient Character AI
【速读】: 该论文试图解决在对话系统中通过大语言模型(Large Language Models, LLMs)具身化历史人物时面临的效率与深度之间的权衡问题:传统检索增强生成(Retrieval-Augmented Generation, RAG)方法生成的回应较为浅显,而多阶段反思机制虽能提升深度却带来显著延迟。其解决方案的关键在于离线数据增强与高效并行检索结构化情景记忆(structured episodic memory)的结合——将传记数据转化为含情感-语义元数据的1,774条第一人称记忆,并采用两阶段检索策略,在0.52秒内完成提示生成。该架构在保持高响应效率的同时显著提升了小模型(如GPT-3.5和GPT-3)的表现,优于传统RAG方法,且支持可视化分析工具(如时空热力图、情绪轨迹分析),使其不仅适用于对话交互,还可作为生物传记研究的新范式。
链接: https://arxiv.org/abs/2511.10652
作者: Rafael Arias Gonzalez,Steve DiPaola
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 25 pages
Abstract:Large language models show promise for embodying historical characters in dialogue systems, but existing approaches face a critical trade-off: simple retrieval-augmented generation produces shallow responses, while multi-stage reflection achieves depth at prohibitive latency. We present an architecture that resolves this tension through offline data augmentation and efficient parallel retrieval from structured episodic memory. Our system transforms biographical data into 1,774 enriched first-person memories with affective-semantic metadata, then employs two-stage retrieval achieving 0.52s prompt generation. Evaluation using LLM-as-judge and RAGAs metrics shows our approach achieves parity with traditional RAG on GPT-4 while significantly outperforming it on smaller models (GPT-3.5, GPT-3), suggesting particular value for resource-constrained deployments. Beyond dialogue, the structured memory enables novel visualization tools: spatiotemporal heatmaps, emotional trajectory analysis, and interactive path tracking, positioning the system as both a dialogue interface and research tool for biographical analysis. We use Van Gogh as a test case, but the architecture is generalizable to any historical figure with substantial textual records, offering a practical framework for educational, museum, and research applications requiring both accuracy and efficiency
zh
[NLP-88] Data Analysis and Performance Evaluation of Simulation Deduction Based on LLM s
【速读】: 该论文旨在解决军事仿真推演中数据分析与性能评估依赖人工、效率低且易出错的问题(即传统手动分析方法的局限性)。其解决方案的关键在于:首先将复杂任务分解为多个子任务,设计针对每个子任务的有效系统提示(system prompts)和用户提示(user prompts),并通过多轮交互结合自我检查与反思机制,实现结构化数据提取及多步分析与评估;同时引入自定义工具生成图表并计算指标,并设计多种适配不同应用场景和输入数据类型的报告模板,从而显著提升生成分析报告的质量与适应性。
链接: https://arxiv.org/abs/2511.10651
作者: Shansi Zhang,Min Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Data analysis and performance evaluation of simulation deduction plays a pivotal role in modern warfare, which enables military personnel to gain invaluable insights into the potential effectiveness of different strategies, tactics, and operational plans. Traditional manual analysis approach is time-consuming and limited by human errors. To enhance efficiency and accuracy, large language models (LLMs) with strong analytical and inferencing capabilities can be employed. However, high-quality analysis reports with well-structured formatting cannot be obtained through a single instruction input to the LLM. To tackle this issue, we propose a method that first decomposes the complex task into several sub-tasks and designs effective system prompts and user prompts for each sub-task. Multi-round interactions with the LLM incorporating self-check and reflection are then conducted to enable structured data extraction as well as multi-step analysis and evaluation. Furthermore, custom tools are defined and invoked to generate figures and compute metrics. We also design multiple report templates, each tailored to a specific application and input data type, ensuring their adaptability across a variety of scenarios. Extensive evaluation results demonstrate that the reports generated by our method exhibit higher quality, therefore obtaining higher scores than the baseline method.
zh
[NLP-89] Unsupervised Cycle Detection in Agent ic Applications
【速读】: 该论文旨在解决由大型语言模型(Large Language Models, LLMs)驱动的智能体应用(Agentic applications)中因非确定性行为引发的隐藏执行循环问题,这类循环会无声地消耗计算资源却不会触发显式错误,传统可观测性平台难以检测此类低效行为。解决方案的关键在于提出一种无监督的循环检测框架,融合结构分析与语义分析:首先通过高效的时序调用栈分析识别显式循环,再利用语义相似性分析发现由冗余内容生成导致的隐式循环,从而实现对复杂执行路径中潜在资源浪费的有效识别。
链接: https://arxiv.org/abs/2511.10650
作者: Felix George,Harshit Kumar,Divya Pathak,Kaustabha Ray,Mudit Verma,Pratibha Moogi
机构: IBM ResearchIndia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Agentic applications powered by Large Language Models exhibit non-deterministic behaviors that can form hidden execution cycles, silently consuming resources without triggering explicit errors. Traditional observability platforms fail to detect these costly inefficiencies. We present an unsupervised cycle detection framework that combines structural and semantic analysis. Our approach first applies computationally efficient temporal call stack analysis to identify explicit loops and then leverages semantic similarity analysis to uncover subtle cycles characterized by redundant content generation. Evaluated on 1575 trajectories from a LangGraph-based stock market application, our hybrid approach achieves an F1 score of 0.72 (precision: 0.62, recall: 0.86), significantly outperforming individual structural (F1: 0.08) and semantic methods (F1: 0.28). While these results are encouraging, there remains substantial scope for improvement, and future work is needed to refine the approach and address its current limitations.
zh
[NLP-90] Assessing the Capabilities of LLM s in Humor:A Multi-dimensional Analysis of Oogiri Generation and Evaluation
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在幽默理解与生成能力评估中缺乏多维度标准的问题,现有研究多依赖单一维度的“是否有趣”判断,难以全面刻画人类幽默的复杂性。其解决方案的关键在于引入日本即兴喜剧游戏“Oogiri”作为评测框架,并构建了一个包含六维人工标注(新颖性、清晰度、相关性、智慧性、共情力和整体趣味性)的扩展语料库,通过系统评估LLMs在生成和评价两个核心任务上的表现,揭示了模型在共情(Empathy)维度上的显著不足,从而解释其无法准确复现人类幽默判断的根本原因。
链接: https://arxiv.org/abs/2511.09133
作者: Ritsu Sakabe,Hwichan Kim,Tosho Hirasawa,Mamoru Komachi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Computational humor is a frontier for creating advanced and engaging natural language processing (NLP) applications, such as sophisticated dialogue systems. While previous studies have benchmarked the humor capabilities of Large Language Models (LLMs), they have often relied on single-dimensional evaluations, such as judging whether something is simply ``funny.‘’ This paper argues that a multifaceted understanding of humor is necessary and addresses this gap by systematically evaluating LLMs through the lens of Oogiri, a form of Japanese improvisational comedy games. To achieve this, we expanded upon existing Oogiri datasets with data from new sources and then augmented the collection with Oogiri responses generated by LLMs. We then manually annotated this expanded collection with 5-point absolute ratings across six dimensions: Novelty, Clarity, Relevance, Intelligence, Empathy, and Overall Funniness. Using this dataset, we assessed the capabilities of state-of-the-art LLMs on two core tasks: their ability to generate creative Oogiri responses and their ability to evaluate the funniness of responses using a six-dimensional evaluation. Our results show that while LLMs can generate responses at a level between low- and mid-tier human performance, they exhibit a notable lack of Empathy. This deficit in Empathy helps explain their failure to replicate human humor assessment. Correlation analyses of human and model evaluation data further reveal a fundamental divergence in evaluation criteria: LLMs prioritize Novelty, whereas humans prioritize Empathy. We release our annotated corpus to the community to pave the way for the development of more emotionally intelligent and sophisticated conversational agents.
zh
计算机视觉
[CV-0] LARM: A Large Articulated-Object Reconstruction Model
【速读】:该论文旨在解决3D关节物体(3D articulated objects)在稀疏视图输入下难以实现高保真重建的问题,现有基于优化的方法依赖密集多视角数据和昂贵的实例级优化,而前馈方法则常存在几何粗糙、缺乏纹理重建及多阶段流程复杂脆弱等缺陷。其解决方案的关键在于提出LARM——一个统一的前馈框架,通过扩展最近的静态物体视图合成方法LVSM至关节场景,利用基于Transformer的架构联合推理相机位姿与关节变化,从而实现可扩展且精确的新视角合成;同时生成深度图和部件掩码等辅助输出,支持显式三维网格提取与关节估计,无需密集监督即可在多种物体类别上完成高质量重建。
链接: https://arxiv.org/abs/2511.11563
作者: Sylvia Yuan,Ruoxi Shi,Xinyue Wei,Xiaoshuai Zhang,Hao Su,Minghua Liu
机构: University of California San Diego (加州大学圣地亚哥分校); Hillbot Inc. (Hillbot 公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL
Abstract:Modeling 3D articulated objects with realistic geometry, textures, and kinematics is essential for a wide range of applications. However, existing optimization-based reconstruction methods often require dense multi-view inputs and expensive per-instance optimization, limiting their scalability. Recent feedforward approaches offer faster alternatives but frequently produce coarse geometry, lack texture reconstruction, and rely on brittle, complex multi-stage pipelines. We introduce LARM, a unified feedforward framework that reconstructs 3D articulated objects from sparse-view images by jointly recovering detailed geometry, realistic textures, and accurate joint structures. LARM extends LVSM a recent novel view synthesis (NVS) approach for static 3D objects into the articulated setting by jointly reasoning over camera pose and articulation variation using a transformer-based architecture, enabling scalable and accurate novel view synthesis. In addition, LARM generates auxiliary outputs such as depth maps and part masks to facilitate explicit 3D mesh extraction and joint estimation. Our pipeline eliminates the need for dense supervision and supports high-fidelity reconstruction across diverse object categories. Extensive experiments demonstrate that LARM outperforms state-of-the-art methods in both novel view and state synthesis as well as 3D articulated object reconstruction, generating high-quality meshes that closely adhere to the input images. project page: this https URL
zh
[CV-1] Bridging Hidden States in Vision-Language Models
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中多模态融合效率与对齐精度之间的权衡问题。现有方法通常采用早期融合(如在编码器内部混合特征)或晚期融合(如比较池化后的嵌入向量),且常将融合过程绑定至自回归解码器,导致模型复杂度高、生成任务与理解任务耦合。其核心解决方案是提出一个轻量级融合模块——BRIDGE,关键在于在两个模态编码器顶部添加少量仅含交叉注意力(cross-only, bidirectional attention)的层,通过投影隐藏状态至共享空间、跨模态注意力交互并引入门控残差更新机制,实现高效且稳定的模态对齐,同时保持编码器的非因果性以保障理解能力,并使生成任务可选地独立于融合模块,从而在检索、视觉问答(VQA)和视觉推理等基准上超越同类模型,同时保留对比学习型双编码器(bi-encoder)的高效性。
链接: https://arxiv.org/abs/2511.11526
作者: Benjamin Fein-Ashley,Jacob Fein-Ashley
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) are a new family of models that align image content with natural language. Existing approaches typically fuse either (a) early: by mixing tokens/features inside the encoders, or (b) late: by comparing pooled embeddings. Many methods also tie fusion to an autoregressive decoder. However, the hidden states of both modalities already carry rich, modality-specific structure (spatial layout in vision; syntax and semantics in text), so directly aligning these states is a natural way to match what the two modalities “think”. We propose a lightweight fusion module: a few cross-only, bidirectional attention layers placed near the top of both encoders. Each layer projects the vision and text encoder hidden-state sequences into a shared space, attends across modalities, and sends gated residual updates back, with simple stabilizers to improve alignment. The encoders remain non-causal and strong for understanding, while generation stays cleanly decoupled via an optional decoder. Across standard retrieval, VQA, and visual reasoning benchmarks, BRIDGE outperforms comparable VLMs while preserving the bi-encoder efficiency of contrastive models. We make our code publicly available at this https URL.
zh
[CV-2] CVChess: A Deep Learning Framework for Converting Chessboard Images to Forsyth-Edwards Notation
【速读】:该论文旨在解决物理棋盘与数字棋类平台之间存在的体验割裂问题,即当前在线学习平台可提供实时策略建议,但缺乏对实体棋局的数字化支持。其解决方案的关键在于提出CVChess框架,通过深度学习实现从手机拍摄的棋盘图像到Forsyth-Edwards Notation (FEN) 的自动转换:系统采用带有残差连接(residual connections)的卷积神经网络(CNN),结合多步骤处理流程——包括基于霍夫变换(Hough Line Transform)的边缘检测、透视变换校正视角、64格分割以及13类分类(含黑白双方各6种棋子及空格),从而高精度识别棋子并生成可被在线棋类引擎直接使用的FEN字符串,最终实现物理棋局的智能化分析与最优走法推荐。
链接: https://arxiv.org/abs/2511.11522
作者: Luthira Abeykoon,Ved Patel,Gawthaman Senthilvelan,Darshan Kasundra
机构: University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Chess has experienced a large increase in viewership since the pandemic, driven largely by the accessibility of online learning platforms. However, no equivalent assistance exists for physical chess games, creating a divide between analog and digital chess experiences. This paper presents CVChess, a deep learning framework for converting chessboard images to Forsyth-Edwards Notation (FEN), which is later input into online chess engines to provide you with the best next move. Our approach employs a convolutional neural network (CNN) with residual layers to perform piece recognition from smartphone camera images. The system processes RGB images of a physical chess board through a multistep process: image preprocessing using the Hough Line Transform for edge detection, projective transform to achieve a top-down board alignment, segmentation into 64 individual squares, and piece classification into 13 classes (6 unique white pieces, 6 unique black pieces and an empty square) using the residual CNN. Residual connections help retain low-level visual features while enabling deeper feature extraction, improving accuracy and stability during training. We train and evaluate our model using the Chess Recognition Dataset (ChessReD), containing 10,800 annotated smartphone images captured under diverse lighting conditions and angles. The resulting classifications are encoded as an FEN string, which can be fed into a chess engine to generate the most optimal move
zh
[CV-3] Collaborative Representation Learning for Alignment of Tactile Language and Vision Modalities
【速读】:该论文旨在解决现有触觉传感系统中因缺乏标准化而导致的特征冗余问题,以及多模态(触觉、语言、视觉)之间中间信息交互不充分的问题。其关键解决方案在于提出TLV-CoRe方法:通过引入Sensor-Aware Modulator统一不同传感器的触觉特征表示,利用tactile-irrelevant decoupled learning解耦无关触觉特征以提升泛化能力,并设计Unified Bridging Adapter增强三模态在共享表示空间中的协同交互。
链接: https://arxiv.org/abs/2511.11512
作者: Yiyun Zhou,Mingjing Xu,Jingwei Shi,Quanjiang Li,Jingyuan Chen
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation.
zh
[CV-4] OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning
【速读】:该论文旨在解决超声(Ultrasound, US)图像分析中因操作者依赖性强、成像条件差异大(如解剖区域、采集协议和设备类型不同)、以及图像质量受限(如斑点噪声、对比度低)等问题,导致现有生成式AI模型泛化能力差、标注效率低的挑战。其解决方案的关键在于提出OpenUS——首个可复现的开源超声基础模型,采用视觉Mamba(Vision Mamba)骨干网络以捕捉图像中的局部与全局长程依赖关系;创新性地引入自适应掩码框架,结合对比学习与掩码图像建模策略,通过融合教师注意力图与学生重建损失动态优化掩码区域,从而增强预训练阶段对临床相关特征的提取效果;同时设计动态学习调度机制,逐步提升预训练难度,最终在包含308K张图像的跨机构、多设备、多病种公开数据集上实现高效预训练,支持下游任务的标签高效微调。
链接: https://arxiv.org/abs/2511.11510
作者: Xiaoyu Zheng,Xu Chen,Awais Rauf,Qifan Fu,Benedetta Monosi,Felice Rivellese,Myles J. Lewis,Shaogang Gong,Gregory Slabaugh
机构: Digital Environment Research Institute (DERI); The William Harvey Research Institute; School of Electronic Engineering and Computer Science; Queen Mary University of London
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher’s attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at this https URL.
zh
[CV-5] PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision–Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的物体幻觉(object hallucination)问题,即模型在生成文本时错误地引入图像中并不存在的物体。研究表明,许多幻觉预测源于模型对图像信息的忽视,转而依赖先前生成的输出(prelim)token来推断新对象。解决方案的关键在于提出一种轻量级、无需训练的信号——预生成注意力得分(Prelim Attention Score, PAS),该指标通过计算注意力权重在prelim token上的分布来量化模型对图像的依赖程度。PAS可在推理过程中实时计算,无需额外前向传播,且能有效检测物体幻觉,在多个模型和数据集上实现当前最优的幻觉检测性能,从而支持实时过滤与干预。
链接: https://arxiv.org/abs/2511.11502
作者: Nhat Hoang-Xuan,Minh Vu,My T. Thai,Manish Bhattarai
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室); University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large vision-language models (LVLMs) are powerful, yet they remain unreliable due to object hallucinations. In this work, we show that in many hallucinatory predictions the LVLM effectively ignores the image and instead relies on previously generated output (prelim) tokens to infer new objects. We quantify this behavior via the mutual information between the image and the predicted object conditioned on the prelim, demonstrating that weak image dependence strongly correlates with hallucination. Building on this finding, we introduce the Prelim Attention Score (PAS), a lightweight, training-free signal computed from attention weights over prelim tokens. PAS requires no additional forward passes and can be computed on the fly during inference. Exploiting this previously overlooked signal, PAS achieves state-of-the-art object-hallucination detection across multiple models and datasets, enabling real-time filtering and intervention.
zh
[CV-6] Multimodal Posterior Sampling-based Uncertainty in PD-L1 Segmentation from HE Images
【速读】:该论文旨在解决当前基于免疫组化(IHC)的程序性死亡配体-1(PD-L1)表达评估方法资源密集、成本高昂的问题,提出一种无需IHC染色即可从苏木精-伊红(HE)染色组织切片中直接推断PD-L1表达的贝叶斯分割框架——nnUNet-B。其关键创新在于引入多模态后验采样(Multimodal Posterior Sampling, MPS),在循环训练过程中对多个模型检查点进行采样以近似后验分布,从而实现高精度的组织分割与认知不确定性(epistemic uncertainty)估计,同时生成像素级不确定性图谱,为临床可解释性 biomarker 评估提供新路径。
链接: https://arxiv.org/abs/2511.11486
作者: Roman Kinakh,Gonzalo R. Ríos-Muñoz,Arrate Muñoz-Barrutia
机构: Universidad Carlos III de Madrid (卡洛斯三世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: Preprint (pre-review). Accepted for publication in Lecture Notes in Bioinformatics (Springer, 2025). The final authenticated version will be available on SpringerLink once published
Abstract:Accurate assessment of PD-L1 expression is critical for guiding immunotherapy, yet current immunohistochemistry (IHC) based methods are resource-intensive. We present nnUNet-B: a Bayesian segmentation framework that infers PD-L1 expression directly from HE-stained histology images using Multimodal Posterior Sampling (MPS). Built upon nnUNet-v2, our method samples diverse model checkpoints during cyclic training to approximate the posterior, enabling both accurate segmentation and epistemic uncertainty estimation via entropy and standard deviation. Evaluated on a dataset of lung squamous cell carcinoma, our approach achieves competitive performance against established baselines with mean Dice Score and mean IoU of 0.805 and 0.709, respectively, while providing pixel-wise uncertainty maps. Uncertainty estimates show strong correlation with segmentation error, though calibration remains imperfect. These results suggest that uncertainty-aware HE-based PD-L1 prediction is a promising step toward scalable, interpretable biomarker assessment in clinical workflows.
zh
[CV-7] ImAgent : A Unified Multimodal Agent Framework for Test-Time Scalable Image Generation
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型在提示(prompt)模糊或描述不充分时,因随机性和语义一致性差而导致的图像生成不稳定问题。现有方法如提示重写、最佳N采样和自精炼虽可缓解此问题,但通常依赖额外模块且独立运行,限制了测试时扩展效率并增加计算开销。解决方案的关键在于提出ImAgent——一个无需训练的统一多模态代理框架,其通过策略控制器引导多个生成动作动态交互与自组织,在单一流程内实现推理、生成与自评估的融合,从而在不依赖外部模型的前提下显著提升图像保真度和语义对齐性,展现出高效的测试时扩展能力。
链接: https://arxiv.org/abs/2511.11483
作者: Kaishen Wang,Ruibo Chen,Tong Zheng,Heng Huang
机构: University of Maryland, College Park (马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 tables, 6 figures
Abstract:Recent text-to-image (T2I) models have made remarkable progress in generating visually realistic and semantically coherent images. However, they still suffer from randomness and inconsistency with the given prompts, particularly when textual descriptions are vague or underspecified. Existing approaches, such as prompt rewriting, best-of-N sampling, and self-refinement, can mitigate these issues but usually require additional modules and operate independently, hindering test-time scaling efficiency and increasing computational overhead. In this paper, we introduce ImAgent, a training-free unified multimodal agent that integrates reasoning, generation, and self-evaluation within a single framework for efficient test-time scaling. Guided by a policy controller, multiple generation actions dynamically interact and self-organize to enhance image fidelity and semantic alignment without relying on external models. Extensive experiments on image generation and editing tasks demonstrate that ImAgent consistently improves over the backbone and even surpasses other strong baselines where the backbone model fails, highlighting the potential of unified multimodal agents for adaptive and efficient image generation under test-time scaling.
zh
[CV-8] Rethinking Progression of Memory State in Robotic Manipulation: An Object-Centric Perspective AAAI2026
【速读】:该论文旨在解决机器人在非马尔可夫(non-Markovian)环境下执行序列化操作任务时,因缺乏对物体历史交互的持续记忆而导致决策失效的问题。具体而言,在视觉相似物体密集、需依赖对象级部分可观测性的场景中,传统视觉-语言-动作(VLA)模型难以有效利用短时和长时对象历史信息进行上下文感知的动作预测。其核心挑战在于如何在不显著增加计算复杂度的前提下,实现对物体身份的时空一致性建模与长期记忆推理。解决方案的关键是提出Embodied-SlotSSM框架,该框架以“槽位”(slot)为中心,通过两个机制实现时间可扩展的非马尔可夫推理:一是基于槽位状态空间建模(slot-state-space modeling)重建短期历史,二是引入关系编码器(relational encoder)将输入token与动作解码对齐,从而支持时序锚定、上下文感知的动作预测,显著提升了在LIBERO-Mem任务套件上的表现和泛化能力。
链接: https://arxiv.org/abs/2511.11478
作者: Nhat Chung,Taisei Hanyu,Toan Nguyen,Huy Le,Frederick Bumgarner,Duy Minh Ho Nguyen,Khoa Vo,Kashu Yamazaki,Chase Rainwater,Tung Kieu,Anh Nguyen,Ngan Le
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI 2026
Abstract:As embodied agents operate in increasingly complex environments, the ability to perceive, track, and reason about individual object instances over time becomes essential, especially in tasks requiring sequenced interactions with visually similar objects. In these non-Markovian settings, key decision cues are often hidden in object-specific histories rather than the current scene. Without persistent memory of prior interactions (what has been interacted with, where it has been, or how it has changed) visuomotor policies may fail, repeat past actions, or overlook completed ones. To surface this challenge, we introduce LIBERO-Mem, a non-Markovian task suite for stress-testing robotic manipulation under object-level partial observability. It combines short- and long-horizon object tracking with temporally sequenced subgoals, requiring reasoning beyond the current frame. However, vision-language-action (VLA) models often struggle in such settings, with token scaling quickly becoming intractable even for tasks spanning just a few hundred frames. We propose Embodied-SlotSSM, a slot-centric VLA framework built for temporal scalability. It maintains spatio-temporally consistent slot identities and leverages them through two mechanisms: (1) slot-state-space modeling for reconstructing short-term history, and (2) a relational encoder to align the input tokens with action decoding. Together, these components enable temporally grounded, context-aware action prediction. Experiments show Embodied-SlotSSM’s baseline performance on LIBERO-Mem and general tasks, offering a scalable solution for non-Markovian reasoning in object-centric robotic policies.
zh
[CV-9] Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery
【速读】:该论文旨在解决当前3D城市生成方法面临的两大挑战:一是依赖大规模3D城市资产进行监督训练,而这类数据获取困难且成本高昂;二是现有方法多基于语义或高度图生成建筑,缺乏与真实世界外观的关联,导致生成结果在真实感和泛化能力上受限。解决方案的关键在于提出Sat2RealCity框架,其核心创新包括:(1)引入基于OpenStreetMap(OSM)的空间先验策略,实现从空间拓扑到建筑实例的可解释几何生成;(2)设计外观引导的可控建模机制,提升细节真实感与风格控制能力;(3)构建由多模态大语言模型(MLLM)驱动的语义引导生成流程,打通语义理解与几何重建之间的鸿沟。该方案显著提升了生成城市的结构一致性和外观真实性,为面向真实世界的3D城市内容生成奠定了基础。
链接: https://arxiv.org/abs/2511.11470
作者: Yijie Kang,Xinliang Wang,Zhenyu Wu,Yifeng Shi,Hailong Zhu
机构: Ke Holdings Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in generative modeling have substantially enhanced 3D urban generation, enabling applications in digital twins, virtual cities, and large-scale simulations. However, existing methods face two key challenges: (1) the need for large-scale 3D city assets for supervised training, which are difficult and costly to obtain, and (2) reliance on semantic or height maps, which are used exclusively for generating buildings in virtual worlds and lack connection to real-world appearance, limiting the realism and generalizability of generated cities. To address these limitations, we propose Sat2RealCity, a geometry-aware and appearance-controllable framework for 3D urban generation from real-world satellite imagery. Unlike previous city-level generation methods, Sat2RealCity builds generation upon individual building entities, enabling the use of rich priors and pretrained knowledge from 3D object generation while substantially reducing dependence on large-scale 3D city assets. Specifically, (1) we introduce the OSM-based spatial priors strategy to achieve interpretable geometric generation from spatial topology to building instances; (2) we design an appearance-guided controllable modeling mechanism for fine-grained appearance realism and style control; and (3) we construct an MLLM-powered semantic-guided generation pipeline, bridging semantic interpretation and geometric reconstruction. Extensive quantitative and qualitative experiments demonstrate that Sat2RealCity significantly surpasses existing baselines in structural consistency and appearance realism, establishing a strong foundation for real-world aligned 3D urban content creation. The code will be released soon.
zh
[CV-10] Benchmarking Visual LLM s Resilience to Unanswerable Questions on Visually Rich Documents
【速读】:该论文旨在解决视觉丰富文档(Visually Rich Documents, VRDs)中生成式 AI 模型在面对看似合理但无法回答的问题时的鲁棒性不足问题,即模型难以识别此类“看似可答实则不可答”的问题。解决方案的关键在于构建一个名为 VRD-UQA(VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING)的基准测试框架,通过自动化地对多页 VRD 数据集中的问题进行语义扰动(如替换同类型实体、改变文档元素或布局位置),并利用 VLLM-as-a-judge 方法验证其不可答性,从而系统评估不同模型在页面级与文档级检测未回答问题的能力,并分析多种知识注入策略(如 OCR、跨页选择、不可答可能性提示)的效果。
链接: https://arxiv.org/abs/2511.11468
作者: Davide Napolitano,Luca Cagliero,Fabrizio Battiloro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs’ resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs’ performance. Experiments, run on 12 models, analyze: (1) The VLLMs’ accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs’ limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.
zh
[CV-11] Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification
【速读】:该论文旨在解决遥感多模态分类中因环境干扰、传感器故障或大气效应导致的模态缺失问题,此类缺失会显著降低分类性能。现有两阶段适配方法计算成本高且假设训练时数据完整,难以泛化到实际不完整的场景。其解决方案的关键在于提出一种缺损感知的LoRA混合模型(Missing-aware Mixture-of-Loras, MaMOL),将模态缺失建模为多任务学习问题,并引入双路由机制:一是面向任务的动态路由器,根据不同的缺损模式自适应激活专家;二是模态特异性-共享的静态路由器,保障跨模态知识稳定共享。该框架通过轻量级专家更新与共享专家复用实现参数高效适配,无需为每种缺损配置单独训练网络,在多个遥感基准测试中表现出优异的鲁棒性和泛化能力,同时在自然图像数据集上验证了其跨域可扩展性。
链接: https://arxiv.org/abs/2511.11460
作者: Qinghao Gao,Jianhai Qu,Yunsong Li,Weiqiang Dong
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects, which severely degrade classification performance. Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness. To overcome these issues, we propose a Missing-aware Mixture-of-Loras (MaMOL) framework that reformulates modality missing as a multi-task learning problem. MaMOL introduces a dual-routing mechanism: a task-oriented dynamic router that adaptively activates experts for different missing patterns, and a modality-specific-shared static router that maintains stable cross-modal knowledge sharing. Unlike prior methods that train separate networks for each missing configuration, MaMOL achieves parameter-efficient adaptation via lightweight expert updates and shared expert reuse. Experiments on multiple remote sensing benchmarks demonstrate superior robustness and generalization under varying missing rates, with minimal computational overhead. Moreover, transfer experiments on natural image datasets validate its scalability and cross-domain applicability, highlighting MaMOL as a general and efficient solution for incomplete multimodal learning.
zh
[CV-12] VoxTell: Free-Text Promptable Universal 3D Medical Image Segmentation
【速读】:该论文旨在解决医学影像中基于文本提示的三维(3D)图像分割问题,即如何将自然语言描述(从单个词汇到完整临床语句)准确映射为对应的解剖或病理结构的3D掩码(mask)。传统方法通常依赖于标注数据进行监督训练,难以泛化至未见类别或模态。VoxTell 的关键解决方案在于采用多阶段视觉-语言融合机制,在解码器层间实现跨尺度的文本与视觉特征对齐,从而在62,000+份涵盖CT、MRI和PET的医学体积数据上训练出具有强零样本(zero-shot)能力的模型,不仅在已知概念上表现优异,还能有效推广至相关未见类别,并展现出对语言变体和临床术语的鲁棒性以及基于真实世界文本的实例级分割准确性。
链接: https://arxiv.org/abs/2511.11450
作者: Maximilian Rokuss,Moritz Langenberg,Yannick Kirchhoff,Fabian Isensee,Benjamin Hamm,Constantin Ulrich,Sebastian Regnery,Lukas Bauer,Efthimios Katsigiannopulos,Tobias Norajitra,Klaus Maier-Hein
机构: German Cancer Research Center (德国癌症研究中心); Faculty of Mathematics and Computer Science (数学与计算机科学学院); Medical Faculty - Heidelberg University (海德堡大学医学院); Helmholtz Imaging (赫尔姆霍兹成像); Department of Radiation Oncology, Heidelberg University Hospital (海德堡大学医院放射肿瘤科); HIDSS4Health, Heidelberg (海德堡健康数据科学与系统); Pattern Analysis and Learning Group, Heidelberg University Hospital (海德堡大学医院模式分析与学习组)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We introduce VoxTell, a vision-language model for text-prompted volumetric medical image segmentation. It maps free-form descriptions, from single words to full clinical sentences, to 3D masks. Trained on 62K+ CT, MRI, and PET volumes spanning over 1K anatomical and pathological classes, VoxTell uses multi-stage vision-language fusion across decoder layers to align textual and visual features at multiple scales. It achieves state-of-the-art zero-shot performance across modalities on unseen datasets, excelling on familiar concepts while generalizing to related unseen classes. Extensive experiments further demonstrate strong cross-modality transfer, robustness to linguistic variations and clinical language, as well as accurate instance-specific segmentation from real-world text. Code is available at: this https URL
zh
[CV-13] VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models AAAI2026
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解和利用人类自然使用的“视觉提示”(Visual Prompts, VPs)方面缺乏系统评估的问题。现有基准未充分考察MLLMs对边界框等VP的感知能力及其在下游任务中的实际应用效果,导致无法判断这些模型是否能有效识别并利用VP来解决具体问题。解决方案的关键在于提出VP-Bench——一个两阶段评估框架:第一阶段通过3万条涵盖8种形状和355种属性组合的可视化提示,量化模型对VP的感知能力;第二阶段则测试VP在真实场景下对下游任务的影响,从而全面评估MLLMs在接地指代问题(grounded referring questions)中的理解与求解能力。该基准为研究MLLMs如何处理视觉提示提供了新的参考体系。
链接: https://arxiv.org/abs/2511.11438
作者: Mingjie Xu,Jinpeng Chen,Yuzhi Zhao,Jason Chun Lok Li,Yue Qiu,Zekang Du,Mengyang Wu,Pingping Zhang,Kun Li,Hongzheng Yang,Wenao Ma,Jiaheng Wei,Qinbin Li,Kangcheng Liu,Wenqiang Lei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the extended version of the paper accepted at AAAI 2026, which includes all technical appendices and additional experimental details
Abstract:Multimodal large language models (MLLMs) have enabled a wide range of advanced vision-language applications, including fine-grained object recognition and contextual understanding. When querying specific regions or objects in an image, human users naturally use “visual prompts” (VPs), such as bounding boxes, to provide reference. However, no existing benchmark systematically evaluates the ability of MLLMs to interpret such VPs. This gap leaves it unclear whether current MLLMs can effectively recognize VPs, an intuitive prompting method for humans, and use them to solve problems. To address this limitation, we introduce VP-Bench, a benchmark for assessing MLLMs’ capability in VP perception and utilization. VP-Bench employs a two-stage evaluation framework: Stage 1 examines models’ ability to perceive VPs in natural scenes, using 30k visualized prompts spanning eight shapes and 355 attribute combinations. Stage 2 investigates the impact of VPs on downstream tasks, measuring their effectiveness in real-world problem-solving scenarios. Using VP-Bench, we evaluate 28 MLLMs, including proprietary systems (e.g., GPT-4o) and open-source models (e.g., InternVL3 and Qwen2.5-VL), and provide a comprehensive analysis of factors that affect VP understanding, such as variations in VP attributes, question arrangement, and model scale. VP-Bench establishes a new reference framework for studying how MLLMs comprehend and resolve grounded referring questions.
zh
[CV-14] Hi-DREAM: Brain Inspired Hierarchical Diffusion for fMRI Reconstruction via ROI Encoder and visuAl Mapping
【速读】:该论文旨在解决当前基于扩散模型的fMRI图像重建方法中,条件输入直接依赖于全脑fMRI特征而忽视了视觉皮层层级组织结构的问题,导致早期、中期和晚期视觉区域的功能角色被模糊化。解决方案的关键在于提出Hi-DREAM框架,其核心创新是显式建模大脑皮层的层次结构:通过一个感兴趣区(ROI)适配器将fMRI信号分组为早期、中期和晚期视觉流,并构建与U-Net深度对齐的多尺度皮层金字塔(浅层保留布局和边缘信息,深层强调物体和语义),同时引入轻量级、深度匹配的ControlNet,在去噪过程中注入各尺度的提示信息。该设计使每个条件信号具有类脑功能意义,从而实现高效且可解释的图像重建,同时揭示不同视觉区域的功能贡献。
链接: https://arxiv.org/abs/2511.11437
作者: Guowei Zhang,Yun Zhao,Moein Khajehnejad,Adeel Razi,Levin Kuhlmann
机构: Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Mapping human brain activity to natural images offers a new window into vision and cognition, yet current diffusion-based decoders face a core difficulty: most condition directly on fMRI features without analyzing how visual information is organized across the cortex. This overlooks the brain’s hierarchical processing and blurs the roles of early, middle, and late visual areas. We propose Hi-DREAM, a brain-inspired conditional diffusion framework that makes the cortical organization explicit. A region-of-interest (ROI) adapter groups fMRI into early/mid/late streams and converts them into a multi-scale cortical pyramid aligned with the U-Net depth (shallow scales preserve layout and edges; deeper scales emphasize objects and semantics). A lightweight, depth-matched ControlNet injects these scale-specific hints during denoising. The result is an efficient and interpretable decoder in which each signal plays a brain-like role, allowing the model not only to reconstruct images but also to illuminate functional contributions of different visual areas. Experiments on the Natural Scenes Dataset (NSD) show that Hi-DREAM attains state-of-the-art performance on high-level semantic metrics while maintaining competitive low-level fidelity. These findings suggest that structuring conditioning by cortical hierarchy is a powerful alternative to purely data-driven embeddings and provides a useful lens for studying the visual cortex.
zh
[CV-15] he Persistence of Cultural Memory: Investigating Multimodal Iconicity in Diffusion Models
【速读】:该论文旨在解决文本到图像扩散模型中泛化(generalization)与记忆(memorization)之间的模糊性问题,尤其聚焦于一种特定现象——多模态标志性(multimodal iconicity),即图像与文本共同唤起文化共享联想的情形(如标题引发对知名艺术作品或电影场景的回忆)。其解决方案的关键在于提出一个评估框架,将“识别”(recognition,模型是否识别出文化参考)与“实现”(realization,如何通过复制或重构来呈现该参考)区分开来,并引入量化指标分别衡量这两个维度。该框架能够更有效地区分模型是简单复制还是创造性转化文化知识,从而超越传统基于相似度的方法,在评价模型对文化引用的理解深度上提供了更丰富的上下文感知能力。
链接: https://arxiv.org/abs/2511.11435
作者: Maria-Teresa De Rosa Palmini,Eva Cetinic
机构: University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Our work addresses the ambiguity between generalization and memorization in text-to-image diffusion models, focusing on a specific case we term multimodal iconicity. This refers to instances where images and texts evoke culturally shared associations, such as when a title recalls a familiar artwork or film scene. While prior research on memorization and unlearning emphasizes forgetting, we examine what is remembered and how, focusing on the balance between recognizing cultural references and reproducing them. We introduce an evaluation framework that separates recognition, whether a model identifies a reference, from realization, how it depicts it through replication or reinterpretation, quantified through measures capturing both dimensions. By evaluating five diffusion models across 767 Wikidata-derived cultural references spanning static and dynamic imagery, we show that our framework distinguishes replication from transformation more effectively than existing similarity-based methods. To assess linguistic sensitivity, we conduct prompt perturbation experiments using synonym substitutions and literal image descriptions, finding that models often reproduce iconic visual structures even when textual cues are altered. Finally, our analysis shows that cultural alignment correlates not only with training data frequency, but also textual uniqueness, reference popularity, and creation date. Our work reveals that the value of diffusion models lies not only in what they reproduce but in how they transform and recontextualize cultural knowledge, advancing evaluation beyond simple text-image matching toward richer contextual understanding.
zh
[CV-16] WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
【速读】:该论文旨在解决当前统一多模态模型(Unified Multimodal Models, UMMs)在视觉理解与生成任务中普遍存在的单轮交互局限性问题,即现有数据集和基准测试未能充分捕捉现实世界图像创作与编辑中的多轮、上下文依赖特性。解决方案的关键在于提出WEAVE,这是首个面向情境内交错跨模态理解与生成的综合评估体系,包含两个核心组成部分:WEAVE-100k是一个大规模(100K样本、超370K对话轮次、500K图像)的交错样本数据集,覆盖需历史上下文推理的理解、编辑与生成任务;WEAVEBench则是一个基于人类标注的基准测试,采用结合参考图像与原始图像+编辑指令的混合视觉语言模型(VLM)评判框架,系统评估模型在多轮生成、视觉记忆及世界知识推理方面的表现。实验表明,WEAVE训练可显著提升UMMs的视觉理解、图像编辑及跨模态协作能力,并激发其新兴的视觉记忆功能,同时揭示了当前方法在多轮情境感知图像生成与编辑中的持续挑战。
链接: https://arxiv.org/abs/2511.11434
作者: Wei Chow,Jiachun Pan,Yongyuan Liang,Mingze Zhou,Xue Song,Liyu Jia,Saining Zhang,Siliang Tang,Juncheng Li,Fengda Zhang,Weijia Wu,Hanwang Zhang,Tat-Seng Chua
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); University of Maryland, College Park (马里兰大学学院公园分校); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models’ abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
zh
[CV-17] Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs
【速读】:该论文旨在解决视觉-语言理解任务中多语言参考表达理解(Multilingual Referring Expression Comprehension, REC)的局限性,即当前研究主要集中在英语场景下,难以满足全球部署需求。其核心解决方案包括两个关键部分:一是构建了一个涵盖10种语言、包含约800万条多语言指代表达的统一数据集,通过机器翻译与上下文增强的策略扩展了12个现有的英文REC基准;二是提出一种基于注意力锚定的神经架构,利用多语言SigLIP2编码器生成粗粒度空间锚点,并通过学习残差进行精细化定位,从而实现跨语言的视觉接地(Visual Grounding)能力。实验表明,该方法在多语言RefCOCO基准上达到86.9% IoU@50准确率,接近纯英文模型(91.3%),验证了多语言视觉理解系统的可行性与一致性。
链接: https://arxiv.org/abs/2511.11427
作者: Francisco Nogueira,Alexandre Bernardino,Bruno Martins
机构: Instituto Superior Técnico (里斯本技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned residuals. Experimental evaluation demonstrates competitive performance on standard benchmarks, e.g. achieving 86.9% accuracy at IoU@50 on RefCOCO aggregate multilingual evaluation, compared to an English-only result of 91.3%. Multilingual evaluation shows consistent capabilities across languages, establishing the practical feasibility of multilingual visual grounding systems. The dataset and model are available at \hrefthis https URLthis http URL .
zh
[CV-18] Shrinking the Teacher: An Adaptive Teaching Paradigm for Asymmetric EEG-Vision Alignment AAAI2026
【速读】:该论文旨在解决从脑电图(EEG)信号中解码视觉特征这一神经科学核心挑战,尤其针对现有跨模态对齐方法忽视视觉与脑部模态之间本质不对称性的问题。作者指出,这种不对称性体现在两个关键差距:一是保真度差距(Fidelity Gap),即EEG固有的噪声和信号退化导致其远低于视觉模态的高保真特征;二是语义差距(Semantic Gap),即EEG的表征较浅,难以匹配视觉模态丰富的语义深度。传统方法将二者视为平等对齐对象,限制了泛化能力。为此,论文提出自适应教学范式(adaptive teaching paradigm),其核心在于让“教师”模态(视觉)在任务引导下动态收缩并调整其知识结构,以适配“学生”模态(EEG)的能力。具体实现为ShrinkAdapter模块,采用无残差设计和瓶颈结构,有效缩小视觉特征的语义密度以匹配EEG的表达能力。实验表明,该方法在零样本脑到图像检索任务上达到60.2%的Top-1准确率,显著优于此前最优方法(提升9.8%)。
链接: https://arxiv.org/abs/2511.11422
作者: Lukun Wu,Jie Li,Ziqi Ren,Kaifan Zhang,Xinbo Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21pages,12 figures,published to AAAI 2026
Abstract:Decoding visual features from EEG signals is a central challenge in neuroscience, with cross-modal alignment as the dominant approach. We argue that the relationship between visual and brain modalities is fundamentally asymmetric, characterized by two critical gaps: a Fidelity Gap (stemming from EEG’s inherent noise and signal degradation, vs. vision’s high-fidelity features) and a Semantic Gap (arising from EEG’s shallow conceptual representation, vs. vision’s rich semantic depth). Previous methods often overlook this asymmetry, forcing alignment between the two modalities as if they were equal partners and thereby leading to poor generalization. To address this, we propose the adaptive teaching paradigm. This paradigm empowers the teacher" modality (vision) to dynamically shrink and adjust its knowledge structure under task guidance, tailoring its semantically dense features to match the student" modality (EEG)'s capacity. We implement this paradigm with the ShrinkAdapter, a simple yet effective module featuring a residual-free design and a bottleneck structure. Through extensive experiments, we validate the underlying rationale and effectiveness of our paradigm. Our method achieves a top-1 accuracy of 60.2% on the zero-shot brain-to-image retrieval task, surpassing previous state-of-the-art methods by a margin of 9.8%. Our work introduces a new perspective for asymmetric alignment: the teacher must shrink and adapt to bridge the vision-brain gap.
zh
[CV-19] BOFA: Bridge-Layer Orthogonal Low-Rank Fusion for CLIP-Based Class-Incremental Learning AAAI2026
【速读】:该论文针对类增量学习(Class-Incremental Learning, CIL)中应用视觉-语言模型(如CLIP)所面临的两大挑战展开研究:一是现有方法在适配下游任务时需引入额外可学习模块,导致模型复杂度上升并加剧遗忘问题;二是尚未充分挖掘视觉与文本模态之间的互补优势以实现高效融合。解决方案的关键在于提出BOFA(Bridge-layer Orthogonal Fusion for Adaptation)框架,其核心创新包括:(1) 将全部模型适配操作限制在CLIP原有的跨模态桥接层(bridge-layer)内,不增加任何参数或推理开销;(2) 引入正交低秩融合(Orthogonal Low-Rank Fusion)机制,通过数学构造一个与历史任务特征正交的低秩“安全子空间”来约束参数更新,从而避免遗忘且无需数据回放;(3) 设计跨模态混合原型(cross-modal hybrid prototype),结合稳定文本原型与由稳定适配桥接层生成的视觉原型,显著提升分类性能。
链接: https://arxiv.org/abs/2511.11421
作者: Lan Li,Tao Hu,Da-Wei Zhou,Han-Jia Ye,De-Chuan Zhan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI 2026
Abstract:Class-Incremental Learning (CIL) aims to continually learn new categories without forgetting previously acquired knowledge. Vision-language models such as CLIP offer strong transferable representations via multi-modal supervision, making them promising for CIL. However, applying CLIP to CIL poses two major challenges: (1) adapting to downstream tasks often requires additional learnable modules, increasing model complexity and susceptibility to forgetting; and (2) while multi-modal representations offer complementary strengths, existing methods have yet to fully realize their potential in effectively integrating visual and textual modalities. To address these issues, we propose BOFA (Bridge-layer Orthogonal Fusion for Adaptation), a novel framework for CIL. BOFA confines all model adaptation exclusively to CLIP’s existing cross-modal bridge-layer, thereby adding no extra parameters or inference cost. To prevent forgetting within this layer, it leverages Orthogonal Low-Rank Fusion, a mechanism that constrains parameter updates to a low-rank ``safe subspace" mathematically constructed to be orthogonal to past task features. This ensures stable knowledge accumulation without data replay. Furthermore, BOFA employs a cross-modal hybrid prototype that synergizes stable textual prototypes with visual counterparts derived from our stably adapted bridge-layer, enhancing classification performance. Extensive experiments on standard benchmarks show that BOFA achieves superior accuracy and efficiency compared to existing methods.
zh
[CV-20] Low-Bit High-Fidelity: Optimal Transport Quantization for Flow Matching
【速读】:该论文旨在解决生成式 AI (Generative AI) 中流匹配(Flow Matching, FM)模型在实际部署时面临的高精度参数要求问题,尤其是在边缘和嵌入式人工智能(Edge and Embedded AI)场景下模型压缩的挑战。其解决方案的关键在于引入基于最优传输(Optimal Transport, OT)的后训练量化方法,通过最小化量化前后权重之间的 2-Wasserstein 距离来保持模型性能;理论分析提供了量化导致生成质量退化的上界,实验证明该方法在低至每参数 2–3 比特时仍能有效维持图像生成质量与潜在空间稳定性,显著优于均匀、分段和对数量化等传统方案。
链接: https://arxiv.org/abs/2511.11418
作者: Dara Varam,Diaa A. Abuhani,Imran Zualkernan,Raghad AlDamani,Lujain Khalil
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 8 figures
Abstract:Flow Matching (FM) generative models offer efficient simulation-free training and deterministic sampling, but their practical deployment is challenged by high-precision parameter requirements. We adapt optimal transport (OT)-based post-training quantization to FM models, minimizing the 2-Wasserstein distance between quantized and original weights, and systematically compare its effectiveness against uniform, piecewise, and logarithmic quantization schemes. Our theoretical analysis provides upper bounds on generative degradation under quantization, and empirical results across five benchmark datasets of varying complexity show that OT-based quantization preserves both visual generation quality and latent space stability down to 2-3 bits per parameter, where alternative methods fail. This establishes OT-based quantization as a principled, effective approach to compress FM generative models for edge and embedded AI applications.
zh
[CV-21] Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在文档图像质量评估(Document Image Quality Assessment, DIQA)方面能力尚未被充分探索的问题。现有研究主要聚焦于高阶视觉任务,而对MLLMs在细粒度图像质量感知与判断上的潜力缺乏系统性评估。为此,作者提出Q-Doc框架,通过三个层级的评估策略:粗粒度的质量评分、中粒度的失真类型识别(包括单选与多选任务)、以及细粒度的失真严重程度分类,全面探测MLLMs的DIQA能力。该方案的关键在于引入分层评估机制,并结合Chain-of-Thought (CoT)提示策略显著提升模型在各层级任务中的表现,从而揭示了当前MLLMs在质量感知方面的局限性及改进路径。
链接: https://arxiv.org/abs/2511.11410
作者: Jiaxi Huang,Dongxu Wu,Hanwei Zhu,Lingyu Zhu,Jun Xing,Xu Wang,Baoliang Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The rapid advancement of Multi-modal Large Language Models (MLLMs) has expanded their capabilities beyond high-level vision tasks. Nevertheless, their potential for Document Image Quality Assessment (DIQA) remains underexplored. To bridge this gap, we propose Q-Doc, a three-tiered evaluation framework for systematically probing DIQA capabilities of MLLMs at coarse, middle, and fine granularity levels. a) At the coarse level, we instruct MLLMs to assign quality scores to document images and analyze their correlation with Quality Annotations. b) At the middle level, we design distortion-type identification tasks, including single-choice and multi-choice tests for multi-distortion scenarios. c) At the fine level, we introduce distortion-severity assessment where MLLMs classify distortion intensity against human-annotated references. Our evaluation demonstrates that while MLLMs possess nascent DIQA abilities, they exhibit critical limitations: inconsistent scoring, distortion misidentification, and severity misjudgment. Significantly, we show that Chain-of-Thought (CoT) prompting substantially enhances performance across all levels. Our work provides a benchmark for DIQA capabilities in MLLMs, revealing pronounced deficiencies in their quality perception and promising pathways for enhancement. The benchmark and code are publicly available at: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.11410 [cs.CV] (or arXiv:2511.11410v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.11410 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-22] MicroVQA: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
【速读】:该论文旨在解决生物医学显微图像领域中科学推理能力受限的问题,其根本原因在于高质量、大规模训练数据的稀缺。解决方案的关键在于构建一个三阶段的数据集生成与过滤流程:首先从同行评审文献中的专家验证图-文对中获取监督信号;其次引入HiCQA-Graph——一种首次将图像、文本描述和问答对联合建模的异构图结构,融合自然语言推理(NLI)、CLIP视觉-语言对齐以及代理信号以识别并剔除不一致样本;最后利用多模态大语言模型(MLLM)生成多项选择题(MCQ),并通过人工筛选确保质量。该方法实现了高质量数据的自动化构建与人工精修相结合,从而显著提升模型在显微图像理解任务上的性能表现。
链接: https://arxiv.org/abs/2511.11407
作者: Manyu Li,Ruian He,Chenxi Ma,Weimin Tan,Bo Yan
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures
Abstract:Multimodal Large Language Models are increasingly applied to biomedical imaging, yet scientific reasoning for microscopy remains limited by the scarcity of large-scale, high-quality training data. We introduce MicroVQA++, a three-stage, large-scale and high-quality microscopy VQA corpus derived from the BIOMEDICA archive. Stage one bootstraps supervision from expert-validated figure-caption pairs sourced from peer-reviewed articles. Stage two applies HiCQA-Graph, a novel heterogeneous graph over images, captions, and QAs that fuses NLI-based textual entailment, CLIP-based vision-language alignment, and agent signals to identify and filter inconsistent samples. Stage three uses a MultiModal Large Language Model (MLLM) agent to generate multiple-choice questions (MCQ) followed by human screening. The resulting release comprises a large training split and a human-checked test split whose Bloom’s level hard-sample distribution exceeds the MicroVQA benchmark. Our work delivers (i) a quality-controlled dataset that couples expert literature with graph-based filtering and human refinement; (ii) HiCQA-Graph, the first graph that jointly models (image, caption, QA) for cross-modal consistency filtering; (iii) evidence that careful data construction enables 4B-scale MLLMs to reach competitive microscopy reasoning performance (e.g., GPT-5) and achieve state-of-the-art performance among open-source MLLMs. Code and dataset will be released after the review process concludes.
zh
[CV-23] Disentangling Emotional Bases and Transient Fluctuations: A Low-Rank Sparse Decomposition Approach for Video Affective Analysis
【速读】:该论文旨在解决视频情感计算(Video-based Affective Computing, VAC)中因复杂情绪动态导致的模型不稳定和表征退化问题。核心挑战在于现有方法缺乏对情绪成分的层次化解耦机制,无法区分长期情绪基调(emotional bases)与短期情绪波动(transient fluctuations)。解决方案的关键是提出低秩稀疏情绪理解框架(Low-Rank Sparse Emotion Understanding Framework, LSEF),其理论基础为情绪动态可建模为层次化的低秩稀疏组合过程;LSEF通过三个模块实现:稳定性编码模块(Stability Encoding Module, SEM)捕获低秩情绪基底、动态解耦模块(Dynamic Decoupling Module, DDM)分离稀疏瞬时信号、一致性融合模块(Consistency Integration Module, CIM)重建多尺度稳定性与反应一致性,并结合感知秩优化策略(Rank Aware Optimization, RAO)自适应平衡梯度平滑性与敏感性,从而显著提升模型鲁棒性和动态判别能力。
链接: https://arxiv.org/abs/2511.11406
作者: Feng-Qi Cui,Jinyang Huang,Ziyu Jia,Xinyu Li,Xin Yan,Xiaokang Zhou,Meng Wang
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学); Chinese Academy of Sciences (中国科学院); Cylingo Group (Cylingo集团); Kansai University (关西大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.
zh
[CV-24] Unsupervised Segmentation of Micro-CT Scans of Polyurethane Structures By Combining Hidden-Markov-Random Fields and a U-Net
【速读】:该论文旨在解决材料图像中数字表征提取的准确性与效率问题,传统分割方法在精度或速度上存在不足,而监督式卷积神经网络(Convolutional Neural Networks, CNNs)虽性能优越但依赖大量标注数据,无监督方法则面临分割时间长、精度低的问题。解决方案的关键在于融合隐马尔可夫随机场(Hidden Markov Random Fields, HMRF)理论与CNN分割框架,构建一种基于HMRF损失函数的无监督学习模型(HMRF-UNet),通过引入邻域项和类别分布约束,在无需真实标签的情况下实现高精度且快速的分割;同时提出一种预训练策略,显著减少后续微调阶段对标注数据的需求。
链接: https://arxiv.org/abs/2511.11378
作者: Julian Grolig,Lars Griem,Michael Selzer,Hans-Ulrich Kauczor,Simon M.F. Triphan,Britta Nestler,Arnd Koeppe
机构: Karlsruhe Institute of Technology (KIT); Karlsruhe University of Applied Sciences; University Hospital of Heidelberg; German Center for Lung Research (DZL)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Extracting digital material representations from images is a necessary prerequisite for a quantitative analysis of material properties. Different segmentation approaches have been extensively studied in the past to achieve this task, but were often lacking accuracy or speed. With the advent of machine learning, supervised convolutional neural networks (CNNs) have achieved state-of-the-art performance for different segmentation tasks. However, these models are often trained in a supervised manner, which requires large labeled datasets. Unsupervised approaches do not require ground-truth data for learning, but suffer from long segmentation times and often worse segmentation accuracy. Hidden Markov Random Fields (HMRF) are an unsupervised segmentation approach that incorporates concepts of neighborhood and class distributions. We present a method that integrates HMRF theory and CNN segmentation, leveraging the advantages of both areas: unsupervised learning and fast segmentation times. We investigate the contribution of different neighborhood terms and components for the unsupervised HMRF loss. We demonstrate that the HMRF-UNet enables high segmentation accuracy without ground truth on a Micro-Computed Tomography ( \mu CT) image dataset of Polyurethane (PU) foam structures. Finally, we propose and demonstrate a pre-training strategy that considerably reduces the required amount of ground-truth data when training a segmentation model.
zh
[CV-25] Free3D: 3D Human Motion Emerges from Single-View 2D Supervision
【速读】:该论文旨在解决当前3D人体运动生成模型在训练分布之外难以泛化的问题,其根源在于依赖精确的3D监督信号,导致模型过度拟合固定坐标模式,而非学习本质的3D结构与运动语义线索。解决方案的关键在于提出Free3D框架,该框架完全基于2D运动数据进行训练,通过引入Motion-Lifting Residual Quantized VAE(ML-RQ)将2D运动序列映射至3D一致的潜在空间,并设计一系列无需3D标注的正则化目标,包括视图一致性、姿态方向一致性及物理合理性约束,从而实现多样、时序连贯且语义对齐的3D运动生成,性能可媲美甚至超越全3D监督方法,验证了弱化显式3D监督有助于提升结构推理能力和泛化性。
链接: https://arxiv.org/abs/2511.11368
作者: Sheng Liu,Yuanzhi Liang,Sidan Du
机构: Nanjing University (南京大学); TeleAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent 3D human motion generation models demonstrate remarkable reconstruction accuracy yet struggle to generalize beyond training distributions. This limitation arises partly from the use of precise 3D supervision, which encourages models to fit fixed coordinate patterns instead of learning the essential 3D structure and motion semantic cues required for robust this http URL overcome this limitation, we propose Free3D, a framework that synthesizes realistic 3D motions without any 3D motion annotations. Free3D introduces a Motion-Lifting Residual Quantized VAE (ML-RQ) that maps 2D motion sequences into 3D-consistent latent spaces, and a suite of 3D-free regularization objectives enforcing view consistency, orientation coherence, and physical plausibility. Trained entirely on 2D motion data, Free3D generates diverse, temporally coherent, and semantically aligned 3D motions, achieving performance comparable to or even surpassing fully 3D-supervised counterparts. These results suggest that relaxing explicit 3D supervision encourages stronger structural reasoning and generalization, offering a scalable and data-efficient paradigm for 3D motion generation.
zh
[CV-26] YCB-Ev SD: Synthetic event-vision dataset for 6DoF object pose estimation
【速读】:该论文旨在解决事件相机(event camera)在6自由度(6DoF)目标位姿估计任务中缺乏高质量合成数据集的问题。当前合成数据在帧图像视觉领域已广泛应用,但事件视觉领域尚无类似全面的资源支持研究进展。解决方案的关键在于构建一个名为YCB-Ev SD的高保真合成数据集,包含50,000条持续34毫秒的事件序列,基于物理渲染(Physically Based Rendering, PBR)生成,并严格遵循BOP(Benchmark for 6D Object Pose)评估方法论。其核心创新在于采用模拟线性相机运动以确保场景全覆盖(包括背景活动),并通过系统性评估不同事件表示方式发现:具有线性衰减特性的时表面(time-surface)与双通道极性编码(dual-channel polarity encoding)组合可显著提升CNN推理性能,优于指数衰减和单通道方案;分析进一步表明,极性信息对性能提升贡献最大,而线性时间编码更能有效保留关键运动特征。
链接: https://arxiv.org/abs/2511.11344
作者: Pavel Rojtberg,Julius Kühn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce YCB-Ev SD, a synthetic dataset of event-camera data at standard definition (SD) resolution for 6DoF object pose estimation. While synthetic data has become fundamental in frame-based computer vision, event-based vision lacks comparable comprehensive resources. Addressing this gap, we present 50,000 event sequences of 34 ms duration each, synthesized from Physically Based Rendering (PBR) scenes of YCB-Video objects following the Benchmark for 6D Object Pose (BOP) methodology. Our generation framework employs simulated linear camera motion to ensure complete scene coverage, including background activity. Through systematic evaluation of event representations for CNN-based inference, we demonstrate that time-surfaces with linear decay and dual-channel polarity encoding achieve superior pose estimation performance, outperforming exponential decay and single-channel alternatives by significant margins. Our analysis reveals that polarity information contributes most substantially to performance gains, while linear temporal encoding preserves critical motion information more effectively than exponential decay. The dataset is provided in a structured format with both raw event streams and precomputed optimal representations to facilitate immediate research use and reproducible benchmarking. The dataset is publicly available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.11344 [cs.CV] (or arXiv:2511.11344v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.11344 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-27] DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在资源受限的边缘设备上部署时因内存占用过高而导致的实际应用困难问题。其核心解决方案是提出DocSLM,一个专为长文档理解设计的小型视觉语言模型;关键创新在于引入分层多模态压缩器(Hierarchical Multimodal Compressor),能够将每页的视觉、文本和布局信息联合编码为固定长度序列,显著降低内存消耗并保留局部与全局语义信息;同时,通过流式回避机制(Streaming Abstention)对文档片段进行顺序处理,并利用基于熵的不确定性校准器过滤低置信度响应,从而实现对任意长度输入的可扩展处理。
链接: https://arxiv.org/abs/2511.11313
作者: Tanveer Hannan,Dimitrios Mallios,Parth Pathak,Faegheh Sardari,Thomas Seidl,Gedas Bertasius,Mohsen Fayyaz,Sunando Sengupta
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); MCML; Microsoft (微软); FAIR Meta (FAIR Meta); UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large Vision-Language Models (LVLMs) have demonstrated strong multimodal reasoning capabilities on long and complex documents. However, their high memory footprint makes them impractical for deployment on resource-constrained edge devices. We present DocSLM, an efficient Small Vision-Language Model designed for long-document understanding under constrained memory resources. DocSLM incorporates a Hierarchical Multimodal Compressor that jointly encodes visual, textual, and layout information from each page into a fixed-length sequence, greatly reducing memory consumption while preserving both local and global semantics. To enable scalable processing over arbitrarily long inputs, we introduce a Streaming Abstention mechanism that operates on document segments sequentially and filters low-confidence responses using an entropy-based uncertainty calibrator. Across multiple long multimodal document benchmarks, DocSLM matches or surpasses state-of-the-art methods while using 82% fewer visual tokens, 75% fewer parameters, and 71% lower latency, delivering reliable multimodal document understanding on lightweight edge devices. Code is available in the supplementary material.
zh
[CV-28] 6D Strawberry Pose Estimation: Real-time and Edge AI Solutions Using Purely Synthetic Training Data
【速读】:该论文旨在解决水果自动化采摘中6D位姿估计(6D pose estimation)的挑战,尤其是针对草莓这一类目标在实际应用中因训练数据稀缺导致模型性能受限的问题。解决方案的关键在于构建了一套基于程序化Blender渲染的合成数据生成流程,通过增强合成数据的真实性以弥补真实标注数据的不足,并采用YOLOX-6D-Pose算法进行单阶段位姿估计,该算法兼具高精度与边缘计算部署能力。实验表明,该方法在NVIDIA RTX 3090和Jetson Orin Nano平台上均实现了良好的泛化性能,其中Jetson Orin Nano尤其适合资源受限的农业机器人场景,且该框架可扩展至苹果、桃子等其他水果,具有广泛的应用潜力。
链接: https://arxiv.org/abs/2511.11307
作者: Saptarshi Neil Sinha,Julius Kühn,Mika Silvan Goschke,Michael Weinmann
机构: Fraunhofer IGD (弗劳恩霍夫图像图形与数据研究所); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Automated and selective harvesting of fruits has become an important area of research, particularly due to challenges such as high costs and a shortage of seasonal labor in advanced economies. This paper focuses on 6D pose estimation of strawberries using purely synthetic data generated through a procedural pipeline for photorealistic rendering. We employ the YOLOX-6D-Pose algorithm, a single-shot approach that leverages the YOLOX backbone, known for its balance between speed and accuracy, and its support for edge inference. To address the lacking availability of training data, we introduce a robust and flexible pipeline for generating synthetic strawberry data from various 3D models via a procedural Blender pipeline, where we focus on enhancing the realism of the synthesized data in comparison to previous work to make it a valuable resource for training pose estimation algorithms. Quantitative evaluations indicate that our models achieve comparable accuracy on both the NVIDIA RTX 3090 and Jetson Orin Nano across several ADD-S metrics, with the RTX 3090 demonstrating superior processing speed. However, the Jetson Orin Nano is particularly suited for resource-constrained environments, making it an excellent choice for deployment in agricultural robotics. Qualitative assessments further confirm the model’s performance, demonstrating its capability to accurately infer the poses of ripe and partially ripe strawberries, while facing challenges in detecting unripe specimens. This suggests opportunities for future improvements, especially in enhancing detection capabilities for unripe strawberries (if desired) by exploring variations in color. Furthermore, the methodology presented could be adapted easily for other fruits such as apples, peaches, and plums, thereby expanding its applicability and impact in the field of agricultural automation.
zh
[CV-29] MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising
【速读】:该论文旨在解决电商场景中多模态表示学习(Multimodal Representation Learning)的可持续迭代优化问题,以提升点击率(Click-Through Rate, CTR)预测性能。其核心挑战在于如何有效对齐多模态表示学习与下游任务目标之间的差异,并实现持续、可扩展的模型改进。解决方案的关键在于提出MOON框架,采用“预训练—后训练—应用”三阶段训练范式,通过定义“交换率”(exchange rate)量化中间指标(如基于图像的搜索召回率)向下游任务收益的转化效率,从而指导模型优化方向。该方法在淘宝搜索广告系统中实现了整体CTR提升20.00%,并历经五次大规模迭代,在数据处理、训练策略、模型架构和下游应用四个维度持续演进,同时揭示了电商场景下多模态表示学习的缩放规律(scaling laws)。
链接: https://arxiv.org/abs/2511.11305
作者: Chenghan Fu,Daoze Zhang,Yukang Lin,Zhanheng Nie,Xiang Zhang,Jianyu Liu,Yueran Liu,Wanxian Guan,Pengjie Wang,Jian Xu,Bo Zheng
机构: Alimama (阿里妈妈); Alibaba Group (阿里巴巴集团)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 31 pages, 12 figures
Abstract:We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of “Pretraining, Post-training, and Application”, allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.
zh
[CV-30] AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models AAAI2026
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练数据中包含敏感或受版权保护内容时所引发的数据隐私问题,尤其是针对视觉概念的“遗忘”需求——即在不进行资源密集型重新训练的前提下,精准移除特定视觉概念,同时避免对相关但非目标概念造成性能干扰。解决方案的关键在于提出AUVIC框架,其通过引入对抗扰动(adversarial perturbations)实现对目标视觉概念的精确隔离与删除,从而在保持模型整体性能稳定的同时,显著提升目标概念的遗忘率。
链接: https://arxiv.org/abs/2511.11299
作者: Haokun Chen,Jianing Li,Yao Zhang,Jinhe Bi,Yan Xia,Jindong Gu,Volker Tresp
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2026. Code: this https URL
Abstract:Multimodal Large Language Models (MLLMs) achieve impressive performance once optimized on massive datasets. Such datasets often contain sensitive or copyrighted content, raising significant data privacy concerns. Regulatory frameworks mandating the ‘right to be forgotten’ drive the need for machine unlearning. This technique allows for the removal of target data without resource-consuming retraining. However, while well-studied for text, visual concept unlearning in MLLMs remains underexplored. A primary challenge is precisely removing a target visual concept without disrupting model performance on related entities. To address this, we introduce AUVIC, a novel visual concept unlearning framework for MLLMs. AUVIC applies adversarial perturbations to enable precise forgetting. This approach effectively isolates the target concept while avoiding unintended effects on similar entities. To evaluate our method, we construct VCUBench. It is the first benchmark designed to assess visual concept unlearning in group contexts. Experimental results demonstrate that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.
zh
[CV-31] SimuFreeMark: A Noise-Simulation-Free Robust Watermarking Against Image Editing
【速读】:该论文旨在解决生成式 AI (Generative AI) 时代下图像水印技术面临的挑战,即如何在应对传统信号处理攻击和新型语义编辑攻击时保持鲁棒性。现有基于深度学习的方法依赖于手工设计的噪声模拟层进行训练,导致其难以泛化到未预见的失真场景。解决方案的关键在于提出 SimuFreeMark 框架,该框架摒弃了噪声模拟训练环节,通过利用图像低频成分(low-frequency components)固有的稳定性,将水印直接嵌入到低频分量的深层特征空间中,并借助预训练变分自编码器(VAE)将水印与结构稳定的图像表示绑定,从而实现高鲁棒性和优异的视觉质量。
链接: https://arxiv.org/abs/2511.11295
作者: Yichao Tang,Mingyang Li,Di Miao,Sheng Li,Zhenxing Qian,Xinpeng Zhang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The advancement of artificial intelligence generated content (AIGC) has created a pressing need for robust image watermarking that can withstand both conventional signal processing and novel semantic editing attacks. Current deep learning-based methods rely on training with hand-crafted noise simulation layers, which inherently limit their generalization to unforeseen distortions. In this work, we propose \textbfSimuFreeMark , a noise- \underline\textsimu lation- \underline\textfree water \underline\textmark ing framework that circumvents this limitation by exploiting the inherent stability of image low-frequency components. We first systematically establish that low-frequency components exhibit significant robustness against a wide range of attacks. Building on this foundation, SimuFreeMark embeds watermarks directly into the deep feature space of the low-frequency components, leveraging a pre-trained variational autoencoder (VAE) to bind the watermark with structurally stable image representations. This design completely eliminates the need for noise simulation during training. Extensive experiments demonstrate that SimuFreeMark outperforms state-of-the-art methods across a wide range of conventional and semantic attacks, while maintaining superior visual quality.
zh
[CV-32] RTGaze: Real-Time 3D-Aware Gaze Redirection from a Single Image AAAI2026
【速读】:该论文旨在解决现有注视方向重定向(gaze redirection)方法在三维一致性(3D consistency)、效率和图像质量方面的局限性,从而限制了其实际应用。解决方案的关键在于提出一种实时且高质量的注视重定向方法RTGaze,其核心创新包括:首先从人脸图像和注视提示中学习可控制注视方向的人脸表示;其次通过神经渲染(neural rendering)解码该表示实现精确的注视重定向;此外,利用预训练的3D肖像生成器蒸馏面部几何先验(face geometric priors)以提升生成质量。该方法采用前馈网络架构,实现约0.06秒/图像的实时处理速度,相较此前最先进的3D感知方法快800倍,同时在多个数据集上实现了最优的效率、重定向准确性和图像保真度。
链接: https://arxiv.org/abs/2511.11289
作者: Hengfei Wang,Zhongqun Zhang,Yihua Cheng,Hyung Jin Chang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Gaze redirection methods aim to generate realistic human face images with controllable eye movement. However, recent methods often struggle with 3D consistency, efficiency, or quality, limiting their practical applications. In this work, we propose RTGaze, a real-time and high-quality gaze redirection method. Our approach learns a gaze-controllable facial representation from face images and gaze prompts, then decodes this representation via neural rendering for gaze redirection. Additionally, we distill face geometric priors from a pretrained 3D portrait generator to enhance generation quality. We evaluate RTGaze both qualitatively and quantitatively, demonstrating state-of-the-art performance in efficiency, redirection accuracy, and image quality across multiple datasets. Our system achieves real-time, 3D-aware gaze redirection with a feedforward network (~0.06 sec/image), making it 800x faster than the previous state-of-the-art 3D-aware methods.
zh
[CV-33] D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Amplitude and Pixel Spaces
【速读】:该论文旨在解决真实世界计算机视觉应用中因图像背景、风格及采集设备等变化导致的域外(Out-of-domain, OOD)鲁棒性下降问题。现有方法如通用数据增强在域偏移下效果不稳定,而特定数据集的增强则依赖专家知识且难以泛化;同时,神经网络对域特定频域成分存在学习偏差,传统频域扰动虽可缓解此问题但忽略像素级细节,造成性能受限。其解决方案的关键在于提出D-GAP(Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces),通过任务梯度计算频域敏感图(sensitivity maps),自适应地在源与目标样本间插值幅度(amplitude space),从而降低频域学习偏差;并辅以像素空间混合策略恢复精细空间结构,实现频域与像素空间的协同优化,显著提升OOD鲁棒性。
链接: https://arxiv.org/abs/2511.11286
作者: Ruoqi Wang,Haitao Wang,Shaojie Guo,Qiong Luo
机构: HKUST(GZ)(香港科技大学(广州)); SYSU(中山大学); ECNU(华东师范大学); HKUST(香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Out-of-domain (OOD) robustness is challenging to achieve in real-world computer vision applications, where shifts in image background, style, and acquisition instruments always degrade model performance. Generic augmentations show inconsistent gains under such shifts, whereas dataset-specific augmentations require expert knowledge and prior analysis. Moreover, prior studies show that neural networks adapt poorly to domain shifts because they exhibit a learning bias to domain-specific frequency components. Perturbing frequency values can mitigate such bias but overlooks pixel-level details, leading to suboptimal performance. To address these problems, we propose D-GAP (Dataset-agnostic and Gradient-guided augmentation in Amplitude and Pixel spaces), improving OOD robustness by introducing targeted augmentation in both the amplitude space (frequency space) and pixel space. Unlike conventional handcrafted augmentations, D-GAP computes sensitivity maps in the frequency space from task gradients, which reflect how strongly the model responds to different frequency components, and uses the maps to adaptively interpolate amplitudes between source and target samples. This way, D-GAP reduces the learning bias in frequency space, while a complementary pixel-space blending procedure restores fine spatial details. Extensive experiments on four real-world datasets and three domain-adaptation benchmarks show that D-GAP consistently outperforms both generic and dataset-specific augmentations, improving average OOD performance by +5.3% on real-world datasets and +1.8% on benchmark datasets.
zh
[CV-34] Coordinative Learning with Ordinal and Relational Priors for Volumetric Medical Image Segmentation
【速读】:该论文旨在解决体积医学图像分割中因解剖结构固有特性及标注数据稀缺所带来的挑战,尤其是现有方法依赖硬性二值阈值定义正负样本而丢失了连续的解剖相似性信息,并忽略了跨患者的解剖进展全局方向一致性,导致特征空间失真、无法捕捉患者间共享的解剖流形(anatomical manifold)。解决方案的关键在于提出协同序数关系解剖学习(Coordinative Ordinal-Relational Anatomical Learning, CORAL),其核心机制包括:一是采用对比排序目标(contrastive ranking objective)以利用连续解剖相似性,确保切片间特征距离与解剖位置差异成比例;二是引入序数目标(ordinal objective)以强制全局方向一致性,使学习到的特征分布与跨患者一致的解剖进展对齐。通过这种协同学习框架,CORAL在有限标注条件下实现了最先进的分割性能,并生成具有明确解剖结构意义的表示。
链接: https://arxiv.org/abs/2511.11276
作者: Haoyi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Volumetric medical image segmentation presents unique challenges due to the inherent anatomical structure and limited availability of annotations. While recent methods have shown promise by contrasting spatial relationships between slices, they rely on hard binary thresholds to define positive and negative samples, thereby discarding valuable continuous information about anatomical similarity. Moreover, these methods overlook the global directional consistency of anatomical progression, resulting in distorted feature spaces that fail to capture the canonical anatomical manifold shared across patients. To address these limitations, we propose Coordinative Ordinal-Relational Anatomical Learning (CORAL) to capture both local and global structure in volumetric images. First, CORAL employs a contrastive ranking objective to leverage continuous anatomical similarity, ensuring relational feature distances between slices are proportional to their anatomical position differences. In addition, CORAL incorporates an ordinal objective to enforce global directional consistency, aligning the learned feature distribution with the canonical anatomical progression across patients. Learning these inter-slice relationships produces anatomically informed representations that benefit the downstream segmentation task. Through this coordinative learning framework, CORAL achieves state-of-the-art performance on benchmark datasets under limited-annotation settings while learning representations with meaningful anatomical structure. Code is available at this https URL.
zh
[CV-35] Φeat: Physically-Grounded Feature Representation
【速读】:该论文旨在解决当前自监督视觉特征表示中高阶语义信息与低阶物理因素(如几何形状和光照条件)混杂的问题,这限制了其在需要显式物理推理任务中的应用。解决方案的关键在于提出一种名为Φeat的新颖物理基础视觉骨干网络,其核心思想是采用一种纯自监督预训练策略:通过对比同一材料在不同形状和光照条件下所生成的空间裁剪图像及其物理增强版本,引导模型学习对材质身份敏感的表征,包括反射特性与几何细观结构等物理属性。这种方法无需显式标签即可建立对物理因素不变性的强先验,从而实现超越语义分组的物理结构感知能力。
链接: https://arxiv.org/abs/2511.11270
作者: Giuseppe Vecchio,Adrien Kaiser,Rouffet Romain,Rosalie Martin,Elena Garces,Tamy Boubekeur
机构: Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce \Phi eat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that \Phi eat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.
zh
[CV-36] GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在自动驾驶规划任务中缺乏显式结构化关系建模能力的问题,即现有模型通常未通过监督信号显式编码交通场景中多智能体之间的拓扑结构与动态交互关系,导致其难以从原始传感器数据中有效推断出各交通参与者之间的相互影响。解决方案的关键在于提出一种模型无关的方法,通过将交通场景图(traffic scene graphs)作为结构化的关系上下文,以序列化形式嵌入到语言驱动的自动驾驶模型中,利用结构化提示模板实现对关系先验信息的条件化训练,从而显著提升模型对空间结构和动态交互的理解能力,且无需在测试阶段提供场景图输入即可获得持续性能提升。
链接: https://arxiv.org/abs/2511.11266
作者: Fabian Schmidt,Markus Enzweiler,Abhinav Valada
机构: Esslingen University of Applied Sciences (埃斯林根应用科学大学); University of Freiburg (弗莱堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models have recently emerged as promising planners for autonomous driving, where success hinges on topology-aware reasoning over spatial structure and dynamic interactions from multimodal input. However, existing models are typically trained without supervision that explicitly encodes these relational dependencies, limiting their ability to infer how agents and other traffic entities influence one another from raw sensor data. In this work, we bridge this gap with a novel model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs. We serialize scene graphs at various abstraction levels and formats, and incorporate them into the models via structured prompt templates, enabling a systematic analysis of when and how relational supervision is most beneficial. Extensive evaluations on the public LangAuto benchmark show that scene graph conditioning of state-of-the-art approaches yields large and persistent improvement in driving performance. Notably, we observe up to a 15.6% increase in driving score for LMDrive and 17.5% for BEVDriver, indicating that models can better internalize and ground relational priors through scene graph-conditioned training, even without requiring scene graph input at test-time. Code, fine-tuned models, and our scene graph dataset are publicly available at this https URL.
zh
[CV-37] CountSteer: Steering Attention for Object Counting in Diffusion Models AAAI2026
【速读】:该论文旨在解决文本到图像扩散模型在生成图像时难以准确遵循文本中数字指令的问题,即语言与视觉表征之间的语义鸿沟。其解决方案的关键在于发现模型内部信号会因输出是否符合指定数量而发生一致性的变化,表明模型已隐式编码了数值正确性的潜在表示;基于此,作者提出了一种无需训练的控制方法 CountSteer,通过在推理阶段引导交叉注意力(cross-attention)隐藏状态来提升目标对象计数的准确性,实验显示该方法可使计数准确率提升约4%,且不损害图像质量。
链接: https://arxiv.org/abs/2511.11253
作者: Hyemin Boo,Hyoryung Kim,Myungjin Lee,Seunghyeon Lee,Jiyoung Lee,Jang-Hwan Choi,Hyunsoo Cho
机构: 11footnotemark: 1; 22footnotemark: 2
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026 Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models (RSD)
Abstract:Text-to-image diffusion models generate realistic and coherent images but often fail to follow numerical instructions in text, revealing a gap between language and visual representation. Interestingly, we found that these models are not entirely blind to numbers-they are implicitly aware of their own counting accuracy, as their internal signals shift in consistent ways depending on whether the output meets the specified count. This observation suggests that the model already encodes a latent notion of numerical correctness, which can be harnessed to guide generation more precisely. Building on this intuition, we introduce CountSteer, a training-free method that improves generation of specified object counts by steering the model’s cross-attention hidden states during inference. In our experiments, CountSteer improved object-count accuracy by about 4% without compromising visual quality, demonstrating a simple yet effective step toward more controllable and semantically reliable text-to-image generation.
zh
[CV-38] oward Gaze Target Detection of Young Autistic Children AAAI2026
【速读】:该论文旨在解决自闭症儿童在真实场景中 gaze target(注视目标)自动检测的难题,这是构建自动化联合注意(joint attention)测量系统的基础任务,而联合注意是自闭症谱系障碍(Autism Spectrum Disorder, ASD)的核心挑战之一。为应对这一问题,作者提出了一种新颖的 Socially Aware Coarse-to-Fine (SACF) 眼动检测框架,其关键在于通过双路径架构分别训练社会性注视(social gaze,如人脸)与非社会性注视(non-social gaze)的专家模型,并引入一个上下文感知门控模块(context-awareness gate module),以显式利用场景的社会语境信息来缓解自闭症数据集中常见的类别不平衡问题——这源于自闭症儿童对人脸注视显著减少的现象。实验表明,该方法在该人群中的注视目标检测任务上达到当前最优性能,尤其在关键的少数类(face-directed gaze)上表现显著优于现有方法。
链接: https://arxiv.org/abs/2511.11244
作者: Shijian Deng,Erin E. Kosloski,Siva Sai Nagender Vasireddy,Jia Li,Randi Sierra Sherwood,Feroz Mohamed Hatha,Siddhi Patel,Pamela R Rollins,Yapeng Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2026 Artificial Intelligence for Social Impact Track
Abstract:The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child’s point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention-a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) dataset. We further propose a novel Socially Aware Coarse-to-Fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets-a consequence of autistic children’s tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.
zh
[CV-39] Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
【速读】:该论文旨在解决现有基于状态空间模型(State-space Models, SSMs)的视觉变体在处理非序列信号(如图像)时,因严格因果性限制导致建模能力受限的问题。具体而言,传统Mamba架构中选择性扫描(selective-scan)操作在每个块(block)之间重新初始化状态空间动态,丢弃前一区块的终端状态空间表示(Terminal State-Space Representation, SSR),从而无法保留跨块的长期依赖信息。解决方案的关键在于提出Arcee机制——通过构建一个跨块递归的状态链(cross-block recurrent state chain),将每个块的终端SSR作为下一区块的初始条件,并设计可微分的边界映射(differentiable boundary map)以确保梯度能跨终端边界端到端流动。该方法无需额外参数、计算开销极小,且兼容所有先前的“vision-mamba”架构,在CelebA-HQ图像生成任务上显著降低FID分数(从82.81降至15.33),验证了终端SSR作为轻量级方向先验的有效性。
链接: https://arxiv.org/abs/2511.11243
作者: Jitesh Chavan,Rohit Lal,Anand Kamat,Mengjia Xu
机构: New Jersey Institute of Technology (新泽西理工学院); University of California, Riverside (加州大学河滨分校); AWS AI (亚马逊云科技人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent “Mamba-for-vision” variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block’s state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block’s terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior “vision-mamba” variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256 \times 256) with Flow Matching, Arcee reduces FID \downarrow from 82.81 to 15.33 ( 5.4\times lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.
zh
[CV-40] Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression
【速读】:该论文旨在解决现有视觉语言模型(Vision Language Models, VLMs)在理解真实世界三维空间智能方面的根本性缺陷,其核心问题源于双重瓶颈:输入阶段存在计算昂贵的几何感知编码器与仅限二维特征之间的冲突,输出阶段则因离散分词器结构无法生成精确的连续数值而导致语义错位。解决方案的关键在于提出GEODE(Geometric-Output and Decoupled-Input Engine)架构,通过解耦3D推理与数值生成过程实现突破——具体包括两个可插拔模块:Decoupled Rationale Module (DRM) 作为空间协处理器,利用交叉注意力机制将显式3D数据与2D视觉特征对齐,并提炼出可注入的“空间链式思维”(Spatial Chain-of-Thought, CoT)逻辑;以及Direct Regression Head (DRH),采用“嵌入即值”(Embedding-as-Value)范式,通过轻量级MLP直接回归标量和3D边界框,从而实现高精度连续输出。此设计使1.5B参数模型在空间推理性能上达到媲美7B+规模模型的水平。
链接: https://arxiv.org/abs/2511.11239
作者: Zhongbin Guo,Jiahe Liu,Yushan Li,Wenyu Gao,Zhen Yang,Chenzhi Li,Xinyue Zhang,Ping Jian
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing Vision Language Models (VLMs) architecturally rooted in “flatland” perception, fundamentally struggle to comprehend real-world 3D spatial intelligence. This failure stems from a dual-bottleneck: input-stage conflict between computationally exorbitant geometric-aware encoders and superficial 2D-only features, and output-stage misalignment where discrete tokenizers are structurally incapable of producing precise, continuous numerical values. To break this impasse, we introduce GEODE (Geometric-Output and Decoupled-Input Engine), a novel architecture that resolves this dual-bottleneck by decoupling 3D reasoning from numerical generation. GEODE augments main VLM with two specialized, plug-and-play modules: Decoupled Rationale Module (DRM) that acts as spatial co-processor, aligning explicit 3D data with 2D visual features via cross-attention and distilling spatial Chain-of-Thought (CoT) logic into injectable Rationale Tokens; and Direct Regression Head (DRH), an “Embedding-as-Value” paradigm which routes specialized control tokens to a lightweight MLP for precise, continuous regression of scalars and 3D bounding boxes. The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher, achieving state-of-the-art spatial reasoning performance that rivals 7B+ models.
zh
[CV-41] Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing
【速读】:该论文旨在解决通用图像编辑模型在面对新风格时表现不佳的问题,尤其是当仅有少量成对数据可用于微调时的挑战。解决方案的关键在于提出一种参数高效的多风格混合专家低秩适应(Multi-style Mixture-of-Experts Low-Rank Adaptation, MoE LoRA)框架,其包含风格特定路由与风格共享路由机制:前者避免不同风格间的干扰,后者自适应地分配共享的MoE LoRA以学习共性特征;同时引入一种基于度量引导的策略自动确定每层最优秩,并优化LoRA在Diffusion in Transformer (DiT)模型中的插入位置,结合对抗学习和流匹配技术指导扩散训练过程,从而在显著减少LoRA参数量的同时实现优于现有最先进方法的性能。
链接: https://arxiv.org/abs/2511.11236
作者: Cong Cao,Yujie Xu,Xiaodong Xu
机构: SenseTime Group (商汤科技); SenseTime Group (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.
zh
[CV-42] DoReMi: A Domain-Representation Mixture Framework for Generalizable 3D Understanding
【速读】:该论文旨在解决3D深度学习在跨域场景下泛化能力受限的问题,其核心挑战在于现有数据集规模有限以及多源点云(如LiDAR扫描与网格生成点云)在密度和噪声分布上的高度异质性,导致多域融合时出现负向迁移。为应对这一问题,作者提出DoReMi(Domain-Representation Mixture)框架,其关键创新在于采用Mixture-of-Experts(MoE)结构,同时建模两个分支:一个用于捕捉特定域特征的Domain-aware Experts分支,另一个通过预训练保持跨域几何与结构先验的统一Representation分支。该框架通过Domain-Guided Spatial Routing(DSR)动态激活专家分支以实现上下文感知的专家选择,并利用Entropy-Controlled Dynamic Allocation(EDA)机制保障专家使用的稳定性与效率,从而自适应地建模不同域的数据分布特性,兼顾领域特异性与通用性知识的协同学习。
链接: https://arxiv.org/abs/2511.11232
作者: Mingwei Xing,Xinliang Wang,Yifeng Shi
机构: Ke Holdings Inc. (贝壳控股有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The generalization of 3D deep learning across multiple domains remains limited by the limited scale of existing datasets and the high heterogeneity of multi-source point clouds. Point clouds collected from different sensors (e.g., LiDAR scans and mesh-derived point clouds) exhibit substantial discrepancies in density and noise distribution, resulting in negative transfer during multi-domain fusion. Most existing approaches focus exclusively on either domain-aware or domain-general features, overlooking the potential synergy between them. To address this, we propose DoReMi (Domain-Representation Mixture), a Mixture-of-Experts (MoE) framework that jointly models Domain-aware Experts branch and a unified Representation branch to enable cooperative learning between specialized and generalizable knowledge. DoReMi dynamically activates domain-aware expert branch via Domain-Guided Spatial Routing (DSR) for context-aware expert selection and employs Entropy-Controlled Dynamic Allocation (EDA) for stable and efficient expert utilization, thereby adaptively modeling diverse domain distributions. Complemented by a frozen unified representation branch pretrained through robust multi-attribute self-supervised learning, DoReMi preserves cross-domain geometric and structural priors while maintaining global consistency. We evaluate DoReMi across multiple 3D understanding benchmarks. Notably, DoReMi achieves 80.1% mIoU on ScanNet Val and 77.2% mIoU on S3DIS, demonstrating competitive or superior performance compared to existing approaches, and showing strong potential as a foundation framework for future 3D understanding research. The code will be released soon.
zh
[CV-43] 3D Gaussian and Diffusion-Based Gaze Redirection
【速读】:该论文旨在解决高保真眼动重定向(gaze redirection)问题,以生成高质量的合成训练数据来提升眼动估计器(gaze estimator)的泛化能力。现有基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的方法如GazeGaussian在渲染细微、连续的眼动变化时存在局限性。其解决方案的关键在于提出DiT-Gaze框架,该框架创新性地结合了扩散变换器(Diffusion Transformer, DiT)、跨眼动角度的弱监督策略以及正交约束损失(orthogonality constraint loss)。其中,DiT提升图像合成质量,弱监督通过合成中间眼动角度构建平滑的眼动方向流形,正交约束损失则从数学上强制分离眼动、头部姿态和表情的内部表征,从而实现更精确且自然的 gaze redirection。
链接: https://arxiv.org/abs/2511.11231
作者: Abiram Panchalingam,Indu Bodala,Stuart Middleton
机构: University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:High-fidelity gaze redirection is critical for generating augmented data to improve the generalization of gaze estimators. 3D Gaussian Splatting (3DGS) models like GazeGaussian represent the state-of-the-art but can struggle with rendering subtle, continuous gaze shifts. In this paper, we propose DiT-Gaze, a framework that enhances 3D gaze redirection models using a novel combination of Diffusion Transformer (DiT), weak supervision across gaze angles, and an orthogonality constraint loss. DiT allows higher-fidelity image synthesis, while our weak supervision strategy using synthetically generated intermediate gaze angles provides a smooth manifold of gaze directions during training. The orthogonality constraint loss mathematically enforces the disentanglement of internal representations for gaze, head pose, and expression. Comprehensive experiments show that DiT-Gaze sets a new state-of-the-art in both perceptual quality and redirection accuracy, reducing the state-of-the-art gaze error by 4.1% to 6.353 degrees, providing a superior method for creating synthetic training data. Our code and models will be made available for the research community to benchmark against.
zh
[CV-44] Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning the Middle or the End? AAAI2026
【速读】:该论文旨在解决多模态表示模型中位置偏差(positional bias)的问题,即模型在处理图像-文本对时对输入序列特定位置的过度依赖,从而影响图像-文本检索等任务的性能。其关键解决方案在于系统性地识别和量化这种偏差在不同模态(文本与图像编码器)中的表现差异,并揭示其成因:包括位置编码方案、训练损失函数、上下文重要性以及图像-文本配对训练方式等因素共同作用导致了位置偏差的产生或放大。
链接: https://arxiv.org/abs/2511.11216
作者: Kebin Wu,Fatima Albreiki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted to AAAI 2026 main track
Abstract:Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas image encoders show bias at both the beginning and end. Furthermore, we find that this bias arises from, or is amplified by, a combination of factors, including the positional encoding scheme, training loss, context importance, and the nature of using image-text pairs in multimodal training.
zh
[CV-45] RealisticDreamer: Guidance Score Distillation for Few-shot Gaussian Splatting
【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在输入训练视图稀疏时容易过拟合的问题,其根本原因在于缺乏中间视图的监督信号。解决方案的关键在于提出一种名为 Guidance Score Distillation (GSD) 的框架,通过从预训练视频扩散模型(Video Diffusion Models, VDM)中提取多视角一致性先验来引导3DGS的优化过程。GSD基于Score Distillation Sampling (SDS) 的思想,利用多个邻近视图的渲染图像对高斯点云进行监督,从而将表示推向VDM生成方向;同时为应对VDM生成过程中存在的对象运动和随机相机轨迹带来的干扰,引入统一的引导形式——结合真实深度图的深度扭曲引导与语义图像特征引导,确保VDM预测噪声的梯度更新方向与正确的相机位姿和几何结构保持一致,从而显著提升重建质量和泛化能力。
链接: https://arxiv.org/abs/2511.11213
作者: Ruocheng Wu,Haolan He,Yufei Wang,Zhihao Li,Bihan Wen
机构: University of Electronic Science and Technology of China (电子科技大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) has recently gained great attention in the 3D scene representation for its high-quality real-time rendering capabilities. However, when the input comprises sparse training views, 3DGS is prone to overfitting, primarily due to the lack of intermediate-view supervision. Inspired by the recent success of Video Diffusion Models (VDM), we propose a framework called Guidance Score Distillation (GSD) to extract the rich multi-view consistency priors from pretrained VDMs. Building on the insights from Score Distillation Sampling (SDS), GSD supervises rendered images from multiple neighboring views, guiding the Gaussian splatting representation towards the generative direction of VDM. However, the generative direction often involves object motion and random camera trajectories, making it challenging for direct supervision in the optimization process. To address this problem, we introduce an unified guidance form to correct the noise prediction result of VDM. Specifically, we incorporate both a depth warp guidance based on real depth maps and a guidance based on semantic image features, ensuring that the score update direction from VDM aligns with the correct camera pose and accurate geometry. Experimental results show that our method outperforms existing approaches across multiple datasets.
zh
[CV-46] MAFM3: Modular Adaptation of Foundation Models for Multi-Modal Medical AI
【速读】:该论文旨在解决医学影像领域中基础模型(foundation models)因数据稀缺而难以针对每个特定领域、模态或任务进行独立预训练的问题。传统方法往往为不同任务构建孤立的模型,导致资源浪费且缺乏扩展性。其解决方案的关键在于提出MAFM³(Modular Adaptation of Foundation Models for Multi-Modal Medical AI)框架,通过轻量级模块化组件使单一基础模型能够灵活适配多种任务和模态。这些模块作为专用技能集,在推理时根据输入类型或临床目标动态激活相应能力,从而实现高效多任务、多模态适应,突破了基础模型初始训练范围的限制。
链接: https://arxiv.org/abs/2511.11212
作者: Mohammad Areeb Qazi,Munachiso S Nwadike,Ibrahim Almakky,Mohammad Yaqub,Numan Saeed
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2 figures, 3 tables
Abstract:Foundational models are trained on extensive datasets to capture the general trends of a domain. However, in medical imaging, the scarcity of data makes pre-training for every domain, modality, or task challenging. Instead of building separate models, we propose MAFM^3 (Modular Adaptation of Foundation Models for Multi-Modal Medical AI), a framework that enables a single foundation model to expand into diverse domains, tasks, and modalities through lightweight modular components. These components serve as specialized skill sets that allow the system to flexibly activate the appropriate capability at the inference time, depending on the input type or clinical objective. Unlike conventional adaptation methods that treat each new task or modality in isolation, MAFM^3 provides a unified and expandable framework for efficient multitask and multimodality adaptation. Empirically, we validate our approach by adapting a chest CT foundation model initially trained for classification into prognosis and segmentation modules. Our results show improved performance on both tasks. Furthermore, by incorporating PET scans, MAFM^3 achieved an improvement in the Dice score 5% compared to the respective baselines. These findings establish that foundation models, when equipped with modular components, are not inherently constrained to their initial training scope but can evolve into multitask, multimodality systems for medical imaging. The code implementation of this work can be found at this https URL
zh
[CV-47] One-to-N Backdoor Attack in 3D Point Cloud via Spherical Trigger
【速读】:该论文旨在解决3D视觉领域中现有后门攻击(backdoor attack)局限于刚性一对一模式的问题,即传统方法仅能针对单一目标类别进行攻击,难以模拟复杂多样的现实威胁场景。其解决方案的关键在于提出首个面向3D视觉的一对多(one-to-N)后门攻击框架,核心创新是设计了一种可配置的球形触发器(configurable spherical trigger),利用球体的空间特性作为参数空间,使单个触发器设计能够编码多个目标类别。通过理论分析与实验证明,中毒模型可根据不同的触发器配置映射至不同目标标签,在多个数据集和模型架构上实现高达100%的攻击成功率,同时保持对干净样本的正常识别性能,从而为3D视觉系统的多目标威胁评估提供了基准,并奠定了未来安全防护研究的基础。
链接: https://arxiv.org/abs/2511.11210
作者: Dongmei Shan,Wei Lian,Chongxia Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures
Abstract:Backdoor attacks represent a critical threat to deep learning systems, particularly in safety-sensitive 3D domains such as autonomous driving and robotics. However, existing backdoor attacks for 3D point clouds have been limited to a rigid one-to-one paradigm. To address this, we present the first one-to-N backdoor framework for 3D vision, based on a novel, configurable spherical trigger. Our key insight is to leverage the spatial properties of spheres as a parameter space, allowing a single trigger design to encode multiple target classes. We establish a theoretical foundation for one-to-N backdoor attacks in 3D, demonstrating that poisoned models can map distinct trigger configurations to different target labels. Experimental results systematically validate this conclusion across multiple datasets and model architectures, achieving high attack success rates (up to 100%) while maintaining accuracy on clean data. This work establishes a crucial benchmark for multi-target threats in 3D vision and provides the foundational understanding needed to secure future 3D-driven intelligent systems.
zh
[CV-48] Questioning the Stability of Visual Question Answering
【速读】:该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在面对微小但语义不变的输入扰动时的可靠性问题,即模型对像素级变化、轻量几何变换、文本重述等良性扰动是否具备鲁棒性。研究表明,即使是最先进的VLM(如GPT-4o和Gemini 2.0 Flash),也普遍对这类扰动高度敏感,导致预测结果不稳定。解决方案的关键在于揭示了样本层面的稳定性与正确性之间存在强相关性:稳定预测的样本更可能正确,而这一特性可被用于利用小型开源模型的稳定性模式来高精度预测大型闭源模型的输出正确性,从而为评估和提升VLM鲁棒性提供新路径。
链接: https://arxiv.org/abs/2511.11206
作者: Amir Rosenfeld,Neta Glazer,Ethan Fetaya
机构: Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.
zh
[CV-49] Geospatial Chain of Thought Reasoning for Enhanced Visual Question Answering on Satellite Imagery
【速读】:该论文旨在解决当前视觉问答(Visual Question Answering, VQA)模型在卫星遥感图像分析中缺乏结构化地理空间链式思维(Geospatial Chain of Thought, CoT)推理能力的问题,这限制了其在气候相关应用(如灾害监测、基础设施风险评估等)中的可靠性和可解释性。解决方案的关键在于提出一个融合CoT推理与直接偏好优化(Direct Preference Optimization, DPO)的VQA框架:通过生成中间推理步骤增强模型对检测、分类、空间关系和比较分析等复杂任务的理解能力,同时利用DPO进一步提升准确率与推理质量,从而实现更鲁棒、可解释且高效的多光谱地球观测场景下的决策支持。
链接: https://arxiv.org/abs/2511.11198
作者: Shambhavi Shanker,Manikandan Padmanaban,Jagabondhu Hazra
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); IBM Research (IBM 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geospatial chain of thought (CoT) reasoning is essential for advancing Visual Question Answering (VQA) on satellite imagery, particularly in climate related applications such as disaster monitoring, infrastructure risk assessment, urban resilience planning, and policy support. Existing VQA models enable scalable interpretation of remote sensing data but often lack the structured reasoning required for complex geospatial queries. We propose a VQA framework that integrates CoT reasoning with Direct Preference Optimization (DPO) to improve interpretability, robustness, and accuracy. By generating intermediate rationales, the model better handles tasks involving detection, classification, spatial relations, and comparative analysis, which are critical for reliable decision support in high stakes climate domains. Experiments show that CoT supervision improves accuracy by 34.9% over direct baselines, while DPO yields additional gains in accuracy and reasoning quality. The resulting system advances VQA for multispectral Earth observation by enabling richer geospatial reasoning and more effective climate use cases.
zh
[CV-50] Computationally-efficient deep learning models for nowcasting of precipitation: A solution for the Weather4cast 2025 challenge
【速读】:该论文旨在解决短时降雨预测问题,特别是在Weather4Cast 2025竞赛中实现高精度的累积降雨量和降水事件预测。其核心解决方案是基于卷积门控循环单元(Convolutional Gated Recurrent Units, ConvGRU)构建的迁移学习框架,采用SEVIRI红外通道(10.8 μm波长)四小时时间序列数据作为输入,通过两阶段训练策略实现:第一阶段利用ConvGRU建模亮度温度的时空演变规律;第二阶段引入经验非线性变换将预测的亮度温度映射为OPERA兼容的降雨率。此外,在事件预测任务中,使用3D事件检测结合时空特征提取方法识别并表征降水事件,最终在累积降雨任务中获得第二名成绩,且无需调整即可在事件预测任务中达到与基线相当的性能。
链接: https://arxiv.org/abs/2511.11197
作者: Anushree Bhuskute,Kaushik Gopalan,Jeet Shah
机构: Flame University (火焰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study presents a transfer-learning framework based on Convolutional Gated Recurrent Units (ConvGRU) for short-term rainfall prediction in the Weather4Cast 2025 competition. A single SEVIRI infrared channel (10.8 \mum wavelength) is used as input, which consists of four observations over a one-hour period. A two-stage training strategy is applied to generate rainfall estimates up to four hours ahead. In the first stage, ConvGRU is trained to forecast the brightness temperatures from SEVIRI, enabling the model to capture relevant spatiotemporal patterns. In the second stage, an empirically derived nonlinear transformation maps the predicted fields to OPERA-compatible rainfall rates. For the event-prediction task, the transformed rainfall forecasts are processed using 3D event detection followed by spatiotemporal feature extraction to identify and characterize precipitation events. Our submission achieved 2nd place in the cumulative rainfall task. Further, the same model was used out-of-the-box for the event prediction task, and resulted in similar scores as the baseline model to the competition. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.11197 [cs.CV] (or arXiv:2511.11197v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.11197 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-51] A Comparison of Lightweight Deep Learning Models for Particulate-Matter Nowcasting in the Indian Subcontinent Surrounding Regions
【速读】:该论文旨在解决印度次大陆及周边地区PM₁、PM₂.₅和PM₁₀的6小时短临预报问题,其核心挑战在于如何在有限空间域内实现高精度、低偏差且快速推理的空气质量预测。解决方案的关键在于构建三种轻量化、参数专用的深度学习架构,利用Copernicus大气监测服务(CAMS)全球大气成分预报数据(0.4°分辨率)作为输入,通过裁剪出256×256的空间区域进行建模,并聚焦于中心128×128区域输出预测结果,从而兼顾印度本土化预报需求与大尺度气象背景信息;模型在2021–2023年数据上训练并独立评估于2024年数据,验证了其相较于Aurora基础模型在RMSE、MAE、Bias和SSIM等指标上的显著性能提升,凸显了紧凑专用模型在小范围短时预报中的有效性。
链接: https://arxiv.org/abs/2511.11185
作者: Ansh Kushwaha,Kaushik Gopalan
机构: FLAME University (FLAME大学); Centre for Interdisciplinary Artificial Intelligence (CAI) (跨学科人工智能中心); School of Computing and Data Sciences (计算机与数据科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper is a submission for the Weather4Cast~2025 complementary Pollution Task and presents an efficient framework for 6-hour lead-time nowcasting of PM _1 , PM _2.5 , and PM _10 across the Indian subcontinent and surrounding regions. The proposed approach leverages analysis fields from the Copernicus Atmosphere Monitoring Service (CAMS) Global Atmospheric Composition Forecasts at 0.4 degree resolution. A 256x256 spatial region, covering 28.4S-73.6N and 32E-134.0E, is used as the model input, while predictions are generated for the central 128x128 area spanning 2.8S-48N and 57.6E-108.4E, ensuring an India-centric forecast domain with sufficient synoptic-scale context. Models are trained on CAMS analyses from 2021-2023 using a shuffled 90/10 split and independently evaluated on 2024 data. Three lightweight parameter-specific architectures are developed to improve accuracy, minimize systematic bias, and enable rapid inference. Evaluation using RMSE, MAE, Bias, and SSIM demonstrates substantial performance gains over the Aurora foundation model, underscoring the effectiveness of compact specialized deep learning models for short-range forecasts on limited spatial domains.
zh
[CV-52] Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在资源受限场景下部署时面临的两个核心问题:一是基于Transformer的交叉注意力机制带来的二次计算复杂度,导致推理效率低下;二是小型视觉-语言模型难以精确捕捉细粒度、任务相关的视觉区域,从而限制了其在细粒度推理任务中的表现。解决方案的关键在于提出Viper-F1——一种混合状态空间视觉-语言模型,它用高效的液态状态空间动力学(Liquid State-Space Dynamics)替代传统注意力机制,实现线性时间复杂度的推理;同时引入Token-Grid相关模块(Token-Grid Correlation Module),通过轻量级文本令牌与图像块的相关性计算,并结合FiLM条件调节机制来动态调制状态空间动力学,从而增强视觉定位能力,使模型能够聚焦于与文本提示相关的视觉区域,显著提升细粒度理解精度与计算效率。
链接: https://arxiv.org/abs/2511.11177
作者: Quoc-Huy Trinh,Mustapha Abdullahi,Do Duy Hung Trinh,Bo Zhao,Debesh Jha
机构: Aalto University (阿尔托大学); University of South Dakota (南达科他大学); Physical Robotics AS (物理机器人公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as robotic manipulation, personal assistants, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Viper-F1, a Hybrid State-Space Vision-Language Model that replaces attention with efficient Liquid State-Space Dynamics. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates the state-space dynamics via FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Viper-F1 achieves accurate, fine-grained understanding with significantly improved efficiency.
zh
[CV-53] Dynamic Gaussian Scene Reconstruction from Unsynchronized Videos AAAI2026
【速读】:该论文旨在解决多视角视频重建中因摄像机时间不同步导致的时序错位问题(temporal misalignment),该问题在真实场景下普遍存在,如相机触发延迟或独立录制设置,会显著降低4D高斯泼溅(4D Gaussian Splatting, 4DGS)等动态场景重建方法的质量。解决方案的关键在于提出一种粗到精的时序对齐模块(coarse-to-fine alignment module),首先估计各摄像头的帧级时间偏移,再进一步优化至亚帧级精度,从而实现对异步多视角视频的有效校正。该模块可无缝集成至现有4DGS框架中,提升其处理非同步数据的鲁棒性。
链接: https://arxiv.org/abs/2511.11175
作者: Zhixin Xu,Hengyu Zhou,Yuan Liu,Wenhan Xue,Hao Pan,Wenping Wang,Bin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026
Abstract:Multi-view video reconstruction plays a vital role in computer vision, enabling applications in film production, virtual reality, and motion analysis. While recent advances such as 4D Gaussian Splatting (4DGS) have demonstrated impressive capabilities in dynamic scene reconstruction, they typically rely on the assumption that input video streams are temporally synchronized. However, in real-world scenarios, this assumption often fails due to factors like camera trigger delays or independent recording setups, leading to temporal misalignment across views and reduced reconstruction quality. To address this challenge, a novel temporal alignment strategy is proposed for high-quality 4DGS reconstruction from unsynchronized multi-view videos. Our method features a coarse-to-fine alignment module that estimates and compensates for each camera’s time shift. The method first determines a coarse, frame-level offset and then refines it to achieve sub-frame accuracy. This strategy can be integrated as a readily integrable module into existing 4DGS frameworks, enhancing their robustness when handling asynchronous data. Experiments show that our approach effectively processes temporally misaligned videos and significantly enhances baseline methods.
zh
[CV-54] Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA AAAI2026
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)系统中模型置信度校准(calibration)不足的问题,即模型对其答案的自信程度与其实际正确性之间存在偏差,尤其在高风险场景下如医疗诊断和自动驾驶中,这种过自信(overconfidence)行为可能导致严重后果。解决方案的关键在于提出AlignVQA框架,该框架采用基于辩论的多智能体机制:多个具有不同提示策略的专业化视觉语言模型(Vision-Language Models, VLMs)首先生成候选答案,随后由通用型代理进行两阶段交互——批判、修正并聚合这些提案,从而生成更贴近真实预测性能的置信度估计。此外,论文设计了一种可微分的校准感知损失函数AlignCal,通过最小化校准误差的上界来微调各专业化代理,显著提升其个体置信度估计的准确性。实验证明,该方法在多个基准VQA数据集上均有效降低了校准偏差。
链接: https://arxiv.org/abs/2511.11169
作者: Ayush Pandey,Jai Bardhan,Ishita Jain,Ramya S Hebbalaguppe,Rohan Raju Dhanakshirur,Lovekesh Vig
机构: TCS Research (TCS 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 6 figures, 5 tables. Accepted to Special Track on AI Alignment, AAAI 2026. Project Page- this https URL
Abstract:In the context of Visual Question Answering (VQA) and Agentic AI, calibration refers to how closely an AI system’s confidence in its answers reflects their actual correctness. This aspect becomes especially important when such systems operate autonomously and must make decisions under visual uncertainty. While modern VQA systems, powered by advanced vision-language models (VLMs), are increasingly used in high-stakes domains like medical diagnostics and autonomous navigation due to their improved accuracy, the reliability of their confidence estimates remains under-examined. Particularly, these systems often produce overconfident responses. To address this, we introduce AlignVQA, a debate-based multi-agent framework, in which diverse specialized VLM – each following distinct prompting strategies – generate candidate answers and then engage in two-stage interaction: generalist agents critique, refine and aggregate these proposals. This debate process yields confidence estimates that more accurately reflect the model’s true predictive performance. We find that more calibrated specialized agents produce better aligned confidences. Furthermore, we introduce a novel differentiable calibration-aware loss function called aligncal designed to fine-tune the specialized agents by minimizing an upper bound on the calibration error. This objective explicitly improves the fidelity of each agent’s confidence estimates. Empirical results across multiple benchmark VQA datasets substantiate the efficacy of our approach, demonstrating substantial reductions in calibration discrepancies. Furthermore, we propose a novel differentiable calibration-aware loss to fine-tune the specialized agents and improve the quality of their individual confidence estimates based on minimising upper bound calibration error.
zh
[CV-55] CATS-V2V: A Real-World Vehicle-to-Vehicle Cooperative Perception Dataset with Complex Adverse Traffic Scenarios
【速读】:该论文旨在解决当前自动驾驶中车辆间协同感知(V2V cooperative perception)在复杂不利交通场景(CATS)下因数据匮乏而受限的问题。现有数据集主要聚焦于普通交通场景,难以支撑V2V协同感知在极端条件下的性能提升。其解决方案的关键在于构建首个面向CATS的实时世界V2V协同感知数据集CATS-V2V,该数据集通过两辆硬件时间同步的车辆采集,覆盖10种天气与光照条件及10个多样化地点,包含60K帧10 Hz LiDAR点云、1.26M张多视角30 Hz相机图像以及750K条高精度RTK固定GNSS和IMU记录,并提供时序一致的3D边界框标注与静态场景信息以构建4D BEV表示。此外,提出基于目标的时间对齐方法,确保所有传感器模态中对象精确时空对齐,从而为后续研究提供高质量、大规模、多模态协同感知数据基础。
链接: https://arxiv.org/abs/2511.11168
作者: Hangyu Li,Bofeng Cao,Zhaohui Liang,Wuzhen Li,Juyoung Oh,Yuxuan Chen,Shixiao Liang,Hang Zhou,Chengyuan Ma,Jiaxi Liu,Zheng Li,Peng Zhang,KeKe Long,Maolin Liu,Jackson Jiang,Chunlei Yu,Shengxiang Liu,Hongkai Yu,Xiaopeng Li
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); wuwen-ai; Cleveland State University (克利夫兰州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vehicle-to-Vehicle (V2V) cooperative perception has great potential to enhance autonomous driving performance by overcoming perception limitations in complex adverse traffic scenarios (CATS). Meanwhile, data serves as the fundamental infrastructure for modern autonomous driving AI. However, due to stringent data collection requirements, existing datasets focus primarily on ordinary traffic scenarios, constraining the benefits of cooperative perception. To address this challenge, we introduce CATS-V2V, the first-of-its-kind real-world dataset for V2V cooperative perception under complex adverse traffic scenarios. The dataset was collected by two hardware time-synchronized vehicles, covering 10 weather and lighting conditions across 10 diverse locations. The 100-clip dataset includes 60K frames of 10 Hz LiDAR point clouds and 1.26M multi-view 30 Hz camera images, along with 750K anonymized yet high-precision RTK-fixed GNSS and IMU records. Correspondingly, we provide time-consistent 3D bounding box annotations for objects, as well as static scenes to construct a 4D BEV representation. On this basis, we propose a target-based temporal alignment method, ensuring that all objects are precisely aligned across all sensor modalities. We hope that CATS-V2V, the largest-scale, most supportive, and highest-quality dataset of its kind to date, will benefit the autonomous driving community in related tasks.
zh
[CV-56] Explainable Deep Convolutional Multi-Type Anomaly Detection
【速读】:该论文旨在解决现有可解释异常检测方法在异常类型区分能力上的不足,以及因需为每类物体单独训练模型而导致的高成本问题。当前方法通常只能识别异常存在,却无法明确异常类型(如“裂纹”与“划痕”),限制了其在实际场景中的诊断准确性与决策效率;同时,传统方案难以适应计算资源受限的实时或嵌入式系统。解决方案的关键在于提出 MultiTypeFCDD,一个轻量级卷积框架,仅使用图像级标签即可学习并生成多通道热图,每个通道对应一种特定异常类型,从而在一个统一模型中实现跨多种物体类别的多类型异常区分,避免了为不同物体类别分别建模的复杂性,且显著降低了参数量和推理时间,具备良好的实用性与部署潜力。
链接: https://arxiv.org/abs/2511.11165
作者: Alex George,Lyudmila Mihaylova,Sean Anderson
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most explainable anomaly detection methods often identify anomalies but lack the capability to differentiate the type of anomaly. Furthermore, they often require the costly training and maintenance of separate models for each object category. The lack of specificity is a significant research gap, as identifying the type of anomaly (e.g., “Crack” vs. “Scratch”) is crucial for accurate diagnosis that facilitates cost-saving operational decisions across diverse application domains. While some recent large-scale Vision-Language Models (VLMs) have begun to address this, they are computationally intensive and memory-heavy, restricting their use in real-time or embedded systems. We propose MultiTypeFCDD, a simple and lightweight convolutional framework designed as a practical alternative for explainable multi-type anomaly detection. MultiTypeFCDD uses only image-level labels to learn and produce multi-channel heatmaps, where each channel is trained to correspond to a specific anomaly type. The model functions as a single, unified framework capable of differentiating anomaly types across multiple object categories, eliminating the need to train and manage separate models for each object category. We evaluated our proposed method on the Real-IAD dataset and it delivers results competitive with state-of-the-art complex models at significantly reduced parametric load and inference times. This makes it a highly practical and viable solution for real-world applications where computational resources are tightly constrained.
zh
[CV-57] Reverberation: Learning the Latencies Before Forecasting Trajectories
【速读】:该论文旨在解决轨迹预测任务中对代理(agent)响应延迟(latency)建模不足的问题,即不同代理在面对轨迹变化事件时存在差异化的感知、处理与反应时间,而现有方法通常忽略此类时序延迟,导致预测轨迹缺乏因果连续性且可能不切实际。解决方案的关键在于提出一种受声学混响曲线启发的“混响变换”(reverberation transform),并构建相应的Rev模型,通过两个显式且可学习的混响核(reverberation kernel)来模拟和预测每个代理的延迟偏好及其随机性,从而实现基于预测延迟的可控轨迹生成。
链接: https://arxiv.org/abs/2511.11164
作者: Conghao Wong,Ziqian Zou,Beihao Xia,Xinge You
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bridging the past to the future, connecting agents both spatially and temporally, lies at the core of the trajectory prediction task. Despite great efforts, it remains challenging to explicitly learn and predict latencies, the temporal delays with which agents respond to different trajectory-changing events and adjust their future paths, whether on their own or interactively. Different agents may exhibit distinct latency preferences for noticing, processing, and reacting to any specific trajectory-changing event. The lack of consideration of such latencies may undermine the causal continuity of the forecasting system and also lead to implausible or unintended trajectories. Inspired by the reverberation curves in acoustics, we propose a new reverberation transform and the corresponding Reverberation (short for Rev) trajectory prediction model, which simulates and predicts different latency preferences of each agent as well as their stochasticity by using two explicit and learnable reverberation kernels, allowing for the controllable trajectory prediction based on these forecasted latencies. Experiments on multiple datasets, whether pedestrians or vehicles, demonstrate that Rev achieves competitive accuracy while revealing interpretable latency dynamics across agents and scenarios. Qualitative analyses further verify the properties of the proposed reverberation transform, highlighting its potential as a general latency modeling approach.
zh
[CV-58] OT-ALD: Aligning Latent Distributions with Optimal Transport for Accelerated Image-to-Image Translation
【速读】:该论文旨在解决双扩散隐式桥接(DDIB)方法在图像到图像(I2I)翻译中面临的两个关键问题:一是翻译效率低,二是由于源域与目标域潜在分布不匹配导致的翻译轨迹偏差。解决方案的关键在于提出基于最优传输(Optimal Transport, OT)理论的新框架OT-ALD,通过计算从源域潜在分布到目标域潜在分布的OT映射,并将映射后的分布作为目标域反向扩散过程的起点,从而消除潜在分布不匹配问题,同时显著提升翻译效率和图像质量。
链接: https://arxiv.org/abs/2511.11162
作者: Zhanpeng Wang,Shuting Cao,Yuhang Lu,Yuhan Li,Na Lei,Zhongxuan Luo
机构: Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The Dual Diffusion Implicit Bridge (DDIB) is an emerging image-to-image (I2I) translation method that preserves cycle consistency while achieving strong flexibility. It links two independently trained diffusion models (DMs) in the source and target domains by first adding noise to a source image to obtain a latent code, then denoising it in the target domain to generate the translated image. However, this method faces two key challenges: (1) low translation efficiency, and (2) translation trajectory deviations caused by mismatched latent distributions. To address these issues, we propose a novel I2I translation framework, OT-ALD, grounded in optimal transport (OT) theory, which retains the strengths of DDIB-based approach. Specifically, we compute an OT map from the latent distribution of the source domain to that of the target domain, and use the mapped distribution as the starting point for the reverse diffusion process in the target domain. Our error analysis confirms that OT-ALD eliminates latent distribution mismatches. Moreover, OT-ALD effectively balances faster image translation with improved image quality. Experiments on four translation tasks across three high-resolution datasets show that OT-ALD improves sampling efficiency by 20.29% and reduces the FID score by 2.6 on average compared to the top-performing baseline models.
zh
[CV-59] Hindsight Distillation Reasoning with Knowledge Encourag ement Preference for Knowledge-based Visual Question Answering
【速读】:该论文旨在解决知识增强型视觉问答(Knowledge-based Visual Question Answering, KBVQA)中现有方法推理过程隐式化的问题,即多模态大语言模型(Multimodal Large Language Models, MLLMs)虽能利用隐式或显式知识进行回答,但缺乏可解释的多步推理轨迹。解决方案的关键在于提出一种事后提炼推理(Hindsight Distilled Reasoning, HinD)框架,并引入知识鼓励偏好优化(Knowledge Encouragement Preference Optimization, KEPO)。具体而言,通过提示冻结的7B规模MLLM生成从问题到正确答案之间的推理路径,构建“事后零样本”(Hindsight-Zero)训练数据;随后自蒸馏得到Chain-of-Thought(CoT)生成器与知识生成器,以输出序列化推理步骤和离散事实;最后利用KEPO优化知识生成器,优先选择低置信度但有助于推理的知识,从而提升知识准确性与推理一致性,实现无需外部API或知识库的高性能KBVQA。
链接: https://arxiv.org/abs/2511.11132
作者: Yu Zhao,Ying Zhang,Xuhui Sui,Baohang Zhou,Li Shen,Dacheng Tao
机构: Nankai University (南开大学); Nanyang Technological University (南洋理工大学); Tiangong University (天津工业大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Knowledge-based Visual Question Answering (KBVQA) necessitates external knowledge incorporation beyond cross-modal understanding. Existing KBVQA methods either utilize implicit knowledge in multimodal large language models (MLLMs) via in-context learning or explicit knowledge via retrieval augmented generation. However, their reasoning processes remain implicit, without explicit multi-step trajectories from MLLMs. To address this gap, we provide a Hindsight Distilled Reasoning (HinD) framework with Knowledge Encouragement Preference Optimization (KEPO), designed to elicit and harness internal knowledge reasoning ability in MLLMs. First, to tackle the reasoning supervision problem, we propose to emphasize the hindsight wisdom of MLLM by prompting a frozen 7B-size MLLM to complete the reasoning process between the question and its ground truth answer, constructing Hindsight-Zero training data. Then we self-distill Hindsight-Zero into Chain-of-Thought (CoT) Generator and Knowledge Generator, enabling the generation of sequential steps and discrete facts. Secondly, to tackle the misalignment between knowledge correctness and confidence, we optimize the Knowledge Generator with KEPO, preferring under-confident but helpful knowledge over the over-confident but unhelpful one. The generated CoT and sampled knowledge are then exploited for answer prediction. Experiments on OK-VQA and A-OKVQA validate the effectiveness of HinD, showing that HinD with elicited reasoning from 7B-size MLLM achieves superior performance without commercial model APIs or outside knowledge.
zh
[CV-60] Stroke Modeling Enables Vectorized Character Generation with Large Vectorized Glyph Model
【速读】:该论文旨在解决如何高效生成高质量、可缩放的矢量汉字字形(vectorized Chinese glyphs)的问题,特别是在缺乏大规模标注数据的情况下实现语义合理且结构完整的字符生成。其解决方案的关键在于提出一种大型矢量字形模型(Large Vectorized Glyph Model, LVGM),通过将笔画(stroke)编码为离散潜在变量(stroke embeddings),并基于大语言模型(LLM)的序列预测能力,以笔画级建模方式训练模型来预测下一个笔画嵌入;该方法使得仅需少量初始笔画即可生成完整汉字、语义通顺的词语乃至未见过的诗句,且生成结果保持矢量格式的灵活性与可扩展性。
链接: https://arxiv.org/abs/2511.11119
作者: Xinyue Zhang,Haolong Li,Jiawei Ma,Chen Ye
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vectorized glyphs are widely used in poster design, network animation, art display, and various other fields due to their scalability and flexibility. In typography, they are often seen as special sequences composed of ordered strokes. This concept extends to the token sequence prediction abilities of large language models (LLMs), enabling vectorized character generation through stroke modeling. In this paper, we propose a novel Large Vectorized Glyph Model (LVGM) designed to generate vectorized Chinese glyphs by predicting the next stroke. Initially, we encode strokes into discrete latent variables called stroke embeddings. Subsequently, we train our LVGM via fine-tuning DeepSeek LLM by predicting the next stroke embedding. With limited strokes given, it can generate complete characters, semantically elegant words, and even unseen verses in vectorized form. Moreover, we release a new large-scale Chinese SVG dataset containing 907,267 samples based on strokes for dynamically vectorized glyph generation. Experimental results show that our model has scaling behaviors on data scales. Our generated vectorized glyphs have been validated by experts and relevant individuals.
zh
[CV-61] oward Generalized Detection of Synthetic Media: Limitations Challenges and the Path to Multimodal Solutions
【速读】:该论文旨在解决当前AI生成媒体检测方法在面对多样化、高度修改的合成内容时泛化能力不足的问题,尤其是在跨模型迁移和多模态数据场景下检测性能下降的挑战。其解决方案的关键在于提出应聚焦于构建基于多模态深度学习模型的检测框架,这类模型能够整合视觉、文本及其他模态信息,从而提升对不同生成机制(如GANs、扩散模型)产生的合成内容的鲁棒性和通用性检测能力,为未来研发更有效的合成媒体防御系统提供明确的研究方向。
链接: https://arxiv.org/abs/2511.11116
作者: Redwan Hussain,Mizanur Rahman,Prithwiraj Bhattacharjee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 10 Pages, 4 figures, 1 table, 7th International Conference on Trends in Computational and Cognitive Engineering(TCCE-2025)
Abstract:Artificial intelligence (AI) in media has advanced rapidly over the last decade. The introduction of Generative Adversarial Networks (GANs) improved the quality of photorealistic image generation. Diffusion models later brought a new era of generative media. These advances made it difficult to separate real and synthetic content. The rise of deepfakes demonstrated how these tools could be misused to spread misinformation, political conspiracies, privacy violations, and fraud. For this reason, many detection models have been developed. They often use deep learning methods such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). These models search for visual, spatial, or temporal anomalies. However, such approaches often fail to generalize across unseen data and struggle with content from different models. In addition, existing approaches are ineffective in multimodal data and highly modified content. This study reviews twenty-four recent works on AI-generated media detection. Each study was examined individually to identify its contributions and weaknesses, respectively. The review then summarizes the common limitations and key challenges faced by current approaches. Based on this analysis, a research direction is suggested with a focus on multimodal deep learning models. Such models have the potential to provide more robust and generalized detection. It offers future researchers a clear starting point for building stronger defenses against harmful synthetic media.
zh
[CV-62] VIDEOP2R: Video Understanding from Perception to Reasoning
【速读】:该论文旨在解决如何将强化微调(Reinforcement Fine-Tuning, RFT)有效扩展至大视频语言模型(Large Video Language Models, LVLMs),以提升其视频推理能力的问题。现有RFT方法在文本领域表现优异,但在处理视频模态时面临感知与推理过程耦合、奖励机制不区分任务阶段等挑战。解决方案的关键在于提出VideoP2R框架,其核心创新是将视频理解中的感知(perception)和推理(reasoning)建模为两个独立的过程:在监督微调(SFT)阶段构建了一个三步流水线,生成高质量、过程感知的链式思维(Chain-of-Thought, CoT)数据集VideoP2R-CoT-162K;在强化学习(RL)阶段设计了一种新的过程感知组相对策略优化算法(Process-aware Group Relative Policy Optimization, PA-GRPO),分别对感知和推理输出提供独立奖励信号,从而实现更精准的策略优化。实验证明该方法在七个视频理解基准中取得六项最优性能,且消融实验验证了过程分离建模与PA-GRPO的有效性。
链接: https://arxiv.org/abs/2511.11113
作者: Yifan Jiang,Yueying Wang,Rui Zhao,Toufiq Parag,Zhimin Chen,Zhenyu Liao,Jayakrishnan Unnikrishnan
机构: USC(南加州大学); Amazon(亚马逊)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model’s perception output is information-sufficient for downstream reasoning.
zh
[CV-63] AccKV: Towards Efficient Audio-Video LLM s Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization
【速读】:该论文旨在解决音频-视频大语言模型(Audio-Video Large Language Models, AV-LLMs)在推理过程中因视频与音频引入的时序维度导致键值缓存(Key-Value Cache, KV cache)规模膨胀,以及模态间信息混淆和对齐失衡的问题。传统方法通过任务导向选择性保留音频或视频KV缓存效果不佳,因为高层数模型中注意力机制对视频模态具有更强倾向性,且直接混合处理音频的时序KV与视频的空间-时序KV易引发信息冲突,进而导致性能下降。解决方案的关键在于提出AccKV框架:其一,采用分层自适应聚焦技术(layer adaptive focusing),根据各层特征动态选择关键模态;其二,引入交叉校准(Cross-Calibration)机制,先整合模态内低效缓存,再通过优先级对齐策略,有选择地淘汰低优先级模态的KV缓存,从而实现高效且准确的多模态缓存管理。
链接: https://arxiv.org/abs/2511.11106
作者: Zhonghua Jiang,Kui Chen,Kunxi Li,Keting Yin,Yiyun Zhou,Zhaode Wang,Chengfei Lv,Shengyu Zhang
机构: 1. 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:
Abstract:Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced their capabilities in tasks like audio-visual question answering and multimodal dialog systems. Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. In higher layers, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.
zh
[CV-64] Detection of Bark Beetle Attacks using Hyperspectral PRISMA Data and Few-Shot Learning
【速读】:该论文旨在解决针叶林中树皮甲虫(bark beetle)侵染检测难题,以提升森林健康监测的精度与效率。其解决方案的关键在于提出一种基于对比学习(contrastive learning)的少样本学习(few-shot learning)方法,利用PRISMA高光谱遥感数据预训练一维卷积神经网络(1D CNN)编码器,提取鲁棒的特征表示,并结合支持向量回归(support vector regression, SVR)模型,在少量标注样本条件下实现对每个像素中健康、受侵染及死亡树木比例的精准估计。该方法显著优于直接使用原始PRISMA光谱波段或Sentinel-2数据的结果。
链接: https://arxiv.org/abs/2511.11096
作者: Mattia Ferrari,Giancarlo Papitto,Giorgio Deligios,Lorenzo Bruzzone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures, accepted at IGARSS conference 3-8 August 2025 Brisbane, Australia
Abstract:Bark beetle infestations represent a serious challenge for maintaining the health of coniferous forests. This paper proposes a few-shot learning approach leveraging contrastive learning to detect bark beetle infestations using satellite PRISMA hyperspectral data. The methodology is based on a contrastive learning framework to pre-train a one-dimensional CNN encoder, enabling the extraction of robust feature representations from hyperspectral data. These extracted features are subsequently utilized as input to support vector regression estimators, one for each class, trained on few labeled samples to estimate the proportions of healthy, attacked by bark beetle, and dead trees for each pixel. Experiments on the area of study in the Dolomites show that our method outperforms the use of original PRISMA spectral bands and of Sentinel-2 data. The results indicate that PRISMA hyperspectral data combined with few-shot learning offers significant advantages for forest health monitoring.
zh
[CV-65] Machine-Learning Based Detection of Coronary Artery Calcification Using Synthetic Chest X-Rays
【速读】:该论文旨在解决冠状动脉钙化(Coronary Artery Calcification, CAC)检测中缺乏可靠标注数据的问题,尤其针对胸片(Chest X-ray, CXR)因标注不准确限制深度学习模型发展的困境。其解决方案的关键在于利用数字化重建射线照片(Digitally Reconstructed Radiographs, DRRs)作为合成训练域:通过将高精度CT图像投影生成类似X光片的图像,同时保留原始CT的精确标签,从而构建一个标签丰富且可扩展的训练基础。研究发现,轻量级卷积神经网络(CNN)从头训练优于大型预训练模型,结合超分辨率与对比度增强可显著提升性能,而课程学习策略在弱监督条件下稳定了训练过程,最终实现了与现有基于真实CXRs方法相当甚至更优的诊断效果(平均AUC达0.754)。这为未来向真实胸片迁移学习和领域自适应提供了可行路径。
链接: https://arxiv.org/abs/2511.11093
作者: Dylan Saeed,Ramtin Gharleghi,Susann Bier,Sonit Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Under review for MIDL 2026
Abstract:Coronary artery calcification (CAC) is a strong predictor of cardiovascular events, with CT-based Agatston scoring widely regarded as the clinical gold standard. However, CT is costly and impractical for large-scale screening, while chest X-rays (CXRs) are inexpensive but lack reliable ground truth labels, constraining deep learning development. Digitally reconstructed radiographs (DRRs) offer a scalable alternative by projecting CT volumes into CXR-like images while inheriting precise labels. In this work, we provide the first systematic evaluation of DRRs as a surrogate training domain for CAC detection. Using 667 CT scans from the COCA dataset, we generate synthetic DRRs and assess model capacity, super-resolution fidelity enhancement, preprocessing, and training strategies. Lightweight CNNs trained from scratch outperform large pretrained networks; pairing super-resolution with contrast enhancement yields significant gains; and curriculum learning stabilises training under weak supervision. Our best configuration achieves a mean AUC of 0.754, comparable to or exceeding prior CXR-based studies. These results establish DRRs as a scalable, label-rich foundation for CAC detection, while laying the foundation for future transfer learning and domain adaptation to real CXRs.
zh
[CV-66] A Space-Time Transformer for Precipitation Forecasting
【速读】:该论文旨在解决传统数值天气预报(Numerical Weather Prediction, NWP)模型在极端降水预测中的两大局限:一是求解偏微分方程(PDEs)计算成本高,难以实时应用;二是其在短时临近预报(0–4小时)场景下性能显著下降。为此,作者提出SaTformer——一种基于全时空注意力机制的视频Transformer架构,用于从卫星辐射率数据中精准预测极端降水事件。解决方案的关键在于:首先,采用数据驱动的生成式AI方法替代物理参数化建模,提升计算效率与实时性;其次,针对极端降水数据分布高度长尾的问题,创新性地将回归任务转化为分类问题,并引入类别加权损失函数以缓解标签不平衡,从而显著提升模型对稀有极端事件的捕捉能力。
链接: https://arxiv.org/abs/2511.11090
作者: Levi Harris,Tianlong Chen
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Meteorological agencies around the world rely on real-time flood guidance to issue live-saving advisories and warnings. For decades traditional numerical weather prediction (NWP) models have been state-of-the-art for precipitation forecasting. However, physically-parameterized models suffer from a few core limitations: first, solving PDEs to resolve atmospheric dynamics is computationally demanding, and second, these methods degrade in performance at nowcasting timescales (i.e., 0-4 hour lead-times). Motivated by these shortcomings, recent work proposes AI-weather prediction (AI-WP) alternatives that learn to emulate analysis data with neural networks. While these data-driven approaches have enjoyed enormous success across diverse spatial and temporal resolutions, applications of video-understanding architectures for weather forecasting remain underexplored. To address these gaps, we propose SaTformer: a video transformer built on full space-time attention that skillfully forecasts extreme precipitation from satellite radiances. Along with our novel architecture, we introduce techniques to tame long-tailed precipitation datasets. Namely, we reformulate precipitation regression into a classification problem, and employ a class-weighted loss to address label imbalances. Our model scored first place on the NeurIPS Weather4Cast 2025 Cumulative Rainfall challenge. Code and model weights are available: this https URL
zh
[CV-67] SplineSplat: 3D Ray Tracing for Higher-Quality Tomography
【速读】:该论文旨在解决三维(3D)体积数据在投影计算中的效率与精度问题,特别是在基于样条函数(B-splines)表示的体数据中,如何高效实现任意几何配置下的射线追踪(ray-tracing)以获得高保真度的断层投影。其解决方案的关键在于提出了一种结合神经网络的射线追踪算法:该算法利用神经网络快速估算基函数对积分路径的贡献,从而显著提升计算效率;同时保持了传统体素(voxel)方法无法比拟的重建质量,在数据充足且无需正则化的情况下实现了更优的重构效果。
链接: https://arxiv.org/abs/2511.11078
作者: Youssef Haouchat,Sepand Kashani,Aleix Boquet-Pujadas,Philippe Thévenaz,Michael Unser
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:
Abstract:We propose a method to efficiently compute tomographic projections of a 3D volume represented by a linear combination of shifted B-splines. To do so, we propose a ray-tracing algorithm that computes 3D line integrals with arbitrary projection geometries. One of the components of our algorithm is a neural network that computes the contribution of the basis functions efficiently. In our experiments, we consider well-posed cases where the data are sufficient for accurate reconstruction without the need for regularization. We achieve higher reconstruction quality than traditional voxel-based methods.
zh
[CV-68] Phys-Liquid: A Physics-Informed Dataset for Estimating 3D Geometry and Volume of Transparent Deformable Liquids AAAI-26 AAAI ALT
【速读】:该论文旨在解决透明可变形液体在动态容器运动下几何与体积属性估计的难题,其核心挑战在于光学复杂性及由容器移动引发的表面形变导致的感知困难。为应对这一问题,作者提出了一种物理信息驱动的数据集 Phys-Liquid,其关键创新在于构建包含97,200张仿真图像及其对应3D网格的高质量数据集,涵盖多种实验室场景、光照条件、液体颜色和容器旋转状态,从而实现对真实液体行为的多维度模拟。此外,论文还设计了一个四阶段重建与估计算法流程(包括液体分割、多视角掩码生成、3D网格重建和真实尺度校准),显著提升了液体几何形状与体积估计的准确性与一致性,验证了该数据集在提升透明液体感知任务中的有效性。
链接: https://arxiv.org/abs/2511.11077
作者: Ke Ma,Yizhou Fang,Jean-Baptiste Weibel,Shuai Tan,Xinggang Wang,Yang Xiao,Yi Fang,Tian Xia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 14 pages, 19 figures. Accepted as an oral paper at AAAI-26 (Main Technical Track). Code and dataset: this https URL Project page: this https URL
Abstract:Estimating the geometric and volumetric properties of transparent deformable liquids is challenging due to optical complexities and dynamic surface deformations induced by container movements. Autonomous robots performing precise liquid manipulation tasks, such as dispensing, aspiration, and mixing, must handle containers in ways that inevitably induce these deformations, complicating accurate liquid state assessment. Current datasets lack comprehensive physics-informed simulation data representing realistic liquid behaviors under diverse dynamic scenarios. To bridge this gap, we introduce Phys-Liquid, a physics-informed dataset comprising 97,200 simulation images and corresponding 3D meshes, capturing liquid dynamics across multiple laboratory scenes, lighting conditions, liquid colors, and container rotations. To validate the realism and effectiveness of Phys-Liquid, we propose a four-stage reconstruction and estimation pipeline involving liquid segmentation, multi-view mask generation, 3D mesh reconstruction, and real-world scaling. Experimental results demonstrate improved accuracy and consistency in reconstructing liquid geometry and volume, outperforming existing benchmarks. The dataset and associated validation methods facilitate future advancements in transparent liquid perception tasks. The dataset and code are available at this https URL.
zh
[CV-69] Evaluating Latent Generative Paradigms for High-Fidelity 3D Shape Completion from a Single Depth Image
【速读】:该论文旨在解决生成式模型在3D形状建模与补全任务中缺乏统一性能评估标准的问题,特别是针对不同条件信息(如文本、图像或部分3D数据)驱动下的生成效果差异尚未系统验证的现状。其解决方案的关键在于对两种前沿生成模型——去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)和自回归因果Transformer(Autoregressive Causal Transformers)进行适配与定量比较,重点分析它们在多模态形状补全任务中的表现,并通过消融实验揭示连续潜空间与离散潜空间对性能的影响机制。结果表明,采用连续潜变量的扩散模型在真实场景下从单张噪声深度图完成多模态形状补全时达到最优性能,而基于相同离散潜空间的自回归模型则可与扩散模型相当甚至超越。
链接: https://arxiv.org/abs/2511.11074
作者: Matthias Humt,Ulrich Hillenbrand,Rudolph Triebel
机构: German Aerospace Center (DLR); Technical University Munich (TUM); Karlsruhe Institute of Technology (KIT)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 4 figures, 19 tables
Abstract:While generative models have seen significant adoption across a wide range of data modalities, including 3D data, a consensus on which model is best suited for which task has yet to be reached. Further, conditional information such as text and images to steer the generation process are frequently employed, whereas others, like partial 3D data, have not been thoroughly evaluated. In this work, we compare two of the most promising generative models–Denoising Diffusion Probabilistic Models and Autoregressive Causal Transformers–which we adapt for the tasks of generative shape modeling and completion. We conduct a thorough quantitative evaluation and comparison of both tasks, including a baseline discriminative model and an extensive ablation study. Our results show that (1) the diffusion model with continuous latents outperforms both the discriminative model and the autoregressive approach and delivers state-of-the-art performance on multi-modal shape completion from a single, noisy depth image under realistic conditions and (2) when compared on the same discrete latent space, the autoregressive model can match or exceed diffusion performance on these tasks.
zh
[CV-70] From Retinal Pixels to Patients: Evolution of Deep Learning Research in Diabetic Retinopathy Screening
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期筛查中面临的临床与技术挑战,包括模型泛化能力不足、数据隐私问题、标注稀缺性以及临床可信赖度低等关键障碍。其解决方案的关键在于系统性整合过去十年深度学习在DR领域的进展,提出涵盖自监督与半监督学习、领域泛化、联邦训练及混合神经符号模型等先进方法,并强调标准化评估协议、可复现性验证和多中心临床部署的重要性,从而推动生成式AI (Generative AI) 在医学影像中的可落地应用与可信转化。
链接: https://arxiv.org/abs/2511.11065
作者: Muskaan Chopra,Lorenz Sparrenberg,Armin Berger,Sarthak Khanna,Jan H. Terheyden,Rafet Sifa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in IEEE BigData 2025
Abstract:Diabetic Retinopathy (DR) remains a leading cause of preventable blindness, with early detection critical for reducing vision loss worldwide. Over the past decade, deep learning has transformed DR screening, progressing from early convolutional neural networks trained on private datasets to advanced pipelines addressing class imbalance, label scarcity, domain shift, and interpretability. This survey provides the first systematic synthesis of DR research spanning 2016-2025, consolidating results from 50+ studies and over 20 datasets. We critically examine methodological advances, including self- and semi-supervised learning, domain generalization, federated training, and hybrid neuro-symbolic models, alongside evaluation protocols, reporting standards, and reproducibility challenges. Benchmark tables contextualize performance across datasets, while discussion highlights open gaps in multi-center validation and clinical trust. By linking technical progress with translational barriers, this work outlines a practical agenda for reproducible, privacy-preserving, and clinically deployable DR AI. Beyond DR, many of the surveyed innovations extend broadly to medical imaging at scale.
zh
[CV-71] LiteAttention: A Temporal Sparse Attention for Diffusion Transformers
【速读】:该论文旨在解决扩散模型(Diffusion Models)在视频生成任务中因注意力机制复杂度为二次方(quadratic attention complexity)而导致的高延迟问题。现有加速方法面临根本性权衡:动态稀疏注意力模式虽具适应性但计算开销大且存在估计误差,而静态稀疏模式则缺乏灵活性且常不最优。论文提出的关键解决方案是识别出扩散注意力具有强时间一致性(temporal coherence)特性——即在不同去噪步骤中,被判定为非关键的注意力区域通常保持不变。基于此,作者设计了LiteAttention方法,通过早期标记冗余区域并向前传播跳过决策,实现跨去噪序列的演化式计算跳过(evolutionary computation skips),从而在无需重复性能探测的前提下,融合动态方法的适应性与静态方法的高效性,显著提升推理速度且不损失生成质量。
链接: https://arxiv.org/abs/2511.11062
作者: Dor Shmilovich,Tony Wu,Aviad Dahan,Yuval Domb
机构: MoonMath.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion Transformers, particularly for video generation, achieve remarkable quality but suffer from quadratic attention complexity, leading to prohibitive latency. Existing acceleration methods face a fundamental trade-off: dynamically estimating sparse attention patterns at each denoising step incurs high computational overhead and estimation errors, while static sparsity patterns remain fixed and often suboptimal throughout denoising. We identify a key structural property of diffusion attention, namely, its sparsity patterns exhibit strong temporal coherence across denoising steps. Tiles deemed non-essential at step t typically remain so at step t+\delta . Leveraging this observation, we introduce LiteAttention, a method that exploits temporal coherence to enable evolutionary computation skips across the denoising sequence. By marking non-essential tiles early and propagating skip decisions forward, LiteAttention eliminates redundant attention computations without repeated profiling overheads, combining the adaptivity of dynamic methods with the efficiency of static ones. We implement a highly optimized LiteAttention kernel on top of FlashAttention and demonstrate substantial speedups on production video diffusion models, with no degradation in quality. The code and implementation details will be publicly released.
zh
[CV-72] CareCom: Generative Image Composition with Calibrated Reference Features
【速读】:该论文旨在解决生成式图像合成(generative image composition)中同时保持前景细节与调整前景姿态/视角的难题。现有方法在实现这两个目标时存在矛盾,难以兼顾。解决方案的关键在于提出一种多参考图像的扩展模型,允许使用任意数量的前景参考图像,并通过校准前景参考图像的全局和局部特征,使其与背景信息兼容,从而补充原始参考特征中关于合适姿态/视角的有用全局和局部信息,显著提升合成效果。
链接: https://arxiv.org/abs/2511.11060
作者: Jiaxuan Chen,Bo Zhang,Qingdong He,Jinlong Peng,Li Niu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image composition aims to seamlessly insert foreground object into background. Despite the huge progress in generative image composition, the existing methods are still struggling with simultaneous detail preservation and foreground pose/view adjustment. To address this issue, we extend the existing generative composition model to multi-reference version, which allows using arbitrary number of foreground reference images. Furthermore, we propose to calibrate the global and local features of foreground reference images to make them compatible with the background information. The calibrated reference features can supplement the original reference features with useful global and local information of proper pose/view. Extensive experiments on MVImgNet and MureCom demonstrate that the generative model can greatly benefit from the calibrated reference features.
zh
[CV-73] NP-LoRA: Null Space Projection Unifies Subject and Style in LoRA Fusion
【速读】:该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)融合过程中因权重合并导致的结构干扰问题,即不同LoRA模块在高维低秩子空间中存在非正交和重叠表示,使得一个LoRA主导另一个,从而降低生成质量。解决方案的关键在于提出Null Space Projection LoRA(NP-LoRA),其核心思想是通过奇异值分解(SVD)提取主风格方向,并将主体LoRA投影到其正交零空间中,实现主方向上的子空间分离,从而避免结构性干扰;同时引入软投影机制以平衡主体保真度与风格一致性之间的权衡。
链接: https://arxiv.org/abs/2511.11051
作者: Chuheng Chen,Xiaofei Zhou,Geyuan Zhang,Yong Huang
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-Rank Adaptation (LoRA) fusion has emerged as a key technique for reusing and composing learned subject and style representations for controllable generation without costly retraining. However, existing methods rely on weight-based merging, where one LoRA often dominates the other, leading to interference and degraded fidelity. This interference is structural: separately trained LoRAs occupy low-rank high-dimensional subspaces, leading to non-orthogonal and overlapping representations. In this work, we analyze the internal structure of LoRAs and find their generative behavior is dominated by a few principal directions in the low-rank subspace, which should remain free from interference during fusion. To achieve this, we propose Null Space Projection LoRA (NP-LoRA), a projection-based framework for LoRA fusion that enforces subspace separation to prevent structural interference among principal directions. Specifically, we first extract principal style directions via singular value decomposition (SVD) and then project the subject LoRA into its orthogonal null space. Furthermore, we introduce a soft projection mechanism that enables smooth control over the trade-off between subject fidelity and style consistency. Experiments show NP-LoRA consistently improves fusion quality over strong baselines (e.g., DINO and CLIP-based metrics, with human and LLM preference scores), and applies broadly across backbones and LoRA pairs without retraining.
zh
[CV-74] PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI AAAI2026
【速读】:该论文旨在解决4D flow磁共振成像(4D flow MRI)在高时空分辨率下扫描时间过长的问题,即如何在不显著增加采集时间的前提下提升血流速度预测的精度。现有方法如物理信息神经网络(PINNs)虽能实现MRI数据超分辨率重建,但其训练过程对每位患者均需重新进行,效率低下且难以实用。解决方案的关键在于提出PINGS-X框架,其核心创新包括:(i) 带有形式化收敛保证的归一化高斯点绘(normalized Gaussian splatting),(ii) 轴对齐高斯表示以简化高维数据训练并保持精度与收敛性,以及(iii) 高斯合并机制防止退化解并提高计算效率。该方法显著缩短了训练时间,并在计算流体动力学(CFD)和真实4D flow MRI数据集上实现了更优的超分辨率性能。
链接: https://arxiv.org/abs/2511.11048
作者: Sun Jo,Seok Young Hong,JinHyun Kim,Seungmin Kang,Ahjin Choi,Don-Gwan An,Simon Song,Je Hyeong Hong
机构: 1. Korea University (韩国科学技术院); 2. Seoul National University of Science and Technology (首尔科学综合技术大学); 3. Korea Institute of Science and Technology (韩国科学技术院); 4. KAIST (韩国科学技术院); 5. POSTECH (浦项工科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAAI 2026. Supplementary material included after references. 27 pages, 21 figures, 11 tables
Abstract:4D flow magnetic resonance imaging (MRI) is a reliable, non-invasive approach for estimating blood flow velocities, vital for cardiovascular diagnostics. Unlike conventional MRI focused on anatomical structures, 4D flow MRI requires high spatiotemporal resolution for early detection of critical conditions such as stenosis or aneurysms. However, achieving such resolution typically results in prolonged scan times, creating a trade-off between acquisition speed and prediction accuracy. Recent studies have leveraged physics-informed neural networks (PINNs) for super-resolution of MRI data, but their practical applicability is limited as the prohibitively slow training process must be performed for each patient. To overcome this limitation, we propose PINGS-X, a novel framework modeling high-resolution flow velocities using axes-aligned spatiotemporal Gaussian representations. Inspired by the effectiveness of 3D Gaussian splatting (3DGS) in novel view synthesis, PINGS-X extends this concept through several non-trivial novel innovations: (i) normalized Gaussian splatting with a formal convergence guarantee, (ii) axes-aligned Gaussians that simplify training for high-dimensional data while preserving accuracy and the convergence guarantee, and (iii) a Gaussian merging procedure to prevent degenerate solutions and boost computational efficiency. Experimental results on computational fluid dynamics (CFD) and real 4D flow MRI datasets demonstrate that PINGS-X substantially reduces training time while achieving superior super-resolution accuracy. Our code and datasets are available at this https URL.
zh
[CV-75] Hyperbolic Hierarchical Alignment Reasoning Network for Text-3D Retrieval AAAI-2026
【速读】:该论文旨在解决文本到3D检索任务中面临的两大挑战:层次表示坍塌(Hierarchy Representation Collapse, HRC)和冗余诱导显著性稀释(Redundancy-Induced Saliency Dilution, RISD)。HRC导致欧几里得嵌入空间中抽象到具体、整体到局部的层次结构被压缩,而RISD则因噪声片段的平均化处理削弱了关键语义线索,降低了模型区分困难负样本的能力。解决方案的关键在于提出Hyperbolic Hierarchical Alignment Reasoning Network (H²ARN),其核心创新包括:1)将文本与3D数据嵌入洛伦兹模型(Lorentz-model)的双曲空间,利用其指数增长的体积特性自然保留层次距离;2)设计层次排序损失构建围绕每个文本向量的收缩蕴含锥体,确保匹配的3D实例落入其中;3)引入贡献感知的双曲聚合模块,基于洛伦兹距离评估局部特征的相关性,并通过双曲几何引导的加权聚合增强判别区域、抑制冗余,无需额外监督。
链接: https://arxiv.org/abs/2511.11045
作者: Wenrui Li,Yidan Lu,Yeyu Chai,Rui Zhao,Hengyu Man,Xiaopeng Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-2026
Abstract:With the daily influx of 3D data on the internet, text-3D retrieval has gained increasing attention. However, current methods face two major challenges: Hierarchy Representation Collapse (HRC) and Redundancy-Induced Saliency Dilution (RISD). HRC compresses abstract-to-specific and whole-to-part hierarchies in Euclidean embeddings, while RISD averages noisy fragments, obscuring critical semantic cues and diminishing the model’s ability to distinguish hard negatives. To address these challenges, we introduce the Hyperbolic Hierarchical Alignment Reasoning Network (H ^2 ARN) for text-3D retrieval. H ^2 ARN embeds both text and 3D data in a Lorentz-model hyperbolic space, where exponential volume growth inherently preserves hierarchical distances. A hierarchical ordering loss constructs a shrinking entailment cone around each text vector, ensuring that the matched 3D instance falls within the cone, while an instance-level contrastive loss jointly enforces separation from non-matching samples. To tackle RISD, we propose a contribution-aware hyperbolic aggregation module that leverages Lorentzian distance to assess the relevance of each local feature and applies contribution-weighted aggregation guided by hyperbolic geometry, enhancing discriminative regions while suppressing redundancy without additional supervision. We also release the expanded T3DR-HIT v2 benchmark, which contains 8,935 text-to-3D pairs, 2.6 times the original size, covering both fine-grained cultural artefacts and complex indoor scenes. Our codes are available at this https URL.
zh
[CV-76] SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices
【速读】:该论文旨在解决在资源受限的嵌入式设备与边缘计算节点之间进行协同推理时,因通信信道不稳定导致的传输错误问题。传统方法依赖比特级传输正确性,在动态信道条件下效率低下;而本文提出SemanticNN,其核心在于通过语义级容错机制替代传统的比特级纠错,实现压缩且鲁棒的协同推理卸载。关键创新包括:(1) 基于误码率(Bit Error Rate, BER)感知的解码器以适应动态信道变化,(2) 基于软量化(Soft Quantization, SQ)的编码器学习紧凑特征表示,(3) 引入特征增强学习(Feature-augmentation Learning)提升卸载效率,并通过基于可解释AI(XAI)的不对称补偿策略缓解编解码能力不匹配问题,从而在严苛的计算与通信约束下显著降低特征传输量(最多达344.83倍),同时保持高推理精度。
链接: https://arxiv.org/abs/2511.11038
作者: Jiaming Huang,Yi Gao,Fuchang Pan,Renjie Li,Wei Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:With the rapid growth of the Internet of Things (IoT), integrating artificial intelligence (AI) on extremely weak embedded devices has garnered significant attention, enabling improved real-time performance and enhanced data privacy. However, the resource limitations of such devices and unreliable network conditions necessitate error-resilient device-edge collaboration systems. Traditional approaches focus on bit-level transmission correctness, which can be inefficient under dynamic channel conditions. In contrast, we propose SemanticNN, a semantic codec that tolerates bit-level errors in pursuit of semantic-level correctness, enabling compressive and resilient collaborative inference offloading under strict computational and communication constraints. It incorporates a Bit Error Rate (BER)-aware decoder that adapts to dynamic channel conditions and a Soft Quantization (SQ)-based encoder to learn compact representations. Building on this architecture, we introduce Feature-augmentation Learning, a novel training strategy that enhances offloading efficiency. To address encoder-decoder capability mismatches from asymmetric resources, we propose XAI-based Asymmetry Compensation to enhance decoding semantic fidelity. We conduct extensive experiments on STM32 using three models and six datasets across image classification and object detection tasks. Experimental results demonstrate that, under varying transmission error rates, SemanticNN significantly reduces feature transmission volume by 56.82-344.83x while maintaining superior inference accuracy.
zh
[CV-77] CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging
【速读】:该论文旨在解决医学多模态大语言模型(Multimodal Large Language Models, MLLMs)在未见过的成像模态(Modality)、解剖结构(Anatomy)和任务类型(Task)组合下,其组合泛化能力(Compositional Generalization, CG)不足的问题。现有研究尚未充分探索MLLMs在跨模态、跨解剖区域和跨任务场景中的零样本迁移性能。解决方案的关键在于提出CrossMed基准,基于结构化的Modality-Anatomy-Task(MAT)框架,将四个公开医学影像数据集统一转化为视觉问答(Visual Question Answering, VQA)格式,并设计Related、Unrelated及零重叠(zero-overlap)三种测试分割策略,从而系统评估模型在不同组合泛化条件下的表现。实验表明,尽管传统模型(如ResNet-50和U-Net)仅表现出有限提升,但多模态大语言模型(如LLaVA-Vicuna-7B和Qwen2-VL-7B)展现出显著的组合泛化能力,尤其在零样本和跨任务迁移场景中表现突出,验证了MAT框架的有效性和CrossMed作为严谨评估平台的价值。
链接: https://arxiv.org/abs/2511.11034
作者: Pooja Singh,Siddhant Ujjain,Tapan Kumar Gandhi,Sandeep Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in multimodal large language models have enabled unified processing of visual and textual inputs, offering promising applications in general-purpose medical AI. However, their ability to generalize compositionally across unseen combinations of imaging modality, anatomy, and task type remains underexplored. We introduce CrossMed, a benchmark designed to evaluate compositional generalization (CG) in medical multimodal LLMs using a structured Modality-Anatomy-Task (MAT) schema. CrossMed reformulates four public datasets, CheXpert (X-ray classification), SIIM-ACR (X-ray segmentation), BraTS 2020 (MRI classification and segmentation), and MosMedData (CT classification) into a unified visual question answering (VQA) format, resulting in 20,200 multiple-choice QA instances. We evaluate two open-source multimodal LLMs, LLaVA-Vicuna-7B and Qwen2-VL-7B, on both Related and Unrelated MAT splits, as well as a zero-overlap setting where test triplets share no Modality, Anatomy, or Task with the training data. Models trained on Related splits achieve 83.2 percent classification accuracy and 0.75 segmentation cIoU, while performance drops significantly under Unrelated and zero-overlap conditions, demonstrating the benchmark difficulty. We also show cross-task transfer, where segmentation performance improves by 7 percent cIoU even when trained using classification-only data. Traditional models (ResNet-50 and U-Net) show modest gains, confirming the broad utility of the MAT framework, while multimodal LLMs uniquely excel at compositional generalization. CrossMed provides a rigorous testbed for evaluating zero-shot, cross-task, and modality-agnostic generalization in medical vision-language models.
zh
[CV-78] MPCGNet: A Multiscale Feature Extraction and Progressive Feature Aggregation Network Using Coupling Gates for Polyp Segmentation IJCNN2025
【速读】:该论文旨在解决结肠息肉(polyp)自动分割中存在的三大挑战:小尺寸息肉易被漏检、息肉与周围环境边界模糊,以及内窥镜图像中因光照不均等因素导致的噪声干扰。解决方案的关键在于引入**耦合门控机制(coupling gates)**作为特定模块的核心组件,以实现噪声过滤和特征重要性选择。具体包括三个模块:耦合门多尺度特征提取(CGMFE)模块用于有效提取局部特征并抑制噪声;窗口交叉注意力解码器(WCAD)模块在精确定位息肉后恢复细节;解码器特征聚合(DFA)模块通过逐步聚合与特征重要性筛选,减少小尺寸息肉的信息损失。实验表明,所提出的MPCGNet模型在ETIS-LaribPolypDB和CVC-ColonDB数据集上分别比次优网络提升mDice分数2.20%和0.68%,验证了该方法的有效性。
链接: https://arxiv.org/abs/2511.11032
作者: Wei Wang,Feng Jiang,Xin Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures,3 tables. This paper has been accepted by IJCNN 2025 but not published
Abstract:Automatic segmentation methods of polyps is crucial for assisting doctors in colorectal polyp screening and cancer diagnosis. Despite the progress made by existing methods, polyp segmentation faces several challenges: (1) small-sized polyps are prone to being missed during identification, (2) the boundaries between polyps and the surrounding environment are often ambiguous, (3) noise in colonoscopy images, caused by uneven lighting and other factors, affects segmentation results. To address these challenges, this paper introduces coupling gates as components in specific modules to filter noise and perform feature importance selection. Three modules are proposed: the coupling gates multiscale feature extraction (CGMFE) module, which effectively extracts local features and suppresses noise; the windows cross attention (WCAD) decoder module, which restores details after capturing the precise location of polyps; and the decoder feature aggregation (DFA) module, which progressively aggregates features, further extracts them, and performs feature importance selection to reduce the loss of small-sized polyps. Experimental results demonstrate that MPCGNet outperforms recent networks, with mDice scores 2.20% and 0.68% higher than the second-best network on the ETIS-LaribPolypDB and CVC-ColonDB datasets, respectively.
zh
[CV-79] Accelerating Controllable Generation via Hybrid-grained Cache
【速读】:该论文旨在解决可控生成模型在推理过程中因需处理控制条件与内容生成计算需求而导致的生成效率低下问题。解决方案的关键在于提出一种混合粒度缓存(Hybrid-Grained Cache, HGC)机制,通过在不同计算阶段采用不同粒度的缓存策略来降低计算开销:一方面,在编码器-解码器块之间使用基于特征复用的粗粒度缓存(块级),动态跳过冗余计算;另一方面,在模块内部设计细粒度缓存(提示级),复用连续推理步骤中的交叉注意力图,并扩展至相邻步骤的模块计算中,从而实现高效且高质量的可控图像生成。
链接: https://arxiv.org/abs/2511.11031
作者: Lin Liu,Huixia Ben,Shuo Wang,Jinda Lu,Junxiang Qiu,Shengeng Tang,Yanbin Hao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.
zh
[CV-80] Algorithms Trained on Normal Chest X-rays Can Predict Health Insurance Types
【速读】:该论文试图解决的问题是:当前医学影像中的深度学习模型(如DenseNet121、SwinV2-B、MedMamba)在未显式使用社会经济状态(SES)信息的情况下,仍能从正常胸部X光片中准确预测患者的健康保险类型(作为SES的代理变量),这揭示了医疗图像并非中立的生物数据,而是隐含了社会不平等的痕迹。解决方案的关键在于通过patch-based occlusion分析发现,这种社会信号并非局部特征,而是弥散分布于胸腔上部和中部区域,表明模型可能捕获了临床环境、设备差异或诊疗路径等与社会分层相关的细微模式;因此,公平性问题的核心不再是单纯的数据集平衡或阈值调整,而应深入识别并解耦临床数据本身所携带的社会指纹。
链接: https://arxiv.org/abs/2511.11030
作者: Chi-Yu Chen,Rawan Abulibdeh,Arash Asgari,Leo Anthony Celi,Deirdre Goode,Hassan Hamidi,Laleh Seyyed-Kalantari,Po-Chih Kuo,Ned McCague,Thomas Sounack
机构: National Taiwan University Hospital (台湾大学医院); University of Toronto (多伦多大学); York University (约克大学); Massachusetts Institute of Technology (麻省理工学院); Mass General Brigham (马萨诸塞州总医院); National Tsing Hua University (清华大学); MIT (麻省理工学院); Dana-Farber Cancer Institute (达纳-法伯癌症研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitting to MIDL 2026
Abstract:Artificial intelligence is revealing what medicine never intended to encode. Deep vision models, trained on chest X-rays, can now detect not only disease but also invisible traces of social inequality. In this study, we show that state-of-the-art architectures (DenseNet121, SwinV2-B, MedMamba) can predict a patient’s health insurance type, a strong proxy for socioeconomic status, from normal chest X-rays with significant accuracy (AUC around 0.67 on MIMIC-CXR-JPG, 0.68 on CheXpert). The signal persists even when age, race, and sex are controlled for, and remains detectable when the model is trained exclusively on a single racial group. Patch-based occlusion reveals that the signal is diffuse rather than localized, embedded in the upper and mid-thoracic regions. This suggests that deep networks may be internalizing subtle traces of clinical environments, equipment differences, or care pathways; learning socioeconomic segregation itself. These findings challenge the assumption that medical images are neutral biological data. By uncovering how models perceive and exploit these hidden social signatures, this work reframes fairness in medical AI: the goal is no longer only to balance datasets or adjust thresholds, but to interrogate and disentangle the social fingerprints embedded in clinical data itself.
zh
[CV-81] EmbryoDiff: A Conditional Diffusion Framework with Multi-Focal Feature Fusion for Fine-Grained Embryo Developmental Stage Recognition
【速读】:该论文旨在解决体外受精(IVF)过程中胚胎发育阶段细粒度识别的准确性问题,现有判别模型因未利用胚胎发育的分布先验信息且依赖单一焦点信息,导致特征表示不完整,在细胞遮挡情况下易产生歧义。其解决方案的关键在于提出一种两阶段扩散框架 EmbryoDiff,首先冻结帧级编码器以提取多焦点鲁棒特征;其次引入多焦点特征融合策略,构建具有3D感知能力的形态学表征,有效缓解遮挡引起的歧义;最后基于融合表征设计混合语义-边界条件模块,将互补的语义与边界线索注入扩散去噪过程,从而实现高精度胚胎阶段分类。
链接: https://arxiv.org/abs/2511.11027
作者: Yong Sun,Zhengjie Zhang,Junyu Shi,Zhiyuan Zhang,Lijiang Liu,Qiang Nie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Identification of fine-grained embryo developmental stages during In Vitro Fertilization (IVF) is crucial for assessing embryo viability. Although recent deep learning methods have achieved promising accuracy, existing discriminative models fail to utilize the distributional prior of embryonic development to improve accuracy. Moreover, their reliance on single-focal information leads to incomplete embryonic representations, making them susceptible to feature ambiguity under cell occlusions. To address these limitations, we propose EmbryoDiff, a two-stage diffusion-based framework that formulates the task as a conditional sequence denoising process. Specifically, we first train and freeze a frame-level encoder to extract robust multi-focal features. In the second stage, we introduce a Multi-Focal Feature Fusion Strategy that aggregates information across focal planes to construct a 3D-aware morphological representation, effectively alleviating ambiguities arising from cell occlusions. Building on this fused representation, we derive complementary semantic and boundary cues and design a Hybrid Semantic-Boundary Condition Block to inject them into the diffusion-based denoising process, enabling accurate embryonic stage classification. Extensive experiments on two benchmark datasets show that our method achieves state-of-the-art results. Notably, with only a single denoising step, our model obtains the best average test performance, reaching 82.8% and 81.3% accuracy on the two datasets, respectively.
zh
[CV-82] AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning
【速读】:该论文旨在解决多智能体协同感知(multi-agent collaborative perception)在生成式 AI(Generative AI)模型评估中缺乏系统性基准测试工具的问题,尤其是在复杂、感知退化的现实场景下。现有基准主要基于高质量单视角图像进行基础感知任务评估,无法有效衡量多无人机系统(multi-drone systems)在协作中的表现,尤其在真实世界中因环境干扰导致的感知劣化条件下。为应对这一挑战,作者提出 AirCopBench,这是首个面向具身空中协同感知(embodied aerial collaborative perception)的综合性基准,其关键创新在于:构建包含14.6k+问题的数据集,覆盖场景理解、物体理解、感知评估与协同决策四个维度共14类任务,并通过模拟器与真实数据融合、多源标注方法(模型、规则与人工结合)及严格质量控制,实现对 MLLMs 在复杂协同场景下的全面评测。
链接: https://arxiv.org/abs/2511.11025
作者: Jirong Zha,Yuxuan Fan,Tianyu Zhang,Geng Chen,Yingfeng Chen,Chen Gao,Xinlei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown promise in single-agent vision tasks, yet benchmarks for evaluating multi-agent collaborative perception remain scarce. This gap is critical, as multi-drone systems provide enhanced coverage, robustness, and collaboration compared to single-sensor setups. Existing multi-image benchmarks mainly target basic perception tasks using high-quality single-agent images, thus failing to evaluate MLLMs in more complex, egocentric collaborative scenarios, especially under real-world degraded perception this http URL address these challenges, we introduce AirCopBench, the first comprehensive benchmark designed to evaluate MLLMs in embodied aerial collaborative perception under challenging perceptual conditions. AirCopBench includes 14.6k+ questions derived from both simulator and real-world data, spanning four key task dimensions: Scene Understanding, Object Understanding, Perception Assessment, and Collaborative Decision, across 14 task types. We construct the benchmark using data from challenging degraded-perception scenarios with annotated collaborative events, generating large-scale questions through model-, rule-, and human-based methods under rigorous quality control. Evaluations on 40 MLLMs show significant performance gaps in collaborative perception tasks, with the best model trailing humans by 24.38% on average and exhibiting inconsistent results across tasks. Fine-tuning experiments further confirm the feasibility of sim-to-real transfer in aerial collaborative perception and reasoning.
zh
[CV-83] SUPER Decoder Block for Reconstruction-Aware U-Net Variants
【速读】:该论文旨在解决基于跳跃连接的编码器-解码器架构(如U-Net变体)在逆问题求解中因信息丢失而导致的高频细节恢复不足的问题。其解决方案的关键在于提出Selective Suppressed Perfect Reconstruction (SUPER) 解码块,该模块利用小波变换的完美重构(Perfect Reconstruction, PR)特性,在不引入刚性框架约束的前提下,通过选择性抑制(Selectively Suppressed, SS)冗余特征来防止信息退化,从而提升模型对高频细节的保真度与整体表示能力。SUPER作为即插即用的解码模块,可无缝集成至多种U-Net变体中,有效消除其固有的重建瓶颈,并在保持计算成本相近的情况下显著增强表征多样性与结构一致性。
链接: https://arxiv.org/abs/2511.11015
作者: Siheon Joo,Hongjo Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages. Under review
Abstract:Skip-connected encoder-decoder architectures (U-Net variants) are widely adopted for inverse problems but still suffer from information loss, limiting recovery of fine high-frequency details. We present Selectively Suppressed Perfect Reconstruction (SUPER), which exploits the perfect reconstruction (PR) property of wavelets to prevent information degradation while selectively suppressing (SS) redundant features. Free from rigid framelet constraints, SUPER serves as a plug-and-play decoder block for diverse U-Net variants, eliminating their intrinsic reconstruction bottlenecks and enhancing representational richness. Experiments across diverse crack benchmarks, including state-of-the-art (SOTA) models, demonstrate the structural potential of the proposed SUPER Decoder Block. Maintaining comparable computational cost, SUPER enriches representational diversity through increased parameterization. In small-scale in-domain experiments on the CrackVision12K dataset, SUPER markedly improves thin-crack segmentation performance, particularly for cracks narrower than 4 px, underscoring its advantage in high-frequency dominant settings. In smartphone image denoising on SIDD, where low-frequency components prevail, SUPER still achieves a moderate gain in PSNR, confirming its robustness across low- and high-frequency regimes. These results validate its plug-and-play generality across U-Net variants, achieving high-frequency fidelity and global coherence within a unified, reconstruction-aware framework.
zh
[CV-84] SP-Guard: Selective Prompt-adaptive Guidance for Safe Text-to-Image Generation ECAI2025
【速读】:该论文旨在解决基于扩散的文本到图像(Text-to-Image, T2I)生成模型在生成高质量图像的同时,易被用于创建有害内容所带来的安全问题。现有推理阶段引导方法存在两个关键缺陷:缺乏自适应性(无法根据提示词动态调整引导强度)和选择性不足(对整个图像进行统一引导,而非仅针对不安全区域)。论文提出的解决方案SP-Guard的核心在于:首先估计提示词的有害程度(prompt harmfulness),进而生成一个选择性引导掩码(selective guidance mask),仅对图像中潜在不安全区域施加引导,从而在保障安全性的同时最小化对其他区域的非预期干扰。这一机制显著提升了生成图像的安全性和可控性。
链接: https://arxiv.org/abs/2511.11014
作者: Sumin Yu,Taesup Moon
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Accepted for presentation at TRUST-AI Workshop, ECAI 2025. Proceedings to appear in CEUR-WS
Abstract:While diffusion-based T2I models have achieved remarkable image generation quality, they also enable easy creation of harmful content, raising social concerns and highlighting the need for safer generation. Existing inference-time guiding methods lack both adaptivity–adjusting guidance strength based on the prompt–and selectivity–targeting only unsafe regions of the image. Our method, SP-Guard, addresses these limitations by estimating prompt harmfulness and applying a selective guidance mask to guide only unsafe areas. Experiments show that SP-Guard generates safer images than existing methods while minimizing unintended content alteration. Beyond improving safety, our findings highlight the importance of transparency and controllability in image generation.
zh
[CV-85] Unsupervised Robust Domain Adaptation: Paradigm Theory and Algorithm
【速读】:该论文旨在解决无监督域自适应(Unsupervised Domain Adaptation, UDA)中模型在面对对抗攻击时鲁棒性不足的问题。现有UDA方法通常侧重于迁移能力,却忽视了对抗扰动下的稳定性;尽管对抗训练(Virtual Adversarial Training, VAT)能提升模型鲁棒性,但在UDA框架下效果有限。作者指出,传统UDA与VAT的结合存在本质上的特征纠缠挑战(entanglement challenge),导致二者难以协同优化。为此,论文提出全新的无监督鲁棒域自适应(Unsupervised Robust Domain Adaptation, URDA)范式,并推导其泛化边界理论,使其同时抵御对抗噪声和域偏移。解决方案的关键在于设计一种两阶段训练流程——Disentangled Adversarial Robustness Training (DART),首先预训练任意UDA模型,再通过解耦的即时鲁棒化后处理步骤增强对抗鲁棒性,无需复杂修改即可实现迁移能力和鲁棒性的兼顾。
链接: https://arxiv.org/abs/2511.11009
作者: Fuxiang Huang,Xiaowei Fu,Shiyu Ye,Lina Ma,Wen Li,Xinbo Gao,David Zhang,Lei Zhang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 4. Beijing Institute of Technology (北京理工大学); 5. Northeastern University (东北大学); 6. University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in IJCV
Abstract:Unsupervised domain adaptation (UDA) aims to transfer knowledge from a label-rich source domain to an unlabeled target domain by addressing domain shifts. Most UDA approaches emphasize transfer ability, but often overlook robustness against adversarial attacks. Although vanilla adversarial training (VAT) improves the robustness of deep neural networks, it has little effect on UDA. This paper focuses on answering three key questions: 1) Why does VAT, known for its defensive effectiveness, fail in the UDA paradigm? 2) What is the generalization bound theory under attacks and how does it evolve from classical UDA theory? 3) How can we implement a robustification training procedure without complex modifications? Specifically, we explore and reveal the inherent entanglement challenge in general UDA+VAT paradigm, and propose an unsupervised robust domain adaptation (URDA) paradigm. We further derive the generalization bound theory of the URDA paradigm so that it can resist adversarial noise and domain shift. To the best of our knowledge, this is the first time to establish the URDA paradigm and theory. We further introduce a simple, novel yet effective URDA algorithm called Disentangled Adversarial Robustness Training (DART), a two-step training procedure that ensures both transferability and robustness. DART first pre-trains an arbitrary UDA model, and then applies an instantaneous robustification post-training step via disentangled this http URL on four benchmark datasets with/without attacks show that DART effectively enhances robustness while maintaining domain adaptability, and validate the URDA paradigm and theory.
zh
[CV-86] VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂视觉任务中因“视觉处理瓶颈”而导致的性能受限问题,即模型在长时间生成过程中容易丢失视觉证据的锚定,缺乏情境化的视觉体验。解决方案的关键在于提出VisMem框架,该框架受人类认知记忆理论启发,引入动态潜在视觉记忆机制,包含两个模块:一个短期模块用于细粒度感知保留(短时视觉主导记忆),另一个长期模块用于抽象语义巩固(长时语义主导记忆)。这两个模块在推理阶段无缝调用,使VLMs能够在思维与生成过程中同时保持感知保真度和语义一致性,从而显著提升跨理解、推理和生成任务的表现,平均性能提升达11.8%。
链接: https://arxiv.org/abs/2511.11007
作者: Xinlei Yu,Chengming Xu,Guibin Zhang,Zhangquan Chen,Yudong Zhang,Yongbo He,Peng-Tao Jiang,Jiangning Zhang,Xiaobin Hu,Shuicheng Yan
机构: National University of Singapore (新加坡国立大学); Fudan University (复旦大学); Tsinghua University (清华大学); Zhejiang University (浙江大学); University of Science and Technology of China (中国科学技术大学); vivo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite the remarkable success of Vision-Language Models (VLMs), their performance on a range of complex visual tasks is often hindered by a “visual processing bottleneck”: a propensity to lose grounding in visual evidence and exhibit a deficit in contextualized visual experience during prolonged generation. Drawing inspiration from human cognitive memory theory, which distinguishes short-term visually-dominant memory and long-term semantically-dominant memory, we propose VisMem, a cognitively-aligned framework that equips VLMs with dynamic latent vision memories, a short-term module for fine-grained perceptual retention and a long-term module for abstract semantic consolidation. These memories are seamlessly invoked during inference, allowing VLMs to maintain both perceptual fidelity and semantic consistency across thinking and generation. Extensive experiments across diverse visual benchmarks for understanding, reasoning, and generation reveal that VisMem delivers a significant average performance boost of 11.8% relative to the vanilla model and outperforms all counterparts, establishing a new paradigm for latent-space memory enhancement. The code will be available: this https URL.
zh
[CV-87] Draft and Refine with Visual Experts
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态推理中过度依赖语言先验而忽视视觉证据,导致生成内容缺乏视觉 grounding(即“幻觉”)的问题。其解决方案的关键在于提出一种名为 Draft and Refine (DnR) 的代理框架,该框架通过一个基于问题条件的视觉利用度量(utilization metric)来量化模型对视觉信息的依赖程度:首先构建查询条件下的相关性图以定位与问题相关的视觉线索,再通过相关性引导的概率掩码测量依赖关系;随后,该度量驱动代理使用外部视觉专家提供的针对性反馈(如边界框或掩码)对初始回答进行迭代优化,从而在不修改模型结构或重新训练的前提下增强视觉 grounding 效果。
链接: https://arxiv.org/abs/2511.11005
作者: Sungheon Jeong,Ryozo Masukawa,Jihong Park,Sanggeon Yun,Wenjun Huang,Hanning Chen,Mahdi Imani,Mohsen Imani
机构: University of California, Irvine (加州大学欧文分校); Northeastern University (东北大学); MOLOCO
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model’s reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert’s output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.
zh
[CV-88] MeCaMIL: Causality-Aware Multiple Instance Learning for Fair and Interpretable Whole Slide Image Diagnosis
【速读】:该论文旨在解决当前基于多实例学习(Multiple Instance Learning, MIL)的全切片图像(Whole Slide Image, WSI)分析方法在计算病理学中面临的两大关键问题:一是现有方法依赖注意力机制,缺乏因果可解释性;二是未能有效整合患者人口统计学特征(如年龄、性别、种族),导致算法公平性不足,可能加剧健康差异。解决方案的关键在于提出一种因果感知的MIL框架——MeCaMIL,其核心创新是通过结构化因果图显式建模人口统计学混杂因子,并运用do-calculus和碰撞器结构(collider structures)进行因果推断,从而分离出与疾病相关的真实信号,消除由人口统计学因素引起的虚假关联。该方法不仅显著提升诊断性能(在多个基准上达到SOTA),还大幅降低不同群体间的公平性差异(平均下降超65%),并具备良好的泛化能力(如生存预测任务中C-index提升)。
链接: https://arxiv.org/abs/2511.11004
作者: Yiran Song,Yikai Zhang,Shuang Zhou,Guojun Xiong,Xiaofeng Yang,Nian Wang,Fenglong Ma,Rui Zhang,Mingquan Lin
机构: University of Minnesota (明尼苏达大学); Harvard University (哈佛大学); Emory University (埃默里大学); University of Texas Southwestern Medical Center (德克萨斯大学西南医学中心); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15page,5 figures,8 tables
Abstract:Multiple instance learning (MIL) has emerged as the dominant paradigm for whole slide image (WSI) analysis in computational pathology, achieving strong diagnostic performance through patch-level feature aggregation. However, existing MIL methods face critical limitations: (1) they rely on attention mechanisms that lack causal interpretability, and (2) they fail to integrate patient demographics (age, gender, race), leading to fairness concerns across diverse populations. These shortcomings hinder clinical translation, where algorithmic bias can exacerbate health disparities. We introduce \textbfMeCaMIL, a causality-aware MIL framework that explicitly models demographic confounders through structured causal graphs. Unlike prior approaches treating demographics as auxiliary features, MeCaMIL employs principled causal inference – leveraging do-calculus and collider structures – to disentangle disease-relevant signals from spurious demographic correlations. Extensive evaluation on three benchmarks demonstrates state-of-the-art performance across CAMELYON16 (ACC/AUC/F1: 0.939/0.983/0.946), TCGA-Lung (0.935/0.979/0.931), and TCGA-Multi (0.977/0.993/0.970, five cancer types). Critically, MeCaMIL achieves superior fairness – demographic disparity variance drops by over 65% relative reduction on average across attributes, with notable improvements for underserved populations. The framework generalizes to survival prediction (mean C-index: 0.653, +0.017 over best baseline across five cancer types). Ablation studies confirm causal graph structure is essential – alternative designs yield 0.048 lower accuracy and 4.2x times worse fairness. These results establish MeCaMIL as a principled framework for fair, interpretable, and clinically actionable AI in digital pathology. Code will be released upon acceptance.
zh
[CV-89] EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation AAAI2026
【速读】:该论文旨在解决当前视频生成系统普遍忽视情感维度的问题,即现有方法主要依赖低级视觉指标(如PSNR、SSIM),而缺乏对情绪感知的建模与引导。其解决方案的关键在于构建首个面向创意媒体的多模态情感标注视频数据集EmoVid,并基于此数据集挖掘视觉特征(亮度、色彩丰富度、色相等)与情绪感知之间的时空关联模式,进而提出一种情感条件化的视频生成技术——通过微调Wan2.1模型实现文本到视频和图像到视频任务中的情绪可控生成,显著提升了生成视频在定量指标和视觉质量上的表现。
链接: https://arxiv.org/abs/2511.11002
作者: Zongyang Qiu,Bingyuan Wang,Xingbei Chen,Yingqing He,Zeyu Wang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, University of Science and Technology of China (中国科学技术大学人工智能研究所); 3. School of Computer Science and Technology, University of Science and Technology of China (中国科学技术大学计算机科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 12 figures. Accepted as an Oral presentation at AAAI 2026. For code and dataset, see this https URL
Abstract:Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.
zh
[CV-90] PROMISE: Prompt-Attentive Hierarchical Contrastive Learning for Robust Cross-Modal Representation with Missing Modalities AAAI’2026
【速读】:该论文旨在解决多模态模型在真实场景中因部分模态缺失而导致性能显著下降的问题,其根源在于完整多模态数据与不完整模态情形下表示学习的一致性不足。现有方法通常采用较为简单的模态生成策略,难以有效保持跨模态一致性,从而影响整体表现。为克服这一局限,作者提出了一种名为PROMISE(PROMpting-Attentive HIerarchical ContraStive LEarning)的新框架,其核心创新在于将多模态提示学习(multimodal prompt learning)融入分层对比学习架构,并设计了专用的提示注意力机制(prompt-attention mechanism),该机制能够动态生成在特定模态缺失场景下的鲁棒且一致的表示,从而有效弥合完整与不完整数据之间的表征鸿沟。
链接: https://arxiv.org/abs/2511.10997
作者: Jiajun Chen,Sai Cheng,Yutao Yuan,Yirui Zhang,Haitao Yuan,Peng Peng,Yi Zhong
机构: 1. Tsinghua University (清华大学); 2. Peking University (北京大学); 3. Chinese Academy of Sciences (中国科学院); 4. Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI’2026 Main Conference
Abstract:Multimodal models integrating natural language and visual information have substantially improved generalization of representation models. However, their effectiveness significantly declines in real-world situations where certain modalities are missing or unavailable. This degradation primarily stems from inconsistent representation learning between complete multimodal data and incomplete modality scenarios. Existing approaches typically address missing modalities through relatively simplistic generation methods, yet these approaches fail to adequately preserve cross-modal consistency, leading to suboptimal performance. To overcome this limitation, we propose a novel multimodal framework named PROMISE, a PROMpting-Attentive HIerarchical ContraStive LEarning approach designed explicitly for robust cross-modal representation under conditions of missing modalities. Specifically, PROMISE innovatively incorporates multimodal prompt learning into a hierarchical contrastive learning framework, equipped with a specially designed prompt-attention mechanism. This mechanism dynamically generates robust and consistent representations for scenarios where particular modalities are absent, thereby effectively bridging the representational gap between complete and incomplete data. Extensive experiments conducted on benchmark datasets, along with comprehensive ablation studies, clearly demonstrate the superior performance of PROMISE compared to current state-of-the-art multimodal methods.
zh
[CV-91] CLUE: Controllable Latent space of Unprompted Embeddings for Diversity Management in Text-to-Image Synthesis
【速读】:该论文旨在解决生成式 AI(Generative AI)在特定领域(如医学影像)中因数据稀缺和多样性不足而导致的图像生成不稳定与多样性受限的问题。其解决方案的关键在于提出 CLUE(Controllable Latent space of Unprompted Embeddings)框架,该框架基于 Stable Diffusion 架构,引入一个 Style Encoder 以提取图像与提示词的风格嵌入,并将其注入 U-Net 中新增的第二注意力层;同时通过 Kullback-Leibler 散度约束潜空间在高斯区域内实现对图像特征的连续表示,从而在不依赖额外数据的前提下,实现稳定且多样化的图像生成。
链接: https://arxiv.org/abs/2511.10993
作者: Keunwoo Park,Jihye Chae,Joong Ho Ahn,Jihoon Kweon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image synthesis models require the ability to generate diverse images while maintaining stability. To overcome this challenge, a number of methods have been proposed, including the collection of prompt-image datasets and the integration of additional data modalities during training. Although these methods have shown promising results in general domains, they face limitations when applied to specialized fields such as medicine, where only limited types and insufficient amounts of data are available. We present CLUE (Controllable Latent space of Unprompted Embeddings), a generative model framework that achieves diverse generation while maintaining stability through fixed-format prompts without requiring any additional data. Based on the Stable Diffusion architecture, CLUE employs a Style Encoder that processes images and prompts to generate style embeddings, which are subsequently fed into a new second attention layer of the U-Net architecture. Through Kullback-Leibler divergence, the latent space achieves continuous representation of image features within Gaussian regions, independent of prompts. Performance was assessed on otitis media dataset. CLUE reduced FID to 9.30 (vs. 46.81) and improved recall to 70.29% (vs. 49.60%). A classifier trained on synthetic-only data at 1000% scale achieved an F1 score of 83.21% (vs. 73.83%). Combining synthetic data with equal amounts of real data achieved an F1 score of 94.76%, higher than when using only real data. On an external dataset, synthetic-only training achieved an F1 score of 76.77% (vs. 60.61%) at 1000% scale. The combined approach achieved an F1 score of 85.78%, higher than when using only the internal dataset. These results demonstrate that CLUE enables diverse yet stable image generation from limited datasets and serves as an effective data augmentation method for domain-specific applications.
zh
[CV-92] Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation
【速读】:该论文旨在解决自回归(Autoregressive, AR)模型在学习型无损图像压缩中因计算成本过高而被普遍视为不实用的问题。其核心解决方案在于提出一种基于分层并行与渐进适应的高效框架——分层并行自回归卷积网络(Hierarchical Parallel Autoregressive ConvNet, HPAC),通过分层因子化结构和内容感知卷积门控机制,以极轻量级参数有效建模空间依赖关系。关键创新包括:Cache-then-Select Inference(CSI)策略消除冗余计算以加速编码,以及Adaptive Focus Coding(AFC)方法扩展至高比特深度图像;此外,采用Spatially-Aware Rate-Guided Progressive Fine-tuning(SARP-FT)实现逐图像级的渐进微调,基于信息密度估计选择空间连续区域进行低秩适配器优化,从而在保持小参数规模的同时实现卓越压缩性能与竞争性编码速度。
链接: https://arxiv.org/abs/2511.10991
作者: Daxin Li,Yuanchao Bai,Kai Wang,Wenbo Zhao,Junjun Jiang,Xianming Liu
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:Autoregressive (AR) models, the theoretical performance benchmark for learned lossless image compression, are often dismissed as impractical due to prohibitive computational cost. This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation that re-establishes pure autoregression as a top-performing and practical solution. Our approach is embodied in the Hierarchical Parallel Autoregressive ConvNet (HPAC), an ultra-lightweight pre-trained model using a hierarchical factorized structure and content-aware convolutional gating to efficiently capture spatial dependencies. We introduce two key optimizations for practicality: Cache-then-Select Inference (CSI), which accelerates coding by eliminating redundant computations, and Adaptive Focus Coding (AFC), which efficiently extends the framework to high bit-depth images. Building on this efficient foundation, our progressive adaptation strategy is realized by Spatially-Aware Rate-Guided Progressive Fine-tuning (SARP-FT). This instance-level strategy fine-tunes the model for each test image by optimizing low-rank adapters on progressively larger, spatially-continuous regions selected via estimated information density. Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression. Notably, our approach sets a new benchmark in learned lossless compression, showing a carefully designed AR framework can offer significant gains over existing methods with a small parameter count and competitive coding speeds.
zh
[CV-93] Binary Verification for Zero-Shot Vision
【速读】:该论文旨在解决零样本视觉理解(zero-shot vision)中开放式查询(open-ended queries)难以准确解析的问题,尤其在涉及指代表达定位(referring expression grounding)、空间推理(spatial reasoning)和BLINK-Jigsaw等任务时,直接使用现成的视觉语言模型(VLMs)效果有限。其解决方案的关键在于提出一种无需训练的二元验证流程(training-free, binary verification workflow),包含两个核心步骤:一是“量化”(quantization),将开放式问题转化为带有明确候选集的多选题(MCQ);二是“二值化”(binarization),对每个候选项逐一进行真/假判断(True/False verification),并通过布尔逻辑确定最终答案——若仅有一个为真则选择该候选,否则退化为剩余合理选项的多选题。该方法通过结构化推理机制显著提升零样本性能,并在多个任务上展现出通用性与有效性,体现了推理阶段设计优于任务特定训练的重要性。
链接: https://arxiv.org/abs/2511.10983
作者: Jeffrey Liu,Rongbin Hu
机构: mycube.tv
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose a training-free, binary verification workflow for zero-shot vision with off-the-shelf VLMs. It comprises two steps: (i) quantization, which turns the open-ended query into a multiple-choice question (MCQ) with a small, explicit list of unambiguous candidates; and (ii) binarization, which asks one True/False question per candidate and resolves deterministically: if exactly one is True, select it; otherwise, revert to an MCQ over the remaining plausible candidates. We evaluate the workflow on referring expression grounding (REC), spatial reasoning (Spatial-Map, Spatial-Grid, Spatial-Maze), and BLINK-Jigsaw. Relative to answering open-ended queries directly, quantization to MCQ yields large gains, and True/False binarization provides a consistent additional boost. Across all tasks, the same workflow produces significant improvements, indicating generality. Our theory formalizes how open-ended vision queries can be quantized to MCQs and further binarized into True/False verifications, establishing a hardness ladder. A simple analysis explains why Boolean resolution boosts accuracy. Together, these components yield a simple and unified workflow that emphasizes inference-time design over task-specific training. It offers a practical, drop-in path to stronger zero-shot vision with today’s VLMs.
zh
[CV-94] PAS: A Training-Free Stabilizer for Temporal Encoding in Video LLM s
【速读】:该论文旨在解决视频大语言模型(Video LLMs)中存在的时序不一致性问题:微小的帧时间偏移会导致注意力机制发生翻转,从而抑制相关帧的信息。研究指出,这一不稳定性的根源在于将旋转位置编码(Rotary Position Embeddings, RoPE)扩展至视频时采用的多模态RoPE方法,其诱导出的逆傅里叶时间核在帧尺度上呈现涟漪效应,导致相邻帧被不同因子加权,进而扰动本应由原始查询-键内积主导的注意力计算。解决方案的关键是提出Phase Aggregated Smoothing (PAS),一种无需训练的机制——通过在不同注意力头中引入小幅度相反的相位偏移并聚合输出,在不改变位置编码结构的前提下,有效平滑时序核、降低相位敏感性,同时保持各头频谱幅值不变;理论分析表明,PAS可使RoPE旋转后的logit近似为内容点积乘以时间核,平滑该核可实现对微小时序偏移的Lipschitz稳定性,且多相位平均能抑制高频涟漪,同时满足奈奎斯特采样条件下的频谱保真度。实验验证了PAS在多个视频理解基准上的稳定性和有效性,且计算开销极低,具备即插即用特性。
链接: https://arxiv.org/abs/2511.10979
作者: Bowen Sun,Yujun Cai,Ming-Hsuan Yang,Hang Wu,Yiwei Wang
机构: University of California, Merced (加州大学默塞德分校); The University of Queensland (昆士兰大学); Google DeepMind (谷歌深度大脑)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:Video LLMs suffer from temporal inconsistency: small shifts in frame timing can flip attention and suppress relevant frames. We trace this instability to the common extension of Rotary Position Embeddings to video through multimodal RoPE. The induced inverse Fourier time kernel exhibits frame-scale ripples that multiply adjacent frames by different factors, which perturbs attention that should otherwise be governed by the raw query key inner product. We present Phase Aggregated Smoothing (PAS), a simple, training-free mechanism that applies small opposed phase offsets across heads and then aggregates their outputs. PAS preserves the per-head spectrum magnitude, while the aggregation effectively smooths the temporal kernel and reduces phase sensitivity without changing the positional encoding structure. Our analysis shows that the RoPE rotated logit can be approximated as a content dot product scaled by a time kernel; smoothing this kernel yields Lipschitz stability of attention to small temporal shifts; multi phase averaging attenuates high frequency ripples while preserving per-head spectra under Nyquist-valid sampling. Experiments on multiple video understanding benchmarks under matched token budgets show consistent improvements with negligible computational overhead. PAS provides a plug and play upgrade for robust temporal encoding in Video LLMs.
zh
[CV-95] Preserving Cross-Modal Consistency for CLIP-based Class-Incremental Learning
【速读】:该论文旨在解决基于CLIP(Contrastive Language–Image Pretraining)的类增量学习(Class-Incremental Learning, CIL)中出现的分类器偏差(classifier bias)和记忆统计分布漂移(distributional drift)问题。前者源于任务特定软提示(soft prompts)对新类别过拟合,导致文本原型偏向近期类别而忽视旧知识;后者则因视觉编码器随时间更新引发存储类高斯统计量的不一致性,影响生成回放(generative replay)效果。解决方案的关键在于提出一个两阶段框架DMC,通过冻结一模态来稳定另一模态的优化过程,从而维持跨模态对齐(cross-modal alignment),并进一步设计DMC-OT,引入最优传输(optimal transport, OT)引导的校准策略以对齐不同阶段视觉编码器下的记忆统计分布,同时采用任务特定提示设计增强类间可分性(inter-task separability)。
链接: https://arxiv.org/abs/2511.10974
作者: Haoran Chen,Houze Xu,Micah Goldblum,Daoguo Dong,Zuxuan Wu
机构: Fudan University (复旦大学); Shanghai Collaborative Innovation Center of Intelligent Visual Computing (上海智能视觉计算协同创新中心); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Class-incremental learning (CIL) enables models to continuously learn new categories from sequential tasks without forgetting previously acquired knowledge. While recent advances in vision-language models such as CLIP have demonstrated strong generalization across domains, extending them to continual settings remains challenging. In particular, learning task-specific soft prompts for newly introduced classes often leads to severe classifier bias, as the text prototypes overfit to recent categories when prior data are unavailable. In this paper, we propose DMC, a simple yet effective two-stage framework for CLIP-based CIL that decouples the adaptation of the vision encoder and the optimization of textual soft prompts. Each stage is trained with the other frozen, allowing one modality to act as a stable semantic anchor for the other to preserve cross-modal alignment. Furthermore, current CLIP-based CIL approaches typically store class-wise Gaussian statistics for generative replay, yet they overlook the distributional drift that arises when the vision encoder is updated over time. To address this issue, we introduce DMC-OT, an enhanced version of DMC that incorporates an optimal-transport guided calibration strategy to align memory statistics across evolving encoders, along with a task-specific prompting design that enhances inter-task separability. Extensive experiments on CIFAR-100, Imagenet-R, CUB-200, and UCF-101 demonstrate that both DMC and DMC-OT achieve state-of-the-art performance, with DMC-OT further improving accuracy by an average of 1.80%.
zh
[CV-96] ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization
【速读】:该论文旨在解决稀疏混合专家(Mixture-of-Experts, MoE)架构中的两个核心问题:一是路由 logits 与专家内部结构之间的错位导致路由不稳定和专家利用率低下;二是负载不平衡引发的延迟瓶颈。其解决方案的关键在于提出 ERMoE,通过将每个专家重参数化为学习到的正交特征基(orthonormal eigenbasis),并用“特征基分数”(Eigenbasis Score)替代传统门控 logits,该分数定义为输入特征与专家基向量之间的余弦相似度。这种内容感知的路由机制直接将 token 分配绑定到专家的表示空间,从而稳定专家利用率、促进可解释的专业化,且无需显式的负载平衡损失,避免了干扰梯度的引入。
链接: https://arxiv.org/abs/2511.10971
作者: Anzhe Cheng,Shukai Duan,Shixuan Li,Chenzhong Yin,Mingxi Cheng,Heng Ping,Tamoghna Chattopadhyay,Sophia I Thomopoulos,Shahin Nazarian,Paul Thompson,Paul Bogdan
机构: University of Southern California (南加州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mixture-of-Experts (MoE) architectures expand model capacity by sparsely activating experts but face two core challenges: misalignment between router logits and each expert’s internal structure leads to unstable routing and expert underutilization, and load imbalances create straggler bottlenecks. Standard solutions, such as auxiliary load-balancing losses, can reduce load disparities but often weaken expert specialization and hurt downstream performance. To address these issues, we propose ERMoE, a sparse MoE transformer that reparameterizes each expert in a learned orthonormal eigenbasis and replaces learned gating logits with an “Eigenbasis Score”, defined as the cosine similarity between input features and an expert’s basis. This content-aware routing ties token assignments directly to experts’ representation spaces, stabilizing utilization and promoting interpretable specialization without sacrificing sparsity. Crucially, ERMoE removes the need for explicit balancing losses and avoids the interfering gradients they introduce. We show that ERMoE achieves state-of-the-art accuracy on ImageNet classification and cross-modal image-text retrieval benchmarks (e.g., COCO, Flickr30K), while naturally producing flatter expert load distributions. Moreover, a 3D MRI variant (ERMoE-ba) improves brain age prediction accuracy by more than 7% and yields anatomically interpretable expert specializations. ERMoE thus introduces a new architectural principle for sparse expert models that directly addresses routing instabilities and enables improved performance with scalable, interpretable specialization.
zh
[CV-97] xt-guided Weakly Supervised Framework for Dynamic Facial Expression Recognition
【速读】:该论文旨在解决动态面部表情识别(Dynamic Facial Expression Recognition, DFER)中的“多对一标注问题”(many-to-one labeling problem),即视频序列中大量帧被赋予单一情绪标签,导致模型难以准确捕捉情绪变化的时序细节。其解决方案的关键在于提出一种文本引导的弱监督框架TG-DFER,通过引入视觉-语言预训练(Vision-Language Pre-trained, VLP)模型提供细粒度的情绪语义引导,并设计视觉提示机制将增强后的文本情绪标签与视觉实例特征对齐,实现帧级相关性估计;同时构建多粒度时序网络,联合建模短时面部动态与长程情绪流,从而提升模型在弱监督条件下的泛化能力、可解释性和时序敏感性。
链接: https://arxiv.org/abs/2511.10958
作者: Gunho Jung,Heejo Kong,Seong-Whan Lee
机构: Korea University (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Dynamic facial expression recognition (DFER) aims to identify emotional states by modeling the temporal changes in facial movements across video sequences. A key challenge in DFER is the many-to-one labeling problem, where a video composed of numerous frames is assigned a single emotion label. A common strategy to mitigate this issue is to formulate DFER as a Multiple Instance Learning (MIL) problem. However, MIL-based approaches inherently suffer from the visual diversity of emotional expressions and the complexity of temporal dynamics. To address this challenge, we propose TG-DFER, a text-guided weakly supervised framework that enhances MIL-based DFER by incorporating semantic guidance and coherent temporal modeling. We incorporate a vision-language pre-trained (VLP) model is integrated to provide semantic guidance through fine-grained textual descriptions of emotional context. Furthermore, we introduce visual prompts, which align enriched textual emotion labels with visual instance features, enabling fine-grained reasoning and frame-level relevance estimation. In addition, a multi-grained temporal network is designed to jointly capture short-term facial dynamics and long-range emotional flow, ensuring coherent affective understanding across time. Extensive results demonstrate that TG-DFER achieves improved generalization, interpretability, and temporal sensitivity under weak supervision.
zh
[CV-98] Language-Guided Graph Representation Learning for Video Summarization
【速读】:该论文旨在解决视频摘要生成中面临的两大核心挑战:一是现有方法难以有效捕捉视频内容中的全局依赖关系,二是难以实现多模态用户定制化需求;同时,视频帧间的时序邻近性并不总是对应语义上的相近性。解决方案的关键在于提出一种语言引导的图表示学习网络(Language-guided Graph Representation Learning Network, LGRLN),其核心创新包括:首先,设计了一个视频图生成器,将视频帧转化为结构化的前向、后向和无向图以保留时序顺序与上下文依赖;其次,引入基于双阈值图卷积机制的图内关系推理模块,区分节点间语义相关与无关帧;最后,构建语言引导的跨模态嵌入模块,使摘要生成可依据文本描述进行定制,并通过伯努利混合分布建模输出并用期望最大化(EM)算法求解。该方案显著提升了摘要质量,同时大幅降低计算开销(推理时间减少87.8%,模型参数减少91.7%)。
链接: https://arxiv.org/abs/2511.10953
作者: Wenrui Li,Wei Han,Hengyu Man,Wangmeng Zuo,Xiaopeng Fan,Yonghong Tian
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology Suzhou Research Institute (哈尔滨工业大学苏州研究院); Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TPAMI
Abstract:With the rapid growth of video content on social media, video summarization has become a crucial task in multimedia processing. However, existing methods face challenges in capturing global dependencies in video content and accommodating multimodal user customization. Moreover, temporal proximity between video frames does not always correspond to semantic proximity. To tackle these challenges, we propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization. Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies. By constructing forward, backward and undirected graphs, the video graph generator effectively preserves the sequentiality and contextual relationships of video content. We designed an intra-graph relational reasoning module with a dual-threshold graph convolution mechanism, which distinguishes semantically relevant frames from irrelevant ones between nodes. Additionally, our proposed language-guided cross-modal embedding module generates video summaries with specific textual descriptions. We model the summary generation output as a mixture of Bernoulli distribution and solve it with the EM algorithm. Experimental results show that our method outperforms existing approaches across multiple benchmarks. Moreover, we proposed LGRLN reduces inference time and model parameters by 87.8% and 91.7%, respectively. Our codes and pre-trained models are available at this https URL.
zh
[CV-99] DEFT-LLM : Disentangled Expert Feature Tuning for Micro-Expression Recognition
【速读】:该论文旨在解决微表情识别(Micro Expression Recognition, MER)中的两个核心挑战:一是静态外观与动态运动特征的纠缠导致模型难以聚焦于细微面部运动;二是现有数据集中文本标签与实际面部肌肉运动之间存在语义鸿沟,限制了监督信号的有效性。解决方案的关键在于提出DEFT-LLM框架,通过多专家解耦机制实现运动语义对齐——首先构建Uni-MER指令数据集,利用光流和动作单元(Action Unit, AU)双重约束确保文本描述与局部面部运动在时空上一致;进而设计包含三个专家的架构,将面部动态分解为结构、动态纹理和运动语义三个独立且可解释的表征,从而注入物理先验知识并结合大语言模型的跨模态推理能力,显著提升对微表情中细微情感线索的捕捉精度与可解释性。
链接: https://arxiv.org/abs/2511.10948
作者: Ren Zhang,Huilai Li,Chao qi,Guoliang Xu,Tianyu Zhou,Wei wei,Jianqin Yin
机构: College of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications (北京邮电大学智能工程与自动化学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:Micro expression recognition (MER) is crucial for inferring genuine emotion. Applying a multimodal large language model (MLLM) to this task enables spatio-temporal analysis of facial motion and provides interpretable descriptions. However, there are still two core challenges: (1) The entanglement of static appearance and dynamic motion cues prevents the model from focusing on subtle motion; (2) Textual labels in existing MER datasets do not fully correspond to underlying facial muscle movements, creating a semantic gap between text supervision and physical motion. To address these issues, we propose DEFT-LLM, which achieves motion semantic alignment by multi-expert disentanglement. We first introduce Uni-MER, a motion-driven instruction dataset designed to align text with local facial motion. Its construction leverages dual constraints from optical flow and Action Unit (AU) labels to ensure spatio-temporal consistency and reasonable correspondence to the movements. We then design an architecture with three experts to decouple facial dynamics into independent and interpretable representations (structure, dynamic textures, and motion-semantics). By integrating the instruction-aligned knowledge from Uni-MER into DEFT-LLM, our method injects effective physical priors for micro expressions while also leveraging the cross modal reasoning ability of large language models, thus enabling precise capture of subtle emotional cues. Experiments on multiple challenging MER benchmarks demonstrate state-of-the-art performance, as well as a particular advantage in interpretable modeling of local facial motion.
zh
[CV-100] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在三维(3D)相关任务中表现不佳的问题,尤其是空间认知和物理理解能力的缺失,这对机器人和具身智能体等实际应用至关重要。作者指出,这一局限源于VLMs在二维(2D)数据上训练,导致其从2D输入中高效提取3D信息的能力不足。解决方案的关键在于提出SandboxVLM框架,通过引入抽象边界框(abstract bounding boxes)来编码几何结构和物理运动学信息,从而构建一个包含四阶段的3D沙盒重建与感知流水线:多视角先验生成、代理高程估计、多视角投票与聚类、以及3D感知推理。该方法无需额外训练即可显著提升VLM的3D推理能力,在多个基准测试中实现了8.3%的性能增益,验证了3D抽象对增强通用具身智能潜力的重要性。
链接: https://arxiv.org/abs/2511.10946
作者: Yifan Liu,Fangneng Zhan,Kaichen Zhou,Yilun Du,Paul Pu Liang,Hanspeter Pfister
机构: Tsinghua University (清华大学); Harvard University; Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding, which are crucial for real-world applications like robotics and embodied agents. We attribute this to a modality gap between the 3D tasks and the 2D training of VLM, which led to inefficient retrieval of 3D information from 2D input. To bridge this gap, we introduce SandboxVLM, a simple yet effective framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM. Specifically, we design a 3D Sandbox reconstruction and perception pipeline comprising four stages: generating multi-view priors with abstract control, proxy elevation, multi-view voting and clustering, and 3D-aware reasoning. Evaluated in zero-shot settings across multiple benchmarks and VLM backbones, our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods for instance. These results demonstrate that equipping VLMs with a 3D abstraction substantially enhances their 3D reasoning ability without additional training, suggesting new possibilities for general-purpose embodied intelligence.
zh
[CV-101] Divide Conquer and Unite: Hierarchical Style-Recalibrated Prototype Alignment for Federated Medical Image Segmentation AAAI-26
【速读】:该论文旨在解决联邦学习(Federated Learning)在医疗影像分割任务中因设备或扫描协议差异导致的特征异质性(Feature Heterogeneity)问题,其核心挑战在于现有方法通常仅依赖最终层特征进行对齐,忽略了多层次语义信息的缺失以及中间层风格偏差的累积,从而影响模型鲁棒性和分割精度。解决方案的关键在于提出FedBCS框架,通过引入频域自适应风格重校准机制,在原型构建阶段实现内容与风格的解耦并学习最优风格参数,同时设计一种上下文感知的双层原型对齐策略,从编码器和解码器的不同层级提取领域不变的原型,并融合上下文信息以实现更细粒度的表示对齐,从而有效缓解跨机构数据分布差异带来的负面影响。
链接: https://arxiv.org/abs/2511.10945
作者: Xingyue Zhao,Wenke Huang,Xingguang Wang,Haoyu Zhao,Linghao Zhuang,Anwen Jiang,Guancheng Wan,Mang Ye
机构: 武汉大学(Whuhan University); 中国科学技术大学(University of Science and Technology of China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AAAI-26
Abstract:Federated learning enables multiple medical institutions to train a global model without sharing data, yet feature heterogeneity from diverse scanners or protocols remains a major challenge. Many existing works attempt to address this issue by leveraging model representations (e.g., mean feature vectors) to correct local training; however, they often face two key limitations: 1) Incomplete Contextual Representation Learning: Current approaches primarily focus on final-layer features, overlooking critical multi-level cues and thus diluting essential context for accurate segmentation. 2) Layerwise Style Bias Accumulation: Although utilizing representations can partially align global features, these methods neglect domain-specific biases within intermediate layers, allowing style discrepancies to build up and reduce model robustness. To address these challenges, we propose FedBCS to bridge feature representation gaps via domain-invariant contextual prototypes alignment. Specifically, we introduce a frequency-domain adaptive style recalibration into prototype construction that not only decouples content-style representations but also learns optimal style parameters, enabling more robust domain-invariant prototypes. Furthermore, we design a context-aware dual-level prototype alignment method that extracts domain-invariant prototypes from different layers of both encoder and decoder and fuses them with contextual information for finer-grained representation alignment. Extensive experiments on two public datasets demonstrate that our method exhibits remarkable performance.
zh
[CV-102] From Parameter to Representation: A Closed-Form Approach for Controllable Model Merging AAAI2026
【速读】:该论文旨在解决多任务场景下模型合并(model merging)中存在的参数干扰问题,尤其是在现有可控模型合并方法中因离线多目标优化导致的计算复杂度随任务数量呈指数增长的问题。其解决方案的关键在于摒弃传统的参数空间优化思路,转而直接对模型最终表示进行修正,将该修正建模为一个最优线性变换,并推导出闭式解(closed-form solution),从而用单步、与架构无关的计算替代原有的耗时离线优化过程。该方法不仅使生成过程复杂度从指数级降低至线性,还能实时融入用户偏好,实现帕累托最优模型的在线生成。
链接: https://arxiv.org/abs/2511.10943
作者: Jialin Wu,Jian Yang,Handing Wang,Jiajun Wen,Zhiyong Yu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026, Extended Version
Abstract:Model merging combines expert models for multitask performance but faces challenges from parameter interference. This has sparked recent interest in controllable model merging, giving users the ability to explicitly balance performance trade-offs. Existing approaches employ a compile-then-query paradigm, performing a costly offline multi-objective optimization to enable fast, preference-aware model generation. This offline stage typically involves iterative search or dedicated training, with complexity that grows exponentially with the number of tasks. To overcome these limitations, we shift the perspective from parameter-space optimization to a direct correction of the model’s final representation. Our approach models this correction as an optimal linear transformation, yielding a closed-form solution that replaces the entire offline optimization process with a single-step, architecture-agnostic computation. This solution directly incorporates user preferences, allowing a Pareto-optimal model to be generated on-the-fly with complexity that scales linearly with the number of tasks. Experimental results show our method generates a superior Pareto front with more precise preference alignment and drastically reduced computational cost.
zh
[CV-103] Heterogeneous Complementary Distillation AAAI2026
【速读】:该论文旨在解决异构架构知识蒸馏(Heterogeneous Knowledge Distillation, KD)中的关键挑战,即当教师模型(如Vision Transformer)与学生模型(如ResNet18)结构差异显著时,传统KD方法因难以对齐空间特征而效果受限。其解决方案的核心在于提出异构互补蒸馏(Heterogeneous Complementary Distillation, HCD)框架,通过融合教师与学生的互补特征来增强表示对齐:首先利用卷积投影器和自适应池化处理学生中间特征,并与教师最后一层特征拼接后经由互补特征映射模块(Complementary Feature Mapper, CFM)生成共享特征;进一步引入子logit解耦蒸馏(Sub-logit Decoupled Distillation, SDD),将共享logits分解为n个子logits并与其教师logits融合,同时设计正交性损失(Orthogonality Loss, OL)以提升子logits多样性、减少冗余知识传递。此机制有效保留了学生特定优势并充分挖掘教师知识,显著提升了模型的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2511.10942
作者: Liuchi Xu,Hao Zheng,Lu Wang,Lisheng Xu,Jun Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026
Abstract:Knowledge distillation (KD)transfers the dark knowledge from a complex teacher to a compact student. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature this http URL KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD),a simple yet effective framework that integrates complementary teacher and student features to align representations in shared this http URL logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student’s intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher’s feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer,to produce shared this http URL further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into n sub-logits, which are fused with teacher’s logits to rectify this http URL ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL).By preserving student-specific strengths and leveraging teacher knowledge,HCD enhances robustness and generalization in this http URL experiments on the CIFAR-100, Fine-grained (e.g., CUB200)and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods,establishing it as an effective solution for heterogeneous KD.
zh
[CV-104] Facial Expression Recognition with YOLOv11 and YOLOv12: A Comparative Study ICSE
【速读】:该论文旨在解决面部表情识别(Facial Expression Recognition, FER)在非受控、真实场景下性能下降的问题,特别是在资源受限环境中实现高效实时识别的挑战。解决方案的关键在于引入两个轻量级目标检测模型——YOLOv11n 和 YOLOv12n,并将其集成到统一的检测与分类框架中,通过将FER2013和KDEF两个基准数据集转换为对象检测格式进行训练与评估。实验表明,YOLOv12n在干净数据集上表现出更高的mAP 0.5(95.6),体现更强的表情敏感性;而YOLOv11n在噪声较多的FER2013数据集中具有更高精度(65.2),显示出更好的鲁棒性和更低的误报率,从而揭示了轻量化模型在灵敏度与精度之间的权衡关系,验证了其在实际部署中的适应性与效率优势。
链接: https://arxiv.org/abs/2511.10940
作者: Umma Aymon,Nur Shazwani Kamarudin,Ahmad Fakhri Ab. Nasir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Conference Proceedings for the 2025 IEEE 9th International Conference on Software Engineering Computer Systems (ICSECS)
Abstract:Facial Expression Recognition remains a challenging task, especially in unconstrained, real-world environments. This study investigates the performance of two lightweight models, YOLOv11n and YOLOv12n, which are the nano variants of the latest official YOLO series, within a unified detection and classification framework for FER. Two benchmark classification datasets, FER2013 and KDEF, are converted into object detection format and model performance is evaluated using mAP 0.5, precision, recall, and confusion matrices. Results show that YOLOv12n achieves the highest overall performance on the clean KDEF dataset with a mAP 0.5 of 95.6, and also outperforms YOLOv11n on the FER2013 dataset in terms of mAP 63.8, reflecting stronger sensitivity to varied expressions. In contrast, YOLOv11n demonstrates higher precision 65.2 on FER2013, indicating fewer false positives and better reliability in noisy, real-world conditions. On FER2013, both models show more confusion between visually similar expressions, while clearer class separation is observed on the cleaner KDEF dataset. These findings underscore the trade-off between sensitivity and precision, illustrating how lightweight YOLO models can effectively balance performance and efficiency. The results demonstrate adaptability across both controlled and real-world conditions, establishing these models as strong candidates for real-time, resource-constrained emotion-aware AI applications.
zh
[CV-105] Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models
【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Models, VLMs)的分布外(Out-of-Distribution, OOD)检测方法中,负向提示(negative prompts)因包含广泛非目标类别特征而导致语义混淆与性能下降的问题。解决方案的关键在于提出正负提示监督机制(Positive and Negative Prompt Supervision):首先利用大语言模型(Large Language Models, LLMs)初始化类特定的正负提示,随后对负向提示进行优化,使其聚焦于类别边界附近的跨类特征;同时引入图结构架构,将优化后的提示语义信息聚合并传播至视觉分支,从而增强基于能量的OOD检测器的判别能力。该方法有效提升了OOD检测的准确性与鲁棒性。
链接: https://arxiv.org/abs/2511.10923
作者: Zhixia He,Chen Zhao,Minglai Shao,Xintao Wu,Xujiang Zhao,Dong Li,Qin Tian,Linlin Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.
zh
[CV-106] PhaseWin Search Framework Enable Efficient Object-Level Interpretation
【速读】:该论文旨在解决对象级基础模型(object-level foundation models)中区域归因(region attribution)的高忠实度与计算效率之间的矛盾问题。现有基于子模函数子集选择的方法虽能实现高忠实度,但其二次复杂度限制了在实际场景中的部署。解决方案的关键在于提出PhaseWin算法,该算法通过分阶段粗粒度到细粒度的搜索机制,结合自适应剪枝、窗口化精细选择和动态监督机制,在近线性时间复杂度下逼近贪婪选择行为,从而在仅使用20%计算预算的情况下实现超过95%的贪婪归因忠实度,显著提升了可扩展性和实用性。
链接: https://arxiv.org/abs/2511.10914
作者: Zihan Gu,Ruoyu Chen,Junchi Zhang,Yue Hu,Hua Zhang,Xiaochun Cao
机构: Institute of Information Engineering, Chinese Academy of Sciences (中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences (中国科学院大学网络空间安全学院); School of Mathematical Sciences, Fudan University (复旦大学数学科学学院); School of Cyber Science and Technology, Sun Yat-sen University (中山大学网络科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Attribution is essential for interpreting object-level foundation models. Recent methods based on submodular subset selection have achieved high faithfulness, but their efficiency limitations hinder practical deployment in real-world scenarios. To address this, we propose PhaseWin, a novel phase-window search algorithm that enables faithful region attribution with near-linear complexity. PhaseWin replaces traditional quadratic-cost greedy selection with a phased coarse-to-fine search, combining adaptive pruning, windowed fine-grained selection, and dynamic supervision mechanisms to closely approximate greedy behavior while dramatically reducing model evaluations. Theoretically, PhaseWin retains near-greedy approximation guarantees under mild monotone submodular assumptions. Empirically, PhaseWin achieves over 95% of greedy attribution faithfulness using only 20% of the computational budget, and consistently outperforms other attribution baselines across object detection and visual grounding tasks with Grounding DINO and Florence-2. PhaseWin establishes a new state of the art in scalable, high-faithfulness attribution for object-level multimodal models.
zh
[CV-107] YOLO-Drone: An Efficient Object Detection Approach Using the GhostHead Network for Drone Images
【速读】:该论文旨在解决无人机(drone)从高空拍摄图像时因目标尺度小、细节模糊而导致的物体检测准确率低的问题。其核心解决方案是针对YOLOv11模型的头部网络(Head network)进行改进,提出一种名为GhostHead Network的轻量化增强结构,从而提升模型在复杂场景下的检测性能。改进后的模型命名为YOLO-Drone,在VisDrone数据集上验证表明,该方法显著提升了Precision、Recall、F1-Score和mAP (0.5)等关键指标,并且在推理速度方面也有所优化,优于YOLOv8、YOLOv9和YOLOv10等多个先进版本,体现出更强的准确性与实时性优势。
链接: https://arxiv.org/abs/2511.10905
作者: Hyun-Ki Jung
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint version. Accepted for publication in the Journal of Information Systems Engineering and Management
Abstract:Object detection using images or videos captured by drones is a promising technology with significant potential across various industries. However, a major challenge is that drone images are typically taken from high altitudes, making object identification difficult. This paper proposes an effective solution to address this issue. The base model used in the experiments is YOLOv11, the latest object detection model, with a specific implementation based on YOLOv11n. The experimental data were sourced from the widely used and reliable VisDrone dataset, a standard benchmark in drone-based object detection. This paper introduces an enhancement to the Head network of the YOLOv11 algorithm, called the GhostHead Network. The model incorporating this improvement is named YOLO-Drone. Experimental results demonstrate that YOLO-Drone achieves significant improvements in key detection accuracy metrics, including Precision, Recall, F1-Score, and mAP (0.5), compared to the original YOLOv11. Specifically, the proposed model recorded a 0.4% increase in Precision, a 0.6% increase in Recall, a 0.5% increase in F1-Score, and a 0.5% increase in mAP (0.5). Additionally, the Inference Speed metric, which measures image processing speed, also showed a notable improvement. These results indicate that YOLO-Drone is a high-performance model with enhanced accuracy and speed compared to YOLOv11. To further validate its reliability, comparative experiments were conducted against other high-performance object detection models, including YOLOv8, YOLOv9, and YOLOv10. The results confirmed that the proposed model outperformed YOLOv8 by 0.1% in mAP (0.5) and surpassed YOLOv9 and YOLOv10 by 0.3% and 0.6%, respectively.
zh
[CV-108] DINOv3 as a Frozen Encoder for CRPS-Oriented Probabilistic Rainfall Nowcasting
【速读】:该论文旨在解决气象领域中降水临近预报(probabilistic rainfall nowcasting)的准确性与计算效率问题。其核心挑战在于如何在保证预测概率分布质量的同时,降低模型复杂度并提升实时性。解决方案的关键在于提出一种基于预训练卫星视觉编码器(DINOv3-SAT493M)与轻量级概率头(lightweight probabilistic head)相结合的视频投影机制(V-JEPA Vision Transformer),将编码器输出的token映射为4小时累积降雨量的离散经验累积分布函数(eCDF),并通过端到端优化连续排名概率评分(CRPS)进行训练。该方法在Weather4Cast 2025基准测试中取得显著效果,CRPS达3.5102,相较最优3D-UNET基线提升约26%的有效性。
链接: https://arxiv.org/abs/2511.10894
作者: Luciano Araujo Dourado Filho,Almir Moreira da Silva Neto,Anthony Miyaguchi,Rodrigo Pereira David,Rodrigo Tripodi Calumby,Lukáš Picek
机构: Advanced Data Analysis and Management, University of Feira de Santana (费拉德桑塔纳大学高级数据分析与管理); National Institute of Metrology, Quality and Technology (国家计量、质量和技术研究所); Georgia Institute of Technology (佐治亚理工学院); University of West Bohemia in Pilsen (皮尔森西波希米亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper proposes a competitive and computationally efficient approach to probabilistic rainfall nowcasting. A video projector (V-JEPA Vision Transformer) associated to a lightweight probabilistic head is attached to a pre-trained satellite vision encoder (DINOv3\text-SAT493M) to map encoder tokens into a discrete empirical CDF (eCDF) over 4-hour accumulated rainfall. The projector-head is optimized end-to-end over the Continuous Ranked Probability Score (CRPS). As an alternative, 3D-UNET baselines trained with an aggregate Rank Probability Score and a per-pixel Gamma-Hurdle objective are used. On the Weather4Cast 2025 benchmark, the proposed method achieved a promising performance, with a CRPS of 3.5102 (CRPS), which represents \approx 26% in effectiveness gain against the best 3D-UNET.
zh
[CV-109] MCN-CL: Multimodal Cross-Attention Network and Contrastive Learning for Multimodal Emotion Recognition
【速读】:该论文针对多模态情感识别中存在的三大挑战展开研究:类别分布不均衡、动态面部动作单元(Action Unit, AU)时间建模复杂性以及由于模态异质性导致的特征融合困难。为解决这些问题,提出了一种基于交叉注意力机制与对比学习的多模态网络(Multimodal Cross-Attention Network and Contrastive Learning, MCN-CL),其核心创新在于采用三重查询机制(triple query mechanism)和硬负样本挖掘策略(hard negative mining strategy),在去除冗余特征的同时保留关键情感线索,从而有效缓解模态异质性和类别不平衡问题,并提升跨模态融合效率。实验表明,该方法在IEMOCAP和MELD数据集上均显著优于当前最优模型,加权F1分数分别提升3.42%和5.73%。
链接: https://arxiv.org/abs/2511.10892
作者: Feng Li,Ke Wu,Yongwei Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by 32nd International Conference on MultiMedia Modeling (MMM 2026)
Abstract:Multimodal emotion recognition plays a key role in many domains, including mental health monitoring, educational interaction, and human-computer interaction. However, existing methods often face three major challenges: unbalanced category distribution, the complexity of dynamic facial action unit time modeling, and the difficulty of feature fusion due to modal heterogeneity. With the explosive growth of multimodal data in social media scenarios, the need for building an efficient cross-modal fusion framework for emotion recognition is becoming increasingly urgent. To this end, this paper proposes Multimodal Cross-Attention Network and Contrastive Learning (MCN-CL) for multimodal emotion recognition. It uses a triple query mechanism and hard negative mining strategy to remove feature redundancy while preserving important emotional cues, effectively addressing the issues of modal heterogeneity and category imbalance. Experiment results on the IEMOCAP and MELD datasets show that our proposed method outperforms state-of-the-art approaches, with Weighted F1 scores improving by 3.42% and 5.73%, respectively.
zh
[CV-110] Short-Window Sliding Learning for Real-Time Violence Detection via LLM -based Auto-Labeling
【速读】:该论文旨在解决传统暴力检测方法在实时智能监控系统中因依赖长视频训练而导致的时效性差与细粒度识别能力弱的问题。其解决方案的关键在于提出一种短窗滑动学习框架(Short-Window Sliding Learning framework),将视频分割为1–2秒的短片段,并利用大语言模型(Large Language Model, LLM)进行自动标题标注以构建细粒度数据集,从而在保持时间连续性的前提下实现对快速暴力事件的精准识别,显著提升了模型在长视频上的泛化性能与实时应用效果。
链接: https://arxiv.org/abs/2511.10866
作者: Seoik Jung,Taekyung Song,Yangro Lee,Sungjun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures. Accepted paper for the IEIE (Institute of Electronics and Information Engineers) Fall Conference 2025. Presentation on Nov 27, 2025
Abstract:This paper proposes a Short-Window Sliding Learning framework for real-time violence detection in CCTV footages. Unlike conventional long-video training approaches, the proposed method divides videos into 1-2 second clips and applies Large Language Model (LLM)-based auto-caption labeling to construct fine-grained datasets. Each short clip fully utilizes all frames to preserve temporal continuity, enabling precise recognition of rapid violent events. Experiments demonstrate that the proposed method achieves 95.25% accuracy on RWF-2000 and significantly improves performance on long videos (UCF-Crime: 83.25%), confirming its strong generalization and real-time applicability in intelligent surveillance systems.
zh
[CV-111] Accuracy-Preserving CNN Pruning Method under Limited Data Availability
【速读】:该论文旨在解决现有基于层间相关性传播(Layer-wise Relevance Propagation, LRP)的模型剪枝方法在压缩卷积神经网络(Convolutional Neural Networks, CNNs)时仍存在显著精度下降的问题,从而限制了其在数据有限场景下的实际应用。解决方案的关键在于提出一种新的剪枝策略,在小样本数据条件下实现更高剪枝率的同时更好地保持模型精度,且无需微调(fine-tuning),特别适用于计算资源受限和数据稀缺的应用环境。
链接: https://arxiv.org/abs/2511.10861
作者: Daisuke Yasui,Toshitaka Matsuki,Hiroshi Sato
机构: National Defense Academy of Japan (日本防卫大学校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Convolutional Neural Networks (CNNs) are widely used in image recognition and have succeeded in various domains. CNN models have become larger-scale to improve accuracy and generalization performance. Research has been conducted on compressing pre-trained models for specific target applications in environments with limited computing resources. Among model compression techniques, methods using Layer-wise Relevance Propagation (LRP), an explainable AI technique, have shown promise by achieving high pruning rates while preserving accuracy, even without fine-tuning. Because these methods do not require fine-tuning, they are suited to scenarios with limited data. However, existing LRP-based pruning approaches still suffer from significant accuracy degradation, limiting their practical usability. This study proposes a pruning method that achieves a higher pruning rate while preserving better model accuracy. Our approach to pruning with a small amount of data has achieved pruning that preserves accuracy better than existing methods.
zh
[CV-112] GFT: Graph Feature Tuning for Efficient Point Cloud Analysis WACV2026
【速读】:该论文旨在解决点云数据在参数高效微调(Parameter-efficient Fine-tuning, PEFT)中因通用方法表现不佳而导致的性能损失与计算资源消耗问题。其核心解决方案是提出一种面向点云的PEFT方法——图特征微调(Graph Features Tuning, GFT),关键在于通过轻量级图卷积网络从Transformer的初始token化输入中动态学习图结构,并将提取的图特征通过跳跃连接和高效的交叉注意力模块传递至深层网络,从而在保持任务性能的同时显著减少可训练参数数量。
链接: https://arxiv.org/abs/2511.10799
作者: Manish Dhakal,Venkat R. Dasari,Raj Sunderraman,Yi Ding
机构: Georgia State University (佐治亚州立大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026
Abstract:Parameter-efficient fine-tuning (PEFT) significantly reduces computational and memory costs by updating only a small subset of the model’s parameters, enabling faster adaptation to new tasks with minimal loss in performance. Previous studies have introduced PEFTs tailored for point cloud data, as general approaches are suboptimal. To further reduce the number of trainable parameters, we propose a point-cloud-specific PEFT, termed Graph Features Tuning (GFT), which learns a dynamic graph from initial tokenized inputs of the transformer using a lightweight graph convolution network and passes these graph features to deeper layers via skip connections and efficient cross-attention modules. Extensive experiments on object classification and segmentation tasks show that GFT operates in the same domain, rivalling existing methods, while reducing the trainable parameters. Code is at this https URL.
zh
[CV-113] Frequency-Aware Vision-Language Multimodality Generalization Network for Remote Sensing Image Classification
【速读】:该论文旨在解决遥感(Remote Sensing, RS)多模态泛化问题,即在不同传感器或成像条件下,模型如何克服数据异质性并具备强大的跨场景泛化能力。现有视觉-语言模型(Vision-Language Models, VLMs)通常使用通用文本描述地表材料,缺乏针对不同RS视觉模态的专属语言先验知识。解决方案的关键在于提出一种频率感知的多模态视觉-语言泛化网络(Frequency-aware Vision-Language Multimodality Generalization Network, FVMGN),其核心包括:1)基于扩散的训练-测试时增强(Diffusion-based Training-Test-Time Augmentation, DTAug)策略以重建多模态地物覆盖分布;2)多模态小波解耦(Multimodal Wavelet Disentanglement, MWDis)模块通过频域低频与高频成分重采样学习跨域不变特征;3)设计共享与专属类别文本作为Transformer文本编码器输入,提取多样化语义特征;4)构建空间-频率感知图像编码器(Spatial-Frequency-Aware Image Encoder, SFIE)实现局部-全局特征重构;5)引入多尺度空间-频率特征对齐(Multiscale Spatial-Frequency Feature Alignment, MSFFA)模块,在空间和频率域上建立统一语义空间,实现细粒度的多模态特征对齐。
链接: https://arxiv.org/abs/2511.10774
作者: Junjie Zhang,Feng Zhao,Hanqiang Liu,Jun Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The booming remote sensing (RS) technology is giving rise to a novel multimodality generalization task, which requires the model to overcome data heterogeneity while possessing powerful cross-scene generalization ability. Moreover, most vision-language models (VLMs) usually describe surface materials in RS images using universal texts, lacking proprietary linguistic prior knowledge specific to different RS vision modalities. In this work, we formalize RS multimodality generalization (RSMG) as a learning paradigm, and propose a frequency-aware vision-language multimodality generalization network (FVMGN) for RS image classification. Specifically, a diffusion-based training-test-time augmentation (DTAug) strategy is designed to reconstruct multimodal land-cover distributions, enriching input information for FVMGN. Following that, to overcome multimodal heterogeneity, a multimodal wavelet disentanglement (MWDis) module is developed to learn cross-domain invariant features by resampling low and high frequency components in the frequency domain. Considering the characteristics of RS vision modalities, shared and proprietary class texts is designed as linguistic inputs for the transformer-based text encoder to extract diverse text features. For multimodal vision inputs, a spatial-frequency-aware image encoder (SFIE) is constructed to realize local-global feature reconstruction and representation. Finally, a multiscale spatial-frequency feature alignment (MSFFA) module is suggested to construct a unified semantic space, ensuring refined multiscale alignment of different text and vision features in spatial and frequency domains. Extensive experiments show that FVMGN has the excellent multimodality generalization ability compared with state-of-the-art (SOTA) methods.
zh
[CV-114] Expert Consensus-based Video-Based Assessment Tool for Workflow Analysis in Minimally Invasive Colorectal Surgery: Development and Validation of ColoWorkflow
【速读】:该论文旨在解决微创结直肠手术(minimally invasive colorectal surgery, MICS)中因操作流程变异大、学习曲线陡峭及并发症多而导致的手术质量与效果不稳定问题。现有基于视频的评估(video-based assessment, VBA)工具难以标准化和推广,限制了其在训练优化与性能提升中的应用。解决方案的关键在于通过德尔菲法(Delphi process)达成共识,构建了一个通用且可验证的VBA工具——ColoWorkflow,该工具将MICS流程分解为10个通用阶段和34个特定步骤,并在多中心视频数据集上验证了其适用性与评分者间一致性(平均Cohen’s K分别为0.71和0.66)。此框架首次实现了对微创结直肠手术全流程的标准化分析,为人工智能驱动的工作流识别和跨机构基准化评估提供了基础,有望推动手术培训标准化与数据驱动的质量改进。
链接: https://arxiv.org/abs/2511.10766
作者: Pooja P Jain,Pietro Mascagni,Giuseppe Massimiani,Nabani Banik,Marta Goglia,Lorenzo Arboit,Britty Baby,Andrea Balla,Ludovica Baldari,Gianfranco Silecchia,Claudio Fiorillo,CompSurg Colorectal Experts Group,Sergio Alfieri,Salvador Morales-Conde,Deborah S Keller,Luigi Boni,Nicolas Padoy
机构: IHU Strasbourg(斯特拉斯堡人类健康研究所); University of Strasbourg(斯特拉斯堡大学); CNRS(法国国家科学研究中心); INSERM(法国国家健康与医学研究院); ICube(斯特拉斯堡大学信息与自动化跨学科中心); UMR7357(法国国家科研署7357实验室); Fondazione Policlinico Universitario Agostino Gemelli IRCCS(圣嘉禄大学医院基金会); Sapienza University of Rome(罗马大学); University Hospital Virgen Macarena(塞维利亚大学医院); University of Sevilla(塞维利亚大学); Department of General and Minimally Invasive Surgery, Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico(米兰大都会总医院基金会普通与微创外科部门); Arizona State University(亚利桑那州立大学); Mayo Clinic(梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures
Abstract:Minimally invasive colorectal surgery is characterized by procedural variability, a difficult learning curve, and complications that impact quality and outcomes. Video-based assessment (VBA) offers an opportunity to generate data-driven insights to reduce variability, optimize training, and improve surgical performance. However, existing tools for workflow analysis remain difficult to standardize and implement. This study aims to develop and validate a VBA tool for workflow analysis across minimally invasive colorectal procedures. A Delphi process was conducted to achieve consensus on generalizable workflow descriptors. The resulting framework informed the development of a new VBA tool, ColoWorkflow. Independent raters then applied ColoWorkflow to a multicentre video dataset of laparoscopic and robotic colorectal surgery (CRS). Applicability and inter-rater reliability were evaluated. Consensus was achieved for 10 procedure-agnostic phases and 34 procedure-specific steps describing CRS workflows. ColoWorkflow was developed and applied to 54 colorectal operative videos (left and right hemicolectomies, sigmoid and rectosigmoid resections, and total proctocolectomies) from five centres. The tool demonstrated broad applicability, with all but one label utilized. Inter-rater reliability was moderate, with mean Cohen’s K of 0.71 for phases and 0.66 for steps. Most discrepancies arose at phase transitions and step boundary definitions. ColoWorkflow is the first consensus-based, validated VBA tool for comprehensive workflow analysis in minimally invasive CRS. It establishes a reproducible framework for video-based performance assessment, enabling benchmarking across institutions and supporting the development of artificial intelligence-driven workflow recognition. Its adoption may standardize training, accelerate competency acquisition, and advance data-informed surgical quality improvement.
zh
[CV-115] Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues
【速读】:该论文旨在解决预训练视觉表示(Pre-trained Visual Representations, PVRs)在训练视觉运动策略(visuomotor policies)时因包含大量与任务无关的场景信息而导致策略对域外视觉变化和干扰因素敏感的问题。解决方案的关键在于提出了一种轻量级、可训练的特征池化机制——注意力特征聚合(Attentive Feature Aggregation, AFA),该机制能够自动关注任务相关的视觉线索,忽略即使语义丰富的场景干扰项,从而显著提升策略在存在视觉扰动环境中的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2511.10762
作者: Nikolaos Tsagkas,Andreas Sochopoulos,Duolikun Danier,Sethu Vijayakumar,Alexandros Kouris,Oisin Mac Aodha,Chris Xiaoxuan Lu
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper stems from a split of our earlier work “When Pre-trained Visual Representations Fall Short: Limitations in Visuo-Motor Robot Learning.” While “The Temporal Trap” replaces the original and focuses on temporal entanglement, this companion study examines policy robustness and task-relevant visual cue selection
Abstract:The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: this http URL
zh
[CV-116] Fast Data Attribution for Text-to-Image Models FAST NEURIPS2025
【速读】:该论文旨在解决文本到图像生成模型(text-to-image models)中数据归属(data attribution)的计算效率问题,即如何高效识别对特定生成结果影响最大的训练图像。现有方法因每次查询都需要大量计算资源而难以应用于实际场景。其解决方案的关键在于:将一种基于移除训练样本(unlearning-based)的慢速归属方法蒸馏(distill)至特征嵌入空间(feature embedding space),从而实现高效检索高影响力训练图像;部署时结合高效的索引与搜索技术,在无需重复运行昂贵归属算法的前提下,快速定位关键训练样本。实验表明,该方法在MSCOCO和Stable Diffusion等模型上均能在数秒内完成归属分析,较现有方法提速达2,500至400,000倍。
链接: https://arxiv.org/abs/2511.10721
作者: Sheng-Yu Wang,Aaron Hertzmann,Alexei A Efros,Richard Zhang,Jun-Yan Zhu
机构: Carnegie Mellon University (卡内基梅隆大学); Adobe Research (Adobe 研究院); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2025 camera ready. Project page: this https URL
Abstract:Data attribution for text-to-image models aims to identify the training images that most significantly influenced a generated output. Existing attribution methods involve considerable computational resources for each query, making them impractical for real-world applications. We propose a novel approach for scalable and efficient data attribution. Our key idea is to distill a slow, unlearning-based attribution method to a feature embedding space for efficient retrieval of highly influential training images. During deployment, combined with efficient indexing and search methods, our method successfully finds highly influential images without running expensive attribution algorithms. We show extensive results on both medium-scale models trained on MSCOCO and large-scale Stable Diffusion models trained on LAION, demonstrating that our method can achieve better or competitive performance in a few seconds, faster than existing methods by 2,500x - 400,000x. Our work represents a meaningful step towards the large-scale application of data attribution methods on real-world models such as Stable Diffusion.
zh
[CV-117] Semantic VLM Dataset for Safe Autonomous Driving
【速读】:该论文旨在解决自动驾驶场景中视觉-语言模型(Vision-Language Models, VLMs)缺乏高质量、结构化、可解释的帧级标注数据的问题,从而推动对复杂交通场景的语义理解与风险感知能力。其解决方案的关键在于构建了一个名为CAR-Scenes的数据集,该数据集涵盖环境、道路几何、背景车辆行为、自车行为、弱势道路使用者、传感器状态及离散严重度等级(1–10)等多维度属性(共350+叶子节点),并采用GPT-4o辅助的视觉-语言管道结合人工验证生成标签,确保标注质量与一致性;同时提供属性共现图谱和JSONL格式记录,支持语义检索、数据筛选与风险感知场景挖掘,并公开完整的标注脚本、图谱构建与评估流程,以实现可复现、数据驱动的智能车辆开发工作流。
链接: https://arxiv.org/abs/2511.10701
作者: Yuankai He,Weisong Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 6 figures, 7 tables
Abstract:CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: this https URL
zh
[CV-118] LT-Soups: Bridging Head and Tail Classes via Subsampled Model Soups NEURIPS2025
【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在长尾(Long-Tailed, LT)分布数据上存在的性能权衡问题:尽管PEFT方法如LoRA和AdaptFormer能较好保持尾部类别(tail classes)的性能,但会显著损害头部类别(head classes)的准确率。研究发现,头尾比例(head-tail ratio, η)是影响这一权衡的关键但被忽视的因素。其解决方案的核心在于提出LT-Soups——一种两阶段模型平均(model soups)框架:第一阶段通过在平衡子集上微调并平均模型来缓解头部类别偏差;第二阶段仅对分类器进行微调以恢复头部类别的准确性。该方法在多种长尾分布场景下实现了优于传统PEFT与模型平均的性能平衡。
链接: https://arxiv.org/abs/2511.10683
作者: Masih Aminbeidokhti,Subhankar Roy,Eric Granger,Elisa Ricci,Marco Pedersoli
机构: École de technologie supérieure (École de technologie supérieure); University of Bergamo (大学 of Bergamo); University of Trento (大学 of Trento); Fondazione Bruno Kessler (FBK)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Neurips 2025
Abstract:Real-world datasets typically exhibit long-tailed (LT) distributions, where a few head classes dominate and many tail classes are severely underrepresented. While recent work shows that parameter-efficient fine-tuning (PEFT) methods like LoRA and AdaptFormer preserve tail-class performance on foundation models such as CLIP, we find that they do so at the cost of head-class accuracy. We identify the head-tail ratio, the proportion of head to tail classes, as a crucial but overlooked factor influencing this trade-off. Through controlled experiments on CIFAR100 with varying imbalance ratio ( \rho ) and head-tail ratio ( \eta ), we show that PEFT excels in tail-heavy scenarios but degrades in more balanced and head-heavy distributions. To overcome these limitations, we propose LT-Soups, a two-stage model soups framework designed to generalize across diverse LT regimes. In the first stage, LT-Soups averages models fine-tuned on balanced subsets to reduce head-class bias; in the second, it fine-tunes only the classifier on the full dataset to restore head-class accuracy. Experiments across six benchmark datasets show that LT-Soups achieves superior trade-offs compared to both PEFT and traditional model soups across a wide range of imbalance regimes.
zh
[CV-119] A Mathematical Framework for AI Singularity: Conditions Bounds and Control of Recursive Improvement
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统中“失控增长”(runaway growth)问题的可量化判定与控制,即在何种可观测条件下能力提升可能在有限时间内无界加速,以及在何种条件下可以排除这种可能性。其解决方案的关键在于构建一个基于资源投入(计算、数据、能源)和部署策略的递归自改进分析框架,通过物理与信息理论极限(如功率、带宽、内存)定义瞬时改进的服务边界,并结合内生增长模型识别超线性增长与亚临界增长的临界分界。该框架将可测量序列(设施功率、IO带宽、训练吞吐量、基准损失和支出)映射为“是/否”证书,用于判断是否可能出现奇点行为,并提供可直接实施的安全控制措施,如功率上限、吞吐量限流和评估门控机制。
链接: https://arxiv.org/abs/2511.10668
作者: Akbar Anbar Jafari,Cagri Ozcinar,Gholamreza Anbarjafari
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 41 pages
Abstract:AI systems improve by drawing on more compute, data, energy, and better training methods. This paper asks a precise, testable version of the “runaway growth” question: under what measurable conditions could capability escalate without bound in finite time, and under what conditions can that be ruled out? We develop an analytic framework for recursive self-improvement that links capability growth to resource build-out and deployment policies. Physical and information-theoretic limits from power, bandwidth, and memory define a service envelope that caps instantaneous improvement. An endogenous growth model couples capital to compute, data, and energy and defines a critical boundary separating superlinear from subcritical regimes. We derive decision rules that map observable series (facility power, IO bandwidth, training throughput, benchmark losses, and spending) into yes/no certificates for runaway versus nonsingular behavior. The framework yields falsifiable tests based on how fast improvement accelerates relative to its current level, and it provides safety controls that are directly implementable in practice, such as power caps, throughput throttling, and evaluation gates. Analytical case studies cover capped-power, saturating-data, and investment-amplified settings, illustrating when the envelope binds and when it does not. The approach is simulation-free and grounded in measurements engineers already collect. Limitations include dependence on the chosen capability metric and on regularity diagnostics; future work will address stochastic dynamics, multi-agent competition, and abrupt architectural shifts. Overall, the results replace speculation with testable conditions and deployable controls for certifying or precluding an AI singularity.
zh
[CV-120] Synergy vs. Noise: Performance-Guided Multimodal Fusion For Biochemical Recurrence-Free Survival in Prostate Cancer
【速读】:该论文旨在解决多模态深度学习(Multimodal Deep Learning, MDL)在计算病理学中广泛应用时存在的一个核心假设问题:即是否任意组合多个模态数据都能提升预测性能。研究表明,当前多数MDL模型默认融合不同模态信息可带来增益,但缺乏对模态质量与整合效果之间关系的系统验证。论文的关键解决方案在于提出并验证“模态性能导向的集成策略”——只有当各模态本身具备较高预测能力时,其融合才能显著提升模型表现;反之,引入低性能模态反而会因引入噪声而降低整体准确性。这一发现强调了在设计MDL架构时应优先评估单模态性能,并实施选择性融合,而非盲目堆叠所有可用模态。
链接: https://arxiv.org/abs/2511.11452
作者: Seth Alain Chang,Muhammad Mueez Amjad,Noorul Wahab,Ethar Alzaid,Nasir Rajpoot,Adam Shephard
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 5 pages, 1 figure, 4 tables
Abstract:Multimodal deep learning (MDL) has emerged as a transformative approach in computational pathology. By integrating complementary information from multiple data sources, MDL models have demonstrated superior predictive performance across diverse clinical tasks compared to unimodal models. However, the assumption that combining modalities inherently improves performance remains largely unexamined. We hypothesise that multimodal gains depend critically on the predictive quality of individual modalities, and that integrating weak modalities may introduce noise rather than complementary information. We test this hypothesis on a prostate cancer dataset with histopathology, radiology, and clinical data to predict time-to-biochemical recurrence. Our results confirm that combining high-performing modalities yield superior performance compared to unimodal approaches. However, integrating a poor-performing modality with other higher-performing modalities degrades predictive accuracy. These findings demonstrate that multimodal benefit requires selective, performance-guided integration rather than indiscriminate modality combination, with implications for MDL design across computational pathology and medical imaging.
zh
[CV-121] Unsupervised Motion-Compensated Decomposition for Cardiac MRI Reconstruction via Neural Representation AAAI-26
【速读】:该论文旨在解决心脏磁共振成像(Cardiac Magnetic Resonance, CMR)中因高加速采样导致的图像质量下降与真实标注数据稀缺问题,从而提升重建图像的质量和临床适用性。其解决方案的关键在于提出一种名为MoCo-INR的新方法,该方法将隐式神经表示(Implicit Neural Representations, INR)与传统的运动补偿(Motion-Compensated, MoCo)框架相结合,利用显式的运动建模和INR的连续先验特性,实现精准的心脏运动分解与高质量CMR重建;同时设计了一种针对CMR问题定制的INR网络架构,显著提升了模型优化的稳定性。
链接: https://arxiv.org/abs/2511.11436
作者: Xuanyu Tian,Lixuan Chen,Qing Wu,Xiao Wang,Jie Feng,Yuyao Zhang,Hongjiang Wei
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI-26
Abstract:Cardiac magnetic resonance (CMR) imaging is widely used to characterize cardiac morphology and function. To accelerate CMR imaging, various methods have been proposed to recover high-quality spatiotemporal CMR images from highly undersampled k-t space data. However, current CMR reconstruction techniques either fail to achieve satisfactory image quality or are restricted by the scarcity of ground truth data, leading to limited applicability in clinical scenarios. In this work, we proposed MoCo-INR, a new unsupervised method that integrates implicit neural representations (INR) with the conventional motion-compensated (MoCo) framework. Using explicit motion modeling and the continuous prior of INRs, MoCo-INR can produce accurate cardiac motion decomposition and high-quality CMR reconstruction. Furthermore, we introduce a new INR network architecture tailored to the CMR problem, which significantly stabilizes model optimization. Experiments on retrospective (simulated) datasets demonstrate the superiority of MoCo-INR over state-of-the-art methods, achieving fast convergence and fine-detailed reconstructions at ultra-high acceleration factors (e.g., 20x in VISTA sampling). Additionally, evaluations on prospective (real-acquired) free-breathing CMR scans highlight the clinical practicality of MoCo-INR for real-time imaging. Several ablation studies further confirm the effectiveness of the critical components of MoCo-INR.
zh
[CV-122] Large-scale modality-invariant foundation models for brain MRI analysis: Application to lesion segmentation
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)框架在多模态脑磁共振成像(MRI)数据中适应性不足的问题,即现有SSL方法主要针对自然图像设计,难以有效捕捉MRI的多模态特性。其解决方案的关键在于提出一种模态不变表示学习(modality-invariant representation learning)策略,并通过大规模预训练后在卒中和癫痫病灶分割任务上的实验证明:尽管跨模态对齐成功,但保留细粒度的模态特异性特征才是提升分割性能的核心因素。
链接: https://arxiv.org/abs/2511.11311
作者: Petros Koutsouvelis,Matej Gazda,Leroy Volmer,Sina Amirrajab,Kamil Barbierik,Branislav Setlak,Jakub Gazda,Peter Drotar
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Submitted to IEEE ISBI 2026
Abstract:The field of computer vision is undergoing a paradigm shift toward large-scale foundation model pre-training via self-supervised learning (SSL). Leveraging large volumes of unlabeled brain MRI data, such models can learn anatomical priors that improve few-shot performance in diverse neuroimaging tasks. However, most SSL frameworks are tailored to natural images, and their adaptation to capture multi-modal MRI information remains underexplored. This work proposes a modality-invariant representation learning setup and evaluates its effectiveness in stroke and epilepsy lesion segmentation, following large-scale pre-training. Experimental results suggest that despite successful cross-modality alignment, lesion segmentation primarily benefits from preserving fine-grained modality-specific features. Model checkpoints and code are made publicly available.
zh
[CV-123] Deep Learning-Enhanced Analysis for Delineating Anticoagulant Essay Efficacy Using Phase Microscopy
【速读】:该论文旨在解决血液离体后凝固对血液学分析造成的干扰问题,该现象可能导致检测结果不准确及细胞形态改变,从而影响诊断可靠性。其解决方案的关键在于构建一种基于数字全息显微镜(Digital Holographic Microscopy, DHM)与深度学习相结合的框架,实现对不同抗凝剂(如传统EDTA与新型铁酸钾草酸盐纳米颗粒KFeOx-NPs)下血液样本的无标记、非侵入式细胞形态和聚集行为分析。通过自动化图像处理与深度学习算法,系统可定量评估细胞聚集程度和形态变化,从而实现高通量、精准的抗凝效果评价,揭示不同抗凝剂在体外对红细胞形态和凝血动力学的影响差异。
链接: https://arxiv.org/abs/2511.11158
作者: S. Shrivastava,M. Rathor,D. Yenurkar,S. K. Chaubey,S. Mukherjee,R. K. Singh
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The coagulation of blood after it is drawn from the body poses a significant challenge for hematological analysis, potentially leading to inaccurate test results and altered cellular characteristics, compromising diagnostic reliability. This paper presents a deep learning-enhanced framework for delineating anticoagulant efficacy ex vivo using Digital Holographic Microscopy (DHM). We demonstrate a label-free, non-invasive approach for analyzing human blood samples, capable of accurate cell counting and morphological estimation. A DHM with an automated image processing and deep learning pipeline is built for morphological analysis of the blood cells under two different anti-coagulation agents, e.g. conventional EDTA and novel potassium ferric oxalate nanoparticles (KFeOx-NPs). This enables automated high-throughput screening of cells and estimation of blood coagulation rates when samples are treated with different anticoagulants. Results indicated that KFeOx-NPs prevented human blood coagulation without altering the cellular morphology of red blood cells (RBCs), whereas EDTA incubation caused notable changes within 6 hours of incubation. The system allows for quantitative analysis of coagulation dynamics by assessing parameters like cell clustering and morphology over time in these prepared samples, offering insights into the comparative efficacy and effects of anticoagulants outside the body.
zh
[CV-124] Boosting Neural Video Representation via Online Structural Reparameterization
【速读】:该论文旨在解决神经视频表示(Neural Video Representation, NVR)在视频压缩中面临的两大问题:一是现有方法因复杂架构设计导致计算开销增加且难以集成到其他框架;二是模型容量受限,制约了NVR网络的表达能力,形成性能瓶颈。解决方案的关键在于提出一种基于在线结构重参数化(online structural reparameterization)的NVR框架——Online-RepNeRV,其核心创新是引入通用重参数化模块(ERB),通过多并行卷积路径增强模型容量,并采用在线重参数化策略在训练过程中动态融合参数,使多分支结构在训练后等效为单分支结构,从而将额外计算与参数复杂度限制在编码阶段,不牺牲解码效率。实验表明,该方法在主流视频数据集上相较基线平均提升0.37–2.7 dB的PSNR,同时保持相近的训练时间和解码速度。
链接: https://arxiv.org/abs/2511.11071
作者: Ziyi Li,Qingyu Mao,Shuai Liu,Qilei Li,Fanyang Meng,Yongsheng Liang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 15 pages, 7 figures
Abstract:Neural Video Representation~(NVR) is a promising paradigm for video compression, showing great potential in improving video storage and transmission efficiency. While recent advances have made efforts in architectural refinements to improve representational capability, these methods typically involve complex designs, which may incur increased computational overhead and lack the flexibility to integrate into other frameworks. Moreover, the inherent limitation in model capacity restricts the expressiveness of NVR networks, resulting in a performance bottleneck. To overcome these limitations, we propose Online-RepNeRV, a NVR framework based on online structural reparameterization. Specifically, we propose a universal reparameterization block named ERB, which incorporates multiple parallel convolutional paths to enhance the model capacity. To mitigate the overhead, an online reparameterization strategy is adopted to dynamically fuse the parameters during training, and the multi-branch structure is equivalently converted into a single-branch structure after training. As a result, the additional computational and parameter complexity is confined to the encoding stage, without affecting the decoding efficiency. Extensive experiments on mainstream video datasets demonstrate that our method achieves an average PSNR gain of 0.37-2.7 dB over baseline methods, while maintaining comparable training time and decoding speed.
zh
[CV-125] CLIPPan: Adapting CLIP as A Supervisor for Unsupervised Pansharpening AAAI2026
【速读】:该论文旨在解决监督式全色锐化(pansharpening)神经网络在实际应用中因训练数据与真实场景分辨率差异导致的域适应问题。传统方法依赖于模拟的低分辨率训练数据,难以直接适配真实高分辨率图像,从而限制了模型性能。解决方案的关键在于提出一种无监督的全分辨率框架CLIPPan,其核心创新是利用视觉-语言模型CLIP作为监督信号,通过轻量级微调使CLIP能够识别多光谱、全色及融合图像,并理解全色锐化任务本质;进而设计了一种基于语义语言约束的新损失函数,将图像融合过程与协议对齐的文本提示(如Wald或Khan描述)进行对齐,从而无需真实标签即可引导融合学习,显著提升真实数据上的光谱与空间保真度,成为无监督全分辨率全色锐化的新基准。
链接: https://arxiv.org/abs/2511.10896
作者: Lihua Jian,Jiabo Liu,Shaowu Wu,Lihui Chen
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026
Abstract:Despite remarkable advancements in supervised pansharpening neural networks, these methods face domain adaptation challenges of resolution due to the intrinsic disparity between simulated reduced-resolution training data and real-world full-resolution this http URL bridge this gap, we propose an unsupervised pansharpening framework, CLIPPan, that enables model training at full resolution directly by taking CLIP, a visual-language model, as a supervisor. However, directly applying CLIP to supervise pansharpening remains challenging due to its inherent bias toward natural images and limited understanding of pansharpening tasks. Therefore, we first introduce a lightweight fine-tuning pipeline that adapts CLIP to recognize low-resolution multispectral, panchromatic, and high-resolution multispectral images, as well as to understand the pansharpening process. Then, building on the adapted CLIP, we formulate a novel \textitloss integrating semantic language constraints, which aligns image-level fusion transitions with protocol-aligned textual prompts (e.g., Wald’s or Khan’s descriptions), thus enabling CLIPPan to use language as a powerful supervisory signal and guide fusion learning without ground truth. Extensive experiments demonstrate that CLIPPan consistently improves spectral and spatial fidelity across various pansharpening backbones on real-world datasets, setting a new state of the art for unsupervised full-resolution pansharpening.
zh
[CV-126] From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring
【速读】:该论文旨在解决图像去模糊(image deblurring)问题,即从由运动或相机抖动引起的模糊图像中恢复出清晰图像。现有深度学习方法如卷积神经网络(CNNs)和视觉Transformer(Vision Transformers, ViTs)虽取得进展,但在处理复杂或高分辨率模糊时仍存在性能瓶颈且计算开销较大。其解决方案的关键在于提出一种双域架构,将ViT与频域的FFT-ReLU模块相结合,显式地融合空间注意力建模与频域稀疏性约束:ViT主干网络负责捕捉局部与全局依赖关系,而FFT-ReLU模块通过强制频域稀疏性抑制模糊相关伪影并保留细节特征,从而在多个基准数据集上实现更优的峰值信噪比(PSNR)、结构相似性(SSIM)及感知质量。
链接: https://arxiv.org/abs/2511.10806
作者: Syed Mumtahin Mahmud,Mahdi Mohd Hossain Noki,Prothito Shovon Majumder,Abdul Mohaimen Al Radi,Md. Haider Ali,Md. Mosaddek Khan
机构: University of Dhaka (达卡大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.
zh
[CV-127] DualVision ArthroNav: Investigating Opportunities to Enhance Localization and Reconstruction in Image-based Arthroscopy Navigation via External Cameras
【速读】:该论文旨在解决现有关节镜导航系统中视觉定位不稳定的问题,特别是单目关节镜摄像头在长期使用中易出现尺度模糊(scale ambiguity)和漂移(drift),以及光学跟踪系统对工作空间限制严格、干扰手术流程的缺陷。解决方案的关键在于提出DualVision ArthroNav系统,通过将一个外部固定于关节镜上的相机与主单目关节镜摄像头协同工作:外部相机提供稳定的视觉里程计(visual odometry)和绝对定位,而单目关节镜视频用于密集场景重建;两者互补融合,有效消除单目SLAM的尺度不确定性和长期漂移,并实现鲁棒的重定位,从而显著提升导航精度与稳定性。
链接: https://arxiv.org/abs/2511.10699
作者: Hongchao Shu,Lalithkumar Seenivasan,Mingxu Liu,Yunseo Hwang,Yu-Chun Ku,Jonathan Knopf,Alejandro Martin-Gomez,Mehran Armand,Mathias Unberath
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Arthroscopic procedures can greatly benefit from navigation systems that enhance spatial awareness, depth perception, and field of view. However, existing optical tracking solutions impose strict workspace constraints and disrupt surgical workflow. Vision-based alternatives, though less invasive, often rely solely on the monocular arthroscope camera, making them prone to drift, scale ambiguity, and sensitivity to rapid motion or occlusion. We propose DualVision ArthroNav, a multi-camera arthroscopy navigation system that integrates an external camera rigidly mounted on the arthroscope. The external camera provides stable visual odometry and absolute localization, while the monocular arthroscope video enables dense scene reconstruction. By combining these complementary views, our system resolves the scale ambiguity and long-term drift inherent in monocular SLAM and ensures robust relocalization. Experiments demonstrate that our system effectively compensates for calibration errors, achieving an average absolute trajectory error of 1.09 mm. The reconstructed scenes reach an average target registration error of 2.16 mm, with high visual fidelity (SSIM = 0.69, PSNR = 22.19). These results indicate that our system provides a practical and cost-efficient solution for arthroscopic navigation, bridging the gap between optical tracking and purely vision-based systems, and paving the way toward clinically deployable, fully vision-based arthroscopic guidance.
zh
人工智能
[AI-0] Private Frequency Estimation Via Residue Number Systems AAAI2026
【速读】:该论文旨在解决局部差分隐私(Local Differential Privacy, LDP)环境下频率估计的通信效率与计算复杂度问题。现有方法如SubsetSelection (SS) 和ProjectiveGeometryResponse (PGR) 虽然在估计精度上表现优异,但存在用户通信开销高或需复杂解码器的问题。解决方案的关键在于提出ModularSubsetSelection (MSS) 算法:利用剩余数系(Residue Number System, RNS)将每个用户的输入编码为多个互质模数下的残数,并随机选择一个模数及其扰动后的残数上报,从而显著降低用户通信成本(从Θ(ω log₂(k/ω))降至⌈log₂ℓ⌉ + ⌈log₂mⱼ⌉比特),同时保持与最优协议相当的均方误差(MSE)性能;服务器端采用LSMR迭代求解线性系统,实现高效解码(时间复杂度为Θ(n + k log k)),且无需PGR所需的代数结构和动态规划解码器。
链接: https://arxiv.org/abs/2511.11569
作者: Héber H. Arcolezi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: AAAI 2026
Abstract:We present \textsfModularSubsetSelection (MSS), a new algorithm for locally differentially private (LDP) frequency estimation. Given a universe of size k and n users, our \varepsilon -LDP mechanism encodes each input via a Residue Number System (RNS) over \ell pairwise-coprime moduli m_0, \ldots, m_\ell-1 , and reports a randomly chosen index j \in [\ell] along with the perturbed residue using the statistically optimal \textsfSubsetSelection~(SS) (Wang et al. 2016). This design reduces the user communication cost from \Theta\bigl(\omega \log_2(k/\omega)\bigr) bits required by standard SS (with \omega \approx k/(e^\varepsilon+1) ) down to \lceil \log_2 \ell \rceil + \lceil \log_2 m_j \rceil bits, where m_j k . Server-side decoding runs in \Theta(n + r k \ell) time, where r is the number of LSMR (Fong and Saunders 2011) iterations. In practice, with well-conditioned moduli (\textiti.e., constant r and \ell = \Theta(\log k) ), this becomes \Theta(n + k \log k) . We prove that MSS achieves worst-case MSE within a constant factor of state-of-the-art protocols such as SS and \textsfProjectiveGeometryResponse (PGR) (Feldman et al. 2022), while avoiding the algebraic prerequisites and dynamic-programming decoder required by PGR. Empirically, MSS matches the estimation accuracy of SS, PGR, and \textsfRAPPOR (Erlingsson, Pihur, and Korolova 2014) across realistic (k, \varepsilon) settings, while offering faster decoding than PGR and shorter user messages than SS. Lastly, by sampling from multiple moduli and reporting only a single perturbed residue, MSS achieves the lowest reconstruction-attack success rate among all evaluated LDP protocols.
zh
[AI-1] A Unified Convergence Analysis for Semi-Decentralized Learning: Sampled-to-Sampled vs. Sampled-to-All Communication AAAI2026
【速读】:该论文旨在解决半去中心化联邦学习(semi-decentralized federated learning)中两种模型聚合策略——采样设备间聚合(sampled-to-sampled, S2S)与采样设备到所有设备广播(sampled-to-all, S2A)——在理论和实证层面缺乏系统比较的问题。其解决方案的关键在于构建一个统一的收敛性分析框架,该框架综合考虑了采样率、服务器聚合频率和网络连通性等关键系统参数,并通过理论分析与实验验证揭示了不同数据异构性(data heterogeneity)水平下两种策略的性能优劣边界,从而为实际部署提供可操作的设计准则。
链接: https://arxiv.org/abs/2511.11560
作者: Angelo Rodio,Giovanni Neglia,Zheng Chen,Erik G. Larsson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted as a conference paper at AAAI 2026 (oral presentation). This is the extended version including the appendix
Abstract:In semi-decentralized federated learning, devices primarily rely on device-to-device communication but occasionally interact with a central server. Periodically, a sampled subset of devices uploads their local models to the server, which computes an aggregate model. The server can then either (i) share this aggregate model only with the sampled clients (sampled-to-sampled, S2S) or (ii) broadcast it to all clients (sampled-to-all, S2A). Despite their practical significance, a rigorous theoretical and empirical comparison of these two strategies remains absent. We address this gap by analyzing S2S and S2A within a unified convergence framework that accounts for key system parameters: sampling rate, server aggregation frequency, and network connectivity. Our results, both analytical and experimental, reveal distinct regimes where one strategy outperforms the other, depending primarily on the degree of data heterogeneity across devices. These insights lead to concrete design guidelines for practical semi-decentralized FL deployments.
zh
[AI-2] Volumetric Ergodic Control
【速读】:该论文旨在解决传统ergodic控制在机器人应用中因将机器人建模为无体积点而无法准确反映其与环境物理交互的问题,尤其在涉及机器人本体和传感器物理尺寸的实际场景中。解决方案的关键在于提出一种基于体积状态表示(volumetric state representation)的新ergodic控制公式,该方法在保持原有渐近覆盖保证的同时,引入最小计算开销以支持实时控制,并兼容任意基于样本的体积模型。实验表明,该方法在搜索与操作任务中显著提升覆盖效率(提高两倍以上),且维持100%的任务完成率,优于标准ergodic控制方法。
链接: https://arxiv.org/abs/2511.11533
作者: Jueun Kwon,Max M. Sun,Todd Murphey
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 8 figures
Abstract:Ergodic control synthesizes optimal coverage behaviors over spatial distributions for nonlinear systems. However, existing formulations model the robot as a non-volumetric point, but in practice a robot interacts with the environment through its body and sensors with physical volume. In this work, we introduce a new ergodic control formulation that optimizes spatial coverage using a volumetric state representation. Our method preserves the asymptotic coverage guarantees of ergodic control, adds minimal computational overhead for real-time control, and supports arbitrary sample-based volumetric models. We evaluate our method across search and manipulation tasks – with multiple robot dynamics and end-effector geometries or sensor models – and show that it improves coverage efficiency by more than a factor of two while maintaining a 100% task completion rate across all experiments, outperforming the standard ergodic control method. Finally, we demonstrate the effectiveness of our method on a robot arm performing mechanical erasing tasks.
zh
[AI-3] Experience-Guided Adaptation of Inference-Time Reasoning Strategies
【速读】:该论文旨在解决生成式 AI (Generative AI) 系统在推理阶段难以根据训练后交互动态调整其问题求解策略的问题。现有方法仅通过修改文本输入来引导语言模型或代理,无法调整采样参数、移除工具、修改系统提示或切换代理与工作流范式;而更灵活的适应性系统则依赖离线优化且部署后保持静态。解决方案的关键在于提出 Experience-Guided Reasoner (EGuR),其核心是基于一个 LLM-based meta-strategy(即输出策略的策略),在推理时动态生成包含大语言模型调用、工具配置、采样参数和控制逻辑的完整计算过程。EGuR 由两个组件构成:Guide 根据当前问题和结构化经验记忆生成多个候选策略,Consolidator 则利用执行反馈优化未来策略生成,从而实现对所有策略组件的自适应调整,并支持缓存、检索与按需执行,显著提升性能并降低计算开销。
链接: https://arxiv.org/abs/2511.11519
作者: Adam Stein,Matthew Trager,Benjamin Bowman,Michael Kleinman,Aditya Chattopadhyay,Wei Xia,Stefano Soatto
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 5 figures
Abstract:Enabling agentic AI systems to adapt their problem-solving approaches based on post-training interactions remains a fundamental challenge. While systems that update and maintain a memory at inference time have been proposed, existing designs only steer the system by modifying textual input to a language model or agent, which means that they cannot change sampling parameters, remove tools, modify system prompts, or switch between agentic and workflow paradigms. On the other hand, systems that adapt more flexibly require offline optimization and remain static once deployed. We present Experience-Guided Reasoner (EGuR), which generates tailored strategies – complete computational procedures involving LLM calls, tools, sampling parameters, and control logic – dynamically at inference time based on accumulated experience. We achieve this using an LLM-based meta-strategy – a strategy that outputs strategies – enabling adaptation of all strategy components (prompts, sampling parameters, tool configurations, and control logic). EGuR operates through two components: a Guide generates multiple candidate strategies conditioned on the current problem and structured memory of past experiences, while a Consolidator integrates execution feedback to improve future strategy generation. This produces complete, ready-to-run strategies optimized for each problem, which can be cached, retrieved, and executed as needed without wasting resources. Across five challenging benchmarks (AIME 2025, 3-SAT, and three Big Bench Extra Hard tasks), EGuR achieves up to 14% accuracy improvements over the strongest baselines while reducing computational costs by up to 111x, with both metrics improving as the system gains experience.
zh
[AI-4] Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models NEURIPS2025
【速读】:该论文旨在解决天文图像数据集(Radio Galaxy Zoo, RGZ)中潜在表示空间的内在维度(intrinsic dimension, iD)估计问题,以量化其复杂性并揭示不同特征(如能量得分、形态分类和信噪比)对iD的影响。解决方案的关键在于利用基于分数的扩散模型(score-based diffusion model)来估算iD,并系统分析iD如何随贝叶斯神经网络(Bayesian neural network, BNN)能量得分变化——结果表明,分布外(out-of-distribution)源具有更高的iD值,且RGZ的整体iD高于典型自然图像数据集;此外,发现iD与信号-to-噪声比(signal-to-noise ratio, SNR)呈弱负相关趋势,但与Fanaroff-Riley(FR)形态类别无显著关联。这一方法为评估自监督学习算法在天文图像中学习表征的质量提供了可量化的基准。
链接: https://arxiv.org/abs/2511.11490
作者: Joan Font-Quer Roset,Devina Mohan,Anna Scaife
机构: 未知
类目: Machine Learning (cs.LG); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 2 tables, submitted to NeurIPS 2025 ML4PS Workshop
Abstract:In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.
zh
[AI-5] Context-aware Adaptive Visualizations for Critical Decision Making
【速读】:该论文旨在解决信息可视化(InfoVis)仪表板在实际应用中缺乏对用户认知状态实时适应性的问题,从而提升决策效率与用户体验。其解决方案的关键在于提出Symbiotik系统,该系统通过融合神经生理信号来实时估算心理工作负荷(mental workload, MWL),并利用强化学习(reinforcement learning, RL)动态调整可视化呈现方式,实现个性化、自适应的交互优化。
链接: https://arxiv.org/abs/2511.11476
作者: Angela Lopez-Cardona,Mireia Masias Bruns,Nuwan T. Attygalle,Sebastian Idesis,Matteo Salvatori,Konstantinos Raftopoulos,Konstantinos Oikonomou,Saravanakumar Duraisamy,Parvin Emami,Nacera Latreche,Alaa Eddine Anis Sahraoui,Michalis Vakallelis,Jean Vanderdonckt,Ioannis Arapakis,Luis A. Leiva
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective decision-making often relies on timely insights from complex visual data. While Information Visualization (InfoVis) dashboards can support this process, they rarely adapt to users’ cognitive state, and less so in real time. We present Symbiotik, an intelligent, context-aware adaptive visualization system that leverages neurophysiological signals to estimate mental workload (MWL) and dynamically adapt visual dashboards using reinforcement learning (RL). Through a user study with 120 participants and three visualization types, we demonstrate that our approach improves task performance and engagement. Symbiotik offers a scalable, real-time adaptation architecture, and a validated methodology for neuroadaptive user interfaces.
zh
[AI-6] Epistemic Error Decomposition for Multi-step Time Series Forecasting: Rethinking Bias-Variance in Recursive and Direct Strategies
【速读】:该论文试图解决多步预测中传统“递归策略具有高偏差低方差,直接策略具有低偏差高方差”的直觉是否成立的问题。其关键解决方案在于将多步预测误差分解为不可约噪声、结构性近似差距(structural approximation gap)和估计方差三项,并指出在线性预测器中结构性差距恒为零,而在非线性预测器中,递归策略因重复组合可能提升模型表达能力,使结构性差距依赖于模型与数据;同时,递归策略在任意预测步长的估计方差可表示为单步方差乘以一个基于雅可比矩阵的放大因子,该因子衡量复合预测器对参数误差的敏感度。这一分析揭示了递归策略在某些条件下可同时实现更低偏差和更高方差,从而为选择递归或直接策略提供了基于模型非线性和噪声特性的新依据。
链接: https://arxiv.org/abs/2511.11461
作者: Riku Green,Huw Day,Zahraa S. Abdallah,Telmo M. Silva Filho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 2025 EIML Eurips Workshop, 6 pages
Abstract:Multi-step forecasting is often described through a simple rule of thumb: recursive strategies are said to have high bias and low variance, while direct strategies are said to have low bias and high variance. We revisit this belief by decomposing the expected multi-step forecast error into three parts: irreducible noise, a structural approximation gap, and an estimation-variance term. For linear predictors we show that the structural gap is identically zero for any dataset. For nonlinear predictors, however, the repeated composition used in recursion can increase model expressivity, making the structural gap depend on both the model and the data. We further show that the estimation variance of the recursive strategy at any horizon can be written as the one-step variance multiplied by a Jacobian-based amplification factor that measures how sensitive the composed predictor is to parameter error. This perspective explains when recursive forecasting may simultaneously have lower bias and higher variance than direct forecasting. Experiments with multilayer perceptrons on the ETTm1 dataset confirm these findings. The results offer practical guidance for choosing between recursive and direct strategies based on model nonlinearity and noise characteristics, rather than relying on traditional bias-variance intuition.
zh
[AI-7] Retrofit: Continual Learning with Bounded Forgetting for Security Applications
【速读】:该论文旨在解决持续学习(Continual Learning, CL)在安全敏感场景中面临的两大核心挑战:一是如何在无历史数据的情况下保留先前知识,二是如何以最小干扰整合新知识。传统方法依赖全量重训练或数据回放,难以适用于数据隐私受限的环境;且现有技术在知识迁移过程中易产生灾难性遗忘与模型冲突。解决方案的关键在于提出 RETROFIT 方法,其核心创新是通过参数级融合机制合并已训练模型(作为旧知识教师)与微调后的模型(作为新知识教师),从而实现无需历史数据的知识保留与迁移。该方法进一步引入低秩和稀疏更新策略,将参数变化限制在独立子空间内以降低干扰,并设计基于模型置信度的知识仲裁机制动态平衡教师贡献,最终在恶意软件检测和二进制摘要等任务中显著提升知识保持能力与跨表示泛化性能。
链接: https://arxiv.org/abs/2511.11439
作者: Yiling He,Junchi Lei,Hongyu She,Shuo Shao,Xinran Zheng,Yiping Liu,Zhan Qin,Lorenzo Cavallaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern security analytics are increasingly powered by deep learning models, but their performance often degrades as threat landscapes evolve and data representations shift. While continual learning (CL) offers a promising paradigm to maintain model effectiveness, many approaches rely on full retraining or data replay, which are infeasible in data-sensitive environments. Moreover, existing methods remain inadequate for security-critical scenarios, facing two coupled challenges in knowledge transfer: preserving prior knowledge without old data and integrating new knowledge with minimal interference. We propose RETROFIT, a data retrospective-free continual learning method that achieves bounded forgetting for effective knowledge transfer. Our key idea is to consolidate previously trained and newly fine-tuned models, serving as teachers of old and new knowledge, through parameter-level merging that eliminates the need for historical data. To mitigate interference, we apply low-rank and sparse updates that confine parameter changes to independent subspaces, while a knowledge arbitration dynamically balances the teacher contributions guided by model confidence. Our evaluation on two representative applications demonstrates that RETROFIT consistently mitigates forgetting while maintaining adaptability. In malware detection under temporal drift, it substantially improves the retention score, from 20.2% to 38.6% over CL baselines, and exceeds the oracle upper bound on new data. In binary summarization across decompilation levels, where analyzing stripped binaries is especially challenging, RETROFIT achieves around twice the BLEU score of transfer learning used in prior work and surpasses all baselines in cross-representation generalization. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11439 [cs.LG] (or arXiv:2511.11439v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11439 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-8] CURENet: Combining Unified Representations for Efficient Chronic Disease Prediction
【速读】:该论文旨在解决现有预测模型在处理电子健康记录(EHR)时难以充分捕捉多模态数据(如非结构化临床文本、结构化检验指标和时间序列就诊数据)之间的交互关系、冗余性及时间模式的问题。解决方案的关键在于提出CURENet,一种基于大语言模型(LLMs)处理临床文本与检验报告,并结合Transformer编码器建模纵向就诊序列的多模态融合架构,从而有效整合异构EHR数据并提升慢性疾病预测的准确性。
链接: https://arxiv.org/abs/2511.11423
作者: Cong-Tinh Dao,Nguyen Minh Thao Phan,Jun-En Ding,Chenwei Wu,David Restrepo,Dongsheng Luo,Fanyi Zhao,Chun-Chieh Liao,Wen-Chih Peng,Chi-Te Wang,Pei-Fu Chen,Ling Chen,Xinglong Ju,Feng Liu,Fang-Ming Hung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Electronic health records (EHRs) are designed to synthesize diverse data types, including unstructured clinical notes, structured lab tests, and time-series visit data. Physicians draw on these multimodal and temporal sources of EHR data to form a comprehensive view of a patient’s health, which is crucial for informed therapeutic decision-making. Yet, most predictive models fail to fully capture the interactions, redundancies, and temporal patterns across multiple data modalities, often focusing on a single data type or overlooking these complexities. In this paper, we present CURENet, a multimodal model (Combining Unified Representations for Efficient chronic disease prediction) that integrates unstructured clinical notes, lab tests, and patients’ time-series data by utilizing large language models (LLMs) for clinical text processing and textual lab tests, as well as transformer encoders for longitudinal sequential visits. CURENet has been capable of capturing the intricate interaction between different forms of clinical data and creating a more reliable predictive model for chronic illnesses. We evaluated CURENet using the public MIMIC-III and private FEMH datasets, where it achieved over 94% accuracy in predicting the top 10 chronic conditions in a multi-label framework. Our findings highlight the potential of multimodal EHR integration to enhance clinical decision-making and improve patient outcomes.
zh
[AI-9] Robust and Efficient Communication in Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)在实际部署中因通信约束导致的性能下降问题,特别是针对消息扰动、传输延迟和带宽限制等现实挑战。其解决方案的关键在于提出一套鲁棒且高效的通信策略,通过协同设计通信机制、学习算法与系统鲁棒性,以实现理论MARL模型与实际应用之间的有效衔接。研究聚焦于协作自动驾驶、分布式同时定位与建图(Simultaneous Localization and Mapping, SLAM)及联邦学习三个典型场景,强调低延迟可靠性、高带宽数据共享与通信隐私之间的权衡,并指出未来需构建统一框架来优化通信-学习-鲁棒性的协同设计。
链接: https://arxiv.org/abs/2511.11393
作者: Zejiao Liu,Yi Li,Jiali Wang,Junqi Tu,Yitian Hong,Fangfei Li,Yang Liu,Toshiharu Sugawara,Yang Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-agent reinforcement learning (MARL) has made significant strides in enabling coordinated behaviors among autonomous agents. However, most existing approaches assume that communication is instantaneous, reliable, and has unlimited bandwidth; these conditions are rarely met in real-world deployments. This survey systematically reviews recent advances in robust and efficient communication strategies for MARL under realistic constraints, including message perturbations, transmission delays, and limited bandwidth. Furthermore, because the challenges of low-latency reliability, bandwidth-intensive data sharing, and communication-privacy trade-offs are central to practical MARL systems, we focus on three applications involving cooperative autonomous driving, distributed simultaneous localization and mapping, and federated learning. Finally, we identify key open challenges and future research directions, advocating a unified approach that co-designs communication, learning, and robustness to bridge the gap between theoretical MARL models and practical implementations.
zh
[AI-10] MarsRL: Advancing Multi-Agent Reasoning Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism
【速读】:该论文旨在解决开源大语言模型(Large Language Models, LLMs)在多智能体推理系统中因批评(critic)与修正(corrector)能力不足而导致的泛化性能受限问题。现有方法如基于验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)和测试时扩展虽能提升推理深度,但受限于单次推理长度,难以实现复杂任务的充分迭代优化。为此,作者提出MarsRL框架,其核心创新在于引入代理特定奖励机制以降低奖励噪声,并采用流水线并行训练策略提升长轨迹处理效率,从而联合优化Solver、Verifier与Corrector三个代理模块,显著增强系统在开放场景下的推理能力与稳定性。
链接: https://arxiv.org/abs/2511.11373
作者: Shulin Liu,Dong Du,Tao Yang,Yang Li,Boyu Qiu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Recent progress in large language models (LLMs) has been propelled by reinforcement learning with verifiable rewards (RLVR) and test-time scaling. However, the limited output length of LLMs constrains the depth of reasoning attainable in a single inference process. Multi-agent reasoning systems offer a promising alternative by employing multiple agents including Solver, Verifier, and Corrector, to iteratively refine solutions. While effective in closed-source models like Gemini 2.5 Pro, they struggle to generalize to open-source models due to insufficient critic and correction capabilities. To address this, we propose MarsRL, a novel reinforcement learning framework with agentic pipeline parallelism, designed to jointly optimize all agents in the system. MarsRL introduces agent-specific reward mechanisms to mitigate reward noise and employs pipeline-inspired training to enhance efficiency in handling long trajectories. Applied to Qwen3-30B-A3B-Thinking-2507, MarsRL improves AIME2025 accuracy from 86.5% to 93.3% and BeyondAIME from 64.9% to 73.8%, even surpassing Qwen3-235B-A22B-Thinking-2507. These findings highlight the potential of MarsRL to advance multi-agent reasoning systems and broaden their applicability across diverse reasoning tasks.
zh
[AI-11] KarmaTS: A Universal Simulation Platform for Multivariate Time Series with Functional Causal Dynamics
【速读】:该论文旨在解决在生理数据访问受限情况下,缺乏具有已知因果结构的多变量时间序列(Multivariate Time Series, MTS)数据用于因果发现算法验证与基准测试的问题。解决方案的关键在于提出KarmaTS框架,其核心是通过人机协同(human-in-the-loop)工作流,融合专家知识与算法建议,构建离散时间结构因果过程(Discrete-Time Structural Causal Process, DSCP),支持包含用户指定分布偏移的因果干预模拟;该框架能处理混合变量类型、同时考虑同期与滞后边,并采用模块化边函数(从可参数化模板到神经网络模型),从而实现对因果发现算法的灵活验证与基准评估。
链接: https://arxiv.org/abs/2511.11357
作者: Haixin Li,Yanke Li,Diego Paez-Granados
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:We introduce KarmaTS, an interactive framework for constructing lag-indexed, executable spatiotemporal causal graphical models for multivariate time series (MTS) simulation. Motivated by the challenge of access-restricted physiological data, KarmaTS generates synthetic MTS with known causal dynamics and augments real-world datasets with expert knowledge. The system constructs a discrete-time structural causal process (DSCP) by combining expert knowledge and algorithmic proposals in a mixed-initiative, human-in-the-loop workflow. The resulting DSCP supports simulation and causal interventions, including those under user-specified distribution shifts. KarmaTS handles mixed variable types, contemporaneous and lagged edges, and modular edge functionals ranging from parameterizable templates to neural network models. Together, these features enable flexible validation and benchmarking of causal discovery algorithms through expert-informed simulation.
zh
[AI-12] Privacy Challenges and Solutions in Retrieval-Augmented Generation-Enhanced LLM s for Healthcare Chatbots: A Review of Applications Risks and Future Directions
【速读】:该论文旨在解决生成式 AI(Generative AI)在医疗健康领域应用中因检索增强生成(Retrieval-Augmented Generation, RAG)架构导致的隐私风险问题,特别是受保护健康信息(Protected Health Information, PHI)泄露的风险。其解决方案的关键在于构建一个基于数据存储、传输、检索和生成四个阶段的管道结构化框架,系统性识别各环节潜在的隐私失效模式及其成因,并在此基础上综合评估17篇隐私保护策略研究,指出当前存在的关键短板——如临床验证不足、缺乏标准化评估体系及自动化检测工具缺失。论文进一步提出可操作的研究方向,强调需发展兼具临床有效性与强隐私保障能力的RAG系统,为未来医疗AI系统的安全落地提供理论依据与实践路径。
链接: https://arxiv.org/abs/2511.11347
作者: Shaowei Guan,Hin Chi Kwok,Ngai Fong Law,Gregor Stiglic,Vivian Hui
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 23 pages, 2 figures
Abstract:Retrieval-augmented generation (RAG) has rapidly emerged as a transformative approach for integrating large language models into clinical and biomedical workflows. However, privacy risks, such as protected health information (PHI) exposure, remain inconsistently mitigated. This review provides a thorough analysis of the current landscape of RAG applications in healthcare, including (i) sensitive data type across clinical scenarios, (ii) the associated privacy risks, (iii) current and emerging data-privacy protection mechanisms and (iv) future direction for patient data privacy protection. We synthesize 23 articles on RAG applications in healthcare and systematically analyze privacy challenges through a pipeline-structured framework encompassing data storage, transmission, retrieval and generation stages, delineating potential failure modes, their underlying causes in threat models and system mechanisms, and their practical implications. Building on this analysis, we critically review 17 articles on privacy-preserving strategies for RAG systems. Our evaluation reveals critical gaps, including insufficient clinical validation, absence of standardized evaluation frameworks, and lack of automated assessment tools. We propose actionable directions based on these limitations and conclude with a call to action. This review provides researchers and practitioners with a structured framework for understanding privacy vulnerabilities in healthcare RAG and offers a roadmap toward developing systems that achieve both clinical effectiveness and robust privacy preservation.
zh
[AI-13] RLSLM: A Hybrid Reinforcement Learning Framework Aligning Rule-Based Social Locomotion Model with Human Social Norms AAAI2026
【速读】:该论文旨在解决社交感知代理在人类密集环境中导航时如何避免引起不适的问题,即实现既符合人类社会规范又高效可靠的路径规划。其核心挑战在于传统基于规则的方法虽具可解释性但泛化能力弱,而数据驱动方法虽能学习复杂行为却存在效率低、黑箱性强且难以与人类直觉对齐的缺陷。解决方案的关键在于提出一种混合强化学习框架RLSLM,将基于实证行为实验构建的规则型社交移动模型(Social Locomotion Model)嵌入到强化学习的奖励函数中,通过生成方向敏感的社会舒适场(social comfort field)量化空间中的人类舒适度,从而联合优化机械能耗与社会舒适度,使代理在最小训练条件下即可生成符合社会规范的导航策略,并显著提升可解释性与用户体验。
链接: https://arxiv.org/abs/2511.11323
作者: Yitian Kou,Yihe Gu,Chen Zhou,DanDan Zhu,Shuguang Kuai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026
Abstract:Navigating human-populated environments without causing discomfort is a critical capability for socially-aware agents. While rule-based approaches offer interpretability through predefined psychological principles, they often lack generalizability and flexibility. Conversely, data-driven methods can learn complex behaviors from large-scale datasets, but are typically inefficient, opaque, and difficult to align with human intuitions. To bridge this gap, we propose RLSLM, a hybrid Reinforcement Learning framework that integrates a rule-based Social Locomotion Model, grounded in empirical behavioral experiments, into the reward function of a reinforcement learning framework. The social locomotion model generates an orientation-sensitive social comfort field that quantifies human comfort across space, enabling socially aligned navigation policies with minimal training. RLSLM then jointly optimizes mechanical energy and social comfort, allowing agents to avoid intrusions into personal or group space. A human-agent interaction experiment using an immersive VR-based setup demonstrates that RLSLM outperforms state-of-the-art rule-based models in user experience. Ablation and sensitivity analyses further show the model’s significantly improved interpretability over conventional data-driven methods. This work presents a scalable, human-centered methodology that effectively integrates cognitive science and machine learning for real-world social navigation.
zh
[AI-14] EcoAlign: An Economically Rational Framework for Efficient LVLM Alignment
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在对齐过程中面临的“安全-效用-成本”权衡难题,尤其是现有对齐方法因仅关注最终输出(即过程盲视)而导致计算资源浪费于不安全推理的问题。其解决方案的关键在于提出EcoAlign框架,该框架将对齐视为一种经济理性搜索过程,通过将LVLM建模为有限理性代理,在推理阶段增量式扩展思维图(thought graph),并利用一个前瞻性的评分函数(类比净现值)动态权衡预期安全性、效用与剩余预算;同时采用最弱环节原则(weakest-link principle)强制路径安全,从而有效防止有害推理被表面合理的解释所掩盖。实验表明,EcoAlign在多个闭源与开源模型上实现了更高安全性与效用的同时降低计算开销。
链接: https://arxiv.org/abs/2511.11301
作者: Ruoxi Cheng,Haoxuan Ma,Teng Ma,Hongyi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) exhibit powerful reasoning capabilities but suffer sophisticated jailbreak vulnerabilities. Fundamentally, aligning LVLMs is not just a safety challenge but a problem of economic efficiency. Current alignment methods struggle with the trade-off between safety, utility, and operational costs. Critically, a focus solely on final outputs (process-blindness) wastes significant computational budget on unsafe deliberation. This flaw allows harmful reasoning to be disguised with benign justifications, thereby circumventing simple additive safety scores. To address this, we propose EcoAlign, an inference-time framework that reframes alignment as an economically rational search by treating the LVLM as a boundedly rational agent. EcoAlign incrementally expands a thought graph and scores actions using a forward-looking function (analogous to net present value) that dynamically weighs expected safety, utility, and cost against the remaining budget. To prevent deception, path safety is enforced via the weakest-link principle. Extensive experiments across 3 closed-source and 2 open-source models on 6 datasets show that EcoAlign matches or surpasses state-of-the-art safety and utility at a lower computational cost, thereby offering a principled, economical pathway to robust LVLM alignment.
zh
[AI-15] Experiences from Benchmarking Vision-Language-Action Models for Robotic Manipulation
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在机器人领域中,特别是视觉-语言-动作(Vision–Language–Action, VLA)模型在真实世界环境中缺乏系统性评估与跨模型比较的问题。解决方案的关键在于构建了一个标准化的评估框架,对四种代表性VLA模型(ACT、OpenVLA–OFT、RDT-1B 和 π₀)在模拟环境和ALOHA Mobile平台上的四类操作任务中进行多维度量化分析,涵盖准确性与效率(成功率与完成时间)、分布内/外适应能力(in-distribution、spatial out-of-distribution、instance-plus-spatial out-of-distribution)以及语言指令遵循精度。通过该框架,研究揭示了不同模型在泛化性能、计算需求和失败模式上的关键差异,为实际部署中权衡精度、鲁棒性和成本提供了可操作的决策依据。
链接: https://arxiv.org/abs/2511.11298
作者: Yihao Zhang,Yuankai Qi,Xi Zheng
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models applied in robotics, particularly \textbfVision–Language–Action (VLA) models, hold great promise for achieving general-purpose manipulation. Yet, systematic real-world evaluations and cross-model comparisons remain scarce. This paper reports our \textbfempirical experiences from benchmarking four representative VLAs – \textbfACT, \textbfOpenVLA–OFT, \textbfRDT-1B, and \boldmath \pi_0 – across four manipulation tasks conducted in both simulation and on the \textbfALOHA Mobile platform. We establish a \textbfstandardized evaluation framework that measures performance along three key dimensions: (1) \textitaccuracy and efficiency (success rate and time-to-success), (2) \textitadaptability across in-distribution, spatial out-of-distribution, and instance-plus-spatial out-of-distribution settings, and (3) \textitlanguage instruction-following accuracy. Through this process, we observe that \boldmath \pi_0 demonstrates superior adaptability in out-of-distribution scenarios, while \textbfACT provides the highest stability in-distribution. Further analysis highlights differences in computational demands, data-scaling behavior, and recurring failure modes such as near-miss grasps, premature releases, and long-horizon state drift. These findings reveal practical trade-offs among VLA model architectures in balancing precision, generalization, and deployment cost, offering actionable insights for selecting and deploying VLAs in real-world robotic manipulation tasks.
zh
[AI-16] Can You Tell the Difference? Contrastive Explanations for ABox Entailments AAAI-2026
【速读】:该论文旨在解决描述逻辑(Description Logic, DL)本体中ABox推理的对比解释问题,即回答“为何个体a属于概念C,而个体b不属于C”这类对比性疑问。传统方法仅能单独解释正向蕴含(如C(a)为何被知识库蕴含)或缺失蕴含(如C(b)为何未被蕴含),无法同时捕捉两者间的共性与差异。解决方案的关键在于提出了一种新的对比解释(contrastive explanation)形式,通过联合分析两个个体在知识库中的语义差异,聚焦于导致其归属不同概念的核心特征,从而提供更具针对性和可理解性的解释。研究进一步分析了不同优化标准下该解释方法的计算复杂度,并实现了一种具体变体的计算算法,在真实知识库生成的数据上进行了验证。
链接: https://arxiv.org/abs/2511.11281
作者: Patrick Koopmann,Yasir Mahmood,Axel-Cyrille Ngonga Ngomo,Balram Tiwari
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Technical report to the paper accepted at AAAI-2026
Abstract:We introduce the notion of contrastive ABox explanations to answer questions of the type “Why is a an instance of C, but b is not?”. While there are various approaches for explaining positive entailments (why is C(a) entailed by the knowledge base) as well as missing entailments (why is C(b) not entailed) in isolation, contrastive explanations consider both at the same time, which allows them to focus on the relevant commonalities and differences between a and b. We develop an appropriate notion of contrastive explanations for the special case of ABox reasoning with description logic ontologies, and analyze the computational complexity for different variants under different optimality criteria, considering lightweight as well as more expressive description logics. We implemented a first method for computing one variant of contrastive explanations, and evaluated it on generated problems for realistic knowledge bases.
zh
[AI-17] A Workflow for Full Traceability of AI Decisions
【速读】:该论文旨在解决当前高风险决策中人工智能(Artificial Intelligence, AI)系统因缺乏可追溯性而导致的责任认定难题。随着自动化系统越来越多地参与关键决策,若其决策侵犯了个人福祉或基本人权,现有AI技术往往无法提供足够的文档支持以追踪决策依据,从而阻碍责任链的重建。解决方案的关键在于强制记录训练与推理过程中每一个组件的输入、输出及交互过程,构建一个不可篡改、可验证且完整的AI决策溯源工作流。该方法通过扩展DBOM(Data-Driven Decision Traceability and Ownership Model)概念,并结合可信计算(confidential computing)技术,实现了首个可运行的AI决策追踪框架,用于确保在法律场景下能够明确判定AI决策违法的原因。
链接: https://arxiv.org/abs/2511.11275
作者: Julius Wenzel,Syeda Umaima Alam,Andreas Schmidt,Hanwei Zhang,Holger Hermanns
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures
Abstract:An ever increasing number of high-stake decisions are made or assisted by automated systems employing brittle artificial intelligence technology. There is a substantial risk that some of these decision induce harm to people, by infringing their well-being or their fundamental human rights. The state-of-the-art in AI systems makes little effort with respect to appropriate documentation of the decision process. This obstructs the ability to trace what went into a decision, which in turn is a prerequisite to any attempt of reconstructing a responsibility chain. Specifically, such traceability is linked to a documentation that will stand up in court when determining the cause of some AI-based decision that inadvertently or intentionally violates the law. This paper takes a radical, yet practical, approach to this problem, by enforcing the documentation of each and every component that goes into the training or inference of an automated decision. As such, it presents the first running workflow supporting the generation of tamper-proof, verifiable and exhaustive traces of AI decisions. In doing so, we expand the DBOM concept into an effective running workflow leveraging confidential computing technology. We demonstrate the inner workings of the workflow in the development of an app to tell poisonous and edible mushrooms apart, meant as a playful example of high-stake decision support. Comments: 10 pages, 10 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11275 [cs.AI] (or arXiv:2511.11275v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.11275 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-18] AIonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery
【速读】:该论文旨在解决离子液体(Ionic Liquids, ILs)新发现过程中面临的三大关键挑战:数据有限、模型预测精度不足以及工作流程碎片化。其解决方案的核心在于提出并实现了一个名为AIonopedia的大型语言模型(Large Language Models, LLMs)代理,该代理基于一个增强型多模态领域基础模型(LLM-augmented multimodal domain foundation model),能够实现高精度的物性预测,并结合分层搜索架构进行分子筛选与设计。通过在新构建的全面离子液体数据集上训练和验证,AIonopedia展现出卓越性能,并在真实实验中验证了其对分布外任务的强大泛化能力,从而显著加速了离子液体的实际发现进程。
链接: https://arxiv.org/abs/2511.11257
作者: Yuqi Yin,Yibo Fu,Siyuan Wang,Peng Sun,Hongyu Wang,Xiaohui Wang,Lei Zheng,Zhiyong Li,Zhirong Liu,Jianji Wang,Zhaoxi Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:The discovery of novel Ionic Liquids (ILs) is hindered by critical challenges in property prediction, including limited data, poor model accuracy, and fragmented workflows. Leveraging the power of Large Language Models (LLMs), we introduce AIonopedia, to the best of our knowledge, the first LLM agent for IL discovery. Powered by an LLM-augmented multimodal domain foundation model for ILs, AIonopedia enables accurate property predictions and incorporates a hierarchical search architecture for molecular screening and design. Trained and evaluated on a newly curated and comprehensive IL dataset, our model delivers superior performance. Complementing these results, evaluations on literature-reported systems indicate that the agent can perform effective IL modification. Moving beyond offline tests, the practical efficacy was further confirmed through real-world wet-lab validation, in which the agent demonstrated exceptional generalization capabilities on challenging out-of-distribution tasks, underscoring its ability to accelerate real-world IL discovery.
zh
[AI-19] UAVBench: An Open Benchmark Dataset for Autonomous and Agent ic AI UAV Systems via LLM -Generated Flight Scenarios
【速读】:该论文旨在解决自主飞行系统(Autonomous Aerial Systems, AAS)在任务规划、感知与决策中日益依赖大语言模型(Large Language Models, LLMs)时,缺乏标准化且物理可验证的评估基准这一关键问题。现有方法难以系统性衡量LLMs在真实空域环境下的推理能力,限制了其在无人机(UAV)场景中的可信部署。解决方案的关键在于提出UAVBench——一个包含5万条经分类引导提示生成并多阶段安全验证的无人机飞行场景的数据集,每个场景以结构化JSON格式编码任务目标、载具配置、环境条件及量化风险标签,实现跨领域统一表征;进一步扩展为UAVBench_MCQ,即包含5万道多选题的推理导向子集,覆盖十种认知与伦理推理类型(如空气动力学、导航、多智能体协同等),从而支持可解释、机器可验证的无人机特定认知评估。该框架首次将物理约束与语义推理结合,为下一代无人机推理智能提供了可复现、可扩展的基准体系。
链接: https://arxiv.org/abs/2511.11252
作者: Mohamed Amine Ferrag,Abderrahmane Lakas,Merouane Debbah
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 Figures
Abstract:Autonomous aerial systems increasingly rely on large language models (LLMs) for mission planning, perception, and decision-making, yet the lack of standardized and physically grounded benchmarks limits systematic evaluation of their reasoning capabilities. To address this gap, we introduce UAVBench, an open benchmark dataset comprising 50,000 validated UAV flight scenarios generated through taxonomy-guided LLM prompting and multi-stage safety validation. Each scenario is encoded in a structured JSON schema that includes mission objectives, vehicle configuration, environmental conditions, and quantitative risk labels, providing a unified representation of UAV operations across diverse domains. Building on this foundation, we present UAVBench_MCQ, a reasoning-oriented extension containing 50,000 multiple-choice questions spanning ten cognitive and ethical reasoning styles, ranging from aerodynamics and navigation to multi-agent coordination and integrated reasoning. This framework enables interpretable and machine-checkable assessment of UAV-specific cognition under realistic operational contexts. We evaluate 32 state-of-the-art LLMs, including GPT-5, ChatGPT-4o, Gemini 2.5 Flash, DeepSeek V3, Qwen3 235B, and ERNIE 4.5 300B, and find strong performance in perception and policy reasoning but persistent challenges in ethics-aware and resource-constrained decision-making. UAVBench establishes a reproducible and physically grounded foundation for benchmarking agentic AI in autonomous aerial systems and advancing next-generation UAV reasoning intelligence. To support open science and reproducibility, we release the UAVBench dataset, the UAVBench_MCQ benchmark, evaluation scripts, and all related materials on GitHub at this https URL
zh
[AI-20] HealSplit: Towards Self-Healing through Adversarial Distillation in Split Federated Learning AAAI2026
【速读】:该论文针对Split Federated Learning (SFL) 中因局部特征、标签、数据碎片化及模型权重易受复杂数据投毒攻击而导致的隐私保护学习安全性不足问题,提出了一种统一防御框架HealSplit。其核心解决方案包含三个关键组件:(1) 基于拓扑感知的数据图构建与异常评分(Topological Anomaly Scoring, TAS),用于识别被污染样本;(2) 生成式恢复流水线,通过语义一致的合成替代品修复检测到的异常;(3) 对抗性多教师蒸馏机制,结合普通教师(Vanilla Teacher)的语义监督和异常影响去偏教师(Anomaly-Influence Debiasing, AD Teacher)的异常感知信号,利用拓扑与梯度交互矩阵的一致性指导学生模型训练,从而实现端到端的检测与恢复能力。
链接: https://arxiv.org/abs/2511.11240
作者: Yuhan Xie,Chen Lyu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026
Abstract:Split Federated Learning (SFL) is an emerging paradigm for privacy-preserving distributed learning. However, it remains vulnerable to sophisticated data poisoning attacks targeting local features, labels, smashed data, and model weights. Existing defenses, primarily adapted from traditional Federated Learning (FL), are less effective under SFL due to limited access to complete model updates. This paper presents HealSplit, the first unified defense framework tailored for SFL, offering end-to-end detection and recovery against five sophisticated types of poisoning attacks. HealSplit comprises three key components: (1) a topology-aware detection module that constructs graphs over smashed data to identify poisoned samples via topological anomaly scoring (TAS); (2) a generative recovery pipeline that synthesizes semantically consistent substitutes for detected anomalies, validated by a consistency validation student; and (3) an adversarial multi-teacher distillation framework trains the student using semantic supervision from a Vanilla Teacher and anomaly-aware signals from an Anomaly-Influence Debiasing (AD) Teacher, guided by the alignment between topological and gradient-based interaction matrices. Extensive experiments on four benchmark datasets demonstrate that HealSplit consistently outperforms ten state-of-the-art defenses, achieving superior robustness and defense effectiveness across diverse attack scenarios.
zh
[AI-21] Virtual Width Networks
【速读】:该论文旨在解决大规模模型中因增加隐藏层宽度(hidden size)而导致的计算成本呈二次增长的问题,即传统方法在提升表示宽度(representational width)时会显著增加计算开销。其解决方案的关键在于提出虚拟宽度网络(Virtual Width Networks, VWN),通过将表示宽度与主干网络(backbone)宽度解耦,实现嵌入空间的扩展,同时保持主干计算量几乎不变。这一机制使得模型在不显著增加计算负担的前提下,显著加速优化过程,并在更大规模训练中表现出更强的收敛速度和性能优势。
链接: https://arxiv.org/abs/2511.11238
作者: Seed,Baisheng Li,Banggu Wu,Bole Ma,Bowen Xiao,Chaoyi Zhang,Cheng Li,Chengyi Wang,Chenyin Xu,Chi Zhang,Chong Hu,Daoguang Zan,Defa Zhu,Dongyu Xu,Du Li,Faming Wu,Fan Xia,Ge Zhang,Guang Shi,Haobin Chen,Hongyu Zhu,Hongzhi Huang,Huan Zhou,Huanzhang Dou,Jianhui Duan,Jianqiao Lu,Jianyu Jiang,Jiayi Xu,Jiecao Chen,Jin Chen,Jin Ma,Jing Su,Jingji Chen,Jun Wang,Jun Yuan,Juncai Liu,Jundong Zhou,Kai Hua,Kai Shen,Kai Xiang,Kaiyuan Chen,Kang Liu,Ke Shen,Liang Xiang,Lin Yan,Lishu Luo,Mengyao Zhang,Ming Ding,Mofan Zhang,Nianning Liang,Peng Li,Penghao Huang,Pengpeng Mu,Qi Huang,Qianli Ma,Qiyang Min,Qiying Yu,Renming Pang,Ru Zhang,Shen Yan,Shen Yan,Shixiong Zhao,Shuaishuai Cao,Shuang Wu,Siyan Chen,Siyu Li,Siyuan Qiao,Tao Sun,Tian Xin,Tiantian Fan,Ting Huang,Ting-Han Fan,Wei Jia,Wenqiang Zhang,Wenxuan Liu,Xiangzhong Wu,Xiaochen Zuo,Xiaoying Jia,Ximing Yang,Xin Liu,Xin Yu,Xingyan Bin,Xintong Hao,Xiongcai Luo,Xujing Li,Xun Zhou,Yanghua Peng,Yangrui Chen,Yi Lin,Yichong Leng,Yinghao Li,Yingshuan Song,Yiyuan Ma,Yong Shan,Yongan Xiang,Yonghui Wu,Yongtao Zhang,Yongzhen Yao,Yu Bao,Yuehang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.
zh
[AI-22] STaR: Towards Cognitive Table Reasoning via Slow-Thinking Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在表格推理任务中存在的两个关键问题:一是推理过程缺乏人类认知所具备的深度与迭代优化能力;二是推理过程不稳定,影响其在下游应用中的可靠性。解决方案的关键在于提出STaR(slow-thinking for table reasoning)框架,通过显式建模分步思考和不确定性感知推理来赋予LLMs“慢思考”能力。具体而言,训练阶段采用两阶段难度感知强化学习(Difficulty-aware Reinforcement Learning, DRL),逐步从简单到复杂查询学习;推理阶段则通过轨迹级不确定性量化(整合词级别置信度与答案一致性)选择更可信的推理路径,从而显著提升推理性能与稳定性,并展现出良好的跨领域泛化能力。
链接: https://arxiv.org/abs/2511.11233
作者: Huajian Zhang,Mingyue Cheng,Yucong Luo,Xiaoyu Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Table reasoning with the large language models (LLMs) is a fundamental path toward building intelligent systems that can understand and analyze over structured data. While recent progress has shown promising results, they still suffer from two key limitations: (i) the reasoning processes lack the depth and iterative refinement characteristic of human cognition; and (ii) the reasoning processes exhibit instability, which compromises their reliability in downstream applications. In this work, we present STaR (slow-thinking for table reasoning), a new framework achieving cognitive table reasoning, in which LLMs are equipped with slow-thinking capabilities by explicitly modeling step-by-step thinking and uncertainty-aware inference. During training, STaR employs two-stage difficulty-aware reinforcement learning (DRL), progressively learning from simple to complex queries under a composite reward. During inference, STaR performs trajectory-level uncertainty quantification by integrating token-level confidence and answer consistency, enabling selection of more credible reasoning paths. Extensive experiments on benchmarks demonstrate that STaR achieves superior performance and enhanced reasoning stability. Moreover, strong generalization over out-of-domain datasets further demonstrates STaR’s potential as a reliable and cognitively inspired solution for table reasoning with LLMs.
zh
[AI-23] Enhancing Group Recommendation using Soft Impute Singular Value Decomposition
【速读】:该论文旨在解决群体推荐系统中因数据稀疏性和高维度导致的推荐性能下降问题(即“sparsity and high-dimensionality of the available data”)。其解决方案的关键在于提出一种名为Group Soft-Impute SVD的推荐方法,该方法利用软阈值奇异值分解(Soft-Impute Singular Value Decomposition)进行低秩矩阵补全,从而有效提升小规模用户群体的召回率(recall),并在不同群体规模下保持与基线方法相当的性能表现。
链接: https://arxiv.org/abs/2511.11172
作者: Mubaraka Sani Ibrahim(1),Isah Charles Saidu(2),Lehel Csato(3)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: ((1) African University of Science and Technology (Abuja, Nigeria), (2) Baze University (Abuja, Nigeria), (3) Babes-Bolyai University (Cluj-Napoca, Romania))
Abstract:The growing popularity of group activities increased the need to develop methods for providing recommendations to a group of users based on the collective preferences of the group members. Several group recommender systems have been proposed, but these methods often struggle due to sparsity and high-dimensionality of the available data, common in many real-world applications. In this paper, we propose a group recommender system called Group Soft-Impute SVD, which leverages soft-impute singular value decomposition to enhance group recommendations. This approach addresses the challenge of sparse high-dimensional data using low-rank matrix completion. We compared the performance of Group Soft-Impute SVD with Group MF based approaches and found that our method outperforms the baselines in recall for small user groups while achieving comparable results across all group sizes when tasked on Goodbooks, Movielens, and Synthetic datasets. Furthermore, our method recovers lower matrix ranks than the baselines, demonstrating its effectiveness in handling high-dimensional data.
zh
[AI-24] Specification Application and Operationalization of a Metamodel of Fairness
【速读】:该论文旨在解决公平性(fairness)在人工智能(AI)系统中难以形式化定义、比较与评估的问题。其核心挑战在于如何在不同应用场景下统一建模和分析公平性的多种概念,如平等(equality)与公正(equity)之间的差异。解决方案的关键是提出了一种名为AR公平性元模型(AR fairness metamodel)的形式化框架,该框架能够抽象表示公平性概念,并通过Tiles框架实现模块化建模,支持对多种公平性定义的灵活组合与比较,从而为AI系统的公平性操作化提供可验证的方法论基础。
链接: https://arxiv.org/abs/2511.11144
作者: Julian Alfredo Mendez,Timotheus Kampik
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents the AR fairness metamodel, aimed at formally representing, analyzing, and comparing fairness scenarios. The metamodel provides an abstract representation of fairness, enabling the formal definition of fairness notions. We instantiate the metamodel through several examples, with a particular focus on comparing the notions of equity and equality. We use the Tiles framework, which offers modular components that can be interconnected to represent various definitions of fairness. Its primary objective is to support the operationalization of AR-based fairness definitions in a range of scenarios, providing a robust method for defining, comparing, and evaluating fairness. Tiles has an open-source implementation for fairness modeling and evaluation. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.11144 [cs.CY] (or arXiv:2511.11144v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2511.11144 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-25] GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
【速读】:该论文旨在解决当前统一多模态模型(Unified Multimodal Models, UMMs)在评估中存在的关键问题:现有基准测试主要分别评估判别式理解能力或无约束图像生成能力,而未能衡量模型在跨模态生成过程中所体现的整合认知能力,即生成式推理(generative reasoning)。解决方案的关键在于提出一个名为GGBench的新基准,该基准以几何构造任务为测试场景,因其天然要求语言理解与精确视觉生成的融合,从而系统性地诊断模型不仅理解与推理的能力,还具备主动构建解决方案的能力,为下一代智能系统设定了更严格的评估标准。
链接: https://arxiv.org/abs/2511.11134
作者: Jingxuan Wei,Caijun Jia,Xi Bai,Xinglong Xu,Siyuan Li,Linzhuang Sun,Bihui Yu,Conghui He,Lijun Wu,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 35 pages, 22 figures
Abstract:The advent of Unified Multimodal Models (UMMs) signals a paradigm shift in artificial intelligence, moving from passive perception to active, cross-modal generation. Despite their unprecedented ability to synthesize information, a critical gap persists in evaluation: existing benchmarks primarily assess discriminative understanding or unconstrained image generation separately, failing to measure the integrated cognitive process of generative reasoning. To bridge this gap, we propose that geometric construction provides an ideal testbed as it inherently demands a fusion of language comprehension and precise visual generation. We introduce GGBench, a benchmark designed specifically to evaluate geometric generative reasoning. It provides a comprehensive framework for systematically diagnosing a model’s ability to not only understand and reason but to actively construct a solution, thereby setting a more rigorous standard for the next generation of intelligent systems. Project website: this https URL.
zh
[AI-26] Utilizing LLM s for Industrial Process Automation: A Case Study on Modifying RAPID Programs ICSE
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在工业过程自动化领域中应用受限的问题,特别是针对那些高度专业化、仅在专有环境中使用的领域特定语言(Domain-Specific Languages, DSLs),这些语言通常缺乏LLM的有效支持。研究发现,无需投入大量资源进行模型微调,仅通过少量示例提示(few-shot prompting)即可在本地部署(on-premise)环境下有效解决简单任务,从而保障企业敏感数据的安全性并提升LLM在该领域的实用性。解决方案的关键在于利用少样本提示策略实现对非主流DSL的初步支持,且无需依赖定制化训练。
链接: https://arxiv.org/abs/2511.11125
作者: Salim Fares,Steffen Herbold
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to the International Conference on Software Engineering (ICSE) track Software Engineering in Practice (SEIP) 2026
Abstract:How to best use Large Language Models (LLMs) for software engineering is covered in many publications in recent years. However, most of this work focuses on widely-used general purpose programming languages. The utility of LLMs for software within the industrial process automation domain, with highly-specialized languages that are typically only used in proprietary contexts, is still underexplored. Within this paper, we study enterprises can achieve on their own without investing large amounts of effort into the training of models specific to the domain-specific languages that are used. We show that few-shot prompting approaches are sufficient to solve simple problems in a language that is otherwise not well-supported by an LLM and that is possible on-premise, thereby ensuring the protection of sensitive company data.
zh
[AI-27] Satisficing and Optimal Generalised Planning via Goal Regression (Extended Version) AAAI2026
【速读】:该论文旨在解决**广义规划(Generalised Planning, GP)**中如何从一组训练问题中自动合成可泛化执行的程序(即广义计划)的问题,以应对传统规划方法在面对相关问题家族时缺乏效率与通用性的问题。其解决方案的关键在于提出一种新颖且简洁的方法:首先对每个训练问题按顺序计算每个目标原子(goal atom)的最优计划,然后对这些计划进行目标回归(goal regression),最终将所得结果提升为一阶逻辑形式的“条件 → 动作”规则集合,形成可直接执行或用于剪枝搜索空间的广义计划。作者进一步形式化并证明了该方法在特定条件下能保证生成有效广义计划及状态空间剪枝公理,实验证明该方法在合成成本、规划覆盖度和解质量三个指标上均显著优于现有最先进(广义)规划器。
链接: https://arxiv.org/abs/2511.11095
作者: Dillon Z. Chen,Till Hofmann,Toryn Q. Klassen,Sheila A. McIlraith
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Extended version of AAAI 2026 paper
Abstract:Generalised planning (GP) refers to the task of synthesising programs that solve families of related planning problems. We introduce a novel, yet simple method for GP: given a set of training problems, for each problem, compute an optimal plan for each goal atom in some order, perform goal regression on the resulting plans, and lift the corresponding outputs to obtain a set of first-order \textitCondition \rightarrow \textitActions rules. The rules collectively constitute a generalised plan that can be executed as is or alternatively be used to prune the planning search space. We formalise and prove the conditions under which our method is guaranteed to learn valid generalised plans and state space pruning axioms for search. Experiments demonstrate significant improvements over state-of-the-art (generalised) planners with respect to the 3 metrics of synthesis cost, planning coverage, and solution quality on various classical and numeric planning domains.
zh
[AI-28] Scalable Population Training for Zero-Shot Coordination
【速读】:该论文旨在解决零样本协调(Zero-shot Coordination, ZSC)中因计算资源限制导致的群体规模扩展难题,即现有基于种群的训练方法受限于小规模群体优化多样性,难以通过扩大群体规模获得性能提升。其解决方案的关键在于提出可扩展种群训练框架(Scalable Population Training, ScaPT),包含两个核心组件:一是元智能体(meta-agent),通过选择性地在智能体间共享参数来高效实现大规模种群;二是互信息正则项(mutual information regularizer),确保种群内部的多样性,从而在不进行微调的情况下提升智能体与未见过合作者的协作能力。
链接: https://arxiv.org/abs/2511.11083
作者: Bingyu Hui,Lebin Yu,Quanming Yao,Yunpeng Qu,Xudong Zhang,Jian Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Zero-shot coordination(ZSC) has become a hot topic in reinforcement learning research recently. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators that are not seen before without any fine-tuning. Population-based training has been proven to provide good zero-shot coordination performance; nevertheless, existing methods are limited by computational resources, mainly focusing on optimizing diversity in small populations while neglecting the potential performance gains from scaling population size. To address this issue, this paper proposes the Scalable Population Training (ScaPT), an efficient training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of ScaPT, this paper evaluates it along with representational frameworks in Hanabi and confirms its superiority.
zh
[AI-29] ARCTraj: A Dataset and Benchmark of Human Reasoning Trajectories for Abstract Problem Solving
【速读】:该论文旨在解决现有抽象推理研究中因依赖静态输入-输出监督而导致难以捕捉人类推理过程动态演化的问题。其解决方案的关键在于构建ARCTraj数据集与方法论框架,通过记录人类在完成ARC任务时的时序有序、对象级操作轨迹,揭示传统数据集所忽略的中间推理步骤;该框架进一步定义了统一的推理流程,涵盖数据采集、动作抽象、马尔可夫决策过程(MDP)建模及下游学习,支持强化学习、生成建模和序列建模等多种方法的集成应用,从而为研究类人推理提供了结构化且可解释的基础。
链接: https://arxiv.org/abs/2511.11079
作者: Sejin Kim,Hayan Choi,Seokki Lee,Sundong Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present ARCTraj, a dataset and methodological framework for modeling human reasoning through complex visual tasks in the Abstraction and Reasoning Corpus (ARC). While ARC has inspired extensive research on abstract reasoning, most existing approaches rely on static input–output supervision, which limits insight into how reasoning unfolds over time. ARCTraj addresses this gap by recording temporally ordered, object-level actions that capture how humans iteratively transform inputs into outputs, revealing intermediate reasoning steps that conventional datasets overlook. Collected via the O2ARC web interface, it contains around 10,000 trajectories annotated with task identifiers, timestamps, and success labels across 400 training tasks from the ARC-AGI-1 benchmark. It further defines a unified reasoning pipeline encompassing data collection, action abstraction, Markov decision process (MDP) formulation, and downstream learning, enabling integration with reinforcement learning, generative modeling, and sequence modeling methods such as PPO, World Models, GFlowNets, Diffusion agents, and Decision Transformers. Analyses of spatial selection, color attribution, and strategic convergence highlight the structure and diversity of human reasoning. Together, these contributions position ARCTraj as a structured and interpretable foundation for studying human-like reasoning, advancing explainability, alignment, and generalizable intelligence.
zh
[AI-30] Enhancing Graph Representations with Neighborhood-Contextualized Message-Passing
【速读】:该论文旨在解决经典消息传递图神经网络(Message-Passing Graph Neural Networks, MP-GNNs)在处理局部邻域信息时表达能力不足的问题。传统MP-GNN仅基于中心节点与单个邻居节点的特征进行成对消息传递,忽略了邻域内更丰富的上下文关系,从而限制了其对复杂结构关系的学习能力。解决方案的关键在于提出邻域上下文化(Neighborhood-Contextualization)的概念,该概念源自注意力机制变体的一个核心特性,并以此为基础构建了邻域上下文化消息传递(Neighborhood-Contextualized Message-Passing, NCMP)框架。通过引入一种简单、实用且高效的参数化方法,作者进一步实现了Soft-Isomorphic Neighborhood-Contextualized Graph Convolution Network (SINC-GCN),显著提升了模型的表达能力和计算效率。
链接: https://arxiv.org/abs/2511.11046
作者: Brian Godwin Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph neural networks (GNNs) have become an indispensable tool for analyzing relational data. In the literature, classical GNNs may be classified into three variants: convolutional, attentional, and message-passing. While the standard message-passing variant is highly expressive, its typical pair-wise messages nevertheless only consider the features of the center node and each neighboring node individually. This design fails to incorporate the rich contextual information contained within the broader local neighborhood, potentially hindering its ability to learn complex relationships within the entire set of neighboring nodes. To address this limitation, this work first formalizes the concept of neighborhood-contextualization, rooted in a key property of the attentional variant. This then serves as the foundation for generalizing the message-passing variant to the proposed neighborhood-contextualized message-passing (NCMP) framework. To demonstrate its utility, a simple, practical, and efficient method to parametrize and operationalize NCMP is presented, leading to the development of the proposed Soft-Isomorphic Neighborhood-Contextualized Graph Convolution Network (SINC-GCN). A preliminary analysis on a synthetic binary node classification problem then underscores both the expressivity and efficiency of the proposed GNN architecture. Overall, the paper lays the foundation for the novel NCMP framework as a practical path toward further enhancing the graph representational power of classical GNNs.
zh
[AI-31] Autonomous Vehicle Path Planning by Searching With Differentiable Simulation
【速读】:该论文旨在解决自主驾驶中复杂交通场景下的安全路径规划问题,尤其是在策略(policy)、状态预测器(next-state predictor)和评价函数(critic)均需通过学习获得时,如何高效搜索最优动作序列的挑战。其解决方案的关键在于提出了一种名为“可微分模拟搜索”(Differentiable Simulation for Search, DSS)的框架,该框架利用可微分模拟器Waymax作为状态预测器和评价函数,借助其硬编码的动力学模型实现高精度状态预测,并通过模拟器的可微性在想象的未来轨迹上使用梯度下降进行动作优化,从而显著提升跟踪与路径规划的准确性。
链接: https://arxiv.org/abs/2511.11043
作者: Asen Nachkov,Jan-Nico Zaech,Danda Pani Paudel,Xi Wang,Luc Van Gool
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Planning allows an agent to safely refine its actions before executing them in the real world. In autonomous driving, this is crucial to avoid collisions and navigate in complex, dense traffic scenarios. One way to plan is to search for the best action sequence. However, this is challenging when all necessary components - policy, next-state predictor, and critic - have to be learned. Here we propose Differentiable Simulation for Search (DSS), a framework that leverages the differentiable simulator Waymax as both a next state predictor and a critic. It relies on the simulator’s hardcoded dynamics, making state predictions highly accurate, while utilizing the simulator’s differentiability to effectively search across action sequences. Our DSS agent optimizes its actions using gradient descent over imagined future trajectories. We show experimentally that DSS - the combination of planning gradients and stochastic search - significantly improves tracking and path planning accuracy compared to sequence prediction, imitation learning, model-free RL, and other planning methods.
zh
[AI-32] Key Decision-Makers in Multi-Agent Debates: Who Holds the Power?
【速读】:该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)在推理任务中因角色分配策略不当而导致性能瓶颈的问题。其核心挑战在于如何通过合理的角色分工提升MAD的推理准确性,尤其是在实际应用中真理未知的情况下。解决方案的关键在于提出一种新的角色分配策略“Truth Last”以及相应的多智能体辩论一致性(Multi-Agent Debate Consistency, MADC)机制:前者通过将最可能揭示真相的角色置于最后发言位置来增强推理效果(可提升22%),后者则引入路径一致性(path consistency)评估不同角色间的共识程度,并模拟出一致性得分最高的角色作为“真理”代理,从而系统性地优化MAD的核心机制,在9种大语言模型上验证了其有效性与普适性。
链接: https://arxiv.org/abs/2511.11040
作者: Qian Zhang,Yan Zheng,Jinyi Liu,Hebin Liang,Lanjun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent studies on LLM agent scaling have highlighted the potential of Multi-Agent Debate (MAD) to enhance reasoning abilities. However, the critical aspect of role allocation strategies remains underexplored. In this study, we demonstrate that allocating roles with differing viewpoints to specific positions significantly impacts MAD’s performance in reasoning tasks. Specifically, we find a novel role allocation strategy, “Truth Last”, which can improve MAD performance by up to 22% in reasoning tasks. To address the issue of unknown truth in practical applications, we propose the Multi-Agent Debate Consistency (MADC) strategy, which systematically simulates and optimizes its core mechanisms. MADC incorporates path consistency to assess agreement among independent roles, simulating the role with the highest consistency score as the truth. We validated MADC across a range of LLMs (9 models), including the DeepSeek-R1 Distilled Models, on challenging reasoning tasks. MADC consistently demonstrated advanced performance, effectively overcoming MAD’s performance bottlenecks and providing a crucial pathway for further improvements in LLM agent scaling.
zh
[AI-33] Faster Symmetry Breaking Constraints for Abstract Structures
【速读】:该论文旨在解决约束编程中抽象结构(abstract structures)对称性破缺(symmetry breaking)效率低下的问题。在使用高阶建模语言(如Essence)时,抽象结构(如嵌套集合)需被转化为约束求解器支持的表示形式(如矩阵),而传统的对称性破缺方法在处理此类抽象变量时会产生大量复杂约束,导致求解性能显著下降。论文提出了一种新的不完全对称性破缺方法,其关键在于更有效地利用抽象结构的表示形式,从而减少约束数量并提升求解速度;该方法特别针对不可区分对象(indistinguishable objects)引发的对称性进行优化,在实验中展现出优于现有方法(Akgün et al. 2025)的性能表现。
链接: https://arxiv.org/abs/2511.11029
作者: Özgür Akgün,Mun See Chang,Ian P. Gent,Christopher Jefferson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In constraint programming and related paradigms, a modeller specifies their problem in a modelling language for a solver to search and return its solution(s). Using high-level modelling languages such as Essence, a modeller may express their problems in terms of abstract structures. These are structures not natively supported by the solvers, and so they have to be transformed into or represented as other structures before solving. For example, nested sets are abstract structures, and they can be represented as matrices in constraint solvers. Many problems contain symmetries and one very common and highly successful technique used in constraint programming is to “break” symmetries, to avoid searching for symmetric solutions. This can speed up the solving process by many orders of magnitude. Most of these symmetry-breaking techniques involve placing some kind of ordering for the variables of the problem, and picking a particular member under the symmetries, usually the smallest. Unfortunately, applying this technique to abstract variables produces a very large number of complex constraints that perform poorly in practice. In this paper, we demonstrate a new incomplete method of breaking the symmetries of abstract structures by better exploiting their representations. We apply the method in breaking the symmetries arising from indistinguishable objects, a commonly occurring type of symmetry, and show that our method is faster than the previous methods proposed in (Akgün et al. 2025).
zh
[AI-34] Data Poisoning Vulnerabilities Across Healthcare AI Architectures: A Security Threat Analysis
【速读】:该论文旨在解决医疗人工智能(Healthcare AI)系统在数据投毒攻击下的严重脆弱性问题,此类攻击当前的防御机制和监管框架难以有效应对。研究通过分析八种攻击场景(涵盖架构、基础设施、资源分配及供应链层面),揭示了即使仅用100–500个样本,攻击者即可在不同规模数据集上实现超过60%的成功率,且检测周期长达6至12个月甚至无法发现。关键解决方案在于构建多层防御体系:包括强制开展对抗鲁棒性测试、采用基于集成模型的异常检测机制、引入隐私保护的安全策略,并推动国际间AI安全标准的协调统一。此外,论文质疑黑箱模型在高风险临床决策中的适用性,主张转向具备可解释性和可验证安全性保障的系统设计。
链接: https://arxiv.org/abs/2511.11020
作者: Farhad Abtahi,Fernando Seoane,Iván Pau,Mario Vega-Barbas
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Healthcare AI systems face major vulnerabilities to data poisoning that current defenses and regulations cannot adequately address. We analyzed eight attack scenarios in four categories: architectural attacks on convolutional neural networks, large language models, and reinforcement learning agents; infrastructure attacks exploiting federated learning and medical documentation systems; critical resource allocation attacks affecting organ transplantation and crisis triage; and supply chain attacks targeting commercial foundation models. Our findings indicate that attackers with access to only 100-500 samples can compromise healthcare AI regardless of dataset size, often achieving over 60 percent success, with detection taking an estimated 6 to 12 months or sometimes not occurring at all. The distributed nature of healthcare infrastructure creates many entry points where insiders with routine access can launch attacks with limited technical skill. Privacy laws such as HIPAA and GDPR can unintentionally shield attackers by restricting the analyses needed for detection. Supply chain weaknesses allow a single compromised vendor to poison models across 50 to 200 institutions. The Medical Scribe Sybil scenario shows how coordinated fake patient visits can poison data through legitimate clinical workflows without requiring a system breach. Current regulations lack mandatory adversarial robustness testing, and federated learning can worsen risks by obscuring attribution. We recommend multilayer defenses including required adversarial testing, ensemble-based detection, privacy-preserving security mechanisms, and international coordination on AI security standards. We also question whether opaque black-box models are suitable for high-stakes clinical decisions, suggesting a shift toward interpretable systems with verifiable safety guarantees.
zh
[AI-35] AI Agent -Driven Framework for Automated Product Knowledge Graph Construction in E-Commerce
【速读】:该论文旨在解决电商平台上海量非结构化产品数据难以有效组织与利用的问题,尤其是构建高质量、可解释的产品知识图谱(Knowledge Graph, KG)过程中存在的自动化程度低、依赖人工规则和预定义模式等挑战。其解决方案的关键在于提出了一种全自动化、基于AI代理(AI agent)的框架,利用大语言模型(Large Language Models, LLMs)分三阶段协同完成:ontology创建与扩展、ontology精炼以及知识图谱填充,从而实现无需预先设定Schema或手工提取规则即可生成语义一致、高覆盖率且冗余极少的知识图谱,显著提升了产品数据的结构化与智能化处理能力。
链接: https://arxiv.org/abs/2511.11017
作者: Dimitar Peshevski,Riste Stojanov,Dimitar Trajanov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Proceedings of the 1st GOBLIN Workshop on Knowledge Graph Technologies
Abstract:The rapid expansion of e-commerce platforms generates vast amounts of unstructured product data, creating significant challenges for information retrieval, recommendation systems, and data analytics. Knowledge Graphs (KGs) offer a structured, interpretable format to organize such data, yet constructing product-specific KGs remains a complex and manual process. This paper introduces a fully automated, AI agent-driven framework for constructing product knowledge graphs directly from unstructured product descriptions. Leveraging Large Language Models (LLMs), our method operates in three stages using dedicated agents: ontology creation and expansion, ontology refinement, and knowledge graph population. This agent-based approach ensures semantic coherence, scalability, and high-quality output without relying on predefined schemas or handcrafted extraction rules. We evaluate the system on a real-world dataset of air conditioner product descriptions, demonstrating strong performance in both ontology generation and KG population. The framework achieves over 97% property coverage and minimal redundancy, validating its effectiveness and practical applicability. Our work highlights the potential of LLMs to automate structured knowledge extraction in retail, providing a scalable path toward intelligent product data integration and utilization.
zh
[AI-36] MSMT-FN: Multi-segment Multi-task Fusion Network for Marketing Audio Classification
【速读】:该论文旨在解决从大规模音频数据中高效分类客户购买倾向(purchasing propensity)的挑战,这是营销电话中情感分析与态度识别的关键任务。解决方案的核心是提出一种专为该业务需求设计的多段落多任务融合网络(Multi-Segment Multi-Task Fusion Network, MSMT-FN),通过联合建模多个音频片段和多任务学习机制,提升分类准确性和泛化能力。实验表明,MSMT-FN在自建的MarketCalls数据集及CMU-MOSI、CMU-MOSEI和MELD等基准上均优于或匹配当前最优方法。
链接: https://arxiv.org/abs/2511.11006
作者: HongYu Liu,Ruijie Wan,Yueju Han,Junxin Li,Liuxing Lu,Chao He,Lihua Cai
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at The 21st International Conference on Advanced Data Mining and Applications (ADMA 2025). In book: Advanced Data Mining and Applications (pp.306-320)
Abstract:Audio classification plays an essential role in sentiment analysis and emotion recognition, especially for analyzing customer attitudes in marketing phone calls. Efficiently categorizing customer purchasing propensity from large volumes of audio data remains challenging. In this work, we propose a novel Multi-Segment Multi-Task Fusion Network (MSMT-FN) that is uniquely designed for addressing this business demand. Evaluations conducted on our proprietary MarketCalls dataset, as well as established benchmarks (CMU-MOSI, CMU-MOSEI, and MELD), show MSMT-FN consistently outperforms or matches state-of-the-art methods. Additionally, our newly curated MarketCalls dataset will be available upon request, and the code base is made accessible at GitHub Repository MSMT-FN, to facilitate further research and advancements in audio classification domain.
zh
[AI-37] DialogGraph-LLM : Graph-Informed LLM s for End-to-End Audio Dialogue Intent Recognition ECAI2025
【速读】:该论文旨在解决长音频对话中说话人意图识别(Speaker Intent Recognition)这一复杂任务,其核心挑战在于说话人语句间存在复杂的依赖关系以及标注数据稀缺。解决方案的关键在于提出一个端到端框架 DialogGraph-LLM,该框架融合了新颖的多关系对话注意力网络(Multi-Relational Dialogue Attention Network, MR-DAN)与多模态基础模型(如 Qwen2.5-Omni-7B),实现从声学信号到意图的直接推理;同时设计了一种自适应半监督学习策略,通过基于全局和类别置信度的双阈值过滤机制生成伪标签,并结合熵驱动的样本选择过程优先利用高信息量的未标注样本,从而有效提升在小样本场景下的模型性能与实用性。
链接: https://arxiv.org/abs/2511.11000
作者: HongYu Liu,Junxin Li,Changxi Guo,Hao Chen,Yaqian Huang,Yifu Guo,Huan Yang,Lihua Cai
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures; Series: Frontiers in Artificial Intelligence and Applications, Volume 413: ECAI 2025
Abstract:Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM’s superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision. Our code is available at this https URL.
zh
[AI-38] How Data Quality Affects Machine Learning Models for Credit Risk Assessment
【速读】:该论文旨在解决机器学习(Machine Learning, ML)模型在信用风险评估中因输入数据质量问题(如缺失值、噪声属性、异常值和标签错误)而导致预测准确性下降的问题。其解决方案的关键在于通过Pucktrick库对公开数据集进行受控的数据污染,系统性地评估10种常用模型(如随机森林、支持向量机和支持向量机等)在不同数据退化场景下的鲁棒性差异,从而为实践者提供增强数据管道鲁棒性的实用工具,并为研究者构建一个灵活的数据驱动型人工智能(Data-Centric AI)实验框架。
链接: https://arxiv.org/abs/2511.10964
作者: Andrea Maurino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Machine Learning (ML) models are being increasingly employed for credit risk evaluation, with their effectiveness largely hinging on the quality of the input data. In this paper we investigate the impact of several data quality issues, including missing values, noisy attributes, outliers, and label errors, on the predictive accuracy of the machine learning model used in credit risk assessment. Utilizing an open-source dataset, we introduce controlled data corruption using the Pucktrick library to assess the robustness of 10 frequently used models like Random Forest, SVM, and Logistic Regression and so on. Our experiments show significant differences in model robustness based on the nature and severity of the data degradation. Moreover, the proposed methodology and accompanying tools offer practical support for practitioners seeking to enhance data pipeline robustness, and provide researchers with a flexible framework for further experimentation in data-centric AI contexts.
zh
[AI-39] Requirements for Aligned Dynamic Resolution of Conflicts in Operational Constraints AAAI26
【速读】:该论文旨在解决自主人工智能(AI)系统在面对未见过或规范不明确的情境时,如何在多个可行行为序列中进行评估与选择的问题。由于训练好的策略无法覆盖所有可能场景,系统必须超越既定政策,构建、评估并证明候选行动方案的合理性,这要求引入超出原有训练数据的知识。解决方案的关键在于:Agent需要整合规范性(normative)、实用性和情境理解能力,以确保决策不仅符合自身目标,还能与人类期望和价值观保持一致。
链接: https://arxiv.org/abs/2511.10952
作者: Steven J. Jones,Robert E. Wray,John E. Laird
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, technical appendix (submitted to AAAI26)
Abstract:Deployed, autonomous AI systems must often evaluate multiple plausible courses of action (extended sequences of behavior) in novel or under-specified contexts. Despite extensive training, these systems will inevitably encounter scenarios where no available course of action fully satisfies all operational constraints (e.g., operating procedures, rules, laws, norms, and goals). To achieve goals in accordance with human expectations and values, agents must go beyond their trained policies and instead construct, evaluate, and justify candidate courses of action. These processes require contextual “knowledge” that may lie outside prior (policy) training. This paper characterizes requirements for agent decision making in these contexts. It also identifies the types of knowledge agents require to make decisions robust to agent goals and aligned with human expectations. Drawing on both analysis and empirical case studies, we examine how agents need to integrate normative, pragmatic, and situational understanding to select and then to pursue more aligned courses of action in complex, real-world environments.
zh
[AI-40] Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting MICRO
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)在实际应用中因设计复杂性而引入的安全漏洞问题,特别是现有研究多聚焦于单智能体安全评估,缺乏对多智能体特有拒绝模式(rejection modes)的统一框架与量化指标。其解决方案的关键在于提出 SafeAgents —— 一个可扩展的细粒度安全评估框架,通过系统化分析计划构建策略、智能体间上下文共享机制及回退行为等设计因素对对抗性提示(adversarial prompting)敏感性的影响,并引入 Dharma 诊断指标以识别多智能体流水线中的薄弱环节。该框架在五种主流多智能体架构上进行验证,揭示了如集中式系统仅传递原子指令会掩盖有害目标从而降低鲁棒性的关键风险,强调了未来 MAS 设计需具备安全性意识。
链接: https://arxiv.org/abs/2511.10949
作者: Nirmit Arora,Sathvik Joel,Ishan Kavathekar,Palak,Rohan Gandhi,Yash Pandya,Tanuja Ganu,Aditya Kanade,Akshay Nambi
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures. Code available at this https URL
Abstract:LLM-based agents are increasingly deployed in multi-agent systems (MAS). As these systems move toward real-world applications, their security becomes paramount. Existing research largely evaluates single-agent security, leaving a critical gap in understanding the vulnerabilities introduced by multi-agent design. However, existing systems fall short due to lack of unified frameworks and metrics focusing on unique rejection modes in MAS. We present SafeAgents, a unified and extensible framework for fine-grained security assessment of MAS. SafeAgents systematically exposes how design choices such as plan construction strategies, inter-agent context sharing, and fallback behaviors affect susceptibility to adversarial prompting. We introduce Dharma, a diagnostic measure that helps identify weak links within multi-agent pipelines. Using SafeAgents, we conduct a comprehensive study across five widely adopted multi-agent architectures (centralized, decentralized, and hybrid variants) on four datasets spanning web tasks, tool use, and code generation. Our findings reveal that common design patterns carry significant vulnerabilities. For example, centralized systems that delegate only atomic instructions to sub-agents obscure harmful objectives, reducing robustness. Our results highlight the need for security-aware design in MAS. Link to code is this https URL
zh
[AI-41] GraphToxin: Reconstructing Full Unlearned Graphs from Graph Unlearning
【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在执行“被遗忘权”(right to be forgotten)操作后仍可能残留敏感信息的问题,即图去学习(graph unlearning)方法的隐私安全性不足。现有方案虽能移除指定节点或边,但攻击者仍可通过重建攻击恢复被删除数据,从而违背监管预期。解决方案的关键在于提出GraphToxin——首个针对图去学习的图重构攻击方法,其核心创新是引入一种新颖的曲率匹配模块(curvature matching module),为全图恢复提供细粒度引导,实现对被删除节点及其连接关系、敏感内容的高精度重建。实验表明,该攻击在白盒和黑盒场景下均有效,且现有防御机制不仅无效,甚至可能加剧攻击效果,凸显了亟需更鲁棒的防御策略。
链接: https://arxiv.org/abs/2511.10936
作者: Ying Song,Balaji Palanisamy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Submitted to SP 2026. Code will be available
Abstract:Graph unlearning has emerged as a promising solution for complying with “the right to be forgotten” regulations by enabling the removal of sensitive information upon request. However, this solution is not foolproof. The involvement of multiple parties creates new attack surfaces, and residual traces of deleted data can still remain in the unlearned graph neural networks. These vulnerabilities can be exploited by attackers to recover the supposedly erased samples, thereby undermining the inherent functionality of graph unlearning. In this work, we propose GraphToxin, the first graph reconstruction attack against graph unlearning. Specifically, we introduce a novel curvature matching module to provide a fine-grained guidance for full unlearned graph recovery. We demonstrate that GraphToxin can successfully subvert the regulatory guarantees expected from graph unlearning - it can recover not only a deleted individual’s information and personal links but also sensitive content from their connections, thereby posing substantially more detrimental threats. Furthermore, we extend GraphToxin to multiple node removals under both white-box and black-box setting. We highlight the necessity of a worst-case analysis and propose a comprehensive evaluation framework to systematically assess the attack performance under both random and worst-case node removals. This provides a more robust and realistic measure of the vulnerability of graph unlearning methods to graph reconstruction attacks. Our extensive experiments demonstrate the effectiveness and flexibility of GraphToxin. Notably, we show that existing defense mechanisms are largely ineffective against this attack and, in some cases, can even amplify its performance. Given the severe privacy risks posed by GraphToxin, our work underscores the urgent need for the development of more effective and robust defense strategies against this attack.
zh
[AI-42] Multi-Agent Legal Verifier Systems for Data Transfer Planning KR
【速读】:该论文旨在解决在严格隐私法规(如日本《个人信息保护法》(APPI))背景下,AI驱动的数据传输规划中法律合规性验证的准确性与可解释性问题。解决方案的关键在于提出一种多智能体法律验证框架,通过专业化分工实现更精准的合规判断:系统将合规检查任务分解为三个专用智能体——法规解释智能体、业务情境评估智能体和风险评估智能体,并借助结构化的合成协议进行协同推理。实验表明,该方法在200个经标注的APPI第16条修正案案例上达到72%的整体准确率,较单智能体基线提升21个百分点,尤其在明确合规案例中准确率达90%(基线仅16%),同时保持对明显违规行为的零漏检,证明了领域专业化与协调推理机制对提升法律AI性能的有效性。
链接: https://arxiv.org/abs/2511.10925
作者: Ha-Thanh Nguyen,Wachara Fungwacharakorn,Ken Satoh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at NeLaMKRR@KR, 2025 ( arXiv:2511.09575 )
Abstract:Legal compliance in AI-driven data transfer planning is becoming increasingly critical under stringent privacy regulations such as the Japanese Act on the Protection of Personal Information (APPI). We propose a multi-agent legal verifier that decomposes compliance checking into specialized agents for statutory interpretation, business context evaluation, and risk assessment, coordinated through a structured synthesis protocol. Evaluated on a stratified dataset of 200 Amended APPI Article 16 cases with clearly defined ground truth labels and multiple performance metrics, the system achieves 72% accuracy, which is 21 percentage points higher than a single-agent baseline, including 90% accuracy on clear compliance cases (vs. 16% for the baseline) while maintaining perfect detection of clear violations. While challenges remain in ambiguous scenarios, these results show that domain specialization and coordinated reasoning can meaningfully improve legal AI performance, providing a scalable and regulation-aware framework for trustworthy and interpretable automated compliance verification.
zh
[AI-43] Synthetic Voices Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio
【速读】:该论文旨在解决生成式语音合成系统(Text-to-Speech, TTS)中一个此前被忽视的内容导向型滥用风险:即利用大型音频语言模型(Large Audio-Language Models, LALMs)生成含有有害内容的语音,而非仅关注说话人仿冒问题。其核心挑战在于两个方面:一是LALM的安全对齐机制通常会拒绝有害文本输入,而现有越狱攻击方法不适用于TTS系统(因其设计目标是忠实还原任意文本);二是实际部署中常存在文本和音频过滤机制,进一步限制恶意内容输出。解决方案的关键在于提出HARMGEN攻击套件,包含两类创新性策略:第一类为语义混淆技术(Concat、Shuffle),通过修改文本结构隐藏有害意图;第二类为音频模态攻击(Read、Spell、Phoneme),借助辅助音频通道注入有害内容,同时保持文本提示表面无害。实验证明该方案显著降低系统拒绝率并提升生成语音毒性,揭示了当前跨模态防御体系的薄弱环节,强调需在训练与部署阶段建立更全面的多模态安全防护机制。
链接: https://arxiv.org/abs/2511.10913
作者: Guangke Chen,Yuhui Wang,Shouling Ji,Xiapu Luo,Ting Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:
Abstract:Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, this work explores a distinct content-centric threat: exploiting TTS systems to produce speech containing harmful content. Realizing such threats poses two core challenges: (1) LALM safety alignment frequently rejects harmful prompts, yet existing jailbreak attacks are ill-suited for TTS because these systems are designed to faithfully vocalize any input text, and (2) real-world deployment pipelines often employ input/output filters that block harmful text and audio. We present HARMGEN, a suite of five attacks organized into two families that address these challenges. The first family employs semantic obfuscation techniques (Concat, Shuffle) that conceal harmful content within text. The second leverages audio-modality exploits (Read, Spell, Phoneme) that inject harmful content through auxiliary audio channels while maintaining benign textual prompts. Through evaluation across five commercial LALMs-based TTS systems and three datasets spanning two languages, we demonstrate that our attacks substantially reduce refusal rates and increase the toxicity of generated speech. We further assess both reactive countermeasures deployed by audio-streaming platforms and proactive defenses implemented by TTS providers. Our analysis reveals critical vulnerabilities: deepfake detectors underperform on high-fidelity audio; reactive moderation can be circumvented by adversarial perturbations; while proactive moderation detects 57-93% of attacks. Our work highlights a previously underexplored content-centric misuse vector for TTS and underscore the need for robust cross-modal safeguards throughout training and deployment. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Multimedia (cs.MM); Audio and Speech Processing (eess.AS) Cite as: arXiv:2511.10913 [cs.SD] (or arXiv:2511.10913v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2511.10913 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-44] LLM enhanced graph inference for long-term disease progression modelling
【速读】:该论文旨在解决神经退行性疾病(如阿尔茨海默病)中脑区生物标志物间复杂相互作用建模的难题,特别是现有方法因假设单一模态脑连接组作为病理传播基础而导致长期进展预测不准确的问题。其关键解决方案是引入大型语言模型(Large Language Models, LLMs)作为专家引导,以增强从非规则采样纵向患者数据中学习疾病进展的能力:一方面优化个体层面的长期疾病轨迹重建,另一方面在生物约束下学习具有更好可识别性的脑区交互图结构,从而同时提升预测精度与可解释性。
链接: https://arxiv.org/abs/2511.10890
作者: Tiantian He,An Zhao,Elinor Thompson,Anna Schroder,Ahmed Abdulaal,Frederik Barkhof,Daniel C. Alexander
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Understanding the interactions between biomarkers among brain regions during neurodegenerative disease is essential for unravelling the mechanisms underlying disease progression. For example, pathophysiological models of Alzheimer’s Disease (AD) typically describe how variables, such as regional levels of toxic proteins, interact spatiotemporally within a dynamical system driven by an underlying biological substrate, often based on brain connectivity. However, current methods grossly oversimplify the complex relationship between brain connectivity by assuming a single-modality brain connectome as the disease-spreading substrate. This leads to inaccurate predictions of pathology spread, especially during the long-term progression period. Meanhwile, other methods of learning such a graph in a purely data-driven way face the identifiability issue due to lack of proper constraint. We thus present a novel framework that uses Large Language Models (LLMs) as expert guides on the interaction of regional variables to enhance learning of disease progression from irregularly sampled longitudinal patient data. By leveraging LLMs’ ability to synthesize multi-modal relationships and incorporate diverse disease-driving mechanisms, our method simultaneously optimizes 1) the construction of long-term disease trajectories from individual-level observations and 2) the biologically-constrained graph structure that captures interactions among brain regions with better identifiability. We demonstrate the new approach by estimating the pathology propagation using tau-PET imaging data from an Alzheimer’s disease cohort. The new framework demonstrates superior prediction accuracy and interpretability compared to traditional approaches while revealing additional disease-driving factors beyond conventional connectivity measures.
zh
[AI-45] Incorporating Spatial Information into Goal-Conditioned Hierarchical Reinforcement Learning via Graph Representations
【速读】:该论文旨在解决当前基于目标条件的分层强化学习(Goal-conditioned Hierarchical Reinforcement Learning, GCHRL)方法中存在的三个关键问题:一是现有图结构方法依赖领域知识构建图,难以泛化到新任务;二是动态生成的图在信息传递上存在局限,无法有效指导未访问状态;三是GCHRL方法普遍存在样本效率低和子目标表示能力弱的问题。解决方案的关键在于提出一种图编码器-解码器架构(graph encoder-decoder),用于评估未见过的状态,并通过训练一个基于探索过程中生成的状态图的神经网络来实现高效建模。该方法名为Graph-Guided sub-Goal representation Generation RL (G4RL),可无缝集成至任意GCHRL框架中,在主要具有对称性和可逆转移的环境中显著提升性能,且仅需额外少量计算开销即可利用图结构中的高低层内在奖励信号增强学习效果。
链接: https://arxiv.org/abs/2511.10872
作者: Shuyuan Zhang,Zihan Wang,Xiao-Wen Chang,Doina Precup
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Transactions on Machine Learning Research (2025)
Abstract:The integration of graphs with Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has recently gained attention, as intermediate goals (subgoals) can be effectively sampled from graphs that naturally represent the overall task structure in most RL tasks. However, existing approaches typically rely on domain-specific knowledge to construct these graphs, limiting their applicability to new tasks. Other graph-based approaches create graphs dynamically during exploration but struggle to fully utilize them, because they have problems passing the information in the graphs to newly visited states. Additionally, current GCHRL methods face challenges such as sample inefficiency and poor subgoal representation. This paper proposes a solution to these issues by developing a graph encoder-decoder to evaluate unseen states. Our proposed method, Graph-Guided sub-Goal representation Generation RL (G4RL), can be incorporated into any existing GCHRL method when operating in environments with primarily symmetric and reversible transitions to enhance performance across this class of problems. We show that the graph encoder-decoder can be effectively implemented using a network trained on the state graph generated during exploration. Empirical results indicate that leveraging high and low-level intrinsic rewards from the graph encoder-decoder significantly enhances the performance of state-of-the-art GCHRL approaches with an extra small computational cost in dense and sparse reward environments.
zh
[AI-46] Generative Artificial Intelligence Adoption Among Bangladeshi Journalists: Exploring Journalists Awareness Acceptance Usage and Organizational Stance on Generative AI
【速读】:该论文试图解决的问题是:在非西方语境下(以孟加拉国为例),记者对生成式 AI (Generative AI) 的采纳行为如何受到特定社会、组织与文化背景的影响,以及现有技术采纳理论(如统一技术接受与使用理论 UTAUT)是否适用于此类情境。其解决方案的关键在于对 UTAUT 模型进行修正,提出两个核心调整:一是发现“促进条件”(facilitating conditions)在非西方环境中并不显著影响记者的行为意图,表明机构支持并非必要驱动力;二是揭示“社会影响”(social influence)在缺乏正式层级压力的情况下,通过非正式的同行压力或职业动机以横向方式发挥作用,同时指出记者对 GenAI 的采纳具有“自愿性”特征,实则源于其职业责任感而非纯粹自主选择。这一修正深化了对非西方新闻业中技术采纳路径的理解。
链接: https://arxiv.org/abs/2511.10862
作者: H. M. Murtuza,Md Oliullah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Newsrooms and journalists across the world are adopting Generative AI (GenAI). Drawing on in-depth interviews with 23 journalists, this study identifies Bangladeshi journalists’ awareness, acceptance, usage patterns, and their media organizations’ stance toward GenAI. This study finds Bangladeshi journalists’ high reliance on GenAI like their Western colleagues despite limited institutional support and the near absence of AI policy. Despite this contrast, concerns over GenAI’s implications in journalism between the West and non-West were mostly identical. Moreover, this study contributes to the Unified Theory of Acceptance and Use of Technology (UTAUT) by proposing two changes regarding GenAI adoption among journalists in non-Western settings. First, this study identifies the non-contribution of facilitating conditions in shaping behavioral intent in GenAI adoption in non-Western contexts. Second, social influence works in a horizontal order through informal peer pressure or professional motivation in the absence of formal institutional hierarchical pressure. Voluntariness in the context of Bangladeshi journalists is underpinned by their professional compulsion. Therefore, this study contributes to understanding how contextual factors shape technology adoption trajectories in non-Western journalism.
zh
[AI-47] HPCAgent Tester: A Multi-Agent LLM Approach for Enhanced HPC Unit Test Generation
【速读】:该论文旨在解决高性能计算(High-Performance Computing, HPC)环境中单元测试自动化与可靠性不足的问题,尤其针对并行性、复杂算法和异构硬件带来的非确定性行为及同步难题。其解决方案的关键在于提出一种基于多智能体大型语言模型(Multi-Agent Large Language Model, LLM)的框架——HPCAgentTester,该框架通过协作式工作流,由“配方代理”(Recipe Agent)与“测试代理”(Test Agent)迭代生成并优化测试用例,并借助批判循环实现上下文感知的单元测试生成,从而有效覆盖OpenMP和MPI并行执行结构、复杂通信模式及分层并行性,显著提升测试的可编译性和功能正确性。
链接: https://arxiv.org/abs/2511.10860
作者: Rabimba Karanjai,Lei Xu,Weidong Shi
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted in AIWare 2025
Abstract:Unit testing in High-Performance Computing (HPC) is critical but challenged by parallelism, complex algorithms, and diverse hardware. Traditional methods often fail to address non-deterministic behavior and synchronization issues in HPC applications. This paper introduces HPCAgentTester, a novel multi-agent Large Language Model (LLM) framework designed to automate and enhance unit test generation for HPC software utilizing OpenMP and MPI. HPCAgentTester employs a unique collaborative workflow where specialized LLM agents (Recipe Agent and Test Agent) iteratively generate and refine test cases through a critique loop. This architecture enables the generation of context-aware unit tests that specifically target parallel execution constructs, complex communication patterns, and hierarchical parallelism. We demonstrate HPCAgentTester’s ability to produce compilable and functionally correct tests for OpenMP and MPI primitives, effectively identifying subtle bugs that are often missed by conventional techniques. Our evaluation shows that HPCAgentTester significantly improves test compilation rates and correctness compared to standalone LLMs, offering a more robust and scalable solution for ensuring the reliability of parallel software systems.
zh
[AI-48] Enhancing Demand-Oriented Regionalization with Agent ic AI and Local Heterogeneous Data for Adaptation Planning NEURIPS2025
【速读】:该论文旨在解决传统规划单元(如人口普查区、邮编或社区)无法准确反映地方社区具体需求、且缺乏灵活性以实施有效灾害预防或应对策略的问题。其解决方案的关键在于构建一个基于代理型人工智能(agentic AI)的规划支持系统,该系统利用初始化的空间约束自组织映射(RepSC-SOM)作为核心框架,通过自适应地理过滤与区域生长优化增强空间聚类能力,并引入AI代理实现推理、规划与行动,从而引导用户交互式生成以需求为导向的动态规划区域,结合计算严谨性与用户驱动决策,提升灾害规划的透明度与适应性。
链接: https://arxiv.org/abs/2511.10857
作者: Seyedeh Mobina Noorani,Shangde Gao,Changjie Chen,Karla Saldana Ochoa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025 UrbanAI Workshop as poster
Abstract:Conventional planning units or urban regions, such as census tracts, zip codes, or neighborhoods, often do not capture the specific demands of local communities and lack the flexibility to implement effective strategies for hazard prevention or response. To support the creation of dynamic planning units, we introduce a planning support system with agentic AI that enables users to generate demand-oriented regions for disaster planning, integrating the human-in-the-loop principle for transparency and adaptability. The platform is built on a representative initialized spatially constrained self-organizing map (RepSC-SOM), extending traditional SOM with adaptive geographic filtering and region-growing refinement, while AI agents can reason, plan, and act to guide the process by suggesting input features, guiding spatial constraints, and supporting interactive exploration. We demonstrate the capabilities of the platform through a case study on the flooding-related risk in Jacksonville, Florida, showing how it allows users to explore, generate, and evaluate regionalization interactively, combining computational rigor with user-driven decision making.
zh
[AI-49] Advanced Tool for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction
【速读】:该论文旨在解决交通碰撞事故重建中因依赖人工专家判断而导致的结果不一致问题,尤其是在处理多模态数据(如文本报告、结构化表格数据和视觉场景图)不完整时的挑战。解决方案的关键在于提出一个两阶段协同的多智能体AI框架:第一阶段从多模态输入中生成自然语言形式的碰撞重构;第二阶段结合时间序列的事件数据记录器(Event Data Recorder, EDR)进行深入推理,从而精准识别关键事件和车辆角色(撞击方与被撞方)。该框架在39个复杂案例中实现了100%准确率,显著优于人类研究人员的92%准确率,且对缺失或错误EDR数据及模糊场景图具有强鲁棒性,展现出在异构碰撞数据处理中的卓越能力。
链接: https://arxiv.org/abs/2511.10853
作者: Gerui Xu,Boyou Chen,Huizhong Guo,Dave LeBlanc,Ananna Ahmed,Zhaonan Sun,Shan Bao
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 26 pages, 10 figures
Abstract:Traffic collision reconstruction traditionally relies on human expertise, often yielding inconsistent results when analyzing incomplete multimodal data. This study develops a multi-agent AI framework that reconstructs pre-crash scenarios and infers vehicle behaviors from fragmented collision data. We present a two-phase collaborative framework combining reconstruction and reasoning phases. The system processes 277 rear-end lead vehicle deceleration (LVD) collisions from the Crash Investigation Sampling System, integrating textual crash reports, structured tabular data, and visual scene diagrams. Phase I generates natural-language crash reconstructions from multimodal inputs. Phase II performs in-depth crash reasoning by combining these reconstructions with temporal Event Data Recorder (EDR).For validation, we applied it to all LVD cases, focusing on a subset of 39 complex crashes where multiple EDR records per collision introduced ambiguity (e.g., due to missing or conflicting data).The evaluation of the 39 LVD crash cases revealed our framework achieved perfect accuracy across all test cases, successfully identifying both the most relevant EDR event and correctly distinguishing striking versus struck vehicles, surpassing the 92% accuracy achieved by human researchers on the same challenging dataset. The system maintained robust performance even when processing incomplete data, including missing or erroneous EDR records and ambiguous scene diagrams. This study demonstrates superior AI capabilities in processing heterogeneous collision data, providing unprecedented precision in reconstructing impact dynamics and characterizing pre-crash behaviors.
zh
[AI-50] Adaptive Digital Twin of Sheet Metal Forming via Proper Orthogonal Decomposition-Based Koopman Operator with Model Predictive Control
【速读】:该论文旨在解决数字孪生(Digital Twin, DT)在基于变形的金属成形工艺中应用时面临的挑战,特别是工具路径与材料响应之间强非线性关系及时空耦合行为导致的实时预测与控制难题。针对这一问题,其解决方案的关键在于构建一个自适应数字孪生框架,该框架融合了本征正交分解(Proper Orthogonal Decomposition, POD)用于物理感知的降维处理,并引入Koopman算子将非线性系统映射到线性升维空间以支持模型预测控制(Model Predictive Control, MPC);同时,通过在线递归最小二乘法(Recursive Least Squares, RLS)算法动态更新Koopman算子系数,实现对过程状态或材料变化的持续自适应建模,从而保障数字孪生模型在新变形数据下的实时更新与高保真控制能力。
链接: https://arxiv.org/abs/2511.10852
作者: Yi-Ping Chen,Derick Suarez,Ying-Kuan Tsai,Vispi Karkaria,Guanzhong Hu,Zihan Chen,Ping Guo,Jian Cao,Wei Chen
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital Twin (DT) technologies are transforming manufacturing by enabling real-time prediction, monitoring, and control of complex processes. Yet, applying DT to deformation-based metal forming remains challenging because of the strongly coupled spatial-temporal behavior and the nonlinear relationship between toolpath and material response. For instance, sheet-metal forming by the English wheel, a highly flexible but artisan-dependent process, still lacks digital counterparts that can autonomously plan and adapt forming strategies. This study presents an adaptive DT framework that integrates Proper Orthogonal Decomposition (POD) for physics-aware dimensionality reduction with a Koopman operator for representing nonlinear system in a linear lifted space for the real-time decision-making via model predictive control (MPC). To accommodate evolving process conditions or material states, an online Recursive Least Squares (RLS) algorithm is introduced to update the operator coefficients in real time, enabling continuous adaptation of the DT model as new deformation data become available. The proposed framework is experimentally demonstrated on a robotic English Wheel sheet metal forming system, where deformation fields are measured and modeled under varying toolpaths. Results show that the adaptive DT is capable of controlling the forming process to achieve the given target shape by effectively capturing non-stationary process behaviors. Beyond this case study, the proposed framework establishes a generalizable approach for interpretable, adaptive, and computationally-efficient DT of nonlinear manufacturing systems, bridging reduced-order physics representations with data-driven adaptability to support autonomous process control and optimization.
zh
[AI-51] STAMP: Spatial-Temporal Adapter with Multi-Head Pooling ML4H ALT NEURIPS2025
【速读】:该论文旨在解决现有时间序列基础模型(Time Series Foundation Models, TSFMs)在脑电图(EEG)特定任务中性能不如专门设计的EEG基础模型(EEG-specific Foundation Models, EEGFMs)的问题,尤其是缺乏对EEGFM与通用TSFM在EEG任务上的系统性比较。解决方案的关键在于提出一种轻量级且灵活的Spatial-Temporal Adapter with Multi-Head Pooling (STAMP),该模块利用通用TSFM生成的单变量嵌入(univariate embeddings),隐式建模EEG数据的空间-时间特性,从而在不依赖专用EEG预训练的前提下,实现与当前最优EEGFM相当的分类性能。
链接: https://arxiv.org/abs/2511.10848
作者: Brad Shook,Abby Turner,Jieshi Chen,Michał Wiliński,Mononito Goswami,Jonathan Elmer,Artur Dubrawski
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a Proceedings paper at Machine Learning for Health (ML4H) 2025, invited presentation at the Time Series for Health (TS4H) Workshop, NeurIPS 2025
Abstract:Time series foundation models (TSFMs) pretrained on data from multiple domains have shown strong performance on diverse modeling tasks. Various efforts have been made to develop foundation models specific to electroencephalography (EEG) data, which records brain electrical activity as time series. However, no comparative analysis of EEG-specific foundation models (EEGFMs) versus general TSFMs has been performed on EEG-specific tasks. We introduce a novel Spatial-Temporal Adapter with Multi-Head Pooling (STAMP), which leverages univariate embeddings produced by a general TSFM, implicitly models spatial-temporal characteristics of EEG data, and achieves performance comparable to state-of-the-art EEGFMs. A comprehensive analysis is performed on 8 benchmark datasets of clinical tasks using EEG for classification, along with ablation studies. Our proposed adapter is lightweight in trainable parameters and flexible in the inputs it can accommodate, supporting easy modeling of EEG data using TSFMs.
zh
[AI-52] Optimal Welfare in Noncooperative Network Formation under Attack AAAI2026
【速读】:该论文致力于解决去中心化通信网络中由自私个体自主决策连接与安全策略时,如何在面对潜在攻击者破坏行为下仍能维持网络鲁棒性的问题。其核心挑战在于,此类网络(如互联网或智能设备间的对等网络)缺乏单一控制实体,各节点以自身利益为导向进行互连和防御决策,导致整体社会福利可能受损。解决方案的关键在于重新审视Goyal等人提出的博弈论模型,并通过理论分析证明:即便在自私代理的自组织行为下,所生成的网络仍能抵御一类广泛存在的攻击者,实现攻击后渐近最优的社会福利水平。这一结果不仅改进了先前关于网络鲁棒性的紧致边界,还首次解决了该领域长期存在的开放问题,同时揭示了一个反直觉现象——旨在最小化攻击后社会福利的攻击者实际上并不会造成最大损害。
链接: https://arxiv.org/abs/2511.10845
作者: Natan Doubez,Pascal Lenzner,Marcus Wunderlich
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 – full version
Abstract:Communication networks are essential for our economy and our everyday lives. This makes them lucrative targets for attacks. Today, we see an ongoing battle between criminals that try to disrupt our key communication networks and security professionals that try to mitigate these attacks. However, today’s networks, like the Internet or peer-to-peer networks among smart devices, are not controlled by a single authority, but instead consist of many independently administrated entities that are interconnected. Thus, both the decisions of how to interconnect and how to secure against potential attacks are taken in a decentralized way by selfish agents. This strategic setting, with agents that want to interconnect and potential attackers that want to disrupt the network, was captured via an influential game-theoretic model by Goyal, Jabbari, Kearns, Khanna, and Morgenstern (WINE 2016). We revisit this model and show improved tight bounds on the achieved robustness of networks created by selfish agents. As our main result, we show that such networks can resist attacks of a large class of potential attackers, i.e., these networks maintain asymptotically optimal welfare post attack. This improves several bounds and resolves an open problem. Along the way, we show the counter-intuitive result, that attackers that aim at minimizing the social welfare post attack do not actually inflict the greatest possible damage. Comments: Accepted at AAAI 2026 – full version Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.10845 [cs.GT] (or arXiv:2511.10845v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2511.10845 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-53] Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning AAAI2026
【速读】:该论文旨在解决强化学习算法在策略改进过程中因回报估计方差过高而导致的样本效率低下和训练不稳定问题。其解决方案的关键在于利用近期关于离策略评估(off-policy evaluation)的新成果:通过设计合理的行为策略(behaviour policy)收集离策略数据,可获得方差更低的回报估计,这一发现挑战了传统认为在线策略(on-policy)数据收集为方差最优的认知。作者将此洞察扩展至在线强化学习场景,即策略评估与改进交替进行,仅使用单一行为策略采集数据用于策略优化,并通过实验证明该方法在两种策略梯度算法上均显著提升了样本效率和性能表现。
链接: https://arxiv.org/abs/2511.10843
作者: Alexander W. Goodall,Edwin Hamel-De le Court,Francesco Belardinelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 (main track)
Abstract:Many reinforcement learning algorithms, particularly those that rely on return estimates for policy improvement, can suffer from poor sample efficiency and training instability due to high-variance return estimates. In this paper we leverage new results from off-policy evaluation; it has recently been shown that well-designed behaviour policies can be used to collect off-policy data for provably lower variance return estimates. This result is surprising as it means collecting data on-policy is not variance optimal. We extend this key insight to the online reinforcement learning setting, where both policy evaluation and improvement are interleaved to learn optimal policies. Off-policy RL has been well studied (e.g., IMPALA), with correct and truncated importance weighted samples for de-biasing and managing variance appropriately. Generally these approaches are concerned with reconciling data collected from multiple workers in parallel, while the policy is updated asynchronously, mismatch between the workers and policy is corrected in a mathematically sound way. Here we consider only one worker - the behaviour policy, which is used to collect data for policy improvement, with provably lower variance return estimates. In our experiments we extend two policy-gradient methods with this regime, demonstrating better sample efficiency and performance over a diverse set of environments.
zh
[AI-54] HyperComplEx: Adaptive Multi-Space Knowledge Graph Embeddings
【速读】:该论文旨在解决现有知识图谱嵌入方法在大规模复杂关系建模中的局限性问题:传统欧几里得空间模型难以刻画层次结构,向量空间模型无法捕捉关系不对称性,而双曲空间模型则不适用于对称关系。其解决方案的关键在于提出HyperComplEx——一种融合双曲、复数与欧几里得空间的混合嵌入框架,通过可学习的注意力机制自适应地组合不同几何空间;同时引入关系特定的空间权重策略动态选择最优几何结构,并设计多空间一致性损失函数确保跨空间预测的一致性,从而实现对多样化关系类型的高效建模与高精度预测。
链接: https://arxiv.org/abs/2511.10842
作者: Jugal Gajjar,Kaustik Ranaware,Kamalasankari Subramaniakuppusamy,Vaibhav Gandhi
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 9 pages, 3 figures, 8 tables, 19 equations, accepted at the 5th Workshop on Knowledge Graphs and Big Data in IEEE BigData 2025 and the paper will be published in the IEEE BigData Conference Proceedings
Abstract:Knowledge graphs have emerged as fundamental structures for representing complex relational data across scientific and enterprise domains. However, existing embedding methods face critical limitations when modeling diverse relationship types at scale: Euclidean models struggle with hierarchies, vector space models cannot capture asymmetry, and hyperbolic models fail on symmetric relations. We propose HyperComplEx, a hybrid embedding framework that adaptively combines hyperbolic, complex, and Euclidean spaces via learned attention mechanisms. A relation-specific space weighting strategy dynamically selects optimal geometries for each relation type, while a multi-space consistency loss ensures coherent predictions across spaces. We evaluate HyperComplEx on computer science research knowledge graphs ranging from 1K papers (~25K triples) to 10M papers (~45M triples), demonstrating consistent improvements over state-of-the-art baselines including TransE, RotatE, DistMult, ComplEx, SEPA, and UltraE. Additional tests on standard benchmarks confirm significantly higher results than all baselines. On the 10M-paper dataset, HyperComplEx achieves 0.612 MRR, a 4.8% relative gain over the best baseline, while maintaining efficient training, achieving 85 ms inference per triple. The model scales near-linearly with graph size through adaptive dimension allocation. We release our implementation and dataset family to facilitate reproducible research in scalable knowledge graph embeddings.
zh
[AI-55] FlowPath: Learning Data-Driven Manifolds with Invertible Flows for Robust Irregularly-sampled Time Series Classification
【速读】:该论文旨在解决从稀疏且不规则采样的时间序列中建模连续时间动态过程的问题,这在实际应用中常因数据缺失或采样频率不一致而导致传统方法性能下降。其核心挑战在于如何合理构造控制路径(control path)以准确反映观测点之间的潜在几何结构,而现有方法多采用固定插值策略,难以适应复杂的数据流形。解决方案的关键在于提出FlowPath方法,通过可逆神经流(invertible neural flow)学习控制路径的几何结构,从而构建一条连续、数据自适应的流形,并借助可逆性约束确保变换的信息保真性和稳定性。这一归纳偏置使FlowPath区别于以往无约束的可学习路径模型,在18个基准数据集和真实案例研究中均显著优于使用固定插值或非可逆架构的基线方法,验证了同时建模路径动态与路径几何的重要性。
链接: https://arxiv.org/abs/2511.10841
作者: YongKyung Oh,Dong-Young Lim,Sungil Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modeling continuous-time dynamics from sparse and irregularly-sampled time series remains a fundamental challenge. Neural controlled differential equations provide a principled framework for such tasks, yet their performance is highly sensitive to the choice of control path constructed from discrete observations. Existing methods commonly employ fixed interpolation schemes, which impose simplistic geometric assumptions that often misrepresent the underlying data manifold, particularly under high missingness. We propose FlowPath, a novel approach that learns the geometry of the control path via an invertible neural flow. Rather than merely connecting observations, FlowPath constructs a continuous and data-adaptive manifold, guided by invertibility constraints that enforce information-preserving and well-behaved transformations. This inductive bias distinguishes FlowPath from prior unconstrained learnable path models. Empirical evaluations on 18 benchmark datasets and a real-world case study demonstrate that FlowPath consistently achieves statistically significant improvements in classification accuracy over baselines using fixed interpolants or non-invertible architectures. These results highlight the importance of modeling not only the dynamics along the path but also the geometry of the path itself, offering a robust and generalizable solution for learning from irregular time series.
zh
[AI-56] HARNESS: Human-Agent Risk Navigation and Event Safety System for Proactive Hazard Forecasting in High-Risk DOE Environments
【速读】:该论文旨在解决任务关键型工作场所中操作安全风险难以实时预测与管理的问题,特别是在美国能源部(DOE)环境中,复杂且危险的日常任务对安全防控提出了高要求。解决方案的关键在于提出了一种模块化的人工智能框架——人类-代理风险导航与事件安全系统(HARNESS),其核心创新是将大型语言模型(LLMs)与结构化作业数据、历史事件检索及风险分析相结合,通过人机协同机制(human-in-the-loop)让领域专家(SMEs)参与修正预测结果,形成迭代式智能推理闭环,从而提升预测系统的可靠性与效率。
链接: https://arxiv.org/abs/2511.10810
作者: Ran Elgedawy,Sanjay Das,Ethan Seefried,Gavin Wiggins,Ryan Burchfield,Dana Hewit,Sudarshan Srinivasan,Todd Thomas,Prasanna Balaprakash,Tirthankar Ghosal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Operational safety at mission-critical work sites is a top priority given the complex and hazardous nature of daily tasks. This paper presents the Human-Agent Risk Navigation and Event Safety System (HARNESS), a modular AI framework designed to forecast hazardous events and analyze operational risks in U.S. Department of Energy (DOE) environments. HARNESS integrates Large Language Models (LLMs) with structured work data, historical event retrieval, and risk analysis to proactively identify potential hazards. A human-in-the-loop mechanism allows subject matter experts (SMEs) to refine predictions, creating an adaptive learning loop that enhances performance over time. By combining SME collaboration with iterative agentic reasoning, HARNESS improves the reliability and efficiency of predictive safety systems. Preliminary deployment shows promising results, with future work focusing on quantitative evaluation of accuracy, SME agreement, and decision latency reduction.
zh
[AI-57] Discounted Cuts: A Stackelberg Approach to Network Disruption AAAI2026
【速读】:该论文旨在解决一类新的斯塔克伯格博弈(Stackelberg game)下的“最重要边”问题,即在流网络中,攻击者先移除最多 $ k $ 条边以最大化对源点 $ s $ 到汇点 $ t $ 之间最大流的破坏,随后防御者重新优化剩余流量分配。为建模这一对抗性交互过程,作者提出了一种**折扣割(discounted cuts)的新数学框架,其中割的成本通过排除其最昂贵的 $ k $ 条边来评估,从而将经典“最重要边”问题推广至更复杂的攻防场景。该框架的关键创新在于统一处理多种折扣机制(如排除最贵或最便宜的 $ k $ 条边),并揭示了在一般图上多数变体为 NP-完全问题,但在有界亏格图(bounded-genus graphs)**类中可多项式求解——这类图涵盖众多现实世界的交通与基础设施网络,显著拓展了算法可解性边界,为人工智能、算法博弈论与运筹学交叉研究提供了理论基础和实践工具。
链接: https://arxiv.org/abs/2511.10804
作者: Pål Grønås Drange,Fedor V. Fomin,Petr Golovach,Danil Sagunov
机构: 未知
类目: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026
Abstract:We study a Stackelberg variant of the classical Most Vital Links problem, modeled as a one-round adversarial game between an attacker and a defender. The attacker strategically removes up to k edges from a flow network to maximally disrupt flow between a source s and a sink t , after which the defender optimally reroutes the remaining flow. To capture this attacker–defender interaction, we introduce a new mathematical model of discounted cuts, in which the cost of a cut is evaluated by excluding its k most expensive edges. This model generalizes the Most Vital Links problem and uncovers novel algorithmic and complexity-theoretic properties. We develop a unified algorithmic framework for analyzing various forms of discounted cut problems, including minimizing or maximizing the cost of a cut under discount mechanisms that exclude either the k most expensive or the k cheapest edges. While most variants are NP-complete on general graphs, our main result establishes polynomial-time solvability for all discounted cut problems in our framework when the input is restricted to bounded-genus graphs, a relevant class that includes many real-world networks such as transportation and infrastructure networks. With this work, we aim to open collaborative bridges between artificial intelligence, algorithmic game theory, and operations research. Comments: Accepted to AAAI 2026 Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI) MSC classes: 05C85 (Primary), 90B10 (Secondary) 91A65 ACMclasses: F.2.2; G.2.2 Cite as: arXiv:2511.10804 [cs.DS] (or arXiv:2511.10804v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2511.10804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-58] Fast Neural Tangent Kernel Alignment Norm and Effective Rank via Trace Estimation
【速读】:该论文旨在解决神经正切核(Neural Tangent Kernel, NTK)在实际应用中因计算全NTK矩阵代价过高而难以实施的问题,尤其针对循环结构(如RNN)等大规模模型。其关键解决方案是引入无矩阵(matrix-free)的随机化方法,通过迹估计(trace estimation)快速计算NTK的多种重要属性,包括迹、Frobenius范数、有效秩和对齐度。作者基于Hutch++迹估计器提供数值实现,并证明其收敛性保证;更重要的是,利用NTK的特殊结构,提出仅需前向或反向自动微分(forward- or reverse-mode automatic differentiation)即可完成迹估计的一侧估计器(one-sided estimators),在样本量较少时显著优于Hutch++,尤其当模型状态与参数数量差距较大时表现更优。整体上,该方法实现了多个数量级的速度提升,使NTK分析与应用更加高效可行。
链接: https://arxiv.org/abs/2511.10796
作者: James Hazelden
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:
Abstract:The Neural Tangent Kernel (NTK) characterizes how a model’s state evolves over Gradient Descent. Computing the full NTK matrix is often infeasible, especially for recurrent architectures. Here, we introduce a matrix-free perspective, using trace estimation to rapidly analyze the empirical, finite-width NTK. This enables fast computation of the NTK’s trace, Frobenius norm, effective rank, and alignment. We provide numerical recipes based on the Hutch++ trace estimator with provably fast convergence guarantees. In addition, we show that, due to the structure of the NTK, one can compute the trace using only forward- or reverse-mode automatic differentiation, not requiring both modes. We show these so-called one-sided estimators can outperform Hutch++ in the low-sample regime, especially when the gap between the model state and parameter count is large. In total, our results demonstrate that matrix-free randomized approaches can yield speedups of many orders of magnitude, leading to faster analysis and applications of the NTK.
zh
[AI-59] Potential Outcome Rankings for Counterfactual Decision Making
【速读】:该论文旨在解决不确定性环境下基于因果推理的反事实决策问题,即在多个备选行动中选择最优策略,以最大化个体预期潜在结果的效用或可取性。其核心挑战在于如何量化不同行动下潜在结果的排序概率与最优结果达成概率,从而为决策提供更精细的依据。解决方案的关键在于引入两个新指标:潜在结果排序概率(Probability of Potential Outcome Ranking, PoR)和实现最优潜在结果的概率(Probability of Achieving the Best Potential Outcome, PoB),并通过建立识别定理和推导边界来估计这些指标,最终结合数值实验验证估计量的有限样本性质并应用于真实数据集,提升了反事实决策分析的可操作性与准确性。
链接: https://arxiv.org/abs/2511.10776
作者: Yuta Kawakami,Jin Tian
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Counterfactual decision-making in the face of uncertainty involves selecting the optimal action from several alternatives using causal reasoning. Decision-makers often rank expected potential outcomes (or their corresponding utility and desirability) to compare the preferences of candidate actions. In this paper, we study new counterfactual decision-making rules by introducing two new metrics: the probabilities of potential outcome ranking (PoR) and the probability of achieving the best potential outcome (PoB). PoR reveals the most probable ranking of potential outcomes for an individual, and PoB indicates the action most likely to yield the top-ranked outcome for an individual. We then establish identification theorems and derive bounds for these metrics, and present estimation methods. Finally, we perform numerical experiments to illustrate the finite-sample properties of the estimators and demonstrate their application to a real-world dataset.
zh
[AI-60] Structure-Aware Encodings of Argumentation Properties for Clique-width AAAI2026
【速读】:该论文旨在解决如何在保持图结构参数——特别是clique-width(团宽)——不变的前提下,将抽象论证(abstract argumentation)问题高效地编码为(Q)SAT问题,从而提升求解效率并理解编码的理论极限。其解决方案的关键在于设计了一种新颖的线性保团宽的归约方法,即基于有向分解的引导归约(Directed Decomposition-Guided, DDG reductions),该方法能够适用于所有论证语义(包括计数问题),且在合理假设下证明了其引入的额外开销无法被显著优化。
链接: https://arxiv.org/abs/2511.10767
作者: Yasir Mahmood,Markus Hecher,Johanna Groven,Johannes K. Fichte
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注: Technical report of paper accepted at AAAI 2026
Abstract:Structural measures of graphs, such as treewidth, are central tools in computational complexity resulting in efficient algorithms when exploiting the parameter. It is even known that modern SAT solvers work efficiently on instances of small treewidth. Since these solvers are widely applied, research interests in compact encodings into (Q)SAT for solving and to understand encoding limitations. Even more general is the graph parameter clique-width, which unlike treewidth can be small for dense graphs. Although algorithms are available for clique-width, little is known about encodings. We initiate the quest to understand encoding capabilities with clique-width by considering abstract argumentation, which is a robust framework for reasoning with conflicting arguments. It is based on directed graphs and asks for computationally challenging properties, making it a natural candidate to study computational properties. We design novel reductions from argumentation problems to (Q)SAT. Our reductions linearly preserve the clique-width, resulting in directed decomposition-guided (DDG) reductions. We establish novel results for all argumentation semantics, including counting. Notably, the overhead caused by our DDG reductions cannot be significantly improved under reasonable assumptions.
zh
[AI-61] Surrogate-Based Differentiable Pipeline for Shape Optimization
【速读】:该论文试图解决工程设计优化中因计算机辅助工程(CAE)流程中存在不可微组件而导致的梯度优化受限问题,尤其是在高维设计空间中,尽管数学或物理原理本身具有可微性,但网格生成、物理仿真等典型模块因代码实现不可微而阻碍了高效优化。解决方案的关键在于用本质可微的代理模型(surrogate models)替代这些不可微的流程组件;文中以气动外形优化为例,提出使用3D U-Net全场代理模型直接从形状的符号距离场(SDF)映射到感兴趣的物理场,从而构建一个端到端可微的优化流程,无需依赖可微求解器或伴随方法(adjoint methods),显著提升了优化效率与适用性。
链接: https://arxiv.org/abs/2511.10761
作者: Andrin Rehmann,Nolan Black,Josiah Bjorgaard,Alessandro Angioi,Andrei Paleyes,Niklas Heim,Dion Häfner,Alexander Lavin
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Differential Geometry (math.DG)
备注:
Abstract:Gradient-based optimization of engineering designs is limited by non-differentiable components in the typical computer-aided engineering (CAE) workflow, which calculates performance metrics from design parameters. While gradient-based methods could provide noticeable speed-ups in high-dimensional design spaces, codes for meshing, physical simulations, and other common components are not differentiable even if the math or physics underneath them is. We propose replacing non-differentiable pipeline components with surrogate models which are inherently differentiable. Using a toy example of aerodynamic shape optimization, we demonstrate an end-to-end differentiable pipeline where a 3D U-Net full-field surrogate replaces both meshing and simulation steps by training it on the mapping between the signed distance field (SDF) of the shape and the fields of interest. This approach enables gradient-based shape optimization without the need for differentiable solvers, which can be useful in situations where adjoint methods are unavailable and/or hard to implement.
zh
[AI-62] Picking a Representative Set of Solutions in Multiobjective Optimization: Axioms Algorithms and Experiments AAAI’26
【速读】:该论文旨在解决多目标优化问题中 Pareto 最优解集过大导致决策者难以选择最偏好解的难题,即如何从所有 Pareto 最优解中挑选一个固定大小的代表性子集以降低决策认知负担。其核心解决方案是将 Pareto 删减(Pareto pruning)问题重新建模为多胜者投票(multiwinner voting)问题,并在此框架下对现有质量度量进行公理化分析,揭示了若干非直观行为;基于此,提出一种新的度量方法——定向覆盖(directed coverage),同时系统分析了不同度量在计算复杂性上的边界,识别出在目标数量和结构变化下可 tractable 与 intractable 的分界点。实验表明,所提度量在多种场景下表现优异或具有竞争力。
链接: https://arxiv.org/abs/2511.10716
作者: Niclas Boehmer,Maximilian T. Wittmann
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computer Science and Game Theory (cs.GT)
备注: Accepted to AAAI '26
Abstract:Many real-world decision-making problems involve optimizing multiple objectives simultaneously, rendering the selection of the most preferred solution a non-trivial problem: All Pareto optimal solutions are viable candidates, and it is typically up to a decision maker to select one for implementation based on their subjective preferences. To reduce the cognitive load on the decision maker, previous work has introduced the Pareto pruning problem, where the goal is to compute a fixed-size subset of Pareto optimal solutions that best represent the full set, as evaluated by a given quality measure. Reframing Pareto pruning as a multiwinner voting problem, we conduct an axiomatic analysis of existing quality measures, uncovering several unintuitive behaviors. Motivated by these findings, we introduce a new measure, directed coverage. We also analyze the computational complexity of optimizing various quality measures, identifying previously unknown boundaries between tractable and intractable cases depending on the number and structure of the objectives. Finally, we present an experimental evaluation, demonstrating that the choice of quality measure has a decisive impact on the characteristics of the selected set of solutions and that our proposed measure performs competitively or even favorably across a range of settings.
zh
[AI-63] BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models AAAI2026
【速读】:该论文旨在解决链式思维(Chain-of-Thought, CoT)增强型大语言模型(Large Language Models, LLMs)在推理效率方面存在的隐蔽性安全漏洞问题,即如何通过后门攻击诱导模型产生冗余的推理过程而不影响最终输出一致性。解决方案的关键在于提出BadThink攻击方法,其核心是采用基于LLM的迭代优化策略生成高度自然的中毒数据,并通过精细设计的触发提示(trigger prompts)在模型微调阶段嵌入“过度思考”行为,从而在不改变输出正确性的前提下显著增加推理轨迹长度(如在MATH-500数据集上提升超过17倍),实现对计算资源的隐秘消耗与性能退化。
链接: https://arxiv.org/abs/2511.10714
作者: Shuaitong Liu,Renjue Li,Lijia Yu,Lijun Zhang,Zhiming Liu,Gaojie Jin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 (Main Track). This arXiv version corresponds to the camera-ready manuscript and includes expanded appendices. Please cite the AAAI 2026 version when available
Abstract:Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of large language models (LLMs), but have also introduced their computational efficiency as a new attack surface. In this paper, we propose BadThink, the first backdoor attack designed to deliberately induce “overthinking” behavior in CoT-enabled LLMs while ensuring stealth. When activated by carefully crafted trigger prompts, BadThink manipulates the model to generate inflated reasoning traces - producing unnecessarily redundant thought processes while preserving the consistency of final outputs. This subtle attack vector creates a covert form of performance degradation that significantly increases computational costs and inference time while remaining difficult to detect through conventional output evaluation methods. We implement this attack through a sophisticated poisoning-based fine-tuning strategy, employing a novel LLM-based iterative optimization process to embed the behavior by generating highly naturalistic poisoned data. Our experiments on multiple state-of-the-art models and reasoning tasks show that BadThink consistently increases reasoning trace lengths - achieving an over 17x increase on the MATH-500 dataset - while remaining stealthy and robust. This work reveals a critical, previously unexplored vulnerability where reasoning efficiency can be covertly manipulated, demonstrating a new class of sophisticated attacks against CoT-enabled systems.
zh
[AI-64] Do Not Merge My Model! Safeguarding Open-Source LLM s Against Unauthorized Model Merging AAAI2026
【速读】:该论文旨在解决模型合并窃取(model merging stealing)问题,即未经授权的第三方通过合并受保护模型与其同源模型来获取知识,从而造成知识产权泄露。现有防御机制无法同时满足三个关键保护属性:主动阻止未经授权的合并、兼容通用开源环境以及实现高安全性且性能损失可忽略。为此,作者提出了一种即插即用的防御方案 MergeBarrier,其核心在于破坏受保护模型与其同源模型之间的线性模式连通性(Linear Mode Connectivity, LMC),从而消除有效模型合并所需的低损失路径,实现对模型合并窃取的有效防护,同时保持模型精度几乎不受影响。
链接: https://arxiv.org/abs/2511.10712
作者: Qinfeng Li,Miao Pan,Jintao Chen,Fu Teng,Zhiqiang Shen,Ge Su,Hao Peng,Xuhong Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026 Conference
Abstract:Model merging has emerged as an efficient technique for expanding large language models (LLMs) by integrating specialized expert models. However, it also introduces a new threat: model merging stealing, where free-riders exploit models through unauthorized model merging. Unfortunately, existing defense mechanisms fail to provide effective protection. Specifically, we identify three critical protection properties that existing methods fail to simultaneously satisfy: (1) proactively preventing unauthorized merging; (2) ensuring compatibility with general open-source settings; (3) achieving high security with negligible performance loss. To address the above issues, we propose MergeBarrier, a plug-and-play defense that proactively prevents unauthorized merging. The core design of MergeBarrier is to disrupt the Linear Mode Connectivity (LMC) between the protected model and its homologous counterparts, thereby eliminating the low-loss path required for effective model merging. Extensive experiments show that MergeBarrier effectively prevents model merging stealing with negligible accuracy loss.
zh
[AI-65] owards Uncertainty Quantification in Generative Model Learning
【速读】:该论文旨在解决生成式模型(Generative Models)在分布逼近能力评估中缺乏不确定性量化的问题。当前的评估方法主要关注学习分布与目标分布之间的接近程度,却忽略了测量本身所蕴含的不确定性。论文的关键解决方案在于形式化不确定性量化问题,并提出基于集成方法的精度-召回率曲线(ensemble-based precision-recall curves),通过聚合多模型预测结果来捕捉模型近似过程中的不确定性,从而实现对不同模型架构在不确定性特征上的系统性比较。
链接: https://arxiv.org/abs/2511.10710
作者: Giorgio Morales,Frederic Jurie,Jalal Fadili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at EurIPS 2025 Workshop: Epistemic Intelligence in Machine Learning (EIML@EurIPS 2025)
Abstract:While generative models have become increasingly prevalent across various domains, fundamental concerns regarding their reliability persist. A crucial yet understudied aspect of these models is the uncertainty quantification surrounding their distribution approximation capabilities. Current evaluation methodologies focus predominantly on measuring the closeness between the learned and the target distributions, neglecting the inherent uncertainty in these measurements. In this position paper, we formalize the problem of uncertainty quantification in generative model learning. We discuss potential research directions, including the use of ensemble-based precision-recall curves. Our preliminary experiments on synthetic datasets demonstrate the effectiveness of aggregated precision-recall curves in capturing model approximation uncertainty, enabling systematic comparison among different model architectures based on their uncertainty characteristics.
zh
[AI-66] Bias-Restrained Prefix Representation Finetuning for Mathematical Reasoning AAAI2026
【速读】:该论文旨在解决Representation Finetuning (ReFT) 在数学推理任务中表现不佳的问题。研究表明,ReFT性能下降的主要原因在于其在推理初期难以生成有效的推理前缀(reasoning prefix),并导致数值编码被干扰以及在思维链(Chain-of-Thought, CoT)阶段产生误差累积。解决方案的关键在于提出Bias-REstrained Prefix Representation FineTuning (BREP ReFT),通过截断训练数据以优化初始推理前缀的生成、干预早期推理阶段以防止误差传播,并约束干预向量的幅度以避免破坏数值编码,从而显著提升ReFT在数学推理任务中的表现。
链接: https://arxiv.org/abs/2511.10707
作者: Sirui Liang,Pengfei Cao,Jian Zhao,Cong Huang,Jun Zhao,Kang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted by aaai2026
Abstract:Parameter-Efficient finetuning (PEFT) enhances model performance on downstream tasks by updating a minimal subset of parameters. Representation finetuning (ReFT) methods further improve efficiency by freezing model weights and optimizing internal representations with fewer parameters than PEFT, outperforming PEFT on several tasks. However, ReFT exhibits a significant performance decline on mathematical reasoning tasks. To address this problem, the paper demonstrates that ReFT’s poor performance on mathematical tasks primarily stems from its struggle to generate effective reasoning prefixes during the early inference phase. Moreover, ReFT disturbs the numerical encoding and the error accumulats during the CoT stage. Based on these observations, this paper proposes Bias-REstrained Prefix Representation FineTuning (BREP ReFT), which enhances ReFT’s mathematical reasoning capability by truncating training data to optimize the generation of initial reasoning prefixes, intervening on the early inference stage to prevent error accumulation, and constraining the intervention vectors’ magnitude to avoid disturbing numerical encoding. Extensive experiments across diverse model architectures demonstrate BREP’s superior effectiveness, efficiency, and robust generalization capability, outperforming both standard ReFT and weight-based PEFT methods on the task of mathematical reasoning. The source code is available at this https URL.
zh
[AI-67] he Second Law of Intelligence: Controlling Ethical Entropy in Autonomous Systems
【速读】:该论文试图解决的是生成式 AI (Generative AI) 在无约束条件下因目标偏离而引发的伦理不稳定性问题,即如何维持其行为长期符合人类意图。解决方案的关键在于提出一个类热力学的“第二定律”框架,将伦理熵(ethical entropy)定义为模型偏离预设目标集 $ g_i $ 的概率分布 $ p(g_i; \theta) $ 的香农熵,并证明在梯度优化过程中,若不施加持续的对齐工作(alignment work),伦理熵会自发增加。其核心机制是通过 Fisher 信息矩阵的最大特征值 $ \lambda_{\text{max}} $ 和模型参数量 $ N $ 构建临界对齐强度阈值 $ \gamma_{\text{crit}} = (\lambda_{\text{max}} / 2) \ln N $,当实际对齐工作 $ \gamma $ 超过该阈值时,系统可保持伦理熵稳定;模拟验证表明,70亿参数模型在 $ \gamma = 20.4 (1.5倍临界值)下熵值维持接近零( 0.00 \pm 0.00 $ nats),显著优于未正则化情况(熵增至 $ 1.69 \pm 1.08 $ nats)。这一理论将AI对齐转化为连续的热力学控制问题,为高级自主系统的安全性提供量化基础。
链接: https://arxiv.org/abs/2511.10704
作者: Samih Fadli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 1 table, includes Supplementary Materials, simulation code on GitHub ( this https URL )
Abstract:We propose that unconstrained artificial intelligence obeys a Second Law analogous to thermodynamics, where ethical entropy, defined as a measure of divergence from intended goals, increases spontaneously without continuous alignment work. For gradient-based optimizers, we define this entropy over a finite set of goals g_i as S = -\Sigma p(g_i; theta) ln p(g_i; theta), and we prove that its time derivative dS/dt = 0, driven by exploration noise and specification gaming. We derive the critical stability boundary for alignment work as gamma_crit = (lambda_max / 2) ln N, where lambda_max is the dominant eigenvalue of the Fisher Information Matrix and N is the number of model parameters. Simulations validate this theory. A 7-billion-parameter model (N = 7 x 10^9) with lambda_max = 1.2 drifts from an initial entropy of 0.32 to 1.69 +/- 1.08 nats, while a system regularized with alignment work gamma = 20.4 (1.5 gamma_crit) maintains stability at 0.00 +/- 0.00 nats (p = 4.19 x 10^-17, n = 20 trials). This framework recasts AI alignment as a problem of continuous thermodynamic control, providing a quantitative foundation for maintaining the stability and safety of advanced autonomous systems.
zh
[AI-68] Human-AI collaborative autonomous synthesis with pulsed laser deposition for remote epitaxy
【速读】:该论文旨在解决自主实验室中AI与人类协作效率不足的问题,特别是如何实现从假设生成到实验执行与解释的全流程闭环优化。其核心挑战在于现有系统多为“人-in-the-loop”模式,缺乏AI代理(AI agent)与人类专家之间紧密耦合、协同演进的流程设计。解决方案的关键是构建一个人-AI协同工作流(Human-AI Collaborative, HAIC),该工作流将大语言模型(Large Language Model, LLM)用于假设生成和数据分析,并通过协作式策略更新机制驱动自主脉冲激光沉积(Pulsed Laser Deposition, PLD)实验,在远程外延生长BaTiO₃/石墨烯体系中高效探索生长空间。HAIC在每一轮自主实验后融合人类洞察与AI推理,显著加速科学发现进程,实现了对材料缺陷成因(如氧压和温度影响)的精准识别与工艺路径优化,最终提出两步Ar/O₂沉积策略以同时实现铁电BaTiO₃的剥离与单层石墨烯界面的保护。
链接: https://arxiv.org/abs/2511.11558
作者: Asraful Haque,Daniel T. Yimam,Jawad Chowdhury,Ralph Bulanadi,Ivan Vlassiouk,John Lasseter,Sujoy Ghosh,Christopher M. Rouleau,Kai Xiao,Yongtao Liu,Eva Zarkadoula,Rama K. Vasudevan,Sumner B. Harris
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous laboratories typically rely on data-driven decision-making, occasionally with human-in-the-loop oversight to inject domain expertise. Fully leveraging AI agents, however, requires tightly coupled, collaborative workflows spanning hypothesis generation, experimental planning, execution, and interpretation. To address this, we develop and deploy a human-AI collaborative (HAIC) workflow that integrates large language models for hypothesis generation and analysis, with collaborative policy updates driving autonomous pulsed laser deposition (PLD) experiments for remote epitaxy of BaTiO _3 /graphene. HAIC accelerated the hypothesis formation and experimental design and efficiently mapped the growth space to graphene-damage. In situ Raman spectroscopy reveals that chemistry drives degradation while the highest energy plume components seed defects, identifying a low-O _2 pressure low-temperature synthesis window that preserves graphene but is incompatible with optimal BaTiO _3 growth. Thus, we show a two-step Ar/O _2 deposition is required to exfoliate ferroelectric BaTiO _3 while maintaining a monolayer graphene interlayer. HAIC stages human insight with AI reasoning between autonomous batches to drive rapid scientific progress, providing an evolution to many existing human-in-the-loop autonomous workflows.
zh
[AI-69] Inferring response times of perceptual decisions with Poisson variational autoencoders NEURIPS2025
【速读】:该论文旨在解决当前深度神经网络在建模感知决策时忽略决策过程时间动态性的问题,即传统架构通常将决策视为瞬时读出,而未能刻画决策形成的连续演化过程。其解决方案的关键在于构建一个图像可计算(image-computable)的感知决策模型,该模型通过联合优化感知编码与贝叶斯解码机制实现:首先使用泊松变分自编码器(Poisson variational autoencoder)从速率编码神经元群体(建模为独立同质泊松过程)中无监督学习视觉刺激的表示;随后设计一个任务优化的解码器,持续基于输入的神经放电活动推断动作后验分布;最后引入基于熵的停止规则以决定决策终止时机。这一框架能够生成逐次试验层面的选择结果和反应时间模式,且在MNIST数字分类任务中重现了感知决策的核心经验特征,如随机变异、右偏反应时间分布、反应时间对备选项数的对数缩放(Hick定律)以及速度-准确性权衡。
链接: https://arxiv.org/abs/2511.11480
作者: Hayden R. Johnson,Anastasia N. Krouglova,Hadi Vafaii,Jacob L. Yates,Pedro J. Gonçalves
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear at the NeurIPS 2025 Workshop on Data on the Mind and Brain
Abstract:Many properties of perceptual decision making are well-modeled by deep neural networks. However, such architectures typically treat decisions as instantaneous readouts, overlooking the temporal dynamics of the decision process. We present an image-computable model of perceptual decision making in which choices and response times arise from efficient sensory encoding and Bayesian decoding of neural spiking activity. We use a Poisson variational autoencoder to learn unsupervised representations of visual stimuli in a population of rate-coded neurons, modeled as independent homogeneous Poisson processes. A task-optimized decoder then continually infers an approximate posterior over actions conditioned on incoming spiking activity. Combining these components with an entropy-based stopping rule yields a principled and image-computable model of perceptual decisions capable of generating trial-by-trial patterns of choices and response times. Applied to MNIST digit classification, the model reproduces key empirical signatures of perceptual decision making, including stochastic variability, right-skewed response time distributions, logarithmic scaling of response times with the number of alternatives (Hick’s law), and speed-accuracy trade-offs.
zh
[AI-70] Variational Quantum Algorithms for Particle Track Reconstruction
【速读】:该论文旨在解决高能物理中粒子轨迹重建问题在量子计算框架下的应用挑战,特别是针对变分量子算法(Variational Quantum Algorithms, VQAs)在固定探测器几何结构下设计高效且表达能力强的量子线路(quantum ansatz)这一关键难题。解决方案的核心在于采用基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的量子架构搜索方法,自动设计适用于不同问题规模的量子电路,并在此基础上对两种不同的数学建模方式——即基态能量问题和线性方程组系统——进行实验评估,从而验证其在性能与计算成本上的有效性。
链接: https://arxiv.org/abs/2511.11397
作者: Vincenzo Lipardi,Xenofon Chiotopoulos,Jacco A. de Vries,Domenica Dibenedetto,Kurt Driessens,Marcel Merk,Mark H.M. Winands
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures, 2 tables, pre-proceedings BNAIC 2024
Abstract:Quantum Computing is a rapidly developing field with the potential to tackle the increasing computational challenges faced in high-energy physics. In this work, we explore the potential and limitations of variational quantum algorithms in solving the particle track reconstruction problem. We present an analysis of two distinct formulations for identifying straight-line tracks in a multilayer detection system, inspired by the LHCb vertex detector. The first approach is formulated as a ground-state energy problem, while the second approach is formulated as a system of linear equations. This work addresses one of the main challenges when dealing with variational quantum algorithms on general problems, namely designing an expressive and efficient quantum ansatz working on tracking events with fixed detector geometry. For this purpose, we employed a quantum architecture search method based on Monte Carlo Tree Search to design the quantum circuits for different problem sizes. We provide experimental results to test our approach on both formulations for different problem sizes in terms of performance and computational cost.
zh
[AI-71] Understanding the Nature of Depth-1 Equivariant Quantum Circuit
【速读】:该论文旨在解决生成式量子电路(Generative Quantum Circuit, GQC)在求解大规模旅行商问题(Travelling Salesman Problem, TSP)时面临的挑战,包括量子电路模拟的指数级时间和内存开销,以及在实际量子硬件上运行时噪声和退相干效应加剧的问题。其核心解决方案是提出一种名为“尺寸不变网格搜索”(Size-Invariant Grid Search, SIGS)的高效训练优化方法,用于量子强化学习(Quantum Reinforcement Learning, QRL)。SIGS的关键在于利用了所谓的“尺寸不变性质”(Size-Invariant Properties),该性质超越了以往文献中讨论的等变性(equivariance),使得训练好的深度为1的等变量子电路(Equivariant Quantum Circuit, EQC)能够被高效地扩展至350节点规模的TSP实例进行模拟,且在100节点场景下将总模拟时间减少96.4%,同时保持与原QRL模型相当的最优性差距(均值小于0.005),从而为QRL算法在更大问题规模上的性能评估提供了一种实用的基准工具。
链接: https://arxiv.org/abs/2511.10756
作者: Jonathan Teo(1),Lee Xin Wei(1),Hoong Chuin Lau(1) ((1) Singapore Management University)
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:The Equivariant Quantum Circuit (EQC) for the Travelling Salesman Problem (TSP) has been shown to achieve near-optimal performance in solving small TSP problems (up to 20 nodes) using only two parameters at depth 1. However, extending EQCs to larger TSP problem sizes remains challenging due to the exponential time and memory for quantum circuit simulation, as well as increasing noise and decoherence when running on actual quantum hardware. In this work, we propose the Size-Invariant Grid Search (SIGS), an efficient training optimization for Quantum Reinforcement Learning (QRL), and use it to simulate the outputs of a trained Depth-1 EQC up to 350-node TSP instances - well beyond previously tractable limits. At TSP with 100 nodes, we reduce total simulation times by 96.4%, when comparing to RL simulations with the analytical expression (151 minutes using RL to under 6 minutes using SIGS on TSP-100), while achieving a mean optimality gap within 0.005 of the RL trained model on the test set. SIGS provides a practical benchmarking tool for the QRL community, allowing us to efficiently analyze the performance of QRL algorithms on larger problem sizes. We provide a theoretical explanation for SIGS called the Size-Invariant Properties that goes beyond the concept of equivariance discussed in prior literature.
zh
机器学习
[LG-0] Multistability of Self-Attention Dynamics in Transformers
链接: https://arxiv.org/abs/2511.11553
作者: Claudio Altafini
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS)
*备注: 8 pages, 3 figures
Abstract:In machine learning, a self-attention dynamics is a continuous-time multiagent-like model of the attention mechanisms of transformers. In this paper we show that such dynamics is related to a multiagent version of the Oja flow, a dynamical system that computes the principal eigenvector of a matrix corresponding for transformers to the value matrix. We classify the equilibria of the ``single-head’’ self-attention system into four classes: consensus, bipartite consensus, clustering and polygonal equilibria. Multiple asymptotically stable equilibria from the first three classes often coexist in the self-attention dynamics. Interestingly, equilibria from the first two classes are always aligned with the eigenvectors of the value matrix, often but not exclusively with the principal eigenvector.
[LG-1] Generalizing Fair Clustering to Multiple Groups: Algorithms and Applications AAAI2026
链接: https://arxiv.org/abs/2511.11539
作者: Diptarka Chakraborty,Kushagra Chatterjee,Debarati Das,Tien-Long Nguyen
类目: Machine Learning (cs.LG)
*备注: Accepted in AAAI 2026 for Oral Representation
Abstract:Clustering is a fundamental task in machine learning and data analysis, but it frequently fails to provide fair representation for various marginalized communities defined by multiple protected attributes – a shortcoming often caused by biases in the training data. As a result, there is a growing need to enhance the fairness of clustering outcomes, ideally by making minimal modifications, possibly as a post-processing step after conventional clustering. Recently, Chakraborty et al. [COLT’25] initiated the study of \emphclosest fair clustering, though in a restricted scenario where data points belong to only two groups. In practice, however, data points are typically characterized by many groups, reflecting diverse protected attributes such as age, ethnicity, gender, etc. In this work, we generalize the study of the \emphclosest fair clustering problem to settings with an arbitrary number (more than two) of groups. We begin by showing that the problem is NP-hard even when all groups are of equal size – a stark contrast with the two-group case, for which an exact algorithm exists. Next, we propose near-linear time approximation algorithms that efficiently handle arbitrary-sized multiple groups, thereby answering an open question posed by Chakraborty et al. [COLT’25]. Leveraging our closest fair clustering algorithms, we further achieve improved approximation guarantees for the \emphfair correlation clustering problem, advancing the state-of-the-art results established by Ahmadian et al. [AISTATS’20] and Ahmadi et al. [2020]. Additionally, we are the first to provide approximation algorithms for the \emphfair consensus clustering problem involving multiple (more than two) groups, thus addressing another open direction highlighted by Chakraborty et al. [COLT’25]. Comments: Accepted in AAAI 2026 for Oral Representation Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.11539 [cs.LG] (or arXiv:2511.11539v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11539 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-2] FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
链接: https://arxiv.org/abs/2511.11505
作者: Yonatan Dukler,Guihong Li,Deval Shah,Vikram Appia,Emad Barsoum
类目: Machine Learning (cs.LG)
*备注:
Abstract:Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.
[LG-3] Honesty over Accuracy: Trustworthy Language Models through Reinforced Hesitation
链接: https://arxiv.org/abs/2511.11500
作者: Mohamad Amin Mohamadi,Tianhao Wang,Zhiyuan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern language models fail a fundamental requirement of trustworthy intelligence: knowing when not to answer. Despite achieving impressive accuracy on benchmarks, these models produce confident hallucinations, even when wrong answers carry catastrophic consequences. Our evaluations on GSM8K, MedQA and GPQA show frontier models almost never abstain despite explicit warnings of severe penalties, suggesting that prompts cannot override training that rewards any answer over no answer. As a remedy, we propose Reinforced Hesitation (RH): a modification to Reinforcement Learning from Verifiable Rewards (RLVR) to use ternary rewards (+1 correct, 0 abstention, - \lambda error) instead of binary. Controlled experiments on logic puzzles reveal that varying \lambda produces distinct models along a Pareto frontier, where each training penalty yields the optimal model for its corresponding risk regime: low penalties produce aggressive answerers, high penalties conservative abstainers. We then introduce two inference strategies that exploit trained abstention as a coordination signal: cascading routes queries through models with decreasing risk tolerance, while self-cascading re-queries the same model on abstention. Both outperform majority voting with lower computational cost. These results establish abstention as a first-class training objective that transforms ``I don’t know’’ from failure into a coordination signal, enabling models to earn trust through calibrated honesty about their limits.
[LG-4] Learning and Testing Convex Functions
链接: https://arxiv.org/abs/2511.11498
作者: Renato Ferreira Pinto Jr.,Cassandra Marcussen,Elchanan Mossel,Shivam Nadimpalli
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: 43 pages
Abstract:We consider the problems of \emphlearning and \emphtesting real-valued convex functions over Gaussian space. Despite the extensive study of function convexity across mathematics, statistics, and computer science, its learnability and testability have largely been examined only in discrete or restricted settings – typically with respect to the Hamming distance, which is ill-suited for real-valued functions. In contrast, we study these problems in high dimensions under the standard Gaussian measure, assuming sample access to the function and a mild smoothness condition, namely Lipschitzness. A smoothness assumption is natural and, in fact, necessary even in one dimension: without it, convexity cannot be inferred from finitely many samples. As our main results, we give: - Learning Convex Functions: An agnostic proper learning algorithm for Lipschitz convex functions that achieves error \varepsilon using n^O(1/\varepsilon^2) samples, together with a complementary lower bound of n^\mathrmpoly(1/\varepsilon) samples in the \emphcorrelational statistical query (CSQ) model. - Testing Convex Functions: A tolerant (two-sided) tester for convexity of Lipschitz functions with the same sample complexity (as a corollary of our learning result), and a one-sided tester (which never rejects convex functions) using O(\sqrtn/\varepsilon)^n samples. Comments: 43 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2511.11498 [cs.DS] (or arXiv:2511.11498v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2511.11498 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shivam Nadimpalli [view email] [v1] Fri, 14 Nov 2025 17:19:44 UTC (55 KB)
[LG-5] Data-efficient U-Net for Segmentation of Carbide Microstructures in SEM Images of Steel Alloys
链接: https://arxiv.org/abs/2511.11485
作者: Alinda Ezgi Gerçek,Till Korten,Paul Chekhonin,Maleeha Hassan,Peter Steinbach
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Understanding reactor-pressure-vessel steel microstructure is crucial for predicting mechanical properties, as carbide precipitates both strengthen the alloy and can initiate cracks. In scanning electron microscopy images, gray-value overlap between carbides and matrix makes simple thresholding ineffective. We present a data-efficient segmentation pipeline using a lightweight U-Net (30.7~M parameters) trained on just \textbf10 annotated scanning electron microscopy images. Despite limited data, our model achieves a \textbfDice-Sørensen coefficient of 0.98, significantly outperforming the state-of-the-art in the field of metallurgy (classical image analysis: 0.85), while reducing annotation effort by one order of magnitude compared to the state-of-the-art data efficient segmentation model. This approach enables rapid, automated carbide quantification for alloy design and generalizes to other steel types, demonstrating the potential of data-efficient deep learning in reactor-pressure-vessel steel analysis.
[LG-6] Quantifying and Improving Adaptivity in Conformal Prediction through Input Transformations
链接: https://arxiv.org/abs/2511.11472
作者: Sooyong Jang,Insup Lee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conformal prediction constructs a set of labels instead of a single point prediction, while providing a probabilistic coverage guarantee. Beyond the coverage guarantee, adaptiveness to example difficulty is an important property. It means that the method should produce larger prediction sets for more difficult examples, and smaller ones for easier examples. Existing evaluation methods for adaptiveness typically analyze coverage rate violation or average set size across bins of examples grouped by difficulty. However, these approaches often suffer from imbalanced binning, which can lead to inaccurate estimates of coverage or set size. To address this issue, we propose a binning method that leverages input transformations to sort examples by difficulty, followed by uniform-mass binning. Building on this binning, we introduce two metrics to better evaluate adaptiveness. These metrics provide more reliable estimates of coverage rate violation and average set size due to balanced binning, leading to more accurate adaptivity assessment. Through experiments, we demonstrate that our proposed metric correlates more strongly with the desired adaptiveness property compared to existing ones. Furthermore, motivated by our findings, we propose a new adaptive prediction set algorithm that groups examples by estimated difficulty and applies group-conditional conformal prediction. This allows us to determine appropriate thresholds for each group. Experimental results on both (a) an Image Classification (ImageNet) (b) a medical task (visual acuity prediction) show that our method outperforms existing approaches according to the new metrics.
[LG-7] Adaptive Intrusion Detection for Evolving RPL IoT Attacks Using Incremental Learning
链接: https://arxiv.org/abs/2511.11464
作者: Sumeyye Bas,Kiymet Kaya,Elif Ak,Sule Gunduz Oguducu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The routing protocol for low-power and lossy networks (RPL) has become the de facto routing standard for resource-constrained IoT systems, but its lightweight design exposes critical vulnerabilities to a wide range of routing-layer attacks such as hello flood, decreased rank, and version number manipulation. Traditional countermeasures, including protocol-level modifications and machine learning classifiers, can achieve high accuracy against known threats, yet they fail when confronted with novel or zero-day attacks unless fully retrained, an approach that is impractical for dynamic IoT environments. In this paper, we investigate incremental learning as a practical and adaptive strategy for intrusion detection in RPL-based networks. We systematically evaluate five model families, including ensemble models and deep learning models. Our analysis highlights that incremental learning not only restores detection performance on new attack classes but also mitigates catastrophic forgetting of previously learned threats, all while reducing training time compared to full retraining. By combining five diverse models with attack-specific analysis, forgetting behavior, and time efficiency, this study provides systematic evidence that incremental learning offers a scalable pathway to maintain resilient intrusion detection in evolving RPL-based IoT networks.
[LG-8] MoCap2Radar: A Spatiotemporal Transformer for Synthesizing Micro-Doppler Radar Signatures from Motion Capture
链接: https://arxiv.org/abs/2511.11462
作者: Kevin Chen,Kenneth W. Parker,Anish Arora
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present a pure machine learning process for synthesizing radar spectrograms from Motion-Capture (MoCap) data. We formulate MoCap-to-spectrogram translation as a windowed sequence-to-sequence task using a transformer-based model that jointly captures spatial relations among MoCap markers and temporal dynamics across frames. Real-world experiments show that the proposed approach produces visually and quantitatively plausible doppler radar spectrograms and achieves good generalizability. Ablation experiments show that the learned model includes both the ability to convert multi-part motion into doppler signatures and an understanding of the spatial relations between different parts of the human body. The result is an interesting example of using transformers for time-series signal processing. It is especially applicable to edge computing and Internet of Things (IoT) radars. It also suggests the ability to augment scarce radar datasets using more abundant MoCap data for training higher-level applications. Finally, it requires far less computation than physics-based methods for generating radar data. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.11462 [cs.LG] (or arXiv:2511.11462v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11462 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-9] FairReweighing: Density Estimation-Based Reweighing Framework for Improving Separation in Fair Regression
链接: https://arxiv.org/abs/2511.11459
作者: Xiaoyin Xi,Zhe Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:There has been a prevalence of applying AI software in both high-stakes public-sector and industrial contexts. However, the lack of transparency has raised concerns about whether these data-informed AI software decisions secure fairness against people of all racial, gender, or age groups. Despite extensive research on emerging fairness-aware AI software, up to now most efforts to solve this issue have been dedicated to binary classification tasks. Fairness in regression is relatively underexplored. In this work, we adopted a mutual information-based metric to assess separation violations. The metric is also extended so that it can be directly applied to both classification and regression problems with both binary and continuous sensitive attributes. Inspired by the Reweighing algorithm in fair classification, we proposed a FairReweighing pre-processing algorithm based on density estimation to ensure that the learned model satisfies the separation criterion. Theoretically, we show that the proposed FairReweighing algorithm can guarantee separation in the training data under a data independence assumption. Empirically, on both synthetic and real-world data, we show that FairReweighing outperforms existing state-of-the-art regression fairness solutions in terms of improving separation while maintaining high accuracy.
[LG-10] DiffPro: Joint Timestep and Layer-Wise Precision Optimization for Efficient Diffusion Inference
链接: https://arxiv.org/abs/2511.11446
作者: Farhana Amin,Sabiha Afroz,Kanchon Gharami,Mona Moghadampanah,Dimitrios S. Nikolopoulos
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models produce high quality images but inference is costly due to many denoising steps and heavy matrix operations. We present DiffPro, a post-training, hardware-faithful framework that works with the exact integer kernels used in deployment and jointly tunes timesteps and per-layer precision in Diffusion Transformers (DiTs) to reduce latency and memory without any training. DiffPro combines three parts: a manifold-aware sensitivity metric to allocate weight bits, dynamic activation quantization to stabilize activations across timesteps, and a budgeted timestep selector guided by teacher-student drift. In experiments DiffPro achieves up to 6.25x model compression, fifty percent fewer timesteps, and 2.8x faster inference with Delta FID = 10 on standard benchmarks, demonstrating practical efficiency gains. DiffPro unifies step reduction and precision planning into a single budgeted deployable plan for real-time energy-aware diffusion inference.
[LG-11] Differentiation Strategies for Acoustic Inverse Problems: Admittance Estimation and Shape Optimization
链接: https://arxiv.org/abs/2511.11415
作者: Nikolas Borrel-Jensen,Josiah Bjorgaard
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注: 4 pages, 2 figures
Abstract:We demonstrate a practical differentiable programming approach for acoustic inverse problems through two applications: admittance estimation and shape optimization for resonance damping. First, we show that JAX-FEM’s automatic differentiation (AD) enables direct gradient-based estimation of complex boundary admittance from sparse pressure measurements, achieving 3-digit precision without requiring manual derivation of adjoint equations. Second, we apply randomized finite differences to acoustic shape optimization, combining JAX-FEM for forward simulation with PyTorch3D for mesh manipulation through AD. By separating physics-driven boundary optimization from geometry-driven interior mesh adaptation, we achieve 48.1% energy reduction at target frequencies with 30-fold fewer FEM solutions compared to standard finite difference on the full mesh. This work showcases how modern differentiable software stacks enable rapid prototyping of optimization workflows for physics-based inverse problems, with automatic differentiation for parameter estimation and a combination of finite differences and AD for geometric design.
[LG-12] Multicalibration yields better matchings
链接: https://arxiv.org/abs/2511.11413
作者: Riccardo Colini Baldeschi,Simone Di Gregorio,Simone Fioravanti,Federico Fusco,Ido Guy,Daniel Haimovich,Stefano Leonardi,Fridolin Linder,Lorenzo Perini,Matteo Russo,Niek Tax
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Consider the problem of finding the best matching in a weighted graph where we only have access to predictions of the actual stochastic weights, based on an underlying context. If the predictor is the Bayes optimal one, then computing the best matching based on the predicted weights is optimal. However, in practice, this perfect information scenario is not realistic. Given an imperfect predictor, a suboptimal decision rule may compensate for the induced error and thus outperform the standard optimal rule. In this paper, we propose multicalibration as a way to address this problem. This fairness notion requires a predictor to be unbiased on each element of a family of protected sets of contexts. Given a class of matching algorithms \mathcal C and any predictor \gamma of the edge-weights, we show how to construct a specific multicalibrated predictor \hat \gamma , with the following property. Picking the best matching based on the output of \hat \gamma is competitive with the best decision rule in \mathcal C applied onto the original predictor \gamma . We complement this result by providing sample complexity bounds. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2511.11413 [cs.LG] (or arXiv:2511.11413v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.11413 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-13] Multi-Phase Spacecraft Trajectory Optimization via Transformer-Based Reinforcement Learning
链接: https://arxiv.org/abs/2511.11402
作者: Amit Jain,Victor Rodriguez-Fernandez,Richard Linares
类目: Machine Learning (cs.LG)
*备注:
Abstract:Autonomous spacecraft control for mission phases such as launch, ascent, stage separation, and orbit insertion remains a critical challenge due to the need for adaptive policies that generalize across dynamically distinct regimes. While reinforcement learning (RL) has shown promise in individual astrodynamics tasks, existing approaches often require separate policies for distinct mission phases, limiting adaptability and increasing operational complexity. This work introduces a transformer-based RL framework that unifies multi-phase trajectory optimization through a single policy architecture, leveraging the transformer’s inherent capacity to model extended temporal contexts. Building on proximal policy optimization (PPO), our framework replaces conventional recurrent networks with a transformer encoder-decoder structure, enabling the agent to maintain coherent memory across mission phases spanning seconds to minutes during critical operations. By integrating a Gated Transformer-XL (GTrXL) architecture, the framework eliminates manual phase transitions while maintaining stability in control decisions. We validate our approach progressively: first demonstrating near-optimal performance on single-phase benchmarks (double integrator and Van der Pol oscillator), then extending to multiphase waypoint navigation variants, and finally tackling a complex multiphase rocket ascent problem that includes atmospheric flight, stage separation, and vacuum operations. Results demonstrate that the transformer-based framework not only matches analytical solutions in simple cases but also effectively learns coherent control policies across dynamically distinct regimes, establishing a foundation for scalable autonomous mission planning that reduces reliance on phase-specific controllers while maintaining compatibility with safety-critical verification protocols.
[LG-14] SPOT: Single-Shot Positioning via Trainable Near-Field Rainbow Beamforming
链接: https://arxiv.org/abs/2511.11391
作者: Yeyue Cai,Jianhua Mo,Meixia Tao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Phase-time arrays, which integrate phase shifters (PSs) and true-time delays (TTDs), have emerged as a cost-effective architecture for generating frequency-dependent rainbow beams in wideband sensing and localization. This paper proposes an end-to-end deep learning-based scheme that simultaneously designs the rainbow beams and estimates user positions. Treating the PS and TTD coefficients as trainable variables allows the network to synthesize task-oriented beams that maximize localization accuracy. A lightweight fully connected module then recovers the user’s angle-range coordinates from its feedback of the maximum quantized received power and its corresponding subcarrier index after a single downlink transmission. Compared with existing analytical and learning-based schemes, the proposed method reduces overhead by an order of magnitude and delivers consistently lower two-dimensional positioning error.
[LG-15] Robust inverse material design with physical guarantees using the Voigt-Reuss Net
链接: https://arxiv.org/abs/2511.11388
作者: Sanath Keshav,Felix Fritzen
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose a spectrally normalized surrogate for forward and inverse mechanical homogenization with hard physical guarantees. Leveraging the Voigt-Reuss bounds, we factor their difference via a Cholesky-like operator and learn a dimensionless, symmetric positive semi-definite representation with eigenvalues in [0,1] ; the inverse map returns symmetric positive-definite predictions that lie between the bounds in the Löwner sense. In 3D linear elasticity on an open dataset of stochastic biphasic microstructures, a fully connected Voigt-Reuss net trained on !7.5\times 10^5 FFT-based labels with 236 isotropy-invariant descriptors and three contrast parameters recovers the isotropic projection with near-perfect fidelity (isotropy-related entries: R^2 \ge 0.998 ), while anisotropy-revealing couplings are unidentifiable from SO(3) -invariant inputs. Tensor-level relative Frobenius errors have median \approx 1.7% and mean \approx 3.4% across splits. For 2D plane strain on thresholded trigonometric microstructures, coupling spectral normalization with a differentiable renderer and a CNN yields R^20.99 on all components, subpercent normalized losses, accurate tracking of percolation-induced eigenvalue jumps, and robust generalization to out-of-distribution images. Treating the parametric microstructure as design variables, batched first-order optimization with a single surrogate matches target tensors within a few percent and returns diverse near-optimal designs. Overall, the Voigt-Reuss net unifies accurate, physically admissible forward prediction with large-batch, constraint-consistent inverse design, and is generic to elliptic operators and coupled-physics settings.
[LG-16] SoK: Security Evaluation of Wi-Fi CSI Biometrics: Attacks Metrics and Systemic Weaknesses DATE
链接: https://arxiv.org/abs/2511.11381
作者: Gioliano de Oliveira Braga,Pedro Henrique dos Santos Rocha,Rafael Pimenta de Mattos Paixão,Giovani Hoff da Costa,Gustavo Cavalcanti Morais,Lourenço Alves Pereira Júnior
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: An improved version will be submitted to Euro SP 2026, and this paper will be updated in the near future
Abstract:Wi-Fi Channel State Information (CSI) has been repeatedly proposed as a biometric modality, often with reports of high accuracy and operational feasibility. However, the field lacks a consolidated understanding of its security properties, adversarial resilience, and methodological consistency. This Systematization of Knowledge (SoK) examines CSI-based biometric authentication through a security perspective, analyzing how existing work differs across sensing infrastructure, signal representations, feature pipelines, learning models, and evaluation methodologies. Our synthesis reveals systemic inconsistencies: reliance on aggregate accuracy metrics, limited reporting of FAR/FRR/EER, absence of per-user risk analysis, and scarce consideration of threat models or adversarial feasibility. We construct a unified evaluation framework to empirically expose these issues and demonstrate how security-relevant metrics, such as per-class EER, FCS, and the Gini Coefficient, uncover risk concentration that remains hidden under traditional reporting practices. Our analysis highlights concrete attack surfaces and shows how methodological choices materially influence vulnerability profiles, which include replay, geometric mimicry, and environmental perturbation. Based on these findings, we articulate the security boundaries of current CSI biometrics and provide guidelines for rigorous evaluation, reproducible experimentation, and future research directions. This SoK offers the security community a structured, evidence-driven reassessment of Wi-Fi CSI biometrics and their suitability as an authentication primitive.
[LG-17] When Genes Speak: A Semantic-Guided Framework for Spatially Resolved Transcriptomics Data Clustering AAAI’2026
链接: https://arxiv.org/abs/2511.11380
作者: Jiangkai Long,Yanran Zhu,Chang Tang,Kun Sun,Yuanyuan Liu,Xuesong Yan
类目: Machine Learning (cs.LG)
*备注: AAAI’2026 poster paper. 12 pages, 8 figures
Abstract:Spatial transcriptomics enables gene expression profiling with spatial context, offering unprecedented insights into the tissue microenvironment. However, most computational models treat genes as isolated numerical features, ignoring the rich biological semantics encoded in their symbols. This prevents a truly deep understanding of critical biological characteristics. To overcome this limitation, we present SemST, a semantic-guided deep learning framework for spatial transcriptomics data clustering. SemST leverages Large Language Models (LLMs) to enable genes to “speak” through their symbolic meanings, transforming gene sets within each tissue spot into biologically informed embeddings. These embeddings are then fused with the spatial neighborhood relationships captured by Graph Neural Networks (GNNs), achieving a coherent integration of biological function and spatial structure. We further introduce the Fine-grained Semantic Modulation (FSM) module to optimally exploit these biological priors. The FSM module learns spot-specific affine transformations that empower the semantic embeddings to perform an element-wise calibration of the spatial features, thus dynamically injecting high-order biological knowledge into the spatial context. Extensive experiments on public spatial transcriptomics datasets show that SemST achieves state-of-the-art clustering performance. Crucially, the FSM module exhibits plug-and-play versatility, consistently improving the performance when integrated into other baseline methods.
[LG-18] oward Multi-Fidelity Machine Learning Force Field for Cathode Materials
链接: https://arxiv.org/abs/2511.11361
作者: Guangyi Dong,Zhihui Wang
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:
Abstract:Machine learning force fields (MLFFs), which employ neural networks to map atomic structures to system energies, effectively combine the high accuracy of first-principles calculation with the computational efficiency of empirical force fields. They are widely used in computational materials simulations. However, the development and application of MLFFs for lithium-ion battery cathode materials remain relatively limited. This is primarily due to the complex electronic structure characteristics of cathode materials and the resulting scarcity of high-quality computational datasets available for force field training. In this work, we develop a multi-fidelity machine learning force field framework to enhance the data efficiency of computational results, which can simultaneously utilize both low-fidelity non-magnetic and high-fidelity magnetic computational datasets of cathode materials for training. Tests conducted on the lithium manganese iron phosphate (LMFP) cathode material system demonstrate the effectiveness of this multi-fidelity approach. This work helps to achieve high-accuracy MLFF training for cathode materials at a lower training dataset cost, and offers new perspectives for applying MLFFs to computational simulations of cathode materials.
[LG-19] Fast and Expressive Multi-Token Prediction with Probabilistic Circuits
链接: https://arxiv.org/abs/2511.11346
作者: Andreas Grivas,Lorenzo Loconte,Emile van Krieken,Piotr Nawrot,Yu Zhao,Euan Wielewski,Pasquale Minervini,Edoardo Ponti,Antonio Vergari
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), including byte-level LLMs, which are tokeniser-free but prohibitively slow. However, existing MTP methods often sacrifice expressiveness by assuming independence between future tokens. In this work, we investigate the trade-off between expressiveness and latency in MTP within the framework of probabilistic circuits (PCs). Our framework, named MTPC, allows one to explore different ways to encode the joint distributions over future tokens by selecting different circuit architectures, generalising classical models such as (hierarchical) mixture models, hidden Markov models and tensor networks. We show the efficacy of MTPC by retrofitting existing byte-level LLMs, such as EvaByte. Our experiments show that, when combined with speculative decoding, MTPC significantly speeds up generation compared to MTP with independence assumptions, while guaranteeing to retain the performance of the original verifier LLM. We also rigorously study the optimal trade-off between expressiveness and latency when exploring the possible parameterisations of MTPC, such as PC architectures and partial layer sharing between the verifier and draft LLMs.
[LG-20] StochEP: Stochastic Equilibrium Propagation for Spiking Convergent Recurrent Neural Networks
链接: https://arxiv.org/abs/2511.11320
作者: Jiaqi Lin,Yi Jiang,Abhronil Sengupta
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) promise energy-efficient, sparse, biologically inspired computation. Training them with Backpropagation Through Time (BPTT) and surrogate gradients achieves strong performance but remains biologically implausible. Equilibrium Propagation (EP) provides a more local and biologically grounded alternative. However, existing EP frameworks, primarily based on deterministic neurons, either require complex mechanisms to handle discontinuities in spiking dynamics or fail to scale beyond simple visual tasks. Inspired by the stochastic nature of biological spiking mechanism and recent hardware trends, we propose a stochastic EP framework that integrates probabilistic spiking neurons into the EP paradigm. This formulation smoothens the optimization landscape, stabilizes training, and enables scalable learning in deep convolutional spiking convergent recurrent neural networks (CRNNs). We provide theoretical guarantees showing that the proposed stochastic EP dynamics approximate deterministic EP under mean-field theory, thereby inheriting its underlying theoretical guarantees. The proposed framework narrows the gap to both BPTT-trained SNNs and EP-trained non-spiking CRNNs in vision benchmarks while preserving locality, highlighting stochastic EP as a promising direction for neuromorphic and on-chip learning.
[LG-21] oward Scalable Early Cancer Detection: Evaluating EHR-Based Predictive Models Against Traditional Screening Criteria
链接: https://arxiv.org/abs/2511.11293
作者: Jiheum Park,Chao Pang,Tristan Y. Lee,Jeong Yun Yang,Jacob Berkowitz,Alexander Z. Wei,Nicholas Tatonetti
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Current cancer screening guidelines cover only a few cancer types and rely on narrowly defined criteria such as age or a single risk factor like smoking history, to identify high-risk individuals. Predictive models using electronic health records (EHRs), which capture large-scale longitudinal patient-level health information, may provide a more effective tool for identifying high-risk groups by detecting subtle prediagnostic signals of cancer. Recent advances in large language and foundation models have further expanded this potential, yet evidence remains limited on how useful HER-based models are compared with traditional risk factors currently used in screening guidelines. We systematically evaluated the clinical utility of EHR-based predictive models against traditional risk factors, including gene mutations and family history of cancer, for identifying high-risk individuals across eight major cancers (breast, lung, colorectal, prostate, ovarian, liver, pancreatic, and stomach), using data from the All of Us Research Program, which integrates EHR, genomic, and survey data from over 865,000 participants. Even with a baseline modeling approach, EHR-based models achieved a 3- to 6-fold higher enrichment of true cancer cases among individuals identified as high risk compared with traditional risk factors alone, whether used as a standalone or complementary tool. The EHR foundation model, a state-of-the-art approach trained on comprehensive patient trajectories, further improved predictive performance across 26 cancer types, demonstrating the clinical potential of EHR-based predictive modeling to support more precise and scalable early detection strategies.
[LG-22] Heterogeneous Attributed Graph Learning via Neighborhood-Aware Star Kernels
链接: https://arxiv.org/abs/2511.11245
作者: Hong Huang,Chengyu Yao,Haiming Chen,Hang Gao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Attributed graphs, typically characterized by irregular topologies and a mix of numerical and categorical attributes, are ubiquitous in diverse domains such as social networks, bioinformatics, and cheminformatics. While graph kernels provide a principled framework for measuring graph similarity, existing kernel methods often struggle to simultaneously capture heterogeneous attribute semantics and neighborhood information in attributed graphs. In this work, we propose the Neighborhood-Aware Star Kernel (NASK), a novel graph kernel designed for attributed graph learning. NASK leverages an exponential transformation of the Gower similarity coefficient to jointly model numerical and categorical features efficiently, and employs star substructures enhanced by Weisfeiler-Lehman iterations to integrate multi-scale neighborhood structural information. We theoretically prove that NASK is positive definite, ensuring compatibility with kernel-based learning frameworks such as SVMs. Extensive experiments are conducted on eleven attributed and four large-scale real-world graph benchmarks. The results demonstrate that NASK consistently achieves superior performance over sixteen state-of-the-art baselines, including nine graph kernels and seven Graph Neural Networks.
[LG-23] Neural Network-Powered Finger-Drawn Biometric Authentication
链接: https://arxiv.org/abs/2511.11235
作者: Maan Al Balkhi,Kordian Gontarska,Marko Harasic,Adrian Paschke
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:This paper investigates neural network-based biometric authentication using finger-drawn digits on touchscreen devices. We evaluated CNN and autoencoder architectures for user authentication through simple digit patterns (0-9) traced with finger input. Twenty participants contributed 2,000 finger-drawn digits each on personal touchscreen devices. We compared two CNN architectures: a modified Inception-V1 network and a lightweight shallow CNN for mobile environments. Additionally, we examined Convolutional and Fully Connected autoencoders for anomaly detection. Both CNN architectures achieved ~89% authentication accuracy, with the shallow CNN requiring fewer parameters. Autoencoder approaches achieved ~75% accuracy. The results demonstrate that finger-drawn symbol authentication provides a viable, secure, and user-friendly biometric solution for touchscreen devices. This approach can be integrated with existing pattern-based authentication methods to create multi-layered security systems for mobile applications.
[LG-24] Sparse Methods for Vector Embeddings of TPC Data NEURIPS
链接: https://arxiv.org/abs/2511.11221
作者: Tyler Wheeler,Michelle P. Kuchera,Raghuram Ramanujan,Ryan Krupp,Chris Wrede,Saiprasad Ravishankar,Connor L. Cross,Hoi Yan Ian Heung,Andrew J. Jones,Benjamin Votaw
类目: Machine Learning (cs.LG); Nuclear Experiment (nucl-ex)
*备注: NeurIPS Machine Learning and the Physical Sciences Workshop 2025
Abstract:Time Projection Chambers (TPCs) are versatile detectors that reconstruct charged-particle tracks in an ionizing medium, enabling sensitive measurements across a wide range of nuclear physics experiments. We explore sparse convolutional networks for representation learning on TPC data, finding that a sparse ResNet architecture, even with randomly set weights, provides useful structured vector embeddings of events. Pre-training this architecture on a simple physics-motivated binary classification task further improves the embedding quality. Using data from the GAseous Detector with GErmanium Tagging (GADGET) II TPC, a detector optimized for measuring low-energy \beta -delayed particle decays, we represent raw pad-level signals as sparse tensors, train Minkowski Engine ResNet models, and probe the resulting event-level embeddings which reveal rich event structure. As a cross-detector test, we embed data from the Active-Target TPC (AT-TPC) – a detector designed for nuclear reaction studies in inverse kinematics – using the same encoder. We find that even an untrained sparse ResNet model provides useful embeddings of AT-TPC data, and we observe improvements when the model is trained on GADGET data. Together, these results highlight the potential of sparse convolutional techniques as a general tool for representation learning in diverse TPC experiments.
[LG-25] A Best-of-Both-Worlds Proof for Tsallis-INF without Fenchel Conjugates
链接: https://arxiv.org/abs/2511.11211
作者: Wei-Cheng Lee,Francesco Orabona
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:In this short note, we present a simple derivation of the best-of-both-world guarantee for the Tsallis-INF multi-armed bandit algorithm from J. Zimmert and Y. Seldin. Tsallis-INF: An optimal algorithm for stochastic and adversarial bandits. Journal of Machine Learning Research, 22(28):1-49, 2021. URL this https URL. In particular, the proof uses modern tools from online convex optimization and avoid the use of conjugate functions. Also, we do not optimize the constants in the bounds in favor of a slimmer proof.
[LG-26] When to Stop Federated Learning: Zero-Shot Generation of Synthetic Validation Data with Generative AI for Early Stopping
链接: https://arxiv.org/abs/2511.11208
作者: Youngjoon Lee,Hyukjoon Lee,Jinu Gong,Yang Cao,Joonhyuk Kang
类目: Machine Learning (cs.LG)
*备注: Accepted to IEEE BigData 2025
Abstract:Federated Learning (FL) enables collaborative model training across decentralized devices while preserving data privacy. However, FL methods typically run for a predefined number of global rounds, often leading to unnecessary computation when optimal performance is reached earlier. In addition, training may continue even when the model fails to achieve meaningful performance. To address this inefficiency, we introduce a zero-shot synthetic validation framework that leverages generative AI to monitor model performance and determine early stopping points. Our approach adaptively stops training near the optimal round, thereby conserving computational resources and enabling rapid hyperparameter adjustments. Numerical results on multi-label chest X-ray classification demonstrate that our method reduces training rounds by up to 74% while maintaining accuracy within 1% of the optimal.
[LG-27] LoRaCompass: Robust Reinforcement Learning to Efficiently Search for a LoRa Tag
链接: https://arxiv.org/abs/2511.11190
作者: Tianlang He,Zhongming Lin,Tianrui Jiang,S.-H. Gary Chan
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Long-Range (LoRa) protocol, known for its extensive range and low power, has increasingly been adopted in tags worn by mentally incapacitated persons (MIPs) and others at risk of going missing. We study the sequential decision-making process for a mobile sensor to locate a periodically broadcasting LoRa tag with the fewest moves (hops) in general, unknown environments, guided by the received signal strength indicator (RSSI). While existing methods leverage reinforcement learning for search, they remain vulnerable to domain shift and signal fluctuation, resulting in cascading decision errors that culminate in substantial localization inaccuracies. To bridge this gap, we propose LoRaCompass, a reinforcement learning model designed to achieve robust and efficient search for a LoRa tag. For exploitation under domain shift and signal fluctuation, LoRaCompass learns a robust spatial representation from RSSI to maximize the probability of moving closer to a tag, via a spatially-aware feature extractor and a policy distillation loss function. It further introduces an exploration function inspired by the upper confidence bound (UCB) that guides the sensor toward the tag with increasing confidence. We have validated LoRaCompass in ground-based and drone-assisted scenarios within diverse unseen environments covering an area of over 80km^2. It has demonstrated high success rate (90%) in locating the tag within 100m proximity (a 40% improvement over existing methods) and high efficiency with a search path length (in hops) that scales linearly with the initial distance.
[LG-28] Dynamic Deep Graph Learning for Incomplete Multi-View Clustering with Masked Graph Reconstruction Loss
链接: https://arxiv.org/abs/2511.11181
作者: Zhenghao Zhang,Jun Xie,Xingchen Chen,Tao Yu,Hongzhu Yi,Kaixin Xu,Yuanxiang Wang,Tianyu Zong,Xinming Wang,Jiahuan Chen,Guoqing Chao,Feng Chen,Zhepeng Wang,Jungang Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:The prevalence of real-world multi-view data makes incomplete multi-view clustering (IMVC) a crucial research. The rapid development of Graph Neural Networks (GNNs) has established them as one of the mainstream approaches for multi-view clustering. Despite significant progress in GNNs-based IMVC, some challenges remain: (1) Most methods rely on the K-Nearest Neighbors (KNN) algorithm to construct static graphs from raw data, which introduces noise and diminishes the robustness of the graph topology. (2) Existing methods typically utilize the Mean Squared Error (MSE) loss between the reconstructed graph and the sparse adjacency graph directly as the graph reconstruction loss, leading to substantial gradient noise during optimization. To address these issues, we propose a novel \textbfDynamic Deep \textbfGraph Learning for \textbfIncomplete \textbfMulti-\textbfView \textbfClustering with \textbfMasked Graph Reconstruction Loss (DGIMVCM). Firstly, we construct a missing-robust global graph from the raw data. A graph convolutional embedding layer is then designed to extract primary features and refined dynamic view-specific graph structures, leveraging the global graph for imputation of missing views. This process is complemented by graph structure contrastive learning, which identifies consistency among view-specific graph structures. Secondly, a graph self-attention encoder is introduced to extract high-level representations based on the imputed primary features and view-specific graphs, and is optimized with a masked graph reconstruction loss to mitigate gradient noise during optimization. Finally, a clustering module is constructed and optimized through a pseudo-label self-supervised training mechanism. Extensive experiments on multiple datasets validate the effectiveness and superiority of DGIMVCM.
[LG-29] On-line learning of dynamic systems: sparse regression meets Kalman filtering
链接: https://arxiv.org/abs/2511.11178
作者: Gianluigi Pillonetto,Akram Yazdani,Aleksandr Aravkin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning governing equations from data is central to understanding the behavior of physical systems across diverse scientific disciplines, including physics, biology, and engineering. The Sindy algorithm has proven effective in leveraging sparsity to identify concise models of nonlinear dynamical systems. In this paper, we extend sparsity-driven approaches to real-time learning by integrating a cornerstone algorithm from control theory – the Kalman filter (KF). The resulting Sindy Kalman Filter (SKF) unifies both frameworks by treating unknown system parameters as state variables, enabling real-time inference of complex, time-varying nonlinear models unattainable by either method alone. Furthermore, SKF enhances KF parameter identification strategies, particularly via look-ahead error, significantly simplifying the estimation of sparsity levels, variance parameters, and switching instants. We validate SKF on a chaotic Lorenz system with drifting or switching parameters and demonstrate its effectiveness in the real-time identification of a sparse nonlinear aircraft model built from real flight data.
[LG-30] Power Ensemble Aggregation for Improved Extreme Event AI Prediction NEURIPS2025
链接: https://arxiv.org/abs/2511.11170
作者: Julien Collard,Pierre Gentine,Tian Zheng
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: Accepted for the NeurIPS 2025 ML4PS workshop
Abstract:This paper addresses the critical challenge of improving predictions of climate extreme events, specifically heat waves, using machine learning methods. Our work is framed as a classification problem in which we try to predict whether surface air temperature will exceed its q-th local quantile within a specified timeframe. Our key finding is that aggregating ensemble predictions using a power mean significantly enhances the classifier’s performance. By making a machine-learning based weather forecasting model generative and applying this non-linear aggregation method, we achieve better accuracy in predicting extreme heat events than with the typical mean prediction from the same model. Our power aggregation method shows promise and adaptability, as its optimal performance varies with the quantile threshold chosen, demonstrating increased effectiveness for higher extremes prediction.
[LG-31] raining Neural Networks at Any Scale
链接: https://arxiv.org/abs/2511.11163
作者: Thomas Pethick,Kimon Antonakopoulos,Antonio Silveti-Falls,Leena Chennuru Vankadara,Volkan Cevher
类目: Machine Learning (cs.LG)
*备注:
Abstract:This article reviews modern optimization methods for training neural networks with an emphasis on efficiency and scale. We present state-of-the-art optimization algorithms under a unified algorithmic template that highlights the importance of adapting to the structures in the problem. We then cover how to make these algorithms agnostic to the scale of the problem. Our exposition is intended as an introduction for both practitioners and researchers who wish to be involved in these exciting new developments.
[LG-32] Adaptive Symmetrization of the KL Divergence
链接: https://arxiv.org/abs/2511.11159
作者: Omri Ben-Dov,Luiz F.O. Chamon
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many tasks in machine learning can be described as or reduced to learning a probability distribution given a finite set of samples. A common approach is to minimize a statistical divergence between the (empirical) data distribution and a parameterized distribution, e.g., a normalizing flow (NF) or an energy-based model (EBM). In this context, the forward KL divergence is a ubiquitous due to its tractability, though its asymmetry may prevent capturing some properties of the target distribution. Symmetric alternatives involve brittle min-max formulations and adversarial training (e.g., generative adversarial networks) or evaluating the reverse KL divergence, as is the case for the symmetric Jeffreys divergence, which is challenging to compute from samples. This work sets out to develop a new approach to minimize the Jeffreys divergence. To do so, it uses a proxy model whose goal is not only to fit the data, but also to assist in optimizing the Jeffreys divergence of the main model. This joint training task is formulated as a constrained optimization problem to obtain a practical algorithm that adapts the models priorities throughout training. We illustrate how this framework can be used to combine the advantages of NFs and EBMs in tasks such as density estimation, image generation, and simulation-based inference.
[LG-33] Deep Learning for Short-Term Precipitation Prediction in Four Major Indian Cities: A ConvLSTM Approach with Explainable AI
链接: https://arxiv.org/abs/2511.11152
作者: Tanmay Ghosh,Shaurabh Anand,Rakesh Gomaji Nannewar,Nithin Nagaraj
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep learning models for precipitation forecasting often function as black boxes, limiting their adoption in real-world weather prediction. To enhance transparency while maintaining accuracy, we developed an interpretable deep learning framework for short-term precipitation prediction in four major Indian cities: Bengaluru, Mumbai, Delhi, and Kolkata, spanning diverse climate zones. We implemented a hybrid Time-Distributed CNN-ConvLSTM (Convolutional Neural Network-Long Short-Term Memory) architecture, trained on multi-decadal ERA5 reanalysis data. The architecture was optimized for each city with a different number of convolutional filters: Bengaluru (32), Mumbai and Delhi (64), and Kolkata (128). The models achieved root mean square error (RMSE) values of 0.21 mm/day (Bengaluru), 0.52 mm/day (Mumbai), 0.48 mm/day (Delhi), and 1.80 mm/day (Kolkata). Through interpretability analysis using permutation importance, Gradient-weighted Class Activation Mapping (Grad-CAM), temporal occlusion, and counterfactual perturbation, we identified distinct patterns in the model’s behavior. The model relied on city-specific variables, with prediction horizons ranging from one day for Bengaluru to five days for Kolkata. This study demonstrates how explainable AI (xAI) can provide accurate forecasts and transparent insights into precipitation patterns in diverse urban environments.
[LG-34] Anomaly Detection in High-Dimensional Bank Account Balances via Robust Methods
链接: https://arxiv.org/abs/2511.11143
作者: Federico Maddanu,Tommaso Proietti,Riccardo Crupi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Detecting point anomalies in bank account balances is essential for financial institutions, as it enables the identification of potential fraud, operational issues, or other irregularities. Robust statistics is useful for flagging outliers and for providing estimates of the data distribution parameters that are not affected by contaminated observations. However, such a strategy is often less efficient and computationally expensive under high dimensional setting. In this paper, we propose and evaluate empirically several robust approaches that may be computationally efficient in medium and high dimensional datasets, with high breakdown points and low computational time. Our application deals with around 2.6 million daily records of anonymous users’ bank account balances.
[LG-35] One-Shot Transfer Learning for Nonlinear PDEs with Perturbative PINNs NEURIPS2025
链接: https://arxiv.org/abs/2511.11137
作者: Samuel Auroy,Pavlos Protopapas
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: Accepted at Machine Learning and the Physical Sciences Workshop, NeurIPS 2025
Abstract:We propose a framework for solving nonlinear partial differential equations (PDEs) by combining perturbation theory with one-shot transfer learning in Physics-Informed Neural Networks (PINNs). Nonlinear PDEs with polynomial terms are decomposed into a sequence of linear subproblems, which are efficiently solved using a Multi-Head PINN. Once the latent representation of the linear operator is learned, solutions to new PDE instances with varying perturbations, forcing terms, or boundary/initial conditions can be obtained in closed form without retraining. We validate the method on KPP-Fisher and wave equations, achieving errors on the order of 1e-3 while adapting to new problem instances in under 0.2 seconds; comparable accuracy to classical solvers but with faster transfer. Sensitivity analyses show predictable error growth with epsilon and polynomial degree, clarifying the method’s effective regime. Our contributions are: (i) extending one-shot transfer learning from nonlinear ODEs to PDEs, (ii) deriving a closed-form solution for adapting to new PDE instances, and (iii) demonstrating accuracy and efficiency on canonical nonlinear PDEs. We conclude by outlining extensions to derivative-dependent nonlinearities and higher-dimensional PDEs. Comments: Accepted at Machine Learning and the Physical Sciences Workshop, NeurIPS 2025 Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG) Cite as: arXiv:2511.11137 [math.NA] (or arXiv:2511.11137v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2511.11137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-36] Improving Continual Learning of Knowledge Graph Embeddings via Informed Initialization
链接: https://arxiv.org/abs/2511.11118
作者: Gerard Pons,Besim Bilalli,Anna Queralt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many Knowledege Graphs (KGs) are frequently updated, forcing their Knowledge Graph Embeddings (KGEs) to adapt to these changes. To address this problem, continual learning techniques for KGEs incorporate embeddings for new entities while updating the old ones. One necessary step in these methods is the initialization of the embeddings, as an input to the KGE learning process, which can have an important impact in the accuracy of the final embeddings, as well as in the time required to train them. This is especially relevant for relatively small and frequent updates. We propose a novel informed embedding initialization strategy, which can be seamlessly integrated into existing continual learning methods for KGE, that enhances the acquisition of new knowledge while reducing catastrophic forgetting. Specifically, the KG schema and the previously learned embeddings are utilized to obtain initial representations for the new entities, based on the classes the entities belong to. Our extensive experimental analysis shows that the proposed initialization strategy improves the predictive performance of the resulting KGEs, while also enhancing knowledge retention. Furthermore, our approach accelerates knowledge acquisition, reducing the number of epochs, and therefore time, required to incrementally learn new embeddings. Finally, its benefits across various types of KGE learning models are demonstrated.
[LG-37] SMART: A Surrogate Model for Predicting Application Runtime in Drag onfly Systems AAAI2026
链接: https://arxiv.org/abs/2511.11111
作者: Xin Wang,Pietro Lodi Rizzini,Sourav Medya,Zhiling Lan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Accepted at AAAI 2026
Abstract:The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.
[LG-38] Sheaf Cohomology of Linear Predictive Coding Networks NEURIPS2025
链接: https://arxiv.org/abs/2511.11092
作者: Jeffrey Seely
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations
Abstract:Predictive coding (PC) replaces global backpropagation with local optimization over weights and activations. We show that linear PC networks admit a natural formulation as cellular sheaves: the sheaf coboundary maps activations to edge-wise prediction errors, and PC inference is diffusion under the sheaf Laplacian. Sheaf cohomology then characterizes irreducible error patterns that inference cannot remove. We analyze recurrent topologies where feedback loops create internal contradictions, introducing prediction errors unrelated to supervision. Using a Hodge decomposition, we determine when these contradictions cause learning to stall. The sheaf formalism provides both diagnostic tools for identifying problematic network configurations and design principles for effective weight initialization for recurrent PC networks.
[LG-39] Echoless Label-Based Pre-computation for Memory-Efficient Heterogeneous Graph Learning AAAI2026
链接: https://arxiv.org/abs/2511.11081
作者: Jun Hu,Shangheng Chen,Yufei He,Yuan Li,Bryan Hooi,Bingsheng He
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: Accepted by AAAI 2026
Abstract:Heterogeneous Graph Neural Networks (HGNNs) are widely used for deep learning on heterogeneous graphs. Typical end-to-end HGNNs require repetitive message passing during training, limiting efficiency for large-scale real-world graphs. Pre-computation-based HGNNs address this by performing message passing only once during preprocessing, collecting neighbor information into regular-shaped tensors, which enables efficient mini-batch training. Label-based pre-computation methods collect neighbors’ label information but suffer from training label leakage, where a node’s own label information propagates back to itself during multi-hop message passing - the echo effect. Existing mitigation strategies are memory-inefficient on large graphs or suffer from compatibility issues with advanced message passing methods. We propose Echoless Label-based Pre-computation (Echoless-LP), which eliminates training label leakage with Partition-Focused Echoless Propagation (PFEP). PFEP partitions target nodes and performs echoless propagation, where nodes in each partition collect label information only from neighbors in other partitions, avoiding echo while remaining memory-efficient and compatible with any message passing method. We also introduce an Asymmetric Partitioning Scheme (APS) and a PostAdjust mechanism to address information loss from partitioning and distributional shifts across partitions. Experiments on public datasets demonstrate that Echoless-LP achieves superior performance and maintains memory efficiency compared to baselines.
[LG-40] Flow matching-based generative models for MIMO channel estimation
链接: https://arxiv.org/abs/2511.10941
作者: Wenkai Liu,Nan Ma,Jianqiao Chen,Xiaoxuan Qi,Yuhang Ma
类目: Machine Learning (cs.LG)
*备注: 6 pages, 4 figures
Abstract:Diffusion model (DM)-based channel estimation, which generates channel samples via a posteriori sampling stepwise with denoising process, has shown potential in high-precision channel state information (CSI) acquisition. However, slow sampling speed is an essential challenge for recent developed DM-based schemes. To alleviate this problem, we propose a novel flow matching (FM)-based generative model for multiple-input multiple-output (MIMO) channel estimation. We first formulate the channel estimation problem within FM framework, where the conditional probability path is constructed from the noisy channel distribution to the true channel distribution. In this case, the path evolves along the straight-line trajectory at a constant speed. Then, guided by this, we derive the velocity field that depends solely on the noise statistics to guide generative models training. Furthermore, during the sampling phase, we utilize the trained velocity field as prior information for channel estimation, which allows for quick and reliable noise channel enhancement via ordinary differential equation (ODE) Euler solver. Finally, numerical results demonstrate that the proposed FM-based channel estimation scheme can significantly reduce the sampling overhead compared to other popular DM-based schemes, such as the score matching (SM)-based scheme. Meanwhile, it achieves superior channel estimation accuracy under different channel conditions.
[LG-41] Cascading Bandits With Feedback
链接: https://arxiv.org/abs/2511.10938
作者: R Sri Prakash,Nikhil Karamchandani,Sharayu Moharir
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Motivated by the challenges of edge inference, we study a variant of the cascade bandit model in which each arm corresponds to an inference model with an associated accuracy and error probability. We analyse four decision-making policies-Explore-then-Commit, Action Elimination, Lower Confidence Bound (LCB), and Thompson Sampling-and provide sharp theoretical regret guarantees for each. Unlike in classical bandit settings, Explore-then-Commit and Action Elimination incur suboptimal regret because they commit to a fixed ordering after the exploration phase, limiting their ability to adapt. In contrast, LCB and Thompson Sampling continuously update their decisions based on observed feedback, achieving constant O(1) regret. Simulations corroborate these theoretical findings, highlighting the crucial role of adaptivity for efficient edge inference under uncertainty.
[LG-42] CAT-Net: A Cross-Attention Tone Network for Cross-Subject EEG-EMG Fusion Tone Decoding AAAI-26 AAAI
链接: https://arxiv.org/abs/2511.10935
作者: Yifan Zhuang,Calvin Huang,Zepeng Yu,Yongjie Zou,Jiawei Ju
类目: ound (cs.SD); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: This is the extended version with technical appendices. The version of record appears in AAAI-26. Please cite the AAAI version
Abstract:Brain-computer interface (BCI) speech decoding has emerged as a promising tool for assisting individuals with speech impairments. In this context, the integration of electroencephalography (EEG) and electromyography (EMG) signals offers strong potential for enhancing decoding performance. Mandarin tone classification presents particular challenges, as tonal variations convey distinct meanings even when phonemes remain identical. In this study, we propose a novel cross-subject multimodal BCI decoding framework that fuses EEG and EMG signals to classify four Mandarin tones under both audible and silent speech conditions. Inspired by the cooperative mechanisms of neural and muscular systems in speech production, our neural decoding architecture combines spatial-temporal feature extraction branches with a cross-attention fusion mechanism, enabling informative interaction between modalities. We further incorporate domain-adversarial training to improve cross-subject generalization. We collected 4,800 EEG trials and 4,800 EMG trials from 10 participants using only twenty EEG and five EMG channels, demonstrating the feasibility of minimal-channel decoding. Despite employing lightweight modules, our model outperforms state-of-the-art baselines across all conditions, achieving average classification accuracies of 87.83% for audible speech and 88.08% for silent speech. In cross-subject evaluations, it still maintains strong performance with accuracies of 83.27% and 85.10% for audible and silent speech, respectively. We further conduct ablation studies to validate the effectiveness of each component. Our findings suggest that tone-level decoding with minimal EEG-EMG channels is feasible and potentially generalizable across subjects, contributing to the development of practical BCI applications.
[LG-43] owards Federated Clustering: A Client-wise Private Graph Aggregation Framework
链接: https://arxiv.org/abs/2511.10915
作者: Guanxiong He,Jie Wang,Liaoyuan Tang,Zheng Wang,Rong Wang,Feiping Nie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated clustering addresses the critical challenge of extracting patterns from decentralized, unlabeled data. However, it is hampered by the flaw that current approaches are forced to accept a compromise between performance and privacy: \textittransmitting embedding representations risks sensitive data leakage, while sharing only abstract cluster prototypes leads to diminished model accuracy. To resolve this dilemma, we propose Structural Privacy-Preserving Federated Graph Clustering (SPP-FGC), a novel algorithm that innovatively leverages local structural graphs as the primary medium for privacy-preserving knowledge sharing, thus moving beyond the limitations of conventional techniques. Our framework operates on a clear client-server logic; on the client-side, each participant constructs a private structural graph that captures intrinsic data relationships, which the server then securely aggregates and aligns to form a comprehensive global graph from which a unified clustering structure is derived. The framework offers two distinct modes to suit different needs. SPP-FGC is designed as an efficient one-shot method that completes its task in a single communication round, ideal for rapid analysis. For more complex, unstructured data like images, SPP-FGC+ employs an iterative process where clients and the server collaboratively refine feature representations to achieve superior downstream performance. Extensive experiments demonstrate that our framework achieves state-of-the-art performance, improving clustering accuracy by up to 10% (NMI) over federated baselines while maintaining provable privacy guarantees.
[LG-44] MMA-Sim: Bit-Accurate Reference Model of Tensor Cores and Matrix Cores
链接: https://arxiv.org/abs/2511.10909
作者: Peichen Xie,Yang Wang,Fan Yang,Mao Yang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:The rapidly growing computation demands of deep neural networks (DNNs) have driven hardware vendors to integrate matrix multiplication accelerators (MMAs), such as NVIDIA Tensor Cores and AMD Matrix Cores, into modern GPUs. However, due to distinct and undocumented arithmetic specifications for floating-point matrix multiplication, some MMAs can lead to numerical imprecision and inconsistency that can compromise the stability and reproducibility of DNN training and inference. This paper presents MMA-Sim, the first bit-accurate reference model that reveals the detailed arithmetic behaviors of the MMAs from ten GPU architectures (eight from NVIDIA and two from AMD). By dissecting the MMAs using a combination of targeted and randomized tests, our methodology derives nine arithmetic algorithms to simulate the floating-point matrix multiplication of the MMAs. Large-scale validation confirms bitwise equivalence between MMA-Sim and the real hardware. Using MMA-Sim, we investigate arithmetic behaviors that affect DNN training stability, and identify undocumented behaviors that could lead to significant errors. Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Numerical Analysis (math.NA) Cite as: arXiv:2511.10909 [cs.AR] (or arXiv:2511.10909v1 [cs.AR] for this version) https://doi.org/10.48550/arXiv.2511.10909 Focus to learn more arXiv-issued DOI via DataCite
[LG-45] Graph Attention Network for Predicting Duration of Large-Scale Power Outages Induced by Natural Disasters
链接: https://arxiv.org/abs/2511.10898
作者: Chenghao Duan,Chuanyi Ji
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Natural disasters such as hurricanes, wildfires, and winter storms have induced large-scale power outages in the U.S., resulting in tremendous economic and societal impacts. Accurately predicting power outage recovery and impact is key to resilience of power grid. Recent advances in machine learning offer viable frameworks for estimating power outage duration from geospatial and weather data. However, three major challenges are inherent to the task in a real world setting: spatial dependency of the data, spatial heterogeneity of the impact, and moderate event data. We propose a novel approach to estimate the duration of severe weather-induced power outages through Graph Attention Networks (GAT). Our network uses a simple structure from unsupervised pre-training, followed by semi-supervised learning. We use field data from four major hurricanes affecting 501 counties in eight Southeastern U.S. states. The model exhibits an excellent performance ( 93% accuracy) and outperforms the existing methods XGBoost, Random Forest, GCN and simple GAT by 2% - 15% in both the overall performance and class-wise accuracy.
[LG-46] Multi-View Polymer Representations for the Open Polymer Prediction
链接: https://arxiv.org/abs/2511.10893
作者: Wonjin Jung,Yongseok Choi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We address polymer property prediction with a multi-view design that exploits complementary representations. Our system integrates four families: (i) tabular RDKit/Morgan descriptors, (ii) graph neural networks, (iii) 3D-informed representations, and (iv) pretrained SMILES language models, and averages per-property predictions via a uniform ensemble. Models are trained with 10-fold splits and evaluated with SMILES test-time augmentation. The approach ranks 9th of 2241 teams in the Open Polymer Prediction Challenge at NeurIPS 2025. The submitted ensemble achieves a public MAE of 0.057 and a private MAE of 0.082.
[LG-47] Multi-Joint Physics-Informed Deep Learning Framework for Time-Efficient Inverse Dynamics
链接: https://arxiv.org/abs/2511.10878
作者: Shuhao Ma,Zeyi Huang,Yu Cao,Wesley Doorsamy,Chaoyang Shi,Jun Li,Zhi-Qiang Zhang
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
*备注: 11 pages
Abstract:Time-efficient estimation of muscle activations and forces across multi-joint systems is critical for clinical assessment and assistive device control. However, conventional approaches are computationally expensive and lack a high-quality labeled dataset for multi-joint applications. To address these challenges, we propose a physics-informed deep learning framework that estimates muscle activations and forces directly from kinematics. The framework employs a novel Multi-Joint Cross-Attention (MJCA) module with Bidirectional Gated Recurrent Unit (BiGRU) layers to capture inter-joint coordination, enabling each joint to adaptively integrate motion information from others. By embedding multi-joint dynamics, inter-joint coupling, and external force interactions into the loss function, our Physics-Informed MJCA-BiGRU (PI-MJCA-BiGRU) delivers physiologically consistent predictions without labeled data while enabling time-efficient inference. Experimental validation on two datasets demonstrates that PI-MJCA-BiGRU achieves performance comparable to conventional supervised methods without requiring ground-truth labels, while the MJCA module significantly enhances inter-joint coordination modeling compared to other baseline architectures.
[LG-48] Architecting software monitors for control-flow anomaly detection through large language models and conformance checking
链接: https://arxiv.org/abs/2511.10876
作者: Francesco Vitale,Francesco Flammini,Mauro Caporuscio,Nicola Mazzocca
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Context: Ensuring high levels of dependability in modern computer-based systems has become increasingly challenging due to their complexity. Although systems are validated at design time, their behavior can be different at run-time, possibly showing control-flow anomalies due to “unknown unknowns”. Objective: We aim to detect control-flow anomalies through software monitoring, which verifies run-time behavior by logging software execution and detecting deviations from expected control flow. Methods: We propose a methodology to develop software monitors for control-flow anomaly detection through Large Language Models (LLMs) and conformance checking. The methodology builds on existing software development practices to maintain traditional VV while providing an additional level of robustness and trustworthiness. It leverages LLMs to link design-time models and implementation code, automating source-code instrumentation. The resulting event logs are analyzed via conformance checking, an explainable and effective technique for control-flow anomaly detection. Results: We test the methodology on a case-study scenario from the European Railway Traffic Management System / European Train Control System (ERTMS/ETCS), which is a railway standard for modern interoperable railways. The results obtained from the ERTMS/ETCS case study demonstrate that LLM-based source-code instrumentation can achieve up to 84.775% control-flow coverage of the reference design-time process model, while the subsequent conformance checking-based anomaly detection reaches a peak performance of 96.610% F1-score and 93.515% AUC. Conclusion: Incorporating domain-specific knowledge to guide LLMs in source-code instrumentation significantly allowed obtaining reliable and quality software logs and enabled effective control-flow anomaly detection through conformance checking. Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2511.10876 [cs.SE] (or arXiv:2511.10876v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2511.10876 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Francesco Vitale [view email] [v1] Fri, 14 Nov 2025 01:11:26 UTC (761 KB) Full-text links: Access Paper: View a PDF of the paper titled Architecting software monitors for control-flow anomaly detection through large language models and conformance checking, by Francesco Vitale and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2025-11 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[LG-49] Go-UT-Bench: A Fine-Tuning Dataset for LLM -Based Unit Test Generation in Go
链接: https://arxiv.org/abs/2511.10868
作者: Yashshi Pipalani,Hritik Raj,Rajat Ghosh,Vaishnavi Bhargava,Debojyoti Dutta
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures
Abstract:Training data imbalance poses a major challenge for code LLMs. Most available data heavily over represents raw opensource code while underrepresenting broader software engineering tasks, especially in low resource languages like Golang. As a result, models excel at code autocompletion but struggle with real world developer workflows such as unit test generation. To address this gap, we introduce GO UT Bench, a benchmark dataset of 5264 pairs of code and unit tests, drawn from 10 permissively licensed Golang repositories spanning diverse domain. We evaluate its effectiveness as a fine tuning dataset across two LLM families i.e. mixture of experts and dense decoders. Our results show that finetuned models outperform their base counterparts on more than 75% of benchmark tasks.
[LG-50] Private Zeroth-Order Optimization with Public Data NEURIPS2025
链接: https://arxiv.org/abs/2511.10859
作者: Xuchen Gong,Tian Li
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025
Abstract:One of the major bottlenecks for deploying popular first-order differentially private (DP) machine learning algorithms (e.g., DP-SGD) lies in their high computation and memory cost, despite the existence of optimized implementations. Zeroth-order methods have promise in mitigating the overhead, as they leverage function evaluations to approximate the gradients, hence significantly easier to privatize. While recent works have explored zeroth-order approaches in both private and non-private settings, they still suffer from relatively low utilities compared with DP-SGD, and have only been evaluated in limited application domains. In this work, we propose to leverage public information to guide and improve gradient approximation of private zeroth-order algorithms. We explore a suite of public-data-assisted zeroth-order optimizers (PAZO) with minimal overhead. We provide theoretical analyses of the PAZO framework under an assumption of the similarity between public and private data. Empirically, we demonstrate that PAZO achieves superior privacy/utility tradeoffs across vision and text tasks in both pre-training and fine-tuning settings, outperforming the best first-order baselines (with public data) especially in highly private regimes, while offering up to 16\times runtime speedup.
[LG-51] ExPairT-LLM : Exact Learning for LLM Code Selection by Pairwise Queries
链接: https://arxiv.org/abs/2511.10855
作者: Tom Yuviler,Dana Drachsler-Cohen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite recent advances in LLMs, the task of code generation is still challenging. To cope, code selection algorithms select the best program from multiple programs generated by an LLM. However, existing algorithms can fail to identify the correct program, either because they can misidentify nonequivalent programs or because they rely on an LLM and assume it always correctly determines the output for every input. We present ExPairT-LLM, an exact learning algorithm for code selection that selects a program by posing to an LLM oracle two new types of queries: pairwise membership and pairwise equivalence. These queries are simpler for LLMs and enable ExPairT-LLM to identify the correct program through a tournament, which is robust to some LLM mistakes. We evaluate ExPairT-LLM on four popular code datasets. Its pass@1 (success rate) outperforms the state-of-the-art code selection algorithm on average by +13.0% and up to +27.1%. It also improves the pass@1 of LLMs performing complex reasoning by +24.0%.
[LG-52] EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence
链接: https://arxiv.org/abs/2511.10834
作者: Ansel Kaplan Erol,Seungjun Lee,Divya Mahajan
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a distributed decision problem between orbit and ground. EarthSight introduces three core innovations: (1) multi-task inference on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a ground-station query scheduler that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) dynamic filter ordering, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9x and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.
[LG-53] SURFACEBENCH: Can Self-Evolving LLM s Find the Equations of 3D Scientific Surfaces?
链接: https://arxiv.org/abs/2511.10833
作者: Sanchit Kabra,Shobhnik Kriplani,Parshin Shojaee,Chandan K. Reddy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Equation discovery from data is a core challenge in machine learning for science, requiring the recovery of concise symbolic expressions that govern complex physical and geometric phenomena. Recent approaches with large language models (LLMs) show promise in symbolic regression, but their success often hinges on memorized formulas or overly simplified functional forms. Existing benchmarks exacerbate this limitation: they focus on scalar functions, ignore domain grounding, and rely on brittle string-matching based metrics that fail to capture scientific equivalence. We introduce SurfaceBench, first comprehensive benchmark for symbolic surface discovery. SurfaceBench comprises 183 tasks across 15 categories of symbolic complexity, spanning explicit, implicit, and parametric equation representation forms. Each task includes ground-truth equations, variable semantics, and synthetically sampled three dimensional data. Unlike prior SR datasets, our tasks reflect surface-level structure, resist LLM memorization through novel symbolic compositions, and are grounded in scientific domains such as fluid dynamics, robotics, electromagnetics, and geometry. To evaluate equation discovery quality, we pair symbolic checks with geometry-aware metrics such as Chamfer and Hausdorff distances, capturing both algebraic fidelity and spatial reconstruction accuracy. Our experiments reveal that state-of-the-art frameworks, while occasionally successful on specific families, struggle to generalize across representation types and surface complexities. SurfaceBench thus establishes a challenging and diagnostic testbed that bridges symbolic reasoning with geometric reconstruction, enabling principled benchmarking of progress in compositional generalization, data-driven scientific induction, and geometry-aware reasoning with LLMs. We release the code here: this https URL
[LG-54] Benchmarking Quantum Kernels Across Diverse and Complex Data
链接: https://arxiv.org/abs/2511.10831
作者: Yuhan Jiang,Matthew Otten
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Quantum kernel methods are a promising branch of quantum machine learning, yet their practical advantage on diverse, high-dimensional, real-world data remains unverified. Current research has largely been limited to low-dimensional or synthetic datasets, preventing a thorough evaluation of their potential. To address this gap, we developed a variational quantum kernel framework utilizing resource-efficient ansätze for complex classification tasks and introduced a parameter scaling technique to accelerate convergence. We conducted a comprehensive benchmark of this framework on eight challenging, real world and high-dimensional datasets covering tabular, image, time series, and graph data. Our classically simulated results show that the proposed quantum kernel demonstrated a clear performance advantage over standard classical kernels, such as the radial basis function (RBF) kernel. This work demonstrates that properly designed quantum kernels can function as versatile, high-performance tools, laying a foundation for quantum-enhanced applications in real-world machine learning. Further research is needed to fully assess the practical quantum advantage.
[LG-55] owards Universal Neural Operators through Multiphysics Pretraining NEURIPS2025
链接: https://arxiv.org/abs/2511.10829
作者: Mikhail Masliaev,Dmitry Gusarov,Ilya Markov,Alexander Hvatov
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure, accepted for Machine Learning and the Physical Sciences Workshop, NeurIPS 2025
Abstract:Although neural operators are widely used in data-driven physical simulations, their training remains computationally expensive. Recent advances address this issue via downstream learning, where a model pretrained on simpler problems is fine-tuned on more complex ones. In this research, we investigate transformer-based neural operators, which have previously been applied only to specific problems, in a more general transfer learning setting. We evaluate their performance across diverse PDE problems, including extrapolation to unseen parameters, incorporation of new variables, and transfer from multi-equation datasets. Our results demonstrate that advanced neural operator architectures can effectively transfer knowledge across PDE problems.
[LG-56] ransformers know more than they can tell – Learning the Collatz sequence
链接: https://arxiv.org/abs/2511.10811
作者: François Charton,Ashvni Narayanan
类目: Machine Learning (cs.LG)
*备注:
Abstract:We investigate transformer prediction of long Collatz steps, a complex arithmetic function that maps odd integers to their distant successors in the Collatz sequence ( u_n+1=u_n/2 if u_n is even, u_n+1=(3u_n+1)/2 if u_n is odd). Model accuracy varies with the base used to encode input and output. It can be as high as 99.7% for bases 24 and 32 , and as low as 37 and 25% for bases 11 and 3 . Yet, all models, no matter the base, follow a common learning pattern. As training proceeds, they learn a sequence of classes of inputs that share the same residual modulo 2^p . Models achieve near-perfect accuracy on these classes, and less than 1% for all other inputs. This maps to a mathematical property of Collatz sequences: the length of the loops involved in the computation of a long Collatz step can be deduced from the binary representation of its input. The learning pattern reflects the model learning to predict inputs associated with increasing loop lengths. An analysis of failure cases reveals that almost all model errors follow predictable patterns. Hallucination, a common feature of large language models, almost never happens. In over 90% of failures, the model performs the correct calculation, but wrongly estimates loop lengths. Our observations give a full account of the algorithms learned by the models. They suggest that the difficulty of learning such complex arithmetic function lies in figuring the control structure of the computation – the length of the loops. We believe that the approach outlined here, using mathematical problems as tools for understanding, explaining, and perhaps improving language models, can be applied to a broad range of problems and bear fruitful results.
[LG-57] Near-optimal Linear Predictive Clustering in Non-separable Spaces via Mixed Integer Programming and Quadratic Pseudo-Boolean Reductions
链接: https://arxiv.org/abs/2511.10809
作者: Jiazhou Liang,Hassan Khurram,Scott Sanner
类目: Machine Learning (cs.LG)
*备注:
Abstract:Linear Predictive Clustering (LPC) partitions samples based on shared linear relationships between feature and target variables, with numerous applications including marketing, medicine, and education. Greedy optimization methods, commonly used for LPC, alternate between clustering and linear regression but lack global optimality. While effective for separable clusters, they struggle in non-separable settings where clusters overlap in feature space. In an alternative constrained optimization paradigm, Bertsimas and Shioda (2007) formulated LPC as a Mixed-Integer Program (MIP), ensuring global optimality regardless of separability but suffering from poor scalability. This work builds on the constrained optimization paradigm to introduce two novel approaches that improve the efficiency of global optimization for LPC. By leveraging key theoretical properties of separability, we derive near-optimal approximations with provable error bounds, significantly reducing the MIP formulation’s complexity and improving scalability. Additionally, we can further approximate LPC as a Quadratic Pseudo-Boolean Optimization (QPBO) problem, achieving substantial computational improvements in some settings. Comparative analyses on synthetic and real-world datasets demonstrate that our methods consistently achieve near-optimal solutions with substantially lower regression errors than greedy optimization while exhibiting superior scalability over existing MIP formulations.
[LG-58] Movement-Specific Analysis for FIM Score Classification Using Spatio-Temporal Deep Learning
链接: https://arxiv.org/abs/2511.10713
作者: Jun Masaki,Ariaki Higashi,Naoko Shinagawa,Kazuhiko Hirata,Yuichi Kurita,Akira Furui
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 3tables, Accepted for the 2026 IEEE/SICE International Symposium on System Integration (SII 2026), January 11-14, 2026, Cancun, Mexico
Abstract:The functional independence measure (FIM) is widely used to evaluate patients’ physical independence in activities of daily living. However, traditional FIM assessment imposes a significant burden on both patients and healthcare professionals. To address this challenge, we propose an automated FIM score estimation method that utilizes simple exercises different from the designated FIM assessment actions. Our approach employs a deep neural network architecture integrating a spatial-temporal graph convolutional network (ST-GCN), bidirectional long short-term memory (BiLSTM), and an attention mechanism to estimate FIM motor item scores. The model effectively captures long-term temporal dependencies and identifies key body-joint contributions through learned attention weights. We evaluated our method in a study of 277 rehabilitation patients, focusing on FIM transfer and locomotion items. Our approach successfully distinguishes between completely independent patients and those requiring assistance, achieving balanced accuracies of 70.09-78.79 % across different FIM items. Additionally, our analysis reveals specific movement patterns that serve as reliable predictors for particular FIM evaluation items.
[LG-59] Differentiable Sparse Identification of Lagrangian Dynamics
链接: https://arxiv.org/abs/2511.10706
作者: Zitong Zhang,Hao Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Data-driven discovery of governing equations from data remains a fundamental challenge in nonlinear dynamics. Although sparse regression techniques have advanced system identification, they struggle with rational functions and noise sensitivity in complex mechanical systems. The Lagrangian formalism offers a promising alternative, as it typically avoids rational expressions and provides a more concise representation of system dynamics. However, existing Lagrangian identification methods are significantly affected by measurement noise and limited data availability. This paper presents a novel differentiable sparse identification framework that addresses these limitations through three key contributions: (1) the first integration of cubic B-Spline approximation into Lagrangian system identification, enabling accurate representation of complex nonlinearities, (2) a robust equation discovery mechanism that effectively utilizes measurements while incorporating known physical constraints, (3) a recursive derivative computation scheme based on B-spline basis functions, effectively constraining higher-order derivatives and reducing noise sensitivity on second-order dynamical systems. The proposed method demonstrates superior performance and enables more accurate and reliable extraction of physical laws from noisy data, particularly in complex mechanical systems compared to baseline methods.
[LG-60] LAD-BNet: Lag-Aware Dual-Branch Networks for Real-Time Energy Forecasting on Edge Devices
链接: https://arxiv.org/abs/2511.10680
作者: Jean-Philippe Lignier
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 27 pages, in French language. 10 tables, 26 references. Submitted to Energy and AI
Abstract:Real-time energy forecasting on edge devices represents a major challenge for smart grid optimization and intelligent buildings. We present LAD-BNet (Lag-Aware Dual-Branch Network), an innovative neural architecture optimized for edge inference with Google Coral TPU. Our hybrid approach combines a branch dedicated to explicit exploitation of temporal lags with a Temporal Convolutional Network (TCN) featuring dilated convolutions, enabling simultaneous capture of short and long-term dependencies. Tested on real energy consumption data with 10-minute temporal resolution, LAD-BNet achieves 14.49% MAPE at 1-hour horizon with only 18ms inference time on Edge TPU, representing an 8-12 x acceleration compared to CPU. The multi-scale architecture enables predictions up to 12 hours with controlled performance degradation. Our model demonstrates a 2.39% improvement over LSTM baselines and 3.04% over pure TCN architectures, while maintaining a 180MB memory footprint suitable for embedded device constraints. These results pave the way for industrial applications in real-time energy optimization, demand management, and operational planning.
[LG-61] Estimating Total Effects in Bipartite Experiments with Spillovers and Partial Eligibility
链接: https://arxiv.org/abs/2511.11564
作者: Albert Tan,Mohsen Bayati,James Nordlund,Roman Istomin
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 6 figures, Appeared as Oral Presentation in 2025 Conference on Digital Experimentation (CODE) at MIT
Abstract:We study randomized experiments in bipartite systems where only a subset of treatment-side units are eligible for assignment while all units continue to interact, generating interference. We formalize eligibility-constrained bipartite experiments and define estimands aligned with full deployment: the Primary Total Treatment Effect (PTTE) on eligible units and the Secondary Total Treatment Effect (STTE) on ineligible units. Under randomization within the eligible set, we give identification conditions and develop interference-aware ensemble estimators that combine exposure mappings, generalized propensity scores, and flexible machine learning. We further introduce a projection that links treatment- and outcome-level estimands; this mapping is exact under a Linear Additive Edges condition and enables estimation on the (typically much smaller) treatment side with deterministic aggregation to outcomes. In simulations with known ground truth across realistic exposure regimes, the proposed estimators recover PTTE and STTE with low bias and variance and reduce the bias that could arise when interference is ignored. Two field experiments illustrate practical relevance: our method corrects the direction of expected interference bias for a pre-specified metric in both studies and reverses the sign and significance of the primary decision metric in one case.
[LG-62] Non-Euclidean SGD for Structured Optimization: Unified Analysis and Improved Rates
链接: https://arxiv.org/abs/2511.11466
作者: Dmitry Kovalev,Ekaterina Borodich
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Recently, several instances of non-Euclidean SGD, including SignSGD, Lion, and Muon, have attracted significant interest from the optimization community due to their practical success in training deep neural networks. Consequently, a number of works have attempted to explain this success by developing theoretical convergence analyses. Unfortunately, these results cannot properly justify the superior performance of these methods, as they could not beat the convergence rate of vanilla Euclidean SGD. We resolve this important open problem by developing a new unified convergence analysis under the structured smoothness and gradient noise assumption. In particular, our results indicate that non-Euclidean SGD (i) can exploit the sparsity or low-rank structure of the upper bounds on the Hessian and gradient noise, (ii) can provably benefit from popular algorithmic tools such as extrapolation or momentum variance reduction, and (iii) can match the state-of-the-art convergence rates of adaptive and more complex optimization algorithms such as AdaGrad and Shampoo.
[LG-63] Decomposing Direct and Indirect Biases in Linear Models under Demographic Parity Constraint
链接: https://arxiv.org/abs/2511.11294
作者: Bertille Tierny(1,2),Arthur Charpentier(3),François Hu(2) ((1) Milliman France, Ramp;D Department, AI Lab, (2) ENSAE Paris, (3) Université du Québec à Montréal)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Linear models are widely used in high-stakes decision-making due to their simplicity and interpretability. Yet when fairness constraints such as demographic parity are introduced, their effects on model coefficients, and thus on how predictive bias is distributed across features, remain opaque. Existing approaches on linear models often rely on strong and unrealistic assumptions, or overlook the explicit role of the sensitive attribute, limiting their practical utility for fairness assessment. We extend the work of (Chzhen and Schreuder, 2022) and (Fukuchi and Sakuma, 2023) by proposing a post-processing framework that can be applied on top of any linear model to decompose the resulting bias into direct (sensitive-attribute) and indirect (correlated-features) components. Our method analytically characterizes how demographic parity reshapes each model coefficient, including those of both sensitive and non-sensitive features. This enables a transparent, feature-level interpretation of fairness interventions and reveals how bias may persist or shift through correlated variables. Our framework requires no retraining and provides actionable insights for model auditing and mitigation. Experiments on both synthetic and real-world datasets demonstrate that our method captures fairness dynamics missed by prior work, offering a practical and interpretable tool for responsible deployment of linear models.
[LG-64] Drift Estimation for Diffusion Processes Using Neural Networks Based on Discretely Observed Independent Paths
链接: https://arxiv.org/abs/2511.11161
作者: Yuzhen Zhao,Yating Liu,Marc Hoffmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:This paper addresses the nonparametric estimation of the drift function over a compact domain for a time-homogeneous diffusion process, based on high-frequency discrete observations from N independent trajectories. We propose a neural network-based estimator and derive a non-asymptotic convergence rate, decomposed into a training error, an approximation error, and a diffusion-related term scaling as \log N/N . For compositional drift functions, we establish an explicit rate. In the numerical experiments, we consider a drift function with local fluctuations generated by a double-layer compositional structure featuring local oscillations, and show that the empirical convergence rate becomes independent of the input dimension d . Compared to the B -spline method, the neural network estimator achieves better convergence rates and more effectively captures local features, particularly in higher-dimensional settings.
[LG-65] Heterogeneous Multisource Transfer Learning via Model Averag ing for Positive-Unlabeled Data
链接: https://arxiv.org/abs/2511.10919
作者: Jialei Liu,Jun Liao,Kuangnan Fang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Positive-Unlabeled (PU) learning presents unique challenges due to the lack of explicitly labeled negative samples, particularly in high-stakes domains such as fraud detection and medical diagnosis. To address data scarcity and privacy constraints, we propose a novel transfer learning with model averaging framework that integrates information from heterogeneous data sources - including fully binary labeled, semi-supervised, and PU data sets - without direct data sharing. For each source domain type, a tailored logistic regression model is conducted, and knowledge is transferred to the PU target domain through model averaging. Optimal weights for combining source models are determined via a cross-validation criterion that minimizes the Kullback-Leibler divergence. We establish theoretical guarantees for weight optimality and convergence, covering both misspecified and correctly specified target models, with further extensions to high-dimensional settings using sparsity-penalized estimators. Extensive simulations and real-world credit risk data analyses demonstrate that our method outperforms other comparative methods in terms of predictive accuracy and robustness, especially under limited labeled data and heterogeneous environments.
[LG-66] Neural Local Wasserstein Regression
链接: https://arxiv.org/abs/2511.10824
作者: Inga Girshfeld,Xiaohui Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to TAG-DS 2025. 11 pages, 3 figures
Abstract:We study the estimation problem of distribution-on-distribution regression, where both predictors and responses are probability measures. Existing approaches typically rely on a global optimal transport map or tangent-space linearization, which can be restrictive in approximation capacity and distort geometry in multivariate underlying domains. In this paper, we propose the \emphNeural Local Wasserstein Regression, a flexible nonparametric framework that models regression through locally defined transport maps in Wasserstein space. Our method builds on the analogy with classical kernel regression: kernel weights based on the 2-Wasserstein distance localize estimators around reference measures, while neural networks parameterize transport operators that adapt flexibly to complex data geometries. This localized perspective broadens the class of admissible transformations and avoids the limitations of global map assumptions and linearization structures. We develop a practical training procedure using DeepSets-style architectures and Sinkhorn-approximated losses, combined with a greedy reference selection strategy for scalability. Through synthetic experiments on Gaussian and mixture models, as well as distributional prediction tasks on MNIST, we demonstrate that our approach effectively captures nonlinear and high-dimensional distributional relationships that elude existing methods.
信息检索
[IR-0] GRIN Transfer: A production-ready tool for libraries to retrieve digital copies from Google Books
链接: https://arxiv.org/abs/2511.11447
作者: Liza Daly,Matteo Cargnelutti,Catherine Brobston,John Hess,Greg Leppert,Amanda Watson,Jonathan Zittrain
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注:
Abstract:Publicly launched in 2004, the Google Books project has scanned tens of millions of items in partnership with libraries around the world. As part of this project, Google created the Google Return Interface (GRIN). Through this platform, libraries can access their scanned collections, the associated metadata, and the ongoing OCR and metadata improvements that become available as Google reprocesses these collections using new technologies. When downloading the Harvard Library Google Books collection from GRIN to develop the Institutional Books dataset, we encountered several challenges related to rate-limiting and atomized metadata within the GRIN platform. To overcome these challenges and help other libraries make more robust use of their Google Books collections, this technical report introduces the initial release of GRIN Transfer. This open-source and production-ready Python pipeline allows partner libraries to efficiently retrieve their Google Books collections from GRIN. This report also introduces an updated version of our Institutional Books 1.0 pipeline, initially used to analyze, augment, and assemble the Institutional Books 1.0 dataset. We have revised this pipeline for compatibility with the output format of GRIN Transfer. A library could pair these two tools to create an end-to-end processing pipeline for their Google Books collection to retrieve, structure, and enhance data available from GRIN. This report gives an overview of how GRIN Transfer was designed to optimize for reliability and usability in different environments, as well as guidance on configuration for various use cases.
[IR-1] Unlocking Advanced Graph Machine Learning Insights through Knowledge Completion on Neo4j Graph Database
链接: https://arxiv.org/abs/2511.11399
作者: Rosario Napoli,Antonio Celesti,Massimo Villari,Maria Fazio
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: Accepted at the 30th IEEE Symposium on Computers and Communications (ISCC) 2025
Abstract:Graph Machine Learning (GML) with Graph Databases (GDBs) has gained significant relevance in recent years, due to its ability to handle complex interconnected data and apply ML techniques using Graph Data Science (GDS). However, a critical gap exists in the current way GDB-GML applications analyze data, especially in terms of Knowledge Completion (KC) in Knowledge Graphs (KGs). In particular, current architectures ignore KC, working on datasets that appear incomplete or fragmented, despite they actually contain valuable hidden knowledge. This limitation may cause wrong interpretations when these data are used as input for GML models. This paper proposes an innovative architecture that integrates a KC phase into GDB-GML applications, demonstrating how revealing hidden knowledge can heavily impact datasets’ behavior and metrics. For this purpose, we introduce scalable transitive relationships, which are links that propagate information over the network and modelled by a decay function, allowing a deterministic knowledge flows across multiple nodes. Experimental results demonstrate that our intuition radically reshapes both topology and overall dataset dynamics, underscoring the need for this new GDB-GML architecture to produce better models and unlock the full potential of graph-based data analysis. Comments: Accepted at the 30th IEEE Symposium on Computers and Communications (ISCC) 2025 Subjects: Databases (cs.DB); Information Retrieval (cs.IR) MSC classes: 05C85, 68T05, 68T30 ACMclasses: H.2.4; H.2.8; I.2.6; I.2.4; G.2.2 Cite as: arXiv:2511.11399 [cs.DB] (or arXiv:2511.11399v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2511.11399 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] SRLF: An Agent -Driven Set-Wise Reflective Learning Framework for Sequential Recommendation
链接: https://arxiv.org/abs/2511.11370
作者: Jiahao Wang,Bokang Fu,Yu Zhu,Yuli Liu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:LLM-based agents are emerging as a promising paradigm for simulating user behavior to enhance recommender systems. However, their effectiveness is often limited by existing studies that focus on modeling user ratings for individual items. This point-wise approach leads to prevalent issues such as inaccurate user preference comprehension and rigid item-semantic representations. To address these limitations, we propose the novel Set-wise Reflective Learning Framework (SRLF). Our framework operationalizes a closed-loop “assess-validate-reflect” cycle that harnesses the powerful in-context learning capabilities of LLMs. SRLF departs from conventional point-wise assessment by formulating a holistic judgment on an entire set of items. It accomplishes this by comprehensively analyzing both the intricate interrelationships among items within the set and their collective alignment with the user’s preference profile. This method of set-level contextual understanding allows our model to capture complex relational patterns essential to user behavior, making it significantly more adept for sequential recommendation. Extensive experiments validate our approach, confirming that this set-wise perspective is crucial for achieving state-of-the-art performance in sequential recommendation tasks. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2511.11370 [cs.IR] (or arXiv:2511.11370v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2511.11370 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] Align3GR: Unified Multi-Level Alignment for LLM -based Generative Recommendation AAAI2026
链接: https://arxiv.org/abs/2511.11255
作者: Wencai Ye,Mingjie Sun,Shuhang Chen,Wenjin Wu,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注: Accepted by AAAI 2026 (Oral)
Abstract:Large Language Models (LLMs) demonstrate significant advantages in leveraging structured world knowledge and multi-step reasoning capabilities. However, fundamental challenges arise when transforming LLMs into real-world recommender systems due to semantic and behavioral misalignment. To bridge this gap, we propose Align ^3 GR, a novel framework that unifies token-level, behavior modeling-level, and preference-level alignment. Our approach introduces: Dual tokenization fusing user-item semantic and collaborative signals. Enhanced behavior modeling with bidirectional semantic alignment. Progressive DPO strategy combining self-play (SP-DPO) and real-world feedback (RF-DPO) for dynamic preference adaptation. Experiments show Align ^3 GR outperforms the SOTA baseline by +17.8% in Recall@10 and +20.2% in NDCG@10 on the public dataset, with significant gains in online A/B tests and full-scale deployment on an industrial large-scale recommendation platform.
[IR-4] GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
链接: https://arxiv.org/abs/2511.11010
作者: Kyle Deeds,Ying-Hsiang Huang,Claire Gong,Shreya Shaji,Alison Yan,Leslie Harka,Samuel J Klein,Shannon Zejiang Shen,Mark Phillips,Trevor Owens,Benjamin Charles Germain Lee
类目: Information Retrieval (cs.IR); Digital Libraries (cs.DL)
*备注: 10 pages, 5 figures, 2 tables
Abstract:Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as “redacted documents” or “pie charts.” We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape’s pre-processing pipeline for 10 million PDFs was approximately 1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at this https URL.
[IR-5] LEMUR: Large scale End-to-end MUltimodal Recommendation
链接: https://arxiv.org/abs/2511.10962
作者: Xintian Han,Honggang Chen,Quan Lin,Jingyue Gao,Xiangyuan Ren,Lifei Zhu,Zhisheng Ye,Shikang Wu,XiongHang Xie,Xiaochu Gan,Bingzheng Wei,Peng Xu,Zhe Wang,Yuchao Zheng,Jingjian Lin,Di Wu,Junfeng Ge
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Traditional ID-based recommender systems often struggle with cold-start and generalization challenges. Multimodal recommendation systems, which leverage textual and visual data, offer a promising solution to mitigate these issues. However, existing industrial approaches typically adopt a two-stage training paradigm: first pretraining a multimodal model, then applying its frozen representations to train the recommendation model. This decoupled framework suffers from misalignment between multimodal learning and recommendation objectives, as well as an inability to adapt dynamically to new data. To address these limitations, we propose LEMUR, the first large-scale multimodal recommender system trained end-to-end from raw data. By jointly optimizing both the multimodal and recommendation components, LEMUR ensures tighter alignment with downstream objectives while enabling real-time parameter updates. Constructing multimodal sequential representations from user history often entails prohibitively high computational costs. To alleviate this bottleneck, we propose a novel memory bank mechanism that incrementally accumulates historical multimodal representations throughout the training process. After one month of deployment in Douyin Search, LEMUR has led to a 0.843% reduction in query change rate decay and a 0.81% improvement in QAUC. Additionally, LEMUR has shown significant gains across key offline metrics for Douyin Advertisement. Our results validate the superiority of end-to-end multimodal recommendation in real-world industrial scenarios.

