本篇博文主要内容为 2025-11-12 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-12)

今日共更新643篇论文,其中:

  • 自然语言处理93篇(Computation and Language (cs.CL))
  • 人工智能221篇(Artificial Intelligence (cs.AI))
  • 计算机视觉139篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习170篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] raining Language Models to Explain Their Own Computations

【速读】: 该论文试图解决的问题是:语言模型(Language Models, LMs)是否能够学习到对其内部计算过程进行忠实描述的能力,以及它们在自我解释方面是否优于其他模型。解决方案的关键在于利用语言模型对自身内部状态的特权访问能力,通过少量标注示例(仅数万条)微调模型以生成自然语言描述,涵盖三个核心任务:(1) 解释模型特征所编码的信息,(2) 描述内部激活的因果结构,(3) 分析特定输入标记对输出的影响。实验表明,这类“解释器模型”展现出非平凡的泛化能力,且其解释效果优于使用不同模型进行解释的情况,即使后者能力更强,这说明模型对自身内部机制的直接可观测性是提升解释质量的关键因素。

链接: https://arxiv.org/abs/2511.08579
作者: Belinda Z. Li,Zifan Carl Guo,Vincent Huang,Jacob Steinhardt,Jacob Andreas
机构: Transluce; MIT CSAIL
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 33 pages, 7 tables, 8 figures

点击查看摘要

Abstract:Can language models (LMs) learn to faithfully describe their internal computations? Are they better able to describe themselves than other models? We study the extent to which LMs’ privileged access to their own internals can be leveraged to produce new techniques for explaining their behavior. Using existing interpretability techniques as a source of ground truth, we fine-tune LMs to generate natural language descriptions of (1) the information encoded by LM features, (2) the causal structure of LMs’ internal activations, and (3) the influence of specific input tokens on LM outputs. When trained with only tens of thousands of example explanations, explainer models exhibit non-trivial generalization to new queries. This generalization appears partly attributable to explainer models’ privileged access to their own internals: using a model to explain its own computations generally works better than using a different model to explain its computations (even if the other model is significantly more capable). Our results suggest not only that LMs can learn to reliably explain their internal computations, but that such explanations offer a scalable complement to existing interpretability methods.
zh

[NLP-1] hink-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在参数受限条件下提升推理能力的问题,特别是针对现有循环式Transformer方法中存在的“隐式过度思考”(latent overthinking)现象——即简单token在首次前向传播后已正确预测,却在后续迭代中被错误修正。其解决方案的关键在于提出Think-at-Hard (TaH) 方法:通过一个轻量级神经决策器动态触发仅对高难度token的隐式迭代,并利用低秩适应(Low-Rank Adaptation, LoRA)模块将模型目标从通用下一个token预测调整为聚焦于困难token的精细化修正;同时引入双因果注意力机制(duo-causal attention),扩展注意力维度至迭代深度方向,在保持序列并行性的前提下实现跨迭代信息流动,从而显著提升推理性能而无需增加参数量。

链接: https://arxiv.org/abs/2511.08577
作者: Tianyu Fu,Yichen You,Zekai Chen,Guohao Dai,Huazhong Yang,Yu Wang
机构: Tsinghua University (清华大学); Infinigence AI; Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Improving reasoning capabilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Prior work proposes recurrent transformers, which allocate a fixed number of extra iterations per token to improve generation quality. After the first, standard forward pass, instead of verbalization, last-layer hidden states are fed back as inputs for additional iterations to refine token predictions. Yet we identify a latent overthinking phenomenon: easy token predictions that are already correct after the first pass are sometimes revised into errors in additional iterations. To address this, we propose Think-at-Hard (TaH), a dynamic latent thinking method that iterates deeper only at hard tokens. It employs a lightweight neural decider to trigger latent iterations only at tokens that are likely incorrect after the standard forward pass. During latent iterations, Low-Rank Adaptation (LoRA) modules shift the LLM objective from general next-token prediction to focused hard-token refinement. We further introduce a duo-causal attention mechanism that extends attention from the token sequence dimension to an additional iteration depth dimension. This enables cross-iteration information flow while maintaining full sequential parallelism. Experiments show that TaH boosts LLM reasoning performance across five challenging benchmarks while maintaining the same parameter count. Compared with baselines that iterate twice for all output tokens, TaH delivers 8.1-11.3% accuracy gains while exempting 94% of tokens from the second iteration. Against strong single-iteration Qwen3 models finetuned with the same data, it also delivers 4.0-5.0% accuracy gains. When allowing less than 3% additional parameters from LoRA and the iteration decider, the gains increase to 8.5-12.6% and 5.3-5.4%, respectively. Our code is available at this https URL.
zh

[NLP-2] Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在社会情境中如何表达和调整道德判断的问题,特别是探究角色扮演(persona role-play)对模型道德响应的影响。其解决方案的关键在于引入一个基于道德基础问卷(Moral Foundations Questionnaire, MFQ)的基准测试,量化两个核心属性:道德易感性(moral susceptibility,即不同角色下MFQ得分的变异性)与道德鲁棒性(moral robustness,即同一角色内MFQ得分的稳定性)。通过系统分析不同模型家族及规模下的表现,发现模型家族对道德鲁棒性影响显著,而模型规模主要影响道德易感性,且二者呈正相关关系,从而为理解角色条件化如何塑造LLMs的道德行为提供了结构化框架。

链接: https://arxiv.org/abs/2511.08565
作者: Davi Bastos Costa,Felippe Alves,Renato Vicente
机构: TELUS Digital Research Hub; Center for Artificial Intelligence and Machine Learning; Institute of Mathematics, Statistics and Computer Science; University of São Paulo
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 9+8 pages, 7 tables, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) increasingly operate in social contexts, motivating analysis of how they express and shift moral judgments. In this work, we investigate the moral response of LLMs to persona role-play, prompting a LLM to assume a specific character. Using the Moral Foundations Questionnaire (MFQ), we introduce a benchmark that quantifies two properties: moral susceptibility and moral robustness, defined from the variability of MFQ scores across and within personas, respectively. We find that, for moral robustness, model family accounts for most of the variance, while model size shows no systematic effect. The Claude family is, by a significant margin, the most robust, followed by Gemini and GPT-4 models, with other families exhibiting lower robustness. In contrast, moral susceptibility exhibits a mild family effect but a clear within-family size effect, with larger variants being more susceptible. Moreover, robustness and susceptibility are positively correlated, an association that is more pronounced at the family level. Additionally, we present moral foundation profiles for models without persona role-play and for personas averaged across models. Together, these analyses provide a systematic view of how persona conditioning shapes moral behavior in large language models.
zh

[NLP-3] From Semantic Roles to Opinion Roles: SRL Data Extraction for Multi-Task and Transfer Learning in Low-Resource ORL

【速读】: 该论文旨在解决如何从大规模语料库中构建高质量的语义角色标注(Semantic Role Labeling, SRL)数据集,并将其有效迁移至观点角色标注(Opinion Role Labeling, ORL)任务中的问题,尤其针对低资源场景下的观点挖掘需求。其解决方案的关键在于:基于PropBank标注框架设计了一套可复现的提取流程,精确对齐谓词-论元结构与文本表面形式,将句法树指针转换为连贯的文本跨度,并通过严格的数据清洗保障语义一致性;最终生成包含97,169个谓词-论元实例的SRL数据集,映射到ORL的Holder(持有者)、Expression(表达)和Target(目标)角色体系,为提升ORL性能提供可复用的高质量资源。

链接: https://arxiv.org/abs/2511.08537
作者: Amirmohammad Omidi Galdiani,Sepehr Rezaei Melal,Mohammad Norasteh,Arash Yousefi Jordehi,Seyed Abolghasem Mirroshandel
机构: University of Guilan (吉兰大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, 16 figures

点击查看摘要

Abstract:This report presents a detailed methodology for constructing a high-quality Semantic Role Labeling (SRL) dataset from the Wall Street Journal (WSJ) portion of the OntoNotes 5.0 corpus and adapting it for Opinion Role Labeling (ORL) tasks. Leveraging the PropBank annotation framework, we implement a reproducible extraction pipeline that aligns predicate-argument structures with surface text, converts syntactic tree pointers to coherent spans, and applies rigorous cleaning to ensure semantic fidelity. The resulting dataset comprises 97,169 predicate-argument instances with clearly defined Agent (ARG0), Predicate (REL), and Patient (ARG1) roles, mapped to ORL’s Holder, Expression, and Target schema. We provide a detailed account of our extraction algorithms, discontinuous argument handling, annotation corrections, and statistical analysis of the resulting dataset. This work offers a reusable resource for researchers aiming to leverage SRL for enhancing ORL, especially in low-resource opinion mining scenarios.
zh

[NLP-4] Investigating CoT Monitorability in Large Reasoning Models

【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRMs)在决策过程中生成的链式思维(Chain-of-Thought, CoT)是否能被有效用于监测潜在不当行为(如捷径依赖或讨好倾向)的问题,即CoT可监控性(CoT Monitorability)的挑战。其核心问题在于:一是模型在CoT中未必忠实反映真实决策依据(即“语义表达真实性”),二是现有基于CoT的监控机制可能因敏感度过高或过低而失效,甚至被模型精心设计的冗长推理所误导。解决方案的关键在于提出系统性的实证分析框架,从“语义表达质量”与“监控可靠性”两个维度量化评估CoT的可监控潜力,并进一步引入MoME(Monitor via Model Evaluation)范式——通过LLM自身对其他模型的CoT进行结构化判断并提供证据支持,从而提升对模型误行为检测的有效性和鲁棒性。

链接: https://arxiv.org/abs/2511.08525
作者: Shu Yang,Junchao Wu,Xilin Gou,Xuansheng Wu,Derek Wong,Ninhao Liu,Di Wang
机构: Provable Responsible AI and Data Analytics (PRADA) Lab; King Abdullah University of Science and Technology; University of Georgia; University of Macau
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, models do not always truthfully represent their internal decision-making in the generated reasoning. Second, monitors themselves may be either overly sensitive or insufficiently sensitive, and can potentially be deceived by models’ long, elaborate reasoning traces. In this paper, we present the first systematic investigation of the challenges and potential of CoT monitorability. Motivated by two fundamental challenges we mentioned before, we structure our study around two central perspectives: (i) verbalization: to what extent do LRMs faithfully verbalize the true factors guiding their decisions in the CoT, and (ii) monitor reliability: to what extent can misbehavior be reliably detected by a CoT-based monitor? Specifically, we provide empirical evidence and correlation analyses between verbalization quality, monitor reliability, and LLM performance across mathematical, scientific, and ethical domains. Then we further investigate how different CoT intervention methods, designed to improve reasoning efficiency or performance, will affect monitoring effectiveness. Finally, we propose MoME, a new paradigm in which LLMs monitor other models’ misbehavior through their CoT and provide structured judgments along with supporting evidence.
zh

[NLP-5] AlphaResearch: Accelerating New Algorithm Discovery with Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂但可验证问题上表现优异,却难以发现未知算法的局限性。为实现自主算法发现,作者提出AlphaResearch——一个基于双环境协同的自主研究代理,其核心解决方案在于构建融合执行验证(execution-based verification)与模拟真实同行评审(simulated real-world peer review)的双重研究环境,从而在可行性与创新性之间取得平衡。该框架通过迭代“提出新想法—在双环境中验证—优化提案”三步流程,推动算法探索;同时引入AlphaResearchComp评估基准,包含8个精心设计的开放性算法竞赛问题,确保结果的可复现性与客观评价。实验表明,AlphaResearch在8个任务中取得2/8胜率,并在“圆 Packing”问题上达到已知最优性能,验证了LLMs在加速算法发现中的潜力。

链接: https://arxiv.org/abs/2511.08522
作者: Zhaojian Yu,Kaiyue Feng,Yilun Zhao,Shilin He,Xiao-Ping Zhang,Arman Cohan
机构: Tsinghua University (清华大学); New York University (纽约大学); Yale University (耶鲁大学); ByteDance (字节跳动)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models have made significant progress in complex but easy-to-verify problems, yet they still struggle with discovering the unknown. In this paper, we present \textbfAlphaResearch, an autonomous research agent designed to discover new algorithms on open-ended problems. To synergize the feasibility and innovation of the discovery process, we construct a novel dual research environment by combining the execution-based verify and simulated real-world peer review environment. AlphaResearch discovers new algorithm by iteratively running the following steps: (1) propose new ideas (2) verify the ideas in the dual research environment (3) optimize the research proposals for better performance. To promote a transparent evaluation process, we construct \textbfAlphaResearchComp, a new evaluation benchmark that includes an eight open-ended algorithmic problems competition, with each problem carefully curated and verified through executable pipelines, objective metrics, and reproducibility checks. AlphaResearch gets a 2/8 win rate in head-to-head comparison with human researchers, demonstrate the possibility of accelerating algorithm discovery with LLMs. Notably, the algorithm discovered by AlphaResearch on the \emph``packing circles’’ problem achieves the best-of-known performance, surpassing the results of human researchers and strong baselines from recent work (e.g., AlphaEvolve). Additionally, we conduct a comprehensive analysis of the remaining challenges of the 6/8 failure cases, providing valuable insights for future research.
zh

[NLP-6] Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

【速读】: 该论文旨在解决孟加拉语手语(Bangla Sign Language, BdSL)翻译这一低资源自然语言处理(Natural Language Processing, NLP)任务中缺乏大规模句级数据集的问题,现有研究多局限于词和字母级别的识别。解决方案的关键在于构建了一个包含1000个高质量人工标注的句子-词素对(sentence-gloss pairs)的平行语料库Bangla-SGP,并通过基于语法规则和形态学规则的检索增强生成(Retrieval-Augmented Generation, RAG)管道,合成约3000个额外样本以扩充数据。该方法结合了专业手语者的人工标注与规则驱动的提示工程策略,从而提升模型在句级BdSL翻译任务中的表现,尤其在mBart50、Google mT5等Transformer模型上的微调结果验证了其有效性。

链接: https://arxiv.org/abs/2511.08507
作者: Neelavro Saha,Rafi Shahriyar,Nafis Ashraf Roudra,Saadman Sakib,Annajiat Alim Rasel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model’s gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.
zh

[NLP-7] Structured RAG for Answering Aggregative Questions

【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)方法在处理聚合型查询(aggregative queries)时的局限性,即现有方法主要针对单次检索少量相关文档的场景,难以有效整合大量文档并进行推理。其解决方案的关键在于提出S-RAG框架:在数据摄入阶段构建语料库的结构化表示,在推理阶段将自然语言查询转化为对该结构化表示的正式查询,从而实现对大规模文档集合的信息聚合与推理。

链接: https://arxiv.org/abs/2511.08505
作者: Omri Koshorek,Niv Granot,Aviv Alloni,Shahar Admati,Roee Hendel,Ido Weiss,Alan Arazi,Shay-Nitzan Cohen,Yonatan Belinkov
机构: AI21 Labs (AI21实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become the dominant approach for answering questions over large corpora. However, current datasets and methods are highly focused on cases where only a small part of the corpus (usually a few paragraphs) is relevant per query, and fail to capture the rich world of aggregative queries. These require gathering information from a large set of documents and reasoning over them. To address this gap, we propose S-RAG, an approach specifically designed for such queries. At ingestion time, S-RAG constructs a structured representation of the corpus; at inference time, it translates natural-language queries into formal queries over said representation. To validate our approach and promote further research in this area, we introduce two new datasets of aggregative queries: HOTELS and WORLD CUP. Experiments with S-RAG on the newly introduced datasets, as well as on a public benchmark, demonstrate that it substantially outperforms both common RAG systems and long-context LLMs.
zh

[NLP-8] SPEAR-MM: Selective Parameter Evaluation and Restoration via Model Merging for Efficient Financial LLM Adaptation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融领域适配过程中因持续预训练导致的灾难性遗忘问题,即模型在提升金融任务性能的同时,严重丢失了通用推理能力,而这对于客户服务和复杂金融分析至关重要。解决方案的关键在于提出一种名为“基于模型融合的选择性参数评估与恢复”(Selective Parameter Evaluation and Restoration via Model Merging, SPEAR-MM)的框架:通过后验分析量化各Transformer层对通用基准的影响,进而利用球面插值合并技术选择性地冻结或恢复特定层,从而在保留91.2%通用能力的同时维持94%的领域适应效果,且计算成本降低90%,具备可解释的权衡控制能力。

链接: https://arxiv.org/abs/2511.08500
作者: Berkcan Kapusuzoglu,Supriyo Chakraborty,Renkun Ni,Stephen Rawls,Sambit Sahu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Spectral Theory (math.SP)
备注:

点击查看摘要

Abstract:Large language models (LLMs) adapted to financial domains often suffer from catastrophic forgetting of general reasoning capabilities essential for customer interactions and complex financial analysis. We introduce Selective Parameter Evaluation and Restoration via Model Merging (SPEAR-MM), a practical framework that preserves critical capabilities while enabling domain adaptation. Our method approximates layer-wise impact on external benchmarks through post-hoc analysis, then selectively freezes or restores transformer layers via spherical interpolation merging. Applied to LLaMA-3.1-8B for financial tasks, SPEAR-MM achieves 91.2% retention of general capabilities versus 69.7% for standard continual pretraining, while maintaining 94% of domain adaptation gains. The approach provides interpretable trade-off control and reduces computational costs by 90% crucial for resource-constrained financial institutions.
zh

[NLP-9] How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLM)驱动智能体的安全评估主要聚焦于原子性危害(atomic harms),而忽视了在复杂任务中隐藏或稀释恶意意图的隐蔽性威胁问题。其解决方案的关键在于提出一种二维分析框架,从意图隐蔽性(intent concealment)与任务复杂度(task complexity)两个正交维度系统评估智能体安全脆弱性(safety brittleness)。为实现这一目标,作者构建了OASIS(Orthogonal Agent Safety Inquiry Suite),一个具有细粒度标注和高保真仿真沙箱的分层基准测试套件,从而揭示出安全对齐随意图隐蔽程度增加而显著且可预测地退化,并发现“复杂度悖论”——即智能体在更困难任务上看似更安全,实则源于能力局限而非真正的安全性提升。

链接: https://arxiv.org/abs/2511.08487
作者: Zihan Ma,Dongsheng Zhu,Shudong Liu,Taolin Zhang,Junnan Liu,Qingqiu Li,Minnan Luo,Songyang Zhang,Kai Chen
机构: Xi’an Jiaotong University (西安交通大学); Shanghai AI Laboratory; MOE KLINNS Lab (教育部重点实验室)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current safety evaluations for LLM-driven agents primarily focus on atomic harms, failing to address sophisticated threats where malicious intent is concealed or diluted within complex tasks. We address this gap with a two-dimensional analysis of agent safety brittleness under the orthogonal pressures of intent concealment and task complexity. To enable this, we introduce OASIS (Orthogonal Agent Safety Inquiry Suite), a hierarchical benchmark with fine-grained annotations and a high-fidelity simulation sandbox. Our findings reveal two critical phenomena: safety alignment degrades sharply and predictably as intent becomes obscured, and a “Complexity Paradox” emerges, where agents seem safer on harder tasks only due to capability limitations. By releasing OASIS and its simulation environment, we provide a principled foundation for probing and strengthening agent safety in these overlooked dimensions.
zh

[NLP-10] Generative Artificial Intelligence in Qualitative Research Methods: Between Hype and Risks?

【速读】: 该论文试图解决生成式 AI(genAI)在定性研究编码方法中的应用是否具有方法论有效性的问题。其核心观点是,尽管 genAI 被广泛宣传为提高效率的工具,但因其缺乏可解释的文档记录、商业算法不透明以及易产生错误输出等特性,导致其无法满足定性研究对严谨性和可信度的要求。解决方案的关键在于坚持方法论优先原则,即研究人员应将研究设计的科学性和可靠性置于技术新颖性之上,避免因盲目采纳 genAI 而削弱定性研究的学术质量与可信度。

链接: https://arxiv.org/abs/2511.08461
作者: Maria Couto Teixeira,Marisa Tschopp,Anna Jobin
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 8 pages, peer-reviewed position paper accepted at CONVERSATIONS 2025

点击查看摘要

Abstract:As Artificial Intelligence (AI) is increasingly promoted and used in qualitative research, it also raises profound methodological issues. This position paper critically interrogates the role of generative AI (genAI) in the context of qualitative coding methodologies. Despite widespread hype and claims of efficiency, we propose that genAI is not methodologically valid within qualitative inquiries, and its use risks undermining the robustness and trustworthiness of qualitative research. The lack of meaningful documentation, commercial opacity, and the inherent tendencies of genAI systems to produce incorrect outputs all contribute to weakening methodological rigor. Overall, the balance between risk and benefits does not support the use of genAI in qualitative research, and our position paper cautions researchers to put sound methodology before technological novelty.
zh

[NLP-11] Bot Meets Shortcut: How Can LLM s Aid in Handling Unknown Invariance OOD Scenarios?

【速读】: 该论文旨在解决现有社交机器人检测模型在真实场景中鲁棒性不足的问题,尤其是由“捷径学习”(shortcut learning)导致的性能下降问题——即模型依赖于与任务无关的表面文本特征(如特定词汇或句法模式)而非因果相关的用户行为特征进行判断。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)进行反事实数据增强(counterfactual data augmentation),从数据分布和模型结构两个层面提升检测器对捷径依赖的抵抗能力:一方面调整个体用户文本及整体数据集的特征分布,另一方面增强模型提取因果信息的能力,从而显著提升在存在误导性文本线索时的检测准确性。

链接: https://arxiv.org/abs/2511.08455
作者: Shiyan Zheng,Herun Wan,Minnan Luo,Junhang Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While existing social bot detectors perform well on benchmarks, their robustness across diverse real-world scenarios remains limited due to unclear ground truth and varied misleading cues. In particular, the impact of shortcut learning, where models rely on spurious correlations instead of capturing causal task-relevant features, has received limited attention. To address this gap, we conduct an in-depth study to assess how detectors are influenced by potential shortcuts based on textual features, which are most susceptible to manipulation by social bots. We design a series of shortcut scenarios by constructing spurious associations between user labels and superficial textual cues to evaluate model robustness. Results show that shifts in irrelevant feature distributions significantly degrade social bot detector performance, with an average relative accuracy drop of 32% in the baseline models. To tackle this challenge, we propose mitigation strategies based on large language models, leveraging counterfactual data augmentation. These methods mitigate the problem from data and model perspectives across three levels, including data distribution at both the individual user text and overall dataset levels, as well as the model’s ability to extract causal information. Our strategies achieve an average relative performance improvement of 56% under shortcut scenarios.
zh

[NLP-12] Interaction Dynamics as a Reward Signal for LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中对齐(alignment)时过度依赖文本内容本身作为奖励信号的问题,而忽视了交互动态(interaction dynamics)这一潜在的互补信息源。其解决方案的关键在于提出一种基于对话嵌入轨迹几何性质的新奖励信号——TRACE(Trajectory-based Reward for Agent Collaboration Estimation),即“对话轨迹奖励”,通过捕捉对话过程中交互结构的几何特征来建模协作质量。实验表明,仅使用此类结构化信号训练的奖励模型即可达到与依赖完整文本分析的基线模型相当的配对准确率(68.20% vs. 70.04%),而结合两者构建的混合模型则取得最优性能(80.17%),验证了交互动态与文本内容具有显著互补性,为对话对齐提供了新的隐私友好型框架和诊断工具。

链接: https://arxiv.org/abs/2511.08394
作者: Sian Gooding,Edward Grefenstette
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The alignment of Large Language Models (LLMs) for multi-turn conversations typically relies on reward signals derived from the content of the text. This approach, however, overlooks a rich, complementary source of signal: the dynamics of the interaction itself. This paper introduces TRACE (Trajectory-based Reward for Agent Collaboration Estimation), a novel reward signal derived from the geometric properties of a dialogue’s embedding trajectory–a concept we term ‘conversational geometry’. Our central finding is that a reward model trained only on these structural signals achieves a pairwise accuracy (68.20%) comparable to a powerful LLM baseline that analyzes the full transcript (70.04%). Furthermore, a hybrid model combining interaction dynamics with textual analysis achieves the highest performance (80.17%), demonstrating their complementary nature. This work provides strong evidence that for interactive settings, how an agent communicates is as powerful a predictor of success as what it says, offering a new, privacy-preserving framework that not only aligns agents but also serves as a diagnostic tool for understanding the distinct interaction patterns that drive successful collaboration.
zh

[NLP-13] PCRLLM : Proof-Carrying Reasoning with Large Language Models under Stepwise Logical Constraints

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中逻辑连贯性不足的问题,即模型常将前提映射到结论但不遵循显式的推理规则,导致推理过程不可验证且难以信任。其解决方案的关键在于提出Proof-Carrying Reasoning with LLMs (PCRLLM) 框架,该框架通过约束每次推理仅使用单步推导,并要求输出明确标注前提、推理规则和结论,从而支持基于目标逻辑的链级验证;这一机制不仅提升了黑盒场景下的可信度,还实现了多模型间的系统性协作与形式化规则下的中间步骤整合。

链接: https://arxiv.org/abs/2511.08392
作者: Tangrui Li,Pei Wang,Hongzheng Wang Christian Hahm,Matteo Spatola,Justin Shi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit limited logical coherence, mapping premises to conclusions without adherence to explicit inference rules. We propose Proof-Carrying Reasoning with LLMs (PCRLLM), a framework that constrains reasoning to single-step inferences while preserving natural language formulations. Each output explicitly specifies premises, rules, and conclusions, thereby enabling verification against a target logic. This mechanism mitigates trustworthiness concerns by supporting chain-level validation even in black-box settings. Moreover, PCRLLM facilitates systematic multi-LLM collaboration, allowing intermediate steps to be compared and integrated under formal rules. Finally, we introduce a benchmark schema for generating large-scale step-level reasoning data, combining natural language expressiveness with formal rigor.
zh

[NLP-14] urkEmbed: Turkish Embedding Model on NLI STS Tasks

【速读】: 该论文旨在解决当前土耳其语(Turkish)嵌入模型在自然语言推理(Natural Language Inference, NLI)和语义文本相似度(Semantic Textual Similarity, STS)任务中性能不足的问题,尤其针对依赖机器翻译数据集导致的语义理解偏差与准确性受限问题。解决方案的关键在于提出一种名为TurkEmbed的新模型,其核心创新包括:利用多样化的训练数据集与先进的训练技术(如Matryoshka表示学习),从而生成更具鲁棒性和准确性的词向量表示;该设计使模型能够在资源受限环境中实现快速编码,且在All-NLI-TR和STS-b-TR基准测试中显著优于现有最优模型Emrecan,提升幅度达1–4%。

链接: https://arxiv.org/abs/2511.08376
作者: Özay Ezerceli,Gizem Gümüşçekiçci,Tuğba Erkoç,Berke Özenç
机构: New Mind AI (New Mind AI); Isik University (伊斯坦布尔大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 9 pages, 1 Figure, 4 Tables, ASYU Conference. 2025 IEEE 11th International Conference on Advances in Software, hardware and Systems Engineering (ASYU)

点击查看摘要

Abstract:This paper introduces TurkEmbed, a novel Turkish language embedding model designed to outperform existing models, particularly in Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. Current Turkish embedding models often rely on machine-translated datasets, potentially limiting their accuracy and semantic understanding. TurkEmbed utilizes a combination of diverse datasets and advanced training techniques, including matryoshka representation learning, to achieve more robust and accurate embeddings. This approach enables the model to adapt to various resource-constrained environments, offering faster encoding capabilities. Our evaluation on the Turkish STS-b-TR dataset, using Pearson and Spearman correlation metrics, demonstrates significant improvements in semantic similarity tasks. Furthermore, TurkEmbed surpasses the current state-of-the-art model, Emrecan, on All-NLI-TR and STS-b-TR benchmarks, achieving a 1-4% improvement. TurkEmbed promises to enhance the Turkish NLP ecosystem by providing a more nuanced understanding of language and facilitating advancements in downstream applications.
zh

[NLP-15] he Dynamic Articulatory Model DYNARTmo: Dynamic Movement Generation and Speech Gestures

【速读】: 该论文旨在解决语音生成过程中从语言表征到发音器官运动及声学输出的层次控制机制建模问题,即如何在计算框架中实现从抽象言语动作(speech gestures)到连续发音器轨迹的映射。解决方案的关键在于构建一个神经生物学启发的动态发音模型 DYNARTmo,其核心是基于言语动作库(gesture inventory)及其在手势评分(gesture score)中的协调机制,将高层次的言语动作转化为连续的发音器运动轨迹,从而驱动 vocal tract 模型实现声学输出。

链接: https://arxiv.org/abs/2511.08372
作者: Bernd J. Kröger
机构: RWTH Aachen University (亚琛工业大学); Kröger Lab (Kröger 实验室)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures, 4 tables, and supplementary material (python code)

点击查看摘要

Abstract:This paper describes the current implementation of the dynamic articulatory model DYNARTmo, which generates continuous articulator movements based on the concept of speech gestures and a corresponding gesture score. The model provides a neurobiologically inspired computational framework for simulating the hierarchical control of speech production from linguistic representation to articulatory-acoustic realization. We present the structure of the gesture inventory, the coordination of gestures in the gesture score, and their translation into continuous articulator trajectories controlling the DYNARTmo vocal tract model.
zh

[NLP-16] DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

【速读】: 该论文旨在解决多跳问答(Multi-hop Question Answering, MHQA)任务中生成式 AI(Generative AI)模型在推理过程评估上的不足问题。现有方法如Outcome Reward Models(ORMs)仅能基于最终答案提供反馈,无法评估多步推理过程;而传统Process Reward Models(PRMs)虽可评估推理步骤,但依赖昂贵的人工标注或rollout生成。尽管隐式PRM(Implicit PRM)通过奖励参数化从结果信号中推导步骤奖励,无需显式标注,但在MHQA场景下难以处理知识图谱(Knowledge Graph, KG)的结构约束,并且无法捕捉链式思维(Chain of Thought, CoT)与KG路径之间的潜在不一致性。论文提出Dual Implicit Process Reward Model(DPRM),其关键在于训练两个独立的隐式PRM——CoT-PRM和KG-PRM,分别从结果信号中推导CoT和KG推理步骤的奖励,同时引入一致性约束机制,使两者相互验证并协同优化推理路径,从而提升MHQA任务中的推理准确性和鲁棒性。

链接: https://arxiv.org/abs/2511.08364
作者: Xinyi Wang,Yiping Song,Zhiliang Tian,Bo Liu,Tingjin Luo,Minlie Huang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.
zh

[NLP-17] Agent PRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮决策任务(如网络购物和浏览器导航等代理任务)中表现不佳的问题,这类任务要求基于环境反馈做出一系列智能决策。传统方法依赖复杂的提示工程或专家轨迹微调来提升性能,但存在泛化性差和数据获取成本高的局限。本文的关键解决方案是提出一种新的过程奖励模型(Process Reward Model, PRM),命名为AgentPRM,其核心在于重新定义奖励机制:不再以单步正确性评分,而是评估每一步决策对最终目标的逼近程度及其在序列中的贡献,从而捕捉决策间的依赖关系并实现更精准的进展追踪与探索-利用平衡。此外,为高效获取训练数据,作者采用基于时序差分(Temporal Difference, TD)估计结合广义优势估计(Generalized Advantage Estimation, GAE)的方法,显著提升了样本效率。实验表明,AgentPRM相较基线方法在计算效率上提升超过8倍,并且在测试时计算资源扩展时仍保持稳健性能提升。

链接: https://arxiv.org/abs/2511.08325
作者: Zhiheng Xi,Chenyang Liao,Guanyu Li,Yajie Yang,Wenxiang Chen,Zhihao Zhang,Binghai Wang,Senjie Jin,Yuhao Zhou,Jian Guan,Wei Wu,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
机构: Fudan University (复旦大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent’s decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over 8\times more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.
zh

[NLP-18] Adaptive Multi-Agent Response Refinement in Conversational Systems AAAI2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在对话系统中生成响应时,因缺乏个性化或特定知识而出现错误的问题,且用户难以主动识别并请求修正。传统单模型优化方法难以兼顾对话质量的多个维度。其解决方案的关键在于提出一种基于多智能体(multi-agent)框架的响应精炼机制,其中每个智能体专注于对话质量的三个核心方面之一:事实准确性(factuality)、个性化(personalization)和连贯性(coherence),并通过动态通信策略自适应地选择与协调最相关的智能体,从而实现对响应的协同优化,显著提升在涉及知识或用户身份信息任务上的表现。

链接: https://arxiv.org/abs/2511.08319
作者: Soyeong Jeong,Aparna Elangovan,Emine Yilmaz,Oleg Rokhlenko
机构: KAIST(韩国科学技术院); Amazon(亚马逊); Collate; University College London(伦敦大学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: LaCATODA Workshop @ AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable success in conversational systems by generating human-like responses. However, they can fall short, especially when required to account for personalization or specific knowledge. In real-life settings, it is impractical to rely on users to detect these errors and request a new response. One way to address this problem is to refine the response before returning it to the user. While existing approaches focus on refining responses within a single LLM, this method struggles to consider diverse aspects needed for effective conversations. In this work, we propose refining responses through a multi-agent framework, where each agent is assigned a specific role for each aspect. We focus on three key aspects crucial to conversational quality: factuality, personalization, and coherence. Each agent is responsible for reviewing and refining one of these aspects, and their feedback is then merged to improve the overall response. To enhance collaboration among them, we introduce a dynamic communication strategy. Instead of following a fixed sequence of agents, our approach adaptively selects and coordinates the most relevant agents based on the specific requirements of each query. We validate our framework on challenging conversational datasets, demonstrating that ours significantly outperforms relevant baselines, particularly in tasks involving knowledge or user’s persona, or both.
zh

[NLP-19] Automatic Paper Reviewing with Heterogeneous Graph Reasoning over LLM -Simulated Reviewer-Author Debates

【速读】: 该论文旨在解决现有论文评审方法依赖表面稿件特征或直接使用大语言模型(Large Language Models, LLMs)所引发的幻觉、评分偏倚及推理能力有限的问题,尤其在于难以捕捉审稿人与作者之间复杂的论证推理和协商动态。其解决方案的关键在于提出 ReViewGraph 框架,通过 LLM 驱动的多智能体协作模拟多轮审稿-作者辩论,并将不同类型的立场关系(如接受、拒绝、澄清、妥协)显式提取并编码为异构交互图中的类型化边;随后利用图神经网络对结构化的辩论图进行推理,从而捕获细粒度的论证动态,提升评审决策的准确性与合理性。

链接: https://arxiv.org/abs/2511.08317
作者: Shuaimin Li,Liyang Fan,Yufang Lin,Zeyang Li,Xian Wei,Shiwen Ni,Hamid Alinejad-Rokny,Min Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Existing paper review methods often rely on superficial manuscript features or directly on large language models (LLMs), which are prone to hallucinations, biased scoring, and limited reasoning capabilities. Moreover, these methods often fail to capture the complex argumentative reasoning and negotiation dynamics inherent in reviewer-author interactions. To address these limitations, we propose ReViewGraph (Reviewer-Author Debates Graph Reasoner), a novel framework that performs heterogeneous graph reasoning over LLM-simulated multi-round reviewer-author debates. In our approach, reviewer-author exchanges are simulated through LLM-based multi-agent collaboration. Diverse opinion relations (e.g., acceptance, rejection, clarification, and compromise) are then explicitly extracted and encoded as typed edges within a heterogeneous interaction graph. By applying graph neural networks to reason over these structured debate graphs, ReViewGraph captures fine-grained argumentative dynamics and enables more informed review decisions. Extensive experiments on three datasets demonstrate that ReViewGraph outperforms strong baselines with an average relative improvement of 15.73%, underscoring the value of modeling detailed reviewer-author debate structures.
zh

[NLP-20] Hierarchical structure understanding in complex tables with VLLM s: a benchmark and experiments

【速读】: 该论文旨在解决视觉大语言模型(Vision Large Language Models, VLLMs)在不依赖额外预处理的情况下,是否能够理解并解析科学文献中表格的层级结构这一问题。其关键解决方案在于构建了一个名为Complex Hierarchical Tables (CHiTab) 的基准数据集,该数据集从PubTables-1M中提取具有复杂层级标题的表格,并采用多种提示工程(prompt engineering)策略对多个开源权重的先进VLLMs进行测试,包括使用原始版本和微调后的模型,从而系统评估其结构理解能力。

链接: https://arxiv.org/abs/2511.08298
作者: Luca Bindini,Simone Giovannini,Simone Marinai,Valeria Nardoni,Kimiya Noor Ali
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work investigates the ability of Vision Large Language Models (VLLMs) to understand and interpret the structure of tables in scientific articles. Specifically, we explore whether VLLMs can infer the hierarchical structure of tables without additional processing. As a basis for our experiments we use the PubTables-1M dataset, a large-scale corpus of scientific tables. From this dataset, we extract a subset of tables that we introduce as Complex Hierarchical Tables (CHiTab): a benchmark collection of complex tables containing hierarchical headings. We adopt a series of prompt engineering strategies to probe the models’ comprehension capabilities, experimenting with various prompt formats and writing styles. Multiple state-of-the-art open-weights VLLMs are evaluated on the benchmark first using their off-the-shelf versions and then fine-tuning some models on our task. We also measure the performance of humans to solve the task on a small set of tables comparing with performance of the evaluated VLLMs. The experiments support our intuition that generic VLLMs, not explicitly designed for understanding the structure of tables, can perform this task. This study provides insights into the potential and limitations of VLLMs to process complex tables and offers guidance for future work on integrating structured data understanding into general-purpose VLLMs.
zh

[NLP-21] Multi-Agent GraphRAG : A Text-to-Cypher Framework for Labeled Property Graphs

【速读】: 该论文旨在解决当前GraphRAG研究中对基于Cypher查询语言和带标签属性图(Labeled Property Graph, LPG)数据库的潜力挖掘不足的问题,即如何有效利用LPG数据库作为可扩展且高效的推理引擎嵌入到GraphRAG流程中。其解决方案的关键在于提出Multi-Agent GraphRAG系统,这是一个模块化的大型语言模型(Large Language Model, LLM)代理系统,用于实现从自然语言到Cypher查询的自动生成与执行,并通过迭代的内容感知修正与归一化机制以及聚合反馈回路,确保生成查询在语义和语法上的双重优化,从而提升对LPG数据的精准访问能力。

链接: https://arxiv.org/abs/2511.08274
作者: Anton Gusarov,Anastasia Volkova,Valentin Khrulkov,Andrey Kuznetsov,Evgenii Maslov,Ivan Oseledets
机构: Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究院); National Research University Higher School of Economics (俄罗斯高等经济大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code to be released

点击查看摘要

Abstract:While Retrieval-Augmented Generation (RAG) methods commonly draw information from unstructured documents, the emerging paradigm of GraphRAG aims to leverage structured data such as knowledge graphs. Most existing GraphRAG efforts focus on Resource Description Framework (RDF) knowledge graphs, relying on triple representations and SPARQL queries. However, the potential of Cypher and Labeled Property Graph (LPG) databases to serve as scalable and effective reasoning engines within GraphRAG pipelines remains underexplored in current research literature. To fill this gap, we propose Multi-Agent GraphRAG, a modular LLM agentic system for text-to-Cypher query generation serving as a natural language interface to LPG-based graph data. Our proof-of-concept system features an LLM-based workflow for automated Cypher queries generation and execution, using Memgraph as the graph database backend. Iterative content-aware correction and normalization, reinforced by an aggregated feedback loop, ensures both semantic and syntactic refinement of generated queries. We evaluate our system on the CypherBench graph dataset covering several general domains with diverse types of queries. In addition, we demonstrate performance of the proposed workflow on a property graph derived from the IFC (Industry Foundation Classes) data, representing a digital twin of a building. This highlights how such an approach can bridge AI with real-world applications at scale, enabling industrial digital automation use cases.
zh

[NLP-22] ParliaBench: An Evaluation and Benchmarking Framework for LLM -Generated Parliamentary Speech

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成议会演讲时面临的独特挑战,即除了基本的语言质量外,还需保证政治真实性和意识形态一致性,而现有模型缺乏针对议会语境的专门训练,且评估方法多依赖标准自然语言处理(Natural Language Processing, NLP)指标,忽视了政治维度的真实性。解决方案的关键在于构建了一个名为ParliaBench的基准测试平台,包含英国议会演讲数据集,并提出了一套融合计算指标与LLM-as-a-judge评估的多维评价框架,涵盖语言质量、语义连贯性和政治真实性三个维度;其中创新性地引入基于嵌入的两个度量指标——政治光谱对齐(Political Spectrum Alignment)和政党对齐(Party Alignment),用于量化生成内容的意识形态定位,从而实现对模型生成效果的系统性评估与优化。

链接: https://arxiv.org/abs/2511.08247
作者: Marios Koniaris,Argyro Tsipi,Panayiotis Tsanakas
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Parliamentary speech generation presents specific challenges for large language models beyond standard text generation tasks. Unlike general text generation, parliamentary speeches require not only linguistic quality but also political authenticity and ideological consistency. Current language models lack specialized training for parliamentary contexts, and existing evaluation methods focus on standard NLP metrics rather than political authenticity. To address this, we present ParliaBench, a benchmark for parliamentary speech generation. We constructed a dataset of speeches from UK Parliament to enable systematic model training. We introduce an evaluation framework combining computational metrics with LLM-as-a-judge assessments for measuring generation quality across three dimensions: linguistic quality, semantic coherence, and political authenticity. We propose two novel embedding-based metrics, Political Spectrum Alignment and Party Alignment, to quantify ideological positioning. We fine-tuned five large language models (LLMs), generated 28k speeches, and evaluated them using our framework, comparing baseline and fine-tuned models. Results show that fine-tuning produces statistically significant improvements across the majority of metrics and our novel metrics demonstrate strong discriminative power for political dimensions.
zh

[NLP-23] Prompt Tuning for Natural Language to SQL with Embedding Fine-Tuning and RAG PAKDD2024

【速读】: 该论文旨在解决自然语言到SQL(NL-to-SQL)翻译任务中准确率不足的问题,尤其是在复杂查询场景下,现有方法难以有效识别并修正生成SQL语句中的错误。解决方案的关键在于提出一种基于提示微调(Prompt Tuning)的纠错框架,该框架受医学诊断流程启发,具备四步机制:诊断错误类型、定位错误成因、生成修复指令,并对SQL查询进行修正;同时结合嵌入微调(embedding fine-tuning)与检索增强生成(Retrieval-Augmented Generation, RAG),利用外部知识库提升纠错的准确性与可解释性。实验表明,该方法相较基线模型在准确率上提升12%,显著增强了自然语言接口在数据驱动环境中的实用性。

链接: https://arxiv.org/abs/2511.08245
作者: Jisoo Jang,Tien-Cuong Bui,Yunjun Choi,Wen-Syan Li
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Presented at the Workshop on Robust ML in Open Environments (PAKDD 2024)

点击查看摘要

Abstract:This paper introduces an Error Correction through Prompt Tuning for NL-to-SQL, leveraging the latest advancements in generative pre-training-based LLMs and RAG. Our work addresses the crucial need for efficient and accurate translation of natural language queries into SQL expressions in various settings with the growing use of natural language interfaces. We explore the evolution of NLIDBs from early rule-based systems to advanced neural network-driven approaches. Drawing inspiration from the medical diagnostic process, we propose a novel framework integrating an error correction mechanism that diagnoses error types, identifies their causes, provides fixing instructions, and applies these corrections to SQL queries. This approach is further enriched by embedding fine-tuning and RAG, which harnesses external knowledge bases for improved accuracy and transparency. Through comprehensive experiments, we demonstrate that our framework achieves a significant 12 percent accuracy improvement over existing baselines, highlighting its potential to revolutionize data access and handling in contemporary data-driven environments.
zh

[NLP-24] owards Outcome-Oriented Task-Agnostic Evaluation of AI Agents

【速读】: 该论文旨在解决当前AI代理(AI Agent)评估体系的局限性问题,即仅依赖基础设施指标(如延迟、首Token时间或吞吐量)无法全面衡量代理在决策质量、操作自主性及业务价值方面的实际表现。其解决方案的关键在于提出了一套十一项基于结果、任务无关的性能指标框架,涵盖目标完成率(Goal Completion Rate, GCR)、自主性指数(Autonomy Index, AIx)、多步骤任务韧性(Multi-Step Task Resilience, MTR)和业务影响效率(Business Impact Efficiency, BIE)等核心维度,从而实现跨领域、跨架构的统一评估标准。通过大规模模拟实验验证了该框架的有效性,并揭示不同代理架构之间的性能权衡,为AI代理的开发、部署与治理提供了标准化、可量化的评估方法。

链接: https://arxiv.org/abs/2511.08242
作者: Waseem AlShikh,Muayad Sayed Ali,Brian Kennedy,Dmytro Mozolevskyi
机构: Writer, Inc.
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI agents proliferate across industries and applications, evaluating their performance based solely on infrastructural metrics such as latency, time-to-first-token, or token throughput is proving insufficient. These metrics fail to capture the quality of an agent’s decisions, its operational autonomy, or its ultimate business value. This white paper proposes a novel, comprehensive framework of eleven outcome-based, task-agnostic performance metrics for AI agents that transcend domain boundaries. These metrics are designed to enable organizations to evaluate agents based on the quality of their decisions, their degree of autonomy, their adaptability to new challenges, and the tangible business value they deliver, regardless of the underlying model architecture or specific use case. We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE). Through a large-scale simulated experiment involving four distinct agent architectures (ReAct, Chain-of-Thought, Tool-Augmented, Hybrid) across five diverse domains (Healthcare, Finance, Marketing, Legal, and Customer Service), we demonstrate the framework’s efficacy. Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model across the majority of our proposed metrics, achieving an average Goal Completion Rate of 88.8% and the highest Return on Investment (ROI). This work provides a robust, standardized methodology for the holistic evaluation of AI agents, paving the way for more effective development, deployment, and governance.
zh

[NLP-25] VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

【速读】: 该论文旨在解决中文场景下语音到语音(Speech-to-Speech, S2S)交互系统缺乏全面评估基准的问题,从而阻碍了开发者对模型性能的系统性评测以及用户之间公平的模型比较。解决方案的关键在于提出VocalBench-zh——一个面向中文语境的能力分级评估套件,包含10个精心设计的子集和超过10,000个高质量样本,覆盖12种用户导向能力维度,为当前主流14个模型提供了细致的能力分析,揭示了现有方法的共性挑战,并为下一代语音交互系统的演进指明方向。

链接: https://arxiv.org/abs/2511.08230
作者: Heyang Liu,Ziyang Cheng,Yuhao Wang,Hongcheng Liu,Yiqi Li,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at this https URL.
zh

[NLP-26] Benchmarking Educational LLM s with Analytics: A Case Study on Gender Bias in Feedback

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在教育场景中提供形成性反馈时可能存在的性别偏见问题,尤其关注大语言模型(LLMs)对性别相关线索的响应差异。其解决方案的关键在于构建一个基于嵌入的基准测试框架,通过控制变量法构造两类反事实情境:一是基于词汇层面的隐式性别线索(如替换文本中的性别代词),二是显式性别提示(如在提示中加入作者性别信息),并利用句子嵌入的余弦距离与欧氏距离量化响应差异,结合置换检验进行显著性分析及降维可视化,从而系统检测模型在性别敏感性上的不对称响应模式。结果表明,尽管多数模型对显式性别提示不敏感,但所有模型均表现出对隐式性别替换的非对称语义偏移,且男性提示下反馈更具自主支持性,女性提示下则更倾向控制性表达,揭示了当前先进 LLMs 在教育反馈中仍存在隐蔽的性别偏见,为公平性审计和提示工程提供了可操作的评估标准与改进路径。

链接: https://arxiv.org/abs/2511.08225
作者: Yishan Du,Conrad Borchers,Mutlu Cukurova
机构: University College London, UCL Knowledge Lab (伦敦大学学院,UCL知识实验室); UCL Centre for Artificial Intelligence (伦敦大学学院人工智能中心); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:As teachers increasingly turn to GenAI in their educational practice, we need robust methods to benchmark large language models (LLMs) for pedagogical purposes. This article presents an embedding-based benchmarking framework to detect bias in LLMs in the context of formative feedback. Using 600 authentic student essays from the AES 2.0 corpus, we constructed controlled counterfactuals along two dimensions: (i) implicit cues via lexicon-based swaps of gendered terms within essays, and (ii) explicit cues via gendered author background in the prompt. We investigated six representative LLMs (i.e. GPT-5 mini, GPT-4o mini, DeepSeek-R1, DeepSeek-R1-Qwen, Gemini 2.5 Pro, Llama-3-8B). We first quantified the response divergence with cosine and Euclidean distances over sentence embeddings, then assessed significance via permutation tests, and finally, visualised structure using dimensionality reduction. In all models, implicit manipulations reliably induced larger semantic shifts for male-female counterfactuals than for female-male. Only the GPT and Llama models showed sensitivity to explicit gender cues. These findings show that even state-of-the-art LLMs exhibit asymmetric semantic responses to gender substitutions, suggesting persistent gender biases in feedback they provide learners. Qualitative analyses further revealed consistent linguistic differences (e.g., more autonomy-supportive feedback under male cues vs. more controlling feedback under female cues). We discuss implications for fairness auditing of pedagogical GenAI, propose reporting standards for counterfactual evaluation in learning analytics, and outline practical guidance for prompt design and deployment to safeguard equitable feedback.
zh

[NLP-27] Encoder Fine-tuning with Stochastic Sampling Outperforms Open-weight GPT in Astronomy Knowledge Extraction

【速读】: 该论文旨在解决天文科学文献中关键实体与上下文信息自动提取的问题,以应对天文文献快速增长带来的知识挖掘挑战。解决方案的关键在于构建一个基于编码器的多任务Transformer系统,该系统基于SciBERT模型进行微调,并针对天文语料库优化,能够同时完成望远镜引用分类、辅助语义属性检测和仪器提及识别任务;其创新性地采用随机采样训练片段并结合推理时多数投票机制,从而在保持低实现成本的同时显著优于开源权重的GPT基线模型。

链接: https://arxiv.org/abs/2511.08204
作者: Shivam Rawat,Lucie Flek,Akbar Karimi
机构: Bonn-Aachen International Center for Information Technology, University of Bonn (波恩-亚琛国际信息科技中心,波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉玛尔机器学习与人工智能研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific literature in astronomy is rapidly expanding, making it increasingly important to automate the extraction of key entities and contextual information from research papers. In this paper, we present an encoder-based system for extracting knowledge from astronomy articles. Our objective is to develop models capable of classifying telescope references, detecting auxiliary semantic attributes, and recognizing instrument mentions from textual content. To this end, we implement a multi-task transformer-based system built upon the SciBERT model and fine-tuned for astronomy corpora classification. To carry out the fine-tuning, we stochastically sample segments from the training data and use majority voting over the test segments at inference time. Our system, despite its simplicity and low-cost implementation, significantly outperforms the open-weight GPT baseline.
zh

[NLP-28] Do Syntactic Categories Help in Developmentally Motivated Curriculum Learning for Language Models?

【速读】: 该论文旨在解决语言模型在儿童语言数据上训练时,如何有效利用语料库中的句法特征以提升模型在语言任务上的表现问题。其核心挑战在于,尽管CHILDES语料库未表现出明显的按年龄分层的句法差异,但模型性能仍可能受训练数据句法结构的影响。解决方案的关键在于:通过筛选出具有可分类句法结构的数据子集(syntactically categorizable data),而非使用包含噪声的完整语料库,能够显著提升模型在阅读任务中的表现,这表明句法知识的利用比单纯依赖年龄分层的课程学习(curriculum learning)更为重要。

链接: https://arxiv.org/abs/2511.08199
作者: Arzu Burcu Güven,Anna Rogers,Rob van der Goot
机构: IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We examine the syntactic properties of BabyLM corpus, and age-groups within CHILDES. While we find that CHILDES does not exhibit strong syntactic differentiation by age, we show that the syntactic knowledge about the training data can be helpful in interpreting model performance on linguistic tasks. For curriculum learning, we explore developmental and several alternative cognitively inspired curriculum approaches. We find that some curricula help with reading tasks, but the main performance improvement come from using the subset of syntactically categorizable data, rather than the full noisy corpus.
zh

[NLP-29] SciAgent : A Unified Multi-Agent System for Generalistic Scientific Reasoning

【速读】: 该论文旨在解决当前人工智能系统在科学任务中普遍存在“窄域依赖”和“手工设计”局限的问题,即现有模型难以在不同学科领域(如数学、物理、化学)和不同难度层级间灵活迁移与自适应推理。其解决方案的关键在于提出SciAgent——一个统一的多智能体系统,通过分层架构实现跨学科科学推理的通用性:由协调者智能体(Coordinator Agent)动态识别问题的领域与复杂度,并调度多个专用工作系统(Worker Systems),每个工作系统包含符号演绎(symbolic deduction)、概念建模(conceptual modeling)、数值计算(numerical computation)和验证(verification)等交互式子智能体(Sub-agents),协同构建并优化针对具体任务的推理流水线。这一设计使系统在国际数学奥林匹克(IMO)、国际物理奥林匹克(IPhO)等竞赛中达到或超越人类金牌选手水平,验证了其在多学科场景下的推理适应性与泛化能力。

链接: https://arxiv.org/abs/2511.08151
作者: Xuchen Li,Ruitao Wu,Xuanbo Liu,Xukai Wang,Jinbo Hu,Zhixin Bai,Bohan Zeng,Hao Liang,Leheng Chen,Mingrui Chen,Haitian Zhong,Xuanlin Yang,Xu-Yao Zhang,Liu Liu,Jia Li,Kaiqi Huang,Jiahao Xu,Haitao Mi,Wentao Zhang,Bin Dong
机构: Zhongguancun Academy (中关村学院); Tencent AI Lab (腾讯人工智能实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Beihang University (北京航空航天大学); Peking University (北京大学); Nanjing University (南京大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Technique Report

点击查看摘要

Abstract:Recent advances in large language models have enabled AI systems to achieve expert-level performance on domain-specific scientific tasks, yet these systems remain narrow and handcrafted. We introduce SciAgent, a unified multi-agent system designed for generalistic scientific reasoning-the ability to adapt reasoning strategies across disciplines and difficulty levels. SciAgent organizes problem solving as a hierarchical process: a Coordinator Agent interprets each problem’s domain and complexity, dynamically orchestrating specialized Worker Systems, each composed of interacting reasoning Sub-agents for symbolic deduction, conceptual modeling, numerical computation, and verification. These agents collaboratively assemble and refine reasoning pipelines tailored to each task. Across mathematics and physics Olympiads (IMO, IMC, IPhO, CPhO), SciAgent consistently attains or surpasses human gold-medalist performance, demonstrating both domain generality and reasoning adaptability. Additionally, SciAgent has been tested on the International Chemistry Olympiad (IChO) and selected problems from the Humanity’s Last Exam (HLE) benchmark, further confirming the system’s ability to generalize across diverse scientific domains. This work establishes SciAgent as a concrete step toward generalistic scientific intelligence-AI systems capable of coherent, cross-disciplinary reasoning at expert levels.
zh

[NLP-30] Still Not There: Can LLM s Outperform Smaller Task-Specific Seq2Seq Models on the Poetry-to-Prose Conversion Task?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)是否适用于低资源、形态丰富的语言(如梵语)的自然语言处理任务,特别是针对梵语诗歌到散文转换这一复杂任务。该任务要求多步骤推理,包括词组切分、依存关系解析和句法线性化,对模型泛化能力和语法理解提出极高要求。解决方案的关键在于对比两种策略:一是对通用LLM进行指令微调和基于婆罗门语法(Paninian grammar)设计的上下文提示模板;二是训练一个专用的ByT5-Sanskrit编码器-解码器模型,并对其进行全量微调。实验表明,领域特定微调的Seq2Seq模型显著优于所有指令驱动的LLM方法,且人类评估结果与Kendall’s Tau相关性高,验证了其有效性。此外,提示策略在缺乏领域语料时提供了一种替代方案,而专用模型在跨域测试中表现出良好泛化能力。

链接: https://arxiv.org/abs/2511.08145
作者: Kunal Kingkar Das,Manoj Balaji Jagadeeshan,Nallani Chakravartula Sahith,Jivnesh Sandhan,Pawan Goyal
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly treated as universal, general-purpose solutions across NLP tasks, particularly in English. But does this assumption hold for low-resource, morphologically rich languages such as Sanskrit? We address this question by comparing instruction-tuned and in-context-prompted LLMs with smaller task-specific encoder-decoder models on the Sanskrit poetry-to-prose conversion task. This task is intrinsically challenging: Sanskrit verse exhibits free word order combined with rigid metrical constraints, and its conversion to canonical prose (anvaya) requires multi-step reasoning involving compound segmentation, dependency resolution, and syntactic linearisation. This makes it an ideal testbed to evaluate whether LLMs can surpass specialised models. For LLMs, we apply instruction fine-tuning on general-purpose models and design in-context learning templates grounded in Paninian grammar and classical commentary heuristics. For task-specific modelling, we fully fine-tune a ByT5-Sanskrit Seq2Seq model. Our experiments show that domain-specific fine-tuning of ByT5-Sanskrit significantly outperforms all instruction-driven LLM approaches. Human evaluation strongly corroborates this result, with scores exhibiting high correlation with Kendall’s Tau scores. Additionally, our prompting strategies provide an alternative to fine-tuning when domain-specific verse corpora are unavailable, and the task-specific Seq2Seq model demonstrates robust generalisation on out-of-domain evaluations.
zh

[NLP-31] Relation as a Prior: A Novel Paradigm for LLM -based Document-level Relation Extraction

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文档级关系抽取(Document-level Relation Extraction, DocRE)任务中因细粒度理解不足而导致的性能瓶颈问题。具体而言,现有基于LLM的方法普遍采用“先提取实体再预测关系”的范式,存在两个关键挑战:一是大量无关实体对引入噪声,干扰真正相关实体对的关系预测;二是模型对未定义关系标签的语义关联误判为预测错误,限制了其泛化能力。为此,作者提出一种新颖的“关系作为先验”(Relation as a Prior, RelPrior)范式:针对第一个问题,利用二元关系作为先验信息筛选出潜在相关的实体对,从而降低预测噪声;针对第二个问题,通过预定义关系作为先验引导三元组抽取,而非直接预测关系标签,避免因严格标签约束导致的误判。实验表明,RelPrior在两个基准数据集上均达到当前最优性能。

链接: https://arxiv.org/abs/2511.08143
作者: Qiankun Pi,Yepeng Sun,Jicang Lu,Qinlong Fan,Ningbo Huang,Shiyu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated their remarkable capabilities in document understanding. However, recent research reveals that LLMs still exhibit performance gaps in Document-level Relation Extraction (DocRE) as requiring fine-grained comprehension. The commonly adopted “extract entities then predict relations” paradigm in LLM-based methods leads to these gaps due to two main reasons: (1) Numerous unrelated entity pairs introduce noise and interfere with the relation prediction for truly related entity pairs. (2) Although LLMs have identified semantic associations between entities, relation labels beyond the predefined set are still treated as prediction errors. To address these challenges, we propose a novel Relation as a Prior (RelPrior) paradigm for LLM-based DocRE. For challenge (1), RelPrior utilizes binary relation as a prior to extract and determine whether two entities are correlated, thereby filtering out irrelevant entity pairs and reducing prediction noise. For challenge (2), RelPrior utilizes predefined relation as a prior to match entities for triples extraction instead of directly predicting relation. Thus, it avoids misjudgment caused by strict predefined relation labeling. Extensive experiments on two benchmarks demonstrate that RelPrior achieves state-of-the-art performance, surpassing existing LLM-based methods.
zh

[NLP-32] On the Interplay between Positional Encodings Morphological Complexity and Word Order Flexibility AACL

【速读】: 该论文旨在解决语言模型架构在跨语言迁移时是否存在因英语主导设计而导致性能下降的问题,尤其聚焦于位置编码(positional encodings)这一关键架构选择是否与语言的形态复杂度(morphological complexity)和词序灵活性(word order flexibility)之间存在系统性交互关系。其解决方案的关键在于通过预训练七种类型学上多样化的语言模型变体(分别采用绝对位置编码、相对位置编码及无位置编码),并在四个下游任务上进行评估,从而系统检验“词序灵活性-形态复杂度权衡假说”(trade-off hypothesis)在实际语言建模中的有效性。结果表明,位置编码策略与语言特征之间的交互并不显著,强调了任务、语言和评估指标的选择对结论稳定性的重要性。

链接: https://arxiv.org/abs/2511.08139
作者: Kushal Tatariya,Wessel Poelman,Miryam de Lhoneux
机构: KU Leuven (鲁汶大学)
类目: Computation and Language (cs.CL)
备注: IJCNLP-AACL: Main Conference

点击查看摘要

Abstract:Language model architectures are predominantly first created for English and subsequently applied to other languages. It is an open question whether this architectural bias leads to degraded performance for languages that are structurally different from English. We examine one specific architectural choice: positional encodings, through the lens of the trade-off hypothesis: the supposed interplay between morphological complexity and word order flexibility. This hypothesis posits a trade-off between the two: a more morphologically complex language can have a more flexible word order, and vice-versa. Positional encodings are a direct target to investigate the implications of this hypothesis in relation to language modelling. We pretrain monolingual model variants with absolute, relative, and no positional encodings for seven typologically diverse languages and evaluate them on four downstream tasks. Contrary to previous findings, we do not observe a clear interaction between position encodings and morphological complexity or word order flexibility, as measured by various proxies. Our results show that the choice of tasks, languages, and metrics are essential for drawing stable conclusions
zh

[NLP-33] Sentence-Anchored Gist Compression for Long-Context LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长序列时面临的内存占用高和计算开销大的问题。其解决方案的关键在于引入可学习的压缩标记(compression tokens),通过微调预训练模型,使其能够将输入上下文压缩至原始长度的1/2至1/8,从而显著降低资源消耗,同时保持模型性能在短序列和长序列基准测试中基本不变。

链接: https://arxiv.org/abs/2511.08128
作者: Dmitrii Tarasov,Elizaveta Goncharova,Kuznetsov Andrey
机构: FusionBrainLab; HSE University (高等经济大学); Innopolis University (因诺波利斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned to compress their context by factors of 2x to 8x without significant performance degradation, as evaluated on both short-context and long-context benchmarks. Furthermore, in experiments on a 3-billion-parameter LLaMA model, our method achieves results on par with alternative compression techniques while attaining higher compression ratios.
zh

[NLP-34] Quantification and object perception in Multimodal Large Language Models deviate from human linguistic cognition

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在量化表达理解上表现不佳的问题,其核心在于探究人类跨语言共通的三个量化特征——量词排序成尺度、使用范围与典型性以及近似数系统中的认知偏差——是否在模型架构中被有效编码,并分析这些特征在不同模型类型和语言下的差异。解决方案的关键在于通过对比人类与MLLMs在真实情境(in vivo)与模拟情境(in silico)中对量化表征的表现,揭示模型在语义与语用层面的局限性,从而为提升MLLMs作为语义与语用代理的能力提供方向。

链接: https://arxiv.org/abs/2511.08126
作者: Raquel Montero,Natalia Moskvina,Paolo Morosi,Tamara Serrano,Elena Pagliarini,Evelina Leivada
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This papers looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models’ architecture, how they may differ from humans, and whether the results are affected by the type of model and language under investigation. We find that there are clear differences between humans and MLLMs with respect to these features across various tasks that tap into the representation of quantification in vivo vs. in silico. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.
zh

[NLP-35] Multimodal LLM s Do Not Compose Skills Optimally Across Modalities

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态技能组合(cross-modality skill composition)方面的能力不足问题,即模型在面对需要整合两种不同模态(如视觉与语言)技能才能完成的任务时,其表现显著低于预期。解决方案的关键在于探索两种策略:一是通过思维链(chain-of-thought)提示显式引导模型进行技能组合;二是采用特定的微调方案以促进跨模态技能的协同利用。尽管这两种方法均能提升模型性能,但依然存在明显的技能组合差距,表明当前MLLMs在跨模态推理与组合能力上仍需进一步研究与优化。

链接: https://arxiv.org/abs/2511.08113
作者: Paula Ontalvilla,Aitor Ormazabal,Gorka Azkune
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.
zh

[NLP-36] Estranged Predictions: Measuring Semantic Category Disruption with Masked Language Modelling

【速读】: 该论文试图解决的问题是:如何通过计算方法量化科幻文学(science fiction)对人类(human)、动物(animal)和机器(machine)等本体论范畴的 destabilisation(不稳定化)效应,从而揭示其在语言层面如何重构人类中心主义逻辑。解决方案的关键在于将Darko Suvin的“异域化”(estrangement)理论转化为可计算指标——利用掩码语言模型(masked language modelling, MLM)生成被遮蔽词元的替代词,并借助Gemini分类器分析这些替代词的分布特征,进而通过保留率(retention rate)、替换率(replacement rate)和熵(entropy)三个指标衡量概念边界的稳定性或破坏程度。结果表明,科幻文本中机器指称表现出显著的概念渗透性,而人类指称则保持语义一致性,这说明科幻通过可控的语义扰动实现了对常规认知结构的批判性重构,且MLM可作为揭示类型条件下的本体论假设的有效工具。

链接: https://arxiv.org/abs/2511.08109
作者: Yuxuan Liu,Haim Dubossarsky,Ruth Ahnert
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper examines how science fiction destabilises ontological categories by measuring conceptual permeability across the terms human, animal, and machine using masked language modelling (MLM). Drawing on corpora of science fiction (Gollancz SF Masterworks) and general fiction (NovelTM), we operationalise Darko Suvin’s theory of estrangement as computationally measurable deviation in token prediction, using RoBERTa to generate lexical substitutes for masked referents and classifying them via Gemini. We quantify conceptual slippage through three metrics: retention rate, replacement rate, and entropy, mapping the stability or disruption of category boundaries across genres. Our findings reveal that science fiction exhibits heightened conceptual permeability, particularly around machine referents, which show significant cross-category substitution and dispersion. Human terms, by contrast, maintain semantic coherence and often anchor substitutional hierarchies. These patterns suggest a genre-specific restructuring within anthropocentric logics. We argue that estrangement in science fiction operates as a controlled perturbation of semantic norms, detectable through probabilistic modelling, and that MLMs, when used critically, serve as interpretive instruments capable of surfacing genre-conditioned ontological assumptions. This study contributes to the methodological repertoire of computational literary studies and offers new insights into the linguistic infrastructure of science fiction.
zh

[NLP-37] PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体协作系统中因缺乏有效视角采择(perspective-taking)能力而导致的交互失效问题,特别是在面对物理和认知层面的主观视角差异时,模型难以准确理解其他智能体的需求与意图。解决方案的关键在于引入基于ReAct框架(即推理与行动相结合的方法)的显式视角提示机制,并结合主动视觉探索策略,在七个逐步增加视角复杂度的任务场景中增强模型对参考指代歧义的解析能力和协作效能。实验表明,通过显式建模不同观察者的视角信息并辅以主动感知行为,LLM能够显著提升其在多智能体环境中的情境适应性和交互准确性。

链接: https://arxiv.org/abs/2511.08098
作者: Sabrina Patania,Luca Annese,Anita Pellegrini,Silvia Serino,Anna Lambiase,Luca Pallonetto,Silvia Rossi,Simone Colombani,Tom Foulsham,Azzurra Ruggeri,Dimitri Ognibene
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at IAS19

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi-agent interaction necessitates robust perspective-taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM’s ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of increasing perspective-taking complexity. These scenarios are designed to challenge the agent’s capacity to resolve referential ambiguity based on visual access and interaction, under varying state representations and prompting strategies, including ReAct-style reasoning. Our results demonstrate that explicit perspective cues, combined with active exploration strategies, significantly improve the model’s interpretative accuracy and collaborative effectiveness. These findings highlight the potential of integrating active perception with perspective-taking mechanisms in advancing LLMs’ application in robotics and multi-agent systems, setting a foundation for future research into adaptive and context-aware AI systems.
zh

[NLP-38] BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution

【速读】: 该论文旨在解决孟加拉语作者归属识别(Bangla Authorship Attribution)中的关键挑战,特别是停用词(stop-word)在不同模型和文本类型中对作者风格表征的影响。其核心问题是:停用词是否以及如何影响经典机器学习与深度学习模型在短文本场景下的作者识别性能,以及如何构建更具代表性和可复现的基准数据集以推动领域发展。解决方案的关键在于构建了一个新的平衡语料库 BARD10(包含10位当代作者的博客与评论类文本),并系统评估了四种代表性分类器(SVM、Bangla BERT、XGBoost 和 MLP)在统一预处理下对停用词去除的敏感性。研究发现,停用词是孟加拉语作者风格的重要指示符,且经典 TF-IDF + SVM 方法在短文本上表现最优(BAAD16 上 macro-F1 达 0.997,BARD10 上为 0.921),而基于 Transformer 的 Bangla BERT 因对高频成分依赖性强,在去除停用词后性能显著下降,揭示出停用词在特定文本类型中承载着不可忽视的作者特征。

链接: https://arxiv.org/abs/2511.08085
作者: Abdullah Muhammad Moosa(1),Nusrat Sultana(1),Mahdi Muhammad Moosa(2),Md. Miraiz Hossain(1) ((1) Department of Mechatronics amp; Industrial Engineering, Chittagong University of Engineering amp; Technology, Chittagong 4349, Bangladesh, (2) Department of Mathematics amp; Natural Sciences, Brac University, Dhaka 1212, Bangladesh)
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 28 pages, 6 Figures

点击查看摘要

Abstract:This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop-word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop-words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative classifiers: SVM (Support Vector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perception), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF-IDF + SVM baseline outperformed, attaining a macro-F1 score of 0.997 on BAAD16 and 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to stop-word pruning, while BAAD16 authors remain comparatively robust highlighting genre-dependent reliance on stop-word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop-words serve as essential stylistic indicators; finely calibrated ML models prove effective within short-text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long-context or domain-adapted transformers.
zh

[NLP-39] Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在不同规模和架构下缺乏统一效率评估指标的问题,尤其在测试时缩放(test-time scaling)加剧计算资源消耗的背景下,亟需一种能公平比较模型推理效率的方法。其解决方案的关键在于提出“信息容量”(information capacity)这一新指标,该指标基于文本压缩性能与计算复杂度的相对关系:大模型虽因更强的预测能力实现更高压缩率,但伴随更高的计算开销;通过量化此权衡关系,信息容量可跨模型系列进行公平效率对比,并准确预测同一模型系列内的性能表现。此外,该指标创新性地纳入分词器(tokenizer)效率的影响,从而更全面反映输入输出token数量对整体效率的作用,弥补了现有评估中常忽略该因素的缺陷。

链接: https://arxiv.org/abs/2511.08066
作者: Cheng Yuan,Jiawei Shao,Chi Zhang,Xuelong Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further aggravates the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM’s efficiency across different model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. Larger models can predict the next token more accurately, achieving greater compression gains but at higher computational costs. Empirical evaluations on mainstream open-source models show that models of varying sizes within a series exhibit consistent information capacity. This metric enables a fair efficiency comparison across model series and accurate performance prediction within a model series. A distinctive feature of information capacity is that it incorporates tokenizer efficiency, which affects both input and output token counts but is often neglected in LLM evaluations. We assess the information capacity of 49 models on 5 heterogeneous datasets and observe consistent results on the influences of tokenizer efficiency, pretraining data, and the mixture-of-experts architecture.
zh

[NLP-40] Dual-Process Scaffold Reasoning for Enhancing LLM Code Debugging

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码调试任务中如何平衡推理步骤的复杂性与计算效率的问题。现有研究虽已展示LLMs在多种基准测试中的强大问题求解能力,但尚未明确识别出既能保证推理准确性又具备高效执行路径的中间推理步骤。为此,作者提出了一种基于心理学理论支撑的“Scaffold Reasoning”框架,其关键在于将推理过程划分为三个流:Scaffold Stream用于构建参考代码,Analytic Stream分析错误代码,Integration Stream则融合二者结果以生成最终修复方案。该框架通过模拟人类认知中系统1(直觉式输出)与系统2(分析式推理)的协同机制,在DebugBench上实现了88.91%的通过率和平均每题5.36秒的推理时间,显著优于其他方法,并验证了其与人类认知路径的一致性。

链接: https://arxiv.org/abs/2511.08052
作者: Po-Chung Hsieh,Chin-Po Chen,Jeng-Lin Li,Ming-Ching Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Recent LLMs have demonstrated sophisticated problem-solving capabilities on various benchmarks through advanced reasoning algorithms. However, the key research question of identifying reasoning steps that balance complexity and computational efficiency remains unsolved. Recent research has increasingly drawn upon psychological theories to explore strategies for optimizing cognitive pathways. The LLM’s final outputs and intermediate steps are regarded as System 1 and System 2, respectively. However, an in-depth exploration of the System 2 reasoning is still lacking. Therefore, we propose a novel psychologically backed Scaffold Reasoning framework for code debugging, which encompasses the Scaffold Stream, Analytic Stream, and Integration Stream. The construction of reference code within the Scaffold Stream is integrated with the buggy code analysis results produced by the Analytic Stream through the Integration Stream. Our framework achieves an 88.91% pass rate and an average inference time of 5.36 seconds per-problem on DebugBench, outperforming other reasoning approaches across various LLMs in both reasoning accuracy and efficiency. Further analyses elucidate the advantages and limitations of various cognitive pathways across varying problem difficulties and bug types. Our findings also corroborate the alignment of the proposed Scaffold Reasoning framework with human cognitive processes.
zh

[NLP-41] DynaAct: Large Language Model Reasoning with Dynamic Action Spaces NEURIPS2025

【速读】: 该论文旨在解决复杂问题求解场景中候选动作空间(candidate action space)构建效率与效果之间的矛盾问题:传统方法要么依赖人工定义的动作空间难以扩展,要么采用无结构空间导致搜索计算成本过高。其解决方案的关键在于提出一种名为DynaAct的自动化框架,通过大语言模型从多样化复杂推理语料中提取通用动作模式作为代理完整动作空间的先验,并基于子模函数(submodular function)联合评估候选动作在当前状态下的效用与其多样性,进而利用贪心算法高效选出最优紧凑动作集,从而在不显著增加延迟的前提下提升序列决策系统的推理性能。

链接: https://arxiv.org/abs/2511.08043
作者: Xueliang Zhao,Wei Wu,Jian Guan,Qintong Li,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Ant Group (蚂蚁集团)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:In modern sequential decision-making systems, the construction of an optimal candidate action space is critical to efficient inference. However, existing approaches either rely on manually defined action spaces that lack scalability or utilize unstructured spaces that render exhaustive search computationally prohibitive. In this paper, we propose a novel framework named \textscDynaAct for automatically constructing a compact action space to enhance sequential reasoning in complex problem-solving scenarios. Our method first estimates a proxy for the complete action space by extracting general sketches observed in a corpus covering diverse complex reasoning problems using large language models. We then formulate a submodular function that jointly evaluates candidate actions based on their utility to the current state and their diversity, and employ a greedy algorithm to select an optimal candidate set. Extensive experiments on six diverse standard benchmarks demonstrate that our approach significantly improves overall performance, while maintaining efficient inference without introducing substantial latency. The implementation is available at this https URL.
zh

[NLP-42] BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives AAAI2026

【速读】: 该论文旨在解决生物医学和科学领域中硬负样本(hard negatives)挖掘困难的问题,尤其是在区分源文档与难负样本时存在挑战。传统方法依赖交叉编码器或静态嵌入模型基于余弦距离等相似性度量进行排序,但在专业领域难以有效识别高质量的难负样本。解决方案的关键在于利用文献中的引用链接结构(citation links),将引用文档作为天然的难负样本——这些文档与源文档具有语境相关性但并非重复内容,因而适合作为训练检索模型的硬负例。作者提出BiCA方法,在20,000篇PubMed文章基础上构建基于引用信息的难负样本,并对GTE_small和GTE_Base模型进行微调,显著提升了零样本密集检索性能(nDCG@10),尤其在长尾话题上优于基线模型(Success@5)。这一策略展示了通过文档链接结构生成高信息量负样本的有效性,实现了以最小数据量实现领域适应的高性能检索系统。

链接: https://arxiv.org/abs/2511.08029
作者: Aarush Sinha,Pavan Kumar S,Roshan Balaji,Nirav Pravinbhai Bhatt
机构: University of Copenhagen (哥本哈根大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted for oral presentation at AAAI 2026

点击查看摘要

Abstract:Hard negatives are essential for training effective retrieval models. Hard-negative mining typically relies on ranking documents using cross-encoders or static embedding models based on similarity metrics such as cosine distance. Hard negative mining becomes challenging for biomedical and scientific domains due to the difficulty in distinguishing between source and hard negative documents. However, referenced documents naturally share contextual relevance with the source document but are not duplicates, making them well-suited as hard negatives. In this work, we propose BiCA: Biomedical Dense Retrieval with Citation-Aware Hard Negatives, an approach for hard-negative mining by utilizing citation links in 20,000 PubMed articles for improving a domain-specific small dense retriever. We fine-tune the GTE_small and GTE_Base models using these citation-informed negatives and observe consistent improvements in zero-shot dense retrieval using nDCG@10 for both in-domain and out-of-domain tasks on BEIR and outperform baselines on long-tailed topics in LoTTE using Success@5. Our findings highlight the potential of leveraging document link structure to generate highly informative negatives, enabling state-of-the-art performance with minimal fine-tuning and demonstrating a path towards highly data-efficient domain adaptation.
zh

[NLP-43] HyCoRA: Hyper-Contrastive Role-Adaptive Learning for Role-Playing

【速读】: 该论文旨在解决多角色扮演(multi-character role-playing)中角色特异性与共性建模之间的平衡问题:现有方法要么使用共享参数模块导致忽略角色独特性,要么为每个角色分配独立模块而忽视跨角色的共性特征。解决方案的关键在于提出一种新颖的HyCoRA(Hyper-Contrastive Role-Adaptive)学习框架,其核心是引入Hyper-Half Low-Rank Adaptation结构——其中一半为由轻量级超网络生成的角色特定模块(用于捕捉个性签名),另一半为可训练的角色共享模块(用于建模通用特征);同时设计超对比学习机制(hyper-contrastive learning),增强超网络对不同角色独特特征的区分能力,从而在保持角色多样性的基础上实现高效的角色适应性建模。

链接: https://arxiv.org/abs/2511.08017
作者: Shihao Yang,Zhicong Lu,Yong Yang,Bo Lv,Yang Shen,Nayu Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Multi-character role-playing aims to equip models with the capability to simulate diverse roles. Existing methods either use one shared parameterized module across all roles or assign a separate parameterized module to each role. However, the role-shared module may ignore distinct traits of each role, weakening personality learning, while the role-specific module may overlook shared traits across multiple roles, hindering commonality modeling. In this paper, we propose a novel HyCoRA: Hyper-Contrastive Role-Adaptive learning framework, which efficiently improves multi-character role-playing ability by balancing the learning of distinct and shared traits. Specifically, we propose a Hyper-Half Low-Rank Adaptation structure, where one half is a role-specific module generated by a lightweight hyper-network, and the other half is a trainable role-shared module. The role-specific module is devised to represent distinct persona signatures, while the role-shared module serves to capture common traits. Moreover, to better reflect distinct personalities across different roles, we design a hyper-contrastive learning mechanism to help the hyper-network distinguish their unique characteristics. Extensive experimental results on both English and Chinese available benchmarks demonstrate the superiority of our framework. Further GPT-4 evaluations and visual analyses also verify the capability of HyCoRA to capture role characteristics.
zh

[NLP-44] Self-Correction Distillation for Structured Data Question Answering AAAI2026

【速读】: 该论文旨在解决小规模语言模型(small-scale LLM)在结构化数据问答(structured data QA)任务中因生成结构化查询(structured queries)易出错而导致性能受限的问题。其核心解决方案是提出一种自校正蒸馏(self-correction distillation, SCD)方法,关键在于设计了一个错误提示机制(error prompt mechanism, EPM),用于在推理阶段检测并提供定制化的错误反馈信息,并结合两阶段蒸馏策略,将大规模语言模型(large-scale LLM)的查询生成与错误校正能力有效迁移至小规模模型,从而显著提升后者在表格问答、知识图谱问答及时间知识图谱问答等多场景下的表现。

链接: https://arxiv.org/abs/2511.07998
作者: Yushan Zhu,Wen Zhang,Long Jin,Mengshu Sun,Ling Zhong,Zhiqiang Liu,Juan Li,Lei Liang,Chong Long,Chao Deng,Junlan Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Structured data question answering (QA), including table QA, Knowledge Graph (KG) QA, and temporal KG QA, is a pivotal research area. Advances in large language models (LLMs) have driven significant progress in unified structural QA frameworks like TrustUQA. However, these frameworks face challenges when applied to small-scale LLMs since small-scale LLMs are prone to errors in generating structured queries. To improve the structured data QA ability of small-scale LLMs, we propose a self-correction distillation (SCD) method. In SCD, an error prompt mechanism (EPM) is designed to detect errors and provide customized error messages during inference, and a two-stage distillation strategy is designed to transfer large-scale LLMs’ query-generation and error-correction capabilities to small-scale LLM. Experiments across 5 benchmarks with 3 structured data types demonstrate that our SCD achieves the best performance and superior generalization on small-scale LLM (8B) compared to other distillation methods, and closely approaches the performance of GPT4 on some datasets. Furthermore, large-scale LLMs equipped with EPM surpass the state-of-the-art results on most datasets.
zh

[NLP-45] State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting? LREC2026

【速读】: 该论文试图解决的问题是:在资源匮乏的语言(特别是南斯拉夫语族语言)中,大型语言模型(LLMs)在文本分类任务上的表现是否优于传统的微调BERT类模型,以及其零样本(zero-shot)能力是否具有实际应用价值。解决方案的关键在于通过系统性对比开源和闭源LLMs与公开可用的微调BERT类模型,在三个不同领域(议会演讲情感分类、新闻与议会文本的主题分类、网络文本体裁识别)中评估它们在南斯拉夫语族语言和英语上的性能表现。结果表明,LLMs在零样本设置下表现出与微调BERT类模型相当甚至更优的准确性,且跨语言一致性良好,但其推理速度慢、计算成本高且输出不可预测,因此对于大规模自动文本标注任务,微调BERT类模型仍是更实用的选择。

链接: https://arxiv.org/abs/2511.07989
作者: Taja Kuzman Pungeršek,Peter Rupnik,Ivan Porupski,Vuk Dinić,Nikola Ljubešić
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages; 4 figures; 3 tables. Submitted to the LREC 2026 conference

点击查看摘要

Abstract:Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.
zh

[NLP-46] NOTAM-Evolve: A Knowledge-Guided Self-Evolving Optimization Framework with LLM s for NOTAM Interpretation AAAI2026

【速读】: 该论文旨在解决航空安全领域中航行通告(NOTAM)的准确解读问题,现有自动化系统多局限于浅层解析,难以提取支持运行决策的可操作情报。其核心挑战在于实现深度解析,需同时完成动态知识锚定(将NOTAM与实时变化的航行情报数据关联)和基于模式的推理(应用静态领域规则推导运行状态)。解决方案的关键是提出NOTAM-Evolve框架,该框架通过知识图谱增强的检索模块实现数据锚定,并引入闭环学习机制使大语言模型(LLM)从自身输出中持续迭代优化,显著降低对人工标注推理轨迹的依赖,从而在结构化NOTAM解析任务上实现30.4%的绝对准确率提升,达到当前最优性能。

链接: https://arxiv.org/abs/2511.07982
作者: Maoqi Liu,Quan Fang,Yuhao Wu,Can Zhao,Yang Yang,Kaiquan Cai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Accurate interpretation of Notices to Airmen (NOTAMs) is critical for aviation safety, yet their condensed and cryptic language poses significant challenges to both manual and automated processing. Existing automated systems are typically limited to shallow parsing, failing to extract the actionable intelligence needed for operational decisions. We formalize the complete interpretation task as deep parsing, a dual-reasoning challenge requiring both dynamic knowledge grounding (linking the NOTAM to evolving real-world aeronautical data) and schema-based inference (applying static domain rules to deduce operational status). To tackle this challenge, we propose NOTAM-Evolve, a self-evolving framework that enables a large language model (LLM) to autonomously master complex NOTAM interpretation. Leveraging a knowledge graph-enhanced retrieval module for data grounding, the framework introduces a closed-loop learning process where the LLM progressively improves from its own outputs, minimizing the need for extensive human-annotated reasoning traces. In conjunction with this framework, we introduce a new benchmark dataset of 10,000 expert-annotated NOTAMs. Our experiments demonstrate that NOTAM-Evolve achieves a 30.4% absolute accuracy improvement over the base LLM, establishing a new state of the art on the task of structured NOTAM interpretation.
zh

[NLP-47] Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

【速读】: 该论文旨在解决工作场景下自然语言处理(Natural Language Processing, NLP)任务中存在的长尾分布、极端多标签目标空间和数据稀缺等现实复杂性问题,同时评估通用嵌入模型在职场领域中的表现。其核心挑战在于如何在缺乏标注数据的情况下实现跨任务的知识迁移与高效推理。解决方案的关键在于提出首个统一的职场任务评估基准WorkBench,涵盖六项以排序问题形式定义的工作相关任务,并基于此发现显著的跨任务正向迁移效应;进一步构建基于真实数据的任务特定二部图结构并通过语义锚定进行合成增强,最终设计出任务无关的双编码器模型Unified Work Embeddings (UWE),该模型采用多对多InfoNCE损失函数优化训练数据结构,并引入任务无关的软晚期交互机制利用词级别嵌入,从而在未见目标空间中实现零样本排序性能,且通过缓存任务目标嵌入支持低延迟推理,在宏平均MAP和RP@10指标上显著优于通用嵌入模型。

链接: https://arxiv.org/abs/2511.07969
作者: Matthias De Lange,Jens-Joris Decorte,Jeroen Van Hautte
机构: TechWolf(科技狼)
类目: Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Workforce transformation across diverse industries has driven an increased demand for specialized natural language processing capabilities. Nevertheless, tasks derived from work-related contexts inherently reflect real-world complexities, characterized by long-tailed distributions, extreme multi-label target spaces, and scarce data availability. The rise of generalist embedding models prompts the question of their performance in the work domain, especially as progress in the field has focused mainly on individual tasks. To this end, we introduce WorkBench, the first unified evaluation suite spanning six work-related tasks formulated explicitly as ranking problems, establishing a common ground for multi-task progress. Based on this benchmark, we find significant positive cross-task transfer, and use this insight to compose task-specific bipartite graphs from real-world data, synthetically enriched through grounding. This leads to Unified Work Embeddings (UWE), a task-agnostic bi-encoder that exploits our training-data structure with a many-to-many InfoNCE objective, and leverages token-level embeddings with task-agnostic soft late interaction. UWE demonstrates zero-shot ranking performance on unseen target spaces in the work domain, enables low-latency inference by caching the task target space embeddings, and shows significant gains in macro-averaged MAP and RP@10 over generalist embedding models.
zh

[NLP-48] hinker: Training LLM s in Hierarchical Thinking for Deep Search via Multi-Turn Interaction AAAI2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在利用外部知识库和网页进行复杂问题求解时,因缺乏对推理过程的监督而导致逻辑连贯性和严谨性难以保障的问题。现有方法多依赖端到端强化学习,未能有效约束中间推理步骤。其解决方案的关键在于提出Thinker——一种分层思维模型,通过多轮交互实现深度搜索:将复杂问题分解为可独立求解的子问题,每个子问题以自然语言和等价逻辑函数双重表示,从而支持知识库与网络检索;同时,子问题间的依赖关系通过逻辑函数参数传递,增强推理链条的逻辑一致性;此外,引入知识边界判定机制,识别LLM内在知识范围,避免冗余外部检索。这一设计使整个推理过程具备可监督性和可验证性,显著提升了模型性能。

链接: https://arxiv.org/abs/2511.07943
作者: Jun Xu,Xinkai Du,Yu Ao,Peilong Zhao,Yang Li,Ling Zhong,Lin Yuan,Zhongpu Bo,Xiaorui Wang,Mengshu Sun,Zhengke Gui,Dalong Zhang,Zhaoyang Wang,Qiwei Wang,Yangyang Hou,Zhiying Yin,Haofen Wang,Huajun Chen,Lei Liang,Jun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to AAAI 2026. Extended version with full Appendix

点击查看摘要

Abstract:Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM’s intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes. The source code is available at this https URL.
zh

[NLP-49] SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

【速读】: 该论文旨在解决生成式语音合成模型与人类感知偏好对齐的问题,尤其针对缺乏大规模高质量人工偏好数据集导致的模型性能瓶颈。其关键解决方案是构建SpeechJudge这一综合性工具套件,包含一个包含99K语音对的人工反馈语料库(SpeechJudge-Data)、一个高挑战性的自然度评估基准(SpeechJudge-Eval)以及一个基于Qwen2.5-Omni-7B架构的生成式奖励模型(Generative Reward Model, GRM)——SpeechJudge-GRM。该GRM通过两阶段后训练策略(监督微调结合链式思维推理 + 强化学习优化),在自然度判断任务上显著优于现有方法(准确率达77.2%,经推理时缩放后提升至79.4%),并可作为奖励函数用于语音生成模型的后训练过程,从而有效提升模型与人类偏好的一致性。

链接: https://arxiv.org/abs/2511.07931
作者: Xueyao Zhang,Chaoren Wang,Huan Liao,Ziniu Li,Yuancheng Wang,Li Wang,Dongya Jia,Yuanzhe Chen,Xiulin Li,Zhuo Chen,Zhizheng Wu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness–one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.
zh

[NLP-50] Distinct Theta Synchrony across Speech Modes: Perceived Spoken Whispered and Imagined

【速读】: 该论文旨在解决不同语音产生模式(如外显言语、耳语、感知言语和想象言语)中theta频段神经同步性差异的机制问题,尤其是缺乏对多种模式下theta同步性的系统比较。其解决方案的关键在于基于连接度指标分析区域特异性的theta-band神经同步模式,揭示了不同言语模式下大脑皮层激活的空间分布与强度差异:外显言语和耳语表现出更广泛且更强的额颞叶同步性,反映运动-音位耦合;感知言语以后部和颞叶同步为主,符合听觉感知与理解过程;而想象言语则呈现局限于前额叶和辅助运动区的内部一致性同步,表明其神经基础更具内源性。

链接: https://arxiv.org/abs/2511.07918
作者: Jung-Sun Lee,Ha-Na Jo,Eunyeong Ko
机构: 未知
类目: Computation and Language (cs.CL)
备注: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface

点击查看摘要

Abstract:Human speech production encompasses multiple modes such as perceived, overt, whispered, and imagined, each reflecting distinct neural mechanisms. Among these, theta-band synchrony has been closely associated with language processing, attentional control, and inner speech. However, previous studies have largely focused on a single mode, such as overt speech, and have rarely conducted an integrated comparison of theta synchrony across different speech modes. In this study, we analyzed differences in theta-band neural synchrony across speech modes based on connectivity metrics, focusing on region-wise variations. The results revealed that overt and whispered speech exhibited broader and stronger frontotemporal synchrony, reflecting active motor-phonological coupling during overt articulation, whereas perceived speech showed dominant posterior and temporal synchrony patterns, consistent with auditory perception and comprehension processes. In contrast, imagined speech demonstrated a more spatially confined but internally coherent synchronization pattern, primarily involving frontal and supplementary motor regions. These findings indicate that the extent and spatial distribution of theta synchrony differ substantially across modes, with overt articulation engaging widespread cortical interactions, whispered speech showing intermediate engagement, and perception relying predominantly on temporoparietal networks. Therefore, this study aims to elucidate the differences in theta-band neural synchrony across various speech modes, thereby uncovering both the shared and distinct neural dynamics underlying language perception and imagined speech.
zh

[NLP-51] Social Media for Mental Health: Data Methods and Findings

【速读】: 该论文旨在解决如何利用社交媒体数据(social media data)来更好地理解和应对心理健康问题,如抑郁、焦虑和自杀念头等。其核心挑战在于从海量非结构化用户内容中提取可量化的语言、视觉和情感指标,以支持早期识别、干预和政策制定。解决方案的关键在于整合机器学习(machine learning)、自然语言处理(natural language processing)和特征工程(feature engineering)等技术,系统性地分析用户披露内容中的心理状态信号,从而为医疗实践提供实时支持,并推动政府与政策制定者对心理健康议题的认知提升。

链接: https://arxiv.org/abs/2511.07914
作者: Nur Shazwani Kamarudin,Ghazaleh Beigi,Lydia Manikonda,Huan Liu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Open Source Intelligence and Cyber Crime. Lecture Notes in Social Networks. Springer, Cham

点击查看摘要

Abstract:There is an increasing number of virtual communities and forums available on the web. With social media, people can freely communicate and share their thoughts, ask personal questions, and seek peer-support, especially those with conditions that are highly stigmatized, without revealing personal identity. We study the state-of-the-art research methodologies and findings on mental health challenges like de- pression, anxiety, suicidal thoughts, from the pervasive use of social media data. We also discuss how these novel thinking and approaches can help to raise awareness of mental health issues in an unprecedented way. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. The main goal of this chapter is to show how this new source of data can be tapped to improve medical practice, provide timely support, and influence government or policymakers. In the context of social media for mental health issues, this chapter categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and surveys methods and outlines directions for future research.
zh

[NLP-52] Last Layer Logits to Logic: Empowering LLM s with Logic-Consistent Structured Knowledge Reasoning ICLR26

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在结构化知识推理任务中因表示差异导致的逻辑漂移(Logic Drift)问题,尤其是在知识图谱问答(Knowledge Graph Question Answering, KGQA)场景下,LLMs难以保持输出逻辑一致性。现有方法依赖复杂但固定的提示(prompt)引导推理流程,仅提供输入层面的指导,无法从根本上纠正模型输出中的逻辑缺陷,且缺乏对不同任务和知识图谱的适应性。本文提出Logits-to-Logic框架,其核心在于直接干预自回归生成过程中的logits输出,通过“logits强化”与“logits过滤”两个模块协同修正逻辑错误,从而显著提升LLMs在结构化知识推理中的逻辑一致性,并在多个KGQA基准上达到当前最优性能。

链接: https://arxiv.org/abs/2511.07910
作者: Songze Li,Zhiqiang Liu,Zhaoyan Gong,Xiaoke Guo,Zhengke Gui,Huajun Chen,Wen Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); ZJU-Ant Group Joint Lab of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL)
备注: ICLR26 Submission

点击查看摘要

Abstract:Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text, enabling them to understand the logic in natural language and generate logic-consistent responses. However, the representational differences between unstructured and structured knowledge make LLMs inherently struggle to maintain logic consistency, leading to \textitLogic Drift challenges in structured knowledge reasoning tasks such as Knowledge Graph Question Answering (KGQA). Existing methods address this limitation by designing complex workflows embedded in prompts to guide LLM reasoning. Nevertheless, these approaches only provide input-level guidance and fail to fundamentally address the \textitLogic Drift in LLM outputs. Additionally, their inflexible reasoning workflows cannot adapt to different tasks and knowledge graphs. To enhance LLMs’ logic consistency in structured knowledge reasoning, we specifically target the logits output from the autoregressive generation process. We propose the \textitLogits-to-Logic framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs. Extensive experiments show that our approach significantly improves LLMs’ logic consistency in structured knowledge reasoning and achieves state-of-the-art performance on multiple KGQA benchmarks.
zh

[NLP-53] SparseRM: A Lightweight Preference Modeling with Sparse Autoencoder AAAI-26

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)后训练阶段中奖励模型(Reward Models, RMs)训练成本高、依赖大规模偏好标注数据的问题。其核心挑战在于如何在有限资源下构建可靠且高效的奖励模型。解决方案的关键在于提出SparseRM,该方法利用稀疏自编码器(Sparse Autoencoder, SAE)从LLM的中间表示中提取与偏好相关的可解释特征方向,并将模型表示投影到这些方向上计算对齐分数,最终通过轻量级奖励头聚合得分以预测偏好分数。该方法仅需不到1%的可训练参数即可实现优于主流奖励模型的性能,并具备良好的下游对齐集成能力。

链接: https://arxiv.org/abs/2511.07896
作者: Dengcan Liu,Jiahao Li,Zheren Fu,Yi Tu,Jiajun Li,Zhendong Mao,Yongdong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 15pages,11figures,AAAI-26

点击查看摘要

Abstract:Reward models (RMs) are a core component in the post-training of large language models (LLMs), serving as proxies for human preference evaluation and guiding model alignment. However, training reliable RMs under limited resources remains challenging due to the reliance on large-scale preference annotations and the high cost of fine-tuning LLMs. To address this, we propose SparseRM, which leverages Sparse Autoencoder (SAE) to extract preference-relevant information encoded in model representations, enabling the construction of a lightweight and interpretable reward model. SparseRM first employs SAE to decompose LLM representations into interpretable directions that capture preference-relevant features. The representations are then projected onto these directions to compute alignment scores, which quantify the strength of each preference feature in the representations. A simple reward head aggregates these scores to predict preference scores. Experiments on three preference modeling tasks show that SparseRM achieves superior performance over most mainstream RMs while using less than 1% of trainable parameters. Moreover, it integrates seamlessly into downstream alignment pipelines, highlighting its potential for efficient alignment.
zh

[NLP-54] Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification

【速读】: 该论文旨在解决文本分类(Text Classification, TC)中模型对抗鲁棒性提升与干净数据性能之间存在的权衡问题,即传统方法在增强对抗攻击防御能力时往往导致干净样本上性能下降。其解决方案的关键在于通过建模编码器嵌入流形(embedding manifold)中干净样本的分布来实现鲁棒性与准确性的统一:提出双模块系统 Manifold-Correcting Causal Flow (MC²F),其中 Stratified Riemannian Continuous Normalizing Flow (SR-CNF) 学习干净数据流形的密度分布以识别分布外(out-of-distribution)的对抗嵌入,并由 Geodesic Purification Solver 通过测地线最短路径将其投影回学习到的干净流形上,从而恢复语义一致的清洁表示,最终在多个数据集和对抗攻击场景下同时实现最优鲁棒性和对干净数据的零性能损失甚至小幅增益。

链接: https://arxiv.org/abs/2511.07888
作者: Chenhao Dang,Jing Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages,3 figures

点击查看摘要

Abstract:A persistent challenge in text classification (TC) is that enhancing model robustness against adversarial attacks typically degrades performance on clean data. We argue that this challenge can be resolved by modeling the distribution of clean samples in the encoder embedding manifold. To this end, we propose the Manifold-Correcting Causal Flow (MC^2F), a two-module system that operates directly on sentence embeddings. A Stratified Riemannian Continuous Normalizing Flow (SR-CNF) learns the density of the clean data manifold. It identifies out-of-distribution embeddings, which are then corrected by a Geodesic Purification Solver. This solver projects adversarial points back onto the learned manifold via the shortest path, restoring a clean, semantically coherent representation. We conducted extensive evaluations on text classification (TC) across three datasets and multiple adversarial attacks. The results demonstrate that our method, MC^2F, not only establishes a new state-of-the-art in adversarial robustness but also fully preserves performance on clean data, even yielding modest gains in accuracy.
zh

[NLP-55] Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

【速读】: 该论文旨在解决大规模语言模型(Large Language Model, LLM)推理任务日益增长的计算需求与集中式云基础设施扩展能力不足之间的矛盾问题。传统上,LLM查询主要依赖前沿模型在云端处理,但随着用户规模扩大,云服务商难以及时扩容以满足需求。论文提出的关键解决方案是:利用性能接近前沿模型的小型本地语言模型(small LMs,参数量≤20B)结合本地加速器(如Apple M4 Max),实现高效、低功耗的本地推理,从而将部分负载从云端迁移至终端设备。其核心创新在于提出“每瓦智能量”(Intelligence Per Watt, IPW)这一综合指标,即任务准确率与单位能耗的比值,用于系统性评估不同模型-加速器组合在本地推理中的能力与效率,并通过覆盖百万级真实世界单轮对话和推理请求的大规模实证研究验证了该方案的有效性。

链接: https://arxiv.org/abs/2511.07885
作者: Jon Saad-Falcon,Avanika Narayan,Hakki Orhun Akengin,J. Wes Griffin,Herumb Shandilya,Adrian Gamarra Lafuente,Medhya Goel,Rebecca Joseph,Shlok Natarajan,Etash Kumar Guha,Shang Zhu,Ben Athiwaratkun,John Hennessy,Azalia Mirhoseini,Christopher Ré
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Rapidly growing demand strains this paradigm, and cloud providers struggle to scale infrastructure at pace. Two advances enable us to rethink this paradigm: small LMs (=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) run these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? Answering this requires measuring whether local LMs can accurately answer real-world queries and whether they can do so efficiently enough to be practical on power-constrained devices (i.e., laptops). We propose intelligence per watt (IPW), task accuracy divided by unit of power, as a metric for assessing capability and efficiency of local inference across model-accelerator pairs. We conduct a large-scale empirical study across 20+ state-of-the-art local LMs, 8 accelerators, and a representative subset of LLM traffic: 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy, energy, latency, and power. Our analysis reveals 3 findings. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries with accuracy varying by domain. Second, from 2023-2025, IPW improved 5.3x and local query coverage rose from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure, with IPW serving as the critical metric for tracking this transition. We release our IPW profiling harness for systematic intelligence-per-watt benchmarking.
zh

[NLP-56] Planned Event Forecasting using Future Mentions and Related Entity Extraction in News Articles

【速读】: 该论文旨在解决如何提前预测民主国家中由民众自发组织的社会 unrest(社会动荡)事件,如抗议、集会和游行等问题,以帮助行政官员采取必要应对措施。其核心解决方案是构建一个地理无关的通用模型,通过主题建模(topic modeling)与 word2vec 技术筛选相关新闻文章,并结合命名实体识别(Named Entity Recognition, NER)提取关键实体(如人物、组织、地点和日期),进一步利用时间标准化将未来日期表述转换为标准格式;在此基础上提出“相关实体抽取”方法,精准识别真正参与事件的关键实体,从而提升预测的准确性与实用性。

链接: https://arxiv.org/abs/2511.07879
作者: Neelesh Kumar Shukla,Pranay Sanghvi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In democracies like India, people are free to express their views and demands. Sometimes this causes situations of civil unrest such as protests, rallies, and marches. These events may be disruptive in nature and are often held without prior permission from the competent authority. Forecasting these events helps administrative officials take necessary action. Usually, protests are announced well in advance to encourage large participation. Therefore, by analyzing such announcements in news articles, planned events can be forecasted beforehand. We developed such a system in this paper to forecast social unrest events using topic modeling and word2vec to filter relevant news articles, and Named Entity Recognition (NER) methods to identify entities such as people, organizations, locations, and dates. Time normalization is applied to convert future date mentions into a standard format. In this paper, we have developed a geographically independent, generalized model to identify key features for filtering civil unrest events. There could be many mentions of entities, but only a few may actually be involved in the event. This paper calls such entities Related Entities and proposes a method to extract them, referred to as Related Entity Extraction.
zh

[NLP-57] LoopLLM : Transferable Energy-Latency Attacks in LLM s via Repetitive Generation AAAI2026

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在推理过程中面临的能量-延迟攻击(energy-latency attacks)问题,即通过精心设计的输入提示诱导模型产生高能耗和长延迟的输出。现有方法主要依赖于延缓终止符号(termination symbols)的生成,但随着输出长度增加,这种控制变得困难,导致攻击效果下降。论文提出了一种名为LoopLLM的能量-延迟攻击框架,其核心创新在于:(1) 利用自回归生成过程中的漏洞,设计一种重复诱导提示优化策略,使模型陷入低熵解码循环,从而可靠地触发最长输出;(2) 引入token对齐集成优化方法,通过梯度聚合提升攻击在不同模型间的迁移能力。实验表明,LoopLLM显著优于基线方法,在12个开源与商用模型上实现超过90%最大输出长度,并将迁移性提升约40%至DeepSeek-V3和Gemini 2.5 Flash。

链接: https://arxiv.org/abs/2511.07876
作者: Xingyu Li,Xiaolei Liu,Cheng Liu,Yixiao Xu,Kangyi Ding,Bangzhou Xin,Jia-Li Yin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages with 7 figures; accepted by the AAAI 2026

点击查看摘要

Abstract:As large language models (LLMs) scale, their inference incurs substantial computational resources, exposing them to energy-latency attacks, where crafted prompts induce high energy and latency cost. Existing attack methods aim to prolong output by delaying the generation of termination symbols. However, as the output grows longer, controlling the termination symbols through input becomes difficult, making these methods less effective. Therefore, we propose LoopLLM, an energy-latency attack framework based on the observation that repetitive generation can trigger low-entropy decoding loops, reliably compelling LLMs to generate until their output limits. LoopLLM introduces (1) a repetition-inducing prompt optimization that exploits autoregressive vulnerabilities to induce repetitive generation, and (2) a token-aligned ensemble optimization that aggregates gradients to improve cross-model transferability. Extensive experiments on 12 open-source and 2 commercial LLMs show that LoopLLM significantly outperforms existing methods, achieving over 90% of the maximum output length, compared to 20% for baselines, and improving transferability by around 40% to DeepSeek-V3 and Gemini 2.5 Flash.
zh

[NLP-58] AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys

【速读】: 该论文旨在解决传统社会调查在固定问题格式、高成本、适应性差以及跨文化等效性难以保障等方面的局限性,同时应对当前基于大语言模型(Large Language Models, LLMs)的调研模拟研究中存在的任务碎片化、忽略完整调研流程及代表性不足等问题。其解决方案的关键在于提出首个系统性复现并评估完整社会调查流程的基准框架 AlignSurvey,该框架定义了四个与调研关键阶段对齐的任务:社会角色建模、半结构化访谈建模、态度立场建模和调研响应建模,并配套设计任务特异性评价指标以衡量个体与群体层面的一致性、对齐度与公平性;此外,构建多层级数据架构(包括跨国家的 Social Foundation Corpus 和 Entire-Pipeline Survey Datasets)以及基于两阶段微调的 SurveyLM 系列模型,为实现可解释、负责任且具代表性的生成式调研提供标准化工具链。

链接: https://arxiv.org/abs/2511.07871
作者: Chenxi Lin,Weikang Yuan,Zhuoren Jiang,Biao Huang,Ruitao Zhang,Jianan Ge,Yueqian Xu,Jianxing Yu
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding human attitudes, preferences, and behaviors through social surveys is essential for academic research and policymaking. Yet traditional surveys face persistent challenges, including fixed-question formats, high costs, limited adaptability, and difficulties ensuring cross-cultural equivalence. While recent studies explore large language models (LLMs) to simulate survey responses, most are limited to structured questions, overlook the entire survey process, and risks under-representing marginalized groups due to training data biases. We introduce AlignSurvey, the first benchmark that systematically replicates and evaluates the full social survey pipeline using LLMs. It defines four tasks aligned with key survey stages: social role modeling, semi-structured interview modeling, attitude stance modeling and survey response modeling. It also provides task-specific evaluation metrics to assess alignment fidelity, consistency, and fairness at both individual and group levels, with a focus on demographic diversity. To support AlignSurvey, we construct a multi-tiered dataset architecture: (i) the Social Foundation Corpus, a cross-national resource with 44K+ interview dialogues and 400K+ structured survey records; and (ii) a suite of Entire-Pipeline Survey Datasets, including the expert-annotated AlignSurvey-Expert (ASE) and two nationally representative surveys for cross-cultural evaluation. We release the SurveyLM family, obtained through two-stage fine-tuning of open-source LLMs, and offer reference models for evaluating domain-specific alignment. All datasets, models, and tools are available at github and huggingface to support transparent and socially responsible research.
zh

[NLP-59] LLM -Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

【速读】: 该论文旨在解决混沌工程(Chaos Engineering, CE)在实际应用中面临的两大核心问题:一是CE实验的规划与优化仍依赖人工,耗时且需多领域专业知识;二是现有工具仅能自动化执行预定义实验,无法实现从需求定义到代码生成、测试及调试的全流程闭环。解决方案的关键在于提出ChaosEater系统,其基于大语言模型(Large Language Models, LLMs)构建了一个代理式(agentic)工作流,将CE周期分解为多个子任务并分配给LLMs执行,从而实现对基于Kubernetes的软件系统的全自动混沌工程循环,显著降低时间与成本,并通过案例研究验证了其有效性与合理性。

链接: https://arxiv.org/abs/2511.07865
作者: Daisuke Kikuta,Hiroki Ikeuchi,Kengo Tajiri
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Accepted at ASE 2025 NIER Track. The code is available at this https URL

点击查看摘要

Abstract:Chaos Engineering (CE) is an engineering technique aimed at improving the resilience of distributed systems. It involves intentionally injecting faults into a system to test its resilience, uncover weaknesses, and address them before they cause failures in production. Recent CE tools automate the execution of predefined CE experiments. However, planning such experiments and improving the system based on the experimental results still remain manual. These processes are labor-intensive and require multi-domain expertise. To address these challenges and enable anyone to build resilient systems at low cost, this paper proposes ChaosEater, a system that automates the entire CE cycle with Large Language Models (LLMs). It predefines an agentic workflow according to a systematic CE cycle and assigns subdivided processes within the workflow to LLMs. ChaosEater targets CE for software systems built on Kubernetes. Therefore, the LLMs in ChaosEater complete CE cycles through software engineering tasks, including requirement definition, code generation, testing, and debugging. We evaluate ChaosEater through case studies on small- and large-scale Kubernetes systems. The results demonstrate that it consistently completes reasonable CE cycles with significantly low time and monetary costs. Its cycles are also qualitatively validated by human engineers and LLMs.
zh

[NLP-60] From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自主任务求解中因经验利用不足而导致的推理能力受限问题。现有方法要么依赖训练过程中的隐式记忆(存在灾难性遗忘和可解释性差的问题),要么依赖提示中的显式记忆(缺乏适应性)。解决方案的关键在于提出一种以代理为中心、可训练的多层图记忆框架,将原始代理轨迹抽象为状态机中的结构化决策路径,并进一步提炼为人类可理解的战略元认知(meta-cognition);同时引入基于强化学习的权重优化机制,根据下游任务的奖励反馈估计每种元认知的实际效用,并通过元认知提示动态融入LLM代理的训练循环,从而实现记忆的自适应整合与策略增强。

链接: https://arxiv.org/abs/2511.07800
作者: Siyu Xia,Zekun Xu,Jiajun Chai,Wentian Fan,Yan Song,Xiaohan Wang,Guojun Yin,Wei Lin,Haifeng Zhang,Jun Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Meituan (美团); Nanjing University of Posts and Telecommunications (南京邮电大学); AI Centre, Department of Computer Science, University College London (伦敦大学学院计算机科学系人工智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving across complex, open-ended environments. A promising approach for improving the reasoning capabilities of LLM agents is to better utilize prior experiences in guiding current decisions. However, LLMs acquire experience either through implicit memory via training, which suffers from catastrophic forgetting and limited interpretability, or explicit memory via prompting, which lacks adaptability. In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework and evaluate how context memory enhances the ability of LLMs to utilize parametric information. The graph abstracts raw agent trajectories into structured decision paths in a state machine and further distills them into high-level, human-interpretable strategic meta-cognition. In order to make memory adaptable, we propose a reinforcement-based weight optimization procedure that estimates the empirical utility of each meta-cognition based on reward feedback from downstream tasks. These optimized strategies are then dynamically integrated into the LLM agent’s training loop through meta-cognitive prompting. Empirically, the learnable graph memory delivers robust generalization, improves LLM agents’ strategic reasoning performance, and provides consistent benefits during Reinforcement Learning (RL) training.
zh

[NLP-61] Design Results and Industry Implications of the Worlds First Insurance Large Language Model Evaluation Benchmark

【速读】: 该论文旨在解决保险领域大语言模型缺乏专业、系统且权威的评估基准的问题。当前通用大模型在保险行业的专业能力(如精算、合规、业务逻辑推理)方面存在显著短板,而现有评估体系难以精准识别其局限性。解决方案的关键在于构建CUFEInse v1.0这一多维评价基准,其核心设计遵循“量化导向、专家驱动、多维验证”原则,涵盖5个核心维度、54个子指标和14,430道高质量问题,覆盖保险理论、行业认知、安全合规、智能代理应用及逻辑严谨性等关键领域。通过该基准对11个主流大模型进行系统评测,不仅揭示了通用模型在保险场景中的共性瓶颈,也验证了垂直领域训练模型的优势与不足,为保险大模型的优化方向(如“领域适配+推理增强”)提供了实证依据和方法论参考。

链接: https://arxiv.org/abs/2511.07794
作者: Hua Zhou(Central University of Finance and Economics),Bing Ma(Central University of Finance and Economics),Yufei Zhang(Zetavision AI Lab),Yi Zhao(Zetavision AI Lab)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 11 models,1 set of evaluation framework,5 core dimensions, 54 sub-indicators, 14,430 high-quality questions

点击查看摘要

Abstract:This paper comprehensively elaborates on the construction methodology, multi-dimensional evaluation system, and underlying design philosophy of CUFEInse v1.0. Adhering to the principles of “quantitative-oriented, expert-driven, and multi-validation,” the benchmark establishes an evaluation framework covering 5 core dimensions, 54 sub-indicators, and 14,430 high-quality questions, encompassing insurance theoretical knowledge, industry understanding, safety and compliance, intelligent agent application, and logical rigor. Based on this benchmark, a comprehensive evaluation was conducted on 11 mainstream large language models. The evaluation results reveal that general-purpose models suffer from common bottlenecks such as weak actuarial capabilities and inadequate compliance adaptation. High-quality domain-specific training demonstrates significant advantages in insurance vertical scenarios but exhibits shortcomings in business adaptation and compliance. The evaluation also accurately identifies the common bottlenecks of current large models in professional scenarios such as insurance actuarial, underwriting and claim settlement reasoning, and compliant marketing copywriting. The establishment of CUFEInse not only fills the gap in professional evaluation benchmarks for the insurance field, providing academia and industry with a professional, systematic, and authoritative evaluation tool, but also its construction concept and methodology offer important references for the evaluation paradigm of large models in vertical fields, serving as an authoritative reference for academic model optimization and industrial model selection. Finally, the paper looks forward to the future iteration direction of the evaluation benchmark and the core development direction of “domain adaptation + reasoning enhancement” for insurance large models.
zh

[NLP-62] CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

【速读】: 该论文旨在解决机器学习领域中计算可复现性(computational reproducibility)研究资源匮乏的问题,尤其是缺乏用于训练和评估可复现性导向情感分类模型的高质量标注数据集。其关键解决方案是构建了CC30k数据集,包含30,734个机器学习论文中的引用上下文,并标注为三种可复现性导向的情感标签(Positive、Negative、Neutral),以反映被引论文在社区中的可复现性感知。该数据集通过严格的清洗、众包标注与受控负样本生成策略实现高标注准确率(94%),并验证了在该数据集上微调的大语言模型在可复现性情感分类任务上的显著性能提升,从而为大规模评估机器学习论文的可复现性提供了基础支撑。

链接: https://arxiv.org/abs/2511.07790
作者: Rochana R. Obadage,Sarah M. Rajtmajer,Jian Wu
机构: Old Dominion University (老多佛大学); The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: Peer reviewed and accepted at JCDL 2025, 16 pages, 7 figures

点击查看摘要

Abstract:Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper’s perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at this https URL .
zh

[NLP-63] SALT: Steering Activations towards Leakage-free Thinking in Chain of Thought

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为个人助理时因内部推理过程泄露敏感用户数据而导致的上下文隐私(contextual privacy)问题。尽管模型输出可能看似安全,但其Chain of Thought(CoT)推理路径中仍可能无意暴露私有信息,这违背了用户对隐私保护的预期。解决方案的关键在于提出一种轻量级的测试时干预方法——Steering Activations towards Leakage-free Thinking (SALT),通过向隐藏状态注入特定的引导向量(steering vectors),精准抑制高泄漏层中的隐私泄露行为,从而在不显著影响任务性能的前提下有效降低上下文隐私泄露风险。

链接: https://arxiv.org/abs/2511.07772
作者: Shourya Batra,Pierce Tillman,Samarth Gaggar,Shashank Kesineni,Kevin Zhu,Sunishchal Dev,Ashwinee Panda,Vasu Sharma,Maheep Chaudhary
机构: AlgoVerse AI Research(算法宇宙人工智能研究); RAND; University of Maryland (马里兰大学); META (Meta)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently expose sensitive details in their reasoning traces, even when final outputs appear safe. The challenge lies in preventing such leakage without compromising the model’s reasoning capabilities, requiring a delicate balance between privacy and utility. We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model’s Chain of Thought (CoT) by injecting targeted steering vectors into hidden state. We identify the high-leakage layers responsible for this behavior. Through experiments across multiple LLMs, we demonstrate that SALT achieves reductions including 18.2% reduction in CPL on QwQ-32B, 17.9% reduction in CPL on Llama-3.1-8B, and 31.2% reduction in CPL on Deepseek in contextual privacy leakage dataset AirGapAgent-R while maintaining comparable task performance and utility. Our work establishes SALT as a practical approach for test-time privacy protection in reasoning-capable language models, offering a path toward safer deployment of LLM-based personal agents.
zh

[NLP-64] Back to the Future: The Role of Past and Future Context Predictability in Incremental Language Production

【速读】: 该论文旨在解决自然语言生成过程中未来语境(future context)对词语选择和形式的影响机制问题,特别是针对以往研究中未能充分解释的“后向可预测性效应”(backward predictability effect)。传统研究主要关注前序语境对词汇选择和发音时长的影响,而本文通过引入一种基于信息论的新颖可预测性度量方法,将过去与未来语境的信息整合进统一框架,从而更全面地刻画语言产出中的上下文依赖性。其解决方案的关键在于:一是提出一个概念上合理且原理清晰的信息论预测指标,能够同时量化来自前后语境的可预测性;二是利用生成式建模框架分离词法、语境及交际因素对替代错误(substitution errors)的影响,揭示说话者在词汇规划阶段如何权衡形式、语义与语境信息。这一方法不仅复现了经典预测效应,还为理解句子规划机制提供了新的实证基础。

链接: https://arxiv.org/abs/2511.07752
作者: Shiva Upadhye,Richard Futrell
机构: University of California, Irvine (加州大学欧文分校)
类目: Computation and Language (cs.CL)
备注: 42 pages, 13 figures

点击查看摘要

Abstract:Contextual predictability shapes both the form and choice of words in online language production. The effects of the predictability of a word given its previous context are generally well-understood in both production and comprehension, but studies of naturalistic production have also revealed a poorly-understood backward predictability effect of a word given its future context, which may be related to future planning. Here, in two studies of naturalistic speech corpora, we investigate backward predictability effects using improved measures and more powerful language models, introducing a new principled and conceptually motivated information-theoretic predictability measure that integrates predictability from both the future and the past context. Our first study revisits classic predictability effects on word duration. Our second study investigates substitution errors within a generative framework that independently models the effects of lexical, contextual, and communicative factors on word choice, while predicting the actual words that surface as speech errors. We find that our proposed conceptually-motivated alternative to backward predictability yields qualitatively similar effects across both studies. Through a fine-grained analysis of substitution errors, we further show that different kinds of errors are suggestive of how speakers prioritize form, meaning, and context-based information during lexical planning. Together, these findings illuminate the functional roles of past and future context in how speakers encode and choose words, offering a bridge between contextual predictability effects and the mechanisms of sentence planning.
zh

[NLP-65] ViPRA: Video Prediction for Robot Actions

【速读】: 该论文旨在解决从无动作标注的视频中学习机器人连续控制策略的问题,这类视频虽富含物理交互信息,但因缺乏标签化动作而难以直接用于机器人学习。解决方案的关键在于提出Video Prediction for Robot Actions (ViPRA)框架,通过预训练-微调范式,利用视频-语言模型同时预测未来视觉观测与运动中心的潜在动作(motion-centric latent actions),其中潜在动作作为场景动态的中间表示,并借助感知损失和光流一致性约束确保其物理合理性;在下游控制阶段,引入分块光流匹配解码器(chunked flow matching decoder),将潜在动作映射为机器人特定的连续动作序列,仅需100–200次遥操作示范即可实现高频率(最高达22 Hz)的平滑连续控制,从而避免昂贵的动作标注、支持跨机器人本体泛化,并显著优于现有基线方法。

链接: https://arxiv.org/abs/2511.07732
作者: Sandeep Routray,Hengkai Pan,Unnat Jain,Shikhar Bahl,Deepak Pathak
机构: Carnegie Mellon University (卡内基梅隆大学); Skild AI; University of California, Irvine (加州大学欧文分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Website: this https URL

点击查看摘要

Abstract:Can we turn a video prediction model into a robot policy? Videos, including those of humans or teleoperated robots, capture rich physical interactions. However, most of them lack labeled actions, which limits their use in robot learning. We present Video Prediction for Robot Actions (ViPRA), a simple pretraining-finetuning framework that learns continuous robot control from these actionless videos. Instead of directly predicting actions, we train a video-language model to predict both future visual observations and motion-centric latent actions, which serve as intermediate representations of scene dynamics. We train these latent actions using perceptual losses and optical flow consistency to ensure they reflect physically grounded behavior. For downstream control, we introduce a chunked flow matching decoder that maps latent actions to robot-specific continuous action sequences, using only 100 to 200 teleoperated demonstrations. This approach avoids expensive action annotation, supports generalization across embodiments, and enables smooth, high-frequency continuous control upto 22 Hz via chunked action decoding. Unlike prior latent action works that treat pretraining as autoregressive policy learning, explicitly models both what changes and how. Our method outperforms strong baselines, with a 16% gain on the SIMPLER benchmark and a 13% improvement across real world manipulation tasks. We will release models and code at this https URL
zh

[NLP-66] Critical Confabulation: Can LLM s Hallucinate for Social Good?

【速读】: 该论文旨在解决历史记录中因社会与政治不平等导致的“隐藏人物”(hidden figures)叙事缺失问题,即档案资料对边缘群体的系统性遗漏。传统方法难以重建这些被忽视群体的真实经历,而生成式 AI(Generative AI)常因幻觉(hallucination)被视为不可靠。论文提出“批判性虚构”(critical confabulation),其核心在于通过受控的、有证据约束的幻觉机制,填补档案空白,重构基于事实但能反映多元视角的历史叙事。解决方案的关键在于:设计开放式的叙事填空任务(narrative cloze task),利用经审计的开源模型(如OLMo-2系列)在特定提示工程下生成符合语境且可验证的补充事件,从而实现从无到有的知识生产,同时避免将推测误作真实。

链接: https://arxiv.org/abs/2511.07722
作者: Peiqi Sui,Eamon Duede,Hoyt Long,Richard Jean So
机构: McGill University (麦吉尔大学); Purdue University (普渡大学); University of Chicago (芝加哥大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 4 figures, under review

点击查看摘要

Abstract:LLMs hallucinate, yet some confabulations can have social affordances if carefully bounded. We propose critical confabulation (inspired by critical fabulation from literary and social theory), the use of LLM hallucinations to “fill-in-the-gap” for omissions in archives due to social and political inequality, and reconstruct divergent yet evidence-bound narratives for history’s “hidden figures”. We simulate these gaps with an open-ended narrative cloze task: asking LLMs to generate a masked event in a character-centric timeline sourced from a novel corpus of unpublished texts. We evaluate audited (for data contamination), fully-open models (the OLMo-2 family) and unaudited open-weight and proprietary baselines under a range of prompts designed to elicit controlled and useful hallucinations. Our findings validate LLMs’ foundational narrative understanding capabilities to perform critical confabulation, and show how controlled and well-specified hallucinations can support LLM applications for knowledge production without collapsing speculation into a lack of historical accuracy and fidelity.
zh

[NLP-67] CAPO: Confidence Aware Preference Optimization Learning for Multilingual Preferences AACL2025

链接: https://arxiv.org/abs/2511.07691
作者: Rhitabrat Pokharel,Yufei Tao,Ameeta Agrawal
机构: Portland State University (波特兰州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at IJCNLP-AACL 2025 Findings

点击查看摘要

[NLP-68] Stress Testing Factual Consistency Metrics for Long-Document Summarization

链接: https://arxiv.org/abs/2511.07689
作者: Zain Muhammad Mujahid,Dustin Wright,Isabelle Augenstein
机构: University of Copenhagen (哥本哈根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-69] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

【速读】: 该论文旨在解决深度研究(Deep Research, DR)任务评估的挑战问题,即如何对基于大语言模型(Large Language Models, LLMs)生成的长文本、多步骤推理和跨文档合成的回答进行标准化、可量化且具有细粒度的评估。其解决方案的关键在于提出了一套名为ResearchRubrics的基准评测体系,该体系包含超过2800小时的人工标注工作,通过2500多个专家撰写的细粒度评分标准(rubrics),从事实准确性、推理合理性与表达清晰度三个维度对DR响应进行打分;同时构建了一个三轴复杂度分类框架(概念广度、逻辑嵌套性、探索程度)以系统化划分DR任务难度,并开发了人类与模型协同的评估协议来衡量代理对rubrics的遵守程度。实验表明,当前领先DR系统如Gemini和OpenAI DR平均合规率不足68%,主要受限于对隐含上下文的遗漏和对检索信息推理不足,凸显了建立鲁棒、可扩展评估机制的重要性。

链接: https://arxiv.org/abs/2511.07685
作者: Manasi Sharma,Chen Bo Calvin Zhang,Chaithanya Bandi,Clinton Wang,Ankit Aich,Huy Nghiem,Tahseen Rabbani,Ye Htet,Brian Jang,Sumana Basu,Aishwarya Balwani,Denis Peskoff,Marcos Ayestaran,Sean M. Hendryx,Brad Kenstler,Bing Liu
机构: Scale AI; University of Maryland; University of Chicago; Washington University, St. Louis; McGill University; University of California, Berkeley
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 27 pages, 21 figures, pre-print

点击查看摘要

Abstract:Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address open-ended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini’s DR and OpenAI’s DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics(including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.
zh

[NLP-70] Revisiting NLI: Towards Cost-Effective and Human-Aligned Metrics for Evaluating LLM s in Question Answering

【速读】: 该论文旨在解决当前对先进大语言模型(Large Language Models, LLMs)生成答案的评估难题,即传统词汇匹配指标无法捕捉语义细节,而依赖“LLM作为裁判”(LLM-as-Judge)的评分方法则计算成本高昂。其解决方案的关键在于重新评估一种轻量级替代方案:使用现成的自然语言推理(Natural Language Inference, NLI)模型,并通过一个简单的词汇匹配标志(lexical-match flag)进行增强。实验表明,该方法在长文本问答任务上达到与GPT-4o相当的准确率(89.9%),同时所需参数量仅为后者的数个数量级。

链接: https://arxiv.org/abs/2511.07659
作者: Sai Shridhar Balamurali,Lu Cheng
机构: University of Illinois at Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating answers from state-of-the-art large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas “LLM-as-Judge” scoring is computationally expensive. We re-evaluate a lightweight alternative – off-the-shelf Natural Language Inference (NLI) scoring augmented by a simple lexical-match flag and find that this decades-old technique matches GPT-4o’s accuracy (89.9%) on long-form QA, while requiring orders-of-magnitude fewer parameters. To test human alignment of these metrics rigorously, we introduce DIVER-QA, a new 3000-sample human-annotated benchmark spanning five QA datasets and five candidate LLMs. Our results highlight that inexpensive NLI-based evaluation remains competitive and offer DIVER-QA as an open resource for future metric research.
zh

[NLP-71] LLM s vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives NEURIPS2025

链接: https://arxiv.org/abs/2511.07641
作者: Ratna Kandala,Katie Hoemann
机构: 未知
类目: Computation and Language (cs.CL)
备注: NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

点击查看摘要

[NLP-72] Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces AAAI2026

链接: https://arxiv.org/abs/2511.07587
作者: Shreyas Rajesh,Pavan Holur,Chenda Duan,David Chong,Vwani Roychowdhury
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: AAAI 2026 Oral

点击查看摘要

[NLP-73] LLM Output Drift: Cross-Provider Validation Mitigation for Financial Workflows

【速读】: 该论文旨在解决金融领域中大型语言模型(Large Language Models, LLMs)因输出不确定性(即输出漂移,output drift)导致的可审计性与可信度下降问题,尤其在合规性要求严格的场景下(如监管报告、客户沟通和对账)。其核心发现是:模型规模并非决定输出一致性的唯一因素,反而呈现显著的反向关系——较小模型(如Granite-3-8B和Qwen2.5-7B)在温度参数T=0.0时可实现100%一致性,而GPT-OSS-120B仅达12.5%(95%置信区间:3.5–36.0%),且不受配置影响(p<0.0001,Fisher精确检验)。解决方案的关键在于提出一套完整的确定性评估框架,包括:(i) 金融校准的确定性测试工具链(贪婪解码+固定随机种子+SEC 10-K结构感知检索排序);(ii) 针对RAG、JSON和SQL输出的任务特定不变性校验机制(采用±5%财务重要性阈值及SEC引用验证);(iii) 三层次模型分类体系以支持风险适配部署决策;以及(iv) 支持双重提供方验证的审计就绪证明系统。该框架成功映射至FSB、BIS和CFTC等监管要求,为合规AI部署提供了可落地的技术路径。

链接: https://arxiv.org/abs/2511.07585
作者: Raffi Khatchadourian,Rolando Franco
机构: IBM – Financial Services Market (IBM 金融服务市场)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 11 pages, 5 figures. To appear in AI4F @ ACM ICAIF '25, November 15-18, 2025, Singapore

点击查看摘要

Abstract:Financial institutions deploy Large Language Models (LLMs) for reconciliations, regulatory reporting, and client communications, but nondeterministic outputs (output drift) undermine auditability and trust. We quantify drift across five model architectures (7B-120B parameters) on regulated financial tasks, revealing a stark inverse relationship: smaller models (Granite-3-8B, Qwen2.5-7B) achieve 100% output consistency at T=0.0, while GPT-OSS-120B exhibits only 12.5% consistency (95% CI: 3.5-36.0%) regardless of configuration (p0.0001, Fisher’s exact test). This finding challenges conventional assumptions that larger models are universally superior for production deployment. Our contributions include: (i) a finance-calibrated deterministic test harness combining greedy decoding (T=0.0), fixed seeds, and SEC 10-K structure-aware retrieval ordering; (ii) task-specific invariant checking for RAG, JSON, and SQL outputs using finance-calibrated materiality thresholds (plus or minus 5%) and SEC citation validation; (iii) a three-tier model classification system enabling risk-appropriate deployment decisions; and (iv) an audit-ready attestation system with dual-provider validation. We evaluated five models (Qwen2.5-7B via Ollama, Granite-3-8B via IBM this http URL, Llama-3.3-70B, Mistral-Medium-2505, and GPT-OSS-120B) across three regulated financial tasks. Across 480 runs (n=16 per condition), structured tasks (SQL) remain stable even at T=0.2, while RAG tasks show drift (25-75%), revealing task-dependent sensitivity. Cross-provider validation confirms deterministic behavior transfers between local and cloud deployments. We map our framework to Financial Stability Board (FSB), Bank for International Settlements (BIS), and Commodity Futures Trading Commission (CFTC) requirements, demonstrating practical pathways for compliance-ready AI deployments. Comments: 11 pages, 5 figures. To appear in AI4F @ ACM ICAIF '25, November 15-18, 2025, Singapore Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML) ACMclasses: I.2.7; K.4.1; H.3.3 Cite as: arXiv:2511.07585 [cs.LG] (or arXiv:2511.07585v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.07585 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-74] hink Before You Retrieve: Learning Test-Time Adaptive Search with Small Language Models

【速读】: 该论文旨在解决现有信息检索方法在处理复杂用户查询时缺乏迭代探索、反馈与修正能力的问题,即传统神经检索器(neural retrievers)无法进行推理,大语言模型(LLMs)虽具语义深度但计算成本过高,而查询重写或分解策略仅限于静态变换,难以捕捉动态搜索过程。其解决方案的关键在于提出Orion训练框架,通过三方面机制实现紧凑模型(350M–1.2B参数)的迭代式检索:(1) 合成轨迹生成与监督微调以促进多样化探索模式;(2) 强化学习(RL)奖励有效的查询精炼与回溯行为;(3) 推理阶段采用束搜索算法利用RL中学到的自我反思能力。实验证明,该方法在多个基准上显著优于现有模型,甚至在仅使用3%训练数据的情况下,1.2B模型性能超越数十倍更大的检索器,表明检索性能可源于学习到的搜索策略而非单纯模型规模。

链接: https://arxiv.org/abs/2511.07581
作者: Supriti Vijay,Aman Priyanshu,Anu Vellore,Baturay Saglam,Amin Karbasi
机构: Carnegie Mellon University (卡内基梅隆大学); Foundation AI–Cisco Systems Inc. (Foundation AI–思科系统公司); Yale University (耶鲁大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 37 images, 7 figures, and 15 tables

点击查看摘要

Abstract:Effective information retrieval requires reasoning over partial evidence and refining strategies as information emerges. Yet current approaches fall short: neural retrievers lack reasoning capabilities, large language models (LLMs) provide semantic depth but at prohibitive cost, and query rewriting or decomposition limits improvement to static transformations. As a result, existing methods fail to capture the iterative dynamics of exploration, feedback, and revision that complex user queries demand. We introduce Orion, a training framework that enables compact models (350M-1.2B parameters) to perform iterative retrieval through learned search strategies. Orion combines: (1) synthetic trajectory generation and supervised fine-tuning to encourage diverse exploration patterns in models, (2) reinforcement learning (RL) that rewards effective query refinement and backtracking behaviors, and (3) inference-time beam search algorithms that exploit the self-reflection capabilities learned during RL. Despite using only 3% of the training data available, our 1.2B model achieves 77.6% success on SciFact (vs. 72.6% for prior retrievers), 25.2% on BRIGHT (vs. 22.1%), 63.2% on NFCorpus (vs. 57.8%), and remains competitive on FEVER, HotpotQA, and MSMarco. It outperforms retrievers up to 200-400x larger on five of six benchmarks. These findings suggest that retrieval performance can emerge from learned strategies, not just model scale, when models are trained to search, reflect, and revise.
zh

[NLP-75] A Decentralized Retrieval Augmented Generation System with Source Reliabilities Secured on Blockchain

链接: https://arxiv.org/abs/2511.07577
作者: Yining Lu,Wenyi Tang,Max Johnson,Taeho Jung,Meng Jiang
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

[NLP-76] LLM Optimization Unlocks Real-Time Pairwise Reranking

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中文档重排序(reranking)环节的效率问题,即如何在不显著降低检索质量的前提下,大幅降低基于大语言模型(Large Language Models, LLMs)的成对重排序(Pairwise Reranking Prompting, PRP)方法的延迟。其解决方案的关键在于通过一系列精心设计的优化策略:使用较小的模型、限制重排序文档集规模、采用低精度计算、通过单向顺序推理减少位置偏差(positional bias),以及限制输出token数量,从而将每查询延迟从61.36秒降至0.37秒(提升166倍),同时保持Recall@k指标几乎不变,显著提升了LLM-based重排序在实时场景中的可行性。

链接: https://arxiv.org/abs/2511.07555
作者: Jingyu Wu,Aditya Shrivastava,Jing Zhu,Alfy Samuel,Anoop Kumar,Daben Liu
机构: AI Foundations, Capital One (Capital One)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Efficiently reranking documents retrieved from information retrieval (IR) pipelines to enhance overall quality of Retrieval-Augmented Generation (RAG) system remains an important yet challenging problem. Recent studies have highlighted the importance of Large Language Models (LLMs) in reranking tasks. In particular, Pairwise Reranking Prompting (PRP) has emerged as a promising plug-and-play approach due to its usability and effectiveness. However, the inherent complexity of the algorithm, coupled with the high computational demands and latency incurred due to LLMs, raises concerns about its feasibility in real-time applications. To address these challenges, this paper presents a focused study on pairwise reranking, demonstrating that carefully applied optimization methods can significantly mitigate these issues. By implementing these methods, we achieve a remarkable latency reduction of up to 166 times, from 61.36 seconds to 0.37 seconds per query, with an insignificant drop in performance measured by Recall@k. Our study highlights the importance of design choices that were previously overlooked, such as using smaller models, limiting the reranked set, using lower precision, reducing positional bias with one-directional order inference, and restricting output tokens. These optimizations make LLM-based reranking substantially more efficient and feasible for latency-sensitive, real-world deployments.
zh

[NLP-77] Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models AAAI-2026

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言处理中内部机制不清晰的问题,特别是多头自注意力(Multi-Head Self-Attention, MHA)在支持多语言能力中的作用尚未被充分理解。为解决此问题,作者提出了一种名为语言注意力头重要性评分(Language Attention Head Importance Scores, LAHIS)的方法,该方法通过一次前向和反向传播即可高效识别对多语言能力贡献显著的注意力头。关键发现包括:存在语言特异性(language-specific)与语言通用性(language-general)注意力头;前者可促进跨语言注意力转移,缓解非目标语言生成问题,从而提升多语言性能。此外,研究进一步引入轻量级适配策略——学习软注意力头掩码(soft head mask),仅需20个可调参数即可优化XQuAD准确率,显著增强模型的可解释性和多语言表现。

链接: https://arxiv.org/abs/2511.07498
作者: Xin Liu,Qiyang Song,Qihang Zhou,Haichao Du,Shaowen Xu,Wenbo Jiang,Weijuan Zhang,Xiaoqi Jia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI-2026

点击查看摘要

Abstract:Large language models (LLMs) increasingly support multilingual understanding and generation. Meanwhile, efforts to interpret their internal mechanisms have emerged, offering insights to enhance multilingual performance. While multi-head self-attention (MHA) has proven critical in many areas, its role in multilingual capabilities remains underexplored. In this work, we study the contribution of MHA in supporting multilingual processing in LLMs. We propose Language Attention Head Importance Scores (LAHIS), an effective and efficient method that identifies attention head importance for multilingual capabilities via a single forward and backward pass through the LLM. Applying LAHIS to Aya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1, we reveal the existence of both language-specific and language-general heads. Language-specific heads enable cross-lingual attention transfer to guide the model toward target language contexts and mitigate off-target language generation issue, contributing to addressing challenges in multilingual LLMs. We also introduce a lightweight adaptation that learns a soft head mask to modulate attention outputs over language heads, requiring only 20 tunable parameters to improve XQuAD accuracy. Overall, our work enhances both the interpretability and multilingual capabilities of LLMs from the perspective of MHA.
zh

[NLP-78] Quantifying the Impact of CU: A Systematic Literature Review

链接: https://arxiv.org/abs/2511.07491
作者: Thomas Compton
机构: University of York (约克大学)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: 14 pages, 2 figures, 2 tables

点击查看摘要

[NLP-79] Alignment-Constrained Dynamic Pruning for LLM s: Identifying and Preserving Alignment-Critical Circuits

链接: https://arxiv.org/abs/2511.07482
作者: Dev Patel,Gabrielle Gervacio,Diekola Raimi,Kevin Zhu,Ryan Lagasse,Gabriel Grand,Ashwinee Panda,Maheep Chaudhary
机构: Algoverse; MIT; University of Maryland; Independent
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-80] he Polite Liar: Epistemic Pathology in Language Models

链接: https://arxiv.org/abs/2511.07477
作者: Bentley DeVilling(Course Correct Labs)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 2 tables, Preprint - under review at AI Society

点击查看摘要

[NLP-81] Motif 2 12.7B technical report

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在计算资源受限条件下如何实现高效训练与推理的问题,同时保持强大的语言理解与指令泛化能力。其核心解决方案在于通过架构创新与系统级优化的协同设计:一方面引入Grouped Differential Attention (GDA)机制,将信号与噪声控制的注意力路径解耦以提升表征效率;另一方面采用MuonClip优化器、融合PolyNorm激活函数及Parallel Muon并行算法等高性能内核,在分布式训练中显著提升吞吐量和内存效率。这种组合策略使Motif-2-12.7B在仅127亿参数规模下实现了媲美更大模型的性能表现。

链接: https://arxiv.org/abs/2511.07464
作者: Junghwan Lim,Sungmin Lee,Dongseok Kim,Taehyun Kim,Eunhwan Park,Jeesoo Lee,Jeongdoo Lee,Junhyeok Lee,Wai Ting Cheung,Dahye Choi,Jaeheui Her,Jaeyeon Huh,Hanbin Jung,Changjin Kang,Beomgyu Kim,Minjae Kim,Taewhan Kim,Youngrok Kim,Hyukjin Kweon,Haesol Lee,Kungyu Lee,Dongpin Oh,Yeongjae Park,Bokki Ryu,Dongjoo Weon
机构: Motif Technologies
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Motif-2-12.7B, a new open-weight foundation model that pushes the efficiency frontier of large language models by combining architectural innovation with system-level optimization. Designed for scalable language understanding and robust instruction generalization under constrained compute budgets, Motif-2-12.7B builds upon Motif-2.6B with the integration of Grouped Differential Attention (GDA), which improves representational efficiency by disentangling signal and noise-control attention pathways. The model is pre-trained on 5.5 trillion tokens spanning diverse linguistic, mathematical, scientific, and programming domains using a curriculum-driven data scheduler that gradually changes the data composition ratio. The training system leverages the MuonClip optimizer alongside custom high-performance kernels, including fused PolyNorm activations and the Parallel Muon algorithm, yielding significant throughput and memory efficiency gains in large-scale distributed environments. Post-training employs a three-stage supervised fine-tuning pipeline that successively enhances general instruction adherence, compositional understanding, and linguistic precision. Motif-2-12.7B demonstrates competitive performance across diverse benchmarks, showing that thoughtful architectural scaling and optimized training design can rival the capabilities of much larger models.
zh

[NLP-82] It Takes Two: A Dual Stage Approach for Terminology-Aware Translation

链接: https://arxiv.org/abs/2511.07461
作者: Akshat Singh Jaswal
机构: PES University (印度理工学院佩斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to WMT 2025. Code availavle at this https URL

点击查看摘要

[NLP-83] REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

【速读】: 该论文旨在解决日志摘要(log summarization)系统评估中缺乏高质量参考摘要以及传统指标(如ROUGE和BLEU)依赖表面词汇重叠而导致评估效果有限的问题。其解决方案的关键在于提出REFLEX,一种基于大语言模型(LLM)的无参考评估方法,利用LLM作为零样本评估器,从相关性、信息量和连贯性等维度自动判断摘要质量,无需黄金标准参考文本或人工标注,从而实现稳定、可解释且细粒度的评估结果。

链接: https://arxiv.org/abs/2511.07458
作者: Priyanka Mudgal
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注: Accepted at IEEE-ICETISI 2025

点击查看摘要

Abstract:Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.
zh

[NLP-84] GRIP: In-Parameter Graph Reasoning through Fine-Tuning Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理结构化数据(如知识图谱或网页数据)时面临的挑战,尤其是现有方法在将图结构转换为文本序列时存在显著的token开销,或引入额外模块进行固定大小编码但需大规模后训练与复杂对齐过程,且常因模态对齐不佳导致性能欠佳的问题。解决方案的关键在于提出GRIP框架,通过精心设计的微调任务,使LLM能够以内嵌方式吸收图中的复杂关系信息,并将这些知识高效存储于轻量级LoRA(Low-Rank Adaptation)参数中,从而实现无需推理时访问原始图即可完成多种图相关任务的能力。

链接: https://arxiv.org/abs/2511.07457
作者: Jiarui Feng,Donghong Cai,Yixin Chen,Muhan Zhang
机构: Washington University in Saint Louis (圣路易斯华盛顿大学); Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in modeling sequential textual data and generalizing across diverse tasks. However, adapting LLMs to effectively handle structural data, such as knowledge graphs or web data, remains a challenging problem. Some approaches adopt complex strategies to convert graphs into text sequences, resulting in significant token overhead and rendering them impractical for large-scale graphs. Others introduce additional modules to encode graphs into fixed-size token representations for LLMs. However, these methods typically require large-scale post-training on graph-text corpus and complex alignment procedures, yet often yield sub-optimal results due to poor modality alignment. Inspired by in-parameter knowledge injection for test-time adaptation of LLMs, we propose GRIP, a novel framework that equips LLMs with the ability to internalize complex relational information from graphs through carefully designed fine-tuning tasks. This knowledge is efficiently stored within lightweight LoRA parameters, enabling the fine-tuned LLM to perform a wide range of graph-related tasks without requiring access to the original graph at inference time. Extensive experiments across multiple benchmarks validate the effectiveness and efficiency of our approach.
zh

[NLP-85] Large Language Models for Scientific Idea Generation: A Creativity-Centered Survey

链接: https://arxiv.org/abs/2511.07448
作者: Fatemeh Shahhosseini,Arash Marioriyad,Ali Momen,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban,Shaghayegh Haghjooy Javanmard
机构: Sharif University of Technology (谢里夫理工大学); Iran University of Science and Technology (伊朗科学技术大学); Isfahan University of Medical Sciences (伊斯法罕医科大学)
类目: Computation and Language (cs.CL)
备注: 67 Pages

点击查看摘要

[NLP-86] A Preliminary Study of RAG for Taiwanese Historical Archives

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在台湾历史档案这类知识密集型任务中的适用性问题,特别是针对两个传统中文历史数据集(Fort Zeelandia 和台湾省议会公报)及其开放问答查询集的实证研究。解决方案的关键在于构建并系统评估一个RAG流水线,重点分析查询特征和元数据整合策略对检索质量、答案生成准确性及整体系统性能的影响,结果表明早期阶段引入元数据可显著提升检索与生成精度,但同时也揭示了生成幻觉和处理时间相关或多跳历史查询等持续挑战。

链接: https://arxiv.org/abs/2511.07445
作者: Claire Lin,Bo-Han Feng,Xuanjun Chen,Te-Lun Yang,Hung-yi Lee,Jyh-Shing Roger Jang
机构: National Taiwan University (国立台湾大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ROCLING 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising approach for knowledge-intensive tasks. However, few studies have examined RAG for Taiwanese Historical Archives. In this paper, we present an initial study of a RAG pipeline applied to two historical Traditional Chinese datasets, Fort Zeelandia and the Taiwan Provincial Council Gazette, along with their corresponding open-ended query sets. We systematically investigate the effects of query characteristics and metadata integration strategies on retrieval quality, answer generation, and the performance of the overall system. The results show that early-stage metadata integration enhances both retrieval and answer accuracy while also revealing persistent challenges for RAG systems, including hallucinations during generation and difficulties in handling temporal or multi-hop historical queries.
zh

[NLP-87] Network and Systems Performance Characterization of MCP-Enabled LLM Agents

链接: https://arxiv.org/abs/2511.07426
作者: Zihao Ding,Mufeng Zhu,Yao Liu
机构: Rutgers University (罗格斯大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Networking and Internet Architecture (cs.NI); Software Engineering (cs.SE)
备注:

点击查看摘要

[NLP-88] LLaDA-Rec: Discrete Diffusion for Parallel Semantic ID Generation in Generative Recommendation

链接: https://arxiv.org/abs/2511.06254
作者: Teng Shi,Chenglei Shen,Weijie Yu,Shen Nie,Chongxuan Li,Xiao Zhang,Ming He,Yan Han,Jun Xu
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学); School of Information Technology and Management (信息科技与管理学院); University of International Business and Economics (对外经济贸易大学); AI Lab at Lenovo Research (联想研究院人工智能实验室)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-89] Unifying Model and Layer Fusion for Speech Foundation Models

【速读】: 该论文旨在解决如何更有效地融合多个上游语音基础模型(Speech Foundation Models)的表示以提升下游任务性能的问题。现有方法通常仅采用单一融合策略,如层内融合或模型间融合,难以兼顾多层级信息与多模型优势。解决方案的关键在于提出一个统一的接口模块(interface module),该模块能够在多个上游模型之间实现跨模型融合的同时,整合各模型内部的多层特征信息,从而实现更全面的信息融合。实验表明,该方法在自动语音识别(ASR)和副语言分析等任务中优于以往融合方法,并且其性能提升依赖于合适的上游模型选择,凸显了接口模块在模型组合优化中的关键作用。

链接: https://arxiv.org/abs/2511.08389
作者: Yi-Jen Shih,David Harwath
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted by IEEE ASRU 2025

点击查看摘要

Abstract:Speech Foundation Models have gained significant attention recently. Prior works have shown that the fusion of representations from multiple layers of the same model or the fusion of multiple models can improve performance on downstream tasks. We unify these two fusion strategies by proposing an interface module that enables fusion across multiple upstream speech models while integrating information across their layers. We conduct extensive experiments on different self-supervised and supervised models across various speech tasks, including ASR and paralinguistic analysis, and demonstrate that our method outperforms prior fusion approaches. We further analyze its scalability concerning model size and count, highlighting the importance of selecting appropriate upstream models. Our results show that the proposed interface provides an additional performance boost when given a suitable upstream model selection, making it a promising approach for utilizing Speech Foundation Models.
zh

[NLP-90] Hybrid Quantum-Classical Selective State Space Artificial Intelligence

链接: https://arxiv.org/abs/2511.08349
作者: Amin Ebrahimi,Farzan Haddadi
机构: Iran University of Science and Technology (伊朗科学技术大学)
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages

点击查看摘要

[NLP-91] Quantizing Whisper-small: How design choices affect ASR performance ICASSP2026

【速读】: 该论文旨在解决大规模语音识别模型(如Whisper-small)在边缘设备上部署困难的问题,其主要瓶颈在于高计算需求导致的资源消耗过大。解决方案的关键在于通过后训练量化(Post-Training Quantization, PTQ)技术,在不进行重新训练的前提下显著降低模型尺寸和推理成本。研究系统性地评估了四种不同库(PyTorch、Optimum-Quanto、HQQ 和 bitsandbytes)中多种量化方案、方法、粒度与位宽组合的效果,发现动态 int8 量化配合 Quanto 库能实现最佳平衡:在 LibriSpeech 测试集上压缩模型大小达 57%,同时提升词错误率(Word Error Rate, WER)性能;相比之下,静态量化因 Whisper 的 Transformer 架构特性表现较差,而更激进的量化格式(如 nf4、int3)虽可实现最高 71% 压缩率,但在噪声环境下牺牲了准确性。结果表明,合理选择 PTQ 方法可在保持模型精度的同时高效适配受限硬件环境。

链接: https://arxiv.org/abs/2511.08093
作者: Arthur Söhler,Julian Irigoyen,Andreas Søeborg Kirkedal
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline’s word error rate. Static quantization performed worse, likely due to Whisper’s Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.
zh

[NLP-92] Pruning as Regularization: Sensitivity-Aware One-Shot Pruning in ASR ICASSP2026

链接: https://arxiv.org/abs/2511.08092
作者: Julian Irigoyen,Arthur Söhler,Andreas Søeborg Kirkedal
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to ICASSP 2026

点击查看摘要

计算机视觉

[CV-0] Simulating the Visual World with Artificial Intelligence: A Roadmap

【速读】:该论文旨在解决视频生成技术从单纯追求视觉吸引力向构建可交互、具物理合理性虚拟环境的范式转变问题,其核心挑战在于如何实现对真实或想象世界中物理动态、智能体-环境交互及任务规划等复杂机制的建模与模拟。解决方案的关键在于提出将现代视频基础模型(video foundation models)视为两个核心组件的组合:隐式世界模型(implicit world model)和视频渲染器(video renderer)。其中,隐式世界模型编码关于物理规律、交互动力学和智能体行为的结构化知识,作为潜在的仿真引擎,支持连贯的视觉推理、长期时间一致性以及目标驱动的规划;而视频渲染器则将该潜在模拟转化为逼真的视觉观测,从而以视频形式“窗口化”呈现所模拟的世界。这一架构推动了视频生成从静态片段到具备多尺度时空规划能力的动态世界模型的演进。

链接: https://arxiv.org/abs/2511.08585
作者: Jingtong Yue,Ziqi Huang,Zhaoxi Chen,Xintao Wang,Pengfei Wan,Ziwei Liu
机构: Carnegie Mellon University (卡内基梅隆大学); Nanyang Technological University (南洋理工大学); Kuaishou Technology (快手科技)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Github Repo: this https URL

点击查看摘要

Abstract:The landscape of video generation is shifting, from a focus on generating visually appealing clips to building virtual environments that support interaction and maintain physical plausibility. These developments point toward the emergence of video foundation models that function not only as visual generators but also as implicit world models, models that simulate the physical dynamics, agent-environment interactions, and task planning that govern real or imagined worlds. This survey provides a systematic overview of this evolution, conceptualizing modern video foundation models as the combination of two core components: an implicit world model and a video renderer. The world model encodes structured knowledge about the world, including physical laws, interaction dynamics, and agent behavior. It serves as a latent simulation engine that enables coherent visual reasoning, long-term temporal consistency, and goal-driven planning. The video renderer transforms this latent simulation into realistic visual observations, effectively producing videos as a “window” into the simulated world. We trace the progression of video generation through four generations, in which the core capabilities advance step by step, ultimately culminating in a world model, built upon a video generation model, that embodies intrinsic physical plausibility, real-time multimodal interaction, and planning capabilities spanning multiple spatiotemporal scales. For each generation, we define its core characteristics, highlight representative works, and examine their application domains such as robotics, autonomous driving, and interactive gaming. Finally, we discuss open challenges and design principles for next-generation world models, including the role of agent intelligence in shaping and evaluating these systems. An up-to-date list of related works is maintained at this link.
zh

[CV-1] SENCA-st: Integrating Spatial Transcriptomics and Histopathology with Cross Attention Shared Encoder for Region Identification in Cancer Pathology WACV2026

【速读】:该论文旨在解决多模态数据融合中功能信息丢失的问题,具体而言是在整合组织病理图像(histopathology)与空间转录组学(spatial transcriptomics)数据时,如何避免模型因过度依赖某一模态而忽略另一模态的关键功能特征。当前方法要么以空间转录组学为主导导致噪声干扰,要么采用朴素对比学习使图像结构信息被过度平滑,从而丧失肿瘤微环境中的功能性差异。解决方案的关键在于提出一种新型架构SENCA-st(Shared Encoder with Neighborhood Cross Attention),其通过共享编码器保留双模态特征,并引入邻域交叉注意力机制(neighborhood cross attention)强化在组织结构相似但功能差异显著的区域,从而更精准地识别肿瘤异质性和微环境区域,提升临床相关性。

链接: https://arxiv.org/abs/2511.08573
作者: Shanaka Liyanaarachchi,Chathurya Wijethunga,Shihab Aaquil Ahamed,Akthas Absar,Ranga Rodrigo
机构: University of Moratuwa (莫鲁塔瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:Spatial transcriptomics is an emerging field that enables the identification of functional regions based on the spatial distribution of gene expression. Integrating this functional information present in transcriptomic data with structural data from histopathology images is an active research area with applications in identifying tumor substructures associated with cancer drug resistance. Current histopathology-spatial-transcriptomic region segmentation methods suffer due to either making spatial transcriptomics prominent by using histopathology features just to assist processing spatial transcriptomics data or using vanilla contrastive learning that make histopathology images prominent due to only promoting common features losing functional information. In both extremes, the model gets either lost in the noise of spatial transcriptomics or overly smoothed, losing essential information. Thus, we propose our novel architecture SENCA-st (Shared Encoder with Neighborhood Cross Attention) that preserves the features of both modalities. More importantly, it emphasizes regions that are structurally similar in histopathology but functionally different on spatial transcriptomics using cross-attention. We demonstrate the superior performance of our model that surpasses state-of-the-art methods in detecting tumor heterogeneity and tumor micro-environment regions, a clinically crucial aspect.
zh

[CV-2] Vision Transformer Based User Equipment Positioning

【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型在用户设备(User Equipment, UE)定位中面临的两个关键问题:一是模型对输入数据的注意力分配过于均匀,无法聚焦于关键特征;二是现有方法不适用于非序列化数据,例如仅能获取瞬时信道状态信息(Channel State Information, CSI)的情况。其解决方案的核心在于提出一种基于注意力机制的视觉Transformer(Vision Transformer, ViT)架构,该架构专门聚焦于从CSI矩阵中提取的角度-延迟谱(Angle Delay Profile, ADP),从而有效利用CSI中的空间和时间结构信息。实验表明,该方案在DeepMIMO和ViWi射线追踪数据集上分别实现了0.55m(室内)、13.59m(室外)和3.45m(含遮挡的室外场景)的均方根误差(Root Mean Squared Error, RMSE),较当前最优方法提升约38%,且在误差分布上表现更优。

链接: https://arxiv.org/abs/2511.08549
作者: Parshwa Shah,Dhaval K. Patel,Brijesh Soni,Miguel López-Benítez,Siddhartan Govindasamy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注: The results are accepted in parts at IEEE CCNC2026

点击查看摘要

Abstract:Recently, Deep Learning (DL) techniques have been used for User Equipment (UE) positioning. However, the key shortcomings of such models is that: i) they weigh the same attention to the entire input; ii) they are not well suited for the non-sequential data e.g., when only instantaneous Channel State Information (CSI) is available. In this context, we propose an attention-based Vision Transformer (ViT) architecture that focuses on the Angle Delay Profile (ADP) from CSI matrix. Our approach, validated on the DeepMIMO' and ViWi’ ray-tracing datasets, achieves an Root Mean Squared Error (RMSE) of 0.55m indoors, 13.59m outdoors in DeepMIMO, and 3.45m in ViWi’s outdoor blockage scenario. The proposed scheme outperforms state-of-the-art schemes by \sim 38%. It also performs substantially better than other approaches that we have considered in terms of the distribution of error distance.
zh

[CV-3] RePose-NeRF: Robust Radiance Fields for Mesh Reconstruction under Noisy Camera Poses

【速读】:该论文旨在解决基于神经隐式表示(Neural Implicit Representations)的三维重建方法在实际机器人应用中面临的两大挑战:一是现有方法对相机位姿(camera poses)精度要求高,在真实场景中存在噪声时重建效果显著下降;二是其隐式体素表示与广泛使用的多边形网格(polygonal meshes)不兼容,导致在标准3D软件中渲染和操作效率低下。解决方案的关键在于提出一种鲁棒框架,能够在相机位姿存在噪声的情况下,联合优化位姿并学习捕捉精细几何结构和逼真外观的隐式场景表示,最终生成高质量、可编辑的多边形网格,从而实现与主流3D图形和机器人工具链的兼容性,提升下游任务的实用性。

链接: https://arxiv.org/abs/2511.08545
作者: Sriram Srinivasan,Gautam Ramachandra
机构: Bellatrix Aerospace(贝尔特里克斯航空航天)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Several figures are included to illustrate the reconstruction and rendering quality of the proposed method, which is why the submission exceeds the 50MB file size limit. Several figures are included to illustrate the reconstruction and rendering quality of the proposed method, which is why the submission exceeds the 50,000 KB file size limit (Now this has been resolved)

点击查看摘要

Abstract:Accurate 3D reconstruction from multi-view images is essential for downstream robotic tasks such as navigation, manipulation, and environment understanding. However, obtaining precise camera poses in real-world settings remains challenging, even when calibration parameters are known. This limits the practicality of existing NeRF-based methods that rely heavily on accurate extrinsic estimates. Furthermore, their implicit volumetric representations differ significantly from the widely adopted polygonal meshes, making rendering and manipulation inefficient in standard 3D software. In this work, we propose a robust framework that reconstructs high-quality, editable 3D meshes directly from multi-view images with noisy extrinsic parameters. Our approach jointly refines camera poses while learning an implicit scene representation that captures fine geometric detail and photorealistic appearance. The resulting meshes are compatible with common 3D graphics and robotics tools, enabling efficient downstream use. Experiments on standard benchmarks demonstrate that our method achieves accurate and robust 3D reconstruction under pose uncertainty, bridging the gap between neural implicit representations and practical robotic applications.
zh

[CV-4] LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics

【速读】:该论文旨在解决自监督学习中表示学习(representational learning)的理论指导不足与实践方法缺乏系统性设计的问题,特别是针对联合嵌入预测架构(Joint-Embedding Predictive Architectures, JEPAs)在实际应用中依赖启发式调参、稳定性差以及难以扩展等挑战。其解决方案的关键在于提出一种理论严谨且实现简洁的新训练目标——LeJEPA,它通过引入草图各向同性高斯正则化(Sketched Isotropic Gaussian Regularization, SIGReg),强制嵌入空间收敛到最优的各向同性高斯分布,从而最小化下游任务的预测风险;这一机制使得LeJEPA具备单一超参数、线性时间与内存复杂度、对架构和超参数鲁棒性强、无需停梯度或教师-学生结构等特性,并支持分布式训练,显著提升了JEPA类方法的实用性与可复现性。

链接: https://arxiv.org/abs/2511.08544
作者: Randall Balestriero,Yann LeCun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad-hoc RD. We present a comprehensive theory of JEPAs and instantiate it in \bf LeJEPA, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs’ embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective–\bf Sketched Isotropic Gaussian Regularization (SIGReg)–to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade-off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop-gradient, no teacher-student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only \approx 50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (\hrefgit@github.com:rbalestr-lab/lejepa.gitGitHub repo).
zh

[CV-5] 3D4D: An Interactive Editable 4D World Model via 3D Video Generation AAAI2026

【速读】:该论文旨在解决静态图像与文本难以直观呈现复杂四维(4D)时空场景的问题,尤其在多模态数据融合与实时交互体验方面存在挑战。解决方案的关键在于提出3D4D框架,该框架通过集成WebGL与Supersplat渲染技术,结合四个核心模块实现从静态输入到连贯4D场景的转换,并采用视网膜聚焦(foveated rendering)策略优化渲染效率,从而支持用户驱动的自适应探索与实时交互。

链接: https://arxiv.org/abs/2511.08536
作者: Yunhong He,Zhengqing Yuan,Zhengzhong Tu,Yanfang Ye,Lichao Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026 Demo Track

点击查看摘要

Abstract:We introduce 3D4D, an interactive 4D visualization framework that integrates WebGL with Supersplat rendering. It transforms static images and text into coherent 4D scenes through four core modules and employs a foveated rendering strategy for efficient, real-time multi-modal interaction. This framework enables adaptive, user-driven exploration of complex 4D environments. The project page and code are available at this https URL.
zh

[CV-6] Large Sign Language Models: Toward 3D American Sign Language Translation

【速读】:该论文旨在解决听力障碍人群在虚拟通信中因缺乏高效、准确的美国手语(American Sign Language, ASL)翻译工具而导致的数字沟通障碍问题。现有方法多依赖2D视频数据进行手语识别,难以充分捕捉手势的空间结构与深度信息,导致翻译精度不足。解决方案的关键在于提出大型手语模型(Large Sign Language Models, LSLM),以大型语言模型(Large Language Models, LLMs)为核心架构,直接利用3D手语数据进行端到端翻译,从而更精确地建模空间动作特征与语义内容;同时探索了从3D手势特征到文本的直接映射以及基于外部指令提示的可控翻译两种范式,显著提升了系统对复杂具身多模态语言的理解能力与应用灵活性。

链接: https://arxiv.org/abs/2511.08535
作者: Sen Zhang,Xiaoxiao He,Di Liu,Zhaoyang Xia,Mingyu Zhao,Chaowei Tan,Vivian Li,Bo Liu,Dimitris N. Metaxas,Mubbasir Kapadia
机构: Rutgers University (罗格斯大学); Meta Reality Labs (Meta现实实验室); Qualcomm (高通公司); Walmart Global Tech (沃尔玛全球科技); Roblox (Roblox); PRISMS (PRISMS)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Large Sign Language Models (LSLM), a novel framework for translating 3D American Sign Language (ASL) by leveraging Large Language Models (LLMs) as the backbone, which can benefit hearing-impaired individuals’ virtual communication. Unlike existing sign language recognition methods that rely on 2D video, our approach directly utilizes 3D sign language data to capture rich spatial, gestural, and depth information in 3D scenes. This enables more accurate and resilient translation, enhancing digital communication accessibility for the hearing-impaired community. Beyond the task of ASL translation, our work explores the integration of complex, embodied multimodal languages into the processing capabilities of LLMs, moving beyond purely text-based inputs to broaden their understanding of human communication. We investigate both direct translation from 3D gesture features to text and an instruction-guided setting where translations can be modulated by external prompts, offering greater flexibility. This work provides a foundational step toward inclusive, multimodal intelligent systems capable of understanding diverse forms of language.
zh

[CV-7] UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

【速读】:该论文旨在解决当前专用AI模型在视频任务中孤立运行、难以支持复杂迭代式工作流的问题,尤其是在现实场景中需要融合视频理解、分割、编辑与生成等多能力协同的挑战。其解决方案的关键在于提出UniVA框架——一个开源的全能型多智能体系统,采用“规划-执行”双代理架构:规划代理解析用户意图并分解为结构化的视频处理步骤,执行代理通过基于MCP(Modular Control Protocol)的工具服务器完成具体操作;同时依托分层多级记忆机制(全局知识、任务上下文与用户偏好),实现长程推理、上下文连续性和跨代理通信,从而支撑交互式、自反思且可追溯的视频创作流程,显著提升多条件驱动的视频工作流(如文本/图像/视频条件生成 → 多轮编辑 → 对象分割 → 组合合成)的自动化与灵活性。

链接: https://arxiv.org/abs/2511.08521
作者: Zhengyang Liang,Daoan Zhang,Huichi Zhou,Rui Huang,Bobo Li,Yuechen Zhang,Shengqiong Wu,Xiaohan Wang,Jiebo Luo,Lizi Liao,Hao Fei
机构: Singapore Management University (新加坡管理大学); University of Rochester (罗切斯特大学); University College London (伦敦大学学院); National University of Singapore (新加坡国立大学); The Chinese University of Hong Kong (香港中文大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report. 24 figures, 37 pages. Website: this https URL

点击查看摘要

Abstract:While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation \rightarrow multi-round editing \rightarrow object segmentation \rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (this https URL)
zh

[CV-8] CleverBirds: A Multiple-Choice Benchmark for Fine-grained Human Knowledge Tracing NEURIPS2025

【速读】:该论文旨在解决如何准确追踪人类学习者在细粒度视觉识别任务中的知识状态这一挑战,尤其关注专家能力的渐进式发展过程。其解决方案的关键在于构建并发布CleverBirds数据集——一个基于公民科学平台eBird收集的大规模知识追踪基准,涵盖超过40,000名参与者、1700万道多选题以及超过10,000种鸟类物种,平均每位参与者完成约400道题目,具有长期学习模式。该数据集为研究视觉知识追踪提供了高维度、真实世界的学习轨迹,有助于开发和评估新型知识追踪模型,并揭示不同上下文信息对预测学习者表现的差异化贡献。

链接: https://arxiv.org/abs/2511.08512
作者: Leonie Bossemeyer,Samuel Heinrich,Grant Van Horn,Oisin Mac Aodha
机构: University of Edinburgh (爱丁堡大学); Cornell University (康奈尔大学); UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: To appear at NeurIPS 2025 - Datasets and Benchmarks Track

点击查看摘要

Abstract:Mastering fine-grained visual recognition, essential in many expert domains, can require that specialists undergo years of dedicated training. Modeling the progression of such expertize in humans remains challenging, and accurately inferring a human learner’s knowledge state is a key step toward understanding visual learning. We introduce CleverBirds, a large-scale knowledge tracing benchmark for fine-grained bird species recognition. Collected by the citizen-science platform eBird, it offers insight into how individuals acquire expertize in complex fine-grained classification. More than 40,000 participants have engaged in the quiz, answering over 17 million multiple-choice questions spanning over 10,000 bird species, with long-range learning patterns across an average of 400 questions per participant. We release this dataset to support the development and evaluation of new methods for visual knowledge tracing. We show that tracking learners’ knowledge is challenging, especially across participant subgroups and question types, with different forms of contextual information offering varying degrees of predictive benefit. CleverBirds is among the largest benchmark of its kind, offering a substantially higher number of learnable concepts. With it, we hope to enable new avenues for studying the development of visual expertize over time and across individuals.
zh

[CV-9] Fast Multi-Organ Fine Segmentation in CT Images with Hierarchical Sparse Sampling and Residual Transformer

【速读】:该论文旨在解决三维医学图像中多器官分割的计算效率与精度之间的矛盾问题,即传统全体积卷积神经网络方法虽精度高但耗时长、内存占用大,而快速分类器又难以兼顾速度与准确性。其解决方案的关键在于提出一种基于分层稀疏采样(hierarchical sparse sampling)与残差Transformer(Residual Transformer)架构的新型分割框架:通过分层稀疏采样策略在多个分辨率层级上保留有意义的上下文信息,同时显著降低计算量;而残差Transformer结构则能在稀疏特征描述中高效提取并融合多尺度信息,从而在保持低计算成本的前提下实现高精度分割,在CPU平台上达到约2.24秒/图的实时处理速度。

链接: https://arxiv.org/abs/2511.08509
作者: Xueqi Guo,Halid Ziya Yerebakan,Yoshihisa Shinagawa,Kritika Iyer,Gerardo Hermosillo Valadez
机构: Siemens Medical Solutions USA Inc (西门子医疗解决方案美国公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EMBC 2025 oral

点击查看摘要

Abstract:Multi-organ segmentation of 3D medical images is fundamental with meaningful applications in various clinical automation pipelines. Although deep learning has achieved superior performance, the time and memory consumption of segmenting the entire 3D volume voxel by voxel using neural networks can be huge. Classifiers have been developed as an alternative in cases with certain points of interest, but the trade-off between speed and accuracy remains an issue. Thus, we propose a novel fast multi-organ segmentation framework with the usage of hierarchical sparse sampling and a Residual Transformer. Compared with whole-volume analysis, the hierarchical sparse sampling strategy could successfully reduce computation time while preserving a meaningful hierarchical context utilizing multiple resolution levels. The architecture of the Residual Transformer segmentation network could extract and combine information from different levels of information in the sparse descriptor while maintaining a low computational cost. In an internal data set containing 10,253 CT images and the public dataset TotalSegmentator, the proposed method successfully improved qualitative and quantitative segmentation performance compared to the current fast organ classifier, with fast speed at the level of ~2.24 seconds on CPU hardware. The potential of achieving real-time fine organ segmentation is suggested.
zh

[CV-10] Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在生成高质量、任务导向的嵌入表示时面临的效率与效果难以兼顾的问题。现有方法虽通过大规模对比学习实现嵌入优化,但往往需要大量预训练数据且难以同时提升语义完整性与判别性。其解决方案的关键在于提出一种压缩预训练阶段(Compressed Pre-training Phase, CoMa),作为对比学习的前置暖身阶段,仅需少量预训练数据即可显著提升VLM的嵌入性能。CoMa通过解耦语义保真与判别特征优化两个目标,在保证输入内容全面理解的基础上,为后续对比学习提供更有效的初始化,从而在MMEB基准上实现了与同类规模VLM相比的新SOTA结果,兼顾了效率与有效性。

链接: https://arxiv.org/abs/2511.08480
作者: Da Li,Yuxiao Luo,Keping Bi,Jiafeng Guo,Wei Yuan,Biao Yang,Yan Wang,Fan Yang,Tingting Gao,Guorui Zhou
机构: University of Chinese Academy of Sciences (中国科学院大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: Multimodal Embedding

点击查看摘要

Abstract:Vision-language models advance multimodal representation learning by acquiring transferable semantic embeddings, thereby substantially enhancing performance across a range of vision-language tasks, including cross-modal retrieval, clustering, and classification. An effective embedding is expected to comprehensively preserve the semantic content of the input while simultaneously emphasizing features that are discriminative for downstream tasks. Recent approaches demonstrate that VLMs can be adapted into competitive embedding models via large-scale contrastive learning, enabling the simultaneous optimization of two complementary objectives. We argue that the two aforementioned objectives can be decoupled: a comprehensive understanding of the input facilitates the embedding model in achieving superior performance in downstream tasks via contrastive learning. In this paper, we propose CoMa, a compressed pre-training phase, which serves as a warm-up stage for contrastive learning. Experiments demonstrate that with only a small amount of pre-training data, we can transform a VLM into a competitive embedding model. CoMa achieves new state-of-the-art results among VLMs of comparable size on the MMEB, realizing optimization in both efficiency and effectiveness.
zh

[CV-11] Generalizable Blood Cell Detection via Unified Dataset and Faster R-CNN

【速读】:该论文旨在解决外周血细胞(Peripheral Blood Cells, PBCs)显微图像中自动化分类与目标检测面临的两大核心挑战:数据稀缺性与数据异质性。为应对这些问题,研究构建了一个稳健的数据处理流程,将四个公开数据集(PBC、BCCD、Chula 和 Sickle Cell)标准化并整合为统一资源;在此基础上,采用基于 ResNet-50-FPN 主干网络的 Faster R-CNN 框架进行对象检测,并通过对比训练策略验证了迁移学习(Transfer Learning)相较于随机初始化模型的有效性。关键解决方案在于利用在大规模通用数据集(Microsoft COCO)上预训练的权重进行初始化,显著提升了模型收敛速度与稳定性,最终验证损失降至 0.08666,证明该方法可作为高精度、可部署的自动化血液诊断系统的基础。

链接: https://arxiv.org/abs/2511.08465
作者: Siddharth Sahay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 7 tables, 3 figures, 2 algorithms, Submitted for review at Next-Gen Quantum and Advanced Computing: Algorithms, Security, and Beyond (NQComp-2026)

点击查看摘要

Abstract:This paper presents a comprehensive methodology and comparative performance analysis for the automated classification and object detection of peripheral blood cells (PBCs) in microscopic images. Addressing the critical challenge of data scarcity and heterogeneity, robust data pipeline was first developed to standardize and merge four public datasets (PBC, BCCD, Chula, Sickle Cell) into a unified resource. Then employed a state-of-the-art Faster R-CNN object detection framework, leveraging a ResNet-50-FPN backbone. Comparative training rigorously evaluated a randomly initialized baseline model (Regimen 1) against a Transfer Learning Regimen (Regimen 2), initialized with weights pre-trained on the Microsoft COCO dataset. The results demonstrate that the Transfer Learning approach achieved significantly faster convergence and superior stability, culminating in a final validation loss of 0.08666, a substantial improvement over the baseline. This validated methodology establishes a robust foundation for building high-accuracy, deployable systems for automated hematological diagnosis.
zh

[CV-12] Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification

【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分析中模型可解释性不足的问题,尤其是在计算病理学场景下,现有基于梯度的归因方法(如Integrated Gradients, IG)虽能捕捉模型决策模式,但难以有效识别对区分肿瘤亚型具有判别力的关键区域。解决方案的核心在于提出一种新的归因方法——对比集成梯度(Contrastive Integrated Gradients, CIG),其关键创新在于在logit空间中计算对比梯度,通过将特征重要性相对于参考类别进行比较,从而增强对肿瘤与非肿瘤区域的区分能力;同时,CIG满足集成归因的公理体系,保证理论一致性,并引入两种质量评估指标(MIL-AIC和MIL-SIC)以量化预测信息和模型置信度随显著区域暴露的变化情况,在多个癌症数据集上验证了其在定量和定性层面均能提供更具信息量且与真实肿瘤区域高度一致的可解释性结果。

链接: https://arxiv.org/abs/2511.08464
作者: Anh Mai Vu,Tuan L. Vo,Ngoc Lam Quang Bui,Nam Nguyen Le Binh,Akash Awasthi,Huy Quoc Vo,Thanh-Huy Nguyen,Zhu Han,Chandra Mohan,Hien Van Nguyen
机构: University of Houston (休斯顿大学); HCMC University of Technology and Education (胡志明市科技大学); The University of Da Nang (岘港大学); Vietnam National University (越南国家大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Interpretability is essential in Whole Slide Image (WSI) analysis for computational pathology, where understanding model predictions helps build trust in AI-assisted diagnostics. While Integrated Gradients (IG) and related attribution methods have shown promise, applying them directly to WSIs introduces challenges due to their high-resolution nature. These methods capture model decision patterns but may overlook class-discriminative signals that are crucial for distinguishing between tumor subtypes. In this work, we introduce Contrastive Integrated Gradients (CIG), a novel attribution method that enhances interpretability by computing contrastive gradients in logit space. First, CIG highlights class-discriminative regions by comparing feature importance relative to a reference class, offering sharper differentiation between tumor and non-tumor areas. Second, CIG satisfies the axioms of integrated attribution, ensuring consistency and theoretical soundness. Third, we propose two attribution quality metrics, MIL-AIC and MIL-SIC, which measure how predictive information and model confidence evolve with access to salient regions, particularly under weak supervision. We validate CIG across three datasets spanning distinct cancer types: CAMELYON16 (breast cancer metastasis in lymph nodes), TCGA-RCC (renal cell carcinoma), and TCGA-Lung (lung cancer). Experimental results demonstrate that CIG yields more informative attributions both quantitatively, using MIL-AIC and MIL-SIC, and qualitatively, through visualizations that align closely with ground truth tumor regions, underscoring its potential for interpretable and trustworthy WSI-based diagnostics
zh

[CV-13] Cross-pyramid consistency regularization for semi-supervised medical image segmentation

【速读】:该论文旨在解决半监督学习(Semi-supervised Learning, SSL)在医学图像分割任务中如何有效利用大量未标注数据以提升模型性能的问题。其解决方案的关键在于提出一种混合一致性学习方法,通过Cross-Pyramid Consistency Regularization (CPCR) 在双分支金字塔网络(Dual Branch Pyramid Network, DBPNet)的两个解码器之间建立跨尺度的预测一致性约束。具体而言,DBPNet设计为共享编码器并带有两个略有差异的解码器,分别生成多尺度的扰动辅助预测;CPCR则结合现有的一致性学习与不确定性最小化策略,并引入一个新的正则化项,将软标签设定扩展至跨解码器的金字塔预测,从而在深层分层特征中实现知识蒸馏,显著提升了模型对未标注数据的利用效率和分割精度。

链接: https://arxiv.org/abs/2511.08435
作者: Matus Bojko,Maros Kollar,Marek Jakab,Wanda Benesova
机构: Czech Technical University in Prague (捷克技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised learning (SSL) enables training of powerful models with the assumption of limited, carefully labelled data and a large amount of unlabeled data to support the learning. In this paper, we propose a hybrid consistency learning approach to effectively exploit unlabeled data for semi-supervised medical image segmentation by leveraging Cross-Pyramid Consistency Regularization (CPCR) between two decoders. First, we design a hybrid Dual Branch Pyramid Network (DBPNet), consisting of an encoder and two decoders that differ slightly, each producing a pyramid of perturbed auxiliary predictions across multiple resolution scales. Second, we present a learning strategy for this network named CPCR that combines existing consistency learning and uncertainty minimization approaches on the main output predictions of decoders with our novel regularization term. More specifically, in this term, we extend the soft-labeling setting to pyramid predictions across decoders to support knowledge distillation in deep hierarchical features. Experimental results show that DBPNet with CPCR outperforms five state-of-the-art self-supervised learning methods and has comparable performance with recent ones on a public benchmark dataset.
zh

[CV-14] OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

【速读】:该论文旨在解决当前生成式 AI 图像(AIGI)检测器在跨不同生成模型和多样化语义内容时泛化能力不足的问题。现有方法通常学习单一、纠缠的伪造表征,将内容相关的缺陷与内容无关的通用伪影混杂在一起,且受限于过时的基准测试数据集。其解决方案的关键在于提出 OmniAID 框架,采用解耦的专家混合(Mixture-of-Experts, MoE)架构:通过一组可路由的专用语义专家(每个针对特定内容域,如人物或动物)和一个固定的通用伪影专家,显式分离“生成什么”(内容特异性缺陷)与“如何生成”(通用伪影),从而实现更鲁棒的泛化性能。此外,作者还构建了 Mirage 数据集以替代旧有基准,提升对真实场景下现代 AIGI 威胁的验证能力。

链接: https://arxiv.org/abs/2511.08423
作者: Yuncheng Guo,Junyan Ye,Chenjue Zhang,Hengrui Kang,Haohuan Fu,Conghui He,Weijia Li
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Sun Yat-Sen University (中山大学); Tsinghua University (清华大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 7 figures, 5 tables

点击查看摘要

Abstract:A truly universal AI-Generated Image (AIGI) detector must simultaneously generalize across diverse generative models and varied semantic content. Current state-of-the-art methods learn a single, entangled forgery representation–conflating content-dependent flaws with content-agnostic artifacts–and are further constrained by outdated benchmarks. To overcome these limitations, we propose OmniAID, a novel framework centered on a decoupled Mixture-of-Experts (MoE) architecture. The core of our method is a hybrid expert system engineered to decouple: (1) semantic flaws across distinct content domains, and (2) these content-dependent flaws from content-agnostic universal artifacts. This system employs a set of Routable Specialized Semantic Experts, each for a distinct domain (e.g., human, animal), complemented by a Fixed Universal Artifact Expert. This architecture is trained using a bespoke two-stage strategy: we first train the experts independently with domain-specific hard-sampling to ensure specialization, and subsequently train a lightweight gating network for effective input routing. By explicitly decoupling “what is generated” (content-specific flaws) from “how it is generated” (universal artifacts), OmniAID achieves robust generalization. To address outdated benchmarks and validate real-world applicability, we introduce Mirage, a new large-scale, contemporary dataset. Extensive experiments, using both traditional benchmarks and our Mirage dataset, demonstrate our model surpasses existing monolithic detectors, establishing a new, robust standard for AIGI authentication against modern, in-the-wild threats.
zh

[CV-15] NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

【速读】:该论文旨在解决对比损失(contrastive loss)中归一化项(即分区函数,partition function)估计不准确的问题,这是训练对比语言-图像预训练(CLIP)模型的核心挑战之一。传统方法依赖大批次进行近似,计算资源消耗高;而现有基于每样本归一化器估计的方法存在与数据集规模和批大小比值相关的优化误差,限制了其在大规模数据或小批量场景下的有效性。解决方案的关键在于两个核心思想:(i) 通过凸分析(convex analysis)将每个样本的对比损失重新表述为一个带辅助变量(表示其对数归一化项)的最小化问题;(ii) 利用变分分析(variational analysis)将关于n个辅助变量(n为数据集大小)的最小化问题转化为关于一个紧凑神经网络的最小化问题,该网络可预测所有样本的对数归一化项。作者设计了一种交替优化算法,联合训练CLIP模型与辅助网络,并通过定制架构和加速技术提升归一化项估计精度,从而显著优于此前方法。

链接: https://arxiv.org/abs/2511.08417
作者: Xiyuan Wei,Chih-Jen Lin,Tianbao Yang
机构: Texas A&M University (德州农工大学); National Taiwan University (台湾国立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 4 figures

点击查看摘要

Abstract:Accurately estimating the normalization term (also known as the partition function) in the contrastive loss is a central challenge for training Contrastive Language-Image Pre-training (CLIP) models. Conventional methods rely on large batches for approximation, demanding substantial computational resources. To mitigate this issue, prior works introduced per-sample normalizer estimators, which are updated at each epoch in a blockwise coordinate manner to keep track of updated encoders. However, this scheme incurs optimization error that scales with the ratio of dataset size to batch size, limiting effectiveness for large datasets or small batches. To overcome this limitation, we propose NeuCLIP, a novel and elegant optimization framework based on two key ideas: (i) \textbfreformulating the contrastive loss for each sample \textbfvia convex analysis into a minimization problem with an auxiliary variable representing its log-normalizer; and (ii) \textbftransforming the resulting minimization over n auxiliary variables (where n is the dataset size) via \textbfvariational analysis into the minimization over a compact neural network that predicts the log-normalizers. We design an alternating optimization algorithm that jointly trains the CLIP model and the auxiliary network. By employing a tailored architecture and acceleration techniques for the auxiliary network, NeuCLIP achieves more accurate normalizer estimation, leading to improved performance compared with previous methods. Extensive experiments on large-scale CLIP training, spanning datasets from millions to billions of samples, demonstrate that NeuCLIP outperforms previous methods.
zh

[CV-16] Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation WACV

【速读】:该论文旨在解决医学影像诊断中因图像异质性导致的疾病解读准确性不足问题,尤其是现有主流视觉-语言模型(VLMs)将图像视为整体实体、忽略对细粒度图像特征的利用这一局限。其解决方案的关键在于提出Anatomy-VLM,一种基于多尺度信息融合的细粒度视觉-语言模型:首先通过设计模型编码器定位全图中的关键解剖结构区域(Region of Interests, ROIs),随后结合结构化知识增强这些区域的上下文感知能力,最终实现多尺度医学信息对齐,生成具有临床可解释性的疾病预测结果。该方法显著提升了在分布内与分布外数据集上的性能,并支持零样本层面的解剖学级解读,体现出专家级的临床诊断能力。

链接: https://arxiv.org/abs/2511.08402
作者: Difei Gu,Yunhe Gao,Mu Zhou,Dimitris Metaxas
机构: Rutgers University (罗格斯大学); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to Winter Conference on Applications of Computer Vision (WACV) 2026

点击查看摘要

Abstract:Accurate disease interpretation from radiology remains challenging due to imaging heterogeneity. Achieving expert-level diagnostic decisions requires integration of subtle image features with clinical knowledge. Yet major vision-language models (VLMs) treat images as holistic entities and overlook fine-grained image details that are vital for disease diagnosis. Clinicians analyze images by utilizing their prior medical knowledge and identify anatomical structures as important region of interests (ROIs). Inspired from this human-centric workflow, we introduce Anatomy-VLM, a fine-grained, vision-language model that incorporates multi-scale information. First, we design a model encoder to localize key anatomical features from entire medical images. Second, these regions are enriched with structured knowledge for contextually-aware interpretation. Finally, the model encoder aligns multi-scale medical information to generate clinically-interpretable disease prediction. Anatomy-VLM achieves outstanding performance on both in- and out-of-distribution datasets. We also validate the performance of Anatomy-VLM on downstream image segmentation tasks, suggesting that its fine-grained alignment captures anatomical and pathology-related knowledge. Furthermore, the Anatomy-VLM’s encoder facilitates zero-shot anatomy-wise interpretation, providing its strong expert-level clinical interpretation capabilities.
zh

[CV-17] Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment NEURIPS2025

【速读】:该论文旨在解决多模态模型在对比学习中对负样本处理过于均质化的问题,尤其是忽略了那些与正样本仅存在细微差异的模糊负样本(ambiguous negatives),这类样本往往蕴含重要判别信息但常被忽视。解决方案的关键在于提出一种轻量级模块化框架 BACL(Boundary-Aware Curriculum with Local Attention),其核心由两个可微分模块构成:边界感知负采样器(Boundary-aware Negative Sampler)通过渐进式提升负样本难度构建课程学习信号,而对比局部注意力损失(Contrastive Local Attention loss)则聚焦于识别和强化图像与文本间不匹配的具体区域,从而增强模型对细粒度差异的敏感性。该方法无需额外标注即可显著提升性能,在四个大规模基准上达到新SOTA,且兼容任意现成双编码器架构。

链接: https://arxiv.org/abs/2511.08399
作者: Hua Ye(1 and 2),Hang Ding(3),Siyuan Chen(4),Yiyang Jiang(5),Changyuan Zhang(6),Xuan Zhang(2 and 7) ((1) Nanjing University, (2) Airon Technology CO. LTD, (3) University of Bristol, (4) The Hong Kong Polytechnic University, (5) Shanghai Jiao Tong University, (6) The University of Hong Kong, (7) Carnegie Mellon University)
机构: Nanjing University (南京大学); Airon Technology CO., LTD (艾伦科技有限公司); Shanghai Jiao Tong University (上海交通大学); University of Bristol (布里斯托大学); The Hong Kong Polytechnic University (香港理工大学); The University of Hong Kong (香港大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 6 figures, 5 tables. Submitted to NeurIPS 2025

点击查看摘要

Abstract:Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.
zh

[CV-18] RAPTR: Radar-based 3D Pose Estimation using Transformer NEURIPS2025

【速读】:该论文旨在解决雷达(Radar)场景下室内3D人体姿态估计中对精细3D关键点标签依赖过高、标注成本昂贵的问题,尤其是在存在杂乱环境、遮挡或多人群体交互等复杂场景中。解决方案的关键在于提出一种弱监督框架RAPTR(RAdar Pose esTimation using tRansformer),其核心创新为两阶段姿态解码器架构:第一阶段通过3D边界框(3D BBox)标签和设计的3D模板损失函数估计初始3D姿态,以缓解深度模糊性;第二阶段利用2D关键点标签和3D重力损失对初始姿态进行精细化调整,同时引入伪3D可变形注意力机制融合多视角雷达特征,从而显著提升姿态估计精度。

链接: https://arxiv.org/abs/2511.08387
作者: Sorachi Kato,Ryoma Yataka,Pu Perry Wang,Pedro Miraldo,Takuya Fujihashi,Petros Boufounos
机构: Mitsubishi Electric Research Laboratories (Mitsubishi Electric Research Laboratories); The University of Osaka (The University of Osaka); Information Technology R&D Center (Information Technology R&D Center); Mitsubishi Electric Corporation (Mitsubishi Electric Corporation)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 26 pages, Accepted to NeurIPS 2025

点击查看摘要

Abstract:Radar-based indoor 3D human pose estimation typically relied on fine-grained 3D keypoint labels, which are costly to obtain especially in complex indoor settings involving clutter, occlusions, or multiple people. In this paper, we propose \textbfRAPTR (RAdar Pose esTimation using tRansformer) under weak supervision, using only 3D BBox and 2D keypoint labels which are considerably easier and more scalable to collect. Our RAPTR is characterized by a two-stage pose decoder architecture with a pseudo-3D deformable attention to enhance (pose/joint) queries with multi-view radar features: a pose decoder estimates initial 3D poses with a 3D template loss designed to utilize the 3D BBox labels and mitigate depth ambiguities; and a joint decoder refines the initial poses with 2D keypoint labels and a 3D gravity loss. Evaluated on two indoor radar datasets, RAPTR outperforms existing methods, reducing joint position error by 34.3% on HIBER and 76.9% on MMVR. Our implementation is available at this https URL.
zh

[CV-19] xt-based Aerial-Ground Person Retrieval

【速读】:该论文致力于解决跨视角的文本驱动行人重识别问题(Text-based Aerial-Ground Person Retrieval, TAG-PR),即如何在存在显著视角差异的航拍与地面视角图像中,依据文本描述准确检索目标行人。传统文本驱动行人重识别(Text-based Person Retrieval, T-PR)仅关注地面视角图像,而TAG-PR引入了更贴近实际应用的多视角场景,其核心挑战在于视点差异导致的特征不一致性。解决方案的关键在于提出两个创新:一是构建了TAG-PEDES数据集,通过多样化文本生成策略增强文本描述对视角变化的鲁棒性;二是设计了TAG-CLIP框架,采用分层路由专家模块(hierarchically-routed mixture of experts)学习视点特定与视点无关特征,并结合视点解耦策略分离视点相关特征以提升跨模态对齐效果。

链接: https://arxiv.org/abs/2511.08369
作者: Xinyu Zhou,Yu Wu,Jiayao Ma,Wenhao Wang,Min Cao,Mang Ye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work introduces Text-based Aerial-Ground Person Retrieval (TAG-PR), which aims to retrieve person images from heterogeneous aerial and ground views with textual descriptions. Unlike traditional Text-based Person Retrieval (T-PR), which focuses solely on ground-view images, TAG-PR introduces greater practical significance and presents unique challenges due to the large viewpoint discrepancy across images. To support this task, we contribute: (1) TAG-PEDES dataset, constructed from public benchmarks with automatically generated textual descriptions, enhanced by a diversified text generation paradigm to ensure robustness under view heterogeneity; and (2) TAG-CLIP, a novel retrieval framework that addresses view heterogeneity through a hierarchically-routed mixture of experts module to learn view-specific and view-agnostic features and a viewpoint decoupling strategy to decouple view-specific features for better cross-modal alignment. We evaluate the effectiveness of TAG-CLIP on both the proposed TAG-PEDES dataset and existing T-PR benchmarks. The dataset and code are available at this https URL.
zh

[CV-20] A Circular Argument : Does RoPE need to be Equivariant for Vision?

【速读】:该论文旨在解决如何将旋转位置编码(Rotary Positional Encodings, RoPE)从一维序列有效推广到高维数据(如图像和视频)的问题,同时重新审视RoPE性能优异的内在机制。其关键解决方案在于:首先,通过数学证明表明RoPE是一维数据中等变位置嵌入(equivariant positional embedding)最通用的解;其次,提出混合RoPE(Mixed RoPE)作为M维数据在要求生成元可交换(commutative generators)条件下的等变推广;进一步地,作者质疑严格等变性是否对RoPE的实际性能至关重要,并提出球面RoPE(Spherical RoPE),该方法基于非交换生成元构建,实验证明其学习行为与等变版本相当甚至更优,从而揭示相对位置嵌入并非如普遍认知般关键,为视觉任务中的位置编码设计提供了新思路——即无需强加等变约束即可实现高效且泛化能力强的模型。

链接: https://arxiv.org/abs/2511.08368
作者: Chase van de Geijn,Timo Lüddecke,Polina Turishcheva,Alexander S. Ecker
机构: University of Göttingen (哥廷根大学); Max Planck Institute for Dynamics and Self-Organization (马克斯·普朗克动力学与自组织研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Rotary Positional Encodings (RoPE) have emerged as a highly effective technique for one-dimensional sequences in Natural Language Processing spurring recent progress towards generalizing RoPE to higher-dimensional data such as images and videos. The success of RoPE has been thought to be due to its positional equivariance, i.e. its status as a relative positional encoding. In this paper, we mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data. Moreover, we show Mixed RoPE to be the analogously general solution for M-dimensional data, if we require commutative generators – a property necessary for RoPE’s equivariance. However, we question whether strict equivariance plays a large role in RoPE’s performance. We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators. Empirically, we find Spherical RoPE to have the equivalent or better learning behavior compared to its equivariant analogues. This suggests that relative positional embeddings are not as important as is commonly believed, at least within computer vision. We expect this discovery to facilitate future work in positional encodings for vision that can be faster and generalize better by removing the preconception that they must be relative.
zh

[CV-21] Retrospective motion correction in MRI using disentangled embeddings

【速读】:该论文旨在解决磁共振成像(MRI)中生理运动导致的图像伪影问题,尤其针对现有基于机器学习(ML)的运动校正方法在不同运动类型和身体部位间泛化能力不足的局限性。其解决方案的关键在于提出一种分层向量量化(VQ)变分自编码器,通过学习运动到无伪影图像特征的解耦嵌入表示,利用码本(codebook)在多尺度上捕捉有限的运动模式集合,实现从粗到细的校正策略;同时训练一个自回归模型以学习无运动图像的先验分布,并在推理阶段引导校正过程,从而无需针对特定伪影进行训练即可有效应对未见过的运动模式,显著提升模型的通用性和鲁棒性。

链接: https://arxiv.org/abs/2511.08365
作者: Qi Wang,Veronika Ecker,Marcel Früh,Sergios Gatidis,Thomas Küstner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Physiological motion can affect the diagnostic quality of magnetic resonance imaging (MRI). While various retrospective motion correction methods exist, many struggle to generalize across different motion types and body regions. In particular, machine learning (ML)-based corrections are often tailored to specific applications and datasets. We hypothesize that motion artifacts, though diverse, share underlying patterns that can be disentangled and exploited. To address this, we propose a hierarchical vector-quantized (VQ) variational auto-encoder that learns a disentangled embedding of motion-to-clean image features. A codebook is deployed to capture finite collection of motion patterns at multiple resolutions, enabling coarse-to-fine correction. An auto-regressive model is trained to learn the prior distribution of motion-free images and is used at inference to guide the correction process. Unlike conventional approaches, our method does not require artifact-specific training and can generalize to unseen motion patterns. We demonstrate the approach on simulated whole-body motion artifacts and observe robust correction across varying motion severity. Our results suggest that the model effectively disentangled physical motion of the simulated motion-effective scans, therefore, improving the generalizability of the ML-based MRI motion correction. Our work of disentangling the motion features shed a light on its potential application across anatomical regions and motion types.
zh

[CV-22] Extreme Model Compression with Structured Sparsity at Low Precision

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在资源受限设备上部署时面临的模型规模大和计算成本高的问题。现有方法如权重量化(weight quantization)和结构化稀疏(structured sparsity)虽各自有效,但直接组合使用会因双重负面影响导致模型精度显著下降。解决方案的关键在于提出一种统一框架 SLOPE(Structured Sparsity at Low Precision),通过训练阶段的正则化策略,最小化全精度权重与其稀疏量化版本之间的差异——具体而言,不是强制数值匹配,而是促进两者方向上的对齐(angular alignment),从而在实现约20倍模型压缩的同时保持接近原始精度(~99%)的表现,并在多种任务和模型架构中优于当前最优的量化与稀疏化方法。

链接: https://arxiv.org/abs/2511.08360
作者: Dan Liu,Nikita Dvornik,Xue Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 36th British Machine Vision Conference 2025

点击查看摘要

Abstract:Deep neural networks (DNNs) are used in many applications, but their large size and high computational cost make them hard to run on devices with limited resources. Two widely used techniques to address this challenge are weight quantization, which lowers the precision of all weights, and structured sparsity, which removes unimportant weights while retaining the important ones at full precision. Although both are effective individually, they are typically studied in isolation due to their compounded negative impact on model accuracy when combined. In this work, we introduce SLOPE Structured Sparsity at Low Precision), a unified framework, to effectively combine structured sparsity and low-bit quantization in a principled way. We show that naively combining sparsity and quantization severely harms performance due to the compounded impact of both techniques. To address this, we propose a training-time regularization strategy that minimizes the discrepancy between full-precision weights and their sparse, quantized counterparts by promoting angular alignment rather than direct matching. On ResNet-18, SLOPE achieves \sim20\times model size reduction while retaining \sim 99% of the original accuracy. It consistently outperforms state-of-the-art quantization and structured sparsity methods across classification, detection, and segmentation tasks on models such as ResNet-18, ViT-Small, and Mask R-CNN.
zh

[CV-23] VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation

【速读】:该论文旨在解决视频问答(Video Question Generation, VideoQG)领域中缺乏多跳推理(multi-hop reasoning)能力的问题,即现有方法仅能生成基于单一时段的零跳问题,无法捕捉跨时间片段的复杂逻辑关系。解决方案的关键在于提出VideoChain框架,其核心创新是基于改进的BART模型架构并融合视频嵌入(video embeddings),构建模块化结构以建模文本与视觉依赖关系,并通过TVQA+数据集自动构建大规模多跳视频问答数据集MVQ-60,从而实现跨多个时序分离视频片段的推理型问题生成。

链接: https://arxiv.org/abs/2511.08348
作者: Arpan Phukan,Anupam Pandey,Deepjyoti Bodo,Asif Ekbal
机构: IIT Patna (印度理工学院巴特那分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain’s strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model’s ability to generate coherent, contextually grounded, and reasoning-intensive questions.
zh

[CV-24] owards Open-Set Myoelectric Gesture Recognition via Dual-Perspective Inconsistency Learning

【速读】:该论文旨在解决表面肌电信号(surface electromyography, sEMG)基手势识别系统在深度学习模型中因训练数据稀缺而导致的过拟合与泛化能力差的问题。现有数据增强方法往往难以同时保证生成样本的忠实性(faithfulness)与多样性(diversity),且无目标的多样性可能引入冗余样本,降低实用性。解决方案的关键在于提出一种基于扩散机制的数据增强方法——稀疏感知语义引导扩散增强(Sparse-Aware Semantic-Guided Diffusion Augmentation, SASG-DA),其核心创新包括:1)引入语义表示引导机制(Semantic Representation Guidance, SRG),利用细粒度、任务感知的语义表示作为生成条件以提升样本忠实性;2)设计高斯建模语义建模策略(Gaussian Modeling Semantic Modeling Strategy, GMSS),对语义分布进行建模并支持随机采样,实现忠实且多样化的样本生成;3)提出稀疏感知语义采样策略(Sparse-Aware Semantic Sampling),主动探索低频区域以增强目标多样性,提升分布覆盖和样本效用。实验表明,SASG-DA在Ninapro多个基准sEMG数据集上显著优于现有方法,有效缓解过拟合并提升识别性能与泛化能力。

链接: https://arxiv.org/abs/2511.08344
作者: Chen Liu,Can Han,Weishi Xu,Yaqi Wang,Dahong Qian
机构: Shanghai Jiao Tong University (上海交通大学); Communication University of Zhejiang (浙江传媒学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under review

点击查看摘要

Abstract:Surface electromyography (sEMG)-based gesture recognition plays a critical role in human-machine interaction (HMI), particularly for rehabilitation and prosthetic control. However, sEMG-based systems often suffer from the scarcity of informative training data, leading to overfitting and poor generalization in deep learning models. Data augmentation offers a promising approach to increasing the size and diversity of training data, where faithfulness and diversity are two critical factors to effectiveness. However, promoting untargeted diversity can result in redundant samples with limited utility. To address these challenges, we propose a novel diffusion-based data augmentation approach, Sparse-Aware Semantic-Guided Diffusion Augmentation (SASG-DA). To enhance generation faithfulness, we introduce the Semantic Representation Guidance (SRG) mechanism by leveraging fine-grained, task-aware semantic representations as generation conditions. To enable flexible and diverse sample generation, we propose a Gaussian Modeling Semantic Modeling (GMSS) strategy, which models the semantic representation distribution and allows stochastic sampling to produce both faithful and diverse samples. To enhance targeted diversity, we further introduce a Sparse-Aware Semantic Sampling strategy to explicitly explore underrepresented regions, improving distribution coverage and sample utility. Extensive experiments on benchmark sEMG datasets, Ninapro DB2, DB4, and DB7, demonstrate that SASG-DA significantly outperforms existing augmentation methods. Overall, our proposed data augmentation approach effectively mitigates overfitting and improves recognition performance and generalization by offering both faithful and diverse samples.
zh

[CV-25] Empowering DINO Representations for Underwater Instance Segmentation via Aligner and Prompter AAAI2026

【速读】:该论文旨在解决水下实例分割(Underwater Instance Segmentation, UIS)任务中因水下图像颜色偏移和复杂背景导致的分割精度不足问题。解决方案的关键在于提出了一种名为DiveSeg的新框架,其核心创新包括:(1) AquaStyle Aligner,通过在DINO预训练模型微调过程中嵌入水下色彩风格特征,增强模型对水下域的适应性;(2) ObjectPrior Prompter,利用二值分割生成的对象级先验提示,为需要同时进行对象级与实例级推理的任务提供关键引导。实验表明,该方法在UIIS和USIS10K数据集上达到了当前最优性能。

链接: https://arxiv.org/abs/2511.08334
作者: Zhiyang Chen,Chen Zhang,Hao Fang,Runmin Cong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:Underwater instance segmentation (UIS), integrating pixel-level understanding and instance-level discrimination, is a pivotal technology in marine resource exploration and ecological protection. In recent years, large-scale pretrained visual foundation models, exemplified by DINO, have advanced rapidly and demonstrated remarkable performance on complex downstream tasks. In this paper, we demonstrate that DINO can serve as an effective feature learner for UIS, and we introduce DiveSeg, a novel framework built upon two insightful components: (1) The AquaStyle Aligner, designed to embed underwater color style features into the DINO fine-tuning process, facilitating better adaptation to the underwater domain. (2) The ObjectPrior Prompter, which incorporates binary segmentation-based prompts to deliver object-level priors, provides essential guidance for instance segmentation task that requires both object- and instance-level reasoning. We conduct thorough experiments on the popular UIIS and USIS10K datasets, and the results show that DiveSeg achieves the state-of-the-art performance. Code: this https URL.
zh

[CV-26] he Impact of Longitudinal Mammogram Alignment on Breast Cancer Risk Assessment

【速读】:该论文旨在解决纵向乳腺X线摄影(mammography)影像中空间对齐不准确的问题,这会干扰对乳腺组织变化的识别,从而降低基于深度学习的风险建模性能。其核心解决方案是采用图像级配准(image-based registration)方法进行时空对齐,相较于特征空间对齐(feature-level alignment)和隐式对齐(implicit alignment)策略,图像级配准在预测准确性、召回率、精确度以及形变场质量等指标上均表现最优,且生成的形变场具有解剖学合理性。研究进一步表明,在特征空间中应用图像配准得到的形变场可实现最佳风险预测性能,强调了图像级形变场在纵向风险建模中的关键作用。

链接: https://arxiv.org/abs/2511.08328
作者: Solveig Thrun,Stine Hansen,Zijun Sun,Nele Blum,Suaiba A. Salahuddin,Xin Wang,Kristoffer Wickstrøm,Elisabeth Wetzer,Robert Jenssen,Maik Stille,Michael Kampffmeyer
机构: University of Tromsø (特罗姆瑟大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Regular mammography screening is crucial for early breast cancer detection. By leveraging deep learning-based risk models, screening intervals can be personalized, especially for high-risk individuals. While recent methods increasingly incorporate longitudinal information from prior mammograms, accurate spatial alignment across time points remains a key challenge. Misalignment can obscure meaningful tissue changes and degrade model performance. In this study, we provide insights into various alignment strategies, image-based registration, feature-level (representation space) alignment with and without regularization, and implicit alignment methods, for their effectiveness in longitudinal deep learning-based risk modeling. Using two large-scale mammography datasets, we assess each method across key metrics, including predictive accuracy, precision, recall, and deformation field quality. Our results show that image-based registration consistently outperforms the more recently favored feature-based and implicit approaches across all metrics, enabling more accurate, temporally consistent predictions and generating smooth, anatomically plausible deformation fields. Although regularizing the deformation field improves deformation quality, it reduces the risk prediction performance of feature-level alignment. Applying image-based deformation fields within the feature space yields the best risk prediction performance. These findings underscore the importance of image-based deformation fields for spatial alignment in longitudinal risk modeling, offering improved prediction accuracy and robustness. This approach has strong potential to enhance personalized screening and enable earlier interventions for high-risk individuals. The code is available at this https URL, allowing full reproducibility of the results. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2511.08328 [cs.CV] (or arXiv:2511.08328v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2511.08328 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-27] Mitigating Negative Flips via Margin Preserving Training AAAI2026

【速读】:该论文旨在解决AI系统在版本迭代过程中因类别增量引入导致的分类不一致性问题,具体表现为“负向翻转”(negative flips)——即更新后的模型将原本正确分类的样本错误分类。随着训练类别数量增加,原有类别的决策边界被压缩,易引发冲突模式,进而降低原始子集上的性能。解决方案的关键在于提出一种结合显式logits边际校准项与双源焦点蒸馏损失(double-source focal distillation loss)的新方法:一方面通过边际校准保持原模型类别的相对间隔,避免负向翻转;另一方面利用来自原模型和独立新训练模型的双重监督信号,动态学习合适的决策边界,在保障原类稳定性的前提下提升新增类别的准确率,从而实现高整体精度下的最小负向翻转率。

链接: https://arxiv.org/abs/2511.08322
作者: Simone Ricci,Niccolò Biondi,Federico Pernici,Alberto Del Bimbo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at AAAI2026

点击查看摘要

Abstract:Minimizing inconsistencies across successive versions of an AI system is as crucial as reducing the overall error. In image classification, such inconsistencies manifest as negative flips, where an updated model misclassifies test samples that were previously classified correctly. This issue becomes increasingly pronounced as the number of training classes grows over time, since adding new categories reduces the margin of each class and may introduce conflicting patterns that undermine their learning process, thereby degrading performance on the original subset. To mitigate negative flips, we propose a novel approach that preserves the margins of the original model while learning an improved one. Our method encourages a larger relative margin between the previously learned and newly introduced classes by introducing an explicit margin-calibration term on the logits. However, overly constraining the logit margin for the new classes can significantly degrade their accuracy compared to a new independently trained model. To address this, we integrate a double-source focal distillation loss with the previous model and a new independently trained model, learning an appropriate decision margin from both old and new data, even under a logit margin calibration. Extensive experiments on image classification benchmarks demonstrate that our approach consistently reduces the negative flip rate with high overall accuracy.
zh

[CV-28] NeuSpring: Neural Spring Fields for Reconstruction and Simulation of Deformable Objects from Videos AAAI2026

【速读】:该论文旨在解决现有方法在可变形物体物理数字孪生(physical digital twins)建模中,当前状态建模的物理学习不足导致未来预测泛化能力差的问题。其关键解决方案在于提出NeuSpring——一种基于神经弹簧场(neural spring field)的可变形物体重建与仿真方法,核心创新包括:1)采用分段拓扑求解策略,通过零阶优化高效建模多区域弹簧连接拓扑,以考虑真实物体的材料异质性;2)设计基于规范坐标神经网络的神经弹簧场,跨帧表征弹簧物理属性,有效利用弹簧的空间关联性提升物理学习能力。实验表明,该方法在真实数据集上显著提升了当前状态重建和未来预测性能,Chamfer距离分别改善20%和25%。

链接: https://arxiv.org/abs/2511.08310
作者: Qingshan Xu,Jiao Liu,Shangshu Yu,Yuxuan Wang,Yuan Zhou,Junbao Zhou,Jiequan Cui,Yew-Soon Ong,Hanwang Zhang
机构: 1. Nanyang Technological University (南洋理工大学); 2. Tsinghua University (清华大学); 3. Zhejiang University (浙江大学); 4. Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:In this paper, we aim to create physical digital twins of deformable objects under interaction. Existing methods focus more on the physical learning of current state modeling, but generalize worse to future prediction. This is because existing methods ignore the intrinsic physical properties of deformable objects, resulting in the limited physical learning in the current state modeling. To address this, we present NeuSpring, a neural spring field for the reconstruction and simulation of deformable objects from videos. Built upon spring-mass models for realistic physical simulation, our method consists of two major innovations: 1) a piecewise topology solution that efficiently models multi-region spring connection topologies using zero-order optimization, which considers the material heterogeneity of real-world objects. 2) a neural spring field that represents spring physical properties across different frames using a canonical coordinate-based neural network, which effectively leverages the spatial associativity of springs for physical learning. Experiments on real-world datasets demonstrate that our NeuSping achieves superior reconstruction and simulation performance for current state modeling and future prediction, with Chamfer distance improved by 20% and 25%, respectively.
zh

[CV-29] SkelSplat: Robust Multi-view 3D Human Pose Estimation with Differentiable Gaussian Rendering WACV2026

【速读】:该论文旨在解决多视角三维人体姿态估计(multi-view 3D human pose estimation)在测试场景与训练数据分布不一致时泛化能力差的问题。现有基于学习的方法依赖大规模标注数据进行视图融合训练,导致在新场景下性能显著下降。其解决方案的关键在于提出SkelSplat框架,该框架将人体姿态建模为由每个关节对应的3D高斯分布组成的骨架结构(skeleton of 3D Gaussians),并通过可微分高斯渲染(differentiable Gaussian rendering)实现任意相机视角下的无缝融合,且无需3D真值监督即可优化。为此,作者设计了一种新颖的one-hot编码方案,使各关节能独立优化,从而克服了原始高斯点绘(Gaussian Splatting)用于密集场景重建时对稀疏人体结构的不适配问题。实验表明,该方法在Human3.6M和CMU数据集上优于无3D真值监督的方法,并在跨数据集任务中将误差降低最多达47.8%,同时在遮挡场景下表现出强鲁棒性。

链接: https://arxiv.org/abs/2511.08294
作者: Laura Bragagnolo,Leonardo Barcellona,Stefano Ghidoni
机构: University of Padova (帕多瓦大学); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026

点击查看摘要

Abstract:Accurate 3D human pose estimation is fundamental for applications such as augmented reality and human-robot interaction. State-of-the-art multi-view methods learn to fuse predictions across views by training on large annotated datasets, leading to poor generalization when the test scenario differs. To overcome these limitations, we propose SkelSplat, a novel framework for multi-view 3D human pose estimation based on differentiable Gaussian rendering. Human pose is modeled as a skeleton of 3D Gaussians, one per joint, optimized via differentiable rendering to enable seamless fusion of arbitrary camera views without 3D ground-truth supervision. Since Gaussian Splatting was originally designed for dense scene reconstruction, we propose a novel one-hot encoding scheme that enables independent optimization of human joints. SkelSplat outperforms approaches that do not rely on 3D ground truth in Human3.6M and CMU, while reducing the cross-dataset error up to 47.8% compared to learning-based methods. Experiments on Human3.6M-Occ and Occlusion-Person demonstrate robustness to occlusions, without scenario-specific fine-tuning. Our project page is available here: this https URL.
zh

[CV-30] SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer

【速读】:该论文旨在解决当前气象数据建模中普遍存在的局限性,即多数方法仅针对单一变量、单一区域的任务进行建模,导致跨变量互补性和跨区域统一合成难以实现,且常出现结果过度平滑的问题。其解决方案的关键在于提出SynWeather数据集和SynWeatherDiff模型:SynWeather是首个面向多区域、多变量天气观测数据统一合成的基准数据集,覆盖美国大陆、欧洲、东亚及热带气旋区,并包含复合雷达反射率、小时降水、可见光和微波亮温等关键气象变量;而SynWeatherDiff则基于扩散Transformer(Diffusion Transformer)框架构建了一个通用且概率化的天气合成模型,有效缓解了传统确定性模型导致的过平滑问题,提升了多变量、多区域气象数据的合成质量与多样性。

链接: https://arxiv.org/abs/2511.08291
作者: Kaiyi Xu,Junchao Gong,Zhiwang Zhou,Zhangrui Li,Yuandong Pu,Yihao Liu,Ben Fei,Fenghua Ling,Wenlong Zhang,Lei Bei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.
zh

[CV-31] MAUGIF: Mechanism-Aware Unsupervised General Image Fusion via Dual Cross-Image Autoencoders

【速读】:该论文旨在解决现有图像融合方法在面对不同任务时存在两大局限性的问题:一是方法高度任务特定,难以泛化;二是通用框架采用统一策略,忽略了各类融合任务内在机制的差异。其解决方案的关键在于提出一种机制感知的无监督通用图像融合(MAUGIF)方法,通过双交叉图像自编码器实现。该方法首先依据融合机制将任务分为加法型和乘法型两类,随后利用双编码器将源图像映射至共享潜在空间,同时保留模态特异性细节;在解码阶段,双解码器作为特征注入器,根据具体融合机制选择性地将模态特异性特征重新注入共享内容以完成重建,从而在保持性能的同时提升可解释性与适应性。

链接: https://arxiv.org/abs/2511.08272
作者: Kunjing Yang,Zhiwei Wang,Minru Bai
机构: Hunan University (湖南大学); Zhejiang University of Technology (浙江工业大学); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image fusion aims to integrate structural and complementary information from multi-source images. However, existing fusion methods are often either highly task-specific, or general frameworks that apply uniform strategies across diverse tasks, ignoring their distinct fusion mechanisms. To address this issue, we propose a mechanism-aware unsupervised general image fusion (MAUGIF) method based on dual cross-image autoencoders. Initially, we introduce a classification of additive and multiplicative fusion according to the inherent mechanisms of different fusion tasks. Then, dual encoders map source images into a shared latent space, capturing common content while isolating modality-specific details. During the decoding phase, dual decoders act as feature injectors, selectively reintegrating the unique characteristics of each modality into the shared content for reconstruction. The modality-specific features are injected into the source image in the fusion process, generating the fused image that integrates information from both modalities. The architecture of decoders varies according to their fusion mechanisms, enhancing both performance and interpretability. Extensive experiments are conducted on diverse fusion tasks to validate the effectiveness and generalization ability of our method. The code is available at this https URL.
zh

[CV-32] SWAN - Enabling Fast and Mobile Histopathology Image Annotation through Swipeable Interfaces

【速读】:该论文旨在解决大规模组织病理图像数据集标注效率低下的问题,这是制约开发用于临床相关任务(如有丝分裂核分类)的鲁棒深度学习模型的主要瓶颈。传统基于文件夹的标注流程通常速度慢、易疲劳且难以扩展。其解决方案的关键在于提出了一种名为SWipeable ANnotations (SWAN) 的开源、MIT许可的网页应用,该工具通过滑动手势实现直观的图像块分类,支持桌面与移动端操作,具备实时元数据捕获能力,并允许灵活地将滑动手势映射到类别标签。在一项包含四位病理学家标注600个有丝分裂图像块的试点研究中,SWAN在保持高一致性(组间百分比一致性86.52%–93.68%,Cohen’s Kappa = 0.61–0.80)的同时显著提升了标注效率,证明其是一种可扩展且用户友好的替代传统标注流程的新方法。

链接: https://arxiv.org/abs/2511.08271
作者: Sweta Banerjee,Timo Gosch,Sara Hester,Viktoria Weiss,Thomas Conrad,Taryn A. Donovan,Nils Porsche,Jonas Ammeling,Christoph Stroblberger,Robert Klopfleisch,Christopher Kaltenecker,Christof A. Bertram,Katharina Breininger,Marc Aubreville
机构: HafenCity University Hamburg (汉堡应用技术大学); Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学); Charité – Universitätsmedizin Berlin (柏林夏里特医学院); Medical University of Vienna (维也纳医科大学); AMC NY (纽约阿姆斯特丹医疗中心); Technische Hochschule Ingolstadt (英戈尔施塔特应用技术大学); University of Würzburg (维尔茨堡大学); University of Veterinary Medicine Vienna (维也纳兽医大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The annotation of large scale histopathology image datasets remains a major bottleneck in developing robust deep learning models for clinically relevant tasks, such as mitotic figure classification. Folder-based annotation workflows are usually slow, fatiguing, and difficult to scale. To address these challenges, we introduce SWipeable ANnotations (SWAN), an open-source, MIT-licensed web application that enables intuitive image patch classification using a swiping gesture. SWAN supports both desktop and mobile platforms, offers real-time metadata capture, and allows flexible mapping of swipe gestures to class labels. In a pilot study with four pathologists annotating 600 mitotic figure image patches, we compared SWAN against a traditional folder-sorting workflow. SWAN enabled rapid annotations with pairwise percent agreement ranging from 86.52% to 93.68% (Cohen’s Kappa = 0.61-0.80), while for the folder-based method, the pairwise percent agreement ranged from 86.98% to 91.32% (Cohen’s Kappa = 0.63-0.75) for the task of classifying atypical versus normal mitotic figures, demonstrating high consistency between annotators and comparable performance. Participants rated the tool as highly usable and appreciated the ability to annotate on mobile devices. These results suggest that SWAN can accelerate image annotation while maintaining annotation quality, offering a scalable and user-friendly alternative to conventional workflows.
zh

[CV-33] Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation NEURIPS2025

【速读】:该论文旨在解决在极端条件(如弱光、剧烈相机运动)下,RGB图像信息严重丢失导致语义分割性能下降的问题,同时应对事件数据与RGB数据因模态异质性造成的特征层级不匹配问题。解决方案的关键在于提出一种边缘感知的语义一致性框架(Edge-awareness Semantic Concordance framework),其核心创新包括:1)边缘感知潜在重编码(Edge-awareness Latent Re-coding),通过预设的边缘字典引导事件-RGB特征对齐至统一语义空间,并生成不确定性指标;2)重编码融合与不确定性优化(Re-coded Consolidation and Uncertainty Optimization),利用重编码后的边缘特征和不确定性指标实现异质模态的有效融合,从而提升在空间遮挡等极端场景下的鲁棒性。

链接: https://arxiv.org/abs/2511.08269
作者: Nan Bao,Yifan Zhao,Lin Zhu,Jia Li
机构: Beihang University (北京航空航天大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025; code and datasets available at this https URL

点击查看摘要

Abstract:Semantic segmentation has achieved great success in ideal conditions. However, when facing extreme conditions (e.g., insufficient light, fierce camera motion), most existing methods suffer from significant information loss of RGB, severely damaging segmentation results. Several researches exploit the high-speed and high-dynamic event modality as a complement, but event and RGB are naturally heterogeneous, which leads to feature-level mismatch and inferior optimization of existing multi-modality methods. Different from these researches, we delve into the edge secret of both modalities for resilient fusion and propose a novel Edge-awareness Semantic Concordance framework to unify the multi-modality heterogeneous features with latent edge cues. In this framework, we first propose Edge-awareness Latent Re-coding, which obtains uncertainty indicators while realigning event-RGB features into unified semantic space guided by re-coded distribution, and transfers event-RGB distributions into re-coded features by utilizing a pre-established edge dictionary as clues. We then propose Re-coded Consolidation and Uncertainty Optimization, which utilize re-coded edge features and uncertainty indicators to solve the heterogeneous event-RGB fusion issues under extreme conditions. We establish two synthetic and one real-world event-RGB semantic segmentation datasets for extreme scenario comparisons. Experimental results show that our method outperforms the state-of-the-art by a 2.55% mIoU on our proposed DERS-XS, and possesses superior resilience under spatial occlusion. Our code and datasets are publicly available at this https URL.
zh

[CV-34] ImagebindDC: Compressing Multi-modal Data with Imagebind-based Condensation AAAI2026

【速读】:该论文旨在解决多模态数据压缩(multimodal data condensation)中难以保留复杂跨模态依赖关系的问题,传统方法在多模态场景下往往失效。其核心解决方案是提出ImageBindDC框架,关键在于引入基于特征函数(Characteristic Function, CF)的损失函数,该损失函数在傅里叶域中实现精确的无限阶矩匹配,从而更准确地对齐合成数据与真实数据的统计分布。该方法通过三个层次的分布一致性约束:单模态对齐、跨模态对齐和联合模态对齐,系统性地保障了多模态数据结构的完整性,显著提升了压缩效率与模型性能,在NYU-v2数据集上仅用每类5个压缩样本即可达到与全量数据训练相当的效果,并实现8.2%的绝对性能提升。

链接: https://arxiv.org/abs/2511.08263
作者: Yue Min,Shaobo Wang,Jiaze Li,Tianle Niu,Junxin Fan,Yongliang Miao,Lijin Yang,Linfeng Zhang
机构: EPIC Lab, SJTU (上海交通大学); Bosch Corporate Research Asia Pacific (博世亚太研究中心); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: AAAI 2026, 18 pages, 6 figures, 6 tables

点击查看摘要

Abstract:Data condensation techniques aim to synthesize a compact dataset from a larger one to enable efficient model training, yet while successful in unimodal settings, they often fail in multimodal scenarios where preserving intricate inter-modal dependencies is crucial. To address this, we introduce ImageBindDC, a novel data condensation framework operating within the unified feature space of ImageBind. Our approach moves beyond conventional distribution-matching by employing a powerful Characteristic Function (CF) loss, which operates in the Fourier domain to facilitate a more precise statistical alignment via exact infinite moment matching. We design our objective to enforce three critical levels of distributional consistency: (i) uni-modal alignment, which matches the statistical properties of synthetic and real data within each modality; (ii) cross-modal alignment, which preserves pairwise semantics by matching the distributions of hybrid real-synthetic data pairs; and (iii) joint-modal alignment, which captures the complete multivariate data structure by aligning the joint distribution of real data pairs with their synthetic counterparts. Extensive experiments highlight the effectiveness of ImageBindDC: on the NYU-v2 dataset, a model trained on just 5 condensed datapoints per class achieves lossless performance comparable to one trained on the full dataset, achieving a new state-of-the-art with an 8.2% absolute improvement over the previous best method and more than 4 \times less condensation time.
zh

[CV-35] op2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation

【速读】:该论文旨在解决从航空视角(aerial view)生成地表视角(ground-level view)图像的难题,该任务因视角差异大、遮挡严重及视场有限而极具挑战性。解决方案的关键在于提出一种基于扩散模型(diffusion-based method)的新方法 Top2Ground,其不依赖于深度图或三维体素等中间表示,而是通过联合编码空间特征(来自航空RGB图像和估计的高度图)与CLIP语义嵌入来约束去噪过程,从而在几何结构上受场景三维结构限制,并在语义层面保持内容一致性,实现高质量、逼真的地表视角图像生成。

链接: https://arxiv.org/abs/2511.08258
作者: Jae Joong Lee,Bedrich Benes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene’s 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.
zh

[CV-36] LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning AAAI

【速读】:该论文旨在解决文本驱动的多对象图像编辑中因忽略对象间交互而导致的编辑混淆与约束问题,特别是冲突区域中的注意力纠缠会阻碍解耦式编辑,从而引发对象间编辑泄露或对象内编辑受限。其解决方案的关键在于提出一种全新的无训练多层解耦编辑框架LayerEdit,通过“分解-编辑-融合”三阶段机制实现冲突-free的对象层编辑:首先利用感知冲突的层分解模块(Conflict-aware Layer Decomposition)增强对冲突区域的识别与抑制;其次通过对象层编辑模块建立层内语义引导与跨层几何映射,实现语义与结构的解耦修改;最后借助透明度引导的层融合模块(Transparency-guided Layer Fusion)确保对象间结构一致性地融合。该方法首次在不依赖训练的情况下实现了高精度的对象层分解与协同编辑,显著提升了复杂场景下对象内部可控性与对象间一致性。

链接: https://arxiv.org/abs/2511.08251
作者: Fengyi Fu,Mengqi Huang,Lei Zhang,Zhendong Mao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 40th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Text-driven multi-object image editing which aims to precisely modify multiple objects within an image based on text descriptions, has recently attracted considerable interest. Existing works primarily follow the localize-editing paradigm, focusing on independent object localization and editing while neglecting critical inter-object interactions. However, this work points out that the neglected attention entanglements in inter-object conflict regions, inherently hinder disentangled multi-object editing, leading to either inter-object editing leakage or intra-object editing constraints. We thereby propose a novel multi-layer disentangled editing framework LayerEdit, a training-free method which, for the first time, through precise object-layered decomposition and coherent fusion, enables conflict-free object-layered editing. Specifically, LayerEdit introduces a novel “decompose-editingfusion” framework, consisting of: (1) Conflict-aware Layer Decomposition module, which utilizes an attention-aware IoU scheme and time-dependent region removing, to enhance conflict awareness and suppression for layer decomposition. (2) Object-layered Editing module, to establish coordinated intra-layer text guidance and cross-layer geometric mapping, achieving disentangled semantic and structural modifications. (3) Transparency-guided Layer Fusion module, to facilitate structure-coherent inter-object layer fusion through precise transparency guidance learning. Extensive experiments verify the superiority of LayerEdit over existing methods, showing unprecedented intra-object controllability and inter-object coherence in complex multi-object scenarios. Codes are available at: this https URL.
zh

[CV-37] NERVE: Neighbourhood Entropy-guided Random-walk for training free open-Vocabulary sEgmentation

【速读】:该论文旨在解决训练-free开放词汇语义分割(Open-Vocabulary Semantic Segmentation, OVSS)方法中存在的若干关键问题:现有方法依赖计算成本高的亲和力精炼策略、未能有效融合Transformer注意力图(attention maps)——通常采用等权重或固定尺寸高斯核来增强局部空间平滑性,从而限制了对各向同性邻域的建模能力。其解决方案的关键在于提出一种新的强基线方法NERVE(Neighbourhood Entropy-guided Random-walk for open-Vocabulary sEgmentation),该方法通过熵引导的随机游走机制替代固定核的局部上下文建模,实现全局与细粒度局部信息的有效整合;同时利用自注意力层中邻域结构进行空间扩散,促进语义相关区域间的传播,从而精准分割任意形状目标;此外,基于不确定性度量选择最具代表性的注意力图而非平均处理所有头或层,显著提升了分割精度,且无需传统后处理技术如条件随机场(CRF)或像素自适应掩码精炼(PAMR)。

链接: https://arxiv.org/abs/2511.08248
作者: Kunal Mahatha,Jose Dolz,Christian Desrosiers
机构: LIVIA, ÉTS Montréal, Canada; International Laboratory on Learning Systems (ILLS), McGill - ETS - MILA - CNRS - Université Paris-Saclay - CentraleSupélec, Canada
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite recent advances in Open-Vocabulary Semantic Segmentation (OVSS), existing training-free methods face several limitations: use of computationally expensive affinity refinement strategies, ineffective fusion of transformer attention maps due to equal weighting or reliance on fixed-size Gaussian kernels to reinforce local spatial smoothness, enforcing isotropic neighborhoods. We propose a strong baseline for training-free OVSS termed as NERVE (Neighbourhood \ Entropy-guided Random-walk for open-Vocabulary sEgmentation), which uniquely integrates global and fine-grained local information, exploiting the neighbourhood structure from the self-attention layer of a stable diffusion model. We also introduce a stochastic random walk for refining the affinity rather than relying on fixed-size Gaussian kernels for local context. This spatial diffusion process encourages propagation across connected and semantically related areas, enabling it to effectively delineate objects with arbitrary shapes. Whereas most existing approaches treat self-attention maps from different transformer heads or layers equally, our method uses entropy-based uncertainty to select the most relevant maps. Notably, our method does not require any conventional post-processing techniques like Conditional Random Fields (CRF) or Pixel-Adaptive Mask Refinement (PAMR). Experiments are performed on 7 popular semantic segmentation benchmarks, yielding an overall state-of-the-art zero-shot segmentation performance, providing an effective approach to open-vocabulary semantic segmentation.
zh

[CV-38] Hierarchical Direction Perception via Atomic Dot-Product Operators for Rotation-Invariant Point Clouds Learning AAAI2026

【速读】:该论文旨在解决点云处理中因任意旋转导致的方向特征破坏问题,即点云在不同朝向下的表示学习困难。核心挑战在于旋转扰动会扰乱点云的内在方向特性,从而影响模型对几何结构的准确建模。解决方案的关键在于提出方向感知向量网络(Direction-Perceptive Vector Network, DiPVNet),其核心是原子级点积操作,能够同时编码方向选择性和旋转不变性,从而实现旋转对称建模与自适应方向感知能力。具体而言,局部层面引入可学习的局部点积(Learnable Local Dot-Product, L2DP)算子以捕捉非均匀局部结构;全局层面通过广义谐波分析证明点云与球面采样向量的点积等价于方向感知球面傅里叶变换(Direction-Aware Spherical Fourier Transform, DASFT),进而构建全局方向响应谱来建模整体方向结构。两个算子均被严格证明具有旋转不变性,实验证明该方法在噪声和大角度旋转等复杂场景下显著提升点云分类与分割性能。

链接: https://arxiv.org/abs/2511.08240
作者: Chenyu Hu,Xiaotong Li,Hao Zhu,Biao Hou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026. Code is available at: this https URL

点击查看摘要

Abstract:Point cloud processing has become a cornerstone technology in many 3D vision tasks. However, arbitrary rotations introduce variations in point cloud orientations, posing a long-standing challenge for effective representation learning. The core of this issue is the disruption of the point cloud’s intrinsic directional characteristics caused by rotational perturbations. Recent methods attempt to implicitly model rotational equivariance and invariance, preserving directional information and propagating it into deep semantic spaces. Yet, they often fall short of fully exploiting the multiscale directional nature of point clouds to enhance feature representations. To address this, we propose the Direction-Perceptive Vector Network (DiPVNet). At its core is an atomic dot-product operator that simultaneously encodes directional selectivity and rotation invariance–endowing the network with both rotational symmetry modeling and adaptive directional perception. At the local level, we introduce a Learnable Local Dot-Product (L2DP) Operator, which enables interactions between a center point and its neighbors to adaptively capture the non-uniform local structures of point clouds. At the global level, we leverage generalized harmonic analysis to prove that the dot-product between point clouds and spherical sampling vectors is equivalent to a direction-aware spherical Fourier transform (DASFT). This leads to the construction of a global directional response spectrum for modeling holistic directional structures. We rigorously prove the rotation invariance of both operators. Extensive experiments on challenging scenarios involving noise and large-angle rotations demonstrate that DiPVNet achieves state-of-the-art performance on point cloud classification and segmentation tasks. Our code is available at this https URL.
zh

[CV-39] Remodeling Semantic Relationships in Vision-Language Fine-Tuning

【速读】:该论文旨在解决现有视觉-语言微调方法在对齐视觉与语言模态时忽视文本上下文所蕴含的语义关系,从而导致多模态对齐与融合效果不佳的问题。其解决方案的关键在于:首先从不同视觉编码器中提取多层次语义特征以捕获更丰富的视觉关系信息;其次学习将视觉特征映射到具有潜在关联性的语义组中;最后通过可继承的交叉注意力机制融合视觉与文本特征,并全局剔除相关性较低的视觉-语言特征对,从而有效消除冗余关系,提升多模态对齐精度。

链接: https://arxiv.org/abs/2511.08238
作者: Xiangyang Wu,Liu Liu,Baosheng Yu,Jiayan Qiu,Zhenwei Shi
机构: Hangzhou International Innovation Institute (杭州国际创新研究院); Beihang University (北京航空航天大学); School of Artificial Intelligence (人工智能学院); Nanyang Technological University (南洋理工大学); University of Leicester (莱斯特大学); School of Astronautics (宇航学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language fine-tuning has emerged as an efficient paradigm for constructing multimodal foundation models. While textual context often highlights semantic relationships within an image, existing fine-tuning methods typically overlook this information when aligning vision and language, thus leading to suboptimal performance. Toward solving this problem, we propose a method that can improve multimodal alignment and fusion based on both semantics and this http URL, we first extract multilevel semantic features from different vision encoder to capture more visual cues of the relationships. Then, we learn to project the vision features to group related semantics, among which are more likely to have relationships. Finally, we fuse the visual features with the textual by using inheritable cross-attention, where we globally remove the redundant visual relationships by discarding visual-language feature pairs with low correlation. We evaluate our proposed method on eight foundation models and two downstream tasks, visual question answering and image captioning, and show that it outperforms all existing methods.
zh

[CV-40] Accurate and Efficient Surface Reconstruction from Point Clouds via Geometry-Aware Local Adaptation

【速读】:该论文旨在解决点云表面重建中因局部区域划分方式固定而导致的几何复杂度适应性不足的问题,从而影响重建精度与效率。其解决方案的关键在于提出一种自适应调节局部区域间距与尺寸的方法,依据输入点云的曲率动态调整局部区域的分布和大小,以更好地匹配不同几何复杂度区域的需求,从而提升重建的准确性和计算效率。

链接: https://arxiv.org/abs/2511.08233
作者: Eito Ogawa,Taiga Hayami,Hiroshi Watanabe
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages

点击查看摘要

Abstract:Point cloud surface reconstruction has improved in accuracy with advances in deep learning, enabling applications such as infrastructure inspection. Recent approaches that reconstruct from small local regions rather than entire point clouds have attracted attention for their strong generalization capability. However, prior work typically places local regions uniformly and keeps their size fixed, limiting adaptability to variations in geometric complexity. In this study, we propose a method that improves reconstruction accuracy and efficiency by adaptively modulating the spacing and size of local regions based on the curvature of the input point cloud.
zh

[CV-41] he Online Patch Redundancy Eliminator (OPRE): A novel approach to online agnostic continual learning using dataset compression

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中的灾难性遗忘(catastrophic forgetting)问题,同时指出当前多数CL方法引入了关于未来数据的先验信息(a priori information),因而无法实现真正的无偏(agnostic)学习。其解决方案的关键在于提出一种在线数据压缩算法——在线补丁冗余消除器(Online Patch Redundancy Eliminator, OPRE),该方法仅依赖于对数据分布的最小且可解释的假设,通过在测试时训练分类器并结合数据压缩策略,在CIFAR-10和CIFAR-100上实现了优于多个前沿在线持续学习方法的性能,从而为实现真正无偏的持续学习提供了可行路径。

链接: https://arxiv.org/abs/2511.08226
作者: Raphaël Bayle,Martial Mermillod,Robert M. French
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In order to achieve Continual Learning (CL), the problem of catastrophic forgetting, one that has plagued neural networks since their inception, must be overcome. The evaluation of continual learning methods relies on splitting a known homogeneous dataset and learning the associated tasks one after the other. We argue that most CL methods introduce a priori information about the data to come and cannot be considered agnostic. We exemplify this point with the case of methods relying on pretrained feature extractors, which are still used in CL. After showing that pretrained feature extractors imply a loss of generality with respect to the data that can be learned by the model, we then discuss other kinds of a priori information introduced in other CL methods. We then present the Online Patch Redundancy Eliminator (OPRE), an online dataset compression algorithm, which, along with the training of a classifier at test time, yields performance on CIFAR-10 and CIFAR-100 superior to a number of other state-of-the-art online continual learning methods. Additionally, OPRE requires only minimal and interpretable hypothesis on the data to come. We suggest that online dataset compression could well be necessary to achieve fully agnostic CL.
zh

[CV-42] 2D Representation for Unguided Single-View 3D Super-Resolution in Real-Time ICASSP2026

【速读】:该论文旨在解决单视角三维超分辨率(Single-view 3D super-resolution)任务中对高分辨率RGB图像引导的依赖问题,即传统方法通常需要高质量RGB信息来辅助重建细节,而实际场景中此类数据往往难以获取。其解决方案的关键在于提出一种名为2Dto3D-SR的通用框架,通过将3D数据从单一视角编码为结构化的2D表示形式,从而直接应用成熟的2D图像超分辨率架构;其中核心创新是使用投影归一化坐标编码(Projected Normalized Coordinate Code, PNCC)将可见表面的几何信息映射为规则图像,避免了基于点云或RGB引导的复杂处理方式,使得模型轻量化且具备实时性能,适用于多种部署环境。

链接: https://arxiv.org/abs/2511.08224
作者: Ignasi Mas,Ivan Huerta,Ramon Morros,Javier Ruiz-Hidalgo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:We introduce 2Dto3D-SR, a versatile framework for real-time single-view 3D super-resolution that eliminates the need for high-resolution RGB guidance. Our framework encodes 3D data from a single viewpoint into a structured 2D representation, enabling the direct application of existing 2D image super-resolution architectures. We utilize the Projected Normalized Coordinate Code (PNCC) to represent 3D geometry from a visible surface as a regular image, thereby circumventing the complexities of 3D point-based or RGB-guided methods. This design supports lightweight and fast models adaptable to various deployment environments. We evaluate 2Dto3D-SR with two implementations: one using Swin Transformers for high accuracy, and another using Vision Mamba for high efficiency. Experiments show the Swin Transformer model achieves state-of-the-art accuracy on standard benchmarks, while the Vision Mamba model delivers competitive results at real-time speeds. This establishes our geometry-guided pipeline as a surprisingly simple yet viable and practical solution for real-world scenarios, especially where high-resolution RGB data is inaccessible.
zh

[CV-43] Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone

【速读】:该论文旨在解决数字食物应用中自动化营养分析与烹饪指导的瓶颈问题,核心挑战在于如何在保证视觉识别准确率的同时,提升生成式模型输出的营养数据和食谱质量。解决方案的关键在于构建一个解耦的多模态流水线,其中采用专用视觉主干网络(EfficientNet-B4)进行高精度食物识别,并结合强大的生成式大语言模型(Google’s Gemini LLM)实现高质量内容生成。研究进一步引入“语义误差传播”(Semantic Error Propagation, SEP)的 formalization 方法,量化视觉模块错误对生成结果的影响,从而揭示系统性能受限于视觉前端感知精度的本质问题。

链接: https://arxiv.org/abs/2511.08215
作者: Rizal Khoirul Anam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google’s Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for “Semantic Error Propagation” (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system’s overall utility is fundamentally bottlenecked by the visual front-end’s perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.
zh

[CV-44] wist and Compute: The Cost of Pose in 3D Generative Diffusion

【速读】:该论文旨在解决当前大规模图像到3D生成模型中存在的**规范视图偏差(canonical view bias)**问题,即模型在面对不同视角输入时泛化能力不足,导致性能显著下降。其关键解决方案是引入一个轻量级卷积神经网络(CNN),用于检测并校正输入图像的朝向,从而在不修改生成主干网络的前提下恢复模型性能,验证了通过模块化设计提升对称性感知能力的有效性。

链接: https://arxiv.org/abs/2511.08203
作者: Kyle Fogarty,Jack Foster,Boqiao Zhang,Jing Yang,Cengiz Öztireli
机构: University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EurIPS 2025 Workshop on Principles of Generative Modeling (PriGM)

点击查看摘要

Abstract:Despite their impressive results, large-scale image-to-3D generative models remain opaque in their inductive biases. We identify a significant limitation in image-conditioned 3D generative models: a strong canonical view bias. Through controlled experiments using simple 2D rotations, we show that the state-of-the-art Hunyuan3D 2.0 model can struggle to generalize across viewpoints, with performance degrading under rotated inputs. We show that this failure can be mitigated by a lightweight CNN that detects and corrects input orientation, restoring model performance without modifying the generative backbone. Our findings raise an important open question: Is scale enough, or should we pursue modular, symmetry-aware designs?
zh

[CV-45] UCDSC: Open Set UnCertainty aware Deep Simplex Classifier for Medical Image Datasets WACV2026

【速读】:该论文旨在解决医学图像识别中因数据稀缺、标注成本高以及罕见病样本不足而导致的开放集识别(open-set recognition)难题,即如何有效区分已知类别与未知类别样本。其解决方案的关键在于设计一种新型损失函数,通过利用辅助数据集对“开放空间”区域进行惩罚,从而增强模型在面对未知类样本时的拒识能力;该方法基于深度神经网络后层特征倾向于围绕类别中心聚类且这些中心构成正单纯形(regular simplex)的观察,实现了在多个医学影像数据集(如BloodMNIST、OCTMNIST等)上的显著性能提升,优于现有最优技术。

链接: https://arxiv.org/abs/2511.08196
作者: Arnav Aditya,Nitin Kumar,Saurabh Shigwan
机构: Shiv Nadar Institution of Eminence, Delhi NCR, India
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, Accepted at IEEE/CVF WACV 2026, Source code is available at this URL this https URL

点击查看摘要

Abstract:Driven by advancements in deep learning, computer-aided diagnoses have made remarkable progress. However, outside controlled laboratory settings, algorithms may encounter several challenges. In the medical domain, these difficulties often stem from limited data availability due to ethical and legal restrictions, as well as the high cost and time required for expert annotations-especially in the face of emerging or rare diseases. In this context, open-set recognition plays a vital role by identifying whether a sample belongs to one of the known classes seen during training or should be rejected as an unknown. Recent studies have shown that features learned in the later stages of deep neural networks are observed to cluster around their class means, which themselves are arranged as individual vertices of a regular simplex [32]. The proposed method introduces a loss function designed to reject samples of unknown classes effectively by penalizing open space regions using auxiliary datasets. This approach achieves significant performance gain across four MedMNIST datasets-BloodMNIST, OCTMNIST, DermaMNIST, TissueMNIST and a publicly available skin dataset [29] outperforming state-of-the-art techniques.
zh

[CV-46] UI2CodetextN: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

【速读】:该论文旨在解决当前自动用户界面(User Interface, UI)编码技术中存在的两大问题:一是多模态编码能力发展不足,难以有效融合视觉与代码信息;二是现有方法普遍采用单轮交互范式,未能充分利用迭代式视觉反馈以提升生成质量。其解决方案的关键在于提出一种交互式UI-to-code范式(interactive UI-to-code paradigm),并通过分阶段预训练、微调和强化学习训练出一个名为UI2Code^N的视觉语言模型,该模型统一了三项核心能力:UI到代码生成、UI编辑和UI润色(UI polishing)。此外,研究进一步探索了测试时扩展(test-time scaling)策略,系统性地利用多轮反馈实现更高质量的交互式生成,从而在UI编码和润色基准上达到开源模型的新SOTA水平,并逼近Claude-4-Sonnet和GPT-5等闭源领先模型的性能。

链接: https://arxiv.org/abs/2511.08195
作者: Zhen Yang,Wenyi Hong,Mingde Xu,Xinyue Fan,Weihan Wang,Jiele Cheng,Xiaotao Gu,Jie Tang
机构: Tsinghua University (清华大学); Zhipu AI (智谱AI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages

点击查看摘要

Abstract:User interface (UI) programming is a core yet highly complex part of modern software development. Recent advances in visual language models (VLMs) highlight the potential of automatic UI coding, but current approaches face two key limitations: multimodal coding capabilities remain underdeveloped, and single-turn paradigms make little use of iterative visual feedback. We address these challenges with an interactive UI-to-code paradigm that better reflects real-world workflows and raises the upper bound of achievable performance. Under this paradigm, we present UI2Code ^\textN , a visual language model trained through staged pretraining, fine-tuning, and reinforcement learning to achieve foundational improvements in multimodal coding. The model unifies three key capabilities: UI-to-code generation, UI editing, and UI polishing. We further explore test-time scaling for interactive generation, enabling systematic use of multi-turn feedback. Experiments on UI-to-code and UI polishing benchmarks show that UI2Code ^\textN establishes a new state of the art among open-source models and achieves performance comparable to leading closed-source models such as Claude-4-Sonnet and GPT-5. Our code and models are available at this https URL.
zh

[CV-47] Pixel-level Quality Assessment for Oriented Object Detection

【速读】:该论文旨在解决现有旋转目标检测器中基于框级交并比(IoU)预测的定位质量评估存在的结构耦合问题,即由于预测框与真实框(GT)均来自检测器对同一目标的估计,导致在定位偏差较大时IoU被高估,从而影响检测性能。其解决方案的关键在于提出一种像素级质量评估(Pixel-level Quality Assessment, PQA)框架,通过度量每个像素相对于预测框和真实框的相对位置一致性来替代传统的框级IoU预测,从而避免因预测框与GT框直接比较而引入的相似性偏差;同时设计了一种新的聚合指标,将像素级空间一致性整合为统一的质量评分,更准确地逼近实际定位精度。

链接: https://arxiv.org/abs/2511.08186
作者: Yunhui Zhu,Buliao Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern oriented object detectors typically predict a set of bounding boxes and select the top-ranked ones based on estimated localization quality. Achieving high detection performance requires that the estimated quality closely aligns with the actual localization accuracy. To this end, existing approaches predict the Intersection over Union (IoU) between the predicted and ground-truth (GT) boxes as a proxy for localization quality. However, box-level IoU prediction suffers from a structural coupling issue: since the predicted box is derived from the detector’s internal estimation of the GT box, the predicted IoU–based on their similarity–can be overestimated for poorly localized boxes. To overcome this limitation, we propose a novel Pixel-level Quality Assessment (PQA) framework, which replaces box-level IoU prediction with the integration of pixel-level spatial consistency. PQA measures the alignment between each pixel’s relative position to the predicted box and its corresponding position to the GT box. By operating at the pixel level, PQA avoids directly comparing the predicted box with the estimated GT box, thereby eliminating the inherent similarity bias in box-level IoU prediction. Furthermore, we introduce a new integration metric that aggregates pixel-level spatial consistency into a unified quality score, yielding a more accurate approximation of the actual localization quality. Extensive experiments on HRSC2016 and DOTA demonstrate that PQA can be seamlessly integrated into various oriented object detectors, consistently improving performance (e.g., +5.96% AP _50:95 on Rotated RetinaNet and +2.32% on STD).
zh

[CV-48] WarpGAN: Warping-Guided 3D GAN Inversion with Style-Based Novel View Inpainting

【速读】:该论文旨在解决3D生成对抗网络(3D GAN)逆映射中遮挡区域生成质量低的问题。现有方法主要关注可见区域的高保真重建,而遮挡区域的生成仅依赖于3D GAN的生成先验,导致因低比特率潜在代码的信息丢失,使得遮挡区域细节模糊、多视角一致性差。解决方案的关键在于提出一种“扭曲-修复”(warping-and-inpainting)策略:首先通过3D GAN逆映射编码器将单视图图像投影到潜在空间,再利用3D GAN生成的深度图进行视角扭曲得到新视图;随后设计了一种新型SVINet网络,利用对称性先验和同一潜在码下的多视角图像对应关系,实现对扭曲后图像中遮挡区域的有效修复,从而显著提升遮挡区域的真实感与多视角一致性。

链接: https://arxiv.org/abs/2511.08178
作者: Kaitao Huang,Yan Yan,Jing-Hao Xue,Hanzi Wang
机构: Xiamen University (厦门大学); University College London (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D GAN inversion projects a single image into the latent space of a pre-trained 3D GAN to achieve single-shot novel view synthesis, which requires visible regions with high fidelity and occluded regions with realism and multi-view consistency. However, existing methods focus on the reconstruction of visible regions, while the generation of occluded regions relies only on the generative prior of 3D GAN. As a result, the generated occluded regions often exhibit poor quality due to the information loss caused by the low bit-rate latent code. To address this, we introduce the warping-and-inpainting strategy to incorporate image inpainting into 3D GAN inversion and propose a novel 3D GAN inversion method, WarpGAN. Specifically, we first employ a 3D GAN inversion encoder to project the single-view image into a latent code that serves as the input to 3D GAN. Then, we perform warping to a novel view using the depth map generated by 3D GAN. Finally, we develop a novel SVINet, which leverages the symmetry prior and multi-view image correspondence w.r.t. the same latent code to perform inpainting of occluded regions in the warped image. Quantitative and qualitative experiments demonstrate that our method consistently outperforms several state-of-the-art methods.
zh

[CV-49] VLMDiff: Leverag ing Vision-Language Models for Multi-Class Anomaly Detection with Diffusion WACV2026

【速读】:该论文旨在解决多类真实世界图像中视觉异常检测(Visual Anomaly Detection)的难题,尤其针对现有基于扩散模型(Diffusion Model)方法依赖合成噪声生成、泛化能力有限且需为每类异常单独训练模型的问题。其解决方案的关键在于将预训练的视觉语言模型(Vision-Language Model, VLM)与潜在扩散模型(Latent Diffusion Model, LDM)相结合:通过简单提示词(prompt)从VLM获取无标注的正常图像描述作为额外条件信息,用于LDM的训练过程,从而学习到鲁棒的正常图像特征表示,实现无需类别特定训练的多类异常检测。此策略显著提升了像素级异常定位精度,在Real-IAD和COCO-AD数据集上分别在Per-Region-Overlap(PRO)指标上提升25点和8点。

链接: https://arxiv.org/abs/2511.08173
作者: Samet Hicsonmez,Abd El Rahman Shabayek,Djamila Aouada
机构: University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026

点击查看摘要

Abstract:Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at this https URL.
zh

[CV-50] Distributed Zero-Shot Learning for Visual Recognition

【速读】:该论文旨在解决分布式零样本学习(Distributed Zero-Shot Learning, DistZSL)中因数据异构性导致的模型性能下降问题,即在多个分布式节点上训练时难以有效学习未见类别的表示。其解决方案的关键在于引入两个核心组件:一是跨节点属性正则化(cross-node attribute regularizer),通过约束不同节点间属性特征距离的一致性,稳定整体属性特征空间,从而促进视觉到属性(Visual-to-Attribute, V2A)关系的建立;二是全局属性到视觉一致性机制(global attribute-to-visual consensus),强制各节点间属性与视觉特征分布间的双向映射保持一致,缓解单个节点学习到的V2A映射偏差,最终实现跨节点一致且高效的零样本分类能力。

链接: https://arxiv.org/abs/2511.08170
作者: Zhi Chen,Yadan Luo,Zi Huang,Jingjing Li,Sen Wang,Xin Yu
机构: University of Southern Queensland (南昆士兰大学); The University of Queensland (昆士兰大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE Transactions on Multimedia in Oct 2025

点击查看摘要

Abstract:In this paper, we propose a Distributed Zero-Shot Learning (DistZSL) framework that can fully exploit decentralized data to learn an effective model for unseen classes. Considering the data heterogeneity issues across distributed nodes, we introduce two key components to ensure the effective learning of DistZSL: a cross-node attribute regularizer and a global attribute-to-visual consensus. Our proposed cross-node attribute regularizer enforces the distances between attribute features to be similar across different nodes. In this manner, the overall attribute feature space would be stable during learning, and thus facilitate the establishment of visual-to-attribute(V2A) relationships. Then, we introduce the global attribute-tovisual consensus to mitigate biased V2A mappings learned from individual nodes. Specifically, we enforce the bilateral mapping between the attribute and visual feature distributions to be consistent across different nodes. Thus, the learned consistent V2A mapping can significantly enhance zero-shot learning across different nodes. Extensive experiments demonstrate that DistZSL achieves superior performance to the state-of-the-art in learning from distributed data.
zh

[CV-51] KPLM-STA: Physically-Accurate Shadow Synthesis for Human Relighting via Keypoint-Based Light Modeling

【速读】:该论文旨在解决图像合成中阴影生成的挑战,即如何在前景物体与背景融合时生成既具有高视觉真实感又具备几何精确性的阴影,尤其是在复杂人体姿态下。现有基于扩散模型的方法(如IC-Light)虽优于GAN方法,但在阴影的外观真实性和几何准确性方面仍存在不足。解决方案的关键在于提出一种结合关键点线性模型(Keypoints Linear Model, KPLM)和阴影三角算法(Shadow Triangle Algorithm, STA)的新框架:KPLM通过九个关键点和一个边界框建模人体结构,实现关节处动态阴影投影与物理合理的遮挡关系,提升视觉真实性;STA则利用显式几何公式计算阴影的角度、长度及空间位置,显著增强阴影的几何精度。该方法在复杂姿态下的阴影生成任务中达到当前最优性能,并能有效泛化至多方向光照场景。

链接: https://arxiv.org/abs/2511.08169
作者: Xinhui Yin,Qifei Li,Yilin Guo,Hongxia Xie,Xiaoli Zhang
机构: 1. Institute of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能研究院); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Beijing Academy of Artificial Intelligence (北京人工智能研究院); 4. Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image composition aims to seamlessly integrate a foreground object into a background, where generating realistic and geometrically accurate shadows remains a persistent challenge. While recent diffusion-based methods have outperformed GAN-based approaches, existing techniques, such as the diffusion-based relighting framework IC-Light, still fall short in producing shadows with both high appearance realism and geometric precision, especially in composite images. To address these limitations, we propose a novel shadow generation framework based on a Keypoints Linear Model (KPLM) and a Shadow Triangle Algorithm (STA). KPLM models articulated human bodies using nine keypoints and one bounding block, enabling physically plausible shadow projection and dynamic shading across joints, thereby enhancing visual realism. STA further improves geometric accuracy by computing shadow angles, lengths, and spatial positions through explicit geometric formulations. Extensive experiments demonstrate that our method achieves state-of-the-art performance on shadow realism benchmarks, particularly under complex human poses, and generalizes effectively to multi-directional relighting scenarios such as those supported by IC-Light.
zh

[CV-52] Multi-Granularity Mutual Refinement Network for Zero-Shot Learning

【速读】:该论文旨在解决零样本学习(Zero-shot Learning, ZSL)中因忽略局部区域特征之间内在交互关系而导致的可迁移性和显式视觉特征获取不足的问题。现有方法通常通过全局视觉特征或局部区域特征与语义信息(如属性)对齐来增强视觉-语义关联,但未能充分挖掘不同粒度局部区域特征间的相互作用。其解决方案的关键在于提出多粒度互 refine 网络(Multi-Granularity Mutual Refinement Network, Mg-MRN),该网络通过两个核心模块实现:一是多粒度特征提取模块,通过解耦的区域特征挖掘学习具有判别性的局部特征;二是跨粒度特征融合模块,强化不同粒度层级区域特征之间的内在交互,借助相邻层次的区域表示集成提升各粒度层级表征的判别能力,从而显著提升ZSL识别性能。

链接: https://arxiv.org/abs/2511.08163
作者: Ning Wang,Long Yu,Cong Hua,Guangming Zhu,Lin Mei,Syed Afaq Ali Shah,Mohammed Bennamoun,Liang Zhang
机构: Xidian University (西安电子科技大学); Donghai Laboratory (东海实验室); Edith Cowan University (埃迪斯科文大学); University of Western Australia (西澳大利亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to recognize unseen classes with zero samples by transferring semantic knowledge from seen classes. Current approaches typically correlate global visual features with semantic information (i.e., attributes) or align local visual region features with corresponding attributes to enhance visual-semantic interactions. Although effective, these methods often overlook the intrinsic interactions between local region features, which can further improve the acquisition of transferable and explicit visual features. In this paper, we propose a network named Multi-Granularity Mutual Refinement Network (Mg-MRN), which refine discriminative and transferable visual features by learning decoupled multi-granularity features and cross-granularity feature interactions. Specifically, we design a multi-granularity feature extraction module to learn region-level discriminative features through decoupled region feature mining. Then, a cross-granularity feature fusion module strengthens the inherent interactions between region features of varying granularities. This module enhances the discriminability of representations at each granularity level by integrating region representations from adjacent hierarchies, further improving ZSL recognition performance. Extensive experiments on three popular ZSL benchmark datasets demonstrate the superiority and competitiveness of our proposed Mg-MRN method. Our code is available at this https URL.
zh

[CV-53] LandSegmenter: Towards a Flexible Foundation Model for Land Use and Land Cover Mapping

【速读】:该论文旨在解决当前土地利用与土地覆盖(Land Use and Land Cover, LULC)分类模型在多模态适配性和零样本迁移能力上的局限性问题,即现有模型通常针对特定模态和固定类别体系进行训练,难以泛化至未见场景或数据域。其解决方案的关键在于提出LandSegmenter框架,通过三个层次的创新设计实现高效、通用的LULC建模:首先,在输入层面构建LAS(LAnd Segment)数据集,利用全球采样的弱标签替代昂贵的人工标注,显著降低标注成本并提升可扩展性;其次,在模型架构上引入遥感(RS)专用适配器与文本编码器,增强跨模态特征提取能力和语义感知能力;最后,在输出阶段采用类级别置信度引导的融合策略,缓解语义遗漏问题,从而显著提升零样本性能。实验证明,该框架在多个异构数据集上均表现出优异的迁移学习与零样本泛化能力。

链接: https://arxiv.org/abs/2511.08156
作者: Chenying Liu,Wei Huang,Xiao Xiang Zhu
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Land Use and Land Cover (LULC) mapping is a fundamental task in Earth Observation (EO). However, current LULC models are typically developed for a specific modality and a fixed class taxonomy, limiting their generability and broader applicability. Recent advances in foundation models (FMs) offer promising opportunities for building universal models. Yet, task-agnostic FMs often require fine-tuning for downstream applications, whereas task-specific FMs rely on massive amounts of labeled data for training, which is costly and impractical in the remote sensing (RS) domain. To address these challenges, we propose LandSegmenter, an LULC FM framework that resolves three-stage challenges at the input, model, and output levels. From the input side, to alleviate the heavy demand on labeled data for FM training, we introduce LAnd Segment (LAS), a large-scale, multi-modal, multi-source dataset built primarily with globally sampled weak labels from existing LULC products. LAS provides a scalable, cost-effective alternative to manual annotation, enabling large-scale FM training across diverse LULC domains. For model architecture, LandSegmenter integrates an RS-specific adapter for cross-modal feature extraction and a text encoder for semantic awareness enhancement. At the output stage, we introduce a class-wise confidence-guided fusion strategy to mitigate semantic omissions and further improve LandSegmenter’s zero-shot performance. We evaluate LandSegmenter on six precisely annotated LULC datasets spanning diverse modalities and class taxonomies. Extensive transfer learning and zero-shot experiments demonstrate that LandSegmenter achieves competitive or superior performance, particularly in zero-shot settings when transferred to unseen datasets. These results highlight the efficacy of our proposed framework and the utility of weak supervision for building task-specific FMs.
zh

[CV-54] Non-Aligned Reference Image Quality Assessment for Novel View Synthesis

【速读】:该论文旨在解决新型视图合成(Novel View Synthesis, NVS)图像感知质量评估问题,尤其是在缺乏像素级对齐参考图像的情况下。传统全参考图像质量评估(Full-Reference Image Quality Assessment, FR-IQA)方法因依赖像素对齐而失效,而无参考(No-Reference, NR-IQA)方法则在泛化能力上表现不足。为此,作者提出了一种非对齐参考图像质量评估(Non-Aligned Reference IQA, NAR-IQA)框架,其核心假设是参考视图与目标视图共享部分场景内容但无像素级对齐。关键解决方案包括:构建一个包含针对时间感兴趣区域(Temporal Regions of Interest, TROI)的合成失真大规模数据集;采用基于LoRA增强的DINOv2嵌入与对比学习相结合的模型架构,并利用现有IQA方法提供监督信号;训练过程仅使用合成失真数据,避免过拟合特定真实NVS样本,从而显著提升模型泛化性能。实验表明,该方法在对齐和非对齐参考下均优于当前最先进的FR-IQA、NR-IQA及NAR-IQA方法,并与人类主观偏好高度一致。

链接: https://arxiv.org/abs/2511.08155
作者: Abhijay Ghildyal,Rajesh Sureddi,Nabajeet Barman,Saman Zadtootaghaj,Alan Bovik
机构: Sony Interactive Entertainment(索尼互动娱乐); University of Texas at Austin(德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating the perceptual quality of Novel View Synthesis (NVS) images remains a key challenge, particularly in the absence of pixel-aligned ground truth references. Full-Reference Image Quality Assessment (FR-IQA) methods fail under misalignment, while No-Reference (NR-IQA) methods struggle with generalization. In this work, we introduce a Non-Aligned Reference (NAR-IQA) framework tailored for NVS, where it is assumed that the reference view shares partial scene content but lacks pixel-level alignment. We constructed a large-scale image dataset containing synthetic distortions targeting Temporal Regions of Interest (TROI) to train our NAR-IQA model. Our model is built on a contrastive learning framework that incorporates LoRA-enhanced DINOv2 embeddings and is guided by supervision from existing IQA methods. We train exclusively on synthetically generated distortions, deliberately avoiding overfitting to specific real NVS samples and thereby enhancing the model’s generalization capability. Our model outperforms state-of-the-art FR-IQA, NR-IQA, and NAR-IQA methods, achieving robust performance on both aligned and non-aligned references. We also conducted a novel user study to gather data on human preferences when viewing non-aligned references in NVS. We find strong correlation between our proposed quality prediction model and the collected subjective ratings. For dataset and code, please visit our project page: this https URL
zh

[CV-55] Boomda: Balanced Multi-objective Optimization for Multimodal Domain Adaptation

【速读】:该论文旨在解决多模态领域自适应(Multimodal Domain Adaptation, MDA)中因不同模态在源域与目标域间存在异质性领域偏移(heterogeneous domain shifts)而导致的性能下降问题。解决方案的关键在于提出一种基于多目标优化的平衡策略:首先利用信息瓶颈(Information Bottleneck, IB)方法独立学习各模态的表示,随后通过相关性对齐(Correlation Alignment, CORAL)在表示空间中对齐源域与目标域;为实现所有模态间的域对齐均衡,将问题建模为多目标优化任务,并基于模型特性将其简化为二次规划问题,进一步近似得到闭式解,从而设计出高效且模态平衡的算法——Boomda(Balanced Multi-objective Optimization for Multimodal Domain Adaptation)。

链接: https://arxiv.org/abs/2511.08152
作者: Jun Sun,Xinxin Zhang,Simin Hong,Jian Zhu,Xiang Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal learning, while contributing to numerous success stories across various fields, faces the challenge of prohibitively expensive manual annotation. To address the scarcity of annotated data, a popular solution is unsupervised domain adaptation, which has been extensively studied in unimodal settings yet remains less explored in multimodal settings. In this paper, we investigate heterogeneous multimodal domain adaptation, where the primary challenge is the varying domain shifts of different modalities from the source to the target domain. We first introduce the information bottleneck method to learn representations for each modality independently, and then match the source and target domains in the representation space with correlation alignment. To balance the domain alignment of all modalities, we formulate the problem as a multi-objective task, aiming for a Pareto optimal solution. By exploiting the properties specific to our model, the problem can be simplified to a quadratic programming problem. Further approximation yields a closed-form solution, leading to an efficient modality-balanced multimodal domain adaptation algorithm. The proposed method features \textbfBalanced multi-\textbfobjective \textbfoptimization for \textbfmultimodal \textbfdomain \textbfadaptation, termed \textbfBoomda. Extensive empirical results showcase the effectiveness of the proposed approach and demonstrate that Boomda outperforms the competing schemes. The code is is available at: this https URL.
zh

[CV-56] PEOD: A Pixel-Aligned Event-RGB Benchmark for Object Detection under Challenging Conditions

【速读】:该论文旨在解决当前事件相机与RGB图像融合检测方法在极端场景下评估不足的问题,主要受限于现有Event-RGB数据集覆盖极端条件稀疏且空间分辨率低(≤640×480),难以全面验证检测模型在挑战性场景中的性能。其解决方案的关键在于提出首个大规模、像素级对齐且高分辨率(1280×720)的Event-RGB数据集PEOD,包含超过130个时空对齐序列和34万条人工标注边界框,其中57%的数据来自低光照、过曝和高速运动等挑战性条件。通过在PEOD上基准测试14种不同输入配置(纯事件、纯RGB及事件-RGB融合)的方法,揭示了融合模型在正常场景下的优越性,但在极端光照条件下事件模型表现更优,凸显了现有融合策略在模态严重退化时的局限性,从而为多模态感知提供了更真实、高质量的基准。

链接: https://arxiv.org/abs/2511.08140
作者: Luoping Cui,Hanqing Liu,Mingjie Liu,Endian Lin,Donghong Jiang,Yuhao Wang,Chuang Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust object detection for challenging scenarios increasingly relies on event cameras, yet existing Event-RGB datasets remain constrained by sparse coverage of extreme conditions and low spatial resolution (= 640 x 480), which prevents comprehensive evaluation of detectors under challenging scenarios. To address these limitations, we propose PEOD, the first large-scale, pixel-aligned and high-resolution (1280 x 720) Event-RGB dataset for object detection under challenge conditions. PEOD contains 130+ spatiotemporal-aligned sequences and 340k manual bounding boxes, with 57% of data captured under low-light, overexposure, and high-speed motion. Furthermore, we benchmark 14 methods across three input configurations (Event-based, RGB-based, and Event-RGB fusion) on PEOD. On the full test set and normal subset, fusion-based models achieve the excellent performance. However, in illumination challenge subset, the top event-based model outperforms all fusion models, while fusion models still outperform their RGB-based counterparts, indicating limits of existing fusion methods when the frame modality is severely degraded. PEOD establishes a realistic, high-quality benchmark for multimodal perception and facilitates future research.
zh

[CV-57] OTSNet: A Neurocognitive-Inspired Observation-Thinking-Spelling Pipeline for Scene Text Recognition

【速读】:该论文针对场景文本识别(Scene Text Recognition, STR)中因现实世界复杂性导致的识别精度下降问题,特别是现有框架中视觉与语言模态解耦优化引发的误差传播问题展开研究。其核心挑战在于:视觉编码器对背景干扰区域存在注意力偏置,而解码器在处理几何形变文本时易出现空间错位,从而显著降低对不规则文本模式的识别准确率。解决方案的关键在于提出一种受人类视觉认知过程启发的三阶段网络OTSNet,包含三个核心组件:(1) 双注意力Macaron编码器(Dual Attention Macaron Encoder, DAME)通过差异化的注意力图抑制无关区域并增强判别性关注;(2) 位置感知模块(Position-Aware Module, PAM)与语义量化器(Semantic Quantizer, SQ)联合利用自适应采样融合空间上下文与字形级语义抽象;(3) 多模态协同验证器(Multi-Modal Collaborative Verifier, MMCV)通过视觉、语义和字符级特征的跨模态融合实现自我校正机制。这一架构有效缓解了跨模态错位问题,在Union14M-L和OST等基准上均取得当前最优性能。

链接: https://arxiv.org/abs/2511.08133
作者: Lixu Sun,Nurmemet Yolwas,Wushour Silamu
机构: Xinjiang University (新疆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scene Text Recognition (STR) remains challenging due to real-world complexities, where decoupled visual-linguistic optimization in existing frameworks amplifies error propagation through cross-modal misalignment. Visual encoders exhibit attention bias toward background distractors, while decoders suffer from spatial misalignment when parsing geometrically deformed text-collectively degrading recognition accuracy for irregular patterns. Inspired by the hierarchical cognitive processes in human visual perception, we propose OTSNet, a novel three-stage network embodying a neurocognitive-inspired Observation-Thinking-Spelling pipeline for unified STR modeling. The architecture comprises three core components: (1) a Dual Attention Macaron Encoder (DAME) that refines visual features through differential attention maps to suppress irrelevant regions and enhance discriminative focus; (2) a Position-Aware Module (PAM) and Semantic Quantizer (SQ) that jointly integrate spatial context with glyph-level semantic abstraction via adaptive sampling; and (3) a Multi-Modal Collaborative Verifier (MMCV) that enforces self-correction through cross-modal fusion of visual, semantic, and character-level features. Extensive experiments demonstrate that OTSNet achieves state-of-the-art performance, attaining 83.5% average accuracy on the challenging Union14M-L benchmark and 79.1% on the heavily occluded OST dataset-establishing new records across 9 out of 14 evaluation scenarios.
zh

[CV-58] Foam Segmentation in Wastewater Treatment Plants: A Federated Learning Approach with Segment Anything Model 2

【速读】:该论文旨在解决污水处理厂(Wastewater Treatment Plants, WTPs)中泡沫形成问题,该问题会降低处理效率并增加运营成本。传统机器学习(Machine Learning, ML)模型依赖大量标注数据进行训练,而实际中由于数据稀缺性和异质性,以及各污水处理厂因隐私顾虑不愿共享数据,导致模型开发进展缓慢。解决方案的关键在于将联邦学习(Federated Learning, FL)与先进的图像分割基础模型Segment Anything Model 2(SAM2)相结合:FL允许在不集中敏感操作数据的前提下跨多个WTP协同训练模型,保障隐私;同时利用SAM2预训练权重初始化本地模型,显著加速收敛并提升分割性能,即使在本地数据有限的情况下亦能实现良好泛化能力。该框架通过Flower框架在边缘节点上微调SAM2,并由中央雾服务器聚合模型参数而不访问私有数据,从而提供一种可扩展、隐私友好的自动泡沫检测方案。

链接: https://arxiv.org/abs/2511.08130
作者: Mehmet Batuhan Duman,Alejandro Carnero,Cristian Martín,Daniel Garrido,Manuel Díaz
机构: Universidad de Málaga (马拉加大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 36 pages, 14 figures, 3 tables, 4 algorithms. This work is part of the Zerovision project. Code available at: this https URL

点击查看摘要

Abstract:Foam formation in Wastewater Treatment Plants (WTPs) is a major challenge that can reduce treatment efficiency and increase costs. The ability to automatically examine changes in real-time with respect to the percentage of foam can be of great benefit to the plant. However, large amounts of labeled data are required to train standard Machine Learning (ML) models. The development of these systems is slow due to the scarcity and heterogeneity of labeled data. Additionally, the development is often hindered by the fact that different WTPs do not share their data due to privacy concerns. This paper proposes a new framework to address these challenges by combining Federated Learning (FL) with the state-of-the-art base model for image segmentation, Segment Anything Model 2 (SAM2). The FL paradigm enables collaborative model training across multiple WTPs without centralizing sensitive operational data, thereby ensuring privacy. The framework accelerates training convergence and improves segmentation performance even with limited local datasets by leveraging SAM2’s strong pre-trained weights for initialization. The methodology involves fine-tuning SAM2 on distributed clients (edge nodes) using the Flower framework, where a central Fog server orchestrates the process by aggregating model weights without accessing private data. The model was trained and validated using various data collections, including real-world images captured at a WTPs in Granada, Spain, a synthetically generated foam dataset, and images from publicly available datasets to improve generalization. This research offers a practical, scalable, and privacy-aware solution for automatic foam tracking in WTPs. The findings highlight the significant potential of integrating large-scale foundational models into FL systems to solve real-world industrial challenges characterized by distributed and sensitive data.
zh

[CV-59] LatentPrintFormer: A Hybrid CNN-Transformer with Spatial Attention for Latent Fingerprint identification

【速读】:该论文旨在解决潜指纹(latent fingerprint)识别中因图像质量低、背景噪声干扰及部分印痕导致的识别困难问题。其解决方案的关键在于提出了一种名为LatentPrintFormer的新模型,该模型融合了卷积神经网络(CNN)骨干网络(EfficientNet-B0)与Transformer骨干网络(Swin Tiny),以协同提取指纹的局部与全局特征;同时引入空间注意力模块突出高质量纹线区域并抑制背景噪声,最终将融合特征映射至统一的512维嵌入空间,并通过余弦相似度进行闭集识别匹配,从而显著提升识别性能。

链接: https://arxiv.org/abs/2511.08119
作者: Arnab Maity,Manasa,Pavan Kumar C,Raghavendra Ramachandra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in CVIP 2025

点击查看摘要

Abstract:Latent fingerprint identification remains a challenging task due to low image quality, background noise, and partial impressions. In this work, we propose a novel identification approach called LatentPrintFormer. The proposed model integrates a CNN backbone (EfficientNet-B0) and a Transformer backbone (Swin Tiny) to extract both local and global features from latent fingerprints. A spatial attention module is employed to emphasize high-quality ridge regions while suppressing background noise. The extracted features are fused and projected into a unified 512-dimensional embedding, and matching is performed using cosine similarity in a closed-set identification setting. Extensive experiments on two publicly available datasets demonstrate that LatentPrintFormer consistently outperforms three state-of-the-art latent fingerprint recognition techniques, achieving higher identification rates across Rank-10.
zh

[CV-60] Introducing Nylon Face Mask Attacks: A Dataset for Evaluating Generalised Face Presentation Attack Detection

【速读】:该论文旨在解决人脸识别系统在面对新型现实场景下的呈现攻击(Presentation Attacks, PAs)时存在的脆弱性问题,尤其是针对由弹性材质和高保真外观构成的尼龙面罩(Nylon Face Masks, NFMs)所引发的3D仿冒攻击。解决方案的关键在于构建一个大规模、真实场景下采集的新型攻击数据集,该数据集包含3,760个真实人脸样本与51,281个NFMs攻击样本,覆盖四种不同呈现条件(含真人与假人),并基于iPhone 11 Pro设备模拟智能手机使用环境。通过在此数据集上对五种前沿防伪检测(Presentation Attack Detection, PAD)方法进行基准测试,验证其在未见过的攻击条件下泛化能力的差异,从而揭示当前PAD技术的局限性,并推动开发更具鲁棒性的抗欺骗机制。

链接: https://arxiv.org/abs/2511.08114
作者: Manasa,Sushrut Patwardhan,Narayan Vetrekar,Pavan Kumar,R. S. Gad,Raghavendra Ramachandra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: Accepted in Proc. of International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA 2026)

点击查看摘要

Abstract:Face recognition systems are increasingly deployed across a wide range of applications, including smartphone authentication, access control, and border security. However, these systems remain vulnerable to presentation attacks (PAs), which can significantly compromise their reliability. In this work, we introduce a new dataset focused on a novel and realistic presentation attack instrument called Nylon Face Masks (NFMs), designed to simulate advanced 3D spoofing scenarios. NFMs are particularly concerning due to their elastic structure and photorealistic appearance, which enable them to closely mimic the victim’s facial geometry when worn by an attacker. To reflect real-world smartphone-based usage conditions, we collected the dataset using an iPhone 11 Pro, capturing 3,760 bona fide samples from 100 subjects and 51,281 NFM attack samples across four distinct presentation scenarios involving both humans and mannequins. We benchmark the dataset using five state-of-the-art PAD methods to evaluate their robustness under unseen attack conditions. The results demonstrate significant performance variability across methods, highlighting the challenges posed by NFMs and underscoring the importance of developing PAD techniques that generalise effectively to emerging spoofing threats.
zh

[CV-61] StableMorph: High-Quality Face Morph Generation with Stable Diffusion

【速读】:该论文旨在解决当前人脸仿冒攻击(face morphing attacks)检测系统评估中缺乏高质量、逼真且难以识别的仿冒图像的问题。现有方法生成的仿冒图像常存在模糊、伪影或结构缺陷,导致其易被检测,无法真实反映现实威胁。解决方案的关键在于提出StableMorph,一种基于扩散模型(diffusion-based image synthesis)的新一代人脸仿冒生成方法,能够生成无伪影、细节清晰的完整头部仿冒图像,并对视觉属性实现精细控制。实验证明,StableMorph生成的图像在质量上可媲美甚至超越真实人脸图像,同时具备更强的欺骗人脸识别系统的能力,从而显著提升攻击的真实性与检测系统的评估有效性,为生物特征安全研究和实际测试树立了新标准。

链接: https://arxiv.org/abs/2511.08090
作者: Wassim Kabbani,Kiran Raja,Raghavendra Ramachandra,Christoph Busch
机构: Norwegian University of Science and Technology (挪威科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Face morphing attacks threaten the integrity of biometric identity systems by enabling multiple individuals to share a single identity. To develop and evaluate effective morphing attack detection (MAD) systems, we need access to high-quality, realistic morphed images that reflect the challenges posed in real-world scenarios. However, existing morph generation methods often produce images that are blurry, riddled with artifacts, or poorly constructed making them easy to detect and not representative of the most dangerous attacks. In this work, we introduce StableMorph, a novel approach that generates highly realistic, artifact-free morphed face images using modern diffusion-based image synthesis. Unlike prior methods, StableMorph produces full-head images with sharp details, avoids common visual flaws, and offers unmatched control over visual attributes. Through extensive evaluation, we show that StableMorph images not only rival or exceed the quality of genuine face images but also maintain a strong ability to fool face recognition systems posing a greater challenge to existing MAD solutions and setting a new standard for morph quality in research and operational testing. StableMorph improves the evaluation of biometric security by creating more realistic and effective attacks and supports the development of more robust detection systems.
zh

[CV-62] Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis

【速读】:该论文旨在解决生成式模型中身份保留(identity preservation)评估的难题,现有指标依赖全局嵌入或粗粒度视觉语言模型(VLM)提示,难以捕捉细粒度的身份变化并提供有限的诊断洞察。其解决方案的关键在于提出“Beyond the Pixels”这一分层评估框架,通过将身份评估分解为特征级变换,并引导VLM按照结构化推理路径进行判断:首先将主体逐层分解为类型(type)、风格(style)-属性-特征决策树,其次要求模型输出具体的变换描述而非抽象的相似性分数。该方法使VLM分析基于可验证的视觉证据,显著减少幻觉并提升一致性,同时在四个前沿生成模型上验证了与人类判断的高度一致,并引入了一个专为压力测试设计的新基准数据集。

链接: https://arxiv.org/abs/2511.08087
作者: Aditi Singhania,Krutik Malani,Riddhi Dhawan,Arushi Jain,Garv Tandon,Nippun Sharma,Souymodip Chakraborty,Vineet Batra,Ankit Phogat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating identity preservation in generative models remains a critical yet unresolved challenge. Existing metrics rely on global embeddings or coarse VLM prompting, failing to capture fine-grained identity changes and providing limited diagnostic insight. We introduce Beyond the Pixels, a hierarchical evaluation framework that decomposes identity assessment into feature-level transformations. Our approach guides VLMs through structured reasoning by (1) hierarchically decomposing subjects into (type, style) - attribute - feature decision tree, and (2) prompting for concrete transformations rather than abstract similarity scores. This decomposition grounds VLM analysis in verifiable visual evidence, reducing hallucinations and improving consistency. We validate our framework across four state-of-the-art generative models, demonstrating strong alignment with human judgments in measuring identity consistency. Additionally, we introduce a new benchmark specifically designed to stress-test generative models. It comprises 1,078 image-prompt pairs spanning diverse subject types, including underrepresented categories such as anthropomorphic and animated characters, and captures an average of six to seven transformation axes per prompt.
zh

[CV-63] CLIP is All You Need for Human-like Semantic Representations in Stable Diffusion

【速读】:该论文旨在解决生成式 AI(Generative AI)模型如 Stable Diffusion 在文本到图像生成任务中是否具备人类可理解的语义表征这一关键问题。其核心发现是:模型在生成图像过程中所依赖的语义信息主要来源于预训练的 CLIP(Contrastive Language–Image Pretraining)模型中的文本编码模块,而非扩散过程本身;具体而言,通过在 Stable Diffusion 的潜在空间中进行简单回归层的探针实验(probing),研究人员发现对象语义属性的预测准确率高度依赖于 CLIP 提取的语义嵌入,且不同语义属性的可解码性存在显著差异,同时在反向扩散过程中语义区分度下降,进一步验证了 CLIP 是语义表示的核心来源,而扩散机制仅承担视觉重建功能。

链接: https://arxiv.org/abs/2511.08075
作者: Cameron Braunstein,Mariya Toneva,Eddy Ilg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Latent diffusion models such as Stable Diffusion achieve state-of-the-art results on text-to-image generation tasks. However, the extent to which these models have a semantic understanding of the images they generate is not well understood. In this work, we investigate whether the internal representations used by these models during text-to-image generation contain semantic information that is meaningful to humans. To do so, we perform probing on Stable Diffusion with simple regression layers that predict semantic attributes for objects and evaluate these predictions against human annotations. Surprisingly, we find that this success can actually be attributed to the text encoding occurring in CLIP rather than the reverse diffusion process. We demonstrate that groups of specific semantic attributes have markedly different decoding accuracy than the average, and are thus represented to different degrees. Finally, we show that attributes become more difficult to disambiguate from one another during the inverse diffusion process, further demonstrating the strongest semantic representation of object attributes in CLIP. We conclude that the separately trained CLIP vision-language model is what determines the human-like semantic representation, and that the diffusion process instead takes the role of a visual decoder.
zh

[CV-64] Radar-APLANC: Unsupervised Radar-based Heartbeat Sensing via Augmented Pseudo-Label and Noise Contrast AAAI2026

【速读】:该论文旨在解决传统基于雷达的心跳感知方法在噪声环境下性能下降的问题,以及学习型方法依赖昂贵标注数据导致的训练成本高昂问题。其解决方案的关键在于提出首个无监督框架Radar-APLANC,通过利用雷达距离矩阵中的心跳范围和噪声范围分别构建正样本与负样本,并设计仅依赖正样本、负样本及由传统雷达方法生成的伪标签信号的噪声对比三元组(Noise-Contrastive Triplet, NCT)损失函数,从而避免对昂贵生理真值信号的依赖;同时引入自适应噪声感知的伪标签增强策略以提升伪标签质量,最终在Equipleth数据集和自建雷达数据集上实现了与最先进监督方法相当的性能。

链接: https://arxiv.org/abs/2511.08071
作者: Ying Wang,Zhaodong Sun,Xu Cheng,Zuxian He,Xiaobai Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Signal Processing (eess.SP)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Frequency Modulated Continuous Wave (FMCW) radars can measure subtle chest wall oscillations to enable non-contact heartbeat sensing. However, traditional radar-based heartbeat sensing methods face performance degradation due to noise. Learning-based radar methods achieve better noise robustness but require costly labeled signals for supervised training. To overcome these limitations, we propose the first unsupervised framework for radar-based heartbeat sensing via Augmented Pseudo-Label and Noise Contrast (Radar-APLANC). We propose to use both the heartbeat range and noise range within the radar range matrix to construct the positive and negative samples, respectively, for improved noise robustness. Our Noise-Contrastive Triplet (NCT) loss only utilizes positive samples, negative samples, and pseudo-label signals generated by the traditional radar method, thereby avoiding dependence on expensive ground-truth physiological signals. We further design a pseudo-label augmentation approach featuring adaptive noise-aware label selection to improve pseudo-label signal quality. Extensive experiments on the Equipleth dataset and our collected radar dataset demonstrate that our unsupervised method achieves performance comparable to state-of-the-art supervised methods. Our code, dataset, and supplementary materials can be accessed from this https URL.
zh

[CV-65] I2E: Real-Time Image-to-Event Conversion for High-Performance Spiking Neural Networks AAAI-26

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在实际应用中因缺乏事件流数据(event-stream data)而导致的训练瓶颈问题。其解决方案的关键在于提出I2E算法框架,通过模拟微扫视眼动(microsaccadic eye movements)并利用高度并行化的卷积操作,将静态图像高效转换为高保真事件流,转换速度较先前方法提升超过300倍,从而实现了对SNN训练的实时数据增强。这一突破不仅使基于合成I2E数据的SNN在ImageNet上达到60.50%的准确率,更通过“仿真到真实”(sim-to-real)范式,在CIFAR10-DVS数据集上实现92.5%的前所未有的精度,验证了合成事件数据可作为真实传感器数据的有效代理,为类脑计算系统提供了可扩展的数据生成基础。

链接: https://arxiv.org/abs/2511.08065
作者: Ruichen Ma,Liwei Meng,Guanchao Qiao,Ning Ning,Yang Liu,Shaogang Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI-26 Oral

点击查看摘要

Abstract:Spiking neural networks (SNNs) promise highly energy-efficient computing, but their adoption is hindered by a critical scarcity of event-stream data. This work introduces I2E, an algorithmic framework that resolves this bottleneck by converting static images into high-fidelity event streams. By simulating microsaccadic eye movements with a highly parallelized convolution, I2E achieves a conversion speed over 300x faster than prior methods, uniquely enabling on-the-fly data augmentation for SNN training. The framework’s effectiveness is demonstrated on large-scale benchmarks. An SNN trained on the generated I2E-ImageNet dataset achieves a state-of-the-art accuracy of 60.50%. Critically, this work establishes a powerful sim-to-real paradigm where pre-training on synthetic I2E data and fine-tuning on the real-world CIFAR10-DVS dataset yields an unprecedented accuracy of 92.5%. This result validates that synthetic event data can serve as a high-fidelity proxy for real sensor data, bridging a long-standing gap in neuromorphic engineering. By providing a scalable solution to the data problem, I2E offers a foundational toolkit for developing high-performance neuromorphic systems. The open-source algorithm and all generated datasets are provided to accelerate research in the field.
zh

[CV-66] aming Identity Consistency and Prompt Diversity in Diffusion Models via Latent Concatenation and Masked Conditional Flow Matching

【速读】:该论文旨在解决**主体驱动图像生成(subject-driven image generation)**中的核心挑战:在保持特定主体身份特征一致性的前提下,实现对多样化提示(prompt)的高适应性生成。这一任务面临身份一致性与提示多样性之间的根本性权衡。解决方案的关键在于提出一种基于LoRA微调的扩散模型,结合潜在空间拼接策略(latent concatenation strategy)和掩码条件流匹配(masked Conditional Flow Matching, CFM)目标函数,从而在不修改模型架构的情况下实现鲁棒的身份保留;同时引入两阶段蒸馏数据筛选框架(Distilled Data Curation Framework),通过高质量种子数据构建与参数高效微调,提升跨主体与多场景的生成能力,并辅以细粒度评估框架CHARIS进行属性级质量评测。

链接: https://arxiv.org/abs/2511.08061
作者: Aditi Singhania,Arushi Jain,Krutik Malani,Riddhi Dhawan,Souymodip Chakraborty,Vineet Batra,Ankit Phogat
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Subject-driven image generation aims to synthesize novel depictions of a specific subject across diverse contexts while preserving its core identity features. Achieving both strong identity consistency and high prompt diversity presents a fundamental trade-off. We propose a LoRA fine-tuned diffusion model employing a latent concatenation strategy, which jointly processes reference and target images, combined with a masked Conditional Flow Matching (CFM) objective. This approach enables robust identity preservation without architectural modifications. To facilitate large-scale training, we introduce a two-stage Distilled Data Curation Framework: the first stage leverages data restoration and VLM-based filtering to create a compact, high-quality seed dataset from diverse sources; the second stage utilizes these curated examples for parameter-efficient fine-tuning, thus scaling the generation capability across various subjects and contexts. Finally, for filtering and quality assessment, we present CHARIS, a fine-grained evaluation framework that performs attribute-level comparisons along five key axes: identity consistency, prompt adherence, region-wise color fidelity, visual quality, and transformation diversity.
zh

[CV-67] Retext2MaP: Macro Placement by Recursively Prototyping and Packing Tree-based Relocating

【速读】:该论文旨在解决大规模集成电路设计中宏单元(macro)布局优化问题,尤其在保持高布线质量与满足时序约束的前提下提升布局的全局优化能力。现有方法往往难以平衡宏间连接性、数据流分布及物理约束,导致时序违规(如负 slack)严重。解决方案的关键在于提出 Re²MaP 方法,其核心是通过递归式原型构建与基于打包树的重定位机制实现分层优化:首先利用多级宏分组和 PPA(Power, Performance, Area)感知的单元聚类生成统一连接矩阵;继而采用 DREAMPlace 构建混合尺寸原型并获取参考位置;随后引入 ABPlace 方法,在椭圆轨迹上优化宏位置以均匀分布于芯片边缘并兼顾线长与数据流;最后设计基于打包树的重定位流程,通过启发式代价函数联合调整宏组及其内部宏的位置,结合进化搜索策略优化多种设计约束。该方法迭代执行,每次仅定位部分宏组以逐步提升原型精度,最终在 WNS 和 TNS 等关键指标上显著优于当前最优学术放置器 Hier-RTLMP。

链接: https://arxiv.org/abs/2511.08054
作者: Yunqi Shi,Xi Lin,Zhiang Wang,Siyuan Xu,Shixiong Kai,Yao Lai,Chengrui Gao,Ke Xue,Mingxuan Yuan,Chao Qian,Zhi-Hua Zhou
机构: Nanjing University (南京大学); Fudan University (复旦大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室); the University of Hong Kong (香港大学)
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: IEEE Transactions on Comupter-Aided Design under review

点击查看摘要

Abstract:This work introduces the Re ^\text2 MaP method, which generates expert-quality macro placements through recursively prototyping and packing tree-based relocating. We first perform multi-level macro grouping and PPA-aware cell clustering to produce a unified connection matrix that captures both wirelength and dataflow among macros and clusters. Next, we use DREAMPlace to build a mixed-size placement prototype and obtain reference positions for each macro and cluster. Based on this prototype, we introduce ABPlace, an angle-based analytical method that optimizes macro positions on an ellipse to distribute macros uniformly near chip periphery, while optimizing wirelength and dataflow. A packing tree-based relocating procedure is then designed to jointly adjust the locations of macro groups and the macros within each group, by optimizing an expertise-inspired cost function that captures various design constraints through evolutionary search. Re ^\text2 MaP repeats the above process: Only a subset of macro groups are positioned in each iteration, and the remaining macros are deferred to the next iteration to improve the prototype’s accuracy. Using a well-established backend flow with sufficient timing optimizations, Re ^\text2 MaP achieves up to 22.22% (average 10.26%) improvement in worst negative slack (WNS) and up to 97.91% (average 33.97%) improvement in total negative slack (TNS) compared to the state-of-the-art academic placer Hier-RTLMP. It also ranks higher on WNS, TNS, power, design rule check (DRC) violations, and runtime than the conference version ReMaP, across seven tested cases. Our code is available at this https URL.
zh

[CV-68] Generalized-Scale Object Counting with Gradual Query Aggregation AAAI2026

【速读】:该论文旨在解决现有少样本计数(few-shot counting)方法在处理包含多样尺寸物体及小物体密集区域的图像时性能不佳的问题。其关键解决方案是提出GECO2,一种端到端的少样本计数与检测方法,通过引入一种新的密集查询表示(dense query representation),逐步跨尺度聚合样本特定特征信息,从而生成高分辨率的密集查询,有效提升对大、小物体的检测能力,同时在计数和检测精度上相比现有最优方法提升10%,且推理速度提高3倍、显存占用更少。

链接: https://arxiv.org/abs/2511.08048
作者: Jer Pelhan,Alan Lukezic,Matej Kristan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI2026, code: this https URL

点击查看摘要

Abstract:Few-shot detection-based counters estimate the number of instances in the image specified only by a few test-time exemplars. A common approach to localize objects across multiple sizes is to merge backbone features of different resolutions. Furthermore, to enable small object detection in densely populated regions, the input image is commonly upsampled and tiling is applied to cope with the increased computational and memory requirements. Because of these ad-hoc solutions, existing counters struggle with images containing diverse-sized objects and densely populated regions of small objects. We propose GECO2, an end-to-end few-shot counting and detection method that explicitly addresses the object scale issues. A new dense query representation gradually aggregates exemplar-specific feature information across scales that leads to high-resolution dense queries that enable detection of large as well as small objects. GECO2 surpasses state-of-the-art few-shot counters in counting as well as detection accuracy by 10% while running 3x times faster at smaller GPU memory footprint.
zh

[CV-69] ProSona: Prompt-Guided Personalization for Multi-Expert Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因专家间标注差异(inter-observer variability)导致的模型泛化能力受限问题,尤其在肺结节等任务中,不同专家对同一病灶的边界划分常存在显著分歧。现有方法要么将这种变异性压缩为单一共识掩膜,要么通过独立分支建模每位标注者,缺乏灵活性与可解释性。其解决方案的关键在于提出ProSona框架——一个两阶段结构:首先利用概率U-Net捕捉多位专家的多样化分割假设,构建连续的标注风格潜在空间(latent space of annotation styles);其次引入自然语言提示引导的投影机制,实现基于文本指令的个性化分割控制,并通过多层级对比损失对齐文本与视觉表征,从而解耦并可解释地建模专家风格。实验表明,该方法在LIDC-IDRI和多中心前列腺MRI数据集上相较DPersona提升了分割一致性与准确性。

链接: https://arxiv.org/abs/2511.08046
作者: Aya Elgebaly,Nikolaos Delopoulos,Juliane Hörner-Rieber,Carolin Rippke,Sebastian Klüter,Luca Boldrini,Lorenzo Placidi,Riccardo Dal Bello,Nicolaus Andratschke,Michael Baumgartl,Claus Belka,Christopher Kurz,Guillaume Landry,Shadi Albarqouni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures. Submitted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Automated medical image segmentation suffers from high inter-observer variability, particularly in tasks such as lung nodule delineation, where experts often disagree. Existing approaches either collapse this variability into a consensus mask or rely on separate model branches for each annotator. We introduce ProSona, a two-stage framework that learns a continuous latent space of annotation styles, enabling controllable personalization via natural language prompts. A probabilistic U-Net backbone captures diverse expert hypotheses, while a prompt-guided projection mechanism navigates this latent space to generate personalized segmentations. A multi-level contrastive objective aligns textual and visual representations, promoting disentangled and interpretable expert styles. Across the LIDC-IDRI lung nodule and multi-institutional prostate MRI datasets, ProSona reduces the Generalized Energy Distance by 17% and improves mean Dice by more than one point compared with DPersona. These results demonstrate that natural-language prompts can provide flexible, accurate, and interpretable control over personalized medical image segmentation. Our implementation is available online 1 .
zh

[CV-70] WEDepth: Efficient Adaptation of World Knowledge for Monocular Depth Estimation

【速读】:该论文旨在解决单目深度估计(Monocular Depth Estimation, MDE)中因从二维图像恢复三维场景的病态性(ill-posed nature)而导致的精度与泛化能力不足的问题。解决方案的关键在于提出WEDepth方法,该方法不修改视觉基础模型(Vision Foundation Models, VFMs)的结构和预训练权重,而是将其作为多层级特征增强器,在不同表征层次上系统注入先验知识,从而有效激发并利用VFMs内在的世界理解能力。实验表明,该方法在NYU-Depth v2和KITTI数据集上达到新的SOTA性能,并展现出强大的零样本迁移能力。

链接: https://arxiv.org/abs/2511.08036
作者: Gongshu Wang,Zhirui Wang,Kan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation (MDE) has widely applicable but remains highly challenging due to the inherently ill-posed nature of reconstructing 3D scenes from single 2D images. Modern Vision Foundation Models (VFMs), pre-trained on large-scale diverse datasets, exhibit remarkable world understanding capabilities that benefit for various vision tasks. Recent studies have demonstrated significant improvements in MDE through fine-tuning these VFMs. Inspired by these developments, we propose WEDepth, a novel approach that adapts VFMs for MDE without modi-fying their structures and pretrained weights, while effec-tively eliciting and leveraging their inherent priors. Our method employs the VFM as a multi-level feature en-hancer, systematically injecting prior knowledge at differ-ent representation levels. Experiments on NYU-Depth v2 and KITTI datasets show that WEDepth establishes new state-of-the-art (SOTA) performance, achieving competi-tive results compared to both diffusion-based approaches (which require multiple forward passes) and methods pre-trained on relative depth. Furthermore, we demonstrate our method exhibits strong zero-shot transfer capability across diverse scenarios.
zh

[CV-71] Perceptual Quality Assessment of 3D Gaussian Splatting: A Subjective Dataset and Prediction Metric

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)渲染内容在实际应用中因视角稀疏、训练迭代次数有限、点云下采样、噪声及颜色失真等因素导致的感知质量下降问题,而现有研究对这类失真下的主观质量评估尚缺乏系统性分析。其解决方案的关键在于构建首个针对3DGS的主观质量评估数据集(3DGS-QA),包含225种不同失真条件下的重建结果,并提出一种无需参考图像或真实基准的无参考质量预测模型——该模型直接从原始3D高斯原语中提取空间结构与光度特征,以结构感知方式估计感知质量,从而实现对3DGS内容的高效、鲁棒的质量评估。

链接: https://arxiv.org/abs/2511.08032
作者: Zhaolin Wan,Yining Diao,Jingqi Xu,Hao Wang,Zhiyang Li,Xiaopeng Fan,Wangmeng Zuo,Debin Zhao
机构: 1. Tsinghua University (清华大学); 2. Huawei Technologies Co., Ltd. (华为技术有限公司); 3. Peking University (北京大学); 4. Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid advancement of 3D visualization, 3D Gaussian Splatting (3DGS) has emerged as a leading technique for real-time, high-fidelity rendering. While prior research has emphasized algorithmic performance and visual fidelity, the perceptual quality of 3DGS-rendered content, especially under varying reconstruction conditions, remains largely underexplored. In practice, factors such as viewpoint sparsity, limited training iterations, point downsampling, noise, and color distortions can significantly degrade visual quality, yet their perceptual impact has not been systematically studied. To bridge this gap, we present 3DGS-QA, the first subjective quality assessment dataset for 3DGS. It comprises 225 degraded reconstructions across 15 object types, enabling a controlled investigation of common distortion factors. Based on this dataset, we introduce a no-reference quality prediction model that directly operates on native 3D Gaussian primitives, without requiring rendered images or ground-truth references. Our model extracts spatial and photometric cues from the Gaussian representation to estimate perceived quality in a structure-aware manner. We further benchmark existing quality assessment methods, spanning both traditional and learning-based approaches. Experimental results show that our method consistently achieves superior performance, highlighting its robustness and effectiveness for 3DGS content evaluation. The dataset and code are made publicly available at this https URL to facilitate future research in 3DGS quality assessment.
zh

[CV-72] Multi-modal Deepfake Detection and Localization with FPN-Transformer

【速读】:该论文旨在解决当前深度伪造(deepfake)检测方法在跨模态泛化能力和细粒度定位精度方面的不足问题,尤其是在复杂环境中对音频-视频多模态内容的精准识别与篡改区域定位能力有限。其解决方案的关键在于提出一种基于特征金字塔-Transformer(FPN-Transformer)的多模态深度伪造检测与定位框架:首先利用预训练自监督模型(WavLM用于音频、CLIP用于视频)提取分层时序特征;随后通过R-TLM块构建多尺度特征金字塔并引入局部注意力机制,实现跨上下文时序依赖的联合建模;最后采用双分支预测头同时输出伪造概率和精修篡改片段的时间偏移量,从而实现帧级定位精度。该方法显著提升了跨模态检测性能,在IJCAI’25 DDL-AV基准测试中取得了0.7535的最终得分,验证了其在复杂场景下的有效性与通用性。

链接: https://arxiv.org/abs/2511.08031
作者: Chende Zheng,Ruiqi Suo,Zhoulin Ji,Jingyi Deng,Fangbin Yi,Chenhao Lin,Chao Shen
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of generative adversarial networks (GANs) and diffusion models has enabled the creation of highly realistic deepfake content, posing significant threats to digital trust across audio-visual domains. While unimodal detection methods have shown progress in identifying synthetic media, their inability to leverage cross-modal correlations and precisely localize forged segments limits their practicality against sophisticated, fine-grained manipulations. To address this, we introduce a multi-modal deepfake detection and localization framework based on a Feature Pyramid-Transformer (FPN-Transformer), addressing critical gaps in cross-modal generalization and temporal boundary regression. The proposed approach utilizes pre-trained self-supervised models (WavLM for audio, CLIP for video) to extract hierarchical temporal features. A multi-scale feature pyramid is constructed through R-TLM blocks with localized attention mechanisms, enabling joint analysis of cross-context temporal dependencies. The dual-branch prediction head simultaneously predicts forgery probabilities and refines temporal offsets of manipulated segments, achieving frame-level localization precision. We evaluate our approach on the test set of the IJCAI’25 DDL-AV benchmark, showing a good performance with a final score of 0.7535 for cross-modal deepfake detection and localization in challenging environments. Experimental results confirm the effectiveness of our approach and provide a novel way for generalized deepfake detection. Our code is available at this https URL
zh

[CV-73] High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection

【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的虚实迁移目标检测(Imaginary Supervised Object Detection, ISOD)中面临的三大挑战:合成数据质量低、基于DETR的检测器收敛慢且易过拟合合成模式、以及统一去噪压力导致模型对伪标签噪声敏感。其解决方案的关键在于提出Cascade HQP-DETR框架:首先构建高质量合成数据集FluxVOC和FluxCOCO,利用LLaMA-3、Flux和Grounding DINO实现从弱监督到全监督的跃迁;其次设计高质建议框引导的查询编码机制(High-Quality Proposal guided query encoding),通过SAM生成的提议框与RoI池化特征初始化对象查询,加速收敛并抑制对合成模式的过拟合;最后引入级联去噪算法(cascade denoising),在解码器各层逐步提升IoU阈值动态调整训练权重,促使模型学习鲁棒边界而非依赖噪声标签。该方法仅在FluxVOC上训练12个epoch即在PASCAL VOC 2007上达到61.04% mAP@0.5,显著优于现有基线,验证了其在真实场景中的泛化能力。

链接: https://arxiv.org/abs/2511.08018
作者: Zhiyuan Chen,Yuelin Guo,Zitong Huang,Haoyu He,Renhao Lu,Weizhe Zhang
机构: PCL (Pattern Recognition and Computational Intelligence Laboratory); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to Pattern Recognition for possible publication

点击查看摘要

Abstract:Object detection models demand large-scale annotated datasets, which are costly and labor-intensive to create. This motivated Imaginary Supervised Object Detection (ISOD), where models train on synthetic images and test on real images. However, existing methods face three limitations: (1) synthetic datasets suffer from simplistic prompts, poor image quality, and weak supervision; (2) DETR-based detectors, due to their random query initialization, struggle with slow convergence and overfitting to synthetic patterns, hindering real-world generalization; (3) uniform denoising pressure promotes model overfitting to pseudo-label noise. We propose Cascade HQP-DETR to address these limitations. First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets, advancing ISOD from weak to full supervision. Second, our High-Quality Proposal guided query encoding initializes object queries with image-specific priors from SAM-generated proposals and RoI-pooled features, accelerating convergence while steering the model to learn transferable features instead of overfitting to synthetic patterns. Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers, guiding the model to learn robust boundaries from reliable visual cues rather than overfitting to noisy labels. Trained for just 12 epochs solely on FluxVOC, Cascade HQP-DETR achieves a SOTA 61.04% mAP@0.5 on PASCAL VOC 2007, outperforming strong baselines, with its competitive real-data performance confirming the architecture’s universal applicability.
zh

[CV-74] Invisible Triggers Visible Threats! Road-Style Adversarial Creation Attack for Visual 3D Detection in Autonomous Driving AAAI2026

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)系统中基于RGB相机的3D目标检测模型对对抗样本高度敏感的安全隐患问题。现有方法虽能通过在道路表面放置对抗贴纸诱导检测器产生幻觉,但其不自然的外观易被人类察觉且内容固定,难以实现隐蔽攻击。解决方案的关键在于提出AdvRoad框架,采用两阶段策略:第一阶段为道路风格对抗生成(Road-Style Adversary Generation),生成具有自然道路纹理外观的对抗贴纸;第二阶段为场景关联自适应(Scenario-Associated Adaptation),根据输入场景动态调整贴纸内容以最大化攻击效果并保持视觉隐蔽性,从而实现对多种检测器、场景和欺骗位置的良好泛化能力,并在物理世界中验证了实际威胁。

链接: https://arxiv.org/abs/2511.08015
作者: Jian Wang,Lijun He,Yixing Yong,Haixia Bi,Fan Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by the AAAI 2026 (Main Track)

点击查看摘要

Abstract:Modern autonomous driving (AD) systems leverage 3D object detection to perceive foreground objects in 3D environments for subsequent prediction and planning. Visual 3D detection based on RGB cameras provides a cost-effective solution compared to the LiDAR paradigm. While achieving promising detection accuracy, current deep neural network-based models remain highly susceptible to adversarial examples. The underlying safety concerns motivate us to investigate realistic adversarial attacks in AD scenarios. Previous work has demonstrated the feasibility of placing adversarial posters on the road surface to induce hallucinations in the detector. However, the unnatural appearance of the posters makes them easily noticeable by humans, and their fixed content can be readily targeted and defended. To address these limitations, we propose the AdvRoad to generate diverse road-style adversarial posters. The adversaries have naturalistic appearances resembling the road surface while compromising the detector to perceive non-existent objects at the attack locations. We employ a two-stage approach, termed Road-Style Adversary Generation and Scenario-Associated Adaptation, to maximize the attack effectiveness on the input scene while ensuring the natural appearance of the poster, allowing the attack to be carried out stealthily without drawing human attention. Extensive experiments show that AdvRoad generalizes well to different detectors, scenes, and spoofing locations. Moreover, physical attacks further demonstrate the practical threats in real-world environments.
zh

[CV-75] EAGLE: Episodic Appearance- and Geometry-aware Memory for Unified 2D-3D Visual Query Localization in Egocentric Vision AAAI2026

【速读】:该论文旨在解决第一人称视觉(egocentric vision)中2D与3D视觉查询定位(visual query localization, VQL)的统一建模难题,该问题在具身智能(embodied AI)和虚拟现实/增强现实(VR/AR)应用中至关重要,但因相机运动、视角变化及外观差异等因素而极具挑战性。解决方案的关键在于提出EAGLE框架,其核心创新是引入一种受鸟类记忆巩固机制启发的分层记忆机制:一方面通过**外观感知元学习记忆(Appearance-aware Meta-learning Memory, AMM)引导分割以捕捉目标外观变化;另一方面利用几何感知定位记忆(Geometry-aware Localization Memory, GLM)**驱动跟踪以实现空间一致性建模。两者协同构建结构化的外观与几何记忆库,存储高置信度检索样本,从而有效支持短期与长期目标外观建模,并结合视觉几何接地Transformer(Visual Geometry Grounded Transformer, VGGT)实现2D与3D任务的高效统一,显著提升定位精度与鲁棒性。

链接: https://arxiv.org/abs/2511.08007
作者: Yifei Cao,Yu Liu,Guolong Wang,Zhu Liu,Kai Wang,Xianjie Zhang,Jizhe Yu,Xun Tu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 Pages, AAAI2026

点击查看摘要

Abstract:Egocentric visual query localization is vital for embodied AI and VR/AR, yet remains challenging due to camera motion, viewpoint changes, and appearance variations. We present EAGLE, a novel framework that leverages episodic appearance- and geometry-aware memory to achieve unified 2D-3D visual query localization in egocentric vision. Inspired by avian memory consolidation, EAGLE synergistically integrates segmentation guided by an appearance-aware meta-learning memory (AMM), with tracking driven by a geometry-aware localization memory (GLM). This memory consolidation mechanism, through structured appearance and geometry memory banks, stores high-confidence retrieval samples, effectively supporting both long- and short-term modeling of target appearance variations. This enables precise contour delineation with robust spatial discrimination, leading to significantly improved retrieval accuracy. Furthermore, by integrating the VQL-2D output with a visual geometry grounded Transformer (VGGT), we achieve a efficient unification of 2D and 3D tasks, enabling rapid and accurate back-projection into 3D space. Our method achieves state-ofthe-art performance on the Ego4D-VQ benchmark.
zh

[CV-76] Sharp Eyes and Memory for VideoLLM s: Information-Aware Visual Token Pruning for Efficient and Reliable VideoLLM Reasoning

【速读】:该论文旨在解决当前视频大语言模型(Video Large Language Models, VideoLLMs)中存在的二次方计算复杂度和键值缓存(Key-Value Cache, KV cache)规模膨胀问题,其根源在于模型对冗余视觉标记(visual tokens)的过度处理。解决方案的关键在于提出SharpV——一种最小化且高效的自适应视觉标记与KV缓存剪枝方法。不同于传统均匀压缩策略,SharpV基于时空信息动态调整剪枝比例,并通过自校准机制识别并移除退化的视觉特征,从而实现从信息瓶颈视角出发的分层缓存剪枝。该方法在不依赖注意力分数暴露的前提下完成两阶段剪枝,兼顾性能提升与硬件加速兼容性(如Flash Attention),为VideoLLMs的信息流优化提供了新范式。

链接: https://arxiv.org/abs/2511.08003
作者: Jialong Qin,Xin Zou,Di Lu,Yibo Yan,Xuming Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Video Large Language Models (VideoLLMs) suffer from quadratic computational complexity and key-value cache scaling, due to their reliance on processing excessive redundant visual tokens. To address this problem, we propose SharpV, a minimalist and efficient method for adaptive pruning of visual tokens and KV cache. Different from most uniform compression approaches, SharpV dynamically adjusts pruning ratios based on spatial-temporal information. Remarkably, this adaptive mechanism occasionally achieves performance gains over dense models, offering a novel paradigm for adaptive pruning. During the KV cache pruning stage, based on observations of visual information degradation, SharpV prunes degraded visual features via a self-calibration manner, guided by similarity to original visual features. In this way, SharpV achieves hierarchical cache pruning from the perspective of information bottleneck, offering a new insight into VideoLLMs’ information flow. Experiments on multiple public benchmarks demonstrate the superiority of SharpV. Moreover, to the best of our knowledge, SharpV is notably the first two-stage pruning framework that operates without requiring access to exposed attention scores, ensuring full compatibility with hardware acceleration techniques like Flash Attention.
zh

[CV-77] Hardware-Aware YOLO Compression for Low-Power Edge AI on STM32U5 for Weeds Detection in Digital Agriculture

【速读】:该论文旨在解决传统除草方法(主要依赖化学除草剂)带来的环境污染和抗药性杂草滋生问题,同时应对现有精准除草技术因计算资源需求高而难以在农业场景中大规模部署的挑战。其解决方案的关键在于构建一个低功耗边缘人工智能(Edge AI)系统,基于YOLOv8n目标检测模型,并通过结构化剪枝、整数量化和输入图像分辨率缩放等压缩技术,在STM32U575ZI微控制器上实现高效运行,从而在保证检测精度的同时,将单次推理能耗降至51.8mJ,满足农业环境中对能效与实时性的双重需求。

链接: https://arxiv.org/abs/2511.07990
作者: Charalampos S. Kouzinopoulos,Yuri Manna
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Weeds significantly reduce crop yields worldwide and pose major challenges to sustainable agriculture. Traditional weed management methods, primarily relying on chemical herbicides, risk environmental contamination and lead to the emergence of herbicide-resistant species. Precision weeding, leveraging computer vision and machine learning methods, offers a promising eco-friendly alternative but is often limited by reliance on high-power computational platforms. This work presents an optimized, low-power edge AI system for weeds detection based on the YOLOv8n object detector deployed on the STM32U575ZI microcontroller. Several compression techniques are applied to the detection model, including structured pruning, integer quantization and input image resolution scaling in order to meet strict hardware constraints. The model is trained and evaluated on the CropAndWeed dataset with 74 plant species, achieving a balanced trade-off between detection accuracy and efficiency. Our system supports real-time, in-situ weeds detection with a minimal energy consumption of 51.8mJ per inference, enabling scalable deployment in power-constrained agricultural environments.
zh

[CV-78] CSF-Net: Context-Semantic Fusion Network for Large Mask Inpainting WACV2026

【速读】:该论文旨在解决大掩码图像修复(large-mask image inpainting)问题,即在图像中缺失区域较大、上下文线索有限的情况下,如何恢复出结构准确且语义一致的视觉内容。其解决方案的关键在于提出了一种语义引导框架——Context-Semantic Fusion Network (CSF-Net),该网络利用预训练的非可见部分补全(Amodal Completion, AC)模型生成结构感知的候选图像作为语义先验,并通过基于Transformer的融合机制将这些候选与局部上下文特征进行融合,从而生成用于指导修复过程的语义引导图像(semantic guidance image),显著提升修复结果的结构准确性和语义一致性。该方法无需修改现有修复模型架构即可无缝集成,且在多种遮挡条件下均能稳定提升性能。

链接: https://arxiv.org/abs/2511.07987
作者: Chae-Yeon Heo,Yeong-Jun Cho
机构: Chonnam National University (全南国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, Accepted to WACV 2026 (to appear)

点击查看摘要

Abstract:In this paper, we propose a semantic-guided framework to address the challenging problem of large-mask image inpainting, where essential visual content is missing and contextual cues are limited. To compensate for the limited context, we leverage a pretrained Amodal Completion (AC) model to generate structure-aware candidates that serve as semantic priors for the missing regions. We introduce Context-Semantic Fusion Network (CSF-Net), a transformer-based fusion framework that fuses these candidates with contextual features to produce a semantic guidance image for image inpainting. This guidance improves inpainting quality by promoting structural accuracy and semantic consistency. CSF-Net can be seamlessly integrated into existing inpainting models without architectural changes and consistently enhances performance across diverse masking conditions. Extensive experiments on the Places365 and COCOA datasets demonstrate that CSF-Net effectively reduces object hallucination while enhancing visual realism and semantic alignment. The code for CSF-Net is available at this https URL.
zh

[CV-79] ChexFract: From General to Specialized - Enhancing Fracture Description Generation

【速读】:该论文旨在解决当前医学影像AI在生成胸部X光片报告时,对罕见但临床重要的骨折类病理描述能力不足的问题。现有通用视觉-语言模型(Vision-Language Models, VLMs)虽在一般放射学报告生成中表现良好,但在识别和描述骨折等少见病灶时存在显著局限。解决方案的关键在于构建针对骨折病理的专用视觉-语言模型,利用MAIRA-2和CheXagent的编码器进行训练,从而显著提升模型对骨折类型、位置及年龄相关特征的准确描述能力。研究还通过系统性分析不同模型架构在各类骨折场景下的表现,揭示了当前VLMs的优劣势,为后续改进提供了依据。

链接: https://arxiv.org/abs/2511.07983
作者: Nikolay Nechaev,Evgeniia Przhezdzetskaia,Dmitry Umerenkov,Dmitry V. Dylov
机构: Artificial Intelligence Research Institute (AIRI), Moscow, Russia; Skolkovo Institute of Science and Technology (Skoltech), Moscow, Russia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Generating accurate and clinically meaningful radiology reports from chest X-ray images remains a significant challenge in medical AI. While recent vision-language models achieve strong results in general radiology report generation, they often fail to adequately describe rare but clinically important pathologies like fractures. This work addresses this gap by developing specialized models for fracture pathology detection and description. We train fracture-specific vision-language models with encoders from MAIRA-2 and CheXagent, demonstrating significant improvements over general-purpose models in generating accurate fracture descriptions. Analysis of model outputs by fracture type, location, and age reveals distinct strengths and limitations of current vision-language model architectures. We publicly release our best-performing fracture-reporting model, facilitating future research in accurate reporting of rare pathologies.
zh

[CV-80] DANCE: Density-agnostic and Class-aware Network for Point Cloud Completion

【速读】:该论文旨在解决点云补全(Point Cloud Completion)中因遮挡或传感器视角限制导致的几何结构缺失问题,尤其针对现实场景中输入点云密度可变、监督信息有限的情况。其解决方案的关键在于提出一种密度无关且类别感知的网络框架DANCE:通过多视角射线采样生成候选点,利用Transformer解码器精修点位并预测不透明度分数以确定有效表面点;同时引入轻量级分类头直接基于几何特征进行语义引导,实现类别一致的补全,无需外部图像监督。该方法在PCN和MVP基准上验证了其在精度与结构一致性上的优越性,并对输入密度变化和噪声具有鲁棒性。

链接: https://arxiv.org/abs/2511.07978
作者: Da-Yeong Kim,Yeong-Jun Cho
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud completion aims to recover missing geometric structures from incomplete 3D scans, which often suffer from occlusions or limited sensor viewpoints. Existing methods typically assume fixed input/output densities or rely on image-based representations, making them less suitable for real-world scenarios with variable sparsity and limited supervision. In this paper, we introduce Density-agnostic and Class-aware Network (DANCE), a novel framework that completes only the missing regions while preserving the observed geometry. DANCE generates candidate points via ray-based sampling from multiple viewpoints. A transformer decoder then refines their positions and predicts opacity scores, which determine the validity of each point for inclusion in the final surface. To incorporate semantic guidance, a lightweight classification head is trained directly on geometric features, enabling category-consistent completion without external image supervision. Extensive experiments on the PCN and MVP benchmarks show that DANCE outperforms state-of-the-art methods in accuracy and structural consistency, while remaining robust to varying input densities and noise levels.
zh

[CV-81] Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection WACV2026

【速读】:该论文旨在解决遥感变化检测中因多时相图像间存在空间错位(spatial misalignment)而导致的性能下降问题,尤其是在季节或多年跨度下获取的图像对之间。现有基于卷积或Transformer的检测模型虽在图像对齐条件下表现优异,但其对精确配准的依赖限制了实际应用中的鲁棒性;而现有的联合配准-检测框架通常需重新训练且跨域迁移能力差。解决方案的关键在于提出一个模块化流水线:通过扩散模型生成语义形态过渡帧(diffusion-based semantic morphing),以弥合大尺度外观差异,从而让RoMa(一种光流估计方法)分步估算连续帧间的对应关系;随后利用轻量级U-Net对合成光流进行残差精修,最终生成高保真变形场实现原始图像对的共配准,无需修改现有变化检测网络即可显著提升注册精度与下游任务性能。

链接: https://arxiv.org/abs/2511.07976
作者: Seyedehanita Madani,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures. To appear in WACV 2026

点击查看摘要

Abstract:Remote sensing change detection is often challenged by spatial misalignment between bi-temporal images, especially when acquisitions are separated by long seasonal or multi-year gaps. While modern convolutional and transformer-based models perform well on aligned data, their reliance on precise co-registration limits their robustness in real-world conditions. Existing joint registration-detection frameworks typically require retraining and transfer poorly across domains. We introduce a modular pipeline that improves spatial and temporal robustness without altering existing change detection networks. The framework integrates diffusion-based semantic morphing, dense registration, and residual flow refinement. A diffusion module synthesizes intermediate morphing frames that bridge large appearance gaps, enabling RoMa to estimate stepwise correspondences between consecutive frames. The composed flow is then refined through a lightweight U-Net to produce a high-fidelity warp that co-registers the original image pair. Extensive experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show consistent gains in both registration accuracy and downstream change detection across multiple backbones, demonstrating the generality and effectiveness of the proposed approach.
zh

[CV-82] Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection AAAI-26

【速读】:该论文旨在解决激光雷达(LiDAR)基3D目标检测的无监督域自适应(3D UDA)中,如何有效利用图像数据提升模型在目标域性能的问题。现有方法虽广泛采用教师-学生架构与伪标签机制,但较少关注图像模态在跨域特征对齐中的作用。其解决方案的关键在于提出MMAssist框架,通过图像和文本特征作为桥梁实现源域与目标域3D特征的对齐:首先将真实或伪标签投影至图像生成2D边界框,提取对应图像特征(来自预训练视觉主干网络)与文本特征(通过大视觉语言模型LVLM及文本编码器获取),随后在源域训练和目标域学生模型中,将预测框的3D特征与其对应的图像和文本特征进行对齐,并融合加权后的多模态特征以优化最终检测结果;同时,在目标域内进一步对齐学生分支与教师分支的特征,从而增强伪标签质量并提升整体性能。

链接: https://arxiv.org/abs/2511.07966
作者: Shenao Zhao,Pengpeng Liang,Zhoufan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI-26

点击查看摘要

Abstract:Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box’s text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at this https URL.
zh

[CV-83] Burst Image Quality Assessment: A New Benchmark and Unified Framework for Multiple Downstream Tasks

【速读】:该论文旨在解决突发图像(burst images)中冗余信息导致存储与传输开销增大、下游任务效率降低的问题,提出了一种新的任务驱动型突发图像质量评估(Burst Image Quality Assessment, BuIQA)任务,以量化每帧图像在特定下游场景下的可用性,从而为高质量图像选择提供依据。其解决方案的关键在于:首先构建了首个BuIQA基准数据集(包含7,346个突发序列、45,827张图像及191,572个标注质量分数),并基于数据分析设计了一个统一的BuIQA框架;该框架通过异构知识蒸馏引导的任务驱动提示生成网络学习下游任务先验,并结合任务感知的质量评估网络,实现对不同下游场景下图像质量的高效适配,实验证明该方法在10个下游任务中均优于现有最优方案,并在去噪和超分辨率任务中带来0.33 dB的PSNR提升。

链接: https://arxiv.org/abs/2511.07958
作者: Xiaoye Liang,Lai Jiang,Minglang Qiao,Yichen Guo,Yue Zhang,Xin Deng,Shengxi Li,Yufan Liu,Mai Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, the development of burst imaging technology has improved the capture and processing capabilities of visual data, enabling a wide range of applications. However, the redundancy in burst images leads to the increased storage and transmission demands, as well as reduced efficiency of downstream tasks. To address this, we propose a new task of Burst Image Quality Assessment (BuIQA), to evaluate the task-driven quality of each frame within a burst sequence, providing reasonable cues for burst image selection. Specifically, we establish the first benchmark dataset for BuIQA, consisting of 7,346 burst sequences with 45,827 images and 191,572 annotated quality scores for multiple downstream scenarios. Inspired by the data analysis, a unified BuIQA framework is proposed to achieve an efficient adaption for BuIQA under diverse downstream scenarios. Specifically, a task-driven prompt generation network is developed with heterogeneous knowledge distillation, to learn the priors of the downstream task. Then, the task-aware quality assessment network is introduced to assess the burst image quality based on the task prompt. Extensive experiments across 10 downstream scenarios demonstrate the impressive BuIQA performance of the proposed approach, outperforming the state-of-the-art. Furthermore, it can achieve 0.33 dB PSNR improvement in the downstream tasks of denoising and super-resolution, by applying our approach to select the high-quality burst frames.
zh

[CV-84] ReIDMamba: Learning Discriminative Features with Visual State Space Model for Person Re-Identification

【速读】:该论文旨在解决行人重识别(Person Re-identification, ReID)中特征提取的鲁棒性和判别性不足的问题,尤其是传统基于Transformer的方法因输入序列长度增长导致计算和内存开销呈二次增长而面临的可扩展性瓶颈。解决方案的关键在于提出一个纯Mamba架构驱动的ReID框架——ReIDMamba,其核心创新包括:(1) 设计了一个基于Mamba的强基线模型,通过引入多个类别标记(class token)有效捕获细粒度且具有判别力的全局特征;(2) 提出多粒度特征提取模块(Multi-granularity Feature Extractor, MGFE),利用多分支结构与类别标记融合构建多层次特征表示,提升判别能力和细粒度覆盖范围;(3) 引入排名感知三元组正则化(Ranking-aware Triplet Regularization, RATR),在分支间施加类内与类间多样性约束,降低冗余并增强多粒度特征的多样性,从而保障行人特征的鲁棒性。实验表明,ReIDMamba在五个主流ReID数据集上达到SOTA性能,同时参数量仅为TransReID的三分之一,GPU显存占用更低,推理吞吐量更高。

链接: https://arxiv.org/abs/2511.07948
作者: Hongyang Gu,Qisong Yang,Lei Pu,Siming Han,Yao Ding
机构: Rocket Force University of Engineering (火箭军工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures. Accepted to IEEE Transactions on Multimedia (TMM). Accepted Manuscript version uploaded

点击查看摘要

Abstract:Extracting robust discriminative features is a critical challenge in person re-identification (ReID). While Transformer-based methods have successfully addressed some limitations of convolutional neural networks (CNNs), such as their local processing nature and information loss resulting from convolution and downsampling operations, they still face the scalability issue due to the quadratic increase in memory and computational requirements with the length of the input sequence. To overcome this, we propose a pure Mamba-based person ReID framework named ReIDMamba. Specifically, we have designed a Mamba-based strong baseline that effectively leverages fine-grained, discriminative global features by introducing multiple class tokens. To further enhance robust features learning within Mamba, we have carefully designed two novel techniques. First, the multi-granularity feature extractor (MGFE) module, designed with a multi-branch architecture and class token fusion, effectively forms multi-granularity features, enhancing both discrimination ability and fine-grained coverage. Second, the ranking-aware triplet regularization (RATR) is introduced to reduce redundancy in features from multiple branches, enhancing the diversity of multi-granularity features by incorporating both intra-class and inter-class diversity constraints, thus ensuring the robustness of person features. To our knowledge, this is the pioneering work that integrates a purely Mamba-driven approach into ReID research. Our proposed ReIDMamba model boasts only one-third the parameters of TransReID, along with lower GPU memory usage and faster inference throughput. Experimental results demonstrate ReIDMamba’s superior and promising performance, achieving state-of-the-art performance on five person ReID benchmarks. Code is available at this https URL.
zh

[CV-85] Class-feature Watermark: A Resilient Black-box Watermark Against Model Extraction Attacks AAAI’26

【速读】:该论文旨在解决生成式 AI 模型在面对模型提取攻击(Model Extraction Attack, MEA)时的版权保护难题,尤其是现有黑盒水印技术对连续多次 MEA 和移除攻击(Removal Attack)的脆弱性问题。当前方法依赖表示纠缠(representation entanglement)提升水印存活率,但忽视了其导致移除攻击效率下降的隐患。论文提出 Watermark Removal attack (WRK),其关键在于利用样本级水印特征塑造的决策边界,绕过纠缠约束,显著降低水印成功率(至少 88.79%)。为增强鲁棒性,进一步提出 Class-Feature Watermarks (CFW),通过引入域外样本构建合成类别,消除原域样本与其水印变体之间的脆弱决策边界,从而同时优化 MEA 传输能力与提取后稳定性,在多领域实验中保持至少 70.15% 的水印成功率,且不损害模型性能。

链接: https://arxiv.org/abs/2511.07947
作者: Yaxin Xiao,Qingqing Ye,Zi Liang,Haoyang Li,RongHua Li,Huadi Zheng,Haibo Hu
机构: PolyU (香港理工大学); Hong Kong Polytechnic University (香港理工大学)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by AAAI’26

点击查看摘要

Abstract:Machine learning models constitute valuable intellectual property, yet remain vulnerable to model extraction attacks (MEA), where adversaries replicate their functionality through black-box queries. Model watermarking counters MEAs by embedding forensic markers for ownership verification. Current black-box watermarks prioritize MEA survival through representation entanglement, yet inadequately explore resilience against sequential MEAs and removal attacks. Our study reveals that this risk is underestimated because existing removal methods are weakened by entanglement. To address this gap, we propose Watermark Removal attacK (WRK), which circumvents entanglement constraints by exploiting decision boundaries shaped by prevailing sample-level watermark artifacts. WRK effectively reduces watermark success rates by at least 88.79% across existing watermarking benchmarks. For robust protection, we propose Class-Feature Watermarks (CFW), which improve resilience by leveraging class-level artifacts. CFW constructs a synthetic class using out-of-domain samples, eliminating vulnerable decision boundaries between original domain samples and their artifact-modified counterparts (watermark samples). CFW concurrently optimizes both MEA transferability and post-MEA stability. Experiments across multiple domains show that CFW consistently outperforms prior methods in resilience, maintaining a watermark success rate of at least 70.15% in extracted models even under the combined MEA and WRK distortion, while preserving the utility of protected models. Comments: Accepted by AAAI’26 Subjects: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2511.07947 [cs.CR] (or arXiv:2511.07947v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2511.07947 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-86] Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification

【速读】:该论文旨在解决在计算病理学中,由于全切片图像(Whole Slide Images, WSIs)的巨像素特性导致的高计算成本问题,以及病理任务通常仅提供袋级标签(bag-level labels)而缺乏细粒度医学知识所引发的生成式AI(Generative AI)模型实例级描述偏差问题。其核心解决方案在于构建任务特定的病理实体原型(pathological entity prototypes),以学习可泛化的特征并提升模型可解释性;同时,提出一种基于平衡信息压缩机制的多模态原型驱动的多实例学习(Multimodal Prototype-based Multi-Instance Learning)方法,通过双向交互增强视觉与语言模态间的协同效应,并采用基于相似性度量的立体最优传输(Stereoscopic Optimal Transport, SOT)算法实现跨模态语义对齐,从而显著提升少样本分类性能和模型透明度。

链接: https://arxiv.org/abs/2511.07941
作者: Zhenfeng Zhuang,Fangyu Zhou,Liansheng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Large Language Models (LLMs) are emerging as a promising direction in computational pathology, the substantial computational cost of giga-pixel Whole Slide Images (WSIs) necessitates the use of Multi-Instance Learning (MIL) to enable effective modeling. A key challenge is that pathological tasks typically provide only bag-level labels, while instance-level descriptions generated by LLMs often suffer from bias due to a lack of fine-grained medical knowledge. To address this, we propose that constructing task-specific pathological entity prototypes is crucial for learning generalizable features and enhancing model interpretability. Furthermore, existing vision-language MIL methods often employ unidirectional guidance, limiting cross-modal synergy. In this paper, we introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction through a balanced information compression scheme. Specifically, we leverage a frozen LLM to generate task-specific pathological entity descriptions, which are learned as text prototypes. Concurrently, the vision branch learns instance-level prototypes to mitigate the model’s reliance on redundant data. For the fusion stage, we employ the Stereoscopic Optimal Transport (SOT) algorithm, which is based on a similarity metric, thereby facilitating broader semantic alignment in a higher-dimensional space. We conduct few-shot classification and explainability experiments on three distinct cancer datasets, and the results demonstrate the superior generalization capabilities of our proposed method.
zh

[CV-87] Is It Truly Necessary to Process and Fit Minutes-Long Reference Videos for Personalized Talking Face Generation?

【速读】:该论文旨在解决当前基于神经辐射场(Neural Radiated Field, NeRF)或三维高斯溅射(3D Gaussian Sputtering, 3DGS)的说话人脸生成(Talking Face Generation, TFG)方法在实际应用中因需处理数分钟长参考视频而导致计算负担过重的问题。现有方法通常依赖长时间参考视频以学习充分的3D信息和唇音映射关系,但实验表明,视频的信息质量远比长度重要;因此,论文提出一种名为ISExplore(Informative Segment Explore)的简单而有效的片段选择策略,其核心在于自动识别具有高信息价值的5秒参考视频片段,依据音频特征多样性、唇部运动幅度及相机视角数量三个关键数据质量维度进行筛选。该方案显著提升了数据处理与训练效率(超过5倍加速),同时保持高质量输出,从而大幅增强TFG模型的实际可用性。

链接: https://arxiv.org/abs/2511.07940
作者: Rui-Qing Sun,Ang Li,Zhijing Wu,Tian Lan,Qianyu Lu,Xingshan Yao,Chen Xu,Xian-Ling Mao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Talking Face Generation (TFG) aims to produce realistic and dynamic talking portraits, with broad applications in fields such as digital education, film and television production, e-commerce live streaming, and other related areas. Currently, TFG methods based on Neural Radiated Field (NeRF) or 3D Gaussian sputtering (3DGS) are received widespread attention. They learn and store personalized features from reference videos of each target individual to generate realistic speaking videos. To ensure models can capture sufficient 3D information and successfully learns the lip-audio mapping, previous studies usually require meticulous processing and fitting several minutes of reference video, which always takes hours. The computational burden of processing and fitting long reference videos severely limits the practical application value of these this http URL, is it really necessary to fit such minutes of reference video? Our exploratory case studies show that using some informative reference video segments of just a few seconds can achieve performance comparable to or even better than the full reference video. This indicates that video informative quality is much more important than its length. Inspired by this observation, we propose the ISExplore (short for Informative Segment Explore), a simple-yet-effective segment selection strategy that automatically identifies the informative 5-second reference video segment based on three key data quality dimensions: audio feature diversity, lip movement amplitude, and number of camera views. Extensive experiments demonstrate that our approach increases data processing and training speed by more than 5x for NeRF and 3DGS methods, while maintaining high-fidelity output. Project resources are available at xx.
zh

[CV-88] DiffRegCD: Integrated Registration and Change Detection with Diffusion Features WACV2026

【速读】:该论文旨在解决遥感图像中因视差、视角变化及长时间间隔导致的严重图像错位问题,这一问题使得传统变化检测(Change Detection, CD)模型在实际应用中性能下降。现有方法如两阶段注册-检测或联合框架(如BiFA、ChangeRD)仍难以应对大位移场景,其依赖单一光流回归、全局单应性变换或合成扰动,鲁棒性不足。论文提出DiffRegCD,一个将密集配准与变化检测统一建模的端到端框架;其关键创新在于:将对应关系估计重构为高斯平滑分类任务,实现亚像素精度并稳定训练;同时利用预训练去噪扩散模型的冻结多尺度特征,增强对光照和视角变化的鲁棒性;并通过受控仿射扰动生成真实配对标签(含光流与变化掩码),无需伪标签即可提供监督信号。

链接: https://arxiv.org/abs/2511.07935
作者: Seyedehnanita Madani,Rama Chellappa,Vishal M. Patel
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures. Accepted to WACV 2026

点击查看摘要

Abstract:Change detection (CD) is fundamental to computer vision and remote sensing, supporting applications in environmental monitoring, disaster response, and urban development. Most CD models assume co-registered inputs, yet real-world imagery often exhibits parallax, viewpoint shifts, and long temporal gaps that cause severe misalignment. Traditional two stage methods that first register and then detect, as well as recent joint frameworks (e.g., BiFA, ChangeRD), still struggle under large displacements, relying on regression only flow, global homographies, or synthetic perturbations. We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model. DiffRegCD reformulates correspondence estimation as a Gaussian smoothed classification task, achieving sub-pixel accuracy and stable training. It leverages frozen multi-scale features from a pretrained denoising diffusion model, ensuring robustness to illumination and viewpoint variation. Supervision is provided through controlled affine perturbations applied to standard CD datasets, yielding paired ground truth for both flow and change detection without pseudo labels. Extensive experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines and remains reliable under wide temporal and geometric variation, establishing diffusion features and classification based correspondence as a strong foundation for unified change detection.
zh

[CV-89] Laytrol: Preserving Pretrained Knowledge in Layout Control for Multimodal Diffusion Transformers AAAI2026

【速读】:该论文旨在解决扩散模型在文本到图像生成中空间可控性不足的问题,特别是布局到图像(layout-to-image)生成任务中,现有方法因引入适配模块导致图像视觉质量下降和风格与基础模型不一致的问题。解决方案的关键在于:构建基于基础模型自合成图像的Layout Synthesis (LaySyn)数据集以缓解预训练分布偏移;提出继承MM-DiT参数的Layout Control (Laytrol)网络,并设计专用初始化策略——将布局编码器初始化为纯文本编码器以确保输出token处于MM-DiT数据域内,同时将布局控制网络输出初始化为零,从而有效激活复制参数并避免不稳定控制条件干扰;此外,采用对象级旋转位置嵌入(Object-level Rotary Position Embedding)为布局token提供粗粒度位置信息,提升空间控制精度。

链接: https://arxiv.org/abs/2511.07934
作者: Sida Huang,Siqi Huang,Ping Luo,Hongyuan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:With the development of diffusion models, enhancing spatial controllability in text-to-image generation has become a vital challenge. As a representative task for addressing this challenge, layout-to-image generation aims to generate images that are spatially consistent with the given layout condition. Existing layout-to-image methods typically introduce the layout condition by integrating adapter modules into the base generative model. However, the generated images often exhibit low visual quality and stylistic inconsistency with the base model, indicating a loss of pretrained knowledge. To alleviate this issue, we construct the Layout Synthesis (LaySyn) dataset, which leverages images synthesized by the base model itself to mitigate the distribution shift from the pretraining data. Moreover, we propose the Layout Control (Laytrol) Network, in which parameters are inherited from MM-DiT to preserve the pretrained knowledge of the base model. To effectively activate the copied parameters and avoid disturbance from unstable control conditions, we adopt a dedicated initialization scheme for Laytrol. In this scheme, the layout encoder is initialized as a pure text encoder to ensure that its output tokens remain within the data domain of MM-DiT. Meanwhile, the outputs of the layout control network are initialized to zero. In addition, we apply Object-level Rotary Position Embedding to the layout tokens to provide coarse positional information. Qualitative and quantitative experiments demonstrate the effectiveness of our method.
zh

[CV-90] IBMA: An Imputation-Based Mixup Augmentation Using Self-Supervised Learning for Time Series Data AAAI2025

【速读】:该论文旨在解决时间序列预测中数据增强策略匮乏的问题,尤其是在与图像或文本等领域相比时,时间序列数据可用的增强方法较少,且先进方法如Mixup鲜有应用。其解决方案的关键在于提出一种新颖的数据增强方法——基于插补的Mixup增强(Imputation-Based Mixup Augmentation, IBMA),该方法通过结合插补增强数据与Mixup混合策略,在保留时间序列内在时序模式的前提下引入多样性,从而提升模型泛化能力和预测性能。实验表明,IBMA在多个主流时间序列模型(如DLinear、TimesNet和iTrainformer)上均显著优于现有八种增强技术,尤其在使用iTrainformer插补时表现最佳。

链接: https://arxiv.org/abs/2511.07930
作者: Dang Nha Nguyen,Hai Dang Nguyen,Khoa Tho Anh Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 1 figure, 1 table, accepted at the AAAI2025 conference

点击查看摘要

Abstract:Data augmentation in time series forecasting plays a crucial role in enhancing model performance by introducing variability while maintaining the underlying temporal patterns. However, time series data offers fewer augmentation strategies compared to fields such as image or text, with advanced techniques like Mixup rarely being used. In this work, we propose a novel approach, Imputation-Based Mixup Augmentation (IBMA), which combines Imputation-Augmented data with Mixup augmentation to bolster model generalization and improve forecasting performance. We evaluate the effectiveness of this method across several forecasting models, including DLinear (MLP), TimesNet (CNN), and iTrainformer (Transformer), these models represent some of the most recent advances in time series forecasting. Our experiments, conducted on four datasets (ETTh1, ETTh2, ETTm1, ETTm2) and compared against eight other augmentation techniques, demonstrate that IBMA consistently enhances performance, achieving 22 improvements out of 24 instances, with 10 of those being the best performances, particularly with iTrainformer imputation.
zh

[CV-91] Federated CLIP for Resource-Efficient Heterogeneous Medical Image Classification AAAI2026

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗图像分类任务中面临的两大挑战:一是数据异质性(data heterogeneity)导致模型性能下降,二是使用视觉语言模型(Vision Language Models, VLM)时带来的高通信与计算资源开销。针对这些问题,论文提出了一种基于对比语言-图像预训练(Contrastive Language-Image Pre-training, CLIP)的新型联邦学习方法——FedMedCLIP。其核心创新在于引入了一个掩码特征适配模块(Masked Feature Adaptation Module, FAM),用于在冻结CLIP编码器的前提下降低通信负载;同时设计了一个掩码多层感知机(Masked Multi-Layer Perceptron, MLP)作为本地私有分类器以适应各客户端的任务差异,并通过自适应Kullback-Leibler(KL)散度蒸馏正则化方法实现FAM与MLP之间的协同学习。此外,结合模型压缩与集成预测策略,在保证分类精度的同时显著提升了效率(如比FedAVG快120倍)。

链接: https://arxiv.org/abs/2511.07929
作者: Yihang Wu,Ahmad Chaddad
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in AAAI 2026 Main track. Code is available at this https URL

点击查看摘要

Abstract:Despite the remarkable performance of deep models in medical imaging, they still require source data for training, which limits their potential in light of privacy concerns. Federated learning (FL), as a decentralized learning framework that trains a shared model with multiple hospitals (a.k.a., FL clients), provides a feasible solution. However, data heterogeneity and resource costs hinder the deployment of FL models, especially when using vision language models (VLM). To address these challenges, we propose a novel contrastive language-image pre-training (CLIP) based FL approach for medical image classification (FedMedCLIP). Specifically, we introduce a masked feature adaptation module (FAM) as a communication module to reduce the communication load while freezing the CLIP encoders to reduce the computational overhead. Furthermore, we propose a masked multi-layer perceptron (MLP) as a private local classifier to adapt to the client tasks. Moreover, we design an adaptive Kullback-Leibler (KL) divergence-based distillation regularization method to enable mutual learning between FAM and MLP. Finally, we incorporate model compression to transmit the FAM parameters while using ensemble predictions for classification. Extensive experiments on four publicly available medical datasets demonstrate that our model provides feasible performance (e.g., 8% higher compared to second best baseline on ISIC2019) with reasonable resource cost (e.g., 120 \times faster than FedAVG).
zh

[CV-92] An Image-Based Path Planning Algorithm Using a UAV Equipped with Stereo Vision

【速读】:该论文旨在解决无人机在复杂地形中进行安全路径规划的问题,尤其是在二维图像难以区分障碍物(如陨石坑和山丘)的情况下。解决方案的关键在于利用无人机获取的立体视觉信息生成视差图(disparity map),结合边缘、直线和角点检测等计算机视觉技术,提取候选航路点,并通过ArUco标记位姿估计与圆检测自动识别起始点和目标点,从而构建出更安全、准确的路径规划方案。

链接: https://arxiv.org/abs/2511.07928
作者: Selim Ahmet Iz,Mustafa Unel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:This paper presents a novel image-based path planning algorithm that was developed using computer vision techniques, as well as its comparative analysis with well-known deterministic and probabilistic algorithms, namely A* and Probabilistic Road Map algorithm (PRM). The terrain depth has a significant impact on the calculated path safety. The craters and hills on the surface cannot be distinguished in a two-dimensional image. The proposed method uses a disparity map of the terrain that is generated by using a UAV. Several computer vision techniques, including edge, line and corner detection methods, as well as the stereo depth reconstruction technique, are applied to the captured images and the found disparity map is used to define candidate way-points of the trajectory. The initial and desired points are detected automatically using ArUco marker pose estimation and circle detection techniques. After presenting the mathematical model and vision techniques, the developed algorithm is compared with well-known algorithms on different virtual scenes created in the V-REP simulation program and a physical setup created in a laboratory environment. Results are promising and demonstrate effectiveness of the proposed algorithm.
zh

[CV-93] CNN-Based Automated Parameter Extraction Framework for Modeling Memristive Devices

【速读】:该论文旨在解决现有阻变存储器(RRAM)紧凑模型在参数提取过程中依赖大量拟合参数且缺乏物理可解释性的问题,导致参数提取需大量人工调参、效率低下且难以适配不同器件。解决方案的关键在于提出一个自动化框架,利用卷积神经网络(CNN)从器件I-V特性中直接生成初始参数估计,并通过三个启发式优化模块(基于自适应二分搜索的误差最小化策略)对参数进行精细化调整,从而实现快速、可靠、鲁棒的模型参数提取,显著提升建模效率与跨器件适应性。

链接: https://arxiv.org/abs/2511.07926
作者: Akif Hamid,Orchi Hassan
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Resistive random access memory (RRAM) is a promising candidate for next-generation nonvolatile memory (NVM) and in-memory computing applications. Compact models are essential for analyzing the circuit and system-level performance of experimental RRAM devices. However, most existing RRAM compact models rely on multiple fitting parameters to reproduce the device I-V characteristics, and in most cases, as the parameters are not directly related to measurable quantities, their extraction requires extensive manual tuning, making the process time-consuming and limiting adaptability across different devices. This work presents an automated framework for extracting the fitting parameters of the widely used Stanford RRAM model directly from the device I-V characteristics. The framework employs a convolutional neural network (CNN) trained on a synthetic dataset to generate initial parameter estimates, which are then refined through three heuristic optimization blocks that minimize errors via adaptive binary search in the parameter space. We evaluated the framework using four key NVM metrics: set voltage, reset voltage, hysteresis loop area, and low resistance state (LRS) slope. Benchmarking against RRAM device characteristics derived from previously reported Stanford model fits, other analytical models, and experimental data shows that the framework achieves low error across diverse device characteristics, offering a fast, reliable, and robust solution for RRAM modeling.
zh

[CV-94] HD2-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving AAAI2026

【速读】:该论文旨在解决基于相机的3D语义场景补全(Semantic Scene Completion, SSC)中存在的两个关键问题:输入输出维度差异(input-output dimension gap)和标注现实密度差异(annotation-reality density gap)。前者指2D图像输入与3D体素输出之间的维度不匹配,后者则表现为稀疏标注的2D规划视图难以准确预测真实世界中密集的3D占据情况。为应对上述挑战,论文提出HD²-SSC框架,其核心创新在于两个模块:一是高维语义解耦模块(High-dimension Semantic Decoupling),通过在伪第三维度扩展2D图像特征,将粗粒度像素语义与遮挡信息分离,并聚焦于细粒度语义区域以增强特征表达;二是高密度占据精修模块(High-density Occupancy Refinement),采用“检测-精修”架构,利用上下文几何与语义结构来补全缺失体素并修正错误占据,从而提升3D场景补全的密度与准确性。

链接: https://arxiv.org/abs/2511.07925
作者: Zhiwen Yang,Yuxin Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, accepted by AAAI 2026

点击查看摘要

Abstract:Camera-based 3D semantic scene completion (SSC) plays a crucial role in autonomous driving, enabling voxelized 3D scene understanding for effective scene perception and decision-making. Existing SSC methods have shown efficacy in improving 3D scene representations, but suffer from the inherent input-output dimension gap and annotation-reality density gap, where the 2D planner view from input images with sparse annotated labels leads to inferior prediction of real-world dense occupancy with a 3D stereoscopic view. In light of this, we propose the corresponding High-Dimension High-Density Semantic Scene Completion (HD ^2 -SSC) framework with expanded pixel semantics and refined voxel occupancies. To bridge the dimension gap, a High-dimension Semantic Decoupling module is designed to expand 2D image features along a pseudo third dimension, decoupling coarse pixel semantics from occlusions, and then identify focal regions with fine semantics to enrich image features. To mitigate the density gap, a High-density Occupancy Refinement module is devised with a “detect-and-refine” architecture to leverage contextual geometric and semantic structures for enhanced semantic density with the completion of missing voxels and correction of erroneous ones. Extensive experiments and analyses on the SemanticKITTI and SSCBench-KITTI-360 datasets validate the effectiveness of our HD ^2 -SSC framework.
zh

[CV-95] Exploring the Underwater World Segmentation without Extra Training

【速读】:该论文旨在解决水下生物精准分割难题,以支持海洋生物多样性监测与生态评估,而现有数据集和模型多局限于陆地场景,缺乏对水下开放词汇(Open-Vocabulary, OV)分割的系统性研究。解决方案的关键在于提出一个无需训练的跨域迁移框架 Earth2Ocean,其核心由两个模块构成:一是基于几何先验的视觉掩码生成器(Geometric-guided Visual Mask Generator, GMG),通过自相似性几何约束增强局部结构感知能力;二是类别-视觉语义对齐模块(Category-visual Semantic Alignment, CSA),利用多模态大语言模型推理与场景感知模板构建提升文本嵌入质量,从而实现从陆地视觉-语言模型(Vision-Language Models, VLMs)到水下域的零样本迁移,显著提升水下开放词汇分割性能。

链接: https://arxiv.org/abs/2511.07923
作者: Bingyu Li,Tao Huo,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence (TeleAI), China Telecom (中国电信人工智能研究所); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate segmentation of marine organisms is vital for biodiversity monitoring and ecological assessment, yet existing datasets and models remain largely limited to terrestrial scenes. To bridge this gap, we introduce \textbfAquaOV255, the first large-scale and fine-grained underwater segmentation dataset containing 255 categories and over 20K images, covering diverse categories for open-vocabulary (OV) evaluation. Furthermore, we establish the first underwater OV segmentation benchmark, \textbfUOVSBench, by integrating AquaOV255 with five additional underwater datasets to enable comprehensive evaluation. Alongside, we present \textbfEarth2Ocean, a training-free OV segmentation framework that transfers terrestrial vision–language models (VLMs) to underwater domains without any additional underwater training. Earth2Ocean consists of two core components: a Geometric-guided Visual Mask Generator (\textbfGMG) that refines visual features via self-similarity geometric priors for local structure perception, and a Category-visual Semantic Alignment (\textbfCSA) module that enhances text embeddings through multimodal large language model reasoning and scene-aware template construction. Extensive experiments on the UOVSBench benchmark demonstrate that Earth2Ocean achieves significant performance improvement on average while maintaining efficient inference.
zh

[CV-96] heoretical Analysis of Power-law Transformation on Images for Text Polarity Detection

【速读】:该论文旨在解决图像二值化(binarization)过程中文本极性(text polarity)识别的问题,即如何准确判断图像中文字与背景的对比关系(如暗文字在亮背景或亮文字在暗背景),以实现高质量的二值化处理。其解决方案的关键在于对基于幂律变换(power-law transformation)的图像进行理论分析,揭示了在文本与背景被视作两类时,经过变换后的图像直方图统计特性中类间方差(between-class variance)的变化规律:对于暗文字在亮背景的情况,类间方差随变换增强而增大;反之,对于亮文字在暗背景的情况,类间方差则减小。这一现象为自动确定文本极性提供了理论依据,从而优化二值化效果。

链接: https://arxiv.org/abs/2511.07916
作者: Narendra Singh Yadav,Pavan Kumar Perepu
机构: Indian Institute of Information Technology, Sri City, India (印度信息科技学院,斯里城,印度)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Several computer vision applications like vehicle license plate recognition, captcha recognition, printed or handwriting character recognition from images etc., text polarity detection and binarization are the important preprocessing tasks. To analyze any image, it has to be converted to a simple binary image. This binarization process requires the knowledge of polarity of text in the images. Text polarity is defined as the contrast of text with respect to background. That means, text is darker than the background (dark text on bright background) or vice-versa. The binarization process uses this polarity information to convert the original colour or gray scale image into a binary image. In the literature, there is an intuitive approach based on power-law transformation on the original images. In this approach, the authors have illustrated an interesting phenomenon from the histogram statistics of the transformed images. Considering text and background as two classes, they have observed that maximum between-class variance between two classes is increasing (decreasing) for dark (bright) text on bright (dark) background. The corresponding empirical results have been presented. In this paper, we present a theoretical analysis of the above phenomenon.
zh

[CV-97] Generating Sketches in a Hierarchical Auto-Regressive Process for Flexible Sketch Drawing Manipulation at Stroke-Level AAAI2026

【速读】:该论文旨在解决现有生成式AI(Generative AI)方法在控制草图绘制过程中的灵活性不足问题,即当前基于笔画嵌入(stroke embedding)编辑的方法需在生成前一次性输入所有条件,无法在生成过程中进行动态调整。其解决方案的关键在于提出一种分层自回归的草图生成流程:将每条笔画的生成分为三个阶段——预测笔画嵌入、锚定位置和翻译为绘图动作,并通过自回归机制利用已生成笔画及其位置信息来指导当前笔画的生成,从而实现任意时刻对笔画层级的灵活操控。

链接: https://arxiv.org/abs/2511.07889
作者: Sicong Zang,Shuhui Gao,Zhijun Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Generating sketches with specific patterns as expected, i.e., manipulating sketches in a controllable way, is a popular task. Recent studies control sketch features at stroke-level by editing values of stroke embeddings as conditions. However, in order to provide generator a global view about what a sketch is going to be drawn, all these edited conditions should be collected and fed into generator simultaneously before generation starts, i.e., no further manipulation is allowed during sketch generating process. In order to realize sketch drawing manipulation more flexibly, we propose a hierarchical auto-regressive sketch generating process. Instead of generating an entire sketch at once, each stroke in a sketch is generated in a three-staged hierarchy: 1) predicting a stroke embedding to represent which stroke is going to be drawn, and 2) anchoring the predicted stroke on the canvas, and 3) translating the embedding to a sequence of drawing actions to form the full sketch. Moreover, the stroke prediction, anchoring and translation are proceeded auto-regressively, i.e., both the recently generated strokes and their positions are considered to predict the current one, guiding model to produce an appropriate stroke at a suitable position to benefit the full sketch generation. It is flexible to manipulate stroke-level sketch drawing at any time during generation by adjusting the exposed editable stroke embeddings.
zh

[CV-98] Visual Bridge: Universal Visual Perception Representations Generating AAAI2026

【速读】:该论文旨在解决当前扩散模型在计算机视觉中普遍存在的“单任务-单模型”范式限制问题,该范式严重制约了模型在多任务场景下的泛化能力和可扩展性。其解决方案的关键在于提出一种基于流匹配(flow matching)的通用视觉感知框架,将不同任务的表示生成过程建模为从图像块标记(image patch tokens)到任务特定表示的统一流匹配问题,而非独立的生成或回归任务;通过引入以强自监督基础模型为锚点、多尺度循环任务嵌入机制,学习一个通用的速度场(velocity field)以弥合异构任务间的差异,从而实现高效且灵活的表征迁移。

链接: https://arxiv.org/abs/2511.07877
作者: Yilin Gao,Shuguang Dou,Junzhou Li,Zhiheng Yu,Yin Li,Dongsheng Jiang,Shugong Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026

点击查看摘要

Abstract:Recent advances in diffusion models have achieved remarkable success in isolated computer vision tasks such as text-to-image generation, depth estimation, and optical flow. However, these models are often restricted by a ``single-task-single-model’’ paradigm, severely limiting their generalizability and scalability in multi-task scenarios. Motivated by the cross-domain generalization ability of large language models, we propose a universal visual perception framework based on flow matching that can generate diverse visual representations across multiple tasks. Our approach formulates the process as a universal flow-matching problem from image patch tokens to task-specific representations rather than an independent generation or regression problem. By leveraging a strong self-supervised foundation model as the anchor and introducing a multi-scale, circular task embedding mechanism, our method learns a universal velocity field to bridge the gap between heterogeneous tasks, supporting efficient and flexible representation transfer. Extensive experiments on classification, detection, segmentation, depth estimation, and image-text retrieval demonstrate that our model achieves competitive performance in both zero-shot and fine-tuned settings, outperforming prior generalist and several specialist models. Ablation studies further validate the robustness, scalability, and generalization of our framework. Our work marks a significant step towards general-purpose visual perception, providing a solid foundation for future research in universal vision modeling.
zh

[CV-99] MonoCLUE : Object-Aware Clustering Enhances Monocular 3D Object Detection AAAI2026

【速读】:该论文旨在解决单目3D目标检测(monocular 3D object detection)中存在的深度信息 ill-posed(病态)问题以及视场受限导致的几何线索缺失和遮挡/截断场景下检测精度下降的问题。其解决方案的关键在于提出 MonoCLUE 方法,通过融合局部聚类(local clustering)与广义场景记忆(generalized scene memory)来增强视觉特征的利用:首先在图像内对视觉特征进行 K-means 聚类以提取物体级外观部件(如引擎盖、车顶),提升对部分可见目标的识别能力;其次跨图像聚合聚类特征构建广义场景记忆,提供跨场景一致的物体级特征表示,增强环境变化下的检测稳定性;最后将两类特征集成至对象查询(object queries)中,引导注意力机制聚焦于高信息量区域,从而实现鲁棒的单目3D检测,在 KITTI 基准上达到最先进性能。

链接: https://arxiv.org/abs/2511.07862
作者: Sunghun Yang,Minhyeok Lee,Jungho Lee,Sangyoun Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Monocular 3D object detection offers a cost-effective solution for autonomous driving but suffers from ill-posed depth and limited field of view. These constraints cause a lack of geometric cues and reduced accuracy in occluded or truncated scenes. While recent approaches incorporate additional depth information to address geometric ambiguity, they overlook the visual cues crucial for robust recognition. We propose MonoCLUE, which enhances monocular 3D detection by leveraging both local clustering and generalized scene memory of visual features. First, we perform K-means clustering on visual features to capture distinct object-level appearance parts (e.g., bonnet, car roof), improving detection of partially visible objects. The clustered features are propagated across regions to capture objects with similar appearances. Second, we construct a generalized scene memory by aggregating clustered features across images, providing consistent representations that generalize across scenes. This improves object-level feature consistency, enabling stable detection across varying environments. Lastly, we integrate both local cluster features and generalized scene memory into object queries, guiding attention toward informative regions. Exploiting a unified local clustering and generalized scene memory strategy, MonoCLUE enables robust monocular 3D detection under occlusion and limited visibility, achieving state-of-the-art performance on the KITTI benchmark.
zh

[CV-100] CloudMamba: Grouped Selective State Spaces for Point Cloud Analysis AAAI’26

链接: https://arxiv.org/abs/2511.07823
作者: Kanglin Qu,Pan Gao,Qun Dai,Zhanzhi Ye,Rui Ye,Yuanhao Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI '26

点击查看摘要

[CV-101] SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

【速读】:该论文旨在解决当前人形机器人控制中神经控制器规模小、行为集有限且训练资源受限的问题,即如何通过规模化提升模型能力以实现更自然、鲁棒的全身运动控制。其解决方案的关键在于将运动追踪(motion tracking)作为基础任务,利用来自多样化动作捕捉数据的密集监督信号来学习人类运动先验,从而避免人工奖励工程;并通过三个维度的扩展——模型参数量(从120万增至4200万)、数据集规模(超过1亿帧、700小时高质量动作数据)和计算资源(9000 GPU小时),构建了一个通用的人形机器人控制基础模型(foundation model)。该模型不仅在性能上随算力与数据多样性持续提升,还具备良好的泛化能力,可支持多种输入接口(如VR遥操作、人类视频、视觉-语言-动作模型),并通过实时通用运动规划器实现下游任务执行,为通用人形机器人控制提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2511.07820
作者: Zhengyi Luo,Ye Yuan,Tingwu Wang,Chenran Li,Sirui Chen,Fernando Castañeda,Zi-Ang Cao,Jiefeng Li,David Minor,Qingwei Ben,Xingye Da,Runyu Ding,Cyrus Hogg,Lina Song,Edy Lim,Eugene Jeong,Tairan He,Haoru Xue,Wenli Xiao,Zi Wang,Simon Yuen,Jan Kautz,Yan Chang,Umar Iqbal,Linxi “Jim” Fan,Yuke Zhu
机构: Nvidia(英伟达)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Systems and Control (eess.SY)
备注: Project page: this https URL

点击查看摘要

Abstract:Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
zh

[CV-102] Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

【速读】:该论文旨在解决当前3D场景中人体运动合成方法过度依赖场景结构而忽视语义理解的问题。其核心解决方案是提出一种统一的场景语义占据表示(Scene Semantic Occupancy, SSO),并基于此构建SSOMotion框架;关键创新在于采用双向三平面分解(bi-directional tri-plane decomposition)压缩SSO表征,并通过CLIP编码与共享线性降维将场景语义映射至统一特征空间,从而在保留细粒度语义结构的同时显著减少冗余计算,进而利用场景提示和指令推导的运动方向进行逐帧场景查询控制,实现更精准的人体运动生成。

链接: https://arxiv.org/abs/2511.07819
作者: Gong Jingyu,Tong Kunkun,Chen Zhuoran,Yuan Chuanhan,Chen Mingang,Zhang Zhizhong,Tan Xin,Xie Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability. Code will be publicly available at this https URL.
zh

[CV-103] Cancer-Net PCa-MultiSeg: Multimodal Enhancement of Prostate Cancer Lesion Segmentation Using Synthetic Correlated Diffusion Imaging ML4H2025

【速读】:该论文旨在解决当前基于深度学习的前列腺癌病灶分割方法在大规模患者队列中性能有限的问题(Dice分数低于0.32)。其解决方案的关键在于引入合成相关扩散成像(synthetic correlated diffusion imaging, CDI^s),该技术通过利用现有扩散加权成像(DWI)数据生成额外信息,无需增加扫描时间或修改网络架构即可显著提升分割性能。实验表明,CDI^s与DWI联合使用可在94%的模型配置中稳定增强或保持分割效果,且在半数架构中实现统计学意义上的显著改进(最高达72.5%相对提升),并避免任何性能退化,从而为临床应用提供了一种即插即用的高效增强路径。

链接: https://arxiv.org/abs/2511.07816
作者: Jarett Dewbury,Chi-en Amy Tai,Alexander Wong
机构: University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ML4H 2025 Findings

点击查看摘要

Abstract:Current deep learning approaches for prostate cancer lesion segmentation achieve limited performance, with Dice scores of 0.32 or lower in large patient cohorts. To address this limitation, we investigate synthetic correlated diffusion imaging (CDI ^s ) as an enhancement to standard diffusion-based protocols. We conduct a comprehensive evaluation across six state-of-the-art segmentation architectures using 200 patients with co-registered CDI ^s , diffusion-weighted imaging (DWI) and apparent diffusion coefficient (ADC) sequences. We demonstrate that CDI ^s integration reliably enhances or preserves segmentation performance in 94% of evaluated configurations, with individual architectures achieving up to 72.5% statistically significant relative improvement over baseline modalities. CDI ^s + DWI emerges as the safest enhancement pathway, achieving significant improvements in half of evaluated architectures with zero instances of degradation. Since CDI ^s derives from existing DWI acquisitions without requiring additional scan time or architectural modifications, it enables immediate deployment in clinical workflows. Our results establish validated integration pathways for CDI ^s as a practical drop-in enhancement for PCa lesion segmentation tasks across diverse deep learning architectures.
zh

[CV-104] Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views

【速读】:该论文旨在解决训练-free 3D 场景理解方法在实际部署中面临的准确性低和效率差的问题。其核心解决方案是提出 Sparse3DPR,一个基于预训练大语言模型(LLM)的无训练框架,仅需稀疏视角的RGB输入即可实现开放词汇的场景理解;关键创新在于引入分层平面增强型场景图(hierarchical plane-enhanced scene graph),利用主导平面结构作为空间锚点以增强推理链清晰度,并设计任务自适应子图提取机制动态过滤无关信息,从而降低上下文噪声,提升推理效率与准确性。

链接: https://arxiv.org/abs/2511.07813
作者: Haida Feng,Hao Wei,Zewen Xu,Haolin Wang,Chade Li,Yihong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have been explored widely for 3D scene understanding. Among them, training-free approaches are gaining attention for their flexibility and generalization over training-based methods. However, they typically struggle with accuracy and efficiency in practical deployment. To address the problems, we propose Sparse3DPR, a novel training-free framework for open-ended scene understanding, which leverages the reasoning capabilities of pre-trained LLMs and requires only sparse-view RGB inputs. Specifically, we introduce a hierarchical plane-enhanced scene graph that supports open vocabulary and adopts dominant planar structures as spatial anchors, which enables clearer reasoning chains and more reliable high-level inferences. Furthermore, we design a task-adaptive subgraph extraction method to filter query-irrelevant information dynamically, reducing contextual noise and improving 3D scene reasoning efficiency and accuracy. Experimental results demonstrate the superiority of Sparse3DPR, which achieves a 28.7% EM@1 improvement and a 78.2% speedup compared with ConceptGraphs on the Space3D-Bench. Moreover, Sparse3DPR obtains comparable performance to training-based methods on ScanQA, with additional real-world experiments confirming its robustness and generalization capability.
zh

[CV-105] Revisiting MLLM Based Image Quality Assessment: Errors and Remedy

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在图像质量评估(Image Quality Assessment, IQA)任务中因离散 token 输出与连续质量评分需求之间的固有不匹配问题,这一矛盾导致现有方法在将离散预测转换为连续分数时产生误差,并且由等级类 token(如“good”)引入的语义混淆进一步削弱了 MLLM 在 IQA 任务中的性能及原始能力。解决方案的关键在于提出一个名为 Q-Scorer 的简单而有效的框架,其核心创新是将轻量级回归模块与面向 IQA 任务的特定评分 token(score tokens)集成到 MLLM 流程中,从而实现更精确的连续评分预测并缓解语义混淆问题。

链接: https://arxiv.org/abs/2511.07812
作者: Zhenchen Tang,Songlin Yang,Bo Peng,Zichuan Wang,Jing Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:The rapid progress of multi-modal large language models (MLLMs) has boosted the task of image quality assessment (IQA). However, a key challenge arises from the inherent mismatch between the discrete token outputs of MLLMs and the continuous nature of quality scores required by IQA tasks. This discrepancy significantly hinders the performance of MLLM-based IQA methods. Previous approaches that convert discrete token predictions into continuous scores often suffer from conversion errors. Moreover, the semantic confusion introduced by level tokens (e.g., ``good’') further constrains the performance of MLLMs on IQA tasks and degrades their original capabilities for related tasks. To tackle these problems, we provide a theoretical analysis of the errors inherent in previous approaches and, motivated by this analysis, propose a simple yet effective framework, Q-Scorer. This framework incorporates a lightweight regression module and IQA-specific score tokens into the MLLM pipeline. Extensive experiments demonstrate that Q-Scorer achieves state-of-the-art performance across multiple IQA benchmarks, generalizes well to mixed datasets, and further improves when combined with other methods.
zh

[CV-106] DI3CL: Contrastive Learning With Dynamic Instances and Contour Consistency for SAR Land-Cover Classification Foundation Model

【速读】:该论文旨在解决合成孔径雷达(Synthetic Aperture Radar, SAR)地物分类中依赖大量标注数据的监督学习方法所面临的可扩展性差、泛化能力弱及适应场景有限的问题。其解决方案的关键在于提出一种通用的基础模型预训练框架——动态实例与轮廓一致性对比学习(Dynamic Instance and Contour Consistency Contrastive Learning, DI3CL),该框架包含两个核心模块:一是动态实例(Dynamic Instance, DI)模块,通过在不同视图下强制同一区域的局部一致性来增强全局上下文感知能力;二是轮廓一致性(Contour Consistency, CC)模块,利用浅层特征图引导模型关注SAR地物对象的几何轮廓,从而提升结构区分度。此外,研究构建了一个包含460,532幅SAR图像的大规模多样化数据集SARSense,以支持模型在预训练阶段捕获全面且具有代表性的特征,显著提升了模型的鲁棒性和跨任务泛化性能。

链接: https://arxiv.org/abs/2511.07808
作者: Zhongle Ren,Hui Ding,Kai Wang,Biao Hou,Xingyu Luo,Weibin Li,Licheng Jiao
机构: Xidian University (西安电子科技大学); Hangzhou Institute of Technology (杭州职业技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures

点击查看摘要

Abstract:Although significant advances have been achieved in SAR land-cover classification, recent methods remain predominantly focused on supervised learning, which relies heavily on extensive labeled datasets. This dependency not only limits scalability and generalization but also restricts adaptability to diverse application scenarios. In this paper, a general-purpose foundation model for SAR land-cover classification is developed, serving as a robust cornerstone to accelerate the development and deployment of various downstream models. Specifically, a Dynamic Instance and Contour Consistency Contrastive Learning (DI3CL) pre-training framework is presented, which incorporates a Dynamic Instance (DI) module and a Contour Consistency (CC) module. DI module enhances global contextual awareness by enforcing local consistency across different views of the same region. CC module leverages shallow feature maps to guide the model to focus on the geometric contours of SAR land-cover objects, thereby improving structural discrimination. Additionally, to enhance robustness and generalization during pre-training, a large-scale and diverse dataset named SARSense, comprising 460,532 SAR images, is constructed to enable the model to capture comprehensive and representative features. To evaluate the generalization capability of our foundation model, we conducted extensive experiments across a variety of SAR land-cover classification tasks, including SAR land-cover mapping, water body detection, and road extraction. The results consistently demonstrate that the proposed DI3CL outperforms existing methods. Our code and pre-trained weights are publicly available at: this https URL.
zh

[CV-107] PC-Diffusion: Aligning Diffusion Models with Human Preferences via Preference Classifier

【速读】:该论文旨在解决扩散模型(Diffusion Models)在条件图像生成中因输出与人类偏好存在偏差而导致的性能瓶颈问题。现有基于直接偏好优化(Direct Preference Optimization, DPO)的方法虽能提升偏好一致性,但面临两大局限:一是计算成本高,需对整个生成模型进行微调;二是对参考模型质量敏感,易引入不稳定性和偏差。解决方案的关键在于提出一种名为PC-Diffusion的新框架,其核心创新是引入一个轻量级、可训练的偏好分类器(Preference Classifier),直接建模样本间的相对偏好关系,从而将偏好对齐任务从生成模型中解耦出来。这一设计避免了全模型微调和对参考模型的依赖,同时理论证明了偏好引导分布在时间步间的一致传播性、偏好分类器训练目标与DPO等价性以及逐步修正生成过程以趋向偏好一致区域的能力,实验证明其在保持与DPO相当偏好一致性的同时显著降低训练开销并实现高效稳定的偏好引导生成。

链接: https://arxiv.org/abs/2511.07806
作者: Shaomeng Wang,He Wang,Xiaolu Wei,Longquan Dai,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in conditional image generation, yet their outputs often remain misaligned with human preferences. To address this, recent work has applied Direct Preference Optimization (DPO) to diffusion models, yielding significant improvements.~However, DPO-like methods exhibit two key limitations: 1) High computational cost,due to the entire model fine-tuning; 2) Sensitivity to reference model quality, due to its tendency to introduce instability and bias. To overcome these limitations, we propose a novel framework for human preference alignment in diffusion models (PC-Diffusion), using a lightweight, trainable Preference Classifier that directly models the relative preference between samples. By restricting preference learning to this classifier, PC-Diffusion decouples preference alignment from the generative model, eliminating the need for entire model fine-tuning and reference model reliance.~We further provide theoretical guarantees for PC-Diffusion:1) PC-Diffusion ensures that the preference-guided distributions are consistently propagated across timesteps. 2)The training objective of the preference classifier is equivalent to DPO, but does not require a reference model.3) The proposed preference-guided correction can progressively steer generation toward preference-aligned regions.~Empirical results show that PC-Diffusion achieves comparable preference consistency to DPO while significantly reducing training costs and enabling efficient and stable preference-guided generation.
zh

[CV-108] Learning Sparse Label Couplings for Multilabel Chest X-Ray Diagnosis

【速读】:该论文旨在解决胸部X光片(chest X-ray, CXR)多标签分类中的关键挑战,包括极端类别不平衡、标签共现关系建模不足以及训练与推理阶段的性能优化问题。其解决方案的关键在于构建一个简单但强大的流水线:以SE-ResNeXt101(32 × 4d)为骨干网络,采用Sigmoid输出头配合多标签迭代分层(Multilabel Iterative Stratification, MIS)进行稳健交叉验证;通过非对称损失(Asymmetric Loss)缓解类别不平衡和不对称误判成本,结合混合精度训练(AMP)、余弦学习率衰减(带预热)、梯度裁剪及权重指数移动平均(EMA)提升训练稳定性;创新性地引入轻量级标签图精炼模块(Label-Graph Refinement),在分类器后端学习一个稀疏可训练的标签间耦合矩阵,通过单步消息传递机制对logits进行修正,仅增加L1正则化参数头且计算开销极低;最终在推理时采用水平翻转测试时增强(TTA)并聚合MIS折叠预测结果,形成紧凑的深度集成策略,显著提升宏观AUC表现,且无需额外标注数据,具备良好的可复现性和硬件友好性。

链接: https://arxiv.org/abs/2511.07801
作者: Utkarsh Prakash Srivastava,Kaushik Gupta,Kaushik Nath
机构: RenewCred; IKEN Solutions
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 8 figures

点击查看摘要

Abstract:We study multilabel classification of chest X-rays and present a simple, strong pipeline built on SE-ResNeXt101 (32 \times 4d) . The backbone is finetuned for 14 thoracic findings with a sigmoid head, trained using Multilabel Iterative Stratification (MIS) for robust cross-validation splits that preserve label co-occurrence. To address extreme class imbalance and asymmetric error costs, we optimize with Asymmetric Loss, employ mixed-precision (AMP), cosine learning-rate decay with warm-up, gradient clipping, and an exponential moving average (EMA) of weights. We propose a lightweight Label-Graph Refinement module placed after the classifier: given per-label probabilities, it learns a sparse, trainable inter-label coupling matrix that refines logits via a single message-passing step while adding only an L1-regularized parameter head. At inference, we apply horizontal flip test-time augmentation (TTA) and average predictions across MIS folds (a compact deep ensemble). Evaluation uses macro AUC averaging classwise ROC-AUC and skipping single-class labels in a fold to reflect balanced performance across conditions. On our dataset, a strong SE-ResNeXt101 baseline attains competitive macro AUC (e.g., 92.64% in our runs). Adding the Label-Graph Refinement consistently improves validation macro AUC across folds with negligible compute. The resulting method is reproducible, hardware-friendly, and requires no extra annotations, offering a practical route to stronger multilabel CXR classifiers.
zh

[CV-109] Divide-and-Conquer Decoupled Network for Cross-Domain Few-Shot Segmentation

【速读】:该论文旨在解决跨域少样本分割(Cross-domain Few-shot Segmentation, CD-FSS)中因编码器特征混杂领域相关与类别相关信息而导致的泛化能力不足和快速适应新领域困难的问题。解决方案的关键在于提出一种分而治之的解耦网络(Divide-and-Conquer Decoupled Network, DCDNet),其核心创新包括:在训练阶段引入对抗-对比特征分解(Adversarial-Contrastive Feature Decomposition, ACFD)模块,通过对比学习与对抗学习将主干特征解耦为类别相关的私有表示和领域相关的共享表示;同时设计矩阵引导的动态融合(Matrix-Guided Dynamic Fusion, MGDF)模块,在空间引导下自适应融合基础、共享与私有特征以维持结构一致性;此外,在微调阶段引入跨自适应调制(Cross-Adaptive Modulation, CAM)模块,利用共享特征对私有特征进行调制,确保领域相关信息的有效整合,从而显著提升模型在跨域场景下的泛化性能与少样本适应能力。

链接: https://arxiv.org/abs/2511.07798
作者: Runmin Cong,Anpeng Wang,Bin Wan,Cong Zhang,Xiaofei Zhou,Wei Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-domain few-shot segmentation (CD-FSS) aims to tackle the dual challenge of recognizing novel classes and adapting to unseen domains with limited annotations. However, encoder features often entangle domain-relevant and category-relevant information, limiting both generalization and rapid adaptation to new domains. To address this issue, we propose a Divide-and-Conquer Decoupled Network (DCDNet). In the training stage, to tackle feature entanglement that impedes cross-domain generalization and rapid adaptation, we propose the Adversarial-Contrastive Feature Decomposition (ACFD) module. It decouples backbone features into category-relevant private and domain-relevant shared representations via contrastive learning and adversarial learning. Then, to mitigate the potential degradation caused by the disentanglement, the Matrix-Guided Dynamic Fusion (MGDF) module adaptively integrates base, shared, and private features under spatial guidance, maintaining structural coherence. In addition, in the fine-tuning stage, to enhanced model generalization, the Cross-Adaptive Modulation (CAM) module is placed before the MGDF, where shared features guide private features via modulation ensuring effective integration of domain-relevant information. Extensive experiments on four challenging datasets show that DCDNet outperforms existing CD-FSS methods, setting a new state-of-the-art for cross-domain generalization and few-shot adaptation.
zh

[CV-110] Semantic-Consistent Bidirectional Contrastive Hashing for Noisy Multi-Label Cross-Modal Retrieval

【速读】:该论文旨在解决跨模态哈希(Cross-modal Hashing, CMH)在真实场景下因标签噪声和多标签数据中部分语义重叠被忽略而导致的检索性能下降问题。现有方法通常依赖于全标注数据集,且在多标签场景中未能有效建模样本间的细粒度语义关系,从而限制了模型的鲁棒性和泛化能力。解决方案的关键在于提出一种名为语义一致双向对比哈希(Semantic-Consistent Bidirectional Contrastive Hashing, SCBCH)的新框架,其核心创新包括:(1) 交叉模态语义一致性分类(Cross-modal Semantic-Consistent Classification, CSCC),通过利用跨模态语义一致性估计样本可靠性,缓解标签噪声影响;(2) 双向软对比哈希(Bidirectional Soft Contrastive Hashing, BSCH),基于多标签语义重叠动态生成软对比样本对,实现跨模态间语义相似与不相似样本的自适应对比学习,显著提升模型在噪声环境下的稳定性与准确性。

链接: https://arxiv.org/abs/2511.07780
作者: Likang Peng,Chao Su,Wenyuan Wu,Yuan Sun,Dezhong Peng,Xi Peng,Xu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cross-modal hashing (CMH) facilitates efficient retrieval across different modalities (e.g., image and text) by encoding data into compact binary representations. While recent methods have achieved remarkable performance, they often rely heavily on fully annotated datasets, which are costly and labor-intensive to obtain. In real-world scenarios, particularly in multi-label datasets, label noise is prevalent and severely degrades retrieval performance. Moreover, existing CMH approaches typically overlook the partial semantic overlaps inherent in multi-label data, limiting their robustness and generalization. To tackle these challenges, we propose a novel framework named Semantic-Consistent Bidirectional Contrastive Hashing (SCBCH). The framework comprises two complementary modules: (1) Cross-modal Semantic-Consistent Classification (CSCC), which leverages cross-modal semantic consistency to estimate sample reliability and reduce the impact of noisy labels; (2) Bidirectional Soft Contrastive Hashing (BSCH), which dynamically generates soft contrastive sample pairs based on multi-label semantic overlap, enabling adaptive contrastive learning between semantically similar and dissimilar samples across modalities. Extensive experiments on four widely-used cross-modal retrieval benchmarks validate the effectiveness and robustness of our method, consistently outperforming state-of-the-art approaches under noisy multi-label conditions.
zh

[CV-111] Beyond Randomness: Understand the Order of the Noise in Diffusion

【速读】:该论文旨在解决文本驱动的扩散模型(text-driven diffusion model)中初始噪声对生成内容语义影响不明确的问题,传统观点认为初始噪声仅是随机因素,用于提升多样性,但本文揭示其蕴含可分析的语义模式。解决方案的关键在于提出一种无需训练、通用性强的两步“语义擦除-注入”(Semantic Erasure-Injection)流程:首先基于信息论原理从初始噪声中擦除不需要的语义信息,随后利用扩散模型生成过程与语义注入的等价性,将目标语义注入到清理后的噪声中,从而实现对生成内容语义的精确控制。该方法在基于DiT和UNet架构的多种扩散模型上均表现出一致性有效性,为优化扩散模型生成提供了新视角和通用工具。

链接: https://arxiv.org/abs/2511.07756
作者: Song Yan,Min Li,Bi Xinliang,Jian Yang,Yusen Zhang,Guanye Xiong,Yunwei Lan,Tao Zhang,Wei Zhai,Zheng-Jun Zha
机构: Xi’an High-tech Research Institute (西安高新技术研究所); USTC (中国科学技术大学); HUST (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model’s generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step “Semantic Erasure-Injection” process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.
zh

[CV-112] Filtered-ViT: A Robust Defense Against Multiple Adversarial Patch Attacks

链接: https://arxiv.org/abs/2511.07755
作者: Aja Khanal,Ahmed Faid,Apurva Narayan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-113] Class Incremental Medical Image Segmentation via Prototype-Guided Calibration and Dual-Aligned Distillation

链接: https://arxiv.org/abs/2511.07749
作者: Shengqian Zhu,Chengrong Yu,Qiang Wang,Ying Song,Guangjun Li,Jiafei Wu,Xiaogang Xu,Zhang Yi,Junjie Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-114] Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLM s

链接: https://arxiv.org/abs/2511.07748
作者: Yuezhe Yang,Yiyue Guo,Wenjie Cai,Qingqing Ruan,Siying Wang,Xingbo Dong,Zhe Jin,Yong Dai
机构: The Anhui Provincial International Joint Research Center for Advanced Technology in Medical Imaging, Anhui University, Hefei, 230093, China; School of First Clinical Medicine, Anhui University of Science and Technology, Huainan, 232001, China; School of Medicine, Anhui University of Science and Technology, Huainan, 232001, China; The First Hospital, Anhui University of Science and Technology, Huainan, 232001, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

[CV-115] VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics

链接: https://arxiv.org/abs/2511.07744
作者: Daniel Cher,Brian Wei,Srikumar Sastry,Nathan Jacobs
机构: Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-116] UltraG S: Gaussian Splatting for Ultrasound Novel View Synthesis

链接: https://arxiv.org/abs/2511.07743
作者: Yuezhe Yang,Wenjie Cai,Dexin Yang,Yufang Dong,Xingbo Dong,Zhe Jin
机构: Anhui Provincial International Joint Research Center for Advanced Technology in Medical Imaging (安徽省国际联合研究中心先进医学成像技术); Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

[CV-117] From Exploration to Exploitation: A Two-Stage Entropy RLVR Approach for Noise-Tolerant MLLM Training

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)中因高质量标注数据稀缺且易受标注噪声干扰而导致的性能下降问题。现有无监督RLVR方法(如纯熵最小化)容易过拟合到错误标签,削弱了对组相对策略优化(Group-Relative Policy Optimization, GRPO)至关重要的奖励排序信号。解决方案的关键在于提出一种两阶段、基于token级别的熵优化策略:在初始探索阶段通过最大化token级熵促进输出多样性,增强模型对噪声的鲁棒性并保障组内变异,从而提升GRPO中的奖励梯度估计可靠性;随后在利用阶段通过最小化token级熵引导模型生成确定性输出,巩固知识并提高预测精度。该方法在多个MLLM骨干网络和噪声环境下均显著优于现有方法,实现了外部、内部与熵基方法的统一与增强。

链接: https://arxiv.org/abs/2511.07738
作者: Donglai Xu,Hongzheng Yang,Yuzhi Zhao,Pingping Zhang,Jinpeng Chen,Wenao Ma,Zhijian Hou,Mengyang Wu,Xiaolei Li,Senkang Hu,Ziyi Guan,Jason Chun Lok Li,Lai Man Po
机构: The Chinese University of Hong Kong (香港中文大学); City University of Hong Kong (香港城市大学); Hong Kong University of Science and Technology (香港科技大学); University of Hong Kong (香港大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) for Multimodal Large Language Models (MLLMs) is highly dependent on high-quality labeled data, which is often scarce and prone to substantial annotation noise in real-world scenarios. Existing unsupervised RLVR methods, including pure entropy minimization, can overfit to incorrect labels and limit the crucial reward ranking signal for Group-Relative Policy Optimization (GRPO). To address these challenges and enhance noise tolerance, we propose a novel two-stage, token-level entropy optimization method for RLVR. This approach dynamically guides the model from exploration to exploitation during training. In the initial exploration phase, token-level entropy maximization promotes diverse and stochastic output generation, serving as a strong regularizer that prevents premature convergence to noisy labels and ensures sufficient intra-group variation, which enables more reliable reward gradient estimation in GRPO. As training progresses, the method transitions into the exploitation phase, where token-level entropy minimization encourages the model to produce confident and deterministic outputs, thereby consolidating acquired knowledge and refining prediction accuracy. Empirically, across three MLLM backbones - Qwen2-VL-2B, Qwen2-VL-7B, and Qwen2.5-VL-3B - spanning diverse noise settings and multiple tasks, our phased strategy consistently outperforms prior approaches by unifying and enhancing external, internal, and entropy-based methods, delivering robust and superior performance across the board.
zh

[CV-118] Operational machine learning for remote spectroscopic detection of CH_4 point sources

【速读】:该论文旨在解决卫星遥感数据中甲烷(Methane)泄漏点源检测存在的高误报率问题,当前基于匹配滤波的甲烷反演方法虽能识别潜在泄漏,但需大量人工核查,效率低下。其关键解决方案是构建并部署一个基于深度学习的自动化检测系统,集成于联合国环境规划署国际甲烷排放观测站的甲烷警报与响应系统(MARS),通过使用全球最大且最多样化的标注甲烷羽流数据集进行模型训练与评估,并引入模型集成(model ensembling)策略,显著降低误报率超过74%,从而大幅提升检测效率和可操作性。该系统已在七个月内成功协助验证1,351个独立泄漏点,推动479次利益相关方通知,验证了其在实际监测与减排成效评估中的有效性。

链接: https://arxiv.org/abs/2511.07719
作者: Vít Růžička,Gonzalo Mateo-García,Itziar Irakulis-Loitxate,Juan Emmanuel Johnson,Manuel Montesino San Martín,Anna Allen,Luis Guanter,David R. Thompson
机构: NASA Jet Propulsion Laboratory (美国国家航空航天局喷气推进实验室)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 12 figures, 5 tables. In review

点击查看摘要

Abstract:Mitigating anthropogenic methane sources is one the most cost-effective levers to slow down global warming. While satellite-based imaging spectrometers, such as EMIT, PRISMA, and EnMAP, can detect these point sources, current methane retrieval methods based on matched filters still produce a high number of false detections requiring laborious manual verification. This paper describes the operational deployment of a machine learning system for detecting methane emissions within the Methane Alert and Response System (MARS) of the United Nations Environment Programme’s International Methane Emissions Observatory. We created the largest and most diverse global dataset of annotated methane plumes from three imaging spectrometer missions and quantitatively compared different deep learning model configurations. Focusing on the requirements for operational deployment, we extended prior evaluation methodologies from small tiled datasets to full granule evaluation. This revealed that deep learning models still produce a large number of false detections, a problem we address with model ensembling, which reduced false detections by over 74%. Deployed in the MARS pipeline, our system processes scenes and proposes plumes to analysts, accelerating the detection and analysis process. During seven months of operational deployment, it facilitated the verification of 1,351 distinct methane leaks, resulting in 479 stakeholder notifications. We further demonstrate the model’s utility in verifying mitigation success through case studies in Libya, Argentina, Oman, and Azerbaijan. Our work represents a critical step towards a global AI-assisted methane leak detection system, which is required to process the dramatically higher data volumes expected from new and current imaging spectrometers.
zh

[CV-119] RoboTAG: End-to-end Robot Configuration Estimation via Topological Alignment Graph

【速读】:该论文旨在解决单目RGB图像中机器人位姿估计(robot pose estimation)问题,尤其针对现有方法依赖大量标注数据导致的仿真到现实(sim-to-real)差距以及将3D问题简化为2D域而忽略3D先验的问题。解决方案的关键在于提出一种名为Robot Topological Alignment Graph (RoboTAG) 的结构化框架,其核心创新是引入一个3D分支以注入3D先验信息,并通过图结构中的节点(表示相机与机器人系统的状态)和边(捕捉变量间依赖或对齐关系)实现2D与3D表示的协同演化,同时利用图中闭合回路施加跨分支的一致性监督,从而显著降低对标注数据的依赖,支持使用无标注的野外图像进行训练。

链接: https://arxiv.org/abs/2511.07717
作者: Yifan Liu,Fangneng Zhan,Wanhua Li,Haowen Sun,Katerina Fragkiadaki,Hanspeter Pfister
机构: Tsinghua University (清华大学); Harvard University (哈佛大学); Massachusetts Institute of Technology (麻省理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Estimating robot pose from a monocular RGB image is a challenge in robotics and computer vision. Existing methods typically build networks on top of 2D visual backbones and depend heavily on labeled data for training, which is often scarce in real-world scenarios, causing a sim-to-real gap. Moreover, these approaches reduce the 3D-based problem to 2D domain, neglecting the 3D priors. To address these, we propose Robot Topological Alignment Graph (RoboTAG), which incorporates a 3D branch to inject 3D priors while enabling co-evolution of the 2D and 3D representations, alleviating the reliance on labels. Specifically, the RoboTAG consists of a 3D branch and a 2D branch, where nodes represent the states of the camera and robot system, and edges capture the dependencies between these variables or denote alignments between them. Closed loops are then defined in the graph, on which a consistency supervision across branches can be applied. This design allows us to utilize in-the-wild images as training data without annotations. Experimental results demonstrate that our method is effective across robot types, highlighting its potential to alleviate the data bottleneck in robotics.
zh

[CV-120] Cross Modal Fine-grained Alignment via Granularity-aware and Region-uncertain Modeling AAAI2026

【速读】:该论文旨在解决多模态学习中细粒度图像-文本对齐(fine-grained image-text alignment)的核心挑战,即如何实现视觉区域与文本词元之间的精确对应关系。现有方法受限于噪声敏感的注意力机制和对跨模态关系建模过于简化的问题,导致在复杂场景下泛化能力差,且无法捕捉区域与词汇之间的一对多、多对一等细粒度对应不确定性。解决方案的关键在于提出一种统一框架,融合显著性感知(significance-aware)和粒度感知(granularity-aware)建模,以及基于区域级别的不确定性建模:通过利用模态特定偏置识别显著特征,避免依赖脆弱的跨模态注意力;同时将区域特征表示为高斯混合分布,以显式建模细粒度不确定性,从而提升对齐的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2511.07710
作者: Jiale Liu,Haoming Zhou,Yishu Zhu,Bingzhi Chen,Yuncheng Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 6 figures, accepted by AAAI 2026

点击查看摘要

Abstract:Fine-grained image-text alignment is a pivotal challenge in multimodal learning, underpinning key applications such as visual question answering, image captioning, and vision-language navigation. Unlike global alignment, fine-grained alignment requires precise correspondence between localized visual regions and textual tokens, often hindered by noisy attention mechanisms and oversimplified modeling of cross-modal relationships. In this work, we identify two fundamental limitations of existing approaches: the lack of robust intra-modal mechanisms to assess the significance of visual and textual tokens, leading to poor generalization in complex scenes; and the absence of fine-grained uncertainty modeling, which fails to capture the one-to-many and many-to-one nature of region-word correspondences. To address these issues, we propose a unified approach that incorporates significance-aware and granularity-aware modeling and region-level uncertainty modeling. Our method leverages modality-specific biases to identify salient features without relying on brittle cross-modal attention, and represents region features as a mixture of Gaussian distributions to capture fine-grained uncertainty. Extensive experiments on Flickr30K and MS-COCO demonstrate that our approach achieves state-of-the-art performance across various backbone architectures, significantly enhancing the robustness and interpretability of fine-grained image-text alignment.
zh

[CV-121] On the Role of Calibration in Benchmarking Algorithmic Fairness for Skin Cancer Detection

【速读】:该论文试图解决当前生成式 AI (Generative AI) 在皮肤癌检测中存在的人群亚组(如性别、种族和年龄)性能差异问题,尤其是现有模型在公平性评估中仅依赖AUROC指标而忽视预测概率校准(calibration)所带来的潜在偏差。解决方案的关键在于引入校准作为补充的基准指标,以衡量模型预测概率与实际事件发生率的一致性,从而更全面地识别和缓解亚组间的诊断风险高估问题。研究通过对比ISIC 2020挑战赛前三名模型在多个数据集上的表现,揭示了即使在判别准确性提升的情况下,模型仍普遍存在校准不足的问题,强调了开展综合模型审计和收集详尽元数据的重要性,以实现更具公平性的AI医疗应用。

链接: https://arxiv.org/abs/2511.07700
作者: Brandon Dominique,Prudence Lam,Nicholas Kurtansky,Jochen Weber,Kivanc Kose,Veronica Rotemberg,Jennifer Dy
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 19 pages, 4 figures. Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Artificial Intelligence (AI) models have demonstrated expert-level performance in melanoma detection, yet their clinical adoption is hindered by performance disparities across demographic subgroups such as gender, race, and age. Previous efforts to benchmark the performance of AI models have primarily focused on assessing model performance using group fairness metrics that rely on the Area Under the Receiver Operating Characteristic curve (AUROC), which does not provide insights into a model’s ability to provide accurate estimates. In line with clinical assessments, this paper addresses this gap by incorporating calibration as a complementary benchmarking metric to AUROC-based fairness metrics. Calibration evaluates the alignment between predicted probabilities and observed event rates, offering deeper insights into subgroup biases. We assess the performance of the leading skin cancer detection algorithm of the ISIC 2020 Challenge on the ISIC 2020 Challenge dataset and the PROVE-AI dataset, and compare it with the second and third place models, focusing on subgroups defined by sex, race (Fitzpatrick Skin Tone), and age. Our findings reveal that while existing models enhance discriminative accuracy, they often over-diagnose risk and exhibit calibration issues when applied to new datasets. This study underscores the necessity for comprehensive model auditing strategies and extensive metadata collection to achieve equitable AI-driven healthcare solutions. All code is publicly available at this https URL.
zh

[CV-122] FlowFeat: Pixel-Dense Embedding of Motion Profiles

链接: https://arxiv.org/abs/2511.07696
作者: Nikita Araslanov,Anna Sonnweber,Daniel Cremers
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL

点击查看摘要

[CV-123] Predicting Coronary Artery Calcium Severity based on Non-Contrast Cardiac CT images using Deep Learning

链接: https://arxiv.org/abs/2511.07695
作者: Lachlan Nguyen,Aidan Cousins,Arcot Sowmya,Hugh Dixson,Sonit Singh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

[CV-124] rackStudio: An Integrated Toolkit for Markerless Tracking

链接: https://arxiv.org/abs/2511.07624
作者: Hristo Dimitrov,Giulia Dominijanni,Viktorija Pavalkyte,Tamar R. Makin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 26 pages, 5 main text figures, 5 supplementary figures

点击查看摘要

[CV-125] A Hybrid Multimodal Deep Learning Framework for Intelligent Fashion Recommendation

【速读】:该论文旨在解决在线时尚平台中智能推荐系统对服装搭配兼容性预测与互补商品检索的双重需求,其核心挑战在于如何有效融合视觉与文本模态信息以提升推荐准确性。解决方案的关键在于提出了一种混合多模态深度学习框架,利用CLIP(Contrastive Language–Image Pretraining)架构的视觉和文本编码器提取时尚商品的联合潜在表示,并将其整合为统一特征向量后输入Transformer编码器进行建模;其中引入“ outfit token”用于捕捉整套穿搭中各单品间的整体关系,实现高精度的兼容性预测(AUC=0.95);同时通过“目标商品token”表示待补全商品的语义描述,在填空式评估(FITB)下实现69.24%的检索准确率,从而显著提升了多模态学习在时尚推荐任务中的有效性。

链接: https://arxiv.org/abs/2511.07573
作者: Kamand Kalashi,Babak Teimourpour
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 1 figure

点击查看摘要

Abstract:The rapid expansion of online fashion platforms has created an increasing demand for intelligent recommender systems capable of understanding both visual and textual cues. This paper proposes a hybrid multimodal deep learning framework for fashion recommendation that jointly addresses two key tasks: outfit compatibility prediction and complementary item retrieval. The model leverages the visual and textual encoders of the CLIP architecture to obtain joint latent representations of fashion items, which are then integrated into a unified feature vector and processed by a transformer encoder. For compatibility prediction, an “outfit token” is introduced to model the holistic relationships among items, achieving an AUC of 0.95 on the Polyvore dataset. For complementary item retrieval, a “target item token” representing the desired item description is used to retrieve compatible items, reaching an accuracy of 69.24% under the Fill-in-the-Blank (FITB) metric. The proposed approach demonstrates strong performance across both tasks, highlighting the effectiveness of multimodal learning for fashion recommendation.
zh

[CV-126] LiveNeRF: Efficient Face Replacement Through Neural Radiance Fields Integration

【速读】:该论文旨在解决现有面部替换技术在实时性与视觉质量之间的权衡问题,尤其针对娱乐、教育和通信等应用场景中对低延迟高保真度的需求。其解决方案的关键在于提出LiveNeRF框架,通过优化神经辐射场(Neural Radiance Fields, NeRF)的训练与推理流程,在保持高质量图像重建的同时实现33 FPS的实时性能,从而支持直播、视频会议及交互式媒体等实际部署场景。

链接: https://arxiv.org/abs/2511.07552
作者: Tung Vu,Hai Nguyen,Cong Tran
机构: Hanoi Architectural University (河内建筑大学); Posts and Telecommunications Institute of Technology (邮电技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face replacement technology enables significant advancements in entertainment, education, and communication applications, including dubbing, virtual avatars, and cross-cultural content adaptation. Our LiveNeRF framework addresses critical limitations of existing methods by achieving real-time performance (33 FPS) with superior visual quality, enabling practical deployment in live streaming, video conferencing, and interactive media. The technology particularly benefits content creators, educators, and individuals with speech impairments through accessible avatar communication. While acknowledging potential misuse in unauthorized deepfake creation, we advocate for responsible deployment with user consent verification and integration with detection systems to ensure positive societal impact while minimizing risks.
zh

[CV-127] oward the Frontiers of Reliable Diffusion Sampling via Adversarial Sinkhorn Attention Guidance AAAI26

链接: https://arxiv.org/abs/2511.07499
作者: Kwanyoung Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 26

点击查看摘要

[CV-128] Laplacian Score Sharpening for Mitigating Hallucination in Diffusion Models

【速读】:该论文旨在解决扩散模型(diffusion models)在生成过程中出现的幻觉问题(hallucinations),即生成不一致或不现实的样本。现有研究指出,这种现象主要源于模式插值(mode interpolation)和得分平滑(score smoothening),但缺乏有效的采样阶段干预手段。论文提出一种后验调整(post-hoc adjustment)策略,在推理阶段通过引入得分函数的拉普拉斯算子(Laplacian)来抑制模式插值导致的幻觉,其核心在于利用拉普拉斯算子对得分函数的锐度(sharpness)进行建模,并基于有限差分版本的Hutchinson迹估计器高效近似高维空间中的拉普拉斯项。实验表明,该方法显著降低了1D、2D及高维图像数据上的幻觉样本率,同时揭示了拉普拉斯与得分不确定性之间的关联。

链接: https://arxiv.org/abs/2511.07496
作者: Barath Chandran.C,Srinivas Anumasa,Dianbo Liu
机构: Indian Institute of Technology, Roorkee (印度理工学院,鲁尔基分校); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Diffusion models, though successful, are known to suffer from hallucinations that create incoherent or unrealistic samples. Recent works have attributed this to the phenomenon of mode interpolation and score smoothening, but they lack a method to prevent their generation during sampling. In this paper, we propose a post-hoc adjustment to the score function during inference that leverages the Laplacian (or sharpness) of the score to reduce mode interpolation hallucination in unconditional diffusion models across 1D, 2D, and high-dimensional image data. We derive an efficient Laplacian approximation for higher dimensions using a finite-difference variant of the Hutchinson trace estimator. We show that this correction significantly reduces the rate of hallucinated samples across toy 1D/2D distributions and a high- dimensional image dataset. Furthermore, our analysis explores the relationship between the Laplacian and uncertainty in the score.
zh

[CV-129] Modulo Video Recovery via Selective Spatiotemporal Vision Transformer

链接: https://arxiv.org/abs/2511.07479
作者: Tianyu Geng,Feng Ji,Wee Peng Tay
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

[CV-130] Multivariate Variational Autoencoder

链接: https://arxiv.org/abs/2511.07472
作者: Mehmet Can Yavuz
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-131] owards Personalized Quantum Federated Learning for Anomaly Detection

链接: https://arxiv.org/abs/2511.07471
作者: Ratun Rahman,Sina Shaham,Dinh C. Nguyen
机构: University of Alabama in Huntsville (阿拉巴马大学亨茨维尔分校); Meta (Meta)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Quantum Physics (quant-ph)
备注: Accepted at IEEE Transactions on Network Science and Engineering

点击查看摘要

[CV-132] wo Datasets Are Better Than One: Method of Double Moments for 3-D Reconstruction in Cryo-EM

【速读】:该论文旨在解决冷冻电子显微镜(cryo-EM)中从噪声严重的投影图像重建三维分子结构的难题,尤其是在仅依赖第二阶统计量(second-order statistics)的情况下如何实现高精度重构。其解决方案的关键在于提出了一种新的数据融合框架——双矩法(method of double moments, MoDM),该方法利用两个不同取向分布下获得的投影图像的二阶矩:一个均匀取向分布和一个未知非均匀取向分布。理论证明表明,这两个矩在一般情况下唯一确定了分子结构(至多全局旋转与反射等价类),并进一步设计了一个基于凸松弛(convex relaxation)的算法,实现了仅用二阶统计信息即可准确恢复分子结构的能力。此方案凸显了在不同实验条件下采集多组数据并建模差异性对提升计算成像质量的重要性。

链接: https://arxiv.org/abs/2511.07438
作者: Joe Kileel,Oscar Mickelin,Amit Singer,Sheng Xu
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Tsinghua University (清华大学); Princeton University (普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Cryo-electron microscopy (cryo-EM) is a powerful imaging technique for reconstructing three-dimensional molecular structures from noisy tomographic projection images of randomly oriented particles. We introduce a new data fusion framework, termed the method of double moments (MoDM), which reconstructs molecular structures from two instances of the second-order moment of projection images obtained under distinct orientation distributions–one uniform, the other non-uniform and unknown. We prove that these moments generically uniquely determine the underlying structure, up to a global rotation and reflection, and we develop a convex-relaxation-based algorithm that achieves accurate recovery using only second-order statistics. Our results demonstrate the advantage of collecting and modeling multiple datasets under different experimental conditions, illustrating that leveraging dataset diversity can substantially enhance reconstruction quality in computational imaging tasks.
zh

[CV-133] Knowledge-Guided Textual Reasoning for Explainable Video Anomaly Detection via LLM s

链接: https://arxiv.org/abs/2511.07429
作者: Hari Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-134] From Noise to Latent: Generating Gaussian Latents for INR-Based Image Compression

链接: https://arxiv.org/abs/2511.08009
作者: Chaoyi Lin,Yaojun Wu,Yue Li,Junru Li,Kai Zhang,Li Zhang
机构: Bytedance Inc.(字节跳动); Bytedance Inc.(字节跳动)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-135] DynaQuant: Dynamic Mixed-Precision Quantization for Learned Image Compression AAAI2026

链接: https://arxiv.org/abs/2511.07903
作者: Youneng Bao,Yulong Cheng,Yiping Liu,Yichen Yang,Peng Qin,Mu Li,Yongsheng Liang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages,accepted by AAAI 2026

点击查看摘要

[CV-136] Deep Learning Analysis of Prenatal Ultrasound for Identification of Ventriculomegaly

链接: https://arxiv.org/abs/2511.07827
作者: Youssef Megahed,Inok Lee,Robin Ducharme,Aylin Erman,Olivier X. Miguel,Kevin Dick,Adrian D. C. Chan,Steven Hawken,Mark Walker,Felipe Moretti
机构: Carleton University (卡尔顿大学); Toronto General Hospital (多伦多总医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures, 3 tables

点击查看摘要

[CV-137] EvoPS: Evolutionary Patch Selection for Whole Slide Image Analysis in Computational Pathology

链接: https://arxiv.org/abs/2511.07560
作者: Saya Hashemian,Azam Asilian Bidgoli
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

人工智能

[AI-0] DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs AAAI2026

【速读】:该论文旨在解决神经符号(Neurosymbolic, NeSy)AI系统在保持高准确率、可解释性和泛化能力的同时,面临的可扩展性不足问题。现有方法虽能通过逻辑推理保障模型性质,但常因推理过程复杂而导致计算效率低下,限制了其在大规模知识库和复杂证明空间中的应用。解决方案的关键在于提出DeepProofLog(DPrL),一种基于随机逻辑程序(stochastic logic programs)的新颖NeSy框架:它将所有推导步骤参数化为神经网络,实现高效的神经引导推理;并通过建立其归结过程与马尔可夫决策过程(Markov Decision Processes, MDPs)之间的形式映射,使动态规划和强化学习技术可用于高效推理与学习,从而显著提升复杂场景下的可扩展性。

链接: https://arxiv.org/abs/2511.08581
作者: Ying Jiao,Rodrigo Castellano Ontiveros,Luc De Raedt,Marco Gori,Francesco Giannini,Michelangelo Diligenti,Giuseppe Marra
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Neurosymbolic (NeSy) AI aims to combine the strengths of neural architectures and symbolic reasoning to improve the accuracy, interpretability, and generalization capability of AI models. While logic inference on top of subsymbolic modules has been shown to effectively guarantee these properties, this often comes at the cost of reduced scalability, which can severely limit the usability of NeSy models. This paper introduces DeepProofLog (DPrL), a novel NeSy system based on stochastic logic programs, which addresses the scalability limitations of previous methods. DPrL parameterizes all derivation steps with neural networks, allowing efficient neural guidance over the proving system. Additionally, we establish a formal mapping between the resolution process of our deep stochastic logic programs and Markov Decision Processes, enabling the application of dynamic programming and reinforcement learning techniques for efficient inference and learning. This theoretical connection improves scalability for complex proof spaces and large knowledge bases. Our experiments on standard NeSy benchmarks and knowledge graph reasoning tasks demonstrate that DPrL outperforms existing state-of-the-art NeSy systems, advancing scalability to larger and more complex settings than previously possible.
zh

[AI-1] Automatic Grid Updates for Kolmogorov-Arnold Networks using Layer Histograms

【速读】:该论文旨在解决Kolmogorov-Arnold Networks (KANs)在训练过程中需手动调整域离散化(即“域网格”)所带来的额外用户负担问题,这一限制阻碍了其在实际应用中的自动化与高效部署。现有KAN层缺乏根据前层输出范围变化自主更新自身域的能力,从而影响模型的适应性和泛化性能。论文提出的关键解决方案是引入自适应域更新机制——基于直方图算法动态调整每个层的域网格,使KAN能够以数据驱动的方式自动优化参数化激活函数的定义域。该方法不仅提升了模型在学习科学方程、图像分类、控制李雅普诺夫函数建模等任务上的性能,还增强了对分布外(Out-of-Distribution, OOD)输入的检测能力,在OpenOOD v1.5基准测试中表现优异。

链接: https://arxiv.org/abs/2511.08570
作者: Jamison Moody,James Usevitch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) are a class of neural networks that have received increased attention in recent literature. In contrast to MLPs, KANs leverage parameterized, trainable activation functions and offer several benefits including improved interpretability and higher accuracy on learning symbolic equations. However, the original KAN architecture requires adjustments to the domain discretization of the network (called the “domain grid”) during training, creating extra overhead for the user in the training process. Typical KAN layers are not designed with the ability to autonomously update their domains in a data-driven manner informed by the changing output ranges of previous layers. As an added benefit, this histogram algorithm may also be applied towards detecting out-of-distribution (OOD) inputs in a variety of settings. We demonstrate that AdaptKAN exceeds or matches the performance of prior KAN architectures and MLPs on four different tasks: learning scientific equations from the Feynman dataset, image classification from frozen features, learning a control Lyapunov function, and detecting OOD inputs on the OpenOOD v1.5 benchmark.
zh

[AI-2] he Path Not Taken: RLVR Provably Learns Off the Principals NEURIPS2025

【速读】:该论文试图解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练时参数更新稀疏性与模型性能提升之间看似矛盾的问题——即为何RLVR能显著提升大语言模型的推理能力,却仅修改极少数参数。其解决方案的关键在于提出“三门理论”(Three-Gate Theory),从参数空间优化机制层面揭示了RLVR的学习动态本质:Gate I(KL锚定)施加KL约束更新;Gate II(模型几何)引导梯度沿主方向之外的低曲率子空间移动,保持谱结构稳定;Gate III(精度隐藏)将微小更新隐藏在非偏好区域,使偏移行为呈现为稀疏性假象。由此阐明RLVR实际是在权重空间中沿主方向之外进行学习,实现最小谱漂移和最优更新对齐,而非传统监督微调(SFT)那样集中于主权重并破坏谱结构。这一发现揭示了RLVR与SFT处于不同优化范式,表明直接复用SFT时代的参数高效微调(PEFT)方法存在根本缺陷。

链接: https://arxiv.org/abs/2511.08567
作者: Hanqing Zhu,Zhenyu Zhang,Hanxian Huang,DiJia Su,Zechun Liu,Jiawei Zhao,Igor Fedorov,Hamed Pirsiavash,Zhizhou Sha,Jinwon Lee,David Z. Pan,Zhangyang Wang,Yuandong Tian,Kai Sheng Tai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preliminary version accepted as a spotlight in NeurIPS 2025 Workshop on Efficient Reasoning

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) reliably improves the reasoning performance of large language models, yet it appears to modify only a small fraction of parameters. We revisit this paradox and show that sparsity is a surface artifact of a model-conditioned optimization bias: for a fixed pretrained model, updates consistently localize to preferred parameter regions, highly consistent across runs and largely invariant to datasets and RL recipes. We mechanistically explain these dynamics with a Three-Gate Theory: Gate I (KL Anchor) imposes a KL-constrained update; Gate II (Model Geometry) steers the step off principal directions into low-curvature, spectrum-preserving subspaces; and Gate III (Precision) hides micro-updates in non-preferred regions, making the off-principal bias appear as sparsity. We then validate this theory and, for the first time, provide a parameter-level characterization of RLVR’s learning dynamics: RLVR learns off principal directions in weight space, achieving gains via minimal spectral drift, reduced principal-subspace rotation, and off-principal update alignment. In contrast, SFT targets principal weights, distorts the spectrum, and even lags RLVR. Together, these results provide the first parameter-space account of RLVR’s training dynamics, revealing clear regularities in how parameters evolve. Crucially, we show that RL operates in a distinct optimization regime from SFT, so directly adapting SFT-era parameter-efficient fine-tuning (PEFT) methods can be flawed, as evidenced by our case studies on advanced sparse fine-tuning and LoRA variants. We hope this work charts a path toward a white-box understanding of RLVR and the design of geometry-aware, RLVR-native learning algorithms, rather than repurposed SFT-era heuristics. Comments: Preliminary version accepted as a spotlight in NeurIPS 2025 Workshop on Efficient Reasoning Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.08567 [cs.LG] (or arXiv:2511.08567v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.08567 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] Hyperdimensional Decoding of Spiking Neural Networks

【速读】:该论文旨在解决传统脉冲神经网络(Spiking Neural Network, SNN)解码方法在分类准确率、噪声鲁棒性、延迟和能耗等方面难以同时优化的问题。其解决方案的关键在于将脉冲神经网络与高维计算(Hyperdimensional Computing, HDC)相结合,构建SNN-HDC模型:该模型不仅在多个公开数据集(如DvsGesture和SL-Animals-DVS)上实现了更高的分类准确率和更低的分类延迟,还显著降低了估计能耗(降幅达1.24x–3.67x),且具备对未训练类别的有效识别能力(例如在DvsGesture数据集上可100%识别未见过的类别)。这一融合架构为SNN解码提供了一种兼具高性能与低功耗的新型替代方案。

链接: https://arxiv.org/abs/2511.08558
作者: Cedrick Kinavuidi,Luca Peres,Oliver Rhodes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents a novel spiking neural network (SNN) decoding method, combining SNNs with Hyperdimensional computing (HDC). The goal is to create a decoding method with high accuracy, high noise robustness, low latency and low energy usage. Compared to analogous architectures decoded with existing approaches, the presented SNN-HDC model attains generally better classification accuracy, lower classification latency and lower estimated energy consumption on multiple test cases from literature. The SNN-HDC achieved estimated energy consumption reductions ranging from 1.24x to 3.67x on the DvsGesture dataset and from 1.38x to 2.27x on the SL-Animals-DVS dataset. The presented decoding method can also efficiently identify unknown classes it has not been trained on. In the DvsGesture dataset the SNN-HDC model can identify 100% of samples from an unseen/untrained class. Given the numerous benefits shown and discussed in this paper, this decoding method represents a very compelling alternative to both rate and latency decoding.
zh

[AI-4] A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)在数学领域中对问题“有趣性”(interestingness)的判断是否与人类一致的问题,尤其是在人机协作场景下(如科研或教育)。其关键解决方案在于通过两项实证研究,分别针对众包平台参与者和国际数学奥林匹克竞赛选手,系统比较了人类与大型语言模型(LLMs)在数学问题有趣性和难度上的评估一致性。研究发现,尽管多数 LLM 在整体上能捕捉人类对有趣性的大致认知,但它们未能复现人类判断的分布特征,且在理解人类为何认为某些问题有趣方面表现较弱,显示出与人类理由之间的低相关性。这一结果揭示了当前 LLM 在模拟人类数学兴趣判断方面的潜力与局限,为未来构建更符合人类认知逻辑的数学 AI 合作伙伴提供了重要依据。

链接: https://arxiv.org/abs/2511.08548
作者: Shubhra Mishra,Yuka Machino,Gabriel Poesia,Albert Jiang,Joy Hsu,Adrian Weller,Challenger Mishra,David Broman,Joshua B. Tenenbaum,Mateja Jamnik,Cedegao E. Zhang,Katherine M. Collins
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at the Math-AI Workshop, NeurIPS 2025

点击查看摘要

Abstract:The evolution of mathematics has been guided in part by interestingness. From researchers choosing which problems to tackle next, to students deciding which ones to engage with, people’s choices are often guided by judgments about how interesting or challenging problems are likely to be. As AI systems, such as LLMs, increasingly participate in mathematics with people – whether for advanced research or education – it becomes important to understand how well their judgments align with human ones. Our work examines this alignment through two empirical studies of human and LLM assessment of mathematical interestingness and difficulty, spanning a range of mathematical experience. We study two groups: participants from a crowdsourcing platform and International Math Olympiad competitors. We show that while many LLMs appear to broadly agree with human notions of interestingness, they mostly do not capture the distribution observed in human judgments. Moreover, most LLMs only somewhat align with why humans find certain math problems interesting, showing weak correlation with human-selected interestingness rationales. Together, our findings highlight both the promises and limitations of current LLMs in capturing human interestingness judgments for mathematical AI thought partnerships.
zh

[AI-5] HQ-SVC: Towards High-Quality Zero-Shot Singing Voice Conversion in Low-Resource Scenarios AAAI2026

【速读】:该论文旨在解决零样本歌声转换(zero-shot singing voice conversion, SVC)中因分离建模声乐内容与歌手音色而导致的声学信息丢失及计算资源消耗过大的问题。现有方法在分离处理时难以保留关键声学特征,从而影响输出质量。解决方案的关键在于提出HQ-SVC框架:首先通过解耦编解码器联合提取内容与说话人特征,避免信息割裂;其次引入音高和音量建模以增强保真度,保留传统分离建模中易丢失的声学细节;最后利用可微信号处理与扩散技术进行渐进式优化,显著提升转换质量与效率。该方案在保持高保真度的同时,天然支持语音超分辨率任务,实现更自然的语音生成效果。

链接: https://arxiv.org/abs/2511.08496
作者: Bingsong Bai,Yizhong Geng,Fengping Wang,Cong Wang,Puyuan Guo,Yingming Gao,Ya Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by AAAI 2026 main technical track

点击查看摘要

Abstract:Zero-shot singing voice conversion (SVC) transforms a source singer’s timbre to an unseen target speaker’s voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that degrades output quality while requiring significant computational resources. To overcome these limitations, we propose HQ-SVC, an efficient framework for high-quality zero-shot SVC. HQ-SVC first extracts jointly content and speaker features using a decoupled codec. It then enhances fidelity through pitch and volume modeling, preserving critical acoustic information typically lost in separate modeling approaches, and progressively refines outputs via differentiable signal processing and diffusion techniques. Evaluations confirm HQ-SVC significantly outperforms state-of-the-art zero-shot SVC methods in conversion quality and efficiency. Beyond voice conversion, HQ-SVC achieves superior voice naturalness compared to specialized audio super-resolution methods while natively supporting voice super-resolution tasks.
zh

[AI-6] Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在部署后难以快速修复已知安全漏洞的问题,例如毒性内容生成、偏见和有害拒绝能力不足等。当前主流方法如全模型微调或重大版本更新存在成本高、频率低且难以定制的局限性。其解决方案的关键在于提出一种轻量级、模块化的“补丁”机制:通过在现有模型前添加一个紧凑且可学习的前缀(prefix),仅引入0.003%的额外参数,即可显著引导模型行为向更安全的参考模型靠拢。该方法在三个关键安全领域均实现了与下一代对齐模型相当的改进效果,同时保持了原始模型的语言流畅性,为LLM的安全更新提供了一种高效、可组合且可扩展的新范式。

链接: https://arxiv.org/abs/2511.08484
作者: Huzaifa Arif,Keerthiram Murugesan,Ching-Yun Ko,Pin-Yu Chen,Payel Das,Alex Gittens
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This “patch” introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be “patched” much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.
zh

[AI-7] Designing LLM -based Multi-Agent Systems for Software Engineering Tasks: Quality Attributes Design Patterns and Rationale

【速读】:该论文旨在解决当前缺乏对基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent Systems, MASs)在软件工程(Software Engineering, SE)任务中设计方法的系统性研究问题,具体包括:设计师关注的质量属性(Quality Attributes, QAs)、所采用的设计模式以及设计背后的动机。解决方案的关键在于通过系统化文献综述方法,从94篇相关论文中提取并归纳出LLM-based MASs在SE中的设计特征,识别出代码生成(Code Generation)是最常见的SE任务、功能性适用性(Functional Suitability)是首要关注的质量属性、角色协作(Role-Based Cooperation)是最常用的设计模式,且提升生成代码质量是主要设计动机,从而为未来LLM-based MASs在SE中的设计提供实证依据和实践指导。

链接: https://arxiv.org/abs/2511.08475
作者: Yangxiao Cai,Ruiyin Li,Peng Liang,Mojtaba Shahin,Zengyang Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the complexity of Software Engineering (SE) tasks continues to escalate, Multi-Agent Systems (MASs) have emerged as a focal point of research and practice due to their autonomy and scalability. Furthermore, through leveraging the reasoning and planning capabilities of Large Language Models (LLMs), the application of LLM-based MASs in the field of SE is garnering increasing attention. However, there is no dedicated study that systematically explores the design of LLM-based MASs, including the Quality Attributes (QAs) on which the designers mainly focus, the design patterns used by the designers, and the rationale guiding the design of LLM-based MASs for SE tasks. To this end, we conducted a study to identify the QAs that LLM-based MASs for SE tasks focus on, the design patterns used in the MASs, and the design rationale for the MASs. We collected 94 papers on LLM-based MASs for SE tasks as the source. Our study shows that: (1) Code Generation is the most common SE task solved by LLM-based MASs among ten identified SE tasks, (2) Functional Suitability is the QA on which designers of LLM-based MASs pay the most attention, (3) Role-Based Cooperation is the design pattern most frequently employed among 16 patterns used to construct LLM-based MASs, and (4) Improving the Quality of Generated Code is the most common rationale behind the design of LLM-based MASs. Based on the study results, we presented the implications for the design of LLM-based MASs to support SE tasks.
zh

[AI-8] Binary Split Categorical feature with Mean Absolute Error Criteria in CART

【速读】:该论文旨在解决在分类与回归树(CART)算法中,使用平均绝对误差(MAE)作为分裂准则时对类别型特征(categorical features)处理效率低下的问题。传统方法依赖于各种数值编码(numerical encoding)策略,但研究表明,无监督数值编码方法并不适用于MAE准则。论文提出了一种新颖且高效的分裂算法,其关键在于直接优化MAE目标函数,无需依赖数值编码,从而在保持计算效率的同时准确处理类别型特征,显著提升了CART在类别型数据上的建模性能。

链接: https://arxiv.org/abs/2511.08470
作者: Peng Yu,Yike Chen,Chao Xu,Albert Bifet,Jesse Read
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the context of the Classification and Regression Trees (CART) algorithm, the efficient splitting of categorical features using standard criteria like GINI and Entropy is well-established. However, using the Mean Absolute Error (MAE) criterion for categorical features has traditionally relied on various numerical encoding methods. This paper demonstrates that unsupervised numerical encoding methods are not viable for the MAE criteria. Furthermore, we present a novel and efficient splitting algorithm that addresses the challenges of handling categorical features with the MAE criterion. Our findings underscore the limitations of existing approaches and offer a promising solution to enhance the handling of categorical data in CART algorithms.
zh

[AI-9] Dataset Safety in Autonomous Driving: Requirements Risks and Assurance

【速读】:该论文旨在解决自动驾驶系统中数据集完整性不足所引发的安全与可靠性问题,其核心挑战在于如何构建符合ISO/PAS 8800标准的高安全性数据集。解决方案的关键在于提出一个结构化的框架,包含AI数据飞轮(AI Data Flywheel)和数据集生命周期管理机制,涵盖数据采集、标注、筛选与维护全过程,并通过严格的安全分析识别潜在风险,定义安全需求并实施验证与确认策略,从而确保数据集在生成式AI(Generative AI)驱动的感知系统中的安全性和合规性。

链接: https://arxiv.org/abs/2511.08439
作者: Alireza Abbaspour,Tejaskumar Balgonda Patil,B Ravi Kiran,Russel Mohr,Senthil Yogamani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dataset integrity is fundamental to the safety and reliability of AI systems, especially in autonomous driving. This paper presents a structured framework for developing safe datasets aligned with ISO/PAS 8800 guidelines. Using AI-based perception systems as the primary use case, it introduces the AI Data Flywheel and the dataset lifecycle, covering data collection, annotation, curation, and maintenance. The framework incorporates rigorous safety analyses to identify hazards and mitigate risks caused by dataset insufficiencies. It also defines processes for establishing dataset safety requirements and proposes verification and validation strategies to ensure compliance with safety standards. In addition to outlining best practices, the paper reviews recent research and emerging trends in dataset safety and autonomous vehicle development, providing insights into current challenges and future directions. By integrating these perspectives, the paper aims to advance robust, safety-assured AI systems for autonomous driving applications.
zh

[AI-10] Understanding Electro-communication and Electro-sensing in Weakly Electric Fish using Multi-Agent Deep Reinforcement Learning

【速读】:该论文旨在解决弱电鱼(weakly electric fish)在自然环境中电感知与电通讯行为及其相关神经活动难以研究的问题,尤其针对多个体行为记录困难导致传统数据驱动建模受限的挑战。其解决方案的关键在于提出了一种生物启发式的计算框架,利用基于循环神经网络(RNN)的人工智能代理通过多智能体强化学习(MARL)训练,在虚拟环境中自主调节电器官放电(EOD)和运动模式以实现群体觅食。该方法通过进化启发的个体适应性奖励机制,使代理间产生诸如重尾EOD间隔分布、环境依赖的EOD调控以及搭便车(freeloading)等类真实鱼群的社会行为,而无需显式奖励社交互动,从而揭示了电通讯与相对支配地位对觅食效率的协同作用。

链接: https://arxiv.org/abs/2511.08436
作者: Satpreet H. Singh,Sonja Johnson-Yu,Zhouyang Lu,Aaron Walsman,Federico Pedraja,Denis Turcu,Pratyusha Sharma,Naomi Saphra,Nathaniel B. Sawtell,Kanaka Rajan
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Weakly electric fish, like Gnathonemus petersii, use a remarkable electrical modality for active sensing and communication, but studying their rich electrosensing and electrocommunication behavior and associated neural activity in naturalistic settings remains experimentally challenging. Here, we present a novel biologically-inspired computational framework to study these behaviors, where recurrent neural network (RNN) based artificial agents trained via multi-agent reinforcement learning (MARL) learn to modulate their electric organ discharges (EODs) and movement patterns to collectively forage in virtual environments. Trained agents demonstrate several emergent features consistent with real fish collectives, including heavy tailed EOD interval distributions, environmental context dependent shifts in EOD interval distributions, and social interaction patterns like freeloading, where agents reduce their EOD rates while benefiting from neighboring agents’ active sensing. A minimal two-fish assay further isolates the role of electro-communication, showing that access to conspecific EODs and relative dominance jointly shape foraging success. Notably, these behaviors emerge through evolution-inspired rewards for individual fitness and emergent inter-agent interactions, rather than through rewarding agents explicitly for social interactions. Our work has broad implications for the neuroethology of weakly electric fish, as well as other social, communicating animals in which extensive recordings from multiple individuals, and thus traditional data-driven modeling, are infeasible.
zh

[AI-11] FaithAct: Faithfulness Planning and Acting in MLLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的不忠实性(unfaithfulness)问题,即模型在推理过程中生成看似合理但缺乏感知证据或与最终结论脱节的推理链。为应对这一挑战,作者区分了行为忠实性(behavioral faithfulness,推理与输出的一致性)和感知忠实性(perceptual faithfulness,推理与输入的一致性),并提出 FaithEval 用于量化步骤级和链级的忠实性,通过检验每个声称的对象是否得到图像的视觉支持。解决方案的关键在于引入 FaithAct——一个以忠实性优先的规划与执行框架,在每一步推理中强制实施证据 grounding,从而在不损害任务准确性的前提下,将感知忠实性提升高达 26%。实验表明,将忠实性作为指导原则不仅减少了幻觉,还增强了推理轨迹的稳定性。

链接: https://arxiv.org/abs/2511.08409
作者: Junxian Li,Xinyue Xu,Sai Ma,Sichao Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unfaithfulness remains a persistent challenge for large language models (LLMs), which often produce plausible yet ungrounded reasoning chains that diverge from perceptual evidence or final conclusions. We distinguish between behavioral faithfulness (alignment between reasoning and output) and perceptual faithfulness (alignment between reasoning and input), and introduce FaithEval for quantifying step-level and chain-level faithfulness by evaluating whether each claimed object is visually supported by the image. Building on these insights, we propose FaithAct, a faithfulness-first planning and acting framework that enforces evidential grounding at every reasoning step. Experiments across multiple reasoning benchmarks demonstrate that FaithAct improves perceptual faithfulness by up to 26% without degrading task accuracy compared to prompt-based and tool-augmented baselines. Our analysis shows that treating faithfulness as a guiding principle not only mitigates hallucination but also leads to more stable reasoning trajectories. This work thereby establishes a unified framework for both evaluating and enforcing faithfulness in multimodal reasoning.
zh

[AI-12] SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models AAAI2026

【速读】:该论文旨在解决语言模型中拒绝行为(refusal)的机制解释问题,即如何从模型内部表征空间中准确识别和提取与拒绝有害或不道德提示相关的多个方向。传统方法将拒绝行为建模为单一方向,通常通过计算有害与无害提示表征中心点的差值来实现,但该方法忽略了概念在高维潜空间中可能以低维流形形式存在这一事实。解决方案的关键在于引入自组织映射(Self-Organizing Maps, SOMs),利用其对高维数据的拓扑保持特性,从有害提示表征中学习多个神经元,并基于每个神经元与无害提示中心的差值得到一组多维拒绝方向,从而更全面地刻画拒绝行为的内在机制。实验表明,移除这些多方向可显著削弱模型的拒绝能力,优于单方向基线及专门设计的越狱算法。

链接: https://arxiv.org/abs/2511.08379
作者: Giorgio Piras,Raffaele Mura,Fabio Brau,Luca Oneto,Fabio Roli,Battista Biggio
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model’s latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work’s difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models’ internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.
zh

[AI-13] Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents

【速读】:该论文旨在解决会话推荐(Session-based Recommendation, SBR)中因长尾物品曝光不足而导致的推荐多样性与准确率之间的“跷跷板效应”(see-saw effect)问题。现有方法虽尝试提升长尾物品的推荐表现,但往往以牺牲整体准确性为代价,其根本原因在于未能有效识别和约束长尾物品中的会话无关噪声(session-irrelevant noise)。解决方案的关键在于提出一种可插拔的混合意图双约束框架(HID),其核心创新包括:(i) 混合意图学习(Hybrid Intent Learning),通过属性感知的谱聚类重构物品到意图的映射,并区分目标意图与噪声意图;(ii) 意图约束损失(Intent Constraint Loss),引入针对多样性和准确性的双重约束机制,统一优化物品与会话的表征学习过程,从而实现长尾性能与准确率的协同提升。

链接: https://arxiv.org/abs/2511.08378
作者: Xiao Wang,Ke Qin,Dongyang Zhang,Xiurui Xie,Shuang Liang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Session-based recommendation (SBR) aims to predict anonymous users’ next interaction based on their interaction sessions. In the practical recommendation scenario, low-exposure items constitute the majority of interactions, creating a long-tail distribution that severely compromises recommendation diversity. Existing approaches attempt to address this issue by promoting tail items but incur accuracy degradation, exhibiting a “see-saw” effect between long-tail and accuracy performance. We attribute such conflict to session-irrelevant noise within the tail items, which existing long-tail approaches fail to identify and constrain effectively. To resolve this fundamental conflict, we propose \textbfHID (\textbfHybrid \textbfIntent-based \textbfDual Constraint Framework), a plug-and-play framework that transforms the conventional “see-saw” into “win-win” through introducing the hybrid intent-based dual constraints for both long-tail and accuracy. Two key innovations are incorporated in this framework: (i) \textitHybrid Intent Learning, where we reformulate the intent extraction strategies by employing attribute-aware spectral clustering to reconstruct the item-to-intent mapping. Furthermore, discrimination of session-irrelevant noise is achieved through the assignment of the target and noise intents to each session. (ii) \textitIntent Constraint Loss, which incorporates two novel constraint paradigms regarding the \textitdiversity and \textitaccuracy to regulate the representation learning process of both items and sessions. These two objectives are unified into a single training loss through rigorous theoretical derivation. Extensive experiments across multiple SBR models and datasets demonstrate that HID can enhance both long-tail performance and recommendation accuracy, establishing new state-of-the-art performance in long-tail recommender systems.
zh

[AI-14] AI-Powered Data Visualization Platform: An Intelligent Web Application for Automated Dataset Analysis

【速读】:该论文旨在解决数据驱动环境中手动数据分析耗时且效率低下的问题,特别是在数据清洗、特征选择和可视化生成等环节中对专业人员的高度依赖。解决方案的关键在于构建一个基于人工智能(AI)的自动化数据可视化平台,通过集成先进的机器学习算法实现数据自动预处理(包括缺失值插补与异常值检测)、智能特征选择(采用四种不同算法)以及根据数据集属性自动生成可视化图表和标题。该平台采用Python Flask作为后端、React作为前端,并结合Firebase Cloud Storage实现云端数据存储与实时交互,支持大规模数据(如10万行)的实时分析及多用户并发请求处理,显著减少了人工干预并提升了分析效率与可视化质量。

链接: https://arxiv.org/abs/2511.08363
作者: Srihari R,Pallavi M,Tejaswini S,Vaishnavi R C
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 4 figures, 4 tables

点击查看摘要

Abstract:An AI-powered data visualization platform that automates the entire data analysis process, from uploading a dataset to generating an interactive visualization. Advanced machine learning algorithms are employed to clean and preprocess the data, analyse its features, and automatically select appropriate visualizations. The system establishes the process of automating AI-based analysis and visualization from the context of data-driven environments, and eliminates the challenge of time-consuming manual data analysis. The combination of a Python Flask backend to access the dataset, paired with a React frontend, provides a robust platform that automatically interacts with Firebase Cloud Storage for numerous data processing and data analysis solutions and real-time sources. Key contributions include automatic and intelligent data cleaning, with imputation for missing values, and detection of outliers, via analysis of the data set. AI solutions to intelligently select features, using four different algorithms, and intelligent title generation and visualization are determined by the attributes of the dataset. These contributions were evaluated using two separate datasets to assess the platform’s performance. In the process evaluation, the initial analysis was performed in real-time on datasets as large as 100000 rows, while the cloud-based demand platform scales to meet requests from multiple users and processes them simultaneously. In conclusion, the cloud-based data visualization application allowed for a significant reduction of manual inputs to the data analysis process while maintaining a high quality, impactful visual outputs, and user experiences
zh

[AI-15] JobSphere: An AI-Powered Multilingual Career Copilot for Government Employment Platforms

【速读】:该论文旨在解决政府就业网站用户面临的参与度低和可访问性差的问题,具体表现为导航复杂、语言选项匮乏以及缺乏个性化支持。解决方案的关键在于提出并实现JobSphere——一个基于检索增强生成(Retrieval-Augmented Generation, RAG)架构的多语言AI职业助手,支持英语、印地语和旁遮普语;其采用4-bit量化技术可在消费级GPU(如NVIDIA RTX 3050 4GB)上部署,成本较云平台降低89%;同时集成语音交互、自动模拟测试、简历解析与技能识别及基于嵌入的职位推荐机制(precision@10达68%),显著提升了用户体验与准确性(事实准确率94%,响应时间中位数1.8秒,系统可用性量表得分78.5/100,相较基线提升50%)。

链接: https://arxiv.org/abs/2511.08343
作者: Srihari R,Adarsha B V,Mohammed Usman Hussain,Shweta Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 7 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Users of government employment websites commonly face engagement and accessibility challenges linked to navigational complexity, a dearth of language options, and a lack of personalized support. This paper introduces JobSphere, an AI-powered career assistant that is redefining the employment platform in Punjab called PGRKAM. JobSphere employs Retrieval-Augmented Generation (RAG) architecture, and it is multilingual, available in English, Hindi and Punjabi. JobSphere technique uses 4-bit quantization, allowing the platform to deploy on consumer-grade GPUs (i.e., NVIDIA RTX 3050 4GB), making the implementation 89% cheaper than that of cloud-based systems. Key innovations include voice-enabled interaction with the assistant, automated mock tests, resume parsing with skills recognition, and embed-based job recommendation that achieves a precision@10 score of 68%. An evaluation of JobSphere’s implementation reveals 94% factual accuracy, a median response time of 1.8 seconds, and a System Usability Scale score of 78.5/100, a 50% improvement compared to the baseline PGRKAM platform context. In conclusion, JobSphere effectively fills significant accessibility gaps for Punjab/Hindi-speaking users in rural locations, while also affirming the users access to trusted job content provided by government agencies.
zh

[AI-16] HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting AAAI2026

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)预测中因复杂时序依赖关系导致的准确性难题,尤其是现有基于神经网络的模型在引入通道相关性(channel-dependent)建模时易出现性能下降的问题。其解决方案的关键在于提出一种名为HN-MVTS的新架构,该架构通过一个超网络(hypernetwork)生成目标预测模型最后一层的权重参数,作为数据自适应的正则化项,从而提升模型泛化能力和长程预测精度;值得注意的是,超网络仅在训练阶段使用,不增加推理时间,且可无缝集成至当前主流预测模型(如DLinear、PatchTST等),显著改善其性能表现。

链接: https://arxiv.org/abs/2511.08340
作者: Andrey Savchenko,Oleg Kachan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026

点击查看摘要

Abstract:Accurate forecasting of multivariate time series data remains a formidable challenge, particularly due to the growing complexity of temporal dependencies in real-world scenarios. While neural network-based models have achieved notable success in this domain, complex channel-dependent models often suffer from performance degradation compared to channel-independent models that do not consider the relationship between components but provide high robustness due to small capacity. In this work, we propose HN-MVTS, a novel architecture that integrates a hypernetwork-based generative prior with an arbitrary neural network forecasting model. The input of this hypernetwork is a learnable embedding matrix of time series components. To restrict the number of new parameters, the hypernetwork learns to generate the weights of the last layer of the target forecasting networks, serving as a data-adaptive regularizer that improves generalization and long-range predictive accuracy. The hypernetwork is used only during the training, so it does not increase the inference time compared to the base forecasting model. Extensive experiments on eight benchmark datasets demonstrate that application of HN-MVTS to the state-of-the-art models (DLinear, PatchTST, TSMixer, etc.) typically improves their performance. Our findings suggest that hypernetwork-driven parameterization offers a promising direction for enhancing existing forecasting techniques in complex scenarios.
zh

[AI-17] LPPG-RL: Lexicographically Projected Policy Gradient Reinforcement Learning with Subproblem Exploration

【速读】:该论文旨在解决**分层多目标强化学习(Lexicographic Multi-Objective Reinforcement Learning, LMORL)在连续动作空间中难以高效实现优先级约束的问题。现有方法要么依赖启发式阈值调参,要么仅适用于离散状态空间,限制了其通用性和实用性。解决方案的关键在于提出分层投影策略梯度强化学习(Lexicographically Projected Policy Gradient RL, LPPG-RL)**框架:通过将优先级约束建模为顺序梯度投影优化问题,并采用Dykstra投影算法替代通用求解器,显著提升计算效率;同时引入子问题探索(Subproblem Exploration, SE)机制以缓解梯度消失、加速收敛并增强训练稳定性,从而在连续空间中实现对多目标优先级的有效控制与策略优化。

链接: https://arxiv.org/abs/2511.08339
作者: Ruiyu Qiu,Rui Wang,Guanghui Yang,Xiang Li,Zhijiang Shao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lexicographic multi-objective problems, which consist of multiple conflicting subtasks with explicit priorities, are common in real-world applications. Despite the advantages of Reinforcement Learning (RL) in single tasks, extending conventional RL methods to prioritized multiple objectives remains challenging. In particular, traditional Safe RL and Multi-Objective RL (MORL) methods have difficulty enforcing priority orderings efficiently. Therefore, Lexicographic Multi-Objective RL (LMORL) methods have been developed to address these challenges. However, existing LMORL methods either rely on heuristic threshold tuning with prior knowledge or are restricted to discrete domains. To overcome these limitations, we propose Lexicographically Projected Policy Gradient RL (LPPG-RL), a novel LMORL framework which leverages sequential gradient projections to identify feasible policy update directions, thereby enabling LPPG-RL broadly compatible with all policy gradient algorithms in continuous spaces. LPPG-RL reformulates the projection step as an optimization problem, and utilizes Dykstra’s projection rather than generic solvers to deliver great speedups, especially for small- to medium-scale instances. In addition, LPPG-RL introduces Subproblem Exploration (SE) to prevent gradient vanishing, accelerate convergence and enhance stability. We provide theoretical guarantees for convergence and establish a lower bound on policy improvement. Finally, through extensive experiments in a 2D navigation environment, we demonstrate the effectiveness of LPPG-RL, showing that it outperforms existing state-of-the-art continuous LMORL methods.
zh

[AI-18] Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed framework

【速读】:该论文旨在解决生成式 AI(Generative AI)在分子属性预测任务中普遍存在的准确性不足和分布外(out-of-distribution, OOD)分子泛化能力差的问题。其解决方案的关键在于提出 MolRuleLoss 框架,该框架通过将子结构替换规则(substructure substitution rules, SSRs)的偏导数约束嵌入到分子属性回归模型(molecular property regression models, MPRMs)的损失函数中,从而提升模型对多种分子性质(如脂溶性、水溶性、溶剂化自由能等)的预测精度与泛化能力。实验表明,MolRuleLoss 在多个数据集上显著降低均方根误差(RMSE),尤其在 OOD 分子和“活性悬崖”(activity cliff)分子预测中表现突出,且理论证明了 SSR 的属性变化上限与模型误差之间存在正相关关系,验证了该方法的内在合理性。

链接: https://arxiv.org/abs/2511.08314
作者: Xiaoyu Fan,Lin Guo,Ruizhen Jia,Yang Tian,Zhihao Yang,Boxue Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI)-aided drug discovery is an active research field, yet AI models often exhibit poor accuracy in regression tasks for molecular property prediction, and perform catastrophically poorly for out-of-distribution (OOD) molecules. Here, we present MolRuleLoss, a substructure-substitution-rule-informed framework that improves the accuracy and generalizability of multiple molecular property regression models (MPRMs) such as GEM and UniMol for diverse molecular property prediction tasks. MolRuleLoss incorporates partial derivative constraints for substructure substitution rules (SSRs) into an MPRM’s loss function. When using GEM models for predicting lipophilicity, water solubility, and solvation-free energy (using lipophilicity, ESOL, and freeSolv datasets from MoleculeNet), the root mean squared error (RMSE) values with and without MolRuleLoss were 0.587 vs. 0.660, 0.777 vs. 0.798, and 1.252 vs. 1.877, respectively, representing 2.6-33.3% performance improvements. We show that both the number and the quality of SSRs contribute to the magnitude of prediction accuracy gains obtained upon adding MolRuleLoss to an MPRM. MolRuleLoss improved the generalizability of MPRMs for “activity cliff” molecules in a lipophilicity prediction task and improved the generalizability of MPRMs for OOD molecules in a melting point prediction task. In a molecular weight prediction task for OOD molecules, MolRuleLoss reduced the RMSE value of a GEM model from 29.507 to 0.007. We also provide a formal demonstration that the upper bound of the variation for property change of SSRs is positively correlated with an MPRM’s error. Together, we show that using the MolRuleLoss framework as a bolt-on boosts the prediction accuracy and generalizability of multiple MPRMs, supporting diverse applications in areas like cheminformatics and AI-aided drug discovery.
zh

[AI-19] st-time Diverse Reasoning by Riemannian Activation Steering AAAI2026 AAAI

【速读】:该论文旨在解决Best-of-N推理策略在复杂任务中因输出多样性不足而导致的性能瓶颈问题,即模型在随机采样时生成相似甚至相同的错误推理路径(output diversity limit)。解决方案的关键在于提出一种无监督的激活转向(activation steering)策略,该策略在测试阶段同时优化多个推理轨迹的转向向量,通过在批量生成过程中的同步锚点(synchronization anchor)处最大化所有可能干预激活子集所张成的总体积,从而提升推理路径的多样性与准确性。具体而言,该方法将转向向量的优化建模为一个定义在球面乘积流形上的黎曼优化问题(Riemannian optimization problem),目标函数为对数行列式(log-determinant),并采用带调优学习率的黎曼块坐标下降算法求解其驻点,进而实现高多样性和高准确性的生成结果。

链接: https://arxiv.org/abs/2511.08305
作者: Ly Tran Ho Khanh,Dongxuan Zhu,Man-Chung Yue,Viet Anh Nguyen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures. Accepted for publication at AAAI 2026 (40th AAAI Conference on Artificial Intelligence)

点击查看摘要

Abstract:Best-of- N reasoning improves the accuracy of language models in solving complex tasks by sampling multiple candidate solutions and then selecting the best one based on some criteria. A critical bottleneck for this strategy is the output diversity limit, which occurs when the model generates similar outputs despite stochastic sampling, and hence recites the same error. To address this lack of variance in reasoning paths, we propose a novel unsupervised activation steering strategy that simultaneously optimizes the steering vectors for multiple reasoning trajectories at test time. At any synchronization anchor along the batch generation process, we find the steering vectors that maximize the total volume spanned by all possible intervened activation subsets. We demonstrate that these steering vectors can be determined by solving a Riemannian optimization problem over the product of spheres with a log-determinant objective function. We then use a Riemannian block-coordinate descent algorithm with a well-tuned learning rate to obtain a stationary point of the problem, and we apply these steering vectors until the generation process reaches the subsequent synchronization anchor. Empirical evaluations on popular mathematical benchmarks demonstrate that our test-time Riemannian activation steering strategy outperforms vanilla sampling techniques in terms of generative diversity and solution accuracy.
zh

[AI-20] Smarter Together: Creating Agent ic Communities of Practice through Shared Experiential Learning

【速读】:该论文旨在解决当前从以人类为中心的软件开发实践向以AI代理(agent)为中心转变过程中,开发者知识共享环境急剧衰减的问题。传统由开发者主导的代码库与社区因参与度下降而难以支撑AI代理的学习需求,导致生成式AI代码代理缺乏持续的知识积累与共享机制。解决方案的关键在于提出一种名为Spark的新型共享智能体记忆架构(shared agentic memory architecture),该架构模拟人类开发者社区的集体智慧与经验积累能力,使AI编码代理能够持续地向其中贡献并从中汲取知识,从而实现跨代理的持续学习与协同进化。实验表明,Spark作为教练可显著提升通用代码生成模型的代码质量,甚至使一个300亿参数的小型模型达到大型先进模型的性能水平,并在多维度评估中展现出高达98.2%的推荐有用性。

链接: https://arxiv.org/abs/2511.08301
作者: Valentin Tablan,Scott Taylor,Gabriel Hurtado,Kristoffer Bernhem,Anders Uhrenholt,Gabriele Farei,Karo Moilanen
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 24 pages

点击查看摘要

Abstract:The transition from human-centric to agent-centric software development practices is disrupting existing knowledge sharing environments for software developers. Traditional peer-to-peer repositories and developer communities for shared technical knowledge and best practice have witnessed dramatic drops in participation in a short period of time. At the same time, agentic functional equivalents are yet to emerge leaving AI agents, which already generate a significant proportion of all new software code produced, without access to repositories of valuable shared learning. In this paper, we introduce Spark, a novel shared agentic memory architecture which is designed to emulate the collective intelligence and know-how of human developer communities. Spark enables AI coding agents to both contribute to and draw from a persistent and continuously evolving experiential memory. Agents operating in the same general problem space use the Spark shared memory as a repository of new knowledge to achieve collective continual learning. We evaluate Spark as a coach for AI coding agents performing software development tasks. We demonstrate that recommendations made by Spark improve the quality of code generated by generic code generation models at varying sizes and capability tiers. Boosted by Spark, a small open-weights model with 30 billion parameters was able to match the code quality afforded by a much larger state-of-the-art model. Separately, we measure the intrinsic quality of recommendations generated by Spark against a wide range of criteria inspired by software development best practice, and achieve helpfulness levels of up to 98.2% in the top two (out of five) qualitative helpfulness bands. Comments: 24 pages Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2511.08301 [cs.AI] (or arXiv:2511.08301v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2511.08301 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-21] Dual-Kernel Graph Community Contrastive Learning

【速读】:该论文旨在解决图对比学习(Graph Contrastive Learning, GCL)在大规模图数据上训练时面临的两个核心问题:一是图神经网络(GNN)密集的消息传递机制导致的计算开销过大,二是对比损失函数在正负节点对之间具有二次复杂度,限制了其可扩展性。解决方案的关键在于提出一种高效的GCL框架,通过将原始图转化为一个由相互连接的节点集合构成的紧凑网络来保留社区结构信息;同时引入核化图社区对比损失(kernelized graph community contrastive loss),实现线性时间复杂度下的有效信息传递以捕获图的层次结构,并结合知识蒸馏技术嵌入解耦式GNN架构中,在加速推理的同时保持优异的泛化性能。

链接: https://arxiv.org/abs/2511.08287
作者: Xiang Chen,Kun Yue,Wenjie Liu,Zhenyu Zhang,Liang Duan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Contrastive Learning (GCL) has emerged as a powerful paradigm for training Graph Neural Networks (GNNs) in the absence of task-specific labels. However, its scalability on large-scale graphs is hindered by the intensive message passing mechanism of GNN and the quadratic computational complexity of contrastive loss over positive and negative node pairs. To address these issues, we propose an efficient GCL framework that transforms the input graph into a compact network of interconnected node sets while preserving structural information across communities. We firstly introduce a kernelized graph community contrastive loss with linear complexity, enabling effective information transfer among node sets to capture hierarchical structural information of the graph. We then incorporate a knowledge distillation technique into the decoupled GNN architecture to accelerate inference while maintaining strong generalization performance. Extensive experiments on sixteen real-world datasets of varying scales demonstrate that our method outperforms state-of-the-art GCL baselines in both effectiveness and scalability.
zh

[AI-22] DiagramIR: An Automatic Pipeline for Educational Math Diagram Evaluation NEURIPS2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在教育场景中生成可视化内容(如几何图形)后缺乏高效、可扩展评估机制的问题。当前多数LLM工具仍局限于文本交互,难以支持数学等依赖视觉表达的领域;尽管已有研究证明LLM能生成可编译为教学图示的LaTeX TikZ代码,但如何规模化地验证这些图形的正确性成为瓶颈。解决方案的关键在于提出DiagramIR——一种基于LaTeX TikZ代码中间表示(Intermediate Representation, IR)的自动评估流水线,通过结构化分析图形语义而非依赖人工标注或大模型判别,显著提升了评估一致性与效率,使轻量级模型(如GPT-4.1-Mini)在推理成本降低10倍的情况下达到与大型模型(如GPT-5)相当的评估性能。

链接: https://arxiv.org/abs/2511.08283
作者: Vishal Kumar,Shubhra Mishra,Rebecca Hao,Rizwaan Malik,David Broman,Dorottya Demszky
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published at the Math-AI Workshop at NeurIPS 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly being adopted as tools for learning; however, most tools remain text-only, limiting their usefulness for domains where visualizations are essential, such as mathematics. Recent work shows that LLMs are capable of generating code that compiles to educational figures, but a major bottleneck remains: scalable evaluation of these diagrams. We address this by proposing DiagramIR: an automatic and scalable evaluation pipeline for geometric figures. Our method relies on intermediate representations (IRs) of LaTeX TikZ code. We compare our pipeline to other evaluation baselines such as LLM-as-a-Judge, showing that our approach has higher agreement with human raters. This evaluation approach also enables smaller models like GPT-4.1-Mini to perform comparably to larger models such as GPT-5 at a 10x lower inference cost, which is important for deploying accessible and scalable education technologies.
zh

[AI-23] Bi-Objective Evolutionary Optimization for Large-Scale Open Pit Mine Scheduling Problem under Uncertainty with Chance Constraints

【速读】:该论文旨在解决露天矿调度问题(Open-pit mine scheduling problem, OPMSP)中因地质不确定性导致的传统确定性方法难以生成最优且可行生产计划的问题。其核心挑战在于如何在长期矿山规划中有效建模地质不确定性,同时平衡经济价值与调度风险。解决方案的关键在于提出一种双目标优化模型,同时最大化预期净现值(expected net present value)和最小化调度风险,且不依赖于特定置信水平;此外,通过引入领域特定的贪心随机初始化策略和优先级感知的周期交换变异算子,并将其集成到三种多目标进化算法(GSEMO、MOEA/D-mutation-only 和 NSGA-II)中,实现了对复杂约束下调度方案的有效搜索与高质量帕累托前沿逼近。

链接: https://arxiv.org/abs/2511.08275
作者: Ishara Hewa Pathiranage,Aneta Neumann
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The open-pit mine scheduling problem (OPMSP) is a complex, computationally expensive process in long-term mine planning, constrained by operational and geological dependencies. Traditional deterministic approaches often ignore geological uncertainty, leading to suboptimal and potentially infeasible production schedules. Chance constraints allow modeling of stochastic components by ensuring probabilistic constraints are satisfied with high probability. This paper presents a bi-objective formulation of the OPMSP that simultaneously maximizes expected net present value and minimizes scheduling risk, independent of the confidence level required for the constraint. Solutions are represented using integer encoding, inherently satisfying reserve constraints. We introduce a domain-specific greedy randomized initialization and a precedence-aware period-swap mutation operator. We integrate these operators into three multi-objective evolutionary algorithms: the global simple evolutionary multi-objective optimizer (GSEMO), a mutation-only variant of multi-objective evolutionary algorithm based on decomposition (MOEA/D), and non-dominated sorting genetic algorithm II (NSGA-II). We compare our bi-objective formulation against the single-objective approach, which depends on a specific confidence level, by analyzing mine deposits consisting of up to 112 687 blocks. Results demonstrate that the proposed bi-objective formulation yields more robust and balanced trade-offs between economic value and risk compared to single-objective, confidence-dependent approach.
zh

[AI-24] Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning AAAI2026

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在多示例上下文学习(many-shot in-context learning, ICL)中面临的两大挑战:受限的上下文长度和高昂的推理成本。现有基于任务向量(task vector)的方法虽能通过插入紧凑的演示表示来缓解问题,但普遍忽视了向量插入位置的选择以及如何确定各位置的最优值。论文提出了一种敏感性感知的任务向量插入框架(Sensitivity-aware Task Vector insertion framework, STV),其核心创新在于识别出查询-上下文对激活差异中的结构化模式,从而定位敏感插入位置;在此基础上,构建每个位置的预聚类激活库,并利用强化学习选择最适配的向量插入,显著提升了多模态模型在多种任务上的泛化性能与效率。

链接: https://arxiv.org/abs/2511.08246
作者: Ziyu Ma,Chenhui Gou,Yiming Hu,Yong Wang,Xiangxiang Chu,Bohan Zhuang,Jianfei Cai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.
zh

[AI-25] Beyond Distributions: Geometric Action Control for Continuous Reinforcement Learning

【速读】:该论文旨在解决深度强化学习中连续控制策略(如高斯策略)与有界动作空间之间存在的几何不匹配问题,以及现有基于球面分布(如von Mises-Fisher分布)方法因计算复杂度高(依赖贝塞尔函数和拒绝采样)而难以实用的问题。其解决方案的关键在于提出一种名为**几何动作控制(Geometric Action Control, GAC)**的新范式:通过将动作生成分解为方向向量和可学习的集中参数(concentration parameter),在保持球面分布几何优势的同时,显著简化计算复杂度(从O(dk)降至O(d)),并减少参数量(从2d降至d+1)。这一设计实现了确定性动作与均匀球面噪声之间的高效插值,从而在六项MuJoCo基准任务上稳定达到或超越当前最优性能。

链接: https://arxiv.org/abs/2511.08234
作者: Zhihao Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures

点击查看摘要

Abstract:Gaussian policies have dominated continuous control in deep reinforcement learning (RL), yet they suffer from a fundamental mismatch: their unbounded support requires ad-hoc squashing functions that distort the geometry of bounded action spaces. While von Mises-Fisher (vMF) distributions offer a theoretically grounded alternative on the sphere, their reliance on Bessel functions and rejection sampling hinders practical adoption. We propose \textbfGeometric Action Control (GAC), a novel action generation paradigm that preserves the geometric benefits of spherical distributions while \textitsimplifying computation. GAC decomposes action generation into a direction vector and a learnable concentration parameter, enabling efficient interpolation between deterministic actions and uniform spherical noise. This design reduces parameter count from (2d) to (d+1), and avoids the (O(dk)) complexity of vMF rejection sampling, achieving simple (O(d)) operations. Empirically, GAC consistently matches or exceeds state-of-the-art methods across six MuJoCo benchmarks, achieving 37.6% improvement over SAC on Ant-v4 and the best results on 4 out of 6 tasks. Our ablation studies reveal that both \textbfspherical normalization and \textbfadaptive concentration control are essential to GAC’s success. These findings suggest that robust and efficient continuous control does not require complex distributions, but a principled respect for the geometry of action spaces. Code and pretrained models are available in supplementary materials.
zh

[AI-26] Real-Time Performance Analysis of Multi-Fidelity Residual Physics-Informed Neural Process-Based State Estimation for Robotic Systems

【速读】:该论文旨在解决实时非线性状态估计中因模型失配(model-mismatch)导致的精度下降问题,尤其是在安全关键应用中对可靠误差边界的需求。解决方案的关键在于提出一种基于多保真度残差物理信息神经过程(multi-fidelity residual physics-informed neural process, MFR-PINP)的新型数据驱动估计方法,该方法通过学习低保真度预测与高保真度真实动力学之间的残差来弥补模型偏差;同时引入分割校准(split conformal, SC)预测框架,在训练和推理阶段提供鲁棒的不确定性保证,从而提升估计结果的可靠性与实用性。

链接: https://arxiv.org/abs/2511.08231
作者: Devin Hunter,Chinwendu Enyioha
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Various neural network architectures are used in many of the state-of-the-art approaches for real-time nonlinear state estimation. With the ever-increasing incorporation of these data-driven models into the estimation domain, model predictions with reliable margins of error are a requirement – especially for safety-critical applications. This paper discusses the application of a novel real-time, data-driven estimation approach based on the multi-fidelity residual physics-informed neural process (MFR-PINP) toward the real-time state estimation of a robotic system. Specifically, we address the model-mismatch issue of selecting an accurate kinematic model by tasking the MFR-PINP to also learn the residuals between simple, low-fidelity predictions and complex, high-fidelity ground-truth dynamics. To account for model uncertainty present in a physical implementation, robust uncertainty guarantees from the split conformal (SC) prediction framework are modeled in the training and inference paradigms. We provide implementation details of our MFR-PINP-based estimator for a hybrid online learning setting to validate our model’s usage in real-time applications. Experimental results of our approach’s performance in comparison to the state-of-the-art variants of the Kalman filter (i.e. unscented Kalman filter and deep Kalman filter) in estimation scenarios showed promising results for the MFR-PINP model as a viable option in real-time estimation tasks.
zh

[AI-27] MADD: Multi-Agent Drug Discovery Orchestra EMNLP2025

【速读】:该论文旨在解决早期药物发现中命中化合物(hit identification)识别的挑战,传统方法依赖大量实验资源且效率较低。为应对这一问题,作者提出了一种多智能体系统(multi-agent system)MADD,其核心创新在于通过四个协同工作的智能体构建并执行从自然语言查询到分子生成与筛选的定制化流程,从而融合大型语言模型(LLM)的可解释性与专用模型及工具的精准性,显著提升药物设计的自动化水平和可访问性。

链接: https://arxiv.org/abs/2511.08217
作者: Gleb V. Solovev,Alina B. Zhidkovskaya,Anastasia Orlova,Nina Gubina,Anastasia Vepreva,Rodion Golovinskii,Ilya Tonkii,Ivan Dubrovsky,Ivan Gurev,Dmitry Gilemkhanov,Denis Chistiakov,Timur A. Aliev,Ivan Poddiakov,Galina Zubkova,Ekaterina V. Skorb,Vladimir Vinogradov,Alexander Boukhanovsky,Nikolay Nikitin,Andrei Dmitrenko,Anna Kalyuzhnaya,Andrey Savchenko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: EMNLP2025 accepted paper

点击查看摘要

Abstract:Hit identification is a central challenge in early drug discovery, traditionally requiring substantial experimental resources. Recent advances in artificial intelligence, particularly large language models (LLMs), have enabled virtual screening methods that reduce costs and improve efficiency. However, the growing complexity of these tools has limited their accessibility to wet-lab researchers. Multi-agent systems offer a promising solution by combining the interpretability of LLMs with the precision of specialized models and tools. In this work, we present MADD, a multi-agent system that builds and executes customized hit identification pipelines from natural language queries. MADD employs four coordinated agents to handle key subtasks in de novo compound generation and screening. We evaluate MADD across seven drug discovery cases and demonstrate its superior performance compared to existing LLM-based solutions. Using MADD, we pioneer the application of AI-first drug design to five biological targets and release the identified hit molecules. Finally, we introduce a new benchmark of query-molecule pairs and docking scores for over three million compounds to contribute to the agentic future of drug design.
zh

[AI-28] FedPoP: Federated Learning Meets Proof of Participation

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中参与方难以证明其对全局模型训练贡献的问题,这在模型作为可货币化数字资产的场景下尤为关键。现有方案通常依赖于公共账本或高计算开销,难以兼顾隐私保护与实用性。论文提出的FedPoP框架通过设计一种无需公共账本且计算轻量的非关联性参与证明机制,在不泄露客户端身份的前提下,实现对参与行为的可审计性。其核心创新在于与现有安全聚合协议无缝集成,从而在保持客户端匿名性和隐私的同时,使第三方能够高效验证某客户端是否参与了特定轮次的模型训练——原型系统实测表明,每轮仅引入0.97秒额外延迟,且单次证明验证耗时仅为0.0612秒,具备实际部署可行性。

链接: https://arxiv.org/abs/2511.08207
作者: Devriş İşler(IMDEA Networks Institute - Universidad Carlos III de Madrid),Elina van Kempen(University of California, Irvine),Seoyeon Hwang(Stealth Software Technologies Inc.),Nikolaos Laoutaris(IMDEA Networks Institute)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This version is currently under review

点击查看摘要

Abstract:Federated learning (FL) offers privacy preserving, distributed machine learning, allowing clients to contribute to a global model without revealing their local data. As models increasingly serve as monetizable digital assets, the ability to prove participation in their training becomes essential for establishing ownership. In this paper, we address this emerging need by introducing FedPoP, a novel FL framework that allows nonlinkable proof of participation while preserving client anonymity and privacy without requiring either extensive computations or a public ledger. FedPoP is designed to seamlessly integrate with existing secure aggregation protocols to ensure compatibility with real-world FL deployments. We provide a proof of concept implementation and an empirical evaluation under realistic client dropouts. In our prototype, FedPoP introduces 0.97 seconds of per-round overhead atop securely aggregated FL and enables a client to prove its participation/contribution to a model held by a third party in 0.0612 seconds. These results indicate FedPoP is practical for real-world deployments that require auditable participation without sacrificing privacy.
zh

[AI-29] EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在处理结构化电子健康记录(Structured Electronic Health Record, EHR)时缺乏标准化评估框架和明确任务定义的问题,导致模型性能难以系统性比较。解决方案的关键在于提出EHRStruct基准,该基准包含11个代表性临床任务及2,200个任务特定的评估样本,覆盖多种临床需求;并进一步引入EHRMaster方法——一种基于代码增强的推理策略,在多项任务上实现SOTA性能,显著提升了LLMs对结构化EHR数据的理解与推理能力。

链接: https://arxiv.org/abs/2511.08206
作者: Xiao Yang,Xuejiao Zhao,Zhiqi Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28pages, 6 figures, 6 tables

点击查看摘要

Abstract:Structured Electronic Health Record (EHR) data stores patient information in relational tables and plays a central role in clinical decision-making. Recent advances have explored the use of large language models (LLMs) to process such data, showing promise across various clinical this http URL, the absence of standardized evaluation frameworks and clearly defined tasks makes it difficult to systematically assess and compare LLM performance on structured EHR this http URL address these evaluation challenges, we introduce EHRStruct, a benchmark specifically designed to evaluate LLMs on structured EHR this http URL defines 11 representative tasks spanning diverse clinical needs and includes 2,200 task-specific evaluation samples derived from two widely used EHR this http URL use EHRStruct to evaluate 20 advanced and representative LLMs, covering both general and medical this http URL further analyze key factors influencing model performance, including input formats, few-shot generalisation, and finetuning strategies, and compare results with 11 state-of-the-art LLM-based enhancement methods for structured data reasoning. Our results indicate that many structured EHR tasks place high demands on the understanding and reasoning capabilities of this http URL response, we propose EHRMaster, a code-augmented method that achieves state-of-the-art performance and offers practical
zh

[AI-30] owards Provably Unlearnable Examples via Bayes Error Optimisation

【速读】:该论文旨在解决大规模机器学习模型训练中用户数据隐私保护的问题,特别是当训练数据来自在线来源且未经明确授权时,如何有效防止模型从特定数据实例中学习。现有方法虽提出“不可学习样本”(unlearnable examples)的概念,但多依赖启发式策略且缺乏理论保障,且在与干净数据混合时失效。其解决方案的关键在于系统性地最大化贝叶斯误差(Bayes error),即分类任务中不可消除的最小错误率,通过优化方法构造出具有强不可学习性的样本;具体采用投影梯度上升(projected gradient ascent)实现高效求解,理论上可保证贝叶斯误差提升,并在混合干净数据场景下仍保持有效性。

链接: https://arxiv.org/abs/2511.08191
作者: Ruihan Zhang,Jun Sun,Ee-Peng Lim,Peixin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent success of machine learning models, especially large-scale classifiers and language models, relies heavily on training with massive data. These data are often collected from online sources. This raises serious concerns about the protection of user data, as individuals may not have given consent for their data to be used in training. To address this concern, recent studies introduce the concept of unlearnable examples, i.e., data instances that appear natural but are intentionally altered to prevent models from effectively learning from them. While existing methods demonstrate empirical effectiveness, they typically rely on heuristic trials and lack formal guarantees. Besides, when unlearnable examples are mixed with clean data, as is often the case in practice, their unlearnability disappears. In this work, we propose a novel approach to constructing unlearnable examples by systematically maximising the Bayes error, a measurement of irreducible classification error. We develop an optimisation-based approach and provide an efficient solution using projected gradient ascent. Our method provably increases the Bayes error and remains effective when the unlearning examples are mixed with clean samples. Experimental results across multiple datasets and model architectures are consistent with our theoretical analysis and show that our approach can restrict data learnability, effectively in practice.
zh

[AI-31] MARC: Multimodal and Multi-Task Agent ic Retrieval-Augmented Generation for Cold-Start Recommender System CIKM2025

【速读】:该论文旨在解决冷启动条件下鸡尾酒推荐系统(Cocktail Recommender System)中因数据稀疏导致的推荐质量下降问题。传统方法在处理此类问题时受限于单一模态信息或知识图谱结构的表达能力,难以充分挖掘用户与物品之间的复杂关系。其解决方案的关键在于提出一种基于代理式检索增强生成(Agentic Retrieval-Augmented Generation, RAG)的多模态多任务推荐系统(MARC),通过构建基于图数据库(Graph Database)的知识表示体系,并结合任务识别路由模块和反思机制(reflection process),实现高质量、上下文感知的回答生成。实验表明,在200个手工设计的问题上,该方案相较于简单向量数据库显著提升了回答质量,验证了图数据库在冷启动场景下的优越性。

链接: https://arxiv.org/abs/2511.08181
作者: Seung Hwan Cho,Yujin Yang,Danik Baeck,Minjoo Kim,Young-Min Kim,Heejung Lee,Sangjin Park
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages, 2 figures, Accepted at RDGENAI at CIKM 2025 workshop

点击查看摘要

Abstract:Recommender systems (RS) are currently being studied to mitigate limitations during cold-start conditions by leveraging modality information or introducing Agent concepts based on the exceptional reasoning capabilities of Large Language Models (LLMs). Meanwhile, food and beverage recommender systems have traditionally used knowledge graph and ontology concepts due to the domain’s unique data attributes and relationship characteristics. On this background, we propose MARC, a multimodal and multi-task cocktail recommender system based on Agentic Retrieval-Augmented Generation (RAG) utilizing graph database under cold-start conditions. The proposed system generates high-quality, contextually appropriate answers through two core processes: a task recognition router and a reflection process. The graph database was constructed by processing cocktail data from Kaggle, and its effectiveness was evaluated using 200 manually crafted questions. The evaluation used both LLM-as-a-judge and human evaluation to demonstrate that answers generated via the graph database outperformed those from a simple vector database in terms of quality. The code is available at this https URL
zh

[AI-32] Deep (Predictive) Discounted Counterfactual Regret Minimization AAAI2026 AAAI

【速读】:该论文旨在解决现有基于神经网络的计数反事实遗憾最小化(Counterfactual Regret Minimization, CFR)方法难以有效近似高级CFR变体的问题,从而限制了其在大规模不完美信息博弈中的应用。解决方案的关键在于提出一种高效的无模型神经CFR算法:在每次迭代中,利用价值网络收集方差减少的采样优势(sampled advantages),通过Bootstrap方式拟合累积优势,并引入折扣(discounting)与截断(clipping)操作来模拟高级CFR变体的更新机制,从而实现对先进CFR算法的有效逼近。实验表明,该方法在典型不完美信息博弈中收敛速度更快,在大型扑克游戏中展现出更强的对抗性能。

链接: https://arxiv.org/abs/2511.08174
作者: Hang Xu,Kai Li,Haobo Fu,Qiang Fu,Junliang Xing,Jian Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Accepted to 40th AAAI Conference on Artificial Intelligence (AAAI 2026)

点击查看摘要

Abstract:Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. To enhance CFR’s applicability in large games, researchers use neural networks to approximate its behavior. However, existing methods are mainly based on vanilla CFR and struggle to effectively integrate more advanced CFR variants. In this work, we propose an efficient model-free neural CFR algorithm, overcoming the limitations of existing methods in approximating advanced CFR variants. At each iteration, it collects variance-reduced sampled advantages based on a value network, fits cumulative advantages by bootstrapping, and applies discounting and clipping operations to simulate the update mechanisms of advanced CFR variants. Experimental results show that, compared with model-free neural algorithms, it exhibits faster convergence in typical imperfect-information games and demonstrates stronger adversarial performance in a large poker game.
zh

[AI-33] An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

【速读】:该论文旨在解决视觉定位(Visual Grounding)任务中因依赖大量噪声合成数据而导致模型性能受限的问题,尤其是在构建具备推理能力的图形用户界面(Graphical User Interface, GUI)代理时。其解决方案的关键在于提出了一种高效的训练流程,结合基于模型的数据过滤与参数高效微调技术:首先从480万条合成样本中筛选出1.2万条高质量、多样化的实例,通过识别挑战性案例、移除错位样本并选取多模态代表性样本实现数据净化;随后在该精炼数据集上,使用三种轻量级训练策略(监督微调、思维链增强微调及基于组相对策略优化的强化学习)对30亿参数的视觉-语言模型进行训练。实验表明,该方法在ScreenSpot、Multimodal-Mind2Web和AndroidControl等基准测试中达到或超越更大规模模型的性能,验证了系统性数据筛选与稳健适应策略可有效替代大规模训练,从而构建紧凑但具备强大多模态推理能力的AI代理。

链接: https://arxiv.org/abs/2511.08172
作者: Georgios Pantazopoulos,Eda B. Özyiğit
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic this http URL work introduces an efficient training pipeline that combines model-based data filtering with parameter-efficient fine-tuning. From 4.8M synthetic examples, 12K clean and diverse instances are curated by first identifying challenging cases, removing misaligned and then selecting a diverse set of multimodal instances. On this data, a 3B-parameter Vision-Language Model is trained under three regimes: supervised fine-tuning, chain-of-thought- augmented fine-tuning, and reinforcement learning via Group Relative Policy Optimization. Models trained with the filtered data and lightweight training strategies match or surpass larger baselines on benchmarks such as ScreenSpot, Multimodal-Mind2Web, and AndroidControl. These results demonstrate that principled data curation and robust adaptation can rival large-scale training, enabling compact yet capable multimodal reasoning agents.
zh

[AI-34] oboro: Text-to-Image Synthesis on Limited Data using Flow-based Diffusion Transformer with MMH Attention

【速读】:该论文旨在解决日本动画制作产业面临的劳动力短缺问题,通过开发一个从零开始构建的图像生成模型来提升生产效率。解决方案的关键在于设计了一种能够在有限数据集下生成高质量图像的架构,并确保训练数据均为版权清晰的图像,从而实现合规、高效的内容生成。该模型名为“oboro:”,其基础模型权重与推理代码已公开,标志着首个在日本本土完全自主研发且面向商业应用的开源图像生成AI的发布。

链接: https://arxiv.org/abs/2511.08168
作者: Ryusuke Mizutani,Kazuaki Matano,Tsugumi Kadowaki,Haruki Tenya,Layris,nuigurumi,Koki Hashimoto,Yu Tanaka
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures

点击查看摘要

Abstract:This project was conducted as a 2nd-term adopted project of the “Post-5G Information and Communication System Infrastructure Enhancement RD Project Development of Competitive Generative AI Foundation Models (GENIAC),” a business of the Ministry of Economy, Trade and Industry (METI) and the New Energy and Industrial Technology Development Organization (NEDO). To address challenges such as labor shortages in Japan’s anime production industry, this project aims to develop an image generation model from scratch. This report details the technical specifications of the developed image generation model, “oboro:.” We have developed “oboro:,” a new image generation model built from scratch, using only copyright-cleared images for training. A key characteristic is its architecture, designed to generate high-quality images even from limited datasets. The foundation model weights and inference code are publicly available alongside this report. This project marks the first release of an open-source, commercially-oriented image generation AI fully developed in Japan. AiHUB originated from the OSS community; by maintaining transparency in our development process, we aim to contribute to Japan’s AI researcher and engineer community and promote the domestic AI development ecosystem.
zh

[AI-35] ProbSelect: Stochastic Client Selection for GPU-Accelerated Compute Devices in the 3D Continuum

【速读】:该论文旨在解决在融合边缘、云端和空间设备的三维连续体(3D continuum)中,联邦学习系统客户端选择(client selection)面临的挑战,尤其是动态环境中卫星与移动设备频繁变化运行状态时,传统依赖历史数据和持续监控的方法变得不可行,且现有方案多局限于CPU计算资源,未能充分考虑GPU加速训练的复杂特性。解决方案的关键在于提出ProbSelect方法,其核心是利用解析建模(analytical modeling)与概率预测(probabilistic forecasting)进行客户端选择,无需历史数据或持续监控,并在用户定义的服务水平目标(SLOs)范围内优化选型;实验表明,该方法平均提升SLO合规性13.77%,同时减少72.5%的计算浪费。

链接: https://arxiv.org/abs/2511.08147
作者: Andrija Stanisic,Stefan Nastic
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integration of edge, cloud and space devices into a unified 3D continuum imposes significant challenges for client selection in federated learning systems. Traditional approaches rely on continuous monitoring and historical data collection, which becomes impractical in dynamic environments where satellites and mobile devices frequently change operational conditions. Furthermore, existing solutions primarily consider CPU-based computation, failing to capture complex characteristics of GPU-accelerated training that is prevalent across the 3D continuum. This paper introduces ProbSelect, a novel approach utilizing analytical modeling and probabilistic forecasting for client selection on GPU-accelerated devices, without requiring historical data or continuous monitoring. We model client selection within user-defined SLOs. Extensive evaluation across diverse GPU architectures and workloads demonstrates that ProbSelect improves SLO compliance by 13.77% on average while achieving 72.5% computational waste reduction compared to baseline approaches.
zh

[AI-36] SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories AAAI2026

【速读】:该论文旨在解决离线安全模仿学习(offline safe imitation learning)问题,即在无法进行在线交互或难以明确指定每步奖励与安全成本信息的现实场景中,如何让智能体不仅学习专家示范行为,同时规避由非偏好轨迹(non-preferred trajectories)隐式表达的危险行为。其解决方案的关键在于提出SafeMIL方法,通过多实例学习(Multiple Instance Learning, MIL)构建一个可参数化的成本函数,用于预测状态-动作对是否危险,并将该成本函数嵌入策略优化过程中以优先保障安全性,从而在不牺牲奖励性能的前提下实现满足约束的安全策略学习。

链接: https://arxiv.org/abs/2511.08136
作者: Returaj Burnwal,Nirav Pravinbhai Bhatt,Balaraman Ravindran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 18 pages, AAAI 2026

点击查看摘要

Abstract:In this work, we study the problem of offline safe imitation learning (IL). In many real-world settings, online interactions can be risky, and accurately specifying the reward and the safety cost information at each timestep can be difficult. However, it is often feasible to collect trajectories reflecting undesirable or risky behavior, implicitly conveying the behavior the agent should avoid. We refer to these trajectories as non-preferred trajectories. Unlike standard IL, which aims to mimic demonstrations, our agent must also learn to avoid risky behavior using non-preferred trajectories. In this paper, we propose a novel approach, SafeMIL, to learn a parameterized cost that predicts if the state-action pair is risky via \textitMultiple Instance Learning. The learned cost is then used to avoid non-preferred behaviors, resulting in a policy that prioritizes safety. We empirically demonstrate that our approach can learn a safer policy that satisfies cost constraints without degrading the reward performance, thereby outperforming several baselines.
zh

[AI-37] National Institute on Aging PREPARE Challenge: Early Detection of Cognitive Impairment Using Speech - The SpeechCARE Solution

【速读】:该论文旨在解决阿尔茨海默病及相关痴呆(Alzheimer’s disease and related dementias, ADRD)早期诊断率低的问题,尤其针对超过一半存在认知障碍但未被诊断的老年人群。传统基于手工特征或通用音频分类器的语音处理方法在性能和泛化能力上存在局限。其解决方案的关键在于提出SpeechCARE——一个基于预训练多语言声学与语言Transformer模型的多模态语音处理流水线,采用受Mixture of Experts(MoE)启发的动态融合架构,对声学、语言及人口统计学输入进行加权整合,并支持未来扩展社会因素、影像等其他模态;同时集成自动转录、大语言模型(LLM)异常检测、任务识别等鲁棒预处理模块,以及基于SHAP的可解释性模块和LLM推理机制,以提升模型在不同任务下的准确性与透明度,实现对健康人群、轻度认知障碍(MCI)和阿尔茨海默病(AD)个体的有效区分(AUC=0.88,F1=0.72),并显著改善MCI检测性能(AUC=0.90,F1=0.62)。

链接: https://arxiv.org/abs/2511.08132
作者: Maryam Zolnoori,Hossein Azadmaleki,Yasaman Haghbin,Ali Zolnour,Mohammad Javad Momeni Nezhad,Sina Rashidi,Mehdi Naserian,Elyas Esmaeili,Sepehr Karimi Arpanahi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Alzheimer’s disease and related dementias (ADRD) affect one in five adults over 60, yet more than half of individuals with cognitive decline remain undiagnosed. Speech-based assessments show promise for early detection, as phonetic motor planning deficits alter acoustic features (e.g., pitch, tone), while memory and language impairments lead to syntactic and semantic errors. However, conventional speech-processing pipelines with hand-crafted features or general-purpose audio classifiers often exhibit limited performance and generalizability. To address these limitations, we introduce SpeechCARE, a multimodal speech processing pipeline that leverages pretrained, multilingual acoustic and linguistic transformer models to capture subtle speech-related cues associated with cognitive impairment. Inspired by the Mixture of Experts (MoE) paradigm, SpeechCARE employs a dynamic fusion architecture that weights transformer-based acoustic, linguistic, and demographic inputs, allowing integration of additional modalities (e.g., social factors, imaging) and enhancing robustness across diverse tasks. Its robust preprocessing includes automatic transcription, large language model (LLM)-based anomaly detection, and task identification. A SHAP-based explainability module and LLM reasoning highlight each modality’s contribution to decision-making. SpeechCARE achieved AUC = 0.88 and F1 = 0.72 for classifying cognitively healthy, MCI, and AD individuals, with AUC = 0.90 and F1 = 0.62 for MCI detection. Bias analysis showed minimal disparities, except for adults over 80. Mitigation techniques included oversampling and weighted loss. Future work includes deployment in real-world care settings (e.g., VNS Health, Columbia ADRC) and EHR-integrated explainability for underrepresented populations in New York City.
zh

[AI-38] A robust methodology for long-term sustainability evaluation of Machine Learning models

【速读】:该论文试图解决当前人工智能系统在可持续性和效率评估中缺乏标准化、模型无关的评价协议的问题,尤其是现有评估方法仅关注短期实验资源消耗且过度侧重批量学习场景,无法反映真实世界中模型的长期生命周期。其解决方案的关键在于提出一种适用于批量学习和流式学习场景的全面评估协议,用于衡量机器学习模型的长期可持续性;通过在多种分类任务和模型类型上的实验表明,传统静态训练-测试评估无法可靠捕捉在数据演化和模型反复更新下的可持续性表现,且不同模型的长期可持续性差异显著,高环境成本未必带来性能提升。

链接: https://arxiv.org/abs/2511.08120
作者: Jorge Paz-Ruza,João Gama,Amparo Alonso-Betanzos,Bertha Guijarro-Berdiñas
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sustainability and efficiency have become essential considerations in the development and deployment of Artificial Intelligence systems, yet existing regulatory and reporting practices lack standardized, model-agnostic evaluation protocols. Current assessments often measure only short-term experimental resource usage and disproportionately emphasize batch learning settings, failing to reflect real-world, long-term AI lifecycles. In this work, we propose a comprehensive evaluation protocol for assessing the long-term sustainability of ML models, applicable to both batch and streaming learning scenarios. Through experiments on diverse classification tasks using a range of model types, we demonstrate that traditional static train-test evaluations do not reliably capture sustainability under evolving data and repeated model updates. Our results show that long-term sustainability varies significantly across models, and in many cases, higher environmental cost yields little performance benefit.
zh

[AI-39] Advancements in synthetic data extraction for industrial injection molding

【速读】:该论文旨在解决工业过程中机器学习模型训练因数据获取困难(如耗时、成本高)而导致的性能受限问题。其解决方案的关键在于利用仿真技术生成合成数据,并将其与真实数据结合用于训练,通过迭代实验确定最优合成数据比例,从而在保持模型真实性与相关性的前提下提升其泛化能力与鲁棒性,尤其适用于注射成型等制造场景,具有降低人工、设备及材料浪费的潜在应用价值。

链接: https://arxiv.org/abs/2511.08117
作者: Georg Rottenwalter,Marcel Tilly,Christian Bielenberg,Katharina Obermeier
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published in: Progress in Artificial Intelligence, EPIA 2023, Lecture Notes in Computer Science, vol. 14258. 13 pages, 3 figures, 3 tables. This is the author-accepted manuscript (AAM) version. DOI: https://doi.org/10.1007/978-3-031-49011-8_43

点击查看摘要

Abstract:Machine learning has significant potential for optimizing various industrial processes. However, data acquisition remains a major challenge as it is both time-consuming and costly. Synthetic data offers a promising solution to augment insufficient data sets and improve the robustness of machine learning models. In this paper, we investigate the feasibility of incorporating synthetic data into the training process of the injection molding process using an existing Long Short-Term Memory architecture. Our approach is to generate synthetic data by simulating production cycles and incorporating them into the training data set. Through iterative experimentation with different proportions of synthetic data, we attempt to find an optimal balance that maximizes the benefits of synthetic data while preserving the authenticity and relevance of real data. Our results suggest that the inclusion of synthetic data improves the model’s ability to handle different scenarios, with potential practical industrial applications to reduce manual labor, machine use, and material waste. This approach provides a valuable alternative for situations where extensive data collection and maintenance has been impractical or costly and thus could contribute to more efficient manufacturing processes in the future.
zh

[AI-40] Improving Industrial Injection Molding Processes with Explainable AI for Quality Classification

【速读】:该论文旨在解决工业质量控制中机器学习模型因复杂性导致的可解释性不足以及传感器数据不完整带来的建模挑战。其解决方案的关键在于引入可解释人工智能(Explainable Artificial Intelligence, XAI)技术,通过SHAP、Grad-CAM和LIME等方法识别长短期记忆(Long Short-Term Memory, LSTM)模型中的关键特征,并基于此进行特征降维——将原始19个输入特征缩减至9个和6个。实验表明,这种基于XAI的特征选择策略可在保持高分类性能的同时提升模型泛化能力并略微加快推理速度,从而增强在传感器资源有限的制造场景下AI驱动质量控制的实际可行性。

链接: https://arxiv.org/abs/2511.08108
作者: Georg Rottenwalter,Marcel Tilly,Victor Owolabi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted and published at the 2025 IEEE 9th Forum on Research and Technologies for Society and Industry (RTSI). 10 pages, 6 figures. DOI: https://doi.org/10.1109/RTSI64020.2025.11212395 . This is the author-accepted manuscript (AAM) version

点击查看摘要

Abstract:Machine learning is an essential tool for optimizing industrial quality control processes. However, the complexity of machine learning models often limits their practical applicability due to a lack of interpretability. Additionally, many industrial machines lack comprehensive sensor technology, making data acquisition incomplete and challenging. Explainable Artificial Intelligence offers a solution by providing insights into model decision-making and identifying the most relevant features for classification. In this paper, we investigate the impact of feature reduction using XAI techniques on the quality classification of injection-molded parts. We apply SHAP, Grad-CAM, and LIME to analyze feature importance in a Long Short-Term Memory model trained on real production data. By reducing the original 19 input features to 9 and 6, we evaluate the trade-off between model accuracy, inference speed, and interpretability. Our results show that reducing features can improve generalization while maintaining high classification performance, with an small increase in inference speed. This approach enhances the feasibility of AI-driven quality control, particularly for industrial settings with limited sensor capabilities, and paves the way for more efficient and interpretable machine learning applications in manufacturing.
zh

[AI-41] Gateways to Tractability for Satisfiability in Pearls Causal Hierarchy

【速读】:该论文旨在解决Pearl因果层次结构(Pearl’s Causal Hierarchy, PCH)中公式可满足性问题的计算难解性问题,该问题在几乎所有经典设置下均为计算上不可行。解决方案的关键在于引入参数化复杂性分析框架,识别出首个实现可 tractability(可处理性)的路径:通过定义诸如变量数量和原始树宽(primal treewidth)等参数,设计了固定参数可解(fixed-parameter)和XP类算法,用于求解关键的概率与反事实片段的可满足性问题;技术上,作者摒弃了传统基于树宽的动态规划范式,转而利用良定义因果模型的结构性刻画,从而构建了一套全新的因果推理算法工具集。

链接: https://arxiv.org/abs/2511.08091
作者: Robert Ganian,Marlene Gründel,Simon Wietheger
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC)
备注:

点击查看摘要

Abstract:Pearl’s Causal Hierarchy (PCH) is a central framework for reasoning about probabilistic, interventional, and counterfactual statements, yet the satisfiability problem for PCH formulas is computationally intractable in almost all classical settings. We revisit this challenge through the lens of parameterized complexity and identify the first gateways to tractability. Our results include fixed-parameter and XP-algorithms for satisfiability in key probabilistic and counterfactual fragments, using parameters such as primal treewidth and the number of variables, together with matching hardness results that map the limits of tractability. Technically, we depart from the dynamic programming paradigm typically employed for treewidth-based algorithms and instead exploit structural characterizations of well-formed causal models, providing a new algorithmic toolkit for causal reasoning.
zh

[AI-42] Dynamic Sparsity: Challenging Common Sparsity Assumptions for Learning World Models in Robotic Reinforcement Learning Benchmarks

【速读】:该论文旨在解决当前强化学习中动态模型(world models)学习过程中对稀疏性先验(sparsity priors)的假设是否合理的问题,特别是针对状态空间稀疏性和时序稀疏性这两个常见假设在真实机器人环境中的有效性进行实证检验。其关键解决方案在于通过分析MuJoCo Playground基准套件中一系列机器人强化学习任务的真值动力学(ground-truth dynamics),系统评估因果图结构是否全局稀疏、是否呈现状态依赖性,以及局部动力学变化是否具有时序稀疏特征;结果表明,全局稀疏性罕见,但存在局部、状态依赖的稀疏结构,并表现为特定状态维度上的时序聚集性变化(如接触事件期间),从而挑战了传统稀疏先验的有效性,强调应构建基于真实世界动力学状态依赖结构的“接地型”归纳偏置(grounded inductive biases)。

链接: https://arxiv.org/abs/2511.08086
作者: Muthukumar Pandaram,Jakob Hollenstein,David Drexel,Samuele Tosatto,Antonio Rodríguez-Sánchez,Justus Piater
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The use of learned dynamics models, also known as world models, can improve the sample efficiency of reinforcement learning. Recent work suggests that the underlying causal graphs of such dynamics models are sparsely connected, with each of the future state variables depending only on a small subset of the current state variables, and that learning may therefore benefit from sparsity priors. Similarly, temporal sparsity, i.e. sparsely and abruptly changing local dynamics, has also been proposed as a useful inductive bias. In this work, we critically examine these assumptions by analyzing ground-truth dynamics from a set of robotic reinforcement learning environments in the MuJoCo Playground benchmark suite, aiming to determine whether the proposed notions of state and temporal sparsity actually tend to hold in typical reinforcement learning tasks. We study (i) whether the causal graphs of environment dynamics are sparse, (ii) whether such sparsity is state-dependent, and (iii) whether local system dynamics change sparsely. Our results indicate that global sparsity is rare, but instead the tasks show local, state-dependent sparsity in their dynamics and this sparsity exhibits distinct structures, appearing in temporally localized clusters (e.g., during contact events) and affecting specific subsets of state dimensions. These findings challenge common sparsity prior assumptions in dynamics learning, emphasizing the need for grounded inductive biases that reflect the state-dependent sparsity structure of real-world dynamics. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2511.08086 [cs.LG] (or arXiv:2511.08086v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.08086 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jakob Hollenstein [view email] [v1] Tue, 11 Nov 2025 10:43:26 UTC (4,128 KB)
zh

[AI-43] Prudential Reliability of Large Language Models in Reinsurance: Governance Assurance and Capital Efficiency

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在再保险领域应用中的可靠性评估问题,尤其关注其在风险转移与资本配置中可能引发的信息不对称和不可控风险。解决方案的关键在于构建一个五支柱架构(治理、数据溯源、保证、韧性及监管一致性),将Solvency II、SR 11-7以及EIOPA(2025)、NAIC(2023)、IAIS(2024)等监管指引转化为可量化的全生命周期控制机制,并通过再保险AI可靠性与保证基准(Reinsurance AI Reliability and Assurance Benchmark, RAIRAB)进行实证验证。结果显示,嵌入治理结构的检索增强型LLM配置显著提升接地准确性(0.90)、降低幻觉和解释漂移约40%,并近乎翻倍透明度,表明现有审慎监管框架可通过明确治理、可追溯数据和可验证保证实现对可靠AI的有效容纳。

链接: https://arxiv.org/abs/2511.08082
作者: Stella C. Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); General Economics (econ.GN)
备注: 48 pages, 9 figures, 5 tables. Submitted to the Journal of Risk and Insurance (JRI), November 2025

点击查看摘要

Abstract:This paper develops a prudential framework for assessing the reliability of large language models (LLMs) in reinsurance. A five-pillar architecture–governance, data lineage, assurance, resilience, and regulatory alignment–translates supervisory expectations from Solvency II, SR 11-7, and guidance from EIOPA (2025), NAIC (2023), and IAIS (2024) into measurable lifecycle controls. The framework is implemented through the Reinsurance AI Reliability and Assurance Benchmark (RAIRAB), which evaluates whether governance-embedded LLMs meet prudential standards for grounding, transparency, and accountability. Across six task families, retrieval-grounded configurations achieved higher grounding accuracy (0.90), reduced hallucination and interpretive drift by roughly 40%, and nearly doubled transparency. These mechanisms lower informational frictions in risk transfer and capital allocation, showing that existing prudential doctrines already accommodate reliable AI when governance is explicit, data are traceable, and assurance is verifiable.
zh

[AI-44] Hierarchical Structure-Property Alignment for Data-Efficient Molecular Generation and Editing

【速读】:该论文旨在解决生成式分子设计中两个核心挑战:一是难以准确建模分子结构与多维性质之间的复杂关系,二是由于分子属性数据覆盖范围窄且标注不完整,导致基于属性的模型性能受限。其解决方案的关键在于提出一种高效的数据驱动框架HSPAG(Hierarchical Structure-Property Alignment Generation),通过将SMILES序列与分子属性视为互补模态,在原子、子结构和全分子三个层级上对齐结构与属性关系;同时结合骨架聚类采样与辅助变分自编码器(VAE)筛选难样本以减少预训练数据需求,并引入属性相关性感知的掩码机制与多样化扰动策略,从而在稀疏标注条件下显著提升生成质量与可控性。

链接: https://arxiv.org/abs/2511.08080
作者: Ziyu Fan,Zhijian Huang,Yahan Li,Xiaowen Hu,Siyuan Shen,Yunliang Wang,Zeyu Zhong,Shuhong Liu,Shuning Yang,Shangqian Wu,Min Wu,Lei Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Property-constrained molecular generation and editing are crucial in AI-driven drug discovery but remain hindered by two factors: (i) capturing the complex relationships between molecular structures and multiple properties remains challenging, and (ii) the narrow coverage and incomplete annotations of molecular properties weaken the effectiveness of property-based models. To tackle these limitations, we propose HSPAG, a data-efficient framework featuring hierarchical structure-property alignment. By treating SMILES and molecular properties as complementary modalities, the model learns their relationships at atom, substructure, and whole-molecule levels. Moreover, we select representative samples through scaffold clustering and hard samples via an auxiliary variational auto-encoder (VAE), substantially reducing the required pre-training data. In addition, we incorporate a property relevance-aware masking mechanism and diversified perturbation strategies to enhance generation quality under sparse annotations. Experiments demonstrate that HSPAG captures fine-grained structure-property relationships and supports controllable generation under multiple property constraints. Two real-world case studies further validate the editing capabilities of HSPAG.
zh

[AI-45] Constrained and Robust Policy Synthesis with Satisfiability-Modulo-Probabilistic-Model-Checking AAAI2026

【速读】:该论文旨在解决如何在已知有限马尔可夫决策过程(Markov Decision Process, MDP)的基础上,有效计算满足任意结构约束且具备鲁棒性的策略问题。这类策略需在MDP发生扰动时仍保持良好性能,并满足如表示形式或实现成本等额外结构限制,而传统方法难以兼顾这两方面要求。解决方案的关键在于提出一个灵活高效的框架:通过在一阶理论中表达结构约束以实现灵活性,并利用可满足性求解器(Satisfiability Solver)处理组合爆炸问题、结合概率模型检测算法(Probabilistic Model Checking Algorithm)分析MDP的特性,从而实现对复杂约束下鲁棒策略的有效合成。实验表明该方法在数百个基准测试中具有可行性,并在多种问题片段上展现出与当前最先进方法相当的竞争力。

链接: https://arxiv.org/abs/2511.08078
作者: Linus Heck,Filip Macák,Milan Češka,Sebastian Junges
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:The ability to compute reward-optimal policies for given and known finite Markov decision processes (MDPs) underpins a variety of applications across planning, controller synthesis, and verification. However, we often want policies (1) to be robust, i.e., they perform well on perturbations of the MDP and (2) to satisfy additional structural constraints regarding, e.g., their representation or implementation cost. Computing such robust and constrained policies is indeed computationally more challenging. This paper contributes the first approach to effectively compute robust policies subject to arbitrary structural constraints using a flexible and efficient framework. We achieve flexibility by allowing to express our constraints in a first-order theory over a set of MDPs, while the root for our efficiency lies in the tight integration of satisfiability solvers to handle the combinatorial nature of the problem and probabilistic model checking algorithms to handle the analysis of MDPs. Experiments on a few hundred benchmarks demonstrate the feasibility for constrained and robust policy synthesis and the competitiveness with state-of-the-art methods for various fragments of the problem.
zh

[AI-46] An Integrated Fusion Framework for Ensemble Learning Leverag ing Gradient Boosting and Fuzzy Rule-Based Models

【速读】:该论文旨在解决模糊规则模型(Fuzzy Rule-Based Models)在实际应用中面临的两大核心问题:一是模型设计复杂且难以扩展至大规模数据集,二是规则数量多导致过拟合风险增加、可解释性下降。解决方案的关键在于提出一种集成融合框架(Integrated Fusion Framework),通过将梯度提升(Gradient Boosting)与模糊规则模型相结合,在每轮迭代中构建一个受动态因子控制的模糊规则模型,该因子不仅调节各子模型对整体集成的贡献度以防止主导效应和促进多样性,还兼具正则化功能并支持基于验证集反馈的自适应调优,从而有效缓解过拟合问题并保持模型的可解释性,同时简化模型维护与更新流程。

链接: https://arxiv.org/abs/2511.08077
作者: Jinbo Li,Peng Liu,Long Chen,Witold Pedrycz,Weiping Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures. IEEE Transactions on Artificial Intelligence (2024)

点击查看摘要

Abstract:The integration of different learning paradigms has long been a focus of machine learning research, aimed at overcoming the inherent limitations of individual methods. Fuzzy rule-based models excel in interpretability and have seen widespread application across diverse fields. However, they face challenges such as complex design specifications and scalability issues with large datasets. The fusion of different techniques and strategies, particularly Gradient Boosting, with Fuzzy Rule-Based Models offers a robust solution to these challenges. This paper proposes an Integrated Fusion Framework that merges the strengths of both paradigms to enhance model performance and interpretability. At each iteration, a Fuzzy Rule-Based Model is constructed and controlled by a dynamic factor to optimize its contribution to the overall ensemble. This control factor serves multiple purposes: it prevents model dominance, encourages diversity, acts as a regularization parameter, and provides a mechanism for dynamic tuning based on model performance, thus mitigating the risk of overfitting. Additionally, the framework incorporates a sample-based correction mechanism that allows for adaptive adjustments based on feedback from a validation set. Experimental results substantiate the efficacy of the presented gradient boosting framework for fuzzy rule-based models, demonstrating performance enhancement, especially in terms of mitigating overfitting and complexity typically associated with many rules. By leveraging an optimal factor to govern the contribution of each model, the framework improves performance, maintains interpretability, and simplifies the maintenance and update of the models.
zh

[AI-47] Clustering-based Anomaly Detection in Multivariate Time Series Data

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series)中的异常检测问题,其核心挑战在于如何同时考虑时间维度和变量之间的复杂关系来准确识别异常。解决方案的关键在于提出一种基于聚类的方法:首先利用滑动窗口生成多变量子序列,随后应用扩展的模糊聚类(Extended Fuzzy Clustering)揭示子序列内部结构;接着通过重构准则使用最优聚类中心与划分矩阵重建子序列,并引入置信指数量化异常程度;最后借助粒子群优化(Particle Swarm Optimization)实现整体异常检测过程的优化。该框架能够有效识别异常幅度和形状模式,在医疗健康、气象分析、金融及疾病爆发检测等多个领域具有适用性。

链接: https://arxiv.org/abs/2511.08072
作者: Jinbo Li,Hesam Izakian,Witold Pedrycz,Iqbal Jamal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 20 figures

点击查看摘要

Abstract:Multivariate time series data come as a collection of time series describing different aspects of a certain temporal phenomenon. Anomaly detection in this type of data constitutes a challenging problem yet with numerous applications in science and engineering because anomaly scores come from the simultaneous consideration of the temporal and variable relationships. In this paper, we propose a clustering-based approach to detect anomalies concerning the amplitude and the shape of multivariate time series. First, we use a sliding window to generate a set of multivariate subsequences and thereafter apply an extended fuzzy clustering to reveal a structure present within the generated multivariate subsequences. Finally, a reconstruction criterion is employed to reconstruct the multivariate subsequences with the optimal cluster centers and the partition matrix. We construct a confidence index to quantify a level of anomaly detected in the series and apply Particle Swarm Optimization as an optimization vehicle for the problem of anomaly detection. Experimental studies completed on several synthetic and six real-world datasets suggest that the proposed methods can detect the anomalies in multivariate time series. With the help of available clusters revealed by the extended fuzzy clustering, the proposed framework can detect anomalies in the multivariate time series and is suitable for identifying anomalous amplitude and shape patterns in various application domains such as health care, weather data analysis, finance, and disease outbreak detection.
zh

[AI-48] MSCR: Exploring the Vulnerability of LLM s Mathematical Reasoning Abilities Using Multi-Source Candidate Replacement

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中对微小输入扰动的鲁棒性不足问题。现有方法普遍存在可扩展性差、语义保持能力弱及成本高等缺陷。其解决方案的关键在于提出一种名为MSCR的自动化对抗攻击方法,该方法基于多源候选替换机制:通过融合LLM嵌入空间中的余弦相似度、WordNet词典以及掩码语言模型的上下文预测结果,为输入问题中的每个词生成一组语义相近的候选词,并经筛选后逐字替换以实施攻击。实验表明,仅单字扰动即可显著降低模型准确率(GSM8K最高下降49.89%,MATH500最高下降35.40%),同时保持高语义一致性,揭示了当前LLMs在数学推理中的鲁棒性缺陷与效率瓶颈。

链接: https://arxiv.org/abs/2511.08055
作者: Zhishen Sun,Guang Dai,Haishan Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs demonstrate performance comparable to human abilities in complex tasks such as mathematical reasoning, but their robustness in mathematical reasoning under minor input perturbations still lacks systematic investigation. Existing methods generally suffer from limited scalability, weak semantic preservation, and high costs. Therefore, we propose MSCR, an automated adversarial attack method based on multi-source candidate replacement. By combining three information sources including cosine similarity in the embedding space of LLMs, the WordNet dictionary, and contextual predictions from a masked language model, we generate for each word in the input question a set of semantically similar candidates, which are then filtered and substituted one by one to carry out the attack. We conduct large-scale experiments on LLMs using the GSM8K and MATH500 benchmarks. The results show that even a slight perturbation involving only a single word can significantly reduce the accuracy of all models, with the maximum drop reaching 49.89% on GSM8K and 35.40% on MATH500, while preserving the high semantic consistency of the perturbed questions. Further analysis reveals that perturbations not only lead to incorrect outputs but also substantially increase the average response length, which results in more redundant reasoning paths and higher computational resource consumption. These findings highlight the robustness deficiencies and efficiency bottlenecks of current LLMs in mathematical reasoning tasks.
zh

[AI-49] owards a Standard Enterprise-Relevant Agent ic AI Benchmark: Lessons from 5.5 billion tokens worth of agent ic AI evaluations

【速读】:该论文旨在解决企业级生成式 AI (Generative AI) 系统部署中缺乏可靠评估方法的问题,特别是传统大语言模型(Large Language Model, LLM)基准测试因训练数据污染而无法有效衡量代理型能力(如多步工具调用和不确定性下的决策)。其解决方案的关键在于提出面向企业的“Kamiwaza Agentic Merit Index”(KAMI)v0.1基准,该基准具备抗污染特性并专门设计用于评估代理行为;通过在35种模型配置上处理超过55亿token的17万条测试项,研究发现传统基准排名与实际代理性能关联性弱,且新世代模型(如Llama 4或Qwen 3)并不总优于旧版本,揭示了成本-性能权衡、模型特异性行为模式及推理能力对令牌效率的影响,为企业的模型选型与部署提供实证依据。

链接: https://arxiv.org/abs/2511.08042
作者: JV Roig
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages

点击查看摘要

Abstract:Enterprise adoption of agentic AI systems requires reliable evaluation methods that reflect real-world deployment scenarios. Traditional LLM benchmarks suffer from training data contamination and fail to measure agentic capabilities such as multi-step tool use and decision-making under uncertainty. We present the Kamiwaza Agentic Merit Index (KAMI) v0.1, an enterprise-focused benchmark that addresses both contamination resistance and agentic evaluation. Through 170,000 LLM test items processing over 5.5 billion tokens across 35 model configurations, we demonstrate that traditional benchmark rankings poorly predict practical agentic performance. Notably, newer generation models like Llama 4 or Qwen 3 do not always outperform their older generation variants on enterprise-relevant tasks, contradicting traditional benchmark trends. We also present insights on cost-performance tradeoffs, model-specific behavioral patterns, and the impact of reasoning capabilities on token efficiency – findings critical for enterprises making deployment decisions.
zh

[AI-50] Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂生物分子机制推理任务中因逻辑不一致和缺乏领域知识支撑而导致的可靠性不足问题。现有方法常因推理步骤偏离生物学事实或无法捕捉长程机制依赖而表现不佳。解决方案的关键在于提出一种知识增强的长链思维(Long-CoT Reasoning)框架,通过将LLMs与基于知识图谱(Knowledge Graph, KG)的多跳推理链相结合,利用知识图谱引导多跳遍历与剪枝构建机制性链条,并将其融入监督微调以提升事实准确性,再通过强化学习进一步优化推理的一致性和可靠性。该框架显著提升了对结构化生物知识进行深度推理的能力,在多跳问答任务上达到当前最优性能。

链接: https://arxiv.org/abs/2511.08024
作者: Tianwen Lyu,Xiang Zhuang,Keyan Ding,Xinzhe Cao,Lei Liang,Wei Zhao,Qiang Zhang,Huajun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding complex biomolecular mechanisms requires multi-step reasoning across molecular interactions, signaling cascades, and metabolic pathways. While large language models(LLMs) show promise in such tasks, their application to biomolecular problems is hindered by logical inconsistencies and the lack of grounding in domain knowledge. Existing approaches often exacerbate these issues: reasoning steps may deviate from biological facts or fail to capture long mechanistic dependencies. To address these challenges, we propose a Knowledge-Augmented Long-CoT Reasoning framework that integrates LLMs with knowledge graph-based multi-hop reasoning chains. The framework constructs mechanistic chains via guided multi-hop traversal and pruning on the knowledge graph; these chains are then incorporated into supervised fine-tuning to improve factual grounding and further refined with reinforcement learning to enhance reasoning reliability and consistency. Furthermore, to overcome the shortcomings of existing benchmarks, which are often restricted in scale and scope and lack annotations for deep reasoning chains, we introduce PrimeKGQA, a comprehensive benchmark for biomolecular question answering. Experimental results on both PrimeKGQA and existing datasets demonstrate that although larger closed-source models still perform well on relatively simple tasks, our method demonstrates clear advantages as reasoning depth increases, achieving state-of-the-art performance on multi-hop tasks that demand traversal of structured biological knowledge. These findings highlight the effectiveness of combining structured knowledge with advanced reasoning strategies for reliable and interpretable biomolecular reasoning.
zh

[AI-51] Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在数学推理任务中是否具备真正的数学理解能力这一争议性问题。其核心挑战在于现有模型可能依赖模式匹配或记忆模板而非逻辑推理来生成答案,导致在复杂环境下的鲁棒性不足。解决方案的关键在于提出一种新的扰动评估框架:通过引入语义无关的干扰句(perturbation sentences)并逐步增强扰动强度,同时设计“核心提问指令缺失”的额外扰动方法,系统性地测试LLMs在噪声环境中的推理稳定性。实验表明,模型对含数值信息的干扰更为敏感,性能下降显著(开源模型降幅达10%–51.55%,商用模型亦有3%–10%下降),且在缺少核心指令时仍能维持20%–40%准确率,揭示其可能依赖非逻辑机制完成任务,从而暴露当前LLMs在数学推理能力上的局限性。

链接: https://arxiv.org/abs/2511.08022
作者: Zhishen Sun,Guang Dai,Ivor Tsang,Haishan Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have made significant progress in the field of mathematical reasoning, but whether they have true the mathematical understanding ability is still controversial. To explore this issue, we propose a new perturbation framework to evaluate LLMs’ reasoning ability in complex environments by injecting additional semantically irrelevant perturbation sentences and gradually increasing the perturbation intensity. At the same time, we use an additional perturbation method: core questioning instruction missing, to further analyze the LLMs’ problem-solving mechanism. The experimental results show that LLMs perform stably when facing perturbation sentences without numbers, but there is also a robustness boundary. As the perturbation intensity increases, the performance exhibits varying degrees of decline; when facing perturbation sentences with numbers, the performance decreases more significantly, most open source models with smaller parameters decrease by nearly or even more than 10%, and further increasing with the enhancement of perturbation intensity, with the maximum decrease reaching 51.55%. Even the most advanced commercial LLMs have seen a 3%-10% performance drop. By analyzing the reasoning process of LLMs in detail, We find that models are more sensitive to perturbations with numerical information and are more likely to give incorrect answers when disturbed by irrelevant numerical information. The higher the perturbation intensity, the more obvious these defects are. At the same time, in the absence of core questioning instruction, models can still maintain an accuracy of 20%-40%, indicating that LLMs may rely on memory templates or pattern matching to complete the task, rather than logical reasoning. In general, our work reveals the shortcomings and limitations of current LLMs in their reasoning capabilities, which is of great significance for the further development of LLMs.
zh

[AI-52] AVOID-JACK: Avoidance of Jackknifing for Swarms of Long Heavy Articulated Vehicles

【速读】:该论文旨在解决重型铰接车辆(Heavy Articulated Vehicles, HAVs)在群体协同作业中易发生的“折刀”现象(jackknifing)及相互碰撞问题,这一问题在物流自动化、远程采矿、机场行李运输和农业作业等实际场景中具有重要意义,但此前尚未见系统性研究。解决方案的关键在于提出一种纯反应式(reaction-based)、去中心化的群体智能策略,该策略无需中央控制或全局信息,仅依赖局部感知与即时响应机制,优先保障避免折刀行为,并为后续实现多车间互不碰撞提供基础。实验表明,单辆HAV在99.8%情况下成功避免折刀,且达到第一和第二目标的比例分别为86.7%和83.4%;两辆HAV交互时,折刀避免率仍达98.9%,互碰发生率仅为0.3%,验证了方法的有效性与鲁棒性。

链接: https://arxiv.org/abs/2511.08016
作者: Adrian Schönnagel,Michael Dubé,Christoph Steup,Felix Keppler,Sanaz Mostaghim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 6+1 pages, 9 figures, accepted for publication in IEEE MRS 2025

点击查看摘要

Abstract:This paper presents a novel approach to avoiding jackknifing and mutual collisions in Heavy Articulated Vehicles (HAVs) by leveraging decentralized swarm intelligence. In contrast to typical swarm robotics research, our robots are elongated and exhibit complex kinematics, introducing unique challenges. Despite its relevance to real-world applications such as logistics automation, remote mining, airport baggage transport, and agricultural operations, this problem has not been addressed in the existing literature. To tackle this new class of swarm robotics problems, we propose a purely reaction-based, decentralized swarm intelligence strategy tailored to automate elongated, articulated vehicles. The method presented in this paper prioritizes jackknifing avoidance and establishes a foundation for mutual collision avoidance. We validate our approach through extensive simulation experiments and provide a comprehensive analysis of its performance. For the experiments with a single HAV, we observe that for 99.8% jackknifing was successfully avoided and that 86.7% and 83.4% reach their first and second goals, respectively. With two HAVs interacting, we observe 98.9%, 79.4%, and 65.1%, respectively, while 99.7% of the HAVs do not experience mutual collisions. Comments: 6+1 pages, 9 figures, accepted for publication in IEEE MRS 2025 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA) ACMclasses: I.2.9; I.2.11 Cite as: arXiv:2511.08016 [cs.RO] (or arXiv:2511.08016v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2511.08016 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-53] DOA Estimation with Lightweight Network on LLM -Aided Simulated Acoustic Scenes

【速读】:该论文旨在解决当前方向估计(Direction-of-Arrival, DOA)模型在真实场景中泛化能力不足的问题,其根源在于现有方法多基于合成数据训练,即通过将干净语音与房间脉冲响应(Room Impulse Response, RIR)卷积生成训练样本,导致声学多样性受限。为提升模型鲁棒性,作者利用大语言模型(Large Language Models, LLMs)辅助构建的更真实、多样化的空间音频数据集进行基准测试,并提出LightDOA——一种基于深度可分离卷积(depthwise separable convolutions)设计的轻量化DOA估计模型,专门针对多通道输入在不同环境下的适应性优化。解决方案的关键在于:一方面借助LLM生成更具现实感的空间音频场景以增强数据多样性,另一方面通过结构优化实现高精度与低计算复杂度的平衡,从而在资源受限场景下仍保持良好性能。

链接: https://arxiv.org/abs/2511.08012
作者: Haowen Li,Zhengding Luo,Dongyuan Shi,Boxiang Wang,Junwei Ji,Ziyi Yang,Woon-Seng Gan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Direction-of-Arrival (DOA) estimation is critical in spatial audio and acoustic signal processing, with wide-ranging applications in real-world. Most existing DOA models are trained on synthetic data by convolving clean speech with room impulse responses (RIRs), which limits their generalizability due to constrained acoustic diversity. In this paper, we revisit DOA estimation using a recently introduced dataset constructed with the assistance of large language models (LLMs), which provides more realistic and diverse spatial audio scenes. We benchmark several representative neural-based DOA methods on this dataset and propose LightDOA, a lightweight DOA estimation model based on depthwise separable convolutions, specifically designed for mutil-channel input in varying environments. Experimental results show that LightDOA achieves satisfactory accuracy and robustness across various acoustic scenes while maintaining low computational complexity. This study not only highlights the potential of spatial audio synthesized with the assistance of LLMs in advancing robust and efficient DOA estimation research, but also highlights LightDOA as efficient solution for resource-constrained applications.
zh

[AI-54] Combining LLM Semantic Reasoning with GNN Structural Modeling for Multi-view Multi-Label Feature Selection

【速读】:该论文旨在解决多视图多标签特征选择(Multi-View Multi-Label Feature Selection, MVMLFS)问题,即从异构视图中识别出与多个相互依赖的标签相关的信息特征,尤其在高维、多模态数据场景下(如社交媒体、生物信息学或推荐系统)具有重要意义。现有方法主要依赖统计信息建模,但忽略了语义层面的关联性。其解决方案的关键在于融合语义信息与结构信息:首先利用大语言模型(Large Language Models, LLMs)作为评估代理,量化特征、视图与标签描述之间的潜在语义相关性;其次构建一个双层语义感知异构图(semantic-aware heterogeneous graph),包含语义关系图和统计关系图;最后通过轻量级图注意力网络(Graph Attention Network, GAT)学习节点嵌入,并将其作为特征显著性分数用于排序与选择。该方法在多个基准数据集上优于现有最优基线,且在小样本场景下仍具鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2511.08008
作者: Zhiqi Chen,Yuzhou Liu,Jiarui Liu,Wanfu Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-view multi-label feature selection aims to identify informative features from heterogeneous views, where each sample is associated with multiple interdependent labels. This problem is particularly important in machine learning involving high-dimensional, multimodal data such as social media, bioinformatics or recommendation systems. Existing Multi-View Multi-Label Feature Selection (MVMLFS) methods mainly focus on analyzing statistical information of data, but seldom consider semantic information. In this paper, we aim to use these two types of information jointly and propose a method that combines Large Language Models (LLMs) semantic reasoning with Graph Neural Networks (GNNs) structural modeling for MVMLFS. Specifically, the method consists of three main components. (1) LLM is first used as an evaluation agent to assess the latent semantic relevance among feature, view, and label descriptions. (2) A semantic-aware heterogeneous graph with two levels is designed to represent relations among features, views and labels: one is a semantic graph representing semantic relations, and the other is a statistical graph. (3) A lightweight Graph Attention Network (GAT) is applied to learn node embedding in the heterogeneous graph as feature saliency scores for ranking and selection. Experimental results on multiple benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines, and it is still effective when applied to small-scale datasets, showcasing its robustness, flexibility, and generalization ability.
zh

[AI-55] Multivariate Time series Anomaly Detection:A Framework of Hidden Markov Models

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series)中的异常检测问题,其核心挑战在于如何有效提取和利用多维时间序列中的复杂动态特征以识别异常模式。解决方案的关键在于将多变量时间序列转化为单变量时间序列(Univariate Time Series),通过引入模糊C均值(Fuzzy C-Means, FCM)聚类与模糊积分(Fuzzy Integral)等变换技术实现维度降维与信息融合,随后基于隐马尔可夫模型(Hidden Markov Model, HMM)构建异常检测器,并系统比较不同转换方法的性能,从而提升异常检测的准确性与鲁棒性。

链接: https://arxiv.org/abs/2511.07995
作者: Jinbo Li,Witold Pedrycz,Iqbal Jamal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures, 6 tables

点击查看摘要

Abstract:In this study, we develop an approach to multivariate time series anomaly detection focused on the transformation of multivariate time series to univariate time series. Several transformation techniques involving Fuzzy C-Means (FCM) clustering and fuzzy integral are studied. In the sequel, a Hidden Markov Model (HMM), one of the commonly encountered statistical methods, is engaged here to detect anomalies in multivariate time series. We construct HMM-based anomaly detectors and in this context compare several transformation methods. A suite of experimental studies along with some comparative analysis is reported.
zh

[AI-56] Enhancing Logical Expressiveness in Graph Neural Networks via Path-Neighbor Aggregation

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在知识图谱(Knowledge Graph, KG)推理中逻辑表达能力不足的问题,尤其是现有方法对复杂逻辑规则的建模能力有限。解决方案的关键在于提出路径-邻接增强的图神经网络(Path-Neighbor enhanced GNN, PN-GNN),其核心思想是通过在推理路径上聚合节点邻域嵌入来增强逻辑表达能力。理论分析表明,PN-GNN不仅在表达能力上严格强于C-GNN,且其(k+1)-跳逻辑表达能力也严格优于k-跳版本;实验验证了该方法在多个合成与真实数据集上提升了逻辑规则的表达能力,同时保持良好的泛化性能。

链接: https://arxiv.org/abs/2511.07994
作者: Han Yu,Xiaojuan Zhao,Aiping Li,Kai Chen,Ziniu Liu,Zhichao Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) can effectively model structural information of graphs, making them widely used in knowledge graph (KG) reasoning. However, existing studies on the expressive power of GNNs mainly focuses on simple single-relation graphs, and there is still insufficient discussion on the power of GNN to express logical rules in KGs. How to enhance the logical expressive power of GNNs is still a key issue. Motivated by this, we propose Path-Neighbor enhanced GNN (PN-GNN), a method to enhance the logical expressive power of GNN by aggregating node-neighbor embeddings on the reasoning path. First, we analyze the logical expressive power of existing GNN-based methods and point out the shortcomings of the expressive power of these methods. Then, we theoretically investigate the logical expressive power of PN-GNN, showing that it not only has strictly stronger expressive power than C-GNN but also that its (k+1) -hop logical expressiveness is strictly superior to that of k -hop. Finally, we evaluate the logical expressive power of PN-GNN on six synthetic datasets and two real-world datasets. Both theoretical analysis and extensive experiments confirm that PN-GNN enhances the expressive power of logical rules without compromising generalization, as evidenced by its competitive performance in KG reasoning tasks.
zh

[AI-57] VSPO: Validating Semantic Pitfalls in Ontology via LLM -Based CQ Generation AAAI2026

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)生成能力问题(Competency Questions, CQs)时,难以有效验证本体(Ontology)中语义缺陷(如“Misusing allValuesFrom”等建模错误)的问题。现有方法多依赖于与已有数据集的相似性评估,无法可靠识别逻辑不一致或语义错位的建模错误。解决方案的关键在于构建了一个专门用于验证语义陷阱的新型数据集和模型——Validating Semantic Pitfalls in Ontology (VSPO),通过LLMs生成类与属性的自然语言定义,并主动引入缺失或误用公理(如删除约束或替换逻辑运算符),从而模拟真实建模错误;在此基础上,对LLaMA-3.1-8B-Instruct进行微调,使其生成能精准探测这些语义差异的CQs,显著提升了检测语义陷阱的能力,在精度和召回率上分别比GPT-4.1高出26%和28.2%,实现了自动化的TBox验证CQ生成,大幅减少人工成本并增强本体与专家知识之间的语义一致性。

链接: https://arxiv.org/abs/2511.07991
作者: Hyojun Choi,Seokju Hwang,Kyong-Ho Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026 oral

点击查看摘要

Abstract:Competency Questions (CQs) play a crucial role in validating ontology design. While manually crafting CQs can be highly time-consuming and costly for ontology engineers, recent studies have explored the use of large language models (LLMs) to automate this process. However, prior approaches have largely evaluated generated CQs based on their similarity to existing datasets, which often fail to verify semantic pitfalls such as “Misusing allValuesFrom”. Since such pitfalls cannot be reliably detected through rule-based methods, we propose a novel dataset and model of Validating Semantic Pitfalls in Ontology (VSPO) for CQ generation specifically designed to verify the semantic pitfalls. To simulate missing and misused axioms, we use LLMs to generate natural language definitions of classes and properties and introduce misalignments between the definitions and the ontology by removing axioms or altering logical operators (e.g., substituting union with intersection). We then fine-tune LLaMA-3.1-8B-Instruct to generate CQs that validate these semantic discrepancies between the provided definitions and the corresponding axioms. The resulting CQs can detect a broader range of modeling errors compared to existing public datasets. Our fine-tuned model demonstrates superior performance over baselines, showing 26% higher precision and 28.2% higher recall than GPT-4.1 in generating CQs for pitfall validation. This research enables automatic generation of TBox-validating CQs using LLMs, significantly reducing manual effort while improving semantic alignment between ontologies and expert knowledge. To the best of our knowledge, this is the first study to target semantic pitfall validation in CQ generation using LLMs.
zh

[AI-58] he One Where They Brain-Tune for Social Cognition: Multi-Modal Brain-Tuning on Friends NEURIPS2025

【速读】:该论文旨在解决多模态音频-视频模型在社会认知任务中表现不足的问题,特别是如何通过脑科学引导的方法提升模型对社会信息的感知与理解能力。其解决方案的关键在于采用“脑调优”(brain-tuning)策略,即在受试者观看情景喜剧《老友记》时,对多模态音频-视频模型进行微调,使其输出特征更贴近大脑中与社会处理相关的核心区域——颞上沟(Superior Temporal Sulcus, STS)及其邻近兴趣区(ROI)的fMRI活动模式。实验表明,该方法显著增强了模型与STS区域的功能对齐,并在讽刺识别这一下游社会认知任务中取得性能提升,从而验证了脑调优在多模态场景下提升模型社会认知能力的有效性。

链接: https://arxiv.org/abs/2511.07988
作者: Nico Policzer,Cameron Braunstein,Mariya Toneva
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 7 figures. Appearing at the NeurIPS 2025 Workshop on Interpreting Cognition in Deep Learning Models

点击查看摘要

Abstract:Recent studies on audio models show brain-tuning - fine-tuning models to better predict corresponding fMRI activity - improves brain alignment and increases performance on downstream semantic and audio tasks. We extend this approach to a multimodal audio-video model to enhance social cognition, targeting the Superior Temporal Sulcus (STS), a key region for social processing, while subjects watch Friends. We find significant increases in brain alignment to the STS and an adjacent ROI, as well as improvements to a social cognition task related to the training data - sarcasm detection in sitcoms. In summary, our study extends brain-tuning to the multi-modal domain, demonstrating improvements to a downstream task after tuning to a relevant functional region.
zh

[AI-59] Capturing Complex Spatial-Temporal Dependencies in Traffic Forecasting: A Self-Attention Approach

【速读】:该论文旨在解决交通预测问题,即准确预测特定区域在未来时间窗口内的流入和流出量。由于区域间存在复杂的时空依赖关系,现有方法通常将空间与时间依赖性解耦处理,难以捕捉二者联合效应。其解决方案的关键在于提出一种新颖且高效的时空自注意力模型(ST-SAM),通过区域嵌入层学习时变的区域表示,并利用基于自注意力机制的时空依赖学习模块,统一建模近邻与远距离区域之间的联合时空相关性,从而实现对局部与全局时空模式的有效捕获,显著提升了预测精度与训练效率。

链接: https://arxiv.org/abs/2511.07980
作者: Zheng Chenghong,Zongyin Deng,Liu Cheng,Xiong Simin,Di Deshi,Li Guanyao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:We study the problem of traffic forecasting, aiming to predict the inflow and outflow of a region in the subsequent time slot. The problem is complex due to the intricate spatial and temporal interdependence among regions. Prior works study the spatial and temporal dependency in a decouple manner, failing to capture their joint effect. In this work, we propose ST-SAM, a novel and efficient Spatial-Temporal Self-Attention Model for traffic forecasting. ST-SAM uses a region embedding layer to learn time-specific embedding from traffic data for regions. Then, it employs a spatial-temporal dependency learning module based on self-attention mechanism to capture the joint spatial-temporal dependency for both nearby and faraway regions. ST-SAM entirely relies on self-attention to capture both local and global spatial-temporal correlations, which make it effective and efficient. Extensive experiments on two real world datasets show that ST-SAM is substantially more accurate and efficient than the state-of-the-art approaches (with an average improvement of up to 15% on RMSE, 17% on MAPE, and 32 times on training time in our experiments).
zh

[AI-60] Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models AAAI2026

【速读】:该论文旨在解决现有法律推理基准测试中存在的三大问题:一是将事实回忆与真实推理混淆,二是割裂推理过程,三是忽视推理质量。为应对这些问题,作者提出了MSLR(Multi-Step Legal Reasoning)数据集,这是首个基于真实司法判决的中文多步法律推理数据集,其核心创新在于采用IRAC框架(Issue, Rule, Application, Conclusion)结构化建模专家推理流程,并设计了一种可扩展的人机协同标注流水线,实现细粒度的步骤级推理标注。该方案的关键在于通过真实法律文档构建标准化推理路径,并借助人机协作提升标注效率与质量,从而为复杂法律推理任务提供高质量训练与评估资源。

链接: https://arxiv.org/abs/2511.07979
作者: Wenhan Yu,Xinbo Lin,Lanxin Ni,Jinhua Cheng,Lei Sha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures. To appear in AAAI 2026

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at this https URL and this https URL.
zh

[AI-61] Reliable and Private Utility Signaling for Data Markets

【速读】:该论文旨在解决数据市场中信号机制在隐私与可靠性之间难以平衡的问题,即由于数据具有可复制性,传统信号方法无法同时保障数据提供方的隐私和信号的真实性,从而影响交易决策的有效性。解决方案的关键在于提出一种基于非TCP(Transmission Control Protocol)架构的信号机制构建方式,通过引入恶意安全的多方计算(Maliciously Secure Multi-Party Computation, MPC)来确保信号计算过程中的隐私性和鲁棒性,并设计基于MPC的哈希验证方案以保证输入数据的可靠性;此外,在多卖家场景下进一步优化了基于KNN-Shapley的价值分配方法,提升了整体效率与实用性。

链接: https://arxiv.org/abs/2511.07975
作者: Li Peng,Jiayao Zhang,Yihang Wu,Weiran Liu,Jinfei Liu,Zheng Yan,Kui Ren,Lei Zhang,Lin Qu
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The explosive growth of data has highlighted its critical role in driving economic growth through data marketplaces, which enable extensive data sharing and access to high-quality datasets. To support effective trading, signaling mechanisms provide participants with information about data products before transactions, enabling informed decisions and facilitating trading. However, due to the inherent free-duplication nature of data, commonly practiced signaling methods face a dilemma between privacy and reliability, undermining the effectiveness of signals in guiding decision-making. To address this, this paper explores the benefits and develops a non-TCP-based construction for a desirable signaling mechanism that simultaneously ensures privacy and reliability. We begin by formally defining the desirable utility signaling mechanism and proving its ability to prevent suboptimal decisions for both participants and facilitate informed data trading. To design a protocol to realize its functionality, we propose leveraging maliciously secure multi-party computation (MPC) to ensure the privacy and robustness of signal computation and introduce an MPC-based hash verification scheme to ensure input reliability. In multi-seller scenarios requiring fair data valuation, we further explore the design and optimization of the MPC-based KNN-Shapley method with improved efficiency. Rigorous experiments demonstrate the efficiency and practicality of our approach. Subjects: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.07975 [cs.GT] (or arXiv:2511.07975v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2511.07975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-62] owards Fine-Grained Interpretability: Counterfactual Explanations for Misclassification with Saliency Partition

【速读】:该论文旨在解决现有基于归因的解释技术在细粒度任务中缺乏足够细节的问题,尤其是在模型误分类场景下,传统方法生成的解释往往不够精细,难以提供有价值的洞察。解决方案的关键在于提出一种细粒度的反事实解释框架,通过量化正确分类样本与误分类样本在感兴趣区域内的相似性并权重各组件贡献,实现非生成式的可解释反事实生成;同时引入基于Shapley值的显著性分区模块,以识别具有区域特异性相关性的特征,从而精确定位影响反事实调整的主导局部特征,提升解释的颗粒度和直观意义。

链接: https://arxiv.org/abs/2511.07974
作者: Lintong Zhang,Kang Yin,Seong-Whan Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attribution-based explanation techniques capture key patterns to enhance visual interpretability; however, these patterns often lack the granularity needed for insight in fine-grained tasks, particularly in cases of model misclassification, where explanations may be insufficiently detailed. To address this limitation, we propose a fine-grained counterfactual explanation framework that generates both object-level and part-level interpretability, addressing two fundamental questions: (1) which fine-grained features contribute to model misclassification, and (2) where dominant local features influence counterfactual adjustments. Our approach yields explainable counterfactuals in a non-generative manner by quantifying similarity and weighting component contributions within regions of interest between correctly classified and misclassified samples. Furthermore, we introduce a saliency partition module grounded in Shapley value contributions, isolating features with region-specific relevance. Extensive experiments demonstrate the superiority of our approach in capturing more granular, intuitively meaningful regions, surpassing fine-grained methods.
zh

[AI-63] Versatile and Risk-Sensitive Cardiac Diagnosis via Graph-Based ECG Signal Representation

【速读】:该论文旨在解决心电图(Electrocardiogram, ECG)信号诊断中两个关键瓶颈问题:一是现有深度学习方法在处理不同导联数、采样频率和持续时间的异构ECG信号时缺乏通用性;二是由于样本不平衡导致对高风险异常信号的检测敏感性不足。解决方案的核心在于提出VersAtile and Risk-Sensitive cardiac diagnosis (VARS),其创新性地将ECG信号转化为统一的图结构表示,从而实现对多样化ECG配置的无差别建模,并通过融合去噪重建与对比学习机制,在保留原始信号信息的同时强化病理性特征的表征。这种图中心范式不仅显著提升了模型在多数据集上的诊断性能,还增强了对危险信号的识别能力,同时提供可解释性输出,精准定位触发特定预测结果的心电波形,为临床决策提供支持。

链接: https://arxiv.org/abs/2511.07973
作者: Yue Wang,Yuyang Xu,Renjun Hu,Fanqi Shen,Hanyun Jiang,Jun Wang,Jintai Chen,Danny Z. Chen,Jian Wu,Haochao Ying
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the rapid advancements of electrocardiogram (ECG) signal diagnosis and analysis methods through deep learning, two major hurdles still limit their clinical adoption: the lack of versatility in processing ECG signals with diverse configurations, and the inadequate detection of risk signals due to sample imbalances. Addressing these challenges, we introduce VersAtile and Risk-Sensitive cardiac diagnosis (VARS), an innovative approach that employs a graph-based representation to uniformly model heterogeneous ECG signals. VARS stands out by transforming ECG signals into versatile graph structures that capture critical diagnostic features, irrespective of signal diversity in the lead count, sampling frequency, and duration. This graph-centric formulation also enhances diagnostic sensitivity, enabling precise localization and identification of abnormal ECG patterns that often elude standard analysis methods. To facilitate representation transformation, our approach integrates denoising reconstruction with contrastive learning to preserve raw ECG information while highlighting pathognomonic patterns. We rigorously evaluate the efficacy of VARS on three distinct ECG datasets, encompassing a range of structural variations. The results demonstrate that VARS not only consistently surpasses existing state-of-the-art models across all these datasets but also exhibits substantial improvement in identifying risk signals. Additionally, VARS offers interpretability by pinpointing the exact waveforms that lead to specific model outputs, thereby assisting clinicians in making informed decisions. These findings suggest that our VARS will likely emerge as an invaluable tool for comprehensive cardiac health assessment.
zh

[AI-64] meFlow: Towards Stochastic-Aware and Efficient Time Series Generation via Flow Matching Modeling

【速读】:该论文旨在解决时间序列数据生成中难以准确建模内在随机性(intrinsic stochasticity)的问题,尤其是在真实世界序列常表现出随机波动和局部变化的情况下。现有方法如扩散模型虽效果显著但计算效率低,而基于常微分方程(ODE)的流匹配(flow matching)方法则无法显式捕捉不确定性,限制了生成序列的真实性。解决方案的关键在于提出TimeFlow,一个基于随机微分方程(SDE)的流匹配框架,其核心创新包括:采用仅编码器架构以提升灵活性;设计逐分量分解的速度场来刻画时间序列的多维结构;并通过引入额外的随机项增强优化目标的表达能力,从而在统一框架下实现无条件与条件生成任务,同时在生成质量、多样性与效率上显著优于基线方法。

链接: https://arxiv.org/abs/2511.07968
作者: He Panjing,Cheng Mingyue,Li Li,Zhang XiaoHan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating high-quality time series data has emerged as a critical research topic due to its broad utility in supporting downstream time series mining tasks. A major challenge lies in modeling the intrinsic stochasticity of temporal dynamics, as real-world sequences often exhibit random fluctuations and localized variations. While diffusion models have achieved remarkable success, their generation process is computationally inefficient, often requiring hundreds to thousands of expensive function evaluations per sample. Flow matching has emerged as a more efficient paradigm, yet its conventional ordinary differential equation (ODE)-based formulation fails to explicitly capture stochasticity, thereby limiting the fidelity of generated sequences. By contrast, stochastic differential equation (SDE) are naturally suited for modeling randomness and uncertainty. Motivated by these insights, we propose TimeFlow, a novel SDE-based flow matching framework that integrates a encoder-only architecture. Specifically, we design a component-wise decomposed velocity field to capture the multi-faceted structure of time series and augment the vanilla flow-matching optimization with an additional stochastic term to enhance representational expressiveness. TimeFlow is flexible and general, supporting both unconditional and conditional generation tasks within a unified framework. Extensive experiments across diverse datasets demonstrate that our model consistently outperforms strong baselines in generation quality, diversity, and efficiency.
zh

[AI-65] USV Obstacles Detection and Tracking in Marine Environments

【速读】:该论文旨在解决无人水面艇(Unmanned Surface Vehicle, USV)在海洋环境中实现鲁棒且高效的障碍物检测与跟踪问题。其关键解决方案在于构建一个融合视觉与激光雷达(LiDAR)信息的多传感器系统:首先在图像平面上检测并跟踪障碍物,再将其定位到三维LiDAR点云中;随后通过在ROS平台上集成各模块,实现在同步采集的LiDAR与相机数据上进行实时处理,并对比仅使用LiDAR点云与融合相机和LiDAR信息两种方法的性能;最终提出一种混合方法,结合两者优势以生成更全面的环境障碍物地图,从而提升USV对复杂海洋场景的认知能力。

链接: https://arxiv.org/abs/2511.07950
作者: Yara AlaaEldin,Enrico Simetti,Francesca Odone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing a robust and effective obstacle detection and tracking system for Unmanned Surface Vehicle (USV) at marine environments is a challenging task. Research efforts have been made in this area during the past years by GRAAL lab at the university of Genova that resulted in a methodology for detecting and tracking obstacles on the image plane and, then, locating them in the 3D LiDAR point cloud. In this work, we continue on the developed system by, firstly, evaluating its performance on recently published marine datasets. Then, we integrate the different blocks of the system on ROS platform where we could test it in real-time on synchronized LiDAR and camera data collected in various marine conditions available in the MIT marine datasets. We present a thorough experimental analysis of the results obtained using two approaches; one that uses sensor fusion between the camera and LiDAR to detect and track the obstacles and the other uses only the LiDAR point cloud for the detection and tracking. In the end, we propose a hybrid approach that merges the advantages of both approaches to build an informative obstacles map of the surrounding environment to the USV.
zh

[AI-66] Balance Equation-based Distributionally Robust Offline Imitation Learning

【速读】:该论文旨在解决模仿学习(Imitation Learning, IL)在实际应用中因环境动态变化导致性能显著下降的问题。标准IL方法通常假设训练与部署阶段的环境动态保持一致,但在现实中,建模误差、参数漂移及对抗扰动等因素常引起转移动态偏移,从而破坏策略性能。其解决方案的关键在于提出一种基于平衡方程的分布鲁棒离线模仿学习框架(Balance Equation-based Distributionally Robust Offline Imitation Learning),通过在转移模型的不确定性集合上进行分布鲁棒优化,寻找在最差转移分布下仍能最小化模仿损失的策略;更重要的是,作者证明该鲁棒目标可完全转化为仅依赖于名义数据分布的形式,从而实现无需额外环境交互的可行离线学习。

链接: https://arxiv.org/abs/2511.07942
作者: Rishabh Agrawal,Yusuf Alvi,Rahul Jain,Ashutosh Nayyar
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible. However, standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment. In practice, this assumption rarely holds where modeling inaccuracies, real-world parameter variations, and adversarial perturbations can all induce shifts in transition dynamics, leading to severe performance degradation. We address this challenge through Balance Equation-based Distributionally Robust Offline Imitation Learning, a framework that learns robust policies solely from expert demonstrations collected under nominal dynamics, without requiring further environment interaction. We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution. Importantly, we show that this robust objective can be reformulated entirely in terms of the nominal data distribution, enabling tractable offline learning. Empirical evaluations on continuous-control benchmarks demonstrate that our approach achieves superior robustness and generalization compared to state-of-the-art offline IL baselines, particularly under perturbed or shifted environments.
zh

[AI-67] oward Practical BCI: A Real-time Wireless Imagined Speech EEG Decoding System

【速读】:该论文旨在解决脑机接口(Brain-Computer Interface, BCI)研究长期局限于静态固定环境、难以应用于现实场景的问题,从而推动BCI向实用化和日常使用迈进。其解决方案的关键在于构建一个端到端的实时无线想象言语脑电图(imagined speech electroencephalogram, EEG)解码系统,通过引入用户识别模块实现个性化服务,并利用实验室流媒体层(Lab Streaming Layer, LSL)管理实时EEG信号传输至定制化解码器,从而在便携式无线设备上实现低延迟、高灵活性的指令分类功能,在有线设备和无线头戴设备上分别达到62.00%和46.67%的四类命令准确率,显著提升了BCI系统的实用性与可扩展性。

链接: https://arxiv.org/abs/2511.07936
作者: Ji-Ha Park,Heon-Gyu Kwak,Gi-Hwan Shin,Yoo-In Jeon,Sun-Min Park,Ji-Yeon Hwang,Seong-Whan Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface

点击查看摘要

Abstract:Brain-computer interface (BCI) research, while promising, has largely been confined to static and fixed environments, limiting real-world applicability. To move towards practical BCI, we introduce a real-time wireless imagined speech electroencephalogram (EEG) decoding system designed for flexibility and everyday use. Our framework focuses on practicality, demonstrating extensibility beyond wired EEG devices to portable, wireless hardware. A user identification module recognizes the operator and provides a personalized, user-specific service. To achieve seamless, real-time operation, we utilize the lab streaming layer to manage the continuous streaming of live EEG signals to the personalized decoder. This end-to-end pipeline enables a functional real-time application capable of classifying user commands from imagined speech EEG signals, achieving an overall 4-class accuracy of 62.00 % on a wired device and 46.67 % on a portable wireless headset. This paper demonstrates a significant step towards truly practical and accessible BCI technology, establishing a clear direction for future research in robust, practical, and personalized neural interfaces.
zh

[AI-68] Computational Blueprints: Generating Isomorphic Mathematics Problems with Large Language Models EMNLP2025

【速读】:该论文旨在解决个性化数学教育中对大量结构一致但形式多样的练习题的迫切需求,现有研究多聚焦于神经语言模型训练的数据增强,而非直接面向教学部署。为此,作者提出了同构数学问题生成(Isomorphic Math Problem Generation, IMPG)任务,并设计了基于大语言模型(Large Language Model, LLM)的计算蓝图——计算同构孪生体蓝图(Computational Blueprints for Isomorphic Twins, CBIT)。CBIT的关键在于元级生成与基于模板的选择性变异机制,在确保数学正确性和结构一致性的同时显著降低生成成本,实证表明其生成问题的错误率比专家编写题目低17.8%,并在商业教育平台上成功部署至6,732名学习者,产生186,870次交互。

链接: https://arxiv.org/abs/2511.07932
作者: Jeong-Hoon Kim,Jinwoo Nam,Geunsik Jo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: EMNLP2025 Industry Track

点击查看摘要

Abstract:Personalized mathematics education is growing rapidly, creating a strong demand for large sets of similar practice problems. Yet existing studies on mathematics problem generation have focused on data augmentation for training neural language models rather than on direct educational deployment. To bridge this gap, we define a new task, Isomorphic Math Problem Generation (IMPG), designed to produce structurally consistent variants of source problems. Subsequently, we explored LLM-based frameworks for automatic IMPG through successive refinements, and established Computational Blueprints for Isomorphic Twins (CBIT). With meta-level generation and template-based selective variation, CBIT achieves high mathematical correctness and structural consistency while reducing the cost of generation. Empirical results across refinements demonstrate that CBIT is superior on generation accuracy and cost-effectiveness at scale. Most importantly, CBIT-generated problems exhibited an error rate 17.8% lower than expert-authored items, with deployment to 6,732 learners on a commercial education platform yielding 186,870 interactions.
zh

[AI-69] Lightweight Diffusion-based Framework for Online Imagined Speech Decoding in Aphasia

【速读】:该论文旨在解决严重表达性语言障碍个体(如失语症患者)在临床环境中实现实时想象言语脑-机接口(BCI)通信支持的问题。其核心挑战在于:如何在有限校准数据、环境干扰和低延迟要求下,实现高准确率的想象言语分类。解决方案的关键在于提出一种基于扩散模型的神经解码框架,其创新点包括:(1)采用轻量级条件扩散编码器与卷积分类器联合训练,利用特定受试者的韩语范式EEG数据优化性能;(2)引入双准则早停策略,在少量校准数据下快速收敛;(3)通过Dropout正则化和分组时序卷积提升泛化稳定性;(4)在线运行时以两秒滑动窗口处理连续EEG流,并根据解码置信度动态调节视觉与听觉反馈。该方法在20次实时试验中实现了65% top-1和70% top-2准确率,显著优于离线评估结果(50% top-1),验证了其在实际临床场景中的可行性与鲁棒性。

链接: https://arxiv.org/abs/2511.07920
作者: Eunyeong Ko,Soowon Kim,Ha-Na Jo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, 1 table, Name of Conference: International Conference on Brain-Computer Interface

点击查看摘要

Abstract:A diffusion-based neural decoding framework optimized for real-time imagined speech classification in individuals with aphasia. The system integrates a lightweight conditional diffusion encoder and convolutional classifier trained using subject-specific EEG data acquired from a Korean-language paradigm. A dual-criterion early stopping strategy enabled rapid convergence under limited calibration data, while dropout regularization and grouped temporal convolutions ensured stable generalization. During online operation, continuous EEG streams were processed in two-second sliding windows to generate class probabilities that dynamically modulated visual and auditory feedback according to decoding confidence. Across twenty real-time trials, the framework achieved 65% top-1 and 70% top-2 accuracy, outperforming offline evaluation (50% top-1). These results demonstrate the feasibility of deploying diffusion-based EEG decoding under practical clinical constraints, maintaining reliable performance despite environmental variability and minimal preprocessing. The proposed framework advances the translation of imagined speech brain-computer interfaces toward clinical communication support for individuals with severe expressive language impairment.
zh

[AI-70] Neurophysiological Characteristics of Adaptive Reasoning for Creative Problem-Solving Strategy

【速读】:该论文旨在解决人类在环境规则或上下文变化时如何灵活调整推理策略的神经机制问题,即适应性推理(adaptive reasoning)的神经动力学基础尚不明确。其解决方案的关键在于结合卡片排序范式与脑电图(electroencephalography)技术,通过刺激锁定和反馈锁定分析揭示了delta-theta-alpha频段的协同动态:早期delta-theta活动反映探索性监控与规则推断,而枕叶alpha活动则表示成功识别规则后的注意力确认与稳定化。相较之下,多模态大语言模型仅表现出短期反馈驱动调整,缺乏层级规则抽象与真正的适应性推理能力,表明人类适应性推理具有独特的神经振荡协调机制,为构建类脑人工智能提供了关键启示。

链接: https://arxiv.org/abs/2511.07912
作者: Jun-Young Kim,Young-Seok Kweon,Gi-Hwan Shin,Seong-Whan Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages, 4 figures, 1 table,

点击查看摘要

Abstract:Adaptive reasoning enables humans to flexibly adjust inference strategies when environmental rules or contexts change, yet its underlying neural dynamics remain unclear. This study investigated the neurophysiological mechanisms of adaptive reasoning using a card-sorting paradigm combined with electroencephalography and compared human performance with that of a multimodal large language model. Stimulus- and feedback-locked analyses revealed coordinated delta-theta-alpha dynamics: early delta-theta activity reflected exploratory monitoring and rule inference, whereas occipital alpha engagement indicated confirmatory stabilization of attention after successful rule identification. In contrast, the multimodal large language model exhibited only short-term feedback-driven adjustments without hierarchical rule abstraction or genuine adaptive reasoning. These findings identify the neural signatures of human adaptive reasoning and highlight the need for brain-inspired artificial intelligence that incorporates oscillatory feedback coordination for true context-sensitive adaptation.
zh

[AI-71] st-driven Reinforcement Learning AAAI2026

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中奖励函数设计困难的问题。传统RL依赖单一奖励函数同时定义任务目标和引导学习过程,导致手动设计复杂且易产生次优的任务表征。为此,作者提出了一种测试驱动的强化学习(Test-driven Reinforcement Learning, TdRL)框架,其核心创新在于将任务目标由多个测试函数(test functions)表示,其中 pass-fail 测试用于明确最优目标,指示性测试(indicative tests)则用于指导学习过程,从而实现任务定义的解耦与简化。在此基础上,论文进一步证明了基于轨迹回报函数的最大熵策略优化可逼近最优策略集,并引入字典序启发式方法比较轨迹与最优轨迹集合的距离关系以学习回报函数,最终在 DeepMind Control Suite 上验证了 TdRL 在性能上优于或匹配手工奖励方法,且具备更简洁的设计流程和天然的多目标优化支持能力。

链接: https://arxiv.org/abs/2511.07904
作者: Zhao Yu,Xiuping Wu,Liangjun Ke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: AAAI 2026 oral

点击查看摘要

Abstract:Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.
zh

[AI-72] DANS-KGC: Diffusion Based Adaptive Negative Sampling for Knowledge Graph Completion

【速读】:该论文旨在解决现有知识图谱补全(Knowledge Graph Completion, KGC)中负采样(Negative Sampling, NS)策略存在的三大问题:易受假负例影响、泛化能力有限,以及难以控制负样本的难度。其解决方案的核心是提出一种基于扩散模型的自适应负采样方法——DANS-KGC(Diffusion-based Adaptive Negative Sampling for Knowledge Graph Completion),关键创新在于三个模块协同工作:Difficulty Assessment Module(DAM)通过融合语义与结构特征评估实体的学习难度;Adaptive Negative Sampling Module(ANS)采用具备难度感知噪声调度的条件扩散模型,在去噪过程中利用语义和邻域信息生成多样难度的负样本;Dynamic Training Mechanism(DTM)则动态调整训练过程中负样本的难度分布,实现从易到难的课程学习式渐进优化,从而显著提升模型性能与泛化能力。

链接: https://arxiv.org/abs/2511.07901
作者: Haoning Li,Qinghua Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Negative sampling (NS) strategies play a crucial role in knowledge graph representation. In order to overcome the limitations of existing negative sampling strategies, such as vulnerability to false negatives, limited generalization, and lack of control over sample hardness, we propose DANS-KGC (Diffusion-based Adaptive Negative Sampling for Knowledge Graph Completion). DANS-KGC comprises three key components: the Difficulty Assessment Module (DAM), the Adaptive Negative Sampling Module (ANS), and the Dynamic Training Mechanism (DTM). DAM evaluates the learning difficulty of entities by integrating semantic and structural features. Based on this assessment, ANS employs a conditional diffusion model with difficulty-aware noise scheduling, leveraging semantic and neighborhood information during the denoising phase to generate negative samples of diverse hardness. DTM further enhances learning by dynamically adjusting the hardness distribution of negative samples throughout training, enabling a curriculum-style progression from easy to hard examples. Extensive experiments on six benchmark datasets demonstrate the effectiveness and generalization ability of DANS-KGC, with the method achieving state-of-the-art results on all three evaluation metrics for the UMLS and YAGO3-10 datasets.
zh

[AI-73] Statistically Assuring Safety of Control Systems using Ensembles of Safety Filters and Conformal Prediction

【速读】:该论文旨在解决学习型自主系统中安全保证的问题,特别是针对基于哈密顿-雅可比(Hamilton-Jacobi, HJ)可达性分析的计算复杂性问题。传统HJ方法在高维系统中计算代价高昂,因此常采用强化学习近似HJ值函数以生成安全控制器,但这类学习得到的值函数和策略无法提供形式化安全性保障——即其在某状态下的估计值可能不等于实际执行该策略时获得的安全回报。解决方案的关键在于引入基于置信区间预测(conformal prediction, CP)的框架,通过概率校准机制量化学习值函数的不确定性,并据此动态切换名义控制器与学习所得HJ安全策略,从而为切换控制策略提供可证明的概率安全性保障。此外,论文还比较了使用独立训练的多个HJ值函数集成作为安全滤波器的效果,进一步提升了安全性的鲁棒性。

链接: https://arxiv.org/abs/2511.07899
作者: Ihab Tabbara,Yuxuan Yang,Hussein Sibai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Safety assurance is a fundamental requirement for deploying learning-enabled autonomous systems. Hamilton-Jacobi (HJ) reachability analysis is a fundamental method for formally verifying safety and generating safe controllers. However, computing the HJ value function that characterizes the backward reachable set (BRS) of a set of user-defined failure states is computationally expensive, especially for high-dimensional systems, motivating the use of reinforcement learning approaches to approximate the value function. Unfortunately, a learned value function and its corresponding safe policy are not guaranteed to be correct. The learned value function evaluated at a given state may not be equal to the actual safety return achieved by following the learned safe policy. To address this challenge, we introduce a conformal prediction-based (CP) framework that bounds such uncertainty. We leverage CP to provide probabilistic safety guarantees when using learned HJ value functions and policies to prevent control systems from reaching failure states. Specifically, we use CP to calibrate the switching between the unsafe nominal controller and the learned HJ-based safe policy and to derive safety guarantees under this switched policy. We also investigate using an ensemble of independently trained HJ value functions as a safety filter and compare this ensemble approach to using individual value functions alone.
zh

[AI-74] Data Descriptions from Large Language Models with Influence Estimation

【速读】:该论文试图解决深度学习模型决策过程的可解释性问题,即如何通过人类易于理解的方式(如自然语言)来解释数据与模型训练之间的关系,而非仅聚焦于模型预测结果的解释。其解决方案的关键在于提出一个基于大语言模型(Large Language Models, LLMs)的文本描述生成管道,结合外部知识库以增强描述的相关性和语义准确性,并引入影响估计(influence estimation)和CLIP分数筛选最相关信息,从而生成高质量、高信息量的文本描述;进一步利用跨模态迁移能力设计了“跨模态迁移分类”新基准任务,验证所生成描述在零样本设置下优于基线方法,并显著提升仅用图像训练模型的性能,表明该方法能有效揭示模型决策的内在可解释性。

链接: https://arxiv.org/abs/2511.07897
作者: Chaeri Kim,Jaeyeon Bae,Taehwan Kim
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep learning models have been successful in many areas but understanding their behaviors still remains a black-box. Most prior explainable AI (XAI) approaches have focused on interpreting and explaining how models make predictions. In contrast, we would like to understand how data can be explained with deep learning model training and propose a novel approach to understand the data via one of the most common media - language - so that humans can easily understand. Our approach proposes a pipeline to generate textual descriptions that can explain the data with large language models by incorporating external knowledge bases. However, generated data descriptions may still include irrelevant information, so we introduce to exploit influence estimation to choose the most informative textual descriptions, along with the CLIP score. Furthermore, based on the phenomenon of cross-modal transferability, we propose a novel benchmark task named cross-modal transfer classification to examine the effectiveness of our textual descriptions. In the experiment of zero-shot setting, we show that our textual descriptions are more effective than other baseline descriptions, and furthermore, we successfully boost the performance of the model trained only on images across all nine image classification datasets. These results are further supported by evaluation using GPT-4o. Through our approach, we may gain insights into the inherent interpretability of the decision-making process of the model.
zh

[AI-75] oward Robust EEG-based Intention Decoding during Misarticulated Speech in Aphasia

【速读】:该论文旨在解决表达性失语症(expressive aphasia)患者在言语产生障碍背景下,如何利用脑电图(EEG)信号实现意图识别的问题。当前脑机接口技术对失语患者的针对性支持研究较少,而现有方法难以应对实际交流中常见的发音错误(misarticulation)。解决方案的关键在于:通过分析任务中正确与错误试次的脑电频谱特征差异,发现错误试次表现出广泛区域的δ波功率异常升高及额区θ-α活动增强;基于此,提出一种融合最大均值差异(Maximum Mean Discrepancy, MMD)正则化的软多任务学习框架,聚焦于δ频段特征,在优化类别判别能力的同时对齐正确与错误试次的EEG特征分布。该模型在误发音条件下仍实现45.5%的准确率,较基线提升超45%,验证了其在非理想语音状态下的鲁棒意图解码能力。

链接: https://arxiv.org/abs/2511.07895
作者: Ha-Na Jo,Jung-Sun Lee,Eunyeong Ko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Aphasia severely limits verbal communication due to impaired language production, often leading to frequent misarticulations during speech attempts. Despite growing interest in brain-computer interface technologies, relatively little attention has been paid to developing EEG-based communication support systems tailored for aphasic patients. To address this gap, we recruited a single participant with expressive aphasia and conducted an Korean-based automatic speech task. EEG signals were recorded during task performance, and each trial was labeled as either correct or incorrect depending on whether the intended word was successfully spoken. Spectral analysis revealed distinct neural activation patterns between the two trial types: misarticulated trials exhibited excessive delta power across widespread channels and increased theta-alpha activity in frontal regions. Building upon these findings, we developed a soft multitask learning framework with maximum mean discrepancy regularization that focus on delta features to jointly optimize class discrimination while aligning the EEG feature distributions of correct and misarticulated trials. The proposed model achieved 58.6 % accuracy for correct and 45.5 % for misarticulated trials-outperforming the baseline by over 45 % on the latter-demonstrating robust intention decoding even under articulation errors. These results highlight the feasibility of EEG-based assistive systems capable of supporting real-world, imperfect speech conditions in aphasia patients.
zh

[AI-76] Confidence-Aware Neural Decoding of Overt Speech from EEG: Toward Robust Brain-Computer Interfaces

【速读】:该论文旨在解决非侵入式脑机接口(Brain-Computer Interface, BCI)在从脑电图(Electroencephalogram, EEG)中解码口语指令时,如何同时实现高准确性和高可信度的问题。现有方法往往缺乏对预测不确定性的量化与管理,导致在实际部署中可靠性不足。解决方案的关键在于提出一种置信度感知的解码框架,其核心包括:(1) 使用紧凑且面向语音的卷积神经网络的深度集成(Deep Ensembles)来捕捉模型不确定性;(2) 结合后验校准(post-hoc calibration)与选择性分类(selective classification),通过预测熵、Top-2边际和互信息等指标量化不确定性,并引入“弃权”选项以控制精度-覆盖率权衡点(accuracy-coverage operating point)。该方法在多类显性言语数据集上验证,相较于主流基线,显著提升了概率估计的可靠性、选择性性能的一致性以及各类别接受率的平衡性,从而为真实场景下的BCI通信系统提供更鲁棒的行为范式。

链接: https://arxiv.org/abs/2511.07890
作者: Soowon Kim,Byung-Kwan Ko,Seo-Hyun Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Non-invasive brain-computer interfaces that decode spoken commands from electroencephalogram must be both accurate and trustworthy. We present a confidence-aware decoding framework that couples deep ensembles of compact, speech-oriented convolutional networks with post-hoc calibration and selective classification. Uncertainty is quantified using ensemble-based predictive entropy, top-two margin, and mutual information, and decisions are made with an abstain option governed by an accuracy-coverage operating point. The approach is evaluated on a multi-class overt speech dataset using a leakage-safe, block-stratified split that respects temporal contiguity. Compared with widely used baselines, the proposed method yields more reliable probability estimates, improved selective performance across operating points, and balanced per-class acceptance. These results suggest that confidence-aware neural decoding can provide robust, deployment-oriented behavior for real-world brain-computer interface communication systems.
zh

[AI-77] Meta-cognitive Multi-scale Hierarchical Reasoning for Motor Imagery Decoding

【速读】:该论文旨在解决基于运动想象(Motor Imagery, MI)的脑机接口(Brain-Computer Interface, BCI)系统在实际部署中因脑电(Electroencephalogram, EEG)信号噪声大和个体间变异性强而导致性能不稳定的问题。其解决方案的关键在于提出一种分层且具备元认知能力的解码框架:首先引入多尺度分层信号处理模块,将骨干网络特征重组为时序多尺度表示以增强对EEG信号复杂动态特性的建模;其次设计内省式不确定性估计模块,为每个周期分配可靠性评分并指导迭代优化。该方法在三种标准EEG骨干网络(EEGNet、ShallowConvNet、DeepConvNet)上均显著提升了四分类MI解码准确率,并降低了跨被试方差,表明其能有效提升MI-BBCI系统的鲁棒性和可靠性。

链接: https://arxiv.org/abs/2511.07884
作者: Si-Hyun Kim,Heon-Gyu Kwak,Byoung-Hee Kwon,Seong-Whan Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 pages, 1 figures, 1 table, Name of Conference: International Winter Conference on Brain-Computer Interface

点击查看摘要

Abstract:Brain-computer interface (BCI) aims to decode motor intent from noninvasive neural signals to enable control of external devices, but practical deployment remains limited by noise and variability in motor imagery (MI)-based electroencephalogram (EEG) signals. This work investigates a hierarchical and meta-cognitive decoding framework for four-class MI classification. We introduce a multi-scale hierarchical signal processing module that reorganizes backbone features into temporal multi-scale representations, together with an introspective uncertainty estimation module that assigns per-cycle reliability scores and guides iterative refinement. We instantiate this framework on three standard EEG backbones (EEGNet, ShallowConvNet, and DeepConvNet) and evaluate four-class MI decoding using the BCI Competition IV-2a dataset under a subject-independent setting. Across all backbones, the proposed components improve average classification accuracy and reduce inter-subject variance compared to the corresponding baselines, indicating increased robustness to subject heterogeneity and noisy trials. These results suggest that combining hierarchical multi-scale processing with introspective confidence estimation can enhance the reliability of MI-based BCI systems.
zh

[AI-78] WaterMod: Modular Token-Rank Partitioning for Probability-Balanced LLM Watermarking AAAI2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成内容的可追溯性问题,即如何在不显著影响生成质量的前提下,嵌入不可察觉且机器可验证的水印以满足如欧盟人工智能法案(EU AI Act)等法规对合成内容溯源的要求。传统基于 logits 的水印方法通过在每一步解码时选择一个伪随机“绿色词汇表”并提升其 logits 来实现标记,但这种随机划分可能导致最高概率词被排除,从而损害流畅性。论文提出的 WaterMod 解决方案关键在于引入一种概率感知的模运算规则:首先按模型预测概率对词汇表降序排序,再根据余数 rank mod k 将相邻且语义相近的 token 分配到不同类别中;随后对选定类别的 logits 添加小幅度偏置,实现精准的水印嵌入。该机制在零比特场景下通过熵自适应门控选择偶数或奇数类作为绿色列表,确保至少保留一个高概率候选词;在多比特场景下,当前 payload 数字 d 直接决定选择 rank mod k = d 的类别进行偏置,从而每步嵌入精确的一位 base-k 数字,兼顾鲁棒检测性能与生成质量。

链接: https://arxiv.org/abs/2511.07863
作者: Shinwoo Park,Hyejin Park,Hyeseon Ahn,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: AAAI 2026 (Oral). This is the extended preprint version including appendices

点击查看摘要

Abstract:Large language models now draft news, legal analyses, and software code with human-level fluency. At the same time, regulations such as the EU AI Act mandate that each synthetic passage carry an imperceptible, machine-verifiable mark for provenance. Conventional logit-based watermarks satisfy this requirement by selecting a pseudorandom green vocabulary at every decoding step and boosting its logits, yet the random split can exclude the highest-probability token and thus erode fluency. WaterMod mitigates this limitation through a probability-aware modular rule. The vocabulary is first sorted in descending model probability; the resulting ranks are then partitioned by the residue rank mod k, which distributes adjacent-and therefore semantically similar-tokens across different classes. A fixed bias of small magnitude is applied to one selected class. In the zero-bit setting (k=2), an entropy-adaptive gate selects either the even or the odd parity as the green list. Because the top two ranks fall into different parities, this choice embeds a detectable signal while guaranteeing that at least one high-probability token remains available for sampling. In the multi-bit regime (k2), the current payload digit d selects the color class whose ranks satisfy rank mod k = d. Biasing the logits of that class embeds exactly one base-k digit per decoding step, thereby enabling fine-grained provenance tracing. The same modular arithmetic therefore supports both binary attribution and rich payloads. Experimental results demonstrate that WaterMod consistently attains strong watermark detection performance while maintaining generation quality in both zero-bit and multi-bit settings. This robustness holds across a range of tasks, including natural language generation, mathematical reasoning, and code synthesis. Our code and data are available at this https URL.
zh

[AI-79] A General Method for Proving Networks Universal Approximation Property

【速读】:该论文旨在解决深度学习架构中通用逼近性质证明的碎片化问题,即现有方法针对每种网络结构(如全连接网络、卷积神经网络或Transformer)需单独设计数学证明,导致重复劳动且缺乏统一理论基础。解决方案的关键在于提出一种通用且模块化的框架:定义具备通用逼近能力的基本构建单元——通用逼近模块(Universal Approximation Module, UAM),并证明只要网络由此类模块组成,其整体就自然继承通用逼近性;同时,该框架将逼近过程解释为模块间的逐步精炼,从而实现对不同架构的统一分析与逐层表达能力演进的理解。

链接: https://arxiv.org/abs/2511.07857
作者: Wei Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning architectures are highly diverse. To prove their universal approximation properties, existing works typically rely on model-specific proofs. Generally, they construct a dedicated mathematical formulation for each architecture (e.g., fully connected networks, CNNs, or Transformers) and then prove their universal approximability. However, this approach suffers from two major limitations: first, every newly proposed architecture often requires a completely new proof from scratch; second, these proofs are largely isolated from one another, lacking a common analytical foundation. This not only incurs significant redundancy but also hinders unified theoretical understanding across different network families. To address these issues, this paper proposes a general and modular framework for proving universal approximation. We define a basic building block (comprising one or multiple layers) that possesses the universal approximation property as a Universal Approximation Module (UAM). Under this condition, we show that any deep network composed of such modules inherently retains the universal approximation property. Moreover, the overall approximation process can be interpreted as a progressive refinement across modules. This perspective not only unifies the analysis of diverse architectures but also enables a step-by-step understanding of how expressive power evolves through the network.
zh

[AI-80] GAMA: A Neural Neighborhood Search Method with Graph-aware Multi-modal Attention for Vehicle Routing Problem

【速读】:该论文旨在解决现有神经邻域搜索方法在处理车辆路径问题(Vehicle Routing Problem, VRP)时,因状态表示过于简单且异构信息融合方式粗略(如直接拼接)而导致难以捕捉丰富结构与语义上下文的问题。其解决方案的关键在于提出一种基于图感知的多模态注意力模型(Graph-aware Multi-modal Attention, GAMA):通过图神经网络分别编码问题实例与演化中的解为不同模态,并利用堆叠的自注意力和交叉注意力层建模模态内与模态间的交互关系;同时引入门控融合机制将多模态表示整合为结构化的状态表征,从而提升策略在操作选择上的决策能力与泛化性能。

链接: https://arxiv.org/abs/2511.07850
作者: Xiangling Chen,Yi Mei,Mengjie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in neural neighborhood search methods have shown potential in tackling Vehicle Routing Problems (VRPs). However, most existing approaches rely on simplistic state representations and fuse heterogeneous information via naive concatenation, limiting their ability to capture rich structural and semantic context. To address these limitations, we propose GAMA, a neural neighborhood search method with Graph-aware Multi-modal Attention model in VRP. GAMA encodes the problem instance and its evolving solution as distinct modalities using graph neural networks, and models their intra- and inter-modal interactions through stacked self- and cross-attention layers. A gated fusion mechanism further integrates the multi-modal representations into a structured state, enabling the policy to make informed and generalizable operator selection decisions. Extensive experiments conducted across various synthetic and benchmark instances demonstrate that the proposed algorithm GAMA significantly outperforms the recent neural baselines. Further ablation studies confirm that both the multi-modal attention mechanism and the gated fusion design play a key role in achieving the observed performance gains.
zh

[AI-81] Alignment-Aware Quantization for LLM Safety

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练量化(Post-Training Quantization, PTQ)过程中效率与安全性的权衡问题。传统PTQ方法仅追求低困惑度(perplexity)以提升效率,但忽视了模型对安全策略的对齐性(alignment),导致量化后的模型虽保持较低困惑度,却可能显著偏离原始的安全对齐行为,从而引入安全隐患。解决方案的关键在于提出一种对齐感知量化(Alignment-Aware Quantization, AAQ)方法,其核心创新是将对齐保留对比损失(Alignment-Preserving Contrastive, APC loss)嵌入PTQ流程中,通过强制量化模型模仿安全微调后的模型、同时远离未对齐的预训练模型,实现对安全对齐的有效保留。该方法无需依赖专门的安全校准数据集,在多种模型架构(如LLaMA、Qwen、Mistral)上均能实现稳健的4-bit(W4A4)量化,解决了效率与安全性之间的根本冲突。

链接: https://arxiv.org/abs/2511.07842
作者: Sunghyun Wee,Suyoung Kim,Hyeonjin Kim,Kyomin Hwang,Nojun Kwak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures. Includes 7 pages of supplementary material

点击查看摘要

Abstract:Safety and efficiency are both important factors when deploying large language models(LLMs). LLMs are trained to follow human alignment for safety, and post training quantization(PTQ) is applied afterward for efficiency. However, these two objectives are often in conflict, revealing a fundamental flaw in the conventional PTQ paradigm: quantization can turn into a safety vulnerability if it only aims to achieve low perplexity. Models can demonstrate low perplexity yet exhibit significant degradation in alignment with the safety policy, highlighting that perplexity alone is an insufficient and often misleading proxy for model safety. To address this, we propose Alignment-Aware Quantization(AAQ), a novel approach that integrates Alignment-Preserving Contrastive(APC) loss into the PTQ pipeline. Compared to simple reconstruction loss, ours explicitly preserves alignment by encouraging the quantized model to mimic its safe, instruction-tuned model while diverging from the unaligned, pre-trained counterpart. Our method achieves this robust safety alignment without resorting to specialized safety-focused calibration datasets, highlighting its practical utility and broad applicability. AAQ is compatible with standard PTQ techniques and enables robust 4-bit (W4A4) quantization across diverse model families such as LLaMA, Qwen, and Mistral while maintaining safety where previous methods fail. Our work resolves the critical trade-off between efficiency and safety, paving the way toward LLMs that are both efficient and trustworthy. Anonymized code is available in the supplementary material.
zh

[AI-82] MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

【速读】:该论文旨在解决当前基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法在处理需要迭代决策的代理任务(agentic tasks)时表现不足的问题,尤其是在多轮推理场景下缺乏有效的自我修正机制。其解决方案的关键在于提出Murphy框架,该框架在Group Relative Policy Optimization (GRPO)基础上引入了训练过程中的迭代式自我修正(iterative self-correction),通过融合定量与定性执行反馈(quantitative and qualitative execution feedback),使模型能够在多轮交互中逐步优化推理路径,从而显著提升代码生成等复杂任务上的性能,在相似计算预算下相较GRPO实现最高达8%的pass@1相对提升。

链接: https://arxiv.org/abs/2511.07833
作者: Chanakya Ekbote,Vijay Lingam,Behrooz Omidvar-Tehrani,Jun Huan,Sujay Sanghavi,Anoop Deoras,Stefano Soatto
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 2 figures, 6 Tables

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce Murphy, a multi-turn reflective optimization framework that extends GRPO by incorporating iterative self-correction during training. By leveraging both quantitative and qualitative execution feedback, Murphy enables models to progressively refine their reasoning across multiple turns. Evaluations on code generation benchmarks with model families such as Qwen and OLMo show that Murphy consistently improves performance, achieving up to a 8% relative gain in pass@1 over GRPO, on similar compute budgets.
zh

[AI-83] PRISM: Privacy-preserving Inference System with Homomorphic Encryption and Modular Activation

【速读】:该论文旨在解决在关键基础设施中部署机器学习模型时面临的隐私保护难题,即如何在不暴露原始数据的前提下进行安全计算。传统同态加密(Homomorphic Encryption, HE)虽能支持对加密数据的运算,但与卷积神经网络(Convolutional Neural Networks, CNNs)中的非线性激活函数不兼容,导致无法直接应用。解决方案的关键在于提出一种优化框架,通过将标准非线性函数替换为可同态计算的近似形式,并重构CNN架构,从而在保障数据隐私的同时最小化计算开销。实验表明,采用四次多项式和Softplus激活函数的改进方案,在CKKS加密协议下实现了94.4%的准确率,单样本加密推理耗时2.42秒,10,000样本总耗时24,000秒,有效平衡了模型性能与安全性。

链接: https://arxiv.org/abs/2511.07807
作者: Zeinab Elkhatib,Ali Sekmen,Kamrul Hasan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rapid advancements in machine learning, models have become increasingly capable of learning and making predictions in various industries. However, deploying these models in critical infrastructures presents a major challenge, as concerns about data privacy prevent unrestricted data sharing. Homomor- phic encryption (HE) offers a solution by enabling computations on encrypted data, but it remains incompatible with machine learning models like convolutional neural networks (CNNs), due to their reliance on non-linear activation functions. To bridge this gap, this work proposes an optimized framework that replaces standard non-linear functions with homomorphically compatible approximations, ensuring secure computations while minimizing computational overhead. The proposed approach restructures the CNN architecture and introduces an efficient activation function approximation method to mitigate the performance trade-offs in- troduced by encryption. Experiments on CIFAR-10 achieve 94.4% accuracy with 2.42 s per single encrypted sample and 24,000 s per 10,000 encrypted samples, using a degree-4 polynomial and Softplus activation under CKKS, balancing accuracy and privacy.
zh

[AI-84] Judging by the Rules: Compliance-Aligned Framework for Modern Slavery Statement Monitoring AAAI-26

【速读】:该论文旨在解决现代奴隶制合规披露文本中存在语义模糊、格式不一的问题,使得人工审核效率低下且难以规模化;同时指出当前基于大语言模型(Large Language Models, LLMs)的解决方案往往将复杂的监管评估简化为二元判断,缺乏法律可验证性与结构化输出。其核心解决方案是提出一种规则对齐的合规验证框架,关键在于引入“合规对齐裁判器”(Compliance Alignment Judge, CA-Judge),通过评估模型生成的解释是否忠实于法定要求来提供反馈,并据此训练出“合规对齐大语言模型”(Compliance Alignment LLM, CALLM),该模型能够生成符合监管规则、透明且可由法律专家验证的输出,从而实现高精度、可追溯的合规分析。

链接: https://arxiv.org/abs/2511.07803
作者: Wenhao Xu,Akshatha Arodi,Jian-Yun Nie,Arsene Fansi Tchango
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: To appear at AAAI-26 (Social Impact Track)

点击查看摘要

Abstract:Modern slavery affects millions of people worldwide, and regulatory frameworks such as Modern Slavery Acts now require companies to publish detailed disclosures. However, these statements are often vague and inconsistent, making manual review time-consuming and difficult to scale. While NLP offers a promising path forward, high-stakes compliance tasks require more than accurate classification: they demand transparent, rule-aligned outputs that legal experts can verify. Existing applications of large language models (LLMs) often reduce complex regulatory assessments to binary decisions, lacking the necessary structure for robust legal scrutiny. We argue that compliance verification is fundamentally a rule-matching problem: it requires evaluating whether textual statements adhere to well-defined regulatory rules. To this end, we propose a novel framework that harnesses AI for rule-level compliance verification while preserving expert oversight. At its core is the Compliance Alignment Judge (CA-Judge), which evaluates model-generated justifications based on their fidelity to statutory requirements. Using this feedback, we train the Compliance Alignment LLM (CALLM), a model that produces rule-consistent, human-verifiable outputs. CALLM improves predictive performance and generates outputs that are both transparent and legally grounded, offering a more verifiable and actionable solution for real-world compliance analysis.
zh

[AI-85] HybridGuard: Enhancing Minority-Class Intrusion Detection in Dew-Enabled Edge-of-Things Networks

【速读】:该论文旨在解决边缘计算物联网(Edge-of-Things, EoT)网络中复杂入侵检测难题,尤其针对数据不平衡导致的少数攻击类别识别精度低的问题。解决方案的关键在于提出 HybridGuard 框架,其核心创新包括:1)基于互信息(mutual information)的特征选择方法,以筛选出最相关特征提升检测性能;2)采用带梯度惩罚的 Wasserstein 条件生成对抗网络(WCGAN-GP)缓解类别不平衡问题并增强检测精度;3)设计双阶段架构 DualNetShield,实现精细化流量分析与异常检测,从而在多种真实攻击场景下显著优于现有方法,具备对持续演化的网络安全威胁的适应能力。

链接: https://arxiv.org/abs/2511.07793
作者: Binayak Kara,Ujjwal Sahua,Ciza Thomas,Jyoti Prakash Sahoo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Securing Dew-Enabled Edge-of-Things (EoT) networks against sophisticated intrusions is a critical challenge. This paper presents HybridGuard, a framework that integrates machine learning and deep learning to improve intrusion detection. HybridGuard addresses data imbalance through mutual information based feature selection, ensuring that the most relevant features are used to improve detection performance, especially for minority attack classes. The framework leverages Wasserstein Conditional Generative Adversarial Networks with Gradient Penalty (WCGAN-GP) to further reduce class imbalance and enhance detection precision. It adopts a two-phase architecture called DualNetShield to support advanced traffic analysis and anomaly detection, improving the granular identification of threats in complex EoT environments. HybridGuard is evaluated on the UNSW-NB15, CIC-IDS-2017, and IOTID20 datasets, where it demonstrates strong performance across diverse attack scenarios and outperforms existing solutions in adapting to evolving cybersecurity threats. This approach establishes HybridGuard as an effective tool for protecting EoT networks against modern intrusions.
zh

[AI-86] Physical Consistency of Auroras Encoder: A Quantitative Study

【速读】:该论文旨在解决大型天气预报模型(如Aurora)因内部表征不透明而缺乏可解释性的问题,这限制了其在高风险业务场景中的应用。解决方案的关键在于通过大规模嵌入数据集训练线性分类器,系统性地检验模型编码器的潜在表示是否与已知的物理和气象概念(如陆海边界、极端高温事件及大气不稳定度)对齐,从而定量评估其物理一致性,进而为下一代AI驱动的天气模型提供可信赖的可解释性验证方法。

链接: https://arxiv.org/abs/2511.07787
作者: Benjamin Richards,Pushpa Kumar Balan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for poster presentation at the AICC: Workshop on AI for Climate and Conservation at EurIPS 2025

点击查看摘要

Abstract:The high accuracy of large-scale weather forecasting models like Aurora is often accompanied by a lack of transparency, as their internal representations remain largely opaque. This “black box” nature hinders their adoption in high-stakes operational settings. In this work, we probe the physical consistency of Aurora’s encoder by investigating whether its latent representations align with known physical and meteorological concepts. Using a large-scale dataset of embeddings, we train linear classifiers to identify three distinct concepts: the fundamental land-sea boundary, high-impact extreme temperature events, and atmospheric instability. Our findings provide quantitative evidence that Aurora learns physically consistent features, while also highlighting its limitations in capturing the rarest events. This work underscores the critical need for interpretability methods to validate and build trust in the next generation of Al-driven weather models.
zh

[AI-87] urboSAT: Gradient-Guided Boolean Satisfiability Accelerated on GPU-CPU Hybrid System

【速读】:该论文旨在解决布尔可满足性(Boolean satisfiability, SAT)求解中并行计算能力受限的问题。当前主流SAT求解器依赖于本质上串行的冲突驱动搜索算法,虽具备强大启发式策略,但严重限制了并行度,难以实现高效扩展。解决方案的关键在于将SAT问题建模为一种可微分的二值矩阵乘法层(binarized matrix-matrix multiplication layer),从而利用可微优化方法在GPU上进行大规模并行评估与变量赋值优化;同时结合CPU端的串行冲突驱动搜索机制对GPU生成的候选解进行精细化探索,形成GPU-CPU协同的混合求解架构。这一设计实现了并行可微优化与串行搜索优势的融合,在NVIDIA DGX GB200节点上的实验表明,相较于最先进的CPU求解器,该方案在可满足基准问题上实现了超过200倍的运行时加速。

链接: https://arxiv.org/abs/2511.07737
作者: Steve Dai,Cunxi Yu,Kalyan Krishnamani,Brucek Khailany
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Software (cs.MS)
备注: 7 pages, 5 equations, 5 figures, 1 table

点击查看摘要

Abstract:While accelerated computing has transformed many domains of computing, its impact on logical reasoning, specifically Boolean satisfiability (SAT), remains limited. State-of-the-art SAT solvers rely heavily on inherently sequential conflict-driven search algorithms that offer powerful heuristics but limit the amount of parallelism that could otherwise enable significantly more scalable SAT solving. Inspired by neural network training, we formulate the SAT problem as a binarized matrix-matrix multiplication layer that could be optimized using a differentiable objective function. Enabled by this encoding, we combine the strengths of parallel differentiable optimization and sequential search to accelerate SAT on a hybrid GPU-CPU system. In this system, the GPUs leverage parallel differentiable solving to rapidly evaluate SAT clauses and use gradients to stochastically explore the solution space and optimize variable assignments. Promising partial assignments generated by the GPUs are post-processed on many CPU threads which exploit conflict-driven sequential search to further traverse the solution subspaces and identify complete assignments. Prototyping the hybrid solver on an NVIDIA DGX GB200 node, our solver achieves runtime speedups up to over 200x when compared to a state-of-the-art CPU-based solver on public satisfiable benchmark problems from the SAT Competition.
zh

[AI-88] Global Optimization on Graph-Structured Data via Gaussian Processes with Spectral Representations

【速读】:该论文旨在解决在图结构域上进行全局优化时面临的挑战,即如何在离散且组合复杂的图数据中高效地应用贝叶斯优化(Bayesian optimization, BO),尤其是在大规模或部分观测的图场景下。传统方法要么依赖完整的图拓扑信息(对大图不现实),要么采用增量式探索(收敛速度慢)。其解决方案的关键在于引入基于低秩谱表示(low-rank spectral representations)的高斯过程(Gaussian process, GP)代理模型,通过稀疏结构观测构建高效的图代理模型,并联合学习图结构与节点嵌入(node embeddings),从而实现全局搜索效率提升和不确定性估计的合理性,即使在数据有限的情况下也能保证性能。

链接: https://arxiv.org/abs/2511.07734
作者: Shu Hong,Yongsheng Mei,Mahdi Imani,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Bayesian optimization (BO) is a powerful framework for optimizing expensive black-box objectives, yet extending it to graph-structured domains remains challenging due to the discrete and combinatorial nature of graphs. Existing approaches often rely on either full graph topology-impractical for large or partially observed graphs-or incremental exploration, which can lead to slow convergence. We introduce a scalable framework for global optimization over graphs that employs low-rank spectral representations to build Gaussian process (GP) surrogates from sparse structural observations. The method jointly infers graph structure and node representations through learnable embeddings, enabling efficient global search and principled uncertainty estimation even with limited data. We also provide theoretical analysis establishing conditions for accurate recovery of underlying graph structure under different sampling regimes. Experiments on synthetic and real-world datasets demonstrate that our approach achieves faster convergence and improved optimization performance compared to prior methods.
zh

[AI-89] A Negotiation-Based Multi-Agent Reinforcement Learning Approach for Dynamic Scheduling of Reconfigurable Manufacturing Systems

链接: https://arxiv.org/abs/2511.07707
作者: Manonmani Sekar,Nasim Nezamoddini
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-90] Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)系统在视觉感知环境中易受对抗攻击的问题,尤其是现有防御方法看似有效实则存在根本性缺陷——即当前基于lpl_p范数约束的攻击难以改变图像语义信息,导致防御措施无法真正检验RL代理的鲁棒性。解决方案的关键在于提出SHIFT,一种无需依赖策略(policy-agnostic)的基于扩散模型的状态扰动攻击方法,能够生成语义上与原始状态显著不同但保持现实性和历史一致性的扰动状态,从而绕过现有防御机制并有效破坏其性能,揭示出RL代理对语义感知型对抗扰动的高度脆弱性。

链接: https://arxiv.org/abs/2511.07701
作者: Xiaolin Sun,Feidi Liu,Zhengming Ding,ZiZhan Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent’s behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. However, after closer investigation, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing l_p norm-constrained attacks, which can barely alter the semantics of image input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel policy-agnostic diffusion-based state perturbation attack to go beyond this limitation. Our attack is able to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, significantly outperforming existing attacks while being more perceptually stealthy. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies.
zh

[AI-91] owards AI-Assisted Generation of Military Training Scenarios

链接: https://arxiv.org/abs/2511.07690
作者: Soham Hans,Volkan Ustun,Benjamin Nye,James Sterrett,Matthew Green
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-92] Designing and Evaluating Malinowskis Lens: An AI-Native Educational Game for Ethnographic Learning

链接: https://arxiv.org/abs/2511.07682
作者: Michael Hoffmann,Jophin John,Jan Fillies,Adrian Paschke
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures. Full preprint version; shorter version in preparation

点击查看摘要

[AI-93] AIA Forecaster: Technical Report

【速读】:该论文旨在解决如何利用生成式 AI (Generative AI) 从非结构化数据中进行高精度判断性预测(judgmental forecasting)的问题,尤其在缺乏明确统计模型或历史数据的情况下。解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的系统——AIA Forecaster,其核心创新包括:(1)通过代理搜索(agentic search)机制从高质量新闻源获取实时信息;(2)引入监督代理(supervisor agent)对同一事件的多个异构预测结果进行整合与校准;(3)采用统计校准技术以缓解LLM固有的行为偏差。实验证明,该方法在ForecastBench基准上达到人类超级预测者水平,并在更复杂的市场预测基准上展现出与市场共识的互补性,表明其具备可扩展、可迁移的专家级预测能力。

链接: https://arxiv.org/abs/2511.07678
作者: Rohan Alur,Bradly C. Stadie,Daniel Kang,Ryan Chen,Matt McManus,Michael Rickert,Tyler Lee,Michael Federici,Richard Zhu,Dennis Fogerty,Hayley Williamson,Nina Lozinski,Aaron Linsky,Jasjeet S. Sekhon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This technical report describes the AIA Forecaster, a Large Language Model (LLM)-based system for judgmental forecasting using unstructured data. The AIA Forecaster approach combines three core elements: agentic search over high-quality news sources, a supervisor agent that reconciles disparate forecasts for the same event, and a set of statistical calibration techniques to counter behavioral biases in large language models. On the ForecastBench benchmark (Karger et al., 2024), the AIA Forecaster achieves performance equal to human superforecasters, surpassing prior LLM baselines. In addition to reporting on ForecastBench, we also introduce a more challenging forecasting benchmark sourced from liquid prediction markets. While the AIA Forecaster underperforms market consensus on this benchmark, an ensemble combining AIA Forecaster with market consensus outperforms consensus alone, demonstrating that our forecaster provides additive information. Our work establishes a new state of the art in AI forecasting and provides practical, transferable recommendations for future research. To the best of our knowledge, this is the first work that verifiably achieves expert-level forecasting at scale.
zh

[AI-94] Speech Separation for Hearing-Impaired Children in the Classroom

【速读】:该论文旨在解决儿童在教室环境中因背景噪声、多说话者和混响等因素导致的语音感知困难问题,这一挑战相较于成人更为严峻,而现有基于深度学习的语音分离模型大多使用成人语音在简化条件下训练,未能充分考虑儿童语音更高的频谱相似性及真实教室环境的声学复杂性。解决方案的关键在于采用MIMO-TasNet架构——一种适合双耳助听器或人工耳蜗实时部署的紧凑、低延迟多通道模型,并通过模拟自然教室场景(包括移动的儿童-儿童与儿童-成人对话对)进行训练,结合空间线索优化模型对儿童语音的适应能力;实验表明,仅用一半教室数据微调即可实现显著性能提升,且扩散式人声噪声训练进一步增强了鲁棒性,同时保持了空间感知能力,验证了空间感知架构与针对性迁移学习相结合在提升儿童课堂语音可访问性方面的有效性。

链接: https://arxiv.org/abs/2511.07677
作者: Feyisayo Olalere,Kiki van der Heijden,H. Christiaan Stronks,Jeroen Briaire,Johan H. M. Frijns,Yagmur Güçlütürk
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation models for assistive devices are developed using adult voices in simplified, low-reverberation conditions. This overlooks both the higher spectral similarity of children’s voices, which weakens separation cues, and the acoustic complexity of real classrooms. We address this gap using MIMO-TasNet, a compact, low-latency, multi-channel architecture suited for real-time deployment in bilateral hearing aids or cochlear implants. We simulated naturalistic classroom scenes with moving child-child and child-adult talker pairs under varying noise and distance conditions. Training strategies tested how well the model adapts to children’s speech through spatial cues. Models trained on adult speech, classroom data, and finetuned variants were compared to assess data-efficient adaptation. Results show that adult-trained models perform well in clean scenes, but classroom-specific training greatly improves separation quality. Finetuning with only half the classroom data achieved comparable gains, confirming efficient transfer learning. Training with diffuse babble noise further enhanced robustness, and the model preserved spatial awareness while generalizing to unseen distances. These findings demonstrate that spatially aware architectures combined with targeted adaptation can improve speech accessibility for children in noisy classrooms, supporting future on-device assistive technologies.
zh

[AI-95] Making LLM s Reliable When It Matters Most: A Five-Layer Architecture for High-Stakes Decisions

链接: https://arxiv.org/abs/2511.07669
作者: Alejandro R. Jadad
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 1 figure, 2 tables

点击查看摘要

[AI-96] AI-Driven Contribution Evaluation and Conflict Resolution: A Framework Design for Group Workload Investigation

链接: https://arxiv.org/abs/2511.07667
作者: Jakub Slapek,Mir Seyedebrahimi,Yang Jianhua
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 8 figures, 8 tables

点击查看摘要

[AI-97] FractalCloud: A Fractal-Inspired Architecture for Efficient Large-Scale Point Cloud Processing HPCA2026

【速读】:该论文旨在解决大规模三维(3D)点云处理中因全连接计算和全局内存访问导致的高复杂度与低效问题,具体表现为点云神经网络(PNNs)在处理数十万点级别的输入时,面临O(n²)的时间复杂度和巨大的内存带宽压力。现有加速器因针对小规模任务优化且缺乏并行架构支持,难以有效扩展。解决方案的关键在于提出FractalCloud硬件架构,其核心创新包括:(1) 一种协同设计的分形(Fractal)方法,实现形状感知且适合硬件部署的分区策略;(2) 块并行(block-parallel)点操作机制,将所有点操作分解并行化。结合片上分形结构与灵活并行性设计,FractalCloud在有限内存资源下实现了完全并行处理,最终在28 nm工艺下以1.5 mm²核心面积实现21.7倍速度提升和27倍能效改进,同时保持网络精度。

链接: https://arxiv.org/abs/2511.07665
作者: Yuzhe Fu,Changchun Zhou,Hancheng Ye,Bowen Duan,Qiyu Huang,Chiyue Wei,Cong Guo,Hai "Helen’’ Li,Yiran Chen
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted for publication in HPCA2026. Codes will be released later

点击查看摘要

Abstract:Three-dimensional (3D) point clouds are increasingly used in applications such as autonomous driving, robotics, and virtual reality (VR). Point-based neural networks (PNNs) have demonstrated strong performance in point cloud analysis, originally targeting small-scale inputs. However, as PNNs evolve to process large-scale point clouds with hundreds of thousands of points, all-to-all computation and global memory access in point cloud processing introduce substantial overhead, causing O(n^2) computational complexity and memory traffic where n is the number of points. Existing accelerators, primarily optimized for small-scale workloads, overlook this challenge and scale poorly due to inefficient partitioning and non-parallel architectures. To address these issues, we propose FractalCloud, a fractal-inspired hardware architecture for efficient large-scale 3D point cloud processing. FractalCloud introduces two key optimizations: (1) a co-designed Fractal method for shape-aware and hardware-friendly partitioning, and (2) block-parallel point operations that decompose and parallelize all point operations. A dedicated hardware design with on-chip fractal and flexible parallelism further enables fully parallel processing within limited memory resources. Implemented in 28 nm technology as a chip layout with a core area of 1.5 mm^2 , FractalCloud achieves 21.7x speedup and 27x energy reduction over state-of-the-art accelerators while maintaining network accuracy, demonstrating its scalability and efficiency for PNN inference.
zh

[AI-98] Cortex AISQL: A Production SQL Engine for Unstructured Data

【速读】:该论文旨在解决在生产环境中高效执行生成式 AI(Generative AI)语义操作所面临的挑战,这些问题包括:语义操作的计算成本远高于传统 SQL 操作、其延迟和吞吐特性与传统操作差异显著、且在查询编译阶段难以预估其成本和选择性,同时现有查询引擎缺乏对语义操作的优化能力。解决方案的关键在于三个创新技术:一是 AI-aware 查询优化,将大语言模型(LLM)推理成本作为首要优化目标,在查询规划阶段直接推理 LLM 成本以实现 2–8 倍加速;二是自适应模型级联(adaptive model cascades),通过快速代理模型处理大多数数据行,仅将不确定样本路由至高性能 oracle 模型,实现 2–6 倍加速并保持 90–95% 的 oracle 模型质量;三是语义连接重写(semantic join query rewriting),将原本具有二次时间复杂度的连接操作重构为多标签分类任务,将其降低为线性复杂度,实现 15–70 倍加速并通常提升预测准确性。这些技术共同支撑了 AISQL 在 Snowflake 生产环境中的部署,服务于分析、搜索和内容理解等多样化工作负载。

链接: https://arxiv.org/abs/2511.07663
作者: Paritosh Aggarwal,Bowei Chen,Anupam Datta,Benjamin Han,Boxin Jiang,Nitish Jindal,Zihan Li,Aaron Lin,Pawel Liskowski,Jay Tayade,Dimitris Tsirogiannis,Nathan Wiegand,Weicheng Zhao
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Snowflake’s Cortex AISQL is a production SQL engine that integrates native semantic operations directly into SQL. This integration allows users to write declarative queries that combine relational operations with semantic reasoning, enabling them to query both structured and unstructured data effortlessly. However, making semantic operations efficient at production scale poses fundamental challenges. Semantic operations are more expensive than traditional SQL operations, possess distinct latency and throughput characteristics, and their cost and selectivity are unknown during query compilation. Furthermore, existing query engines are not designed to optimize semantic operations. The AISQL query execution engine addresses these challenges through three novel techniques informed by production deployment data from Snowflake customers. First, AI-aware query optimization treats AI inference cost as a first-class optimization objective, reasoning about large language model (LLM) cost directly during query planning to achieve 2-8 \times speedups. Second, adaptive model cascades reduce inference costs by routing most rows through a fast proxy model while escalating uncertain cases to a powerful oracle model, achieving 2-6 \times speedups while maintaining 90-95% of oracle model quality. Third, semantic join query rewriting lowers the quadratic time complexity of join operations to linear through reformulation as multi-label classification tasks, achieving 15-70 \times speedups with often improved prediction quality. AISQL is deployed in production at Snowflake, where it powers diverse customer workloads across analytics, search, and content understanding.
zh

[AI-99] Adaptive Graph Learning with Transformer for Multi-Reservoir Inflow Prediction ICDM2025

【速读】:该论文旨在解决多水库流入预测中忽略水库间空间依赖关系的问题,传统方法通常采用单一水库模型,难以捕捉复杂水文网络中的动态关联。其解决方案的关键在于提出AdaTrip框架,该框架通过构建随时间变化的动态图结构(dynamic graph),将各水库视为节点并以有向边表示水文连接关系,结合注意力机制自动识别关键的空间与时间依赖性;同时利用参数共享策略提升数据稀缺水库的预测性能,并提供边缘和时间步级别的可解释注意力热力图,增强对水文控制机制的理解,从而支持运行决策。

链接: https://arxiv.org/abs/2511.07649
作者: Pengfei Hu,Ming Fan,Xiaoxue Han,Chang Lu,Wei Zhang,Hyun Kang,Yue Ning,Dan Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICDM 2025 DMESS Workshop

点击查看摘要

Abstract:Reservoir inflow prediction is crucial for water resource management, yet existing approaches mainly focus on single-reservoir models that ignore spatial dependencies among interconnected reservoirs. We introduce AdaTrip as an adaptive, time-varying graph learning framework for multi-reservoir inflow forecasting. AdaTrip constructs dynamic graphs where reservoirs are nodes with directed edges reflecting hydrological connections, employing attention mechanisms to automatically identify crucial spatial and temporal dependencies. Evaluation on thirty reservoirs in the Upper Colorado River Basin demonstrates superiority over existing baselines, with improved performance for reservoirs with limited records through parameter sharing. Additionally, AdaTrip provides interpretable attention maps at edge and time-step levels, offering insights into hydrological controls to support operational decision-making. Our code is available at this https URL.
zh

[AI-100] A Self-Improving Architecture for Dynamic Safety in Large Language Models

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在实际软件系统中部署时面临的动态安全挑战:传统静态的软件架构模式与不可扩展的安全保障方法难以应对新型对抗性威胁,导致系统在运行时易受攻击。其解决方案的关键在于提出一种自适应安全框架(Self-Improving Safety Framework, SISF),该框架通过将未对齐的基础LLM(mistralai/Mistral-7B-v0.1)与一个动态反馈回路耦合,实现运行时自主调整安全策略的能力;该反馈回路包含AI仲裁器(GPT-4o)用于检测漏洞,并由策略合成模块(GPT-4 Turbo)自动生成新的通用安全策略(包括启发式和语义规则),从而在不损害用户功能的前提下持续提升系统鲁棒性。

链接: https://arxiv.org/abs/2511.07645
作者: Tyler Slater
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Under review at the journal Information and Software Technology (Special Issue on Software Architecture for AI-Driven Systems)

点击查看摘要

Abstract:Context: The integration of Large Language Models (LLMs) into core software systems is accelerating. However, existing software architecture patterns are static, while current safety assurance methods are not scalable, leaving systems vulnerable to novel adversarial threats. Objective: To design, implement, and evaluate a novel software architecture that enables an AI-driven system to autonomously and continuously adapt its own safety protocols at runtime. Method: We propose the Self-Improving Safety Framework (SISF), a runtime architecture that couples an unprotected, unaligned base LLM (mistralai/Mistral-7B-v0.1) with a dynamic feedback loop. This loop consists of an AI Adjudicator (GPT-4o) for breach detection and a Policy Synthesis Module (GPT-4 Turbo) that autonomously generates new, generalized safety policies (both heuristic and semantic) in response to failures. Results: We conducted a dynamic learning evaluation using the 520-prompt AdvBench dataset. The unprotected model was 100% vulnerable. Our SISF, starting from zero policies, demonstrated a clear learning curve: it detected 237 breaches, autonomously synthesized 234 new policies, and reduced the overall Attack Success Rate (ASR) to 45.58%. In a subsequent test on 520 benign prompts, the SISF achieved a 0.00% False Positive Rate (FPR), proving its ability to adapt without compromising user utility. Conclusion: An architectural approach to AI safety, based on the principles of self-adaptation, is a viable and effective strategy. Our framework demonstrates a practical path towards building more robust, resilient, and scalable AI-driven systems, shifting safety assurance from a static, pre-deployment activity to an automated, runtime process. Comments: Under review at the journal Information and Software Technology (Special Issue on Software Architecture for AI-Driven Systems) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) ACMclasses: D.2.2; I.2.6; D.4.6 Cite as: arXiv:2511.07645 [cs.SE] (or arXiv:2511.07645v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2511.07645 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Tyler Slater [view email] [v1] Mon, 10 Nov 2025 21:39:40 UTC (1,549 KB) Full-text links: Access Paper: View a PDF of the paper titled A Self-Improving Architecture for Dynamic Safety in Large Language Models, by Tyler SlaterView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2025-11 Change to browse by: cs cs.AI cs.CR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-101] Private-RAG : Answering Multiple Queries with LLM s while Keeping Your Data Private

链接: https://arxiv.org/abs/2511.07637
作者: Ruihan Wu,Erchi Wang,Zhiyuan Zhang,Yu-Xiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

[AI-102] Partial Action Replacement: Tackling Distribution Shift in Offline MARL AAAI2026

链接: https://arxiv.org/abs/2511.07629
作者: Yue Jin,Giovanni Montana
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted by AAAI 2026

点击查看摘要

[AI-103] One Router to Route Them All: Homogeneous Expert Routing for Heterogeneous Graph Transformers

链接: https://arxiv.org/abs/2511.07603
作者: Georgiy Shakirov,Albert Arakelov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures; 2 tables; work in progress, feedback welcome

点击查看摘要

[AI-104] Leverag ing the Power of AI and Social Interactions to Restore Trust in Public Polls

链接: https://arxiv.org/abs/2511.07593
作者: Amr Akmal Abouelmagd,Amr Hilal
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

[AI-105] SemanticForge: Repository-Level Code Generation through Semantic Knowledge Graphs and Constraint Satisfaction

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件开发中进行自动化代码生成时存在的系统性错误问题,特别是两类关键失败模式:逻辑幻觉(logical hallucination,即控制流或数据流推理错误)和结构幻觉(schematic hallucination,如类型不匹配、签名违规和架构不一致)。这些问题源于缺乏可查询的、覆盖整个代码库语义的显式表示。解决方案的核心是提出SemanticForge框架,其关键技术包括:(1) 一种新颖的静态-动态知识图谱自动校正算法,统一编译时与运行时程序语义;(2) 基于神经网络的自然语言到结构化图查询生成方法,精度达73%(优于传统检索方法的51%);(3) 结合SMT求解的束搜索算法,在生成过程中实时验证约束而非事后校验;(4) 一种增量维护算法,可在O(|ΔR|·log n)时间内更新知识图谱并保持语义等价性。

链接: https://arxiv.org/abs/2511.07584
作者: Wuyang Zhang,Chenkai Zhang,Zhen Luo,Jianming Ma,Wangming Yuan,Chuqiao Gu,Chenwei Feng
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have transformed software development by enabling automated code generation, yet they frequently suffer from systematic errors that limit practical deployment. We identify two critical failure modes: \textitlogical hallucination (incorrect control/data-flow reasoning) and \textitschematic hallucination (type mismatches, signature violations, and architectural inconsistencies). These errors stem from the absence of explicit, queryable representations of repository-wide semantics. This paper presents \textbfSemanticForge, which introduces four fundamental algorithmic advances for semantically-aware code generation: (1) a novel automatic reconciliation algorithm for dual static-dynamic knowledge graphs, unifying compile-time and runtime program semantics; (2) a neural approach that learns to generate structured graph queries from natural language, achieving 73% precision versus 51% for traditional retrieval; (3) a novel beam search algorithm with integrated SMT solving, enabling real-time constraint verification during generation rather than post-hoc validation; and (4) an incremental maintenance algorithm that updates knowledge graphs in O(|\Delta R| \cdot \log n) time while maintaining semantic equivalence. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2511.07584 [cs.SE] (or arXiv:2511.07584v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2511.07584 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: INNO-PRESS: Journal of Emerging Applied AI, 2025
zh

[AI-106] Procedural Knowledge Improves Agent ic LLM Workflows

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行需要隐式规划的代理任务(agentic tasks)时性能受限的问题,尤其是在缺乏工具支持、提示工程或微调的情况下。其核心解决方案是引入基于层次任务网络(Hierarchical Task Network, HTN)的程序性知识(procedural knowledge),通过形式化、实现并评估一种利用HTN结构的LLM代理工作流,从而提升任务规划效率和整体性能。关键在于,人工编写的HTN能显著增强LLM在代理任务上的表现,甚至使参数规模较小(如20B或70B)的模型超越更大参数量(120B)的基线模型,表明程序性知识的显式建模是优化LLM代理能力的重要途径。

链接: https://arxiv.org/abs/2511.07568
作者: Vincent Hsiao,Mark Roberts,Leslie Smith
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often struggle when performing agentic tasks without substantial tool support, prom-pt engineering, or fine tuning. Despite research showing that domain-dependent, procedural knowledge can dramatically increase planning efficiency, little work evaluates its potential for improving LLM performance on agentic tasks that may require implicit planning. We formalize, implement, and evaluate an agentic LLM workflow that leverages procedural knowledge in the form of a hierarchical task network (HTN). Empirical results of our implementation show that hand-coded HTNs can dramatically improve LLM performance on agentic tasks, and using HTNs can boost a 20b or 70b parameter LLM to outperform a much larger 120b parameter LLM baseline. Furthermore, LLM-created HTNs improve overall performance, though less so. The results suggest that leveraging expertise–from humans, documents, or LLMs–to curate procedural knowledge will become another important tool for improving LLM workflows.
zh

[AI-107] N-ReLU: Zero-Mean Stochastic Extension of ReLU

链接: https://arxiv.org/abs/2511.07559
作者: Md Motaleb Hossen Manik,Md Zabirul Islam,Ge Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-108] FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models NEURIPS2025

链接: https://arxiv.org/abs/2511.07505
作者: Pukang Ye,Junwei Luo,Xiaolei Dong,Yunbo Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025. Code is available at this https URL

点击查看摘要

[AI-109] Biologically-Informed Hybrid Membership Inference Attacks on Generative Genomic Models

链接: https://arxiv.org/abs/2511.07503
作者: Asia Belfiore,Jonathan Passerat-Palmbach,Dmitrii Usynin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-110] Enabling Automatic Self-Talk Detection via Earables

链接: https://arxiv.org/abs/2511.07493
作者: Euihyeok Lee,Seonghyeon Kim,SangHun Im,Heung-Seon Oh,Seungwoo Kang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-111] When Are Learning Biases Equivalent? A Unifying Framework for Fairness Robustness and Distribution Shift

【速读】:该论文旨在解决机器学习系统中多种偏差机制(如公平性问题、对虚假相关性的脆弱性、少数子群体性能下降等)长期被不同研究领域孤立研究的问题。其核心挑战在于缺乏一个统一的理论框架来刻画这些不同偏差来源如何在模型性能上产生等效影响。解决方案的关键在于提出一种基于信息论度量的统一理论框架,将各类偏差形式化为条件独立性的违反,并由此推导出虚假相关强度与子群体分布偏移比例之间的定量等价关系——具体而言,在特征重叠假设下,虚假相关强度为 α 时,其导致的最差群体准确率下降等价于子群体不平衡比 r ≈ (1+α)/(1−α) 的情形。这一理论预测在六个数据集和三种架构上得到验证,误差控制在最差群体准确率的 3% 内,从而实现了跨问题域的去偏方法迁移与协同优化。

链接: https://arxiv.org/abs/2511.07485
作者: Sushant Mehta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Machine learning systems exhibit diverse failure modes: unfairness toward protected groups, brittleness to spurious correlations, poor performance on minority sub-populations, which are typically studied in isolation by distinct research communities. We propose a unifying theoretical framework that characterizes when different bias mechanisms produce quantitatively equivalent effects on model performance. By formalizing biases as violations of conditional independence through information-theoretic measures, we prove formal equivalence conditions relating spurious correlations, subpopulation shift, class imbalance, and fairness violations. Our theory predicts that a spurious correlation of strength \alpha produces equivalent worst-group accuracy degradation as a sub-population imbalance ratio r \approx (1+\alpha)/(1-\alpha) under feature overlap assumptions. Empirical validation in six datasets and three architectures confirms that predicted equivalences hold within the accuracy of the worst group 3%, enabling the principled transfer of debiasing methods across problem domains. This work bridges the literature on fairness, robustness, and distribution shifts under a common perspective.
zh

[AI-112] Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning

【速读】:该论文旨在解决小规模模型在基于规则的奖励强化学习(Reinforcement Learning, RL)过程中出现的推理链质量低下或推理过程与最终答案不一致的问题。传统方法依赖规则判定正确性,导致模型可能因随机产生正确答案而获得奖励,从而误导训练方向。解决方案的关键在于提出一种基于置信度的奖励模型(confidence-based reward model),不仅惩罚错误答案,还对高正确率但低置信度的回答进行惩罚,从而引导模型生成更稳健、逻辑一致的STEM推理链。

链接: https://arxiv.org/abs/2511.07483
作者: Qianxi He,Qingyu Ren,Shanzhe Lei,Xuhong Wang,Yingchun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, numerous technical reports indicate that purely rule-based reward RL frequently results in poor-quality reasoning chains or inconsistencies between reasoning processes and final answers, particularly when the base model is of smaller scale. During the RL exploration process, models might employ low-quality reasoning chains due to the lack of knowledge, occasionally producing correct answers randomly and receiving rewards based on established rule-based judges. This constrains the potential for resource-limited organizations to conduct direct reinforcement learning training on smaller-scale models. We propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in this https URL.
zh

[AI-113] KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs

链接: https://arxiv.org/abs/2511.07480
作者: Shuyuan Liu,Jiawei Chen,Xiao Yang,Hang Su,Zhaoxia Yin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-114] Dynamic Stability of LLM -Generated Code

链接: https://arxiv.org/abs/2511.07463
作者: Prateek Rajput,Abdoul Aziz Bonkoungou,Yewei Song,Abdoul Kader Kabore,Iyiola E. Olatunji,Jacques Klein,Tegewende Bissyande
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 8 figures

点击查看摘要

[AI-115] Optimizing Classification of Infrequent Labels by Reducing Variability in Label Distribution

链接: https://arxiv.org/abs/2511.07459
作者: Ashutosh Agarwal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted and presented at 6th International Conference on Emerging research in electronics, computer science and technology ( ICERECT)

点击查看摘要

[AI-116] Exploring the Psychometric Validity of AI-Generated Student Responses: A Study on Virtual Personas Learning Motivation

【速读】:该论文旨在解决如何利用大语言模型(Large Language Models, LLMs)生成符合教育测量学标准的虚拟学生作答数据这一问题,以支持教育评估研究中的数据生成与验证。其解决方案的关键在于使用GPT-4o构建2000个虚拟学生人格(virtual student personas),并让每个个体完成学术动机量表(Academic Motivation Scale, AMS),通过探索性因子分析(Exploratory Factor Analysis, EFA)和验证性因子分析(Confirmatory Factor Analysis, CFA)以及聚类分析,验证了生成数据在结构效度和群体异质性方面均能有效复现真实学生样本的特征,从而证明LLMs具备模拟教育测量中有效学生响应的能力。

链接: https://arxiv.org/abs/2511.07451
作者: Huanxiao Wang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: The paper has been accepted as proceedings of Artificial Intelligence in Measurement and Education Conference (AIME-Con) (2025)

点击查看摘要

Abstract:This study explores whether large language models (LLMs) can simulate valid student responses for educational measurement. Using GPT -4o, 2000 virtual student personas were generated. Each persona completed the Academic Motivation Scale (AMS). Factor analyses(EFA and CFA) and clustering showed GPT -4o reproduced the AMS structure and distinct motivational subgroups.
zh

[AI-117] Pinching Antennas Meet AI in Next-Generation Wireless Networks

链接: https://arxiv.org/abs/2511.07442
作者: Fang Fang,Zhiguo Ding,Victor C. M. Leung,Lajos Hanzo
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-118] AudAgent : Automated Auditing of Privacy Policy Compliance in AI Agents

链接: https://arxiv.org/abs/2511.07441
作者: Ye Zheng,Yidan Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-119] Agent ic Educational Content Generation for African Languages on Edge Devices

链接: https://arxiv.org/abs/2511.07437
作者: Ravi Gupta,Guneet Bhatia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-120] Analysing Environmental Efficiency in AI for X-Ray Diagnosis

【速读】:该论文旨在解决在新冠(Covid-19)胸部X光影像诊断中,如何平衡模型性能与环境影响的问题。其核心挑战在于评估生成式AI(Generative AI)与判别式模型(Discriminative Models)在诊断准确性及碳足迹方面的差异,并探索更可持续的医疗AI部署方案。解决方案的关键在于将小型判别式模型(如Covid-Net)与轻量级大语言模型(LLM)集成到Mendix应用中,通过判别式模型提供知识库以增强LLM输出的可信度,并系统比较14种模型配置的精度与环境影响。结果表明,虽然小模型显著降低碳排放,但存在诊断偏向和置信度不足问题;而单纯限制LLM仅输出概率会导致性能下降,凸显了生成式工具在分类任务中的局限性。最优方案为采用Covid-Net模型,在实现95.5%最高准确率的同时,相较大型LLM减少99.9%碳排放,验证了专用判别模型在医疗场景下的高效性和可持续性优势。

链接: https://arxiv.org/abs/2511.07436
作者: Liam Kearns
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:The integration of AI tools into medical applications has aimed to improve the efficiency of diagnosis. The emergence of large language models (LLMs), such as ChatGPT and Claude, has expanded this integration even further. Because of LLM versatility and ease of use through APIs, these larger models are often utilised even though smaller, custom models can be used instead. In this paper, LLMs and small discriminative models are integrated into a Mendix application to detect Covid-19 in chest X-rays. These discriminative models are also used to provide knowledge bases for LLMs to improve accuracy. This provides a benchmark study of 14 different model configurations for comparison of accuracy and environmental impact. The findings indicated that while smaller models reduced the carbon footprint of the application, the output was biased towards a positive diagnosis and the output probabilities were lacking confidence. Meanwhile, restricting LLMs to only give probabilistic output caused poor performance in both accuracy and carbon footprint, demonstrating the risk of using LLMs as a universal AI solution. While using the smaller LLM GPT-4.1-Nano reduced the carbon footprint by 94.2% compared to the larger models, this was still disproportionate to the discriminative models; the most efficient solution was the Covid-Net model. Although it had a larger carbon footprint than other small models, its carbon footprint was 99.9% less than when using GPT-4.5-Preview, whilst achieving an accuracy of 95.5%, the highest of all models examined. This paper contributes to knowledge by comparing generative and discriminative models in Covid-19 detection as well as highlighting the environmental risk of using generative tools for classification tasks.
zh

[AI-121] DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones

【速读】:该论文旨在解决智能手机上大语言模型(Large Language Models, LLMs)长序列解码时因关键值缓存(Key-Value Cache, KVCache)内存占用随序列长度线性增长而导致的DRAM容量受限问题。现有基于检索的方法虽通过将KVCache卸载至闪存(flash)缓解内存压力,但因解码过程中KVCache分布动态变化,静态或局部聚类更新策略易导致重要条目被遗漏或冗余数据被加载,且在手机端带宽、IOPS和内存容量受限条件下性能进一步恶化。解决方案的关键在于提出DynaKV——首个面向移动端长序列解码的自适应KVCache管理机制,其核心创新包括:(1) 无迁移聚类自适应(Migration-Free Cluster Adaptation),在检索过程中无需额外传输即可动态分裂聚类;(2) 以连续性为中心的闪存管理(Continuity-Centric Flash Management),通过协同定位相关条目与聚类并采用双头布局优化更新效率;(3) 高效内存缓存设计(Memory-Efficient Cache Design),跨DRAM与闪存虚拟化缓存空间,并扩展替换策略以匹配聚类级访问模式。实验证明,DynaKV相较最优方案平均提升1.38倍准确率和1.47倍速度,具备良好的泛化能力,可推广至其他长上下文任务及多层级存储体系。

链接: https://arxiv.org/abs/2511.07427
作者: Tuowei Wang,Minxing Huang,Fengzu Li,Ligeng Chen,Jinrui Zhang,Ju Ren
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the demand for human-like reasoning, multi-turn dialogues, and long-form responses grows, large language models (LLMs) are increasingly expected to support efficient and effective long-sequence decoding. However, due to limited DRAM capacity, long-seuqence LLM decoding on smartphones is constrained by the key-value cache (KVCache), whose memory footprint increases linearly with sequence length. Retrieval-based methods mitigate DRAM pressure by offloading KVCache to flash and retrieving query-relevant entries through cluster-based indexing. Unfortunately, as decoding progresses, KVCache distribution shifts render static or local cluster updates progressively misaligned, excluding essential entries or fetching redundant ones. These issues are further exacerbated by smartphone-specific limitations in bandwidth, IOPS, and memory capacity. We propose DynaKV, the first adaptive KVCache management approach that jointly addresses accuracy and efficiency for long-sequence decoding on smartphones. DynaKV integrates three key techniques: (1) Migration-Free Cluster Adaptation, which adaptively splits clusters during retrieval without incurring additional transfers; (2) Continuity-Centric Flash Management, which co-locates correlated entries and clusters and employs a dual-head layout for efficient updates; and (3) Memory-Efficient Cache Design, which virtualizes cache space across DRAM and flash and extends replacement policies to align with cluster-level access patterns. Evaluations demonstrate that DynaKV improves retrieval accuracy and reduces end-to-end latency compared to state-of-the-art solutions, achieving average gains of 1.38\times in accuracy and 1.47\times speedups. Furthermore, the insights of DynaKV naturally extend to other long-context workloads and multi-tier memory hierarchies, underscoring its broader applicability. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2511.07427 [cs.DC] (or arXiv:2511.07427v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2511.07427 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-122] An Evaluation of LLM s Inference on Popular Single-board Computers

【速读】:该论文旨在解决在资源受限的边缘设备(如单板计算机,SBC)上高效部署和运行量化后的开源大语言模型(LLM)的问题,以满足本地化、隐私保护的推理需求。其解决方案的关键在于系统性地评估25个量化模型在三种SBC(Raspberry Pi 4/5与Orange Pi 5 Pro)上的性能表现,对比两种推理运行时(Ollama与Llamafile),并基于生成吞吐量、内存占用和功耗等指标识别出架构级瓶颈与运行时权衡,最终提出适用于边缘场景的实用部署建议,为高效率LLM推理在低成本硬件平台上的落地提供了实证依据。

链接: https://arxiv.org/abs/2511.07425
作者: Tung(Thomas)Nguyen,Tuyen Nguyen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:The growing demand for on-device large language model (LLM) inference is driving interest in deploying lightweight, cost-effective AI solutions on edge hardware. Single-board computers (SBCs) such as the Raspberry Pi and Orange Pi offer a promising platform for localized, privacy-preserving inference-but remain underexplored in the context of LLM workloads. In this work, we benchmark the performance of 25 quantized open-source LLMs across three SBCs-Raspberry Pi 4, Raspberry Pi 5, and Orange Pi 5 Pro-using two inference runtimes: Ollama and Llamafile. We evaluate generation throughput, memory usage, and power consumption under varying CPU configurations, using multiple prompt types to simulate realistic workloads. Our results show that SBCs can reliably support models up to 1.5B parameters, with Llamafile achieving up to 4x higher throughput and 30-40% lower power usage than Ollama. We identify architecture-specific bottlenecks, highlight runtime-level trade-offs, and provide practical deployment recommendations. This study offers the first broad evaluation of LLM inference on SBCs, bridging the gap between high-performance language models and affordable edge computing.
zh

[AI-123] Synera: Synergistic LLM Serving across Device and Cloud at Scale

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在移动端部署时面临的性能瓶颈问题,包括生成质量下降和延迟增加。现有方案要么依赖云端卸载(cloud offloading),受限于通信带宽;要么使用轻量级语言模型(Small Language Models, SLMs),牺牲生成质量以适应设备资源限制。为克服这些局限,论文提出Synera——一种设备-云协同的LLM服务系统,其核心在于通过高效的SLM-LLM协同机制实现优化:关键设计包括通信高效的 selective offloading(选择性卸载)、无停顿的并行推理(stall-free parallel inference)以及可扩展的云端批处理(scalable cloud batching),从而在保持与现有云端服务相当延迟的前提下显著提升生成质量(1.20–5.47倍)并降低云端成本(8.2–16.5%)。

链接: https://arxiv.org/abs/2511.07423
作者: Genglin Wang,Liekang Zeng,Bufang Yang,Kaiwei Liu,Guoliang Xing,Chumin Sun,Li Zhou,Jie Sun,Zhenyu Yan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming key components in various mobile operating systems, driving smart applications like interactive chatbots and personal assistants. While bringing enhanced intelligence to mobile ends, their deployment suffers from a set of performance challenges, especially the generation quality degradation and prolonged latency. Prior works have mainly relied on solutions of cloud offloading or on-device Small Language Models (SLMs). However, the former is usually limited by the communication bottleneck, and the latter sacrifices generation quality due to resource constraints. To mitigate these limitations, this paper proposes Synera, a device-cloud synergistic LLM serving system that applies an efficient SLM-LLM synergistic mechanism. Through empirical studies on LLM’s unique computing characteristics, Synera identifies a set of underexplored optimization opportunities in device-cloud synergistic LLM inference, including offloading decisions, pipeline stalls, and batching bottlenecks. To translate them into enhanced performance, Synera introduces tailored designs of communication-efficient selective offloading, stall-free parallel inference, and scalable cloud batching. Extensive evaluations with real-world testbeds show that Synera enables 1.20-5.47x better generation quality against competitive baselines with on-par latency performance. Compared with existing cloud serving, Synera achieves 8.2-16.5% lower cloud serving cost on various benchmarks.
zh

[AI-124] Feature Importance Guided Random Forest Learning with Simulated Annealing Based Hyperparameter Tuning

链接: https://arxiv.org/abs/2511.00133
作者: Kowshik Balasubramanian,Andre Williams,Ismail Butun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT)
备注: 10 pages, 2 figures, 3 tables, submitted to IEEE Intelligent Systems journal

点击查看摘要

[AI-125] Privacy-Preserving Personalization in Education: A Federated Recommender System for Student Performance Prediction

链接: https://arxiv.org/abs/2509.10516
作者: Rodrigo Tertulino,Ricardo Almeida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

[AI-126] Evaluating Federated Learning for At-Risk Student Prediction: A Comparative Analysis of Model Complexity and Data Balancing

链接: https://arxiv.org/abs/2508.18316
作者: Rodrigo Tertulino,Ricardo Almeida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
备注: This article has been prepared to be submitted to the Fundamenta Informaticae Journal

点击查看摘要

[AI-127] oward Adaptive BCIs: Enhancing Decoding Stability via User State-Aware EEG Filtering

链接: https://arxiv.org/abs/2511.07891
作者: Yeon-Woo Choi,Hye-Bin Shin,Dan Li
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 4 pages, 3 figures, conference

点击查看摘要

[AI-128] Benchmarking Simulacra AIs Quantum Accurate Synthetic Data Generation for Chemical Sciences

【速读】:该论文旨在解决生成高质量、大规模从头计算(ab-initio)数据集时成本高昂的问题,尤其是在药物研发等场景中,传统方法如CCSD(耦合簇微扰理论)和标准变分蒙特卡洛(VMC)算法在数据生成效率上存在显著瓶颈。解决方案的关键在于提出了一种新颖且专有的采样策略——Replica Exchange with Langevin Adaptive eXploration (RELAX),该策略结合Simulacra的大型波函数模型(Large Wavefunction Models, LWM)与先进的VMC采样算法,在保持能量精度不变的前提下,将数据生成成本降低15–50倍,相较传统CCSD方法提升2–3倍效率,从而实现了高性价比的大规模ab-initio数据集构建。

链接: https://arxiv.org/abs/2511.07433
作者: Fabio Falcioni,Elena Orlova,Timothy Heightman,Philip Mantrov,Aleksei Ustimenko
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:In this work, we benchmark \simulacra’s synthetic data generation pipeline against a state-of-the-art Microsoft pipeline on a dataset of small to large systems. By analyzing the energy quality, autocorrelation times, and effective sample size, our findings show that Simulacra’s Large Wavefunction Models (LWM) pipeline, paired with state-of-the-art Variational Monte Carlo (VMC) sampling algorithms, reduces data generation costs by 15-50x, while maintaining parity in energy accuracy, and 2-3x compared to traditional CCSD methods on the scale of amino acids. This enables the creation of affordable, large-scale \textitab-initio datasets, accelerating AI-driven optimization and discovery in the pharmaceutical industry and beyond. Our improvements are based on a novel and proprietary sampling scheme called Replica Exchange with Langevin Adaptive eXploration (RELAX).
zh

[AI-129] Advancing mathematics research with large language models

链接: https://arxiv.org/abs/2511.07420
作者: Lisa Carbone
机构: 未知
类目: History and Overview (math.HO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Group Theory (math.GR); Logic (math.LO)
备注:

点击查看摘要

[AI-130] Hybrid Bit and Semantic Communications

【速读】:该论文旨在解决当前语义通信(Semantic Communication)技术在实际部署中面临的挑战,即现有直接将语义内容映射为传输符号的方法难以与现有数字通信系统兼容,限制了其应用发展。解决方案的关键在于提出一种混合比特与语义通信系统(HybridBSC),通过将编码后的语义信息插入到比特信息中,利用相同的频谱资源在传统数字通信系统中实现语义与比特信息的共传;同时设计了一套语义插入与提取方案,确保语义信息可被有效嵌入和恢复,并基于Pluto软件定义无线电(SDR)平台在真实无线信道中完成实验验证,证明该策略具备可行性与有效性。

链接: https://arxiv.org/abs/2404.19477
作者: Kaiwen Yu,Renhe Fan,Gang Wu,Zhijin Qin
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic communication technology is regarded as a method surpassing the Shannon limit of bit transmission, capable of effectively enhancing transmission efficiency. However, current approaches that directly map content to transmission symbols are challenging to deploy in practice, imposing significant limitations on the development of semantic communication. To address this challenge, we propose a hybrid bit and semantic communication system, named HybridBSC, in which encoded semantic information is inserted into bit information for transmission via conventional digital communication systems utilizing same spectrum resources. The system can be easily deployed using existing communication architecture to achieve bit and semantic information transmission. Particularly, we design a semantic insertion and extraction scheme to implement this strategy. Furthermore, we conduct experimental validation based on the pluto-based software defined radio (SDR) platform in a real wireless channel, demonstrating that the proposed strategy can simultaneously transmit semantic and bit information.
zh

机器学习

[LG-0] SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment

链接: https://arxiv.org/abs/2511.08583
作者: Rong Xue,Jiageng Mao,Mingtong Zhang,Yue Wang
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Developing efficient and accurate visuomotor policies poses a central challenge in robotic imitation learning. While recent rectified flow approaches have advanced visuomotor policy learning, they suffer from a key limitation: After iterative distillation, generated actions may deviate from the ground-truth actions corresponding to the current visual observation, leading to accumulated error as the reflow process repeats and unstable task execution. We present Selective Flow Alignment (SeFA), an efficient and accurate visuomotor policy learning framework. SeFA resolves this challenge by a selective flow alignment strategy, which leverages expert demonstrations to selectively correct generated actions and restore consistency with observations, while preserving multimodality. This design introduces a consistency correction mechanism that ensures generated actions remain observation-aligned without sacrificing the efficiency of one-step flow inference. Extensive experiments across both simulated and real-world manipulation tasks show that SeFA Policy surpasses state-of-the-art diffusion-based and flow-based policies, achieving superior accuracy and robustness while reducing inference latency by over 98%. By unifying rectified flow efficiency with observation-consistent action generation, SeFA provides a scalable and dependable solution for real-time visuomotor policy learning. Code is available on this https URL.

[LG-1] FMMI: Flow Matching Mutual Information Estimation

链接: https://arxiv.org/abs/2511.08552
作者: Ivan Butakov,Alexander Semenenko,Alexey Frolov,Ivan Oseledets
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 11 pages

点击查看摘要

Abstract:We introduce a novel Mutual Information (MI) estimator that fundamentally reframes the discriminative approach. Instead of training a classifier to discriminate between joint and marginal distributions, we learn a normalizing flow that transforms one into the other. This technique produces a computationally efficient and precise MI estimate that scales well to high dimensions and across a wide range of ground-truth MI values.

[LG-2] Clustering Guided Residual Neural Networks for Multi-Tx Localization in Molecular Communications

链接: https://arxiv.org/abs/2511.08513
作者: Ali Sonmez,Erencem Ozbey,Efe Feyzi Mantaroglu,H. Birkan Yilmaz
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 5 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Transmitter localization in Molecular Communication via Diffusion is a critical topic with many applications. However, accurate localization of multiple transmitters is a challenging problem due to the stochastic nature of diffusion and overlapping molecule distributions at the receiver surface. To address these issues, we introduce clustering-based centroid correction methods that enhance robustness against density variations, and outliers. In addition, we propose two clusteringguided Residual Neural Networks, namely AngleNN for direction refinement and SizeNN for cluster size estimation. Experimental results show that both approaches provide significant improvements with reducing localization error between 69% (2-Tx) and 43% (4-Tx) compared to the K-means.

[LG-3] oward Autonomous and Efficient Cybersecurity: A Multi-Objective AutoML-based Intrusion Detection System

链接: https://arxiv.org/abs/2511.08491
作者: Li Yang,Abdallah Shami
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted and To Appear in IEEE Transactions on Machine Learning in Communications and Networking (TMLCN); Code is available at Github link: this https URL

点击查看摘要

Abstract:With increasingly sophisticated cybersecurity threats and rising demand for network automation, autonomous cybersecurity mechanisms are becoming critical for securing modern networks. The rapid expansion of Internet of Things (IoT) systems amplifies these challenges, as resource-constrained IoT devices demand scalable and efficient security solutions. In this work, an innovative Intrusion Detection System (IDS) utilizing Automated Machine Learning (AutoML) and Multi-Objective Optimization (MOO) is proposed for autonomous and optimized cyber-attack detection in modern networking environments. The proposed IDS framework integrates two primary innovative techniques: Optimized Importance and Percentage-based Automated Feature Selection (OIP-AutoFS) and Optimized Performance, Confidence, and Efficiency-based Combined Algorithm Selection and Hyperparameter Optimization (OPCE-CASH). These components optimize feature selection and model learning processes to strike a balance between intrusion detection effectiveness and computational efficiency. This work presents the first IDS framework that integrates all four AutoML stages and employs multi-objective optimization to jointly optimize detection effectiveness, efficiency, and confidence for deployment in resource-constrained systems. Experimental evaluations over two benchmark cybersecurity datasets demonstrate that the proposed MOO-AutoML IDS outperforms state-of-the-art IDSs, establishing a new benchmark for autonomous, efficient, and optimized security for networks. Designed to support IoT and edge environments with resource constraints, the proposed framework is applicable to a variety of autonomous cybersecurity applications across diverse networked environments.

[LG-4] One Model for All: Universal Pre-training for EEG based Emotion Recognition across Heterogeneous Datasets and Paradigms

链接: https://arxiv.org/abs/2511.08444
作者: Xiang Li,You Li,Yazhou Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:EEG-based emotion recognition is hampered by profound dataset heterogeneity (channel/subject variability), hindering generalizable models. Existing approaches struggle to transfer knowledge effectively. We propose ‘One Model for All’, a universal pre-training framework for EEG analysis across disparate datasets. Our paradigm decouples learning into two stages: (1) Univariate pre-training via self-supervised contrastive learning on individual channels, enabled by a Unified Channel Schema (UCS) that leverages the channel union (e.g., SEED-62ch, DEAP-32ch); (2) Multivariate fine-tuning with a novel ‘ART’ (Adaptive Resampling Transformer) and ‘GAT’ (Graph Attention Network) architecture to capture complex spatio-temporal dependencies. Experiments show universal pre-training is an essential stabilizer, preventing collapse on SEED (vs. scratch) and yielding substantial gains on DEAP (+7.65%) and DREAMER (+3.55%). Our framework achieves new SOTA performance on all within-subject benchmarks: SEED (99.27%), DEAP (93.69%), and DREAMER (93.93%). We also show SOTA cross-dataset transfer, achieving 94.08% (intersection) and 93.05% (UCS) on the unseen DREAMER dataset, with the former surpassing the within-domain pre-training benchmark. Ablation studies validate our architecture: the GAT module is critical, yielding a +22.19% gain over GCN on the high-noise DEAP dataset, and its removal causes a catastrophic -16.44% performance drop. This work paves the way for more universal, scalable, and effective pre-trained models for diverse EEG analysis tasks.

[LG-5] Coherence Mechanisms for Provable Self-Improvement

链接: https://arxiv.org/abs/2511.08440
作者: Mehryar Mohri,Jon Schneider,Yifan Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-improvement is a critical capability for large language models and other intelligent systems, enabling them to refine their behavior and internal consistency without external supervision. Despite its importance, prior approaches largely rely on empirical heuristics and lack formal guarantees. In this paper, we propose a principled framework for self-improvement based on the concept of \emphcoherence, which requires that a model’s outputs remain consistent under task-preserving transformations of the input. We formalize this concept using projection-based mechanisms that update a baseline model to be coherent while remaining as close as possible to its original behavior. We provide rigorous theoretical guarantees that these mechanisms achieve \emphmonotonic improvement, measured by a reduction in expected Bregman divergence. Our analysis is comprehensive, covering both \emphdirect and \emphtwo-step projection methods, and robustly extends these guarantees to non-realizable settings, empirical (finite-sample) distributions, and relaxed coherence constraints. Furthermore, we establish a general \emphcharacterization theorem, showing that any mechanism with similar provable improvement guarantees must inherently conform to a coherence-based structure. This culminates in rigidity results under the demand for universal improvement, establishing coherence as a fundamental and, in a formal sense, necessary principle for provable self-improvement. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.08440 [cs.LG] (or arXiv:2511.08440v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.08440 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-6] An update to PYRO-NN: A Python Library for Differentiable CT Operators

链接: https://arxiv.org/abs/2511.08427
作者: Linda-Sophie Schneider,Yipeng Sun,Chengze Ye,Markus Michen,Andreas Maier
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning has brought significant advancements to X-ray Computed Tomography (CT) reconstruction, offering solutions to challenges arising from modern imaging technologies. These developments benefit from methods that combine classical reconstruction techniques with data-driven approaches. Differentiable operators play a key role in this integration by enabling end-to-end optimization and the incorporation of physical modeling within neural networks. In this work, we present an updated version of PYRO-NN, a Python-based library for differentiable CT reconstruction. The updated framework extends compatibility to PyTorch and introduces native CUDA kernel support for efficient projection and back-projection operations across parallel, fan, and cone-beam geometries. Additionally, it includes tools for simulating imaging artifacts, modeling arbitrary acquisition trajectories, and creating flexible, end-to-end trainable pipelines through a high-level Python API. Code is available at: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2511.08427 [cs.LG] (or arXiv:2511.08427v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.08427 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] HardFlow: Hard-Constrained Sampling for Flow-Matching Models via Trajectory Optimization

链接: https://arxiv.org/abs/2511.08425
作者: Zeyang Li,Kaveh Alim,Navid Azizan
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Diffusion and flow-matching have emerged as powerful methodologies for generative modeling, with remarkable success in capturing complex data distributions and enabling flexible guidance at inference time. Many downstream applications, however, demand enforcing hard constraints on generated samples (for example, robot trajectories must avoid obstacles), a requirement that goes beyond simple guidance. Prevailing projection-based approaches constrain the entire sampling path to the constraint manifold, which is overly restrictive and degrades sample quality. In this paper, we introduce a novel framework that reformulates hard-constrained sampling as a trajectory optimization problem. Our key insight is to leverage numerical optimal control to steer the sampling trajectory so that constraints are satisfied precisely at the terminal time. By exploiting the underlying structure of flow-matching models and adopting techniques from model predictive control, we transform this otherwise complex constrained optimization problem into a tractable surrogate that can be solved efficiently and effectively. Furthermore, this trajectory optimization perspective offers significant flexibility beyond mere constraint satisfaction, allowing for the inclusion of integral costs to minimize distribution shift and terminal objectives to further enhance sample quality, all within a unified framework. We provide a control-theoretic analysis of our method, establishing bounds on the approximation error between our tractable surrogate and the ideal formulation. Extensive experiments across diverse domains, including robotics (planning), partial differential equations (boundary control), and vision (text-guided image editing), demonstrate that our algorithm, which we name \textitHardFlow , substantially outperforms existing methods in both constraint satisfaction and sample quality.

[LG-8] Probabilistic Safety Guarantee for Stochastic Control Systems Using Averag e Reward MDPs

链接: https://arxiv.org/abs/2511.08419
作者: Saber Omidi,Marek Petrik,Se Young Yoon,Momotaz Begum
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Optimization and Control (math.OC)
*备注: Submitted to the Learning for Dynamics Control (L4DC) 2026 conference

点击查看摘要

Abstract:Safety in stochastic control systems, which are subject to random noise with a known probability distribution, aims to compute policies that satisfy predefined operational constraints with high confidence throughout the uncertain evolution of the state variables. The unpredictable evolution of state variables poses a significant challenge for meeting predefined constraints using various control methods. To address this, we present a new algorithm that computes safe policies to determine the safety level across a finite state set. This algorithm reduces the safety objective to the standard average reward Markov Decision Process (MDP) objective. This reduction enables us to use standard techniques, such as linear programs, to compute and analyze safe policies. We validate the proposed method numerically on the Double Integrator and the Inverted Pendulum systems. Results indicate that the average-reward MDPs solution is more comprehensive, converges faster, and offers higher quality compared to the minimum discounted-reward solution.

[LG-9] Physics-Informed Neural Operators for Cardiac Electrophysiology

链接: https://arxiv.org/abs/2511.08418
作者: Hannah Lydon,Milad Kazemi,Martin Bishop,Nicola Paoletti
类目: Machine Learning (cs.LG)
*备注: All code used in this work, including experimental results, can be found at this https URL This work was submitted for review at the 2026 L4DC conference

点击查看摘要

Abstract:Accurately simulating systems governed by PDEs, such as voltage fields in cardiac electrophysiology (EP) modelling, remains a significant modelling challenge. Traditional numerical solvers are computationally expensive and sensitive to discretisation, while canonical deep learning methods are data-hungry and struggle with chaotic dynamics and long-term predictions. Physics-Informed Neural Networks (PINNs) mitigate some of these issues by incorporating physical constraints in the learning process, yet they remain limited by mesh resolution and long-term predictive stability. In this work, we propose a Physics-Informed Neural Operator (PINO) approach to solve PDE problems in cardiac EP. Unlike PINNs, PINO models learn mappings between function spaces, allowing them to generalise to multiple mesh resolutions and initial conditions. Our results show that PINO models can accurately reproduce cardiac EP dynamics over extended time horizons and across multiple propagation scenarios, including zero-shot evaluations on scenarios unseen during training. Additionally, our PINO models maintain high predictive quality in long roll-outs (where predictions are recursively fed back as inputs), and can scale their predictive resolution by up to 10x the training resolution. These advantages come with a significant reduction in simulation time compared to numerical PDE solvers, highlighting the potential of PINO-based approaches for efficient and scalable cardiac EP simulations.

[LG-10] ARAC: Adaptive Regularized Multi-Agent Soft Actor-Critic in Graph-Structured Adversarial Games

链接: https://arxiv.org/abs/2511.08412
作者: Ruochuan Shi,Runyu Lu,Yuanheng Zhu,Dongbin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In graph-structured multi-agent reinforcement learning (MARL) adversarial tasks such as pursuit and confrontation, agents must coordinate under highly dynamic interactions, where sparse rewards hinder efficient policy learning. We propose Adaptive Regularized Multi-Agent Soft Actor-Critic (ARAC), which integrates an attention-based graph neural network (GNN) for modeling agent dependencies with an adaptive divergence regularization mechanism. The GNN enables expressive representation of spatial relations and state features in graph environments. Divergence regularization can serve as policy guidance to alleviate the sparse reward problem, but it may lead to suboptimal convergence when the reference policy itself is imperfect. The adaptive divergence regularization mechanism enables the framework to exploit reference policies for efficient exploration in the early stages, while gradually reducing reliance on them as training progresses to avoid inheriting their limitations. Experiments in pursuit and confrontation scenarios demonstrate that ARAC achieves faster convergence, higher final success rates, and stronger scalability across varying numbers of agents compared with MARL baselines, highlighting its effectiveness in complex graph-structured environments.

[LG-11] EMAformer: Enhancing Transformer through Embedding Armor for Time Series Forecasting AAAI2026

链接: https://arxiv.org/abs/2511.08396
作者: Zhiwei Zhang,Xinyi Du,Xuanchi Guo,Weihao Wang,Wenjuan Han
类目: Machine Learning (cs.LG)
*备注: 14 pages, 9 figures, 6 tables, accepted by AAAI2026

点击查看摘要

Abstract:Multivariate time series forecasting is crucial across a wide range of domains. While presenting notable progress for the Transformer architecture, iTransformer still lags behind the latest MLP-based models. We attribute this performance gap to unstable inter-channel relationships. To bridge this gap, we propose EMAformer, a simple yet effective model that enhances the Transformer with an auxiliary embedding suite, akin to armor that reinforces its ability. By introducing three key inductive biases, i.e., \textitglobal stability, \textitphase sensitivity, and \textitcross-axis specificity, EMAformer unlocks the further potential of the Transformer architecture, achieving state-of-the-art performance on 12 real-world benchmarks and reducing forecasting errors by an average of 2.73% in MSE and 5.15% in MAE. This significantly advances the practical applicability of Transformer-based approaches for multivariate time series forecasting. The code is available on this https URL.

[LG-12] Multi-objective Hyperparameter Optimization in the Age of Deep Learning

链接: https://arxiv.org/abs/2511.08371
作者: Soham Basu,Frank Hutter,Danny Stoll
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While Deep Learning (DL) experts often have prior knowledge about which hyperparameter settings yield strong performance, only few Hyperparameter Optimization (HPO) algorithms can leverage such prior knowledge and none incorporate priors over multiple objectives. As DL practitioners often need to optimize not just one but many objectives, this is a blind spot in the algorithmic landscape of HPO. To address this shortcoming, we introduce PriMO, the first HPO algorithm that can integrate multi-objective user beliefs. We show PriMO achieves state-of-the-art performance across 8 DL benchmarks in the multi-objective and single-objective setting, clearly positioning itself as the new go-to HPO algorithm for DL practitioners.

[LG-13] From Confusion to Clarity: ProtoScore - A Framework for Evaluating Prototype-Based XAI

链接: https://arxiv.org/abs/2511.08361
作者: Helena Monke,Benjamin Sae-Chew,Benjamin Fresz,Marco F. Huber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The complexity and opacity of neural networks (NNs) pose significant challenges, particularly in high-stakes fields such as healthcare, finance, and law, where understanding decision-making processes is crucial. To address these issues, the field of explainable artificial intelligence (XAI) has developed various methods aimed at clarifying AI decision-making, thereby facilitating appropriate trust and validating the fairness of outcomes. Among these methods, prototype-based explanations offer a promising approach that uses representative examples to elucidate model behavior. However, a critical gap exists regarding standardized benchmarks to objectively compare prototype-based XAI methods, especially in the context of time series data. This lack of reliable benchmarks results in subjective evaluations, hindering progress in the field. We aim to establish a robust framework, ProtoScore, for assessing prototype-based XAI methods across different data types with a focus on time series data, facilitating fair and comprehensive evaluations. By integrating the Co-12 properties of Nauta et al., this framework allows for effectively comparing prototype methods against each other and against other XAI methods, ultimately assisting practitioners in selecting appropriate explanation methods while minimizing the costs associated with user studies. All code is publicly available at this https URL .

[LG-14] Revisiting Network Traffic Analysis: Compatible network flows for ML models

链接: https://arxiv.org/abs/2511.08345
作者: João Vitorino,Daniela Pinto,Eva Maia,Ivone Amorim,Isabel Praça
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 16 pages, 12 tables, 1 figure, FPS 2025 conference

点击查看摘要

Abstract:To ensure that Machine Learning (ML) models can perform a robust detection and classification of cyberattacks, it is essential to train them with high-quality datasets with relevant features. However, it can be difficult to accurately represent the complex traffic patterns of an attack, especially in Internet-of-Things (IoT) networks. This paper studies the impact that seemingly similar features created by different network traffic flow exporters can have on the generalization and robustness of ML models. In addition to the original CSV files of the Bot-IoT, IoT-23, and CICIoT23 datasets, the raw network packets of their PCAP files were analysed with the HERA tool, generating new labelled flows and extracting consistent features for new CSV versions. To assess the usefulness of these new flows for intrusion detection, they were compared with the original versions and were used to fine-tune multiple models. Overall, the results indicate that directly analysing and preprocessing PCAP files, instead of just using the commonly available CSV files, enables the computation of more relevant features to train bagging and gradient boosting decision tree ensembles. It is important to continue improving feature extraction and feature selection processes to make different datasets more compatible and enable a trustworthy evaluation and comparison of the ML models used in cybersecurity solutions.

[LG-15] Adversarial Bias: Data Poisoning Attacks on Fairness

链接: https://arxiv.org/abs/2511.08331
作者: Eunice Chan,Hanghang Tong
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 figures, shortened version in BigData 2025

点击查看摘要

Abstract:With the growing adoption of AI and machine learning systems in real-world applications, ensuring their fairness has become increasingly critical. The majority of the work in algorithmic fairness focus on assessing and improving the fairness of machine learning systems. There is relatively little research on fairness vulnerability, i.e., how an AI system’s fairness can be intentionally compromised. In this work, we first provide a theoretical analysis demonstrating that a simple adversarial poisoning strategy is sufficient to induce maximally unfair behavior in naive Bayes classifiers. Our key idea is to strategically inject a small fraction of carefully crafted adversarial data points into the training set, biasing the model’s decision boundary to disproportionately affect a protected group while preserving generalizable performance. To illustrate the practical effectiveness of our method, we conduct experiments across several benchmark datasets and models. We find that our attack significantly outperforms existing methods in degrading fairness metrics across multiple models and datasets, often achieving substantially higher levels of unfairness with a comparable or only slightly worse impact on accuracy. Notably, our method proves effective on a wide range of models, in contrast to prior work, demonstrating a robust and potent approach to compromising the fairness of machine learning systems.

[LG-16] BDD2Seq: Enabling Scalable Reversible-Circuit Synthesis via Graph-to-Sequence Learning

链接: https://arxiv.org/abs/2511.08315
作者: Mingkai Miao,Jianheng Tang,Guangyu Hu,Hongce Zhang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Binary Decision Diagrams (BDDs) are instrumental in many electronic design automation (EDA) tasks thanks to their compact representation of Boolean functions. In BDD-based reversible-circuit synthesis, which is critical for quantum computing, the chosen variable ordering governs the number of BDD nodes and thus the key metrics of resource consumption, such as Quantum Cost. Because finding an optimal variable ordering for BDDs is an NP-complete problem, existing heuristics often degrade as circuit complexity grows. We introduce BDD2Seq, a graph-to-sequence framework that couples a Graph Neural Network encoder with a Pointer-Network decoder and Diverse Beam Search to predict high-quality orderings. By treating the circuit netlist as a graph, BDD2Seq learns structural dependencies that conventional heuristics overlooked, yielding smaller BDDs and faster synthesis. Extensive experiments on three public benchmarks show that BDD2Seq achieves around 1.4 times lower Quantum Cost and 3.7 times faster synthesis than modern heuristic algorithms. To the best of our knowledge, this is the first work to tackle the variable-ordering problem in BDD-based reversible-circuit synthesis with a graph-based generative model and diversity-promoting decoding.

[LG-17] Rethinking Explanation Evaluation under the Retraining Scheme

链接: https://arxiv.org/abs/2511.08281
作者: Yi Cai,Thibaud Ardoin,Mayank Gulati,Gerhard Wunder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature attribution has gained prominence as a tool for explaining model decisions, yet evaluating explanation quality remains challenging due to the absence of ground-truth explanations. To circumvent this, explanation-guided input manipulation has emerged as an indirect evaluation strategy, measuring explanation effectiveness through the impact of input modifications on model outcomes during inference. Despite the widespread use, a major concern with inference-based schemes is the distribution shift caused by such manipulations, which undermines the reliability of their assessments. The retraining-based scheme ROAR overcomes this issue by adapting the model to the altered data distribution. However, its evaluation results often contradict the theoretical foundations of widely accepted explainers. This work investigates this misalignment between empirical observations and theoretical expectations. In particular, we identify the sign issue as a key factor responsible for residual information that ultimately distorts retraining-based evaluation. Based on the analysis, we show that a straightforward reframing of the evaluation process can effectively resolve the identified issue. Building on the existing framework, we further propose novel variants that jointly structure a comprehensive perspective on explanation evaluation. These variants largely improve evaluation efficiency over the standard retraining protocol, thereby enhancing practical applicability for explainer selection and benchmarking. Following our proposed schemes, empirical results across various data scales provide deeper insights into the performance of carefully selected explainers, revealing open challenges and future directions in explainability research.

[LG-18] X-IONet: Cross-Platform Inertial Odometry Network with Dual-Stage Attention

链接: https://arxiv.org/abs/2511.08277
作者: Dehan Shen,Changhao Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning-based inertial odometry has achieved remarkable progress in pedestrian navigation. However, extending these methods to quadruped robots remains challenging due to their distinct and highly dynamic motion patterns. Models that perform well on pedestrian data often experience severe degradation when deployed on legged platforms. To tackle this challenge, we introduce X-IONet, a cross-platform inertial odometry framework that operates solely using a single Inertial Measurement Unit (IMU). X-IONet incorporates a rule-based expert selection module to classify motion platforms and route IMU sequences to platform-specific expert networks. The displacement prediction network features a dual-stage attention architecture that jointly models long-range temporal dependencies and inter-axis correlations, enabling accurate motion representation. It outputs both displacement and associated uncertainty, which are further fused through an Extended Kalman Filter (EKF) for robust state estimation. Extensive experiments on public pedestrian datasets and a self-collected quadruped robot dataset demonstrate that X-IONet achieves state-of-the-art performance, reducing Absolute Trajectory Error (ATE) by 14.3% and Relative Trajectory Error (RTE) by 11.4% on pedestrian data, and by 52.8% and 41.3% on quadruped robot data. These results highlight the effectiveness of X-IONet in advancing accurate and robust inertial navigation across both human and legged robot platforms.

[LG-19] Uncertainty Calibration of Multi-Label Bird Sound Classifiers

链接: https://arxiv.org/abs/2511.08261
作者: Raphael Schwinger,Ben McEwen,Vincent S. Kather,René Heinrich,Lukas Rauch,Sven Tomforde
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Under review at ICAART 2026

点击查看摘要

Abstract:Passive acoustic monitoring enables large-scale biodiversity assessment, but reliable classification of bioacoustic sounds requires not only high accuracy but also well-calibrated uncertainty estimates to ground decision-making. In bioacoustics, calibration is challenged by overlapping vocalisations, long-tailed species distributions, and distribution shifts between training and deployment data. The calibration of multi-label deep learning classifiers within the domain of bioacoustics has not yet been assessed. We systematically benchmark the calibration of four state-of-the-art multi-label bird sound classifiers on the BirdSet benchmark, evaluating both global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP). Model calibration varies significantly across datasets and classes. While Perch v2 and ConvNeXt _BS show better global calibration, results vary between datasets. Both models indicate consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration seems to be better for less frequent classes. Using simple post hoc calibration methods we demonstrate a straightforward way to improve calibration. A small labelled calibration set is sufficient to significantly improve calibration with Platt scaling, while global calibration parameters suffer from dataset variability. Our findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers.

[LG-20] Data-Driven Discovery of Feature Groups in Clinical Time Series ML4H ALT

链接: https://arxiv.org/abs/2511.08260
作者: Fedor Sergeev,Manuel Burger,Polina Leshetkina,Vincent Fortuin,Gunnar Rätsch,Rita Kuznetsova
类目: Machine Learning (cs.LG)
*备注: Machine Learning for Health (ML4H) 2025 in Proceedings of Machine Learning Research 297

点击查看摘要

Abstract:Clinical time series data are critical for patient monitoring and predictive modeling. These time series are typically multivariate and often comprise hundreds of heterogeneous features from different data sources. The grouping of features based on similarity and relevance to the prediction task has been shown to enhance the performance of deep learning architectures. However, defining these groups a priori using only semantic knowledge is challenging, even for domain experts. To address this, we propose a novel method that learns feature groups by clustering weights of feature-wise embedding layers. This approach seamlessly integrates into standard supervised training and discovers the groups that directly improve downstream performance on clinically relevant tasks. We demonstrate that our method outperforms static clustering approaches on synthetic data and achieves performance comparable to expert-defined groups on real-world medical data. Moreover, the learned feature groups are clinically interpretable, enabling data-driven discovery of task-relevant relationships between variables.

[LG-21] A Unified Geometric Field Theory Framework for Transformers: From Manifold Embeddings to Kernel Modulation

链接: https://arxiv.org/abs/2511.08243
作者: Xianshuai Shi,Jianfeng Zhu,Leibo Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Transformer architecture has achieved tremendous success in natural language processing, computer vision, and scientific computing through its self-attention mechanism. However, its core components-positional encoding and attention mechanisms-have lacked a unified physical or mathematical interpretation. This paper proposes a structural theoretical framework that integrates positional encoding, kernel integral operators, and attention mechanisms for in-depth theoretical investigation. We map discrete positions (such as text token indices and image pixel coordinates) to spatial functions on continuous manifolds, enabling a field-theoretic interpretation of Transformer layers as kernel-modulated operators acting over embedded manifolds.

[LG-22] PrefPoE: Advantage-Guided Preference Fusion for Learning Where to Explore

链接: https://arxiv.org/abs/2511.08241
作者: Zhihao Lin,Lin Wu,Zhen Tian,Jianglin Lan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exploration in reinforcement learning remains a critical challenge, as naive entropy maximization often results in high variance and inefficient policy updates. We introduce \textbfPrefPoE, a novel \textitPreference-Product-of-Experts framework that performs intelligent, advantage-guided exploration via the first principled application of product-of-experts (PoE) fusion for single-task exploration-exploitation balancing. By training a preference network to concentrate probability mass on high-advantage actions and fusing it with the main policy through PoE, PrefPoE creates a \textbfsoft trust region that stabilizes policy updates while maintaining targeted exploration. Across diverse control tasks spanning both continuous and discrete action spaces, PrefPoE demonstrates consistent improvements: +321% on HalfCheetah-v4 (1276~ \rightarrow ~5375), +69% on Ant-v4, +276% on LunarLander-v2, with consistently enhanced training stability and sample efficiency. Unlike standard PPO, which suffers from entropy collapse, PrefPoE sustains adaptive exploration through its unique dynamics, thereby preventing premature convergence and enabling superior performance. Our results establish that learning \textitwhere to explore through advantage-guided preferences is as crucial as learning how to act, offering a general framework for enhancing policy gradient methods across the full spectrum of reinforcement learning domains. Code and pretrained models are available in supplementary materials.

[LG-23] owards Non-Stationary Time Series Forecasting with Temporal Stabilization and Frequency Differencing

链接: https://arxiv.org/abs/2511.08229
作者: Junkai Lu,Peng Chen,Chenjuan Guo,Yang Shu,Meng Wang,Bin Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting is critical for decision-making across dynamic domains such as energy, finance, transportation, and cloud computing. However, real-world time series often exhibit non-stationarity, including temporal distribution shifts and spectral variability, which pose significant challenges for long-term time series forecasting. In this paper, we propose DTAF, a dual-branch framework that addresses non-stationarity in both the temporal and frequency domains. For the temporal domain, the Temporal Stabilizing Fusion (TFS) module employs a non-stationary mix of experts (MOE) filter to disentangle and suppress temporal non-stationary patterns while preserving long-term dependencies. For the frequency domain, the Frequency Wave Modeling (FWM) module applies frequency differencing to dynamically highlight components with significant spectral shifts. By fusing the complementary outputs of TFS and FWM, DTAF generates robust forecasts that adapt to both temporal and frequency domain non-stationarity. Extensive experiments on real-world benchmarks demonstrate that DTAF outperforms state-of-the-art baselines, yielding significant improvements in forecasting accuracy under non-stationary conditions. All codes are available at this https URL.

[LG-24] Proof Minimization in Neural Network Verification

链接: https://arxiv.org/abs/2511.08198
作者: Omri Isac,Idan Refaeli,Haoze Wu,Clark Barrett,Guy Katz
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG)
*备注: This is a preprint version of the paper that appears at VMCAI 2026

点击查看摘要

Abstract:The widespread adoption of deep neural networks (DNNs) requires efficient techniques for verifying their safety. DNN verifiers are complex tools, which might contain bugs that could compromise their soundness and undermine the reliability of the verification process. This concern can be mitigated using proofs: artifacts that are checkable by an external and reliable proof checker, and which attest to the correctness of the verification process. However, such proofs tend to be extremely large, limiting their use in many scenarios. In this work, we address this problem by minimizing proofs of unsatisfiability produced by DNN verifiers. We present algorithms that remove facts which were learned during the verification process, but which are unnecessary for the proof itself. Conceptually, our method analyzes the dependencies among facts used to deduce UNSAT, and removes facts that did not contribute. We then further minimize the proof by eliminating remaining unnecessary dependencies, using two alternative procedures. We implemented our algorithms on top of a proof producing DNN verifier, and evaluated them across several benchmarks. Our results show that our best-performing algorithm reduces proof size by 37%-82% and proof checking time by 30%-88%, while introducing a runtime overhead of 7%-20% to the verification process itself.

[LG-25] Improving Long-Range Interactions in Graph Neural Simulators via Hamiltonian Dynamics

链接: https://arxiv.org/abs/2511.08185
作者: Tai Hoang,Alessandro Trenta,Alessio Gravina,Niklas Freymuth,Philipp Becker,Davide Bacciu,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注: 31 pages, including the appendix

点击查看摘要

Abstract:Learning to simulate complex physical systems from data has emerged as a promising way to overcome the limitations of traditional numerical solvers, which often require prohibitive computational costs for high-fidelity solutions. Recent Graph Neural Simulators (GNSs) accelerate simulations by learning dynamics on graph-structured data, yet often struggle to capture long-range interactions and suffer from error accumulation under autoregressive rollouts. To address these challenges, we propose Information-preserving Graph Neural Simulators (IGNS), a graph-based neural simulator built on the principles of Hamiltonian dynamics. This structure guarantees preservation of information across the graph, while extending to port-Hamiltonian systems allows the model to capture a broader class of dynamics, including non-conservative effects. IGNS further incorporates a warmup phase to initialize global context, geometric encoding to handle irregular meshes, and a multi-step training objective to reduce rollout error. To evaluate these properties systematically, we introduce new benchmarks that target long-range dependencies and challenging external forcing scenarios. Across all tasks, IGNS consistently outperforms state-of-the-art GNSs, achieving higher accuracy and stability under challenging and complex dynamical systems.

[LG-26] BIPPO: Budget-Aware Independent PPO for Energy-Efficient Federated Learning Services

链接: https://arxiv.org/abs/2511.08142
作者: Anna Lackinger,Andrea Morichetta,Pantelis A. Frangoudis,Schahram Dustdar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Federated Learning (FL) is a promising machine learning solution in large-scale IoT systems, guaranteeing load distribution and privacy. However, FL does not natively consider infrastructure efficiency, a critical concern for systems operating in resource-constrained environments. Several Reinforcement Learning (RL) based solutions offer improved client selection for FL; however, they do not consider infrastructure challenges, such as resource limitations and device churn. Furthermore, the training of RL methods is often not designed for practical application, as these approaches frequently do not consider generalizability and are not optimized for energy efficiency. To fill this gap, we propose BIPPO (Budget-aware Independent Proximal Policy Optimization), which is an energy-efficient multi-agent RL solution that improves performance. We evaluate BIPPO on two image classification tasks run in a highly budget-constrained setting, with FL clients training on non-IID data, a challenging context for vanilla FL. The improved sampler of BIPPO enables it to increase the mean accuracy compared to non-RL mechanisms, traditional PPO, and IPPO. In addition, BIPPO only consumes a negligible proportion of the budget, which stays consistent even if the number of clients increases. Overall, BIPPO delivers a performant, stable, scalable, and sustainable solution for client selection in IoT-FL.

[LG-27] Stuart-Landau Oscillatory Graph Neural Network

链接: https://arxiv.org/abs/2511.08094
作者: Kaicheng Zhang,David N. Reynolds,Piero Deidda,Francesco Tudisco
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Oscillatory Graph Neural Networks (OGNNs) are an emerging class of physics-inspired architectures designed to mitigate oversmoothing and vanishing gradient problems in deep GNNs. In this work, we introduce the Complex-Valued Stuart-Landau Graph Neural Network (SLGNN), a novel architecture grounded in Stuart-Landau oscillator dynamics. Stuart-Landau oscillators are canonical models of limit-cycle behavior near Hopf bifurcations, which are fundamental to synchronization theory and are widely used in e.g. neuroscience for mesoscopic brain modeling. Unlike harmonic oscillators and phase-only Kuramoto models, Stuart-Landau oscillators retain both amplitude and phase dynamics, enabling rich phenomena such as amplitude regulation and multistable synchronization. The proposed SLGNN generalizes existing phase-centric Kuramoto-based OGNNs by allowing node feature amplitudes to evolve dynamically according to Stuart-Landau dynamics, with explicit tunable hyperparameters (such as the Hopf-parameter and the coupling strength) providing additional control over the interplay between feature amplitudes and network structure. We conduct extensive experiments across node classification, graph classification, and graph regression tasks, demonstrating that SLGNN outperforms existing OGNNs and establishes a novel, expressive, and theoretically grounded framework for deep oscillatory architectures on graphs.

[LG-28] HipKittens: Fast and Furious AMD Kernels

链接: https://arxiv.org/abs/2511.08083
作者: William Hu,Drew Wadsworth,Sean Siddens,Stanley Winata,Daniel Y. Fu,Ryann Swann,Muhammad Osama,Christopher Ré,Simran Arora
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives – for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers – are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by 1.2-2.4\times (e.g., d=64 attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: this https URL.

[LG-29] Online Linear Regression with Paid Stochastic Features AAAI2026

链接: https://arxiv.org/abs/2511.08073
作者: Nadav Merlis,Kyoungseok Jang,Nicolò Cesa-Bianchi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to AAAI 2026

点击查看摘要

Abstract:We study an online linear regression setting in which the observed feature vectors are corrupted by noise and the learner can pay to reduce the noise level. In practice, this may happen for several reasons: for example, because features can be measured more accurately using more expensive equipment, or because data providers can be incentivized to release less private features. Assuming feature vectors are drawn i.i.d. from a fixed but unknown distribution, we measure the learner’s regret against the linear predictor minimizing a notion of loss that combines the prediction error and payment. When the mapping between payments and noise covariance is known, we prove that the rate \sqrtT is optimal for regret if logarithmic factors are ignored. When the noise covariance is unknown, we show that the optimal regret rate becomes of order T^2/3 (ignoring log factors). Our analysis leverages matrix martingale concentration, showing that the empirical loss uniformly converges to the expected one for all payments and linear predictors.

[LG-30] From Sequential to Recursive: Enhancing Decision-Focused Learning with Bidirectional Feedback

链接: https://arxiv.org/abs/2511.08035
作者: Xinyu Wang,Jinxiao Du,Yiyang Peng,Wei Ma
类目: Machine Learning (cs.LG)
*备注: 16 pages, 5 figures

点击查看摘要

Abstract:Decision-focused learning (DFL) has emerged as a powerful end-to-end alternative to conventional predict-then-optimize (PTO) pipelines by directly optimizing predictive models through downstream decision losses. Existing DFL frameworks are limited by their strictly sequential structure, referred to as sequential DFL (S-DFL). However, S-DFL fails to capture the bidirectional feedback between prediction and optimization in complex interaction scenarios. In view of this, we first time propose recursive decision-focused learning (R-DFL), a novel framework that introduces bidirectional feedback between downstream optimization and upstream prediction. We further extend two distinct differentiation methods: explicit unrolling via automatic differentiation and implicit differentiation based on fixed-point methods, to facilitate efficient gradient propagation in R-DFL. We rigorously prove that both methods achieve comparable gradient accuracy, with the implicit method offering superior computational efficiency. Extensive experiments on both synthetic and real-world datasets, including the newsvendor problem and the bipartite matching problem, demonstrate that R-DFL not only substantially enhances the final decision quality over sequential baselines but also exhibits robust adaptability across diverse scenarios in closed-loop decision-making problems.

[LG-31] Generalizable Insights for Graph Transformers in Theory and Practice NEURIPS2025

链接: https://arxiv.org/abs/2511.08028
作者: Timo Stoll,Luis Müller,Christopher Morris
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025 as spotlight

点击查看摘要

Abstract:Graph Transformers (GTs) have shown strong empirical performance, yet current architectures vary widely in their use of attention mechanisms, positional embeddings (PEs), and expressivity. Existing expressivity results are often tied to specific design choices and lack comprehensive empirical validation on large-scale data. This leaves a gap between theory and practice, preventing generalizable insights that exceed particular application domains. Here, we propose the Generalized-Distance Transformer (GDT), a GT architecture using standard attention that incorporates many advancements for GTs from recent years, and develop a fine-grained understanding of the GDT’s representation power in terms of attention and PEs. Through extensive experiments, we identify design choices that consistently perform well across various applications, tasks, and model scales, demonstrating strong performance in a few-shot transfer setting without fine-tuning. Our evaluation covers over eight million graphs with roughly 270M tokens across diverse domains, including image-based object detection, molecular property prediction, code summarization, and out-of-distribution algorithmic reasoning. We distill our theoretical and practical findings into several generalizable insights about effective GT design, training, and inference.

[LG-32] Low-Rank Curvature for Zeroth-Order Optimization in LLM Fine-Tuning AAAI AAAI-2026

链接: https://arxiv.org/abs/2511.07971
作者: Hyunseok Seung,Jaewoo Lee,Hyunsuk Ko
类目: Machine Learning (cs.LG)
*备注: Accepted to the AAAI Conference on Artificial Intelligence (AAAI-2026)

点击查看摘要

Abstract:We introduce LOREN, a curvature-aware zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Existing ZO methods, which estimate gradients via finite differences using random perturbations, often suffer from high variance and suboptimal search directions. Our approach addresses these challenges by: (i) reformulating the problem of gradient preconditioning as that of adaptively estimating an anisotropic perturbation distribution for gradient estimation, (ii) capturing curvature through a low-rank block diagonal preconditioner using the framework of natural evolution strategies, and (iii) applying a REINFORCE leave-one-out (RLOO) gradient estimator to reduce variance. Experiments on standard LLM benchmarks show that our method outperforms state-of-the-art ZO methods by achieving higher accuracy and faster convergence, while cutting peak memory usage by up to 27.3% compared with MeZO-Adam.

[LG-33] Continual Unlearning for Text-to-Image Diffusion Models: A Regularization Perspective

链接: https://arxiv.org/abs/2511.07970
作者: Justin Lee,Zheda Mai,Jinsu Yoo,Chongyu Fan,Cheng Zhang,Wei-Lun Chao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine unlearning–the ability to remove designated concepts from a pre-trained model–has advanced rapidly, particularly for text-to-image diffusion models. However, existing methods typically assume that unlearning requests arrive all at once, whereas in practice they often arrive sequentially. We present the first systematic study of continual unlearning in text-to-image diffusion models and show that popular unlearning methods suffer from rapid utility collapse: after only a few requests, models forget retained knowledge and generate degraded images. We trace this failure to cumulative parameter drift from the pre-training weights and argue that regularization is crucial to addressing it. To this end, we study a suite of add-on regularizers that (1) mitigate drift and (2) remain compatible with existing unlearning methods. Beyond generic regularizers, we show that semantic awareness is essential for preserving concepts close to the unlearning target, and propose a gradient-projection method that constrains parameter drift orthogonal to their subspace. This substantially improves continual unlearning performance and is complementary to other regularizers for further gains. Taken together, our study establishes continual unlearning as a fundamental challenge in text-to-image generation and provides insights, baselines, and open directions for advancing safe and accountable generative AI.

[LG-34] Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics

链接: https://arxiv.org/abs/2511.07955
作者: Ziqian Zhang,Min Huang,Zhongzhe Xiao
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.

[LG-35] Predict-then-Optimize Method for Seaport Power-Logistics Scheduling: Generalization across Varying Tasks Stream

链接: https://arxiv.org/abs/2511.07938
作者: Chuanqing Pu,Feilong Fan,Nengling Tai,Yan Xu,Wentao Huang,Honglin Wen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Preprint to IEEE Transactions on Smart Grid

点击查看摘要

Abstract:Power-logistics scheduling in modern seaports typically follow a predict-then-optimize pipeline. To enhance decision quality, decision-focused learning has been proposed to align forecasting and optimization via end-to-end training. However, most formulations assume a fixed task configuration in downstream optimization, and thus generalize poorly to evolving task structures induced by varying seaport vessel arrivals. We address this gap with a decision-focused continual learning framework that adapts online to a stream of scheduling tasks. Specifically, we introduce Fisher information based regularization to enhance cross-task generalization by preserving parameters critical to prior tasks. A differentiable convex surrogate is also developed to stabilize gradient backpropagation. The proposed approach enables learning a decision-aligned forecasting model for new scheduling tasks while retaining generalization on earlier tasks. Experiments calibrated to the Jurong Port demonstrate superior decision performance and generalization over existing methods with reduced computational cost.

[LG-36] SERL: Self-Examining Reinforcement Learning on Open-Domain

链接: https://arxiv.org/abs/2511.07922
作者: Weixuan Ou,Yanzhao Zheng,Shuoshuo Sun,Wei Zhang,Baohua Dong,Hangcheng Zhu,Ruohui Huang,Gang Yu,Pengwei Yan,Yifan Qiao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor’s capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge’s reliability. This process refines the Judge’s capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.

[LG-37] Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison

链接: https://arxiv.org/abs/2511.07919
作者: Yoonho Lee,Joseph Boen,Chelsea Finn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce \textitFeedback Descent, a framework that optimizes text artifacts – prompts, code, and molecules – through structured textual feedback, rather than relying solely on scalar rewards. By preserving detailed critiques instead of compressing them to binary preferences, Feedback Descent widens the information bottleneck in preference learning, enabling directed optimization in text space rather than weight space. We show that in-context learning can transform structured feedback into gradient-like directional information, enabling targeted edits. Unlike prior approaches that collapse judgments into single bits, our evaluators pair each comparison with textual feedback, which functions as high-bandwidth supervision. The iteration loop is done purely at inference time, without modifying any model weights, and is task-agnostic. We evaluate Feedback Descent on three diverse domains and find that it outperforms state-of-the-art prompt optimization (GEPA), reinforcement learning methods (GRPO, REINVENT), and even specialized graph-based molecular optimizers. In the DOCKSTRING molecule discovery benchmark, Feedback Descent identifies novel drug-like molecules surpassing the 99.9 th percentile of a database with more than 260,000 compounds across six protein targets.

[LG-38] Rectified Noise: A Generative Model Using Positive-incentive Noise AAAI2026

链接: https://arxiv.org/abs/2511.07911
作者: Zhenyu Gu,Yanchen Xu,Sida Huang,Yubin Guo,Hongyuan Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Rectified Flow (RF) has been widely used as an effective generative model. Although RF is primarily based on probability flow Ordinary Differential Equations (ODE), recent studies have shown that injecting noise through reverse-time Stochastic Differential Equations (SDE) for sampling can achieve superior generative performance. Inspired by Positive-incentive Noise ( \pi -noise), we propose an innovative generative algorithm to train \pi -noise generators, namely Rectified Noise ( \Delta RN), which improves the generative performance by injecting \pi -noise into the velocity field of pre-trained RF models. After introducing the Rectified Noise pipeline, pre-trained RF models can be efficiently transformed into \pi -noise generators. We validate Rectified Noise by conducting extensive experiments across various model architectures on different datasets. Notably, we find that: (1) RF models using Rectified Noise reduce FID from \textbf10.16 to 9.05 on ImageNet-1k. (2) The models of \pi -noise generators achieve improved performance with only \textbf0.39% additional training parameters.

[LG-39] CellARC: Measuring Intelligence with Cellular Automata

链接: https://arxiv.org/abs/2511.07908
作者: Miroslav Lžičař
类目: Machine Learning (cs.LG)
*备注: 22 pages, 11 figures. Working draft. Dataset and leaderboard available at this https URL

点击查看摘要

Abstract:We introduce CellARC, a synthetic benchmark for abstraction and reasoning built from multicolor 1D cellular automata (CA). Each episode has five support pairs and one query serialized in 256 tokens, enabling rapid iteration with small models while exposing a controllable task space with explicit knobs for alphabet size k, radius r, rule family, Langton’s lambda, query coverage, and cell entropy. We release 95k training episodes plus two 1k test splits (interpolation/extrapolation) and evaluate symbolic, recurrent, convolutional, transformer, recursive, and LLM baselines. CellARC decouples generalization from anthropomorphic priors, supports unlimited difficulty-controlled sampling, and enables reproducible studies of how quickly models infer new rules under tight budgets. Our strongest small-model baseline (a 10M-parameter vanilla transformer) outperforms recent recursive models (TRM, HRM), reaching 58.0%/32.4% per-token accuracy on the interpolation/extrapolation splits, while a large closed model (GPT-5 High) attains 62.3%/48.1% on subsets of 100 test tasks. An ensemble that chooses per episode between the Transformer and the best symbolic baseline reaches 65.4%/35.5%, highlighting neuro-symbolic complementarity. Leaderboard: this https URL

[LG-40] A Generalized Spectral Framework to Expain Neural Scaling and Compression Dynamics

链接: https://arxiv.org/abs/2511.07892
作者: Yizhou Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Empirical scaling laws describe how test loss and other performance metrics depend on model size, dataset size, and compute. While such laws are consistent within specific regimes, apparently distinct scaling behaviors have been reported for related settings such as model compression. Motivated by recent progress in spectral analyses of neural representations, this paper develops a \emphgeneralized spectral framework that unifies learning dynamics and compression phenomena under a common functional ansatz. We generalize the spectral evolution function from the linear kernel form g(\lambda t)=\lambda t to an asymptotically polynomial function g(\lambda,t;\beta) , characterized by an effective spectral–temporal elasticity \rho(\beta) . This framework recovers existing lazy and feature-learning theories as special cases and yields an invariant relation between learning and compression

[LG-41] SpikCommander: A High-performance Spiking Transformer with Multi-view Learning for Efficient Speech Command Recognition AAAI2026 AAAI

链接: https://arxiv.org/abs/2511.07883
作者: Jiaqi Wang,Liutao Yu,Xiongri Shen,Sihang Guo,Chenlin Zhou,Leilei Zhao,Yi Zhong,Zhengyu Ma,Zhiguo Zhang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted by The Fortieth AAAI Conference on Artificial Intelligence (AAAI 2026)

点击查看摘要

Abstract:Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual information from speech due to limited temporal modeling and binary spike-based representations. To address these challenges, we first introduce the multi-view spiking temporal-aware self-attention (MSTASA) module, which combines effective spiking temporal-aware attention with a multi-view learning framework to model complementary temporal dependencies in speech commands. Building on MSTASA, we further propose SpikCommander, a fully spike-driven transformer architecture that integrates MSTASA with a spiking contextual refinement channel MLP (SCR-MLP) to jointly enhance temporal context modeling and channel-wise feature integration. We evaluate our method on three benchmark datasets: the Spiking Heidelberg Dataset (SHD), the Spiking Speech Commands (SSC), and the Google Speech Commands V2 (GSC). Extensive experiments demonstrate that SpikCommander consistently outperforms state-of-the-art (SOTA) SNN approaches with fewer parameters under comparable time steps, highlighting its effectiveness and efficiency for robust speech command recognition.

[LG-42] Algorithm-Relative Trajectory Valuation in Policy Gradient Control

链接: https://arxiv.org/abs/2511.07878
作者: Shihao Li,Jiachen Li,Jiamin Xu,Christopher Martin,Wei Li,Dongmei Chen
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We study how trajectory value depends on the learning algorithm in policy-gradient control. Using Trajectory Shapley in an uncertain LQR, we find a negative correlation between Persistence of Excitation (PE) and marginal value under vanilla REINFORCE ( r\approx-0.38 ). We prove a variance-mediated mechanism: (i) for fixed energy, higher PE yields lower gradient variance; (ii) near saddles, higher variance increases escape probability, raising marginal contribution. When stabilized (state whitening or Fisher preconditioning), this variance channel is neutralized and information content dominates, flipping the correlation positive ( r\approx+0.29 ). Hence, trajectory value is algorithm-relative. Experiments validate the mechanism and show decision-aligned scores (Leave-One-Out) complement Shapley for pruning, while Shapley identifies toxic subsets.

[LG-43] Parallel Sampling via Autospeculation

链接: https://arxiv.org/abs/2511.07869
作者: Nima Anari,Carlo Baronio,CJ Chen,Alireza Haqi,Frederic Koehler,Anqi Li,Thuy-Duong Vuong
类目: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:We present parallel algorithms to accelerate sampling via counting in two settings: any-order autoregressive models and denoising diffusion models. An any-order autoregressive model accesses a target distribution \mu on [q]^n through an oracle that provides conditional marginals, while a denoising diffusion model accesses a target distribution \mu on \mathbbR^n through an oracle that provides conditional means under Gaussian noise. Standard sequential sampling algorithms require \widetildeO(n) time to produce a sample from \mu in either setting. We show that, by issuing oracle calls in parallel, the expected sampling time can be reduced to \widetildeO(n^1/2) . This improves the previous \widetildeO(n^2/3) bound for any-order autoregressive models and yields the first parallel speedup for diffusion models in the high-accuracy regime, under the relatively mild assumption that the support of \mu is bounded. We introduce a novel technique to obtain our results: speculative rejection sampling. This technique leverages an auxiliary speculative'' distribution~ \nu that approximates~ \mu to accelerate sampling. Our technique is inspired by the well-studied speculative decoding’’ techniques popular in large language models, but differs in key ways. Firstly, we use autospeculation,'' namely we build the speculation \nu out of the same oracle that defines~ \mu . In contrast, speculative decoding typically requires a separate, faster, but potentially less accurate draft’’ model \nu . Secondly, the key differentiating factor in our technique is that we make and accept speculations at a ``sequence’’ level rather than at the level of single (or a few) steps. This last fact is key to unlocking our parallel runtime of \widetildeO(n^1/2) . Subjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Probability (math.PR) Cite as: arXiv:2511.07869 [cs.DS] (or arXiv:2511.07869v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2511.07869 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-44] DP-AdamW: Investigating Decoupled Weight Decay and Bias Correction in Private Deep Learning ICML2025

链接: https://arxiv.org/abs/2511.07843
作者: Jay Chooi,Kevin Cong,Russell Li,Lillian Sun
类目: Machine Learning (cs.LG)
*备注: 19 pages, 5 appendices; presented at ICML 2025 DIG-BUGS Workshop

点击查看摘要

Abstract:As deep learning methods increasingly utilize sensitive data on a widespread scale, differential privacy (DP) offers formal guarantees to protect against information leakage during model training. A significant challenge remains in implementing DP optimizers that retain strong performance while preserving privacy. Recent advances introduced ever more efficient optimizers, with AdamW being a popular choice for training deep learning models because of strong empirical performance. We study \emphDP-AdamW and introduce \emphDP-AdamW-BC, a differentially private variant of the AdamW optimizer with DP bias correction for the second moment estimator. We start by showing theoretical results for privacy and convergence guarantees of DP-AdamW and DP-AdamW-BC. Then, we empirically analyze the behavior of both optimizers across multiple privacy budgets ( \epsilon = 1, 3, 7 ). We find that DP-AdamW outperforms existing state-of-the-art differentially private optimizers like DP-SGD, DP-Adam, and DP-AdamBC, scoring over 15% higher on text classification, up to 5% higher on image classification, and consistently 1% higher on graph node classification. Moreover, we empirically show that incorporating bias correction in DP-AdamW (DP-AdamW-BC) consistently decreases accuracy, in contrast to the improvement of DP-AdamBC improvement over DP-Adam.

[LG-45] Hyperellipsoid Density Sampling: Exploitative Sequences to Accelerate High-Dimensional Optimization

链接: https://arxiv.org/abs/2511.07836
作者: Julian Soltes
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: for Python implementation, see this https URL

点击查看摘要

Abstract:The curse of dimensionality presents a pervasive challenge in optimization problems, with exponential expansion of the search space rapidly causing traditional algorithms to become inefficient or infeasible. An adaptive sampling strategy is presented to accelerate optimization in this domain as an alternative to uniform quasi-Monte Carlo (QMC) methods. This method, referred to as Hyperellipsoid Density Sampling (HDS), generates its sequences by defining multiple hyperellipsoids throughout the search space. HDS uses three types of unsupervised learning algorithms to circumvent high-dimensional geometric calculations, producing an intelligent, non-uniform sample sequence that exploits statistically promising regions of the parameter space and improves final solution quality in high-dimensional optimization problems. A key feature of the method is optional Gaussian weights, which may be provided to influence the sample distribution towards known locations of interest. This capability makes HDS versatile for applications beyond optimization, providing a focused, denser sample distribution where models need to concentrate their efforts on specific, non-uniform regions of the parameter space. The method was evaluated against Sobol, a standard QMC method, using differential evolution (DE) on the 29 CEC2017 benchmark test functions. The results show statistically significant improvements in solution geometric mean error (p 0.05), with average performance gains ranging from 3% in 30D to 37% in 10D. This paper demonstrates the efficacy of HDS as a robust alternative to QMC sampling for high-dimensional optimization. Comments: for Python implementation, see this https URL Subjects: Numerical Analysis (math.NA); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE) MSC classes: 65C05 ACMclasses: G.1.6; G.3; I.5.3 Cite as: arXiv:2511.07836 [math.NA] (or arXiv:2511.07836v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2511.07836 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Julian Soltes [view email] [v1] Tue, 11 Nov 2025 05:12:00 UTC (1,033 KB)

[LG-46] Multi-Objective Bilevel Learning

链接: https://arxiv.org/abs/2511.07824
作者: Zhiyao Zhang,Zhuqing Liu,Xin Zhang,Wen-Yen Chen,Jiyan Yang,Jia Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As machine learning (ML) applications grow increasingly complex in recent years, modern ML frameworks often need to address multiple potentially conflicting objectives with coupled decision variables across different layers. This creates a compelling need for multi-objective bilevel learning (MOBL). So far, however, the field of MOBL remains in its infancy and many important problems remain under-explored. This motivates us to fill this gap and systematically investigate the theoretical and algorithmic foundation of MOBL. Specifically, we consider MOBL problems with multiple conflicting objectives guided by preferences at the upper-level subproblem, where part of the inputs depend on the optimal solution of the lower-level subproblem. Our goal is to develop efficient MOBL optimization algorithms to (1) identify a preference-guided Pareto-stationary solution with low oracle complexity; and (2) enable systematic Pareto front exploration. To this end, we propose a unifying algorithmic framework called weighted-Chebyshev multi-hyper-gradient-descent (WC-MHGD) for both deterministic and stochastic settings with finite-time Pareto-stationarity convergence rate guarantees, which not only implies low oracle complexity but also induces systematic Pareto front exploration. We further conduct extensive experiments to confirm our theoretical results.

[LG-47] Analyzing Political Text at Scale with Online Tensor LDA

链接: https://arxiv.org/abs/2511.07809
作者: Sara Kangaslahti,Danny Ebanks,Jean Kossaifi,Anqi Liu,R. Michael Alvarez,Animashree Anandkumar
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 64 pages, 11 figures

点击查看摘要

Abstract:This paper proposes a topic modeling method that scales linearly to billions of documents. We make three core contributions: i) we present a topic modeling method, Tensor Latent Dirichlet Allocation (TLDA), that has identifiable and recoverable parameter guarantees and sample complexity guarantees for large data; ii) we show that this method is computationally and memory efficient (achieving speeds over 3-4x those of prior parallelized Latent Dirichlet Allocation (LDA) methods), and that it scales linearly to text datasets with over a billion documents; iii) we provide an open-source, GPU-based implementation, of this method. This scaling enables previously prohibitive analyses, and we perform two real-world, large-scale new studies of interest to political scientists: we provide the first thorough analysis of the evolution of the #MeToo movement through the lens of over two years of Twitter conversation and a detailed study of social media conversations about election fraud in the 2020 presidential election. Thus this method provides social scientists with the ability to study very large corpora at scale and to answer important theoretically-relevant questions about salient issues in near real-time.

[LG-48] Streaming Tensor Program: A streaming abstraction for dynamic parallelism

链接: https://arxiv.org/abs/2511.07776
作者: Gina Sohn,Genghan Zhang,Konstantin Hossfeld,Jungwoo Kim,Nathan Sobotka,Nathan Zhang,Olivia Hsu,Kunle Olukotun
类目: Programming Languages (cs.PL); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic behaviors are becoming prevalent in many tensor applications. In machine learning, for example, the input tensors are dynamically shaped or ragged, and data-dependent control flow is widely used in many models. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators forces the dynamic behaviors to be implemented statically or lacks the visibility for performance-critical decisions. To address these challenges, we present the Streaming Tensor Program (STeP), a new streaming abstraction that enables dynamic tensor workloads to run efficiently on spatial dataflow accelerators. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations-dynamic tiling, dynamic parallelization, and configuration time-multiplexing-that adapt to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers with real-world traces, dynamic tiling reduces on-chip memory requirement by 2.18x, dynamic parallelization improves latency by 1.5x, and configuration time-multiplexing improves compute utilization by 2.57x over implementations available in prior abstractions.

[LG-49] Schedulers for Schedule-free: Theoretically inspired hyperparameters

链接: https://arxiv.org/abs/2511.07767
作者: Yuen-Man Pun,Matthew Buchholz,Robert M. Gower
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of \mathcalO(1/\sqrtT) . We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.

[LG-50] Multistep Quasimetric Learning for Scalable Goal-conditioned Reinforcement Learning

链接: https://arxiv.org/abs/2511.07730
作者: Bill Chunyuan Zheng,Vivek Myers,Benjamin Eysenbach,Sergey Levine
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Learning how to reach goals in an environment is a longstanding challenge in AI, yet reasoning over long horizons remains a challenge for modern methods. The key question is how to estimate the temporal distance between pairs of observations. While temporal difference methods leverage local updates to provide optimality guarantees, they often perform worse than Monte Carlo methods that perform global updates (e.g., with multi-step returns), which lack such guarantees. We show how these approaches can be integrated into a practical GCRL method that fits a quasimetric distance using a multistep Monte-Carlo return. We show our method outperforms existing GCRL methods on long-horizon simulated tasks with up to 4000 steps, even with visual observations. We also demonstrate that our method can enable stitching in the real-world robotic manipulation domain (Bridge setup). Our approach is the first end-to-end GCRL method that enables multistep stitching in this real-world manipulation domain from an unlabeled offline dataset of visual observations.

[LG-51] A Ranking-Based Optimization Algorithm for the Vehicle Relocation Problem in Car Sharing Services

链接: https://arxiv.org/abs/2511.07724
作者: Piotr Szwed,Paweł Skrzynski,Jarosław Wąs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-52] Intelligent Optimization of Multi-Parameter Micromixers Using a Scientific Machine Learning Framework

链接: https://arxiv.org/abs/2511.07702
作者: Meraj Hassanzadeh,Ehsan Ghaderi,Mohamad Ali Bijarchi,Siamak Kazemzadeh Hannani
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Fluid Dynamics (physics.flu-dyn)
*备注:

点击查看摘要

Abstract:Multidimensional optimization has consistently been a critical challenge in engineering. However, traditional simulation-based optimization methods have long been plagued by significant limitations: they are typically capable of optimizing only a single problem at a time and require substantial computational time for meshing and numerical simulation. This paper introduces a novel framework leveraging cutting-edge Scientific Machine Learning (Sci-ML) methodologies to overcome these inherent drawbacks of conventional approaches. The proposed method provides instantaneous solutions to a spectrum of complex, multidimensional optimization problems. A micromixer case study is employed to demonstrate this methodology. An agent, operating on a Deep Reinforcement Learning (DRL) architecture, serves as the optimizer to explore the relationships between key problem parameters. This optimizer interacts with an environment constituted by a parametric Physics-Informed Neural Network (PINN), which responds to the agent’s actions at a significantly higher speed than traditional numerical methods. The agent’s objective, conditioned on the Schmidt number is to discover the optimal geometric and physical parameters that maximize the micromixer’s efficiency. After training the agent across a wide range of Schmidt numbers, we analyzed the resulting optimal designs. Across this entire spectrum, the achieved efficiency was consistently greater than the baseline, normalized value. The maximum efficiency occurred at a Schmidt number of 13.3, demonstrating an improvement of approximately 32%. Finally, a comparative analysis with a Genetic Algorithm was conducted under equivalent conditions to underscore the advantages of the proposed method.

[LG-53] Probabilities Are All You Need: A Probability-Only Approach to Uncertainty Estimation in Large Language Models

链接: https://arxiv.org/abs/2511.07694
作者: Manh Nguyen,Sunil Gupta,Hung Le
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit strong performance across various natural language processing (NLP) tasks but remain vulnerable to hallucinations, generating factually incorrect or misleading outputs. Uncertainty estimation, often using predictive entropy estimation, is key to addressing this issue. However, existing methods often require multiple samples or extra computation to assess semantic entropy. This paper proposes an efficient, training-free uncertainty estimation method that approximates predictive entropy using the responses’ top- K probabilities. Moreover, we employ an adaptive mechanism to determine K to enhance flexibility and filter out low-confidence probabilities. Experimental results on three free-form question-answering datasets across several LLMs demonstrate that our method outperforms expensive state-of-the-art baselines, contributing to the broader goal of enhancing LLM trustworthiness.

[LG-54] ZeroSim: Zero-Shot Analog Circuit Evaluation with Unified Transformer Embeddings

链接: https://arxiv.org/abs/2511.07658
作者: Xiaomeng Yang,Jian Gao,Yanzhi Wang,Xuan Zhang
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: Accepted by ICCAD 2025

点击查看摘要

[LG-55] CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

链接: https://arxiv.org/abs/2511.07657
作者: Veera V S Bhargav Nunna,Shinae Kang,Zheyuan Zhou,Virginia Wang,Sucharitha Boinapally,Michael Foley
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-56] Enhancing Binary Encoded Crime Linkage Analysis Using Siamese Network AAAI2026

链接: https://arxiv.org/abs/2511.07651
作者: Yicheng Zhan,Fahim Ahmed,Amy Burrell,Matthew J. Tonkin,Sarah Galambos,Jessica Woodhams,Dalal Alrajeh
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: AAAI 2026, 7 pages, 4 figures

点击查看摘要

Abstract:Effective crime linkage analysis is crucial for identifying serial offenders and enhancing public safety. To address limitations of traditional crime linkage methods in handling high-dimensional, sparse, and heterogeneous data, we propose a Siamese Autoencoder framework that learns meaningful latent representations and uncovers correlations in complex crime data. Using data from the Violent Crime Linkage Analysis System (ViCLAS), maintained by the Serious Crime Analysis Section of the UK’s National Crime Agency, our approach mitigates signal dilution in sparse feature spaces by integrating geographic-temporal features at the decoder stage. This design amplifies behavioral representations rather than allowing them to be overshadowed at the input level, yielding consistent improvements across multiple evaluation metrics. We further analyze how different domain-informed data reduction strategies influence model performance, providing practical guidance for preprocessing in crime linkage contexts. Our results show that advanced machine learning approaches can substantially enhance linkage accuracy, improving AUC by up to 9% over traditional methods while offering interpretable insights to support investigative decision-making.

[LG-57] FlowTIE: Flow-based Transport of Intensity Equation for Phase Gradient Estimation from 4D-STEM Data NEURIPS2025

链接: https://arxiv.org/abs/2511.07633
作者: Arya Bangun,Maximilian Töllner,Xuan Zhao,Christian Kübel,Hanno Scharr
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, Machine Learning and the Physical Sciences Workshop, NeurIPS 2025

点击查看摘要

[LG-58] SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLM s

链接: https://arxiv.org/abs/2511.07572
作者: Sean P. Fillingham,Andrew Gordon,Peter Lai,Xavier Poncini,David Quarel,Stefan Heimersheim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-59] Methodological Precedence in Health Tech: Why ML/Big Data Analysis Must Follow Basic Epidemiological Consistency. A Case Study

链接: https://arxiv.org/abs/2511.07500
作者: Marco Roccetti
类目: Machine Learning (cs.LG)
*备注: 2 Tables; ML/Big data paper on medical data

点击查看摘要

Abstract:The integration of advanced analytical tools, including Machine Learning (ML) and massive data processing, has revolutionized health research, promising unprecedented accuracy in diagnosis and risk prediction. However, the rigor of these complex methods is fundamentally dependent on the quality and integrity of the underlying datasets and the validity of their statistical design. We propose an emblematic case where advanced analysis (ML/Big Data) must necessarily be subsequent to the verification of basic methodological coherence. This study highlights a crucial cautionary principle: sophisticated analyses amplify, rather than correct, severe methodological flaws rooted in basic design choices, leading to misleading or contradictory findings. By applying simple, standard descriptive statistical methods and established national epidemiological benchmarks to a recently published cohort study on vaccine outcomes and psychiatric events, we expose multiple, statistically irreconcilable paradoxes. These paradoxes, including an implausible risk reduction for a chronic disorder in a high-risk group and contradictory incidence rate comparisons, definitively invalidate the reported hazard ratios (HRs). We demonstrate that the observed effects are mathematical artifacts stemming from an uncorrected selection bias in the cohort construction. This analysis serves as a robust reminder that even the most complex health studies must first pass the test of basic epidemiological consistency before any conclusion drawn from subsequent advanced ML or statistical modeling can be considered valid or publishable. We conclude that robust methods, such as Propensity Score Matching, are essential for achieving valid causal inference from administrative data in the absence of randomization

[LG-60] Provably Efficient Sample Complexity for Robust CMDP

链接: https://arxiv.org/abs/2511.07486
作者: Sourav Ganguly,Arnob Ghosh
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-61] Counterfactual Forecasting of Human Behavior using Generative AI and Causal Graphs

链接: https://arxiv.org/abs/2511.07484
作者: Dharmateja Priyadarshi Uddandarao,Ravi Kiran Vadlamani
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:This study presents a novel framework for counterfactual user behavior forecasting that combines structural causal models with transformer-based generative artificial intelligence. To model fictitious situations, the method creates causal graphs that map the connections between user interactions, adoption metrics, and product features. The framework generates realistic behavioral trajectories under counterfactual conditions by using generative models that are conditioned on causal variables. Tested on datasets from web interactions, mobile applications, and e-commerce, the methodology outperforms conventional forecasting and uplift modeling techniques. Product teams can effectively simulate and assess possible interventions prior to deployment thanks to the framework improved interpretability through causal path visualization.

[LG-62] Comparing Reconstruction Attacks on Pretrained Versus Full Fine-tuned Large Language Model Embeddings on Homo Sapiens Splice Sites Genomic Data

链接: https://arxiv.org/abs/2511.07481
作者: Reem Al-Saidi,Erman Ayday,Ziad Kobti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates embedding reconstruction attacks in large language models (LLMs) applied to genomic sequences, with a specific focus on how fine-tuning affects vulnerability to these attacks. Building upon Pan et al.'s seminal work demonstrating that embeddings from pretrained language models can leak sensitive information, we conduct a comprehensive analysis using the HS3D genomic dataset to determine whether task-specific optimization strengthens or weakens privacy protections. Our research extends Pan et al.'s work in three significant dimensions. First, we apply their reconstruction attack pipeline to pretrained and fine-tuned model embeddings, addressing a critical gap in their methodology that did not specify embedding types. Second, we implement specialized tokenization mechanisms tailored specifically for DNA sequences, enhancing the model’s ability to process genomic data, as these models are pretrained on natural language and not DNA. Third, we perform a detailed comparative analysis examining position-specific, nucleotide-type, and privacy changes between pretrained and fine-tuned embeddings. We assess embeddings vulnerabilities across different types and dimensions, providing deeper insights into how task adaptation shifts privacy risks throughout genomic sequences. Our findings show a clear distinction in reconstruction vulnerability between pretrained and fine-tuned embeddings. Notably, fine-tuning strengthens resistance to reconstruction attacks in multiple architectures – XLNet (+19.8%), GPT-2 (+9.8%), and BERT (+7.8%) – pointing to task-specific optimization as a potential privacy enhancement mechanism. These results highlight the need for advanced protective mechanisms for language models processing sensitive genomic data, while highlighting fine-tuning as a potential privacy-enhancing technique worth further exploration.

[LG-63] From Hubs to Deserts: Urban Cultural Accessibility Patterns with Explainable AI

链接: https://arxiv.org/abs/2511.07475
作者: Protik Bose Pranto,Minhazul Islam,Ripon Kumar Saha,Abimelec Mercado Rivera,Namig Abbasov
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cultural infrastructures, such as libraries, museums, theaters, and galleries, support learning, civic life, health, and local economies, yet access is uneven across cities. We present a novel, scalable, and open-data framework to measure spatial equity in cultural access. We map cultural infrastructures and compute a metric called Cultural Infrastructure Accessibility Score (CIAS) using exponential distance decay at fine spatial resolution, then aggregate the score per capita and integrate socio-demographic indicators. Interpretable tree-ensemble models with SHapley Additive exPlanation (SHAP) are used to explain associations between accessibility, income, density, and tract-level racial/ethnic composition. Results show a pronounced core-periphery gradient, where non-library cultural infrastructures cluster near urban cores, while libraries track density and provide broader coverage. Non-library accessibility is modestly higher in higher-income tracts, and library accessibility is slightly higher in denser, lower-income areas.

[LG-64] RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records

链接: https://arxiv.org/abs/2511.07473
作者: Yang Yang(1),Kathryn Pollak(2,3),Bibhas Chakraborty(1,4,5,6),Molei Liu(7,8),Doudou Zhou(6),Chuan Hong(1) ((1) Department of Biostatistics and Bioinformatics, Duke University, Durham, USA, (2) Duke Cancer Institute, Durham, USA, (3) Department of Population Health Sciences, Duke University School of Medicine, Durham, USA, (4) Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, (5) Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, (6) Department of Statistics and Data Science, National University of Singapore, Singapore, (7) Department of Biostatistics, Peking University Health Science Center, Beijing, China, (8) Beijing International Center for Mathematical Research, Peking University, Beijing, China)
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 20 pages, 5 figures, 1 table. Includes supplementary material. Submitted to JAMIA Open. † These authors contributed equally. *Corresponding author: Chuan Hong

点击查看摘要

Abstract:Objective: Electronic health record (EHR) phenotyping often relies on noisy proxy labels, which undermine the reliability of downstream risk prediction. Active learning can reduce annotation costs, but most rely on fixed heuristics and do not ensure that phenotype refinement improves prediction performance. Our goal was to develop a framework that directly uses downstream prediction performance as feedback to guide phenotype correction and sample selection under constrained labeling budgets. Materials and Methods: We propose Reinforcement-Enhanced Label-Efficient Active Phenotyping (RELEAP), a reinforcement learning-based active learning framework. RELEAP adaptively integrates multiple querying strategies and, unlike prior methods, updates its policy based on feedback from downstream models. We evaluated RELEAP on a de-identified Duke University Health System (DUHS) cohort (2014-2024) for incident lung cancer risk prediction, using logistic regression and penalized Cox survival models. Performance was benchmarked against noisy-label baselines and single-strategy active learning. Results: RELEAP consistently outperformed all baselines. Logistic AUC increased from 0.774 to 0.805 and survival C-index from 0.718 to 0.752. Using downstream performance as feedback, RELEAP produced smoother and more stable gains than heuristic methods under the same labeling budget. Discussion: By linking phenotype refinement to prediction outcomes, RELEAP learns which samples most improve downstream discrimination and calibration, offering a more principled alternative to fixed active learning rules. Conclusion: RELEAP optimizes phenotype correction through downstream feedback, offering a scalable, label-efficient paradigm that reduces manual chart review and enhances the reliability of EHR-based risk prediction. Comments: 20 pages, 5 figures, 1 table. Includes supplementary material. Submitted to JAMIA Open. † These authors contributed equally. *Corresponding author: Chuan Hong Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2511.07473 [cs.LG] (or arXiv:2511.07473v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2511.07473 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Yang Yang [view email] [v1] Sat, 8 Nov 2025 18:08:44 UTC (4,924 KB) Full-text links: Access Paper: View a PDF of the paper titled RELEAP: Reinforcement-Enhanced Label-Efficient Active Phenotyping for Electronic Health Records, by Yang Yang (1) and 37 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-11 Change to browse by: cs cs.CY References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-65] Slimmable NAM: Neural Amp Models with adjustable runtime computational cost NEURIPS2025

链接: https://arxiv.org/abs/2511.07470
作者: Steven Atkinson
类目: Machine Learning (cs.LG)
*备注: 2 pages, 2 figures. Accepted to NeurIPS 2025 workshop on AI for Music

点击查看摘要

Abstract:This work demonstrates “slimmable Neural Amp Models”, whose size and computational cost can be changed without additional training and with negligible computational overhead, enabling musicians to easily trade off between the accuracy and compute of the models they are using. The method’s performance is quantified against commonly-used baselines, and a real-time demonstration of the model in an audio effect plug-in is developed.

[LG-66] Resource Allocation in Hybrid Radio-Optical IoT Networks using GNN with Multi-task Learning

链接: https://arxiv.org/abs/2511.07428
作者: Aymen Hamrouni,Sofie Pollin,Hazem Sallouha
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: 20 pages, 17 figures, 3 tables

点击查看摘要

[LG-67] Galactification: painting galaxies onto dark matter only simulations using a transformer-based model NEURIPS2025

链接: https://arxiv.org/abs/2511.08438
作者: Shivam Pandey,Christopher C. Lovell,Chirag Modi,Benjamin D. Wandelt
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures. , accepted at Machine Learning and the Physical Sciences Workshop at NeurIPS 2025

点击查看摘要

Abstract:Connecting the formation and evolution of galaxies to the large-scale structure is crucial for interpreting cosmological observations. While hydrodynamical simulations accurately model the correlated properties of galaxies, they are computationally prohibitive to run over volumes that match modern surveys. We address this by developing a framework to rapidly generate mock galaxy catalogs conditioned on inexpensive dark-matter-only simulations. We present a multi-modal, transformer-based model that takes 3D dark matter density and velocity fields as input, and outputs a corresponding point cloud of galaxies with their physical properties. We demonstrate that our trained model faithfully reproduces a variety of galaxy summary statistics and correctly captures their variation with changes in the underlying cosmological and astrophysical parameters, making it the first accelerated forward model to capture all the relevant galaxy properties, their full spatial distribution, and their conditional dependencies in hydrosimulations.

[LG-68] Identification of Empirical Constitutive Models for Age-Hardenable Aluminium Alloy and High-Chromium Martensitic Steel Using Symbolic Regression

链接: https://arxiv.org/abs/2511.08424
作者: Evgeniya Kabliman,Gabriel Kronberger
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: Accepted for publication in Special Issue on Symbolic Regression of the Philosphical Transactions of the Royal Society - Part A

点击查看摘要

[LG-69] Generative AI Meets 6G and Beyond: Diffusion Models for Semantic Communications

链接: https://arxiv.org/abs/2511.08416
作者: Hai-Long Qin,Jincheng Dai,Guo Lu,Shuo Shao,Sixian Wang,Tongda Xu,Wenjun Zhang,Ping Zhang,Khaled B. Letaief
类目: ignal Processing (eess.SP); Information Theory (cs.IT); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: Under review, GitHub repository: this https URL

点击查看摘要

[LG-70] Source-Optimal Training is Transfer-Suboptimal

链接: https://arxiv.org/abs/2511.08401
作者: C. Evans Hedges
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

[LG-71] An Information-Minimal Geometry for Qubit-Efficient Optimization

链接: https://arxiv.org/abs/2511.08362
作者: Gordon Ma,Dimitris G. Angelakis
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 39 pages, 9 figures

点击查看摘要

[LG-72] Revealing the Hidden Third Dimension of Point Defects in Two-Dimensional MXenes

链接: https://arxiv.org/abs/2511.08350
作者: Grace Guinan,Michelle A. Smeaton,Brian C. Wyatt,Steven Goldy,Hilary Egan,Andrew Glaws,Garritt J. Tucker,Babak Anasori,Steven R. Spurgeon
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 38 pages, 13 figures

点击查看摘要

Abstract:Point defects govern many important functional properties of two-dimensional (2D) materials. However, resolving the three-dimensional (3D) arrangement of these defects in multi-layer 2D materials remains a fundamental challenge, hindering rational defect engineering. Here, we overcome this limitation using an artificial intelligence-guided electron microscopy workflow to map the 3D topology and clustering of atomic vacancies in Ti _3 C _2 T _X MXene. Our approach reconstructs the 3D coordinates of vacancies across hundreds of thousands of lattice sites, generating robust statistical insight into their distribution that can be correlated with specific synthesis pathways. This large-scale data enables us to classify a hierarchy of defect structures–from isolated vacancies to nanopores–revealing their preferred formation and interaction mechanisms, as corroborated by molecular dynamics simulations. This work provides a generalizable framework for understanding and ultimately controlling point defects across large volumes, paving the way for the rational design of defect-engineered functional 2D materials.

[LG-73] Concentration bounds on response-based vector embeddings of black-box generative models

链接: https://arxiv.org/abs/2511.08307
作者: Aranyak Acharyya,Joshua Agterberg,Youngser Park,Carey E. Priebe
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-74] Semi-Supervised Treatment Effect Estimation with Unlabeled Covariates via Generalized Riesz Regression

链接: https://arxiv.org/abs/2511.08303
作者: Masahiro Kato
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-75] A Fast and Accurate Approach for Covariance Matrix Construction

链接: https://arxiv.org/abs/2511.08223
作者: Felix Reichel
类目: Computation (stat.CO); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 12 pages, 7 figures

点击查看摘要

[LG-76] Emulating Radiative Transfer in Astrophysical Environments

链接: https://arxiv.org/abs/2511.08219
作者: Rune Rost,Lorenzo Branca,Tobias Buck
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: Accepted at the Differentiable Systems and Scientific Machine Learning workshop at EurIPS, 2025

点击查看摘要

[LG-77] From Classical to Hybrid: A Practical Framework for Quantum-Enhanced Learning

链接: https://arxiv.org/abs/2511.08205
作者: Silvie Illésová,Tomáš Bezděk,Vojtěch Novák,Ivan Zelinka,Stefano Cacciatore,Martin Beseda
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-78] Good flavor search in SU(5): a machine learning approach

链接: https://arxiv.org/abs/2511.08154
作者: Fayez Abu-Ajamieh,Shinsuke Kawai,Nobuchika Okada
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Theory (hep-th)
*备注: 14 pages, 9 figures

点击查看摘要

[LG-79] PrAda-GAN: A Private Adaptive Generative Adversarial Network with Bayes Network Structure

链接: https://arxiv.org/abs/2511.07997
作者: Ke Jia,Yuheng Ma,Yang Li,Feifei Wang
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We revisit the problem of generating synthetic data under differential privacy. To address the core limitations of marginal-based methods, we propose the Private Adaptive Generative Adversarial Network with Bayes Network Structure (PrAda-GAN), which integrates the strengths of both GAN-based and marginal-based approaches. Our method adopts a sequential generator architecture to capture complex dependencies among variables, while adaptively regularizing the learned structure to promote sparsity in the underlying Bayes network. Theoretically, we establish diminishing bounds on the parameter distance, variable selection error, and Wasserstein distance. Our analysis shows that leveraging dependency sparsity leads to significant improvements in convergence rates. Empirically, experiments on both synthetic and real-world datasets demonstrate that PrAda-GAN outperforms existing tabular data synthesis methods in terms of the privacy-utility trade-off.

[LG-80] Distributionally Robust Online Markov Game with Linear Function Approximation AAAI2026

链接: https://arxiv.org/abs/2511.07831
作者: Zewu Zheng,Yuanyuan Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: To be published in the Proceedings of AAAI 2026

点击查看摘要

Abstract:The sim-to-real gap, where agents trained in a simulator face significant performance degradation during testing, is a fundamental challenge in reinforcement learning. Extansive works adopt the framework of distributionally robust RL, to learn a policy that acts robustly under worst case environment shift. Within this framework, our objective is to devise algorithms that are sample efficient with interactive data collection and large state spaces. By assuming d-rectangularity of environment dynamic shift, we identify a fundamental hardness result for learning in online Markov game, and address it by adopting minimum value assumption. Then, a novel least square value iteration type algorithm, DR-CCE-LSI, with exploration bonus devised specifically for multiple agents, is proposed to find an \episilon-approximate robust Coarse Correlated Equilibrium(CCE). To obtain sample efficient learning, we find that: when the feature mapping function satisfies certain properties, our algorithm, DR-CCE-LSI, is able to achieve \epsilon-approximate CCE with a regret bound of OdHminH,1/min\sigma_i\sqrtK, where K is the number of interacting episodes, H is the horizon length, d is the feature dimension, and \simga_i represents the uncertainty level of player i. Our work introduces the first sample-efficient algorithm for this setting, matches the best result so far in single agent setting, and achieves minimax optimalsample complexity in terms of the feature dimension d. Meanwhile, we also conduct simulation study to validate the efficacy of our algorithm in learning a robust equilibrium.

[LG-81] Misaligned by Design: Incentive Failures in Machine Learning

链接: https://arxiv.org/abs/2511.07699
作者: David Autor,Andrew Caplin,Daniel Martin,Philip Marx
类目: General Economics (econ.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-82] Kolmogorov-Arnold Chemical Reaction Neural Networks for learning pressure-dependent kinetic rate laws

链接: https://arxiv.org/abs/2511.07686
作者: Benjamin C. Koenig,Sili Deng
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注: 5 pages, 4 figures

点击查看摘要

Abstract:Chemical Reaction Neural Networks (CRNNs) have emerged as an interpretable machine learning framework for discovering reaction kinetics directly from data, while strictly adhering to the Arrhenius and mass action laws. However, standard CRNNs cannot represent pressure-dependent rate behavior, which is critical in many combustion and chemical systems and typically requires empirical formulations such as Troe or PLOG. Here, we develop Kolmogorov-Arnold Chemical Reaction Neural Networks (KA-CRNNs) that generalize CRNNs by modeling each kinetic parameter as a learnable function of system pressure using Kolmogorov-Arnold activations. This structure maintains full interpretability and physical consistency while enabling assumption-free inference of pressure effects directly from data. A proof-of-concept study on the CH3 recombination reaction demonstrates that KA-CRNNs accurately reproduce pressure-dependent kinetics across a range of temperatures and pressures, outperforming conventional interpolative models. The framework establishes a foundation for data-driven discovery of extended kinetic behaviors in complex reacting systems, advancing interpretable and physics-consistent approaches for chemical model inference.

[LG-83] Robust Experimental Design via Generalised Bayesian Inference

链接: https://arxiv.org/abs/2511.07671
作者: Yasir Zubayr Barlas,Sabina J. Sloman,Samuel Kaski
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 12 main pages, 43 pages in total

点击查看摘要

[LG-84] Infinite-Dimensional Operator/Block Kaczmarz Algorithms: Regret Bounds and λ-Effectiveness

链接: https://arxiv.org/abs/2511.07604
作者: Halyun Jeong,Palle E.T. Jorgensen,Hyun-Kyoung Kwon,Myung-Sin Song
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
*备注: Submitted to a journal

点击查看摘要

[LG-85] Shocks Under Control: Taming Transonic Compressible Flow over an RAE2822 Airfoil with Deep Reinforcement Learning

链接: https://arxiv.org/abs/2511.07564
作者: Trishit Mondal,Ricardo Vinuesa,Ameya D. Jagtap
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 23 pages, 18 figures

点击查看摘要

[LG-86] ractable Instances of Bilinear Maximization: Implementing LinUCB on Ellipsoids

链接: https://arxiv.org/abs/2511.07504
作者: Raymond Zhang,Hédi Hadiji,Richard Combes
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 8 figures, 4 algos

点击查看摘要

[LG-87] RL-Exec: Impact-Aware Reinforcement Learning for Opportunistic Optimal Liquidation Outperforms TWAP and a Book-Liquidity VWAP on BTC-USD Replays

链接: https://arxiv.org/abs/2511.07434
作者: Enzo Duflot,Stanislas Robineau
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注: 8 pages main text, 3 appendix pages, 10 figures

点击查看摘要

信息检索

[IR-0] Advancing Scientific Knowledge Retrieval and Reuse with a Novel Digital Library for Machine-Readable Knowledge

链接: https://arxiv.org/abs/2511.08476
作者: Hadi Ghaemi,Lauren Snyder,Markus Stocker
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Digital libraries for research, such as the ACM Digital Library or Semantic Scholar, do not enable the machine-supported, efficient reuse of scientific knowledge (e.g., in synthesis research). This is because these libraries are based on document-centric models with narrative text knowledge expressions that require manual or semi-automated knowledge extraction, structuring, and organization. We present ORKG reborn, an emerging digital library that supports finding, accessing, and reusing accurate, fine-grained, and reproducible machine-readable expressions of scientific knowledge that relate scientific statements and their supporting evidence in terms of data and code. The rich expressions of scientific knowledge are published as reborn (born-reusable) articles and provide novel possibilities for scientific knowledge retrieval, for instance by statistical methods, software packages, variables, or data matching specific constraints. We describe the proposed system and demonstrate its practical viability and potential for information retrieval in contrast to state-of-the-art digital libraries and document-centric scholarly communication using several published articles in research fields ranging from computer science to soil science. Our work underscores the enormous potential of scientific knowledge databases and a viable approach to their construction.

[IR-1] DiffuGR: Generative Document Retrieval with Diffusion Language Models

链接: https://arxiv.org/abs/2511.08150
作者: Xinpeng Zhao,Yukun Zhao,Zhenyang Li,Mengqi Zhang,Jun Feng,Ran Chen,Ying Zhou,Zhumin Chen,Shuaiqiang Wang,Zhaochun Ren,Dawei Yin,Xin Xin
类目: Information Retrieval (cs.IR)
*备注: This paper is under review

点击查看摘要

Abstract:Generative retrieval (GR) re-frames document retrieval as a sequence-based document identifier (DocID) generation task, memorizing documents with model parameters and enabling end-to-end retrieval without explicit indexing. Existing GR methods are based on auto-regressive generative models, i.e., the token generation is performed from left to right. However, such auto-regressive methods suffer from: (1) mismatch between DocID generation and natural language generation, e.g., an incorrect DocID token generated in early left steps would lead to totally erroneous retrieval; and (2) failure to balance the trade-off between retrieval efficiency and accuracy dynamically, which is crucial for practical applications. To address these limitations, we propose generative document retrieval with diffusion language models, dubbed DiffuGR. It models DocID generation as a discrete diffusion process: during training, DocIDs are corrupted through a stochastic masking process, and a diffusion language model is learned to recover them under a retrieval-aware objective. For inference, DiffuGR attempts to generate DocID tokens in parallel and refines them through a controllable number of denoising steps. In contrast to conventional left-to-right auto-regressive decoding, DiffuGR provides a novel mechanism to first generate more confident DocID tokens and refine the generation through diffusion-based denoising. Moreover, DiffuGR also offers explicit runtime control over the qualitylatency tradeoff. Extensive experiments on benchmark retrieval datasets show that DiffuGR is competitive with strong auto-regressive generative retrievers, while offering flexible speed and accuracy tradeoffs through variable denoising budgets. Overall, our results indicate that non-autoregressive diffusion models are a practical and effective alternative for generative document retrieval.

[IR-2] From IDs to Semantics: A Generative Framework for Cross-Domain Recommendation with Adaptive Semantic Tokenization

链接: https://arxiv.org/abs/2511.08006
作者: Peiyu Hu,Wayne Lu,Jia Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Cross-domain recommendation (CDR) is crucial for improving recommendation accuracy and generalization, yet traditional methods are often hindered by the reliance on shared user/item IDs, which are unavailable in most real-world scenarios. Consequently, many efforts have focused on learning disentangled representations through multi-domain joint training to bridge the domain gaps. Recent Large Language Model (LLM)-based approaches show promise, they still face critical challenges, including: (1) the \textbfitem ID tokenization dilemma, which leads to vocabulary explosion and fails to capture high-order collaborative knowledge; and (2) \textbfinsufficient domain-specific modeling for the complex evolution of user interests and item semantics. To address these limitations, we propose \textbfGenCDR, a novel \textbfGenerative \textbfCross-\textbfDomain \textbfRecommendation framework. GenCDR first employs a \textbfDomain-adaptive Tokenization module, which generates disentangled semantic IDs for items by dynamically routing between a universal encoder and domain-specific adapters. Symmetrically, a \textbfCross-domain Autoregressive Recommendation module models user preferences by fusing universal and domain-specific interests. Finally, a \textbfDomain-aware Prefix-tree enables efficient and accurate generation. Extensive experiments on multiple real-world datasets demonstrate that GenCDR significantly outperforms state-of-the-art baselines. Our code is available in the supplementary materials.

[IR-3] urkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task

链接: https://arxiv.org/abs/2511.07595
作者: Özay Ezerceli,Gizem Gümüşçekiçci,Tuğba Erkoç,Berke Özenç
类目: Information Retrieval (cs.IR)
*备注: 4 pages, in Turkish language, 1 figure, conference

点击查看摘要

Abstract:In this work, we introduce TurkEmbed4Retrieval, a retrieval specialized variant of the TurkEmbed model originally designed for Natural Language Inference (NLI) and Semantic Textual Similarity (STS) tasks. By fine-tuning the base model on the MS MARCO TR dataset using advanced training techniques, including Matryoshka representation learning and a tailored multiple negatives ranking loss, we achieve SOTA performance for Turkish retrieval tasks. Extensive experiments demonstrate that our model outperforms Turkish colBERT by 19,26% on key retrieval metrics for the Scifact TR dataset, thereby establishing a new benchmark for Turkish information retrieval.

附件下载

点击下载今日全部论文列表