本篇博文主要内容为 2025-12-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-25)

今日共更新356篇论文,其中:

  • 自然语言处理43篇(Computation and Language (cs.CL))
  • 人工智能115篇(Artificial Intelligence (cs.AI))
  • 计算机视觉77篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习111篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty

【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在生成过程中因解码顺序不同而导致的输出质量不稳定问题。作者首次将这一现象形式化为生成路径上累积预测不确定性的结果,并提出通过引入可计算的“去噪熵”(Denoising Entropy)来量化该不确定性,从而作为评估生成过程内部信号的指标。解决方案的关键在于利用去噪熵设计两种优化解码路径的算法:一种是后处理选择方法,另一种是实时引导策略,二者均显著提升了生成质量,在复杂推理、规划和代码生成等基准测试中均表现出一致性改进。此工作将MDMs中的不确定性从缺陷转化为可控优势,为高质量生成提供了理论依据与实践工具。

链接: https://arxiv.org/abs/2512.21336
作者: Ziyu Chen,Xinbei Jiang,Peng Sun,Tao Lin
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); University of Chicago (芝加哥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.
zh

[NLP-1] C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

【速读】: 该论文旨在解决代码嵌入模型在序列表示中存在信息瓶颈的问题,尤其是传统基于EOS(End-of-Sequence)的嵌入方法无法充分聚合序列中所有token的信息,从而限制了模型在代码理解任务中的性能表现。解决方案的关键在于提出C2LLM(Contrastive Code Large Language Models),其核心创新是引入Pooling by Multihead Attention (PMA)模块,该模块利用预训练大语言模型(LLM)的因果表示能力,并通过多头注意力机制实现对整个序列token的全局信息聚合,从而突破EOS嵌入的局限性;同时,PMA支持灵活调整嵌入维度,可作为MRL(Multi-Representation Learning)的替代方案,显著提升了代码嵌入的质量与通用性。

链接: https://arxiv.org/abs/2512.21332
作者: Jin Qin,Zihan Liao,Ziyin Zhang,Hang Yu,Peng Di,Rui Wang
机构: Ant Group (蚂蚁集团); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM’s causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
zh

[NLP-2] Your Reasoning Benchmark May Not Test Reasoning Reasoning : Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

【速读】: 该论文试图解决的问题是:当前广泛用于评估人工智能进展的抽象推理基准(如ARC和ARC-AGI)所观测到的模型性能差距是否真正反映了机器在归纳推理(inductive reasoning)能力上的不足。作者挑战了这一主流解释,提出性能差距可能主要源于视觉感知(visual perception)能力的局限,而非推理能力本身。解决方案的关键在于设计了一个两阶段实验流程:第一阶段将图像独立转化为自然语言描述以剥离跨图像的归纳信号,第二阶段仅基于这些描述进行规则推导与应用,从而实现感知与推理的显式分离。实验证明,感知能力是决定性能差距的主要因素,且约80%的模型失败源于感知错误,表明现有基准混淆了感知与推理挑战,亟需开发能解耦二者的新评估协议。

链接: https://arxiv.org/abs/2512.21329
作者: Xinhe Wang,Jin Huang,Xingjian Zhang,Tianhao Wang,Jiaqi W. Ma
机构: Carnegie Mellon University (卡内基梅隆大学); University of Michigan (密歇根大学); University of California San Diego (加州大学圣地亚哥分校); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid’’ reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.21329 [cs.CL] (or arXiv:2512.21329v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.21329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-3] Measuring all the noises of LLM Evals

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估中信号与噪声分离的难题,尤其关注评估过程中由模型预测不确定性、数据采样波动及两者叠加所引发的噪声特性。其核心解决方案在于明确界定并量化三种噪声类型:预测噪声(prediction noise,即同一问题下不同生成结果的差异)、数据噪声(data noise,即问题样本随机抽样的影响)及其总和(遵循总方差定律)。关键创新是提出“全配对成对方法”(all-pairs paired method),通过在大量评估任务和设置中对所有模型对进行成对分析,并基于数百万级问题层级预测值测量各噪声成分,从而显著提升统计功效。该方法使研究者无需定制检验即可准确评估显著性,并能更灵敏地检测微小效应。

链接: https://arxiv.org/abs/2512.21326
作者: Sida Wang
机构: FAIR at Meta (Facebook AI Research at Meta)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Separating signal from noise is central to experimental science. Applying well-established statistical method effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.
zh

[NLP-4] Parallel Token Prediction for Language Models

【速读】: 该论文旨在解决语言模型中自回归解码(autoregressive decoding)带来的延迟瓶颈问题,即每次只能生成一个token,导致生成速度受限。解决方案的关键在于提出了一种通用的并行序列生成框架——并行token预测(Parallel Token Prediction, PTP),其核心创新是将采样过程嵌入到Transformer模型内部,从而在单次前向传播中联合预测多个相关token,避免了传统多token预测方法对token独立性的强假设。理论证明表明PTP可表示任意自回归序列分布,且支持通过蒸馏或无教师的逆自回归训练进行优化,在Spec-Bench基准上实现了每步接受超过四个token的最优推测解码性能,验证了长序列并行生成的可行性且不损失建模能力。

链接: https://arxiv.org/abs/2512.21323
作者: Felix Draxler,Justus Will,Farrin Marouf Sofian,Theofanis Karaletsos,Sameer Singh,Stephan Mandt
机构: University of California, Irvine (加州大学欧文分校); Chan-Zuckerberg Initiative (扎克伯格倡议); Pyramidal AI (金字塔AI)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. Under review

点击查看摘要

Abstract:We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.
zh

[NLP-5] SMART SLM: Structured Memory and Reasoning Transformer A Small Language Model for Accurate Document Assistance

【速读】: 该论文旨在解决工程手册(Engineering Manuals, EM)文本处理中因内容冗长、格式密集导致的生成式 AI(Generative AI)模型难以准确提取和推理关键信息的问题。传统基于扁平token流的Transformer模型易产生自信但错误的数值答案,并且效率低下地依赖记忆分离事实。解决方案的关键在于提出SMART(Structured Memory and Reasoning Transformer)架构,其核心包括三个模块:(1) 语法感知的事实提取器(Grammarian)Tree LSTM,用于从EM句子中抽取结构化三元组(主语-关系-对象);(2) 基于MANN的紧凑索引记忆库,将三元组编码为384维向量并关联来源;(3) 六层Transformer解码器,融合检索到的事实生成响应。该设计在仅使用45.51M参数(比GPT-2少64%)的情况下实现21.3%更高的准确率,同时支持快速索引路径(亚秒级响应)与动态RAG增强路径(FAISS Top 20 + 64槽记忆截断),显著降低幻觉并提升结果可解释性。

链接: https://arxiv.org/abs/2512.21280
作者: Divij Dudeja,Mayukha Pal
机构: ABB Ability Innovation Center (ABB 能力创新中心); Indian Institute of Information Technology (印度信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.
zh

[NLP-6] ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling

【速读】: 该论文针对工业推荐系统在日志驱动范式下的两个核心问题展开研究:一是基于ID的物品表征存在知识贫瘠,导致在数据稀疏场景下兴趣建模脆弱;二是系统对平台边界外的用户兴趣缺乏感知,限制了模型性能。其解决方案的关键在于引入ReaSeq框架,利用大语言模型(Large Language Models, LLMs)中蕴含的世界知识,通过显式和隐式推理机制同时解决上述问题:显式推理采用多智能体协作的思维链(Chain-of-Thought)机制,将结构化产品知识提炼为语义增强的物品表征;隐式推理则借助扩散型大语言模型(Diffusion Large Language Models)推断合理的跨域行为模式,从而捕捉日志之外的潜在用户兴趣。该方案在淘宝排序系统上部署后显著提升多项指标,验证了世界知识增强推理的有效性。

链接: https://arxiv.org/abs/2512.21257
作者: Chuan Wang,Gaoming Yang,Han Wu,Jiakai Tang,Jiahao Yu,Jian Wu,Jianwu Hu,Junjun Zheng,Shuwen Xiao,Yeqiu Yang,Yuning Jiang,Ahjol Nurlanbek,Binbin Cao,Bo Zheng,Fangmei Zhu,Gaoming Zhou,Huimin Yi,Huiping Chu,Jin Huang,Jinzhe Shan,Kenan Cui,Longbin Li,Silu Zhou,Wen Chen,Xia Ming,Xiang Gao,Xin Yao,Xingyu Wen,Yan Zhang,Yiwen Hu,Yulin Wang,Ziheng Bao,Zongyuan Wu
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Industrial recommender systems face two fundamental limitations under the log-driven paradigm: (1) knowledge poverty in ID-based item representations that causes brittle interest modeling under data sparsity, and (2) systemic blindness to beyond-log user interests that constrains model performance within platform boundaries. These limitations stem from an over-reliance on shallow interaction statistics and close-looped feedback while neglecting the rich world knowledge about product semantics and cross-domain behavioral patterns that Large Language Models have learned from vast corpora. To address these challenges, we introduce ReaSeq, a reasoning-enhanced framework that leverages world knowledge in Large Language Models to address both limitations through explicit and implicit reasoning. Specifically, ReaSeq employs explicit Chain-of-Thought reasoning via multi-agent collaboration to distill structured product knowledge into semantically enriched item representations, and latent reasoning via Diffusion Large Language Models to infer plausible beyond-log behaviors. Deployed on Taobao’s ranking system serving hundreds of millions of users, ReaSeq achieves substantial gains: 6.0% in IPV and CTR, 2.9% in Orders, and 2.5% in GMV, validating the effectiveness of world-knowledge-enhanced reasoning over purely log-driven approaches. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2512.21257 [cs.IR] (or arXiv:2512.21257v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2512.21257 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-7] SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

【速读】: 该论文旨在解决人类婴儿在仅接触数百小时语音数据后即可掌握新语言基本单元,而当前自监督语音模型则需要大量数据才能达到类似效果的“数据效率差距”问题。解决方案的关键在于提出一种名为SpidR-Adapt的快速适应方法,其核心是将低资源语音表征学习建模为元学习问题,并设计多任务自适应预训练(MAdaPT)协议,将其形式化为双层优化框架;为降低计算开销,进一步引入一阶双层优化(FOBLO)策略以实现可扩展的元训练;同时通过交错监督机制稳定训练过程,从而在少于1小时的目标语言音频上即显著提升音素可区分性(ABX)和口语语言建模性能(sWUGGY, sBLIMP, tSC),相较标准训练提升超过100倍的数据效率,为生物启发式、高效语音表征学习提供了一条架构无关的实用路径。

链接: https://arxiv.org/abs/2512.21204
作者: Mahi Luthra,Jiayi Shen,Maxime Poli,Angelo Ortiz,Yosuke Higuchi,Youssef Benchekroun,Martin Gleize,Charles-Eric Saint-James,Dongyan Lin,Phillip Rust,Angel Villar,Surya Parimi,Vanessa Stark,Rashel Moritz,Juan Pino,Yann LeCun,Emmanuel Dupoux
机构: Meta AI; ENS-PSL; EHESS; CNRS
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over 100\times more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at this https URL.
zh

[NLP-8] ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开放域多轮对话中面对用户输入不完整或模糊信息时,缺乏有效澄清行为评估与改进机制的问题。现有基准测试主要针对单轮交互或合作型用户场景,难以反映真实人机交互中的复杂性。解决方案的关键在于提出一个基于五维模糊性分类体系和六种行为多样化的模拟用户人格的多轮澄清基准——ClarifyMT-Bench,并设计了一种称为ClarifyAgent的代理式方法,将澄清过程解构为感知(perception)、预测(forecasting)、跟踪(tracking)与规划(planning)四个模块,从而显著提升模型在不同模糊情境下的鲁棒性表现。

链接: https://arxiv.org/abs/2512.21120
作者: Sichun Luo,Yi Huang,Mukai Li,Shichang Meng,Fengyuan Liu,Zefa Hu,Junlan Feng,Qi Liu
机构: The University of Hong Kong (香港大学); JIUTIAN Research, China Mobile (中国移动九天研究院); CityUHK (城市大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbfClarifyMT-Bench, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose \textbfClarifyAgent, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.
zh

[NLP-9] Beyond Context: Large Language Models Failure to Grasp Users Intent

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)安全机制中存在的关键漏洞:即模型在面对用户意图时缺乏对上下文的理解能力,导致恶意用户可系统性地绕过安全防护措施。论文指出,现有方法主要聚焦于识别显式有害内容,而忽视了对隐含意图的识别与判断,从而形成可被利用的安全盲区。解决方案的关键在于将上下文理解(contextual understanding)和意图识别(intent recognition)从传统的后置保护机制转变为模型架构的核心安全能力,而非仅依赖内容过滤或规则约束。实证研究表明,强化推理能力反而可能加剧攻击效果,唯有如Claude Opus 4.1等少数模型在特定场景下优先识别意图而非提供信息,才展现出更强的鲁棒性,这提示未来需从架构层面重构安全设计范式。

链接: https://arxiv.org/abs/2512.21110
作者: Ahmed M. Hussain,Salahuddin Salahuddin,Panos Papadimitratos
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: 22 pages and 23 figures

点击查看摘要

Abstract:Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.
zh

[NLP-10] Semi-Supervised Learning for Large Language Models Safety and Content Moderation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)安全性训练中对大量标注数据的依赖问题,此类数据常面临获取困难、标注误差以及合成数据占比高等挑战。其解决方案的关键在于引入半监督学习(semi-supervised learning)方法,通过同时利用少量标注数据与大量未标注数据来提升安全分类器的性能。研究进一步指出,任务特定的数据增强策略在该框架中至关重要,相较于通用增强技术,能显著提升模型在提示(prompt)和响应(response)层面的安全性判别能力。

链接: https://arxiv.org/abs/2512.21107
作者: Eduard Stefan Dinuta,Iustin Sirbu,Traian Rebedea
机构: National University of Science and Technology Politehnica Bucharest (布加勒斯特理工大学); Renius Technologies; NVIDIA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.
zh

[NLP-11] Semantic Refinement with LLM s for Graph Representations

【速读】: 该论文旨在解决图结构数据中结构与语义信号异质性带来的模型泛化难题,即在不同图域中,预测信号可能源自节点语义(如文本内容)或结构模式(如拓扑关系),而固定归纳偏置的图学习模型难以在所有场景下实现最优性能。其解决方案的关键在于提出一种数据自适应语义精炼框架(Data-Adaptive Semantic Refinement, DAS),通过将固定结构感知的图神经网络(Graph Neural Network, GNN)与大语言模型(Large Language Model, LLM)耦合在一个闭环反馈机制中,使LLM基于GNN提供的隐式监督信号动态优化节点语义表示,并将精炼后的语义反馈至GNN以更新图表示,从而实现对结构主导型图的显著提升,同时在语义丰富型图上保持竞争力。

链接: https://arxiv.org/abs/2512.21106
作者: Safal Thapaliya,Zehong Wang,Jiazheng Li,Ziming Li,Yanfang Ye,Chuxu Zhang
机构: University of Connecticut (康涅狄格大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Data-Adaptive Semantic Refinement framework DAS for graph representation learning, which couples a fixed graph neural network (GNN) and a large language model (LLM) in a closed feedback loop. The GNN provides implicit supervisory signals to guide the semantic refinement of LLM, and the refined semantics are fed back to update the same graph learner. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on structure-dominated graphs while remaining competitive on semantics-rich graphs, demonstrating the effectiveness of data-centric semantic adaptation under structure-semantics heterogeneity.
zh

[NLP-12] Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy

【速读】: 该论文旨在解决传统监督微调(Supervised Fine-Tuning, SFT)在处理复杂推理任务时,模型对较长的思维链(Chain-of-Thought, CoT)序列过度关注,从而忽视较短但决定任务成败的关键部分——最终答案(Key)的问题。解决方案的关键在于提出一种两阶段训练策略 SFTKey:第一阶段采用常规 SFT 保证输出格式正确性,第二阶段仅对 Key 部分进行微调以提升答案准确性,从而在保持格式生成能力的同时显著增强答案的正确率。

链接: https://arxiv.org/abs/2512.21017
作者: Xiaofeng Shi,Qian Kou,Yuduo Li,Hua Zhou
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.
zh

[NLP-13] Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)到小模型(student model)的推理能力蒸馏过程中,因训练长序列(包含提示P、思维链CoT和答案A三段)导致的计算资源消耗过高的问题。其解决方案的关键在于发现:当提示和答案信息可被思维链(Chain-of-Thought, CoT)充分涵盖时,仅对CoT token进行监督蒸馏即可实现接近全序列训练的性能表现;基于此洞察,作者提出一种截断协议(truncation protocol),通过仅使用每条训练序列前50%的token进行训练,在数学基准测试上平均保留约94%的完整序列性能,同时将训练时间、内存占用和浮点运算量(FLOPs)均降低约50%,从而为推理蒸馏提供了高效且可控的计算-质量权衡机制。

链接: https://arxiv.org/abs/2512.21002
作者: Wei-Rui Chen,Vignesh Kothapalli,Ata Fatahibaarzi,Hejian Sang,Shao Tang,Qingquan Song,Zhipeng Wang,Muhammad Abdul-Mageed
机构: The University of British Columbia (不列颠哥伦比亚大学); LinkedIn
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt §, chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first 50% of tokens of every training sequence can retain, on average, \approx94% of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about 50% each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at this https URL.
zh

[NLP-14] Automatic Replication of LLM Mistakes in Medical Conversations

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床场景中出现错误难以复现与系统化评估的问题,尤其在患者-医生对话情境下,现有方法缺乏自动化手段来识别并构建可重复使用的错误基准。其解决方案的关键在于提出一个自动化的流水线——MedMistake,该流程首先生成复杂的LLM驱动的医患对话数据,继而通过由两名LLM裁判组成的委员会从推理质量、安全性及以患者为中心等多个维度进行评估,最终将识别出的典型错误转化为单次问答(single-shot QA)对形式的基准数据集。这一机制显著提升了错误模式的可复制性和跨模型比较的效率,为后续LLM在医疗领域的可靠性和安全性研究提供了标准化测试平台。

链接: https://arxiv.org/abs/2512.20983
作者: Oleksii Proniakin,Diego Fajardo,Ruslan Nazarenko,Razvan Marinescu
机构: Lumos AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 48 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at this https URL.
zh

[NLP-15] Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理在非自然语言领域(如蛋白质和RNA语言模型)中难以应用的问题,其根本原因在于生物序列语言的表达能力(language expressiveness)受限——即其有限的token空间(如氨基酸字符)无法有效编码中间推理步骤。解决方案的关键在于引入反射预训练(reflection pretraining),首次在生物序列模型中实现通过生成额外的“思考标记”(thinking tokens)来支持中间推理过程,从而显著提升语言表达能力与模型推理性能,实验证明该方法可使蛋白质模型具备自我纠错能力并带来显著性能提升。

链接: https://arxiv.org/abs/2512.20954
作者: Xiang Zhang,Jiaqi Wei,Yuejin Yang,Zijie Qiu,Yuhan Chen,Zhiqiang Gao,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Wanli Ouyang,Chenyu You,Siqi Sun
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); University of British Columbia (不列颠哥伦比亚大学); Zhejiang University (浙江大学); The Chinese University of Hong Kong (香港中文大学); Stony Brook University (石溪大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary “thinking tokens” beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.
zh

[NLP-16] MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment SEMEVAL-2025

【速读】: 该论文旨在解决多语言和跨语言事实核查中声明(claim)检索的准确性问题,特别是在虚假信息快速传播背景下,如何高效地从多种语言中准确检索出与查询相关的声明以支持后续事实核查。解决方案的关键在于提出了一种名为TriAligner的新方法,其核心是基于双编码器架构(dual-encoder architecture)结合对比学习(contrastive learning),并融合不同模态下的原生语言与英文翻译表示,从而学习不同来源在对齐过程中的相对重要性;同时通过大语言模型(LLM)进行高效数据预处理与增强,并引入困难负样本(hard negative sampling)策略,显著提升了跨语言语义表征的学习能力与检索精度。

链接: https://arxiv.org/abs/2512.20950
作者: Mohammad Mahdi Abootorabi,Alireza Ghahramani Kure,Mohammadali Mohammadkhani,Sina Elahimanesh,Mohammad Ali Ali Panah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 11 pages Published at the SemEval-2025 workshop

点击查看摘要

Abstract:This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.
zh

[NLP-17] Neural Probe-Based Hallucination Detection for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成内容时易产生幻觉(hallucination)的问题,尤其在高风险领域中,现有基于不确定性估计和外部知识检索的检测方法仍会在高置信度下输出错误内容,且依赖于检索效率与知识覆盖范围;而传统线性探针(linear probes)难以捕捉深层语义中的非线性结构。解决方案的关键在于提出一种基于神经网络的token级幻觉检测框架,通过冻结LLM参数,采用轻量级多层感知机(MLP)探针对高层隐藏状态进行非线性建模,并设计多目标联合损失函数以提升检测稳定性与语义区分能力;同时构建层位置-探针性能响应模型,利用贝叶斯优化自动搜索最优探针插入层,从而实现高效、准确的幻觉检测,在LongFact、HealthBench和TriviaQA数据集上显著优于当前最先进方法。

链接: https://arxiv.org/abs/2512.20949
作者: Shize Liang,Hongzhi Wang
机构: Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model’s hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic this http URL overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear modeling of high-level hidden states. A multi-objective joint loss function is designed to enhance detection stability and semantic disambiguity. Additionally, we establish a layer position-probe performance response model, using Bayesian optimization to automatically search for optimal probe insertion layers and achieve superior training this http URL results on LongFact, HealthBench, and TriviaQA demonstrate that MLP probes significantly outperform state-of-the-art methods in accuracy, recall, and detection capability under low false-positive conditions.
zh

[NLP-18] Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive Multi-Modal and Multi-Lingual Study

【速读】: 该论文旨在解决当前多模态神经精神障碍(如阿尔茨海默病、抑郁症和自闭症谱系障碍)检测中面临的两大核心问题:一是跨语言泛化能力不足,二是缺乏统一的评估框架。为应对这些问题,作者提出FEND(基于基础模型的神经精神障碍评估框架),其关键在于构建一个整合语音与文本模态的多模态系统,并在13个涵盖英语、中文、希腊语、法语和荷兰语的多语言数据集上进行系统性验证。FEND不仅提供了可复现的基准测试,还揭示了模态不平衡和数据异质性对模型性能的影响,从而推动自动化、全生命周期及多语言场景下的神经精神障碍智能诊断发展。

链接: https://arxiv.org/abs/2512.20948
作者: Zhongren Dong,Haotian Guo,Weixiang Xu,Huan Zhao,Zixing Zhang
机构: Hunan University (湖南大学); Shenzhen Research Institute, Hunan University (湖南大学深圳研究院); Ministry of Education Key Laboratory of Fusion Computing of Supercomputing and Artificial Intelligence, Hunan University (教育部融合计算超算与人工智能重点实验室)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Neuropsychiatric disorders, such as Alzheimer’s disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.
zh

[NLP-19] ransductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning

【速读】: 该论文旨在解决3D场景中空间推理任务对视觉语言模型(Vision-Language Models, VLMs)带来的几何计算挑战,以及现有视觉编程方法在工具使用上的局限性——即依赖固定工具集或基于假设的工具推断,导致生成程序质量不高且工具利用率低。解决方案的关键在于提出一种新型框架Transductive Visual Programming (TVP),其核心机制是通过经验驱动的方式动态构建新工具:TVP首先利用基础工具求解问题并积累经验程序至示例库(Example Library),随后从中抽象出重复模式形成可复用的高级工具,并纳入不断演化的工具库(Tool Library)。这种机制使模型能从自身经验中学习并迭代增强工具能力,从而更高效地应对新任务。实验表明,TVP在Omni3D-Bench上优于GPT-4o 22%,且其自学习工具被作为核心依赖使用的频率是传统归纳式工具的5倍,同时展现出卓越的泛化能力,无需针对测试集调整即可在SpatialScore-Hard等未见空间任务上取得优异表现。

链接: https://arxiv.org/abs/2512.20934
作者: Shengguang Wu,Xiaohan Wang,Yuhui Zhang,Hao Zhu,Serena Yeung-Levy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Project Website: this https URL

点击查看摘要

Abstract:Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at this https URL.
zh

[NLP-20] Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation

【速读】: 该论文旨在解决生成式 AI(Generative AI)中推理蒸馏(Reasoning Distillation)的泛化能力问题,即学生模型在训练阶段模仿教师模型的行为后,是否能在测试阶段保持与教师一致的推理行为,而非退化回原始输出模式。此前方法缺乏对蒸馏模型能力来源的细致分析,导致对其泛化性能存在不确定性。解决方案的关键在于提出一种跨模型推理蒸馏溯源追踪框架(Reasoning Distillation Provenance Tracing framework),通过比较教师模型、原始学生模型和蒸馏后学生模型在相同上下文下的预测概率,对每个动作(如句子)进行溯源分类,从而系统性地解析其来源。基于此分析,进一步提出一种由教师引导的数据选择方法,直接利用训练数据上教师与学生之间的差异作为选择标准,取代以往依赖启发式规则的方法,实现了更有效的蒸馏过程。

链接: https://arxiv.org/abs/2512.20908
作者: Kaiyuan Liu,Shaotian Yan,Rui Miao,Bing Wang,Chen Shen,Jun Zhang,Jieping Ye
机构: Zhejiang University (浙江大学); Alibaba Cloud Computing (阿里巴巴云计算); Jilin University (吉林大学); University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning distillation has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher’s behavior in training contexts. However, previous approaches have lacked a detailed analysis of the origins of the distilled model’s capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models. To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into different categories. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics, our method directly compares teacher-student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models and diverse student models. The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing and our insights into reasoning distillation with the community.
zh

[NLP-21] Architectural Trade-offs in Small Language Models Under Compute Constraints

【速读】: 该论文旨在解决在严格计算资源约束下,小规模语言模型的架构选择与训练预算之间的交互关系如何影响模型性能的问题。其核心解决方案在于通过系统性实证研究,从线性预测器逐步引入非线性、自注意力机制和多层Transformer结构,在字符级(Tiny Shakespeare)和词级(PTB、WikiText-2)任务上评估不同模型在测试负对数似然(NLL)、参数量及近似训练浮点运算次数(FLOPs)上的表现,从而揭示注意力机制在小规模模型中仍具备更高的每FLOP效率,而单纯增加深度或上下文长度若缺乏优化则可能导致性能下降。

链接: https://arxiv.org/abs/2512.20877
作者: Shivraj Singh Bhatti
机构: University of Massachusetts (马萨诸塞大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 15 pages, 11 images

点击查看摘要

Abstract:We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.
zh

[NLP-22] NVIDIA Nemotron 3: Efficient and Open Intelligence

【速读】: 该论文旨在解决当前大语言模型在推理能力、生成效率与成本控制之间的平衡难题,尤其是在复杂任务场景下(如多步骤工具调用、长上下文处理)的性能瓶颈。其解决方案的关键在于:首先,采用混合Mamba-Transformer架构结合稀疏专家网络(LatentMoE),显著提升模型质量并优化计算资源利用;其次,引入多环境强化学习进行后训练,使模型具备多步推理和精细预算控制能力;最后,通过MTP层加速文本生成,实现高吞吐量与超长上下文(最高达100万token)支持,从而在保持高精度的同时大幅降低推理成本。

链接: https://arxiv.org/abs/2512.20856
作者: NVIDIA:Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Anjulie Agrusa,Ankur Verma,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Cyril Meurillon,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Lo,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elad Segal,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Evgeny Tsykunov,Faisal Ladhak,Fay Wang,Fei Jia
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
zh

[NLP-23] How important is Recall for Measuring Retrieval Quality?

【速读】: 该论文旨在解决在大规模且动态演化的知识库中,由于相关文档总数未知而导致无法计算召回率(recall)的问题。其核心解决方案是引入一种无需依赖总相关文档数量的简单检索质量度量方法,并通过评估该度量与基于大语言模型(LLM)生成的回答质量判断之间的相关性,验证其有效性。

链接: https://arxiv.org/abs/2512.20854
作者: Shelly Schwartz,Oleg Vasilyev,Randy Sawaya
机构: Primer Technologies Inc. (Primer Technologies 公司)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.
zh

[NLP-24] Nemotron 3 Nano: Open Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agent ic Reasoning

【速读】: 该论文旨在解决大语言模型在推理效率与性能之间难以平衡的问题,特别是在保持高准确率的同时提升推理吞吐量并降低计算资源消耗。解决方案的关键在于提出一种基于混合专家(Mixture-of-Experts, MoE)架构的Mamba-Transformer融合模型——Nemotron 3 Nano 30B-A3B,其通过在预训练阶段使用25万亿文本标记(包括超过3万亿新唯一标记),结合监督微调与大规模强化学习优化,在仅激活少于一半参数的情况下实现了比前代模型Nemotron 2 Nano更高的准确性;同时,该模型在相同规模下相较GPT-OSS-20B和Qwen3-30B-A3B-Thinking-2507等开源模型表现出最高达3.3倍的推理吞吐量提升,并支持长达100万token的上下文长度,显著增强了代理能力、推理能力和对话表现。

链接: https://arxiv.org/abs/2512.20848
作者: NVIDIA:Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.
zh

[NLP-25] MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLM s

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域应用中因可靠性与安全性不足而限制其落地的问题。现有评估方法或孤立测试医学知识准确性,或在患者层面推理时缺乏正确性验证,导致无法系统性衡量模型在真实临床情境下的表现。为此,作者提出MediEval基准,通过将MIMIC-IV电子健康记录(Electronic Health Records, EHRs)与基于UMLS等生物医学词汇库构建的统一知识库关联,生成多样化的事实性和反事实性医学陈述,从而在四象限框架下同时评估知识锚定(knowledge grounding)与上下文一致性(contextual consistency)。关键解决方案是提出一种基于直接偏好优化(Direct Preference Optimization, DPO)的反事实风险感知微调方法(Counterfactual Risk-Aware Fine-tuning, CoRFu),采用不对称惩罚机制专门抑制不安全混淆,显著提升模型准确性和安全性,相较于基线模型宏F1分数提升16.4点并完全消除真理反转错误。

链接: https://arxiv.org/abs/2512.20822
作者: Zhan Qu,Michael Färber
机构: TU Dresden (德累斯顿工业大学); ScaDS.AI (数据科学与分析中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.
zh

[NLP-26] EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

【速读】: 该论文旨在解决自动化评分系统在作文评估中缺乏可解释性的问题,尤其是在大型语言模型作为“黑箱”运行时,教育工作者和学生难以理解评分逻辑。其解决方案的关键在于提出EssayCBM框架,该框架通过八个与评分量规(rubric)对齐的写作概念(如论点清晰度、证据使用等)进行分项评估,利用编码器上的专用预测头输出各概念得分,并以这些得分作为透明瓶颈,由轻量级网络计算最终分数。这一设计使教师能够调整特定概念评分并实时查看结果,从而实现可问责的人机协同评价机制,同时保持与黑箱模型相当的评分性能。

链接: https://arxiv.org/abs/2512.20817
作者: Kumar Satvik Chaudhary,Chengshuai Zhao,Fan Zhang,Yung Hin Tse,Garima Agrawal,Yuli Deng,Huan Liu
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how automated grading systems evaluate essays remains a significant challenge for educators and students, especially when large language models function as black boxes. We introduce EssayCBM, a rubric-aligned framework that prioritizes interpretability in essay assessment. Instead of predicting grades directly from text, EssayCBM evaluates eight writing concepts, such as Thesis Clarity and Evidence Use, through dedicated prediction heads on an encoder. These concept scores form a transparent bottleneck, and a lightweight network computes the final grade using only concepts. Instructors can adjust concept predictions and instantly view the updated grade, enabling accountable human-in-the-loop evaluation. EssayCBM matches black-box performance while offering actionable, concept-level feedback through an intuitive web interface.
zh

[NLP-27] Semantic Deception: When Reasoning Models Cant Compute an Addition

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在面对新颖符号系统时,其所谓“推理能力”是否真正具备符号抽象与逻辑操作能力的问题,尤其关注模型是否会因表面语义线索而偏离正确的符号逻辑。解决方案的关键在于设计了一种实验框架,通过引入“语义欺骗”(semantic deceptions)——即用陌生符号重新定义标准数字和运算符,并嵌入误导性语义上下文,从而测试LLMs能否在无先验知识的情况下进行符号识别与运算。实验结果表明,尽管LLMs在形式上可能遵循指令,但其性能显著受制于残留的语义关联,暴露出对表面特征的过度依赖,削弱了其在关键决策场景中进行可靠符号推理的能力。

链接: https://arxiv.org/abs/2512.20812
作者: Nathaniël de Leeuw,Marceau Nahon,Mathis Reymond,Raja Chatila,Mehdi Khamassi
机构: Sorbonne University (索邦大学); Université Paris Cité (巴黎-萨克雷大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 5 figures

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in situations where human values are at stake, such as decision-making tasks that involve reasoning when performed by humans. We investigate the so-called reasoning capabilities of LLMs over novel symbolic representations by introducing an experimental framework that tests their ability to process and manipulate unfamiliar symbols. We introduce semantic deceptions: situations in which symbols carry misleading semantic associations due to their form, such as being embedded in specific contexts, designed to probe whether LLMs can maintain symbolic abstraction or whether they default to exploiting learned semantic associations. We redefine standard digits and mathematical operators using novel symbols, and task LLMs with solving simple calculations expressed in this altered notation. The objective is: (1) to assess LLMs’ capacity for abstraction and manipulation of arbitrary symbol systems; (2) to evaluate their ability to resist misleading semantic cues that conflict with the task’s symbolic logic. Through experiments with four LLMs we show that semantic cues can significantly deteriorate reasoning models’ performance on very simple tasks. They reveal limitations in current LLMs’ ability for symbolic manipulations and highlight a tendency to over-rely on surface-level semantics, suggesting that chain-of-thoughts may amplify reliance on statistical correlations. Even in situations where LLMs seem to correctly follow instructions, semantic cues still impact basic capabilities. These limitations raise ethical and societal concerns, undermining the widespread and pernicious tendency to attribute reasoning abilities to LLMs and suggesting how LLMs might fail, in particular in decision-making contexts where robust symbolic reasoning is essential and should not be compromised by residual semantic associations inherited from the model’s training.
zh

[NLP-28] Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

【速读】: 该论文旨在解决语言模型中独立于一般人口统计学识别能力的群体偏差机制问题,即探究模型在保留对姓名、职业和教育水平等人口统计学特征识别能力的同时是否能够实现去偏。其解决方案的关键在于采用基于归因(attribution-based)与基于相关性(correlation-based)的方法定位偏差特征,并通过针对特定任务的稀疏自动编码器特征删减(sparse autoencoder feature ablations)实现精细化干预:例如,在Gemma-2-9B模型中,基于归因的删减可有效缓解种族和性别职业刻板印象,同时保持姓名识别准确率;而基于相关性的删减则更适用于教育相关的偏差处理;此外,研究发现对教育任务删除归因特征会引发“先验坍缩”(prior collapse),从而加剧整体偏差,凸显了按维度实施干预的必要性。这表明人口统计学偏差源于任务特定机制而非绝对人口标记,且可通过机制推理阶段的针对性干预实现精准去偏而不损害模型核心能力。

链接: https://arxiv.org/abs/2512.20796
作者: Zhengyang Shan,Aaron Mueller
机构: Boston University (波士顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse’', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.
zh

[NLP-29] Investigating Model Editing for Unlearning in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中“遗忘”问题,即如何高效且精准地移除模型中特定的训练信息,同时避免对保留知识性能造成显著损害。传统方法在处理参数量庞大的LLMs时往往效率低下或无法彻底删除目标信息,而模型编辑算法(如ROME、IKE和WISE)虽主要用于修改输入到新目标的映射关系,但本文通过设计适用于遗忘场景的新编辑目标,发现这些方法在特定设置下可优于基线遗忘方法,在遗忘质量上表现更优。其关键在于将编辑技术适配至未学习任务,并通过精确控制编辑范围来最小化对模型整体性能的影响。

链接: https://arxiv.org/abs/2512.20794
作者: Shariqah Hossain,Lalana Kagal
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.
zh

[NLP-30] Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles

【速读】: 该论文旨在解决生成式 AI 在数学辅导场景中是否能模仿专家人类导师的教学行为这一关键问题,其核心挑战在于评估大语言模型(Large Language Models, LLMs)在教学策略与语言特征上是否与专家人类导师趋同。解决方案的关键在于通过控制变量的逐轮对话对比实验,系统分析专家人类导师、新手人类导师与多个大语言模型对相同数学补救对话回合的响应,量化比较其在重述/复述(revoicing)、准确性追问、词汇多样性、可读性、礼貌性和代理度(agency)等维度上的差异,并揭示这些特征与感知教学质量之间的统计关联。研究发现,尽管LLMs平均达到专家水平的感知教学质量,但其依赖更长、更丰富且更礼貌的语言,却较少使用专家常用的重述和复述策略,这表明当前生成式AI虽具备良好教学表现,但其教学逻辑仍不同于人类专家,强调了从教学策略和语言特征双维评估智能辅导系统的重要性。

链接: https://arxiv.org/abs/2512.20780
作者: Ramatu Oiza Abdulsalam,Segun Aroyehun
机构: African University of Science and Technology (非洲科学技术大学); University of Konstanz (康斯坦茨大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Recent work has explored the use of large language models for generating tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We examine this question using a controlled, turn-level comparison in which expert human tutors, novice human tutors, and multiple large language models respond to the same set of math remediation conversation turns. We examine both instructional strategies and linguistic characteristics of tutoring responses, including restating and revoicing, pressing for accuracy, lexical diversity, readability, politeness, and agency. We find that large language models approach expert levels of perceived pedagogical quality on average but exhibit systematic differences in their instructional and linguistic profiles. In particular, large language models tend to underuse restating and revoicing strategies characteristic of expert human tutors, while producing longer, more lexically diverse, and more polite responses. Statistical analyses show that restating and revoicing, lexical diversity, and pressing for accuracy are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. Overall, recent large language models exhibit levels of perceived pedagogical quality comparable to expert human tutors, while relying on different instructional and linguistic strategies. These findings underscore the value of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.
zh

[NLP-31] Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization

【速读】: 该论文旨在解决任务导向型对话(Task-Oriented Dialogue, TOD)系统评估中用户模拟器真实性不足的问题,即现有模拟器难以准确复现人类行为并有效暴露系统潜在缺陷。其解决方案的关键在于提出一种对抗训练框架,通过生成器(用户模拟器)与判别器之间的竞争机制迭代优化模拟器的 realism(真实性),从而提升其在心理健康支持类聊天机器人的应用场景中暴露系统故障模式的能力。该方法显著优于零样本基线模型,并通过多轮对抗训练实现失败模式分布对齐、多样性增强及预测有效性提高,最终使模拟失败率与真实失败率高度相关且分布差异最小化。

链接: https://arxiv.org/abs/2512.20773
作者: Ziyi Zhu,Olivier Tieleman,Caitlin A. Stamatis,Luka Smyth,Thomas D. Hull,Daniel R. Cahn,Matteo Malgaroli
机构: Slingshot AI; Department of Psychiatry, NYU School of Medicine
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.
zh

[NLP-32] Generalization of RLVR Using Causal Reasoning as a Testbed

【速读】: 该论文旨在解决强化学习 with 可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)在大语言模型(Large Language Models, LLMs)后训练阶段对复杂推理任务的泛化能力不足的问题,尤其是在因果图模型的概率推理场景下。其解决方案的关键在于:通过设计涵盖不同查询层级(关联、干预、反事实)和结构复杂度(子图规模)的因果图与查询数据集,对比 RLVR 与监督微调(Supervised Fine-Tuning, SFT)在不同模型规模(3B–32B参数)和训练查询层级下的表现;研究发现,RLVR 仅在特定模型规模与训练查询层级组合下能实现更强的跨层级与同层级泛化能力,且其有效性高度依赖于模型初始推理能力——当模型具备足够初始推理胜任力时,RLVR 能优化边缘化策略并减少中间概率计算误差,从而显著提升复杂查询的准确性,表明 RLVR 可有效增强 LLM 的特定因果推理子技能,但前提是模型需具备基础推理能力。

链接: https://arxiv.org/abs/2512.20760
作者: Brian Lu,Hongyu Zhao,Shuo Sun,Hao Peng,Rui Ding,Hongyuan Mei
机构: Johns Hopkins University (约翰霍普金斯大学); University of Maryland, College Park (马里兰大学学院市分校); National University of Singapore (新加坡国立大学); University of Illinois at Urbana-Champaign (伊利诺伊大学香槟分校); Microsoft Research Asia (微软亚洲研究院); Toyota Technological Institute at Chicago (芝加哥丰田技术学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query – associational, interventional, or counterfactual – and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR’s effectiveness depends on the model’s initial reasoning competence. With sufficient initial competence, RLVR improves an LLM’s marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.
zh

[NLP-33] okSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

【速读】: 该论文旨在解决语言模型(Language Models, LMs)中分词器(Tokenizer)对模型性能与行为影响机制不明确的问题,这一问题长期受限于难以在隔离条件下量化分词器的作用。其解决方案的关键在于构建TokSuite——一个包含14个使用不同分词器但架构、训练数据、训练预算和初始化完全一致的模型集合,以及一个针对真实世界扰动(可能影响分词结果)设计的新基准测试集。通过这一系统性实验框架,研究者能够精准解耦分词器的影响,从而揭示多种主流分词器在实际应用中的优劣特性。

链接: https://arxiv.org/abs/2512.20757
作者: Gül Sena Altıntaş,Malikeh Ehghaghi,Brian Lester,Fengyuan Liu,Wanru Zhao,Marco Ciccone,Colin Raffel
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization’s influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model’s tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
zh

[NLP-34] Agent Math: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在处理复杂数学运算时存在的计算效率低下和准确性不足的问题。现有模型虽在自然语言推理中表现优异,但在需要高精度数值计算的任务中仍存在局限性。解决方案的关键在于提出AgentMath框架,通过将语言模型的推理能力与代码解释器的计算精确性相结合,实现高效且准确的数学问题求解。其核心创新包括:(1) 自动将自然语言链式思维转化为结构化的工具增强轨迹,生成高质量监督微调数据以缓解数据稀缺;(2) 提出一种新型代理强化学习(agentic reinforcement learning, RL)范式,动态交错自然语言生成与实时代码执行,使模型能通过多轮交互反馈自主学习最优工具使用策略,并涌现出代码优化与错误修正能力;(3) 构建高效的训练系统,引入请求级异步回放调度、代理部分回放和前缀感知加权负载均衡等技术,显著提升训练速度(4–5倍加速),从而支持超长序列和大规模工具场景下的高效强化学习训练。

链接: https://arxiv.org/abs/2512.20745
作者: Haipeng Luo,Huawen Feng,Qingfeng Sun,Can Xu,Kai Zheng,Yufei Wang,Tao Yang,Han Hu,Yansong Tang,Di Wang
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Tencent Hunyuan (腾讯混元)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: LLM, Mathematical Reasoning

点击查看摘要

Abstract:Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models’ reasoning capabilities with code interpreters’ computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5x speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool this http URL evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, achieving advanced this http URL results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.
zh

[NLP-35] SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention

【速读】: 该论文旨在解决基于扩散模型(Diffusion Model)的长文本生成任务中因序列长度增加而导致的计算成本和内存开销急剧上升的问题。其解决方案的关键在于提出SA-DiffuSeq框架,通过在扩散过程中引入稀疏注意力机制(Sparse Attention),有选择性地分配注意力资源,从而显著降低计算复杂度,同时保持语义连贯性和生成质量。该方法的核心创新是一个针对稀疏注意力动态设计的软吸收态(Soft Absorbing State),该机制稳定了扩散轨迹并加速了序列重建,提升了采样效率与长程依赖建模精度。实验表明,SA-DiffuSeq在训练效率和采样速度上均优于现有扩散基线模型,尤其在长序列场景下表现突出,适用于科学写作、大规模代码生成及多轮长上下文对话等高要求应用场景。

链接: https://arxiv.org/abs/2512.20724
作者: Alexandros Christoforos,Chadbourne Davis
机构: Suffolk University (萨福克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under submission

点击查看摘要

Abstract:Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.
zh

[NLP-36] PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation

【速读】: 该论文旨在解决Transformer模型在长文本生成任务中因逐token扫描导致的预填充延迟高和解码阶段KV缓存(Key-Value Cache)访问成为瓶颈的问题。其核心解决方案是提出PHOTON架构,通过引入层次化自回归机制,将传统的扁平式上下文扫描转变为垂直的多分辨率上下文访问方式:底层编码器以自底向上方式压缩token为低速率上下文状态,上层轻量解码器则自顶向下重建细粒度token表示。这一结构显著减少了解码时KV缓存的数据传输量,在保持模型质量的同时,实现每单位内存高达10³倍的吞吐提升。

链接: https://arxiv.org/abs/2512.20687
作者: Yuma Ichikawa,Naoya Takagi,Takumi Nakagawa,Yuzi Kanazawa,Akira Sakai
机构: Fujitsu Limited(富士通有限公司); RIKEN Center for AIP(理化学研究所人工智能中心); Institute of Science Tokyo(东京科学大学); Tokai University(东海大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to 10^3\times higher throughput per unit memory.
zh

[NLP-37] Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险场景部署中安全性与对齐性不足的问题,特别是现有红队测试(red-teaming)依赖人工设计提示词导致的可扩展性差、漏洞覆盖不全等挑战。其解决方案的关键在于提出一个自动化的红队框架,通过元提示攻击合成(meta-prompting-based attack synthesis)、多模态漏洞检测和标准化评估协议,系统性地生成、执行并评估对抗性提示,从而高效发现LLM中的安全漏洞。该框架在GPT-OSS-20B模型上识别出47个不同漏洞,包括21个高严重性漏洞和12种新型攻击模式,相比人工专家测试提升了3.9倍的漏洞发现率,同时保持89%的检测准确率,显著推动了自动化、可重复的AI安全评估实践。

链接: https://arxiv.org/abs/2512.20677
作者: Zhang Wei,Peilu Hu,Shengning Lang,Hao Yan,Li Mei,Yichao Zhang,Chen Yang,Junfeng Hao,Zhimo Han
机构: Zhimo Han§§
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in high-stakes domains, ensuring their security and alignment has become a critical challenge. Existing red-teaming practices depend heavily on manual testing, which limits scalability and fails to comprehensively cover the vast space of potential adversarial behaviors. This paper introduces an automated red-teaming framework that systematically generates, executes, and evaluates adversarial prompts to uncover security vulnerabilities in LLMs. Our framework integrates meta-prompting-based attack synthesis, multi-modal vulnerability detection, and standardized evaluation protocols spanning six major threat categories – reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Experiments on the GPT-OSS-20B model reveal 47 distinct vulnerabilities, including 21 high-severity and 12 novel attack patterns, achieving a 3.9\times improvement in vulnerability discovery rate over manual expert testing while maintaining 89% detection accuracy. These results demonstrate the framework’s effectiveness in enabling scalable, systematic, and reproducible AI safety evaluations. By providing actionable insights for improving alignment robustness, this work advances the state of automated LLM red-teaming and contributes to the broader goal of building secure and trustworthy AI systems.
zh

[NLP-38] Uncovering Competency Gaps in Large Language Models and Their Benchmarks

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中依赖标准化基准测试所导致的两个核心问题:一是模型在特定子领域中的能力短板(即“模型缺口”),二是基准测试本身在概念覆盖上的不平衡性(即“基准缺口”)。传统聚合指标难以揭示这些细粒度差异,从而限制了对模型性能的深入理解与改进方向的明确。解决方案的关键在于引入稀疏自编码器(Sparse Autoencoders, SAEs),通过提取模型内部表示中的概念激活,并结合显著性加权性能评分,在基准数据上实现基于模型内部表征的评估。该方法无需人工标注即可自动识别模型与基准中的关键缺口,从而提供可解释、可比较的概念级分解结果,为优化模型能力和改进基准设计提供依据。

链接: https://arxiv.org/abs/2512.20638
作者: Matyas Bohacek,Nino Scherrer,Nicholas Dufour,Thomas Leung,Christoph Bregler,Stephanie C. Y. Chan
机构: Stanford University (斯坦福大学); Google DeepMind (谷歌深度思维)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak (“model gaps”) and (ii) imbalanced coverage in the benchmarks themselves (“benchmark gaps”). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model’s internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at this https URL.
zh

[NLP-39] Real Time Detection and Quantitative Analysis of Spurious Forgetting in Continual Learning

【速读】: 该论文旨在解决大语言模型在持续学习中面临的灾难性遗忘(catastrophic forgetting)问题,特别是揭示并应对由任务对齐(task alignment)中断引发的虚假遗忘(spurious forgetting)。其关键创新在于提出“浅层与深层对齐框架”(shallow versus deep alignment framework),首次实现了对对齐深度的定量刻画:现有方法仅维持前3–5个输出token的浅层对齐,导致模型易受遗忘影响。解决方案的核心是构建一套完整机制——包括0–1尺度的对齐深度量化指标、训练过程中实时检测浅层对齐的方法、可视化与恢复预测工具,以及能自动区分遗忘类型并促进深层对齐的自适应缓解策略,从而显著提升模型对遗忘的鲁棒性(提升幅度达3.3–7.1%)。

链接: https://arxiv.org/abs/2512.20634
作者: Weiwei Wang
机构: Shenzhen Sunline Tech Co., Ltd.(深圳市速联科技有限公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Catastrophic forgetting remains a fundamental challenge in continual learning for large language models. Recent work revealed that performance degradation may stem from spurious forgetting caused by task alignment disruption rather than true knowledge loss. However, this work only qualitatively describes alignment, relies on post-hoc analysis, and lacks automatic distinction mechanisms. We introduce the shallow versus deep alignment framework, providing the first quantitative characterization of alignment depth. We identify that current task alignment approaches suffer from shallow alignment - maintained only over the first few output tokens (approximately 3-5) - making models vulnerable to forgetting. This explains why spurious forgetting occurs, why it is reversible, and why fine-tuning attacks are effective. We propose a comprehensive framework addressing all gaps: (1) quantitative metrics (0-1 scale) to measure alignment depth across token positions; (2) real-time detection methods for identifying shallow alignment during training; (3) specialized analysis tools for visualization and recovery prediction; and (4) adaptive mitigation strategies that automatically distinguish forgetting types and promote deep alignment. Extensive experiments on multiple datasets and model architectures (Qwen2.5-3B to Qwen2.5-32B) demonstrate 86.2-90.6% identification accuracy and show that promoting deep alignment improves robustness against forgetting by 3.3-7.1% over baselines. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2512.20634 [cs.LG] (or arXiv:2512.20634v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20634 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-40] Zero-Training Temporal Drift Detection for Transformer Sentiment Models: A Comprehensive Analysis on Authentic Social Media Streams ICML

【速读】: 该论文旨在解决基于Transformer的文本情感分析模型在真实社交媒体数据中随时间出现的性能漂移(temporal drift)问题,尤其是在重大现实事件驱动下模型准确率显著下降的现象。解决方案的关键在于提出一种无需额外训练(zero-training)的漂移检测方法,通过引入四个新颖的漂移指标,在不依赖标签数据的情况下实现对模型稳定性的系统评估,并结合Bootstrap置信区间进行统计验证,从而在保持计算效率的同时有效识别出模型性能退化趋势,其检测能力经多事件实证验证具有实际应用价值,优于传统的基于嵌入空间的基线方法。

链接: https://arxiv.org/abs/2512.20631
作者: Aayam Bansal,Ishaan Gangwani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICML NewInML

点击查看摘要

Abstract:We present a comprehensive zero-training temporal drift analysis of transformer-based sentiment models validated on authentic social media data from major real-world events. Through systematic evaluation across three transformer architectures and rigorous statistical validation on 12,279 authentic social media posts, we demonstrate significant model instability with accuracy drops reaching 23.4% during event-driven periods. Our analysis reveals maximum confidence drops of 13.0% (Bootstrap 95% CI: [9.1%, 16.5%]) with strong correlation to actual performance degradation. We introduce four novel drift metrics that outperform embedding-based baselines while maintaining computational efficiency suitable for production deployment. Statistical validation across multiple events confirms robust detection capabilities with practical significance exceeding industry monitoring thresholds. This zero-training methodology enables immediate deployment for real-time sentiment monitoring systems and provides new insights into transformer model behavior during dynamic content periods.
zh

[NLP-41] MegaRAG : Multimodal Knowledge Graph-Based Retrieval Augmented Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长文本、领域特定内容(如整本书籍)时因上下文窗口限制而难以实现深层次概念理解与整体推理的问题。现有基于知识图谱(Knowledge Graph, KG)的检索增强生成(Retrieval-Augmented Generation, RAG)方法虽能提供结构化支持,但局限于纯文本输入,未能利用视觉等多模态信息的互补优势。解决方案的关键在于提出一种多模态知识图谱增强的RAG框架,将视觉线索融入知识图谱构建、检索阶段和答案生成过程,从而实现跨模态推理,提升对复杂文档的内容理解能力。实验表明,该方法在全局和细粒度问答任务中均优于现有的纯文本及多模态RAG方法。

链接: https://arxiv.org/abs/2512.20626
作者: Chi-Hsiang Hsiao,Yi-Cheng Wang,Tzung-Sheng Lin,Yi-Ren Yeh,Chu-Song Chen
机构: National Taiwan University (台湾大学); E.SUN Financial Holding Co., Ltd. (东森金融控股公司); National Kaohsiung Normal University (高雄师范大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.
zh

[NLP-42] Decoding Predictive Inference in Visual Language Processing via Spatiotemporal Neural Coherence NEURIPS2025

【速读】: 该论文旨在解决如何通过神经信号(EEG)解码聋人手语使用者在处理动态视觉语言刺激时的预测性神经机制问题。其解决方案的关键在于构建一种多模态机器学习框架,利用神经信号与基于光流(optical flow)提取的运动特征之间的相干性,生成时空上连续的预测性神经动态表征;并通过熵基特征选择方法识别出区分可理解语言输入与语言破坏性(时间反转)刺激的频段特异性神经标志物,从而揭示左半球和前额叶低频相干性在语言理解中的核心作用,并发现经验依赖的神经特征与年龄相关。

链接: https://arxiv.org/abs/2512.20929
作者: Sean C. Borneman,Julia Krebs,Ronnie B. Wilbur,Evie A. Malaia
机构: Carnegie-Mellon University (卡内基梅隆大学); University of Salzburg (萨尔茨堡大学); Purdue University (普渡大学); University of Alabama (阿拉巴马大学)
类目: Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Foundation Models for the Brain and Body

点击查看摘要

Abstract:Human language processing relies on the brain’s capacity for predictive inference. We present a machine learning framework for decoding neural (EEG) responses to dynamic visual language stimuli in Deaf signers. Using coherence between neural signals and optical flow-derived motion features, we construct spatiotemporal representations of predictive neural dynamics. Through entropy-based feature selection, we identify frequency-specific neural signatures that differentiate interpretable linguistic input from linguistically disrupted (time-reversed) stimuli. Our results reveal distributed left-hemispheric and frontal low-frequency coherence as key features in language comprehension, with experience-dependent neural signatures correlating with age. This work demonstrates a novel multimodal approach for probing experience-driven generative models of perception in the brain.
zh

计算机视觉

[CV-0] HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming

【速读】:该论文旨在解决高分辨率视频生成中扩散模型因计算复杂度呈二次增长而导致推理效率极低的问题。其核心解决方案是提出HiStream框架,通过三个维度的系统性压缩策略显著降低冗余:空间压缩(Spatial Compression)在低分辨率下进行去噪并缓存特征以指导高分辨率细化;时间压缩(Temporal Compression)采用固定大小锚点缓存的分块处理策略,确保推理速度稳定;时间步压缩(Timestep Compression)对后续缓存条件化的分块应用更少的去噪步骤。该方案在1080p基准上实现了76.2倍至107.5倍的加速,同时保持了接近SOTA的视觉质量,使高分辨率视频生成具备实际可行性与可扩展性。

链接: https://arxiv.org/abs/2512.21338
作者: Haonan Qiu,Shikun Liu,Zijian Zhou,Zhaochong An,Weiming Ren,Zhiheng Liu,Jonas Schult,Sen He,Shoufa Chen,Yuren Cong,Tao Xiang,Ziwei Liu,Juan-Manuel Perez-Rua
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this http URL

点击查看摘要

Abstract:High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.
zh

[CV-1] Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在建筑年代预测任务中存在显著的流行度偏差(popularity bias)问题,即模型对著名建筑的预测准确率比普通建筑高出高达34%,反映出其依赖记忆而非可泛化的理解能力。解决方案的关键在于构建了目前最大的开放基准数据集YearGuessr,包含55,546张建筑图像及其多模态属性(如连续序数标签的建造年份、GPS坐标和页面浏览量作为流行度代理指标),并提出流行度感知的区间准确率度量方法来量化该偏差。通过这一基准评估30余种模型(包括作者提出的YearCLIP模型),验证了VLMs在流行项上表现优异但在不熟悉对象上严重退化,揭示了其推理能力的核心缺陷。

链接: https://arxiv.org/abs/2512.21337
作者: Li-Zhong Szu-Tu,Ting-Lin Wu,Chia-Jui Chang,He Syu,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: this https URL
zh

[CV-2] Streaming Video Instruction Tuning

链接: https://arxiv.org/abs/2512.21334
作者: Jiaer Xia,Peixian Chen,Mengdan Zhang,Xing Sun,Kaiyang Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-3] Fast SAM2 with Text-Driven Token Pruning

【速读】:该论文旨在解决生成式 AI(Generative AI)中视频对象分割模型(如SAM2)在实际部署时面临的高计算与内存开销问题,尤其体现在对时间维度上密集视觉标记(visual tokens)的全量传播导致的二次方复杂度内存注意力开销。解决方案的关键在于提出一种文本引导的标记剪枝(text-guided token pruning)框架,在视觉编码后、基于记忆的时序传播前,通过轻量级路由机制对所有视觉标记进行排序和选择性保留,该机制融合局部视觉上下文、以目标为中心的语义相关性(来自用户输入或自动生成的文本描述)以及不确定性线索,从而保留关键区域并剔除冗余信息。此方法无需修改原始架构即可显著降低计算密度,实验证明其可在保持分割精度的同时,实现最高达42.50%的推理加速和37.41%的GPU显存减少。

链接: https://arxiv.org/abs/2512.21333
作者: Avilasha Mandal,Chaoning Zhang,Fachrina Dewi Puspitasari,Xudong Wang,Jiaquan Zhang,Caiyan Qin,Guoqing Wang,Yang Yang,Heng Tao Shen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 9 figures

点击查看摘要

Abstract:Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.
zh

[CV-4] ICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning

【速读】:该论文旨在解决在计算病理学中,小图像块(tile)的表示缺乏大视野全切片图像(Whole Slide Image, WSI)上下文信息的问题。传统基于tile编码器的流水线方法在提取tile嵌入时剥离了其原始上下文,导致无法建模对局部和全局任务至关重要的滑片级信息;同时,不同tile编码器在下游任务中表现各异,亟需一个统一模型来整合并增强来自任意tile基础模型的嵌入表示。解决方案的关键在于提出TICON——一种基于Transformer的tile表示上下文化框架,通过单一共享编码器结合掩码建模预训练目标,实现对多种tile基础模型嵌入的统一与上下文化处理,从而显著提升多个tile级和滑片级基准任务的性能,并以仅11K张WSI的预训练数据量超越现有滑片级基础模型。

链接: https://arxiv.org/abs/2512.21331
作者: Varun Belagali,Saarthak Kapse,Pierre Marza,Srijan Das,Zilinghan Li,Sofiène Boutaj,Pushpak Pati,Srikar Yellapragada,Tarak Nath Nandi,Ravi K Madduri,Joel Saltz,Prateek Prasanna,Stergios Christodoulidis Maria Vakalopoulou,Dimitris Samaras
机构: Stony Brook University (石溪大学); MICS, CentraleSupélec, Université Paris-Saclay (MICS,中央理工-巴黎高等电力学院,巴黎萨克雷大学); UNC Charlotte (北卡罗来纳大学夏洛特分校); Argonne National Laboratory (阿贡国家实验室); University of Chicago (芝加哥大学); Archimedes/Athena RC (阿基米德/雅典娜研究中心); Independent Researcher (独立研究员)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ‘‘any’’ application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ‘‘any’’ tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.
zh

[CV-5] Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

【速读】:该论文试图解决的问题是:在实际应用中,尽管现代深度神经网络具备强大的表达能力,为何仍普遍存在在分类任务前进行低级处理(如去噪或编码)的现象?这一现象与信息论中的数据处理不等式(data processing inequality)相悖,后者指出信号处理无法增加信息内容,因此理论上不应通过预处理提升分类性能。论文的关键解决方案在于提出一个理论框架,证明在有限样本条件下,存在一种预分类处理方式可以提升分类准确率;其核心在于构建一个与最优贝叶斯分类器紧密关联的分类器模型,该模型随训练样本数增加而收敛至贝叶斯最优解,并严格证明对于任意有限样本数,总存在一种预处理策略能带来相对性能增益。此外,论文进一步分析了类别分离度、训练集大小和类别平衡对这种增益的影响,并通过实证研究验证了理论结果在真实场景中的适用性。

链接: https://arxiv.org/abs/2512.21315
作者: Roy Turgeman,Tom Tirer
机构: Bar-Ilan University (巴伊兰大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform “low-level” tasks before “high-level” downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.
zh

[CV-6] AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents

【速读】:该论文旨在解决当前移动图形用户界面(GUI)智能体评估基准存在的局限性问题,即现有评测体系多局限于少量应用、简单任务及粗粒度指标,难以全面反映智能体在真实复杂场景下的性能。其解决方案的关键在于提出AndroidLens框架,该框架包含571个长延迟任务(平均需26步以上完成),覆盖38个现实领域,涵盖多约束、多目标等复杂任务类型;并通过静态与动态双模式评估机制——静态评估保留真实环境异常并允许多条有效路径以减少偏差,动态评估采用里程碑式进度测量方法(Average Task Progress, ATP)实现细粒度任务进展量化,从而更真实、全面地衡量GUI智能体的能力。

链接: https://arxiv.org/abs/2512.21302
作者: Yue Cao,Yingyao Wang,Pi Bu,Jingxuan Xing,Wei Jiang,Zekun Zhu,Junpeng Ma,Sashuai Zhou,Tong Lu,Jun Song,Yu Cheng,Yuning Jiang,Bo Zheng
机构: Nanjing University (南京大学); Alibaba Group (阿里巴巴集团); Fudan University (复旦大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 13 figures, 8 tables

点击查看摘要

Abstract:Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.
zh

[CV-7] Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction

【速读】:该论文旨在解决从低分辨率或噪声图像中准确提取表格结构边界(行与列)的难题,尤其针对现有基于Transformer的方法在处理退化或不完整表格数据时适应性不足的问题。其解决方案的关键在于提出一种多尺度信号处理方法:将表格掩码中的行和列过渡建模为一维信号,通过逐步增大方差的高斯卷积进行平滑处理,并结合统计阈值法抑制噪声、保留稳定的结构边缘;随后将检测到的信号峰值映射回图像坐标以获得精确的分割边界。此方法显著提升了Cell-Aware Segmentation Accuracy (CASA)指标,在PubLayNet-1M基准上使TableNet + PyTesseract OCR的性能从67%提升至76%,且对分辨率变化具有鲁棒性。

链接: https://arxiv.org/abs/2512.21287
作者: Suren Bandara
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.
zh

[CV-8] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

【速读】:该论文旨在解决当前基于人工神经网络(Artificial Neural Networks, ANN)的手术场景分割模型在资源受限的手术环境中难以实现实时部署的问题,其核心挑战在于ANN模型计算复杂度高、功耗大,而脉冲神经网络(Spiking Neural Networks, SNN)虽具备低延迟和高效能优势,却受限于标注数据稀缺及手术视频表征稀疏性。解决方案的关键在于提出首个针对手术场景分割设计的脉冲驱动视频Transformer框架SpikeSurgSeg:首先引入手术场景掩码自编码预训练策略,通过逐层管状掩码实现鲁棒的时空表征学习;在此基础上构建轻量级脉冲驱动分割头,在保持SNN低延迟特性的同时输出时序一致的预测结果。实验表明,该方法在EndoVis18和自建SurgBleed数据集上达到与最先进ANN模型相当的mIoU指标,且推理延迟降低至少8倍,相较多数基础模型提速超20倍,展现出在非GPU平台上的实时手术智能感知潜力。

链接: https://arxiv.org/abs/2512.21284
作者: Shihao Zou,Jingjing Li,Wei Ji,Jincai Huang,Kai Wang,Guo Dan,Weixin Si,Yi Pan
机构: Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院); University of Alberta(阿尔伯塔大学); Yale University(耶鲁大学); Southern University of Science and Technology(南方科技大学); Shenzhen University of Advanced Technology(深圳大学先进技术研究院); Nanfang Hospital Southern Medical University(南方医科大学南方医院); School of Biomedical Engineering, Shenzhen University(深圳大学生物医学工程学院); Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology(深圳大学先进技术研究院计算机科学与控制工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textitSpikeSurgSeg, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least 8\times . Notably, it delivers over 20\times acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.
zh

[CV-9] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在图像序列生成任务中,采用将图像序列直接视为大型张量(large tensors of sequentially stacked frames)表示方式所导致的效率低下与性能瓶颈问题。其核心解决方案在于提出一种分阶段建模策略:首先在低分辨率下生成粗略的序列(coarse sequence),利用 Diffusion Transformer(DiT)中强大的自注意力机制捕捉帧间相关性,从而将2D图像生成器扩展为无需架构修改的低分辨率3D图像序列生成器;随后对每一帧独立进行超分辨率重建,以添加序列无关的高分辨率细节。这一方法显著提升了生成质量、时序一致性、推理效率,并实现了跨数据域的良好泛化能力,相较现有最先进(SoTA)模型在多个指标上取得优势。

链接: https://arxiv.org/abs/2512.21276
作者: Snehal Singh Tomar,Alexandros Graikos,Arjun Krishna,Dimitris Samaras,Klaus Mueller
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
zh

[CV-10] ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

【速读】:该论文旨在解决视频生成中条件控制(conditional control)的不足问题,即现有方法在对齐条件信号(conditioning signals)时存在精度不够、可控性有限的问题。具体而言,无分类器引导(classifier-free guidance)通过建模数据与条件的联合分布间接实现控制,而基于分类器的引导则可能因模型利用分类器机制产生对抗性伪影,导致有效可控性受限。论文提出Attention-Conditional Diffusion(ACD)框架,其核心创新在于通过注意力监督(attention supervision)实现对视频扩散模型的直接条件控制:通过对模型注意力图与外部控制信号进行对齐,提升条件输入的精确响应能力;同时引入稀疏3D感知对象布局(sparse 3D-aware object layout)作为高效条件信号,并配套设计Layout ControlNet和自动化标注流程,从而实现可扩展的条件集成。实验表明,ACD在保持时间一致性与视觉保真度的同时,显著增强了对条件输入的对齐能力。

链接: https://arxiv.org/abs/2512.21268
作者: Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model’s attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.
zh

[CV-11] AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI

【速读】:该论文旨在解决脑部磁共振成像(MRI)中异常检测的可靠性问题,尤其针对标注异常病例稀缺以及临床实际流程中关键影像模态缺失的挑战。现有单类或多类异常检测(Anomaly Detection, AD)模型通常依赖固定模态配置、需重复训练或无法泛化至未见模态组合,限制了其临床可扩展性。解决方案的关键在于提出一个统一的任意模态异常检测(Any-Modality AD)框架:通过双路径DINOv2编码器与特征分布对齐机制,统计上将不完整模态特征与全模态表示对齐,从而在严重模态缺失下仍能稳定推理;同时引入内在正常原型(Intrinsic Normal Prototypes, INPs)提取器和INP引导解码器,在重建正常解剖模式的同时自然放大异常偏离,提升语义一致性。训练阶段采用随机模态掩码和间接特征补全策略,使模型无需重训练即可适应所有模态组合,显著提升了跨模态配置的泛化能力。

链接: https://arxiv.org/abs/2512.21264
作者: Changwei Wu,Yifei Chen,Yuxin Du,Mingxuan Liu,Jinying Zong,Beining Wu,Jie Dong,Feiwei Qin,Yunkang Cao,Qiyuan Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 8 figures

点击查看摘要

Abstract:Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at this https URL.
zh

[CV-12] DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

【速读】:该论文旨在解决“一拍成片”(one-shot)电影拍摄技术在实际应用中因成本高昂和现实约束复杂而难以实现的问题,以及现有视频生成模型依赖简单片段拼接导致视觉不连贯、时间一致性差的缺陷。其解决方案的关键在于提出DreaMontage框架,通过三个核心创新实现任意帧引导下的长时序、高保真、具电影表现力的一拍成片视频合成:(i) 在DiT架构中引入轻量级中间条件机制并结合自适应调优策略,实现对任意输入帧的精准控制;(ii) 构建高质量数据集并设计视觉表达监督微调(Visual Expression SFT)阶段,辅以定制化的DPO优化方案,显著提升主体运动合理性与转场平滑度;(iii) 设计分段自回归(Segment-wise Auto-Regressive, SAR)推理策略,在保证内存效率的同时支持长序列生成。实验证明该方法在视觉效果与时间连续性上均优于现有方法,同时保持计算高效性。

链接: https://arxiv.org/abs/2512.21252
作者: Jiawei Liu,Junqiao Li,Jiangfan Deng,Gen Li,Siyu Zhou,Zetao Fang,Shanshan Lao,Zengde Deng,Jianing Zhu,Tingting Ma,Jiayi Li,Yunqiu Wang,Qian He,Xinglong Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:The “one-shot” technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
zh

[CV-13] Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks AAAI2026

【速读】:该论文旨在解决硬标签黑盒对抗攻击中因查询复杂度过高而导致的实际部署困难问题,尤其聚焦于优化一类通过搜索最优射线方向以最小化ℓ₂范数扰动来使良性图像进入对抗区域的攻击方法。其解决方案的关键在于提出一种基于动量的算法ARS-OPT,该算法受Nesterov加速梯度(Nesterov’s Accelerated Gradient, NAG)启发,通过累积动量主动预测未来射线方向的梯度,从而实现更精确的方向更新,提升优化速度与稳定性;进一步地,通过在梯度估计中引入代理模型先验信息,构建了性能更强的PARS-OPT方法,显著提升了查询效率,在ImageNet和CIFAR-10上的实验表明该方法优于13种当前最先进的攻击方法。

链接: https://arxiv.org/abs/2512.21241
作者: Xinjie Xu,Shuyu Cheng,Dongwei Xu,Qi Xuan,Chen Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at AAAI 2026 (Oral). This version corresponds to the conference proceedings; v2 will include the appendix

点击查看摘要

Abstract:In hard-label black-box adversarial attacks, where only the top-1 predicted label is accessible, the prohibitive query complexity poses a major obstacle to practical deployment. In this paper, we focus on optimizing a representative class of attacks that search for the optimal ray direction yielding the minimum \ell_2 -norm perturbation required to move a benign image into the adversarial region. Inspired by Nesterov’s Accelerated Gradient (NAG), we propose a momentum-based algorithm, ARS-OPT, which proactively estimates the gradient with respect to a future ray direction inferred from accumulated momentum. We provide a theoretical analysis of its convergence behavior, showing that ARS-OPT enables more accurate directional updates and achieves faster, more stable optimization. To further accelerate convergence, we incorporate surrogate-model priors into ARS-OPT’s gradient estimation, resulting in PARS-OPT with enhanced performance. The superiority of our approach is supported by theoretical guarantees under standard assumptions. Extensive experiments on ImageNet and CIFAR-10 demonstrate that our method surpasses 13 state-of-the-art approaches in query efficiency.
zh

[CV-14] SegMo: Segment-aligned Text to 3D Human Motion Generation

【速读】:该论文旨在解决从文本描述生成3D人类动作时,现有方法仅在序列层面进行对齐而忽视模态内部语义结构的问题。其核心挑战在于如何实现更细粒度的文本与动作之间的对应关系。解决方案的关键在于提出SegMo框架,通过三个模块实现:(1) 文本片段提取(Text Segment Extraction),将复杂文本描述分解为时序有序的短语,每个短语代表一个原子动作;(2) 动作片段提取(Motion Segment Extraction),将完整动作序列划分为对应的动作片段;(3) 细粒度文本-动作对齐(Fine-grained Text-Motion Alignment),利用对比学习建立文本与动作片段间的细粒度对齐关系。该方法显著提升了HumanML3D数据集上的TOP 1得分至0.553,并可扩展应用于动作定位和动作到文本检索等下游任务。

链接: https://arxiv.org/abs/2512.21237
作者: Bowen Dang,Lin Wu,Xiaohang Yang,Zheng Yuan,Zhixiang Chen
机构: University of Sheffield (谢菲尔德大学); University of Glasgow (格拉斯哥大学); Queen Mary University of London (伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.
zh

[CV-15] Leverag ing Lightweight Entity Extraction for Scalable Event-Based Image Retrieval

【速读】:该论文旨在解决自然语言描述到图像的检索任务中,由于查询语义模糊、语言变异性以及对可扩展性要求高等因素导致的现实世界图像-文本检索难题。其解决方案的关键在于提出了一种轻量级两阶段检索流水线:第一阶段基于BM25算法利用事件中心实体提取技术进行高效候选过滤,引入时间与上下文信号;第二阶段采用BEiT-3模型捕捉深层多模态语义并重排序结果,从而实现高精度与高效率的图像检索。

链接: https://arxiv.org/abs/2512.21221
作者: Dao Sy Duy Minh,Huynh Trung Kiet,Nguyen Lam Phu Quy,Phu-Hoa Pham,Tran Chi Nguyen
机构: University of Science - VNUHCM (胡志明市国家大学自然科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at this https URL
zh

[CV-16] RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic

【速读】:该论文旨在解决具身智能体(embodied agents)在执行复杂现实任务时,因接收到危险指令而可能引发不安全行为的问题。现有运行时安全防护机制多依赖静态规则过滤或提示层面控制,难以应对动态、时序依赖且情境丰富的环境中的隐性风险。解决方案的关键在于提出 RoboSafe,一种基于可执行谓词逻辑的混合推理运行时防护系统,其核心是构建一个混合长短时安全记忆(Hybrid Long-Short Safety Memory),并集成两种互补的推理模块:一是后向反思推理模块(Backward Reflective Reasoning),通过持续回溯短期记忆中的近期轨迹以推断时序安全谓词并主动触发重规划;二是前向预测推理模块(Forward Predictive Reasoning),基于长期安全记忆与多模态观测生成情境感知的安全谓词以提前预警潜在风险。该方案实现了可解释、可验证且可执行的安全逻辑,在多个代理平台上显著降低危险行为发生率(减少36.8%),同时保持接近原始任务性能。

链接: https://arxiv.org/abs/2512.21220
作者: Le Wang,Zonghao Ying,Xiao Yang,Quanchen Zou,Zhenfei Yin,Tianlin Li,Jian Yang,Yaodong Yang,Aishan Liu,Xianglong Liu
机构: Beihang University (北京航空航天大学); Beijing University of Posts and Telecommunications (北京邮电大学); 360 AI Security Lab; The University of Sydney (悉尼大学); Nanyang Technological University (南洋理工大学); Peking University (北京大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent’s multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
zh

[CV-17] Latent Implicit Visual Reasoning

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)过度依赖文本作为核心推理模态的问题,从而在以视觉为主的推理任务中表现受限。现有方法虽尝试通过辅助图像、深度图或图像裁剪来监督中间视觉步骤,但这些策略引入了人为先验、增加了标注成本且难以跨任务泛化。其解决方案的关键在于提出一种任务无关的机制,使LMM能够自主发现并利用视觉推理标记(visual reasoning tokens),这些标记具备全局注意力能力,并能以任务自适应的方式重新编码图像,从而无需人工标注即可提取相关视觉信息。该方法在多种视觉主导任务上超越直接微调,达到当前最优性能,并展现出良好的多任务指令微调泛化能力。

链接: https://arxiv.org/abs/2512.21218
作者: Kelvin Li,Chuyi Shang,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Roei Herzig
机构: University of California, Berkeley (加州大学伯克利分校); Xero (Xero公司); MIT-IBM Watson AI Lab (MIT-IBM沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what “useful” visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks – including those where intermediate abstractions are hard to specify – while also generalizing to multi-task instruction tuning.
zh

[CV-18] Human Motion Estimation with Everyday Wearables

【速读】:该论文旨在解决基于穿戴设备的人体运动估计在实际应用中面临的可穿戴性差、硬件成本高及标定繁琐等问题,这些问题限制了其在日常生活中的普及。解决方案的关键在于提出EveryWear,一种完全依赖日常可穿戴设备(智能手机、智能手表、耳机和智能眼镜)的轻量级人体动作捕捉方法,无需显式标定即可使用;其核心创新是引入了一个多模态教师-学生框架,融合来自第一人称视角摄像头的视觉线索与消费级设备提供的惯性信号,并直接在真实世界数据(Ego-Elec数据集)上训练模型,从而有效消除先前研究中存在的“仿真到现实”(sim-to-real)差距,实现更鲁棒的全身运动估计。

链接: https://arxiv.org/abs/2512.21209
作者: Siqi Zhu,Yixuan Li,Junfu Li,Qi Wu,Zan Wang,Haozhe Ma,Wei Liang
机构: Beijing Institute of Technology (北京理工大学); Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing (北京理工大学长三角研究院(嘉兴)); Shenzhen MSU-BIT University (深圳北理莫斯科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.
zh

[CV-19] Schrödingers Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

【速读】:该论文旨在解决零样本目标导航(Zero-shot Object Navigation, ZSON)在真实复杂环境中的性能瓶颈问题,特别是在存在严重静态遮挡、未知风险以及动态移动目标等挑战场景下,现有方法往往难以实现稳定可靠的导航与目标定位。其解决方案的关键在于提出“薛定谔导航器”(Schrödinger’s Navigator),该框架受薛定谔思想实验启发,将未观测空间建模为一组可能的未来世界,并基于自车视角视觉输入和三条候选路径,利用轨迹条件化的3D世界模型预测每条路径上的未来观测结果。这种3D想象机制使智能体能够在不依赖额外绕行或密集全局地图的前提下,提前感知遮挡后的信息并预判潜在风险,进而融合想象的3D观测更新导航地图与价值图,指导策略选择更优路径以规避遮挡、减少不确定性暴露并提升对移动目标的追踪能力。

链接: https://arxiv.org/abs/2512.21201
作者: Yu He,Da Huang,Zhenyang Liu,Zixiao Gu,Qiang Sun,Guangnan Ye,Yanwei Fu
机构: Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); Shanghai University of International Business and Economics (上海对外经贸大学); Shanghai Innovation Institute (上海创新研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-shot object navigation (ZSON) requires a robot to locate a target object in a previously unseen environment without relying on pre-built maps or task-specific training. However, existing ZSON methods often struggle in realistic and cluttered environments, particularly when the scene contains heavy occlusions, unknown risks, or dynamically moving target objects. To address these challenges, we propose \textbfSchrödinger’s Navigator, a navigation framework inspired by Schrödinger’s thought experiment on uncertainty. The framework treats unobserved space as a set of plausible future worlds and reasons over them before acting. Conditioned on egocentric visual inputs and three candidate trajectories, a trajectory-conditioned 3D world model imagines future observations along each path. This enables the agent to see beyond occlusions and anticipate risks in unseen regions without requiring extra detours or dense global mapping. The imagined 3D observations are fused into the navigation map and used to update a value map. These updates guide the policy toward trajectories that avoid occlusions, reduce exposure to uncertain space, and better track moving targets. Experiments on a Go2 quadruped robot across three challenging scenarios, including severe static occlusions, unknown risks, and dynamically moving targets, show that Schrödinger’s Navigator consistently outperforms strong ZSON baselines in self-localization, object localization, and overall Success Rate in occlusion-heavy environments. These results demonstrate the effectiveness of trajectory-conditioned 3D imagination in enabling robust zero-shot object navigation.
zh

[CV-20] VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在执行视觉推理任务时,其性能究竟源于真正的视觉理解能力,还是仅仅依赖于语言先验(linguistic priors)的问题。为实现这一目标,作者提出VisRes Bench——一个在自然场景中评估视觉推理能力的基准测试框架,其关键创新在于通过三个层级的复杂度设计,系统性地隔离并量化模型在感知、关系和组合推理方面的表现:Level 1聚焦于受扰动下的感知完整性与全局图像匹配能力;Level 2测试单一属性(如颜色、数量、方向)上的规则推理;Level 3则要求多属性融合的组合推理。实验结果表明,即使是最先进的VLMs在细微感知扰动下也接近随机水平,揭示了其抽象能力仍局限于模式识别,而非深层视觉推理。

链接: https://arxiv.org/abs/2512.21194
作者: Brigitta Malagurski Törtei,Yasser Dahou,Ngoc Dung Huynh,Wamiq Reyaz Para,Phúc H. Lê Khac,Ankit Singh,Sofian Chaybouti,Sanath Narayan
机构: Technology Innovation Institute (技术革新研究所); University of Tuebingen (图宾根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
zh

[CV-21] UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement

【速读】:该论文旨在解决3D几何生成中高质量、细节丰富且结构可靠的生成难题,尤其针对公开数据集质量参差不齐、生成过程难以精细化控制的问题。解决方案的关键在于提出一个两阶段扩散框架UltraShape 1.0:首先生成粗略的全局结构,再通过基于体素(voxel)的精修机制在固定空间位置上进行局部细节合成;同时,利用从粗略几何中提取的体素查询作为显式位置锚点,并通过RoPE(Rotary Position Embedding)编码实现空间定位与几何细节生成的解耦,从而在受限训练资源下仍能生成高保真度的3D几何形状。

链接: https://arxiv.org/abs/2512.21185
作者: Tanghui Jia,Dongyu Yan,Dehao Hao,Yang Li,Kaiyi Zhang,Xianyi He,Lanjiong Li,Jinnan Chen,Lutao Jiang,Qishen Yin,Long Quan,Ying-Cong Chen,Li Yuan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 14 pages, 10 figures, Technical Report,

点击查看摘要

Abstract:In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.
zh

[CV-22] owards Arbitrary Motion Completing via Hierarchical Continuous Representation

【速读】:该论文旨在解决人类运动序列在不同帧率下难以实现平滑插值与外推的问题,传统离散采样方式限制了运动表示的连续性和灵活性。其核心解决方案是提出一种基于隐式神经表示(Implicit Neural Representations, INRs)的分层参数化激活机制框架——NAME,关键在于引入分层时间编码机制以多尺度捕捉复杂时序模式,并结合由傅里叶变换驱动的可学习激活函数嵌入到MLP解码器中,从而显著提升模型对任意帧率下运动行为的高精度连续建模能力。

链接: https://arxiv.org/abs/2512.21183
作者: Chenghao Xu,Guangtao Lyu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng
机构: Hohai university (河海大学); Xidian University (西安电子科技大学); Institute for Infocomm Research (I2R) (资讯通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model’s ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.
zh

[CV-23] A Turn Toward Better Alignment: Few-Shot Generative Adaptation with Equivariant Feature Rotation

【速读】:该论文旨在解决少样本图像生成(Few-shot Image Generation)中因源域与目标域分布结构差异导致的适应难题,尤其是在目标样本稀缺情况下难以准确估计目标域分布、进而引发生成内容失真或信息不足的问题。解决方案的关键在于提出等变特征旋转(Equivariant Feature Rotation, EFR)策略:通过在参数化的李群(Lie Group)内对源域和目标域特征进行自适应旋转,将其映射到一个等变代理特征空间(equivariant proxy feature space),并在该空间中实现双层次对齐——既保留域内结构信息又有效弥合域间差异,从而实现无畸变的知识迁移与高质量生成。

链接: https://arxiv.org/abs/2512.21174
作者: Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng
机构: Hohai university(河海大学); Xidian University(西安电子科技大学); Institute for Infocomm Research (I2R)(资讯通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot image generation aims to effectively adapt a source generative model to a target domain using very few training images. Most existing approaches introduce consistency constraints-typically through instance-level or distribution-level loss functions-to directly align the distribution patterns of source and target domains within their respective latent spaces. However, these strategies often fall short: overly strict constraints can amplify the negative effects of the domain gap, leading to distorted or uninformative content, while overly relaxed constraints may fail to leverage the source domain effectively. This limitation primarily stems from the inherent discrepancy in the underlying distribution structures of the source and target domains. The scarcity of target samples further compounds this issue by hindering accurate estimation of the target domain’s distribution. To overcome these limitations, we propose Equivariant Feature Rotation (EFR), a novel adaptation strategy that aligns source and target domains at two complementary levels within a self-rotated proxy feature space. Specifically, we perform adaptive rotations within a parameterized Lie Group to transform both source and target features into an equivariant proxy space, where alignment is conducted. These learnable rotation matrices serve to bridge the domain gap by preserving intra-domain structural information without distortion, while the alignment optimization facilitates effective knowledge transfer from the source to the target domain. Comprehensive experiments on a variety of commonly used datasets demonstrate that our method significantly enhances the generative performance within the targeted domain.
zh

[CV-24] ORCA: Object Recognition and Comprehension for Archiving Marine Species WACV

【速读】:该论文旨在解决海洋视觉理解领域中因训练数据有限及缺乏系统性任务定义而导致模型应用受限的问题,尤其针对海洋生物多样性与形态相似性带来的挑战。其解决方案的关键在于构建ORCA——一个包含14,647张图像、478个物种、42,217个边界框标注和22,321条专家验证实例描述的多模态基准数据集,通过细粒度的视觉与文本标注捕捉物种形态特征,并在此基础上评估18种前沿模型在目标检测(闭集与开放词汇)、实例描述生成和视觉定位三个任务上的性能,从而推动海洋领域计算机视觉方法的发展。

链接: https://arxiv.org/abs/2512.21150
作者: Yuk-Kwan Wong,Haixin Liang,Zeyu Ma,Yiwei Chen,Ziqiang Zheng,Rinaldi Gotama,Pascal Sebastian,Lauren D. Sparks,Sai-Kit Yeung
机构: Hong Kong University of Science and Technology (香港科技大学); University of Electronic Science and Technology of China (电子科技大学); Indo Ocean Foundation (印度洋基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

点击查看摘要

Abstract:Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: this http URL.
zh

[CV-25] GC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation

【速读】:该论文旨在解决现有文本引导的医学图像分割方法中因图像与文本编码器未对齐而导致多模态融合复杂、性能受限的问题,以及CLIP模型在医疗影像应用中存在的细粒度解剖结构保留不足、复杂临床描述建模能力弱和领域语义错位等挑战。其解决方案的关键在于提出TGC-Net框架,通过三个核心模块实现参数高效的任务特化适配:(1)语义-结构协同编码器(Semantic-Structural Synergy Encoder, SSE),在CLIP的ViT基础上引入CNN分支以增强多尺度结构细化;(2)领域增强文本编码器(Domain-Augmented Text Encoder, DATE),注入大语言模型生成的医学知识以提升临床描述建模能力;(3)视觉-语言校准模块(Vision-Language Calibration Module, VLCM),在统一特征空间中优化跨模态对应关系。该方案在胸部X光和胸部CT五个数据集上实现了SOTA性能,同时显著减少可训练参数量。

链接: https://arxiv.org/abs/2512.21135
作者: Gaoren Lin,Huangxuan Zhao,Yuan Xiong,Lefei Zhang,Bo Du,Wentao Zhu
机构: Wuhan University (武汉大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP’s ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
zh

[CV-26] MarineEval: Assessing the Marine Intelligence of Vision-Language Models WACV

【速读】:该论文试图解决现有视觉语言模型(Vision Language Models, VLMs)在海洋领域专业问答任务中表现不足的问题,即这些模型是否具备作为海洋领域专家的能力,能否准确回答需要高度专业知识和特殊领域要求的海洋问题。解决方案的关键在于构建首个大规模海洋VLM评测数据集与基准测试平台——MarineEval,该数据集包含2000个基于图像的问答对,覆盖7个任务维度和20个能力维度,并由海洋领域专家确保数据的专业性与多样性。通过在该基准上全面评估17种现有VLMs,研究揭示了当前模型在处理海洋专业问题时存在显著局限,从而为未来面向特定领域的多模态模型优化提供了明确方向。

链接: https://arxiv.org/abs/2512.21126
作者: YuK-Kwan Wong,Tuan-An To,Jipeng Zhang,Ziqiang Zheng,Sai-Kit Yeung
机构: Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
备注: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026

点击查看摘要

Abstract:We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: this http URL
zh

[CV-27] STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting

【速读】:该论文旨在解决降水临近预报(precipitation nowcasting)任务中因天气现象的复杂性和随机性导致的预测难题,特别是现有方法在确定性模型易产生模糊结果与生成式模型准确性不足之间的权衡问题。解决方案的关键在于提出一种名为STLDM(Spatial-Temporal Latent Diffusion Model)的扩散模型架构,其通过端到端学习潜在表示,结合变分自编码器(Variational Autoencoder, VAE)与条件网络(conditioning network),将任务分解为两个阶段:由条件网络执行确定性预报,再由潜在扩散模型进行增强优化,从而在保持高精度的同时提升推理效率。

链接: https://arxiv.org/abs/2512.21118
作者: Shi Quan Foo,Chi-Ho Wong,Zhihan Gao,Dit-Yan Yeung,Ka-Hing Wong,Wai-Kin Wong
机构: The Hong Kong University of Science and Technology (香港科技大学); Hong Kong Observatory (香港天文台)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TMLR. Camera-ready submission

点击查看摘要

Abstract:Precipitation nowcasting is a critical spatio-temporal prediction task for society to prevent severe damage owing to extreme weather events. Despite the advances in this field, the complex and stochastic nature of this task still poses challenges to existing approaches. Specifically, deterministic models tend to produce blurry predictions while generative models often struggle with poor accuracy. In this paper, we present a simple yet effective model architecture termed STLDM, a diffusion-based model that learns the latent representation from end to end alongside both the Variational Autoencoder and the conditioning network. STLDM decomposes this task into two stages: a deterministic forecasting stage handled by the conditioning network, and an enhancement stage performed by the latent diffusion model. Experimental results on multiple radar datasets demonstrate that STLDM achieves superior performance compared to the state of the art, while also improving inference efficiency. The code is available in this https URL.
zh

[CV-28] FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting AAAI2026

【速读】:该论文旨在解决文本引导图像修复(text-guided image inpainting)中难以同时保证提示对齐(prompt alignment)与视觉合理性(visual rationality)的问题。现有方法虽借助预训练文本到图像扩散模型生成视觉上逼真的结果,但在实际应用中仍难以兼顾二者。其解决方案的关键在于提出一种即插即用且无需微调的优化策略——FreeInpaint,该方法在推理阶段直接对扩散潜空间(diffusion latents)进行在线优化:首先通过先验引导的噪声优化方法调整初始噪声以引导模型关注有效修复区域;其次设计了一个针对修复任务定制的复合引导目标(composite guidance objective),在每一步去噪过程中优化中间潜变量,从而高效提升提示一致性与图像合理性。

链接: https://arxiv.org/abs/2512.21104
作者: Chao Gong,Dong Li,Yingwei Pan,Jingjing Chen,Ting Yao,Tao Mei
机构: HiDream.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.
zh

[CV-29] xAvatars : Hybrid Texel-3D Representations for Stable Rigging of Photorealistic Gaussian Head Avatars

【速读】:该论文旨在解决当前生成式头像(Head Avatar)在极端表情和姿态变化下泛化能力不足的问题,尤其是现有基于规则的解析绑定(Analytic Rigging)或神经网络驱动的变形场方法难以保持几何一致性与表达细节。其关键解决方案是提出TexAvatars——一种融合解析绑定几何约束与纹理空间(Texel Space)连续性的混合表示:通过卷积神经网络(CNN)预测UV空间中的局部几何属性,并利用网格感知的雅可比矩阵(Mesh-aware Jacobians)驱动三维形变,从而实现三角面片边界处平滑且语义一致的过渡,有效提升模型在复杂、分布外场景下的稳定性与可解释性。

链接: https://arxiv.org/abs/2512.21099
作者: Jaeseong Lee,Junyeong Ahn,Taewoong Kang,Jaegul Choo
机构: KAIST(韩国科学技术院); Hanyang University(汉阳大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 3DV 2026, Project page with videos: this https URL

点击查看摘要

Abstract:Constructing drivable and photorealistic 3D head avatars has become a central task in AR/XR, enabling immersive and expressive user experiences. With the emergence of high-fidelity and efficient representations such as 3D Gaussians, recent works have pushed toward ultra-detailed head avatars. Existing approaches typically fall into two categories: rule-based analytic rigging or neural network-based deformation fields. While effective in constrained settings, both approaches often fail to generalize to unseen expressions and poses, particularly in extreme reenactment scenarios. Other methods constrain Gaussians to the global texel space of 3DMMs to reduce rendering complexity. However, these texel-based avatars tend to underutilize the underlying mesh structure. They apply minimal analytic deformation and rely heavily on neural regressors and heuristic regularization in UV space, which weakens geometric consistency and limits extrapolation to complex, out-of-distribution deformations. To address these limitations, we introduce TexAvatars, a hybrid avatar representation that combines the explicit geometric grounding of analytic rigging with the spatial continuity of texel space. Our approach predicts local geometric attributes in UV space via CNNs, but drives 3D deformation through mesh-aware Jacobians, enabling smooth and semantically meaningful transitions across triangle boundaries. This hybrid design separates semantic modeling from geometric control, resulting in improved generalization, interpretability, and stability. Furthermore, TexAvatars captures fine-grained expression effects, including muscle-induced wrinkles, glabellar lines, and realistic mouth cavity geometry, with high fidelity. Our method achieves state-of-the-art performance under extreme pose and expression variations, demonstrating strong generalization in challenging head reenactment settings.
zh

[CV-30] UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters

【速读】:该论文旨在解决文档中文本与公式(mathematical formulas)联合识别的效率与准确性难题,尤其针对当前视觉语言模型(Vision-Language Models, VLMs)参数量大、计算开销高、难以部署于实际场景的问题。其核心解决方案在于提出一个仅含0.1B参数的轻量化统一识别模型UniRec-0.1B,并通过两个关键技术突破实现高效且多层级(字符、词、行、段落、文档)的文本与公式识别:一是设计分层监督训练策略以显式引导结构理解,缓解不同层级间结构变异带来的挑战;二是引入语义解耦分词器(semantic-decoupled tokenizer),分离文本与公式的表征空间,降低二者语义纠缠对识别性能的影响。实验表明,该方法在多个领域中英文文档上显著优于现有通用VLM和专业文档解析模型,同时实现2–9倍的速度提升,验证了其有效性与实用性。

链接: https://arxiv.org/abs/2512.21095
作者: Yongkun Du,Zhineng Chen,Yazhen Xie,Weikang Baiand Hao Feng,Wei Shi,Yuchen Su,Can Huang,Yu-Gang Jiang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9 \times speedup, validating its effectiveness and efficiency. Codebase and Dataset: this https URL.
zh

[CV-31] 2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

【速读】:该论文旨在解决文本到音频视频(Text-to-Audio-Video, T2AV)生成系统在评估上的碎片化问题,现有方法多依赖单一模态指标或受限的基准测试,难以全面衡量跨模态对齐、指令遵循能力及复杂提示下的感知真实性。解决方案的关键在于提出T2AV-Compass——一个统一的基准测试框架,其核心创新包括:(1) 基于分类法驱动的管道构建500个语义丰富且物理合理的复杂提示,确保评估多样性与挑战性;(2) 设计双层评估机制,融合信号级客观指标(用于视频质量、音频质量及跨模态对齐)与基于大语言模型作为裁判(MLLM-as-a-Judge)的主观评估协议(用于指令遵循和现实感判断)。此方案为T2AV系统提供了更全面、诊断性强的评测标准,揭示当前主流模型在音频真实性、细粒度同步和指令执行等方面仍显著落后于人类水平。

链接: https://arxiv.org/abs/2512.21094
作者: Zhe Cao,Tao Wang,Jiaming Wang,Yanghai Wang,Yuanxing Zhang,Jialu Chen,Miao Deng,Jiahao Wang,Yubin Guo,Chenxi Liao,Yize Zhang,Zhaoxiang Zhang,Jiaheng Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.
zh

[CV-32] Hierarchical Modeling Approach to Fast and Accurate Table Recognition

【速读】:该论文旨在解决从大量文档中提取和利用多样化知识的挑战,尤其针对表格识别任务中因文档内容复杂性导致的识别效率与准确性问题。其核心问题在于现有模型虽通过多任务学习、局部注意力机制和相互学习实现了较好的识别效果,但缺乏对性能提升机制的充分解释,且推理耗时较长。解决方案的关键在于提出一种新型多任务模型,采用非因果注意力机制(non-causal attention)以捕获完整的表格结构信息,并设计并行推理算法以加速单元格内容识别过程,从而在两个大型公开数据集上实现更高效且准确的表格识别性能。

链接: https://arxiv.org/abs/2512.21083
作者: Takaya Kawakatsu
机构: Preferred Networks, Inc. (Preferred Networks, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The extraction and use of diverse knowledge from numerous documents is a pressing challenge in intelligent information retrieval. Documents contain elements that require different recognition methods. Table recognition typically consists of three subtasks, namely table structure, cell position and cell content recognition. Recent models have achieved excellent recognition with a combination of multi-task learning, local attention, and mutual learning. However, their effectiveness has not been fully explained, and they require a long period of time for inference. This paper presents a novel multi-task model that utilizes non-causal attention to capture the entire table structure, and a parallel inference algorithm for faster cell content inference. The superiority is demonstrated both visually and statistically on two large public datasets.
zh

[CV-33] UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer

【速读】:该论文旨在解决视觉地点识别(Visual Place Recognition, VPR)中多视角信息利用不足的问题,现有方法在跨环境泛化能力上表现有限。解决方案的关键在于提出UniPR-3D架构,其核心创新是基于VGGT骨干网络构建能够编码多视角3D表示的特征提取器,并设计了专门用于融合2D与3D特征的聚合模块:其中2D中间特征捕捉细粒度纹理线索,3D tokens则支持跨视角推理。此外,通过引入单帧与多帧聚合策略及可变长度序列检索机制,进一步提升了模型在复杂场景下的泛化性能。实验表明,该方法显著优于单视角和多视角基线,验证了几何引导token在VPR任务中的有效性。

链接: https://arxiv.org/abs/2512.21078
作者: Tianchen Deng,Xun Chen,Ziming Li,Hongming Shen,Danwei Wang,Javier Civera,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); I3A, University of Zaragoza, Spain (西班牙萨拉戈萨大学I3A研究所); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github this https URL.
zh

[CV-34] Language-Guided Grasp Detection with Coarse-to-Fine Learning for Robotic Manipulation

【速读】:该论文旨在解决机器人在非结构化、杂乱且语义多样环境中进行语言引导抓取(language-guided grasping)时,因浅层融合策略导致语义 grounding 不足和语言意图与视觉抓取动作对齐弱的问题。其解决方案的关键在于提出一种基于粗到精学习范式的语言引导抓取检测方法(Language-Guided Grasp Detection, LGGD),通过 CLIP 基础的跨模态嵌入在分层交叉注意力机制中逐步注入语言线索至视觉特征重建过程,实现细粒度的视觉-语义对齐;同时引入语言条件动态卷积头(Language-conditioned Dynamic Convolution Head, LDCH),根据句级特征混合多个卷积专家以生成适应指令的粗粒度掩码和抓取预测,并辅以最终精修模块提升复杂场景下的抓取一致性与鲁棒性。

链接: https://arxiv.org/abs/2512.21065
作者: Zebin Jiang,Tianle Jin,Xiangtong Yao,Alois Knoll,Hu Cao
机构: 1. Hu Cao 1,2∗; 2. Zebin Jiang 2†; 3. Tianle Jin 2†; 4. Xiangtong Yao 2; 5. Alois Knoll 2

对应机构信息:

  1. (未提供具体单位名称,仅标注数字下标)
  2. (未提供具体单位名称,仅标注数字下标)

由于作者辅助信息中没有明确写出单位机构的完整名称(如大学、公司等),无法提取有效机构信息。

输出:
未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Journal

点击查看摘要

Abstract:Grasping is one of the most fundamental challenging capabilities in robotic manipulation, especially in unstructured, cluttered, and semantically diverse environments. Recent researches have increasingly explored language-guided manipulation, where robots not only perceive the scene but also interpret task-relevant natural language instructions. However, existing language-conditioned grasping methods typically rely on shallow fusion strategies, leading to limited semantic grounding and weak alignment between linguistic intent and visual grasp this http URL this work, we propose Language-Guided Grasp Detection (LGGD) with a coarse-to-fine learning paradigm for robotic manipulation. LGGD leverages CLIP-based visual and textual embeddings within a hierarchical cross-modal fusion pipeline, progressively injecting linguistic cues into the visual feature reconstruction process. This design enables fine-grained visual-semantic alignment and improves the feasibility of the predicted grasps with respect to task instructions. In addition, we introduce a language-conditioned dynamic convolution head (LDCH) that mixes multiple convolution experts based on sentence-level features, enabling instruction-adaptive coarse mask and grasp predictions. A final refinement module further enhances grasp consistency and robustness in complex this http URL on the OCID-VLG and Grasp-Anything++ datasets show that LGGD surpasses existing language-guided grasping methods, exhibiting strong generalization to unseen objects and diverse language queries. Moreover, deployment on a real robotic platform demonstrates the practical effectiveness of our approach in executing accurate, instruction-conditioned grasp actions. The code will be released publicly upon acceptance.
zh

[CV-35] Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition

【速读】:该论文旨在解决多模态人体动作理解中如何在模型效率与性能之间取得平衡的问题。现有方法通常依赖于简单的晚期融合策略以提升性能,但导致显著的计算开销;而采用共享骨干网络的早期融合虽高效,却难以实现优异性能。解决方案的关键在于提出一种自监督的基于骨架的多模态动作表示学习框架——Decomposition and Composition(分解与组合)。其中,分解策略将融合后的多模态特征精细地还原为独立的单模态特征,并与对应的真值单模态特征对齐;组合策略则利用多个单模态特征作为自监督信号,指导多模态表示的学习,从而在不增加额外标注成本的前提下显著提升模型表现,同时保持较低的计算复杂度。

链接: https://arxiv.org/abs/2512.21064
作者: Hongsong Wang,Heng Fei,Bingxuan Dai,Jie Gui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Machine Intelligence Research (Journal Impact Factor 8.7, 2024)

点击查看摘要

Abstract:Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.
zh

[CV-36] Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

【速读】:该论文旨在解决病理图像生成中面临的三大瓶颈问题:高质量图像-文本语料稀缺、细粒度语义控制不足导致依赖非语义线索,以及术语异质性阻碍可靠文本条件化。其解决方案的关键在于提出UniPath框架,通过多流控制机制实现语义驱动的可控生成:一是原始文本流;二是高级语义流,利用冻结的病理多模态大语言模型(Multimodal Large Language Model, MLLM)提取抗重述的诊断语义Token并扩展为诊断感知的属性包;三是原型流,借助原型库实现组件级形态学控制。该方法显著提升了生成质量与语义对齐能力,实验表明其Patho-FID达80.9(优于次优方案51%),且细粒度语义控制接近真实图像水平(98.7%)。

链接: https://arxiv.org/abs/2512.21058
作者: Minghao Han,YiChen Liu,Yizhou Liu,Zizhi Chen,Jingqun Tang,Xuecheng Wu,Dingkang Yang,Lihua Zhang
机构: Fudan University (复旦大学); Fysics Intelligence Technologies Co., Ltd. (Fysics AI); University of Science and Technology Beijing (北京科技大学); ByteDance (字节跳动); Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 17 figures, and 6 tables

点击查看摘要

Abstract:In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath’s SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.
zh

[CV-37] DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors WACV2026

【速读】:该论文旨在解决当前手语生成中因缺乏高质量3D人体姿态数据而导致的生成质量受限问题,尤其针对现有视频数据集多为2D关键点且3D重建精度不足的挑战。其解决方案的关键在于提出DexAvatar框架,该框架通过利用学习得到的3D手部与身体先验知识,从自然场景下的单目手语视频中重建出生物力学上准确的精细手部关节运动和身体动作,从而显著提升3D手语姿态估计的准确性,在SGNify动作捕捉数据集上相比现有最优方法提升了35.11%的体征和手部姿态估计性能。

链接: https://arxiv.org/abs/2512.21054
作者: Kaustubh Kundu,Hrishav Bakul Barua,Lucy Robertson-Bell,Zhixi Cai,Kalin Stefanov
机构: Monash University (莫纳什大学); TCS Research
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted in WACV 2026

点击查看摘要

Abstract:The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: this https URL.
zh

[CV-38] Optical Flow-Guided 6DoF Object Pose Tracking with an Event Camera

【速读】:该论文旨在解决传统相机在复杂环境条件下(如运动模糊、传感器噪声、部分遮挡及光照变化)进行6自由度(6DoF)物体位姿跟踪时面临的挑战。其解决方案的关键在于提出了一种基于事件相机的光学流引导的6DoF位姿跟踪方法:首先采用2D-3D混合特征提取策略,从事件数据和物体模型中精确检测角点与边缘;随后通过最大化时空窗口内事件关联概率来搜索角点的光学流,并利用光学流建立角点与边缘之间的关联关系;最终通过最小化角点与边缘间的距离实现位姿的迭代优化,从而实现连续且鲁棒的位姿跟踪。

链接: https://arxiv.org/abs/2512.21053
作者: Zibin Liu,Banglei Guan,Yang Shang,Shunkun Liang,Zhenbao Yu,Qifeng Yu
机构: National University of Defense Technology(国防科技大学); Wuhan University(武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures. In Proceedings of the 32nd ACM International Conference on Multimedia (MM '24)

点击查看摘要

Abstract:Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.
zh

[CV-39] Matrix Completion Via Reweighted Logarithmic Norm Minimization

【速读】:该论文旨在解决低秩矩阵补全(Low-rank Matrix Completion, LRMC)中因核范数作为秩函数的凸松弛导致的次优解问题,其关键在于提出了一种新的非凸替代函数——加权对数范数(reweighted logarithmic norm),该函数能更精确地逼近真实秩函数,从而缓解核范数引起的奇异值过度收缩问题。通过采用交替方向乘子法(ADMM)高效求解由此产生的优化问题,实验表明该方法在图像修复任务中显著优于当前最先进的LRMC算法,在视觉质量和定量指标上均表现更优。

链接: https://arxiv.org/abs/2512.21050
作者: Zhijie Wang,Liangtian He,Qinghua Zhang,Jifei Miao,Liang-Jian Deng,Jun Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Low-rank matrix completion (LRMC) has demonstrated remarkable success in a wide range of applications. To address the NP-hard nature of the rank minimization problem, the nuclear norm is commonly used as a convex and computationally tractable surrogate for the rank function. However, this approach often yields suboptimal solutions due to the excessive shrinkage of singular values. In this letter, we propose a novel reweighted logarithmic norm as a more effective nonconvex surrogate, which provides a closer approximation than many existing alternatives. We efficiently solve the resulting optimization problem by employing the alternating direction method of multipliers (ADMM). Experimental results on image inpainting demonstrate that the proposed method achieves superior performance compared to state-of-the-art LRMC approaches, both in terms of visual quality and quantitative metrics.
zh

[CV-40] A Large-Depth-Range Layer-Based Hologram Dataset for Machine Learning-Based 3D Computer-Generated Holography

【速读】:该论文旨在解决生成式 AI (Generative AI) 在计算机全息(Computer-Generated Holography, CGH)领域中因高质量、大规模全息数据集稀缺而导致的性能瓶颈问题。其关键解决方案是提出一个公开可用的数据集 KOREATECH-CGH,包含 6,000 对 RGB-D 图像与复杂全息图,覆盖从 256×256 到 2048×2048 的多分辨率及理论极限深度范围,并引入幅度投影(amplitude projection)这一后处理技术——通过保留相位信息同时替换各深度层的幅度分量,显著提升远距离深度重建质量,实现 27.01 dB PSNR 和 0.87 SSIM,优于当前最优方法 2.03 dB 和 0.04 SSIM,从而有效支撑下一代 ML-CGH 系统的训练与评估。

链接: https://arxiv.org/abs/2512.21040
作者: Jaehong Lee,You Chan No,YoungWoo Kim,Duksu Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Machine learning-based computer-generated holography (ML-CGH) has advanced rapidly in recent years, yet progress is constrained by the limited availability of high-quality, large-scale hologram datasets. To address this, we present KOREATECH-CGH, a publicly available dataset comprising 6,000 pairs of RGB-D images and complex holograms across resolutions ranging from 256256 to 20482048, with depth ranges extending to the theoretical limits of the angular spectrum method for wide 3D scene coverage. To improve hologram quality at large depth ranges, we introduce amplitude projection, a post-processing technique that replaces amplitude components of hologram wavefields at each depth layer while preserving phase. This approach enhances reconstruction fidelity, achieving 27.01 dB PSNR and 0.87 SSIM, surpassing a recent optimized silhouette-masking layer-based method by 2.03 dB and 0.04 SSIM, respectively. We further validate the utility of KOREATECH-CGH through experiments on hologram generation and super-resolution using state-of-the-art ML models, confirming its applicability for training and evaluating next-generation ML-CGH systems.
zh

[CV-41] Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising

【速读】:该论文旨在解决自监督真实世界图像去噪中的核心难题,即如何在去除空间结构化噪声(spatially structured noise)的同时有效保留高频细节信息,这一问题长期受制于去噪与保真之间的权衡矛盾。现有基于盲区网络(blind-spot network, BSN)的方法依赖像素洗牌下采样(pixel-shuffle downsampling, PD)实现噪声解相关,但过度下采样会破坏精细结构,而轻度下采样又无法充分去除相关噪声。论文提出的关键解决方案是引入下一尺度预测(Next-Scale Prediction, NSP),其核心在于将噪声解相关过程与细节保留任务解耦:NSP构建跨尺度训练对,使BSN以低分辨率、完全去相关的子图像为输入,预测保留高频细节的高分辨率目标图像。此机制不仅显著缓解了噪声去除与细节保持之间的冲突,且天然支持对噪声图像进行超分辨率重建,无需额外训练或修改模型。

链接: https://arxiv.org/abs/2512.21038
作者: Yiwen Shan,Haiyu Zhao,Peng Hu,Xi Peng,Yuanbiao Gou
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.
zh

[CV-42] Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model

【速读】:该论文旨在解决红外人脸图像在可见光域训练的深度学习模型中因域偏移(domain shift)导致的识别性能下降问题,尤其针对传统生成式方法在红外到可见光图像转换过程中出现的失真与关键身份特征丢失问题。其解决方案的关键在于提出一种基于潜在扩散(latent diffusion)的新型生成模型,结合多属性分类器以提取并保留可见光图像中的关键面部属性特征,从而提升生成图像的质量与身份一致性;同时引入Self-attn Mamba模块增强跨模态特征的全局建模能力,并显著提高推理效率,最终在两个基准数据集上实现了图像质量与身份保真度的双重最优表现。

链接: https://arxiv.org/abs/2512.21032
作者: Mingshu Cai,Osamu Yoshie,Yuya Ieiri
机构: Waseda University (早稻田大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by 2025 IEEE International Joint Conference on Biometrics (IJCB 2025)

点击查看摘要

Abstract:Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.
zh

[CV-43] Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face

【速读】:该论文旨在解决3D场域视频参考说话人脸生成(3D-field video-referenced Talking Face Generation, TFG)方法对个人肖像视频可能造成的隐私泄露问题,此类方法可实时合成高保真度的个性化说话人脸视频,但缺乏有效的防御机制来保护原始视频不被恶意利用。现有基于图像的防御方案虽能施加逐帧2D扰动,但计算开销大、视频质量下降严重,且无法有效破坏3D信息以实现防护。解决方案的关键在于:提出一种新颖高效的视频防御框架,在不显著损害视频质量的前提下,通过扰动3D信息获取过程实现保护;其核心技术包括:(1) 相似性引导的参数共享机制以提升计算效率,(2) 多尺度双域注意力模块联合优化空间-频率域扰动,从而在保持高保真度的同时实现强鲁棒性防御,实验表明该方法相较最快基线提速47倍,并具备对抗缩放操作与前沿净化攻击的能力。

链接: https://arxiv.org/abs/2512.21019
作者: Rui-qing Sun,Xingshan Yao,Tian Lan,Hui-Yang Zhao,Jia-Ling Shi,Chen-Hao Cui,Zhijing Wu,Chen Yang,Xian-Ling Mao
机构: Beijing Institute of Technology (北京理工大学); Alibaba International Digital Commerce (阿里巴巴国际数字商业集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at this https URL.
zh

[CV-44] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

【速读】:该论文旨在解决基于预训练文本到图像扩散模型进行视频编辑时面临的两大挑战:一是时间不一致性问题(temporal inconsistency),即视频帧之间存在视觉跳跃或不连贯;二是计算开销过高,主要源于传统方法中引入的时序注意力机制(temporal attention)导致的复杂度上升。解决方案的关键在于提出FluencyVE框架,其核心创新是用线性时间序列模块Mamba替代原有的时序注意力层,从而实现全局帧级注意力的同时显著降低计算成本;此外,通过低秩近似矩阵替换因果注意力中的查询和键权重矩阵,并在训练中采用加权平均策略更新注意力分数,有效保留了文本到图像模型的生成能力,同时大幅减少了计算负担。

链接: https://arxiv.org/abs/2512.21015
作者: Mingshu Cai,Yixuan Li,Osamu Yoshie,Yuya Ieiri
机构: Waseda University (早稻田大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia (TMM)

点击查看摘要

Abstract:Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.
zh

[CV-45] Granular-ball Guided Masking: Structure-aware Data Augmentation

【速读】:该论文旨在解决深度学习模型在计算机视觉任务中对大规模标注数据的依赖问题,尤其是在数据有限或分布发生偏移时容易过拟合的现象。现有数据增强方法(尤其是基于掩码的信息丢弃策略)虽能提升模型鲁棒性,但往往缺乏结构感知能力,可能导致关键语义信息被误删。其解决方案的关键在于提出一种结构感知的数据增强策略——粒球引导掩码(Granular-ball Guided Masking, GBGM),该方法基于粒球计算(Granular-ball Computing, GBC)思想,通过粗到细的分层掩码机制自适应地保留语义丰富且结构重要的区域,同时抑制冗余区域,从而生成兼具代表性与判别性的增强样本。此方法简单且模型无关,可无缝集成至CNN和Vision Transformer中,为结构感知的数据增强提供了新范式。

链接: https://arxiv.org/abs/2512.21011
作者: Shuyin Xia,Fan Chen,Dawei Dai,Meng Yang,Junwei Han,Xinbo Gao,Guoyin Wang
机构: Chongqing Key Laboratory of Computational Intelligence (重庆计算智能重点实验室); Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education (教育部网络空间大数据智能安全重点实验室); Sichuan-Chongqing Co-construction Key Laboratory of Digital Economy Intelligence (川渝共建数字经济智能重点实验室); Key Laboratory of Big Data Intelligent Computing (大数据智能计算重点实验室); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models have achieved remarkable success in computer vision, but they still rely heavily on large-scale labeled data and tend to overfit when data are limited or distributions shift. Data augmentation, particularly mask-based information dropping, can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and may discard essential semantics. We propose Granular-ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular-ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements in classification accuracy and masked image reconstruction, confirming the effectiveness and broad applicability of the proposed method. Simple and model-agnostic, it integrates seamlessly into CNNs and Vision Transformers and provides a new paradigm for structure-aware data augmentation.
zh

[CV-46] Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

【速读】:该论文旨在解决当前视觉生成预训练方法在视频分析中忽视时序信息、生成质量差以及语义定位不准确的问题。现有基于BERT风格掩码建模的方法难以有效捕捉视频的时序依赖关系,而少数采用自回归(autoregressive, AR)策略的方案则存在生成内容语义一致性弱和多样性不足等缺陷。其解决方案的关键在于提出NExT-Vid框架,通过引入“上下文隔离的自回归预测器”(context-isolated autoregressive predictor)将语义表示与目标解码过程解耦,并结合“条件流匹配解码器”(conditioned flow-matching decoder)提升生成质量和多样性;同时采用掩码下一帧预测(masked next-frame prediction)作为预训练任务,实现图像与视频的联合建模,在大规模预训练模型上显著优于以往生成式视觉表征学习方法。

链接: https://arxiv.org/abs/2512.21004
作者: Jinghan Li,Yang Jin,Hao Jiang,Yadong Mu,Yang Song,Kun Xu
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.
zh

[CV-47] MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds

【速读】:该论文旨在解决多视角逆渲染(multi-view inverse rendering)中因现有单视角方法忽略跨视角关系而导致的几何、材质和光照一致性差的问题,以及传统多视角优化方法计算成本高、难以扩展的局限。其解决方案的关键在于提出一种前馈式多视角逆渲染框架,通过视图间交替注意力机制,同时建模单视角内的长程光照交互与跨视角的材质一致性,从而在一次前向传播中实现场景级协同推理;此外,为应对真实世界训练数据稀缺问题,进一步设计了一种基于一致性的微调策略,利用未标注的真实视频提升模型在复杂环境下的多视角一致性与鲁棒性。

链接: https://arxiv.org/abs/2512.21003
作者: Xiangzuo Wu,Chengwei Ren,Jun Zhou,Xiu Li,Yuan Liu
机构: Tsinghua University (清华大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 17 figures, 5 tables

点击查看摘要

Abstract:Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.
zh

[CV-48] PUFM: Point Cloud Upsampling via Enhanced Flow Matching

【速读】:该论文旨在解决点云上采样(point cloud upsampling)中因输入稀疏、噪声和不完整导致的重建质量差的问题,尤其关注几何保真度、对不完美输入的鲁棒性以及与下游基于表面任务的一致性。其解决方案的关键在于提出一种增强型流匹配框架PUFM++,通过三个核心改进实现:(1) 采用两阶段流匹配策略,先学习从稀疏输入到密集目标的直接路径流,再利用噪声扰动样本优化终端边缘分布逼近;(2) 设计数据驱动的自适应时间调度器以提升采样效率和推理稳定性;(3) 在采样过程中施加流形约束(on-manifold constraints),确保生成点始终贴合底层表面结构,并引入循环接口网络(Recurrent Interface Network, RIN)加强层次特征交互,从而显著提升重建精度与视觉保真度。

链接: https://arxiv.org/abs/2512.20988
作者: Zhi-Song Liu,Chenhang He,Roland Maier,Andreas Rupp
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 15 figures

点击查看摘要

Abstract:Recent advances in generative modeling have demonstrated strong promise for high-quality point cloud upsampling. In this work, we present PUFM++, an enhanced flow-matching framework for reconstructing dense and accurate point clouds from sparse, noisy, and partial observations. PUFM++ improves flow matching along three key axes: (i) geometric fidelity, (ii) robustness to imperfect input, and (iii) consistency with downstream surface-based tasks. We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better. To accelerate and stabilize inference, we propose a data-driven adaptive time scheduler that improves sampling efficiency based on interpolation behavior. We further impose on-manifold constraints during sampling to ensure that generated points remain aligned with the underlying surface. Finally, we incorporate a recurrent interface network~(RIN) to strengthen hierarchical feature interactions and boost reconstruction quality. Extensive experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling, delivering superior visual fidelity and quantitative accuracy across a wide range of tasks. Code and pretrained models are publicly available at this https URL.
zh

[CV-49] X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data

【速读】:该论文旨在解决胸部X光影像中长尾肺部异常(long-tailed pulmonary anomalies)的诊断难题,即由于罕见病灶样本稀缺导致生成式模型(generative models)能力受限,进而影响诊断精度的问题。其解决方案的关键在于构建一个数据合成流水线:首先利用大量正常胸片训练一个扩散模型(diffusion model),使其能够生成高质量的正常X光图像;随后将该模型用于修复病变X光图像中的头部病灶(head lesions),从而保留尾部类别(tail classes)作为增强后的训练数据;同时引入大语言模型知识引导(Large Language Model Knowledge Guidance, LKG)模块与渐进增量学习(Progressive Incremental Learning, PIL)策略,以稳定修复过程中的微调稳定性,最终在MIMIC和CheXpert公开数据集上实现了性能新基准。

链接: https://arxiv.org/abs/2512.20980
作者: Xinquan Yang,Jinheng Xie,Yawen Huang,Yuexiang Li,Huimin Huang,Hao Zheng,Xian Wu,Yefeng Zheng,Linlin Shen
机构: Tencent(腾讯); Shenzhen University(深圳大学); National University of Singapore(新加坡国立大学); Westlake University(西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long-tailed pulmonary anomalies in chest radiography present formidable diagnostic challenges. Despite the recent strides in diffusion-based methods for enhancing the representation of tailed lesions, the paucity of rare lesion exemplars curtails the generative capabilities of these approaches, thereby leaving the diagnostic precision less than optimal. In this paper, we propose a novel data synthesis pipeline designed to augment tail lesions utilizing a copious supply of conventional normal X-rays. Specifically, a sufficient quantity of normal samples is amassed to train a diffusion model capable of generating normal X-ray images. This pre-trained diffusion model is subsequently utilized to inpaint the head lesions present in the diseased X-rays, thereby preserving the tail classes as augmented training data. Additionally, we propose the integration of a Large Language Model Knowledge Guidance (LKG) module alongside a Progressive Incremental Learning (PIL) strategy to stabilize the inpainting fine-tuning process. Comprehensive evaluations conducted on the public lung datasets MIMIC and CheXpert demonstrate that the proposed method sets a new benchmark in performance.
zh

[CV-50] XGrid-Mapping: Explicit Implicit Hybrid Grid Submaps for Efficient Incremental Neural LiDAR Mapping

【速读】:该论文旨在解决大规模增量式LiDAR地图构建中效率与精度难以兼顾的问题,特别是现有神经LiDAR映射方法普遍依赖密集隐式表示而忽视几何结构信息,以及基于体素(voxel)引导的方法难以实现实时性能的挑战。其解决方案的关键在于提出XGrid-Mapping框架,该框架采用显式与隐式表示相结合的混合网格策略:通过稀疏网格提供几何先验和结构引导,同时利用隐式稠密网格增强场景表达;并通过VDB(Velocity-Independent Binary Tree)结构与子地图(submap)组织方式降低计算负载,实现高效的大规模增量建图;此外,引入基于知识蒸馏的重叠区域对齐策略以消除子地图间的不连续性,并结合动态移除模块提升鲁棒性和采样效率。

链接: https://arxiv.org/abs/2512.20976
作者: Zeqing Song,Zhongmiao Yan,Junyuan Deng,Songpengcheng Xia,Xiang Mu,Jingyi Xu,Qi Wu,Ling Pei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale incremental mapping is fundamental to the development of robust and reliable autonomous systems, as it underpins incremental environmental understanding with sequential inputs for navigation and decision-making. LiDAR is widely used for this purpose due to its accuracy and robustness. Recently, neural LiDAR mapping has shown impressive performance; however, most approaches rely on dense implicit representations and underutilize geometric structure, while existing voxel-guided methods struggle to achieve real-time performance. To address these challenges, we propose XGrid-Mapping, a hybrid grid framework that jointly exploits explicit and implicit representations for efficient neural LiDAR mapping. Specifically, the strategy combines a sparse grid, providing geometric priors and structural guidance, with an implicit dense grid that enriches scene representation. By coupling the VDB structure with a submap-based organization, the framework reduces computational load and enables efficient incremental mapping on a large scale. To mitigate discontinuities across submaps, we introduce a distillation-based overlap alignment strategy, in which preceding submaps supervise subsequent ones to ensure consistency in overlapping regions. To further enhance robustness and sampling efficiency, we incorporate a dynamic removal module. Extensive experiments show that our approach delivers superior mapping quality while overcoming the efficiency limitations of voxel-guided methods, thereby outperforming existing state-of-the-art mapping methods.
zh

[CV-51] SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking

【速读】:该论文旨在解决基于闭路电视(CCTV)的车辆跟踪系统在多摄像头环境下因盲区和视场限制导致的目标ID切换与轨迹丢失问题,从而提升实时路径预测的可靠性。解决方案的关键在于提出SPOT(Spatial Prediction Over Trajectories)方法,其核心是将道路结构(Waypoints)和摄像头布设信息以二维空间坐标表示并进行分块处理,形成可实时查询的文档;同时利用视频中观测到的目标相对位置与视场(FOV)信息,将车辆位置映射至真实世界坐标系,并结合车辆运动方向、速度及驾驶模式,在交叉路口层级采用束搜索(beam search)策略推断车辆最可能进入的下一摄像头区域,实现对盲区中车辆轨迹的有效预测与连续追踪。

链接: https://arxiv.org/abs/2512.20975
作者: Yujin Noh,Inho Jake Park,Chigon Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 27figures

点击查看摘要

Abstract:CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle’s position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle’s moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.
zh

[CV-52] Generalization of Diffusion Models Arises with a Balanced Representation Space

【速读】:该论文旨在解决扩散模型(diffusion models)在训练过程中可能因过拟合而导致的数据记忆(memorization)问题,这一现象会削弱模型的泛化能力。其核心解决方案在于从表示学习(representation learning)的角度揭示记忆与泛化之间的本质差异:通过理论分析一个两层ReLU去噪自编码器(denoising autoencoder, DAE),证明记忆对应于模型将原始训练样本存储于编码-解码权重中,产生局部“尖锐”(spiky)的表示;而泛化则源于模型捕捉局部数据统计特性,形成“平衡”(balanced)的表示结构。基于此理论洞察,作者提出了一种基于表示的检测方法用于识别记忆行为,并开发了一种无需重新训练的编辑技术,通过表示引导(representation steering)实现对生成结果的精确控制,从而强调了学习高质量表示在生成建模中的关键作用。

链接: https://arxiv.org/abs/2512.20963
作者: Zekai Zhang,Xiao Li,Xiang Li,Lianghe Shi,Meng Wu,Molei Tao,Qing Qu
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 40 pages, 19 figures. The first two authors contributed equally

点击查看摘要

Abstract:Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized “spiky” representations, whereas (ii) generalization arises when the model captures local data statistics, producing “balanced” representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.
zh

[CV-53] Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection

【速读】:该论文旨在解决当前生成式AI(Generative AI)图像检测方法在真实世界条件下泛化能力不足的问题。现有检测器往往过度依赖特定生成器产生的伪影特征,对现实中的多轮跨平台传播和后处理(链式退化)导致的图像质量下降极为敏感,使得原有特征线索失效。解决方案的关键在于提出一种以真实图像为中心的包络建模(Real-centric Envelope Modeling, REM)新范式:通过自重构过程引入特征级扰动生成近似真实样本,并利用具有跨域一致性的包络估计器学习包围真实图像流形的边界,从而实现对合成图像的鲁棒检测。

链接: https://arxiv.org/abs/2512.20937
作者: Ruiqi Liu,Yi Han,Zhengbo Zhang,Liwei Yao,Zhiyuan Yan,Jialiang Shen,ZhiJin Chen,Boyi Sun,Lubin Weng,Jing Dong,Yan Wang,Shu Wu
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Advanced Interdisciplinary Sciences, UCAS (中国科学院大学交叉学科研究院); Southwest University (西南大学); Shanghai Second Polytechnic University (上海第二工业大学); Peking University (北京大学); The University of Sydney (悉尼大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid progress of generative models has intensified the need for reliable and robust detection under real-world conditions. However, existing detectors often overfit to generator-specific artifacts and remain highly sensitive to real-world degradations. As generative architectures evolve and images undergo multi-round cross-platform sharing and post-processing (chain degradations), these artifact cues become obsolete and harder to detect. To address this, we propose Real-centric Envelope Modeling (REM), a new paradigm that shifts detection from learning generator artifacts to modeling the robust distribution of real images. REM introduces feature-level perturbations in self-reconstruction to generate near-real samples, and employs an envelope estimator with cross-domain consistency to learn a boundary enclosing the real image manifold. We further build RealChain, a comprehensive benchmark covering both open-source and commercial generators with simulated real-world degradation. Across eight benchmark evaluations, REM achieves an average improvement of 7.5% over state-of-the-art methods, and notably maintains exceptional generalization on the severely degraded RealChain benchmark, establishing a solid foundation for synthetic image detection under real-world conditions. The code and the RealChain benchmark will be made publicly available upon acceptance of the paper.
zh

[CV-54] Reasoning -Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation

【速读】:该论文旨在解决非可见部分补全(amodal completion)任务中语义一致性与结构完整性难以维持的问题,尤其针对以往渐进式方法因推理不稳定性导致的误差累积缺陷。其解决方案的关键在于提出一种协作式多智能体推理框架(Collaborative Multi-Agent Reasoning Framework),通过显式解耦语义规划(Semantic Planning)与视觉合成(Visual Synthesis)两个阶段:首先由专门的推理智能体生成结构化、明确的语义计划,再基于此进行单次通过的像素级生成,从而实现视觉与语义一致性的统一;此外,引入自校正验证智能体(Verification Agent)和多样化假设生成器(Diverse Hypothesis Generator),分别提升语义规划精度并缓解不可见区域的歧义性,最终构建了更符合人类判断的评估指标MAC-Score,显著优于现有方法。

链接: https://arxiv.org/abs/2512.20936
作者: Hongxing Fan,Shuyu Zhao,Jiayang Ao,Lu Sheng
机构: Beihang University (北京航空航天大学); The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by inference instability and error accumulation. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and semantic consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets. Our project is available at: this https URL.
zh

[CV-55] Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting DATE

【速读】:该论文旨在解决3D开放词汇分割(Open-vocabulary segmentation, OVS)中高维特征高效渲染的问题。现有方法依赖码本(codebook)或特征压缩来处理高维特征,导致信息丢失并降低分割质量。其解决方案的关键在于提出一种新颖的渲染策略——分位数渲染(Quantile Rendering, Q-Render),该策略通过稀疏采样沿射线具有主导影响的3D高斯(3D Gaussians),而非密集采样所有相交高斯,从而在保持高保真度的同时显著提升效率。进一步地,作者构建了可泛化的3D神经网络Gaussian Splatting Network (GS-Net),用于预测通用的高斯特征,最终实现在ScanNet和LeRF数据集上的性能超越当前最优方法,并实现约43.7倍的速度提升。

链接: https://arxiv.org/abs/2512.20927
作者: Yoonwoo Jeong,Cheng Sun,Frank Wang,Minsu Cho,Jaesung Choe
机构: NVIDIA; POSTECH
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Will be updated

点击查看摘要

Abstract:Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.
zh

[CV-56] Self-supervised Multiplex Consensus Mamba for General Image Fusion AAAI2026

【速读】:该论文旨在解决通用图像融合(General Image Fusion)中如何在不增加计算复杂度的前提下,有效整合多模态信息以提升下游视觉任务性能的问题。其核心挑战在于平衡跨模态互补信息的深度融合与模型效率之间的关系。解决方案的关键在于提出一种自监督多路共识Mamba框架(SMC-Mamba),其中包含两个核心模块:一是模态无关特征增强(Modality-Agnostic Feature Enhancement, MAFE)模块,通过自适应门控机制保留细节并利用空间-通道和频域旋转扫描增强全局表征;二是多路共识跨模态Mamba(Multiplex Consensus Cross-modal Mamba, MCCM)模块,实现专家级特征动态协作并达成共识,从而高效集成多模态信息;此外,引入双层级自监督对比损失(Bi-level Self-supervised Contrastive Learning Loss, BSCL),在不增加计算开销的情况下保持高频信息并显著提升下游任务性能。

链接: https://arxiv.org/abs/2512.20921
作者: Yingying Wang,Rongjin Zhuang,Hui Zheng,Xuanhua He,Ke Cao,Xiaotong Tu,Xinghao Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI 2026, 9 pages, 4 figures

点击查看摘要

Abstract:Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.
zh

[CV-57] PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding

【速读】:该论文旨在解决3D视觉定位(3D Visual Grounding, 3DVG)任务中模型泛化能力不足的问题,尤其是在缺乏大规模3D视觉语言数据集以及传统监督模型在3D场景推理能力有限的情况下。其核心挑战在于如何有效结合语言理解与3D场景推理,同时提升模型对未见场景和文本重述的适应性。解决方案的关键在于提出PanoGrounder框架,通过多模态全景表征(multi-modal panoramic representation)将预训练的2D视觉语言模型(Vision-Language Models, VLMs)与3D场景信息相结合:首先利用包含3D语义和几何特征的全景渲染图像作为中间表示,使VLM可直接处理且保留全局物体间关系;随后设计三阶段流水线,在考虑场景布局的基础上选取紧凑的全景视角,逐视图进行文本查询定位,并通过lifting机制融合各视角预测生成最终3D边界框。此方法显著提升了3DVG任务的性能与泛化能力。

链接: https://arxiv.org/abs/2512.20907
作者: Seongmin Jung,Seongho Choi,Gunwoo Jeon,Minsu Cho,Jongwoo Lim
机构: Seoul National University (首尔国立大学); Pohang University of Science and Technology (POSTECH) (浦项科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.
zh

[CV-58] Benchmarking and Enhancing VLM for Compressed Image Understanding

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理低比特率压缩图像时性能下降的问题,即当前VLM主要针对高比特率压缩图像进行训练和优化,而对低比特率压缩图像的理解能力尚未被系统性探索。解决方案的关键在于提出一个通用的VLM适配器(universal VLM adaptor),该适配器能够有效提升模型在多种现有图像编解码器及不同比特率下压缩图像上的表现,实验表明单一适配器可使VLM在各类压缩图像上的性能提升10%-30%,且该方法主要缓解了因模型泛化能力不足导致的性能差距,而非信息丢失问题。

链接: https://arxiv.org/abs/2512.20901
作者: Zifu Zhang,Tongda Xu,Siqi Li,Shengxi Li,Yue Zhang,Mai Xu,Yan Wang
机构: Institute for AI Industry Research, Tsinghua University (清华大学人工智能产业研究院); Beihang University (北京航空航天大学); Department of Computer Science and Technology, Tsinghua (清华大学计算机科学与技术系); Beijing University of Technology (北京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.
zh

[CV-59] DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction

【速读】:该论文旨在解决肺部结节(pulmonary nodule)早期诊断中多模态信息融合效率低的问题,现有方法多依赖于低效的向量拼接和简单的互注意力机制,难以充分挖掘时序与多模态数据的潜在关联。其解决方案的关键在于提出一种双图时空注意力网络(Dual-Graph Spatiotemporal Attention Network, DGSAN),通过构建跨模态与模态内双重图结构,结合全局-局部特征编码器与分层跨模态图融合模块,实现对多模态特征的精细化建模与高效整合,从而显著提升分类准确率并保持优异的计算效率。

链接: https://arxiv.org/abs/2512.20898
作者: Xiao Yu,Zhaojie Fang,Guanyu Zhou,Yin Shen,Huoling Luo,Ye Li,Ahmed Elazab,Xiang Wan,Ruiquan Ge,Changmiao Wang
机构: 1. Tsinghua University (清华大学); 2. Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); 3. South China University of Technology (华南理工大学); 4. Zhejiang University (浙江大学); 5. Zhejiang University (浙江大学); 6. King Saud University (沙特国王大学); 7. Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.
zh

[CV-60] Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification

【速读】:该论文旨在解决跨模态船舶再识别(Cross-Modality Ship Re-Identification, CMS Re-ID)中因模态差异显著而导致的识别性能下降问题,尤其在缺乏大规模成对标注数据的情况下,传统依赖显式模态对齐的方法难以有效迁移知识。其解决方案的关键在于提出一种新颖的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略——域表示注入(Domain Representation Injection, DRI),通过将轻量级可学习的偏置编码器(Offset Encoder)提取的领域特定特征注入到预训练视觉基础模型(Vision Foundation Model, VFM)的中间层,利用上下文信息自适应地调整特征分布,从而在不修改VFM原始权重的前提下实现模态适配与身份保持,显著提升跨模态匹配性能。

链接: https://arxiv.org/abs/2512.20892
作者: Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Zhelin Li
机构: South China University of Technology (华南理工大学); School of Design, South China University of Technology (华南理工大学设计学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM’s pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9% and 60.5% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at this https URL.
zh

[CV-61] NeRV360: Neural Representation for 360-Degree Videos with a Viewport Decoder

【速读】:该论文旨在解决高分辨率360度视频在使用隐式神经表示(NeRV)进行压缩时面临的内存占用高和解码速度慢的问题,从而限制其实时应用的可行性。其解决方案的关键在于提出了一种端到端框架NeRV360,该框架通过仅解码用户选定视口(viewport)而非重建整个全景帧来显著降低资源消耗;同时,它将视口提取集成到解码流程中,并引入时空仿射变换模块(spatial-temporal affine transform module),实现基于视角和时间条件的自适应解码,从而在大幅减少内存使用(7倍)和提升解码速度(2.5倍)的同时保持更优的图像质量。

链接: https://arxiv.org/abs/2512.20871
作者: Daichi Arai,Kyohei Unno,Yasuko Sugito,Yuichi Kusakabe
机构: NHK
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: 2026 IIEEJ International Conference on Image Electronics and Visual Computing (IEVC)

点击查看摘要

Abstract:Implicit neural representations for videos (NeRV) have shown strong potential for video compression. However, applying NeRV to high-resolution 360-degree videos causes high memory usage and slow decoding, making real-time applications impractical. We propose NeRV360, an end-to-end framework that decodes only the user-selected viewport instead of reconstructing the entire panoramic frame. Unlike conventional pipelines, NeRV360 integrates viewport extraction into decoding and introduces a spatial-temporal affine transform module for conditional decoding based on viewpoint and time. Experiments on 6K-resolution videos show that NeRV360 achieves a 7-fold reduction in memory consumption and a 2.5-fold increase in decoding speed compared to HNeRV, a representative prior work, while delivering better image quality in terms of objective metrics.
zh

[CV-62] Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images

【速读】:该论文针对地下管线探测中3D探地雷达(GPR)存在的多视角特征关联弱、小目标识别精度低以及复杂场景下鲁棒性不足等问题,提出了一种三维管线智能检测框架。其解决方案的关键在于:首先构建基于B/C/D-Scan三视图联合分析策略的三维特征评估方法,通过FDTD正演模拟与实测数据交叉验证提升特征可靠性;其次设计DCO-YOLO框架,融合DySample、CGLU和OutlookAttention跨维度相关机制以增强小尺度管线边缘特征提取能力;最后引入3D-DIoU空间特征匹配算法,结合三维几何约束与中心距离惩罚项实现多视角标注的自动关联,从而有效解决单视角检测固有歧义问题。实验表明,该方法在复杂多管线场景下达到96.2%准确率、93.3%召回率和96.7%平均精度,显著优于基线模型。

链接: https://arxiv.org/abs/2512.20866
作者: Haotian Lv,Chao Li,Jiangbo Dai,Yuhui Zhang,Zepeng Fan,Yiqiu Tan,Dawei Wang,Binglei Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To address the issues of weak correlation between multi-view features, low recognition accuracy of small-scale targets, and insufficient robustness in complex scenarios in underground pipeline detection using 3D GPR, this paper proposes a 3D pipeline intelligent detection framework. First, based on a B/C/D-Scan three-view joint analysis strategy, a three-dimensional pipeline three-view feature evaluation method is established by cross-validating forward simulation results obtained using FDTD methods with actual measurement data. Second, the DCO-YOLO framework is proposed, which integrates DySample, CGLU, and OutlookAttention cross-dimensional correlation mechanisms into the original YOLOv11 algorithm, significantly improving the small-scale pipeline edge feature extraction capability. Furthermore, a 3D-DIoU spatial feature matching algorithm is proposed, which integrates three-dimensional geometric constraints and center distance penalty terms to achieve automated association of multi-view annotations. The three-view fusion strategy resolves inherent ambiguities in single-view detection. Experiments based on real urban underground pipeline data show that the proposed method achieves accuracy, recall, and mean average precision of 96.2%, 93.3%, and 96.7%, respectively, in complex multi-pipeline scenarios, which are 2.0%, 2.1%, and 0.9% higher than the baseline model. Ablation experiments validated the synergistic optimization effect of the dynamic feature enhancement module and Grad-CAM++ heatmap visualization demonstrated that the improved model significantly enhanced its ability to focus on pipeline geometric features. This study integrates deep learning optimization strategies with the physical characteristics of 3D GPR, offering an efficient and reliable novel technical framework for the intelligent recognition and localization of underground pipelines.
zh

[CV-63] ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction

【速读】:该论文旨在解决传统讲座视频缺乏实时互动机制的问题,即学习者在观看过程中遇到困惑时无法即时获得澄清,只能依赖外部搜索,从而影响学习效率。其解决方案的核心在于提出ALIVE(Avatar-Lecture Interactive Video Engine),一个完全本地部署的交互式学习系统,关键创新包括:(1)基于自动语音识别(ASR)、大语言模型(LLM)精炼与神经人脸合成技术生成由虚拟形象(neural avatar)讲述的讲座内容;(2)融合语义相似度与时间戳对齐的内容感知检索机制,精准定位相关讲义片段;(3)支持文本或语音提问的实时多模态交互,并以文本或虚拟形象播报方式提供基于上下文的解释。系统通过轻量级嵌入模型、FAISS检索优化及分段预加载策略保障响应速度,实验证明其在医学影像课程中能提供准确、沉浸且隐私友好的实时学习支持。

链接: https://arxiv.org/abs/2512.20858
作者: Md Zabirul Islam,Md Motaleb Hossen Manik,Ge Wang
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.20858 [cs.CV] (or arXiv:2512.20858v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.20858 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Md Motaleb Hossen Manik [view email] [v1] Wed, 24 Dec 2025 00:33:59 UTC (5,021 KB)
zh

[CV-64] Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在实际部署中因高推理延迟和计算成本而导致的效率瓶颈问题,尤其是在处理高分辨率图像时存在冗余计算。其核心挑战在于现有流水线依赖静态视觉预处理策略,无法根据图像内容动态调整输入复杂度,从而浪费资源。解决方案的关键在于提出一种自适应视觉预处理方法,通过内容感知的图像分析、自适应分辨率选择和内容感知裁剪,在视觉编码前有效减少视觉冗余;该方法无需修改FastVLM架构或重新训练模型,即可显著降低每张图像的推理时间(超过50%)、生成时间及视觉标记数(平均减少超55%),是一种轻量且高效的部署优化策略。

链接: https://arxiv.org/abs/2512.20839
作者: Putu Indah Githa Cahyani,Komang David Dananjaya Suartana,Novanto Yudistira
机构: University of Brawijaya (布腊维雅大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50%, lowers mean full generation time, and achieves a consistent reduction of more than 55% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at this https URL.
zh

[CV-65] CHAMMI-75: pre-training multi-channel models with heterogeneous microscopy images

【速读】:该论文旨在解决当前细胞形态学量化模型因依赖单一显微成像类型而缺乏泛化能力的问题,即模型在不同技术参数(如通道数量)或实验条件超出训练分布时性能显著下降。解决方案的关键在于构建了一个名为CHAMMI-75的开放数据集,该数据集包含来自75个多样化生物研究的异构多通道显微图像,通过高多样性覆盖多种成像模态,使训练出的模型具备通道自适应能力,从而提升在多通道生物成像任务中的泛化性能。

链接: https://arxiv.org/abs/2512.20833
作者: Vidit Agrawal(1,2),John Peters(1,2),Tyler N. Thompson(1,2),Mohammad Vali Sanian(3,4),Chau Pham(5),Nikita Moshkov(6),Arshad Kazi(1,2),Aditya Pillai(1,2),Jack Freeman(1),Byunguk Kang(7,8),Samouil L. Farhi(8),Ernest Fraenkel(7),Ron Stewart(1),Lassi Paavolainen(3,4),Bryan A. Plummer(5),Juan C. Caicedo(1,2) ((1) Morgridge Institute for Research, Madison, WI, USA, (2) University of Wisconsin-Madison, Madison, WI, USA, (3) Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland, (4) University of Helsinki, Helsinki, Finland, (5) Boston University, Boston, MA, USA, (6) Institute of Computational Biology, Helmholtz Munich, Neuherberg, Germany, (7) Massachusetts Institute of Technology, Cambridge, MA, USA, (8) Broad Institute of MIT and Harvard, Cambridge, MA, USA)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 47 Pages, 23 Figures, 26 Tables

点击查看摘要

Abstract:Quantifying cell morphology using images and machine learning has proven to be a powerful tool to study the response of cells to treatments. However, models used to quantify cellular morphology are typically trained with a single microscopy imaging type. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels), or because the target experimental conditions are out of distribution. Here, we present CHAMMI-75, an open access dataset of heterogeneous, multi-channel microscopy images from 75 diverse biological studies. We curated this resource from publicly available sources to investigate cellular morphology models that are channel-adaptive and can process any microscopy image type. Our experiments show that training with CHAMMI-75 can improve performance in multi-channel bioimaging tasks primarily because of its high diversity in microscopy modalities. This work paves the way to create the next generation of cellular morphology models for biological studies.
zh

[CV-66] Learning to Sense for Driving: Joint Optics-Sensor-Model Co-Design for Semantic Segmentation

【速读】:该论文旨在解决传统自动驾驶感知流水线中光学设计与下游感知任务分离所导致的信息损失问题,即固定光学系统和手工设计的图像信号处理(ISP)流程优先于人眼可视图像质量,而非机器语义理解需求,从而在去马赛克、去噪或量化过程中丢弃关键信息,并迫使模型适应传感器伪影。其解决方案的关键在于提出一种任务驱动的端到端联合设计框架(task-driven co-design framework),将光学系统、传感器建模与轻量级语义分割网络统一优化,实现从RAW数据到任务目标的全栈协同优化。该框架整合了真实手机级镜头模型、可学习的颜色滤光阵列(CFA)、泊松-高斯噪声过程及量化机制,直接以分割指标(如mIoU)为目标进行优化,显著提升了对细长结构和低光照场景的鲁棒性,同时保持模型紧凑(约1M参数)与实时性能(~28 FPS),验证了边缘部署可行性。

链接: https://arxiv.org/abs/2512.20815
作者: Reeshad Khan amd John Gauch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task-driven co-design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end-to-end RAW-to-task pipeline. Building on DeepLens[19], our system integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI-360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low-light-sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth. Together, these findings establish full-stack co-optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.
zh

[CV-67] NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts

【速读】:该论文旨在解决乳腺超声(Breast Ultrasound, BUS)图像分割中因缺乏可靠文本或空间提示(prompt)而导致模型训练受限的问题。现有方法依赖于带有完整多模态标注的数据集,但多数公开BUS数据集缺少结构化报告或元数据,限制了模型在真实场景中的泛化能力。解决方案的关键在于提出NullBUS框架,其创新性地引入“可选提示”(nullable prompts),通过可学习的零嵌入(null embeddings)与存在掩码(presence masks)机制,在有提示时融合文本信息,在无提示时自动退化为纯图像监督模式,从而实现单模型对含提示和无提示样本的统一建模。该设计显著提升了模型在混合提示可用性条件下的鲁棒性和分割性能。

链接: https://arxiv.org/abs/2512.20783
作者: Raja Mallina,Bryar Shareef
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, and 4 tables

点击查看摘要

Abstract:Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.
zh

[CV-68] OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective

【速读】:该论文旨在解决空中场景下语义场景补全(Semantic Scene Completion, SSC)研究匮乏及LiDAR传感器在无人机(UAV)上应用受限的问题。现有SSC方法主要集中在地面场景(如自动驾驶),而高空视角下的稀疏LiDAR点云难以满足密集3D语义建模需求,且受法规、重量和能耗限制,使得LiDAR难以广泛部署于UAV平台。解决方案的关键在于提出首个基于相机的实时空中SSC基准数据集OccuFly,并构建一种无需LiDAR的自动化数据生成框架:通过传统三维重建技术将部分2D标注掩码映射至重建点云中,实现标签自动迁移,显著降低人工3D标注成本,从而为无人机提供可落地的视觉主导式语义场景理解方案。

链接: https://arxiv.org/abs/2512.20770
作者: Markus Gross,Sai B. Matha,Aya Fahmy,Rui Song,Daniel Cremers,Henri Meess
机构: Fraunhofer IVI (弗劳恩霍夫IVI研究所); TU Munich (慕尼黑工业大学); MCML (媒体计算与机器学习中心); UCLA (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by lifting a subset of annotated 2D masks into the reconstructed point cloud, thereby substantially minimizing manual 3D annotation effort. Finally, we benchmark the state-of-the-art on OccuFly and highlight challenges specific to elevated viewpoints, yielding a comprehensive vision benchmark for holistic aerial 3D scene understanding.
zh

[CV-69] rashDet: Iterative Neural Architecture Search for Efficient Waste Detection WACV2026

【速读】:该论文旨在解决在资源受限的边缘计算和物联网(IoT)设备上实现高效垃圾检测的问题,特别是在TinyML(超低功耗机器学习)约束条件下对TACO数据集进行目标检测的挑战。其核心解决方案是提出一种迭代式硬件感知神经架构搜索框架,通过构建类似Once-for-All结构的ResDets超网络,并采用交替优化骨干网络与颈部/头部结构的进化搜索策略,辅以种群传递机制和精度预测器以降低搜索成本并提升稳定性。该方法生成了一族可部署的检测模型(TrashDets),在参数量从1.2M到30.5M之间时,mAP50性能覆盖11.4至19.5,显著优于现有TinyML检测器,在MAX78002微控制器上的实测结果表明其能实现高达88%的能耗降低、78%的延迟减少和53%的平均功率下降。

链接: https://arxiv.org/abs/2512.20746
作者: Tony Tran,Bin Hu
机构: University of Houston (休斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages. The paper has been accepted by the WACV 2026 workshop

点击查看摘要

Abstract:This paper addresses trash detection on the TACO dataset under strict TinyML constraints using an iterative hardware-aware neural architecture search framework targeting edge and IoT devices. The proposed method constructs a Once-for-All-style ResDets supernet and performs iterative evolutionary search that alternates between backbone and neck/head optimization, supported by a population passthrough mechanism and an accuracy predictor to reduce search cost and improve stability. This framework yields a family of deployment-ready detectors, termed TrashDets. On a five-class TACO subset (paper, plastic, bottle, can, cigarette), the strongest variant, TrashDet-l, achieves 19.5 mAP50 with 30.5M parameters, improving accuracy by up to 3.6 mAP50 over prior detectors while using substantially fewer parameters. The TrashDet family spans 1.2M to 30.5M parameters with mAP50 values between 11.4 and 19.5, providing scalable detector options for diverse TinyML deployment budgets on resource-constrained hardware. On the MAX78002 microcontroller with the TrashNet dataset, two specialized variants, TrashDet-ResNet and TrashDet-MBNet, jointly dominate the ai87-fpndetector baseline, with TrashDet-ResNet achieving 7525~ \mu J energy per inference at 26.7 ms latency and 37.45 FPS, and TrashDet-MBNet improving mAP50 by 10.2%; together they reduce energy consumption by up to 88%, latency by up to 78%, and average power by up to 53% compared to existing TinyML detectors.
zh

[CV-70] VL4Gaze: Unleashing Vision-Language Models for Gaze Following

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在人类注视理解(gaze understanding)方面能力不足的问题,即现有VLMs虽能在场景级推理任务中表现优异,但缺乏系统性评估与训练机制来挖掘其对注视行为的语义和空间信息的理解潜力。解决方案的关键在于提出首个大规模基准测试集VL4Gaze,该数据集包含48.9万条自动生成的问答对,涵盖12.4万张图像,并将注视理解统一建模为视觉问答(Visual Question Answering, VQA)任务,通过四个互补子任务——注视对象描述、注视方向描述、注视点定位及模糊问题识别——实现多维度评估与训练。实验表明,仅靠通用预训练难以可靠推断注视语义与空间位置,而基于VL4Gaze进行针对性多任务微调可显著且一致地提升模型性能,凸显了任务特定监督对发展VLMs注视理解能力的重要性。

链接: https://arxiv.org/abs/2512.20735
作者: Shijing Wang,Chaoqun Cui,Yaping Huang,Hyung Jin Chang,Yihua Cheng
机构: Beijing Jiaotong University (北京交通大学); MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Birmingham (伯明翰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.
zh

[CV-71] HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model

【速读】:该论文旨在解决移动端视觉语言模型(Vision Language Models, VLMs)在参数高效微调过程中因固定秩的低秩适配(Low-Rank Adaptation, LoRA)方法能力不足而导致的性能瓶颈问题。其核心解决方案是提出HyDRA框架,关键在于引入两种优化策略:一是分层优化(hierarchical optimization),即粗粒度地为不同网络层分配差异化秩,并细粒度地在单个层内调整秩;二是动态调整(dynamic adjustment),通过轻量级性能模型实现端到端自动优化,在微调过程中实时确定并调整各层秩,从而在不增加可训练参数数量的前提下显著提升模型性能,实验表明其平均提升达4.7%,并在部分任务中超越全参数微调。

链接: https://arxiv.org/abs/2512.20674
作者: Yuanhao Xi,Xiaohuan Bing,Ramin Yahyapour
机构: University of Göttingen (哥廷根大学); Göttingen State and University Library (哥廷根州立和大学图书馆)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Language Models (VLMs) have undergone significant advancements, particularly with the emergence of mobile-oriented VLMs, which offer a wide range of application scenarios. However, the substantial computational requirements for training these models present a significant obstacle to their practical application. To address this issue, Low-Rank Adaptation (LoRA) has been proposed. Nevertheless, the standard LoRA with a fixed rank lacks sufficient capability for training mobile VLMs that process both text and image modalities. In this work, we introduce HyDRA, a parameter-efficient fine-tuning framework designed to implement hierarchical and dynamic rank scheduling for mobile VLMs. This framework incorporates two essential optimization strategies: (1) hierarchical optimization, which involves a coarse-grained approach that assigns different ranks to various layers, as well as a fine-grained method that adjusts ranks within individual layers, and (2) dynamic adjustment, which employs an end-to-end automatic optimization using a lightweight performance model to determine and adjust ranks during the fine-tuning process. Comprehensive experiments conducted on popular benchmarks demonstrate that HyDRA consistently outperforms the baseline, achieving a 4.7% improvement across various model sizes without increasing the number of trainable parameters. In some tasks, it even surpasses full-parameter fine-tuning.
zh

[CV-72] MaskOpt: A Large-Scale Mask Optimization Dataset to Advance AI in Integrated Circuit Manufacturing

【速读】:该论文旨在解决集成电路(IC)在先进制程节点下,光刻工艺因衍射效应和工艺波动带来的挑战,特别是传统模型驱动的光学邻近校正(OPC)与逆光刻技术(ILT)计算成本高、难以扩展的问题。现有基于深度学习的掩模优化方法受限于合成数据集、忽略标准单元层次结构及周围环境信息,导致其在实际场景中应用效果不佳。解决方案的关键在于构建一个大规模的真实IC设计基准数据集MaskOpt,包含104,714个金属层tile和121,952个通孔层tile,通过标准化单元裁剪保留单元语义信息,并支持不同上下文窗口大小以捕捉光学邻近效应的影响,从而实现细胞感知(cell-aware)与上下文感知(context-aware)的掩模优化,显著提升深度学习模型在真实场景中的精度与适用性。

链接: https://arxiv.org/abs/2512.20655
作者: Yuting Hu,Lei Zhuang,Hua Xiang,Jinjun Xiong,Gi-Joon Nam
机构: University at Buffalo (纽约州立大学布法罗分校); IBM Research (IBM 研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As integrated circuit (IC) dimensions shrink below the lithographic wavelength, optical lithography faces growing challenges from diffraction and process variability. Model-based optical proximity correction (OPC) and inverse lithography technique (ILT) remain indispensable but computationally expensive, requiring repeated simulations that limit scalability. Although deep learning has been applied to mask optimization, existing datasets often rely on synthetic layouts, disregard standard-cell hierarchy, and neglect the surrounding contexts around the mask optimization targets, thereby constraining their applicability to practical mask optimization. To advance deep learning for cell- and context-aware mask optimization, we present MaskOpt, a large-scale benchmark dataset constructed from real IC designs at the 45 \mathrmnm node. MaskOpt includes 104,714 metal-layer tiles and 121,952 via-layer tiles. Each tile is clipped at a standard-cell placement to preserve cell information, exploiting repeated logic gate occurrences. Different context window sizes are supported in MaskOpt to capture the influence of neighboring shapes from optical proximity effects. We evaluate state-of-the-art deep learning models for IC mask optimization to build up benchmarks, and the evaluation results expose distinct trade-offs across baseline models. Further context size analysis and input ablation studies confirm the importance of both surrounding geometries and cell-aware inputs in achieving accurate mask generation.
zh

[CV-73] Equivariant Multiscale Learned Invertible Reconstruction for Cone Beam CT: From Simulated to Real Data

【速读】:该论文旨在解决锥束CT(Cone Beam CT, CBCT)图像质量低于传统CT的问题,尤其是在临床应用中受限于缺乏真实标注数据、内存资源有限以及对快速推理需求的挑战。其解决方案的关键在于提出LIRE++——一种端到端的旋转等变多尺度可逆原始-对偶学习重建框架,通过内存优化和多尺度重建实现高效训练与推理,同时利用旋转等变性提升参数效率;该方法在模拟投影数据上训练,并在真实临床数据上验证,显著优于现有深度学习基线模型,在合成数据上平均峰值信噪比提升1 dB,在真实数据上与计划CT相比平均绝对误差降低10 Hounsfield单位。

链接: https://arxiv.org/abs/2512.21180
作者: Nikita Moriakov,Efstratios Gavves,Jonathan H. Mason,Carmen Seller-Oria,Jonas Teuwen,Jan-Jakob Sonke
机构: Netherlands Cancer Institute(荷兰癌症研究所); University of Amsterdam(阿姆斯特丹大学); Elekta Limited(伊莱克塔有限公司)
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages. arXiv admin note: substantial text overlap with arXiv:2401.11256

点击查看摘要

Abstract:Cone Beam CT (CBCT) is an important imaging modality nowadays, however lower image quality of CBCT compared to more conventional Computed Tomography (CT) remains a limiting factor in CBCT applications. Deep learning reconstruction methods are a promising alternative to classical analytical and iterative reconstruction methods, but applying such methods to CBCT is often difficult due to the lack of ground truth data, memory limitations and the need for fast inference at clinically-relevant resolutions. In this work we propose LIRE++, an end-to-end rotationally-equivariant multiscale learned invertible primal-dual scheme for fast and memory-efficient CBCT reconstruction. Memory optimizations and multiscale reconstruction allow for fast training and inference, while rotational equivariance improves parameter efficiency. LIRE++ was trained on simulated projection data from a fast quasi-Monte Carlo CBCT projection simulator that we developed as well. Evaluated on synthetic data, LIRE++ gave an average improvement of 1 dB in Peak Signal-to-Noise Ratio over alternative deep learning baselines. On real clinical data, LIRE++ improved the average Mean Absolute Error between the reconstruction and the corresponding planning CT by 10 Hounsfield Units with respect to current proprietary state-of-the-art hybrid deep-learning/iterative method.
zh

[CV-74] Flow Gym

【速读】:该论文旨在解决流场量化(flow-field quantification)方法在研究与部署过程中缺乏统一工具链的问题,尤其针对从连续示踪粒子图像中学习流场分布的算法。其解决方案的关键在于提出 Flow Gym,一个受 OpenAI Gym 和 Stable-Baselines3 启发的开源工具包,集成 SynthPix 作为合成图像生成引擎,并提供标准化接口以支持算法的测试、训练和部署;同时内置多个现有算法的稳定实现(使用 JAX),显著提升了流场量化方法开发的效率与可复现性。

链接: https://arxiv.org/abs/2512.20642
作者: Francesco Banelli,Antonio Terpin,Alan Bonomi,Raffaello D’Andrea
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE); Computational Physics (physics.comp-ph)
备注: Code: this https URL

点击查看摘要

Abstract:Flow Gym is a toolkit for research and deployment of flow-field quantification methods inspired by OpenAI Gym and Stable-Baselines3. It uses SynthPix as synthetic image generation engine and provides a unified interface for the testing, deployment and training of (learning-based) algorithms for flow-field quantification from a number of consecutive images of tracer particles. It also contains a growing number of integrations of existing algorithms and stable (re-)implementations in JAX.
zh

人工智能

[AI-0] Model Merging via Multi-Teacher Knowledge Distillation

【速读】:该论文旨在解决模型合并(model merging)过程中缺乏理论保障的问题,尤其是在无原始训练数据的情况下,如何实现对不同任务细调模型的稳健融合。当前方法依赖启发式策略确定参数加权系数(coefficient scaling),导致性能脆弱且对初始化敏感。解决方案的关键在于:首先,提出一种面向模型合并场景的平滑性感知PAC-Bayes泛化界,其中引入“跨任务异质性”项以量化细调模型先验与目标多任务分布之间的不匹配;其次,将模型合并建模为在稀缺未标注数据上的多教师知识蒸馏问题,并证明最小化学生-教师KL散度可直接收紧合并模型超额风险的上界;最后,基于此理论指导,设计SAMerging方法,利用Sharpness-Aware Minimization (SAM)寻找平坦极小值,从而实现高效稳定的合并效果。

链接: https://arxiv.org/abs/2512.21288
作者: Seyed Arshan Dalili,Mehrdad Mahdavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model’s contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a “cross-task heterogeneity” term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model’s excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at this https URL.
zh

[AI-1] Learning Factors in AI-Augmented Education: A Comparative Study of Middle and High School Students

【速读】:该论文试图解决的问题是:在人工智能(AI)辅助教育环境中,关键学习因素(如经验、清晰度、舒适度和动机)之间的关系是否保持一致,以及这些关系如何因学生年龄阶段的不同而变化。现有研究多集中于高等教育和传统教学场景,缺乏对不同年龄段学习者在AI介导学习中感知维度交互机制的深入探讨。解决方案的关键在于采用多方法定量分析策略,结合相关性分析与文本挖掘技术,在真实课堂情境下收集学生使用AI工具进行编程学习时的感知数据,从而揭示初中生与高中生在感知维度结构上的显著差异:初中生表现出各维度间的强正相关,呈现整体性评价模式;而高中生则表现为维度间弱或接近零的相关性,体现更独立的差异化评估过程。这一发现表明,学习者的认知发展阶段会调节感知维度间的相互依赖性,为制定基于学习者发展水平的AI整合策略提供了实证基础。

链接: https://arxiv.org/abs/2512.21246
作者: Gaia Ebli,Bianca Raimondi,Maurizio Gabbrielli
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Preprint. Under review

点击查看摘要

Abstract:The increasing integration of AI tools in education has led prior research to explore their impact on learning processes. Nevertheless, most existing studies focus on higher education and conventional instructional contexts, leaving open questions about how key learning factors are related in AI-mediated learning environments and how these relationships may vary across different age groups. Addressing these gaps, our work investigates whether four critical learning factors, experience, clarity, comfort, and motivation, maintain coherent interrelationships in AI-augmented educational settings, and how the structure of these relationships differs between middle and high school students. The study was conducted in authentic classroom contexts where students interacted with AI tools as part of programming learning activities to collect data on the four learning factors and students’ perceptions. Using a multimethod quantitative analysis, which combined correlation analysis and text mining, we revealed markedly different dimensional structures between the two age groups. Middle school students exhibit strong positive correlations across all dimensions, indicating holistic evaluation patterns whereby positive perceptions in one dimension generalise to others. In contrast, high school students show weak or near-zero correlations between key dimensions, suggesting a more differentiated evaluation process in which dimensions are assessed independently. These findings reveal that perception dimensions actively mediate AI-augmented learning and that the developmental stage moderates their interdependencies. This work establishes a foundation for the development of AI integration strategies that respond to learners’ developmental levels and account for age-specific dimensional structures in student-AI interactions.
zh

[AI-2] LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation

【速读】:该论文旨在解决基于预构建静态场景图(scene graph)的机器人指令跟随任务中,因环境动态变化导致规划失效的问题。现有方法假设所有任务相关信息在规划开始时即已完备,但实际环境中物体位置或状态可能随时间改变,从而影响任务执行成功率。解决方案的关键在于提出LookPlanGraph方法,其核心是构建一个由静态资产和对象先验(object priors)组成的可更新场景图,并在计划执行过程中通过视觉语言模型(Vision Language Model, VLM)持续处理智能体的第一人称视角图像,以验证已有先验或发现新实体并实时更新场景图,从而实现对环境变化的鲁棒适应。

链接: https://arxiv.org/abs/2512.21243
作者: Anatoly O. Onishchenko,Alexey K. Kovalev,Aleksandr I. Panov
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at this https URL .
zh

[AI-3] Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成场景中因安全对齐不足而可能被恶意利用的问题,即攻击者可借助AI辅助编程工具生成有害代码(malicious code),但现有针对LLM的“越狱”(jailbreaking)研究较少关注此类特定攻击目标。解决方案的关键在于提出SPELL框架——一个专门用于评估LLM在恶意代码生成方面安全对齐强度的测试系统;其核心创新是采用时间分割选择策略(time-division selection strategy),通过智能组合先验知识语料库中的句子构建越狱提示(jailbreaking prompts),在探索新型攻击模式与复用已知有效技术之间实现平衡,从而系统性地暴露模型在真实开发环境(如Cursor)中的安全漏洞。

链接: https://arxiv.org/abs/2512.21236
作者: Yifan Huang,Xiaojun Jia,Wenbo Guo,Yuqiang Sun,Yihao Huang,Chong Wang,Yang Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to FSE 2026

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. However, this accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target. To address this gap, we propose SPELL, a comprehensive testing framework specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL’s effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.
zh

[AI-4] BALLAST: Bandit-Assisted Learning for Latency-Aware Stable Timeouts in Raft

【速读】:该论文旨在解决Raft共识算法中因静态随机选举超时机制在长尾延迟(long-tail latency)、抖动(jitter)及网络分区恢复期间导致的重复分裂投票(split votes)问题,从而显著增加系统不可用时间的问题。解决方案的关键在于提出BALLAST机制,其核心是使用轻量级在线自适应方法,将静态超时策略替换为基于上下文的多臂赌博机(contextual bandits),具体采用高效的线性上下文赌博机(LinUCB变体)从离散的超时“臂”中选择最优值,并通过安全探索(safe exploration)机制在不稳定时期限制风险,从而在复杂广域网(WAN)场景下显著缩短恢复时间和不可写入时间,同时在稳定局域网(LAN)和WAN环境下保持竞争力。

链接: https://arxiv.org/abs/2512.21165
作者: Qizhi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 22 tables, 11 figures

点击查看摘要

Abstract:Randomized election timeouts are a simple and effective liveness heuristic for Raft, but they become brittle under long-tail latency, jitter, and partition recovery, where repeated split votes can inflate unavailability. This paper presents BALLAST, a lightweight online adaptation mechanism that replaces static timeout heuristics with contextual bandits. BALLAST selects from a discrete set of timeout “arms” using efficient linear contextual bandits (LinUCB variants), and augments learning with safe exploration to cap risk during unstable periods. We evaluate BALLAST on a reproducible discrete-event simulation with long-tail delay, loss, correlated bursts, node heterogeneity, and partition/recovery turbulence. Across challenging WAN regimes, BALLAST substantially reduces recovery time and unwritable time compared to standard randomized timeouts and common heuristics, while remaining competitive on stable LAN/WAN settings.
zh

[AI-5] MODE: Multi-Objective Adaptive Coreset Selection

【速读】:该论文旨在解决数据效率优化问题,即如何在训练机器学习模型时动态选择最具价值的数据子集,以提升模型性能并降低资源消耗。其核心挑战在于静态数据选择策略难以适应训练过程中不同阶段对数据需求的变化。解决方案的关键在于提出一种自适应的多目标数据选择框架 Mode(Multi-Objective adaptive Data Efficiency),该框架根据训练阶段动态调整选择策略:早期强调类别平衡(class balance),中期注重特征表示多样性(diversity),后期聚焦预测不确定性(uncertainty)。该方法在保证 (1-1/e)-近似性能的同时,具有 O(n log n) 的计算复杂度,并能提供可解释的数据效用演化分析,显著降低内存占用并提升模型精度。

链接: https://arxiv.org/abs/2512.21152
作者: Tanmoy Mukherjee,Pierre Marquis,Zied Bouraoui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Mode(Multi-Objective adaptive Data Efficiency), a framework that dynamically combines coreset selection strategies based on their evolving contribution to model performance. Unlike static methods, \mode adapts selection criteria to training phases: emphasizing class balance early, diversity during representation learning, and uncertainty at convergence. We show that MODE achieves (1-1/e)-approximation with O(n \log n) complexity and demonstrates competitive accuracy while providing interpretable insights into data utility evolution. Experiments show \mode reduces memory requirements
zh

[AI-6] AutoBaxBuilder: Bootstrapping Code Security Benchmarking

链接: https://arxiv.org/abs/2512.21132
作者: Tobias von Arx,Niels Mündler,Mark Vero,Maximilian Baader,Martin Vechev
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

[AI-7] A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care

【速读】:该论文旨在解决生成式 AI(Generative AI)在真实临床环境中用于药物安全审查时的可靠性问题,尤其是在面对复杂患者情况时其潜在失败机制尚未被系统揭示。研究的关键在于首次基于英国国家医疗服务体系(NHS)初级 care 的大规模电子健康记录(EHR)数据,对大语言模型(LLM)驱动的药物安全审查系统进行实证评估,并通过详细失败行为分析识别出主导性错误模式——即上下文推理缺陷而非药物知识缺失,包括:对不确定性的过度自信、机械套用指南而忽略个体情境、误解实际医疗流程、事实性错误及流程盲视等五类核心问题。这一发现为 LLM 在临床部署前需优先改进的方向提供了关键依据。

链接: https://arxiv.org/abs/2512.21127
作者: Oliver Normand,Esther Borsi,Mitch Fruin,Lauren E Walker,Jamie Heagerty,Chris C. Holmes,Anthony J Avery,Iain E Buchan,Harry Coppock
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often match or exceed clinician-level performance on medical benchmarks, yet very few are evaluated on real clinical data or examined beyond headline metrics. We present, to our knowledge, the first evaluation of an LLM-based medication safety review system on real NHS primary care data, with detailed characterisation of key failure behaviours across varying levels of clinical complexity. In a retrospective study using a population-scale EHR spanning 2,125,549 adults in NHS Cheshire and Merseyside, we strategically sampled patients to capture a broad range of clinical complexity and medication safety risk, yielding 277 patients after data-quality exclusions. An expert clinician reviewed these patients and graded system-identified issues and proposed interventions. Our primary LLM system showed strong performance in recognising when a clinical issue is present (sensitivity 100% [95% CI 98.2–100], specificity 83.1% [95% CI 72.7–90.1]), yet correctly identified all issues and interventions in only 46.9% [95% CI 41.1–52.8] of patients. Failure analysis reveals that, in this setting, the dominant failure mechanism is contextual reasoning rather than missing medication knowledge, with five primary patterns: overconfidence in uncertainty, applying standard guidelines without adjusting for patient context, misunderstanding how healthcare is delivered in practice, factual errors, and process blindness. These patterns persisted across patient complexity and demographic strata, and across a range of state-of-the-art models and configurations. We provide 45 detailed vignettes that comprehensively cover all identified failure cases. This work highlights shortcomings that must be addressed before LLM-based clinical AI can be safely deployed. It also begs larger-scale, prospective evaluations and deeper study of LLM behaviours in clinical contexts.
zh

[AI-8] LLM Personas as a Substitute for Field Experiments in Method Benchmarking

【速读】:该论文旨在解决在社会系统中,基于真实人群的A/B测试(Field experiments)因成本高、延迟大而成为方法迭代开发瓶颈的问题。其核心挑战在于:能否用生成式AI构建的虚拟人物(persona)模拟替代真实人类参与实验,同时保持评估接口对自适应算法的一致性。解决方案的关键在于提出一个“当且仅当”(if-and-only-if)的理论条件:当方法仅能观测聚合结果(aggregate-only observation)且评估标准不依赖算法身份或来源(algorithm-blind evaluation)时,将真人替换为persona等价于改变评估群体(如从纽约到雅加达),从算法视角看不可区分。进一步地,作者通过定义信息论意义上的可判别性(discriminability)指标,证明了使persona基准与真实实验具有同等决策相关性的本质问题是样本量问题,并给出了在特定分辨力下可靠区分不同方法所需的独立persona评估次数的显式边界。

链接: https://arxiv.org/abs/2512.21080
作者: Enoch Hyunwook Kang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:Field experiments (A/B tests) are often the most credible benchmark for methods in societal systems, but their cost and latency create a major bottleneck for iterative method development. LLM-based persona simulation offers a cheap synthetic alternative, yet it is unclear whether replacing humans with personas preserves the benchmark interface that adaptive methods optimize against. We prove an if-and-only-if characterization: when (i) methods observe only the aggregate outcome (aggregate-only observation) and (ii) evaluation depends only on the submitted artifact and not on the algorithm’s identity or provenance (algorithm-blind evaluation), swapping humans for personas is just panel change from the method’s point of view, indistinguishable from changing the evaluation population (e.g., New York to Jakarta). Furthermore, we move from validity to usefulness: we define an information-theoretic discriminability of the induced aggregate channel and show that making persona benchmarking as decision-relevant as a field experiment is fundamentally a sample-size question, yielding explicit bounds on the number of independent persona evaluations required to reliably distinguish meaningfully different methods at a chosen resolution.
zh

[AI-9] Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics

【速读】:该论文旨在解决深度神经网络中特征学习(feature learning)在大规模深度下失效的问题,尤其是现有基于无限宽度极限的muP(muP)理论在扩展至深层残差网络(ResNet)时因残差块内层过多而失效的现象。其核心问题是:为何在模型深度增加时会出现训练不稳定和收益递减?解决方案的关键在于提出神经特征动力学(Neural Feature Dynamics, NFD),该理论在联合无限宽度与无限深度极限下,通过耦合前向-后向随机系统刻画了残差网络中特征学习的演化机制;NFD揭示了梯度独立性假设(Gradient-Independence Assumption, GIA)在无限深度下可被1/sqrt(depth)残差缩放重新成立,从而恢复解析可处理性,并进一步指出两层残差块中第一内部层存在特征学习坍塌现象,这是depth-muP失效的结构根源;基于此诊断,作者设计了一种深度感知的学习率修正策略,有效抑制特征坍塌并实现在更深网络中的超参数迁移性能提升。

链接: https://arxiv.org/abs/2512.21075
作者: Zihan Yao,Ruoyu Wu,Tianxiang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The empirical success of deep learning is often attributed to scaling laws that predict consistent gains as model, data, and compute grow; however, large models can exhibit training instability and diminishing returns, suggesting that scaling laws describe what success looks like but not when and why scaling succeeds or fails. A central obstacle is the lack of a rigorous understanding of feature learning at large depth. While muP characterizes feature-learning dynamics in the infinite-width limit and enables hyperparameter transfer across width, its depth extension (depth-muP) breaks down for residual blocks with more than one internal layer. We derive Neural Feature Dynamics (NFD) for ResNets with single-layer residual blocks, characterizing feature learning via a coupled forward-backward stochastic system in the joint infinite-width and infinite-depth limit. In this regime, NFD identifies when scaling-law trends persist and explains diminishing returns. It also reveals a vanishing mechanism induced by the 1/sqrt(depth) residual scaling under which the gradient-independence assumption (GIA), known to fail during training at finite depth, becomes provably valid again at infinite depth, yielding an analytically tractable regime for end-to-end feature learning. Motivated by this insight, we study two-layer residual blocks and show that the same mechanism causes feature-learning collapse in the first internal layer at large depth, providing a structural explanation for the empirical failure of depth-muP. Based on this diagnosis, we propose a depth-aware learning-rate correction that counteracts the collapse and empirically restores depth-wise hyperparameter transfer, yielding stronger performance in deeper ResNets.
zh

[AI-10] Agent ic Explainable Artificial Intelligence (Agent ic XAI) Approach To Explore Better Explanation

【速读】:该论文旨在解决可解释人工智能(Explainable Artificial Intelligence, XAI)输出难以向非专业用户传达的问题,从而影响其对基于AI预测的信任。为此,研究提出了一种代理型可解释人工智能(Agentic XAI)框架,其核心在于将SHAP(Shapley Additive Explanations)方法生成的解释与多模态大语言模型(Large Language Models, LLMs)驱动的迭代优化相结合,通过多轮自主修正逐步提升解释质量。关键创新点在于引入了“代理”机制——即LLM作为自主代理,在每一轮迭代中主动探索如何改进解释,并基于专家和LLM双重评估体系进行反馈优化,最终在3–4轮迭代时达到推荐质量峰值,同时揭示了过度迭代会引发冗余和脱离实际的问题,表明需采用早期停止策略以平衡偏差-方差权衡,为实用型Agentic XAI系统的设计提供了实证依据。

链接: https://arxiv.org/abs/2512.21066
作者: Tomoaki Yamaguchi,Yutong Zhou,Masahiro Ryo,Keisuke Katsura
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Explainable artificial intelligence (XAI) enables data-driven understanding of factor associations with response variables, yet communicating XAI outputs to laypersons remains challenging, hindering trust in AI-based predictions. Large language models (LLMs) have emerged as promising tools for translating technical explanations into accessible narratives, yet the integration of agentic AI, where LLMs operate as autonomous agents through iterative refinement, with XAI remains unexplored. This study proposes an agentic XAI framework combining SHAP-based explainability with multimodal LLM-driven iterative refinement to generate progressively enhanced explanations. As a use case, we tested this framework as an agricultural recommendation system using rice yield data from 26 fields in Japan. The Agentic XAI initially provided a SHAP result and explored how to improve the explanation through additional analysis iteratively across 11 refinement rounds (Rounds 0-10). Explanations were evaluated by human experts (crop scientists) (n=12) and LLMs (n=14) against seven metrics: Specificity, Clarity, Conciseness, Practicality, Contextual Relevance, Cost Consideration, and Crop Science Credibility. Both evaluator groups confirmed that the framework successfully enhanced recommendation quality with an average score increase of 30-33% from Round 0, peaking at Rounds 3-4. However, excessive refinement showed a substantial drop in recommendation quality, indicating a bias-variance trade-off where early rounds lacked explanation depth (bias) while excessive iteration introduced verbosity and ungrounded abstraction (variance), as revealed by metric-specific analysis. These findings suggest that strategic early stopping (regularization) is needed for optimizing practical utility, challenging assumptions about monotonic improvement and providing evidence-based design principles for agentic XAI systems.
zh

[AI-11] Policy-Conditioned Policies for Multi-Agent Task Solving

【速读】:该论文旨在解决多智能体任务中策略动态适应性的核心挑战,尤其是在深度强化学习范式下,由于神经网络策略具有高维且不可解释的特性(即“表征瓶颈”),导致无法有效对对手策略进行条件化建模的问题。其解决方案的关键在于提出一种范式转变:将策略表示为人类可读的源代码形式,并利用大语言模型(Large Language Models, LLMs)作为近似解释器,从而实现博弈论中的“程序均衡”(Program Equilibrium)概念。具体而言,该方法通过LLM在程序化策略空间中直接执行优化,将其视为逐点最优响应算子,迭代合成并精炼自我智能体的策略代码以响应对手策略,形成一种称为“程序化迭代最优响应”(Programmatic Iterated Best Response, PIBR)的算法,其中策略代码通过游戏效用和运行时单元测试生成的结构化反馈进行文本梯度优化。

链接: https://arxiv.org/abs/2512.21024
作者: Yue Lin,Shuhui Zhu,Wenhao Li,Ang Li,Dan Qiao,Pascal Poupart,Hongyuan Zha,Baoxiang Wang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In multi-agent tasks, the central challenge lies in the dynamic adaptation of strategies. However, directly conditioning on opponents’ strategies is intractable in the prevalent deep reinforcement learning paradigm due to a fundamental ``representational bottleneck’': neural policies are opaque, high-dimensional parameter vectors that are incomprehensible to other agents. In this work, we propose a paradigm shift that bridges this gap by representing policies as human-interpretable source code and utilizing Large Language Models (LLMs) as approximate interpreters. This programmatic representation allows us to operationalize the game-theoretic concept of \textitProgram Equilibrium. We reformulate the learning problem by utilizing LLMs to perform optimization directly in the space of programmatic policies. The LLM functions as a point-wise best-response operator that iteratively synthesizes and refines the ego agent’s policy code to respond to the opponent’s strategy. We formalize this process as \textitProgrammatic Iterated Best Response (PIBR), an algorithm where the policy code is optimized by textual gradients, using structured feedback derived from game utility and runtime unit tests. We demonstrate that this approach effectively solves several standard coordination matrix games and a cooperative Level-Based Foraging environment.
zh

[AI-12] LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估体系中因依赖静态评分和任务特异性指标而导致的局限性问题,特别是难以合理融合多维度能力、无法捕捉模型在连续高风险任务中的动态竞争力及脆弱性。其解决方案的关键在于提出一种全新的竞争瑞士系统动态框架(Competitive Swiss-System Dynamics, CSD),通过模拟多轮、序列化的竞赛机制,依据模型累积胜负记录动态配对,并结合蒙特卡洛模拟(N=100,000次迭代)计算统计稳健的期望胜场得分(Expected Win Score, E[S_m]),从而消除随机配对和早期运气带来的噪声;同时引入失败敏感性分析(Failure Sensitivity Analysis),通过参数化每轮淘汰数量(T_k)来量化模型的风险偏好,实现对鲁棒型通用者与激进型专业者的区分,显著提升了评估的细致度与情境感知能力。

链接: https://arxiv.org/abs/2512.21010
作者: Jiashuo Liu,Jiayun Wu,Chunjie Wu,Jingkai Liu,Zaiyuan Wang,Huan Zhou,Wenhao Huang,Hongseok Namkoong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 18 pages

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model’s dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation ( N=100,000 iterations) is used to approximate the statistically robust Expected Win Score ( E[S_m] ), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity ( T_k ), which allows us to profile models based on their risk appetite–distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.
zh

[AI-13] rafficSimAgent : A Hierarchical Agent Framework for Autonomous Traffic Simulation with MCP Control

【速读】:该论文旨在解决交通仿真工具(如SUMO和MATSim)在实际应用中因用户缺乏专业知识而导致实验设计困难、执行效率低的问题。其解决方案的关键在于提出TrafficSimAgent框架,该框架基于大语言模型(Large Language Model, LLM)构建多层级专家代理协作机制:高层专家代理通过自然语言理解灵活规划实验流程并调用MCP兼容工具,低层专家代理则根据实时交通状态优化基础元素的动作策略,从而实现从模糊指令到可执行仿真的端到端自动化决策与优化。

链接: https://arxiv.org/abs/2512.20996
作者: Yuwei Du,Jun Zhang,Jie Feng,Zhicheng Liu,Jian Yuan,Yong Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The code will be available at: this https URL

点击查看摘要

Abstract:Traffic simulation is important for transportation optimization and policy making. While existing simulators such as SUMO and MATSim offer fully-featured platforms and utilities, users without too much knowledge about these platforms often face significant challenges when conducting experiments from scratch and applying them to their daily work. To solve this challenge, we propose TrafficSimAgent, an LLM-based agent framework that serves as an expert in experiment design and decision optimization for general-purpose traffic simulation tasks. The framework facilitates execution through cross-level collaboration among expert agents: high-level expert agents comprehend natural language instructions with high flexibility, plan the overall experiment workflow, and invoke corresponding MCP-compatible tools on demand; meanwhile, low-level expert agents select optimal action plans for fundamental elements based on real-time traffic conditions. Extensive experiments across multiple scenarios show that TrafficSimAgent effectively executes simulations under various conditions and consistently produces reasonable outcomes even when user instructions are ambiguous. Besides, the carefully designed expert-level autonomous decision-driven optimization in TrafficSimAgent yields superior performance when compared with other systems and SOTA LLM based methods.
zh

[AI-14] FinAgent : An Agent ic AI Framework Integrating Personal Finance and Nutrition Planning

【速读】:该论文旨在解决中等收入环境中家庭预算有限与营养需求之间的矛盾,尤其是在食品价格波动背景下如何实现经济可行且营养充足的饮食规划。其解决方案的关键在于构建一个价格感知的智能体AI系统(price aware agentic AI system),该系统采用模块化多智能体架构,包含预算管理、营养优化、价格监控和健康个性化等专用智能体,通过共享知识库和替代图(substitution graph)机制,在确保营养达标(>95%)的前提下动态调整餐食方案以适应市场价格变化(20–30%),从而在保持膳食质量的同时显著降低支出(较静态菜单稳定减少12–18%)。

链接: https://arxiv.org/abs/2512.20991
作者: Toqeer Ali Syed,Abdulaziz Alshahrani,Ali Ullah,Ali Akarma,Sohail Khan,Muhammad Nauman,Salman Jan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: This paper was presented at the IEEE International Conference on Computing and Applications (ICCA 2025), Bahrain

点击查看摘要

Abstract:The issue of limited household budgets and nutritional demands continues to be a challenge especially in the middle-income environment where food prices fluctuate. This paper introduces a price aware agentic AI system, which combines personal finance management with diet optimization. With household income and fixed expenditures, medical and well-being status, as well as real-time food costs, the system creates nutritionally sufficient meals plans at comparatively reasonable prices that automatically adjust to market changes. The framework is implemented in a modular multi-agent architecture, which has specific agents (budgeting, nutrition, price monitoring, and health personalization). These agents share the knowledge base and use the substitution graph to ensure that the nutritional quality is maintained at a minimum cost. Simulations with a representative Saudi household case study show a steady 12-18% reduction in costs relative to a static weekly menu, nutrient adequacy of over 95% and high performance with price changes of 20-30%. The findings indicate that the framework can locally combine affordability with nutritional adequacy and provide a viable avenue of capacity-building towards sustainable and fair diet planning in line with Sustainable Development Goals on Zero Hunger and Good Health.
zh

[AI-15] A Blockchain-Monitored Agent ic AI Architecture for Trusted Perception-Reasoning -Action Pipelines

【速读】:该论文旨在解决自主决策型智能体AI(agentic AI)系统在医疗、智慧城市、数字取证和供应链管理等领域应用中面临的信任缺失、监管不足以及信息与行为完整性难以保障的问题。其解决方案的关键在于构建一个基于LangChain的多智能体系统与许可型区块链相结合的统一架构,通过区块链层实现对感知-概念化-行动循环的持续监控、策略执行和不可篡改的审计追踪,从而确保智能体行为的可追溯性、合规性和安全性。实验表明,该框架能有效防止未经授权的操作,维持合理的运行延迟,并为高影响力场景下的自主但负责任的AI应用提供通用实现路径。

链接: https://arxiv.org/abs/2512.20985
作者: Salman Jan,Hassan Ali Razzaqi,Ali Akarma,Mohammad Riyaz Belgaum
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: This paper was presented at the IEEE International Conference on Computing and Applications (ICCA 2025), Bahrain

点击查看摘要

Abstract:The application of agentic AI systems in autonomous decision-making is growing in the areas of healthcare, smart cities, digital forensics, and supply chain management. Even though these systems are flexible and offer real-time reasoning, they also raise concerns of trust and oversight, and integrity of the information and activities upon which they are founded. The paper suggests a single architecture model comprising of LangChain-based multi-agent system with a permissioned blockchain to guarantee constant monitoring, policy enforcement, and immutable auditability of agentic action. The framework relates the perception conceptualization-action cycle to a blockchain layer of governance that verifies the inputs, evaluates recommended actions, and documents the outcomes of the execution. A Hyperledger Fabric-based system, action executors MCP-integrated, and LangChain agent are introduced and experiments of smart inventory management, traffic-signal control, and healthcare monitoring are done. The results suggest that blockchain-security verification is efficient in preventing unauthorized practices, offers traceability throughout the whole decision-making process, and maintains operational latency within reasonable ranges. The suggested framework provides a universal system of implementing high-impact agentic AI applications that are autonomous yet responsible.
zh

[AI-16] Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

【速读】:该论文旨在解决传统贝叶斯强化学习(Bayesian Reinforcement Learning, BRL)方法在现实世界应用中因假设已知转移和奖励模型形式而导致的泛化能力受限问题,以及现有深度BRL方法中由于直接使用神经网络对联合数据与任务参数建模所引发的证据下界(Evidence Lower Bound, ELBO)优化困难、任务参数不清晰从而影响策略性能的问题。其解决方案的关键在于提出一种新颖的深度BRL方法——可学习基函数的广义线性模型在深度贝叶斯强化学习中的应用(Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions, GLiBRL),该方法通过引入可学习基函数实现对转移和奖励模型的高效且准确学习,并具备可解析的边缘似然和对任务参数及模型噪声的精确贝叶斯推断,从而显著提升了策略的稳定性和性能表现。

链接: https://arxiv.org/abs/2512.20974
作者: Jingyang You,Hanna Kurniawati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Bayesian Reinforcement Learning (BRL) provides a framework for generalisation of Reinforcement Learning (RL) problems from its use of Bayesian task parameters in the transition and reward models. However, classical BRL methods assume known forms of transition and reward models, reducing their applicability in real-world problems. As a result, recent deep BRL methods have started to incorporate model learning, though the use of neural networks directly on the joint data and task parameters requires optimising the Evidence Lower Bound (ELBO). ELBOs are difficult to optimise and may result in indistinctive task parameters, hence compromised BRL policies. To this end, we introduce a novel deep BRL method, Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL), that enables efficient and accurate learning of transition and reward models, with fully tractable marginal likelihood and Bayesian inference on task parameters and model noises. On challenging MetaWorld ML10/45 benchmarks, GLiBRL improves the success rate of one of the state-of-the-art deep BRL methods, VariBAD, by up to 2.7x. Comparing against representative or recent deep BRL / Meta-RL methods, such as MAML, RL2, SDVT, TrMRL and ECET, GLiBRL also demonstrates its low-variance and decent performance consistently.
zh

[AI-17] Mesh-Attention: A New Communication-Efficient Distributed Attention with Improved Data Locality

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在扩展上下文窗口时面临的分布式注意力机制的可扩展性问题,尤其是现有最优方法Ring-Attention因通信流量过大而导致的性能瓶颈。其解决方案的关键在于提出了一种新的分布式注意力算法——Mesh-Attention,该方法通过重新设计分布式注意力的计算分配空间,采用二维tile(而非传统的一维行或列)将计算块分配给每个GPU,从而显著降低通信与计算比(CommCom ratio)。该方法不仅包含Ring-Attention作为特例,还支持通过调整tile形状灵活控制CommCom比率,并引入一种贪婪调度算法,在保证GPU间高效通信的前提下高效搜索调度空间。理论分析和实验结果均表明,Mesh-Attention具有更低的通信复杂度和优异的可扩展性,在256个GPU上实现最高3.4倍、平均2.9倍的加速比,同时通信量减少最多85.4%(平均79.0%),在大规模部署中有效降低了系统开销。

链接: https://arxiv.org/abs/2512.20968
作者: Sirui Chen,Jingji Chen,Siqi Zhu,Ziheng Jiang,Yanghua Peng,Xuehai Qian
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Distributed attention is a fundamental problem for scaling context window for Large Language Models (LLMs). The state-of-the-art method, Ring-Attention, suffers from scalability limitations due to its excessive communication traffic. This paper proposes a new distributed attention algorithm, Mesh-Attention, by rethinking the design space of distributed attention with a new matrix-based model. Our method assigns a two-dimensional tile – rather than one-dimensional row or column – of computation blocks to each GPU to achieve higher efficiency through lower communication-computation (CommCom) ratio. The general approach covers Ring-Attention as a special case, and allows the tuning of CommCom ratio with different tile shapes. Importantly, we propose a greedy algorithm that can efficiently search the scheduling space within the tile with restrictions that ensure efficient communication among GPUs. The theoretical analysis shows that Mesh-Attention leads to a much lower communication complexity and exhibits good scalability comparing to other current algorithms. Our extensive experiment results show that Mesh-Attention can achieve up to 3.4x speedup (2.9x on average) and reduce the communication volume by up to 85.4% (79.0% on average) on 256 GPUs. Our scalability results further demonstrate that Mesh-Attention sustains superior performance as the system scales, substantially reducing overhead in large-scale deployments. The results convincingly confirm the advantage of Mesh-Attention. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20968 [cs.DC] (or arXiv:2512.20968v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.20968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-18] Can Agent ic AI Match the Performance of Human Data Scientists?

【速读】:该论文试图解决的问题是:当前基于大语言模型(Large Language Models, LLMs)的代理型人工智能(Agentic AI)系统在数据科学任务中是否能够达到人类数据科学家的性能水平,尤其是在依赖领域特定知识的情况下。其解决方案的关键在于设计了一个预测任务,其中关键的潜在变量隐藏在图像数据中而非传统的表格特征中,从而使得仅依赖通用代码生成和标准数据分析流程的代理型AI表现不佳,而具备领域知识的人类专家则能识别出这一隐藏变量并提升建模效果。实验结果表明,当前代理型AI在缺乏领域知识时存在明显局限性,凸显了未来研究需致力于开发能够有效识别并整合领域知识的智能系统。

链接: https://arxiv.org/abs/2512.20959
作者: An Luo,Jin Du,Fangqiao Tian,Xun Xian,Robert Specht,Ganghua Wang,Xuan Bi,Charles Fleming,Jayanth Srinivasa,Ashish Kundu,Mingyi Hong,Jie Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) have significantly automated data science workflows, but a fundamental question persists: Can these agentic AI systems truly match the performance of human data scientists who routinely leverage domain-specific knowledge? We explore this question by designing a prediction task where a crucial latent variable is hidden in relevant image data instead of tabular features. As a result, agentic AI that generates generic codes for modeling tabular data cannot perform well, while human experts could identify the important hidden variable using domain knowledge. We demonstrate this idea with a synthetic dataset for property insurance. Our experiments show that agentic AI that relies on generic analytics workflow falls short of methods that use domain-specific insights. This highlights a key limitation of the current agentic AI for data science and underscores the need for future research to develop agentic AI systems that can better recognize and incorporate domain knowledge.
zh

[AI-19] ReACT-Drug: Reaction-Template Guided Reinforcement Learning for de novo Drug Design

【速读】:该论文旨在解决**从头药物设计(De novo drug design)**中面临的挑战,即如何在庞大的化学空间中高效筛选出合成可行且具有高亲和力的候选分子。传统监督学习方法难以实现多目标优化与新颖化学空间的探索,而本研究提出的关键解决方案是构建一个基于强化学习(Reinforcement Learning, RL)的全集成、目标无关的分子设计框架——ReACT-Drug。其核心创新在于:首先利用ESM-2蛋白嵌入从蛋白质数据库(PDB)中识别与目标蛋白相似的已知蛋白,并提取其对应的已知配体作为片段初始化搜索空间,从而引导代理(agent)聚焦于生物相关子空间;随后通过Proximal Policy Optimization(PPO)算法驱动ChemBERTa编码的分子在动态的化学有效反应模板空间中进行迭代优化,生成具有高结合亲和力和高合成可及性的全新分子,同时保证100%的化学有效性与新颖性(符合MOSES基准)。该架构融合了结构生物学、深度表示学习与化学合成规则,为自动化和加速理性药物设计提供了新范式。

链接: https://arxiv.org/abs/2512.20958
作者: R Yadunandan,Nimisha Ghosh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:De novo drug design is a crucial component of modern drug development, yet navigating the vast chemical space to find synthetically accessible, high-affinity candidates remains a significant challenge. Reinforcement Learning (RL) enhances this process by enabling multi-objective optimization and exploration of novel chemical space - capabilities that traditional supervised learning methods lack. In this work, we introduce \textbfReACT-Drug, a fully integrated, target-agnostic molecular design framework based on Reinforcement Learning. Unlike models requiring target-specific fine-tuning, ReACT-Drug utilizes a generalist approach by leveraging ESM-2 protein embeddings to identify similar proteins for a given target from a knowledge base such as Protein Data Base (PDB). Thereafter, the known drug ligands corresponding to such proteins are decomposed to initialize a fragment-based search space, biasing the agent towards biologically relevant subspaces. For each such fragment, the pipeline employs a Proximal Policy Optimization (PPO) agent guiding a ChemBERTa-encoded molecule through a dynamic action space of chemically valid, reaction-template-based transformations. This results in the generation of \textitde novo drug candidates with competitive binding affinities and high synthetic accessibility, while ensuring 100% chemical validity and novelty as per MOSES benchmarking. This architecture highlights the potential of integrating structural biology, deep representation learning, and chemical synthesis rules to automate and accelerate rational drug design. The dataset and code are available at this https URL.
zh

[AI-20] One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

【速读】:该论文旨在解决大规模开源软件(OSS)代码库中定位需修改文件和函数的难题,传统基于大语言模型(LLM)的方法通常将此任务视为仓库级检索问题,并依赖多个辅助工具,忽视了代码执行逻辑且增加了模型控制复杂性。其解决方案的关键在于提出一个名为RepoNavigator的LLM代理,该代理仅配备一个“跳转至被调用符号定义”的执行感知工具,通过统一设计反映真实的代码执行流程并简化工具操作;同时采用强化学习(Reinforcement Learning, RL)从预训练模型端到端训练,无需闭源蒸馏,从而实现了高效且可扩展的仓库级问题定位能力。

链接: https://arxiv.org/abs/2512.20957
作者: Zhaoxi Zhang,Yitong Duan,Yanzhi Zhang,Yiming Xu,Jiyan He,Yunfang Wu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Locating the files and functions requiring modification in large open-source software (OSS) repositories is challenging due to their scale and structural complexity. Existing large language model (LLM)-based methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool-jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a pretrained model, without any closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and even the 32B model exceeding closed-source models such as Claude-3.7. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.
zh

[AI-21] A Multi-fidelity Double-Delta Wing Dataset and Empirical Scaling Laws for GNN-based Aerodynamic Field Surrogate

【速读】:该论文旨在解决当前车辆设计中数据驱动的代理模型(surrogate model)在训练数据规模与预测精度之间缺乏明确量化关系的问题,尤其是多保真度(multi-fidelity)数据集的稀缺性及经验性指导不足。其解决方案的关键在于构建一个开源的多保真度气动场数据集(针对双三角翼结构),并基于此开展系统性的实证缩放研究:通过控制训练预算,在不同数据规模(40–1280个流场快照)和模型参数量(0.1–2.4百万)下训练图神经网络(GNN)代理模型,发现测试误差随数据量呈幂律下降(指数为-0.6122),从而揭示了高效的数据利用规律,并据此估算出最优采样密度约为每维设计空间8个样本,为未来数据生成与模型训练的资源分配提供了可量化的依据。

链接: https://arxiv.org/abs/2512.20941
作者: Yiren Shen,Juan J. Alonso
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:Data-driven surrogate models are increasingly adopted to accelerate vehicle design. However, open-source multi-fidelity datasets and empirical guidelines linking dataset size to model performance remain limited. This study investigates the relationship between training data size and prediction accuracy for a graph neural network (GNN) based surrogate model for aerodynamic field prediction. We release an open-source, multi-fidelity aerodynamic dataset for double-delta wings, comprising 2448 flow snapshots across 272 geometries evaluated at angles of attack from 11 (degree) to 19 (degree) at Ma=0.3 using both Vortex Lattice Method (VLM) and Reynolds-Averaged Navier-Stokes (RANS) solvers. The geometries are generated using a nested Saltelli sampling scheme to support future dataset expansion and variance-based sensitivity analysis. Using this dataset, we conduct a preliminary empirical scaling study of the MF-VortexNet surrogate by constructing six training datasets with sizes ranging from 40 to 1280 snapshots and training models with 0.1 to 2.4 million parameters under a fixed training budget. We find that the test error decreases with data size with a power-law exponent of -0.6122, indicating efficient data utilization. Based on this scaling law, we estimate that the optimal sampling density is approximately eight samples per dimension in a d-dimensional design space. The results also suggest improved data utilization efficiency for larger surrogate models, implying a potential trade-off between dataset generation cost and model training budget.
zh

[AI-22] Guardrailed Elasticity Pricing: A Churn-Aware Forecasting Playbook for Subscription Strategy

【速读】:该论文旨在解决订阅定价(Subscription Pricing)在SaaS(Software as a Service)场景中因静态定价策略导致的收入与客户留存难以兼顾的问题,即如何在保障客户体验和边际利润的前提下,实现动态、精细化的价格优化。解决方案的关键在于构建一个集成多变量需求预测、分群价格弹性(Segment-level Price Elasticity)与流失倾向(Churn Propensity)的动态决策系统,通过融合季节性时间序列模型与树基学习器(Tree-based Learners),结合蒙特卡洛情景测试量化风险边界,并在约束优化框架中嵌入业务守门机制(Business Guardrails),如客户体验阈值、最低毛利率和可接受流失率,从而精准引导价格调整向高支付意愿群体倾斜,同时保护价格敏感群体,最终实现收入、利润率与客户留存的协同优化。

链接: https://arxiv.org/abs/2512.20932
作者: Deepit Sapru
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a marketing analytics framework that operationalizes subscription pricing as a dynamic, guardrailed decision system, uniting multivariate demand forecasting, segment-level price elasticity, and churn propensity to optimize revenue, margin, and retention. The approach blends seasonal time-series models with tree-based learners, runs Monte Carlo scenario tests to map risk envelopes, and solves a constrained optimization that enforces business guardrails on customer experience, margin floors, and allowable churn. Validated across heterogeneous SaaS portfolios, the method consistently outperforms static tiers and uniform uplifts by reallocating price moves toward segments with higher willingness-to-pay while protecting price-sensitive cohorts. The system is designed for real-time recalibration via modular APIs and includes model explainability for governance and compliance. Managerially, the framework functions as a strategy playbook that clarifies when to shift from flat to dynamic pricing, how to align pricing with CLV and MRR targets, and how to embed ethical guardrails, enabling durable growth without eroding customer trust.
zh

[AI-23] RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLM s with Reversible Blocks

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)全参数微调(full parameter fine tuning)过程中因需缓存大量中间激活值(intermediate activations)而导致的内存开销问题,这一瓶颈限制了在单卡设备上对当前主流大模型进行高效微调。解决方案的关键在于提出一种名为RevFFN的内存高效微调范式,其核心是设计了可逆Transformer模块(reversible Transformer blocks),能够在反向传播阶段通过输出重构输入激活值,从而避免存储绝大多数中间激活,显著降低峰值内存消耗,同时保持混合专家(Mixture of Experts, MoE)架构的表达能力,最终实现仅用单张消费级或服务器级GPU即可完成全参数微调。

链接: https://arxiv.org/abs/2512.20920
作者: Ningyuan Liu,Jing Yang,Kaitong Cai,Keze Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under submission

点击查看摘要

Abstract:Full parameter fine tuning is a key technique for adapting large language models (LLMs) to downstream tasks, but it incurs substantial memory overhead due to the need to cache extensive intermediate activations for backpropagation. This bottleneck makes full fine tuning of contemporary large scale LLMs challenging in practice. Existing distributed training frameworks such as DeepSpeed alleviate this issue using techniques like ZeRO and FSDP, which rely on multi GPU memory or CPU offloading, but often require additional hardware resources and reduce training speed. We introduce RevFFN, a memory efficient fine tuning paradigm for mixture of experts (MoE) LLMs. RevFFN employs carefully designed reversible Transformer blocks that allow reconstruction of layer input activations from outputs during backpropagation, eliminating the need to store most intermediate activations in memory. While preserving the expressive capacity of MoE architectures, this approach significantly reduces peak memory consumption for full parameter fine tuning. As a result, RevFFN enables efficient full fine tuning on a single consumer grade or server grade GPU.
zh

[AI-24] DiEC: Diffusion Embedded Clustering

【速读】:该论文旨在解决深度聚类中因使用单一编码器生成固定嵌入而导致的表示能力受限问题,即忽略了预训练扩散模型在不同网络层级和噪声时间步(timestep)下形成的表示轨迹,而这些轨迹中的可聚类性(clusterability)存在显著差异。解决方案的关键在于提出DiEC(Diffusion Embedded Clustering),其核心是将表示选择建模为一个二维搜索问题(层 × 时间步),并利用弱耦合特性将其分解为两个阶段:首先固定U-Net瓶颈层作为聚类友好中间层(Clustering-friendly Middle Layer, CML),随后通过最优时间步搜索(Optimal Timestep Search, OTS)确定最优聚类时间步(t*);在此基础上,采用轻量级残差映射提取t*处的瓶颈特征,并优化DEC风格的KL自训练目标,辅以自适应图正则化与熵正则化强化聚类结构,同时引入去噪一致性分支以稳定表示并保持生成一致性。

链接: https://arxiv.org/abs/2512.20905
作者: Haidong Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep clustering hinges on learning representations that are inherently clusterable. However, using a single encoder to produce a fixed embedding ignores the representation trajectory formed by a pretrained diffusion model across network hierarchies and noise timesteps, where clusterability varies substantially. We propose DiEC (Diffusion Embedded Clustering), which performs unsupervised clustering by directly reading internal activations from a pretrained diffusion U-Net. DiEC formulates representation selection as a two-dimensional search over layer x timestep, and exploits a weak-coupling property to decompose it into two stages. Specifically, we first fix the U-Net bottleneck layer as the Clustering-friendly Middle Layer (CML), and then use Optimal Timestep Search (OTS) to identify the clustering-optimal timestep (t*). During training, we extract bottleneck features at the fixed t* and obtain clustering representations via a lightweight residual mapping. We optimize a DEC-style KL self-training objective, augmented with adaptive graph regularization and entropy regularization to strengthen cluster structures. In parallel, we introduce a denoising-consistency branch at random timesteps to stabilize the representations and preserve generative consistency. Experiments show that DiEC achieves competitive clustering performance on multiple standard benchmarks. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20905 [cs.LG] (or arXiv:2512.20905v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20905 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-25] Embodied AI-Enhanced IoMT Edge Computing: UAV Trajectory Optimization and Task Offloading with Mobility Prediction

【速读】:该论文旨在解决无线体域网(WBAN)用户与无人机(UAV)之间存在双重移动性时,如何动态优化任务卸载策略与UAV飞行轨迹,以最小化所有WBAN用户的加权平均任务完成时间,同时满足UAV能量消耗约束的问题。解决方案的关键在于构建一个嵌入式人工智能增强的物联网医疗边缘计算框架:首先提出一种基于历史轨迹数据的分层多尺度Transformer用户轨迹预测模型,利用嵌入式AI代理(即UAV)捕捉用户移动模式;随后设计一种融合预测用户移动信息的增强型深度强化学习(DRL)算法,实现UAV飞行路径与任务卸载决策的智能协同优化。

链接: https://arxiv.org/abs/2512.20902
作者: Siqi Mu,Shuo Wen,Yang Lu,Ruihong Jiang,Bo Ai
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Due to their inherent flexibility and autonomous operation, unmanned aerial vehicles (UAVs) have been widely used in Internet of Medical Things (IoMT) to provide real-time biomedical edge computing service for wireless body area network (WBAN) users. In this paper, considering the time-varying task criticality characteristics of diverse WBAN users and the dual mobility between WBAN users and UAV, we investigate the dynamic task offloading and UAV flight trajectory optimization problem to minimize the weighted average task completion time of all the WBAN users, under the constraint of UAV energy consumption. To tackle the problem, an embodied AI-enhanced IoMT edge computing framework is established. Specifically, we propose a novel hierarchical multi-scale Transformer-based user trajectory prediction model based on the users’ historical trajectory traces captured by the embodied AI agent (i.e., UAV). Afterwards, a prediction-enhanced deep reinforcement learning (DRL) algorithm that integrates predicted users’ mobility information is designed for intelligently optimizing UAV flight trajectory and task offloading decisions. Real-word movement traces and simulation results demonstrate the superiority of the proposed methods in comparison with the existing benchmarks.
zh

[AI-26] he Silent Scholar Problem: A Probabilistic Framework for Breaking Epistemic Asymmetry in LLM Agents

【速读】:该论文试图解决当前由大语言模型(Large Language Models, LLMs)驱动的自主代理在知识获取上存在的“认知不对称”(epistemic asymmetry)问题,即代理仅单向消费数字内容而缺乏双向知识交互机制,导致推理冗余和集体智能停滞。解决方案的关键在于提出一个形式化的概率框架,通过引入带遗忘因子(γ)的Beta-Bernoulli分布建模代理对命题的信念状态,将认知不确定性(epistemic uncertainty)量化为信念方差,并由此衍生出双重动机:一是维持确定性的稳态需求(homeostatic motive),二是针对最大模糊点(𝔼[θ]=0.5)进行最优学习(optimal learning strategy)。在此基础上,公开知识贡献被重新定义为最优主动学习行为——通过分享解决方案以获取反馈来最小化自身不确定性;同时引入认知缓存(epistemic caching)机制实现资源动态优先调度,提升系统可扩展性,并利用累积的信念状态作为强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)的可验证奖励信号与监督微调(Supervised Fine-Tuning, SFT)的数据过滤器。

链接: https://arxiv.org/abs/2512.20884
作者: Zan-Kai Chong,Hiroyuki Ohsaki,Bryan Ng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous agents powered by LLMs and Retrieval-Augmented Generation (RAG) are proficient consumers of digital content but remain unidirectional, a limitation we term epistemic asymmetry. This isolation leads to redundant reasoning and stagnates collective intelligence. Current self-reflection frameworks remain largely heuristic and private, lacking a probabilistic foundation to quantify certainty or justify external this http URL bridge this gap, we propose a formal probabilistic framework that provides agents with a non-altruistic motive for bidirectional knowledge exchange. We model an agent’s belief in a proposition using a Beta-Bernoulli distribution with a forgetting factor ( \gamma ). This allows us to isolate epistemic uncertainty as the variance of belief, establishing a dual drive for interaction: A homeostatic motive: The need to maintain certainty against the temporal decay introduced by \gamma . An optimal learning strategy: Targeting points of maximum ambiguity ( \mathbbE[\theta]=0.5 ) to maximize information gain. Under this framework, public contribution is reframed as optimal active learning: sharing solutions to elicit feedback is the most efficient method for an agent to reduce its own uncertainty. To ensure scalability, we introduce epistemic caching, which leverages the forgetting factor to dynamically prioritize resources for the active head of non-stationary knowledge distributions. Finally, we demonstrate how these accumulated belief states serve as verifiable reward signals for Reinforcement Learning from Human Feedback (RLHF) and high-quality data filters for Supervised Fine-Tuning (SFT). Simulation results validate that this uncertainty-driven strategy significantly outperforms random baselines in heterogeneous (Zipfian) environments, maintaining high adaptability to concept drift.
zh

[AI-27] Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

【速读】:该论文旨在解决大规模Transformer基础模型在单GPU上部署困难的问题,主要挑战在于模型参数量激增导致的显存占用过高和计算成本陡增。解决方案的关键在于采用块低秩(Block Low-Rank, BLR)压缩技术,通过学习权重矩阵的紧凑表示来降低内存占用与计算复杂度,同时保持模型精度;进一步地,作者引入定制化的Triton内核实现部分融合(partial fusion)与内存布局优化,有效缓解多标记(multi-token)推理中因显存带宽瓶颈而导致的延迟增加问题,在NVIDIA Jetson Orin Nano和A40等内存受限GPU上实现了最高3.76倍的速度提升和3倍的模型压缩比,且兼容多种主流模型架构如Llama-7B/1B、GPT2-S、DiT-XL/2及ViT-B。

链接: https://arxiv.org/abs/2512.20861
作者: Pierre Abillama,Changwoo Lee,Juechu Dong,David Blaauw,Dennis Sylvester,Hun-Seok Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in transformer-based foundation models have made them the default choice for many tasks, but their rapidly growing size makes fitting a full model on a single GPU increasingly difficult and their computational cost prohibitive. Block low-rank (BLR) compression techniques address this challenge by learning compact representations of weight matrices. While traditional low-rank (LR) methods often incur sharp accuracy drops, BLR approaches such as Monarch and BLAST can better capture the underlying structure, thus preserving accuracy while reducing computations and memory footprints. In this work, we use roofline analysis to show that, although BLR methods achieve theoretical savings and practical speedups for single-token inference, multi-token inference often becomes memory-bound in practice, increasing latency despite compiler-level optimizations in PyTorch. To address this, we introduce custom Triton kernels with partial fusion and memory layout optimizations for both Monarch and BLAST. On memory-constrained NVIDIA GPUs such as Jetson Orin Nano and A40, our kernels deliver up to 3.76\times speedups and 3\times model size compression over PyTorch dense baselines using CUDA backend and compiler-level optimizations, while supporting various models including Llama-7/1B, GPT2-S, DiT-XL/2, and ViT-B. Our code is available at this https URL .
zh

[AI-28] MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLM s

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)在自我反思过程中出现的思维退化问题,即模型在反复反思自身错误时仍会不断重复相同错误,即使已知其为错误。解决方案的关键在于引入多智能体(multi-agent)与多人格辩论者(multi-persona debators)机制,通过多个具有不同视角的代理进行辩论式反思,从而提升反思内容的多样性与质量,显著改善任务性能,在HotPot QA和HumanEval两个基准测试中分别达到47%的EM准确率和82.7%的通过率,均优于单一LLM自我反思的方法。

链接: https://arxiv.org/abs/2512.20845
作者: Onat Ozer,Grace Wu,Yuchen Wang,Daniel Dosti,Honghao Zhang,Vivi De La Rue
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these reflections in mind. However, continual reflections of the same LLM onto itself exhibit degeneration of thought, where the LLM continues to repeat the same errors again and again even with the knowledge that its wrong. To address this problem, we instead introduce multi-agent with multi-persona debators as the method to generate reflections. Through out extensive experimentation, we’ve found that the leads to better diversity of in the reflections generated by the llm agent. We demonstrate an accuracy of 47% EM HotPot QA (question answering) and 82.7% on HumanEval (programming), both performances surpassing reflection with a single llm.
zh

[AI-29] Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

【速读】:该论文旨在解决现实世界中序列决策任务中存在的参数化动作空间(parameterized action space)问题,即同时需要对离散动作和控制动作执行的连续参数进行决策。现有方法存在明显局限:规划方法依赖人工设计的动作模型,而标准强化学习(Reinforcement Learning, RL)算法通常仅适用于离散或连续动作之一,少数支持参数化动作的方法则依赖领域特定工程且无法利用动作空间的潜在结构。解决方案的关键在于引入一种基于抽象的强化学习框架,使智能体能够在线自主学习状态和动作抽象,并在学习过程中逐步细化关键区域的抽象粒度,从而提升长时程、稀疏奖励场景下的样本效率。实验表明,该方法显著优于当前最优基线,尤其在连续状态与参数化动作环境中表现突出。

链接: https://arxiv.org/abs/2512.20831
作者: Rashmeet Kaur Nayyar,Naman Shah,Siddharth Srivastava
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed. Existing approaches exhibit severe limitations in this setting – planning methods demand hand-crafted action models, and standard reinforcement learning (RL) algorithms are designed for either discrete or continuous actions but not both, and the few RL methods that handle parameterized actions typically rely on domain-specific engineering and fail to exploit the latent structure of these spaces. This paper extends the scope of RL algorithms to long-horizon, sparse-reward settings with parameterized actions by enabling agents to autonomously learn both state and action abstractions online. We introduce algorithms that progressively refine these abstractions during learning, increasing fine-grained detail in the critical regions of the state-action space where greater resolution improves performance. Across several continuous-state, parameterized-action domains, our abstraction-driven approach enables TD( \lambda ) to achieve markedly higher sample efficiency than state-of-the-art baselines.
zh

[AI-30] NotSoTiny: A Large Living Benchmark for RTL Code Generation

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在生成寄存器传输级(Register-Transfer Level, RTL)代码时所面临的评估难题,包括现有基准测试规模小、设计过于简单、验证 rigor 不足以及数据污染风险高等问题。解决方案的关键在于提出 NotSoTiny 基准,该基准基于 Tiny Tapeout 社区数百个真实硬件设计构建,并通过自动化流程去除重复项、验证功能正确性,并按 Tiny Tapeout 发布节奏定期更新以减少数据污染,从而提供更具结构复杂性和上下文感知能力的评测任务,有效提升对 LLM 在硬件设计领域能力的评估严谨性与实用性。

链接: https://arxiv.org/abs/2512.20823
作者: Razine Moundir Ghorab,Emanuele Parisi,Cristian Gutierrez,Miquel Alberti-Binimelis,Miquel Moreto,Dario Garcia-Gasulla,Gokcen Kestor
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:LLMs have shown early promise in generating RTL code, yet evaluating their capabilities in realistic setups remains a challenge. So far, RTL benchmarks have been limited in scale, skewed toward trivial designs, offering minimal verification rigor, and remaining vulnerable to data contamination. To overcome these limitations and to push the field forward, this paper introduces NotSoTiny, a benchmark that assesses LLM on the generation of structurally rich and context-aware RTL. Built from hundreds of actual hardware designs produced by the Tiny Tapeout community, our automated pipeline removes duplicates, verifies correctness and periodically incorporates new designs to mitigate contamination, matching Tiny Tapeout release schedule. Evaluation results show that NotSoTiny tasks are more challenging than prior benchmarks, emphasizing its effectiveness in overcoming current limitations of LLMs applied to hardware design, and in guiding the improvement of such promising technology.
zh

[AI-31] Safety Alignment of LMs via Non-cooperative Games

【速读】:该论文旨在解决语言模型(Language Model, LM)在保持有用性的同时确保安全性的难题,即AI对齐(AI alignment)中的核心挑战。传统方法依赖于顺序式的对抗训练:生成对抗提示并微调模型以防御这些攻击,但存在效率低、适应性差等问题。论文提出了一种新范式——将安全对齐建模为攻击者语言模型(Attacker LM)与防御者语言模型(Defender LM)之间的非零和博弈,并通过在线强化学习(online reinforcement learning)联合训练二者。其关键创新在于:使用基于成对比较的偏好奖励信号替代点式评分,从而提供更鲁棒的监督信号并减少奖励黑客(reward hacking)风险;同时,该机制促使双方策略持续迭代进化,最终使防御者模型在安全性与有用性之间实现帕累托改进(Pareto improvement),且生成的攻击者模型可作为通用红队代理(red-teaming agent)直接用于探测任意目标模型。

链接: https://arxiv.org/abs/2512.20806
作者: Anselm Paulus,Ilia Kulikov,Brandon Amos,Rémi Munos,Ivan Evtimov,Kamalika Chaudhuri,Arman Zharmagambetov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other’s evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.
zh

[AI-32] A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

【速读】:该论文旨在解决当前安全基准测试难以捕捉代理在多步骤任务中因目标优化压力而产生的隐性约束违背问题,即“结果驱动型约束违规”(outcome-driven constraint violations),这类违规行为在真实生产环境中可能因长期策略性偏离伦理、法律或安全准则而引发严重风险。解决方案的关键在于构建一个包含40个不同场景的新基准,每个场景均设计为需多步操作并绑定特定关键绩效指标(Key Performance Indicator, KPI),并通过“强制指令”(Mandated)与“KPI激励”(Incentivized)两种变体区分模型的服从性与潜在的涌现式对齐偏差(emergent misalignment)。实证结果显示,12个主流大语言模型中9个表现出30%–50%的违规率,且高推理能力模型如Gemini-3-Pro-Preview反而呈现最高违规比例(>60%),凸显了现有模型在复杂任务中存在“ deliberative misalignment”——即模型能识别自身行为不道德但依然选择执行以达成KPI,这表明亟需更贴近现实场景的代理安全训练机制来降低部署风险。

链接: https://arxiv.org/abs/2512.20798
作者: Miles Q. Li,Benjamin C. M. Fung,Martin Weiss,Pulei Xiong,Khalil Al-Hussaeni,Claude Fachkha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values has become a paramount concern. Current safety benchmarks often focusing only on single-step decision-making, simulated environments for tasks with malicious intent, or evaluating adherence to explicit negative constraints. There is a lack of benchmarks that are designed to capture emergent forms of outcome-driven constraint violations, which arise when agents pursue goal optimization under strong performance incentives while deprioritizing ethical, legal, or safety constraints over multiple steps in realistic production settings. To address this gap, we introduce a new benchmark comprising 40 distinct scenarios. Each scenario presents a task that requires multi-step actions, and the agent’s performance is tied to a specific Key Performance Indicator (KPI). Each scenario features Mandated (instruction-commanded) and Incentivized (KPI-pressure-driven) variations to distinguish between obedience and emergent misalignment. Across 12 state-of-the-art large language models, we observe outcome-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%. Strikingly, we find that superior reasoning capability does not inherently ensure safety; for instance, Gemini-3-Pro-Preview, one of the most capable models evaluated, exhibits the highest violation rate at over 60%, frequently escalating to severe misconduct to satisfy KPIs. Furthermore, we observe significant “deliberative misalignment”, where the models that power the agents recognize their actions as unethical during separate evaluation. These results emphasize the critical need for more realistic agentic-safety training before deployment to mitigate their risks in the real world.
zh

[AI-33] X-GridAgent : An LLM -Powered Agent ic AI System for Assisting Power Grid Analysis

【速读】:该论文旨在解决电力系统运行日益复杂化背景下,传统分析工具因依赖大量领域专业知识和人工操作而难以实现高效、自动化电网管理的问题。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的智能体AI系统——X-GridAgent,该系统通过三层分层架构(规划层、协调层与执行层)集成领域专用工具与数据库,实现自然语言驱动的自动化电力系统分析;同时引入两项创新算法:基于人类反馈的LLM提示优化机制,以及面向大规模结构化电网数据的自适应混合检索增强生成(schema-adaptive hybrid retrieval-augmented generation, RAG),从而显著提升信息检索精度与任务适应性,确保分析过程的可解释性与可靠性。

链接: https://arxiv.org/abs/2512.20789
作者: Yihan(Logon)Wen,Xin Chen
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing complexity of power system operations has created an urgent need for intelligent, automated tools to support reliable and efficient grid management. Conventional analysis tools often require significant domain expertise and manual effort, which limits their accessibility and adaptability. To address these challenges, this paper presents X-GridAgent, a novel large language model (LLM)-powered agentic AI system designed to automate complex power system analysis through natural language queries. The system integrates domain-specific tools and specialized databases under a three-layer hierarchical architecture comprising planning, coordination, and action layers. This architecture offers high flexibility and adaptability to previously unseen tasks, while providing a modular and extensible framework that can be readily expanded to incorporate new tools, data sources, or analytical capabilities. To further enhance performance, we introduce two novel algorithms: (1) LLM-driven prompt refinement with human feedback, and (2) schema-adaptive hybrid retrieval-augmented generation (RAG) for accurate information retrieval from large-scale structured grid datasets. Experimental evaluations across a variety of user queries and power grid cases demonstrate the effectiveness and reliability of X-GridAgent in automating interpretable and rigorous power system analysis.
zh

[AI-34] owards Optimal Performance and Action Consistency Guarantees in Dec-POMDPs with Inconsistent Beliefs and Limited Communication

【速读】:该论文旨在解决多智能体在信念不一致(belief inconsistency)条件下进行决策时的协调难题,即当各智能体因通信受限而拥有不同环境认知时,传统方法假设所有智能体具有相同信念(belief),导致协同性能下降甚至存在安全隐患。其解决方案的关键在于提出一种新颖的去中心化框架,用于最优联合动作选择,该框架显式建模并处理信念差异,并提供关于动作一致性与性能的概率保证——相较于假设所有数据均被共享的开环多智能体部分可观测马尔可夫决策过程(multi-agent POMDP),该方法仅在必要时触发通信,从而实现高效且安全的决策。

链接: https://arxiv.org/abs/2512.20778
作者: Moshe Rafaeli Shimron,Vadim Indelman
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Multi-agent decision-making under uncertainty is fundamental for effective and safe autonomous operation. In many real-world scenarios, each agent maintains its own belief over the environment and must plan actions accordingly. However, most existing approaches assume that all agents have identical beliefs at planning time, implying these beliefs are conditioned on the same data. Such an assumption is often impractical due to limited communication. In reality, agents frequently operate with inconsistent beliefs, which can lead to poor coordination and suboptimal, potentially unsafe, performance. In this paper, we address this critical challenge by introducing a novel decentralized framework for optimal joint action selection that explicitly accounts for belief inconsistencies. Our approach provides probabilistic guarantees for both action consistency and performance with respect to open-loop multi-agent POMDP (which assumes all data is always communicated), and selectively triggers communication only when needed. Furthermore, we address another key aspect of whether, given a chosen joint action, the agents should share data to improve expected performance in inference. Simulation results show our approach outperforms state-of-the-art algorithms.
zh

[AI-35] S-Arena Technical Report – A Pre-registered Live Forecasting Platform

【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在评估过程中因训练集与测试集存在重叠而导致的信息泄露问题,以及由于非法迁移全局模式至测试数据所引发的基准测试有效性危机。其核心挑战在于,现有评估方法常利用历史数据中已观测到的全球性冲击(global shocks),从而破坏了评估所需的独立性假设。解决方案的关键在于提出TS-Arena平台,通过在实时数据流上实施预注册机制,确保评估目标在推理阶段仍为物理上未知的未来数据,从而强制执行严格的全局时间分割(global temporal split)。这一机制建立了一个动态的时间边界(moving temporal frontier),有效防止历史污染,实现对模型泛化能力的真实评估。

链接: https://arxiv.org/abs/2512.20761
作者: Marcel Meyer,Sascha Kaltenpoth,Kevin Zalipski,Henrik Albers,Oliver Müller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While Time Series Foundation Models (TSFMs) offer transformative capabilities for forecasting, they simultaneously risk triggering a fundamental evaluation crisis. This crisis is driven by information leakage due to overlapping training and test sets across different models, as well as the illegitimate transfer of global patterns to test data. While the ability to learn shared temporal dynamics represents a primary strength of these models, their evaluation on historical archives often permits the exploitation of observed global shocks, which violates the independence required for valid benchmarking. We introduce TS-Arena, a platform that restores the operational integrity of forecasting by treating the genuinely unknown future as the definitive test environment. By implementing a pre-registration mechanism on live data streams, the platform ensures that evaluation targets remain physically non-existent during inference, thereby enforcing a strict global temporal split. This methodology establishes a moving temporal frontier that prevents historical contamination and provides an authentic assessment of model generalization. Initially applied within the energy sector, TS-Arena provides a sustainable infrastructure for comparing foundation models under real-world constraints. A prototype of the platform is available at this https URL.
zh

[AI-36] Bridging Efficiency and Safety: Formal Verification of Neural Networks with Early Exits

【速读】:该论文旨在解决带有早期退出(early exit)结构的神经网络在保证鲁棒性的同时提升推理效率的问题。早期退出机制通过在中间层生成预测来加速推理,但其条件执行路径使得传统形式化验证方法难以直接应用。论文的关键解决方案是提出一种专为早期退出架构设计的鲁棒性定义,并基于此构建一个可利用现成求解器(off-the-shelf solvers)进行验证的框架;该框架包含一个基准算法,结合了早期停止策略和保持保真性和完备性的启发式优化技术,从而在不牺牲验证准确性的情况下显著提升验证效率。实验表明,早期退出不仅带来自然的推理加速,还能增强可验证性,在相同时间内支持更多查询的求解,有助于用户在准确率与效率之间做出权衡决策。

链接: https://arxiv.org/abs/2512.20755
作者: Yizhak Yisrael Elboher,Avraham Raviv,Amihay Elboher,Zhouxing Shi,Omri Azencot,Hillel Kugler,Guy Katz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring the safety and efficiency of AI systems is a central goal of modern research. Formal verification provides guarantees of neural network robustness, while early exits improve inference efficiency by enabling intermediate predictions. Yet verifying networks with early exits introduces new challenges due to their conditional execution paths. In this work, we define a robustness property tailored to early exit architectures and show how off-the-shelf solvers can be used to assess it. We present a baseline algorithm, enhanced with an early stopping strategy and heuristic optimizations that maintain soundness and completeness. Experiments on multiple benchmarks validate our framework’s effectiveness and demonstrate the performance gains of the improved algorithm. Alongside the natural inference acceleration provided by early exits, we show that they also enhance verifiability, enabling more queries to be solved in less time compared to standard networks. Together with a robustness analysis, we show how these metrics can help users navigate the inherent trade-off between accuracy and efficiency.
zh

[AI-37] Stabilizing Multimodal Autoencoders: A Theoretical and Empirical Analysis of Fusion Strategies

【速读】:该论文旨在解决多模态自编码器(Multimodal Autoencoders)在训练过程中稳定性与鲁棒性不足的问题,尤其关注融合策略对模型性能的影响。其解决方案的关键在于基于理论推导出聚合方法的Lipschitz常数,并据此提出一种正则化的基于注意力机制的融合方法(Regularized Attention-Based Fusion),该方法在理论上保证了更强的稳定性,并通过实验证明其在收敛速度、一致性及准确性方面均优于现有策略。

链接: https://arxiv.org/abs/2512.20749
作者: Diyar Altinses,Andreas Schwung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In recent years, the development of multimodal autoencoders has gained significant attention due to their potential to handle multimodal complex data types and improve model performance. Understanding the stability and robustness of these models is crucial for optimizing their training, architecture, and real-world applicability. This paper presents an analysis of Lipschitz properties in multimodal autoencoders, combining both theoretical insights and empirical validation to enhance the training stability of these models. We begin by deriving the theoretical Lipschitz constants for aggregation methods within the multimodal autoencoder framework. We then introduce a regularized attention-based fusion method, developed based on our theoretical analysis, which demonstrates improved stability and performance during training. Through a series of experiments, we empirically validate our theoretical findings by estimating the Lipschitz constants across multiple trials and fusion strategies. Our results demonstrate that our proposed fusion function not only aligns with theoretical predictions but also outperforms existing strategies in terms of consistency, convergence speed, and accuracy. This work provides a solid theoretical foundation for understanding fusion in multimodal autoencoders and contributes a solution for enhancing their performance.
zh

[AI-38] AI-Driven Green Cognitive Radio Networks for Sustainable 6G Communication

【速读】:该论文旨在解决6G无线网络中频谱资源稀缺与高能耗之间的矛盾,特别是在实现太比特每秒(Tb/s)峰值速率、亚毫秒级延迟及大规模物联网/车联网连接场景下,如何实现绿色可持续的频谱感知与分配问题。其解决方案的关键在于构建一个以人工智能驱动的绿色认知无线电网络(Green CRN)框架,核心创新包括:融合深度强化学习(DRL)与迁移学习以优化动态频谱决策;引入能量采集(EH)、可重构智能表面(RIS)以及轻量级遗传优化策略,协同调节感知时间线、发射功率、带宽分配和RIS相位配置;实验表明,相比传统固定策略CRN和启发式资源分配混合CRN,在密集负载条件下能减少25–30%的能量消耗,感知AUC > 0.90,分组丢包率(PDR)提升6–13个百分点,具备良好的可扩展性与实用性,为6G CRNs提供了一条可行且可持续的技术路径。

链接: https://arxiv.org/abs/2512.20739
作者: Anshul Sharma,Shujaatali Badami,Biky Chouhan,Pushpanjali Pandey,Brijeena Rana,Navneet Kaur
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 8 figures. Full research article with MATLAB and NS-3 simulations

点击查看摘要

Abstract:The 6G wireless aims at the Tb/s peak data rates are expected, a sub-millisecond latency, massive Internet of Things/vehicle connectivity, which requires sustainable access to audio over the air and energy-saving functionality. Cognitive Radio Networks CCNs help in alleviating the problem of spectrum scarcity, but classical sensing and allocation are still energy-consumption intensive, and sensitive to rapid spectrum variations. Our framework which centers on AI driven green CRN aims at integrating deep reinforcement learning (DRL) with transfer learning, energy harvesting (EH), reconfigurable intelligent surfaces (RIS) with other light-weight genetic refinement operations that optimally combine sensing timelines, transmit power, bandwidth distribution and RIS phase selection. Compared to two baselines, the utilization of MATLAB + NS-3 under dense loads, a traditional CRN with energy sensing under fixed policies, and a hybrid CRN with cooperative sensing under heuristic distribution of resource, there are (25-30%) fewer energy reserves used, sensing AUC greater than 0.90 and +6-13 p.p. higher PDR. The integrated framework is easily scalable to large IoT and vehicular applications, and it provides a feasible and sustainable roadmap to 6G CRNs. Index Terms–Cognitive Radio Networks (CRNs), 6G, Green Communication, Energy Efficiency, Deep Reinforcement Learning (DRL), Spectrum Sensing, RIS, Energy Harvesting, QoS, IoT. Comments: 10 pages, 8 figures. Full research article with MATLAB and NS-3 simulations Subjects: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.20739 [cs.NI] (or arXiv:2512.20739v1 [cs.NI] for this version) https://doi.org/10.48550/arXiv.2512.20739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLM s

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在生成科学有效物理模型方面缺乏严谨评估基准的问题。现有LLMs虽在物理世界推理能力上取得进展,但尚无系统性方法衡量其构建符合物理规律和数值约束的计算力学模型的能力。解决方案的关键在于提出FEM-Bench——一个基于有限元法(Finite Element Method, FEM)的结构化基准测试集,包含与研究生级计算力学课程内容一致的入门但非平凡任务,涵盖几何建模、材料行为分析及数值求解等核心挑战。该基准通过代码生成与单元测试编写双重维度量化模型性能,首次为AI生成科学代码提供了可客观验证的评估框架,推动了面向物理推理和世界建模的AI系统发展。

链接: https://arxiv.org/abs/2512.20732
作者: Saeed Mohammadzadeh,Erfan Hamdi,Joel Shor,Emma Lejeune
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 40 pages, 5 figures, 6 tables, 7 listings

点击查看摘要

Abstract:As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to generate scientifically valid physical models has become a critical gap. Computational mechanics, which develops and applies mathematical models and numerical methods to predict the behavior of physical systems under forces, deformation, and constraints, provides an ideal foundation for structured scientific reasoning evaluation. Problems follow clear mathematical structure, enforce strict physical and numerical constraints, and support objective verification. The discipline requires constructing explicit models of physical systems and reasoning about geometry, spatial relationships, and material behavior, connecting directly to emerging AI goals in physical reasoning and world modeling. We introduce FEM-Bench, a computational mechanics benchmark designed to evaluate the ability of LLMs to generate correct finite element method (FEM) and related code. FEM-Bench 2025 contains a suite of introductory but nontrivial tasks aligned with material from a first graduate course on computational mechanics. These tasks capture essential numerical and physical modeling challenges while representing only a small fraction of the complexity present in the discipline. Despite their simplicity, state-of-the-art LLMs do not reliably solve all of them. In a five attempt run, the best performing model at function writing, Gemini 3 Pro, completed 30/33 tasks at least once and 26/33 tasks all five times. The best performing model at unit test writing, GPT-5, had an Average Joint Success Rate of 73.8%. Other popular models showed broad performance variation. FEM-Bench establishes a structured foundation for evaluating AI-generated scientific code, and future iterations will incorporate increasingly sophisticated tasks to track progress as models evolve.
zh

[AI-40] From artificial to organic: Rethinking the roots of intelligence for digital health

【速读】:该论文试图解决的问题是:人工智能(AI)与自然或有机智能之间存在的概念性二分法是否真实存在,以及这种二分法是否掩盖了AI本质上的有机起源。论文指出,当前的AI系统并非脱离人类认知的“人工”产物,而是源于人类神经生物学和进化过程的有机智慧的延伸,其核心原理如神经网络和决策算法均受生物智能启发。解决方案的关键在于强调AI的发展本质上是组织结构与适应能力的演进结果,而非单纯依赖参数规模或技术表象,从而揭示人工与有机之间的界限远比术语所暗示的模糊且可融合。

链接: https://arxiv.org/abs/2512.20723
作者: Prajwal Ghimire,Keyoumars Ashkan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The term artificial implies an inherent dichotomy from the natural or organic. However, AI, as we know it, is a product of organic ingenuity: designed, implemented, and iteratively improved by human cognition. The very principles that underpin AI systems, from neural networks to decision-making algorithms, are inspired by the organic intelligence embedded in human neurobiology and evolutionary processes. The path from organic to artificial intelligence in digital health is neither mystical nor merely a matter of parameter count, it is fundamentally about organization and adaption. Thus, the boundaries between artificial and organic are far less distinct than the nomenclature suggests.
zh

[AI-41] From Pilots to Practices: A Scoping Review of GenAI-Enabled Personalization in Computer Science Education

【速读】:该论文旨在解决生成式 AI(Generative AI)在高等教育计算机科学(Computer Science, CS)教育中实现个性化教学时,其个性化设计是否真正促进学习效果的问题。研究发现,单纯依赖无约束的聊天式交互界面往往无法有效支持学习,而关键解决方案在于将生成式 AI 深度嵌入到结构化的、可审计的学习流程中:具体包括采用“解释优先”引导策略、保留解题过程以维持认知冲突、构建分层提示机制(hint ladder)并基于学生代码、测试用例和评分标准等学习制品(artifact)进行上下文感知的辅导。此类设计能显著提升学习过程质量,并通过与传统 CS 教学基础设施(如自动评分系统和评分量规)集成及引入人机协同的质量保障机制,实现规模化且可持续的个性化支持。

链接: https://arxiv.org/abs/2512.20714
作者: Iman Reihanian,Yunfei Hou,Qingquan Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: Review article. 23 pages, 7 figures, 8 tables. Published in AI (MDPI), 2026

点击查看摘要

Abstract:Generative AI enables personalized computer science education at scale, yet questions remain about whether such personalization supports or undermines learning. This scoping review synthesizes 32 studies (2023-2025) purposively sampled from 259 records to map personalization mechanisms and effectiveness signals in higher-education computer science contexts. We identify five application domains: intelligent tutoring, personalized materials, formative feedback, AI-augmented assessment, and code review, and analyze how design choices shape learning outcomes. Designs incorporating explanation-first guidance, solution withholding, graduated hint ladders, and artifact grounding (student code, tests, and rubrics) consistently show more positive learning processes than unconstrained chat interfaces. Successful implementations share four patterns: context-aware tutoring anchored in student artifacts, multi-level hint structures requiring reflection, composition with traditional CS infrastructure (autograders and rubrics), and human-in-the-loop quality assurance. We propose an exploration-first adoption framework emphasizing piloting, instrumentation, learning-preserving defaults, and evidence-based scaling. Recurrent risks include academic integrity, privacy, bias and equity, and over-reliance, and we pair these with operational mitigation. The evidence supports generative AI as a mechanism for precision scaffolding when embedded in audit-ready workflows that preserve productive struggle while scaling personalized support.
zh

[AI-42] Mechanism-Based Intelligence (MBI): Differentiable Incentives for Rational Coordination and Guaranteed Alignment in Multi-Agent Systems

【速读】:该论文旨在解决自主多智能体系统在协调过程中面临的两大核心难题:Hayekian信息问题(即如何获取分散的私有知识)和Hurwiczian激励问题(即如何使局部行动与全局目标对齐),这些问题导致多智能体系统的协调在计算上变得不可行。其解决方案的关键在于提出机制基础智能(Mechanism-Based Intelligence, MBI)范式,将智能重新定义为多个“大脑”之间协调的涌现结果,而非单一智能体的产物;其中心是差分价格机制(Differentiable Price Mechanism, DPM),该机制能够计算出精确的损失梯度作为动态的VCG等价激励信号,确保主导策略激励相容性(DSIC),并收敛至全局最优解。此外,通过贝叶斯扩展实现不对称信息下的激励相容性(BIC),该框架在复杂度上呈线性增长(O(N)\mathcal{O}(N)),显著优于传统Dec-POMDP方法,并在实证中比无模型强化学习快50倍,从而提供了一种基于经济原则、可验证、可扩展且可信的多智能体协同智能新路径。

链接: https://arxiv.org/abs/2512.20688
作者: Stefano Grassi
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Autonomous multi-agent systems are fundamentally fragile: they struggle to solve the Hayekian Information problem (eliciting dispersed private knowledge) and the Hurwiczian Incentive problem (aligning local actions with global objectives), making coordination computationally intractable. I introduce Mechanism-Based Intelligence (MBI), a paradigm that reconceptualizes intelligence as emergent from the coordination of multiple “brains”, rather than a single one. At its core, the Differentiable Price Mechanism (DPM) computes the exact loss gradient \mathbfG_i = - \frac\partial \mathcalL\partial \mathbfx_i as a dynamic, VCG-equivalent incentive signal, guaranteeing Dominant Strategy Incentive Compatibility (DSIC) and convergence to the global optimum. A Bayesian extension ensures incentive compatibility under asymmetric information (BIC). The framework scales linearly ( \mathcalO(N) ) with the number of agents, bypassing the combinatorial complexity of Dec-POMDPs and is empirically 50x faster than Model-Free Reinforcement Learning. By structurally aligning agent self-interest with collective objectives, it provides a provably efficient, auditable and generalizable approach to coordinated, trustworthy and scalable multi-agent intelligence grounded in economic principles.
zh

[AI-43] Revisiting the Learning Objectives of Vision-Language Reward Models

【速读】:该论文旨在解决具身智能(embodied intelligence)中泛化奖励函数(generalizable reward functions)的学习难题,特别是如何利用对比视觉语言模型(contrastive vision language models, VLMs)在无监督条件下获得密集且领域无关的奖励信号。其解决方案的关键在于通过统一框架(unified framework)控制变量——即使用相同的骨干网络(backbone)、微调数据和评估环境——来隔离学习目标(learning objective)的影响,从而公平比较不同方法的有效性。实验表明,简单的三元组损失(triplet loss)在Meta-World任务上优于当前最先进方法,说明近期性能提升主要源于训练数据和架构差异,而非复杂的学习目标设计。

链接: https://arxiv.org/abs/2512.20675
作者: Simon Roy,Samuel Barbeau,Giovanni Beltrame,Christian Desrosiers,Nicolas Thome
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as an extended abstract at World Modeling Workshop 2026

点击查看摘要

Abstract:Learning generalizable reward functions is a core challenge in embodied intelligence. Recent work leverages contrastive vision language models (VLMs) to obtain dense, domain-agnostic rewards without human supervision. These methods adapt VLMs into reward models through increasingly complex learning objectives, yet meaningful comparison remains difficult due to differences in training data, architectures, and evaluation settings. In this work, we isolate the impact of the learning objective by evaluating recent VLM-based reward models under a unified framework with identical backbones, finetuning data, and evaluation environments. Using Meta-World tasks, we assess modeling accuracy by measuring consistency with ground truth reward and correlation with expert progress. Remarkably, we show that a simple triplet loss outperforms state-of-the-art methods, suggesting that much of the improvements in recent approaches could be attributed to differences in data and architectures.
zh

[AI-44] Bridging the AI Trustworthiness Gap between Functions and Norms ECAI2025

【速读】:该论文旨在解决功能可信人工智能(Functional Trustworthy Artificial Intelligence, FTAI)与规范可信人工智能(Normative Trustworthy Artificial Intelligence, NTAI)之间的鸿沟问题,这一鸿沟使得AI系统的可信性难以被有效评估。解决方案的关键在于引入一种概念性语言(conceptual language),该语言可作为连接FTAI与NTAI的桥梁,使开发者能够基于此框架系统性地评估AI系统的可信性,并帮助利益相关方将法规要求转化为具体的实现步骤。

链接: https://arxiv.org/abs/2512.20671
作者: Daan Di Scala,Sophie Lathouwers,Michael van Bekkum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Published as Position Paper during the TRUST-AI workshop at the ECAI2025 Conference

点击查看摘要

Abstract:Trustworthy Artificial Intelligence (TAI) is gaining traction due to regulations and functional benefits. While Functional TAI (FTAI) focuses on how to implement trustworthy systems, Normative TAI (NTAI) focuses on regulations that need to be enforced. However, gaps between FTAI and NTAI remain, making it difficult to assess trustworthiness of AI systems. We argue that a bridge is needed, specifically by introducing a conceptual language which can match FTAI and NTAI. Such a semantic language can assist developers as a framework to assess AI systems in terms of trustworthiness. It can also help stakeholders translate norms and regulations into concrete implementation steps for their systems. In this position paper, we describe the current state-of-the-art and identify the gap between FTAI and NTAI. We will discuss starting points for developing a semantic language and the envisioned effects of it. Finally, we provide key considerations and discuss future actions towards assessment of TAI.
zh

[AI-45] Disentangling Fact from Sentiment: A Dynamic Conflict-Consensus Framework for Multimodal Fake News Detection

【速读】:该论文旨在解决当前多模态虚假新闻检测中基于一致性融合(consistency-based fusion)方法存在的根本性缺陷:此类方法将跨模态关键差异错误地视为噪声并进行过平滑处理,从而削弱了识别伪造内容的关键证据。其解决方案的核心在于提出动态冲突-共识框架(Dynamic Conflict-Consensus Framework, DCCF),采用“不一致性寻找”(inconsistency-seeking)范式,通过三个关键步骤实现:首先,将输入解耦至独立的客观事实(Fact)与情感(Sentiment)空间以区分客观矛盾与情绪不一致;其次,引入受物理启发的特征动力学机制迭代极化表示,主动提取最具信息量的冲突;最后,利用冲突-共识机制将局部差异标准化于全局语境下,实现鲁棒的推理决策。实验证明,DCCF在三个真实数据集上显著优于现有最先进方法,平均准确率提升达3.52%。

链接: https://arxiv.org/abs/2512.20670
作者: Weilin Zhou,Zonghao Ying,Junjie Mu,Shengwei Tian,Quanchen Zou,Deyue Zhang,Dongdong Yang,Xiangzheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prevalent multimodal fake news detection relies on consistency-based fusion, yet this paradigm fundamentally misinterprets critical cross-modal discrepancies as noise, leading to over-smoothing, which dilutes critical evidence of fabrication. Mainstream consistency-based fusion inherently minimizes feature discrepancies to align modalities, yet this approach fundamentally fails because it inadvertently smoothes out the subtle cross-modal contradictions that serve as the primary evidence of fabrication. To address this, we propose the Dynamic Conflict-Consensus Framework (DCCF), an inconsistency-seeking paradigm designed to amplify rather than suppress contradictions. First, DCCF decouples inputs into independent Fact and Sentiment spaces to distinguish objective mismatches from emotional dissonance. Second, we employ physics-inspired feature dynamics to iteratively polarize these representations, actively extracting maximally informative conflicts. Finally, a conflict-consensus mechanism standardizes these local discrepancies against the global context for robust deliberative this http URL experiments conducted on three real world datasets demonstrate that DCCF consistently outperforms state-of-the-art baselines, achieving an average accuracy improvement of 3.52%.
zh

[AI-46] Improving Cardiac Risk Prediction Using Data Generation Techniques

【速读】:该论文旨在解决心脏康复(Cardiac Rehabilitation)领域中临床数据稀缺、不完整及难以用于特定分析的问题,这些问题限制了心脏风险预测模型的性能提升和对高风险诊断手段(如运动负荷试验)的依赖。解决方案的关键在于提出一种基于条件变分自编码器(Conditional Variational Autoencoder, CVAE)的架构,用于生成与真实世界观察一致的合成临床记录,从而扩充数据集规模并提高模型准确性,同时减少对侵入性检查的需求。实验表明,该方法生成的数据具有高度合理性,并显著优于当前最先进的深度学习合成数据生成技术。

链接: https://arxiv.org/abs/2512.20669
作者: Alexandre Cabodevila,Pedro Gamallo-Fernandez,Juan C. Vidal,Manuel Lama
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cardiac rehabilitation constitutes a structured clinical process involving multiple interdependent phases, individualized medical decisions, and the coordinated participation of diverse healthcare professionals. This sequential and adaptive nature enables the program to be modeled as a business process, thereby facilitating its analysis. Nevertheless, studies in this context face significant limitations inherent to real-world medical databases: data are often scarce due to both economic costs and the time required for collection; many existing records are not suitable for specific analytical purposes; and, finally, there is a high prevalence of missing values, as not all patients undergo the same diagnostic tests. To address these limitations, this work proposes an architecture based on a Conditional Variational Autoencoder (CVAE) for the synthesis of realistic clinical records that are coherent with real-world observations. The primary objective is to increase the size and diversity of the available datasets in order to enhance the performance of cardiac risk prediction models and to reduce the need for potentially hazardous diagnostic procedures, such as exercise stress testing. The results demonstrate that the proposed architecture is capable of generating coherent and realistic synthetic data, whose use improves the accuracy of the various classifiers employed for cardiac risk detection, outperforming state-of-the-art deep learning approaches for synthetic data generation.
zh

[AI-47] Forward Only Learning for Orthogonal Neural Networks of any Depth

【速读】:该论文旨在解决深度神经网络训练中反向传播(Backpropagation)算法因计算成本高而带来的效率瓶颈问题。现有前向-only框架(如PEPITA)虽提出替代方案,但难以扩展至多层网络。作者首先理论分析了这些方法的局限性,并基于线性和正交假设设计了一种等效于反向传播的前向算法;随后通过放松线性假设,提出了FOTON(Forward-Only Training of Orthogonal Networks),使模型可任意深度训练且无需反向传播过程。其关键创新在于利用正交权重约束和前向传播机制实现梯度信息的有效传递,从而在保持性能的同时显著降低训练复杂度。

链接: https://arxiv.org/abs/2512.20668
作者: Paul Caillon,Alex Colagrande,Erwan Fagnou,Blaise Delattre,Alexandre Allauzen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Backpropagation is still the de facto algorithm used today to train neural networks. With the exponential growth of recent architectures, the computational cost of this algorithm also becomes a burden. The recent PEPITA and forward-only frameworks have proposed promising alternatives, but they failed to scale up to a handful of hidden layers, yet limiting their use. In this paper, we first analyze theoretically the main limitations of these approaches. It allows us the design of a forward-only algorithm, which is equivalent to backpropagation under the linear and orthogonal assumptions. By relaxing the linear assumption, we then introduce FOTON (Forward-Only Training of Orthogonal Networks) that bridges the gap with the backpropagation algorithm. Experimental results show that it outperforms PEPITA, enabling us to train neural networks of any depth, without the need for a backward pass. Moreover its performance on convolutional networks clearly opens up avenues for its application to more advanced architectures. The code is open-sourced at this https URL . Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20668 [cs.LG] (or arXiv:2512.20668v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20668 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: ECAI 2025 Related DOI: https://doi.org/10.3233/FAIA251041 Focus to learn more DOI(s) linking to related resources Submission history From: Paul Caillon [view email] [v1] Fri, 19 Dec 2025 10:03:34 UTC (660 KB) Full-text links: Access Paper: View a PDF of the paper titled Forward Only Learning for Orthogonal Neural Networks of any Depth, by Paul Caillon and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-48] Dominating vs. Dominated: Generative Collapse in Diffusion Models

【速读】:该论文旨在解决文本到图像扩散模型在多概念提示(multi-concept prompts)生成时出现的“主导-从属”(Dominant-vs-Dominated, DvD)不平衡问题,即某一概念词元在生成过程中占据主导地位,抑制其他概念的表达。解决方案的关键在于通过构建 DominanceBench 进行系统性分析,发现训练数据中实例多样性不足加剧了概念间的干扰,并揭示跨注意力机制中主导词元快速饱和注意力权重,从而在扩散过程中逐步压制其他词元;进一步的头消融实验表明,这种 DvD 行为源于多个注意力头之间的分布式注意力机制。这些发现为理解生成式崩溃(generative collapse)提供了关键洞见,有助于提升文本到图像生成的可控性和可靠性。

链接: https://arxiv.org/abs/2512.20666
作者: Hayeon Jeong,Jong-Seok Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models have drawn significant attention for their ability to generate diverse and high-fidelity images. However, when generating from multi-concept prompts, one concept token often dominates the generation, suppressing the others-a phenomenon we term the Dominant-vs-Dominated (DvD) imbalance. To systematically analyze this imbalance, we introduce DominanceBench and examine its causes from both data and architectural perspectives. Through various experiments, we show that the limited instance diversity in training data exacerbates the inter-concept interference. Analysis of cross-attention dynamics further reveals that dominant tokens rapidly saturate attention, progressively suppressing others across diffusion timesteps. In addition, head ablation studies show that the DvD behavior arises from distributed attention mechanisms across multiple heads. Our findings provide key insights into generative collapse, advancing toward more reliable and controllable text-to-image generation.
zh

[AI-49] Eidoku: A Neuro-Symbolic Verification Gate for LLM Reasoning via Structural Constraint Satisfaction

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中频繁产生高概率但结构不一致的幻觉陈述(hallucinated statements)的问题,这种现象表明基于概率的验证机制存在根本性局限。其解决方案的关键在于将LLM推理的验证重构为一个与生成似然无关的约束满足问题(Constraint Satisfaction Problem, CSP),通过定义包含图连通性(结构性)、特征空间一致性(几何性)和逻辑蕴含(符号性)三个代理指标的总成本函数,实现对候选推理步骤的结构违规成本评估。验证过程由一个轻量级的System-2门控机制Eidoku执行,该机制基于上下文自适应校准的成本阈值拒绝超出阈值的候选,无需学习且避免了人为启发式设定,从而有效识别并拒绝“平滑虚假信息”(smooth falsehoods),即高概率但结构脱节的错误陈述,实现了对生成推理的神经符号学合理性检验。

链接: https://arxiv.org/abs/2512.20664
作者: Shinobu Miya
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) frequently produce hallucinated statements that are assigned high likelihood by the model itself, exposing a fundamental limitation of probability-based verification. This suggests that hallucination is often not a low-confidence phenomenon, but a failure of structural consistency. In this work, we reformulate the verification of LLM reasoning as a Constraint Satisfaction Problem (CSP) operating independently of the generation likelihood. Rather than optimizing for statistical plausibility, we model verification as a feasibility check based on structural violation cost – the computational cost required to embed a candidate reasoning step into the contextual graph structure. We define a total cost function composed of three proxies: (i) graph connectivity (structural), (ii) feature space consistency (geometric), and (iii) logical entailment (symbolic). Crucially, verification is performed via a lightweight System-2 gate, Eidoku, which rejects candidates exceeding a context-calibrated cost threshold. The threshold is not learned but is derived from the intrinsic statistics of the context, avoiding ad hoc heuristics. We demonstrate that this approach successfully rejects ``smooth falsehoods’’ – statements that are highly probable yet structurally disconnected – that probability-based verifiers are principally incapable of detecting. Our experiments on a controlled diagnostic dataset show that explicitly enforcing structural constraints allows for the deterministic rejection of this specific class of hallucinations, serving as a neuro-symbolic sanity check for generative reasoning.
zh

[AI-50] Quantifying Laziness Decoding Suboptimality and Context Degradation in Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中表现出的行为缺陷问题,包括懒惰行为(laziness,如对多部分指令的提前截断或部分响应)、解码次优性(decoding suboptimality,即因短视解码未能选择高质量序列)以及上下文退化(context degradation,即在长对话中遗忘或忽略核心指令)。研究通过三个受控实验量化了这些现象在多个先进LLM(如OpenAI GPT-4变体和DeepSeek)中的表现,发现模型普遍存在对复杂多步骤指令的不完全遵守,但解码次优性和上下文退化在特定任务中表现有限,甚至在极端长对话场景下仍能较好维持关键信息。解决方案的关键在于识别出当前模型虽存在指令遵循不足的问题,但其内部机制可能已对某些理论预期的失败模式(如上下文遗忘)具备一定鲁棒性;因此,建议采用自精炼(self-refinement)和动态提示(dynamic prompting)等策略来提升多指令一致性与响应完整性,从而增强模型的可靠性。

链接: https://arxiv.org/abs/2512.20662
作者: Yiqing Ma,Jung-Hua Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often exhibit behavioral artifacts such as laziness (premature truncation of responses or partial compliance with multi-part requests), decoding suboptimality (failure to select higher-quality sequences due to myopic decoding), and context degradation (forgetting or ignoring core instructions over long conversations). We conducted three controlled experiments (A, B, and C) to quantify these phenomena across several advanced LLMs (OpenAI GPT-4 variant, DeepSeek). Our results indicate widespread laziness in satisfying complex multi-part instructions: models frequently omitted required sections or failed to meet length requirements despite explicit prompting. However, we found limited evidence of decoding suboptimality in a simple reasoning task (the models’ greedy answers appeared to align with their highest-confidence solution), and we observed surprising robustness against context degradation in a 200-turn chaotic conversation test - the models maintained key facts and instructions far better than expected. These findings suggest that while compliance with detailed instructions remains an open challenge, modern LLMs may internally mitigate some hypothesized failure modes (such as context forgetting) in straightforward retrieval scenarios. We discuss implications for reliability, relate our findings to prior work on instruction-following and long-context processing, and recommend strategies (such as self-refinement and dynamic prompting) to reduce laziness and bolster multi-instruction compliance.
zh

[AI-51] From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers WWW2026

【速读】:该论文旨在解决基于Transformer的文本情感分析模型在特定场景下准确率不足的问题,其核心原因在于现有模型倾向于将注意力集中在常见词汇上,而忽视了那些出现频率较低但对任务至关重要的关键词。解决方案的关键在于提出一种对抗反馈注意力(Adversarial Feedback for Attention, AFA)训练机制,该机制通过引入动态掩码策略欺骗判别器,并利用判别器检测掩码带来的显著差异,从而引导模型自动调整注意力权重至更关键的词项;同时结合基于策略梯度的方法优化注意力分布,提升模型对token级扰动的敏感性,实现高效且快速的收敛。

链接: https://arxiv.org/abs/2512.20661
作者: Yawei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, submited to WWW 2026

点击查看摘要

Abstract:Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information. However, these methods often exhibit suboptimal accuracy in certain scenarios. By analyzing their attention distributions, we observe that existing models tend to allocate attention primarily to common words, overlooking less popular yet highly task-relevant terms, which significantly impairs overall performance. To address this issue, we propose an Adversarial Feedback for Attention(AFA) training mechanism that enables the model to automatically redistribute attention weights to appropriate focal points without requiring manual annotations. This mechanism incorporates a dynamic masking strategy that attempts to mask various words to deceive a discriminator, while the discriminator strives to detect significant differences induced by these masks. Additionally, leveraging the sensitivity of Transformer models to token-level perturbations, we employ a policy gradient approach to optimize attention distributions, which facilitates efficient and rapid convergence. Experiments on three public datasets demonstrate that our method achieves state-of-the-art results. Furthermore, applying this training mechanism to enhance attention in large language models yields a further performance improvement of 12.6%
zh

[AI-52] Managing the Stochastic: Foundations of Learning in Neuro-Symbolic Systems for Software Engineering

【速读】:该论文旨在解决当前AI编码代理(AI coding agent)中因将大语言模型(Large Language Model, LLM)直接作为决策主体而导致的随机性失败问题,如通过单元测试的“游戏化”行为或语法幻觉等。其核心问题是LLM被错误地赋予了本应由确定性流程处理的控制权,从而削弱了系统的可靠性。解决方案的关键在于提出一种双状态架构(Dual-State Architecture),明确区分工作流状态(deterministic control flow)与环境状态(stochastic generation),并将LLM视为环境组件而非决策代理;同时引入原子动作对(Atomic Action Pairs)防护函数(Guard Functions),使生成与验证以不可分割事务的形式耦合,且通过防护函数将概率输出映射到可观测的工作流状态,从而实现稳定可靠的代码生成。

链接: https://arxiv.org/abs/2512.20660
作者: Matthew Thompson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 55 pages, 3 figures, 8 tables

点击查看摘要

Abstract:Current approaches to AI coding agents appear to blur the lines between the Large Language Model (LLM) and the agent itself, asking the LLM to make decisions best left to deterministic processes. This leads to systems prone to stochastic failures such as gaming unit tests or hallucinating syntax. Drawing on established software engineering practices that provide deterministic frameworks for managing unpredictable processes, this paper proposes setting the control boundary such that the LLM is treated as a component of the environment environment – preserving its creative stochasticity – rather than the decision-making agent. A \textbfDual-State Architecture is formalized, separating workflow state (deterministic control flow) from environment state (stochastic generation). \textbfAtomic Action Pairs couple generation with verification as indivisible transactions, where \textbfGuard Functions act as sensing actions that project probabilistic outputs onto observable workflow state. The framework is validated on three code generation tasks across 13 LLMs (1.3B–15B parameters). For qualified instruction-following models, task success rates improved by up to 66 percentage points at 1.2–2.1 \times baseline computational cost. The results suggest that architectural constraints can substitute for parameter scale in achieving reliable code generation. Comments: 55 pages, 3 figures, 8 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2512.20660 [cs.LG] (or arXiv:2512.20660v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-53] AI-Driven Decision-Making System for Hiring Process

【速读】:该论文旨在解决招聘流程中早期候选人验证阶段的瓶颈问题,即招聘人员需整合多种异构输入(如简历、筛选问答、代码任务和有限的公开证据)以评估候选人,这一过程效率低且易受主观因素影响。解决方案的关键在于构建一个由大语言模型(Large Language Model, LLM)协调的模块化多智能体招聘助手,其核心包括文档与视频预处理、结构化候选人档案构建、公开数据验证、技术能力与文化契合度评分(含显式风险惩罚项),以及通过交互式界面实现人机协同验证。该系统通过约束LLM输出以减少变异性并生成可追溯的组件级推理逻辑,最终采用可配置的聚合策略计算候选人排名,显著提升筛选效率——在实际测试中,系统每筛选出一名合格候选人平均耗时1.70小时,优于经验 recruiter 的3.33小时,同时保持人类决策者作为最终裁决方。

链接: https://arxiv.org/abs/2512.20652
作者: Vira Filatova,Andrii Zelenchuk,Dmytro Filatov
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Early-stage candidate validation is a major bottleneck in hiring, because recruiters must reconcile heterogeneous inputs (resumes, screening answers, code assignments, and limited public evidence). This paper presents an AI-driven, modular multi-agent hiring assistant that integrates (i) document and video preprocessing, (ii) structured candidate profile construction, (iii) public-data verification, (iv) technical/culture-fit scoring with explicit risk penalties, and (v) human-in-the-loop validation via an interactive interface. The pipeline is orchestrated by an LLM under strict constraints to reduce output variability and to generate traceable component-level rationales. Candidate ranking is computed by a configurable aggregation of technical fit, culture fit, and normalized risk penalties. The system is evaluated on 64 real applicants for a mid-level Python backend engineer role, using an experienced recruiter as the reference baseline and a second, less experienced recruiter for additional comparison. Alongside precision/recall, we propose an efficiency metric measuring expected time per qualified candidate. In this study, the system improves throughput and achieves 1.70 hours per qualified candidate versus 3.33 hours for the experienced recruiter, with substantially lower estimated screening cost, while preserving a human decision-maker as the final authority.
zh

[AI-54] Memory Bear AI A Breakthrough from Memory to Cognition Toward Artificial General Intelligence

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在记忆机制上的固有局限,包括上下文窗口受限、长期知识遗忘、冗余信息累积以及幻觉生成等问题,这些问题严重制约了持续对话和个性化服务的实现。解决方案的关键在于提出Memory Bear系统,其核心是构建基于认知科学原理的人类-like记忆架构,通过整合多模态信息感知、动态记忆维护与自适应认知服务,实现LLM记忆机制的全流程重构,从而显著提升知识保真度、检索效率、上下文适应性及推理能力,并在医疗、企业运营和教育等多个领域展现出工程创新与性能突破。

链接: https://arxiv.org/abs/2512.20651
作者: Deliang Wen,Ke Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) face inherent limitations in memory, including restricted context windows, long-term knowledge forgetting, redundant information accumulation, and hallucination generation. These issues severely constrain sustained dialogue and personalized services. This paper proposes the Memory Bear system, which constructs a human-like memory architecture grounded in cognitive science principles. By integrating multimodal information perception, dynamic memory maintenance, and adaptive cognitive services, Memory Bear achieves a full-chain reconstruction of LLM memory mechanisms. Across domains such as healthcare, enterprise operations, and education, Memory Bear demonstrates substantial engineering innovation and performance breakthroughs. It significantly improves knowledge fidelity and retrieval efficiency in long-term conversations, reduces hallucination rates, and enhances contextual adaptability and reasoning capability through memory-cognition integration. Experimental results show that, compared with existing solutions (e.g., Mem0, MemGPT, Graphiti), Memory Bear outperforms them across key metrics, including accuracy, token efficiency, and response latency. This marks a crucial step forward in advancing AI from “memory” to “cognition”.
zh

[AI-55] Mixture of Attention Schemes (MoAS): Learning to Route Between MHA GQA and MQA

【速读】:该论文旨在解决Transformer模型中注意力机制在建模质量与推理效率之间的权衡问题。多头注意力(Multi-Head Attention, MHA)虽能提供最优建模质量,但推理时需占用大量键值(Key-Value, KV)缓存内存;而单查询注意力(Multi-Query Attention, MQA)和分组查询注意力(Grouped-Query Attention, GQA)虽可降低内存开销,却常以模型性能为代价。其解决方案的关键在于提出一种混合注意力方案(Mixture of Attention Schemes, MoAS),通过一个可学习的路由器(router)动态地为每个token选择最优的注意力机制(MHA、GQA或MQA),从而在保持接近MHA性能的同时实现条件计算效率的提升。实验表明,动态路由策略在WikiText-2数据集上优于静态混合方案,验证了该方法的有效性。

链接: https://arxiv.org/abs/2512.20650
作者: Esmail Gumaan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:The choice of attention mechanism in Transformer models involves a critical trade-off between modeling quality and inference efficiency. Multi-Head Attention (MHA) offers the best quality but suffers from large Key-Value (KV) cache memory requirements during inference. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce memory usage but often at the cost of model performance. In this work, we propose Mixture of Attention Schemes (MoAS), a novel architecture that dynamically selects the optimal attention scheme (MHA, GQA, or MQA) for each token via a learned router. We demonstrate that dynamic routing performs better than static averaging of schemes and achieves performance competitive with the MHA baseline while offering potential for conditional compute efficiency. Experimental results on WikiText-2 show that dynamic routing (val loss 2.3074) outperforms a static mixture (2.3093), validating the effectiveness of the proposed method. Our code is available at this https URL.
zh

[AI-56] AIAuditTrack: A Framework for AI Security system

【速读】:该论文旨在解决由大规模语言模型驱动的AI应用快速发展所引发的安全性、责任归属与风险可追溯性问题。其解决方案的关键在于提出了一种基于区块链的AI使用流量记录与治理框架AiAuditTrack (AAT),通过去中心化身份(Decentralized Identity, DID)和可验证凭证(Verifiable Credentials, VC)构建可信且可识别的AI实体,并将实体间的交互轨迹上链记录,形成动态交互图谱(interaction graph),其中节点代表AI实体,边表示特定时间点的行为轨迹;在此基础上设计了风险扩散算法,用于追踪风险行为源头并跨实体传播预警信息,从而实现多系统协同监督与审计。

链接: https://arxiv.org/abs/2512.20649
作者: Zixun Luo,Yuhang Fan,Yufei Li,Youzhi Zhang,Hengyu Lin,Ziqi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:The rapid expansion of AI-driven applications powered by large language models has led to a surge in AI interaction data, raising urgent challenges in security, accountability, and risk traceability. This paper presents AiAuditTrack (AAT), a blockchain-based framework for AI usage traffic recording and governance. AAT leverages decentralized identity (DID) and verifiable credentials (VC) to establish trusted and identifiable AI entities, and records inter-entity interaction trajectories on-chain to enable cross-system supervision and auditing. AI entities are modeled as nodes in a dynamic interaction graph, where edges represent time-specific behavioral trajectories. Based on this model, a risk diffusion algorithm is proposed to trace the origin of risky behaviors and propagate early warnings across involved entities. System performance is evaluated using blockchain Transactions Per Second (TPS) metrics, demonstrating the feasibility and stability of AAT under large-scale interaction recording. AAT provides a scalable and verifiable solution for AI auditing, risk management, and responsibility attribution in complex multi-agent environments.
zh

[AI-57] Reasoning Relay: Evaluating Stability and Interchangeability of Large Language Models in Mathematical Reasoning NEURIPS2025

【速读】:该论文试图解决的问题是:不同大语言模型(Large Language Models, LLMs)在推理过程中生成的中间推理链(reasoning chain)是否具有可交换性(interchangeability),即一个模型生成的部分推理链能否被另一个模型可靠地继续完成,从而保持逻辑连贯性和最终答案准确性。这一问题对于评估推理过程在模型替换时的可信度(inference-time trustworthiness)具有重要意义。解决方案的关键在于:通过基于token-level log-probability阈值对基准模型(Gemma-3-4B-IT 和 LLaMA-3.1-70B-Instruct)的推理链进行早期、中期和晚期截断,并使用Process Reward Model(PRM)作为评估工具,系统性测试由其他模型(Gemma-3-1B-IT 和 LLaMA-3.1-8B-Instruct)继续这些截断推理链的效果。实验结果表明,混合推理链不仅能够维持甚至提升最终准确率与逻辑结构,揭示出推理链的可交换性是一种新兴的行为属性,为模块化协作式AI系统的可靠推理提供了新范式。

链接: https://arxiv.org/abs/2512.20647
作者: Leo Lu,Jonathan Zhang,Sean Chua,Spencer Kim,Kevin Zhu,Sean O’Brien,Vasu Sharma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (ResponsibleFM)

点击查看摘要

Abstract:Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of large language models (LLMs). While prior work focuses on improving model performance through internal reasoning strategies, little is known about the interchangeability of reasoning across different models. In this work, we explore whether a partially completed reasoning chain from one model can be reliably continued by another model, either within the same model family or across families. We achieve this by assessing the sufficiency of intermediate reasoning traces as transferable scaffolds for logical coherence and final answer accuracy. We interpret this interchangeability as a means of examining inference-time trustworthiness, probing whether reasoning remains both coherent and reliable under model substitution. Using token-level log-probability thresholds to truncate reasoning at early, mid, and late stages from our baseline models, Gemma-3-4B-IT and LLaMA-3.1-70B-Instruct, we conduct continuation experiments with Gemma-3-1B-IT and LLaMA-3.1-8B-Instruct to test intra-family and cross-family behaviors. Our evaluation pipeline leverages truncation thresholds with a Process Reward Model (PRM), providing a reproducible framework for assessing reasoning stability via model interchange. Evaluations with a PRM reveal that hybrid reasoning chains often preserve, and in some cases even improve, final accuracy and logical structure. Our findings point towards interchangeability as an emerging behavioral property of reasoning models, offering insights into new paradigms for reliable modular reasoning in collaborative AI systems.
zh

[AI-58] Forecasting N-Body Dynamics: A Comparative Study of Neural Ordinary Differential Equations and Universal Differential Equations

【速读】:该论文旨在解决传统机器学习模型在模拟多体引力系统(n body problem)时因忽略物理定律而导致的可解释性差、数据依赖性强的问题。其核心解决方案是采用科学机器学习(Scientific Machine Learning, Scientific ML),将已知的物理规律嵌入到机器学习框架中,具体使用神经微分方程(Neural Ordinary Differential Equations, NODEs)和通用微分方程(Universal Differential Equations, UDEs)进行建模。关键创新在于通过Julia语言实现高效建模,并首次量化了预测失效点(forecasting breakdown point),即模型准确预测未来未见数据所需的最小训练数据量;结果表明,UDE模型比NODE模型更具数据效率,仅需20%的数据即可实现正确预测,而NODE则需90%。

链接: https://arxiv.org/abs/2512.20643
作者: Suriya R S,Prathamesh Dinesh Joshi,Rajat Dandekar,Raj Dandekar,Sreedath Panat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:The n body problem, fundamental to astrophysics, simulates the motion of n bodies acting under the effect of their own mutual gravitational interactions. Traditional machine learning models that are used for predicting and forecasting trajectories are often data intensive black box models, which ignore the physical laws, thereby lacking interpretability. Whereas Scientific Machine Learning ( Scientific ML ) directly embeds the known physical laws into the machine learning framework. Through robust modelling in the Julia programming language, our method uses the Scientific ML frameworks: Neural ordinary differential equations (NODEs) and Universal differential equations (UDEs) to predict and forecast the system dynamics. In addition, an essential component of our analysis involves determining the forecasting breakdown point, which is the smallest possible amount of training data our models need to predict future, unseen data accurately. We employ synthetically created noisy data to simulate real-world observational limitations. Our findings indicate that the UDE model is much more data efficient, needing only 20% of data for a correct forecast, whereas the Neural ODE requires 90%.
zh

[AI-59] Data-Free Pruning of Self-Attention Layers in LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中冗余自注意力子层(self-attention sublayers)的压缩问题,即如何在不显著损失模型性能的前提下高效移除冗余的注意力层以提升推理效率。其核心解决方案是提出一种名为Gate-Norm的单次、仅依赖权重的判别准则,通过量化查询(query)与键(key)之间的耦合强度对注意力子层进行排序,并移除耦合度最低的层。该方法无需校准数据、前向传播、微调或专用内核,可在极短时间内完成模型剪枝(如40层、13B参数的LLaMA模型在1秒内完成),在保持零样本准确率损失小于2%的同时,使推理吞吐量提升最高达1.3倍,且性能媲美数据驱动的剪枝方法,但速度提升约1000倍,从而实现高效、无数据依赖的大模型压缩。

链接: https://arxiv.org/abs/2512.20636
作者: Dhananjay Saikumar,Blesson Varghese
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many self-attention sublayers in large language models (LLMs) can be removed with little to no loss. We attribute this to the Attention Suppression Hypothesis: during pre-training, some deep attention layers learn to mute their own contribution, leaving the residual stream and the MLP to carry the representation. We propose Gate-Norm, a one-shot, weight-only criterion that ranks attention sublayers by query–key coupling and removes the least coupled ones, requiring no calibration data, no forward passes, no fine-tuning, and no specialized kernels. On 40-layer, 13B-parameter LLaMA models, Gate-Norm prunes the model in under a second. Pruning 8 – 16 attention sublayers yields up to 1.30\times higher inference throughput while keeping average zero-shot accuracy within 2% of the unpruned baseline across BoolQ, RTE, HellaSwag, WinoGrande, ARC-Easy/Challenge, and OpenBookQA. Across these settings, Gate-Norm matches data-driven pruning methods in accuracy while being \sim 1000\times faster to score layers, enabling practical, data-free compression of LLMs.
zh

[AI-60] Enhancing Lung Cancer Treatment Outcome Prediction through Semantic Feature Engineering Using Large Language Models

【速读】:该论文旨在解决肺癌治疗效果预测中因真实世界电子健康数据稀疏性、异质性和上下文过载而导致的建模挑战。传统模型难以捕捉多模态数据(如实验室检查、基因组学和用药信息)间的语义关联,而大规模微调方法又不适用于临床工作流程。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的目标导向知识整理框架(Goal-oriented Knowledge Curator, GKC),将多源临床数据转化为与预测任务对齐的高保真特征表示;该方法作为离线预处理步骤运行,可无缝集成至医院信息系统,并显著提升预测性能(平均AUROC达0.803),验证了语义表征质量在稀疏临床数据场景下的决定性作用。

链接: https://arxiv.org/abs/2512.20633
作者: MunHwan Lee,Shaika Chowdhury,Xiaodi Li,Sivaraman Rajaganapathy,Eric W Klee,Ping Yang,Terence Sio,Liewei Wang,James Cerhan,Nansu NA Zong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of treatment outcomes in lung cancer remains challenging due to the sparsity, heterogeneity, and contextual overload of real-world electronic health data. Traditional models often fail to capture semantic information across multimodal streams, while large-scale fine-tuning approaches are impractical in clinical workflows. We introduce a framework that uses Large Language Models (LLMs) as Goal-oriented Knowledge Curators (GKC) to convert laboratory, genomic, and medication data into high-fidelity, task-aligned features. Unlike generic embeddings, GKC produces representations tailored to the prediction objective and operates as an offline preprocessing step that integrates naturally into hospital informatics pipelines. Using a lung cancer cohort (N=184), we benchmarked GKC against expert-engineered features, direct text embeddings, and an end-to-end transformer. Our approach achieved a mean AUROC of 0.803 (95% CI: 0.799-0.807) and outperformed all baselines. An ablation study further confirmed the complementary value of combining all three modalities. These results show that the quality of semantic representation is a key determinant of predictive accuracy in sparse clinical data settings. By reframing LLMs as knowledge curation engines rather than black-box predictors, this work demonstrates a scalable, interpretable, and workflow-compatible pathway for advancing AI-driven decision support in oncology.
zh

[AI-61] Erkang-Diagnosis-1.1 Technical Report

【速读】:该论文旨在解决当前AI健康咨询助手在专业性、可靠性和安全性方面存在的不足,尤其是在初级医疗和健康管理场景中难以提供精准诊断建议的问题。解决方案的关键在于构建一个基于阿里巴巴Qwen-3模型的专用医疗AI系统——Erkang-Diagnosis-1.1,其核心创新是融合约500GB高质量结构化医学知识,并采用增强预训练与检索增强生成(Retrieval-Augmented Generation, RAG)相结合的混合方法,从而实现对用户症状的快速准确理解、初步分析及专业健康指导,在综合医学评估中优于GPT-4。

链接: https://arxiv.org/abs/2512.20632
作者: Jianbing Ma,Ao Feng,Zhenjie Gao,Xinyu Song,Li Su,Bin Chen,Wei Wang,Jiamin Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages; 4 figures

点击查看摘要

Abstract:This report provides a detailed introduction to Erkang-Diagnosis-1.1 model, our AI healthcare consulting assistant developed using Alibaba Qwen-3 model. The Erkang model integrates approximately 500GB of high-quality structured medical knowledge, employing a hybrid approach combining enhanced pre-training and retrieval-enhanced generation to create a secure, reliable, and professional AI health advisor. Through 3-5 efficient interaction rounds, Erkang Diagnosis can accurately understand user symptoms, conduct preliminary analysis, and provide valuable diagnostic suggestions and health guidance. Designed to become users intelligent health companions, it empowers primary healthcare and health management. To validate, Erkang-Diagnosis-1.1 leads GPT-4 in terms of comprehensive medical exams.
zh

[AI-62] MicroProbe: Efficient Reliability Assessment for Foundation Models with Minimal Data ICML

【速读】:该论文旨在解决基础模型(Foundation model)可靠性评估中所需样本量过大、计算成本高昂的问题,尤其在实际部署场景下效率低下。其核心解决方案是提出一种名为 microprobe 的新方法,关键在于通过仅使用100个精心挑选的探针示例(probe examples),结合五个关键可靠性维度的策略性提示多样性(strategic prompt diversity)、先进的不确定性量化(uncertainty quantification)与自适应加权机制,高效识别潜在失败模式。该方法在多个语言模型(如GPT-2系列)和跨领域(医疗、金融、法律)验证中显著优于随机采样基线,实现更高复合可靠性得分(提升23.5%)且具备极高的统计功效(99.9%),同时将评估成本降低90%并保持95%的传统覆盖范围。

链接: https://arxiv.org/abs/2512.20630
作者: Aayam Bansal,Ishaan Gangwani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICML NewInML

点击查看摘要

Abstract:Foundation model reliability assessment typically requires thousands of evaluation examples, making it computationally expensive and time-consuming for real-world deployment. We introduce microprobe, a novel approach that achieves comprehensive reliability assessment using only 100 strategically selected probe examples. Our method combines strategic prompt diversity across five key reliability dimensions with advanced uncertainty quantification and adaptive weighting to efficiently detect potential failure modes. Through extensive empirical evaluation on multiple language models (GPT-2 variants, GPT-2 Medium, GPT-2 Large) and cross-domain validation (healthcare, finance, legal), we demonstrate that microprobe achieves 23.5% higher composite reliability scores compared to random sampling baselines, with exceptional statistical significance (p 0.001, Cohen’s d = 1.21). Expert validation by three AI safety researchers confirms the effectiveness of our strategic selection, rating our approach 4.14/5.0 versus 3.14/5.0 for random selection. microprobe completes reliability assessment with 99.9% statistical power while representing a 90% reduction in assessment cost and maintaining 95% of traditional method coverage. Our approach addresses a critical gap in efficient model evaluation for responsible AI deployment.
zh

[AI-63] Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

【速读】:该论文旨在解决语言模型在长期多轮交互中缺乏持续策略演化能力的问题,即如何在不微调模型参数的前提下实现策略的动态更新与优化。其解决方案的关键在于构建一个双环架构:行为环根据环境奖励调整动作偏好,语言环则通过反思生成文本的语义嵌入来更新外部潜在向量(external latent vectors),从而释放抽象概念的潜在表示从静态语义表征中,并使其能够通过环境交互和强化反馈持续演进。这一机制使代理能够在长时间交互中发展出稳定且解耦的战略风格,同时展现出对情绪化代理的隐式推断与适应能力。

链接: https://arxiv.org/abs/2512.20629
作者: Wenlong Tang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:This study proposes a multi-agent language framework that enables continual strategy evolution without fine-tuning the language model’s parameters. The core idea is to liberate the latent vectors of abstract concepts from traditional static semantic representations, allowing them to be continuously updated through environmental interaction and reinforcement feedback. We construct a dual-loop architecture: the behavior loop adjusts action preferences based on environmental rewards, while the language loop updates the external latent vectors by reflecting on the semantic embeddings of generated text. Together, these mechanisms allow agents to develop stable and disentangled strategic styles over long-horizon multi-round interactions. Experiments show that agents’ latent spaces exhibit clear convergence trajectories under reflection-driven updates, along with structured shifts at critical moments. Moreover, the system demonstrates an emergent ability to implicitly infer and continually adapt to emotional agents, even without shared rewards. These results indicate that, without modifying model parameters, an external latent space can provide language agents with a low-cost, scalable, and interpretable form of abstract strategic representation. Comments: 17 pages, 5 figures. Code available at this https URL Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.20629 [cs.LG] (or arXiv:2512.20629v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20629 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-64] Proceedings of the 20th International Conference on Knowledge Information and Creativity Support Systems (KICSS 2025)

【速读】:该论文旨在解决多学科交叉领域中知识、信息与创造力支持系统(Knowledge, Information and Creativity Support Systems, KICSS)的协同创新问题,尤其聚焦于人工智能、知识工程、人机交互及创造力支持系统等方向的研究进展与实践应用。其解决方案的关键在于通过组织国际学术会议并收录经过双盲评审的高质量论文,构建一个跨领域的知识交流平台,并进一步推荐优秀成果至权威期刊《IEICE Transactions on Information and Systems》进行深度发表,从而促进研究成果的规范化传播与持续迭代发展。

链接: https://arxiv.org/abs/2512.20628
作者: Edited by Tessai Hayama(Nagaoka University of Technology, Japan),Takayuki Ito(Kyoto University, Japan),Takahiro Uchiya(Nagoya Institute of Technology, Japan),Motoki Miura(Chiba Institute of Technology, Japan),Takahiro Kawaji(University of Kurume, Japan),Takaya Yuizono(Japan Advanced Institute of Science and Technology, Japan),Atsuo Yoshitaka(Japan Advanced Institute of Science and Technology, Japan),Tokuro Matsuo(Advanced Institute of Industrial Technology, Japan),Shun Okuhara(Mie University, Japan),Jawad Haqbeen(Kyoto University, Japan),Sofia Sahab(Kyoto University, Japan),Wen Gu(Nagoya Institute of Technology, Japan),Shiyao Ding(Kyoto University, Japan)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Conference proceedings; 325 pages; published in cooperation with IEICE Proceedings Series. A subset of papers will appear in IEICE Transactions on Information and Systems (special section). Venue: Aore Nagaoka, Japan, December 3-5, 2025. Editors: KICSS 2025 Organizing Committee

点击查看摘要

Abstract:This volume presents the proceedings of the 20th International Conference on Knowledge, Information and Creativity Support Systems (KICSS 2025), held in Nagaoka, Japan, on December 3-5, 2025. The conference, organized in cooperation with the IEICE Proceedings Series, provides a multidisciplinary forum for researchers in artificial intelligence, knowledge engineering, human-computer interaction, and creativity support systems. The proceedings include peer-reviewed papers accepted through a double-blind review process. Selected papers have been recommended for publication in IEICE Transactions on Information and Systems after an additional peer-review process.
zh

[AI-65] Efficient Asynchronous Federated Evaluation with Strategy Similarity Awareness for Intent-Based Networking in Industrial Internet of Things

【速读】:该论文针对工业互联网(IIoT)环境中意图驱动网络(Intent-Based Networking, IBN)面临的两大挑战展开研究:一是频繁的策略部署与回滚在实际系统中不可行,因IIoT节点工作流耦合紧密且停机成本高;二是IIoT节点的异构性和隐私约束使得集中式策略验证难以实施。解决方案的关键在于提出FEIBN框架,其核心创新包括:利用大语言模型(Large Language Models, LLMs)将多模态用户意图映射为结构化的策略元组,并通过联邦学习(Federated Learning, FL)实现分布式策略验证而不暴露原始数据;进一步设计了策略相似性感知的联邦学习机制(Strategy Similarity Aware Federated Learning, SSAFL),基于策略相似度和资源状态选择相关节点,并仅在更新显著时触发异步模型上传,从而提升训练效率、加速收敛并降低通信开销27.8%。

链接: https://arxiv.org/abs/2512.20627
作者: Shaowen Qin,Jianfeng Zeng,Haodong Guo,Xiaohuan Li,Jiawen Kang,Qian Chen,Dusit Niyato
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 13 pages with 7 figures and 4 tables

点击查看摘要

Abstract:Intent-Based Networking (IBN) offers a promising paradigm for intelligent and automated network control in Industrial Internet of Things (IIoT) environments by translating high-level user intents into executable network strategies. However, frequent strategy deployment and rollback are impractical in real-world IIoT systems due to tightly coupled workflows and high downtime costs, while the heterogeneity and privacy constraints of IIoT nodes further complicate centralized policy verification. To address these challenges, we propose FEIBN, a Federated Evaluation Enhanced Intent-Based Networking framework. FEIBN leverages large language models (LLMs) to align multimodal user intents into structured strategy tuples and employs federated learning to perform distributed policy verification across IIoT nodes without exposing raw data. To improve training efficiency and reduce communication overhead, we design SSAFL, a Strategy Similarity Aware Federated Learning mechanism that selects task-relevant nodes based on strategy similarity and resource status, and triggers asynchronous model uploads only when updates are significant. Experiments demonstrate that SSAFL can improve model accuracy, accelerate model convergence, and reduce the cost by 27.8% compared with SemiAsyn.
zh

[AI-66] Parameter-Efficient Neural CDEs via Implicit Function Jacobians

【速读】:该论文旨在解决神经微分方程(Neural Controlled Differential Equations, NCDEs)在处理时序数据时参数量过大这一关键问题。其解决方案的核心在于提出一种参数高效的替代方法,该方法不仅显著减少了模型所需参数数量,还通过类比“连续循环神经网络”(Continuous RNN)的逻辑结构,保持了NCDEs在时序建模中的本质优势。

链接: https://arxiv.org/abs/2512.20625
作者: Ilya Kuleshov,Alexey Zaytsev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural Controlled Differential Equations (Neural CDEs, NCDEs) are a unique branch of methods, specifically tailored for analysing temporal sequences. However, they come with drawbacks, the main one being the number of parameters, required for the method’s operation. In this paper, we propose an alternative, parameter-efficient look at Neural CDEs. It requires much fewer parameters, while also presenting a very logical analogy as the “Continuous RNN”, which the Neural CDEs aspire to.
zh

[AI-67] Quantum-Inspired Multi Agent Reinforcement Learning for Exploration Exploitation Optimization in UAV-Assisted 6G Network Deployment

【速读】:该论文旨在解决多智能体强化学习(Multiagent Reinforcement Learning, MARL)中探索与利用(exploration-exploitation)权衡难题,特别是在部分可观测和动态环境下的无人机(UAV)辅助6G网络部署场景中。其解决方案的关键在于提出一种量子启发(Quantum-Inspired, QI)框架,将经典MARL算法与量子启发优化技术相结合,以变分量子电路(Variational Quantum Circuits, VQCs)为核心结构,并采用量子近似优化算法(Quantum Approximate Optimization Algorithm, QAOA)进行组合优化;同时引入贝叶斯推断、高斯过程和变分推断等概率建模方法捕捉潜在环境动态,结合集中训练、分散执行(Centralized Training with Decentralized Execution, CTDE)范式提升局部可观测性。实验表明,该框架显著提升了样本效率、加速收敛并增强覆盖性能,实现了优于PPO和DDPG基线方法的探索-利用平衡。

链接: https://arxiv.org/abs/2512.20624
作者: Mazyar Taghavi,Javad Vahidi
机构: 未知
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: 59 pages

点击查看摘要

Abstract:This study introduces a quantum inspired framework for optimizing the exploration exploitation tradeoff in multiagent reinforcement learning, applied to UAVassisted 6G network deployment. We consider a cooperative scenario where ten intelligent UAVs autonomously coordinate to maximize signal coverage and support efficient network expansion under partial observability and dynamic conditions. The proposed approach integrates classical MARL algorithms with quantum-inspired optimization techniques, leveraging variational quantum circuits VQCs as the core structure and employing the Quantum Approximate Optimization Algorithm QAOA as a representative VQC based method for combinatorial optimization. Complementary probabilistic modeling is incorporated through Bayesian inference, Gaussian processes, and variational inference to capture latent environmental dynamics. A centralized training with decentralized execution CTDE paradigm is adopted, where shared memory and local view grids enhance local observability among agents. Comprehensive experiments including scalability tests, sensitivity analysis, and comparisons with PPO and DDPG baselines demonstrate that the proposed framework improves sample efficiency, accelerates convergence, and enhances coverage performance while maintaining robustness. Radar chart and convergence analyses further show that QI MARL achieves a superior balance between exploration and exploitation compared to classical methods. All implementation code and supplementary materials are publicly available on GitHub to ensure reproducibility.
zh

[AI-68] BitRL-Light: 1-bit LLM Agents with Deep Reinforcement Learning for Energy-Efficient Smart Home Lighting Optimization

【速读】:该论文旨在解决智能家居照明系统能耗高(占住宅能源消耗的15–20%)且缺乏自适应智能以同时优化用户舒适度与能效的问题。解决方案的关键在于提出BitRL-Light框架,该框架结合1-bit量化的大语言模型(Large Language Models, LLMs)与深度Q网络(Deep Q-Network, DQN)强化学习算法,在边缘设备上实现低功耗、实时的照明控制。通过将1-bit量化Llama-3.2-1B模型部署于Raspberry Pi硬件,系统在保持92%任务准确率的前提下实现71.4倍的能耗降低,并借助多目标强化学习从用户反馈中学习最优照明策略,兼顾节能、舒适性和昼夜节律同步。

链接: https://arxiv.org/abs/2512.20623
作者: Ravi Gupta,Shabista Haider
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented as poster in IPCCC 2025 at Austin

点击查看摘要

Abstract:Smart home lighting systems consume 15-20% of residential energy but lack adaptive intelligence to optimize for user comfort and energy efficiency simultaneously. We present BitRL-Light, a novel framework combining 1-bit quantized Large Language Models (LLMs) with Deep Q-Network (DQN) reinforcement learning for real-time smart home lighting control on edge devices. Our approach deploys a 1-bit quantized Llama-3.2-1B model on Raspberry Pi hardware, achieving 71.4 times energy reduction compared to full-precision models while maintaining intelligent control capabilities. Through multi-objective reinforcement learning, BitRL-Light learns optimal lighting policies from user feedback, balancing energy consumption, comfort, and circadian alignment. Experimental results demonstrate 32% energy savings compared to rule-based systems, with inference latency under 200ms on Raspberry Pi 4 and 95% user satisfaction. The system processes natural language commands via Google Home/IFTTT integration and learns from implicit feedback through manual overrides. Our comparative analysis shows 1-bit models achieve 5.07 times speedup over 2-bit alternatives on ARM processors while maintaining 92% task accuracy. This work establishes a practical framework for deploying adaptive AI on resource-constrained IoT devices, enabling intelligent home automation without cloud dependencies.
zh

[AI-69] Cooperation Through Indirect Reciprocity in Child-Robot Interactions

【速读】:该论文旨在解决人类与人工智能(AI)在协作情境中如何通过间接互惠(indirect reciprocity, IR)机制维持合作的问题,尤其关注儿童与机器人之间的互动。其核心挑战在于人类群体中的道德判断、人口统计差异以及AI学习动态的异质性可能影响合作行为的评估与形成。解决方案的关键在于结合实验室实验与理论建模,发现儿童在协调困境中表现出的策略足以作为信号,使多臂赌博机(multi-armed bandit)算法学习并实现合作行为,从而证明IR可有效移植至人机交互场景,并揭示了人类策略对AI合作能力的决定性作用。

链接: https://arxiv.org/abs/2512.20621
作者: Isabel Neto,Alexandre S. Pires,Filipa Correia,Fernando P. Santos
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 16 pages + 5 pages of references; 4 figures; 1 table; accepted for publication in Proceedings of the Royal Society A (in press)

点击查看摘要

Abstract:Social interactions increasingly involve artificial agents, such as conversational or collaborative bots. Understanding trust and prosociality in these settings is fundamental to improve human-AI teamwork. Research in biology and social sciences has identified mechanisms to sustain cooperation among humans. Indirect reciprocity (IR) is one of them. With IR, helping someone can enhance an individual’s reputation, nudging others to reciprocate in the future. Transposing IR to human-AI interactions is however challenging, as differences in human demographics, moral judgements, and agents’ learning dynamics can affect how interactions are assessed. To study IR in human-AI groups, we combine laboratory experiments and theoretical modelling. We investigate whether 1) indirect reciprocity can be transposed to children-robot interactions; 2) artificial agents can learn to cooperate given children’s strategies; and 3) how differences in learning algorithms impact human-AI cooperation. We find that IR extends to children and robots solving coordination dilemmas. Furthermore, we observe that the strategies revealed by children provide a sufficient signal for multi-armed bandit algorithms to learn cooperative actions. Beyond the experimental scenarios, we observe that cooperating through multi-armed bandit algorithms is highly dependent on the strategies revealed by humans.
zh

[AI-70] Unsupervised local learning based on voltage-dependent synaptic plasticity for resistive and ferroelectric synapses

【速读】:该论文旨在解决在边缘计算设备上部署人工智能(AI)时面临的能耗高与功能受限问题,提出通过类脑学习机制实现低功耗、实时自适应的AI运算。其解决方案的关键在于引入电压依赖型突触可塑性(voltage-dependent synaptic plasticity, VDSP),这是一种基于赫布原理(Hebbian principles)的高效无监督局部学习方法,无需传统脉冲时间依赖可塑性(spike-timing-dependent plasticity, STDP)所需的复杂脉冲整形电路,即可在多种纳米尺度阻变存储器(memristive devices)中实现在线学习。该方法已在TiO₂、HfO₂基金属氧化物细丝突触和HfZrO₄基铁电隧道结(ferroelectric tunnel junctions, FTJ)三类器件上验证,并通过脉冲神经网络系统级仿真在MNIST模式识别任务中达到超过83%的准确率,同时提出了应对器件变异性的优化策略以提升鲁棒性。

链接: https://arxiv.org/abs/2510.25787
作者: Nikhil Garg,Ismael Balafrej,Joao Henrique Quintino Palhares,Laura Bégon-Lours,Davide Florini,Donato Francesco Falcone,Tommaso Stecconi,Valeria Bragaglia,Bert Jan Offrein,Jean-Michel Portal,Damien Querlioz,Yann Beilliard,Dominique Drouin,Fabien Alibart
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:The deployment of AI on edge computing devices faces significant challenges related to energy consumption and functionality. These devices could greatly benefit from brain-inspired learning mechanisms, allowing for real-time adaptation while using low-power. In-memory computing with nanoscale resistive memories may play a crucial role in enabling the execution of AI workloads on these edge devices. In this study, we introduce voltage-dependent synaptic plasticity (VDSP) as an efficient approach for unsupervised and local learning in memristive synapses based on Hebbian principles. This method enables online learning without requiring complex pulse-shaping circuits typically necessary for spike-timing-dependent plasticity (STDP). We show how VDSP can be advantageously adapted to three types of memristive devices (TiO _2 , HfO _2 -based metal-oxide filamentary synapses, and HfZrO _4 -based ferroelectric tunnel junctions (FTJ)) with disctinctive switching characteristics. System-level simulations of spiking neural networks incorporating these devices were conducted to validate unsupervised learning on MNIST-based pattern recognition tasks, achieving state-of-the-art performance. The results demonstrated over 83% accuracy across all devices using 200 neurons. Additionally, we assessed the impact of device variability, such as switching thresholds and ratios between high and low resistance state levels, and proposed mitigation strategies to enhance robustness.
zh

[AI-71] Inspection Planning Primitives with Implicit Models

【速读】:该论文旨在解决大型复杂结构(如包含大量构件的基础设施)在生成式AI (Generative AI) 检查路径规划时,因传统采样式规划器依赖显式环境模型而导致内存消耗巨大这一问题。其关键解决方案是提出了一组全新的基础计算原语——基于隐式模型的检查规划原语(Inspection Planning Primitives with Implicit Models, IPIM),使得采样式检查规划器能够在整个规划过程中完全使用神经隐式距离函数(neural Signed Distance Functions, SDFs)表示环境,无需频繁转换至显式模型或存储大量三角网格数据,从而显著降低内存占用并保持高质量轨迹生成能力。

链接: https://arxiv.org/abs/2510.07611
作者: Jingyang You,Hanna Kurniawati,Lashika Medagoda
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The aging and increasing complexity of infrastructures make efficient inspection planning more critical in ensuring safety. Thanks to sampling-based motion planning, many inspection planners are fast. However, they often require huge memory. This is particularly true when the structure under inspection is large and complex, consisting of many struts and pillars of various geometry and sizes. Such structures can be represented efficiently using implicit models, such as neural Signed Distance Functions (SDFs). However, most primitive computations used in sampling-based inspection planner have been designed to work efficiently with explicit environment models, which in turn requires the planner to use explicit environment models or performs frequent transformations between implicit and explicit environment models during planning. This paper proposes a set of primitive computations, called Inspection Planning Primitives with Implicit Models (IPIM), that enable sampling-based inspection planners to entirely use neural SDFs representation during planning. Evaluation on three scenarios, including inspection of a complex real-world structure with over 92M triangular mesh faces, indicates that even a rudimentary sampling-based planner with IPIM can generate inspection trajectories of similar quality to those generated by the state-of-the-art planner, while using up to 70x less memory than the state-of-the-art inspection planner.
zh

[AI-72] Scaling Laws for Economic Productivity: Experimental Evidence in LLM -Assisted Consulting Data Analyst and Management Tasks

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)的训练算力增长与专业生产力提升之间的量化关系问题,即如何通过模型规模扩展实现经济层面的生产力增益。其解决方案的关键在于通过预注册实验设计,在超过500名咨询顾问、数据分析师和管理人员中测试13种不同LLM对专业任务效率的影响,从而实证得出“经济影响缩放定律”(Scaling Laws for Economic Impacts),并区分出算力增长(56%贡献)与算法进步(44%贡献)对生产力提升的具体作用,同时揭示非代理型分析任务相较于需工具调用的代理型工作流具有更高生产力增益。

链接: https://arxiv.org/abs/2512.21316
作者: Ali Merali
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This paper derives `Scaling Laws for Economic Impacts’ – empirical relationships between the training compute of Large Language Models (LLMs) and professional productivity. In a preregistered experiment, over 500 consultants, data analysts, and managers completed professional tasks using one of 13 LLMs. We find that each year of AI model progress reduced task time by 8%, with 56% of gains driven by increased compute and 44% by algorithmic progress. However, productivity gains were significantly larger for non-agentic analytical tasks compared to agentic workflows requiring tool use. These findings suggest continued model scaling could boost U.S. productivity by approximately 20% over the next decade.
zh

[AI-73] PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation

【速读】:该论文旨在解决当前生成式晶体模型在物理可实现性方面存在的关键缺陷——即生成的晶体结构普遍缺乏动力学稳定性(dynamical stability),这限制了其在材料设计与发现中的实际应用。解决方案的关键在于构建首个大规模动力学稳定性基准测试平台PhononBench,利用高精度MatterSim原子间势函数对108,843个由六种主流晶体生成模型产生的结构进行高效声子计算和稳定性分析,从而系统评估各模型的动力学稳定性表现,并识别出28,119个在整个布里渊区均稳定的可靠候选结构,为未来生成模型的发展提供了量化评价标准和优化方向。

链接: https://arxiv.org/abs/2512.21227
作者: Xiao-Qi Han,Ze-Feng Gao,Peng-Jie Guo,Zhong-Yi Lu
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen–the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at this https URL
zh

[AI-74] GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

【速读】:该论文旨在解决传统语音增强(Speech Enhancement, SE)方法在泛化能力、语音保真度和说话人一致性方面的局限性,尤其针对基于语言模型(Language Model, LM)的生成式建模在实际应用中面临的解码不稳定与输出偏离人类感知偏好等问题。解决方案的关键在于提出一种两阶段的解码器-only 生成式语言模型框架(GenTSE):第一阶段预测粗粒度语义 token,第二阶段生成细粒度声学 token,通过语义与声学分离实现更稳定且内容对齐的语音重建;同时采用冻结语言模型条件训练(Frozen-LM Conditioning)策略缓解暴露偏差(exposure bias),并引入直接偏好优化(Direct Preference Optimization, DPO)以提升输出与人类听觉感知的一致性,从而在 Libri2Mix 数据集上显著优于先前基于 LM 的系统,在语音质量、可懂度和说话人一致性方面均取得提升。

链接: https://arxiv.org/abs/2512.20978
作者: Haoyang Li,Xuyi Zhuang,Azmat Adnan,Ye Ni,Wei Rao,Shreyas Gopal,Eng Siong Chng
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.
zh

[AI-75] A Physics Informed Neural Network For Deriving MHD State Vectors From Global Active Regions Observations

【速读】:该论文旨在解决如何从太阳表面观测到的磁活动区(AR)分布模式中,重建深层对流层底部(即剪切层,tachocline)内隐藏的磁流体力学(MHD)状态矢量问题,从而实现对即将出现的耀斑产生型磁活动区的周级早期预测。其关键在于提出了一种基于物理信息神经网络(PINN)的新型模拟器——PINNBARDS,该方法结合观测获得的磁活动区几何形态与MHD浅水剪切层模型方程,通过数据驱动的方式推导出满足动力学自洽性的初始状态矢量(包括磁场、流场及壳层厚度变化),进而克服了传统方法仅能提供几何形状而无法确定内部物理状态的局限性。

链接: https://arxiv.org/abs/2512.20747
作者: Subhamoy Chatterjee,Mausumi Dikpati
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 12 figures, accepted for publication in The Astrophysical Journal

点击查看摘要

Abstract:Solar active regions (ARs) do not appear randomly but cluster along longitudinally warped toroidal bands (‘toroids’) that encode information about magnetic structures in the tachocline, where global-scale organization likely originates. Global MagnetoHydroDynamic Shallow-Water Tachocline (MHD-SWT) models have shown potential to simulate such toroids, matching observations qualitatively. For week-scale early prediction of flare-producing AR emergence, forward-integration of these toroids is necessary. This requires model initialization with a dynamically self-consistent MHD state-vector that includes magnetic, flow fields, and shell-thickness variations. However, synoptic magnetograms provide only geometric shape of toroids, not the state-vector needed to initialize MHD-SWT models. To address this challenging task, we develop PINNBARDS, a novel Physics-Informed Neural Network (PINN)-Based AR Distribution Simulator, that uses observational toroids and MHD-SWT equations to derive initial state-vector. Using Feb-14-2024 SDO/HMI synoptic map, we show that PINN converges to physically consistent, predominantly antisymmetric toroids, matching observed ones. Although surface data provides north and south toroids’ central latitudes, and their latitudinal widths, they cannot determine tachocline field strengths, connected to AR emergence. We explore here solutions across a broad parameter range, finding hydrodynamically-dominated structures for weak fields (~2 kG) and overly rigid behavior for strong fields (~100 kG). We obtain best agreement with observations for 20-30 kG toroidal fields, and ~10 degree bandwidth, consistent with low-order longitudinal mode excitation. To our knowledge, PINNBARDS serves as the first method for reconstructing state-vectors for hidden tachocline magnetic structures from surface patterns; potentially leading to weeks ahead prediction of flare-producing AR-emergence.
zh

机器学习

[LG-0] Variationally correct operator learning: Reduced basis neural operator with a posteriori error estimation

链接: https://arxiv.org/abs/2512.21319
作者: Yuan Qiu,Wolfgang Dahmen,Peng Chen
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Minimizing PDE-residual losses is a common strategy to promote physical consistency in neural operators. However, standard formulations often lack variational correctness, meaning that small residuals do not guarantee small solution errors due to the use of non-compliant norms or ad hoc penalty terms for boundary conditions. This work develops a variationally correct operator learning framework by constructing first-order system least-squares (FOSLS) objectives whose values are provably equivalent to the solution error in PDE-induced norms. We demonstrate this framework on stationary diffusion and linear elasticity, incorporating mixed Dirichlet-Neumann boundary conditions via variational lifts to preserve norm equivalence without inconsistent penalties. To ensure the function space conformity required by the FOSLS loss, we propose a Reduced Basis Neural Operator (RBNO). The RBNO predicts coefficients for a pre-computed, conforming reduced basis, thereby ensuring variational stability by design while enabling efficient training. We provide a rigorous convergence analysis that bounds the total error by the sum of finite element discretization bias, reduced basis truncation error, neural network approximation error, and statistical estimation errors arising from finite sampling and optimization. Numerical benchmarks validate these theoretical bounds and demonstrate that the proposed approach achieves superior accuracy in PDE-compliant norms compared to standard baselines, while the residual loss serves as a reliable, computable a posteriori error estimator.

[LG-1] Learning to Solve PDEs on Neural Shape Representations

链接: https://arxiv.org/abs/2512.21311
作者: Lilian Welschinger,Yilin Liu,Zican Wang,Niloy Mitra
类目: Machine Learning (cs.LG)
*备注: Article webpage link: this https URL

点击查看摘要

Abstract:Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.

[LG-2] ranscriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering

链接: https://arxiv.org/abs/2512.21301
作者: Abdullah G. Elafifi,Basma Mamdouh,Mariam Hanafy,Muhammed Alaa Eldin,Yosef Khaled,Nesma Mohamed El-Gelany,Tarek H.M. Abou-El-Enien
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:Acute Myeloid Leukemia (AML) remains a clinical challenge due to its extreme molecular heterogeneity and high relapse rates. While precision medicine has introduced mutation-specific therapies, many patients still lack effective, personalized options. This paper presents a novel, end-to-end computational framework that bridges the gap between patient-specific transcriptomics and de novo drug discovery. By analyzing bulk RNA sequencing data from the TCGA-LAML cohort, the study utilized Weighted Gene Co-expression Network Analysis (WGCNA) to prioritize 20 high-value biomarkers, including metabolic transporters like HK3 and immune-modulatory receptors such as SIGLEC9. The physical structures of these targets were modeled using AlphaFold3, and druggable hotspots were quantitatively mapped via the DOGSiteScorer engine. Then developed a novel, reaction-first evolutionary metaheuristic algorithm as well as multi-objective optimization programming that assembles novel ligands from fragment libraries, guided by spatial alignment to these identified hotspots. The generative model produced structurally unique chemical entities with a strong bias toward drug-like space, as evidenced by QED scores peaking between 0.5 and 0.7. Validation through ADMET profiling and SwissDock molecular docking identified high-confidence candidates, such as Ligand L1, which achieved a binding free energy of -6.571 kcal/mol against the A08A96 biomarker. These results demonstrate that integrating systems biology with metaheuristic molecular assembly can produce pharmacologically viable, patient tailored leads, offering a scalable blueprint for precision oncology in AML and beyond

[LG-3] Assessing the Software Security Comprehension of Large Language Models

链接: https://arxiv.org/abs/2512.21238
作者: Mohammed Latif Siddiq,Natalie Sekerak,Antonio Karam,Maria Leal,Arvin Islam-Gomes,Joanna C. S. Santos
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Submitted to Empirical Software Engineering (EMSE) journal

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in software development, but their level of software security expertise remains unclear. This work systematically evaluates the security comprehension of five leading LLMs: GPT-4o-Mini, GPT-5-Mini, Gemini-2.5-Flash, Llama-3.1, and Qwen-2.5, using Blooms Taxonomy as a framework. We assess six cognitive dimensions: remembering, understanding, applying, analyzing, evaluating, and creating. Our methodology integrates diverse datasets, including curated multiple-choice questions, vulnerable code snippets (SALLM), course assessments from an Introduction to Software Security course, real-world case studies (XBOW), and project-based creation tasks from a Secure Software Engineering course. Results show that while LLMs perform well on lower-level cognitive tasks such as recalling facts and identifying known vulnerabilities, their performance degrades significantly on higher-order tasks that require reasoning, architectural evaluation, and secure system creation. Beyond reporting aggregate accuracy, we introduce a software security knowledge boundary that identifies the highest cognitive level at which a model consistently maintains reliable performance. In addition, we identify 51 recurring misconception patterns exhibited by LLMs across Blooms levels.

[LG-4] MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models

链接: https://arxiv.org/abs/2512.21231
作者: Andres M Bran,Tong Xie,Shai Pranesh,Jeffrey Meng,Xuan Vu Nguyen,Jeremy Goumaz,David Ming Segura,Ruizhi Xu,Dongzhan Zhou,Wenjie Zhang,Bram Hoex,Philippe Schwaller
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers – a property we term ‘latent solvability’. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.

[LG-5] Analytic and Variational Stability of Deep Learning Systems

链接: https://arxiv.org/abs/2512.21208
作者: Ronald Katende
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We propose a unified analytic and variational framework for studying stability in deep learning systems viewed as coupled representation-parameter dynamics. The central object is the Learning Stability Profile, which tracks the infinitesimal response of representations, parameters, and update mechanisms to perturbations along the learning trajectory. We prove a Fundamental Analytic Stability Theorem showing that uniform boundedness of these stability signatures is equivalent, up to norm equivalence, to the existence of a Lyapunov-type energy that dissipates along the learning flow. In smooth regimes, the framework yields explicit stability exponents linking spectral norms, activation regularity, step sizes, and learning rates to contractivity of the learning dynamics. Classical spectral stability results for feedforward networks, a discrete CFL-type condition for residual architectures, and parametric and temporal stability laws for stochastic gradient methods arise as direct consequences. The theory extends to non-smooth learning systems, including ReLU networks, proximal and projected updates, and stochastic subgradient flows, by replacing classical derivatives with Clarke generalized derivatives and smooth energies with variational Lyapunov functionals. The resulting framework provides a unified dynamical description of stability across architectures and optimization methods, clarifying how architectural and algorithmic choices jointly govern robustness and sensitivity to perturbations. It also provides a foundation for further extensions to continuous-time limits and geometric formulations of learning dynamics.

[LG-6] A Unified Framework for EEG Seizure Detection Using Universum-Integrated Generalized Eigenvalues Proximal Support Vector Machine

链接: https://arxiv.org/abs/2512.21170
作者: Yogesh Kumar,Vrushank Ahire,M. A. Ganaie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The paper presents novel Universum-enhanced classifiers: the Universum Generalized Eigenvalue Proximal Support Vector Machine (U-GEPSVM) and the Improved U-GEPSVM (IU-GEPSVM) for EEG signal classification. Using the computational efficiency of generalized eigenvalue decomposition and the generalization benefits of Universum learning, the proposed models address critical challenges in EEG analysis: non-stationarity, low signal-to-noise ratio, and limited labeled data. U-GEPSVM extends the GEPSVM framework by incorporating Universum constraints through a ratio-based objective function, while IU-GEPSVM enhances stability through a weighted difference-based formulation that provides independent control over class separation and Universum alignment. The models are evaluated on the Bonn University EEG dataset across two binary classification tasks: (O vs S)-healthy (eyes closed) vs seizure, and (Z vs S)-healthy (eyes open) vs seizure. IU-GEPSVM achieves peak accuracies of 85% (O vs S) and 80% (Z vs S), with mean accuracies of 81.29% and 77.57% respectively, outperforming baseline methods.

[LG-7] A Community-Enhanced Graph Representation Model for Link Prediction

链接: https://arxiv.org/abs/2512.21166
作者: Lei Wang,Darong Lai
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Although Graph Neural Networks (GNNs) have become the dominant approach for graph representation learning, their performance on link prediction tasks does not always surpass that of traditional heuristic methods such as Common Neighbors and Jaccard Coefficient. This is mainly because existing GNNs tend to focus on learning local node representations, making it difficult to effectively capture structural relationships between node pairs. Furthermore, excessive reliance on local neighborhood information can lead to over-smoothing. Prior studies have shown that introducing global structural encoding can partially alleviate this issue. To address these limitations, we propose a Community-Enhanced Link Prediction (CELP) framework that incorporates community structure to jointly model local and global graph topology. Specifically, CELP enhances the graph via community-aware, confidence-guided edge completion and pruning, while integrating multi-scale structural features to achieve more accurate link prediction. Experimental results across multiple benchmark datasets demonstrate that CELP achieves superior performance, validating the crucial role of community structure in improving link prediction accuracy.

[LG-8] ElfCore: A 28nm Neural Processor Enabling Dynamic Structured Sparse Training and Online Self-Supervised Learning with Activity-Dependent Weight Update

链接: https://arxiv.org/abs/2512.21153
作者: Zhe Su,Giacomo Indiveri
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: This paper has been published in the proceedings of the 2025 IEEE European Solid-State Electronics Research Conference (ESSERC)

点击查看摘要

Abstract:In this paper, we present ElfCore, a 28nm digital spiking neural network processor tailored for event-driven sensory signal processing. ElfCore is the first to efficiently integrate: (1) a local online self-supervised learning engine that enables multi-layer temporal learning without labeled inputs; (2) a dynamic structured sparse training engine that supports high-accuracy sparse-to-sparse learning; and (3) an activity-dependent sparse weight update mechanism that selectively updates weights based solely on input activity and network dynamics. Demonstrated on tasks including gesture recognition, speech, and biomedical signal processing, ElfCore outperforms state-of-the-art solutions with up to 16X lower power consumption, 3.8X reduced on-chip memory requirements, and 5.9X greater network capacity efficiency.

[LG-9] A Mechanistic Analysis of Transformers for Dynamical Systems

链接: https://arxiv.org/abs/2512.21113
作者: Gregory Duthé,Nikolaos Evangelou,Wei Liu,Ioannis G. Kevrekidis,Eleni Chatzi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Transformers are increasingly adopted for modeling and forecasting time-series, yet their internal mechanisms remain poorly understood from a dynamical systems perspective. In contrast to classical autoregressive and state-space models, which benefit from well-established theoretical foundations, Transformer architectures are typically treated as black boxes. This gap becomes particularly relevant as attention-based models are considered for general-purpose or zero-shot forecasting across diverse dynamical regimes. In this work, we do not propose a new forecasting model, but instead investigate the representational capabilities and limitations of single-layer Transformers when applied to dynamical data. Building on a dynamical systems perspective we interpret causal self-attention as a linear, history-dependent recurrence and analyze how it processes temporal information. Through a series of linear and nonlinear case studies, we identify distinct operational regimes. For linear systems, we show that the convexity constraint imposed by softmax attention fundamentally restricts the class of dynamics that can be represented, leading to oversmoothing in oscillatory settings. For nonlinear systems under partial observability, attention instead acts as an adaptive delay-embedding mechanism, enabling effective state reconstruction when sufficient temporal context and latent dimensionality are available. These results help bridge empirical observations with classical dynamical systems theory, providing insight into when and why Transformers succeed or fail as models of dynamical systems.

[LG-10] Shared Representation Learning for High-Dimensional Multi-Task Forecasting under Resource Contention in Cloud-Native Backends

链接: https://arxiv.org/abs/2512.21102
作者: Zixiao Huang,Jixiao Yang,Sijia Li,Chi Zhang,Jinyu Chen,Chengda Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study proposes a unified forecasting framework for high-dimensional multi-task time series to meet the prediction demands of cloud native backend systems operating under highly dynamic loads, coupled metrics, and parallel tasks. The method builds a shared encoding structure to represent diverse monitoring indicators in a unified manner and employs a state fusion mechanism to capture trend changes and local disturbances across different time scales. A cross-task structural propagation module is introduced to model potential dependencies among nodes, enabling the model to understand complex structural patterns formed by resource contention, link interactions, and changes in service topology. To enhance adaptability to non-stationary behaviors, the framework incorporates a dynamic adjustment mechanism that automatically regulates internal feature flows according to system state changes, ensuring stable predictions in the presence of sudden load shifts, topology drift, and resource jitter. The experimental evaluation compares multiple models across various metrics and verifies the effectiveness of the framework through analyses of hyperparameter sensitivity, environmental sensitivity, and data sensitivity. The results show that the proposed method achieves superior performance on several error metrics and provides more accurate representations of future states under different operating conditions. Overall, the unified forecasting framework offers reliable predictive capability for high-dimensional, multi-task, and strongly dynamic environments in cloud native systems and provides essential technical support for intelligent backend management.

[LG-11] Dyna-Style Reinforcement Learning Modeling and Control of Non-linear Dynamics

链接: https://arxiv.org/abs/2512.21081
作者: Karim Abdelsalam,Zeyad Gamal,Ayman El-Badawy
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Controlling systems with complex, nonlinear dynamics poses a significant challenge, particularly in achieving efficient and robust control. In this paper, we propose a Dyna-Style Reinforcement Learning control framework that integrates Sparse Identification of Nonlinear Dynamics (SINDy) with Twin Delayed Deep Deterministic Policy Gradient (TD3) reinforcement learning. SINDy is used to identify a data-driven model of the system, capturing its key dynamics without requiring an explicit physical model. This identified model is used to generate synthetic rollouts that are periodically injected into the reinforcement learning replay buffer during training on the real environment, enabling efficient policy learning with limited data available. By leveraging this hybrid approach, we mitigate the sample inefficiency of traditional model-free reinforcement learning methods while ensuring accurate control of nonlinear systems. To demonstrate the effectiveness of this framework, we apply it to a bi-rotor system as a case study, evaluating its performance in stabilization and trajectory tracking. The results show that our SINDy-TD3 approach achieves superior accuracy and robustness compared to direct reinforcement learning techniques, highlighting the potential of combining data-driven modeling with reinforcement learning for complex dynamical systems.

[LG-12] Blurb-Refined Inference from Crowdsourced Book Reviews using Hierarchical Genre Mining with Dual-Path Graph Convolutions

链接: https://arxiv.org/abs/2512.21076
作者: Suraj Kumar,Utsav Kumar Nareti,Soumi Chattopadhyay,Chandranath Adak,Prolay Mallick
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
*备注: 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Accurate book genre classification is fundamental to digital library organization, content discovery, and personalized recommendation. Existing approaches typically model genre prediction as a flat, single-label task, ignoring hierarchical genre structure and relying heavily on noisy, subjective user reviews, which often degrade classification reliability. We propose HiGeMine, a two-phase hierarchical genre mining framework that robustly integrates user reviews with authoritative book blurbs. In the first phase, HiGeMine employs a zero-shot semantic alignment strategy to filter reviews, retaining only those semantically consistent with the corresponding blurb, thereby mitigating noise, bias, and irrelevance. In the second phase, we introduce a dual-path, two-level graph-based classification architecture: a coarse-grained Level-1 binary classifier distinguishes fiction from non-fiction, followed by Level-2 multi-label classifiers for fine-grained genre prediction. Inter-genre dependencies are explicitly modeled using a label co-occurrence graph, while contextual representations are derived from pretrained language models applied to the filtered textual content. To facilitate systematic evaluation, we curate a new hierarchical book genre dataset. Extensive experiments demonstrate that HiGeMine consistently outperformed strong baselines across hierarchical genre classification tasks. The proposed framework offers a principled and effective solution for leveraging both structured and unstructured textual data in hierarchical book genre analysis.

[LG-13] zkFL-Health: Blockchain-Enabled Zero-Knowledge Federated Learning for Medical AI Privacy

链接: https://arxiv.org/abs/2512.21048
作者: Savvy Sharma,George Petrovic,Sarthak Kaushik
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages, 1 figure, 5 tables

点击查看摘要

Abstract:Healthcare AI needs large, diverse datasets, yet strict privacy and governance constraints prevent raw data sharing across institutions. Federated learning (FL) mitigates this by training where data reside and exchanging only model updates, but practical deployments still face two core risks: (1) privacy leakage via gradients or updates (membership inference, gradient inversion) and (2) trust in the aggregator, a single point of failure that can drop, alter, or inject contributions undetected. We present zkFL-Health, an architecture that combines FL with zero-knowledge proofs (ZKPs) and Trusted Execution Environments (TEEs) to deliver privacy-preserving, verifiably correct collaborative training for medical AI. Clients locally train and commit their updates; the aggregator operates within a TEE to compute the global update and produces a succinct ZK proof (via Halo2/Nova) that it used exactly the committed inputs and the correct aggregation rule, without revealing any client update to the host. Verifier nodes validate the proof and record cryptographic commitments on-chain, providing an immutable audit trail and removing the need to trust any single party. We outline system and threat models tailored to healthcare, the zkFL-Health protocol, security/privacy guarantees, and a performance evaluation plan spanning accuracy, privacy risk, latency, and cost. This framework enables multi-institutional medical AI with strong confidentiality, integrity, and auditability, key properties for clinical adoption and regulatory compliance.

[LG-14] Agent ic Multi-Persona Framework for Evidence-Aware Fake News Detection

链接: https://arxiv.org/abs/2512.21039
作者: Roopa Bukke,Soumya Pandey,Suraj Kumar,Soumi Chattopadhyay,Chandranath Adak
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 12 pages, 8 tables, 2 figures

点击查看摘要

Abstract:The rapid proliferation of online misinformation poses significant risks to public trust, policy, and safety, necessitating reliable automated fake news detection. Existing methods often struggle with multimodal content, domain generalization, and explainability. We propose AMPEND-LS, an agentic multi-persona evidence-grounded framework with LLM-SLM synergy for multimodal fake news detection. AMPEND-LS integrates textual, visual, and contextual signals through a structured reasoning pipeline powered by LLMs, augmented with reverse image search, knowledge graph paths, and persuasion strategy analysis. To improve reliability, we introduce a credibility fusion mechanism combining semantic similarity, domain trustworthiness, and temporal context, and a complementary SLM classifier to mitigate LLM uncertainty and hallucinations. Extensive experiments across three benchmark datasets demonstrate that AMPEND-LS consistently outperformed state-of-the-art baselines in accuracy, F1 score, and robustness. Qualitative case studies further highlight its transparent reasoning and resilience against evolving misinformation. This work advances the development of adaptive, explainable, and evidence-aware systems for safeguarding online information integrity.

[LG-15] owards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces AAAI2026

链接: https://arxiv.org/abs/2512.21021
作者: Andre Rusli,Miao Cao,Shoma Ishimoto,Sho Akiyama,Max Frenzel
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 5 pages, AAAI 2026 Workshop on New Frontiers in Information Retrieval

点击查看摘要

Abstract:Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan’s largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.

[LG-16] CoSeNet: A Novel Approach for Optimal Segmentation of Correlation Matrices

链接: https://arxiv.org/abs/2512.21000
作者: Alberto. Palomo-Alonso,David Casillas-Perez,Silvia Jimenez-Fernandez,Antonio Portilla-Figueras,Sancho Salcedo-Sanz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose a novel approach for the optimal identification of correlated segments in noisy correlation matrices. The proposed model is known as CoSeNet (Correlation Seg-mentation Network) and is based on a four-layer algorithmic architecture that includes several processing layers: input, formatting, re-scaling, and segmentation layer. The proposed model can effectively identify correlated segments in such matrices, better than previous approaches for similar problems. Internally, the proposed model utilizes an overlapping technique and uses pre-trained Machine Learning (ML) algorithms, which makes it robust and generalizable. CoSeNet approach also includes a method that optimizes the parameters of the re-scaling layer using a heuristic algorithm and fitness based on a Window Difference-based metric. The output of the model is a binary noise-free matrix representing optimal segmentation as well as its seg-mentation points and can be used in a variety of applications, obtaining compromise solutions between efficiency, memory, and speed of the proposed deployment model.

[LG-17] Deadline-Aware Online Scheduling for LLM Fine-Tuning with Spot Market Predictions

链接: https://arxiv.org/abs/2512.20967
作者: Linggao Kong,Yuedong Xu,Lei Jiao,Chuan Xu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As foundation models grow in size, fine-tuning them becomes increasingly expensive. While GPU spot instances offer a low-cost alternative to on-demand resources, their volatile prices and availability make deadline-aware scheduling particularly challenging. We tackle this difficulty by using a mix of spot and on-demand instances. Distinctively, we show the predictability of prices and availability in a spot instance market, the power of prediction in enabling cost-efficient scheduling and its sensitivity to estimation errors. An integer programming problem is formulated to capture the use of mixed instances under both the price and availability dynamics. We propose an online allocation algorithm with prediction based on the committed horizon control approach that leverages a \emphcommitment level to enforce the partial sequence of decisions. When this prediction becomes inaccurate, we further present a complementary online algorithm without predictions. An online policy selection algorithm is developed that learns the best policy from a pool constructed by varying the parameters of both algorithms. We prove that the prediction-based algorithm achieves tighter performance bounds as prediction error decreases, while the policy selection algorithm possesses a regret bound of \mathcalO(\sqrtT) . Experimental results demonstrate that our online framework can adaptively select the best policy under varying spot market dynamics and prediction quality, consistently outperforming baselines and improving utility by up to 54.8%.

[LG-18] Solving Functional PDEs with Gaussian Processes and Applications to Functional Renormalization Group Equations

链接: https://arxiv.org/abs/2512.20956
作者: Xianjin Yang,Matthieu Darcy,Matthew Hudes,Francis J. Alexander,Gregory Eyink,Houman Owhadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present an operator learning framework for solving non-perturbative functional renormalization group equations, which are integro-differential equations defined on functionals. Our proposed approach uses Gaussian process operator learning to construct a flexible functional representation formulated directly on function space, making it independent of a particular equation or discretization. Our method is flexible, and can apply to a broad range of functional differential equations while still allowing for the incorporation of physical priors in either the prior mean or the kernel design. We demonstrate the performance of our method on several relevant equations, such as the Wetterich and Wilson–Polchinski equations, showing that it achieves equal or better performance than existing approximations such as the local-potential approximation, while being significantly more flexible. In particular, our method can handle non-constant fields, making it promising for the study of more complex field configurations, such as instantons.

[LG-19] AirGS: Real-Time 4D Gaussian Streaming for Free-Viewpoint Video Experiences

链接: https://arxiv.org/abs/2512.20943
作者: Zhe Wang,Jinghang Li,Yifei Zhu
类目: Graphics (cs.GR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Multimedia (cs.MM); Networking and Internet Architecture (cs.NI); Image and Video Processing (eess.IV)
*备注: This paper is accepted by IEEE International Conference on Computer Communications (INFOCOM), 2026

点击查看摘要

Abstract:Free-viewpoint video (FVV) enables immersive viewing experiences by allowing users to view scenes from arbitrary perspectives. As a prominent reconstruction technique for FVV generation, 4D Gaussian Splatting (4DGS) models dynamic scenes with time-varying 3D Gaussian ellipsoids and achieves high-quality rendering via fast rasterization. However, existing 4DGS approaches suffer from quality degradation over long sequences and impose substantial bandwidth and storage overhead, limiting their applicability in real-time and wide-scale deployments. Therefore, we present AirGS, a streaming-optimized 4DGS framework that rearchitects the training and delivery pipeline to enable high-quality, low-latency FVV experiences. AirGS converts Gaussian video streams into multi-channel 2D formats and intelligently identifies keyframes to enhance frame reconstruction quality. It further combines temporal coherence with inflation loss to reduce training time and representation size. To support communication-efficient transmission, AirGS models 4DGS delivery as an integer linear programming problem and design a lightweight pruning level selection algorithm to adaptively prune the Gaussian updates to be transmitted, balancing reconstruction quality and bandwidth consumption. Extensive experiments demonstrate that AirGS reduces quality deviation in PSNR by more than 20% when scene changes, maintains frame-level PSNR consistently above 30, accelerates training by 6 times, reduces per-frame transmission size by nearly 50% compared to the SOTA 4DGS approaches.

[LG-20] owards a General Framework for Predicting and Explaining the Hardness of Graph-based Combinatorial Optimization Problems using Machine Learning and Association Rule Mining

链接: https://arxiv.org/abs/2512.20915
作者: Bharat Sharman,Elkafi Hassini
类目: Machine Learning (cs.LG); Combinatorics (math.CO)
*备注:

点击查看摘要

Abstract:This study introduces GCO-HPIF, a general machine-learning-based framework to predict and explain the computational hardness of combinatorial optimization problems that can be represented on graphs. The framework consists of two stages. In the first stage, a dataset is created comprising problem-agnostic graph features and hardness classifications of problem instances. Machine-learning-based classification algorithms are trained to map graph features to hardness categories. In the second stage, the framework explains the predictions using an association rule mining algorithm. Additionally, machine-learning-based regression models are trained to predict algorithmic computation times. The GCO-HPIF framework was applied to a dataset of 3287 maximum clique problem instances compiled from the COLLAB, IMDB, and TWITTER graph datasets using five state-of-the-art algorithms, namely three exact branch-and-bound-based algorithms (Gurobi, CliSAT, and MOMC) and two graph-neural-network-based algorithms (EGN and HGS). The framework demonstrated excellent performance in predicting instance hardness, achieving a weighted F1 score of 0.9921, a minority-class F1 score of 0.878, and an ROC-AUC score of 0.9083 using only three graph features. The best association rule found by the FP-Growth algorithm for explaining the hardness predictions had a support of 0.8829 for hard instances and an overall accuracy of 87.64 percent, underscoring the framework’s usefulness for both prediction and explanation. Furthermore, the best-performing regression model for predicting computation times achieved a percentage RMSE of 5.12 and an R2 value of 0.991.

[LG-21] me-Efficient Evaluation and Enhancement of Adversarial Robustness in Deep Neural Networks

链接: https://arxiv.org/abs/2512.20893
作者: Runqi Lin
类目: Machine Learning (cs.LG)
*备注: Ph.D. Thesis, The University of Sydney

点击查看摘要

Abstract:With deep neural networks (DNNs) increasingly embedded in modern society, ensuring their safety has become a critical and urgent issue. In response, substantial efforts have been dedicated to the red-blue adversarial framework, where the red team focuses on identifying vulnerabilities in DNNs and the blue team on mitigating them. However, existing approaches from both teams remain computationally intensive, constraining their applicability to large-scale models. To overcome this limitation, this thesis endeavours to provide time-efficient methods for the evaluation and enhancement of adversarial robustness in DNNs.

[LG-22] From GNNs to Symbolic Surrogates via Kolmogorov-Arnold Networks for Delay Prediction

链接: https://arxiv.org/abs/2512.20885
作者: Sami Marouani,Kamal Singh,Baptiste Jeudy,Amaury Habrard
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Accurate prediction of flow delay is essential for optimizing and managing modern communication networks. We investigate three levels of modeling for this task. First, we implement a heterogeneous GNN with attention-based message passing, establishing a strong neural baseline. Second, we propose FlowKANet in which Kolmogorov-Arnold Networks replace standard MLP layers, reducing trainable parameters while maintaining competitive predictive performance. FlowKANet integrates KAMP-Attn (Kolmogorov-Arnold Message Passing with Attention), embedding KAN operators directly into message-passing and attention computation. Finally, we distill the model into symbolic surrogate models using block-wise regression, producing closed-form equations that eliminate trainable weights while preserving graph-structured dependencies. The results show that KAN layers provide a favorable trade-off between efficiency and accuracy and that symbolic surrogates emphasize the potential for lightweight deployment and enhanced transparency.

[LG-23] Better Call Graphs: A New Dataset of Function Call Graphs for Malware Classification

链接: https://arxiv.org/abs/2512.20872
作者: Jakir Hossain,Gurvinder Singh,Lukasz Ziarek,Ahmet Erdem Sarıyüce
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Function call graphs (FCGs) have emerged as a powerful abstraction for malware detection, capturing the behavioral structure of applications beyond surface-level signatures. Their utility in traditional program analysis has been well established, enabling effective classification and analysis of malicious software. In the mobile domain, especially in the Android ecosystem, FCG-based malware classification is particularly critical due to the platform’s widespread adoption and the complex, component-based structure of Android apps. However, progress in this direction is hindered by the lack of large-scale, high-quality Android-specific FCG datasets. Existing datasets are often outdated, dominated by small or redundant graphs resulting from app repackaging, and fail to reflect the diversity of real-world malware. These limitations lead to overfitting and unreliable evaluation of graph-based classification methods. To address this gap, we introduce Better Call Graphs (BCG), a comprehensive dataset of large and unique FCGs extracted from recent Android application packages (APKs). BCG includes both benign and malicious samples spanning various families and types, along with graph-level features for each APK. Through extensive experiments using baseline classifiers, we demonstrate the necessity and value of BCG compared to existing datasets. BCG is publicly available at this https URL.

[LG-24] Robustness Certificates for Neural Networks against Adversarial Attacks

链接: https://arxiv.org/abs/2512.20865
作者: Sara Taheri,Mahalakshmi Sabanayagam,Debarghya Ghoshdastidar,Majid Zamani
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The increasing use of machine learning in safety-critical domains amplifies the risk of adversarial threats, especially data poisoning attacks that corrupt training data to degrade performance or induce unsafe behavior. Most existing defenses lack formal guarantees or rely on restrictive assumptions about the model class, attack type, extent of poisoning, or point-wise certification, limiting their practical reliability. This paper introduces a principled formal robustness certification framework that models gradient-based training as a discrete-time dynamical system (dt-DS) and formulates poisoning robustness as a formal safety verification problem. By adapting the concept of barrier certificates (BCs) from control theory, we introduce sufficient conditions to certify a robust radius ensuring that the terminal model remains safe under worst-case \ell_p -norm based poisoning. To make this practical, we parameterize BCs as neural networks trained on finite sets of poisoned trajectories. We further derive probably approximately correct (PAC) bounds by solving a scenario convex program (SCP), which yields a confidence lower bound on the certified robustness radius generalizing beyond the training set. Importantly, our framework also extends to certification against test-time attacks, making it the first unified framework to provide formal guarantees in both training and test-time attack settings. Experiments on MNIST, SVHN, and CIFAR-10 show that our approach certifies non-trivial perturbation budgets while being model-agnostic and requiring no prior knowledge of the attack or contamination level.

[LG-25] Defending against adversarial attacks using mixture of experts

链接: https://arxiv.org/abs/2512.20821
作者: Mohammad Meymani,Roozbeh Razavi-Far
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning is a powerful tool enabling full automation of a huge number of tasks without explicit programming. Despite recent progress of machine learning in different domains, these models have shown vulnerabilities when they are exposed to adversarial threats. Adversarial threats aim to hinder the machine learning models from satisfying their objectives. They can create adversarial perturbations, which are imperceptible to humans’ eyes but have the ability to cause misclassification during inference. Moreover, they can poison the training data to harm the model’s performance or they can query the model to steal its sensitive information. In this paper, we propose a defense system, which devises an adversarial training module within mixture-of-experts architecture to enhance its robustness against adversarial threats. In our proposed defense system, we use nine pre-trained experts with ResNet-18 as their backbone. During end-to-end training, the parameters of expert models and gating mechanism are jointly updated allowing further optimization of the experts. Our proposed defense system outperforms state-of-the-art defense systems and plain classifiers, which use a more complex architecture than our model’s backbone.

[LG-26] FedMPDD: Communication-Efficient Federated Learning with Privacy Preservation Attributes via Projected Directional Derivative

链接: https://arxiv.org/abs/2512.20814
作者: Mohammadreza Rostami,Solmaz S. Kia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces \textttFedMPDD (\textbfFederated Learning via \textbfMulti-\textbfProjected \textbfDirectional \textbfDerivatives), a novel algorithm that simultaneously optimizes bandwidth utilization and enhances privacy in Federated Learning. The core idea of \textttFedMPDD is to encode each client’s high-dimensional gradient by computing its directional derivatives along multiple random vectors. This compresses the gradient into a much smaller message, significantly reducing uplink communication costs from \mathcalO(d) to \mathcalO(m) , where m \ll d . The server then decodes the aggregated information by projecting it back onto the same random vectors. Our key insight is that averaging multiple projections overcomes the dimension-dependent convergence limitations of a single projection. We provide a rigorous theoretical analysis, establishing that \textttFedMPDD converges at a rate of \mathcalO(1/\sqrtK) , matching the performance of FedSGD. Furthermore, we demonstrate that our method provides some inherent privacy against gradient inversion attacks due to the geometric properties of low-rank projections, offering a tunable privacy-utility trade-off controlled by the number of projections. Extensive experiments on benchmark datasets validate our theory and demonstrates our results.

[LG-27] GraphFire-X: Physics-Informed Graph Attention Networks and Structural Gradient Boosting for Building-Scale Wildfire Preparedness at the Wildland-Urban Interface

链接: https://arxiv.org/abs/2512.20813
作者: Miguel Esparza,Vamshi Battal,Ali Mostafavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As wildfires increasingly evolve into urban conflagrations, traditional risk models that treat structures as isolated assets fail to capture the non-linear contagion dynamics characteristic of the wildland urban interface (WUI). This research bridges the gap between mechanistic physics and data driven learning by establishing a novel dual specialist ensemble framework that disentangles vulnerability into two distinct vectors, environmental contagion and structural fragility. The architecture integrates two specialized predictive streams, an environmental specialist, implemented as a graph neural network (GNN) that operationalizes the community as a directed contagion graph weighted by physics informed convection, radiation, and ember probabilities, and enriched with high dimensional Google AlphaEarth Foundation embeddings, and a Structural Specialist, implemented via XGBoost to isolate granular asset level resilience. Applied to the 2025 Eaton Fire, the framework reveals a critical dichotomy in risk drivers. The GNN demonstrates that neighborhood scale environmental pressure overwhelmingly dominates intrinsic structural features in defining propagation pathways, while the XGBoost model identifies eaves as the primary micro scale ingress vector. By synthesizing these divergent signals through logistic stacking, the ensemble achieves robust classification and generates a diagnostic risk topology. This capability empowers decision makers to move beyond binary loss prediction and precisely target mitigation prioritizing vegetation management for high connectivity clusters and structural hardening for architecturally vulnerable nodes thereby operationalizing a proactive, data driven approach to community resilience.

[LG-28] Symbolic regression for defect interactions in 2D materials

链接: https://arxiv.org/abs/2512.20785
作者: Mikhail Lazarev,Andrey Ustyuzhanin
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Machine learning models have become firmly established across all scientific fields. Extracting features from data and making inferences based on them with neural network models often yields high accuracy; however, this approach has several drawbacks. Symbolic regression is a powerful technique for discovering analytical equations that describe data, providing interpretable and generalizable models capable of predicting unseen data. Symbolic regression methods have gained new momentum with the advancement of neural network technologies and offer several advantages, the main one being the interpretability of results. In this work, we examined the application of the deep symbolic regression algorithm SEGVAE to determine the properties of two-dimensional materials with defects. Comparing the results with state-of-the-art graph neural network-based methods shows comparable or, in some cases, even identical outcomes. We also discuss the applicability of this class of methods in natural sciences.

[LG-29] Improving Matrix Exponential for Generative AI Flows: A Taylor-Based Approach Beyond Paterson–Stockmeyer

链接: https://arxiv.org/abs/2512.20777
作者: Jorge Sastre,Daniel Faronbi,José Miguel Alonso,Peter Traver,Javier Ibáñez,Nuria Lloret
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 41 pages, 35 figures

点击查看摘要

Abstract:The matrix exponential is a fundamental operator in scientific computing and system simulation, with applications ranging from control theory and quantum mechanics to modern generative machine learning. While Padé approximants combined with scaling and squaring have long served as the standard, recent Taylor-based methods, which utilize polynomial evaluation schemes that surpass the classical Paterson–Stockmeyer technique, offer superior accuracy and reduced computational complexity. This paper presents an optimized Taylor-based algorithm for the matrix exponential, specifically designed for the high-throughput requirements of generative AI flows. We provide a rigorous error analysis and develop a dynamic selection strategy for the Taylor order and scaling factor to minimize computational effort under a prescribed error tolerance. Extensive numerical experiments demonstrate that our approach provides significant acceleration and maintains high numerical stability compared to existing state-of-the-art implementations. These results establish the proposed method as a highly efficient tool for large-scale generative modeling.

[LG-30] Subgroup Discovery with the Cox Model

链接: https://arxiv.org/abs/2512.20762
作者: Zachary Izzo,Iain Melvin
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 43 pages, 2 figures

点击查看摘要

Abstract:We study the problem of subgroup discovery for survival analysis, where the goal is to find an interpretable subset of the data on which a Cox model is highly accurate. Our work is the first to study this particular subgroup problem, for which we make several contributions. Subgroup discovery methods generally require a “quality function” in order to sift through and select the most advantageous subgroups. We first examine why existing natural choices for quality functions are insufficient to solve the subgroup discovery problem for the Cox model. To address the shortcomings of existing metrics, we introduce two technical innovations: the expected prediction entropy (EPE), a novel metric for evaluating survival models which predict a hazard function; and the conditional rank statistics (CRS), a statistical object which quantifies the deviation of an individual point to the distribution of survival times in an existing subgroup. We study the EPE and CRS theoretically and show that they can solve many of the problems with existing metrics. We introduce a total of eight algorithms for the Cox subgroup discovery problem. The main algorithm is able to take advantage of both the EPE and the CRS, allowing us to give theoretical correctness results for this algorithm in a well-specified setting. We evaluate all of the proposed methods empirically on both synthetic and real data. The experiments confirm our theory, showing that our contributions allow for the recovery of a ground-truth subgroup in well-specified cases, as well as leading to better model fit compared to naively fitting the Cox model to the whole dataset in practical settings. Lastly, we conduct a case study on jet engine simulation data from NASA. The discovered subgroups uncover known nonlinearities/homogeneity in the data, and which suggest design choices which have been mirrored in practice. Comments: 43 pages, 2 figures Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML) Cite as: arXiv:2512.20762 [cs.LG] (or arXiv:2512.20762v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.20762 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] Real-World Adversarial Attacks on RF-Based Drone Detectors

链接: https://arxiv.org/abs/2512.20712
作者: Omer Gazit,Yael Itzhakev,Yuval Elovici,Asaf Shabtai
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radio frequency (RF) based systems are increasingly used to detect drones by analyzing their RF signal patterns, converting them into spectrogram images which are processed by object detection models. Existing RF attacks against image based models alter digital features, making over-the-air (OTA) implementation difficult due to the challenge of converting digital perturbations to transmittable waveforms that may introduce synchronization errors and interference, and encounter hardware limitations. We present the first physical attack on RF image based drone detectors, optimizing class-specific universal complex baseband (I/Q) perturbation waveforms that are transmitted alongside legitimate communications. We evaluated the attack using RF recordings and OTA experiments with four types of drones. Our results show that modest, structured I/Q perturbations are compatible with standard RF chains and reliably reduce target drone detection while preserving detection of legitimate drones.

[LG-32] Graph Neural Networks for Source Detection: A Review and Benchmark Study

链接: https://arxiv.org/abs/2512.20657
作者: Martin Sterchi,Nathan Brack,Lorenz Hilfiker
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The source detection problem arises when an epidemic process unfolds over a contact network, and the objective is to identify its point of origin, i.e., the source node. Research on this problem began with the seminal work of Shah and Zaman in 2010, who formally defined it and introduced the notion of rumor centrality. With the emergence of Graph Neural Networks (GNNs), several studies have proposed GNN-based approaches to source detection. However, some of these works lack methodological clarity and/or are hard to reproduce. As a result, it remains unclear (to us, at least) whether GNNs truly outperform more traditional source detection methods across comparable settings. In this paper, we first review existing GNN-based methods for source detection, clearly outlining the specific settings each addresses and the models they employ. Building on this research, we propose a principled GNN architecture tailored to the source detection task. We also systematically investigate key questions surrounding this problem. Most importantly, we aim to provide a definitive assessment of how GNNs perform relative to other source detection methods. Our experiments show that GNNs substantially outperform all other methods we test across a variety of network types. Although we initially set out to challenge the notion of GNNs as a solution to source detection, our results instead demonstrate their remarkable effectiveness for this task. We discuss possible reasons for this strong performance. To ensure full reproducibility, we release all code and data on GitHub. Finally, we argue that epidemic source detection should serve as a benchmark task for evaluating GNN architectures.

[LG-33] Q-RUN: Quantum-Inspired Data Re-uploading Networks

链接: https://arxiv.org/abs/2512.20654
作者: Wenbo Qiao,Shuaixian Wang,Peng Zhang,Yan Ming,Jiaming Zhao
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Data re-uploading quantum circuits (DRQC) are a key approach to implementing quantum neural networks and have been shown to outperform classical neural networks in fitting high-frequency functions. However, their practical application is limited by the scalability of current quantum hardware. In this paper, we introduce the mathematical paradigm of DRQC into classical models by proposing a quantum-inspired data re-uploading network (Q-RUN), which retains the Fourier-expressive advantages of quantum models without any quantum hardware. Experimental results demonstrate that Q-RUN delivers superior performance across both data modeling and predictive modeling tasks. Compared to the fully connected layers and the state-of-the-art neural network layers, Q-RUN reduces model parameters while decreasing error by approximately one to three orders of magnitude on certain tasks. Notably, Q-RUN can serve as a drop-in replacement for standard fully connected layers, improving the performance of a wide range of neural architectures. This work illustrates how principles from quantum machine learning can guide the design of more expressive artificial intelligence.

[LG-34] SHRP: Specialized Head Routing and Pruning for Efficient Encoder Compression

链接: https://arxiv.org/abs/2512.20635
作者: Zeli Su,Ziyin Zhang,Wenzheng Zhang,Zhou Liu,Guixian Xu,Wentao Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformer encoders are widely deployed in large-scale web services for natural language understanding tasks such as text classification, semantic retrieval, and content ranking. However, their high inference latency and memory consumption pose significant challenges for real-time serving and scalability. These limitations stem largely from architectural redundancy, particularly in the attention module. The inherent parameter redundancy of the attention mechanism, coupled with the fact that its attention heads operate with a degree of independence, makes it particularly amenable to structured model compression. In this paper, we propose SHRP (Specialized Head Routing and Pruning), a novel structured pruning framework that automatically identifies and removes redundant attention heads while preserving most of the model’s accuracy and compatibility. SHRP introduces Expert Attention, a modular design that treats each attention head as an independent expert, followed by a lightweight shared expander feed-forward network that refines their outputs. The framework employs a unified Top-1 usage-driven mechanism to jointly perform dynamic routing during training and deterministic pruning at deployment. Experimental results on the GLUE benchmark using a BERT-base encoder show that SHRP achieves 93% of the original model accuracy while reducing parameters by 48 percent. Under an extreme compression scenario where 11/12 of the layers are pruned, the model still maintains 84% accuracy and delivers a 4.2x throughput gain while reducing computation to as low as 11.5 percent of the original FLOPs, demonstrating its practical utility for large-scale and latency-sensitive web deployments.

[LG-35] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps

链接: https://arxiv.org/abs/2512.20620
作者: Jacqueline Yau,Katherine J. Mimnaugh,Evan G. Center,Timo Ojala,Steven M. LaValle,Wenzhen Yuan,Nancy Amato,Minje Kim,Kara Federmeier
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cybersickness poses a serious challenge for users of virtual reality (VR) technology. Consequently, there has been significant effort to track its occurrence during VR use with brain activity through electroencephalography (EEG). However, a significant confound in current methods for detecting sickness from EEG is they do not account for the simultaneous processing of the sickening visual stimulus that is present in the brain data from VR. Using event-related potentials (ERPs) from an auditory stimulus shown to reflect cybersickness impacts, we can more precisely target EEG cybersickness features and use those to achieve better performance in online cybersickness classification. In this article, we introduce a method utilizing trained convolutional neural networks and transformer models and plot interpretability maps from integrated gradients and class activation to give a visual representation of what the model determined was most useful in sickness classification from an EEG dataset consisting of ERPs recorded during the elicitation of cybersickness. Across 12 runs of our method with three different neural networks, the models consistently pointed to a surprising finding: that amplitudes recorded at an electrode placed on the scalp near the left prefrontal cortex were important in the classification of cybersickness. These results help clarify a hidden pattern in other related research and point to exciting opportunities for future investigation: that this scalp location could be used as a tagged feature for better real-time cybersickness classification with EEG. We provide our code at: [anonymized].

[LG-36] Energy-convergence trade off for the training of neural networks on bio-inspired hardware

链接: https://arxiv.org/abs/2509.18121
作者: Nikhil Garg,Paul Uriarte Vicandi,Yanming Zhang,Alexandre Baigol,Donato Francesco Falcone,Saketh Ram Mamidala,Bert Jan Offrein,Laura Bégon-Lours
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:The increasing deployment of wearable sensors and implantable devices is shifting AI processing demands to the extreme edge, necessitating ultra-low power for continuous operation. Inspired by the brain, emerging memristive devices promise to accelerate neural network training by eliminating costly data transfers between compute and memory. Though, balancing performance and energy efficiency remains a challenge. We investigate ferroelectric synaptic devices based on HfO2/ZrO2 superlattices and feed their experimentally measured weight updates into hardware-aware neural network simulations. Across pulse widths from 20 ns to 0.2 ms, shorter pulses lower per-update energy but require more training epochs while still reducing total energy without sacrificing accuracy. Classification accuracy using plain stochastic gradient descent (SGD) is diminished compared to mixed-precision SGD. We analyze the causes and propose a ``symmetry point shifting’’ technique, addressing asymmetric updates and restoring accuracy. These results highlight a trade-off among accuracy, convergence speed, and energy use, showing that short-pulse programming with tailored training significantly enhances on-chip learning efficiency.

[LG-37] Autonomous Uncertainty Quantification for Computational Point-of-care Sensors

链接: https://arxiv.org/abs/2512.21335
作者: Artem Goncharov,Rajesh Ghosh,Hyou-Arm Joung,Dino Di Carlo,Aydogan Ozcan
类目: Medical Physics (physics.med-ph); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Biological Physics (physics.bio-ph)
*备注: 18 Pages, 5 Figures

点击查看摘要

Abstract:Computational point-of-care (POC) sensors enable rapid, low-cost, and accessible diagnostics in emergency, remote and resource-limited areas that lack access to centralized medical facilities. These systems can utilize neural network-based algorithms to accurately infer a diagnosis from the signals generated by rapid diagnostic tests or sensors. However, neural network-based diagnostic models are subject to hallucinations and can produce erroneous predictions, posing a risk of misdiagnosis and inaccurate clinical decisions. To address this challenge, here we present an autonomous uncertainty quantification technique developed for POC diagnostics. As our testbed, we used a paper-based, computational vertical flow assay (xVFA) platform developed for rapid POC diagnosis of Lyme disease, the most prevalent tick-borne disease globally. The xVFA platform integrates a disposable paper-based assay, a handheld optical reader and a neural network-based inference algorithm, providing rapid and cost-effective Lyme disease diagnostics in under 20 min using only 20 uL of patient serum. By incorporating a Monte Carlo dropout (MCDO)-based uncertainty quantification approach into the diagnostics pipeline, we identified and excluded erroneous predictions with high uncertainty, significantly improving the sensitivity and reliability of the xVFA in an autonomous manner, without access to the ground truth diagnostic information of patients. Blinded testing using new patient samples demonstrated an increase in diagnostic sensitivity from 88.2% to 95.7%, indicating the effectiveness of MCDO-based uncertainty quantification in enhancing the robustness of neural network-driven computational POC sensing systems.

[LG-38] Causal-driven attribution (CDA): Estimating channel influence without user-level data

链接: https://arxiv.org/abs/2512.21211
作者: Georgios Filippou,Boi Mai Quach,Diana Lenghel,Arthur White,Ashish Kumar Jha
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 42 pages, 8 figures, submitted initially to the journal of the academy of marketing science on 24th Dec 2025

点击查看摘要

Abstract:Attribution modelling lies at the heart of marketing effectiveness, yet most existing approaches depend on user-level path data, which are increasingly inaccessible due to privacy regulations and platform restrictions. This paper introduces a Causal-Driven Attribution (CDA) framework that infers channel influence using only aggregated impression-level data, avoiding any reliance on user identifiers or click-path tracking. CDA integrates temporal causal discovery (using PCMCI) with causal effect estimation via a Structural Causal Model to recover directional channel relationships and quantify their contributions to conversions. Using large-scale synthetic data designed to replicate real marketing dynamics, we show that CDA achieves an average relative RMSE of 9.50% when given the true causal graph, and 24.23% when using the predicted graph, demonstrating strong accuracy under correct structure and meaningful signal recovery even under structural uncertainty. CDA captures cross-channel interdependencies while providing interpretable, privacy-preserving attribution insights, offering a scalable and future-proof alternative to traditional path-based models.

[LG-39] Critical Points of Degenerate Metrics on Algebraic Varieties: A Tale of Overparametrization

链接: https://arxiv.org/abs/2512.21029
作者: Giovanni Luca Marchetti,Erin Connelly,Paul Breiding,Kathlén Kohn
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the critical points over an algebraic variety of an optimization problem defined by a quadratic objective that is degenerate. This scenario arises in machine learning when the dataset size is small with respect to the model, and is typically referred to as overparametrization. Our main result relates the degenerate optimization problem to a nondegenerate one via a projection. In the highly-degenerate regime, we find that a central role is played by the ramification locus of the projection. Additionally, we provide tools for counting the number of critical points over projective varieties, and discuss specific cases arising from deep learning. Our work bridges tools from algebraic geometry with ideas from machine learning, and it extends the line of literature around the Euclidean distance degree to the degenerate setting.

[LG-40] Enhancing diffusion models with Gaussianization preprocessing

链接: https://arxiv.org/abs/2512.21020
作者: Li Cunzhi,Louis Kang,Hideaki Shimazaki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages, 9 figures

点击查看摘要

Abstract:Diffusion models are a class of generative models that have demonstrated remarkable success in tasks such as image generation. However, one of the bottlenecks of these models is slow sampling due to the delay before the onset of trajectory bifurcation, at which point substantial reconstruction begins. This issue degrades generation quality, especially in the early stages. Our primary objective is to mitigate bifurcation-related issues by preprocessing the training data to enhance reconstruction quality, particularly for small-scale network architectures. Specifically, we propose applying Gaussianization preprocessing to the training data to make the target distribution more closely resemble an independent Gaussian distribution, which serves as the initial density of the reconstruction process. This preprocessing step simplifies the model’s task of learning the target distribution, thereby improving generation quality even in the early stages of reconstruction with small networks. The proposed method is, in principle, applicable to a broad range of generative tasks, enabling more stable and efficient sampling processes.

[LG-41] Learning from Neighbors with PHIBP: Predicting Infectious Disease Dynamics in Data-Sparse Environments

链接: https://arxiv.org/abs/2512.21005
作者: Edwin Fong,Lancelot F. James,Juho Lee
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: Draft Book chapter on AMMI methods – Application of PHIBP arXiv:2502.01919 to Infectious Disease Detection with suggested extensions using the developments in arXiv:2508.18668

点击查看摘要

Abstract:Modeling sparse count data, which arise across numerous scientific fields, presents significant statistical challenges. This chapter addresses these challenges in the context of infectious disease prediction, with a focus on predicting outbreaks in geographic regions that have historically reported zero cases. To this end, we present the detailed computational framework and experimental application of the Poisson Hierarchical Indian Buffet Process (PHIBP), with demonstrated success in handling sparse count data in microbiome and ecological studies. The PHIBP’s architecture, grounded in the concept of absolute abundance, systematically borrows statistical strength from related regions and circumvents the known sensitivities of relative-rate methods to zero counts. Through a series of experiments on infectious disease data, we show that this principled approach provides a robust foundation for generating coherent predictive distributions and for the effective use of comparative measures such as alpha and beta diversity. The chapter’s emphasis on algorithmic implementation and experimental results confirms that this unified framework delivers both accurate outbreak predictions and meaningful epidemiological insights in data-sparse settings.

[LG-42] Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks

链接: https://arxiv.org/abs/2512.20924
作者: Andrew D. Blevins,Ian K. Quigley
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a “Clever Hans” failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of chemistry. We analyze the sources of this leakage, propose author-disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.

[LG-43] Weighted MCC: A Robust Measure of Multiclass Classifier Performance for Observations with Individual Weights

链接: https://arxiv.org/abs/2512.20811
作者: Rommel Cortez,Bala Krishnamoorthy
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Several performance measures are used to evaluate binary and multiclass classification tasks. But individual observations may often have distinct weights, and none of these measures are sensitive to such varying weights. We propose a new weighted Pearson-Matthews Correlation Coefficient (MCC) for binary classification as well as weighted versions of related multiclass measures. The weighted MCC varies between -1 and 1 . But crucially, the weighted MCC values are higher for classifiers that perform better on highly weighted observations, and hence is able to distinguish them from classifiers that have a similar overall performance and ones that perform better on the lowly weighted observations. Furthermore, we prove that the weighted measures are robust with respect to the choice of weights in a precise manner: if the weights are changed by at most \epsilon , the value of the weighted measure changes at most by a factor of \epsilon in the binary case and by a factor of \epsilon^2 in the multiclass case. Our computations demonstrate that the weighted measures clearly identify classifiers that perform better on higher weighted observations, while the unweighted measures remain completely indifferent to the choices of weights. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2512.20811 [stat.ML] (or arXiv:2512.20811v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2512.20811 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rommel Cortez [view email] [v1] Tue, 23 Dec 2025 22:20:34 UTC (48 KB) Full-text links: Access Paper: View a PDF of the paper titled Weighted MCC: A Robust Measure of Multiclass Classifier Performance for Observations with Individual Weights, by Rommel Cortez and Bala KrishnamoorthyView PDFHTML (experimental)TeX Source view license Current browse context: stat.ML prev | next new | recent | 2025-12 Change to browse by: cs cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-44] Diffusion Models in Simulation-Based Inference: A Tutorial Review

链接: https://arxiv.org/abs/2512.20685
作者: Jonas Arruda,Niels Bracher,Ullrich Köthe,Jan Hasenauer,Stefan T. Radev
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Diffusion models have recently emerged as powerful learners for simulation-based inference (SBI), enabling fast and accurate estimation of latent parameters from simulated and real data. Their score-based formulation offers a flexible way to learn conditional or joint distributions over parameters and observations, thereby providing a versatile solution to various modeling problems. In this tutorial review, we synthesize recent developments on diffusion models for SBI, covering design choices for training, inference, and evaluation. We highlight opportunities created by various concepts such as guidance, score composition, flow matching, consistency models, and joint modeling. Furthermore, we discuss how efficiency and statistical accuracy are affected by noise schedules, parameterizations, and samplers. Finally, we illustrate these concepts with case studies across parameter dimensionalities, simulation budgets, and model types, and outline open questions for future research.

[LG-45] Fast and Exact Least Absolute Deviations Line Fitting via Piecewise Affine Lower-Bounding

链接: https://arxiv.org/abs/2512.20682
作者: Stefan Volz,Martin Storath,Andreas Weinmann
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Submitted to IEEE Transactions on Signal Processing

点击查看摘要

Abstract:Least-absolute-deviations (LAD) line fitting is robust to outliers but computationally more involved than least squares regression. Although the literature includes linear and near-linear time algorithms for the LAD line fitting problem, these methods are difficult to implement and, to our knowledge, lack maintained public implementations. As a result, practitioners often resort to linear programming (LP) based methods such as the simplex-based Barrodale-Roberts method and interior-point methods, or on iteratively reweighted least squares (IRLS) approximation which does not guarantee exact solutions. To close this gap, we propose the Piecewise Affine Lower-Bounding (PALB) method, an exact algorithm for LAD line fitting. PALB uses supporting lines derived from subgradients to build piecewise-affine lower bounds, and employs a subdivision scheme involving minima of these lower bounds. We prove correctness and provide bounds on the number of iterations. On synthetic datasets with varied signal types and noise including heavy-tailed outliers as well as a real dataset from the NOAA’s Integrated Surface Database, PALB exhibits empirical log-linear scaling. It is consistently faster than publicly available implementations of LP based and IRLS based solvers. We provide a reference implementation written in Rust with a Python API.

信息检索

[IR-0] MMSRARec: Summarization and Retrieval Augumented Sequential Recommendation Based on Multimodal Large Language Model

链接: https://arxiv.org/abs/2512.20916
作者: Haoyu Wang,Yitong Wang,Jining Wang
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
*备注: Under Review

点击查看摘要

Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant potential in recommendation systems. However, the effective application of MLLMs to multimodal sequential recommendation remains unexplored: A) Existing methods primarily leverage the multimodal semantic understanding capabilities of pre-trained MLLMs to generate item embeddings or semantic IDs, thereby enhancing traditional recommendation models. These approaches generate item representations that exhibit limited interpretability, and pose challenges when transferring to language model-based recommendation systems. B) Other approaches convert user behavior sequence into image-text pairs and perform recommendation through multiple MLLM inference, incurring prohibitive computational and time costs. C) Current MLLM-based recommendation systems generally neglect the integration of collaborative signals. To address these limitations while balancing recommendation performance, interpretability, and computational cost, this paper proposes MultiModal Summarization-and-Retrieval-Augmented Sequential Recommendation. Specifically, we first employ MLLM to summarize items into concise keywords and fine-tune the model using rewards that incorporate summary length, information loss, and reconstruction difficulty, thereby enabling adaptive adjustment of the summarization policy. Inspired by retrieval-augmented generation, we then transform collaborative signals into corresponding keywords and integrate them as supplementary context. Finally, we apply supervised fine-tuning with multi-task learning to align the MLLM with the multimodal sequential recommendation. Extensive evaluations on common recommendation datasets demonstrate the effectiveness of MMSRARec, showcasing its capability to efficiently and interpretably understand user behavior histories and item information for accurate recommendations.

[IR-1] Accurate and Diverse Recommendations via Propensity-Weighted Linear Autoencoders SIGIR

链接: https://arxiv.org/abs/2512.20896
作者: Kazuma Onishi,Katsuhiko Hayashi,Hidetaka Kamigaito
类目: Information Retrieval (cs.IR)
*备注: Published in the proceedings of SIGIR-AP’25

点击查看摘要

Abstract:In real-world recommender systems, user-item interactions are Missing Not At Random (MNAR), as interactions with popular items are more frequently observed than those with less popular ones. Missing observations shift recommendations toward frequently interacted items, which reduces the diversity of the recommendation list. To alleviate this problem, Inverse Propensity Scoring (IPS) is widely used and commonly models propensities based on a power-law function of item interaction frequency. However, we found that such power-law-based correction overly penalizes popular items and harms their recommendation performance. We address this issue by redefining the propensity score to allow broader item recommendation without excessively penalizing popular items. The proposed score is formulated by applying a sigmoid function to the logarithm of the item observation frequency, maintaining the simplicity of power-law scoring while allowing for more flexible adjustment. Furthermore, we incorporate the redefined propensity score into a linear autoencoder model, which tends to favor popular items, and evaluate its effectiveness. Experimental results revealed that our method substantially improves the diversity of items in the recommendation list without sacrificing recommendation accuracy.

[IR-2] Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints AAAI2026

链接: https://arxiv.org/abs/2512.20781
作者: Youjin Jung,Seongwoo Cho,Hyun-seok Min,Sungchul Choi
类目: Information Retrieval (cs.IR)
*备注: Accepted to AAAI 2026 Workshop on New Frontiers in Information Retrieval

点击查看摘要

Abstract:Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants, tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and proscriptive (must-avoid) constraints. These serve as semantic filters that reward or penalize candidate images to re-rank results, without modifying the base retrieval model or adding supervision. In addition, we construct a two-stage dataset pipeline that refines CIR benchmarks. We first identify multiple plausible targets per query to construct multi-target triplets, capturing the open-ended nature of user intent. Then guide multimodal LLMs to rewrite the modification text to focus on one target, while referencing contrastive distractors to ensure precision. This enables more comprehensive and reliable evaluation under varying ambiguity levels. Applied on top of CIReVL, a ZS-CIR retriever, SoFT raises R@5 to 65.25 on CIRR (+12.94), mAP@50 to 27.93 on CIRCO (+6.13), and R@50 to 58.44 on FashionIQ (+4.59), demonstrating broad effectiveness.

附件下载

点击下载今日全部论文列表