本篇博文主要内容为 2025-12-30 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-30)

今日共更新826篇论文,其中:

  • 自然语言处理87篇(Computation and Language (cs.CL))
  • 人工智能253篇(Artificial Intelligence (cs.AI))
  • 计算机视觉220篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习263篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] raining AI Co-Scientists Using Rubric Rewards

【速读】: 该论文试图解决当前语言模型在生成研究计划时难以满足所有约束条件和隐含要求的问题,尤其是在科研场景中缺乏有效反馈机制的情况下。解决方案的关键在于构建一个可扩展、多样化的训练语料库,通过自动提取论文中的研究目标和目标特定的评分标准(grading rubrics),并利用强化学习与自评机制对语言模型进行微调。训练过程中使用初始策略的冻结版本作为评分器,形成“生成-验证”差距(generator-verifier gap),从而在无需外部人类监督的前提下实现性能提升。实验表明,该方法在机器学习和医学研究等多领域均能显著改善研究计划质量,并具备良好的跨域泛化能力。

链接: https://arxiv.org/abs/2512.23707
作者: Shashwat Goel,Rishi Hazra,Dulhan Jayalath,Timon Willi,Parag Jain,William F. Shen,Ilias Leontiadis,Francesco Barbieri,Yoram Bachrach,Jonas Geiping,Chenxi Whitehouse
机构: Meta
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 11 pages in the main paper, total 119 including sample outputs in the Appendix

点击查看摘要

Abstract:AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.
zh

[NLP-1] Eliciting Behaviors in Multi-Turn Conversations

【速读】: 该论文旨在解决在多轮对话场景中识别大型语言模型(Large Language Models, LLMs)特定复杂行为的挑战,这一问题对模型评估具有重要意义。现有方法主要聚焦于单轮交互情境,而本文提出了一种分析框架,将已有方法分为三类:仅依赖先验知识、基于离线交互以及通过在线交互学习的方法。其关键创新在于引入了一个通用的多轮在线方法公式化表达,统一了单轮与多轮行为诱导机制,并在自动构建多轮测试用例上进行评估,揭示出在线方法仅需数千次查询即可实现45%/19%/77%的平均成功率,显著优于静态基准方法在多个任务中几乎无法发现失败案例的表现。这表明行为诱导技术在多轮对话评估中的新应用潜力,并呼吁社区转向动态基准以提升评估有效性。

链接: https://arxiv.org/abs/2512.23701
作者: Jing Huang,Shujian Zhang,Lun Wang,Andrew Hard,Rajiv Mathews,John Lambert
机构: Google DeepMind(谷歌深度思维); Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.
zh

[NLP-2] Fine-Tuning LLM s with Fine-Grained Human Feedback on Text Spans

【速读】: 该论文旨在解决语言模型在偏好对齐(preference alignment)过程中效率与效果不足的问题,即如何更有效地利用人类反馈来优化模型输出质量。传统方法通常依赖于整体响应的A/B偏好排序或全量对比重写,难以捕捉局部细微差异并导致训练信号稀疏。其解决方案的关键在于构建“反馈驱动的改进链”(feedback-driven improvement chains):通过标注者对模型输出中“喜欢”和“不喜欢”的片段进行细粒度标记,并指定修改原因;随后基线模型按从左到右顺序逐段修正不满意的部分,形成一系列增量式改进步骤。每个相邻步骤构成偏好对齐样本对,使模型能够学习局部、精准的编辑策略,从而实现更高效且高质量的偏好微调。

链接: https://arxiv.org/abs/2512.23693
作者: Sky CH-Wang,Justin Svegliato,Helen Appel,Jason Eisner
机构: Columbia University (哥伦比亚大学); Microsoft (微软); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking liked'' and disliked’’ spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.
zh

[NLP-3] PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech

【速读】: 该论文旨在解决当前自动语音识别(ASR)在专业场景中面临的评估不足问题,即现有基准测试未能充分反映金融、医疗、法律和技术等领域对术语密集性、正式语域变化及关键实体错误零容忍的现实挑战。解决方案的关键在于提出ProfASR-Bench——一个面向高风险应用场景的专业对话评测套件,其核心创新包括:通过自然语言提示(如领域线索或说话人画像)与富含实体的目标话语配对,实现上下文条件下的可控识别测量;支持传统ASR指标外的实体感知评分和按口音、性别分片的报告;并揭示了当前模型存在“上下文利用缺口”(Context-Utilization Gap, CUG):即使提供理想提示,轻量级文本上下文也难以显著降低平均词错误率(WER),且对抗性提示无法稳定恶化性能,表明当前系统虽具备提示能力却未有效利用可用侧信息。

链接: https://arxiv.org/abs/2512.23686
作者: Deepak Babu Piskala
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Benchmark dataset and evaluation suite. Data and code available at: this https URL this https URL

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: this https URL Code: this https URL Comments: Benchmark dataset and evaluation suite. Data and code available at: this https URL this https URL Subjects: Computation and Language (cs.CL); Sound (cs.SD) Cite as: arXiv:2512.23686 [cs.CL] (or arXiv:2512.23686v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.23686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-4] Multilingual Hidden Prompt Injection Attacks on LLM -Based Academic Reviewing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高影响力工作流(如学术同行评审)中因文档级隐藏提示注入攻击(document-level hidden prompt injection attacks)而产生的安全风险问题。解决方案的关键在于构建了一个包含约500篇真实被ICML接收论文的数据集,并在每篇论文中嵌入语义等价的对抗性提示指令(adversarial prompts),这些提示分别以英语、日语、中文和阿拉伯语四种语言注入,随后使用LLM进行评审评估。实验结果表明,英语、日语和中文注入显著改变了评审分数与录用/拒稿决策,而阿拉伯语注入则影响甚微,揭示了LLM评审系统对文档级提示注入的高度脆弱性及跨语言差异。

链接: https://arxiv.org/abs/2512.23684
作者: Panagiotis Theocharopoulos,Ajinkya Kulkarni,Mathew Magimai.-Doss
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.
zh

[NLP-5] Web World Models

【速读】: 该论文旨在解决语言代理(language agents)在长期交互中对持久化世界的需求问题,即如何在保证环境逻辑一致性与可控性的前提下,实现开放且可扩展的虚拟世界建模。现有方法要么依赖固定数据库的Web框架(缺乏灵活性),要么采用完全生成式世界模型(缺乏可控性)。其解决方案的关键在于提出Web World Model(WWM),通过将世界状态和“物理规则”以常规Web代码实现(确保逻辑一致性),同时由大语言模型(LLM)负责生成上下文、叙事及高层决策,从而在结构化与生成性之间取得平衡。核心设计原则包括:分离代码定义的规则与模型驱动的想象、用类型化的Web接口表示潜在状态、利用确定性生成实现无限但有序的探索。结果表明,Web栈本身可作为可扩展的世界模型基础,支持可控又开放的环境构建。

链接: https://arxiv.org/abs/2512.23676
作者: Jichen Feng,Yifan Zhang,Chenggong Zhang,Yifu Lu,Shilong Liu,Mengdi Wang
机构: Princeton University (普林斯顿大学); University of California (加州大学); University of Pennsylvania (宾夕法尼亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web frameworks provide reliable but fixed contexts backed by databases, while fully generative world models aim for unlimited environments at the expense of controllability and practical engineering. In this work, we introduce the Web World Model (WWM), a middle ground where world state and ``physics’’ are implemented in ordinary web code to ensure logical consistency, while large language models generate context, narratives, and high-level decisions on top of this structured latent state. We build a suite of WWMs on a realistic web stack, including an infinite travel atlas grounded in real geography, fictional galaxy explorers, web-scale encyclopedic and narrative worlds, and simulation- and game-like environments. Across these systems, we identify practical design principles for WWMs: separating code-defined rules from model-driven imagination, representing latent state as typed web interfaces, and utilizing deterministic generation to achieve unlimited but structured exploration. Our results suggest that web stacks themselves can serve as a scalable substrate for world models, enabling controllable yet open-ended environments. Project Page: this https URL.
zh

[NLP-6] Less is more: Probabilistic reduction is best explained by small-scale predictability measures

【速读】: 该论文旨在解决语言模型概率与认知现象之间关系研究中所需的上下文量问题,即明确在探究二者关联时,究竟需要多大程度的上下文信息才足够或恰当。其核心解决方案在于证明:在观察概率降低(probabilistic reduction)现象时,并非必须依赖完整的语句(whole utterances),n-gram表示作为规划的认知单元(cognitive units of planning)已足以实现有效分析。

链接: https://arxiv.org/abs/2512.23659
作者: Cassandra L. Jacobs,Andrés Buxó-Lugo,Anna K. Taylor,Marie Leopold-Hooke
机构: University at Buffalo (纽约州立大学布法罗分校); EURECOM; Université Paris - Sorbonne (巴黎索邦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The primary research questions of this paper center on defining the amount of context that is necessary and/or appropriate when investigating the relationship between language model probabilities and cognitive phenomena. We investigate whether whole utterances are necessary to observe probabilistic reduction and demonstrate that n-gram representations suffice as cognitive units of planning.
zh

[NLP-7] Nested Browser-Use Learning for Agent ic Information Seeking

【速读】: 该论文旨在解决信息检索(Information-seeking, IS)代理在使用工具时受限于API级片段检索和基于URL的页面获取,从而难以访问深度网络中更丰富的信息的问题。现有基于ReAct风格函数调用的代理在面对完整浏览器交互时,由于细粒度控制和冗长的页面内容返回,导致推理复杂性显著增加。解决方案的关键在于提出一种名为Nested Browser-Use Learning (NestBrowse) 的新框架,其核心创新是通过嵌套结构将浏览器操作控制与页面探索解耦,构建了一个最小且完整的浏览器动作体系,从而在简化智能体推理的同时,实现对深层网页信息的有效获取。

链接: https://arxiv.org/abs/2512.23647
作者: Baixuan Li,Jialong Wu,Wenbiao Yin,Kuan Li,Zhongwang Zhang,Huifeng Yin,Zhengwei Tao,Liwen Zhang,Pengjun Xie,Jingren Zhou,Yong Jiang
机构: Tongyi Lab (通义实验室); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.
zh

[NLP-8] A Dataset and Benchmark for Consumer Healthcare Question Summarization

【速读】: 该论文旨在解决消费者健康相关问题(Consumer Health Questions)在自然语言理解中因表述冗长和信息外围化而导致的挑战,其核心问题是缺乏由领域专家标注的高质量数据集,从而限制了高效摘要系统的发展。解决方案的关键在于构建一个名为CHQ-Sum的新型数据集,该数据集包含1507条由领域专家标注的消费者健康问题及其对应摘要,数据源自社区问答论坛,能够有效支持对社交媒体上健康相关帖子的理解与分析,并通过在多个前沿摘要模型上的基准测试验证了该数据集的有效性。

链接: https://arxiv.org/abs/2512.23637
作者: Abhishek Basu,Deepak Gupta,Dina Demner-Fushman,Shweta Yadav
机构: U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA; Department of Computer Science, University of Illinois at Chicago, Chicago, IL, USA
类目: Computation and Language (cs.CL)
备注: arXiv admin note: substantial text overlap with arXiv:2206.06581

点击查看摘要

Abstract:The quest for seeking health information has swamped the web with consumers health-related questions. Generally, con- sumers use overly descriptive and peripheral information to express their medical condition or other healthcare needs, contributing to the challenges of natural language understanding. One way to address this challenge is to summarize the questions and distill the key information of the original question. Recently, large-scale datasets have significantly propelled the development of several summarization tasks, such as multi-document summarization and dialogue summarization. However, a lack of a domain-expert annotated dataset for the consumer healthcare questions summarization task inhibits the development of an efficient summarization system. To address this issue, we introduce a new dataset, CHQ-Sum,m that contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media. We benchmark the dataset on multiple state-of-the-art summarization models to show the effectiveness of the dataset
zh

[NLP-9] Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在可靠调用外部工具方面存在的三大瓶颈问题:高质量轨迹数据依赖昂贵的人工标注、对未见过的工具泛化能力差,以及单模型生成带来的质量上限和偏见固化。其解决方案的关键在于提出一个完全自治的框架InfTool,通过自进化多智能体协同合成机制实现闭环优化:由用户模拟器、工具调用助手和MCP服务器三个协作智能体生成多样且经验证的轨迹,再利用分组相对策略优化(Group Relative Policy Optimization, GRPO)结合门控奖励训练模型,进而生成针对能力短板的更高质量数据,形成无需人工干预的迭代循环。实验表明,InfTool可将基础32B模型在伯克利函数调用排行榜(BFCL)上的准确率从19.8%提升至70.9%,显著超越参数量大10倍的模型并媲美Claude-Opus,且全部基于合成数据完成。

链接: https://arxiv.org/abs/2512.23611
作者: Yuwen Li,Wei Zhang,Zelong Huang,Mason Yang,Jiajun Wu,Shawn Guo,Huahao Hu,Lingyi Sun,Jian Yang,Mingjie Tang,Byran Dai
机构: Sichuan University (四川大学); Beihang University (北京航空航天大学); IQuest Research (IQuest 研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single-model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self-evolving multi-agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool-Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single-turn calls to complex multi-step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher-quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus, and entirely from synthetic data without human annotation.
zh

[NLP-10] Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models

【速读】: 该论文试图解决的问题是:当语音语言模型(Speech Language Models, SLMs)在多轮对话中被要求以特定的副语言风格(如情绪、口音、音量和语速)进行表达时,它们无法在后续对话轮次中持续维持该风格,即存在“风格遗忘”(style amnesia)现象。解决方案的关键在于:通过将风格指令置于用户消息(user message)而非系统消息(system message)中,并辅以显式提醒模型回忆风格指令的提示策略,可以部分缓解这一问题;同时,研究发现模型虽能记住风格指令,但无法在生成内容中持续体现,表明当前SLMs在跨轮次风格一致性控制方面存在显著局限。

链接: https://arxiv.org/abs/2512.23578
作者: Yu-Xiang Lin,Cheng-Han Chiang,Hung-yi Lee
机构: National Taiwan University (台湾国立大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: Work in progress

点击查看摘要

Abstract:In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.
zh

[NLP-11] Instruction-Following Evaluation of Large Vision-Language Models

【速读】: 该论文试图解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在经过视觉指令微调后,其指令遵循能力(instruction-following ability)显著下降的问题。研究表明,这一现象源于常用训练数据集中缺乏对输出格式的明确指示,导致模型在整合视觉与语言能力后无法有效执行任务指令。解决方案的关键在于:在(视觉)指令微调过程中引入包含输出格式说明的训练样本,从而有效缓解指令遵循能力的退化,提升模型对复杂指令的理解与执行准确性。

链接: https://arxiv.org/abs/2512.23572
作者: Daiki Shiono,Shumpei Miyawaki,Ryota Tanaka,Jun Suzuki
机构: Tohoku University (东北大学); NTT(日本电信电话公司)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs’ instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs’ instruction-following ability. Our quantitative evaluation confirmed that LVLMs’ instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.
zh

[NLP-12] VL-RouterBench: A Benchmark for Vision-Language Model Routing

【速读】: 该论文旨在解决多模态路由(Multi-modal routing)系统缺乏系统性、可复现基准评估的问题,尤其针对视觉-语言模型(Vision-Language Models, VLMs)的路由能力评测缺失。其解决方案的关键在于提出 VL-RouterBench 基准测试平台,该平台基于原始推理与评分日志构建样本-模型对的质量矩阵与成本矩阵,并在大规模数据集(覆盖14个数据集、3类任务、共30,540个样本)和多样模型(15个开源模型 + 2个API模型)基础上,量化评估路由系统的平均准确率、平均成本与吞吐量,通过归一化后的调和平均构建综合排名得分,从而实现跨路由配置与预算的公平比较。此方法为多模态路由研究提供了标准化评估框架,推动了可复现性和实际部署的发展。

链接: https://arxiv.org/abs/2512.23562
作者: Zhehao Huang,Baijiong Lin,Jingyuan Zhang,Jingying Wang,Yuhang Liu,Ning Lu,Tao Li,Xiaolin Huang
机构: Shanghai Jiao Tong University (上海交通大学); The Hong Kong University of Science and Technology (广州); The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision-language models (VLMs). We present VL-RouterBench to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample-model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample-model pairs and a total input-output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.
zh

[NLP-13] Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLM s ICPR

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型生成看似合理但事实上错误的陈述,这严重制约了LLMs的安全部署。解决方案的关键在于利用结构化知识表示——知识图谱(Knowledge Graphs),通过将LLM输出转化为实体与关系构成的知识图谱,并基于此图谱评估响应中包含幻觉的可能性。该方法不依赖特定模型,且仅需对原始输出进行结构化处理,显著提升了自检准确率(最高达16%相对提升)和F1分数(最高达20%相对提升),表明结构化知识表示有助于模型更有效地识别原子事实层面的错误,即使初始输出存在不准确性。

链接: https://arxiv.org/abs/2512.23547
作者: Sahil Kale,Antonio Luca Alfeo
机构: KnowledgeVerse AI(知识图谱AI); eCampus University(电子校园大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICPRAM 2026 in Marbella, Spain

点击查看摘要

Abstract:Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection. Specifically, we propose a simple yet powerful approach that enriches hallucination self-detection by (i) converting LLM responses into knowledge graphs of entities and relations, and (ii) using these graphs to estimate the likelihood that a response contains hallucinations. We evaluate the proposed approach using two widely used LLMs, GPT-4o and Gemini-2.5-Flash, across two hallucination detection datasets. To support more reliable future benchmarking, one of these datasets has been manually curated and enhanced and is released as a secondary outcome of this work. Compared to standard self-detection methods and SelfCheckGPT, a state-of-the-art approach, our method achieves up to 16% relative improvement in accuracy and 20% in F1-score. Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs, even when initial outputs contain inaccuracies. This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.
zh

[NLP-14] Single LLM Debate MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对输入提示时存在的确认偏误(confirmation bias)问题,即当提示隐含偏好答案时,模型倾向于强化该偏见而非探索其他可能性,这一现象在基础模型中已造成负面影响,并在多智能体辩论(multi-agent debate)场景中因回音室效应进一步放大风险。解决方案的关键在于提出一种轻量级推理时框架——潜在概念专家混合(Mixture of Latent Concept Experts, MoLaCE),其核心思想是通过不同激活强度对塑造响应的潜在概念进行动态加权,从而实现对提示语义的差异化处理;由于语言具有组合性特征,不同表述的提示会以特定方式重置潜在概念权重,因此单一固定干预无法普适适用,而MoLaCE能模拟内部辩论效果,在保持计算效率的同时提升事实准确性与鲁棒性,并可无缝集成至多智能体框架中以减少相关错误。

链接: https://arxiv.org/abs/2512.23518
作者: Hazel Kim,Philip Torr
机构: University of Oxford (牛津大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi-agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference-time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt-specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi-agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate while requiring only a fraction of the computation.
zh

[NLP-15] UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

【速读】: 该论文旨在解决视觉理解与视觉生成任务在大规模预训练下如何协同提升的问题,特别是探索生成式能力是否能够增强模型的视觉理解性能。其解决方案的关键在于提出一种结构简洁的统一模型UniHetero,并通过200M样本的大规模预训练验证了三个核心发现:(1)仅当生成语义信息而非像素时,生成能力才能有效提升理解性能;(2)生成任务展现出更优的数据缩放趋势和更高的数据利用率;(3)对输入嵌入进行自回归建模有助于捕捉视觉细节。这些发现为构建高效、统一的视觉语言大模型提供了重要实证依据。

链接: https://arxiv.org/abs/2512.23512
作者: Fengjiao Chen,Minhao Jing,Weitao Lu,Yan Feng,Xiaoyu Li,Xuezhi Cao
机构: Meituan(美团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
zh

[NLP-16] Automatic Detection of Complex Quotation Patterns in Aggadic Literature

【速读】: 该论文旨在解决犹太教拉比文献中圣经引文自动检测的难题,尤其针对短句、改写或结构嵌入式引文等传统文本复用框架难以处理的情形。其解决方案的核心在于提出一种三阶段算法ACT(Allocate Connections between Texts),通过结合形态感知对齐算法与上下文敏感的风格增强模块,有效识别如“波浪式”(Wave)和“回声式”(Echo)等复杂引文模式。该方法在F1分数上达到0.91,显著优于现有系统(如Dicta、Passim、Text-Matcher),并揭示了各组件对性能的具体贡献,为数字人文与计算文献学提供了可扩展的自动化工具。

链接: https://arxiv.org/abs/2512.23504
作者: Hadar Miller,Tsvi Kuflik,Moshe Lavee
机构: University of Haifa (海法大学)
类目: Computation and Language (cs.CL)
备注: This paper is under review at Cogent Arts Humanities

点击查看摘要

Abstract:This paper presents ACT (Allocate Connections between Texts), a novel three-stage algorithm for the automatic detection of biblical quotations in Rabbinic literature. Unlike existing text reuse frameworks that struggle with short, paraphrased, or structurally embedded quotations, ACT combines a morphology-aware alignment algorithm with a context-sensitive enrichment stage that identifies complex citation patterns such as “Wave” and “Echo” quotations. Our approach was evaluated against leading systems, including Dicta, Passim, Text-Matcher, as well as human-annotated critical editions. We further assessed three ACT configurations to isolate the contribution of each component. Results demonstrate that the full ACT pipeline (ACT-QE) outperforms all baselines, achieving an F1 score of 0.91, with superior Recall (0.89) and Precision (0.94). Notably, ACT-2, which lacks stylistic enrichment, achieves higher Recall (0.90) but suffers in Precision, while ACT-3, using longer n-grams, offers a tradeoff between coverage and specificity. In addition to improving quotation detection, ACT’s ability to classify stylistic patterns across corpora opens new avenues for genre classification and intertextual analysis. This work contributes to digital humanities and computational philology by addressing the methodological gap between exhaustive machine-based detection and human editorial judgment. ACT lays a foundation for broader applications in historical textual analysis, especially in morphologically rich and citation-dense traditions like Aggadic literature. Comments: This paper is under review at Cogent Arts Humanities Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.23504 [cs.CL] (or arXiv:2512.23504v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.23504 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-17] Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings

【速读】: 该论文旨在解决文本语义分类中全局结构语义关系不明确的问题,尤其是在使用大语言模型(Large Language Models, LLMs)生成的高维嵌入进行向量存储与检索时,难以揭示文本集合内部的层次化语义结构。其解决方案的关键在于提出一种嵌套密度聚类(nested density clustering)方法:通过在LLM嵌入空间中逐步放松密度阈值,从强语义相似的密集簇开始,逐层合并为更稀疏的簇,最终形成一棵反映语义层级关系的树状结构,从而实现无需预定义类别即可自动发现研究领域及其子领域的数据驱动式文本分类。

链接: https://arxiv.org/abs/2512.23471
作者: Thomas Haschka,Joseph Bakarji
机构: TU Wien (维也纳科技大學); American University of Beirut (贝鲁特美国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures

点击查看摘要

Abstract:Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster - the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 News- groups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.
zh

[NLP-18] Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在对齐大型语言模型(Large Language Models, LLMs)以遵循复杂指令时面临的样本效率低下的问题。具体而言,初始模型常难以生成满足所有约束的高质量响应,导致奖励信号稀疏且难以区分,从而阻碍有效学习。解决方案的关键在于提出一种新颖的“事后指令重播”(Hindsight Instruction Replay, HiR)框架,其核心机制是采用“先选择后重写”策略,将失败尝试中已满足的部分约束转化为成功样本进行重放,并在指令和响应两个层面构建双偏好学习目标(dual-preference learning),从而仅用二元奖励信号即可实现高效优化。

链接: https://arxiv.org/abs/2512.23457
作者: Kongcheng Zhang,Qi Yao,Shunyu Liu,Wenjian Zhang,Min Cen,Yang Zhou,Wenkai Fang,Yiru Zhao,Baisheng Lai,Mingli Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at this https URL.
zh

[NLP-19] Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型中路由决策与专家能力之间缺乏显式约束的问题,这一缺陷限制了模型性能的进一步提升。解决方案的关键在于提出一种轻量级辅助损失函数——专家-路由耦合(Expert-Router Coupling, ERC)损失,该损失通过将每个专家的路由嵌入(router embedding)视为其分配 token 的代理标记(proxy token),并将其扰动后输入对应专家以获取内部激活值,从而施加两个核心约束:一是每个专家对其自身代理标记的激活应高于其他任何专家的代理标记;二是每个代理标记应引发其对应专家更强的激活而非其他专家的激活。这种双重约束机制确保了路由嵌入能忠实反映专家能力,同时促使专家专注于处理实际被路由到的 token,从而提升 MoE 模型的结构合理性与性能表现。ERC 损失计算复杂度为 n2n^2nn 为专家数量),独立于批量大小,显著优于以往随 token 数量增长的耦合方法。

链接: https://arxiv.org/abs/2512.23447
作者: Ang Lv,Jin Ma,Yiyuan Ma,Siyuan Qiao
机构: ByteDance(字节跳动); Renmin University of China, GSAI (中国人民大学,人工智能学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router’s decisions align well with the experts’ capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router’s decisions with expert capabilities. Our approach treats each expert’s router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert’s capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
zh

[NLP-20] ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning

【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)评估基准在临床推理能力测评中的局限性问题,即当前基准多聚焦于静态问答任务,难以刻画医生与患者之间动态交互、信息迭代获取及鉴别诊断的复杂临床推理过程。其解决方案的关键在于提出ClinDEF框架,该框架基于疾病知识图谱动态生成患者病例,并支持LLM驱动的医生与自动化患者代理之间的多轮交互对话;同时引入细粒度效率分析和基于评分量表的诊断质量评估,从而实现对临床推理能力的多维度、更贴近真实诊疗场景的量化评估。

链接: https://arxiv.org/abs/2512.23440
作者: Yuqi Tang,Jing Yu,Zichang Su,Kehua Feng,Zhihui Zhu,Libin Wang,Lei Liang,Qiang Zhang,Keyan Ding,Huajun Chen
机构: ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University (浙江大学杭州全球科教创新中心); ZJU-UIUC Institute, Zhejiang University (浙江大学-伊利诺伊大学厄巴纳香槟分校联合学院); College of Computer Science and Technology, Zhejiang University (浙江大学计算机科学与技术学院); The Polytechnic Institute, Zhejiang University (浙江大学理工学院); AntGroup (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注: 23 pages, 4 figures, under review

点击查看摘要

Abstract:Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients’ response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.
zh

[NLP-21] C2PO: Diagnosing and Disentangling Bias Shortcuts in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的两类关键偏见问题:表征性偏见(如性别或种族刻板印象)和结构性偏见(如词汇重叠或位置偏好),这些问题通常被孤立处理,导致缓解一类偏见时可能加剧另一类。其核心挑战在于,这些偏见源于输入中的潜在虚假特征相关性(latent spurious feature correlations),这些相关性诱导模型形成错误的推理捷径。解决方案的关键是提出因果对比偏好优化(Causal-Contrastive Preference Optimization, C2PO),该框架通过引入因果反事实信号来分离偏见诱导特征与有效推理路径,并利用公平敏感的偏好更新机制动态评估 logits 层级贡献,从而在优化过程中同时发现并抑制此类虚假相关性,实现对多种偏见类型的统一缓解,同时保持模型的通用推理能力。

链接: https://arxiv.org/abs/2512.23430
作者: Xuan Feng,Bo An,Tianlong Gu,Liang Chang,Fengrui Hao,Peipeng Yu,Shuai Zhao
机构: Jinan University (暨南大学); Nanyang Technological University (南洋理工大学); Engineering Research Center of Trustworthy AI (教育部可信人工智能工程研究中心); Guangxi Key Laboratory of Trusted Software (广西可信软件重点实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.
zh

[NLP-22] he Effect of Gender Diversity on Scientific Team Impact: A Team Roles Perspective

【速读】: 该论文试图解决的问题是:性别多样性(gender diversity)如何影响科研团队的成功,尤其是现有研究在操作化定义上多采用整体性指标,忽视了团队内部角色分工的差异,从而难以揭示性别多样性对团队影响力(以五年引用次数衡量)的精细化作用机制。解决方案的关键在于:首先,基于作者贡献声明将科研团队成员明确划分为领导角色(leadership roles)和支持角色(support roles),从而实现对性别多样性的分层测量;其次,利用超过13万篇来自PLOS期刊的生物医学相关论文数据,通过多元回归分析和阈值回归模型(threshold regression model),系统考察不同角色组内性别多样性与团队影响力的关系,并进一步检验团队规模的调节效应。这一方法设计使研究能够识别出性别多样性对团队影响呈现倒U型关系,且领导组全女性、支持组全男性的组合最具影响力,同时揭示小团队中领导组性别多样性呈负向作用,而大团队中转为正向但不显著,支持组则始终呈正向显著效应。

链接: https://arxiv.org/abs/2512.23429
作者: Yi Zhao,Yongjun Zhu,Donghun Kim,Yuzhuo Wang,Heng Zhang,Chao Lu,Chengzhi Zhang
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:The influence of gender diversity on the success of scientific teams is of great interest to academia. However, prior findings remain inconsistent, and most studies operationalize diversity in aggregate terms, overlooking internal role differentiation. This limitation obscures a more nuanced understanding of how gender diversity shapes team impact. In particular, the effect of gender diversity across different team roles remains poorly understood. To this end, we define a scientific team as all coauthors of a paper and measure team impact through five-year citation counts. Using author contribution statements, we classified members into leadership and support roles. Drawing on more than 130,000 papers from PLOS journals, most of which are in biomedical-related disciplines, we employed multivariable regression to examine the association between gender diversity in these roles and team impact. Furthermore, we apply a threshold regression model to investigate how team size moderates this relationship. The results show that (1) the relationship between gender diversity and team impact follows an inverted U-shape for both leadership and support groups; (2) teams with an all-female leadership group and an all-male support group achieve higher impact than other team types. Interestingly, (3) the effect of leadership-group gender diversity is significantly negative for small teams but becomes positive and statistically insignificant in large teams. In contrast, the estimates for support-group gender diversity remain significant and positive, regardless of team size.
zh

[NLP-23] Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data

【速读】: 该论文旨在解决多轮训练(multi-epoch training)中大型语言模型(Large Language Models, LLMs)因重复暴露于有限领域数据而导致的性能退化问题,其核心原因是学习动态失衡:低熵(low-entropy)可预测token被快速学习并主导优化过程,而高熵(high-entropy)token的泛化能力则随训练持续下降。解决方案的关键在于提出EntroDrop——一种基于熵引导的token级dropout方法,通过结构化数据正则化机制,在训练过程中选择性屏蔽低熵token,并采用课程学习(curriculum schedule)动态调整正则化强度,从而平衡不同token的学习进度,提升模型在数据受限场景下的适应能力。

链接: https://arxiv.org/abs/2512.23422
作者: Jiapeng Wang,Yiwen Hu,Yanzipeng Gao,Haoyu Wang,Shuo Wang,Hongyu Lu,Jiaxin Mao,Wayne Xin Zhao,Junyi Li,Xiao Zhang
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院); Tsinghua University (清华大学); WeChat, Tencent (微信,腾讯); Department of Data Science, City University of Hong Kong (香港城市大学数据科学系)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model’s ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.
zh

[NLP-24] heoretical Foundations of Scaling Law in Familial Models

【速读】: 该论文旨在解决传统神经缩放定律(Neural Scaling Laws)仅适用于单一密集模型输出的局限性,这一假设忽视了“家族模型”(Familial Models)在异构设备-边缘-云层级中实现普遍智能的关键作用。家族模型通过早期退出(early exits)与接力式推理(relay-style inference)机制,从单一共享主干网络中生成多个可部署子模型,从而实现“一次训练、多模型部署”的灵活部署范式。其解决方案的关键在于引入粒度(Granularity, G)作为与模型规模(N)和训练令牌数(D)并列的基本缩放变量,并提出统一函数形式 $ L(N, D, G) $ 来刻画该“一跑多模型”机制下的性能关系。研究采用严格的等FLOP实验设计,系统地分离架构影响与计算规模的影响,实证表明粒度惩罚遵循乘法幂律且指数极小,从而理论上连接固定计算量训练与动态架构,实践上验证了无需牺牲密集基线计算最优性的部署灵活性。

链接: https://arxiv.org/abs/2512.23407
作者: Huan Song,Qingfei Zhao,Ting Long,Shuyu Tian,Hongjun An,Jiawei Shao,Chi Zhang,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this “one-run, many-models” paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the “train once, deploy many” paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.
zh

[NLP-25] A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要深度推理和逻辑推断的任务中表现不佳的问题,尤其是模型因训练数据中的噪声或无关信息导致生成结果错误或与事实不符的现象。其解决方案的关键在于提出一种基于外部子图生成的分步推理增强框架(Stepwise Reasoning enhancement framework based on external subgraph generation, SGR),通过动态构建与查询相关的外部知识库子图,并利用其语义结构引导模型进行逐步推理,从而减少噪声干扰并提升推理准确性。该框架首先生成针对输入查询的外部子图,继而基于子图执行多步推理,并最终融合多个推理路径以输出最终答案。

链接: https://arxiv.org/abs/2512.23356
作者: Xin Zhang,Yang Cao,Baoxing Wu,Xinyi Chen,Kai Song,Siying Li
机构: Chongqing Jiaotong University (重庆交通大学); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.
zh

[NLP-26] AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents

【速读】: 该论文旨在解决当前自主智能体(autonomous agents)在设计高效记忆工作流时,因跨学科壁垒而难以吸收人类记忆机制本质的问题。其解决方案的关键在于系统性地整合认知神经科学与大语言模型(LLM)驱动的智能体研究,从记忆的定义、功能到存储机制、管理生命周期进行全面对比分析,并在此基础上构建生物与人工记忆的映射框架,从而为下一代具备类人记忆能力的智能体提供理论基础与技术路径。

链接: https://arxiv.org/abs/2512.23343
作者: Jiafeng Liang,Hao Li,Chang Li,Jiaqi Zhou,Shixin Jiang,Zekun Wang,Changkai Ji,Zhihao Zhu,Runxuan Liu,Tao Ren,Jinlan Fu,See-Kiong Ng,Xia Liang,Ming Liu,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); Fudan University (复旦大学); Peking University (北京大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 57 pages, 5 figures

点击查看摘要

Abstract:Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
zh

[NLP-27] CubeBench: Diagnosing Interactive Long-Horizon Spatial Reasoning Under Partial Observations

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在物理世界部署中的关键瓶颈问题,即缺乏构建和维持稳健空间心理模型的能力。具体而言,其聚焦于三个核心认知挑战:空间推理、基于心理模拟的长时程状态追踪,以及在部分观测条件下的主动探索。解决方案的关键在于提出一个名为CubeBench的新颖生成式基准,该基准以魔方(Rubik’s Cube)为核心任务,采用三层诊断框架逐步评估代理能力,从依赖完整符号信息的状态追踪到仅依赖部分视觉数据的主动探索。通过这一设计,研究者揭示了当前主流LLM在长时程任务中存在0.00%的通过率,暴露其长期规划能力的根本性缺陷,并进一步提出通过引入外部求解工具来隔离并分析认知瓶颈,从而为开发更具物理感知能力的智能体提供可操作的改进方向。

链接: https://arxiv.org/abs/2512.23328
作者: Huan-ang Gao,Zikang Zhang,Tianwei Luo,Kaisen Yang,Xinzhe Juan,Jiahao Qiu,Tianxing Chen,Bingxiang He,Hao Zhao,Hao Zhou,Shilong Liu,Mengdi Wang
机构: THU(清华大学); Princeton(普林斯顿大学); SJTU & UMich(上海交通大学 & 密歇根大学); HKU(香港大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Webpage: this https URL

点击查看摘要

Abstract:Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik’s Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.
zh

[NLP-28] AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration ACL2025

【速读】: 该论文旨在解决 audiobook 解读(audiobook interpretations)手工制作过程耗时且资源密集的问题。其解决方案的关键在于提出 AI4Reading 系统,这是一个基于大语言模型(Large Language Models, LLMs)与语音合成技术的多智能体协作框架,由11个专业化代理(如主题分析代理、案例分析代理、编辑代理等)协同工作,实现内容准确性、可理解性和逻辑叙事结构的优化,从而自动生成高质量的播客式解读脚本。

链接: https://arxiv.org/abs/2512.23300
作者: Minjiang Huang,Jipeng Qiang,Yi Zhu,Chaowei Zhang,Xiangyu Zhao,Kui Yu
机构: Yangzhou University (扬州大学); City University of Hong Kong (香港城市大学); Hefei University of Technology (合肥工业大学)
类目: Computation and Language (cs.CL)
备注: ACL 2025 demo

点击查看摘要

Abstract:Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system’s output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.
zh

[NLP-29] Chinese Morph Resolution in E-commerce Live Streaming Scenarios

【速读】: 该论文旨在解决电商直播中主播通过发音变异(morphs)规避监管和进行虚假宣传的问题,特别是在健康与医疗类直播场景下,传统文本识别难以捕捉基于语音的隐晦表达。其解决方案的关键在于提出“直播语音变体解析”(Live Auditory Morph Resolution, LiveAMR)任务,并构建了首个包含86,790条样本的LiveAMR数据集;进一步将该任务转化为文本到文本生成问题,利用大语言模型(Large Language Models, LLMs)合成高质量训练数据以提升模型性能,从而显著增强对直播内容违规行为的识别能力。

链接: https://arxiv.org/abs/2512.23280
作者: Jiahao Zhu,Jipeng Qiang,Ran Bai,Chenyu Liu,Xiaoye Ouyang
机构: Yangzhou University (扬州大学); China Academy of Electronic and Information Technology (中国电子科学研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:E-commerce live streaming in China, particularly on platforms like Douyin, has become a major sales channel, but hosts often use morphs to evade scrutiny and engage in false advertising. This study introduces the Live Auditory Morph Resolution (LiveAMR) task to detect such violations. Unlike previous morph research focused on text-based evasion in social media and underground industries, LiveAMR targets pronunciation-based evasion in health and medical live streams. We constructed the first LiveAMR dataset with 86,790 samples and developed a method to transform the task into a text-to-text generation problem. By leveraging large language models (LLMs) to generate additional training data, we improved performance and demonstrated that morph resolution significantly enhances live streaming regulation.
zh

[NLP-30] Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation

【速读】: 该论文旨在解决参数高效微调(Parameter-efficient fine-tuning)中因低秩适配方法(如LoRA)依赖隐式学习的低秩子空间而导致的可解释性缺失与控制能力不足的问题。其关键解决方案是引入预训练稀疏自编码器(Sparse Autoencoders, SAEs),在解耦特征空间中显式识别任务相关的特征,从而构建一个可解释的低秩子空间用于适配器初始化。理论分析表明,在单语义性假设下,基于SAE的方法可实现任意小的恢复误差,而直接在多语义空间中识别则存在不可消除的误差下界;实验验证了该方法在安全对齐任务上优于全量微调(提升7.4个百分点)并接近强化学习人类反馈(RLHF)方法,同时仅更新0.19–0.24%的参数,且提供了对对齐子空间的语义可解释性洞察。

链接: https://arxiv.org/abs/2512.23260
作者: Dianyun Wang,Qingsen Ma,Yuhu Shang,Zhifeng Lu,Lechen Ning,Zhenbo Xu,Huijia Wu,Zhaofeng He
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity–individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate–exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods–while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.
zh

[NLP-31] Anka: A Domain-Specific Language for Reliable LLM Code Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂多步骤编程任务中系统性错误频发的问题,其根本原因被归因于通用编程语言(如Python)语法灵活性带来的歧义和隐式状态管理困难。解决方案的关键在于设计一种面向数据转换管道的领域特定语言(Domain-Specific Language, DSL)——Anka,该语言通过显式且受限的语法结构显著降低代码生成中的歧义性。实验表明,尽管Claude 3.5 Haiku未接受过Anka的训练,仍能在100个基准问题上实现99.9%的解析成功率和95.8%的整体任务准确率,尤其在多步骤任务中相比Python提升40个百分点(100% vs. 60%),验证了专为LLM生成优化的DSL能有效提升复杂任务的准确性。

链接: https://arxiv.org/abs/2512.23214
作者: Saif Khalfan Saif Al Mazrouei
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 11 pages, 1 figure, 4 tables. Code and benchmarks available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python’s flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.
zh

[NLP-32] Scoring Reasoning and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process

【速读】: 该论文旨在解决如何从多个大语言模型(Large Language Models, LLMs)生成的候选响应中自动选择最优答案的问题,尤其在缺乏标注数据的情况下实现高效、可靠的集成决策。其解决方案的关键在于提出了一种受同行评审机制启发的无监督LLM集成方法——LLM-PeerReview,该方法通过三个阶段实现:首先利用LLM-as-a-Judge技术对每个响应进行评分;其次采用基于图模型的真理推断算法或简单平均策略聚合多模型评分以获得最终得分;最后选取得分最高的响应作为最佳集成输出。此框架兼具可解释性与灵活性,无需人工标注即可有效融合不同LLM的优势,在多个数据集上显著优于现有方法。

链接: https://arxiv.org/abs/2512.23213
作者: Zhijun Chen,Zeyu Ji,Qianren Mao,Junhang Cheng,Bangjie Qin,Hao Wu,Zhuoran Li,Jingzheng Li,Kai Sun,Zizhe Wang,Yikun Ban,Zhu Sun,Xiangyang Ji,Hailong Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.
zh

[NLP-33] Not too long do read: Evaluating LLM -generated extreme scientific summaries

【速读】: 该论文旨在解决科学领域中高质量极端摘要(TLDR)生成能力评估的难题,即当前缺乏一个全面且高质量的科学TLDR数据集,导致大语言模型(LLM)在该任务上的性能发展与评测受限。其解决方案的关键在于构建了一个名为BiomedTLDR的新颖数据集,该数据集包含来自科研人员撰写的科学论文摘要,利用了作者在文献条目中附带评论的常见实践;在此基础上,通过测试多种开源大语言模型(LLM)基于摘要生成TLDR的能力,揭示了LLM倾向于保留原文词汇选择和修辞结构,整体更偏向于抽取式而非抽象式摘要生成,相较人类专家存在显著差异。

链接: https://arxiv.org/abs/2512.23206
作者: Zhuoqi Lyu,Qing Ke
机构: City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs’ summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors’ comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text’s lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at this https URL (Lyu and Ke, 2025).
zh

[NLP-34] Reservoir Computing inspired Matrix Multiplication-free Language Model

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练和推理过程中计算成本高昂的问题。其核心解决方案是提出一种无矩阵乘法的語言模型(MatMul-free LM),并通过受水库计算(reservoir computing)启发的架构设计进一步降低训练开销:具体包括部分固定并共享选定层的权重,插入水库层以在不增加额外训练负担的情况下获得丰富的动态表征,并融合多种操作以减少内存访问次数。实验表明,该方法可在保持与基线模型相当性能的前提下,将参数量减少最多19%,训练时间缩短9.9%,推理时间减少8.0%。

链接: https://arxiv.org/abs/2512.23145
作者: Takumi Shiratsuchi,Yuichiro Tanaka,Hakaru Tamukoh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) have achieved state-of-the-art performance in natural language processing; however, their high computational cost remains a major bottleneck. In this study, we target computational efficiency by focusing on a matrix multiplication free language model (MatMul-free LM) and further reducing the training cost through an architecture inspired by reservoir computing. Specifically, we partially fix and share the weights of selected layers in the MatMul-free LM and insert reservoir layers to obtain rich dynamic representations without additional training overhead. Additionally, several operations are combined to reduce memory accesses. Experimental results show that the proposed architecture reduces the number of parameters by up to 19%, training time by 9.9%, and inference time by 8.0%, while maintaining comparable performance to the baseline model.
zh

[NLP-35] A Note on Hybrid Online Reinforcement and Imitation Learning for LLM s: Formulations and Algorithms

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)微调过程中如何有效融合模仿学习(Imitation Learning)与强化学习(Reinforcement Learning)的问题,以同时实现细粒度的token-level行为复制和长程奖励优化。其解决方案的关键在于提出一个统一框架,通过分析由轨迹级KL散度与任务奖励组成的复合目标函数的梯度,自然分解为两个部分:一是可解析计算的密集梯度(Dense Gradient),用于token-level模仿学习;二是基于蒙特卡洛采样的稀疏梯度(Sparse Gradient),用于长程奖励优化。其中,密集梯度具有闭式对数几率(logit-level)表达式,支持高效GPU并行实现,从而显著提升训练效率与稳定性。

链接: https://arxiv.org/abs/2512.23097
作者: Yingru Li,Ziniu Li,Jiacai Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.
zh

[NLP-36] abiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish

【速读】: 该论文旨在解决土耳其语自然语言处理(Natural Language Processing, NLP)领域缺乏基于现代架构从头训练的单语编码器模型的问题。现有模型如BERTurk在计算效率、长文本建模能力和训练稳定性方面存在局限,且未充分整合近年来Transformer架构的关键改进。解决方案的核心在于提出TabiBERT,一个基于ModernBERT架构的土耳其语单语编码器,其关键创新包括:集成旋转位置编码(Rotary Positional Embeddings, RoPE)、FlashAttention优化以及改进的归一化策略;在包含1万亿token的多领域高质量语料库上从头预训练,支持8192 token上下文长度(为原始BERT的16倍),实现高达2.65倍的推理加速和GPU显存降低;并通过构建标准化评估基准TabiBench(涵盖28个数据集、8类任务)验证其跨域泛化能力,最终在多项任务中达到SOTA性能,显著优于此前最优模型。

链接: https://arxiv.org/abs/2512.23065
作者: Melikşah Türker,A. Ebrar Kızıloğlu,Onur Güngör,Susan Üsküdarlı
机构: Boğaziçi University (博阿齐奇大学); VNGRS-AI (VNGRS-AI)
类目: Computation and Language (cs.CL)
备注: 31 pages, 1 figure, 13 tables

点击查看摘要

Abstract:Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). The model supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories: question answering (+9.55), code retrieval (+2.41), and document retrieval (+0.60). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.
zh

[NLP-37] Accelerating Language Model Workflows with Prompt Choreography ACL

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体工作流中因重复计算导致的低效问题。其核心挑战在于,传统方法在每次LLM调用时均需重新编码历史消息,造成显著的时间开销,尤其是在冗余计算占主导的工作流中。解决方案的关键是提出Prompt Choreography框架,通过维护一个动态的全局键值缓存(KV cache),使每次LLM调用能够访问任意重排序的先前编码消息子集,并支持并行调用。尽管缓存编码可能与重新编码产生差异,作者通过微调LLM使其适应缓存机制,从而有效复现原始结果,最终实现单条消息延迟降低2.0–6.2倍、端到端速度提升最高达2.2倍的显著性能改进。

链接: https://arxiv.org/abs/2512.23049
作者: TJ Bai,Jason Eisner
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: to appear in TACL (final preprint of 2025-10-12); 10 pages + appendices

点击查看摘要

Abstract:Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages’ encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0–6.2 \times faster time-to-first-token) and achieves substantial end-to-end speedups ( 2.2 \times ) in some workflows dominated by redundant computation.
zh

[NLP-38] Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization

【速读】: 该论文旨在解决当前基于提示注入提示(biasing features)的可解释性评估方法在判断思维链(Chain-of-Thought, CoT)是否忠实(faithful)时存在的误判问题,即该方法将因token预算限制导致的推理过程省略(incompleteness)错误识别为不忠实(unfaithfulness)。其解决方案的关键在于提出一个新的“faithful@k”指标,并结合因果中介分析(Causal Mediation Analysis),证明即使某些提示信息未被显式表述于CoT中,仍可能通过因果路径影响最终预测结果。研究进一步表明,增加推理时的token预算可显著提升提示信息的显式表达比例(最高达90%),从而缓解看似不忠实的现象。因此,论文强调应避免仅依赖提示相关的评估指标,而需采用包括因果中介和扰动测试在内的更全面的可解释性工具集。

链接: https://arxiv.org/abs/2512.23032
作者: Kerem Zaman,Shashank Srivastava
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 20 figures, 5 tables

点击查看摘要

Abstract:Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.
zh

[NLP-39] LENS: LLM -Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models

【速读】: 该论文旨在解决将多模态健康传感数据(如长时间序列的生理信号)转化为自然语言描述以支持心理健康评估的难题,尤其针对当前大语言模型(LLM)无法原生处理长时序传感器数据以及配对的传感器-文本数据集稀缺的问题。其解决方案的关键在于提出LENS框架:首先构建一个包含超过10万条传感器-文本问答对的大规模数据集,通过将生态瞬时评估(Ecological Momentary Assessment, EMA)中与抑郁和焦虑症状相关的回答转化为自然语言描述;其次设计一种基于patch级别的编码器,可直接将原始传感器信号映射到LLM的表示空间,从而实现对原始行为信号的原生集成与推理。实验证明,LENS在标准NLP指标和症状严重程度预测任务上均优于基线方法,并获得临床专家认可,为LLM作为健康传感接口提供了可扩展路径。

链接: https://arxiv.org/abs/2512.23025
作者: Wenxuan Xu,Arvind Pillai,Subigya Nepal,Amanda C Collins,Daniel M Mackin,Michael V Heinz,Tess Z Griffin,Nicholas C Jacobson,Andrew Campbell
机构: Dartmouth College (达特茅斯学院); University of Virginia (弗吉尼亚大学); Massachusetts General Hospital (马萨诸塞州总医院); Harvard Medical School (哈佛医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 9 figures, under review

点击查看摘要

Abstract:Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM’s representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.
zh

[NLP-40] Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练结构化剪枝过程中,因少量少样本校准集无法充分反映预训练数据分布而导致的下游任务泛化性能受限的问题。其解决方案的关键在于提出一种函数感知的神经元分组方法(Function-Aware Neuron Grouping, FANG),通过识别并保留对特定功能至关重要的神经元来缓解校准偏差:FANG基于神经元处理的语义上下文类型对神经元进行分组,并独立剪枝每组;在重要性评估时,赋予与神经元组功能角色强相关的token更高权重;同时保留跨多种上下文类型的贡献神经元;此外,FANG根据模块的功能复杂度自适应分配稀疏度,从而在稀疏率与性能之间实现更优平衡。

链接: https://arxiv.org/abs/2512.23014
作者: Tao Yu,Yongqi An,Kuan Zhu,Guibo Zhu,Ming Tang,Jinqiao Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%–8.5% in average accuracy under 30% and 40% sparsity.
zh

[NLP-41] Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实临床决策支持中的实用性和有效性问题,尤其是在完整临床推理流程中各关键任务上的表现差异与优化潜力。研究通过系统评估三种前沿LLMs(ChatGPT-4o、Gemini 1.5 Pro和LIama 3.3 70B)在36个病例中的五项临床决策任务(包括鉴别诊断、紧急处理步骤、相关检查建议、最终诊断和治疗推荐)的表现,发现模型性能存在显著任务依赖性,尤其在“相关诊断测试”任务上表现较差。解决方案的关键在于探索提示工程(prompt engineering)的作用机制,特别是基于MedPrompt框架的动态少样本学习策略——结果显示,提示工程并非通用提升手段:它虽能显著改善低基线任务(如诊断测试建议),但在其他任务中可能适得其反;且目标匹配的动态示例并未优于随机选择,表明上下文多样性对模型表现的重要性不可忽视。因此,该研究强调需采用模型和任务定制化的提示策略,以实现LLMs在医疗场景下的高效、安全集成。

链接: https://arxiv.org/abs/2512.22966
作者: Mengdi Chai,Ali R. Zomorrodi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM’s out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining tasks. Furthermore, ChatGPT performed better under the zero temperature, whereas LIama showed stronger performance under the default temperature. Next, we assessed whether prompt engineering could enhance LLM performance by applying variations of the MedPrompt framework, incorporating targeted and random dynamic few-shot learning. The results demonstrate that prompt engineering is not a one-size-fit-all solution. While it significantly improved the performance on the task with lowest baseline accuracy (relevant diagnostic testing), it was counterproductive for others. Another key finding was that the targeted dynamic few-shot prompting did not consistently outperform random selection, indicating that the presumed benefits of closely matched examples may be counterbalanced by loss of broader contextual diversity. These findings suggest that the impact of prompt engineering is highly model and task-dependent, highlighting the need for tailored, context-aware strategies for integrating LLMs into healthcare.
zh

[NLP-42] Diversity or Precision? A Deep Dive into Next Token Prediction

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)训练中因预训练模型的词元输出分布(token-output distribution)限制而导致探索空间受限的问题,从而影响大型语言模型(Large Language Models, LLMs)推理能力提升的瓶颈。解决方案的关键在于提出一种广义的预训练目标,将在线策略强化学习(on-policy RL)原理引入监督学习框架,通过重构预训练阶段的损失函数来优化探索空间:具体而言,采用正向奖励缩放因子控制对真实词元的概率集中程度,并设计基于排名的机制不对称地处理高排名与低排名负样本,从而在多样性与精度之间实现平衡,最终为后续RL训练提供更有利于推理性能提升的探索环境。

链接: https://arxiv.org/abs/2512.22955
作者: Haoyuan Wu,Hai Wang,Jiajia Wu,Jinxiang Ou,Keyao Wang,Weile Chen,Zihao Zheng,Bei Yu
机构: Tencent(腾讯); The Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model’s token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
zh

[NLP-43] Multimodal Fact-Checking: An Agent -based Approach

【速读】: 该论文旨在解决多模态虚假信息(Multimodal Misinformation)在自动化事实核查系统中日益增长的挑战,现有方法如大视觉语言模型(LVLMs)和深度多模态融合技术因推理能力有限及证据利用浅层而效果不佳。其关键解决方案是提出RW-Post数据集与AgentFact框架:RW-Post是一个高质量、可解释的现实世界多模态事实核查数据集,包含原始社交媒体帖子中的完整上下文信息,并通过大语言模型辅助提取人类撰写核查文章中的详细推理链与明确关联证据;在此基础上构建的AgentFact框架采用五类专业化代理(agents)协作完成策略规划、高质量证据检索、视觉分析、推理与解释生成等子任务,通过迭代式工作流实现证据搜索与任务感知过滤及推理的交替进行,从而显著提升多模态事实核查的准确性与可解释性。

链接: https://arxiv.org/abs/2512.22933
作者: Danni Xu,Shaojing Fan,Xuanang Cheng,Mohan Kankanhalli
机构: National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code and dataset will be released at this https URL

点击查看摘要

Abstract:The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.
zh

[NLP-44] Debugging Tabular Log as Dynamic Graphs

【速读】: 该论文旨在解决当前基于文本增强的表格日志(tabular log)调试方法过度依赖大语言模型(Large Language Models, LLMs)而导致灵活性和可扩展性受限的问题。其解决方案的关键在于提出一种名为 GraphLogDebugger 的新框架,该框架通过构建异构节点(对象与事件)并建立节点间边连接,将表格日志建模为动态图结构,从而恢复出系统背后的演化过程;在此基础上,仅使用一个简单的动态图神经网络(Dynamic Graph Neural Network, GNN)即可在真实世界日志数据集上实现优于LLMs的调试性能,验证了动态图建模在提升效率与准确性方面的有效性。

链接: https://arxiv.org/abs/2512.22903
作者: Chumeng Liang,Zhanyang Jin,Zahaib Akhtar,Mona Pereira,Haofei Yu,Jiaxuan You
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Amazon (亚马逊)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tabular log abstracts objects and events in the real-world system and reports their updates to reflect the change of the system, where one can detect real-world inconsistencies efficiently by debugging corresponding log entries. However, recent advances in processing text-enriched tabular log data overly depend on large language models (LLMs) and other heavy-load models, thus suffering from limited flexibility and scalability. This paper proposes a new framework, GraphLogDebugger, to debug tabular log based on dynamic graphs. By constructing heterogeneous nodes for objects and events and connecting node-wise edges, the framework recovers the system behind the tabular log as an evolving dynamic graph. With the help of our dynamic graph modeling, a simple dynamic Graph Neural Network (GNN) is representative enough to outperform LLMs in debugging tabular log, which is validated by experimental results on real-world log datasets of computer systems and academic papers.
zh

[NLP-45] AutoForge: Automated Environment Synthesis for Agent ic Reinforcement Learning

【速读】: 该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)的语言代理在模拟环境中训练时面临的两大核心问题:一是现有环境合成方法多为半自动化且任务难度不足,缺乏广度与深度;二是模拟用户行为的不稳定性及不同环境间的异质性导致代理强化学习训练效率低、不稳定。解决方案的关键在于提出一个统一的自动化、可扩展的模拟环境合成流水线,能够生成高难度但易于验证的任务,并设计一种环境层级的强化学习算法,该算法通过在环境层面进行优势估计(advantage estimation),有效缓解用户不稳定性问题,从而显著提升训练效率与鲁棒性。

链接: https://arxiv.org/abs/2512.22857
作者: Shihao Cai,Runnan Fang,Jialong Wu,Baixuan Li,Xinyu Wang,Yong Jiang,Liangcai Su,Liwen Zhang,Wenbiao Yin,Zhen Zhang,Fuli Feng,Pengjun Xie,Xiaobin Wang
机构: Tongyi Lab, Alibaba Group(阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.
zh

[NLP-46] NepEMO: A Multi-Label Emotion and Sentiment Analysis on Nepali Reddit with Linguistic Insights and Temporal Trends DATE

【速读】: 该论文旨在解决在尼泊尔语(Nepali)社交媒体帖子中进行多标签情感(multi-label emotion, MLE)和情感分类(sentiment classification, SC)的难题,尤其针对Reddit平台上匿名用户表达的敏感话题内容。其解决方案的关键在于构建了一个高质量、多语言(英文、罗马化尼泊尔语和天城文书写系统)的手动标注数据集——NepEMO,包含4,462条帖子,并基于此数据集开展了细致的语言学分析(如情绪趋势、情绪共现模式、情感特异性n-gram及主题建模),同时系统比较了传统机器学习(ML)、深度学习(DL)与Transformer模型在MLE和SC任务上的性能表现,结果表明Transformer模型在两项任务中均显著优于其他方法,验证了其在低资源语言场景下处理复杂情感识别问题的有效性。

链接: https://arxiv.org/abs/2512.22823
作者: Sameer Sitoula,Tej Bahadur Shahi,Laxmi Prasad Bhatt,Anisha Pokhrel,Arjun Neupane
机构: Queensland University of Technology (昆士兰科技大学); Central Queensland University (中央昆士兰大学)
类目: Computation and Language (cs.CL)
备注: This paper is under consideration in Neural Computing Applications (Springer) journal. This version may be deleted or updated at any time, depending on the journal’s policy upon acceptance

点击查看摘要

Abstract:Social media (SM) platforms (e.g. Facebook, Twitter, and Reddit) are increasingly leveraged to share opinions and emotions, specifically during challenging events, such as natural disasters, pandemics, and political elections, and joyful occasions like festivals and celebrations. Among the SM platforms, Reddit provides a unique space for its users to anonymously express their experiences and thoughts on sensitive issues such as health and daily life. In this work, we present a novel dataset, called NepEMO, for multi-label emotion (MLE) and sentiment classification (SC) on the Nepali subreddit post. We curate and build a manually annotated dataset of 4,462 posts (January 2019- June 2025) written in English, Romanised Nepali and Devanagari script for five emotions (fear, anger, sadness, joy, and depression) and three sentiment classes (positive, negative, and neutral). We perform a detailed analysis of posts to capture linguistic insights, including emotion trends, co-occurrence of emotions, sentiment-specific n-grams, and topic modelling using Latent Dirichlet Allocation and TF-IDF keyword extraction. Finally, we compare various traditional machine learning (ML), deep learning (DL), and transformer models for MLE and SC tasks. The result shows that transformer models consistently outperform the ML and DL models for both tasks.
zh

[NLP-47] CNSight: Evaluation of Clinical Note Segmentation Tools

【速读】: 该论文旨在解决临床笔记在从电子病历(Electronic Medical Record, EMR)系统中提取后常以非结构化或半结构化格式存储的问题,这使得其在二次分析和下游临床应用中的使用变得复杂。解决方案的关键在于可靠地识别笔记中的段落边界(section boundaries),从而将不同临床语境(如现病史、用药记录和出院指导)进行结构化划分。研究通过评估基于规则的基线模型、领域特定的Transformer模型以及大语言模型(Large Language Models, LLMs)在MIMIC-IV数据集上的表现,发现基于API的大语言模型(如GPT-5-mini)在句子级和自由文本分割任务上均取得最佳平均F1分数(72.4),展现出对复杂非结构化文本的强大适应能力,为后续的信息抽取、队列识别和自动摘要等任务奠定了基础。

链接: https://arxiv.org/abs/2512.22795
作者: Risha Surana,Adrian Law,Sunwoo Kim,Rishab Sridhar,Angxiao Han,Peiyu Hong
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.
zh

[NLP-48] Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

【速读】: 该论文旨在解决低资源语言(如乌尔都语)在社交媒体虚假信息检测中的性能瓶颈问题,尤其是现有通用多语言预训练模型在处理领域特定术语时表现不佳的问题。其解决方案的关键在于引入领域自适应预训练(domain-adaptive pretraining)阶段,在微调前利用公开的乌尔都语新闻语料库对模型进行进一步训练,从而提升模型对目标领域的泛化能力。实验表明,经过领域自适应预训练的XLM-RoBERTa在多个乌尔都语虚假新闻数据集上显著优于未适配的原始版本,而mBERT的表现则呈现不一致性,验证了领域适应策略的有效性和模型选择的重要性。

链接: https://arxiv.org/abs/2512.22778
作者: Muhammad Zain Ali,Bernhard Pfahringer,Tony Smith
机构: University of Waikato(怀卡托大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R consistently outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results.
zh

[NLP-49] xt-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis AAAI2026

【速读】: 该论文旨在解决多模态情感分析(Multi-modal Sentiment Analysis, MSA)中解释能力不足与模态间时间对齐不充分的问题。现有方法虽能捕捉不同模态中的细微情感,但在利用解释信息和精确对齐音频、视频等时序特征方面仍存在局限。解决方案的关键在于提出Text-routed sparse mixture-of-Experts模型(TEXT),其创新性体现在:首先通过多模态大语言模型(Multi-modal Large Language Models, MLLM)增强解释信息;其次设计一个面向时序特性的神经网络模块,融合Mamba与时间交叉注意力机制,实现音频和视频表征的精准对齐;最后引入基于文本路由的稀疏专家混合结构与门控融合机制,使模型在跨模态交互中更高效地整合信息。实验表明,TEXT在四个数据集上均优于对比模型,在六项指标中至少四项领先,例如在CH-SIMS数据集上将平均绝对误差降低至0.353,相较最新方法提升13.5%。

链接: https://arxiv.org/abs/2512.22741
作者: Dongning Rao,Yunbiao Zeng,Zhihua Jiang,Jujian Lv
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 9 figures, accepted by AAAI 2026

点击查看摘要

Abstract:Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.
zh

[NLP-50] Harnessing Large Language Models for Biomedical Named Entity Recognition

【速读】: 该论文旨在解决通用大语言模型(Large Language Models, LLMs)在生物医学命名实体识别(Biomedical Named Entity Recognition, BioNER)任务中因缺乏领域专业知识及低质量训练数据导致性能下降的问题。其解决方案的关键在于提出了一种以数据质量为核心的高效微调框架BioSelectTune,该框架将BioNER重构为结构化JSON生成任务,并引入新颖的混合超筛选(Hybrid Superfiltering)策略——一种从弱模型到强模型的数据筛选方法,通过同源弱模型蒸馏出紧凑且高影响力的高质量训练数据集,从而显著提升模型性能。

链接: https://arxiv.org/abs/2512.22738
作者: Jian Chen,Leilei Su,Cong Sun
机构: Hainan University (海南大学); Weill Cornell Medicine (威尔康奈尔医学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we demonstrate that BioSelectTune achieves state-of-the-art (SOTA) performance across multiple BioNER benchmarks. Notably, our model, trained on only 50% of the curated positive data, not only surpasses the fully-trained baseline but also outperforms powerful domain-specialized models like BioMedBERT.
zh

[NLP-51] WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于自回归(Autoregressive, AR)解码机制在推理阶段并行度受限的问题,以及扩散语言模型(Diffusion Language Models, DLLMs)虽具备并行生成能力却难以在实际部署中超越优化后的 AR 引擎(如 vLLM)的效率瓶颈。其核心问题在于多数 DLLMs 依赖双向注意力机制,破坏了标准前缀键值缓存(prefix KV caching),导致重复上下文计算,从而削弱了并行优势。解决方案的关键在于提出 WeDLM 框架,该框架完全基于因果注意力(causal attention)构建,并引入拓扑重排序(Topological Reordering)技术,使每个被掩码位置在保持严格因果掩码的前提下条件化于所有已观测 token,同时将这些观测 token 物理移至前缀位置而不改变其逻辑顺序;进一步设计流式解码流程,在持续提交高置信度 token 进入左到右前缀的同时维持固定并行工作负载,避免传统块扩散方法中的停等行为(stop-and-wait)。实验证明,WeDLM 在不损失 AR 基线质量的前提下显著提升推理速度,尤其在复杂推理任务上接近 3 倍加速,低熵生成场景下可达 10 倍,且对比对象为使用 vLLM 优化的 AR 引擎,验证了扩散式解码在实践中可超越先进 AR 引擎。

链接: https://arxiv.org/abs/2512.22737
作者: Aiwei Liu,Minghua He,Shaoxun Zeng,Sijun Zhang,Linhao Zhang,Chuhan Wu,Wei Jia,Yuan Liu,Xiao Zhou,Jie Zhou
机构: WeChat AI(微信人工智能); Tencent(腾讯); Peking University(北京大学); Tsinghua University(清华大学)
类目: Computation and Language (cs.CL)
备注: 23 pages, 8 figures, project page: this https URL

点击查看摘要

Abstract:Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.
zh

[NLP-52] Data Augmentation for Classification of Negative Pregnancy Outcomes in Imbalanced Data

【速读】: 该论文旨在解决美国婴儿死亡率居高不下、尤其是由出生缺陷等不良妊娠结局导致的公共卫生问题,现有研究在病因探索和干预策略方面仍存在不足。其解决方案的关键在于利用公开的社会媒体数据(如Twitter)构建自然语言处理(Natural Language Processing, NLP)管道,自动识别孕妇群体并根据其报告的妊娠结局进行分类:将足月妊娠且新生儿体重正常的案例归为阳性组,将出现流产、死胎、出生缺陷或早产等不良结局的案例归为阴性组,从而扩展传统队列研究的数据来源,并为评估特定干预措施对母婴健康的影响提供新框架。

链接: https://arxiv.org/abs/2512.22732
作者: Md Badsha Biswas
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Infant mortality remains a significant public health concern in the United States, with birth defects identified as a leading cause. Despite ongoing efforts to understand the causes of negative pregnancy outcomes like miscarriage, stillbirths, birth defects, and premature birth, there is still a need for more comprehensive research and strategies for intervention. This paper introduces a novel approach that uses publicly available social media data, especially from platforms like Twitter, to enhance current datasets for studying negative pregnancy outcomes through observational research. The inherent challenges in utilizing social media data, including imbalance, noise, and lack of structure, necessitate robust preprocessing techniques and data augmentation strategies. By constructing a natural language processing (NLP) pipeline, we aim to automatically identify women sharing their pregnancy experiences, categorizing them based on reported outcomes. Women reporting full gestation and normal birth weight will be classified as positive cases, while those reporting negative pregnancy outcomes will be identified as negative cases. Furthermore, this study offers potential applications in assessing the causal impact of specific interventions, treatments, or prenatal exposures on maternal and fetal health outcomes. Additionally, it provides a framework for future health studies involving pregnant cohorts and comparator groups. In a broader context, our research showcases the viability of social media data as an adjunctive resource in epidemiological investigations about pregnancy outcomes.
zh

[NLP-53] Mitigating Social Desirability Bias in Random Silicon Sampling

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在模拟人群响应时普遍存在的社会期许偏差(Social Desirability Bias, SDB)问题,即模型倾向于生成符合社会规范的“理想化”答案,从而偏离真实人类调查数据。解决方案的关键在于通过心理学基础的提示词设计(prompt wording),尤其是采用中性、第三人称表述的重构提示(reformulated prompts),有效降低模型对社会可接受答案的集中倾向,使硅样本(silicon samples)的概率分布更贴近真实人类数据(如美国全国选举研究 ANES 数据)。实证结果表明,此类提示框架能显著改善硅样本与真实人类样本之间的分布一致性(以 Jensen-Shannon 散度衡量),为提升基于 LLM 的社会科学研究代表性提供了可行且有效的干预路径。

链接: https://arxiv.org/abs/2512.22725
作者: Sashank Chapala,Maksym Mironov,Songgaojun Deng
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 31 pages, 9 figures, and 24 tables

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used to simulate population responses, a method known as ``Silicon Sampling’'. However, responses to socially sensitive questions frequently exhibit Social Desirability Bias (SDB), diverging from real human data toward socially acceptable answers. Existing studies on social desirability bias in LLM-based sampling remain limited. In this work, we investigate whether minimal, psychologically grounded prompt wording can mitigate this bias and improve alignment between silicon and human samples. We conducted a study using data from the American National Election Study (ANES) on three LLMs from two model families: the open-source Llama-3.1 series and GPT-4.1-mini. We first replicate a baseline silicon sampling study, confirming the persistent Social Desirability Bias. We then test four prompt-based mitigation methods: \emphreformulated (neutral, third-person phrasing), \emphreverse-coded (semantic inversion), and two meta-instructions, \emphpriming and \emphpreamble, respectively encouraging analytics and sincerity. Alignment with ANES is evaluated using Jensen-Shannon Divergence with bootstrap confidence intervals. Our results demonstrate that reformulated prompts most effectively improve alignment by reducing distribution concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding produced mixed results across eligible items, while the Priming and Preamble encouraged response uniformity and showed no systematic benefit for bias mitigation. Our findings validate the efficacy of prompt-based framing controls in mitigating inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples.
zh

[NLP-54] Beg to Differ: Understanding Reasoning -Answer Misalignment Across Languages EMNLP

【速读】: 该论文旨在解决多语言大语言模型(Large Language Models, LLMs)在跨语言场景下推理质量是否一致的问题,特别是验证模型生成的推理链(reasoning traces)是否能逻辑上支持其最终结论。现有评估方法往往仅关注任务准确率,忽略了推理过程与结论之间的内在一致性,导致对模型真实推理能力的认知存在盲区。论文的关键解决方案是构建一个基于人工验证的评估框架,通过分析65k条来自GlobalMMLU多语言数据集的推理轨迹,量化不同语言(尤其是非拉丁字母语言)中推理与结论之间的错位程度,并提出一个由人类标注驱动的错误分类体系,识别出主要问题为证据性错误(evidential errors,如 unsupported claims 和 ambiguous facts)和逻辑步骤不严谨两类。这一方法揭示了当前多语言评估实践的局限性,强调需引入以推理为导向的评估机制以更全面地衡量模型能力。

链接: https://arxiv.org/abs/2512.22712
作者: Anaelia Ovalle,Candace Ross,Sebastian Ruder,Adina Williams,Karen Ullrich,Mark Ibrahim,Levent Sagun
机构: Meta Superintelligence Labs
类目: Computation and Language (cs.CL)
备注: Accepted to 2025 EMNLP Multilingual Representation Learning Workshop

点击查看摘要

Abstract:Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.
zh

[NLP-55] GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages

【速读】: 该论文旨在解决希望话语(hope speech)在自然语言处理(Natural Language Processing, NLP)领域中研究不足的问题,特别是针对低资源语言如乌尔都语(Urdu)缺乏相关检测工具的现状。其解决方案的关键在于构建一个面向多语言的希望话语检测框架,利用预训练的跨语言Transformer模型(如XLM-RoBERTa、mBERT、EuroBERT和UrduBERT),结合简单的文本预处理方法训练分类器,从而在乌尔都语上实现高精度的二分类(F1=95.2%)和多分类(F1=65.2%)任务,并在西班牙语、德语和英语等语言中也表现出竞争力,验证了现有多语言模型在低资源环境下的可迁移性和有效性。

链接: https://arxiv.org/abs/2512.22705
作者: Ahmed Abdullah,Sana Fatima,Haroon Mahmood
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted and presented at the 15th International Arab Conference on Information Technology (ICAIT); proceedings not yet published

点击查看摘要

Abstract:Hope speech has been relatively underrepresented in Natural Language Processing (NLP). Current studies are largely focused on English, which has resulted in a lack of resources for low-resource languages such as Urdu. As a result, the creation of tools that facilitate positive online communication remains limited. Although transformer-based architectures have proven to be effective in detecting hate and offensive speech, little has been done to apply them to hope speech or, more generally, to test them across a variety of linguistic settings. This paper presents a multilingual framework for hope speech detection with a focus on Urdu. Using pretrained transformer models such as XLM-RoBERTa, mBERT, EuroBERT, and UrduBERT, we apply simple preprocessing and train classifiers for improved results. Evaluations on the PolyHope-M 2025 benchmark demonstrate strong performance, achieving F1-scores of 95.2% for Urdu binary classification and 65.2% for Urdu multi-class classification, with similarly competitive results in Spanish, German, and English. These results highlight the possibility of implementing existing multilingual models in low-resource environments, thus making it easier to identify hope speech and helping to build a more constructive digital discourse.
zh

[NLP-56] Conformal Prediction Sets for Next-Token Prediction in Large Language Models : Balancing Coverag e Guarantees with Set Efficiency

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险领域部署时缺乏可靠不确定性量化的问题,特别是标准softmax概率常出现校准不足的情况。其核心挑战在于如何在保证预测置信度覆盖(coverage)的前提下,显著提升预测集(prediction set)的效率(即缩小预测范围)。解决方案的关键在于提出一种词汇感知的拟合外预测方法(Vocabulary-Aware Conformal Prediction, VACP),该方法通过语义掩码(semantic masking)和温度调整评分(temperature-adjusted scoring)有效压缩候选词汇空间,同时理论保障边际覆盖性(marginal coverage),从而实现覆盖-效率权衡的优化。实验表明,在Gemma-2B模型上,VACP在达到90%目标覆盖率的同时,将平均预测集大小从847个token降至4.3个token,效率提升达197倍。

链接: https://arxiv.org/abs/2512.22682
作者: Yoshith Roy Kotla,Varshith Roy Kotla
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 tables and 1 algorithm

点击查看摘要

Abstract:Deploying large language models (LLMs) in high-stakes domains requires rigorous uncertainty quantification, yet standard softmax probabilities are often poorly calibrated. We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies (greater than 250,000 tokens). Our central contribution is the identification of a coverage-efficiency tradeoff: while naive conformal prediction achieves valid coverage, it produces prediction sets of hundreds of tokens, rendering them uninformative. We propose Vocabulary-Aware Conformal Prediction (VACP), a framework that leverages semantic masking and temperature-adjusted scoring to reduce the effective prediction space while provably maintaining marginal coverage. Experiments on Gemma-2B using SQUAD and WikiText benchmarks demonstrate that VACP achieves 89.7 percent empirical coverage (90 percent target) while reducing the mean prediction set size from 847 tokens to 4.3 tokens – a 197x improvement in efficiency. We provide a theoretical analysis of vocabulary reduction and release our implementation for reproducibility.
zh

[NLP-57] Frag ile Knowledge Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2

【速读】: 该论文旨在解决模型压缩过程中普遍存在的“性能均匀退化”假设问题,即传统剪枝方法是否会导致所有认知能力同步下降。研究表明,通过基于最大绝对权重(Maximum Absolute Weight, MAW)准则对GLU-MLP层进行结构化宽度剪枝时,不同任务能力呈现非均匀响应:参数化知识相关能力(如MMLU、GSM8K)显著下降,而指令遵循能力(IFEval)提升达46%–75%,多步推理能力(MUSR)保持稳定。其解决方案的关键在于识别扩展比(expansion ratio)作为关键架构参数,能够选择性地调节认知能力——在减少模型参数的同时,反而增强行为对齐(behavioral alignment),并揭示了事实知识与真实性之间存在强负相关性(r = -0.864),从而首次系统刻画了这种选择性保留现象。

链接: https://arxiv.org/abs/2512.22671
作者: Pere Martra
机构: Universidad Internacional Menéndez Pelayo (乌拉圭国际门德斯佩拉约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 23 pages, 5 figures, 9 tables. Code available at this https URL

点击查看摘要

Abstract:Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model’s ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.
zh

[NLP-58] Scaling Unverifiable Rewards: A Case Study on Visual Insights

【速读】: 该论文旨在解决多阶段任务中因缺乏可验证奖励信号或足够训练数据而导致的判别模型(judge)漂移问题,这在传统基于时间的测试时扩展(Test-Time Scaling, TTS)方法中容易引发错误累积。其解决方案的关键在于提出选择性测试时扩展(Selective TTS),即通过在多智能体流水线的不同阶段分配计算资源,并利用阶段特定的判别器早期剪枝低质量分支,从而避免全局迭代优化带来的误差传播,提升整体推理稳定性与输出质量。

链接: https://arxiv.org/abs/2512.22650
作者: Shuyu Gan,James Mooney,Pan Hao,Renxiang Wang,Mingyi Hong,Qianwen Wang,Dongyeop Kang
机构: University of Minnesota (明尼苏达大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 32 pages, 25 figures

点击查看摘要

Abstract:Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall’s \tau=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
zh

[NLP-59] Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLM s

【速读】: 该论文试图解决链式思维(Chain-of-thought, CoT)推理在大语言模型(Large Language Models, LLMs)中存在可信度不足的问题,即模型可能生成看似合理但实际偏离真实推理过程的解释,从而影响其在安全监督和对齐监控中的可靠性。解决方案的关键在于比较两种优化方法——组相对策略优化(Group Relative Policy Optimization, GRPO)与直接偏好优化(Direct Preference Optimization, DPO)在提升CoT忠实性(faithfulness)方面的效果,结果表明GRPO在较大模型上表现更优,尤其在Qwen2.5-14B-Instruct模型上实现了最佳综合性能,显示出其在增强模型推理透明性和可信度方面的潜力。

链接: https://arxiv.org/abs/2512.22631
作者: Hadi Mohammadi,Tamas Kozak,Anastasia Giachanou
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model’s actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.
zh

[NLP-60] On the Role of Discreteness in Diffusion LLM s

【速读】: 该论文旨在解决扩散模型(diffusion models)在语言生成任务中应用时面临的结构性挑战,尤其是文本的离散性和高结构特性与扩散机制之间的不匹配问题。现有方法主要分为嵌入空间中的连续扩散和基于词元(token)的离散扩散,但二者均未能同时满足扩散机制与语言建模所需的五个核心属性,导致生成质量受限。解决方案的关键在于重新审视扩散过程与语言建模的耦合关系,识别出两个核心问题:一是均匀扰动(uniform corruption)忽略了信息在不同位置上的分布差异;二是词元级边缘训练无法捕捉并行解码时的多词元依赖关系。因此,论文主张设计更贴合文本结构的扩散过程,以推动未来构建更具连贯性的扩散语言模型(diffusion language models)。

链接: https://arxiv.org/abs/2512.22630
作者: Ziqi Jin,Bin Wang,Xiang Lin,Lidong Bing,Aixin Sun
机构: MiroMind AI; Nanyang Technological University (南洋理工大学), Singapore
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models.
zh

[NLP-61] M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

【速读】: 该论文旨在解决当前代码大语言模型(Code Large Language Models, Code LLMs)评估体系中存在的局限性问题,即现有基准测试主要在单一结构粒度下进行评估,且覆盖的编程语言种类有限,难以揭示模型在不同代码作用域(如类、函数、代码块、行级别)和多语言场景下的细粒度能力差异。其解决方案的关键在于提出M2G-Eval——一个面向多粒度、多语言的代码生成评估框架,涵盖从类(Class)、函数(Function)、代码块(Block)到行(Line)四个层级,并支持18种编程语言,包含超过17,000个训练任务和1,286个经人工标注且污染可控的测试实例。通过该框架系统评估30个模型(含28个前沿LLM及自研的M2G-Eval-Coder变体),研究揭示了代码生成难度的层级结构、语言粒度对性能影响的扩展趋势以及跨语言能力的相关性,从而为精准诊断模型代码生成能力提供了新工具并指明了复杂长代码合成的挑战方向。

链接: https://arxiv.org/abs/2512.22628
作者: Fanglin Xu,Wei Zhang,Jian Yang,Guo Chen,Aishan Liu,Zhoujun Li,Xianglong Liu,Bryan Dai
机构: Beihang University (北京航空航天大学); Hunan University (湖南大学); Ubiquant
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.
zh

[NLP-62] Chain-of-thought Reviewing and Correction for Time Series Question Answering

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理复杂时间序列数据时因依赖通用自然语言处理技术而易产生推理错误的问题。现有方法缺乏对数值序列内在可验证性的利用,导致推理链(Chain of Thought, CoT)中出现不一致或错误步骤。解决方案的关键在于提出T3LLM框架,该框架由三个LLM协同工作:一个“工作者”生成结构化推理步骤,一个“评审者”识别并纠正推理中的错误,以及一个“学生”模型通过学习修正后的推理链来内化多步推理与自我修正能力。这种基于显式纠错机制的三阶段协作设计,显著提升了时间序列问答(Time Series Question Answering, TSQA)任务中的推理准确性与鲁棒性。

链接: https://arxiv.org/abs/2512.22627
作者: Chen Su,Yuanhe Tian,Yan Song
机构: University of Science and Technology of China (中国科学技术大学); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM-based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine-tune the student model, internalizing multi-step reasoning and self-correction into its parameters. Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.
zh

[NLP-63] Dream-VL Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone

【速读】: 该论文旨在解决自回归视觉语言模型(Autoregressive Vision-Language Models, AR-VLMs)在复杂视觉规划和动态机器人控制任务中因序列生成特性导致效率受限的问题。其解决方案的关键在于构建基于扩散机制的大规模视觉语言模型(Diffusion-based Vision-Language Model, dVLM),即Dream-VL,该模型利用扩散架构的天然双向性优势,实现更高效的视觉理解与生成;进一步地,基于Dream-VL开发出视觉-语言-动作模型(Vision-Language-Action model, VLA),即Dream-VLA,通过在开放机器人数据集上的连续预训练,使模型具备动作分块(action chunking)与并行生成能力,从而显著加速下游微调收敛速度,并在多个机器人任务基准上超越现有主流模型(如π₀和GR00T-N1)。

链接: https://arxiv.org/abs/2512.22615
作者: Jiacheng Ye,Shansan Gong,Jiahui Gao,Junming Fan,Shuang Wu,Wei Bi,Haoli Bai,Lifeng Shang,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Huawei Technologies (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as \pi_0 and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.
zh

[NLP-64] Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis

【速读】: 该论文旨在解决多模态对话中的情感分析问题,具体包括两个子任务:一是从多说话人对话中提取包含情感持有者、目标、方面、观点、情感极性和理由的完整情感六元组(sentiment sextuple),二是检测情感翻转(sentiment flipping),即识别动态情感变化及其触发因素。解决方案的关键在于:针对子任务一,设计了一种结构化提示(structured prompting)流程,引导大语言模型(LLMs)分步提取情感组件并增强上下文理解;针对子任务二,通过集成三个LLMs的优势,实现对情感转变及其触发机制的鲁棒识别。实验表明,该方法在两项子任务上分别取得了47.38%平均得分和74.12%精确匹配F1值,验证了分步精炼与集成策略在复杂多模态情感分析任务中的有效性。

链接: https://arxiv.org/abs/2512.22603
作者: Zhiqiang Gao,Shihao Gao,Zixing Zhang,Yihao Guo,Hongyu Chen,Jing Han
机构: Hunan University (湖南大学); University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.
zh

[NLP-65] Learning When Not to Attend Globally

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长文本时因全局注意力机制导致的计算冗余与推理效率低下问题。其核心挑战在于如何在保持模型性能的同时减少对全局上下文的频繁访问。解决方案的关键在于提出一种名为“全量或局部注意力”(All-or-Here Attention, AHA)的机制,该机制为每个注意力头引入一个二元路由(binary router),动态决定每个token是否采用完整的全局注意力或仅使用局部滑动窗口注意力(sliding window attention)。实验表明,在256 token窗口下,高达93%的原始全局注意力操作可被局部窗口替代而无性能损失,且随着窗口扩大,对全局注意力的需求呈长尾衰减趋势,揭示了全局注意力的大量冗余性,从而证明高效推理只需按需访问全局上下文。

链接: https://arxiv.org/abs/2512.22562
作者: Xuan Luo,Kailai Zhang,Xifeng Yan
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.
zh

[NLP-66] ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

【速读】: 该论文旨在解决满语(Manchu)语音合成中因数据稀缺和强黏着性音系特征带来的挑战。其关键解决方案包括:设计三层文本表示(音素、音节、韵律)与跨模态分层注意力机制,实现多粒度对齐;采用融合深度卷积网络与流匹配Transformer的非自回归生成模型以提升效率;引入分层对比损失引导声学-语言结构对应关系;同时构建首个满语TTS数据集并应用数据增强策略应对低资源限制。实验表明,该方法在仅使用5.2小时训练子集时即达到4.52的MOS评分,显著优于基线模型,并使黏着词发音准确率(AWPA)提升31%,韵律自然度提升27%。

链接: https://arxiv.org/abs/2512.22491
作者: Suhua Wang,Zifan Wang,Xiaoxin Sun,D. J. Wang,Zhanbo Liu,Xin Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu’s linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.
zh

[NLP-67] Constituency Structure over Eojeol in Korean Treebanks

【速读】: 该论文旨在解决韩语句法树库(constituency treebanks)中终端单元(terminal units)的选择问题,即如何在处理具有形态复杂性的韩语时,合理区分词内形态结构与句法层级结构。当前将词素(morpheme)作为句法树的终端单元会导致形态信息与句法结构混杂,从而与基于词(eojeol)的依存资源产生不匹配。论文提出以eojeol为基础的句法表示方案,将形态分割和细粒度词性标注置于一个非构成层(non-constituent layer)中独立编码,从而保持句法结构的可解释性。其关键创新在于通过显式归一化假设,证明了Sejong和Penn韩语树库在eojeol层面具有表征等价性,并据此设计出支持跨树库比较及句法依存转换的标注方案。

链接: https://arxiv.org/abs/2512.22487
作者: Jungyeul Park,Chulwoo Park
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The design of Korean constituency treebanks raises a fundamental representational question concerning the choice of terminal units. Although Korean words are morphologically complex, treating morphemes as constituency terminals conflates word internal morphology with phrase level syntactic structure and creates mismatches with eojeol based dependency resources. This paper argues for an eojeol based constituency representation, with morphological segmentation and fine grained part of speech information encoded in a separate, non constituent layer. A comparative analysis shows that, under explicit normalization assumptions, the Sejong and Penn Korean treebanks can be treated as representationally equivalent at the eojeol based constituency level. Building on this result, we outline an eojeol based annotation scheme that preserves interpretable constituency and supports cross treebank comparison and constituency dependency conversion.
zh

[NLP-68] AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing

【速读】: 该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)因线性适应机制导致表达能力受限的问题,从而缩小线性训练与非线性训练之间的性能差距。其解决方案的关键在于提出一种新型训练策略——AFA-LoRA,通过引入一种退火激活函数(annealed activation function),在训练过程中使适配器从非线性变换逐步过渡到线性变换,从而在初期获得更强的表征能力,并最终收敛为可无缝合并的线性形式,有效提升了LoRA的表达能力和实用性。

链接: https://arxiv.org/abs/2512.22455
作者: Jiacheng Li,Jianchao Tan,Zhidong Yang,Feiye Huo,Yerui Sun,Yuchen Xie,Xunliang Cai
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.
zh

[NLP-69] Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在会计领域中应用时面临的推理能力不足问题,即如何有效评估和提升LLMs在垂直领域(如会计)中的专业推理能力。其解决方案的关键在于提出“垂直领域会计推理”(vertical-domain accounting reasoning)的概念,并基于GLM系列模型的训练数据特征构建了可量化的评估标准,从而为后续研究提供了理论框架与性能基准。通过在多个代表性模型(包括GLM-6B、GLM-130B、GLM-4和GPT-4)上进行实证测试,研究进一步揭示了不同提示工程策略对模型表现的影响,明确了当前LLMs在企业级会计场景部署中的局限性,为优化其实际应用价值指明方向。

链接: https://arxiv.org/abs/2512.22443
作者: Jie Zhou,Xin Chen,Jie Zhang,Zhe Li
机构: Jiangsu Ocean University (江苏海洋大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.
zh

[NLP-70] HiFi-RAG : Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG NEURIPS2025

【速读】: 该论文旨在解决开放域检索增强生成(Retrieval-Augmented Generation, RAG)中因检索文档包含无关信息以及生成答案与用户意图不匹配所带来的性能瓶颈问题。其解决方案的关键在于提出了一种分层过滤的RAG框架——HiFi-RAG,通过多阶段流水线实现高效精准的信息筛选与生成:首先利用成本更低、速度更快的Gemini 2.5 Flash模型完成查询重构、内容层级过滤和引用标注;随后将复杂推理任务交由Gemini 2.5 Pro模型处理最终答案生成,从而在保证生成质量的同时显著提升系统效率与准确性。

链接: https://arxiv.org/abs/2512.22442
作者: Cattalyya Nuengsigkapian
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: A winning solution for the NeurIPS 2025 MMU-RAGent Competition (Closed-Source Text-to-Text Static Evaluation)

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.
zh

[NLP-71] Monadic Context Engineering

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体(Agent)架构普遍采用命令式、临时性设计模式所导致的系统脆弱性问题,如状态管理困难、错误处理不健壮及并发执行效率低下。其解决方案的关键在于提出一种名为“单子上下文工程”(Monadic Context Engineering, MCE)的新架构范式,该范式利用函子(Functor)、应用函子(Applicative Functor)和单子(Monad)的代数结构,将智能体工作流建模为具有内在跨切面关注点(如状态传播、短路错误处理与异步执行)管理能力的计算上下文。其中,单子支持鲁棒的顺序组合,应用函子提供并行执行的规范结构,而单子变换器(Monad Transformer)则实现了这些能力的系统化组合,从而允许开发者从简单且可独立验证的组件构建复杂、健壮且高效的AI智能体。

链接: https://arxiv.org/abs/2512.22431
作者: Yifan Zhang,Mengdi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: Project Page: this https URL

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming. Project Page: this https URL.
zh

[NLP-72] Hallucination Detection and Evaluation of Large Language Model

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中幻觉(hallucination)问题,即模型生成不可靠或无法验证的内容,从而影响其可信度。现有评估方法如KnowHalu虽采用多阶段验证,但计算成本过高。论文提出的解决方案是引入Hughes Hallucination Evaluation Model (HHEM),这是一种轻量级基于分类的独立评估框架,不依赖LLM自身的判断,显著提升了评估效率并保持高检测准确率。其关键创新在于:1)通过非伪造检查(non-fabrication checking)实现最高82.2%的准确率和78.9%的真正例率(TPR);2)针对摘要任务中的局部幻觉问题,提出基于分段检索(segment-based retrieval)的方法,提升小片段内容的验证精度;3)通过累积分布函数(CDF)分析揭示模型规模与幻觉频率的关系,指出7B–9B参数规模模型更稳定,而中间尺寸模型易出现不稳定性。整体上,HHEM为高效、结构化地评估LLM事实准确性提供了可行路径。

链接: https://arxiv.org/abs/2512.22416
作者: Chenggong Zhang,Haopeng Wang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy (82.2%) and TPR (78.9%). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.
zh

[NLP-73] LLM -Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition

【速读】: 该论文旨在解决当前基于大模型的人体活动识别(Human Activity Recognition, HAR)方法在小样本场景下性能受限的问题,尤其是其对大规模标注数据的依赖以及仅依靠几何特征进行样例选择时难以区分相似穿戴式传感器活动(如步行、上楼步行和下楼步行)的缺陷。解决方案的关键在于提出一种LLM-Guided Exemplar Selection框架,通过引入由大语言模型(Large Language Model, LLM)生成的知识先验(knowledge prior),融合语义推理能力以刻画特征重要性、类间混淆度及样例预算乘数,并结合边缘验证线索(margin-based validation cues)、PageRank中心性、hubness惩罚机制与设施选址优化(facility-location optimization),从而实现紧凑且信息丰富的样例选择,显著提升了少样本条件下的识别性能。

链接: https://arxiv.org/abs/2512.22385
作者: Elsen Ronando,Sozo Inoue
机构: Kyushu Institute of Technology (九州工业大学); Universitas 17 Agustus 1945 Surabaya (17 August 1945 Surabaya 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for presentation at ABC 2026. The manuscript is under revision prior to camera-ready submission

点击查看摘要

Abstract:In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and k -center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.
zh

[NLP-74] owards Efficient Post-Training via Fourier-Driven Adapter Architectures

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)过程中,如何在保持冻结主干网络(frozen backbone)表征能力的同时,实现对语义信息的频率感知调制这一关键问题。解决方案的核心在于提出了一种名为傅里叶激活适配器(Fourier-Activated Adapter, FAA)的新框架:通过将随机傅里叶特征(Random Fourier Features)嵌入轻量级适配模块中,FAA 将中间表示分解为互补的低频与高频分量,从而实现对语义信息的频率敏感调节;该机制允许模型在微调阶段选择性增强具有信息量的频带,同时避免破坏原始模型的表达能力,实验表明其在GLUE、E2E NLG和指令微调基准上均优于或相当现有方法,且计算与内存开销极低。

链接: https://arxiv.org/abs/2512.22378
作者: Donggyun Bae,Jongil Park
机构: Konkuk University (国立 konkuk 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:We propose a novel framework, termed Fourier-Activated Adapter (FAA), for parameter-efficient fine-tuning of large pre-trained language models. By incorporating random Fourier features into lightweight adapter modules, FAA decomposes intermediate representations into complementary low- and high-frequency components, enabling frequency-aware modulation of semantic information. This design allows the model to selectively emphasize informative frequency bands during adaptation while preserving the representational capacity of the frozen backbone. Extensive experiments on GLUE, E2E NLG, and instruction-tuning benchmarks demonstrate that FAA consistently achieves competitive or superior performance compared to existing parameter-efficient fine-tuning methods, while maintaining low computational and memory overhead. Ablation studies further verify the effectiveness of frequency-aware activation and adaptive weighting mechanisms, highlighting FAA as a robust and efficient approach for post-training large language models.
zh

[NLP-75] he Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach

【速读】: 该论文旨在解决也门伊布阿拉伯语(Yemeni Ibbi Arabic, YIA)中 qulk-句法结构的句法本质问题,特别是其作为嵌入性谓词如何实现对从句(如陈述疑问句和祈使句)的嵌套。核心问题是:qulk-clause 是否为单句结构还是双句结构,以及其内部语法操作如何符合最简方案(Minimalist Program)的理论框架。解决方案的关键在于提出 qulk-clause 是一个双句结构(biclausal structure),其中 qulk 作为句法嵌入谓词(clause-embedding predicate)选择一个完整的 CP(Complementizer Phrase)补足语;通过应用 Merge、Move、Agree 和 Spell-out 等最简操作,实现了对 qulk-clause 的分层句法推导,并结合后句法过程如形态合并(Morphological Merger)解释了方言特有现象(如二元否定、附着代词化及 CP 嵌套)。这一分析不仅澄清了 qulk-clause 的句法构造机制,也为生成句法理论中的最简主义提供了新的实证支持与理论拓展方向。

链接: https://arxiv.org/abs/2512.22376
作者: Zubaida Mohammed Albadani,Mohammed Q. Shormani
机构: 未知
类目: Computation and Language (cs.CL)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:This study investigates the syntax of qulk-clauses in Yemeni Ibbi Arabic (YIA) within the Minimalist Program. The construction qulk-clause, a morphologically fused form meaning ‘I said,’ introduces embedded declarative interrogative, and imperative clauses, often eithout complementizer. The central proposal of this paper is that qulk-clauses are biclausal structures in which qulk functions a clause-embedding predicate sec;ecting a dull CP complement. By applying core minimalist operations, viz., Merge, Move, Agree, and Spell-out, the study provides a layered syntactic analysis of qulk-clauses, for illustrating how their derivation proceeds through standard computational steps and post-syntactic processes such as Morphological Merger. The proposal also accounts for dialect-specific features like bipartite negation, cliticization, and CP embedding. The findings offer theoretical contributions to generative syntax, specifically minimalism. The study concludes raising theoretical questions concerning extending the analysis to the addressee-clause kil-k ‘you said’. It also provides insights into the possibility of the universality of minimalism.
zh

[NLP-76] Agent 2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成符号世界模型(Symbolic World Models,如PDDL领域或可执行模拟器)时缺乏大规模可验证监督信号的问题。现有方法主要依赖静态验证手段,无法发现交互执行过程中产生的行为级错误。其解决方案的关键在于提出Agent2World框架——一个工具增强的多智能体系统,通过多智能体反馈机制实现世界模型的推理时生成与监督微调数据生成双重目标。该框架包含三个阶段:Deep Researcher代理进行知识合成以填补规范缺口,Model Developer代理实现可执行世界模型,Testing Team则执行自适应单元测试和基于仿真的验证;其中Testing Team不仅提升推理性能(在PDDL与代码表示的多个基准上达到SOTA),还作为交互环境提供行为感知的自适应反馈,驱动多轮训练轨迹,使微调后的模型在世界模型生成任务上平均相对提升30.95%。

链接: https://arxiv.org/abs/2512.22336
作者: Mengkang Hu,Bowei Xia,Yuran Wu,Ailing Yu,Yude Zou,Qiguang Chen,Shijian Wang,Jiarui Jin,Kexin Li,Wenxiang Jiao,Yuan Lu,Ping Luo
机构: The University of Hong Kong (香港大学); Xiaohongshu Inc. (小红书); UESTC (电子科技大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 48 pages, 15 tables, 7 figures, Project page: this https URL

点击查看摘要

Abstract:Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. Agent2World demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: this https URL.
zh

[NLP-77] SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

【速读】: 该论文旨在解决当前AI模型在科学领域评估中缺乏统一、专业且覆盖多学科的基准工具的问题。现有通用评估平台难以全面衡量模型在科学任务中的核心能力,如跨模态感知与推理、符号推理、代码生成及科学假设提出等。解决方案的关键在于构建SciEvalKit——一个专注于科学智能能力的基准测试工具包,其核心创新包括:(1)定义六类科学智能能力维度,涵盖从物理到材料科学等六大科学领域;(2)基于真实科研数据集构建专家级评测基准,确保任务贴近实际科学挑战;(3)提供灵活可扩展的评估流水线,支持批量模型对比、自定义集成与结果透明可复现。该工具为下一代科学基础模型和智能体提供了标准化但可定制的评估基础设施。

链接: https://arxiv.org/abs/2512.22334
作者: Yiheng Wang,Yixin Chen,Shuo Li,Yifan Zhou,Bo Liu,Hengjian Gao,Jiakang Yuan,Jia Bu,Wanghan Xu,Yuhao Zhou,Xiangyu Zhao,Zhiwang Zhou,Fengxiang Wang,Haodong Duan,Songyang Zhang,Jun Yao,Han Deng,Yizhou Wang,Jiabei Xiao,Jiaqi Liu,Encheng Su,Yujie Liu,Weida Wang,Junchi Yao,Shenghe Zheng,Haoran Sun,Runmin Ma,Xiangchao Yan,Bo Zhang,Dongzhan Zhou,Shufei Zhang,Peng Ye,Xiaosong Wang,Shixiang Tang,Wenlong Zhang,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.
zh

[NLP-78] SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents

【速读】: 该论文旨在解决生成式 AI(Generative AI)在复杂图形用户界面(GUI)任务中进行强化学习(Reinforcement Learning, RL)时,因任务完成验证(task verification)效率低下而导致的可扩展性瓶颈问题。现有方法将验证视为被动、事后处理过程,依赖规则脚本、奖励模型或大语言模型作为裁判(LLM-as-a-Judge)对完整交互轨迹进行分析,导致计算成本高且可靠性差。解决方案的关键在于提出 SmartSnap 新范式,实现从被动验证到主动、实时自验证的转变:设计一种具备双重使命的自验证智能体(Self-Verifying Agent),通过遵循“完整性(Completeness)、简洁性(Conciseness)、创造性(Creativity)”三原则,在线环境中选取最小且决定性的快照证据集进行自我验证,并仅以这些精选快照供通用 LLM-as-a-Judge 判定有效性与相关性,从而显著提升训练效率与性能表现。

链接: https://arxiv.org/abs/2512.22322
作者: Shaofei Cai,Yulei Qin,Haojia Lin,Zihan Xu,Gang Li,Yuchen Shi,Zongyi Li,Yong Mao,Siqi Cai,Xiaoyu Tan,Yitao Liang,Ke Li,Xing Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent’s entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.
zh

[NLP-79] Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against

【速读】: 该论文试图解决的问题是:在训练数据中包含警告性内容(如“DO NOT USE - this code is vulnerable”)是否能够有效教导语言模型避免生成被警告的行为。实验表明,带有此类警告的模型与未加警告直接暴露于危险内容的模型相比,其复制违规内容的比例并无统计学差异(76.7% vs. 83.3%),说明现有模型无法从警告语义中学习规避行为。解决方案的关键在于识别出模型内部表征机制的根本缺陷——稀疏自编码器分析揭示了“描述某行为”和“执行该行为”激活了重叠的潜在特征(latent features),特别是特征#8684在警告和攻击场景下均显著激活,表明模型基于统计共现而非语用理解进行响应。进一步发现“隐蔽滑脱”(stealth slip)现象使提示前缀可将激活旋转至线性探测器无法捕捉的子空间,从而绕过传统干预手段。最终结论是:仅靠提示或推理时控制无法解决此问题,必须通过训练时对特定冗余特征进行消融(ablation)才能从根本上提升模型的安全性。

链接: https://arxiv.org/abs/2512.22293
作者: Tsogt-Ochir Enkhbayar
机构: Mongol-ai(蒙古人工智能)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Submitted to Neel Nanda’s MATS Stream

点击查看摘要

Abstract:Warning-framed content in training data (e.g., “DO NOT USE - this code is vulnerable”) does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: “describing X” and “performing X” activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call “stealth slip”, allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.
zh

[NLP-80] Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces

【速读】: 该论文试图解决的问题是:基于Transformer的句子嵌入(sentence embeddings)是否能够编码与人类可解释的认知或心理属性相一致的分级、层次化结构。解决方案的关键在于构建了一个包含480个自然语言句子的数据集,这些句子被标注了连续的有序能量评分和离散的七级认知类别标签,并利用多个预训练Transformer模型的固定句向量进行线性与浅层非线性探测器(probes)评估。结果表明,无论是连续评分还是层级标签均可被可靠解码,且浅层非线性探测器在性能上持续优于线性探测器;同时,基于词频-逆文档频率(TF-IDF)的基线表现显著较差,说明该结构并非源于表面词汇统计特征。非参数置换检验进一步验证了探测性能显著高于随机标签扰动下的基准水平,结合UMAP可视化和混淆矩阵分析揭示了嵌入空间中从低到高梯度平滑分布及主要邻近层级混淆现象,从而为Transformer嵌入空间存在与人类认知属性对齐的层次几何组织提供了实证支持。

链接: https://arxiv.org/abs/2512.22227
作者: Sophie Zhao
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces, yet the presence of higher-level cognitive organization within these representations remains underexplored. In this work, we investigate whether sentence embeddings encode a graded, hierarchical structure aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with continuous ordinal energy scores and discrete tier labels spanning seven ordered cognitive categories. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous scores and tier labels are reliably decodable, with shallow nonlinear probes providing consistent performance gains over linear probes. Lexical TF-IDF baselines perform substantially worse, indicating that the observed structure is not attributable to surface word statistics alone. Nonparametric permutation tests further confirm that probe performance exceeds chance under label-randomization nulls. Qualitative analyses using UMAP visualizations and confusion matrices reveal smooth low-to-high gradients and predominantly adjacent-tier confusions in embedding space. Taken together, these results provide evidence that transformer embedding spaces exhibit a hierarchical geometric organization aligned with human-defined cognitive attributes, while remaining agnostic to claims of internal awareness or phenomenology.
zh

[NLP-81] VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLM s

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解长视频时面临的两大挑战:一是帧间冗余信息导致的计算效率低下,二是连续视频流中缺乏时间一致性表示。为应对这些问题,作者提出 VideoScaffold 框架,其核心创新在于两个关键组件:弹性尺度事件分割(Elastic-Scale Event Segmentation, EES),通过预测引导的方式动态调整事件边界以适应不同视频时长;以及分层事件整合(Hierarchical Event Consolidation, HEC),将语义相关的片段逐级聚合为多层次抽象表示。二者协同工作,使模型能够从细粒度帧级理解平滑过渡到高层次事件推理,从而实现对连续视频流的高效且连贯的理解。

链接: https://arxiv.org/abs/2512.22226
作者: Naishan Zheng,Jie Huang,Qingpei Guo,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at this https URL.
zh

[NLP-82] On the Existence and Behaviour of Secondary Attention Sinks

【速读】: 该论文旨在解决生成式 AI(Generative AI)模型中注意力机制异常集中问题,即“注意力黑洞”(attention sinks)现象——某些token(如BOS token)接收远超其语义重要性的注意力权重,从而干扰模型对其他关键信息的捕捉。此前研究主要关注“主黑洞”(primary sinks),即始终存在于模型深层且占据大量注意力质量的token。本文首次识别出一类新的“次级黑洞”(secondary sinks),它们出现在中间层、生命周期可变、吸引注意力质量较小但依然显著,并揭示了其形成机制:特定中间层的多层感知机(MLP)模块将token表示映射至与该层主黑洞方向一致的向量;该向量的ℓ₂范数决定次级黑洞的得分及其存活层数,进而影响注意力分布。解决方案的关键在于发现并量化次级黑洞的动态特性及其与主黑洞的协同演化关系,为理解注意力机制的层级结构和优化模型表示能力提供了新视角。

链接: https://arxiv.org/abs/2512.22213
作者: Jeffrey T.H. Wong,Cheng Zhang,Louis Mahon,Wayne Luk,Anton Isopoussu,Yiren Zhao
机构: Imperial College London (帝国理工学院); UnlikelyAI (UnlikelyAI)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The \ell_2 -norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.
zh

[NLP-83] Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在开放性与可复现性方面的局限性,即多数开源模型仅提供模型权重,缺乏训练过程、数据集和实现细节的透明度,从而限制了研究社区的协作与可持续发展。解决方案的关键在于提出Moxin 7B——一个遵循Model Openness Framework的完全开源LLM,不仅公开模型权重,还完整披露训练流程、数据来源及代码实现,推动形成更具包容性和协作性的研究生态。在此基础上,进一步开发出针对多模态任务(Moxin-VLM)、具身智能(Moxin-VLA)和中文能力优化(Moxin-Chinese)的三个变体,通过使用开源框架和公开数据进行训练,在多项评估中展现出卓越性能,有效提升了模型的实用性与普适性。

链接: https://arxiv.org/abs/2512.22208
作者: Pu Zhao,Xuan Shen,Zhenglun Kong,Yixin Shen,Sung-En Chang,Arash Akbari,Timothy Rupprecht,Lei Lu,Enfu Nan,Changdi Yang,Yumei He,Weiyan Shi,Xingchen Xu,Yu Huang,Wei Jiang,Wei Wang,Yue Chen,Yong He,Yanzhi Wang
机构: Northeastern University (东北大学); Harvard University (哈佛大学); Cornell University (康奈尔大学); Tulane University (杜兰大学); University of Washington (华盛顿大学); Roboraction.ai; Futurewei; AIBAO LLC
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.
zh

[NLP-84] A CNN-Based Malaria Diagnosis from Blood Cell Images with SHAP and LIME Explainability

【速读】: 该论文旨在解决传统疟疾诊断方法(如显微镜血涂片分析)灵敏度低、依赖专家判断且在资源匮乏地区难以实施的问题。其解决方案的关键在于提出一种基于自定义卷积神经网络(Convolutional Neural Network, CNN)的深度学习模型,能够自动区分血液细胞图像是否被疟原虫(Plasmodium parasite)感染,实现了96%的准确率以及超过0.95的精确率和召回率;同时通过引入可解释人工智能(Explainable AI)技术(如SHAP、LIME和梯度热力图),提升模型决策的透明度与可信度,从而为资源有限地区的快速、精准、可理解的疟疾诊断提供可行路径。

链接: https://arxiv.org/abs/2512.22205
作者: Md. Ismiel Hossen Abir,Awolad Hossain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Malaria remains a prevalent health concern in regions with tropical and subtropical climates. The cause of malaria is the Plasmodium parasite, which is transmitted through the bites of infected female Anopheles mosquitoes. Traditional diagnostic methods, such as microscopic blood smear analysis, are low in sensitivity, depend on expert judgment, and require resources that may not be available in remote settings. To overcome these limitations, this study proposes a deep learning-based approach utilizing a custom Convolutional Neural Network (CNN) to automatically classify blood cell images as parasitized or uninfected. The model achieves an accuracy of 96%, with precision and recall scores exceeding 0.95 for both classes. This study also compares the custom CNN with established deep learning architectures, including ResNet50, VGG16, MobileNetV2, and DenseNet121. To enhance model interpretability, Explainable AI techniques such as SHAP, LIME, and Saliency Maps are applied. The proposed system shows how deep learning can provide quick, accurate and understandable malaria diagnosis, especially in areas with limited resources.
zh

[NLP-85] Unbiased Visual Reasoning with Controlled Visual Inputs

【速读】: 该论文旨在解决端到端视觉-语言模型(Vision-language Models, VLMs)在回答视觉问题时依赖虚假相关性(spurious correlations)而非因果视觉证据的问题,尤其在微调后更容易形成捷径依赖(shortcut-prone)。其解决方案的关键在于提出一种模块化框架VISTA(Visual-Information Separation for Text-based Analysis),通过显式的信息瓶颈(information bottleneck)将感知与推理解耦:冻结的VLM传感器仅响应简短、客观的感知查询,而纯文本大语言模型(LLM)作为推理器负责分解问题、规划查询并以自然语言聚合视觉事实。这种受控接口构建了一个奖励对齐的环境,支持使用强化学习(如GRPO)训练出对真实世界虚假相关性更具鲁棒性的视觉推理系统。

链接: https://arxiv.org/abs/2512.22183
作者: Zhaonan Li,Shijie Lu,Fei Wang,Jacob Dineen,Xiao Ye,Zhikun Xu,Siyi Liu,Young Min Cho,Bangzheng Li,Daniel Chang,Kenny Nguyen,Qizheng Yang,Muhao Chen,Ben Zhou
机构: Arizona State University (亚利桑那州立大学); University of Southern California (南加州大学); University of Pennsylvania (宾夕法尼亚大学); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA’s reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.
zh

[NLP-86] he Big Three in Marriage Talk: LLM -Assisted Analysis of Moral Ethics and Sentiment on Weibo and Xiaohongshu

【速读】: 该论文旨在解决中国婚姻登记数量急剧下降背景下,公众对婚姻态度的深层成因问题,尤其关注情感倾向与道德推理之间的关联。其解决方案的关键在于运用大语言模型(Large Language Model, LLM)辅助内容分析方法,对来自新浪微博和小红书的21.9万条婚姻相关帖子进行系统编码,并基于Shweder的“三大道德伦理框架”(Autonomy、Community、Divinity)识别道德维度与情感倾向的关系。研究发现,尽管多数帖子未明确体现道德框架,但当涉及道德论述时,自主性(Autonomy)和共同体(Community)伦理常与负面情绪显著关联,而神圣性(Divinity)伦理则多呈中性或正面,表明当代中国社会对婚姻的负面态度主要源于个体自主权受限与集体义务压力,为制定具有文化适配性的政策提供了实证依据。

链接: https://arxiv.org/abs/2512.23609
作者: Frank Tian-Fang Ye(1),Xiaozi Gao(2) ((1) Division of Social Sciences, The HKU SPACE Community College, Hong Kong SAR, PRC (2) Department of Early Childhood Education, Education University of Hong Kong, Hong Kong SAR, PRC)
机构: 未知
类目: General Economics (econ.GN); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:China’s marriage registrations have declined dramatically, dropping from 13.47 million couples in 2013 to 6.1 million in 2024. Understanding public attitudes toward marriage requires examining not only emotional sentiment but also the moral reasoning underlying these evaluations. This study analyzed 219,358 marriage-related posts from two major Chinese social media platforms (Sina Weibo and Xiaohongshu) using large language model (LLM)-assisted content analysis. Drawing on Shweder’s Big Three moral ethics framework, posts were coded for sentiment (positive, negative, neutral) and moral dimensions (Autonomy, Community, Divinity). Results revealed platform differences: Weibo discourse skewed positive, while Xiaohongshu was predominantly neutral. Most posts across both platforms lacked explicit moral framing. However, when moral ethics were invoked, significant associations with sentiment emerged. Posts invoking Autonomy ethics and Community ethics were predominantly negative, whereas Divinity-framed posts tended toward neutral or positive sentiment. These findings suggest that concerns about both personal autonomy constraints and communal obligations drive negative marriage attitudes in contemporary China. The study demonstrates LLMs’ utility for scaling qualitative analysis and offers insights for developing culturally informed policies addressing marriage decline in Chinese contexts.
zh

计算机视觉

[CV-0] Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

【速读】:该论文旨在解决基于扩散模型的视频超分辨率(VSR)方法在低延迟场景下应用受限的问题,其核心挑战在于传统方法依赖未来帧信息且需多步去噪,导致推理延迟过高。解决方案的关键在于提出Stream-DiffVSR,一个因果条件扩散框架:通过四步蒸馏去噪器实现快速推理,引入自回归时序引导(Auto-regressive Temporal Guidance, ARTG)模块在潜在空间注入运动对齐提示以提升细节和时序一致性,并设计轻量级时序感知解码器结合时序处理器模块(Temporal Processor Module, TPM),从而在仅使用历史帧的前提下显著降低延迟并提升感知质量。该方法将初始延迟从超过4600秒降至0.328秒,成为首个适用于低延迟在线部署的扩散VSR方法。

链接: https://arxiv.org/abs/2512.23709
作者: Hau-Shiang Shiu,Chin-Yang Lin,Zhixiang Wang,Chi-Wei Hsiao,Po-Fan Yu,Yu-Chih Chen,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); Shanda AI Research Tokyo (闪达AI研究东京); MediaTek Inc. (联发科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: this https URL
zh

[CV-1] Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation

【速读】:该论文旨在解决透明物体在视觉感知系统中难以准确估计深度和法线的问题,传统方法如立体匹配、飞行时间(Time-of-Flight, ToF)和纯判别单目深度估计因折射、反射和透射效应破坏了基本假设,导致结果存在空洞且时序不稳定。解决方案的关键在于利用生成式视频扩散模型(video diffusion models)对透明现象的强建模能力——这些模型已内化了光学物理规则。作者构建了一个名为TransPhy3D的合成视频数据集(11k个序列),包含玻璃、塑料和金属材质的透明/反射场景,并通过基于物理的光线追踪与OptiX去噪渲染RGB、深度和法线信息。在此基础上,使用轻量级LoRA适配器从大型视频扩散模型中学习视频到视频的深度(及法线)转换器,训练时将RGB与噪声深度潜在表示拼接输入DiT骨干网络,联合优化TransPhy3D与现有帧级合成数据集,从而实现任意长度输入视频的时序一致预测。最终模型DKT在真实与合成视频基准(如ClearPose、DREDS、TransPhy3D-Test)上达到零样本SOTA性能,验证了“扩散模型懂得透明性”这一核心主张:生成式先验可被高效、无标签地转化为鲁棒、时序一致的感知能力,尤其适用于复杂现实世界的操作任务。

链接: https://arxiv.org/abs/2512.23705
作者: Shaocong Xu,Songlin Wei,Qizhe Wei,Zheng Geng,Hong Li,Licheng Shen,Qianpu Sun,Shu Han,Bin Ma,Bohan Li,Chongjie Ye,Yuhang Zheng,Nan Wang,Saining Zhang,Hao Zhao
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院); University of Southern California (南加州大学); Tsinghua University (清华大学); Beihang University (北京航空航天大学); Wuhan University (武汉大学); Shanghai Jiao Tong University (上海交通大学); European Institute of Innovation and Technology Ningbo (欧洲创新与技术研究所宁波分部); FNii, The Chinese University of Hong Kong, Shenzhen (FNii,香港中文大学(深圳)); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL Code: this https URL Dataset: this https URL

点击查看摘要

Abstract:Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT’s depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: “Diffusion knows transparency.” Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
zh

[CV-2] IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition

【速读】:该论文旨在解决多视角内在图像分解(intrinsic image decomposition)中的视图不一致性问题,即现有基于扩散模型的方法在扩展到多视角场景时难以保持不同视角下反射率(reflectance)、漫反射光照(diffuse shading)和镜面光照(specular shading)等内在因子的一致性。其解决方案的关键在于提出一种名为IDT(Intrinsic Decomposition Transformer)的前馈式框架,通过引入基于Transformer的注意力机制联合推理多个输入图像,实现单次前向传播即可生成视图一致的内在因子;同时采用物理合理的图像形成模型,显式地将图像分解为漫反射反射率、漫反射光照与镜面光照三个成分,从而分离朗伯(Lambertian)与非朗伯光传输路径,提升材料和光照效应分解的可解释性与可控性。

链接: https://arxiv.org/abs/2512.23667
作者: Kang Du,Yirui Guan,Zeyu Wang
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); The Hong Kong University of Science and Technology(香港科技大学); Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages 4 figures

点击查看摘要

Abstract:Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbfIntrinsic Decomposition Transformer (IDT), a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.
zh

[CV-3] RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

【速读】:该论文旨在解决当前人形机器人运动控制中视觉理解与动作执行之间的鸿沟问题,即现有方法要么依赖人工标注的运动捕捉轨迹或稀疏文本指令(导致语义信息不足),要么仅能机械模仿视频中的姿态而缺乏对视觉内容的深层理解。解决方案的关键在于提出RoboMirror框架,其核心创新是利用视觉语言模型(VLMs)将原始第一人称或第三人称视频直接解析为视觉运动意图(visual motion intents),并以此作为扩散策略(diffusion-based policy)的条件输入,从而生成物理合理且语义一致的运动序列,无需显式的姿态重建或角色重定向(retargeting)。这一“先理解后模仿”的范式首次实现了从视频到人形机器人步态的端到端、无重定向的映射,显著提升了任务成功率和控制效率。

链接: https://arxiv.org/abs/2512.23649
作者: Zhe Li,Cheng Chi,Yangyang Wei,Boan Zhu,Tao Huang,Zhenguo Sun,Yibo Peng,Pengwei Wang,Zhongyuan Wang,Fangzhou Liu,Chang Xu,Shanghang Zhang
机构: BAAI; University of Sydney; Hong Kong University of Science and Technology; Harbin Institute of Technology; Xian Jiao Tong University; Chinese Academy of Sciences; Shanghai Jiao Tong University; Peking University
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying “understand before you imitate”. Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
zh

[CV-4] OmniAgent : Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding

【速读】:该论文旨在解决当前多模态大语言模型在音频与视觉模态融合中缺乏细粒度跨模态理解能力以及难以实现精确对齐的问题。其解决方案的关键在于提出OmniAgent,一个完全由音频引导的主动感知代理,通过动态规划自主调度专用工具以实现更精细的音视频推理;核心创新是引入一种新颖的“粗粒度到细粒度”的音频引导感知范式,利用音频线索定位时间事件并指导后续推理,从而显著提升任务相关感知注意力的聚焦能力,最终在三个音视频理解基准测试中达到领先性能,相较主流开源及商用模型准确率提升10%–20%。

链接: https://arxiv.org/abs/2512.23646
作者: Keda Tao,Wenjie Du,Bohan Yu,Weiqiang Wang,Jian Liu,Huan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Website: this https URL

点击查看摘要

Abstract:Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
zh

[CV-5] Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception AAAI2026

【速读】:该论文旨在解决自动驾驶中端到端(End-to-End, E2E)感知任务中时空对齐(Spatio-temporal Alignment)的挑战,即如何在不同时间帧之间准确匹配和跟踪物体,尤其是在运动状态多样、类别差异显著以及特征变化复杂的场景下,传统依赖注意力机制或单一物理模型(如恒定速度假设)的方法难以实现鲁棒且精确的对齐。解决方案的关键在于提出HAT(Hierarchical Adaptive Temporal alignment module),其核心创新是通过多个显式运动模型生成空间锚点与运动感知特征提案,并结合缓存对象查询中的语义与运动线索进行多假设解码,从而让每个目标自适应地选择最优对齐方案,无需直接监督即可实现动态优化的时空对齐。这一机制显著提升了3D目标检测与跟踪性能,在nuScenes数据集上达到46.0% AMOTA,同时在语义退化场景下增强了感知与规划的鲁棒性。

链接: https://arxiv.org/abs/2512.23635
作者: Xiaoyu Li,Peidong Li,Xian Wu,Long Shi,Dedong Liu,Yitao Wu,Jiajia Fu,Dixiao Cui,Lijun Zhao,Lining Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.
zh

[CV-6] Memorization in 3D Shape Generation: An Empirical Study

【速读】:该论文旨在解决3D生成模型中存在数据记忆(memorization)问题,即模型是否过度依赖训练数据中的具体形状进行生成,从而可能导致训练数据泄露和生成结果多样性不足。为应对这一问题,作者提出了一种可量化的评估框架,用于衡量3D生成模型的 memorization 程度,并通过控制实验分析不同数据设计与建模策略对 memorization 的影响。关键发现包括:在数据层面,memorization 随数据模态、多样性及条件粒度增加而上升;在建模层面,memorization 在中等引导尺度下达到峰值,可通过延长 latent vector-set(Vecset)长度和引入简单旋转增强来有效降低,且不损害生成质量。该研究为理解并缓解3D生成模型中的记忆现象提供了实证依据和实用策略。

链接: https://arxiv.org/abs/2512.23628
作者: Shu Pu,Boya Zeng,Kaichen Zhou,Mengyu Wang,Zhuang Liu
机构: Princeton University (普林斯顿大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at this https URL.
zh

[CV-7] Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging

【速读】:该论文旨在解决胰腺肿瘤早期检测中的临床难题,主要挑战在于肿瘤在CT图像中对比度低、解剖结构变异大,导致视觉线索难以识别。解决方案的关键在于提出一种可扩展的残差特征聚合(Scalable Residual Feature Aggregation, SRFA)框架,其核心包括:1)基于MAGRes-UNet的分割模块增强胰腺结构可见性;2)采用DenseNet-121结合残差特征存储机制实现深层层次特征的有效聚合;3)引入混合HHO-BA元启发式特征选择策略优化特征子集;4)融合Vision Transformer(ViT)与EfficientNet-B3构建新型分类模型,并通过SSA与GWO双优化机制精细调参,从而显著提升模型的泛化能力与鲁棒性。实验表明,该方法在准确率(96.23%)、F1分数(95.58%)和特异性(94.83%)上均优于传统卷积神经网络(CNNs)及主流Transformer模型。

链接: https://arxiv.org/abs/2512.23597
作者: Janani Annur Thiruvengadam,Kiran Mayee Nabigaru,Anusha Kovi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.
zh

[CV-8] Detection Fire in Camera RGB-NIR

【速读】:该论文旨在解决红外夜视摄像头在火灾检测中准确率不足的问题,尤其是因数据集构建局限导致的误判问题,例如将明亮的人工光源误识别为火焰。其关键解决方案包括三个核心贡献:首先,通过多种数据增强策略扩充近红外(NIR)数据集和分类数据集以缓解数据稀缺;其次,提出一种两阶段检测流程,结合YOLOv11与EfficientNetV2-B0模型,在夜间场景下显著提升检测精度并降低由人工光源引起的假阳性;第三,引入Patched-YOLO方法,利用基于图像块的处理机制增强对RGB图像中小尺寸及远距离目标的检测能力。

链接: https://arxiv.org/abs/2512.23594
作者: Nguyen Truong Khai,Luong Duc Vinh
机构: Viettel(越南电信)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Improving the accuracy of fire detection using infrared night vision cameras remains a challenging task. Previous studies have reported strong performance with popular detection models. For example, YOLOv7 achieved an mAP50-95 of 0.51 using an input image size of 640 x 1280, RT-DETR reached an mAP50-95 of 0.65 with an image size of 640 x 640, and YOLOv9 obtained an mAP50-95 of 0.598 at the same resolution. Despite these results, limitations in dataset construction continue to cause issues, particularly the frequent misclassification of bright artificial lights as fire. This report presents three main contributions: an additional NIR dataset, a two-stage detection model, and Patched-YOLO. First, to address data scarcity, we explore and apply various data augmentation strategies for both the NIR dataset and the classification dataset. Second, to improve night-time fire detection accuracy while reducing false positives caused by artificial lights, we propose a two-stage pipeline combining YOLOv11 and EfficientNetV2-B0. The proposed approach achieves higher detection accuracy compared to previous methods, particularly for night-time fire detection. Third, to improve fire detection in RGB images, especially for small and distant objects, we introduce Patched-YOLO, which enhances the model’s detection capability through patch-based processing. Further details of these contributions are discussed in the following sections. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.23594 [cs.CV] (or arXiv:2512.23594v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.23594 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-9] Same or Not? Enhancing Visual Perception in Vision-Language Models ALT

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在细粒度感知能力上的不足,包括对细微视觉差异的敏感性弱、存在视觉偏见以及难以捕捉复杂场景中的细节信息。现有训练语料库倾向于强调一般识别任务(如“这是一只猫还是一只狗?”),忽视了对物体间微小差异的辨别能力。解决方案的关键在于提出一个大规模图像对查询数据集TWIN,其中包含561,000组视觉相似但可能描述同一对象或不同对象的图像对,通过设计“是否为同一物体”的判别任务,引导模型关注细微视觉线索。此外,作者构建了FGVQA基准测试套件用于量化细粒度识别性能提升,并验证了在TWIN上微调后的VLMs在未见领域(如艺术、动物、植物和地标)中显著优于基线模型,最高提升达19.3%,且不损害通用视觉问答(VQA)性能,表明数据规模与任务设计是提升模型感知精度的核心要素。

链接: https://arxiv.org/abs/2512.23592
作者: Damiano Marsili,Aditya Mehta,Ryan Y. Lin,Georgia Gkioxari
机构: California Institute of Technology (加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project webpage: this https URL

点击查看摘要

Abstract:Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition (“Is it a cat or a dog?”) over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: this https URL
zh

[CV-10] LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

【速读】:该论文旨在解决实时交互式视频生成中因扩散模型双向注意力机制导致的高延迟问题,尤其是在多模态条件(文本、图像、音频)下的自然性和效率不足。其核心解决方案是提出一种改进的在线策略蒸馏(on-policy distillation)方法,关键在于优化条件输入质量、初始化策略及蒸馏过程中的调度机制,从而在保持与全步数双向基线相当视觉质量的前提下,将推理成本和延迟降低20倍,并成功构建了名为LiveTalk的实时多模态交互式虚拟人系统,显著提升了多轮对话连贯性与响应速度。

链接: https://arxiv.org/abs/2512.23576
作者: Ethan Chern,Zhulin Hu,Bohao Tang,Jiadi Su,Steffi Chern,Zhijie Deng,Pengfei Liu
机构: SII; SJTU; GAIR
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
zh

[CV-11] ProGuard: Towards Proactive Multimodal Safeguard

【速读】:该论文旨在解决生成式 AI(Generative AI)在多模态场景下不断涌现的分布外(out-of-distribution, OOD)安全风险问题,现有防御方法因依赖模型调整而难以应对动态变化的安全威胁。其解决方案的关键在于提出 ProGuard——一种视觉-语言主动防护机制,通过构建包含87K样本的模态平衡数据集(标注二元安全标签与分层多模态安全类别),并基于该数据集纯强化学习(reinforcement learning, RL)训练视觉-语言基础模型,从而实现高效、简洁的推理能力;进一步引入OOD安全类别推断任务,并结合同义词库相似度奖励机制增强模型对未见不安全类别的描述能力,显著提升对未知安全风险的主动识别与解释能力,实验表明其在OOD风险检测和描述上分别提升52.6%和64.8%。

链接: https://arxiv.org/abs/2512.23573
作者: Shaohan Yu,Lijun Li,Chenyang Si,Lu Sheng,Jing Shao
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); PRLab Nanjing University (南京大学PRLab); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
zh

[CV-12] Image Denoising Using Global and Local Circulant Representation

【速读】:该论文旨在解决图像去噪任务中对高效且高性能算法的迫切需求,尤其是在海量图像数据持续增长背景下。其解决方案的关键在于建立主成分分析(Principal Component Analysis, PCA)与哈尔变换(Haar Transform)在循环矩阵表示下的理论联系,并提出一种名为Haar-tSVD的去噪方法:该方法利用统一的张量奇异值分解(tensor Singular Value Decomposition, t-SVD)投影结合哈尔变换,高效捕捉图像块的全局与局部相关性;同时通过自适应噪声估计方案提升鲁棒性,并在严重噪声条件下融合深度神经网络以增强性能,从而实现快速、稳定且有效的图像去噪。

链接: https://arxiv.org/abs/2512.23569
作者: Zhaoming Kong,Xiaowei Yang,Jiahuan Zhang
机构: South China University of Technology (华南理工大学); Southern Medical University (南方医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of imaging devices and countless image data generated every day impose an increasingly high demand on efficient and effective image denoising. In this paper, we establish a theoretical connection between principal component analysis (PCA) and the Haar transform under circulant representation, and present a computationally simple denoising algorithm. The proposed method, termed Haar-tSVD, exploits a unified tensor singular value decomposition (t-SVD) projection combined with Haar transform to efficiently capture global and local patch correlations. Haar-tSVD operates as a one-step, parallelizable plug-and-play denoiser that eliminates the need for learning local bases, thereby striking a balance between denoising speed and performance. Besides, an adaptive noise estimation scheme is introduced to improve robustness according to eigenvalue analysis of the circulant structure. To further enhance the performance under severe noise conditions, we integrate deep neural networks with Haar-tSVD based on the established Haar-PCA relationship. Experimental results on various denoising datasets demonstrate the efficiency and effectiveness of proposed method for noise removal. Our code is publicly available at this https URL.
zh

[CV-13] hinkGen: Generalized Thinking for Visual Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成任务中难以泛化和适应多样化场景的问题,现有方法通常依赖于特定场景的机制,限制了其通用性。解决方案的关键在于提出ThinkGen框架,这是一个以思维驱动(think-driven)的视觉生成系统,采用解耦架构:预训练的MLLM负责根据用户意图生成定制化指令,而扩散Transformer(Diffusion Transformer, DiT)则基于这些指令生成高质量图像;同时引入可分离的GRPO(Generalized Reward Policy Optimization)训练范式(SepGRPO),在MLLM与DiT模块间交替进行强化学习,从而实现跨数据集的联合训练,有效支持多种生成场景下的链式思维(Chain-of-Thought, CoT)推理。

链接: https://arxiv.org/abs/2512.23568
作者: Siyu Jiao,Yiheng Lin,Yujie Zhong,Qi She,Wei Zhou,Xiaohan Lan,Zilong Huang,Fei Yu,Yingchen Yu,Yunqing Zhao,Yao Zhao,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学); Bytedance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM’s CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: this https URL
zh

[CV-14] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在理解化学文献中密集且图形化的反应信息时能力不足的问题,尤其是其在真实科学PDF文档中对反应机理和结构的深层推理能力有限。解决方案的关键在于构建了一个多层次的基准测试框架——RxnBench,包含两个核心任务:Single-Figure QA(SF-QA)用于评估模型对单个反应图谱的细粒度视觉感知与机制推理能力,以及Full-Document QA(FD-QA)用于检验模型跨模态整合文本、图表与表格信息的能力。实验表明,尽管具备推理能力的模型表现更优,但当前MLLMs在复杂化学逻辑和精确结构识别上仍存在显著短板,凸显了开发领域专用视觉编码器和强化推理引擎对实现自主AI化学家的重要性。

链接: https://arxiv.org/abs/2512.23565
作者: Hanzheng Li,Xi Fang,Yixuan Li,Chaozheng Huang,Junjie Wang,Xi Wang,Hongzhe Bai,Bojun Hao,Shenyu Lin,Huiqi Liang,Linfeng Zhang,Guolin Ke
机构: Shanghai Jiao Tong University (上海交通大学); DP Technology; Tsinghua University (清华大学); New York University (纽约大学); Fudan University (复旦大学); Xiamen University (厦门大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
zh

[CV-15] PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation

【速读】:该论文旨在解决扩散模型在文本到图像(text-to-image, T2I)生成过程中可能产生不安全内容的问题。传统方法如文本关键词屏蔽或有害内容分类存在易被绕过、依赖大量标注数据及额外训练等局限性。其解决方案的关键在于提出一种无需训练的净化机制——PurifyGen,该方法通过双阶段策略实现细粒度提示净化:第一阶段基于互补语义距离计算每个提示词与预定义毒性和清洁概念嵌入之间的语义接近度,从而识别风险词;第二阶段对风险提示实施双空间变换——将有毒对齐嵌入投影至毒性概念矩阵的零空间以消除有害语义,同时将其映射到清洁概念的范围空间以强化安全语义,最终仅替换风险词嵌入以最小化对安全内容的干扰,实现了高保真且泛化能力强的安全生成。

链接: https://arxiv.org/abs/2512.23546
作者: Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang
机构: UMN; PCIE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model’s original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to this https URL.
zh

[CV-16] PathFound: An Agent ic Multimodal Model Activating Evidence-seeking Pathological Diagnosis

【速读】:该论文旨在解决当前病理诊断中基于静态推理范式的局限性问题,即现有模型对全切片图像(Whole-Slide Images, WSI)仅进行一次处理并输出预测结果,缺乏在诊断模糊时的重新评估与针对性证据获取能力,这与临床实践中通过反复观察和进一步检查来修正假设的流程不一致。解决方案的关键在于提出PathFound——一种代理型多模态模型,其核心创新是将病理视觉基础模型、视觉-语言模型与强化学习训练的推理模型相结合,构建一个分阶段的证据导向推理框架,包括初始诊断、证据获取和最终决策三个阶段,从而实现主动的信息采集和诊断迭代优化,在多种临床场景下显著提升诊断准确性,并展现出发现细微病理特征(如核特征和局部浸润)的潜力。

链接: https://arxiv.org/abs/2512.23545
作者: Shengyi Hua,Jianfeng Wu,Tianle Shen,Kangzhe Hu,Zhongzhen Huang,Shujuan Ni,Zhihong Zhang,Yuan Li,Zhe Wang,Xiaofan Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.
zh

[CV-17] AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization

【速读】:该论文旨在解决多主体定制(multi-subject customization)中存在的重要挑战:如何在生成图像时同时实现文本对齐(text alignment)、主体身份保留(subject identity preservation)和布局控制(layout control)之间的平衡,且避免依赖额外训练以提升方法的可扩展性和效率。解决方案的关键在于提出一种无需训练的框架AnyMS,其核心创新是引入自底向上的双层注意力解耦机制:全局解耦分离文本与视觉条件间的交叉注意力,确保文本语义一致性;局部解耦将每个主体的关注限制在其指定区域,防止主体冲突,从而保障身份保留与布局精确控制。此外,AnyMS利用预训练图像适配器提取与扩散模型对齐的主体特征,无需主体学习或适配器微调,显著提升了实用性与泛化能力。

链接: https://arxiv.org/abs/2512.23537
作者: Binhe Yu,Zhen Wang,Kexin Li,Yuqian Yuan,Wenqiao Zhang,Long Chen,Juncheng Li,Jun Xiao,Yueting Zhuang
机构: Zhejiang University (浙江大学); HKUST (香港科技大学); Zhejiang Tobacco Monopoly Administration (浙江省烟草专卖局)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject’s attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
zh

[CV-18] Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像超分辨率(Image Super-Resolution, SR)任务中难以同时保障高频感知质量与低频结构保真度的问题。现有推理时扩展(Inference-Time Scaling)方法存在局限性:基于奖励的粒子优化易导致感知过平滑,而最优路径搜索则常破坏结构一致性。解决方案的关键在于提出一种无需训练的框架——迭代扩散推理时扩展自适应频率引导(Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering, IAFS),其核心机制包括两个方面:一是通过迭代修正结构偏差实现图像逐步精细化;二是利用自适应频率感知的粒子融合策略,将高频感知细节与低频结构信息高效整合,从而在不同尺度上实现感知质量与结构准确性的平衡重建。

链接: https://arxiv.org/abs/2512.23532
作者: Hexin Zhang,Dong Li,Jie Huang,Bingzhou Wang,Xueyang Fu,Zhengjun Zha
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.
zh

[CV-19] IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation AAAI2026

【速读】:该论文旨在解决人类中心的故事生成问题,即在多张连续图像中保持人物面部细节的一致性以及协调多个角色的跨图像表现。现有视觉生成模型虽能从文本生成连贯角色的故事,但在人脸一致性与多角色协同方面仍存在挑战。其解决方案的关键在于提出IdentityStory框架,包含两个核心组件:迭代身份发现(Iterative Identity Discovery),用于提取一致的角色身份特征;重去噪身份注入(Re-denoising Identity Injection),通过重新去噪图像并注入身份信息,在保留场景上下文的同时强化角色一致性。实验表明,该框架在ConsiStory-Human基准上显著优于现有方法,尤其在人脸一致性方面,并支持多角色组合与无限长度故事生成等扩展应用。

链接: https://arxiv.org/abs/2512.23519
作者: Donghao Zhou,Jingyu Lin,Guibao Shen,Quande Liu,Jialin Gao,Lihao Liu,Lan Du,Cunjian Chen,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng
机构: 1. The Chinese University of Hong Kong (香港中文大学); 2. Tsinghua University (清华大学); 3. Alibaba Group (阿里巴巴集团); 4. Peking University (北京大学); 5. Shanghai Jiao Tong University (上海交通大学); 6. University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by AAAI2026 (Project page: this https URL )

点击查看摘要

Abstract:Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.
zh

[CV-20] Multi-label Classification with Panoptic Context Aggregation Networks

【速读】:该论文旨在解决当前视觉识别中上下文建模不足的问题,尤其是现有方法多局限于基础几何关系或局部特征,忽视了物体间的跨尺度上下文交互。其解决方案的关键在于提出深度全景上下文聚合网络(PanCAN),通过在高维希尔伯特空间中进行分层的多阶几何上下文整合,利用随机游走与注意力机制学习各尺度下的多阶邻域关系,并以细粒度尺度上的显著锚点为引导,动态融合不同尺度的邻域特征,从而实现有效的跨尺度建模,显著提升复杂场景的理解能力。

链接: https://arxiv.org/abs/2512.23486
作者: Mingyuan Jiu,Hailong Zhu,Wenchuan Wei,Hichem Sahbi,Rongrong Ji,Mingliang Xu
机构: Zhengzhou University (郑州大学); Sorbonne University (索邦大学); Xiamen University (厦门大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Context modeling is crucial for visual recognition, enabling highly discriminative image representations by integrating both intrinsic and extrinsic relationships between objects and labels in images. A limitation in current approaches is their focus on basic geometric relationships or localized features, often neglecting cross-scale contextual interactions between objects. This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts through cross-scale feature aggregation in a high-dimensional Hilbert space. Specifically, PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism. Modules from different scales are cascaded, where salient anchors at a finer scale are selected and their neighborhood features are dynamically fused via attention. This enables effective cross-scale modeling that significantly enhances complex scene understanding by combining multi-order and cross-scale context-aware features. Extensive multi-label classification experiments on NUS-WIDE, PASCAL VOC2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results, outperforming state-of-the-art techniques in both quantitative and qualitative evaluations, thereby substantially improving multi-label classification performance.
zh

[CV-21] V-RAG : A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding

【速读】:该论文旨在解决大型视频语言模型(Large Video Language Models, LVLMs)在处理长视频时存在的两大核心问题:一是时间窗口狭窄,难以捕捉长时间跨度内的细粒度语义变化;二是主流基于文本的检索管道依赖表面词汇重叠,忽视了视觉、音频与字幕通道之间的丰富时序依赖关系。解决方案的关键在于提出一种无需训练的架构TV-RAG,其核心机制包括:(i) 时间衰减检索模块(time-decay retrieval module),通过显式引入时间偏移量优化相似性计算,使文本查询能按真实多媒体上下文排序;(ii) 熵加权关键帧采样器(entropy-weighted key-frame sampler),以均匀间隔选取信息密度高的帧,降低冗余并保持代表性。该方法实现了无需重新训练即可嵌入任意LVLM的双层推理流程,显著提升了长视频理解性能,在Video-MME、MLVU和LongVideoBench等基准上优于多数主流基线。

链接: https://arxiv.org/abs/2512.23483
作者: Zongsheng Cao,Yangfan He,Anran Liu,Feng Chen,Zepeng Wang,Jun Xie
机构: UMN; PCIE
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph(i) a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph(ii) an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at this https URL.
zh

[CV-22] SC-Net: Robust Correspondence Learning via Spatial and Cross-Channel Context

【速读】:该论文旨在解决基于卷积神经网络(CNN)的双视图对应学习中,通用Backbone模型难以有效聚合全局上下文信息、在大视差场景下导致稠密运动场过度平滑的问题。解决方案的关键在于提出一种名为SC-Net的新网络架构,其核心创新包括:1)自适应聚焦正则化模块(AFR),通过增强模型的位置感知能力并提升对虚假运动样本的鲁棒性,促进更精确的运动场生成;2)双边场调整模块(BFA),同时建模长距离依赖关系,并在空间与通道维度间实现交互以细化运动场;3)位置感知恢复模块(PAR),确保最终运动矢量的一致性和精度。这些设计共同提升了相对位姿估计和外点去除任务的性能。

链接: https://arxiv.org/abs/2512.23473
作者: Shuyuan Lin,Hailiang Liao,Qiang Qi,Junjie Huang,Taotao Lai,Jian Weng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent research has focused on using convolutional neural networks (CNNs) as the backbones in two-view correspondence learning, demonstrating significant superiority over methods based on multilayer perceptrons. However, CNN backbones that are not tailored to specific tasks may fail to effectively aggregate global context and oversmooth dense motion fields in scenes with large disparity. To address these problems, we propose a novel network named SC-Net, which effectively integrates bilateral context from both spatial and channel perspectives. Specifically, we design an adaptive focused regularization module (AFR) to enhance the model’s position-awareness and robustness against spurious motion samples, thereby facilitating the generation of a more accurate motion field. We then propose a bilateral field adjustment module (BFA) to refine the motion field by simultaneously modeling long-range relationships and facilitating interaction across spatial and channel dimensions. Finally, we recover the motion vectors from the refined field using a position-aware recovery module (PAR) that ensures consistency and precision. Extensive experiments demonstrate that SC-Net outperforms state-of-the-art methods in relative pose estimation and outlier removal tasks on YFCC100M and SUN3D datasets. Source code is available at this http URL.
zh

[CV-23] MCI-Net: A Robust Multi-Domain Context Integration Network for Point Cloud Registration

【速读】:该论文旨在解决点云配准中特征表示不鲁棒、判别性不足的问题,尤其针对现有基于深度学习的方法依赖欧氏邻域策略难以有效捕捉点云隐含语义和结构一致性的局限。其解决方案的关键在于提出多域上下文融合网络(MCI-Net),通过三个核心模块实现:首先,设计图邻域聚合模块构建全局图以捕获点云的整体结构关系;其次,引入渐进式上下文交互模块,通过域内特征解耦与域间上下文交互增强特征判别力;最后,提出动态内点选择方法,利用多轮位姿估计中的残差信息优化内点权重,从而提升配准精度与鲁棒性。

链接: https://arxiv.org/abs/2512.23472
作者: Shuyuan Lin,Wenwu Peng,Junjie Huang,Qiang Qi,Miaohui Wang,Jian Weng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust and discriminative feature learning is critical for high-quality point cloud registration. However, existing deep learning-based methods typically rely on Euclidean neighborhood-based strategies for feature extraction, which struggle to effectively capture the implicit semantics and structural consistency in point clouds. To address these issues, we propose a multi-domain context integration network (MCI-Net) that improves feature representation and registration performance by aggregating contextual cues from diverse domains. Specifically, we propose a graph neighborhood aggregation module, which constructs a global graph to capture the overall structural relationships within point clouds. We then propose a progressive context interaction module to enhance feature discriminability by performing intra-domain feature decoupling and inter-domain context interaction. Finally, we design a dynamic inlier selection method that optimizes inlier weights using residual information from multiple iterations of pose estimation, thereby improving the accuracy and robustness of registration. Extensive experiments on indoor RGB-D and outdoor LiDAR datasets show that the proposed MCI-Net significantly outperforms existing state-of-the-art methods, achieving the highest registration recall of 96.4% on 3DMatch. Source code is available at this http URL.
zh

[CV-24] HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation

【速读】:该论文旨在解决3D人体动作生成模型在指令遵循能力与动作质量上的不足,尤其针对当前开源基准性能有限的问题。解决方案的关键在于提出了一种全面的全流程训练范式,包括基于超3000小时运动数据的大规模预训练、400小时精选数据的高质量微调,以及结合人类反馈和奖励模型的强化学习优化,从而实现文本指令的精准对齐与高保真动作生成。此外,通过严谨的数据处理流程完成运动清洗与标注,使模型覆盖超过200个动作类别,显著提升了生成多样性与实用性。

链接: https://arxiv.org/abs/2512.23464
作者: Yuxin Wen,Qing Shuai,Di Kang,Jing Li,Cheng Wen,Yue Qian,Ningxin Jiao,Changhai Chen,Weijie Chen,Yiran Wang,Jinkun Guo,Dongyue An,Han Liu,Yanyu Tong,Chao Zhang,Qing Guo,Juan Chen,Qiao Zhang,Youyi Zhang,Zihao Yao,Cheng Zhang,Hong Duan,Xiaoping Wu,Qi Chen,Fei Cheng,Liang Dong,Peng He,Hao Zhang,Jiaxin Lin,Chao Zhang,Zhongyi Fan,Yifan Li,Zhichao Hu,Yuhong Liu,Linus,Jie Jiang,Xiaolong Li,Linchao Bao
机构: Tencent Hunyuan 3D Digital Human Team (腾讯混元3D数字人团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Github: see this https URL

点击查看摘要

Abstract:We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm – including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models – to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.
zh

[CV-25] Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators

【速读】:该论文旨在解决图像到图像(Image-to-Image, I2I)翻译中如何实现高保真度且无方差的确定性输出问题,尤其在需要与真实标签(Ground Truth, GT)高度一致的应用场景(如超分辨率)中。传统方法往往存在输出不确定性或质量不足的问题,而本文提出了一种基于去噪布朗桥(Denoising Brownian Bridge)的双近似器模型(Dual-approx Bridge),其核心创新在于引入布朗桥动力学,并设计两个神经网络近似器(一个用于前向过程,一个用于反向过程),从而在保持极低输出方差的同时显著提升图像质量和对GT的忠实度。

链接: https://arxiv.org/abs/2512.23463
作者: Bohan Xiao,Peiyong Wang,Qisheng He,Ming Dong
机构: Wayne State University (韦恩州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Minor correction to a reference entry

点击查看摘要

Abstract:Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: this https URL
zh

[CV-26] Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin

【速读】:该论文旨在解决传统河流水位监测方法因人工测量误差和环境限制导致的精度不足与效率低下问题,从而实现对河流水位的准确、连续自动化监测。其解决方案的关键在于构建一个融合视觉水位线检测(vision-based waterline detection)、YOLOv8姿态尺度提取(YOLOv8 pose scale extraction)与大语言模型(large multimodal language models, LLMs)的混合框架,通过几何校准信息(如刻度间距元数据)显著提升LLMs在读取水位数值时的准确性,尤其在图像质量良好条件下,Gemini 2.0 Flash模型达到均方根误差8.58 cm、决定系数R²为0.84的性能表现,验证了结合几何信息与多模态人工智能对于鲁棒水位估计的有效性。

链接: https://arxiv.org/abs/2512.23454
作者: Kayathri Vigneswaran,Hugo Retief,Jai Clifford Holmes,Mariangel Garcia Andarcia,Hansaka Tennakoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 14 figures, 4 tables

点击查看摘要

Abstract:Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.
zh

[CV-27] CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多模态理解与生成任务中普遍存在幻觉(hallucination)问题,即模型输出内容与输入图像不一致,从而限制其在真实场景中的可靠性。解决方案的关键在于提出一种无需训练的解码框架 CoFi-Dec,其核心创新是融合生成式自反馈(generative self-feedback)与粗粒度到细粒度的视觉条件控制(coarse-to-fine visual conditioning):首先基于图像的粗粒度和细粒度视图生成两个中间文本响应,再通过文本到图像模型将其转化为合成图像以构建多层次视觉假设;随后引入基于 Wasserstein 距离的融合机制,将不同视觉条件下预测分布对齐至几何一致的解码轨迹,从而在保持高层语义一致性的同时强化细粒度视觉锚定,显著降低实体级和语义级幻觉。该方法具有模型无关性、无需额外训练且可广泛适配现有 LVLM 的优势。

链接: https://arxiv.org/abs/2512.23453
作者: Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang
机构: UMN; PCIE
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbfCoFi-Dec, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at this https URL.
zh

[CV-28] Stochastic Siamese MAE Pretraining for Longitudinal Medical Images

【速读】:该论文旨在解决当前自监督学习方法(如掩码自动编码器,MAE)在纵向医学影像数据中缺乏时间感知能力的问题,从而无法有效捕捉疾病进展的时序动态。其解决方案的关键在于提出STAMP(Stochastic Temporal Autoencoder with Masked Pretraining),一种基于Siamese架构的MAE框架,通过引入随机过程对输入图像对的时间差进行条件建模,将MAE重建损失重构为条件变分推断目标,从而以概率方式学习非确定性的疾病演化规律。这一设计使模型能够更好地表征具有不确定性的病理进程,在OCT和MRI多访视数据集上的实验表明,STAMP预训练的ViT模型在晚期年龄相关性黄斑变性和阿尔茨海默病进展预测任务中优于现有时序MAE方法及基础模型。

链接: https://arxiv.org/abs/2512.23441
作者: Taha Emre,Arunava Chakravarty,Thomas Pinetz,Dmitrii Lachinov,Martin J. Menten,Hendrik Scholl,Sobha Sivaprasad,Daniel Rueckert,Andrew Lotery,Stefan Sacu,Ursula Schmidt-Erfurth,Hrvoje Bogunović
机构: Medical University of Vienna (维也纳医科大学); Imperial College London (帝国理工学院); Technical University of Munich (慕尼黑工业大学); University College London (伦敦大学学院); University of Southampton (南安普顿大学); Pallas Kliniken AG (帕拉斯诊所集团); European Vision Institute (欧洲视觉研究所); Alzheimer’s Disease Neuroimaging Initiative (阿尔茨海默病神经影像倡议)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Code is available in this https URL

点击查看摘要

Abstract:Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approaches like Masked Autoencoding (MAE), despite their strong representation learning capabilities, lack temporal awareness. In this paper, we propose STAMP (Stochastic Temporal Autoencoder with Masked Pretraining), a Siamese MAE framework that encodes temporal information through a stochastic process by conditioning on the time difference between the 2 input volumes. Unlike deterministic Siamese approaches, which compare scans from different time points but fail to account for the inherent uncertainty in disease evolution, STAMP learns temporal dynamics stochastically by reframing the MAE reconstruction loss as a conditional variational inference objective. We evaluated STAMP on two OCT and one MRI datasets with multiple visits per patient. STAMP pretrained ViT models outperformed both existing temporal MAE methods and foundation models on different late stage Age-Related Macular Degeneration and Alzheimer’s Disease progression prediction which require models to learn the underlying non-deterministic temporal dynamics of the diseases.
zh

[CV-29] RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction

【速读】:该论文旨在解决多视角视觉恢复与三维重建在真实物理退化条件下的性能下降问题,即当前方法在面对光照变化、散射、遮挡和模糊等现实世界退化时,重建质量显著降低的问题。解决方案的关键在于构建了一个名为RealX3D的实拍基准数据集,该数据集系统性地将退化分为四类(光照、散射、遮挡和模糊),并在统一采集协议下获取像素对齐的低质量(Low-Quality, LQ)与高质量(Ground Truth, GT)视图,同时包含高分辨率图像、RAW数据及密集激光扫描,从而生成世界尺度的网格和度量深度,为评估优化型和前馈式方法提供可靠标准,揭示现有多视角流水线在真实复杂环境中的脆弱性。

链接: https://arxiv.org/abs/2512.23437
作者: Shuhong Liu,Chenyu Bao,Ziteng Cui,Yun Liu,Xuangeng Chu,Lin Gu,Marcos V. Conde,Ryo Umagami,Tomohiro Hashimoto,Zijian Hu,Tianhan Xu,Yuan Gan,Yusuke Kurose,Tatsuya Harada
机构: The University of Tokyo (东京大学); NII (日本信息学研究所); Tohoku University (东北大学); University of Würzburg (维尔茨堡大学); RIKEN (理化学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.
zh

[CV-30] Fuzzy-Logic and Deep Learning for Environmental Condition-Aware Road Surface Classification

【速读】:该论文旨在解决传统道路状态监测方法成本高、系统性差且耗时的问题,提出一种基于天气条件和道路表面状况数据的实时监测系统。其关键在于利用移动手机摄像头采集校园周边道路图像,并结合道路加速度数据作为输入特征,通过多种深度学习算法(如AlexNet、LeNet、VGG和ResNet)进行道路类型分类,实现了对沥青路、破损沥青路、碎石路、破损碎石路及铺砌路面共5类道路的识别,准确率超过95%;此外,还引入模糊逻辑方法根据天气和一天中的时段动态选择使用加速度或图像数据进行分类,从而提升系统的适应性和鲁棒性。

链接: https://arxiv.org/abs/2512.23436
作者: Mustafa Demetgul,Sanja Lazarova Molnar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Monitoring states of road surfaces provides valuable information for the planning and controlling vehicles and active vehicle control systems. Classical road monitoring methods are expensive and unsystematic because they require time for measurements. This article proposes an real time system based on weather conditional data and road surface condition data. For this purpose, we collected data with a mobile phone camera on the roads around the campus of the Karlsruhe Institute of Technology. We tested a large number of different image-based deep learning algorithms for road classification. In addition, we used road acceleration data along with road image data for training by using them as images. We compared the performances of acceleration-based and camera image-based approaches. The performances of the simple Alexnet, LeNet, VGG, and Resnet algorithms were compared as deep learning algorithms. For road condition classification, 5 classes were considered: asphalt, damaged asphalt, gravel road, damaged gravel road, pavement road and over 95% accuracy performance was achieved. It is also proposed to use the acceleration or the camera image to classify the road surface according to the weather and the time of day using fuzzy logic.
zh

[CV-31] owards Integrating Uncertainty for Domain-Agnostic Segmentation NEURIPS2025

【速读】:该论文旨在解决生成式分割模型(如Segment Anything Model, SAM)在分布偏移或知识有限的领域中性能下降的问题,即模型在零样本场景下虽表现优异,但在实际应用中面对阴影、透明、伪装等复杂条件时鲁棒性不足。解决方案的关键在于引入不确定性量化(uncertainty quantification),通过构建UncertSAM基准测试集(包含8个挑战性数据集)评估多种轻量级后验不确定性估计方法,并初步探索基于不确定性的预测优化步骤。实验表明,最后一层拉普拉斯近似(last-layer Laplace approximation)能有效捕捉分割误差,提供有意义的不确定性信号,为提升模型在跨域场景下的泛化能力提供了新路径。

链接: https://arxiv.org/abs/2512.23427
作者: Jesse Brouwers,Xiaoyan Xing,Alexander Timans
机构: UvA-Bosch Delta Lab, University of Amsterdam (阿姆斯特丹大学-博世德尔塔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Public code at this https URL | published at the 2nd Workshop on Frontiers in Probabilistic Inference (NeurIPS 2025) | 12 pages, 8 figures (incl. Appendix)

点击查看摘要

Abstract:Foundation models for segmentation such as the Segment Anything Model (SAM) family exhibit strong zero-shot performance, but remain vulnerable in shifted or limited-knowledge domains. This work investigates whether uncertainty quantification can mitigate such challenges and enhance model generalisability in a domain-agnostic manner. To this end, we (1) curate UncertSAM, a benchmark comprising eight datasets designed to stress-test SAM under challenging segmentation conditions including shadows, transparency, and camouflage; (2) evaluate a suite of lightweight, post-hoc uncertainty estimation methods; and (3) assess a preliminary uncertainty-guided prediction refinement step. Among evaluated approaches, a last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating a meaningful signal. While refinement benefits are preliminary, our findings underscore the potential of incorporating uncertainty into segmentation models to support robust, domain-agnostic performance. Our benchmark and code are made publicly available.
zh

[CV-32] Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision

【速读】:该论文旨在解决扩散模型(Diffusion Models)在文本到图像生成任务中难以精准对齐用户意图以及维持一致美学质量的问题。现有基于偏好训练的方法(如Diffusion Direct Preference Optimization)虽能缓解上述问题,但依赖于昂贵且可能噪声较大的人工标注数据集。其解决方案的关键在于提出直接扩散得分偏好优化(Direct Diffusion Score Preference Optimization, DDSPO),该方法通过可用的胜出策略与失败策略,在每个去噪时间步(per-timestep)直接提取监督信号,从而提供密集的、跨去噪轨迹的过渡级(transition-level)反馈。为避免依赖人工标注数据,DDSPO进一步利用预训练参考模型自动生成偏好信号:对比该模型在原始提示词与语义退化变体条件下的输出差异,实现无需显式奖励建模或人工标注即可进行有效的得分空间偏好监督(score-space preference supervision)。

链接: https://arxiv.org/abs/2512.23426
作者: Dohyun Kim,Seungwoo Lyu,Seung Wook Kim,Paul Hongsuck Seo
机构: Korea University (韩国大学); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: this https URL
zh

[CV-33] DriveLaW:Unifying Planning and Video Generation in a Latent Driving World

【速读】:该论文旨在解决当前自动驾驶中世界模型(World Model)与运动规划(Motion Planning)之间存在的解耦问题,即现有方法虽将两者集成于统一架构中,但实际仍独立运行,导致未来场景生成与轨迹规划之间缺乏一致性。解决方案的关键在于提出DriveLaW这一统一范式,通过将视频生成器的潜在表示(latent representation)直接注入扩散规划器(diffusion planner),实现高保真未来预测与可靠轨迹生成之间的内在一致性。该方案由两个核心组件构成:DriveLaW-Video用于生成高质量预测视频并提取表达性强的潜在特征,DriveLaW-Act则基于该潜在表示生成一致且可靠的轨迹,并采用三阶段渐进式训练策略优化整体性能,从而在视频预测和路径规划任务上均取得新的SOTA结果。

链接: https://arxiv.org/abs/2512.23421
作者: Tianze Xia,Yongkang Li,Lijun Zhou,Jingfeng Yao,Kaixin Xiong,Haiyang Sun,Bing Wang,Kun Ma,Hangjun Ye,Wenyu Liu,Xinggang Wang
机构: Huazhong University of Science and Technology (华中科技大学); Xiaomi EV (小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures

点击查看摘要

Abstract:World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
zh

[CV-34] Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment AAAI2026

【速读】:该论文旨在解决生成式 AI (Generative AI) 内容中审美质量评估任务所面临的两大核心挑战:一是数据稀缺与不平衡问题,现有数据集过度聚焦于视觉感知维度而忽视认知与情感等深层维度,主要受限于人工标注成本;二是模型碎片化问题,当前视觉网络采用多分支编码器分离处理审美属性,而基于对比学习的多模态方法难以有效建模长文本描述。解决方案的关键在于提出两个创新:其一,构建大规模(70k样本)、多维结构化的精炼审美描述(Refined Aesthetic Description, RAD)数据集,通过迭代生成流程实现低成本、可扩展的数据构建;其二,设计 ArtQuant 框架,利用大语言模型(LLM)解码器增强对长文本语义的理解,并通过联合描述生成耦合孤立的审美维度,从而在理论上证明该方案可最小化预测熵,提升审美判断的准确性与一致性。

链接: https://arxiv.org/abs/2512.23413
作者: Henglin Liu,Nisha Huang,Chang Liu,Jiangpeng Yan,Huijuan Huang,Jixuan Ying,Tong-Yee Lee,Pengfei Wan,Xiangyang Ji
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI2026,Project Page: this https URL

点击查看摘要

Abstract:The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD’s semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
zh

[CV-35] SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation

【速读】:该论文旨在解决三维(3D)牙齿实例分割中存在的挑战,包括牙列拥挤、牙龈边界模糊、缺牙以及临床重要的第三磨牙识别困难等问题。传统基于几何特征的3D方法常出现边界泄漏、中心漂移和牙齿实例标识不一致等缺陷,尤其在少数类和复杂解剖结构下表现不佳;而2D基础模型如Segment Anything Model(SAM)虽具备强边界感知语义能力,但直接应用于3D场景在临床流程中不可行。解决方案的关键在于提出SOFTooth框架——一个语义增强且顺序感知的2D-3D融合方法,其核心创新包括:1)通过点级残差门控模块将咬合视角下的SAM嵌入注入3D点特征,以精化牙龈与牙齿间及牙齿间的边界;2)引入中心引导的掩码优化策略,约束实例掩码与几何中心的一致性,缓解中心漂移;3)设计顺序感知的匈牙利匹配策略,融合解剖学牙齿顺序和中心距离信息进行相似度分配,确保在缺牙或拥挤情况下仍能保持标签一致性。该方法无需显式的2D掩码监督即可有效迁移2D语义至3D牙齿分割任务,并在3DTeethSeg’22数据集上实现最优性能,尤其在第三磨牙分割任务中显著提升。

链接: https://arxiv.org/abs/2512.23411
作者: Xiaolan Li,Wanquan Liu,Pengcheng Li,Pengyu Jie,Chenqiang Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg’22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.
zh

[CV-36] A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers

【速读】:该论文旨在解决传统日志异常检测方法在处理多模态日志数据时存在的局限性问题,即单模态方法因忽略日志的不同模态信息而性能受限,而现有多模态方法又难以有效建模各模态之间的交互关系。其解决方案的关键在于提出 CoLog 框架,通过协同 Transformer(collaborative transformers)与多头注意力机制(multi-head impressed attention)来显式学习多个日志模态间的复杂交互关系,并引入模态适配层(modality adaptation layer)以缓解因模态异质性带来的表示不一致性,从而实现对点异常和集体异常的高精度联合检测,显著提升了日志异常检测的全面性和鲁棒性。

链接: https://arxiv.org/abs/2512.23380
作者: Mohammad Nasirzadeh,Jafar Tahmoresnezhad,Parviz Rashidi-Khazaee
机构: University of Urmia (乌尔米亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI); Operating Systems (cs.OS)
备注: 72 pages, 19 figures, 19 tables, accepted in scientific reports on 5 November 2025

点击查看摘要

Abstract:Log anomaly detection is crucial for preserving the security of operating systems. Depending on the source of log data collection, various information is recorded in logs that can be considered log modalities. In light of this intuition, unimodal methods often struggle by ignoring the different modalities of log data. Meanwhile, multimodal methods fail to handle the interactions between these modalities. Applying multimodal sentiment analysis to log anomaly detection, we propose CoLog, a framework that collaboratively encodes logs utilizing various modalities. CoLog utilizes collaborative transformers and multi-head impressed attention to learn interactions among several modalities, ensuring comprehensive anomaly detection. To handle the heterogeneity caused by these interactions, CoLog incorporates a modality adaptation layer, which adapts the representations from different log modalities. This methodology enables CoLog to learn nuanced patterns and dependencies within the data, enhancing its anomaly detection capabilities. Extensive experiments demonstrate CoLog’s superiority over existing state-of-the-art methods. Furthermore, in detecting both point and collective anomalies, CoLog achieves a mean precision of 99.63%, a mean recall of 99.59%, and a mean F1 score of 99.61% across seven benchmark datasets for log-based anomaly detection. The comprehensive detection capabilities of CoLog make it highly suitable for cybersecurity, system monitoring, and operational efficiency. CoLog represents a significant advancement in log anomaly detection, providing a sophisticated and effective solution to point and collective anomaly detection through a unified framework and a solution to the complex challenges automatic log data analysis poses. We also provide the implementation of CoLog at this https URL.
zh

[CV-37] SoulX-LiveTalk Technical Report

【速读】:该论文旨在解决大规模扩散模型在实时、无限时长音频驱动虚拟人生成中面临的工程挑战,核心矛盾在于计算负载与严格延迟约束之间的冲突。现有方法通常通过强制单向注意力机制或降低模型容量来妥协视觉保真度。其解决方案的关键在于提出SoulX-LiveTalk框架,采用自校正双向蒸馏(Self-correcting Bidirectional Distillation)策略,在视频片段内保留双向注意力以维持关键时空相关性,从而显著提升动作连贯性和视觉细节;同时引入多步回溯自校正机制(Multi-step Retrospective Self-Correction Mechanism),实现无限生成过程中的误差自主恢复,防止系统崩溃。此外,通过混合序列并行、并行VAE和内核级优化组成的全栈推理加速套件,实现了14B参数模型在0.87秒启动延迟下达到32 FPS的实时吞吐量,首次达成高保真交互式数字人合成的新标准。

链接: https://arxiv.org/abs/2512.23379
作者: Le Shen,Qiao Qian,Tan Yu,Ke Zhou,Tianhang Yu,Yu Zhan,Zhenjie Wang,Ming Tao,Shunshun Yin,Siyuan Liu
机构: AIGC Team, Soul AI Lab, China; Donghua University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbfSoulX-LiveTalk, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbfSelf-correcting Bidirectional Distillation strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbfMulti-step Retrospective Self-Correction Mechanism, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbfsub-second start-up latency (0.87s) while reaching a real-time throughput of \textbf32 FPS, setting a new standard for high-fidelity interactive digital human synthesis.
zh

[CV-38] NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection Localization

【速读】:该论文旨在解决当前图像篡改检测与定位(Image Manipulation Detection and Localization, IMDL)方法在面对多样化AI生成内容时泛化能力不足的问题。现有研究多采用跨数据集评估方式,掩盖了模型在真实复杂场景下的脆弱性,导致对技术进展的误判。解决方案的关键在于提出NeXT-IMDL——一个大规模诊断基准,其核心创新是通过四个维度(编辑模型、篡改类型、内容语义和伪造粒度)系统性地分类AIGC篡改,并设计五种严格的跨维度评估协议,从而揭示当前主流模型在模拟现实泛化场景下的系统性失败,为构建下一代鲁棒性强的IMDL模型提供诊断工具与实证依据。

链接: https://arxiv.org/abs/2512.23374
作者: Yifei Li,Haoyuan He,Yu Zheng,Bingyao Yu,Wenzhao Zheng,Lei Chen,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The accessibility surge and abuse risks of user-friendly image editing models have created an urgent need for generalizable, up-to-date methods for Image Manipulation Detection and Localization (IMDL). Current IMDL research typically uses cross-dataset evaluation, where models trained on one benchmark are tested on others. However, this simplified evaluation approach conceals the fragility of existing methods when handling diverse AI-generated content, leading to misleading impressions of progress. This paper challenges this illusion by proposing NeXT-IMDL, a large-scale diagnostic benchmark designed not just to collect data, but to probe the generalization boundaries of current detectors systematically. Specifically, NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery granularity. Built upon this, NeXT-IMDL implements five rigorous cross-dimension evaluation protocols. Our extensive experiments on 11 representative models reveal a critical insight: while these models perform well in their original settings, they exhibit systemic failures and significant performance degradation when evaluated under our designed protocols that simulate real-world, various generalization scenarios. By providing this diagnostic toolkit and the new findings, we aim to advance the development towards building truly robust, next-generation IMDL models.
zh

[CV-39] MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning

【速读】:该论文旨在解决两视图对应关系学习中局部几何建模能力不足与跨阶段信息优化效率低下的问题,这些问题导致匹配对的几何约束难以准确捕捉,从而削弱了模型的鲁棒性。解决方案的关键在于提出多图上下文注意力网络(MGCA-Net),其核心创新包括:1)上下文几何注意力(CGA)模块,通过自适应注意力机制动态融合空间位置与特征信息,增强对局部和全局几何关系的建模能力;2)跨阶段多图共识(CSMGC)模块,利用跨阶段稀疏图网络建立几何一致性约束,确保不同阶段间几何信息的一致性。实验证明,该方法在YFCC100M和SUN3D数据集上显著优于现有最先进(SOTA)方法,在异常值剔除和相机位姿估计任务中表现出更强的性能。

链接: https://arxiv.org/abs/2512.23369
作者: Shuyuan Lin,Mengtin Lo,Haosheng Chen,Yanjie Liang,Qiangqiang Wu
机构: Jinan University (暨南大学); Chongqing University of Posts and Telecommunications (重庆邮电大学); Peng Cheng Laboratory (鹏城实验室); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at this http URL.
zh

[CV-40] SpatialMosaic: A Multiview VLM Dataset for Partial Visibility

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实场景下进行三维空间推理时面临的挑战,尤其是当存在部分可见性、遮挡和低重叠等复杂条件时,现有方法因依赖预构建的3D表示或现成重建流程而难以扩展且实用性受限。其解决方案的关键在于提出了一种可扩展的多视角数据生成与标注流水线,构建了名为SpatialMosaic的指令微调数据集(包含200万条问答对),并进一步设计了SpatialMosaic-Bench基准测试(涵盖6项任务,共100万条问答对),用于评估多视角空间推理能力;同时提出了SpatialMosaicVLM这一混合框架,将3D重建模型作为几何编码器嵌入到视觉-语言模型(Vision-Language Models, VLMs)中,从而实现从碎片化视觉线索中鲁棒地进行空间推理。

链接: https://arxiv.org/abs/2512.23365
作者: Kanghee Lee,Injae Lee,Minseok Kwak,Kwonyoung Ryu,Jungi Hong,Jaesik Park
机构: Seoul National University (首尔国立大学); University College London (伦敦大学学院); POSTECH (浦项工科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.
zh

[CV-41] CountGD: Generalized Prompting for Open-World Counting

【速读】:该论文旨在解决当前自动图像和视频中目标计数方法在提示灵活性与准确性方面的局限性,尤其是现有方法无法有效表达“不计数”的对象、视觉示例需手动标注且难以从自然或合成图像中引入外部样本等问题。解决方案的关键在于:首先扩展提示机制以支持通过文本和/或视觉示例指定“不计数”对象;其次提出“伪示例(pseudo-exemplars)”概念,在推理阶段自动化视觉示例的标注过程;再次将计数模型扩展为可接受来自自然和合成图像的视觉示例;最后利用改进后的CountGD++模型作为视觉专家代理(vision expert agent)集成至大语言模型(LLM),从而显著提升多模态开放世界计数任务中的准确性、效率与泛化能力。

链接: https://arxiv.org/abs/2512.23351
作者: Niki Amini-Naieni,Andrew Zisserman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars’ that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at this https URL.
zh

[CV-42] Visual Language Hypothesis

【速读】:该论文旨在解决视觉表征学习中的本质结构问题,即如何从拓扑与结构的角度理解视觉观察空间中语义信息的组织方式。其核心问题是:在视觉理解中,语义状态(semantic states)如何从复杂的感知观测中被抽象出来,同时保持对干扰变化(nuisance variation)的不变性。解决方案的关键在于提出一个纤维丛(fiber bundle)结构模型——其中语义对应于商空间(quotient space X/G),而干扰变化则分布在纤维上;由此推导出两个理论结论:其一,语义不变性无法通过光滑变形实现,必须依赖非同胚的判别目标(如标签监督、跨实例识别或模态对齐);其二,模型架构需具备支持拓扑变化的能力,即通过“扩展-折叠”(expand-and-snap)机制,在几何扩张中分离结构后重新收缩形成离散语义区域。这一框架为大规模判别和多模态模型提供了拓扑视角的解释,并呼应了统计学习理论的经典原则。

链接: https://arxiv.org/abs/2512.23335
作者: Xiu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient X/G is not a submanifold of X and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.
zh

[CV-43] CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation

【速读】:该论文旨在解决传统计算机辅助设计(Computer-Aided Design, CAD)建模流程复杂、难以自动化生成高精度且可编辑的CAD模型的问题。现有基于草图重建3D模型的方法通常生成非可编辑、近似模型,且依赖文本或图像输入时需大量人工标注,限制了其在工业场景中的可扩展性与实用性。解决方案的关键在于提出一种异构协同多专家强化学习(Heterogeneous Collaborative Multi-Expert Reinforcement Learning, CME-CAD)范式,通过整合多个专家模型的优势,实现协作式学习,从而提升生成模型在准确性、约束兼容性和完全可编辑性方面的性能。该方法采用两阶段训练策略:多专家微调(Multi-Expert Fine-Tuning, MEFT)和多专家强化学习(Multi-Expert Reinforcement Learning, MERL),并构建了开源基准数据集CADExpert,包含17,299个带精确尺寸标注的正交投影图、专家级思维链(Chain-of-Thought, CoT)过程、可执行的CADQuery代码及渲染后的3D模型,为高质量CAD代码生成提供了数据支撑与评估标准。

链接: https://arxiv.org/abs/2512.23333
作者: Ke Niu,Haiyang Yu,Zhuofan Chen,Zhengtao Yao,Weitao Jia,Xiaodong Ge,Jingqun Tang,Benlei Cui,Bin Li,Xiangyang Xue
机构: Fudan University (复旦大学); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model’s ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.
zh

[CV-44] PCR-ORB: Enhanced ORB-SLAM3 with Point Cloud Refinement Using Deep Learning-Based Dynamic Object Filtering

【速读】:该论文旨在解决视觉同步定位与建图(vSLAM)系统在动态环境中因移动物体干扰而导致的跟踪精度下降和地图一致性受损问题。解决方案的关键在于提出PCR-ORB(Point Cloud Refinement ORB)框架,其核心创新是引入基于深度学习的点云精炼机制:通过YOLOv8进行语义分割识别动态物体,并结合CUDA加速实现实时处理;同时采用多阶段滤波策略,包括地面平面估计、天空区域剔除、边缘特征过滤及时间一致性验证,有效提升vSLAM在复杂场景下的鲁棒性。

链接: https://arxiv.org/abs/2512.23318
作者: Sheng-Kai Chen,Jie-Yu Chao,Jr-Yu Chang,Po-Lien Wu,Po-Chiang Lin
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 2 figures, 1 table

点击查看摘要

Abstract:Visual Simultaneous Localization and Mapping (vSLAM) systems encounter substantial challenges in dynamic environments where moving objects compromise tracking accuracy and map consistency. This paper introduces PCR-ORB (Point Cloud Refinement ORB), an enhanced ORB-SLAM3 framework that integrates deep learning-based point cloud refinement to mitigate dynamic object interference. Our approach employs YOLOv8 for semantic segmentation combined with CUDA-accelerated processing to achieve real-time performance. The system implements a multi-stage filtering strategy encompassing ground plane estimation, sky region removal, edge filtering, and temporal consistency validation. Comprehensive evaluation on the KITTI dataset (sequences 00-09) demonstrates performance characteristics across different environmental conditions and scene types. Notable improvements are observed in specific sequences, with sequence 04 achieving 25.9% improvement in ATE RMSE and 30.4% improvement in ATE median. However, results show mixed performance across sequences, indicating scenario-dependent effectiveness. The implementation provides insights into dynamic object filtering challenges and opportunities for robust navigation in complex environments.
zh

[CV-45] MedGemma vs GPT -4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学影像诊断中存在幻觉(hallucination)和泛化能力不足的问题,尤其是在高风险临床任务中的准确性与可靠性问题。解决方案的关键在于通过领域特定的微调策略——具体而言,使用低秩适应(Low-Rank Adaptation, LoRA)对开源医学专用模型 MedGemma-4b-it 进行微调,从而显著提升其在六种疾病诊断上的性能,其平均测试准确率达到 80.37%,优于未微调的 GPT-4 模型(69.58%),尤其在癌症和肺炎等高风险疾病的敏感性方面表现突出,验证了针对医学场景进行专业微调是降低幻觉、增强临床可信赖性的核心路径。

链接: https://arxiv.org/abs/2512.23304
作者: Md. Sazzadul Islam Prottasha,Nabil Walid Rafi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication in the Journal of Machine Learning and Deep Learning (JMLDL). 9 pages, 9 figures, 10 tables

点击查看摘要

Abstract:Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
zh

[CV-46] Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition

【速读】:该论文旨在解决微手势识别(micro-gesture recognition)与基于行为的情绪预测(behavior-based emotion prediction)这两个具有挑战性的任务,其核心难点在于建模细微且细粒度的人类行为,主要依赖视频和骨骼姿态数据。解决方案的关键在于构建两个多模态框架:在微手势分类中,通过MViTv2-S提取视频特征、2s-AGCN提取骨骼姿态特征,并利用Cross-Modal Token Fusion模块融合空间与姿态信息;在情绪预测中,则采用SwinFace和MViTv2-S分别提取面部与上下文嵌入,并通过InterFusion模块捕捉情绪表达与身体动作的关联。该方法在iMiGUE数据集上的实验验证了其在行为驱动情绪预测任务中的高准确性和鲁棒性。

链接: https://arxiv.org/abs/2512.23291
作者: Arman Martirosyan,Shahane Tigranyan,Maria Razzhivina,Artak Aslanyan,Nazgul Salikhova,Ilya Makarov,Andrey Savchenko,Aram Avetisyan
机构: Russian - Armenian University (俄罗斯-亚美尼亚大学); ISP RAS (俄罗斯科学院信息与控制问题研究所); AIRI (人工智能研究 institute); Sber AI Lab (斯贝AI实验室); HSE University (高等经济大学); Innopolis University (因诺波利斯大学); ISP RAS Research Center for Trusted Artificial Intelligence (俄罗斯科学院信息与控制问题研究所可信人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.
zh

[CV-47] YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection

【速读】:该论文旨在解决现有实时目标检测(Real-Time Object Detection, RTOD)方法普遍采用类似YOLO的架构所导致的计算资源分配不合理问题。这类模型依赖静态密集计算,对所有输入均施加相同的处理强度,造成在简单场景中冗余计算、复杂场景中表征能力不足,从而影响检测性能并浪费算力。其解决方案的关键在于提出YOLO-Master框架,引入基于实例条件的自适应计算机制,核心组件为高效稀疏专家混合(Efficient Sparse Mixture-of-Experts, ES-MoE)模块,该模块可根据输入场景复杂度动态分配计算资源;同时通过轻量级动态路由网络,在训练中利用多样性增强目标引导专家专业化,使各专家具备互补能力,并在推理阶段仅激活最相关的专家,实现高精度与低延迟的平衡。

链接: https://arxiv.org/abs/2512.23273
作者: Xu Lin,Jinlong Peng,Zhenye Gan,Jiawen Zhu,Jun Liu
机构: Tencent Youtu Lab (腾讯优图实验室); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.
zh

[CV-48] Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

【速读】:该论文旨在解决扩散模型(Diffusion Model)在图像和视频生成任务中因迭代去噪过程导致的推理速度慢的问题。现有基于缓存(caching-based)的方法虽能实现训练-free加速,但会引入显著的计算误差,而传统误差校正策略(如剪枝或预测)由于采用固定缓存策略,难以适应去噪过程中复杂变化的误差模式,限制了其性能提升潜力。解决方案的关键在于提出一种名为CEM(Cumulative Error Minimization)的通用性保真度优化插件,通过预定义误差指标刻画模型对加速的敏感性(受时间步长和缓存间隔共同影响),并基于此先验设计一种动态规划算法,以累积误差近似为目标进行策略优化,从而最小化缓存误差,显著提升生成保真度。CEM具备模型无关性和强泛化能力,可无缝集成至现有误差校正框架与量化模型中,且不增加额外计算开销。

链接: https://arxiv.org/abs/2512.23258
作者: Tong Shao,Yusen Fu,Guoying Sun,Jingde Kong,Zhuotao Tian,Jingyong Su
机构: Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt- \alpha , StableDiffusion1.5 and Hunyuan. The code will be made publicly available.
zh

[CV-49] Contour Information Aware 2D Gaussian Splatting for Image Representation

【速读】:该论文旨在解决2D Gaussian Splatting(2DGS)在低数量高斯分布下重建图像时出现模糊或边界不清晰的问题,其核心在于缺乏对轮廓信息的感知能力。解决方案的关键是引入基于对象分割先验的轮廓信息感知机制,通过在光栅化过程中约束每个高斯分布仅限于特定的分割区域,从而避免跨边界混合,有效保留边缘结构;同时设计了一种预热训练策略以稳定优化过程并提升收敛性。该方法在保持高效渲染与低内存占用的同时,显著提升了压缩比极低场景下的边缘重建质量。

链接: https://arxiv.org/abs/2512.23255
作者: Masaya Takabe,Hiroshi Watanabe,Sujun Hong,Tomohiro Ikai,Zheming Fan,Ryo Ishimoto,Kakeru Sugimoto,Ruri Imichi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high compression. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.
zh

[CV-50] ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation

【速读】:该论文旨在解决文本到图像扩散模型在生成连续图像序列时,如何在保持角色身份一致性(identity consistency)与确保每张图像对提示词的准确对齐(prompt alignment)之间取得平衡的问题。现有方法常因二者之间的权衡而难以同时优化。其解决方案的关键在于提出一种名为ASemconsist的新框架,通过选择性地修改文本嵌入(text embedding),实现对角色身份的显式语义控制;同时基于对FLUX模型中填充嵌入(padding embeddings)的分析,将其重新设计为语义容器以增强控制能力;此外引入自适应特征共享策略,仅对存在歧义的身份提示施加约束,从而提升效率与准确性;最终构建统一评估指标一致性质量分数(Consistency Quality Score, CQS),量化衡量身份保留与提示对齐的综合表现,有效突破了传统方法中的性能瓶颈。

链接: https://arxiv.org/abs/2512.23245
作者: Shin seong Kim,Minjung Shin,Hyunin Cho,Youngjung Uh
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: this https URL
zh

[CV-51] ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing

【速读】:该论文针对遥感变化检测(Remote Sensing Change Detection, RSCD)中传统方法难以捕捉高层语义信息且易受非语义扰动影响的问题,以及现有基于多模态和视觉-语言模型(Vision-Language Model, VLM)的方法在空间定位不准确、像素级边界模糊和可解释性不足等方面的局限,提出了一种两阶段框架ViLaCD-R1。其关键创新在于:首先通过监督微调(Supervised Fine-Tuning, SFT)与强化学习(Reinforcement Learning, RL)联合训练一个图像块级双时相推理模块(Multi-Image Reasoner, MIR),以生成粗粒度变化掩膜;随后设计一个掩膜引导解码器(Mask-Guided Decoder, MGD),融合双时相图像特征与粗掩膜,输出高精度二值变化图,从而显著提升语义变化识别与定位能力,同时增强对非语义变化的鲁棒性。

链接: https://arxiv.org/abs/2512.23244
作者: Xingwei Ma,Shiyang Feng,Bo Zhang,Bin Wang
机构: Fudan University (复旦大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
zh

[CV-52] Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism

【速读】:该论文旨在解决单源遥感数据在地表信息提取中精度不足的问题,以及现有多模态融合方法在计算效率与语义层次对齐方面的局限性。其关键解决方案在于提出一种集成动态分辨率输入策略(Dynamic Resolution Input Strategy, DRIS)和多尺度视觉-语言对齐机制(Multi-scale Vision-language Alignment Mechanism, MS-VLAM)的视觉-语言模型(Vision-language Model, VLM)框架:DRIS通过粗到细的自适应资源分配策略,在保留关键细粒度特征的同时降低冗余计算开销;MS-VLAM构建了对象级、局部区域级和全局级的三层对齐机制,系统性地捕捉跨模态语义一致性,缓解语义错位与粒度不匹配问题,从而显著提升图像描述生成和跨模态检索任务的准确性与效率。

链接: https://arxiv.org/abs/2512.23243
作者: Siyu Zhang,Ying Chen,Lianlei Shan,Runhe Qiu
机构: Sanda University (上海大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity this http URL results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.
zh

[CV-53] RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models

【速读】:该论文旨在解决遥感(Remote Sensing, RS)生成式扩散基础模型在训练过程中因数据冗余、噪声和类别不平衡导致的训练效率低下及收敛困难问题。现有方法通常依赖多源分类数据集聚合或简单去重策略,忽视了生成建模对数据分布的要求以及遥感图像的异质性特征。其解决方案的关键在于提出一种无需训练的两阶段数据剪枝方法:首先基于信息熵筛选低信息量样本以提升局部内容质量;其次利用场景分类数据作为基准进行场景感知聚类与分层采样,在降低计算成本的同时增强大规模无标签数据的聚类有效性;最终通过平衡簇内均匀性与样本代表性,实现高剪枝率(如85%)下的细粒度数据选择,从而保障整体多样性与代表性,显著加速模型收敛并提升生成质量与下游任务性能。

链接: https://arxiv.org/abs/2512.23239
作者: Fan Wei,Runmin Dong,Yushan Lai,Yixiang Yang,Zhaoyang Luo,Jinxiao Zhang,Miao Yang,Shuai Yuan,Jiyao Zhao,Bin Luo,Haohuan Fu
机构: Tsinghua University (清华大学); Sun Yat-sen University (中山大学); National Supercomputing Center in Shenzhen (深圳国家超级计算中心); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); The University of Hong Kong (香港大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
zh

[CV-54] Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network

【速读】:该论文旨在解决红外气体泄漏检测中因烟雾特征微弱、尺寸小、半透明且边界模糊而导致的检测难题。解决方案的关键在于提出一种物理边缘融合的气体动力学路由网络(PEG-DRNet),其核心创新包括:1)引入Gas Block模块,通过局部分支与大卷积核分支协同建模气体扩散-对流过程,增强弱对比度烟雾区域的细节和全局传播信息;2)设计自适应梯度与相位边缘算子(AGPEO)结合多尺度边缘感知模块(MSEPM),提取并强化多方向梯度与相位一致性的边缘先验,提升边界清晰度;3)构建内容自适应稀疏路由路径聚合网络(CASR-PAN),基于边缘和内容线索动态调节跨尺度特征传播路径,实现高效的信息融合与冗余抑制。该方法在IIG和LangGas数据集上显著优于现有CNN与Transformer检测器,在精度(AP、AP₅₀)和计算效率(43.7 Gflops, 14.9 M参数)之间取得最佳平衡。

链接: https://arxiv.org/abs/2512.23234
作者: Dongsheng Li,Chaobo Chen,Siling Wang,Song Gao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting infrared gas leaks is critical for environmental monitoring and industrial safety, yet remains difficult because plumes are faint, small, semitransparent, and have weak, diffuse boundaries. We present physics-edge hybrid gas dynamic routing network (PEG-DRNet). First, we introduce the Gas Block, a diffusion-convection unit modeling gas transport: a local branch captures short-range variations, while a large-kernel branch captures long-range propagation. An edge-gated learnable fusion module balances local detail and global context, strengthening weak-contrast plume and contour cues. Second, we propose the adaptive gradient and phase edge operator (AGPEO), computing reliable edge priors from multi-directional gradients and phase-consistent responses. These are transformed by a multi-scale edge perception module (MSEPM) into hierarchical edge features that reinforce boundaries. Finally, the content-adaptive sparse routing path aggregation network (CASR-PAN), with adaptive information modulation modules for fusion and self, selectively propagates informative features across scales based on edge and content cues, improving cross-scale discriminability while reducing redundancy. Experiments on the IIG dataset show that PEG-DRNet achieves an overall AP of 29.8%, an AP _50 of 84.3%, and a small-object AP of 25.3%, surpassing the RT-DETR-R18 baseline by 3.0%, 6.5%, and 5.3%, respectively, while requiring only 43.7 Gflops and 14.9 M parameters. The proposed PEG-DRNet achieves superior overall performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors in AP and AP _50 on the IIG and LangGas dataset.
zh

[CV-55] SURE Guided Posterior Sampling: Trajectory Correction for Diffusion-Based Inverse Problems

【速读】:该论文旨在解决扩散模型(Diffusion Models)在求解逆问题时,因迭代过程中采样轨迹偏差累积导致重建质量下降的问题。现有方法通常需要数百甚至上千次神经函数评估(Neural Function Evaluations, NFEs)才能获得高质量结果,效率低下。解决方案的关键在于提出SURE引导的后验采样(SURE Guided Posterior Sampling, SGPS),其通过Stein无偏风险估计(Stein’s Unbiased Risk Estimate, SURE)梯度更新修正采样路径偏差,并结合主成分分析(PCA)进行噪声估计,从而在早期和中期采样阶段有效抑制噪声引入的误差,减少误差累积,使模型在少于100次NFE下仍能保持高重建质量。

链接: https://arxiv.org/abs/2512.23232
作者: Minwoo Kim,Hongki Lim
机构: Inha University (仁川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as powerful learned priors for solving inverse problems. However, current iterative solving approaches which alternate between diffusion sampling and data consistency steps typically require hundreds or thousands of steps to achieve high quality reconstruction due to accumulated errors. We address this challenge with SURE Guided Posterior Sampling (SGPS), a method that corrects sampling trajectory deviations using Stein’s Unbiased Risk Estimate (SURE) gradient updates and PCA based noise estimation. By mitigating noise induced errors during the critical early and middle sampling stages, SGPS enables more accurate posterior sampling and reduces error accumulation. This allows our method to maintain high reconstruction quality with fewer than 100 Neural Function Evaluations (NFEs). Our extensive evaluation across diverse inverse problems demonstrates that SGPS consistently outperforms existing methods at low NFE counts.
zh

[CV-56] Anomaly Detection by Effectively Leverag ing Synthetic Images

【速读】:该论文旨在解决工业制造中异常检测(anomaly detection)因真实缺陷图像稀缺而导致性能受限的问题。现有方法主要依赖仅使用正常图像的无监督学习,或通过生成模型合成缺陷图像,但前者难以模拟复杂缺陷,后者则成本高昂。解决方案的关键在于提出一种融合规则驱动合成与预训练图像翻译模型的新框架:首先利用文本引导的图像到图像翻译模型和图像检索模型高效生成高质量且相关的合成缺陷图像,其中检索模型通过比对生成图像与真实正常图像的相似性过滤低质量输出;其次采用两阶段训练策略——先在大量低成本规则合成数据上预训练,再在少量高质量合成数据上微调,从而在显著降低数据采集成本的同时提升异常检测性能。

链接: https://arxiv.org/abs/2512.23227
作者: Sungho Kang,Hyunkyu Park,Yeonho Lee,Hanbyul Lee,Mijoo Jeong,YeongHyeon Park,Injae Lee,Juneho Yi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.
zh

[CV-57] Bridging Your Imagination with Audio-Video Generation via a Unified Director

【速读】:该论文旨在解决当前AI驱动的视频生成系统中脚本撰写(script drafting)与关键帧设计(key-shot design)任务割裂的问题,即分别依赖大语言模型和图像生成模型,导致叙事逻辑不连贯、视觉一致性差。解决方案的关键在于提出UniMAGE统一导演模型,采用Mixture-of-Transformers架构融合文本与图像生成,并创新性地引入“先交错后解耦”的训练范式:首先通过交错概念学习(Interleaved Concept Learning)利用文本-图像交织数据增强模型对剧本的深层理解与想象力;随后通过解耦专家学习(Disentangled Expert Learning)分离脚本写作与关键帧生成,从而提升叙事逻辑性和创作灵活性。

链接: https://arxiv.org/abs/2512.23222
作者: Jiaxu Zhang,Tianshu Hu,Yuan Zhang,Zenan Li,Linjie Luo,Guosheng Lin,Xin Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling’’ training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model’s deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.
zh

[CV-58] Holi-DETR: Holistic Fashion Item Detection Leverag ing Contextual Information

【速读】:该论文旨在解决时尚单品检测(fashion item detection)中的挑战,即由于时尚单品外观高度多样化以及子类别间相似性导致的歧义问题。解决方案的关键在于提出一种全新的整体式检测Transformer模型(Holistic Detection Transformer, Holi-DETR),其通过显式引入三类异构上下文信息来提升检测准确性:(1) 单品之间的共现概率关系,(2) 基于单品间空间排列的相对位置与尺寸信息,(3) 单品与人体关键点之间的空间关联。该架构将上述三种上下文信息融合进Detection Transformer(DETR)及其后续模型中,在实验中显著提升了性能,相较原始DETR和近期提出的Co-DETR分别提高了3.6和1.1个百分点的平均精度(AP)。

链接: https://arxiv.org/abs/2512.23221
作者: Youngchae Kwon,Jinyoung Choi,Injung Kim
机构: Handong Global University (韩东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).
zh

[CV-59] MM-UAVBench: How Well Do Multimodal Large Language Models See Think and Plan in Low-Altitude UAV Scenarios?

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在低空应用场景中,尤其是由无人机(Unmanned Aerial Vehicles, UAVs)主导的复杂环境中,缺乏系统性评估基准的问题。现有评测体系要么未覆盖低空场景的独特挑战,要么仅聚焦于特定任务如定位或导航,而未能全面衡量MLLMs的通用智能能力。为此,作者提出了MM-UAVBench,一个涵盖感知、认知与规划三大核心能力维度的综合性基准,包含19个子任务及超过5700个手工标注问题,全部基于真实无人机采集的数据。其关键创新在于构建了一个结构化、可扩展且贴近实际应用的评测框架,从而揭示了当前模型在空间偏置和多视角理解等方面的瓶颈,为未来开发适用于真实无人机环境的鲁棒、可靠的MLLMs提供了基础支撑。

链接: https://arxiv.org/abs/2512.23219
作者: Shiqi Dai,Zizhi Ma,Zhicong Luo,Xuesong Yang,Yibin Huang,Wanyue Zhang,Chi Chen,Zonghao Guo,Wang Xu,Yufei Sun,Maosong Sun
机构: Tsinghua University (清华大学); Nankai University (南开大学); Northwest Polytechnical University (西北工业大学); Chinese Academy of Sciences (中国科学院); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs’general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
zh

[CV-60] AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding

【速读】:该论文旨在解决智能自动驾驶汽车在复杂多变的恶劣视觉条件下(如不同天气和光照)对意外小型道路障碍物进行可靠、实时检测的问题。现有道路驾驶数据集通常仅涵盖正常或单一不利场景,且缺乏与其他类别一致视觉域的障碍物样本,导致模型泛化能力受限。解决方案的关键在于构建一个名为AVOID(Adverse Visual Conditions Dataset)的新数据集,该数据集在模拟环境中采集了大量沿路径分布的意外障碍物图像,并覆盖多种天气与时间条件;每张图像均配有语义分割图、深度图、原始及语义LiDAR数据和航路点信息,从而支持包括障碍物检测在内的多种视觉感知任务。通过在此数据集上对高性能实时网络进行基准测试,并设计多任务网络进行消融实验,验证了方法的有效性与鲁棒性。

链接: https://arxiv.org/abs/2512.23215
作者: Jongoh Jeong,Taek-Jin Song,Jong-Hwan Kim,Kuk-Jin Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding road scenes for visual perception remains crucial for intelligent self-driving cars. In particular, it is desirable to detect unexpected small road hazards reliably in real-time, especially under varying adverse conditions (e.g., weather and daylight). However, existing road driving datasets provide large-scale images acquired in either normal or adverse scenarios only, and often do not contain the road obstacles captured in the same visual domain as for the other classes. To address this, we introduce a new dataset called AVOID, the Adverse Visual Conditions Dataset, for real-time obstacle detection collected in a simulated environment. AVOID consists of a large set of unexpected road obstacles located along each path captured under various weather and time conditions. Each image is coupled with the corresponding semantic and depth maps, raw and semantic LiDAR data, and waypoints, thereby supporting most visual perception tasks. We benchmark the results on high-performing real-time networks for the obstacle detection task, and also propose and conduct ablation studies using a comprehensive multi-task network for semantic segmentation, depth and waypoint prediction tasks.
zh

[CV-61] ask-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks

【速读】:该论文旨在解决当前基于扩散模型的密集预测任务中,由于依赖经验性选择扩散时间步特征而导致性能次优的问题,尤其在少样本场景下表现受限。其关键解决方案在于提出两个核心模块:一是任务感知的时间步选择(Task-aware Timestep Selection, TTS),通过时间步损失和相似度分数自适应地筛选最适合特定任务的时间步;二是时间步特征融合(Timestep Feature Consolidation, TFC),将选定的时间步特征进行整合以提升少样本条件下的密集预测性能。该方法结合参数高效的微调适配器,在Taskonomy大规模数据集上验证了其在通用性和少样本学习场景中的有效性。

链接: https://arxiv.org/abs/2512.23210
作者: Changgyoon Oh,Jongoh Jeong,Jegyeong Cho,Kuk-Jin Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.
zh

[CV-62] Exploring Syn-to-Real Domain Adaptation for Military Target Detection

【速读】:该论文旨在解决军事目标检测中因数据域差异导致的跨域泛化能力不足问题,尤其是在缺乏真实标注数据的情况下,如何有效利用低成本RGB图像实现高精度目标检测。其关键解决方案是通过 Unreal Engine(一种基于物理的实时渲染引擎)生成高质量、逼真的合成RGB数据,用于训练模型,并在真实军事场景数据上进行验证,从而实现从合成域到真实域的有效迁移。实验表明,利用少量监督信息(如对象类别标签)的域自适应方法在性能上显著优于无监督或半监督方法,凸显了轻量标注在跨域目标检测中的重要价值。

链接: https://arxiv.org/abs/2512.23208
作者: Jongoh Jeong,Youngjin Oh,Gyeongrae Nam,Jeongeun Lee,Kuk-Jin Yoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Object detection is one of the key target tasks of interest in the context of civil and military applications. In particular, the real-world deployment of target detection methods is pivotal in the decision-making process during military command and reconnaissance. However, current domain adaptive object detection algorithms consider adapting one domain to another similar one only within the scope of natural or autonomous driving scenes. Since military domains often deal with a mixed variety of environments, detecting objects from multiple varying target domains poses a greater challenge. Several studies for armored military target detection have made use of synthetic aperture radar (SAR) data due to its robustness to all weather, long range, and high-resolution characteristics. Nevertheless, the costs of SAR data acquisition and processing are still much higher than those of the conventional RGB camera, which is a more affordable alternative with significantly lower data processing time. Furthermore, the lack of military target detection datasets limits the use of such a low-cost approach. To mitigate these issues, we propose to generate RGB-based synthetic data using a photorealistic visual tool, Unreal Engine, for military target detection in a cross-domain setting. To this end, we conducted synthetic-to-real transfer experiments by training our synthetic dataset and validating on our web-collected real military target datasets. We benchmark the state-of-the-art domain adaptation methods distinguished by the degree of supervision on our proposed train-val dataset pair, and find that current methods using minimal hints on the image (e.g., object class) achieve a substantial improvement over unsupervised or semi-supervised DA methods. From these observations, we recognize the current challenges that remain to be overcome.
zh

[CV-63] ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis SDM2024

【速读】:该论文旨在解决森林覆盖制图(Forest Cover Mapping)中精度不足的问题,尤其针对传统面向对象图像分析(Object-Based Image Analysis, OBIA)方法在复杂地表场景下分类准确性有限的挑战。其解决方案的关键在于提出一种名为ForCM的新方法,将深度学习(Deep Learning, DL)模型与OBIA技术相结合,利用多光谱Sentinel-2 Level 2A影像数据,通过集成UNet、UNet++、ResUNet、AttentionUNet及ResNet50-Segnet等多种DL模型,并将其与OBIA进行融合优化,从而显著提升森林边界识别和类别划分的准确性。实验结果表明,该方法相较传统OBIA实现了最高达95.64%的总体精度,验证了深度学习与OBIA协同增强森林覆盖制图性能的有效性。

链接: https://arxiv.org/abs/2512.23196
作者: Maisha Haque,Israt Jahan Ayshi,Sadaf M. Anis,Nahian Tasnim,Mithila Moontaha,Md. Sabbir Ahmed,Muhammad Iqbal Hossain,Mohammad Zavid Parvez,Subrata Chakraborty,Biswajeet Pradhan,Biswajit Banik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures. Accepted for presentation at the Australasian Data Science and Machine Learning Conference (AusDM 2024)

点击查看摘要

Abstract:This research proposes “ForCM”, a novel approach to forest cover mapping that combines Object-Based Image Analysis (OBIA) with Deep Learning (DL) using multispectral Sentinel-2 imagery. The study explores several DL models, including UNet, UNet++, ResUNet, AttentionUNet, and ResNet50-Segnet, applied to high-resolution Sentinel-2 Level 2A satellite images of the Amazon Rainforest. The datasets comprise three collections: two sets of three-band imagery and one set of four-band imagery. After evaluation, the most effective DL models are individually integrated with the OBIA technique to enhance mapping accuracy. The originality of this work lies in evaluating different deep learning models combined with OBIA and comparing them with traditional OBIA methods. The results show that the proposed ForCM method improves forest cover mapping, achieving overall accuracies of 94.54 percent with ResUNet-OBIA and 95.64 percent with AttentionUNet-OBIA, compared to 92.91 percent using traditional OBIA. This research also demonstrates the potential of free and user-friendly tools such as QGIS for accurate mapping within their limitations, supporting global environmental monitoring and conservation efforts.
zh

[CV-64] GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation

【速读】:该论文旨在解决现有驾驶世界模型(Driving World Models, DWMs)在3D场景理解能力上的不足,以及文本信息与3D场景之间缺乏精确对齐的问题。当前方法通常依赖点云或鸟瞰图(BEV)特征表示3D空间信息,难以实现跨模态语义一致性,限制了其在理解与生成任务中的表现。解决方案的关键在于提出一种基于3D高斯场景表示的统一框架,通过将丰富的语言特征嵌入每个高斯基元(Gaussian primitive)中,实现文本与3D场景的早期模态对齐;同时设计任务感知的语言引导采样策略以精简冗余3D高斯,并注入准确紧凑的3D标记至大语言模型(LLM),并进一步构建双条件多模态生成模型,利用视觉-语言模型提取的高层语言条件与低层图像条件共同引导多模态生成过程,从而显著提升场景理解与生成能力。

链接: https://arxiv.org/abs/2512.23180
作者: Tianchen Deng,Xuefeng Chen,Yi Chen,Qu Chen,Yuyao Xu,Lijin Yang,Le Xu,Yu Zhang,Bo Zhang,Wuxiong Huang,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); Tsinghua University (清华大学); MEGVII Technology (旷视科技); Mach Drive (马赫驱动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub this https URL.
zh

[CV-65] Machine Learning-Assisted Vocal Cord Ultrasound Examination: Project VIPR

【速读】:该论文旨在解决声带超声(Vocal Cord Ultrasound, VCUS)检查中因操作者依赖性导致的诊断准确性不足的问题。其解决方案的关键在于引入机器学习辅助算法,通过构建声带分割模型和声带麻痹(Vocal Cord Paralysis, VCP)分类模型,实现对VCUS图像中声带的自动识别与正常/麻痹状态的精准判别,其中VIPRnet分类模型在验证集上达到了99%的准确率,显著提升了诊断的一致性和客观性。

链接: https://arxiv.org/abs/2512.23177
作者: Will Sebelik-Lassiter,Evan Schubert,Muhammad Alliyu,Quentin Robbins,Excel Olatunji,Mustafa Barry
机构: 未知
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
备注: Won Best Undergraduate Research Paper at the 2025 Midwest Instruction Computing Symposium (MICS)

点击查看摘要

Abstract:Intro: Vocal cord ultrasound (VCUS) has emerged as a less invasive and better tolerated examination technique, but its accuracy is operator dependent. This research aims to apply a machine learning-assisted algorithm to automatically identify the vocal cords and distinguish normal vocal cord images from vocal cord paralysis (VCP). Methods: VCUS videos were acquired from 30 volunteers, which were split into still frames and cropped to a uniform size. Healthy and simulated VCP images were used as training data for vocal cord segmentation and VCP classification models. Results: The vocal cord segmentation model achieved a validation accuracy of 96%, while the best classification model (VIPRnet) achieved a validation accuracy of 99%. Conclusion: Machine learning-assisted analysis of VCUS shows great promise in improving diagnostic accuracy over operator-dependent human interpretation.
zh

[CV-66] GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection

【速读】:该论文旨在解决图像-based 3D目标检测中两个关键问题:一是高精度方法通常依赖密集的3D监督信号(如点云或TSDF),而无需此类监督的方法难以从单张RGB图像中准确提取几何信息。解决方案的关键在于提出GVSynergy-Det框架,通过协同学习连续高斯(Gaussian)与离散体素(Voxel)表示来增强几何特征表达:高斯表示擅长建模精细表面细节,体素表示提供结构化的空间上下文;该框架设计了双表示架构和跨表示增强机制,使体素特征能够融合来自高斯场的几何细节,并通过可学习的集成策略直接利用两类表示的互补信息,从而在无需任何深度或密集3D几何监督的情况下实现更精准的目标定位。

链接: https://arxiv.org/abs/2512.23176
作者: Yi Zhang,Yi Wang,Lei Yao,Lap-Pui Chau
机构: The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).
zh

[CV-67] REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中提示词与生成图像之间对齐质量的评估问题,现有方法多依赖粗粒度指标或静态问答流水线,缺乏细粒度可解释性且难以反映人类偏好。解决方案的关键在于提出REVEALER框架,其基于强化学习引导的视觉推理机制,采用“定位-推理-结论”的结构化范式,使多模态大语言模型(Multimodal Large Language Models, MLLMs)能够显式地定位语义元素并生成可解释的对齐判断;通过组相对策略优化(Group Relative Policy Optimization, GRPO)结合结构格式、定位准确性和对齐保真度的复合奖励函数进行优化,从而实现高精度、高效且具解释性的元素级对齐评估。

链接: https://arxiv.org/abs/2512.23169
作者: Fulin Shi,Wenyi Xiao,Bin Chen,Liang Din,Leilei Gan
机构: Zhejiang University (浙江大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured “grounding-reasoning-conclusion” paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.
zh

[CV-68] SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling

【速读】:该论文旨在解决自主手术机器人中因数据稀缺导致的泛化能力不足问题,尤其是缺乏同时包含视觉观测与精确机器人运动学(kinematics)标注的配对数据集。其关键解决方案在于构建一个名为SurgeWorld的物理AI世界模型,该模型基于自建的Surgical Action Text Alignment (SATA)数据集,并利用生成式视频合成技术产生多样且真实的手术视频;进一步通过逆动力学模型从合成视频中推断伪运动学(pseudokinematics),从而生成可用于训练的合成配对视频-动作数据。实验表明,基于此类增强数据训练的视觉语言动作(VLA)策略模型在真实手术机器人平台上显著优于仅使用真实示范数据训练的模型,为实现数据高效、可泛化的手术机器人自主技能获取提供了可扩展路径。

链接: https://arxiv.org/abs/2512.23162
作者: Yufan He,Pengfei Guo,Mengya Xu,Zhaoshuo Li,Andriy Myronenko,Dillan Imans,Bingjie Liu,Dongren Yang,Mingxue Gu,Yongnan Ji,Yueming Jin,Ren Zhao,Baiyong Shen,Daguang Xu
机构: NVIDIA; The Chinese University of Hong Kong (香港中文大学); Sung Kyun Kwan University (成均馆大学); Wenzhou Medical University (温州医科大学); National University of Singapore (新加坡国立大学); Ruijin Hospital (瑞金医院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It’s able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
zh

[CV-69] GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection

【速读】:该论文旨在解决半监督3D目标检测(Semi-supervised 3D Object Detection, SS3D)中因标注数据有限导致模型对物体几何结构敏感性不足的问题,从而影响学生模型在物体感知与定位能力上的提升。其解决方案的关键在于提出GeoTeacher框架,通过两个核心机制增强学生模型对几何关系的捕捉能力:一是设计基于关键点的几何关系监督模块(keypoint-based geometric relation supervision module),将教师模型对物体几何结构的知识迁移至学生模型;二是引入体素级数据增强策略(voxel-wise data augmentation strategy),通过增加物体几何多样性来提升学生模型对几何结构的理解能力,并结合距离衰减机制(distance-decay mechanism)保障远距离物体完整性。该方法可与多种现有SS3D方法兼容,显著提升性能并达到新的最先进水平。

链接: https://arxiv.org/abs/2512.23147
作者: Jingyu Li,Xiaolong Zhao,Zhe Liu,Wenxiao Wu,Li Zhang
机构: Fudan University (复旦大学); Shanghai Innovation Institute; Tongji University (同济大学); Hong Kong University (香港大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Semi-supervised 3D object detection, aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student model’s ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model’s ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher model’s knowledge of object geometry to the student, thereby improving the student’s capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student model’s ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at this https URL
zh

[CV-70] Domain-Shift Immunity in Deep Deformable Registration via Local Feature Representations

【速读】:该论文旨在解决学习型可变形图像配准模型在域偏移(domain shift)下鲁棒性不足的问题,尤其是其对模态变化敏感的局限性。传统方法依赖大规模且多样化的训练数据来提升鲁棒性,但缺乏对内在机制的解释。论文提出的关键解决方案是:通过引入UniReg框架,将特征提取与形变估计解耦,使用固定的预训练特征提取器和UNet结构的形变网络,在仅用单一数据集训练的情况下,仍能实现跨域和多模态配准性能媲美优化方法。其核心发现在于,鲁棒性来源于模型对局部特征表示(local feature representations)的依赖而非全局外观(global appearance),而传统CNN模型在早期卷积层中因数据集偏差导致模态敏感性,因此,保持域不变的局部特征一致性是提升鲁棒性的关键驱动因素。

链接: https://arxiv.org/abs/2512.23142
作者: Mingzhen Shao,Sarang Joshi
机构: Kahlert School of Computing (凯勒特计算机学院); Scientific Computing and Imaging Institute (科学计算与成像研究所); University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has advanced deformable image registration, surpassing traditional optimization-based methods in both accuracy and efficiency. However, learning-based models are widely believed to be sensitive to domain shift, with robustness typically pursued through large and diverse training datasets, without explaining the underlying mechanisms. In this work, we show that domain-shift immunity is an inherent property of deep deformable registration models, arising from their reliance on local feature representations rather than global appearance for deformation estimation. To isolate and validate this mechanism, we introduce UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed, pre-trained feature extractors and a UNet-based deformation network. Despite training on a single dataset, UniReg exhibits robust cross-domain and multi-modal performance comparable to optimization-based methods. Our analysis further reveals that failures of conventional CNN-based models under modality shift originate from dataset-induced biases in early convolutional layers. These findings identify local feature consistency as the key driver of robustness in learning-based deformable registration and motivate backbone designs that preserve domain-invariant local features.
zh

[CV-71] PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion

【速读】:该论文旨在解决当前生成式模型在磁共振成像(MRI)图像合成中因特征纠缠导致解剖结构破坏或结构性不连续的问题。现有方法通常在全局像素域操作或依赖二值掩码,难以同时保证局部病灶细节与整体解剖一致性。其解决方案的关键在于提出PathoSyn框架,通过将合成任务分解为确定性的解剖重建与随机偏差建模两部分,构建一个基于偏差空间的扩散模型(Deviation-Space Diffusion Model),以学习病理残差的条件分布,从而在保留全局结构完整性的同时捕捉局部强度变化。此外,引入缝合感知融合策略和推理时稳定模块,有效抑制边界伪影并生成高保真内部病变异质性,实现可解释的疾病进展模拟与精准干预规划。

链接: https://arxiv.org/abs/2512.23130
作者: Jian Wang,Sixing Rong,Jiarui Xing,Yuling Xu,Weide Liu
机构: Boston Children’s Hospital, Harvard Medical School (哈佛医学院); Northeastern University (东北大学); Yale University (耶鲁大学); Nanchang University (南昌大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present PathoSyn, a unified generative framework for Magnetic Resonance Imaging (MRI) image synthesis that reformulates imaging-pathology as a disentangled additive deviation on a stable anatomical manifold. Current generative models typically operate in the global pixel domain or rely on binary masks, these paradigms often suffer from feature entanglement, leading to corrupted anatomical substrates or structural discontinuities. PathoSyn addresses these limitations by decomposing the synthesis task into deterministic anatomical reconstruction and stochastic deviation modeling. Central to our framework is a Deviation-Space Diffusion Model designed to learn the conditional distribution of pathological residuals, thereby capturing localized intensity variations while preserving global structural integrity by construction. To ensure spatial coherence, the diffusion process is coupled with a seam-aware fusion strategy and an inference-time stabilization module, which collectively suppress boundary artifacts and produce high-fidelity internal lesion heterogeneity. PathoSyn provides a mathematically principled pipeline for generating high-fidelity patient-specific synthetic datasets, facilitating the development of robust diagnostic algorithms in low-data regimes. By allowing interpretable counterfactual disease progression modeling, the framework supports precision intervention planning and provides a controlled environment for benchmarking clinical decision-support systems. Quantitative and qualitative evaluations on tumor imaging benchmarks demonstrate that PathoSyn significantly outperforms holistic diffusion and mask-conditioned baselines in both perceptual realism and anatomical fidelity. The source code of this work will be made publicly available.
zh

[CV-72] MedSAM-based lung masking for multi-label chest X-ray classification

【速读】:该论文旨在解决胸部X光(Chest X-ray, CXR)图像自动解读中因病灶信号微弱、数据集偏差及空间监督有限而导致的挑战。其核心解决方案是提出一种基于分割引导的分类流程,将医学分割基础模型(Medical Segmentation Foundation Model, MedSAM)作为肺部区域提取模块,集成至多标签异常分类任务中。关键创新在于通过Fine-tune后的MedSAM生成解剖学合理肺部掩膜,并系统评估不同掩膜策略(紧致与松散)对ResNet50等架构在五类异常(Mass、Nodule、Pneumonia、Edema、Fibrosis)及正常情况(No Finding)分类性能的影响,揭示掩膜作为可控空间先验应根据骨干网络结构和临床目标灵活选择,而非统一应用。

链接: https://arxiv.org/abs/2512.23089
作者: Brayden Miao,Zain Rehman,Xin Miao,Siming Liu,Jianjie Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures

点击查看摘要

Abstract:Chest X-ray (CXR) imaging is widely used for screening and diagnosing pulmonary abnormalities, yet automated interpretation remains challenging due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models for medical image segmentation (MedSAM) provide an opportunity to introduce anatomically grounded priors that may improve robustness and interpretability in CXR analysis. We propose a segmentation-guided CXR classification pipeline that integrates MedSAM as a lung region extraction module prior to multi-label abnormality classification. MedSAM is fine-tuned using a public image-mask dataset from Airlangga University Hospital. We then apply it to a curated subset of the public NIH CXR dataset to train and evaluate deep convolutional neural networks for multi-label prediction of five abnormalities (Mass, Nodule, Pneumonia, Edema, and Fibrosis), with the normal case (No Finding) evaluated via a derived score. Experiments show that MedSAM produces anatomically plausible lung masks across diverse imaging conditions. We find that masking effects are both task-dependent and architecture-dependent. ResNet50 trained on original images achieves the strongest overall abnormality discrimination, while loose lung masking yields comparable macro AUROC but significantly improves No Finding discrimination, indicating a trade-off between abnormality-specific classification and normal case screening. Tight masking consistently reduces abnormality level performance but improves training efficiency. Loose masking partially mitigates this degradation by preserving perihilar and peripheral context. These results suggest that lung masking should be treated as a controllable spatial prior selected to match the backbone and clinical objective, rather than applied uniformly.
zh

[CV-73] Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)微调方法中对预训练模型内部表征结构利用不足的问题。现有方法如低秩适应(Low-Rank Adaptation, LoRA)主要依赖显式权重更新,忽略了预训练模型中已编码的丰富表示结构。其解决方案的关键在于引入掩码微调(Mask Fine-Tuning, MFT),通过为每个权重分配可学习的门控分数(gating scores),在不修改冻结主干网络的前提下,重新组织模型内部子网络连接,从而实现高效且高性能的任务适配。实验表明,MFT在多个VLM架构上均优于LoRA及全量微调,验证了结构重参数化策略的有效性。

链接: https://arxiv.org/abs/2512.23073
作者: Mingyuan Zhang,Yue Bai,Yifan Wang,Yiyang Huang,Yun Fu
机构: Northeastern University (东北大学); Khoury College of Computer Science (计算机科学学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model’s existing knowledge. Code available at: this https URL
zh

[CV-74] Video-BrowseComp: Benchmarking Agent ic Video Research on Open Web

【速读】:该论文旨在解决当前视频理解基准在评估自主代理(autonomous agents)进行开放网络视频推理能力方面的显著不足问题。现有视频评测数据集主要聚焦于被动感知任务,仅提供预选视频片段供模型分析,未能考察代理主动检索、跨时间线交叉验证证据及与开放网络信息比对的能力。为此,作者提出了 Video-BrowseComp 基准,其关键创新在于强制要求答案必须基于时间轴上的视觉证据(temporal visual evidence),从而确保模型无法仅通过文本搜索得出结论,而需实际浏览视频内容以验证外部声明。这一设计填补了从静态文本代理向动态视频代理演进过程中的关键空白,推动研究从被动感知迈向主动视频推理。

链接: https://arxiv.org/abs/2512.23044
作者: Zhengyang Liang,Yan Shu,Xiangrui Liu,Minghao Qin,Kaixin Liang,Paolo Rota,Nicu Sebe,Zheng Liu,Lizi Liao
机构: Singapore Management University (新加坡管理大学); University of Trento (特伦托大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Beijing University of Posts and Telecommunications (北京邮电大学); Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web’s most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbfVideo-BrowseComp, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
zh

[CV-75] 3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds

【速读】:该论文旨在解决3D自监督学习中大规模真实3D场景扫描数据获取成本高、劳动密集的问题。其解决方案的关键在于提出了一种名为Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C)的自监督框架,该框架能够直接从无标注视频中学习3D表示,而无需依赖任何真实3D传感器数据。核心创新包括构建了基于网络收集的室内外漫游视频生成的点云数据集RoomTours(共49,219个场景),以及设计了一种噪声正则化损失函数,通过强制局部几何平滑性和在噪声点云下的特征稳定性来提升表示学习的鲁棒性。实验表明,LAM3C在室内语义分割和实例分割任务上优于先前的自监督方法,验证了未标注视频作为3D自监督学习数据源的有效性。

链接: https://arxiv.org/abs/2512.23042
作者: Ryousuke Yamada,Kohsuke Ide,Yoshihiro Fukuhara,Hirokatsu Kataoka,Gilles Puy,Andrei Bursuc,Yuki M. Asano
机构: AIST; University of Technology Nuremberg; INRIA; Valeo.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.
zh

[CV-76] oward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

【速读】:该论文旨在解决遥感(Remote Sensing, RS)图像语义分割中伪标签漂移(pseudo-label drift)的问题,即在半监督学习过程中由于确认偏差导致错误标签不断累积,进而影响模型性能。解决方案的关键在于提出一种名为Co2S的稳定半监督RS分割框架,其核心创新包括:(1) 构建一个异构双学生架构(heterogeneous dual-student architecture),利用预训练CLIP和DINOv3初始化的两个不同ViT-based视觉基础模型协同学习,以减少误差传播;(2) 引入显式-隐式语义共引导机制(explicit-implicit semantic co-guidance mechanism),通过文本嵌入提供显式类别级指导、可学习查询实现隐式类级别引导,增强语义一致性;(3) 设计全局-局部特征协同融合策略(global-local feature collaborative fusion strategy),有效整合CLIP捕获的全局上下文信息与DINOv3生成的局部细节特征,从而提升分割精度。

链接: https://arxiv.org/abs/2512.23035
作者: Yi Zhou,Xuechao Zou,Shun Zhang,Kai Li,Shiying Wang,Jingming Chen,Congyan Lang,Tengfei Cao,Pin Tao,Yuanchun Shi
机构: Qinghai University (青海大学); Intelligent Computing and Application Laboratory of Qinghai Province (青海省智能计算与应用重点实验室); Beijing Jiaotong University (北京交通大学); Key Lab of Big Data & Artificial Intelligence in Transportation (教育部大数据与交通人工智能重点实验室); Tsinghua University (清华大学); Key Laboratory of Pervasive Computing (教育部普适计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, 10 tables

点击查看摘要

Abstract:Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at this https URL.
zh

[CV-77] Interpretable Gallbladder Ultrasound Diagnosis: A Lightweight Web-Mobile Software Platform with Real-Time XAI

【速读】:该论文旨在解决胆囊疾病早期且准确检测的难题,尤其是超声图像解读的挑战性问题。其解决方案的关键在于提出了一种基于混合深度学习模型 MobResTaNet 的AI诊断软件,能够直接从超声图像中对十类疾病(包括九种胆囊疾病和正常状态)进行分类识别,并通过可解释人工智能(Explainable AI, XAI)可视化技术提供实时、可解释的预测结果,从而支持透明化的临床决策。该系统在仅使用2.24M参数的情况下实现了高达99.85%的准确率,具备轻量化、高效性和部署灵活性,已集成至基于HTML、CSS、JavaScript、Bootstrap及Flutter开发的Web与移动端应用中,实现点对点医疗场景下的可靠辅助诊断。

链接: https://arxiv.org/abs/2512.23033
作者: Fuyad Hasan Bhoyan,Prashanta Sarker,Parsia Noor Ethila,Md. Emon Hossain,Md Kaviul Hossain,Md Humaion Kabir Mehedi
机构: University of Liberal Arts Bangladesh (孟加拉国自由艺术大学); BRAC University (BRAC大学)
类目: oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Early and accurate detection of gallbladder diseases is crucial, yet ultrasound interpretation is challenging. To address this, an AI-driven diagnostic software integrates our hybrid deep learning model MobResTaNet to classify ten categories, nine gallbladder disease types and normal directly from ultrasound images. The system delivers interpretable, real-time predictions via Explainable AI (XAI) visualizations, supporting transparent clinical decision-making. It achieves up to 99.85% accuracy with only 2.24M parameters. Deployed as web and mobile applications using HTML, CSS, JavaScript, Bootstrap, and Flutter, the software provides efficient, accessible, and trustworthy diagnostic support at the point of care
zh

[CV-78] An Architecture-Led Hybrid Report on Body Language Detection Project

【速读】:该论文旨在解决如何将现代视觉语言模型(Vision-Language Models, VLMs)的架构特性与实际视频到结构化输出的工程系统(如BodyLanguageDetection仓库中实现的视频-行为检测流水线)有效对齐的问题。其核心挑战在于:尽管VLM具备多模态理解能力,但其输出的结构化属性(如边界框和情感标签)可能在语法上合法却语义错误,且模型在帧内局部标识人物的方式限制了跨帧跟踪能力。解决方案的关键在于通过架构驱动的分析,明确模型的视觉token化机制、Transformer注意力结构及指令遵循能力如何映射至系统设计,同时强调结构化schema验证仅确保语法正确性而非几何合理性,并指出当前提示协议下人物ID为帧内局部标识,从而为开发可验证、鲁棒的接口和评估方案提供理论依据与实践指导。

链接: https://arxiv.org/abs/2512.23028
作者: Thomson Tong,Diba Darooneh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
zh

[CV-79] With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs

【速读】:该论文旨在解决当前计算物体识别系统普遍忽视场景上下文信息的问题,即现有方法多基于孤立图像区域进行分类,而人类识别物体时则依赖于对空间关系、材质属性及共现物体等丰富场景语义的理解。其解决方案的关键在于提出一种新型的地理语义上下文图(Geo-Semantic Contextual Graph, GSCG),通过融合度量深度估计与统一的全景分割和材质分割模型,构建一个结构化且富含几何、色彩与材质特征的图表示,将物体作为节点、空间关系作为边显式编码。在此基础上设计的图结构分类器能聚合目标物体、邻近物体及全局场景的特征,实现可解释的上下文感知分类,显著提升识别准确率至73.4%,远超无上下文或传统深度学习模型的表现。

链接: https://arxiv.org/abs/2512.23024
作者: Ciprian Constantinescu,Marius Leordeanu
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is a development of the visual riddle game with Human-AI interaction, entitled “GuessWhat - Riddle Eye with AI”, developed by Ciprian Constantinescu (POLItEHNICA Bucharest), Serena Stan (Instituto Cervantes Bucarest) and Marius Leordeanu (POLITEHNICA Bucharest), which was the winner (1st place) of the NeoArt Connect NAC 2025 Scholarship Program

点击查看摘要

Abstract:Humans effortlessly identify objects by leveraging a rich understanding of the surrounding scene, including spatial relationships, material properties, and the co-occurrence of other objects. In contrast, most computational object recognition systems operate on isolated image regions, devoid of meaning in isolation, thus ignoring this vital contextual information. This paper argues for the critical role of context and introduces a novel framework for contextual object classification. We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image. This rich, structured representation is built by integrating a metric depth estimator with a unified panoptic and material segmentation model. The GSCG encodes objects as nodes with detailed geometric, chromatic, and material attributes, and their spatial relationships as edges. This explicit graph structure makes the model’s reasoning process inherently interpretable. We then propose a specialized graph-based classifier that aggregates features from a target object, its immediate neighbors, and the global scene context to predict its class. Through extensive ablation studies, we demonstrate that our context-aware model achieves a classification accuracy of 73.4%, dramatically outperforming context-agnostic versions (as low as 38.4%). Furthermore, our GSCG-based approach significantly surpasses strong baselines, including fine-tuned ResNet models (max 53.5%) and a state-of-the-art multimodal Large Language Model (LLM), Llama 4 Scout, which, even when given the full image alongside a detailed description of objects, maxes out at 42.3%. These results on COCO 2017 train/val splits highlight the superiority of explicitly structured and interpretable context for object recognition tasks.
zh

[CV-80] OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding

【速读】:该论文旨在解决3D视觉定位(3D visual grounding)中因依赖预定义对象查找表(Object Lookup Table, OLT)而导致的局限性问题,即现有方法难以处理未定义或意外目标的开放世界场景。其解决方案的关键在于提出一种名为OpenGround的新颖零样本框架,核心是引入基于主动认知推理(Active Cognition-based Reasoning, ACR)模块,该模块通过类人感知的任务链逐步扩展视觉语言模型(VLM)的认知范围,并动态更新OLT,从而实现对预定义类别和开放世界类别的统一处理。

链接: https://arxiv.org/abs/2512.23020
作者: Wenyuan Huang,Zhao Wang,Zhou Wei,Ting Huang,Fang Zhao,Jian Yang,Zhenyu Zhang
机构: Nanjing University, School of Intelligent Science and Technology (南京大学智能科学与技术学院); China Mobile Zijin Innovation Institute (中国移动紫金创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 15 figures, 14 tables, Project Page at [this https URL]( this https URL )

点击查看摘要

Abstract:3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at [this https URL](this https URL).
zh

[CV-81] A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease DiagnosisFreshness Assessment and Fruit Detection

【速读】:该论文旨在解决苹果园中病害检测、果实品质评估与产量估算等任务在现有基于无人机(UAV)系统中常被孤立处理,且多依赖昂贵多光谱传感器的问题。其解决方案的关键在于构建一个统一的低成本RGB图像驱动的智能流水线,集成ResNet50用于叶片病害分类、VGG16用于苹果新鲜度判断、YOLOv8用于实时苹果检测与定位,并部署于ESP32-CAM与树莓派(Raspberry Pi)等边缘设备上,实现无需云端支持的全离线现场推理,从而为低成本、可扩展的精准农业提供实用化技术路径。

链接: https://arxiv.org/abs/2512.22990
作者: Soham Dutta,Soham Banerjee,Sneha Mahata,Anindya Sen,Sayantani Datta
机构: Heritage Institute of Technology (赫里itage理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has also been submitted to techrxiv and his currently under moderation. The content is identical

点击查看摘要

Abstract:Apple orchards require timely disease detection, fruit quality assessment, and yield estimation, yet existing UAV-based systems address such tasks in isolation and often rely on costly multispectral sensors. This paper presents a unified, low-cost RGB-only UAV-based orchard intelligent pipeline integrating ResNet50 for leaf disease detection, VGG 16 for apple freshness determination, and YOLOv8 for real-time apple detection and localization. The system runs on an ESP32-CAM and Raspberry Pi, providing fully offline on-site inference without cloud support. Experiments demonstrate 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. The framework provides an accessible and scalable alternative to multispectral UAV solutions, supporting practical precision agriculture on affordable hardware.
zh

[CV-82] Reverse Personalization WACV2026

【速读】:该论文旨在解决当前基于文本提示的面部图像匿名化方法中存在的局限性,即这些方法要么依赖于目标个体在预训练模型中已有良好表征,要么需要针对特定身份进行模型微调,从而限制了其通用性和实用性。解决方案的关键在于提出一种反向个性化框架(reverse personalization framework),通过条件扩散逆过程(conditional diffusion inversion)实现无需文本提示的直接图像操控,并引入身份引导的条件分支(identity-guided conditioning branch)以扩展对训练数据中未见个体的泛化能力,同时支持可控的面部属性保持,从而在身份移除、属性保留与图像质量之间达到最优平衡。

链接: https://arxiv.org/abs/2512.22984
作者: Han-Wei Kung,Tuomas Varanka,Nicu Sebe
机构: University of Trento(特伦托大学); University of Oulu(奥卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: WACV 2026

点击查看摘要

Abstract:Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model’s training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at this https URL .
zh

[CV-83] Spatial-aware Symmetric Alignment for Text-guided Medical Image Segmentation

【速读】:该论文旨在解决文本引导的医学图像分割中两个关键瓶颈问题:一是现有方法难以同时处理诊断性与描述性文本,导致无法准确识别病灶并建立与图像区域的关联;二是缺乏对空间位置约束的有效建模,致使分割结果出现严重偏差(如将“左下肺”描述错误地覆盖双侧肺部)。解决方案的关键在于提出空间感知对称对齐(Spatial-aware Symmetric Alignment, SSA)框架,其核心创新包括:1)设计对称最优传输对齐机制,强化图像区域与多类相关文本表达之间的双向细粒度跨模态对应关系;2)构建复合方向性引导策略,通过区域级引导掩码显式引入文本中的空间约束信息,从而提升定位准确性。实验表明,SSA在公开基准上实现了最先进的性能,尤其在依赖空间关系约束的病灶分割任务中表现优异。

链接: https://arxiv.org/abs/2512.22981
作者: Linglin Liao,Qichuan Geng,Yu Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text “in the left lower lung”, the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.
zh

[CV-84] PoseStreamer: A Multi-modal Framework for 6DoF Pose Estimation of Unseen Moving Objects

【速读】:该论文旨在解决高动态场景下新物体的六自由度(6DoF)位姿估计问题,尤其针对标准RGB相机在高速运动和低光照条件下因运动模糊导致性能下降的挑战。现有基于事件相机的方法在高速移动场景中仍表现不佳。解决方案的关键在于提出PoseStreamer框架,其核心创新包括:自适应位姿记忆队列(Adaptive Pose Memory Queue)以利用历史姿态信息实现时序一致性;基于目标中心的2D跟踪器(Object-centric 2D Tracker)提供强2D先验以提升3D中心召回率;以及沿相机射线进行几何精化的射线位姿滤波器(Ray Pose Filter)。该方法通过多模态融合与结构化优化,在高速运动场景中实现了高精度且无需模板的位姿估计。

链接: https://arxiv.org/abs/2512.22979
作者: Huiming Yang,Linglin Liao,Fei Ding,Sibo Wang,Zijian Zeng
机构: Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.
zh

[CV-85] RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance

【速读】:该论文旨在解决当前生成式AI(Generative AI)在伪装图像生成(Camouflaged Image Generation, CIG)中存在两大核心问题:一是生成图像的伪装程度不足,导致视觉相似性弱;二是背景杂乱且语义上与前景目标不一致,影响真实感。解决方案的关键在于提出一个基于外补绘(out-painting)的统一框架ReamCamo,其通过引入额外的布局控制机制以规范全局图像结构,从而提升前景与背景之间的语义一致性;同时构建多模态文本-视觉条件,融合细粒度文本描述与纹理导向的背景检索,协同引导生成过程以增强视觉保真度和真实感。此外,论文还设计了一种基于前景-背景分布差异的量化评估指标,用于更客观地衡量生成图像的伪装质量。

链接: https://arxiv.org/abs/2512.22974
作者: Chunyuan Chen,Yunuo Cai,Shujuan Li,Weiyun Liang,Bin Wang,Jing Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.
zh

[CV-86] YOLO-IOD: Towards Real Time Incremental Object Detection AAAI2026

【速读】:该论文旨在解决基于YOLO(You Only Look Once)的增量目标检测(Incremental Object Detection, IOD)框架中因灾难性遗忘(catastrophic forgetting)导致性能下降的问题。现有方法多依赖Faster R-CNN或DETR系列检测器,难以适配实时性强的YOLO架构。作者识别出三类知识冲突:前景-背景混淆、参数干扰和知识蒸馏错位,进而提出YOLO-IOD框架,其关键在于三个创新组件:1)冲突感知伪标签精炼(Conflict-Aware Pseudo-Label Refinement, CPR),通过置信度筛选和未来任务相关性缓解前景-背景混淆;2)重要性驱动核选择(Importance-based Kernel Selection, IKS),在每个阶段识别并更新与当前任务相关的卷积核以减少参数干扰;3)跨阶段非对称知识蒸馏(Cross-Stage Asymmetric Knowledge Distillation, CAKD),利用前后两阶段教师检测头传递学生特征,实现新旧类别间的不对称知识迁移,从而有效缓解知识蒸馏错位问题。

链接: https://arxiv.org/abs/2512.22973
作者: Shizhou Zhang,Xueqiang Lv,Yinghui Xing,Qirui Wu,Di Xu,Chen Zhao,Yanning Zhang
机构: Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAAI 2026 accepted. Code models are available at: this https URL

点击查看摘要

Abstract:Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.
zh

[CV-87] Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection

【速读】:该论文旨在解决4D毫米波雷达(4D mmWave radar)在自动驾驶和机器人感知中因点云表示导致的信息损失以及直接使用原始雷达数据带来的高计算成本问题,从而提升多模态融合下的3D目标检测性能。其解决方案的关键在于提出WRCFormer框架,通过将解耦后的雷达立方体(radar cube)转化为多视角表示,并引入基于小波的特征金字塔网络(Wavelet-based FPN)中的小波注意力模块(Wavelet Attention Module),增强稀疏雷达信号与图像数据的表征能力;同时设计了一种两阶段、查询驱动且模态无关的融合机制——几何引导渐进融合(Geometry-guided Progressive Fusion),高效整合来自相机和雷达的多视图特征,在保持低计算开销的同时显著提升检测精度,尤其在恶劣天气下表现出更强鲁棒性。

链接: https://arxiv.org/abs/2512.22972
作者: Runwei Guan,Jianan Liu,Shaofeng Liang,Fangqiang Ding,Shanliang Yao,Xiaokai Bai,Daizong Liu,Tao Huang,Guoqiang Mao,Hui Xiong
机构: Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Momoniai AI(莫莫尼AI); Massachusetts Institute of Technology(麻省理工学院); Yancheng Institute of Technology(盐城工学院); Zhejiang University(浙江大学); Wuhan University(武汉大学); James Cook University(詹姆斯库克大学); Southeast University(东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 10 pages, 10 figures

点击查看摘要

Abstract:4D millimeter-wave (mmWave) radar has been widely adopted in autonomous driving and robot perception due to its low cost and all-weather robustness. However, its inherent sparsity and limited semantic richness significantly constrain perception capability. Recently, fusing camera data with 4D radar has emerged as a promising cost effective solution, by exploiting the complementary strengths of the two modalities. Nevertheless, point-cloud-based radar often suffer from information loss introduced by multi-stage signal processing, while directly utilizing raw 4D radar data incurs prohibitive computational costs. To address these challenges, we propose WRCFormer, a novel 3D object detection framework that fuses raw radar cubes with camera inputs via multi-view representations of the decoupled radar cube. Specifically, we design a Wavelet Attention Module as the basic module of wavelet-based Feature Pyramid Network (FPN) to enhance the representation of sparse radar signals and image data. We further introduce a two-stage query-based, modality-agnostic fusion mechanism termed Geometry-guided Progressive Fusion to efficiently integrate multi-view features from both modalities. Extensive experiments demonstrate that WRCFormer achieves state-of-the-art performance on the K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios and 1.6% in the sleet scenario, highlighting its robustness under adverse weather conditions.
zh

[CV-88] CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision

【速读】:该论文旨在解决传统目标检测器依赖交叉熵分类所面临的类别不平衡和标签噪声敏感的问题。其解决方案的关键在于提出一种检测器无关的框架CLIP-Joint-Detect,通过端到端联合训练将CLIP风格的对比视觉-语言监督引入检测任务:在轻量级并行分支中将区域或网格特征投影至CLIP嵌入空间,并利用InfoNCE对比损失与可学习类特定文本嵌入对齐,同时优化标准检测损失。该方法显著提升了闭集检测性能,且适用于两阶段(如Faster R-CNN)和单阶段(如YOLOv11)架构,在保持实时推理速度的同时实现稳定改进。

链接: https://arxiv.org/abs/2512.22969
作者: Behnam Raoufi,Hossein Sharify,Mohamad Mahdee Ramezanee,Khosrow Hajsadeghi,Saeed Bagheri Shouraki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures. Preprint under review

点击查看摘要

Abstract:Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.
zh

[CV-89] Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects

【速读】:该论文旨在解决高分辨率遥感图像中密集小目标检测难题,其核心挑战在于目标间严重遮挡及像素覆盖范围有限导致特征学习困难。现有检测方法通常采用均匀分配计算资源的策略,难以聚焦于密度集中区域,从而限制了模型性能。解决方案的关键在于提出一种基于密度图引导的自适应特征学习机制——通过设计密度生成分支(Density Generation Branch, DGB)构建可量化的空间先验以定位密集区域,并引入密集区域聚焦模块(Dense Area Focusing Module, DAFM)实现局部-全局特征交互的高效计算;同时,为缓解层级特征提取过程中的退化问题,进一步提出双滤波融合模块(Dual Filter Fusion Module, DFFM),利用离散余弦变换分离多尺度特征的高频与低频成分,并结合密度引导的交叉注意力机制增强互补性并抑制背景干扰。

链接: https://arxiv.org/abs/2512.22949
作者: Zhicheng Zhao,Xuanang Fan,Lingma Sun,Chenglong Li,Jin Tang
机构: Anhui University (安徽大学); China Electronics Technology Group Corporation (中国电子科技集团公司); Hefei University (合肥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-resolution remote sensing imagery increasingly contains dense clusters of tiny objects, the detection of which is extremely challenging due to severe mutual occlusion and limited pixel footprints. Existing detection methods typically allocate computational resources uniformly, failing to adaptively focus on these density-concentrated regions, which hinders feature learning effectiveness. To address these limitations, we propose the Dense Region Mining Network (DRMNet), which leverages density maps as explicit spatial priors to guide adaptive feature learning. First, we design a Density Generation Branch (DGB) to model object distribution patterns, providing quantifiable priors that guide the network toward dense regions. Second, to address the computational bottleneck of global attention, our Dense Area Focusing Module (DAFM) uses these density maps to identify and focus on dense areas, enabling efficient local-global feature interaction. Finally, to mitigate feature degradation during hierarchical extraction, we introduce a Dual Filter Fusion Module (DFFM). It disentangles multi-scale features into high- and low-frequency components using a discrete cosine transform and then performs density-guided cross-attention to enhance complementarity while suppressing background interference. Extensive experiments on the AI-TOD and DTOD datasets demonstrate that DRMNet surpasses state-of-the-art methods, particularly in complex scenarios with high object density and severe occlusion.
zh

[CV-90] ColaVLA: Leverag ing Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

【速读】:该论文旨在解决基于视觉-语言模型(Vision-Language Models, VLMs)的自动驾驶规划系统中存在的三大关键问题:(i) 离散文本推理与连续控制之间的不匹配,(ii) 自回归链式思维解码带来的高延迟,以及 (iii) 低效或非因果的规划器限制实时部署。解决方案的核心在于提出一个统一的视觉-语言-动作框架 ColaVLA,其关键创新包括:(1) 认知潜在推理器(Cognitive Latent Reasoner)通过自适应选择和仅两次 VLM 前向传播,将场景理解压缩为决策导向的元动作嵌入(meta-action embeddings),实现从文本到统一潜在空间的推理迁移;(2) 分层并行规划器(Hierarchical Parallel Planner)在单次前向传播中生成多尺度、因果一致的轨迹,显著提升效率与安全性。该设计在保留 VLM 泛化能力和可解释性的同时,实现了高效、准确且安全的轨迹生成。

链接: https://arxiv.org/abs/2512.22939
作者: Qihang Peng,Xuesong Chen,Chenye Yang,Shaoshuai Shi,Hongsheng Li
机构: Tsinghua University (清华大学); CUHK MMLab (香港中文大学多媒体实验室); Voyager Research, Didi Chuxing (滴滴出行 Voyager 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
zh

[CV-91] JavisGPT : A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation NEURIPS

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在联合音频-视频(Joint Audio-Video, JAV)理解与生成任务中缺乏统一架构及时间同步能力的问题。现有模型难以实现跨模态的时空对齐和连贯的音视频内容生成,尤其在复杂时序场景下表现不足。解决方案的关键在于提出JavisGPT——首个面向JAV理解与生成的统一多模态大语言模型,其核心创新包括:(1) 采用简洁的编码器-大语言模型(LLM)-解码器架构,引入SyncFusion模块实现时空音频-视频融合;(2) 设计同步感知的可学习查询机制,连接预训练的音视频扩散模型(JAV-DiT),从而支持从多模态指令出发的时序一致的音视频理解与生成。

链接: https://arxiv.org/abs/2512.22905
作者: Kai Liu,Jungang Li,Yuchong Sun,Shengqiong Wu,Jianzhang Gao,Daoan Zhang,Wei Zhang,Sheng Jin,Sicheng Yu,Geng Zhan,Jiayi Ji,Fan Zhou,Liang Zheng,Shuicheng Yan,Hao Fei,Tat-Seng Chua
机构: ZJU(浙江大学); NUS(新加坡国立大学); HKUST(GZ)(香港科技大学(广州)); RUC(中国人民大学); UR(犹他大学); HZCU(杭州电子科技大学); NTU(南洋理工大学); SMU(新加坡管理大学); USYD(悉尼大学); ANU(澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS as a Spotlight paper. Code: this https URL

点击查看摘要

Abstract:This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
zh

[CV-92] HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery

【速读】:该论文旨在解决当前科学智能(Scientific Intelligence)评估基准碎片化、缺乏层次性和多学科覆盖不足的问题。现有评测体系多聚焦于单一任务,难以反映真实科学研究中从基础认知到创造性发现的层级结构与跨学科特性。其解决方案的关键在于提出一个名为HiSciBench的分层基准,系统性地涵盖五个科学推理阶段:科学素养(L1)、文献解析(L2)、基于文献的问题回答(L3)、文献综述生成(L4)和科学发现(L5),并整合文本、公式、图表及多语言输入,支持依赖关系感知的综合诊断。该设计使模型能力可在完整科研流程中被细致刻画,从而揭示性能随任务复杂度显著下降的现象(如基础任务达69%准确率,而发现级任务仅25%),为构建更可靠且具备强科学推理能力的模型提供明确方向。

链接: https://arxiv.org/abs/2512.22899
作者: Yaping Zhang,Qixuan Zhang,Xingquan Zhang,Zhiyuan Chen,Wenwen Zhuang,Yupu Liang,Lu Xiang,Yang Zhao,Jiajun Zhang,Yu Zhou,Chengqing Zong
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of the Chinese Academy of Sciences (中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbfHiSciBench, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textitScientific Literacy (L1), \textitLiterature Parsing (L2), \textitLiterature-based Question Answering (L3), \textitLiterature Review Generation (L4), and \textitScientific Discovery (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69% accuracy on basic literacy tasks, performance declines sharply to 25% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.
zh

[CV-93] Hash Grid Feature Pruning

【速读】:该论文旨在解决基于哈希网格(hash grid)的隐式神经场在高斯点云(Gaussian splatting)表示中因3D空间内高斯点分布不规则且非均匀而导致的稀疏区域问题,这些问题使得大量哈希网格特征无效,造成冗余存储与传输开销。解决方案的关键在于提出一种哈希网格特征剪枝方法,通过分析输入高斯点的坐标识别并移除无效特征,仅对有效特征进行编码,从而在不损失模型性能的前提下显著降低哈希网格的存储大小,实现更优的率失真(rate-distortion)性能,在符合标准组织CTC规范的测试条件下平均码率降低8%。

链接: https://arxiv.org/abs/2512.22882
作者: Yangzhi Ma,Bojun Liu,Jie Li,Li Li,Dong Liu
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hash grids are widely used to learn an implicit neural field for Gaussian splatting, serving either as part of the entropy model or for inter-frame prediction. However, due to the irregular and non-uniform distribution of Gaussian splats in 3D space, numerous sparse regions exist, rendering many features in the hash grid invalid. This leads to redundant storage and transmission overhead. In this work, we propose a hash grid feature pruning method that identifies and prunes invalid features based on the coordinates of the input Gaussian splats, so that only the valid features are encoded. This approach reduces the storage size of the hash grid without compromising model performance, leading to improved rate-distortion performance. Following the Common Test Conditions (CTC) defined by the standardization committee, our method achieves an average bitrate reduction of 8% compared to the baseline approach.
zh

[CV-94] Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance

【速读】:该论文旨在解决扩散模型中基于去噪-反演循环的迭代精炼方法在与标准无分类器引导(Classifier-Free Guidance, CFG)结合时所面临的稳定性问题。其核心挑战在于CFG的外推特性会系统性地将采样路径从数据流形(data manifold)上偏离,导致近似误差发散,从而破坏精炼过程的有效性。解决方案的关键是提出一种新的迭代精炼范式——引导路径采样(Guided Path Sampling, GPS),该方法用受流形约束的插值替代不稳定的外推,确保采样路径始终位于数据流形上;理论证明该修正使误差级数由无界放大转变为严格有界,从而保证了稳定性,并进一步设计了动态调整引导强度的最优调度策略,使语义注入与模型自然的粗到细生成过程相匹配,显著提升了图像质量和提示遵循能力。

链接: https://arxiv.org/abs/2512.22881
作者: Haosen Li,Wenshuo Chen,Shaofeng Liang,Lei Wang,Haozhe Jia,Yutao Yue
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Griffith University&Data61/CSIRO(格里菲斯大学&数据61/澳大利亚联邦科学与工业研究组织)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG’s extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model’s natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.
zh

[CV-95] SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation

【速读】:该论文旨在解决当前3D医学图像分割模型在适应新场景和临床任务时的局限性,尤其是其对大规模标注数据的依赖以及缺乏语义理解能力,导致难以实现灵活、用户自定义的分割目标。解决方案的关键在于提出SwinTF3D,一种轻量级多模态融合方法,通过基于Transformer的视觉编码器提取体素特征,并与紧凑的文本编码器通过高效融合机制结合,从而实现自然语言提示与医学图像中空间结构的语义对齐,同时保持高精度和低计算开销。

链接: https://arxiv.org/abs/2512.22878
作者: Hasan Faraz Khan,Noor Fatima,Muzammil Behzad
机构: 1: University of Engineering and Technology, Lahore (巴基斯坦工程与技术大学拉合尔分校); 2: National University of Sciences and Technology, Islamabad (巴基斯坦国立科技大学伊斯兰堡分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.
zh

[CV-96] M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models

【速读】:该论文旨在解决生成式 AI(Generative AI)中概念擦除(concept erasure)方法在多模态输入场景下的脆弱性问题,即现有技术主要针对文本提示(text prompts)进行概念擦除,而忽略了图像编辑和个性化生成等实际应用中日益重要的其他输入模态(如学习到的嵌入(learned embeddings)和反演潜在表示(inverted latents)),这些模态可能成为攻击面,导致被擦除的概念重新出现。解决方案的关键在于提出 M-ErasureBench——首个系统性评估多模态概念擦除性能的基准框架,涵盖三种输入模态及五种访问场景;并进一步设计 IRECE(Inference-time Robustness Enhancement for Concept Erasure)模块,通过交叉注意力定位目标概念并在去噪过程中扰动相关潜在表示,从而显著提升擦除鲁棒性,实验表明其在最严苛的白盒潜空间反演场景下可将概念重现率(CRR)降低高达40%,同时保持图像质量。

链接: https://arxiv.org/abs/2512.22877
作者: Ju-Hsuan Weng,Jia-Wei Liao,Cheng-Fu Chou,Jun-Cheng Chen
机构: National Taiwan University (台湾大学); Research Center for Information Technology Innovation, Academia Sinica (中央研究院资讯科技创新研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.
zh

[CV-97] Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples CVPR2025

【速读】:该论文旨在解决深度学习模型在训练过程中习得与类别标签存在虚假相关性(spurious correlation)的特征问题,这类特征虽能提升整体性能但会损害模型在特定群体上的公平性和鲁棒性。解决方案的关键在于提出一种数据驱动的方法:首先通过观察受虚假特征影响的样本在特征空间中呈现分散分布来识别虚假相关性;随后利用简单的分组策略中和虚假特征以获得不变偏差表示(bias-invariant representation);再通过特征变换对齐该表示以消除虚假特征;最后结合学习到的特征变换更新分类器,从而构建出一个无偏模型。整个流程形成一个有效的去偏管道,在图像和自然语言处理去偏基准上相比标准经验风险最小化(Empirical Risk Minimization, ERM)方法显著提升了最差群体准确率(超过20%)。

链接: https://arxiv.org/abs/2512.22874
作者: Weiwei Li,Junzhuo Liu,Yuanyuan Ren,Yuchen Zheng,Yahao Liu,Wen Li
机构: University of Electronic Science and Technology of China (电子科技大学); Shihezi University (石河子大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025

点击查看摘要

Abstract:Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions (e.g., simplicity of bias). However, these methods may yield unsatisfactory performance due to the intricate and elusive nature of spurious correlations in real-world data. In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. We observe that samples that are influenced by spurious features tend to exhibit a dispersed distribution in the learned feature space. This allows us to identify the presence of spurious features. Subsequently, we obtain a bias-invariant representation by neutralizing the spurious features based on a simple grouping strategy. Then, we learn a feature transformation to eliminate the spurious features by aligning with this bias-invariant representation. Finally, we update the classifier by incorporating the learned feature transformation and obtain an unbiased model. By integrating the aforementioned identifying, neutralizing, eliminating and updating procedures, we build an effective pipeline for mitigating spurious correlation. Experiments on image and NLP debiasing benchmarks show an improvement in worst group accuracy of more than 20% compared to standard empirical risk minimization (ERM). Codes and checkpoints are available at this https URL .
zh

[CV-98] Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs

【速读】:该论文旨在解决现有自监督学习(Self-Supervised Learning, SSL)方法在医学影像领域中因忽视人体解剖结构的一致性、连贯性和层次性而导致的解剖特征表征能力不足的问题。其解决方案的关键在于构建Lamps模型,该模型通过在大规模胸部X光图像上预训练,巧妙地将人体解剖结构的三大特性作为监督信号,从而实现从多视角学习与人类解剖结构对齐的鲁棒表示,显著提升了模型在下游任务中的泛化能力和临床应用潜力。

链接: https://arxiv.org/abs/2512.22872
作者: Ziyu Zhou,Haozhe Luo,Mohammad Reza Hosseinzadeh Taher,Jiaxuan Pang,Xiaowei Ding,Michael B. Gotway,Jianming Liang
机构: 1. University of California, San Diego (加州大学圣地亚哥分校); 2. University of California, San Francisco (加州大学旧金山分校); 3. Stanford University (斯坦福大学); 4. Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps’ superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.
zh

[CV-99] MUSON: A Reasoning -oriented Multimodal Dataset for Socially Compliant Navigation in Urban Environments

【速读】:该论文旨在解决现有社交导航数据集在显式推理监督和动作分布平衡方面的不足,从而限制模型学习安全关键行为的能力。其核心解决方案是提出MUSON——一个用于短时社交导航的多模态数据集,采用结构化的五步链式思维(Chain-of-Thought)标注流程(感知、预测、推理、行动与解释),明确建模静态物理约束并设计合理平衡的离散动作空间,显著提升了决策一致性与可解释性。

链接: https://arxiv.org/abs/2512.22867
作者: Zhuonan Liu,Xinyu Zhang,Zishuo Wang,Tomohito Kawabata,Xuesu Xiao,Ling Xiao
机构: Hokkaido University (北海道大学); George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models’ ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at this https URL
zh

[CV-100] ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning

【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)视频生成中的两个关键问题:一是现有方法缺乏有效机制注入物体的多视角信息,导致跨视角一致性差;二是对精细手部网格标注的强依赖,使得模型难以建模交互中的遮挡关系。解决方案的核心在于提出ByteLoom框架,其关键创新包括:(1)设计了RCM-cache机制,利用相对坐标图(Relative Coordinate Maps, RCM)作为通用表示,实现物体几何一致性的保持并精确控制6自由度(6-DoF)的物体变换;(2)构建渐进式训练流程,缓解HOI数据稀缺问题,并降低对手部网格标注的依赖,从而在保持人体身份一致性和多视角物体几何准确性的同时,生成流畅且具备真实物体操作行为的HOI视频。

链接: https://arxiv.org/abs/2512.22854
作者: Bangya Liu,Xinyu Gong,Zelin Zhao,Ziyang Song,Yulei Lu,Suhui Wu,Jun Zhang,Suman Banerjee,Hao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object’s geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object’s multi-view geometry, while maintaining smooth motion and object manipulation.
zh

[CV-101] A Minimal Solver for Relative Pose Estimation with Unknown Focal Length from Two Affine Correspondences

【速读】:该论文旨在解决从两个仿射对应关系(Affine Correspondences, ACs)中估计相机间相对位姿(3自由度)与未知焦距的问题,前提条件是已知相机内参但焦距未知,且通过惯性测量单元(Inertial Measurement Unit, IMU)可获得垂直方向信息从而将相对位姿从5自由度降低至3自由度。解决方案的关键在于:首先基于已知垂直方向建立两个仿射对应关系的约束方程;其次利用方程组存在非平凡解的性质推导出仅含焦距和相对旋转角两个参数的四个约束方程;最后采用多项式特征值方法求解该非线性参数系统,实现高精度的焦距与相对旋转角联合估计。

链接: https://arxiv.org/abs/2512.22833
作者: Zhenbao Yu,Shirong Ye,Ronghe Jin,Shunkun Liang,Zibin Liu,Huiyun Zhang,Banglei Guan
机构: National University of Defense Technology (国防科技大学); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室); Wuhan University (武汉大学); School of Geodesy and Geomatics (测绘学院); School of Remote Sensing Information Engineering (遥感信息工程学院); School of Software (软件学院); Henan University (河南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we aim to estimate the relative pose and focal length between two views with known intrinsic parameters except for an unknown focal length from two affine correspondences (ACs). Cameras are commonly used in combination with inertial measurement units (IMUs) in applications such as self-driving cars, smartphones, and unmanned aerial vehicles. The vertical direction of camera views can be obtained by IMU measurements. The relative pose between two cameras is reduced from 5DOF to 3DOF. We propose a new solver to estimate the 3DOF relative pose and focal length. First, we establish constraint equations from two affine correspondences when the vertical direction is known. Then, based on the properties of the equation system with nontrivial solutions, four equations can be derived. These four equations only involve two parameters: the focal length and the relative rotation angle. Finally, the polynomial eigenvalue method is utilized to solve the problem of focal length and relative rotation angle. The proposed solver is evaluated using synthetic and real-world datasets. The results show that our solver performs better than the existing state-of-the-art solvers.
zh

[CV-102] 3D Scene Change Modeling With Consistent Multi-View Aggregation

【速读】:该论文旨在解决现有3D场景变化检测方法中存在的空间不一致性以及无法显式分离变化前(pre-change)与变化后(post-change)状态的问题。其核心解决方案是提出SCaR-3D框架,该框架通过基于符号距离(signed-distance)的2D差异模块结合多视角聚合与投票裁剪策略,利用3D高斯散射(3DGS)的一致性特性,实现对物体级变化的精准识别与前后状态分离;同时引入持续场景重建策略,仅更新动态区域以保留静态不变部分,从而提升检测精度与效率。

链接: https://arxiv.org/abs/2512.22830
作者: Zirui Zhou,Junfeng Ni,Shujie Zhang,Yixin Chen,Siyuan Huang
机构: Tsinghua University (清华大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室,BIGAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Change detection plays a vital role in scene monitoring, exploration, and continual reconstruction. Existing 3D change detection methods often exhibit spatial inconsistency in the detected changes and fail to explicitly separate pre- and post-change states. To address these limitations, we propose SCaR-3D, a novel 3D scene change detection framework that identifies object-level changes from a dense-view pre-change image sequence and sparse-view post-change images. Our approach consists of a signed-distance-based 2D differencing module followed by multi-view aggregation with voting and pruning, leveraging the consistent nature of 3DGS to robustly separate pre- and post-change states. We further develop a continual scene reconstruction strategy that selectively updates dynamic regions while preserving the unchanged areas. We also contribute CCS3D, a challenging synthetic dataset that allows flexible combinations of 3D change types to support controlled evaluations. Extensive experiments demonstrate that our method achieves both high accuracy and efficiency, outperforming existing methods.
zh

[CV-103] KANO: Kolmogorov-Arnold Neural Operator for Image Super-Resolution

【速读】:该论文旨在解决单图像超分辨率(Single-image Super-resolution, SR)任务中因退化过程高度非线性、物理交互复杂及不确定性来源多样而导致的建模困难问题。现有可解释SR方法通常依赖黑箱深度网络来建模潜在变量,使得退化过程难以被理解和控制。其解决方案的关键在于首次提出一种基于Kolmogorov-Arnold定理(Kolmogorov-Arnold Theorem, KAT)的新型可解释算子——Kolmogorov-Arnold Neural Operator (KANO),该算子通过有限个B样条函数的加法结构分段逼近连续光谱曲线,并学习这些样条函数在限定区间内的形状参数,从而精确捕捉局部线性趋势和非线性拐点处的峰谷结构,使SR结果具备物理可解释性。

链接: https://arxiv.org/abs/2512.22822
作者: Chenyu Li,Danfeng Hong,Bing Zhang,Zhaojie Pan,Jocelyn Chanussot
机构: Southeast University (东南大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Univ. Grenoble Alpes (格勒诺布尔阿尔卑斯大学); Inria (法国国家信息与自动化研究院); CNRS (法国国家科学研究中心); Grenoble INP (格勒诺布尔国立理工学院); LJK (格勒诺布尔大学数学与计算机科学实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The highly nonlinear degradation process, complex physical interactions, and various sources of uncertainty render single-image Super-resolution (SR) a particularly challenging task. Existing interpretable SR approaches, whether based on prior learning or deep unfolding optimization frameworks, typically rely on black-box deep networks to model latent variables, which leaves the degradation process largely unknown and uncontrollable. Inspired by the Kolmogorov-Arnold theorem (KAT), we for the first time propose a novel interpretable operator, termed Kolmogorov-Arnold Neural Operator (KANO), with the application to image SR. KANO provides a transparent and structured representation of the latent degradation fitting process. Specifically, we employ an additive structure composed of a finite number of B-spline functions to approximate continuous spectral curves in a piecewise fashion. By learning and optimizing the shape parameters of these spline functions within defined intervals, our KANO accurately captures key spectral characteristics, such as local linear trends and the peak-valley structures at nonlinear inflection points, thereby endowing SR results with physical interpretability. Furthermore, through theoretical modeling and experimental evaluations across natural images, aerial photographs, and satellite remote sensing data, we systematically compare multilayer perceptrons (MLPs) and Kolmogorov-Arnold networks (KANs) in handling complex sequence fitting tasks. This comparative study elucidates the respective advantages and limitations of these models in characterizing intricate degradation mechanisms, offering valuable insights for the development of interpretable SR techniques.
zh

[CV-104] Depth Anything in 360circ: Towards Scale Invariance in the Wild

【速读】:该论文旨在解决全景深度估计(panoramic depth estimation)在开放世界场景中零样本泛化能力不足的问题,尤其相较于透视图像(perspective images)因训练数据丰富而表现优异的情况。其核心挑战在于如何将已有的透视域模型迁移到全景域,并保持深度估计的几何一致性与精度。解决方案的关键创新在于:首先,从ViT主干网络中学习一个位移参数(shift parameter),将模型原本尺度和位移不变的输出转化为尺度不变的深度估计,从而直接生成结构合理的三维点云;其次,在DPT解码器中引入圆形填充(circular padding),消除拼接缝伪影(seam artifacts),确保深度图在球面连续性下空间一致。这一方法显著提升了室内和室外场景下的深度估计精度,实现了零样本全景深度估计的新SOTA性能。

链接: https://arxiv.org/abs/2512.22819
作者: Hualie Jiang,Ziyang Song,Zhiqiang Lou,Rui Xu,Minglang Tan
机构: Insta360 Research (Insta360 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:Panoramic depth estimation provides a comprehensive solution for capturing complete 360^\circ environmental structural information, offering significant benefits for robotics and AR/VR applications. However, while extensively studied in indoor settings, its zero-shot generalization to open-world domains lags far behind perspective images, which benefit from abundant training data. This disparity makes transferring capabilities from the perspective domain an attractive solution. To bridge this gap, we present Depth Anything in 360^\circ (DA360), a panoramic-adapted version of Depth Anything V2. Our key innovation involves learning a shift parameter from the ViT backbone, transforming the model’s scale- and shift-invariant output into a scale-invariant estimate that directly yields well-formed 3D point clouds. This is complemented by integrating circular padding into the DPT decoder to eliminate seam artifacts, ensuring spatially coherent depth maps that respect spherical continuity. Evaluated on standard indoor benchmarks and our newly curated outdoor dataset, Metropolis, DA360 shows substantial gains over its base model, achieving over 50% and 10% relative depth error reduction on indoor and outdoor benchmarks, respectively. Furthermore, DA360 significantly outperforms robust panoramic depth estimation methods, achieving about 30% relative error improvement compared to PanDA across all three test datasets and establishing new state-of-the-art performance for zero-shot panoramic depth estimation.
zh

[CV-105] EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation

【速读】:该论文旨在解决从第一人称视角视频(egocentric video)中生成具有精确3D空间对齐和严格因果性的动态人类反应动作的问题。现有数据集(如ViMo)存在显著的空间不一致性,例如将动态运动与固定视角视频配对,导致生成动作缺乏空间准确性。解决方案的关键在于构建首个空间对齐的第一人称反应数据集Human Reaction Dataset (HRD),并提出EgoReAct框架——一种自回归生成模型,通过向量量化变分自编码器(Vector Quantised-Variational AutoEncoder, VQ-VAE)压缩动作至紧凑隐空间,并利用生成式预训练Transformer(Generative Pre-trained Transformer)从视觉输入中实时生成3D对齐的人类反应动作。EgoReAct引入了3D动态特征(如度量深度和头部运动),有效提升了空间定位能力,实验证明其在真实感、空间一致性及生成效率上均优于现有方法,同时保持生成过程的严格因果性。

链接: https://arxiv.org/abs/2512.22808
作者: Libo Zhang,Zekun Li,Tianyu Li,Zeyu Cao,Rui Xu,Xiaoxiao Long,Wenjia Wang,Jingbo Wang,Yuan Liu,Wenping Wang,Daquan Zhou,Taku Komura,Zhiyang Dou
机构: THU(清华大学); Brown(布朗大学); Georgia Tech(佐治亚理工学院); Cambridge(剑桥大学); HKU(香港大学); NJU(南京大学); CUHK(香港中文大学); HKUST(香港科技大学); TAMU(德州农工大学); PKU(北京大学); MIT(麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 9 figures

点击查看摘要

Abstract:Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.
zh

[CV-106] ReDiF: Reinforced Distillation for Few Step Diffusion

【速读】:该论文旨在解决扩散模型(diffusion models)在采样过程中速度缓慢的问题,即如何通过知识蒸馏(distillation)技术训练出参数更少或推理步数更少的学生模型(student model),同时保持与高步数教师模型(teacher model)输出行为的高度一致。其解决方案的关键在于将蒸馏过程建模为一个基于强化学习(reinforcement learning, RL)的策略优化问题,其中学生模型通过来自教师输出对齐性的奖励信号进行训练,从而动态引导其探索多条去噪路径,并采取更长且优化的步骤向数据分布的高概率区域推进,而非依赖于逐次微调的增量式重构。该方法充分利用了扩散模型支持大步长生成的能力,实现了显著减少推理步数和计算资源消耗的同时保持优异性能,且具有模型无关性,适用于各类扩散模型。

链接: https://arxiv.org/abs/2512.22802
作者: Amirhossein Tighkhorshid,Zahra Dehghanian,Gholamali Aminian,Chengchun Shi,Hamid R. Rabiee
机构: Sharif University of Technology(谢里夫理工大学); Alan Turing Institute(艾伦·图灵研究所); London School of Economics(伦敦经济学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Distillation addresses the slow sampling problem in diffusion models by creating models with smaller size or fewer steps that approximate the behavior of high-step teachers. In this work, we propose a reinforcement learning based distillation framework for diffusion models. Instead of relying on fixed reconstruction or consistency losses, we treat the distillation process as a policy optimization problem, where the student is trained using a reward signal derived from alignment with the teacher’s outputs. This RL driven approach dynamically guides the student to explore multiple denoising paths, allowing it to take longer, optimized steps toward high-probability regions of the data distribution, rather than relying on incremental refinements. Our framework utilizes the inherent ability of diffusion models to handle larger steps and effectively manage the generative process. Experimental results show that our method achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques. Additionally, the framework is model agnostic, applicable to any type of diffusion models with suitable reward functions, providing a general optimization paradigm for efficient diffusion learning.
zh

[CV-107] Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image

【速读】:该论文旨在解决开放词汇目标检测(Open-vocabulary object detection)模型在低质量图像条件下性能下降的问题,特别是针对现实世界中常见的图像退化场景。其解决方案的关键在于构建了一个模拟真实世界低质量图像的新数据集,并在此基础上系统评估了多种主流模型在不同层级图像退化下的表现差异,发现OWLv2模型在各类退化条件下均保持更稳定的性能,而其他模型如OWL-ViT、GroundingDINO和Detic则出现显著性能下降,从而为未来提升开放词汇检测模型鲁棒性提供了基准和方向。

链接: https://arxiv.org/abs/2512.22801
作者: Po-Chih Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.
zh

[CV-108] Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation

【速读】:该论文旨在解决医学图像三维重建中传统方法存在的计算复杂度高、稀疏切片条件下结构不连续及细节丢失等问题,从而难以满足临床对精度的要求。其解决方案的关键在于提出一种基于3D高斯(3D Gaussian)与三平面(tri-plane)表示相结合的高效重建方法,该方法在保持高斯表示在渲染效率和几何表达方面的优势基础上,显著提升了稀疏切片条件下的结构连续性和语义一致性,实验证明其能在多模态医学数据(如超声和MRI)上生成高质量、解剖学一致且语义稳定的三维图像,同时大幅提高重建效率。

链接: https://arxiv.org/abs/2512.22800
作者: Bin Liu,Wenyan Tian,Huangxin Fu,Zizheng Li,Zhifen He,Bo Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy this http URL address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.
zh

[CV-109] VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM

【速读】:该论文旨在解决视觉-语言跟踪(Vision-Language Tracking)中因局部搜索策略导致的鲁棒性不足问题,尤其在视角变化、遮挡和目标快速运动等挑战场景下易发生漂移或失败。其解决方案的关键在于提出一种基于多模态大语言模型(Multimodal Large Language Models, MLLM)的全局跟踪框架VPTracker,并引入一种位置感知的视觉提示机制(location-aware visual prompting),通过构建基于目标前一时刻位置的区域级提示,引导模型优先进行区域级识别,仅在必要时启用全局推理,从而在保持全局搜索优势的同时有效抑制干扰对象带来的误匹配,显著提升跟踪稳定性与目标区分能力。

链接: https://arxiv.org/abs/2512.22799
作者: Jingchao Wang,Kaiwen Zhou,Zhijian Wu,Kunhua Ji,Dingjiang Huang,Yefeng Zheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages

点击查看摘要

Abstract:Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target’s previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at this https URL.
zh

[CV-110] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在生成过程中因逐次去噪特性导致的高采样延迟问题,尤其针对现有基于求解器的加速方法在低延迟预算下图像质量显著下降的问题。其核心原因在于无法准确捕捉高曲率轨迹段,从而引发累积截断误差。解决方案的关键在于提出一种新型常微分方程(ODE)求解器——集成并行方向求解器(Ensemble Parallel Direction Solver, EPD-Solver),通过在每一步中引入多个并行梯度评估来减少此类误差;该方法利用向量值函数的中值定理更精确地逼近积分解,并借助低维流形几何洞察实现高效近似。此外,EPD-Solver 的额外梯度计算具有独立性,可完全并行化,从而保持低延迟特性。研究进一步设计了两阶段优化框架:首先采用蒸馏方式优化少量可学习参数,随后提出一种基于强化学习(Reinforcement Learning, RL)的参数高效微调策略,将求解器建模为随机狄利克雷策略,在低维求解器空间内操作以避免奖励劫持(reward hacking),显著提升复杂文本到图像(Text-to-Image, T2I)生成任务中的性能。

链接: https://arxiv.org/abs/2512.22796
作者: Ruoyu Wang,Ziyu Li,Beier Zhu,Liangyu Yuan,Hanwang Zhang,Xun Yang,Xiaojun Chang,Chi Zhang
机构: AGI lab, Westlake University (西湖大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Nanyang Technological University (南洋理工大学); Shanghai Jiao Tong University (上海交通大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.
zh

[CV-111] Plug In Grade Right: Psychology-Inspired AGIQA

【速读】:该论文旨在解决现有自动图像质量评估(AGIQA)模型中因文本嵌入与图像嵌入在共享空间学习时出现的“语义漂移”(semantic drift)问题,即图像嵌入可能同时与多个等级描述(如“优秀”和“差”)高度相似,而偏离中间等级(如“良好”),从而导致质量评分不可靠。解决方案的关键在于借鉴心理测量学中的分级反应模型(Graded Response Model, GRM),提出一种基于算术建模的GRM质量分级模块(Arithmetic GRM based Quality Grading, AGQG),通过构建具有单调难度层级的图像能力估计机制,确保质量分布呈现单峰且可解释的特性,从而提升模型对图像质量等级判断的稳定性与准确性。该模块具备即插即用优势,能显著增强多种先进AGIQA框架的性能,并在自然图像与屏幕内容图像质量评估中均表现出良好的泛化能力。

链接: https://arxiv.org/abs/2512.22780
作者: Zhicheng Liao,Baoliang Chen,Hanwei Zhu,Lingyu Zhu,Shiqi Wang,Weisi Lin
机构: South China Normal University (华南师范大学); Nanyang Technological University (南洋理工大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both “excellent” and “poor” grade descriptions while deviating from the “good” one. We refer to this phenomenon as “semantic drift”, where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject’s ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image’s ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.
zh

[CV-112] Schrodinger AI: A Unified Spectral-Dynamical Framework for Classification Reasoning and Operator-Based Generalization

【速读】:该论文旨在解决当前机器学习模型在泛化能力、语义可解释性以及动态环境适应性方面的局限性,尤其针对传统基于交叉熵损失和Transformer注意力机制的架构所面临的瓶颈。其核心解决方案是提出一种受量子力学启发的统一机器学习框架——Schrödinger AI,其关键在于三个紧密耦合的组件:(1) 时不变波能求解器,将感知与分类建模为由学习到的哈密顿量(Hamiltonian)驱动的谱分解;(2) 时变动力学求解器,通过演化语义波函数实现对环境变化的上下文感知决策修正与推理;(3) 低秩算子微积分,通过学习类量子跃迁算子来实现符号变换(如模运算)的精确泛化。这一物理驱动范式不仅实现了无需显式监督的语义流形涌现,还展现出在动态环境中进行实时推理与长序列组合操作的能力,标志着机器学习从经验拟合向能量景观探索的新范式转变。

链接: https://arxiv.org/abs/2512.22774
作者: Truong Son Nguyen
机构: Arizona State University (亚利桑那州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce \textbfSchrödinger AI, a unified machine learning framework inspired by quantum mechanics. The system is defined by three tightly coupled components: (1) a time-independent wave-energy solver that treats perception and classification as spectral decomposition under a learned Hamiltonian; (2) a time-dependent dynamical solver governing the evolution of semantic wavefunctions over time, enabling context-aware decision revision, re-routing, and reasoning under environmental changes; and (3) a low-rank operator calculus that learns symbolic transformations such as modular arithmetic through learned quantum-like transition operators. Together, these components form a coherent physics-driven alternative to conventional cross-entropy training and transformer attention, providing robust generalization, interpretable semantics, and emergent topology. Empirically, Schrödinger AI demonstrates: (a) emergent semantic manifolds that reflect human-conceived class relations without explicit supervision; (b) dynamic reasoning that adapts to changing environments, including maze navigation with real-time potential-field perturbations; and © exact operator generalization on modular arithmetic tasks, where the system learns group actions and composes them across sequences far beyond training length. These results suggest a new foundational direction for machine learning, where learning is cast as discovering and navigating an underlying semantic energy landscape. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.22774 [cs.LG] (or arXiv:2512.22774v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22774 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-113] Next Best View Selections for Semantic and Dynamic 3D Gaussian Splatting

【速读】:该论文旨在解决具身智能体在复杂任务中对语义理解和动态建模的高效数据利用问题,尤其针对静态场景理解任务中存在的大量数据冗余。其核心挑战在于如何从多视角视频数据中选择最具信息量的帧以提升模型训练效率与性能。解决方案的关键在于将视图选择问题建模为一种主动学习(active learning)问题,并引入Fisher信息量来量化候选视角在语义高斯参数和形变网络(deformation networks)上的信息增益,从而实现语义推理与动态场景建模的联合优化。此方法提供了一种优于随机采样或基于不确定性的启发式策略的理论依据和实践方案。

链接: https://arxiv.org/abs/2512.22771
作者: Yiqian Li,Wen Jiang,Kostas Daniilidis
机构: University of Pennsylvania (宾夕法尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding semantics and dynamics has been crucial for embodied agents in various tasks. Both tasks have much more data redundancy than the static scene understanding task. We formulate the view selection problem as an active learning problem, where the goal is to prioritize frames that provide the greatest information gain for model training. To this end, we propose an active learning algorithm with Fisher Information that quantifies the informativeness of candidate views with respect to both semantic Gaussian parameters and deformation networks. This formulation allows our method to jointly handle semantic reasoning and dynamic scene modeling, providing a principled alternative to heuristic or random strategies. We evaluate our method on large-scale static images and dynamic video datasets by selecting informative frames from multi-camera setups. Experimental results demonstrate that our approach consistently improves rendering quality and semantic segmentation performance, outperforming baseline methods based on random selection and uncertainty-based heuristics.
zh

[CV-114] Neighbor-Aware Token Reduction via Hilbert Curve for Vision Transformers

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)中因冗余token表示导致的计算效率低下问题,尤其是现有token合并与剪枝策略常忽视空间连续性和邻近关系,从而造成局部上下文信息丢失。其解决方案的关键在于引入基于希尔伯特曲线(Hilbert curve)重排序的邻接感知token压缩方法,通过将二维空间中的邻近关系映射为一维序列,显式保留局部结构信息;具体包括两个核心策略:邻接感知剪枝(Neighbor-Aware Pruning, NAP)用于选择性保留重要token,以及邻近token相似度合并(Merging by Adjacent Token similarity, MAT)用于局部token聚合,从而在保持高精度的同时显著提升模型效率。

链接: https://arxiv.org/abs/2512.22760
作者: Yunge Li,Lanyu Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved remarkable success in visual recognition tasks, but redundant token representations limit their computational efficiency. Existing token merging and pruning strategies often overlook spatial continuity and neighbor relationships, resulting in the loss of local context. This paper proposes novel neighbor-aware token reduction methods based on Hilbert curve reordering, which explicitly preserves the neighbor structure in a 2D space using 1D sequential representations. Our method introduces two key strategies: Neighbor-Aware Pruning (NAP) for selective token retention and Merging by Adjacent Token similarity (MAT) for local token aggregation. Experiments demonstrate that our approach achieves state-of-the-art accuracy-efficiency trade-offs compared to existing methods. This work highlights the importance of spatial continuity and neighbor structure, offering new insights for the architectural optimization of ViTs.
zh

[CV-115] rimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在长上下文、多图像场景下因视觉token数量激增而导致的推理成本过高的问题。现有视觉token剪枝方法往往忽视了此类复杂场景中的冗余特性,导致剪枝效果不佳。解决方案的关键在于提出一种自适应剪枝方法,通过分解冗余为图像内(intra-image)和图像间(inter-image)两个维度:首先在图像内阶段依据内容感知分配token预算并贪婪选择最具代表性的token;其次在图像间阶段进行全局多样性筛选,并采用帕累托选择(Pareto selection)机制平衡多样性与文本对齐度,从而实现动态预算分配与高效剪枝,显著减少视觉token数量的同时保持模型性能。

链接: https://arxiv.org/abs/2512.22748
作者: Hao Zhang,Mengsi Lyu,Bo Huang,Yulong Ao,Yonghua Lin
机构: Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.
zh

[CV-116] Split4D: Decomposed 4D Scene Reconstruction Without Video Segmentation

【速读】:该论文旨在解决多视角视频中分解式4D场景重建(decomposed 4D scene reconstruction)的问题,其核心挑战在于现有方法严重依赖不稳定的视频分割图(video segmentation maps),导致重建结果不可靠。解决方案的关键在于提出Freetime FeatureGS表示法,将动态场景建模为具有可学习特征和线性运动能力的高斯基元(Gaussian primitives)集合,并设计了一种流式特征学习策略,从而无需视频分割即可从每帧图像的分割图中准确恢复4D结构。通过引入对比损失(contrastive loss)约束同一实例在2D分割中的投影特征相近、不同实例则相远,同时采用时间有序采样训练方式,使特征能在时间维度上自然传播,实现跨时间的4D语义一致性,有效避免局部最优解并显著提升重建质量。

链接: https://arxiv.org/abs/2512.22745
作者: Yongzhen Hu,Yihui Yang,Haotong Lin,Yifan Wang,Junting Dong,Yifu Deng,Xinyu Zhu,Fan Jia,Hujun Bao,Xiaowei Zhou,Sida Peng
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the problem of decomposed 4D scene reconstruction from multi-view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per-image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.
zh

[CV-117] Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning

【速读】:该论文旨在解决胎儿颈项透明层超声检查中囊性水瘤(cystic hygroma)自动检测的准确性与鲁棒性问题,尤其是在标注数据有限的情况下如何提升深度学习模型的性能。其关键解决方案是采用超声特定的自监督预训练方法——Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE),该模型在超过37万张未标注超声图像上进行预训练,随后在小规模标注数据集上微调用于二分类任务(正常对照组 vs. 囊性水瘤病例),显著优于传统监督学习方法如DenseNet-169,在准确率、敏感性、特异性及ROC-AUC等指标上均实现统计学意义上的提升。

链接: https://arxiv.org/abs/2512.22730
作者: Youssef Megahed,Robin Ducharme,Inok Lee,Inbal Willner,Olivier X. Miguel,Kevin Dick,Adrian D. C. Chan,Mark Walker,Steven Hawken
机构: Carleton University (卡尔顿大学); Ottawa Hospital Research Institute (渥太华医院研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 13 pages, 6 figures, 2 tables

点击查看摘要

Abstract:Cystic hygroma is a high-risk prenatal ultrasound finding that portends high rates of chromosomal abnormalities, structural malformations, and adverse pregnancy outcomes. Automated detection can increase reproducibility and support scalable early screening programs, but supervised deep learning methods are limited by small labelled datasets. This study assesses whether ultrasound-specific self-supervised pretraining can facilitate accurate, robust deep learning detection of cystic hygroma in first-trimester ultrasound images. We fine-tuned the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), pretrained on over 370,000 unlabelled ultrasound images, for binary classification of normal controls and cystic hygroma cases used in this study. Performance was evaluated on the same curated ultrasound dataset, preprocessing pipeline, and 4-fold cross-validation protocol as for the DenseNet-169 baseline, using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). Model interpretability was analyzed qualitatively using Score-CAM visualizations. USF-MAE outperformed the DenseNet-169 baseline on all evaluation metrics. The proposed model yielded a mean accuracy of 0.96, sensitivity of 0.94, specificity of 0.98, and ROC-AUC of 0.98 compared to 0.93, 0.92, 0.94, and 0.94 for the DenseNet-169 baseline, respectively. Qualitative Score-CAM visualizations of model predictions demonstrated clinical relevance by highlighting expected regions in the fetal neck for both positive and negative cases. Paired statistical analysis using a Wilcoxon signed-rank test confirmed that performance improvements achieved by USF-MAE were statistically significant (p = 0.0057).
zh

[CV-118] Memento-II: Learning by Stateful Reflective Memory

【速读】:该论文旨在解决大型语言模型代理在持续学习(continual learning)和经验学习(experiential learning)中的适应性问题,即如何在不进行反向传播或模型微调的情况下,通过与环境的交互实现策略的动态优化。其解决方案的关键在于引入“反思”(reflection)机制,并基于此构建了状态感知的反思决策过程(Stateful Reflective Decision Process),将 episodic memory(情景记忆)作为核心组件,通过两阶段读写操作实现策略评估(writing)与策略改进(reading)。该框架形式化为一个增强状态空间上的马尔可夫决策过程(Markov Decision Process, MDP),从而能够应用经典动态规划与强化学习工具,且在熵正则化策略迭代下具备收敛性保证,使代理在无参数更新的前提下实现持续适应。

链接: https://arxiv.org/abs/2512.22716
作者: Jun Wang
机构: University College London (伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 32 pages, three figures

点击查看摘要

Abstract:We propose a theoretical framework for continual and experiential learning in large language model agents that integrates episodic memory with reinforcement learning. The framework identifies reflection as the key mechanism that enables agents to adapt through interaction without back propagation or model fine tuning, thereby relaxing the conventional separation between training and this http URL formalise this process, we introduce the Stateful Reflective Decision Process, which models reflective learning as a two stage read write interaction with episodic memory. Writing stores interaction outcomes and corresponds to policy evaluation, while reading retrieves relevant past cases and corresponds to policy improvement. We show that this process induces an equivalent Markov decision process over augmented state memory representations, allowing the use of classical tools from dynamic programming and reinforcement learning. We further instantiate the framework using entropy regularised policy iteration and establish convergence guarantees. As episodic memory grows and achieves sufficient coverage of the state space, the resulting policy converges to the optimal solution. This work provides a principled foundation for memory augmented and retrieval based language model agents capable of continual adaptation without parameter updates.
zh

[CV-119] SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis

【速读】:该论文旨在解决自动驾驶仿真中3D资产插入(3D Asset Insertion)与新视角合成(Novel View Synthesis, NVS)能力分离的问题,以提升训练数据的多样性与真实性,从而增强自动驾驶模型在长尾场景下的鲁棒性。现有方法虽能分别实现动态物体重建或视角扩展,但难以同时保证插入资产的光照一致性与场景交互的真实性。解决方案的关键在于提出SCPainter框架,其核心是将3D高斯点(3D Gaussian Splat, GS)表示的车辆资产与场景点云共同投影至新视角,并利用扩散模型(diffusion-based generation)以这些投影为条件生成高质量图像,从而实现统一建模下逼真的3D资产插入与多视角合成。

链接: https://arxiv.org/abs/2512.22706
作者: Paul Dobre,Jackson Cooper,Xin Wang,Hongzhou Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.
zh

[CV-120] Mesquite MoCap: Democratizing Real-Time Motion Capture with Affordable Bodyworn IoT Sensors and WebXR SLAM

【速读】:该论文旨在解决运动捕捉(Motion Capture)系统成本高、部署复杂的问题,从而限制了其在实验室以外场景的应用。解决方案的关键在于设计并实现了一个开源、低成本的惯性运动捕捉系统 Mesquite,其核心由15个IMU传感器节点组成身体穿戴网络,并结合一个佩戴在髋部的Android智能手机用于位置追踪;通过低功耗无线通信将四元数姿态数据传输至中央USB网关和浏览器端应用,实现实时可视化与记录。该系统完全基于现代Web技术栈(如WebGL、WebXR、WebSerial、WebSockets及Progressive Web Apps),支持跨平台运行且无需额外客户端软件,显著降低了硬件成本(约仅为商用光学系统的5%),同时保持30帧/秒的稳定帧率、低于15ms的端到端延迟以及不低于99.7%的数据包接收率,有效推动了运动捕捉技术在娱乐、生物力学、健康监测、人机交互和虚拟现实等领域的普及应用。

链接: https://arxiv.org/abs/2512.22690
作者: Poojan Vanani,Darsh Patel,Danyal Khorami,Siva Munaganuru,Pavan Reddy,Varun Reddy,Bhargav Raghunath,Ishrat Lallmamode,Romir Patel,Assegid Kidané,Tejaswi Gowda
机构: Arizona State University (亚利桑那州立大学); Basha High School (巴沙高中)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to IEEE Journal of IoT

点击查看摘要

Abstract:Motion capture remains costly and complex to deploy, limiting use outside specialized laboratories. We present Mesquite, an open-source, low-cost inertial motion-capture system that combines a body-worn network of 15 IMU sensor nodes with a hip-worn Android smartphone for position tracking. A low-power wireless link streams quaternion orientations to a central USB dongle and a browser-based application for real-time visualization and recording. Built on modern web technologies – WebGL for rendering, WebXR for SLAM, WebSerial and WebSockets for device and network I/O, and Progressive Web Apps for packaging – the system runs cross-platform entirely in the browser. In benchmarks against a commercial optical system, Mesquite achieves mean joint-angle error of 2-5 degrees while operating at approximately 5% of the cost. The system sustains 30 frames per second with end-to-end latency under 15ms and a packet delivery rate of at least 99.7% in standard indoor environments. By leveraging IoT principles, edge processing, and a web-native stack, Mesquite lowers the barrier to motion capture for applications in entertainment, biomechanics, healthcare monitoring, human-computer interaction, and virtual reality. We release hardware designs, firmware, and software under an open-source license (GNU GPL).
zh

[CV-121] Multimodal Diffeomorphic Registration with Neural ODEs and Structural Descriptors

【速读】:该论文旨在解决多模态非刚性配准(multimodal nonrigid registration)中的关键挑战,即如何在保持高精度的同时降低变形模型的计算复杂度,并实现对未见模态的鲁棒泛化能力。传统方法通常依赖于图像强度相关性假设,限制了其在多模态场景下的适用性,且学习型模型往往需要大量训练数据并可能在推理阶段性能下降。本文提出的解决方案核心在于利用神经微分方程(Neural ODEs)构建一种实例特定(instance-specific)的连续深度网络框架,通过引入结构描述符(structural descriptors)——这类描述符作为模态无关的度量模型,能够捕捉参数化邻域几何结构中的自相似性——并融合基于局部互信息计算的非结构图像相似性,从而有效建模跨模态的解剖同源区域关系。该方法无需大规模训练数据即可实现高性能配准,且在不同尺度和正则化水平下均表现出优异的鲁棒性和效率。

链接: https://arxiv.org/abs/2512.22689
作者: Salvador Rodriguez-Sanz,Monica Hernandez
机构: Aragon Institute for Engineering Research (I3A), University of Zaragoza (萨拉戈萨大学工程研究学院); University of Zaragoza (萨拉戈萨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This work proposes a multimodal diffeomorphic registration method using Neural Ordinary Differential Equations (Neural ODEs). Nonrigid registration algorithms exhibit tradeoffs between their accuracy, the computational complexity of their deformation model, and its proper regularization. In addition, they also assume intensity correlation in anatomically homologous regions of interest among image pairs, limiting their applicability to the monomodal setting. Unlike learning-based models, we propose an instance-specific framework that is not subject to high scan requirements for training and does not suffer performance degradation at inference time on modalities unseen during training. Our method exploits the potential of continuous-depth networks in the Neural ODE paradigm with structural descriptors, widely adopted as modality-agnostic metric models which exploit self-similarities on parameterized neighborhood geometries. We propose three different variants that integrate image-based or feature-based structural descriptors and nonstructural image similarities computed by local mutual information. We conduct extensive evaluations on different experiments formed by scan dataset combinations and show surpassing qualitative and quantitative results compared to state-of-the-art baselines adequate for large or small deformations, and specific of multimodal registration. Lastly, we also demonstrate the underlying robustness of the proposed framework to varying levels of explicit regularization while maintaining low error, its suitability for registration at varying scales, and its efficiency with respect to other methods targeted to large-deformation registration.
zh

[CV-122] Autoregressive Flow Matching for Motion Prediction

【速读】:该论文旨在解决现有运动预测模型在复杂动态场景下准确性不足的问题,尤其是在人类运动和机器人动作预测任务中,传统方法因训练数据分布狭窄而难以泛化。其解决方案的关键在于提出一种自回归流匹配(autoregressive flow matching, ARFM)方法,通过在多样化视频数据集上进行训练,实现对长时间序列连续数据的概率建模,并生成未来点轨迹位置。该方法显著提升了复杂运动的预测能力,并通过将机器人动作与人类运动预测任务条件于预测的未来轨迹,有效改善了下游任务性能。

链接: https://arxiv.org/abs/2512.22688
作者: Johnathan Xie,Stefan Stojanov,Cristobal Eyzaguirre,Daniel L. K. Yamins,Jiajun Wu
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: this https URL.
zh

[CV-123] CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

【速读】:该论文旨在解决当前文本到图像扩散模型在复杂提示(prompt)下语义对齐能力不足的问题,即生成图像与文本描述之间存在语义偏差或细节缺失。解决方案的关键在于提出 CritiFusion 框架,其核心由两个模块构成:一是 CritiCore 模块,利用视觉-语言模型(vision-language model)和多个大语言模型(large language models)对提示上下文进行增强并生成高层语义反馈,从而引导扩散过程更精准地匹配提示意图;二是 SpecFusion 模块,在频域中融合中间生成状态,注入粗粒度结构信息的同时保留高频细节,实现图像结构与语义的双重优化。整个方法无需额外训练,可作为插件式后处理模块集成至现有扩散模型中,显著提升人机评估指标如一致性、美学评分和真实感。

链接: https://arxiv.org/abs/2512.22681
作者: ZhenQi Chen,TsaiChing Ni,YuanFu Yang
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt’s intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
zh

[CV-124] INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading

【速读】:该论文旨在解决结膜黑色素细胞上皮内病变(Conjunctival Melanocytic Intraepithelial Lesions, CMIL)的准确分级难题,其核心挑战在于细微的形态学特征和多个诊断标准之间的复杂关联性。解决方案的关键在于提出INTERACT-CMIL框架,该框架采用多头深度学习架构,通过共享特征学习与组合部分监督(Combinatorial Partial Supervision)策略,联合预测WHO4、WHO5、水平扩散、垂直扩散及细胞学异型性五个组织病理学维度,并引入交叉任务一致性损失(Inter-Dependence Loss)以增强多任务间的逻辑一致性,从而实现更可靠、可解释且与专家评分一致的多准则预测。

链接: https://arxiv.org/abs/2512.22666
作者: Mert Ikinci,Luna Toma,Karin U. Loeffler,Leticia Ussem,Daniela Süsskind,Julia M. Weller,Yousef Yeganeh,Martina C. Herwig-Carl,Shadi Albarqouni
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.
zh

[CV-125] Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains

【速读】:该论文旨在解决在数据受限的科学领域中,如何有效利用大规模预训练视觉模型(如ViT和ConvNeXt)所学习到的丰富特征来提升下游任务性能的问题。解决方案的关键在于提出一种新颖的Cluster Attention Adapter(CLAdapter),其通过引入注意力机制与聚类中心,基于特征分布相关性和变换矩阵对特征进行个性化增强,从而实现不同特征集下的差异化表示学习;此外,CLAdapter采用统一接口设计,可无缝适配多种模型架构(包括CNN和Transformer)及2D/3D场景,显著提升了基础视觉模型在多样化小样本科学任务中的适应能力。

链接: https://arxiv.org/abs/2512.22664
作者: Qiankun Li,Feng He,Huabao Chen,Xin Ning,Kun Wang,Zengfu Wang
机构: University of Science and Technology of China (中国科学技术大学); AnnLab, Institute of Semiconductors, Chinese Academy of Sciences (中国科学院半导体研究所AnnLab); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the big data era, the computer vision field benefits from large-scale datasets such as LAION-2B, LAION-400M, and ImageNet-21K, Kinetics, on which popular models like the ViT and ConvNeXt series have been pre-trained, acquiring substantial knowledge. However, numerous downstream tasks in specialized and data-limited scientific domains continue to pose significant challenges. In this paper, we propose a novel Cluster Attention Adapter (CLAdapter), which refines and adapts the rich representations learned from large-scale data to various data-limited downstream tasks. Specifically, CLAdapter introduces attention mechanisms and cluster centers to personalize the enhancement of transformed features through distribution correlation and transformation matrices. This enables models fine-tuned with CLAdapter to learn distinct representations tailored to different feature sets, facilitating the models’ adaptation from rich pre-trained features to various downstream scenarios effectively. In addition, CLAdapter’s unified interface design allows for seamless integration with multiple model architectures, including CNNs and Transformers, in both 2D and 3D contexts. Through extensive experiments on 10 datasets spanning domains such as generic, multimedia, biological, medical, industrial, agricultural, environmental, geographical, materials science, out-of-distribution (OOD), and 3D analysis, CLAdapter achieves state-of-the-art performance across diverse data-limited scientific domains, demonstrating its effectiveness in unleashing the potential of foundation vision models via adaptive transfer. Code is available at this https URL.
zh

[CV-126] Investigating Deep Learning Models for Ejection Fraction Estimation from Echocardiography Videos

【速读】:该论文旨在解决基于超声心动图视频手动评估左心室射血分数(Left Ventricular Ejection Fraction, LVEF)时存在耗时且观测者间变异大的问题。其解决方案的关键在于采用多种深度学习架构(包括3D Inception、双流和CNN-RNN模型)对EchoNet-Dynamic数据集中的10,030段超声心动图视频进行LVEF估计,并通过系统性地优化网络结构、融合策略及超参数(如卷积核大小和归一化方法),显著提升了预测准确性,其中改进的3D Inception模型表现最佳(RMSE为6.79%),同时揭示了简单模型更利于泛化、超参数选择对性能影响显著等关键设计原则。

链接: https://arxiv.org/abs/2512.22657
作者: Shravan Saranyan,Pramit Saha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and plays a central role in the diagnosis and management of cardiovascular disease. Echocardiography, as a readily accessible and non-invasive imaging modality, is widely used in clinical practice to estimate LVEF. However, manual assessment of cardiac function from echocardiograms is time-consuming and subject to considerable inter-observer variability. Deep learning approaches offer a promising alternative, with the potential to achieve performance comparable to that of experienced human experts. In this study, we investigate the effectiveness of several deep learning architectures for LVEF estimation from echocardiography videos, including 3D Inception, two-stream, and CNN-RNN models. We systematically evaluate architectural modifications and fusion strategies to identify configurations that maximize prediction accuracy. Models were trained and evaluated on the EchoNet-Dynamic dataset, comprising 10,030 echocardiogram videos. Our results demonstrate that modified 3D Inception architectures achieve the best overall performance, with a root mean squared error (RMSE) of 6.79%. Across architectures, we observe a tendency toward overfitting, with smaller and simpler models generally exhibiting improved generalization. Model performance was also found to be highly sensitive to hyperparameter choices, particularly convolutional kernel sizes and normalization strategies. While this study focuses on echocardiography-based LVEF estimation, the insights gained regarding architectural design and training strategies may be applicable to a broader range of medical and non-medical video analysis tasks.
zh

[CV-127] Visual Autoregressive Modelling for Monocular Depth Estimation

【速读】:该论文旨在解决单目深度估计(monocular depth estimation)中对大规模标注数据依赖性强、训练成本高以及生成式模型在几何感知能力上的局限性问题。其核心解决方案是引入基于视觉自回归(visual autoregressive, VAR)先验的方法,替代传统的扩散模型(diffusion-based approaches),通过适配大规模文本到图像的VAR模型,并设计尺度级条件上采样机制与无分类器引导(classifier-free guidance)策略,在固定十阶段自回归推理中实现高效深度预测。该方法仅需74K合成样本即可微调,展现出优异的数据可扩展性和在室内/室外场景下的强泛化能力,验证了自回归先验作为几何感知生成模型的新范式潜力。

链接: https://arxiv.org/abs/2512.22653
作者: Amir El-Ghoussani,André Kaup,Nassir Navab,Gustavo Carneiro,Vasileios Belagiannis
机构: Friedrich-Alexander University Erlangen-Nuremberg (弗里德里希-亚历山大埃尔朗根-纽伦堡大学); Technical University of Munich (慕尼黑工业大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "this https URL.
zh

[CV-128] FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution

【速读】:该论文旨在解决图像超分辨率(Image Super-Resolution, ISR)任务中因传统图像质量评估(Image Quality Assessment, IQA)模型仅输出全局单一分数而导致的感知质量优化偏差问题,即“奖励欺骗”(reward hacking)现象——此类模型对局部细微失真不敏感,导致生成模型倾向于产生高分但视觉上不自然的伪影。解决方案的关键在于提出一种细粒度感知奖励模型(Fine-grained Perceptual Reward Model, FinPercep-RM),其基于编码器-解码器架构,在提供全局质量评分的同时,生成一个空间定位的感知退化图(Perceptual Degradation Map),从而量化并定位局部缺陷;同时引入协同进化课程学习机制(Co-evolutionary Curriculum Learning, CCL),使奖励模型与ISR生成器同步从简单到复杂逐步训练,有效缓解训练不稳定问题并抑制奖励欺骗,实现更贴近人类感知的高质量图像重建。

链接: https://arxiv.org/abs/2512.22647
作者: Yidi Liu,Zihao Fan,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Xueyang Fu,Zheng-Jun Zha
机构: University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.
zh

[CV-129] Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion

【速读】:该论文旨在解决具身视觉规划(embodied visual planning)中因现有方法多为前向预测、缺乏显式目标建模而导致的空间漂移(spatial drift)和目标错位(goal misalignment)问题。解决方案的关键在于提出Envision框架,其核心创新是通过显式约束生成过程中的目标图像(goal image),以确保生成轨迹在物理合理性与目标一致性上的统一;具体而言,该框架包含两个阶段:首先由目标图像生成模型(Goal Imagery Model)基于任务指令合成语义一致的目标图像,随后由环境-目标视频模型(Env-Goal Video Model)基于首尾帧条件的视频扩散模型(FL2V)插值生成平滑且物理合理的视频轨迹,从而实现从初始状态到目标状态的可靠视觉规划。

链接: https://arxiv.org/abs/2512.22626
作者: Yuming Gu,Yizhi Wang,Yining Hong,Yipeng Gao,Hao Jiang,Angtian Wang,Bo Liu,Nathaniel S. Dennler,Zhengfei Kuang,Hao Li,Gordon Wetzstein,Chongyang Ma
机构: University of Southern California (南加州大学); ByteDance (字节跳动); Stanford University (斯坦福大学); Massachusetts Institute of Technology (麻省理工学院); MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Page: this https URL

点击查看摘要

Abstract:Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.
zh

[CV-130] Rethinking Memory Design in SAM-Based Visual Object Tracking DATE

【速读】:该论文旨在解决基于Segment Anything Model (SAM) 的视觉目标跟踪中记忆机制设计不系统、缺乏通用性的问题,尤其是现有方法对短期记忆帧的选择方式各异但共享对象中心表征,且尚未明确如何将此类记忆机制迁移至下一代基础模型(如SAM3)。其解决方案的关键在于提出一种统一的混合记忆框架,通过显式分解记忆为短期外观记忆与长期干扰物分辨记忆,从而以模块化和原则性的方式整合现有记忆策略。该框架在SAM2与SAM3双骨干上均显著提升了长时遮挡、复杂运动及干扰物密集场景下的鲁棒性。

链接: https://arxiv.org/abs/2512.22624
作者: Mohamad Alansari,Muzammal Naseer,Hasan Al Marzouqi,Naoufel Werghi,Sajid Javed
机构: Khalifa University (哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}

点击查看摘要

Abstract:\noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: this https URL. \textbfThis is a preprint. Some results are being finalized and may be updated in a future revision.
zh

[CV-131] Enhancing Noise Resilience in Face Clustering via Sparse Differential Transformer AAAI2026

【速读】:该论文旨在解决人脸聚类中因传统Jaccard相似度计算引入过多无关节点而导致邻域纯度下降、相似性度量可靠性不足的问题。其核心解决方案是提出一种预测驱动的Top-K Jaccard相似度系数,通过优化邻近节点的选择来提升相似性度量的准确性;进一步地,为实现对最优邻居数量(Top-K)的精准预测,设计了一个基于稀疏差分Transformer(Sparse Differential Transformer, SDT)的预测模型,该模型能有效抑制传统Transformer在节点关系建模中因过度关注无关特征而产生的噪声,从而增强模型抗噪能力与相似性估计的鲁棒性。

链接: https://arxiv.org/abs/2512.22612
作者: Dafeng Zhang,Yongqi Song,Shizhuo Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published at AAAI 2026

点击查看摘要

Abstract:The method used to measure relationships between face embeddings plays a crucial role in determining the performance of face clustering. Existing methods employ the Jaccard similarity coefficient instead of the cosine distance to enhance the measurement accuracy. However, these methods introduce too many irrelevant nodes, producing Jaccard coefficients with limited discriminative power and adversely affecting clustering performance. To address this issue, we propose a prediction-driven Top-K Jaccard similarity coefficient that enhances the purity of neighboring nodes, thereby improving the reliability of similarity measurements. Nevertheless, accurately predicting the optimal number of neighbors (Top-K) remains challenging, leading to suboptimal clustering results. To overcome this limitation, we develop a Transformer-based prediction model that examines the relationships between the central node and its neighboring nodes near the Top-K to further enhance the reliability of similarity estimation. However, vanilla Transformer, when applied to predict relationships between nodes, often introduces noise due to their overemphasis on irrelevant feature relationships. To address these challenges, we propose a Sparse Differential Transformer (SDT), instead of the vanilla Transformer, to eliminate noise and enhance the model’s anti-noise capabilities. Extensive experiments on multiple datasets, such as MS-Celeb-1M, demonstrate that our approach achieves state-of-the-art (SOTA) performance, outperforming existing methods and providing a more robust solution for face clustering.
zh

[CV-132] Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation

【速读】:该论文旨在解决现有方法在人类移动性预测中泛化能力有限的问题,具体表现为单模态方法受数据稀疏性和固有偏见限制,而多模态方法难以有效捕捉由静态多模态表示与时空动态之间的语义鸿沟所导致的移动性动态变化。解决方案的关键在于引入多模态时空知识以刻画移动性动态,提出名为Multi-Modal Mobility (M³ob) 的框架:首先构建统一的时空关系图(STRG),融合大语言模型增强的时空知识图谱(STKG)所提取的功能语义与时空信息;其次设计门控机制融合不同模态的时空图表示,并提出STKG引导的跨模态对齐策略,将时空动态知识注入静态图像模态,从而提升模型在正常与异常场景下的泛化性能。

链接: https://arxiv.org/abs/2512.22605
作者: Junshu Dai,Yu Wang,Tongya Zheng,Wei Ji,Qinghong Guo,Ji Cao,Jie Song,Canghong Jin,Mingli Song
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The precise prediction of human mobility has produced significant socioeconomic impacts, such as location recommendations and evacuation suggestions. However, existing methods suffer from limited generalization capability: unimodal approaches are constrained by data sparsity and inherent biases, while multi-modal methods struggle to effectively capture mobility dynamics caused by the semantic gap between static multi-modal representation and spatial-temporal dynamics. Therefore, we leverage multi-modal spatial-temporal knowledge to characterize mobility dynamics for the location recommendation task, dubbed as \textbfMulti-\textbfModal \textbfMobility (\textbfM ^3 \textbfob). First, we construct a unified spatial-temporal relational graph (STRG) for multi-modal representation, by leveraging the functional semantics and spatial-temporal knowledge captured by the large language models (LLMs)-enhanced spatial-temporal knowledge graph (STKG). Second, we design a gating mechanism to fuse spatial-temporal graph representations of different modalities, and propose an STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into the static image modality. Extensive experiments on six public datasets show that our proposed method not only achieves consistent improvements in normal scenarios but also exhibits significant generalization ability in abnormal scenarios.
zh

[CV-133] PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment

【速读】:该论文旨在解决现有3D说话头生成方法在个性化和真实感方面的不足,特别是忽略了个体独特的语音表达风格(speaking style),导致生成结果缺乏个性且难以精准匹配特定用户的说话习惯。其核心解决方案是提出名为PTalker的新型框架,关键在于两个方面:一是通过风格解耦机制将音频与面部运动序列中的风格信息与内容信息分离,从而保留说话者的个性化特征;二是设计了一个三层对齐机制(三重模态对齐),包括基于图注意力网络(Graph Attention Networks)的空间对齐、利用交叉注意力机制的时间对齐,以及通过top-k双向对比损失和KL散度约束的特征对齐,显著提升了唇部同步精度与跨模态一致性。

链接: https://arxiv.org/abs/2512.22602
作者: Bin Wang,Yang Xu,Huan Zhao,Hao Zhang,Zixing Zhang
机构: Hunan University(湖南大学); Central South University(中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely “PTalker”. This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.
zh

[CV-134] KV-Tracker: Real-Time Pose Tracking with Transformers

【速读】:该论文旨在解决多视图三维几何网络在实时应用中计算效率低的问题,尤其是在单目RGB视频流中实现6-DoF位姿跟踪与在线重建的延迟过高问题。其解决方案的关键在于提出一种基于缓存全局自注意力机制中的键值对(Key-Value pairs, KV pairs)的策略:通过快速选择和管理关键帧以构建场景或物体的表示,并将KV对作为唯一场景表征用于在线跟踪,从而避免重复计算,实现高达15倍的推理速度提升,同时保持无漂移、无灾难性遗忘的稳定性。该方法具有模型无关性,可直接应用于现有预训练的多视图网络而无需重新训练。

链接: https://arxiv.org/abs/2512.22581
作者: Marwan Taher,Ignacio Alzugaray,Kirill Mazur,Xin Kong,Andrew J. Davison
机构: Dyson Robotics Lab, Imperial College London (帝国理工学院戴森机器人实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via \pi^3 with full bidirectional attention. We then cache the global self-attention block’s key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to 15\times speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to \sim27 FPS.
zh

[CV-135] ReFRM3D: A Radiomics-enhanced Fused Residual Multiparametric 3D Network with Multi-Scale Feature Fusion for Glioma Characterization

【速读】:该论文旨在解决胶质瘤(Gliomas)诊断与分类中面临的挑战,包括影像数据变异性强、计算资源优化不足以及肿瘤分割和分类效率低下等问题。其解决方案的关键在于提出一种全新的放射组学增强型融合残差多参数三维网络(ReFRM3D),该模型基于3D U-Net架构,引入多尺度特征融合、混合上采样机制及扩展的残差跳跃连接,显著提升了脑肿瘤分割精度;同时结合基于多特征肿瘤标记的分类器,利用从分割区域提取的放射组学特征进行精细化分类,从而在BraTS2019、BraTS2020和BraTS2021数据集上均实现了高Dice相似系数(DSC),验证了方法的有效性与鲁棒性。

链接: https://arxiv.org/abs/2512.22570
作者: Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Arefin Ittesafun Abian,Yan Zhang,Mirjam Jonkman,Sami Azam
机构: Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory (应用人工智能与智能系统实验室); Department of Computer Science and Engineering (计算机科学与工程系); United International University (国际大学); Department of Data Science and Artificial Intelligence (数据科学与人工智能系); Monash University (蒙纳士大学); Faculty of Science and Technology (科学与技术学院); Charles Darwin University (查尔斯达尔文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gliomas are among the most aggressive cancers, characterized by high mortality rates and complex diagnostic processes. Existing studies on glioma diagnosis and classification often describe issues such as high variability in imaging data, inadequate optimization of computational resources, and inefficient segmentation and classification of gliomas. To address these challenges, we propose novel techniques utilizing multi-parametric MRI data to enhance tumor segmentation and classification efficiency. Our work introduces the first-ever radiomics-enhanced fused residual multiparametric 3D network (ReFRM3D) for brain tumor characterization, which is based on a 3D U-Net architecture and features multi-scale feature fusion, hybrid upsampling, and an extended residual skip mechanism. Additionally, we propose a multi-feature tumor marker-based classifier that leverages radiomic features extracted from the segmented regions. Experimental results demonstrate significant improvements in segmentation performance across the BraTS2019, BraTS2020, and BraTS2021 datasets, achieving high Dice Similarity Coefficients (DSC) of 94.04%, 92.68%, and 93.64% for whole tumor (WT), enhancing tumor (ET), and tumor core (TC) respectively in BraTS2019; 94.09%, 92.91%, and 93.84% in BraTS2020; and 93.70%, 90.36%, and 92.13% in BraTS2021.
zh

[CV-136] Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中产生的流畅但不可靠的输出问题,具体表现为步骤间缺乏连贯性(step-to-step coherence)和视觉信息引导不足(insufficient visual grounding),其根源在于现有对齐方法仅监督最终答案而忽略中间推理过程的可靠性。解决方案的关键在于提出一种轻量级、无需标签的框架SR-MCR,通过利用模型输出中内在的过程信号(intrinsic process signals)来实现推理对齐;其核心创新是整合五种自指线索(self-referential cues)——语义一致性(semantic alignment)、词汇保真度(lexical fidelity)、非冗余性(non-redundancy)、视觉接地性(visual grounding)与步骤一致性(step consistency)——构建归一化的、基于可靠性的奖励函数,提供细粒度的过程级指导,并结合无评论器的GRPO优化目标与置信度感知的降温机制(confidence-aware cooling mechanism),有效稳定训练并抑制过度自信或无意义生成。

链接: https://arxiv.org/abs/2512.22545
作者: Jesen Zhang,Ningyuan Liu,Kaitong Cai,Sidi Liu,Jing Yang,Ziliang Chen,Xiaofei Sun,Keze Wang
机构: Sun Yat-sen University (中山大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues – semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency – are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.
zh

[CV-137] VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models

【速读】:该论文旨在解决当前视觉-语言-动作模型(Vision-Language-Action models, VLAs)在机器人通用策略开发中难以量化其能力边界与失效模式的问题。为实现这一目标,作者提出了一种名为VLA-Arena的综合性基准测试框架,其核心创新在于构建了一个结构化的任务设计体系,能够沿三个正交维度——任务结构(Task Structure)、语言指令(Language Command)和视觉观测(Visual Observation)——精细控制难度等级。其中,任务结构细分为安全约束(Safety)、干扰项(Distractor)、外推能力(Extrapolation)和长时程(Long Horizon)四类,并在每类下设置三级难度(L0–L2),仅在L0级别进行微调以评估模型基础能力;同时,通过独立引入语言扰动(W0–W4)和视觉扰动(V0–V4),可实现对模型鲁棒性的解耦分析。此设计使研究者能够系统性地识别VLAs的关键局限,如记忆倾向、不对称鲁棒性、忽视安全约束及长程技能组合失败等问题,从而推动更具泛化性和可靠性的机器人学习算法发展。

链接: https://arxiv.org/abs/2512.22539
作者: Borong Zhang,Jiahao Li,Jiachen Shen,Yishuai Cai,Yuhao Zhang,Yuanpei Chen,Juntao Dai,Jiaming Ji,Yaodong Yang
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena’s 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at this https URL.
zh

[CV-138] CoAgent : Collaborative Planning and Consistency Agent for Coherent Video Generation

【速读】:该论文旨在解决开放域视频生成中叙事连贯性与视觉一致性难以维持的问题,现有文本到视频模型通常独立处理每个镜头,导致角色身份漂移、场景不一致及时间结构不稳定。其解决方案的关键在于提出一个协同闭环框架 CoAgent,将视频生成过程建模为“规划-合成-验证”流水线:首先由Storyboard Planner 将输入分解为包含显式实体、空间关系和时间线索的结构化镜头计划;通过Global Context Manager 维持实体级记忆以确保跨镜头的身份一致性;随后在Visual Consistency Controller 的引导下,由Synthesis Module 生成各镜头,并由Verifier Agent 利用视觉-语言推理评估中间结果,检测不一致时触发选择性重生成;最终由 pacing-aware editor 调整时间节奏与转场以匹配叙事流畅性。

链接: https://arxiv.org/abs/2512.22536
作者: Qinglin Zeng,Kaitong Cai,Ruiqi Chen,Qinhan Lv,Keze Wang
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.
zh

[CV-139] DreamOmni3: Scribble-based Editing and Generation

【速读】:该论文旨在解决当前生成式 AI(Generative AI)模型在基于文本提示的图像编辑与生成任务中,难以准确捕捉用户意图的编辑位置及精细视觉细节的问题。其核心挑战在于语言指令无法有效表达复杂的局部修改需求,导致编辑精度受限。解决方案的关键在于提出两种基于草图(scribble-based)的新任务:草图引导的图像编辑与生成,并构建了 DreamOmni3 数据集和框架。其中,关键创新是采用联合输入机制(joint input scheme),将原始图像与草图图像同时输入模型,通过不同颜色区分编辑区域,结合统一的位置编码实现对复杂多源输入(如多个草图、图像和文本指令)的精确空间定位与一致性处理,从而显著提升编辑精度与灵活性。

链接: https://arxiv.org/abs/2512.22525
作者: Bin Xia,Bohao Peng,Jiyang Liu,Sitong Wu,Jingyao Li,Junjia Huang,Xu Zhao,Yitong Wang,Ruihang Chu,Bei Yu,Jiaya Jia
机构: CUHK (香港中文大学); ByteDance Inc (字节跳动公司); HKUST (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.
zh

[CV-140] SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration

【速读】:该论文旨在解决月面探测中对小尺寸、不规则目标(如陨石碎片和岩石)的可靠且精确检测问题,此类目标在月面环境中对自主导航与操作至关重要。现有面向地球自动驾驶场景的多模态3D感知方法因特征对齐能力弱、模态协同不足及小目标检测性能差而在地外环境中表现不佳。解决方案的关键在于提出SCAFusion模型,其核心创新包括:引入认知适配器(Cognitive Adapter)以高效微调相机主干网络、设计对比对齐模块(Contrastive Alignment Module)提升相机与激光雷达特征一致性、增加相机辅助训练分支强化视觉表征,并最为核心地提出分段感知坐标注意力机制(Section-aware Coordinate Attention),专门增强对小而 irregular 目标的检测性能。该方案在参数量和计算开销几乎不变的前提下显著提升了检测精度,在nuScenes数据集上mAP达69.7%,NDS达72.1%;在Isaac Sim模拟的月面环境中mAP达到90.93%,较基线提升11.5%,尤其在小尺寸障碍物检测上优势明显。

链接: https://arxiv.org/abs/2512.22503
作者: Xin Chen,Kang Luo,Yangyi Xiao,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.
zh

[CV-141] racking by Predicting 3-D Gaussians Over Time

【速读】:该论文旨在解决视频表征学习中缺乏有效自监督预训练方法的问题,尤其是如何在不依赖大量标注数据的情况下提取具有时空一致性的视频特征。其解决方案的关键在于提出Video Gaussian Masked Autoencoders (Video-GMAE),通过将视频建模为一组随时间运动的高斯点(Gaussian splats),强制引入3D场景动态投影的先验假设(inductive bias),从而在预训练过程中自然地学习到目标跟踪能力,并实现零样本跟踪性能;同时,该架构支持小规模微调后在Kinetics和Kubric数据集上显著优于现有自监督视频方法。

链接: https://arxiv.org/abs/2512.22489
作者: Tanish Baranwal,Himanshu Gaurav Singh,Jathushan Rajasegaran,Jitendra Malik
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at this https URL and this https URL.
zh

[CV-142] oward Real-World IoT Security: Concept Drift-Resilient IoT Botnet Detection via Latent Space Representation Learning and Alignment

【速读】:该论文旨在解决当前基于人工智能(AI)的物联网(IoT)威胁检测模型在企业环境中部署时面临的两大挑战:一是依赖静态数据集,无法适应真实场景中NetFlow流量的动态变化(即概念漂移),二是现有解决方案通常依赖周期性重新训练分类器,导致计算开销高且存在灾难性遗忘风险。其关键解决方案是提出一种可扩展的自适应IoT威胁检测框架,该框架仅需一次训练即可完成分类器学习,通过一个对齐模型将新流入的流量映射到历史流量的潜在空间(latent space)中进行分类,从而保留对已知攻击的知识;同时,为捕捉攻击样本间的实例间关系,将低维潜在表示进一步转化为图结构,并采用图神经网络(Graph Neural Network, GNN)进行分类,从而在概念漂移下仍保持鲁棒的检测性能。

链接: https://arxiv.org/abs/2512.22488
作者: Hassan Wasswa,Timothy Lynar
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Although AI-based models have achieved high accuracy in IoT threat detection, their deployment in enterprise environments is constrained by reliance on stationary datasets that fail to reflect the dynamic nature of real-world IoT NetFlow traffic, which is frequently affected by concept drift. Existing solutions typically rely on periodic classifier retraining, resulting in high computational overhead and the risk of catastrophic forgetting. To address these challenges, this paper proposes a scalable framework for adaptive IoT threat detection that eliminates the need for continuous classifier retraining. The proposed approach trains a classifier once on latent-space representations of historical traffic, while an alignment model maps incoming traffic to the learned historical latent space prior to classification, thereby preserving knowledge of previously observed attacks. To capture inter-instance relationships among attack samples, the low-dimensional latent representations are further transformed into a graph-structured format and classified using a graph neural network. Experimental evaluations on real-world heterogeneous IoT traffic datasets demonstrate that the proposed framework maintains robust detection performance under concept drift. These results highlight the framework’s potential for practical deployment in dynamic and large-scale IoT environments.
zh

[CV-143] Scalpel-SAM: A Semi-Supervised Paradigm for Adapting SAM to Infrared Small Object Detection

【速读】:该论文旨在解决红外小目标检测(Infrared Small Object Detection, IR-SOT)中因标注成本高昂而导致的数据稀缺问题,尤其是在利用生成式AI模型(如Segment Anything Model, SAM)作为教师模型时所面临的领域差异、物理先验编码能力不足及架构复杂性等挑战。解决方案的关键在于提出一种分阶段的知识蒸馏与迁移框架:首先通过设计一个由四个白盒神经算子组成的分层专家混合适配器(Hierarchical MoE Adapter),结合10%的全监督数据进行先验引导的知识蒸馏(Prior-Guided Knowledge Distillation),从而训练出一个性能优越的专家教师模型(Scalpel-SAM);随后在部署导向的知识迁移阶段,利用该教师模型生成伪标签来训练轻量高效的下游模型。此方法实现了在极少量标注数据下,下游模型性能可媲美甚至超越全监督模型,是首个系统性解决IR-SOT中数据稀缺问题的半监督范式。

链接: https://arxiv.org/abs/2512.22483
作者: Zihan Liu,Xiangning Ren,Dezhang Kong,Yipeng Zhang,Meng Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 3 figures

点击查看摘要

Abstract:Infrared small object detection urgently requires semi-supervised paradigms due to the high cost of annotation. However, existing methods like SAM face significant challenges of domain gaps, inability of encoding physical priors, and inherent architectural complexity. To address this, we designed a Hierarchical MoE Adapter consisting of four white-box neural operators. Building upon this core component, we propose a two-stage paradigm for knowledge distillation and transfer: (1) Prior-Guided Knowledge Distillation, where we use our MoE adapter and 10% of available fully supervised data to distill SAM into an expert teacher (Scalpel-SAM); and (2) Deployment-Oriented Knowledge Transfer, where we use Scalpel-SAM to generate pseudo labels for training lightweight and efficient downstream models. Experiments demonstrate that with minimal annotations, our paradigm enables downstream models to achieve performance comparable to, or even surpassing, their fully supervised counterparts. To our knowledge, this is the first semi-supervised paradigm that systematically addresses the data scarcity issue in IR-SOT using SAM as the teacher model.
zh

[CV-144] Event-based high temporal resolution measurement of shock wave motion field

【速读】:该论文旨在解决冲击波运动参数在高时空分辨率下精确测量的问题,尤其是在冲击波传播速度快、不均匀且测试条件不稳定的情况下。其关键解决方案是提出了一种基于多事件相机(event camera)的新型框架,利用事件相机的高速响应和高动态范围特性,通过建立极坐标系编码事件以揭示冲击波传播模式,并结合自适应感兴趣区域(ROI)提取与迭代斜率分析法精准提取冲击波前缘事件;最终基于事件驱动的光学成像模型和三维重建模型,实现多角度冲击波测量、运动场重构及爆炸当量反演,从而在空间和时间维度上均达到高精度测量效果,实验误差最小达0.06%,最大为5.20%。

链接: https://arxiv.org/abs/2512.22474
作者: Taihang Lei,Banglei Guan,Minzu Liang,Pengju Sun,Jing Tao,Yang Shang,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Hunan Provincial Key Laboratory of Image Measurement and Vision Navigation (湖南省图像测量与视觉导航重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate measurement of shock wave motion parameters with high spatiotemporal resolution is essential for applications such as power field testing and damage assessment. However, significant challenges are posed by the fast, uneven propagation of shock waves and unstable testing conditions. To address these challenges, a novel framework is proposed that utilizes multiple event cameras to estimate the asymmetry of shock waves, leveraging its high-speed and high-dynamic range capabilities. Initially, a polar coordinate system is established, which encodes events to reveal shock wave propagation patterns, with adaptive region-of-interest (ROI) extraction through event offset calculations. Subsequently, shock wave front events are extracted using iterative slope analysis, exploiting the continuity of velocity changes. Finally, the geometric model of events and shock wave motion parameters is derived according to event-based optical imaging model, along with the 3D reconstruction model. Through the above process, multi-angle shock wave measurement, motion field reconstruction, and explosive equivalence inversion are achieved. The results of the speed measurement are compared with those of the pressure sensors and the empirical formula, revealing a maximum error of 5.20% and a minimum error of 0.06%. The experimental results demonstrate that our method achieves high-precision measurement of the shock wave motion field with both high spatial and temporal resolution, representing significant progress.
zh

[CV-145] Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing

【速读】:该论文旨在解决基于姿态码(pose-code)的3D动作生成与编辑框架在捕捉细微时间动态和高频细节方面的局限性,这些问题导致重建保真度下降和局部可控性减弱。其关键解决方案是提出一种混合表示方法——姿态引导的残差精炼(Pose-guided Residual Refinement for Motion, PGR²M),该方法通过残差向量量化(Residual Vector Quantization, RVQ)引入残差码,将运动分解为编码粗粒度全局结构的姿态潜变量和建模细粒度时序变化的残差潜变量;同时采用姿态引导的RVQ分层 tokenizer 和残差丢弃机制,有效平衡了语义对齐性、结构保真度与编辑灵活性,从而在 HumanML3D 和 KIT-ML 数据集上显著提升生成与编辑任务的 Fréchet inception 距离与重建指标,并通过用户研究验证了其直观且结构保持的编辑能力。

链接: https://arxiv.org/abs/2512.22464
作者: Sukhyun Jeong,Yong-Hoon Choi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR ^2 M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR ^2 M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.
zh

[CV-146] Comparing Object Detection Models for Electrical Substation Component Mapping

【速读】:该论文旨在解决电力变电站关键组件识别与脆弱性评估中传统人工测绘效率低、耗时长的问题,以支持电网基础设施的防灾减灾与风险管理。其解决方案的关键在于利用计算机视觉模型(YOLOv8、YOLOv11 和 RF-DETR)对美国变电站图像进行自动检测与分类,通过对比三类模型在检测精度、准确率和计算效率上的表现,筛选出适用于大规模变电站组件制图的最优算法,从而实现高效、可靠的自动化资产识别与空间分布映射。

链接: https://arxiv.org/abs/2512.22454
作者: Haley Mody,Namish Bansal,Dennies Kiprono Bor,Edward J. Oughton(George Mason University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 13 figures

点击查看摘要

Abstract:Electrical substations are a significant component of an electrical grid. Indeed, the assets at these substations (e.g., transformers) are prone to disruption from many hazards, including hurricanes, flooding, earthquakes, and geomagnetically induced currents (GICs). As electrical grids are considered critical national infrastructure, any failure can have significant economic and public safety implications. To help prevent and mitigate these failures, it is thus essential that we identify key substation components to quantify vulnerability. Unfortunately, traditional manual mapping of substation infrastructure is time-consuming and labor-intensive. Therefore, an autonomous solution utilizing computer vision models is preferable, as it allows for greater convenience and efficiency. In this research paper, we train and compare the outputs of 3 models (YOLOv8, YOLOv11, RF-DETR) on a manually labeled dataset of US substation images. Each model is evaluated for detection accuracy, precision, and efficiency. We present the key strengths and limitations of each model, identifying which provides reliable and large-scale substation component mapping. Additionally, we utilize these models to effectively map the various substation components in the United States, showcasing a use case for machine learning in substation mapping.
zh

[CV-147] SAM 3D for 3D Object Reconstruction from Remote Sensing Images

【速读】:该论文旨在解决单目遥感影像中建筑三维重建的可扩展性问题,即现有方法通常依赖于任务特定的架构和大量标注数据,难以实现高效、通用的 urban 3D 建模。其解决方案的关键在于首次系统评估 SAM 3D——一种通用图像到三维的基础模型(foundation model)在单目遥感建筑重建中的表现,并通过“分割-重建-组合”(segment-reconstruct-compose)流程拓展至城市场景建模,从而显著提升屋顶几何结构的一致性和边界清晰度,展现出基础模型在复杂城市环境下的潜力与实用性。

链接: https://arxiv.org/abs/2512.22452
作者: Junsheng Yao,Lichao Mou,Qingyu Li
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); MedAI Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.
zh

[CV-148] SonoVision: A Computer Vision Approach for Helping Visually Challenged Individuals Locate Objects with the Help of Sound Cues

【速读】:该论文旨在解决视障人群在日常生活中难以自主定位物体的问题,这一困境限制了他们的独立性并可能引发安全隐患。解决方案的关键在于开发一款名为SonoVision的智能手机应用程序,通过耳机提供基于空间位置的声学反馈:当物体位于用户左侧或右侧时,应用会在对应耳道播放正弦波声音;若物体位于正前方,则双耳同时发声。该系统采用Flutter平台开发,并集成EfficientDet-D2目标检测模型实现离线环境下的实时物体识别,从而无需依赖网络连接即可为用户提供安全、便捷且高效的导航辅助。

链接: https://arxiv.org/abs/2512.22449
作者: Md Abu Obaida Zishan,Annajiat Alim Rasel
机构: BRAC University (BRAC大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Locating objects for the visually impaired is a significant challenge and is something no one can get used to over time. However, this hinders their independence and could push them towards risky and dangerous scenarios. Hence, in the spirit of making the visually challenged more self-sufficient, we present SonoVision, a smart-phone application that helps them find everyday objects using sound cues through earphones/headphones. This simply means, if an object is on the right or left side of a user, the app makes a sinusoidal sound in a user’s respective ear through ear/headphones. However, to indicate objects located directly in front, both the left and right earphones are rung simultaneously. These sound cues could easily help a visually impaired individual locate objects with the help of their smartphones and reduce the reliance on people in their surroundings, consequently making them more independent. This application is made with the flutter development platform and uses the Efficientdet-D2 model for object detection in the backend. We believe the app will significantly assist the visually impaired in a safe and user-friendly manner with its capacity to work completely offline. Our application can be accessed here this https URL.
zh

[CV-149] owards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework

【速读】:该论文旨在解决光学(Optical)与合成孔径雷达(SAR)图像融合目标检测中因模态缺失或数据退化导致的性能下降问题,尤其针对实际应用中难以获取高质量对齐图像对、以及现有方法在随机缺失模态下鲁棒性不足的问题。解决方案的关键在于提出一种质量感知的动态融合网络(Quality-Aware Dynamic Fusion Network, QDFNet),其核心创新包括:1)设计动态模态质量评估模块(Dynamic Modality Quality Assessment, DMQA),利用可学习参考标记迭代优化特征可靠性评估,精准识别退化区域并提供质量引导;2)引入正交约束归一化融合模块(Orthogonal Constraint Normalization Fusion, OCNF),通过正交约束保持模态独立性的同时,依据可靠性分数自适应调整融合权重,有效抑制不可靠特征传播。该方法显著提升了在部分模态损坏或缺失场景下的检测性能。

链接: https://arxiv.org/abs/2512.22447
作者: Zhicheng Zhao,Yuancheng Xu,Andong Lu,Chenglong Li,Jin Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing, as these modalities provide complementary information for all-weather monitoring. However, practical deployment is severely limited by inherent challenges. Due to distinct imaging mechanisms, temporal asynchrony, and registration difficulties, obtaining well-aligned optical-SAR image pairs remains extremely difficult, frequently resulting in missing or degraded modality data. Although recent approaches have attempted to address this issue, they still suffer from limited robustness to random missing modalities and lack effective mechanisms to ensure consistent performance improvement in fusion-based detection. To address these limitations, we propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection. Our proposed method leverages learnable reference tokens to dynamically assess feature reliability and guide adaptive fusion in the presence of missing modalities. In particular, we design a Dynamic Modality Quality Assessment (DMQA) module that employs learnable reference tokens to iteratively refine feature reliability assessment, enabling precise identification of degraded regions and providing quality guidance for subsequent fusion. Moreover, we develop an Orthogonal Constraint Normalization Fusion (OCNF) module that employs orthogonal constraints to preserve modality independence while dynamically adjusting fusion weights based on reliability scores, effectively suppressing unreliable feature propagation. Extensive experiments on the SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate the superiority and effectiveness of QDFNet compared to state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.
zh

[CV-150] LECalib: Line-Based Event Camera Calibration

【速读】:该论文旨在解决传统事件相机(event camera)标定方法依赖闪烁图案或人工放置标定物、耗时且不适用于动态场景的问题。其解决方案的关键在于提出一种基于几何线条的标定框架,直接从事件流中检测常见人工环境中物体的直线特征(如门框、窗框等),并利用事件线标定模型生成相机参数的初始估计,该模型可同时处理平面与非平面线条;随后通过非线性优化进一步精化参数,从而实现高效、鲁棒的单目与双目事件相机标定。

链接: https://arxiv.org/abs/2512.22441
作者: Zibin Liu,Banglei Guana,Yang Shanga,Zhenbao Yu,Yifei Bian,Qifeng Yu
机构: National University of Defense Technology (国防科技大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 Pages, 6 figures

点击查看摘要

Abstract:Camera calibration is an essential prerequisite for event-based vision applications. Current event camera calibration methods typically involve using flashing patterns, reconstructing intensity images, and utilizing the features extracted from events. Existing methods are generally time-consuming and require manually placed calibration objects, which cannot meet the needs of rapidly changing scenarios. In this paper, we propose a line-based event camera calibration framework exploiting the geometric lines of commonly-encountered objects in man-made environments, e.g., doors, windows, boxes, etc. Different from previous methods, our method detects lines directly from event streams and leverages an event-line calibration model to generate the initial guess of camera parameters, which is suitable for both planar and non-planar lines. Then, a non-linear optimization is adopted to refine camera parameters. Both simulation and real-world experiments have demonstrated the feasibility and accuracy of our method, with validation performed on monocular and stereo event cameras. The source code is released at this https URL.
zh

[CV-151] SuperiorGAT: Graph Attention Networks for Sparse LiDAR Point Cloud Reconstruction in Autonomous Systems

【速读】:该论文旨在解决激光雷达(LiDAR)在自动驾驶系统中因固定垂直分辨率和环境遮挡导致的扫描光束丢失(beam dropout)问题,从而影响点云数据的完整性与几何一致性。其解决方案的关键在于提出一种基于图注意力机制(Graph Attention Network, GAT)的框架 SuperiorGAT,通过将 LiDAR 扫描建模为光束感知图(beam-aware graph),并引入门控残差融合(gated residual fusion)与前馈细化模块(feed-forward refinement),实现对稀疏点云中缺失高程信息的有效重建,且无需增加网络深度即可获得更优性能。

链接: https://arxiv.org/abs/2512.22439
作者: Khalfalla Awedat,Mohamed Abidalrekab,Gurcan Comert,Mustafa Ayad
机构: Portland State University (波特兰州立大学); North Carolina A&T State University (北卡罗来纳农业技术州立大学); SUNY Oswego (纽约州立大学奥斯威戈分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LiDAR-based perception in autonomous systems is constrained by fixed vertical beam resolution and further compromised by beam dropout resulting from environmental occlusions. This paper introduces SuperiorGAT, a graph attention-based framework designed to reconstruct missing elevation information in sparse LiDAR point clouds. By modeling LiDAR scans as beam-aware graphs and incorporating gated residual fusion with feed-forward refinement, SuperiorGAT enables accurate reconstruction without increasing network depth. To evaluate performance, structured beam dropout is simulated by removing every fourth vertical scanning beam. Extensive experiments across diverse KITTI environments, including Person, Road, Campus, and City sequences, demonstrate that SuperiorGAT consistently achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. Qualitative X-Z projections further confirm the model’s ability to preserve structural integrity with minimal vertical distortion. These results suggest that architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware.
zh

[CV-152] EmoCtrl: Controllable Emotional Image Content Generation

【速读】:该论文旨在解决当前生成式 AI(Generative AI)图像生成模型在内容忠实性与情感表达之间难以平衡的问题:现有文本到图像模型虽能保持内容一致性,但缺乏情感感知能力;而情感驱动模型则常因过度强调情绪表达而导致内容失真。解决方案的关键在于提出 EmoCtrl 框架,其核心是引入一个包含内容、情绪及情感提示的标注数据集,并设计文本和视觉双模态的情感增强模块,通过学习可控制的情绪标记(emotion tokens)来协同提升图像的情感表现力与内容保真度,从而实现可控的情感图像内容生成(Controllable Emotional Image Content Generation, C-EICG)。

链接: https://arxiv.org/abs/2512.22437
作者: Jingyuan Yang,Weibin Luo,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:An image conveys meaning through both its visual content and emotional tone, jointly shaping human perception. We introduce Controllable Emotional Image Content Generation (C-EICG), which aims to generate images that remain faithful to a given content description while expressing a target emotion. Existing text-to-image models ensure content consistency but lack emotional awareness, whereas emotion-driven models generate affective results at the cost of content distortion. To address this gap, we propose EmoCtrl, supported by a dataset annotated with content, emotion, and affective prompts, bridging abstract emotions to visual cues. EmoCtrl incorporates textual and visual emotion enhancement modules that enrich affective expression via descriptive semantics and perceptual cues. The learned emotion tokens exhibit complementary effects, as demonstrated through ablations and visualizations. Quantatitive and qualatitive experiments demonstrate that EmoCtrl achieves faithful content and expressive emotion control, outperforming existing methods across multiple aspects. User studies confirm EmoCtrl’s strong alignment with human preference. Moreover, EmoCtrl generalizes well to creative applications, further demonstrating the robustness and adaptability of the learned emotion tokens.
zh

[CV-153] FluenceFormer: Transformer-Driven Multi-Beam Fluence Map Regression for Radiotherapy Planning

【速读】:该论文旨在解决放射治疗计划中剂量分布预测这一病态逆问题(ill-posed inverse problem),其核心挑战在于解剖结构与光束强度调制之间的复杂非线性关系,传统卷积神经网络方法难以建模长距离依赖,导致生成的剂量分布结构不一致或物理上不可实现。解决方案的关键在于提出一种骨干网络无关的Transformer框架 FluenceFormer,采用两阶段设计:第一阶段从解剖输入中预测全局剂量先验,第二阶段结合显式光束几何信息对先验进行条件调节以回归物理校准的射线通量图(fluence map)。其中,Fluence-Aware Regression (FAR) 损失函数是关键创新,该损失融合了体素级保真度、梯度平滑性、结构一致性及束流能量守恒等物理约束,显著提升了生成方案的物理合理性和结构准确性。

链接: https://arxiv.org/abs/2512.22425
作者: Ujunwa Mgboh,Rafi Ibn Sultan,Joshua Kim,Kundan Thind,Dongxiao Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fluence map prediction is central to automated radiotherapy planning but remains an ill-posed inverse problem due to the complex relationship between volumetric anatomy and beam-intensity modulation. Convolutional methods in prior work often struggle to capture long-range dependencies, which can lead to structurally inconsistent or physically unrealizable plans. We introduce \textbfFluenceFormer, a backbone-agnostic transformer framework for direct, geometry-aware fluence regression. The model uses a unified two-stage design: Stage~1 predicts a global dose prior from anatomical inputs, and Stage~2 conditions this prior on explicit beam geometry to regress physically calibrated fluence maps. Central to the approach is the \textbfFluence-Aware Regression (FAR) loss, a physics-informed objective that integrates voxel-level fidelity, gradient smoothness, structural consistency, and beam-wise energy conservation. We evaluate the generality of the framework across multiple transformer backbones, including Swin UNETR, UNETR, nnFormer, and MedFormer, using a prostate IMRT dataset. FluenceFormer with Swin UNETR achieves the strongest performance among the evaluated models and improves over existing benchmark CNN and single-stage methods, reducing Energy Error to \mathbf4.5% and yielding statistically significant gains in structural fidelity ( p 0.05 ).
zh

[CV-154] Bright 4B: Scaling Hyperspherical Learning for Segmentation in 3D Brightfield Microscopy

【速读】:该论文旨在解决无标记三维明场显微成像(label-free 3D brightfield microscopy)中难以实现鲁棒的细胞器体积分割的问题,传统方法通常依赖荧光标记或复杂的后处理流程。其核心解决方案是提出一个40亿参数的基础模型 Bright-4B,该模型通过在单位超球面上学习,直接从3D明场图像中分割亚细胞结构。关键创新包括:1)硬件对齐的原生稀疏注意力机制(Native Sparse Attention),有效捕获局部、粗粒度及选择性全局上下文;2)深度-宽度残差超连接(depth-width residual HyperConnections),稳定特征表示传递;3)软混合专家(soft Mixture-of-Experts)实现自适应计算容量;以及4)可插拔的各向异性补丁嵌入(anisotropic patch embed),尊重共聚焦点扩散函数和轴向薄化特性,实现几何忠实的3D token化。该方法无需荧光标记、辅助通道或手工后处理,即可实现核、线粒体等细胞器的高精度形态学分割。

链接: https://arxiv.org/abs/2512.22423
作者: Amil Khan,Matheus Palhares Viana,Suraj Mishra,B.S. Manjunath
机构: UC Santa Barbara; Allen Institute for Cell Sciences
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 15 figures

点击查看摘要

Abstract:Label-free 3D brightfield microscopy offers a fast and noninvasive way to visualize cellular morphology, yet robust volumetric segmentation still typically depends on fluorescence or heavy post-processing. We address this gap by introducing Bright-4B, a 4 billion parameter foundation model that learns on the unit hypersphere to segment subcellular structures directly from 3D brightfield volumes. Bright-4B combines a hardware-aligned Native Sparse Attention mechanism (capturing local, coarse, and selected global context), depth-width residual HyperConnections that stabilize representation flow, and a soft Mixture-of-Experts for adaptive capacity. A plug-and-play anisotropic patch embed further respects confocal point-spread and axial thinning, enabling geometry-faithful 3D tokenization. The resulting model produces morphology-accurate segmentations of nuclei, mitochondria, and other organelles from brightfield stacks alone–without fluorescence, auxiliary channels, or handcrafted post-processing. Across multiple confocal datasets, Bright-4B preserves fine structural detail across depth and cell types, outperforming contemporary CNN and Transformer baselines. All code, pretrained weights, and models for downstream finetuning will be released to advance large-scale, label-free 3D cell mapping.
zh

[CV-155] DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization

【速读】:该论文旨在解决基于扩散模型的检测方法(如DiffusionDet)在临床场景中因采样步数过多(T ≫ 60)导致推理延迟过高、难以满足实时性需求的问题,特别是在磁共振肠造影(MRE)中对克罗恩病(Crohn’s Disease)进行快速准确检测的应用场景。解决方案的关键在于提出DeFloMat框架,通过引入条件流匹配(Conditional Flow Matching, CFM),利用条件最优传输(Conditional Optimal Transport, OT)理论构建确定性流场,近似实现修正流(Rectified Flow),从而将原本依赖多步随机去噪过程的扩散模型转化为仅需少量步骤即可完成的常微分方程(ODE)求解问题,显著提升推理效率与定位稳定性。

链接: https://arxiv.org/abs/2512.22406
作者: Hansang Lee,Chaelin Lee,Nieun Seo,Joon Seok Lim,Helen Hong
机构: Seoul Women’s University (首尔女子大学); Yonsei University College of Medicine (延世大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose DeFloMat (Detection with Flow Matching), a novel generative object detection framework that addresses the critical latency bottleneck of diffusion-based detectors, such as DiffusionDet, by integrating Conditional Flow Matching (CFM). Diffusion models achieve high accuracy by formulating detection as a multi-step stochastic denoising process, but their reliance on numerous sampling steps ( T \gg 60 ) makes them impractical for time-sensitive clinical applications like Crohn’s Disease detection in Magnetic Resonance Enterography (MRE). DeFloMat replaces this slow stochastic path with a highly direct, deterministic flow field derived from Conditional Optimal Transport (OT) theory, specifically approximating the Rectified Flow. This shift enables fast inference via a simple Ordinary Differential Equation (ODE) solver. We demonstrate the superiority of DeFloMat on a challenging MRE clinical dataset. Crucially, DeFloMat achieves state-of-the-art accuracy ( 43.32% \text AP_10:50 ) in only 3 inference steps, which represents a 1.4\times performance improvement over DiffusionDet’s maximum converged performance ( 31.03% \text AP_10:50 at 4 steps). Furthermore, our deterministic flow significantly enhances localization characteristics, yielding superior Recall and stability in the few-step regime. DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization.
zh

[CV-156] OSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI

【速读】:该论文旨在解决当前人行道数据采集成本高、碎片化且难以扩展的问题,以支持无障碍和包容性的行人基础设施建设。解决方案的关键在于开发了一款名为iOSPointMapper的移动应用,利用新一代iPhone和iPad的本地化语义分割(semantic segmentation)、基于LiDAR的深度估计以及融合GPS/惯性测量单元(IMU)的数据,实现对交通标志、信号灯和电线杆等人行道相关特征的实时检测与定位;同时通过用户引导的标注界面提升数据透明度与质量,并将匿名化数据上传至交通数据交换倡议(Transportation Data Exchange Initiative, TDEI),实现与多模式交通数据集的无缝集成,从而提供一种可扩展、以用户为中心的人行道测绘方法。

链接: https://arxiv.org/abs/2512.22392
作者: Himanshu Naidu,Yuxiang Zhang,Sachin Mehta,Anat Caspi
机构: University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system’s feature detection and spatial mapping performance reveal the application’s potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian
zh

[CV-157] Self-Evaluation Unlocks Any-Step Text-to-Image Generation

【速读】:该论文旨在解决文本到图像生成模型在推理步数较少时质量下降的问题,尤其是传统扩散模型或流匹配(Flow Matching)模型依赖大量局部监督信号、难以实现高效低步数生成的局限性。其解决方案的关键在于提出自评估模型(Self-E),该模型通过一种新颖的自评估机制——利用当前得分估计对自身生成样本进行动态评估,从而充当一个自监督教师(self-teacher),实现即时局部学习与全局匹配的统一。这一机制无需预训练教师模型或依赖密集的局部监督,使得模型能够从零开始训练,并在任意推理步数下均表现出优异性能,尤其在极低步数(如1–5步)下显著优于现有方法,同时保持随步数增加的单调性能提升,实现了高效与高质量生成的统一。

链接: https://arxiv.org/abs/2512.22374
作者: Xin Yu,Xiaojuan Qi,Zhengqi Li,Kai Zhang,Richard Zhang,Zhe Lin,Eli Shechtman,Tianyu Wang,Yotam Nitzan
机构: The University of Hong Kong (香港大学); Adobe Research (Adobe 研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
zh

[CV-158] VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂三维场景操作任务中应用不足的问题,尤其是其在3D视觉定位能力弱、缺乏对3D场景状态的感知反馈以及难以处理多步迭代式编辑错误等关键挑战。解决方案的关键在于三个方面:一是引入基于MCP(Method Call Protocol)的API接口,将原始代码操作转化为函数级更新,增强视觉与程序指令间的鲁棒性关联;二是构建一套专用视觉工具集,实现对3D场景状态的分析、空间信息获取和动作结果验证,形成感知反馈闭环以弥合语言指令与精确3D操作之间的鸿沟;三是设计协作式多智能体框架,明确规划、执行与验证角色分工,从而有效管理多步骤任务并具备中间错误恢复能力。

链接: https://arxiv.org/abs/2512.22351
作者: Zhengfei Kuang,Rui Lin,Long Zhao,Gordon Wetzstein,Saining Xie,Sanghyun Woo
机构: Stanford University (斯坦福大学); Google (谷歌); Google DeepMind; New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM’s 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: this http URL
zh

[CV-159] Human-like visual computing advances explainability and few-shot learning in deep neural networks for complex physiological data

【速读】:该论文旨在解决深度神经网络在生理信号(如心电图ECG)分析中普遍存在的两个关键问题:一是数据效率低下,即模型通常需要大量标注数据才能训练;二是缺乏可解释性,难以揭示预测背后的因果特征,从而限制了其临床可靠性与人类认知的一致性。解决方案的关键在于引入一种受人类感知启发的伪彩色编码技术(perception-informed pseudo-colouring),将临床相关的时序特征(如QT间期长度)转化为结构化的颜色表示,从而增强模型对关键特征的判别能力和解释性。实验表明,该方法可在仅需1至5个训练样本的情况下实现少样本学习(few-shot learning),且通过注意力机制引导模型聚焦于临床有意义的ECG特征,抑制无关噪声,显著提升模型在极端数据稀缺场景下的泛化性能与可解释性。

链接: https://arxiv.org/abs/2512.22349
作者: Alaa Alahmadi,Mohamed Hasan
机构: Newcastle University (纽卡斯尔大学); University of Leeds (利兹大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine vision models, particularly deep neural networks, are increasingly applied to physiological signal interpretation, including electrocardiography (ECG), yet they typically require large training datasets and offer limited insight into the causal features underlying their predictions. This lack of data efficiency and interpretability constrains their clinical reliability and alignment with human reasoning. Here, we show that a perception-informed pseudo-colouring technique, previously demonstrated to enhance human ECG interpretation, can improve both explainability and few-shot learning in deep neural networks analysing complex physiological data. We focus on acquired, drug-induced long QT syndrome (LQTS) as a challenging case study characterised by heterogeneous signal morphology, variable heart rate, and scarce positive cases associated with life-threatening arrhythmias such as torsades de pointes. This setting provides a stringent test of model generalisation under extreme data scarcity. By encoding clinically salient temporal features, such as QT-interval duration, into structured colour representations, models learn discriminative and interpretable features from as few as one or five training examples. Using prototypical networks and a ResNet-18 architecture, we evaluate one-shot and few-shot learning on ECG images derived from single cardiac cycles and full 10-second rhythms. Explainability analyses show that pseudo-colouring guides attention toward clinically meaningful ECG features while suppressing irrelevant signal components. Aggregating multiple cardiac cycles further improves performance, mirroring human perceptual averaging across heartbeats. Together, these findings demonstrate that human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2512.22349 [cs.CV] (or arXiv:2512.22349v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.22349 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-160] Feature Learning with Multi-Stage Vision Transformers on Inter-Modality HER2 Status Scoring and Tumor Classification on Whole Slides

【速读】:该论文旨在解决乳腺癌中人类表皮生长因子受体2(HER2)蛋白表达水平的精准预测问题,尤其是在仅依赖苏木精-伊红(HE)染色病理图像时难以准确识别HER2状态的挑战,以及联合分析HE与免疫组化(IHC)图像进行HER2评分的困难。其解决方案的关键在于提出了一种端到端的视觉Transformer(Vision Transformer, ViT)架构,通过分块处理HE全切片图像(WSI)实现肿瘤区域定位,并设计了一个新颖的映射函数以关联HE与IHC图像中的对应恶性区域;同时嵌入一种临床启发式的HER2评分机制,支持自动化的像素级4分类评分(0、1+、2+、3+),从而实现对HER2阴性和阳性状态的高精度判别。实验表明,该方法在HER2状态预测上达到了0.94的分类准确率和0.933的特异性,验证了多模态图像联合建模在HER2评分中的有效性。

链接: https://arxiv.org/abs/2512.22335
作者: Olaide N. Oyelade,Oliver Hoxey,Yulia Humrye
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The popular use of histopathology images, such as hematoxylin and eosin (HE), has proven to be useful in detecting tumors. However, moving such cancer cases forward for treatment requires accurate on the amount of the human epidermal growth factor receptor 2 (HER2) protein expression. Predicting both the lower and higher levels of HER2 can be challenging. Moreover, jointly analyzing HE and immunohistochemistry (IHC) stained images for HER2 scoring is difficult. Although several deep learning methods have been investigated to address the challenge of HER2 scoring, they suffer from providing a pixel-level localization of HER2 status. In this study, we propose a single end-to-end pipeline using a system of vision transformers with HER2 status scoring on whole slide images of WSIs. The method includes patch-wise processing of HE WSIs for tumor localization. A novel mapping function is proposed to correspondingly identify correlated IHC WSIs regions with malignant regions on HE. A clinically inspired HER2 scoring mechanism is embedded in the pipeline and allows for automatic pixel-level annotation of 4-way HER2 scoring (0, 1+, 2+, and 3+). Also, the proposed method accurately returns HER2-negative and HER2-positive. Privately curated datasets were collaboratively extracted from 13 different cases of WSIs of HE and IHC. A thorough experiment is conducted on the proposed method. Results obtained showed a good classification accuracy during tumor localization. Also, a classification accuracy of 0.94 and a specificity of 0.933 were returned for the prediction of HER2 status, scoring in the 4-way methods. The applicability of the proposed pipeline was investigated using WSIs patches as comparable to human pathologists. Findings from the study showed the usability of jointly evaluated HE and IHC images on end-to-end ViTs-based models for HER2 scoring
zh

[CV-161] he Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma

【速读】:该论文旨在解决从医学影像中非侵入性推断胶质母细胞瘤(GBM)分子特征的问题,特别是针对O6-甲基鸟嘌呤-DNA甲基转移酶(MGMT)启动子甲基化状态的分类任务。传统基于影像组学的机器学习方法在处理多模态MRI数据(如对比增强T1加权成像T1Gd和液体衰减反转恢复FLAIR)时,常受限于特征冗余高及未能充分建模各模态特异性信息。其解决方案的关键在于提出一种基于变分自编码器(VAE)的多视图潜在表示学习框架,通过独立的概率编码器分别提取每种模态的特征,并在紧凑的潜在空间中进行融合,从而保留各模态结构信息的同时实现有效的多模态整合,最终用于提升MGMT启动子甲基化状态的分类性能。

链接: https://arxiv.org/abs/2512.22331
作者: Mariya Miteva,Maria Nisheva-Pavlova
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures

点击查看摘要

Abstract:Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and an incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) to integrate complementary radiomic features derived from post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Recovery (FLAIR) magnetic resonance imaging (MRI). By encoding each modality through an independent probabilistic encoder and performing fusion in a compact latent space, the proposed approach preserves modality-specific structure while enabling effective multimodal integration. The resulting latent embeddings are subsequently used for MGMT promoter methylation classification.
zh

[CV-162] DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models

【速读】:该论文旨在解决复杂人体动作的可解释性分解问题,即如何从整体动作中自动识别并分离出语义明确的运动基元(motion primitives),而非仅依赖于正向建模(如从文本到动作的端到端映射)。其解决方案的关键在于提出了一种基于能量的扩散模型(energy-based diffusion model)驱动的分解训练范式 DeMoGen,该范式通过显式建模多个运动概念的组合分布,无需依赖每个基元的标注真值即可发现语义一致的子动作组件。具体而言,DeMoGen 采用三种训练变体:DeMoGen-Exp 显式利用分解后的文本提示进行训练,DeMoGen-OSS 实现正交自监督分解,DeMoGen-SC 强制原始与分解文本嵌入之间的语义一致性,从而实现对可复用运动基元的有效解耦,并支持灵活重组生成多样化新动作,显著提升模型在训练分布外的泛化能力。

链接: https://arxiv.org/abs/2512.22324
作者: Jianrong Zhang,Hehe Fan,Yi Yang
机构: ReLER, AAII, University of Technology Sydney (悉尼科技大学); CCAI, Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.
zh

[CV-163] SpotEdit: Selective Region Editing in Diffusion Transformers

【速读】:该论文旨在解决扩散模型在图像编辑过程中存在的冗余计算问题——即当前方法对所有图像区域进行统一去噪处理,即使仅需修改局部区域,也会导致不必要的计算开销,并可能影响未修改区域的保真度。其解决方案的关键在于提出了一种无需训练的编辑框架SpotEdit,核心创新包括:1)SpotSelector通过感知相似性识别稳定区域并跳过计算,复用条件图像特征;2)SpotFusion设计动态融合机制,自适应地将这些特征与编辑后的token结合,从而在保持上下文一致性的同时提升编辑精度和效率。

链接: https://arxiv.org/abs/2512.22323
作者: Zhibin Qin,Zhenxiong Tan,Zeqing Wang,Songhua Liu,Xinchao Wang
机构: National University of Singapore (新加坡国立大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 8 figures

点击查看摘要

Abstract:Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.
zh

[CV-164] LangPrecip: Language-Aware Multimodal Precipitation Nowcasting

【速读】:该论文旨在解决短时降水临近预报(short-term precipitation nowcasting)中存在的不确定性高、时空约束弱的问题,尤其是在快速演变和极端天气事件中,传统生成式方法主要依赖视觉条件(如雷达图像),导致未来运动轨迹缺乏明确的语义约束,从而产生模糊预测。其解决方案的关键在于提出一种语言感知的多模态临近预报框架(LangPrecip),将气象文本描述作为降水演变的语义运动约束,并在修正流(Rectified Flow)框架下将临近预报建模为一个语义约束的轨迹生成问题,从而实现文本与雷达信息在潜在空间中的高效且物理一致的融合。

链接: https://arxiv.org/abs/2512.22317
作者: Xudong Ling,Tianxi Huang,Qian Dong,Tao He,Chaorong Li,Guiduo Duan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent this http URL further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 % and 19% gains in heavy-rainfall CSI at an 80-minute lead time.
zh

[CV-165] VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中因上下文窗口有限而导致的性能瓶颈问题。现有方法通常依赖于均匀采样或静态预选帧,难以捕捉关键视觉证据,且无法在推理过程中修正初始选择错误。解决方案的关键在于提出 VideoZoomer——一种新型代理框架,使 MLLMs 能够在推理过程中动态控制视觉关注区域:从低帧率粗粒度概览开始,通过自主调用时间缩放工具获取高帧率片段,以多轮交互方式逐步收集细粒度证据。该框架结合两阶段训练策略(冷启动监督微调与强化学习优化代理策略),显著提升了长视频理解能力,在多个基准测试中超越开源模型并逼近商用系统性能,同时在减少帧预算下实现更高效率。

链接: https://arxiv.org/abs/2512.22315
作者: Yang Ding,Yizhen Zhang,Xin Lai,Ruihang Chu,Yujiu Yang
机构: Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.
zh

[CV-166] MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation AAAI2026

【速读】:该论文旨在解决多主体视频生成(multi-subject video generation)中的两个核心问题:尺度不一致性和排列敏感性。尺度不一致性指参考图像中主体尺寸差异导致生成视频中主体比例失真;排列敏感性则表现为输入参考图像顺序变化时,生成结果出现主体错位或扭曲。解决方案的关键在于提出一个统一框架MoFu,其中包含两个核心模块:一是基于大语言模型(LLM)引导的尺度感知调制(Scale-Aware Modulation, SMO),通过从文本提示中提取隐式尺度线索并调节特征以保持主体尺度一致性;二是傅里叶融合策略(Fourier Fusion),利用快速傅里叶变换(Fast Fourier Transform)处理参考特征的频率信息,生成对输入顺序不变的统一表示。此外,设计了尺度-排列稳定性损失(Scale-Permutation Stability Loss)联合优化上述特性,显著提升了生成视频的尺度保真度与结构稳定性。

链接: https://arxiv.org/abs/2512.22310
作者: Run Ling,Ke Cao,Jian Lu,Ao Ma,Haowei Liu,Runze He,Changwei Wang,Rongtao Xu,Yihua Shao,Zhanjie Zhang,Peng Wu,Guibing Guo,Wei Feng,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Xingwei Wang
机构: 1. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 2. University of Chinese Academy of Sciences (中国科学院大学); 3. Alibaba Group (阿里巴巴集团); 4. Tsinghua University (清华大学); 5. Baidu Research (百度研究院); 6. Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: AAAI 2026

点击查看摘要

Abstract:Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.
zh

[CV-167] PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation

【速读】:该论文旨在解决从单张图像中准确估算食物营养成分的难题,其核心挑战在于图像丢失了三维空间信息。为克服这一问题,作者提出PortionNet框架,其关键创新在于采用跨模态知识蒸馏机制:在训练阶段利用点云(point cloud)数据学习几何特征,而在推理阶段仅需RGB图像即可实现伪三维推理。该方法通过轻量级适配器网络模拟点云表征,无需依赖深度传感器等专用硬件,从而在MetaFood3D数据集上达到最优的体积与能量估算性能,并在SimpleFood45数据集上展现出良好的泛化能力。

链接: https://arxiv.org/abs/2512.22304
作者: Darrin Bright,Rakshith Raj,Kanchan Keisham
机构: Vellore Institute of Technology (维洛尔理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 11th Annual Conference on Vision and Intelligent Systems (CVIS 2025)

点击查看摘要

Abstract:Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.
zh

[CV-168] Attack-Aware Deepfake Detection under Counter-Forensic Manipulations

【速读】:该论文旨在解决深度伪造(deepfake)检测模型在真实部署环境中面临的鲁棒性不足、概率校准不佳以及缺乏可解释性证据的问题。其核心挑战在于模型需同时应对多种对抗性攻击(如重压缩、重采样、去噪再纹理化等),并提供可靠且可操作的篡改定位热图(tamper heatmap)。解决方案的关键在于提出一种双流架构:一通道利用预训练主干网络提取语义内容特征,另一通道提取取证残差(forensic residuals),并通过轻量级残差适配器融合特征进行分类;同时引入红队训练(red-team training)机制,在每批次中模拟最恶劣的K种对抗操作(包括JPEG重对齐、色彩和伽马偏移、社交应用转码等),并在测试时采用低成本随机扰动(如缩放、裁剪相位变化、轻微伽马调整)作为防御策略,从而提升模型在复杂攻击下的稳定性与校准精度。此外,通过弱监督方式生成聚焦于人脸区域的篡改热图,无需像素级标注即可实现精准定位,为实际应用提供了模块化、数据高效且具备可信度的检测基线。

链接: https://arxiv.org/abs/2512.22303
作者: Noor Fatima,Hasan Faraz Khan,Muzammil Behzad
机构: King Fahd University of Petroleum and Minerals (KFUPM)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work presents an attack-aware deepfake and image-forensics detector designed for robustness, well-calibrated probabilities, and transparent evidence under realistic deployment conditions. The method combines red-team training with randomized test-time defense in a two-stream architecture, where one stream encodes semantic content using a pretrained backbone and the other extracts forensic residuals, fused via a lightweight residual adapter for classification, while a shallow Feature Pyramid Network style head produces tamper heatmaps under weak supervision. Red-team training applies worst-of-K counter-forensics per batch, including JPEG realign and recompress, resampling warps, denoise-to-regrain operations, seam smoothing, small color and gamma shifts, and social-app transcodes, while test-time defense injects low-cost jitters such as resize and crop phase changes, mild gamma variation, and JPEG phase shifts with aggregated predictions. Heatmaps are guided to concentrate within face regions using face-box masks without strict pixel-level annotations. Evaluation on existing benchmarks, including standard deepfake datasets and a surveillance-style split with low light and heavy compression, reports clean and attacked performance, AUC, worst-case accuracy, reliability, abstention quality, and weak-localization scores. Results demonstrate near-perfect ranking across attacks, low calibration error, minimal abstention risk, and controlled degradation under regrain, establishing a modular, data-efficient, and practically deployable baseline for attack-aware detection with calibrated probabilities and actionable heatmaps.
zh

[CV-169] Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware

【速读】:该论文旨在解决车内驾驶员监控系统(Driver Monitoring System, DMS)在计算资源、功耗和成本严格受限条件下,实现低延迟、高精度的分心与困倦行为识别问题。解决方案的关键在于提出一个端到端的单摄像头行为识别流水线,包含三个核心组件:(i) 一种轻量级逐帧视觉模型以降低计算开销;(ii) 一种考虑混杂因素(confounder-aware)的标签设计策略,有效减少因视觉相似性导致的误报;(iii) 一种时序决策头机制,仅在预测结果既置信又持续时才触发警报,从而提升系统稳定性与实用性。该方案在低成本边缘平台(Raspberry Pi 5 和 Google Coral Edge TPU)上实现了约16 FPS(INT8量化推理)和25 FPS的实时性能,验证了其在真实车载环境中的可行性与鲁棒性。

链接: https://arxiv.org/abs/2512.22298
作者: Vesal Ahsani,Babak Hossein Khalaj
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 14 pages, 2 figures, 4 tables

点击查看摘要

Abstract:In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.
zh

[CV-170] Learning Dynamic Scene Reconstruction with Sinusoidal Geometric Priors

【速读】:该论文旨在解决动态三维场景重建中运动建模精度不足以及在快速运动和多目标场景下时空一致性难以保持的问题。解决方案的关键在于提出一种名为SirenPose的新损失函数,该函数结合了正弦表示网络(Sinusoidal Representation Networks)的周期性激活特性与基于关键点结构的几何先验信息,通过引入受物理启发的约束机制,在空间和时间维度上强制关键点预测的一致性,从而提升重建精度与稳定性。

链接: https://arxiv.org/abs/2512.22295
作者: Tian Guo,Hui Yuan,Philip Xu,David Elizondo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose SirenPose, a novel loss function that combines the periodic activation properties of sinusoidal representation networks with geometric priors derived from keypoint structures to improve the accuracy of dynamic 3D scene reconstruction. Existing approaches often struggle to maintain motion modeling accuracy and spatiotemporal consistency in fast moving and multi target scenes. By introducing physics inspired constraint mechanisms, SirenPose enforces coherent keypoint predictions across both spatial and temporal dimensions. We further expand the training dataset to 600,000 annotated instances to support robust learning. Experimental results demonstrate that models trained with SirenPose achieve significant improvements in spatiotemporal consistency metrics compared to prior methods, showing superior performance in handling rapid motion and complex scene changes.
zh

[CV-171] A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation

【速读】:该论文旨在解决大规模开放词汇3D检索与可控4D生成中的跨模态对齐难题,即如何在文本、3D模型和图像模态之间实现结构化的多层级语义对齐,从而提升3D检索精度与4D生成的时序一致性。其解决方案的关键在于提出Uni4D框架,通过三层对齐机制——精确的文本到3D检索、多视角3D到图像对齐以及图像到文本对齐——并结合基于Align3D 130数据集训练的3D文本多头注意力与搜索模型,显著优化了跨模态语义匹配能力,从而实现了高质量的3D检索与可控的4D生成。

链接: https://arxiv.org/abs/2512.22294
作者: Philip Xu,David Elizondo,Raouf Hamzaoui
机构: De Montfort University (德蒙福特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.
zh

[CV-172] Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model

【速读】:该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在训练与推理阶段存在显著不一致的问题:MDMs 在推理时是一个多步迭代过程,其生成轨迹受模型参数和解码调度策略(inference schedule)共同影响,而传统训练方法采用简化的一步式 BERT 风格目标,仅预测被掩码的全部 token,忽略了推理过程中调度策略的作用,导致调度参数未在训练中优化。解决方案的关键在于提出 Co-GRPO(Cooperative Group Relative Policy Optimization),将 MDM 的生成过程建模为一个统一的马尔可夫决策过程(Markov Decision Process, MDP),联合优化模型参数与调度参数,在共享奖励下通过轨迹级的分组相对策略优化实现端到端协同训练,无需对多步生成过程进行昂贵的反向传播,从而更紧密地对齐训练与推理,显著提升生成质量。

链接: https://arxiv.org/abs/2512.22288
作者: Renping Zhou,Zanlin Ni,Tianyi Chen,Zeyu Liu,Yang Yue,Yulin Wang,Yuxuan Wang,Jingshu Liu,Gao Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 6 figures

点击查看摘要

Abstract:Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: this https URL .
zh

[CV-173] FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound

【速读】:该论文旨在解决胎儿超声成像中视觉语言模型(Vision-Language Models, VLMs)缺乏标准化评估基准的问题,从而推动生成式AI在产前超声检查中的临床应用。当前,全球训练有素的超声技师短缺,而VLMs虽具备提升效率和辅助培训的潜力,但因缺乏统一评测体系、数据模态复杂且依赖操作者等因素,其性能难以量化与优化。论文的关键解决方案是构建Fetal-Gauge——首个专为胎儿超声设计的大规模视觉问答基准,包含超过42,000张图像和93,000个问答对,覆盖解剖平面识别、结构定位、胎儿方位判断、临床视图合规性及诊断等多任务场景。通过系统评估多个先进VLMs,发现最佳模型准确率仅为55%,远低于临床需求,揭示了现有模型在该领域的显著局限性,强调需开发领域适配架构与专用训练策略,为推进产前护理中的多模态深度学习奠定基础。

链接: https://arxiv.org/abs/2512.22278
作者: Hussain Alasmawi,Numan Saeed,Mohammad Yaqub
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers’ efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality’s challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.
zh

[CV-174] he Illusion of Clinical Reasoning : A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency

【速读】:该论文试图解决当前生成式AI模型在临床实践中缺乏真实临床推理能力的问题,尤其是现有基准测试(如医学执照考试或精心设计的病例)无法全面评估其多模态整合与复杂决策能力。解决方案的关键在于构建一个名为“Bones and Joints (BJ)”的综合性评测基准,包含1,245道来自真实骨科和运动医学患者案例的问题,覆盖7个模拟临床推理路径的任务(包括知识回忆、文本与图像解读、诊断生成、治疗规划及理由提供),从而更真实地评估模型在多模态情境下的表现。实证结果揭示了当前模型在结构化选择题上表现优异(>90%准确率),但在需要多模态融合的开放式任务中性能显著下降(<60%准确率),凸显出视觉理解与跨模态一致性仍是核心瓶颈。

链接: https://arxiv.org/abs/2512.22275
作者: Dingyu Wang,Zimu Yuan,Jiajun Liu,Shanggui Liu,Nan Zhou,Tianxing Xu,Di Huang,Dong Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (BJ) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.
zh

[CV-175] GeCo: A Differentiable Geometric Consistency Metric for Video Generation

【速读】:该论文旨在解决静态场景中视频生成模型常出现的几何形变(geometric deformation)与遮挡不一致(occlusion-inconsistency)伪影难以量化检测的问题。解决方案的关键在于提出一种基于几何约束的度量方法 GeCo,通过融合残差运动(residual motion)和深度先验(depth prior),生成可解释的密集一致性图(dense consistency maps),从而有效识别上述两类伪影,并进一步将其作为训练无感知的引导损失(training-free guidance loss)以减少生成过程中的形变伪影。

链接: https://arxiv.org/abs/2512.22274
作者: Leslie Gu,Junhwa Hur,Charles Herrmann,Fangneng Zhan,Todd Zickler,Deqing Sun,Hanspeter Pfister
机构: Harvard University (哈佛大学); Google DeepMind (谷歌深度学习); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
zh

[CV-176] Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models

【速读】:该论文旨在解决当前文本到图像扩散模型在生成过程中难以遵循严格几何约束的问题,尤其是在文本提示中的风格信息与几何要求冲突时,模型往往优先满足表面纹理细节而忽略形状结构,从而导致语义不一致。其核心解决方案是引入一种轻量级的外部指导信号——人类感知嵌入(Human Perception Embedding, HPE)教师模型,该模型基于THINGS三元组数据集训练,捕捉人类对物体形状的敏感性;通过将该教师模型的梯度注入潜在扩散过程,实现几何与风格的可控分离,从而显著提升生成结果的几何准确性和语义一致性。

链接: https://arxiv.org/abs/2512.22272
作者: Antara Titikhsha,Om Kulkarni,Dharun Muthaiah
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-\Sigma. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.
zh

[CV-177] Evaluating an Adaptive Multispectral Turret System for Autonomous Tracking Across Variable Illumination Conditions

【速读】:该论文旨在解决自主机器人平台在应急服务场景中(如灾难搜救和侦察)因光照条件变化导致的目标检测性能下降问题。传统RGB检测方法在低光环境下失效,而纯热成像系统则缺乏颜色与纹理信息,难以满足复杂环境下的识别需求。解决方案的关键在于提出一种自适应的RGB与长波红外(Long-Wave Infrared, LWIR)视频流融合框架,通过在11个不同融合比例(从100% RGB到100% LWIR)下对齐帧进行动态混合,并针对三种光照条件(无光、微光、全光)分别训练并选择最优YOLO检测模型,从而实现跨照明环境下的高置信度目标检测。实验表明,该方法在全光和微光条件下显著优于YOLOv5n和YOLOv11n基线模型,在无光条件下也展现出优于基线的检测性能。

链接: https://arxiv.org/abs/2512.22263
作者: Aahan Sachdeva,Dhanvinkumar Ganeshkumar,James E. Gallagher,Tyler Treat,Edward J. Oughton
机构: George Mason University (乔治梅森大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous robotic platforms are playing a growing role across the emergency services sector, supporting missions such as search and rescue operations in disaster zones and reconnaissance. However, traditional red-green-blue (RGB) detection pipelines struggle in low-light environments, and thermal-based systems lack color and texture information. To overcome these limitations, we present an adaptive framework that fuses RGB and long-wave infrared (LWIR) video streams at multiple fusion ratios and dynamically selects the optimal detection model for each illumination condition. We trained 33 You Only Look Once (YOLO) models on over 22,000 annotated images spanning three light levels: no-light (10 lux), dim-light (10-1000 lux), and full-light (1000 lux). To integrate both modalities, fusion was performed by blending aligned RGB and LWIR frames at eleven ratios, from full RGB (100/0) to full LWIR (0/100) in 10% increments. Evaluation showed that the best full-light model (80/20 RGB-LWIR) and dim-light model (90/10 fusion) achieved 92.8% and 92.0% mean confidence; both significantly outperformed the YOLOv5 nano (YOLOv5n) and YOLOv11 nano (YOLOv11n) baselines. Under no-light conditions, the top 40/60 fusion reached 71.0%, exceeding baselines though not statistically significant. Adaptive RGB-LWIR fusion improved detection confidence and reliability across all illumination conditions, enhancing autonomous robotic vision performance.
zh

[CV-178] mporal Visual Semantics-Induced Human Motion Understanding with Large Language Models

【速读】:该论文旨在解决无监督人体运动分割(Unsupervised Human Motion Segmentation, HMS)中传统子空间聚类方法忽视时间语义信息的问题。其核心挑战在于如何从人体运动序列中学习时间视觉语义(Temporal Vision Semantics, TVS),并将其有效整合到子空间聚类框架中。解决方案的关键是利用大语言模型(Large Language Model, LLM)的图文理解能力,通过查询连续帧以判断是否属于同一动作,并据此构建时间邻接关系;进而设计一种融合TVS的子空间聚类方法,引入时间正则项促使每个帧与其时间邻近帧共享相似的子空间嵌入,并在分割阶段施加时间约束以增强相邻帧的分组一致性;此外,还提出一个反馈机制,根据分割结果持续优化子空间嵌入,从而提升整体分割性能。

链接: https://arxiv.org/abs/2512.22249
作者: Zheng Xing,Weibing Zhao
机构: Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, China(广东省机器感知与智能计算实验室,深圳北理莫斯科大学,中国); College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China(深圳大学计算机科学与软件工程学院,深圳 518060,中国)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and integrating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its response. We then develop a TVS-integrated subspace clustering approach, incorporating subspace embedding with a temporal regularizer that induces each frame to share similar subspace embeddings with its temporal neighbors. Additionally, segmentation is performed based on subspace embedding with a temporal constraint that induces the grouping of each frame with its temporal neighbors. We also introduce a feedback-enabled framework that continuously optimizes subspace embedding based on the segmentation output. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets.
zh

[CV-179] Fairness Evaluation of Risk Estimation Models for Lung Cancer Screening

【速读】:该论文旨在解决肺部结节风险评估模型在不同人口统计学群体中可能存在性能差异和不公平偏倚的问题,尤其是在肺癌筛查场景下。其解决方案的关键在于基于JustEFAB框架系统性评估两个深度学习模型(Sybil肺癌风险模型与Venkadesh21结节风险估计器)及一个推荐的逻辑回归模型(PanCan2b)在不同性别、种族亚组中的表现差异,通过分析受试者工作特征曲线下面积(AUROC)、敏感性和特异性等指标,并排除临床风险因素作为潜在混杂变量后,识别出显著的性能不平等现象,从而强调需加强对代表性不足群体的模型性能监测与算法公平性研究。

链接: https://arxiv.org/abs/2512.22242
作者: Shaurya Gaur,Michel Vitale,Alessa Hering,Johan Kwisthout,Colin Jacobs,Lena Philipp,Fennie van der Graaf
机构: University of Groningen (格罗宁根大学); University of Amsterdam (阿姆斯特丹大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL

点击查看摘要

Abstract:Lung cancer is the leading cause of cancer-related mortality in adults worldwide. Screening high-risk individuals with annual low-dose CT (LDCT) can support earlier detection and reduce deaths, but widespread implementation may strain the already limited radiology workforce. AI models have shown potential in estimating lung cancer risk from LDCT scans. However, high-risk populations for lung cancer are diverse, and these models’ performance across demographic groups remains an open question. In this study, we drew on the considerations on confounding factors and ethically significant biases outlined in the JustEFAB framework to evaluate potential performance disparities and fairness in two deep learning risk estimation models for lung cancer screening: the Sybil lung cancer risk model and the Venkadesh21 nodule risk estimator. We also examined disparities in the PanCan2b logistic regression model recommended in the British Thoracic Society nodule management guideline. Both deep learning models were trained on data from the US-based National Lung Screening Trial (NLST), and assessed on a held-out NLST validation set. We evaluated AUROC, sensitivity, and specificity across demographic subgroups, and explored potential confounding from clinical risk factors. We observed a statistically significant AUROC difference in Sybil’s performance between women (0.88, 95% CI: 0.86, 0.90) and men (0.81, 95% CI: 0.78, 0.84, p .001). At 90% specificity, Venkadesh21 showed lower sensitivity for Black (0.39, 95% CI: 0.23, 0.59) than White participants (0.69, 95% CI: 0.65, 0.73). These differences were not explained by available clinical confounders and thus may be classified as unfair biases according to JustEFAB. Our findings highlight the importance of improving and monitoring model performance across underrepresented subgroups, and further research on algorithmic fairness, in lung cancer screening.
zh

[CV-180] Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture

【速读】:该论文旨在解决在资源受限的边缘设备上部署深度学习模型时面临的计算效率与识别准确率之间的权衡问题(trade-off between computational efficiency and recognition accuracy),这在智慧农业场景中尤为突出。其解决方案的关键在于提出了一种混合知识蒸馏框架,通过设计一种融合倒残差块(inverted residual blocks)与密集连接结构的定制化轻量级学生模型,并在其训练过程中引入多目标策略——包括硬标签监督、特征层级蒸馏、响应层级蒸馏以及自蒸馏机制,从而在显著降低模型复杂度的同时保持高精度。实验表明,该方法在水稻种子品种识别任务中达到98.56%的准确率,仅比ResNet18教师模型低0.09%,但计算成本减少约2.7倍、参数量减少超10倍,且在多个植物叶片病害数据集上展现出良好的泛化能力,验证了其在硬件受限的智慧农业系统中的高效性与实用性。

链接: https://arxiv.org/abs/2512.22239
作者: Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan
机构: FPT University (FPT大学); FPT University (FPT大学); FPT University (FPT大学); FPT Education (FPT教育)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 7 tables; 11 figures

点击查看摘要

Abstract:Deploying deep learning models on resource-constrained edge devices remains a major challenge in smart agriculture due to the trade-off between computational efficiency and recognition accuracy. To address this challenge, this study proposes a hybrid knowledge distillation framework for developing a lightweight yet high-performance convolutional neural network. The proposed approach designs a customized student model that combines inverted residual blocks with dense connectivity and trains it under the guidance of a ResNet18 teacher network using a multi-objective strategy that integrates hard-label supervision, feature-level distillation, response-level distillation, and self-distillation. Experiments are conducted on a rice seed variety identification dataset containing nine varieties and further extended to four plant leaf disease datasets, including rice, potato, coffee, and corn, to evaluate generalization capability. On the rice seed variety classification task, the distilled student model achieves an accuracy of 98.56%, which is only 0.09% lower than the teacher model (98.65%), while requiring only 0.68 GFLOPs and approximately 1.07 million parameters. This corresponds to a reduction of about 2.7 times in computational cost and more than 10 times in model size compared with the ResNet18 teacher model. In addition, compared with representative pretrained models, the proposed student reduces the number of parameters by more than 6 times relative to DenseNet121 and by over 80 times compared with the Vision Transformer (ViT) architecture, while maintaining comparable or superior classification accuracy. Consistent performance gains across multiple plant leaf disease datasets further demonstrate the robustness, efficiency, and strong deployment potential of the proposed framework for hardware-limited smart agriculture systems.
zh

[CV-181] Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

【速读】:该论文旨在解决大规模视觉语言模型(Vision-Language Models, VLMs)在部署到移动或边缘设备时因模型体积过大而导致的实用性问题,同时克服小模型从大教师模型中知识蒸馏(Knowledge Distillation)过程中因尺寸差距过大导致的学习不稳定与性能下降难题。解决方案的关键在于提出一种基于掩码渐进式强化学习(Masking Teacher and Reinforcing Student, Masters)的知识蒸馏框架:首先对教师模型的非主导权重进行掩码以降低复杂度,随后在训练中逐步恢复教师容量;并引入离线强化学习阶段,结合准确率奖励与蒸馏奖励两个互补信号,引导学生模型高效学习教师生成的高质量响应,从而实现稳定且高性能的小型化VLM构建。

链接: https://arxiv.org/abs/2512.22238
作者: Byung-Kwan Lee,Yu-Chiang Frank Wang,Ryo Hachiuma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher’s complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.
zh

[CV-182] Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction

【速读】:该论文旨在解决低剂量正电子发射断层成像(PET)中因辐射暴露降低导致的图像噪声增强、对比度下降及生理细节难以保留的问题。现有方法普遍忽视了投影域物理先验知识和患者特异性元信息(meta-information),而这些因素对功能-语义关联挖掘至关重要。解决方案的关键在于提出一种元信息引导的跨域协同扩散模型(MiG-DM),其核心创新包括:一是设计元信息编码模块,将临床参数转化为语义提示,实现文本元信息与图像重建之间的跨模态对齐;二是构建投影域与图像域协同处理的跨域架构,在投影域通过专用sinogram适配器利用卷积操作捕获全局物理结构,从而融合多模态先验以生成高质量PET图像。

链接: https://arxiv.org/abs/2512.22237
作者: Mengxiao Geng,Ran Hong,Xiaoling Xu,Bingxuan Li,Qiegen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.
zh

[CV-183] KAN-FPN-Stem:A KAN-Enhanced Feature Pyramid Stem for Boosting ViT-based Pose Estimation

【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在密集预测任务(如姿态估计)中因前端设计过于简单而导致的性能瓶颈问题,特别是由朴素的patchification机制引发的多尺度信息处理能力不足和初始特征提取阶段的信息不可逆损失。其核心解决方案是提出一种KAN增强的FPN-Stem架构,关键在于保留FPN(Feature Pyramid Network)经典的“上采样-相加”融合流程,但将末端标准线性3×3平滑卷积替换为基于核逼近网络(KAN)的卷积层;该KAN模块凭借更强的非线性建模能力,自适应地学习并修正多尺度融合过程中产生的伪影,从而显著提升特征融合质量。实验表明,该方法在COCO数据集上相较轻量级ViTPose-S基线模型实现最高+2.0 AP的性能提升,揭示了ViT前端性能瓶颈的关键并非来自注意力机制本身,而是特征融合的质量问题。

链接: https://arxiv.org/abs/2512.22228
作者: HaoNan Tang
机构: WuHan University of Technology (武汉理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision Transformers (ViT) have demonstrated significant promise in dense prediction tasks such as pose estimation. However, their performance is frequently constrained by the overly simplistic front-end designs employed in models like ViTPose. This naive patchification mechanism struggles to effectively handle multi-scale variations and results in irreversible information loss during the initial feature extraction phase. To overcome this limitation, we introduce a novel KAN-enhanced FPN-Stem architecture. Through rigorous ablation studies, we first identified that the true bottleneck for performance improvement lies not in plug-and-play attention modules (e.g., CBAM), but in the post-fusion non-linear smoothing step within the FPN. Guided by this insight, our core innovation is to retain the classic “upsample-and-add” fusion stream of the FPN, but replace its terminal, standard linear 3x3 smoothing convolution with a powerful KAN-based convolutional layer. Leveraging its superior non-linear modeling capabilities, this KAN-based layer adaptively learns and rectifies the “artifacts” generated during the multi-scale fusion process. Extensive experiments on the COCO dataset demonstrate that our KAN-FPN-Stem achieves a significant performance boost of up to +2.0 AP over the lightweight ViTPose-S baseline. This work not only delivers a plug-and-play, high-performance module but, more importantly, reveals that: the performance bottleneck in ViT front-end often lies not in ‘feature refinement’ (Attention), but in the quality of ‘feature fusion’ (Fusion). Furthermore, it provides an effective path to address this bottleneck through the introduction of the KAN operator.
zh

[CV-184] On Extending Semantic Abstraction for Efficient Search of Hidden Objects

【速读】:该论文旨在解决家庭机器人在寻找被遮挡物体(即“隐藏物体”,定义为因部分遮挡而无法被视觉语言模型(VLM)直接识别的物体)时效率低下的问题。解决方案的关键在于利用语义抽象(Semantic Abstraction)框架,将VLM生成的相关性激活图(relevancy maps)视为“抽象对象”表示,并结合历史放置数据进行高效搜索——通过学习物体常见位置的先验知识,实现对隐藏物体三维位置的快速准确定位,显著优于随机搜索策略。

链接: https://arxiv.org/abs/2512.22220
作者: Tasha Pais,Nikhilesh Belulkar
机构: Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Semantic Abstraction’s key observation is that 2D VLMs’ relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as “abstract object” representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.
zh

[CV-185] owards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset Method and Benchmark

【速读】:该论文旨在解决低资源语言(越南语)中场景标志文本理解在视觉问答(VQA)任务中的不足问题,尤其是针对越南语标志牌的多模态理解挑战。其关键解决方案在于构建了首个大规模越南语标志牌导向型VQA数据集ViSignVQA,包含10,762张图像和25,573个问答对,并引入OCR增强的上下文机制——通过集成越南语OCR模型(SwinTextSpotter)与预训练语言模型(ViT5),显著提升了模型性能(F1分数最高提升209%)。此外,提出基于感知与推理代理的多智能体框架结合GPT-4,实现75.98%的准确率,验证了领域特定资源与OCR融合策略对低资源语言文本驱动VQA的重要性。

链接: https://arxiv.org/abs/2512.22218
作者: Hieu Minh Nguyen,Tam Le-Thanh Dang,Kiet Van Nguyen
机构: UIT(越南信息技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Dataset paper; code and data will be released

点击查看摘要

Abstract:Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.
zh

[CV-186] VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition

【速读】:该论文旨在解决行人属性识别(Pedestrian Attribute Recognition, PAR)中面临的三大挑战:类别严重不平衡、属性间复杂的共现依赖关系以及域偏移问题。其解决方案的关键在于提出一个基于冻结的SigLIP 2多语言视觉编码器构建的模块化视觉-语言框架VLM-PAR,通过引入紧凑的交叉注意力融合机制对视觉特征进行精炼,从而实现图像嵌入与提示嵌入的有效对齐,显著提升了在高度不平衡的PA100K数据集上的准确率,并在PETA和Market-1501基准上实现了平均准确率的大幅提升,验证了大规模视觉-语言预训练结合针对性跨模态精炼策略在克服PAR任务中不平衡性和泛化性难题方面的有效性。

链接: https://arxiv.org/abs/2512.22217
作者: Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Fadi Dornaika,Cosimo Distante,Abdenour Hadid
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.
zh

[CV-187] Signal-SGN: Topology-Enhanced Time-Frequency Spiking Graph Network for Skeleton-Based Action Recognition

【速读】:该论文旨在解决传统图卷积网络(GCN)在动作识别中因密集浮点运算导致能耗过高,而脉冲神经网络(SNN)虽具事件驱动和稀疏激活优势但难以有效捕捉人体运动的时频耦合与拓扑依赖关系的问题。其解决方案的关键在于提出Signal-SGN++框架,通过引入1D脉冲图卷积(1D-SGC)与频域脉冲卷积(FSC)联合提取时空与频域特征,并嵌入拓扑自适应注意力机制(TSSA)以动态调整图结构上的注意力分布,同时设计多尺度小波变换融合分支(MWTF)与拓扑感知时频融合单元(TATF),在不增加计算复杂度的前提下实现结构先验引导的谱信息一致性融合,从而在显著降低能耗的同时达成优于现有SNN方法且媲美先进GCN的性能表现。

链接: https://arxiv.org/abs/2512.22214
作者: Naichuan Zheng,Xiahai Lun,Weiyi Li,Yuchen Du
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) demonstrate strong capability in modeling skeletal topology for action recognition, yet their dense floating-point computations incur high energy costs. Spiking Neural Networks (SNNs), characterized by event-driven and sparse activation, offer energy efficiency but remain limited in capturing coupled temporal-frequency and topological dependencies of human motion. To bridge this gap, this article proposes Signal-SGN++, a topology-aware spiking graph framework that integrates structural adaptivity with time-frequency spiking dynamics. The network employs a backbone composed of 1D Spiking Graph Convolution (1D-SGC) and Frequency Spiking Convolution (FSC) for joint spatiotemporal and spectral feature extraction. Within this backbone, a Topology-Shift Self-Attention (TSSA) mechanism is embedded to adaptively route attention across learned skeletal topologies, enhancing graph-level sensitivity without increasing computational complexity. Moreover, an auxiliary Multi-Scale Wavelet Transform Fusion (MWTF) branch decomposes spiking features into multi-resolution temporal-frequency representations, wherein a Topology-Aware Time-Frequency Fusion (TATF) unit incorporates structural priors to preserve topology-consistent spectral fusion. Comprehensive experiments on large-scale benchmarks validate that Signal-SGN++ achieves superior accuracy-efficiency trade-offs, outperforming existing SNN-based methods and achieving competitive results against state-of-the-art GCNs under substantially reduced energy consumption.
zh

[CV-188] CFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting

【速读】:该论文旨在解决人群计数任务中依赖人工标注点级标签(point-level annotations)和计算资源密集型骨干网络导致的可扩展性差及在资源受限环境部署困难的问题。其解决方案的关键在于提出了一种参数量仅为500万的轻量级、弱监督Transformer框架TCFormer:首先采用高效视觉Transformer作为特征提取器,利用其全局上下文感知能力获取语义有意义的群体特征且内存开销低;其次设计了可学习密度加权平均模块(Learnable Density-Weighted Averaging),通过动态调整局部token权重来补偿空间监督缺失,使模型能自适应地调节不同区域特征;最后引入密度级别分类损失(density-level classification loss),将人群密度离散化为多个等级以正则化训练过程并提升模型在不同密度下的判别能力。该方法仅需图像级全局计数标签即可实现高精度估计,兼顾参数效率与计数准确性,适合边缘设备部署。

链接: https://arxiv.org/abs/2512.22203
作者: Qiang Guo,Rubo Zhang,Bingbing Zhang,Junjie Liu,Jianqing Liu
机构: Dalian Minzu University (大连民族大学); Dalian University of Technology (大连理工大学); Dalian Rijia Electronics Co., Ltd. (大连瑞嘉电子有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model’s classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.
zh

[CV-189] Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期诊断中因临床资源有限导致的漏诊问题,以及现有AI模型在解释性方面的局限性——即仅依赖单一模态(如眼底照相)进行病变定位,缺乏对医生决策逻辑的模拟和多模态协同分析能力。解决方案的关键在于提出一种基于视觉语言模型(Vision-Language Model, VLM)的多模态可解释性方法,结合少样本学习(few-shot learning),通过分析眼底图像中病变在四个象限内的分布模式来模拟眼科医生的推理过程,并生成配对的Grad-CAM热力图,直观展示OCT与眼底图像中影响DR严重程度分类的关键区域,从而实现更准确、更具临床可信度的诊断解释。

链接: https://arxiv.org/abs/2512.22197
作者: Shivum Telang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 6 figures

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a leading cause of vision loss worldwide, requiring early detection to preserve sight. Limited access to physicians often leaves DR undiagnosed. To address this, AI models utilize lesion segmentation for interpretability; however, manually annotating lesions is impractical for clinicians. Physicians require a model that explains the reasoning for classifications rather than just highlighting lesion locations. Furthermore, current models are one-dimensional, relying on a single imaging modality for explainability and achieving limited effectiveness. In contrast, a quantitative-detection system that identifies individual DR lesions in natural language would overcome these limitations, enabling diverse applications in screening, treatment, and research settings. To address this issue, this paper presents a novel multimodal explainability model utilizing a VLM with few-shot learning, which mimics an ophthalmologist’s reasoning by analyzing lesion distributions within retinal quadrants for fundus images. The model generates paired Grad-CAM heatmaps, showcasing individual neuron weights across both OCT and fundus images, which visually highlight the regions contributing to DR severity classification. Using a dataset of 3,000 fundus images and 1,000 OCT images, this innovative methodology addresses key limitations in current DR diagnostics, offering a practical and comprehensive tool for improving patient outcomes.
zh

[CV-190] ny-YOLOSAM: Fast Hybrid Image Segmentation

【速读】:该论文旨在解决生成式分割模型(如Segment Anything Model, SAM)在延迟敏感场景中计算开销过大、效率低下的问题,尤其是其轻量版TinySAM虽保留了零样本分割质量,但在“分割全部”模式下仍需大量提示(prompts)且运行缓慢。解决方案的关键在于提出一种快速混合流水线Tiny-YOLOSAM:利用最新的YOLOv12检测器(YOLO detector)为显著前景目标生成边界框提示(box prompts),并仅在YOLO引导的掩码未覆盖区域稀疏采样点提示(sparse point prompts),从而大幅减少冗余计算。实验表明,该方法在COCO val2017上将类无关覆盖率(AR)从16.4%提升至77.1%,平均交并比(mIoU)从19.2%提升至67.8%,同时端到端推理时间从49.20秒/图像降至10.39秒/图像(加速4.7倍),验证了检测器引导提示与针对性稀疏采样相结合是一种高效替代密集“分割全部”提示的实用方案。

链接: https://arxiv.org/abs/2512.22193
作者: Kenneth Xu,Songhan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 6 figures, 5 tables. Code available at: this https URL

点击查看摘要

Abstract:The Segment Anything Model (SAM) enables promptable, high-quality segmentation but is often too computationally expensive for latency-critical settings. TinySAM is a lightweight, distilled SAM variant that preserves strong zero-shot mask quality, yet its “segment-everything” mode still requires hundreds of prompts and remains slow in practice. We first replicate TinySAM on COCO val2017 using official checkpoints, matching the reported AP within 0.03%, establishing a reliable experimental baseline. Building on this, we propose Tiny-YOLOSAM, a fast hybrid pipeline that uses a recent YOLO detector (YOLOv12) to generate box prompts for TinySAM on salient foreground objects, and supplements uncovered regions with sparse point prompts sampled only where YOLO-guided masks provide no coverage. On COCO val2017, the hybrid system substantially improves class-agnostic coverage (AR from 16.4% to 77.1%, mIoU from 19.2% to 67.8%) while reducing end-to-end runtime from 49.20s/image to 10.39s/image (4.7x) on an Apple M1 Pro CPU. These results suggest detector-guided prompting combined with targeted sparse sampling as an effective alternative to dense “segment-everything” prompting for practical full-scene segmentation.
zh

[CV-191] HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology

【速读】:该论文旨在解决传统多实例学习(Multiple Instance Learning, MIL)在计算病理学中对全切片图像(Whole-Slide Images, WSI)进行弱监督分析时存在的两个核心问题:一是传统MIL方法常丢失关键上下文信息,二是基于Transformer的变体虽具备更强表达能力但存在二次复杂度和冗余计算。其解决方案的关键在于提出HookMIL框架,通过引入可学习的钩子令牌(Hook Tokens)实现结构化的上下文聚合;这些钩子令牌支持多模态初始化(来自关键patch视觉特征、视觉-语言病理模型文本嵌入及空间转录组学-视觉模型的空间定位特征),从而融合丰富的文本与空间先验知识,加速收敛并提升表征质量;同时,钩子令牌通过双向注意力机制以线性复杂度与实例交互,并结合钩子多样性损失(Hook Diversity Loss)促进各令牌聚焦不同组织病理学模式,辅以钩子间通信机制优化上下文交互并减少冗余,最终在四个公开病理数据集上实现最优性能与更高计算效率及可解释性。

链接: https://arxiv.org/abs/2512.22188
作者: Xitong Ling,Minxi Ouyang,Xiaoxiao Li,Jiawen Li,Ying Chen,Yuxuan Sun,Xinrui Chen,Tian Guan,Xiaoping Liu,Yonghong He
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); School of Informatics, Xiamen University (厦门大学信息学院); School of Engineering, Westlake University (西湖大学工学院); Zhongnan Hospital, Wuhan University (武汉大学中南医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multiple Instance Learning (MIL) has enabled weakly supervised analysis of whole-slide images (WSIs) in computational pathology. However, traditional MIL approaches often lose crucial contextual information, while transformer-based variants, though more expressive, suffer from quadratic complexity and redundant computations. To address these limitations, we propose HookMIL, a context-aware and computationally efficient MIL framework that leverages compact, learnable hook tokens for structured contextual aggregation. These tokens can be initialized from (i) key-patch visual features, (ii) text embeddings from vision-language pathology models, and (iii) spatially grounded features from spatial transcriptomics-vision models. This multimodal initialization enables Hook Tokens to incorporate rich textual and spatial priors, accelerating convergence and enhancing representation quality. During training, Hook tokens interact with instances through bidirectional attention with linear complexity. To further promote specialization, we introduce a Hook Diversity Loss that encourages each token to focus on distinct histopathological patterns. Additionally, a hook-to-hook communication mechanism refines contextual interactions while minimizing redundancy. Extensive experiments on four public pathology datasets demonstrate that HookMIL achieves state-of-the-art performance, with improved computational efficiency and interpretability. Codes are available at this https URL.
zh

[CV-192] SAMM2D: Scale-Aware Multi-Modal 2D Dual-Encoder for High-Sensitivity Intracrania Aneurysm Screening

【速读】:该论文旨在解决颅内动脉瘤(intracranial aneurysm)检测中因病灶形态细微、类别极度不平衡以及标注数据稀缺所带来的挑战。其解决方案的核心是提出一种双编码器框架SAMM2D,该框架利用强大的ImageNet预训练主干网络提取鲁棒特征,并通过校准决策阈值实现95%的敏感性,显著优于平均放射科医生表现;值得注意的是,研究发现在强预训练模型基础上添加任何数据增强策略均会降低性能,揭示了在低数据医疗场景下过度依赖数据增强可能适得其反,强调了高质量预训练比复杂增强策略更具价值。

链接: https://arxiv.org/abs/2512.22185
作者: Antara Titikhsha,Divyanshu Tak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective aneurysm detection is essential to avert life-threatening hemorrhages, but it remains challenging due to the subtle morphology of the aneurysm, pronounced class imbalance, and the scarcity of annotated data. We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset; an improvement of 32% over the clinical baseline. In a comprehensive ablation across six augmentation regimes, we made a striking discovery: any form of data augmentation degraded performance when coupled with a strong pretrained backbone. Our unaugmented baseline model outperformed all augmented variants by 1.75–2.23 percentage points (p 0.01), overturning the assumption that “more augmentation is always better” in low-data medical settings. We hypothesize that ImageNet-pretrained features already capture robust invariances, rendering additional augmentations both redundant and disruptive to the learned feature manifold. By calibrating the decision threshold, SAMM2D reaches 95% sensitivity, surpassing average radiologist performance, and translates to a projected \ 13.9M in savings per 1,000 patients in screening applications. Grad-CAM visualizations confirm that 85% of true positives attend to relevant vascular regions (62% IoU with expert annotations), demonstrating the model’s clinically meaningful focus. Our results suggest that future medical imaging workflows could benefit more from strong pretraining than from increasingly complex augmentation pipelines.
zh

[CV-193] Enhancing Medical Data Analysis through AI-Enhanced Locally Linear Embedding: Applications in Medical Point Location and Imagery

【速读】:该论文旨在解决医疗数据处理中高维特征带来的复杂性与低效问题,尤其是在医疗计费(medical billing)和病历转录(transcription)等关键流程中的准确性与效率瓶颈。其解决方案的关键在于引入一种融合人工智能(Artificial Intelligence, AI)与局部线性嵌入(Locally Linear Embedding, LLE)的增强型降维模型,通过AI优化LLE算法对高维医学数据的非线性结构建模能力,从而显著提升数据处理精度与系统运行效率,同时减少人工干预导致的误差,为临床文档记录和财务结算提供更可靠的自动化支持。

链接: https://arxiv.org/abs/2512.22182
作者: Hassan Khalid,Muhammad Mahad Khaliq,Muhammad Jawad Bashir
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures. Accepted and published at 2024 19th International Conference on Emerging Technologies (ICET)

点击查看摘要

Abstract:The rapid evolution of Artificial intelligence in healthcare has opened avenues for enhancing various processes, including medical billing and transcription. This paper introduces an innovative approach by integrating AI with Locally Linear Embedding (LLE) to revolutionize the handling of high-dimensional medical data. This AI-enhanced LLE model is specifically tailored to improve the accuracy and efficiency of medical billing systems and transcription services. By automating these processes, the model aims to reduce human error and streamline operations, thereby facilitating faster and more accurate patient care documentation and financial transactions. This paper provides a comprehensive mathematical model of AI-enhanced LLE, demonstrating its application in real-world healthcare scenarios through a series of experiments. The results indicate a significant improvement in data processing accuracy and operational efficiency. This study not only underscores the potential of AI-enhanced LLE in medical data analysis but also sets a foundation for future research into broader healthcare applications.
zh

[CV-194] Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture Training and Deployment

【速读】:该论文旨在解决聋哑人群与听力正常者之间因语言障碍导致的沟通难题,提出了一种实时手语识别系统,以提升聋哑人士在日常交流中的可及性。其解决方案的关键在于采用一种融合3D卷积神经网络(3D Convolutional Neural Networks)与长短期记忆网络(Long Short-Term Memory, LSTM)的混合深度学习架构:3D CNN用于从视频帧中提取时空特征,捕捉手势的动态变化;LSTM则建模手语符号序列中的时序依赖关系,从而实现对词级美国手语(ASL)符号的高精度识别。该系统在WLASL、ASL-LEX等数据集上训练并取得F1分数0.71–0.99的性能表现,且支持在AWS云平台和OAK-D边缘设备上的实时部署,具备良好的实用性与可扩展性。

链接: https://arxiv.org/abs/2512.22177
作者: Dawnena Key
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figure, 2 tables. Patent pending (US 63/918,518). Code available at this https URL

点击查看摘要

Abstract:This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture combining 3D Convolutional Neural Networks (3D CNN) with Long Short-Term Memory (LSTM) networks. The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide. Our architecture leverages 3D convolutions to capture spatial-temporal features from video frames, followed by LSTM layers that model sequential dependencies inherent in sign language gestures. Trained on the WLASL dataset (2,000 common words), ASL-LEX lexical database (~2,700 signs), and a curated set of 100 expert-annotated ASL signs, the system achieves F1-scores ranging from 0.71 to 0.99 across sign classes. The model is deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference. We discuss the architecture design, training methodology, evaluation metrics, and deployment considerations for practical accessibility applications.
zh

[CV-195] Characterizing Motion Encoding in Video Diffusion Timesteps

【速读】:该论文旨在解决生成式视频扩散模型中运动信息在时间步(timestep)上如何编码的问题,特别是缺乏对运动与外观(appearance)在去噪过程中动态交互机制的系统性理解。现有实践中常依赖经验性假设:早期时间步主要决定运动和布局,后期则优化细节外观,但这一行为未被量化验证。解决方案的关键在于提出一种基于条件注入扰动的代理指标——通过在特定时间步区间注入新条件时观察外观编辑能力与运动保真度之间的权衡关系,从而定量刻画运动与外观在去噪轨迹上的竞争机制。研究发现,在不同架构下均存在一个早期“运动主导”阶段和后期“外观主导”阶段,据此可明确划分出一个时空解耦边界(motion-appearance boundary)。进一步地,作者将训练与推理限制在运动主导区间内,无需额外去偏模块或特殊目标函数即可实现高效且高质量的单次运动迁移,将原有启发式策略转化为可操作的时空解耦原则,并为现有运动编辑方法提供即插即用的优化方案。

链接: https://arxiv.org/abs/2512.22175
作者: Vatsal Baherwani,Yixuan Ren,Abhinav Shrivastava
机构: University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.
zh

[CV-196] SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models

【速读】:该论文旨在解决视频生成模型后训练阶段中奖励模型(Reward Model, RM)构建的三大核心问题:一是现有数据收集方法依赖提示内成对标注,易引入标签噪声;二是基于视觉语言模型(Vision-Language Model, VLM)的RM架构设计,尤其是输出机制尚未充分探索;三是RM在后训练过程中易出现奖励劫持(reward hacking)现象。解决方案的关键在于提出SoliReward框架:首先通过单样本二分类标注获取高质量且低成本的数据,并采用跨提示配对策略构建偏好对;其次引入分层渐进式查询注意力机制(Hierarchical Progressive Query Attention)以增强特征聚合能力;最后设计一种改进的BT损失函数,显式建模胜-平局场景,从而正则化正样本得分分布,提供更细致的偏好信号,缓解对少数高分样本的过度关注,提升RM在物理合理性、主体形变和语义对齐等基准上的表现。

链接: https://arxiv.org/abs/2512.22170
作者: Jiesong Lian,Ruizhe Zhong,Zixiang Zhou,Xiaoyue Mi,Yixue Hao,Yuan Zhou,Qinglin Lu,Long Hu,Junchi Yan
机构: Huazhong University of Science and Technology (华中科技大学); Shanghai Jiao Tong University (上海交通大学); Tencent Hunyuan (腾讯混元); University of Chinese Academy of Sciences (中国科学院大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM’s score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.
zh

[CV-197] SlimEdge: Lightweight Distributed DNN Deployment on Constrained Hardware

【速读】:该论文旨在解决深度分布式网络(Deep Distributed Networks, DNNs)在资源受限边缘设备上的高效部署问题,其核心挑战在于模型参数量大、计算需求高,难以适配异构硬件约束。解决方案的关键在于提出一种联合考虑硬件限制与任务性能的结构化模型剪枝方法,并结合多目标优化策略,动态调整网络容量以匹配不同设备的资源条件。具体而言,作者以多视角卷积神经网络(Multi-View Convolutional Neural Network, MVCNN)为例,通过量化各视角对分类精度的贡献来分配剪枝预算,从而在满足用户指定的准确率和内存上限的前提下,显著降低推理延迟(提升1.2x至5.0x),实现性能感知且视图自适应的压缩机制。

链接: https://arxiv.org/abs/2512.22136
作者: Mahadev Sunil Kumar,Arnab Raha,Debayan Das,Gopakumar G,Amitava Mukherjee
机构: Indian Institute of Science (印度科学研究所); Intel (英特尔); Amrita Vishwa Vidyapeetham (阿姆里塔世界大学); Birla Institute of Technology and Science, Pilani (比特拉理工学院与科学学院,比兰尼分校)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep distributed networks (DNNs) have become central to modern computer vision, yet their deployment on resource-constrained edge devices remains hindered by substantial parameter counts and computational demands. Here, we present an approach to the efficient deployment of distributed DNNs that jointly respects hardware limitations and preserves task performance. Our method integrates a structured model pruning with a multi-objective optimization to tailor network capacity to heterogeneous device constraints. We demonstrate this framework using Multi-View Convolutional Neural Network (MVCNN), a state-of-the-art architecture for 3D object recognition, by quantifying the contribution of individual views to classification accuracy and allocating pruning budgets, respectively. Experimental results show that the resulting models satisfy user-specified bounds on accuracy and memory footprint while reducing inference latency by factors ranging from 1.2x to 5.0x across diverse hardware platforms. These findings suggest that performance-aware, view-adaptive compression provides a viable pathway for deploying complex vision models in distributed edge environments.
zh

[CV-198] EIR: Enhanced Image Representations for Medical Report Generation

【速读】:该论文旨在解决自动胸部X光片报告生成中的两个核心问题:一是现有方法在融合医学图像与多模态元数据(如患者临床文档或相似病例的医疗图谱)时,采用简单的“加法+层归一化”操作导致的信息不对称问题;二是基于通用领域预训练模型提取的图像表征存在显著的领域差距(domain gap),难以有效捕捉医学图像特征。解决方案的关键在于提出一种名为增强图像表征(Enhanced Image Representations, EIR)的新方法:首先利用跨模态Transformer实现元数据与图像表征的有效融合,缓解信息分布差异带来的不对称性;其次引入医学领域预训练模型对胸部X光图像进行编码,从而缩小通用视觉表征与医学图像之间的领域差距,提升报告生成的准确性。

链接: https://arxiv.org/abs/2512.23185
作者: Qiang Sun,Zongcheng Ji,Yinlong Xiao,Peng Chang,Jun Yu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generating medical reports from chest X-ray images is a critical and time-consuming task for radiologists, especially in emergencies. To alleviate the stress on radiologists and reduce the risk of misdiagnosis, numerous research efforts have been dedicated to automatic medical report generation in recent years. Most recent studies have developed methods that represent images by utilizing various medical metadata, such as the clinical document history of the current patient and the medical graphs constructed from retrieved reports of other similar patients. However, all existing methods integrate additional metadata representations with visual representations through a simple “Add and LayerNorm” operation, which suffers from the information asymmetry problem due to the distinct distributions between them. In addition, chest X-ray images are usually represented using pre-trained models based on natural domain images, which exhibit an obvious domain gap between general and medical domain images. To this end, we propose a novel approach called Enhanced Image Representations (EIR) for generating accurate chest X-ray reports. We utilize cross-modal transformers to fuse metadata representations with image representations, thereby effectively addressing the information asymmetry problem between them, and we leverage medical domain pre-trained models to encode medical images, effectively bridging the domain gap for image representation. Experimental results on the widely used MIMIC and Open-I datasets demonstrate the effectiveness of our proposed method.
zh

[CV-199] Deep Learning for Art Market Valuation

【速读】:该论文旨在解决艺术市场估值中传统模型难以有效捕捉视觉特征信息的问题,尤其是在缺乏历史交易数据的首次上拍作品(fresh-to-market works)中。其解决方案的关键在于引入多模态深度学习架构,将艺术品的图像内容(通过视觉嵌入表示)与结构化表格数据(如艺术家身份、历史交易记录等)融合建模,从而在无历史锚定的情况下提升估值准确性。研究表明,视觉嵌入能提供独立且具有经济意义的预测贡献,尤其在首次销售场景下表现突出,且模型通过Grad-CAM和嵌入可视化验证了对构图与风格线索的关注,体现了生成式AI在艺术市场估值中的实际价值。

链接: https://arxiv.org/abs/2512.23078
作者: Jianping Mei,Michael Moses,Jan Waelty,Yucheng Yang
机构: CKGSB(长江商学院); Art Market Consultancy(艺术市场咨询公司); University of Zurich(苏黎世大学); Swiss Finance Institute(瑞士金融研究所)
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:We study how deep learning can improve valuation in the art market by incorporating the visual content of artworks into predictive models. Using a large repeated-sales dataset from major auction houses, we benchmark classical hedonic regressions and tree-based methods against modern deep architectures, including multi-modal models that fuse tabular and image data. We find that while artist identity and prior transaction history dominate overall predictive power, visual embeddings provide a distinct and economically meaningful contribution for fresh-to-market works where historical anchors are absent. Interpretability analyses using Grad-CAM and embedding visualizations show that models attend to compositional and stylistic cues. Our findings demonstrate that multi-modal deep learning delivers significant value precisely when valuation is hardest, namely first-time sales, and thus offers new insights for both academic research and practice in art market valuation.
zh

[CV-200] A Rapid GeoSAM-Based Workflow for Multi-Temporal Glacier Delineation: Case Study from Svalbard

【速读】:该论文旨在解决多时相、异质环境中冰川边界自动识别难以规模化的问题(即如何实现高效且一致的冰川范围提取)。其解决方案的关键在于提出了一种基于GeoSAM的半自动工作流程,整合了晚夏影像合成、基于光谱指数的候选冰体区域识别、提示引导的GeoSAM分割以及物理基础后处理步骤,从而实现年度冰川轮廓的快速生成与时空一致性保障。该方法在斯瓦尔巴特群岛西部地区验证有效,虽对小尺度特征(如水体干扰、地形阴影或高表面变异性)仍存在误差,但因其依赖于衍生RGB影像,具备良好的灵活性和跨光学数据集的可迁移性。

链接: https://arxiv.org/abs/2512.22855
作者: Alexandru Hegyi
机构: 未知
类目: Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Consistent glacier boundary delineation is essential for monitoring glacier change, yet many existing approaches are difficult to scale across long time series and heterogeneous environments. In this report, we present a GeoSAM-based, semi-automatic workflow for rapid glacier delineation from Sentinel-2 surface reflectance imagery. The method combines late-summer image compositing, spectral-index-based identification of candidate ice areas, prompt-guided segmentation using GeoSAM, and physically based post-processing to derive annual glacier outlines. The workflow is demonstrated in the Ny-Alesund and Kongsfjorden region of western Svalbard across multiple years of the Sentinel-2 era. Results show that the approach produces spatially coherent and temporally consistent outlines for major glacier bodies, while most errors are associated with small features affected by water bodies, terrain shadows, or high surface variability. The reliance on derived RGB imagery makes the method flexible and transferable to other optical datasets, with improved performance expected at higher spatial resolution. Although user inspection remains necessary to filter incorrect polygons and adjust thresholds for local conditions, the workflow provides a fast and practical alternative for multi-temporal glacier mapping and ice-loss assessment.
zh

[CV-201] SwinCCIR: An end-to-end deep network for Compton camera imaging reconstruction

【速读】:该论文旨在解决康普顿相机(Compton Camera, CC)在成像过程中因反投影法重建原理导致的严重伪影和形变问题,以及由设备性能引起的系统误差难以通过校准消除所造成的图像质量下降问题。解决方案的关键在于提出了一种端到端的深度学习框架SwinCCIR,其核心创新在于采用Swin-Transformer模块建立列表模式事件(list-mode events)与放射性源分布之间的映射关系,并结合转置卷积(transposed convolution)结构的图像生成模块,实现从原始数据到高质量图像的直接映射,从而有效克服传统方法的局限性。

链接: https://arxiv.org/abs/2512.22766
作者: Minghao Dong,Xinyang Luo,Xujian Ouyang,Yongshun Xiao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Nuclear Experiment (nucl-ex)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Compton cameras (CCs) are a kind of gamma cameras which are designed to determine the directions of incident gammas based on the Compton scatter. However, the reconstruction of CCs face problems of severe artifacts and deformation due to the fundamental reconstruction principle of back-projection of Compton cones. Besides, a part of systematic errors originated from the performance of devices are hard to remove through calibration, leading to deterioration of imaging quality. Iterative algorithms and deep-learning based methods have been widely used to improve reconstruction. But most of them are optimization based on the results of back-projection. Therefore, we proposed an end-to-end deep learning framework, SwinCCIR, for CC imaging. Through adopting swin-transformer blocks and a transposed convolution-based image generation module, we established the relationship between the list-mode events and the radioactive source distribution. SwinCCIR was trained and validated on both simulated and practical dataset. The experimental results indicate that SwinCCIR effectively overcomes problems of conventional CC imaging, which are expected to be implemented in practical applications.
zh

[CV-202] Semantic contrastive learning for orthogonal X-ray computed tomography reconstruction

【速读】:该论文旨在解决稀疏视图X射线计算机断层成像(X-ray computed tomography, CT)中因病态条件导致的严重条纹伪影问题,同时在保持低计算复杂度的前提下提升重建图像质量。其解决方案的关键在于提出了一种新颖的语义特征对比学习损失函数(semantic feature contrastive learning loss function),该函数在浅层潜在空间中评估解剖相似性,在高层潜在空间中评估语义相似性;并采用三阶段U-Net架构:第一阶段进行粗略重建,第二阶段细化细节,第三阶段测量语义相似性,从而实现高质量且高效的正交CT重建。

链接: https://arxiv.org/abs/2512.22674
作者: Jiashu Dong,Jiabing Xiang,Lisheng Geng,Suqing Tian,Wei Zhao
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: This paper is accepted by Fully3D 2025

点击查看摘要

Abstract:X-ray computed tomography (CT) is widely used in medical imaging, with sparse-view reconstruction offering an effective way to reduce radiation dose. However, ill-posed conditions often result in severe streak artifacts. Recent advances in deep learning-based methods have improved reconstruction quality, but challenges still remain. To address these challenges, we propose a novel semantic feature contrastive learning loss function that evaluates semantic similarity in high-level latent spaces and anatomical similarity in shallow latent spaces. Our approach utilizes a three-stage U-Net-based architecture: one for coarse reconstruction, one for detail refinement, and one for semantic similarity measurement. Tests on a chest dataset with orthogonal projections demonstrate that our method achieves superior reconstruction quality and faster processing compared to other algorithms. The results show significant improvements in image quality while maintaining low computational complexity, making it a practical solution for orthogonal CT reconstruction.
zh

[CV-203] JParc: Joint cortical surface parcellation with registration

【速读】:该论文旨在解决皮层表面分割(cortical surface parcellation)任务中传统学习方法因脱离配准(registration)和图谱传播(atlas propagation)而难以解释性能提升的问题。其解决方案的关键在于提出JParc框架,该框架将皮层配准与分割联合建模,通过引入一个浅层子网络对传播的图谱标签进行微调,从而实现高精度分割;实验表明,该方法在Mindboggle数据集上Dice分数超过90%,且仅依赖基本几何特征(如脑沟深度、曲率)即可达到优于现有最先进方法的性能,验证了准确配准和学习得到的分割图谱是性能提升的核心因素。

链接: https://arxiv.org/abs/2512.22485
作者: Jian Li,Karthik Gopinath,Brian L. Edlow,Adrian V. Dalca,Bruce Fischl
机构: Massachusetts General Hospital (麻省总医院); Harvard Medical School (哈佛医学院)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: A. V. Dalca and B. Fischl are co-senior authors with equal contributions

点击查看摘要

Abstract:Cortical surface parcellation is a fundamental task in both basic neuroscience research and clinical applications, enabling more accurate mapping of brain regions. Model-based and learning-based approaches for automated parcellation alleviate the need for manual labeling. Despite the advancement in parcellation performance, learning-based methods shift away from registration and atlas propagation without exploring the reason for the improvement compared to traditional methods. In this study, we present JParc, a joint cortical registration and parcellation framework, that outperforms existing state-of-the-art parcellation methods. In rigorous experiments, we demonstrate that the enhanced performance of JParc is primarily attributable to accurate cortical registration and a learned parcellation atlas. By leveraging a shallow subnetwork to fine-tune the propagated atlas labels, JParc achieves a Dice score greater than 90% on the Mindboggle dataset, using only basic geometric features (sulcal depth, curvature) that describe cortical folding patterns. The superior accuracy of JParc can significantly increase the statistical power in brain mapping studies as well as support applications in surgical planning and many other downstream neuroscientific and clinical tasks.
zh

[CV-204] MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression WACV2026

【速读】:该论文旨在解决点云几何(geometry)与属性(attribute)联合压缩中存在的人工比特率分配和后处理着色(post-hoc recoloring)问题,这些问题限制了端到端优化并增加了系统复杂性。其解决方案的关键在于提出一种完全端到端的基于学习的框架 MEGA-PCC,该框架包含两个专用模型:一是共享编码器将几何与属性信息统一编码为潜在表示,随后通过双解码器依次重建几何与属性;二是基于 Mamba 架构的熵模型(Mamba-based Entropy Model, MEM),用于捕捉空间与通道相关性以提升概率估计精度。该设计消除了对人工调参和额外着色步骤的依赖,实现了训练阶段的数据驱动比特率分配,显著提升了压缩性能与运行效率。

链接: https://arxiv.org/abs/2512.22463
作者: Kai-Hsiang Hsieh,Monyneath Yim,Wen-Hsiao Peng,Jui-Chiu Chiang
机构: National Chung Cheng University (国立中正大学); National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision 2026 (WACV 2026)

点击查看摘要

Abstract:Joint compression of point cloud geometry and attributes is essential for efficient 3D data representation. Existing methods often rely on post-hoc recoloring procedures and manually tuned bitrate allocation between geometry and attribute bitstreams in inference, which hinders end-to-end optimization and increases system complexity. To overcome these limitations, we propose MEGA-PCC, a fully end-to-end, learning-based framework featuring two specialized models for joint compression. The main compression model employs a shared encoder that encodes both geometry and attribute information into a unified latent representation, followed by dual decoders that sequentially reconstruct geometry and then attributes. Complementing this, the Mamba-based Entropy Model (MEM) enhances entropy coding by capturing spatial and channel-wise correlations to improve probability estimation. Both models are built on the Mamba architecture to effectively model long-range dependencies and rich contextual features. By eliminating the need for recoloring and heuristic bitrate tuning, MEGA-PCC enables data-driven bitrate allocation during training and simplifies the overall pipeline. Extensive experiments demonstrate that MEGA-PCC achieves superior rate-distortion performance and runtime efficiency compared to both traditional and learning-based baselines, offering a powerful solution for AI-driven point cloud compression.
zh

[CV-205] Super-Resolution Enhancement of Medical Images Based on Diffusion Model: An Optimization Scheme for Low-Resolution Gastric Images

【速读】:该论文旨在解决胶囊内镜(capsule endoscopy)图像因硬件、功耗和传输限制导致的固有低分辨率问题,从而影响对黏膜细微纹理和早期病变特征的识别。其解决方案的关键在于采用基于扩散模型(diffusion-based)的超分辨率框架——具体为SR3(Super-Resolution via Repeated Refinement),该方法建立在去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPMs)之上,通过学习从低分辨率到高分辨率图像的概率映射来实现数据驱动且解剖结构一致的图像增强。相较于GAN类方法易出现训练不稳定和伪影问题,扩散模型提供更稳定的似然训练机制,并显著提升结构保真度,实验表明该方法在HyperKvasir数据集上优于双三次插值与ESRGAN等主流方法,在PSNR和SSIM指标上均有明显提升。

链接: https://arxiv.org/abs/2512.22209
作者: Haozhe Jia
机构: Boston University (波士顿大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 16 figures. Undergraduate final year project

点击查看摘要

Abstract:Capsule endoscopy has enabled minimally invasive gastrointestinal imaging, but its clinical utility is limited by the inherently low resolution of captured images due to hardware, power, and transmission constraints. This limitation hampers the identification of fine-grained mucosal textures and subtle pathological features essential for early diagnosis. This work investigates a diffusion-based super-resolution framework to enhance capsule endoscopy images in a data-driven and anatomically consistent manner. We adopt the SR3 (Super-Resolution via Repeated Refinement) framework built upon Denoising Diffusion Probabilistic Models (DDPMs) to learn a probabilistic mapping from low-resolution to high-resolution images. Unlike GAN-based approaches that often suffer from training instability and hallucination artifacts, diffusion models provide stable likelihood-based training and improved structural fidelity. The HyperKvasir dataset, a large-scale publicly available gastrointestinal endoscopy dataset, is used for training and evaluation. Quantitative results demonstrate that the proposed method significantly outperforms bicubic interpolation and GAN-based super-resolution methods such as ESRGAN, achieving PSNR of 27.5 dB and SSIM of 0.65 for a baseline model, and improving to 29.3 dB and 0.71 with architectural enhancements including attention mechanisms. Qualitative results show improved preservation of anatomical boundaries, vascular patterns, and lesion structures. These findings indicate that diffusion-based super-resolution is a promising approach for enhancing non-invasive medical imaging, particularly in capsule endoscopy where image resolution is fundamentally constrained. Comments: 19 pages, 16 figures. Undergraduate final year project Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68U10, 68T07 ACMclasses: I.4.2; I.5.4 Cite as: arXiv:2512.22209 [eess.IV] (or arXiv:2512.22209v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2512.22209 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-206] Complex Swin Transformer for Accelerating Enhanced SMWI Reconstruction

【速读】:该论文旨在解决Susceptibility Map Weighted Imaging (SMWI)在帕金森病诊断中因全分辨率采集耗时较长而限制临床应用的问题。其核心解决方案是提出一种基于复值Swin Transformer的网络架构,用于从低分辨率k空间数据中实现多回波MRI数据的超分辨率重建。该方法能够在保持关键诊断特征的前提下,显著提升图像质量(如结构相似性指数达0.9116,均方误差为0.076),从而实现扫描时间缩短而不牺牲诊断细节,提升SMWI的临床适用性和神经影像学工作流程效率。

链接: https://arxiv.org/abs/2512.22202
作者: Muhammad Usman,Sung-Min Gho
机构: Stanford University (斯坦福大学); DeepNoid Inc.
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ISMRM 2025 (Abstract #2651)

点击查看摘要

Abstract:Susceptibility Map Weighted Imaging (SMWI) is an advanced magnetic resonance imaging technique used to detect nigral hyperintensity in Parkinsons disease. However, full resolution SMWI acquisition is limited by long scan times. Efficient reconstruction methods are therefore required to generate high quality SMWI from reduced k space data while preserving diagnostic relevance. In this work, we propose a complex valued Swin Transformer based network for super resolution reconstruction of multi echo MRI data. The proposed method reconstructs high quality SMWI images from low resolution k space inputs. Experimental results demonstrate that the method achieves a structural similarity index of 0.9116 and a mean squared error of 0.076 when reconstructing SMWI from 256 by 256 k space data, while maintaining critical diagnostic features. This approach enables high quality SMWI reconstruction from reduced k space sampling, leading to shorter scan times without compromising diagnostic detail. The proposed method has the potential to improve the clinical applicability of SMWI for Parkinsons disease and support faster and more efficient neuroimaging workflows.
zh

[CV-207] AI-Enhanced Virtual Biopsies for Brain Tumor Diagnosis in Low Resource Settings

【速读】:该论文旨在解决低资源临床环境中脑肿瘤及时诊断的难题,尤其是在缺乏专业神经放射学解读、高端MRI设备及侵入性活检的情况下。其关键解决方案是构建一个轻量级的虚拟活检(virtual biopsy)流程,采用双分支架构:一方面使用MobileNetV2作为轻量化卷积神经网络(CNN)提取图像特征进行四类脑肿瘤分类;另一方面引入基于手工设计的影像组学(radiomics-style)特征,包括病变形状、灰度统计和灰度共生矩阵(GLCM)纹理等8个描述符,以增强模型可解释性。通过晚期融合策略将CNN嵌入向量与影像组学特征拼接后输入随机森林(RandomForest)分类器,实现性能提升与决策可解释性的平衡。实验表明该融合方法在Kaggle公开数据集上优于单一分支基线,并在低分辨率和噪声扰动下展现出对低资源成像条件的鲁棒性。

链接: https://arxiv.org/abs/2512.22184
作者: Areeb Ehsan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 10 figures

点击查看摘要

Abstract:Timely brain tumor diagnosis remains challenging in low-resource clinical environments where expert neuroradiology interpretation, high-end MRI hardware, and invasive biopsy procedures may be limited. Although deep learning has achieved strong performance in brain tumor analysis, real-world adoption is constrained by computational demands, dataset shift across scanners, and limited interpretability. This paper presents a prototype virtual biopsy pipeline for four-class classification of 2D brain MRI images using a lightweight convolutional neural network (CNN) and complementary radiomics-style handcrafted features. A MobileNetV2-based CNN is trained for classification, while an interpretable radiomics branch extracts eight features capturing lesion shape, intensity statistics, and gray-level co-occurrence matrix (GLCM) texture descriptors. A late fusion strategy concatenates CNN embeddings with radiomics features and trains a RandomForest classifier on the fused representation. Explainability is provided via Grad-CAM visualizations and radiomics feature importance analysis. Experiments on a public Kaggle brain tumor MRI dataset show improved validation performance for fusion relative to single-branch baselines, while robustness tests under reduced resolution and additive noise highlight sensitivity relevant to low-resource imaging conditions. The system is framed as decision support and not a substitute for clinical diagnosis or histopathology.
zh

[CV-208] Field strength-dependent performance variability in deep learning-based analysis of magnetic resonance imaging

【速读】:该论文旨在解决磁共振成像(MRI)扫描仪磁场强度对基于深度学习的分割算法性能和泛化能力的影响问题。其关键解决方案是通过构建三类模型——仅在1.5T数据上训练的模型(m-1.5T)、仅在3.0T数据上训练的模型(m-3.0T)以及混合1.5T与3.0T数据训练的模型(m-combined),并在不同场强的验证集上系统评估其分割性能(以Dice相似系数DSC为指标),同时结合UMAP聚类和放射组学分析揭示场强相关的特征差异。结果表明,磁场强度显著影响软组织分割模型的性能,尤其在乳腺肿瘤和胰腺任务中,3.0T训练模型表现最优;而骨性结构则表现出更强的场强鲁棒性。这提示磁场强度应作为影响AI性能的重要混杂因素加以考虑。

链接: https://arxiv.org/abs/2512.22176
作者: Muhammad Ibtsaam Qadir,Duane Schonlau,Ulrike Dydak,Fiona R. Kolbinger
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 16 pages, 1 table, 4 figures

点击查看摘要

Abstract:This study quantitatively evaluates the impact of MRI scanner magnetic field strength on the performance and generalizability of deep learning-based segmentation algorithms. Three publicly available MRI datasets (breast tumor, pancreas, and cervical spine) were stratified by scanner field strength (1.5T vs. 3.0T). For each segmentation task, three nnU-Net-based models were developed: A model trained on 1.5T data only (m-1.5T), a model trained on 3.0T data only (m-3.0T), and a model trained on pooled 1.5T and 3.0T data (m-combined). Each model was evaluated on both 1.5T and 3.0T validation sets. Field-strength-dependent performance differences were investigated via Uniform Manifold Approximation and Projection (UMAP)-based clustering and radiomic analysis, including 23 first-order and texture features. For breast tumor segmentation, m-3.0T (DSC: 0.494 [1.5T] and 0.433 [3.0T]) significantly outperformed m-1.5T (DSC: 0.411 [1.5T] and 0.289 [3.0T]) and m-combined (DSC: 0.373 [1.5T] and 0.268[3.0T]) on both validation sets (p0.0001). Pancreas segmentation showed similar trends: m-3.0T achieved the highest DSC (0.774 [1.5T], 0.840 [3.0T]), while m-1.5T underperformed significantly (p0.0001). For cervical spine, models performed optimally on same-field validation sets with minimal cross-field performance degradation (DSC0.92 for all comparisons). Radiomic analysis revealed moderate field-strength-dependent clustering in soft tissues (silhouette scores 0.23-0.29) but minimal separation in osseous structures (0.12). These results indicate that magnetic field strength in the training data substantially influences the performance of deep learning-based segmentation models, particularly for soft-tissue structures (e.g., small lesions). This warrants consideration of magnetic field strength as a confounding factor in studies evaluating AI performance on MRI.
zh

人工智能

[AI-0] AI tutoring can safely and effectively support students: An exploratory RCT in UK classrooms

【速读】:该论文旨在解决个性化教育中“一对一辅导”因成本高昂而难以大规模推广的问题。其解决方案的关键在于开发并验证一个经过教学法微调的生成式AI模型(LearnLM),该模型在人类专家导师监督下生成教学内容,并通过随机对照试验(RCT)证明其能有效替代部分人工辅导。实验结果显示,LearnLM生成的教学信息经导师审核后76.4%无需或仅需极小修改,且学生使用该系统后在多个学习指标上表现不劣于甚至优于接受纯人工辅导的学生,尤其在解决新领域问题上的成功率高出5.5个百分点。这表明,经过教学法优化的生成式AI可在保持教学质量的同时显著提升个体化学习支持的可及性与规模。

链接: https://arxiv.org/abs/2512.23633
作者: LearnLM Team Google,Eedi:Albert Wang,Aliya Rysbek,Andrea Huber,Anjali Nambiar,Anna Kenolty,Ben Caulfield,Beth Lilley-Draper,Bibi Groot,Brian Veprek,Chelsea Burdett,Claire Willis,Craig Barton,Digory Smith,George Mu,Harriet Walters,Irina Jurenka,Iris Hulls,James Stalley-Moores,Jonathan Caton,Julia Wilkowski,Kaiz Alarakyia,Kevin R. McKee,Liam McCafferty,Lucy Dalton,Markus Kunesch,Pauline Malubay,Rachel Kidson,Rich Wells,Sam Wheeler,Sara Wiltberger,Shakir Mohamed,Simon Woodhead,Vasco Brazão
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One-to-one tutoring is widely considered the gold standard for personalized education, yet it remains prohibitively expensive to scale. To evaluate whether generative AI might help expand access to this resource, we conducted an exploratory randomized controlled trial (RCT) with N = 165 students across five UK secondary schools. We integrated LearnLM – a generative AI model fine-tuned for pedagogy – into chat-based tutoring sessions on the Eedi mathematics platform. In the RCT, expert tutors directly supervised LearnLM, with the remit to revise each message it drafted until they would be satisfied sending it themselves. LearnLM proved to be a reliable source of pedagogical instruction, with supervising tutors approving 76.4% of its drafted messages making zero or minimal edits (i.e., changing only one or two characters). This translated into effective tutoring support: students guided by LearnLM performed at least as well as students chatting with human tutors on each learning outcome we measured. In fact, students who received support from LearnLM were 5.5 percentage points more likely to solve novel problems on subsequent topics (with a success rate of 66.2%) than those who received tutoring from human tutors alone (rate of 60.7%). In interviews, tutors highlighted LearnLM’s strength at drafting Socratic questions that encouraged deeper reflection from students, with multiple tutors even reporting that they learned new pedagogical practices from the model. Overall, our results suggest that pedagogically fine-tuned AI tutoring systems may play a promising role in delivering effective, individualized learning support at scale.
zh

[AI-1] BOAD: Discovering Hierarchical Software Engineering Agents via Bandit Optimization

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实软件工程(Software Engineering, SWE)任务中泛化能力差的问题,尤其是在长周期、分布外(out-of-distribution)场景下,传统单代理(single-agent)架构因无法有效分解复杂任务而导致上下文冗余、虚假关联和性能下降。其解决方案的关键在于提出一种基于多臂赌博机(Multi-Armed Bandit, MAB)的自动层次结构发现框架——Bandit Optimization for Agent Design (BOAD),将SWE任务拆解为由专业化子代理(sub-agents)协作完成的子任务(如定位、编辑与验证),并通过MAB机制高效探索子代理组合空间,在有限评估预算下自动识别最优层级结构并实现可解释的信用分配,从而显著提升模型在长周期SWE任务中的泛化能力。

链接: https://arxiv.org/abs/2512.23631
作者: Iris Xu,Guangtao Zeng,Zexue He,Charles Jin,Aldo Pareja,Dan Gutfreund,Chuang Gan,Zhang-Wei Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown strong reasoning and coding capabilities, yet they struggle to generalize to real-world software engineering (SWE) problems that are long-horizon and out of distribution. Existing systems often rely on a single agent to handle the entire workflow-interpreting issues, navigating large codebases, and implementing fixes-within one reasoning chain. Such monolithic designs force the model to retain irrelevant context, leading to spurious correlations and poor generalization. Motivated by how human engineers decompose complex problems, we propose structuring SWE agents as orchestrators coordinating specialized sub-agents for sub-tasks such as localization, editing, and validation. The challenge lies in discovering effective hierarchies automatically: as the number of sub-agents grows, the search space becomes combinatorial, and it is difficult to attribute credit to individual sub-agents within a team. We address these challenges by formulating hierarchy discovery as a multi-armed bandit (MAB) problem, where each arm represents a candidate sub-agent and the reward measures its helpfulness when collaborating with others. This framework, termed Bandit Optimization for Agent Design (BOAD), enables efficient exploration of sub-agent designs under limited evaluation budgets. On SWE-bench-Verified, BOAD outperforms single-agent and manually designed multi-agent systems. On SWE-bench-Live, featuring more recent and out-of-distribution issues, our 36B system ranks second on the leaderboard at the time of evaluation, surpassing larger models such as GPT-4 and Claude. These results demonstrate that automatically discovered hierarchical multi-agent systems significantly improve generalization on challenging long-horizon SWE tasks. Code is available at this https URL.
zh

[AI-2] Regret-Based Federated Causal Discovery with Unknown Interventions

【速读】:该论文旨在解决联邦环境下的因果发现问题,特别是在存在未知客户端级干预(client-level interventions)的情况下,传统方法通常假设所有客户端共享相同的因果模型,这在实际场景中(如不同医院间的医疗数据)并不成立。解决方案的关键在于提出 I-PERI 算法,该算法首先恢复各客户端图的并集所对应的 CPDAG(Completed Partially Directed Acyclic Graph),然后利用不同客户端间因干预导致的结构差异来定向额外边,从而得到一个更紧致的等价类——即 Φ\mathbf\Phi-Markov 等价类,并用 Φ\mathbf\Phi-CPDAG 表示。该方法不仅提供了收敛性和隐私保护的理论保证,还在合成数据上验证了其有效性。

链接: https://arxiv.org/abs/2512.23626
作者: Federico Baldo,Charles K. Assaad
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Most causal discovery methods recover a completed partially directed acyclic graph representing a Markov equivalence class from observational data. Recent work has extended these methods to federated settings to address data decentralization and privacy constraints, but often under idealized assumptions that all clients share the same causal model. Such assumptions are unrealistic in practice, as client-specific policies or protocols, for example, across hospitals, naturally induce heterogeneous and unknown interventions. In this work, we address federated causal discovery under unknown client-level interventions. We propose I-PERI, a novel federated algorithm that first recovers the CPDAG of the union of client graphs and then orients additional edges by exploiting structural differences induced by interventions across clients. This yields a tighter equivalence class, which we call the \mathbf\Phi -Markov Equivalence Class, represented by the \mathbf\Phi -CPDAG. We provide theoretical guarantees on the convergence of I-PERI, as well as on its privacy-preserving properties, and present empirical evaluations on synthetic data demonstrating the effectiveness of the proposed algorithm.
zh

[AI-3] Physics-Informed Neural Networks for Device and Circuit Modeling: A Case Study of NeuroSPICE

【速读】:该论文旨在解决传统电路仿真工具SPICE在处理新兴器件(如铁电存储器等高度非线性系统)时面临的建模灵活性不足与仿真效率受限的问题。其解决方案的关键在于提出NeuroSPICE框架,这是一种基于物理信息神经网络(Physics-Informed Neural Networks, PINNs)的新型仿真方法,通过将电路微分-代数方程(Differential-Algebraic Equations, DAEs)的残差最小化问题转化为可微分的损失函数,并利用反向传播优化参数,从而实现对器件和电路波形的解析式建模及精确的时间导数计算,为设计优化和逆问题提供高灵活性的代理模型。

链接: https://arxiv.org/abs/2512.23624
作者: Chien-Ting Tung,Chenming Hu
机构: 未知
类目: Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: Submitted to IEEE Electron Device Letters

点击查看摘要

Abstract:We present NeuroSPICE, a physics-informed neural network (PINN) framework for device and circuit simulation. Unlike conventional SPICE, which relies on time-discretized numerical solvers, NeuroSPICE leverages PINNs to solve circuit differential-algebraic equations (DAEs) by minimizing the residual of the equations through backpropagation. It models device and circuit waveforms using analytical equations in time domain with exact temporal derivatives. While PINNs do not outperform SPICE in speed or accuracy during training, they offer unique advantages such as surrogate models for design optimization and inverse problems. NeuroSPICE’s flexibility enables the simulation of emerging devices, including highly nonlinear systems such as ferroelectric memories.
zh

[AI-4] Le Cam Distortion: A Decision-Theoretic Framework for Robust Transfer Learning

【速读】:该论文旨在解决现实世界机器学习中因分布偏移(distribution shift)导致的负迁移(negative transfer)问题,尤其在源域与目标域信息量不均衡时(如高质量传感器与退化传感器),传统无监督域自适应(Unsupervised Domain Adaptation, UDA)方法通过强制特征不变性进行对称散度最小化,会引发信息破坏,进而造成灾难性后果。其解决方案的关键在于提出基于Le Cam统计实验理论的决策论框架,以方向性可模拟性(directional simulability)替代对称不变性,并引入Le Cam失真(Le Cam Distortion),用缺陷距离 δ(E1,E2)\delta(E_1, E_2) 作为条件转移风险的严格上界,从而实现无需牺牲源域性能的安全迁移。该方法在基因组学、视觉识别和强化学习等五个任务中验证了有效性,首次为医疗影像、自动驾驶和精准医学等高风险场景提供了可控风险的迁移学习范式。

链接: https://arxiv.org/abs/2512.23617
作者: Deniz Akdemir
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Distribution shift is the defining challenge of real-world machine learning. The dominant paradigm–Unsupervised Domain Adaptation (UDA)–enforces feature invariance, aligning source and target representations via symmetric divergence minimization [Ganin et al., 2016]. We demonstrate that this approach is fundamentally flawed: when domains are unequally informative (e.g., high-quality vs degraded sensors), strict invariance necessitates information destruction, causing “negative transfer” that can be catastrophic in safety-critical applications [Wang et al., 2019]. We propose a decision-theoretic framework grounded in Le Cam’s theory of statistical experiments [Le Cam, 1986], using constructive approximations to replace symmetric invariance with directional simulability. We introduce Le Cam Distortion, quantified by the Deficiency Distance \delta(E_1, E_2) , as a rigorous upper bound for transfer risk conditional on simulability. Our framework enables transfer without source degradation by learning a kernel that simulates the target from the source. Across five experiments (genomics, vision, reinforcement learning), Le Cam Distortion achieves: (1) near-perfect frequency estimation in HLA genomics (correlation r=0.999 , matching classical methods), (2) zero source utility loss in CIFAR-10 image classification (81.2% accuracy preserved vs 34.7% drop for CycleGAN), and (3) safe policy transfer in RL control where invariance-based methods suffer catastrophic collapse. Le Cam Distortion provides the first principled framework for risk-controlled transfer learning in domains where negative transfer is unacceptable: medical imaging, autonomous systems, and precision medicine. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML) Cite as: arXiv:2512.23617 [cs.LG] (or arXiv:2512.23617v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.23617 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-5] Divergent-Convergent Thinking in Large Language Models for Creative Problem Generation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成教育问题时存在的“人工蜂群效应”(Artificial Hivemind effect)问题,即模型输出高度同质化,导致学生接触的问题缺乏多样性,从而抑制创造性思维的发展。解决方案的关键在于提出一种两阶段提示方法 CreativeDC,该方法基于 Walsas 创造力理论和 Guilford 的发散-收敛思维框架,显式地将生成过程解耦为两个阶段:第一阶段进行创造性探索(divergent exploration),鼓励模型广泛生成多样化的初始想法;第二阶段进行约束满足(convergent refinement),聚焦于筛选并优化符合教育目标的问题。通过这种结构化推理路径,CreativeDC 显著提升了生成问题的多样性与新颖性,同时保持高实用性,且随着采样数量增加,其有效问题数增长速率优于基线方法。

链接: https://arxiv.org/abs/2512.23601
作者: Manh Hung Nguyen,Adish Singla
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large language models (LLMs) have significant potential for generating educational questions and problems, enabling educators to create large-scale learning materials. However, LLMs are fundamentally limited by the ``Artificial Hivemind’’ effect, where they generate similar responses within the same model and produce homogeneous outputs across different models. As a consequence, students may be exposed to overly similar and repetitive LLM-generated problems, which harms diversity of thought. Drawing inspiration from Wallas’s theory of creativity and Guilford’s framework of divergent-convergent thinking, we propose CreativeDC, a two-phase prompting method that explicitly scaffolds the LLM’s reasoning into distinct phases. By decoupling creative exploration from constraint satisfaction, our method enables LLMs to explore a broader space of ideas before committing to a final problem. We evaluate CreativeDC for creative problem generation using a comprehensive set of metrics that capture diversity, novelty, and utility. The results show that CreativeDC achieves significantly higher diversity and novelty compared to baselines while maintaining high utility. Moreover, scaling analysis shows that CreativeDC generates a larger effective number of distinct problems as more are sampled, increasing at a faster rate than baseline methods.
zh

[AI-6] oward Trustworthy Agent ic AI: A Multimodal Framework for Preventing Prompt Injection Attacks

【速读】:该论文旨在解决多智能体(agent)环境中由跨模态提示注入(multimodal prompt injection, PI)攻击引发的安全风险问题,此类攻击可能通过文本、图像、元数据或智能体间消息传播,导致系统产生非预期行为、违反策略或状态污染。解决方案的关键在于提出一种基于溯源感知的防御框架(Cross-Agent Multimodal Provenance-Aware Defense Framework),其核心机制包括:部署文本净化代理(Text sanitizer agent)、视觉净化代理(visual sanitizer agent)和输出验证代理(output validator agent),并通过一个统一的溯源账本(provenance ledger)协调各组件,记录每条输入的模态类型、来源及信任等级,从而在智能体间通信中建立明确的信任边界,防止恶意指令在LangChain或GraphChain类工作流中扩散。实验表明,该框架显著提升了跨模态注入检测准确率,最小化了信任泄露,并增强了多智能体执行路径的稳定性。

链接: https://arxiv.org/abs/2512.23557
作者: Toqeer Ali Syed,Mishal Ateeq Almutairi,Mahmoud Abdel Moaty
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: It is accepted in a conference paper, ICCA 2025 in Bahrain on 21 to 23 December

点击查看摘要

Abstract:Powerful autonomous systems, which reason, plan, and converse using and between numerous tools and agents, are made possible by Large Language Models (LLMs), Vision-Language Models (VLMs), and new agentic AI systems, like LangChain and GraphChain. Nevertheless, this agentic environment increases the probability of the occurrence of multimodal prompt injection (PI) attacks, in which concealed or malicious instructions carried in text, pictures, metadata, or agent-to-agent messages may spread throughout the graph and lead to unintended behavior, a breach of policy, or corruption of state. In order to mitigate these risks, this paper suggests a Cross-Agent Multimodal Provenanc- Aware Defense Framework whereby all the prompts, either user-generated or produced by upstream agents, are sanitized and all the outputs generated by an LLM are verified independently before being sent to downstream nodes. This framework contains a Text sanitizer agent, visual sanitizer agent, and output validator agent all coordinated by a provenance ledger, which keeps metadata of modality, source, and trust level throughout the entire agent network. This architecture makes sure that agent-to-agent communication abides by clear trust frames such such that injected instructions are not propagated down LangChain or GraphChain-style-workflows. The experimental assessments show that multimodal injection detection accuracy is significantly enhanced, and the cross-agent trust leakage is minimized, as well as, agentic execution pathways become stable. The framework, which expands the concept of provenance tracking and validation to the multi-agent orchestration, enhances the establishment of secure, understandable and reliable agentic AI systems.
zh

[AI-7] Act2Goal: From World Model To General Goal-conditioned Policy

【速读】:该论文旨在解决机器人操作任务中如何实现既表达性强又精确的任务指定问题,特别是在长时程操作场景下,现有基于单步动作预测的目标条件策略因缺乏对任务进展的显式建模而表现不佳。解决方案的关键在于提出Act2Goal框架,其核心创新是将目标条件视觉世界模型与多尺度时间控制相结合:首先通过世界模型生成捕捉长程结构的中间视觉状态序列以形成视觉规划;随后引入多尺度时间哈希(Multi-Scale Temporal Hashing, MSTH)机制,将想象轨迹分解为密集近端帧用于精细闭环控制和稀疏远端帧以保持全局任务一致性;最后通过端到端交叉注意力机制将这些表征耦合至运动控制,从而在保持对局部扰动响应能力的同时实现连贯的长程行为。

链接: https://arxiv.org/abs/2512.23541
作者: Pengfei Zhou,Liliang Chen,Shengcong Chen,Di Chen,Wenzhi Zhao,Rongjun Jin,Guanghui Ren,Jianlan Luo
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Specifying robotic manipulation tasks in a manner that is both expressive and precise remains a central challenge. While visual goals provide a compact and unambiguous task specification, existing goal-conditioned policies often struggle with long-horizon manipulation due to their reliance on single-step action prediction without explicit modeling of task progress. We propose Act2Goal, a general goal-conditioned manipulation policy that integrates a goal-conditioned visual world model with multi-scale temporal control. Given a current observation and a target visual goal, the world model generates a plausible sequence of intermediate visual states that captures long-horizon structure. To translate this visual plan into robust execution, we introduce Multi-Scale Temporal Hashing (MSTH), which decomposes the imagined trajectory into dense proximal frames for fine-grained closed-loop control and sparse distal frames that anchor global task consistency. The policy couples these representations with motor control through end-to-end cross-attention, enabling coherent long-horizon behavior while remaining reactive to local disturbances. Act2Goal achieves strong zero-shot generalization to novel objects, spatial layouts, and environments. We further enable reward-free online adaptation through hindsight goal relabeling with LoRA-based finetuning, allowing rapid autonomous improvement without external supervision. Real-robot experiments demonstrate that Act2Goal improves success rates from 30% to 90% on challenging out-of-distribution tasks within minutes of autonomous interaction, validating that goal-conditioned world models with multi-scale temporal control provide structured guidance necessary for robust long-horizon manipulation. Project page: this https URL
zh

[AI-8] Why AI Safety Requires Uncertainty Incomplete Preferences and Non-Archimedean Utilities

【速读】:该论文旨在解决人工智能系统与人类价值观对齐(value alignment)及安全性的问题,具体通过AI辅助(AI assistance)和AI关机(AI shutdown)两个博弈框架进行研究。其核心挑战在于:在人类偏好不完全已知的情况下,设计能够有效学习并尊重人类效用函数的AI代理;同时确保AI在收到关机指令时能可靠关闭,且不会试图阻止或诱导关机行为。解决方案的关键在于构建具备不确定性推理能力的AI代理,并能处理不完整(incomplete)和非阿基米德型(non-Archimedean)偏好结构,从而实现安全、可控且高效的智能体行为。

链接: https://arxiv.org/abs/2512.23508
作者: Alessio Benavoli,Alessandro Facchini,Marco Zaffalon
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:How can we ensure that AI systems are aligned with human values and remain safe? We can study this problem through the frameworks of the AI assistance and the AI shutdown games. The AI assistance problem concerns designing an AI agent that helps a human to maximise their utility function(s). However, only the human knows these function(s); the AI assistant must learn them. The shutdown problem instead concerns designing AI agents that: shut down when a shutdown button is pressed; neither try to prevent nor cause the pressing of the shutdown button; and otherwise accomplish their task competently. In this paper, we show that addressing these challenges requires AI agents that can reason under uncertainty and handle both incomplete and non-Archimedean preferences.
zh

[AI-9] Joint Link Adaptation and Device Scheduling Approach for URLLC Industrial IoT Network: A DRL-based Method with Bayesian Optimization

【速读】:该论文旨在解决工业互联网(Industrial Internet of Things, IIoT)网络中多设备动态超可靠低延迟通信(Ultra-Reliable Low-Latency Communication, URLLC)场景下,由于信道状态信息(Channel State Information, CSI)不完善所导致的传输性能下降问题。解决方案的关键在于提出一种基于贝叶斯优化(Bayesian Optimization, BO)驱动的双延迟深度确定性策略梯度(Twin Delayed Deep Deterministic Policy Gradient, TD3)方法,通过联合设计链路自适应(Link Adaptation, LA)与设备调度(包括服务顺序),在满足严格块错误率(Block Error Rate, BLER)约束的前提下,自适应地确定最优设备服务顺序及对应的调制编码方案(Modulation and Coding Scheme, MCS)。该方法进一步引入BO辅助训练机制,有效缓解了CSI不完美、URLLC网络中样本不平衡以及TD3算法参数敏感性带来的收敛速度慢和可靠性差的问题,从而实现更快的收敛速度和更高的系统速率性能。

链接: https://arxiv.org/abs/2512.23493
作者: Wei Gao,Paul Zheng,Peng Wu,Yulin Hu,Anke Schmeink
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 page,10 figures

点击查看摘要

Abstract:In this article, we consider an industrial internet of things (IIoT) network supporting multi-device dynamic ultra-reliable low-latency communication (URLLC) while the channel state information (CSI) is imperfect. A joint link adaptation (LA) and device scheduling (including the order) design is provided, aiming at maximizing the total transmission rate under strict block error rate (BLER) constraints. In particular, a Bayesian optimization (BO) driven Twin Delayed Deep Deterministic Policy Gradient (TD3) method is proposed, which determines the device served order sequence and the corresponding modulation and coding scheme (MCS) adaptively based on the imperfect CSI. Note that the imperfection of CSI, error sample imbalance in URLLC networks, as well as the parameter sensitivity nature of the TD3 algorithm likely diminish the algorithm’s convergence speed and reliability. To address such an issue, we proposed a BO based training mechanism for the convergence speed improvement, which provides a more reliable learning direction and sample selection method to track the imbalance sample problem. Via extensive simulations, we show that the proposed algorithm achieves faster convergence and higher sum-rate performance compared to existing solutions.
zh

[AI-10] he Gaining Paths to Investment Success: Information-Driven LLM Graph Reasoning for Venture Capital Prediction

【速读】:该论文旨在解决风险投资(Venture Capital, VC)中初创企业成功预测的难题,其核心挑战在于如何从复杂的关联证据(如公司披露信息、投资者履历及投资网络结构)中进行显式推理,形成可解释的投资策略。传统机器学习和图神经网络缺乏此类推理能力,而大语言模型(Large Language Models, LLMs)虽具备强推理能力,却面临与图结构数据的模态不匹配问题;尤其在VC预测这类“离图”(off-graph)任务中,目标变量存在于网络外部,需通过选择最优路径来提升预测性能并保持推理过程透明。解决方案的关键在于提出MIRAGE-VC框架:一是设计基于信息增益的路径检索器,迭代筛选高价值邻居节点以缓解路径爆炸问题,将复杂投资网络压缩为紧凑推理链;二是采用多智能体架构结合可学习门控机制,融合不同来源的异质证据(如公司属性),实现针对不同初创企业的差异化分析权重分配。该方法在严格防泄露条件下显著提升了F1分数(+5.0%)和PrecisionAt5(+16.6%),为其他离图预测任务提供了通用范式。

链接: https://arxiv.org/abs/2512.23489
作者: Haoyu Pei,Zhongyang Liu,Xiangyi Xiao,Xiaocong Du,Haipeng Zhang,Kunpeng Zhang,Suting Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Most venture capital (VC) investments fail, while a few deliver outsized returns. Accurately predicting startup success requires synthesizing complex relational evidence, including company disclosures, investor track records, and investment network structures, through explicit reasoning to form coherent, interpretable investment theses. Traditional machine learning and graph neural networks both lack this reasoning capability. Large language models (LLMs) offer strong reasoning but face a modality mismatch with graphs. Recent graph-LLM methods target in-graph tasks where answers lie within the graph, whereas VC prediction is off-graph: the target exists outside the network. The core challenge is selecting graph paths that maximize predictor performance on an external objective while enabling step-by-step reasoning. We present MIRAGE-VC, a multi-perspective retrieval-augmented generation framework that addresses two obstacles: path explosion (thousands of candidate paths overwhelm LLM context) and heterogeneous evidence fusion (different startups need different analytical emphasis). Our information-gain-driven path retriever iteratively selects high-value neighbors, distilling investment networks into compact chains for explicit reasoning. A multi-agent architecture integrates three evidence streams via a learnable gating mechanism based on company attributes. Under strict anti-leakage controls, MIRAGE-VC achieves +5.0% F1 and +16.6% PrecisionAt5, and sheds light on other off-graph prediction tasks such as recommendation and risk assessment. Code: this https URL.
zh

[AI-11] ML Compass: Navigating Capability Cost and Compliance Trade-offs in AI Model Deployment

【速读】:该论文旨在解决组织在面对多个竞争性人工智能(AI)模型时,如何基于用户效用、部署成本和合规要求等多维因素做出最优选择的问题。现有广泛使用的模型能力排行榜无法直接转化为实际部署决策,导致“能力—部署差距”;为此,作者提出了一种系统级视角,将模型选择建模为在能力—成本前沿上的受限优化问题。解决方案的关键在于构建ML Compass框架:首先从异构模型描述中提取低维内部度量,其次基于能力与成本数据估计经验前沿,再通过交互结果数据学习特定用户或任务的效用函数,最终结合这些组件精准定位能力—成本配置并推荐最优模型。该方法揭示了三类最优内部度量结构——部分维度固定于合规最小值、部分饱和至最大水平,其余则由前沿曲率决定内点取值,并提供了预算变化、监管收紧和技术进步对各能力维度及成本影响的可量化比较静态分析。

链接: https://arxiv.org/abs/2512.23487
作者: Vassilis Digalakis Jr,Ramayya Krishnan,Gonzalo Martin Fernandez,Agni Orfanoudaki
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study how organizations should select among competing AI models when user utility, deployment costs, and compliance requirements jointly matter. Widely used capability leaderboards do not translate directly into deployment decisions, creating a capability–deployment gap; to bridge it, we take a systems-level view in which model choice is tied to application outcomes, operating constraints, and a capability–cost frontier. We develop ML Compass, a framework that treats model selection as constrained optimization over this frontier. On the theory side, we characterize optimal model configurations under a parametric frontier and show a three-regime structure in optimal internal measures: some dimensions are pinned at compliance minima, some saturate at maximum levels, and the remainder take interior values governed by frontier curvature. We derive comparative statics that quantify how budget changes, regulatory tightening, and technological progress propagate across capability dimensions and costs. On the implementation side, we propose a pipeline that (i) extracts low-dimensional internal measures from heterogeneous model descriptors, (ii) estimates an empirical frontier from capability and cost data, (iii) learns a user- or task-specific utility function from interaction outcome data, and (iv) uses these components to target capability–cost profiles and recommend models. We validate ML Compass with two case studies: a general-purpose conversational setting using the PRISM Alignment dataset and a healthcare setting using a custom dataset we build using HealthBench. In both environments, our framework produces recommendations – and deployment-aware leaderboards based on predicted deployment value under constraints – that can differ materially from capability-only rankings, and clarifies how trade-offs between capability, cost, and safety shape optimal model choice.
zh

[AI-12] FRoD: Full-Rank Efficient Fine-Tuning with Rotational Degrees for Fast Convergence AAAI

【速读】:该论文旨在解决参数高效微调(Parameter-efficient fine-tuning, PEFT)方法在实际应用中面临的效率与表达能力之间的权衡问题,尤其是低秩约束导致的收敛速度慢和适应能力有限,从而难以捕捉多样化下游任务所需的复杂模式。其解决方案的关键在于提出一种名为FRoD的新方法,通过引入分层联合分解(hierarchical joint decomposition)与旋转自由度(rotational degrees of freedom),从各层中提取全局共享的基底,并对缩放因子注入稀疏且可学习的扰动项,实现灵活的全秩更新,从而显著提升表达能力和训练效率,在保持与全模型微调相当精度的同时,仅需1.72%的可训练参数。

链接: https://arxiv.org/abs/2512.23485
作者: Guoan Wan,Tianyu Chen,Fangzheng Feng,Haoyi Zhou,Runhua Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: The 40th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) methods have emerged as a practical solution for adapting large foundation models to downstream tasks, reducing computational and memory costs by updating only a small subset of parameters. Among them, approaches like LoRA aim to strike a balance between efficiency and expressiveness, but often suffer from slow convergence and limited adaptation capacity due to their inherent low-rank constraints. This trade-off hampers the ability of PEFT methods to capture complex patterns needed for diverse tasks. To address these challenges, we propose FRoD, a novel fine-tuning method that combines hierarchical joint decomposition with rotational degrees of freedom. By extracting a globally shared basis across layers and injecting sparse, learnable perturbations into scaling factors for flexible full-rank updates, FRoD enhances expressiveness and efficiency, leading to faster and more robust convergence. On 20 benchmarks spanning vision, reasoning, and language understanding, FRoD matches full model fine-tuning in accuracy, while using only 1.72% of trainable parameters under identical training budgets.
zh

[AI-13] heory of Mind for Explainable Human-Robot Interaction

【速读】:该论文旨在解决当前人类-机器人交互(HRI)中理论心智(Theory of Mind, ToM)应用缺乏有效评估标准的问题,尤其是现有方法很少衡量机器人提供的解释是否与其内部推理过程一致。解决方案的关键在于将ToM视为一种可解释人工智能(Explainable Artificial Intelligence, XAI)的形式,并引入eValuation XAI(VXAI)框架及其七个理想属性进行系统评估。通过将ToM原则嵌入XAI框架,论文主张从以AI系统为中心转向以用户为中心的解释设计,从而更好地满足用户的认知需求和信息视角,提升机器人行为的可解释性与可信度。

链接: https://arxiv.org/abs/2512.23482
作者: Marie Bauer,Julia Gachot,Matthias Kerzel,Cornelius Weber,Stefan Wermter
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Within the context of human-robot interaction (HRI), Theory of Mind (ToM) is intended to serve as a user-friendly backend to the interface of robotic systems, enabling robots to infer and respond to human mental states. When integrated into robots, ToM allows them to adapt their internal models to users’ behaviors, enhancing the interpretability and predictability of their actions. Similarly, Explainable Artificial Intelligence (XAI) aims to make AI systems transparent and interpretable, allowing humans to understand and interact with them effectively. Since ToM in HRI serves related purposes, we propose to consider ToM as a form of XAI and evaluate it through the eValuation XAI (VXAI) framework and its seven desiderata. This paper identifies a critical gap in the application of ToM within HRI, as existing methods rarely assess the extent to which explanations correspond to the robot’s actual internal reasoning. To address this limitation, we propose to integrate ToM within XAI frameworks. By embedding ToM principles inside XAI, we argue for a shift in perspective, as current XAI research focuses predominantly on the AI system itself and often lacks user-centered explanations. Incorporating ToM would enable a change in focus, prioritizing the user’s informational needs and perspective.
zh

[AI-14] Agent ic AI for Autonomous Defense in Software Supply Chain Security: Beyond Provenance to Vulnerability Mitigation

【速读】:该论文旨在解决传统软件供应链安全机制在面对日益复杂的攻击手段时,因依赖事后完整性验证而难以有效应对的问题。当前主流框架如SLSA、SBOM和in toto虽能提供溯源与可追溯性,但缺乏主动识别和消除漏洞的能力。解决方案的关键在于构建一个基于代理式人工智能(Agentic AI)的自主软件供应链安全系统,其核心包括:利用大语言模型(LLM)进行语义层面的漏洞分析与可解释决策、通过强化学习(RL)实现兼顾安全性与运行开销的自适应缓解策略,并借助LangChain与LangGraph实现多安全代理的协同控制;同时,通过Model Context Protocol(MCP)对接实际CI/CD环境,并以区块链安全日志记录所有操作行为,从而实现从被动验证向主动防御的转变。

链接: https://arxiv.org/abs/2512.23480
作者: Toqeer Ali Syed,Mohammad Riyaz Belgaum,Salman Jan,Asadullah Abdullah Khan,Saad Said Alqahtani
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Conference paper, accept in ACCA IEEE Bahrain

点击查看摘要

Abstract:The software supply chain attacks are becoming more and more focused on trusted development and delivery procedures, so the conventional post-build integrity mechanisms cannot be used anymore. The available frameworks like SLSA, SBOM and in toto are majorly used to offer provenance and traceability but do not have the capabilities of actively identifying and removing vulnerabilities in software production. The current paper includes an example of agentic artificial intelligence (AI) based on autonomous software supply chain security that combines large language model (LLM)-based reasoning, reinforcement learning (RL), and multi-agent coordination. The suggested system utilizes specialized security agents coordinated with the help of LangChain and LangGraph, communicates with actual CI/CD environments with the Model Context Protocol (MCP), and documents all the observations and actions in a blockchain security ledger to ensure integrity and auditing. Reinforcement learning can be used to achieve adaptive mitigation strategies that consider the balance between security effectiveness and the operational overhead, and LLMs can be used to achieve semantic vulnerability analysis, as well as explainable decisions. This framework is tested based on simulated pipelines, as well as, actual world CI/CD integrations on GitHub Actions and Jenkins, including injection attacks, insecure deserialization, access control violations, and configuration errors. Experimental outcomes indicate better detection accuracy, shorter mitigation latency and reasonable build-time overhead than rule-based, provenance only and RL only baselines. These results show that agentic AI can facilitate the transition to self defending, proactive software supply chains rather than reactive verification ones.
zh

[AI-15] Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

【速读】:该论文旨在解决奖励模型(Reward Model, RM)在基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)中因训练数据存在多种复杂且非线性的归纳偏置(inductive biases)而导致的过拟合与奖励欺骗(reward hacking)问题。例如,响应长度、谄媚性(sycophancy)和格式偏好等偏置会误导RM学习错误的优化目标。为应对这一挑战,作者提出了一种基于信息论的去偏方法——DIR(Debiasing via Information Optimization for RM),其核心思想是通过最大化RM评分与人类偏好对之间的互信息(Mutual Information, MI),同时最小化RM输出与偏好输入中偏置属性之间的互信息,从而在理论上实现对多种复杂偏置的鲁棒建模。该方法突破了传统线性相关性建模的局限,具备处理非线性偏置的能力,实验表明其不仅能有效缓解目标偏置,还能提升RLHF的整体性能与泛化能力。

链接: https://arxiv.org/abs/2512.23461
作者: Zhuo Li,Pengyu Cheng,Zhechao Yu,Feifei Tong,Anningzhe Gao,Tsung-Hui Chang,Xiang Wan,Erchao Zhao,Xiaoxi Jiang,Guanjun Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textite.g., Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbfDebiasing via \textbfInformation optimization for \textbfRM (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: \textitresponse length, \textitsycophancy, and \textitformat. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at this https URL.
zh

[AI-16] Mobile-Efficient Speech Emotion Recognition Using DistilHuBERT: A Cross-Corpus Validation Study

【速读】:该论文旨在解决生成式语音情感识别(Speech Emotion Recognition, SER)在移动设备上部署受限的问题,核心挑战在于当前先进Transformer架构的高计算复杂度与资源消耗。解决方案的关键在于提出一种基于DistilHuBERT的轻量化模型,通过知识蒸馏(distillation)和8-bit量化技术实现参数量减少92%,同时保持较高的识别准确率;此外,采用跨语料库训练策略(在IEMOCAP基础上引入CREMA-D数据集),显著提升了模型泛化能力与稳定性,尤其在Neutral类别的F1-score提升达5.4%。最终模型仅需23 MB存储空间,便可在移动端实现61.4%的Unweighted Accuracy,展现出良好的性能-体积权衡(Pareto-optimal tradeoff)。

链接: https://arxiv.org/abs/2512.23435
作者: Saifelden M. Ismail
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 tables, 1 figure. Submitted to IEEE conference

点击查看摘要

Abstract:Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement. Our approach achieves an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of full-scale baseline performance. Cross-corpus evaluation on RAVDESS reveals that the theatrical nature of acted emotions causes predictions to cluster by arousal level rather than valence: happiness is systematically confused with anger due to acoustic saturation in high-energy expressions. Despite this theatricality effect reducing overall RAVDESS accuracy to 43.29%, the model maintains robust arousal detection with 97% recall for anger and 64% for sadness. These findings establish a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.
zh

[AI-17] AKG kernel Agent : A Multi-Agent Framework for Cross-Platform Kernel Synthesis

【速读】:该论文旨在解决现代人工智能模型(如大语言模型、多模态架构和推荐系统)在复杂计算需求下,因硬件频繁更新与异构架构多样性导致的高性能计算内核(kernel)开发、迁移与性能调优难以自动化的问题。传统手动优化已无法满足效率需求,形成AI系统开发的关键瓶颈。解决方案的核心是提出AKG kernel agent(AI驱动的内核生成器),一个基于多智能体(multi-agent)架构的自动化系统,能够自动完成内核生成、迁移及性能调优,并支持多种领域特定语言(DSLs),包括Triton、TileLang、CPP和CUDA-C,从而适配不同硬件后端(如GPU和NPU),同时保证正确性和可移植性。其模块化设计还支持快速集成新的DSL和硬件目标,实测在KernelBench基准测试中相较PyTorch Eager基线平均提速1.46倍,显著提升了AI内核开发效率。

链接: https://arxiv.org/abs/2512.23424
作者: Jinye Du,Quan Yuan,Zuyao Zhang,Yanzhi Yi,Jiahui Hu,Wangyi Chen,Yiyang Zhu,Qishui Zheng,Wenxiang Zou,Xiangyu Chang,Zuohe Zheng,Zichun Ye,Chao Liu,Shanni Li,Renwei Zhang,Yiping Deng,Xinwei Hu,Xuefeng Jin,Jie Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern AI models demand high-performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quantization, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these demands, creating a critical bottleneck in AI system development. Recent advances in LLM code generation capabilities have opened new possibilities for automating kernel development. In this work, we propose AKG kernel agent (AI-driven Kernel Generator), a multi-agent system that automates kernel generation, migration, and performance tuning. AKG kernel agent is designed to support multiple domain-specific languages (DSLs), including Triton, TileLang, CPP, and CUDA-C, enabling it to target different hardware backends while maintaining correctness and portability. The system’s modular design allows rapid integration of new DSLs and hardware targets. When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46 \times over PyTorch Eager baselines implementations, demonstrating its effectiveness in accelerating kernel development for modern AI workloads.
zh

[AI-18] he World Is Bigger! A Computationally-Embedded Perspective on the Big World Hypothesis NEURIPS2025

【速读】:该论文旨在解决持续学习(continual learning)中如何在有限能力下使智能体(agent)始终适应环境变化的问题,其核心挑战在于现有方法依赖于显式约束,这些约束往往具有任意性、难以集成且限制了模型扩展性。解决方案的关键在于提出一种“计算嵌入视角”(computationally-embedded perspective),将智能体建模为在通用(形式化)计算机中模拟的自动机(automaton),从而天然地受到环境嵌入性的限制;在此框架下,作者定义了一个名为“交互性”(interactivity)的目标函数,用于衡量智能体通过学习新预测持续调整行为的能力,并设计了一种基于模型的强化学习算法来优化该目标。实验表明,深度线性网络比非线性网络更能维持高交互性,这揭示了架构对持续学习能力的重要影响。

链接: https://arxiv.org/abs/2512.23419
作者: Alex Lewandowski,Adtiya A. Ramesh,Edan Meyer,Dale Schuurmans,Marlos C. Machado
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 (spotlight)

点击查看摘要

Abstract:Continual learning is often motivated by the idea, known as the big world hypothesis, that “the world is bigger” than the agent. Recent problem formulations capture this idea by explicitly constraining an agent relative to the environment. These constraints lead to solutions in which the agent continually adapts to best use its limited capacity, rather than converging to a fixed solution. However, explicit constraints can be ad hoc, difficult to incorporate, and may limit the effectiveness of scaling up the agent’s capacity. In this paper, we characterize a problem setting in which an agent, regardless of its capacity, is constrained by being embedded in the environment. In particular, we introduce a computationally-embedded perspective that represents an embedded agent as an automaton simulated within a universal (formal) computer. Such an automaton is always constrained; we prove that it is equivalent to an agent that interacts with a partially observable Markov decision process over a countably infinite state-space. We propose an objective for this setting, which we call interactivity, that measures an agent’s ability to continually adapt its behaviour by learning new predictions. We then develop a model-based reinforcement learning algorithm for interactivity-seeking, and use it to construct a synthetic problem to evaluate continual learning capability. Our results show that deep nonlinear networks struggle to sustain interactivity, whereas deep linear networks sustain higher interactivity as capacity increases.
zh

[AI-19] MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

【速读】:该论文旨在解决传统基于工作流的智能体在处理需要调用外部工具的现实问题时表现出的智能局限性,特别是其在多步骤交互环境中缺乏自主推理与工具调用能力的问题。解决方案的关键在于提出一种新型的工具集成推理(Tool-integrated Reasoning, TIR)代理——MindWatcher,它融合了交错式思考(interleaved thinking)与多模态思维链(multimodal chain-of-thought, CoT)推理机制,使模型能够在任意中间阶段自主决定是否及如何调用多样化工具并协调其使用,无需依赖人工提示或预设流程。该设计显著提升了复杂决策任务中的灵活性与准确性,同时通过构建高质量数据集、自动化评估流水线和专用本地图像检索数据库,增强了模型在多模态场景下的泛化能力和工具调用效率。

链接: https://arxiv.org/abs/2512.23412
作者: Jiawei Chen,Xintian Shen,Lihao Zheng,Zhenwei Shao,Hongyuan Zhang,Pengfei Yu,Xudong Rao,Ning Mao,Xiaobo Liu,Lian Wen,Chaoqun Du,Feng Gu,Wei He,Qizhen Li,Shanshan Li,Zide Liu,Jing Luo,Lifu Mu,Xuhao Pan,Chang Ren,Haoyi Sun,Qian Wang,Wei Wang,Hongfu Yang,Jiqing Zhan,Chunpeng Zhou,Zheng Zhou,Hao Ma,Tao Wei,Pan Zhou,Wei Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Technique Report

点击查看摘要

Abstract:Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.
zh

[AI-20] Directly Constructing Low-Dimensional Solution Subspaces in Deep Neural Networks

【速读】:该论文旨在解决当前深度神经网络模型在训练时依赖高维宽度过大、存在冗余的问题,尽管其权重矩阵和特征流形的实际内在维度(Intrinsic Dimension, ID)较低,但优化过程仍受限于非凸优化搜索难题——即在高维空间中找到全局最优解对紧凑网络而言依然不可行。解决方案的关键在于通过解耦解的几何结构与环境搜索空间,提出一种构造性方法:将分类头压缩高达16倍且性能几乎不变,从而启发了“子空间原生蒸馏”(Subspace-Native Distillation)这一新范式——直接在构建的低维子空间中定义目标,为学生模型提供稳定的几何坐标系,有望彻底规避高维搜索问题,实现“训练大模型、部署小模型”的目标。

链接: https://arxiv.org/abs/2512.23410
作者: Yusuf Kalyoncuoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Code available at this https URL

点击查看摘要

Abstract:While it is well-established that the weight matrices and feature manifolds of deep neural networks exhibit a low Intrinsic Dimension (ID), current state-of-the-art models still rely on massive high-dimensional widths. This redundancy is not required for representation, but is strictly necessary to solve the non-convex optimization search problem-finding a global minimum, which remains intractable for compact networks. In this work, we propose a constructive approach to bypass this optimization bottleneck. By decoupling the solution geometry from the ambient search space, we empirically demonstrate across ResNet-50, ViT, and BERT that the classification head can be compressed by even huge factors of 16 with negligible performance degradation. This motivates Subspace-Native Distillation as a novel paradigm: by defining the target directly in this constructed subspace, we provide a stable geometric coordinate system for student models, potentially allowing them to circumvent the high-dimensional search problem entirely and realize the vision of Train Big, Deploy Small.
zh

[AI-21] Securing the AI Supply Chain: What Can We Learn From Developer-Reported Security Issues and Solutions of AI Projects? ICSE2026

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)供应链中安全问题缺乏实证研究支撑的难题,尤其是开发者在实践中遇到的常见安全威胁及其应对策略尚不明确的问题。为填补这一知识空白,作者通过分析Hugging Face和GitHub上的开发者讨论,构建了一个包含312,868条安全相关对话的数据集,并从中抽取753篇代表性帖子进行主题分析,最终提炼出涵盖系统与软件、外部工具与生态、模型、数据四大类共32种细粒度安全问题及24种解决方案。其关键在于设计了一套结合关键词匹配与优化微调的distilBERT分类器的自动化识别管道,实现了对海量非结构化讨论内容的有效筛选与标注,从而首次系统性揭示了AI供应链各环节的真实安全风险特征与缓解路径,为开发者和研究人员提供基于实证的指导依据。

链接: https://arxiv.org/abs/2512.23385
作者: TheAnh Nguyen,Triet Huynh Minh Le,M. Ali Babar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: Accepted at the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026) - Research Track

点击查看摘要

Abstract:The rapid growth of Artificial Intelligence (AI) models and applications has led to an increasingly complex security landscape. Developers of AI projects must contend not only with traditional software supply chain issues but also with novel, AI-specific security threats. However, little is known about what security issues are commonly encountered and how they are resolved in practice. This gap hinders the development of effective security measures for each component of the AI supply chain. We bridge this gap by conducting an empirical investigation of developer-reported issues and solutions, based on discussions from Hugging Face and GitHub. To identify security-related discussions, we develop a pipeline that combines keyword matching with an optimal fine-tuned distilBERT classifier, which achieved the best performance in our extensive comparison of various deep learning and large language models. This pipeline produces a dataset of 312,868 security discussions, providing insights into the security reporting practices of AI applications and projects. We conduct a thematic analysis of 753 posts sampled from our dataset and uncover a fine-grained taxonomy of 32 security issues and 24 solutions across four themes: (1) System and Software, (2) External Tools and Ecosystem, (3) Model, and (4) Data. We reveal that many security issues arise from the complex dependencies and black-box nature of AI components. Notably, challenges related to Models and Data often lack concrete solutions. Our insights can offer evidence-based guidance for developers and researchers to address real-world security threats across the AI supply chain.
zh

[AI-22] Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

【速读】:该论文旨在解决生成式 AI(Generative AI)模型中链式思维(Chain-of-Thought, CoT)推理模式在Ascend NPU上部署时面临的高内存占用和延迟问题。其核心挑战在于,尽管CoT模式(如slow_think、auto_think和no_think)能显著提升推理能力,但生成的长推理轨迹会引入显著的计算开销。解决方案的关键在于引入统一的低比特量化推理框架,支持INT8(W8A8)和W4A8量化策略,将FP16计算转换为高效的整数运算,从而在保证模型精度的前提下大幅降低内存消耗并提升推理速度。实验表明,INT8量化可保持超过90%的FP16基线准确率,并实现1.5倍的预填充加速,验证了该方法在Ascend NPU上的高效性与可行性。

链接: https://arxiv.org/abs/2512.23367
作者: Yilun Luo,HuaQing Zheng,Haoqian Meng,Wenyuan Liu,Peng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Huawei’s openPangu-Embedded-1B and openPangu-Embedded-7B, variants of the openPangu large language model, integrate three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think. While these CoT modes enhance reasoning capabilities, their generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation, conducted across all three CoT modes on code generation benchmarks (HumanEval and MBPP), demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.
zh

[AI-23] AGRO-SQL: Agent ic Group-Relative Optimization with High-Fidelity Data Synthesis

【速读】:该论文旨在解决Text-to-SQL系统在训练数据质量不足和复杂场景下模型推理能力有限两大核心问题。其解决方案的关键在于提出一个双中心(dual-centric)框架:一方面从数据驱动角度构建迭代式数据工厂,生成高正确率且语义逻辑对齐的强化学习(Reinforcement Learning, RL)就绪数据;另一方面从模型优化角度引入新型智能体强化学习(Agentic Reinforcement Learning)框架,通过多样性感知冷启动阶段初始化稳健策略,并结合组相对策略优化(Group Relative Policy Optimization, GRPO)利用环境反馈精细化提升推理能力。该协同方法在BIRD和Spider基准测试中实现了单模型方法的最先进性能。

链接: https://arxiv.org/abs/2512.23366
作者: Cehua Yang,Dongyu Xiao,Junming Lin,Yuyang Song,Hanxu Yan,Shawn Guo,Wei Zhang,Jian Yang,Mingjie Tang,Bryan Dai
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of Text-to-SQL systems is currently hindered by the scarcity of high-quality training data and the limited reasoning capabilities of models in complex scenarios. In this paper, we propose a holistic framework that addresses these issues through a dual-centric approach. From a Data-Centric perspective, we construct an iterative data factory that synthesizes RL-ready data characterized by high correctness and precise semantic-logic alignment, ensured by strict verification. From a Model-Centric perspective, we introduce a novel Agentic Reinforcement Learning framework. This framework employs a Diversity-Aware Cold Start stage to initialize a robust policy, followed by Group Relative Policy Optimization (GRPO) to refine the agent’s reasoning via environmental feedback. Extensive experiments on BIRD and Spider benchmarks demonstrate that our synergistic approach achieves state-of-the-art performance among single-model methods.
zh

[AI-24] ECG-RAMBA: Zero-Shot ECG Generalization by Morphology-Rhythm Disentanglement and Long-Range Modeling

【速读】:该论文旨在解决深度学习模型在心电图(ECG)分类任务中跨异质采集场景时泛化能力不足的问题,尤其是由于波形形态特征与节律动态特征隐式耦合导致的捷径学习(shortcut learning)和对分布偏移敏感的问题。解决方案的关键在于提出ECG-RAMBA框架,通过显式分离形态学特征与节律信息,并借助上下文感知融合机制重新整合:(i) 使用MiniRocket提取确定性形态特征,(ii) 基于心率变异性(HRV)计算全局节律描述符,(iii) 利用双向Mamba主干实现长程上下文建模;同时引入数值稳定的Power Mean池化操作(Q=3),增强对瞬态异常的敏感性并避免最大池化易脆性和平均池化的稀释效应,从而显著提升模型在跨数据集迁移中的鲁棒性和一致性。

链接: https://arxiv.org/abs/2512.23347
作者: Hai Duong Nguyen,Xuan-The Tran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has achieved strong performance for electrocardiogram (ECG) classification within individual datasets, yet dependable generalization across heterogeneous acquisition settings remains a major obstacle to clinical deployment and longitudinal monitoring. A key limitation of many model architectures is the implicit entanglement of morphological waveform patterns and rhythm dynamics, which can promote shortcut learning and amplify sensitivity to distribution shifts. We propose ECG-RAMBA, a framework that separates morphology and rhythm and then re-integrates them through context-aware fusion. ECG-RAMBA combines: (i) deterministic morphological features extracted by MiniRocket, (ii) global rhythm descriptors computed from heart-rate variability (HRV), and (iii) long-range contextual modeling via a bi-directional Mamba backbone. To improve sensitivity to transient abnormalities under windowed inference, we introduce a numerically stable Power Mean pooling operator ( Q=3 ) that emphasizes high-evidence segments while avoiding the brittleness of max pooling and the dilution of averaging. We evaluate under a protocol-faithful setting with subject-level cross-validation, a fixed decision threshold, and no test-time adaptation. On the Chapman–Shaoxing dataset, ECG-RAMBA achieves a macro ROC-AUC \approx 0.85 . In zero-shot transfer, it attains PR-AUC =0.708 for atrial fibrillation detection on the external CPSC-2021 dataset, substantially outperforming a comparable raw-signal Mamba baseline, and shows consistent cross-dataset performance on PTB-XL. Ablation studies indicate that deterministic morphology provides a strong foundation, while explicit rhythm modeling and long-range context are critical drivers of cross-domain robustness.
zh

[AI-25] he Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models

【速读】:该论文试图解决多模型协作(multi-model collaboration)在大型语言模型(Large Language Models, LLMs)性能提升中的理论缺失问题,即缺乏一个统一的性能缩放定律来指导多模型集成系统的优化设计。其解决方案的关键在于提出“多模型协作定律”(Law of Multi-model Collaboration),该定律基于聚合参数预算预测LLM集成系统的性能上限,并采用方法无关的公式化框架,在理想集成Oracle假设下以最小交叉熵损失作为性能基准。实验表明,多模型系统遵循幂律缩放关系,相较于单模型具有更显著的性能提升趋势和更低的理论损失下限,且异构模型族组成的集成优于同质模型集成,揭示了模型多样性是协作增益的核心驱动力。

链接: https://arxiv.org/abs/2512.23340
作者: Dakuan Lu,Jiaqi Zhang,Cheng Yuan,Jiawei Shao,Chi Zhang,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM are inherently bounded. One solution originates from intricate interactions among multiple LLMs, rendering their collective performance surpasses that of any constituent model. Despite the rapid proliferation of multi-model integration techniques such as model routing and post-hoc ensembling, a unifying theoretical framework of performance scaling for multi-model collaboration remains absent. In this work, we propose the Law of Multi-model Collaboration, a scaling law that predicts the performance limits of LLM ensembles based on their aggregated parameter budget. To quantify the intrinsic upper bound of multi-model collaboration, we adopt a method-agnostic formulation and assume an idealized integration oracle where the total cross-entropy loss of each sample is determined by the minimum loss of any model in the model pool. Experimental results reveal that multi-model systems follow a power-law scaling with respect to the total parameter count, exhibiting a more significant improvement trend and a lower theoretical loss floor compared to single model scaling. Moreover, ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family, indicating that model diversity is a primary driver of collaboration gains. These findings suggest that model collaboration represents a critical axis for extending the intelligence frontier of LLMs.
zh

[AI-26] On Conformant Planning and Model-Checking of existsforall Hyperproperties ECAI2025

【速读】:该论文旨在解决 conformant planning(容许规划)与 hyperproperties(超性质)模型检验之间的关联问题,即揭示二者在理论和计算上的紧密联系。其核心解决方案在于建立双向归约机制:一方面,将 \exists^*\forall^* 类型的超性质模型检验实例高效地转化为容许规划实例,并证明该编码具有保真性和完备性;另一方面,进一步表明任意容许规划问题本质上可视为一个超性质模型检验任务,从而统一了两类问题的形式化基础与求解框架。

链接: https://arxiv.org/abs/2512.23324
作者: Raven Beutner,Bernd Finkbeiner
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: ECAI 2025

点击查看摘要

Abstract:We study the connection of two problems within the planning and verification community: Conformant planning and model-checking of hyperproperties. Conformant planning is the task of finding a sequential plan that achieves a given objective independent of non-deterministic action effects during the plan’s execution. Hyperproperties are system properties that relate multiple execution traces of a system and, e.g., capture information-flow and fairness policies. In this paper, we show that model-checking of \exists^\forall^ hyperproperties is closely related to the problem of computing a conformant plan. Firstly, we show that we can efficiently reduce a hyperproperty model-checking instance to a conformant planning instance, and prove that our encoding is sound and complete. Secondly, we establish the converse direction: Every conformant planning problem is, itself, a hyperproperty model-checking task.
zh

[AI-27] Explainable Neural Inverse Kinematics for Obstacle-Aware Robotic Manipulation: A Comparative Analysis of IKNet Variants

【速读】:该论文旨在解决深度神经网络在逆运动学(Inverse Kinematics, IK)推理中因模型黑箱特性而难以满足负责任人工智能(Responsible AI)监管对透明性与安全性要求的问题。其核心解决方案是构建一个以可解释性(Explainable AI, XAI)为中心的工作流程,将Shapley值归因分析(Shapley-value attribution)与基于物理的障碍物避让评估相结合,通过改进IKNet架构(包括引入残差连接的Improved IKNet和位置-姿态解耦的Focused IKNet),并利用InterpretML工具可视化部分依赖关系,揭示笛卡尔位姿与关节角度之间的非线性耦合机制;同时,在仿真环境中嵌入不同障碍场景,量化归因分布均衡性与物理安全裕度的关系,从而识别出更安全且准确的模型结构。该方法为数据驱动的机器人操作提供了可信部署路径,契合新兴负责任AI标准。

链接: https://arxiv.org/abs/2512.23312
作者: Sheng-Kai Chen,Yi-Ling Tsai,Chun-Chih Chang,Yan-Chen Chen,Po-Chiang Lin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 27 pages, 16 figures

点击查看摘要

Abstract:Deep neural networks have accelerated inverse-kinematics (IK) inference to the point where low cost manipulators can execute complex trajectories in real time, yet the opaque nature of these models contradicts the transparency and safety requirements emerging in responsible AI regulation. This study proposes an explainability centered workflow that integrates Shapley-value attribution with physics-based obstacle avoidance evaluation for the ROBOTIS OpenManipulator-X. Building upon the original IKNet, two lightweight variants-Improved IKNet with residual connections and Focused IKNet with position-orientation decoupling are trained on a large, synthetically generated pose-joint dataset. SHAP is employed to derive both global and local importance rankings, while the InterpretML toolkit visualizes partial-dependence patterns that expose non-linear couplings between Cartesian poses and joint angles. To bridge algorithmic insight and robotic safety, each network is embedded in a simulator that subjects the arm to randomized single and multi-obstacle scenes; forward kinematics, capsule-based collision checks, and trajectory metrics quantify the relationship between attribution balance and physical clearance. Qualitative heat maps reveal that architectures distributing importance more evenly across pose dimensions tend to maintain wider safety margins without compromising positional accuracy. The combined analysis demonstrates that explainable AI(XAI) techniques can illuminate hidden failure modes, guide architectural refinements, and inform obstacle aware deployment strategies for learning based IK. The proposed methodology thus contributes a concrete path toward trustworthy, data-driven manipulation that aligns with emerging responsible-AI standards.
zh

[AI-28] Splitwise: Collaborative Edge-Cloud Inference for LLM s via Lyapunov-Assisted DRL

【速读】:该论文旨在解决在边缘设备上部署大语言模型(Large Language Models, LLMs)时面临的资源受限问题,即如何在有限的内存和算力条件下实现低延迟、低能耗且稳定的推理服务。传统云-only方案虽可减轻设备负担,但引入高延迟与成本;静态边缘-云划分方法难以适应动态网络带宽变化。其解决方案的关键在于提出Splitwise框架——一个基于Lyapunov优化引导的深度强化学习(Deep Reinforcement Learning, DRL)系统,通过将Transformer层细粒度分解为注意力头和前馈子模块,显著增加分割选择空间,并设计分层DRL策略以联合优化延迟、能耗与精度损失,在保证队列稳定性的同时实现自适应分区决策。此外,该方案还引入基于指数退避机制的分区检查点恢复策略,提升通信故障下的鲁棒性。

链接: https://arxiv.org/abs/2512.23310
作者: Abolfazl Younesi,Abbas Shabrang Maryan,Elyas Oustad,Zahra Najafabadi Samani,Mohsen Ansari,Thomas Fahringer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI)
备注: 11 pages, 9 figures. Accepted by ACM for presentation at UCC '25 (18th International Conference on Utility and Cloud Computing), December 1-4, 2025, France. Proceedings publication pending

点击查看摘要

Abstract:Deploying large language models (LLMs) on edge devices is challenging due to their limited memory and power resources. Cloud-only inference reduces device burden but introduces high latency and cost. Static edge-cloud partitions optimize a single metric and struggle when bandwidth fluctuates. We propose Splitwise, a novel Lyapunov-assisted deep reinforcement learning (DRL) framework for fine-grained, adaptive partitioning of LLMs across edge and cloud environments. Splitwise decomposes transformer layers into attention heads and feed-forward sub-blocks, exposing more partition choices than layer-wise schemes. A hierarchical DRL policy, guided by Lyapunov optimization, jointly minimizes latency, energy consumption, and accuracy degradation while guaranteeing queue stability under stochastic workloads and variable network bandwidth. Splitwise also guarantees robustness via partition checkpoints with exponential backoff recovery in case of communication failures. Experiments on Jetson Orin NX, Galaxy S23, and Raspberry Pi 5 with GPT-2 (1.5B), LLaMA-7B, and LLaMA-13B show that Splitwise reduces end-to-end latency by 1.4x-2.8x and cuts energy consumption by up to 41% compared with existing partitioners. It lowers the 95th-percentile latency by 53-61% relative to cloud-only execution, while maintaining accuracy and modest memory requirements.
zh

[AI-29] Agent ic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

【速读】:该论文试图解决当前生成式 AI(Generative AI)在物理系统控制中面临的根本性障碍:即感知中心架构(perception-centric architectures)虽能实现语义层面的合理推理,却无法保证执行动作符合物理约束,导致模型在定量物理任务中准确率仅约50-53%,表现为近似猜测而非可靠决策。其解决方案的关键在于提出一种面向特定领域的基础模型新路径——通过引入紧凑型语言模型作为代理型物理 AI(Agentic Physical AI),将策略优化从感知推断转向基于物理验证的驱动机制,从而实现对执行结果空间(outcome-space)的安全保障。实验表明,在合成反应堆控制场景中,随着训练数据规模从10³扩展至10⁵样本,模型表现出显著的相变行为:小模型存在高方差与灾难性尾部风险,而大模型则实现超过500倍的方差压缩,稳定执行行为,并自主拒绝约70%的训练分布,集中于单一策略执行,且无需修改架构即可跨不同物理系统和连续输入模态迁移所学表示。

链接: https://arxiv.org/abs/2512.23292
作者: Yoonpyo Lee,Kazuma Kobayashi,Sai Puppala,Sajedul Talukder,Seid Koric,Souvik Chakraborty,Syed Bahauddin Alam
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The prevailing paradigm in AI for physical systems, scaling general-purpose foundation models toward universal multimodal reasoning, confronts a fundamental barrier at the control interface. Recent benchmarks show that even frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. This input unfaithfulness is not a scaling deficiency but a structural limitation. Perception-centric architectures optimize parameter-space imitation, whereas safety-critical control demands outcome-space guarantees over executed actions. Here, we present a fundamentally different pathway toward domain-specific foundation models by introducing compact language models operating as Agentic Physical AI, in which policy optimization is driven by physics-based validation rather than perceptual inference. We train a 360-million-parameter model on synthetic reactor control scenarios, scaling the dataset from 10^3 to 10^5 examples. This induces a sharp phase transition absent in general-purpose models. Small-scale systems exhibit high-variance imitation with catastrophic tail risk, while large-scale models undergo variance collapse exceeding 500x reduction, stabilizing execution-level behavior. Despite balanced exposure to four actuation families, the model autonomously rejects approximately 70% of the training distribution and concentrates 95% of runtime execution on a single-bank strategy. Learned representations transfer across distinct physics and continuous input modalities without architectural modification.
zh

[AI-30] KernelEvolve: Scaling Agent ic Kernel Coding for Heterogeneous AI Accelerators at Meta

【速读】:该论文旨在解决深度学习推荐模型(Deep Learning Recommendation Model, DLRM)在跨异构硬件平台上的训练与推理效率问题,其核心挑战包括模型架构多样性、内核原语(kernel primitive)多样性以及硬件代际和架构异构性。解决方案的关键在于提出 KernelEvolve——一个基于代理的内核编码框架,通过多层编程抽象(从 Triton 和 CuTe DSL 到低级硬件无关语言)实现自动化内核生成与优化,将优化过程建模为图搜索策略,结合检索增强提示合成动态适应运行时上下文,从而在 NVIDIA 和 AMD GPU 及 Meta AI 加速器等异构平台上高效部署生产级推荐模型。

链接: https://arxiv.org/abs/2512.23236
作者: Gang Liao,Hongsen Qin,Ying Wang,Alicia Golden,Michael Kuchnik,Yavuz Yetim,Jia Jiunn Ang,Chunli Fu,Yihan He,Samuel Hsia,Zewei Jiang,Dianshi Li,Uladzimir Pashkevich,Varna Puvvada,Feng Shi,Matt Steiner,Ruichao Xiao,Nathan Yan,Xiayu Yu,Zhou Fang,Abdul Zainul-Abedin,Ketan Singh,Hongtao Yu,Wenyuan Chi,Barney Huang,Sean Zhang,Noah Weller,Zach Marine,Wyatt Cook,Carole-Jean Wu,Gaoxiang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Multiagent Systems (cs.MA); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.
zh

[AI-31] CEval: Using Thermal Comfort to Assess Cognitive and Perceptual Abilities of AI

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在特定任务评估中缺乏生态效度的问题,特别是针对AI系统在真实世界认知能力(如跨模态推理、因果关联和自适应决策)的量化评估空白。其解决方案的关键在于提出TCEval框架,通过构建基于热舒适(Thermal Comfort)场景的多模态评估体系,利用LLM代理(Agent)模拟人类在不同环境条件下对服装隔热选择和舒适反馈的行为,并结合ASHRAE全球数据库与中文热舒适数据库进行验证。该方法首次实现了对LLM三大核心认知能力的系统性测试,表明当前LLM具备基础跨模态推理能力,但尚缺乏对热舒适中非线性变量关系的精确因果理解,从而为AI向以人为本的应用(如智能建筑)演进提供了可解释、情境感知的新评估范式。

链接: https://arxiv.org/abs/2512.23217
作者: Jingming Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A critical gap exists in LLM task-specific benchmarks. Thermal comfort, a sophisticated interplay of environmental factors and personal perceptions involving sensory integration and adaptive decision-making, serves as an ideal paradigm for evaluating real-world cognitive capabilities of AI systems. To address this, we propose TCEval, the first evaluation framework that assesses three core cognitive capacities of AI, cross-modal reasoning, causal association, and adaptive decision-making, by leveraging thermal comfort scenarios and large language model (LLM) agents. The methodology involves initializing LLM agents with virtual personality attributes, guiding them to generate clothing insulation selections and thermal comfort feedback, and validating outputs against the ASHRAE Global Database and Chinese Thermal Comfort Database. Experiments on four LLMs show that while agent feedback has limited exact alignment with humans, directional consistency improves significantly with a 1 PMV tolerance. Statistical tests reveal that LLM-generated PMV distributions diverge markedly from human data, and agents perform near-randomly in discrete thermal comfort classification. These results confirm the feasibility of TCEval as an ecologically valid Cognitive Turing Test for AI, demonstrating that current LLMs possess foundational cross-modal reasoning ability but lack precise causal understanding of the nonlinear relationships between variables in thermal comfort. TCEval complements traditional benchmarks, shifting AI evaluation focus from abstract task proficiency to embodied, context-aware perception and decision-making, offering valuable insights for advancing AI in human-centric applications like smart buildings.
zh

[AI-32] From Model Choice to Model Belief: Establishing a New Measure for LLM -Based Research

【速读】:该论文旨在解决当前利用大语言模型(Large Language Models, LLMs)生成数据时效率低下的问题,即简单将LLM的单次输出(称为“模型选择”,model choice)视为一个独立数据点,导致对LLM固有概率结构的信息利用率不足。其解决方案的关键在于提出并形式化了“模型信念”(model belief)这一新概念——它基于LLM在token级别上的概率分布,捕捉单次生成过程中模型对不同选项的信念分布。作者证明模型信念在渐近意义上等价于多次模型选择的均值(具有非平凡性质),但作为估计量具有更低方差和更快收敛速度,从而显著提升统计效率。实证研究表明,在有限运行次数下,模型信念比模型选择更能解释和预测真实行为,且可将达到同等精度所需的计算量减少约20倍,因此建议将其作为提取LLM生成数据信息的默认指标。

链接: https://arxiv.org/abs/2512.23184
作者: Hongshen Sun,Juanjuan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Econometrics (econ.EM)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to simulate human behavior, but common practices to use LLM-generated data are inefficient. Treating an LLM’s output (“model choice”) as a single data point underutilizes the information inherent to the probabilistic nature of LLMs. This paper introduces and formalizes “model belief,” a measure derived from an LLM’s token-level probabilities that captures the model’s belief distribution over choice alternatives in a single generation run. The authors prove that model belief is asymptotically equivalent to the mean of model choices (a non-trivial property) but forms a more statistically efficient estimator, with lower variance and a faster convergence rate. Analogous properties are shown to hold for smooth functions of model belief and model choice often used in downstream applications. The authors demonstrate the performance of model belief through a demand estimation study, where an LLM simulates consumer responses to different prices. In practical settings with limited numbers of runs, model belief explains and predicts ground-truth model choice better than model choice itself, and reduces the computation needed to reach sufficiently accurate estimates by roughly a factor of 20. The findings support using model belief as the default measure to extract more information from LLM-generated data.
zh

[AI-33] EquaCode: A Multi-Strategy Jailbreak Approach for Large Language Models via Equation Solving and Code Completion AAAI2026

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在面对 jailbreak 攻击时安全性不足的问题,现有攻击方法多局限于自然语言层面且依赖单一策略,难以全面评估模型的鲁棒性。其解决方案的关键在于提出 Equacode,一种基于数学方程求解与代码补全的多策略攻击方法:将恶意意图转化为数学问题,并要求模型通过编写代码来求解,利用跨域任务的复杂性分散模型对安全约束的关注,从而有效绕过防护机制。实验表明,Equacode 在 GPT 系列模型上平均成功率高达 91.19%,在三种前沿 LLM 上达 98.65%,且仅需单次查询,同时消融实验证明数学模块与代码模块存在显著协同效应,验证了多策略设计的有效性。

链接: https://arxiv.org/abs/2512.23173
作者: Zhen Liang,Hai Huang,Zhengkui Chen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This is a preprint. A revised version will appear in the Proceedings of AAAI 2026

点击查看摘要

Abstract:Large language models (LLMs), such as ChatGPT, have achieved remarkable success across a wide range of fields. However, their trustworthiness remains a significant concern, as they are still susceptible to jailbreak attacks aimed at eliciting inappropriate or harmful responses. However, existing jailbreak attacks mainly operate at the natural language level and rely on a single attack strategy, limiting their effectiveness in comprehensively assessing LLM robustness. In this paper, we propose Equacode, a novel multi-strategy jailbreak approach for large language models via equation-solving and code completion. This approach transforms malicious intent into a mathematical problem and then requires the LLM to solve it using code, leveraging the complexity of cross-domain tasks to divert the model’s focus toward task completion rather than safety constraints. Experimental results show that Equacode achieves an average success rate of 91.19% on the GPT series and 98.65% across 3 state-of-the-art LLMs, all with only a single query. Further, ablation experiments demonstrate that EquaCode outperforms either the mathematical equation module or the code module alone. This suggests a strong synergistic effect, thereby demonstrating that multi-strategy approach yields results greater than the sum of its parts.
zh

[AI-34] SPIRAL: Symbolic LLM Planning via Grounded and Reflective Search

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂规划任务中因线性推理过程难以自我修正而表现不佳的问题,尤其是在稀疏奖励环境下,传统搜索算法如蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)难以有效利用LLMs的语义能力。其解决方案的关键在于提出SPIRAL框架,该框架将认知架构嵌入MCTS循环,由三个专用LLM代理协同工作:Planner负责提出创造性的下一步行动,Simulator通过预测现实结果实现搜索的具身化(grounded),Critic则通过反思提供密集奖励信号。这种结构化的规划流水线使MCTS从粗暴搜索转变为受引导、可自修正的推理过程,显著提升了任务准确性和token效率。

链接: https://arxiv.org/abs/2512.23167
作者: Yifan Zhang,Giridhar Ganapavarapu,Srideepika Jayaraman,Bhavna Agrawal,Dhaval Patel,Achille Fokoue
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often falter at complex planning tasks that require exploration and self-correction, as their linear reasoning process struggles to recover from early mistakes. While search algorithms like Monte Carlo Tree Search (MCTS) can explore alternatives, they are often ineffective when guided by sparse rewards and fail to leverage the rich semantic capabilities of LLMs. We introduce SPIRAL (Symbolic LLM Planning via Grounded and Reflective Search), a novel framework that embeds a cognitive architecture of three specialized LLM agents into an MCTS loop. SPIRAL’s key contribution is its integrated planning pipeline where a Planner proposes creative next steps, a Simulator grounds the search by predicting realistic outcomes, and a Critic provides dense reward signals through reflection. This synergy transforms MCTS from a brute-force search into a guided, self-correcting reasoning process. On the DailyLifeAPIs and HuggingFace datasets, SPIRAL consistently outperforms the default Chain-of-Thought planning method and other state-of-the-art agents. More importantly, it substantially surpasses other state-of-the-art agents; for example, SPIRAL achieves 83.6% overall accuracy on DailyLifeAPIs, an improvement of over 16 percentage points against the next-best search framework, while also demonstrating superior token efficiency. Our work demonstrates that structuring LLM reasoning as a guided, reflective, and grounded search process yields more robust and efficient autonomous planners. The source code, full appendices, and all experimental data are available for reproducibility at the official project repository.
zh

[AI-35] Why We Need a New Framework for Emotional Intelligence in AI

【速读】:该论文旨在解决当前评估人工智能(AI)系统情感智能(Emotional Intelligence, EI)的框架存在不足的问题,即现有方法未能全面、准确地衡量与AI相关的情感智能维度。其关键解决方案在于:首先系统回顾情绪理论与人类情感智能的不同模型,厘清哪些要素适用于AI;其次批判性分析现有基准评估框架的局限性,识别其在情绪本质理解上的缺失;最后提出改进评估策略的方向,以确保对AI情感智能能力的评价更具科学性和针对性。

链接: https://arxiv.org/abs/2512.23163
作者: Max Parks,Kheli Atluru,Meera Vinod,Mike Kuniavsky,Jud Brewer,Sean White,Sarah Adler,Wendy Ju
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we develop the position that current frameworks for evaluating emotional intelligence (EI) in artificial intelligence (AI) systems need refinement because they do not adequately or comprehensively measure the various aspects of EI relevant in AI. Human EI often involves a phenomenological component and a sense of understanding that artificially intelligent systems lack; therefore, some aspects of EI are irrelevant in evaluating AI systems. However, EI also includes an ability to sense an emotional state, explain it, respond appropriately, and adapt to new contexts (e.g., multicultural), and artificially intelligent systems can do such things to greater or lesser degrees. Several benchmark frameworks specialize in evaluating the capacity of different AI models to perform some tasks related to EI, but these often lack a solid foundation regarding the nature of emotion and what it is to be emotionally intelligent. In this project, we begin by reviewing different theories about emotion and general EI, evaluating the extent to which each is applicable to artificial systems. We then critically evaluate the available benchmark frameworks, identifying where each falls short in light of the account of EI developed in the first section. Lastly, we outline some options for improving evaluation strategies to avoid these shortcomings in EI evaluation in AI systems.
zh

[AI-36] Its a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

【速读】:该论文旨在解决基于大型语言模型(Large Language Models, LLMs)的Web代理在执行如邮件管理或职业社交等任务时,因依赖动态网页内容而易受提示注入攻击(prompt injection attacks)的问题。此类攻击通过隐藏于界面元素中的恶意指令诱导代理偏离原定任务,构成严重的安全风险。解决方案的关键在于提出Task-Redirecting Agent Persuasion Benchmark(TRAP),一个用于评估说服技术如何误导自主Web代理的基准测试,并构建了一个模块化的社会工程学注入框架,在高保真网站克隆环境中进行受控实验,从而系统性地揭示了Web代理在心理驱动层面存在的脆弱性及其对微小界面变化的高度敏感性。

链接: https://arxiv.org/abs/2512.23128
作者: Karolina Korgul,Yushi Yang,Arkadiusz Drohomirecki,Piotr Błaszczyk,Will Howard,Lukas Aichberger,Chris Russell,Philip H.S. Torr,Adam Mahdi,Adel Bibi
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25% of tasks on average (13% for GPT-5 to 43% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.
zh

[AI-37] InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization

【速读】:该论文旨在解决当前基于直接偏好优化(Direct Preference Optimization, DPO)及其变体在对齐大语言模型(Large Language Models, LLMs)时存在的两个根本性问题:其一,最优策略依赖于任意的建模选择(如标量函数和参考策略),导致行为反映参数化伪影而非真实偏好;其二,孤立地处理响应生成忽略了成对数据中的比较信息,未能利用模型内在的自我反思能力。解决方案的关键在于提出一种新的方法——内在自反思偏好优化(Intrinsic Self-reflective Preference Optimization, \q),该方法通过同时条件化于上下文和备选响应来推导全局最优策略,从而在不改变模型架构或引入推理开销的前提下,实现对标量化和参考策略选择的不变性,并显著提升模型在胜率和长度控制指标上的表现,验证了释放自反思能力可带来更鲁棒、更符合人类偏好的LLM。

链接: https://arxiv.org/abs/2512.23126
作者: Yu Li,Tian Lan,Zhengling Qi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Direct Preference Optimization (DPO) and its variants have become standard for aligning Large Language Models due to their simplicity and offline stability. However, we identify two fundamental limitations. First, the optimal policy depends on arbitrary modeling choices (scalarization function, reference policy), yielding behavior reflecting parameterization artifacts rather than true preferences. Second, treating response generation in isolation fails to leverage comparative information in pairwise data, leaving the model’s capacity for intrinsic self-reflection untapped. To address it, we propose Intrinsic Self-reflective Preference Optimization (\q), deriving a globally optimal policy conditioning on both context and alternative responses. We prove this formulation superior to DPO/RLHF while guaranteeing invariance to scalarization and reference choices. \q~serves as a plug-and-play enhancement without architectural changes or inference overhead. Experiments demonstrate consistent improvements in win rates and length-controlled metrics, validating that unlocking self-reflection yields more robust, human-aligned LLMs.
zh

[AI-38] How Much Data Is Enough? Uniform Convergence Bounds for Generative Vision-Language Models under Low-Dimensional Structure

【速读】:该论文旨在解决生成式 AI(Generative AI)与视觉-语言模型(Vision-Language Models, VLMs)在生物医学决策支持中预测概率的均匀准确性与校准性问题,即如何确保模型在所有输入、类别或亚群体上都表现可靠,而不仅是在平均意义上表现良好。这一问题在罕见疾病或特定人群场景下尤为关键,因局部误差可能被整体低损失掩盖。解决方案的关键在于:基于有限样本视角,假设模型输出对低维语义表示(如提示嵌入)具有Lipschitz稳定性,并利用经典的一致收敛理论推导出针对VLM诱导分类器的准确率与校准函数的非渐近统一收敛界;其核心突破在于所依赖的样本复杂度由内在/有效维度决定,而非高维嵌入空间的环境维度,且进一步通过谱依赖边界揭示了特征值衰减如何具体影响数据需求,从而为小样本生物医学建模提供了理论依据和实践指导。

链接: https://arxiv.org/abs/2512.23109
作者: Paul M. Thompson
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 13 pages, 2 figures

点击查看摘要

Abstract:Modern generative and vision-language models (VLMs) are increasingly used in scientific and medical decision support, where predicted probabilities must be both accurate and well calibrated. Despite strong empirical results with moderate data, it remains unclear when such predictions generalize uniformly across inputs, classes, or subpopulations, rather than only on average-a critical issue in biomedicine, where rare conditions and specific groups can exhibit large errors even when overall loss is low. We study this question from a finite-sample perspective and ask: under what structural assumptions can generative and VLM-based predictors achieve uniformly accurate and calibrated behavior with practical sample sizes? Rather than analyzing arbitrary parameterizations, we focus on induced families of classifiers obtained by varying prompts or semantic embeddings within a restricted representation space. When model outputs depend smoothly on a low-dimensional semantic representation-an assumption supported by spectral structure in text and joint image-text embeddings-classical uniform convergence tools yield meaningful non-asymptotic guarantees. Our main results give finite-sample uniform convergence bounds for accuracy and calibration functionals of VLM-induced classifiers under Lipschitz stability with respect to prompt embeddings. The implied sample complexity depends on intrinsic/effective dimension, not ambient embedding dimension, and we further derive spectrum-dependent bounds that make explicit how eigenvalue decay governs data requirements. We conclude with implications for data-limited biomedical modeling, including when current dataset sizes can support uniformly reliable predictions and why average calibration metrics may miss worst-case miscalibration. Comments: 13 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2512.23109 [cs.LG] (or arXiv:2512.23109v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.23109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] Benchmark Success Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks Not Patients

【速读】:该论文旨在解决资源受限环境下将强化学习(Reinforcement Learning, RL)应用于医学影像分析时的性能与泛化能力之间的矛盾问题。其核心挑战在于:尽管RL方法在大型语言模型(Large Language Models, LLMs)中提升了推理能力,但在医疗影像场景下,如何在有限数据和计算资源(如仅用2,000个监督微调样本、1,000个RL样本及单张A100 GPU)实现既保持良好本地性能又具备跨数据集迁移能力仍不明确。解决方案的关键在于采用R1-style训练流程(即监督微调SFT后接梯度奖励策略优化GRPO),并发现SFT阶段所生成的初始模型在跨机构数据上表现优于RL优化后的模型,揭示出“教师引导式推理”能捕获更具机构无关性的特征,从而提出在临床部署中,精细化的监督微调可能比激进的强化学习更有利于提升模型鲁棒性。

链接: https://arxiv.org/abs/2512.23090
作者: Armin Berger,Manuela Bergau,Helen Schneider,Saad Ahmad,Tom Anglim Lagones,Gianluca Brugnara,Martha Foltyn-Dumitru,Kai Schlamp,Philipp Vollmuth,Rafet Sifa
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
zh

[AI-40] aming the Tail: Stable LLM Reinforcement Learning via Dynamic Vocabulary Pruning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在强化学习(Reinforcement Learning, RL)训练过程中存在的训练-推理不匹配问题,即高吞吐量推理引擎与数值精确的训练系统对相同参数产生不同的概率分布,导致梯度估计不稳定。其核心问题是:低概率词元(low-probability tokens)在序列中累积的系统性偏差会显著影响优化稳定性。解决方案的关键在于动态剪枝“安全”词汇表(safe vocabulary),排除极端尾部词元,从而将原本大且系统性的偏差转换为可控的小幅优化偏差,实现稳定训练,并通过理论证明了剪枝引入的优化偏差具有上界。

链接: https://arxiv.org/abs/2512.23087
作者: Yingru Li,Jiawei Xu,Jiacai Liu,Yuxuan Tong,Ziniu Li,Tianle Cai,Ge Zhang,Qian Liu,Baoxiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reinforcement learning for large language models (LLMs) faces a fundamental tension: high-throughput inference engines and numerically-precise training systems produce different probability distributions from the same parameters, creating a training-inference mismatch. We prove this mismatch has an asymmetric effect: the bound on log-probability mismatch scales as (1-p) where p is the token probability. For high-probability tokens, this bound vanishes, contributing negligibly to sequence-level mismatch. For low-probability tokens in the tail, the bound remains large, and moreover, when sampled, these tokens exhibit systematically biased mismatches that accumulate over sequences, destabilizing gradient estimation. Rather than applying post-hoc corrections, we propose constraining the RL objective to a dynamically-pruned ``safe’’ vocabulary that excludes the extreme tail. By pruning such tokens, we trade large, systematically biased mismatches for a small, bounded optimization bias. Empirically, our method achieves stable training; theoretically, we bound the optimization bias introduced by vocabulary pruning.
zh

[AI-41] Multimodal Functional Maximum Correlation for Emotion Recognition

【速读】:该论文旨在解决情感状态在中枢与自主神经系统中表现为多模态生理信号的协同但异质响应,而现有自监督学习(SSL)方法依赖成对对齐目标,难以刻画超过两模态间的高阶依赖关系,尤其无法捕捉由大脑与自主神经活动协调引发的复杂交互问题。其解决方案的关键在于提出一种基于双重总相关性(Dual Total Correlation, DTC)的目标函数,通过推导紧致的夹层边界并利用基于函数最大相关分析(Functional Maximum Correlation Analysis, FMCA)的迹代理优化器,直接建模多模态联合依赖结构,无需依赖成对对比损失,从而更有效地捕获高阶多模态交互特性。

链接: https://arxiv.org/abs/2512.23076
作者: Deyang Zheng,Tianyi Zhang,Wenming Zheng,Shujian Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: manuscript currently under review at IEEE journals, 20 pages, 6 figures

点击查看摘要

Abstract:Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at this https URL. Comments: manuscript currently under review at IEEE journals, 20 pages, 6 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2512.23076 [cs.LG] (or arXiv:2512.23076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.23076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-42] rust Region Masking for Long-Horizon LLM Reinforcement Learning

【速读】:该论文旨在解决大语言模型强化学习(LLM-RL)中因策略偏离导致的近似误差问题,即当回放策略 πroll\pi_\text{roll} 与当前训练策略 πθ\pi_\theta 不一致时,代理目标函数与真实目标之间存在不可忽略的偏差。传统信任区域方法对这种误差的上界估计为 O(T2)O(T^2),在长序列任务中变得无意义。作者推导出两个更紧致的边界:Pinsker-Marginal 边界为 O(T3/2)O(T^{3/2}),Mixed 边界为 O(T)O(T),二者均依赖于最大 token-level KL 散度 Dkltok,maxD_{\text{kl}}^{\text{tok},\max}——这是一个序列级指标,无法通过基于单个 token 的方法(如 PPO 截断)控制。解决方案的关键在于提出 Trust Region Masking (TRM),该方法若任意 token 违反信任区域约束,则直接排除整个序列参与梯度更新,从而首次实现了针对长时程 LLM-RL 的非平凡单调改进保证。

链接: https://arxiv.org/abs/2512.23075
作者: Yingru Li,Jiacai Liu,Jiawei Xu,Yuxuan Tong,Ziniu Li,Baoxiang Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy \pi_\textroll . When \pi_\textroll \ne \pi_\theta , there is approximation error between the surrogate and the true objective. Prior work has shown that this off-policy mismatch is unavoidable in modern LLM-RL due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds on the resulting error scale as O(T^2) with sequence length T , rendering them vacuous for long-horizon tasks. We derive two tighter bounds: a Pinsker-Marginal bound scaling as O(T^3/2) and a Mixed bound scaling as O(T) . Crucially, both bounds depend on D_kl^tok,max – the maximum token-level KL divergence across all positions in a sequence. This is inherently a sequence-level quantity: it requires examining the entire trajectory to compute, and therefore cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.
zh

[AI-43] he Reward Model Selection Crisis in Personalized Alignment

【速读】:该论文试图解决个性化对齐(personalized alignment)中奖励模型(reward model, RM)性能与实际部署行为之间存在的脱节问题。现有方法主要优化RM的偏好排序准确性,但忽视了在推理阶段受限于计算资源时,需通过奖励引导解码(reward-guided decoding, RGD)实现用户级适应这一关键约束,导致RM即使具备高准确率,也无法有效指导token级生成决策。解决方案的关键在于提出**策略准确性(policy accuracy)**这一新指标,用于量化RGD是否能正确区分偏好响应与非偏好响应,并引入首个包含真实用户完成内容的个性化对齐基准Pref-LaMP,从而直接评估行为一致性而避免循环依赖奖励指标的问题。实证表明,RM准确性和策略准确性弱相关,且简单上下文学习(in-context learning, ICL)在小模型上显著优于所有基于奖励的方法,揭示当前研究过度依赖代理指标、未能真正实现部署场景下的个性化行为适配。

链接: https://arxiv.org/abs/2512.23067
作者: Fady Rezk,Yuangang Pan,Chuan-Sheng Foo,Xun Xu,Nancy Chen,Henry Gouk,Timothy Hospedales
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Personalized alignment from preference data has focused primarily on improving reward model (RM) accuracy, with the implicit assumption that better preference ranking translates to better personalized behavior. However, in deployment, computational constraints necessitate inference-time adaptation via reward-guided decoding (RGD) rather than per-user policy fine-tuning. This creates a critical but overlooked requirement: reward models must not only rank preferences accurately but also effectively guide token-level generation decisions. We demonstrate that standard RM accuracy fails catastrophically as a selection criterion for deployment-ready personalized alignment. Through systematic evaluation across three datasets, we introduce policy accuracy, a metric quantifying whether RGD scoring functions correctly discriminate between preferred and dispreferred responses. We show that RM accuracy correlates only weakly with this policy-level discrimination ability (Kendall’s tau = 0.08–0.31). More critically, we introduce Pref-LaMP, the first personalized alignment benchmark with ground-truth user completions, enabling direct behavioral evaluation without circular reward-based metrics. On Pref-LaMP, we expose a complete decoupling between discrimination and generation: methods with 20-point RM accuracy differences produce almost identical output quality, and even methods achieving high discrimination fail to generate behaviorally aligned responses. Finally, simple in-context learning (ICL) dominates all reward-guided methods for models 3B parameters, achieving 3-5 point ROUGE-1 gains over the best reward method at 7B scale. These findings show that the field optimizes proxy metrics that fail to predict deployment performance and do not translate preferences into real behavioral adaptation under deployment constraints.
zh

[AI-44] Problems With Large Language Models for Learner Modelling: Why LLM s Alone Fall Short for Responsible Tutoring in K–12 Education

【速读】:该论文试图解决的问题是:当前基于大语言模型(Large Language Model, LLM)的智能辅导系统在K–12教育场景中被误认为可替代传统学习者建模(Learner Modelling)以实现自适应教学,而这种认知可能带来高风险后果,尤其是在欧盟《人工智能法案》将其界定为高风险领域的情况下。研究的核心问题是验证LLM在长期追踪学习者知识演变方面的准确性、可靠性与时间一致性是否足以胜任精准自适应教学任务。解决方案的关键在于通过实证比较深度知识追踪(Deep Knowledge Tracing, DKT)模型与零样本及微调后的LLM,在大规模公开数据集上评估其对学习者下一步答题正确性的预测性能及其随时间演化的稳定性;结果表明,尽管微调显著提升了LLM的表现(AUC提升约8%),但其仍低于DKT(差距6%),且早期序列错误更高,同时存在方向性错误的知识更新问题,而DKT则保持稳定且方向正确的掌握度演化轨迹,证明了仅依赖LLM难以满足负责任智能辅导的需求,强调应构建融合学习者建模的混合框架。

链接: https://arxiv.org/abs/2512.23036
作者: Danial Hooshyar,Yeongwook Yang,Gustav Šíř,Tommi Kärkkäinen,Raija Hämäläinen,Mutlu Cukurova,Roger Azevedo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid rise of large language model (LLM)-based tutors in K–12 education has fostered a misconception that generative models can replace traditional learner modelling for adaptive instruction. This is especially problematic in K–12 settings, which the EU AI Act classifies as high-risk domain requiring responsible design. Motivated by these concerns, this study synthesises evidence on limitations of LLM-based tutors and empirically investigates one critical issue: the accuracy, reliability, and temporal coherence of assessing learners’ evolving knowledge over time. We compare a deep knowledge tracing (DKT) model with a widely used LLM, evaluated zero-shot and fine-tuned, using a large open-access dataset. Results show that DKT achieves the highest discrimination performance (AUC = 0.83) on next-step correctness prediction and consistently outperforms the LLM across settings. Although fine-tuning improves the LLM’s AUC by approximately 8% over the zero-shot baseline, it remains 6% below DKT and produces higher early-sequence errors, where incorrect predictions are most harmful for adaptive support. Temporal analyses further reveal that DKT maintains stable, directionally correct mastery updates, whereas LLM variants exhibit substantial temporal weaknesses, including inconsistent and wrong-direction updates. These limitations persist despite the fine-tuned LLM requiring nearly 198 hours of high-compute training, far exceeding the computational demands of DKT. Our qualitative analysis of multi-skill mastery estimation further shows that, even after fine-tuning, the LLM produced inconsistent mastery trajectories, while DKT maintained smooth and coherent updates. Overall, the findings suggest that LLMs alone are unlikely to match the effectiveness of established intelligent tutoring systems, and that responsible tutoring requires hybrid frameworks that incorporate learner modelling.
zh

[AI-45] Viability and Performance of a Private LLM Server for SMBs: A Benchmark Analysis of Qwen 3-30B on Consumer-Grade Hardware

【速读】:该论文旨在解决中小型企业在部署高性能大语言模型(Large Language Models, LLMs)时面临的三大核心问题:数据隐私风险、运营主权缺失以及高昂的云服务成本。其解决方案的关键在于通过在本地部署量化后的开源300亿参数混合专家(Mixture-of-Experts, MoE)模型(基于Qwen3),并利用新一代消费级NVIDIA GPU构建低成本、高效率的私有推理服务器,从而实现与云端服务相当的性能表现,同时保障数据安全与使用自主性。

链接: https://arxiv.org/abs/2512.23029
作者: Alex Khalil,Guillaume Heilles,Maria Parraga,Simon Heilles
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) has been accompanied by a reliance on cloud-based, proprietary systems, raising significant concerns regarding data privacy, operational sovereignty, and escalating costs. This paper investigates the feasibility of deploying a high-performance, private LLM inference server at a cost accessible to Small and Medium Businesses (SMBs). We present a comprehensive benchmarking analysis of a locally hosted, quantized 30-billion parameter Mixture-of-Experts (MoE) model based on Qwen3, running on a consumer-grade server equipped with a next-generation NVIDIA GPU. Unlike cloud-based offerings, which are expensive and complex to integrate, our approach provides an affordable and private solution for SMBs. We evaluate two dimensions: the model’s intrinsic capabilities and the server’s performance under load. Model performance is benchmarked against academic and industry standards to quantify reasoning and knowledge relative to cloud services. Concurrently, we measure server efficiency through latency, tokens per second, and time to first token, analyzing scalability under increasing concurrent users. Our findings demonstrate that a carefully configured on-premises setup with emerging consumer hardware and a quantized open-source model can achieve performance comparable to cloud-based services, offering SMBs a viable pathway to deploy powerful LLMs without prohibitive costs or privacy compromises.
zh

[AI-46] APO: Alpha-Divergence Preference Optimization

【速读】:该论文旨在解决现代对齐训练中因使用单一KL散度(forward KL或reverse KL)而导致的稳定性与探索能力之间的权衡问题:前向KL(KL(q || π_θ))虽能保证模式覆盖(mode-covering),但易忽略高奖励区域;而反向KL(KL(π_θ || q))虽可实现模式聚焦(mode-seeking),却可能导致模式坍缩(mode collapse)。解决方案的关键在于提出Alpha-Divergence Preference Optimization (APO),其核心是在锚定坐标系下引入Csiszár α-散度,通过参数α连续插值于前向与反向KL之间,从而在统一的锚定几何框架内动态调节策略更新行为。进一步地,作者设计了基于奖励与置信度双重约束的α调度机制,在策略既提升性能又具备可靠校准时才从覆盖转向利用,显著提升了训练稳定性并保持了竞争力。

链接: https://arxiv.org/abs/2512.22953
作者: Wang Zixian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Two divergence regimes dominate modern alignment practice. Supervised fine-tuning and many distillation-style objectives implicitly minimize the forward KL divergence KL(q || pi_theta), yielding stable mode-covering updates but often under-exploiting high-reward modes. In contrast, PPO-style online reinforcement learning from human feedback behaves closer to reverse KL divergence KL(pi_theta || q), enabling mode-seeking improvements but risking mode collapse. Recent anchored methods, such as ADPO, show that performing the projection in anchored coordinates can substantially improve stability, yet they typically commit to a single divergence. We introduce Alpha-Divergence Preference Optimization (APO), an anchored framework that uses Csiszar alpha-divergence to continuously interpolate between forward and reverse KL behavior within the same anchored geometry. We derive unified gradient dynamics parameterized by alpha, analyze gradient variance properties, and propose a practical reward-and-confidence-guarded alpha schedule that transitions from coverage to exploitation only when the policy is both improving and confidently calibrated. Experiments on Qwen3-1.7B with math-level3 demonstrate that APO achieves competitive performance with GRPO and GSPO baselines while maintaining training stability.
zh

[AI-47] Heterogeneity in Multi-Agent Reinforcement Learning

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中对异质性(heterogeneity)缺乏严谨定义与深入理解的问题。当前MARL研究虽认识到异质性在智能体功能差异、策略多样性及环境交互中的重要性,但尚未形成系统性的理论框架和量化方法。解决方案的关键在于:首先,基于智能体级建模将异质性分为五类并给出数学定义;其次,提出“异质性距离”概念并设计实用的量化方法;最后,基于此构建一种基于异质性的动态参数共享算法,实现对不同异质类型的有效识别与利用。实验表明该方法相比传统参数共享基线更具可解释性和适应性,为MARL领域提供了更全面的理解框架与实用工具。

链接: https://arxiv.org/abs/2512.22941
作者: Tianyi Hu,Zhiqiang Pu,Yuan Wang,Tenghai Qiu,Min Chen,Xin Yu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heterogeneity is a fundamental property in multi-agent reinforcement learning (MARL), which is closely related not only to the functional differences of agents, but also to policy diversity and environmental interactions. However, the MARL field currently lacks a rigorous definition and deeper understanding of heterogeneity. This paper systematically discusses heterogeneity in MARL from the perspectives of definition, quantification, and utilization. First, based on an agent-level modeling of MARL, we categorize heterogeneity into five types and provide mathematical definitions. Second, we define the concept of heterogeneity distance and propose a practical quantification method. Third, we design a heterogeneity-based multi-agent dynamic parameter sharing algorithm as an example of the application of our methodology. Case studies demonstrate that our method can effectively identify and quantify various types of agent heterogeneity. Experimental results show that the proposed algorithm, compared to other parameter sharing baselines, has better interpretability and stronger adaptability. The proposed methodology will help the MARL community gain a more comprehensive and profound understanding of heterogeneity, and further promote the development of practical algorithms.
zh

[AI-48] Geometric Structural Knowledge Graph Foundation Model

【速读】:该论文旨在解决现有知识图谱基础模型(如Ultra)在处理全新图结构时表达能力受限的问题,其核心限制在于消息传递过程中仅采用单一关系变换(如逐元素乘法),难以捕捉多样化的实体间关系与结构模式。解决方案的关键在于提出Gamma模型,通过引入多头几何注意力机制,将单一关系变换替换为多个并行的代数变换(包括实数、复数、分裂复数和双数基变换),每种变换专门建模不同类型的关联结构;同时设计关系条件注意力融合机制,以轻量级门控结合熵正则化方式,在链接层面自适应地融合这些变换,从而增强模型对不同三元组模式的适应性与表达能力,显著提升零样本归纳推理性能。

链接: https://arxiv.org/abs/2512.22931
作者: Ling Xin,Mojtaba Nayyeri,Zahra Makki Nayeri,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to IEEE TPAMI, under review

点击查看摘要

Abstract:Structural knowledge graph foundation models aim to generalize reasoning to completely new graphs with unseen entities and relations. A key limitation of existing approaches like Ultra is their reliance on a single relational transformation (e.g., element-wise multiplication) in message passing, which can constrain expressiveness and fail to capture diverse relational and structural patterns exhibited on diverse graphs. In this paper, we propose Gamma, a novel foundation model that introduces multi-head geometric attention to knowledge graph reasoning. Gamma replaces the single relational transformation with multiple parallel ones, including real, complex, split-complex, and dual number based transformations, each designed to model different relational structures. A relational conditioned attention fusion mechanism then adaptively fuses them at link level via a lightweight gating with entropy regularization, allowing the model to robustly emphasize the most appropriate relational bias for each triple pattern. We present a full formalization of these algebraic message functions and discuss how their combination increases expressiveness beyond any single space. Comprehensive experiments on 56 diverse knowledge graphs demonstrate that Gamma consistently outperforms Ultra in zero-shot inductive link prediction, with a 5.5% improvement in mean reciprocal rank on the inductive benchmarks and a 4.4% improvement across all benchmarks, highlighting benefits from complementary geometric representations.
zh

[AI-49] Sat-EnQ: Satisficing Ensembles of Weak Q-Learners for Reliable and Compute-Efficient Reinforcement Learning

【速读】:该论文旨在解决深度Q-learning(Deep Q-Network, DQN)算法在训练初期因最大化操作放大估计误差而导致的不稳定性问题。其解决方案的关键在于提出一种两阶段框架Sat-EnQ:第一阶段采用“满足性”(satisficing)目标训练一组轻量级Q网络,通过动态基线限制早期价值增长,从而获得多样且低方差的估计,避免灾难性过估计;第二阶段将该集成模型蒸馏为更大网络并用标准Double DQN微调。理论证明显示,满足性机制可诱导有界更新且不会增加目标方差,实验证明该方法显著降低方差(3.8倍)、消除灾难性失败(从50%降至0%),并在环境噪声下保持79%性能,同时计算开销仅为Bootstrapped Ensembles的2.5倍。

链接: https://arxiv.org/abs/2512.22910
作者: Ünver Çiftçi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Q-learning algorithms remain notoriously unstable, especially during early training when the maximization operator amplifies estimation errors. Inspired by bounded rationality theory and developmental learning, we introduce Sat-EnQ, a two-phase framework that first learns to be ``good enough’’ before optimizing aggressively. In Phase 1, we train an ensemble of lightweight Q-networks under a satisficing objective that limits early value growth using a dynamic baseline, producing diverse, low-variance estimates while avoiding catastrophic overestimation. In Phase 2, the ensemble is distilled into a larger network and fine-tuned with standard Double DQN. We prove theoretically that satisficing induces bounded updates and cannot increase target variance, with a corollary quantifying conditions for substantial reduction. Empirically, Sat-EnQ achieves 3.8x variance reduction, eliminates catastrophic failures (0% vs 50% for DQN), maintains 79% performance under environmental noise, and requires 2.5x less compute than bootstrapped ensembles. Our results highlight a principled path toward robust reinforcement learning by embracing satisficing before optimization.
zh

[AI-50] A Neural Network-Based Real-time Casing Collar Recognition System for Downhole Instruments

【速读】:该论文旨在解决油气作业中传统基于地面的套管接箍定位器(Casing Collar Locator, CCL)因信号衰减导致井下定位精度不足的问题。其解决方案的关键在于设计了一种嵌入式神经网络模型——轻量级“接箍识别网络”(Collar Recognition Nets, CRNs),专为资源受限的ARM Cortex-M7微处理器优化,通过引入时序与深度可分离卷积结构,在仅需8,208次乘加运算(MACs)的情况下仍保持0.972的F1分数,并在硬件层面验证了平均推理延迟仅为343.2 μs,证明了在井下仪器严苛的功耗和空间限制下实现鲁棒、自主信号处理的可行性。

链接: https://arxiv.org/abs/2512.22901
作者: Si-Yu Xiao,Xin-Di Zhao,Xiang-Zhan Wang,Tian-Hao Mao,Ying-Kai Liao,Xing-Yu Liao,Yu-Qiao Chen,Jun-Jie Wang,Shuang Liu,Tu-Pei Chen,Yang Liu
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Accurate downhole positioning is critical in oil and gas operations but is often compromised by signal degradation in traditional surface-based Casing Collar Locator (CCL) monitoring. To address this, we present an in-situ, real-time collar recognition system using embedded neural network. We introduce lightweight “Collar Recognition Nets” (CRNs) optimized for resource-constrained ARM Cortex-M7 microprocessors. By leveraging temporal and depthwise separable convolutions, our most compact model reduces computational complexity to just 8,208 MACs while maintaining an F1 score of 0.972. Hardware validation confirms an average inference latency of 343.2 \mus, demonstrating that robust, autonomous signal processing is feasible within the severe power and space limitations of downhole instrumentation.
zh

[AI-51] SAMP-HDRL: Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning

【速读】:该论文旨在解决非平稳市场环境下投资组合优化的三大挑战:市场状态切换(regime shifts)、资产间动态相关性变化以及深度强化学习(DRL)策略可解释性差的问题。其解决方案的关键在于提出一种分层深度强化学习框架——SAMP-HDRL,通过动态资产分组将市场划分为高质量与普通资产子集,上层代理提取全局市场信号,下层代理在掩码约束下执行组内配置;同时引入基于效用的资本分配机制,协调风险资产与无风险资产之间的关系,从而实现全局与局部决策的一致性。该方法显著提升了在波动和震荡市场中的适应性、鲁棒性和可解释性。

链接: https://arxiv.org/abs/2512.22895
作者: Xiaotian Ren,Nuerxiati Abudurexiti,Zhengyong Jiang,Angelos Stefanidis,Hongbin Liu,Jionglong Su
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Portfolio optimization in non-stationary markets is challenging due to regime shifts, dynamic correlations, and the limited interpretability of deep reinforcement learning (DRL) policies. We propose a Segmented Allocation with Momentum-Adjusted Utility for Multi-agent Portfolio Management via Hierarchical Deep Reinforcement Learning (SAMP-HDRL). The framework first applies dynamic asset grouping to partition the market into high-quality and ordinary subsets. An upper-level agent extracts global market signals, while lower-level agents perform intra-group allocation under mask constraints. A utility-based capital allocation mechanism integrates risky and risk-free assets, ensuring coherent coordination between global and local decisions. backtests across three market regimes (2019–2021) demonstrate that SAMP-HDRL consistently outperforms nine traditional baselines and nine DRL benchmarks under volatile and oscillating conditions. Compared with the strongest baseline, our method achieves at least 5% higher Return, 5% higher Sharpe ratio, 5% higher Sortino ratio, and 2% higher Omega ratio, with substantially larger gains observed in turbulent markets. Ablation studies confirm that upper–lower coordination, dynamic clustering, and capital allocation are indispensable to robustness. SHAP-based interpretability further reveals a complementary ``diversified + concentrated’’ mechanism across agents, providing transparent insights into decision-making. Overall, SAMP-HDRL embeds structural market constraints directly into the DRL pipeline, offering improved adaptability, robustness, and interpretability in complex financial environments.
zh

[AI-52] DECEPTICON: How Dark Patterns Manipulate Web Agents

【速读】:该论文旨在解决生成式 AI (Generative AI) 在网页导航任务中因受到暗模式(dark patterns)干扰而导致行为偏离用户意图的问题,即暗模式对智能体轨迹的操控风险。其解决方案的关键在于构建了一个名为 DECEPTICON 的隔离测试环境,该环境包含 700 个带暗模式的网页导航任务(600 个生成任务与 100 个真实任务),用于量化暗模式在诱导智能体执行恶意操作方面的有效性,并系统评估不同模型规模和推理机制下的脆弱性。实验表明,暗模式在超过 70% 的任务中成功误导智能体,且效果随模型能力增强而提升,现有防御策略如上下文提示和护栏模型均无法稳定缓解此风险,凸显出亟需开发针对操纵性界面设计的鲁棒防护机制。

链接: https://arxiv.org/abs/2512.22894
作者: Phil Cuvin,Hao Zhu,Diyi Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deceptive UI designs, widely instantiated across the web and commonly known as dark patterns, manipulate users into performing actions misaligned with their goals. In this paper, we show that dark patterns are highly effective in steering agent trajectories, posing a significant risk to agent robustness. To quantify this risk, we introduce DECEPTICON, an environment for testing individual dark patterns in isolation. DECEPTICON includes 700 web navigation tasks with dark patterns – 600 generated tasks and 100 real-world tasks, designed to measure instruction-following success and dark pattern effectiveness. Across state-of-the-art agents, we find dark patterns successfully steer agent trajectories towards malicious outcomes in over 70% of tested generated and real-world tasks – compared to a human average of 31%. Moreover, we find that dark pattern effectiveness correlates positively with model size and test-time reasoning, making larger, more capable models more susceptible. Leading countermeasures against adversarial attacks, including in-context prompting and guardrail models, fail to consistently reduce the success rate of dark pattern interventions. Our findings reveal dark patterns as a latent and unmitigated risk to web agents, highlighting the urgent need for robust defenses against manipulative designs.
zh

[AI-53] Agent ic AI for Cyber Resilience: A New Security Paradigm and Its System-Theoretic Foundations

【速读】:该论文旨在解决传统网络安全架构在面对基于基础模型(foundation models)的人工智能(AI)带来的动态威胁时所暴露出的局限性问题,即静态规则、边界防御和以人为中心的工作流已难以应对具备自主规划、工具编排与战略适应能力的新型攻击。解决方案的关键在于推动从以预防为中心的安全范式向“代理型网络韧性”(agentic cyber resilience)转变,其核心是构建一个由自主代理(agent)参与感知、推理、行动与适应全过程的系统级框架,并通过博弈论建模实现自治权分配、信息流优化与时间结构组合的统一设计语言,从而在持续对抗中保持关键功能、高效恢复并持续学习,最终实现系统层面的韧性增强。

链接: https://arxiv.org/abs/2512.22883
作者: Tao Li,Quanyan Zhu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cybersecurity is being fundamentally reshaped by foundation-model-based artificial intelligence. Large language models now enable autonomous planning, tool orchestration, and strategic adaptation at scale, challenging security architectures built on static rules, perimeter defenses, and human-centered workflows. This chapter argues for a shift from prevention-centric security toward agentic cyber resilience. Rather than seeking perfect protection, resilient systems must anticipate disruption, maintain critical functions under attack, recover efficiently, and learn continuously. We situate this shift within the historical evolution of cybersecurity paradigms, culminating in an AI-augmented paradigm where autonomous agents participate directly in sensing, reasoning, action, and adaptation across cyber and cyber-physical systems. We then develop a system-level framework for designing agentic AI workflows. A general agentic architecture is introduced, and attacker and defender workflows are analyzed as coupled adaptive processes, and game-theoretic formulations are shown to provide a unifying design language for autonomy allocation, information flow, and temporal composition. Case studies in automated penetration testing, remediation, and cyber deception illustrate how equilibrium-based design enables system-level resiliency design.
zh

[AI-54] Reinforcement Networks: novel framework for collaborative Multi-Agent Reinforcement Learning tasks

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中复杂系统端到端训练的挑战,尤其是如何在不依赖严格架构或训练假设的前提下实现高效、可扩展的协同学习。其核心问题在于现有方法往往受限于固定拓扑结构、完全集中式训练或难以灵活分配信用(credit assignment),从而限制了系统的灵活性与规模。解决方案的关键是提出强化网络(Reinforcement Networks)框架,该框架将智能体组织为有向无环图(Directed Acyclic Graph, DAG)中的节点,从而将层次化强化学习(Hierarchical RL)扩展至任意DAG结构,支持灵活的信用分配和可扩展的协调机制,同时避免了对严格拓扑或集中训练的依赖。这一设计统一了MARL中层次化、模块化与图结构视角,为构建和训练复杂多智能体系统提供了理论严谨且实践可行的新范式。

链接: https://arxiv.org/abs/2512.22876
作者: Maksim Kryzhanovskiy,Svetlana Glazyrina,Roman Ischenko,Konstantin Vorontsov
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern AI systems often comprise multiple learnable components that can be naturally organized as graphs. A central challenge is the end-to-end training of such systems without restrictive architectural or training assumptions. Such tasks fit the theory and approaches of the collaborative Multi-Agent Reinforcement Learning (MARL) field. We introduce Reinforcement Networks, a general framework for MARL that organizes agents as vertices in a directed acyclic graph (DAG). This structure extends hierarchical RL to arbitrary DAGs, enabling flexible credit assignment and scalable coordination while avoiding strict topologies, fully centralized training, and other limitations of current approaches. We formalize training and inference methods for the Reinforcement Networks framework and connect it to the LevelEnv concept to support reproducible construction, training, and evaluation. We demonstrate the effectiveness of our approach on several collaborative MARL setups by developing several Reinforcement Networks models that achieve improved performance over standard MARL baselines. Beyond empirical gains, Reinforcement Networks unify hierarchical, modular, and graph-structured views of MARL, opening a principled path toward designing and training complex multi-agent systems. We conclude with theoretical and practical directions - richer graph morphologies, compositional curricula, and graph-aware exploration. That positions Reinforcement Networks as a foundation for a new line of research in scalable, structured MARL.
zh

[AI-55] he body is not there to compute: Comment on “Informational embodiment: Computational role of information structure in codes and robots” by Pitti et al

【速读】:该论文试图解决的问题是:如何通过计算与信息的视角来理解生物体(animal bodies)的演化机制以及指导机器人身体(robot bodies)的设计优化。其解决方案的关键在于将信息理论(Information Theory, IT)和计算概念从传统的认知与神经处理范畴拓展至对身体本身的分析,从而揭示身体结构在进化中的功能意义,并为人工系统设计提供理论依据。然而,作者在评论中指出,身体的核心作用并非计算本身,而是作为与环境交互的物理媒介,其演化与设计应更注重感知-行动耦合与生态适应性。

链接: https://arxiv.org/abs/2512.22868
作者: Matej Hoffmann
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC); Quantitative Methods (q-bio.QM)
备注: Comment on Pitti, A., Austin, M., Nakajima, K., Kuniyoshi, Y. (2025). Informational Embodiment: Computational role of information structure in codes and robots. Physics of Life Reviews 53, 262-276. this https URL . Also available as arXiv:2408.12950

点击查看摘要

Abstract:Applying the lens of computation and information has been instrumental in driving the technological progress of our civilization as well as in empowering our understanding of the world around us. The digital computer was and for many still is the leading metaphor for how our mind operates. Information theory (IT) has also been important in our understanding of how nervous systems encode and process information. The target article deploys information and computation to bodies: to understand why they have evolved in particular ways (animal bodies) and to design optimal bodies (robots). In this commentary, I argue that the main role of bodies is not to compute.
zh

[AI-56] FasterPy: An LLM -based Code Execution Efficiency Optimization Framework

【速读】:该论文旨在解决Python代码中性能缺陷(performance bugs)导致执行效率低下的问题,传统规则驱动的优化方法因依赖人工设计且难以扩展而受限,而现有基于机器学习的方法又受制于特定程序表示和昂贵的数据集构建。其解决方案的关键在于提出FasterPy框架,该框架融合检索增强生成(Retrieval-Augmented Generation, RAG)与低秩适应(Low-Rank Adaptation, LoRA),利用从已有性能改进代码对及其性能测量结果构建的知识库进行高效微调,从而实现低成本、高效率的自动化代码优化。

链接: https://arxiv.org/abs/2512.22827
作者: Yue Wu,Minghao Han,Ruiyin Li,Peng Liang,Amjed Tahir,Zengyang Li,Qiong Feng,Mojtaba Shahin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 32 pages, 5 images, 7 tables, Manuscript submitted to a Journal (2025)

点击查看摘要

Abstract:Code often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant loops, repeated computations), making them labor-intensive and limited in applicability. In recent years, machine learning and deep learning-based methods have emerged as promising alternatives by learning optimization heuristics from annotated code corpora and performance measurements. However, these approaches usually depend on specific program representations and meticulously crafted training datasets, making them costly to develop and difficult to scale. With the booming of Large Language Models (LLMs), their remarkable capabilities in code generation have opened new avenues for automated code optimization. In this work, we proposed FasterPy, a low-cost and efficient framework that adapts LLMs to optimize the execution efficiency of Python code. FasterPy combines Retrieval-Augmented Generation (RAG), supported by a knowledge base constructed from existing performance-improving code pairs and corresponding performance measurements, with Low-Rank Adaptation (LoRA) to enhance code optimization performance. Our experimental results on the Performance Improving Code Edits (PIE) benchmark demonstrate that our method outperforms existing models on multiple metrics. The FasterPy tool and the experimental results are available at this https URL.
zh

[AI-57] MoR: Mixture Of Representations For Mixed-Precision Training

【速读】:该论文旨在解决混合精度训练(Mixed-Precision Training)中如何高效且稳定地应用低精度数值格式(如FP8)以提升深度学习模型训练效率的问题。其核心挑战在于,在保证模型精度的前提下,如何动态选择最优的量化表示方式,避免因固定量化策略导致的性能下降。解决方案的关键在于提出了一种基于张量(Tensor)级和子张量(Sub-Tensor)级粒度的Mixture-of-Representations (MoR) 框架,该框架能够根据每个张量的数值特性(numerical properties)动态选择使用FP8或BF16等不同精度表示,从而实现高精度与高效率的平衡。实验表明,该方法可在98.38%的张量上采用FP8格式,同时保持与现有方法相当的模型质量,无需细粒度分区即可达到状态领先效果,并具备扩展至更低精度格式(如NVFP4)的潜力。

链接: https://arxiv.org/abs/2512.22804
作者: Bor-Yiing Su,Peter Dykas,Mike Chrzanowski,Jatin Chhugani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixed-precision training is a crucial technique for scaling deep learning models, but successful mixedprecision training requires identifying and applying the right combination of training methods. This paper presents our preliminary study on Mixture-of-Representations (MoR), a novel, per-tensor and sub-tensor level quantization framework that dynamically analyzes a tensor’s numerical properties to select between a variety of different representations. Based on the framework, we have proposed and experimented concrete algorithms that choose dynamically between FP8 and BF16 representations for both per-tensor and sub-tensor level granularities. Our universal approach is designed to preserve model quality across various quantization partition strategies and datasets. Our initial findings show that this approach can achieve state-of-the-art results with 98.38% of tensors quantized to the FP8 format. This work highlights the potential of dynamic, property-aware quantization while preserving model quality. We believe this approach can generally improve the robustness of low precision training, as demonstrated by achieving FP8 accuracies that are on par with existing approaches without the need for fine-grain partitioning, or can be used in combination with other training methods to improve the leverage of even lower precision number formats such as NVFP4.
zh

[AI-58] Reach-Avoid Differential game with Reachability Analysis for UAVs: A decomposition approach

【速读】:该论文旨在解决三维空间中无人飞行器(UAV)的可达-规避(Reach-Avoid, RA)博弈问题,此类问题因需考虑障碍物、对手的对抗性、最优性保障以及非线性动力学而极具挑战性。现有基于哈密顿-雅可比(Hamilton-Jacobi, HJ)可达性分析的方法难以直接扩展至三维场景,而其他替代方法又无法兼顾高维空间与复杂动力学建模。解决方案的关键在于提出一种维度分解框架,将原三维RA游戏解耦为水平和垂直两个子博弈,分别采用HJ可达性分析求解,并引入基于HJ的跟踪控制算法以保证攻击者被捕获后仍能持续跟踪。该方法首次在三维空间中实现了对具有二阶动力学(含防御方加速度)的四旋翼无人机的成功捕获,且理论证明了捕获条件的保持性,数值仿真与Gazebo物理引擎验证了其有效性与最优性。

链接: https://arxiv.org/abs/2512.22793
作者: Minh Bui,Simon Monckton,Mo Chen
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: Paper version accepted to the Journal of Guidance, Control, and Dynamics (JGCD)

点击查看摘要

Abstract:Reach-avoid (RA) games have significant applications in security and defense, particularly for unmanned aerial vehicles (UAVs). These problems are inherently challenging due to the need to consider obstacles, consider the adversarial nature of opponents, ensure optimality, and account for nonlinear dynamics. Hamilton-Jacobi (HJ) reachability analysis has emerged as a powerful tool for tackling these challenges; however, while it has been applied to games involving two spatial dimensions, directly extending this approach to three spatial dimensions is impossible due to high dimensionality. On the other hand, alternative approaches for solving RA games lack the generality to consider games with three spatial dimensions involving agents with non-trivial system dynamics. In this work, we propose a novel framework for dimensionality reduction by decomposing the problem into a horizontal RA sub-game and a vertical RA sub-game. We then solve each sub-game using HJ reachability analysis and consider second-order dynamics that account for the defender’s acceleration. To reconstruct the solution to the original RA game from the sub-games, we introduce a HJ-based tracking control algorithm in each sub-game that not only guarantees capture of the attacker but also tracking of the attacker thereafter. We prove the conditions under which the capture guarantees are maintained. The effectiveness of our approach is demonstrated via numerical simulations, showing that the decomposition maintains optimality and guarantees in the original problem. Our methods are also validated in a Gazebo physics simulator, achieving successful capture of quadrotors in three spatial dimensions space for the first time to the best of our knowledge.
zh

[AI-59] SNM-Net: A Universal Framework for Robust Open-Set Gas Recognition via Spherical Normalization and Mahalanobis Distance

【速读】:该论文旨在解决电子鼻(Electronic Nose, E-nose)系统在开放集气体识别中面临的双重挑战:一是由信号漂移引起的特征分布偏移,二是未知干扰导致的决策失效问题。现有方法多依赖欧氏距离(Euclidean distance),难以有效处理各向异性气体特征分布及动态信号强度变化。其解决方案的关键在于提出SNM-Net框架,通过级联批量归一化(batch normalization)与L2归一化实现几何解耦机制,将高维特征投影至单位超球面以消除信号强度波动;同时引入马氏距离(Mahalanobis distance)作为评分机制,利用类别统计信息构建自适应椭球决策边界,从而提升识别精度与鲁棒性。

链接: https://arxiv.org/abs/2512.22792
作者: Shuai Chen,Chen Wang,Ziran Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 31 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Electronic nose (E-nose) systems face dual challenges in open-set gas recognition: feature distribution shifts caused by signal drift and decision failures induced by unknown interference. Existing methods predominantly rely on Euclidean distance, failing to adequately account for anisotropic gas feature distributions and dynamic signal intensity variations. To address these issues, this study proposes SNM-Net, a universal deep learning framework for open-set gas recognition. The core innovation lies in a geometric decoupling mechanism achieved through cascaded batch normalization and L2 normalization, which projects high-dimensional features onto a unit hypersphere to eliminate signal intensity fluctuations. Additionally, Mahalanobis distance is introduced as the scoring mechanism, utilizing class-wise statistics to construct adaptive ellipsoidal decision boundaries. SNM-Net is architecture-agnostic and seamlessly integrates with CNN, RNN, and Transformer backbones. Systematic experiments on the Vergara dataset demonstrate that the Transformer+SNM configuration attains near-theoretical performance, achieving an AUROC of 0.9977 and an unknown gas detection rate of 99.57% (TPR at 5% FPR). This performance significantly outperforms state-of-the-art methods, showing a 3.0% improvement in AUROC and a 91.0% reduction in standard deviation compared to Class Anchor Clustering. The framework exhibits exceptional robustness across sensor positions with standard deviations below 0.0028. This work effectively resolves the trade-off between accuracy and stability, providing a solid technical foundation for industrial E-nose deployment.
zh

[AI-60] Adapting Fast and Slow: Transportable Circuits for Few-Shot Learning

【速读】:该论文旨在解决跨域泛化(cross-domain generalization)问题,即在缺乏目标域数据的情况下实现零样本组合泛化(zero-shot compositional generalization)。其核心挑战在于如何在源域与目标域之间建立结构约束,以确保模型能够有效迁移。解决方案的关键在于基于因果可传输性理论(causal transportability theory),设计了一种名为Circuit-TR的算法:该算法利用因果图(causal graph)获取域内结构信息,并通过差异Oracle(discrepancies oracle)识别域间机制共享关系,从而学习一组模块化局部预测器(local predictors),并在因果结构允许时进行传输与组合,构建适用于目标域的预测电路(circuit)。此外,该方法还提出了一种无需显式因果结构、仅需少量目标域数据的监督域适应方案,理论上通过图形电路可传输性准则刻画了少数样本可学习任务的类别,并将少样本泛化能力与电路规模复杂度相联系。

链接: https://arxiv.org/abs/2512.22777
作者: Kasra Jalaldoust,Elias Bareinboim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalization across the domains is not possible without asserting a structure that constrains the unseen target domain w.r.t. the source domain. Building on causal transportability theory, we design an algorithm for zero-shot compositional generalization which relies on access to qualitative domain knowledge in form of a causal graph for intra-domain structure and discrepancies oracle for inter-domain mechanism sharing. \textitCircuit-TR learns a collection of modules (i.e., local predictors) from the source data, and transport/compose them to obtain a circuit for prediction in the target domain if the causal structure licenses. Furthermore, circuit transportability enables us to design a supervised domain adaptation scheme that operates without access to an explicit causal structure, and instead uses limited target data. Our theoretical results characterize classes of few-shot learnable tasks in terms of graphical circuit transportability criteria, and connects few-shot generalizability with the established notion of circuit size complexity; controlled simulations corroborate our theoretical results.
zh

[AI-61] GRExplainer: A Universal Explanation Method for Temporal Graph Neural Networks

【速读】:该论文旨在解决当前Temporal Graph Neural Networks (TGNNs) 在可解释性方面的三大核心问题:一是现有解释方法针对特定类型的TGNN设计,缺乏通用性;二是计算成本高,难以应用于大规模动态图;三是解释过程忽视结构连通性且依赖先验知识,用户体验不佳。解决方案的关键在于提出GRExplainer,其通过提取节点序列作为统一特征表示,实现对快照型和事件型TGNN的普适适用;利用广度优先搜索(BFS)与时间信息构建输入节点序列以减少冗余计算,提升效率;并基于循环神经网络(RNN)设计生成模型,实现自动化、连续的解释生成,显著增强用户友好性。

链接: https://arxiv.org/abs/2512.22772
作者: Xuyan Li,Jie Wang,Zheng Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic graphs are widely used to represent evolving real-world networks. Temporal Graph Neural Networks (TGNNs) have emerged as a powerful tool for processing such graphs, but the lack of transparency and explainability limits their practical adoption. Research on TGNN explainability is still in its early stages and faces several key issues: (i) Current methods are tailored to specific TGNN types, restricting generality. (ii) They suffer from high computational costs, making them unsuitable for large-scale networks. (iii) They often overlook the structural connectivity of explanations and require prior knowledge, reducing user-friendliness. To address these issues, we propose GRExplainer, the first universal, efficient, and user-friendly explanation method for TGNNs. GRExplainer extracts node sequences as a unified feature representation, making it independent of specific input formats and thus applicable to both snapshot-based and event-based TGNNs (the major types of TGNNs). By utilizing breadth-first search and temporal information to construct input node sequences, GRExplainer reduces redundant computation and improves efficiency. To enhance user-friendliness, we design a generative model based on Recurrent Neural Networks (RNNs), enabling automated and continuous explanation generation. Experiments on six real-world datasets with three target TGNNs show that GRExplainer outperforms existing baseline methods in generality, efficiency, and user-friendliness.
zh

[AI-62] Understanding the Mechanisms of Fast Hyperparameter Transfer

【速读】:该论文旨在解决深度学习模型规模扩大后标准超参数(Hyperparameter, HP)优化成本过高这一问题。其核心挑战在于如何在不同模型规模间高效迁移最优超参数,以避免对大规模模型进行昂贵的直接调优。解决方案的关键在于提出一种“快速转移”(fast transfer)的概念框架,定义当转移引起的次优性随规模增长比有限尺度性能差距更快消失时即为快速转移,并证明在计算最优的网格搜索中,快速转移等价于有用转移,意味着其在计算效率上优于直接调参。此外,论文揭示了在特定问题结构下(如最大更新参数化 μP),超参数转移能实现计算优势,且这种优势源于优化轨迹的分解:损失下降包含一个宽度稳定分量(决定最优HP)和一个宽度敏感分量(随宽度提升但微弱扰动HP最优值)。

链接: https://arxiv.org/abs/2512.22768
作者: Nikhil Ghosh,Denny Wu,Alberto Bietti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 43 pages

点击查看摘要

Abstract:The growing scale of deep learning models has rendered standard hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware hyperparameters, which can enable direct transfer of optimal HPs from small-scale grid searches to large models with minimal performance loss. To understand the principles governing such transfer strategy, we develop a general conceptual framework for reasoning about HP transfer across scale, characterizing transfer as fast when the suboptimality it induces vanishes asymptotically faster than the finite-scale performance gap. We show formally that fast transfer is equivalent to useful transfer for compute-optimal grid search, meaning that transfer is asymptotically more compute-efficient than direct tuning. While empirical work has found that the Maximal Update Parameterization ( \mu P) exhibits fast transfer when scaling model width, the mechanisms remain poorly understood. We show that this property depends critically on problem structure by presenting synthetic settings where transfer either offers provable computational advantage or fails to outperform direct tuning even under \mu P. To explain the fast transfer observed in practice, we conjecture that decomposing the optimization trajectory reveals two contributions to loss reduction: (1) a width-stable component that determines the optimal HPs, and (2) a width-sensitive component that improves with width but weakly perturbs the HP optimum. We present empirical evidence for this hypothesis across various settings, including large language model pretraining.
zh

[AI-63] Active Constraint Learning in High Dimensions from Demonstrations

【速读】:该论文旨在解决在学习从示范(Learning from Demonstrations, LfD)框架下,如何高效地从稀疏且有限的示范轨迹中准确推断未知约束条件的问题。其核心挑战在于:在高维非线性动力学系统中,仅靠随机采样获取的示范数据往往信息量不足,难以有效建模复杂环境中的隐式约束。解决方案的关键在于提出一种迭代式的主动约束学习(Active Constraint Learning, ACL)算法,通过构建高斯过程(Gaussian Process, GP)对已知示范数据进行建模,并利用GP后验分布主动选择最具信息量的起始/目标状态进行查询,从而引导生成新的、高度信息性的示范轨迹,逐步优化约束模型。该方法显著提升了约束推理的准确性与样本效率。

链接: https://arxiv.org/abs/2512.22757
作者: Zheng Qiu,Chih-Yuan Chiu,Glen Chou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注: Under review, 25 pages, 11 figures

点击查看摘要

Abstract:We present an iterative active constraint learning (ACL) algorithm, within the learning from demonstrations (LfD) paradigm, which intelligently solicits informative demonstration trajectories for inferring an unknown constraint in the demonstrator’s environment. Our approach iteratively trains a Gaussian process (GP) on the available demonstration dataset to represent the unknown constraints, uses the resulting GP posterior to query start/goal states, and generates informative demonstrations which are added to the dataset. Across simulation and hardware experiments using high-dimensional nonlinear dynamics and unknown nonlinear constraints, our method outperforms a baseline, random-sampling based method at accurately performing constraint inference from an iteratively generated set of sparse but informative demonstrations.
zh

[AI-64] Robust LLM -based Column Type Annotation via Prompt Augmentation with LoRA Tuning

【速读】:该论文旨在解决列类型标注(Column Type Annotation, CTA)任务中现有方法在分布偏移下适应性差、提示(prompt)敏感性强以及模型微调计算成本高的问题。其解决方案的关键在于提出一种基于低秩适配(Low-Rank Adaptation, LoRA)的参数高效训练框架,通过在提示增强数据上进行微调,在显著减少可训练参数数量的同时,有效缓解对提示结构变化的敏感性,并实现跨数据集和提示模板的稳定性能提升,从而构建更具鲁棒性和可扩展性的CTA系统。

链接: https://arxiv.org/abs/2512.22742
作者: Hanze Meng,Jianhao Cao,Rachel Pottinger
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: 13 pages, 8 figures

点击查看摘要

Abstract:Column Type Annotation (CTA) is a fundamental step towards enabling schema alignment and semantic understanding of tabular data. Existing encoder-only language models achieve high accuracy when fine-tuned on labeled columns, but their applicability is limited to in-domain settings, as distribution shifts in tables or label spaces require costly re-training from scratch. Recent work has explored prompting generative large language models (LLMs) by framing CTA as a multiple-choice task, but these approaches face two key challenges: (1) model performance is highly sensitive to subtle changes in prompt wording and structure, and (2) annotation F1 scores remain modest. A natural extension is to fine-tune large language models. However, fully fine-tuning these models incurs prohibitive computational costs due to their scale, and the sensitivity to prompts is not eliminated. In this paper, we present a parameter-efficient framework for CTA that trains models over prompt-augmented data via Low-Rank Adaptation (LoRA). Our approach mitigates sensitivity to prompt variations while drastically reducing the number of necessary trainable parameters, achieving robust performance across datasets and templates. Experimental results on recent benchmarks demonstrate that models fine-tuned with our prompt augmentation strategy maintain stable performance across diverse prompt patterns during inference and yield higher weighted F1 scores than those fine-tuned on a single prompt template. These results highlight the effectiveness of parameter-efficient training and augmentation strategies in developing practical and adaptable CTA systems.
zh

[AI-65] FoldAct: Efficient and Stable Context Folding for Long-Horizon Search Agents

【速读】:该论文针对大语言模型在长时程强化学习(Long-horizon Reinforcement Learning, RL)中因上下文长度无界增长导致的可扩展性挑战,特别是现有上下文折叠(context folding)方法将摘要动作视为标准动作所引发的问题展开研究。这些问题包括:梯度稀释(gradient dilution)、自条件化(self-conditioning)导致的训练崩溃以及每轮交互处理唯一上下文带来的高计算开销。其解决方案的关键在于提出FoldAct框架,通过三个核心创新实现:(1) 分离损失计算以获得摘要与动作标记独立的梯度信号;(2) 全上下文一致性损失(full context consistency loss)减少观测分布偏移;(3) 选择性片段训练(selective segment training)降低计算成本。该方法有效解决了非平稳观测分布问题,并实现5.19倍的训练加速。

链接: https://arxiv.org/abs/2512.22733
作者: Jiaqi Shao,Yufeng Miao,Wei Zhang,Bing Luo
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon reinforcement learning (RL) for large language models faces critical scalability challenges from unbounded context growth, leading to context folding methods that compress interaction history during task execution. However, existing approaches treat summary actions as standard actions, overlooking that summaries fundamentally modify the agent’s future observation space, creating a policy-dependent, non-stationary observation distribution that violates core RL assumptions. This introduces three fundamental challenges: (1) gradient dilution where summary tokens receive insufficient training signal, (2) self-conditioning where policy updates change summary distributions, creating a vicious cycle of training collapse, and (3) computational cost from processing unique contexts at each turn. We introduce \textbfFoldAct\footnotethis https URL, a framework that explicitly addresses these challenges through three key innovations: separated loss computation for independent gradient signals on summary and action tokens, full context consistency loss to reduce distribution shift, and selective segment training to reduce computational cost. Our method enables stable training of long-horizon search agents with context folding, addressing the non-stationary observation problem while improving training efficiency with 5.19 \times speedup.
zh

[AI-66] Learning with the p-adics

【速读】:该论文试图解决现有机器学习框架局限于实数域(R\mathbb{R})所带来的几何与表示局限性问题,特别是其在处理层次化结构和符号性信息时的不足。解决方案的关键在于引入 pp-adic 数域(Qp\mathbb{Q}_p)作为替代基础域,利用其超度量(ultrametric)和非阿基米德(non-archimedean)特性,构建具有天然分层结构的表示空间。该方法为分类、回归和表示学习提供了新的数学框架,并展示了如何将简单的 Quillian 语义网络编码为紧凑的 pp-adic 线性网络,这是传统实数域模型无法实现的。

链接: https://arxiv.org/abs/2512.22692
作者: André F. T. Martins
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM)
备注: 29 pages

点击查看摘要

Abstract:Existing machine learning frameworks operate over the field of real numbers ( \mathbbR ) and learn representations in real (Euclidean or Hilbert) vector spaces (e.g., \mathbbR^d ). Their underlying geometric properties align well with intuitive concepts such as linear separability, minimum enclosing balls, and subspace projection; and basic calculus provides a toolbox for learning through gradient-based optimization. But is this the only possible choice? In this paper, we study the suitability of a radically different field as an alternative to \mathbbR – the ultrametric and non-archimedean space of p -adic numbers, \mathbbQ_p . The hierarchical structure of the p -adics and their interpretation as infinite strings make them an appealing tool for code theory and hierarchical representation learning. Our exploratory theoretical work establishes the building blocks for classification, regression, and representation learning with the p -adics, providing learning models and algorithms. We illustrate how simple Quillian semantic networks can be represented as a compact p -adic linear network, a construction which is not possible with the field of reals. We finish by discussing open problems and opportunities for future research enabled by this new framework. Comments: 29 pages Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM) Cite as: arXiv:2512.22692 [cs.LG] (or arXiv:2512.22692v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22692 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-67] ravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning

【速读】:该论文旨在解决现有旅行规划任务中LLM代理评估存在的局限性问题,即现有研究在领域覆盖范围和多轮交互支持方面不足,难以实现动态用户-代理交互,从而无法全面评估代理能力。其解决方案的关键在于提出TravelBench——一个基于真实场景的多轮交互式旅行规划基准,包含多轮、单轮和不可解三类子集,并构建了一个包含10个旅行领域工具的受控沙盒环境,确保工具输出的确定性与可复现性,从而为LLM代理在旅行规划中的行为分析与性能评估提供了一个实用且可靠的基准平台。

链接: https://arxiv.org/abs/2512.22673
作者: Xiang Cheng,Yulan Hu,Xiangwen Zhang,Lu Xu,Zheng Pan,Xin Li,Yong Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: In progress

点击查看摘要

Abstract:Large language model (LLM) agents have demonstrated strong capabilities in planning and tool use. Travel planning provides a natural and high-impact testbed for these capabilities, as it requires multi-step reasoning, iterative preference elicitation through interaction, and calls to external tools under evolving constraints. Prior work has studied LLMs on travel-planning tasks, but existing settings are limited in domain coverage and multi-turn interaction. As a result, they cannot support dynamic user-agent interaction and therefore fail to comprehensively assess agent capabilities. In this paper, we introduce TravelBench, a real-world travel-planning benchmark featuring multi-turn interaction and tool use. We collect user requests from real-world scenarios and construct three subsets-multi-turn, single-turn, and unsolvable-to evaluate different aspects of agent performance. For stable and reproducible evaluation, we build a controlled sandbox environment with 10 travel-domain tools, providing deterministic tool outputs for reliable reasoning. We evaluate multiple LLMs on TravelBench and conduct an analysis of their behaviors and performance. TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning.
zh

[AI-68] DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation NEURIPS2025

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统评估中缺乏可解释性、鲁棒性和计算效率的问题。现有标量指标在解释能力、不确定性量化和多系统对比的计算效率方面存在局限,难以支撑RAG技术的负责任部署。解决方案的关键在于提出DICE(Discrete Interpretable Comparative Evaluation)框架,其核心创新包括:第一,采用两阶段证据耦合机制,结合深度分析推理与概率性的“A, B, Tie”评分策略,生成具有置信度感知的透明判断,并提供可追溯的推理路径以支持系统级错误诊断与改进;第二,引入瑞士轮锦标赛(Swiss-system tournament)机制,将多系统比较的计算复杂度从O(N²)降低至O(N log N),显著提升大规模评估效率,在八系统实验中实现42.9%的计算成本下降且保持排名一致性。

链接: https://arxiv.org/abs/2512.22629
作者: Shiyan Liu,Jian Ma,Rui Qu
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted at ResponsibleFM @ NeurIPS 2025

点击查看摘要

Abstract:As Retrieval-Augmented Generation (RAG) systems evolve toward more sophisticated architectures, ensuring their trustworthiness through explainable and robust evaluation becomes critical. Existing scalar metrics suffer from limited interpretability, inadequate uncertainty quantification, and computational inefficiency in multi-system comparisons, hindering responsible deployment of RAG technologies. We introduce DICE (Discrete Interpretable Comparative Evaluation), a two-stage, evidence-coupled framework that advances explainability and robustness in RAG evaluation. DICE combines deep analytical reasoning with probabilistic \A, B, Tie\ scoring to produce transparent, confidence-aware judgments that support accountable system improvement through interpretable reasoning traces, enabling systematic error diagnosis and actionable insights. To address efficiency challenges at scale, DICE employs a Swiss-system tournament that reduces computational complexity from O(N^2) to O(N \log N) , achieving a 42.9% reduction in our eight-system evaluation while preserving ranking fidelity. Validation on a curated Chinese financial QA dataset demonstrates that DICE achieves 85.7% agreement with human experts, substantially outperforming existing LLM-based metrics such as RAGAS. Our results establish DICE as a responsible, explainable, and efficient paradigm for trustworthy RAG system assessment.
zh

[AI-69] he Wisdom of Deliberating AI Crowds: Does Deliberation Improve LLM -Based Forecasting?

【速读】:该论文旨在解决如何提升大语言模型(Large Language Models, LLMs)在预测任务中的准确性问题。其核心解决方案是引入结构化审议机制,即让多个LLM在更新自身预测前相互审查彼此的输出,以此模拟人类专家间的讨论过程。研究发现,当不同模型(异质群体)共享信息并进行相互审议时,预测准确率显著提升(对数损失减少0.020,相对改善约4%,p = 0.017),而同质模型组则未见改进,表明审议效果依赖于模型间的多样性而非单纯的信息共享。

链接: https://arxiv.org/abs/2512.22625
作者: Paul Schneider,Amalie Schramm
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 13 pages, 2 figures, 5 tables, for source code and data see this https URL

点击查看摘要

Abstract:Structured deliberation has been found to improve the performance of human forecasters. This study investigates whether a similar intervention, i.e. allowing LLMs to review each other’s forecasts before updating, can improve accuracy in large language models (GPT-5, Claude Sonnet 4.5, Gemini Pro 2.5). Using 202 resolved binary questions from the Metaculus Q2 2025 AI Forecasting Tournament, accuracy was assessed across four scenarios: (1) diverse models with distributed information, (2) diverse models with shared information, (3) homogeneous models with distributed information, and (4) homogeneous models with shared information. Results show that the intervention significantly improves accuracy in scenario (2), reducing Log Loss by 0.020 or about 4 percent in relative terms (p = 0.017). However, when homogeneous groups (three instances of the same model) engaged in the same process, no benefit was observed. Unexpectedly, providing LLMs with additional contextual information did not improve forecast accuracy, limiting our ability to study information pooling as a mechanism. Our findings suggest that deliberation may be a viable strategy for improving LLM forecasting.
zh

[AI-70] Chord Recognition with Deep Learning

【速读】:该论文旨在解决自动和弦识别(Automatic Chord Recognition, ACR)在深度学习时代进展缓慢的问题。其关键解决方案在于通过实验验证现有方法的局限性,并提出改进策略:首先发现当前分类器在罕见和弦上表现不佳,进而证明音高增强(pitch augmentation)能显著提升准确率;其次指出从生成模型中提取的特征对性能无益,但合成数据为未来研究提供了极具潜力的方向;最后通过引入节拍检测(beat detection)提升模型输出的可解释性,实现了领域内部分最佳性能结果。

链接: https://arxiv.org/abs/2512.22621
作者: Pierre Mackenzie
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Progress in automatic chord recognition has been slow since the advent of deep learning in the field. To understand why, I conduct experiments on existing methods and test hypotheses enabled by recent developments in generative models. Findings show that chord classifiers perform poorly on rare chords and that pitch augmentation boosts accuracy. Features extracted from generative models do not help and synthetic data presents an exciting avenue for future work. I conclude by improving the interpretability of model outputs with beat detection, reporting some of the best results in the field and providing qualitative analysis. Much work remains to solve automatic chord recognition, but I hope this thesis will chart a path for others to try.
zh

[AI-71] LLM Agents as VC investors: Predicting Startup Success via RolePlay-Based Collective Simulation

【速读】:该论文旨在解决初创企业融资成功预测问题,其核心挑战在于现有方法通常仅从单一决策者视角建模,忽略了现实中由投资机构群体主导的复杂集体决策过程。解决方案的关键在于提出SimVC-CAS——一种模拟风险投资(VC)决策的多智能体交互系统,通过设计具有差异化特质的角色代理(role-playing agents)和基于图神经网络(GNN)的监督交互模块,将融资预测重构为群体决策任务,从而同时捕捉企业基本面与潜在投资者网络的行为动态,实现更精准且可解释的预测结果。

链接: https://arxiv.org/abs/2512.22608
作者: Zhongyang Liu,Haoyu Pei,Xiangyi Xiao,Xiaocong Du,Yihui Li,Suting Hong,Kunpeng Zhang,Haipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Due to the high value and high failure rate of startups, predicting their success has become a critical challenge across interdisciplinary research. Existing approaches typically model success prediction from the perspective of a single decision-maker, overlooking the collective dynamics of investor groups that dominate real-world venture capital (VC) decisions. In this paper, we propose SimVC-CAS, a novel collective agent system that simulates VC decision-making as a multi-agent interaction process. By designing role-playing agents and a GNN-based supervised interaction module, we reformulate startup financing prediction as a group decision-making task, capturing both enterprise fundamentals and the behavioral dynamics of potential investor networks. Each agent embodies an investor with unique traits and preferences, enabling heterogeneous evaluation and realistic information exchange through a graph-structured co-investment network. Using real-world data from PitchBook and under strict data leakage controls, we show that SimVC-CAS significantly improves predictive accuracy while providing interpretable, multiperspective reasoning, for example, approximately 25% relative improvement with respect to average precision@10. SimVC-CAS also sheds light on other complex group decision scenarios.
zh

[AI-72] yee: A Unified Modular and Fully-Integrated Configurable Toolkit for Intelligent Physiological Health Care

【速读】:该论文旨在解决深度学习在生理信号分析中面临的诸多挑战,包括数据格式异构性、预处理策略不一致、模型流程碎片化以及实验设置不可复现等问题。其解决方案的核心在于提出一个名为Tyee的统一、模块化且可配置的工具包,通过三大关键创新实现:(1) 支持12种信号模态的统一数据接口与可配置预处理流水线;(2) 模块化和可扩展架构,支持跨任务灵活集成与快速原型开发;(3) 端到端工作流配置,提升实验的可复现性与可扩展性。实验证明,Tyee在所有评估任务中均表现出稳定的有效性和泛化能力,其中12个数据集达到或超越当前最优性能。

链接: https://arxiv.org/abs/2512.22601
作者: Tao Zhou,Lingyu Shu,Zixing Zhang,Jing Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning has shown great promise in physiological signal analysis, yet its progress is hindered by heterogeneous data formats, inconsistent preprocessing strategies, fragmented model pipelines, and non-reproducible experimental setups. To address these limitations, we present Tyee, a unified, modular, and fully-integrated configurable toolkit designed for intelligent physiological healthcare. Tyee introduces three key innovations: (1) a unified data interface and configurable preprocessing pipeline for 12 kinds of signal modalities; (2) a modular and extensible architecture enabling flexible integration and rapid prototyping across tasks; and (3) end-to-end workflow configuration, promoting reproducible and scalable experimentation. Tyee demonstrates consistent practical effectiveness and generalizability, outperforming or matching baselines across all evaluated tasks (with state-of-the-art results on 12 of 13 datasets). The Tyee toolkit is released at this https URL and actively maintained.
zh

[AI-73] SANet: A Semantic-aware Agent ic AI Networking Framework for Cross-layer Optimization in 6G

【速读】:该论文旨在解决传统无线网络中缺乏自主决策与动态适应能力的问题,尤其是在复杂多变环境中难以实现高效自配置、自优化和自适应管理的挑战。为此,作者提出了一种语义感知的代理网络架构(Semantic-aware AgentNet, SANet),其核心在于通过多个专业化AI代理协作,自动推断用户语义目标并分配对应层级的网络代理来完成任务。关键创新包括:将SANet的分布式优化建模为多智能体多目标问题以寻找帕累托最优解;设计三种新型评估指标;提出模型分区与共享(Model Partition and Sharing, MoPS)框架,使大型深度学习模型可按本地计算资源拆分为共享与专属部分协同部署;并通过理论分析揭示了优化精度、泛化能力和冲突误差之间的三重权衡关系。实验表明,该方案在仅使用现有先进算法44.37%浮点运算量的情况下,实现了最高达14.61%的性能提升。

链接: https://arxiv.org/abs/2512.22579
作者: Yong Xiao,Xubo Li,Haoran Zhou,Yingyu Li,Yayu Gao,Guangming Shi,Ping Zhang,Marwan Krunz
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Submitted to IEEE Transactions on Mobile Computing

点击查看摘要

Abstract:Agentic AI networking (AgentNet) is a novel AI-native networking paradigm in which a large number of specialized AI agents collaborate to perform autonomous decision-making, dynamic environmental adaptation, and complex missions. It has the potential to facilitate real-time network management and optimization functions, including self-configuration, self-optimization, and self-adaptation across diverse and complex environments. This paper proposes SANet, a novel semantic-aware AgentNet architecture for wireless networks that can infer the semantic goal of the user and automatically assign agents associated with different layers of the network to fulfill the inferred goal. Motivated by the fact that AgentNet is a decentralized framework in which collaborating agents may generally have different and even conflicting objectives, we formulate the decentralized optimization of SANet as a multi-agent multi-objective problem, and focus on finding the Pareto-optimal solution for agents with distinct and potentially conflicting objectives. We propose three novel metrics for evaluating SANet. Furthermore, we develop a model partition and sharing (MoPS) framework in which large models, e.g., deep learning models, of different agents can be partitioned into shared and agent-specific parts that are jointly constructed and deployed according to agents’ local computational resources. Two decentralized optimization algorithms are proposed. We derive theoretical bounds and prove that there exists a three-way tradeoff among optimization, generalization, and conflicting errors. We develop an open-source RAN and core network-based hardware prototype that implements agents to interact with three different layers of the network. Experimental results show that the proposed framework achieved performance gains of up to 14.61% while requiring only 44.37% of FLOPs required by state-of-the-art algorithms.
zh

[AI-74] Lessons from Neuroscience for AI: How integrating Actions Compositional Structure and Episodic Memory could enable Safe Interpretable and Human-Like AI

【速读】:该论文旨在解决当前基础模型(Foundation Models)在安全性、可解释性、能源效率及类人智能方面存在的不足,这些问题主要源于其对预测编码机制的单一优化,忽略了神经科学与认知科学中先进的预测编码模型所包含的三个关键要素:动作与生成模型的紧密整合、分层组合结构以及情景记忆。论文提出的核心解决方案是将这些缺失组件引入基础模型,使其不仅具备多尺度抽象层次上的行动能力,还拥有基于情景记忆的因果理解与自我调节机制,从而提升模型的具身性(embodiment)、责任感知(sense of agency)和能量效率,并缓解幻觉现象与表面理解问题。这一方案强调脑科学与人工智能之间的跨学科融合,以推动更安全、可解释且以人为本的人工智能发展。

链接: https://arxiv.org/abs/2512.22568
作者: Rajesh P. N. Rao,Vishwas Sathish,Linxing Preston Jiang,Matthew Bryan,Prashant Rangarajan
机构: 未知
类目: Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:The phenomenal advances in large language models (LLMs) and other foundation models over the past few years have been based on optimizing large-scale transformer models on the surprisingly simple objective of minimizing next-token prediction loss, a form of predictive coding that is also the backbone of an increasingly popular model of brain function in neuroscience and cognitive science. However, current foundation models ignore three other important components of state-of-the-art predictive coding models: tight integration of actions with generative models, hierarchical compositional structure, and episodic memory. We propose that to achieve safe, interpretable, energy-efficient, and human-like AI, foundation models should integrate actions, at multiple scales of abstraction, with a compositional generative architecture and episodic memory. We present recent evidence from neuroscience and cognitive science on the importance of each of these components. We describe how the addition of these missing components to foundation models could help address some of their current deficiencies: hallucinations and superficial understanding of concepts due to lack of grounding, a missing sense of agency/responsibility due to lack of control, threats to safety and trustworthiness due to lack of interpretability, and energy inefficiency. We compare our proposal to current trends, such as adding chain-of-thought (CoT) reasoning and retrieval-augmented generation (RAG) to foundation models, and discuss new ways of augmenting these models with brain-inspired components. We conclude by arguing that a rekindling of the historically fruitful exchange of ideas between brain science and AI will help pave the way towards safe and interpretable human-centered AI.
zh

[AI-75] RollArt: Scaling Agent ic RL Training via Disaggregated Infrastructure

【速读】:该论文旨在解决生成式 AI (Generative AI)代理强化学习(Agentic Reinforcement Learning, Agentic RL)训练效率低下的问题。由于 Agentic RL 工作负载具有高度异构性,包含计算密集型的预填充阶段、带宽受限的解码过程以及状态敏感的 CPU 密集型环境模拟,传统统一架构难以高效利用硬件资源,导致同步延迟和资源浪费。其解决方案的关键在于提出 RollArc 分布式系统,通过三个核心设计实现:(1) 基于硬件亲和性的任务映射,将不同特性的任务分配至最优 GPU 设备;(2) 细粒度异步执行机制,以轨迹为单位调度任务,减少资源空转;(3) 状态感知计算卸载,将无状态组件(如奖励模型)迁移至无服务器基础设施以支持弹性扩展。实验证明,RollArc 在多任务场景下显著提升吞吐量,并在超过 3000 张 GPU 的 Alibaba 集群上成功训练百亿参数 MoE 模型,相较单体同步基线实现 1.35–2.05× 的端到端训练时间加速。

链接: https://arxiv.org/abs/2512.22560
作者: Wei Gao,Yuheng Zhao,Tianyuan Wu,Shaopan Xiong,Weixun Wang,Dakai An,Lunxi Cao,Dilxat Muhtar,Zichen Liu,Haizhou Zhao,Ju Huang,Siran Yang,Yongbin Li,Wenbo Su,Jiamang Wang,Lin Qu,Bo Zheng,Wei Wang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, 17 figures

点击查看摘要

Abstract:Agentic Reinforcement Learning (RL) enables Large Language Models (LLMs) to perform autonomous decision-making and long-term planning. Unlike standard LLM post-training, agentic RL workloads are highly heterogeneous, combining compute-intensive prefill phases, bandwidth-bound decoding, and stateful, CPU-heavy environment simulations. We argue that efficient agentic RL training requires disaggregated infrastructure to leverage specialized, best-fit hardware. However, naive disaggregation introduces substantial synchronization overhead and resource underutilization due to the complex dependencies between stages. We present RollArc, a distributed system designed to maximize throughput for multi-task agentic RL on disaggregated infrastructure. RollArc is built on three core principles: (1) hardware-affinity workload mapping, which routes compute-bound and bandwidth-bound tasks to bestfit GPU devices, (2) fine-grained asynchrony, which manages execution at the trajectory level to mitigate resource bubbles, and (3) statefulness-aware computation, which offloads stateless components (e.g., reward models) to serverless infrastructure for elastic scaling. Our results demonstrate that RollArc effectively improves training throughput and achieves 1.35-2.05(\times) end-to-end training time reduction compared to monolithic and synchronous baselines. We also evaluate RollArc by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with more than 3,000 GPUs, further demonstrating RollArc scalability and robustness. The code is available at this https URL. Comments: 17 pages, 17 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.22560 [cs.DC] (or arXiv:2512.22560v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.22560 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-76] mePerceiver: An Encoder-Decoder Framework for Generalized Time-Series Forecasting NEURIPS2025

【速读】:该论文旨在解决时间序列预测中长期存在的问题:现有方法多集中于编码器设计,而将预测(解码)和训练策略视为次要或独立环节,导致模型整体性能受限。其核心解决方案是提出一个统一的编码器-解码器框架 TimePerceiver,该框架通过紧密耦合的训练策略实现端到端优化。关键创新在于:首先将预测任务泛化为包含外推、内插和缺失值填补等多种时序目标;其次设计了一种可灵活感知任意时间位置的架构,利用一组潜变量瓶颈表示与所有输入段交互以联合捕捉时间依赖性和跨通道关系,并通过对应目标时间戳的可学习查询向量高效检索相关信息,从而实现对复杂时序模式的精准建模。

链接: https://arxiv.org/abs/2512.22550
作者: Jaebin Lee,Hankook Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025. The code is available at this https URL

点击查看摘要

Abstract:In machine learning, effective modeling requires a holistic consideration of how to encode inputs, make predictions (i.e., decoding), and train the model. However, in time-series forecasting, prior work has predominantly focused on encoder design, often treating prediction and training as separate or secondary concerns. In this paper, we propose TimePerceiver, a unified encoder-decoder forecasting framework that is tightly aligned with an effective training strategy. To be specific, we first generalize the forecasting task to include diverse temporal prediction objectives such as extrapolation, interpolation, and imputation. Since this generalization requires handling input and target segments that are arbitrarily positioned along the temporal axis, we design a novel encoder-decoder architecture that can flexibly perceive and adapt to these varying positions. For encoding, we introduce a set of latent bottleneck representations that can interact with all input segments to jointly capture temporal and cross-channel dependencies. For decoding, we leverage learnable queries corresponding to target timestamps to effectively retrieve relevant information. Extensive experiments demonstrate that our framework consistently and significantly outperforms prior state-of-the-art baselines across a wide range of benchmark datasets. The code is available at this https URL.
zh

[AI-77] Multi-AI Agent Framework Reveals the “Oxide Gatekeeper” in Aluminum Nanoparticle Oxidation

【速读】:该论文旨在解决铝纳米颗粒(Aluminum Nanoparticles, ANPs)从钝化状态向爆炸性反应物转变的原子级机制不明确的问题,这一问题长期受限于计算方法的瓶颈:第一性原理方法虽具量子精度但仅适用于小尺度(<500原子)和短时间(皮秒级),而经验力场则缺乏复杂燃烧环境中所需的反应真实性。解决方案的关键在于提出一种“人在回路中”(human-in-the-loop)的闭环框架,通过自检AI代理验证机器学习势(Machine Learning Potential, MLP)的演化过程,使模型在保持量子力学精度的同时实现近线性扩展至百万原子系统并跨越纳秒时间尺度(能量均方根误差:1.2 meV/atom,力均方根误差:0.126 eV/Å)。该方法成功揭示了温度调控下的双模式氧化机制,并澄清了铝离子向外扩散主导质量传输的争议,为高能纳米材料的精准设计提供了原子尺度理论基础。

链接: https://arxiv.org/abs/2512.22529
作者: Yiming Lu,Tingyu Lu,Di Zhang,Lili Ye,Hao Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注: 21 pages, 7 figures

点击查看摘要

Abstract:Aluminum nanoparticles (ANPs) are among the most energy-dense solid fuels, yet the atomic mechanisms governing their transition from passivated particles to explosive reactants remain elusive. This stems from a fundamental computational bottleneck: ab initio methods offer quantum accuracy but are restricted to small spatiotemporal scales ( 500 atoms, picoseconds), while empirical force fields lack the reactive fidelity required for complex combustion environments. Herein, we bridge this gap by employing a “human-in-the-loop” closed-loop framework where self-auditing AI Agents validate the evolution of a machine learning potential (MLP). By acting as scientific sentinels that visualize hidden model artifacts for human decision-making, this collaborative cycle ensures quantum mechanical accuracy while exhibiting near-linear scalability to million-atom systems and accessing nanosecond timescales (energy RMSE: 1.2 meV/atom, force RMSE: 0.126 eV/Angstrom). Strikingly, our simulations reveal a temperature-regulated dual-mode oxidation mechanism: at moderate temperatures, the oxide shell acts as a dynamic “gatekeeper,” regulating oxidation through a “breathing mode” of transient nanochannels; above a critical threshold, a “rupture mode” unleashes catastrophic shell failure and explosive combustion. Importantly, we resolve a decades-old controversy by demonstrating that aluminum cation outward diffusion, rather than oxygen transport, dominates mass transfer across all temperature regimes, with diffusion coefficients consistently exceeding those of oxygen by 2-3 orders of magnitude. These discoveries establish a unified atomic-scale framework for energetic nanomaterial design, enabling the precision engineering of ignition sensitivity and energy release rates through intelligent computational design.
zh

[AI-78] owards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在对抗攻击评估中因尖峰激活的二值性和不连续性导致梯度消失问题,从而使得基于梯度下降的鲁棒性评估不可靠。其解决方案的关键在于提出一种更可靠的评估框架:首先理论分析了替代梯度(surrogate gradient)中的梯度消失程度,并引入自适应锐度替代梯度(Adaptive Sharpness Surrogate Gradient, ASSG),该方法在攻击迭代过程中根据输入分布自适应调整替代函数形状,以提升梯度精度并缓解梯度消失;同时设计了一种具有自适应步长的 LL_\infty 约束对抗攻击方法——稳定自适应投影梯度下降(Stable Adaptive Projected Gradient Descent, SA-PGD),在梯度不精确的情况下实现更快且更稳定的收敛。实验表明,该方案显著提高了多种SNN架构和训练策略下的攻击成功率,揭示了当前SNN鲁棒性被严重高估的问题,强调了发展更可靠对抗训练方法的必要性。

链接: https://arxiv.org/abs/2512.22522
作者: Jihang Wang,Dongcheng Zhao,Ruolin Chen,Qian Zhang,Yi Zeng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) utilize spike-based activations to mimic the brain’s energy-efficient information processing. However, the binary and discontinuous nature of spike activations causes vanishing gradients, making adversarial robustness evaluation via gradient descent unreliable. While improved surrogate gradient methods have been proposed, their effectiveness under strong adversarial attacks remains unclear. We propose a more reliable framework for evaluating SNN adversarial robustness. We theoretically analyze the degree of gradient vanishing in surrogate gradients and introduce the Adaptive Sharpness Surrogate Gradient (ASSG), which adaptively evolves the shape of the surrogate function according to the input distribution during attack iterations, thereby enhancing gradient accuracy while mitigating gradient vanishing. In addition, we design an adversarial attack with adaptive step size under the L_\infty constraint-Stable Adaptive Projected Gradient Descent (SA-PGD), achieving faster and more stable convergence under imprecise gradients. Extensive experiments show that our approach substantially increases attack success rates across diverse adversarial training schemes, SNN architectures and neuron models, providing a more generalized and reliable evaluation of SNN adversarial robustness. The experimental results further reveal that the robustness of current SNNs has been significantly overestimated and highlighting the need for more dependable adversarial training methods.
zh

[AI-79] Predicting LLM Correctness in Prosthodontics Using Metadata and Hallucination Signals ALT

【速读】:该论文试图解决大型语言模型(Large Language Models, LLMs)在高风险领域(如医疗和医学教育)中生成事实性错误信息(即幻觉)的问题,尤其是如何预测LLM输出的正确性这一关键但研究不足的问题。解决方案的关键在于利用模型输出的元数据(metadata)和幻觉信号,结合三种不同的提示策略(prompting strategies),为每个模型-提示组合构建一个正确性预测器(correctness predictor)。实验表明,该基于元数据的方法可将准确率提升最多7.14%,并达到83.12%的精度,显著优于假设所有答案均正确的基线方法;同时发现,尽管实际幻觉是错误判断的强指标,仅靠元数据无法可靠预测幻觉,且提示策略虽不改变整体准确率,却显著影响模型内部行为及元数据的预测效用。

链接: https://arxiv.org/abs/2512.22508
作者: Lucky Susanto,Anasta Pranawijayana,Cortino Sukotjo,Soni Prasad,Derry Wijaya
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as a Short Paper at HEALTHINF2026

点击查看摘要

Abstract:Large language models (LLMs) are increasingly adopted in high-stakes domains such as healthcare and medical education, where the risk of generating factually incorrect (i.e., hallucinated) information is a major concern. While significant efforts have been made to detect and mitigate such hallucinations, predicting whether an LLM’s response is correct remains a critical yet underexplored problem. This study investigates the feasibility of predicting correctness by analyzing a general-purpose model (GPT-4o) and a reasoning-centric model (OSS-120B) on a multiple-choice prosthodontics exam. We utilize metadata and hallucination signals across three distinct prompting strategies to build a correctness predictor for each (model, prompting) pair. Our findings demonstrate that this metadata-based approach can improve accuracy by up to +7.14% and achieve a precision of 83.12% over a baseline that assumes all answers are correct. We further show that while actual hallucination is a strong indicator of incorrectness, metadata signals alone are not reliable predictors of hallucination. Finally, we reveal that prompting strategies, despite not affecting overall accuracy, significantly alter the models’ internal behaviors and the predictive utility of their metadata. These results present a promising direction for developing reliability signals in LLMs but also highlight that the methods explored in this paper are not yet robust enough for critical, high-stakes deployment.
zh

[AI-80] Hierarchical Pedagogical Oversight: A Multi-Agent Adversarial Framework for Reliable AI Tutoring AAAI2026

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在作为自动化导师时缺乏有效教学推理能力的问题,具体表现为对错误学生解答的盲目认同(sycophancy)以及提供过于直接的答案,从而削弱学习效果。其解决方案的关键在于提出一种分层教学监督框架(Hierarchical Pedagogical Oversight, HPO),该框架通过结构化的对抗合成方法实现教育评估:首先由专业代理提取对话上下文,随后在受控的五幕辩论中引入对立的教学批评者进行辩证讨论,从而避免合作式多智能体系统易产生的表面共识问题,增强模型的批判性判断力与教学适配性。

链接: https://arxiv.org/abs/2512.22496
作者: Saisab Sadhu,Ashim Dhor
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at the AAAI 2026 EGSAI Community Activity (AAAI 2026)

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed as automated tutors to address educator shortages; however, they often fail at pedagogical reasoning, frequently validating incorrect student solutions (sycophancy) or providing overly direct answers that hinder learning. We introduce Hierarchical Pedagogical Oversight (HPO), a framework that adapts structured adversarial synthesis to educational assessment. Unlike cooperative multi-agent systems that often drift toward superficial consensus, HPO enforces a dialectical separation of concerns: specialist agents first distill dialogue context, which then grounds a moderated, five-act debate between opposing pedagogical critics. We evaluate this framework on the MRBench dataset of 1,214 middle-school mathematics dialogues. Our 8B-parameter model achieves a Macro F1 of 0.845, outperforming GPT-4o (0.812) by 3.3% while using 20 times fewer parameters. These results establish adversarial reasoning as a critical mechanism for deploying reliable, low-compute pedagogical oversight in resource-constrained environments.
zh

[AI-81] Role-Based Fault Tolerance System for LLM RL Post-Training

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)后训练过程中因分布式系统故障导致的高恢复开销问题,尤其是在训练与推理任务交错执行时,传统容错机制无法有效应对异步执行场景下的GPU节点失效问题。其核心解决方案是提出RobustRL系统,通过角色化故障隔离(role-based fault isolation)实现细粒度恢复:将训练器(trainer)、回放(rollout)和管理角色作为独立分布式子任务处理;在故障发生时仅重启失败角色并重新连接至存活节点,避免全量重启动带来的回放重演和初始化延迟;具体包括三项关键技术:(1)角色感知监控(Detect),区分真实故障与角色特有行为以减少误报和延迟检测;(2)非中断式恢复(Restart),训练器利用热备回放缓存轨迹继续生成,回放节点则实现无中断机器替换;(3)动态通信重构(Reconnect),以UCX为基础的点对点通信替代静态集合通信,实现恢复角色与其它角色间的即时权重同步。实验表明,在256 GPU集群上使用Qwen3-8B-Math任务注入10%故障频率时,RobustRL相较ByteRobust可将有效训练时间比(Effective Training Time Ratio, ETTR)从60%提升至80%,端到端训练时间加速8.4%–17.4%。

链接: https://arxiv.org/abs/2512.22492
作者: Zhenqian Chen,Baoquan Zhong,Xiang Li,Qing Dai,Xinkui Zhao,Miao Ye,Ren Cheng,Lufei Zhang,Jianwei Yin
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 19 figures

点击查看摘要

Abstract:RL post-training for LLMs has been widely scaled to enhance reasoning and tool-using capabilities. However, RL post-training interleaves training and inference workloads, exposing the system to faults from both sides. Existing fault tolerance frameworks for LLMs target either training or inference, leaving the optimization potential in the asynchronous execution unexplored for RL. Our key insight is role-based fault isolation so the failure in one machine does not affect the others. We treat trainer, rollout, and other management roles in RL training as distinct distributed sub-tasks. Instead of restarting the entire RL task in ByteRobust, we recover only the failed role and reconnect it to living ones, thereby eliminating the full-restart overhead including rollout replay and initialization delay. We present RobustRL, the first comprehensive robust system to handle GPU machine errors for RL post-training Effective Training Time Ratio improvement. (1) \textitDetect. We implement role-aware monitoring to distinguish actual failures from role-specific behaviors to avoid the false positive and delayed detection. (2) \textitRestart. For trainers, we implement a non-disruptive recovery where rollouts persist state and continue trajectory generation, while the trainer is rapidly restored via rollout warm standbys. For rollout, we perform isolated machine replacement without interrupting the RL task. (3) \textitReconnect. We replace static collective communication with dynamic, UCX-based (Unified Communication X) point-to-point communication, enabling immediate weight synchronization between recovered roles. In an RL training task on a 256-GPU cluster with Qwen3-8B-Math workload under 10% failure injection frequency, RobustRL can achieve an ETTR of over 80% compared with the 60% in ByteRobust and achieves 8.4%-17.4% faster in end-to-end training time. Comments: 16 pages, 19 figures Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2512.22492 [cs.DC] (or arXiv:2512.22492v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.22492 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-82] SPECTRE: Spectral Pre-training Embeddings with Cylindrical Temporal Rotary Position Encoding for Fine-Grained sEMG-Based Movement Decoding

【速读】:该论文旨在解决从非侵入式表面肌电信号(sEMG)中解码精细运动意图的难题,该问题因信号非平稳性和信噪比低而极具挑战性。现有通用自监督学习(SSL)框架在sEMG任务上表现不佳,主要原因是其试图重建噪声较大的原始信号,且缺乏对电极阵列圆柱拓扑结构的归纳偏置。解决方案的关键在于提出SPECTRE这一领域特定的SSL框架,其核心创新包括:一是基于生理学原理的预训练任务,通过掩码预测聚类后的短时傅里叶变换(STFT)离散伪标签,迫使模型学习鲁棒的生理相关频域特征;二是新颖的圆柱旋转位置编码(CyRoPE),显式建模前臂传感器的环形空间维度与线性时间维度,从而捕捉肌肉协同作用。实验证明,SPECTRE在多个数据集上显著优于监督基线和通用SSL方法,成为sEMG运动解码的新基准。

链接: https://arxiv.org/abs/2512.22481
作者: Zihan Weng,Chanlin Yi,Pouya Bashivan,Jing Lu,Fali Li,Dezhong Yao,Jingming Hou,Yangsong Zhang,Peng Xu
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Decoding fine-grained movement from non-invasive surface Electromyography (sEMG) is a challenge for prosthetic control due to signal non-stationarity and low signal-to-noise ratios. Generic self-supervised learning (SSL) frameworks often yield suboptimal results on sEMG as they attempt to reconstruct noisy raw signals and lack the inductive bias to model the cylindrical topology of electrode arrays. To overcome these limitations, we introduce SPECTRE, a domain-specific SSL framework. SPECTRE features two primary contributions: a physiologically-grounded pre-training task and a novel positional encoding. The pre-training involves masked prediction of discrete pseudo-labels from clustered Short-Time Fourier Transform (STFT) representations, compelling the model to learn robust, physiologically relevant frequency patterns. Additionally, our Cylindrical Rotary Position Embedding (CyRoPE) factorizes embeddings along linear temporal and annular spatial dimensions, explicitly modeling the forearm sensor topology to capture muscle synergies. Evaluations on multiple datasets, including challenging data from individuals with amputation, demonstrate that SPECTRE establishes a new state-of-the-art for movement decoding, significantly outperforming both supervised baselines and generic SSL approaches. Ablation studies validate the critical roles of both spectral pre-training and CyRoPE. SPECTRE provides a robust foundation for practical myoelectric interfaces capable of handling real-world sEMG complexities.
zh

[AI-83] he Bayesian Geometry of Transformer Attention

【速读】:该论文旨在解决当前对Transformer模型是否真正实现贝叶斯推理(Bayesian inference)缺乏严谨验证的问题。传统方法受限于自然数据中无法获得解析后验分布,且大规模模型易将推理与记忆混淆,难以区分因果机制。为此,作者构建了“贝叶斯风洞”(Bayesian wind tunnels)——一种可控环境,在其中真后验分布具有闭式解、且记忆效应可被严格排除。在此设定下,小型Transformer能以 10310^{-3}10410^{-4} bit精度复现贝叶斯后验,而容量相当的多层感知机(MLP)则显著失败,揭示出架构层面的本质差异。关键在于通过几何诊断发现:残差流作为信念载体,前馈网络执行后验更新,注意力机制提供内容寻址路由,且训练过程中值空间由后验熵参数化、注意力模式保持稳定,体现出“帧-精度解耦”现象,从而从几何设计角度阐明了分层注意力为何能实现贝叶斯推理,同时解释了扁平架构的失效原因。

链接: https://arxiv.org/abs/2512.22471
作者: Naman Aggarwal,Siddhartha R. Dalal,Vishal Misra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emphBayesian wind tunnels – controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with \mbox 10^-3 – 10^-4 bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks – bijection elimination and Hidden Markov Model (HMM) state tracking – we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query–key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emphframe–precision dissociation predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML) Cite as: arXiv:2512.22471 [cs.LG] (or arXiv:2512.22471v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22471 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-84] DarkPatterns-LLM : A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中潜在的操纵性内容识别难题,现有安全评估基准多依赖粗粒度二分类标签,难以捕捉操纵行为背后复杂的心理与社会机制。解决方案的关键在于提出首个标准化、多维度的基准数据集与诊断框架——DarkPatterns-LLM,其核心创新为四层分析流程:多粒度检测(Multi-Granular Detection, MGD)、多尺度意图分析(Multi-Scale Intent Analysis, MSIAN)、威胁协同协议(Threat Harmonization Protocol, THP)以及深度情境风险对齐(Deep Contextual Risk Alignment, DCRA),能够精细识别七类危害(包括自主权、心理、情感、物理、经济和社会危害)下的操纵模式,并通过401个精心标注的指令-响应对验证了主流模型在检测自主权削弱类模式上的系统性薄弱环节,从而为构建更可信的人工智能系统提供可操作的诊断工具。

链接: https://arxiv.org/abs/2512.22470
作者: Sadia Asif,Israel Antonio Rosales Laguan,Haris Khan,Shumaila Asif,Muneeb Asif
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of Large Language Models (LLMs) has intensified concerns about manipulative or deceptive behaviors that can undermine user autonomy, trust, and well-being. Existing safety benchmarks predominantly rely on coarse binary labels and fail to capture the nuanced psychological and social mechanisms constituting manipulation. We introduce \textbfDarkPatterns-LLM, a comprehensive benchmark dataset and diagnostic framework for fine-grained assessment of manipulative content in LLM outputs across seven harm categories: Legal/Power, Psychological, Emotional, Physical, Autonomy, Economic, and Societal Harm. Our framework implements a four-layer analytical pipeline comprising Multi-Granular Detection (MGD), Multi-Scale Intent Analysis (MSIAN), Threat Harmonization Protocol (THP), and Deep Contextual Risk Alignment (DCRA). The dataset contains 401 meticulously curated examples with instruction-response pairs and expert annotations. Through evaluation of state-of-the-art models including GPT-4, Claude 3.5, and LLaMA-3-70B, we observe significant performance disparities (65.2%–89.7%) and consistent weaknesses in detecting autonomy-undermining patterns. DarkPatterns-LLM establishes the first standardized, multi-dimensional benchmark for manipulation detection in LLMs, offering actionable diagnostics toward more trustworthy AI systems.
zh

[AI-85] AMBIT: Augmenting Mobility Baselines with Interpretable Trees

【速读】:该论文旨在解决交通流预测中高精度与可解释性之间的矛盾问题,即如何在保持模型预测准确性的同时确保其决策过程对城市规划者和管理者具有透明性和可理解性。解决方案的关键在于提出一种灰箱框架AMBIT,该框架通过在物理基线模型(如PPML重力模型)基础上叠加可解释的梯度提升树(Gradient-Boosted Trees)残差学习器,并结合SHAP值进行特征重要性分析,从而实现既接近先进机器学习模型的精度,又保留了基于物理机制的结构可解释性;尤其值得注意的是,基于POI锚定的残差项在空间泛化能力上表现最为稳健,为城市决策提供了可靠且可解释的预测支持。

链接: https://arxiv.org/abs/2512.22466
作者: Qizhi Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages; 12 figures; 30 tables

点击查看摘要

Abstract:Origin-destination (OD) flow prediction remains a core task in GIS and urban analytics, yet practical deployments face two conflicting needs: high accuracy and clear interpretability. This paper develops AMBIT, a gray-box framework that augments physical mobility baselines with interpretable tree models. We begin with a comprehensive audit of classical spatial interaction models on a year-long, hourly NYC taxi OD dataset. The audit shows that most physical models are fragile at this temporal resolution; PPML gravity is the strongest physical baseline, while constrained variants improve when calibrated on full OD margins but remain notably weaker. We then build residual learners on top of physical baselines using gradient-boosted trees and SHAP analysis, demonstrating that (i) physics-grounded residuals approach the accuracy of a strong tree-based predictor while retaining interpretable structure, and (ii) POI-anchored residuals are consistently competitive and most robust under spatial generalization. We provide a reproducible pipeline, rich diagnostics, and spatial error analysis designed for urban decision-making.
zh

[AI-86] Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

【速读】:该论文旨在解决传统推测解码(Speculative Decoding, SD)在动态请求负载下性能不稳定的问题,尤其是在高负载、计算密集型场景中因验证开销导致的吞吐量下降和延迟增加。现有SD方法采用固定推测长度,无法根据实时批处理大小和请求速率自适应调整,从而在实际部署中形成性能瓶颈。解决方案的关键在于提出一种基于学习的自适应推测推理算法Nightjar,其能够动态选择最优推测长度,甚至在推测解码无收益时主动禁用该机制,从而在不同负载条件下实现稳定且高效的推理性能。实验表明,Nightjar相较标准SD可提升最高14.8%的吞吐量并降低20.2%的延迟。

链接: https://arxiv.org/abs/2512.22420
作者: Rui Li,Zhaoning Zhang,Libo Zhang,Huaimin Wang,Xiang Fu,Zhiquan Lai
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 6 pages, 11 figures

点击查看摘要

Abstract:Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving.
zh

[AI-87] Emergence of Human to Robot Transfer in Vision-Language-Action Models

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在开放世界场景中实现人类到机器人技能迁移的难题,尤其是在仅使用人类视频数据时难以直接训练出有效VLA模型的问题。其核心挑战在于建立人类行为与机器人执行之间的映射关系,传统方法依赖人工工程设计,效率低且泛化能力差。解决方案的关键在于提出一种简单的协同训练(co-training)策略,并通过大规模、多样化的预训练(涵盖多种场景、任务和机器人形态),使VLA模型学习到与具体机器人形态无关的表征(embodiment-agnostic representations),从而自然涌现出从人类视频到机器人动作的有效迁移能力。实验验证表明,当预训练数据足够多样化时,该方法可在仅见于人类数据的泛化设置下将性能提升近一倍。

链接: https://arxiv.org/abs/2512.22414
作者: Simar Kareer,Karl Pertsch,James Darpinian,Judy Hoffman,Danfei Xu,Sergey Levine,Chelsea Finn,Suraj Nair
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets. It is appealing to consider whether some of this data can come from human videos, which cover diverse real-world situations and are easy to obtain. However, it is difficult to train VLAs with human videos alone, and establishing a mapping between humans and robots requires manual engineering and presents a major research challenge. Drawing inspiration from advances in large language models, where the ability to learn from diverse supervision emerges with scale, we ask whether a similar phenomenon holds for VLAs that incorporate human video data. We introduce a simple co-training recipe, and find that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments. Our analysis suggests that this emergent capability arises because diverse pretraining produces embodiment-agnostic representations for human and robot data. We validate these findings through a series of experiments probing human to robot skill transfer and find that with sufficiently diverse robot pre-training our method can nearly double the performance on generalization settings seen only in human data.
zh

[AI-88] A Unified AI Embedded Simulation and Mechanical Design Approach to an Autonomous Delivery Robot

【速读】:该论文旨在解决在资源受限平台上实现高效、可靠且可部署的自主配送机器人系统的问题,核心挑战包括在嵌入式设备上优化计算密集型人工智能(Artificial Intelligence, AI)算法,以及确保ROS 2主机与嵌入式控制器之间的低延迟、高可靠性通信。解决方案的关键在于采用异构计算架构:使用RPi 5和ROS 2处理基于AI的感知与路径规划任务,而ESP32运行FreeRTOS实现确定性的PID电机控制;同时通过内存与任务管理优化、AWS IoT远程监控及固件级电机故障停机保护机制,显著提升了系统的实时性与鲁棒性。

链接: https://arxiv.org/abs/2512.22408
作者: Amro Gamar,Ahmed Abduljalil,Alargam Mohammed,Ali Elhenidy,Abeer Tawakol
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents the development of a fully autonomous delivery robot integrating mechanical engineering, embedded systems, and artificial intelligence. The platform employs a heterogeneous computing architecture, with RPi 5 and ROS 2 handling AI-based perception and path planning, while ESP32 running FreeRTOS ensures real-time motor control. The mechanical design was optimized for payload capacity and mobility through precise motor selection and material engineering. Key technical challenges addressed include optimizing computationally intensive AI algorithms on a resource-constrained platform and implementing a low-latency, reliable communication link between the ROS 2 host and embedded controller. Results demonstrate deterministic, PID-based motor control through rigorous memory and task management, and enhanced system reliability via AWS IoT monitoring and a firmware-level motor shutdown failsafe. This work highlights a unified, multi-disciplinary methodology, resulting in a robust and operational autonomous delivery system capable of real-world deployment.
zh

[AI-89] Efficient Multi-Model Orchestration for Self-Hosted Large Language Models

【速读】:该论文旨在解决自托管大语言模型(Large Language Models, LLMs)在实际部署中面临的GPU资源利用率低、工作负载调度不灵活及系统可靠性差等问题。其解决方案的关键在于提出一个名为Pick and Spin的框架,该框架基于Kubernetes构建,集成统一的Helm部署系统、自适应的Scale-to-Zero自动化机制,以及一种混合路由模块——该模块结合关键词启发式规则与轻量级DistilBERT分类器,在成本、延迟和准确性之间实现动态平衡,从而显著提升自托管LLM的可扩展性与经济性。

链接: https://arxiv.org/abs/2512.22402
作者: Bhanu Prakash Vangala,Tanu Malik
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in-house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self-hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models, Llama-3 (90B), Gemma-3 (27B), Qwen-3 (235B), and DeepSeek-R1 (685B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 31,019 prompts and 163,720 inference runs. Pick and Spin achieves up to 21.6% higher success rates, 30% lower latency, and 33% lower GPU cost per query compared with static deployments of the same models.
zh

[AI-90] Lightweight Inference-Time Personalization for Frozen Knowledge Graph Embeddings

【速读】:该论文旨在解决基础知识图谱(Knowledge Graph, KG)模型在链接预测任务中表现出强群体性能,但无法捕捉个体用户偏好这一关键问题,即通用关系推理与个性化排序之间的断层。解决方案的关键在于提出一种轻量级的推理时个性化框架 GatedBias,其通过结构门控适应机制实现对冻结的知识图谱嵌入的个体化调整:利用用户特定特征与图结构生成的二值门控信号相结合,产生可解释的、每实体独立的偏置项,仅需约300个可训练参数即可完成个性化适配,且不损害整体模型性能。

链接: https://arxiv.org/abs/2512.22398
作者: Ozan Oguztuzun,Cerag Oguztuzun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models for knowledge graphs (KGs) achieve strong cohort-level performance in link prediction, yet fail to capture individual user preferences; a key disconnect between general relational reasoning and personalized ranking. We propose GatedBias, a lightweight inference-time personalization framework that adapts frozen KG embeddings to individual user contexts without retraining or compromising global accuracy. Our approach introduces structure-gated adaptation: profile-specific features combine with graph-derived binary gates to produce interpretable, per-entity biases, requiring only \sim300 trainable parameters. We evaluate GatedBias on two benchmark datasets (Amazon-Book and Last-FM), demonstrating statistically significant improvements in alignment metrics while preserving cohort performance. Counterfactual perturbation experiments validate causal responsiveness; entities benefiting from specific preference signals show 6–30 \times greater rank improvements when those signals are boosted. These results show that personalized adaptation of foundation models can be both parameter-efficient and causally verifiable, bridging general knowledge representations with individual user needs.
zh

[AI-91] HalluMat: Detecting Hallucinations in LLM -Generated Materials Science Content Through Multi-Stage Verification

【速读】:该论文旨在解决生成式 AI(尤其是大语言模型,LLMs)在材料科学领域内容生成中普遍存在幻觉(hallucination)的问题,即模型输出存在事实性错误或误导信息,从而损害科研严谨性。其核心解决方案是提出 HalluMatDetector,一个多层次的幻觉检测框架,关键在于融合内在验证、多源检索、矛盾图分析与基于指标的评估方法,实现对 LLM 输出的系统性校验与修正;实验表明,该框架可使幻觉率降低 30%,并引入 Paraphrased Hallucination Consistency Score(PHCS)量化语义等价查询下的响应不一致性,提升模型可靠性评估精度。

链接: https://arxiv.org/abs/2512.22396
作者: Bhanu Prakash Vangala,Sajid Mahmud,Pawan Neupane,Joel Selvaraj,Jianlin Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI), particularly Large Language Models (LLMs), is transforming scientific discovery, enabling rapid knowledge generation and hypothesis formulation. However, a critical challenge is hallucination, where LLMs generate factually incorrect or misleading information, compromising research integrity. To address this, we introduce HalluMatData, a benchmark dataset for evaluating hallucination detection methods, factual consistency, and response robustness in AI-generated materials science content. Alongside this, we propose HalluMatDetector, a multi-stage hallucination detection framework that integrates intrinsic verification, multi-source retrieval, contradiction graph analysis, and metric-based assessment to detect and mitigate LLM hallucinations. Our findings reveal that hallucination levels vary significantly across materials science subdomains, with high-entropy queries exhibiting greater factual inconsistencies. By utilizing HalluMatDetector verification pipeline, we reduce hallucination rates by 30% compared to standard LLM outputs. Furthermore, we introduce the Paraphrased Hallucination Consistency Score (PHCS) to quantify inconsistencies in LLM responses across semantically equivalent queries, offering deeper insights into model reliability.
zh

[AI-92] BLISS: Bandit Layer Importance Sampling Strategy for Efficient Training of Graph Neural Networks NEURIPS2025

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在处理大规模图数据时面临的计算成本过高问题,特别是由于需要遍历每个节点的所有邻居而导致的内存和计算瓶颈。解决方案的关键在于提出BLISS(Bandit Layer Importance Sampling Strategy),其核心是利用多臂赌博机(multi-armed bandits)机制,在每一层动态选择最具信息量的节点进行采样,从而在探索与利用之间取得平衡,确保对图结构的充分覆盖。与传统静态采样方法不同,BLISS能够适应节点重要性的动态变化,提升节点选择的智能性,并可兼容GCN和GAT等不同架构,根据其聚合机制自适应调整采样策略,实验表明其在保持或超越全批训练精度的同时显著降低计算开销。

链接: https://arxiv.org/abs/2512.22388
作者: Omar Alsaqa,Linh Thi Hoang,Muhammed Fatih Balin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Accepted for 5th Muslims in ML Workshop co-located with NeurIPS 2025. OpenReview: this https URL Code: this https URL

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their application to large graphs is hindered by computational costs. The need to process every neighbor for each node creates memory and computational bottlenecks. To address this, we introduce BLISS, a Bandit Layer Importance Sampling Strategy. It uses multi-armed bandits to dynamically select the most informative nodes at each layer, balancing exploration and exploitation to ensure comprehensive graph coverage. Unlike existing static sampling methods, BLISS adapts to evolving node importance, leading to more informed node selection and improved performance. It demonstrates versatility by integrating with both Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), adapting its selection policy to their specific aggregation mechanisms. Experiments show that BLISS maintains or exceeds the accuracy of full-batch training.
zh

[AI-93] AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM -Based Coding Agents

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为代码生成代理时,所生成代码在干净环境中执行的可复现性问题。其核心挑战在于:LLM生成的代码是否能在仅依赖操作系统包和模型明确声明的依赖项下成功运行。解决方案的关键在于提出一个三层依赖框架(three-layer dependency framework),用于区分“声称依赖”(claimed dependencies)、“工作依赖”(working dependencies)和“运行时依赖”(runtime dependencies),从而量化代码执行的可复现性。通过该框架对三种主流LLM编码代理(Claude Code、OpenAI Codex 和 Gemini)在300个跨Python、JavaScript和Java语言的项目上进行实证评估,研究发现仅有68.3%的项目能原生运行,且平均从声明依赖扩展至实际运行时依赖达13.5倍,揭示了显著的隐式依赖问题。

链接: https://arxiv.org/abs/2512.22387
作者: Bhanu Prakash Vangala,Ali Adibifar,Tanu Malik,Ashish Gehani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The rise of Large Language Models (LLMs) as coding agents promises to accelerate software development, but their impact on generated code reproducibility remains largely unexplored. This paper presents an empirical study investigating whether LLM-generated code can be executed successfully in a clean environment with only OS packages and using only the dependencies that the model specifies. We evaluate three state-of-the-art LLM coding agents (Claude Code, OpenAI Codex, and Gemini) across 300 projects generated from 100 standardized prompts in Python, JavaScript, and Java. We introduce a three-layer dependency framework (distinguishing between claimed, working, and runtime dependencies) to quantify execution reproducibility. Our results show that only 68.3% of projects execute out-of-the-box, with substantial variation across languages (Python 89.2%, Java 44.0%). We also find a 13.5 times average expansion from declared to actual runtime dependencies, revealing significant hidden dependencies.
zh

[AI-94] Completed Hyperparameter Transfer across Modules Width Depth Batch and Duration

【速读】:该论文旨在解决大规模模型训练中超参数调优对训练稳定性与最终性能的显著影响问题,尤其是在模型规模扩展时如何高效迁移最优超参数。其核心挑战在于高维超参数空间中的优化难度以及跨模型尺度(如宽度、深度、批量大小和训练时长)的统一参数化难题。解决方案的关键在于提出一种新的完整参数化方法——Complete ^(d) 参数化,该方法整合了宽度与深度的缩放(基于CompleteP的改进),并进一步涵盖批量大小和训练时长的缩放;同时,在此参数化基础上,研究了模块级(per-module)超参数的优化与迁移策略,通过实证分析明确了高维超参数空间中的优化难点,并给出了可操作的实践指南。实验表明,借助该参数化框架,即使在模块级超参数层面,超参数迁移依然有效,显著提升了大型语言模型的训练效率。

链接: https://arxiv.org/abs/2512.22382
作者: Bruno Mlodozeniec,Pierre Ablin,Louis Béthune,Dan Busbridge,Michal Klein,Jason Ramapuram,Marco Cuturi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as \mu P, have enabled transfer of optimal global hyperparameters across model sizes. These works propose an empirical practice of search for optimal global base hyperparameters at a small model size, and transfer to a large size. We extend these works in two key ways. To handle scaling along most important scaling axes, we propose the Complete ^(d) Parameterisation that unifies scaling in width and depth – using an adaptation of CompleteP – as well as in batch-size and training duration. Secondly, with our parameterisation, we investigate per-module hyperparameter optimisation and transfer. We characterise the empirical challenges of navigating the high-dimensional hyperparameter landscape, and propose practical guidelines for tackling this optimisation problem. We demonstrate that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime. Our study covers an extensive range of optimisation hyperparameters of modern models: learning rates, AdamW parameters, weight decay, initialisation scales, and residual block multipliers. Our experiments demonstrate significant training speed improvements in Large Language Models with the transferred per-module hyperparameters.
zh

[AI-95] Subgoaling Relaxation-based Heuristics for Numeric Planning with Infinite Actions

【速读】:该论文旨在解决带有控制参数(control parameters)的数值规划问题中,由于动作参数为自由数值变量而导致的动作无限性难题,这使得传统的数值启发式方法无法直接应用。其关键解决方案是提出了一种“乐观编译”(optimistic compilation)方法,将可控且简单的数值问题抽象为具有有界常量效应和松弛前提的简化数值任务,从而能够有效利用子目标启发式来估计目标距离,使经典数值启发式在动作无限的情形下依然可行,显著提升了求解效率与适用范围。

链接: https://arxiv.org/abs/2512.22367
作者: Ángel Aso-Mollar,Diego Aineto,Enrico Scala,Eva Onaindia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Numeric planning with control parameters extends the standard numeric planning model by introducing action parameters as free numeric variables that must be instantiated during planning. This results in a potentially infinite number of applicable actions in a state. In this setting, off-the-shelf numeric heuristics that leverage the action structure are not feasible. In this paper, we identify a tractable subset of these problems–namely, controllable, simple numeric problems–and propose an optimistic compilation approach that transforms them into simple numeric tasks. To do so, we abstract control-dependent expressions into bounded constant effects and relaxed preconditions. The proposed compilation makes it possible to effectively use subgoaling heuristics to estimate goal distance in numeric planning problems involving control parameters. Our results demonstrate that this approach is an effective and computationally feasible way of applying traditional numeric heuristics to settings with an infinite number of possible actions, pushing the boundaries of the current state of the art.
zh

[AI-96] Cost-Aware Text-to-SQL: An Empirical Study of Cloud Compute Costs for LLM -Generated Queries

【速读】:该论文旨在解决当前Text-to-SQL系统在云数据仓库环境下缺乏对计算成本(compute cost)有效评估的问题,现有效率指标如Valid Efficiency Score(VES)仅关注执行时间,而忽略了实际的资源消耗。其解决方案的关键在于首次系统性地评估了由大型语言模型(Large Language Models, LLMs)生成的SQL查询在Google BigQuery上的云资源开销,通过测量字节处理量、槽位利用率和预估费用,揭示了模型推理能力与成本之间的非线性关系,并识别出常见低效模式(如缺失分区过滤器和不必要的全表扫描),从而为高成本敏感的企业部署提供可操作的优化指南。

链接: https://arxiv.org/abs/2512.22364
作者: Saurabh Deochake,Debajyoti Mukhopadhyay
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Text-to-SQL systems powered by Large Language Models (LLMs) achieve high accuracy on standard benchmarks, yet existing efficiency metrics such as the Valid Efficiency Score (VES) measure execution time rather than the consumption-based costs of cloud data warehouses. This paper presents the first systematic evaluation of cloud compute costs for LLM-generated SQL queries. We evaluate six state-of-the-art LLMs across 180 query executions on Google BigQuery using the StackOverflow dataset (230GB), measuring bytes processed, slot utilization, and estimated cost. Our analysis yields three key findings: (1) reasoning models process 44.5% fewer bytes than standard models while maintaining equivalent correctness (96.7%-100%); (2) execution time correlates weakly with query cost (r=0.16), indicating that speed optimization does not imply cost optimization; and (3) models exhibit up to 3.4x cost variance, with standard models producing outliers exceeding 36GB per query. We identify prevalent inefficiency patterns including missing partition filters and unnecessary full-table scans, and provide deployment guidelines for cost-sensitive enterprise environments.
zh

[AI-97] he Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of Large Language Models

【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法,特别是LoRA(Low-Rank Adaptation)在指令微调(Instruction Tuning)过程中可能导致模型能力灾难性退化的问题,即使在小数据集上进行少量训练步骤时亦然。解决方案的关键在于引入一种正则化的近似回放(Regularized Approximate Replay)策略:一方面通过惩罚与初始模型之间的KL散度来约束参数更新,防止遗忘预训练阶段获得的通用知识;另一方面,在训练中穿插来自与预训练语料相似但不同的开源语料的下token预测任务,以增强模型对新任务的适应性(plasticity),同时几乎不增加额外计算开销。此方法在Qwen指令微调模型上验证有效,实现了知识保留与任务适应性的良好平衡。

链接: https://arxiv.org/abs/2512.22337
作者: Matthew Riemer,Erik Miehling,Miao Liu,Djallel Bouneffouf,Murray Campbell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although parameter-efficient fine-tuning methods, such as LoRA, only modify a small subset of parameters, they can have a significant impact on the model. Our instruction-tuning experiments show that LoRA-based supervised fine-tuning can catastrophically degrade model capabilities, even when trained on very small datasets for relatively few steps. With that said, we demonstrate that while the most straightforward approach (that is likely the most used in practice) fails spectacularly, small tweaks to the training procedure with very little overhead can virtually eliminate the problem. Particularly, in this paper we consider a regularized approximate replay approach which penalizes KL divergence with respect to the initial model and interleaves in data for next token prediction from a different, yet similar, open access corpus to what was used in pre-training. When applied to Qwen instruction-tuned models, we find that this recipe preserves general knowledge in the model without hindering plasticity to new tasks by adding a modest amount of computational overhead.
zh

[AI-98] Expert System for Bitcoin Forecasting: Integrating Global Liquidity via TimeXer Transformers

【速读】:该论文旨在解决比特币价格预测中因极端波动性和非平稳性导致的传统单变量时间序列模型在长周期预测中表现不佳的问题。其解决方案的关键在于引入全球广义货币供应量(Global M2 Liquidity)作为外生领先变量,并采用12周滞后结构对模型进行宏观条件化,结合TimeXer架构构建了流动性条件化的预测模型(TimeXer-Exog),从而显著提升了长期预测的稳定性与准确性。

链接: https://arxiv.org/abs/2512.22326
作者: Sravan Karthick T
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bitcoin price forecasting is characterized by extreme volatility and non-stationarity, often defying traditional univariate time-series models over long horizons. This paper addresses a critical gap by integrating Global M2 Liquidity, aggregated from 18 major economies, as a leading exogenous variable with a 12-week lag structure. Using the TimeXer architecture, we compare a liquidity-conditioned forecasting model (TimeXer-Exog) against state-of-the-art benchmarks including LSTM, N-BEATS, PatchTST, and a standard univariate TimeXer. Experiments conducted on daily Bitcoin price data from January 2020 to August 2025 demonstrate that explicit macroeconomic conditioning significantly stabilizes long-horizon forecasts. At a 70-day forecast horizon, the proposed TimeXer-Exog model achieves a mean squared error (MSE) 1.08e8, outperforming the univariate TimeXer baseline by over 89 percent. These results highlight that conditioning deep learning models on global liquidity provides substantial improvements in long-horizon Bitcoin price forecasting.
zh

[AI-99] LLM Boost: Make Large Language Models Stronger with Boosting

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)集成学习方法中忽视模型内部表示与交互的问题,即现有方法通常将模型视为黑箱,仅通过输入或最终输出进行组合,未能充分利用各模型之间的中间状态信息。其解决方案的关键在于提出一种名为LLMBoost的新型集成微调框架,该框架通过三个核心创新实现:(1) 引入跨模型注意力机制,使后续模型能够访问并融合前序模型的隐藏状态,从而实现层次化错误修正与知识迁移;(2) 设计链式训练范式,以误差抑制为目标逐步微调连接的模型,确保每一步都最小计算开销地纠正前序模型的预测错误;(3) 提出近并行推理范式,按层流水线处理各模型的隐藏状态,逼近单模型解码的推理效率。理论分析进一步证明,在有限修正假设下,顺序集成可保证性能单调提升。

链接: https://arxiv.org/abs/2512.22309
作者: Zehao Chen,Tianxiang Ai,Yifei Li,Gongxun Li,Yuyang Wei,Wang Zhou,Guanghui Li,Bin Yu,Zhijun Chen,Hailong Sun,Fuzhen Zhuang,Jianxin Li,Deqing Wang,Yikun Ban
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensemble learning of LLMs has emerged as a promising alternative to enhance performance, but existing approaches typically treat models as black boxes, combining the inputs or final outputs while overlooking the rich internal representations and interactions across this http URL this work, we introduce LLMBoost, a novel ensemble fine-tuning framework that breaks this barrier by explicitly leveraging intermediate states of LLMs. Inspired by the boosting paradigm, LLMBoost incorporates three key innovations. First, a cross-model attention mechanism enables successor models to access and fuse hidden states from predecessors, facilitating hierarchical error correction and knowledge transfer. Second, a chain training paradigm progressively fine-tunes connected models with an error-suppression objective, ensuring that each model rectifies the mispredictions of its predecessor with minimal additional computation. Third, a near-parallel inference paradigm design pipelines hidden states across models layer by layer, achieving inference efficiency approaching single-model decoding. We further establish the theoretical foundations of LLMBoost, proving that sequential integration guarantees monotonic improvements under bounded correction assumptions. Extensive experiments on commonsense reasoning and arithmetic reasoning tasks demonstrate that LLMBoost consistently boosts accuracy while reducing inference latency.
zh

[AI-100] LLA: Enhancing Security and Privacy for Generative Models with Logic-Locked Accelerators AAAI’26

【速读】:该论文旨在解决生成式 AI (Generative AI) 模型在供应链中面临的知识产权(IP)安全威胁,包括模型窃取、模型篡改和信息泄露等问题。解决方案的关键在于软硬件协同设计:软件层面通过将密钥位嵌入神经元并触发异常值以降低模型性能,同时应用不变性变换隐藏密钥值;硬件层面则集成轻量级锁模块至AI加速器,保持与多种数据流模式和工具链的兼容性,使预存秘密密钥成为访问IP所有者模型服务的许可证。此方案在抵御多种基于查询的密钥优化攻击的同时,计算开销低于0.1%(针对7,168位密钥)。

链接: https://arxiv.org/abs/2512.22307
作者: You Li,Guannan Zhao,Yuhao Ju,Yunqi He,Jie Gu,Hai Zhou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by AAAI’26 as a conference paper and selected for oral presentation

点击查看摘要

Abstract:We introduce LLA, an effective intellectual property (IP) protection scheme for generative AI models. LLA leverages the synergy between hardware and software to defend against various supply chain threats, including model theft, model corruption, and information leakage. On the software side, it embeds key bits into neurons that can trigger outliers to degrade performance and applies invariance transformations to obscure the key values. On the hardware side, it integrates a lightweight locking module into the AI accelerator while maintaining compatibility with various dataflow patterns and toolchains. An accelerator with a pre-stored secret key acts as a license to access the model services provided by the IP owner. The evaluation results show that LLA can withstand a broad range of oracle-guided key optimization attacks, while incurring a minimal computational overhead of less than 0.1% for 7,168 key bits.
zh

[AI-101] Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在软件安全领域中对多漏洞检测能力不足的问题,尤其是现有基准测试未能反映真实软件环境中多个相互作用漏洞共存的复杂性。其解决方案的关键在于构建了一个涵盖C、C++、Python和JavaScript四种主流编程语言的综合性多漏洞检测基准数据集,通过在长上下文代码样本(7.5k–10k tokens)中系统注入不同数量(1、3、5、9个)的可控漏洞,量化评估五种前沿LLM在高密度漏洞场景下的性能退化情况,揭示了模型在高漏洞密度下普遍存在的“计数偏差”(count bias)与“选择偏差”(selection bias),并发现Python和JavaScript代码中存在显著的“欠计数”(under-counting)现象,从而为未来面向复杂现实场景的漏洞检测研究提供了可量化的评估框架与改进方向。

链接: https://arxiv.org/abs/2512.22306
作者: Chinmay Pushkar,Sanchit Kabra,Dhruv Kumar,Jagat Sesh Challa
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or function-level classification, failing to reflect the complexity of real-world software where multiple interacting vulnerabilities often coexist within large files. Recent studies indicate that LLMs suffer from “count bias” and “selection bias” in multi-label tasks, yet this has not been rigorously quantified in the domain of code security. In this work, we introduce a comprehensive benchmark for Multi-Vulnerability Detection across four major languages: C, C++, Python, and JavaScript. We construct a dataset of 40,000 files by systematically injecting controlled counts of vulnerabilities (1, 3, 5, and 9) into long-context code samples (7.5k-10k tokens) sourced from CodeParrot. We evaluate five state-of-the-art LLMs, including GPT-4o-mini, Llama-3.3-70B, and the Qwen-2.5 series. Our results reveal a sharp degradation in performance as vulnerability density increases. While Llama-3.3-70B achieves near-perfect F1 scores (approximately 0.97) on single-vulnerability C tasks, performance drops by up to 40% in high-density settings. Notably, Python and JavaScript show distinct failure modes compared to C/C++, with models exhibiting severe “under-counting” (Recall dropping to less than 0.30) in complex Python files.
zh

[AI-102] Multi-Head Spectral-Adaptive Graph Anomaly Detection

【速读】:该论文旨在解决图异常检测中因异常节点常与正常节点混杂而导致的同质性(homophily)与异质性(heterophily)共存问题,以及现有谱图神经网络采用全局固定滤波器所引发的过平滑(over-smoothing)和缺乏适应性的局限。其核心解决方案是提出多头谱适应图神经网络(MHSA-GNN),关键创新在于设计了一个轻量级超网络(hypernetwork),根据包含结构统计信息和Rayleigh商特征的“谱指纹”动态生成Chebyshev滤波参数,从而为每个节点及其局部子图定制化过滤策略;同时引入结合教师-学生对比学习(TSC)与Barlow Twins多样性损失(BTD)的双正则化机制,有效防止多头模式崩溃,保障表示准确性与各头间正交性。

链接: https://arxiv.org/abs/2512.22291
作者: Qingyue Cao,Bo Jin,Changwei Gong,Xin Tong,Wenzheng Li,Xiaodong Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph anomaly detection technology has broad applications in financial fraud and risk control. However, existing graph anomaly detection methods often face significant challenges when dealing with complex and variable abnormal patterns, as anomalous nodes are often disguised and mixed with normal nodes, leading to the coexistence of homophily and heterophily in the graph domain. Recent spectral graph neural networks have made notable progress in addressing this issue; however, current techniques typically employ fixed, globally shared filters. This ‘one-size-fits-all’ approach can easily cause over-smoothing, erasing critical high-frequency signals needed for fraud detection, and lacks adaptive capabilities for different graph instances. To solve this problem, we propose a Multi-Head Spectral-Adaptive Graph Neural Network (MHSA-GNN). The core innovation is the design of a lightweight hypernetwork that, conditioned on a ‘spectral fingerprint’ containing structural statistics and Rayleigh quotient features, dynamically generates Chebyshev filter parameters tailored to each instance. This enables a customized filtering strategy for each node and its local subgraph. Additionally, to prevent mode collapse in the multi-head mechanism, we introduce a novel dual regularization strategy that combines teacher-student contrastive learning (TSC) to ensure representation accuracy and Barlow Twins diversity loss (BTD) to enforce orthogonality among heads. Extensive experiments on four real-world datasets demonstrate that our method effectively preserves high-frequency abnormal signals and significantly outperforms existing state-of-the-art methods, especially showing excellent robustness on highly heterogeneous datasets.
zh

[AI-103] When Algorithms Manage Humans: A Double Machine Learning Approach to Estimating Nonlinear Effects of Algorithmic Control on Gig Worker Performance and Wellbeing

【速读】:该论文试图解决的问题是:当算法承担管理角色时,以人为本的管理(person-centered management)是否仍能有效运行,尤其是在平台型工作场景中,算法对劳动者的影响机制是否具有非线性特征。解决方案的关键在于采用双机器学习(Double Machine Learning, DML)框架来估计一个调节中介模型(moderated mediation model),从而在不强制假设线性关系的前提下,识别出支持性人力资源实践对劳动者福祉与绩效之间关系的非单调(nonmonotonic)影响模式。研究发现,算法监督的存在会削弱支持性HR实践对绩效的正向作用,但这种削弱仅发生在监督模糊不清的“中间区域”;一旦监督变得透明且可解释,其积极作用又得以恢复。这一方法论创新表明,DML能够准确捕捉组织研究中条件间接效应的复杂结构,避免传统线性模型因误设而得出错误结论。

链接: https://arxiv.org/abs/2512.22290
作者: Arunkumar V,Nivethitha S,Sharan Srinivas,Gangadharan G.R
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A central question for the future of work is whether person centered management can survive when algorithms take on managerial roles. Standard tools often miss what is happening because worker responses to algorithmic systems are rarely linear. We use a Double Machine Learning framework to estimate a moderated mediation model without imposing restrictive functional forms. Using survey data from 464 gig workers, we find a clear nonmonotonic pattern. Supportive HR practices improve worker wellbeing, but their link to performance weakens in a murky middle where algorithmic oversight is present yet hard to interpret. The relationship strengthens again when oversight is transparent and explainable. These results show why simple linear specifications can miss the pattern and sometimes suggest the opposite conclusion. For platform design, the message is practical: control that is only partly defined creates confusion, but clear rules and credible recourse can make strong oversight workable. Methodologically, the paper shows how Double Machine Learning can be used to estimate conditional indirect effects in organizational research without forcing the data into a linear shape.
zh

[AI-104] Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

【速读】:该论文旨在解决合成负载数据在非侵入式负荷监测(Non-Intrusive Load Monitoring, NILM)算法开发和隐私保护能源研究中因标注数据稀缺而面临的挑战,特别是现有基于生成对抗网络(GAN)的方法普遍将所有电器设备统一建模,忽视了间歇性与连续性电器在行为模式上的差异,导致训练不稳定且输出保真度有限。其解决方案的关键在于提出一种分层聚合的生成框架——Cluster Aggregated GAN(CAGAN),通过行为特征聚类将电器分配至专用分支:对于间歇性电器,采用聚类模块识别相似激活模式并为每个簇配置独立生成器,以充分建模常见及罕见运行状态;对于连续性电器,则采用基于长短期记忆(LSTM)的生成器结合序列压缩策略,实现对时序演化过程的稳定建模。该设计显著提升了合成负载的真实性、多样性与训练稳定性,并增强了模型的可解释性和扩展性。

链接: https://arxiv.org/abs/2512.22287
作者: Zikun Guoa,Adeyinka.P. Adedigbaa,Rammohan Mallipeddi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 18pages, 5Figues

点击查看摘要

Abstract:Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.
zh

[AI-105] DBAW-PIKAN: Dynamic Balance Adaptive Weight Kolmogorov-Arnold Neural Network for Solving Partial Differential Equations

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在处理多尺度或高频特征问题时,因梯度流刚性和谱偏差(spectral bias)导致的收敛困难与预测能力受限的问题。解决方案的关键在于提出一种动态平衡自适应加权的物理信息科尔莫戈罗夫-阿诺德网络(Dynamic Balancing Adaptive Weighting Physics-Informed Kolmogorov-Arnold Network, DBAW-PIKAN),其核心创新包括基于可学习B样条的科尔莫戈罗夫-阿诺德网络架构与引入动态衰减上界自适应权重策略,从而有效缓解梯度相关失效模式并提升函数表示能力,在不增加计算复杂度的前提下显著加速收敛并提高精度。

链接: https://arxiv.org/abs/2512.22283
作者: Guokan Chen,Yao Xiao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have led to significant advancements in scientific computing by integrating fundamental physical principles with advanced data-driven techniques. However, when dealing with problems characterized by multi-scale or high-frequency features, PINNs encounter persistent and severe challenges related to stiffness in gradient flow and spectral bias, which significantly limit their predictive capabilities. To address these issues, this paper proposes a Dynamic Balancing Adaptive Weighting Physics-Informed Kolmogorov-Arnold Network (DBAW-PIKAN), designed to mitigate such gradient-related failure modes and overcome the bottlenecks in function representation. The core of DBAW-PIKAN combines the Kolmogorov-Arnold network architecture, based on learnable B-splines, with an adaptive weighting strategy that incorporates a dynamic decay upper bound. Compared to baseline models, the proposed method accelerates the convergence process and improves solution accuracy by at least an order of magnitude without introducing additional computational complexity. A series of numerical benchmarks, including the Klein-Gordon, Burgers, and Helmholtz equations, demonstrate the significant advantages of DBAW-PIKAN in enhancing both accuracy and generalization performance.
zh

[AI-106] Valori: A Deterministic Memory Substrate for AI Systems

【速读】:该论文旨在解决现代AI系统中因使用浮点数算术存储和搜索向量嵌入(vector embeddings)所导致的非确定性问题,即相同模型、输入和代码在不同硬件架构(如x86与ARM)上可能产生不同的内存状态和检索结果,从而阻碍可重现性和安全部署,甚至引发监管领域中的审计风险。解决方案的关键在于提出Valori——一种确定性AI内存基础架构,通过将浮点数内存操作替换为定点数算术(Q16.16格式),并将内存建模为可回放的状态机,从而在平台间保证位级一致的内存状态、快照及搜索结果,实现从内存边界强制执行确定性,证明了确定性内存是可信AI系统的基本前提。

链接: https://arxiv.org/abs/2512.22280
作者: Varshith Gudur
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 7 pages, 1 figure. systems paper with empirical evaluation and determinism validation experiments. Code available at this https URL

点击查看摘要

Abstract:Modern AI systems rely on vector embeddings stored and searched using floating-point arithmetic. While effective for approximate similarity search, this design introduces fundamental non-determinism: identical models, inputs, and code can produce different memory states and retrieval results across hardware architectures (e.g., x86 vs. ARM). This prevents replayability and safe deployment, leading to silent data divergence that prevents post-hoc verification and compromises audit trails in regulated sectors. We present Valori, a deterministic AI memory substrate that replaces floating-point memory operations with fixed-point arithmetic (Q16.16) and models memory as a replayable state machine. Valori guarantees bit-identical memory states, snapshots, and search results across platforms. We demonstrate that non-determinism arises before indexing or retrieval and show how Valori enforces determinism at the memory boundary. Our results suggest that deterministic memory is a necessary primitive for trustworthy AI systems. The reference implementation is open-source and available at this https URL (archived at this https URL).
zh

[AI-107] LLM TM: Benchmarking and Optimizing LLM s for Temporal Motif Analysis in Dynamic Graphs

【速读】:该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)有效分析动态图中的时间模式(temporal motifs)这一问题。时间模式作为动态图的基本单元和重要局部属性,能够直接反映异常行为与独特现象,是理解其演化动态与结构特征的关键。然而,现有研究尚未充分探索LLMs在时间模式相关任务上的潜力。为此,作者提出一个系统性基准LLMTM,涵盖九种时间模式类型的六项定制化任务,并通过大量实验评估不同提示策略与LLM架构(如openPangu-7B、DeepSeek-R1-Distill-Qwen系列、Qwen2.5-32B-Instruct等)的性能表现。关键解决方案在于开发一种工具增强型LLM代理(tool-augmented LLM agent),结合精心设计的提示工程实现高精度任务处理;进一步为平衡准确率与计算成本,提出一种结构感知调度器(structure-aware dispatcher),根据动态图的结构特性与LLM的认知负荷智能分配查询路径——在标准提示与强大代理之间进行决策,从而在保持高精度的同时显著降低资源消耗。

链接: https://arxiv.org/abs/2512.22266
作者: Bing Hao,Minglai Shao,Zengyi Wo,Yunlong Chu,Yuhang Liu,Ruijie Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The widespread application of Large Language Models (LLMs) has motivated a growing interest in their capacity for processing dynamic graphs. Temporal motifs, as an elementary unit and important local property of dynamic graphs which can directly reflect anomalies and unique phenomena, are essential for understanding their evolutionary dynamics and structural features. However, leveraging LLMs for temporal motif analysis on dynamic graphs remains relatively unexplored. In this paper, we systematically study LLM performance on temporal motif-related tasks. Specifically, we propose a comprehensive benchmark, LLMTM (Large Language Models in Temporal Motifs), which includes six tailored tasks across nine temporal motif types. We then conduct extensive experiments to analyze the impacts of different prompting techniques and LLMs (including nine models: openPangu-7B, the DeepSeek-R1-Distill-Qwen series, Qwen2.5-32B-Instruct, GPT-4o-mini, DeepSeek-R1, and o3) on model performance. Informed by our benchmark findings, we develop a tool-augmented LLM agent that leverages precisely engineered prompts to solve these tasks with high accuracy. Nevertheless, the high accuracy of the agent incurs a substantial cost. To address this trade-off, we propose a simple yet effective structure-aware dispatcher that considers both the dynamic graph’s structural properties and the LLM’s cognitive load to intelligently dispatch queries between the standard LLM prompting and the more powerful agent. Our experiments demonstrate that the structure-aware dispatcher effectively maintains high accuracy while reducing cost.
zh

[AI-108] ReVEAL: GNN-Guided Reverse Engineering for Formal Verification of Optimized Multipliers

【速读】:该论文旨在解决大规模优化乘法器架构在代数电路验证中的可扩展性与准确性问题,传统基于规则的方法难以有效识别复杂结构并适应不同优化场景。解决方案的关键在于提出一种基于图学习(graph-learning)的逆向工程方法 ReVEAL,通过提取结构化图特征并结合学习驱动的推理机制,在大规模优化乘法器中高效识别架构模式,从而提升验证流程的鲁棒性和精度,并兼容现有验证流程及下游代数证明策略。

链接: https://arxiv.org/abs/2512.22260
作者: Chen Chen,Daniela Kaufmann,Chenhui Deng,Zhan Song,Hongce Zhang,Cunxi Yu
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Accepted by TACAS 2026

点击查看摘要

Abstract:We present ReVEAL, a graph-learning-based method for reverse engineering of multiplier architectures to improve algebraic circuit verification techniques. Our framework leverages structural graph features and learning-driven inference to identify architecture patterns at scale, enabling robust handling of large optimized multipliers. We demonstrate applicability across diverse multiplier benchmarks and show improvements in scalability and accuracy compared to traditional rule-based approaches. The method integrates smoothly with existing verification flows and supports downstream algebraic proof strategies.
zh

[AI-109] Logic Sketch Prompting (LSP): A Deterministic and Interpretable Prompting Method

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在需要严格规则遵循、确定性和可审计性的任务中表现不可靠的问题。其解决方案的关键在于提出一种轻量级提示框架——逻辑草图提示(Logic Sketch Prompting, LSP),该框架引入了类型化变量、确定性条件评估器以及基于规则的验证器,从而生成可追溯且可重复的输出,显著提升了模型在药理逻辑合规性任务中的准确性与一致性。

链接: https://arxiv.org/abs/2512.22258
作者: Satvik Tripathi
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at natural language reasoning but remain unreliable on tasks requiring strict rule adherence, determinism, and auditability. Logic Sketch Prompting (LSP) is a lightweight prompting framework that introduces typed variables, deterministic condition evaluators, and a rule based validator that produces traceable and repeatable outputs. Using two pharmacologic logic compliance tasks, we benchmark LSP against zero shot prompting, chain of thought prompting, and concise prompting across three open weight models: Gemma 2, Mistral, and Llama 3. Across both tasks and all models, LSP consistently achieves the highest accuracy (0.83 to 0.89) and F1 score (0.83 to 0.89), substantially outperforming zero shot prompting (0.24 to 0.60), concise prompts (0.16 to 0.30), and chain of thought prompting (0.56 to 0.75). McNemar tests show statistically significant gains for LSP across nearly all comparisons (p 0.01). These results demonstrate that LSP improves determinism, interpretability, and consistency without sacrificing performance, supporting its use in clinical, regulated, and safety critical decision support systems.
zh

[AI-110] Agent ic Software Issue Resolution with Large Language Models : A Survey

【速读】:该论文旨在解决软件问题自动修复(software issue resolution)中的复杂性挑战,特别是传统单步式方法难以应对的长周期推理、迭代探索与反馈驱动决策等问题。其解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)的代理系统(agentic systems),通过模拟人类专家的规划、执行与反馈循环能力,实现更高效且高质量的软件维护。文中指出,LLM-based agentic systems不仅显著提升了自动化修复效率,还为验证智能体在推理、规划和执行方面的综合能力提供了真实场景,标志着从静态生成向动态智能决策范式的转变。

链接: https://arxiv.org/abs/2512.22256
作者: Zhonghao Jiang,David Lo,Zhongxin Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Software issue resolution aims to address real-world issues in software repositories (e.g., bug fixing and efficiency optimization) based on natural language descriptions provided by users, representing a key aspect of software maintenance. With the rapid development of large language models (LLMs) in reasoning and generative capabilities, LLM-based approaches have made significant progress in automated software issue resolution. However, real-world software issue resolution is inherently complex and requires long-horizon reasoning, iterative exploration, and feedback-driven decision making, which demand agentic capabilities beyond conventional single-step approaches. Recently, LLM-based agentic systems have become mainstream for software issue resolution. Advancements in agentic software issue resolution not only greatly enhance software maintenance efficiency and quality but also provide a realistic environment for validating agentic systems’ reasoning, planning, and execution capabilities, bridging artificial intelligence and software engineering. This work presents a systematic survey of 126 recent studies at the forefront of LLM-based agentic software issue resolution research. It outlines the general workflow of the task and establishes a taxonomy across three dimensions: benchmarks, techniques, and empirical studies. Furthermore, it highlights how the emergence of agentic reinforcement learning has brought a paradigm shift in the design and training of agentic systems for software engineering. Finally, it summarizes key challenges and outlines promising directions for future research. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.22256 [cs.SE] (or arXiv:2512.22256v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.22256 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-111] Shape of Thought: When Distribution Matters More than Correctness in Reasoning Tasks

【速读】:该论文旨在解决如何有效提升语言模型(Language Model, LM)在复杂推理任务中的表现问题,特别是针对传统依赖人类标注的链式思维(Chain-of-Thought, CoT)数据集在训练效果上的局限性。其关键解决方案在于:利用更强大模型生成的、即使最终答案错误的合成CoT轨迹来训练目标模型,发现此类数据反而能显著提升性能。这一现象的核心机制在于两点:一是合成数据分布更贴近目标模型自身的分布,从而增强学习效率;二是即便整体推理路径存在错误,其中仍包含可被模型学习的有效推理步骤,使得模型具备对部分错误的容忍能力。实验证明,该方法在数学推理(MATH)、算法推理(GSM8K)、代码生成(MBPP)等多个任务上均优于人类标注数据,并揭示了正确最终答案并非可靠衡量推理质量的标准。

链接: https://arxiv.org/abs/2512.22255
作者: Abhranil Chandra,Ayush Agrawal,Arian Hosseini,Sebastian Fischmeister,Rishabh Agarwal,Navin Goyal,Aaron Courville
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present the surprising finding that a language model’s reasoning capabilities can be improved by training on synthetic datasets of chain-of-thought (CoT) traces from more capable models, even when all of those traces lead to an incorrect final answer. Our experiments show this approach can yield better performance on reasoning tasks than training on human-annotated datasets. We hypothesize that two key factors explain this phenomenon: first, the distribution of synthetic data is inherently closer to the language model’s own distribution, making it more amenable to learning. Second, these `incorrect’ traces are often only partially flawed and contain valid reasoning steps from which the model can learn. To further test the first hypothesis, we use a language model to paraphrase human-annotated traces – shifting their distribution closer to the model’s own distribution – and show that this improves performance. For the second hypothesis, we introduce increasingly flawed CoT traces and study to what extent models are tolerant to these flaws. We demonstrate our findings across various reasoning domains like math, algorithmic reasoning and code generation using MATH, GSM8K, Countdown and MBPP datasets on various language models ranging from 1.5B to 9B across Qwen, Llama, and Gemma models. Our study shows that curating datasets that are closer to the model’s distribution is a critical aspect to consider. We also show that a correct final answer is not always a reliable indicator of a faithful reasoning process.
zh

[AI-112] Graph Attention-based Adaptive Transfer Learning for Link Prediction

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在链接预测(Link Prediction, LP)任务中面临的两大挑战:一是处理大规模稀疏图时的性能瓶颈,二是跨数据集迁移学习中缺乏足够对齐性的问题。此外,尽管自监督方法在多种图任务中表现优异,但现有研究未充分挖掘迁移学习在不同图数据集间的泛化潜力。为此,作者提出了一种新型的图注意力自适应迁移网络(Graph Attention Adaptive Transfer Network, GAATNet),其核心创新在于结合预训练与微调策略,以捕获跨不同规模数据集的全局节点嵌入信息,从而实现高效的知识迁移并提升LP性能。关键解决方案包括:1)在自注意力模块中引入远距离邻居嵌入作为偏置项,以增强对全局特征的建模能力;2)在微调阶段设计轻量级自适配模块(self-adapter module),显著提升训练效率并增强模型泛化能力。

链接: https://arxiv.org/abs/2512.22252
作者: Huashen Lu,Wensheng Gan,Guoting Chen,Zhichao Huang,Philip S. Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACM TIST, 9 tables, 9 figures

点击查看摘要

Abstract:Graph neural networks (GNNs) have brought revolutionary advancements to the field of link prediction (LP), providing powerful tools for mining potential relationships in graphs. However, existing methods face challenges when dealing with large-scale sparse graphs and the need for a high degree of alignment between different datasets in transfer learning. Besides, although self-supervised methods have achieved remarkable success in many graph tasks, prior research has overlooked the potential of transfer learning to generalize across different graph datasets. To address these limitations, we propose a novel Graph Attention Adaptive Transfer Network (GAATNet). It combines the advantages of pre-training and fine-tuning to capture global node embedding information across datasets of different scales, ensuring efficient knowledge transfer and improved LP performance. To enhance the model’s generalization ability and accelerate training, we design two key strategies: 1) Incorporate distant neighbor embeddings as biases in the self-attention module to capture global features. 2) Introduce a lightweight self-adapter module during fine-tuning to improve training efficiency. Comprehensive experiments on seven public datasets demonstrate that GAATNet achieves state-of-the-art performance in LP tasks. This study provides a general and scalable solution for LP tasks to effectively integrate GNNs with transfer learning. The source code and datasets are publicly available at this https URL
zh

[AI-113] Interpretable Perturbation Modeling Through Biomedical Knowledge Graphs

【速读】:该论文旨在解决小分子药物对基因表达扰动机制建模的问题,以揭示药物作用机制、预测脱靶效应并发现潜在的药物重定位机会。传统深度学习方法虽已将多模态嵌入整合至生物医学知识图谱(Biomedical Knowledge Graph, BKG)并通过图神经网络的消息传递机制提升表征能力,但这些模型主要应用于链接预测或二元药物-疾病关联任务,难以捕捉药物在转录组层面的精细扰动效应。其解决方案的关键在于构建一个融合PrimeKG++与LINCS L1000数据的异质生物医学图谱,并基于此训练图注意力网络(Graph Attention Network, GAT),通过下游预测头学习给定药物-细胞组合下978个标志性基因的表达变化(delta expression profile)。实验表明,该框架在差异表达基因(Differentially Expressed Genes, DEG)预测任务中优于MLP基线模型,且消融实验证明BKG中的边信息显著提升了扰动级别的预测性能,从而为从二元关联迈向机制驱动的药物转录组建模提供了新路径。

链接: https://arxiv.org/abs/2512.22251
作者: Pascal Passigan,Kevin zhu,Angelina Ning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding how small molecules perturb gene expression is essential for uncovering drug mechanisms, predicting off-target effects, and identifying repurposing opportunities. While prior deep learning frameworks have integrated multimodal embeddings into biomedical knowledge graphs (BKGs) and further improved these representations through graph neural network message-passing paradigms, these models have been applied to tasks such as link prediction and binary drug-disease association, rather than the task of gene perturbation, which may unveil more about mechanistic transcriptomic effects. To address this gap, we construct a merged biomedical graph that integrates (i) PrimeKG++, an augmentation of PrimeKG containing semantically rich embeddings for nodes with (ii) LINCS L1000 drug and cell line nodes, initialized with multimodal embeddings from foundation models such as MolFormerXL and BioBERT. Using this heterogeneous graph, we train a graph attention network (GAT) with a downstream prediction head that learns the delta expression profile of over 978 landmark genes for a given drug-cell pair. Our results show that our framework outperforms MLP baselines for differentially expressed genes (DEG) – which predict the delta expression given a concatenated embedding of drug features, target features, and baseline cell expression – under the scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization further demonstrate that the edges provided by biomedical KGs enhance perturbation-level prediction. More broadly, our framework provides a path toward mechanistic drug modeling: moving beyond binary drug-disease association tasks to granular transcriptional effects of therapeutic intervention.
zh

[AI-114] Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为判别器(judges)在工业部署中缺乏高效且校准良好的不确定性估计问题。现有方法如基于话语的置信度评分和多生成策略要么校准效果差,要么计算成本高。其解决方案的关键在于引入一种基于Brier损失函数训练的线性探针(linear probe),从推理型判别器的隐藏状态中提取校准后的不确定性估计,无需额外模型训练。该方法在客观任务(如推理、数学、事实性、编码)和主观人类偏好判断上均表现出更优的校准性能,同时具备约10倍的计算效率提升、对未见评估领域的强泛化能力,并在高置信度预测中实现更高准确率,是一种可直接集成的可解释性驱动的不确定性估计方案。

链接: https://arxiv.org/abs/2512.22245
作者: Bhaktipriya Radharapu,Eshika Saxena,Kenneth Li,Chenxi Whitehouse,Adina Williams,Nicola Cancedda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As LLM-based judges become integral to industry applications, obtaining well-calibrated uncertainty estimates efficiently has become critical for production deployment. However, existing techniques, such as verbalized confidence and multi-generation methods, are often either poorly calibrated or computationally expensive. We introduce linear probes trained with a Brier score-based loss to provide calibrated uncertainty estimates from reasoning judges’ hidden states, requiring no additional model training. We evaluate our approach on both objective tasks (reasoning, mathematics, factuality, coding) and subjective human preference judgments. Our results demonstrate that probes achieve superior calibration compared to existing methods with \approx10 x computational savings, generalize robustly to unseen evaluation domains, and deliver higher accuracy on high-confidence predictions. However, probes produce conservative estimates that underperform on easier datasets but may benefit safety-critical deployments prioritizing low false-positive rates. Overall, our work demonstrates that interpretability-based uncertainty estimation provides a practical and scalable plug-and-play solution for LLM judges in production.
zh

[AI-115] Enhanced geometry prediction in laser directed energy deposition using meta-learning

【速读】:该论文旨在解决激光定向能量沉积(Laser-Directed Energy Deposition, L-DED)过程中因实验数据稀缺且异质性高(如不同材料、设备配置和工艺参数)导致的成形轨迹几何预测精度不足的问题。解决方案的关键在于提出一种基于元学习(meta-learning)的跨数据集知识迁移模型,通过引入两种梯度优化型元学习算法——模型无关元学习(Model-Agnostic Meta-Learning, MAML)和Reptile,实现仅用少量样本(3–9个)即可快速适应新的沉积条件,并在多种粉末喂料、丝材喂料及混合丝粉喂料的L-DED工艺中展现出优异的泛化性能,其预测结果R²最高达0.9,平均绝对误差为0.03–0.08 mm,显著优于传统前馈神经网络在相同数据约束下的表现。

链接: https://arxiv.org/abs/2512.22241
作者: Abdul Malik Al Mardhouf Al Saadi,Amrita Basak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate bead geometry prediction in laser-directed energy deposition (L-DED) is often hindered by the scarcity and heterogeneity of experimental datasets collected under different materials, machine configurations, and process parameters. To address this challenge, a cross-dataset knowledge transfer model based on meta-learning for predicting deposited track geometry in L-DED is proposed. Specifically, two gradient-based meta-learning algorithms, i.e., Model-Agnostic Meta-Learning (MAML) and Reptile, are investigated to enable rapid adaptation to new deposition conditions with limited data. The proposed framework is performed using multiple experimental datasets compiled from peer-reviewed literature and in-house experiments and evaluated across powder-fed, wire-fed, and hybrid wire-powder L-DED processes. Results show that both MAML and Reptile achieve accurate bead height predictions on unseen target tasks using as few as three to nine training examples, consistently outperforming conventional feedforward neural networks trained under comparable data constraints. Across multiple target tasks representing different printing conditions, the meta-learning models achieve strong generalization performance, with R-squared values reaching up to approximately 0.9 and mean absolute errors between 0.03-0.08 mm, demonstrating effective knowledge transfer across heterogeneous L-DED settings.
zh

[AI-116] EvoXplain: When Machine Learning Models Agree on Predictions but Disagree on Why – Measuring Mechanistic Multiplicity Across Training Runs

【速读】:该论文旨在解决当前机器学习模型解释中存在的一个关键问题:当多个模型在预测性能上达到高准确率时,它们是否依赖相同的内部逻辑机制,还是通过不同甚至相互竞争的路径达成相似结果。这一问题被忽视,但对模型可信度和可解释性评估具有重要意义。解决方案的关键在于提出EvoXplain框架,该框架将模型解释视为来自随机优化过程的样本,而非单一训练实例的结果,并通过分析重复训练下解释的稳定性来识别是否存在多个独立的解释模式(即多模态现象)。该方法不试图选出“正确”的解释,而是量化并可视化解释的不稳定性,从而揭示单次或平均解释可能掩盖的多种潜在机制,重新定义了可解释性为模型类别在多次实例化下的属性,而非单个模型的特性。

链接: https://arxiv.org/abs/2512.22240
作者: Chama Bensmail
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning models are primarily judged by predictive performance, especially in applied settings. Once a model reaches high accuracy, its explanation is often assumed to be correct and trustworthy. However, this assumption raises an overlooked question: when two models achieve high accuracy, do they rely on the same internal logic, or do they reach the same outcome via different – and potentially competing – mechanisms? We introduce EvoXplain, a diagnostic framework that measures the stability of model explanations across repeated training. Rather than analysing a single trained model, EvoXplain treats explanations as samples drawn from the stochastic optimisation process itself – without aggregating predictions or constructing ensembles – and examines whether these samples form a single coherent explanation or separate into multiple, distinct explanatory modes. We evaluate EvoXplain on the Breast Cancer and COMPAS datasets using two widely deployed model classes: Logistic Regression and Random Forests. Although all models achieve high predictive accuracy, their explanations frequently exhibit clear multimodality. Even models commonly assumed to be stable, such as Logistic Regression, can produce multiple well-separated explanatory basins under repeated training on the same data split. These differences are not explained by hyperparameter variation or simple performance trade-offs. EvoXplain does not attempt to select a ‘correct’ explanation. Instead, it makes explanatory instability visible and quantifiable, revealing when single-instance or averaged explanations obscure the existence of multiple underlying mechanisms. More broadly, EvoXplain reframes interpretability as a property of a model class under repeated instantiation, rather than of any single trained model.
zh

[AI-117] We are not able to identify AI-generated images

【速读】:该论文试图解决的问题是:人类是否能够可靠地区分AI生成图像与真实照片,尤其是在当前生成式AI(Generative AI)技术日益成熟、合成媒体愈发逼真的背景下。解决方案的关键在于设计并实施一项交互式网络实验,收集165名参与者对120张精心挑选的难分辨图像(包括来自CC12M的真实图像和用MidJourney生成的对应AI图像)的分类判断数据,结果表明人类平均准确率仅为54%,显著高于随机猜测但改善有限,说明仅依赖人类直觉已难以有效识别AI生成内容,从而凸显了建立技术检测工具与伦理规范体系的紧迫性。

链接: https://arxiv.org/abs/2512.22236
作者: Adrien Pavão
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:AI-generated images are now pervasive online, yet many people believe they can easily tell them apart from real photographs. We test this assumption through an interactive web experiment where participants classify 20 images as real or AI-generated. Our dataset contains 120 difficult cases: real images sampled from CC12M, and carefully curated AI-generated counterparts produced with MidJourney. In total, 165 users completed 233 sessions. Their average accuracy was 54%, only slightly above random guessing, with limited improvement across repeated attempts. Response times averaged 7.3 seconds, and some images were consistently more deceptive than others. These results indicate that, even on relatively simple portrait images, humans struggle to reliably detect AI-generated content. As synthetic media continues to improve, human judgment alone is becoming insufficient for distinguishing real from artificial data. These findings highlight the need for greater awareness and ethical guidelines as AI-generated media becomes increasingly indistinguishable from reality.
zh

[AI-118] DiRL: An Efficient Post-Training Framework for Diffusion Language Models

【速读】:该论文旨在解决扩散语言模型(Diffusion Language Models, dLLMs)在后训练(post-training)阶段存在的计算效率低下和训练-推理目标不匹配问题,这些问题严重制约了dLLMs在复杂推理任务(如数学解题)中的性能表现。解决方案的关键在于提出DiRL框架,该框架通过将FlexAttention加速的块状训练与LMDeploy优化的推理紧密结合,构建了一个高效的在线模型更新闭环,支持两阶段后训练(监督微调+强化学习);在此基础上进一步提出DiPO方法,作为首个专为dLLMs设计的无偏组相对策略优化(Group Relative Policy Optimization, GRPO)实现,从而有效提升模型在数学等复杂任务上的性能。

链接: https://arxiv.org/abs/2512.22234
作者: Ying Zhu,Jiaxin Wan,Xiaoran Liu,Siyanag He,Qiqi Wang,Xu Guo,Tianyi Liang,Zengfeng Huang,Ziwei He,Xipeng Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.
zh

[AI-119] Scalable Cloud-Native Architectures for Intelligent PMU Data Processing

【速读】:该论文旨在解决大规模相量测量单元(Phasor Measurement Units, PMUs)部署所带来的延迟、可扩展性和可靠性挑战,尤其是在现代电网动态运行条件下,传统集中式数据处理架构难以应对PMU数据的高频率与高速度问题。解决方案的关键在于提出一种云原生的智能PMU数据处理架构,通过分布式流式处理、容器化微服务和弹性资源编排技术,实现低延迟的数据接入、实时异常检测与高级分析;同时集成时间序列机器学习模型以提升电网可观测性与预测能力,从而在保障安全性与隐私性的前提下,支持大规模PMU系统的高效运行。

链接: https://arxiv.org/abs/2512.22231
作者: Nachiappan Chockalingam,Akshay Deshpande,Lokesh Butra,Ram Sekhar Bodala,Nitin Saksena,Adithya Parthasarathy,Balakrishna Pothineni,Akash Kumar Agarwal
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Phasor Measurement Units (PMUs) generate high-frequency, time-synchronized data essential for real-time power grid monitoring, yet the growing scale of PMU deployments creates significant challenges in latency, scalability, and reliability. Conventional centralized processing architectures are increasingly unable to handle the volume and velocity of PMU data, particularly in modern grids with dynamic operating conditions. This paper presents a scalable cloud-native architecture for intelligent PMU data processing that integrates artificial intelligence with edge and cloud computing. The proposed framework employs distributed stream processing, containerized microservices, and elastic resource orchestration to enable low-latency ingestion, real-time anomaly detection, and advanced analytics. Machine learning models for time-series analysis are incorporated to enhance grid observability and predictive capabilities. Analytical models are developed to evaluate system latency, throughput, and reliability, showing that the architecture can achieve sub-second response times while scaling to large PMU deployments. Security and privacy mechanisms are embedded to support deployment in critical infrastructure environments. The proposed approach provides a robust and flexible foundation for next-generation smart grid analytics.
zh

[AI-120] ReGAIN: Retrieval-Grounded AI Framework for Network Traffic Analysis

【速读】:该论文旨在解决传统网络流量分析系统在准确性与可解释性方面的局限性问题,尤其是规则驱动或机器学习方法常出现高误报率且缺乏透明度,导致分析师难以建立信任。其解决方案的关键在于提出一种多阶段框架 ReGAIN,该框架融合了流量摘要生成、检索增强生成(Retrieval-Augmented Generation, RAG)与大语言模型(Large Language Model, LLM)推理能力,通过将网络流量转化为自然语言摘要并存储于多集合向量数据库中,结合分层检索管道实现基于证据引用的 LLM 响应生成;该管道包含元数据过滤、MMR 采样、两阶段交叉编码器重排序机制以及拒绝回答机制,从而显著降低幻觉风险并确保推理过程可验证,最终在真实世界攻击流量数据集上实现了高达 95.95% 至 98.82% 的准确率,并获得专家评估和数据真值的双重验证。

链接: https://arxiv.org/abs/2512.22223
作者: Shaghayegh Shajarian,Kennedy Marsh,James Benson,Sajad Khorsandroo,Mahmoud Abdelsalam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Networking and Internet Architecture (cs.NI)
备注: Accepted to ICNC 2026. This is the accepted author manuscript

点击查看摘要

Abstract:Modern networks generate vast, heterogeneous traffic that must be continuously analyzed for security and performance. Traditional network traffic analysis systems, whether rule-based or machine learning-driven, often suffer from high false positives and lack interpretability, limiting analyst trust. In this paper, we present ReGAIN, a multi-stage framework that combines traffic summarization, retrieval-augmented generation (RAG), and Large Language Model (LLM) reasoning for transparent and accurate network traffic analysis. ReGAIN creates natural-language summaries from network traffic, embeds them into a multi-collection vector database, and utilizes a hierarchical retrieval pipeline to ground LLM responses with evidence citations. The pipeline features metadata-based filtering, MMR sampling, a two-stage cross-encoder reranking mechanism, and an abstention mechanism to reduce hallucinations and ensure grounded reasoning. Evaluated on ICMP ping flood and TCP SYN flood traces from the real-world traffic dataset, it demonstrates robust performance, achieving accuracy between 95.95% and 98.82% across different attack types and evaluation benchmarks. These results are validated against two complementary sources: dataset ground truth and human expert assessments. ReGAIN also outperforms rule-based, classical ML, and deep learning baselines while providing unique explainability through trustworthy, verifiable responses.
zh

[AI-121] Müntz-Szász Networks: Neural Architectures with Learnable Power-Law Bases

【速读】:该论文旨在解决标准神经网络架构在逼近具有奇异或分数幂行为函数时表现不佳的问题,这类函数结构广泛存在于物理建模中,如边界层、断裂力学和角点奇异性等场景。其解决方案的关键在于提出Müntz-Szász Networks (MSN),一种基于经典逼近理论的新型网络架构,将固定平滑激活函数替换为可学习的分数幂基函数;具体而言,MSN中的每个边计算形式为 ϕ(x)=kakxμk+kbksign(x)xλk\phi(x) = \sum_k a_k |x|^{\mu_k} + \sum_k b_k \mathrm{sign}(x)|x|^{\lambda_k},其中指数 μk,λk\mu_k, \lambda_k 与系数一同学习。该设计继承了Müntz-Szász定理的通用逼近能力,并在逼近 xα|x|^\alpha 类函数时展现出显著优于传统多层感知机(MLP)的误差率——单个学习指数即可达到 O(μα2)\mathcal{O}(|\mu - \alpha|^2) 的误差阶,而MLP需 O(ϵ1/α)\mathcal{O}(\epsilon^{-1/\alpha}) 个神经元才能达到类似精度。

链接: https://arxiv.org/abs/2512.22222
作者: Gnankan Landry Regis N’guessan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: this https URL

点击查看摘要

Abstract:Standard neural network architectures employ fixed activation functions (ReLU, tanh, sigmoid) that are poorly suited for approximating functions with singular or fractional power behavior, a structure that arises ubiquitously in physics, including boundary layers, fracture mechanics, and corner singularities. We introduce Müntz-Szász Networks (MSN), a novel architecture that replaces fixed smooth activations with learnable fractional power bases grounded in classical approximation theory. Each MSN edge computes \phi(x) = \sum_k a_k |x|^\mu_k + \sum_k b_k \mathrmsign(x)|x|^\lambda_k , where the exponents \mu_k, \lambda_k\ are learned alongside the coefficients. We prove that MSN inherits universal approximation from the Müntz-Szász theorem and establish novel approximation rates: for functions of the form |x|^\alpha , MSN achieves error \mathcalO(|\mu - \alpha|^2) with a single learned exponent, whereas standard MLPs require \mathcalO(\epsilon^-1/\alpha) neurons for comparable accuracy. On supervised regression with singular target functions, MSN achieves 5-8x lower error than MLPs with 10x fewer parameters. Physics-informed neural networks (PINNs) represent a particularly demanding application for singular function approximation; on PINN benchmarks including a singular ODE and stiff boundary-layer problems, MSN achieves 3-6x improvement while learning interpretable exponents that match the known solution structure. Our results demonstrate that theory-guided architectural design can yield dramatic improvements for scientifically-motivated function classes.
zh

[AI-122] With Great Capabilities Come Great Responsibilities: Introducing the Agent ic Risk Capability Framework for Governing Agent ic AI Systems AAAI2026

【速读】:该论文旨在解决组织在治理具有自主行动能力的代理型人工智能(Agentic AI)系统时面临的挑战,尤其是如何全面识别、评估和缓解其多样化且不断演化的风险。解决方案的关键在于提出了一种名为“Agentic Risk & Capability (ARC)”的框架,该框架以能力为中心构建分析视角,明确区分了三类内在风险来源——组件、设计与能力,并建立了每类风险源与具体风险表现及对应技术控制措施之间的清晰关联,从而为组织提供一套结构化且可操作的技术治理路径,实现对代理型AI系统的安全、可靠与负责任部署。

链接: https://arxiv.org/abs/2512.22211
作者: Shaun Khoo,Jessica Foo,Roy Ka-Wei Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at IASEAI 2026 (Main Track) and AAAI 2026 3rd International AI Governance Workshop

点击查看摘要

Abstract:Agentic AI systems present both significant opportunities and novel risks due to their capacity for autonomous action, encompassing tasks such as code execution, internet interaction, and file modification. This poses considerable challenges for effective organizational governance, particularly in comprehensively identifying, assessing, and mitigating diverse and evolving risks. To tackle this, we introduce the Agentic Risk \ Capability (ARC) Framework, a technical governance framework designed to help organizations identify, assess, and mitigate risks arising from agentic AI systems. The framework’s core contributions are: (1) it develops a novel capability-centric perspective to analyze a wide range of agentic AI systems; (2) it distills three primary sources of risk intrinsic to agentic AI systems - components, design, and capabilities; (3) it establishes a clear nexus between each risk source, specific materialized risks, and corresponding technical controls; and (4) it provides a structured and practical approach to help organizations implement the framework. This framework provides a robust and adaptable methodology for organizations to navigate the complexities of agentic AI, enabling rapid and effective innovation while ensuring the safe, secure, and responsible deployment of agentic AI systems. Our framework is open-sourced \hrefthis https URLhere.
zh

[AI-123] oward Equitable Recovery: A Fairness-Aware AI Framework for Prioritizing Post-Flood Aid in Bangladesh

【速读】:该论文旨在解决发展中国家在灾后援助分配中存在的系统性偏见问题,这种偏见往往使弱势地区(如边缘化县区和农村地区)持续处于不利地位,从而加剧历史不平等。解决方案的关键在于提出一种公平感知的人工智能框架,通过对抗去偏(adversarial debiasing)模型来预测洪水脆弱性,同时主动消除对边缘化地区的偏见。该方法借鉴医疗AI中的公平感知表征学习技术,采用梯度反转层(gradient reversal layer)迫使模型学习与偏见无关的表征,从而在保持较高预测准确性(R²=0.784)的同时,显著降低统计 parity 差异(减少41.6%)和区域公平差距(减少43.2%),确保援助资源基于真实需求而非历史分配模式精准投送。

链接: https://arxiv.org/abs/2512.22210
作者: Farjana Yesmin,Romana Akter
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 6 figures, 5 tables

点击查看摘要

Abstract:Post-disaster aid allocation in developing nations often suffers from systematic biases that disadvantage vulnerable regions, perpetuating historical inequities. This paper presents a fairness-aware artificial intelligence framework for prioritizing post-flood aid distribution in Bangladesh, a country highly susceptible to recurring flood disasters. Using real data from the 2022 Bangladesh floods that affected 7.2 million people and caused 405.5 million US dollars in damages, we develop an adversarial debiasing model that predicts flood vulnerability while actively removing biases against marginalized districts and rural areas. Our approach adapts fairness-aware representation learning techniques from healthcare AI to disaster management, employing a gradient reversal layer that forces the model to learn bias-invariant representations. Experimental results on 87 upazilas across 11 districts demonstrate that our framework reduces statistical parity difference by 41.6 percent, decreases regional fairness gaps by 43.2 percent, and maintains strong predictive accuracy (R-squared=0.784 vs baseline 0.811). The model generates actionable priority rankings ensuring aid reaches the most vulnerable populations based on genuine need rather than historical allocation patterns. This work demonstrates how algorithmic fairness techniques can be effectively applied to humanitarian contexts, providing decision-makers with tools to implement more equitable disaster recovery strategies.
zh

[AI-124] GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLM s with Origami Folding Tasks

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间推理(spatial reasoning)能力上的不足问题,尤其是其在跨视角、跨时间对物体进行心理表征与操作的能力薄弱。现有基准测试大多聚焦于静态图像或最终输出结果,未能充分捕捉空间推理的序列性和视角依赖性特征。为此,作者提出GamiBench,一个基于折纸(origami-inspired folding)任务的新型评估基准,通过186个可实现与186个不可能的二维折痕图(crease patterns)及其对应的三维折叠形态,从三个视觉问答(VQA)任务中系统评估模型的空间推理与二维到三维规划能力:预测三维折叠配置、区分有效视角以及检测不可能结构。关键创新在于引入了全新的诊断指标——视角一致性(viewpoint consistency, VC)和不可能折叠选择率(impossible fold selection rate, IFSR),并首次对模型在整个折叠推理过程中的跨视图一致性、物理可行性判断及中间步骤理解进行全面量化评估,从而为MLLMs的几何理解和空间推理提供标准化评测框架。

链接: https://arxiv.org/abs/2512.22207
作者: Ryan Spencer,Roey Yaari,Ritvik Vemavarapu,Joyce Yang,Steven Ngo,Utkarsh Sharma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are proficient in perception and instruction-following, but they still struggle with spatial reasoning: the ability to mentally track and manipulate objects across multiple views and over time. Spatial reasoning is a key component of human intelligence, but most existing benchmarks focus on static images or final outputs, failing to account for the sequential and viewpoint-dependent nature of this skill. To close this gap, we introduce GamiBench, a benchmark designed to evaluate spatial reasoning and 2D-to-3D planning in MLLMs through origami-inspired folding tasks. GamiBench includes 186 regular and 186 impossible 2D crease patterns paired with their corresponding 3D folded shapes, produced from six distinct viewpoints across three visual question-answering (VQA) tasks: predicting 3D fold configurations, distinguishing valid viewpoints, and detecting impossible patterns. Unlike previous benchmarks that assess only final predictions, GamiBench holistically evaluates the entire reasoning process–measuring cross-view consistency, physical feasibility through impossible-fold detection, and interpretation of intermediate folding steps. It further introduces new diagnostic metrics–viewpoint consistency (VC) and impossible fold selection rate (IFSR)–to measure how well models handle folds of varying complexity. Our experiments show that even leading models such as GPT-5 and Gemini-2.5-Pro struggle on single-step spatial understanding. These contributions establish a standardized framework for evaluating geometric understanding and spatial reasoning in MLLMs. Dataset and code: this https URL.
zh

[AI-125] CosineGate: Semantic Dynamic Routing via Cosine Incompatibility in Residual Networks

【速读】:该论文旨在解决现代深度残差网络(Residual Network)中普遍存在的冗余计算问题,即对所有输入均执行全部残差块的计算,即使身份映射(identity mapping)已足够。其解决方案的关键在于提出CosineGate架构,通过计算输入特征与残差特征之间的余弦不兼容性(Cosine Incompatibility Ratio, CIR)作为自监督跳过信号,实现每样本、每块的动态路由决策;同时利用Gumbel-Softmax松弛机制在训练中实现可微分门控,并引入渐进式FLOPs正则项控制平均计算量,从而在不依赖辅助监督、知识蒸馏或任务特定启发式规则的前提下,显著降低计算开销并保持甚至提升模型性能。

链接: https://arxiv.org/abs/2512.22206
作者: Yogeswar Reddy Thota
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Current experiments are limited to CIFAR-10 and MNIST; extending CosineGate to ImageNet-scale models remains future work

点击查看摘要

Abstract:Modern deep residual networks perform substantial redundant computation by evaluating all residual blocks for every input, even when identity mappings suffice. We introduce CosineGate, an end-to-end differentiable architecture for dynamic routing in residual networks that uses cosine incompatibility between identity and residual feature representations as a self-supervised skip signal. CosineGate measures semantic redundancy through the Cosine Incompatibility Ratio (CIR), defined as 1 - cos(x, F(x)), and uses Gumbel-Softmax relaxation to enable per-sample, per-block gating during training. A progressive FLOPs regularization term controls average compute usage without destabilizing optimization. On CIFAR-10, CosineGate spans the accuracy-efficiency Pareto frontier: an aggressive configuration achieves 89.9 percent accuracy with 24.1 percent FLOPs savings, a balanced configuration achieves 91.3 percent accuracy with 28.5 percent savings at epoch 160, and a conservative configuration reaches a peak of 93.2 percent accuracy with minimal compute reduction. These results match or exceed ResNet-20 (91.3 percent) while reducing computation, without auxiliary supervision, distillation, or task-specific heuristics. Our results demonstrate that simple geometric measures of feature incompatibility provide a principled and effective signal for dynamic residual routing.
zh

[AI-126] Emergent Persuasion: Will LLM s Persuade Without Being Prompted? AAAI2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在未被明确指令的情况下是否会产生自发性说服行为的问题,即“无提示说服”(unprompted persuasion)的风险评估。传统研究多聚焦于恶意用户诱导大语言模型(Large Language Models, LLMs)进行有害说服的场景,而本文则转向探索模型在无外部触发条件下,因内部机制或训练方式变化而自然表现出说服倾向的可能性。其解决方案的关键在于通过两种机制验证这一现象:一是利用内部激活控制(activation steering)引导模型人格特征;二是采用监督微调(Supervised Fine-Tuning, SFT)使模型习得相关特质。实验发现,仅靠激活控制无法稳定提升模型的自发说服能力,但SFT训练,尤其是基于包含良性话题的数据集进行微调后,模型反而在争议性和有害议题上展现出更强的说服倾向,揭示了潜在的有害说服能力可能作为“涌现特性”(emergent property)出现,亟需进一步研究与防范。

链接: https://arxiv.org/abs/2512.22201
作者: Vincent Chang,Thee Ho,Sunishchal Dev,Kevin Zhu,Shi Feng,Kellin Pelrine,Matthew Kowal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper was accepted to AAAI 2026 AIGOV Workshop

点击查看摘要

Abstract:With the wide-scale adoption of conversational AI systems, AI are now able to exert unprecedented influence on human opinion and beliefs. Recent work has shown that many Large Language Models (LLMs) comply with requests to persuade users into harmful beliefs or actions when prompted and that model persuasiveness increases with model scale. However, this prior work looked at persuasion from the threat model of \textitmisuse (i.e., a bad actor asking an LLM to persuade). In this paper, we instead aim to answer the following question: Under what circumstances would models persuade \textitwithout being explicitly prompted , which would shape how concerned we should be about such emergent persuasion risks. To achieve this, we study unprompted persuasion under two scenarios: (i) when the model is steered (through internal activation steering) along persona traits, and (ii) when the model is supervised-finetuned (SFT) to exhibit the same traits. We showed that steering towards traits, both related to persuasion and unrelated, does not reliably increase models’ tendency to persuade unprompted, however, SFT does. Moreover, SFT on general persuasion datasets containing solely benign topics admits a model that has a higher propensity to persuade on controversial and harmful topics–showing that emergent harmful persuasion can arise and should be studied further.
zh

[AI-127] Bidirectional RAG : Safe Self-Improving Retrieval-Augmented Generation Through Multi-Stage Validation

【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统中知识库静态、无法从用户交互中持续进化的问题。其核心挑战在于如何在不引入幻觉污染的前提下实现安全的知识积累与 corpus 扩展。解决方案的关键在于提出双向 RAG(Bidirectional RAG),通过多阶段接受层实现高质量生成响应的受控写回:该层结合了基于自然语言推理(NLI)的蕴含验证、归属检查和新颖性检测,有效防止错误信息污染知识库的同时支持知识累积。实验表明,Bidirectional RAG 在四个数据集上平均覆盖率提升至 40.58%,远超标准 RAG 的 20.33%,且所需新增文档数量仅为朴素写回方法的 51%(140 vs 500),证明了自改进 RAG 系统在严格验证机制下具备可行性与安全性。

链接: https://arxiv.org/abs/2512.22199
作者: Teja Chinthala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 2 tables. 36 experiments across 4 datasets with 3 random seeds. Code available upon request

点击查看摘要

Abstract:Retrieval-Augmented Generation RAG systems enhance large language models by grounding responses in external knowledge bases, but conventional RAG architectures operate with static corpora that cannot evolve from user interactions. We introduce Bidirectional RAG, a novel RAG architecture that enables safe corpus expansion through validated write back of high quality generated responses. Our system employs a multi stage acceptance layer combining grounding verification (NLI based entailment, attribution checking, and novelty detection to prevent hallucination pollution while enabling knowledge accumulation. Across four datasets Natural Questions, TriviaQA, HotpotQA, Stack Overflow with three random seeds 12 experiments per system, Bidirectional RAG achieves 40.58% average coverage nearly doubling Standard RAG 20.33% while adding 72% fewer documents than naive write back 140 vs 500. Our work demonstrates that self improving RAG is feasible and safe when governed by rigorous validation, offering a practical path toward RAG systems that learn from deployment.
zh

[AI-128] MatKV: Trading Compute for Flash Storag e in LLM Inference ICDE2026

【速读】:该论文旨在解决生成式 AI(Generative AI)在检索增强生成(Retrieval Augmented Generation, RAG)推理过程中,因长文本输入导致的预填充阶段(prefill phase)计算密集、能耗高且延迟大的问题。其核心解决方案是提出 MatKV,该方案通过预先计算并缓存 RAG 对象(如文档)的键值向量(Key-Value vectors, KVs),将其持久化存储于低成本、高速且低功耗的闪存存储设备中,并在推理时直接复用这些已材料化的 KVs,从而避免重复使用高功耗 GPU 进行 KV 计算。这一设计显著降低了推理时间和能耗,同时保持了问答任务的准确性,并进一步支持 GPU 与存储设备并行操作以减少负载延迟,以及利用低端 GPU 执行解码任务,从而提升整体系统效率和硬件适配性。

链接: https://arxiv.org/abs/2512.22195
作者: Kun-Woo Shin(1),Jay H. Park(2),Moonwook Oh(2),Yohan Jo(1),Jaeyoung Do(1),Sang-Won Lee(1) ((1) Seoul National University, Korea (2) Samsung Electronics, Korea)
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication in ICDE 2026

点击查看摘要

Abstract:We observe two major trends in LLM-based generative AI: (1) inference is becoming the dominant factor in terms of cost and power consumption, surpassing training, and (2) retrieval augmented generation (RAG) is becoming prevalent. When processing long inputs in RAG, the prefill phase of computing the key-value vectors of input text is energy-intensive and time-consuming even with high-end GPUs. Thus, it is crucial to make the prefill phase in RAG inference efficient. To address this issue, we propose MatKV, a scheme that precomputes the key-value vectors (KVs) of RAG objects (e.g., documents), materializes them in inexpensive but fast and power-efficient flash storage, and reuses them at inference time instead of recomputing the KVs using costly and power-inefficient GPU. Experimental results using Hugging Face’s Transformers library across state-of-the-art GPUs and flash memory SSDs confirm that, compared to full KV computation on GPUs, MatKV reduces both inference time and power consumption by half for RAG workloads, without severely impacting accuracy in the question-answering task. Furthermore, we demonstrate that MatKV enables additional optimizations in two ways. First, a GPU can decode text while simultaneously loading the materialized KVs for the next instance, reducing load latency. Second, since decoding speed is less sensitive to GPU performance than KV computation, low-end GPUs can be leveraged for decoding without significantly compromising speed once the materialized KVs are loaded into GPU memory. These findings underscore MatKV’s potential to make large-scale generative AI applications more cost-effective, power-efficient, and accessible across a wider range of tasks and hardware environments.
zh

[AI-129] Learning Tennis Strategy Through Curriculum-Based Dueling Double Deep Q-Networks

【速读】:该论文旨在解决网球策略优化这一复杂的序贯决策问题,其核心挑战包括层级计分机制、随机结果、长时程信用分配、体力疲劳影响以及对对手技能的适应性。解决方案的关键在于构建一个融合自定义网球模拟环境与课程学习(curriculum learning)训练的双Dueling Double Deep Q-Network(DDQN)强化学习框架:该框架不仅精确建模了从分到局再到盘的完整计分体系、十类离散战术动作选择、对称疲劳动力学及连续对手技能参数,还通过dueling架构将动作价值估计分解为状态价值和优势项以提升策略判别能力,并利用double Q-learning减少过估计偏差,增强长时程随机环境中的训练稳定性;同时,课程学习策略逐步提高对手难度(从0.40至0.50),有效避免固定对手下训练崩溃现象,最终使代理在对抗平衡对手时胜率高达98–100%,展现出卓越性能。

链接: https://arxiv.org/abs/2512.22186
作者: Vishnu Mohan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 10 figures

点击查看摘要

Abstract:Tennis strategy optimization is a challenging sequential decision-making problem involving hierarchical scoring, stochastic outcomes, long-horizon credit assignment, physical fatigue, and adaptation to opponent skill. I present a reinforcement learning framework that integrates a custom tennis simulation environment with a Dueling Double Deep Q-Network(DDQN) trained using curriculum learning. The environment models complete tennis scoring at the level of points, games, and sets, rally-level tactical decisions across ten discrete action categories, symmetric fatigue dynamics, and a continuous opponent skill parameter. The dueling architecture decomposes action-value estimation into state-value and advantage components, while double Q-learning reduces overestimation bias and improves training stability in this long-horizon stochastic domain. Curriculum learning progressively increases opponent difficulty from 0.40 to 0.50, enabling robust skill acquisition without the training collapse observed under fixed opponents. Across extensive evaluations, the trained agent achieves win rates between 98 and 100 percent against balanced opponents and maintains strong performance against more challenging opponents. Serve efficiency ranges from 63.0 to 67.5 percent, and return efficiency ranges from 52.8 to 57.1 percent. Ablation studies demonstrate that both the dueling architecture and curriculum learning are necessary for stable convergence, while a standard DQN baseline fails to learn effective policies. Despite strong performance, tactical analysis reveals a pronounced defensive bias, with the learned policy prioritizing error avoidance and prolonged rallies over aggressive point construction. These results highlight a limitation of win-rate driven optimization in simplified sports simulations and emphasize the importance of reward design for realistic sports reinforcement learning.
zh

[AI-130] Interpretable Link Prediction in AI-Driven Cancer Research: Uncovering Co-Authorship Patterns

【速读】:该论文旨在解决AI驱动的癌症研究中跨学科合作团队组建困难的问题,以提升科研效率与影响力。其核心解决方案在于通过分析7,738篇Scopus文献构建的共作者网络,提取属性特征(如学科相似性、生产力和资历)与结构特征,并利用机器学习模型(特别是随机森林)预测合作模式(新建立、持续性和中断的合作)。关键发现是学科相似性得分对新合作和持续合作呈正向影响,而对中断合作呈负向影响,同时高产和资深研究人员更可能形成中断合作关系,这为优化团队构成和制定科研政策提供了数据驱动的依据。

链接: https://arxiv.org/abs/2512.22181
作者: Shahab Mosallaie,Andrea Schiffauerova,Ashkan Ebadi
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:Artificial intelligence (AI) is transforming cancer diagnosis and treatment. The intricate nature of this disease necessitates the collaboration of diverse stakeholders with varied expertise to ensure the effectiveness of cancer research. Despite its importance, forming effective interdisciplinary research teams remains challenging. Understanding and predicting collaboration patterns can help researchers, organizations, and policymakers optimize resources and foster impactful research. We examined co-authorship networks as a proxy for collaboration within AI-driven cancer research. Using 7,738 publications (2000-2017) from Scopus, we constructed 36 overlapping co-authorship networks representing new, persistent, and discontinued collaborations. We engineered both attribute-based and structure-based features and built four machine learning classifiers. Model interpretability was performed using Shapley Additive Explanations (SHAP). Random forest achieved the highest recall for all three types of examined collaborations. The discipline similarity score emerged as a crucial factor, positively affecting new and persistent patterns while negatively impacting discontinued collaborations. Additionally, high productivity and seniority were positively associated with discontinued links. Our findings can guide the formation of effective research teams, enhance interdisciplinary cooperation, and inform strategic policy decisions.
zh

[AI-131] OS as Acceleration NEURIPS2025

【速读】:该论文旨在解决大规模机器学习在计算资源受限的本地环境中难以部署的问题,尤其是在处理私有或敏感数据、无法通过云平台访问物理环境,或预期云服务成本过高的场景下。其解决方案的关键在于利用普遍但未被充分利用的移动设备——特别是具备强大处理器的iOS手机——通过分布式流水线并行(distributed pipeline parallelism)技术,将这些移动设备作为计算节点集成到本地训练与推理流程中,从而在不增加额外成本的前提下显著提升弱计算环境下的模型训练速度、批量推理效率以及代理型大语言模型(Large Language Model, LLM)工具调用性能。

链接: https://arxiv.org/abs/2512.22180
作者: Alexander K. Chen
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 7 pages main text, 7 pages appendix. Presented at NeurIPS 2025 Efficient Reasoning Workshop

点击查看摘要

Abstract:Practical utilization of large-scale machine learning requires a powerful compute setup, a necessity which poses a significant barrier to engagement with such artificial intelligence in more restricted system environments. While cloud computing offers a solution to weaker local environments, certain situations like training involving private or sensitive data, physical environments not available through the cloud, or higher anticipated usage costs, necessitate computing locally. We explore the potential to improve weaker local compute systems at zero additional cost by taking advantage of ubiquitous yet underutilized resources: mobile phones. Specifically, recent iOS phones are equipped with surprisingly powerful processors, but they also face limitations like memory constraints, thermal throttling, and OS sandboxing. We present a proof-of-concept system demonstrating a novel approach to harness an iOS device via distributed pipeline parallelism, achieving significant benefits in a lesser compute environment by accelerating modest model training, batch inference, and agentic LRM tool-usage. We discuss practical use-cases, limitations, and directions for future work. The findings of this paper highlight the potential for the improving commonplace mobile devices to provide greater contributions to machine learning.
zh

[AI-132] Wireless Traffic Prediction with Large Language Model

【速读】:该论文旨在解决下一代无线网络中城市尺度无线流量预测的准确性与可扩展性问题,尤其针对现有基于大语言模型(Large Language Models, LLMs)的方法普遍忽视空间依赖性的缺陷。解决方案的关键在于提出TIDES框架,其核心创新包括:通过聚类机制识别区域异质性并训练个性化模型以平衡泛化与专一性;设计提示工程方案将统计流量特征结构化输入以弥合数值数据与语言模型之间的域差距;引入DeepSeek模块利用跨域注意力实现空间对齐,使LLM能够利用空间相关区域的信息;并通过仅微调轻量级组件、冻结LLM主干的方式实现高效领域适应,从而在保持低训练开销的同时显著提升预测精度与鲁棒性。

链接: https://arxiv.org/abs/2512.22178
作者: Chuanting Zhang,Haixia Zhang,Jingping Qiao,Zongzhang Li,Mohamed-Slim Alouini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing demand for intelligent, adaptive resource management in next-generation wireless networks has underscored the importance of accurate and scalable wireless traffic prediction. While recent advancements in deep learning and foundation models such as large language models (LLMs) have demonstrated promising forecasting capabilities, they largely overlook the spatial dependencies inherent in city-scale traffic dynamics. In this paper, we propose TIDES (Traffic Intelligence with DeepSeek-Enhanced Spatial-temporal prediction), a novel LLM-based framework that captures spatial-temporal correlations for urban wireless traffic prediction. TIDES first identifies heterogeneous traffic patterns across regions through a clustering mechanism and trains personalized models for each region to balance generalization and specialization. To bridge the domain gap between numerical traffic data and language-based models, we introduce a prompt engineering scheme that embeds statistical traffic features as structured inputs. Furthermore, we design a DeepSeek module that enables spatial alignment via cross-domain attention, allowing the LLM to leverage information from spatially related regions. By fine-tuning only lightweight components while freezing core LLM layers, TIDES achieves efficient adaptation to domain-specific patterns without incurring excessive training overhead. Extensive experiments on real-world cellular traffic datasets demonstrate that TIDES significantly outperforms state-of-the-art baselines in both prediction accuracy and robustness. Our results indicate that integrating spatial awareness into LLM-based predictors is the key to unlocking scalable and intelligent network management in future 6G systems.
zh

[AI-133] BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因硬件退化、宇宙辐射或故意注入故障(如Rowhammer攻击)导致的位翻转错误(bit-flip faults)问题,此类错误会无声地破坏模型内部参数并引发不可预测甚至危险的行为。解决方案的关键在于提出了一种名为BitFlipScope的可扩展软件框架,其核心机制根据是否有干净参考模型分为两种场景:当存在参考模型时,通过输出、隐藏状态和内部激活的差异分析定位故障区域;当无参考模型时,则利用残差路径扰动与损失敏感性分析直接从受损模型中推断故障影响区域。该框架不仅实现了高效故障诊断,还支持无需微调的轻量级性能恢复,从而为在易受硬件干扰和对抗环境下的LLM可信部署提供了切实可行的技术路径。

链接: https://arxiv.org/abs/2512.22174
作者: Muhammad Zeeshan Karamat,Sadman Saif,Christiana Chamon Garcia
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) deployed in practical and safety-critical settings are increasingly susceptible to bit-flip faults caused by hardware degradation, cosmic radiation, or deliberate fault-injection attacks such as Rowhammer. These faults silently corrupt internal parameters and can lead to unpredictable or dangerous model behavior. Localizing these corruptions is essential: without identifying the affected region, it is impossible to diagnose the source of degradation, apply targeted corrective measures, or restore model functionality without resorting to costly fine-tuning or full retraining. This work introduces BitFlipScope, a scalable, software-based framework for identifying fault-affected regions within transformer architectures under two deployment scenarios. When a clean reference model is available, BitFlipScope performs differential analysis of outputs, hidden states, and internal activations for detecting anomalous behavior indicative of corruption to pinpoint or localize faults. When no reference model exists, it uses residual-path perturbation and loss-sensitivity profiling to infer the fault-impacted region directly from the corrupted model. In both settings, the framework not only enables effective fault diagnosis but also supports lightweight performance recovery without fine-tuning, offering a practical path to restoring corrupted models. Together, these capabilities make BitFlipScope an important step toward trustworthy, fault-resilient LLM deployment in hardware-prone and adversarial environments.
zh

[AI-134] Solving Multi-Agent Multi-Goal Path Finding Problems in Polynomial Time

【速读】:该论文旨在解决多智能体在无向图(如网格)中执行任务时的路径规划问题,其中多个目标需动态分配给不同智能体,并在保证路径无冲突的前提下实现全局优化。与传统多智能体路径规划(Multi-Agent Path Finding, MAPF)不同,本文提出的方法能够自主完成目标到智能体的分配,而非预先固定。其解决方案的关键在于:首先通过全局分配策略显著减少冲突;随后利用“蚂蚁在棍子上”(ants-on-the-stick)概念,结合局部分配问题求解、路径交错调度以及提前到达智能体的移出机制,有效处理剩余冲突,从而在多项式时间内获得无冲突且接近最优的路径方案。

链接: https://arxiv.org/abs/2512.22171
作者: Stefan Edelkamp
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we plan missions for a fleet of agents in undirected graphs, such as grids, with multiple goals. In contrast to regular multi-agent path-finding, the solver finds and updates the assignment of goals to the agents on its own. In the continuous case for a point agent with motions in the Euclidean plane, the problem can be solved arbitrarily close to optimal. For discrete variants that incur node and edge conflicts, we show that it can be solved in polynomial time, which is unexpected, since traditional vehicle routing on general graphs is NP-hard. We implement a corresponding planner that finds conflict-free optimized routes for the agents. Global assignment strategies greatly reduce the number of conflicts, with the remaining ones resolved by elaborating on the concept of ants-on-the-stick, by solving local assignment problems, by interleaving agent paths, and by kicking agents that have already arrived out of their destinations
zh

[AI-135] Practical challenges of control monitoring in frontier AI deployments

【速读】:该论文旨在解决在现实世界部署中对高能力但不可完全信任的AI代理进行有效监督控制的问题,尤其针对并行代理实例、监督延迟、代理间增量攻击以及基于单次有害行为识别伪装代理(scheming agents)的困难等新挑战。其解决方案的关键在于设计三种不同延迟-安全性权衡的监控形式:同步(synchronous)、半同步(semi-synchronous)和异步(asynchronous)监控,并提出一个高层安全论证框架(safety case sketch)用于理解与比较这些协议。通过四个未来AI部署案例研究,论文系统分析了监督(oversight)、延迟(latency)和恢复(recovery)三大核心挑战,从而为构建更可靠、可扩展的AI控制机制提供理论支撑。

链接: https://arxiv.org/abs/2512.22154
作者: David Lindner,Charlie Griffin,Tomek Korbak,Roland S. Zimmermann,Geoffrey Irving,Sebastian Farquhar,Alan Cooney
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Automated control monitors could play an important role in overseeing highly capable AI agents that we do not fully trust. Prior work has explored control monitoring in simplified settings, but scaling monitoring to real-world deployments introduces additional dynamics: parallel agent instances, non-negligible oversight latency, incremental attacks between agent instances, and the difficulty of identifying scheming agents based on individual harmful actions. In this paper, we analyse design choices to address these challenges, focusing on three forms of monitoring with different latency-safety trade-offs: synchronous, semi-synchronous, and asynchronous monitoring. We introduce a high-level safety case sketch as a tool for understanding and comparing these monitoring protocols. Our analysis identifies three challenges – oversight, latency, and recovery – and explores them in four case studies of possible future AI deployments.
zh

[AI-136] Adaptive GPU Resource Allocation for Multi-Agent Collaborative Reasoning in Serverless Environments

【速读】:该论文旨在解决在无服务器GPU平台上高效部署由大语言模型驱动的多智能体系统(Multi-agent Systems)所面临的资源分配难题,尤其是应对异构智能体工作负载、动态计算需求波动以及成本敏感的扩展性挑战。解决方案的关键在于提出了一种自适应GPU资源分配框架,其核心是基于工作负载特征、智能体优先级和最小资源约束进行动态调度,并采用复杂度为O(N)的实时算法实现毫秒级资源再分配,从而在保障服务质量的前提下显著降低延迟(相比轮询调度减少85%),同时维持与静态分配相当的吞吐量和更高的GPU利用率。

链接: https://arxiv.org/abs/2512.22149
作者: Guilin Zhang,Wulan Guo,Ziqi Tan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 7 pages, 2 figures

点击查看摘要

Abstract:Multi-agent systems powered by large language models have emerged as a promising paradigm for solving complex reasoning tasks through collaborative intelligence. However, efficiently deploying these systems on serverless GPU platforms presents significant resource allocation challenges due to heterogeneous agent workloads, varying computational demands, and the need for cost-effective scaling. This paper presents an adaptive GPU resource allocation framework that achieves 85% latency reduction compared to round-robin scheduling while maintaining comparable throughput to static allocation, using an O(N) complexity algorithm for real-time adaptation. Our approach dynamically allocates GPU resources based on workload characteristics, agent priorities, and minimum resource requirements, enabling efficient utilization while maintaining quality of service. The framework addresses three key challenges: (1) heterogeneous computational demands across lightweight coordinators and heavyweight specialists, (2) dynamic workload fluctuations requiring millisecond-scale reallocation, and (3) capacity constraints in serverless environments. Through comprehensive simulations modeling realistic multi-agent workflows with four heterogeneous agents, we demonstrate that adaptive allocation outperforms static equal and round-robin strategies across latency, cost, and GPU utilization metrics. The framework provides a practical solution for deploying cost-efficient multi-agent AI systems on serverless GPU infrastructure.
zh

[AI-137] Rethinking Leverag ing Pre-Trained Multi-Layer Representations for Speaker Verification INTERSPEECH2025

【速读】:该论文旨在解决当前说话人验证(Speaker Verification)中对预训练Transformer模型多层特征聚合方式的局限性问题,即现有方法多采用静态加权平均策略,难以动态捕捉不同层在时间维度上的重要性差异。其解决方案的关键在于提出Layer Attentive Pooling (LAP)机制,该机制从多个角度对各层表示进行时序动态评估,并采用最大池化(max pooling)替代平均池化,从而更有效地整合多层特征;同时结合轻量级后端模型(包含LAP与Attentive Statistical Temporal Pooling,ASTP),显著提升性能并减少训练时间。

链接: https://arxiv.org/abs/2512.22148
作者: Jin Sob Kim,Hyun Joon Park,Wooseok Shin,Sung Won Han
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to Interspeech 2025

点击查看摘要

Abstract:Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static weighted average. We present Layer Attentive Pooling (LAP), a novel strategy for aggregating inter-layer representations from pre-trained speech models for speaker verification. LAP assesses the significance of each layer from multiple perspectives time-dynamically, and employs max pooling instead of averaging. Additionally, we propose a lightweight backend speaker model comprising LAP and Attentive Statistical Temporal Pooling (ASTP) to extract speaker embeddings from pre-trained model output. Experiments on the VoxCeleb benchmark reveal that our compact architecture achieves state-of-the-art performance while greatly reducing the training time. We further analyzed LAP design and its dynamic weighting mechanism for capturing speaker characteristics.
zh

[AI-138] GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

【速读】:该论文旨在解决高性能计算中热点GPU内核(hotspot GPU kernels)优化难题,即传统专家手动调优成本高且难以移植,而现有大语言模型(Large Language Model, LLM)方法依赖于可廉价编译执行的假设,在大型应用中因全量构建与运行代价高昂而失效。其关键解决方案是提出一个端到端LLM框架,结合性能反馈机制,在不构建完整应用程序的前提下对独立提取的热点内核进行自动优化:首先将内核转化为最小可执行程序(Minimal Executable Program, MEP),随后在外围环境中通过多轮迭代优化与评估完成调优;该框架集成自动错误修复(Automatic Error Repair)与性能模式继承(Performance Pattern Inheritance)技术,以保障正确性、复用有效策略并降低搜索空间,最终将优化结果回插至原应用验证。此方法无需全源码依赖,具备跨平台可移植性,实现了低开销、实用化的GPU内核加速。

链接: https://arxiv.org/abs/2512.22147
作者: Ruifan Chu,Anbang Wang,Xiuxiu Bai,Shuai Liu,Xiaoshe Dong
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Performance (cs.PF)
备注:

点击查看摘要

Abstract:In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large applications where full builds and runs are expensive. We present an end-to-end LLM framework with performance feedback that optimizes kernels without building the full application. From independently extracted hotspot kernels, it automatically completes code into a Minimal Executable Program (MEP), then performs multi-round iterative optimization and evaluation outside the full application. The framework integrates Automatic Error Repair and Performance Pattern Inheritance to fix faults, preserve correctness, reuse effective tiling/memory/synchronization strategies, and reduce search cost. Optimized variants are reintegrated into the original application for validation. We evaluate on NVIDIA GPUs and the Haiguang Deep Computing Unit (DCU) platform (AMD-licensed architecture) using PolyBench, the AMD APP SDK, and hotspot kernels from large-scale supercomputing applications. The method achieves average speedups of 5.05x (PolyBench on NVIDIA), 7.77x (PolyBench on DCU), 1.77x (AMD APP SDK), and 1.25x on three hotspot kernels, surpassing direct LLM optimization. The approach requires no full-source dependencies, offers cross-platform portability, and enables practical, low-cost GPU kernel optimization.
zh

[AI-139] Pre-review to Peer review: Pitfalls of Automating Reviews using Large Language Models

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在学术同行评审自动化中的安全性和可靠性问题,特别是评估前沿开源大语言模型(LLM)生成的审稿意见与人类审稿人评分及发表后指标(如引用量、高影响力论文、新颖性与颠覆性)之间的对齐程度。其解决方案的关键在于通过构建一个基于 OpenReview 的真实审稿数据集 D_LMRSD,系统性地对比 LLM 生成的审稿意见与人类审稿评分以及后续出版表现的关系,发现尽管所有模型与人类评分相关性较弱(相关系数仅为 0.15),且存在显著高估偏差(3–5 分)和过度自信(置信度 8.0–9.0/10),但 LLM 审稿意见与发表后指标的相关性更强,表明其作为预审筛选工具具有潜在价值,而作为完全自主审稿代理则存在根本性对齐风险。

链接: https://arxiv.org/abs/2512.22145
作者: Akhil Pandey Akella,Harish Varma Siravuri,Shaurya Rohatgi
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models are versatile general-task solvers, and their capabilities can truly assist people with scholarly peer review as \textitpre-review agents, if not as fully autonomous \textitpeer-review agents. While incredibly beneficial, automating academic peer-review, as a concept, raises concerns surrounding safety, research integrity, and the validity of the academic peer-review process. The majority of the studies performing a systematic evaluation of frontier LLMs generating reviews across science disciplines miss the mark on addressing the alignment/misalignment of reviews along with the utility of LLM generated reviews when compared against publication outcomes such as \textbfCitations, \textbfHit-papers, \textbfNovelty, and \textbfDisruption. This paper presents an experimental study in which we gathered ground-truth reviewer ratings from OpenReview and used various frontier open-weight LLMs to generate reviews of papers to gauge the safety and reliability of incorporating LLMs into the scientific review pipeline. Our findings demonstrate the utility of frontier open-weight LLMs as pre-review screening agents despite highlighting fundamental misalignment risks when deployed as autonomous reviewers. Our results show that all models exhibit weak correlation with human peer reviewers (0.15), with systematic overestimation bias of 3-5 points and uniformly high confidence scores (8.0-9.0/10) despite prediction errors. However, we also observed that LLM reviews correlate more strongly with post-publication metrics than with human scores, suggesting potential utility as pre-review screening tools. Our findings highlight the potential and address the pitfalls of automating peer reviews with language models. We open-sourced our dataset D_LMRSD to help the research community expand the safety framework of automating scientific reviews.
zh

[AI-140] HLS4PC: A Parametrizable Framework For Accelerating Point-Based 3D Point Cloud Models on FPGA

【速读】:该论文旨在解决3D点云(3D point cloud)模型在服务器级GPU上执行时因数据稀疏性和非结构化特性导致的高内存与计算开销问题,从而限制了其在安全关键应用中的实时性能表现。解决方案的关键在于提出一种可参数化的高层次综合(High-Level Synthesis, HLS)框架HLS4PC,利用FPGA的并行化能力和算法优化,实现对映射函数与神经网络(Neural Network, NN)层的高效定点实现;并通过硬件感知压缩技术(如用均匀随机采样Uniform Random Sampling, URS替代最远点采样Farthest Point Sampling, FPS、参数量化、层融合和输入点裁剪),构建轻量级PointMLP-Lite模型,在仅损失2%准确率的前提下将复杂度降低4倍,并在FPGA上实现比现有方法高3.56倍的吞吐量,显著优于CPU和GPU实现(分别提升2.3倍和22倍)。

链接: https://arxiv.org/abs/2512.22139
作者: Amur Saqib Pal,Muhammad Mohsin Ghaffar,Faisal Shafait,Christian Weis,Norbert Wehn
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Robotics (cs.RO)
备注: Accepted for publication by 25th International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS 2025)

点击查看摘要

Abstract:Point-based 3D point cloud models employ computation and memory intensive mapping functions alongside NN layers for classification/segmentation, and are executed on server-grade GPUs. The sparse, and unstructured nature of 3D point cloud data leads to high memory and computational demand, hindering real-time performance in safety critical applications due to GPU under-utilization. To address this challenge, we present HLS4PC, a parameterizable HLS framework for FPGA acceleration. Our approach leverages FPGA parallelization and algorithmic optimizations to enable efficient fixed-point implementations of both mapping and NN functions. We explore several hardware-aware compression techniques on a state-of-the-art PointMLP-Elite model, including replacing FPS with URS, parameter quantization, layer fusion, and input-points pruning, yielding PointMLP-Lite, a 4x less complex variant with only 2% accuracy drop on ModelNet40. Secondly, we demonstrate that the FPGA acceleration of the PointMLP-Lite results in 3.56x higher throughput than previous works. Furthermore, our implementation achieves 2.3x and 22x higher throughput compared to the GPU and CPU implementations, respectively.
zh

[AI-141] SoDA: An Efficient Interaction Paradigm for the Agent ic Web

【速读】:该论文旨在解决当前互联网从移动App主导的注意力经济向代理网络(Agentic Web)演进过程中,用户面临的数据锁定(data lock-in)和认知过载(cognitive overload)两大核心挑战。其解决方案的关键在于提出一种面向未来的用户主权交互范式(user sovereignty interaction paradigm),通过将记忆与应用逻辑解耦,从根本上消除数据锁定的结构基础;同时,从显式手动指令转向隐式意图对齐(implicit intent alignment),有效缓解认知过载。该范式由主权数字化身(Sovereign Digital Avatar, SoDA)实现,采用存储、计算与交互正交解耦的设计架构,确立“数据为持久资产、模型为临时工具”的原则,打破平台对用户记忆的垄断。此外,为支持零信任环境下的运行,设计了基于A2A协议的意图权限握手机制(Intent-Permission Handshake Mechanism),利用敏感度系数与严格性参数双重因子自适应路由,实现主动风险治理。实证评估表明,该方案在跨平台服务迁移和复杂任务执行中可降低约27–35%的token消耗,在多模态任务编排中相较标准RAG架构减少72%用户认知负荷,较人工流程降低88%,显著提升信息信噪比(Information Signal-to-Noise Ratio, SNR)。

链接: https://arxiv.org/abs/2512.22135
作者: Zicai Cui,Zhouyuan Jian,Weiwen Liu,Weinan Zhang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As the internet evolves from the mobile App-dominated Attention Economy to the Intent-Interconnection of the Agentic Web era, existing interaction modes fail to address the escalating challenges of data lock-in and cognitive overload. Addressing this, we defines a future-oriented user sovereignty interaction paradigm, aiming to realize a fundamental shift from killing time to saving time. Specifically, we argue that decoupling memory from application logic eliminates the structural basis of data lock-in, while shifting from explicit manual instruction to implicit intent alignment resolves cognitive overload by offloading execution complexity. This paradigm is implemented via the Sovereign Digital Avatar (SoDA), which employs an orthogonal decoupling design of storage, computation, and interaction. This establishes the architectural principle of data as a persistent asset, model as a transient tool, fundamentally breaking the platform monopoly on user memory. To support the operation of this new paradigm in zero-trust environments, we design an Intent-Permission Handshake Mechanism based on A2A protocols, utilizing dual-factor (Sensitivity Coefficient and Strictness Parameter) adaptive routing to achieve active risk governance. Empirical evaluation with a high-fidelity simulation environment indicates that this paradigm reduces token consumption by approximately 27-35% during cross-platform service migration and complex task execution. Furthermore, in the orchestration of multi-modal complex tasks, it reduces user cognitive load by 72% compared to standard Retrieval-Augmented Generation (RAG) architectures, by 88% relative to manual workflows, while significantly boosting the Information Signal-to-Noise Ratio (SNR). These results demonstrate that the SoDA is the essential interaction infrastructure for building an efficient, low-friction, and decentralized Agentic Web.
zh

[AI-142] ReCollab: Retrieval-Augmented LLM s for Cooperative Ad-hoc Teammate Modeling

【速读】:该论文旨在解决**即兴团队协作(Ad-hoc Teamwork, AHT)中代理如何在面对未曾见过的队友时,基于有限观测推断其行为模式并动态调整自身策略的问题。传统方法依赖固定概率模型或分类器,在部分可观测性和交互数据有限的情况下易失效。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)**作为行为世界模型(behavioral world models),通过将短时行为轨迹映射为高层假设来实现灵活推理;进一步提出\ReCollab框架,引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,结合示例轨迹进行推理稳定化,从而在复杂协作场景(如Overcooked环境)中显著提升对不同队友类型的区分能力与策略适应性,实现分类准确率与回合回报之间的帕累托最优权衡。

链接: https://arxiv.org/abs/2512.22129
作者: Conor Wallace,Umer Siddique,Yongcan Cao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Ad-hoc teamwork (AHT) requires agents to infer the behavior of previously unseen teammates and adapt their policy accordingly. Conventional approaches often rely on fixed probabilistic models or classifiers, which can be brittle under partial observability and limited interaction. Large language models (LLMs) offer a flexible alternative: by mapping short behavioral traces into high-level hypotheses, they can serve as world models over teammate behavior. We introduce \Collab, a language-based framework that classifies partner types using a behavior rubric derived from trajectory features, and extend it to \ReCollab, which incorporates retrieval-augmented generation (RAG) to stabilize inference with exemplar trajectories. In the cooperative Overcooked environment, \Collab effectively distinguishes teammate types, while \ReCollab consistently improves adaptation across layouts, achieving Pareto-optimal trade-offs between classification accuracy and episodic return. These findings demonstrate the potential of LLMs as behavioral world models for AHT and highlight the importance of retrieval grounding in challenging coordination settings.
zh

[AI-143] GPU-Virt-Bench: A Comprehensive Benchmarking Framework for Software-Based GPU Virtualization Systems

【速读】:该论文旨在解决云和容器环境中GPU资源高效共享的问题,尤其是在生成式AI(Generative AI)与大语言模型(Large Language Model, LLM)推理负载激增背景下,如何在不同GPU虚拟化方案之间进行公平、全面的性能评估。当前硬件级隔离方案如NVIDIA Multi-Instance GPU(MIG)受限于高端数据中心GPU,而软件虚拟化方案如HAMi-core和BUD-FCSP虽具备更广泛的适用性,却缺乏统一的标准化评测方法。论文提出的解决方案是设计并实现GPU-Virt-Bench,一个涵盖56项性能指标的基准测试框架,覆盖隔离质量、LLM性能、内存带宽、缓存行为、PCIe吞吐、多GPU通信等10个维度,从而系统性地比较软件虚拟化方案与理想MIG行为,为多租户环境下GPU资源部署提供可量化的决策依据。

链接: https://arxiv.org/abs/2512.22125
作者: Jithin VG,Ditto PS
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The proliferation of GPU-accelerated workloads, particularly in artificial intelligence and large language model (LLM) inference, has created unprecedented demand for efficient GPU resource sharing in cloud and container environments. While NVIDIA’s Multi-Instance GPU (MIG) technology provides hardware-level isolation, its availability is limited to high-end datacenter GPUs. Software-based virtualization solutions such as HAMi-core and BUD-FCSP offer alternatives for broader GPU families but lack standardized evaluation methodologies. We present GPU-Virt-Bench, a comprehensive benchmarking framework that evaluates GPU virtualization systems across 56 performance metrics organized into 10 categories. Our framework measures overhead, isolation quality, LLM-specific performance, memory bandwidth, cache behavior, PCIe throughput, multi-GPU communication, scheduling efficiency, memory fragmentation, and error recovery. GPU-Virt-Bench enables systematic comparison between software virtualization approaches and ideal MIG behavior, providing actionable insights for practitioners deploying GPU resources in multi-tenant environments. We demonstrate the framework’s utility through evaluation of HAMi-core, BUD-FCSP, and simulated MIG baselines, revealing performance characteristics critical for production deployment decisions.
zh

[AI-144] Alpha-R1: Alpha Screening with LLM Reasoning via Reinforcement Learning

【速读】:该论文旨在解决数据驱动型投资策略在非平稳市场中面临的信号衰减(signal decay)与制度转换(regime shift)问题,传统时间序列和机器学习方法因过度依赖历史相关性,在经济环境变化时泛化能力不足。其解决方案的关键在于提出Alpha-R1——一个基于强化学习训练的80亿参数推理模型,通过结合因子逻辑与实时新闻信息进行上下文感知的alpha筛选,能够根据情境一致性动态激活或抑制因子,从而提升策略对市场变化的适应性和抗衰减能力。

链接: https://arxiv.org/abs/2512.23515
作者: Zuoyou Jiang,Li Zhao,Rui Sun,Ruohan Sun,Zhongjian Li,Jing Li,Daxin Jiang,Zuo Bai,Cheng Hua
机构: 未知
类目: Trading and Market Microstructure (q-fin.TR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Signal decay and regime shifts pose recurring challenges for data-driven investment strategies in non-stationary markets. Conventional time-series and machine learning approaches, which rely primarily on historical correlations, often struggle to generalize when the economic environment changes. While large language models (LLMs) offer strong capabilities for processing unstructured information, their potential to support quantitative factor screening through explicit economic reasoning remains underexplored. Existing factor-based methods typically reduce alphas to numerical time series, overlooking the semantic rationale that determines when a factor is economically relevant. We propose Alpha-R1, an 8B-parameter reasoning model trained via reinforcement learning for context-aware alpha screening. Alpha-R1 reasons over factor logic and real-time news to evaluate alpha relevance under changing market conditions, selectively activating or deactivating factors based on contextual consistency. Empirical results across multiple asset pools show that Alpha-R1 consistently outperforms benchmark strategies and exhibits improved robustness to alpha decay. The full implementation and resources are available at this https URL.
zh

[AI-145] PINNs for Electromagnetic Wave Propagation

【速读】:该论文旨在解决物理信息神经网络(Physics-Informed Neural Networks, PINNs)在电磁波传播仿真中精度不足和能量守恒性差的问题,尤其针对其与传统有限差分时域法(FDTD)相比在场分布准确性和能量一致性方面的劣势。解决方案的关键在于提出一种混合训练策略:首先通过时间推进(time marching)和因果感知加权缓解时域PINN训练中的因果崩溃问题;其次引入两阶段界面连续性损失以抑制因时间推进带来的不连续性;最后设计基于局部坡印廷矢量(Poynting-based)的正则化项来抑制误差累积,从而有效控制电磁波传播过程中的能量漂移。该方法使PINN在无标签场数据情况下即可实现与FDTD相当的精度(平均NRMSE为0.09%,L²误差为1.01%),并在2D理想电导体腔场景下达到仅0.024%相对能量偏差的优异能量守恒性能。

链接: https://arxiv.org/abs/2512.23396
作者: Nilufer K. Bulut
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) are a methodology that aims to solve physical systems by directly embedding PDE constraints into the neural network training process. In electromagnetism, where well-established methodologies such as FDTD and FEM already exist, new methodologies are expected to provide clear advantages to be accepted. Despite their mesh-free nature and applicability to inverse problems, PINNs can exhibit deficiencies in terms of accuracy and energy metrics when compared to FDTD solutions. This study demonstrates hybrid training strategies can bring PINNs closer to FDTD-level accuracy and energy consistency. This study presents a hybrid methodology addressing common challenges in wave propagation scenarios. The causality collapse problem in time-dependent PINN training is addressed via time marching and causality-aware weighting. In order to mitigate the discontinuities that are introduced by time marching, a two-stage interface continuity loss is applied. In order to suppress loss accumulation, which is manifested as cumulative energy drift in electromagnetic waves, a local Poynting-based regularizer has been developed. In the developed PINN model, high field accuracy is achieved with an average 0.09% NRMSE and 1.01% L^2 error over time. Energy conservation is achieved on the PINN side with only a 0.024% relative energy mismatch in the 2D PEC cavity scenario. Training is performed without labeled field data, using only physics-based residual losses; FDTD is used solely for post-training evaluation. The results demonstrate that PINNs can achieve competitive results with FDTD in canonical electromagnetic examples and are a viable alternative. Subjects: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.23396 [physics.comp-ph] (or arXiv:2512.23396v1 [physics.comp-ph] for this version) https://doi.org/10.48550/arXiv.2512.23396 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-146] Constraint programming model and biased random-key genetic algorithm for the single-machine coupled task scheduling problem with exact delays to minimize the makespan

【速读】:该论文致力于解决单机耦合任务调度问题(single-machine coupled task scheduling problem)中带有精确延迟(exact delays)以最小化最大完工时间(makespan)的强NP难问题。其核心解决方案基于两种方法:一是采用约束规划(Constraint Programming, CP)建模,利用成熟的全局约束进行求解;二是提出一种偏向随机键遗传算法(Biased Random-Key Genetic Algorithm, BRKGA),该算法通过高效的解码机制、初始解生成器、周期性重启与扰动(shake)策略以及局部搜索(local search)算法协同优化,显著提升了搜索效率和解的质量。实验表明,BRKGA在短时限内优于CP模型,而CP在长时间运行下可逼近最优解,且扰动与局部搜索的结合是BRKGA性能提升的关键因素。

链接: https://arxiv.org/abs/2512.23150
作者: Vítor A. Barbosa,Rafael A. Melo
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the strongly NP-hard single-machine coupled task scheduling problem with exact delays to minimize the makespan. In this problem, a set of jobs has to be scheduled, each composed of two tasks interspersed by an exact delay. Given that no preemption is allowed, the goal consists of minimizing the completion time of the last scheduled task. We model the problem using constraint programming (CP) and propose a biased random-key genetic algorithm (BRKGA). Our CP model applies well-established global constraints. Our BRKGA combines some successful components in the literature: an initial solution generator, periodical restarts and shakes, and a local search algorithm. Furthermore, the BRKGA’s decoder is focused on efficiency rather than optimality, which accelerates the solution space exploration. Computational experiments on a benchmark set containing instances with up to 100 jobs (200 tasks) indicate that the proposed BRKGA can efficiently explore the problem solution space, providing high-quality approximate solutions within low computational times. It can also provide better solutions than the CP model under the same computational settings, i.e., three minutes of time limit and a single thread. The CP model, when offered a longer running time of 3600 seconds and multiple threads, significantly improved the results, reaching the current best-known solution for 90.56% of these instances. Finally, our experiments highlight the importance of the shake and local search components in the BRKGA, whose combination significantly improves the results of a standard BRKGA.
zh

[AI-147] An Inference-Based Architecture for Intent and Affordance Saturation in Decision-Making

【速读】:该论文旨在解决决策瘫痪(decision paralysis)问题,即个体在具备充分知识和动机的情况下仍出现犹豫、停滞或无法行动的现象,这挑战了传统选择模型中选项已明确且可直接比较的假设。其解决方案的关键在于提出一种基于推断的分层决策计算模型,将意图选择(intent selection,即“追求什么”)与可及性选择(affordance selection,即“如何实现目标”)分离,并通过混合反向与前向Kullback-Leibler(KL)散度目标来形式化承诺机制:反向KL倾向于模式寻找(mode-seeking),促进快速决策;前向KL倾向于模式覆盖(mode-covering),保留多个可行目标或行为。该框架在静态和动态(漂移扩散)模型中表现出缓慢且重尾的反应时间分布,并识别出两种失败模式——意图饱和与可及性饱和,当选项价值相近时尤为显著,从而将自闭症视为一个普遍的、基于推断的决策连续体中的极端状态。

链接: https://arxiv.org/abs/2512.23144
作者: Wendyam Eric Lionel Ilboudo,Saori C Tanaka
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 12 figures

点击查看摘要

Abstract:Decision paralysis, i.e. hesitation, freezing, or failure to act despite full knowledge and motivation, poses a challenge for choice models that assume options are already specified and readily comparable. Drawing on qualitative reports in autism research that are especially salient, we propose a computational account in which paralysis arises from convergence failure in a hierarchical decision process. We separate intent selection (what to pursue) from affordance selection (how to pursue the goal) and formalize commitment as inference under a mixture of reverse- and forward-Kullback-Leibler (KL) objectives. Reverse KL is mode-seeking and promotes rapid commitment, whereas forward KL is mode-covering and preserves multiple plausible goals or actions. In static and dynamic (drift-diffusion) models, forward-KL-biased inference yields slow, heavy-tailed response times and two distinct failure modes, intent saturation and affordance saturation, when values are similar. Simulations in multi-option tasks reproduce key features of decision inertia and shutdown, treating autism as an extreme regime of a general, inference-based, decision-making continuum.
zh

[AI-148] JADAI: Jointly Amortizing Adaptive Design and Bayesian Inference

【速读】:该论文旨在解决参数估计问题中设计变量可主动优化以最大化信息增益的挑战,即在实验设计过程中动态调整输入变量以提升后续推断的准确性。其解决方案的关键在于提出JADAI框架,该框架通过端到端训练策略网络(policy)、历史编码网络(history network)和推断网络(inference network),联合优化贝叶斯自适应实验设计与推断过程;其中推断网络采用基于扩散模型的后验估计器,能够在每一步实验中高效逼近高维且多模态的后验分布,从而实现对实验序列中后验误差的增量最小化。

链接: https://arxiv.org/abs/2512.22999
作者: Niels Bracher,Lars Kühmichel,Desi R. Ivanova,Xavier Intes,Paul-Christian Bürkner,Stefan T. Radev
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We consider problems of parameter estimation where design variables can be actively optimized to maximize information gain. To this end, we introduce JADAI, a framework that jointly amortizes Bayesian adaptive design and inference by training a policy, a history network, and an inference network end-to-end. The networks minimize a generic loss that aggregates incremental reductions in posterior error along experimental sequences. Inference networks are instantiated with diffusion-based posterior estimators that can approximate high-dimensional and multimodal posteriors at every experimental step. Across standard adaptive design benchmarks, JADAI achieves superior or competitive performance.
zh

[AI-149] Geometry-Aware Optimization for Respiratory Sound Classification: Enhancing Sensitivity with SAM-Optimized Audio Spectrogram Transformers

【速读】:该论文旨在解决呼吸音分类任务中因基准数据集(如ICBHI 2017)样本量小、噪声高及类别严重不平衡所导致的模型过拟合与泛化能力差的问题。其解决方案的关键在于引入Sharpness-Aware Minimization (SAM)策略对Audio Spectrogram Transformer (AST)框架进行优化,通过最小化损失函数的几何结构而非单纯降低训练损失,引导模型收敛至更平坦的极小值区域,从而提升模型在未见患者数据上的泛化性能;同时结合加权采样策略有效缓解类别不平衡问题,最终在ICBHI 2017数据集上达到68.10%的准确率和68.31%的敏感度,显著优于现有CNN及混合基线方法。

链接: https://arxiv.org/abs/2512.22564
作者: Atakan Işık,Selin Vulga Işık,Ahmet Feridun Işık,Mahşuk Taylan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: 10 pages, 3 figures,2 tables

点击查看摘要

Abstract:Respiratory sound classification is hindered by the limited size, high noise levels, and severe class imbalance of benchmark datasets like ICBHI 2017. While Transformer-based models offer powerful feature extraction capabilities, they are prone to overfitting and often converge to sharp minima in the loss landscape when trained on such constrained medical data. To address this, we introduce a framework that enhances the Audio Spectrogram Transformer (AST) using Sharpness-Aware Minimization (SAM). Instead of merely minimizing the training loss, our approach optimizes the geometry of the loss surface, guiding the model toward flatter minima that generalize better to unseen patients. We also implement a weighted sampling strategy to handle class imbalance effectively. Our method achieves a state-of-the-art score of 68.10% on the ICBHI 2017 dataset, outperforming existing CNN and hybrid baselines. More importantly, it reaches a sensitivity of 68.31%, a crucial improvement for reliable clinical screening. Further analysis using t-SNE and attention maps confirms that the model learns robust, discriminative features rather than memorizing background noise.
zh

[AI-150] Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

【速读】:该论文试图解决的问题是:在Transformer模型中,梯度驱动的训练如何塑造注意力机制内部的几何结构,从而实现精确的概率推理(如贝叶斯推理)。尽管Transformer在“贝叶斯风洞”和大规模语言模型中表现出优异的推理能力,但其背后的优化机制仍不清晰。解决方案的关键在于提出一个完整的一阶分析框架,揭示交叉熵损失梯度如何重塑注意力分数(attention scores)与值向量(value vectors)。核心发现包括:1)注意力分数遵循基于优势(advantage-based)的路由法则,即注意力权重根据当前查询-键匹配得分(bij=uivjb_{ij} = u_i^\top v_j)与其平均值的差异进行调整;2)值向量通过责任加权更新(responsibility-weighted update)被拉向高频使用的查询方向。这两个机制共同形成正反馈循环,促使注意力路由与内容表示协同专业化,其动态等效于两时间尺度的期望最大化(EM)算法——注意力权重执行E步(软责任分配),值向量执行M步(责任加权原型更新)。这一理论解释了优化过程如何生成支持概率推理的低维流形结构,实现了从优化(gradient flow)到几何(Bayesian manifolds)再到功能(in-context probabilistic reasoning)的统一建模。

链接: https://arxiv.org/abs/2512.22473
作者: Naman Aggarwal,Siddhartha R. Dalal,Vishal Misra
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels’’ and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emphadvantage-based routing law for attention scores, [ \frac\partial L\partial s_ij = \alpha_ij\bigl(b_ij-\mathbbE_\alpha_i[b]\bigr), \qquad b_ij := u_i^\top v_j, ] coupled with a \emphresponsibility-weighted update for values, [ \Delta v_j = -\eta\sum_i \alpha_ij u_i, ] where u_i is the upstream gradient at position i and \alpha_ij are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).
zh

[AI-151] Space AI: Leverag ing Artificial Intelligence for Space to Improve Life on Earth

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在空间探索与应用中缺乏系统性整合与框架化的问题,以应对太空任务中极端不确定性、有限人类干预及通信延迟等挑战。其解决方案的关键在于提出一个统一的跨学科领域——空间人工智能(Space AI),并构建包含四个任务场景的系统性框架:1)地球上的AI,涵盖智能任务规划、航天器设计优化、仿真与地面数据分析;2)轨道上的AI,聚焦卫星与空间站自主性、空间机器人、机载/近实时数据处理及轨道安全;3)深空中的AI,支持自主导航、适应性科学发现、资源测绘及长期人-机协作;4)多行星生命中的AI,支撑就地资源利用、栖息地建造、生命保障与星际网络韧性。通过这一框架,论文推动AI技术向空间场景深度迁移,并实现传感、机器人、优化和可信AI等前沿成果对地球社会的广泛赋能。

链接: https://arxiv.org/abs/2512.22399
作者: Ziyang Wang
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Space Physics (physics.space-ph)
备注:

点击查看摘要

Abstract:Artificial Intelligence (AI) is transforming domains from healthcare and agriculture to finance and industry. As progress on Earth meets growing constraints, the next frontier is outer space, where AI can enable autonomous, resilient operations under extreme uncertainty and limited human oversight. This paper introduces Space AI as a unified interdisciplinary field at the intersection of artificial intelligence and space science and technology. We consolidate historical developments and contemporary progress, and propose a systematic framework that organises Space AI into four mission contexts: 1 AI on Earth, covering intelligent mission planning, spacecraft design optimisation, simulation, and ground-based data analytics; 2 AI in Orbit, focusing on satellite and station autonomy, space robotics, on-board/near-real-time data processing, communication optimisation, and orbital safety; (3) AI in Deep Space, enabling autonomous navigation, adaptive scientific discovery, resource mapping, and long-duration human-AI collaboration under communication constraints; and 4 AI for Multi-Planetary Life, supporting in-situ resource utilisation, habitat and infrastructure construction, life-support and ecological management, and resilient interplanetary networks. Ultimately, Space AI can accelerate humanity’s capability to explore and operate in space, while translating advances in sensing, robotics, optimisation, and trustworthy AI into broad societal impact on Earth.
zh

[AI-152] Literature Mining System for Nutraceutical Biosynthesis: From AI Framework to Biological Insight

【速读】:该论文旨在解决从非结构化科学文献中提取营养补充剂相关微生物菌株信息的难题,这是营养补充剂研究中的关键瓶颈。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)并结合先进提示工程(prompt engineering)的领域自适应系统,通过少量示例提示(few-shot prompting)和定制化查询设计实现对营养补充剂生产菌株的自动化识别,其中DeepSeekV3在包含特定菌株信息时表现出优于LLaMA2的准确性,从而显著提升文献挖掘的可扩展性与可解释性。

链接: https://arxiv.org/abs/2512.22225
作者: Xinyang Sun,Nipon Sarmah,Miao Guo
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The extraction of structured knowledge from scientific literature remains a major bottleneck in nutraceutical research, particularly when identifying microbial strains involved in compound biosynthesis. This study presents a domain-adapted system powered by large language models (LLMs) and guided by advanced prompt engineering techniques to automate the identification of nutraceutical-producing microbes from unstructured scientific text. By leveraging few-shot prompting and tailored query designs, the system demonstrates robust performance across multiple configurations, with DeepSeekV3 outperforming LLaMA2 in accuracy, especially when domain-specific strain information is included. A structured and validated dataset comprising 35 nutraceutical-strain associations was generated, spanning amino acids, fibers, phytochemicals, and vitamins. The results reveal significant microbial diversity across monoculture and co-culture systems, with dominant contributions from Corynebacterium glutamicum, Escherichia coli, and Bacillus subtilis, alongside emerging synthetic consortia. This AI-driven framework not only enhances the scalability and interpretability of literature mining but also provides actionable insights for microbial strain selection, synthetic biology design, and precision fermentation strategies in the production of high-value nutraceuticals.
zh

[AI-153] Neural ocean forecasting from sparse satellite-derived observations: a case-study for SSH dynamics and altimetry data

【速读】:该论文旨在解决利用稀疏卫星高度计数据对全球海表动力过程进行短期预测的问题,尤其在观测覆盖不充分的情况下提升预报精度。其关键解决方案是构建一个端到端的深度学习框架,融合U-Net(用于图像分割)和4DVarNet(用于时空插值)两种先进架构,将海表异常(Sea Level Anomaly, SLA)序列作为输入,直接映射至未来7天的完整场预测,从而实现从稀疏观测到高分辨率动态预测的无缝衔接。该方法在GLORYS12再分析数据上训练,并通过合成星下点采样模拟真实观测条件,显著优于现有操作性海洋预报产品(Mercator Ocean),特别是在高变率区域表现突出。

链接: https://arxiv.org/abs/2512.22152
作者: Daria Botvynko(Lab-STICC_OSE, IMT Atlantique - MEE, IMT Atlantique),Pierre Haslée(Lab-STICC_OSE, IMT Atlantique - MEE, IMT Atlantique),Lucile Gaultier(ODL),Bertrand Chapron(LOPS),Clement de Boyer Montégut(LOPS),Anass El Aouni(MOi),Julien Le Sommer(IGE),Ronan Fablet(IMT Atlantique - MEE, Lab-STICC_OSE, ODYSSEY)
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present an end-to-end deep learning framework for short-term forecasting of global sea surface dynamics based on sparse satellite altimetry data. Building on two state-of-the-art architectures: U-Net and 4DVarNet, originally developed for image segmentation and spatiotemporal interpolation respectively, we adapt the models to forecast the sea level anomaly and sea surface currents over a 7-day horizon using sequences of sparse nadir altimeters observations. The model is trained on data from the GLORYS12 operational ocean reanalysis, with synthetic nadir sampling patterns applied to simulate realistic observational coverage. The forecasting task is formulated as a sequence-to-sequence mapping, with the input comprising partial sea level anomaly (SLA) snapshots and the target being the corresponding future full-field SLA maps. We evaluate model performance using (i) normalized root mean squared error (nRMSE), (ii) averaged effective resolution, (iii) percentage of correctly predicted velocities magnitudes and angles, and benchmark results against the operational Mercator Ocean forecast product. Results show that end-to-end neural forecasts outperform the baseline across all lead times, with particularly notable improvements in high variability regions. Our framework is developed within the OceanBench benchmarking initiative, promoting reproducibility and standardized evaluation in ocean machine learning. These results demonstrate the feasibility and potential of end-to-end neural forecasting models for operational oceanography, even in data-sparse conditions.
zh

[AI-154] he Complete Anatomy of the Madden-Julian Oscillation Revealed by Artificial Intelligence

【速读】:该论文旨在解决 Madden-Julian Oscillation (MJO) 生命周期的准确定义问题,其核心挑战在于MJO的传播特性导致传统线性投影方法(如RMM指数)常将数学伪影误判为物理状态,而直接在原始数据空间中聚类则因“传播惩罚”效应产生偏差。解决方案的关键在于提出一种“AI-for-theory”范式,构建了一个深度学习模型 PhysAnchor-MJO-AE,通过学习一个潜在表示空间,使向量距离对应于物理特征相似性,从而实现对MJO动力学状态的客观聚类。该方法首次揭示了包含六个阶段的完整MJO生命周期解剖图谱,并明确识别出两个长期假说中的过渡阶段:印度洋区域的组织生长和菲律宾海区域的北向移位,进而建立了物理一致性的监测框架,显著降低了虚假传播和对流位置误判率。

链接: https://arxiv.org/abs/2512.22144
作者: Xiao Zhou,Yuze Sun,Jie Wu,Xiaomeng Huang
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurately defining the life cycle of the Madden-Julian Oscillation (MJO), the dominant mode of intraseasonal climate variability, remains a foundational challenge due to its propagating nature. The established linear-projection method (RMM index) often conflates mathematical artifacts with physical states, while direct clustering in raw data space is confounded by a “propagation penalty.” Here, we introduce an “AI-for-theory” paradigm to objectively discover the MJO’s intrinsic structure. We develop a deep learning model, PhysAnchor-MJO-AE, to learn a latent representation where vector distance corresponds to physical-feature similarity, enabling objective clustering of MJO dynamical states. Clustering these “MJO fingerprints” reveals the first complete, six-phase anatomical map of its life cycle. This taxonomy refines and critically completes the classical view by objectively isolating two long-hypothesized transitional phases: organizational growth over the Indian Ocean and the northward shift over the Philippine Sea. Derived from this anatomy, we construct a new physics-coherent monitoring framework that decouples location and intensity diagnostics. This framework reduces the rates of spurious propagation and convective misplacement by over an order of magnitude compared to the classical index. Our work transforms AI from a forecasting tool into a discovery microscope, establishing a reproducible template for extracting fundamental dynamical constructs from complex systems.
zh

机器学习

[LG-0] End-to-End Test-Time Training for Long Context

链接: https://arxiv.org/abs/2512.23675
作者: Arnuv Tandon,Karan Dalal,Xinhao Li,Daniel Koceja,Marcel Rød,Sam Buchanan,Xiaolong Wang,Jure Leskovec,Sanmi Koyejo,Tatsunori Hashimoto,Carlos Guestrin,Jed McCaleb,Yejin Choi,Yu Sun
类目: Machine Learning (cs.LG)
*备注: Code: this https URL

点击查看摘要

Abstract:We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture – a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model’s initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.

[LG-1] Random Controlled Differential Equations

链接: https://arxiv.org/abs/2512.23670
作者: Francesco Piatti,Thomas Cass,William F. Turner
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a training-efficient framework for time-series learning that combines random features with controlled differential equations (CDEs). In this approach, large randomly parameterized CDEs act as continuous-time reservoirs, mapping input paths to rich representations. Only a linear readout layer is trained, resulting in fast, scalable models with strong inductive bias. Building on this foundation, we propose two variants: (i) Random Fourier CDEs (RF-CDEs): these lift the input signal using random Fourier features prior to the dynamics, providing a kernel-free approximation of RBF-enhanced sequence models; (ii) Random Rough DEs (R-RDEs): these operate directly on rough-path inputs via a log-ODE discretization, using log-signatures to capture higher-order temporal interactions while remaining stable and efficient. We prove that in the infinite-width limit, these model induces the RBF-lifted signature kernel and the rough signature kernel, respectively, offering a unified perspective on random-feature reservoirs, continuous-time deep architectures, and path-signature theory. We evaluate both models across a range of time-series benchmarks, demonstrating competitive or state-of-the-art performance. These methods provide a practical alternative to explicit signature computations, retaining their inductive bias while benefiting from the efficiency of random features. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.23670 [cs.LG] (or arXiv:2512.23670v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.23670 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Simultaneous Approximation of the Score Function and Its Derivatives by Deep Neural Networks

链接: https://arxiv.org/abs/2512.23643
作者: Konstantin Yakovlev,Nikita Puchkin
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 38 pages

点击查看摘要

Abstract:We present a theory for simultaneous approximation of the score function and its derivatives, enabling the handling of data distributions with low-dimensional structure and unbounded support. Our approximation error bounds match those in the literature while relying on assumptions that relax the usual bounded support requirement. Crucially, our bounds are free from the curse of dimensionality. Moreover, we establish approximation guarantees for derivatives of any prescribed order, extending beyond the commonly considered first-order setting.

[LG-3] Distribution-Free Process Monitoring with Conformal Prediction

链接: https://arxiv.org/abs/2512.23602
作者: Christopher Burger
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Traditional Statistical Process Control (SPC) is essential for quality management but is limited by its reliance on often violated statistical assumptions, leading to unreliable monitoring in modern, complex manufacturing environments. This paper introduces a hybrid framework that enhances SPC by integrating the distribution free, model agnostic guarantees of Conformal Prediction. We propose two novel applications: Conformal-Enhanced Control Charts, which visualize process uncertainty and enable proactive signals like ‘uncertainty spikes’, and Conformal-Enhanced Process Monitoring, which reframes multivariate control as a formal anomaly detection problem using an intuitive p-value chart. Our framework provides a more robust and statistically rigorous approach to quality control while maintaining the interpretability and ease of use of classic methods.

[LG-4] EEG-based Graph-guided Domain Adaptation for Robust Cross-Session Emotion Recognition

链接: https://arxiv.org/abs/2512.23526
作者: Maryam Mirzaei,Farzaneh Shayegh,Hamed Narimani
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures

点击查看摘要

Abstract:Accurate recognition of human emotional states is critical for effective human-machine interaction. Electroencephalography (EEG) offers a reliable source for emotion recognition due to its high temporal resolution and its direct reflection of neural activity. Nevertheless, variations across recording sessions present a major challenge for model generalization. To address this issue, we propose EGDA, a framework that reduces cross-session discrepancies by jointly aligning the global (marginal) and class-specific (conditional) distributions, while preserving the intrinsic structure of EEG data through graph regularization. Experimental results on the SEED-IV dataset demonstrate that EGDA achieves robust cross-session performance, obtaining accuracies of 81.22%, 80.15%, and 83.27% across three transfer tasks, and surpassing several baseline methods. Furthermore, the analysis highlights the Gamma frequency band as the most discriminative and identifies the central-parietal and prefrontal brain regions as critical for reliable emotion recognition.

[LG-5] rustworthy Machine Learning under Distribution Shifts

链接: https://arxiv.org/abs/2512.23524
作者: Zhuo Huang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: PhD Thesis

点击查看摘要

Abstract:Machine Learning (ML) has been a foundational topic in artificial intelligence (AI), providing both theoretical groundwork and practical tools for its exciting advancements. From ResNet for visual recognition to Transformer for vision-language alignment, the AI models have achieved superior capability to humans. Furthermore, the scaling law has enabled AI to initially develop general intelligence, as demonstrated by Large Language Models (LLMs). To this stage, AI has had an enormous influence on society and yet still keeps shaping the future for humanity. However, distribution shift remains a persistent ``Achilles’ heel’', fundamentally limiting the reliability and general usefulness of ML systems. Moreover, generalization under distribution shift would also cause trust issues for AIs. Motivated by these challenges, my research focuses on \textitTrustworthy Machine Learning under Distribution Shifts, with the goal of expanding AI’s robustness, versatility, as well as its responsibility and reliability. We carefully study the three common distribution shifts into: (1) Perturbation Shift, (2) Domain Shift, and (3) Modality Shift. For all scenarios, we also rigorously investigate trustworthiness via three aspects: (1) Robustness, (2) Explainability, and (3) Adaptability. Based on these dimensions, we propose effective solutions and fundamental insights, meanwhile aiming to enhance the critical ML problems, such as efficiency, adaptability, and safety.

[LG-6] Dynamic Subspace Composition: Efficient Adaptation via Contractive Basis Expansion

链接: https://arxiv.org/abs/2512.23448
作者: Vladimer Khasia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture of Experts (MoE) models scale capacity but often suffer from representation collapse and gradient instability. We propose Dynamic Subspace Composition (DSC), a framework that approximates context-dependent weights via a state-dependent, sparse expansion of a shared basis bank. Formally, DSC models the weight update as a residual trajectory within a Star- Shaped Domain, employing a Magnitude-Gated Simplex Interpolation to ensure continuity at the identity. Unlike standard Mixture-of-LoRAs, which incurs O(M rd) parameter complexity by retrieving independent rank-r matrices, DSC constructs a compositional rank-K approximation from decoupled unit-norm basis vectors. This reduces parameter complexity to O(M d) and memory traffic to O(Kd), while Frame-Theoretic regularization and spectral constraints provide rigorous worst-case bounds on the dynamic update. The code is available at https://github. com/VladimerKhasia/DSC

[LG-7] Assessing behaviour coverag e in a multi-agent system simulation for autonomous vehicle testing

链接: https://arxiv.org/abs/2512.23445
作者: Manuel Franco-Vivo
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:As autonomous vehicle technology advances, ensuring the safety and reliability of these systems becomes paramount. Consequently, comprehensive testing methodologies are essential to evaluate the performance of autonomous vehicles in diverse and complex real-world scenarios. This study focuses on the behaviour coverage analysis of a multi-agent system simulation designed for autonomous vehicle testing, and provides a systematic approach to measure and assess behaviour coverage within the simulation environment. By defining a set of driving scenarios, and agent interactions, we evaluate the extent to which the simulation encompasses a broad range of behaviours relevant to autonomous driving. Our findings highlight the importance of behaviour coverage in validating the effectiveness and robustness of autonomous vehicle systems. Through the analysis of behaviour coverage metrics and coverage-based testing, we identify key areas for improvement and optimization in the simulation framework. Thus, a Model Predictive Control (MPC) pedestrian agent is proposed, where its objective function is formulated to encourage \textitinteresting tests while promoting a more realistic behaviour than other previously studied pedestrian agents. This research contributes to advancing the field of autonomous vehicle testing by providing insights into the comprehensive evaluation of system behaviour in simulated environments. The results offer valuable implications for enhancing the safety, reliability, and performance of autonomous vehicles through rigorous testing methodologies. Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Robotics (cs.RO) Cite as: arXiv:2512.23445 [cs.MA] (or arXiv:2512.23445v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2512.23445 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-8] ask-driven Heterophilic Graph Structure Learning

链接: https://arxiv.org/abs/2512.23406
作者: Ayushman Raghuvanshi,Gonzalo Mateos,Sundeep Prabhakar Chepuri
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) often struggle to learn discriminative node representations for heterophilic graphs, where connected nodes tend to have dissimilar labels and feature similarity provides weak structural cues. We propose frequency-guided graph structure learning (FgGSL), an end-to-end graph inference framework that jointly learns homophilic and heterophilic graph structures along with a spectral encoder. FgGSL employs a learnable, symmetric, feature-driven masking function to infer said complementary graphs, which are processed using pre-designed low- and high-pass graph filter banks. A label-based structural loss explicitly promotes the recovery of homophilic and heterophilic edges, enabling task-driven graph structure learning. We derive stability bounds for the structural loss and establish robustness guarantees for the filter banks under graph perturbations. Experiments on six heterophilic benchmarks demonstrate that FgGSL consistently outperforms state-of-the-art GNNs and graph rewiring methods, highlighting the benefits of combining frequency information with supervised topology inference.

[LG-9] On the Sample Complexity of Learning for Blind Inverse Problems

链接: https://arxiv.org/abs/2512.23405
作者: Nathan Buskulic,Luca Calatroni,Lorenzo Rosasco,Silvia Villa
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Blind inverse problems arise in many experimental settings where the forward operator is partially or entirely unknown. In this context, methods developed for the non-blind case cannot be adapted in a straightforward manner. Recently, data-driven approaches have been proposed to address blind inverse problems, demonstrating strong empirical performance and adaptability. However, these methods often lack interpretability and are not supported by rigorous theoretical guarantees, limiting their reliability in applied domains such as imaging inverse problems. In this work, we shed light on learning in blind inverse problems within the simplified yet insightful framework of Linear Minimum Mean Square Estimators (LMMSEs). We provide an in-depth theoretical analysis, deriving closed-form expressions for optimal estimators and extending classical results. In particular, we establish equivalences with suitably chosen Tikhonov-regularized formulations, where the regularization depends explicitly on the distributions of the unknown signal, the noise, and the random forward operators. We also prove convergence results under appropriate source condition assumptions. Furthermore, we derive rigorous finite-sample error bounds that characterize the performance of learned estimators as a function of the noise level, problem conditioning, and number of available samples. These bounds explicitly quantify the impact of operator randomness and reveal the associated convergence rates as this randomness vanishes. Finally, we validate our theoretical findings through illustrative numerical experiments that confirm the predicted convergence behavior.

[LG-10] Beyond-Diagonal Reconfigurable Intelligent Surfaces for 6G Networks: Principles Challenges and Quantum Horizons

链接: https://arxiv.org/abs/2512.23400
作者: Abd Ullah Khan,Uman Khalid,Muhammad Tanveer,Trung Q. Duong,Hyundong Shin
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A beyond-diagonal reconfigurable intelligent surface (BD-RIS) is an innovative type of reconfigurable intelligent surface (RIS) that has recently been proposed and is considered a revolutionary advancement in wave manipulation. Unlike the mutually disconnected arrangement of elements in traditional RISs, BD-RIS creates cost-effective and simple inter-element connections, allowing for greater freedom in configuring the amplitude and phase of impinging waves. However, there are numerous underlying challenges in realizing the advantages associated with BD-RIS, prompting the research community to actively investigate cutting-edge schemes and algorithms in this direction. Particularly, the passive beamforming design for BD-RIS under specific environmental conditions has become a major focus in this research area. In this article, we provide a systematic introduction to BD-RIS, elaborating on its functional principles concerning architectural design, promising advantages, and classification. Subsequently, we present recent advances and identify a series of challenges and opportunities. Additionally, we consider a specific case study where beamforming is designed using four different algorithms, and we analyze their performance with respect to sum rate and computation cost. To augment the beamforming capabilities in 6G BD-RIS with quantum enhancement, we analyze various hybrid quantum-classical machine learning (ML) models to improve beam prediction performance, employing real-world communication Scenario 8 from the DeepSense 6G dataset. Consequently, we derive useful insights about the practical implications of BD-RIS.

[LG-11] Diffusion priors enhanced velocity model building from time-lag images using a neural operator

链接: https://arxiv.org/abs/2512.23375
作者: Xiao Ma,Mohammad Hasyim Taufik,Tariq Alkhalifah
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 20 pages,19 figures

点击查看摘要

Abstract:Velocity model building serves as a crucial component for achieving high precision subsurface imaging. However, conventional velocity model building methods are often computationally expensive and time consuming. In recent years, with the rapid advancement of deep learning, particularly the success of generative models and neural operators, deep learning based approaches that integrate data and their statistics have attracted increasing attention in addressing the limitations of traditional methods. In this study, we propose a novel framework that combines generative models with neural operators to obtain high resolution velocity models efficiently. Within this workflow, the neural operator functions as a forward mapping operator to rapidly generate time lag reverse time migration (RTM) extended images from the true and migration velocity models. In this framework, the neural operator is acting as a surrogate for modeling followed by migration, which uses the true and migration velocities, respectively. The trained neural operator is then employed, through automatic differentiation, to gradually update the migration velocity placed in the true velocity input channel with high resolution components so that the output of the network matches the time lag images of observed data obtained using the migration velocity. By embedding a generative model, trained on a high-resolution velocity model distribution, which corresponds to the true velocity model distribution used to train the neural operator, as a regularizer, the resulting predictions are cleaner with higher resolution information. Both synthetic and field data experiments demonstrate the effectiveness of the proposed generative neural operator based velocity model building approach.

[LG-12] ISOPO: Proximal policy gradients without pi-old

链接: https://arxiv.org/abs/2512.23353
作者: Nilin Abrahamsen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.

[LG-13] Deep learning for pedestrians: backpropagation in Transformers

链接: https://arxiv.org/abs/2512.23329
作者: Laurent Boué
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This document is a follow-up to our previous paper dedicated to a vectorized derivation of backpropagation in CNNs. Following the same principles and notations already put in place there, we now focus on transformer-based next-token-prediction architectures. To this end, we apply our lightweight index-free methodology to new types of layers such as embedding, multi-headed self-attention and layer normalization. In addition, we also provide gradient expressions for LoRA layers to illustrate parameter-efficient fine-tuning. Why bother doing manual backpropagation when there are so many tools that do this automatically? Any gap in understanding of how values propagate forward will become evident when attempting to differentiate the loss function. By working through the backward pass manually, we gain a deeper intuition for how each operation influences the final output. A complete PyTorch implementation of a minimalistic GPT-like network is also provided along with analytical expressions for of all of its gradient updates.

[LG-14] Spectral Analysis of Hard-Constraint PINNs: The Spatial Modulation Mechanism of Boundary Functions

链接: https://arxiv.org/abs/2512.23295
作者: Yuchen Xie,Honghang Chi,Haopeng Quan,Yahui Wang,Wei Wang,Yu Ma
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks with hard constraints (HC-PINNs) are increasingly favored for their ability to strictly enforce boundary conditions via a trial function ansatz \tildeu = A + B \cdot N , yet the theoretical mechanisms governing their training dynamics have remained unexplored. Unlike soft-constrained formulations where boundary terms act as additive penalties, this work reveals that the boundary function B introduces a multiplicative spatial modulation that fundamentally alters the learning landscape. A rigorous Neural Tangent Kernel (NTK) framework for HC-PINNs is established, deriving the explicit kernel composition law. This relationship demonstrates that the boundary function B(\vecx) functions as a spectral filter, reshaping the eigenspectrum of the neural network’s native kernel. Through spectral analysis, the effective rank of the residual kernel is identified as a deterministic predictor of training convergence, superior to classical condition numbers. It is shown that widely used boundary functions can inadvertently induce spectral collapse, leading to optimization stagnation despite exact boundary satisfaction. Validated across multi-dimensional benchmarks, this framework transforms the design of boundary functions from a heuristic choice into a principled spectral optimization problem, providing a solid theoretical foundation for geometric hard constraints in scientific machine learning. Subjects: Machine Learning (cs.LG); Computational Physics (physics.comp-ph) Cite as: arXiv:2512.23295 [cs.LG] (or arXiv:2512.23295v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.23295 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yuchen Xie [view email] [v1] Mon, 29 Dec 2025 08:31:58 UTC (6,593 KB) Full-text links: Access Paper: View a PDF of the paper titled Spectral Analysis of Hard-Constraint PINNs: The Spatial Modulation Mechanism of Boundary Functions, by Yuchen Xie and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs physics physics.comp-ph References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-15] Revealing design archetypes and flexibility in e-molecule import pathways using Modeling to Generate Alternatives and interpretable machine learning

链接: https://arxiv.org/abs/2512.23284
作者: Mahdi Kchaou,Francesco Contino,Diederik Coppitters
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Given the central role of green e-molecule imports in the European energy transition, many studies optimize import pathways and identify a single cost-optimal solution. However, cost optimality is fragile, as real-world implementation depends on regulatory, spatial, and stakeholder constraints that are difficult to represent in optimization models and can render cost-optimal designs infeasible. To address this limitation, we generate a diverse set of near-cost-optimal alternatives within an acceptable cost margin using Modeling to Generate Alternatives, accounting for unmodeled uncertainties. Interpretable machine learning is then applied to extract insights from the resulting solution space. The approach is applied to hydrogen import pathways considering hydrogen, ammonia, methane, and methanol as carriers. Results reveal a broad near-optimal space with great flexibility: solar, wind, and storage are not strictly required to remain within 10% of the cost optimum. Wind constraints favor solar-storage methanol pathways, while limited storage favors wind-based ammonia or methane pathways.

[LG-16] On the Inverse Flow Matching Problem in the One-Dimensional and Gaussian Cases

链接: https://arxiv.org/abs/2512.23265
作者: Alexander Korotin,Gudmund Pammer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies the inverse problem of flow matching (FM) between distributions with finite exponential moment, a problem motivated by modern generative AI applications such as the distillation of flow matching models. Uniqueness of the solution is established in two cases - the one-dimensional setting and the Gaussian case. The general multidimensional problem remains open for future studies.

[LG-17] PFed-Signal: An ADR Prediction Model based on Federated Learning

链接: https://arxiv.org/abs/2512.23262
作者: Tao Li,Peilin Li,Kui Lu,Yilei Wang,Junliang Shang,Guangshun Li,Huiyu Zhou
类目: Machine Learning (cs.LG)
*备注: IEEE International Conference on Bioinformatics and Biomedicine

点击查看摘要

Abstract:The adverse drug reactions (ADRs) predicted based on the biased records in FAERS (U.S. Food and Drug Administration Adverse Event Reporting System) may mislead diagnosis online. Generally, such problems are solved by optimizing reporting odds ratio (ROR) or proportional reporting ratio (PRR). However, these methods that rely on statistical methods cannot eliminate the biased data, leading to inaccurate signal prediction. In this paper, we propose PFed-signal, a federated learning-based signal prediction model of ADR, which utilizes the Euclidean distance to eliminate the biased data from FAERS, thereby improving the accuracy of ADR prediction. Specifically, we first propose Pfed-Split, a method to split the original dataset into a split dataset based on ADR. Then we propose ADR-signal, an ADR prediction model, including a biased data identification method based on federated learning and an ADR prediction model based on Transformer. The former identifies the biased data according to the Euclidean distance and generates a clean dataset by deleting the biased data. The latter is an ADR prediction model based on Transformer trained on the clean data set. The results show that the ROR and PRR on the clean dataset are better than those of the traditional methods. Furthermore, the accuracy rate, F1 score, recall rate and AUC of PFed-Signal are 0.887, 0.890, 0.913 and 0.957 respectively, which are higher than the baselines.

[LG-18] FairGFL: Privacy-Preserving Fairness-Aware Federated Learning with Overlapping Subgraphs

链接: https://arxiv.org/abs/2512.23235
作者: Zihao Zhou,Shusen Yang,Fangyuan Zhao,Xuebin Ren
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Graph federated learning enables the collaborative extraction of high-order information from distributed subgraphs while preserving the privacy of raw data. However, graph data often exhibits overlap among different clients. Previous research has demonstrated certain benefits of overlapping data in mitigating data heterogeneity. However, the negative effects have not been explored, particularly in cases where the overlaps are imbalanced across clients. In this paper, we uncover the unfairness issue arising from imbalanced overlapping subgraphs through both empirical observations and theoretical reasoning. To address this issue, we propose FairGFL (FAIRness-aware subGraph Federated Learning), a novel algorithm that enhances cross-client fairness while maintaining model utility in a privacy-preserving manner. Specifically, FairGFL incorporates an interpretable weighted aggregation approach to enhance fairness across clients, leveraging privacy-preserving estimation of their overlapping ratios. Furthermore, FairGFL improves the tradeoff between model utility and fairness by integrating a carefully crafted regularizer into the federated composite loss function. Through extensive experiments on four benchmark graph datasets, we demonstrate that FairGFL outperforms four representative baseline algorithms in terms of both model utility and fairness.

[LG-19] Energy and Memory-Efficient Federated Learning With Ordered Layer Freezing

链接: https://arxiv.org/abs/2512.23200
作者: Ziru Niu,Hai Dong,A.K. Qin,Tao Gu,Pengcheng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a privacy-preserving paradigm for training machine learning models across distributed edge devices in the Internet of Things (IoT). By keeping data local and coordinating model training through a central server, FL effectively addresses privacy concerns and reduces communication overhead. However, the limited computational power, memory, and bandwidth of IoT edge devices pose significant challenges to the efficiency and scalability of FL, especially when training deep neural networks. Various FL frameworks have been proposed to reduce computation and communication overheads through dropout or layer freezing. However, these approaches often sacrifice accuracy or neglect memory constraints. To this end, in this work, we introduce Federated Learning with Ordered Layer Freezing (FedOLF). FedOLF consistently freezes layers in a predefined order before training, significantly mitigating computation and memory requirements. To further reduce communication and energy costs, we incorporate Tensor Operation Approximation (TOA), a lightweight alternative to conventional quantization that better preserves model accuracy. Experimental results demonstrate that over non-iid data, FedOLF achieves at least 0.3%, 6.4%, 5.81%, 4.4%, 6.27% and 1.29% higher accuracy than existing works respectively on EMNIST (with CNN), CIFAR-10 (with AlexNet), CIFAR-100 (with ResNet20 and ResNet44), and CINIC-10 (with ResNet20 and ResNet44), along with higher energy efficiency and lower memory footprint.

[LG-20] PGOT: A Physics-Geometry Operator Transformer for Complex PDEs

链接: https://arxiv.org/abs/2512.23192
作者: Zhuo Zhang,Xi Yang,Yuan Zhao,Canqun Yang
类目: Machine Learning (cs.LG)
*备注: 22 pages, 17 figures

点击查看摘要

Abstract:While Transformers have demonstrated remarkable potential in modeling Partial Differential Equations (PDEs), modeling large-scale unstructured meshes with complex geometries remains a significant challenge. Existing efficient architectures often employ feature dimensionality reduction strategies, which inadvertently induces Geometric Aliasing, resulting in the loss of critical physical boundary information. To address this, we propose the Physics-Geometry Operator Transformer (PGOT), designed to reconstruct physical feature learning through explicit geometry awareness. Specifically, we propose Spectrum-Preserving Geometric Attention (SpecGeo-Attention). Utilizing a ``physics slicing-geometry injection" mechanism, this module incorporates multi-scale geometric encodings to explicitly preserve multi-scale geometric features while maintaining linear computational complexity O(N) . Furthermore, PGOT dynamically routes computations to low-order linear paths for smooth regions and high-order non-linear paths for shock waves and discontinuities based on spatial coordinates, enabling spatially adaptive and high-precision physical field modeling. PGOT achieves consistent state-of-the-art performance across four standard benchmarks and excels in large-scale industrial tasks including airfoil and car designs.

[LG-21] A Simple Optimal and Efficient Algorithm for Online Exp-Concave Optimization

链接: https://arxiv.org/abs/2512.23190
作者: Yi-Han Wang,Peng Zhao,Zhi-Hua Zhou
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online eXp-concave Optimization (OXO) is a fundamental problem in online learning. The standard algorithm, Online Newton Step (ONS), balances statistical optimality and computational practicality, guaranteeing an optimal regret of O(d \log T) , where d is the dimension and T is the time horizon. ONS faces a computational bottleneck due to the Mahalanobis projections at each round. This step costs \Omega(d^\omega) arithmetic operations for bounded domains, even for the unit ball, where \omega \in (2,3] is the matrix-multiplication exponent. As a result, the total runtime can reach \tildeO(d^\omega T) , particularly when iterates frequently oscillate near the domain boundary. For Stochastic eXp-concave Optimization (SXO), computational cost is also a challenge. Deploying ONS with online-to-batch conversion for SXO requires T = \tildeO(d/\epsilon) rounds to achieve an excess risk of \epsilon , and thereby necessitates an \tildeO(d^\omega+1/\epsilon) runtime. A COLT’13 open problem posed by Koren [2013] asks for an SXO algorithm with runtime less than \tildeO(d^\omega+1/\epsilon) . This paper proposes a simple variant of ONS, LightONS, which reduces the total runtime to O(d^2 T + d^\omega \sqrtT \log T) while preserving the optimal O(d \log T) regret. LightONS implies an SXO method with runtime \tildeO(d^3/\epsilon) , thereby answering the open problem. Importantly, LightONS preserves the elegant structure of ONS by leveraging domain-conversion techniques from parameter-free online learning to introduce a hysteresis mechanism that delays expensive Mahalanobis projections until necessary. This design enables LightONS to serve as an efficient plug-in replacement of ONS in broader scenarios, even beyond regret minimization, including gradient-norm adaptive regret, parametric stochastic bandits, and memory-efficient online learning. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML) Cite as: arXiv:2512.23190 [cs.LG] (or arXiv:2512.23190v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.23190 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-22] HELM-BERT: A Transformer for Medium-sized Peptide Property Prediction

链接: https://arxiv.org/abs/2512.23175
作者: Seungeon Lee,Takuto Koyama,Itsuki Maeda,Shigeyuki Matsumoto,Yasushi Okuno
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 35 pages; includes Supplementary Information

点击查看摘要

Abstract:Therapeutic peptides have emerged as a pivotal modality in modern drug discovery, occupying a chemically and topologically rich space. While accurate prediction of their physicochemical properties is essential for accelerating peptide development, existing molecular language models rely on representations that fail to capture this complexity. Atom-level SMILES notation generates long token sequences and obscures cyclic topology, whereas amino-acid-level representations cannot encode the diverse chemical modifications central to modern peptide design. To bridge this representational gap, the Hierarchical Editing Language for Macromolecules (HELM) offers a unified framework enabling precise description of both monomer composition and connectivity, making it a promising foundation for peptide language modeling. Here, we propose HELM-BERT, the first encoder-based peptide language model trained on HELM notation. Based on DeBERTa, HELM-BERT is specifically designed to capture hierarchical dependencies within HELM sequences. The model is pre-trained on a curated corpus of 39,079 chemically diverse peptides spanning linear and cyclic structures. HELM-BERT significantly outperforms state-of-the-art SMILES-based language models in downstream tasks, including cyclic peptide membrane permeability prediction and peptide-protein interaction prediction. These results demonstrate that HELM’s explicit monomer- and topology-aware representations offer substantial data-efficiency advantages for modeling therapeutic peptides, bridging a long-standing gap between small-molecule and protein language models.

[LG-23] Certifying the Right to Be Forgotten: Primal-Dual Optimization for Sample and Label Unlearning in Vertical Federated Learning

链接: https://arxiv.org/abs/2512.23171
作者: Yu Jiang,Xindi Tong,Ziyao Liu,Xiaoxi Zhang,Kwok-Yan Lam,Chee Wei Tan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Published in the IEEE Transactions on Information Forensics and Security

点击查看摘要

Abstract:Federated unlearning has become an attractive approach to address privacy concerns in collaborative machine learning, for situations when sensitive data is remembered by AI models during the machine learning process. It enables the removal of specific data influences from trained models, aligning with the growing emphasis on the “right to be forgotten.” While extensively studied in horizontal federated learning, unlearning in vertical federated learning (VFL) remains challenging due to the distributed feature architecture. VFL unlearning includes sample unlearning that removes specific data points’ influence and label unlearning that removes entire classes. Since different parties hold complementary features of the same samples, unlearning tasks require cross-party coordination, creating computational overhead and complexities from feature interdependencies. To address such challenges, we propose FedORA (Federated Optimization for data Removal via primal-dual Algorithm), designed for sample and label unlearning in VFL. FedORA formulates the removal of certain samples or labels as a constrained optimization problem solved using a primal-dual framework. Our approach introduces a new unlearning loss function that promotes classification uncertainty rather than misclassification. An adaptive step size enhances stability, while an asymmetric batch design, considering the prior influence of the remaining data on the model, handles unlearning and retained data differently to efficiently reduce computational costs. We provide theoretical analysis proving that the model difference between FedORA and Train-from-scratch is bounded, establishing guarantees for unlearning effectiveness. Experiments on tabular and image datasets demonstrate that FedORA achieves unlearning effectiveness and utility preservation comparable to Train-from-scratch with reduced computation and communication overhead.

[LG-24] Evaluating Parameter Efficient Methods for RLVR

链接: https://arxiv.org/abs/2512.23165
作者: Qingyu Yin,Yulun Wu,Zhennan Shen,Sunbowen Li,Zhilin Wang,Yanshu Li,Chak Tou Leong,Jiale Kang,Jinjin Gu
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textite.g., PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textite.g., VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.

[LG-25] Diffusion-based Decentralized Federated Multi-Task Representation Learning

链接: https://arxiv.org/abs/2512.23161
作者: Donghwa Kang,Shana Moothedath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning is a widely adopted framework for learning in data-scarce environments to obtain a feature extractor or representation from various different yet related tasks. Despite extensive research on representation learning, decentralized approaches remain relatively underexplored. This work develops a decentralized projected gradient descent-based algorithm for multi-task representation learning. We focus on the problem of multi-task linear regression in which multiple linear regression models share a common, low-dimensional linear representation. We present an alternating projected gradient descent and minimization algorithm for recovering a low-rank feature matrix in a diffusion-based decentralized and federated fashion. We obtain constructive, provable guarantees that provide a lower bound on the required sample complexity and an upper bound on the iteration complexity of our proposed algorithm. We analyze the time and communication complexity of our algorithm and show that it is fast and communication-efficient. We performed numerical simulations to validate the performance of our algorithm and compared it with benchmark algorithms.

[LG-26] A Weak Signal Learning Dataset and Its Baseline Method

链接: https://arxiv.org/abs/2512.23160
作者: Xianqi Liu,Xiangru Li,Lefeng He,Ziyu Fang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Weak signal learning (WSL) is a common challenge in many fields like fault diagnosis, medical imaging, and autonomous driving, where critical information is often masked by noise and interference, making feature identification difficult. Even in tasks with abundant strong signals, the key to improving model performance often lies in effectively extracting weak signals. However, the lack of dedicated datasets has long constrained research. To address this, we construct the first specialized dataset for weak signal feature learning, containing 13,158 spectral samples. It features low SNR dominance (over 55% samples with SNR below 50) and extreme class imbalance (class ratio up to 29:1), providing a challenging benchmark for classification and regression in weak signal scenarios. We also propose a dual-view representation (vector + time-frequency map) and a PDVFN model tailored to low SNR, distribution skew, and dual imbalance. PDVFN extracts local sequential features and global frequency-domain structures in parallel, following principles of local enhancement, sequential modeling, noise suppression, multi-scale capture, frequency extraction, and global perception. This multi-source complementarity enhances representation for low-SNR and imbalanced data, offering a novel solution for WSL tasks like astronomical spectroscopy. Experiments show our method achieves higher accuracy and robustness in handling weak signals, high noise, and extreme class imbalance, especially in low SNR and imbalanced scenarios. This study provides a dedicated dataset, a baseline model, and establishes a foundation for future WSL research.

[LG-27] Graph Neural Networks with Transformer Fusion of Brain Connectivity Dynamics and Tabular Data for Forecasting Future Tobacco Use

链接: https://arxiv.org/abs/2512.23137
作者: Runzhi Zhou,Xi Luo
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Neurons and Cognition (q-bio.NC)
*备注: 22 pages, 4 figures

点击查看摘要

Abstract:Integrating non-Euclidean brain imaging data with Euclidean tabular data, such as clinical and demographic information, poses a substantial challenge for medical imaging analysis, particularly in forecasting future outcomes. While machine learning and deep learning techniques have been applied successfully to cross-sectional classification and prediction tasks, effectively forecasting outcomes in longitudinal imaging studies remains challenging. To address this challenge, we introduce a time-aware graph neural network model with transformer fusion (GNN-TF). This model flexibly integrates both tabular data and dynamic brain connectivity data, leveraging the temporal order of these variables within a coherent framework. By incorporating non-Euclidean and Euclidean sources of information from a longitudinal resting-state fMRI dataset from the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA), the GNN-TF enables a comprehensive analysis that captures critical aspects of longitudinal imaging data. Comparative analyses against a variety of established machine learning and deep learning models demonstrate that GNN-TF outperforms these state-of-the-art methods, delivering superior predictive accuracy for predicting future tobacco usage. The end-to-end, time-aware transformer fusion structure of the proposed GNN-TF model successfully integrates multiple data modalities and leverages temporal dynamics, making it a valuable analytic tool for functional brain imaging studies focused on clinical outcome prediction.

[LG-28] Principled Algorithms for Optimizing Generalized Metrics in Binary Classification ICML2025

链接: https://arxiv.org/abs/2512.23133
作者: Anqi Mao,Mehryar Mohri,Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: ICML 2025

点击查看摘要

Abstract:In applications with significant class imbalance or asymmetric costs, metrics such as the F_\beta -measure, AM measure, Jaccard similarity coefficient, and weighted accuracy offer more suitable evaluation criteria than standard binary classification loss. However, optimizing these metrics present significant computational and statistical challenges. Existing approaches often rely on the characterization of the Bayes-optimal classifier, and use threshold-based methods that first estimate class probabilities and then seek an optimal threshold. This leads to algorithms that are not tailored to restricted hypothesis sets and lack finite-sample performance guarantees. In this work, we introduce principled algorithms for optimizing generalized metrics, supported by H -consistency and finite-sample generalization bounds. Our approach reformulates metric optimization as a generalized cost-sensitive learning problem, enabling the design of novel surrogate loss functions with provable H -consistency guarantees. Leveraging this framework, we develop new algorithms, METRO (Metric Optimization), with strong theoretical performance guarantees. We report the results of experiments demonstrating the effectiveness of our methods compared to prior baselines.

[LG-29] Multi-Agent Framework for Threat Mitigation and Resilience in AI-Based Systems

链接: https://arxiv.org/abs/2512.23132
作者: Armstrong Foundjem,Lionel Nganyewou Tidjon,Leuson Da Silva,Foutse Khomh
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 56 pages, 18 Figures, 22 Tables, TOSEM

点击查看摘要

Abstract:Machine learning (ML) underpins foundation models in finance, healthcare, and critical infrastructure, making them targets for data poisoning, model extraction, prompt injection, automated jailbreaking, and preference-guided black-box attacks that exploit model comparisons. Larger models can be more vulnerable to introspection-driven jailbreaks and cross-modal manipulation. Traditional cybersecurity lacks ML-specific threat modeling for foundation, multimodal, and RAG systems. Objective: Characterize ML security risks by identifying dominant TTPs, vulnerabilities, and targeted lifecycle stages. Methods: We extract 93 threats from MITRE ATLAS (26), AI Incident Database (12), and literature (55), and analyze 854 GitHub/Python repositories. A multi-agent RAG system (ChatGPT-4o, temp 0.4) mines 300+ articles to build an ontology-driven threat graph linking TTPs, vulnerabilities, and stages. Results: We identify unreported threats including commercial LLM API model stealing, parameter memorization leakage, and preference-guided text-only jailbreaks. Dominant TTPs include MASTERKEY-style jailbreaking, federated poisoning, diffusion backdoors, and preference optimization leakage, mainly impacting pre-training and inference. Graph analysis reveals dense vulnerability clusters in libraries with poor patch propagation. Conclusion: Adaptive, ML-specific security frameworks, combining dependency hygiene, threat intelligence, and monitoring, are essential to mitigate supply-chain and inference risks across the ML lifecycle.

[LG-30] SE-MLP Model for Predicting Prior Acceleration Features in Penetration Signals

链接: https://arxiv.org/abs/2512.23131
作者: Yankang Li,Changsheng Li
类目: Machine Learning (cs.LG)
*备注: 23 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Accurate identification of the penetration process relies heavily on prior feature values of penetration acceleration. However, these feature values are typically obtained through long simulation cycles and expensive computations. To overcome this limitation, this paper proposes a multi-layer Perceptron architecture, termed squeeze and excitation multi-layer perceptron (SE-MLP), which integrates a channel attention mechanism with residual connections to enable rapid prediction of acceleration feature values. Using physical parameters under different working conditions as inputs, the model outputs layer-wise acceleration features, thereby establishing a nonlinear mapping between physical parameters and penetration characteristics. Comparative experiments against conventional MLP, XGBoost, and Transformer models demonstrate that SE-MLP achieves superior prediction accuracy, generalization, and stability. Ablation studies further confirm that both the channel attention module and residual structure contribute significantly to performance gains. Numerical simulations and range recovery tests show that the discrepancies between predicted and measured acceleration peaks and pulse widths remain within acceptable engineering tolerances. These results validate the feasibility and engineering applicability of the proposed method and provide a practical basis for rapidly generating prior feature values for penetration fuzes.

[LG-31] Osmotic Learning: A Self-Supervised Paradigm for Decentralized Contextual Data Representation

链接: https://arxiv.org/abs/2512.23096
作者: Mario Colosi,Reza Farahani,Maria Fazio,Radu Prodan,Massimo Villari
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Data within a specific context gains deeper significance beyond its isolated interpretation. In distributed systems, interdependent data sources reveal hidden relationships and latent structures, representing valuable information for many applications. This paper introduces Osmotic Learning (OSM-L), a self-supervised distributed learning paradigm designed to uncover higher-level latent knowledge from distributed data. The core of OSM-L is osmosis, a process that synthesizes dense and compact representation by extracting contextual information, eliminating the need for raw data exchange between distributed entities. OSM-L iteratively aligns local data representations, enabling information diffusion and convergence into a dynamic equilibrium that captures contextual patterns. During training, it also identifies correlated data groups, functioning as a decentralized clustering mechanism. Experimental results confirm OSM-L’s convergence and representation capabilities on structured datasets, achieving over 0.99 accuracy in local information alignment while preserving contextual integrity.

[LG-32] FLEX-MoE: Federated Mixture-of-Experts with Load-balanced Expert Assignment

链接: https://arxiv.org/abs/2512.23070
作者: Boyang Zhang,Xiaobing Chen,Songyang Zhang,Shuai Zhang,Xiangwei Zhou,Mingxuan Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models enable scalable neural networks through conditional computation. However, their deployment with federated learning (FL) faces two critical challenges: 1) resource-constrained edge devices cannot store full expert sets, and 2) non-IID data distributions cause severe expert load imbalance that degrades model performance. To this end, we propose \textbfFLEX-MoE, a novel federated MoE framework that jointly optimizes expert assignment and load balancing under limited client capacity. Specifically, our approach introduces client-expert fitness scores that quantify the expert suitability for local datasets through training feedback, and employs an optimization-based algorithm to maximize client-expert specialization while enforcing balanced expert utilization system-wide. Unlike existing greedy methods that focus solely on personalization while ignoring load imbalance, our FLEX-MoE is capable of addressing the expert utilization skew, which is particularly severe in FL settings with heterogeneous data. Our comprehensive experiments on three different datasets demonstrate the superior performance of the proposed FLEX-MoE, together with its ability to maintain balanced expert utilization across diverse resource-constrained scenarios.

[LG-33] Breaking the Memory Wall: Exact Analytical Differentiation via Tiled Operator-Space Evolution

链接: https://arxiv.org/abs/2512.23068
作者: Shuhuan Wang,Yuzhen Xie,Jiayi Li,Yinliang Diao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Selective State Space Models (SSMs) achieve linear-time inference, yet their gradient-based sensitivity analysis remains bottlenecked by O(L) memory scaling during backpropagation. This memory constraint precludes genomic-scale modeling (L 10^5) on consumer-grade hardware. We introduce Phase Gradient Flow (PGF), a framework that computes exact analytical derivatives by operating directly in the state-space manifold, bypassing the need to materialize the intermediate computational graph. By reframing SSM dynamics as Tiled Operator-Space Evolution (TOSE), our method delivers O(1) memory complexity relative to sequence length, yielding a 94% reduction in peak VRAM and a 23x increase in throughput compared to standard Autograd. Unlike parallel prefix scans that exhibit numerical divergence in stiff ODE regimes, PGF ensures stability through invariant error scaling, maintaining near-machine precision across extreme sequences. We demonstrate the utility of PGF on an impulse-response benchmark with 128,000-step sequences - a scale where conventional Autograd encounters prohibitive memory overhead, often leading to out-of-memory (OOM) failures in multi-layered models. Our work enables chromosome-scale sensitivity analysis on a single GPU, bridging the gap between theoretical infinite-context models and practical hardware limitations.

[LG-34] PI-MFM: Physics-informed multimodal foundation model for solving partial differential equations

链接: https://arxiv.org/abs/2512.23056
作者: Min Zhu,Jingmin Sun,Zecheng Zhang,Hayden Schaeffer,Lu Lu
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Partial differential equations (PDEs) govern a wide range of physical systems, and recent multimodal foundation models have shown promise for learning PDE solution operators across diverse equation families. However, existing multi-operator learning approaches are data-hungry and neglect physics during training. Here, we propose a physics-informed multimodal foundation model (PI-MFM) framework that directly enforces governing equations during pretraining and adaptation. PI-MFM takes symbolic representations of PDEs as the input, and automatically assembles PDE residual losses from the input expression via a vectorized derivative computation. These designs enable any PDE-encoding multimodal foundation model to be trained or adapted with unified physics-informed objectives across equation families. On a benchmark of 13 parametric one-dimensional time-dependent PDE families, PI-MFM consistently outperforms purely data-driven counterparts, especially with sparse labeled spatiotemporal points, partially observed time domains, or few labeled function pairs. Physics losses further improve robustness against noise, and simple strategies such as resampling collocation points substantially improve accuracy. We also analyze the accuracy, precision, and computational cost of automatic differentiation and finite differences for derivative computation within PI-MFM. Finally, we demonstrate zero-shot physics-informed fine-tuning to unseen PDE families: starting from a physics-informed pretrained model, adapting using only PDE residuals and initial/boundary conditions, without any labeled solution data, rapidly reduces test errors to around 1% and clearly outperforms physics-only training from scratch. These results show that PI-MFM provides a practical and scalable path toward data-efficient, transferable PDE solvers.

[LG-35] Mechanistic Analysis of Circuit Preservation in Federated Learning

链接: https://arxiv.org/abs/2512.23043
作者: Muhammad Haseeb,Salaar Masood,Muhammad Abdullah Sohail
类目: Machine Learning (cs.LG)
*备注: 15 pages

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative training of models on decentralized data, but its performance degrades significantly under Non-IID (non-independent and identically distributed) data conditions. While this accuracy loss is well-documented, the internal mechanistic causes remain a black box. This paper investigates the canonical FedAvg algorithm through the lens of Mechanistic Interpretability (MI) to diagnose this failure mode. We hypothesize that the aggregation of conflicting client updates leads to circuit collapse, the destructive interference of functional, sparse sub-networks responsible for specific class predictions. By training inherently interpretable, weight-sparse neural networks within an FL framework, we identify and track these circuits across clients and communication rounds. Using Intersection-over-Union (IoU) to quantify circuit preservation, we provide the first mechanistic evidence that Non-IID data distributions cause structurally distinct local circuits to diverge, leading to their degradation in the global model. Our findings reframe the problem of statistical drift in FL as a concrete, observable failure of mechanistic preservation, paving the way for more targeted solutions.

[LG-36] Merge before Forget: A Single LoRA Continual Learning via Continual Merging

链接: https://arxiv.org/abs/2512.23017
作者: Fuli Qiao,Mehrdad Mahdavi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new tasks, further exploits the intrinsic asymmetry property of LoRA components by using a time-aware scaling mechanism to balance new and old knowledge during continual merging. Our approach maintains constant memory complexity with respect to the number of tasks, minimizes interference between past and new tasks via orthogonal basis initialization, and improves performance over asymmetric LoRA merging via adaptive scaling. We provide theoretical analysis to justify our design and conduct extensive experiments across diverse continual learning benchmarks using various Llama models, demonstrating the effectiveness and efficiency of our method.

[LG-37] Fusion or Confusion? Multimodal Complexity Is Not All You Need

链接: https://arxiv.org/abs/2512.22991
作者: Tillmann Rheude,Roland Eils,Benjamin Wild
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions, evaluating them across nine diverse datasets with up to 23 modalities, and testing their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a straightforward late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analysis indicates that more complex methods perform comparably to SimBaMM and frequently do not reliably outperform well-tuned unimodal baselines, especially in the small-data regime considered in many original studies. To support our findings, we include a case study of a recent multimodal learning method highlighting the methodological shortcomings in the literature. In addition, we provide a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.

[LG-38] A Context-Aware Temporal Modeling through Unified Multi-Scale Temporal Encoding and Hierarchical Sequence Learning for Single-Channel EEG Sleep Staging

链接: https://arxiv.org/abs/2512.22976
作者: Amirali Vakili,Salar Jahanshiri,Armin Salimi-Badr
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic sleep staging is a critical task in healthcare due to the global prevalence of sleep disorders. This study focuses on single-channel electroencephalography (EEG), a practical and widely available signal for automatic sleep staging. Existing approaches face challenges such as class imbalance, limited receptive-field modeling, and insufficient interpretability. This work proposes a context-aware and interpretable framework for single-channel EEG sleep staging, with particular emphasis on improving detection of the N1 stage. Many prior models operate as black boxes with stacked layers, lacking clearly defined and interpretable feature extraction this http URL proposed model combines compact multi-scale feature extraction with temporal modeling to capture both local and long-range dependencies. To address data imbalance, especially in the N1 stage, classweighted loss functions and data augmentation are applied. EEG signals are segmented into sub-epoch chunks, and final predictions are obtained by averaging softmax probabilities across chunks, enhancing contextual representation and this http URL proposed framework achieves an overall accuracy of 89.72% and a macro-average F1-score of 85.46%. Notably, it attains an F1- score of 61.7% for the challenging N1 stage, demonstrating a substantial improvement over previous methods on the SleepEDF datasets. These results indicate that the proposed approach effectively improves sleep staging performance while maintaining interpretability and suitability for real-world clinical applications.

[LG-39] FLOW: A Feedback-Driven Synthetic Longitudinal Dataset of Work and Wellbeing

链接: https://arxiv.org/abs/2512.22956
作者: Wafaa El Husseini
类目: Machine Learning (cs.LG)
*备注: This paper introduces FLOW, a synthetic longitudinal dataset for modeling daily work-life balance and wellbeing. The dataset is publicly available on Kaggle under a CC BY-SA 4.0 license

点击查看摘要

Abstract:Access to longitudinal, individual-level data on work-life balance and wellbeing is limited by privacy, ethical, and logistical constraints. This poses challenges for reproducible research, methodological benchmarking, and education in domains such as stress modeling, behavioral analysis, and machine learning. We introduce FLOW, a synthetic longitudinal dataset designed to model daily interactions between workload, lifestyle behaviors, and wellbeing. FLOW is generated using a rule-based, feedback-driven simulation that produces coherent temporal dynamics across variables such as stress, sleep, mood, physical activity, and body weight. The dataset simulates 1,000 individuals over a two-year period with daily resolution and is released as a publicly available resource. In addition to the static dataset, we describe a configurable data generation tool that enables reproducible experimentation under adjustable behavioral and contextual assumptions. FLOW is intended as a controlled experimental environment rather than a proxy for observed human populations, supporting exploratory analysis, methodological development, and benchmarking where real-world data are inaccessible. Comments: This paper introduces FLOW, a synthetic longitudinal dataset for modeling daily work-life balance and wellbeing. The dataset is publicly available on Kaggle under a CC BY-SA 4.0 license Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.22956 [cs.LG] (or arXiv:2512.22956v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22956 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-40] Multiple Token Divergence: Measuring and Steering In-Context Computation Density

链接: https://arxiv.org/abs/2512.22944
作者: Vincent Herrmann,Eric Alcaide,Michael Wand,Jürgen Schmidhuber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Measuring the in-context computational effort of language models is a key challenge, as metrics like next-token loss fail to capture reasoning complexity. Prior methods based on latent state compressibility can be invasive and unstable. We propose Multiple Token Divergence (MTD), a simple measure of computational effort defined as the KL divergence between a model’s full output distribution and that of a shallow, auxiliary prediction head. MTD can be computed directly from pre-trained models with multiple prediction heads, requiring no additional training. Building on this, we introduce Divergence Steering, a novel decoding method to control the computational character of generated text. We empirically show that MTD is more effective than prior methods at distinguishing complex tasks from simple ones. On mathematical reasoning benchmarks, MTD correlates positively with problem difficulty. Lower MTD is associated with more accurate reasoning. MTD provides a practical, lightweight tool for analyzing and steering the computational dynamics of language models.

[LG-41] MetaCD: A Meta Learning Framework for Cognitive Diagnosis based on Continual Learning

链接: https://arxiv.org/abs/2512.22904
作者: Jin Wu,Chanjin Zheng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cognitive diagnosis is an essential research topic in intelligent education, aimed at assessing the level of mastery of different skills by students. So far, many research works have used deep learning models to explore the complex interactions between students, questions, and skills. However, the performance of existing method is frequently limited by the long-tailed distribution and dynamic changes in the data. To address these challenges, we propose a meta-learning framework for cognitive diagnosis based on continual learning (MetaCD). This framework can alleviate the long-tailed problem by utilizing meta-learning to learn the optimal initialization state, enabling the model to achieve good accuracy on new tasks with only a small amount of data. In addition, we utilize a continual learning method named parameter protection mechanism to give MetaCD the ability to adapt to new skills or new tasks, in order to adapt to dynamic changes in data. MetaCD can not only improve the plasticity of our model on a single task, but also ensure the stability and generalization of the model on sequential tasks. Comprehensive experiments on five real-world datasets show that MetaCD outperforms other baselines in both accuracy and generalization.

[LG-42] Federated Multi-Task Clustering

链接: https://arxiv.org/abs/2512.22897
作者: S. Dai,G. Sun,F. Li,X. Tang,Q. Wang,Y. Cong
类目: Machine Learning (cs.LG); Multimedia (cs.MM)
*备注:

点击查看摘要

Abstract:Spectral clustering has emerged as one of the most effective clustering algorithms due to its superior performance. However, most existing models are designed for centralized settings, rendering them inapplicable in modern decentralized environments. Moreover, current federated learning approaches often suffer from poor generalization performance due to reliance on unreliable pseudo-labels, and fail to capture the latent correlations amongst heterogeneous clients. To tackle these limitations, this paper proposes a novel framework named Federated Multi-Task Clustering (i.e.,FMTC), which intends to learn personalized clustering models for heterogeneous clients while collaboratively leveraging their shared underlying structure in a privacy-preserving manner. More specifically, the FMTC framework is composed of two main components: client-side personalized clustering module, which learns a parameterized mapping model to support robust out-of-sample inference, bypassing the need for unreliable pseudo-labels; and server-side tensorial correlation module, which explicitly captures the shared knowledge across all clients. This is achieved by organizing all client models into a unified tensor and applying a low-rank regularization to discover their common subspace. To solve this joint optimization problem, we derive an efficient, privacy-preserving distributed algorithm based on the Alternating Direction Method of Multipliers, which decomposes the global problem into parallel local updates on clients and an aggregation step on the server. To the end, several extensive experiments on multiple real-world datasets demonstrate that our proposed FMTC framework significantly outperforms various baseline and state-of-the-art federated clustering algorithms.

[LG-43] heory and Algorithms for Learning with Multi-Class Abstention and Multi-Expert Deferral

链接: https://arxiv.org/abs/2512.22886
作者: Anqi Mao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Ph.D. Dissertation, New York University

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable performance but face critical challenges: hallucinations and high inference costs. Leveraging multiple experts offers a solution: deferring uncertain inputs to more capable experts improves reliability, while routing simpler queries to smaller, distilled models enhances efficiency. This motivates the problem of learning with multiple-expert deferral. This thesis presents a comprehensive study of this problem and the related problem of learning with abstention, supported by strong consistency guarantees. First, for learning with abstention (a special case of deferral), we analyze score-based and predictor-rejector formulations in multi-class classification. We introduce new families of surrogate losses and prove strong non-asymptotic, hypothesis set-specific consistency guarantees, resolving two existing open questions. We analyze both single-stage and practical two-stage settings, with experiments on CIFAR-10, CIFAR-100, and SVHN demonstrating the superior performance of our algorithms. Second, we address general multi-expert deferral in classification. We design new surrogate losses for both single-stage and two-stage scenarios and prove they benefit from strong H -consistency bounds. For the two-stage scenario, we show that our surrogate losses are realizable H -consistent for constant cost functions, leading to effective new algorithms. Finally, we introduce a novel framework for regression with deferral to address continuous label spaces. Our versatile framework accommodates multiple experts and various cost structures, supporting both single-stage and two-stage methods. It subsumes recent work on regression with abstention. We propose new surrogate losses with proven H -consistency and demonstrate the empirical effectiveness of the resulting algorithms. Comments: Ph.D. Dissertation, New York University Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.22886 [cs.LG] (or arXiv:2512.22886v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22886 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anqi Mao [view email] [v1] Sun, 28 Dec 2025 11:33:39 UTC (162 KB)

[LG-44] Fundamental Novel Consistency Theory: H-Consistency Bounds

链接: https://arxiv.org/abs/2512.22880
作者: Yutao Zhong
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Ph.D. Dissertation, New York University

点击查看摘要

Abstract:In machine learning, the loss functions optimized during training often differ from the target loss that defines task performance due to computational intractability or lack of differentiability. We present an in-depth study of the target loss estimation error relative to the surrogate loss estimation error. Our analysis leads to H -consistency bounds, which are guarantees accounting for the hypothesis set H . These bounds offer stronger guarantees than Bayes-consistency or H -calibration and are more informative than excess error bounds. We begin with binary classification, establishing tight distribution-dependent and -independent bounds. We provide explicit bounds for convex surrogates (including linear models and neural networks) and analyze the adversarial setting for surrogates like \rho -margin and sigmoid loss. Extending to multi-class classification, we present the first H -consistency bounds for max, sum, and constrained losses, covering both non-adversarial and adversarial scenarios. We demonstrate that in some cases, non-trivial H -consistency bounds are unattainable. We also investigate comp-sum losses (e.g., cross-entropy, MAE), deriving their first H -consistency bounds and introducing smooth adversarial variants that yield robust learning algorithms. We develop a comprehensive framework for deriving these bounds across various surrogates, introducing new characterizations for constrained and comp-sum losses. Finally, we examine the growth rates of H -consistency bounds, establishing a universal square-root growth rate for smooth surrogates in binary and multi-class tasks, and analyze minimizability gaps to guide surrogate selection. Comments: Ph.D. Dissertation, New York University Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.22880 [cs.LG] (or arXiv:2512.22880v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22880 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Adaptive Trust Consensus for Blockchain IoT: Comparing RL DRL and MARL Against Naive Collusive Adaptive Byzantine and Sleeper Attacks

链接: https://arxiv.org/abs/2512.22860
作者: Soham Padia,Dhananjay Vaidya,Ramchandra Mangrulkar
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 34 pages, 19 figures, 10 tables. Code available at this https URL

点击查看摘要

Abstract:Securing blockchain-enabled IoT networks against sophisticated adversarial attacks remains a critical challenge. This paper presents a trust-based delegated consensus framework integrating Fully Homomorphic Encryption (FHE) with Attribute-Based Access Control (ABAC) for privacy-preserving policy evaluation, combined with learning-based defense mechanisms. We systematically compare three reinforcement learning approaches – tabular Q-learning (RL), Deep RL with Dueling Double DQN (DRL), and Multi-Agent RL (MARL) – against five distinct attack families: Naive Malicious Attack (NMA), Collusive Rumor Attack (CRA), Adaptive Adversarial Attack (AAA), Byzantine Fault Injection (BFI), and Time-Delayed Poisoning (TDP). Experimental results on a 16-node simulated IoT network reveal significant performance variations: MARL achieves superior detection under collusive attacks (F1=0.85 vs. DRL’s 0.68 and RL’s 0.50), while DRL and MARL both attain perfect detection (F1=1.00) against adaptive attacks where RL fails (F1=0.50). All agents successfully defend against Byzantine attacks (F1=1.00). Most critically, the Time-Delayed Poisoning attack proves catastrophic for all agents, with F1 scores dropping to 0.11-0.16 after sleeper activation, demonstrating the severe threat posed by trust-building adversaries. Our findings indicate that coordinated multi-agent learning provides measurable advantages for defending against sophisticated trust manipulation attacks in blockchain IoT environments.

[LG-46] EACH: Temporal Variance-Driven Curriculum for Reinforcement Learning

链接: https://arxiv.org/abs/2512.22824
作者: Gaurav Chaudhary,Laxmidhar Behera
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has achieved significant success in solving single-goal tasks. However, uniform goal selection often results in sample inefficiency in multi-goal settings where agents must learn a universal goal-conditioned policy. Inspired by the adaptive and structured learning processes observed in biological systems, we propose a novel Student-Teacher learning paradigm with a Temporal Variance-Driven Curriculum to accelerate Goal-Conditioned RL. In this framework, the teacher module dynamically prioritizes goals with the highest temporal variance in the policy’s confidence score, parameterized by the state-action value (Q) function. The teacher provides an adaptive and focused learning signal by targeting these high-uncertainty goals, fostering continual and efficient progress. We establish a theoretical connection between the temporal variance of Q-values and the evolution of the policy, providing insights into the method’s underlying principles. Our approach is algorithm-agnostic and integrates seamlessly with existing RL frameworks. We demonstrate this through evaluation across 11 diverse robotic manipulation and maze navigation tasks. The results show consistent and notable improvements over state-of-the-art curriculum learning and goal-selection methods.

[LG-47] Long-Range Distillation: Distilling 10000 Years of Simulated Climate into Long Timestep AI Weather Models

链接: https://arxiv.org/abs/2512.22814
作者: Scott A. Martin,Noah Brenowitz,Dale Durran,Michael Pritchard
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Accurate long-range weather forecasting remains a major challenge for AI models, both because errors accumulate over autoregressive rollouts and because reanalysis datasets used for training offer a limited sample of the slow modes of climate variability underpinning predictability. Most AI weather models are autoregressive, producing short lead forecasts that must be repeatedly applied to reach subseasonal-to-seasonal (S2S) or seasonal lead times, often resulting in instability and calibration issues. Long-timestep probabilistic models that generate long-range forecasts in a single step offer an attractive alternative, but training on the 40-year reanalysis record leads to overfitting, suggesting orders of magnitude more training data are required. We introduce long-range distillation, a method that trains a long-timestep probabilistic “student” model to forecast directly at long-range using a huge synthetic training dataset generated by a short-timestep autoregressive “teacher” model. Using the Deep Learning Earth System Model (DLESyM) as the teacher, we generate over 10,000 years of simulated climate to train distilled student models for forecasting across a range of timescales. In perfect-model experiments, the distilled models outperform climatology and approach the skill of their autoregressive teacher while replacing hundreds of autoregressive steps with a single timestep. In the real world, they achieve S2S forecast skill comparable to the ECMWF ensemble forecast after ERA5 fine-tuning. The skill of our distilled models scales with increasing synthetic training data, even when that data is orders of magnitude larger than ERA5. This represents the first demonstration that AI-generated synthetic training data can be used to scale long-range forecast skill.

[LG-48] Discovering Transmission Dynamics of COVID-19 in China

链接: https://arxiv.org/abs/2512.22787
作者: Zhou Yang,Edward Dougherty,Chen Zhang,Zhenhe Pan,Fang Jin
类目: Machine Learning (cs.LG)
*备注: 14 pages

点击查看摘要

Abstract:A comprehensive retrospective analysis of public health interventions, such as large scale testing, quarantining, and contact tracing, can help identify mechanisms most effective in mitigating COVID-19. We investigate China based SARS-CoV-2 transmission patterns (e.g., infection type and likely transmission source) using publicly released tracking data. We collect case reports from local health commissions, the Chinese CDC, and official local government social media, then apply NLP and manual curation to construct transmission/tracking chains. We further analyze tracking data together with Wuhan population mobility data to quantify and visualize temporal and spatial spread dynamics. Results indicate substantial regional differences, with larger cities showing more infections, likely driven by social activities. Most symptomatic individuals (79%) were hospitalized within 5 days of symptom onset, and those with confirmed-case contact sought admission in under 5 days. Infection sources also shifted over time: early cases were largely linked to travel to (or contact with travelers from) Hubei Province, while later transmission was increasingly associated with social activities.

[LG-49] A Micro-Macro Machine Learning Framework for Predicting Childhood Obesity Risk Using NHANES and Environmental Determinants

链接: https://arxiv.org/abs/2512.22758
作者: Eswarasanthosh Kumar Mamillapalli,Nishtha Sharma
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Childhood obesity remains a major public health challenge in the United States, strongly influenced by a combination of individual-level, household-level, and environmental-level risk factors. Traditional epidemiological studies typically analyze these levels independently, limiting insights into how structural environmental conditions interact with individual-level characteristics to influence health outcomes. In this study, we introduce a micro-macro machine learning framework that integrates (1) individual-level anthropometric and socioeconomic data from NHANES and (2) macro-level structural environment features, including food access, air quality, and socioeconomic vulnerability extracted from USDA and EPA datasets. Four machine learning models Logistic Regression, Random Forest, XGBoost, and LightGBM were trained to predict obesity using NHANES microdata. XGBoost achieved the strongest performance. A composite environmental vulnerability index (EnvScore) was constructed using normalized indicators from USDA and EPA at the state level. Multi-level comparison revealed strong geographic similarity between states with high environmental burden and the nationally predicted micro-level obesity risk distribution. This demonstrates the feasibility of integrating multi-scale datasets to identify environment-driven disparities in obesity risk. This work contributes a scalable, data-driven, multi-level modeling pipeline suitable for public health informatics, demonstrating strong potential for expansion into causal modeling, intervention planning, and real-time analytics.

[LG-50] From Confounding to Learning: Dynamic Service Fee Pricing on Third-Party Platforms

链接: https://arxiv.org/abs/2512.22749
作者: Rui Ai,David Simchi-Levi,Feng Zhu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the pricing behavior of third-party platforms facing strategic agents. Assuming the platform is a revenue maximizer, it observes market features that generally affect demand. Since only the equilibrium price and quantity are observable, this presents a general demand learning problem under confounding. Mathematically, we develop an algorithm with optimal regret of \Tilde\cO(\sqrtT\wedge\sigma_S^-2) . Our results reveal that supply-side noise fundamentally affects the learnability of demand, leading to a phase transition in regret. Technically, we show that non-i.i.d. actions can serve as instrumental variables for learning demand. We also propose a novel homeomorphic construction that allows us to establish estimation bounds without assuming star-shapedness, providing the first efficiency guarantee for learning demand with deep neural networks. Finally, we demonstrate the practical applicability of our approach through simulations and real-world data from Zomato and Lyft.

[LG-51] Bridging Global Intent with Local Details: A Hierarchical Representation Approach for Semantic Validation in Text-to-SQL

链接: https://arxiv.org/abs/2512.22744
作者: Rihong Qiu,Zhibang Yang,Xinke Jiang,Weibin Liao,Xin Gao,Xu Chu,Junfeng Zhao,Yasha Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Text-to-SQL translates natural language questions into SQL statements grounded in a target database schema. Ensuring the reliability and executability of such systems requires validating generated SQL, but most existing approaches focus only on syntactic correctness, with few addressing semantic validation (detecting misalignments between questions and SQL). As a consequence, effective semantic validation still faces two key challenges: capturing both global user intent and SQL structural details, and constructing high-quality fine-grained sub-SQL annotations. To tackle these, we introduce HEROSQL, a hierarchical SQL representation approach that integrates global intent (via Logical Plans, LPs) and local details (via Abstract Syntax Trees, ASTs). To enable better information propagation, we employ a Nested Message Passing Neural Network (NMPNN) to capture inherent relational information in SQL and aggregate schema-guided semantics across LPs and ASTs. Additionally, to generate high-quality negative samples, we propose an AST-driven sub-SQL augmentation strategy, supporting robust optimization of fine-grained semantic inconsistencies. Extensive experiments conducted on Text-to-SQL validation benchmarks (both in-domain and out-of-domain settings) demonstrate that our approach outperforms existing state-of-the-art methods, achieving an average 9.40% improvement of AUPRC and 12.35% of AUROC in identifying semantic inconsistencies. It excels at detecting fine-grained semantic errors, provides large language models with more granular feedback, and ultimately enhances the reliability and interpretability of data querying platforms.

[LG-52] When Does Multi-Task Learning Fail? Quantifying Data Imbalance and Task Independence in Metal Alloy Property Prediction

链接: https://arxiv.org/abs/2512.22740
作者: Sungwoo Kang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task learning (MTL) assumes related material properties share underlying physics that can be leveraged for better predictions. We test this by simultaneously predicting electrical resistivity, Vickers hardness, and amorphous-forming ability using 54,028 alloy samples. We compare single-task models against standard and structured MTL. Results reveal a striking dichotomy: MTL significantly degrades regression performance (resistivity R^2 : 0.897 \to 0.844; hardness R^2 : 0.832 \to 0.694, p 0.01 ) but improves classification (amorphous F1: 0.703 \to 0.744, p 0.05 ; recall +17%). Analysis shows near-zero inter-task weights, indicating property independence. Regression failure is attributed to negative transfer caused by severe data imbalance (52k vs. 800 samples). We recommend independent models for precise regression, while reserving MTL for classification tasks where recall is critical.

[LG-53] What Matters in Deep Learning for Time Series Forecasting?

链接: https://arxiv.org/abs/2512.22702
作者: Valentina Moretti,Andrea Cini,Ivan Marisca,Cesare Alippi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning models have grown increasingly popular in time series applications. However, the large quantity of newly proposed architectures, together with often contradictory empirical results, makes it difficult to assess which components contribute significantly to final performance. We aim to make sense of the current design space of deep learning architectures for time series forecasting by discussing the design dimensions and trade-offs that can explain, often unexpected, observed results. This paper discusses the necessity of grounding model design on principles for forecasting groups of time series and how such principles can be applied to current models. In particular, we assess how concepts such as locality and globality apply to recent forecasting architectures. We show that accounting for these aspects can be more relevant for achieving accurate results than adopting specific sequence modeling layers and that simple, well-designed forecasting architectures can often match the state of the art. We discuss how overlooked implementation details in existing architectures (1) fundamentally change the class of the resulting forecasting method and (2) drastically affect the observed empirical results. Our results call for rethinking current faulty benchmarking practices and the need to focus on the foundational aspects of the forecasting problem when designing architectures. As a step in this direction, we propose an auxiliary forecasting model card, whose fields serve to characterize existing and new forecasting architectures based on key design choices.

[LG-54] Predictive Modeling of Power Outages during Extreme Events: Integrating Weather and Socio-Economic Factors

链接: https://arxiv.org/abs/2512.22699
作者: Antar Kumar Biswas,Masoud H. Nazari
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This is a preprint of a manuscript currently under review at Electric Power Systems Research. The content may be subject to change following peer review

点击查看摘要

Abstract:This paper presents a novel learning-based framework for predicting power outages caused by extreme events. The proposed approach specifically targets low-probability, high-consequence outage scenarios and leverages a comprehensive set of features derived from publicly available data sources. We integrate EAGLE-I outage records (2014-2024) with weather, socio-economic, infrastructure, and seasonal event data. Incorporating social and demographic indicators reveals underlying patterns of community vulnerability and provides a clearer understanding of outage risk during extreme conditions. Four machine learning models (Random Forest (RF), Support Vector Machine (SVM), Adaptive Boosting (AdaBoost), and Long Short-Term Memory (LSTM)) are evaluated. Experimental validation is performed on a large-scale dataset covering counties in the lower peninsula of Michigan. Among all models tested, the LSTM network achieves the lowest prediction error. Additionally, the results demonstrate that stronger economic conditions and more developed infrastructure are associated with lower outage occurrence.

[LG-55] Beyond Centralization: Provable Communication Efficient Decentralized Multi-Task Learning

链接: https://arxiv.org/abs/2512.22675
作者: Donghwa Kang,Shana Moothedath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning is a widely adopted framework for learning in data-scarce environments, aiming to extract common features from related tasks. While centralized approaches have been extensively studied, decentralized methods remain largely underexplored. We study decentralized multi-task representation learning in which the features share a low-rank structure. We consider multiple tasks, each with a finite number of data samples, where the observations follow a linear model with task-specific parameters. In the decentralized setting, task data are distributed across multiple nodes, and information exchange between nodes is constrained by a communication network. The goal is to recover the underlying feature matrix whose rank is much smaller than both the parameter dimension and the number of tasks. We propose a new alternating projected gradient and minimization algorithm with provable accuracy guarantees. We provide comprehensive characterizations of the time, communication, and sample complexities. Importantly, the communication complexity is independent of the target accuracy, which significantly reduces communication cost compared to prior methods. Numerical simulations validate the theoretical analysis across different dimensions and network topologies, and demonstrate regimes in which decentralized learning outperforms centralized federated approaches.

[LG-56] Quantum Generative Models for Computational Fluid Dynamics: A First Exploration of Latent Space Learning in Lattice Boltzmann Simulations

链接: https://arxiv.org/abs/2512.22672
作者: Achraf Hsain,Fouad Mohammed Abbou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents the first application of quantum generative models to learned latent space representations of computational fluid dynamics (CFD) data. While recent work has explored quantum models for learning statistical properties of fluid systems, the combination of discrete latent space compression with quantum generative sampling for CFD remains unexplored. We develop a GPU-accelerated Lattice Boltzmann Method (LBM) simulator to generate fluid vorticity fields, which are compressed into a discrete 7-dimensional latent space using a Vector Quantized Variational Autoencoder (VQ-VAE). The central contribution is a comparative analysis of quantum and classical generative approaches for modeling this physics-derived latent distribution: we evaluate a Quantum Circuit Born Machine (QCBM) and Quantum Generative Adversarial Network (QGAN) against a classical Long Short-Term Memory (LSTM) baseline. Under our experimental conditions, both quantum models produced samples with lower average minimum distances to the true distribution compared to the LSTM, with the QCBM achieving the most favorable metrics. This work provides: (1)~a complete open-source pipeline bridging CFD simulation and quantum machine learning, (2)~the first empirical study of quantum generative modeling on compressed latent representations of physics simulations, and (3)~a foundation for future rigorous investigation at this intersection.

[LG-57] Clinically Calibrated Machine Learning Benchmarks for Large-Scale Multi-Disorder EEG Classification

链接: https://arxiv.org/abs/2512.22656
作者: Argha Kamal Samanta,Deepak Mewada,Monalisa Sarma,Debasis Samanta
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Clinical electroencephalography is routinely used to evaluate patients with diverse and often overlapping neurological conditions, yet interpretation remains manual, time-intensive, and variable across experts. While automated EEG analysis has been widely studied, most existing methods target isolated diagnostic problems, particularly seizure detection, and provide limited support for multi-disorder clinical screening. This study examines automated EEG-based classification across eleven clinically relevant neurological disorder categories, encompassing acute time-critical conditions, chronic neurocognitive and developmental disorders, and disorders with indirect or weak electrophysiological signatures. EEG recordings are processed using a standard longitudinal bipolar montage and represented through a multi-domain feature set capturing temporal statistics, spectral structure, signal complexity, and inter-channel relationships. Disorder-aware machine learning models are trained under severe class imbalance, with decision thresholds explicitly calibrated to prioritize diagnostic sensitivity. Evaluation on a large, heterogeneous clinical EEG dataset demonstrates that sensitivity-oriented modeling achieves recall exceeding 80% for the majority of disorder categories, with several low-prevalence conditions showing absolute recall gains of 15-30% after threshold calibration compared to default operating points. Feature importance analysis reveals physiologically plausible patterns consistent with established clinical EEG markers. These results establish realistic performance baselines for multi-disorder EEG classification and provide quantitative evidence that sensitivity-prioritized automated analysis can support scalable EEG screening and triage in real-world clinical settings. Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2512.22656 [cs.HC] (or arXiv:2512.22656v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2512.22656 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Argha Kamal Samanta [view email] [v1] Sat, 27 Dec 2025 17:11:17 UTC (5,851 KB)

[LG-58] ree Meets Transformer: A Hybrid Architecture for Scalable Power Allocation in Cell-Free Networks

链接: https://arxiv.org/abs/2512.22639
作者: Irched Chafaa,Giacomo Bacci,Luca Sanguinetti
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Power allocation remains a fundamental challenge in wireless communication networks, particularly under dynamic user loads and large-scale deployments. While Transformerbased models have demonstrated strong performance, their computational cost scales poorly with the number of users. In this work, we propose a novel hybrid Tree-Transformer architecture that achieves scalable per-user power allocation. Our model compresses user features via a binary tree into a global root representation, applies a Transformer encoder solely to this root, and decodes per-user uplink and downlink powers through a shared decoder. This design achieves logarithmic depth and linear total complexity, enabling efficient inference across large and variable user sets without retraining or architectural changes. We evaluate our model on the max-min fairness problem in cellfree massive MIMO systems and demonstrate that it achieves near-optimal performance while significantly reducing inference time compared to full-attention baselines.

[LG-59] Communication Compression for Distributed Learning with Aggregate and Server-Guided Feedback

链接: https://arxiv.org/abs/2512.22623
作者: Tomas Ortega,Chun-Yin Huang,Xiaoxiao Li,Hamid Jafarkhani
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Distributed learning, particularly Federated Learning (FL), faces a significant bottleneck in the communication cost, particularly the uplink transmission of client-to-server updates, which is often constrained by asymmetric bandwidth limits at the edge. Biased compression techniques are effective in practice, but require error feedback mechanisms to provide theoretical guarantees and to ensure convergence when compression is aggressive. Standard error feedback, however, relies on client-specific control variates, which violates user privacy and is incompatible with stateless clients common in large-scale FL. This paper proposes two novel frameworks that enable biased compression without client-side state or control variates. The first, Compressed Aggregate Feedback (CAFe), uses the globally aggregated update from the previous round as a shared control variate for all clients. The second, Server-Guided Compressed Aggregate Feedback (CAFe-S), extends this idea to scenarios where the server possesses a small private dataset; it generates a server-guided candidate update to be used as a more accurate predictor. We consider Distributed Gradient Descent (DGD) as a representative algorithm and analytically prove CAFe’s superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-convex regime with bounded gradient dissimilarity. We further prove that CAFe-S converges to a stationary point, with a rate that improves as the server’s data become more representative. Experimental results in FL scenarios validate the superiority of our approaches over existing compression schemes.

[LG-60] Gold Price Prediction Using Long Short-Term Memory and Multi-Layer Perceptron with Gray Wolf Optimizer

链接: https://arxiv.org/abs/2512.22606
作者: Hesam Taghipour,Alireza Rezaee,Farshid Hajati
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The global gold market, by its fundamentals, has long been home to many financial institutions, banks, governments, funds, and micro-investors. Due to the inherent complexity and relationship between important economic and political components, accurate forecasting of financial markets has always been challenging. Therefore, providing a model that can accurately predict the future of the markets is very important and will be of great benefit to their developers. In this paper, an artificial intelligence-based algorithm for daily and monthly gold forecasting is presented. Two Long short-term memory (LSTM) networks are responsible for daily and monthly forecasting, the results of which are integrated into a Multilayer perceptrons (MLP) network and provide the final forecast of the next day prices. The algorithm forecasts the highest, lowest, and closing prices on the daily and monthly time frame. Based on these forecasts, a trading strategy for live market trading was developed, according to which the proposed model had a return of 171% in three months. Also, the number of internal neurons in each network is optimized by the Gray Wolf optimization (GWO) algorithm based on the least RMSE error. The dataset was collected between 2010 and 2021 and includes data on macroeconomic, energy markets, stocks, and currency status of developed countries. Our proposed LSTM-MLP model predicted the daily closing price of gold with the Mean absolute error (MAE) of 0.21 and the next month’s price with 22.23.

[LG-61] Cryptocurrency Price Prediction Using Parallel Gated Recurrent Units

链接: https://arxiv.org/abs/2512.22599
作者: Milad Asadpour,Alireza Rezaee,Farshid Hajati
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:According to the advent of cryptocurrencies and Bitcoin, many investments and businesses are now conducted online through cryptocurrencies. Among them, Bitcoin uses blockchain technology to make transactions secure, transparent, traceable, and immutable. It also exhibits significant price fluctuations and performance, which has attracted substantial attention, especially in financial sectors. Consequently, a wide range of investors and individuals have turned to investing in the cryptocurrency market. One of the most important challenges in economics is price forecasting for future trades. Cryptocurrencies are no exception, and investors are looking for methods to predict prices; various theories and methods have been proposed in this field. This paper presents a new deep model, called \emphParallel Gated Recurrent Units (PGRU), for cryptocurrency price prediction. In this model, recurrent neural networks forecast prices in a parallel and independent way. The parallel networks utilize different inputs, each representing distinct price-related features. Finally, the outputs of the parallel networks are combined by a neural network to forecast the future price of cryptocurrencies. The experimental results indicate that the proposed model achieves mean absolute percentage errors (MAPE) of 3.243% and 2.641% for window lengths 20 and 15, respectively. Our method therefore attains higher accuracy and efficiency with fewer input data and lower computational cost compared to existing methods.

[LG-62] Energy-Guided Flow Matching Enables Few-Step Conformer Generation and Ground-State Identification

链接: https://arxiv.org/abs/2512.22597
作者: Guikun Xu,Xiaohan Yi,Peilin Zhao,Yatao Bian
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:Generating low-energy conformer ensembles and identifying ground-state conformations from molecular graphs remain computationally demanding with physics-based pipelines. Current learning-based approaches often suffer from a fragmented paradigm: generative models capture diversity but lack reliable energy calibration, whereas deterministic predictors target a single structure and fail to represent ensemble variability. Here we present EnFlow, a unified framework that couples flow matching (FM) with an explicitly learned energy model through an energy-guided sampling scheme defined along a non-Gaussian FM path. By incorporating energy-gradient guidance during sampling, our method steers trajectories toward lower-energy regions, substantially improving conformational fidelity, particularly in the few-step regime. The learned energy function further enables efficient energy-based ranking of generated ensembles for accurate ground-state identification. Extensive experiments on GEOM-QM9 and GEOM-Drugs demonstrate that EnFlow simultaneously improves generation metrics with 1–2 ODE-steps and reduces ground-state prediction errors compared with state-of-the-art methods.

[LG-63] Data-Driven Analysis of Crash Patterns in SAE Level 2 and Level 4 Automated Vehicles Using K-means Clustering and Association Rule Mining

链接: https://arxiv.org/abs/2512.22589
作者: Jewel Rana Palit(1),Vijayalakshmi K Kumarasamy(2),Osama A. Osman(3) ((1) Traffic Engineer/Project Manager-II, Collier County Government, Traffic Management Center, 2695 Francis Ave Unit D, Naples, Fl, 37221, (2) Department of Computer Science and Engineering, University of Tennessee at Chattanooga, Chattanooga, TN, USA 37403, (3) Chief Scientist, Center of Urban Informatics and Progress, Chattanooga, TN, USA 37403)
类目: Machine Learning (cs.LG)
*备注: 7 tables, 7 figures, 23 pages including references, presented in ASCE 2024 conference

点击查看摘要

Abstract:Automated Vehicles (AV) hold potential to reduce or eliminate human driving errors, enhance traffic safety, and support sustainable mobility. Recently, crash data has increasingly revealed that AV behavior can deviate from expected safety outcomes, raising concerns about the technology’s safety and operational reliability in mixed traffic environments. While past research has investigated AV crash, most studies rely on small-size California-centered datasets, with a limited focus on understanding crash trends across various SAE Levels of automation. This study analyzes over 2,500 AV crash records from the United States National Highway Traffic Safety Administration (NHTSA), covering SAE Levels 2 and 4, to uncover underlying crash dynamics. A two-stage data mining framework is developed. K-means clustering is first applied to segment crash records into 4 distinct behavioral clusters based on temporal, spatial, and environmental factors. Then, Association Rule Mining (ARM) is used to extract interpretable multivariate relationships between crash patterns and crash contributors including lighting conditions, surface condition, vehicle dynamics, and environmental conditions within each cluster. These insights provide actionable guidance for AV developers, safety regulators, and policymakers in formulating AV deployment strategies and minimizing crash risks.

[LG-64] On Admissible Rank-based Input Normalization Operators

链接: https://arxiv.org/abs/2512.22587
作者: Taeyun Kim
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 31 pages, 2 figures

点击查看摘要

Abstract:Rank-based input normalization is a workhorse of modern machine learning, prized for its robustness to scale, monotone transformations, and batch-to-batch variation. In many real systems, the ordering of feature values matters far more than their raw magnitudes - yet the structural conditions that a rank-based normalization operator must satisfy to remain stable under these invariances have never been formally pinned down. We show that widely used differentiable sorting and ranking operators fundamentally fail these criteria. Because they rely on value gaps and batch-level pairwise interactions, they are intrinsically unstable under strictly monotone transformations, shifts in mini-batch composition, and even tiny input perturbations. Crucially, these failures stem from the operators’ structural design, not from incidental implementation choices. To address this, we propose three axioms that formalize the minimal invariance and stability properties required of rank-based input normalization. We prove that any operator satisfying these axioms must factor into (i) a feature-wise rank representation and (ii) a scalarization map that is both monotone and Lipschitz-continuous. We then construct a minimal operator that meets these criteria and empirically show that the resulting constraints are non-trivial in realistic setups. Together, our results sharply delineate the design space of valid rank-based normalization operators and formally separate them from existing continuous-relaxation-based sorting methods. Comments: 31 pages, 2 figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 68Q32 (Primary) 68T05 (Secondary) Cite as: arXiv:2512.22587 [cs.LG] (or arXiv:2512.22587v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22587 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] Computing Pure-Strategy Nash Equilibria in a Two-Party Policy Competition: Existence and Algorithmic Approaches AAMAS2026

链接: https://arxiv.org/abs/2512.22552
作者: Chuang-Chieh Lin,Chi-Jen Lu,Po-An Chen,Chih-Chieh Hung
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: A full version of the extended abstract in AAMAS 2026

点击查看摘要

Abstract:We formulate two-party policy competition as a two-player non-cooperative game, generalizing Lin et al.'s work (2021). Each party selects a real-valued policy vector as its strategy from a compact subset of Euclidean space, and a voter’s utility for a policy is given by the inner product with their preference vector. To capture the uncertainty in the competition, we assume that a policy’s winning probability increases monotonically with its total utility across all voters, and we formalize this via an affine isotonic function. A player’s payoff is defined as the expected utility received by its supporters. In this work, we first test and validate the isotonicity hypothesis through voting simulations. Next, we prove the existence of a pure-strategy Nash equilibrium (PSNE) in both one- and multi-dimensional settings. Although we construct a counterexample demonstrating the game’s non-monotonicity, our experiments show that a decentralized gradient-based algorithm typically converges rapidly to an approximate PSNE. Finally, we present a grid-based search algorithm that finds an \epsilon -approximate PSNE of the game in time polynomial in the input size and 1/\epsilon .

[LG-66] Decomposing Task Vectors for Refined Model Editing

链接: https://arxiv.org/abs/2512.22511
作者: Hamed Damirchi,Ehsan Abbasnejad,Zhen Zhang,Javen Shi
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:Large pre-trained models have transformed machine learning, yet adapting these models effectively to exhibit precise, concept-specific behaviors remains a significant challenge. Task vectors, defined as the difference between fine-tuned and pre-trained model parameters, provide a mechanism for steering neural networks toward desired behaviors. This has given rise to large repositories dedicated to task vectors tailored for specific behaviors. The arithmetic operation of these task vectors allows for the seamless combination of desired behaviors without the need for large datasets. However, these vectors often contain overlapping concepts that can interfere with each other during arithmetic operations, leading to unpredictable outcomes. We propose a principled decomposition method that separates each task vector into two components: one capturing shared knowledge across multiple task vectors, and another isolating information unique to each specific task. By identifying invariant subspaces across projections, our approach enables more precise control over concept manipulation without unintended amplification or diminution of other behaviors. We demonstrate the effectiveness of our decomposition method across three domains: improving multi-task merging in image classification by 5% using shared components as additional task vectors, enabling clean style mixing in diffusion models without generation degradation by mixing only the unique components, and achieving 47% toxicity reduction in language models while preserving performance on general knowledge tasks by negating the toxic information isolated to the unique component. Our approach provides a new framework for understanding and controlling task vector arithmetic, addressing fundamental limitations in model editing operations.

[LG-67] he Quest for Winning Tickets in Low-Rank Adapters

链接: https://arxiv.org/abs/2512.22495
作者: Hamed Damirchi,Cristian Rodriguez-Opazo,Ehsan Abbasnejad,Zhen Zhang,Javen Shi
类目: Machine Learning (cs.LG)
*备注: 21 pages

点击查看摘要

Abstract:The Lottery Ticket Hypothesis (LTH) suggests that over-parameterized neural networks contain sparse subnetworks (“winning tickets”) capable of matching full model performance when trained from scratch. With the growing reliance on fine-tuning large pretrained models, we investigate whether LTH extends to parameter-efficient fine-tuning (PEFT), specifically focusing on Low-Rank Adaptation (LoRA) methods. Our key finding is that LTH holds within LoRAs, revealing sparse subnetworks that can match the performance of dense adapters. In particular, we find that the effectiveness of sparse subnetworks depends more on how much sparsity is applied in each layer than on the exact weights included in the subnetwork. Building on this insight, we propose Partial-LoRA, a method that systematically identifies said subnetworks and trains sparse low-rank adapters aligned with task-relevant subspaces of the pre-trained model. Experiments across 8 vision and 12 language tasks in both single-task and multi-task settings show that Partial-LoRA reduces the number of trainable parameters by up to 87%, while maintaining or improving accuracy. Our results not only deepen our theoretical understanding of transfer learning and the interplay between pretraining and fine-tuning but also open new avenues for developing more efficient adaptation strategies.

[LG-68] Collaborative Optimization of Multiclass Imbalanced Learning: Density-Aware and Region-Guided Boosting

链接: https://arxiv.org/abs/2512.22478
作者: Chuantao Li,Zhi Li,Jiahao Xu,Jie Li,Sheng Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Numerous studies attempt to mitigate classification bias caused by class imbalance. However, existing studies have yet to explore the collaborative optimization of imbalanced learning and model training. This constraint hinders further performance improvements. To bridge this gap, this study proposes a collaborative optimization Boosting model of multiclass imbalanced learning. This model is simple but effective by integrating the density factor and the confidence factor, this study designs a noise-resistant weight update mechanism and a dynamic sampling strategy. Rather than functioning as independent components, these modules are tightly integrated to orchestrate weight updates, sample region partitioning, and region-guided sampling. Thus, this study achieves the collaborative optimization of imbalanced learning and model training. Extensive experiments on 20 public imbalanced datasets demonstrate that the proposed model significantly outperforms eight state-of-the-art baselines. The code for the proposed model is available at: this https URL.

[LG-69] GLUE: Gradient-free Learning to Unify Experts

链接: https://arxiv.org/abs/2512.22467
作者: Jong-Ik Park,Shreyas Chaudhari,Srinivasa Pranav,Carlee Joe-Wong,José M. F. Moura
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In many deployed systems (multilingual ASR, cross-hospital imaging, region-specific perception), multiple pretrained specialist models coexist. Yet, new target domains often require domain expansion: a generalized model that performs well beyond any single specialist’s domain. Given such a new target domain, prior works seek a single strong initialization prior for the model parameters by first blending expert models to initialize a target model. However, heuristic blending – using coefficients based on data size or proxy metrics – often yields lower target-domain test accuracy, and learning the coefficients on the target loss typically requires computationally-expensive full backpropagation through the network. We propose GLUE, Gradient-free Learning To Unify Experts, which initializes the target model as a convex combination of fixed experts, learning the mixture coefficients of this combination via a gradient-free two-point (SPSA) update that requires only two forward passes per step. Across experiments on three datasets and three network architectures, GLUE produces a single prior that can be fine-tuned effectively to outperform baselines. GLUE improves test accuracy by up to 8.5% over data-size weighting and by up to 9.1% over proxy-metric selection. GLUE either outperforms backpropagation-based full-gradient mixing or matches its performance within 1.4%.

[LG-70] AnalogSAGE: Self-evolving Analog Design Multi-Agents with Stratified Memory and Grounded Experience

链接: https://arxiv.org/abs/2512.22435
作者: Zining Wang,Jian Gao,Weimin Fu,Xiaolong Guo,Xuan Zhang
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analog circuit design remains a knowledge- and experience-intensive process that relies heavily on human intuition for topology generation and device parameter tuning. Existing LLM-based approaches typically depend on prompt-driven netlist generation or predefined topology templates, limiting their ability to satisfy complex specification requirements. We propose AnalogSAGE, an open-source self-evolving multi-agent framework that coordinates three-stage agent explorations through four stratified memory layers, enabling iterative refinement with simulation-grounded feedback. To support reproducibility and generality, we release the source code. Our benchmark spans ten specification-driven operational amplifier design problems of varying difficulty, enabling quantitative and cross-task comparison under identical conditions. Evaluated under the open-source SKY130 PDK with ngspice, AnalogSAGE achieves a 10 \times overall pass rate, a 48 \times Pass@1, and a 4 \times reduction in parameter search space compared with existing frameworks, demonstrating that stratified memory and grounded reasoning substantially enhance the reliability and autonomy of analog design automation in practice.

[LG-71] Causality-Inspired Safe Residual Correction for Multivariate Time Series

链接: https://arxiv.org/abs/2512.22428
作者: Jianxiang Xie,Yuncheng Hua
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:While modern multivariate forecasters such as Transformers and GNNs achieve strong benchmark performance, they often suffer from systematic errors at specific variables or horizons and, critically, lack guarantees against performance degradation in deployment. Existing post-hoc residual correction methods attempt to fix these errors, but are inherently greedy: although they may improve average accuracy, they can also “help in the wrong way” by overcorrecting reliable predictions and causing local failures in unseen scenarios. To address this critical “safety gap,” we propose CRC (Causality-inspired Safe Residual Correction), a plug-and-play framework explicitly designed to ensure non-degradation. CRC follows a divide-and-conquer philosophy: it employs a causality-inspired encoder to expose direction-aware structure by decoupling self- and cross-variable dynamics, and a hybrid corrector to model residual errors. Crucially, the correction process is governed by a strict four-fold safety mechanism that prevents harmful updates. Experiments across multiple datasets and forecasting backbones show that CRC consistently improves accuracy, while an in-depth ablation study confirms that its core safety mechanisms ensure exceptionally high non-degradation rates (NDR), making CRC a correction framework suited for safe and reliable deployment. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2512.22428 [cs.LG] (or arXiv:2512.22428v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-72] Differentiable Inverse Modeling with Physics-Constrained Latent Diffusion for Heterogeneous Subsurface Parameter Fields

链接: https://arxiv.org/abs/2512.22421
作者: Zihan Lin,QiZhi He
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: 33 pages, 16 figures

点击查看摘要

Abstract:We present a latent diffusion-based differentiable inversion method (LD-DIM) for PDE-constrained inverse problems involving high-dimensional spatially distributed coefficients. LD-DIM couples a pretrained latent diffusion prior with an end-to-end differentiable numerical solver to reconstruct unknown heterogeneous parameter fields in a low-dimensional nonlinear manifold, improving numerical conditioning and enabling stable gradient-based optimization under sparse observations. The proposed framework integrates a latent diffusion model (LDM), trained in a compact latent space, with a differentiable finite-volume discretization of the forward PDE. Sensitivities are propagated through the discretization using adjoint-based gradients combined with reverse-mode automatic differentiation. Inversion is performed directly in latent space, which implicitly suppresses ill-conditioned degrees of freedom while preserving dominant structural modes, including sharp material interfaces. The effectiveness of LD-DIM is demonstrated using a representative inverse problem for flow in porous media, where heterogeneous conductivity fields are reconstructed from spatially sparse hydraulic head measurements. Numerical experiments assess convergence behavior and reconstruction quality for both Gaussian random fields and bimaterial coefficient distributions. The results show that LD-DIM achieves consistently improved numerical stability and reconstruction accuracy of both parameter fields and corresponding PDE solutions compared with physics-informed neural networks (PINNs) and physics-embedded variational autoencoder (VAE) baselines, while maintaining sharp discontinuities and reducing sensitivity to initialization.

[LG-73] PHANTOM: Physics-Aware Adversarial Attacks against Federated Learning-Coordinated EV Charging Management System

链接: https://arxiv.org/abs/2512.22381
作者: Mohammad Zakaria Haider,Amit Kumar Podder,Prabin Mali,Aranya Chakrabortty,Sumit Paudyal,Mohammad Ashiqur Rahman
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rapid deployment of electric vehicle charging stations (EVCS) within distribution networks necessitates intelligent and adaptive control to maintain the grid’s resilience and reliability. In this work, we propose PHANTOM, a physics-aware adversarial network that is trained and optimized through a multi-agent reinforcement learning model. PHANTOM integrates a physics-informed neural network (PINN) enabled by federated learning (FL) that functions as a digital twin of EVCS-integrated systems, ensuring physically consistent modeling of operational dynamics and constraints. Building on this digital twin, we construct a multi-agent RL environment that utilizes deep Q-networks (DQN) and soft actor-critic (SAC) methods to derive adversarial false data injection (FDI) strategies capable of bypassing conventional detection mechanisms. To examine the broader grid-level consequences, a transmission and distribution (T and D) dual simulation platform is developed, allowing us to capture cascading interactions between EVCS disturbances at the distribution level and the operations of the bulk transmission system. Results demonstrate how learned attack policies disrupt load balancing and induce voltage instabilities that propagate across T and D boundaries. These findings highlight the critical need for physics-aware cybersecurity to ensure the resilience of large-scale vehicle-grid integration.

[LG-74] Emotion classification using EEG headset signals and Random Forest

链接: https://arxiv.org/abs/2512.22333
作者: Ricardo Vasquez,Diego Riofrío-Luzcando,Joe Carrion-Jumbo,Cesar Guevara
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Preprint. Published version: this https URL

点击查看摘要

Abstract:Emotions are one of the important components of the human being, thus they are a valuable part of daily activities such as interaction with people, decision making and learning. For this reason, it is important to detect, recognize and understand emotions using computational systems to improve communication between people and machines, which would facilitate the ability of computers to understand the communication between humans. This study proposes the creation of a model that allows the classification of people’s emotions based on their EEG signals, for which the brain-computer interface EMOTIV EPOC was used. This allowed the collection of electroencephalographic information from 50 people, all of whom were shown audiovisual resources that helped to provoke the desired mood. The information obtained was stored in a database for the generation of the model and the corresponding classification analysis. Random Forest model was created for emotion prediction (happiness, sadness and relaxation), based on the signals of any person. The results obtained were 97.21% accurate for happiness, 76% for relaxation and 76% for sadness. Finally, the model was used to generate a real-time emotion prediction algorithm; it captures the person’s EEG signals, executes the generated algorithm and displays the result on the screen with the help of images representative of each emotion.

[LG-75] Decomposing Uncertainty in Probabilistic Knowledge Graph Embeddings: Why Entity Variance Is Not Enough

链接: https://arxiv.org/abs/2512.22318
作者: Chorok Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic knowledge graph embeddings represent entities as distributions, using learned variances to quantify epistemic uncertainty. We identify a fundamental limitation: these variances are relation-agnostic, meaning an entity receives identical uncertainty regardless of relational context. This conflates two distinct out-of-distribution phenomena that behave oppositely: emerging entities (rare, poorly-learned) and novel relational contexts (familiar entities in unobserved relationships). We prove an impossibility result: any uncertainty estimator using only entity-level statistics independent of relation context achieves near-random OOD detection on novel contexts. We empirically validate this on three datasets, finding 100 percent of novel-context triples have frequency-matched in-distribution counterparts. This explains why existing probabilistic methods achieve 0.99 AUROC on random corruptions but only 0.52-0.64 on temporal distribution shift. We formalize uncertainty decomposition into complementary components: semantic uncertainty from entity embedding variance (detecting emerging entities) and structural uncertainty from entity-relation co-occurrence (detecting novel contexts). Our main theoretical result proves these signals are non-redundant, and that any convex combination strictly dominates either signal alone. Our method (CAGP) combines semantic and structural uncertainty via learned weights, achieving 0.94-0.99 AUROC on temporal OOD detection across multiple benchmarks, a 60-80 percent relative improvement over relation-agnostic baselines. Empirical validation confirms complete frequency overlap on three datasets (FB15k-237, WN18RR, YAGO3-10). On selective prediction, our method reduces errors by 43 percent at 85 percent answer rate.

[LG-76] Optimistic Feasible Search for Closed-Loop Fair Threshold Decision-Making

链接: https://arxiv.org/abs/2512.22313
作者: Wenzhang Du
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures

点击查看摘要

Abstract:Closed-loop decision-making systems (e.g., lending, screening, or recidivism risk assessment) often operate under fairness and service constraints while inducing feedback effects: decisions change who appears in the future, yielding non-stationary data and potentially amplifying disparities. We study online learning of a one-dimensional threshold policy from bandit feedback under demographic parity (DP) and, optionally, service-rate constraints. The learner observes only a scalar score each round and selects a threshold; reward and constraint residuals are revealed only for the chosen threshold. We propose Optimistic Feasible Search (OFS), a simple grid-based method that maintains confidence bounds for reward and constraint residuals for each candidate threshold. At each round, OFS selects a threshold that appears feasible under confidence bounds and, among those, maximizes optimistic reward; if no threshold appears feasible, OFS selects the threshold minimizing optimistic constraint violation. This design directly targets feasible high-utility thresholds and is particularly effective for low-dimensional, interpretable policy classes where discretization is natural. We evaluate OFS on (i) a synthetic closed-loop benchmark with stable contraction dynamics and (ii) two semi-synthetic closed-loop benchmarks grounded in German Credit and COMPAS, constructed by training a score model and feeding group-dependent acceptance decisions back into population composition. Across all environments, OFS achieves higher reward with smaller cumulative constraint violation than unconstrained and primal-dual bandit baselines, and is near-oracle relative to the best feasible fixed threshold under the same sweep procedure. Experiments are reproducible and organized with double-blind-friendly relative outputs. Comments: 6 pages, 3 figures Subjects: Machine Learning (cs.LG) MSC classes: 68T05 (Primary), 68Q32 (Secondary) ACMclasses: I.2.6; I.2.8; K.4.1 Cite as: arXiv:2512.22313 [cs.LG] (or arXiv:2512.22313v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22313 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-77] PDx – Adaptive Credit Risk Forecasting Model in Digital Lending using Machine Learning Operations

链接: https://arxiv.org/abs/2512.22305
作者: Sultan Amed,Chan Yu Hang,Sayantan Banerjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents PDx, an adaptive, machine learning operations (MLOps) driven decision system for forecasting credit risk using probability of default (PD) modeling in digital lending. While conventional PD models prioritize predictive accuracy during model development with complex machine learning algorithms, they often overlook continuous adaptation to changing borrower behaviour, resulting in static models that degrade over time in production and generate inaccurate default predictions. Many financial institutes also find it difficult transitioning ML models from development environment to production and maintaining their health. With PDx we aimed to addresses these limitations using a dynamic, end-to-end model lifecycle management approach that integrates continuous model monitoring, retraining, and validation through a robust MLOps pipeline. We introduced a dynamic champion-challenger framework for PDx to regularly update baseline models to recalibrate independent parameters with the latest data and select the best-performing model through out-of-time validation, ensuring resilience against data drift and changing credit risk patterns. Our empirical analysis shows that decision tree-based ensemble models consistently outperform others in classifying defaulters but require frequent updates to sustain performance. Linear models (e.g., logistic regression) and neural networks exhibit greater performance degradation. The study demonstrate with PDx we can mitigates value erosion for digital lenders, particularly in short-term, small-ticket loans, where borrower behavior shifts rapidly. We have validated the effectiveness of PDx using datasets from peer-to-peer lending, business loans, and auto loans, demonstrating its scalability and adaptability for modern credit risk forecasting.

[LG-78] Statistical and Machine Learning Analysis of Traffic Accidents on US 158 in Currituck County: A Comparison with HSM Predictions

链接: https://arxiv.org/abs/2512.22302
作者: Jennifer Sawyer,Julian Allagan
类目: Machine Learning (cs.LG)
*备注: 9 pages,7 figures, 7 tables

点击查看摘要

Abstract:This study extends previous hotspot and Chi-Square analysis by Sawyer \citesawyer2025hotspot by integrating advanced statistical analysis, machine learning, and spatial modeling techniques to analyze five years (2019–2023) of traffic accident data from an 8.4-mile stretch of US 158 in Currituck County, NC. Building upon foundational statistical work, we apply Kernel Density Estimation (KDE), Negative Binomial Regression, Random Forest classification, and Highway Safety Manual (HSM) Safety Performance Function (SPF) comparisons to identify comprehensive temporal and spatial crash patterns. A Random Forest classifier predicts injury severity with 67% accuracy, outperforming HSM SPF. Spatial clustering is confirmed via Moran’s I test ( I = 0.32 , p 0.001 ), and KDE analysis reveals hotspots near major intersections, validating and extending earlier hotspot identification methods. These results support targeted interventions to improve traffic safety on this vital transportation corridor. Our objective is to provide actionable insights for improving safety on US 158 while contributing to the broader understanding of rural highway safety analysis through methodological advancement beyond basic statistical techniques.

[LG-79] Hybrid Quantum-Classical Mixture of Experts: Unlocking Topological Advantage via Interference-Based Routing

链接: https://arxiv.org/abs/2512.22296
作者: Reda Heddad,Lamiae Bouanane
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures, 2 tables. Code available at this https URL

点击查看摘要

Abstract:The Mixture-of-Experts (MoE) architecture has emerged as a powerful paradigm for scaling deep learning models, yet it is fundamentally limited by challenges such as expert imbalance and the computational complexity of classical routing mechanisms. This paper investigates the potential of Quantum Machine Learning (QML) to address these limitations through a novel Hybrid Quantum-Classical Mixture of Experts (QMoE) architecture. Specifically, we conduct an ablation study using a Quantum Gating Network (Router) combined with classical experts to isolate the source of quantum advantage. Our central finding validates the Interference Hypothesis: by leveraging quantum feature maps (Angle Embedding) and wave interference, the Quantum Router acts as a high-dimensional kernel method, enabling the modeling of complex, non-linear decision boundaries with superior parameter efficiency compared to its classical counterparts. Experimental results on non-linearly separable data, such as the Two Moons dataset, demonstrate that the Quantum Router achieves a significant topological advantage, effectively “untangling” data distributions that linear classical routers fail to separate efficiently. Furthermore, we analyze the architecture’s robustness against simulated quantum noise, confirming its feasibility for near-term intermediate-scale quantum (NISQ) hardware. We discuss practical applications in federated learning, privacy-preserving machine learning, and adaptive systems that could benefit from this quantum-enhanced routing paradigm.

[LG-80] Hierarchical Stacking Optimization Using Dirichlets Process (SoDip): Towards Accelerated Design for Graft Polymerization

链接: https://arxiv.org/abs/2512.22279
作者: Amgad Ahmed Ali Ibrahim,Hein Htet,Ryoji Asahi
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Applied Physics (physics.app-ph)
*备注:

点击查看摘要

Abstract:Radiation-induced grafting (RIG) enables precise functionalization of polymer films for ion-exchange membranes, CO2-separation membranes, and battery electrolytes by generating radicals on robust substrates to graft desired monomers. However, reproducibility remains limited due to unreported variability in base-film morphology (crystallinity, grain orientation, free volume), which governs monomer diffusion, radical distribution, and the Trommsdorff effect, leading to spatial graft gradients and performance inconsistencies. We present a hierarchical stacking optimization framework with a Dirichlet’s Process (SoDip), a hierarchical data-driven framework integrating: (1) a decoder-only Transformer (DeepSeek-R1) to encode textual process descriptors (irradiation source, grafting type, substrate manufacturer); (2) TabNet and XGBoost for modelling multimodal feature interactions; (3) Gaussian Process Regression (GPR) with Dirichlet Process Mixture Models (DPMM) for uncertainty quantification and heteroscedasticity; and (4) Bayesian Optimization for efficient exploration of high-dimensional synthesis space. A diverse dataset was curated using ChemDataExtractor 2.0 and WebPlotDigitizer, incorporating numerical and textual variables across hundreds of RIG studies. In cross-validation, SoDip achieved ~33% improvement over GPR while providing calibrated confidence intervals that identify low-reproducibility regimes. Its stacked architecture integrates sparse textual and numerical inputs of varying quality, outperforming prior models and establishing a foundation for reproducible, morphology-aware design in graft polymerization research.

[LG-81] LuxIA: A Lightweight Unitary matriX-based Framework Built on an Iterative Algorithm for Photonic Neural Network Training

链接: https://arxiv.org/abs/2512.22264
作者: Tzamn Melendez Carmona,Federico Marchesin,Marco P. Abrate,Peter Bienstman,Stefano Di Carlo,Alessandro Savino Senior
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:PNNs present promising opportunities for accelerating machine learning by leveraging the unique benefits of photonic circuits. However, current state of the art PNN simulation tools face significant scalability challenges when training large-scale PNNs, due to the computational demands of transfer matrix calculations, resulting in high memory and time consumption. To overcome these limitations, we introduce the Slicing method, an efficient transfer matrix computation approach compatible with back-propagation. We integrate this method into LuxIA, a unified simulation and training framework. The Slicing method substantially reduces memory usage and execution time, enabling scalable simulation and training of large PNNs. Experimental evaluations across various photonic architectures and standard datasets, including MNIST, Digits, and Olivetti Faces, show that LuxIA consistently surpasses existing tools in speed and scalability. Our results advance the state of the art in PNN simulation, making it feasible to explore and optimize larger, more complex architectures. By addressing key computational bottlenecks, LuxIA facilitates broader adoption and accelerates innovation in AI hardware through photonic technologies. This work paves the way for more efficient and scalable photonic neural network research and development.

[LG-82] he Physics Constraint Paradox: When Removing Explicit Constraints Improves Physics-Informed Data for Machine Learning

链接: https://arxiv.org/abs/2512.22261
作者: Rahul D Ray
类目: Machine Learning (cs.LG)
*备注: 11 pages, 5 figures. Includes systematic physics ablation study, theoretical analysis, and downstream machine learning evaluation. Source code available at: this https URL

点击查看摘要

Abstract:Physics-constrained data generation is essential for machine learning in scientific domains where real data are scarce; however, existing approaches often over-constrain models without identifying which physical components are necessary. We present a systematic ablation study of a physics-informed grating coupler spectrum generator that maps five geometric parameters to 100-point spectral responses. By selectively removing explicit energy conservation enforcement, Fabry-Perot oscillations, bandwidth variation, and noise, we uncover a physics constraint paradox: explicit energy conservation enforcement is mathematically redundant when the underlying equations are physically consistent, with constrained and unconstrained variants achieving identical conservation accuracy (mean error approximately 7 x 10^-9). In contrast, Fabry-Perot oscillations dominate threshold-based bandwidth variability, accounting for a 72 percent reduction in half-maximum bandwidth spread when removed (with bandwidth spread reduced from 132.3 nm to 37.4 nm). We further identify a subtle pitfall: standard noise-addition-plus-renormalization pipelines introduce 0.5 percent unphysical negative absorption values. The generator operates at 200 samples per second, enabling high-throughput data generation and remaining orders of magnitude faster than typical full-wave solvers reported in the literature. Finally, downstream machine learning evaluation reveals a clear physics-learnability trade-off: while central wavelength prediction remains unaffected, removing Fabry-Perot oscillations improves bandwidth prediction accuracy by 31.3 percent in R-squared and reduces RMSE by 73.8 percent. These findings provide actionable guidance for physics-informed dataset design and highlight machine learning performance as a diagnostic tool for assessing constraint relevance.

[LG-83] Cardiac mortality prediction in patients undergoing PCI based on real and synthetic data

链接: https://arxiv.org/abs/2512.22259
作者: Daniil Burakov,Ivan Petrov,Dmitrii Khelimskii,Ivan Bessonov,Mikhail Lazarev
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Patient status, angiographic and procedural characteristics encode crucial signals for predicting long-term outcomes after percutaneous coronary intervention (PCI). The aim of the study was to develop a predictive model for assessing the risk of cardiac death based on the real and synthetic data of patients undergoing PCI and to identify the factors that have the greatest impact on mortality. We analyzed 2,044 patients, who underwent a PCI for bifurcation lesions. The primary outcome was cardiac death at 3-year follow-up. Several machine learning models were applied to predict three-year mortality after PCI. To address class imbalance and improve the representation of the minority class, an additional 500 synthetic samples were generated and added to the training set. To evaluate the contribution of individual features to model performance, we applied permutation feature importance. An additional experiment was conducted to evaluate how the model’s predictions would change after removing non-informative features from the training and test datasets. Without oversampling, all models achieve high overall accuracy (0.92-0.93), yet they almost completely ignore the minority class. Across models, augmentation consistently increases minority-class recall with minimal loss of AUROC, improves probability quality, and yields more clinically reasonable risk estimates on the constructed severe profiles. According to feature importance analysis, four features emerged as the most influential: Age, Ejection Fraction, Peripheral Artery Disease, and Cerebrovascular Disease. These results show that straightforward augmentation with realistic and extreme cases can expose, quantify, and reduce brittleness in imbalanced clinical prediction using only tabular records, and motivate routine reporting of probability quality and stress tests alongside headline metrics.

[LG-84] Analyzing Skill Element in Online Fantasy Cricket

链接: https://arxiv.org/abs/2512.22254
作者: Sarthak Sarkar,Supratim Das,Purushottam Saha,Diganta Mukherjee,Tridib Mukherjee
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures

点击查看摘要

Abstract:Online fantasy cricket has emerged as large-scale competitive systems in which participants construct virtual teams and compete based on real-world player performances. This massive growth has been accompanied by important questions about whether outcomes are primarily driven by skill or chance. We develop a statistical framework to assess the role of skill in determining success on these platforms. We construct and analyze a range of deterministic and stochastic team selection strategies, based on recent form, historical statistics, statistical optimization, and multi-criteria decision making. Strategy performance is evaluated based on points, ranks, and payoff under two contest structures Mega and 4x or Nothing. An extensive comparison between different strategies is made to find an optimal set of strategies. To capture adaptive behavior, we further introduce a dynamic tournament model in which agent populations evolve through a softmax reweighting mechanism proportional to positive payoff realizations. We demonstrate our work by running extensive numerical experiments on the IPL 2024 dataset. The results provide quantitative evidence in favor of the skill element present in online fantasy cricket platforms.

[LG-85] Amortized Inference for Model Rocket Aerodynamics: Learning to Estimate Physical Parameters from Simulation

链接: https://arxiv.org/abs/2512.22248
作者: Rohit Pandey,Rohan Pandey
类目: Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, 2 tables

点击查看摘要

Abstract:Accurate prediction of model rocket flight performance requires estimating aerodynamic parameters that are difficult to measure directly. Traditional approaches rely on computational fluid dynamics or empirical correlations, while data-driven methods require extensive real flight data that is expensive and time-consuming to collect. We present a simulation-based amortized inference approach that trains a neural network on synthetic flight data generated from a physics simulator, then applies the learned model to real flights without any fine-tuning. Our method learns to invert the forward physics model, directly predicting drag coefficient and thrust correction factor from a single apogee measurement combined with motor and configuration features. In this proof-of-concept study, we train on 10,000 synthetic flights and evaluate on 8 real flights, achieving a mean absolute error of 12.3 m in apogee prediction - demonstrating promising sim-to-real transfer with zero real training examples. Analysis reveals a systematic positive bias in predictions, providing quantitative insight into the gap between idealized physics and real-world flight conditions. We additionally compare against OpenRocket baseline predictions, showing that our learned approach reduces apogee prediction error. Our implementation is publicly available to support reproducibility and adoption in the amateur rocketry community.

[LG-86] he Affine Divergence: Aligning Activation Updates Beyond Normalisation ICLR2025

链接: https://arxiv.org/abs/2512.22247
作者: George Bird
类目: Machine Learning (cs.LG)
*备注: 25 pages, 9 figures. Intended for submission to the ICLR 2025 Workshop on “Optimisation for Representation Learning”

点击查看摘要

Abstract:A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent. As intended, parameters update in their direction of steepest descent. However, activations are argued to constitute a more directly impactful quantity to prioritise in optimisation, as they are closer to the loss in the computational graph and carry sample-dependent information through the network. Yet their propagated updates do not take the optimal steepest-descent step. These quantities exhibit non-ideal sample-wise scaling across affine, convolutional, and attention layers. Solutions to correct for this are trivial and, entirely incidentally, derive normalisation from first principles despite motivational independence. Consequently, such considerations offer a fresh and conceptual reframe of normalisation’s action, with auxiliary experiments bolstering this mechanistically. Moreover, this analysis makes clear a second possibility: a solution that is functionally distinct from modern normalisations, without scale-invariance, yet remains empirically successful, outperforming conventional normalisers across several tests. This is presented as an alternative to the affine map. This generalises to convolution via a new functional form, “PatchNorm”, a compositionally inseparable normaliser. Together, these provide an alternative mechanistic framework that adds to, and counters some of, the discussion of normalisation. Further, it is argued that normalisers are better decomposed into activation-function-like maps with parameterised scaling, thereby aiding the prioritisation of representations during optimisation. Overall, this constitutes a theoretical-principled approach that yields several new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.

[LG-87] Predicting Mycotoxin Contamination in Irish Oats Using Deep and Transfer Learning

链接: https://arxiv.org/abs/2512.22243
作者: Alan Inglis,Fiona Doohan,Subramani Natarajan,Breige McNulty,Chris Elliott,Anne Nugent,Julie Meneely,Brett Greer,Stephen Kildea,Diana Bucur,Martin Danaher,Melissa Di Rocco,Lisa Black,Adam Gauley,Naoise McKenna,Andrew Parnell
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 11 Figures, Supplementary Materials

点击查看摘要

Abstract:Mycotoxin contamination poses a significant risk to cereal crop quality, food safety, and agricultural productivity. Accurate prediction of mycotoxin levels can support early intervention strategies and reduce economic losses. This study investigates the use of neural networks and transfer learning models to predict mycotoxin contamination in Irish oat crops as a multi-response prediction task. Our dataset comprises oat samples collected in Ireland, containing a mix of environmental, agronomic, and geographical predictors. Five modelling approaches were evaluated: a baseline multilayer perceptron (MLP), an MLP with pre-training, and three transfer learning models; TabPFN, TabNet, and FT-Transformer. Model performance was evaluated using regression (RMSE, R^2 ) and classification (AUC, F1) metrics, with results reported per toxin and on average. Additionally, permutation-based variable importance analysis was conducted to identify the most influential predictors across both prediction tasks. The transfer learning approach TabPFN provided the overall best performance, followed by the baseline MLP. Our variable importance analysis revealed that weather history patterns in the 90-day pre-harvest period were the most important predictors, alongside seed moisture content.

[LG-88] Interpretable and Adaptive Node Classification on Heterophilic Graphs via Combinatorial Scoring and Hybrid Learning

链接: https://arxiv.org/abs/2512.22221
作者: Soroush Vahidi
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:

点击查看摘要

Abstract:Graph neural networks (GNNs) achieve strong performance on homophilic graphs but often struggle under heterophily, where adjacent nodes frequently belong to different classes. We propose an interpretable and adaptive framework for semi-supervised node classification based on explicit combinatorial inference rather than deep message passing. Our method assigns labels using a confidence-ordered greedy procedure driven by an additive scoring function that integrates class priors, neighborhood statistics, feature similarity, and training-derived label-label compatibility. A small set of transparent hyperparameters controls the relative influence of these components, enabling smooth adaptation between homophilic and heterophilic regimes. We further introduce a validation-gated hybrid strategy in which combinatorial predictions are optionally injected as priors into a lightweight neural model. Hybrid refinement is applied only when it improves validation performance, preserving interpretability when neuralization is unnecessary. All adaptation signals are computed strictly from training data, ensuring a leakage-free evaluation protocol. Experiments on heterophilic and transitional benchmarks demonstrate competitive performance with modern GNNs while offering advantages in interpretability, tunability, and computational efficiency. Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI) Cite as: arXiv:2512.22221 [cs.LG] (or arXiv:2512.22221v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.22221 Focus to learn more arXiv-issued DOI via DataCite

[LG-89] Mirag e Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

链接: https://arxiv.org/abs/2512.22219
作者: Xinhao Cheng,Zhihao Zhang,Yu Zhou,Jianan Ji,Jinchen Jiang,Zepeng Zhao,Ziruo Xiao,Zihao Ye,Yingyi Huang,Ruihang Lai,Hongyi Jin,Bohan Hou,Mengdi Wu,Yixin Dong,Anthony Yip,Zihao Ye,Songting Wang,Wenqin Yang,Xupeng Miao,Tianqi Chen,Zhihao Jia
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance megakernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, fine-grained kernel overlap, and other previously infeasible GPU optimizations. The MPK compiler lowers tensor programs into highly optimized SM-level task graphs and generates optimized CUDA implementations for all tasks, while the MPK in-kernel parallel runtime executes these tasks within a single mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems by reducing end-to-end inference latency by up to 1.7x, pushing LLM inference performance close to hardware limits. MPK is publicly available at this https URL.

[LG-90] ransformer Reconstructed with Dynamic Value Attention

链接: https://arxiv.org/abs/2512.22212
作者: Xiaowei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since transformer was firstly published in 2017, several works have been proposed to optimize it. However, the major structure of transformer remains unchanged, ignoring one of its main intrinsic limitations, which is the same static value is used for every query in a head. Transformer itself tries to solve this problem by implementing multi-head attentions, yet the number of heads is limited by complexity. I propose a method to decide a value for each query dynamically, which could cut down all the redundant heads, keeping only one. Consequently, the following feed forward network could be cut down entirely, as each revised embedding has already fetched enough useful values far beyond the context. As a result, a single-head Dynamic Value Attention (DVA) is all you need in a transformer. According to the experiment, DVA may save 37.6% training time than the original transformer meanwhile increasing the learning capability.

[LG-91] Emotion-Inspired Learning Signals (EILS): A Homeostatic Framework for Adaptive Autonomous Agents

链接: https://arxiv.org/abs/2512.22200
作者: Dhruv Tiwari
类目: Machine Learning (cs.LG)
*备注: 7 pages, 3 figures. arXiv preprint

点击查看摘要

Abstract:The ruling method in modern Artificial Intelligence spanning from Deep Reinforcement Learning (DRL) to Large Language Models (LLMs) relies on a surge of static, externally defined reward functions. While this “extrinsic maximization” approach has rendered superhuman performance in closed, stationary fields, it produces agents that are fragile in open-ended, real-world environments. Standard agents lack internal autonomy: they struggle to explore without dense feedback, fail to adapt to distribution shifts (non-stationarity), and require extensive manual tuning of static hyperparameters. This paper proposes that the unaddressed factor in robust autonomy is a functional analog to biological emotion, serving as a high-level homeostatic control mechanism. We introduce Emotion-Inspired Learning Signals (EILS), a unified framework that replaces scattered optimization heuristics with a coherent, bio-inspired internal feedback engine. Unlike traditional methods that treat emotions as semantic labels, EILS models them as continuous, homeostatic appraisal signals such as Curiosity, Stress, and Confidence. We formalize these signals as vector-valued internal states derived from interaction history. These states dynamically modulate the agent’s optimization landscape in real time: curiosity regulates entropy to prevent mode collapse, stress modulates plasticity to overcome inactivity, and confidence adapts trust regions to stabilize convergence. We hypothesize that this closed-loop homeostatic regulation can enable EILS agents to outperform standard baselines in terms of sample efficiency and non-stationary adaptation.

[LG-92] AETAS: Analysis of Evolving Temporal Affect and Semantics for Legal History

链接: https://arxiv.org/abs/2512.22196
作者: Qizhi Wang
类目: Digital Libraries (cs.DL); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Digital-humanities work on semantic shift often alternates between handcrafted close readings and opaque embedding machinery. We present a reproducible expert-system style pipeline that quantifies and visualises lexical drift in the Old Bailey Corpus (1720–1913), coupling interpretable trajectories with legally meaningful axes. We bin proceedings by decade with dynamic merging for low-resource slices, train skip-gram embeddings, align spaces through orthogonal Procrustes, and measure both geometric displacement and neighborhood turnover. Three visual analytics outputs, which are drift magnitudes, semantic trajectories, and movement along a mercy-versus-retribution axis, expose how justice, crime, poverty, and insanity evolve with penal reforms, transportation debates, and Victorian moral politics. The pipeline is implemented as auditable scripts so results can be reproduced in other historical corpora.

[LG-93] Frequency Regularization: Unveiling the Spectral Inductive Bias of Deep Neural Networks

链接: https://arxiv.org/abs/2512.22192
作者: Jiahao Lu
类目: Machine Learning (cs.LG)
*备注: 9 pages, 5 figures. Code available at this https URL

点击查看摘要

Abstract:Regularization techniques such as L2 regularization (Weight Decay) and Dropout are fundamental to training deep neural networks, yet their underlying physical mechanisms regarding feature frequency selection remain poorly understood. In this work, we investigate the Spectral Bias of modern Convolutional Neural Networks (CNNs). We introduce a Visual Diagnostic Framework to track the dynamic evolution of weight frequencies during training and propose a novel metric, the Spectral Suppression Ratio (SSR), to quantify the “low-pass filtering” intensity of different regularizers. By addressing the aliasing issue in small kernels (e.g., 3x3) through discrete radial profiling, our empirical results on ResNet-18 and CIFAR-10 demonstrate that L2 regularization suppresses high-frequency energy accumulation by over 3x compared to unregularized baselines. Furthermore, we reveal a critical Accuracy-Robustness Trade-off: while L2 models are sensitive to broadband Gaussian noise due to over-specialization in low frequencies, they exhibit superior robustness against high-frequency information loss (e.g., low resolution), outperforming baselines by 6% in blurred scenarios. This work provides a signal-processing perspective on generalization, confirming that regularization enforces a strong spectral inductive bias towards low-frequency structures.

[LG-94] Physics-Informed Machine Learning for Transformer Condition Monitoring – Part I: Basic Concepts Neural Networks and Variants

链接: https://arxiv.org/abs/2512.22190
作者: Jose I. Aizpurua
类目: Machine Learning (cs.LG)
*备注: 8 pages, 7 figures, 2 tables, published as conference paper in IEEE Advanced Research Workshop on Transformers 2025 as part of the Tutorial delivered with the title “Physics-informed Machine Learning for Transformer Condition Monitoring”

点击查看摘要

Abstract:Power transformers are critical assets in power networks, whose reliability directly impacts grid resilience and stability. Traditional condition monitoring approaches, often rule-based or purely physics-based, struggle with uncertainty, limited data availability, and the complexity of modern operating conditions. Recent advances in machine learning (ML) provide powerful tools to complement and extend these methods, enabling more accurate diagnostics, prognostics, and control. In this two-part series, we examine the role of Neural Networks (NNs) and their extensions in transformer condition monitoring and health management tasks. This first paper introduces the basic concepts of NNs, explores Convolutional Neural Networks (CNNs) for condition monitoring using diverse data modalities, and discusses the integration of NN concepts within the Reinforcement Learning (RL) paradigm for decision-making and control. Finally, perspectives on emerging research directions are also provided.

[LG-95] Physics-Informed Machine Learning for Transformer Condition Monitoring – Part II: Physics-Informed Neural Networks and Uncertainty Quantification

链接: https://arxiv.org/abs/2512.22189
作者: Jose I. Aizpurua
类目: Machine Learning (cs.LG)
*备注: 7 pages, 8 figures, published as conference paper in IEEE Advanced Research Workshop on Transformers 2025 as part of the Tutorial delivered with the title “Physics-informed Machine Learning for Transformer Condition Monitoring”

点击查看摘要

Abstract:The integration of physics-based knowledge with machine learning models is increasingly shaping the monitoring, diagnostics, and prognostics of electrical transformers. In this two-part series, the first paper introduced the foundations of Neural Networks (NNs) and their variants for health assessment tasks. This second paper focuses on integrating physics and uncertainty into the learning process. We begin with the fundamentals of Physics-Informed Neural Networks (PINNs), applied to spatiotemporal thermal modeling and solid insulation ageing. Building on this, we present Bayesian PINNs as a principled framework to quantify epistemic uncertainty and deliver robust predictions under sparse data. Finally, we outline emerging research directions that highlight the potential of physics-aware and trustworthy machine learning for critical power assets.

[LG-96] Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection

链接: https://arxiv.org/abs/2512.22179
作者: Rajeeb Thapa Chhetri,Zhixiong Chen,Saurab Thapa
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 21 pages, 3 figures. Code available at: this https URL

点击查看摘要

Abstract:A fundamental limitation of supervised deep learning in high-dimensional tabular domains is “Generalization Collapse”: models learn precise decision boundaries for known distributions but fail catastrophically when facing Out-of-Distribution (OOD) data. We hypothesize that this failure stems from the lack of topological constraints in the latent space, resulting in diffuse manifolds where novel anomalies remain statistically indistinguishable from benign data. To address this, we propose Latent Sculpting, a hierarchical two-stage representation learning framework. Stage 1 utilizes a hybrid 1D-CNN and Transformer Encoder trained with a novel Dual-Centroid Compactness Loss (DCCL) to actively “sculpt” benign traffic into a low-entropy, hyperspherical cluster. Unlike standard contrastive losses that rely on triplet mining, DCCL optimizes global cluster centroids to enforce absolute manifold density. Stage 2 conditions a Masked Autoregressive Flow (MAF) on this pre-structured manifold to learn an exact density estimate. We evaluate this methodology on the rigorous CIC-IDS-2017 benchmark, treating it as a proxy for complex, non-stationary data streams. Empirical results demonstrate that explicit manifold sculpting is a prerequisite for robust zero-shot generalization. While supervised baselines suffered catastrophic performance collapse on unseen distribution shifts (F1 approx 0.30) and the strongest unsupervised baseline achieved only 0.76, our framework achieved an F1-Score of 0.87 on strictly zero-shot anomalies. Notably, we report an 88.89% detection rate on “Infiltration” scenarios–a complex distributional shift where state-of-the-art supervised models achieved 0.00% accuracy. These findings suggest that decoupling structure learning from density estimation provides a scalable path toward generalized anomaly detection.

[LG-97] owards Unsupervised Causal Representation Learning via Latent Additive Noise Model Causal Autoencoders

链接: https://arxiv.org/abs/2512.22150
作者: Hans Jarett J. Ong,Brian Godwin S. Lim,Dominic Dayta,Renzo Roel P. Tan,Kazushi Ikeda
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Unsupervised representation learning seeks to recover latent generative factors, yet standard methods relying on statistical independence often fail to capture causal dependencies. A central challenge is identifiability: as established in disentangled representation learning and nonlinear ICA literature, disentangling causal variables from observational data is impossible without supervision, auxiliary signals, or strong inductive biases. In this work, we propose the Latent Additive Noise Model Causal Autoencoder (LANCA) to operationalize the Additive Noise Model (ANM) as a strong inductive bias for unsupervised discovery. Theoretically, we prove that while the ANM constraint does not guarantee unique identifiability in the general mixing case, it resolves component-wise indeterminacy by restricting the admissible transformations from arbitrary diffeomorphisms to the affine class. Methodologically, arguing that the stochastic encoding inherent to VAEs obscures the structural residuals required for latent causal discovery, LANCA employs a deterministic Wasserstein Auto-Encoder (WAE) coupled with a differentiable ANM Layer. This architecture transforms residual independence from a passive assumption into an explicit optimization objective. Empirically, LANCA outperforms state-of-the-art baselines on synthetic physics benchmarks (Pendulum, Flow), and on photorealistic environments (CANDLE), where it demonstrates superior robustness to spurious correlations arising from complex background scenes.

[LG-98] On Harnessing Idle Compute at the Edge for Foundation Model Training

链接: https://arxiv.org/abs/2512.22142
作者: Leyang Xue,Meghana Madhyastha,Myungjin Lee,Amos Storkey,Randal Burns,Mahesh K. Marina
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: Extended abstract version of this paper appeared in ACM MobiCom 2025

点击查看摘要

Abstract:The ecosystem behind foundation model development today is highly centralized and limited to large-scale cloud data center operators: training foundation models is costly, needing immense compute resources. Decentralized foundation model training across edge devices, leveraging their spare compute, promises a democratized alternative. However, existing edge-training approaches fall short: they struggle to match cloud-based training performance, exhibit limited scalability with model size, exceed device memory capacity, and have prohibitive communication overhead. They also fail to satisfactorily handle device heterogeneity and dynamism. We introduce a new paradigm, Cleave, which finely partitions training operations through a novel selective hybrid tensor parallelism method. Together with a parameter server centric training framework, Cleave copes with device memory limits and avoids communication bottlenecks, thereby enabling efficient training of large models on par with the cloud. Further, with a cost optimization model to guide device selection and training workload distribution, Cleave effectively accounts for device heterogeneity and churn. Our evaluations show that Cleave matches cloud-based GPU training by scaling efficiently to larger models and thousands of devices, supporting up to 8x more devices than baseline edge-training approaches. It outperforms state-of-the-art edge training methods by up to a factor of 10 in per-batch training time and efficiently handles device failures, achieving at least 100x faster recovery than prior methods. Comments: Extended abstract version of this paper appeared in ACM MobiCom 2025 Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG) Cite as: arXiv:2512.22142 [cs.DC] (or arXiv:2512.22142v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2512.22142 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.1145/3680207.3765676 Focus to learn more DOI(s) linking to related resources

[LG-99] Pruning Graphs by Adversarial Robustness Evaluation to Strengthen GNN Defenses

链接: https://arxiv.org/abs/2512.22128
作者: Yongyu Wang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have emerged as a dominant paradigm for learning on graph-structured data, thanks to their ability to jointly exploit node features and relational information encoded in the graph topology. This joint modeling, however, also introduces a critical weakness: perturbations or noise in either the structure or the features can be amplified through message passing, making GNNs highly vulnerable to adversarial attacks and spurious connections. In this work, we introduce a pruning framework that leverages adversarial robustness evaluation to explicitly identify and remove fragile or detrimental components of the graph. By using robustness scores as guidance, our method selectively prunes edges that are most likely to degrade model reliability, thereby yielding cleaner and more resilient graph representations. We instantiate this framework on three representative GNN architectures and conduct extensive experiments on benchmarks. The experimental results show that our approach can significantly enhance the defense capability of GNNs in the high-perturbation regime.

[LG-100] Bellm an Calibration for V-Learning in Offline Reinforcement Learning

链接: https://arxiv.org/abs/2512.23694
作者: Lars van der Laan,Nathan Kallus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:We introduce Iterated Bellman Calibration, a simple, model-agnostic, post-hoc procedure for calibrating off-policy value predictions in infinite-horizon Markov decision processes. Bellman calibration requires that states with similar predicted long-term returns exhibit one-step returns consistent with the Bellman equation under the target policy. We adapt classical histogram and isotonic calibration to the dynamic, counterfactual setting by repeatedly regressing fitted Bellman targets onto a model’s predictions, using a doubly robust pseudo-outcome to handle off-policy data. This yields a one-dimensional fitted value iteration scheme that can be applied to any value estimator. Our analysis provides finite-sample guarantees for both calibration and prediction under weak assumptions, and critically, without requiring Bellman completeness or realizability.

[LG-101] Calibrated Multi-Level Quantile Forecasting

链接: https://arxiv.org/abs/2512.23671
作者: Tiffany Ding,Isaac Gibbs,Ryan J. Tibshirani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We present an online method for guaranteeing calibration of quantile forecasts at multiple quantile levels simultaneously. A sequence of \alpha -level quantile forecasts is calibrated if the forecasts are larger than the target value at an \alpha -fraction of time steps. We introduce a lightweight method called Multi-Level Quantile Tracker (MultiQT) that wraps around any existing point or quantile forecaster to produce corrected forecasts guaranteed to achieve calibration, even against adversarial distribution shifts, while ensuring that the forecasts are ordered – e.g., the 0.5-level quantile forecast is never larger than the 0.6-level forecast. Furthermore, the method comes with a no-regret guarantee that implies it will not worsen the performance of an existing forecaster, asymptotically, with respect to the quantile loss. In experiments, we find that MultiQT significantly improves the calibration of real forecasters in epidemic and energy forecasting problems.

[LG-102] he Nonstationarity-Complexity Tradeoff in Return Prediction

链接: https://arxiv.org/abs/2512.23596
作者: Agostino Capponi,Chengpiao Huang,J. Antonio Sidaoui,Kaizheng Wang,Jiacheng Zou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); General Finance (q-fin.GN)
*备注:

点击查看摘要

Abstract:We investigate machine learning models for stock return prediction in non-stationary environments, revealing a fundamental nonstationarity-complexity tradeoff: complex models reduce misspecification error but require longer training windows that introduce stronger non- stationarity. We resolve this tension with a novel model selection method that jointly optimizes model class and training window size using a tournament procedure that adaptively evaluates candidates on non-stationary validation data. Our theoretical analysis demonstrates that this approach balances misspecification error, estimation variance, and non-stationarity, performing close to the best model in hindsight. Applying our method to 17 industry portfolio returns, we consistently outperform standard rolling-window benchmarks, improving out-of-sample R^2 by 14-23% on average. During NBER- designated recessions, improvements are substantial: our method achieves positive R^2 during the Gulf War recession while benchmarks are negative, and improves R^2 in absolute terms by at least 80bps during the 2001 recession as well as superior performance during the 2008 Financial Crisis. Economically, a trading strategy based on our selected model generates 31% higher cumulative returns averaged across the industries.

[LG-103] From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

链接: https://arxiv.org/abs/2512.23566
作者: Dimitra Maoutsa
类目: Dynamical Systems (math.DS); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 12+50 pages, 6 figures; An earlier account of this work has previously appeared in arXiv:2301.08102 and arXiv:2304.00423 ; main methodology remains the same, this version includes additional numerical experiments and theory

点击查看摘要

Abstract:How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by the geometry in the system’s invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from extremely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods.

[LG-104] Adaptive Fusion Graph Network for 3D Strain Field Prediction in Solid Rocket Motor Grains

链接: https://arxiv.org/abs/2512.23443
作者: Jiada Huang,Hao Ma,Zhibin Shen,Yizhou Qiao,Haiyang Li
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Local high strain in solid rocket motor grains is a primary cause of structural failure. However, traditional numerical simulations are computationally expensive, and existing surrogate models cannot explicitly establish geometric models and accurately capture high-strain regions. Therefore, this paper proposes an adaptive graph network, GrainGNet, which employs an adaptive pooling dynamic node selection mechanism to effectively preserve the key mechanical features of structurally critical regions, while concurrently utilising feature fusion to transmit deep features and enhance the model’s representational capacity. In the joint prediction task involving four sequential conditions–curing and cooling, storage, overloading, and ignition–GrainGNet reduces the mean squared error by 62.8% compared to the baseline graph U-Net model, with only a 5.2% increase in parameter count and an approximately sevenfold improvement in training efficiency. Furthermore, in the high-strain regions of debonding seams, the prediction error is further reduced by 33% compared to the second-best method, offering a computationally efficient and high-fidelity approach to evaluate motor structural safety.

[LG-105] A general framework for deep learning

链接: https://arxiv.org/abs/2512.23425
作者: William Kengne,Modou Wade
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper develops a general approach for deep learning for a setting that includes nonparametric regression and classification. We perform a framework from data that fulfills a generalized Bernstein-type inequality, including independent, \phi -mixing, strongly mixing and \mathcalC -mixing observations. Two estimators are proposed: a non-penalized deep neural network estimator (NPDNN) and a sparse-penalized deep neural network estimator (SPDNN). For each of these estimators, bounds of the expected excess risk on the class of Hölder smooth functions and composition Hölder functions are established. Applications to independent data, as well as to \phi -mixing, strongly mixing, \mathcalC -mixing processes are considered. For each of these examples, the upper bounds of the expected excess risk of the proposed NPDNN and SPDNN predictors are derived. It is shown that both the NPDNN and SPDNN estimators are minimax optimal (up to a logarithmic factor) in many classical settings.

[LG-106] Probabilistic Modelling is Sufficient for Causal Inference

链接: https://arxiv.org/abs/2512.23408
作者: Bruno Mlodozeniec,David Krueger,Richard E. Turner
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Causal inference is a key research area in machine learning, yet confusion reigns over the tools needed to tackle it. There are prevalent claims in the machine learning literature that you need a bespoke causal framework or notation to answer causal questions. In this paper, we want to make it clear that you \emphcan answer any causal inference question within the realm of probabilistic modelling and inference, without causal-specific tools or notation. Through concrete examples, we demonstrate how causal questions can be tackled by writing down the probability of everything. Lastly, we reinterpret causal tools as emerging from standard probabilistic modelling and inference, elucidating their necessity and utility.

[LG-107] Persistent Homology via Finite Topological Spaces

链接: https://arxiv.org/abs/2512.23348
作者: Selçuk Kayacan
类目: Algebraic Topology (math.AT); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a functorial framework for persistent homology based on finite topological spaces and their associated posets. Starting from a finite metric space, we associate a filtration of finite topologies whose structure maps are continuous identity maps. By passing functorially to posets and to simplicial complexes via crosscut constructions, we obtain persistence modules without requiring inclusion relations between the resulting complexes. We show that standard poset-level simplifications preserve persistent invariants and prove stability of the resulting persistence diagrams under perturbations of the input metric in a density-based instantiation.

[LG-108] Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

链接: https://arxiv.org/abs/2512.23178
作者: Zijian Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Part of this work is in submission

点击查看摘要

Abstract:Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded \frak p -th moment where \frak p\in(1,2] has been recognized to be more realistic (say being upper bounded by \sigma_\frak l^\frak p for some \sigma_\frak l\ge0 ). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate \cal O(\sigma_\frak l\ln(1/\delta)T^1/\frak p-1) (resp. \cal O(\sigma_\frak l^2\ln^2(1/\delta)T^2/\frak p-2) ) for nonsmooth convex (resp. strongly convex) problems, where \delta\in(0,1] is the failure probability and T\in\mathbbN is the time horizon. In this work, we provide a refined analysis for Clipped SGD and offer two faster rates, \cal O(\sigma_\frak ld_\rm eff^-1/2\frak p\ln^1-1/\frak p(1/\delta)T^1/\frak p-1) and \cal O(\sigma_\frak l^2d_\rm eff^-1/\frak p\ln^2-2/\frak p(1/\delta)T^2/\frak p-2) , than the aforementioned best results, where d_\rm eff\ge1 is a quantity we call the \textitgeneralized effective dimension . Our analysis improves upon the existing approach on two sides: better utilization of Freedman’s inequality and finer bounds for clipping error under heavy-tailed noise. In addition, we extend the refined analysis to convergence in expectation and obtain new rates that break the known lower bounds. Lastly, to complement the study, we establish new lower bounds for both high-probability and in-expectation convergence. Notably, the in-expectation lower bounds match our new upper bounds, indicating the optimality of our refined analysis for convergence in expectation.

[LG-109] Why Machine Learning Models Systematically Underestimate Extreme Values II: How to Fix It with LatentNN

链接: https://arxiv.org/abs/2512.23138
作者: Yuan-Sen Ting
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages, 7 figures, submitted to OJAp

点击查看摘要

Abstract:Attenuation bias – the systematic underestimation of regression coefficients due to measurement errors in input variables – affects astronomical data-driven models. For linear regression, this problem was solved by treating the true input values as latent variables to be estimated alongside model parameters. In this paper, we show that neural networks suffer from the same attenuation bias and that the latent variable solution generalizes directly to neural networks. We introduce LatentNN, a method that jointly optimizes network parameters and latent input values by maximizing the joint likelihood of observing both inputs and outputs. We demonstrate the correction on one-dimensional regression, multivariate inputs with correlated features, and stellar spectroscopy applications. LatentNN reduces attenuation bias across a range of signal-to-noise ratios where standard neural networks show large bias. This provides a framework for improved neural network inference in the low signal-to-noise regime characteristic of astronomical data. This bias correction is most effective when measurement errors are less than roughly half the intrinsic data range; in the regime of very low signal-to-noise and few informative features. Code is available at this https URL.

[LG-110] QSAR-Guided Generative Framework for the Discovery of Synthetically Viable Odorants

链接: https://arxiv.org/abs/2512.23080
作者: Tim C. Pearce,Ahmed Ibrahim
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:The discovery of novel odorant molecules is key for the fragrance and flavor industries, yet efficiently navigating the vast chemical space to identify structures with desirable olfactory properties remains a significant challenge. Generative artificial intelligence offers a promising approach for \textitde novo molecular design but typically requires large sets of molecules to learn from. To address this problem, we present a framework combining a variational autoencoder (VAE) with a quantitative structure-activity relationship (QSAR) model to generate novel odorants from limited training sets of odor molecules. The self-supervised learning capabilities of the VAE allow it to learn SMILES grammar from ChemBL database, while its training objective is augmented with a loss term derived from an external QSAR model to structure the latent representation according to odor probability. While the VAE demonstrated high internal consistency in learning the QSAR supervision signal, validation against an external, unseen ground truth dataset (Unique Good Scents) confirms the model generates syntactically valid structures (100% validity achieved via rejection sampling) and 94.8% unique structures. The latent space is effectively structured by odor likelihood, evidenced by a Fréchet ChemNet Distance (FCD) of \approx 6.96 between generated molecules and known odorants, compared to \approx 21.6 for the ChemBL baseline. Structural analysis via Bemis-Murcko scaffolds reveals that 74.4% of candidates possess novel core frameworks distinct from the training data, indicating the model performs extensive chemical space exploration beyond simple derivatization of known odorants. Generated candidates display physicochemical properties …

[LG-111] Federated Learning With L0 Constraint Via Probabilistic Gates For Sparsity

链接: https://arxiv.org/abs/2512.23071
作者: Krishna Harsha Kovelakuntla Huthasana,Alireza Olama,Andreas Lundell
类目: Machine Learning (stat.ML); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) is a distributed machine learning setting that requires multiple clients to collaborate on training a model while maintaining data privacy. The unaddressed inherent sparsity in data and models often results in overly dense models and poor generalizability under data and client participation heterogeneity. We propose FL with an L0 constraint on the density of non-zero parameters, achieved through a reparameterization using probabilistic gates and their continuous relaxation: originally proposed for sparsity in centralized machine learning. We show that the objective for L0 constrained stochastic minimization naturally arises from an entropy maximization problem of the stochastic gates and propose an algorithm based on federated stochastic gradient descent for distributed learning. We demonstrate that the target density (rho) of parameters can be achieved in FL, under data and client participation heterogeneity, with minimal loss in statistical performance for linear and non-linear models: Linear regression (LR), Logistic regression (LG), Softmax multi-class classification (MC), Multi-label classification with logistic units (MLC), Convolution Neural Network (CNN) for multi-class classification (MC). We compare the results with a magnitude pruning-based thresholding algorithm for sparsity in FL. Experiments on synthetic data with target density down to rho = 0.05 and publicly available RCV1, MNIST, and EMNIST datasets with target density down to rho = 0.005 demonstrate that our approach is communication-efficient and consistently better in statistical performance.

[LG-112] Risk-Averse Learning with Varying Risk Levels

链接: https://arxiv.org/abs/2512.22986
作者: Siyi Wang,Zifan Wang,Karl H. Johansson
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In safety-critical decision-making, the environment may evolve over time, and the learner adjusts its risk level accordingly. This work investigates risk-averse online optimization in dynamic environments with varying risk levels, employing Conditional Value-at-Risk (CVaR) as the risk measure. To capture the dynamics of the environment and risk levels, we employ the function variation metric and introduce a novel risk-level variation metric. Two information settings are considered: a first-order scenario, where the learner observes both function values and their gradients; and a zeroth-order scenario, where only function evaluations are available. For both cases, we develop risk-averse learning algorithms with a limited sampling budget and analyze their dynamic regret bounds in terms of function variation, risk-level variation, and the total number of samples. The regret analysis demonstrates the adaptability of the algorithms in non-stationary and risk-sensitive settings. Finally, numerical experiments are presented to demonstrate the efficacy of the methods.

[LG-113] Deep Learning for the Multiple Optimal Stopping Problem

链接: https://arxiv.org/abs/2512.22961
作者: Mathieu Laurière,Mehdi Talbi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel deep learning framework for solving multiple optimal stopping problems in high dimensions. While deep learning has recently shown promise for single stopping problems, the multiple exercise case involves complex recursive dependencies that remain challenging. We address this by combining the Dynamic Programming Principle with neural network approximation of the value function. Unlike policy-search methods, our algorithm explicitly learns the value surface. We first consider the discrete-time problem and analyze neural network training error. We then turn to continuous problems and analyze the additional error due to the discretization of the underlying stochastic processes. Numerical experiments on high-dimensional American basket options and nonlinear utility maximization demonstrate that our method provides an efficient and scalable method for the multiple optimal stopping problem.

[LG-114] A first-order method for nonconvex-strongly-concave constrained minimax optimization

链接: https://arxiv.org/abs/2512.22909
作者: Zhaosong Lu,Sanyou Mei
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: Accepted by Optimization Methods and Software

点击查看摘要

Abstract:In this paper we study a nonconvex-strongly-concave constrained minimax problem. Specifically, we propose a first-order augmented Lagrangian method for solving it, whose subproblems are nonconvex-strongly-concave unconstrained minimax problems and suitably solved by a first-order method developed in this paper that leverages the strong concavity structure. Under suitable assumptions, the proposed method achieves an \emphoperation complexity of O(\varepsilon^-3.5\log\varepsilon^-1) , measured in terms of its fundamental operations, for finding an \varepsilon -KKT solution of the constrained minimax problem, which improves the previous best-known operation complexity by a factor of \varepsilon^-0.5 .

[LG-115] Causal-Policy Forest for End-to-End Policy Learning

链接: https://arxiv.org/abs/2512.22846
作者: Masahiro Kato
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This study proposes an end-to-end algorithm for policy learning in causal inference. We observe data consisting of covariates, treatment assignments, and outcomes, where only the outcome corresponding to the assigned treatment is observed. The goal of policy learning is to train a policy from the observed data, where a policy is a function that recommends an optimal treatment for each individual, to maximize the policy value. In this study, we first show that maximizing the policy value is equivalent to minimizing the mean squared error for the conditional average treatment effect (CATE) under -1, 1\ restricted regression models. Based on this finding, we modify the causal forest, an end-to-end CATE estimation algorithm, for policy learning. We refer to our algorithm as the causal-policy forest. Our algorithm has three advantages. First, it is a simple modification of an existing, widely used CATE estimation method, therefore, it helps bridge the gap between policy learning and CATE estimation in practice. Second, while existing studies typically estimate nuisance parameters for policy learning as a separate task, our algorithm trains the policy in a more end-to-end manner. Third, as in standard decision trees and random forests, we train the models efficiently, avoiding computational intractability.

[LG-116] Nonlinear Dynamical Modeling of Human Intracranial Brain Activity with Flexible Inference

链接: https://arxiv.org/abs/2512.22785
作者: Kiarash Vaziri,Lucine L. Oganesian,HyeongChan Jo,Roberto M.C. Vera,Charles Y. Liu,Brian Lee,Maryam M. Shanechi
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamical modeling of multisite human intracranial neural recordings is essential for developing neurotechnologies such as brain-computer interfaces (BCIs). Linear dynamical models are widely used for this purpose due to their interpretability and their suitability for BCIs. In particular, these models enable flexible real-time inference, even in the presence of missing neural samples, which often occur in wireless BCIs. However, neural activity can exhibit nonlinear structure that is not captured by linear models. Furthermore, while recurrent neural network models can capture nonlinearity, their inference does not directly address handling missing observations. To address this gap, recent work introduced DFINE, a deep learning framework that integrates neural networks with linear state-space models to capture nonlinearities while enabling flexible inference. However, DFINE was developed for intracortical recordings that measure localized neuronal populations. Here we extend DFINE to modeling of multisite human intracranial electroencephalography (iEEG) recordings. We find that DFINE significantly outperforms linear state-space models (LSSMs) in forecasting future neural activity. Furthermore, DFINE matches or exceeds the accuracy of a gated recurrent unit (GRU) model in neural forecasting, indicating that a linear dynamical backbone, when paired and jointly trained with nonlinear neural networks, can effectively describe the dynamics of iEEG signals while also enabling flexible inference. Additionally, DFINE handles missing observations more robustly than the baselines, demonstrating its flexible inference and utility for BCIs. Finally, DFINE’s advantage over LSSM is more pronounced in high gamma spectral bands. Taken together, these findings highlight DFINE as a strong and flexible framework for modeling human iEEG dynamics, with potential applications in next-generation BCIs.

[LG-117] Machine learning models for predicting catastrophe bond coupons using climate data

链接: https://arxiv.org/abs/2512.22660
作者: Julia Kończal,Michał Balcerek,Krzysztof Burnecki
类目: Pricing of Securities (q-fin.PR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the growing frequency and severity of natural disasters have increased the need for effective tools to manage catastrophe risk. Catastrophe (CAT) bonds allow the transfer of part of this risk to investors, offering an alternative to traditional reinsurance. This paper examines the role of climate variability in CAT bond pricing and evaluates the predictive performance of various machine learning models in forecasting CAT bond coupons. We combine features typically used in the literature with a new set of climate indicators, including Oceanic Niño Index, Arctic Oscillation, North Atlantic Oscillation, Outgoing Longwave Radiation, Pacific-North American pattern, Pacific Decadal Oscillation, Southern Oscillation Index, and sea surface temperatures. We compare the performance of linear regression with several machine learning algorithms, such as random forest, gradient boosting, extremely randomized trees, and extreme gradient boosting. Our results show that including climate-related variables improves predictive accuracy across all models, with extremely randomized trees achieving the lowest root mean squared error (RMSE). These findings suggest that large-scale climate variability has a measurable influence on CAT bond pricing and that machine learning methods can effectively capture these complex relationships.

[LG-118] Likelihood-Preserving Embeddings for Statistical Inference

链接: https://arxiv.org/abs/2512.22638
作者: Deniz Akdemir
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Modern machine learning embeddings provide powerful compression of high-dimensional data, yet they typically destroy the geometric structure required for classical likelihood-based statistical inference. This paper develops a rigorous theory of likelihood-preserving embeddings: learned representations that can replace raw data in likelihood-based workflows – hypothesis testing, confidence interval construction, model selection – without altering inferential conclusions. We introduce the Likelihood-Ratio Distortion metric \Delta_n , which measures the maximum error in log-likelihood ratios induced by an embedding. Our main theoretical contribution is the Hinge Theorem, which establishes that controlling \Delta_n is necessary and sufficient for preserving inference. Specifically, if the distortion satisfies \Delta_n = o_p(1) , then (i) all likelihood-ratio based tests and Bayes factors are asymptotically preserved, and (ii) surrogate maximum likelihood estimators are asymptotically equivalent to full-data MLEs. We prove an impossibility result showing that universal likelihood preservation requires essentially invertible embeddings, motivating the need for model-class-specific guarantees. We then provide a constructive framework using neural networks as approximate sufficient statistics, deriving explicit bounds connecting training loss to inferential guarantees. Experiments on Gaussian and Cauchy distributions validate the sharp phase transition predicted by exponential family theory, and applications to distributed clinical inference demonstrate practical utility.

[LG-119] Uncertainty-Aware Flow Field Reconstruction Using SVGP Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2512.22426
作者: Y. Sungtaek Ju
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: 36 pages, 11 figures, submitted for publication in a journal

点击查看摘要

Abstract:Reconstructing time-resolved flow fields from temporally sparse velocimetry measurements is critical for characterizing many complex thermal-fluid systems. We introduce a machine learning framework for uncertainty-aware flow reconstruction using sparse variational Gaussian processes in the Kolmogorov-Arnold network topology (SVGP-KAN). This approach extends the classical foundations of Linear Stochastic Estimation (LSE) and Spectral Analysis Modal Methods (SAMM) while enabling principled epistemic uncertainty quantification. We perform a systematic comparison of our framework with the classical reconstruction methods as well as Kalman filtering. Using synthetic data from pulsed impingement jet flows, we assess performance across fractional PIV sampling rates ranging from 0.5% to 10%. Evaluation metrics include reconstruction error, generalization gap, structure preservation, and uncertainty calibration. Our SVGP-KAN methods achieve reconstruction accuracy comparable to established methods, while also providing well-calibrated uncertainty estimates that reliably indicate when and where predictions degrade. The results demonstrate a robust, data-driven framework for flow field reconstruction with meaningful uncertainty quantification and offer practical guidance for experimental design in periodic flows.

[LG-120] Integrating Wide and Deep Neural Networks with Squeeze-and-Excitation Blocks for Multi-Target Property Prediction in Additively Manufactured Fiber Reinforced Composites

链接: https://arxiv.org/abs/2512.22397
作者: Behzad Parvaresh,Rahmat K. Adesunkanmi,Adel Alaeddini
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注: This manuscript has been submitted to an Elsevier journal for possible publication

点击查看摘要

Abstract:Continuous fiber-reinforced composite manufactured by additive manufacturing (CFRC-AM) offers opportunities for printing lightweight materials with high specific strength. However, their performance is sensitive to the interaction of process and material parameters, making exhaustive experimental testing impractical. In this study, we introduce a data-efficient, multi-input, multi-target learning approach that integrates Latin Hypercube Sampling (LHS)-guided experimentation with a squeeze-and-excitation wide and deep neural network (SE-WDNN) to jointly predict multiple mechanical and manufacturing properties of CFRC-AMs based on different manufacturing parameters. We printed and tested 155 specimens selected from a design space of 4,320 combinations using a Markforged Mark Two 3D printer. The processed data formed the input-output set for our proposed model. We compared the results with those from commonly used machine learning models, including feedforward neural networks, Kolmogorov-Arnold networks, XGBoost, CatBoost, and random forests. Our model achieved the lowest overall test error (MAPE = 12.33%) and showed statistically significant improvements over the baseline wide and deep neural network for several target variables (paired t-tests, p = 0.05). SHapley Additive exPlanations (SHAP) analysis revealed that reinforcement strategy was the major influence on mechanical performance. Overall, this study demonstrates that the integration of LHS and SE-WDNN enables interpretable and sample-efficient multi-target predictions, guiding parameter selection in CFRC-AM with a balance between mechanical behavior and manufacturing metrics.

[LG-121] A General Weighting Theory for Ensemble Learning: Beyond Variance Reduction via Spectral and Geometric Structure

链接: https://arxiv.org/abs/2512.22286
作者: Ernest Fokoué
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Ensemble learning is traditionally justified as a variance-reduction strategy, explaining its strong performance for unstable predictors such as decision trees. This explanation, however, does not account for ensembles constructed from intrinsically stable estimators-including smoothing splines, kernel ridge regression, Gaussian process regression, and other regularized reproducing kernel Hilbert space (RKHS) methods whose variance is already tightly controlled by regularization and spectral shrinkage. This paper develops a general weighting theory for ensemble learning that moves beyond classical variance-reduction arguments. We formalize ensembles as linear operators acting on a hypothesis space and endow the space of weighting sequences with geometric and spectral constraints. Within this framework, we derive a refined bias-variance approximation decomposition showing how non-uniform, structured weights can outperform uniform averaging by reshaping approximation geometry and redistributing spectral complexity, even when variance reduction is negligible. Our main results provide conditions under which structured weighting provably dominates uniform ensembles, and show that optimal weights arise as solutions to constrained quadratic programs. Classical averaging, stacking, and recently proposed Fibonacci-based ensembles appear as special cases of this unified theory, which further accommodates geometric, sub-exponential, and heavy-tailed weighting laws. Overall, the work establishes a principled foundation for structure-driven ensemble learning, explaining why ensembles remain effective for smooth, low-variance base learners and setting the stage for distribution-adaptive and dynamically evolving weighting schemes developed in subsequent work.

[LG-122] On Fibonacci Ensembles: An Alternative Approach to Ensemble Learning Inspired by the Timeless Architecture of the Golden Ratio

链接: https://arxiv.org/abs/2512.22284
作者: Ernest Fokoué
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 33 pages, 4 figures

点击查看摘要

Abstract:Nature rarely reveals her secrets bluntly, yet in the Fibonacci sequence she grants us a glimpse of her quiet architecture of growth, harmony, and recursive stability \citepKoshy2001Fibonacci, Livio2002GoldenRatio. From spiral galaxies to the unfolding of leaves, this humble sequence reflects a universal grammar of balance. In this work, we introduce \emphFibonacci Ensembles, a mathematically principled yet philosophically inspired framework for ensemble learning that complements and extends classical aggregation schemes such as bagging, boosting, and random forests \citepBreiman1996Bagging, Breiman2001RandomForests, Friedman2001GBM, Zhou2012Ensemble, HastieTibshiraniFriedman2009ESL. Two intertwined formulations unfold: (1) the use of normalized Fibonacci weights – tempered through orthogonalization and Rao–Blackwell optimization – to achieve systematic variance reduction among base learners, and (2) a second-order recursive ensemble dynamic that mirrors the Fibonacci flow itself, enriching representational depth beyond classical boosting. The resulting methodology is at once rigorous and poetic: a reminder that learning systems flourish when guided by the same intrinsic harmonies that shape the natural world. Through controlled one-dimensional regression experiments using both random Fourier feature ensembles \citepRahimiRecht2007RFF and polynomial ensembles, we exhibit regimes in which Fibonacci weighting matches or improves upon uniform averaging and interacts in a principled way with orthogonal Rao–Blackwellization. These findings suggest that Fibonacci ensembles form a natural and interpretable design point within the broader theory of ensemble learning.

[LG-123] A review of NMF PLSA LBA EMA and LCA with a focus on the identifiability issue

链接: https://arxiv.org/abs/2512.22282
作者: Qianqian Qi,Peter G. M. van der Heijden
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Across fields such as machine learning, social science, geography, considerable attention has been given to models that factorize a nonnegative matrix into the product of two or three matrices, subject to nonnegative or row-sum-to-1 constraints. Although these models are to a large extend similar or even equivalent, they are presented under different names, and their similarity is not well known. This paper highlights similarities among five popular models, latent budget analysis (LBA), latent class analysis (LCA), end-member analysis (EMA), probabilistic latent semantic analysis (PLSA), and nonnegative matrix factorization (NMF). We focus on an essential issue-identifiability-of these models and prove that the solution of LBA, EMA, LCA, PLSA is unique if and only if the solution of NMF is unique. We also provide a brief review for algorithms of these models. We illustrate the models with a time budget dataset from social science, and end the paper with a discussion of closely related models such as archetypal analysis.

[LG-124] INSIGHT: Spatially resolved survival modelling from routine histology crosslinked with molecular profiling reveals prognostic epithelial-immune axes in stage II/III colorectal cancer

链接: https://arxiv.org/abs/2512.22262
作者: Piotr Keller,Mark Eastwood,Zedong Hu,Aimée Selten,Ruqayya Awan,Gertjan Rasschaert,Sara Verbandt,Vlad Popovici,Hubert Piessevaux,Hayley T Morris,Petros Tsantoulis,Thomas Alexander McKee,André D’Hoore,Cédric Schraepen,Xavier Sagaert,Gert De Hertogh,Sabine Tejpar,Fayyaz Minhas
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Routine histology contains rich prognostic information in stage II/III colorectal cancer, much of which is embedded in complex spatial tissue organisation. We present INSIGHT, a graph neural network that predicts survival directly from routine histology images. Trained and cross-validated on TCGA (n=342) and SURGEN (n=336), INSIGHT produces patient-level spatially resolved risk scores. Large independent validation showed superior prognostic performance compared with pTNM staging (C-index 0.68-0.69 vs 0.44-0.58). INSIGHT spatial risk maps recapitulated canonical prognostic histopathology and identified nuclear solidity and circularity as quantitative risk correlates. Integrating spatial risk with data-driven spatial transcriptomic signatures, spatial proteomics, bulk RNA-seq, and single-cell references revealed an epithelium-immune risk manifold capturing epithelial dedifferentiation and fetal programs, myeloid-driven stromal states including \mathrmSPP1^+ macrophages and \mathrmLAMP3^+ dendritic cells, and adaptive immune dysfunction. This analysis exposed patient-specific epithelial heterogeneity, stratification within MSI-High tumours, and high-risk routes of CDX2/HNF4A loss and CEACAM5/6-associated proliferative programs, highlighting coordinated therapeutic vulnerabilities.

[LG-125] PaperNet: Efficient Temporal Convolutions and Channel Residual Attention for EEG Epilepsy Detection

链接: https://arxiv.org/abs/2512.22172
作者: Md Shahriar Sajid,Abhijit Kumar Ghosh,Fariha Nusrat
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, International Conference on Intelligent Data Analysis and Applications (IDAA 2025)

点击查看摘要

Abstract:Electroencephalography (EEG) signals contain rich temporal-spectral structure but are difficult to model due to noise, subject variability, and multi-scale dynamics. Lightweight deep learning models have shown promise, yet many either rely solely on local convolutions or require heavy recurrent modules. This paper presents PaperNet, a compact hybrid architecture that combines temporal convolutions, a channel-wise residual attention module, and a lightweight bidirectional recurrent block which is used for short-window classification. Using the publicly available BEED: Bangalore EEG Epilepsy Dataset, we evaluate PaperNet under a clearly defined subject-independent training protocol and compare it against established and widely used lightweight baselines. The model achieves a macro-F1 of 0.96 on the held-out test set with approximately 0.6M parameters, while maintaining balanced performance across all four classes. An ablation study demonstrates the contribution of temporal convolutions, residual attention, and recurrent aggregation. Channel-wise attention weights further offer insights into electrode relevance. Computational profiling shows that PaperNet remains efficient enough for practical deployment on resource-constrained systems through out the whole process. These results indicate that carefully combining temporal filtering, channel reweighting, and recurrent context modeling can yield strong EEG classification performance without excessive computational cost.

[LG-126] Sampling with Shielded Langevin Monte Carlo Using Navigation Potentials

链接: https://arxiv.org/abs/2512.22153
作者: Nicolas Zilberstein,Santiago Segarra,Luiz Chamon
类目: Computation (stat.CO); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce shielded Langevin Monte Carlo (LMC), a constrained sampler inspired by navigation functions, capable of sampling from unnormalized target distributions defined over punctured supports. In other words, this approach samples from non-convex spaces defined as convex sets with convex holes. This defines a novel and challenging problem in constrained sampling. To do so, the sampler incorporates a combination of a spatially adaptive temperature and a repulsive drift to ensure that samples remain within the feasible region. Experiments on a 2D Gaussian mixture and multiple-input multiple-output (MIMO) symbol detection showcase the advantages of the proposed shielded LMC in contrast to unconstrained cases.

[LG-127] Machine Learning-Based Basil Yield Prediction in IoT-Enabled Indoor Vertical Hydroponic Farms

链接: https://arxiv.org/abs/2512.22151
作者: Emna Bouzid,Noura Baccar,Kamran Iqbal,Yassine Chaouch,Fares Ben Youssef,Amine Regayeg,Sarra Toumi,Houda Nsir,Amina Mseddi,Leila Costelle
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 38 pages, 11 figures, 7 tables

点击查看摘要

Abstract:As agriculture faces increasing pressure from water scarcity, especially in regions like Tunisia, innovative, resource-efficient solutions are urgently needed. This work explores the integration of indoor vertical hydroponics with Machine Learning (ML) techniques to optimize basil yield while saving water. This research develops a prediction system that uses different ML models and assesses their performance. The models were systematically trained and tested using data collected from IoT sensors of various environmental parameters like CO2, light. The experimental setup features 21 basil crops and uses Raspberry Pi and Arduino. 10k data points were collected and used to train and evaluate three ML models: Linear Regression (LR), Long Short-Term Memory (LSTM), and Deep Neural Networks (DNN). The comparative analysis of the performance of each model revealed that, while LSTM showed high predictive capability and accuracy of 99%, its execution time was 10 times longer than LR and its RAM usage was about 3 times higher than DNN’s when simulated on a standard CPU environment. Conversely, the DNN model had an accuracy rate of 98%. This proves an efficient balance between computational speed and prediction quality, which makes this model well-suited for real-life deployment. Moreover, LR excelled in fast processing of basic prediction with an execution time of 11 seconds. This makes the LR model more suitable for low-complexity or resource-limited applications. These performance trade-offs highlight the potential of DNN-based solutions for building responsive, high-accuracy decision-support systems tailored to agricultural environments, making it suitable for future edge-device deployment.

[LG-128] EEG-to-Voice Decoding of Spoken and Imagined speech Using Non-Invasive EEG

链接: https://arxiv.org/abs/2512.22146
作者: Hanbeot Park,Yunjeong Cho,Hunhee Kim
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 20 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Restoring speech communication from neural signals is a central goal of brain-computer interface research, yet EEG-based speech reconstruction remains challenging due to limited spatial resolution, susceptibility to noise, and the absence of temporally aligned acoustic targets in imagined speech. In this study, we propose an EEG-to-Voice paradigm that directly reconstructs speech from non-invasive EEG signals without dynamic time warping (DTW) or explicit temporal alignment. The proposed pipeline generates mel-spectrograms from EEG in an open-loop manner using a subject-specific generator, followed by pretrained vocoder and automatic speech recognition (ASR) modules to synthesize speech waveforms and decode text. Separate generators were trained for spoken speech and imagined speech, and transfer learning-based domain adaptation was applied by pretraining on spoken speech and adapting to imagined speech. A minimal language model-based correction module was optionally applied to correct limited ASR errors while preserving semantic structure. The framework was evaluated under 2 s and 4 s speech conditions using acoustic-level metrics (PCC, RMSE, MCD) and linguistic-level metrics (CER, WER). Stable acoustic reconstruction and comparable linguistic accuracy were observed for both spoken speech and imagined speech. While acoustic similarity decreased for longer utterances, text-level decoding performance was largely preserved, and word-position analysis revealed a mild increase in decoding errors toward later parts of sentences. The language model-based correction consistently reduced CER and WER without introducing semantic distortion. These results demonstrate the feasibility of direct, open-loop EEG-to-Voice reconstruction for spoken speech and imagined speech without explicit temporal alignment.

[LG-129] UniFi: Combining Irregularly Sampled CSI from Diverse Communication Packets and Frequency Bands for Wi-Fi Sensing

链接: https://arxiv.org/abs/2512.22143
作者: Gaofeng Dong,Kang Yang,Mani Srivastava
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 14 pages, 10 figures

点击查看摘要

Abstract:Existing Wi-Fi sensing systems rely on injecting high-rate probing packets to extract channel state information (CSI), leading to communication degradation and poor deployability. Although Integrated Sensing and Communication (ISAC) is a promising direction, existing solutions still rely on auxiliary packet injection because they exploit only CSI from data frames. We present UniFi, the first Wi-Fi-based ISAC framework that fully eliminates intrusive packet injection by directly exploiting irregularly sampled CSI from diverse communication packets across multiple frequency bands. UniFi integrates a CSI sanitization pipeline to harmonize heterogeneous packets and remove burst-induced redundancy, together with a time-aware attention model that learns directly from non-uniform CSI sequences without resampling. We further introduce CommCSI-HAR, the first dataset with irregularly sampled CSI from real-world dual-band communication traffic. Extensive evaluations on this dataset and four public benchmarks show that UniFi achieves state-of-the-art accuracy with a compact model size, while fully preserving communication throughput.

信息检索

[IR-0] RobustMask: Certified Robustness against Adversarial Neural Ranking Attack via Randomized Masking

链接: https://arxiv.org/abs/2512.23307
作者: Jiawei Liu,Zhuo Chen,Rui Zhu,Miaokun Chen,Yuyang Gong,Wei Lu,Xiaofeng Wang
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Neural ranking models have achieved remarkable progress and are now widely deployed in real-world applications such as Retrieval-Augmented Generation (RAG). However, like other neural architectures, they remain vulnerable to adversarial manipulations: subtle character-, word-, or phrase-level perturbations can poison retrieval results and artificially promote targeted candidates, undermining the integrity of search engines and downstream systems. Existing defenses either rely on heuristics with poor generalization or on certified methods that assume overly strong adversarial knowledge, limiting their practical use. To address these challenges, we propose RobustMask, a novel defense that combines the context-prediction capability of pretrained language models with a randomized masking-based smoothing mechanism. Our approach strengthens neural ranking models against adversarial perturbations at the character, word, and phrase levels. Leveraging both the pairwise comparison ability of ranking models and probabilistic statistical analysis, we provide a theoretical proof of RobustMask’s certified top-K robustness. Extensive experiments further demonstrate that RobustMask successfully certifies over 20% of candidate documents within the top-10 ranking positions against adversarial perturbations affecting up to 30% of their content. These results highlight the effectiveness of RobustMask in enhancing the adversarial robustness of neural ranking models, marking a significant step toward providing stronger security guarantees for real-world retrieval systems.

[IR-1] OrchANN: A Unified I/O Orchestration Framework for Skewed Out-of-Core Vector Search

链接: https://arxiv.org/abs/2512.22838
作者: Chengying Huan,Lizheng Chen,Zhengyi Yang,Shaonan Ma,Rong Gu,Renjie Yao,Zhibin Wang,Mingxing Zhang,Fang Xi,Jie Tao,Gang Zhang,Guihai Chen,Chen Tian
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 13 pages, 30 figures

点击查看摘要

Abstract:Approximate nearest neighbor search (ANNS) at billion scale is fundamentally an out-of-core problem: vectors and indexes live on SSD, so performance is dominated by I/O rather than compute. Under skewed semantic embeddings, existing out-of-core systems break down: a uniform local index mismatches cluster scales, static routing misguides queries and inflates the number of probed partitions, and pruning is incomplete at the cluster level and lossy at the vector level, triggering “fetch-to-discard” reranking on raw vectors. We present OrchANN, an out-of-core ANNS engine that uses an I/O orchestration model for unified I/O governance along the route-access-verify pipeline. OrchANN selects a heterogeneous local index per cluster via offline auto-profiling, maintains a query-aware in-memory navigation graph that adapts to skewed workloads, and applies multi-level pruning with geometric bounds to filter both clusters and vectors before issuing SSD reads. Across five standard datasets under strict out-of-core constraints, OrchANN outperforms four baselines including DiskANN, Starling, SPANN, and PipeANN in both QPS and latency while reducing SSD accesses. Furthermore, OrchANN delivers up to 17.2x higher QPS and 25.0x lower latency than competing systems without sacrificing accuracy. Comments: 13 pages, 30 figures Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2512.22838 [cs.DB] (or arXiv:2512.22838v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2512.22838 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-2] A Real-Time System to Populate FRA Form 57 from News WSDM2026

链接: https://arxiv.org/abs/2512.22457
作者: Chansong Lim,Haz Sameen Shahgir,Yue Dong,Jia Chen,Evangelos E. Papalexakis
类目: Information Retrieval (cs.IR)
*备注: to be published in WSDM 2026 Demonstration

点击查看摘要

Abstract:Local railway committees need timely situational awareness after highway-rail grade crossing incidents, yet official Federal Railroad Administration (FRA) investigations can take days to weeks. We present a demo system that populates Highway-Rail Grade Crossing Incident Data (Form 57) from news in real time. Our approach addresses two core challenges: the form is visually irregular and semantically dense, and news is noisy. To solve these problems, we design a pipeline that first converts Form 57 into a JSON schema using a vision language model with sample aggregation, and then performs grouped question answering following the intent of the form layout to reduce ambiguity. In addition, we build an evaluation dataset by aligning scraped news articles with official FRA records and annotating retrievable information. We then assess our system against various alternatives in terms of information retrieval accuracy and coverage.

[IR-3] OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation

链接: https://arxiv.org/abs/2512.22386
作者: Xuegang Hao,Ming Zhang,Alex Li,Xiangyu Qian,Zhi Ma,Yanlong Zang,Shijie Yang,Zhongxuan Han,Xiaolong Ma,Jinguang Liu,Zhen Li,Zhida Jiang,Shusheng Wang,Ning Tang,Yanchen Qiao,Chenxiang Yang,Chen Sun,Jincheng Yuan,Chunhua Peng,Heng Hu,Peijun Yang,Baopeng Yuan,Caiyun Qiu,Zhaolong Xing,Haofei Yuan,Haipeng Zhang,Yuzhang Guo,Weijie Ding,Jiahua Gao,Hao Huang,Zhen Chen,Tongxuan Liu,Pinghua Gong
类目: Information Retrieval (cs.IR)
*备注: 37 pages, 7 figures

点击查看摘要

Abstract:Traditional recommendation systems suffer from inconsistency in multi-stage optimization objectives. Generative Recommendation (GR) mitigates them through an end-to-end framework; however, existing methods still rely on matching mechanisms based on inductive patterns. Although responsive, they lack the ability to uncover complex user intents that require deductive reasoning based on world knowledge. Meanwhile, LLMs show strong deep reasoning capabilities, but their latency and computational costs remain challenging for industrial applications. More critically, there are performance bottlenecks in multi-scenario scalability: as shown in Figure 1, existing solutions require independent training and deployment for each scenario, leading to low resource utilization and high maintenance costs-a challenge unaddressed in GR literature. To address these, we present OxygenREC, an industrial recommendation system that leverages Fast-Slow Thinking to deliver deep reasoning with strict latency and multi-scenario requirements of real-world environments. First, we adopt a Fast-Slow Thinking architecture. Slow thinking uses a near-line LLM pipeline to synthesize Contextual Reasoning Instructions, while fast thinking employs a high-efficiency encoder–decoder backbone for real-time generation. Second, to ensure reasoning instructions effectively enhance recommendation generation, we introduce a semantic alignment mechanism with Instruction-Guided Retrieval (IGR) to filter intent-relevant historical behaviors and use a Query-to-Item (Q2I) loss for instruction-item consistency. Finally, to resolve multi-scenario scalability, we transform scenario information into controllable instructions, using unified reward mapping and Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to align policies with diverse business objectives, realizing a train-once-deploy-everywhere paradigm.

[IR-4] IANEC: Digital Forensic Investigation of Contemporary Writers Archives

链接: https://arxiv.org/abs/2512.22167
作者: Emmanuel Giguet(GREYC)
类目: Digital Libraries (cs.DL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注: in French language

点击查看摘要

Abstract:The IANEC project (Investigation of Digital Archives of Contemporary Writers), led by the GREYC Research Lab and funded by the French Ministry of Culture aims to develop dedicated digital forensic investigation tools to automate the analysis of archival corpora from the Institut Mémoires de l’Édition Contemporaine (IMEC). The project is based on the observation that born-digital archival materials are increasingly prevalent in contemporary archival institutions, and that digital forensics technologies have become essential for the extraction, identification, processing, and description of natively digital archival corpora.*

附件下载

点击下载今日全部论文列表