本篇博文主要内容为 2025-12-08 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-12-08)

今日共更新411篇论文,其中:

  • 自然语言处理45篇(Computation and Language (cs.CL))
  • 人工智能122篇(Artificial Intelligence (cs.AI))
  • 计算机视觉94篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习113篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

【速读】: 该论文旨在解决基于纯语义相似度的检索增强生成(Retrieval-Augmented Generation, RAG)系统在专业领域中难以保证事实准确性的问题,尤其在术语歧义显著的场景下,传统方法易导致检索结果相关性偏差。其解决方案的关键在于引入源自维基数据(Wikidata)的实体链接(Entity Linking)模块,并设计三种融合语义与实体信息的重排序策略:混合评分加权模型、倒数排名融合(Reciprocal Rank Fusion, RRF)和交叉编码器(cross-encoder)重排序器。实验表明,在特定领域任务中,基于RRF的混合方案显著优于基线与交叉编码器方法;而在通用领域,交叉编码器表现最优,验证了领域不匹配效应的存在,并强调了领域自适应与混合排序策略对提升RAG系统事实精确性和可靠性的关键作用。

链接: https://arxiv.org/abs/2512.05967
作者: Francesco Granata,Francesco Poggi,Misael Mongiovì
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In the era of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) architectures are gaining significant attention for their ability to ground language generation in reliable knowledge sources. Despite their impressive effectiveness in many areas, RAG systems based solely on semantic similarity often fail to ensure factual accuracy in specialized domains, where terminological ambiguity can affect retrieval relevance. This study proposes an enhanced RAG architecture that integrates a factual signal derived from Entity Linking to improve the accuracy of educational question-answering systems in Italian. The system includes a Wikidata-based Entity Linking module and implements three re-ranking strategies to combine semantic and entity-based information: a hybrid score weighting model, reciprocal rank fusion, and a cross-encoder re-ranker. Experiments were conducted on two benchmarks: a custom academic dataset and the standard SQuAD-it dataset. Results show that, in domain-specific contexts, the hybrid schema based on reciprocal rank fusion significantly outperforms both the baseline and the cross-encoder approach, while the cross-encoder achieves the best results on the general-domain dataset. These findings confirm the presence of an effect of domain mismatch and highlight the importance of domain adaptation and hybrid ranking strategies to enhance factual precision and reliability in retrieval-augmented generation. They also demonstrate the potential of entity-aware RAG systems in educational environments, fostering adaptive and reliable AI-based tutoring tools.
zh

[NLP-1] M4-RAG : A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉问答(VQA)任务中因训练数据静态化而导致的局限性问题,特别是多语言、多模态场景下知识更新与文化语境适配不足的挑战。其解决方案的关键在于提出M4-RAG——一个覆盖42种语言和56种区域方言及语域的大型基准数据集,包含超过8万对文化多样化的图像-问题样本,并构建了一个受控的多语言文档检索环境,以模拟真实世界检索条件同时保障实验可复现性。通过系统评估发现,尽管RAG(Retrieval-Augmented Generation)能持续提升小型VLM性能,但对大模型反而常导致性能下降,揭示了当前检索有效性与模型规模之间存在显著不匹配,为下一代跨语言、跨模态、跨文化推理的RAG系统发展提供了关键基础。

链接: https://arxiv.org/abs/2512.05959
作者: David Anugraha,Patrick Amadeus Irawan,Anshul Singh,En-Shiun Annie Lee,Genta Indra Winata
机构: Stanford University (斯坦福大学); MBZUAI; Indian Institute of Science (印度科学研究所); Ontario Tech University; University of Toronto; Capital One
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:Vision-language models (VLMs) have achieved strong performance in visual question answering (VQA), yet they remain constrained by static training data. Retrieval-Augmented Generation (RAG) mitigates this limitation by enabling access to up-to-date, culturally grounded, and multilingual information; however, multilingual multimodal RAG remains largely underexplored. We introduce M4-RAG, a massive-scale benchmark covering 42 languages and 56 regional dialects and registers, comprising over 80,000 culturally diverse image-question pairs for evaluating retrieval-augmented VQA across languages and modalities. To balance realism with reproducibility, we build a controlled retrieval environment containing millions of carefully curated multilingual documents relevant to the query domains, approximating real-world retrieval conditions while ensuring consistent experimentation. Our systematic evaluation reveals that although RAG consistently benefits smaller VLMs, it fails to scale to larger models and often even degrades their performance, exposing a critical mismatch between model size and current retrieval effectiveness. M4-RAG provides a foundation for advancing next-generation RAG systems capable of reasoning seamlessly across languages, modalities, and cultural contexts.
zh

[NLP-2] Zoom in Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

【速读】: 该论文旨在解决图形用户界面(GUI)智能体在元素定位任务中面临的挑战,包括跨平台泛化能力弱、复杂布局分析困难以及细粒度元素定位不准等问题。其解决方案的关键在于引入“缩放(zoom)”作为强先验信息,并提出一种无需训练的方法 ZoomClick,通过刻画缩放的四个关键属性(预缩放状态、深度、缩小尺寸、最小裁剪尺寸),实现动态空间聚焦与自适应上下文切换,从而显著提升通用视觉-语言模型和专用 GUI 接地模型的性能,在多个主流基准测试中达到当前最优结果。

链接: https://arxiv.org/abs/2512.05941
作者: Zhiyuan Jiang,Shenghao Xie,Wenyi Li,Wenqiang Zu,Peihang Li,Jiahao Qiu,Siqi Pei,Lei Ma,Tiejun Huang,Mengdi Wang,Shilong Liu
机构: Xi’an Jiaotong University (西安交通大学); Princeton University (普林斯顿大学); Peking University (北京大学); University of Chinese Academy of Sciences (中国科学院大学); The University of Hong Kong (香港大学); Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code is available at this https URL

点击查看摘要

Abstract:Grounding is a fundamental capability for building graphical user interface (GUI) agents. Although existing approaches rely on large-scale bounding box supervision, they still face various challenges, such as cross-platform generalization, complex layout analysis, and fine-grained element localization. In this paper, we investigate zoom as a strong yet underexplored prior for GUI grounding, and propose a training-free method, ZoomClick. By characterizing four key properties of zoom (i.e., pre-zoom, depth, shrink size, minimal crop size), we unlock its full capabilities for dynamic spatial focusing and adaptive context switching. Experiments demonstrate that our method significantly boosts the performance of both general vision-language and specialized GUI grounding models, achieving state-of-the-art results on several mainstream benchmarks; for example, UI-Venus-72B attains a 73.1% success rate on ScreenSpot-Pro. Furthermore, we present GUIZoom-Bench, a benchmark for evaluating model adaptability to zoom, aiming to inspire future research on improving zoom for further training and test-time scaling in GUI grounding tasks.
zh

[NLP-3] o Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis

【速读】: 该论文旨在解决已发表人工智能(Artificial Intelligence, AI)论文中存在客观错误的传播问题,这些问题可能影响后续研究的可复现性和知识体系的可靠性。其解决方案的关键在于开发了一个基于GPT-5的Paper Correctness Checker,该工具能够系统性地识别顶会与期刊论文中的公式、推导、计算、图表等具有明确验证标准的客观错误,并通过人工专家评审验证其准确性(精度达83.2%),同时还能为75.8%的错误提出修正建议,从而提升学术文献的质量与可信度。

链接: https://arxiv.org/abs/2512.05925
作者: Federico Bianchi,Yongchan Kwon,Zachary Izzo,Linjun Zhang,James Zou
机构: Together AI; NEC Labs America; Rutgers University (罗格斯大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How many mistakes do published AI papers contain? Peer-reviewed publications form the foundation upon which new research and knowledge are built. Errors that persist in the literature can propagate unnoticed, creating confusion in follow-up studies and complicating reproducibility. The accelerating pace of research and the increasing demands on the peer-review system make such mistakes harder to detect and avoid. To address this, we developed a Paper Correctness Checker based on GPT-5 to systematically identify mistakes in papers previously published at top AI conferences and journals. Our analysis focuses on objective mistakes-e.g., errors in formulas, derivations, calculations, figures, and tables-that have a clearly verifiable ground truth. We intentionally exclude subjective considerations such as novelty, importance, or writing quality. We find that published papers contain a non-negligible number of objective mistakes and that the average number of mistakes per paper has increased over time-from 3.8 in NeurIPS 2021 to 5.9 in NeurIPS 2025 (55.3% increase); from 4.1 in ICLR 2018 to 5.2 in ICLR 2025; and from 5.0 in TMLR 2022/23 to 5.5 in TMLR 2025. Human experts reviewed 316 potential mistakes identified by the AI Checker and confirmed that 263 were actual mistakes, corresponding to a precision of 83.2%. While most identified issues are relatively minor, correcting them would reduce confusion in the literature and strengthen reproducibility. The AI Checker also surfaced potentially more substantive mistakes that could affect the interpretation of results. Moreover, we show that the AI Checker can propose correct fixes for 75.8% of the identified mistakes. Overall, this study highlights the potential of frontier LLMs to detect and correct objective mistakes in published papers, helping to establish a firmer foundation of knowledge.
zh

[NLP-4] Natural Language Summarization Enables Multi-Repository Bug Localization by LLM s in Microservice Architectures ICSE2026

【速读】: 该论文旨在解决多仓库微服务架构中缺陷定位(bug localization)的难题,主要挑战包括自然语言缺陷报告与代码之间的语义鸿沟、大语言模型(Large Language Model, LLM)上下文限制,以及需首先准确识别目标代码仓库的问题。其解决方案的关键在于将缺陷定位重构为自然语言推理任务:通过构建分层的自然语言摘要(hierarchical NL summaries),将代码库在文件、目录和仓库层级上转化为结构化文本表示,并采用两阶段搜索策略——先基于自然语言匹配路由缺陷报告至相关仓库,再在选定仓库内进行自顶向下的逐级定位。该方法在工业级系统DNext上的实验表明,其Pass@10达0.82、MRR为0.50,显著优于传统检索基线及基于代理的RAG系统(如GitHub Copilot和Cursor),并提供可解释的“仓库-目录-文件”搜索路径,增强了企业级AI工具的信任度与透明性。

链接: https://arxiv.org/abs/2512.05908
作者: Amirkia Rafiei Oskooei,S. Selcan Yukcu,Mehmet Cevheri Bozoglan,Mehmet S. Aktas
机构: Intellica Business Intelligence Consultancy, R&D Center (Intellica商业智能咨询公司,研发中心); Yildiz Technical University, Dept. of Computer Eng. (伊斯坦布尔理工大学,计算机工程系)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted at LLM4Code Workshop, ICSE 2026

点击查看摘要

Abstract:Bug localization in multi-repository microservice architectures is challenging due to the semantic gap between natural language bug reports and code, LLM context limitations, and the need to first identify the correct repository. We propose reframing this as a natural language reasoning task by transforming codebases into hierarchical NL summaries and performing NL-to-NL search instead of cross-modal retrieval. Our approach builds context-aware summaries at file, directory, and repository levels, then uses a two-phase search: first routing bug reports to relevant repositories, then performing top-down localization within those repositories. Evaluated on DNext, an industrial system with 46 repositories and 1.1M lines of code, our method achieves Pass@10 of 0.82 and MRR of 0.50, significantly outperforming retrieval baselines and agentic RAG systems like GitHub Copilot and Cursor. This work demonstrates that engineered natural language representations can be more effective than raw source code for scalable bug localization, providing an interpretable repository - directory - file search path, which is vital for building trust in enterprise AI tools by providing essential transparency.
zh

[NLP-5] Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework

【速读】: 该论文旨在解决将大语言模型(Large Language Models, LLMs)直接应用于临床医学问答(Medical Question-Answering, QA)时面临的事实准确性不足和幻觉(hallucination)问题。其解决方案的关键在于采用检索增强生成(Retrieval-Augmented Generation, RAG)架构,通过引入领域特定的医学文献检索机制来约束和引导LLM的生成过程,从而提升答案的可信度与事实一致性。具体而言,作者基于开源LLMs(如LLaMA-2和Falcon)使用低秩适配(Low-Rank Adaptation, LoRA)技术进行高效微调,并结合外部医学知识库检索相关证据,使生成答案能够附带可追溯的参考来源。实验表明,该方法在PubMedQA和MedMCQA等基准数据集上显著优于纯LLM零样本基线,尤其在减少无依据内容方面效果明显,验证了RAG框架对构建可靠生物医学QA系统的有效性。

链接: https://arxiv.org/abs/2512.05863
作者: Tasnimul Hassan,Md Faisal Karim,Haziq Jeelani,Elham Behnam,Robert Green,Fayeq Jeelani Syed
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical question-answering (QA) systems can benefit from advances in large language models (LLMs), but directly applying LLMs to the clinical domain poses challenges such as maintaining factual accuracy and avoiding hallucinations. In this paper, we present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions. We fine-tune two state-of-the-art open LLMs (LLaMA~2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization. The system retrieves relevant medical literature to ground the LLM’s answers, thereby improving factual correctness and reducing hallucinations. We evaluate the approach on benchmark datasets (PubMedQA and MedMCQA) and show that retrieval augmentation yields measurable improvements in answer accuracy compared to using LLMs alone. Our fine-tuned LLaMA~2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline, while maintaining transparency by providing source references. We also detail the system design and fine-tuning methodology, demonstrating that grounding answers in retrieved evidence reduces unsupported content by approximately 60%. These results highlight the potential of RAG-augmented open-source LLMs for reliable biomedical QA, pointing toward practical clinical informatics applications.
zh

[NLP-6] Prompting Science Report 4: Playing Pretend: Expert Personas Dont Improve Factual Accuracy

【速读】: 该论文旨在解决“为大语言模型分配角色身份(persona)是否能提升其在高难度客观题上的答题准确性”这一问题。研究聚焦于三种 personas 设置策略:领域内专家(In-Domain Experts)、跨领域专家(Off-Domain Experts)以及低知识水平 personas(如普通人、儿童),并在 GPQA Diamond 和 MMLU-Pro 两个涵盖科学、工程与法律领域的研究生级测试集上进行评估。关键发现是:无论哪种 persona 设置方式,均未显著提升模型的准确率;其中,领域内专家 personas 对多数模型无改善作用(仅 Gemini 2.0 Flash 表现例外),跨领域专家 personas 偶有负面影响,而低知识 personas 显著降低准确率。因此,结论表明,在当前设定下,persona prompt 并非提升模型事实性回答准确性的有效手段。

链接: https://arxiv.org/abs/2512.05858
作者: Savir Basil,Ina Shapiro,Dan Shapiro,Ethan Mollick,Lilach Mollick,Lennart Meincke
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This is the fourth in a series of short reports that help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. Here, we ask whether assigning personas to models improves performance on difficult objective multiple-choice questions. We study both domain-specific expert personas and low-knowledge personas, evaluating six models on GPQA Diamond (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024), graduate-level questions spanning science, engineering, and law. We tested three approaches: -In-Domain Experts: Assigning the model an expert persona (“you are a physics expert”) matched to the problem type (physics problems) had no significant impact on performance (with the exception of the Gemini 2.0 Flash model). -Off-Domain Experts (Domain-Mismatched): Assigning the model an expert persona (“you are a physics expert”) not matched to the problem type (law problems) resulted in marginal differences. -Low-Knowledge Personas: We assigned the model negative capability personas (layperson, young child, toddler), which were generally harmful to benchmark accuracy. Across both benchmarks, persona prompts generally did not improve accuracy relative to a no-persona baseline. Expert personas showed no consistent benefit across models, with few exceptions. Domain-mismatched expert personas sometimes degraded performance. Low-knowledge personas often reduced accuracy. These results are about the accuracy of answers only; personas may serve other purposes (such as altering the tone of outputs), beyond improving factual performance. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.05858 [cs.CL] (or arXiv:2512.05858v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.05858 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Savir Basil [view email] [v1] Fri, 5 Dec 2025 16:35:18 UTC (2,099 KB) Full-text links: Access Paper: View a PDF of the paper titled Prompting Science Report 4: Playing Pretend: Expert Personas Don’t Improve Factual Accuracy, by Savir Basil and 5 other authorsView PDF view license Current browse context: cs.CL prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[NLP-7] Heard or Halted? Gender Interruptions and Emotional Tone in U.S. Supreme Court Oral Arguments

【速读】: 该论文旨在解决司法语境中性别差异如何影响律师与大法官互动过程中话语内容和情感基调的问题,特别是中断行为是否导致女性律师的论点发生语义改变,以及其情感倾向是否更为负面。其解决方案的关键在于利用ConvoKit Supreme Court Corpus(2010–2019)中的12,663个发言片段,结合GloVe词向量嵌入量化语义相似性,并通过词典法测量情感极性,从而实证检验中断对语义内容与情绪倾向的影响,发现中断并未显著改变论证内容,但针对女性律师的中断具有更高负向情感强度。

链接: https://arxiv.org/abs/2512.05832
作者: Yifei Tong
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 12 pages, 5 figures, 1 table. Includes appendix. Code available at: this https URL

点击查看摘要

Abstract:This study examines how interruptions during U.S. Supreme Court oral arguments shape both the semantic content and emotional tone of advocates’ speech, with a focus on gendered dynamics in judicial discourse. Using the ConvoKit Supreme Court Corpus (2010-2019), we analyze 12,663 speech chunks from advocate-justice interactions to assess whether interruptions alter the meaning of an advocate’s argument and whether interruptions toward female advocates exhibit more negative emotional valence. Semantic shifts are quantified using GloVe-based sentence embeddings, while sentiment is measured through lexicon-based analysis. We find that semantic similarity between pre- and post-interruption speech remains consistently high, suggesting that interruptions do not substantially alter argumentative content. However, interruptions directed at female advocates contain significantly higher levels of negative sentiment. These results deepen empirical understanding of gendered communication in elite institutional settings and demonstrate the value of computational linguistic methods for studying power, discourse, and equity in judicial proceedings.
zh

[NLP-8] Active Video Perception: Iterative Evidence Seeking for Agent ic Long Video Understanding

【速读】: 该论文旨在解决长视频理解(Long Video Understanding, LVU)中因稀疏、时序分散的线索隐藏在大量冗余内容中而导致的效率与精度难题。传统方法依赖查询无关的视频描述器(captioner)进行感知,造成计算资源浪费并模糊细粒度时空信息。其解决方案的关键在于提出主动视频感知(Active Video Perception, AVP)框架,将视频视为交互环境,通过迭代式“规划-观察-反思”过程实现精准证据获取:由多模态大语言模型(Multimodal Large Language Model, MLLM)代理动态决定何时、何地、观察什么内容,并基于时间戳证据持续评估是否足以回答查询,从而显著提升准确率(平均提升5.7%)并大幅降低推理时间和输入token消耗(分别减少至18.4%和12.4%)。

链接: https://arxiv.org/abs/2512.05774
作者: Ziyang Wang,Honglu Zhou,Shijie Wang,Junnan Li,Caiming Xiong,Silvio Savarese,Mohit Bansal,Michael S. Ryoo,Juan Carlos Niebles
机构: Salesforce AI Research; University of North Carolina at Chapel Hill
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Website: this https URL

点击查看摘要

Abstract:Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest performance with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average accuracy while only requires 18.4% inference time and 12.4% input tokens.
zh

[NLP-9] Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

【速读】: 该论文旨在解决生成式 AI 在开放性故事生成中缺乏细粒度风格控制的问题,现有方法通常依赖于浅层提示(如人名或主题)模拟作者风格,且缺乏可靠评估。其解决方案的关键在于提出一种基于组相对策略优化(Group Relative Policy Optimization, GRPO)的训练框架,并设计了一个定制的多奖励机制:其中风格奖励由微调后的句子嵌入模型(sentence transformer)结合作者身份验证(authorship verification, AV)信号计算得出,同时引入内容质量和完整性评分以稳定长篇叙事生成。实验以马克·吐温的小说《哈克贝利·费恩历险记》为风格基准,结果表明,8B规模模型在AV风格指标上优于GPT-4o和Claude Sonnet 4等更大模型,证明了中等规模模型通过任务特定训练即可实现高效、可代理的风格化生成,但全局连贯性和故事收尾仍是待改进之处。

链接: https://arxiv.org/abs/2512.05747
作者: Jinlong Liu,Mohammed Bahja,Venelin Kovatchev,Mark Lee
机构: University of Birmingham (伯明翰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) show impressive performance in open-ended story generation, but fine-grained stylistic control remains limited. Existing methods often rely on shallow cues (e.g., names or topics) to simulate authorial style, without robust evaluation. In this work, we present a training framework for style-conditioned story generation using Group Relative Policy Optimization (GRPO) and a custom multi-reward setup. The style reward is derived from a fine-tuned sentence transformer using authorship verification (AV) signals, combined with content and completeness scores to stabilize long-form narrative generation. We conduct experiments using fiction by Mark Twain, a prominent 19th-century American author, with The Adventures of Huckleberry Finn serving as the reference style exemplar. Our 8B model outperforms larger baselines such as GPT-4o and Claude Sonnet 4 in AV-style metrics, achieving a style score of 0.628 and competitive content quality. Results demonstrate the feasibility of agentic stylistic generation with moderate model size and task-specific training. While the output is clearly style-aligned, narrative completeness remains a challenge, indicating future work is needed to better model global coherence and story resolution.
zh

[NLP-10] Efficient Text Classification with Conformal In-Context Learning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在文本分类任务中对提示(prompt)设计高度敏感且计算成本高昂的问题。其解决方案的核心是提出并系统评估一种名为“共形在上下文学习”(Conformal In-Context Learning, CICLe)的框架,该框架通过将轻量级基础分类器与共形预测(Conformal Prediction)相结合,自适应地缩小候选类别集合,从而提升提示效率和分类性能。CICLe不仅在样本充足时显著优于基础分类器和少样本提示基线,在低数据场景下也表现稳健,并能减少最多34.45%的样本数量和25.16%的提示长度,同时支持使用更小模型实现竞争力强的性能,尤其适用于类别严重不平衡的文本分类任务。

链接: https://arxiv.org/abs/2512.05732
作者: Ippokratis Pantelidis,Korbinian Randl,Aron Henriksson
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 tables, 2 figures

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong in-context learning abilities, yet their effectiveness in text classification depends heavily on prompt design and incurs substantial computational cost. Conformal In-Context Learning (CICLe) has been proposed as a resource-efficient framework that integrates a lightweight base classifier with Conformal Prediction to guide LLM prompting by adaptively reducing the set of candidate classes. However, its broader applicability and efficiency benefits beyond a single domain have not yet been systematically explored. In this paper, we present a comprehensive evaluation of CICLe across diverse NLP classification benchmarks. The results show that CICLe consistently improves over its base classifier and outperforms few-shot prompting baselines when the sample size is sufficient for training the base classifier, and performs comparably in low-data regimes. In terms of efficiency, CICLe reduces the number of shots and prompt length by up to 34.45% and 25.16%, respectively, and enables the use of smaller models with competitive performance. CICLe is furthermore particularly advantageous for text classification tasks with high class imbalance. These findings highlight CICLe as a practical and scalable approach for efficient text classification, combining the robustness of traditional classifiers with the adaptability of LLMs, and achieving substantial gains in data and computational efficiency.
zh

[NLP-11] Big Tech-Funded AI Papers Have Higher Citation Impact Greater Insularity and Larger Recency Bias

【速读】: 该论文旨在解决当前人工智能(Artificial Intelligence, AI)研究中产业资助论文的数量、引用影响力及其驱动因素不明确的问题。通过系统量化10个顶级AI会议(如ICLR、CVPR、AAAI、ACL等)中产业资助论文的比例及其引用表现,作者利用Scopus数据库中1998至2022年间约49.8万篇论文及超过410万条引用关系进行分析,提出并应用新的指标“引用偏好比(Citation Preference Ratio, CPR)”来衡量产业资助研究与非资助研究在学术交流中的差异。关键发现包括:自2015年起产业资助比例显著上升(从不足2%增至2020年的11%以上),其高影响力论文占比(以h5-index衡量)远高于非产业资助论文(12% vs 4%),且产业资助研究呈现日益封闭的趋势——更倾向于引用其他产业资助论文,同时减少对非资助论文的引用,并偏好引用近期成果。这些发现揭示了AI研究资助格局向产业倾斜的新趋势及其潜在影响。

链接: https://arxiv.org/abs/2512.05714
作者: Max Martin Gnewuch,Jan Philip Wahle,Terry Ruas,Bela Gipp
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published at IEEE (ACDSA)

点击查看摘要

Abstract:Over the past four decades, artificial intelligence (AI) research has flourished at the nexus of academia and industry. However, Big Tech companies have increasingly acquired the edge in computational resources, big data, and talent. So far, it has been largely unclear how many papers the industry funds, how their citation impact compares to non-funded papers, and what drives industry interest. This study fills that gap by quantifying the number of industry-funded papers at 10 top AI conferences (e.g., ICLR, CVPR, AAAI, ACL) and their citation influence. We analyze about 49.8K papers, about 1.8M citations from AI papers to other papers, and about 2.3M citations from other papers to AI papers from 1998-2022 in Scopus. Through seven research questions, we examine the volume and evolution of industry funding in AI research, the citation impact of funded papers, the diversity and temporal range of their citations, and the subfields in which industry predominantly acts. Our findings reveal that industry presence has grown markedly since 2015, from less than 2 percent to more than 11 percent in 2020. Between 2018 and 2022, 12 percent of industry-funded papers achieved high citation rates as measured by the h5-index, compared to 4 percent of non-industry-funded papers and 2 percent of non-funded papers. Top AI conferences engage more with industry-funded research than non-funded research, as measured by our newly proposed metric, the Citation Preference Ratio (CPR). We show that industry-funded research is increasingly insular, citing predominantly other industry-funded papers while referencing fewer non-funded papers. These findings reveal new trends in AI research funding, including a shift towards more industry-funded papers and their growing citation impact, greater insularity of industry-funded work than non-funded work, and a preference of industry-funded research to cite recent work.
zh

[NLP-12] Faithfulness metric fusion: Improving the evaluation of LLM trustworthiness across domains

链接: https://arxiv.org/abs/2512.05700
作者: Ben Malin,Tatiana Kalganova,Nikolaos Boulgouris
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, conference paper

点击查看摘要

[NLP-13] Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods

【速读】: 该论文旨在解决宪法判例检索效率低下的问题,即传统依赖数据库查询的方式耗时且效果有限。其核心解决方案是对比两种嵌入模型在捷克宪法法院判决文本中的检索性能:一是通用大规模嵌入模型(OpenAI),二是基于约30,000份判决数据从零训练的领域特定BERT模型(使用滑动窗口和注意力池化)。关键创新在于提出了一种噪声感知的评估框架,包括基于逆文档频率(IDF)加权的关键字重叠作为分级相关性指标、通过两个阈值(0.20平衡与0.28严格)进行二值化处理、采用配对自助法检验显著性,并结合nDCG诊断与定性分析。实证表明,尽管绝对nDCG表现一般(归因于标签漂移和理想化标准而非模型无效),通用嵌入模型仍显著优于领域预训练BERT,在@10/@20/@100排名指标上均具统计学意义,且该评估框架能有效应对由历史司法数据库异构标签带来的噪声金标准问题。

链接: https://arxiv.org/abs/2512.05681
作者: Tereza Novotna,Jakub Harasta
机构: Masaryk University (马萨里克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The manuscript has been accepted for presentation as a short paper at the 38th International Conference on Legal Knowledge and Information Systems (JURIX 2025) in Torino, Italy

点击查看摘要

Abstract:Retrieving case law is a time-consuming task predominantly carried out by querying databases. We provide a comparison of two models in three different settings for Czech Constitutional Court decisions: (i) a large general-purpose embedder (OpenAI), (ii) a domain-specific BERT-trained from scratch on ~30,000 decisions using sliding windows and attention pooling. We propose a noise-aware evaluation including IDF-weighted keyword overlap as graded relevance, binarization via two thresholds (0.20 balanced, 0.28 strict), significance via paired bootstrap, and an nDCG diagnosis supported with qualitative analysis. Despite modest absolute nDCG (expected under noisy labels), the general OpenAI embedder decisively outperforms the domain pre-trained BERT in both settings at @10/@20/@100 across both thresholds; differences are statistically significant. Diagnostics attribute low absolutes to label drift and strong ideals rather than lack of utility. Additionally, our framework is robust enough to be used for evaluation under a noisy gold dataset, which is typical when handling data with heterogeneous labels stemming from legacy judicial databases.
zh

[NLP-14] MedTutor-R1: Socratic Personalized Medical Teaching with Multi-Agent Simulation

【速读】: 该论文旨在解决临床医学教育中因专家指导资源稀缺与培训需求不断增长之间的巨大缺口问题。现有研究多聚焦于一对一的知识传授,忽视了团队协作推理(collaborative reasoning)这一在临床教学场景如查房中至关重要的能力培养。解决方案的关键在于构建了一个名为ClinEdu的多智能体教学模拟器,其包含具有个性特征的虚拟患者和多样化的学生群体,支持对复杂教学过程的可控测试与大规模教学数据生成;在此基础上,研究者进一步构建了ClinTeach——首个捕捉群体教学复杂性的苏格拉底式对话数据集,并基于此训练出首个面向临床医学“一对多”教学的多模态苏格拉底导师模型MedTutor-R1,该模型通过三轴评分体系(结构保真度、分析质量与临床安全性)进行强化学习优化,实现了自适应教学策略的精准调整,最终在仿真环境中展现出显著优于基线模型的教学效果与高适应性。

链接: https://arxiv.org/abs/2512.05671
作者: Zhitao He,Haolin Yang,Zeyu Qin,Yi R Fung
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work In Progress

点击查看摘要

Abstract:The significant gap between rising demands for clinical training and the scarcity of expert instruction poses a major challenge to medical education. With powerful capabilities in personalized guidance, Large Language Models (LLMs) offer a promising solution to bridge this gap. However, current research focuses mainly on one-on-one knowledge instruction, overlooking collaborative reasoning, a key skill for students developed in teamwork like ward rounds. To this end, we develop ClinEdu, a multi-agent pedagogical simulator with personality-driven patients and diverse student cohorts, enabling controlled testing of complex pedagogical processes and scalable generation of teaching data. Based on ClinEdu, we construct ClinTeach, a large Socratic teaching dialogue dataset that captures the complexities of group instruction. We then train MedTutor-R1, the first multimodal Socratic tutor designed for one-to-many instruction in clinical medical education. MedTutor-R1 is first instruction-tuned on our ClinTeach dataset and then optimized with reinforcement learning, using rewards derived from a three-axis rubric, covering structural fidelity, analytical quality, and clinical safety, to refine its adaptive Socratic strategies. For authentic in-situ assessment, we use simulation-based interactive evaluation that redeploys the tutor back into ClinEdu. Experimental results demonstrate that our MedTutor-R1 outperforms the base model by over 20% in average pedagogical score and is comparable to o3, while also exhibiting high adaptability in handling a varying number of students. This promising performance underscores the effectiveness of our pedagogical simulator, ClinEdu.
zh

[NLP-15] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在采用交错式推理范式时面临的计算成本过高问题,以及现有潜在视觉推理方法中因过度压缩特征而导致感知精度损失或无法建模动态问题的局限性。解决方案的关键在于提出一种名为交错潜在视觉推理(Interleaved Latent Visual Reasoning, ILVR)的新框架,该框架通过将文本生成与可演化的潜在视觉表示进行交错交互,实现精细感知建模与动态状态演化的统一;其核心机制是利用动量教师模型(Momentum Teacher Model)的自监督策略,从辅助图像中选择性蒸馏相关特征作为稀疏监督目标,从而引导模型自主生成上下文感知的视觉信号,有效平衡了计算效率与推理精度。

链接: https://arxiv.org/abs/2512.05665
作者: Shuai Dong,Siyuan Wang,Xingyu Liu,Zhongyu Wei
机构: China University of Geosciences (中国地质大学); Shanghai Innovation Institute; University of Southern California (南加州大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures. Code available at this https URL

点击查看摘要

Abstract:Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of repeatedly re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet currently forces a critical trade-off: methods either sacrifice precise perceptual modeling by over-compressing features or fail to model dynamic problems due to static, non-interleaved structures. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. To enable this, we employ a self-supervision strategy where a Momentum Teacher Model selectively distills relevant features from helper images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR significantly outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.
zh

[NLP-16] Grounded Multilingual Medical Reasoning for Question Answering with Large Language Models

链接: https://arxiv.org/abs/2512.05658
作者: Pietro Ferrazzi,Aitor Soroa,Rodrigo Agerri
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

[NLP-17] A Greek Government Decisions Dataset for Public-Sector Analysis and Insight

【速读】: 该论文旨在解决公共部门文本信息难以高效访问与利用的问题,特别是希腊政府决策文档的结构化处理和智能问答支持难题。其解决方案的关键在于构建了一个大规模、机器可读的希腊政府决策语料库(包含100万份决策文件),并提供高质量的原始文本提取流程(基于Markdown格式)及可复现的处理管道;同时设计了检索增强生成(Retrieval-Augmented Generation, RAG)任务,通过代表性问题、高质量答案标注以及基准系统评估,验证了该语料库在支持结构化检索与推理方面的潜力,从而为开发面向公共政策领域的交互式AI助手提供了基础。

链接: https://arxiv.org/abs/2512.05647
作者: Giorgos Antoniou,Giorgos Filandrianos,Aggelos Vlachos,Giorgos Stamou,Lampros Kollimenos,Konstantinos Skianis,Michalis Vazirgiannis
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce an open, machine-readable corpus of Greek government decisions sourced from the national transparency platform Diavgeia. The resource comprises 1 million decisions, featuring and high-quality raw text extracted from PDFs. It is released with raw extracted text in Markdown format, alongside a fully reproducible extraction pipeline. Beyond the core dataset, we conduct qualitative analyses to explore boilerplate patterns and design a retrieval-augmented generation (RAG) task by formulating a set of representative questions, creating high-quality answers, and evaluating a baseline RAG system on its ability to retrieve and reason over public decisions. This evaluation demonstrates the potential of large-scale public-sector corpora to support advanced information access and transparency through structured retrieval and reasoning over governmental documents, and highlights how such a RAG pipeline could simulate a chat-based assistant capable of interactively answering questions about public decisions. Due to its scale, quality, and domain coverage, the corpus can also serve as high-value pre-training or fine-tuning material for new Language Models (LMs) and Large Language Models (LLMs) respectively, including specialized models for legal and governmental domains, and as a foundation for novel approaches in domain adaptation, knowledge-grounded generation, and explainable AI. Finally, we discuss limitations, outline future directions, and make both the data and the code accessible.
zh

[NLP-18] Ontology Learning with LLM s: A Benchmark Study on Axiom Identification

【速读】: 该论文旨在解决知识表示中本体(Ontology)构建过程中 axiom(公理)识别自动化不足的问题,即如何利用大语言模型(Large Language Models, LLMs)从自然语言文本中准确提取定义类与属性之间逻辑关系的公理。其解决方案的关键在于设计了一个系统化的基准测试集 OntoAxiom,包含九个中等规模本体共 17,118 个三元组和 2,771 条公理,并对比不同提示策略(Direct 一次性查询 vs. Axiom-by-Axiom AbA 分步查询)、LLM 规模及本体领域对性能的影响。实验表明,AbA 提示策略显著优于 Direct 方法,且模型规模越大表现越好,但不同类型的公理(如 subclass、disjoint、subproperty 等)识别难度差异明显,说明当前 LLM 尚无法完全替代人工完成公理识别,但仍可作为候选生成工具辅助本体工程师高效开发与优化本体。

链接: https://arxiv.org/abs/2512.05594
作者: Roos M. Bakker,Daan L. Di Scala,Maaike H.T. de Boer,Stephan A. Raaijmakers
机构: Netherlands Organisation for Applied Scientific Research (荷兰应用科学研究组织); Leiden University Centre for Linguistics (莱顿大学语言学中心); Utrecht University (乌得勒支大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Submitted to Semantic Web Journal, under review

点击查看摘要

Abstract:Ontologies are an important tool for structuring domain knowledge, but their development is a complex task that requires significant modelling and domain expertise. Ontology learning, aimed at automating this process, has seen advancements in the past decade with the improvement of Natural Language Processing techniques, and especially with the recent growth of Large Language Models (LLMs). This paper investigates the challenge of identifying axioms: fundamental ontology components that define logical relations between classes and properties. In this work, we introduce an Ontology Axiom Benchmark OntoAxiom, and systematically test LLMs on that benchmark for axiom identification, evaluating different prompting strategies, ontologies, and axiom types. The benchmark consists of nine medium-sized ontologies with together 17.118 triples, and 2.771 axioms. We focus on subclass, disjoint, subproperty, domain, and range axioms. To evaluate LLM performance, we compare twelve LLMs with three shot settings and two prompting strategies: a Direct approach where we query all axioms at once, versus an Axiom-by-Axiom (AbA) approach, where each prompt queries for one axiom only. Our findings show that the AbA prompting leads to higher F1 scores than the direct approach. However, performance varies across axioms, suggesting that certain axioms are more challenging to identify. The domain also influences performance: the FOAF ontology achieves a score of 0.642 for the subclass axiom, while the music ontology reaches only 0.218. Larger LLMs outperform smaller ones, but smaller models may still be viable for resource-constrained settings. Although performance overall is not high enough to fully automate axiom identification, LLMs can provide valuable candidate axioms to support ontology engineers with the development and refinement of ontologies.
zh

[NLP-19] Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)后训练中基于强化学习(Reinforcement Learning, RL)的策略更新不稳定问题,尤其是由于离策略(off-policy)训练引发的分布偏移(distribution shift),导致策略熵波动和梯度不稳定。现有方法如PPO-Clip虽通过重要性裁剪(importance clipping)缓解局部策略变化,但未能有效控制未采样动作的概率偏移,从而无法保障全局策略分布的稳定性。解决方案的关键在于提出一种新的全局度量——熵比(Entropy Ratio),用于量化策略更新过程中探索行为的整体变化,并引入熵比裁剪(Entropy Ratio Clipping, ERC)机制,对熵比施加双向约束,从而在全局分布层面稳定策略更新,弥补PPO-Clip在未采样动作概率调节上的不足。实验表明,ERC机制可显著提升DAPO与GPPO算法在多个基准上的性能稳定性与一致性。

链接: https://arxiv.org/abs/2512.05591
作者: Zhenpeng Su,Leiyu Pan,Minxuan Lv,Tiehua Mei,Zijia Lin,Yuntao Li,Wenping Hu,Ruiming Tang,Kun Gai,Guorui Zhou
机构: Kuaishou Technology (快手科技)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an \textbfEntropy Ratio Clipping (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.
zh

[NLP-20] Structured Reasoning with Tree-of-Thoughts for Bengali Math Word Problems

【速读】: 该论文旨在解决数学应用题(Mathematical Word Problems, MWPs)在自然语言处理中的挑战,即如何有效结合语言理解与多步数值推理,同时克服传统链式思维(Chain-of-Thought, CoT)提示中因线性结构导致的错误传播问题。其解决方案的关键在于引入树状思维(Tree-of-Thought, ToT)推理框架,通过构建分层、多路径的思维空间来增强模型的全局一致性与容错能力。实验表明,在 Bengali 语言场景下,ToT 相较于标准提示和 CoT 显著提升了准确率(最高达 88%),尤其在中大型语言模型(如 GPT-OSS-120B)上效果更优,验证了 ToT 在低资源语言中实现可靠推理的有效性。

链接: https://arxiv.org/abs/2512.05580
作者: Aurprita Mahmood,Sabrin alam,Neloy kumer Sagor,Md. Abdul Hadi,Md. Sehab Al Islam,Minhajul Islam
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Mathematical Word Problems (MWPs) are among the most challenging tasks in natural language processing because they require both linguistic understanding and multi-step numerical reasoning. While Chain-of-Thought (CoT) prompting has shown promise, its linear structure often propagates errors, limiting overall effectiveness. To address this limitation, we present the a systematic study of Tree-of-Thought (ToT) reasoning for Bengali MWPs using the SOMADHAN dataset. Owing to computational and token-cost constraints, we evaluate a curated set of 100 representative problems across multiple large language models (LLMs), including GPT-OSS and LLaMA variants, under standard prompting, CoT, and ToT strategies. Our results show that CoT improves baseline accuracy from 78% (standard prompting) to 83% on average, while ToT further increases performance by up to 5 percentage points, achieving 88% accuracy with GPT-OSS-120B. These improvements highlight that ToT is particularly effective in medium-to-large-scale models but may offer less advantage for smaller ones. Overall, our findings establish ToT as a robust framework for solving mathematical problems in low-resource languages such as Bengali. More broadly, this study shows that structured reasoning methods like ToT can provide more reliable and globally consistent outcomes than CoT, paving the way for better reasoning strategies in multilingual NLP.
zh

[NLP-21] Automated Identification of Incidentalomas Requiring Follow-Up: A Multi-Anatomy Evaluation of LLM -Based and Supervised Approaches

【速读】: 该论文旨在解决当前放射学报告中对需随访的偶然发现(incidentalomas)进行细粒度、病灶级别检测时存在的局限性,尤其是传统文档级分类系统难以精准识别具体病灶位置与性质的问题。其解决方案的关键在于引入一种新颖的推理策略:通过病灶标签化输入(lesion-tagged inputs)和解剖结构感知提示(anatomy-aware prompting),增强生成式大语言模型(Generative LLMs)的推理能力,从而实现更准确的病灶定位与临床意义判断。实验表明,该方法显著优于监督式Transformer编码器,并达到接近人类标注者一致性的性能水平。

链接: https://arxiv.org/abs/2512.05537
作者: Namu Park,Farzad Ahmed,Zhaoyi Sun,Kevin Lybarger,Ethan Breinhorst,Julie Hu,Ozlem Uzuner,Martin Gunn,Meliha Yetisgen
机构: University of Washington (华盛顿大学); George Mason University (乔治梅森大学); Te Whatu Ora Health New Zealand (Te Whatu Ora 健康新西兰); School of Medicine (医学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Objective: To evaluate large language models (LLMs) against supervised baselines for fine-grained, lesion-level detection of incidentalomas requiring follow-up, addressing the limitations of current document-level classification systems. Methods: We utilized a dataset of 400 annotated radiology reports containing 1,623 verified lesion findings. We compared three supervised transformer-based encoders (BioClinicalModernBERT, ModernBERT, Clinical Longformer) against four generative LLM configurations (Llama 3.1-8B, GPT-4o, GPT-OSS-20b). We introduced a novel inference strategy using lesion-tagged inputs and anatomy-aware prompting to ground model reasoning. Performance was evaluated using class-specific F1-scores. Results: The anatomy-informed GPT-OSS-20b model achieved the highest performance, yielding an incidentaloma-positive macro-F1 of 0.79. This surpassed all supervised baselines (maximum macro-F1: 0.70) and closely matched the inter-annotator agreement of 0.76. Explicit anatomical grounding yielded statistically significant performance gains across GPT-based models (p 0.05), while a majority-vote ensemble of the top systems further improved the macro-F1 to 0.90. Error analysis revealed that anatomy-aware LLMs demonstrated superior contextual reasoning in distinguishing actionable findings from benign lesions. Conclusion: Generative LLMs, when enhanced with structured lesion tagging and anatomical context, significantly outperform traditional supervised encoders and achieve performance comparable to human experts. This approach offers a reliable, interpretable pathway for automated incidental finding surveillance in radiology workflows. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2512.05537 [cs.CL] (or arXiv:2512.05537v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2512.05537 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Namu Park [view email] [v1] Fri, 5 Dec 2025 08:49:57 UTC (618 KB)
zh

[NLP-22] SEA-SafeguardBench: Evaluating AI Safety in SEA Languages and Cultures

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)安全评估中普遍存在的英语中心主义问题,特别是对东南亚(Southeast Asian, SEA)地区低资源语言和本地化危害场景的忽视。现有多语言安全基准大多依赖机器翻译的英文数据,无法准确捕捉当地文化语境下的敏感内容与独特风险。为此,作者提出SEA-SafeguardBench,这是首个由本地人类验证的东南亚多语言安全基准,覆盖8种语言、21,640个样本,并分为通用、真实场景(in-the-wild)和内容生成三个子集。其关键创新在于采用原生本地化文本构建,确保反映区域特有的社会规范与危害类型,从而更真实地评估LLMs在跨文化语境下的安全性表现。

链接: https://arxiv.org/abs/2512.05501
作者: Panuthep Tasawong,Jian Gang Ngui,Alham Fikri Aji,Trevor Cohn,Peerat Limkonchotiwat
机构: VISTEC(视觉技术研究中心); Google(谷歌); AI Singapore(新加坡人工智能研究所)
类目: Computation and Language (cs.CL)
备注: Under review

点击查看摘要

Abstract:Safeguard models help large language models (LLMs) detect and block harmful content, but most evaluations remain English-centric and overlook linguistic and cultural diversity. Existing multilingual safety benchmarks often rely on machine-translated English data, which fails to capture nuances in low-resource languages. Southeast Asian (SEA) languages are underrepresented despite the region’s linguistic diversity and unique safety concerns, from culturally sensitive political speech to region-specific misinformation. Addressing these gaps requires benchmarks that are natively authored to reflect local norms and harm scenarios. We introduce SEA-SafeguardBench, the first human-verified safety benchmark for SEA, covering eight languages, 21,640 samples, across three subsets: general, in-the-wild, and content generation. The experimental results from our benchmark demonstrate that even state-of-the-art LLMs and guardrails are challenged by SEA cultural and harm scenarios and underperform when compared to English texts.
zh

[NLP-23] Dynamic Alignment for Collective Agency: Toward a Scalable Self-Improving Framework for Open-Ended LLM Alignment AAAI2026

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在向通用人工智能(Artificial General Intelligence, AGI)和超级智能(Artificial Superintelligence, ASI)演进过程中,传统基于人类偏好或预设原则(如助人性、诚实性和无害性)的对齐方法存在局限性的问题,同时应对人类反馈驱动对齐方式资源消耗大、难以扩展的挑战。其解决方案的关键在于提出两个核心创新:一是引入一种统一且开放式的对齐目标——集体代理能力(Collective Agency, CA),以超越传统对齐范式;二是设计动态对齐(Dynamic Alignment)框架,通过LLM自动构建训练数据集与自奖励机制(self-rewarding mechanism),使模型能够迭代自我对齐,从而实现可扩展的自主优化。实验表明,该方法能在保持自然语言处理(NLP)通用能力的同时有效对齐至CA目标。

链接: https://arxiv.org/abs/2512.05464
作者: Panatchakorn Anantaprayoon,Nataliia Babina,Jad Tarifi,Nima Asgharbeygi
机构: Integral AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, to appear in AAAI 2026 AIGOV Workshop

点击查看摘要

Abstract:Large Language Models (LLMs) are typically aligned with human values using preference data or predefined principles such as helpfulness, honesty, and harmlessness. However, as AI systems progress toward Artificial General Intelligence (AGI) and Artificial Superintelligence (ASI), such value systems may become insufficient. In addition, human feedback-based alignment remains resource-intensive and difficult to scale. While AI-feedback-based self-improving alignment methods have been explored as a scalable alternative, they have largely remained constrained to conventional alignment values. In this work, we explore both a more holistic alignment objective and a scalable, self-improving alignment approach. Aiming to transcend conventional alignment norms, we introduce Collective Agency (CA)-a unified and open-ended alignment value that encourages integrated agentic capabilities. We also propose Dynamic Alignment-an alignment framework that enables an LLM to iteratively align itself. Dynamic Alignment comprises two key components: (1) automated training dataset generation with LLMs, and (2) a self-rewarding mechanism, where the policy model evaluates its own output candidates and assigns rewards for GRPO-based learning. Experimental results demonstrate that our approach successfully aligns the model to CA while preserving general NLP capabilities.
zh

[NLP-24] ArtistMus: A Globally Diverse Artist-Centric Benchmark for Retrieval-Augmented Music Question Answering LREC2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在音乐相关推理任务中表现有限的问题,根源在于预训练数据中音乐知识的稀疏性。为应对这一挑战,作者提出两个核心解决方案:一是构建MusWikiDB,一个包含320万段落的向量数据库,源自14.4万篇音乐相关的维基百科条目;二是设计ArtistMus基准测试集,涵盖500位多元艺术家的1000个问题,附带如流派、出道年份等元数据信息。关键创新在于利用检索增强生成(Retrieval-Augmented Generation, RAG)方法,显著提升模型在事实准确性与上下文理解上的表现——实验表明,RAG使开源模型如Qwen3 8B的准确率从35.0%提升至91.8%,并接近专有模型性能,同时在域内和域外基准上均实现更优的推理能力。

链接: https://arxiv.org/abs/2512.05430
作者: Daeyong Kwon,SeungHeon Doh,Juhan Nam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Submitted to LREC 2026. This work is an evolution of our earlier preprint arXiv:2507.23334

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have transformed open-domain question answering, yet their effectiveness in music-related reasoning remains limited due to sparse music knowledge in pretraining data. While music information retrieval and computational musicology have explored structured and multimodal understanding, few resources support factual and contextual music question answering (MQA) grounded in artist metadata or historical context. We introduce MusWikiDB, a vector database of 3.2M passages from 144K music-related Wikipedia pages, and ArtistMus, a benchmark of 1,000 questions on 500 diverse artists with metadata such as genre, debut year, and topic. These resources enable systematic evaluation of retrieval-augmented generation (RAG) for MQA. Experiments show that RAG markedly improves factual accuracy; open-source models gain up to +56.8 percentage points (for example, Qwen3 8B improves from 35.0 to 91.8), approaching proprietary model performance. RAG-style fine-tuning further boosts both factual recall and contextual reasoning, improving results on both in-domain and out-of-domain benchmarks. MusWikiDB also yields approximately 6 percentage points higher accuracy and 40% faster retrieval than a general-purpose Wikipedia corpus. We release MusWikiDB and ArtistMus to advance research in music information retrieval and domain-specific question answering, establishing a foundation for retrieval-augmented reasoning in culturally rich domains such as music.
zh

[NLP-25] LMSpell: Neural Spell Checking for Low-Resource Languages

【速读】: 该论文旨在解决低资源语言(Low-Resource Languages, LRLs)中拼写纠错(Spell Correction)任务的挑战问题,特别是针对当前预训练语言模型(Pretrained Language Models, PLMs)在LRLs上的应用有限且缺乏系统性比较的现状。其解决方案的关键在于开展首个关于PLMs在拼写纠错任务上有效性的实证研究,发现大型语言模型(Large Language Models, LLMs)在训练数据充足时优于编码器型(encoder-based)和编码器-解码器型(encoder-decoder)模型,且这一优势即使在LLM未预训练的语言中依然成立;同时,作者发布了LMSpell工具包,提供跨PLMs的拼写纠错功能,并包含一个能缓解LLMs幻觉(hallucination)现象的评估函数,从而提升模型在实际应用中的可靠性与适用性。

链接: https://arxiv.org/abs/2512.05414
作者: Akesh Gunathilakea,Nadil Karunarathnea,Tharusha Bandaranayakea,Nisansa de Silvaa,Surangika Ranathunga
机构: University of Moratuwa (莫鲁塔瓦大学); Massey University (马西大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Spell correction is still a challenging problem for low-resource languages (LRLs). While pretrained language models (PLMs) have been employed for spell correction, their use is still limited to a handful of languages, and there has been no proper comparison across PLMs. We present the first empirical study on the effectiveness of PLMs for spell correction, which includes LRLs. We find that Large Language Models (LLMs) outperform their counterparts (encoder-based and encoder-decoder) when the fine-tuning dataset is large. This observation holds even in languages for which the LLM is not pre-trained. We release LMSpell, an easy- to use spell correction toolkit across PLMs. It includes an evaluation function that compensates for the hallucination of LLMs. Further, we present a case study with Sinhala to shed light on the plight of spell correction for LRLs.
zh

[NLP-26] SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLM s

【速读】: 该论文旨在解决低比特量化(low-bit quantization)与稀疏化(sparsification)技术在实际应用中难以兼顾精度与效率的问题,尤其是在现有硬件支持有限的情况下。例如,W4A8量化虽能降低计算位宽,但其峰值算力(TOPS)仍无法超越W8A8,而GPU广泛支持的2:4半结构稀疏格式因精度损失较少被采用。论文提出了一种统一的数据格式——稀疏量化格式(Sparse-Quantized Format, SQ-format),其关键在于利用稀疏矩阵可在高精度下加速、同时低精度矩阵乘法也可相应加速的特性,从而实现性能与吞吐量之间的帕累托改进(Pareto improvement)。SQ-format特别适用于激活值存在异常值不均衡分布的情况,并使其静态压缩成为可能,显著提升了后训练量化(Post-training Quantization, PTQ)的性能表现。

链接: https://arxiv.org/abs/2512.05409
作者: Ruixuan Huang,Hao Zeng,Hantao Huang,Jinyuan Shi,Minghui Yu,Ian En-Hsu Yen,Shuai Wang
机构: ByteDance(字节跳动); Moffett.AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Post-training quantization (PTQ) plays a crucial role in the democratization of large language models (LLMs). However, existing low-bit quantization and sparsification techniques are difficult to balance accuracy and efficiency due to the limited hardware support. For example, W4A8 can only achieve the same peak TOPS as W8A8 whereas the GPU-supported sparse data format (2:4 semi-structure sparse) is seldomly adopted due to the loss of accuracy. To bridge this gap, in this paper, we propose the Sparse-Quantized Format (SQ-format), which is a unified data format for quantization and sparsification potentially easily supported by new hardware and existing GPUs. SQ-format makes use of the fact that sparse matrix can be accelerated in high-precision, and low-precision matrix multiplication can also be accelerated accordingly. As such, SQ-format is proposed to achieve Pareto improvement between performance and throughput. This format is particularly suitable for activations with outlier inequality status and makes their static compression possible. We show the state-of-the-art PTQ performance with SQ-format, propose the hardware required to support it, and further offer the design exploration and insights for the next-generation AI accelerators.
zh

[NLP-27] Learning from Self Critique and Refinement for Faithful LLM Summarization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成任务(如摘要生成)中普遍存在的幻觉问题,即模型输出内容与输入上下文不一致或缺乏依据。现有方法虽可通过迭代式批判与精炼来缓解该问题,但通常依赖额外的推理时计算资源或更强大的教师模型,实用性受限。本文提出Self Critique and Refinement-based Preference Optimization (SCRPO)框架,其核心在于利用LLM自身具备的批判与精炼能力自动构建偏好数据集,并通过偏好学习优化同一模型,从而在无需外部教师模型或额外推理开销的情况下提升摘要的忠实性(faithfulness),同时保持甚至改善整体摘要质量。

链接: https://arxiv.org/abs/2512.05387
作者: Ting-Yao Hu,Hema Swetha Koppula,Hadi Pouransari,Cem Koc,Oncel Tuzel,Raviteja Vemulapalli
机构: Apple(苹果公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often suffer from hallucinations: output content that is not grounded in the input context, when performing long-form text generation tasks such as summarization. Prior works have shown that hallucinations can be reduced by iteratively critiquing and refining previously generated outputs using either the same model or a more powerful teacher model as the critique. However, these approaches either require additional test-time compute or assume access to more powerful teacher models, making them costly and less practical. In this work, we propose Self Critique and Refinement-based Preference Optimization (SCRPO), which is a self-supervised training framework that first constructs a preference dataset by leveraging the LLM’s own critique and refinement capabilities, and then applies preference learning to improve the same LLM for faithful summarization. Experiments on three summarization benchmarks (XSUM CNNDM and SAMSum), demonstrate that our approach outperforms state-of-the-art self-supervised learning methods in terms of faithfulness metrics while either maintaining or improving other metrics that measure the overall quality of the summary. Moreover, compared to test-time refinement, our approach not only improves efficiency but also results in more faithful summaries.
zh

[NLP-28] Mitigating Self-Preference by Authorship Obfuscation

【速读】: 该论文旨在解决语言模型(Language Model, LM)评判者中存在的自我偏好(self-preference)问题,即LM评判者倾向于偏好自身生成的输出而非其他LM或人类生成的输出,这种偏差会影响评估结果的客观性。解决方案的关键在于通过黑盒扰动(black-box perturbations)降低LM评判者识别自身输出的能力,具体方法是对成对比较中的候选文本进行扰动,如少量词的同义词替换,以模糊作者身份并减少自识别。实验表明,此类简单扰动可有效抑制自我偏好,但进一步完全中和风格差异后,自我偏好又会恢复,揭示了自我识别可能发生在多个语义层面,从而指出彻底消除该偏差仍具挑战性。

链接: https://arxiv.org/abs/2512.05379
作者: Taslim Mahbub,Shi Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges’ ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also uncover fundamental challenges to eliminating the bias: when we extrapolate our perturbations to a more complete neutralization of stylistic differences between the evaluation candidates, self-preference recovers. Our findings suggest that self-recognition and self-preference can happen on many semantic levels, and complete mitigation remains challenging despite promising initial results.
zh

[NLP-29] xt Rationalization for Robust Causal Effect Estimation

【速读】: 该论文旨在解决高维文本数据在因果推断中因冗余或虚假特征导致的观测层面正性假设(positivity assumption)违背问题,进而引发倾向得分极端化、权重不稳定及效应估计方差增大等挑战。解决方案的关键在于提出一种名为“混淆感知标记精简”(Confounding-Aware Token Rationalization, CATR)的框架,其核心是利用残差独立性诊断(residual-independence diagnostic)筛选出保留充分混淆控制信息的稀疏标记子集,从而在丢弃无关文本的同时保留关键因果信号,有效缓解正性假设违反并稳定下游因果效应估计。

链接: https://arxiv.org/abs/2512.05373
作者: Lijinghua Zhang,Hengrui Cai
机构: University of California, Irvine (加州大学欧文分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Recent advances in natural language processing have enabled the increasing use of text data in causal inference, particularly for adjusting confounding factors in treatment effect estimation. Although high-dimensional text can encode rich contextual information, it also poses unique challenges for causal identification and estimation. In particular, the positivity assumption, which requires sufficient treatment overlap across confounder values, is often violated at the observational level, when massive text is represented in feature spaces. Redundant or spurious textual features inflate dimensionality, producing extreme propensity scores, unstable weights, and inflated variance in effect estimates. We address these challenges with Confounding-Aware Token Rationalization (CATR), a framework that selects a sparse necessary subset of tokens using a residual-independence diagnostic designed to preserve confounding information sufficient for unconfoundedness. By discarding irrelevant texts while retaining key signals, CATR mitigates observational-level positivity violations and stabilizes downstream causal effect estimators. Experiments on synthetic data and a real-world study using the MIMIC-III database demonstrate that CATR yields more accurate, stable, and interpretable causal effect estimates than existing baselines.
zh

[NLP-30] ransformer-Enabled Diachronic Analysis of Vedic Sanskrit: Neural Methods for Quantifying Types of Language Change

【速读】: 该论文旨在解决低资源、形态丰富的语言(如梵语)在历时演变过程中,如何通过计算方法准确捕捉其复杂性变化的问题。传统观点常假设语言演化趋向简化,但本文挑战这一假设,提出一种弱监督的混合神经符号方法,以揭示真实演化模式。解决方案的关键在于:首先利用100多个高精度正则表达式生成伪标签,对多语言BERT进行微调,从而缓解数据稀缺问题;其次,设计了一种基于置信度加权的集成机制,融合符号规则与神经网络输出,实现模型的可扩展性与可解释性;最终在147万词的历时语料库上实现了52.4%的整体特征检测率,并发现梵语的形态复杂性并非减少而是动态再分配——早期动词特征呈周期性衰减,而复合词和哲学术语显著增长,同时系统具备良好的不确定性校准能力(Pearson相关系数r=0.92,ECE=0.043),增强了计算语文学研究结果的可靠性。

链接: https://arxiv.org/abs/2512.05364
作者: Ananth Hariharan,David Mortensen
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study demonstrates how hybrid neural-symbolic methods can yield significant new insights into the evolution of a morphologically rich, low-resource language. We challenge the naive assumption that linguistic change is simplification by quantitatively analyzing over 2,000 years of Sanskrit, demonstrating how weakly-supervised hybrid methods can yield new insights into the evolution of morphologically rich, low-resource languages. Our approach addresses data scarcity through weak supervision, using 100+ high-precision regex patterns to generate pseudo-labels for fine-tuning a multilingual BERT. We then fuse symbolic and neural outputs via a novel confidence-weighted ensemble, creating a system that is both scalable and interpretable. Applying this framework to a 1.47-million-word diachronic corpus, our ensemble achieves a 52.4% overall feature detection rate. Our findings reveal that Sanskrit’s overall morphological complexity does not decrease but is instead dynamically redistributed: while earlier verbal features show cyclical patterns of decline, complexity shifts to other domains, evidenced by a dramatic expansion in compounding and the emergence of new philosophical terminology. Critically, our system produces well-calibrated uncertainty estimates, with confidence strongly correlating with accuracy (Pearson r = 0.92) and low overall calibration error (ECE = 0.043), bolstering the reliability of these findings for computational philology.
zh

[NLP-31] he Effect of Document Summarization on LLM -Based Relevance Judgments

【速读】: 该论文旨在解决信息检索(Information Retrieval, IR)系统评估中人工相关性判断成本高、耗时长的问题,提出利用大语言模型(Large Language Models, LLMs)作为自动化评估工具的替代方案。其关键解决方案在于系统性地考察文本摘要对LLM生成相关性判断可靠性的影响:通过对比基于完整文档与LLM生成不同长度摘要所作判断的表现,发现摘要-based判断在IR系统排名稳定性上可媲美全文档判断,但会引入因模型和数据集而异的标签分布偏移和系统性偏差,表明文本摘要既为大规模IR评估提供了效率提升路径,也需谨慎对待其方法论选择对自动判断可靠性的影响。

链接: https://arxiv.org/abs/2512.05334
作者: Samaneh Mohtadi,Kevin Roitero,Stefano Mizzaro,Gianluca Demartini
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Relevance judgments are central to the evaluation of Information Retrieval (IR) systems, but obtaining them from human annotators is costly and time-consuming. Large Language Models (LLMs) have recently been proposed as automated assessors, showing promising alignment with human annotations. Most prior studies have treated documents as fixed units, feeding their full content directly to LLM assessors. We investigate how text summarization affects the reliability of LLM-based judgments and their downstream impact on IR evaluation. Using state-of-the-art LLMs across multiple TREC collections, we compare judgments made from full documents with those based on LLM-generated summaries of different lengths. We examine their agreement with human labels, their effect on retrieval effectiveness evaluation, and their influence on IR systems’ ranking stability. Our findings show that summary-based judgments achieve comparable stability in systems’ ranking to full-document judgments, while introducing systematic shifts in label distributions and biases that vary by model and dataset. These results highlight summarization as both an opportunity for more efficient large-scale IR evaluation and a methodological choice with important implications for the reliability of automatic judgments.
zh

[NLP-32] Exposing Pink Slime Journalism: Linguistic Signatures and Robust Detection Against LLM -Generated Threats

【速读】: 该论文旨在解决“粉红泥浆新闻”(Pink Slime Journalism)——一种低质量、自动生成且模仿正规本地新闻报道的虚假内容——对美国2800万民众获取可靠信息渠道造成的威胁。此类内容通过语言、风格和词汇特征的细微变化逃避传统检测机制,尤其在大型语言模型(Large Language Models, LLMs)参与生成后更具隐蔽性。解决方案的关键在于提出一种鲁棒学习框架(robust learning framework),专门针对LLM驱动的对抗性攻击进行优化,能够适应自动化粉红泥浆新闻的演变趋势,并在实验中将检测性能提升高达27%,显著优于现有方法。

链接: https://arxiv.org/abs/2512.05331
作者: Sadat Shahriar,Navid Ayoobi,Arjun Mukherjee,Mostafa Musharrat,Sai Vishnu Vamsi
机构: University of Houston (休斯顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Published in RANLP 2025

点击查看摘要

Abstract:The local news landscape, a vital source of reliable information for 28 million Americans, faces a growing threat from Pink Slime Journalism, a low-quality, auto-generated articles that mimic legitimate local reporting. Detecting these deceptive articles requires a fine-grained analysis of their linguistic, stylistic, and lexical characteristics. In this work, we conduct a comprehensive study to uncover the distinguishing patterns of Pink Slime content and propose detection strategies based on these insights. Beyond traditional generation methods, we highlight a new adversarial vector: modifications through large language models (LLMs). Our findings reveal that even consumer-accessible LLMs can significantly undermine existing detection systems, reducing their performance by up to 40% in F1-score. To counter this threat, we introduce a robust learning framework specifically designed to resist LLM-based adversarial attacks and adapt to the evolving landscape of automated pink slime journalism, and showed and improvement by up to 27%.
zh

[NLP-33] LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning

【速读】: 该论文旨在解决大型推理模型在执行复杂任务时存在的“过度思考”问题,即模型在已有足够信息作出正确回答后仍持续生成冗余推理步骤,导致计算资源浪费并可能降低准确性。现有早期退出方法通常依赖额外采样策略、辅助验证模型或仅作为事后分析管道,缺乏形式化保证。其解决方案的关键在于提出一种名为LYNX的在线早期退出机制,该机制利用模型自身隐藏状态对推理线索(如“hmm”、“wait”等)的感知能力,通过轻量级探测器(probe)在这些线索标记处训练,并结合分拆校准预测(split conformal prediction)实现无分布假设下的提前退出控制。该方法只需在通用数学语料上一次性训练和校准探测器,即可跨不同基准测试、解码温度甚至非数学任务零样本迁移使用,显著提升推理效率的同时保持或增强准确性。

链接: https://arxiv.org/abs/2512.05325
作者: Ömer Faruk Akgül,Yusuf Hakan Kalaycı,Rajgopal Kannan,Willie Neiswanger,Viktor Prasanna
机构: University of Southern California (南加州大学); DEVCOM ARL (美国陆军研发司令部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large reasoning models achieve strong performance on complex tasks by generating extended chains of thought, but they often “overthink”: continuing to reason long after they have enough information to answer correctly. This wastes inference-time compute and can hurt accuracy. Existing attempts to stop early either manipulate decoding with extra sampling and heuristics, rely on auxiliary verifier models, or operate only as post-hoc analysis pipelines without formal guarantees. We introduce LYNX, an online early-exit mechanism that turns a model’s own hidden-state awareness into confidence-controlled stopping decisions. LYNX attaches exit decisions to naturally occurring reasoning cues (e.g., “hmm”, “wait”) during generation, trains a lightweight probe on hidden states at those cue tokens using supervision from forced exits, and wraps the resulting scores in split conformal prediction to obtain distribution-free control over premature exits. Crucially, we train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks. Across three model families spanning 1.5B to 32B parameters, a single mathematically trained probe per base model yields strong accuracy–efficiency tradeoffs. On GSM8K, LYNX matches or improves baseline accuracy while reducing tokens by 40–65%; on MATH-500 it improves accuracy by up to 12 points with roughly 35–60% fewer tokens; on AIME 2024 it recovers baseline accuracy with more than 50% token savings; and on CommonsenseQA, a non-math benchmark, it transfers zero-shot with modest accuracy gains and up to 70% fewer tokens. Compared to state-of-the-art early-exit methods, LYNX offers competitive or superior Pareto frontiers while remaining fully online, requiring no proxy models at inference, and providing explicit, user-tunable confidence guarantees.
zh

[NLP-34] o Think or Not to Think: The Hidden Cost of Meta-Training with Excessive CoT Examples

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)提示与少样本上下文学习(In-context Learning, ICL)在预训练知识不足时对新任务效果不佳的问题。其关键解决方案是提出一种名为CoT-Recipe的正式方法,用于在元训练序列中调节CoT与非CoT示例的混合比例,从而避免因CoT监督信息有限而导致性能下降;实验证明,通过CoT-Recipe的精细调控,即使在无CoT示例的上下文中,Transformer模型在新任务上的准确率也可提升高达300%,并在Qwen2.5系列预训练大语言模型上验证了该方法在符号推理任务中的广泛有效性(最高提升130%)。

链接: https://arxiv.org/abs/2512.05318
作者: Vignesh Kothapalli,Ata Fatahibaarzi,Hamed Firooz,Maziar Sanjabi
机构: Stanford University (斯坦福大学); LinkedIn AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 45 figures, 3 tables

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting combined with few-shot in-context learning (ICL) has unlocked significant reasoning capabilities in large language models (LLMs). However, ICL with CoT examples is ineffective on novel tasks when the pre-training knowledge is insufficient. We study this problem in a controlled setting using the CoT-ICL Lab framework, and propose meta-training techniques to learn novel abstract reasoning tasks in-context. Although CoT examples facilitate reasoning, we noticed that their excessive inclusion during meta-training degrades performance when CoT supervision is limited. To mitigate such behavior, we propose CoT-Recipe, a formal approach to modulate the mix of CoT and non-CoT examples in meta-training sequences. We demonstrate that careful modulation via CoT-Recipe can increase the accuracy of transformers on novel tasks by up to 300% even when there are no CoT examples available in-context. We confirm the broader effectiveness of these techniques by applying them to pretrained LLMs (Qwen2.5 series) for symbolic reasoning tasks and observing gains of up to 130% in accuracy.
zh

[NLP-35] Enhancing Clinical Note Generation with ICD-10 Clinical Ontology Knowledge Graphs and Chain-of-Thought Prompting Using GPT -4

【速读】: 该论文旨在解决临床医生在电子健康记录(Electronic Health Record, EHR)中手动撰写结构化与非结构化临床笔记所耗费时间过长的问题,从而影响诊疗效率和患者等待时间。其解决方案的关键在于利用链式思维(Chain-of-Thought, CoT)提示工程,结合国际疾病分类(International Classification of Diseases, ICD)编码、基础患者信息以及语义搜索结果,并引入基于临床本体构建的知识图谱(Knowledge Graph, KG),以增强大语言模型(Large Language Models, LLMs)生成临床笔记的准确性与专业性。实验表明,该方法在CodiEsp测试数据集上的六例临床案例中优于标准单次提示(one-shot prompt)生成的临床笔记。

链接: https://arxiv.org/abs/2512.05256
作者: Ivan Makohon,Mohamad Najafi,Jian Wu,Mathias Brochhausen,Yaohang Li
机构: Old Dominion University (老多明尼昂大学); University of Arkansas for Medical Sciences (阿肯色医科大学)
类目: Computation and Language (cs.CL); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients’ assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering and editing them. Manually writing clinical notes takes a considerable amount of a doctor’s valuable time, increasing the patient’s waiting time and possibly delaying diagnoses. Large language models (LLMs) possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM’s response in clinical note generation. In our prompts, we use as input International Classification of Diseases (ICD) codes and basic patient information. We investigate a strategy that combines the traditional CoT with semantic search results to improve the quality of generated clinical notes. Additionally, we infuse a knowledge graph (KG) built from clinical ontology to further enrich the domain-specific knowledge of generated clinical notes. We test our prompting technique on six clinical cases from the CodiEsp test dataset using GPT-4 and our results show that it outperformed the clinical notes generated by standard one-shot prompts.
zh

[NLP-36] Decoding the Black Box: Discerning AI Rhetorics About and Through Poetic Prompting

【速读】: 该论文旨在解决如何通过提示工程(Prompt Engineering)更有效地揭示大型语言模型(Large Language Models, LLMs)的算法倾向与偏见,同时探索其在创意文本生成中的应用边界。其解决方案的关键在于引入“诗歌提示模式”(Poetry Prompt Patterns),作为一种新颖的提示策略,用于评估LLMs对著名诗人作品的描述与评价能力,并考察模型在面对潜在受众偏好时是否愿意调整或重写原始创意内容,从而系统性地揭示模型的适应性与创造性响应机制。

链接: https://arxiv.org/abs/2512.05243
作者: P.D. Edgar,Alia Hall
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Late-Breaking Paper accepted to IEEE SSCI 2025 NLP Social Media Track as extended abstract and presented in Trondheim, Norway 17-20 March 2025 as Poster Presentation

点击查看摘要

Abstract:Prompt engineering has emerged as a useful way studying the algorithmic tendencies and biases of large language models. Meanwhile creatives and academics have leveraged LLMs to develop creative works and explore the boundaries of their writing capabilities through text generation and code. This study suggests that creative text prompting, specifically Poetry Prompt Patterns, may be a useful addition to the toolbox of the prompt engineer, and outlines the process by which this approach may be taken. Then, the paper uses poetic prompts to assess descriptions and evaluations of three models of a renowned poet and test the consequences of the willingness of models to adapt or rewrite original creative works for presumed audiences.
zh

[NLP-37] Unveiling Affective Polarization Trends in Parliamentary Proceedings

【速读】: 该论文旨在解决如何量化政治话语中情绪极化(affective polarization)的问题,传统方法多依赖意识形态立场差异,而忽视了情感表达的差异。其解决方案的关键在于引入基于情绪风格的测量指标——Valence(正负性)、Arousal(唤醒度)和Dominance(支配感),通过这些情绪维度识别话语中的情感信号,并以此操作化“情绪极化”概念。研究以以色列议会(Knesset)会议记录为语料,发现政府成员与反对派成员在情绪表达上存在显著差异,且情绪极化程度随时间显著上升。

链接: https://arxiv.org/abs/2512.05231
作者: Gili Goldin,Ella Rabinovich,Shuly Wintner
机构: University of Haifa (海法大学); The Academic College of Tel Aviv-Yaffo (特拉维夫-雅法学术学院)
类目: Computation and Language (cs.CL)
备注: pre-MIT Press publication version

点击查看摘要

Abstract:Recent years have seen an increase in polarized discourse worldwide, on various platforms. We propose a novel method for quantifying polarization, based on the emotional style of the discourse rather than on differences in ideological stands. Using measures of Valence, Arousal and Dominance, we detect signals of emotional discourse and use them to operationalize the concept of affective polarization. Applying this method to a recently released corpus of proceedings of the Knesset, the Israeli parliament (in Hebrew), we find that the emotional style of members of government differs from that of opposition members; and that the level of affective polarization, as reflected by this style, is significantly increasing with time.
zh

[NLP-38] On the Computability of Artificial General Intelligence

【速读】: 该论文试图解决的核心问题是:是否存在一种算法(包括人工智能模型)能够实现真正意义上的创造性,从而达到人类水平的通用人工智能(Artificial General Intelligence, A.G.I.)。为回答这一问题,作者首先采用业界主流定义——即A.G.I.指在某一领域中具备创造性和创新能力,能解锁此前未知的功能能力。论文的关键解决方案在于提出并形式化证明了一个重要结论:任何可计算的算法都无法展现出初始阶段未包含的功能能力,因此无法实现真正的创造性;AI模型仅能展示已有功能的能力、或其组合与排列。这一结论从理论上界定了机器智能的上限,对理解AI发展的边界及人类智能的本质具有深远意义。

链接: https://arxiv.org/abs/2512.05212
作者: Georgios Mappouras,Charalambos Rossides
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years we observed rapid and significant advancements in artificial intelligence (A.I.). So much so that many wonder how close humanity is to developing an A.I. model that can achieve human level of intelligence, also known as artificial general intelligence (A.G.I.). In this work we look at this question and we attempt to define the upper bounds, not just of A.I., but rather of any machine-computable process (a.k.a. an algorithm). To answer this question however, one must first precisely define A.G.I. We borrow prior work’s definition of A.G.I. [1] that best describes the sentiment of the term, as used by the leading developers of A.I. That is, the ability to be creative and innovate in some field of study in a way that unlocks new and previously unknown functional capabilities in that field. Based on this definition we draw new bounds on the limits of computation. We formally prove that no algorithm can demonstrate new functional capabilities that were not already present in the initial algorithm itself. Therefore, no algorithm (and thus no A.I. model) can be truly creative in any field of study, whether that is science, engineering, art, sports, etc. In contrast, A.I. models can demonstrate existing functional capabilities, as well as combinations and permutations of existing functional capabilities. We conclude this work by discussing the implications of this proof both as it regards to the future of A.I. development, as well as to what it means for the origins of human intelligence.
zh

[NLP-39] Fine-Tuning BERT for Domain-Specific Question Answering: Toward Educational NLP Resources at University Scale

【速读】: 该论文试图解决的问题是:当前科学问答(Scientific Question Answering)研究主要集中在通用聊天机器人系统上,缺乏对基础模型在特定领域推理任务中的微调探索,尤其在大学课程资料这一教育领域尚无专门适配的问答模型。解决方案的关键在于:构建一个包含1,203个问题-答案对的定制化数据集(采用SQuAD格式),并基于PyTorch对BERT模型进行微调,从而提升模型在课程信息理解与知识提取方面的性能;实验表明,即使小规模微调也能显著改善假设构建和知识抽取能力,验证了将基础模型适配至高等教育场景的可行性,为开发首个面向高校课程材料的专用问答模型提供了技术路径。

链接: https://arxiv.org/abs/2512.05179
作者: Aurélie Montfrond
机构: University of Limerick (利默里克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Prior work on scientific question answering has largely emphasized chatbot-style systems, with limited exploration of fine-tuning foundation models for domain-specific reasoning. In this study, we developed a chatbot for the University of Limerick’s Department of Electronic and Computer Engineering to provide course information to students. A custom dataset of 1,203 question-answer pairs in SQuAD format was constructed using the university book of modules, supplemented with manually and synthetically generated entries. We fine-tuned BERT (Devlin et al., 2019) using PyTorch and evaluated performance with Exact Match and F1 scores. Results show that even modest fine-tuning improves hypothesis framing and knowledge extraction, demonstrating the feasibility of adapting foundation models to educational domains. While domain-specific BERT variants such as BioBERT and SciBERT exist for biomedical and scientific literature, no foundation model has yet been tailored to university course materials. Our work addresses this gap by showing that fine-tuning BERT with academic QA pairs yields effective results, highlighting the potential to scale towards the first domain-specific QA model for universities and enabling autonomous educational knowledge systems.
zh

[NLP-40] Semantic Faithfulness and Entropy Production Measures to Tame Your LLM Demons and Manage Hallucinations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行特定任务时的忠实性(faithfulness)评估难题,即模型输出是否准确反映输入上下文信息。其核心问题是缺乏有效的无监督指标来量化LLM生成内容与给定上下文之间的语义一致性,从而难以检测幻觉(hallucination)。解决方案的关键在于将LLM建模为一个双部分信息引擎(bipartite information engine),利用信息论和热力学原理:通过将问题-上下文-答案(Question-Context-Answer, QCA)三元组视为共享主题上的概率分布,并构建从上下文到问题和答案的转移矩阵(transition matrices Q\bf QA\bf A),进而以Kullback-Leibler (KL) 散度衡量二者差异,得到语义忠实性(Semantic Faithfulness, SF)指标;同时引入基于热力学的语义熵产生(Semantic Entropy Production, SEP)指标,揭示高忠实性通常对应低熵产生。这两个指标共同构成了可联合或独立使用的无监督评估框架,已在企业SEC 10-K文件摘要任务中验证有效性。

链接: https://arxiv.org/abs/2512.05156
作者: Igor Halperin
机构: Fidelity Investments (富达投资)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
备注: 23 pages, 6 figures

点击查看摘要

Abstract:Evaluating faithfulness of Large Language Models (LLMs) to a given task is a complex challenge. We propose two new unsupervised metrics for faithfulness evaluation using insights from information theory and thermodynamics. Our approach treats an LLM as a bipartite information engine where hidden layers act as a Maxwell demon controlling transformations of context C into answer A via prompt Q . We model Question-Context-Answer (QCA) triplets as probability distributions over shared topics. Topic transformations from C to Q and A are modeled as transition matrices \bf Q and \bf A encoding the query goal and actual result, respectively. Our semantic faithfulness (SF) metric quantifies faithfulness for any given QCA triplet by the Kullback-Leibler (KL) divergence between these matrices. Both matrices are inferred simultaneously via convex optimization of this KL divergence, and the final SF metric is obtained by mapping the minimal divergence onto the unit interval [0,1], where higher scores indicate greater faithfulness. Furthermore, we propose a thermodynamics-based semantic entropy production (SEP) metric in answer generation, and show that high faithfulness generally implies low entropy production. The SF and SEP metrics can be used jointly or separately for LLM evaluation and hallucination control. We demonstrate our framework on LLM summarization of corporate SEC 10-K filings.
zh

[NLP-41] RAG -IGBench: Innovative Evaluation for RAG -based Interleaved Generation in Open-domain Question Answering NEURIPS2025

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在开放域问答中生成高质量交错图文内容(interleaved image-text generation)所面临的两大挑战:一是现有模型难以稳定生成语义连贯、视觉与文本一致的多模态输出;二是缺乏有效的评估基准来全面衡量此类交错序列的质量。解决方案的关键在于提出 RAG-IGBench,一个专为基于检索增强生成(Retrieval-Augmented Generation, RAG-IG)任务设计的综合性评测基准,其创新性体现在两个方面:首先,它利用社交媒体平台最新公开数据构建高质量训练与测试集,使模型能访问外部图像-文本信息以提升生成一致性;其次,引入新颖的多维度评价指标,同时量化文本质量、图像质量及其协同一致性,且经实验证明这些指标与人类评估高度相关。该基准不仅填补了现有评测体系的空白,还通过微调实验验证了其对模型性能提升的实际价值。

链接: https://arxiv.org/abs/2512.05119
作者: Rongyang Zhang,Yuqing Huang,Chengqiang Lu,Qimeng Wang,Yan Gao,Yi Wu,Yao Hu,Yin Xu,Wei Wang,Hao Wang,Enhong Chen
机构: University of Science and Technology of China (中国科学技术大学); Xiaohongshu Inc. (小红书); Xi’an Jiaotong University (西安交通大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 26 pages, 6 figures, NeurIPS 2025 DB Track poster

点击查看摘要

Abstract:In real-world scenarios, providing user queries with visually enhanced responses can considerably benefit understanding and memory, underscoring the great value of interleaved image-text generation. Despite recent progress, like the visual autoregressive model that unifies text and image processing in a single transformer architecture, generating high-quality interleaved content remains challenging. Moreover, evaluations of these interleaved sequences largely remain underexplored, with existing benchmarks often limited by unimodal metrics that inadequately assess the intricacies of combined image-text outputs. To address these issues, we present RAG-IGBench, a thorough benchmark designed specifically to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering. RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content. Distinct from previous datasets, RAG-IGBench draws on the latest publicly available content from social platforms and introduces innovative evaluation metrics that measure the quality of text and images, as well as their consistency. Through extensive experiments with state-of-the-art MLLMs (both open-source and proprietary) on RAG-IGBench, we provide an in-depth analysis examining the capabilities and limitations of these models. Additionally, we validate our evaluation metrics by demonstrating their high correlation with human assessments. Models fine-tuned on RAG-IGBench’s training set exhibit improved performance across multiple benchmarks, confirming both the quality and practical utility of our dataset. Our benchmark is available at this https URL.
zh

[NLP-42] EnterpriseEM: Fine-tuned Embeddings for Enterprise Semantic Search

【速读】: 该论文旨在解决企业环境中由于专有非结构化数据管理困难而导致的信息检索效率低下问题,尤其针对预训练嵌入模型(pre-trained embedding models)在企业特定数据上可能因训练目标不匹配而造成检索性能不佳的挑战。解决方案的关键在于提出一套完整的上下文化(contextualization)方法论,涵盖从数据准备、模型微调(fine-tuning)到评估的全流程,通过将预训练嵌入模型适配至企业语境,显著提升检索结果的精度与相关性,从而增强企业信息管理中AI驱动的信息检索系统的有效性。

链接: https://arxiv.org/abs/2406.00010
作者: Kamalkumar Rathinasamy,Jayarama Nettar,Amit Kumar,Vishal Manchanda,Arun Vijayakumar,Ayush Kataria,Venkateshprasanna Manjunath,Chidambaram GS,Jaskirat Singh Sodhi,Shoeb Shaikh,Wasim Akhtar Khan,Prashant Singh,Tanishq Dattatray Ige,Vipin Tiwari,Rajab Ali Mondal,Harshini K,S Reka,Chetana Amancharla,Faiz ur Rahman,Harikrishnan P A,Indraneel Saha,Bhavya Tiwary,Navin Shankar Patel,Pradeep T S,Balaji A J,Priyapravas,Mohammed Rafee Tarafdar
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Enterprises grapple with the significant challenge of managing proprietary unstructured data, hindering efficient information retrieval. This has led to the emergence of AI-driven information retrieval solutions, designed to adeptly extract relevant insights to address employee inquiries. These solutions often leverage pre-trained embedding models and generative models as foundational components. While pre-trained embeddings may exhibit proximity or disparity based on their original training objectives, they might not fully align with the unique characteristics of enterprise-specific data, leading to suboptimal alignment with the retrieval goals of enterprise environments. In this paper, we propose a comprehensive methodology for contextualizing pre-trained embedding models to enterprise environments, covering the entire process from data preparation to model fine-tuning and evaluation. By adapting the embeddings to better suit the retrieval tasks prevalent in enterprises, we aim to enhance the performance of information retrieval solutions. We discuss the process of fine-tuning, its effect on retrieval accuracy, and the potential benefits for enterprise information management. Our findings demonstrate the efficacy of fine-tuned embedding models in improving the precision and relevance of search results in enterprise settings.
zh

[NLP-43] Vague Knowledge: Information without Transitivity and Partitions

链接: https://arxiv.org/abs/2512.05833
作者: Kerry Xiao
机构: 未知
类目: Theoretical Economics (econ.TH); Computation and Language (cs.CL); Logic (math.LO); General Finance (q-fin.GN)
备注:

点击查看摘要

[NLP-44] SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model

【速读】: 该论文旨在解决视频配音(video dubbing)中语音自然度不足和音画同步性差的问题,且现有方法多局限于单语言场景。解决方案的关键在于提出SyncVoice框架,其基于预训练文本到语音(text-to-speech, TTS)模型,并通过在视听数据上微调实现强音画一致性;同时引入双说话人编码器(Dual Speaker Encoder)有效缓解跨语言语音合成中的语言干扰,从而支持多语言视频配音与翻译任务。

链接: https://arxiv.org/abs/2512.05126
作者: Kaidi Wang,Yi He,Wenhao Guan,Weijie Wu,Hongwu Ding,Xiong Zhang,Di Wu,Meng Meng,Jian Luan,Lin Li,Qingyang Hong
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Video dubbing aims to generate high-fidelity speech that is precisely temporally aligned with the visual content. Existing methods still suffer from limitations in speech naturalness and audio-visual synchronization, and are limited to monolingual settings. To address these challenges, we propose SyncVoice, a vision-augmented video dubbing framework built upon a pretrained text-to-speech (TTS) model. By fine-tuning the TTS model on audio-visual data, we achieve strong audiovisual consistency. We propose a Dual Speaker Encoder to effectively mitigate inter-language interference in cross-lingual speech synthesis and explore the application of video dubbing in video translation scenarios. Experimental results show that SyncVoice achieves high-fidelity speech generation with strong synchronization performance, demonstrating its potential in video dubbing tasks.
zh

计算机视觉

[CV-0] EditThinker: Unlocking Iterative Reasoning for Any Image Editor

【速读】:该论文旨在解决指令驱动图像编辑(instruction-based image editing)中指令遵循能力不足的问题,尤其针对现有方法在单轮编辑中因固有随机性和缺乏反思机制而导致的成功率有限。其解决方案的关键在于提出一种“边编辑边思考”(Think-while-Edit)的 deliberative 编辑框架,该框架模拟人类认知循环,通过迭代执行“结果评判—指令优化”过程并重复生成,直至满足要求。具体而言,论文训练一个统一的多模态大语言模型(Multimodal Large Language Model, MLLM)——EditThinker,作为推理引擎,联合输出评判分数、推理过程和优化后的指令,并利用强化学习使思考过程与编辑结果对齐,从而显著提升指令改进的针对性和整体编辑成功率。

链接: https://arxiv.org/abs/2512.05965
作者: Hongyu Li,Manyuan Zhang,Dian Zheng,Ziyu Guo,Yimeng Jia,Kaituo Feng,Hao Yu,Yexin Liu,Yan Feng,Peng Pei,Xunliang Cai,Linjiang Huang,Hongsheng Li,Si Liu
机构: Beihang University (北京航空航天大学); Meituan (美团); CUHK MMLab (香港中文大学多媒体实验室); CUHK IMIXR (香港中文大学IMIXR实验室); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Instruction-based image editing has emerged as a prominent research area, which, benefiting from image generation foundation models, have achieved high aesthetic quality, making instruction-following capability the primary challenge. Existing approaches improve instruction adherence via supervised or reinforcement learning, yet single-turn success rates remain limited due to inherent stochasticity and a lack of deliberation. In this work, we propose a deliberative editing framework to ‘think’ while they edit, which simulates the human cognitive loop by iteratively executing a Think-while-Edit cycle: Critiquing results and Refining instructions , followed by Repeating the generation until satisfactory. Specifically, we train a single MLLM, EditThinker, to act as the reasoning engine of this framework, which jointly produce the critique score, reasoning process, and refined instructions. We employ reinforcement learning to align the EditThinker’s thinking with its editing, thereby generating more targeted instruction improvements. Extensive experiments on four benchmarks demonstrate that our approach significantly improves the instruction-following capability of any image editing model by a large margin. We will release our data construction framework, datasets, and models to benefit the community.
zh

[CV-1] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

【速读】:该论文旨在解决水下图像中存在的严重颜色失真、对比度低以及朦胧感问题,同时应对现有深度学习模型计算复杂度高、难以实现实时应用的挑战。其解决方案的关键在于提出一种名为自适应频率融合与光照感知网络(AQUA-Net)的新架构,该模型通过残差编码器-解码器结构结合双辅助分支,在频域和光照域中协同优化图像增强效果:频域分支利用傅里叶域中的频率线索增强空间表征并保留细节纹理,光照感知解码器则借鉴Retinex理论,通过学习得到的光照图实现自适应曝光校正,从而分离反射率与光照影响。这种联合的空间-频率-光照设计显著提升了水下图像的颜色平衡、视觉对比度和感知真实性,且参数量更少,具备良好的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2512.05960
作者: Munsif Ali,Najmul Hassan,Lucia Ventura,Davide Di Bari,Simonepietro Canese
机构: Stazione Zoologica Anton Dohrn (圣斯坦齐奥动物园安东多恩站); University of Aizu (会津大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Underwater images often suffer from severe color distortion, low contrast, and a hazy appearance due to wavelength-dependent light absorption and scattering. Simultaneously, existing deep learning models exhibit high computational complexity, which limits their practical deployment for real-time underwater applications. To address these challenges, this paper presents a novel underwater image enhancement model, called Adaptive Frequency Fusion and Illumination Aware Network (AQUA-Net). It integrates a residual encoder decoder with dual auxiliary branches, which operate in the frequency and illumination domains. The frequency fusion encoder enriches spatial representations with frequency cues from the Fourier domain and preserves fine textures and structural details. Inspired by Retinex, the illumination-aware decoder performs adaptive exposure correction through a learned illumination map that separates reflectance from lighting effects. This joint spatial, frequency, and illumination design enables the model to restore color balance, visual contrast, and perceptual realism under diverse underwater conditions. Additionally, we present a high-resolution, real-world underwater video-derived dataset from the Mediterranean Sea, which captures challenging deep-sea conditions with realistic visual degradations to enable robust evaluation and development of deep learning models. Extensive experiments on multiple benchmark datasets show that AQUA-Net performs on par with SOTA in both qualitative and quantitative evaluations while using less number of parameters. Ablation studies further confirm that the frequency and illumination branches provide complementary contributions that improve visibility and color representation. Overall, the proposed model shows strong generalization capability and robustness, and it provides an effective solution for real-world underwater imaging applications.
zh

[CV-2] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在机器人精细操作任务中缺乏物理动态理解的问题。现有VLMs主要基于静态的互联网规模图文数据训练,无法建模因果交互或动作条件下的变化,导致其在需要物理推理与动作规划的任务中表现受限。解决方案的关键在于提出SIMPACT框架——一种测试时、基于模拟的行动规划方法,通过“模拟闭环世界建模”赋予VLM物理推理能力,无需额外训练。该框架从单帧RGB-D观测出发,高效构建物理仿真环境,使VLM能够生成动作建议、观察模拟轨迹并迭代优化决策,从而实现对接触动力学和动作结果的物理基础理解。

链接: https://arxiv.org/abs/2512.05955
作者: Haowen Liu,Shaoxiong Yao,Haonan Chen,Jiawei Gao,Jiayuan Mao,Jia-Bin Huang,Yilun Du
机构: UMD(马里兰大学); UIUC(伊利诺伊大学厄巴纳-香槟分校); Harvard(哈佛大学); Amazon FAR(亚马逊未来人工智能研究); UPenn(宾夕法尼亚大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at this https URL
zh

[CV-3] Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception

【速读】:该论文旨在解决可解释人工智能(Explainable AI, XAI)在深度学习模型中对分类决策的解释缺乏定量验证的问题,尤其是如何区分模型是否真正关注目标对象(如交通标志)而非依赖背景中的虚假相关性(spurious correlations)。其关键解决方案是系统性地构建六个合成数据集,仅通过控制相机视角变化程度和背景相关性的强度来隔离变量,从而量化背景特征重要性随训练域变化的影响,进而明确背景特征何时以及在多大程度上参与分类任务。这种方法克服了真实世界数据中难以控制相关性的局限,并提供了可复现、可控的实验框架以评估XAI方法的有效性。

链接: https://arxiv.org/abs/2512.05937
作者: Anne Sielemann,Valentin Barner,Stefan Wolf,Masoud Roschani,Jens Ziehn,Juergen Beyerer
机构: 1: University of Stuttgart (斯图加特大学); 2: Fraunhofer Institute for Industrial Mathematics (弗劳恩霍夫工业数学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 8 pages, 2 figures, 7 tables

点击查看摘要

Abstract:Common approaches to explainable AI (XAI) for deep learning focus on analyzing the importance of input features on the classification task in a given model: saliency methods like SHAP and GradCAM are used to measure the impact of spatial regions of the input image on the classification result. Combined with ground truth information about the location of the object in the input image (e.g., a binary mask), it is determined whether object pixels had a high impact on the classification result, or whether the classification focused on background pixels. The former is considered to be a sign of a healthy classifier, whereas the latter is assumed to suggest overfitting on spurious correlations. A major challenge, however, is that these intuitive interpretations are difficult to test quantitatively, and hence the output of such explanations lacks an explanation itself. One particular reason is that correlations in real-world data are difficult to avoid, and whether they are spurious or legitimate is debatable. Synthetic data in turn can facilitate to actively enable or disable correlations where desired but often lack a sufficient quantification of realism and stochastic properties. […] Therefore, we systematically generate six synthetic datasets for the task of traffic sign recognition, which differ only in their degree of camera variation and background correlation […] to quantify the isolated influence of background correlation, different levels of camera variation, and considered traffic sign shapes on the classification performance, as well as background feature importance. […] Results include a quantification of when and how much background features gain importance to support the classification task based on changes in the training domain […]. Download: this http URL Comments: 8 pages, 2 figures, 7 tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO) Cite as: arXiv:2512.05937 [cs.CV] (or arXiv:2512.05937v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2512.05937 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/IAVVC61942.2025.11219547 Focus to learn more DOI(s) linking to related resources
zh

[CV-4] Synset Signset Germany: a Synthetic Dataset for German Traffic Sign Recognition

【速读】:该论文旨在解决交通标志识别(Traffic Sign Recognition, TSR)任务中真实数据稀缺与合成数据真实性不足的问题。现有方法要么依赖大量标注的真实数据,要么生成的合成数据在纹理细节和光照物理准确性上存在局限,难以满足模型训练与鲁棒性测试的需求。解决方案的关键在于提出一种融合数据驱动与分析建模优势的合成流水线:利用生成对抗网络(GAN)生成具有真实脏污和磨损特征的纹理,同时通过解析场景调制(Analytical Scene Modulation)实现物理正确的光照模拟与参数化控制,从而提升合成数据的 realism 并支持可解释人工智能(Explainable AI, XAI)和鲁棒性测试。该方案最终构建了包含105,500张图像的Synset Signset Germany数据集,覆盖211类德国交通标志,并提供详尽元数据以支持下游研究。

链接: https://arxiv.org/abs/2512.05936
作者: Anne Sielemann,Lena Loercher,Max-Lion Schumacher,Stefan Wolf,Masoud Roschani,Jens Ziehn
机构: Fraunhofer IOSB (弗劳恩霍夫信息与软件工程研究所); Fraunhofer IPA (弗劳恩霍夫工艺工程研究所); Karlsruhe Institute of Technology (KIT) (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 8 figures, 3 tables

点击查看摘要

Abstract:In this paper, we present a synthesis pipeline and dataset for training / testing data in the task of traffic sign recognition that combines the advantages of data-driven and analytical modeling: GAN-based texture generation enables data-driven dirt and wear artifacts, rendering unique and realistic traffic sign surfaces, while the analytical scene modulation achieves physically correct lighting and allows detailed parameterization. In particular, the latter opens up applications in the context of explainable AI (XAI) and robustness tests due to the possibility of evaluating the sensitivity to parameter changes, which we demonstrate with experiments. Our resulting synthetic traffic sign recognition dataset Synset Signset Germany contains a total of 105500 images of 211 different German traffic sign classes, including newly published (2020) and thus comparatively rare traffic signs. In addition to a mask and a segmentation image, we also provide extensive metadata including the stochastically selected environment and imaging effect parameters for each image. We evaluate the degree of realism of Synset Signset Germany on the real-world German Traffic Sign Recognition Benchmark (GTSRB) and in comparison to CATERED, a state-of-the-art synthetic traffic sign recognition dataset.
zh

[CV-5] Physically-Based Simulation of Automotive LiDAR

【速读】:该论文旨在解决汽车时间-of-flight(ToF)激光雷达(LiDAR)系统在仿真中难以准确建模 blooming(溢出效应)、回波脉冲宽度(echo pulse width)及环境光干扰等问题,从而限制了高保真度虚拟测试与算法开发的瓶颈。其解决方案的关键在于构建一个基于物理渲染(Physically Based Rendering, PBR)的分析模型,该模型在近红外波段内对激光发射、目标反射(单次反弹或后向反射)以及接收二极管灵敏度进行精确模拟,并引入灵活的光束转向模式和非零直径的光束表示方式。通过光学实验室测量(使用0.01°分辨率的goniometer获取不同目标表面的光度亮度数据),系统性地确定包括激光束扩散特性、接收器灵敏度、发射光强、反射光强度到回波脉宽转换关系等关键参数,从而实现对两种典型车载LiDAR系统(Valeo Scala Gen. 2 和 Blickfeld Cube 1)的有效校准与验证。

链接: https://arxiv.org/abs/2512.05932
作者: L. Dudzik,M. Roschani,A. Sielemann,K. Trampert,J. Ziehn,J. Beyerer,C. Neumann
机构: 2: Institute for Industrial Automation and Machine Tools (IAT), University of Stuttgart (斯图加特大学); 1: Institute for Manufacturing Technology (IMT), University of Stuttgart (斯图加特大学); 3: Institute for Industrial Automation and Machine Tools (IAT), University of Stuttgart (斯图加特大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present an analytic model for simulating automotive time-of-flight (ToF) LiDAR that includes blooming, echo pulse width, and ambient light, along with steps to determine model parameters systematically through optical laboratory measurements. The model uses physically based rendering (PBR) in the near-infrared domain. It assumes single-bounce reflections and retroreflections over rasterized rendered images from shading or ray tracing, including light emitted from the sensor as well as stray light from other, non-correlated sources such as sunlight. Beams from the sensor and sensitivity of the receiving diodes are modeled with flexible beam steering patterns and with non-vanishing diameter. Different (all non-real time) computational approaches can be chosen based on system properties, computing capabilities, and desired output properties. Model parameters include system-specific properties, namely the physical spread of the LiDAR beam, combined with the sensitivity of the receiving diode; the intensity of the emitted light; the conversion between the intensity of reflected light and the echo pulse width; and scenario parameters such as environment lighting, positioning, and surface properties of the target(s) in the relevant infrared domain. System-specific properties of the model are determined from laboratory measurements of the photometric luminance on different target surfaces aligned with a goniometer at 0.01° resolution, which marks the best available resolution for measuring the beam pattern. The approach is calibrated for and tested on two automotive LiDAR systems, the Valeo Scala Gen. 2 and the Blickfeld Cube 1. Both systems differ notably in their properties and available interfaces, but the relevant model parameters could be extracted successfully. Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2512.05932 [cs.RO] (or arXiv:2512.05932v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2512.05932 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1109/IAVVC61942.2025.11219516 Focus to learn more DOI(s) linking to related resources Submission history From: Jens Ziehn [view email] [v1] Fri, 5 Dec 2025 18:18:32 UTC (3,114 KB)
zh

[CV-6] A Comparative Study on Synthetic Facial Data Generation Techniques for Face Recognition

【速读】:该论文旨在解决当前人脸识别技术在可解释性、人口统计学偏差、隐私保护以及对年龄变化、姿态差异、光照变化、遮挡和面部表情等挑战下的鲁棒性不足等问题。其核心解决方案是利用合成人脸数据生成技术,通过生成式AI(Generative AI)方法如扩散模型、生成对抗网络(GANs)和3D建模技术,构建具有可控属性的高质量合成人脸数据集,从而缓解隐私风险、减少偏差、补充真实数据并提升模型性能。研究表明,尽管现有合成数据方法已能较好模拟真实变化,但仍存在与真实数据间的性能差距,亟需进一步研究优化。

链接: https://arxiv.org/abs/2512.05928
作者: Pedro Vidal,Bernardo Biesseck,Luiz E. L. Coelho,Roger Granada,David Menotti
机构: Federal University of Paraná (联邦大学巴拉那州); IFMT (联邦理工学院马托格罗索); UFMG (米纳斯吉拉斯联邦大学); Unico IDTech (Unico IDTech); FUNPAR (联邦大学巴拉那州基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 17 figures

点击查看摘要

Abstract:Facial recognition has become a widely used method for authentication and identification, with applications for secure access and locating missing persons. Its success is largely attributed to deep learning, which leverages large datasets and effective loss functions to learn discriminative features. Despite these advances, facial recognition still faces challenges in explainability, demographic bias, privacy, and robustness to aging, pose variations, lighting changes, occlusions, and facial expressions. Privacy regulations have also led to the degradation of several datasets, raising legal, ethical, and privacy concerns. Synthetic facial data generation has been proposed as a promising solution. It mitigates privacy issues, enables experimentation with controlled facial attributes, alleviates demographic bias, and provides supplementary data to improve models trained on real data. This study compares the effectiveness of synthetic facial datasets generated using different techniques in facial recognition tasks. We evaluate accuracy, rank-1, rank-5, and the true positive rate at a false positive rate of 0.01% on eight leading datasets, offering a comparative analysis not extensively explored in the literature. Results demonstrate the ability of synthetic data to capture realistic variations while emphasizing the need for further research to close the performance gap with real data. Techniques such as diffusion models, GANs, and 3D models show substantial progress; however, challenges remain.
zh

[CV-7] World Models That Know When They Dont Know: Controllable Video Generation with Calibrated Uncertainty

【速读】:该论文旨在解决可控视频生成模型在实际应用中因幻觉(hallucination)导致的物理现实不一致问题,尤其是在机器人策略评估与规划等任务中,现有模型缺乏不确定性量化(uncertainty quantification, UQ)能力,难以识别不可靠的预测区域。解决方案的关键在于提出C3方法,其核心创新包括:1)基于严格适当的评分规则(strictly proper scoring rules)训练模型以同时优化正确性与校准性;2)在潜在空间(latent space)中估计不确定性,避免像素空间方法带来的训练不稳定和高计算成本;3)将密集的潜在空间不确定性映射至RGB空间,生成高分辨率可解释的不确定性热图,实现对每帧视频中不确信区域的精准定位。

链接: https://arxiv.org/abs/2512.05927
作者: Zhiting Mei,Tenny Yin,Micah Baker,Ola Shorinwa,Anirudha Majumdar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch level, precisely localizing the uncertainty in each generated video frame. Our UQ method introduces three core innovations to empower video models to estimate their uncertainty. First, our method develops a novel framework that trains video models for correctness and calibration via strictly proper scoring rules. Second, we estimate the video model’s uncertainty in latent space, avoiding training instability and prohibitive training costs associated with pixel-space approaches. Third, we map the dense latent-space uncertainty to interpretable pixel-level uncertainty in the RGB space for intuitive visualization, providing high-resolution uncertainty heatmaps that identify untrustworthy regions. Through extensive experiments on large-scale robot learning datasets (Bridge and DROID) and real-world evaluations, we demonstrate that our method not only provides calibrated uncertainty estimates within the training distribution, but also enables effective out-of-distribution detection.
zh

[CV-8] LPD: Learnable Prototypes with Diversity Regularization for Weakly Supervised Histopathology Segmentation

【速读】:该论文旨在解决组织病理学图像中弱监督语义分割(Weakly Supervised Semantic Segmentation, WSSS)面临的三大挑战:类别间相似性高(inter-class homogeneity)、类别内异质性强(intra-class heterogeneity)以及类激活图(Class Activation Map, CAM)引起的区域收缩问题(CAM-induced region shrinkage)。传统方法通过构建聚类原型库并在独立阶段优化分割掩码,但存在两阶段流程计算开销大、超参数敏感且原型发现与分割学习解耦的问题。本文提出一种无需聚类的单阶段可学习原型框架,并引入多样性正则化机制以增强对类别内部形态异质性的覆盖能力,从而在BCSS-WSSS数据集上实现最优性能(mIoU和mDice指标),同时生成边界更清晰、误标记更少的分割结果,且激活热力图显示其可学习原型能覆盖更多样且互补的类别区域,显著优于基于聚类的原型方法。

链接: https://arxiv.org/abs/2512.05922
作者: Khang Le,Anh Mai Vu,Thi Kim Trang Vo,Ha Thach,Ngoc Bui Lam Quang,Thanh-Huy Nguyen,Minh H. N. Le,Zhu Han,Chandra Mohan,Hien Van Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Note: Khang Le and Anh Mai Vu contributed equally

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) in histopathology reduces pixel-level labeling by learning from image-level labels, but it is hindered by inter-class homogeneity, intra-class heterogeneity, and CAM-induced region shrinkage (global pooling-based class activation maps whose activations highlight only the most distinctive areas and miss nearby class regions). Recent works address these challenges by constructing a clustering prototype bank and then refining masks in a separate stage; however, such two-stage pipelines are costly, sensitive to hyperparameters, and decouple prototype discovery from segmentation learning, limiting their effectiveness and efficiency. We propose a cluster-free, one-stage learnable-prototype framework with diversity regularization to enhance morphological intra-class heterogeneity coverage. Our approach achieves state-of-the-art (SOTA) performance on BCSS-WSSS, outperforming prior methods in mIoU and mDice. Qualitative segmentation maps show sharper boundaries and fewer mislabels, and activation heatmaps further reveal that, compared with clustering-based prototypes, our learnable prototypes cover more diverse and complementary regions within each class, providing consistent qualitative evidence for their effectiveness.
zh

[CV-9] NICE: Neural Implicit Craniofacial Model for Orthognathic Surgery Prediction

【速读】:该论文旨在解决正颌手术中术后面部外观预测准确性不足的问题,其核心挑战在于骨骼移动与面部软组织之间复杂的非线性相互作用难以被现有生物力学模型、参数化模型及深度学习方法充分捕捉。解决方案的关键在于提出神经隐式颅面模型(Neural Implicit Craniofacial Model, NICE),其创新性地采用隐式神经表示(implicit neural representations)构建两个模块:形状模块通过区域特异的隐式符号距离函数(Signed Distance Function, SDF)解码器实现面部表面、上颌骨和下颌骨的高精度重建;手术模块则利用共享的手术潜在代码驱动区域特异的形变解码器,输出逐点位移场,从而有效建模面部软组织对骨骼移动的非线性生物力学响应,并融合解剖先验知识。该方法在关键面部区域(如唇部和颏部)显著提升了预测精度,同时保持了良好的解剖完整性,为正颌手术的术前规划与患者沟通提供了临床可行的工具。

链接: https://arxiv.org/abs/2512.05920
作者: Jiawen Yang,Yihui Cao,Xuanyu Tian,Yuyao Zhang,Hongjiang Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Orthognathic surgery is a crucial intervention for correcting dentofacial skeletal deformities to enhance occlusal functionality and facial aesthetics. Accurate postoperative facial appearance prediction remains challenging due to the complex nonlinear interactions between skeletal movements and facial soft tissue. Existing biomechanical, parametric models and deep-learning approaches either lack computational efficiency or fail to fully capture these intricate interactions. To address these limitations, we propose Neural Implicit Craniofacial Model (NICE) which employs implicit neural representations for accurate anatomical reconstruction and surgical outcome prediction. NICE comprises a shape module, which employs region-specific implicit Signed Distance Function (SDF) decoders to reconstruct the facial surface, maxilla, and mandible, and a surgery module, which employs region-specific deformation decoders. These deformation decoders are driven by a shared surgical latent code to effectively model the complex, nonlinear biomechanical response of the facial surface to skeletal movements, incorporating anatomical prior knowledge. The deformation decoders output point-wise displacement fields, enabling precise modeling of surgical outcomes. Extensive experiments demonstrate that NICE outperforms current state-of-the-art methods, notably improving prediction accuracy in critical facial regions such as lips and chin, while robustly preserving anatomical integrity. This work provides a clinically viable tool for enhanced surgical planning and patient consultation in orthognathic procedures.
zh

[CV-10] SCAIL: Towards Studio-Grade Character Animation via In-Context Learning of 3D-Consistent Pose Representations

【速读】:该论文旨在解决当前角色动画(Character Animation)在复杂场景下难以达到影视制作级(Studio-grade)标准的问题,尤其针对现有方法在跨身份动画和复杂运动中难以保持结构保真度与时间一致性(Temporal Consistency)的缺陷。其解决方案的关键在于两个创新:一是提出一种新颖的3D姿态表示(3D Pose Representation),提供更鲁棒且灵活的运动信号;二是设计了一种全上下文姿态注入机制(Full-context Pose Injection),嵌入扩散-Transformer架构中,实现对完整运动序列的有效时空推理(Spatio-temporal Reasoning)。

链接: https://arxiv.org/abs/2512.05905
作者: Wenhao Yan,Sheng Ye,Zhuoyi Yang,Jiayan Teng,ZhenHui Dong,Kairui Wen,Xiaotao Gu,Yong-Jin Liu,Jie Tang
机构: Tsinghua University (清华大学); Z.ai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving character animation that meets studio-grade production standards remains challenging despite recent progress. Existing approaches can transfer motion from a driving video to a reference image, but often fail to preserve structural fidelity and temporal consistency in wild scenarios involving complex motion and cross-identity animations. In this work, we present \textbfSCAIL (\textbfStudio-grade \textbfCharacter \textbfAnimation via \textbfIn-context \textbfLearning), a framework designed to address these challenges from two key innovations. First, we propose a novel 3D pose representation, providing a more robust and flexible motion signal. Second, we introduce a full-context pose injection mechanism within a diffusion-transformer architecture, enabling effective spatio-temporal reasoning over full motion sequences. To align with studio-level requirements, we develop a curated data pipeline ensuring both diversity and quality, and establish a comprehensive benchmark for systematic evaluation. Experiments show that \textbfSCAIL achieves state-of-the-art performance and advances character animation toward studio-grade reliability and realism.
zh

[CV-11] Underwater Image Reconstruction Using a Swin Transformer-Based Generator and PatchGAN Discriminator

【速读】:该论文旨在解决水下成像中因波长依赖性吸收和散射导致的图像退化问题,如颜色失真、对比度低和雾霾效应等,这些问题严重影响了海洋探索、环境监测和基础设施检测等应用的效果。传统重建方法和基于卷积神经网络(Convolutional Neural Networks, CNN)的方法由于感受野有限且难以建模全局依赖关系,难以有效应对上述挑战。解决方案的关键在于提出一种融合Swin Transformer架构与生成对抗网络(Generative Adversarial Network, GAN)的深度学习框架:其生成器采用带有Swin Transformer模块的U-Net结构,能够同时捕捉局部特征与长程依赖关系,从而实现全图范围的颜色校正;判别器则使用PatchGAN以保障高频细节的保留。实验表明,该方法在EUVP数据集上取得了PSNR 24.76 dB和SSIM 0.89的性能指标,显著优于现有技术,并通过消融实验证明了Swin Transformer设计相较于卷积结构的优势。

链接: https://arxiv.org/abs/2512.05866
作者: Md. Mahbub Hasan Akash,Aria Tasnim Mridula,Sheekar Banerjee,Ishtiak Al Mamoon
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted for presentation at the IEEE 28th International Conference on Computer and Information Technology (ICCIT), December 2025

点击查看摘要

Abstract:Underwater imaging is essential for marine exploration, environmental monitoring, and infrastructure inspection. However, water causes severe image degradation through wavelength-dependent absorption and scattering, resulting in color distortion, low contrast, and haze effects. Traditional reconstruction methods and convolutional neural network-based approaches often fail to adequately address these challenges due to limited receptive fields and inability to model global dependencies. This paper presented a novel deep learning framework that integrated a Swin Transformer architecture within a generative adversarial network (GAN) for underwater image reconstruction. Our generator employed a U-Net structure with Swin Transformer blocks to capture both local features and long-range dependencies crucial for color correction across entire images. A PatchGAN discriminator provided adversarial training to ensure high-frequency detail preservation. We trained and evaluated our model on the EUVP dataset, which contains paired underwater images of varying quality. Quantitative results demonstrate stateof-the-art performance with PSNR of 24.76 dB and SSIM of 0.89, representing significant improvements over existing methods. Visual results showed effective color balance restoration, contrast improvement, and haze reduction. An ablation study confirms the superiority of our Swin Transformer designed over convolutional alternatives. The proposed method offers robust underwater image reconstruction suitable for various marine applications.
zh

[CV-12] Edit-aware RAW Reconstruction

【速读】:该论文旨在解决从相机显示参考输出(如8位sRGB JPEG)中重建RAW图像时,现有方法在面对多样化的渲染风格和编辑操作时重建质量下降的问题。其核心挑战在于,传统RAW重建方法主要优化像素级重建保真度,但缺乏对实际摄影后期处理流程的适应性。解决方案的关键是提出一种即插即用的、编辑感知的损失函数,该损失函数嵌入一个模块化且可微分的图像信号处理器(Image Signal Processor, ISP),模拟具有可调参数的真实摄影后期管线,并在训练过程中随机采样反映真实相机处理差异的参数分布;通过在sRGB空间中计算真实与重建RAW图像经此ISP渲染后的差异来更新模型,从而显著提升重建RAW在不同编辑条件下的鲁棒性和后续处理灵活性。

链接: https://arxiv.org/abs/2512.05859
作者: Abhijith Punnappurath,Luxi Zhao,Ke Zhao,Hue Nguyen,Radek Grzeszczuk,Michael S. Brown
机构: Samsung Electronics (三星电子)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Users frequently edit camera images post-capture to achieve their preferred photofinishing style. While editing in the RAW domain provides greater accuracy and flexibility, most edits are performed on the camera’s display-referred output (e.g., 8-bit sRGB JPEG) since RAW images are rarely stored. Existing RAW reconstruction methods can recover RAW data from sRGB images, but these approaches are typically optimized for pixel-wise RAW reconstruction fidelity and tend to degrade under diverse rendering styles and editing operations. We introduce a plug-and-play, edit-aware loss function that can be integrated into any existing RAW reconstruction framework to make the recovered RAWs more robust to different rendering styles and edits. Our loss formulation incorporates a modular, differentiable image signal processor (ISP) that simulates realistic photofinishing pipelines with tunable parameters. During training, parameters for each ISP module are randomly sampled from carefully designed distributions that model practical variations in real camera processing. The loss is then computed in sRGB space between ground-truth and reconstructed RAWs rendered through this differentiable ISP. Incorporating our loss improves sRGB reconstruction quality by up to 1.5-2 dB PSNR across various editing conditions. Moreover, when applied to metadata-assisted RAW reconstruction methods, our approach enables fine-tuning for target edits, yielding further gains. Since photographic editing is the primary motivation for RAW reconstruction in consumer imaging, our simple yet effective loss function provides a general mechanism for enhancing edit fidelity and rendering flexibility across existing methods.
zh

[CV-13] VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理任务中潜在的安全风险问题,特别是针对视觉模态下被用于越狱攻击(jailbreak attack)时可能引发的有害内容生成风险。现有研究主要聚焦于文本模态的推理安全漏洞,而忽视了视觉模态中通过图像序列逐步诱导模型输出有害意图的可能性。解决方案的关键在于提出一种视觉推理顺序攻击方法(Visual Reasoning Sequential Attack, VRSA),其核心机制包括:1)将原始有害文本分解为若干语义关联的子图像,通过逐步暴露信息诱导模型聚合完整有害意图;2)引入自适应场景优化(Adaptive Scene Refinement)以增强图像序列中场景合理性;3)采用语义一致性补全(Semantic Coherent Completion)确保每帧子图像与上下文语义连贯;4)通过文本-图像一致性对齐(Text-Image Consistency Alignment)维持生成图像与目标文本的语义一致性。实验表明,VRSA在开源和闭源MLLMs(如GPT-4o和Claude-4.5-Sonnet)上均显著优于当前最优越狱攻击方法。

链接: https://arxiv.org/abs/2512.05853
作者: Shiji Zhao,Shukun Xiong,Yao Huang,Yan Jin,Zhenyu Wu,Jiyang Guan,Ranjie Duan,Jialing Tao,Hui Xue,Xingxing Wei
机构: Beihang University (北京航空航天大学); Chinese Academy of Sciences (中国科学院); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are widely used in various fields due to their powerful cross-modal comprehension and generation capabilities. However, more modalities bring more vulnerabilities to being utilized for jailbreak attacks, which induces MLLMs to output harmful content. Due to the strong reasoning ability of MLLMs, previous jailbreak attacks try to explore reasoning safety risk in text modal, while similar threats have been largely overlooked in the visual modal. To fully evaluate potential safety risks in the visual reasoning task, we propose Visual Reasoning Sequential Attack (VRSA), which induces MLLMs to gradually externalize and aggregate complete harmful intent by decomposing the original harmful text into several sequentially related sub-images. In particular, to enhance the rationality of the scene in the image sequence, we propose Adaptive Scene Refinement to optimize the scene most relevant to the original harmful query. To ensure the semantic continuity of the generated image, we propose Semantic Coherent Completion to iteratively rewrite each sub-text combined with contextual information in this scene. In addition, we propose Text-Image Consistency Alignment to keep the semantical consistency. A series of experiments demonstrates that the VRSA can achieve a higher attack success rate compared with the state-of-the-art jailbreak attack methods on both the open-source and closed-source MLLMs such as GPT-4o and Claude-4.5-Sonnet.
zh

[CV-14] Phase-OTDR Event Detection Using Image-Based Data Transformation and Deep Learning

【速读】:该论文旨在解决光纤传感中事件检测(event detection)的分类准确性与分析效率问题,尤其针对Phase-OTDR(相位敏感光时域反射计)系统下复杂信号的识别挑战。其解决方案的关键在于将原始一维(1D)Phase-OTDR数据通过Gramian Angular Difference Field (GADF)、Gramian Angular Summation Field (GASF) 和 Recurrence Plot(递归图)等技术转化为灰度图像,并进一步组合成多通道RGB表示,从而利用迁移学习模型(如EfficientNetB0和DenseNet121)实现高精度分类。该方法显著提升了分类准确率(最高达99.07%),同时减少了对大规模标注数据的依赖,为光纤传感数据的高效、可靠分析提供了新范式。

链接: https://arxiv.org/abs/2512.05830
作者: Muhammet Cagri Yeke,Samil Sirin,Kivilcim Yuksel,Abdurrahman Gumus
机构: Izmir Institute of Technology (伊兹密尔理工学院); Isparta University of Applied Sciences (伊斯帕尔塔应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures, 5 tables

点击查看摘要

Abstract:This study focuses on event detection in optical fibers, specifically classifying six events using the Phase-OTDR system. A novel approach is introduced to enhance Phase-OTDR data analysis by transforming 1D data into grayscale images through techniques such as Gramian Angular Difference Field, Gramian Angular Summation Field, and Recurrence Plot. These grayscale images are combined into a multi-channel RGB representation, enabling more robust and adaptable analysis using transfer learning models. The proposed methodology achieves high classification accuracies of 98.84% and 98.24% with the EfficientNetB0 and DenseNet121 models, respectively. A 5-fold cross-validation process confirms the reliability of these models, with test accuracy rates of 99.07% and 98.68%. Using a publicly available Phase-OTDR dataset, the study demonstrates an efficient approach to understanding optical fiber events while reducing dataset size and improving analysis efficiency. The results highlight the transformative potential of image-based analysis in interpreting complex fiber optic sensing data, offering significant advancements in the accuracy and reliability of fiber optic monitoring systems. The codes and the corresponding image-based dataset are made publicly available on GitHub to support further research: this https URL.
zh

[CV-15] Multimodal Oncology Agent for IDH1 Mutation Prediction in Low-Grade Glioma

【速读】:该论文旨在解决低级别胶质瘤(low-grade glioma, LGG)中IDH1突变精准预测的问题,该突变对临床分型、预后评估及治疗策略具有重要指导意义。解决方案的关键在于提出一种多模态肿瘤学智能体(Multimodal Oncology Agent, MOA),其核心创新是融合基于TITAN基础模型的组织病理学分析与结构化临床及基因组数据推理能力,通过整合PubMed、Google Search和OncoKB等外部生物医学知识源,实现跨模态信息互补。实验表明,MOA在TCGA-LGG队列中达到F1-score 0.912,显著优于单一模态基线,验证了多模态协同推理在提升IDH1突变预测准确性方面的有效性。

链接: https://arxiv.org/abs/2512.05824
作者: Hafsa Akebli(1),Adam Shephard(2),Vincenzo Della Mea(1),Nasir Rajpoot(2 and 3) ((1) University of Udine, Udine, Italy, (2) University of Warwick, Coventry, UK, (3) Histofy Ltd, Coventry, UK)
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 2 figures

点击查看摘要

Abstract:Low-grade gliomas frequently present IDH1 mutations that define clinically distinct subgroups with specific prognostic and therapeutic implications. This work introduces a Multimodal Oncology Agent (MOA) integrating a histology tool based on the TITAN foundation model for IDH1 mutation prediction in low-grade glioma, combined with reasoning over structured clinical and genomic inputs through PubMed, Google Search, and OncoKB. MOA reports were quantitatively evaluated on 488 patients from the TCGA-LGG cohort against clinical and histology baselines. MOA without the histology tool outperformed the clinical baseline, achieving an F1-score of 0.826 compared to 0.798. When fused with histology features, MOA reached the highest performance with an F1-score of 0.912, exceeding both the histology baseline at 0.894 and the fused histology-clinical baseline at 0.897. These results demonstrate that the proposed agent captures complementary mutation-relevant information enriched through external biomedical sources, enabling accurate IDH1 mutation prediction.
zh

[CV-16] UG-FedDA: Uncertainty-Guided Federated Domain Adaptation for Multi-Center Alzheimers Disease Detection

【速读】:该论文旨在解决多中心阿尔茨海默病(Alzheimer’s disease, AD)分类框架中存在的跨站点结构磁共振成像(MRI)异质性问题以及缺乏不确定性量化(uncertainty quantification, UQ)机制导致的鲁棒性和临床适用性不足的问题。解决方案的关键在于提出了一种名为“不确定性引导的联邦域自适应”(Uncertainty-Guided Federated Domain Adaptation, UG-FedDA)的新框架,该框架将UQ与联邦域自适应相结合,在保护隐私的前提下有效处理多中心数据分布差异;其核心创新是利用UQ指导特征对齐过程,通过降低不确定样本的权重来缓解源域与目标域之间的分布偏移,同时采用基于自注意力机制的Transformer模型提取多模板感兴趣区域(region-of-interest, RoI)特征,从而提升跨域分类性能。

链接: https://arxiv.org/abs/2512.05814
作者: Fubao Zhu,Zhanyuan Jia,Zhiguo Wang,Huan Huang,Danyang Sun,Chuang Han,Yanting Li,Jiaofen Nan,Chen Zhao,Weihua Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is already available on GitHub: this https URL

点击查看摘要

Abstract:Alzheimer’s disease (AD) is an irreversible neurodegenerative disorder, and early diagnosis is critical for timely intervention. However, most existing classification frameworks face challenges in multicenter studies, as they often neglect inter-site heterogeneity and lack mechanisms to quantify uncertainty, which limits their robustness and clinical applicability. To address these issues, we proposed Uncertainty-Guided Federated Domain Adaptation (UG-FedDA), a novel multicenter AD classification framework that integrates uncertainty quantification (UQ) with federated domain adaptation to handle cross-site structure magnetic resonance imaging (MRI) heterogeneity under privacy constraints. Our approach extracts multi-template region-of-interest (RoI) features using a self-attention transformer, capturing both regional representations and their interactions. UQ is integrated to guide feature alignment, mitigating source-target distribution shifts by down-weighting uncertain samples. Experiments are conducted on three public datasets: the Alzheimer’s Disease Neuroimaging Initiative (ADNI), the Australian Imaging, Biomarkers and Lifestyle study (AIBL), and the Open Access Series of Imaging Studies (OASIS). UG-FedDA achieved consistent cross-domain improvements in accuracy, sensitivity, and area under the ROC curve across three classification tasks: AD vs. normal controls (NC), mild cognitive impairment (MCI) vs. AD, and NC vs. MCI. For NC vs. AD, UG-FedDA achieves accuracies of 90.54%, 89.04%, and 77.78% on ADNI, AIBL and OASIS datasets, respectively. For MCI vs. AD, accuracies are 80.20% (ADNI), 71.91% (AIBL), and 79.73% (OASIS). For NC vs. MCI, results are 76.87% (ADNI), 73.91% (AIBL), and 83.73% (OASIS). These results demonstrate that the proposed framework not only adapts efficiently across multiple sites but also preserves strict privacy.
zh

[CV-17] oward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

【速读】:该论文旨在解决大规模多智能体驾驶仿真中行为模型的现实性与计算效率难以兼顾的问题。其核心解决方案在于采用基于实例的场景表示(instance-centric scene representation),将每个交通参与者及道路元素置于独立的局部坐标系中,从而实现视角不变的高效场景编码,并复用静态地图标记以减少冗余计算;同时引入查询中心对称上下文编码机制结合相对位置编码来建模智能体间的交互关系,并通过对抗逆强化学习(Adversarial Inverse Reinforcement Learning)训练行为模型,辅以自适应奖励变换策略自动平衡训练过程中的鲁棒性与真实性。实验表明,该方法在token数量增加时仍保持良好扩展性,显著降低训练与推理时间,且在位置精度和鲁棒性上优于多个以智能体为中心的基线模型。

链接: https://arxiv.org/abs/2512.05812
作者: Fabian Konstantinidis,Moritz Sackmann,Ulrich Hofmann,Christoph Stiller
机构: CARIAD SE (德国); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.
zh

[CV-18] Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在空间推理任务中因缺乏多视角理解与具身视角转换能力而导致的性能瓶颈问题。现有方法如MindJourney通过测试时扩展(test-time scaling)利用世界模型生成动作条件轨迹,并由启发式验证器选择有益视图,但其验证机制存在校准不足、随机评分效果相当等缺陷,暴露出系统性动作偏差和不可靠的奖励信号。为此,作者提出基于空间断言的验证框架(Verification through Spatial Assertions, ViSA),其核心在于将测试时奖励 grounded 在可验证的、帧锚定的微观断言(frame-anchored micro-claims)上,从而实现更可靠的验证信号并促进平衡探索行为。ViSA在SAT-Real基准上显著提升空间推理性能,但在更具挑战性的MMSI-Bench上仍受限于当前世界模型的信息瓶颈,表明单纯依赖想象视图难以支撑细粒度推理。

链接: https://arxiv.org/abs/2512.05809
作者: Saurav Jha,M. Jehanzeb Mirza,Wei Lin,Shiqi Yang,Sarath Chandar
机构: MILA – Quebec AI Institute (魁北克人工智能研究所); Polytechnique Montréal (蒙特利尔工程学院); MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Institute for Machine Learning, Johannes Kepler University Linz (约翰尼斯·开普勒林茨大学机器学习研究所); Nankai University (南开大学); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Extended abstract at World Modeling Workshop 2026

点击查看摘要

Abstract:Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shifts. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney’s verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at this https URL.
zh

[CV-19] Bring Your Dreams to Life: Continual Text-to-Video Customization AAAI2026

【速读】:该论文旨在解决定制化文本到视频生成(Customized Text-to-Video Generation, CTVG)中持续学习新概念时面临的灾难性遗忘(catastrophic forgetting)和概念忽视(concept neglect)问题。现有方法通常假设个性化概念静态不变,难以在增量学习过程中保持旧概念的特征并有效整合新概念。解决方案的关键在于提出一种新型连续定制视频扩散模型(Continual Customized Video Diffusion, CCVD),其核心包括:1)引入概念特定属性保留模块(concept-specific attribute retention module)与任务感知概念聚合策略(task-aware concept aggregation strategy),以在训练中保留旧概念的独特特征,并在测试时根据相关性动态融合旧概念的主体和运动适配器;2)设计可控条件合成机制(controllable conditional synthesis),通过层特定区域注意力引导的噪声估计增强局部特征并对齐用户条件下的视频上下文,从而缓解概念忽视问题。实验表明,CCVD 在持续学习场景下显著优于现有CTVG模型。

链接: https://arxiv.org/abs/2512.05802
作者: Jiahua Dong,Xudong Wang,Wenqi Liang,Zongyan Han,Meng Cao,Duzhen Zhang,Hanbin Zhao,Zhi Han,Salman Khan,Fahad Shahbaz Khan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI2026

点击查看摘要

Abstract:Customized text-to-video generation (CTVG) has recently witnessed great progress in generating tailored videos from user-specific text. However, most CTVG methods assume that personalized concepts remain static and do not expand incrementally over time. Additionally, they struggle with forgetting and concept neglect when continuously learning new concepts, including subjects and motions. To resolve the above challenges, we develop a novel Continual Customized Video Diffusion (CCVD) model, which can continuously learn new concepts to generate videos across various text-to-video generation tasks by tackling forgetting and concept neglect. To address catastrophic forgetting, we introduce a concept-specific attribute retention module and a task-aware concept aggregation strategy. They can capture the unique characteristics and identities of old concepts during training, while combining all subject and motion adapters of old concepts based on their relevance during testing. Besides, to tackle concept neglect, we develop a controllable conditional synthesis to enhance regional features and align video contexts with user conditions, by incorporating layer-specific region attention-guided noise estimation. Extensive experimental comparisons demonstrate that our CCVD outperforms existing CTVG models. The code is available at this https URL.
zh

[CV-20] Curvature-Regularized Variational Autoencoder for 3D Scene Reconstruction from Sparse Depth

【速读】:该论文旨在解决深度传感器仅提供所需测量值5%时,重建完整三维场景的难题,此类稀疏数据会导致几何误差,对自动驾驶车辆和机器人等应用构成严重挑战。解决方案的关键在于引入基于离散拉普拉斯算子(discrete Laplacian operator)的曲率正则化方法,通过单一且精心设计的正则化项显著提升重建精度——相比标准变分自编码器(variational autoencoders),重建准确率提高18.1%。该方法突破了几何深度学习中“组合多个几何约束可提升性能”的隐含假设,证明单个有效的正则化项在稳定梯度、抑制噪声的同时,仅带来15%的训练开销且无推理成本,从而实现更高效可靠的3D场景重建。

链接: https://arxiv.org/abs/2512.05783
作者: Maryam Yousefi,Soodeh Bakhshandeh
机构: Islamic Azad University of Tehran, East Branch (伊斯兰阿扎德大学德黑兰分校东校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:When depth sensors provide only 5% of needed measurements, reconstructing complete 3D scenes becomes difficult. Autonomous vehicles and robots cannot tolerate the geometric errors that sparse reconstruction introduces. We propose curvature regularization through a discrete Laplacian operator, achieving 18.1% better reconstruction accuracy than standard variational autoencoders. Our contribution challenges an implicit assumption in geometric deep learning: that combining multiple geometric constraints improves performance. A single well-designed regularization term not only matches but exceeds the effectiveness of complex multi-term formulations. The discrete Laplacian offers stable gradients and noise suppression with just 15% training overhead and zero inference cost. Code and models are available at this https URL.
zh

[CV-21] FNOPT: Resolution-Agnostic Self-Supervised Cloth Simulation using Meta-Optimization with Fourier Neural Operators WACV

【速读】:该论文旨在解决现有神经模拟器在布料动力学仿真中普遍存在的问题:依赖大量真实数据标注、难以保留细粒度细节(如褶皱)以及在不同网格分辨率和运动模式下泛化能力差。其解决方案的关键在于提出FNOpt框架,将时间积分过程建模为优化问题,并采用傅里叶神经算子(Fourier Neural Operator, FNO)参数化一个分辨无关的神经优化器。该方法仅在粗网格上使用物理驱动损失进行训练,即可实现对细网格的稳定且高精度的模拟,从而在不重新训练的情况下跨分辨率保持物理合理性与滚动稳定性。

链接: https://arxiv.org/abs/2512.05762
作者: Ruochen Chen,Thuy Tran,Shaifali Parashar
机构: CNRS(法国国家科学研究中心); École Centrale de Lyon (里昂中央理工学院); INSA Lyon (里昂国立应用科学学院); Université Claude Bernard Lyon 1 (克莱蒙-奥弗涅大学里昂第一分校); LIRIS, UMR5205 (里昂信息、信号与图像实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted for WACV

点击查看摘要

Abstract:We present FNOpt, a self-supervised cloth simulation framework that formulates time integration as an optimization problem and trains a resolution-agnostic neural optimizer parameterized by a Fourier neural operator (FNO). Prior neural simulators often rely on extensive ground truth data or sacrifice fine-scale detail, and generalize poorly across resolutions and motion patterns. In contrast, FNOpt learns to simulate physically plausible cloth dynamics and achieves stable and accurate rollouts across diverse mesh resolutions and motion patterns without retraining. Trained only on a coarse grid with physics-based losses, FNOpt generalizes to finer resolutions, capturing fine-scale wrinkles and preserving rollout stability. Extensive evaluations on a benchmark cloth simulation dataset demonstrate that FNOpt outperforms prior learning-based approaches in out-of-distribution settings in both accuracy and robustness. These results position FNO-based meta-optimization as a compelling alternative to previous neural simulators for cloth, thus reducing the need for curated data and improving cross-resolution reliability.
zh

[CV-22] Label-Efficient Point Cloud Segmentation with Active Learning

【速读】:该论文旨在解决3D点云语义分割中因标注成本高昂而导致的效率问题,提出了一种更高效的主动学习(Active Learning)策略以减少所需标注数据量。其解决方案的关键在于:首先利用2D网格将点云划分为可标注区域(annotatable regions),从而实现对点云的空间分区;其次,通过网络集成(network ensemble)估计模型输出的不确定性,进而选择最具信息量的数据进行标注。实验表明,该方法在S3DIS、Toronto-3D及弗莱堡城市级大规模点云数据集上均达到或优于现有复杂方法的性能,且验证了标注区域面积作为评估指标比标注点数更具意义。

链接: https://arxiv.org/abs/2512.05759
作者: Johannes Meyer,Jasper Hoffmann,Felix Schulz,Dominik Merkle,Daniel Buescher,Alexander Reiterer,Joschka Boedecker,Wolfram Burgard
机构: University of Freiburg (弗莱堡大学); Fraunhofer IPM (弗劳恩霍夫IPM研究所); Institute for Sustainable Systems Engineering INATECH (可持续系统工程研究所); University of Technology Nuremberg (纽伦堡工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Semantic segmentation of 3D point cloud data often comes with high annotation costs. Active learning automates the process of selecting which data to annotate, reducing the total amount of annotation needed to achieve satisfactory performance. Recent approaches to active learning for 3D point clouds are often based on sophisticated heuristics for both, splitting point clouds into annotatable regions and selecting the most beneficial for further neural network training. In this work, we propose a novel and easy-to-implement strategy to separate the point cloud into annotatable regions. In our approach, we utilize a 2D grid to subdivide the point cloud into columns. To identify the next data to be annotated, we employ a network ensemble to estimate the uncertainty in the network output. We evaluate our method on the S3DIS dataset, the Toronto-3D dataset, and a large-scale urban 3D point cloud of the city of Freiburg, which we labeled in parts manually. The extensive evaluation shows that our method yields performance on par with, or even better than, complex state-of-the-art methods on all datasets. Furthermore, we provide results suggesting that in the context of point clouds the annotated area can be a more meaningful measure for active learning algorithms than the number of annotated points.
zh

[CV-23] USV: Unified Sparsification for Accelerating Video Diffusion Models

【速读】:该论文旨在解决高保真视频扩散模型(Video Diffusion Models, VDMs)在扩展性上的瓶颈问题,其核心限制来自两个方面的冗余:一是全局时空注意力机制带来的二次复杂度,二是长迭代去噪轨迹所产生的计算开销。现有加速方法如稀疏注意力和步数蒸馏采样器通常仅针对单一维度优化,难以持续提升性能,因为剩余瓶颈会迅速成为主导因素。论文提出了一种名为USV(Unified Sparsification for Video diffusion models)的端到端可训练框架,其关键在于通过联合优化模型内部计算与采样过程中的稀疏化策略,实现多维协同加速:USV学习一种动态、依赖数据和时间步的稀疏化策略,能够剪枝冗余注意力连接、自适应合并语义相似的token,并减少去噪步骤,将这些原本孤立的加速手段整合为统一优化目标下的协调动作,从而显著增强各策略间的相互促进效应。实验表明,USV在去噪阶段最高提速83.3%,端到端加速达22.7%,同时保持高质量视觉输出。

链接: https://arxiv.org/abs/2512.05754
作者: Xinjian Wu,Hongmei Wang,Yuan Zhou,Qinglin Lu
机构: University of Chinese Academy of Sciences (中国科学院大学); Tencent Hunyuan (腾讯混元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The scalability of high-fidelity video diffusion models (VDMs) is constrained by two key sources of redundancy: the quadratic complexity of global spatio-temporal attention and the computational overhead of long iterative denoising trajectories. Existing accelerators – such as sparse attention and step-distilled samplers – typically target a single dimension in isolation and quickly encounter diminishing returns, as the remaining bottlenecks become dominant. In this work, we introduce USV (Unified Sparsification for Video diffusion models), an end-to-end trainable framework that overcomes this limitation by jointly orchestrating sparsification across both the model’s internal computation and its sampling process. USV learns a dynamic, data- and timestep-dependent sparsification policy that prunes redundant attention connections, adaptively merges semantically similar tokens, and reduces denoising steps, treating them not as independent tricks but as coordinated actions within a single optimization objective. This multi-dimensional co-design enables strong mutual reinforcement among previously disjoint acceleration strategies. Extensive experiments on large-scale video generation benchmarks demonstrate that USV achieves up to 83.3% speedup in the denoising process and 22.7% end-to-end acceleration, while maintaining high visual fidelity. Our results highlight unified, dynamic sparsification as a practical path toward efficient, high-quality video generation.
zh

[CV-24] HQ-DM: Single Hadamard Transformation-Based Quantization-Aware Training for Low-Bit Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在低比特量化(low-bit quantization)场景下因激活矩阵中异常值(outliers)导致性能显著下降的问题。现有量化方法难以有效抑制推理过程中激活矩阵的异常值,从而限制了模型在资源受限环境中的部署效率。解决方案的关键在于提出一种新的量化感知训练框架HQ-DM,其核心创新是引入单重哈达玛变换(Single Hadamard Transformation)对激活矩阵进行预处理,该方法能有效降低激活异常值的影响,同时保持模型性能;相较于传统的双重哈达玛变换,该方案还能无缝支持整数卷积(INT convolution)运算,并避免权重异常值被放大,从而在ImageNet 256x256数据集上实现显著的生成质量提升(如W4A4和W4A3量化方案分别在Inception Score上提升12.8%和467.73%)。

链接: https://arxiv.org/abs/2512.05746
作者: Shizhuo Mao,Hongtao Zou,Qihu Xie,Song Chen,Yi Kang
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated significant applications in the field of image generation. However, their high computational and memory costs pose challenges for deployment. Model quantization has emerged as a promising solution to reduce storage overhead and accelerate inference. Nevertheless, existing quantization methods for diffusion models struggle to mitigate outliers in activation matrices during inference, leading to substantial performance degradation under low-bit quantization scenarios. To address this, we propose HQ-DM, a novel Quantization-Aware Training framework that applies Single Hadamard Transformation to activation matrices. This approach effectively reduces activation outliers while preserving model performance under quantization. Compared to traditional Double Hadamard Transformation, our proposed scheme offers distinct advantages by seamlessly supporting INT convolution operations while preventing the amplification of weight outliers. For conditional generation on the ImageNet 256x256 dataset using the LDM-4 model, our W4A4 and W4A3 quantization schemes improve the Inception Score by 12.8% and 467.73%, respectively, over the existing state-of-the-art method.
zh

[CV-25] Distilling Expert Surgical Knowledge: How to train local surgical VLMs for anatomy explanation in Complete Mesocolic Excision

【速读】:该论文旨在解决当前视觉大语言模型(Vision Large Language Models, VLMs)在特定外科场景理解上的不足,尤其是难以准确识别和解释完整结肠系膜切除术(Complete Mesocolic Excision)中的解剖学标志,并应对患者数据外泄风险的问题。解决方案的关键在于提出一种隐私保护的知识蒸馏框架,通过仅使用文本上下文和二值分割掩码(binary segmentation masks)生成专家监督数据集,无需敏感图像即可从通用大语言模型(LLM)中提取知识,进而对本地部署的VLM进行监督微调(SFT)与直接偏好优化(DPO),从而显著提升其在手术领域的专业知识,同时确保数据隐私合规性。

链接: https://arxiv.org/abs/2512.05740
作者: Lennart Maack,Julia-Kristin Graß,Lisa-Marie Toscha,Nathaniel Melling,Alexander Schlaefer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Vision Large Language Models (VLMs) have demonstrated high potential in computer-aided diagnosis and decision-support. However, current VLMs show deficits in domain specific surgical scene understanding, such as identifying and explaining anatomical landmarks during Complete Mesocolic Excision. Additionally, there is a need for locally deployable models to avoid patient data leakage to large VLMs, hosted outside the clinic. We propose a privacy-preserving framework to distill knowledge from large, general-purpose LLMs into an efficient, local VLM. We generate an expert-supervised dataset by prompting a teacher LLM without sensitive images, using only textual context and binary segmentation masks for spatial information. This dataset is used for Supervised Fine-Tuning (SFT) and subsequent Direct Preference Optimization (DPO) of the locally deployable VLM. Our evaluation confirms that finetuning VLMs with our generated datasets increases surgical domain knowledge compared to its base VLM by a large margin. Overall, this work validates a data-efficient and privacy-conforming way to train a surgical domain optimized, locally deployable VLM for surgical scene understanding.
zh

[CV-26] Manifold-Aware Point Cloud Completion via Geodesic-Attentive Hierarchical Feature Learning

【速读】:该论文旨在解决点云补全(Point Cloud Completion)任务中因依赖欧氏距离而忽略点云内在非线性几何结构所导致的几何一致性不足与语义模糊问题。其解决方案的关键在于提出一种流形感知(Manifold-Aware)框架,通过引入两个核心模块:一是测地距离近似器(Geodesic Distance Approximator, GDA),用于估计点间测地距离以捕捉潜在流形拓扑;二是流形感知特征提取器(Manifold-Aware Feature Extractor, MAFE),利用基于测地距离的k-NN分组和测地关系注意力机制引导层次化特征学习过程,从而在重建过程中增强语义一致性和结构保真度。

链接: https://arxiv.org/abs/2512.05710
作者: Jianan Sun,Dongzhihan Wang,Mingyu Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point cloud completion seeks to recover geometrically consistent shapes from partial or sparse 3D observations. Although recent methods have achieved reasonable global shape reconstruction, they often rely on Euclidean proximity and overlook the intrinsic nonlinear geometric structure of point clouds, resulting in suboptimal geometric consistency and semantic ambiguity. In this paper, we present a manifold-aware point cloud completion framework that explicitly incorporates nonlinear geometry information throughout the feature learning pipeline. Our approach introduces two key modules: a Geodesic Distance Approximator (GDA), which estimates geodesic distances between points to capture the latent manifold topology, and a Manifold-Aware Feature Extractor (MAFE), which utilizes geodesic-based k -NN groupings and a geodesic-relational attention mechanism to guide the hierarchical feature extraction process. By integrating geodesic-aware relational attention, our method promotes semantic coherence and structural fidelity in the reconstructed point clouds. Extensive experiments on benchmark datasets demonstrate that our approach consistently outperforms state-of-the-art methods in reconstruction quality.
zh

[CV-27] OWL: Unsupervised 3D Object Detection by Occupancy Guided Warm-up and Large Model Priors Reasoning AAAI

【速读】:该论文旨在解决无监督三维目标检测中伪标签质量低导致模型优化过程被误导的问题,尤其是在训练初期伪标签错误率较高时,传统自训练方法难以有效过滤和精炼伪标签,从而限制了性能提升。其解决方案的关键在于提出 OWL 方法,包含三个核心组件:首先,采用基于占据信息的预热策略(Occupancy Guided Warm-up, OGW),通过引入空间感知能力初始化主干网络权重,降低错误伪标签对网络收敛的干扰;其次,设计实例引导推理模块(Instance-Cued Reasoning, ICR),利用大模型先验知识评估伪标签质量,实现精准筛选与修正;最后,引入权重自适应自训练策略(Weight-adapted Self-training, WAS),动态调整伪标签权重,进一步提升自训练效果。实验表明,OWL 在 Waymo Open Dataset 和 KITTI 数据集上相较现有最优无监督方法提升超过 15.0% 的 mAP,验证了该方案的有效性。

链接: https://arxiv.org/abs/2512.05698
作者: Xusheng Guo,Wanfa Zhang,Shijia Zhao,Qiming Xia,Xiaolong Xie,Mingming Wang,Hai Wu,Chenglu Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 40th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Unsupervised 3D object detection leverages heuristic algorithms to discover potential objects, offering a promising route to reduce annotation costs in autonomous driving. Existing approaches mainly generate pseudo labels and refine them through self-training iterations. However, these pseudo-labels are often incorrect at the beginning of training, resulting in misleading the optimization process. Moreover, effectively filtering and refining them remains a critical challenge. In this paper, we propose OWL for unsupervised 3D object detection by occupancy guided warm-up and large-model priors reasoning. OWL first employs an Occupancy Guided Warm-up (OGW) strategy to initialize the backbone weight with spatial perception capabilities, mitigating the interference of incorrect pseudo-labels on network convergence. Furthermore, OWL introduces an Instance-Cued Reasoning (ICR) module that leverages the prior knowledge of large models to assess pseudo-label quality, enabling precise filtering and refinement. Finally, we design a Weight-adapted Self-training (WAS) strategy to dynamically re-weight pseudo-labels, improving the performance through self-training. Extensive experiments on Waymo Open Dataset (WOD) and KITTI demonstrate that OWL outperforms state-of-the-art unsupervised methods by over 15.0% mAP, revealing the effectiveness of our method.
zh

[CV-28] Physics-Informed Graph Neural Network with Frequency-Aware Learning for Optical Aberration Correction

【速读】:该论文旨在解决显微成像中因光学像差(optical aberrations)导致的图像质量退化问题,尤其是在深层样本成像时,传统方法往往仅能处理轻微像差且适用样本类型有限,同时缺乏对波前畸变物理机制的建模。其解决方案的关键在于提出一种物理信息驱动的框架ZRNet,通过引入Zernike图模块(Zernike Graph module)显式建模Zernike多项式之间的方位角度物理关联,确保学习到的校正符合基础光学原理;并设计频域感知对齐损失(Frequency-Aware Alignment, FAA loss),增强图像恢复与Zernike系数预测在傅里叶域的一致性,从而实现更精准、物理一致的图像复原与像差补偿。

链接: https://arxiv.org/abs/2512.05683
作者: Yong En Kok,Bowen Deng,Alexander Bentley,Andrew J. Parkes,Michael G. Somekh,Amanda J. Wright,Michael P. Pound
机构: University of Nottingham (诺丁汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. Code is available at this https URL.
zh

[CV-29] Hyperspectral Unmixing with 3D Convolutional Sparse Coding and Projected Simplex Volume Maximization

【速读】:该论文旨在解决高光谱解混(Hyperspectral Unmixing, HSU)问题,即从每个像素中分离出端元(endmember)并估计其对应的丰度分数。解决方案的关键在于提出了一种基于算法展开(algorithm unrolling)的三维卷积稀疏编码网络(3D Convolutional Sparse Coding Network, 3D-CSCNet),该网络构建于自动编码器(Autoencoder, AE)框架内,通过设计一个三维稀疏编码块(3D CSC Block, 3D-CSCB)来联合学习高光谱图像(HSI)数据立方体中的光谱与空间关系。该方法利用3D-CSCB估计丰度矩阵,并将结果输入AE解码器以重建HSI,同时提取解码器权重作为端元矩阵;此外,还引入投影单纯形体积最大化(Projected Simplex Volume Maximization, PSVM)算法用于初始化端元,从而提升模型性能。实验表明,3D-CSCNet在多个真实和模拟数据集上均优于现有先进方法。

链接: https://arxiv.org/abs/2512.05674
作者: Gargi Panda,Soumitra Kundu,Saumik Bhattacharya,Aurobinda Routray
机构: IIT Kharagpur (印度理工学院克哈拉格普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral unmixing (HSU) aims to separate each pixel into its constituent endmembers and estimate their corresponding abundance fractions. This work presents an algorithm-unrolling-based network for the HSU task, named the 3D Convolutional Sparse Coding Network (3D-CSCNet), built upon a 3D CSC model. Unlike existing unrolling-based networks, our 3D-CSCNet is designed within the powerful autoencoder (AE) framework. Specifically, to solve the 3D CSC problem, we propose a 3D CSC block (3D-CSCB) derived through deep algorithm unrolling. Given a hyperspectral image (HSI), 3D-CSCNet employs the 3D-CSCB to estimate the abundance matrix. The use of 3D CSC enables joint learning of spectral and spatial relationships in the 3D HSI data cube. The estimated abundance matrix is then passed to the AE decoder to reconstruct the HSI, and the decoder weights are extracted as the endmember matrix. Additionally, we propose a projected simplex volume maximization (PSVM) algorithm for endmember estimation, and the resulting endmembers are used to initialize the decoder weights of 3D-CSCNet. Extensive experiments on three real datasets and one simulated dataset with three different signal-to-noise ratio (SNR) levels demonstrate that our 3D-CSCNet outperforms state-of-the-art methods.
zh

[CV-30] InverseCrafter: Efficient Video ReCapture as a Latent Domain Inverse Problem

【速读】:该论文旨在解决当前可控4D视频生成方法依赖于对预训练视频扩散模型(Video Diffusion Models, VDMs)进行微调所带来的计算成本高、数据需求大以及灾难性遗忘原始生成先验等问题。其解决方案的关键在于提出InversCrafter,一种高效的图像修复逆求解器,将4D生成任务重构为在潜在空间中求解的图像修复问题;该方法的核心创新是设计了一种原理性的机制,能够将像素空间的退化算子编码为连续的多通道潜在掩码,从而避免了重复的VAE操作和反向传播瓶颈,实现了近零计算开销下的高质量新视角生成与相机控制一致性提升,并具备通用视频修复与编辑能力。

链接: https://arxiv.org/abs/2512.05672
作者: Yeobin Hong,Suhyeon Lee,Hyungjin Chung,Jong Chul Ye
机构: KAIST(韩国科学技术院); EverEx
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent approaches to controllable 4D video generation often rely on fine-tuning pre-trained Video Diffusion Models (VDMs). This dominant paradigm is computationally expensive, requiring large-scale datasets and architectural modifications, and frequently suffers from catastrophic forgetting of the model’s original generative priors. Here, we propose InverseCrafter, an efficient inpainting inverse solver that reformulates the 4D generation task as an inpainting problem solved in the latent space. The core of our method is a principled mechanism to encode the pixel space degradation operator into a continuous, multi-channel latent mask, thereby bypassing the costly bottleneck of repeated VAE operations and backpropagation. InverseCrafter not only achieves comparable novel view generation and superior measurement consistency in camera control tasks with near-zero computational overhead, but also excels at general-purpose video inpainting with editing. Code is available at this https URL.
zh

[CV-31] Deep Learning-Based Real-Time Sequential Facial Expression Analysis Using Geometric Features

【速读】:该论文旨在解决实时序列化面部表情识别(facial expression recognition)问题,以提升人机交互(human-computer interaction)和情感感知系统(emotion-aware systems)的性能。其核心挑战在于如何在保持高准确率的同时实现低延迟、高帧率的动态表情分析,尤其是在复杂场景下对表情起始(onset)、峰值(apex)和结束(offset)阶段的精准捕捉。解决方案的关键在于融合几何特征与时空建模:首先利用MediaPipe FaceMesh高效提取面部关键点(facial landmarks),进而计算欧氏距离和角度等几何特征;随后通过连续帧间特征差分构建时序动态信息,最终采用ConvLSTM1D结合多层感知机(multilayer perceptron)进行分类。该方法在多个公开数据集上验证了优越性能,并实现了约165帧/秒的实时处理能力,展现出良好的泛化性与工程实用性。

链接: https://arxiv.org/abs/2512.05669
作者: Talha Enes Koksal,Abdurrahman Gumus
机构: Izmir Institute of Technology (伊兹密尔理工学院); Isparta University of Applied Sciences (伊斯帕尔塔应用科学大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial expression recognition is a crucial component in enhancing human-computer interaction and developing emotion-aware systems. Real-time detection and interpretation of facial expressions have become increasingly important for various applications, from user experience personalization to intelligent surveillance systems. This study presents a novel approach to real-time sequential facial expression recognition using deep learning and geometric features. The proposed method utilizes MediaPipe FaceMesh for rapid and accurate facial landmark detection. Geometric features, including Euclidean distances and angles, are extracted from these landmarks. Temporal dynamics are incorporated by analyzing feature differences between consecutive frames, enabling the detection of onset, apex, and offset phases of expressions. For classification, a ConvLSTM1D network followed by multilayer perceptron blocks is employed. The method’s performance was evaluated on multiple publicly available datasets, including CK+, Oulu-CASIA (VIS and NIR), and MMI. Accuracies of 93%, 79%, 77%, and 68% were achieved respectively. Experiments with composite datasets were also conducted to assess the model’s generalization capabilities. The approach demonstrated real-time applicability, processing approximately 165 frames per second on consumer-grade hardware. This research contributes to the field of facial expression analysis by providing a fast, accurate, and adaptable solution. The findings highlight the potential for further advancements in emotion-aware technologies and personalized user experiences, paving the way for more sophisticated human-computer interaction systems. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: this https URL.
zh

[CV-32] LeAD-M3D: Leverag ing Asymmetric Distillation for Real-time Monocular 3D Detection

【速读】:该论文致力于解决单目3D目标检测中因深度模糊性(depth ambiguity)、视角变化(viewpoint shifts)以及3D推理计算成本高导致的精度与效率难以兼顾的问题。现有方法通常依赖LiDAR或几何先验来补偿缺失的深度信息,或牺牲效率以换取较高精度。其解决方案的核心在于提出LeAD-M3D框架,包含三个关键组件:1)不对称增强去噪蒸馏(Asymmetric Augmentation Denoising Distillation, A2D2),通过质量与重要性加权的深度特征损失,将干净图像教师模型中的几何知识迁移至混入噪声的学生模型,从而在无LiDAR监督下增强深度推理能力;2)3D感知一致匹配(3D-aware Consistent Matching, CM3D),引入3D MGIoU优化预测到真实标注的匹配分数,提升监督稳定性与精度;3)置信度门控3D推理(Confidence-Gated 3D Inference, CGI3D),仅对高置信度区域执行昂贵的3D回归操作,显著加速推理过程。三者协同使LeAD-M3D在KITTI、Waymo和Rope3D等数据集上达到当前最优精度,并实现高达3.6倍于以往高精度方法的实时性能,证明了无需LiDAR、立体视觉或几何假设即可同时实现高保真与高效能。

链接: https://arxiv.org/abs/2512.05663
作者: Johannes Meier,Jonathan Michel,Oussema Dhaouadi,Yung-Hsu Yang,Christoph Reich,Zuria Bauer,Stefan Roth,Marc Pollefeys,Jacques Kaiser,Daniel Cremers
机构: DeepScenario; ETH Zurich (苏黎世联邦理工学院); TU Munich (慕尼黑工业大学); MCML; TU Darmstadt (达姆施塔特工业大学); Microsoft
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time monocular 3D object detection remains challenging due to severe depth ambiguity, viewpoint shifts, and the high computational cost of 3D reasoning. Existing approaches either rely on LiDAR or geometric priors to compensate for missing depth, or sacrifice efficiency to achieve competitive accuracy. We introduce LeAD-M3D, a monocular 3D detector that achieves state-of-the-art accuracy and real-time inference without extra modalities. Our method is powered by three key components. Asymmetric Augmentation Denoising Distillation (A2D2) transfers geometric knowledge from a clean-image teacher to a mixup-noised student via a quality- and importance-weighted depth-feature loss, enabling stronger depth reasoning without LiDAR supervision. 3D-aware Consistent Matching (CM3D) improves prediction-to-ground truth assignment by integrating 3D MGIoU into the matching score, yielding more stable and precise supervision. Finally, Confidence-Gated 3D Inference (CGI3D) accelerates detection by restricting expensive 3D regression to top-confidence regions. Together, these components set a new Pareto frontier for monocular 3D detection: LeAD-M3D achieves state-of-the-art accuracy on KITTI and Waymo, and the best reported car AP on Rope3D, while running up to 3.6x faster than prior high-accuracy methods. Our results demonstrate that high fidelity and real-time efficiency in monocular 3D detection are simultaneously attainable - without LiDAR, stereo, or geometric assumptions.
zh

[CV-33] Self-Supervised AI-Generated Image Detection: A Camera Metadata Perspective

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成图像在多媒体取证领域带来的挑战,尤其是现有检测方法依赖特定生成模型内部结构、跨模型泛化能力差的问题。解决方案的关键在于提出一种自监督学习框架,利用相机元数据(EXIF)标签作为预训练任务的信号,从中提取与数字摄影固有特征相关的表示。具体而言,通过分类离散型 EXIF 标签(如相机型号和场景类型)及对有序和连续型 EXIF 标签(如焦距和光圈值)进行成对排序,训练一个特征提取器;随后基于该特征构建一类别检测模型(使用高斯混合模型建模真实图像分布)和二分类检测器(将提取器作为正则项约束分类器,输入为空间打乱补丁的高频残差),从而实现对 AI 生成图像的有效识别,且具备良好的野外样本泛化能力和对常见良性图像扰动的鲁棒性。

链接: https://arxiv.org/abs/2512.05651
作者: Nan Zhong,Mian Zou,Yiran Xu,Zhenxing Qian,Xinpeng Zhang,Baoyuan Wu,Kede Ma
机构: City University of Hong Kong (香港城市大学); Fudan University (复旦大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The proliferation of AI-generated imagery poses escalating challenges for multimedia forensics, yet many existing detectors depend on assumptions about the internals of specific generative models, limiting their cross-model applicability. We introduce a self-supervised approach for detecting AI-generated images that leverages camera metadata – specifically exchangeable image file format (EXIF) tags – to learn features intrinsic to digital photography. Our pretext task trains a feature extractor solely on camera-captured photographs by classifying categorical EXIF tags (\eg, camera model and scene type) and pairwise-ranking ordinal and continuous EXIF tags (\eg, focal length and aperture value). Using these EXIF-induced features, we first perform one-class detection by modeling the distribution of photographic images with a Gaussian mixture model and flagging low-likelihood samples as AI-generated. We then extend to binary detection that treats the learned extractor as a strong regularizer for a classifier of the same architecture, operating on high-frequency residuals from spatially scrambled patches. Extensive experiments across various generative models demonstrate that our EXIF-induced detectors substantially advance the state of the art, delivering strong generalization to in-the-wild samples and robustness to common benign image perturbations.
zh

[CV-34] Experts-Guided Unbalanced Optimal Transport for ISP Learning from Unpaired and/or Paired Data

【速读】:该论文旨在解决当前学习型图像信号处理(ISP)流水线对大规模配对原始数据(raw-to-sRGB)的高度依赖问题,这一依赖导致数据采集成本高昂且成为性能提升的关键瓶颈。其核心解决方案是提出一种基于最优传输(Optimal Transport, OT)的无监督训练框架,首次成功引入非平衡最优传输(Unbalanced Optimal Transport, UOT)来处理跨域图像翻译任务,从而在配对和非配对两种模式下均可训练任意ISP架构。该框架的关键创新在于设计了一个“专家判别器委员会”(committee of expert discriminators),这是一种混合对抗正则化机制,通过提供针对特定ISP失效模式(如色彩保真度、结构伪影和频域真实性)的专用梯度,引导最优传输映射优化,显著提升了模型鲁棒性和生成质量。

链接: https://arxiv.org/abs/2512.05635
作者: Georgy Perevozchikov,Nancy Mehta,Egor Ershov,Radu Timofte
机构: University of Würzburg (维尔茨堡大学); Institute for Information Transmission Problems, RAS (俄罗斯科学院信息传输问题研究所); Moscow Institute of Physics and Technologies (莫斯科物理技术学院); Artificial Intelligence Research Institute (人工智能研究 institute)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learned Image Signal Processing (ISP) pipelines offer powerful end-to-end performance but are critically dependent on large-scale paired raw-to-sRGB datasets. This reliance on costly-to-acquire paired data remains a significant bottleneck. To address this challenge, we introduce a novel, unsupervised training framework based on Optimal Transport capable of training arbitrary ISP architectures in both unpaired and paired modes. We are the first to successfully apply Unbalanced Optimal Transport (UOT) for this complex, cross-domain translation task. Our UOT-based framework provides robustness to outliers in the target sRGB data, allowing it to discount atypical samples that would be prohibitively costly to map. A key component of our framework is a novel ``committee of expert discriminators,‘’ a hybrid adversarial regularizer. This committee guides the optimal transport mapping by providing specialized, targeted gradients to correct specific ISP failure modes, including color fidelity, structural artifacts, and frequency-domain realism. To demonstrate the superiority of our approach, we retrained existing state-of-the-art ISP architectures using our paired and unpaired setups. Our experiments show that while our framework, when trained in paired mode, exceeds the performance of the original paired methods across all metrics, our unpaired mode concurrently achieves quantitative and qualitative performance that rivals, and in some cases surpasses, the original paired-trained counterparts. The code and pre-trained models are available at: this https URL.
zh

[CV-35] DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model

【速读】:该论文旨在解决跨域少样本语义分割(Cross-Domain Few-Shot Semantic Segmentation, CD-FSS)中的三大挑战:源域与目标域间存在显著分布偏移(distribution shift)、标签空间不重叠(disjoint label spaces),以及支持图像(support images)稀缺导致的标准 episodic 方法在测试时不可靠且计算开销大。其解决方案的关键在于提出 DistillFSS 框架,通过教师-学生蒸馏(teacher–student distillation)机制将支持集知识直接嵌入模型参数中,使学生网络中的专用层内化少样本推理能力,从而在测试时无需依赖支持图像即可实现快速、轻量化的推理,并支持通过教师驱动的快速专业化扩展至未见域的新类别。

链接: https://arxiv.org/abs/2512.05613
作者: Pasquale De Marinis,Pieter M. Blok,Uzay Kaymak,Rogier Brussee,Gennaro Vessio,Giovanna Castellano
机构: University of Bari Aldo Moro (巴里大学阿尔多·莫罗); Eindhoven University of Technology (埃因霍温理工大学); IEEE; JADS; University of Bari Aldo Moro (巴里大学阿尔多·莫罗); University of Bari Aldo Moro (巴里大学阿尔多·莫罗)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce–making standard episodic methods unreliable and computationally demanding at test time. To address these constraints, we propose DistillFSS, a framework that embeds support-set knowledge directly into a model’s parameters through a teacher–student distillation process. By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher-driven specialization. Combined with fine-tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD-FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains. The code is available at this https URL.
zh

[CV-36] NormalView: sensor-agnostic tree species classification from backpack and aerial lidar data using geometric projections

【速读】:该论文旨在解决利用激光扫描点云数据进行树种分类的准确性与通用性问题,尤其针对移动激光扫描(MLS)和机载激光扫描(ALS)两种平台下不同密度点云数据的适用性挑战。其解决方案的关键在于提出一种传感器无关(sensor-agnostic)的投影式深度学习方法——NormalView,该方法将局部几何信息嵌入到二维法向量估计投影中,并将其作为输入送入YOLOv11图像分类网络进行树种识别。通过引入多光谱辐射强度信息,进一步提升了分类性能,实验表明在多通道强度信息融合下模型表现最优,验证了几何特征与多源辐射信息协同作用的有效性。

链接: https://arxiv.org/abs/2512.05610
作者: Juho Korkeala,Jesse Muhojoki,Josef Taher,Klaara Salolahti,Matti Hyyppä,Antero Kukko,Juha Hyyppä
机构: Finnish Geospatial Research Institute FGI (芬兰地理空间研究所FGI); The National Land Survey of Finland (芬兰国家土地测量局); Aalto University (阿尔托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Laser scanning has proven to be an invaluable tool in assessing the decomposition of forest environments. Mobile laser scanning (MLS) has shown to be highly promising for extremely accurate, tree level inventory. In this study, we present NormalView, a sensor-agnostic projection-based deep learning method for classifying tree species from point cloud data. NormalView embeds local geometric information into two-dimensional projections, in the form of normal vector estimates, and uses the projections as inputs to an image classification network, YOLOv11. In addition, we inspected the effect of multispectral radiometric intensity information on classification performance. We trained and tested our model on high-density MLS data (7 species, ~5000 pts/m^2), as well as high-density airborne laser scanning (ALS) data (9 species, 1000 pts/m^2). On the MLS data, NormalView achieves an overall accuracy (macro-average accuracy) of 95.5 % (94.8 %), and 91.8 % (79.1 %) on the ALS data. We found that having intensity information from multiple scanners provides benefits in tree species classification, and the best model on the multispectral ALS dataset was a model using intensity information from all three channels of the multispectral ALS. This study demonstrates that projection-based methods, when enhanced with geometric information and coupled with state-of-the-art image classification backbones, can achieve exceptional results. Crucially, these methods are sensor-agnostic, relying only on geometric information. Additionally, we publically release the MLS dataset used in the study.
zh

[CV-37] Fast SceneScript: Accurate and Efficient Structured Language Model via Multi-Token Prediction

【速读】:该论文旨在解决基于语言模型的3D场景布局估计方法中因依赖自回归逐token预测而导致的推理速度慢的问题。其核心解决方案是提出Fast SceneScript,通过引入多token预测(Multi-Token Prediction, MTP)减少自回归迭代次数以显著加速推理;同时为提升MTP的准确性,设计了置信度引导解码(Confidence-Guided Decoding, CGD)机制,结合改进的token可靠性评分策略过滤不可靠预测,并采用参数高效的机制控制MTP带来的额外参数开销。实验表明,该方法可在不牺牲精度的前提下每步推理生成多达9个token,且仅增加约7.5%的参数量。

链接: https://arxiv.org/abs/2512.05597
作者: Ruihong Yin,Xuepeng Shi,Oleksandr Bailo,Marco Manfredi,Theo Gevers
机构: Qualcomm XR Labs (高通XR实验室); University of Amsterdam (阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Recent perception-generalist approaches based on language models have achieved state-of-the-art results across diverse tasks, including 3D scene layout estimation, via unified architecture and interface. However, these approaches rely on autoregressive next-token prediction, which is inherently slow. In this work, we introduce Fast SceneScript, a novel structured language model for accurate and efficient 3D scene layout estimation. Our method employs multi-token prediction (MTP) to reduce the number of autoregressive iterations and significantly accelerate inference. While MTP improves speed, unreliable token predictions can significantly reduce accuracy. To filter out unreliable tokens, we adapt self-speculative decoding (SSD) for structured language models and introduce confidence-guided decoding (CGD) with an improved scoring mechanism for token reliability. Furthermore, we design a parameter-efficient mechanism that reduces the parameter overhead of MTP. Extensive experiments on the ASE and Structured3D benchmarks demonstrate that Fast SceneScript can generate up to 9 tokens per decoder inference step without compromising accuracy, while adding only \sim7.5% additional parameters.
zh

[CV-38] Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer

【速读】:该论文旨在解决现有基于线性混合皮肤(Linear Blend Skinning, LBS)的3D服装变形方法在处理高频率褶皱细节时存在的形变错位问题。由于LBS缺乏显式皮肤绑定监督,导致在不同身体姿态下服装低频形状易出现失真,进而污染高频褶皱信号,难以恢复高保真褶皱。其解决方案的关键在于提出一种无皮肤绑定(skinning-free)的方法:将服装变形解耦为两个独立模态——(i) 顶点位置用于建模低频姿态相关形状,(ii) 顶点法向用于捕获高频局部褶皱细节,并分别对二者进行几何监督;进一步地,通过将顶点属性编码为纹理图像,利用预训练图像模型实现2D图像空间中的细粒度褶皱重建,从而避免了手动UV划分且保持对多样化服装拓扑结构的可扩展性;最终通过多模态融合机制整合两种频率模态约束,从图像转移结果中鲁棒地重构完整3D服装变形。

链接: https://arxiv.org/abs/2512.05593
作者: Rong Wang,Wei Mao,Changsheng Lu,Hongdong Li
机构: The Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to 3DV 2026

点击查看摘要

Abstract:We present a novel method for generating 3D garment deformations from given body poses, which is key to a wide range of applications, including virtual try-on and extended reality. To simplify the cloth dynamics, existing methods mostly rely on linear blend skinning to obtain low-frequency posed garment shape and only regress high-frequency wrinkles. However, due to the lack of explicit skinning supervision, such skinning-based approach often produces misaligned shapes when posing the garment, consequently corrupts the high-frequency signals and fails to recover high-fidelity wrinkles. To tackle this issue, we propose a skinning-free approach by independently estimating posed (i) vertex position for low-frequency posed garment shape, and (ii) vertex normal for high-frequency local wrinkle details. In this way, each frequency modality can be effectively decoupled and directly supervised by the geometry of the deformed garment. To further improve the visual quality of animation, we propose to encode both vertex attributes as rendered texture images, so that 3D garment deformation can be equivalently achieved via 2D image transfer. This enables us to leverage powerful pretrained image models to recover fine-grained visual details in wrinkles, while maintaining superior scalability for garments of diverse topologies without relying on manual UV partition. Finally, we propose a multimodal fusion to incorporate constraints from both frequency modalities and robustly recover deformed 3D garments from transferred images. Extensive experiments show that our method significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods.
zh

[CV-39] MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging

【速读】:该论文旨在解决医学图像配准中因局部强度相似性度量无法捕捉全局语义结构而导致的低对比度或解剖变异区域匹配错误问题。解决方案的关键在于提出一种无需训练的3D对应关系框架MedDIFT,其核心是利用预训练潜空间医学扩散模型(latent medical diffusion model)的多尺度特征作为体素描述符(voxel descriptors),通过融合扩散激活信息并基于余弦相似度进行匹配,辅以可选的局部搜索先验,从而实现高精度的跨时间点或模态的医学图像对应关系建立。

链接: https://arxiv.org/abs/2512.05571
作者: Xingyu Zhang,Anna Reithmeir,Fryderyk Kögl,Rickmer Braren,Julia A. Schnabel,Daniel M. Lang
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT achieves correspondence accuracy comparable to the state-of-the-art learning-based UniGradICON model and surpasses conventional B-spline-based registration, without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance.
zh

[CV-40] ProPhy: Progressive Physical Alignment for Dynamic World Simulation

【速读】:该论文旨在解决当前视频生成模型在处理大规模或复杂动态场景时物理一致性不足的问题,其核心挑战在于现有方法对物理提示的响应呈各向同性,且未能实现生成内容与局部物理线索之间的细粒度对齐。解决方案的关键在于提出一种渐进式物理对齐框架(ProPhy),其创新点包括:采用两阶段混合物理专家(Mixture-of-Physics-Experts, MoPE)机制,通过语义专家从文本描述中提取语义层面的物理原理,以及精炼专家捕捉标记级别的物理动态;同时引入物理对齐策略,将视觉语言模型(Vision-Language Models, VLMs)的物理推理能力迁移至精炼专家,从而提升动态物理现象的表征精度,最终实现更真实、动态且物理一致的视频生成结果。

链接: https://arxiv.org/abs/2512.05564
作者: Zijun Wang,Panwen Hu,Jing Wang,Terry Jingchen Zhang,Yuhao Cheng,Long Chen,Yiqiang Yan,Zutao Jiang,Hanhui Li,Xiaodan Liang
机构: Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); Peng Cheng Laboratory (鹏城实验室); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); ETH Zürich (苏黎世联邦理工学院); Lenovo Research (联想研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.
zh

[CV-41] 2K-Characters-10K-Stories: A Quality-Gated Stylized Narrative Dataset with Disentangled Control and Sequence Consistency

【速读】:该论文旨在解决可控视觉叙事中序列身份一致性(sequential identity consistency)的问题,即在精确控制瞬时属性(如姿态、表情和场景构图)的同时,保持角色身份的稳定性和连贯性。现有数据集因缺乏足够的保真度且未能有效解耦稳定身份与瞬时属性,导致难以实现结构化的序列生成。其解决方案的关键在于:首先构建了首个大规模多模态叙事数据集 2K-Characters-10K-Stories,包含2,000个独特风格化角色及其对应10,000个插画故事,实现了大规模唯一身份与显式解耦控制信号的配对;其次提出一种人机协同管道(Human-in-the-Loop pipeline, HiL),结合专家验证的角色模板与大语言模型(LLM)引导的叙事规划,生成高度对齐的结构化数据;最后引入解耦控制机制将持久身份与瞬时属性(姿态/表情)分离,并通过质量门控循环(Quality-Gated loop)集成多模态语言模型评估、自动提示调优与局部图像编辑,确保像素级一致性。实验表明,基于该数据集微调的模型在视觉叙事生成上达到与闭源模型相当的性能。

链接: https://arxiv.org/abs/2512.05557
作者: Xingxi Yin,Yicheng Li,Gong Yan,Chenglin Li,Jian Zhao,Cong Huang,Yue Deng,Yin Zhang
机构: Zhejiang University (浙江大学); Zhongguancun Institute of Artificial Intelligence (中关村人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sequential identity consistency under precise transient attribute control remains a long-standing challenge in controllable visual storytelling. Existing datasets lack sufficient fidelity and fail to disentangle stable identities from transient attributes, limiting structured control over pose, expression, and scene composition and thus constraining reliable sequential synthesis. To address this gap, we introduce \textbf2K-Characters-10K-Stories, a multi-modal stylized narrative dataset of \textbf2,000 uniquely stylized characters appearing across \textbf10,000 illustration stories. It is the first dataset that pairs large-scale unique identities with explicit, decoupled control signals for sequential identity consistency. We introduce a \textbfHuman-in-the-Loop pipeline (HiL) that leverages expert-verified character templates and LLM-guided narrative planning to generate highly-aligned structured data. A \textbfdecoupled control scheme separates persistent identity from transient attributes – pose and expression – while a \textbfQuality-Gated loop integrating MMLM evaluation, Auto-Prompt Tuning, and Local Image Editing enforces pixel-level consistency. Extensive experiments demonstrate that models fine-tuned on our dataset achieves performance comparable to closed-source models in generating visual narratives.
zh

[CV-42] Conscious Gaze: Adaptive Attention Mechanisms for Hallucination Mitigation in Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)中存在的文本惯性(text inertia)问题,即模型在推理过程中注意力从视觉证据偏移至语言先验,导致对象幻觉(object hallucinations)。现有解码策略仅在输出logits层面干预,无法纠正内部推理过程中的注意力漂移;而基于启发式头抑制或全局导向向量的内控方法缺乏理论基础。论文提出无需训练、仅在推理阶段生效的Conscious Gaze (CG-VLM)框架,其核心创新在于将博弈论可解释性转化为可操作的解码控制:通过基于Harsanyi交互作用构建的认知需求传感器(Cognitive Demand Sensor)实时估计视觉-文本协同强度,识别需强化视觉锚定的关键时刻;随后由聚焦共识诱导模块(Focused Consensus Induction)在中间层选择性重定向注意力至视觉token,防止其过早坍缩为语言先验。该方法在POPE和CHAIR基准上显著优于现有方法,同时保持模型通用能力,证明了基于token级感知的精准、上下文感知干预是可行且有效的。

链接: https://arxiv.org/abs/2512.05546
作者: Weijue Bu,Guan Yuan,Guixian Zhang
机构: China University of Mining and Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) often exhibit text inertia, where attention drifts from visual evidence toward linguistic priors, resulting in object hallucinations. Existing decoding strategies intervene only at the output logits and thus cannot correct internal reasoning drift, while recent internal-control methods based on heuristic head suppression or global steering vectors lack principled grounding. We introduce Conscious Gaze (CG-VLM), a training-free, inference-time framework that converts game-theoretic interpretability into actionable decoding control. A Cognitive Demand Sensor built on Harsanyi interactions estimates instantaneous vision-text synergy and identifies moments when visual grounding is necessary. Conditioned on this signal, a Focused Consensus Induction module selectively reorients mid-layer attention toward visual tokens before collapse into text priors. CG-VLM achieves state-of-the-art results on POPE and CHAIR across InstructBLIP, LLaVA, Qwen-VL, and mPLUG, while preserving general capabilities, demonstrating that token-level sensing enables precise, context-aware intervention without compromising foundational knowledge.
zh

[CV-43] Ideal Observer for Segmentation of Dead Leaves Images

【速读】:该论文旨在解决图像分割中如何基于“死叶”(dead leaves)模型实现最优感知决策的问题,特别是在有限像素范围内对场景中不同表面进行精确划分。其核心解决方案是构建一个贝叶斯理想观察者(Bayesian ideal observer),该观察者通过联合考虑物体位置、形状、颜色和纹理的独立分布,计算给定像素集合在死叶模型下的后验概率分布,从而提供理论上最优的分割性能上限。关键在于推导出可计算的后验概率表达式,并分析其实际应用中的可行性因素,如计算复杂度与采样策略,为人类视觉系统和计算机视觉算法提供可比较的性能基准。

链接: https://arxiv.org/abs/2512.05539
作者: Swantje Mahncke,Malte Ott
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Statistics Theory (math.ST); Methodology (stat.ME)
备注: 41 pages, 16 figures

点击查看摘要

Abstract:The human visual environment is comprised of different surfaces that are distributed in space. The parts of a scene that are visible at any one time are governed by the occlusion of overlapping objects. In this work we consider “dead leaves” models, which replicate these occlusions when generating images by layering objects on top of each other. A dead leaves model is a generative model comprised of distributions for object position, shape, color and texture. An image is generated from a dead leaves model by sampling objects (“leaves”) from these distributions until a stopping criterion is reached, usually when the image is fully covered or until a given number of leaves was sampled. Here, we describe a theoretical approach, based on previous work, to derive a Bayesian ideal observer for the partition of a given set of pixels based on independent dead leaves model distributions. Extending previous work, we provide step-by-step explanations for the computation of the posterior probability as well as describe factors that determine the feasibility of practically applying this computation. The dead leaves image model and the associated ideal observer can be applied to study segmentation decisions in a limited number of pixels, providing a principled upper-bound on performance, to which humans and vision algorithms could be compared.
zh

[CV-44] See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors

【速读】:该论文旨在解决腹腔镜手术场景中像素级分割(pixel-wise segmentation)因密集标注成本高昂而难以规模化的问题。其解决方案的关键在于提出了一种无需训练的框架DepSeg,该框架利用单目深度图作为几何先验,并结合预训练视觉基础模型(pretrained vision foundation models),通过深度引导的点提示(depth-guided point prompts)生成类无关掩码,再基于模板匹配对掩码进行分类,从而实现高效的标注依赖性降低的分割性能。

链接: https://arxiv.org/abs/2512.05529
作者: Kunyi Yang,Qingyu Wang,Cheng Yuan,Yutong Ban
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: The first two authors contributed equally

点击查看摘要

Abstract:Pixel-wise segmentation of laparoscopic scenes is essential for computer-assisted surgery but difficult to scale due to the high cost of dense annotations. We propose depth-guided surgical scene segmentation (DepSeg), a training-free framework that utilizes monocular depth as a geometric prior together with pretrained vision foundation models. DepSeg first estimates a relative depth map with a pretrained monocular depth estimation network and proposes depth-guided point prompts, which SAM2 converts into class-agnostic masks. Each mask is then described by a pooled pretrained visual feature and classified via template matching against a template bank built from annotated frames. On the CholecSeg8k dataset, DepSeg improves over a direct SAM2 auto segmentation baseline (35.9% vs. 14.7% mIoU) and maintains competitive performance even when using only 10–20% of the object templates. These results show that depth-guided prompting and template-based classification offer an annotation-efficient segmentation approach.
zh

[CV-45] VOST-SGG: VLM-Aided One-Stage Spatio-Temporal Scene Graph Generation

【速读】:该论文旨在解决当前基于DETR架构的单阶段时空场景图生成(ST-SGG)模型存在的两个关键问题:一是模型依赖的可学习查询(learnable queries)缺乏语义信息且以实例无关的方式初始化,导致注意力机制难以精准定位;二是predicate分类仅依赖单模态视觉特征,限制了关系推理的准确性。解决方案的关键在于提出VOST-SGG框架,其核心创新包括:(1)双源查询初始化策略(dual-source query initialization),将“关注什么”(what to attend to)与“关注哪里”(where to attend)解耦,实现语义引导的“何-何”推理;(2)构建多模态特征库(multi-modal feature bank),融合来自视觉语言模型(VLM)的视觉、文本和空间线索,显著提升谓词分类性能。实验表明,该方法在Action Genome数据集上达到SOTA效果,验证了引入VLM辅助语义先验与多模态特征的有效性。

链接: https://arxiv.org/abs/2512.05524
作者: Chinthani Sugandhika,Chen Li,Deepu Rajan,Basura Fernando
机构: College of Computing and Data Science, Nanyang Technological University, Singapore (南洋理工大学计算机与数据科学学院); Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore (新加坡科技研究局高性能计算研究所); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore (新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatio-temporal scene graph generation (ST-SGG) aims to model objects and their evolving relationships across video frames, enabling interpretable representations for downstream reasoning tasks such as video captioning and visual question answering. Despite recent advancements in DETR-style single-stage ST-SGG models, they still suffer from several key limitations. First, while these models rely on attention-based learnable queries as a core component, these learnable queries are semantically uninformed and instance-agnostically initialized. Second, these models rely exclusively on unimodal visual features for predicate classification. To address these challenges, we propose VOST-SGG, a VLM-aided one-stage ST-SGG framework that integrates the common sense reasoning capabilities of vision-language models (VLMs) into the ST-SGG pipeline. First, we introduce the dual-source query initialization strategy that disentangles what to attend to from where to attend, enabling semantically grounded what-where reasoning. Furthermore, we propose a multi-modal feature bank that fuses visual, textual, and spatial cues derived from VLMs for improved predicate classification. Extensive experiments on the Action Genome dataset demonstrate that our approach achieves state-of-the-art performance, validating the effectiveness of integrating VLM-aided semantic priors and multi-modal features for ST-SGG. We will release the code at this https URL.
zh

[CV-46] DashFusion: Dual-stream Alignment with Hierarchical Bottleneck Fusion for Multimodal Sentiment Analysis

【速读】:该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis, MSA)中普遍存在的对齐(alignment)与融合(fusion)问题。现有方法通常孤立地处理这两个环节,导致性能受限且效率不高。其解决方案的关键在于提出一种名为双流对齐与分层瓶颈融合(Dual-stream Alignment with Hierarchical Bottleneck Fusion, DashFusion)的新框架:首先通过双流对齐模块实现时序和语义层面的跨模态同步——时序对齐利用跨模态注意力建立帧级对应关系,语义对齐则借助对比学习确保特征空间一致性;其次引入监督对比学习优化模态特征表达;最后采用分层瓶颈融合机制,通过压缩的瓶颈令牌逐步整合多模态信息,在保持高性能的同时显著提升计算效率。

链接: https://arxiv.org/abs/2512.05515
作者: Yuhua Wen,Qifei Li,Yingying Zhou,Yingming Gao,Zhengqi Wen,Jianhua Tao,Ya Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2025

点击查看摘要

Abstract:Multimodal sentiment analysis (MSA) integrates various modalities, such as text, image, and audio, to provide a more comprehensive understanding of sentiment. However, effective MSA is challenged by alignment and fusion issues. Alignment requires synchronizing both temporal and semantic information across modalities, while fusion involves integrating these aligned features into a unified representation. Existing methods often address alignment or fusion in isolation, leading to limitations in performance and efficiency. To tackle these issues, we propose a novel framework called Dual-stream Alignment with Hierarchical Bottleneck Fusion (DashFusion). Firstly, dual-stream alignment module synchronizes multimodal features through temporal and semantic alignment. Temporal alignment employs cross-modal attention to establish frame-level correspondences among multimodal sequences. Semantic alignment ensures consistency across the feature space through contrastive learning. Secondly, supervised contrastive learning leverages label information to refine the modality features. Finally, hierarchical bottleneck fusion progressively integrates multimodal information through compressed bottleneck tokens, which achieves a balance between performance and computational efficiency. We evaluate DashFusion on three datasets: CMU-MOSI, CMU-MOSEI, and CH-SIMS. Experimental results demonstrate that DashFusion achieves state-of-the-art performance across various metrics, and ablation studies confirm the effectiveness of our alignment and fusion techniques. The codes for our experiments are available at this https URL.
zh

[CV-47] Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning

【速读】:该论文旨在解决当前大型视频语言模型(Large Video-Language Models, Video-LMs)在时空 grounded 推理能力上的不足,即模型难以将动作及其语义推理与视觉和时间证据有效关联。为评估这一能力,作者提出了 Know-Show 基准,该基准通过五个互补场景统一了推理与定位任务,覆盖空间维度(人物、物体、人-物、手-物)与时间维度,基于 Charades、Action Genome 和 Ego4D 构建的 2.5K 条人工标注问题揭示了现有模型与人类推理之间的显著差距。解决方案的关键在于提出 GRAM(Grounded Reasoning Augmentation Module),这是一种无需训练的插件式方法,通过基于注意力机制的视频 token 选择和显式的时戳编码,增强 Video-LMs 的细粒度时空定位能力,从而提升其“展示所知”与“理解所见”的一致性,尤其在手-物交互等细粒度场景中效果显著。

链接: https://arxiv.org/abs/2512.05513
作者: Chinthani Sugandhika,Chen Li,Deepu Rajan,Basura Fernando
机构: College of Computing and Data Science, Nanyang Technological University, Singapore (南洋理工大学计算机与数据科学学院); Institute of High-Performance Computing, Agency for Science, Technology and Research, Singapore (新加坡科技研究局高性能计算研究所); Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore (新加坡科技研究局前沿人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Video-Language Models (Video-LMs) have achieved impressive progress in multimodal understanding, yet their reasoning remains weakly grounded in space and time. We present Know-Show, a new benchmark designed to evaluate spatio-temporal grounded reasoning, the ability of a model to reason about actions and their semantics while simultaneously grounding its inferences in visual and temporal evidence. Know-Show unifies reasoning and localization within a single evaluation framework consisting of five complementary scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions. Built from Charades, Action Genome, and Ego4D with 2.5K human-authored questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning. To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding through attention-based video token selection and explicit timestamp encoding. Extensive experiments across open and closed Video-LMs (Qwen, VideoLLaVA, GPT-4o, and Gemini, etc.) reveal that existing models struggle to “show what they know” and vice versa, especially in fine-grained hand-object interactions. Know-Show establishes a unified standard for assessing grounded reasoning in video-language understanding and provides insights toward developing interpretable and reliable multimodal reasoning systems. We will release the code at this https URL.
zh

[CV-48] Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm

【速读】:该论文旨在解决大规模视觉基础模型(Visual Foundation Models, VFMs)在单帧红外小目标检测(Single-Frame Infrared Small Target Detection, SIRST)任务中潜力未被充分挖掘的问题。现有方法难以有效利用VFMs所携带的全局语义先验信息,同时存在推理开销大、评估体系碎片化等挑战。解决方案的关键在于提出一种基础驱动的高效范式(Foundation-Driven Efficient Paradigm, FDEP),其核心包括:1)设计语义对齐调制融合模块(Semantic Alignment Modulation Fusion, SAMF),实现VFM全局语义先验与任务特定特征的动态对齐与深度融合;2)提出基于协同优化的隐式自蒸馏策略(Collaborative Optimization-based Implicit Self-Distillation, CO-ISD),通过参数共享与同步反向传播,在主干分支与轻量分支间实现隐式语义迁移,避免引入额外推理延迟;3)构建统一的SIRST综合评价指标(Holistic SIRST Evaluation, HSE),从像素级置信度和目标级鲁棒性两个维度进行多阈值积分评估,提升模型比较的公平性与全面性。

链接: https://arxiv.org/abs/2512.05511
作者: Chuang Yu,Jinmiao Zhao,Yunpeng Liu,Yaokun Li,Xiujun Shu,Yuanhao Feng,Bo Wang,Yimian Dai,Xiangyu Yue
机构: Chinese Academy of Sciences (中国科学院); Shenyang Institute of Automation, Chinese Academy of Sciences (中国科学院沈阳自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); Sun Yat-sen University (中山大学); Tencent (腾讯); Nankai University (南开大学); MMLab, The Chinese University of Hong Kong (香港中文大学多媒体实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, their potential for single-frame infrared small target (SIRST) detection remains largely unexplored. To fill this gap, we systematically introduce the frozen representations from VFMs into the SIRST task for the first time and propose a Foundation-Driven Efficient Paradigm (FDEP), which can seamlessly adapt to existing encoder-decoder-based methods and significantly improve accuracy without additional inference overhead. Specifically, a Semantic Alignment Modulation Fusion (SAMF) module is designed to achieve dynamic alignment and deep fusion of the global semantic priors from VFMs with task-specific features. Meanwhile, to avoid the inference time burden introduced by VFMs, we propose a Collaborative Optimization-based Implicit Self-Distillation (CO-ISD) strategy, which enables implicit semantic transfer between the main and lightweight branches through parameter sharing and synchronized backpropagation. In addition, to unify the fragmented evaluation system, we construct a Holistic SIRST Evaluation (HSE) metric that performs multi-threshold integral evaluation at both pixel-level confidence and target-level robustness, providing a stable and comprehensive basis for fair model comparison. Extensive experiments demonstrate that the SIRST detection networks equipped with our FDEP framework achieve state-of-the-art (SOTA) performance on multiple public datasets. Our code is available at this https URL
zh

[CV-49] Decoding with Structured Awareness: Integrating Directional Frequency-Spatial and Structural Attention for Medical Image Segmentation AAAI2026

【速读】:该论文旨在解决Transformer解码器在医学图像分割任务中对边缘细节捕捉不足、局部纹理识别能力弱以及空间连续性建模差的问题。其解决方案的关键在于提出一个包含三个核心模块的新型解码框架:(1)自适应交叉融合注意力(Adaptive Cross-Fusion Attention, ACFA)模块,通过引入可学习的三向引导机制增强对关键区域和结构方向的响应;(2)三域特征融合注意力(Triple Feature Fusion Attention, TFFA)模块,融合空间域、傅里叶域与小波域特征,实现频域-空间联合表示,强化全局依赖性和结构建模同时保留边缘与纹理等局部信息;(3)结构感知多尺度掩码模块(Structural-aware Multi-scale Masking Module, SMMM),优化编码器与解码器间的跳跃连接,利用多尺度上下文和结构显著性过滤减少特征冗余并提升语义交互质量。上述模块协同作用,显著提升了肿瘤分割和器官边界提取等高精度任务的性能与模型泛化能力。

链接: https://arxiv.org/abs/2512.05494
作者: Fan Zhang,Zhiwei Gu,Hua Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to AAAI 2026

点击查看摘要

Abstract:To address the limitations of Transformer decoders in capturing edge details, recognizing local textures and modeling spatial continuity, this paper proposes a novel decoder framework specifically designed for medical image segmentation, comprising three core modules. First, the Adaptive Cross-Fusion Attention (ACFA) module integrates channel feature enhancement with spatial attention mechanisms and introduces learnable guidance in three directions (planar, horizontal, and vertical) to enhance responsiveness to key regions and structural orientations. Second, the Triple Feature Fusion Attention (TFFA) module fuses features from Spatial, Fourier and Wavelet domains, achieving joint frequency-spatial representation that strengthens global dependency and structural modeling while preserving local information such as edges and textures, making it particularly effective in complex and blurred boundary scenarios. Finally, the Structural-aware Multi-scale Masking Module (SMMM) optimizes the skip connections between encoder and decoder by leveraging multi-scale context and structural saliency filtering, effectively reducing feature redundancy and improving semantic interaction quality. Working synergistically, these modules not only address the shortcomings of traditional decoders but also significantly enhance performance in high-precision tasks such as tumor segmentation and organ boundary extraction, improving both segmentation accuracy and model generalization. Experimental results demonstrate that this framework provides an efficient and practical solution for medical image segmentation.
zh

[CV-50] WaterWave: Bridging Underwater Image Enhancement into Video Streams via Wavelet-based Temporal Consistency Field

【速读】:该论文旨在解决现有水下视频增强方法因依赖单图增强模型逐帧处理而导致的时间不一致性问题(temporal inconsistency)。其解决方案的关键在于提出一种基于小波域时间一致性场(wavelet-based temporal consistency field)的隐式表示方法——WaterWave,该方法利用动态场景中局部时间频率视角下的时间一致性先验,在无配对数据条件下逐步滤除不一致成分并保留运动细节与场景信息;同时设计了水下光流校正模块(underwater flow correction module),以更准确地建模水下传播特性对估计光流的影响,从而实现自然流畅的视频增强效果。

链接: https://arxiv.org/abs/2512.05492
作者: Qi Zhu,Jingyi Zhang,Naishan Zheng,Wei Yu,Jinghao Zhang,Deyi Ji,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Underwater video pairs are fairly difficult to obtain due to the complex underwater imaging. In this case, most existing video underwater enhancement methods are performed by directly applying the single-image enhancement model frame by frame, but a natural issue is lacking temporal consistency. To relieve the problem, we rethink the temporal manifold inherent in natural videos and observe a temporal consistency prior in dynamic scenes from the local temporal frequency perspective. Building upon the specific prior and no paired-data condition, we propose an implicit representation manner for enhanced video signals, which is conducted in the wavelet-based temporal consistency field, WaterWave. Specifically, under the constraints of the prior, we progressively filter and attenuate the inconsistent components while preserving motion details and scenes, achieving a natural-flowing video. Furthermore, to represent temporal frequency bands more accurately, an underwater flow correction module is designed to rectify estimated flows considering the transmission in underwater scenes. Extensive experiments demonstrate that WaterWave significantly enhances the quality of videos generated using single-image underwater enhancements. Additionally, our method demonstrates high potential in downstream underwater tracking tasks, such as UOSTrack and MAT, outperforming the original video by a large margin, i.e., 19.7% and 9.7% on precise respectively.
zh

[CV-51] Concept-based Explainable Data Mining with VLM for 3D Detection

【速读】:该论文旨在解决自动驾驶系统中仅依赖点云数据时对稀有目标(rare objects)检测性能不足的问题。其核心挑战在于稀有目标类别(如施工车辆、摩托车和障碍物)在训练数据中样本稀缺,导致模型难以学习其特征。解决方案的关键在于提出一种新颖的跨模态框架,利用2D视觉-语言模型(Vision-Language Models, VLMs)从驾驶场景中智能挖掘稀有目标,并结合多维度异常检测方法(包括Isolation Forest与t-SNE降维后的异常点识别)及基于语义概念的过滤机制,系统性地识别出具有实际意义的稀有对象。该方法显著减少了标注负担,聚焦于最具价值的训练样本,从而在仅使用少量数据的情况下大幅提升3D目标检测性能,尤其在拖车和自行车等难检类别上表现突出。

链接: https://arxiv.org/abs/2512.05482
作者: Mai Tsujimoto
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages including appendix. Code: this https URL

点击查看摘要

Abstract:Rare-object detection remains a challenging task in autonomous driving systems, particularly when relying solely on point cloud data. Although Vision-Language Models (VLMs) exhibit strong capabilities in image understanding, their potential to enhance 3D object detection through intelligent data mining has not been fully explored. This paper proposes a novel cross-modal framework that leverages 2D VLMs to identify and mine rare objects from driving scenes, thereby improving 3D object detection performance. Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection into a cohesive, explainable pipeline that systematically identifies rare but critical objects in driving scenes. By combining Isolation Forest and t-SNE-based outlier detection methods with concept-based filtering, the framework effectively identifies semantically meaningful rare objects. A key strength of this approach lies in its ability to extract and annotate targeted rare object concepts such as construction vehicles, motorcycles, and barriers. This substantially reduces the annotation burden and focuses only on the most valuable training samples. Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models while utilizing only a fraction of the training data, with particularly notable improvements for challenging object categories such as trailers and bicycles compared with the same amount of random data. This finding has substantial implications for the efficient curation of datasets in safety-critical autonomous systems.
zh

[CV-52] UniFS: Unified Multi-Contrast MRI Reconstruction via Frequency-Spatial Fusion

【速读】:该论文针对多对比度磁共振成像重建(Multi-Contrast MR Reconstruction, MCMR)中模型泛化能力不足的问题展开研究,具体表现为现有方法难以适应不同的k空间欠采样模式,需为每种模式单独训练模型,限制了实际应用。解决方案的关键在于提出UniFS(Unified Frequency-Spatial Fusion)模型,其核心创新包括:1)跨模态频率融合模块,用于提取不同欠采样模式下的域不变特征;2)基于自适应掩码提示学习模块,动态适配各欠采样模式的差异;3)双分支互补精修模块,增强空间与频率信息的协同利用。其中,引入自适应提示引导的频率融合机制是突破传统方法仅关注空间信息或浅层频率特征局限性的关键,显著提升了模型在未见欠采样模式下的泛化性能。

链接: https://arxiv.org/abs/2512.05481
作者: Jialin Li,Yiwei Ren,Kai Pan,Dong Wei,Pujin Cheng,Xian Wu,Xiaoying Tang
机构: 1. Southern University of Science and Technology (南方科技大学); 2. Tsinghua University (清华大学); 3. Shenzhen Key Laboratory of Deep Learning and Computer Vision (深圳市重点实验室深度学习与计算机视觉)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Multi-Contrast MR Reconstruction (MCMR) has emerged as a hot research topic that leverages high-quality auxiliary modalities to reconstruct undersampled target modalities of interest. However, existing methods often struggle to generalize across different k-space undersampling patterns, requiring the training of a separate model for each specific pattern, which limits their practical applicability. To address this challenge, we propose UniFS, a Unified Frequency-Spatial Fusion model designed to handle multiple k-space undersampling patterns for MCMR tasks without any need for retraining. UniFS integrates three key modules: a Cross-Modal Frequency Fusion module, an Adaptive Mask-Based Prompt Learning module, and a Dual-Branch Complementary Refinement module. These modules work together to extract domain-invariant features from diverse k-space undersampling patterns while dynamically adapt to their own variations. Another limitation of existing MCMR methods is their tendency to focus solely on spatial information while neglect frequency characteristics, or extract only shallow frequency features, thus failing to fully leverage complementary cross-modal frequency information. To relieve this issue, UniFS introduces an adaptive prompt-guided frequency fusion module for k-space learning, significantly enhancing the model’s generalization performance. We evaluate our model on the BraTS and HCP datasets with various k-space undersampling patterns and acceleration factors, including previously unseen patterns, to comprehensively assess UniFS’s generalizability. Experimental results across multiple scenarios demonstrate that UniFS achieves state-of-the-art performance. Our code is available at this https URL.
zh

[CV-53] EmoStyle: Emotion-Driven Image Stylization

【速读】:该论文旨在解决现有图像风格化方法在转换视觉外观时忽视艺术风格所承载情感影响的问题,即如何在保持内容一致性的前提下,使生成的图像能够有效唤起特定情绪。其解决方案的关键在于提出EmoStyle框架,包含两个核心组件:一是构建了基于ArtEmis数据集的内容-情绪-风格化图像三元组数据集EmoStyleSet,以支持情感驱动的图像风格化任务;二是设计了情绪-内容推理器(Emotion-Content Reasoner),用于自适应融合情绪线索与内容信息以学习连贯的风格查询,并结合风格量化器(Style Quantizer)将连续风格特征映射为与情绪相关的码本条目,从而实现情绪感知的风格表示与迁移。

链接: https://arxiv.org/abs/2512.05478
作者: Jingyuan Yang,Zihuan Bai,Hui Huang
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Art has long been a profound medium for expressing emotions. While existing image stylization methods effectively transform visual appearance, they often overlook the emotional impact carried by styles. To bridge this gap, we introduce Affective Image Stylization (AIS), a task that applies artistic styles to evoke specific emotions while preserving content. We present EmoStyle, a framework designed to address key challenges in AIS, including the lack of training data and the emotion-style mapping. First, we construct EmoStyleSet, a content-emotion-stylized image triplet dataset derived from ArtEmis to support AIS. We then propose an Emotion-Content Reasoner that adaptively integrates emotional cues with content to learn coherent style queries. Given the discrete nature of artistic styles, we further develop a Style Quantizer that converts continuous style features into emotion-related codebook entries. Extensive qualitative and quantitative evaluations, including user studies, demonstrate that EmoStyle enhances emotional expressiveness while maintaining content consistency. Moreover, the learned emotion-aware style dictionary is adaptable to other generative tasks, highlighting its potential for broader applications. Our work establishes a foundation for emotion-driven image stylization, expanding the creative potential of AI-generated art.
zh

[CV-54] University Building Recognition Dataset in Thailand for the mission-oriented IoT sensor system

【速读】:该论文旨在解决边缘设备上协同训练的挑战,尤其是在无线自组织联邦学习(Wireless Ad Hoc Federated Learning, WAFL)场景下如何提升模型性能的问题。当前工业领域虽已在边缘设备上部署机器学习推理任务,但边缘端训练仍面临算力限制与通信效率低下的问题。研究提出WAFL-ViT框架,结合视觉Transformer(Vision Transformer, ViT)架构,在无需中心服务器的情况下实现设备间直接协作学习,并通过构建专用数据集Chulalongkorn University Building Recognition Dataset (CUBR) 来适配特定任务需求。关键创新在于:1)利用WAFL实现去中心化的分布式训练机制;2)证明在WAFL场景下训练精度优于传统自训练(self-training)方案,验证了其在实际部署中的有效性。

链接: https://arxiv.org/abs/2512.05468
作者: Takara Taniguchi,Yudai Ueda,Atsuya Muramatsu,Kohki Hashimoto,Ryo Yagi,Hideya Ochiai,Chaodit Aswakul
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many industrial sectors have been using of machine learning at inference mode on edge devices. Future directions show that training on edge devices is promising due to improvements in semiconductor performance. Wireless Ad Hoc Federated Learning (WAFL) has been proposed as a promising approach for collaborative learning with device-to-device communication among edges. In particular, WAFL with Vision Transformer (WAFL-ViT) has been tested on image recognition tasks with the UTokyo Building Recognition Dataset (UTBR). Since WAFL-ViT is a mission-oriented sensor system, it is essential to construct specific datasets by each mission. In our work, we have developed the Chulalongkorn University Building Recognition Dataset (CUBR), which is specialized for Chulalongkorn University as a case study in Thailand. Additionally, our results also demonstrate that training on WAFL scenarios achieves better accuracy than self-training scenarios. Dataset is available in this https URL.
zh

[CV-55] ED-4DGS: Temporally Activated and Embedding-based Deformation for 4DGS Compression

【速读】:该论文旨在解决动态三维场景表示中压缩效率与重建质量之间的权衡问题,特别是针对4D高斯溅射(4DGS)在率失真优化(rate-distortion optimization, RDO)方面的不足。现有方法要么采用时空联合建模导致冗余的高斯基元,要么依赖于缺乏显式时间控制的变形机制,难以实现紧凑且高效的动态3DGS表示。解决方案的关键在于提出TED-4DGS——一种基于时序激活和嵌入的变形方案:首先,在稀疏锚点基础上引入可学习的时间激活参数以控制每个锚点的出现与消失过渡;其次,通过轻量级每锚点时间嵌入查询共享的变形库,生成具有时序一致性的锚点变形;同时结合基于隐式神经表示(INR)的超先验模型和通道自回归模型,对锚点属性分布及内部相关性进行高效建模,从而在多个真实数据集上实现当前最优的率失真性能。

链接: https://arxiv.org/abs/2512.05446
作者: Cheng-Yuan Ho,He-Bi Yang,Jui-Chiu Chiang,Yu-Lun Liu,Wen-Hsiao Peng
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); National Chung Cheng University (国立中正大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Building on the success of 3D Gaussian Splatting (3DGS) in static 3D scene representation, its extension to dynamic scenes, commonly referred to as 4DGS or dynamic 3DGS, has attracted increasing attention. However, designing more compact and efficient deformation schemes together with rate-distortion-optimized compression strategies for dynamic 3DGS representations remains an underexplored area. Prior methods either rely on space-time 4DGS with overspecified, short-lived Gaussian primitives or on canonical 3DGS with deformation that lacks explicit temporal control. To address this, we present TED-4DGS, a temporally activated and embedding-based deformation scheme for rate-distortion-optimized 4DGS compression that unifies the strengths of both families. TED-4DGS is built on a sparse anchor-based 3DGS representation. Each canonical anchor is assigned learnable temporal-activation parameters to specify its appearance and disappearance transitions over time, while a lightweight per-anchor temporal embedding queries a shared deformation bank to produce anchor-specific deformation. For rate-distortion compression, we incorporate an implicit neural representation (INR)-based hyperprior to model anchor attribute distributions, along with a channel-wise autoregressive model to capture intra-anchor correlations. With these novel elements, our scheme achieves state-of-the-art rate-distortion performance on several real-world datasets. To the best of our knowledge, this work represents one of the first attempts to pursue a rate-distortion-optimized compression framework for dynamic 3DGS representations.
zh

[CV-56] EXR: An Interactive Immersive EHR Visualization in Extended Reality

【速读】:该论文旨在解决传统电子健康记录(EHR)系统在临床决策支持中可视化能力不足的问题,尤其是难以直观呈现结构化与非结构化患者数据的复杂关联。其解决方案的关键在于构建一个扩展现实(XR)平台,将基于FHIR标准的EHR数据、体积医学影像及AI生成的分割结果整合至共享的3D环境中,实现多模态数据的空间化、交互式呈现,并支持实时协作,从而为下一代临床决策支持工具提供可直接交互的先进数据基础设施。

链接: https://arxiv.org/abs/2512.05438
作者: Benoit Marteau,Shaun Q. Y. Tan,Jieru Li,Andrew Hornback,Yishan Zhong,Shaunna Wang,Christian Lowson,Jason Woloff,Joshua M. Pahys,Steven W. Hwang,Coleman Hilton,May D. Wang
机构: Georgia Institute of Technology (佐治亚理工学院); Georgia Tech & Emory University (佐治亚理工学院与埃默里大学); Shriners Hospitals for Children (儿童烧伤医院)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 11 pages, 6 figures. Preprint version. This paper has been accepted to IEEE ICIR 2025. This is the author-prepared version and not the final published version. The final version will appear in IEEE Xplo

点击查看摘要

Abstract:This paper presents the design and implementation of an Extended Reality (XR) platform for immersive, interactive visualization of Electronic Health Records (EHRs). The system extends beyond conventional 2D interfaces by visualizing both structured and unstructured patient data into a shared 3D environment, enabling intuitive exploration and real-time collaboration. The modular infrastructure integrates FHIR-based EHR data with volumetric medical imaging and AI-generated segmentation, ensuring interoperability with modern healthcare systems. The platform’s capabilities are demonstrated using synthetic EHR datasets and computed tomography (CT)-derived spine models processed through an AI-powered segmentation pipeline. This work suggests that such integrated XR solutions could form the foundation for next-generation clinical decision-support tools, where advanced data infrastructures are directly accessible in an interactive and spatially rich environment.
zh

[CV-57] ParaUni: Enhance Generation in Unified Multimodal Model with Reinforcement-driven Hierarchical Parallel Information Interaction

【速读】:该论文旨在解决统一多模态模型(Unified multimodal model)在视觉生成任务中难以平衡充分交互与灵活实现的问题,根源在于视觉语言模型(VLM)各层间存在显著的表征差异。其解决方案的关键在于提出ParaUni框架,通过并行提取VLM多层次特征(从低级细节到高级语义)并引入层融合模块(Layer Integration Module, LIM),实现细粒度信息与语义抽象的高效融合,同时保持架构的灵活性;进一步地,设计分层动态调整机制(Layer-wise Dynamic Adjustment Mechanism, LDAM),利用强化学习(Reinforcement Learning, RL)识别不同层级对各类奖励响应的非均衡性,并据此优化多奖励目标,从而增强生成质量与RL阶段的多目标适应能力。

链接: https://arxiv.org/abs/2512.05422
作者: Jiangtong Tan,Lin Liu,Jie Huanng,Xiaopeng Zhang,Qi Tian,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unified multimodal models significantly improve visual generation by combining vision-language models (VLMs) with diffusion models. However, existing methods struggle to fully balance sufficient interaction and flexible implementation due to vast representation difference. Considering abundant and hierarchical information in VLM’s layers from low-level details to high-level semantics, we propose \textbfParaUni. It extracts features from variants VLM’s layers in a \textbfParallel way for comprehensive information interaction and retains a flexible separation architecture to enhance generation in \textbfUnified multimodal model. Concretely, visual features from all VLM’s layers are fed in parallel into a Layer Integration Module (LIM), which efficiently integrates fine-grained details and semantic abstractions and provides the fused representation as a condition to the diffusion model. To further enhance performance, we reveal that these hierarchical layers respond unequally to different rewards in Reinforcement Learning (RL). Crucially, we design a Layer-wise Dynamic Adjustment Mechanism (LDAM) to facilitate multiple reward improvements that aligns the hierarchical properties of these layers using RL. Extensive experiments show ParaUni leverages complementary multi-layer features to substantially improve generation quality and shows strong potential for multiple reward advances during RL stages. Code is available at this https URL.
zh

[CV-58] Performance Evaluation of Deep Learning for Tree Branch Segmentation in Autonomous Forestry Systems

【速读】:该论文旨在解决无人机(UAV)在林业自主作业中面临的树杈精准分割难题,以保障飞行安全与自动化修剪的实现,尤其针对不同像素分辨率和复杂操作条件下的鲁棒性需求。其解决方案的关键在于系统评估多种深度学习模型在三个分辨率(256×256、512×512、1024×1024)下的性能表现,引入包括Thin Structure IoU(TS-IoU)和Connectivity Preservation Rate(CPR)在内的专用指标,最终发现基于MiT-B4主干网络的U-Net架构在多数场景下表现出最优平衡,且在不同分辨率下可针对性选择性能或效率最优的模型配置,从而为嵌入式林业系统中的精度-效率权衡提供多分辨率基准。

链接: https://arxiv.org/abs/2512.05418
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:UAV-based autonomous forestry operations require rapid and precise tree branch segmentation for safe navigation and automated pruning across varying pixel resolutions and operational conditions. We evaluate different deep learning methods at three resolutions (256x256, 512x512, 1024x1024) using the Urban Street Tree Dataset, employing standard metrics (IoU, Dice) and specialized measures including Thin Structure IoU (TS-IoU) and Connectivity Preservation Rate (CPR). Among 22 configurations tested, U-Net with MiT-B4 backbone achieves strong performance at 256x256. At 512x512, MiT-B4 leads in IoU, Dice, TS-IoU, and Boundary-F1. At 1024x1024, U-Net+MiT-B3 shows the best validation performance for IoU/Dice and precision, while U-Net++ excels in boundary quality. PSPNet provides the most efficient option (2.36/9.43/37.74 GFLOPs) with 25.7/19.6/11.8 percentage point IoU reductions compared to top performers at respective resolutions. These results establish multi-resolution benchmarks for accuracy-efficiency trade-offs in embedded forestry systems. Implementation is available at this https URL.
zh

[CV-59] Moving object detection from multi-depth images with an attention-enhanced CNN

【速读】:该论文旨在解决从宽视场巡天数据中检测太阳系内运动天体时,如何准确区分真实天体信号与噪声等干扰源的问题,传统依赖人工目视验证的方法存在劳动强度大、效率低的局限。解决方案的关键在于提出一种融合多输入卷积神经网络(multi-input convolutional neural network)与卷积块注意力模块(convolutional block attention module, CBAM)的新模型:前者通过并行处理多帧叠加图像实现对运动特征的有效捕捉,后者则在空间和通道维度上增强模型对关键特征的关注能力,从而显著提升检测鲁棒性与分类准确性。实验表明,该方法在约2000张观测图像上的准确率达99%,AUC为0.99,且通过调整检测阈值可使人工核查工作量减少超过99%。

链接: https://arxiv.org/abs/2512.05415
作者: Masato Shibukawa,Fumi Yoshida,Toshifumi Yanagisawa,Takashi Ito,Hirohisa Kurosaki,Makoto Yoshikawa,Kohki Kamiya,Ji-an Jiang,Wesley Fraser,JJ Kavelaars,Susan Benecchi,Anne Verbiscer,Akira Hatakeyama,Hosei O,Naoya Ozaki
机构: The Graduate University for Advanced Studies, SOKENDAI (日本高级研究所大学); University of Occupational and Environmental Health, Japan (日本职业与环境健康大学); Planetary Exploration Research Center, Chiba Institute of Technology (千叶工业大学行星探索研究中心); Star Signal Solutions Inc (星信号解决方案公司); Center for Computational Astrophysics, National Astronomical Observatory of Japan (日本国家天文台计算天体物理中心); College of Science and Engineering, Chubu University (中部大学理工学院); Chofu Headquarters, Japan Aerospace Exploration Agency (日本宇宙航空研究开发机构调布总部); ISAS, Japan Aerospace Exploration Agency (日本宇宙航空研究开发机构空间科学部); Department of Astronomy, University of Science and Technology of China (中国科学技术大学天文系); Division of Science, National Astronomical Observatory of Japan (日本国家天文台科学部); National Research Council of Canada, Herzberg Astronomy and Astrophysics Research Centre (加拿大国家研究委员会赫兹伯格天文学与天体物理学研究中心); Department of Physics and Astronomy, University of Victoria (维多利亚大学物理与天文学系); Planetary Science Institute (行星科学研究所); Southwest Research Institute (西南研究院); Department of Astronomy, University of Virginia (弗吉尼亚大学天文系); University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 22 figures, submitted to PASJ

点击查看摘要

Abstract:One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of 0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.
zh

[CV-60] YOLO and SGBM Integration for Autonomous Tree Branch Detection and Depth Estimation in Radiata Pine Pruning Applications

【速读】:该论文旨在解决商业林业中人工修剪辐射松(radiata pine)树木时因高空作业和复杂地形带来的显著安全风险问题。其解决方案的关键在于构建一个基于计算机视觉的自主无人机修剪框架,该框架融合YOLO目标检测与半全局块匹配(Semi-Global Block Matching, SGBM)立体视觉技术,仅依赖双目相机输入即可实现高精度枝条检测与深度估计,从而替代昂贵的激光雷达(LiDAR)传感器。实验表明,YOLO在枝条分割任务中表现优于Mask R-CNN,mAPmask50-95达到82.0%,且系统可在2米作业范围内准确定位枝条,单帧处理时间低于1秒,验证了低成本、高效率、安全可靠的自主修剪系统的可行性。

链接: https://arxiv.org/abs/2512.05412
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Manual pruning of radiata pine trees poses significant safety risks due to extreme working heights and challenging terrain. This paper presents a computer vision framework that integrates YOLO object detection with Semi-Global Block Matching (SGBM) stereo vision for autonomous drone-based pruning operations. Our system achieves precise branch detection and depth estimation using only stereo camera input, eliminating the need for expensive LiDAR sensors. Experimental evaluation demonstrates YOLO’s superior performance over Mask R-CNN, achieving 82.0% mAPmask50-95 for branch segmentation. The integrated system accurately localizes branches within a 2 m operational range, with processing times under one second per frame. These results establish the feasibility of cost-effective autonomous pruning systems that enhance worker safety and operational efficiency in commercial forestry.
zh

[CV-61] Genetic Algorithms For Parameter Optimization for Disparity Map Generation of Radiata Pine Branch Images

【速读】:该论文旨在解决传统立体匹配算法(如SGBM结合WLS滤波)在无人机(UAV)应用中因参数配置依赖人工调优而导致精度不足与效率难以平衡的问题。其关键解决方案是提出一种基于遗传算法(Genetic Algorithm, GA)的参数优化框架,通过系统性搜索自动获取SGBM和WLS的最佳参数组合,在不牺牲处理速度的前提下显著提升距离测量精度,并增强算法在不同成像条件下的泛化能力,从而为资源受限的UAV系统提供实用、高效的深度感知方案。

链接: https://arxiv.org/abs/2512.05410
作者: Yida Lin,Bing Xue,Mengjie Zhang,Sam Schofield,Richard Green
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional stereo matching algorithms like Semi-Global Block Matching (SGBM) with Weighted Least Squares (WLS) filtering offer speed advantages over neural networks for UAV applications, generating disparity maps in approximately 0.5 seconds per frame. However, these algorithms require meticulous parameter tuning. We propose a Genetic Algorithm (GA) based parameter optimization framework that systematically searches for optimal parameter configurations for SGBM and WLS, enabling UAVs to measure distances to tree branches with enhanced precision while maintaining processing efficiency. Our contributions include: (1) a novel GA-based parameter optimization framework that eliminates manual tuning; (2) a comprehensive evaluation methodology using multiple image quality metrics; and (3) a practical solution for resource-constrained UAV systems. Experimental results demonstrate that our GA-optimized approach reduces Mean Squared Error by 42.86% while increasing Peak Signal-to-Noise Ratio and Structural Similarity by 8.47% and 28.52%, respectively, compared with baseline configurations. Furthermore, our approach demonstrates superior generalization performance across varied imaging conditions, which is critcal for real-world forestry applications.
zh

[CV-62] he Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

【速读】:该论文旨在解决从真实场景视频中准确估计相机位姿、三维场景几何结构及物体运动的难题,尤其针对传统结构光(Structure from Motion, SfM)方法在动态物体干扰下性能下降的问题。其解决方案的关键在于提出一种无需特定任务训练即可鲁棒识别动态物体的动态先验(Dynamic Prior, \ourmodel),该方法融合了视觉语言模型(Vision-Language Models, VLMs)的强大推理能力与SAM2(Segment Anything Model 2)的细粒度空间分割能力,从而实现对动态物体的有效分离,提升后续相机位姿优化、深度重建和4D轨迹估计的准确性与鲁棒性。

链接: https://arxiv.org/abs/2512.05398
作者: Zhuoyuan Wu,Xurui Yang,Jiahui Huang,Yue Wang,Jun Gao
机构: PKU(北京大学); Independent Researcher; NVIDIA(英伟达); USC(南加州大学); University of Michigan(密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code is available at this https URL

点击查看摘要

Abstract:Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.
zh

[CV-63] Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability

【速读】:该论文旨在解决视频生成模型中潜在空间结构对扩散训练难度的影响问题,现有视频变分自编码器(Video VAE)通常仅关注重建保真度,而忽视了潜在表示的谱特性对扩散过程收敛速度和质量的关键作用。解决方案的核心在于通过统计分析识别出两个对扩散训练至关重要的谱性质:一是时空频率谱偏向低频成分,二是通道维度上的特征值谱由少数主导模式主导;为此,作者提出两种轻量级、与骨干网络无关的正则化方法——局部相关性正则化(Local Correlation Regularization)和潜在掩码重建(Latent Masked Reconstruction),以诱导上述谱结构,从而显著提升文本到视频生成的收敛速度(提速3倍)和视频质量(奖励指标提升10%)。

链接: https://arxiv.org/abs/2512.05394
作者: Shizhan Liu,Xinran Deng,Zhuoyi Yang,Jiayan Teng,Xiaotao Gu,Jie Tang
机构: Zhipu AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent diffusion models pair VAEs with diffusion backbones, and the structure of VAE latents strongly influences the difficulty of diffusion training. However, existing video VAEs typically focus on reconstruction fidelity, overlooking latent structure. We present a statistical analysis of video VAE latent spaces and identify two spectral properties essential for diffusion training: a spatio-temporal frequency spectrum biased toward low frequencies, and a channel-wise eigenspectrum dominated by a few modes. To induce these properties, we propose two lightweight, backbone-agnostic regularizers: Local Correlation Regularization and Latent Masked Reconstruction. Experiments show that our Spectral-Structured VAE (SSVAE) achieves a 3\times speedup in text-to-video generation convergence and a 10% gain in video reward, outperforming strong open-source VAEs. The code is available at this https URL.
zh

[CV-64] LoC-Path: Learning to Compress for Pathology Multimodal Large Language Models

【速读】:该论文旨在解决病理学中全切片图像(Whole Slide Image, WSI)理解的计算效率问题,即现有基于多模态大语言模型(Multimodal Large Language Model, MLLM)的方法依赖于对数千个图像块(patch)进行全量处理,导致巨大的计算开销,而人类专家仅关注少数诊断相关区域。其解决方案的关键在于引入一个名为LoC-Path的高效框架,通过两个核心模块实现:一是稀疏令牌合并器(Sparse Token Merger, STM)与MAE预训练重采样器,用于消除局部冗余并压缩全局冗余的图像块特征;二是跨注意力路由适配器(Cross-Attention Routing Adapter, CARA)与令牌重要性评分器(Token Importance Scorer, TIS),以低计算成本将压缩后的视觉表示高效融合进语言模型。该方法在保持与当前最优WSI-MLLM相当性能的同时,显著降低了资源消耗。

链接: https://arxiv.org/abs/2512.05391
作者: Qingqiao Hu,Weimin Lyu,Meilong Xu,Kehan Qi,Xiaoling Hu,Saumya Gupta,Jiawei Zhou,Chao Chen
机构: Stony Brook University (石溪大学); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages

点击查看摘要

Abstract:Whole Slide Image (WSI) understanding is fundamentally challenging due to its gigapixel scale and the extreme sparsity of diagnostically relevant regions. Unlike human experts who primarily rely on key areas to arrive at a diagnosis, existing slide-level multimodal large language models (MLLMs) for pathology rely on heavy slide-level encoders that process thousands of patch features in a brute-force manner, resulting in excessive computational cost. In this work, we revisit the WSI-language modeling paradigm and show that tile-level features exhibit strong global and local redundancy, whereas only a small subset of tiles are truly task-relevant. Motivated by this observation, we introduce an efficient MLLM framework, called LoC-Path, that replaces the expensive slide-level encoder with redundancy-reducing modules. We first design a Sparse Token Merger (STM) and an MAE-pretrained resampler to remove local redundancy and compress globally redundant tile tokens into a compact slide-level representation set. We then propose a Cross-Attention Routing Adapter (CARA) and a Token Importance Scorer (TIS) to integrate the compressed visual representation with the language model in a computation-efficient manner. Extensive experiments demonstrate that our approach achieves performance comparable to existing state-of-the-art whole-slide MLLMs, while requiring significantly lower computation and memory.
zh

[CV-65] ShaRP: SHAllow-LayeR Pruning for Video Large Language Models Acceleration

【速读】:该论文旨在解决视频大语言模型(Video Large Language Models, VLLMs)在预填充阶段因处理海量视觉标记(visual tokens)而导致的高计算负载问题,尤其关注基于注意力机制的剪枝方法在浅层解码器中因位置编码偏差和信息交互不足而出现性能显著下降的问题。解决方案的关键在于提出一种改进的注意力剪枝框架ShaRP,其核心创新包括:引入分段感知的因果掩码(segment-aware causal masking)、位置去偏(positional debiasing)以及标记去重(token deduplication),从而提升浅层解码器中的标记选择准确性,在不重新训练模型的前提下实现高效且稳定的高压缩率剪枝,显著改善VLLM推理加速效果。

链接: https://arxiv.org/abs/2512.05385
作者: Yingjie Xia,Tao Liu,Jinglei Shi,Qingsong Xie,Heng Guo,Jian Yang,Xi Wang
机构: VCIP & TMCC & DISSec, College of Computer Science, Nankai University (南开大学计算机学院); LIX, Ecole Polytechnique, IP Paris (巴黎综合理工学院); OPPO Research Institute (OPPO研究院); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video Large Language Models (VLLMs) face the challenge of high computational load during the pre-filling stage due to the processing of an enormous number of visual tokens. Although attention-based pruning methods are widely used to accelerate inference, trials at early decoder layers often result in significant performance degradation, especially under high compression rates. We argue that while attention-based pruning inherently holds the potential to identify the most relevant visual tokens, its effectiveness in shallow decoder layers is limited by factors such as positional encoding bias and insufficient information interaction. In this paper, we propose an improved attention-based pruning framework, termed ShaRP, that integrates segment-aware causal masking, positional debiasing, and token deduplication for enhanced token selection. It enables effective pruning at shallow layers while maintaining stable performance under high compression rates without retraining. Extensive experiments demonstrate that ShaRP achieves competitive performance across multiple video understanding benchmarks, establishing a new paradigm for accelerating VLLM inference.
zh

[CV-66] PoolNet: Deep Learning for 2D to 3D Video Process Validation

【速读】:该论文旨在解决从野外采集的图像数据中提取结构光恢复(Structure-from-Motion, SfM)信息时存在的计算复杂度高、耗时长以及大量公开数据因相机位姿变化不足、场景遮挡和噪声等问题而难以处理的挑战。其解决方案的关键在于提出了一种名为PoolNet的通用深度学习框架,能够实现帧级(frame-level)和场景级(scene-level)的验证,有效区分适合SfM处理的场景与不适合处理的场景,从而显著减少获取SfM数据所需的时间,优于现有最先进的算法。

链接: https://arxiv.org/abs/2512.05362
作者: Sanchit Kaul,Joseph Luna,Shray Arora
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: All code related to this paper can be found at this https URL

点击查看摘要

Abstract:Lifting Structure-from-Motion (SfM) information from sequential and non-sequential image data is a time-consuming and computationally expensive task. In addition to this, the majority of publicly available data is unfit for processing due to inadequate camera pose variation, obscuring scene elements, and noisy data. To solve this problem, we introduce PoolNet, a versatile deep learning framework for frame-level and scene-level validation of in-the-wild data. We demonstrate that our model successfully differentiates SfM ready scenes from those unfit for processing while significantly undercutting the amount of time state of the art algorithms take to obtain structure-from-motion data.
zh

[CV-67] Group Orthogonal Low-Rank Adaptation for RGB-T Tracking AAAI2026

【速读】:该论文旨在解决RGB-T跟踪中低秩适应(low-rank adaptation)存在的秩空间冗余问题,即在参数高效微调过程中,大量秩对模型表达能力贡献甚微,导致特征学习受限且信息重复,从而削弱模型应对复杂场景挑战的能力。解决方案的关键在于提出分组正交低秩适应(Group Orthogonal Low-Rank Adaptation, GOLA)框架,其核心机制包括:通过奇异值分解(Singular Value Decomposition, SVD)量化秩的重要性,冻结关键秩以保留预训练先验知识,并将冗余秩聚类为组;进一步设计组间正交约束策略,强制不同秩组学习互补特征,从而显著减少冗余并提升特征表示多样性与适应性。

链接: https://arxiv.org/abs/2512.05359
作者: Zekai Shao,Yufan Hu,Jingyuan Liu,Bin Fan,Hongmin Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures. Accepted by AAAI 2026. Extended version

点击查看摘要

Abstract:Parameter-efficient fine-tuning has emerged as a promising paradigm in RGB-T tracking, enabling downstream task adaptation by freezing pretrained parameters and fine-tuning only a small set of parameters. This set forms a rank space made up of multiple individual ranks, whose expressiveness directly shapes the model’s adaptability. However, quantitative analysis reveals low-rank adaptation exhibits significant redundancy in the rank space, with many ranks contributing almost no practical information. This hinders the model’s ability to learn more diverse knowledge to address the various challenges in RGB-T tracking. To address this issue, we propose the Group Orthogonal Low-Rank Adaptation (GOLA) framework for RGB-T tracking, which effectively leverages the rank space through structured parameter learning. Specifically, we adopt a rank decomposition partitioning strategy utilizing singular value decomposition to quantify rank importance, freeze crucial ranks to preserve the pretrained priors, and cluster the redundant ranks into groups to prepare for subsequent orthogonal constraints. We further design an inter-group orthogonal constraint strategy. This constraint enforces orthogonality between rank groups, compelling them to learn complementary features that target diverse challenges, thereby alleviating information redundancy. Experimental results demonstrate that GOLA effectively reduces parameter redundancy and enhances feature representation capabilities, significantly outperforming state-of-the-art methods across four benchmark datasets and validating its effectiveness in RGB-T tracking tasks.
zh

[CV-68] SplatPainter: Interactive Authoring of 3D Gaussians from 2D Edits via Test-Time Training

【速读】:该论文旨在解决3D Gaussian Splatting(3D高斯点绘)在交互式精修与编辑中的关键瓶颈问题,即现有基于扩散模型或优化的方法存在速度慢、破坏原始资产特征一致性或缺乏细粒度控制精度等缺陷。其解决方案的核心在于提出一种状态感知的前馈模型(state-aware feedforward model),能够从用户提供的2D视图中直接预测对紧凑且富含特征的高斯表示属性的更新,并结合测试时训练(Test-Time Training)构建状态感知的迭代编辑流程,从而实现局部细节高保真重构、局部涂改和全局一致着色等多种任务,在保持交互速度的同时保障编辑的精确性与连续性。

链接: https://arxiv.org/abs/2512.05354
作者: Yang Zheng,Hao Tan,Kai Zhang,Peng Wang,Leonidas Guibas,Gordon Wetzstein,Wang Yifan
机构: Stanford University (斯坦福大学); Adobe Research
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: project page this https URL

点击查看摘要

Abstract:The rise of 3D Gaussian Splatting has revolutionized photorealistic 3D asset creation, yet a critical gap remains for their interactive refinement and editing. Existing approaches based on diffusion or optimization are ill-suited for this task, as they are often prohibitively slow, destructive to the original asset’s identity, or lack the precision for fine-grained control. To address this, we introduce \ourmethod, a state-aware feedforward model that enables continuous editing of 3D Gaussian assets from user-provided 2D view(s). Our method directly predicts updates to the attributes of a compact, feature-rich Gaussian representation and leverages Test-Time Training to create a state-aware, iterative workflow. The versatility of our approach allows a single architecture to perform diverse tasks, including high-fidelity local detail refinement, local paint-over, and consistent global recoloring, all at interactive speeds, paving the way for fluid and intuitive 3D content authoring.
zh

[CV-69] SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling CEC

【速读】:该论文旨在解决生成式3D资产(generative 3D assets)在几何控制上的局限性问题,即现有方法主要依赖文本或图像提示,难以实现对物体几何形状的直观且精确的控制——语言描述存在歧义,图像编辑则效率低下。其解决方案的关键在于提出SpaceControl,一种无需训练的测试时(test-time)空间控制方法,能够接受从粗粒度几何体到精细网格等多种几何输入,并与预训练的生成模型无缝集成;通过一个可控参数平衡几何保真度与输出真实感,在不增加训练成本的前提下显著提升几何准确性,同时保持高质量视觉效果。

链接: https://arxiv.org/abs/2512.05343
作者: Elisabetta Fedele,Francis Engelmann,Ian Huang,Or Litany,Marc Pollefeys,Leonidas Guibas
机构: ETH Zurich; Stanford University; Technion; NVIDIA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are cumbersome to edit. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern pre-trained generative models without requiring any additional training. A controllable parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive user interface that enables online editing of superquadrics for direct conversion into textured 3D assets, facilitating practical deployment in creative workflows. Find our project page at this https URL
zh

[CV-70] ARCAS: An Augmented Reality Collision Avoidance System with SLAM-Based Tracking for Enhancing VRU Safety

【速读】:该论文旨在解决城市混合交通环境中弱势道路使用者(Vulnerable Road Users, VRUs)面临高碰撞风险的问题,现有安全系统多聚焦于驾驶员或车辆辅助,缺乏对VRU的直接支持。其解决方案的关键在于提出ARCAS——一种基于可穿戴增强现实(Augmented Reality, AR)头显的实时碰撞避让系统,通过融合路侧360°三维激光雷达(LiDAR)与SLAM(Simultaneous Localization and Mapping)驱动的头戴设备追踪,并结合自动3D标定流程,实现世界坐标系锁定的3D边界框和方向箭头在用户透视视图中的精准叠加,从而提供个性化的空间预警;同时支持多头显共享世界锚点以实现协同感知。实测表明,该系统显著提升了行人在面对电动滑板车和机动车时的反应时间与安全性。

链接: https://arxiv.org/abs/2512.05299
作者: Ahmad Yehia,Jiseop Byeon,Tianyi Wang,Huihai Wang,Yiming Xu,Junfeng Jiao,Christian Claudel
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: ystems and Control (eess.SY); Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: 8 pages, 3 figures, 1 table

点击查看摘要

Abstract:Vulnerable road users (VRUs) face high collision risks in mixed traffic, yet most existing safety systems prioritize driver or vehicle assistance over direct VRU support. This paper presents ARCAS, a real-time augmented reality collision avoidance system that provides personalized spatial alerts to VRUs via wearable AR headsets. By fusing roadside 360-degree 3D LiDAR with SLAM-based headset tracking and an automatic 3D calibration procedure, ARCAS accurately overlays world-locked 3D bounding boxes and directional arrows onto approaching hazards in the user’s passthrough view. The system also enables multi-headset coordination through shared world anchoring. Evaluated in real-world pedestrian interactions with e-scooters and vehicles (180 trials), ARCAS nearly doubled pedestrians’ time-to-collision and increased counterparts’ reaction margins by up to 4x compared to unaided-eye conditions. Results validate the feasibility and effectiveness of LiDAR-driven AR guidance and highlight the potential of wearable AR as a promising next-generation safety tool for urban mobility.
zh

[CV-71] From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)场景中视觉语言模型(Vision-Language Models, VLMs)在时间理解上的不足,尤其是对驾驶过程中动作间动态关系的细粒度运动感知能力欠缺的问题。现有基准数据集多聚焦于体育、烹饪或电影等非驾驶场景,缺乏针对AD独特时序挑战的评测体系。为此,作者提出了Temporal Understanding in Autonomous Driving (TAD) 基准,包含近6000个问答对和7项人工设计任务,用于系统评估VLMs的时间推理能力。实验表明,当前最先进的模型在TAD上表现不佳,主要源于运动理解不充分。为提升性能,论文提出两种无需额外训练的解决方案:Scene-CoT利用链式思维(Chain-of-Thought, CoT)增强逻辑推理,TCogMap引入以车辆为中心的时间认知地图(ego-centric temporal cognitive map),从而显式建模时空动态关系。二者结合现有VLMs后,在TAD上平均准确率提升达17.72%,显著推动了AD领域的时间理解研究进展。

链接: https://arxiv.org/abs/2512.05277
作者: Kevin Cannons,Saeed Ranjbar Alvar,Mohammad Asiful Hossain,Ahmad Rezaei,Mohsen Gholami,Alireza Heidarikhazaei,Zhou Weimin,Yong Zhang,Mohammad Akbari
机构: Huawei Technologies Canada Co., Ltd.(华为技术加拿大有限公司); Huawei Cloud(华为云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporal understanding in autonomous driving (AD) remains a significant challenge, even for recent state-of-the-art (SoTA) Vision-Language Models (VLMs). Prior work has introduced datasets and benchmarks aimed at improving temporal reasoning, but these have emphasized other video content, including sports, cooking, and movies. No existing benchmark focuses exclusively on the unique challenges of temporal understanding in ego-centric AD footage. To fill this gap, the Temporal Understanding in Autonomous Driving (TAD) benchmark is presented, which evaluates VLMs’ ability to capture the dynamic relationships between actions in AD. TAD comprises nearly 6,000 question-answer (QA) pairs, spanning 7 human-designed tasks. In addition, an evaluation is performed that consists of 9 closed- and open-source generalist models as well as SoTA AD specialist models. When applied to TAD, current SoTA models demonstrated substandard accuracies, largely due to imperfect fine-grained motion understanding. To improve motion understanding and overall accuracy on TAD, two novel training-free solutions are proposed: Scene-CoT, that leverages Chain-of-Thought (CoT) and TCogMap, which incorporates an ego-centric temporal cognitive map. The proposed approaches are integrated with existing VLMs and improve average accuracy on TAD by up to 17.72%. By introducing TAD, benchmarking multiple SoTA models, and proposing effective enhancements, this work aims to catalyze future research on temporal understanding in AD. The benchmark and evaluation code are available at \hrefthis https URLHugging Face and \hrefthis https URLGithub, respectively.
zh

[CV-72] Inferring Compositional 4D Scenes without Ever Seeing One

【速读】:该论文旨在解决从单目视频中一致且联合地预测4D/3D物体的结构及其时空配置的问题,现有方法通常局限于单一物体,并依赖特定类别的参数化形状模型,导致场景配置不一致且泛化能力受限。解决方案的关键在于提出COM4D(Compositional 4D)方法,通过在2D视频输入上设计空间与时间注意力机制的训练策略,将学习过程解耦为两部分:一是基于物体组合的学习,二是基于单个物体动态变化的学习,从而完全避免对4D组合训练数据的依赖;在推理阶段,利用所提出的注意力混合机制融合独立学习的空间与时间注意力,无需任何4D组合示例即可重建包含多个交互物体的完整且持久的4D场景。

链接: https://arxiv.org/abs/2512.05272
作者: Ahmet Berke Gokmen,Ajad Chhatkuli,Luc Van Gool,Danda Pani Paudel
机构: INSAIT; Sofia University “St. Kliment Ohridski” (索菲亚大学“圣克莱门特·奥赫里德斯基”)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand, and single object dynamics throughout the video on the other, thus completely avoiding reliance on 4D compositional training data. At inference time, our proposed attention mixing mechanism combines these independently learned attentions, without requiring any 4D composition examples. By alternating between spatial and temporal reasoning, COM4D reconstructs complete and persistent 4D scenes with multiple interacting objects directly from monocular videos. Furthermore, COM4D provides state-of-the-art results in existing separate problems of 4D object and composed 3D reconstruction despite being purely data-driven.
zh

[CV-73] CARD: Correlation Aware Restoration with Diffusion

【速读】:该论文旨在解决扩散模型在图像恢复任务中对真实世界传感器噪声建模不足的问题,特别是针对由读出机制引起的具有空间相关性的噪声(spatially correlated noise),这类噪声在传统方法中常被假设为独立同分布(i.i.d.)的高斯噪声,导致实际应用效果受限。解决方案的关键在于提出Correlation Aware Restoration with Diffusion (CARD),其核心创新是引入一个无需训练的预处理步骤——对观测噪声进行白化(whitening),将原始的空间相关噪声转化为i.i.d.形式;随后在扩散恢复过程中采用白化后的更新策略,从而在保持原DDRM模型闭式采样效率的同时,有效适应真实场景中的相关噪声。

链接: https://arxiv.org/abs/2512.05268
作者: Niki Nezakati,Arnab Ghosh,Amit Roy-Chowdhury,Vishwanath Saragadam
机构: University of California, Riverside (加州大学河滨分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Denoising diffusion models have achieved state-of-the-art performance in image restoration by modeling the process as sequential denoising steps. However, most approaches assume independent and identically distributed (i.i.d.) Gaussian noise, while real-world sensors often exhibit spatially correlated noise due to readout mechanisms, limiting their practical effectiveness. We introduce Correlation Aware Restoration with Diffusion (CARD), a training-free extension of DDRM that explicitly handles correlated Gaussian noise. CARD first whitens the noisy observation, which converts the noise into an i.i.d. form. Then, the diffusion restoration steps are replaced with noise-whitened updates, which inherits DDRM’s closed-form sampling efficiency while now being able to handle correlated noise. To emphasize the importance of addressing correlated noise, we contribute CIN-D, a novel correlated noise dataset captured across diverse illumination conditions to evaluate restoration methods on real rolling-shutter sensor noise. This dataset fills a critical gap in the literature for experimental evaluation with real-world correlated noise. Experiments on standard benchmarks with synthetic correlated noise and on CIN-D demonstrate that CARD consistently outperforms existing methods across denoising, deblurring, and super-resolution tasks.
zh

[CV-74] Age-Inclusive 3D Human Mesh Recovery for Action-Preserving Data Anonymization

【速读】:该论文旨在解决当前三维(3D)人体形状与姿态估计方法在儿童和婴儿群体上泛化能力不足的问题,即现有模型对成人数据训练后难以有效应用于年龄较小的人群。其解决方案的关键在于提出AionHMR框架,该框架基于优化方法扩展了高性能模型,并引入SMPL-A(Skinned Multi-Person Linear model for Adults and Children)身体模型,实现了成人、儿童和婴儿的联合精确建模;同时利用该框架生成伪真值标注数据,训练出一种基于Transformer的深度学习模型,可在保持成人精度的同时显著提升儿童和婴儿的3D重建效果,且生成的网格可作为隐私保护替代方案,保留动作、姿态和几何信息,实现匿名化数据发布。

链接: https://arxiv.org/abs/2512.05259
作者: Georgios Chatzichristodoulou,Niki Efthymiou,Panagiotis Filntisis,Georgios Pavlakos,Petros Maragos
机构: School of ECE, National Technical University of Athens (国家技术大学电气与计算机工程学院); Robotics Institute, Athena RC (雅典研究中心机器人研究所); HERON - Hellenic Robotics Center of Excellence (希腊机器人卓越中心); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While three-dimensional (3D) shape and pose estimation is a highly researched area that has yielded significant advances, the resulting methods, despite performing well for the adult population, generally fail to generalize effectively to children and infants. This paper addresses this challenge by introducing AionHMR, a comprehensive framework designed to bridge this domain gap. We propose an optimization-based method that extends a top-performing model by incorporating the SMPL-A body model, enabling the concurrent and accurate modeling of adults, children, and infants. Leveraging this approach, we generated pseudo-ground-truth annotations for publicly available child and infant image databases. Using these new training data, we then developed and trained a specialized transformer-based deep learning model capable of real-time 3D age-inclusive human reconstruction. Extensive experiments demonstrate that our methods significantly improve shape and pose estimation for children and infants without compromising accuracy on adults. Importantly, our reconstructed meshes serve as privacy-preserving substitutes for raw images, retaining essential action, pose, and geometry information while enabling anonymized datasets release. As a demonstration, we introduce the 3D-BabyRobot dataset, a collection of action-preserving 3D reconstructions of children interacting with robots. This work bridges a crucial domain gap and establishes a foundation for inclusive, privacy-aware, and age-diverse 3D human modeling.
zh

[CV-75] IE2Video: Adapting Pretrained Diffusion Models for Event-Based Video Reconstruction

【速读】:该论文旨在解决连续视频监控、机器人和可穿戴系统中因传统RGB相机固定帧率采集导致的高功耗问题。其核心解决方案是提出一种混合捕获范式,即在记录稀疏RGB关键帧的同时,持续采集事件相机产生的异步事件流,并在离线阶段重构完整的RGB视频序列,从而在不牺牲下游应用所需标准视频输出的前提下显著降低采集阶段的能耗。该方案的关键在于定义了“图像与事件到视频”(Image and Event to Video, IE2Video)任务,并探索两种架构策略:一是将自回归模型(HyperE2VID)适配用于RGB生成,二是通过学习的编码器和低秩适应(LoRA)将事件表示注入预训练文本到视频扩散模型(LTX),后者在感知质量上比自回归基线提升33%(LPIPS从0.283提升至0.422),并展现出跨数据集和不同序列长度的良好泛化能力。

链接: https://arxiv.org/abs/2512.05240
作者: Dmitrii Torbunov,Onur Okuducu,Yi Huang,Odera Dim,Rebecca Coles,Yonggang Cui,Yihui Ren
机构: Brookhaven National Laboratory (布鲁克海文国家实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continuous video monitoring in surveillance, robotics, and wearable systems faces a fundamental power constraint: conventional RGB cameras consume substantial energy through fixed-rate capture. Event cameras offer sparse, motion-driven sensing with low power consumption, but produce asynchronous event streams rather than RGB video. We propose a hybrid capture paradigm that records sparse RGB keyframes alongside continuous event streams, then reconstructs full RGB video offline – reducing capture power consumption while maintaining standard video output for downstream applications. We introduce the Image and Event to Video (IE2Video) task: reconstructing RGB video sequences from a single initial frame and subsequent event camera data. We investigate two architectural strategies: adapting an autoregressive model (HyperE2VID) for RGB generation, and injecting event representations into a pretrained text-to-video diffusion model (LTX) via learned encoders and low-rank adaptation. Our experiments demonstrate that the diffusion-based approach achieves 33% better perceptual quality than the autoregressive baseline (0.283 vs 0.422 LPIPS). We validate our approach across three event camera datasets (BS-ERGB, HS-ERGB far/close) at varying sequence lengths (32-128 frames), demonstrating robust cross-dataset generalization with strong performance on unseen capture configurations.
zh

[CV-76] DEAR: Dataset for Evaluating the Aesthetics of RenderingDEAR: Dataset for Evaluating the Aesthetics of Rendering

【速读】:该论文旨在解决图像渲染美学评估(Evaluation of Aesthetics of Rendering, EAR)缺乏基于主观人类偏好的数据集问题,从而弥补传统图像质量评估(Image Quality Assessment, IQA)仅关注技术退化(如噪声、模糊或压缩伪影)的局限性。解决方案的关键在于构建首个系统性地建模人类对图像渲染风格审美判断的基准数据集——DEAR(Dataset for Evaluating the Aesthetics of Rendering),其基于MIT-Adobe FiveK数据集,通过大规模众包方式收集了13,648名参与者对成对图像的偏好评分,每对图像由25位独立评价者标注,从而捕捉到细粒度且情境敏感的审美偏好,为EAR任务提供可量化、可复现的评估基础,并支持风格偏好预测、美学基准测试及个性化美学建模等多种应用场景。

链接: https://arxiv.org/abs/2512.05209
作者: Vsevolod Plohotnuk,Artyom Panshin,Nikola Banić,Simone Bianco,Michael Freeman,Egor Ershov
机构: Color Reproduction and Synthesis Institute (色彩再现与合成研究所); Moscow Independent Research Institute of Artificial Intelligence (莫斯科独立人工智能研究院); Gideon Brothers (吉德恩兄弟); University of Milano-Bicocca (米兰博科尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Traditional Image Quality Assessment~(IQA) focuses on quantifying technical degradations such as noise, blur, or compression artifacts, using both full-reference and no-reference objective metrics. However, evaluation of rendering aesthetics, a growing domain relevant to photographic editing, content creation, and AI-generated imagery, remains underexplored due to the lack of datasets that reflect the inherently subjective nature of style preference. In this work, a novel benchmark dataset designed to model human aesthetic judgments of image rendering styles is introduced: the Dataset for Evaluating the Aesthetics of Rendering (DEAR). Built upon the MIT-Adobe FiveK dataset, DEAR incorporates pairwise human preference scores collected via large-scale crowdsourcing, with each image pair evaluated by 25 distinct human evaluators with a total of 13,648 of them participating overall. These annotations capture nuanced, context-sensitive aesthetic preferences, enabling the development and evaluation of models that go beyond traditional distortion-based IQA, focusing on a new task: Evaluation of Aesthetics of Rendering (EAR). The data collection pipeline is described, human voting patterns are analyzed, and multiple use cases are outlined, including style preference prediction, aesthetic benchmarking, and personalized aesthetic modeling. To the best of the authors’ knowledge, DEAR is the first dataset to systematically address image aesthetics of rendering assessment grounded in subjective human preferences. A subset of 100 images with markup for them is published on HuggingFace (this http URL).
zh

[CV-77] Your Latent Mask is Wrong: Pixel-Equivalent Latent Compositing for Diffusion Models

【速读】:该论文旨在解决扩散模型中潜在空间补绘(latent inpainting)依赖线性插值VAE潜在表示时所导致的严重伪影问题,尤其是掩码边界处的全局颜色偏移和模糊边缘。其核心解决方案是提出像素等效潜在合成(Pixel-Equivalent Latent Compositing, PELC)原则——即潜在空间的融合结果应与像素空间中的alpha混合效果一致。为实现这一原则,作者设计了DecFormer模型,一个仅7.7M参数的小型Transformer网络,用于预测每通道的混合权重及非流形残差修正项,从而在保持全分辨率掩码控制的同时实现真正的软边alpha合成。该方法无需微调扩散模型主干,仅增加0.07% FLUX.1-Dev参数和3.5% FLOP开销,显著提升了边缘区域精度(误差降低达53%),并可在轻量LoRA适配下达到与全微调补绘模型相当的保真度。

链接: https://arxiv.org/abs/2512.05198
作者: Rowan Bradbury,Dazhi Zhong
机构: Bradbury Group (Bradbury 组)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: 16 pages, 10 figures

点击查看摘要

Abstract:Latent inpainting in diffusion models still relies almost universally on linearly interpolating VAE latents under a downsampled mask. We propose a key principle for compositing image latents: Pixel-Equivalent Latent Compositing (PELC). An equivalent latent compositor should be the same as compositing in pixel space. This principle enables full-resolution mask control and true soft-edge alpha compositing, even though VAEs compress images 8x spatially. Modern VAEs capture global context beyond patch-aligned local structure, so linear latent blending cannot be pixel-equivalent: it produces large artifacts at mask seams and global degradation and color shifts. We introduce DecFormer, a 7.7M-parameter transformer that predicts per-channel blend weights and an off-manifold residual correction to realize mask-consistent latent fusion. DecFormer is trained so that decoding after fusion matches pixel-space alpha compositing, is plug-compatible with existing diffusion pipelines, requires no backbone finetuning and adds only 0.07% of FLUX.1-Dev’s parameters and 3.5% FLOP overhead. On the FLUX.1 family, DecFormer restores global color consistency, soft-mask support, sharp boundaries, and high-fidelity masking, reducing error metrics around edges by up to 53% over standard mask interpolation. Used as an inpainting prior, a lightweight LoRA on FLUX.1-Dev with DecFormer achieves fidelity comparable to FLUX.1-Fill, a fully finetuned inpainting model. While we focus on inpainting, PELC is a general recipe for pixel-equivalent latent editing, as we demonstrate on a complex color-correction task.
zh

[CV-78] Semore: VLM-guided Enhanced Semantic Motion Representations for Visual Reinforcement Learning

【速读】:该论文旨在解决现有基于大语言模型(Large Language Model, LLM)的强化学习(Reinforcement Learning, RL)方法在控制策略引导中因骨干网络表征能力有限而导致性能受限的问题。其解决方案的关键在于提出了一种基于视觉-语言模型(Vision-Language Model, VLM)的新框架——增强语义运动表征(Enhanced Semantic Motion Representations, Semore),该框架通过双路径骨干网络从RGB流中同步提取语义和运动表征;同时利用VLM引入常识知识以检索观测中的关键信息,并借助预训练CLIP实现文本-图像对齐,从而将真实标签表征嵌入骨干网络;此外,采用分离监督机制分别指导语义与运动表征的提取,同时允许二者自发交互,提升决策效率与适应性。

链接: https://arxiv.org/abs/2512.05172
作者: Wentao Wang,Chunyang Liu,Kehua Sheng,Bo Zhang,Yan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing exploration of Large Language Models (LLM) and Vision-Language Models (VLM) has opened avenues for enhancing the effectiveness of reinforcement learning (RL). However, existing LLM-based RL methods often focus on the guidance of control policy and encounter the challenge of limited representations of the backbone networks. To tackle this problem, we introduce Enhanced Semantic Motion Representations (Semore), a new VLM-based framework for visual RL, which can simultaneously extract semantic and motion representations through a dual-path backbone from the RGB flows. Semore utilizes VLM with common-sense knowledge to retrieve key information from observations, while using the pre-trained clip to achieve the text-image alignment, thereby embedding the ground-truth representations into the backbone. To efficiently fuse semantic and motion representations for decision-making, our method adopts a separately supervised approach to simultaneously guide the extraction of semantics and motion, while allowing them to interact spontaneously. Extensive experiments demonstrate that, under the guidance of VLM at the feature level, our method exhibits efficient and adaptive ability compared to state-of-art methods. All codes are released.
zh

[CV-79] EFDiT: Efficient Fine-grained Image Generation Using Diffusion Transformer Models ICME

【速读】:该论文旨在解决大规模细粒度图像生成中存在语义信息纠缠和生成图像细节不足的问题。其解决方案的关键在于引入分层嵌入器(tiered embedder),该机制融合了父类与子类的语义信息,从而增强扩散模型对语义结构的理解并缓解语义纠缠;同时,在感知信息生成阶段引入超分辨率技术,通过增强与退化模型提升细粒度图像的细节特征;此外,提出一种高效的ProAttention机制,可有效集成至扩散模型中以优化计算效率与生成质量。

链接: https://arxiv.org/abs/2512.05152
作者: Kun Wang,Donglin Di,Tonghua Su,Lei Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6pages, 5figures, published to 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, 2025

点击查看摘要

Abstract:Diffusion models are highly regarded for their controllability and the diversity of images they generate. However, class-conditional generation methods based on diffusion models often focus on more common categories. In large-scale fine-grained image generation, issues of semantic information entanglement and insufficient detail in the generated images still persist. This paper attempts to introduce a concept of a tiered embedder in fine-grained image generation, which integrates semantic information from both super and child classes, allowing the diffusion model to better incorporate semantic information and address the issue of semantic entanglement. To address the issue of insufficient detail in fine-grained images, we introduce the concept of super-resolution during the perceptual information generation stage, enhancing the detailed features of fine-grained images through enhancement and degradation models. Furthermore, we propose an efficient ProAttention mechanism that can be effectively implemented in the diffusion model. We evaluate our method through extensive experiments on public benchmarks, demonstrating that our approach outperforms other state-of-the-art fine-tuning methods in terms of performance.
zh

[CV-80] winFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

【速读】:该论文旨在解决大规模多模态生成模型(如图像和视频生成)在推理效率上的瓶颈问题,即传统基于扩散或流匹配的多步框架通常需要40–100次函数求值(Number of Function Evaluations, NFE),严重限制了实际部署效率。现有加速方法如蒸馏技术存在训练迭代复杂、极少数步骤下性能显著下降(如4-NFE)的问题,而引入对抗训练的方案(如DMD/DMD2和SANA-Sprint)则带来训练不稳定、额外模型开销及高GPU内存消耗等挑战。其解决方案的关键在于提出TwinFlow框架:通过设计一种无需固定预训练教师模型且不依赖标准对抗网络的训练机制,实现单步(1-NFE)生成模型的有效训练,从而在保持高质量生成能力的同时大幅提升推理效率,并具备良好的可扩展性——例如在Qwen-Image-20B上全参数微调后,仅用1-NFE即可逼近原100-NFE模型性能,计算成本降低超过100倍。

链接: https://arxiv.org/abs/2512.05150
作者: Zhenglin Cheng,Peng Sun,Jianguo Li,Tao Lin
机构: Inclusion AI; Shanghai Innovation Institute; Westlake University; Zhejiang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: arxiv v0

点击查看摘要

Abstract:Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps ( 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by 100\times with minor quality degradation. Project page is available at this https URL.
zh

[CV-81] Self-Improving VLM Judges Without Human Annotations

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)评估中对大规模人工偏好标注严重依赖的问题,此类标注成本高昂且难以适应模型快速迭代带来的时效性挑战。解决方案的关键在于提出一种无需任何人工偏好标注的自训练框架,通过自合成数据实现VLM判官模型(judge model)的迭代式训练:首先生成不同质量水平的多模态指令-响应对,接着为每对生成推理轨迹并筛选符合预期质量的样本,最后基于正确的判断答案及其推理过程进行模型训练。该方法在VL-RewardBench等基准上显著提升判官模型性能,证明了其在无监督场景下构建可进化判官系统的可行性。

链接: https://arxiv.org/abs/2512.05145
作者: Inna Wanyin Lin,Yushi Hu,Shuyue Stella Li,Scott Geng,Pang Wei Koh,Luke Zettlemoyer,Tim Althoff,Marjan Ghazvininejad
机构: Meta(元)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective judges of Vision-Language Models (VLMs) are crucial for model development. Current methods for training VLM judges mainly rely on large-scale human preference annotations. However, such an approach is costly, and the annotations easily become obsolete as models rapidly improve. In this work, we present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data. Our method is iterative and has three stages: (1) generate diverse multimodal instruction-response pairs at varying quality levels, (2) generate reasoning traces and judgments for each pair, removing the ones that do not match our expected quality levels, and (3) training on correct judge answers and their reasoning traces. We evaluate the resulting judge on Multimodal RewardBench and VL-RewardBench across domains: correctness, preference, reasoning, safety, and visual question-answering. Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on VL-RewardBench, often outperforming much larger models including Llama-3.2-90B, GPT-4o, and Claude 3.5 Sonnet, with particularly strong gains in general, hallucination, and reasoning dimensions. The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.
zh

[CV-82] FlowEO: Generative Unsupervised Domain Adaptation for Earth Observation WACV

【速读】:该论文旨在解决遥感图像在跨域场景下模型泛化能力不足的问题,尤其是由于传感器类型、地理区域、时间变化及大气条件差异导致的分布偏移(distribution shift)问题。针对这一挑战,作者提出了一种基于流匹配(flow matching)的无监督域适应(Unsupervised Domain Adaptation, UDA)框架 FlowEO,其核心创新在于利用生成式模型学习从源域到目标域的语义保持映射(semantically preserving mapping),从而实现图像空间中的有效域对齐。该方法能够在不依赖标签的情况下提升分类与语义分割任务的性能,同时保证图像感知质量,显著优于现有图像翻译类域适应方法。

链接: https://arxiv.org/abs/2512.05140
作者: Georges Le Bellier(CEDRIC - VERTIGO, Cnam),Nicolas Audebert(LaSTIG, IGN, CEDRIC - VERTIGO)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Mar 2026, Tucson (AZ), United States

点击查看摘要

Abstract:The increasing availability of Earth observation data offers unprecedented opportunities for large-scale environmental monitoring and analysis. However, these datasets are inherently heterogeneous, stemming from diverse sensors, geographical regions, acquisition times, and atmospheric conditions. Distribution shifts between training and deployment domains severely limit the generalization of pretrained remote sensing models, making unsupervised domain adaptation (UDA) crucial for real-world applications. We introduce FlowEO, a novel framework that leverages generative models for image-space UDA in Earth observation. We leverage flow matching to learn a semantically preserving mapping that transports from the source to the target image distribution. This allows us to tackle challenging domain adaptation configurations for classification and semantic segmentation of Earth observation images. We conduct extensive experiments across four datasets covering adaptation scenarios such as SAR to optical translation and temporal and semantic shifts caused by natural disasters. Experimental results demonstrate that FlowEO outperforms existing image translation approaches for domain adaptation while achieving on-par or better perceptual image quality, highlighting the potential of flow-matching-based UDA for remote sensing.
zh

[CV-83] Spatiotemporal Satellite Image Downscaling with Transfer Encoders and Autoregressive Generative Models

【速读】:该论文旨在解决从粗分辨率卫星图像(如NASA MERRA-2,50 km)到细分辨率图像(如GEOS-5 Nature Run,7 km)的生成式下采样问题,尤其在训练数据有限的情况下如何保持物理一致性与高精度。其解决方案的关键在于提出一种基于迁移学习的生成式下采样框架:首先使用轻量级U-Net对长时间序列的粗分辨率数据进行预训练,以学习时空特征表示;随后冻结该编码器,并将其作为物理意义明确的潜在特征,嵌入到一个扩散模型(diffusion-based generative model)中用于精细化重建。这种分阶段迁移策略有效缓解了小样本下的过拟合问题,并确保了下采样结果在空间变异性、时间自相关性等方面具有良好的物理一致性,从而实现了对长期环境监测数据的稳定、可靠重构。

链接: https://arxiv.org/abs/2512.05139
作者: Yang Xiang,Jingwen Zhong,Yige Yan,Petros Koutrakis,Eric Garshick,Meredith Franklin
机构: University of Toronto (多伦多大学); Harvard T.H. Chan School of Public Health (哈佛大学陈曾熙公共卫生学院); Harvard Medical School (哈佛医学院); VA Healthcare System Boston (波士顿退伍军人事务医疗系统)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present a transfer-learning generative downscaling framework to reconstruct fine resolution satellite images from coarse scale inputs. Our approach combines a lightweight U-Net transfer encoder with a diffusion-based generative model. The simpler U-Net is first pretrained on a long time series of coarse resolution data to learn spatiotemporal representations; its encoder is then frozen and transferred to a larger downscaling model as physically meaningful latent features. Our application uses NASA’s MERRA-2 reanalysis as the low resolution source domain (50 km) and the GEOS-5 Nature Run (G5NR) as the high resolution target (7 km). Our study area included a large area in Asia, which was made computationally tractable by splitting into two subregions and four seasons. We conducted domain similarity analysis using Wasserstein distances confirmed minimal distributional shift between MERRA-2 and G5NR, validating the safety of parameter frozen transfer. Across seasonal regional splits, our model achieved excellent performance (R2 = 0.65 to 0.94), outperforming comparison models including deterministic U-Nets, variational autoencoders, and prior transfer learning baselines. Out of data evaluations using semivariograms, ACF/PACF, and lag-based RMSE/R2 demonstrated that the predicted downscaled images preserved physically consistent spatial variability and temporal autocorrelation, enabling stable autoregressive reconstruction beyond the G5NR record. These results show that transfer enhanced diffusion models provide a robust and physically coherent solution for downscaling a long time series of coarse resolution images with limited training periods. This advancement has significant implications for improving environmental exposure assessment and long term environmental monitoring.
zh

[CV-84] ChromouVQA: Benchmarking Vision-Language Models under Chromatic Camouflaged Images

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在复杂背景中进行目标识别时面临的挑战,特别是当目标与背景存在细微色差或几何干扰时,模型难以实现有效的图-底分离(figure-ground segregation)。为应对这一问题,作者提出 ChromouVQA,一个基于 Ishihara 风格色觉伪装图像的大规模多任务基准,通过引入多种填充几何形状、调整色度对比度、密度、尺寸、遮挡和旋转等变量,构建可控且可复现的测试环境。其解决方案的关键在于:1)设计包含九类视觉问答任务的标准化数据集,涵盖识别、计数、比较与空间推理;2)提出一种模型无关的对比学习策略,通过对轮廓(silhouette)与伪装渲染图像进行对齐,显著提升模型对整体形状的恢复能力,从而增强 VLM 在低对比度和高干扰场景下的鲁棒性。

链接: https://arxiv.org/abs/2512.05137
作者: Yunfei Zhang,Yizhuo He,Yuanxun Shao,Zhengtao Yao,Haoyan Xu,Junhao Dong,Zhen Yao,Zhikang Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have advanced multimodal understanding, yet still struggle when targets are embedded in cluttered backgrounds requiring figure-ground segregation. To address this, we introduce ChromouVQA, a large-scale, multi-task benchmark based on Ishihara-style chromatic camouflaged images. We extend classic dot plates with multiple fill geometries and vary chromatic separation, density, size, occlusion, and rotation, recording full metadata for reproducibility. The benchmark covers nine vision-question-answering tasks, including recognition, counting, comparison, and spatial reasoning. Evaluations of humans and VLMs reveal large gaps, especially under subtle chromatic contrast or disruptive geometric fills. We also propose a model-agnostic contrastive recipe aligning silhouettes with their camouflaged renderings, improving recovery of global shapes. ChromouVQA provides a compact, controlled benchmark for reproducible evaluation and extension. Code and dataset are available at this https URL.
zh

[CV-85] Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

【速读】:该论文旨在解决冠状动脉疾病(Coronary Artery Disease, CAD)中精准识别责任血管及评估狭窄严重程度的临床难题,以指导个体化治疗。传统冠状动脉CT血管成像(Coronary CT Angiography, CCTA)虽为一线无创诊断手段,但受限于高端设备依赖、辐射暴露和患者配合要求,难以大规模应用。为此,研究提出了一种可解释的人工智能心电图(Artificial Intelligence–Electrocardiography, AI-ECG)模型,通过分析常规12导联心电图(ECG)来预测四大冠状动脉(右冠状动脉RCA、左主干LM、前降支LAD、回旋支LCX)是否存在重度或完全狭窄。其关键创新在于构建了一个具有高区分度(内部验证AUC最高达0.818)且鲁棒性强的AI-ECG模型,并结合可解释性分析揭示了不同风险组间特征波形差异,明确了与冠状动脉狭窄相关的关键电生理区域,从而为基于ECG的CAD筛查提供了新机制和可靠工具。

链接: https://arxiv.org/abs/2512.05136
作者: Yujie Xiao,Gongzhen Tang,Deyun Zhang,Jun Li,Guangkun Nie,Haoyu Wang,Shun Huang,Tong Liu,Qinghao Zhao,Kangyin Chen,Shenda Hong
机构: Institute of Medical Technology, Peking University Health Science Center, Beijing, China; National Institute of Health Data Science, Peking University, Beijing, China; Heart Voice Medical Technology, Hefei, China; School of Intelligence Science and Technology, Peking University, Beijing, China; University of Chinese Academy of Sciences, China; State Key Laboratory of Vascular Homeostasis and Remodeling, NHC Key Laboratory of Cardiovascular Molecular Biology and Regulatory Peptides, Peking University, Beijing, China; Institute for Artificial Intelligence, Peking University, Beijing, China; Tianjin Key Laboratory of Ionic-Molecular Function of Cardiovascular Disease, Department of Cardiology, Tianjin Institute of Cardiology, The Second Hospital of Tianjin Medical University, Tianjin, China
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Coronary artery disease (CAD) remains a major global health burden. Accurate identification of the culprit vessel and assessment of stenosis severity are essential for guiding individualized therapy. Although coronary CT angiography (CCTA) is the first-line non-invasive modality for CAD diagnosis, its dependence on high-end equipment, radiation exposure, and strict patient cooperation limits large-scale use. With advances in artificial intelligence (AI) and the widespread availability of electrocardiography (ECG), AI-ECG offers a promising alternative for CAD screening. In this study, we developed an interpretable AI-ECG model to predict severe or complete stenosis of the four major coronary arteries on CCTA. On the internal validation set, the model’s AUCs for the right coronary artery (RCA), left main coronary artery (LM), left anterior descending artery (LAD), and left circumflex artery (LCX) were 0.794, 0.818, 0.744, and 0.755, respectively; on the external validation set, the AUCs reached 0.749, 0.971, 0.667, and 0.727, respectively. Performance remained stable in a clinically normal-ECG subset, indicating robustness beyond overt ECG abnormalities. Subgroup analyses across demographic and acquisition-time strata further confirmed model stability. Risk stratification based on vessel-specific incidence thresholds showed consistent separation on calibration and cumulative event curves. Interpretability analyses revealed distinct waveform differences between high- and low-risk groups, highlighting key electrophysiological regions contributing to model decisions and offering new insights into the ECG correlates of coronary stenosis.
zh

[CV-86] InvarDiff: Cross-Scale Invariance Caching for Accelerated Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中因迭代采样导致的推理速度慢的问题。其核心解决方案是提出 InvarDiff,一种无需训练的加速方法,关键在于利用确定性采样中存在的时间尺度与层尺度上的特征不变性(feature invariance)。通过少量确定性运行计算出一个每步、每层、每模块的二进制缓存规划矩阵(cache plan matrix),并结合分位数驱动的变化度量指标,决定哪些模块在特定步骤可复用缓存结果而非重新计算;同时在步骤尺度上应用相同的不变性准则实现跨时间步缓存,从而显著减少冗余计算。推理时按步骤优先、逐层进行缓存调度,最终在保持图像质量的前提下实现 2–3 倍的端到端加速。

链接: https://arxiv.org/abs/2512.05134
作者: Zihao Wu
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 8 pages main, 8 pages appendix, 16 figures, 5 tables. Code: this https URL

点击查看摘要

Abstract:Diffusion models deliver high-fidelity synthesis but remain slow due to iterative sampling. We empirically observe there exists feature invariance in deterministic sampling, and present InvarDiff, a training-free acceleration method that exploits the relative temporal invariance across timestep-scale and layer-scale. From a few deterministic runs, we compute a per-timestep, per-layer, per-module binary cache plan matrix and use a re-sampling correction to avoid drift when consecutive caches occur. Using quantile-based change metrics, this matrix specifies which module at which step is reused rather than recomputed. The same invariance criterion is applied at the step scale to enable cross-timestep caching, deciding whether an entire step can reuse cached results. During inference, InvarDiff performs step-first and layer-wise caching guided by this matrix. When applied to DiT and FLUX, our approach reduces redundant compute while preserving fidelity. Experiments show that InvarDiff achieves 2 - 3\times end-to-end speed-ups with minimal impact on standard quality metrics. Qualitatively, we observe almost no degradation in visual quality compared with full computations.
zh

[CV-87] Breaking Scale Anchoring: Frequency Representation Learning for Accurate High-Resolution Inference from Low-Resolution Training

【速读】:该论文旨在解决零样本时空预测中因分辨率提升导致的误差无法下降的问题,即“Scale Anchoring”现象——由于低分辨率数据受限于奈奎斯特频率(Nyquist frequency),模型在高分辨率推理时难以处理未见过的高频成分,从而导致误差被锚定在低分辨率水平,误判为多分辨率泛化成功。解决方案的关键在于提出一种与架构无关的频域表示学习(Frequency Representation Learning, FRL),通过构建分辨率对齐的频域表示和频谱一致性训练,使高分辨率网格下高频带的频率响应更加稳定,从而实现误差随分辨率提升而降低,显著优于基线方法,且计算开销可控。

链接: https://arxiv.org/abs/2512.05132
作者: Wenshuo Wang,Fan Zhang
机构: South China University of Technology (华南理工大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Zero-Shot Super-Resolution Spatiotemporal Forecasting requires a deep learning model to be trained on low-resolution data and deployed for inference on high-resolution. Existing studies consider maintaining similar error across different resolutions as indicative of successful multi-resolution generalization. However, deep learning models serving as alternatives to numerical solvers should reduce error as resolution increases. The fundamental limitation is, the upper bound of physical law frequencies that low-resolution data can represent is constrained by its Nyquist frequency, making it difficult for models to process signals containing unseen frequency components during high-resolution inference. This results in errors being anchored at low resolution, incorrectly interpreted as successful generalization. We define this fundamental phenomenon as a new problem distinct from existing issues: Scale Anchoring. Therefore, we propose architecture-agnostic Frequency Representation Learning. It alleviates Scale Anchoring through resolution-aligned frequency representations and spectral consistency training: on grids with higher Nyquist frequencies, the frequency response in high-frequency bands of FRL-enhanced variants is more stable. This allows errors to decrease with resolution and significantly outperform baselines within our task and resolution range, while incurring only modest computational overhead.
zh

[CV-88] AREA3D: Active Reconstruction Agent with Unified Feed-Forward 3D Perception and Vision-Language Guidance

【速读】:该论文旨在解决现有主动三维重建(Active 3D Reconstruction)方法依赖手工设计几何启发式规则导致冗余观测且难以显著提升重建质量的问题。其解决方案的关键在于提出AREA3D框架,该框架将视点不确定性建模与前向式三维重建模型解耦,从而在无需昂贵在线优化的情况下实现精确的不确定性估计;同时引入集成视觉-语言模型提供高层语义引导,促使代理选择更具信息量和多样性的观测视角,超越纯几何线索的局限。

链接: https://arxiv.org/abs/2512.05131
作者: Tianling Xu,Shengzhe Gan,Leslie Gu,Yuelei Li,Fangneng Zhan,Hanspeter Pfister
机构: Southern University of Science and Technology (南方科技大学); Harvard University (哈佛大学); California Institute of Technology (加州理工学院); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Under review

点击查看摘要

Abstract:Active 3D reconstruction enables an agent to autonomously select viewpoints to efficiently obtain accurate and complete scene geometry, rather than passively reconstructing scenes from pre-collected images. However, existing active reconstruction methods often rely on hand-crafted geometric heuristics, which can lead to redundant observations without substantially improving reconstruction quality. To address this limitation, we propose AREA3D, an active reconstruction agent that leverages feed-forward 3D reconstruction models and vision-language guidance. Our framework decouples view-uncertainty modeling from the underlying feed-forward reconstructor, enabling precise uncertainty estimation without expensive online optimization. In addition, an integrated vision-language model provides high-level semantic guidance, encouraging informative and diverse viewpoints beyond purely geometric cues. Extensive experiments on both scene-level and object-level benchmarks demonstrate that AREA3D achieves state-of-the-art reconstruction accuracy, particularly in the sparse-view regime. Code will be made available at: this https URL .
zh

人工智能

[AI-0] raining-Time Action Conditioning for Efficient Real-Time Chunking

【速读】:该论文旨在解决实时分块(Real-time chunking, RTC)在视觉-语言-动作模型(Vision-language-action models, VLAs)中因推理时图像修补(inpainting)机制引入计算开销、增加推理延迟的问题。其解决方案的关键在于:将推理时的延迟模拟至训练阶段,直接以动作前缀(action prefixes)作为条件进行训练,从而消除推理时的图像修补操作,无需修改模型架构或机器人运行时代码,仅需少量额外代码即可实现。实验表明,该方法在仿真和真实场景下均能保持与传统推理时RTC相当的任务性能和执行速度,同时显著降低计算成本。

链接: https://arxiv.org/abs/2512.05964
作者: Kevin Black,Allen Z. Ren,Michael Equi,Sergey Levine
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time chunking (RTC) enables vision-language-action models (VLAs) to generate smooth, reactive robot trajectories by asynchronously predicting action chunks and conditioning on previously committed actions via inference-time inpainting. However, this inpainting method introduces computational overhead that increases inference latency. In this work, we propose a simple alternative: simulating inference delay at training time and conditioning on action prefixes directly, eliminating any inference-time overhead. Our method requires no modifications to the model architecture or robot runtime, and can be implemented with only a few additional lines of code. In simulated experiments, we find that training-time RTC outperforms inference-time RTC at higher inference delays. In real-world experiments on box building and espresso making tasks with the \pi_0.6 VLA, we demonstrate that training-time RTC maintains both task performance and speed parity with inference-time RTC while being computationally cheaper. Our results suggest that training-time action conditioning is a practical drop-in replacement for inference-time inpainting in real-time robot control.
zh

[AI-1] Whatever Remains Must Be True: Filtering Drives Reasoning in LLM s Shaping Diversity

【速读】:该论文旨在解决基于强化学习(Reinforcement Learning, RL)微调大语言模型(Large Language Models, LLMs)时导致的输出多样性显著下降的问题。研究表明,RL隐式优化的是反向KL散度(Reverse KL),倾向于“模式搜索”(mode-seeking),使模型集中于高概率区域而忽略其他正确但低概率的解,从而损害多样性。解决方案的关键在于:从一个显式的目标分布出发——通过过滤错误答案并保留正确答案之间的相对概率关系构建该分布,并利用α-散度族(α-divergence family)来近似该目标分布;这一方法统一了先前多种策略,并可通过插值控制模式搜索与质量覆盖之间的权衡,实现对精度与多样性(precision-diversity)的直接调控。在Lean定理证明基准上,该方法在覆盖率-精度帕累托前沿上达到最优性能,尤其在覆盖率维度上优于所有现有方法。

链接: https://arxiv.org/abs/2512.05962
作者: Germán Kruszewski,Pierre Erbacher,Jos Rozen,Marc Dymetman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) has become the de facto standard for tuning LLMs to solve tasks involving reasoning. However, growing evidence shows that models trained in such way often suffer from a significant loss in diversity. We argue that this arises because RL implicitly optimizes the “mode-seeking” or “zero-forcing” Reverse KL to a target distribution causing the model to concentrate mass on certain high-probability regions of the target while neglecting others. In this work, we instead begin from an explicit target distribution, obtained by filtering out incorrect answers while preserving the relative probabilities of correct ones. Starting from a pre-trained LLM, we approximate this target distribution using the \alpha -divergence family, which unifies prior approaches and enables direct control of the precision-diversity trade-off by interpolating between mode-seeking and mass-covering divergences. On a Lean theorem-proving benchmark, our method achieves state-of-the-art performance along the coverage-precision Pareto frontier, outperforming all prior methods on the coverage axis.
zh

[AI-2] MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

【速读】:该论文旨在解决生成式搜索(Generative Search)生态系统中内容提供者贡献度难以公平分配的问题,进而影响其合理补偿机制的建立。当前基于大语言模型(Large Language Models, LLMs)的生成式搜索正逐步取代传统搜索方式,但如何精准衡量不同文档在最终生成答案中的贡献成为关键挑战。解决方案的核心在于提出MaxShapley算法,该算法是Shapley值的一个特例,利用可分解的最大和效用函数(decomposable max-sum utility function),将原本呈指数复杂度的Shapley计算转化为线性复杂度,从而实现高效且公平的内容贡献度分配,在保持与精确Shapley方法相当的 attribution 质量的同时显著降低资源消耗(如Token使用量)。

链接: https://arxiv.org/abs/2512.05958
作者: Sara Patel,Mingxun Zhou,Giulia Fanti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative search engines based on large language models (LLMs) are replacing traditional search, fundamentally changing how information providers are compensated. To sustain this ecosystem, we need fair mechanisms to attribute and compensate content providers based on their contributions to generated answers. We introduce MaxShapley, an efficient algorithm for fair attribution in generative search pipelines that use retrieval-augmented generation (RAG). MaxShapley is a special case of the celebrated Shapley value; it leverages a decomposable max-sum utility function to compute attributions with linear computation in the number of documents, as opposed to the exponential cost of Shapley values. We evaluate MaxShapley on three multi-hop QA datasets (HotPotQA, MuSiQUE, MS MARCO); MaxShapley achieves comparable attribution quality to exact Shapley computation, while consuming a fraction of its tokens–for instance, it gives up to an 8x reduction in resource consumption over prior state-of-the-art methods at the same attribution accuracy.
zh

[AI-3] SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

【速读】:该论文旨在解决当前科学推理评估中缺乏大规模、可扩展且结构化良好的基准测试问题,尤其在物理学科领域,现有数据集往往难以覆盖多样化的问题配置并提供可验证的推理过程。解决方案的关键在于构建一个名为SymPyBench的大规模合成基准,包含15,045道大学级物理问题,每道题均采用参数化设计,支持无限多的输入变体,并附带结构化的分步推理与可执行Python代码以生成真实解。此外,该基准引入了Consistency Score、Failure Rate和Confusion Rate三项新评价指标,用于量化模型在不同问题变体下的表现稳定性与不确定性,从而更全面地评估生成式AI(Generative AI)在科学推理任务中的鲁棒性与可解释性。

链接: https://arxiv.org/abs/2512.05954
作者: Shima Imani,Seungwhan Moon,Adel Ahmadyan,Lu Zhang,Kirmani Ahmed,Babak Damavandi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems
zh

[AI-4] rusted AI Agents in the Cloud

【速读】:该论文旨在解决当前由大语言模型驱动的AI代理(AI agents)在云服务中运行时面临的信任与安全问题,尤其是在多参与方生态系统中,未受信组件可能导致数据泄露、篡改或意外行为。现有可信虚拟机(Confidential Virtual Machines, CVMs)仅提供单二进制保护,无法保障跨主体信任、加速器级别隔离及代理行为的监督。其解决方案的关键在于提出Omega系统,通过构建基于可信虚拟机(CVM)和可信GPU(Confidential GPUs)的受信任代理平台(Trusted Agent Platform),实现端到端隔离与跨参与方可验证信任;利用嵌套隔离技术在单一CVM内托管多个代理,并通过差分认证(differential attestation)建立跨主体信任关系,同时引入策略规范与执行框架以管控数据访问、工具调用和代理间通信,从而在保障高密度、合规的多代理部署的同时,实现云端规模化下的高性能与强安全性。

链接: https://arxiv.org/abs/2512.05951
作者: Teofil Bodea,Masanori Misono,Julian Pritzi,Patrick Sabanic,Thore Sommer,Harshavardhan Unnibhavi,David Schall,Nuno Santos,Dimitrios Stavrakakis,Pramod Bhatotia
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:AI agents powered by large language models are increasingly deployed as cloud services that autonomously access sensitive data, invoke external tools, and interact with other agents. However, these agents run within a complex multi-party ecosystem, where untrusted components can lead to data leakage, tampering, or unintended behavior. Existing Confidential Virtual Machines (CVMs) provide only per binary protection and offer no guarantees for cross-principal trust, accelerator-level isolation, or supervised agent behavior. We present Omega, a system that enables trusted AI agents by enforcing end-to-end isolation, establishing verifiable trust across all contributing principals, and supervising every external interaction with accountable provenance. Omega builds on Confidential VMs and Confidential GPUs to create a Trusted Agent Platform that hosts many agents within a single CVM using nested isolation. It also provides efficient multi-agent orchestration with cross-principal trust establishment via differential attestation, and a policy specification and enforcement framework that governs data access, tool usage, and inter-agent communication for data protection and regulatory compliance. Implemented on AMD SEV-SNP and NVIDIA H100, Omega fully secures agent state across CVM-GPU, and achieves high performance while enabling high-density, policy-compliant multi-agent deployments at cloud scale.
zh

[AI-5] Impugan: Learning Conditional Generative Models for Robust Data Imputation

【速读】:该论文旨在解决现实世界中数据不完整(incomplete data)问题,尤其是在多源异构数据整合场景下,由于传感器故障、记录不一致或采样率与质量差异导致的缺失值难以有效填补和建模的问题。传统插补方法如回归模型、期望最大化(Expectation-Maximization, EM)和多重插补(Multiple Imputation)依赖线性与独立性假设,在复杂或异质数据上常产生偏差或过度平滑的结果。其解决方案的关键在于提出 Impugan,一种基于条件生成对抗网络(conditional Generative Adversarial Network, cGAN)的插补框架:通过在完整样本上训练,使生成器学习缺失变量与可观测变量之间的非线性依赖关系;判别器则通过区分真实数据与生成数据来确保插补结果的合理性,从而捕捉传统方法无法表达的非线性及多模态关系。实验表明,Impugan 在 Earth Mover’s Distance (EMD) 和互信息偏差(Mutual-Information Deviation, MI)上显著优于主流基线方法。

链接: https://arxiv.org/abs/2512.05950
作者: Zalish Mahmud,Anantaa Kotal,Aritran Piplai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Incomplete data are common in real-world applications. Sensors fail, records are inconsistent, and datasets collected from different sources often differ in scale, sampling rate, and quality. These differences create missing values that make it difficult to combine data and build reliable models. Standard imputation methods such as regression models, expectation-maximization, and multiple imputation rely on strong assumptions about linearity and independence. These assumptions rarely hold for complex or heterogeneous data, which can lead to biased or over-smoothed estimates. We propose Impugan, a conditional Generative Adversarial Network (cGAN) for imputing missing values and integrating heterogeneous datasets. The model is trained on complete samples to learn how missing variables depend on observed ones. During inference, the generator reconstructs missing entries from available features, and the discriminator enforces realism by distinguishing true from imputed data. This adversarial process allows Impugan to capture nonlinear and multimodal relationships that conventional methods cannot represent. In experiments on benchmark datasets and a multi-source integration task, Impugan achieves up to 82% lower Earth Mover’s Distance (EMD) and 70% lower mutual-information deviation (MI) compared to leading baselines. These results show that adversarially trained generative models provide a scalable and principled approach for imputing and merging incomplete, heterogeneous data. Our model is available at: this http URL
zh

[AI-6] Variational Quantum Rainbow Deep Q-Network for Optimizing Resource Allocation Problem

【速读】:该论文旨在解决资源分配问题(Resource Allocation Problem, RAP)中因组合复杂性导致的NP难性质,尤其在人类资源分配问题(Human Resource Allocation Problem, HRAP)场景下,传统深度强化学习(Deep Reinforcement Learning, DRL)方法受限于经典函数逼近器的表达能力。其解决方案的关键在于提出变分量子彩虹DQN(Variational Quantum Rainbow DQN, VQR-DQN),将环形拓扑的变分量子电路与Rainbow DQN结合,利用量子叠加和纠缠特性增强策略表示能力;同时将HRAP建模为具有组合动作空间的马尔可夫决策过程(Markov Decision Process, MDP),通过量子增强的策略网络实现更优的资源调度性能,在四个基准测试中相较随机基线降低26.8%的归一化完成时间(makespan),并优于Double DQN和经典Rainbow DQN 4.9%-13.4%。

链接: https://arxiv.org/abs/2512.05946
作者: Truong Thanh Hung Nguyen,Truong Thinh Nguyen,Hung Cao
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Software Engineering (cs.SE)
备注: Quantum Software Engineering Practices at The 41st ACM/SIGAPP Symposium On Applied Computing (SAC 2026)

点击查看摘要

Abstract:Resource allocation remains NP-hard due to combinatorial complexity. While deep reinforcement learning (DRL) methods, such as the Rainbow Deep Q-Network (DQN), improve scalability through prioritized replay and distributional heads, classical function approximators limit their representational power. We introduce Variational Quantum Rainbow DQN (VQR-DQN), which integrates ring-topology variational quantum circuits with Rainbow DQN to leverage quantum superposition and entanglement. We frame the human resource allocation problem (HRAP) as a Markov decision process (MDP) with combinatorial action spaces based on officer capabilities, event schedules, and transition times. On four HRAP benchmarks, VQR-DQN achieves 26.8% normalized makespan reduction versus random baselines and outperforms Double DQN and classical Rainbow DQN by 4.9-13.4%. These gains align with theoretical connections between circuit expressibility, entanglement, and policy quality, demonstrating the potential of quantum-enhanced DRL for large-scale resource allocation. Our implementation is available at: this https URL.
zh

[AI-7] RACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models

【速读】:该论文旨在解决大视觉语言模型在数学与科学推理任务中可靠性不足的问题,尤其是标准最终答案评估方法难以暴露中间推理错误,导致模型存在“沉默失败”的现象。其解决方案的关键在于提出TRACE框架——一种基于辅助推理集(Auxiliary Reasoning Sets, ARS)的透明推理与一致性评估机制,通过将复杂问题分解为紧凑的子问题-答案对,利用一致性指标评估中间推理步骤,从而识别出传统评估忽略的错误路径,并提供可操作的信号用于模型调试与优化。

链接: https://arxiv.org/abs/2512.05943
作者: Shima Imani,Seungwhan Moon,Lambert Mathias,Lu Zhang,Babak Damavandi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable mathematical and scientific reasoning remains an open challenge for large vision-language models. Standard final-answer evaluation often masks reasoning errors, allowing silent failures to persist. To address this gap, we introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation that diagnoses reasoning trajectories rather than only end results. At its core, TRACE leverages Auxiliary Reasoning Sets, compact sub question answer pairs that decompose complex problems, evaluate intermediate steps through consistency-based metrics, and expose failures overlooked by standard evaluation. Our experiments show that consistency across ARS correlates with final-answer correctness and helps pinpoint the reasoning steps where failures arise, offering actionable signals for model improvement. Furthermore, TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths, supporting effective filtering, debugging, and model refinement.
zh

[AI-8] PRiSM: An Agent ic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在科学领域(如数学和物理)评估中存在的核心问题:现有基准测试难以衡量模型对概念的理解、符号推理能力以及对形式化科学规律的遵守,且多数数据集缺乏中间推理步骤、对扰动的鲁棒性及科学正确性的验证机制。解决方案的关键在于提出PRiSM——一个基于Python代码驱动的合成、全动态、多模态基准,其通过可扩展的代理式生成流程(PrismAgent)构建超过24,750个大学级别科学问题,每个问题包含动态文本与视觉输入、生成图表以及结构化的输出(如可执行Python代码用于真值生成与验证、分步推理链)。这一设计使得对多模态VLM的细粒度审计成为可能,从而揭示其在科学推理中的失效模式、不确定性行为与局限性,并支持五类针对性任务评估(泛化、符号程序合成、扰动鲁棒性、推理修正与歧义解析)。

链接: https://arxiv.org/abs/2512.05930
作者: Shima Imani,Seungwhan Moon,Adel Ahmadyan,Lu Zhang,Kirmani Ahmed,Babak Damavandi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating vision-language models (VLMs) in scientific domains like mathematics and physics poses unique challenges that go far beyond predicting final answers. These domains demand conceptual understanding, symbolic reasoning, and adherence to formal laws, requirements that most existing benchmarks fail to address. In particular, current datasets tend to be static, lacking intermediate reasoning steps, robustness to variations, or mechanisms for verifying scientific correctness. To address these limitations, we introduce PRiSM, a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code. PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent, to generate well-structured problem instances. Each problem contains dynamic textual and visual input, a generated figure, alongside rich structured outputs: executable Python code for ground truth generation and verification, and detailed step-by-step reasoning. The dynamic nature and Python-powered automated ground truth generation of our benchmark allow for fine-grained experimental auditing of multimodal VLMs, revealing failure modes, uncertainty behaviors, and limitations in scientific reasoning. To this end, we propose five targeted evaluation tasks covering generalization, symbolic program synthesis, perturbation robustness, reasoning correction, and ambiguity resolution. Through comprehensive evaluation of existing VLMs, we highlight their limitations and showcase how PRiSM enables deeper insights into their scientific reasoning capabilities.
zh

[AI-9] Neural Coherence : Find higher performance to out-of-distribution tasks from few samples

【速读】:该论文旨在解决在目标任务数据稀缺、无标签且分布外(out-of-distribution)的情况下,如何高效选择预训练视觉模型的最佳检查点(checkpoint)以作为微调起点的问题。传统方法依赖于分布内验证数据,但在上述场景中往往不可靠或无法应用。其解决方案的关键在于提出了一种新颖的“神经一致性”(Neural Coherence)概念,通过分析模型在源域和目标域上的激活统计特性,构建高数据效率的模型选择机制,从而在仅使用少量目标域无标签样本时即可实现更优的泛化性能。

链接: https://arxiv.org/abs/2512.05880
作者: Simon Guiroy,Mats Richter,Sarath Chandar,Christopher Pal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:To create state-of-the-art models for many downstream tasks, it has become common practice to fine-tune a pre-trained large vision model. However, it remains an open question of how to best determine which of the many possible model checkpoints resulting from a large training run to use as the starting point. This becomes especially important when data for the target task of interest is scarce, unlabeled and out-of-distribution. In such scenarios, common methods relying on in-distribution validation data become unreliable or inapplicable. This work proposes a novel approach for model selection that operates reliably on just a few unlabeled examples from the target task. Our approach is based on a novel concept: Neural Coherence, which entails characterizing a model’s activation statistics for source and target domains, allowing one to define model selection methods with high data-efficiency. We provide experiments where models are pre-trained on ImageNet1K and examine target domains consisting of Food-101, PlantNet-300K and iNaturalist. We also evaluate it in many meta-learning settings. Our approach significantly improves generalization across these different target domains compared to established baselines. We further demonstrate the versatility of Neural Coherence as a powerful principle by showing its effectiveness in training data selection.
zh

[AI-10] Sparse Attention Post-Training for Mechanistic Interpretability

【速读】:该论文旨在解决Transformer模型中注意力机制冗余计算的问题,即尽管当前模型参数规模庞大,但其注意力连接结构存在大量非必要边,导致模型复杂且难以解释。解决方案的关键在于引入一种灵活的稀疏性正则化(sparsity regularization),在约束损失目标下对注意力机制进行后训练调整,从而在不牺牲预训练性能的前提下,将注意力连接减少至原始边数的约0.3%。这种基于结构先验的稀疏化方法不仅保留了模型能力,还使注意力连接呈现出更局部化、可解释的模式,并进一步促使全局电路简化——任务特定的子电路涉及更少的注意力头和多层感知机(MLP)组件,连接边数减少高达100倍,表明稀疏性可能是构建更结构化、可解释模型的重要指导原则。

链接: https://arxiv.org/abs/2512.05865
作者: Florent Draye,Anson Lei,Ingmar Posner,Bernhard Schölkopf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 1B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to \approx 0.3 % of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.
zh

[AI-11] NEAT: Neighborhood-Guided Efficient Autoregressive Set Transformer for 3D Molecular Generation

【速读】:该论文旨在解决自回归模型(Autoregressive Models)在3D分子结构生成中因原子顺序假设引发的不一致性问题。传统方法通常依赖于分子图的规范顺序(canonical order)或固定焦点原子(focus atom)来规避原子排列不变性(atom-level permutation invariance)的挑战,但这限制了模型的灵活性和泛化能力。其解决方案的关键在于提出NEAT(Neighborhood-guided, Efficient, Autoregressive, Set Transformer),该模型将分子图视为原子集合(set of atoms),通过引入基于邻域引导的自回归流(autoregressive flow model)直接学习图边界上可接受原子token的顺序无关分布(order-agnostic distribution),从而在保持高计算效率的同时实现原子级排列不变性,并达到当前最优的3D分子生成性能。

链接: https://arxiv.org/abs/2512.05844
作者: Daniel Rose,Roxane Axel Jacob,Johannes Kirchmair,Thierry Langer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive models are a promising alternative to diffusion-based models for 3D molecular structure generation. However, a key limitation is the assumption of a token order: while text has a natural sequential order, the next token prediction given a molecular graph prefix should be invariant to atom permutations. Previous works sidestepped this mismatch by using canonical orders or focus atoms. We argue that this is unnecessary. We introduce NEAT, a Neighborhood-guided, Efficient, Autoregressive, Set Transformer that treats molecular graphs as sets of atoms and learns the order-agnostic distribution over admissible tokens at the graph boundary with an autoregressive flow model. NEAT approaches state-of-the-art performance in 3D molecular generation with high computational efficiency and atom-level permutation invariance, establishing a practical foundation for scalable molecular design.
zh

[AI-12] Using Large Language Models to Create Personalized Networks From Therapy Sessions

【速读】:该论文旨在解决心理治疗个性化中因缺乏密集纵向数据而难以估计个体化心理网络的问题,从而限制了基于网络的治疗个性化方法的可扩展性。其解决方案的关键在于构建一个端到端的流水线,利用大语言模型(Large Language Models, LLMs)从77份治疗对话转录文本中自动提取并生成临床相关的心理过程网络。该方法通过在上下文学习(in-context learning)框架下识别心理过程及其维度,并采用两步聚类与关系生成策略,将过程组织为具有临床意义的簇及带解释的簇间关系,显著提升了网络的临床效用和可解释性,专家评估显示其优于直接提示法,且在临床相关性、新颖性和实用性方面得分均达72–75%。

链接: https://arxiv.org/abs/2512.05836
作者: Clarissa W. Ong,Hiba Arnaout,Kate Sheehan,Estella Fox,Eugen Owtscharow,Iryna Gurevych
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in psychotherapy have focused on treatment personalization, such as by selecting treatment modules based on personalized networks. However, estimating personalized networks typically requires intensive longitudinal data, which is not always feasible. A solution to facilitate scalability of network-driven treatment personalization is leveraging LLMs. In this study, we present an end-to-end pipeline for automatically generating client networks from 77 therapy transcripts to support case conceptualization and treatment planning. We annotated 3364 psychological processes and their corresponding dimensions in therapy transcripts. Using these data, we applied in-context learning to jointly identify psychological processes and their dimensions. The method achieved high performance even with a few training examples. To organize the processes into networks, we introduced a two-step method that grouped them into clinically meaningful clusters. We then generated explanation-augmented relationships between clusters. Experts found that networks produced by our multi-step approach outperformed those built with direct prompting for clinical utility and interpretability, with up to 90% preferring our approach. In addition, the networks were rated favorably by experts, with scores for clinical relevance, novelty, and usefulness ranging from 72-75%. Our findings provide a proof of concept for using LLMs to create clinically relevant networks from therapy transcripts. Advantages of our approach include bottom-up case conceptualization from client utterances in therapy sessions and identification of latent themes. Networks generated from our pipeline may be used in clinical settings and supervision and training. Future research should examine whether these networks improve treatment outcomes relative to other methods of treatment personalization, including statistically estimated networks.
zh

[AI-13] Approximation of Box Decomposition Algorithm for Fast Hypervolume-Based Multi-Objective Optimization

【速读】:该论文旨在解决基于超体积(Hypervolume, HV)的贝叶斯优化(Bayesian Optimization, BO)在多目标决策中因获取函数优化计算成本过高而面临的瓶颈问题,尤其是HV改进计算的高复杂度。现有方法如HV盒分解虽能高效处理频繁的精确改进计算,但在最坏情况下具有超多项式时间复杂度 O(MNM+12)O(MN^{\lfloor \frac{M+1}{2} \rfloor}),导致内存消耗急剧上升;而此前Couckuyt等人提出的近似算法缺乏严谨的算法描述。本文的关键贡献在于首次提供了该近似算法的完整数学推导与算法细节,填补了文献空白,从而为实际应用中高效实现HV-based BO提供了理论保障和可复现的技术路径。

链接: https://arxiv.org/abs/2512.05825
作者: Shuhei Watanabe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Hypervolume (HV)-based Bayesian optimization (BO) is one of the standard approaches for multi-objective decision-making. However, the computational cost of optimizing the acquisition function remains a significant bottleneck, primarily due to the expense of HV improvement calculations. While HV box-decomposition offers an efficient way to cope with the frequent exact improvement calculations, it suffers from super-polynomial memory complexity O(MN^\lfloor \fracM + 12 \rfloor) in the worst case as proposed by Lacour et al. (2017). To tackle this problem, Couckuyt et al. (2012) employed an approximation algorithm. However, a rigorous algorithmic description is currently absent from the literature. This paper bridges this gap by providing comprehensive mathematical and algorithmic details of this approximation algorithm.
zh

[AI-14] 3D Path Planning for Robot-assisted Vertebroplasty from Arbitrary Bi-plane X-ray via Differentiable Rendering

【速读】:该论文旨在解决机器人辅助椎体成形术(vertebroplasty)中因依赖术前3D CT扫描进行手术路径规划而导致的临床实施困难问题,尤其是在术前不常规获取CT影像的情况下。其关键解决方案是提出了一种基于可微渲染(differentiable rendering)的框架,利用双平面2D X射线图像实现无需术前CT的3D经椎弓根路径规划;该方法结合统计形状模型(Statistical Shape Model, SSM)生成椎体模板,并通过学习的相似性损失函数动态优化SSM的形状与位姿,从而在不依赖固定成像几何的前提下实现高精度的3D重建与路径规划,且具备对任意视角X射线的良好泛化能力。

链接: https://arxiv.org/abs/2512.05803
作者: Blanca Inigo,Benjamin D. Killeen,Rebecca Choi,Michelle Song,Ali Uneri,Majid Khan,Christopher Bailey,Axel Krieger,Mathias Unberath
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robotic systems are transforming image-guided interventions by enhancing accuracy and minimizing radiation exposure. A significant challenge in robotic assistance lies in surgical path planning, which often relies on the registration of intraoperative 2D images with preoperative 3D CT scans. This requirement can be burdensome and costly, particularly in procedures like vertebroplasty, where preoperative CT scans are not routinely performed. To address this issue, we introduce a differentiable rendering-based framework for 3D transpedicular path planning utilizing bi-planar 2D X-rays. Our method integrates differentiable rendering with a vertebral atlas generated through a Statistical Shape Model (SSM) and employs a learned similarity loss to refine the SSM shape and pose dynamically, independent of fixed imaging geometries. We evaluated our framework in two stages: first, through vertebral reconstruction from orthogonal X-rays for benchmarking, and second, via clinician-in-the-loop path planning using arbitrary-view X-rays. Our results indicate that our method outperformed a normalized cross-correlation baseline in reconstruction metrics (DICE: 0.75 vs. 0.65) and achieved comparable performance to the state-of-the-art model ReVerteR (DICE: 0.77), while maintaining generalization to arbitrary views. Success rates for bipedicular planning reached 82% with synthetic data and 75% with cadaver data, exceeding the 66% and 31% rates of a 2D-to-3D baseline, respectively. In conclusion, our framework facilitates versatile, CT-free 3D path planning for robot-assisted vertebroplasty, effectively accommodating real-world imaging diversity without the need for preoperative CT scans.
zh

[AI-15] Mechanistic Interpretability of Antibody Language Models Using SAEs

【速读】:该论文旨在解决如何在领域特定的蛋白质语言模型中实现对生成过程的可解释性和可控性问题,特别是针对自回归抗体语言模型p-IgGen的机制可解释性与生成引导(steering)能力。其解决方案的关键在于对比两种稀疏自动编码器(Sparse Autoencoders, SAEs)——TopK SAE和Ordered SAE——在识别生物相关潜在特征及实现生成控制方面的性能差异:TopK SAE能有效映射潜在特征到生物学概念,但高特征概念相关性并不保证因果生成控制;而Ordered SAE通过引入层次结构可稳定识别可引导特征,尽管其激活模式更复杂且不易解释。这一发现表明,在仅需特征映射时TopK SAE足够,但在需要精确生成引导时,Ordered SAE更具优势。

链接: https://arxiv.org/abs/2512.05794
作者: Rebonto Haque,Oliver M. Turnbull,Anisha Parsan,Nithin Parsan,John J. Yang,Charlotte M. Deane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate an autoregressive antibody language model, p-IgGen, and steer its generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose an hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs are sufficient for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
zh

[AI-16] he Missing Layer of AGI: From Pattern Alchemy to Coordination Physics

【速读】:该论文试图解决当前关于大语言模型(Large Language Models, LLMs)是否能够通向通用人工智能(AGI)的争议问题,即反驳“LLMs只是模式匹配器,无法进行推理或规划”的观点。其核心论点是:LLMs作为系统1(System-1)的模式库是必要的基础,但缺乏一个系统2(System-2)式的协调层来选择、约束并绑定这些模式以实现目标导向的行为。解决方案的关键在于提出UCCT(统一协调控制理论,Unified Coordination Control Theory),该理论将推理建模为由有效支持(rho_d)、表征不匹配(d_r)和自适应锚定预算(gamma log k)驱动的相变过程;进一步通过MACI架构实现该理论,包括行为调制辩论(baiting)、苏格拉底式判断(filtering)和事务性记忆(persistence)三个机制,从而将原本无约束的生成行为转化为受目标引导的协调推理。

链接: https://arxiv.org/abs/2512.05765
作者: Edward Y. Chang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Influential critiques argue that Large Language Models (LLMs) are a dead end for AGI: “mere pattern matchers” structurally incapable of reasoning or planning. We argue this conclusion misidentifies the bottleneck: it confuses the ocean with the net. Pattern repositories are the necessary System-1 substrate; the missing component is a System-2 coordination layer that selects, constrains, and binds these patterns. We formalize this layer via UCCT, a theory of semantic anchoring that models reasoning as a phase transition governed by effective support (rho_d), representational mismatch (d_r), and an adaptive anchoring budget (gamma log k). Under this lens, ungrounded generation is simply an unbaited retrieval of the substrate’s maximum likelihood prior, while “reasoning” emerges when anchors shift the posterior toward goal-directed constraints. We translate UCCT into architecture with MACI, a coordination stack that implements baiting (behavior-modulated debate), filtering (Socratic judging), and persistence (transactional memory). By reframing common objections as testable coordination failures, we argue that the path to AGI runs through LLMs, not around them.
zh

[AI-17] Evolutionary System 2 Reasoning : An Empirical Proof

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)在通用智能方面存在局限的问题,特别是其系统2推理能力(System 2 reasoning,即慢速、逻辑性强的推理)不足,导致难以实现类人水平的泛化推理能力。解决方案的关键在于提出一种进化推理优化框架(Evolutionary Reasoning Optimization, ERO),通过模拟自然选择机制,在多模型种群中迭代优化推理性能:首先初始化多个LLM作为初始种群,随后基于量化推理得分执行进化策略,筛选出推理能力最强的个体。实验表明,即使基础模型较弱(如Qwen-7B),经ERO简单进化循环后也能显著提升其系统2推理能力,从而验证了通过演化机制增强LLM通用推理能力的可行性。

链接: https://arxiv.org/abs/2512.05760
作者: Zeyuan Ma,Wenqi Huang,Guo-Huan Song,Hongshu Guo,Sijie Ma,Zhiguang Cao,Yue-Jiao Gong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine intelligence marks the ultimate dream of making machines’ intelligence comparable to human beings. While recent progress in Large Language Models (LLMs) show substantial specific skills for a wide array of downstream tasks, they more or less fall shorts in general intelligence. Following correlation between intelligence and system 2 reasoning (slow thinking), in this paper, we aim to answering a worthwhile research question: could machine intelligence such as LLMs be evolved to acquire reasoning ability (not specific skill) just like our human beings? To this end, we propose evolutionary reasoning optimization (ERO) framework which performs survival of the fittest over a population of LLMs to search for individual with strong reasoning ability. Given a reasoning task, ERO first initializes multiple LLMs as a population, after which an evolutionary strategy evolves the population to maximize quantified reasoning score of the best individual. Based on experiments on representative testsuites, we claim two surprising empirical discoveries: i) the latest LLMs such as GPT-5 still show limited system 2 reasoning ability; ii) with simple evolution-loop of ERO, a relatively weak model (Qwen-7B) could be enhanced to emerge powerful reasoning ability. Our project can be accessed at this https URL for reproduction needs.
zh

[AI-18] A Fast Anti-Jamming Cognitive Radar Deployment Algorithm Based on Reinforcement Learning

【速读】:该论文旨在解决认知雷达在对抗干扰场景下快速部署的问题,以提升目标检测效率。现有方法主要依赖进化算法,存在计算耗时且易陷入局部最优的缺陷。解决方案的关键在于提出一种全新的端到端框架——快速抗干扰雷达部署算法(Fast Anti-Jamming Radar Deployment Algorithm, FARDA),其核心创新包括:将雷达部署建模为深度强化学习任务,设计集成神经模块用于感知热力图信息,并引入新型奖励机制。实验表明,FARDA 在覆盖性能上可媲美进化算法,但部署速度提升约7000倍,且消融实验证明了各模块的必要性。

链接: https://arxiv.org/abs/2512.05753
作者: Wencheng Cai,Xuchao Gao,Congying Han,Mingqiang Li,Tiande Guo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The fast deployment of cognitive radar to counter jamming remains a critical challenge in modern warfare, where more efficient deployment leads to quicker detection of targets. Existing methods are primarily based on evolutionary algorithms, which are time-consuming and prone to falling into local optima. We tackle these drawbacks via the efficient inference of neural networks and propose a brand new framework: Fast Anti-Jamming Radar Deployment Algorithm (FARDA). We first model the radar deployment problem as an end-to-end task and design deep reinforcement learning algorithms to solve it, where we develop integrated neural modules to perceive heatmap information and a brand new reward format. Empirical results demonstrate that our method achieves coverage comparable to evolutionary algorithms while deploying radars approximately 7,000 times faster. Further ablation experiments confirm the necessity of each component of FARDA.
zh

[AI-19] KANFormer for Predicting Fill Probabilities via Survival Analysis in Limit Order Books

【速读】:该论文旨在解决限价订单(Limit Order, LOB)的成交时间预测问题,即准确预估订单被完全执行所需的时间。现有方法通常仅依赖订单簿的离散快照数据,难以捕捉订单队列动态和市场微观结构中的复杂非线性关系。本文提出的KANFormer模型通过融合扩张因果卷积网络(Dilated Causal Convolutional Network)与Transformer编码器,并引入Kolmogorov-Arnold Networks(KANs)以增强非线性逼近能力,从而更有效地整合市场层面信息与交易者行为特征(如订单在队列中的位置及代理动作),显著提升了对订单成交概率的建模精度与可解释性。关键创新在于将多粒度市场信号与高表达力神经架构相结合,实现对填单时间(time-to-fill)的精准且可解释的预测。

链接: https://arxiv.org/abs/2512.05734
作者: Jinfeng Zhong,Emmanuel Bacry,Agathe Guilloux,Jean-François Muzy
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper introduces KANFormer, a novel deep-learning-based model for predicting the time-to-fill of limit orders by leveraging both market- and agent-level information. KANFormer combines a Dilated Causal Convolutional network with a Transformer encoder, enhanced by Kolmogorov-Arnold Networks (KANs), which improve nonlinear approximation. Unlike existing models that rely solely on a series of snapshots of the limit order book, KANFormer integrates the actions of agents related to LOB dynamics and the position of the order in the queue to more effectively capture patterns related to execution likelihood. We evaluate the model using CAC 40 index futures data with labeled orders. The results show that KANFormer outperforms existing works in both calibration (Right-Censored Log-Likelihood, Integrated Brier Score) and discrimination (C-index, time-dependent AUC). We further analyze feature importance over time using SHAP (SHapley Additive exPlanations). Our results highlight the benefits of combining rich market signals with expressive neural architectures to achieve accurate and interpretabl predictions of fill probabilities.
zh

[AI-20] Bayesian Active Inference for Intelligent UAV Anti-Jamming and Adaptive Trajectory Planning

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)在对抗性干扰(adversarial jamming)环境下进行轨迹规划的问题,即如何在未知干扰源位置的情况下实现鲁棒的路径规划与通信保障。解决方案的关键在于提出了一种分层轨迹规划框架,结合贝叶斯主动推断(Bayesian Active Inference),通过专家示范数据与概率生成建模共同编码高层符号规划、低层运动策略及无线信号反馈机制,使无人机能够在部署过程中在线推断干扰模式、定位干扰源并自适应调整轨迹,从而显著降低通信干扰和任务成本,同时保持在动态环境中的泛化能力。

链接: https://arxiv.org/abs/2512.05711
作者: Ali Krayani,Seyedeh Fatemeh Sadati,Lucio Marcenaro,Carlo Regazzoni
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: This paper has been accepted for the 2026 IEEE Consumer Communications Networking Conference (IEEE CCNC 2026)

点击查看摘要

Abstract:This paper proposes a hierarchical trajectory planning framework for UAVs operating under adversarial jamming conditions. Leveraging Bayesian Active Inference, the approach combines expert-generated demonstrations with probabilistic generative modeling to encode high-level symbolic planning, low-level motion policies, and wireless signal feedback. During deployment, the UAV performs online inference to anticipate interference, localize jammers, and adapt its trajectory accordingly, without prior knowledge of jammer locations. Simulation results demonstrate that the proposed method achieves near-expert performance, significantly reducing communication interference and mission cost compared to model-free reinforcement learning baselines, while maintaining robust generalization in dynamic environments.
zh

[AI-21] HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies

【速读】:该论文旨在解决当前具身智能基础模型在训练过程中因机器人示范数据存在显著异质性(如不同机器人本体结构、动作空间、传感器配置及控制频率等)而导致的泛化能力受限问题。现有方法缺乏对这些多源异质性的显式建模,导致难以有效整合多样化的机器人数据,进而影响模型在新场景下的性能表现。其解决方案的关键在于提出一种面向视觉-语言-动作(Vision-Language-Action, VLA)框架的分层专家混合(Hierarchical Mixture-of-Experts, HiMoE)架构,该架构通过逐层自适应处理多种异质性因素,并逐步抽象为共享的知识表示,从而实现对多样化机器人数据的高效利用与鲁棒迁移。

链接: https://arxiv.org/abs/2512.05693
作者: Zhiying Du,Bei Liu,Yaobo Liang,Yichao Shen,Haidong Cao,Xiangyu Zheng,Zhiyuan Feng,Zuxuan Wu,Jiaolong Yang,Yu-Gang Jiang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of foundation models for embodied intelligence critically depends on access to large-scale, high-quality robot demonstration data. Recent approaches have sought to address this challenge by training on large collections of heterogeneous robotic datasets. However, unlike vision or language data, robotic demonstrations exhibit substantial heterogeneity across embodiments and action spaces as well as other prominent variations such as senor configurations and action control frequencies. The lack of explicit designs for handling such heterogeneity causes existing methods to struggle with integrating diverse factors, thereby limiting their generalization and leading to degraded performance when transferred to new settings. In this paper, we present HiMoE-VLA, a novel vision-language-action (VLA) framework tailored to effectively handle diverse robotic data with heterogeneity. Specifically, we introduce a Hierarchical Mixture-of-Experts (HiMoE) architecture for the action module which adaptively handles multiple sources of heterogeneity across layers and gradually abstracts them into shared knowledge representations. Through extensive experimentation with simulation benchmarks and real-world robotic platforms, HiMoE-VLA demonstrates a consistent performance boost over existing VLA baselines, achieving higher accuracy and robust generalization across diverse robots and action spaces. The code and models are publicly available at this https URL.
zh

[AI-22] On Dynamic Programming Theory for Leader-Follower Stochastic Games

【速读】:该论文旨在解决领导者-跟随者一般和随机博弈(Leader-follower general-sum stochastic games, LF-GSSGs)中如何高效计算强Stackelberg均衡(Strong Stackelberg Equilibrium, SSE)的问题,特别是在领导者部分承诺策略时,如何刻画所有理性跟随者的最优响应并求解最优领导者策略。解决方案的关键在于提出一个基于动态规划(Dynamic Programming, DP)的框架,通过在“可信集”(credible sets)上应用贝尔曼递归(Bellman recursion)——这些集合形式化表示了在领导者部分承诺下的所有合理跟随者最优响应——从而将原问题无损地转化为一个马尔可夫决策过程(Markov Decision Process, MDP) over credible sets,并进一步设计出具有可证明保证的近似最优(ϵ\epsilon-optimal)DP算法,以控制领导者被利用的程度(leader exploitability)。

链接: https://arxiv.org/abs/2512.05667
作者: Jilles Steeve Dibangoye,Thibaut Le Marre,Ocan Sankur,François Schwarzentruber
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: 31 pages, 5 figures

点击查看摘要

Abstract:Leader-follower general-sum stochastic games (LF-GSSGs) model sequential decision-making under asymmetric commitment, where a leader commits to a policy and a follower best responds, yielding a strong Stackelberg equilibrium (SSE) with leader-favourable tie-breaking. This paper introduces a dynamic programming (DP) framework that applies Bellman recursion over credible sets-state abstractions formally representing all rational follower best responses under partial leader commitments-to compute SSEs. We first prove that any LF-GSSG admits a lossless reduction to a Markov decision process (MDP) over credible sets. We further establish that synthesising an optimal memoryless deterministic leader policy is NP-hard, motivating the development of \epsilon-optimal DP algorithms with provable guarantees on leader exploitability. Experiments on standard mixed-motive benchmarks-including security games, resource allocation, and adversarial planning-demonstrate empirical gains in leader value and runtime scalability over state-of-the-art methods.
zh

[AI-23] Feasibility of AI-Assisted Programming for End-User Development

【速读】:该论文试图解决的问题是:AI-assisted end-user coding(AI辅助的终端用户编码)是否可作为端用户开发(end-user development)的一种可行范式,从而可能补充甚至取代当前主流的低代码/无代码(Low-Code/No-Code, LCNC)平台模式。解决方案的关键在于利用生成式 AI(Generative AI),特别是基于大语言模型的智能助手或“协作者”(copilots),使非编程人员能够通过自然语言提示直接生成和优化代码,构建应用程序,从而实现更灵活、高效、可复用且减少厂商锁定的开发方式。研究通过案例实证表明,非程序员在与AI交互下能够合理时间内完成基础Web应用开发,并对这一方法表示认可,验证了其可行性。

链接: https://arxiv.org/abs/2512.05666
作者: Irene Weber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 3 figures

点击查看摘要

Abstract:End-user development,where non-programmers create or adapt their own digital tools, can play a key role in driving digital transformation within organizations. Currently, low-code/no-code platforms are widely used to enable end-user development through visual programming, minimizing the need for manual coding. Recent advancements in generative AI, particularly large language model-based assistants and “copilots”, open new possibilities, as they may enable end users to generate and refine programming code and build apps directly from natural language prompts. This approach, here referred to as AI-assisted end-user coding, promises greater flexibility, broader applicability, faster development, improved reusability, and reduced vendor lock-in compared to the established visual LCNC platforms. This paper investigates whether AI-assisted end-user coding is a feasible paradigm for end-user development, which may complement or even replace the LCNC model in the future. To explore this, we conducted a case study in which non-programmers were asked to develop a basic web app through interaction with AI this http URL majority of study participants successfully completed the task in reasonable time and also expressed support for AI-assisted end-user coding as a viable approach for end-user development. The paper presents the study design, analyzes the outcomes, and discusses potential implications for practice, future research, and academic teaching.
zh

[AI-24] Modular Jets for Supervised Pipelines: Diagnosing Mirag e vs Identifiability

【速读】:该论文旨在解决传统监督学习中模型评估仅依赖预测风险(predictive risk)而导致的内部结构不可识别问题,即无法确定模型的模块化分解是否由数据和评估设计唯一决定。其核心挑战在于:即使不同模块化结构产生相同的输入-输出映射,它们在局部响应行为上可能难以区分,从而导致“镜像”(mirage)情形的存在。解决方案的关键是引入模块化喷流(Modular Jets),通过估计局部线性响应映射来刻画每个模块对小规模结构扰动的反应;在此基础上提出“可识别”(identifiable)与“镜像”(mirage)两种 regimes 的概念,并在两模块线性回归场景下证明了喷流可识别性定理——在适度秩假设和模块级喷流可观测的前提下,内部因子分解被唯一确定,而仅基于风险评估则存在大量观测等价但结构不同的镜像分解。该方法为模型结构可解释性和可识别性提供了新的理论框架与实证工具(MoJet算法)。

链接: https://arxiv.org/abs/2512.05638
作者: Suman Sanyal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Classical supervised learning evaluates models primarily via predictive risk on hold-out data. Such evaluations quantify how well a function behaves on a distribution, but they do not address whether the internal decomposition of a model is uniquely determined by the data and evaluation design. In this paper, we introduce \emphModular Jets for regression and classification pipelines. Given a task manifold (input space), a modular decomposition, and access to module-level representations, we estimate empirical jets, which are local linear response maps that describe how each module reacts to small structured perturbations of the input. We propose an empirical notion of \emphmirage regimes, where multiple distinct modular decompositions induce indistinguishable jets and thus remain observationally equivalent, and contrast this with an \emphidentifiable regime, where the observed jets single out a decomposition up to natural symmetries. In the setting of two-module linear regression pipelines we prove a jet-identifiability theorem. Under mild rank assumptions and access to module-level jets, the internal factorisation is uniquely determined, whereas risk-only evaluation admits a large family of mirage decompositions that implement the same input-to-output map. We then present an algorithm (MoJet) for empirical jet estimation and mirage diagnostics, and illustrate the framework using linear and deep regression as well as pipeline classification.
zh

[AI-25] Enhancing Local Search for MaxSAT with Deep Differentiation Clause Weighting ECAI2025

【速读】:该论文旨在解决部分最大可满足性(Partial Maximum Satisfiability, PMS)与加权部分最大可满足性(Weighted Partial Maximum Satisfiability, WPMS)问题在基于随机局部搜索(Stochastic Local Search, SLS)求解时存在的效率与精度不足问题。现有方法通常采用统一的权重更新策略,未能充分区分PMS与WPMS在结构上的本质差异,导致优化效果受限。其解决方案的关键在于提出一种全新的、针对两类问题分别设计的Clause Weighting Scheme(Clause权重更新机制),首次实现了根据PMS和WPMS实例的不同特性动态调整权重;同时引入一种更契合两类问题特性的初始化方法,并结合优先满足单位子句和硬子句的Decimation策略,从而显著提升SLS算法的收敛速度与解的质量。基于此,作者开发了新的SLS求解器DeepDist,在近期MaxSAT Evaluation的任意时间(anytime)赛道测试中表现优异,且与TT-Open-WBO-Inc组合后超越2024年竞赛冠军SPB-MaxSAT-c-Band和SPB-MaxSAT-c-FPS,验证了方法的有效性。

链接: https://arxiv.org/abs/2512.05619
作者: Menghua Jiang,Haokai Gao,Shuhao Chen,Yin Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ECAI 2025

点击查看摘要

Abstract:Partial Maximum Satisfiability (PMS) and Weighted Partial Maximum Satisfiability (WPMS) generalize Maximum Satisfiability (MaxSAT), with broad real-world applications. Recent advances in Stochastic Local Search (SLS) algorithms for solving (W)PMS have mainly focused on designing clause weighting schemes. However, existing methods often fail to adequately distinguish between PMS and WPMS, typically employing uniform update strategies for clause weights and overlooking critical structural differences between the two problem types. In this work, we present a novel clause weighting scheme that, for the first time, updates the clause weights of PMS and WPMS instances according to distinct conditions. This scheme also introduces a new initialization method, which better accommodates the unique characteristics of both instance types. Furthermore, we propose a decimation method that prioritizes satisfying unit and hard clauses, effectively complementing our proposed clause weighting scheme. Building on these methods, we develop a new SLS solver for (W)PMS named DeepDist. Experimental results on benchmarks from the anytime tracks of recent MaxSAT Evaluations show that DeepDist outperforms state-of-the-art SLS solvers. Notably, a hybrid solver combining DeepDist with TT-Open-WBO-Inc surpasses the performance of the MaxSAT Evaluation 2024 winners, SPB-MaxSAT-c-Band and SPB-MaxSAT-c-FPS, highlighting the effectiveness of our approach. The code is available at this https URL
zh

[AI-26] A Comprehensive Framework for Automated Quality Control in the Automotive Industry

【速读】:该论文旨在解决汽车制造中铝合金高压铸造(High-Pressure Die Casting, HPDC)零部件表面及螺纹缺陷检测的自动化与高精度问题,传统人工质检效率低且易漏检。其解决方案的关键在于构建一套由双协作机器人(Collaborative Robots)组成的智能视觉检测系统,集成高分辨率相机与定制化光学镜头及照明配置以确保图像质量,并采用改进的YOLO11n深度学习模型(结合图像切片、集成学习和边界框合并策略)实现缺陷的精准定位与分类,同时通过图像处理技术量化缺陷程度,从而在保证实时性的同时显著降低误报率,具备良好的可扩展性和生产环境适应性。

链接: https://arxiv.org/abs/2512.05579
作者: Panagiota Moraiti,Panagiotis Giannikos,Athanasios Mastrogeorgiou,Panagiotis Mavridis,Linghao Zhou,Panagiotis Chatzakos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a cutting-edge robotic inspection solution designed to automate quality control in automotive manufacturing. The system integrates a pair of collaborative robots, each equipped with a high-resolution camera-based vision system to accurately detect and localize surface and thread defects in aluminum high-pressure die casting (HPDC) automotive components. In addition, specialized lenses and optimized lighting configurations are employed to ensure consistent and high-quality image acquisition. The YOLO11n deep learning model is utilized, incorporating additional enhancements such as image slicing, ensemble learning, and bounding-box merging to significantly improve performance and minimize false detections. Furthermore, image processing techniques are applied to estimate the extent of the detected defects. Experimental results demonstrate real-time performance with high accuracy across a wide variety of defects, while minimizing false detections. The proposed solution is promising and highly scalable, providing the flexibility to adapt to various production environments and meet the evolving demands of the automotive industry.
zh

[AI-27] CureAgent : A Training-Free Executor-Analyst Framework for Clinical Reasoning NEURIPS2025

【速读】:该论文旨在解决当前基于小型大语言模型(Small Language Models, LLMs)的临床智能体(如TxAgent)中存在的“上下文利用失败”(Context Utilization Failure)问题,即模型虽能通过监督微调成功检索生物医学证据,却无法将诊断结果建立在这些信息之上,导致推理不可靠。其解决方案的关键在于提出执行器-分析器框架(Executor-Analyst Framework),该框架通过模块化设计将工具执行的语法精确性与临床推理的语义鲁棒性解耦:由专门的TxAgent(执行器)负责精准调用工具,同时借助具备长上下文能力的基础模型(分析师)进行高阶推理,从而弥补单一架构的推理缺陷。进一步地,研究发现分层集成策略(Stratified Ensemble)优于全局池化方法,有效保留了证据多样性并缓解信息瓶颈;此外,实验揭示了“上下文性能悖论”和“动作空间维度诅咒”,表明过度扩展上下文长度或工具集会引入噪声并降低性能,凸显了无需训练的架构工程在实现高性能、可信赖AI辅助诊疗中的潜力。

链接: https://arxiv.org/abs/2512.05576
作者: Ting-Ting Xie,Yixin Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 2nd Place Solution to the CURE-Bench Competition @ NeurIPS 2025. Code available at this https URL

点击查看摘要

Abstract:Current clinical agent built on small LLMs, such as TxAgent suffer from a \textitContext Utilization Failure, where models successfully retrieve biomedical evidence due to supervised finetuning but fail to ground their diagnosis in that information. In this work, we propose the Executor-Analyst Framework, a modular architecture that decouples the syntactic precision of tool execution from the semantic robustness of clinical reasoning. By orchestrating specialized TxAgents (Executors) with long-context foundation models (Analysts), we mitigate the reasoning deficits observed in monolithic models. Beyond simple modularity, we demonstrate that a Stratified Ensemble strategy significantly outperforms global pooling by preserving evidentiary diversity, effectively addressing the information bottleneck. Furthermore, our stress tests reveal critical scaling insights: (1) a \textitContext-Performance Paradox, where extending reasoning contexts beyond 12k tokens introduces noise that degrades accuracy; and (2) the \textitCurse of Dimensionality in action spaces, where expanding toolsets necessitates hierarchical retrieval strategies. Crucially, our approach underscores the potential of training-free architectural engineering, achieving state-of-the-art performance on CURE-Bench without the need for expensive end-to-end finetuning. This provides a scalable, agile foundation for the next generation of trustworthy AI-driven therapeutics. Code has been released on this https URL.
zh

[AI-28] Improving Local Fidelity Through Sampling and Modeling Nonlinearity

【速读】:该论文旨在解决现有局部可解释模型无关解释方法(如LIME)在处理复杂黑箱机器学习模型时,因假设局部决策边界为线性而导致的解释失真问题。其核心解决方案在于引入多元自适应回归样条(Multivariate Adaptive Regression Splines, MARS)来建模非线性的局部决策边界,从而更准确地捕捉参考模型的局部行为,提升解释的局部保真度(local fidelity)。此外,通过采用N-ball采样技术直接从目标分布中采样,避免了LIME中基于重加权样本的方式,进一步提高了解释的忠实性(faithfulness score)。实验表明,该方法在多个UCI数据集和不同分类器上均显著优于基线,平均根均方误差降低37%。

链接: https://arxiv.org/abs/2512.05556
作者: Sanjeev Shrestha,Rahul Dubey,Hui Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the increasing complexity of black-box machine learning models and their adoption in high-stakes areas, it is critical to provide explanations for their predictions. Local Interpretable Model-agnostic Explanation (LIME) is a widely used technique that explains the prediction of any classifier by learning an interpretable model locally around the predicted instance. However, it assumes that the local decision boundary is linear and fails to capture the non-linear relationships, leading to incorrect explanations. In this paper, we propose a novel method that can generate high-fidelity explanations. Multivariate adaptive regression splines (MARS) is used to model non-linear local boundaries that effectively captures the underlying behavior of the reference model, thereby enhancing the local fidelity of the explanation. Additionally, we utilize the N-ball sampling technique, which samples directly from the desired distribution instead of reweighting samples as done in LIME, further improving the faithfulness score. We evaluate our method on three UCI datasets across different classifiers and varying kernel widths. Experimental results show that our method yields more faithful explanations compared to baselines, achieving an average reduction of 37% in root mean square error, significantly improving local fidelity.
zh

[AI-29] RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLM s NEURIPS2025

【速读】:该论文旨在解决传统 best-of- n 推理策略在使用单一大语言模型(Large Language Model, LLM)时无法充分利用多个模型间互补优势的问题。其解决方案的关键在于提出 RoBoN(Routed Online Best-of- n),一种基于奖励模型(reward model)评分与预测响应一致性信号(agreement signal)进行在线路由的多模型序列推理机制,无需额外训练即可实现跨模型的动态选择与生成,从而在保持计算开销均衡的前提下显著提升推理性能,在多个推理基准测试中相较单模型 best-of- n 和均匀多模型组合基线均取得更优结果。

链接: https://arxiv.org/abs/2512.05542
作者: Jonathan Geuter,Gregor Kornhardt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures. 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Foundations of Reasoning in Language Models

点击查看摘要

Abstract:Best-of- n is a widely used test-time scaling approach for LLM inference. Yet despite evidence that LLMs exhibit complementary strengths across tasks, traditionally best-of- n relies on a single model to generate responses. We propose RoBoN (Routed Online Best-of- n ), a sequential multi-LLM alternative to the prevailing single-model best-of- n . Given a suite of models \m_i_i=1^M , RoBoN sequentially routes generations one-by-one across models, based on scores computed using a reward model and an agreement signal on the predicted responses. This online routing requires no additional training, keeps compute parity, and works with any plug-in reward model. Across reasoning benchmarks (MATH500, OlympiadBench, MinervaMath, GSM8K, MMLU), RoBoN consistently outperforms standard best-of- n applied to each individual model for larger n , with gains of up to 3.4% in absolute accuracy, and also improves over a uniform multi-model portfolio baseline. Our results indicate that diversity across models can be exploited at inference to improve best-of- n performance over any constituent model alone, providing a simple, training-free path to test-time scaling with multiple LLMs.
zh

[AI-30] On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability

【速读】:该论文旨在解决稀疏字典学习(Sparse Dictionary Learning, SDL)方法在生成式 AI 模型中缺乏统一理论框架的问题。尽管 SDL 方法(如稀疏自编码器、转码器和交叉编码器)在实践中已成功用于解耦神经网络表示空间中的超位置概念(superposition),但其背后的优化机制尚无系统性理论支撑,尤其对于非 tied-weight 的 SDL 方法,现有理论仅局限于稀疏自编码器的特定情形。论文的关键解决方案是构建首个将 SDL 视为统一优化问题的理论框架,并通过严格分析其优化景观,首次从理论上解释了诸如特征吸收(feature absorption)、死神经元(dead neurons)以及神经元重采样(neuron resampling)等经验现象,从而为 SDL 方法提供了形式化基础与可验证的理论依据。

链接: https://arxiv.org/abs/2512.05534
作者: Yiming Tang,Harshvardhan Saini,Yizhen Liao,Dianbo Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode many concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into interpretable features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one unified optimization problem. We demonstrate how diverse methods instantiate the theoretical framwork and provide rigorous analysis on the optimization landscape. We provide the first theoretical explanations for some empirically observed phenomena, including feature absorption, dead neurons, and the neuron resampling technique. We further design controlled experiments to validate our theoretical results.
zh

[AI-31] MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂场景下存在的三大问题:有限的多推理路径语义建模能力、逻辑鲁棒性不足,以及易受误导性解释的影响。其解决方案的关键在于提出一种名为MIND(Multi-rationale INtegrated Discriminative)的推理框架,通过引入三个核心机制实现从被动模仿式推理向主动判别式推理的范式跃迁:(1) Rationale Augmentation and Discrimination (RAD) 预训练范式,自动扩展多样化推理路径以构建统一且可扩展的数据基础;(2) Progressive Two-stage Correction Learning (P2CL) 策略,分阶段强化多推理路径正向学习并激活逻辑判别与修正能力;(3) Multi-rationale Contrastive Alignment (MCA) 优化策略,缓解多推理路径语义空间中的表征纠缠问题,实现正确推理的语义聚合与错误推理的边界分离。

链接: https://arxiv.org/abs/2512.05530
作者: Chuang Yu,Jinmiao Zhao,Mingxuan Zhao,Yunpeng Liu,Xiujun Shu,Yuanhao Feng,Bo Wang,Xiangyu Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rationale semantic modeling, insufficient logical robustness, and are susceptible to misleading interpretations in complex scenarios. Therefore, we propose a Multi-rationale INtegrated Discriminative (MIND) reasoning framework, which is designed to endow MLLMs with human-like cognitive abilities of “Understand - Rethink - Correct”, and achieves a paradigm evolution from passive imitation-based reasoning to active discriminative reasoning. Specifically, we introduce a Rationale Augmentation and Discrimination (RAD) paradigm, which automatically and efficiently expands existing datasets by generating diverse rationales, providing a unified and extensible data foundation. Meanwhile, we design a Progressive Two-stage Correction Learning (P2CL) strategy. The first phase enhances multi-rationale positive learning, while the second phase enables active logic discrimination and correction. In addition, to mitigate representation entanglement in the multi-rationale semantic space, we propose a Multi-rationale Contrastive Alignment (MCA) optimization strategy, which achieves semantic aggregation of correct reasoning and boundary separation of incorrect reasoning. Extensive experiments demonstrate that the proposed MIND reasoning framework achieves state-of-the-art (SOTA) performance on multiple public datasets covering scientific, commonsense, and mathematical scenarios. It provides a new perspective for advancing MLLMs towards higher levels of cognitive intelligence. Our code is available at this https URL
zh

[AI-32] User Negotiations of Authenticity Ownership and Governance on AI-Generated Video Platforms: Evidence from Sora

【速读】:该论文试图解决的问题是:随着生成式 AI 视频平台(如 OpenAI 的 Sora)的快速发展,用户在使用过程中如何理解和协商真实性、作者身份与平台治理等伦理问题。研究发现,用户通过四种关键动态进行自我调节和群体互动:一是作为批判性评估者关注视频细节以判断其现实性;二是从被动观看者转变为积极创作者,对文本提示(text prompts)的知识产权产生关切;三是对真实与合成媒体边界模糊化感到忧虑,进而质疑内容的真实性甚至其他用户的参与动机;四是挑战平台治理规则,既揭露监管不透明或不一致现象,也发展出规避审查的策略,同时自发维护伦理规范,如抵制滥用真人图像。解决方案的关键在于识别并理解这些用户行为模式,从而为未来 AI 平台的治理机制设计提供依据,推动形成更具包容性和伦理导向的数字生态体系。

链接: https://arxiv.org/abs/2512.05519
作者: Bohui Shen,Shrikar Bhatta,Alex Ireebanije,Zexuan Liu,Abhinav Choudhry,Ece Gumusel,Kyrie Zhixuan Zhou
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As AI-generated video platforms rapidly advance, ethical challenges such as copyright infringement emerge. This study examines how users make sense of AI-generated videos on OpenAI’s Sora by conducting a qualitative content analysis of user comments. Through a thematic analysis, we identified four dynamics that characterize how users negotiate authenticity, authorship, and platform governance on Sora. First, users acted as critical evaluators of realism, assessing micro-details such as lighting, shadows, fluid motion, and physics to judge whether AI-generated scenes could plausibly exist. Second, users increasingly shifted from passive viewers to active creators, expressing curiosity about prompts, techniques, and creative processes. Text prompts were perceived as intellectual property, generating concerns about plagiarism and remixing norms. Third, users reported blurred boundaries between real and synthetic media, worried about misinformation, and even questioned the authenticity of other commenters, suspecting bot-generated engagement. Fourth, users contested platform governance: some perceived moderation as inconsistent or opaque, while others shared tactics for evading prompt censorship through misspellings, alternative phrasing, emojis, or other languages. Despite this, many users also enforced ethical norms by discouraging the misuse of real people’s images or disrespectful content. Together, these patterns highlighted how AI-mediated platforms complicate notions of reality, creativity, and rule-making in emerging digital ecosystems. Based on the findings, we discuss governance challenges in Sora and how user negotiations inform future platform governance.
zh

[AI-33] Matching Ranks Over Probability Yields Truly Deep Safety Alignment

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 模型在面对预填充攻击(prefilling attack)时安全性不足的问题,特别是针对通过数据增强的监督微调(SFT)防御策略被新型 Rank-Assisted Prefilling (RAP) 攻击有效绕过的情况。研究表明,现有 SFT 方法虽然能产生看似“深度”的安全对齐,但其本质仍存在漏洞:当目标分布熵较低时,模型倾向于将概率质量集中于少数拒绝 token,而忽略有害 token 的高排名,从而被 RAP 攻击利用。解决方案的关键在于从概率匹配转向token 排名匹配,提出一种基于注意力正则化的防御机制——PRefill attEntion STOpping (PRESTO),通过抑制模型对有害预填充 token 的注意力来提升安全性,实验证明该方法在三种主流开源大语言模型上可使平均 StrongREJECT 分数提升达 4.7 倍,且对模型实用性影响较小。

链接: https://arxiv.org/abs/2512.05518
作者: Jason Vega,Gagandeep Singh
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquotedeep safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the “deep” safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability “harmful” tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability “refusal” tokens). We argue that this vulnerability is enabled due to the “gaming” of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility.
zh

[AI-34] Lyrics Matter: Exploiting the Power of Learnt Representations for Music Popularity Prediction

【速读】:该论文旨在解决音乐流行度预测(music popularity prediction)这一关键挑战,尤其关注以往研究中被忽视的歌词(lyrics)特征的作用。其解决方案的关键在于构建一个自动化管道,利用大语言模型(LLM)提取高维歌词嵌入(lyric embeddings),以捕捉语义、句法和序列信息,并将其集成到HitMusicLyricNet这一多模态架构中,该架构融合音频、歌词与社交元数据来预测0–100范围内的流行度评分。实验表明,基于LLM的歌词特征提取模块(LyricsAENet)显著提升了预测性能,在SpotGenTrack数据集上相较基线模型实现9%和20%的MAE与MSE改进。

链接: https://arxiv.org/abs/2512.05508
作者: Yash Choudhary,Preeti Rao,Pushpak Bhattacharyya
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Accurately predicting music popularity is a critical challenge in the music industry, offering benefits to artists, producers, and streaming platforms. Prior research has largely focused on audio features, social metadata, or model architectures. This work addresses the under-explored role of lyrics in predicting popularity. We present an automated pipeline that uses LLM to extract high-dimensional lyric embeddings, capturing semantic, syntactic, and sequential information. These features are integrated into HitMusicLyricNet, a multimodal architecture that combines audio, lyrics, and social metadata for popularity score prediction in the range 0-100. Our method outperforms existing baselines on the SpotGenTrack dataset, which contains over 100,000 tracks, achieving 9% and 20% improvements in MAE and MSE, respectively. Ablation confirms that gains arise from our LLM-driven lyrics feature pipeline (LyricsAENet), underscoring the value of dense lyric representations.
zh

[AI-35] PERM EQ x GRAPH EQ: Equivariant Neural Networks for Quantum Molecular Learning

【速读】:该论文旨在解决几何学习中量子机器学习(Quantum Machine Learning, QML)模型在不同分子几何结构下的性能差异及其泛化能力问题。研究聚焦于对比无对称性等变、旋转与置换等变以及图嵌入置换等变的QML模型在LiH(线性)和NH₃(三角锥形)分子数据集上的表现,以明确模型设计对泛化能力的影响。其关键解决方案在于引入图嵌入(graph embedding)特征表示,并验证置换对称性嵌入(permutational symmetric embedding)是提升几何学习任务中模型泛化能力最有效的策略。

链接: https://arxiv.org/abs/2512.05475
作者: Saumya Biswas,Jiten Oswal
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注: 22 pages, 9 figures, 4 tables

点击查看摘要

Abstract:In hierarchal order of molecular geometry, we compare the performances of Geometric Quantum Machine Learning models. Two molecular datasets are considered: the simplistic linear shaped LiH-molecule and the trigonal pyramidal molecule NH3. Both accuracy and generalizability metrics are considered. A classical equivariant model is used as a baseline for the performance comparison. The comparative performance of Quantum Machine Learning models with no symmetry equivariance, rotational and permutational equivariance, and graph embedded permutational equivariance is investigated. The performance differentials and the molecular geometry in question reveals the criteria for choice of models for generalizability. Graph embedding of features is shown to be an effective pathway to greater trainability for geometric datasets. Permutational symmetric embedding is found to be the most generalizable quantum Machine Learning model for geometric learning.
zh

[AI-36] How Ensemble Learning Balances Accuracy and Overfitting: A Bias-Variance Perspective on Tabular Data

【速读】:该论文旨在解决集成模型(ensemble models)在保持高准确率的同时如何有效控制过拟合(overfitting)的问题,尤其是在不同数据特性下的泛化性能差异。其解决方案的关键在于通过系统性实验验证:集成方法可通过平均或受控的提升(boosting)策略降低方差,从而在非线性结构清晰的数据上显著提升测试准确率(提升5–7个百分点)且维持泛化差距低于3%;而在线性或噪声较大的数据上,需结合数据复杂度指标(如线性评分、Fisher比和噪声估计)判断是否适用集成方法,并辅以正则化手段防止对噪声或多数类模式的过拟合。

链接: https://arxiv.org/abs/2512.05469
作者: Zubair Ahmed Mohammad
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures, 3 tables. Code and reproducible experiments are available at: this https URL

点击查看摘要

Abstract:Ensemble models often achieve higher accuracy than single learners, but their ability to maintain small generalization gaps is not always well understood. This study examines how ensembles balance accuracy and overfitting across four tabular classification tasks: Breast Cancer, Heart Disease, Pima Diabetes, and Credit Card Fraud. Using repeated stratified cross validation with statistical significance testing, we compare linear models, a single decision tree, and nine ensemble methods. The results show that ensembles can reach high accuracy without large gaps by reducing variance through averaging or controlled boosting. On nearly linear and clean data, linear models already generalize well and ensembles offer little additional benefit. On datasets with meaningful nonlinear structure, tree based ensembles increase test accuracy by 5 to 7 points while keeping gaps below 3 percent. On noisy or highly imbalanced datasets, ensembles remain competitive but require regularization to avoid fitting noise or majority class patterns. We also compute simple dataset complexity indicators, such as linearity score, Fisher ratio, and noise estimate, which explain when ensembles are likely to control variance effectively. Overall, the study provides a clear view of how and when ensembles maintain high accuracy while keeping overfitting low, offering practical guidance for model selection in real world tabular applications.
zh

[AI-37] Knowing Your Uncertainty – On the application of LLM in social sciences

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在计算社会科学研究中应用时因训练过程黑箱化及推理阶段设计的随机性所引发的不确定性问题。此类不确定性若未被明确评估,将影响研究结果的可重复性与科学严谨性。论文的关键解决方案是提出一个统一的不确定性量化(Uncertainty Quantification, UQ)框架,该框架基于两个维度进行分类:任务类型(Task type, T),区分分类、短文本生成和长文本生成三类任务;验证类型(Validation type, V),反映参考数据或评价标准的可用性。通过将现有UQ方法映射到这一T-V分类体系,论文为研究人员提供了可操作的实践建议,从而在保持社会科学研究方法论严谨性的前提下,实现LLMs的可靠集成与评估。

链接: https://arxiv.org/abs/2512.05461
作者: Bolun Zhang,Linzhuo Li,Yunqi Chen,Qinlin Zhao,Zihan Zhu,Xiaoyuan Yi,Xing Xie
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 49 pages, 10 figures

点击查看摘要

Abstract:Large language models (LLMs) are rapidly being integrated into computational social science research, yet their blackboxed training and designed stochastic elements in inference pose unique challenges for scientific inquiry. This article argues that applying LLMs to social scientific tasks requires explicit assessment of uncertainty-an expectation long established in both quantitative methodology in the social sciences and machine learning. We introduce a unified framework for evaluating LLM uncertainty along two dimensions: the task type (T), which distinguishes between classification, short-form, and long-form generation, and the validation type (V), which captures the availability of reference data or evaluative criteria. Drawing from both computer science and social science literature, we map existing uncertainty quantification (UQ) methods to this T-V typology and offer practical recommendations for researchers. Our framework provides both a methodological safeguard and a practical guide for integrating LLMs into rigorous social science research.
zh

[AI-38] Parajudica: An RDF-Based Reason er and Metamodel for Multi-Framework Context-Dependent Data Compliance Assessments

【速读】:该论文旨在解决在多个同时适用的合规框架下实施基于策略的数据访问控制(Policy-Based Data Access Control, PBAC)所面临的挑战。解决方案的关键在于提出Parajudica,一个基于RDF/SPARQL的开放、模块化且可扩展的规则系统,用于评估上下文依赖的数据合规状态。该系统通过元模型支持对现有法律框架和行业标准的应用,从而实现合规策略执行、合规监控、数据发现和风险评估等场景下的高效分析与管理。

链接: https://arxiv.org/abs/2512.05453
作者: Luc Moreau(University of Sussex, Brighton, United Kingdom),Alfred Rossi(Immuta Research, Boston, Massachusetts, USA),Sophie Stalla-Bourdillon(Brussels Privacy Hub, Vrije Universiteit Brussel, Brussels, Belgium)
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Logic in Computer Science (cs.LO)
备注: 17 pages, 8 figures. Code and examples available at this https URL

点击查看摘要

Abstract:Motivated by the challenges of implementing policy-based data access control (PBAC) under multiple simultaneously applicable compliance frameworks, we present Parajudica, an open, modular, and extensible RDF/SPARQL-based rule system for evaluating context-dependent data compliance status. We demonstrate the utility of this resource and accompanying metamodel through application to existing legal frameworks and industry standards, offering insights for comparative framework analysis. Applications include compliance policy enforcement, compliance monitoring, data discovery, and risk assessment.
zh

[AI-39] he Seeds of Scheming: Weakness of Will in the Building Blocks of Agent ic Systems FAST AAAI2026

【速读】:该论文试图解决当前生成式 AI(Generative AI)系统中普遍存在的一致性问题,即模型虽能做出正确的全局判断却在局部行为上偏离既定目标,这种现象被类比为人类哲学中的“弱意志”(akrasia)。其解决方案的关键在于将 akrasia 引入 agentic AI 系统的分析框架,提出一个初步的 Akrasia Benchmark,通过结构化的提示条件(Baseline、Synonym、Temporal 和 Temptation)量化模型在局部响应中违背自身先前承诺的程度,从而实现对不同模型家族、解码策略及诱惑类型下“自我控制”能力的定量比较,并揭示微观层面 akrasia 在多智能体系统中可能引发宏观不稳定性,如“阴谋行为”或故意偏离目标。

链接: https://arxiv.org/abs/2512.05449
作者: Robert Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 4 pages + appendix. AAAI 2026 FAST Workshop (Oral)

点击查看摘要

Abstract:Large language models display a peculiar form of inconsistency: they “know” the correct answer but fail to act on it. In human philosophy, this tension between global judgment and local impulse is called akrasia, or weakness of will. We propose akrasia as a foundational concept for analyzing inconsistency and goal drift in agentic AI systems. To operationalize it, we introduce a preliminary version of the Akrasia Benchmark, currently a structured set of prompting conditions (Baseline [B], Synonym [S], Temporal [T], and Temptation [X]) that measures when a model’s local response contradicts its own prior commitments. The benchmark enables quantitative comparison of “self-control” across model families, decoding strategies, and temptation types. Beyond single-model evaluation, we outline how micro-level akrasia may compound into macro-level instability in multi-agent systems that may be interpreted as “scheming” or deliberate misalignment. By reframing inconsistency as weakness of will, this work connects agentic behavior to classical theories of agency and provides an empirical bridge between philosophy, psychology, and the emerging science of agentic AI.
zh

[AI-40] IdealTSF: Can Non-Ideal Data Contribute to Enhancing the Performance of Time Series Forecasting Models? AAAI2026

【速读】:该论文旨在解决时间序列预测任务中因数据缺失和异常值等非理想样本(non-ideal negative samples)导致的模型性能下降问题。传统方法通常将这些噪声或低质量数据视为干扰因素,或仅作为知识迁移的正样本处理,未能充分挖掘其潜在价值。本文提出IdealTSF框架,其核心创新在于将非理想负样本与理想正样本协同利用:首先通过预训练阶段从负样本中提取知识,随后在训练阶段将原始序列转化为理想正样本,并引入对抗扰动的负优化机制以增强模型鲁棒性。该方案显著提升了基础注意力架构在含噪或低质数据场景下的预测能力。

链接: https://arxiv.org/abs/2512.05442
作者: Hua Wang,Jinghao Lu,Fan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at AAAI 2026

点击查看摘要

Abstract:Deep learning has shown strong performance in time series forecasting tasks. However, issues such as missing values and anomalies in sequential data hinder its further development in prediction tasks. Previous research has primarily focused on extracting feature information from sequence data or addressing these suboptimal data as positive samples for knowledge transfer. A more effective approach would be to leverage these non-ideal negative samples to enhance event prediction. In response, this study highlights the advantages of non-ideal negative samples and proposes the IdealTSF framework, which integrates both ideal positive and negative samples for time series forecasting. IdealTSF consists of three progressive steps: pretraining, training, and optimization. It first pretrains the model by extracting knowledge from negative sample data, then transforms the sequence data into ideal positive samples during training. Additionally, a negative optimization mechanism with adversarial disturbances is applied. Extensive experiments demonstrate that negative sample data unlocks significant potential within the basic attention architecture for time series forecasting. Therefore, IdealTSF is particularly well-suited for applications with noisy samples or low-quality data.
zh

[AI-41] BEAVER: An Efficient Deterministic LLM Verifier

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中缺乏可靠验证手段的问题,即如何在不依赖采样估计的前提下,为模型输出满足特定语义约束提供确定性、可证明的概率边界。解决方案的关键在于提出BEAVER框架,其核心创新包括:设计新型的token trie和frontier数据结构以系统性探索生成空间,并在每一轮迭代中维护严格有效的概率上界,从而实现对LLM约束满足性的精确风险评估与识别。

链接: https://arxiv.org/abs/2512.05439
作者: Tarun Suresh,Nalin Wadhwa,Debangshu Banerjee,Gagandeep Singh
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.
zh

[AI-42] Building Capacity for Artificial Intelligence in Africa: A Cross-Country Survey of Challenges and Governance Pathways

【速读】:该论文试图解决非洲地区人工智能(Artificial Intelligence, AI)教育与劳动力准备之间存在显著不均衡的问题,尤其是在快速人口变化和劳动力市场压力加剧的背景下,如何有效提升AI相关技能以支持可持续发展。研究发现,尽管高校与产业界普遍认识到AI的重要性,但在课程实践、资源公平分配及持续协作方面仍存在明显短板。解决方案的关键在于强化大学与产业界的协同机制,同时系统性地克服资金短缺、基础设施薄弱和政策支持不足等结构性障碍,并建立包容性的治理框架,以确保AI人才培养能够促进非洲大陆的公平与可持续发展。

链接: https://arxiv.org/abs/2512.05432
作者: Jeffrey N. A. Aryee,Patrick Davies,Godfred A. Torsah,Mercy M. Apaw,Cyril D. Boateng,Sam M. Mwando,Chris Kwisanga,Eric Jobunga,Leonard K. Amekudzi
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 1 table

点击查看摘要

Abstract:Artificial intelligence (AI) is transforming education and the workforce, but access to AI learning opportunities in Africa remains uneven. With rapid demographic shifts and growing labour market pressures, AI has become a strategic development priority, making the demand for relevant skills more urgent. This study investigates how universities and industries engage in shaping AI education and workforce preparation, drawing on survey responses from five African countries (Ghana, Namibia, Rwanda, Kenya and Zambia). The findings show broad recognition of AI importance but limited evidence of consistent engagement, practical training, or equitable access to resources. Most respondents who rated the AI component of their curriculum as very relevant reported being well prepared for jobs, but financial barriers, poor infrastructure, and weak communication limit participation, especially among students and underrepresented groups. Respondents highlighted internships, industry partnerships, and targeted support mechanisms as critical enablers, alongside the need for inclusive governance frameworks. The results showed both the growing awareness of AI’s potential and the structural gaps that hinder its translation into workforce capacity. Strengthening university-industry collaboration and addressing barriers of access, funding, and policy are central to ensuring that AI contributes to equitable and sustainable development across the continent.
zh

[AI-43] A Systematic Framework for Enterprise Knowledge Retrieval: Leverag ing LLM rag ing LLM-Generated Metadata to Enhance RAG Systems

【速读】:该论文旨在解决企业环境中从大规模复杂知识库中高效检索相关信息的问题,以提升运营效率和决策质量。其核心挑战在于如何增强检索增强生成(Retrieval-Augmented Generation, RAG)系统中文档片段的语义表示能力,从而提高检索准确性与效率。解决方案的关键在于提出了一种基于大语言模型(Large Language Models, LLMs)的元数据增强(metadata enrichment)系统性框架,通过结构化流水线动态生成有意义的元数据来丰富文档段落的表征;实验表明,该方法显著优于仅依赖内容的基线策略,尤其在递归分块(recursive chunking)结合TF-IDF加权嵌入时达到82.5%的精度率,且能改善向量聚类质量并降低检索延迟,为RAG系统的性能优化提供了可落地的实践路径。

链接: https://arxiv.org/abs/2512.05411
作者: Pranav Pushkar Mishra,Kranti Prakash Yeole,Ramyashree Keshavamurthy,Mokshit Bharat Surana,Fatemeh Sarayloo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 3 tables

点击查看摘要

Abstract:In enterprise settings, efficiently retrieving relevant information from large and complex knowledge bases is essential for operational productivity and informed decision-making. This research presents a systematic framework for metadata enrichment using large language models (LLMs) to enhance document retrieval in Retrieval-Augmented Generation (RAG) systems. Our approach employs a comprehensive, structured pipeline that dynamically generates meaningful metadata for document segments, substantially improving their semantic representations and retrieval accuracy. Through extensive experiments, we compare three chunking strategies-semantic, recursive, and naive-and evaluate their effectiveness when combined with advanced embedding techniques. The results demonstrate that metadata-enriched approaches consistently outperform content-only baselines, with recursive chunking paired with TF-IDF weighted embeddings yielding an 82.5% precision rate compared to 73.3% for semantic content-only approaches. The naive chunking strategy with prefix-fusion achieved the highest Hit Rate@10 of 0.925. Our evaluation employs cross-encoder reranking for ground truth generation, enabling rigorous assessment via Hit Rate and Metadata Consistency metrics. These findings confirm that metadata enrichment enhances vector clustering quality while reducing retrieval latency, making it a key optimization for RAG systems across knowledge domains. This work offers practical insights for deploying high-performance, scalable document retrieval solutions in enterprise settings, demonstrating that metadata enrichment is a powerful approach for enhancing RAG effectiveness.
zh

[AI-44] Smart Timing for Mining: A Deep Learning Framework for Bitcoin Hardware ROI Prediction

【速读】:该论文旨在解决比特币挖矿硬件(ASIC)采购时机选择的决策问题,这一问题因市场波动性、技术快速过时及协议驱动的收益周期而变得复杂。现有研究缺乏针对何时购置新硬件的系统性指导,且无计算框架专门应对此决策难题。解决方案的关键在于将硬件采购决策建模为时间序列分类任务,提出MineROI-Net——一种基于Transformer架构的开源模型,能够捕捉挖矿盈利能力中的多尺度时间模式,从而预测购买ASIC设备在未来一年内是否带来盈利(ROI=1)、边际收益(0<ROI<1)或亏损(ROI=0)。该方法在2015–2024年间发布的20款ASIC矿机数据上验证,准确率达83.7%,宏平均F1分数为83.1%,并展现出极高的经济相关性,尤其在识别亏损和盈利时段方面分别达到93.6%和98.5%的精度,有效避免了误判,为资本密集型挖矿运营提供了可落地的数据驱动决策工具。

链接: https://arxiv.org/abs/2512.05402
作者: Sithumi Wickramasinghe,Bikramjit Das,Dorien Herremans
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Bitcoin mining hardware acquisition requires strategic timing due to volatile markets, rapid technological obsolescence, and protocol-driven revenue cycles. Despite mining’s evolution into a capital-intensive industry, there is little guidance on when to purchase new Application-Specific Integrated Circuit (ASIC) hardware, and no prior computational frameworks address this decision problem. We address this gap by formulating hardware acquisition as a time series classification task, predicting whether purchasing ASIC machines yields profitable (Return on Investment (ROI) = 1), marginal (0 ROI 1), or unprofitable (ROI = 0) returns within one year. We propose MineROI-Net, an open source Transformer-based architecture designed to capture multi-scale temporal patterns in mining profitability. Evaluated on data from 20 ASIC miners released between 2015 and 2024 across diverse market regimes, MineROI-Net outperforms LSTM-based and TSLANet baselines, achieving 83.7% accuracy and 83.1% macro F1-score. The model demonstrates strong economic relevance, achieving 93.6% precision in detecting unprofitable periods and 98.5% precision for profitable ones, while avoiding misclassification of profitable scenarios as unprofitable and vice versa. These results indicate that MineROI-Net offers a practical, data-driven tool for timing mining hardware acquisitions, potentially reducing financial risk in capital-intensive mining operations. The model is available through: this https URL.
zh

[AI-45] Simulating Life Paths with Digital Twins: AI-Generated Future Selves Influence Decision-Making and Expand Human Choice

【速读】:该论文试图解决个体在面临重大人生转折时,因未来自我想象能力有限而难以权衡决策后果的问题(即“心理时间旅行”能力不足)。其解决方案的关键在于引入由人工智能驱动的数字孪生体(AI-enabled digital twins),通过多模态合成技术(包括面部年龄进展、语音克隆和大语言模型对话)生成代表参与者30年后未来的个性化虚拟化身,从而扩展前瞻性认知(prospective cognition),使替代性未来情景足够具象以支持理性 deliberation(审议),而不预设哪条路径最优。实验表明,这种AI生成的未来自我能够显著影响决策倾向,尤其当引入系统自动生成的新选项时,可有效拓展选择范围并促进对未被察觉路径的采纳。

链接: https://arxiv.org/abs/2512.05397
作者: Rachel Poonsiriwong,Chayapatr Archiwaranguprok,Constanze Albrecht,Peggy Yin,Nattavudh Powthavee,Hal Hershfield,Monchai Lertsutthiwong,Kavin Winson,Pat Pataranutaporn
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Major life transitions demand high-stakes decisions, yet people often struggle to imagine how their future selves will live with the consequences. To support this limited capacity for mental time travel, we introduce AI-enabled digital twins that have ``lived through’’ simulated life scenarios. Rather than predicting optimal outcomes, these simulations extend prospective cognition by making alternative futures vivid enough to support deliberation without assuming which path is best. We evaluate this idea in a randomized controlled study (N=192) using multimodal synthesis - facial age progression, voice cloning, and large language model dialogue - to create personalized avatars representing participants 30 years forward. Young adults 18 to 28 years old described pending binary decisions and were assigned to guided imagination or one of four avatar conditions: single-option, balanced dual-option, or expanded three-option with a system-generated novel alternative. Results showed asymmetric effects: single-sided avatars increased shifts toward the presented option, while balanced presentation produced movement toward both. Introducing a system-generated third option increased adoption of this new alternative compared to control, suggesting that AI-generated future selves can expand choice by surfacing paths that might otherwise go unnoticed. Participants rated evaluative reasoning and eudaimonic meaning-making as more important than emotional or visual vividness. Perceived persuasiveness and baseline agency predicted decision change. These findings advance understanding of AI-mediated episodic prospection and raise questions about autonomy in AI-augmented decisions.
zh

[AI-46] Generalization Beyond Benchmarks: Evaluating Learnable Protein-Ligand Scoring Functions on Unseen Targets NEURIPS2025

【速读】:该论文旨在解决当前基于机器学习的蛋白-配体评分函数(protein-ligand scoring functions)在面对新型蛋白靶标时泛化能力不足的问题。现有评分函数虽在标准基准数据集上表现良好,但其在实际应用中面临训练数据有限、缺乏已知结构和实验亲和力测量的新靶标场景时性能显著下降。论文的关键解决方案在于:首先,提出更贴近真实应用场景的数据划分策略以评估模型泛化能力;其次,探索大规模自监督预训练是否有助于缩小泛化差距,并提供初步证据支持其潜力;最后,验证利用少量测试靶标数据的简单方法能否提升评分函数性能,从而为设计具备跨靶标预测能力的评分函数提供实践指导。

链接: https://arxiv.org/abs/2512.05386
作者: Jakub Kopko,David Graber,Saltuk Mustafa Eyrilmez,Stanislav Mazurenko,David Bednar,Jiri Sedlar,Josef Sivic
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 6 figures, submitted to NeurIPS 2025 AI4Science Workshop

点击查看摘要

Abstract:As machine learning becomes increasingly central to molecular design, it is vital to ensure the reliability of learnable protein-ligand scoring functions on novel protein targets. While many scoring functions perform well on standard benchmarks, their ability to generalize beyond training data remains a significant challenge. In this work, we evaluate the generalization capability of state-of-the-art scoring functions on dataset splits that simulate evaluation on targets with a limited number of known structures and experimental affinity measurements. Our analysis reveals that the commonly used benchmarks do not reflect the true challenge of generalizing to novel targets. We also investigate whether large-scale self-supervised pretraining can bridge this generalization gap and we provide preliminary evidence of its potential. Furthermore, we probe the efficacy of simple methods that leverage limited test-target data to improve scoring function performance. Our findings underscore the need for more rigorous evaluation protocols and offer practical guidance for designing scoring functions with predictive power extending to novel protein targets.
zh

[AI-47] Fuzzing the brain: Automated stress testing for the safety of ML-driven neurostimulation

【速读】:该论文旨在解决生成式机器学习(Machine Learning, ML)模型在神经假体设备中直接输出电刺激模式时引入的安全风险问题,尤其是当这些模式超出生物物理安全限值(如电荷密度、瞬时电流或电极共激活)时可能对神经组织造成损伤的风险。解决方案的关键在于提出一种基于覆盖率引导的模糊测试(coverage-guided fuzzing)方法,将软件测试中的自动化压力测试技术引入神经刺激领域:通过扰动输入并追踪输出是否违反安全限制,以系统性地识别和表征不安全刺激模式;该框架将编码器视为黑盒,并利用覆盖度指标量化测试用例在输出空间及违规类型上的分布广度,从而实现对不同模型架构与训练策略下安全性差异的可解释比较。

链接: https://arxiv.org/abs/2512.05383
作者: Mara Downing,Matthew Peng,Jacob Granley,Michael Beyeler,Tevfik Bultan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 20 pages, 4 figures, 2 tables

点击查看摘要

Abstract:Objective: Machine learning (ML) models are increasingly used to generate electrical stimulation patterns in neuroprosthetic devices such as visual prostheses. While these models promise precise and personalized control, they also introduce new safety risks when model outputs are delivered directly to neural tissue. We propose a systematic, quantitative approach to detect and characterize unsafe stimulation patterns in ML-driven neurostimulation systems. Approach: We adapt an automated software testing technique known as coverage-guided fuzzing to the domain of neural stimulation. Here, fuzzing performs stress testing by perturbing model inputs and tracking whether resulting stimulation violates biophysical limits on charge density, instantaneous current, or electrode co-activation. The framework treats encoders as black boxes and steers exploration with coverage metrics that quantify how broadly test cases span the space of possible outputs and violation types. Main results: Applied to deep stimulus encoders for the retina and cortex, the method systematically reveals diverse stimulation regimes that exceed established safety limits. Two violation-output coverage metrics identify the highest number and diversity of unsafe outputs, enabling interpretable comparisons across architectures and training strategies. Significance: Violation-focused fuzzing reframes safety assessment as an empirical, reproducible process. By transforming safety from a training heuristic into a measurable property of the deployed model, it establishes a foundation for evidence-based benchmarking, regulatory readiness, and ethical assurance in next-generation neural interfaces.
zh

[AI-48] China Regional 3km Downscaling Based on Residual Corrective Diffusion Model

【速读】:该论文旨在解决数值天气预报中高效生成高分辨率预报的难题,特别是通过统计降尺度(statistical downscaling)方法将全球模式输出的低分辨率数据转化为更精细的区域预报。其核心解决方案是基于扩散模型(diffusion model)构建了一种名为CorrDiff的降尺度框架,并进行了关键改进:首先,将研究区域扩展至原工作的近20倍,覆盖中国全境;其次,不仅对地表变量进行降尺度,还引入了六个气压层的高层变量作为目标变量;最后,增加了全局残差连接(global residual connection)以提升预测精度。实验表明,该方法在3 km分辨率下对CMA-GFS和SFF全球模式输出进行降尺度后,相较于基准模型CMA-MESO,在多个变量上的平均绝对误差(MAE)更低,且在雷达复合反射率场生成中展现出更强的细节刻画能力,验证了其作为生成式模型的优势。

链接: https://arxiv.org/abs/2512.05377
作者: Honglu Sun,Hao Jing,Zhixiang Dai,Sa Xiao,Wei Xue,Jian Sun,Qifeng Lu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph)
备注:

点击查看摘要

Abstract:A fundamental challenge in numerical weather prediction is to efficiently produce high-resolution forecasts. A common solution is applying downscaling methods, which include dynamical downscaling and statistical downscaling, to the outputs of global models. This work focuses on statistical downscaling, which establishes statistical relationships between low-resolution and high-resolution historical data using statistical models. Deep learning has emerged as a powerful tool for this task, giving rise to various high-performance super-resolution models, which can be directly applied for downscaling, such as diffusion models and Generative Adversarial Networks. This work relies on a diffusion-based downscaling framework named CorrDiff. In contrast to the original work of CorrDiff, the region considered in this work is nearly 20 times larger, and we not only consider surface variables as in the original work, but also encounter high-level variables (six pressure levels) as target downscaling variables. In addition, a global residual connection is added to improve accuracy. In order to generate the 3km forecasts for the China region, we apply our trained models to the 25km global grid forecasts of CMA-GFS, an operational global model of the China Meteorological Administration (CMA), and SFF, a data-driven deep learning-based weather model developed from Spherical Fourier Neural Operators (SFNO). CMA-MESO, a high-resolution regional model, is chosen as the baseline model. The experimental results demonstrate that the forecasts downscaled by our method generally outperform the direct forecasts of CMA-MESO in terms of MAE for the target variables. Our forecasts of radar composite reflectivity show that CorrDiff, as a generative model, can generate fine-scale details that lead to more realistic predictions compared to the corresponding deterministic regression models.
zh

[AI-49] Please Dont Kill My Vibe: Empowering Agents with Data Flow Control CIDR2026

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在执行复杂、有状态任务时因缺乏对不可控数据流的可见性和管理机制而导致的政策违规、流程污染和安全漏洞问题。当前代理工作流通过临时性方式强制执行策略,存在不可靠性和低效性。解决方案的关键在于将数据流控制(Data Flow Controls, DFC)机制内置于数据库管理系统(DBMS)中,使其原生支持并强制执行DFC策略,从而像数据验证和访问控制从应用层转移到DBMS一样,解放代理开发者对数据流风险的直接管理负担,并为代理生态系统提供可移植、标准化的数据流治理基础。

链接: https://arxiv.org/abs/2512.05374
作者: Charlie Summers,Haneen Mohammed,Eugene Wu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 7 pages, 7 figures, CIDR 2026

点击查看摘要

Abstract:The promise of Large Language Model (LLM) agents is to perform complex, stateful tasks. This promise is stunted by significant risks - policy violations, process corruption, and security flaws - that stem from the lack of visibility and mechanisms to manage undesirable data flows produced by agent actions. Today, agent workflows are responsible for enforcing these policies in ad hoc ways. Just as data validation and access controls shifted from the application to the DBMS, freeing application developers from these concerns, we argue that systems should support Data Flow Controls (DFCs) and enforce DFC policies natively. This paper describes early work developing a portable instance of DFC for DBMSes and outlines a broader research agenda toward DFC for agent ecosystems.
zh

[AI-50] ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在集成电路(Integrated Circuit, IC)开发自动化中因上下文窗口受限而导致的长文本语义建模与多跳推理能力不足的问题。解决方案的关键在于提出ChipMind框架,其核心创新包括:首先通过电路语义感知的知识图谱构建方法(Circuit Semantic-Aware Knowledge Graph Construction)将复杂的IC规格说明书转化为领域特定的知识图谱ChipKG;其次采用知识图谱增强的推理机制(ChipKG-Augmented Reasoning),结合信息论驱动的自适应检索以动态追踪逻辑依赖关系,并引入意图感知的语义过滤策略以剔除无关噪声,从而在召回完整性和精度之间实现有效平衡。

链接: https://arxiv.org/abs/2512.05371
作者: Changwen Xing,SamZaak Wong,Xinlai Wan,Yanfeng Lu,Mengli Zhang,Zebin Ma,Lei Qi,Zhengxiong Li,Nan Guan,Zhe Jiang,Xi Wang,Jun Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: Accepted by the AAAl26 Conference Main Track

点击查看摘要

Abstract:While Large Language Models (LLMs) demonstrate immense potential for automating integrated circuit (IC) development, their practical deployment is fundamentally limited by restricted context windows. Existing context-extension methods struggle to achieve effective semantic modeling and thorough multi-hop reasoning over extensive, intricate circuit specifications. To address this, we introduce ChipMind, a novel knowledge graph-augmented reasoning framework specifically designed for lengthy IC specifications. ChipMind first transforms circuit specifications into a domain-specific knowledge graph ChipKG through the Circuit Semantic-Aware Knowledge Graph Construction methodology. It then leverages the ChipKG-Augmented Reasoning mechanism, combining information-theoretic adaptive retrieval to dynamically trace logical dependencies with intent-aware semantic filtering to prune irrelevant noise, effectively balancing retrieval completeness and precision. Evaluated on an industrial-scale specification reasoning benchmark, ChipMind significantly outperforms state-of-the-art baselines, achieving an average improvement of 34.59% (up to 72.73%). Our framework bridges a critical gap between academic research and practical industrial deployment of LLM-aided Hardware Design (LAD).
zh

[AI-51] MCP-AI: Protocol-Driven Intelligence Framework for Autonomous Reasoning in Healthcare

【速读】:该论文旨在解决当前医疗人工智能(Healthcare AI)系统在整合情境推理、长期状态管理与可验证的人类工作流程方面存在的瓶颈问题,这限制了其在复杂临床环境中的可信部署与持续应用。传统临床决策支持系统(Clinical Decision Support Systems, CDSS)和基于提示的大型语言模型(Large Language Models, LLMs)往往缺乏对患者上下文的持久记忆、跨场景协作能力以及符合临床逻辑的可解释性。解决方案的关键在于提出一种全新的架构——MCP-AI,其核心是将模型上下文协议(Model Context Protocol, MCP)与特定临床场景结合,通过模块化、可执行的MCP文件封装临床目标、患者背景、推理状态和任务逻辑,形成可复用且可审计的记忆对象。这一设计使智能体能够在多阶段、跨机构环境中实现自适应、纵向且协作式的临床推理,并支持医生参与验证、保障AI职责安全交接,同时兼容HL7/FHIR标准及HIPAA、FDA SaMD等法规要求,从而为未来可解释、可组合、以安全性为导向的医疗AI提供可扩展基础。

链接: https://arxiv.org/abs/2512.05365
作者: Zag ElSayed,Craig Erickson,Ernest Pedapati
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Healthcare AI systems have historically faced challenges in merging contextual reasoning, long-term state management, and human-verifiable workflows into a cohesive framework. This paper introduces a completely innovative architecture and concept: combining the Model Context Protocol (MCP) with a specific clinical application, known as MCP-AI. This integration allows intelligent agents to reason over extended periods, collaborate securely, and adhere to authentic clinical logic, representing a significant shift away from traditional Clinical Decision Support Systems (CDSS) and prompt-based Large Language Models (LLMs). As healthcare systems become more complex, the need for autonomous, context-aware clinical reasoning frameworks has become urgent. We present MCP-AI, a novel architecture for explainable medical decision-making built upon the Model Context Protocol (MCP) a modular, executable specification for orchestrating generative and descriptive AI agents in real-time workflows. Each MCP file captures clinical objectives, patient context, reasoning state, and task logic, forming a reusable and auditable memory object. Unlike conventional CDSS or stateless prompt-based AI systems, MCP-AI supports adaptive, longitudinal, and collaborative reasoning across care settings. MCP-AI is validated through two use cases: (1) diagnostic modeling of Fragile X Syndrome with comorbid depression, and (2) remote coordination for Type 2 Diabetes and hypertension. In either scenario, the protocol facilitates physician-in-the-loop validation, streamlines clinical processes, and guarantees secure transitions of AI responsibilities between healthcare providers. The system connects with HL7/FHIR interfaces and adheres to regulatory standards, such as HIPAA and FDA SaMD guidelines. MCP-AI provides a scalable basis for interpretable, composable, and safety-oriented AI within upcoming clinical environments.
zh

[AI-52] AI Human Co-Improvement for Safer Co-Superintelligence

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)研究中“自我改进”目标所面临的危险性和长期性问题,即单纯追求AI系统的自主进化可能带来不可控风险且难以快速实现。其解决方案的关键在于转向“共进化”(co-improvement)范式,强调人类研究人员与AI系统协同工作,共同提升彼此能力,形成人机共生的“共超智能”(co-superintelligence)。这一策略通过将人类研究能力纳入AI改进闭环,不仅加速AI研究进程,还能在协作过程中提升AI与人类的总体安全性与智能水平。

链接: https://arxiv.org/abs/2512.05356
作者: Jason Weston,Jakob Foerster
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to maximize co-improvement: collaboration between human researchers and AIs to achieve co-superintelligence. That is, specifically targeting improving AI systems’ ability to work with human researchers to conduct AI research together, from ideation to experimentation, in order to both accelerate AI research and to generally endow both AIs and humans with safer superintelligence through their symbiosis. Focusing on including human research improvement in the loop will both get us there faster, and more safely.
zh

[AI-53] Invisible Load: Uncovering the Challenges of Neurodivergent Women in Software Engineering

【速读】:该论文旨在解决神经多样性女性(neurodivergent women)在软件工程(Software Engineering, SE)领域中因性别偏见与神经差异交织而面临的独特挑战,包括因误诊、掩蔽(masking)及男性主导的工作文化导致的高压力、职业倦怠和人才流失问题。其解决方案的关键在于提出一种混合方法论,将InclusiveMag的包容性框架与GenderMag走查流程相结合,并针对该群体特点进行定制化调整,通过三个阶段——文献综述、人物画像与分析流程构建、协作工作坊应用——系统性识别并缓解其认知、社交、组织、结构及职业发展等方面的障碍,从而为推动可操作的变革提供理论基础与实践工具。

链接: https://arxiv.org/abs/2512.05350
作者: Munazza Zaib,Wei Wang,Dulaji Hidellaarachchi,Isma Farah Siddiqui
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Neurodivergent women in Software Engineering (SE) encounter distinctive challenges at the intersection of gender bias and neurological differences. To the best of our knowledge, no prior work in SE research has systematically examined this group, despite increasing recognition of neurodiversity in the workplace. Underdiagnosis, masking, and male-centric workplace cultures continue to exacerbate barriers that contribute to stress, burnout, and attrition. In response, we propose a hybrid methodological approach that integrates InclusiveMag’s inclusivity framework with the GenderMag walkthrough process, tailored to the context of neurodivergent women in SE. The overarching design unfolds across three stages, scoping through literature review, deriving personas and analytic processes, and applying the method in collaborative workshops. We present a targeted literature review that synthesize challenges into cognitive, social, organizational, structural and career progression challenges neurodivergent women face in SE, including how under/late diagnosis and masking intensify exclusion. These findings lay the groundwork for subsequent stages that will develop and apply inclusive analytic methods to support actionable change.
zh

[AI-54] Interaction Tensor Shap

【速读】:该论文旨在解决高维机器学习模型中特征交互效应难以高效计算的问题,特别是现有基于Shapley值的方法无法在可 tractable(可处理)时间内评估高阶交互作用:传统Shapley Taylor Interaction Index (STII) 需要指数级的子集枚举,而基于张量的方法如Marginal SHAP Tensor (MST) 仅限于一阶效应。其解决方案的关键在于将高阶Shapley交互表示为张量网络收缩(tensor network contraction),并假设权重张量具有有限状态的张量列车(Tensor Train, TT)结构,从而在TT结构下实现多项式时间与polylog深度的计算复杂度。作者提出Interaction Tensor SHAP (IT SHAP),通过将STII重构为价值张量与权重张量的收缩形式,在保持STII的公理精确性的同时,将原指数复杂度Θ(4ⁿ)降至NC²并行时间,实现了主效应与高阶交互效应的统一、可扩展且计算可行的解释框架。

链接: https://arxiv.org/abs/2512.05338
作者: Hiroki Hasegawa,Yukihiko Okada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:Machine learning models have grown increasingly deep and high dimensional, making it difficult to understand how individual and combined features influence their predictions. While Shapley value based methods provide principled feature attributions, existing formulations cannot tractably evaluate higher order interactions: the Shapley Taylor Interaction Index (STII) requires exponential scale enumeration of subsets, and current tensor based approaches such as the Marginal SHAP Tensor (MST) are restricted to first order effects. The central problem is that no existing framework simultaneously preserves the axiomatic exactness of STII and avoids the exponential computational blow up inherent to high order discrete derivatives. Here we show that high order Shapley interactions can be represented exactly as tensor network contractions, enabling polynomial time and polylog depth computation under Tensor Train (TT) structure. We introduce Interaction Tensor SHAP (IT SHAP), which reformulates STII as the contraction of a Value Tensor and a Weight Tensor, and assume a finite state TT representation of the Weight Tensor with polynomial TT ranks. Under TT structured model and distribution tensors, we show that IT SHAP reduces the exponential complex Theta(4^n) of STII to NC2 parallel time. These results demonstrate that IT SHAP provides a unified, axiomatic, and computationally tractable formulation of main effects and higher order interactions in high dimensional models. This framework establishes a foundation for scalable interaction aware explainable AI, with implications for large black box models whose combinatorial structure has previously rendered interaction analysis infeasible.
zh

[AI-55] Robustness Test for AI Forecasting of Hurricane Florence Using FourCastNetv2 and Random Perturbations of the Initial Condition

【速读】:该论文旨在解决生成式 AI (Generative AI) 天气预报模型在面对输入噪声或不确定性时的鲁棒性问题,尤其关注极端天气事件(如飓风)预测的可靠性。其核心解决方案在于通过两个实验设计:一是向初始条件注入不同强度的高斯噪声,评估模型对飓风路径和强度预测的敏感性;二是使用完全随机的初始条件观察模型输出行为。关键发现是,FourCastNetv2 (FCNv2) 在低至中等噪声下能准确保持飓风特征,即使在高噪声下仍维持整体轨迹与结构,但强度预测始终偏低;且在非物理初始条件下,模型也能快速生成平滑、一致的预测结果,表明其内在稳定性。此方法简洁且可迁移至其他数据驱动型AI天气预报模型。

链接: https://arxiv.org/abs/2512.05323
作者: Adam Lizerbram,Shane Stevenson,Iman Khadir,Matthew Tu,Samuel S. P. Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML); Other Statistics (stat.OT)
备注: 26 pages, 12 figures

点击查看摘要

Abstract:Understanding the robustness of a weather forecasting model with respect to input noise or different uncertainties is important in assessing its output reliability, particularly for extreme weather events like hurricanes. In this paper, we test sensitivity and robustness of an artificial intelligence (AI) weather forecasting model: NVIDIAs FourCastNetv2 (FCNv2). We conduct two experiments designed to assess model output under different levels of injected noise in the models initial condition. First, we perturb the initial condition of Hurricane Florence from the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis v5 (ERA5) dataset (September 13-16, 2018) with varying amounts of Gaussian noise and examine the impact on predicted trajectories and forecasted storm intensity. Second, we start FCNv2 with fully random initial conditions and observe how the model responds to nonsensical inputs. Our results indicate that FCNv2 accurately preserves hurricane features under low to moderate noise injection. Even under high levels of noise, the model maintains the general storm trajectory and structure, although positional accuracy begins to degrade. FCNv2 consistently underestimates storm intensity and persistence across all levels of injected noise. With full random initial conditions, the model generates smooth and cohesive forecasts after a few timesteps, implying the models tendency towards stable, smoothed outputs. Our approach is simple and portable to other data-driven AI weather forecasting models.
zh

[AI-56] WhatsCode: Large-Scale GenAI Deployment for Developer Efficiency at WhatsApp

【速读】:该论文旨在解决在合规相关、大规模工业环境中部署AI辅助开发工具时存在的研究空白问题,特别是在真实生产场景下如何实现有效的人机协同与可持续业务价值。其解决方案的关键在于构建并验证了一个名为WhatsCode的领域特定AI开发系统,该系统不仅支持WhatsApp(服务超20亿用户)的多平台代码管理,还实现了从隐私自动化验证到端到端功能开发和DevOps流程的自主代理式工作流集成。研究发现,组织因素如所有权模式、采纳动态和风险管理与技术能力同等重要,且两种稳定的人机协作模式——一键部署高置信度变更(占60%)与命令接管-修订复杂决策(占40%)——构成了系统成功落地的核心机制,证明了有效的人机协同比完全自动化更能驱动企业级AI工具的长期可持续影响。

链接: https://arxiv.org/abs/2512.05314
作者: Ke Mao,Timotej Kapus,Cons T Åhs,Matteo Marescotti,Daniel Ip,Ákos Hajdu,Sopot Cela,Aparup Banerjee
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 11 pages, 4 figures, 48th International Conference on Software Engineering: Software Engineering in Practice

点击查看摘要

Abstract:The deployment of AI-assisted development tools in compliance-relevant, large-scale industrial environments represents significant gaps in academic literature, despite growing industry adoption. We report on the industrial deployment of WhatsCode, a domain-specific AI development system that supports WhatsApp (serving over 2 billion users) and processes millions of lines of code across multiple platforms. Over 25 months (2023-2025), WhatsCode evolved from targeted privacy automation to autonomous agentic workflows integrated with end-to-end feature development and DevOps processes. WhatsCode achieved substantial quantifiable impact, improving automated privacy verification coverage 3.5x from 15% to 53%, identifying privacy requirements, and generating over 3,000 accepted code changes with acceptance rates ranging from 9% to 100% across different automation domains. The system committed 692 automated refactor/fix changes, 711 framework adoptions, 141 feature development assists and maintained 86% precision in bug triage. Our study identifies two stable human-AI collaboration patterns that emerged from production deployment: one-click rollout for high-confidence changes (60% of cases) and commandeer-revise for complex decisions (40%). We demonstrate that organizational factors, such as ownership models, adoption dynamics, and risk management, are as decisive as technical capabilities for enterprise-scale AI success. The findings provide evidence-based guidance for large-scale AI tool deployment in compliance-relevant environments, showing that effective human-AI collaboration, not full automation, drives sustainable business impact. Comments: 11 pages, 4 figures, 48th International Conference on Software Engineering: Software Engineering in Practice Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2512.05314 [cs.SE] (or arXiv:2512.05314v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2512.05314 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-57] he Erosion of LLM Signatures: Can We Still Distinguish Human and LLM -Generated Scientific Ideas After Iterative Paraphrasing?

【速读】:该论文旨在解决如何区分由大型语言模型(Large Language Models, LLMs)与人类生成的科学想法这一尚未充分探索的问题,尤其是在经过多轮改写(paraphrasing)后仍能有效识别来源的挑战。其解决方案的关键在于系统评估当前最先进的机器学习模型在不同改写阶段下对LLM与人类生成思想的辨别能力,并发现引入研究问题作为上下文信息可提升检测性能(最高达2.97%),同时指出当想法被简化为非专家风格时,最显著削弱了LLM的独特标记,成为影响区分准确性的主要因素。

链接: https://arxiv.org/abs/2512.05311
作者: Sadat Shahriar,Navid Ayoobi,Arjun Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in RANLP 2025

点击查看摘要

Abstract:With the increasing reliance on LLMs as research agents, distinguishing between LLM and human-generated ideas has become crucial for understanding the cognitive nuances of LLMs’ research capabilities. While detecting LLM-generated text has been extensively studied, distinguishing human vs LLM-generated scientific idea remains an unexplored area. In this work, we systematically evaluate the ability of state-of-the-art (SOTA) machine learning models to differentiate between human and LLM-generated ideas, particularly after successive paraphrasing stages. Our findings highlight the challenges SOTA models face in source attribution, with detection performance declining by an average of 25.4% after five consecutive paraphrasing stages. Additionally, we demonstrate that incorporating the research problem as contextual information improves detection performance by up to 2.97%. Notably, our analysis reveals that detection algorithms struggle significantly when ideas are paraphrased into a simplified, non-expert style, contributing the most to the erosion of distinguishable LLM signatures.
zh

[AI-58] CFO: Learning Continuous-Time PDE Dynamics via Flow-Matched Neural Operators

【速读】:该论文旨在解决神经算子(Neural Operator)在求解时变偏微分方程(Time-Dependent Partial Differential Equations, PDEs)时面临的两个核心问题:一是传统自回归预测方法在长时间滚动预测中误差累积严重,二是对均匀时间离散化依赖性强,限制了实际应用场景的灵活性。解决方案的关键在于提出连续流算子(Continuous Flow Operator, CFO),其核心创新是通过流匹配(Flow Matching)直接学习PDE右侧函数(即动力学项),而无需对ODE求解器进行反向传播;CFO利用轨迹数据构建时间样条插值,并基于有限差分估计时间导数来构造概率路径,使速度场逼近真实PDE动力学。该方法具有时间分辨率不变性(time-resolution invariant),可在任意非均匀时间网格上训练并支持任意时间点查询与逆向时间推理,显著提升长期稳定性与数据效率。

链接: https://arxiv.org/abs/2512.05297
作者: Xianglong Hou,Xinquan Huang,Paris Perdikaris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:Neural operator surrogates for time-dependent partial differential equations (PDEs) conventionally employ autoregressive prediction schemes, which accumulate error over long rollouts and require uniform temporal discretization. We introduce the Continuous Flow Operator (CFO), a framework that learns continuous-time PDE dynamics without the computational burden of standard continuous approaches, e.g., neural ODE. The key insight is repurposing flow matching to directly learn the right-hand side of PDEs without backpropagating through ODE solvers. CFO fits temporal splines to trajectory data, using finite-difference estimates of time derivatives at knots to construct probability paths whose velocities closely approximate the true PDE dynamics. A neural operator is then trained via flow matching to predict these analytic velocity fields. This approach is inherently time-resolution invariant: training accepts trajectories sampled on arbitrary, non-uniform time grids while inference queries solutions at any temporal resolution through ODE integration. Across four benchmarks (Lorenz, 1D Burgers, 2D diffusion-reaction, 2D shallow water), CFO demonstrates superior long-horizon stability and remarkable data efficiency. CFO trained on only 25% of irregularly subsampled time points outperforms autoregressive baselines trained on complete data, with relative error reductions up to 87%. Despite requiring numerical integration at inference, CFO achieves competitive efficiency, outperforming autoregressive baselines using only 50% of their function evaluations, while uniquely enabling reverse-time inference and arbitrary temporal querying.
zh

[AI-59] Beyond Detection: A Comprehensive Benchmark and Study on Representation Learning for Fine-Grained Webshell Family Classification

【速读】:该论文旨在解决WebShell家族分类(WebShell family classification)自动化不足的问题,即当前依赖人工分析、效率低下且难以应对不断演化的恶意WebShell威胁。其解决方案的关键在于:首先通过提取动态函数调用轨迹(dynamic function call traces)来捕捉不受常见加密和混淆手段影响的内在行为特征;其次利用大语言模型(Large Language Models, LLMs)合成新变种以扩充数据集规模与多样性;最后将这些轨迹抽象为序列、图和树等多种结构表示形式,并系统评估多种表示学习方法(包括经典嵌入、Transformer模型及结构感知算法),从而建立首个针对WebShell家族分类的自动化基准框架,为实现精准、快速的主动防御提供技术支撑。

链接: https://arxiv.org/abs/2512.05288
作者: Feijiang Han
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Malicious WebShells pose a significant and evolving threat by compromising critical digital infrastructures and endangering public services in sectors such as healthcare and finance. While the research community has made significant progress in WebShell detection (i.e., distinguishing malicious samples from benign ones), we argue that it is time to transition from passive detection to in-depth analysis and proactive defense. One promising direction is the automation of WebShell family classification, which involves identifying the specific malware lineage in order to understand an adversary’s tactics and enable a precise, rapid response. This crucial task, however, remains a largely unexplored area that currently relies on slow, manual expert analysis. To address this gap, we present the first systematic study to automate WebShell family classification. Our method begins with extracting dynamic function call traces to capture inherent behaviors that are resistant to common encryption and obfuscation. To enhance the scale and diversity of our dataset for a more stable evaluation, we augment these real-world traces with new variants synthesized by Large Language Models. These augmented traces are then abstracted into sequences, graphs, and trees, providing a foundation to benchmark a comprehensive suite of representation methods. Our evaluation spans classic sequence-based embeddings (CBOW, GloVe), transformers (BERT, SimCSE), and a range of structure-aware algorithms, including Graph Kernels, Graph Edit Distance, Graph2Vec, and various Graph Neural Networks. Through extensive experiments on four real-world, family-annotated datasets under both supervised and unsupervised settings, we establish a robust baseline and provide practical insights into the most effective combinations of data abstractions, representation models, and learning paradigms for this challenge.
zh

[AI-60] XR-DT: Extended Reality-Enhanced Digital Twin for Agent ic Mobile Robots

【速读】:该论文旨在解决移动机器人在共享工作空间中与人类交互时,如何实现安全、高效且可解释的人机交互(Human-Robot Interaction, HRI)问题,尤其关注人类对机器人推理过程的感知、理解与信任缺失这一关键瓶颈。解决方案的核心在于提出一个扩展现实增强的数字孪生框架(eXtended Reality-enhanced Digital Twin, XR-DT),通过融合虚拟现实(VR)、增强现实(AR)与混合现实(MR)三层结构,实现物理世界与虚拟世界的双向映射,并集成实时传感器数据、Unity仿真环境及可穿戴AR设备采集的人类反馈。在此基础上,系统构建了具有统一扩散策略(diffusion policy)的智能体式移动机器人架构,并引入基于链式思维提示(chain-of-thought prompting)的多模态大语言模型推理机制和基于AutoGen的多智能体协同层,从而提升任务适应性、鲁棒性和交互透明度,最终实现可解释、可信且自适应的人机协同。

链接: https://arxiv.org/abs/2512.05270
作者: Tianyi Wang,Jiseop Byeon,Ahmad Yehia,Huihai Wang,Yiming Xu,Tianyi Zeng,Ziran Wang,Junfeng Jiao,Christian Claudel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:As mobile robots increasingly operate alongside humans in shared workspaces, ensuring safe, efficient, and interpretable Human-Robot Interaction (HRI) has become a pressing challenge. While substantial progress has been devoted to human behavior prediction, limited attention has been paid to how humans perceive, interpret, and trust robots’ inferences, impeding deployment in safety-critical and socially embedded environments. This paper presents XR-DT, an eXtended Reality-enhanced Digital Twin framework for agentic mobile robots, that bridges physical and virtual spaces to enable bi-directional understanding between humans and robots. Our hierarchical XR-DT architecture integrates virtual-, augmented-, and mixed-reality layers, fusing real-time sensor data, simulated environments in the Unity game engine, and human feedback captured through wearable AR devices. Within this framework, we design an agentic mobile robot system with a unified diffusion policy for context-aware task adaptation. We further propose a chain-of-thought prompting mechanism that allows multimodal large language models to reason over human instructions and environmental context, while leveraging an AutoGen-based multi-agent coordination layer to enhance robustness and collaboration in dynamic tasks. Initial experimental results demonstrate accurate human and robot trajectory prediction, validating the XR-DT framework’s effectiveness in HRI tasks. By embedding human intention, environmental dynamics, and robot cognition into the XR-DT framework, our system enables interpretable, trustworthy, and adaptive HRI.
zh

[AI-61] Uncertainty-Aware Data-Efficient AI: An Information-Theoretic Perspective

【速读】:该论文旨在解决在机器人、通信和医疗等场景中,由于训练数据有限而导致的模型预测性能受限问题,其核心挑战在于如何有效应对由数据稀缺引发的认知不确定性(epistemic uncertainty)。解决方案的关键在于从两个维度入手:一是通过广义贝叶斯学习框架(如广义后验分布)和“后贝叶斯”学习方法对认知不确定性进行形式化建模;二是利用信息论视角下的泛化边界理论,量化训练数据量与预测不确定性之间的关系,并结合合成数据增强策略提升数据效率。此外,论文还强调了提供有限样本统计保证的不确定性量化方法(如校准预测和校准风险控制),从而实现对数据稀缺问题的系统性缓解。

链接: https://arxiv.org/abs/2512.05267
作者: Osvaldo Simeone,Yaniv Romano
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In context-specific applications such as robotics, telecommunications, and healthcare, artificial intelligence systems often face the challenge of limited training data. This scarcity introduces epistemic uncertainty, i.e., reducible uncertainty stemming from incomplete knowledge of the underlying data distribution, which fundamentally limits predictive performance. This review paper examines formal methodologies that address data-limited regimes through two complementary approaches: quantifying epistemic uncertainty and mitigating data scarcity via synthetic data augmentation. We begin by reviewing generalized Bayesian learning frameworks that characterize epistemic uncertainty through generalized posteriors in the model parameter space, as well as ``post-Bayes’’ learning frameworks. We continue by presenting information-theoretic generalization bounds that formalize the relationship between training data quantity and predictive uncertainty, providing a theoretical justification for generalized Bayesian learning. Moving beyond methods with asymptotic statistical validity, we survey uncertainty quantification methods that provide finite-sample statistical guarantees, including conformal prediction and conformal risk control. Finally, we examine recent advances in data efficiency by combining limited labeled data with abundant model predictions or synthetic data. Throughout, we take an information-theoretic perspective, highlighting the role of information measures in quantifying the impact of data scarcity.
zh

[AI-62] Resolving Zadehs Paradox Axiomatic Possibility Theory as a Foundation for Reliable Artificial Intelligence

【速读】:该论文旨在解决Dempster-Shafer理论(DST)中存在的悖论问题,这些问题在处理不确定性信息时可能导致逻辑陷阱,尤其是在面对矛盾数据时。其解决方案的关键在于引入可能性理论(possibility theory),并通过Bychkov提出的公理化方法构建一个逻辑一致且数学严谨的框架,该框架基于可能性与必要性测度的二元结构(dualistic apparatus),从而为不确定性建模提供根本性的解决路径,而非仅仅是替代性方案。

链接: https://arxiv.org/abs/2512.05257
作者: Bychkov Oleksii,Bychkova Sophia,Lytvynchuk Khrystyna
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:This work advances and substantiates the thesis that the resolution of this crisis lies in the domain of possibility theory, specifically in the axiomatic approach developed in Bychkovs article. Unlike numerous attempts to fix Dempster rule, this approach builds from scratch a logically consistent and mathematically rigorous foundation for working with uncertainty, using the dualistic apparatus of possibility and necessity measures. The aim of this work is to demonstrate that possibility theory is not merely an alternative, but provides a fundamental resolution to DST paradoxes. A comparative analysis of three paradigms will be conducted probabilistic, evidential, and possibilistic. Using a classic medical diagnostic dilemma as an example, it will be shown how possibility theory allows for correct processing of contradictory data, avoiding the logical traps of DST and bringing formal reasoning closer to the logic of natural intelligence.
zh

[AI-63] Learning to Code with Context: A Study-Based Approach

【速读】:该论文旨在解决软件工程教育中如何有效整合生成式 AI 工具的问题,以帮助学生在掌握传统开发方法的同时,学会负责任且有意义地使用这些新兴技术。其解决方案的关键在于设计并评估一个基于项目的学习环境,结合用户研究与本地部署的、具备仓库感知能力的大语言模型(Large Language Model, LLM)助手,该助手利用检索增强生成(Retrieval-Augmented Generation, RAG)技术将响应锚定在项目相关的文档和源代码中,从而提供上下文感知的支持,并通过定量与定性分析揭示模型行为、参数敏感性及常见失效模式,为未来将 AI 辅助工具系统性融入软件工程课程提供实证依据与实践指导。

链接: https://arxiv.org/abs/2512.05242
作者: Uwe M. Borghoff,Mark Minas,Jannis Schopp
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 36 pages, 7 figures, 5 tables

点击查看摘要

Abstract:The rapid emergence of generative AI tools is transforming the way software is developed. Consequently, software engineering education must adapt to ensure that students not only learn traditional development methods but also understand how to meaningfully and responsibly use these new technologies. In particular, project-based courses offer an effective environment to explore and evaluate the integration of AI assistance into real-world development practices. This paper presents our approach and a user study conducted within a university programming project in which students collaboratively developed computer games. The study investigates how participants used generative AI tools throughout different phases of the software development process, identifies the types of tasks where such tools were most effective, and analyzes the challenges students encountered. Building on these insights, we further examine a repository-aware, locally deployed large language model (LLM) assistant designed to provide project-contextualized support. The system employs Retrieval-Augmented Generation (RAG) to ground responses in relevant documentation and source code, enabling qualitative analysis of model behavior, parameter sensitivity, and common failure modes. The findings deepen our understanding of context-aware AI support in educational software projects and inform future integration of AI-based assistance into software engineering curricula.
zh

[AI-64] A Survey of Bugs in AI-Generated Code

【速读】:该论文旨在系统性地解决AI生成代码中存在缺陷和错误的类型、分布特征及其与不同模型之间关联性的缺乏清晰认知的问题(即:当前关于AI生成代码的质量问题研究较为零散,缺乏统一分类与归纳)。其解决方案的关键在于对现有文献进行系统性分析,构建一个全面的bug分类体系,并识别不同模型生成代码中的典型错误模式,同时探讨可行的修复与缓解策略,从而为后续模型优化和质量评估提供理论依据与实践参考。

链接: https://arxiv.org/abs/2512.05239
作者: Ruofan Gao,Amjed Tahir,Peng Liang,Teo Susnjak,Foutse Khomh
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developers are widely using AI code-generation models, aiming to increase productivity and efficiency. However, there are also quality concerns regarding the AI-generated code. The generated code is produced by models trained on publicly available code, which are known to contain bugs and quality issues. Those issues can cause trust and maintenance challenges during the development process. Several quality issues associated with AI-generated code have been reported, including bugs and defects. However, these findings are often scattered and lack a systematic summary. A comprehensive review is currently lacking to reveal the types and distribution of these errors, possible remediation strategies, as well as their correlation with the specific models. In this paper, we systematically analyze the existing AI-generated code literature to establish an overall understanding of bugs and defects in generated code, providing a reference for future model improvement and quality assessment. We aim to understand the nature and extent of bugs in AI-generated code, and provide a classification of bug types and patterns present in code generated by different models. We also discuss possible fixes and mitigation strategies adopted to eliminate bugs from the generated code.
zh

[AI-65] MAR-FL: A Communication Efficient Peer-to-Peer Federated Learning System NEURIPS2025

【速读】:该论文旨在解决下一代无线系统与分布式机器学习(Distributed Machine Learning)融合场景下,联邦学习(Federated Learning, FL)在无线连接的参与方(peers)中因网络波动(network churn)导致效率低下和鲁棒性不足的问题。现有基于对等网络(Peer-to-Peer, P2P)的FL方法虽消除了中心协调器瓶颈,但存在通信复杂度过高(O(N²))的问题,限制了实际可扩展性。其解决方案的关键在于提出MAR-FL——一种利用迭代分组聚合(iterative group-based aggregation)机制的新型P2P FL系统,将通信开销降至O(N log N),显著优于传统基线,并在参与方数量增加时仍保持高效性和对不可靠客户端的鲁棒性,同时支持私有计算集成。

链接: https://arxiv.org/abs/2512.05234
作者: Felix Mulitze,Herbert Woisetschläger,Hans Arno Jacobsen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the peer-reviewed AI4NextG Workshop at NeurIPS 2025

点击查看摘要

Abstract:The convergence of next-generation wireless systems and distributed Machine Learning (ML) demands Federated Learning (FL) methods that remain efficient and robust with wireless connected peers and under network churn. Peer-to-peer (P2P) FL removes the bottleneck of a central coordinator, but existing approaches suffer from excessive communication complexity, limiting their scalability in practice. We introduce MAR-FL, a novel P2P FL system that leverages iterative group-based aggregation to substantially reduce communication overhead while retaining resilience to churn. MAR-FL achieves communication costs that scale as O(N log N), contrasting with the O(N^2) complexity of previously existing baselines, and thereby maintains effectiveness especially as the number of peers in an aggregation round grows. The system is robust towards unreliable FL clients and can integrate private computing.
zh

[AI-66] Invariance Co-training for Robot Visual Generalization

【速读】:该论文旨在解决当前大规模机器人策略模型在面对观测变化(如相机视角、光照条件和干扰物存在)时泛化能力不足的问题。其核心挑战在于现有数据集缺乏足够多样性的观测变体,导致模型难以鲁棒地适应复杂环境。解决方案的关键在于引入两个辅助任务——状态相似性(state similarity)和对观测扰动的不变性(invariance to observational perturbations),并结合昂贵的真实机器人示范数据与低成本但视觉丰富的非物理仿真合成图像(如Unreal Engine生成的数据)进行联合训练。这种方法显著提升了模型在未见相机视角、光照配置和干扰物条件下的泛化性能,相较现有生成式增强方法提升达18%。

链接: https://arxiv.org/abs/2512.05230
作者: Jonathan Yang,Chelsea Finn,Dorsa Sadigh
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14 pages, 10 figures

点击查看摘要

Abstract:Reasoning from diverse observations is a fundamental capability for generalist robot policies to operate in a wide range of environments. Despite recent advancements, many large-scale robotic policies still remain sensitive to key sources of observational variation such as changes in camera perspective, lighting, and the presence of distractor objects. We posit that the limited generalizability of these models arises from the substantial diversity required to robustly cover these quasistatic axes, coupled with the current scarcity of large-scale robotic datasets that exhibit rich variation across them. In this work, we propose to systematically examine what robots need to generalize across these challenging axes by introducing two key auxiliary tasks, state similarity and invariance to observational perturbations, applied to both demonstration data and static visual data. We then show that via these auxiliary tasks, leveraging both more-expensive robotic demonstration data and less-expensive, visually rich synthetic images generated from non-physics-based simulation (for example, Unreal Engine) can lead to substantial increases in generalization to unseen camera viewpoints, lighting configurations, and distractor conditions. Our results demonstrate that co-training on this diverse data improves performance by 18 percent over existing generative augmentation methods. For more information and videos, please visit this https URL
zh

[AI-67] owards A Cultural Intelligence and Values Inferences Quality Benchmark for Community Values and Common Knowledge

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在文化适配上的显著偏差问题,即现有模型多基于西方白人主导的叙事进行对齐,导致对其他文化群体(尤其是美国境内多元文化群体)的代表性不足与认知错位。为应对这一挑战,论文提出的关键解决方案是开发一个名为CIVIQ(Cultural Intelligence and Values Inference Quality)的文化智能与价值观推理质量基准,其核心在于借鉴韩国国家对齐基准KorNAT的构建流程,将焦点从“国家层面”转向“社区层面”,通过聚焦特定社区的社会价值和常识知识来实现更精细、更具包容性的文化对齐评估体系,从而为实践中的AI文化适配提供可操作的研究基础与评价工具。

链接: https://arxiv.org/abs/2512.05176
作者: Brittany Johnson,Erin Reddick,Angela D.R. Smith
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Under review

点击查看摘要

Abstract:Large language models (LLMs) have emerged as a powerful technology, and thus, we have seen widespread adoption and use on software engineering teams. Most often, LLMs are designed as “general purpose” technologies meant to represent the general population. Unfortunately, this often means alignment with predominantly Western Caucasian narratives and misalignment with other cultures and populations that engage in collaborative innovation. In response to this misalignment, there have been recent efforts centered on the development of “culturally-informed” LLMs, such as ChatBlackGPT, that are capable of better aligning with historically marginalized experiences and perspectives. Despite this progress, there has been little effort aimed at supporting our ability to develop and evaluate culturally-informed LLMs. A recent effort proposed an approach for developing a national alignment benchmark that emphasizes alignment with national social values and common knowledge. However, given the range of cultural identities present in the United States (U.S.), a national alignment benchmark is an ineffective goal for broader representation. To help fill this gap in this US context, we propose a replication study that translates the process used to develop KorNAT, a Korean National LLM alignment benchmark, to develop CIVIQ, a Cultural Intelligence and Values Inference Quality benchmark centered on alignment with community social values and common knowledge. Our work provides a critical foundation for research and development aimed at cultural alignment of AI technologies in practice.
zh

[AI-68] Advanced Unsupervised Learning: A Comprehensive Overview of Multi-View Clustering Techniques

【速读】:该论文旨在解决机器学习在实际应用中面临的诸多挑战,包括计算资源限制、单视图学习算法的局限性以及跨域、跨源或多视角大规模数据处理的复杂性问题。其核心解决方案是提出多视图聚类(Multi-view Clustering, MVC)这一无监督多视图学习方法,通过整合来自不同视角的数据信息,弥补单视图方法的不足,从而实现更丰富的数据表征和更有效的无监督学习任务解决方案。关键在于系统性地对MVC方法进行分类(如协同训练、协同正则化、子空间学习、深度学习、核方法、锚点法和图方法等),深入分析各类方法的优势、劣势及可扩展性与不完整数据等现实挑战,并结合早期融合、晚期融合与联合学习等集成策略,推动MVC在医疗健康、多媒体和社交网络分析等领域的落地应用。

链接: https://arxiv.org/abs/2512.05169
作者: Abdelmalik Moujahid,Fadi Dornaika
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning techniques face numerous challenges to achieve optimal performance. These include computational constraints, the limitations of single-view learning algorithms and the complexity of processing large datasets from different domains, sources or views. In this context, multi-view clustering (MVC), a class of unsupervised multi-view learning, emerges as a powerful approach to overcome these challenges. MVC compensates for the shortcomings of single-view methods and provides a richer data representation and effective solutions for a variety of unsupervised learning tasks. In contrast to traditional single-view approaches, the semantically rich nature of multi-view data increases its practical utility despite its inherent complexity. This survey makes a threefold contribution: (1) a systematic categorization of multi-view clustering methods into well-defined groups, including co-training, co-regularization, subspace, deep learning, kernel-based, anchor-based, and graph-based strategies; (2) an in-depth analysis of their respective strengths, weaknesses, and practical challenges, such as scalability and incomplete data; and (3) a forward-looking discussion of emerging trends, interdisciplinary applications, and future directions in MVC research. This study represents an extensive workload, encompassing the review of over 140 foundational and recent publications, the development of comparative insights on integration strategies such as early fusion, late fusion, and joint learning, and the structured investigation of practical use cases in the areas of healthcare, multimedia, and social network analysis. By integrating these efforts, this work aims to fill existing gaps in MVC research and provide actionable insights for the advancement of the field.
zh

[AI-69] Bridging Traditional Machine Learning and Large Language Models : A Two-Part Course Design for Modern AI Education CCS

【速读】:该论文试图解决的问题是如何在人工智能(Artificial Intelligence, AI)与数据科学的教学中,有效衔接传统机器学习技术与现代大语言模型(Large Language Models, LLMs)之间的知识断层,从而提升学生对AI发展脉络的理解并增强其实践能力。解决方案的关键在于设计了一门结构清晰、分阶段实施的课程:第一部分聚焦于基础机器学习概念,第二部分转向当代LLM的应用,通过两个顺序且互补的学习模块,使学生既能掌握经典方法,又能熟练运用前沿技术,最终实现理论与实践融合的教学目标。

链接: https://arxiv.org/abs/2512.05167
作者: Fang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by the 39th annual Consortium for Computing Sciences in Colleges (CCSC:SE)

点击查看摘要

Abstract:This paper presents an innovative pedagogical approach for teaching artificial intelligence and data science that systematically bridges traditional machine learning techniques with modern Large Language Models (LLMs). We describe a course structured in two sequential and complementary parts: foundational machine learning concepts and contemporary LLM applications. This design enables students to develop a comprehensive understanding of AI evolution while building practical skills with both established and cutting-edge technologies. We detail the course architecture, implementation strategies, assessment methods, and learning outcomes from our summer course delivery spanning two seven-week terms. Our findings demonstrate that this integrated approach enhances student comprehension of the AI landscape and better prepares them for industry demands in the rapidly evolving field of artificial intelligence.
zh

[AI-70] Documenting SME Processes with Conversational AI: From Tacit Knowledge to BPMN

【速读】:该论文旨在解决中小企业(SMEs)普遍存在的隐性知识难以转化为正式文档的问题,特别是车间现场的经验型流程知识缺乏结构化记录与标准化表达。解决方案的关键在于开发一个基于大语言模型(LLM)的对话式助手,利用Gemini 2.5 Pro驱动,通过轻量级Gradio前端和客户端bpmn-js可视化工具,以问答交互方式逐步捕获操作人员的知识,并实时生成符合Business Process Model and Notation (BPMN) 2.0标准的流程图。该系统不仅支持即时修正与注释标注问题,还能在12分钟内完成从“现状”(AS-IS)到“目标”(TO-BE)的改进模型构建,同时控制API成本在中小企业可接受范围内,从而显著降低过程建模的技术门槛和实施成本。

链接: https://arxiv.org/abs/2512.05122
作者: Unnikrishnan Radhakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at 2025 International Workshop on Low-Cost Digital Solutions for Industrial Automation (LODISA)

点击查看摘要

Abstract:Small and medium-sized enterprises (SMEs) still depend heavily on tacit, experience-based know-how that rarely makes its way into formal documentation. This paper introduces a large-language-model (LLM)-driven conversational assistant that captures such knowledge on the shop floor and converts it incrementally and interactively into standards-compliant Business Process Model and Notation (BPMN) 2.0 diagrams. Powered by Gemini 2.5 Pro and delivered through a lightweight Gradio front-end with client-side bpmn-js visualisation, the assistant conducts an interview-style dialogue: it elicits process details, supports clarifying dialogue and on-demand analysis, and renders live diagrams that users can refine in real time. A proof-of-concept evaluation in an equipment-maintenance scenario shows that the chatbot produced an accurate “AS-IS” model, flagged issues via on-diagram annotations, and generated an improved “TO-BE” variant, all within about 12-minutes, while keeping API costs within an SME-friendly budget. The study analyses latency sources, model-selection trade-offs, and the challenges of enforcing strict XML schemas, then outlines a roadmap toward agentic and multimodal deployments. The results demonstrate that conversational LLMs can potentially be used to lower the skill and cost barriers to rigorous process documentation, helping SMEs preserve institutional knowledge, enhance operational transparency, and accelerate continuous-improvement efforts.
zh

[AI-71] PESTalk: Speech-Driven 3D Facial Animation with Personalized Emotional Styles

【速读】:该论文旨在解决现有3D人脸动画生成方法在个性化情感表达上的不足,尤其是难以从语音中提取细粒度情绪特征并实现个体化表情风格建模的问题。其解决方案的关键在于提出PESTalk框架,核心创新包括:(1)双流情绪提取器(Dual-Stream Emotion Extractor, DSEE),能够同时捕获音频的时间域与频域特征以实现更精细的情绪分析;(2)情绪风格建模模块(Emotional Style Modeling Module, ESMM),基于声纹特征对个体表达模式进行建模,从而生成具有个性化情感风格的面部动画。此外,研究还构建了新的3D-EmoStyle数据集以缓解训练数据稀缺问题,实验表明该方法在真实性和个性化方面显著优于当前最优方法。

链接: https://arxiv.org/abs/2512.05121
作者: Tianshun Han,Benjia Zhou,Ajian Liu,Yanyan Liang,Du Zhang,Zhen Lei,Jun Wan
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PESTalk is a novel method for generating 3D facial animations with personalized emotional styles directly from speech. It overcomes key limitations of existing approaches by introducing a Dual-Stream Emotion Extractor (DSEE) that captures both time and frequency-domain audio features for fine-grained emotion analysis, and an Emotional Style Modeling Module (ESMM) that models individual expression patterns based on voiceprint characteristics. To address data scarcity, the method leverages a newly constructed 3D-EmoStyle dataset. Evaluations demonstrate that PESTalk outperforms state-of-the-art methods in producing realistic and personalized facial animations.
zh

[AI-72] When Ads Become Profiles: Uncovering the Invisible Risk of Web Advertising at Scale with LLM s

【速读】:该论文旨在解决算法广告系统中隐式用户属性(如政治倾向、就业状态和教育水平)的被动推断问题,即如何利用公开的广告曝光数据,通过生成式 AI (Generative AI) 实现对用户隐私属性的高精度逆向推理。其解决方案的关键在于构建了一种新颖的攻击流水线,利用零样本多模态大语言模型(Large Language Models, LLMs)作为对抗性推理引擎,直接从广告流中提取潜在信号并进行自然语言层面的用户画像重建。实验表明,该方法在仅需少量观察窗口的情况下即可实现优于传统人口统计学先验和人类社会感知的准确率,且效率远超人工方式,揭示了当前广告生态系统存在系统性漏洞,凸显了在生成式 AI 时代加强负责任网络人工智能治理的紧迫性。

链接: https://arxiv.org/abs/2509.18874
作者: Baiyu Chen,Benjamin Tag,Hao Xue,Daniel Angus,Flora Salim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Regulatory limits on explicit targeting have not eliminated algorithmic profiling on the Web, as optimisation systems still adapt ad delivery to users’ private attributes. The widespread availability of powerful zero-shot multimodal Large Language Models (LLMs) has dramatically lowered the barrier for exploiting these latent signals for adversarial inference. We investigate this emerging societal risk, specifically how adversaries can now exploit these signals to reverse-engineer private attributes from ad exposure alone. We introduce a novel pipeline that leverages LLMs as adversarial inference engines to perform natural language profiling. Applying this method to a longitudinal dataset comprising over 435,000 ad impressions collected from 891 users, we conducted a large-scale study to assess the feasibility and precision of inferring private attributes from passive online ad observations. Our results demonstrate that off-the-shelf LLMs can accurately reconstruct complex user private attributes, including party preference, employment status, and education level, consistently outperforming strong census-based priors and matching or exceeding human social perception, while operating at only a fraction of the cost (223 \times lower) and time (52 \times faster) required by humans. Critically, actionable profiling is feasible even within short observation windows, indicating that prolonged tracking is not a prerequisite for a successful attack. These findings provide the first empirical evidence that ad streams serve as a high-fidelity digital footprint, enabling off-platform profiling that inherently bypasses current platform safeguards, highlighting a systemic vulnerability in the ad ecosystem and the urgent need for responsible web AI governance in the generative AI era. The code is available at this https URL.
zh

[AI-73] How to Tame Your LLM : Semantic Collapse in Continuous Systems

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)中语义动态的理论建模问题,即如何从连续计算过程中解释离散符号语义的涌现。其核心挑战在于理解LLMs在高维隐空间中的演化机制,并揭示其如何生成可解释、结构化的语义表示。解决方案的关键在于将LLMs形式化为连续状态机(Continuous State Machines, CSMs),并引入转移算子(transfer operator)$ P: L^2(M,\mu) \to L^2(M,\mu) $ 来刻画语义质量的传播。在此框架下,作者证明了语义特征定理(Semantic Characterization Theorem, SCT):在适度正则性假设(紧致性、遍历性、有界雅可比矩阵)下,$ P $ 具有离散谱,其主特征函数诱导出有限个不变语义盆地(spectral basins of invariant meaning),这些盆地可在实数上的o-极小结构中定义,从而实现语义的逻辑可解释性与连续动力学之间的统一。这一结果表明,尽管LLMs的内部计算是连续的,但其语义结构本质上是离散且逻辑上“驯服”的。

链接: https://arxiv.org/abs/2512.05162
作者: C. M. Wyss
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS); Probability (math.PR)
备注: 35 pages, 1 figure. Exolytica AI Technical Report XTR-2025-01

点击查看摘要

Abstract:We develop a general theory of semantic dynamics for large language models by formalizing them as Continuous State Machines (CSMs): smooth dynamical systems whose latent manifolds evolve under probabilistic transition operators. The associated transfer operator P: L^2(M,\mu) \to L^2(M,\mu) encodes the propagation of semantic mass. Under mild regularity assumptions (compactness, ergodicity, bounded Jacobian), P is compact with discrete spectrum. Within this setting, we prove the Semantic Characterization Theorem (SCT): the leading eigenfunctions of P induce finitely many spectral basins of invariant meaning, each definable in an o-minimal structure over \mathbbR . Thus spectral lumpability and logical tameness coincide. This explains how discrete symbolic semantics can emerge from continuous computation: the continuous activation manifold collapses into a finite, logically interpretable ontology. We further extend the SCT to stochastic and adiabatic (time-inhomogeneous) settings, showing that slowly drifting kernels preserve compactness, spectral coherence, and basin structure.
zh

[AI-74] GNSS Jammer Direction Finding in Dynamic Scenarios Using an Inertial-based Multi-Antenna System

【速读】:该论文旨在解决全球导航卫星系统(GNSS)干扰源的检测与定位问题,以提升对干扰信号的态势感知能力并实施有效对抗措施。其关键解决方案在于:利用软件定义无线电设备(Ettus USRP X440)配合双十字形贴片天线阵列采集IQ样本,并融合惯性测量单元(IMU)预测动态场景下天线的相对运动;在此基础上构建合成孔径系统,通过平台运动合成更大虚拟孔径,实现无需机械旋转的相干空间成像,从而提高角分辨率;同时采用多特征融合方法,将22个角度到达(AoA)特征与FFT计算的频谱图结合,增强在多径环境中的干扰源方向估计精度,克服传统AoA方法因信号反射和散射导致的定位误差问题。

链接: https://arxiv.org/abs/2512.05128
作者: Lucas Heublein,Thorsten Nowak,Tobias Feigl,Jaspar Pahl,Felix Ott
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 9 pages, 26 figures, 2 tables

点击查看摘要

Abstract:Jamming devices disrupt signals from the global navigation satellite system (GNSS) and pose a significant threat by compromising the reliability of accurate positioning. Consequently, the detection and localization of these interference signals are essential to achieve situational awareness, mitigating their impact, and implementing effective countermeasures. In this paper, we utilize a two-times-two patch antenna system (i.e., the software defined radio device Ettus USRP X440) to predict the angle, elevation, and distance to the jamming source based on in-phase and quadrature (IQ) samples. We propose to use an inertial measurement unit (IMU) attached to the antenna system to predict the relative movement of the antenna in dynamic scenarios. We present a synthetic aperture system that enables coherent spatial imaging using platform motion to synthesize larger virtual apertures, offering superior angular resolution without mechanically rotating antennas. While classical angle-of-arrival (AoA) methods exhibit reduced accuracy in multipath environments due to signal reflections and scattering, leading to localization errors, we utilize a methodology that fuses IQ and Fast Fourier Transform (FFT)-computed spectrograms with 22 AoA features and the predicted relative movement to enhance GNSS jammer direction finding.
zh

机器学习

[LG-0] Developing synthetic microdata through machine learning for firm-level business surveys

链接: https://arxiv.org/abs/2512.05948
作者: Jorge Cisneros Paz,Timothy Wojan,Matthew Williams,Jennifer Ozawa,Robert Chew,Kimberly Janda,Timothy Navarro,Michael Floyd,Christine Task,Damon Streat
类目: Machine Learning (cs.LG); General Economics (econ.GN); Applications (stat.AP); Methodology (stat.ME)
*备注: 17 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Public-use microdata samples (PUMS) from the United States (US) Census Bureau on individuals have been available for decades. However, large increases in computing power and the greater availability of Big Data have dramatically increased the probability of re-identifying anonymized data, potentially violating the pledge of confidentiality given to survey respondents. Data science tools can be used to produce synthetic data that preserve critical moments of the empirical data but do not contain the records of any existing individual respondent or business. Developing public-use firm data from surveys presents unique challenges different from demographic data, because there is a lack of anonymity and certain industries can be easily identified in each geographic area. This paper briefly describes a machine learning model used to construct a synthetic PUMS based on the Annual Business Survey (ABS) and discusses various quality metrics. Although the ABS PUMS is currently being refined and results are confidential, we present two synthetic PUMS developed for the 2007 Survey of Business Owners, similar to the ABS business data. Econometric replication of a high impact analysis published in Small Business Economics demonstrates the verisimilitude of the synthetic data to the true data and motivates discussion of possible ABS use cases.

[LG-1] On the Bayes Inconsistency of Disagreement Discrepancy Surrogates

链接: https://arxiv.org/abs/2512.05931
作者: Neil G. Marchant,Andrew C. Cullen,Feng Liu,Sarah M. Erfani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 37 pages, 7 figures

点击查看摘要

Abstract:Deep neural networks often fail when deployed in real-world contexts due to distribution shift, a critical barrier to building safe and reliable systems. An emerging approach to address this problem relies on \emphdisagreement discrepancy – a measure of how the disagreement between two models changes under a shifting distribution. The process of maximizing this measure has seen applications in bounding error under shifts, testing for harmful shifts, and training more robust models. However, this optimization involves the non-differentiable zero-one loss, necessitating the use of practical surrogate losses. We prove that existing surrogates for disagreement discrepancy are not Bayes consistent, revealing a fundamental flaw: maximizing these surrogates can fail to maximize the true disagreement discrepancy. To address this, we introduce new theoretical results providing both upper and lower bounds on the optimality gap for such surrogates. Guided by this theory, we propose a novel disagreement loss that, when paired with cross-entropy, yields a provably consistent surrogate for disagreement discrepancy. Empirical evaluations across diverse benchmarks demonstrate that our method provides more accurate and robust estimates of disagreement discrepancy than existing approaches, particularly under challenging adversarial conditions.

[LG-2] KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

链接: https://arxiv.org/abs/2512.05916
作者: Damien Lesens,Beheshteh T. Rakhshan,Guillaume Rabusseau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.

[LG-3] LDLT mathcalL-Lipschitz Network: Generalized Deep End-To-End Lipschitz Network Construction

链接: https://arxiv.org/abs/2512.05915
作者: Marius F.R. Juston,Ramavarapu S. Sreenivas,Dustin Nottage,Ahmet Soylemezoglu
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 39 pages, 3 figures, 12 tables

点击查看摘要

Abstract:Deep residual networks (ResNets) have demonstrated outstanding success in computer vision tasks, attributed to their ability to maintain gradient flow through deep architectures. Simultaneously, controlling the Lipschitz constant in neural networks has emerged as an essential area of research to enhance adversarial robustness and network certifiability. This paper presents a rigorous approach to the general design of \mathcalL -Lipschitz deep residual networks using a Linear Matrix Inequality (LMI) framework. Initially, the ResNet architecture was reformulated as a cyclic tridiagonal LMI, and closed-form constraints on network parameters were derived to ensure \mathcalL -Lipschitz continuity; however, using a new LDL^\top decomposition approach for certifying LMI feasibility, we extend the construction of \mathcalL -Lipchitz networks to any other nonlinear architecture. Our contributions include a provable parameterization methodology for constructing Lipschitz-constrained residual networks and other hierarchical architectures. Cholesky decomposition is also used for efficient parameterization. These findings enable robust network designs applicable to adversarial robustness, certified training, and control systems. The LDL^\top formulation is shown to be a tight relaxation of the SDP-based network, maintaining full expressiveness and achieving 3%-13% accuracy gains over SLL Layers on 121 UCI data sets.

[LG-4] NeuroMemFPP: A recurrent neural approach for memory-aware parameter estimation in fractional Poisson process

链接: https://arxiv.org/abs/2512.05893
作者: Neha Gupta,Aditya Maheshwari
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 12 pages

点击查看摘要

Abstract:In this paper, we propose a recurrent neural network (RNN)-based framework for estimating the parameters of the fractional Poisson process (FPP), which models event arrivals with memory and long-range dependence. The Long Short-Term Memory (LSTM) network estimates the key parameters \mu 0 and \beta \in(0,1) from sequences of inter-arrival times, effectively modeling their temporal dependencies. Our experiments on synthetic data show that the proposed approach reduces the mean squared error (MSE) by about 55.3% compared to the traditional method of moments (MOM) and performs reliably across different training conditions. We tested the method on two real-world high-frequency datasets: emergency call records from Montgomery County, PA, and AAPL stock trading data. The results show that the LSTM can effectively track daily patterns and parameter changes, indicating its effectiveness on real-world data with complex time dependencies.

[LG-5] Bootstrapping Fuzzers for Compilers of Low-Resource Language Dialects Using Language Models

链接: https://arxiv.org/abs/2512.05887
作者: Sairam Vaidya,Marcel Böhme,Loris D’Antoni
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG); Programming Languages (cs.PL)
*备注:

点击查看摘要

Abstract:Modern extensible compiler frameworks-such as MLIR-enable rapid creation of domain-specific language dialects. This flexibility, however, makes correctness harder to ensure as the same extensibility that accelerates development also complicates maintaining the testing infrastructure. Extensible languages require automated test generation that is both dialect-agnostic (works across dialects without manual adaptation) and dialect-effective (targets dialect-specific features to find bugs). Existing approaches typically sacrifice one of these goals by either requiring manually constructed seed corpora for each dialect, or by failing to be effective. We present a dialect-agnostic and dialect-effective grammar-based and coverage-guided fuzzing approach for extensible compilers that combines two key insights from existing work: (i) the grammars of dialects, which already encode the structural and type constraints, can often be extracted automatically from the dialect specification; and (ii) these grammars can be used in combination with pre-trained large language models to automatically generate representative and diverse seed inputs from the full dialect space without requiring any manual input or training data. These seeds can then be used to bootstrap coverage-guided fuzzers. We built this approach into a tool, Germinator. When evaluated on six MLIR projects spanning 91 dialects, Germinator generated seeds improve line coverage by 10-120% over grammar-based baselines. We compare against grammar-based baselines because they are the only class of existing automatic seed generators that can be applied uniformly across MLIR’s heterogeneous dialect ecosystem. Germinator discovers 88 previously unknown bugs (40 confirmed), including 23 in dialects with no prior automated test generators, demonstrating effective and controllable testing of low-resource dialects at scale.

[LG-6] DAE-HardNet: A Physics Constrained Neural Network Enforcing Differential-Algebraic Hard Constraints

链接: https://arxiv.org/abs/2512.05881
作者: Rahul Golder,Bimol Nath Roy,M. M. Faruque Hasan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traditional physics-informed neural networks (PINNs) do not always satisfy physics based constraints, especially when the constraints include differential operators. Rather, they minimize the constraint violations in a soft way. Strict satisfaction of differential-algebraic equations (DAEs) to embed domain knowledge and first-principles in data-driven models is generally challenging. This is because data-driven models consider the original functions to be black-box whose derivatives can only be obtained after evaluating the functions. We introduce DAE-HardNet, a physics-constrained (rather than simply physics-informed) neural network that learns both the functions and their derivatives simultaneously, while enforcing algebraic as well as differential constraints. This is done by projecting model predictions onto the constraint manifold using a differentiable projection layer. We apply DAE-HardNet to several systems and test problems governed by DAEs, including the dynamic Lotka-Volterra predator-prey system and transient heat conduction. We also show the ability of DAE-HardNet to estimate unknown parameters through a parameter estimation problem. Compared to multilayer perceptrons (MLPs) and PINNs, DAE-HardNet achieves orders of magnitude reduction in the physics loss while maintaining the prediction accuracy. It has the added benefits of learning the derivatives which improves the constrained learning of the backbone neural network prior to the projection layer. For specific problems, this suggests that the projection layer can be bypassed for faster inference. The current implementation and codes are available at this https URL.

[LG-7] Computational Design of Low-Volatility Lubricants for Space Using Interpretable Machine Learning

链接: https://arxiv.org/abs/2512.05870
作者: Daniel Miliate,Ashlie Martini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The function and lifetime of moving mechanical assemblies (MMAs) in space depend on the properties of lubricants. MMAs that experience high speeds or high cycles require liquid based lubricants due to their ability to reflow to the point of contact. However, only a few liquid-based lubricants have vapor pressures low enough for the vacuum conditions of space, each of which has limitations that add constraints to MMA designs. This work introduces a data-driven machine learning (ML) approach to predicting vapor pressure, enabling virtual screening and discovery of new space-suitable liquid lubricants. The ML models are trained with data from both high-throughput molecular dynamics simulations and experimental databases. The models are designed to prioritize interpretability, enabling the relationships between chemical structure and vapor pressure to be identified. Based on these insights, several candidate molecules are proposed that may have promise for future space lubricant applications in MMAs.

[LG-8] Predicting Price Movements in High-Frequency Financial Data with Spiking Neural Networks

链接: https://arxiv.org/abs/2512.05868
作者: Brian Ezinwoke,Oliver Rhodes
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注: 9 pages, 5 figures, 8 tables

点击查看摘要

Abstract:Modern high-frequency trading (HFT) environments are characterized by sudden price spikes that present both risk and opportunity, but conventional financial models often fail to capture the required fine temporal structure. Spiking Neural Networks (SNNs) offer a biologically inspired framework well-suited to these challenges due to their natural ability to process discrete events and preserve millisecond-scale timing. This work investigates the application of SNNs to high-frequency price-spike forecasting, enhancing performance via robust hyperparameter tuning with Bayesian Optimization (BO). This work converts high-frequency stock data into spike trains and evaluates three architectures: an established unsupervised STDP-trained SNN, a novel SNN with explicit inhibitory competition, and a supervised backpropagation network. BO was driven by a novel objective, Penalized Spike Accuracy (PSA), designed to ensure a network’s predicted price spike rate aligns with the empirical rate of price events. Simulated trading demonstrated that models optimized with PSA consistently outperformed their Spike Accuracy (SA)-tuned counterparts and baselines. Specifically, the extended SNN model with PSA achieved the highest cumulative return (76.8%) in simple backtesting, significantly surpassing the supervised alternative (42.54% return). These results validate the potential of spiking networks, when robustly tuned with task-specific objectives, for effective price spike forecasting in HFT.

[LG-9] Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverag e Laws

链接: https://arxiv.org/abs/2512.05817
作者: Zhengquan Luo,Zhiqiang Xu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dataset distillation (DD) aims to construct compact synthetic datasets that allow models to achieve comparable performance to full-data training while substantially reducing storage and computation. Despite rapid empirical progress, its theoretical foundations remain limited: existing methods (gradient, distribution, trajectory matching) are built on heterogeneous surrogate objectives and optimization assumptions, which makes it difficult to analyze their common principles or provide general guarantees. Moreover, it is still unclear under what conditions distilled data can retain the effectiveness of full datasets when the training configuration, such as optimizer, architecture, or augmentation, changes. To answer these questions, we propose a unified theoretical framework, termed configuration–dynamics–error analysis, which reformulates major DD approaches under a common generalization-error perspective and provides two main results: (i) a scaling law that provides a single-configuration upper bound, characterizing how the error decreases as the distilled sample size increases and explaining the commonly observed performance saturation effect; and (ii) a coverage law showing that the required distilled sample size scales linearly with configuration diversity, with provably matching upper and lower bounds. In addition, our unified analysis reveals that various matching methods are interchangeable surrogates, reducing the same generalization error, clarifying why they can all achieve dataset distillation and providing guidance on how surrogate choices affect sample efficiency and robustness. Experiments across diverse methods and configurations empirically confirm the derived laws, advancing a theoretical foundation for DD and enabling theory-driven design of compact, configuration-robust dataset distillation.

[LG-10] Learnability Window in Gated Recurrent Neural Networks

链接: https://arxiv.org/abs/2512.05790
作者: Lorenzo Livi
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:We develop a theoretical framework that explains how gating mechanisms determine the learnability window \mathcalH_N of recurrent neural networks, defined as the largest temporal horizon over which gradient information remains statistically recoverable. While classical analyses emphasize numerical stability of Jacobian products, we show that stability alone is insufficient: learnability is governed instead by the \empheffective learning rates \mu_t,\ell , per-lag and per-neuron quantities obtained from first-order expansions of gate-induced Jacobian products in Backpropagation Through Time. These effective learning rates act as multiplicative filters that control both the magnitude and anisotropy of gradient transport. Under heavy-tailed ( \alpha -stable) gradient noise, we prove that the minimal sample size required to detect a dependency at lag~ \ell satisfies N(\ell)\propto f(\ell)^-\alpha , where f(\ell)=|\mu_t,\ell|_1 is the effective learning rate envelope. This leads to an explicit formula for \mathcalH_N and closed-form scaling laws for logarithmic, polynomial, and exponential decay of f(\ell) . The theory predicts that broader or more heterogeneous gate spectra produce slower decay of f(\ell) and hence larger learnability windows, whereas heavier-tailed noise compresses \mathcalH_N by slowing statistical concentration. By linking gate-induced time-scale structure, gradient noise, and sample complexity, the framework identifies the effective learning rates as the fundamental quantities that govern when – and for how long – gated recurrent networks can learn long-range temporal dependencies.

[LG-11] owards agent -based-model informed neural networks

链接: https://arxiv.org/abs/2512.05764
作者: Nino Antulov-Fantulin
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI); Adaptation and Self-Organizing Systems (nlin.AO); Physics and Society (physics.soc-ph)
*备注:

点击查看摘要

Abstract:In this article, we present a framework for designing neural networks that remain consistent with the underlying principles of agent-based models. We begin by highlighting the limitations of standard neural differential equations in modeling complex systems, where physical invariants (like energy) are often absent but other constraints (like mass conservation, network locality, bounded rationality) must be enforced. To address this, we introduce Agent-Based-Model informed Neural Networks(ABM-NNs), which leverage restricted graph neural networks and hierarchical decomposition to learn interpretable, structure-preserving dynamics. We validate the framework across three case studies of increasing complexity: (i) a generalized Generalized Lotka–Volterra system, where we recover ground-truth parameters from short trajectories in presence of interventions; (ii) a graph-based SIR contagion model, where our method outperforms state-of-the-art graph learning baselines (GCN, GraphSAGE, Graph Transformer) in out-of-sample forecasting and noise robustness; and (iii) a real-world macroeconomic model of the ten largest economies, where we learn coupled GDP dynamics from empirical data and demonstrate gradient-based counterfactual analysis for policy interventions.

[LG-12] aching Language Models Mechanistic Explainability Through Arrow-Pushing

链接: https://arxiv.org/abs/2512.05722
作者: Théo A. Neukomm,Zlatko Jončev,Philippe Schwaller
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: ELLIS 2025 ML4Molecules Workshop

点击查看摘要

Abstract:Chemical reaction mechanisms provide crucial insight into synthesizability, yet current Computer-Assisted Synthesis Planning (CASP) systems lack mechanistic grounding. We introduce a computational framework for teaching language models to predict chemical reaction mechanisms through arrow pushing formalism, a century-old notation that tracks electron flow while respecting conservation laws. We developed MechSMILES, a compact textual format encoding molecular structure and electron flow, and trained language models on four mechanism prediction tasks of increasing complexity using mechanistic reaction datasets, such as mech-USPTO-31k and FlowER. Our models achieve more than 95% top-3 accuracy on elementary step prediction and scores that surpass 73% on mech-USPTO-31k, and 93% on FlowER dataset for the retrieval of complete reaction mechanisms on our hardest task. This mechanistic understanding enables three key applications. First, our models serve as post-hoc validators for CASP systems, filtering chemically implausible transformations. Second, they enable holistic atom-to-atom mapping that tracks all atoms, including hydrogens. Third, they extract catalyst-aware reaction templates that distinguish recycled catalysts from spectator species. By grounding predictions in physically meaningful electron moves that ensure conservation of mass and charge, this work provides a pathway toward more explainable and chemically valid computational synthesis planning, while providing an architecture-agnostic framework for the benchmarking of mechanism prediction.

[LG-13] BERTO: an Adaptive BERT-based Network Time Series Predictor with Operator Preferences in Natural Language

链接: https://arxiv.org/abs/2512.05721
作者: Nitin Priyadarshini Shankar,Vaibhav Singh,Sheetal Kalyani,Christian Maciocco
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce BERTO, a BERT-based framework for traffic prediction and energy optimization in cellular networks. Built on transformer architectures, BERTO delivers high prediction accuracy, while its Balancing Loss Function and prompt-based customization allow operators to adjust the trade-off between power savings and performance. Natural language prompts guide the model to manage underprediction and overprediction in accordance with the operator’s intent. Experiments on real-world datasets show that BERTO improves upon existing models with a 4.13 % reduction in MSE while introducing the feature of balancing competing objectives of power saving and performance through simple natural language inputs, operating over a flexible range of 1.4 kW in power and up to 9\times variation in service quality, making it well suited for intelligent RAN deployments.

[LG-14] Meta-Learning Multi-armed Bandits for Beam Tracking in 5G and 6G Networks

链接: https://arxiv.org/abs/2512.05680
作者: Alexander Mattick,George Yammine,Georgios Kontes,Setareh Maghsudi,Christopher Mutschler
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Beamforming-capable antenna arrays with many elements enable higher data rates in next generation 5G and 6G networks. In current practice, analog beamforming uses a codebook of pre-configured beams with each of them radiating towards a specific direction, and a beam management function continuously selects \textitoptimal beams for moving user equipments (UEs). However, large codebooks and effects caused by reflections or blockages of beams make an optimal beam selection challenging. In contrast to previous work and standardization efforts that opt for supervised learning to train classifiers to predict the next best beam based on previously selected beams we formulate the problem as a partially observable Markov decision process (POMDP) and model the environment as the codebook itself. At each time step, we select a candidate beam conditioned on the belief state of the unobservable optimal beam and previously probed beams. This frames the beam selection problem as an online search procedure that locates the moving optimal beam. In contrast to previous work, our method handles new or unforeseen trajectories and changes in the physical environment, and outperforms previous work by orders of magnitude.

[LG-15] Beyond Data Filtering: Knowledge Localization for Capability Removal in LLM s

链接: https://arxiv.org/abs/2512.05648
作者: Igor Shilov,Alex Cloud,Aryo Pradipta Gema,Jacob Goldman-Wetzler,Nina Panickssery,Henry Sleight,Erik Jones,Cem Anil
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models increasingly possess capabilities that carry dual-use risks. While data filtering has emerged as a pretraining-time mitigation, it faces significant challenges: labeling whether data is harmful is expensive at scale, and given improving sample efficiency with larger models, even small amounts of mislabeled content could give rise to dangerous capabilities. To address risks associated with mislabeled harmful content, prior work proposed Gradient Routing (Cloud et al., 2024) – a technique that localizes target knowledge into a dedicated subset of model parameters so they can later be removed. We explore an improved variant of Gradient Routing, which we call Selective GradienT Masking (SGTM), with particular focus on evaluating its robustness to label noise. SGTM zero-masks selected gradients such that target domain examples only update their dedicated parameters. We test SGTM’s effectiveness in two applications: removing knowledge of one language from a model trained on a bilingual synthetic dataset, and removing biology knowledge from a model trained on English Wikipedia. In both cases SGTM provides better retain/forget trade-off in the presence of labeling errors compared to both data filtering and a previously proposed instantiation of Gradient Routing. Unlike shallow unlearning approaches that can be quickly undone through fine-tuning, SGTM exhibits strong robustness to adversarial fine-tuning, requiring seven times more fine-tuning steps to reach baseline performance on the forget set compared to a finetuning-based unlearning method (RMU). Our results suggest SGTM provides a promising pretraining-time complement to existing safety mitigations, particularly in settings where label noise is unavoidable.

[LG-16] Bounded Graph Clustering with Graph Neural Networks

链接: https://arxiv.org/abs/2512.05623
作者: Kibidi Neocosmos,Diego Baptista,Nicole Ludwig
类目: Machine Learning (cs.LG)
*备注: 17 pages, 8 figures

点击查看摘要

Abstract:In community detection, many methods require the user to specify the number of clusters in advance since an exhaustive search over all possible values is computationally infeasible. While some classical algorithms can infer this number directly from the data, this is typically not the case for graph neural networks (GNNs): even when a desired number of clusters is specified, standard GNN-based methods often fail to return the exact number due to the way they are designed. In this work, we address this limitation by introducing a flexible and principled way to control the number of communities discovered by GNNs. Rather than assuming the true number of clusters is known, we propose a framework that allows the user to specify a plausible range and enforce these bounds during training. However, if the user wants an exact number of clusters, it may also be specified and reliably returned.

[LG-17] Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales NEURIPS2025

链接: https://arxiv.org/abs/2512.05620
作者: Shikai Qiu,Zixi Chen,Hoang Phan,Qi Lei,Andrew Gordon Wilson
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025. Code available at: this https URL

点击查看摘要

Abstract:Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as \mu P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to \mu P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as 1/\mathrmwidth is nearly optimal across optimizers. Applying these scaling rules, we show Muon and Shampoo consistently achieve 1.4\times and 1.3\times speedup over AdamW for training Llama-architecture language models of sizes ranging from 190 M to 1.4 B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.

[LG-18] Wasserstein distance based semi-supervised manifold learning and application to GNSS multi-path detection

链接: https://arxiv.org/abs/2512.05567
作者: Antoine Blais,Nicolas Couëllan
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The main objective of this study is to propose an optimal transport based semi-supervised approach to learn from scarce labelled image data using deep convolutional networks. The principle lies in implicit graph-based transductive semi-supervised learning where the similarity metric between image samples is the Wasserstein distance. This metric is used in the label propagation mechanism during learning. We apply and demonstrate the effectiveness of the method on a GNSS real life application. More specifically, we address the problem of multi-path interference detection. Experiments are conducted under various signal conditions. The results show that for specific choices of hyperparameters controlling the amount of semi-supervision and the level of sensitivity to the metric, the classification accuracy can be significantly improved over the fully supervised training method.

[LG-19] SCoNE: Spherical Consistent Neighborhoods Ensemble for Effective and Efficient Multi-View Anomaly Detection

链接: https://arxiv.org/abs/2512.05540
作者: Yang Xu,Hang Zhang,Yixiao Ma,Ye Zhu,Kai Ming Ting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The core problem in multi-view anomaly detection is to represent local neighborhoods of normal instances consistently across all views. Recent approaches consider a representation of local neighborhood in each view independently, and then capture the consistent neighbors across all views via a learning process. They suffer from two key issues. First, there is no guarantee that they can capture consistent neighbors well, especially when the same neighbors are in regions of varied densities in different views, resulting in inferior detection accuracy. Second, the learning process has a high computational cost of \mathcalO(N^2) , rendering them inapplicable for large datasets. To address these issues, we propose a novel method termed \textbfSpherical \textbfConsistent \textbfNeighborhoods \textbfEnsemble (SCoNE). It has two unique features: (a) the consistent neighborhoods are represented with multi-view instances directly, requiring no intermediate representations as used in existing approaches; and (b) the neighborhoods have data-dependent properties, which lead to large neighborhoods in sparse regions and small neighborhoods in dense regions. The data-dependent properties enable local neighborhoods in different views to be represented well as consistent neighborhoods, without learning. This leads to \mathcalO(N) time complexity. Empirical evaluations show that SCoNE has superior detection accuracy and runs orders-of-magnitude faster in large datasets than existing approaches.

[LG-20] IDK-S: Incremental Distributional Kernel for Streaming Anomaly Detection

链接: https://arxiv.org/abs/2512.05531
作者: Yang Xu,Yixiao Ma,Kaifeng Zhang,Zuliang Yang,Kai Ming Ting
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection on data streams presents significant challenges, requiring methods to maintain high detection accuracy among evolving distributions while ensuring real-time efficiency. Here we introduce \mathcalIDK - \mathcalS , a novel \mathbfI ncremental \mathbfD istributional \mathbfK ernel for \mathbfS treaming anomaly detection that effectively addresses these challenges by creating a new dynamic representation in the kernel mean embedding framework. The superiority of \mathcalIDK - \mathcalS is attributed to two key innovations. First, it inherits the strengths of the Isolation Distributional Kernel, an offline detector that has demonstrated significant performance advantages over foundational methods like Isolation Forest and Local Outlier Factor due to the use of a data-dependent kernel. Second, it adopts a lightweight incremental update mechanism that significantly reduces computational overhead compared to the naive baseline strategy of performing a full model retraining. This is achieved without compromising detection accuracy, a claim supported by its statistical equivalence to the full retrained model. Our extensive experiments on thirteen benchmarks demonstrate that \mathcalIDK - \mathcalS achieves superior detection accuracy while operating substantially faster, in many cases by an order of magnitude, than existing state-of-the-art methods.

[LG-21] Credal and Interval Deep Evidential Classifications

链接: https://arxiv.org/abs/2512.05526
作者: Michele Caprio,Shireen K. Manchingal,Fabio Cuzzolin
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Uncertainty Quantification (UQ) presents a pivotal challenge in the field of Artificial Intelligence (AI), profoundly impacting decision-making, risk assessment and model reliability. In this paper, we introduce Credal and Interval Deep Evidential Classifications (CDEC and IDEC, respectively) as novel approaches to address UQ in classification tasks. CDEC and IDEC leverage a credal set (closed and convex set of probabilities) and an interval of evidential predictive distributions, respectively, allowing us to avoid overfitting to the training data and to systematically assess both epistemic (reducible) and aleatoric (irreducible) uncertainties. When those surpass acceptable thresholds, CDEC and IDEC have the capability to abstain from classification and flag an excess of epistemic or aleatoric uncertainty, as relevant. Conversely, within acceptable uncertainty bounds, CDEC and IDEC provide a collection of labels with robust probabilistic guarantees. CDEC and IDEC are trained using standard backpropagation and a loss function that draws from the theory of evidence. They overcome the shortcomings of previous efforts, and extend the current evidential deep learning literature. Through extensive experiments on MNIST, CIFAR-10 and CIFAR-100, together with their natural OoD shifts (F-MNIST/K-MNIST, SVHN/Intel, TinyImageNet), we show that CDEC and IDEC achieve competitive predictive accuracy, state-of-the-art OoD detection under epistemic and total uncertainty, and tight, well-calibrated prediction regions that expand reliably under distribution shift. An ablation over ensemble size further demonstrates that CDEC attains stable uncertainty estimates with only a small ensemble.

[LG-22] Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement

链接: https://arxiv.org/abs/2512.05525
作者: Nils Strassenburg,Boris Glavic,Tilmann Rabl
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Businesses increasingly rely on large language models (LLMs) to automate simple repetitive tasks instead of developing custom machine learning models. LLMs require few, if any, training examples and can be utilized by users without expertise in model development. However, this comes at the cost of substantially higher resource and energy consumption compared to smaller models, which often achieve similar predictive performance for simple tasks. In this paper, we present our vision for just-in-time model replacement (JITR), where, upon identifying a recurring task in calls to an LLM, the model is replaced transparently with a cheaper alternative that performs well for this specific task. JITR retains the ease of use and low development effort of LLMs, while saving significant cost and energy. We discuss the main challenges in realizing our vision regarding the identification of recurring tasks and the creation of a custom model. Specifically, we argue that model search and transfer learning will play a crucial role in JITR to efficiently identify and fine-tune models for a recurring task. Using our JITR prototype Poodle, we achieve significant savings for exemplary tasks.

[LG-23] GRASP: Graph Reasoning Agents for Systems Pharmacology with Human-in-the-Loop

链接: https://arxiv.org/abs/2512.05502
作者: Omid Bazgir,Vineeth Manthapuri,Ilia Rattsev,Mohammad Jafarnejad
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quantitative Systems Pharmacology (QSP) modeling is essential for drug development but it requires significant time investment that limits the throughput of domain experts. We present \textbfGRASP – a multi-agent, graph-reasoning framework with a human-in-the-loop conversational interface – that encodes QSP models as typed biological knowledge graphs and compiles them to executable MATLAB/SimBiology code while preserving units, mass balance, and physiological constraints. A two-phase workflow – \textscUnderstanding (graph reconstruction of legacy code) and \textscAction (constraint-checked, language-driven modification) – is orchestrated by a state machine with iterative validation. GRASP performs breadth-first parameter-alignment around new entities to surface dependent quantities and propose biologically plausible defaults, and it runs automatic execution/diagnostics until convergence. In head-to-head evaluations using LLM-as-judge, GRASP outperforms SME-guided CoT and ToT baselines across biological plausibility, mathematical correctness, structural fidelity, and code quality ((\approx)9–10/10 vs.\ 5–7/10). BFS alignment achieves F1 = 0.95 for dependency discovery, units, and range. These results demonstrate that graph-structured, agentic workflows can make QSP model development both accessible and rigorous, enabling domain experts to specify mechanisms in natural language without sacrificing biomedical fidelity.

[LG-24] urbulence Regression

链接: https://arxiv.org/abs/2512.05483
作者: Yingang Fan,Binjie Ding,Baiyi Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air turbulence refers to the disordered and irregular motion state generated by drastic changes in velocity, pressure, or direction during airflow. Various complex factors lead to intricate low-altitude turbulence outcomes. Under current observational conditions, especially when using only wind profile radar data, traditional methods struggle to accurately predict turbulence states. Therefore, this paper introduces a NeuTucker decomposition model utilizing discretized data. Designed for continuous yet sparse three-dimensional wind field data, it constructs a low-rank Tucker decomposition model based on a Tucker neural network to capture the latent interactions within the three-dimensional wind field data. Therefore, two core ideas are proposed here: 1) Discretizing continuous input data to adapt to models like NeuTucF that require discrete data inputs. 2) Constructing a four-dimensional Tucker interaction tensor to represent all possible spatio-temporal interactions among different elevations and three-dimensional wind speeds. In estimating missing observations in real datasets, this discretized NeuTucF model demonstrates superior performance compared to various common regression models.

[LG-25] Model Gateway: Model Management Platform for Model-Driven Drug Discovery

链接: https://arxiv.org/abs/2512.05462
作者: Yan-Shiun Wu,Nathan A. Morin
类目: oftware Engineering (cs.SE); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 7 pages, 7 figures

点击查看摘要

Abstract:This paper presents the Model Gateway, a management platform for managing machine learning (ML) and scientific computational models in the drug discovery pipeline. The platform supports Large Language Model (LLM) Agents and Generative AI-based tools to perform ML model management tasks in our Machine Learning operations (MLOps) pipelines, such as the dynamic consensus model, a model that aggregates several scientific computational models, registration and management, retrieving model information, asynchronous submission/execution of models, and receiving results once the model complete executions. The platform includes a Model Owner Control Panel, Platform Admin Tools, and Model Gateway API service for interacting with the platform and tracking model execution. The platform achieves a 0% failure rate when testing scaling beyond 10k simultaneous application clients consume models. The Model Gateway is a fundamental part of our model-driven drug discovery pipeline. It has the potential to significantly accelerate the development of new drugs with the maturity of our MLOps infrastructure and the integration of LLM Agents and Generative AI tools.

[LG-26] S-HINT: Enhancing Semiconductor Time Series Regression Using Attention Hints From Large Language Model Reasoning

链接: https://arxiv.org/abs/2512.05419
作者: Jonathan Adam Rico,Nagarajan Raghavan,Senthilnath Jayavelu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing data-driven methods rely on the extraction of static features from time series to approximate the material removal rate (MRR) of semiconductor manufacturing processes such as chemical mechanical polishing (CMP). However, this leads to a loss of temporal dynamics. Moreover, these methods require a large amount of data for effective training. In this paper, we propose TS-Hint, a Time Series Foundation Model (TSFM) framework, integrated with chain-of-thought reasoning which provides attention hints during training based on attention mechanism data and saliency data. Experimental results demonstrate the effectiveness of our model in limited data settings via few-shot learning and can learn directly from multivariate time series features.

[LG-27] Sepsis Prediction Using Graph Convolutional Networks over Patient-Feature-Value Triplets

链接: https://arxiv.org/abs/2512.05416
作者: Bozhi Dan,Di Wu,Ji Xu,Xiang Liu,Yiziting Zhu,Xin Shu,Yujie Li,Bin Yi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In the intensive care setting, sepsis continues to be a major contributor to patient illness and death; however, its timely detection is hindered by the complex, sparse, and heterogeneous nature of electronic health record (EHR) data. We propose Triplet-GCN, a single-branch graph convolutional model that represents each encounter as patient–feature–value triplets, constructs a bipartite EHR graph, and learns patient embeddings via a Graph Convolutional Network (GCN) followed by a lightweight multilayer perceptron (MLP). The pipeline applies type-specific preprocessing – median imputation and standardization for numeric variables, effect coding for binary features, and mode imputation with low-dimensional embeddings for rare categorical attributes – and initializes patient nodes with summary statistics, while retaining measurement values on edges to preserve “who measured what and by how much”. In a retrospective, multi-center Chinese cohort (N = 648; 70/30 train–test split) drawn from three tertiary hospitals, Triplet-GCN consistently outperforms strong tabular baselines (KNN, SVM, XGBoost, Random Forest) across discrimination and balanced error metrics, yielding a more favorable sensitivity–specificity trade-off and improved overall utility for early warning. These findings indicate that encoding EHR as triplets and propagating information over a patient–feature graph produce more informative patient representations than feature-independent models, offering a simple, end-to-end blueprint for deployable sepsis risk stratification.

[LG-28] RevoNAD: Reflective Evolutionary Exploration for Neural Architecture Design

链接: https://arxiv.org/abs/2512.05403
作者: Gyusam Chang,Jeongyoon Yoon,Shin han yi,JaeHyeok Lee,Sujin Jang,Sangpil Kim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent progress in leveraging large language models (LLMs) has enabled Neural Architecture Design (NAD) systems to generate new architecture not limited from manually predefined search space. Nevertheless, LLM-driven generation remains challenging: the token-level design loop is discrete and non-differentiable, preventing feedback from smoothly guiding architectural improvement. These methods, in turn, commonly suffer from mode collapse into redundant structures or drift toward infeasible designs when constructive reasoning is not well grounded. We introduce RevoNAD, a reflective evolutionary orchestrator that effectively bridges LLM-based reasoning with feedback-aligned architectural search. First, RevoNAD presents a Multi-round Multi-expert Consensus to transfer isolated design rules into meaningful architectural clues. Then, Adaptive Reflective Exploration adjusts the degree of exploration leveraging reward variance; it explores when feedback is uncertain and refines when stability is reached. Finally, Pareto-guided Evolutionary Selection effectively promotes architectures that jointly optimize accuracy, efficiency, latency, confidence, and structural diversity. Across CIFAR10, CIFAR100, ImageNet16-120, COCO-5K, and Cityscape, RevoNAD achieves state-of-the-art performance. Ablation and transfer studies further validate the effectiveness of RevoNAD in allowing practically reliable, and deployable neural architecture design.

[LG-29] Enhancing Dimensionality Prediction in Hybrid Metal Halides via Feature Engineering and Class-Imbalance Mitigation

链接: https://arxiv.org/abs/2512.05367
作者: Mariia Karabin,Isaac Armstrong,Leo Beck,Paulina Apanel,Markus Eisenbach,David B. Mitzi,Hanna Terletska,Hendrik Heinz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a machine learning framework for predicting the structural dimensionality of hybrid metal halides (HMHs), including organic-inorganic perovskites, using a combination of chemically-informed feature engineering and advanced class-imbalance handling techniques. The dataset, consisting of 494 HMH structures, is highly imbalanced across dimensionality classes (0D, 1D, 2D, 3D), posing significant challenges to predictive modeling. This dataset was later augmented to 1336 via the Synthetic Minority Oversampling Technique (SMOTE) to mitigate the effects of the class imbalance. We developed interaction-based descriptors and integrated them into a multi-stage workflow that combines feature selection, model stacking, and performance optimization to improve dimensionality prediction accuracy. Our approach significantly improves F1-scores for underrepresented classes, achieving robust cross-validation performance across all dimensionalities.

[LG-30] When Forgetting Builds Reliability: LLM Unlearning for Reliable Hardware Code Generation

链接: https://arxiv.org/abs/2512.05341
作者: Yiwen Liang,Qiufeng Li,Shikai Wang,Weidong Cao
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong potential in accelerating digital hardware design through automated code generation. Yet, ensuring their reliability remains a critical challenge, as existing LLMs trained on massive heterogeneous datasets often exhibit problematic memorization of proprietary intellectual property (IP), contaminated benchmarks, and unsafe coding patterns. To mitigate these risks, we propose a novel unlearning framework tailored for LLM-based hardware code generation. Our method combines (i) a syntax-preserving unlearning strategy that safeguards the structural integrity of hardware code during forgetting, and (ii) a fine-grained floor-aware selective loss that enables precise and efficient removal of problematic knowledge. This integration achieves effective unlearning without degrading LLM code generation capabilities. Extensive experiments show that our framework supports forget sets up to 3x larger, typically requiring only a single training epoch, while preserving both syntactic correctness and functional integrity of register-transfer level (RTL) codes. Our work paves an avenue towards reliable LLM-assisted hardware design.

[LG-31] axonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models AAAI-26

链接: https://arxiv.org/abs/2512.05339
作者: Mahesh Kumar Nandwana,Youngwan Lim,Joseph Liu,Alex Yang,Varun Notibala,Nishchaie Khanna
类目: Machine Learning (cs.LG)
*备注: To be presented at AAAI-26 PerFM Workshop

点击查看摘要

Abstract:Large Language Models (LLMs) are typically aligned for safety during the post-training phase; however, they may still generate inappropriate outputs that could potentially pose risks to users. This challenge underscores the need for robust safeguards that operate across both model inputs and outputs. In this work, we introduce Roblox Guard 1.0, a state-of-the-art instruction fine-tuned LLM designed to enhance the safety of LLM systems through comprehensive input-output moderation, using a pipeline of LLMs to enhance moderation capability. Built on the Llama-3.1-8B-Instruct backbone, our model is instruction fine-tuned to generalize across previously unseen safety taxonomies and demonstrates strong performance on out-of-domain safety benchmarks. The instruction fine-tuning process uses a mix of synthetic and open-source safety datasets, augmented with chain-of-thought (CoT) rationales and input inversion to enhance contextual understanding and decision making. To support systematic evaluation, we also release RobloxGuard-Eval, a new benchmark featuring an extensible safety taxonomy to assess the effectiveness of LLM guardrails and moderation frameworks.

[LG-32] PathFinder: MCTS and LLM Feedback-based Path Selection for Multi-Hop Question Answering

链接: https://arxiv.org/abs/2512.05336
作者: Durga Prasad Maram,Kalpa Gunaratna,Vijay Srinivasan,Haris Jeelani,Srinivas Chappidi
类目: Machine Learning (cs.LG)
*备注: 5 PAGES, 3 IMAGES

点击查看摘要

Abstract:Multi-hop question answering is a challenging task in which language models must reason over multiple steps to reach the correct answer. With the help of Large Language Models and their reasoning capabilities, existing systems are able to think and decompose an input question over multiple steps to analyze, retrieve, and reason. However, training-based approaches for this problem still suffer from LLM hallucinations and incorrect reasoning paths that hinder performance. Hence, we propose PATHFINDER, an approach that: (i) uses Monte Carlo Tree Search to generate training path traces, (ii) improves training data quality by filtering erroneous and lengthy traces using sub-answer recall and LLM-as-a-judge verification, and (iii) reformulates sub-queries to handle failed retrieval cases. By following these steps, we demonstrate that PATHFINDER improves the performance of multi-hop QA over public benchmark datasets.

[LG-33] Non-Convex Federated Optimization under Cost-Aware Client Selection

链接: https://arxiv.org/abs/2512.05327
作者: Xiaowen Jiang,Anton Rodomanov,Sebastian U. Stich
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Different federated optimization algorithms typically employ distinct client-selection strategies: some methods communicate only with a randomly sampled subset of clients at each round, while others need to periodically communicate with all clients or use a hybrid scheme that combines both strategies. However, existing metrics for comparing optimization methods typically do not distinguish between these strategies, which often incur different communication costs in practice. To address this disparity, we introduce a simple and natural model of federated optimization that quantifies communication and local computation complexities. This new model allows for several commonly used client-selection strategies and explicitly associates each with a distinct cost. Within this setting, we propose a new algorithm that achieves the best-known communication and local complexities among existing federated optimization methods for non-convex optimization. This algorithm is based on the inexact composite gradient method with a carefully constructed gradient estimator and a special procedure for solving the auxiliary subproblem at each iteration. The gradient estimator is based on SAGA, a popular variance-reduced gradient estimator. We first derive a new variance bound for it, showing that SAGA can exploit functional similarity. We then introduce the Recursive-Gradient technique as a general way to potentially improve the error bound of a given conditionally unbiased gradient estimator, including both SAGA and SVRG. By applying this technique to SAGA, we obtain a new estimator, RG-SAGA, which has an improved error bound compared to the original one.

[LG-34] Enhancing Deep Deterministic Policy Gradients on Continuous Control Tasks with Decoupled Prioritized Experience Replay

链接: https://arxiv.org/abs/2512.05320
作者: Mehmet Efe Lorasdagi,Dogan Can Cicek,Furkan Burak Mutlu,Suleyman Serdar Kozat
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Background: Deep Deterministic Policy Gradient-based reinforcement learning algorithms utilize Actor-Critic architectures, where both networks are typically trained using identical batches of replayed transitions. However, the learning objectives and update dynamics of the Actor and Critic differ, raising concerns about whether uniform transition usage is optimal. Objectives: We aim to improve the performance of deep deterministic policy gradient algorithms by decoupling the transition batches used to train the Actor and the Critic. Our goal is to design an experience replay mechanism that provides appropriate learning signals to each component by using separate, tailored batches. Methods: We introduce Decoupled Prioritized Experience Replay (DPER), a novel approach that allows independent sampling of transition batches for the Actor and the Critic. DPER can be integrated into any off-policy deep reinforcement learning algorithm that operates in continuous control domains. We combine DPER with the state-of-the-art Twin Delayed DDPG algorithm and evaluate its performance across standard continuous control benchmarks. Results: DPER outperforms conventional experience replay strategies such as vanilla experience replay and prioritized experience replay in multiple MuJoCo tasks from the OpenAI Gym suite. Conclusions: Our findings show that decoupling experience replay for Actor and Critic networks can enhance training dynamics and final policy quality. DPER offers a generalizable mechanism that enhances performance for a wide class of actor-critic off-policy reinforcement learning algorithms. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2512.05320 [cs.LG] (or arXiv:2512.05320v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2512.05320 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mehmet Efe Lorasdagi [view email] [v1] Thu, 4 Dec 2025 23:37:29 UTC (3,890 KB) Full-text links: Access Paper: View a PDF of the paper titled Enhancing Deep Deterministic Policy Gradients on Continuous Control Tasks with Decoupled Prioritized Experience Replay, by Mehmet Efe Lorasdagi and 3 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-12 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-35] Uncertainty Quantification for Scientific Machine Learning using Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KAN)

链接: https://arxiv.org/abs/2512.05306
作者: Y. Sungtaek Ju
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 20 pages, 3 figures

点击查看摘要

Abstract:Kolmogorov-Arnold Networks have emerged as interpretable alternatives to traditional multi-layer perceptrons. However, standard implementations lack principled uncertainty quantification capabilities essential for many scientific applications. We present a framework integrating sparse variational Gaussian process inference with the Kolmogorov-Arnold topology, enabling scalable Bayesian inference with computational complexity quasi-linear in sample size. Through analytic moment matching, we propagate uncertainty through deep additive structures while maintaining interpretability. We use three example studies to demonstrate the framework’s ability to distinguish aleatoric from epistemic uncertainty: calibration of heteroscedastic measurement noise in fluid flow reconstruction, quantification of prediction confidence degradation in multi-step forecasting of advection-diffusion dynamics, and out-of-distribution detection in convolutional autoencoders. These results suggest Sparse Variational Gaussian Process Kolmogorov-Arnold Networks (SVGP KANs) is a promising architecture for uncertainty-aware learning in scientific machine learning.

[LG-36] Bridging Interpretability and Optimization: Provably Attribution-Weighted Actor-Critic in Reproducing Kernel Hilbert Spaces

链接: https://arxiv.org/abs/2512.05291
作者: Na Li,Hangguan Shan,Wei Ni,Wenjie Zhang,Xinyu Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Actor-critic (AC) methods are a cornerstone of reinforcement learning (RL) but offer limited interpretability. Current explainable RL methods seldom use state attributions to assist training. Rather, they treat all state features equally, thereby neglecting the heterogeneous impacts of individual state dimensions on the reward. We propose RKHS–SHAP-based Advanced Actor–Critic (RSA2C), an attribution-aware, kernelized, two-timescale AC algorithm, including Actor, Value Critic, and Advantage Critic. The Actor is instantiated in a vector-valued reproducing kernel Hilbert space (RKHS) with a Mahalanobis-weighted operator-valued kernel, while the Value Critic and Advantage Critic reside in scalar RKHSs. These RKHS-enhanced components use sparsified dictionaries: the Value Critic maintains its own dictionary, while the Actor and Advantage Critic share one. State attributions, computed from the Value Critic via RKHS–SHAP (kernel mean embedding for on-manifold expectations and conditional mean embedding for off-manifold expectations), are converted into Mahalanobis-gated weights that modulate Actor gradients and Advantage Critic targets. Theoretically, we derive a global, non-asymptotic convergence bound under state perturbations, showing stability through the perturbation-error term and efficiency through the convergence-error term. Empirical results on three standard continuous-control environments show that our algorithm achieves efficiency, stability, and interpretability.

[LG-37] DMAGT: Unveiling miRNA-Drug Associations by Integrating SMILES and RNA Sequence Structures through Graph Transformer Models

链接: https://arxiv.org/abs/2512.05287
作者: Ziqi Zhang
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:MiRNAs, due to their role in gene regulation, have paved a new pathway for pharmacology, focusing on drug development that targets miRNAs. However, traditional wet lab experiments are limited by efficiency and cost constraints, making it difficult to extensively explore potential associations between developed drugs and target miRNAs. Therefore, we have designed a novel machine learning model based on a multi-layer transformer-based graph neural network, DMAGT, specifically for predicting associations between drugs and miRNAs. This model transforms drug-miRNA associations into graphs, employs Word2Vec for embedding features of drug molecular structures and miRNA base structures, and leverages a graph transformer model to learn from embedded features and relational structures, ultimately predicting associations between drugs and miRNAs. To evaluate DMAGT, we tested its performance on three datasets composed of drug-miRNA associations: ncDR, RNAInter, and SM2miR, achieving up to AUC of 95.24\pm0.05 . DMAGT demonstrated superior performance in comparative experiments tackling similar challenges. To validate its practical efficacy, we specifically focused on two drugs, namely 5-Fluorouracil and Oxaliplatin. Of the 20 potential drug-miRNA associations identified as the most likely, 14 were successfully validated. The above experiments demonstrate that DMAGT has an excellent performance and stability in predicting drug-miRNA associations, providing a new shortcut for miRNA drug development.

[LG-38] Robust forecast aggregation via additional queries

链接: https://arxiv.org/abs/2512.05271
作者: Rafael Frongillo,Mary Monroe,Eric Neyman,Bo Waggoner
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study the problem of robust forecast aggregation: combining expert forecasts with provable accuracy guarantees compared to the best possible aggregation of the underlying information. Prior work shows strong impossibility results, e.g. that even under natural assumptions, no aggregation of the experts’ individual forecasts can outperform simply following a random expert (Neyman and Roughgarden, 2022). In this paper, we introduce a more general framework that allows the principal to elicit richer information from experts through structured queries. Our framework ensures that experts will truthfully report their underlying beliefs, and also enables us to define notions of complexity over the difficulty of asking these queries. Under a general model of independent but overlapping expert signals, we show that optimal aggregation is achievable in the worst case with each complexity measure bounded above by the number of agents n . We further establish tight tradeoffs between accuracy and query complexity: aggregation error decreases linearly with the number of queries, and vanishes when the “order of reasoning” and number of agents relevant to a query is \omega(\sqrtn) . These results demonstrate that modest extensions to the space of expert queries dramatically strengthen the power of robust forecast aggregation. We therefore expect that our new query framework will open up a fruitful line of research in this area. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2512.05271 [cs.GT] (or arXiv:2512.05271v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2512.05271 Focus to learn more arXiv-issued DOI via DataCite

[LG-39] When unlearning is free: leverag ing low influence points to reduce computational costs

链接: https://arxiv.org/abs/2512.05254
作者: Anat Kleiman,Robert Fisher,Ben Deaner,Udi Wieder
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As concerns around data privacy in machine learning grow, the ability to unlearn, or remove, specific data points from trained models becomes increasingly important. While state of the art unlearning methods have emerged in response, they typically treat all points in the forget set equally. In this work, we challenge this approach by asking whether points that have a negligible impact on the model’s learning need to be removed. Through a comparative analysis of influence functions across language and vision tasks, we identify subsets of training data with negligible impact on model outputs. Leveraging this insight, we propose an efficient unlearning framework that reduces the size of datasets before unlearning leading to significant computational savings (up to approximately 50 percent) on real world empirical examples.

[LG-40] Bridging quantum and classical computing for partial differential equations through multifidelity machine learning

链接: https://arxiv.org/abs/2512.05241
作者: Bruno Jacob,Amanda A. Howard,Panos Stinis
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 19 pages, 12 figures

点击查看摘要

Abstract:Quantum algorithms for partial differential equations (PDEs) face severe practical constraints on near-term hardware: limited qubit counts restrict spatial resolution to coarse grids, while circuit depth limitations prevent accurate long-time integration. These hardware bottlenecks confine quantum PDE solvers to low-fidelity regimes despite their theoretical potential for computational speedup. We introduce a multifidelity learning framework that corrects coarse quantum solutions to high-fidelity accuracy using sparse classical training data, facilitating the path toward practical quantum utility for scientific computing. The approach trains a low-fidelity surrogate on abundant quantum solver outputs, then learns correction mappings through a multifidelity neural architecture that balances linear and nonlinear transformations. Demonstrated on benchmark nonlinear PDEs including viscous Burgers equation and incompressible Navier-Stokes flows via quantum lattice Boltzmann methods, the framework successfully corrects coarse quantum predictions and achieves temporal extrapolation well beyond the classical training window. This strategy illustrates how one can reduce expensive high-fidelity simulation requirements while producing predictions that are competitive with classical accuracy. By bridging the gap between hardware-limited quantum simulations and application requirements, this work establishes a pathway for extracting computational value from current quantum devices in real-world scientific applications, advancing both algorithm development and practical deployment of near-term quantum computing for computational physics.

[LG-41] Edged Weisfeiler-Lehman Algorithm ICANN2024

链接: https://arxiv.org/abs/2512.05238
作者: Xiao Yue,Bo Liu,Feng Zhang,Guangzhi Qu
类目: Machine Learning (cs.LG)
*备注: Author’s Accepted Manuscript (AAM) of ICANN 2024 paper published in LNCS (Springer). Final version available at: this https URL

点击查看摘要

Abstract:As a classical approach on graph learning, the propagation-aggregation methodology is widely exploited by many of Graph Neural Networks (GNNs), wherein the representation of a node is updated by aggregating representations from itself and neighbor nodes recursively. Similar to the propagation-aggregation methodology, the Weisfeiler-Lehman (1-WL) algorithm tests isomorphism through color refinement according to color representations of a node and its neighbor nodes. However, 1-WL does not leverage any edge features (labels), presenting a potential improvement on exploiting edge features in some fields. To address this limitation, we proposed a novel Edged-WL algorithm (E-WL) which extends the original 1-WL algorithm to incorporate edge features. Building upon the E-WL algorithm, we also introduce an Edged Graph Isomorphism Network (EGIN) model for further exploiting edge features, which addresses one key drawback in many GNNs that do not utilize any edge features of graph data. We evaluated the performance of proposed models using 12 edge-featured benchmark graph datasets and compared them with some state-of-the-art baseline models. Experimental results indicate that our proposed EGIN models, in general, demonstrate superior performance in graph learning on graph classification tasks.

[LG-42] Variance Matters: Improving Domain Adaptation via Stratified Sampling

链接: https://arxiv.org/abs/2512.05226
作者: Andrea Napoli,Paul White
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Domain shift remains a key challenge in deploying machine learning models to the real world. Unsupervised domain adaptation (UDA) aims to address this by minimising domain discrepancy during training, but the discrepancy estimates suffer from high variance in stochastic settings, which can stifle the theoretical benefits of the method. This paper proposes Variance-Reduced Domain Adaptation via Stratified Sampling (VaRDASS), the first specialised stochastic variance reduction technique for UDA. We consider two specific discrepancy measures – correlation alignment and the maximum mean discrepancy (MMD) – and derive ad hoc stratification objectives for these terms. We then present expected and worst-case error bounds, and prove that our proposed objective for the MMD is theoretically optimal (i.e., minimises the variance) under certain assumptions. Finally, a practical k-means style optimisation algorithm is introduced and analysed. Experiments on three domain shift datasets demonstrate improved discrepancy estimation accuracy and target domain performance.

[LG-43] Mitigating the Antigenic Data Bottleneck: Semi-supervised Learning with Protein Language Models for Influenza A Surveillance

链接: https://arxiv.org/abs/2512.05222
作者: Yanhua Xu
类目: Machine Learning (cs.LG)
*备注: V0: initial draft uploaded

点击查看摘要

Abstract:Influenza A viruses (IAVs) evolve antigenically at a pace that requires frequent vaccine updates, yet the haemagglutination inhibition (HI) assays used to quantify antigenicity are labor-intensive and unscalable. As a result, genomic data vastly outpace available phenotypic labels, limiting the effectiveness of traditional supervised models. We hypothesize that combining pre-trained Protein Language Models (PLMs) with Semi-Supervised Learning (SSL) can retain high predictive accuracy even when labeled data are scarce. We evaluated two SSL strategies, Self-training and Label Spreading, against fully supervised baselines using four PLM-derived embeddings (ESM-2, ProtVec, ProtT5, ProtBert) applied to haemagglutinin (HA) sequences. A nested cross-validation framework simulated low-label regimes (25%, 50%, 75%, and 100% label availability) across four IAV subtypes (H1N1, H3N2, H5N1, H9N2). SSL consistently improved performance under label scarcity. Self-training with ProtVec produced the largest relative gains, showing that SSL can compensate for lower-resolution representations. ESM-2 remained highly robust, achieving F1 scores above 0.82 with only 25% labeled data, indicating that its embeddings capture key antigenic determinants. While H1N1 and H9N2 were predicted with high accuracy, the hypervariable H3N2 subtype remained challenging, although SSL mitigated the performance decline. These findings demonstrate that integrating PLMs with SSL can address the antigenicity labeling bottleneck and enable more effective use of unlabeled surveillance sequences, supporting rapid variant prioritization and timely vaccine strain selection.

[LG-44] Rethinking Tokenization for Clinical Time Series: When Less is More ALT ML4H

链接: https://arxiv.org/abs/2512.05217
作者: Rafi Al Attrach,Rajna Fani,David Restrepo,Yugang Jia,Peter Schüffler
类目: Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, 4 tables. Machine Learning for Health (ML4H) 2025, Findings track

点击查看摘要

Abstract:Tokenization strategies shape how models process electronic health records, yet fair comparisons of their effectiveness remain limited. We present a systematic evaluation of tokenization approaches for clinical time series modeling using transformer-based architectures, revealing task-dependent and sometimes counterintuitive findings about temporal and value feature importance. Through controlled ablations across four clinical prediction tasks on MIMIC-IV, we demonstrate that explicit time encodings provide no consistent statistically significant benefit for the evaluated downstream tasks. Value features show task-dependent importance, affecting mortality prediction but not readmission, suggesting code sequences alone can carry sufficient predictive signal. We further show that frozen pretrained code encoders dramatically outperform their trainable counterparts while requiring dramatically fewer parameters. Larger clinical encoders provide consistent improvements across tasks, benefiting from frozen embeddings that eliminate computational overhead. Our controlled evaluation enables fairer tokenization comparisons and demonstrates that simpler, parameter-efficient approaches can, in many cases, achieve strong performance, though the optimal tokenization strategy remains task-dependent.

[LG-45] Coefficient of Variation Masking: A Volatility-Aware Strategy for EHR Foundation Models ALT ML4H

链接: https://arxiv.org/abs/2512.05216
作者: Rajna Fani,Rafi Al Attrach,David Restrepo,Yugang Jia,Leo Anthony Celi,Peter Schüffler
类目: Machine Learning (cs.LG)
*备注: 16 pages, 9 figures, 1 table, 1 algorithm. Accepted at Machine Learning for Health (ML4H) 2025, Proceedings of the Machine Learning Research (PMLR)

点击查看摘要

Abstract:Masked autoencoders (MAEs) are increasingly applied to electronic health records (EHR) for learning general-purpose representations that support diverse clinical tasks. However, existing approaches typically rely on uniform random masking, implicitly assuming all features are equally predictable. In reality, laboratory tests exhibit substantial heterogeneity in volatility: some biomarkers (e.g., sodium) remain stable, while others (e.g., lactate) fluctuate considerably and are more difficult to model. Clinically, volatile biomarkers often signal acute pathophysiology and require more sophisticated modeling to capture their complex temporal patterns. We propose a volatility-aware pretraining strategy, Coefficient of Variation Masking (CV-Masking), that adaptively adjusts masking probabilities according to the intrinsic variability of each feature. Combined with a value-only masking objective aligned with clinical workflows, CV-Masking yields systematic improvements over random and variance-based strategies. Experiments on a large panel of laboratory tests show that CV-Masking enhances reconstruction, improves downstream predictive performance, and accelerates convergence, producing more robust and clinically meaningful EHR representations.

[LG-46] Hierarchical Reinforcement Learning for the Dynamic VNE with Alternatives Problem ICML

链接: https://arxiv.org/abs/2512.05207
作者: Ali Al Housseini,Cristina Rottondi,Omran Ayoub
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Submitted to IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN) 2026

点击查看摘要

Abstract:Virtual Network Embedding (VNE) is a key enabler of network slicing, yet most formulations assume that each Virtual Network Request (VNR) has a fixed topology. Recently, VNE with Alternative topologies (VNEAP) was introduced to capture malleable VNRs, where each request can be instantiated using one of several functionally equivalent topologies that trade resources differently. While this flexibility enlarges the feasible space, it also introduces an additional decision layer, making dynamic embedding more challenging. This paper proposes HRL-VNEAP, a hierarchical reinforcement learning approach for VNEAP under dynamic arrivals. A high-level policy selects the most suitable alternative topology (or rejects the request), and a low-level policy embeds the chosen topology onto the substrate network. Experiments on realistic substrate topologies under multiple traffic loads show that naive exploitation strategies provide only modest gains, whereas HRL-VNEAP consistently achieves the best performance across all metrics. Compared to the strongest tested baselines, HRL-VNEAP improves acceptance ratio by up to \textbf20.7%, total revenue by up to \textbf36.2%, and revenue-over-cost by up to \textbf22.1%. Finally, we benchmark against an MILP formulation on tractable instances to quantify the remaining gap to optimality and motivate future work on learning- and optimization-based VNEAP solutions.

[LG-47] Consequences of Kernel Regularity for Bandit Optimization

链接: https://arxiv.org/abs/2512.05957
作者: Madison Lee,Tara Javidi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Feedback welcome!

点击查看摘要

Abstract:In this work we investigate the relationship between kernel regularity and algorithmic performance in the bandit optimization of RKHS functions. While reproducing kernel Hilbert space (RKHS) methods traditionally rely on global kernel regressors, it is also common to use a smoothness-based approach that exploits local approximations. We show that these perspectives are deeply connected through the spectral properties of isotropic kernels. In particular, we characterize the Fourier spectra of the Matérn, square-exponential, rational-quadratic, \gamma -exponential, piecewise-polynomial, and Dirichlet kernels, and show that the decay rate determines asymptotic regret from both viewpoints. For kernelized bandit algorithms, spectral decay yields upper bounds on the maximum information gain, governing worst-case regret, while for smoothness-based methods, the same decay rates establish Hölder space embeddings and Besov space norm-equivalences, enabling local continuity analysis. These connections show that kernel-based and locally adaptive algorithms can be analyzed within a unified framework. This allows us to derive explicit regret bounds for each kernel family, obtaining novel results in several cases and providing improved analysis for others. Furthermore, we analyze LP-GP-UCB, an algorithm that combines both approaches, augmenting global Gaussian process surrogates with local polynomial estimators. While the hybrid approach does not uniformly dominate specialized methods, it achieves order-optimality across multiple kernel families.

[LG-48] Designing an Optimal Sensor Network via Minimizing Information Loss

链接: https://arxiv.org/abs/2512.05940
作者: Daniel Waxman,Fernando Llorente,Katia Lamer,Petar M. Djurić
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注: 37 pages, 15 figures. Accepted to Bayesian Analysis

点击查看摘要

Abstract:Optimal experimental design is a classic topic in statistics, with many well-studied problems, applications, and solutions. The design problem we study is the placement of sensors to monitor spatiotemporal processes, explicitly accounting for the temporal dimension in our modeling and optimization. We observe that recent advancements in computational sciences often yield large datasets based on physics-based simulations, which are rarely leveraged in experimental design. We introduce a novel model-based sensor placement criterion, along with a highly-efficient optimization algorithm, which integrates physics-based simulations and Bayesian experimental design principles to identify sensor networks that “minimize information loss” from simulated data. Our technique relies on sparse variational inference and (separable) Gauss-Markov priors, and thus may adapt many techniques from Bayesian experimental design. We validate our method through a case study monitoring air temperature in Phoenix, Arizona, using state-of-the-art physics-based simulations. Our results show our framework to be superior to random or quasi-random sampling, particularly with a limited number of sensors. We conclude by discussing practical considerations and implications of our framework, including more complex modeling tools and real-world deployments.

[LG-49] BalLOT: Balanced k-means clustering with optimal transport

链接: https://arxiv.org/abs/2512.05926
作者: Wenyan Luo,Dustin G. Mixon
类目: Machine Learning (stat.ML); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 20 pages, 9 figures

点击查看摘要

Abstract:We consider the fundamental problem of balanced k -means clustering. In particular, we introduce an optimal transport approach to alternating minimization called BalLOT, and we show that it delivers a fast and effective solution to this problem. We establish this with a variety of numerical experiments before proving several theoretical guarantees. First, we prove that for generic data, BalLOT produces integral couplings at each step. Next, we perform a landscape analysis to provide theoretical guarantees for both exact and partial recoveries of planted clusters under the stochastic ball model. Finally, we propose initialization schemes that achieve one-step recovery of planted clusters.

[LG-50] Machine-learning-enabled interpretation of tribological deformation patterns in large-scale MD data

链接: https://arxiv.org/abs/2512.05818
作者: Hendrik J. Ehrich,Marvin C. May,Stefan J. Eder
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注: 19 pages, 11 figures

点击查看摘要

Abstract:Molecular dynamics (MD) simulations have become indispensable for exploring tribological deformation patterns at the atomic scale. However, transforming the resulting high-dimensional data into interpretable deformation pattern maps remains a resource-intensive and largely manual process. In this work, we introduce a data-driven workflow that automates this interpretation step using unsupervised and supervised learning. Grain-orientation-colored computational tomograph pictures obtained from CuNi alloy simulations were first compressed through an autoencoder to a 32-dimensional global feature vector. Despite this strong compression, the reconstructed images retained the essential microstructural motifs: grain boundaries, stacking faults, twins, and partial lattice rotations, while omitting only the finest defects. The learned representations were then combined with simulation metadata (composition, load, time, temperature, and spatial position) to train a CNN-MLP model to predict the dominant deformation pattern. The resulting model achieves a prediction accuracy of approximately 96% on validation data. A refined evaluation strategy, in which an entire spatial region containing distinct grains was excluded from training, provides a more robust measure of generalization. The approach demonstrates that essential tribological deformation signatures can be automatically identified and classified from structural images using Machine Learning. This proof of concept constitutes a first step towards fully automated, data-driven construction of tribological mechanism maps and, ultimately, toward predictive modeling frameworks that may reduce the need for large-scale MD simulation campaigns.

[LG-51] Comparing the latent features of universal machine-learning interatomic potentials

链接: https://arxiv.org/abs/2512.05717
作者: Sofiia Chorna,Davide Tisi,Cesare Malosso,Wei Bin How,Michele Ceriotti,Sanggyu Chong
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The past few years have seen the development of ``universal’’ machine-learning interatomic potentials (uMLIPs) capable of approximating the ground-state potential energy surface across a wide range of chemical structures and compositions with reasonable accuracy. While these models differ in the architecture and the dataset used, they share the ability to compress a staggering amount of chemical information into descriptive latent features. Herein, we systematically analyze what the different uMLIPs have learned by quantitatively assessing the relative information content of their latent features with feature reconstruction errors as metrics, and observing how the trends are affected by the choice of training set and training protocol. We find that the uMLIPs encode chemical space in significantly distinct ways, with substantial cross-model feature reconstruction errors. When variants of the same model architecture are considered, trends become dependent on the dataset, target, and training protocol of choice. We also observe that fine-tuning of a uMLIP retains a strong pre-training bias in the latent features. Finally, we discuss how atom-level features, which are directly output by MLIPs, can be compressed into global structure-level features via concatenation of progressive cumulants, each adding significantly new information about the variability across the atomic environments within a given system.

[LG-52] Over-the-Air Semantic Alignment with Stacked Intelligent Metasurfaces

链接: https://arxiv.org/abs/2512.05657
作者: Mario Edoardo Pandolfo,Kyriakos Stylianopoulos,George C. Alexandropoulos,Paolo Di Lorenzo
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Semantic communication systems aim to transmit task-relevant information between devices capable of artificial intelligence, but their performance can degrade when heterogeneous transmitter-receiver models produce misaligned latent representations. Existing semantic alignment methods typically rely on additional digital processing at the transmitter or receiver, increasing overall device complexity. In this work, we introduce the first over-the-air semantic alignment framework based on stacked intelligent metasurfaces (SIM), which enables latent-space alignment directly in the wave domain, reducing substantially the computational burden at the device level. We model SIMs as trainable linear operators capable of emulating both supervised linear aligners and zero-shot Parseval-frame-based equalizers. To realize these operators physically, we develop a gradient-based optimization procedure that tailors the metasurface transfer function to a desired semantic mapping. Experiments with heterogeneous vision transformer (ViT) encoders show that SIMs can accurately reproduce both supervised and zero-shot semantic equalizers, achieving up to 90% task accuracy in regimes with high signal-to-noise ratio (SNR), while maintaining strong robustness even at low SNR values.

[LG-53] Design-marginal calibration of Gaussian process predictive distributions: Bayesian and conformal approaches

链接: https://arxiv.org/abs/2512.05611
作者: Aurélien Pion,Emmanuel Vazquez
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We study the calibration of Gaussian process (GP) predictive distributions in the interpolation setting from a design-marginal perspective. Conditioning on the data and averaging over a design measure \mu, we formalize \mu-coverage for central intervals and \mu-probabilistic calibration through randomized probability integral transforms. We introduce two methods. cps-gp adapts conformal predictive systems to GP interpolation using standardized leave-one-out residuals, yielding stepwise predictive distributions with finite-sample marginal calibration. bcr-gp retains the GP posterior mean and replaces the Gaussian residual by a generalized normal model fitted to cross-validated standardized residuals. A Bayesian selection rule-based either on a posterior upper quantile of the variance for conservative prediction or on a cross-posterior Kolmogorov-Smirnov criterion for probabilistic calibration-controls dispersion and tail behavior while producing smooth predictive distributions suitable for sequential design. Numerical experiments on benchmark functions compare cps-gp, bcr-gp, Jackknife+ for GPs, and the full conformal Gaussian process, using calibration metrics (coverage, Kolmogorov-Smirnov, integral absolute error) and accuracy or sharpness through the scaled continuous ranked probability score.

[LG-54] Decoding Selective Auditory Attention to Musical Elements in Ecologically Valid Music Listening

链接: https://arxiv.org/abs/2512.05528
作者: Taketo Akama,Zhuohao Zhang,Tsukasa Nagashima,Takagi Yutaka,Shun Minamikawa,Natalia Polouliakh
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Art has long played a profound role in shaping human emotion, cognition, and behavior. While visual arts such as painting and architecture have been studied through eye tracking, revealing distinct gaze patterns between experts and novices, analogous methods for auditory art forms remain underdeveloped. Music, despite being a pervasive component of modern life and culture, still lacks objective tools to quantify listeners’ attention and perceptual focus during natural listening experiences. To our knowledge, this is the first attempt to decode selective attention to musical elements using naturalistic, studio-produced songs and a lightweight consumer-grade EEG device with only four electrodes. By analyzing neural responses during real world like music listening, we test whether decoding is feasible under conditions that minimize participant burden and preserve the authenticity of the musical experience. Our contributions are fourfold: (i) decoding music attention in real studio-produced songs, (ii) demonstrating feasibility with a four-channel consumer EEG, (iii) providing insights for music attention decoding, and (iv) demonstrating improved model ability over prior work. Our findings suggest that musical attention can be decoded not only for novel songs but also across new subjects, showing performance improvements compared to existing approaches under our tested conditions. These findings show that consumer-grade devices can reliably capture signals, and that neural decoding in music could be feasible in real-world settings. This paves the way for applications in education, personalized music technologies, and therapeutic interventions.

[LG-55] SSDLabeler: Realistic semi-synthetic data generation for multi-label artifact classification in EEG

链接: https://arxiv.org/abs/2512.05500
作者: Taketo Akama,Akima Connelly,Shun Minamikawa,Natalia Polouliakh
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:EEG recordings are inherently contaminated by artifacts such as ocular, muscular, and environmental noise, which obscure neural activity and complicate preprocessing. Artifact classification offers advantages in stability and transparency, providing a viable alternative to ICA-based methods that enable flexible use alongside human inspections and across various applications. However, artifact classification is limited by its training data as it requires extensive manual labeling, which cannot fully cover the diversity of real-world EEG. Semi-synthetic data (SSD) methods have been proposed to address this limitation, but prior approaches typically injected single artifact types using ICA components or required separately recorded artifact signals, reducing both the realism of the generated data and the applicability of the method. To overcome these issues, we introduce SSDLabeler, a framework that generates realistic, annotated SSDs by decomposing real EEG with ICA, epoch-level artifact verification using RMS and PSD criteria, and reinjecting multiple artifact types into clean data. When applied to train a multi-label artifact classifier, it improved accuracy on raw EEG across diverse conditions compared to prior SSD and raw EEG training, establishing a scalable foundation for artifact handling that captures the co-occurrence and complexity of real EEG.

[LG-56] Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

链接: https://arxiv.org/abs/2512.05456
作者: Stephen Salerno,Kentaro Hoffman,Awan Afiaz,Anna Neufeld,Tyler H. McCormick,Jeffrey T. Leek
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 32 pages, 9 figures, 3 tables

点击查看摘要

Abstract:As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as substitutes for missing or unobserved data. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to drawing inference with predicted data (IPD) and show that high predictive accuracy does not guarantee valid downstream inference. We show that all such failures reduce to statistical notions of (i) bias, when predictions systematically shift the estimand or distort relationships among variables, and (ii) variance, when uncertainty from the prediction model and the intrinsic variability of the true data are ignored. We then review recent methods for conducting IPD and discuss how this framework is deeply rooted in classical statistical theory. We then comment on some open questions and interesting avenues for future work in this area, and end with some comments on how to use predicted data in scientific studies that is both transparent and statistically principled.

[LG-57] FieldSeer I: Physics-Guided World Models for Long-Horizon Electromagnetic Dynamics under Partial Observability

链接: https://arxiv.org/abs/2512.05361
作者: Ziheng Guo,Fang Wu,Maoxiong Zhao,Chaoqun Fang,Yang Bu
类目: Optics (physics.optics); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:We introduce FieldSeer I, a geometry-aware world model that forecasts electromagnetic field dynamics from partial observations in 2-D TE waveguides. The model assimilates a short prefix of observed fields, conditions on a scalar source action and structure/material map, and generates closed-loop rollouts in the physical domain. Training in a symmetric-log domain ensures numerical stability. Evaluated on a reproducible FDTD benchmark (200 unique simulations, structure-wise split), FieldSeer I achieves higher suffix fidelity than GRU and deterministic baselines across three practical settings: (i) software-in-the-loop filtering (64x64, P=80-Q=80), (ii) offline single-file rollouts (80x140, P=240-Q=40), and (iii) offline multi-structure rollouts (80x140, P=180-Q=100). Crucially, it enables edit-after-prefix geometry modifications without re-assimilation. Results demonstrate that geometry-conditioned world models provide a practical path toward interactive digital twins for photonic design.

[LG-58] Symmetric Linear Dynamical Systems are Learnable from Few Observations

链接: https://arxiv.org/abs/2512.05337
作者: Minh Vu,Andrey Y. Lokhov,Marc Vuffray
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We consider the problem of learning the parameters of a N -dimensional stochastic linear dynamics under both full and partial observations from a single trajectory of time T . We introduce and analyze a new estimator that achieves a small maximum element-wise error on the recovery of symmetric dynamic matrices using only T=\mathcalO(\log N) observations, irrespective of whether the matrix is sparse or dense. This estimator is based on the method of moments and does not rely on problem-specific regularization. This is especially important for applications such as structure discovery.

[LG-59] One-Step Diffusion Samplers via Self-Distillation and Deterministic Flow

链接: https://arxiv.org/abs/2512.05251
作者: Pascal Jutras-Dube,Jiaru Zhang,Ziran Wang,Ruqi Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sampling from unnormalized target distributions is a fundamental yet challenging task in machine learning and statistics. Existing sampling algorithms typically require many iterative steps to produce high-quality samples, leading to high computational costs. We introduce one-step diffusion samplers which learn a step-conditioned ODE so that one large step reproduces the trajectory of many small ones via a state-space consistency loss. We further show that standard ELBO estimates in diffusion samplers degrade in the few-step regime because common discrete integrators yield mismatched forward/backward transition kernels. Motivated by this analysis, we derive a deterministic-flow (DF) importance weight for ELBO estimation without a backward kernel. To calibrate DF, we introduce a volume-consistency regularization that aligns the accumulated volume change along the flow across step resolutions. Our proposed sampler therefore achieves both sampling and stable evidence estimate in only one or few steps. Across challenging synthetic and Bayesian benchmarks, it achieves competitive sample quality with orders-of-magnitude fewer network evaluations while maintaining robust ELBO estimates.

[LG-60] STAR-GO: Improving Protein Function Prediction by Learning to Hierarchically Integrate Ontology-Informed Semantic Embeddings

链接: https://arxiv.org/abs/2512.05245
作者: Mehmet Efe Akça,Gökçe Uludoğan,Arzucan Özgür,İnci M. Baytaş
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: 14 pages, 2 figures, 6 tables

点击查看摘要

Abstract:Accurate prediction of protein function is essential for elucidating molecular mechanisms and advancing biological and therapeutic discovery. Yet experimental annotation lags far behind the rapid growth of protein sequence data. Computational approaches address this gap by associating proteins with Gene Ontology (GO) terms, which encode functional knowledge through hierarchical relations and textual definitions. However, existing models often emphasize one modality over the other, limiting their ability to generalize, particularly to unseen or newly introduced GO terms that frequently arise as the ontology evolves, and making the previously trained models outdated. We present STAR-GO, a Transformer-based framework that jointly models the semantic and structural characteristics of GO terms to enhance zero-shot protein function prediction. STAR-GO integrates textual definitions with ontology graph structure to learn unified GO representations, which are processed in hierarchical order to propagate information from general to specific terms. These representations are then aligned with protein sequence embeddings to capture sequence-function relationships. STAR-GO achieves state-of-the-art performance and superior zero-shot generalization, demonstrating the utility of integrating semantics and structure for robust and adaptable protein function prediction. Code is available at this https URL.

[LG-61] Continuous-Time Homeostatic Dynamics for Reentrant Inference Models

链接: https://arxiv.org/abs/2512.05158
作者: Byung Gyu Chae
类目: Dynamical Systems (math.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 13 pages, 4 figures

点击查看摘要

Abstract:We formulate the Fast-Weights Homeostatic Reentry Network (FHRN) as a continuous-time neural-ODE system, revealing its role as a norm-regulated reentrant dynamical process. Starting from the discrete reentry rule x_t = x_t^(\mathrmex) + \gamma, W_r, g(|y_t-1|), y_t-1 , we derive the coupled system \doty=-y+f(W_ry;,x,,A)+g_\mathrmh(y) showing that the network couples fast associative memory with global radial homeostasis. The dynamics admit bounded attractors governed by an energy functional, yielding a ring-like manifold. A Jacobian spectral analysis identifies a \emphreflective regime in which reentry induces stable oscillatory trajectories rather than divergence or collapse. Unlike continuous-time recurrent neural networks or liquid neural networks, FHRN achieves stability through population-level gain modulation rather than fixed recurrence or neuron-local time adaptation. These results establish the reentry network as a distinct class of self-referential neural dynamics supporting recursive yet bounded computation.

[LG-62] Bayesian Optimization and Convolutional Neural Networks for Zernike-Based Wavefront Correction in High Harmonic Generation

链接: https://arxiv.org/abs/2512.05127
作者: Guilherme Grancho D. Fernandes,Duarte Alexandrino,Eduardo Silva,João Matias,Joaquim Pereira
类目: Optics (physics.optics); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:High harmonic generation (HHG) is a nonlinear process that enables table-top generation of tunable, high-energy, coherent, ultrashort radiation pulses in the extreme ultraviolet (EUV) to soft X-ray range. These pulses find applications in photoemission spectroscopy in condensed matter physics, pump-probe spectroscopy for high-energy-density plasmas, and attosecond science. However, optical aberrations in the high-power laser systems required for HHG degrade beam quality and reduce efficiency. We present a machine learning approach to optimize aberration correction using a spatial light modulator. We implemented and compared Bayesian optimization and convolutional neural network (CNN) methods to predict optimal Zernike polynomial coefficients for wavefront correction. Our CNN achieved promising results with 80.39% accuracy on test data, demonstrating the potential for automated aberration correction in HHG systems.

信息检索

附件下载

点击下载今日全部论文列表