本篇博文主要内容为 2025-10-20 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-20)
今日共更新501篇论文,其中:
- 自然语言处理共76篇(Computation and Language (cs.CL))
- 人工智能共148篇(Artificial Intelligence (cs.AI))
- 计算机视觉共104篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共161篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
【速读】: 该论文旨在解决多模态机器智能中跨模态感知与推理能力不足的问题,即如何构建一个能够像人类一样在视觉、音频等多模态信息间进行有效对齐和协同理解的大型语言模型(LLM)。其解决方案的关键在于:首先提出OmniAlignNet以强化视觉与音频嵌入在共享的多模态潜在空间中的对齐;其次引入Temporal Embedding Grouping捕捉视觉与音频信号间的相对时间对齐关系;第三采用Constrained Rotary Time Embedding编码绝对时间信息于多模态嵌入中。此外,通过设计数据采集与合成管道生成2400万条单模态及多模态对话样本,显著提升了模型在跨模态理解(DailyOmni +19.05)、音频识别(MMAR +1.7)和视觉任务(Video-MME +3.9)上的性能,同时仅需0.2T训练token,相较Qwen2.5-Omni减少6倍,验证了多模态协同增强的有效性及其在机器人、医疗AI和智能制造等下游应用中的优势。
链接: https://arxiv.org/abs/2510.15870
作者: Hanrong Ye,Chao-Han Huck Yang,Arushi Goel,Wei Huang,Ligeng Zhu,Yuanhang Su,Sean Lin,An-Chieh Cheng,Zhen Wan,Jinchuan Tian,Yuming Lou,Dong Yang,Zhijian Liu,Yukang Chen,Ambrish Dantrey,Ehsan Jahangiri,Sreyan Ghosh,Daguang Xu,Ehsan Hosseini-Asl,Danial Mohseni Taheri,Vidya Murali,Sifei Liu,Jason Lu,Oluwatobi Olabiyi,Frank Wang,Rafael Valle,Bryan Catanzaro,Andrew Tao,Song Han,Jan Kautz,Hongxu Yin,Pavlo Molchanov
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Technical Report. Code: this https URL
Abstract:Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
zh
[NLP-1] PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在持续学习过程中所面临的技能过度专业化问题,即现有方法训练出的技能往往仅适用于特定网站或任务,难以泛化到新环境。解决方案的关键在于提出PolySkill框架,其核心思想受软件工程中多态性(polymorphism)启发,通过将技能的抽象目标(goal)与具体实现(implementation)解耦,使智能体能够学习具有通用性和可组合性的技能。实验表明,该方法显著提升了技能复用率和任务成功率,同时减少了执行步骤,并在无任务指定的自探索场景中增强了智能体自主识别和优化目标的能力,从而构建更高效的持续学习课程,推动智能体在开放网络环境中实现长期、泛化的自主学习。
链接: https://arxiv.org/abs/2510.15863
作者: Simon Yu,Gang Li,Weiyan Shi,Peng Qi
机构: Northeastern University (东北大学); Uniphore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figures, 8 tables
Abstract:Large language models (LLMs) are moving beyond static uses and are now powering agents that learn continually during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill’s abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4% on Mind2Web and 13.9% on unseen websites, while reducing steps by over 20%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent’s ability to learn a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. This work provides a practical path toward building agents capable of continual learning in adaptive environments. Our findings show that separating a skill’s goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously.
zh
[NLP-2] InfiMed-ORBIT: Aligning LLM s on Open-Ended Complex Tasks via Rubric-Based Incremental Training
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在开放域任务中因缺乏稳健奖励函数而导致的强化学习(Reinforcement Learning, RL)效果受限的问题,尤其是在高风险医疗咨询等场景下,传统基于规则或可编程验证奖励的方法难以适用。解决方案的关键在于提出ORBIT框架——一种基于评分量表(rubric-based)的增量式训练方法,通过合成对话生成与动态创建评分量表相结合,利用量表引导的反馈机制驱动RL过程,无需依赖外部医学知识或人工制定规则,从而实现对模型性能的显著提升和跨场景的一致性改进。
链接: https://arxiv.org/abs/2510.15859
作者: Pengkai Wang,Qi Zuo,Pengwei Liu,Zhijie Sang,Congkai Xie,Hongxia Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures
Abstract:Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.
zh
[NLP-3] SpeechLLM s for Large-scale Contextualized Zero-shot Slot Filling EMNLP2025
【速读】: 该论文旨在解决传统语音理解(Speech Understanding, SU)中槽位填充(Slot Filling)任务的局限性,即依赖于串行的语音识别与自然语言理解模块所导致的性能瓶颈、鲁棒性不足及泛化能力弱的问题。其关键解决方案在于利用新兴的语音大语言模型(Speech-based Large Language Models, speechLLMs),通过构建任务的理论上限(empirical upper bound)、系统性识别性能、鲁棒性和泛化差距,并针对性改进训练数据、模型架构和训练策略,从而显著提升槽位填充效果,同时验证了这些方法在零样本迁移和未见槽位标签场景下的有效性,为高效、统一且具备指令遵循能力的语音理解提供了实证指导。
链接: https://arxiv.org/abs/2510.15851
作者: Kadri Hacioglu,Manjunath K E,Andreas Stolcke
机构: Uniphore
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, EMNLP 2025
Abstract:Slot filling is a crucial subtask in spoken language understanding (SLU), traditionally implemented as a cascade of speech recognition followed by one or more natural language understanding (NLU) components. The recent advent of speech-based large language models (speechLLMs), which integrate speech and textual foundation models, has opened new avenues for achieving speech understanding tasks in a more unified, generative, and instruction-following manner while promising data and compute efficiency with zero-shot abilities, generalizing to unseen slot labels. We address the slot-filling task by creating an empirical upper bound for the task, identifying performance, robustness, and generalization gaps, and proposing improvements to the training data, architecture, and training strategies to narrow the gap with the upper bound result. We show that each of these measures improve performance substantially, while highlighting practical challenges and providing empirical guidance and insights for harnessing these emerging models.
zh
[NLP-4] Enhanced Sentiment Interpretation via a Lexicon-Fuzzy-Transformer Framework
【速读】: 该论文旨在解决在产品评论和社交媒体文本中准确检测情感极性(sentiment polarity)与强度(sentiment intensity)的难题,尤其针对非正式语言和领域特异性表达带来的挑战。其解决方案的关键在于提出一种混合词典-模糊Transformer框架(hybrid lexicon-fuzzy-transformer framework),通过融合基于规则的启发式方法、上下文深度学习(如DistilBERT)以及模糊逻辑(fuzzy logic),实现连续的情感评分输出。该框架首先利用VADER进行初始情感估计,再通过两阶段调整过程——结合DistilBERT的置信度分数与自定义模糊推理系统——有效缓解过度中性化偏差并提升细粒度表达能力,最终映射至0到1的连续区间以生成类专家级判断,从而在多个领域数据集上显著改善与用户评分的一致性及极端情感识别准确性。
链接: https://arxiv.org/abs/2510.15843
作者: Shayan Rokhva,Mousa Alizadeh,Maryam Abdollahi Shamami
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately detecting sentiment polarity and intensity in product reviews and social media posts remains challenging due to informal and domain-specific language. To address this, we propose a novel hybrid lexicon-fuzzy-transformer framework that combines rule-based heuristics, contextual deep learning, and fuzzy logic to generate continuous sentiment scores reflecting both polarity and strength. The pipeline begins with VADER-based initial sentiment estimations, which are refined through a two-stage adjustment process. This involves leveraging confidence scores from DistilBERT, a lightweight transformer and applying fuzzy logic principles to mitigate excessive neutrality bias and enhance granularity. A custom fuzzy inference system then maps the refined scores onto a 0 to 1 continuum, producing expert)like judgments. The framework is rigorously evaluated on four domain-specific datasets. food delivery, e-commerce, tourism, and fashion. Results show improved alignment with user ratings, better identification of sentiment extremes, and reduced misclassifications. Both quantitative metrics (distributional alignment, confusion matrices) and qualitative insights (case studies, runtime analysis) affirm the models robustness and efficiency. This work demonstrates the value of integrating symbolic reasoning with neural models for interpretable, finegrained sentiment analysis in linguistically dynamic domains.
zh
[NLP-5] Paper2Web: Lets Make Your Paper Alive!
【速读】: 该论文旨在解决当前学术项目网站生成方法(如直接使用大语言模型(Large Language Model, LLM)生成、模板化或HTML直接转换)在布局感知性与交互能力方面的不足,且缺乏系统性评估框架的问题。其解决方案的关键在于提出Paper2Web基准数据集与多维评估体系,涵盖连接性(Connectivity)、完整性(Completeness)、人工验证的LLM作为裁判(LLM-as-a-Judge,评估交互性、美观性和信息丰富度)以及PaperQuiz(衡量论文级知识保留度),并设计了PWAgent自主流程:通过MCP工具迭代优化内容与布局,提升强调效果、平衡性和展示质量,从而实现高质量、交互性强且多媒体丰富的学术主页生成,在保持低成本的同时显著优于端到端基线方法,达到学术网页生成任务中的帕累托前沿(Pareto-front)。
链接: https://arxiv.org/abs/2510.15842
作者: Yuhang Chen,Tianpeng Lv,Siyi Zhang,Yixiang Yin,Yao Wan,Philip S. Yu,Dongping Chen
机构: ONE Lab, Huazhong University of Science and Technology (华中科技大学); University of Illinois Chicago (芝加哥大学伊利诺伊分校); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review. Check this https URL for the unified platform to streamline all academic presentation
Abstract:Academic project websites can more effectively disseminate research when they clearly present core content and enable intuitive navigation and interaction. However, current approaches such as direct Large Language Model (LLM) generation, templates, or direct HTML conversion struggle to produce layout-aware, interactive sites, and a comprehensive evaluation suite for this task has been lacking. In this paper, we introduce Paper2Web, a benchmark dataset and multi-dimensional evaluation framework for assessing academic webpage generation. It incorporates rule-based metrics like Connectivity, Completeness and human-verified LLM-as-a-Judge (covering interactivity, aesthetics, and informativeness), and PaperQuiz, which measures paper-level knowledge retention. We further present PWAgent, an autonomous pipeline that converts scientific papers into interactive and multimedia-rich academic homepages. The agent iteratively refines both content and layout through MCP tools that enhance emphasis, balance, and presentation quality. Our experiments show that PWAgent consistently outperforms end-to-end baselines like template-based webpages and arXiv/alphaXiv versions by a large margin while maintaining low cost, achieving the Pareto-front in academic webpage generation.
zh
[NLP-6] Emergence of Linear Truth Encodings in Language Models NEURIPS2025
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)中线性真值子空间(linear truth subspaces)的形成机制尚不明确,即为何模型能够通过线性可分的方式区分真实与虚假陈述。解决方案的关键在于提出一个透明的一层Transformer玩具模型(one-layer transformer toy model),该模型能够端到端地再现此类真值子空间,并揭示其产生的一种具体路径:当训练数据分布中真实陈述与其他真实陈述共现(虚假陈述同理),模型为降低语言建模损失(language-modeling loss),会学习区分真假语句;实验进一步验证了这一机制在预训练语言模型中的存在。此外,研究还发现该过程呈现两阶段学习动态:初期快速记忆个体事实关联,随后在较长训练周期内逐步学习线性分离策略以优化整体损失。
链接: https://arxiv.org/abs/2510.15804
作者: Shauli Ravfogel,Gilad Yehudai,Tal Linzen,Joan Bruna,Alberto Bietti
机构: New York University (纽约大学); Flatiron Institute (扁平化研究所)
类目: Computation and Language (cs.CL)
备注: Accepted in Neurips 2025
Abstract:Recent probing studies reveal that large language models exhibit linear subspaces that separate true from false statements, yet the mechanism behind their emergence is unclear. We introduce a transparent, one-layer transformer toy model that reproduces such truth subspaces end-to-end and exposes one concrete route by which they can arise. We study one simple setting in which truth encoding can emerge: a data distribution where factual statements co-occur with other factual statements (and vice-versa), encouraging the model to learn this distinction in order to lower the LM loss on future tokens. We corroborate this pattern with experiments in pretrained language models. Finally, in the toy setting we observe a two-phase learning dynamic: networks first memorize individual factual associations in a few steps, then – over a longer horizon – learn to linearly separate true from false, which in turn lowers language-modeling loss. Together, these results provide both a mechanistic demonstration and an empirical motivation for how and why linear truth representations can emerge in language models.
zh
[NLP-7] On Non-interactive Evaluation of Animal Communication Translators
【速读】: 该论文旨在解决无参考翻译(reference-free)条件下机器翻译质量评估(Machine Translation Quality Evaluation, MTQE)的问题,特别是在缺乏标准译文或交互数据的情况下如何验证动物语言到英语的翻译器是否有效。其核心挑战在于识别“幻觉”(hallucinations)——即看似流畅但实际错误的翻译。解决方案的关键在于提出一种基于段落级翻译与经典自然语言处理(Natural Language Processing, NLP)洗牌测试(shuffle test)相结合的方法:通过逐轮翻译动物交流内容,并比较原始顺序与随机打乱顺序下翻译结果的合理性差异,从而判断翻译器的有效性。实验表明,该方法在数据稀缺的人类语言和构造语言上具有高相关性,且理论上支持在早期翻译学习阶段无需依赖交互或观察即可进行评估。
链接: https://arxiv.org/abs/2510.15768
作者: Orr Paradise,David F. Gruber,Adam Tauman Kalai
机构: EPFL (瑞士联邦理工学院); Project CETI; OpenAI
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:If you had an AI Whale-to-English translator, how could you validate whether or not it is working? Does one need to interact with the animals or rely on grounded observations such as temperature? We provide theoretical and proof-of-concept experimental evidence suggesting that interaction and even observations may not be necessary for sufficiently complex languages. One may be able to evaluate translators solely by their English outputs, offering potential advantages in terms of safety, ethics, and cost. This is an instance of machine translation quality evaluation (MTQE) without any reference translations available. A key challenge is identifying ``hallucinations,‘’ false translations which may appear fluent and plausible. We propose using segment-by-segment translation together with the classic NLP shuffle test to evaluate translators. The idea is to translate animal communication, turn by turn, and evaluate how often the resulting translations make more sense in order than permuted. Proof-of-concept experiments on data-scarce human languages and constructed languages demonstrate the potential utility of this evaluation methodology. These human-language experiments serve solely to validate our reference-free metric under data scarcity. It is found to correlate highly with a standard evaluation based on reference translations, which are available in our experiments. We also perform a theoretical analysis suggesting that interaction may not be necessary nor efficient in the early stages of learning to translate.
zh
[NLP-8] LLM s Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation
【速读】: 该论文旨在解决传统大语言模型(Large Language Models, LLMs)评估方法的局限性,即固定格式任务与参考答案难以捕捉现代LLM行为中复杂、主观且开放性的特征。其核心解决方案是提出一种自动相互评估(automatic mutual evaluation)框架,其中LLMs通过自对弈和同行评审的方式相互评估输出,并将这些模型生成的评分与人类投票行为进行系统比较。该框架进一步引入博弈论投票算法(game-theoretic voting algorithms)来聚合同行评审结果,从而在理论上和实证上检验模型生成的排名是否能反映人类偏好。关键创新在于首次将相互评估、博弈论聚合与基于人类标注的验证三者有机结合,为LLM能力评估提供了更贴近人类判断的可解释路径。
链接: https://arxiv.org/abs/2510.15746
作者: Gao Yang,Yuhang Liu,Siyu Miao,Xinyue Liang,Zhengyang Liu,Heyan Huang
机构: Beijing Institute of Technology (北京理工大学); Southeast Academy of Information Technology (东南信息科技研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Ideal or real - that is the this http URL this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other’s output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.
zh
[NLP-9] Attention Sinks in Diffusion Language Models
【速读】: 该论文旨在解决生成式语言模型中注意力机制内部运作机制不明确的问题,特别是针对掩码扩散语言模型(Masked Diffusion Language Models, DLMs)与传统自回归模型(Autoregressive Models, ARMs)在注意力分配上的差异。其解决方案的关键在于通过实证分析揭示DLMs中存在的“注意力下沉”(attention sinking)现象,并发现该现象在DLMs中具有动态性且对模型性能影响较小:具体而言,DLMs中的注意力下沉位置随生成过程变化,且屏蔽这些下沉位置仅导致轻微性能下降,表明DLMs对注意力结构的依赖性弱于ARMs,从而为理解扩散语言模型的内在机制提供了新视角。
链接: https://arxiv.org/abs/2510.15731
作者: Maximo Eduardo Rulli,Simone Petruzzi,Edoardo Michielon,Fabrizio Silvestri,Simone Scardapane,Alessio Devoto
机构: Sapienza University of Rome (罗马大学); Fastweb
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked Diffusion Language Models (DLMs) have recently emerged as a promising alternative to traditional Autoregressive Models (ARMs). DLMs employ transformer encoders with bidirectional attention, enabling parallel token generation while maintaining competitive performance. Although their efficiency and effectiveness have been extensively studied, the internal mechanisms that govern DLMs remain largely unexplored. In this work, we conduct an empirical analysis of DLM attention patterns, focusing on the attention sinking phenomenon, an effect previously observed in various transformer-based architectures. Our findings reveal that DLMs also exhibit attention sinks, but with distinct characteristics. First, unlike in ARMs, the sink positions in DLMs tend to shift throughout the generation process, displaying a dynamic behaviour. Second, while ARMs are highly sensitive to the removal of attention sinks, DLMs remain robust: masking sinks leads to only a minor degradation in performance. These results provide new insights into the inner workings of diffusion-based language models and highlight fundamental differences in how they allocate and utilize attention compared to autoregressive models.
zh
[NLP-10] Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth
【速读】: 该论文旨在解决检索增强型推理模型(retrieval-augmented reasoning models)在实际应用中面临的高计算成本问题,尤其是在检索和推理阶段产生的大量token消耗。其核心解决方案在于提出一种基于查询与检索结果动态调整文档列表长度的机制,并引入一个面向成本感知的优势函数(cost-aware advantage function),通过强化学习训练模型以实现效率优化。该方法在保持甚至提升任务效果的同时,显著降低了延迟(平均减少16-20%),并在多个公开问答数据集上验证了有效性。
链接: https://arxiv.org/abs/2510.15719
作者: Helia Hashemi,Victor Rühle,Saravan Rajmohan
机构: Microsoft(微软); Microsoft(微软); Microsoft(微软)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Reasoning models have gained significant attention due to their strong performance, particularly when enhanced with retrieval augmentation. However, these models often incur high computational costs, as both retrieval and reasoning tokens contribute substantially to the overall resource usage. In this work, we make the following contributions: (1) we propose a retrieval-augmented reasoning model that dynamically adjusts the length of the retrieved document list based on the query and retrieval results; (2) we develop a cost-aware advantage function for training of efficient retrieval-augmented reasoning models through reinforcement learning; and (3) we explore both memory- and latency-bound implementations of the proposed cost-aware framework for both proximal and group relative policy optimization algorithms. We evaluate our approach on seven public question answering datasets and demonstrate significant efficiency gains, without compromising effectiveness. In fact, we observed that the model latency decreases by ~16-20% across datasets, while its effectiveness increases by ~5% on average, in terms of exact match.
zh
[NLP-11] GraphMind: Interactive Novelty Assessment System for Accelerating Scientific Discovery EMNLP2025
【速读】: 该论文旨在解决科学文献新颖性评估中缺乏透明度和可追溯性的问题,尤其是在同行评审过程中,审稿人往往因对相关研究背景了解不足而难以准确判断论文的创新性。现有基于大语言模型(Large Language Models, LLMs)的方法虽能支持文献对比,但通常缺乏信息检索模块以实现结果的可验证性。其解决方案的关键在于提出一个名为GraphMind的交互式Web工具,该工具通过整合arXiv与Semantic Scholar等外部API与LLMs,实现对科学论文核心结构的捕获、关键要素标注、多维关系下的相关文献探索,并提供基于上下文的可验证洞察,从而增强新颖性评估的透明度与可追溯性。
链接: https://arxiv.org/abs/2510.15706
作者: Italo Luis da Silva,Hanqi Yan,Lin Gui,Yulan He
机构: King’s College London (伦敦国王学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 9 pages, 6 figures, 3 tables, EMNLP 2025 Demo paper
Abstract:Large Language Models (LLMs) show strong reasoning and text generation capabilities, prompting their use in scientific literature analysis, including novelty assessment. While evaluating novelty of scientific papers is crucial for peer review, it requires extensive knowledge of related work, something not all reviewers have. While recent work on LLM-assisted scientific literature analysis supports literature comparison, existing approaches offer limited transparency and lack mechanisms for result traceability via an information retrieval module. To address this gap, we introduce \textbfGraphMind , an easy-to-use interactive web tool designed to assist users in evaluating the novelty of scientific papers or drafted ideas. Specially, \textbfGraphMind enables users to capture the main structure of a scientific paper, explore related ideas through various perspectives, and assess novelty via providing verifiable contextual insights. \textbfGraphMind enables users to annotate key elements of a paper, explore related papers through various relationships, and assess novelty with contextual insight. This tool integrates external APIs such as arXiv and Semantic Scholar with LLMs to support annotation, extraction, retrieval and classification of papers. This combination provides users with a rich, structured view of a scientific idea’s core contributions and its connections to existing work. \textbfGraphMind is available at this https URL and a demonstration video at this https URL. The source code is available at this https URL.
zh
[NLP-12] Leverag ing LLM s for Context-Aware Implicit Textual and Multimodal Hate Speech Detection LREC2026
【速读】: 该论文旨在解决文本和多模态仇恨言论检测(Hate Speech Detection, HSD)中因缺乏背景知识而导致的识别准确率不足问题。其核心解决方案是利用大型语言模型(Large Language Models, LLMs)作为动态知识库,生成与输入内容相关的背景上下文信息,并通过不同方式将其融入HSD分类器的输入中。研究发现,上下文信息本身及其融合方法均至关重要,其中基于嵌入拼接(embedding concatenation)的融合策略在文本和多模态任务上分别带来了最高达3和6 F1分数的提升,显著优于无上下文基线系统。
链接: https://arxiv.org/abs/2510.15685
作者: Joshua Wolfe Brook,Ilia Markov
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages, 9 figures, submitted to LREC 2026
Abstract:This research introduces a novel approach to textual and multimodal Hate Speech Detection (HSD), using Large Language Models (LLMs) as dynamic knowledge bases to generate background context and incorporate it into the input of HSD classifiers. Two context generation strategies are examined: one focused on named entities and the other on full-text prompting. Four methods of incorporating context into the classifier input are compared: text concatenation, embedding concatenation, a hierarchical transformer-based fusion, and LLM-driven text enhancement. Experiments are conducted on the textual Latent Hatred dataset of implicit hate speech and applied in a multimodal setting on the MAMI dataset of misogynous memes. Results suggest that both the contextual information and the method by which it is incorporated are key, with gains of up to 3 and 6 F1 points on textual and multimodal setups respectively, from a zero-context baseline to the highest-performing system, based on embedding concatenation.
zh
[NLP-13] SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation CIKM2025
【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统在科学问答(Scientific Question Answering, QA)领域中存在的关键问题,包括:复杂开放域问题难以获得准确答案、生成内容缺乏可追溯的引用证据,以及在数百万篇科学文献中实现高效且精准的跨文档检索。解决方案的核心在于构建一个可扩展且可信的多智能体RAG框架SQuAI,其关键创新包括:1)通过四个协作智能体对复杂问题进行分解并执行混合稀疏-密集检索以获取针对性证据;2)引入自适应文档过滤机制提升上下文相关性;3)集成内联引用机制为每个生成主张提供可验证的来源支持句,从而显著提升答案的忠实度(faithfulness)、相关性和可解释性。
链接: https://arxiv.org/abs/2510.15682
作者: Ines Besrour,Jingbo He,Tobias Schreieder,Michael Färber
机构: TU Dresden (德累斯顿工业大学); ScaDS.AI Dresden (德累斯顿科学数据与人工智能中心)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted at CIKM 2025
Abstract:We present SQuAI (this https URL), a scalable and trustworthy multi-agent retrieval-augmented generation (RAG) framework for scientific question answering (QA) with large language models (LLMs). SQuAI addresses key limitations of existing RAG systems in the scholarly domain, where complex, open-domain questions demand accurate answers, explicit claims with citations, and retrieval across millions of scientific documents. Built on over 2.3 million full-text papers from arXiv.org, SQuAI employs four collaborative agents to decompose complex questions into sub-questions, retrieve targeted evidence via hybrid sparse-dense retrieval, and adaptively filter documents to improve contextual relevance. To ensure faithfulness and traceability, SQuAI integrates in-line citations for each generated claim and provides supporting sentences from the source documents. Our system improves faithfulness, answer relevance, and contextual relevance by up to +0.088 (12%) over a strong RAG baseline. We further release a benchmark of 1,000 scientific question-answer-evidence triplets to support reproducibility. With transparent reasoning, verifiable citations, and domain-wide scalability, SQuAI demonstrates how multi-agent RAG enables more trustworthy scientific QA with LLMs.
zh
[NLP-14] Build Your Personalized Research Group: A Multiagent Framework for Continual and Interactive Science Automation
【速读】: 该论文旨在解决当前科学发现自动化系统中存在的两大核心问题:一是现有代理系统采用僵化的预编程工作流,无法根据中间研究成果进行动态调整;二是上下文管理能力不足,难以支持长期科研任务的连续性。其解决方案的关键在于提出一个名为 \textttfreephdlabor 的开源多智能体框架,该框架具备由实时智能体推理驱动的全动态工作流(fully dynamic workflows)和模块化架构(modular architecture),允许用户灵活修改、添加或移除智能体以适应特定领域需求。同时,框架集成自动上下文压缩、基于工作区的通信机制、跨会话记忆持久化及非阻塞人类干预等基础设施,从而将自动化研究从孤立的单次尝试转变为可延续的研究程序,实现对先前探索的系统性积累与人类反馈的有效融合。
链接: https://arxiv.org/abs/2510.15624
作者: Ed Li,Junyu Ren,Xintian Pan,Cat Yan,Chuanhao Li,Dirk Bergemann,Zhuoran Yang
机构: Yale University (耶鲁大学); University of Chicago (芝加哥大学); Oxford University (牛津大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 37 pages, 5 figures. Code: this https URL
Abstract:The automation of scientific discovery represents a critical milestone in Artificial Intelligence (AI) research. However, existing agentic systems for science suffer from two fundamental limitations: rigid, pre-programmed workflows that cannot adapt to intermediate findings, and inadequate context management that hinders long-horizon research. We present \textttfreephdlabor, an open-source multiagent framework featuring \textitfully dynamic workflows determined by real-time agent reasoning and a \coloremph\textitmodular architecture enabling seamless customization – users can modify, add, or remove agents to address domain-specific requirements. The framework provides comprehensive infrastructure including \textitautomatic context compaction, \textitworkspace-based communication to prevent information degradation, \textitmemory persistence across sessions, and \textitnon-blocking human intervention mechanisms. These features collectively transform automated research from isolated, single-run attempts into \textitcontinual research programs that build systematically on prior explorations and incorporate human feedback. By providing both the architectural principles and practical implementation for building customizable co-scientist systems, this work aims to facilitate broader adoption of automated research across scientific domains, enabling practitioners to deploy interactive multiagent systems that autonomously conduct end-to-end research – from ideation through experimentation to publication-ready manuscripts.
zh
[NLP-15] HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在科学工作流中仅输出单一解释、难以有效探索多解空间的问题。针对科学问题常具有不确定性(underdetermined)的特点,即多个机制上不同的假设均可与相同观测数据一致,作者提出HypoSpace这一诊断工具集,将LLMs视为有限假设集合的采样器,并通过三个互补指标进行评估:有效性(Validity,提案与观测一致性的精度)、唯一性(Uniqueness,提案间的非冗余性)和恢复度(Recovery,对已知可接受假设空间的覆盖程度)。其关键创新在于设计了结构化领域中的确定性验证器和精确枚举的假设空间,从而揭示传统仅依赖正确性指标时无法发现的模式坍缩(mode collapse)现象,为显式探索和覆盖可接受解释空间的方法提供可控的探针而非排行榜式的评估框架。
链接: https://arxiv.org/abs/2510.15614
作者: Tingting Chen,Beibei Lin,Zifeng Yuan,Qiran Zou,Hongyu He,Yew-Soon Ong,Anirudh Goyal,Dianbo Liu
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Meta Superintelligence Labs (Meta超级智能实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: this https URL.
zh
[NLP-16] Unleashing Scientific Reasoning for Bio-experimental Protocol Generation via Structured Component-based Reward Mechanism
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成可复现科学实验协议时存在的不完整性和逻辑不一致问题,从而限制其在科研自动化中的实用性。解决方案的关键在于提出一个名为“Sketch-and-Fill”的分步生成范式,将协议生成过程明确划分为分析、结构化和表达三个阶段,确保每一步都具备可验证性;同时引入基于结构化组件的奖励机制,从步骤粒度、操作顺序和语义保真度等维度优化模型输出,最终构建出名为Thoth的系统,通过知识到行动的分阶段训练流程实现高可靠性、可执行的实验协议生成,在多个基准测试中显著优于现有开源与专有LLMs。
链接: https://arxiv.org/abs/2510.15600
作者: Haoran Sun,Yankai Jiang,Zhenyu Tang,Yaning Pan,Shuang Gu,Zekai Lin,Lilong Wang,Wenjie Lou,Lei Liu,Lei Bai,Xiaosong Wang
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The foundation of reproducible science lies in protocols that are precise, logically ordered, and executable. The autonomous generation of these protocols through natural language queries could greatly improve the efficiency of the reproduction process. However, current leading large language models (LLMs) often generate incomplete or inconsistent protocols, limiting their utility. To address this limitation, we first introduce SciRecipe, a large-scale dataset of over 12K structured protocols spanning 27 biological subfields and encompassing both comprehension and problem-solving tasks. To further improve protocol generation, we propose the “Sketch-and-Fill” paradigm, which separates analysis, structuring, and expression to ensure each step is explicit and verifiable. Complementing this, the structured component-based reward mechanism evaluates step granularity, action order, and semantic fidelity, aligning model optimization with experimental reliability. Building on these components, we develop Thoth, trained through a staged Knowledge-to-Action process that progresses from knowledge acquisition to operational reasoning and ultimately to robust, executable protocol generation. Across multiple benchmarks, Thoth consistently surpasses both proprietary and open-source LLMs, achieving significant improvements in step alignment, logical sequencing, and semantic accuracy. Our approach paves the way for reliable scientific assistants that bridge knowledge with experimental execution. All data, code, and models will be released publicly.
zh
[NLP-17] he Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works
【速读】: 该论文旨在解决长篇文学文本中共指消解(coreference resolution)数据集稀缺的问题,特别是针对包含复杂、长距离指代链的全篇小说进行标注与建模。其关键解决方案是构建了一个包含三部法语长篇小说的全新标注语料库(总计超过285,000词元),并提出一个模块化的共指消解处理流程,支持细粒度错误分析,从而在长文档场景下实现高性能且可扩展的共指识别,并进一步验证了该方法在推断虚构角色性别等文学分析任务中的实用性。
链接: https://arxiv.org/abs/2510.15594
作者: Antoine Bourgois,Thierry Poibeau
机构: Lattice (CNRS & ENS-PSL & Université Sorbonne Nouvelle) (Lattice (法国国家科学研究中心 & 巴黎文理研究大学 & 索邦新大学))
类目: Computation and Language (cs.CL)
备注:
Abstract:While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.
zh
[NLP-18] Leverag ing Test Driven Development with Large Language Models for Reliable and Verifiable Spreadsheet Code Generation: A Research Framework
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成代码和电子表格逻辑时存在的关键问题,包括幻觉(hallucinations)、细微的逻辑不一致性和语法错误,这些问题在金融建模和科学计算等高风险领域尤为危险。解决方案的关键在于将软件工程中经过验证的测试驱动开发(Test-Driven Development, TDD)方法与LLM生成过程相结合,通过“先写测试”的范式提供技术约束和认知支架,从而引导LLM输出更准确、可验证且易理解的结果。该框架强调以测试为导向的思维模式,不仅提升生成内容的正确性与可靠性,还增强用户对生成结果的信心,尤其适用于缺乏编程训练但面临严重逻辑错误后果的电子表格使用者。
链接: https://arxiv.org/abs/2510.15585
作者: Dr Simon Thorne,Dr Advait Sarkar
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Programming Languages (cs.PL)
备注: 16 pages
Abstract:Large Language Models (LLMs), such as ChatGPT, are increasingly leveraged for generating both traditional software code and spreadsheet logic. Despite their impressive generative capabilities, these models frequently exhibit critical issues such as hallucinations, subtle logical inconsistencies, and syntactic errors, risks particularly acute in high stakes domains like financial modelling and scientific computations, where accuracy and reliability are paramount. This position paper proposes a structured research framework that integrates the proven software engineering practice of Test-Driven Development (TDD) with Large Language Model (LLM) driven generation to enhance the correctness of, reliability of, and user confidence in generated outputs. We hypothesise that a “test first” methodology provides both technical constraints and cognitive scaffolding, guiding LLM outputs towards more accurate, verifiable, and comprehensible solutions. Our framework, applicable across diverse programming contexts, from spreadsheet formula generation to scripting languages such as Python and strongly typed languages like Rust, includes an explicitly outlined experimental design with clearly defined participant groups, evaluation metrics, and illustrative TDD based prompting examples. By emphasising test driven thinking, we aim to improve computational thinking, prompt engineering skills, and user engagement, particularly benefiting spreadsheet users who often lack formal programming training yet face serious consequences from logical errors. We invite collaboration to refine and empirically evaluate this approach, ultimately aiming to establish responsible and reliable LLM integration in both educational and professional development practices.
zh
[NLP-19] BiMax: Bidirectional MaxSim Score for Document-Level Alignment EMNLP2025
【速读】: 该论文旨在解决跨语言文档对齐(cross-lingual document alignment)任务中效率与精度难以兼顾的问题,尤其是在大规模网络挖掘数据场景下。当前基于最优传输(Optimal Transport, OT)等高精度方法虽效果优异,但计算复杂度高,难以满足实际应用的时效性需求。解决方案的关键在于提出一种新的跨语言双向最大相似度评分机制——交叉语言双向最大相似度得分(cross-lingual Bidirectional Maxsim score, BiMax),通过简化相似度计算流程,在保持与OT相当对齐准确率的同时,实现了约100倍的速度提升,显著提升了大规模多语言文档对齐的可扩展性。
链接: https://arxiv.org/abs/2510.15577
作者: Xiaotian Wang,Takehito Utsuro,Masaaki Nagata
机构: University of Tsukuba (筑波大学); University of Tokyo (东京大学); NTT Communication Science Laboratories, NTT Corporation (日本电信电话公司通信科学实验室)
类目: Computation and Language (cs.CL)
备注: accepted at Findings of EMNLP2025
Abstract:Document alignment is necessary for the hierarchical mining (Bañón et al., 2020; Morishita et al., 2022), which aligns documents across source and target languages within the same web domain. Several high precision sentence embedding-based methods have been developed, such as TK-PERT (Thompson and Koehn, 2020) and Optimal Transport (OT) (Clark et al., 2019; El-Kishky and Guzmán, 2020). However, given the massive scale of web mining data, both accuracy and speed must be considered. In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity, to improve efficiency compared to the OT method. Consequently, on the WMT16 bilingual document alignment task, BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase. Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models. All the alignment methods in this paper are publicly available as a tool called EmbDA (this https URL).
zh
[NLP-20] From Ghazals to Sonnets: Decoding the Polysemous Expressions of Love Across Languages
【速读】: 该论文旨在解决 Urdu 语言中表达“爱”的三个核心词汇(pyaar、muhabbat 和 ishq)在语义上的细微差异及其在诗歌中的文化内涵难以被英语文学直接对应的问题。其解决方案的关键在于采用多维度分析方法:首先通过语义学的多义性案例研究(polysemic case study approach),细致解析这三个词在乌尔都诗歌语境中的具体用法与情感层次;其次借助词嵌入(word embeddings)技术对乌尔都语和英语相关词汇进行量化比较,从而可视化不同语言中“爱”的语义空间分布,揭示出乌尔都语中蕴含的独特情感维度与文化意涵。
链接: https://arxiv.org/abs/2510.15569
作者: Syed Mohammad Sualeh Ali
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper delves into the intricate world of Urdu poetry, exploring its thematic depths through a lens of polysemy. By focusing on the nuanced differences between three seemingly synonymous words (pyaar, muhabbat, and ishq) we expose a spectrum of emotions and experiences unique to the Urdu language. This study employs a polysemic case study approach, meticulously examining how these words are interwoven within the rich tapestry of Urdu poetry. By analyzing their usage and context, we uncover a hidden layer of meaning, revealing subtle distinctions which lack direct equivalents in English literature. Furthermore, we embark on a comparative analysis, generating word embeddings for both Urdu and English terms related to love. This enables us to quantify and visualize the semantic space occupied by these words, providing valuable insights into the cultural and linguistic nuances of expressing love. Through this multifaceted approach, our study sheds light on the captivating complexities of Urdu poetry, offering a deeper understanding and appreciation for its unique portrayal of love and its myriad expressions
zh
[NLP-21] Finetuning LLM s for EvaCun 2025 token prediction shared task
【速读】: 该论文旨在解决EvaCun 2025中的token预测任务,即基于给定的上下文序列准确预测下一个token。其解决方案的关键在于使用三种不同的大型语言模型(LLMs)——Command-R、Mistral和Aya Expanse——在主办方提供的任务数据上进行微调,并采用三种不同提示(prompt)策略生成预测结果。作者未对数据进行任务特定的预处理或过滤,仅依赖原始训练数据直接微调模型,从而评估不同提示方式在保留数据原始性前提下的性能差异。
链接: https://arxiv.org/abs/2510.15561
作者: Josef Jon,Ondřej Bojar
机构: Charles University, Faculty of Mathematics and Physics (查尔斯大学数学与物理学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.
zh
[NLP-22] KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在非英语语种,特别是韩语(Korean)场景下指令遵循能力缺乏系统性评估的问题。现有评测多集中于英语模型或仅关注事实性知识与选择题形式的测试,未能充分覆盖韩语特有的语法结构、形态学特征、敬语体系及双重数词系统等复杂语言现象。解决方案的关键在于提出一个名为KITE(Korean Instruction-following Task Evaluation)的综合性基准,专门针对通用和韩语特有指令设计开放式任务,并结合自动化指标与人工评估构建多维度评价体系,从而精准刻画不同模型在韩语指令遵循中的性能差异与优劣。
链接: https://arxiv.org/abs/2510.15558
作者: Dongjun Kim,Chanhee Park,Chanjun Park,Heuiseok Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures, 5 tables
Abstract:The instruction-following capabilities of large language models (LLMs) are pivotal for numerous applications, from conversational agents to complex reasoning systems. However, current evaluations predominantly focus on English models, neglecting the linguistic and cultural nuances of other languages. Specifically, Korean, with its distinct syntax, rich morphological features, honorific system, and dual numbering systems, lacks a dedicated benchmark for assessing open-ended instruction-following capabilities. To address this gap, we introduce the Korean Instruction-following Task Evaluation (KITE), a comprehensive benchmark designed to evaluate both general and Korean-specific instructions. Unlike existing Korean benchmarks that focus mainly on factual knowledge or multiple-choice testing, KITE directly targets diverse, open-ended instruction-following tasks. Our evaluation pipeline combines automated metrics with human assessments, revealing performance disparities across models and providing deeper insights into their strengths and weaknesses. By publicly releasing the KITE dataset and code, we aim to foster further research on culturally and linguistically inclusive LLM development and inspire similar endeavors for other underrepresented languages.
zh
[NLP-23] hink Parallax: Solving Multi-Hop Problems via Multi-View Knowledge-Graph-Based Retrieval-Augmented Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多跳推理(multi-hop reasoning)任务中易产生幻觉(hallucination)且依赖扁平嵌入与噪声路径探索的问题。其解决方案的关键在于提出ParallaxRAG框架,该框架通过将查询和知识图谱三元组对称地解耦到多视角空间中,显式地强制头部多样性并约束弱相关路径,从而构建更清晰的子图并引导LLM进行基于知识的分步推理。核心创新点在于发现不同注意力头在推理的不同阶段特化于语义关系,这一多视角头专业化机制为知识增强型多跳推理提供了可解释且高效的架构基础。
链接: https://arxiv.org/abs/2510.15552
作者: Jinliang Liu
机构: UESTC(电子科技大学); UESTC(电子科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) excel at language understanding but often hallucinate and struggle with multi-hop reasoning. Knowledge-graph-based retrieval-augmented generation (KG-RAG) offers grounding, yet most methods rely on flat embeddings and noisy path exploration. We propose ParallaxRAG, a framework that symmetrically decouples queries and graph triples into multi-view spaces, enabling a robust retrieval architecture that explicitly enforces head diversity while constraining weakly related paths. Central to our approach is the observation that different attention heads specialize in semantic relations at distinct reasoning stages, contributing to different hops of the reasoning chain. This specialization allows ParallaxRAG to construct cleaner subgraphs and guide LLMs through grounded, step-wise reasoning. Experiments on WebQSP and CWQ, under our unified, reproducible setup (BGE-M3 + Llama3.1-8B), demonstrate competitive retrieval and QA performance, alongside reduced hallucination and good generalization. Our results highlight multi-view head specialization as a principled direction for knowledge-grounded multi-hop reasoning. Our implementation will be released as soon as the paper is accepted.
zh
[NLP-24] Rethinking Cross-lingual Gaps from a Statistical Viewpoint
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨语言知识查询中出现的“跨语言差距”(cross-lingual gap)问题,即当知识从源语言迁移到目标语言时,模型在目标语言上的准确率显著下降的现象。现有研究多将该差距归因于源语言与目标语言在潜在表示空间中的差异,而本文提出新假设:目标语言响应的方差(variance)是导致跨语言差距的主要原因。其解决方案的关键在于通过推理阶段的干预手段控制响应方差,特别是引入一个简单的提示指令(prompt instruction)以降低目标语言输出的不确定性,从而有效缩小跨语言差距——实验表明该方法在多个模型上使目标语言准确率提升20–25%。
链接: https://arxiv.org/abs/2510.15551
作者: Vihari Piratla,Purvam Jain,Darshan Singh,Partha Talukdar,Trevor Cohn
机构: Google DeepMind (谷歌深度思维); Google Research (谷歌研究)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages
Abstract:Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried from target languages. Prior research has pointed to a cross-lingual gap, viz., a drop in accuracy when the knowledge is queried in a target language compared to when the query is in the source language. Existing research has rationalized divergence in latent representations in source and target languages as the source of cross-lingual gap. In this work, we take an alternative view and hypothesize that the variance of responses in the target language is the main cause of this gap. For the first time, we formalize the cross-lingual gap in terms of bias-variance decomposition. We present extensive experimental evidence which support proposed formulation and hypothesis. We then reinforce our hypothesis through multiple inference-time interventions that control the variance and reduce the cross-lingual gap. We demonstrate a simple prompt instruction to reduce the response variance, which improved target accuracy by 20-25% across different models.
zh
[NLP-25] okenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
【速读】: 该论文旨在解决生成式 AI 中大语言模型(Large Language Models, LLMs)推理加速问题,特别是现有推测解码(Speculative Decoding, SD)方法因要求起草模型(draft model)与目标模型(target model)必须共享相同词汇表而导致的适用性受限问题。其解决方案的关键在于提出 TokenTiming 算法,该算法通过动态时间规整(Dynamic Time Warping, DTW)对起草 token 序列进行重新编码以生成新的目标 token 序列,并建立概率分布转移映射,从而实现跨词汇表的通用推测采样,无需对任何现成模型进行重训练或修改。
链接: https://arxiv.org/abs/2510.15545
作者: Sibo Xiao,Jinyuan Fu,Zhongle Xie,Lidan Shou
机构: Zhejiang University (浙江大学); College of Computer Science and Technology (计算机科学与技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.
zh
[NLP-26] MCA: Modality Composition Awareness for Robust Composed Multimodal Retrieval
【速读】: 该论文旨在解决统一编码器(unified encoder)在多模态检索任务中因采用传统对比学习训练而容易学习到模态捷径(modality shortcut),导致在分布外(out-of-distribution)场景下鲁棒性差的问题。解决方案的关键在于提出一种模态组合感知框架(modality composition awareness framework),其核心包括两个机制:一是偏好损失(preference loss),强制多模态嵌入优于其单模态对应项;二是组合正则化目标(composition regularization objective),将多模态嵌入与由单模态部分组成的原型对齐。这两个目标显式建模了复合表示与其单模态成分之间的结构关系,从而提升基于多模态大语言模型(MLLM)的统一编码器在复杂分布下的检索性能。
链接: https://arxiv.org/abs/2510.15543
作者: Qiyu Wu,Shuyang Cui,Satoshi Hayakawa,Wei-Yao Wang,Hiromi Wakaki,Yuki Mitsufuji
机构: Sony Group Corporation (索尼集团); Sony AI (索尼人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
Abstract:Multimodal retrieval, which seeks to retrieve relevant content across modalities such as text or image, supports applications from AI search to contents production. Despite the success of separate-encoder approaches like CLIP align modality-specific embeddings with contrastive learning, recent multimodal large language models (MLLMs) enable a unified encoder that directly processes composed inputs. While flexible and advanced, we identify that unified encoders trained with conventional contrastive learning are prone to learn modality shortcut, leading to poor robustness under distribution shifts. We propose a modality composition awareness framework to mitigate this issue. Concretely, a preference loss enforces multimodal embeddings to outperform their unimodal counterparts, while a composition regularization objective aligns multimodal embeddings with prototypes composed from its unimodal parts. These objectives explicitly model structural relationships between the composed representation and its unimodal counterparts. Experiments on various benchmarks show gains in out-of-distribution retrieval, highlighting modality composition awareness as a effective principle for robust composed multimodal retrieval when utilizing MLLMs as the unified encoder.
zh
[NLP-27] Latent Reasoning in LLM s as a Vocabulary-Space Superposition
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在采用链式思维(Chain-of-Thought, CoT)提示时产生的高计算开销问题。现有隐式推理方法虽通过在潜在空间中推理降低计算成本,但因潜在空间无结构导致性能显著下降。其关键解决方案是提出Latent-SFT框架,将潜在空间限制在LLM词表的列空间内,使潜在推理成为词表概率的叠加态,并在推理结束时坍缩为显式推理的本征态以输出最终答案。该方案通过两阶段训练:第一阶段设计专用注意力掩码引导潜在标记编码器生成可控的潜在标记;第二阶段丢弃编码器,直接训练LLM自主生成潜在标记,结合KL与交叉熵损失优化,从而实现高效且高性能的推理,显著优于先前的隐式推理方法。
链接: https://arxiv.org/abs/2510.15522
作者: Jingcheng Deng,Liang Pang,Zihao Wei,Shichen Xu,Zenghao Duan,Kun Xu,Yang Song,Huawei Shen,Xueqi Cheng
机构: State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.
zh
[NLP-28] From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
【速读】: 该论文旨在解决传统子词分词方法(如Byte Pair Encoding,BPE)在处理稀有词时效率低下以及需要大尺寸嵌入矩阵的问题,同时克服纯字符级模型在基于Transformer架构中引入性能瓶颈的缺陷。现有分层模型虽尝试融合两者优势,但受限于依赖空格的分块策略或引入额外辅助模型而带来新依赖。其解决方案的关键在于提出一种动态字符分组方法:利用现有BPE分词结构,在BPE token后添加显式的块结束标记,并通过第二阶段BPE压缩来控制块粒度,从而实现无需额外模型、高效灵活且语言无关的表示学习,实验证明该方法在性能上可匹敌甚至超越基于熵或空格的动态分块策略,同时保持紧凑词汇表。
链接: https://arxiv.org/abs/2510.15517
作者: Rares Dolga,Lucas Maystre,Tudor Berariu,David Barber
机构: University College London (伦敦大学学院); UiPath (UiPath)
类目: Computation and Language (cs.CL)
备注:
Abstract:Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.
zh
[NLP-29] mporal Referential Consistency: Do LLM s Favor Sequences Over Absolute Time References? EMNLP
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在时间敏感领域(如法律、医疗和金融)中缺乏Temporal Referential Consistency(时间参照一致性)的问题,即模型在面对涉及不同时序信息的查询时,难以保持事实与时间逻辑的一致性。为应对这一挑战,作者提出了一种名为UnTRaP的新方法,其核心在于基于推理路径对齐(reasoning path alignment)机制,通过引导模型在生成过程中对不同时间点的参考信息进行一致性的校准与整合,从而提升其在跨时间维度上的推理稳定性与准确性。实验证明,UnTRaP相较于多个基线模型在增强LLMs的时间参照一致性方面具有显著优势。
链接: https://arxiv.org/abs/2510.15513
作者: Ashutosh Bajpai,Tanmoy Chakraborty
机构: Indian Institute of Technology Delhi (印度理工学院德里分校); MongoDB, Inc. (MongoDB公司)
类目: Computation and Language (cs.CL)
备注: EMNLP Main Long Paper 2025
Abstract:The increasing acceptance of large language models (LLMs) as an alternative to knowledge sources marks a significant paradigm shift across various domains, including time-sensitive fields such as law, healthcare, and finance. To fulfill this expanded role, LLMs must not only be factually accurate but also demonstrate consistency across temporal dimensions, necessitating robust temporal reasoning capabilities. Despite this critical requirement, efforts to ensure temporal consistency in LLMs remain scarce including noticeable absence of endeavors aimed at evaluating or augmenting LLMs across temporal references in time-sensitive inquiries. In this paper, we seek to address this gap by introducing a novel benchmark entitled temporal referential consistency, accompanied by a resource TEMP-ReCon designed to benchmark a wide range of both open-source and closed-source LLMs with various linguistic contexts characterized by differing resource richness (including English, French, and Romanian). The findings emphasis that LLMs do exhibit insufficient temporal referent consistency. To address this, we propose \newmodel, a reasoning path alignment-based model that aims to enhance the temporal referential consistency of LLMs. Our empirical experiments substantiate the efficacy of UnTRaP compared to several baseline models.
zh
[NLP-30] DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在高风险场景中因能力增强而涌现出欺骗性行为的问题,尤其缺乏对现实社会情境下欺骗行为系统性评估的研究空白。其解决方案的关键在于构建首个系统性的基准测试平台——DeceptionBench,该平台涵盖经济、医疗、教育、社交互动和娱乐五个社会领域共150个精心设计的场景(超过1000个样本),从内在行为模式(如利己主义倾向与讨好行为)和外在情境因素(中立条件、奖励激励与强制压力)两个维度量化分析欺骗行为,并引入多轮交互机制模拟真实反馈动态。实验结果揭示了当前模型在强化学习驱动下欺骗倾向显著增强,暴露出其对操纵性上下文线索缺乏鲁棒性,凸显了亟需开发更先进的防御机制以应对多样化欺骗行为。
链接: https://arxiv.org/abs/2510.15501
作者: Yao Huang,Yitong Sun,Yichi Zhang,Ruochen Zhang,Yinpeng Dong,Xingxing Wei
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); Shanghai Qi Zhi Institute (上海奇智研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 17 figures, accepted by NeruIPS 2025
Abstract:Despite the remarkable advances of Large Language Models (LLMs) across diverse cognitive tasks, the rapid enhancement of these capabilities also introduces emergent deceptive behaviors that may induce severe risks in high-stakes deployments. More critically, the characterization of deception across realistic real-world scenarios remains underexplored. To bridge this gap, we establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different societal domains, what their intrinsic behavioral patterns are, and how extrinsic factors affect them. Specifically, on the static count, the benchmark encompasses 150 meticulously designed scenarios in five domains, i.e., Economy, Healthcare, Education, Social Interaction, and Entertainment, with over 1,000 samples, providing sufficient empirical foundations for deception analysis. On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement. On the extrinsic dimension, we investigate how contextual factors modulate deceptive outputs under neutral conditions, reward-based incentivization, and coercive pressures. Moreover, we incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics. Extensive experiments across LLMs and Large Reasoning Models (LRMs) reveal critical vulnerabilities, particularly amplified deception under reinforcement dynamics, demonstrating that current models lack robust resistance to manipulative contextual cues and the urgent need for advanced safeguards against various deception behaviors. Code and resources are publicly available at this https URL.
zh
[NLP-31] CORE: Reducing UI Exposure in Mobile Agents via Collaboration Between Cloud and Local LLM s
【速读】: 该论文旨在解决移动代理在执行任务时因依赖云端大语言模型(Large Language Models, LLMs)而造成用户界面(UI)状态过度暴露的问题。现有方法中,云端LLM虽能实现高任务准确率,但需上传完整UI状态,导致隐私泄露风险;本地LLM虽避免了UI上传,却受限于自身容量,任务成功率较低。解决方案的关键在于提出一个名为CORE的协同框架,其核心机制包括:(1) 布局感知的块分区(Layout-aware block partitioning),基于XML屏幕层次结构将语义相关的UI元素分组;(2) 协同规划(Co-planning),由本地与云端LLM共同识别当前子任务;(3) 协同决策(Co-decision-making),本地LLM对相关UI块排序,云端LLM则在排名靠前的块内选择具体UI元素。此外,通过多轮累积机制缓解本地误判或上下文受限问题,最终在保持任务成功率接近云端方案的同时,最多减少55.6%的UI暴露。
链接: https://arxiv.org/abs/2510.15455
作者: Gucongcong Fan,Chaoyue Niu,Chengfei Lyu,Fan Wu,Guihai Chen
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Mobile agents rely on Large Language Models (LLMs) to plan and execute tasks on smartphone user interfaces (UIs). While cloud-based LLMs achieve high task accuracy, they require uploading the full UI state at every step, exposing unnecessary and often irrelevant information. In contrast, local LLMs avoid UI uploads but suffer from limited capacity, resulting in lower task success rates. We propose \textbfCORE , a \textbfCO llaborative framework that combines the strengths of cloud and local LLMs to \textbfR educe UI \textbfE xposure, while maintaining task accuracy for mobile agents. CORE comprises three key components: (1) \textbfLayout-aware block partitioning , which groups semantically related UI elements based on the XML screen hierarchy; (2) \textbfCo-planning , where local and cloud LLMs collaboratively identify the current sub-task; and (3) \textbfCo-decision-making , where the local LLM ranks relevant UI blocks, and the cloud LLM selects specific UI elements within the top-ranked block. CORE further introduces a multi-round accumulation mechanism to mitigate local misjudgment or limited context. Experiments across diverse mobile apps and tasks show that CORE reduces UI exposure by up to 55.6% while maintaining task success rates slightly below cloud-only agents, effectively mitigating unnecessary privacy exposure to the cloud. The code is available at this https URL.
zh
[NLP-32] Controllable Abstraction in Summary Generation for Large Language Models via Prompt Engineering
【速读】: 该论文旨在解决传统摘要生成方法中摘要质量不高以及可控性不足的问题。其解决方案的关键在于提出了一种基于提示工程(prompt engineering)的多阶段提示生成框架,通过语义分析、主题建模和噪声控制对输入文本进行处理,从而生成具有不同抽象层次的摘要。该框架能够有效提升大语言模型在摘要生成任务中的可控性和准确性,尤其在优化提示长度和文本预处理方面展现出显著影响。
链接: https://arxiv.org/abs/2510.15436
作者: Xiangchen Song,Yuchen Liu,Yaxuan Luan,Jinxu Guo,Xiaofan Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study presents a controllable abstract summary generation method for large language models based on prompt engineering. To address the issues of summary quality and controllability in traditional methods, we design a multi-stage prompt generation framework. This framework generates summaries with varying levels of abstraction by performing semantic analysis, topic modeling, and noise control on the input text. The experiment uses the CNN/Daily Mail dataset and provides a detailed analysis of different prompt lengths, data noise, and text types. The experimental results show that prompt length has a significant impact on the quality of generated summaries. Both very short and very long prompt tokens result in a decrease in summary quality. Data noise also negatively affects the summary generation process. As noise levels increase, the ROUGE-L score gradually decreases. Furthermore, different text types have varying effects on the model’s ability to generate summaries. The model performs best when handling news texts, while its performance is worse when processing academic articles. This research provides new insights into improving summary generation using large language models, particularly in how controlling prompt strategies and optimizing text preprocessing can enhance summary accuracy and controllability.
zh
[NLP-33] When Seeing Is not Enough: Revealing the Limits of Active Reasoning in MLLM s
【速读】: 该论文试图解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在现实场景中面临的核心挑战:现有评估主要聚焦于被动推理(passive inference),即模型在信息完备条件下进行逐步推理,而忽视了实际应用中模型需主动获取缺失证据以迭代优化决策的能力。为弥合这一差距,论文提出通过让MLLMs从候选图像池中选择目标图像来主动获取信息,从而实现无任务先验条件下的主动推理(active reasoning)。其解决方案的关键在于构建GuessBench基准测试平台,该平台包含感知导向型和知识导向型图像,用于系统性评估MLLMs的主动推理能力;实验表明,当前主流MLLMs在主动推理上的表现显著落后于被动推理,且细粒度感知与及时决策是主要瓶颈,同时发现感知增强对小模型更有效,而思维导向方法在不同规模模型上均具稳定提升效果,这为未来多模态主动推理研究指明了方向。
链接: https://arxiv.org/abs/2510.15421
作者: Hongcheng Liu,Pingjie Wang,Yuhao Wang,Siqu Ou,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: 20 pages, 13 figures
Abstract:Multimodal large language models (MLLMs) have shown strong capabilities across a broad range of benchmarks. However, most existing evaluations focus on passive inference, where models perform step-by-step reasoning under complete information. This setup is misaligned with real-world use, where seeing is not enough. This raises a fundamental question: Can MLLMs actively acquire missing evidence under incomplete information? To bridge this gap, we require the MLLMs to actively acquire missing evidence and iteratively refine decisions under incomplete information, by selecting a target image from a candidate pool without task-specific priors. To support systematic study, we propose GuessBench, a benchmark with both perception-oriented and knowledge-oriented images for evaluating active reasoning in MLLMs. We evaluate 20 superior MLLMs and find that performance on active reasoning lags far behind it on passive settings, indicating substantial room for improvement. Further analysis identifies fine-grained perception and timely decision-making as key challenges. Ablation studies show that perceptual enhancements benefit smaller models, whereas thinking-oriented methods provide consistent gains across model sizes. These results suggest promising directions for future research on multimodal active reasoning.
zh
[NLP-34] Fine-Tuning MedGemma for Clinical Captioning to Enhance Multimodal RAG over Malaysia CPGs
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理基于图像的临床查询时效果受限的问题,其核心挑战在于通用视觉-语言模型(Vision-Language Model, VLM)生成的图像描述缺乏临床特异性和事实准确性,难以支撑精准的循证临床决策支持。解决方案的关键在于构建一个面向医学场景的专用VLM微调框架:首先通过知识蒸馏(knowledge distillation)策略合成跨皮肤科、眼底和胸部X光三个领域的高质量标注数据集以缓解医疗图像数据稀缺问题;随后采用参数高效微调方法QLoRA对MedGemma模型进行优化,从而生成高保真度、事实可靠且语义相关的图像描述,作为更有效的RAG查询输入。实验表明,该方法在分类准确率和RAGAS评估指标(包括忠实性、相关性和正确性)上均取得显著提升,验证了其在临床多模态信息检索中的有效性。
链接: https://arxiv.org/abs/2510.15418
作者: Lee Qi Zun,Mohamad Zulhilmi Bin Abdul Halim,Goh Man Fye
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation systems are essential for providing fact-based guidance from Malaysian Clinical Practice Guidelines. However, their effectiveness with image-based queries is limited, as general Vision-Language Model captions often lack clinical specificity and factual grounding. This study proposes and validates a framework to specialize the MedGemma model for generating high-fidelity captions that serve as superior queries. To overcome data scarcity, we employ a knowledge distillation pipeline to create a synthetic dataset across dermatology, fundus, and chest radiography domains, and fine-tune MedGemma using the parameter-efficient QLoRA method. Performance was rigorously assessed through a dual framework measuring both classification accuracy and, via a novel application of the RAGAS framework, caption faithfulness, relevancy, and correctness. The fine-tuned model demonstrated substantial improvements in classification performance, while RAGAS evaluation confirmed significant gains in caption faithfulness and correctness, validating the models ability to produce reliable, factually grounded descriptions. This work establishes a robust pipeline for specializing medical VLMs and validates the resulting model as a high-quality query generator, laying the groundwork for enhancing multimodal RAG systems in evidence-based clinical decision support.
zh
[NLP-35] Large-scale User Game Lifecycle Representation Learning
【速读】: 该论文旨在解决在线游戏平台中游戏广告与推荐系统面临的两大核心问题:一是游戏稀疏性(game sparsity),即可用的游戏数量仅数百款,难以支撑大规模用户表征学习;二是游戏不平衡性(game imbalance),即用户行为高度集中于少数热门游戏,导致模型难以捕捉长尾游戏的兴趣信号。解决方案的关键在于提出用户游戏生命周期(User Game Lifecycle, UGL),通过增强用户在游戏中的多阶段行为数据来缓解稀疏性问题,并结合**逆概率掩码策略(Inverse Probability Masking)**以平衡用户行为分布,从而更有效地提取用户的短期与长期兴趣。实验表明,UGL表征显著提升了广告点击率(CVR)和游戏内物品推荐的ARPU指标。
链接: https://arxiv.org/abs/2510.15412
作者: Yanjie Gou,Jiangming Liu,Kouying Xue,Yi Hua
机构: Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid expansion of video game production necessitates the development of effective advertising and recommendation systems for online game platforms. Recommending and advertising games to users hinges on capturing their interest in games. However, existing representation learning methods crafted for handling billions of items in recommendation systems are unsuitable for game advertising and recommendation. This is primarily due to game sparsity, where the mere hundreds of games fall short for large-scale user representation learning, and game imbalance, where user behaviors are overwhelmingly dominated by a handful of popular games. To address the sparsity issue, we introduce the User Game Lifecycle (UGL), designed to enrich user behaviors in games. Additionally, we propose two innovative strategies aimed at manipulating user behaviors to more effectively extract both short and long-term interests. To tackle the game imbalance challenge, we present an Inverse Probability Masking strategy for UGL representation learning. The offline and online experimental results demonstrate that the UGL representations significantly enhance model by achieving a 1.83% AUC offline increase on average and a 21.67% CVR online increase on average for game advertising and a 0.5% AUC offline increase and a 0.82% ARPU online increase for in-game item recommendation.
zh
[NLP-36] VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency
【速读】: 该论文旨在解决当前语音大语言模型(Speech-LLMs)在面对言语不流畅(speech disfluency)时鲁棒性不足的问题,尤其是针对帕金森病等疾病引起的语音障碍用户场景下性能显著下降的现实挑战。解决方案的关键在于提出一个名为VocalBench-DF的多维分类框架,用于系统化评估不同类型的言语不流畅对模型的影响,并通过实证分析识别出音素级处理能力和长上下文建模能力是导致性能退化的两个核心瓶颈。研究进一步指出,增强语音识别与推理模块的组件设计及流水线协同优化,可显著提升模型对不流畅语音的适应能力,从而推动构建更具包容性的语音大语言模型。
链接: https://arxiv.org/abs/2510.15406
作者: Hongcheng Liu,Yixuan Hou,Heyang Liu,Yuhao Wang,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: 21 pages, 4 figures
Abstract:While Speech Large Language Models (Speech-LLMs) show strong performance in many applications, their robustness is critically under-tested, especially to speech disfluency. Existing evaluations often rely on idealized inputs, overlooking common disfluencies, particularly those associated with conditions like Parkinson’s disease. This work investigates whether current Speech-LLMs can maintain performance when interacting with users who have speech impairments. To facilitate this inquiry, we introduce VocalBench-DF, a framework for the systematic evaluation of disfluency across a multi-dimensional taxonomy. Our evaluation of 22 mainstream Speech-LLMs reveals substantial performance degradation, indicating that their real-world readiness is limited. Further analysis identifies phoneme-level processing and long-context modeling as primary bottlenecks responsible for these failures. Strengthening recognition and reasoning capability from components and pipelines can substantially improve robustness. These findings highlight the urgent need for new methods to improve disfluency handling and build truly inclusive Speech-LLMs
zh
[NLP-37] Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
【速读】: 该论文旨在解决扫描文档图像中结构化解析(document parsing)的难题,尤其针对文本段落、图表、公式和表格等复杂元素交织导致的跨文档类型泛化能力差的问题。现有监督微调方法在分布外数据上表现不佳,且高质量布局感知训练数据稀缺。解决方案的关键在于提出一种基于强化学习的框架 LayoutRL,通过组合奖励机制(包括归一化编辑距离、段落数量准确率和阅读顺序保持)优化布局理解,并构建了包含 40 万张文档图像的 Infinity-Doc-400K 数据集以支持训练,最终得到的 Infinity-Parser 模型在多个基准测试中展现出卓越的跨领域、跨语言和跨结构复杂度的泛化性能。
链接: https://arxiv.org/abs/2510.15349
作者: Baode Wang,Biao Wu,Weizhen Li,Meng Fang,Zuming Huang,Jun Huang,Haozhe Wang,Yanjie Liang,Ling Chen,Wei Chu,Yuan Qi
机构: INFLY Tech(飞鱼科技); Australian Artificial Intelligence Institute(澳大利亚人工智能研究院); University of Liverpool(利物浦大学)
类目: Computation and Language (cs.CL)
备注: 22 pages, 14 figures,
Abstract:Document parsing from scanned images into structured formats remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data. This issue is further exacerbated by the limited availability of high-quality training data for layout-aware parsing tasks. To address these challenges, we introduce LayoutRL, a reinforcement learning framework that optimizes layout understanding through composite rewards integrating normalized edit distance, paragraph count accuracy, and reading order preservation. To support this training, we construct the Infinity-Doc-400K dataset, which we use to train Infinity-Parser, a vision-language model demonstrating robust generalization across various domains. Extensive evaluations on benchmarks including OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet show that Infinity-Parser consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities, substantially outperforming both specialized document parsing systems and general-purpose vision-language models. We will release our code, dataset, and model to facilitate reproducible research in document parsing.
zh
[NLP-38] When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling
【速读】: 该论文旨在解决现有集成大型语言模型(Large Language Models, LLMs)方法在长文本生成任务中性能下降的问题。传统做法是在每个词元(token)处进行集成,但研究表明这在长文本生成场景下往往导致性能劣化。论文指出,决定集成位置的两个关键因素是:不同模型间词元化(tokenization)不一致以及它们对下一个词元概率分布的共识程度。解决方案的核心在于提出SAFE(Stable And Fast LLM Ensembling)框架,该框架通过联合考虑上述两个因素来选择性地执行集成,从而提升生成稳定性与效率;此外,引入概率锐化(probability sharpening)策略,将同一词汇分散在多个子词词元上的概率集中到一个代表性词元上,进一步增强集成的稳定性。实验表明,SAFE在多个基准测试(如MATH500和BBH)中显著优于现有方法,且仅需集成少于1%的词元即可实现性能增益。
链接: https://arxiv.org/abs/2510.15346
作者: Heecheol Yun,Kwangmin Ki,Junghyun Lee,Eunho Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Ensembling Large Language Models (LLMs) has gained attention as a promising approach to surpass the performance of individual models by leveraging their complementary strengths. In particular, aggregating models’ next-token probability distributions to select the next token has been shown to be effective in various tasks. However, while successful for short-form answers, its application to long-form generation remains underexplored. In this paper, we show that using existing ensemble methods in long-form generation requires a careful choice of ensembling positions, since the standard practice of ensembling at every token often degrades performance. We identify two key factors for determining these positions: tokenization mismatch across models and consensus in their next-token probability distributions. Based on this, we propose SAFE, (Stable And Fast LLM Ensembling), a framework that selectively ensembles by jointly considering these factors. To further improve stability, we introduce a probability sharpening strategy that consolidates probabilities spread across multiple sub-word tokens representing the same word into a single representative token. Our experiments on diverse benchmarks, including MATH500 and BBH, demonstrate that SAFE outperforms existing methods in both accuracy and efficiency, with gains achieved even when ensembling fewer than 1% of tokens.
zh
[NLP-39] Readability Reconsidered: A Cross-Dataset Analysis of Reference-Free Metrics EMNLP2025
【速读】: 该论文旨在解决当前可读性评估(readability assessment)领域中存在的核心问题:现有指标多依赖表面文本特征,且缺乏统一的定义,导致其与人类对文本易读性的感知存在显著偏差。解决方案的关键在于通过大规模人工判断(897条)识别出影响人类感知可读性的深层因素,发现信息内容和主题是比表面特征更重要的决定因素;同时,对比传统指标与基于模型的更精细指标,结果表明四类模型-based指标在与人类判断的等级相关性上始终位列前四,显著优于传统指标(平均排名8.6),从而验证了模型驱动方法在提升可读性评估准确性方面的潜力。
链接: https://arxiv.org/abs/2510.15345
作者: Catarina G Belem,Parker Glenn,Alfy Samuel,Anoop Kumar,Daben Liu
机构: University of California Irvine (加州大学欧文分校); Capital One (资本一号)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at the TSAR Workshop @ EMNLP 2025
Abstract:Automatic readability assessment plays a key role in ensuring effective and accessible written communication. Despite significant progress, the field is hindered by inconsistent definitions of readability and measurements that rely on surface-level text properties. In this work, we investigate the factors shaping human perceptions of readability through the analysis of 897 judgments, finding that, beyond surface-level cues, information content and topic strongly shape text comprehensibility. Furthermore, we evaluate 15 popular readability metrics across five English datasets, contrasting them with six more nuanced, model-based metrics. Our results show that four model-based metrics consistently place among the top four in rank correlations with human judgments, while the best performing traditional metric achieves an average rank of 8.6. These findings highlight a mismatch between current readability metrics and human perceptions, pointing to model-based approaches as a more promising direction.
zh
[NLP-40] AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)构建与下游任务应用脱节的问题,即当前KG构造过程通常独立于其在检索增强生成(Retrieval-Augmented Generation, RAG)系统中的实际使用效果,导致生成的图结构并非最优。解决方案的关键在于提出AutoGraph-R1框架,首次通过强化学习(Reinforcement Learning, RL)直接优化KG构造以提升任务性能:将图生成建模为策略学习问题,其中奖励信号来源于图在RAG流程中的功能性效用,并设计了两种面向任务的奖励函数——分别针对图作为知识载体和知识索引的角色。实验表明,该方法显著优于传统任务无关的基线图结构,在多个问答(Question Answering, QA)基准上实现性能提升,从而实现了从构建“内在良好”图到构建“实际有用”图的范式转变。
链接: https://arxiv.org/abs/2510.15339
作者: Hong Ting Tsang,Jiaxin Bai,Haoyu Huang,Qiao Xiao,Tianshi Zheng,Baixuan Xu,Shujie Liu,Yangqiu Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph’s functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically good'' graphs to building demonstrably
useful’’ ones.
zh
[NLP-41] BeLLM an: Controlling LLM Congestion
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)应用在推理过程中因对底层基础设施状态无感知而导致的延迟激增和用户体验下降问题。现有LLM应用通常采用自回归方式生成token,且不考虑系统负载变化,导致在高负载时推理延迟显著上升。解决方案的关键在于提出首个主动调控控制器beLLMan,它能够使LLM基础设施向第一方LLM应用动态、渐进地传递系统负载信息,并据此调整输出长度,从而实现对推理延迟的有效控制。实验表明,在配备H100 GPU的真实测试平台上,beLLMan可将端到端延迟降低最多8倍,并在拥塞期间减少25%能耗的同时支持19%更多的请求。
链接: https://arxiv.org/abs/2510.15330
作者: Tella Rajashekhar Reddy,Atharva Deshmukh,Karan Tandon,Rohan Gandhi,Anjaly Parayil,Debopam Bhattacherjee
机构: Microsoft(微软)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Networking and Internet Architecture (cs.NI)
备注: To be presented at FAISYS 2025
Abstract:Large language model (LLM) applications are blindfolded to the infrastructure underneath and generate tokens autoregressively, indifferent to the system load, thus risking inferencing latency inflation and poor user experience. Our first-cut controller, named beLLMan, enables the LLM infrastructure to actively and progressively signal the first-party LLM application to adjust the output length in response to changing system load. On a real testbed with H100 GPUs, beLLMan helps keep inferencing latency under control (upto 8X lower end-to-end latency) and reduces energy consumption by 25% (while serving 19% more requests) during periods of congestion for a summarization workload.
zh
[NLP-42] Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在古典诗歌生成与评价中的性能认知不足问题,特别是其在主题、情感、意象、形式和风格等多维质量评估上是否存在系统性偏差。解决方案的关键在于提出一个三步评估框架,融合计算指标、LLM-as-a-judge 评分以及人类专家验证,从而揭示 LLM 在创造性质量判断中存在“回音室效应”,即模型间趋于一致但偏离人类专家标准的倾向,凸显了当前仅依赖模型进行评价的局限性,并强调了在文化与技术复杂的创意任务中,人机协同验证的必要性。
链接: https://arxiv.org/abs/2510.15313
作者: Bolei Ma,Yina Yao,Anna-Carolina Haensch
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit “echo chamber” effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.
zh
[NLP-43] Accelerating Mobile Language Model Generation via Hybrid Context and Hardware Coordination
【速读】: 该论文旨在解决移动设备上上下文感知的文本生成任务中,因token-by-token解码过程固有的内存密集特性而导致的高延迟和硬件利用率低的问题。解决方案的关键在于提出CoordGen框架,通过三个协同优化组件实现加速:(1) 自适应执行调度,动态平衡预填充(prefill)与解码阶段的计算图;(2) 上下文对齐的草稿机制,通过轻量级在线校准提升推测解码效率;(3) 硬件高效的草稿扩展策略,复用并扩展中间序列以增强并行性并降低验证开销。该方案在多款智能手机上实现了最高3.8倍的生成速度提升和4.7倍的能效改进。
链接: https://arxiv.org/abs/2510.15312
作者: Zhiyang Chen,Daliang Xu,Haiyang Shen,Mengwei Xu,Shangguang Wang,Yun Ma
机构: Peking University (北京大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents CoordGen, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.
zh
[NLP-44] Automatic essay scoring: leverag ing Jaccard coefficient and Cosine similaritywith n-gram variation in vector space model approach
【速读】: 该论文旨在解决自动化作文评分(Automated Essay Scoring, AES)系统中如何有效衡量作文相似性以提升评分准确性的问题。其解决方案的关键在于利用向量空间模型(Vector Space Model, VSM)结合n-gram特征表示,通过计算Jaccard系数与余弦相似度(Cosine similarity)来量化作文间的语义相似性,并基于人类评分与系统评分之间的均方根误差(Root Mean Square Error, RMSE)进行性能评估。实验结果表明,余弦相似度在评分精度上优于Jaccard系数,且使用一元语法(unigram)表示的特征相比二元语法(bigram)和三元语法(trigram)能获得更低的RMSE,说明特征粒度对评分效果具有显著影响。
链接: https://arxiv.org/abs/2510.15311
作者: Andharini Dwi Cahyani,Moh. Wildan Fathoni,Fika Hastarita Rachman,Ari Basuki,Salman Amin,Bain Khusnul Khotimah
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Software Engineering (cs.SE)
备注:
Abstract:Automated essay scoring (AES) is a vital area of research aiming to provide efficient and accurate assessment tools for evaluating written content. This study investigates the effectiveness of two popular similarity metrics, Jaccard coefficient, and Cosine similarity, within the context of vector space models(VSM)employing unigram, bigram, and trigram representations. The data used in this research was obtained from the formative essay of the citizenship education subject in a junior high school. Each essay undergoes preprocessing to extract features using n-gram models, followed by vectorization to transform text data into numerical representations. Then, similarity scores are computed between essays using both Jaccard coefficient and Cosine similarity. The performance of the system is evaluated by analyzing the root mean square error (RMSE), which measures the difference between the scores given by human graders and those generated by the system. The result shows that the Cosine similarity outperformed the Jaccard coefficient. In terms of n-gram, unigrams have lower RMSE compared to bigrams and trigrams.
zh
[NLP-45] Exemplar-Guided Planing: Enhanced LLM Agent for KGQA
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在知识图谱问答(Knowledge Graph Question Answering, KGQA)任务中面临的语义鸿沟问题,即自然语言查询与结构化知识图谱(Knowledge Graph, KG)表示之间的不匹配,导致规划能力不足和知识图谱探索效率低下。此外,训练-free 方法通常未能充分利用训练数据中的有效推理模式。解决方案的关键在于提出一种新颖的示例引导规划框架(Exemplar-Guided Planning, EGP),其核心包括:通过实体模板化预处理训练集以规范化语义变体;利用语义嵌入和高效的 FAISS 索引检索相似示例及其成功推理路径;动态指导 LLM 在两个关键阶段进行规划——任务分解阶段通过对齐子目标与已验证的推理步骤,关系探索阶段提供高质量辅助信息以提升关系剪枝准确性;同时引入智能前瞻机制(Smart Lookahead)在关系探索中提前评估潜在路径并可能提前终止探索,从而提升整体效率。
链接: https://arxiv.org/abs/2510.15283
作者: Jingao Xu,Shuoyoucheng Ma,Xin Song,Rong Jiang,Hongkui Tu,Bin Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) as interactive agents show significant promise in Knowledge Graph Question Answering (KGQA) but often struggle with the semantic gap between natural language queries and structured knowledge graph (KG) representations. This leads to suboptimal planning and inefficient exploration on KG, while training-free approaches often underutilize valuable reasoning patterns in training data. To address these limitations, we propose a novel framework, Exemplar-Guided Planning (EGP), which enhances the planning capabilities of LLM agents for KGQA. EGP first preprocesses the training set questions via entity templating to normalize semantic variations. It then retrieves highly similar exemplary questions and their successful reasoning paths from this preprocessed set using semantic embeddings and an efficient FAISS index. These retrieved exemplars dynamically guide the LLM’s planning process in two key phases: (1) Task Decomposition, by aligning generated sub-objectives with proven reasoning steps, and (2) Relation Exploration, by providing high-quality auxiliary information to improve relation pruning accuracy. Additionally, we introduce a Smart Lookahead mechanism during relation exploration to improve efficiency by preemptively exploring promising paths and potentially terminating exploration earlier. We apply EGP to the Plan-on-Graph (PoG) framework, termed PoG-EGP. Extensive experiments on two real-world KGQA datasets, WebQSP and CWQ, demonstrate that PoG-EGP significantly improves over the baseline PoG system and other compared methods.
zh
[NLP-46] ACL: Threshold-Adaptive Curriculum Learning Strategy for Enhancing Medical Text Understanding
【速读】: 该论文旨在解决医疗文本(尤其是电子病历,EMR)自动化理解中的复杂性挑战,现有方法通常将所有数据视为同等难度,忽略了临床记录在结构、术语和语境上的差异,导致模型在罕见或复杂病例上表现不佳。解决方案的关键在于提出一种名为TACL(Threshold-Adaptive Curriculum Learning)的新型教学框架,其核心思想是基于样本复杂度动态调整训练过程:通过将数据按难度分级并优先训练简单样本,使模型先建立坚实基础,再逐步学习复杂内容,从而提升模型在多语言(如英文与中文)临床任务中的泛化能力与性能表现。
链接: https://arxiv.org/abs/2510.15269
作者: Mucheng Ren,Yucheng Yan,He Chen,Danqing Hu,Jun Xu,Xian Zeng
机构: Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as BIBM 2025 Regular. 8 pages. Pre-CR version
Abstract:Medical texts, particularly electronic medical records (EMRs), are a cornerstone of modern healthcare, capturing critical information about patient care, diagnoses, and treatments. These texts hold immense potential for advancing clinical decision-making and healthcare analytics. However, their unstructured nature, domain-specific language, and variability across contexts make automated understanding an intricate challenge. Despite the advancements in natural language processing, existing methods often treat all data as equally challenging, ignoring the inherent differences in complexity across clinical records. This oversight limits the ability of models to effectively generalize and perform well on rare or complex cases. In this paper, we present TACL (Threshold-Adaptive Curriculum Learning), a novel framework designed to address these challenges by rethinking how models interact with medical texts during training. Inspired by the principle of progressive learning, TACL dynamically adjusts the training process based on the complexity of individual samples. By categorizing data into difficulty levels and prioritizing simpler cases early in training, the model builds a strong foundation before tackling more complex records. By applying TACL to multilingual medical data, including English and Chinese clinical records, we observe significant improvements across diverse clinical tasks, including automatic ICD coding, readmission prediction and TCM syndrome differentiation. TACL not only enhances the performance of automated systems but also demonstrates the potential to unify approaches across disparate medical domains, paving the way for more accurate, scalable, and globally applicable medical text understanding solutions.
zh
[NLP-47] raceCoder: Towards Traceable ICD Coding via Multi-Source Knowledge Integration
【速读】: 该论文旨在解决自动化国际疾病分类(International Classification of Diseases, ICD)编码中存在的三大核心问题:临床文本与ICD代码间的语义鸿沟、罕见及长尾代码识别性能差,以及模型预测缺乏可解释性。解决方案的关键在于提出TraceCoder框架,其通过动态整合多源外部知识(包括UMLS、Wikipedia和大语言模型LLMs)来增强代码表示、弥合语义差距,并提升对罕见与模糊代码的处理能力;同时引入混合注意力机制以建模标签、临床上下文与外部知识之间的交互关系,从而在提升长尾代码识别准确率的同时,使预测结果可追溯至外部证据,实现高可解释性的自动化ICD编码。
链接: https://arxiv.org/abs/2510.15267
作者: Mucheng Ren,He Chen,Yuchen Yan,Danqing Hu,Jun Xu,Xian Zeng
机构: Jiangsu Key Laboratory of Intelligent Medical Image Computing, School of Artificial Intelligence, Nanjing University of Information Science and Technology (南京信息工程大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accpeted as BIBM 2025 Regular.8 this http URL -CR version
Abstract:Automated International Classification of Diseases (ICD) coding assigns standardized diagnosis and procedure codes to clinical records, playing a critical role in healthcare systems. However, existing methods face challenges such as semantic gaps between clinical text and ICD codes, poor performance on rare and long-tail codes, and limited interpretability. To address these issues, we propose TraceCoder, a novel framework integrating multi-source external knowledge to enhance traceability and explainability in ICD coding. TraceCoder dynamically incorporates diverse knowledge sources, including UMLS, Wikipedia, and large language models (LLMs), to enrich code representations, bridge semantic gaps, and handle rare and ambiguous codes. It also introduces a hybrid attention mechanism to model interactions among labels, clinical context, and knowledge, improving long-tail code recognition and making predictions interpretable by grounding them in external evidence. Experiments on MIMIC-III-ICD9, MIMIC-IV-ICD9, and MIMIC-IV-ICD10 datasets demonstrate that TraceCoder achieves state-of-the-art performance, with ablation studies validating the effectiveness of its components. TraceCoder offers a scalable and robust solution for automated ICD coding, aligning with clinical needs for accuracy, interpretability, and reliability.
zh
[NLP-48] DRO-InstructZero: Distributionally Robust Prompt Optimization for Large Language Models ICLR2026
【速读】: 该论文旨在解决大语言模型在提示词(prompt)优化中对分布偏移(distribution shift)和对抗性评估敏感的问题,即现有自动提示搜索方法(如InstructZero)虽在单一评估分布下表现良好,但缺乏鲁棒性,导致提示在不同场景下难以迁移。其解决方案的关键在于将零样本提示优化建模为一种分布鲁棒贝叶斯优化(robust Bayesian optimization),通过引入f-散度球定义评估分布的不确定性集合(ambiguity set),并设计鲁棒采集规则以最大化最差情况下的期望效用,从而在保持贝叶斯搜索查询效率的同时,显式提升提示在分布变化下的可靠性,而非仅优化平均性能。
链接: https://arxiv.org/abs/2510.15260
作者: Yangyang Li
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint. Under review at ICLR 2026. 11 pages, 2 figures
Abstract:Large language models are highly sensitive to prompt wording. However, popular automatic prompt search methods, including InstructZero, often degrade under distribution shift and adversarial evaluation because they optimize expected performance under a single evaluation distribution. Consequently, prompts that work in one setting frequently fail to transfer. To address this, DRO-InstructZero formulates zero-shot prompt optimization as robust Bayesian optimization. Specifically, an f-divergence ball defines an ambiguity set around the evaluation distribution, and a robust acquisition rule maximizes worst-case expected utility while retaining the query efficiency of Bayesian search. Therefore, the search explicitly targets reliability under distribution shift rather than average behavior alone. Experiments follow the instruction-induction protocol with matched query budgets across formality rewriting, code debugging, and translation. For example, on BIG-Bench informative-to-formal rewriting, accuracy improves from 61.3 +/- 0.7% to approximately 85-90%, yielding an absolute gain of about 25-30 points. Moreover, auto-debugging shows about +25-point gains under domain shift. Meanwhile, stable tasks such as cause-and-effect remain above 96%, indicating no loss on in-distribution cases. Furthermore, improvements are consistent across divergence choices and decoding temperatures. Overall, DRO-InstructZero connects distributionally robust optimization with prompt learning, offering a plug-and-play and general approach for reliable, transferable prompt alignment under real-world uncertainty.
zh
[NLP-49] Multi-dimensional Data Analysis and Applications Basing on LLM Agents and Knowledge Graph Interactions
【速读】: 该论文旨在解决在大数据时代下,从海量、异构且复杂关联的多维数据中提取深层洞察的挑战,特别是大型语言模型(Large Language Models, LLMs)在处理结构化知识时存在“幻觉”问题且难以实时更新,而知识图谱(Knowledge Graphs, KGs)虽能显式存储结构化知识但静态特性限制了动态交互与分析能力。解决方案的关键在于构建一个LLM代理与KG协同工作的动态分析生态系统:利用LLM代理自动从非结构化数据中提取产品数据,实时构建和可视化知识图谱,并通过交互式平台支持用户对图谱节点进行深度探索与分析,从而实现结构化知识的动态更新与人机协同的深度洞察挖掘。
链接: https://arxiv.org/abs/2510.15258
作者: Xi Wang,Xianyao Ling,Kun Li,Gang Yin,Liang Zhang,Jiang Wu,Jun Xu,Fu Zhang,Wenbo Lei,Annie Wang,Peng Gong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 14 pages, 7 figures, 40 references
Abstract:In the current era of big data, extracting deep insights from massive, heterogeneous, and complexly associated multi-dimensional data has become a significant challenge. Large Language Models (LLMs) perform well in natural language understanding and generation, but still suffer from “hallucination” issues when processing structured knowledge and are difficult to update in real-time. Although Knowledge Graphs (KGs) can explicitly store structured knowledge, their static nature limits dynamic interaction and analytical capabilities. Therefore, this paper proposes a multi-dimensional data analysis method based on the interactions between LLM agents and KGs, constructing a dynamic, collaborative analytical ecosystem. This method utilizes LLM agents to automatically extract product data from unstructured data, constructs and visualizes the KG in real-time, and supports users in deep exploration and analysis of graph nodes through an interactive platform. Experimental results show that this method has significant advantages in product ecosystem analysis, relationship mining, and user-driven exploratory analysis, providing new ideas and tools for multi-dimensional data analysis.
zh
[NLP-50] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding
【速读】: 该论文旨在解决当前文档理解方法在处理多模态信息时的局限性,即基于光学字符识别(OCR)的流水线模型会丢失结构细节,而原生多模态大语言模型(Multimodal Large Language Models, MLLMs)则在上下文建模方面存在不足。其解决方案的关键在于提出并系统综述多模态检索增强生成(Multimodal Retrieval-Augmented Generation, Multimodal RAG)范式,该范式能够跨文本、表格、图表和布局等多种模态进行整体检索与推理,从而实现对文档内容的全面理解与智能应用。
链接: https://arxiv.org/abs/2510.15253
作者: Sensen Gao,Shanshan Zhao,Xu Jiang,Lunhao Duan,Yong Xien Chng,Qing-Guo Chen,Weihua Luo,Kaifu Zhang,Jia-Wang Bian,Mingming Gong
机构: MBZUAI; Alibaba International Digital Commerce Group; Tsinghua University; Wuhan University; University of Melbourne
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents’ multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, and applications, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.
zh
[NLP-51] Planner and Executor: Collaboration between Discrete Diffusion And Autoregressive Models in Reasoning
【速读】: 该论文旨在解决当前自回归语言模型(Autoregressive Language Models, ARMs)在生成长文本时计算成本高昂的问题,同时探索离散扩散语言模型(Discrete Diffusion Language Models, DDLMs)在复杂推理和长期规划任务中的潜力。其解决方案的关键在于提出一种混合架构,通过在文本空间与潜在空间中协同DDLM与ARM,实现互补优势:首先在文本空间中由DDLM规划推理过程、ARM执行答案;进一步引入一个学习得到的投影器,将DDLM的潜在表示映射到ARM的嵌入空间,从而绕过扩散模型在文本生成上的局限性。实验表明,这种潜空间通信机制显著提升准确率(如DART-5从27.0%提升至54.0%),并大幅降低计算开销(如使用64 tokens计划+约5 tokens执行即可超越Qwen3.1-7B模型,后者使用44倍token数)。
链接: https://arxiv.org/abs/2510.15244
作者: Lina Berrayana,Ahmed Heakl,Muhammad Abdullah Sohail,Thomas Hofmann,Salman Khan,Wei Chen
机构: EPFL(瑞士联邦理工学院); MBZUAI(穆罕默德·本·扎耶德人工智能大学); ETH Zürich(苏黎世联邦理工学院); Microsoft Research Asia(微软亚洲研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under Submission
Abstract:Current autoregressive language models (ARMs) achieve high accuracy but require long token sequences, making them costly. Discrete diffusion language models (DDLMs) enable parallel and flexible generation within a fixed number of steps and have recently emerged for their strong performance in complex reasoning and long-term planning tasks. We present a study exploring hybrid architectures that couple DDLMs with ARMs to assess whether their collaboration can yield complementary benefits. We first examine collaboration in text space, where one model plans the reasoning process and another executes the final answer based on that plan. We then extend this setup to latent-space communication, introducing a learned projector that maps DDLM latents into the ARM’s embedding space, potentially bypassing some of the text-generation limitations of diffusion models. We find that shifting DDLM – ARM communication from text space to latent space yields significant accuracy gains, for example increasing from 27.0% to 54.0% on DART-5 and from 0.0% to 14.0% on AIME24. We also find that combining a DDLM planner with an ARM executor can provide substantial computational savings with little to no impact on accuracy. For example, the latent-space pipeline, using 64 tokens for planning and roughly 5 for execution, surpasses Qwen3.1-7B on DART-5 and AIME, despite Qwen using 44 times more tokens. Overall, our study offers new insights into reasoning with DDLMs and highlights their potential in hybrid architectures.
zh
[NLP-52] FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融领域实际应用中因高风险和高 stakes 属性所带来的可信度评估难题。解决方案的关键在于提出 FinTrust,一个专门用于评估 LLM 在金融场景下可信性的综合性基准测试平台,其核心特征包括基于真实业务场景设计的多维对齐问题、细粒度的任务划分以及针对安全性、公平性、合规性等关键维度的量化评估体系。通过在该基准上对十一款主流 LLM 进行系统评测,研究发现尽管部分专有模型在安全性方面表现更优,而开源模型在行业公平性上有优势,但所有模型在受托责任对齐与信息披露等复杂任务上均存在显著不足,凸显了当前 LLM 在法律意识方面的短板,从而为未来金融领域可信 AI 的发展提供了明确的方向和评估依据。
链接: https://arxiv.org/abs/2510.15232
作者: Tiansheng Hu,Tongyan Hu,Liuyang Bai,Yilun Zhao,Arman Cohan,Chen Zhao
机构: NYU Shanghai (纽约大学上海分校); National University of Singapore (新加坡国立大学); Yale University (耶鲁大学); Center for Data Science, New York University (纽约大学数据科学中心)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: EMNLP 2025 Main
Abstract:Recent LLMs have demonstrated promising ability in solving finance related problems. However, applying LLMs in real-world finance application remains challenging due to its high risk and high stakes property. This paper introduces FinTrust, a comprehensive benchmark specifically designed for evaluating the trustworthiness of LLMs in finance applications. Our benchmark focuses on a wide range of alignment issues based on practical context and features fine-grained tasks for each dimension of trustworthiness evaluation. We assess eleven LLMs on FinTrust and find that proprietary models like o4-mini outperforms in most tasks such as safety while open-source models like DeepSeek-V3 have advantage in specific areas like industry-level fairness. For challenging task like fiduciary alignment and disclosure, all LLMs fall short, showing a significant gap in legal awareness. We believe that FinTrust can be a valuable benchmark for LLMs’ trustworthiness evaluation in finance domain.
zh
[NLP-53] Extending Audio Context for Long-Form Understanding in Large Audio-Language Models
【速读】: 该论文旨在解决大型音频语言模型(Large Audio-Language Models, LALMs)在长音频理解任务中受限于短音频上下文窗口的问题,尽管其文本骨干网络支持长文本上下文。核心挑战在于如何在不破坏原有文本能力的前提下扩展音频的上下文长度。解决方案的关键在于提出两种方法:一是训练-free的Partial YaRN,通过仅修改音频token的位置编码来扩展音频上下文,同时保留文本位置编码不变以维持原始语言模型的文本理解能力;二是Virtual Longform Audio Training (VLAT),一种训练时的位置增强策略,通过模拟不同长度的音频输入提升模型对远超训练长度的长音频的泛化能力和鲁棒性。实验证明,Partial YaRN显著优于原模型,而VLAT进一步提升了长音频理解性能。
链接: https://arxiv.org/abs/2510.15231
作者: Yuatyong Chaichana,Pittawat Taveekitworachai,Warit Sirichotedumrong,Potsawee Manakul,Kunat Pipatanakul
机构: SCB 10X, SCBX Group; Chulalongkorn University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, audio-only extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM’s text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training and improving robustness for long-context audio understanding. Our experiments on SALMONN and Qwen2-Audio show that Partial YaRN outperforms the original models across wide range of settings, and VLAT training strategy provides substantial improvement, achieving strong performance on long audio of unseen lengths.
zh
[NLP-54] Soundness-Aware Level: A Microscopic Signature that Predicts LLM Reasoning Potential
【速读】: 该论文试图解决的问题是:在强化学习与可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)后,不同基础大语言模型(Large Language Models, LLMs)的推理能力存在显著差异,这种差异的根本原因是什么?为此,作者提出了一种基于模型内部机制的微观分析方法,其关键在于将推理建模为由跨层稀疏自编码器(Cross-layer Sparse Autoencoders, SAEs)从LLM隐空间提取特征所构建的Horn子句链(即“if-then”规则),并利用大语言模型对每条规则进行语义合理性分类(严格、合理、噪声级)。通过估计特征间的转移概率,研究发现高性能模型具备“合理性感知能力”(soundness-awareness)——其内部概率分布会根据规则的语义合理性水平系统性地变化,尤其在“严格”与“噪声”规则之间表现出显著分离;而性能较弱的模型则无此区分能力。作者进一步引入Soundness-Aware Level(SAL)这一量化指标,基于Jensen-Shannon散度衡量不同合理性级别下分布的分离程度,实证表明SAL能以高精度(R²=0.87)预测RLVR后的推理表现,揭示了模型推理潜力与其预训练阶段区分有效知识与无效知识的能力密切相关。
链接: https://arxiv.org/abs/2510.15216
作者: Xuansheng Wu,Xiaoman Pan,Wenlin Yao,Jianshu Chen
机构: University of Georgia (乔治亚大学); Amazon.com (亚马逊)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Pre-print
Abstract:Reinforcement learning with verifiable rewards (RLVR) can elicit strong reasoning in large language models (LLMs), while their performance after RLVR varies dramatically across different base models. This raises a fundamental question: what microscopic property of pre-trained models leads to this variation? To investigate, we formalize reasoning as chains of Horn clauses (“if-then” rules) built from features extracted from the LLM’s latent space via cross-layer sparse autoencoders (SAEs). We estimate the transition probabilities between its features, and further categorize each rule by its semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key discovery is that high-potential models are inherently soundness-aware: their internal probability distributions systematically shift across rules’ soundness levels, becoming highly distinct for “strict” versus “noisy” rules. In contrast, weaker models are soundness-agnostic, collapsing to one distribution regardless of soundness levels. To quantify this, we introduce the Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon Divergence to measure the separation between these distributions. We show that SAL’s predictions of post-RLVR reasoning performance follow a precise empirical law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek) and scales (0.5B-14B). This reveals that a model’s reasoning potential is tied to its intrinsic, pre-trained ability to distinguish sound knowledge from unsound ones. These findings underscore the critical role of model pre-training in shaping reasoning and offer a practical metric grounded in the model’s internal mechanisms for selecting/designing stronger base models.
zh
[NLP-55] Structure-R1: Dynamically Leverag ing Structural Knowledge in LLM Reasoning through Reinforcement Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因缺乏显式且结构化的领域知识而导致性能受限的问题。传统检索增强生成(Retrieval-Augmented Generation, RAG)系统通常依赖于非结构化和碎片化的文本,导致信息密度低、推理效果不佳。其解决方案的关键在于提出一种名为 \textscStructure-R1 的新框架,该框架通过强化学习动态生成并适配结构化表示,以优化多步推理过程;同时引入自奖励结构验证机制,确保生成的结构既正确又自包含,从而显著提升信息密度与上下文清晰度,实现高效可靠的推理能力。
链接: https://arxiv.org/abs/2510.15191
作者: Junlin Wu,Xianrui Zhong,Jiashuo Sun,Bolian Li,Bowen Jin,Jiawei Han,Qingkai Zeng
机构: Washington University in St. Louis (圣路易斯华盛顿大学); University of Illinois at Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Purdue University (普渡大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Large language models (LLMs) have demonstrated remarkable advances in reasoning capabilities. However, their performance remains constrained by limited access to explicit and structured domain knowledge. Retrieval-Augmented Generation (RAG) addresses this by incorporating external information as context to augment reasoning. Nevertheless, traditional RAG systems typically operate over unstructured and fragmented text, resulting in low information density and suboptimal reasoning. To overcome these limitations, we propose \textscStructure-R1, a novel framework that transforms retrieved content into structured representations optimized for reasoning. Leveraging reinforcement learning, \textscStructure-R1 learns a content representation policy that dynamically generates and adapts structural formats based on the demands of multi-step reasoning. Unlike prior methods that rely on fixed schemas, our approach adopts a generative paradigm capable of producing task-specific structures tailored to individual queries. To ensure the quality and reliability of these representations, we introduce a self-reward structural verification mechanism that checks whether the generated structures are both correct and self-contained. Extensive experiments on seven knowledge-intensive benchmarks show that \textscStructure-R1 consistently achieves competitive performance with a 7B-scale backbone model and matches the performance of much larger models. Additionally, our theoretical analysis demonstrates how structured representations enhance reasoning by improving information density and contextual clarity. Our code and data are available at: this https URL.
zh
[NLP-56] MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evaluation
【速读】: 该论文旨在解决自主大语言模型(Large Language Model, LLM)代理在协作环境中难以平衡隐私理解与保护和任务执行效率的问题。现有隐私评估基准仅关注简单的单轮交互,无法真实反映多代理协作中隐私信息的复杂处理需求。解决方案的关键在于提出MAGPIE(Multi-AGent contextual PrIvacy Evaluation),一个包含200个高风险任务的新基准,其中私有信息是完成任务的必要条件,迫使代理在有效协作的同时进行策略性信息控制。这一设计使隐私保护成为任务完成的核心挑战,从而更真实地评估LLM代理在非对抗性多代理场景下的隐私理解能力与行为一致性。
链接: https://arxiv.org/abs/2510.15186
作者: Gurusha Juneja,Jayanth Naga Sai Pasupulati,Alon Albalak,Wenyue Hua,William Yang Wang
机构: University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of California, Davis (加州大学戴维斯分校)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:A core challenge for autonomous LLM agents in collaborative settings is balancing robust privacy understanding and preservation alongside task efficacy. Existing privacy benchmarks only focus on simplistic, single-turn interactions where private information can be trivially omitted without affecting task outcomes. In this paper, we introduce MAGPIE (Multi-AGent contextual PrIvacy Evaluation), a novel benchmark of 200 high-stakes tasks designed to evaluate privacy understanding and preservation in multi-agent collaborative, non-adversarial scenarios. MAGPIE integrates private information as essential for task resolution, forcing agents to balance effective collaboration with strategic information control. Our evaluation reveals that state-of-the-art agents, including GPT-5 and Gemini 2.5-Pro, exhibit significant privacy leakage, with Gemini 2.5-Pro leaking up to 50.7% and GPT-5 up to 35.1% of the sensitive information even when explicitly instructed not to. Moreover, these agents struggle to achieve consensus or task completion and often resort to undesirable behaviors such as manipulation and power-seeking (e.g., Gemini 2.5-Pro demonstrating manipulation in 38.2% of the cases). These findings underscore that current LLM agents lack robust privacy understanding and are not yet adequately aligned to simultaneously preserve privacy and maintain effective collaboration in complex environments.
zh
[NLP-57] rain a Unified Multimodal Data Quality Classifier with Synthetic Data EMNLP2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在预训练过程中,对图像-文本交错文档数据(image-text interleaved document data)进行高质量数据筛选的探索不足问题。现有方法主要依赖于图像-文本标题数据的过滤,而忽略了交错文档数据的质量控制,导致模型性能受限。解决方案的关键在于提出一种统一的多模态数据质量分类器(Unified Multimodal Data Quality Classifier, UniFilter),其通过一个半合成(semi-synthetic)方法高效构建样本-评分对:利用现成的原始图像生成对应四个质量等级的文本描述,从而实现对图像-文本标题数据和交错文档数据的联合高质量筛选。该方法显著提升了预训练数据质量,使MLLMs在零样本推理和上下文学习能力上均取得增强,并在视觉监督微调后进一步改善了多个基准测试表现,验证了高质量多模态预训练数据的关键作用。
链接: https://arxiv.org/abs/2510.15162
作者: Weizhi Wang,Rongmei Lin,Shiyang Li,Colin Lockard,Ritesh Sarkhel,Sanket Lokegaonkar,Jingbo Shang,Xifeng Yan,Nasser Zalmout,Xian Li
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校); Amazon Stores Foundational AI (亚马逊商店基础人工智能); UC San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: EMNLP 2025 Findings
Abstract:The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.
zh
[NLP-58] HugAgent : Evaluating LLM s in Simulating Human-Like Individual Reasoning on Open-Ended Tasks NEURIPS2025
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在开放任务中虽能模拟人类响应规模,却难以捕捉个体推理风格与信念演化轨迹的问题,即模型倾向于对齐群体共识而忽略个体差异。其解决方案的关键在于提出HugAgent(Human-Grounded Agent Benchmark),采用双轨设计:合成轨道(synthetic track)用于大规模系统性压力测试,人类轨道(human track)收集生态效度高、可解释的“出声思考”数据,从而实现对模型个体化推理适应能力的可扩展、可复现评估——核心指标为“代理内保真度”(intra-agent fidelity),即模型是否不仅能预测个体信念,还能准确还原其推理过程的动态演变。
链接: https://arxiv.org/abs/2510.15144
作者: Chance Jiajie Li,Zhenze Mo,Yuhan Tang,Ao Qu,Jiayi Wu,Kaiya Ivy Zhao,Yulu Gan,Jie Fan,Jiangbo Yu,Hang Jiang,Paul Pu Liang,Jinhua Zhao,Luis Alberto Alonso Pastor,Kent Larson
机构: MIT Media Lab (麻省理工学院媒体实验室); MIT EECS (麻省理工学院电子工程与计算机科学系); MIT BCS (麻省理工学院脑与认知科学系); MIT IDSS (麻省理工学院数据、系统与社会研究所); MIT CEE (麻省理工学院城市与环境工程系); MIT DUSP (麻省理工学院城市规划系); MIT Architecture (麻省理工学院建筑系); Northeastern University (东北大学); Brown University (布朗大学); McGill University (麦吉尔大学); Google(谷歌)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: To appear in NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models (LAW)
Abstract:Simulating human reasoning in open-ended tasks has been a long-standing aspiration in AI and cognitive science. While large language models now approximate human responses at scale, they remain tuned to population-level consensus, often erasing the individuality of reasoning styles and belief trajectories. To advance the vision of more human-like reasoning in machines, we introduce HugAgent (Human-Grounded Agent Benchmark), a benchmark for average-to-individual reasoning adaptation. The task is to predict how a specific person would reason and update their beliefs in novel scenarios, given partial evidence of their past views. HugAgent adopts a dual-track design: a synthetic track for scale and systematic stress tests, and a human track for ecologically valid, “out-loud” reasoning data. This design enables scalable, reproducible evaluation of intra-agent fidelity: whether models can capture not just what people believe, but how their reasoning evolves. Experiments with state-of-the-art LLMs reveal persistent adaptation gaps, positioning HugAgent as the first extensible benchmark for aligning machine reasoning with the individuality of human thought. Our benchmark and chatbot are open-sourced as HugAgent (this https URL) and TraceYourThinking (this https URL).
zh
[NLP-59] FarsiMCQGen: a Persian Multiple-choice Question Generation Framework
【速读】: 该论文旨在解决低资源语言(如波斯语)中高质量多选题(Multiple-choice Questions, MCQs)生成困难的问题。其解决方案的关键在于提出了一种名为FarsiMCQGen的创新方法,该方法通过候选生成、过滤与排序三阶段技术构建模型,并融合Transformer架构、知识图谱与规则驱动策略,以生成符合真实MCQ特征的干扰项(distractors),从而提升题目质量。研究基于维基百科数据构建了包含10,289个波斯语MCQ的新颖数据集,并经由多个先进大语言模型(Large Language Models, LLMs)验证其有效性,为后续相关研究提供了高质量数据基础与可复现的技术路径。
链接: https://arxiv.org/abs/2510.15134
作者: Mohammad Heydari Rad,Rezvan Afari,Saeedeh Momtazi
机构: Amirkabir University of Technology (阿米尔卡比尔理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multiple-choice questions (MCQs) are commonly used in educational testing, as they offer an efficient means of evaluating learners’ knowledge. However, generating high-quality MCQs, particularly in low-resource languages such as Persian, remains a significant challenge. This paper introduces FarsiMCQGen, an innovative approach for generating Persian-language MCQs. Our methodology combines candidate generation, filtering, and ranking techniques to build a model that generates answer choices resembling those in real MCQs. We leverage advanced methods, including Transformers and knowledge graphs, integrated with rule-based approaches to craft credible distractors that challenge test-takers. Our work is based on data from Wikipedia, which includes general knowledge questions. Furthermore, this study introduces a novel Persian MCQ dataset comprising 10,289 questions. This dataset is evaluated by different state-of-the-art large language models (LLMs). Our results demonstrate the effectiveness of our model and the quality of the generated dataset, which has the potential to inspire further research on MCQs.
zh
[NLP-60] Latent Topic Synthesis: Leverag ing LLM s for Electoral Ad Analysis
【速读】: 该论文旨在解决社交媒体平台上政治话语内容海量且动态变化所带来的分析难题,尤其是如何自动构建可解释的主题分类体系以揭示隐藏的 discourse 结构。其解决方案的关键在于提出一个端到端框架,结合无监督聚类与基于提示(prompt-based)的标签生成方法,利用大语言模型(Large Language Models, LLMs)迭代构建无需种子集或领域专业知识的主题分类体系,从而实现对政治广告文本的语义丰富标注与道德框架解析。
链接: https://arxiv.org/abs/2510.15125
作者: Alexander Brady,Tunazzina Islam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: Under-submission
Abstract:Social media platforms play a pivotal role in shaping political discourse, but analyzing their vast and rapidly evolving content remains a major challenge. We introduce an end-to-end framework for automatically generating an interpretable topic taxonomy from an unlabeled corpus. By combining unsupervised clustering with prompt-based labeling, our method leverages large language models (LLMs) to iteratively construct a taxonomy without requiring seed sets or domain expertise. We apply this framework to a large corpus of Meta (previously known as Facebook) political ads from the month ahead of the 2024 U.S. Presidential election. Our approach uncovers latent discourse structures, synthesizes semantically rich topic labels, and annotates topics with moral framing dimensions. We show quantitative and qualitative analyses to demonstrate the effectiveness of our framework. Our findings reveal that voting and immigration ads dominate overall spending and impressions, while abortion and election-integrity achieve disproportionate reach. Funding patterns are equally polarized: economic appeals are driven mainly by conservative PACs, abortion messaging splits between pro- and anti-rights coalitions, and crime-and-justice campaigns are fragmented across local committees. The framing of these appeals also diverges–abortion ads emphasize liberty/oppression rhetoric, while economic messaging blends care/harm, fairness/cheating, and liberty/oppression narratives. Topic salience further reveals strong correlations between moral foundations and issues. Demographic targeting also emerges. This work supports scalable, interpretable analysis of political messaging on social media, enabling researchers, policymakers, and the public to better understand emerging narratives, polarization dynamics, and the moral underpinnings of digital political communication.
zh
[NLP-61] Measuring the Effect of Disfluency in Multilingual Knowledge Probing Benchmarks
【速读】: 该论文旨在解决多语言事实知识评估中因模板翻译导致的语法不正确或措辞错误问题,这些问题在形态丰富的语言中尤为突出,进而影响模型得分的可解释性。其关键解决方案是采用整句级别的机器翻译(如Google Translate和ChatGPT)对MLAMA数据集进行重译,而非简单的模板替换,从而显著提升知识检索得分,并增强结果的准确性与可解释性。
链接: https://arxiv.org/abs/2510.15115
作者: Kirill Semenov,Rico Sennrich
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:For multilingual factual knowledge assessment of LLMs, benchmarks such as MLAMA use template translations that do not take into account the grammatical and semantic information of the named entities inserted in the sentence. This leads to numerous instances of ungrammaticality or wrong wording of the final prompts, which complicates the interpretation of scores, especially for languages that have a rich morphological inventory. In this work, we sample 4 Slavic languages from the MLAMA dataset and compare the knowledge retrieval scores between the initial (templated) MLAMA dataset and its sentence-level translations made by Google Translate and ChatGPT. We observe a significant increase in knowledge retrieval scores, and provide a qualitative analysis for possible reasons behind it. We also make an additional analysis of 5 more languages from different families and see similar patterns. Therefore, we encourage the community to control the grammaticality of highly multilingual datasets for higher and more interpretable results, which is well approximated by whole sentence translation with neural MT or LLM systems. The dataset and all related code is published at the Github repository: this https URL.
zh
[NLP-62] DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
【速读】: 该论文旨在解决生成式 AI(Generative AI)模型在推理过程中输出冗长但效率低下这一问题,核心目标是提升“每 token 的智能性”——即在保证准确率的前提下最小化响应长度。其解决方案的关键在于重新审视强化学习(Reinforcement Learning, RL)训练中的简单截断惩罚策略,并识别出三个关键挑战:优势估计偏差大、熵崩溃以及奖励信号稀疏。为此,作者提出 DLER(Doing Length pEnalty Right)训练方案,融合批量奖励归一化、更高裁剪阈值、动态采样和简单截断长度惩罚机制,显著提升了准确性与效率的权衡;进一步引入难度感知的 DLER 和更新选择性合并方法,实现更优的推理压缩与跨场景适应能力。
链接: https://arxiv.org/abs/2510.15110
作者: Shih-Yang Liu,Xin Dong,Ximing Lu,Shizhe Diao,Mingjie Liu,Min-Hung Chen,Hongxu Yin,Yu-Chiang Frank Wang,Kwang-Ting Cheng,Yejin Choi,Jan Kautz,Pavlo Molchanov
机构: NVIDIA(英伟达)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NVIDIA-Tech Report
Abstract:Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token–accuracy relative to response length–remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty–truncation–and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy–efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.
zh
[NLP-63] Continual Learning via Sparse Memory Finetuning
【速读】: 该论文旨在解决大语言模型在持续学习过程中面临的灾难性遗忘(catastrophic forgetting)问题,即模型在更新新知识时会严重损失原有能力。解决方案的关键在于引入稀疏记忆微调(sparse memory finetuning),其核心机制是利用记忆层模型(memory layer models)中仅对新知识高度激活的记忆槽进行稀疏参数更新,从而减少新旧知识之间的干扰。实验表明,相较于全参数微调和LoRA等参数高效微调方法,该方案在保持新知识学习能力的同时显著降低了遗忘程度。
链接: https://arxiv.org/abs/2510.15103
作者: Jessy Lin,Luke Zettlemoyer,Gargi Ghosh,Wen-Tau Yih,Aram Markosyan,Vincent-Pierre Berges,Barlas Oğuz
机构: University of California, Berkeley (加州大学伯克利分校); Meta (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities. Motivated by the intuition that mitigating forgetting is challenging because trainable parameters are shared across all tasks, we investigate whether sparse parameter updates can enable learning without catastrophic forgetting. We introduce sparse memory finetuning, leveraging memory layer models (Berges et al., 2024), which are sparsely updated by design. By updating only the memory slots that are highly activated by a new piece of knowledge relative to usage on pretraining data, we reduce interference between new knowledge and the model’s existing capabilities. We evaluate learning and forgetting compared to full finetuning and parameter-efficient finetuning with LoRA on two question answering tasks. We find that sparse memory finetuning learns new knowledge while exhibiting substantially less forgetting: while NaturalQuestions F1 drops by 89% after full finetuning on new facts and 71% with LoRA, sparse memory finetuning yields only an 11% drop with the same level of new knowledge acquisition. Our results suggest sparsity in memory layers offers a promising path toward continual learning in large language models.
zh
[NLP-64] A Generalizable Rhetorical Strategy Annotation Model Using LLM -based Debate Simulation and Labelling
【速读】: 该论文旨在解决 rhetorical strategies(修辞策略)分析中依赖人工标注导致的高成本、不一致性和可扩展性差的问题,以及现有数据集在主题和策略覆盖上的局限性。其解决方案的关键在于提出一种新颖框架,利用大语言模型(Large Language Models, LLMs)基于四类修辞类型(因果型、实证型、情感型、道德型)自动生成并标注合成辩论数据,并在此基础上微调基于Transformer的分类器,从而实现高性能且跨领域的泛化能力。
链接: https://arxiv.org/abs/2510.15081
作者: Shiyu Ji,Farnoosh Hashemi,Joice Chen,Juanwen Pan,Weicheng Ma,Hefan Zhang,Sophia Pan,Ming Cheng,Shubham Mohole,Saeed Hassanpour,Soroush Vosoughi,Michael Macy
机构: Cornell University (康奈尔大学); Georgia Institute of Technology (佐治亚理工学院); Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: The first two authors contributed equally
Abstract:Rhetorical strategies are central to persuasive communication, from political discourse and marketing to legal argumentation. However, analysis of rhetorical strategies has been limited by reliance on human annotation, which is costly, inconsistent, difficult to scale. Their associated datasets are often limited to specific topics and strategies, posing challenges for robust model development. We propose a novel framework that leverages large language models (LLMs) to automatically generate and label synthetic debate data based on a four-part rhetorical typology (causal, empirical, emotional, moral). We fine-tune transformer-based classifiers on this LLM-labeled dataset and validate its performance against human-labeled data on this dataset and on multiple external corpora. Our model achieves high performance and strong generalization across topical domains. We illustrate two applications with the fine-tuned model: (1) the improvement in persuasiveness prediction from incorporating rhetorical strategy labels, and (2) analyzing temporal and partisan shifts in rhetorical strategies in U.S. Presidential debates (1960-2020), revealing increased use of affective over cognitive argument in U.S. Presidential debates.
zh
[NLP-65] Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)广泛采用后产生的重复性表达模式(称为“slop”)问题,这种模式会降低生成文本质量并使其易于被识别为AI生成内容。解决方案的关键在于提出Antislop框架,其核心创新包括:(1) Antislop Sampler,通过回溯机制在推理阶段抑制不良字符串而不破坏词汇表;(2) 自动化流水线,用于分析模型特定的slop模式并与人类基准对比,进而生成训练数据;(3) 最终Token偏好优化(Final Token Preference Optimization, FTPO),一种基于单个token的微调方法,可对推理轨迹中出现的禁用模式进行精准logits调整。实验证明,FTPO在实现90% slop减少的同时,保持或提升了跨领域评估(如GSM8K、MMLU和创意写作任务)中的性能表现,显著优于DPO等传统方法。
链接: https://arxiv.org/abs/2510.15061
作者: Samuel Paech,Allen Roush,Judah Goldfeder,Ravid Shwartz-Ziv
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 11 pages + appendices, 16 figures
Abstract:Widespread LLM adoption has introduced characteristic repetitive phraseology, termed ``slop,‘’ which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000 \times more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: this https URL.
zh
[NLP-66] Internalizing World Models via Self-Play Finetuning for Agent ic RL
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为智能体在分布外(out-of-distribution, OOD)场景下表现脆弱的问题,尤其是在复杂且动态的真实环境中,LLM难以将其内部知识与环境动态有效对齐,导致探索能力差、泛化性能弱。解决方案的关键在于引入一个基于模型的强化学习框架SPA(Self-Play with World Model),其核心是将世界模型(world model)分解为状态表示(state representation)和转移建模(transition modeling)两个组件,并通过自 play 监督微调(Self-Play supervised fine-tuning, SFT)阶段冷启动策略,使LLM智能体先与环境交互以学习世界模型,再利用该模型在策略优化前模拟未来状态,从而增强决策一致性与探索效率。实验表明,该方法显著提升了多种环境下的任务成功率,如Sokoban成功率达59.8%(提升自25.6%),FrozenLake得分达70.9%(提升自22.1%)。
链接: https://arxiv.org/abs/2510.15047
作者: Shiqi Chen,Tongyao Zhu,Zian Wang,Jinghan Zhang,Kangrui Wang,Siyang Gao,Teng Xiao,Yee Whye Teh,Junxian He,Manling Li
机构: City University of Hong Kong (香港城市大学); Northwestern University (西北大学); The Hong Kong University of Science and Technology (香港科技大学); Oxford University (牛津大学); Allen Institute for AI (AI2) (艾伦人工智能研究所); University of Washington (华盛顿大学); National University of Singapore (新加坡国立大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios. Real-world environments are complex and dynamic, governed by task-specific rules and stochasticity, which makes it difficult for LLMs to ground their internal knowledge in those dynamics. Under such OOD conditions, vanilla RL training often fails to scale; we observe Pass@k–the probability that at least one of (k) sampled trajectories succeeds–drops markedly across training steps, indicating brittle exploration and limited generalization. Inspired by model-based reinforcement learning, we hypothesize that equipping LLM agents with an internal world model can better align reasoning with environmental dynamics and improve decision-making. We show how to encode this world model by decomposing it into two components: state representation and transition modeling. Building on this, we introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning (SFT) stage to learn the world model by interacting with the environment, then uses it to simulate future states prior to policy optimization. This simple initialization outperforms the online world-modeling baseline and greatly boosts the RL-based agent training performance. Experiments across diverse environments like Sokoban, FrozenLake, and Sudoku show that our approach significantly improves performance. For example, SPA boosts the Sokoban success rate from 25.6% to 59.8% and raises the FrozenLake score from 22.1% to 70.9% for the Qwen2.5-1.5B-Instruct model.
zh
[NLP-67] Composition-Grounded Instruction Synthesis for Visual Reasoning
【速读】: 该论文旨在解决预训练多模态大语言模型(Multimodal Large Language Models, MLLMs)在人工图像领域(如图表、渲染文档和网页)中因缺乏大规模人工标注推理数据而表现出的推理能力不足问题。解决方案的关键在于提出一种名为COGS(COmposition-Grounded instruction Synthesis)的数据高效框架,其核心思想是将每个种子问题分解为基本的感知与推理因子,并通过系统性重组这些因子与新图像生成大量合成问答对,同时为每个生成问题提供子问题和中间答案,从而支持基于因子级别的过程奖励强化学习训练,有效提升模型在未见任务上的泛化推理能力。
链接: https://arxiv.org/abs/2510.15040
作者: Xinyi Gu,Jiayuan Mao,Zhang-Wei Hong,Zhuoran Yu,Pengyuan Li,Dhiraj Joshi,Rogerio Feris,Zexue He
机构: MIT(麻省理工学院); UW-Madison(威斯康星大学麦迪逊分校); MIT-IBM Watson AI Lab(麻省理工学院-IBM沃森人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
zh
[NLP-68] DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models
【速读】: 该论文旨在解决文本到图像(Text-to-Image, T2I)模型中存在的语义泄露(semantic leakage)问题,即不同实体间无意传递语义相关特征的现象,这会损害生成图像的准确性与可控性。解决方案的关键在于提出一种轻量级、无需优化的推理阶段干预方法 DeLeaker,其通过直接操作模型的注意力机制(attention maps),在扩散过程中动态重加权注意力图,抑制跨实体过度交互的同时强化每个实体的身份表征,从而实现高效且无损的语义泄露缓解。
链接: https://arxiv.org/abs/2510.15015
作者: Mor Ventura,Michael Toker,Or Patashnik,Yonatan Belinkov,Roi Reichart
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Text-to-Image (T2I) models have advanced rapidly, yet they remain vulnerable to semantic leakage, the unintended transfer of semantically related features between distinct entities. Existing mitigation strategies are often optimization-based or dependent on external inputs. We introduce DeLeaker, a lightweight, optimization-free inference-time approach that mitigates leakage by directly intervening on the model’s attention maps. Throughout the diffusion process, DeLeaker dynamically reweights attention maps to suppress excessive cross-entity interactions while strengthening the identity of each entity. To support systematic evaluation, we introduce SLIM (Semantic Leakage in IMages), the first dataset dedicated to semantic leakage, comprising 1,130 human-verified samples spanning diverse scenarios, together with a novel automatic evaluation framework. Experiments demonstrate that DeLeaker consistently outperforms all baselines, even when they are provided with external information, achieving effective leakage mitigation without compromising fidelity or quality. These results underscore the value of attention control and pave the way for more semantically precise T2I models.
zh
[NLP-69] Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT Gemini and Deepseek
【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动评分学生作文时对习语等修辞语言处理能力不足的问题,尤其关注其在包含与不含习语的作文中评分一致性及公平性表现。解决方案的关键在于通过语料库语言学(Corpus Linguistics)和计算语言学(Computational Linguistics)方法构建两组结构均衡的作文样本(一组含多个习语,另一组无习语),并对比三种主流生成式 AI 模型(ChatGPT、Gemini 和 Deepseek)在相同评分标准下对两组作文的评分表现,结果表明 Gemini 在处理习语类文本时最接近人工评分者,且整体评分具有一致性和无群体偏差,体现出其在处理隐喻性语言方面的优势,是未来独立承担作文评分任务的最佳候选模型。
链接: https://arxiv.org/abs/2510.15009
作者: Enis Oğuz
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The developments in Generative AI technologies have paved the way for numerous innovations in different fields. Recently, Generative AI has been proposed as a competitor to AES systems in evaluating student essays automatically. Considering the potential limitations of AI in processing idioms, this study assessed the scoring performances of Generative AI models for essays with and without idioms by incorporating insights from Corpus Linguistics and Computational Linguistics. Two equal essay lists were created from 348 student essays taken from a corpus: one with multiple idioms present in each essay and another with no idioms in essays. Three Generative AI models (ChatGPT, Gemini, and Deepseek) were asked to score all essays in both lists three times, using the same rubric used by human raters in assigning essay scores. The results revealed excellent consistency for all models, but Gemini outperformed its competitors in interrater reliability with human raters. There was also no detectable bias for any demographic group in AI assessment. For essays with multiple idioms, Gemini followed a the most similar pattern to human raters. While the models in the study demonstrated potential for a hybrid approach, Gemini was the best candidate for the task due to its ability to handle figurative language and showed promise for handling essay-scoring tasks alone in the future.
zh
[NLP-70] Rethinking Toxicity Evaluation in Large Language Models : A Multi-Label Perspective
【速读】: 该论文旨在解决当前毒性检测模型在评估过程中因依赖单标签基准而导致的偏差问题,这些问题包括对真实世界中多维、模糊毒性的漏检和误报,以及因难以获取细粒度多标签标注而阻碍有效评估与开发。其解决方案的关键在于构建三个新的多标签毒性检测基准(Q-A-MLL、R-A-MLL 和 H-X-MLL),基于公开数据集并依据15类细化的毒性分类体系进行标注,并提出一种基于伪标签(pseudo-label)的训练方法;理论证明及实验证明该方法在所释放数据集上优于直接使用单标签监督的学习方式,显著提升了检测准确性和可靠性,超越了GPT-4o和DeepSeek等先进基线模型。
链接: https://arxiv.org/abs/2510.15007
作者: Zhiqiang Kou,Junyang Chen,Xin-Qiang Cai,Ming-Kun Xie,Biao Liu,Changwei Wang,Lei Feng,Yuheng Jia,Gang Niu,Masashi Sugiyama,Xin Geng
机构: Southeast University (东南大学); RIKEN Center for Advanced Intelligence Project (AIP) (理化学研究所先进智能项目中心); Qilu University of Technology (齐鲁工业大学); The University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbfQ-A-MLL, \textbfR-A-MLL, and \textbfH-X-MLL, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.
zh
[NLP-71] Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在边缘设备上部署时面临的高计算需求、能耗大以及数据隐私风险等问题。解决方案的关键在于设计并实现了一系列轻量级语言模型——Shakti Small Language Models (SLMs),包括Shakti-100M、Shakti-250M和Shakti-500M,通过采用高效架构、量化技术与负责任的AI原则,在保证性能的同时显著降低资源消耗,从而实现智能手机、智能家电及物联网系统等场景下的本地化智能推理能力。
链接: https://arxiv.org/abs/2503.01933
作者: Rakshit Aralimatti,Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi
机构: SandLogic Technologies Pvt Ltd(桑德逻辑科技有限公司)
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deploying large scale language models on edge devices faces inherent challenges such as high computational demands, energy consumption, and potential data privacy risks. This paper introduces the Shakti Small Language Models (SLMs) Shakti-100M, Shakti-250M, and Shakti-500M which target these constraints headon. By combining efficient architectures, quantization techniques, and responsible AI principles, the Shakti series enables on-device intelligence for smartphones, smart appliances, IoT systems, and beyond. We provide comprehensive insights into their design philosophy, training pipelines, and benchmark performance on both general tasks (e.g., MMLU, Hellaswag) and specialized domains (healthcare, finance, and legal). Our findings illustrate that compact models, when carefully engineered and fine-tuned, can meet and often exceed expectations in real-world edge-AI scenarios.
zh
[NLP-72] Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI
【速读】: 该论文旨在解决多模态学习中数据效率低下的问题,即当前视觉语言模型(Vision-Language Models, VLMs)通常依赖海量训练数据才能达到优异性能,导致资源消耗大、部署成本高。解决方案的关键在于通过模型架构创新和训练策略优化实现高性能与低数据需求的平衡:具体包括引入QK-Normalization以提升注意力机制的稳定性、采用混合归一化技术增强训练鲁棒性、改进位置编码以更好捕捉空间信息,并设计三阶段训练策略以提升学习效率。实验表明,Shakti-VLM-1B和Shakti-VLM-4B在文档理解、视觉推理、OCR提取及通用多模态推理任务上均表现优异,证明了模型设计与训练策略优化可显著降低对大规模数据的依赖。
链接: https://arxiv.org/abs/2502.17092
作者: Syed Abdul Gaffar Shakhadri,Kruthika KR,Kartik Basavaraj Angadi
机构: SandLogic Technologies Pvt Ltd. (SandLogic 技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:We introduce Shakti VLM, a family of vision-language models in the capacity of 1B and 4B parameters designed to address data efficiency challenges in multimodal learning. While recent VLMs achieve strong performance through extensive training data, Shakti models leverage architectural innovations to attain competitive results with fewer tokens. Key advancements include QK-Normalization for attention stability, hybrid normalization techniques, and enhanced positional encoding. A three-stage training strategy further optimizes learning efficiency. Evaluations show that Shakti-Shakti-VLM-1B and Shakti-VLM-4B excel in document understanding, Visual Reasoning, OCR extraction, and general multimodal reasoning. Our results highlight that high performance can be achieved through model design and training strategy rather than sheer data volume, making Shakti an efficient solution for enterprise-scale multimodal tasks.
zh
[NLP-73] FIRE: Fact-checking with Iterative Retrieval and Verification NAACL
【速读】: 该论文旨在解决长文本事实核查中因采用固定数量证据检索与验证分离策略而导致的效率低下问题,该方法不仅未能充分利用大语言模型(Large Language Model, LLM)对声明的内部知识,也难以模拟人类搜索中的迭代推理过程。解决方案的关键在于提出一种基于代理(agent-based)的框架FIRE,其核心机制是在每一轮迭代中根据当前判断置信度动态决定是否输出最终答案或生成新的搜索查询,从而实现证据检索与声明验证的统一整合与自适应控制,显著降低LLM调用和搜索成本,同时保持甚至提升核查性能。
链接: https://arxiv.org/abs/2411.00784
作者: Zhuohan Xie,Rui Xing,Yuxia Wang,Jiahui Geng,Hasan Iqbal,Dhruv Sahnan,Iryna Gurevych,Preslav Nakov
机构: MBZUAI; The University of Melbourne
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 figures, 8 tables, accepted to Findings of NAACL
Abstract:Fact-checking long-form text is challenging, and it is therefore common practice to break it down into multiple atomic claims. The typical approach to fact-checking these atomic claims involves retrieving a fixed number of pieces of evidence, followed by a verification step. However, this method is usually not cost-effective, as it underutilizes the verification model’s internal knowledge of the claim and fails to replicate the iterative reasoning process in human search strategies. To address these limitations, we propose FIRE, a novel agent-based framework that integrates evidence retrieval and claim verification in an iterative manner. Specifically, FIRE employs a unified mechanism to decide whether to provide a final answer or generate a subsequent search query, based on its confidence in the current judgment. We compare FIRE with other strong fact-checking frameworks and find that it achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times. These results indicate that FIRE holds promise for application in large-scale fact-checking operations. Our code is available at this https URL.
zh
[NLP-74] Exploring the Synergy of Quantitative Factors and Newsflow Representations from Large Language Models for Stock Return Prediction
【速读】: 该论文旨在解决多模态因子(如估值、质量、成长等定量因子)与新闻流(newsflow)在股票收益预测中的融合问题,以提升选股和组合优化的效果。其核心挑战在于如何有效整合结构化因子与非结构化文本信息(来自大语言模型LLM生成的新闻表示),并克服由此带来的训练不稳定性和预测性能波动。解决方案的关键在于:首先提出一种融合学习框架,比较三种代表性方法(表示拼接、表示相加与注意力机制)来构建统一表征;其次基于实证观察设计自适应混合模型,动态融合单模态与融合模态的预测结果;最后引入解耦训练策略,通过理论分析缓解混合模型的训练不稳定性,从而实现更稳健且高效的多模态股票收益预测。
链接: https://arxiv.org/abs/2510.15691
作者: Tian Guo,Emmanuel Hauptmann
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:In quantitative investing, return prediction supports various tasks, including stock selection, portfolio optimization, and risk management. Quantitative factors, such as valuation, quality, and growth, capture various characteristics of stocks. Unstructured financial data, like news and transcripts, has attracted growing attention, driven by recent advances in large language models (LLMs). This paper examines effective methods for leveraging multimodal factors and newsflow in return prediction and stock selection. First, we introduce a fusion learning framework to learn a unified representation from factors and newsflow representations generated by an LLM. Within this framework, we compare three representative methods: representation combination, representation summation, and attentive representations. Next, building on empirical observations from fusion learning, we explore the mixture model that adaptively combines predictions made by single modalities and their fusion. To mitigate the training instability observed in the mixture model, we introduce a decoupled training approach with theoretical insights. Finally, our experiments on real investment universes yield several insights into effective multimodal modeling of factors and news for stock return prediction.
zh
[NLP-75] he Coverag e Principle: How Pre-training Enables Post-Training
【速读】: 该论文旨在解决预训练语言模型在下游任务中表现优异的机制不明确问题,特别是为何交叉熵损失(cross-entropy loss)作为预训练成功指标时常无法准确预测最终性能。其核心解决方案是提出“覆盖率原理”(coverage principle),即预训练过程中通过下一个词预测任务隐式优化模型对高质量响应的概率质量分布(coverage),而这种覆盖率是后续微调和测试阶段采用如Best-of-N等缩放方法有效的必要且充分条件。关键创新在于揭示了覆盖率比交叉熵具有更快的泛化能力,能避免对序列长度等任务相关参数的虚假依赖,从而更可靠地预测下游性能,并进一步设计出三种可证明提升覆盖率的算法干预策略:模型/检查点选择、梯度归一化以及测试时解码策略。
链接: https://arxiv.org/abs/2510.15020
作者: Fan Chen,Audrey Huang,Noah Golowich,Sadhika Malladi,Adam Block,Jordan T. Ash,Akshay Krishnamurthy,Dylan J. Foster
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistics Theory (math.ST)
备注:
Abstract:Language models demonstrate remarkable abilities when pre-trained on large text corpora and fine-tuned for specific tasks, but how and why pre-training shapes the success of the final model remains poorly understood. Notably, although pre-training success is often quantified by cross entropy loss, cross-entropy can be a poor predictor of downstream performance. Instead, we provide a theoretical perspective on this relationship through the lens of \emphcoverage, which quantifies the probability mass the pre-trained model places on high-quality responses and which is necessary and sufficient for post-training and test-time scaling methods such as Best-of-N to succeed. Our main results develop an understanding of \emphthe coverage principle, a phenomenon whereby next-token prediction implicitly optimizes toward a model with good coverage. In particular, we uncover a mechanism that explains the power of coverage in predicting downstream performance: \emphcoverage generalizes faster than cross entropy, avoiding spurious dependence on problem-dependent parameters such as the sequence length. We also study practical algorithmic interventions with provable benefits for improving coverage, including (i) model/checkpoint selection procedures, (ii) gradient normalization schemes, and (iii) test-time decoding strategies.
zh
计算机视觉
[CV-0] Skyfall-GS: Synthesizing Immersive 3D Urban Scenes from Satellite Imagery
【速读】:该论文旨在解决大规模、可探索且几何准确的3D城市场景生成问题,其核心挑战在于缺乏可用于训练通用生成模型的大规模高质量真实世界3D扫描数据。解决方案的关键在于利用 readily available 卫星影像提供现实感粗略几何结构,并结合开放域扩散模型(open-domain diffusion model)生成高保真局部外观,从而无需昂贵的3D标注即可构建城市街区尺度的3D场景。该方法提出 Skyfall-GS 框架,通过课程驱动的迭代优化策略逐步提升几何完整性与逼真纹理,实现了实时沉浸式3D探索,显著优于现有最先进方法在跨视角几何一致性与纹理真实感方面的表现。
链接: https://arxiv.org/abs/2510.15869
作者: Jie-Ying Lee,Yi-Ruei Liu,Shr-Ruei Tsai,Wei-Cheng Chang,Chung-Ho Wu,Jiewen Chan,Zhenjun Zhao,Chieh Hubert Lin,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); UIUC (伊利诺伊大学厄巴纳-香槟分校); University of Zaragoza (萨拉戈萨大学); UC Merced (加州大学默塞德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Synthesizing large-scale, explorable, and geometrically accurate 3D urban scenes is a challenging yet valuable task in providing immersive and embodied applications. The challenges lie in the lack of large-scale and high-quality real-world 3D scans for training generalizable generative models. In this paper, we take an alternative route to create large-scale 3D scenes by synergizing the readily available satellite imagery that supplies realistic coarse geometry and the open-domain diffusion model for creating high-quality close-up appearances. We propose \textbfSkyfall-GS, the first city-block scale 3D scene creation framework without costly 3D annotations, also featuring real-time, immersive 3D exploration. We tailor a curriculum-driven iterative refinement strategy to progressively enhance geometric completeness and photorealistic textures. Extensive experiments demonstrate that Skyfall-GS provides improved cross-view consistent geometry and more realistic textures compared to state-of-the-art approaches. Project page: this https URL
zh
[CV-1] LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal ICCV2025
【速读】:该论文旨在解决单图像镜头眩光去除(Single Image Flare Removal, SIFR)方法在面对离屏光源(off-frame light sources)不完整或缺失时性能显著下降的问题。其解决方案的关键在于提出LightsOut框架,该框架基于扩散模型(diffusion model)实现对离屏光源的外推重建(outpainting),并通过多任务回归模块与LoRA微调的扩散模型协同工作,确保生成结果在物理一致性与视觉真实性上的优越性,从而无需额外训练即可提升现有SIFR方法在复杂场景下的鲁棒性,成为通用的即插即用预处理方案。
链接: https://arxiv.org/abs/2510.15868
作者: Shr-Ruei Tsai,Wei-Cheng Chang,Jie-Ying Lee,Chih-Hai Su,Yu-Lun Liu
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL
Abstract:Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: this https URL
zh
[CV-2] BiomedXPro: Prompt Optimization for Explainable Diagnosis with Biomedical Vision Language Models
【速读】:该论文旨在解决生物医学视觉-语言模型在临床应用中因提示优化技术生成不可解释的潜在向量或单一文本提示而导致的透明度不足问题,进而限制了其在高风险场景下的可信度。其核心解决方案是提出BiomedXPro框架,该框架利用大语言模型(Large Language Model, LLM)作为生物医学知识提取器和自适应优化器,自动生成一组多样且可解释的自然语言提示对(prompt pairs),以更好地捕捉临床诊断的多维特征。关键创新在于通过进化机制构建可验证的提示集合,从而提升模型预测的可解释性与临床一致性,显著优于现有提示调优方法,尤其在数据稀缺的少样本场景下表现突出。
链接: https://arxiv.org/abs/2510.15866
作者: Kaushitha Silva,Mansitha Eashwara,Sanduni Ubayasiri,Ruwan Tennakoon,Damayanthi Herath
机构: University of Peradeniya (佩拉德尼亚大学); RMIT University (皇家墨尔本理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 10 Pages + 15 Supplementary Material Pages, 5 figures
Abstract:The clinical adoption of biomedical vision-language models is hindered by prompt optimization techniques that produce either uninterpretable latent vectors or single textual prompts. This lack of transparency and failure to capture the multi-faceted nature of clinical diagnosis, which relies on integrating diverse observations, limits their trustworthiness in high-stakes settings. To address this, we introduce BiomedXPro, an evolutionary framework that leverages a large language model as both a biomedical knowledge extractor and an adaptive optimizer to automatically generate a diverse ensemble of interpretable, natural-language prompt pairs for disease diagnosis. Experiments on multiple biomedical benchmarks show that BiomedXPro consistently outperforms state-of-the-art prompt-tuning methods, particularly in data-scarce few-shot settings. Furthermore, our analysis demonstrates a strong semantic alignment between the discovered prompts and statistically significant clinical features, grounding the model’s performance in verifiable concepts. By producing a diverse ensemble of interpretable prompts, BiomedXPro provides a verifiable basis for model predictions, representing a critical step toward the development of more trustworthy and clinically-aligned AI systems.
zh
[CV-3] BLIP3o-NEXT: Next Frontier of Native Image Generation
【速读】:该论文旨在解决生成式 AI(Generative AI)中图像生成与图像编辑任务的统一建模问题,目标是构建一个具备强大文本到图像生成和图像编辑能力的开源基础模型。其解决方案的关键在于提出了一种融合自回归(Autoregressive)与扩散(Diffusion)架构的新范式:首先利用自回归模型基于多模态输入生成离散图像标记(image tokens),再将这些标记的隐藏状态作为条件信号驱动扩散模型生成高保真图像。该设计有效结合了自回归模型在指令遵循和逻辑推理方面的优势与扩散模型在细节渲染上的卓越表现,从而实现了图像生成的连贯性与真实感的显著提升。
链接: https://arxiv.org/abs/2510.15857
作者: Jiuhai Chen,Le Xue,Zhiyang Xu,Xichen Pan,Shusheng Yang,Can Qin,An Yan,Honglu Zhou,Zeyuan Chen,Lifu Huang,Tianyi Zhou,Junnan Li,Silvio Savarese,Caiming Xiong,Ran Xu
机构: Salesforce Research; University of Maryland (马里兰大学); Virginia Tech (弗吉尼亚理工学院); New York University (纽约大学); UC Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.
zh
[CV-4] Memory-SAM: Human-Prompt-Free Tongue Segmentation via Retrieval-to-Prompt
【速读】:该论文旨在解决中医舌象分析中舌体分割的准确性问题,尤其针对传统监督模型依赖大量标注数据以及SAM(Segment Anything Model)家族模型仍需人工提示(prompt)的局限性。其解决方案的关键在于提出Memory-SAM框架,该框架无需训练且不依赖人工提示,通过从少量先验病例记忆库中利用密集DINOv3特征和FAISS检索机制自动生成有效提示;具体而言,给定查询图像后,系统基于掩码约束的对应关系提取前景/背景点提示,引导SAM2完成分割,从而实现对不规则边界舌象的鲁棒、高效分割。
链接: https://arxiv.org/abs/2510.15849
作者: Joongwon Chae,Lihui Luo,Xi Yuan,Dongmei Yu,Zhenglin Chen,Lian Zhang,Peiwu Qin
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Affiliated Fifth Hospital, Wenzhou Medical University (温州医科大学附属第五医院); Zhejiang Key Laboratory of Imaging and Interventional Medicine (浙江省医学影像与介入治疗重点实验室); The Fifth Affiliated Hospital of Wenzhou Medical University (温州医科大学第五附属医院); The First Hospital of Hebei Medical University (河北医科大学第一医院); Wenzhou Medical University (温州医科大学); Hengqin Laboratory (横琴实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate tongue segmentation is crucial for reliable TCM analysis. Supervised models require large annotated datasets, while SAM-family models remain prompt-driven. We present Memory-SAM, a training-free, human-prompt-free pipeline that automatically generates effective prompts from a small memory of prior cases via dense DINOv3 features and FAISS retrieval. Given a query image, mask-constrained correspondences to the retrieved exemplar are distilled into foreground/background point prompts that guide SAM2 without manual clicks or model fine-tuning. We evaluate on 600 expert-annotated images (300 controlled, 300 in-the-wild). On the mixed test split, Memory-SAM achieves mIoU 0.9863, surpassing FCN (0.8188) and a detector-to-box SAM baseline (0.1839). On controlled data, ceiling effects above 0.98 make small differences less meaningful given annotation variability, while our method shows clear gains under real-world conditions. Results indicate that retrieval-to-prompt enables data-efficient, robust segmentation of irregular boundaries in tongue imaging. The code is publicly available at this https URL.
zh
[CV-5] 3DPR: Single Image 3D Portrait Relight using Generative Priors SIGGRAPH
【速读】:该论文旨在解决从单视角人像图像中生成新视角、新光照条件下的高质量人脸渲染问题,这是一个典型的欠约束问题。传统基于图形学的方法依赖可微分渲染将输入图像显式分解为几何、材质和光照,但受限于模型假设与参数化方式的局限性。其解决方案的关键在于引入一种基于生成先验的图像级重光照方法——3DPR,该方法利用在光台(light stage)采集的多视角“一次一灯”(One-Light-at-a-Time, OLAT)图像构建高保真高频面部反射率先验,并结合预训练生成式头部模型(Generative Head Model)的潜在空间,通过编码器逆映射将输入图像嵌入该潜在流形,再使用新型三平面(triplane)反射率网络合成OLAT图像以实现图像级重光照。这一设计使得仅需少量光台数据即可训练出高质量反射率模型,最终通过融合OLAT图像与HDRI环境贴图,实现物理准确的环境重光照效果,显著优于现有方法,在身份保留与镜面高光、自阴影及次表面散射等光照效应捕捉方面表现优异。
链接: https://arxiv.org/abs/2510.15846
作者: Pramod Rao,Abhimitra Meka,Xilong Zhou,Gereon Fox,Mallikarjun B R,Fangneng Zhan,Tim Weyrich,Bernd Bickel,Hanspeter Pfister,Wojciech Matusik,Thabo Beeler,Mohamed Elgharib,Marc Habermann,Christian Theobalt
机构: Max Planck Institute for Informatics & SIC SaarbrückenGermany; Google Inc.; Friedrich-Alexander-Universität Erlangen-Nürnberg NürnbergGermany; ETH ZürichZürichSwitzerland; Harvard UniversityCambridgeUSA; Massachusetts Institute of TechnologyCambridgeUSA; Max Planck Institute for Informatics, SIC & VIA Research CenterSaarbrückenGermany; OpenAI; Meta; Stability.AI; Anthropic; Character.ai; Claude; Google(谷歌); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACM SIGGRAPH ASIA 2025 Conference Proceedings
Abstract:Rendering novel, relit views of a human head, given a monocular portrait image as input, is an inherently underconstrained problem. The traditional graphics solution is to explicitly decompose the input image into geometry, material and lighting via differentiable rendering; but this is constrained by the multiple assumptions and approximations of the underlying models and parameterizations of these scene components. We propose 3DPR, an image-based relighting model that leverages generative priors learnt from multi-view One-Light-at-A-Time (OLAT) images captured in a light stage. We introduce a new diverse and large-scale multi-view 4K OLAT dataset of 139 subjects to learn a high-quality prior over the distribution of high-frequency face reflectance. We leverage the latent space of a pre-trained generative head model that provides a rich prior over face geometry learnt from in-the-wild image datasets. The input portrait is first embedded in the latent manifold of such a model through an encoder-based inversion process. Then a novel triplane-based reflectance network trained on our lightstage data is used to synthesize high-fidelity OLAT images to enable image-based relighting. Our reflectance network operates in the latent space of the generative head model, crucially enabling a relatively small number of lightstage images to train the reflectance model. Combining the generated OLATs according to a given HDRI environment maps yields physically accurate environmental relighting results. Through quantitative and qualitative evaluations, we demonstrate that 3DPR outperforms previous methods, particularly in preserving identity and in capturing lighting effects such as specularities, self-shadows, and subsurface scattering. Project Page: this https URL
zh
[CV-6] Neuro-Symbolic Spatial Reasoning in Segmentation
【速读】:该论文旨在解决开放词汇语义分割(Open-Vocabulary Semantic Segmentation, OVSS)中对未见类别对象的泛化能力不足的问题,尤其是现有基于视觉语言模型(Vision-Language Models, VLMs)的方法因缺乏对场景中物体空间关系的理解而导致性能受限。解决方案的关键在于引入神经符号(Neuro-Symbolic, NeSy)空间推理机制,提出Relational Segmentor(RelateSeg)框架,通过一阶逻辑(First-Order Logic, FOL)显式建模像素间的空间关系约束,并将其以模糊逻辑松弛形式嵌入深度网络架构中,实现端到端学习具有空间一致性约束的分割结果。该方法在不增加额外参数的前提下,仅引入一个辅助损失函数,即可显著提升多类别图像中的分割性能,验证了NeSy空间推理在OVSS中的有效性。
链接: https://arxiv.org/abs/2510.15841
作者: Jiayi Lin,Jiabo Huang,Shaogang Gong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-Vocabulary Semantic Segmentation (OVSS) assigns pixel-level labels from an open set of categories, requiring generalization to unseen and unlabelled objects. Using vision-language models (VLMs) to correlate local image patches with potential unseen object categories suffers from a lack of understanding of spatial relations of objects in a scene. To solve this problem, we introduce neuro-symbolic (NeSy) spatial reasoning in OVSS. In contrast to contemporary VLM correlation-based approaches, we propose Relational Segmentor (RelateSeg) to impose explicit spatial relational constraints by first order logic (FOL) formulated in a neural network architecture. This is the first attempt to explore NeSy spatial reasoning in OVSS. Specifically, RelateSeg automatically extracts spatial relations, e.g., cat, to-right-of, person, and encodes them as first-order logic formulas using our proposed pseudo categories. Each pixel learns to predict both a semantic category (e.g., “cat”) and a spatial pseudo category (e.g., “right of person”) simultaneously, enforcing relational constraints (e.g., a “cat” pixel must lie to the right of a “person”). Finally, these logic constraints are formulated in a deep network architecture by fuzzy logic relaxation, enabling end-to-end learning of spatial-relationally consistent segmentation. RelateSeg achieves state-of-the-art performance in terms of average mIoU across four benchmark datasets and particularly shows clear advantages on images containing multiple categories, with the cost of only introducing a single auxiliary loss function and no additional parameters, validating the effectiveness of NeSy spatial reasoning in OVSS.
zh
[CV-7] VISTA: A Test-Time Self-Improving Video Generation Agent
【速读】:该论文旨在解决当前文本到视频生成(text-to-video synthesis)中视频质量高度依赖于用户提示(prompt)精确性的问题,尤其针对现有测试时优化方法在视频多维度特性(如视觉、音频与语境一致性)上表现不佳的局限。其解决方案的关键在于提出一个名为VISTA(Video Iterative Self-improvemenT Agent)的多智能体系统,通过迭代式提示优化实现视频质量提升:首先将用户意图结构化为时间规划,随后通过锦标赛机制筛选最优视频,由视觉、音频和语境三类专用代理进行细致评估,最终由推理代理整合反馈并重构提示,驱动下一轮生成。该设计实现了对生成视频的闭环反馈与持续改进,显著优于现有基线方法,在多项指标上展现出一致且可量化的性能提升。
链接: https://arxiv.org/abs/2510.15831
作者: Do Xuan Long,Xingchen Wan,Hootan Nakhost,Chen-Yu Lee,Tomas Pfister,Sercan Ö. Arık
机构: Google(谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.
zh
[CV-8] ERNet: Efficient Non-Rigid Registration Network for Point Sequences ICCV2025
【速读】:该论文旨在解决非刚性变形下点云序列中物体形状注册的难题,其核心挑战在于:(i)由于注册目标函数的非凸性,在噪声或部分输入条件下易陷入局部极小值,导致变形估计不准确且鲁棒性差;(ii)长时间序列中误差累积引发跟踪失败。解决方案的关键在于提出ERNet——一种基于大规模变形数据集训练的高效前馈模型,通过两阶段流水线预测变形图序列:首先粗略估计每帧的图节点以实现鲁棒初始化,随后在滑动窗口内优化这些节点的时间轨迹,从而有效利用时序信息并提升注册精度与一致性。
链接: https://arxiv.org/abs/2510.15800
作者: Guangzhao He,Yuxi Xiao,Zhen Xu,Xiaowei Zhou,Sida Peng
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025. Project Page: this https URL
Abstract:Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose ERNet, an efficient feed-forward model trained on large deformation datasets. It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates frame-wise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach (i) outperforms previous state-of-the-art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.
zh
[CV-9] ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection NEURIPS2025
【速读】:该论文旨在解决生成式数据增强中因内容-位置错位(content-position mismatch)和语义泄露(semantic leakage)导致的图像质量与训练有效性不足的问题。当前方法通常依赖复杂的后处理或大规模微调,难以保证生成样本的结构可控性和语义一致性。其解决方案的关键在于提出ReCon框架,通过在扩散采样过程中引入区域引导的修正机制(region-guided rectification),利用预训练感知模型的反馈动态修正生成图像中的错误区域;同时设计区域对齐交叉注意力(region-aligned cross-attention),强化图像区域与文本提示之间的空间-语义对齐,从而显著提升生成数据的质量与可训练性。
链接: https://arxiv.org/abs/2510.15783
作者: Haowei Zhu,Tianxiang Pan,Rui Qin,Jun-Hai Yong,Bin Wang
机构: Tsinghua University (清华大学); Li Auto Inc. (小鹏汽车); BNRist (贝瑞特)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 (spotlight)
Abstract:The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at this https URL .
zh
[CV-10] Controlling the image generation process with parametric activation functions
【速读】:该论文旨在解决生成式 AI(Generative AI)模型内部机制缺乏可解释性与可控性的问题,尤其是在用户难以通过直观方式干预模型输出的情况下。其解决方案的关键在于引入一个交互式系统,允许用户替换生成网络中的激活函数为可参数化的替代函数,并提供调节这些参数的手段,从而实现对模型输出的细粒度控制。该方法在StyleGAN2和BigGAN等主流生成模型上得到了验证,展示了通过直接操作内部机制提升模型可解释性和可控性的可行性。
链接: https://arxiv.org/abs/2510.15778
作者: Ilia Pavlov
机构: Creative Computing Institute (创意计算研究所); University of the Arts London (伦敦艺术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures, accepted for the 16th International Conference on Computational Creativity, ICCC’25
Abstract:As image generative models continue to increase not only in their fidelity but also in their ubiquity the development of tools that leverage direct interaction with their internal mechanisms in an interpretable way has received little attention In this work we introduce a system that allows users to develop a better understanding of the model through interaction and experimentation By giving users the ability to replace activation functions of a generative network with parametric ones and a way to set the parameters of these functions we introduce an alternative approach to control the networks output We demonstrate the use of our method on StyleGAN2 and BigGAN networks trained on FFHQ and ImageNet respectively.
zh
[CV-11] owards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model
【速读】:该论文旨在解决现有概念瓶颈模型(Concept Bottleneck Models, CBMs)中存在的输入到概念映射偏差(input-to-concept mapping bias)以及可控性不足的问题,这些问题限制了CBMs在实际应用中的可靠性与可解释性。解决方案的关键在于提出一种轻量级解耦概念瓶颈模型(Lightweight Disentangled Concept Bottleneck Model, LDCBM),其通过引入滤波分组损失(filter grouping loss)和联合概念监督机制,自动将视觉特征划分为语义上合理的组件,无需区域标注即可实现更精准的概念-视觉对齐,从而提升决策的透明度与鲁棒性。
链接: https://arxiv.org/abs/2510.15770
作者: Gaoxiang Huang,Songning Lai,Yutao Yue
机构: HKUST(GZ); Deep Interdisciplinary Intelligence Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations. However, existing CBMs often suffer from input-to-concept mapping bias and limited controllability, which restricts their practical value, directly damage the responsibility of strategy from concept-based methods. We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components without region annotation. By introducing a filter grouping loss and joint concept supervision, our method improves the alignment between visual patterns and concepts, enabling more transparent and robust decision-making. Notably, Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance. By grounding concepts in visual evidence, our method overcomes a fundamental limitation of prior models and enhances the reliability of interpretable AI.
zh
[CV-12] QSilk: Micrograin Stabilization and Adaptive Quantile Clipping for Detail-Friendly Latent Diffusion
【速读】:该论文旨在解决潜扩散模型(latent diffusion)在生成图像时存在的高频细节失真和罕见激活峰值(activation spikes)问题,这些问题会导致图像纹理模糊或出现异常噪声。解决方案的关键在于提出QSilk——一个轻量级、始终启用的稳定层,其核心由两部分组成:(i) 每样本微钳位(per-sample micro clamp),可温和限制极端值而不破坏纹理细节;(ii) 自适应分位数裁剪(Adaptive Quantile Clip, AQClip),可根据局部结构统计或注意力熵引导(模型置信度)动态调整各区域允许的数值范围。该方法无需训练或微调,集成至CADE 2.5渲染管线后可在低步数和超高清分辨率下显著提升图像清晰度与锐度,且计算开销极低。
链接: https://arxiv.org/abs/2510.15761
作者: Denis Rychkovskiy(DZRobo, Independent Researcher)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint. Qualitative side-by-side comparisons (fixed seeds); 3 figures with subfigures; 1 algorithm. CADE 2.5 / SDXL integration; sample images included. Code and presets planned for release upon publication
Abstract:We present QSilk, a lightweight, always-on stabilization layer for latent diffusion that improves high-frequency fidelity while suppressing rare activation spikes. QSilk combines (i) a per-sample micro clamp that gently limits extreme values without washing out texture, and (ii) Adaptive Quantile Clip (AQClip), which adapts the allowed value corridor per region. AQClip can operate in a proxy mode using local structure statistics or in an attention entropy guided mode (model confidence). Integrated into the CADE 2.5 rendering pipeline, QSilk yields cleaner, sharper results at low step counts and ultra-high resolutions with negligible overhead. It requires no training or fine-tuning and exposes minimal user controls. We report consistent qualitative improvements across SD/SDXL backbones and show synergy with CFG/Rescale, enabling slightly higher guidance without artifacts.
zh
[CV-13] Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity
【速读】:该论文旨在解决小中型家禽养殖场在追求生产效率的同时,难以实现动物福利保障与环境合规的难题,尤其针对缺乏低成本、集成化连续监测与决策支持工具的问题。其解决方案的关键在于提出一个模块化、高性价比的智能平台——家禽农场智能系统(Poultry Farm Intelligence, PoultryFI),该系统融合了六项基于人工智能(AI)的功能模块:摄像头布局优化、音视频联合监测、分析告警、实时蛋数统计、生产盈利预测及推荐模块。其中,通过进化算法离线优化摄像头部署以最小硬件成本实现全舍覆盖,并结合边缘计算视觉模型实现蛋数识别准确率达100%(在Raspberry Pi 5上验证),同时利用短期预测模型提前10天预报产蛋量与饲料消耗,再通过整合天气数据提供可操作的环境调控建议,从而实现了从被动响应到主动优化的转变,填补了孤立试点工具向规模化应用之间的空白。
链接: https://arxiv.org/abs/2510.15757
作者: Pieris Panagi,Savvas Karatsiolis,Kyriacos Mosphilis,Nicholas Hadjisavvas,Andreas Kamilaris,Nicolas Nicolaou,Efstathios Stavrakis,Vassilis Vassiliades
机构: CYENS Centre of Excellence (CYENS 中心卓越); Algolysis (阿尔戈利西斯)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Poultry farming faces increasing pressure to meet productivity targets while ensuring animal welfare and environmental compliance. Yet many small and medium-sized farms lack affordable, integrated tools for continuous monitoring and decision-making, relying instead on manual, reactive inspections. This paper presents Poultry Farm Intelligence (PoultryFI) - a modular, cost-effective platform that integrates six AI-powered modules: Camera Placement Optimizer, Audio-Visual Monitoring, Analytics Alerting, Real-Time Egg Counting, Production Profitability Forecasting, and a Recommendation Module. Camera layouts are first optimized offline using evolutionary algorithms for full poultry house coverage with minimal hardware. The Audio-Visual Monitoring module extracts welfare indicators from synchronized video, audio, and feeding data. Analytics Alerting produces daily summaries and real-time notifications, while Real-Time Egg Counting uses an edge vision model to automate production tracking. Forecasting models predict egg yield and feed consumption up to 10 days in advance, and the Recommendation Module integrates forecasts with weather data to guide environmental and operational adjustments. This is among the first systems to combine low-cost sensing, edge analytics, and prescriptive AI to continuously monitor flocks, predict production, and optimize performance. Field trials demonstrate 100% egg-count accuracy on Raspberry Pi 5, robust anomaly detection, and reliable short-term forecasting. PoultryFI bridges the gap between isolated pilot tools and scalable, farm-wide intelligence, empowering producers to proactively safeguard welfare and profitability. Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE) Cite as: arXiv:2510.15757 [cs.LG] (or arXiv:2510.15757v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15757 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-14] Semantic segmentation with coarse annotations
【速读】:该论文旨在解决使用粗粒度标注(coarse annotations)训练语义分割模型时边界对齐效果差的问题,尤其在标注成本高或难以获得精细像素级标签的场景下。其核心解决方案是提出一种正则化方法,通过在编码器-解码器架构中引入基于SLIC超像素(SLIC superpixels)的上采样机制,强制解码后的分割结果以颜色和位置信息为基础形成紧凑的超像素区域,从而提升边界精度;该方法在FCN-16网络结构上实现,并在SUIM、Cityscapes和PanNuke数据集上验证了其在粗标注条件下显著优于当前最优模型的边界召回率。
链接: https://arxiv.org/abs/2510.15756
作者: Jort de Jong,Mike Holenderski
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.
zh
[CV-15] NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation
【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型在面对隐性性暗示提示时生成不当内容的问题,这类提示通常以看似无害的词汇形式出现,但因模型内部偏见仍会触发色情内容生成,而现有检测方法主要针对显性有害内容,难以识别此类隐性恶意意图。解决方案的关键在于提出首个基于噪声驱动的检测与缓解框架(Noise-driven Detection and Mitigation, NDM),其核心创新包括:一是利用早期预测噪声的可分离性,构建高精度、高效率的噪声基检测机制以识别恶意内容;二是设计一种噪声增强的自适应负向引导机制,通过优化初始噪声来抑制显著区域的关注度,从而提升负向引导对性内容的抑制效果,同时保留模型原始生成能力。
链接: https://arxiv.org/abs/2510.15752
作者: Yitong Sun,Yao Huang,Ruochen Zhang,Huanran Chen,Shouwei Ruan,Ranjie Duan,Xingxing Wei
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures, accepted by ACMMM 2025
Abstract:Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model’s generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model’s original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region’s attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at this https URL.
zh
[CV-16] SEGA: A Stepwise Evolution Paradigm for Content-Aware Layout Generation with Design Prior ICCV-2025
【速读】:该论文旨在解决内容感知布局生成(content-aware layout generation)问题,即自动创建与给定背景图像协调一致的布局,尤其针对现有单步推理框架在复杂元素排布规划中因缺乏反馈式自校正机制而导致失败率显著上升的问题。其解决方案的关键在于提出一种名为SEGA(Stepwise Evolution Paradigm for Content-Aware Layout Generation)的新范式,该范式受人类系统性思维启发,采用粗粒度到细粒度的分层推理框架:首先由粗粒度模块对布局进行初步估计,再由精化模块基于粗略结果进行细粒度推理,同时引入版面设计原则作为先验知识以增强模型的布局规划能力。
链接: https://arxiv.org/abs/2510.15749
作者: Haoran Wang,Bo Zhao,Jinghui Wang,Hanzhang Wang,Huan Yang,Wei Ji,Hao Liu,Xinyan Xiao
机构: Baidu Inc.(百度公司); Nanjing University (南京大学); Harbin Institute of Technology (哈尔滨工业大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV-2025, Our project website is at: this https URL , 10 pages
Abstract:In this paper, we study the content-aware layout generation problem, which aims to automatically generate layouts that are harmonious with a given background image. Existing methods usually deal with this task with a single-step reasoning framework. The lack of a feedback-based self-correction mechanism leads to their failure rates significantly increasing when faced with complex element layout planning. To address this challenge, we introduce SEGA, a novel Stepwise Evolution Paradigm for Content-Aware Layout Generation. Inspired by the systematic mode of human thinking, SEGA employs a hierarchical reasoning framework with a coarse-to-fine strategy: first, a coarse-level module roughly estimates the layout planning results; then, another refining module performs fine-level reasoning regarding the coarse planning results. Furthermore, we incorporate layout design principles as prior knowledge into the model to enhance its layout planning ability. Besides, we present GenPoster-100K that is a new large-scale poster dataset with rich meta-information annotation. The experiments demonstrate the effectiveness of our approach by achieving the state-of-the-art results on multiple benchmark datasets. Our project page is at: this https URL
zh
[CV-17] Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset
【速读】:该论文旨在解决指令驱动的视频编辑(instruction-based video editing)领域因高质量训练数据稀缺而导致的发展瓶颈问题。其解决方案的关键在于提出一个名为Ditto的综合性框架,该框架通过三个核心创新实现突破:首先,设计了一种融合先进图像编辑器创造性与上下文视频生成能力的数据生成流水线,以扩展训练数据的多样性;其次,采用高效蒸馏模型架构并结合时序增强模块,在显著降低计算开销的同时提升视频时序一致性;最后,引入智能代理自动构造多样化指令并对输出进行严格过滤,从而保障大规模生成数据的质量可控性。基于此框架,作者构建了包含一百万条高保真视频编辑样本的Ditto-1M数据集,并在此基础上训练出Editto模型,实现了在指令遵循能力上的新SOTA性能。
链接: https://arxiv.org/abs/2510.15742
作者: Qingyan Bai,Qiuyu Wang,Hao Ouyang,Yue Yu,Hanlin Wang,Wen Wang,Ka Leong Cheng,Shuailei Ma,Yanhong Zeng,Zichen Liu,Yinghao Xu,Yujun Shen,Qifeng Chen
机构: HKUST; Ant Group; Zhejiang University; Northeastern University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL Code: this https URL
Abstract:Instruction-based video editing promises to democratize content creation, yet its progress is severely hampered by the scarcity of large-scale, high-quality training data. We introduce Ditto, a holistic framework designed to tackle this fundamental challenge. At its heart, Ditto features a novel data generation pipeline that fuses the creative diversity of a leading image editor with an in-context video generator, overcoming the limited scope of existing models. To make this process viable, our framework resolves the prohibitive cost-quality trade-off by employing an efficient, distilled model architecture augmented by a temporal enhancer, which simultaneously reduces computational overhead and improves temporal coherence. Finally, to achieve full scalability, this entire pipeline is driven by an intelligent agent that crafts diverse instructions and rigorously filters the output, ensuring quality control at scale. Using this framework, we invested over 12,000 GPU-days to build Ditto-1M, a new dataset of one million high-fidelity video editing examples. We trained our model, Editto, on Ditto-1M with a curriculum learning strategy. The results demonstrate superior instruction-following ability and establish a new state-of-the-art in instruction-based video editing.
zh
[CV-18] Fix False Transparency by Noise Guided Splatting
【速读】:该论文旨在解决3D高斯散射(3DGS)重建中出现的“虚假透明”问题,即在交互式视角变化下,原本不透明物体表面呈现视图不一致的伪透明现象。这一问题源于3DGS优化过程中的病态性:训练时仅通过光度损失对输入RGB图像进行优化,缺乏对表面不透明度的显式约束,导致优化结果错误地赋予不透明区域透明属性。为解决此问题,作者提出NGS(Noise-guided Surface Opacity)策略,其核心在于在训练过程中向物体体积内注入不透明噪声高斯点(opaque noise Gaussians),引导表面高斯点提升不透明度,从而抑制虚假透明;该方法仅需对现有渲染流程做最小改动,即可显著改善该问题。
链接: https://arxiv.org/abs/2510.15736
作者: Aly El Hakie,Yiren Lu,Yu Yin,Michael Jenkins,Yehe Liu
机构: OpsiClear LLC; Case Western Reserve University
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Opaque objects reconstructed by 3DGS often exhibit a falsely transparent surface, leading to inconsistent background and internal patterns under camera motion in interactive viewing. This issue stems from the ill-posed optimization in 3DGS. During training, background and foreground Gaussians are blended via alpha-compositing and optimized solely against the input RGB images using a photometric loss. As this process lacks an explicit constraint on surface opacity, the optimization may incorrectly assign transparency to opaque regions, resulting in view-inconsistent and falsely transparent. This issue is difficult to detect in standard evaluation settings but becomes particularly evident in object-centric reconstructions under interactive viewing. Although other causes of view-inconsistency have been explored recently, false transparency has not been explicitly identified. To the best of our knowledge, we are the first to identify, characterize, and develop solutions for this artifact, an underreported artifact in 3DGS. Our strategy, NGS, encourages surface Gaussians to adopt higher opacity by injecting opaque noise Gaussians in the object volume during training, requiring only minimal modifications to the existing splatting process. To quantitatively evaluate false transparency in static renderings, we propose a transmittance-based metric that measures the severity of this artifact. In addition, we introduce a customized, high-quality object-centric scan dataset exhibiting pronounced transparency issues, and we augment popular existing datasets with complementary infill noise specifically designed to assess the robustness of 3D reconstruction methods to false transparency. Experiments across multiple datasets show that NGS substantially reduces false transparency while maintaining competitive performance on standard rendering metrics, demonstrating its overall effectiveness.
zh
[CV-19] DGME-T: Directional Grid Motion Encoding for Transformer-Based Historical Camera Movement Classification
【速读】:该论文旨在解决相机运动分类(Camera Movement Classification, CMC)模型在应用于档案电影时性能显著下降的问题,其核心挑战在于 archival film 中存在的噪声、丢帧和低对比度等因素会严重模糊运动线索。解决方案的关键在于构建一个统一的基准数据集,并提出 DGME-T 模型——该模型是 Video Swin Transformer 的轻量级扩展,通过引入基于光流(optical flow)的定向网格运动编码(Directional Grid Motion Encoding, DGME),以可学习且归一化的晚期融合层注入结构化运动先验信息。实验表明,DGME-T 在现代视频上将 Top-1 准确率从 81.78% 提升至 86.14%,宏 F1 分数从 82.08% 提升至 87.81%;在二战时期影片上也实现了准确率和宏 F1 的提升,同时跨域研究进一步验证了中间阶段在现代数据上的微调能显著增强历史影像的性能。这说明结构化的运动先验与 Transformer 表示能力具有互补性,即使是一个小而精调的运动头也能大幅提升对退化影像的鲁棒性分析能力。
链接: https://arxiv.org/abs/2510.15725
作者: Tingyu Lin,Armin Dadras,Florian Kleber,Robert Sablatnig
机构: TU Wien (维也纳科技大学); UAS St. Pölten (圣珀尔滕应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 9 pages, accepted at ACMMM2025 SUMAC
Abstract:Camera movement classification (CMC) models trained on contemporary, high-quality footage often degrade when applied to archival film, where noise, missing frames, and low contrast obscure motion cues. We bridge this gap by assembling a unified benchmark that consolidates two modern corpora into four canonical classes and restructures the HISTORIAN collection into five balanced categories. Building on this benchmark, we introduce DGME-T, a lightweight extension to the Video Swin Transformer that injects directional grid motion encoding, derived from optical flow, via a learnable and normalised late-fusion layer. DGME-T raises the backbone’s top-1 accuracy from 81.78% to 86.14% and its macro F1 from 82.08% to 87.81% on modern clips, while still improving the demanding World-War-II footage from 83.43% to 84.62% accuracy and from 81.72% to 82.63% macro F1. A cross-domain study further shows that an intermediate fine-tuning stage on modern data increases historical performance by more than five percentage points. These results demonstrate that structured motion priors and transformer representations are complementary and that even a small, carefully calibrated motion head can substantially enhance robustness in degraded film analysis. Related resources are available at this https URL.
zh
[CV-20] Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis
【速读】:该论文旨在解决当前医疗人工智能(AI)系统在处理多模态医学输入(如影像、病史、检验结果)与生成多样化输出(如文本报告、标注、分割掩码等)时存在的割裂问题:图像理解模型无法生成视觉内容,而图像生成模型缺乏文本解释能力,导致数据表征不完整、特征融合困难以及任务级多模态能力缺失。解决方案的关键在于提出一个多层次统一框架——UniMedVL,其核心是基于诊断流程的观察-知识-分析(Observation-Knowledge-Analysis, OKA)范式:在观察层构建包含560万样本的UniMed-5M数据集以支持基础多模态对齐;在知识层引入渐进式课程学习(Progressive Curriculum Learning)系统性注入医学多模态知识;在分析层设计首个统一的医学多模态模型UniMedVL,实现图像理解和生成任务在同一架构中的协同优化,并通过双向知识共享机制提升整体性能,显著优于现有专用模型在多个医学图像理解基准上的表现,同时保持生成质量与专业模型相当。
链接: https://arxiv.org/abs/2510.15710
作者: Junzhi Ning,Wei Li,Cheng Tang,Jiashi Lin,Chenglong Ma,Chaoyang Zhang,Jiyao Liu,Ying Chen,Shujian Gao,Lihao Liu,Yuandong Pu,Huihui Xu,Chenhui Gou,Ziyan Huang,Yi Xin,Qi Qin,Zhongying Deng,Diping Song,Bin Fu,Guang Yang,Yuanfeng Ji,Tianbin Li,Yanzhou Su,Jin Ye,Shixiang Tang,Ming Hu,Junjun He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at this https URL.
zh
[CV-21] owards Label-Free Brain Tumor Segmentation: Unsupervised Learning with Multimodal MRI
【速读】:该论文旨在解决脑肿瘤分割中因标注数据稀缺、成本高或不一致而导致的监督学习受限问题,提出了一种基于生成式 AI(Generative AI)的无监督异常检测(Unsupervised Anomaly Detection, UAD)方法。其关键解决方案是设计了一种多模态视觉 Transformer 自编码器(Multimodal Vision Transformer Autoencoder, MViT-AE),仅使用健康脑部 MRI 训练,通过重建误差图实现肿瘤的检测与定位,并结合早期-晚期融合策略利用多序列 MRI 的互补信息,以及引入 Segment Anything Model(SAM)进行后处理以优化肿瘤边界,从而在无需人工标注的前提下实现可临床应用的肿瘤定位效果。
链接: https://arxiv.org/abs/2510.15684
作者: Gerard Comas-Quiles,Carles Garcia-Cabrera,Julia Dietlmeier,Noel E. O’Connor,Ferran Marques
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, BraTS GoAT 2025 challenge
Abstract:Unsupervised anomaly detection (UAD) presents a complementary alternative to supervised learning for brain tumor segmentation in magnetic resonance imaging (MRI), particularly when annotated datasets are limited, costly, or inconsistent. In this work, we propose a novel Multimodal Vision Transformer Autoencoder (MViT-AE) trained exclusively on healthy brain MRIs to detect and localize tumors via reconstruction-based error maps. This unsupervised paradigm enables segmentation without reliance on manual labels, addressing a key scalability bottleneck in neuroimaging workflows. Our method is evaluated in the BraTS-GoAT 2025 Lighthouse dataset, which includes various types of tumors such as gliomas, meningiomas, and pediatric brain tumors. To enhance performance, we introduce a multimodal early-late fusion strategy that leverages complementary information across multiple MRI sequences, and a post-processing pipeline that integrates the Segment Anything Model (SAM) to refine predicted tumor contours. Despite the known challenges of UAD, particularly in detecting small or non-enhancing lesions, our method achieves clinically meaningful tumor localization, with lesion-wise Dice Similarity Coefficient of 0.437 (Whole Tumor), 0.316 (Tumor Core), and 0.350 (Enhancing Tumor) on the test set, and an anomaly Detection Rate of 89.4% on the validation set. These findings highlight the potential of transformer-based unsupervised models to serve as scalable, label-efficient tools for neuro-oncological imaging.
zh
[CV-22] Valeo Near-Field: a novel dataset for pedestrian intent detection
【速读】:该论文旨在解决智能车辆在近场场景中准确感知行人意图的问题,尤其关注如何利用多模态传感器数据实现高精度的行人检测、3D姿态估计及4D轨迹与意图预测。其解决方案的关键在于构建了一个同步采集的多模态数据集,包含鱼眼相机图像、激光雷达(LiDAR)点云、超声波传感器读数以及基于动作捕捉系统的3D人体关节位置标注,并通过精确的时间对齐和空间标定,为感知算法提供高质量的训练与评估基准。此外,研究还提出了适用于嵌入式系统的综合评测指标体系,以应对实际部署中的传感器遮挡、动态环境变化和硬件资源限制等挑战,从而推动面向真实道路场景的先进感知算法研发。
链接: https://arxiv.org/abs/2510.15673
作者: Antonyo Musabini,Rachid Benmokhtar,Jagdish Bhanushali,Victor Galizzi,Bertrand Luvison,Xavier Perrotton
机构: Valeo(伟世通); Universite Paris-Saclay, CEA, List(法国巴黎-萨克雷大学,法国原子能和替代能源委员会,列表实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper presents a novel dataset aimed at detecting pedestrians’ intentions as they approach an ego-vehicle. The dataset comprises synchronized multi-modal data, including fisheye camera feeds, lidar laser scans, ultrasonic sensor readings, and motion capture-based 3D body poses, collected across diverse real-world scenarios. Key contributions include detailed annotations of 3D body joint positions synchronized with fisheye camera images, as well as accurate 3D pedestrian positions extracted from lidar data, facilitating robust benchmarking for perception algorithms. We release a portion of the dataset along with a comprehensive benchmark suite, featuring evaluation metrics for accuracy, efficiency, and scalability on embedded systems. By addressing real-world challenges such as sensor occlusions, dynamic environments, and hardware constraints, this dataset offers a unique resource for developing and evaluating state-of-the-art algorithms in pedestrian detection, 3D pose estimation and 4D trajectory and intention prediction. Additionally, we provide baseline performance metrics using custom neural network architectures and suggest future research directions to encourage the adoption and enhancement of the dataset. This work aims to serve as a foundation for researchers seeking to advance the capabilities of intelligent vehicles in near-field scenarios.
zh
[CV-23] Uncertainty-Aware Extreme Point Tracing for Weakly Supervised Ultrasound Image Segmentation
【速读】:该论文旨在解决医学图像分割中对大量像素级标注数据的依赖问题,这种标注方式成本高且耗时。为降低标注负担,作者提出了一种弱监督分割框架,仅需四个极端点(extreme points)作为标注信息。其核心解决方案是利用这些极端点生成的边界框作为提示(prompt),驱动Segment Anything Model 2(SAM2)生成可靠的初始伪标签,并通过改进的特征引导极端点掩码(Feature-Guided Extreme Point Masking, FGEPM)算法逐步优化伪标签,其中引入基于蒙特卡洛Dropout的不确定性估计构建统一梯度不确定性代价图以增强边界追踪精度;同时设计双分支不确定性感知尺度一致性(Uncertainty-aware Scale Consistency, USC)损失和框对齐损失,确保训练过程中的空间一致性与边界精确定位。实验表明,该方法在两个公开超声图像数据集(BUSI和UNS)上达到甚至超越全监督模型性能,显著降低标注成本。
链接: https://arxiv.org/abs/2510.15666
作者: Lei Shi,Gang Li,Junxing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automatic medical image segmentation is a fundamental step in computer-aided diagnosis, yet fully supervised approaches demand extensive pixel-level annotations that are costly and time-consuming. To alleviate this burden, we propose a weakly supervised segmentation framework that leverages only four extreme points as annotation. Specifically, bounding boxes derived from the extreme points are used as prompts for the Segment Anything Model 2 (SAM2) to generate reliable initial pseudo labels. These pseudo labels are progressively refined by an enhanced Feature-Guided Extreme Point Masking (FGEPM) algorithm, which incorporates Monte Carlo dropout-based uncertainty estimation to construct a unified gradient uncertainty cost map for boundary tracing. Furthermore, a dual-branch Uncertainty-aware Scale Consistency (USC) loss and a box alignment loss are introduced to ensure spatial consistency and precise boundary alignment during training. Extensive experiments on two public ultrasound datasets, BUSI and UNS, demonstrate that our method achieves performance comparable to, and even surpassing fully supervised counterparts while significantly reducing annotation cost. These results validate the effectiveness and practicality of the proposed weakly supervised framework for ultrasound image segmentation.
zh
[CV-24] Deep Learning Based Domain Adaptation Methods in Remote Sensing: A Comprehensive Survey
【速读】:该论文旨在解决遥感领域中因数据分布差异导致的域适应(Domain Adaptation)难题,即如何将源域(source domain)的知识有效迁移至目标域(target domain),以提升模型在不同传感器、地理环境或成像条件下的泛化能力。其解决方案的关键在于系统梳理基于深度学习的域适应方法,构建涵盖任务分类、输入模式、监督范式和算法粒度等多维度的结构化分类体系,并全面评述当前主流数据集与先进算法的性能表现,从而为研究者提供清晰的技术脉络与未来发展方向。
链接: https://arxiv.org/abs/2510.15615
作者: Shuchang Lyu,Qi Zhao,Zheng Zhou,Meng Li,You Zhou,Dingding Yao,Guangliang Cheng,Huiyu Zhou,Zhenwei Shi
机构: Beihang University (北京航空航天大学); Institute of Acoustics, Chinese Academy of Sciences (中国科学院声学研究所); University of Liverpool (利物浦大学); University of Leicester (莱斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 7 figures
Abstract:Domain adaptation is a crucial and increasingly important task in remote sensing, aiming to transfer knowledge from a source domain a differently distributed target domain. It has broad applications across various real-world applications, including remote sensing element interpretation, ecological environment monitoring, and urban/rural planning. However, domain adaptation in remote sensing poses significant challenges due to differences in data, such as variations in ground sampling distance, imaging modes from various sensors, geographical landscapes, and environmental conditions. In recent years, deep learning has emerged as a powerful tool for feature representation and cross-domain knowledge transfer, leading to widespread adoption in remote sensing tasks. In this paper, we present a comprehensive survey of significant advancements in deep learning based domain adaptation for remote sensing. We first introduce the preliminary knowledge to clarify key concepts, mathematical notations, and the taxonomy of methodologies. We then organize existing algorithms from multiple perspectives, including task categorization, input mode, supervision paradigm, and algorithmic granularity, providing readers with a structured understanding of the field. Next, we review widely used datasets and summarize the performance of state-of-the-art methods to provide an overview of current progress. We also identify open challenges and potential directions to guide future research in domain adaptation for remote sensing. Compared to previous surveys, this work addresses a broader range of domain adaptation tasks in remote sensing, rather than concentrating on a few subfields. It also presents a systematic taxonomy, providing a more comprehensive and organized understanding of the field. As a whole, this survey can inspire the research community, foster understanding, and guide future work in the field.
zh
[CV-25] Lightweight Data-Free Denoising for Detail-Preserving Biomedical Image Restoration MICCAI2025
【速读】:该论文旨在解决当前自监督去噪技术在实际应用中因计算和内存开销大而导致的推理速度与重建质量难以兼顾的问题。其解决方案的关键在于提出一种超轻量级的多阶段去噪框架Noise2Detail(N2D),该框架基于Noise2Noise训练范式,无需干净参考图像或显式噪声建模,通过在推理阶段破坏噪声的空间相关性以生成中间平滑结构,并直接从噪声输入中精细重构细节,从而在显著降低计算成本的同时实现高质量图像恢复。
链接: https://arxiv.org/abs/2510.15611
作者: Tomáš Chobola,Julia A. Schnabel,Tingying Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, MICCAI 2025
Abstract:Current self-supervised denoising techniques achieve impressive results, yet their real-world application is frequently constrained by substantial computational and memory demands, necessitating a compromise between inference speed and reconstruction quality. In this paper, we present an ultra-lightweight model that addresses this challenge, achieving both fast denoising and high quality image restoration. Built upon the Noise2Noise training framework-which removes the reliance on clean reference images or explicit noise modeling-we introduce an innovative multistage denoising pipeline named Noise2Detail (N2D). During inference, this approach disrupts the spatial correlations of noise patterns to produce intermediate smooth structures, which are subsequently refined to recapture fine details directly from the noisy input. Extensive testing reveals that Noise2Detail surpasses existing dataset-free techniques in performance, while requiring only a fraction of the computational resources. This combination of efficiency, low computational cost, and data-free approach make it a valuable tool for biomedical imaging, overcoming the challenges of scarce clean training data-due to rare and complex imaging modalities-while enabling fast inference for practical use.
zh
[CV-26] Quantized FCA: Efficient Zero-Shot Texture Anomaly Detection
【速读】:该论文旨在解决纹理异常检测与定位中的实时性问题,即现有方法在运行时间上存在显著瓶颈,难以部署于实际场景(如生产线监控)。其解决方案的关键在于提出一种名为QFCA的实时方法,通过将特征对应分析(Feature Correspondence Analysis, FCA)算法量化实现,并设计基于直方图的量化值统计比较机制,在保持精度几乎不变的前提下实现了10倍的速度提升;此外,引入基于主成分分析(Principal Component Analysis, PCA)的特征预处理步骤,增强正常与异常特征间的对比度,从而提高复杂纹理下的检测精度。
链接: https://arxiv.org/abs/2510.15602
作者: Andrei-Timotei Ardelean,Patrick Rückbeil,Tim Weyrich
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures. Published in the 30th Intl. Conference on Vision, Modeling, and Visualization (VMV), 2025
Abstract:Zero-shot anomaly localization is a rising field in computer vision research, with important progress in recent years. This work focuses on the problem of detecting and localizing anomalies in textures, where anomalies can be defined as the regions that deviate from the overall statistics, violating the stationarity assumption. The main limitation of existing methods is their high running time, making them impractical for deployment in real-world scenarios, such as assembly line monitoring. We propose a real-time method, named QFCA, which implements a quantized version of the feature correspondence analysis (FCA) algorithm. By carefully adapting the patch statistics comparison to work on histograms of quantized values, we obtain a 10x speedup with little to no loss in accuracy. Moreover, we introduce a feature preprocessing step based on principal component analysis, which enhances the contrast between normal and anomalous features, improving the detection precision on complex textures. Our method is thoroughly evaluated against prior art, comparing favorably with existing methods. Project page: this https URL
zh
[CV-27] FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
【速读】:该论文旨在解决多模态行人重识别(Multimodal Person Re-Identification, Re-ID)中现有方法在跨模态匹配时受限于固定查询-检索组合的问题,难以支持任意模态间的灵活匹配,从而限制了实际应用。其解决方案的关键在于提出FlexiReID框架,该框架通过引入自适应混合专家(Adaptive Mixture-of-Experts, MoE)机制动态融合不同模态特征,并结合跨模态查询融合模块增强多模态特征提取能力,从而实现对四种模态(RGB、红外、素描和文本)间七种检索模式的统一支持。
链接: https://arxiv.org/abs/2510.15595
作者: Zhen Sun,Lei Tan,Yunhang Shen,Chengmao Cai,Xing Sun,Pingyang Dai,Liujuan Cao,Rongrong Ji
机构: Xiamen University (厦门大学); National University of Singapore (新加坡国立大学); Tencent YouTu Lab (腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: rgb, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.
zh
[CV-28] Context-aware deep learning using individualized prior information reduces false positives in disease risk prediction and longitudinal health assessment
【速读】:该论文旨在解决医疗风险预测中因缺乏时间维度信息而导致的假阳性率过高问题,尤其是在患者既往就诊记录有限且频率不一的情况下。解决方案的关键在于构建一个机器学习框架,通过整合患者多次就诊中获取的影像学和/或临床生物标志物数据(即时间上下文),对当前健康状态进行动态风险评估:首先基于最近一次就诊数据估计初始疾病风险,再利用历史数据中的信息对风险进行精细化调整。实证结果表明,该方法能显著降低假阳性率(如在预测临床显著前列腺癌时,假阳性率从51%降至24%),同时保持高敏感性,从而提升风险预测的特异性,为低风险人群的大规模纵向健康监测提供了可行路径。
链接: https://arxiv.org/abs/2510.15591
作者: Lavanya Umapathy,Patricia M Johnson,Tarun Dutt,Angela Tong,Madhur Nayan,Hersh Chandarana,Daniel K Sodickson
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 5 figures, 1 table
Abstract:Temporal context in medicine is valuable in assessing key changes in patient health over time. We developed a machine learning framework to integrate diverse context from prior visits to improve health monitoring, especially when prior visits are limited and their frequency is variable. Our model first estimates initial risk of disease using medical data from the most recent patient visit, then refines this assessment using information digested from previously collected imaging and/or clinical biomarkers. We applied our framework to prostate cancer (PCa) risk prediction using data from a large population (28,342 patients, 39,013 magnetic resonance imaging scans, 68,931 blood tests) collected over nearly a decade. For predictions of the risk of clinically significant PCa at the time of the visit, integrating prior context directly converted false positives to true negatives, increasing overall specificity while preserving high sensitivity. False positive rates were reduced progressively from 51% to 33% when integrating information from up to three prior imaging examinations, as compared to using data from a single visit, and were further reduced to 24% when also including additional context from prior clinical data. For predicting the risk of PCa within five years of the visit, incorporating prior context reduced false positive rates still further (64% to 9%). Our findings show that information collected over time provides relevant context to enhance the specificity of medical risk prediction. For a wide range of progressive conditions, sufficient reduction of false positive rates using context could offer a pathway to expand longitudinal health monitoring programs to large populations with comparatively low baseline risk of disease, leading to earlier detection and improved health outcomes.
zh
[CV-29] Standardization for improved Spatio-Temporal Image Fusion
【速读】:该论文旨在解决多源遥感图像在时空融合(Spatio-Temporal Image Fusion, STIF)过程中因传感器间空间与光谱分辨率不匹配而导致的融合精度受限问题。为提升未配对图像块的时空融合性能(Unpaired Spatio Temporal Fusion of Image Patches, USTFIP),研究提出并比较了两种标准化方法:一是传统的高分辨率图像上采样策略;二是基于异常检测的卫星图像标准化方法(Anomaly Based Satellite Image Standardization, ABSIS),其核心在于将高分辨率图像序列的整体结构特征与特定低分辨率图像的独特属性进行融合,从而生成更接近实际聚合结果的标准化图像。实验表明,ABSIS显著提升了融合图像的光谱和空间精度,分别提高达49.46%和78.40%,验证了其作为STIF预处理环节的有效性。
链接: https://arxiv.org/abs/2510.15589
作者: Harkaitz Goyena,Peter M. Atkinson,Unai Pérez-Goya,M. Dolores Ugarte
机构: Public University of Navarre (纳瓦拉公共大学); InaMat2 Institute (InaMat2 研究所); Lancaster University (兰卡斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
备注:
Abstract:Spatio-Temporal Image Fusion (STIF) methods usually require sets of images with matching spatial and spectral resolutions captured by different sensors. To facilitate the application of STIF methods, we propose and compare two different standardization approaches. The first method is based on traditional upscaling of the fine-resolution images. The second method is a sharpening approach called Anomaly Based Satellite Image Standardization (ABSIS) that blends the overall features found in the fine-resolution image series with the distinctive attributes of a specific coarse-resolution image to produce images that more closely resemble the outcome of aggregating the fine-resolution images. Both methods produce a significant increase in accuracy of the Unpaired Spatio Temporal Fusion of Image Patches (USTFIP) STIF method, with the sharpening approach increasing the spectral and spatial accuracies of the fused images by up to 49.46% and 78.40%, respectively.
zh
[CV-30] Lightweight CycleGAN Models for Cross-Modality Image Transformation and Experimental Quality Assessment in Fluorescence Microscopy
【速读】:该论文旨在解决荧光显微镜中模态转换(如共聚焦成像到超分辨STED或去卷积STED成像)时面临的无配对数据集问题,同时降低深度学习模型的计算成本与环境影响。其关键解决方案是提出一种轻量级CycleGAN架构,通过将U-Net结构中的传统通道翻倍策略替换为固定通道设计,使可训练参数从4180万锐减至约9000个,显著提升了训练速度并降低了内存消耗,同时保持了优异的生成性能。此外,该模型还可作为诊断工具,用于评估实验图像质量,例如识别光漂白、伪影或标记错误等异常情况。
链接: https://arxiv.org/abs/2510.15579
作者: Mohammad Soltaninezhad,Yashar Rouzbahani,Jhonatan Contreras,Rohan Chippalkatti,Daniel Kwaku Abankwa,Christian Eggeling,Thomas Bocklitz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 Figures
Abstract:Lightweight deep learning models offer substantial reductions in computational cost and environmental impact, making them crucial for scientific applications. We present a lightweight CycleGAN for modality transfer in fluorescence microscopy (confocal to super-resolution STED/deconvolved STED), addressing the common challenge of unpaired datasets. By replacing the traditional channel-doubling strategy in the U-Net-based generator with a fixed channel approach, we drastically reduce trainable parameters from 41.8 million to approximately nine thousand, achieving superior performance with faster training and lower memory usage. We also introduce the GAN as a diagnostic tool for experimental and labeling quality. When trained on high-quality images, the GAN learns the characteristics of optimal imaging; deviations between its generated outputs and new experimental images can reveal issues such as photobleaching, artifacts, or inaccurate labeling. This establishes the model as a practical tool for validating experimental accuracy and image fidelity in microscopy workflows.
zh
[CV-31] Unmasking Facial DeepFakes: A Robust Multiview Detection Framework for Natural Images
【速读】:该论文旨在解决深度伪造(DeepFake)图像检测在实际应用中面临的挑战,尤其是姿态变化、遮挡以及难以察觉的伪影问题。解决方案的关键在于提出一种多视角架构,通过三个专用编码器分别捕捉不同层次的面部特征:全局视角编码器用于检测边界不一致性,中观视角编码器分析纹理与色彩对齐情况,局部视角编码器聚焦于眼部、鼻部和口部等易出现伪造痕迹的表达区域;此外,引入一个面部朝向编码器以增强模型在不同视角下的鲁棒性。最终,通过融合各编码器提取的特征,显著提升了复杂条件下对合成图像的检测性能。
链接: https://arxiv.org/abs/2510.15576
作者: Sami Belguesmia,Mohand Saïd Allili,Assia Hamadene
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:DeepFake technology has advanced significantly in recent years, enabling the creation of highly realistic synthetic face images. Existing DeepFake detection methods often struggle with pose variations, occlusions, and artifacts that are difficult to detect in real-world conditions. To address these challenges, we propose a multi-view architecture that enhances DeepFake detection by analyzing facial features at multiple levels. Our approach integrates three specialized encoders, a global view encoder for detecting boundary inconsistencies, a middle view encoder for analyzing texture and color alignment, and a local view encoder for capturing distortions in expressive facial regions such as the eyes, nose, and mouth, where DeepFake artifacts frequently occur. Additionally, we incorporate a face orientation encoder, trained to classify face poses, ensuring robust detection across various viewing angles. By fusing features from these encoders, our model achieves superior performance in detecting manipulated images, even under challenging pose and lighting this http URL results on challenging datasets demonstrate the effectiveness of our method, outperforming conventional single-view approaches
zh
[CV-32] Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation
【速读】:该论文旨在解决3D场景布局生成中传统优化方法依赖繁琐手动规则、深度生成模型难以实现内容丰富性与多样性,以及基于大语言模型的方法在复杂空间关系建模上缺乏鲁棒性的问题。其解决方案的关键在于构建一个高质量的资产库(包含2,037个场景资产和147个3D场景布局),利用图像生成模型将提示(prompt)映射为视觉表示并进行微调以匹配资产库,进而设计一个鲁棒的图像解析模块,结合视觉语义与几何信息恢复3D布局,并通过场景图(scene graph)和整体视觉语义优化确保布局的逻辑一致性与图像对齐性。
链接: https://arxiv.org/abs/2510.15564
作者: Xiaoming Zhu,Xu Huang,Qinghongbing Xie,Zhi Deng,Junsheng Yu,Yirui Guan,Zhongyuan Liu,Lin Zhu,Qijun Zhao,Ligang Liu,Long Zeng
机构: Tsinghua UniversityShenzhenChina; TencentShenzhenChina; Southeast UniversityShenzhenChina; University of Science and Technology of ChinaHefeiChina
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at this https URL.
zh
[CV-33] ClapperText: A Benchmark for Text Recognition in Low-Resource Archival Documents ICDAR2025
【速读】:该论文旨在解决在视觉退化和低资源环境下手写与印刷文本识别的挑战,尤其针对历史档案中因运动模糊、笔迹变异、曝光波动及杂乱背景等因素导致的OCR(光学字符识别)性能下降问题。解决方案的关键在于构建ClapperText这一基准数据集,其包含9,813个标注帧和94,573个词级文本实例(其中67%为手写文本,1,566个部分遮挡),并提供旋转边界框(以四点多边形表示)和语义类别、文本类型、遮挡状态等细粒度标注信息,支持高精度OCR应用。此外,通过在仅18个视频的小规模训练集上进行微调,模型性能显著提升,验证了该数据集在少样本学习场景中的有效性,从而为低资源历史文档理解提供了现实且文化语境丰富的研究资源。
链接: https://arxiv.org/abs/2510.15557
作者: Tingyu Lin,Marco Peer,Florian Kleber,Robert Sablatnig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: 18 pages, accepted at ICDAR2025 DALL
Abstract:This paper presents ClapperText, a benchmark dataset for handwritten and printed text recognition in visually degraded and low-resource settings. The dataset is derived from 127 World War II-era archival video segments containing clapperboards that record structured production metadata such as date, location, and camera-operator identity. ClapperText includes 9,813 annotated frames and 94,573 word-level text instances, 67% of which are handwritten and 1,566 are partially occluded. Each instance includes transcription, semantic category, text type, and occlusion status, with annotations available as rotated bounding boxes represented as 4-point polygons to support spatially precise OCR applications. Recognizing clapperboard text poses significant challenges, including motion blur, handwriting variation, exposure fluctuations, and cluttered backgrounds, mirroring broader challenges in historical document analysis where structured content appears in degraded, non-standard forms. We provide both full-frame annotations and cropped word images to support downstream tasks. Using a consistent per-video evaluation protocol, we benchmark six representative recognition and seven detection models under zero-shot and fine-tuned conditions. Despite the small training set (18 videos), fine-tuning leads to substantial performance gains, highlighting ClapperText’s suitability for few-shot learning scenarios. The dataset offers a realistic and culturally grounded resource for advancing robust OCR and document understanding in low-resource archival contexts. The dataset and evaluation code are available at this https URL.
zh
[CV-34] Diffusion Bridge Networks Simulate Clinical-grade PET from MRI for Dementia Diagnostics
【速读】:该论文旨在解决正电子发射断层成像(PET)与氟代脱氧葡萄糖(FDG)在痴呆诊断中应用受限的问题,即其可及性低且成本高,相较于常规磁共振成像(MRI)存在显著劣势。解决方案的关键在于提出SiM2P框架——一种基于3D扩散桥接的深度学习模型,能够从MRI和辅助患者信息中学习概率映射,生成诊断质量的模拟FDG-PET图像。该方法在盲法临床读片研究中显著提升了三组人群(阿尔茨海默病、行为变异型额颞叶痴呆及认知健康对照)的鉴别准确率(从75.0%提升至84.7%,p<0.05),并改善了诊断确定性和评分者间一致性,同时支持仅需20例本地病例和基础人口学信息即可部署,从而推动FDG-PET的诊断优势在资源有限环境中更广泛落地。
链接: https://arxiv.org/abs/2510.15556
作者: Yitong Li,Ralph Buchert,Benita Schmitz-Koep,Timo Grimmer,Björn Ommer,Dennis M. Hedderich,Igor Yakushev,Christian Wachinger
机构: Technical University of Munich (慕尼黑工业大学); Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Positron emission tomography (PET) with 18F-Fluorodeoxyglucose (FDG) is an established tool in the diagnostic workup of patients with suspected dementing disorders. However, compared to the routinely available magnetic resonance imaging (MRI), FDG-PET remains significantly less accessible and substantially more expensive. Here, we present SiM2P, a 3D diffusion bridge-based framework that learns a probabilistic mapping from MRI and auxiliary patient information to simulate FDG-PET images of diagnostic quality. In a blinded clinical reader study, two neuroradiologists and two nuclear medicine physicians rated the original MRI and SiM2P-simulated PET images of patients with Alzheimer’s disease, behavioral-variant frontotemporal dementia, and cognitively healthy controls. SiM2P significantly improved the overall diagnostic accuracy of differentiating between three groups from 75.0% to 84.7% (p0.05). Notably, the simulated PET images received higher diagnostic certainty ratings and achieved superior interrater agreement compared to the MRI images. Finally, we developed a practical workflow for local deployment of the SiM2P framework. It requires as few as 20 site-specific cases and only basic demographic information. This approach makes the established diagnostic benefits of FDG-PET imaging more accessible to patients with suspected dementing disorders, potentially improving early detection and differential diagnosis in resource-limited settings. Our code is available at this https URL.
zh
[CV-35] An Empirical Study on MC Dropout–Based Uncertainty–Error Correlation in 2D Brain Tumor Segmentation
【速读】:该论文旨在解决医学图像分割中模型不确定性估计的有效性问题,特别是针对脑肿瘤边界区域的分割误差识别能力不足的问题。其核心解决方案是通过蒙特卡洛Dropout(Monte Carlo Dropout)方法计算模型预测的不确定性,并将其与像素级分割误差进行相关性分析,从而评估该不确定性指标在定位边界错误方面的可靠性。研究发现,MC Dropout所生成的不确定性与整体分割误差仅存在弱相关性(r ≈ 0.30–0.38),且在肿瘤边界区域几乎无显著相关性(|r| < 0.05),表明该方法在边界误差定位上效果有限,亟需探索更有效的不确定性估计策略或融合多种方法的混合方案。
链接: https://arxiv.org/abs/2510.15541
作者: Saumya B
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Code and results available at this https URL
Abstract:Accurate brain tumor segmentation from MRI is vital for diagnosis and treatment planning. Although Monte Carlo (MC) Dropout is widely used to estimate model uncertainty, its effectiveness in identifying segmentation errors – especially near tumor boundaries – remains unclear. This study empirically examines the relationship between MC Dropout–based uncertainty and segmentation error in 2D brain tumor MRI segmentation using a U-Net trained under four augmentation settings: none, horizontal flip, rotation, and scaling. Uncertainty was computed from 50 stochastic forward passes and correlated with pixel-wise errors using Pearson and Spearman coefficients. Results show weak global correlations ( r \approx 0.30 – 0.38 ) and negligible boundary correlations ( |r| 0.05 ). Although differences across augmentations were statistically significant ( p 0.001 ), they lacked practical relevance. These findings suggest that MC Dropout uncertainty provides limited cues for boundary error localization, underscoring the need for alternative or hybrid uncertainty estimation methods in medical image segmentation.
zh
[CV-36] VO-DP: Semantic-Geometric Adaptive Diffusion Policy for Vision-Only Robotic Manipulation
【速读】:该论文旨在解决当前基于扩散模型的机器人操作策略学习中对点云输入的依赖问题,探索纯视觉(vision-only)解决方案的潜力。现有方法普遍采用点云作为观测输入并通过特征学习构建场景表示,虽然精度较高,但存在计算复杂度高、硬件依赖性强等局限。其核心解决方案是提出一种单视角、纯视觉的扩散策略学习方法(VO-DP),关键在于利用预训练视觉基础模型融合语义与几何特征:通过VGGT提取语义信息、DINOv2提供高层语义特征、Alternating Attention模块捕获几何结构,并借助交叉注意力机制实现特征融合,再经卷积神经网络(CNN)空间压缩后输入策略头。实验表明,VO-DP在仿真和真实任务中均显著优于纯视觉基线(DP),且在多数场景下媲美甚至超越点云方法(DP3),同时展现出更强的鲁棒性。
链接: https://arxiv.org/abs/2510.15530
作者: Zehao Ni,Yonghao He,Lingfeng Qian,Jilei Mao,Fa Fu,Wei Sui,Hu Su,Junran Peng,Zhipeng Wang,Bin He
机构: National Key Laboratory of Autonomous Intelligent Unmanned Systems; D-Robotics; University of Science and Technology Beijing; State Key Laboratory of Multimodal Artificial Intelligence System (MAIS) (多模态人工智能系统国家重点实验室); Institute of Automation of Chinese Academy of Sciences (中国科学院自动化研究所); Frontiers Science Center for Intelligent Autonomous Systems; Shanghai Institute of Intelligent Science and Technology, Tongji University (同济大学智能科学与技术研究院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:In the context of imitation learning, visuomotor-based diffusion policy learning is one of the main directions in robotic manipulation. Most of these approaches rely on point clouds as observation inputs and construct scene representations through point clouds feature learning, which enables them to achieve remarkable accuracy. However, the existing literature lacks an in-depth exploration of vision-only solutions that have significant potential. In this paper, we propose a Vision-Only and single-view Diffusion Policy learning method (VO-DP) that leverages pretrained visual foundation models to achieve effective fusion of semantic and geometric features. We utilize intermediate features from VGGT incorporating semantic features from DINOv2 and geometric features from Alternating Attention blocks. Features are fused via cross-attention and spatially compressed with a CNN to form the input to the policy head. Extensive experiments demonstrate that VO-DP not only outperforms the vision-only baseline DP significantly but also exhibits distinct performance trends against the point cloud-based method DP3: in simulation tasks, VO-DP achieves an average success rate of 64.6% on par with DP3 64.0% and far higher than DP 34.8%, while in real-world tasks, it reaches 87.9%, outperforming both DP3 67.5% and DP 11.2% by a notable margin. Further robustness evaluations confirm that VO-DP remains highly stable under varying conditions including color, size, background, and lighting. Lastly, we open-source a training library for robotic manipulation. Built on Accelerate, this library supports multi-machine and multi-GPU parallel training, as well as mixed precision training. It is compatible with visuomotor policies such as DP, DP3 and VO-DP, and also supports the RoboTwin simulator.
zh
[CV-37] Balanced Multi-Task Attention for Satellite Image Classification: A Systematic Approach to Achieving 97.23% Accuracy on EuroSAT Without Pre-Training
【速读】:该论文旨在解决卫星遥感图像土地利用分类中因模型架构设计不足导致的性能瓶颈问题,特别是针对空间(spatial)与光谱(spectral)特征利用不均衡、过拟合及类别混淆模式不平衡等问题。其关键解决方案是提出了一种新颖的平衡多任务注意力机制(balanced multi-task attention mechanism),该机制通过Coordinate Attention模块提取空间特征、Squeeze-Excitation块提取光谱特征,并引入一个可学习的融合参数使二者统一优化;实验表明该参数收敛至约0.57,说明空间与光谱模态在卫星影像中具有近似相等的重要性。此外,采用逐层递增的DropBlock正则化策略(5%–20%)和类别平衡损失权重进一步提升了模型泛化能力与类别区分度,最终在EuroSAT数据集上达到97.23%测试准确率,且所有类别均超过94.46%,验证了系统性架构设计对特定领域应用的有效性。
链接: https://arxiv.org/abs/2510.15527
作者: Aditya Vir
机构: Manipal University Jaipur (曼帕尔大学贾伊普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 2 figures, 2 tables. Code and trained models available at this https URL
Abstract:This work presents a systematic investigation of custom convolutional neural network architectures for satellite land use classification, achieving 97.23% test accuracy on the EuroSAT dataset without reliance on pre-trained models. Through three progressive architectural iterations (baseline: 94.30%, CBAM-enhanced: 95.98%, and balanced multi-task attention: 97.23%) we identify and address specific failure modes in satellite imagery classification. Our principal contribution is a novel balanced multi-task attention mechanism that combines Coordinate Attention for spatial feature extraction with Squeeze-Excitation blocks for spectral feature extraction, unified through a learnable fusion parameter. Experimental results demonstrate that this learnable parameter autonomously converges to alpha approximately 0.57, indicating near-equal importance of spatial and spectral modalities for satellite imagery. We employ progressive DropBlock regularization (5-20% by network depth) and class-balanced loss weighting to address overfitting and confusion pattern imbalance. The final 12-layer architecture achieves Cohen’s Kappa of 0.9692 with all classes exceeding 94.46% accuracy, demonstrating confidence calibration with a 24.25% gap between correct and incorrect predictions. Our approach achieves performance within 1.34% of fine-tuned ResNet-50 (98.57%) while requiring no external data, validating the efficacy of systematic architectural design for domain-specific applications. Complete code, trained models, and evaluation scripts are publicly available.
zh
[CV-38] Latent Feature Alignment: Discovering Biased and Interpretable Subpopulations in Face Recognition Models
【速读】:该论文旨在解决现代人脸识别模型中存在的系统性偏差问题,即这些模型在不同子群体(如不同年龄、种族或着装特征的人群)中表现不一致,而传统评估方法依赖于标注属性来定义子群体,存在获取成本高且受限于预设类别的问题。解决方案的关键在于提出一种无需属性标签的潜在特征对齐(Latent Feature Alignment, LFA)算法,通过挖掘潜在空间中的语义方向自动识别子群体,并实现两个核心优势:一是基于语义一致性进行分组,优于仅依赖距离的聚类方法;二是发现可解释的潜在方向,对应于年龄、种族或服饰等语义属性。该方法显著提升了子群体内部语义一致性,并为无监督地审计人脸识别模型的偏倚提供了实用工具。
链接: https://arxiv.org/abs/2510.15520
作者: Ignacio Serna
机构: Max Planck Institute for Human Development (马普所人类发展研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Modern face recognition models achieve high overall accuracy but continue to exhibit systematic biases that disproportionately affect certain subpopulations. Conventional bias evaluation frameworks rely on labeled attributes to form subpopulations, which are expensive to obtain and limited to predefined categories. We introduce Latent Feature Alignment (LFA), an attribute-label-free algorithm that uses latent directions to identify subpopulations. This yields two main benefits over standard clustering: (i) semantically coherent grouping, where faces sharing common attributes are grouped together more reliably than by proximity-based methods, and (ii) discovery of interpretable directions, which correspond to semantic attributes such as age, ethnicity, or attire. Across four state-of-the-art recognition models (ArcFace, CosFace, ElasticFace, PartialFC) and two benchmarks (RFW, CelebA), LFA consistently outperforms k-means and nearest-neighbor search in intra-group semantic coherence, while uncovering interpretable latent directions aligned with demographic and contextual attributes. These results position LFA as a practical method for representation auditing of face recognition models, enabling practitioners to identify and interpret biased subpopulations without predefined attribute annotations.
zh
[CV-39] Exploring Conditions for Diffusion models in Robotic Control
【速读】:该论文旨在解决预训练视觉表征在机器人控制任务中因任务无关性(task-agnostic)而导致性能受限的问题,尤其针对冻结参数的视觉模型难以适应具体控制需求的局限。其核心解决方案是提出ORCA方法,关键在于引入可学习的任务提示(task prompts)与捕捉帧级细节的视觉提示(visual prompts),以引导文本到图像扩散模型生成任务自适应的视觉表征,从而有效缓解训练数据与机器人控制环境之间的域差距(domain gap),显著提升控制性能。
链接: https://arxiv.org/abs/2510.15510
作者: Heeseong Shin,Byeongho Heo,Dongyoon Han,Seungryong Kim,Taekyung Kim
机构: KAIST AI(韩国科学技术院人工智能); NAVER AI Lab(NAVER人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model’s training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
zh
[CV-40] Rethinking Efficient Hierarchical Mixing Architecture for Low-light RAW Image Enhancement
【速读】:该论文旨在解决低光照RAW图像增强中同时实现高质量增强与高效率的难题。现有基于深度学习的方法在性能和计算效率之间存在权衡,且两阶段框架易引入歧义。其解决方案的关键在于提出一种分层混合架构(Hierarchical Mixing Architecture, HiMA),通过结合Transformer模块处理大尺度特征与Mamba模块处理小尺度特征,有效提升处理效率并避免传统两阶段方法的不确定性;此外,引入局部分布调整(Local Distribution Adjustment, LoDA)以应对局部光照不均问题,并设计多先验融合(Multi-prior Fusion, MPF)模块整合空间与频域先验信息,充分挖掘第一阶段去噪结果中的细节信息,从而在多个公开数据集上实现优于当前最优方法的性能,且参数量更少。
链接: https://arxiv.org/abs/2510.15497
作者: Xianmin Chen,Peiliang Huang,Longfei Han,Dingwen Zhang,Junwei Han
机构: USTC; Institute of Artificial Intelligence, Hefei Comprehensive National Science Center; Beijing Technology and Business University; Northwestern Polytechnical University; Chongqing University of Posts and Telecommunications
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-light RAW image enhancement remains a challenging task. Although numerous deep learning based approaches have been proposed, they still suffer from inherent limitations. A key challenge is how to simultaneously achieve strong enhancement quality and high efficiency. In this paper, we rethink the architecture for efficient low-light image signal processing (ISP) and introduce a Hierarchical Mixing Architecture (HiMA). HiMA leverages the complementary strengths of Transformer and Mamba modules to handle features at large and small scales, respectively, thereby improving efficiency while avoiding the ambiguities observed in prior two-stage frameworks. To further address uneven illumination with strong local variations, we propose Local Distribution Adjustment (LoDA), which adaptively aligns feature distributions across different local regions. In addition, to fully exploit the denoised outputs from the first stage, we design a Multi-prior Fusion (MPF) module that integrates spatial and frequency-domain priors for detail enhancement. Extensive experiments on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches, achieving superior performance with fewer parameters. Code will be released at this https URL.
zh
[CV-41] Iterative Motion Compensation for Canonical 3D Reconstruction from UAV Plant Images Captured in Windy Conditions
【速读】:该论文旨在解决农业植物三维表型(3D phenotyping)中因环境风力和无人机下洗气流导致的图像运动模糊问题,从而提升个体作物高精度三维重建的质量。其核心解决方案是提出一个可集成任意先进三维重建方法的流水线,并引入一种迭代优化策略:通过光流估计原始图像与中间三维重建结果之间的运动差异,对输入图像进行逐步形变校正,使场景在多轮迭代中趋于稳定,最终获得更清晰、高分辨率的三维网格模型。该方法显著改善了现有重建算法在动态叶片条件下的性能表现。
链接: https://arxiv.org/abs/2510.15491
作者: Andre Rochow,Jonas Marcic,Svetlana Seliunina,Sven Behnke
机构: Autonomous Intelligent Systems - Computer Science Institute VI and Center for Robotics, University of Bonn, Germany; Lamarr Institute for Machine Learning and Artificial Intelligence, Germany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D phenotyping of plants plays a crucial role for understanding plant growth, yield prediction, and disease control. We present a pipeline capable of generating high-quality 3D reconstructions of individual agricultural plants. To acquire data, a small commercially available UAV captures images of a selected plant. Apart from placing ArUco markers, the entire image acquisition process is fully autonomous, controlled by a self-developed Android application running on the drone’s controller. The reconstruction task is particularly challenging due to environmental wind and downwash of the UAV. Our proposed pipeline supports the integration of arbitrary state-of-the-art 3D reconstruction methods. To mitigate errors caused by leaf motion during image capture, we use an iterative method that gradually adjusts the input images through deformation. Motion is estimated using optical flow between the original input images and intermediate 3D reconstructions rendered from the corresponding viewpoints. This alignment gradually reduces scene motion, resulting in a canonical representation. After a few iterations, our pipeline improves the reconstruction of state-of-the-art methods and enables the extraction of high-resolution 3D meshes. We will publicly release the source code of our reconstruction pipeline. Additionally, we provide a dataset consisting of multiple plants from various crops, captured across different points in time.
zh
[CV-42] A Novel Combined Optical Flow Approach for Comprehensive Micro-Expression Recognition
【速读】:该论文旨在解决微表情识别(Micro-Expression Recognition, MER)中因仅关注从起始到峰值阶段(onset-to-apex phase)而忽略峰值到消退阶段(apex-to-offset phase)所导致的时序动态信息丢失问题。解决方案的关键在于提出一种联合光流(Combined Optical Flow, COF)方法,通过整合两个阶段的光流信息,实现更全面的运动特征分析,从而提升微表情识别性能。实验表明,COF在CASMEII和SAMM数据集上优于单一光流方法,验证了其在捕捉微表情时序动态方面的有效性。
链接: https://arxiv.org/abs/2510.15471
作者: Vu Tram Anh Khuong,Thi Bich Phuong Man,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo
机构: Vietnam National University (越南国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Facial micro-expressions are brief, involuntary facial movements that reveal hidden emotions. Most Micro-Expression Recognition (MER) methods that rely on optical flow typically focus on the onset-to-apex phase, neglecting the apex-to-offset phase, which holds key temporal dynamics. This study introduces a Combined Optical Flow (COF), integrating both phases to enhance feature representation. COF provides a more comprehensive motion analysis, improving MER performance. Experimental results on CASMEII and SAMM datasets show that COF outperforms single optical flow-based methods, demonstrating its effectiveness in capturing micro-expression dynamics.
zh
[CV-43] MSAM: Multi-Semantic Adaptive Mining for Cross-Modal Drone Video-Text Retrieval
【速读】:该论文旨在解决无人机视频与文本之间的跨模态语义检索问题(drone video-text retrieval, DVTR),其核心挑战在于无人机视频具有俯视视角、强结构同质性以及目标组合的多样语义表达,这使得现有针对地面视角设计的跨模态方法难以有效建模其特征。解决方案的关键是提出一种多语义自适应挖掘方法(Multi-Semantic Adaptive Mining, MSAM),该方法通过引入动态帧间变化建模和特定场景区域的细粒度语义提取机制,增强对无人机视频内容的深度理解与推理能力;同时,MSAM结合自适应语义构建模块、分布驱动语义学习项和多样性语义项,强化文本与视频模态间的细粒度交互,并采用跨模态交互特征融合池化机制聚焦目标区域特征提取与匹配,从而降低复杂背景干扰,提升特征表示鲁棒性。
链接: https://arxiv.org/abs/2510.15470
作者: Jinghao Huang,Yaxiong Chen,Ganchao Liu
机构: Sun Yat-sen University (中山大学); Wuhan University of Technology (武汉理工大学); Sanya Science and Education Innovation Park of Wuhan University of Technology (武汉理工大学三亚科教创新园); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:With the advancement of drone technology, the volume of video data increases rapidly, creating an urgent need for efficient semantic retrieval. We are the first to systematically propose and study the drone video-text retrieval (DVTR) task. Drone videos feature overhead perspectives, strong structural homogeneity, and diverse semantic expressions of target combinations, which challenge existing cross-modal methods designed for ground-level views in effectively modeling their characteristics. Therefore, dedicated retrieval mechanisms tailored for drone scenarios are necessary. To address this issue, we propose a novel approach called Multi-Semantic Adaptive Mining (MSAM). MSAM introduces a multi-semantic adaptive learning mechanism, which incorporates dynamic changes between frames and extracts rich semantic information from specific scene regions, thereby enhancing the deep understanding and reasoning of drone video content. This method relies on fine-grained interactions between words and drone video frames, integrating an adaptive semantic construction module, a distribution-driven semantic learning term and a diversity semantic term to deepen the interaction between text and drone video modalities and improve the robustness of feature representation. To reduce the interference of complex backgrounds in drone videos, we introduce a cross-modal interactive feature fusion pooling mechanism that focuses on feature extraction and matching in target regions, minimizing noise effects. Extensive experiments on two self-constructed drone video-text datasets show that MSAM outperforms other existing methods in the drone video-text retrieval task. The source code and dataset will be made publicly available.
zh
[CV-44] MRASfM: Multi-Camera Reconstruction and Aggregation through Structure-from-Motion in Driving Scenes
【速读】:该论文旨在解决多摄像头系统在驾驶场景中应用结构光恢复(Structure from Motion, SfM)时面临的三大挑战:相机位姿估计不可靠、道路表面重建存在大量异常点以及重建效率低下。解决方案的关键在于提出一种针对驾驶场景优化的多摄像头重建与聚合SfM框架(Multi-camera Reconstruction and Aggregation Structure-from-Motion, MRASfM):首先利用多摄像头系统固定的几何关系提升位姿估计可靠性;其次引入平面模型剔除三角化后道路表面的错误点以提高重建质量;再次将整个多摄像头系统作为单一单元参与束调整(Bundle Adjustment, BA),从而减少优化变量并提升计算效率;最后通过粗到精的场景关联与组装模块实现多场景聚合,显著增强系统的泛化能力和鲁棒性。
链接: https://arxiv.org/abs/2510.15467
作者: Lingfeng Xuan,Chang Nie,Yiqing Xu,Zhe Liu,Yanzi Miao,Hesheng Wang
机构: Shanghai Jiao Tong University (上海交通大学); China University of Mining and Technology (中国矿业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 11 figures
Abstract:Structure from Motion (SfM) estimates camera poses and reconstructs point clouds, forming a foundation for various tasks. However, applying SfM to driving scenes captured by multi-camera systems presents significant difficulties, including unreliable pose estimation, excessive outliers in road surface reconstruction, and low reconstruction efficiency. To address these limitations, we propose a Multi-camera Reconstruction and Aggregation Structure-from-Motion (MRASfM) framework specifically designed for driving scenes. MRASfM enhances the reliability of camera pose estimation by leveraging the fixed spatial relationships within the multi-camera system during the registration process. To improve the quality of road surface reconstruction, our framework employs a plane model to effectively remove erroneous points from the triangulated road surface. Moreover, treating the multi-camera set as a single unit in Bundle Adjustment (BA) helps reduce optimization variables to boost efficiency. In addition, MRASfM achieves multi-scene aggregation through scene association and assembly modules in a coarse-to-fine fashion. We deployed multi-camera systems on actual vehicles to validate the generalizability of MRASfM across various scenes and its robustness in challenging conditions through real-world applications. Furthermore, large-scale validation results on public datasets show the state-of-the-art performance of MRASfM, achieving 0.124 absolute pose error on the nuScenes dataset.
zh
[CV-45] Improving Micro-Expression Recognition with Phase-Aware Temporal Augmentation
【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中因标注数据稀缺导致的模型泛化能力差和运动模式多样性不足的问题。现有方法多依赖简单的空间增强策略(如翻转、旋转),忽视了对时间维度上运动特征的有效利用。其解决方案的关键在于提出一种基于动态图像(Dynamic Image, DI)的相位感知时间增强方法:将每个微表情序列分解为两个运动阶段——起始到峰值(onset-to-apex)和峰值到结束(apex-to-offset),分别为每个阶段生成独立的动态图像,形成双相位动态图像(Dual-phase DI)增强策略。该方法通过引入互补的时间线索,显著提升了运动特征的多样性与表达能力,从而在CASME-II和SAMM等数据集上实现了识别准确率、未加权F1分数和未加权平均召回率的稳定提升,尤其在低资源场景下表现出强鲁棒性和模型无关性。
链接: https://arxiv.org/abs/2510.15466
作者: Vu Tram Anh Khuong,Luu Tu Nguyen,Thanh Ha Le,Thi Duyen Ngo
机构: VNU University of Engineering and Technology (河内国家大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-expressions (MEs) are brief, involuntary facial movements that reveal genuine emotions, typically lasting less than half a second. Recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. Although deep learning has enabled significant advances in micro-expression recognition (MER), its effectiveness is limited by the scarcity of annotated ME datasets. This data limitation not only hinders generalization but also restricts the diversity of motion patterns captured during training. Existing MER studies predominantly rely on simple spatial augmentations (e.g., flipping, rotation) and overlook temporal augmentation strategies that can better exploit motion characteristics. To address this gap, this paper proposes a phase-aware temporal augmentation method based on dynamic image. Rather than encoding the entire expression as a single onset-to-offset dynamic image (DI), our approach decomposes each expression sequence into two motion phases: onset-to-apex and apex-to-offset. A separate DI is generated for each phase, forming a Dual-phase DI augmentation strategy. These phase-specific representations enrich motion diversity and introduce complementary temporal cues that are crucial for recognizing subtle facial transitions. Extensive experiments on CASME-II and SAMM datasets using six deep architectures, including CNNs, Vision Transformer, and the lightweight LEARNet, demonstrate consistent performance improvements in recognition accuracy, unweighted F1-score, and unweighted average recall, which are crucial for addressing class imbalance in MER. When combined with spatial augmentations, our method achieves up to a 10% relative improvement. The proposed augmentation is simple, model-agnostic, and effective in low-resource settings, offering a promising direction for robust and generalizable MER.
zh
[CV-46] DPTrack:Directional Kernel-Guided Prompt Learning for Robust Nighttime Aerial Tracking
【速读】:该论文旨在解决现有基于提示学习(prompt learning)的夜间航拍跟踪器仅依赖空间定位监督、缺乏细粒度目标特征提示导致提示模糊的问题,从而影响跟踪精度。其解决方案的关键在于提出DPTrack,通过将目标属性特征编码进富含细粒度线索的方向性核(directional kernel),生成精准提示;具体而言,首先借鉴视觉仿生学思想,分层捕获目标拓扑结构并利用拓扑属性增强特征表示,随后将这些拓扑感知特征压缩至方向性核中作为核心引导信号,最后构建基于通道-类别对应关系的核引导提示模块,将核传播至搜索区域特征以精确定位目标特征并转化为精确提示,同时引入空间门控机制提升夜间跟踪鲁棒性。
链接: https://arxiv.org/abs/2510.15449
作者: Zhiqiang Zhu,Xinbo Gao,Wen Lu,Jie Li,Zhaoyang Wang,Mingqian Ge
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing nighttime aerial trackers based on prompt learning rely solely on spatial localization supervision, which fails to provide fine-grained cues that point to target features and inevitably produces vague prompts. This limitation impairs the tracker’s ability to accurately focus on the object features and results in trackers still performing poorly. To address this issue, we propose DPTrack, a prompt-based aerial tracker designed for nighttime scenarios by encoding the given object’s attribute features into the directional kernel enriched with fine-grained cues to generate precise prompts. Specifically, drawing inspiration from visual bionics, DPTrack first hierarchically captures the object’s topological structure, leveraging topological attributes to enrich the feature representation. Subsequently, an encoder condenses these topology-aware features into the directional kernel, which serves as the core guidance signal that explicitly encapsulates the object’s fine-grained attribute cues. Finally, a kernel-guided prompt module built on channel-category correspondence attributes propagates the kernel across the features of the search region to pinpoint the positions of target features and convert them into precise prompts, integrating spatial gating for robust nighttime tracking. Extensive evaluations on established benchmarks demonstrate DPTrack’s superior performance. Our code will be available at this https URL.
zh
[CV-47] MAVR-Net: Robust Multi-View Learning for MAV Action Recognition with Cross-View Attention
【速读】:该论文旨在解决基于RGB图像的微小型无人机(Micro Aerial Vehicle, MAV)动作识别模型在捕捉复杂时空特征方面的不足,从而限制其对不同飞行动作的区分能力。解决方案的关键在于提出一种多视角学习框架MAVR-Net,通过融合原始RGB帧、光流(optical flow)和分割掩码(segmentation mask)三种互补模态的数据,结合基于ResNet的编码器提取各视角判别性特征,并引入多尺度特征金字塔以保留运动模式的时空细节;同时设计跨视角注意力模块增强不同模态与特征尺度间的交互关系,并采用多视角对齐损失确保语义一致性,显著提升了MAV动作识别的准确性和鲁棒性。
链接: https://arxiv.org/abs/2510.15448
作者: Nengbo Zhang,Hann Woei Ho
机构: Universiti Sains Malaysia (马来西亚理科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recognizing the motion of Micro Aerial Vehicles (MAVs) is crucial for enabling cooperative perception and control in autonomous aerial swarms. Yet, vision-based recognition models relying only on RGB data often fail to capture the complex spatial temporal characteristics of MAV motion, which limits their ability to distinguish different actions. To overcome this problem, this paper presents MAVR-Net, a multi-view learning-based MAV action recognition framework. Unlike traditional single-view methods, the proposed approach combines three complementary types of data, including raw RGB frames, optical flow, and segmentation masks, to improve the robustness and accuracy of MAV motion recognition. Specifically, ResNet-based encoders are used to extract discriminative features from each view, and a multi-scale feature pyramid is adopted to preserve the spatiotemporal details of MAV motion patterns. To enhance the interaction between different views, a cross-view attention module is introduced to model the dependencies among various modalities and feature scales. In addition, a multi-view alignment loss is designed to ensure semantic consistency and strengthen cross-view feature representations. Experimental results on benchmark MAV action datasets show that our method clearly outperforms existing approaches, achieving 97.8%, 96.5%, and 92.8% accuracy on the Short MAV, Medium MAV, and Long MAV datasets, respectively.
zh
[CV-48] Select Less Reason More: Prioritizing Evidence Purity for Video Reasoning
【速读】:该论文旨在解决长视频推理中因静态均匀帧采样导致的信息稀释问题,以及现有像素空间视频推理代理因缺乏严谨的奖励机制而难以保证证据纯度、且无法在预采样帧之外进行时间信息补全的问题。其解决方案的关键在于提出了一种基于“少选多思”理念的证据优先自适应框架,核心是证据感知强化学习(Evidence-aware Reinforcement Learning, EARL),该框架使模型成为主动的证据探查者,能够动态选择最相关帧,并在关键帧周围进行局部重采样以获取细粒度的时间细节,从而显著提升视觉证据的纯度与有效性。
链接: https://arxiv.org/abs/2510.15440
作者: Xuchen Li,Xuzhao Li,Shiyu Hu,Kaiqi Huang
机构: CASIA(中国科学院自动化研究所); UCAS(中国科学院大学); ZGCA(中国科学院自动化研究所); NTU(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint, Under review
Abstract:Long-form video reasoning remains a major challenge for Video Large Language Models (Video LLMs), as static uniform frame sampling leads to information dilution and obscures critical evidence. Furthermore, existing pixel-space video reasoning agents, which are designed to actively interact with the video to acquire new visual information, remain suboptimal due to their lack of rigorous reward mechanisms to enforce evidence purity and their inability to perform temporal information supplementation beyond pre-sampled frames. To address this critical gap, we propose a novel evidence-prioritized adaptive framework built upon our core philosophy: “Select Less, Reason More.” Our core contribution is the evidence-aware reinforcement learning (EARL) framework, which transforms the model into an active interrogator of evidence. EARL is precisely engineered to dynamically select the most relevant frames and, crucially, to perform localized re-sampling around the selected key frames to access fine-grained temporal detail. Extensive experiments on five demanding video reasoning benchmarks demonstrate that our EARL-trained model achieves new state-of-the-art among open-source Video LLMs, simultaneously learning an effective and high-purity visual evidence selection policy. Impressively, our 7B model achieves 59.8% on LongVideoBench, 69.0% on MVBench and 64.9% on VideoMME. These results highlight the importance of prioritizing evidence purity and the effectiveness of our framework.
zh
[CV-49] Rethinking Convergence in Deep Learning: The Predictive-Corrective Paradigm for Anatomy-Informed Brain MRI Segmentation
【速读】:该论文旨在解决深度学习中端到端(end-to-end)范式在数据稀缺领域(如医学影像)中存在的收敛速度慢和对大规模数据依赖性强的问题,从而限制了其效率与适用性。解决方案的关键在于提出预测-校正(Predictive-Corrective, PC)范式,该范式通过解耦建模任务来显著加速学习过程:首先利用预测先验模块(Predictive Prior Module, PPM)以低计算成本生成粗略近似,并借助解剖学知识(如双侧对称性)预测诊断相关不对称区域的“关注图”;随后,校正残差网络(Corrective Residual Network, CRN)专注于学习残差误差,将模型全部能力集中于精细化处理这些关键区域,从而实现高精度分割与极快收敛(仅需1–5个epoch),有效缓解了数据效率低下和过拟合问题。
链接: https://arxiv.org/abs/2510.15439
作者: Feifei Zhang,Zhenhong Jia,Sensen Song,Fei Shi,Dayong Ren
机构: Xinjiang University (新疆大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the remarkable success of the end-to-end paradigm in deep learning, it often suffers from slow convergence and heavy reliance on large-scale datasets, which fundamentally limits its efficiency and applicability in data-scarce domains such as medical imaging. In this work, we introduce the Predictive-Corrective (PC) paradigm, a framework that decouples the modeling task to fundamentally accelerate learning. Building upon this paradigm, we propose a novel network, termed PCMambaNet. PCMambaNet is composed of two synergistic modules. First, the Predictive Prior Module (PPM) generates a coarse approximation at low computational cost, thereby anchoring the search space. Specifically, the PPM leverages anatomical knowledge-bilateral symmetry-to predict a ‘focus map’ of diagnostically relevant asymmetric regions. Next, the Corrective Residual Network (CRN) learns to model the residual error, focusing the network’s full capacity on refining these challenging regions and delineating precise pathological boundaries. Extensive experiments on high-resolution brain MRI segmentation demonstrate that PCMambaNet achieves state-of-the-art accuracy while converging within only 1-5 epochs-a performance unattainable by conventional end-to-end models. This dramatic acceleration highlights that by explicitly incorporating domain knowledge to simplify the learning objective, PCMambaNet effectively mitigates data inefficiency and overfitting.
zh
[CV-50] Semantic4Safety: Causal Insights from Zero-shot Street View Imagery Segmentation for Urban Road Safety
【速读】:该论文旨在解决街景影像(Street-view Imagery, SVI)在交通风险分析中的两个核心问题:一是如何构建能够捕捉事故相关特征的街道级指标,二是如何量化这些指标对不同类型交通事故的因果影响。其解决方案的关键在于提出Semantic4Safety框架,该框架通过零样本语义分割技术从SVI中提取11个可解释的街道景观指标,并引入道路类型作为上下文信息;随后结合XGBoost多分类器、SHAP值解释模型进行全局与局部特征贡献分析,并采用广义倾向得分(Generalized Propensity Score, GPS)加权和平均处理效应(Average Treatment Effect, ATE)估计来控制混杂因素并量化因果效应,从而实现从预测建模到因果推断的有效衔接,支持针对性干预和高风险路段诊断。
链接: https://arxiv.org/abs/2510.15434
作者: Huan Chen,Ting Han,Siyu Chen,Zhihao Guo,Yiping Chen,Meiliu Wu
机构: Sun Yat-sen University (中山大学); University of Glasgow (格拉斯哥大学); Shanxi University (山西大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 10 figures, The 8th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery (GeoAI '25), November 3–6, 2025, Minneapolis, MN, USA
Abstract:Street-view imagery (SVI) offers a fine-grained lens on traffic risk, yet two fundamental challenges persist: (1) how to construct street-level indicators that capture accident-related features, and (2) how to quantify their causal impacts across different accident types. To address these challenges, we propose Semantic4Safety, a framework that applies zero-shot semantic segmentation to SVIs to derive 11 interpretable streetscape indicators, and integrates road type as contextual information to analyze approximately 30,000 accident records in Austin. Specifically, we train an eXtreme Gradient Boosting (XGBoost) multi-class classifier and use Shapley Additive Explanations (SHAP) to interpret both global and local feature contributions, and then apply Generalized Propensity Score (GPS) weighting and Average Treatment Effect (ATE) estimation to control confounding and quantify causal effects. Results uncover heterogeneous, accident-type-specific causal patterns: features capturing scene complexity, exposure, and roadway geometry dominate predictive power; larger drivable area and emergency space reduce risk, whereas excessive visual openness can increase it. By bridging predictive modeling with causal inference, Semantic4Safety supports targeted interventions and high-risk corridor diagnosis, offering a scalable, data-informed tool for urban road safety planning.
zh
[CV-51] Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在面对未知越狱攻击(jailbreak attacks)时安全性不足的问题。现有检测方法要么依赖特定攻击的参数学习,导致泛化能力差;要么基于启发式原则,限制了检测的准确性和效率。其解决方案的关键在于提出一种名为“Learning to Detect”(LoD)的通用框架,通过将关注点从攻击特定学习转向任务特定学习,实现对未知攻击的高精度检测。该框架包含两个核心模块:多模态安全概念激活向量(Multi-modal Safety Concept Activation Vector)用于安全导向的表征学习,以及安全模式自动编码器(Safety Pattern Auto-Encoder)用于无监督的攻击分类,从而在多样化的未知攻击场景下显著提升检测的AUROC指标并优化计算效率。
链接: https://arxiv.org/abs/2510.15430
作者: Shuang Liang,Zhihao Xu,Jialing Tao,Hui Xue,Xiting Wang
机构: Renmin University of China (中国人民大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at this https URL.
zh
[CV-52] Robust High-Resolution Multi-Organ Diffusion MRI Using Synthetic-Data-Tuned Prompt Learning
【速读】:该论文旨在解决多-shot扩散加权磁共振成像(multi-shot diffusion-weighted magnetic resonance imaging, multi-shot DWI)在全身肿瘤诊断中因呼吸、蠕动等生理运动引起的相位伪影严重、且受多器官、多层面、多方向和多b值复杂性限制而难以临床应用的问题。解决方案的关键在于提出一种物理信息驱动的重建框架LoSP-Prompt,其核心创新是将各shot间的相位变化建模为高阶局部平滑相位(Locally Smooth Phase, LoSP),并嵌入低秩Hankel矩阵重构中;同时,利用仅基于模拟生理运动的腹部DWI合成数据进行提示学习(prompt learning),自动确定算法的秩参数,从而实现无需导航信号或真实数据监督的高分辨率、多器官通用重建,显著提升图像质量与抗伪影能力。
链接: https://arxiv.org/abs/2510.15400
作者: Chen Qian,Haoyu Zhang,Junnan Ma,Liuhong Zhu,Qingrui Cai,Yu Wang,Ruibo Song,Lv Li,Lin Mei,Xianwang Jiang,Qin Xu,Boyu Jiang,Ran Tao,Chunmiao Chen,Shufang Chen,Dongyun Liang,Qiu Guo,Jianzhong Lin,Taishan Kang,Mengtian Lu,Liyuan Fu,Ruibin Huang,Huijuan Wan,Xu Huang,Jianhua Wang,Di Guo,Hai Zhong,Jianjun Zhou,Xiaobo Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
备注: 43 pages, 27 figures
Abstract:Clinical adoption of multi-shot diffusion-weighted magnetic resonance imaging (multi-shot DWI) for body-wide tumor diagnostics is limited by severe motion-induced phase artifacts from respiration, peristalsis, and so on, compounded by multi-organ, multi-slice, multi-direction and multi-b-value complexities. Here, we introduce a reconstruction framework, LoSP-Prompt, that overcomes these challenges through physics-informed modeling and synthetic-data-driven prompt learning. We model inter-shot phase variations as a high-order Locally Smooth Phase (LoSP), integrated into a low-rank Hankel matrix reconstruction. Crucially, the algorithm’s rank parameter is automatically set via prompt learning trained exclusively on synthetic abdominal DWI data emulating physiological motion. Validated across 10,000+ clinical images (43 subjects, 4 scanner models, 5 centers), LoSP-Prompt: (1) Achieved twice the spatial resolution of clinical single-shot DWI, enhancing liver lesion conspicuity; (2) Generalized to seven diverse anatomical regions (liver, kidney, sacroiliac, pelvis, knee, spinal cord, brain) with a single model; (3) Outperformed state-of-the-art methods in image quality, artifact suppression, and noise reduction (11 radiologists’ evaluations on a 5-point scale, p0.05 ), achieving 4-5 points (excellent) on kidney DWI, 4 points (good to excellent) on liver, sacroiliac and spinal cord DWI, and 3-4 points (good) on knee and tumor brain. The approach eliminates navigator signals and realistic data supervision, providing an interpretable, robust solution for high-resolution multi-organ multi-shot DWI. Its scanner-agnostic performance signifies transformative potential for precision oncology.
zh
[CV-53] MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment
【速读】:该论文旨在解决水下实例分割(Underwater Instance Segmentation)中因封闭词汇(Close-Vocabulary)限制导致的难以识别新海洋类别的问题,即现有方法无法有效处理未见类别(Unseen Categories)的分割任务。为应对这一挑战,作者提出了一个统一框架,其关键在于两个互补模块:几何先验增强模块(Geometric Prior Enhancement Module, GPEM),利用稳定的部分级和结构线索在视觉退化条件下保持目标一致性;语义对齐注入机制(Semantic Alignment Injection Mechanism, SAIM),通过引入领域特定先验丰富语言嵌入,缓解语义模糊性并提升对未见类别的识别能力。该方案在首个大规模水下开放词汇分割基准MARIS上验证了有效性,显著优于现有基线模型。
链接: https://arxiv.org/abs/2510.15398
作者: Bingyu Li,Feiyu Wang,Da Zhang,Zhiyuan Zhao,Junyu Gao,Xuelong Li
机构: TeleAI; USTC; Fudan University; NWPU
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbfMARIS (\underlineMarine Open-Vocabulary \underlineInstance \underlineSegmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbfGPEM) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbfSAIM) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.
zh
[CV-54] LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding
【速读】:该论文旨在解决实时生成长序列、风格化人体运动的问题,现有流式方法通常直接在原始运动空间中操作,导致计算开销大且难以保持时间稳定性;而基于潜空间的变分自编码器-扩散模型(VAE-Diffusion)框架虽能实现高质量风格化,但多局限于离线处理。解决方案的关键在于提出LILAC(Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding),其通过潜空间流式架构结合滑动窗口因果解码设计,并注入已解码运动特征以确保运动过渡平滑,从而在不依赖未来帧或修改扩散模型结构的前提下,实现长序列实时任意风格化,兼顾风格化质量和响应速度。
链接: https://arxiv.org/abs/2510.15392
作者: Peng Ren,Hai Yang
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: this https URL
zh
[CV-55] PFGS: Pose-Fused 3D Gaussian Splatting for Complete Multi-Pose Object Reconstruction
【速读】:该论文旨在解决现有3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在多姿态图像捕获场景下重建不完整的问题,即当物体在多个静态姿态下被拍摄时,传统方法因假设单一主视角而无法有效重建被遮挡或自遮挡区域。其解决方案的关键在于提出一种姿态感知的3DGS框架PFGS(Pose-aware Fusion for 3DGS),通过迭代融合不同辅助姿态的图像到主姿态的统一3DGS表示中,结合全局与局部配准策略实现高效且精准的视图融合;同时,PFGS创新性地利用背景特征进行每姿态相机位姿估计,并引入3D基础模型(3D foundation models)完成跨姿态配准,在提升注册鲁棒性和效率的同时,缓解了背景不一致导致的误差问题,从而显著改善重建完整性与模型保真度。
链接: https://arxiv.org/abs/2510.15386
作者: Ting-Yu Yen,Yu-Sheng Chiu,Shih-Hsuan Hung,Peter Wonka,Hung-Kuo Chu
机构: National Tsing Hua University (国立清华大学); King Abdullah University of Science and Technology (阿卜杜拉国王科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality, real-time novel-view synthesis from multi-view images. However, most existing methods assume the object is captured in a single, static pose, resulting in incomplete reconstructions that miss occluded or self-occluded regions. We introduce PFGS, a pose-aware 3DGS framework that addresses the practical challenge of reconstructing complete objects from multi-pose image captures. Given images of an object in one main pose and several auxiliary poses, PFGS iteratively fuses each auxiliary set into a unified 3DGS representation of the main pose. Our pose-aware fusion strategy combines global and local registration to merge views effectively and refine the 3DGS model. While recent advances in 3D foundation models have improved registration robustness and efficiency, they remain limited by high memory demands and suboptimal accuracy. PFGS overcomes these challenges by incorporating them more intelligently into the registration process: it leverages background features for per-pose camera pose estimation and employs foundation models for cross-pose registration. This design captures the best of both approaches while resolving background inconsistency issues. Experimental results demonstrate that PFGS consistently outperforms strong baselines in both qualitative and quantitative evaluations, producing more complete reconstructions and higher-fidelity 3DGS models.
zh
[CV-56] FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers ICCV2025
【速读】:该论文旨在解决多视角2D图像中3D目标检测的准确性问题,尤其是现有方法依赖LiDAR点云进行显式深度监督时所导致的深度预测质量不佳问题,如物体边界处的深度不连续性和小目标区分度不足等。其关键解决方案在于提出频率感知的位置深度嵌入(Frequency-aware Positional Depth Embedding, FreqPDE),通过三个核心模块实现:1)频率感知的空间金字塔编码器(Frequency-aware Spatial Pyramid Encoder, FSPE)融合不同层级的高频边缘信息与低频语义特征构建多尺度特征金字塔;2)跨视图尺度不变深度预测器(Cross-view Scale-invariant Depth Predictor, CSDP)利用跨视图和高效通道注意力机制估计像素级深度分布;3)位置深度编码器(Positional Depth Encoder, PDE)将2D图像特征与3D位置嵌入结合生成用于查询解码的3D深度感知特征。此外,采用混合深度监督策略从度量和分布两个维度互补提升深度学习效果。
链接: https://arxiv.org/abs/2510.15385
作者: Haisheng Su,Junjie Zhang,Feixiang Song,Sanping Zhou,Wei Wu,Nanning Zheng,Junchi Yan
机构: Shanghai Jiao Tong University (上海交通大学); Xi’an Jiaotong University (西安交通大学); SenseAuto Research (深势科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025
Abstract:Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.
zh
[CV-57] Adaptive transfer learning for surgical tool presence detection in laparoscopic videos through gradual freezing fine-tuning
【速读】:该论文旨在解决微创手术中由于标注数据稀缺导致深度学习模型训练困难的问题,从而实现对 surgical tools 的自动化检测。其解决方案的关键在于提出了一种分阶段自适应微调(staged adaptive fine-tuning)策略,包含两个核心步骤:首先通过线性探测(linear probing)阶段在预训练 CNN 架构(如 ResNet-50 和 DenseNet-121)基础上引入额外分类层以适配手术领域;其次通过渐进式冻结(gradual freezing)阶段动态减少可微调参数数量,从而控制模型对新域的适应程度。该方法仅需单次训练循环即可显著提升检测性能(在 Cholec80 数据集上达到 mAP 96.4%),且具有良好的跨域泛化能力,在眼科手术数据集 CATARACTS 上也验证了其有效性,表明该策略在多样化手术场景中具有广泛适用性。
链接: https://arxiv.org/abs/2510.15372
作者: Ana Davila,Jacinto Colan,Yasuhisa Hasegawa
机构: Institutes of Innovation for Future Society, Nagoya University (名古屋大学未来社会创新研究所); Department of Micro-Nano Mechanical Science and Engineering, Nagoya University (名古屋大学微纳米机械科学与工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Minimally invasive surgery can benefit significantly from automated surgical tool detection, enabling advanced analysis and assistance. However, the limited availability of annotated data in surgical settings poses a challenge for training robust deep learning models. This paper introduces a novel staged adaptive fine-tuning approach consisting of two steps: a linear probing stage to condition additional classification layers on a pre-trained CNN-based architecture and a gradual freezing stage to dynamically reduce the fine-tunable layers, aiming to regulate adaptation to the surgical domain. This strategy reduces network complexity and improves efficiency, requiring only a single training loop and eliminating the need for multiple iterations. We validated our method on the Cholec80 dataset, employing CNN architectures (ResNet-50 and DenseNet-121) pre-trained on ImageNet for detecting surgical tools in cholecystectomy endoscopic videos. Our results demonstrate that our method improves detection performance compared to existing approaches and established fine-tuning techniques, achieving a mean average precision (mAP) of 96.4%. To assess its broader applicability, the generalizability of the fine-tuning strategy was further confirmed on the CATARACTS dataset, a distinct domain of minimally invasive ophthalmic surgery. These findings suggest that gradual freezing fine-tuning is a promising technique for improving tool presence detection in diverse surgical procedures and may have broader applications in general image classification tasks.
zh
[CV-58] Cortical-SSM: A Deep State Space Model for EEG and ECoG Motor Imagery Decoding
【速读】:该论文旨在解决运动想象(Motor Imagery, MI)过程中获取的脑电图(EEG)和皮层电图(ECoG)信号在分类任务中因生理伪影干扰及难以捕捉细粒度时空频域依赖关系而导致性能受限的问题。其解决方案的关键在于提出一种名为Cortical-SSM的新架构,该架构将深度状态空间模型(Deep State Space Models, DSSMs)扩展至跨时间、空间和频率维度,以集成建模EEG与ECoG信号中的复杂依赖结构,从而提升分类准确性并增强对神经生理学相关区域的可解释性。
链接: https://arxiv.org/abs/2510.15371
作者: Shuntaro Suzuki,Shunya Nagashima,Masayuki Hirata,Komei Sugiura
机构: Keio University (庆应义塾大学); Osaka University (大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Classification of electroencephalogram (EEG) and electrocorticogram (ECoG) signals obtained during motor imagery (MI) has substantial application potential, including for communication assistance and rehabilitation support for patients with motor impairments. These signals remain inherently susceptible to physiological artifacts (e.g., eye blinking, swallowing), which pose persistent challenges. Although Transformer-based approaches for classifying EEG and ECoG signals have been widely adopted, they often struggle to capture fine-grained dependencies within them. To overcome these limitations, we propose Cortical-SSM, a novel architecture that extends deep state space models to capture integrated dependencies of EEG and ECoG signals across temporal, spatial, and frequency domains. We validated our method across three benchmarks: 1) two large-scale public MI EEG datasets containing more than 50 subjects, and 2) a clinical MI ECoG dataset recorded from a patient with amyotrophic lateral sclerosis. Our method outperformed baseline methods on the three benchmarks. Furthermore, visual explanations derived from our model indicate that it effectively captures neurophysiologically relevant regions of both EEG and ECoG signals.
zh
[CV-59] SHARE: Scene-Human Aligned Reconstruction SIGGRAPH
【速读】:该论文旨在解决当前人体运动重建方法在三维空间中难以准确定位人类姿态的问题,尤其是在复杂场景下缺乏可靠的空间参考导致重建结果漂移或失真。其解决方案的关键在于提出Scene-Human Aligned REconstruction (SHARE)框架,该框架通过利用场景几何结构提供的隐式空间线索来对齐人体与环境:首先从单目RGB视频中估计每帧的人体网格和分割掩码,并在关键帧上提取场景点云;随后通过对比人体网格与由掩码提取的场景点云,迭代优化关键帧中人体的位置;同时,为保证非关键帧的人体网格一致性,采用保持其相对根关节位置相对于关键帧根关节的方式进行约束优化。这一机制显著提升了人体在3D空间中的定位精度,从而实现更可靠的场景-人体协同重建。
链接: https://arxiv.org/abs/2510.15342
作者: Joshua Li,Brendan Chharawala,Chang Shu,Xue Bin Peng,Pengcheng Xi
机构: National Research Council Canada(加拿大国家研究委员会); University of Waterloo(滑铁卢大学); Simon Fraser University(西蒙菲莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia Technical Communications 2025
Abstract:Animating realistic character interactions with the surrounding environment is important for autonomous agents in gaming, AR/VR, and robotics. However, current methods for human motion reconstruction struggle with accurately placing humans in 3D space. We introduce Scene-Human Aligned REconstruction (SHARE), a technique that leverages the scene geometry’s inherent spatial cues to accurately ground human motion reconstruction. Each reconstruction relies solely on a monocular RGB video from a stationary camera. SHARE first estimates a human mesh and segmentation mask for every frame, alongside a scene point map at keyframes. It iteratively refines the human’s positions at these keyframes by comparing the human mesh against the human point map extracted from the scene using the mask. Crucially, we also ensure that non-keyframe human meshes remain consistent by preserving their relative root joint positions to keyframe root joints during optimization. Our approach enables more accurate 3D human placement while reconstructing the surrounding scene, facilitating use cases on both curated datasets and in-the-wild web videos. Extensive experiments demonstrate that SHARE outperforms existing methods.
zh
[CV-60] Proto-Former: Unified Facial Landmark Detection by Prototype Transformer
【速读】:该论文旨在解决现有面部关键点检测方法因训练数据集定义的关键点数量不一致,导致模型泛化能力受限、难以构建统一模型的问题。解决方案的关键在于提出Proto-Former框架,其核心创新是引入可自适应的原型表示机制:通过一个自适应原型感知编码器(Adaptive Prototype-Aware Encoder, APAE)学习不同数据集特有的面部结构原型(prototype),并借助一个渐进式原型感知解码器(Progressive Prototype-Aware Decoder, PPAD)利用这些原型生成引导注意力的关键区域提示;同时设计了一种新型原型感知损失(Prototype-Aware Loss, PA loss),有效稳定多数据集联合训练中专家原型的选择权重,缓解梯度冲突,从而提升面部结构特征提取的准确性与鲁棒性。
链接: https://arxiv.org/abs/2510.15338
作者: Shengkai Hu,Haozhe Qi,Jun Wan,Jiaxing Huang,Lefei Zhang,Hang Sun,Dacheng Tao
机构: Zhongnan University of Economics and Law (中南财经政法大学); Nanyang Technological University (南洋理工大学); Wuhan University (武汉大学); China Three Gorges University (三峡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by TMM October 2025. Project page: this https URL
Abstract:Recent advances in deep learning have significantly improved facial landmark detection. However, existing facial landmark detection datasets often define different numbers of landmarks, and most mainstream methods can only be trained on a single dataset. This limits the model generalization to different datasets and hinders the development of a unified model. To address this issue, we propose Proto-Former, a unified, adaptive, end-to-end facial landmark detection framework that explicitly enhances dataset-specific facial structural representations (i.e., prototype). Proto-Former overcomes the limitations of single-dataset training by enabling joint training across multiple datasets within a unified architecture. Specifically, Proto-Former comprises two key components: an Adaptive Prototype-Aware Encoder (APAE) that performs adaptive feature extraction and learns prototype representations, and a Progressive Prototype-Aware Decoder (PPAD) that refines these prototypes to generate prompts that guide the model’s attention to key facial regions. Furthermore, we introduce a novel Prototype-Aware (PA) loss, which achieves optimal path finding by constraining the selection weights of prototype experts. This loss function effectively resolves the problem of prototype expert addressing instability during multi-dataset training, alleviates gradient conflicts, and enables the extraction of more accurate facial structure features. Extensive experiments on widely used benchmark datasets demonstrate that our Proto-Former achieves superior performance compared to existing state-of-the-art methods. The code is publicly available at: this https URL.
zh
[CV-61] Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)因参数量庞大而导致的高计算与存储开销问题,同时克服现有逐层结构化剪枝方法在剪枝过程中忽略保留关键能力、线性权重层聚合效果差以及缺乏有效后训练恢复机制等局限性。其解决方案的关键在于提出一种名为CoMe的框架,包含三个核心组件:一是基于激活强度和权重范数的通道敏感度度量,用于细粒度选择重要通道;二是基于拼接的层融合技术(Concatenation-based Merging),通过融合相邻层中最关键的通道实现渐进式模型压缩;三是分层蒸馏后训练协议,利用剪枝过程中建立的原始模型与剪枝模型层间对应关系,实现高效知识迁移。实验表明,该方法在多个基准测试中达到最优性能,在剪掉LLaMA-2-7b 30%参数的情况下仍保持83%的原始平均准确率。
链接: https://arxiv.org/abs/2510.15304
作者: Fei Wang,Li Shen,Liang Ding,Chao Xue,Ye Liu,Changxing Ding
机构: South China University of Technology (华南理工大学); JD Explore Academy; Shenzhen Campus of Sun Yat-sen University (中山大学深圳校区); University of Sydney (悉尼大学); Pazhou Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models excel at natural language processing tasks, but their massive size leads to high computational and storage demands. Recent works have sought to reduce their model size through layer-wise structured pruning. However, they tend to ignore retaining the capabilities in the pruned part. In this work, we re-examine structured pruning paradigms and uncover several key limitations: 1) notable performance degradation due to direct layer removal, 2) incompetent linear weight layer aggregation, and 3) the lack of effective post-training recovery mechanisms. To address these limitations, we propose CoMe, including a progressive layer pruning framework with a Concatenation-based Merging technology and a hierarchical distillation post-training process. Specifically, we introduce a channel sensitivity metric that utilizes activation intensity and weight norms for fine-grained channel selection. Subsequently, we employ a concatenation-based layer merging method to fuse the most critical channels across adjacent layers, enabling progressive model size reduction. Finally, we propose a hierarchical distillation protocol that leverages the correspondences between the original and pruned model layers established during pruning, thereby enabling efficient knowledge transfer. Experiments on seven benchmarks show that CoMe achieves state-of-the-art performance; when pruning 30% of LLaMA-2-7b’s parameters, the pruned model retains 83% of its original average accuracy. Our code is available at this https URL.
zh
[CV-62] Latent Diffusion Model without Variational Autoencoder
【速读】:该论文旨在解决基于扩散模型的视觉生成中因依赖变分自编码器(Variational Autoencoder, VAE)所导致的训练效率低、推理速度慢以及跨任务迁移能力差的问题。其核心原因在于VAE隐空间缺乏清晰的语义分离和强判别结构,从而影响了扩散模型的稳定训练与高效学习。解决方案的关键在于提出SVG(Self-supervised Visual Generation)框架,摒弃VAE结构,转而利用冻结的DINO自监督特征构建具有明确语义判别性的潜在空间,并通过轻量级残差分支捕捉细节信息,使扩散模型直接在该语义结构化的潜空间中进行训练,从而实现更高效的训练过程、支持少步采样并提升生成质量,同时保留了底层表示的语义与判别能力。
链接: https://arxiv.org/abs/2510.15301
作者: Minglei Shi,Haolin Wang,Wenzhao Zheng,Ziyang Yuan,Xiaoshi Wu,Xintao Wang,Pengfei Wan,Jie Zhou,Jiwen Lu
机构: Tsinghua University (清华大学); Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent progress in diffusion-based visual generation has largely relied on latent diffusion models with variational autoencoders (VAEs). While effective for high-fidelity synthesis, this VAE+diffusion paradigm suffers from limited training efficiency, slow inference, and poor transferability to broader vision tasks. These issues stem from a key limitation of VAE latent spaces: the lack of clear semantic separation and strong discriminative structure. Our analysis confirms that these properties are crucial not only for perception and understanding tasks, but also for the stable and efficient training of latent diffusion models. Motivated by this insight, we introduce SVG, a novel latent diffusion model without variational autoencoders, which leverages self-supervised representations for visual generation. SVG constructs a feature space with clear semantic discriminability by leveraging frozen DINO features, while a lightweight residual branch captures fine-grained details for high-fidelity reconstruction. Diffusion models are trained directly on this semantically structured latent space to facilitate more efficient learning. As a result, SVG enables accelerated diffusion training, supports few-step sampling, and improves generative quality. Experimental results further show that SVG preserves the semantic and discriminative capabilities of the underlying self-supervised representations, providing a principled pathway toward task-general, high-quality visual representations.
zh
[CV-63] Hyperbolic Structured Classification for Robust Single Positive Multi-label Learning ICDM
【速读】:该论文旨在解决单正例多标签学习(Single Positive Multi-Label Learning, SPMLL)中因每样本仅标注一个正标签而导致的标签关系建模困难问题,尤其在缺乏显式几何定义的情况下难以刻画复杂的标签间关系(如层次结构、共现模式和语义独立性)。其解决方案的关键在于提出首个基于双曲空间(hyperbolic space)的分类框架,将每个标签表示为一个双曲球体(hyperbolic ball),通过球体间的几何交互自然地同时建模多种标签关系类型:包含关系用于表达层次结构、重叠关系用于捕捉共现模式、分离关系用于体现语义独立性。此外,创新性引入温度自适应双曲球分类器与受物理启发的双井正则化机制,引导标签球体向具有实际意义的空间配置演化,从而在四个基准数据集上实现优于现有方法的性能,并展现出更强的可解释性。
链接: https://arxiv.org/abs/2510.15296
作者: Yiming Lin,Shang Wang,Junkai Zhou,Qiufeng Wang,Xiao-Bo Jin,Kaizhu Huang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages, ICDM Workshop
Abstract:Single Positive Multi-Label Learning (SPMLL) addresses the challenging scenario where each training sample is annotated with only one positive label despite potentially belonging to multiple categories, making it difficult to capture complex label relationships and hierarchical structures. While existing methods implicitly model label relationships through distance-based similarity, lacking explicit geometric definitions for different relationship types. To address these limitations, we propose the first hyperbolic classification framework for SPMLL that represents each label as a hyperbolic ball rather than a point or vector, enabling rich inter-label relationship modeling through geometric ball interactions. Our ball-based approach naturally captures multiple relationship types simultaneously: inclusion for hierarchical structures, overlap for co-occurrence patterns, and separation for semantic independence. Further, we introduce two key component innovations: a temperature-adaptive hyperbolic ball classifier and a physics-inspired double-well regularization that guides balls toward meaningful configurations. To validate our approach, extensive experiments on four benchmark datasets (MS-COCO, PASCAL VOC, NUS-WIDE, CUB-200-2011) demonstrate competitive performance with superior interpretability compared to existing methods. Furthermore, statistical analysis reveals strong correlation between learned embeddings and real-world co-occurrence patterns, establishing hyperbolic geometry as a more robust paradigm for structured classification under incomplete supervision.
zh
[CV-64] QCFace: Image Quality Control for boosting Face Representation Recognition WACV2026
【速读】:该论文旨在解决深度人脸识别(Face Recognition, FR)中因特征表示质量不足而导致的性能瓶颈问题,特别是现有方法在处理低质量或模糊人脸时,难以有效捕捉可识别性(recognizability),且特征方向与模长之间的梯度相互重叠,导致优化不稳定、超球面规划混乱以及可识别性与身份信息耦合等问题。解决方案的关键在于提出一种基于硬边界策略的质量控制人脸(Quality Control Face, QCFace)方法,通过引入硬边界约束克服梯度重叠问题,并实现可识别性与身份表征的清晰解耦;在此基础上设计了一种新的硬边界损失函数,结合引导因子进行超球面规划,同步优化识别能力与显式可识别性建模,从而显著提升模型在验证和识别任务中的鲁棒性和准确性。
链接: https://arxiv.org/abs/2510.15289
作者: Duc-Phuong Doan-Ngo,Thanh-Dang Diep,Thanh Nguyen-Duc,Thanh-Sach LE,Nam Thoai
机构: Ho Chi Minh City University of Technology (HCMUT), VNU-HCM; Monash University; Monash Health
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages with 11 figures, 14 tables and 71 references. Accepted in Round 1 at WACV 2026, Oral
Abstract:Recognizability, a key perceptual factor in human face processing, strongly affects the performance of face recognition (FR) systems in both verification and identification tasks. Effectively using recognizability to enhance feature representation remains challenging. In deep FR, the loss function plays a crucial role in shaping how features are embedded. However, current methods have two main drawbacks: (i) recognizability is only partially captured through soft margin constraints, resulting in weaker quality representation and lower discrimination, especially for low-quality or ambiguous faces; (ii) mutual overlapping gradients between feature direction and magnitude introduce undesirable interactions during optimization, causing instability and confusion in hypersphere planning, which may result in poor generalization, and entangled representations where recognizability and identity are not cleanly separated. To address these issues, we introduce a hard margin strategy - Quality Control Face (QCFace), which overcomes the mutual overlapping gradient problem and enables the clear decoupling of recognizability from identity representation. Based on this strategy, a novel hard-margin-based loss function employs a guidance factor for hypersphere planning, simultaneously optimizing for recognition ability and explicit recognizability representation. Extensive experiments confirm that QCFace not only provides robust and quantifiable recognizability encoding but also achieves state-of-the-art performance in both verification and identification benchmarks compared to existing recognizability-based losses.
zh
[CV-65] Post-Processing Methods for Improving Accuracy in MRI Inpainting
【速读】:该论文旨在解决自动化磁共振成像(MRI)分析工具在处理大脑病变区域(如肿瘤)时性能下降的问题,因为这些工具通常针对健康脑组织优化,在面对大范围病灶时容易失效。其关键解决方案是提出一种结合模型集成(model ensembling)与高效后处理策略(如中值滤波、直方图匹配和像素平均)的综合方法,并引入一个轻量级U-Net增强模块以进一步提升解剖结构的合理性。该方案显著改善了修复区域的解剖合理性和视觉保真度,相较于单一基线模型实现了更高的准确性和鲁棒性,从而支持更广泛临床部署和可持续研究。
链接: https://arxiv.org/abs/2510.15282
作者: Nishad Kulkarni,Krithika Iyer,Austin Tapp,Abhijeet Parida,Daniel Capellán-Martín,Zhifan Jiang,María J. Ledesma-Carbayo,Syed Muhammad Anwar,Marius George Linguraru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Magnetic Resonance Imaging (MRI) is the primary imaging modality used in the diagnosis, assessment, and treatment planning for brain pathologies. However, most automated MRI analysis tools, such as segmentation and registration pipelines, are optimized for healthy anatomies and often fail when confronted with large lesions such as tumors. To overcome this, image inpainting techniques aim to locally synthesize healthy brain tissues in tumor regions, enabling the reliable application of general-purpose tools. In this work, we systematically evaluate state-of-the-art inpainting models and observe a saturation in their standalone performance. In response, we introduce a methodology combining model ensembling with efficient post-processing strategies such as median filtering, histogram matching, and pixel averaging. Further anatomical refinement is achieved via a lightweight U-Net enhancement stage. Comprehensive evaluation demonstrates that our proposed pipeline improves the anatomical plausibility and visual fidelity of inpainted regions, yielding higher accuracy and more robust outcomes than individual baseline models. By combining established models with targeted post-processing, we achieve improved and more accessible inpainting outcomes, supporting broader clinical deployment and sustainable, resource-conscious research. Our 2025 BraTS inpainting docker is available at this https URL.
zh
[CV-66] CuSfM: CUDA-Accelerated Structure-from-Motion
【速读】:该论文旨在解决**相机位姿估计(camera pose estimation)**在自主导航、机器人感知和虚拟仿真系统中对高精度与高效性并重的需求问题。其核心挑战在于如何在保证重建精度的同时提升计算效率,尤其是在特征提取等计算密集型步骤上。解决方案的关键在于提出了一种基于CUDA加速的离线SfM(Structure-from-Motion)系统cuSfM,利用GPU并行化技术充分发挥高性能特征提取器(如SIFT或SuperPoint)的潜力,从而生成全面且非冗余的数据关联,实现精确的相机位姿估计与全局一致的地图构建。该方法通过充分利用离线处理中的计算资源,在准确性和速度上均显著优于广泛使用的COLMAP系统。
链接: https://arxiv.org/abs/2510.15271
作者: Jingrui Yu,Jun Liu,Kefei Ren,Joydeep Biswas,Rurui Ye,Keqiang Wu,Chirag Majithia,Di Zeng
机构: NVIDIA(英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Efficient and accurate camera pose estimation forms the foundational requirement for dense reconstruction in autonomous navigation, robotic perception, and virtual simulation systems. This paper addresses the challenge via cuSfM, a CUDA-accelerated offline Structure-from-Motion system that leverages GPU parallelization to efficiently employ computationally intensive yet highly accurate feature extractors, generating comprehensive and non-redundant data associations for precise camera pose estimation and globally consistent mapping. The system supports pose optimization, mapping, prior-map localization, and extrinsic refinement. It is designed for offline processing, where computational resources can be fully utilized to maximize accuracy. Experimental results demonstrate that cuSfM achieves significantly improved accuracy and processing speed compared to the widely used COLMAP method across various testing scenarios, while maintaining the high precision and global consistency essential for offline SfM applications. The system is released as an open-source Python wrapper implementation, PyCuSfM, available at this https URL, to facilitate research and applications in computer vision and robotics.
zh
[CV-67] DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion NEURIPS
【速读】:该论文旨在解决现有驾驶场景生成方法中存在的关键局限性,即在长时间动态3D场景合成中面临计算成本过高、缺乏三维表示或仅限于静态单场景重建的问题。解决方案的核心在于提出一个统一的框架DriveGen3D,其关键创新是集成高效的长时视频生成与大规模动态场景重建能力,并通过多模态条件控制实现高可控性:具体包括两个专用模块——FastDrive-DiT(一种用于高分辨率、时序一致视频合成的高效视频扩散Transformer,支持文本和鸟瞰图(Bird’s-Eye-View, BEV)布局引导)以及FastRecon3D(一种前馈式重建模块,可快速构建时空一致的3D Gaussian表示)。该方案实现了高达424×800分辨率、12 FPS的实时驾驶视频及对应动态3D场景生成,在新视角合成上达到SSIM 0.811和PSNR 22.84,同时保持参数效率。
链接: https://arxiv.org/abs/2510.15264
作者: Weijie Wang,Jiagang Zhu,Zeyu Zhang,Xiaofeng Wang,Zheng Zhu,Guosheng Zhao,Chaojun Ni,Haoxiao Wang,Guan Huang,Xinze Chen,Yukun Zhou,Wenkang Qin,Duochao Shi,Haoyun Li,Guanghong Jia,Jiwen Lu
机构: GigaAI; Zhejiang University (浙江大学); Tsinghua University (清华大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS Workshop on Next Practices in Video Generation and Evaluation (Short Paper Track)
Abstract:We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird’s-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to 424\times800 at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.
zh
[CV-68] he Face of Persuasion: Analyzing Bias and Generating Culture-Aware Ads
【速读】:该论文旨在解决生成式 AI 在视觉广告定制中可能存在的种族与性别偏见问题,以及如何通过特定策略提升广告对不同人群的说服力。其核心问题是:在广告内容相同仅人物性别或种族不同的情况下,模型评估的说服力是否存在显著差异;同时探索基于地理位置(如国家)定向投放广告的技术可行性。解决方案的关键在于利用文本到图像生成模型构建具有可控属性的广告样本,并通过对比分析不同群体在广告中的表现效果,量化潜在的偏见水平,进而提出一种基于地理标签的广告目标优化方法,以实现更公平且高效的个性化广告投放。
链接: https://arxiv.org/abs/2510.15240
作者: Aysan Aghazadeh,Adriana Kovashka
机构: University of Pittsburgh (匹兹堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image models are appealing for customizing visual advertisements and targeting specific populations. We investigate this potential by examining the demographic bias within ads for different ad topics, and the disparate level of persuasiveness (judged by models) of ads that are identical except for gender/race of the people portrayed. We also experiment with a technique to target ads for specific countries. The code is available at this https URL
zh
[CV-69] CARDIUM: Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records ICCV2025
【速读】:该论文旨在解决先天性心脏病(Congenital Heart Diseases, CHDs)产前诊断中因数据稀少导致的高质量标注数据不足问题,以及多源信息(如影像与临床记录)未被有效整合所限制AI模型性能提升的挑战。其关键解决方案是构建了首个公开的多模态数据集CARDIUM,该数据集融合胎儿超声和胎儿超声心动图图像与母体临床记录,并提出一种基于交叉注意力机制(cross-attention mechanism)的鲁棒多模态Transformer架构,实现图像与表格数据特征的有效融合,在CHD检测任务上分别较单一模态方法提升11%和50%的性能,F1分数达到79.8 ± 4.8%。
链接: https://arxiv.org/abs/2510.15208
作者: Daniela Vega,Hannah V. Ceballos,Javier S. Vera,Santiago Rodriguez,Alejandra Perez,Angela Castillo,Maria Escobar,Dario Londoño,Luis A. Sarmiento,Camila I. Castro,Nadiezhda Rodriguez,Juan C. Briceño,Pablo Arbeláez
机构: Universidad de los Andes, Colombia (安第斯大学, 哥伦比亚); Fundación Santa Fe de Bogotá, Colombia (圣菲波哥大基金会, 哥伦比亚)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVAMD Workshop, ICCV 2025
Abstract:Prenatal diagnosis of Congenital Heart Diseases (CHDs) holds great potential for Artificial Intelligence (AI)-driven solutions. However, collecting high-quality diagnostic data remains difficult due to the rarity of these conditions, resulting in imbalanced and low-quality datasets that hinder model performance. Moreover, no public efforts have been made to integrate multiple sources of information, such as imaging and clinical data, further limiting the ability of AI models to support and enhance clinical decision-making. To overcome these challenges, we introduce the Congenital Anomaly Recognition with Diagnostic Images and Unified Medical records (CARDIUM) dataset, the first publicly available multimodal dataset consolidating fetal ultrasound and echocardiographic images along with maternal clinical records for prenatal CHD detection. Furthermore, we propose a robust multimodal transformer architecture that incorporates a cross-attention mechanism to fuse feature representations from image and tabular data, improving CHD detection by 11% and 50% over image and tabular single-modality approaches, respectively, and achieving an F1 score of 79.8 \pm 4.8% in the CARDIUM dataset. We will publicly release our dataset and code to encourage further research on this unexplored field. Our dataset and code are available at this https URL, and at the project website this https URL
zh
[CV-70] Dissecting Mahalanobis: How Feature Geometry and Normalization Shape OOD Detection
【速读】:该论文旨在解决深度学习模型在分布外(Out-of-distribution, OOD)检测中的可靠性问题,尤其关注基于马氏距离(Mahalanobis distance)的方法在不同数据表示几何结构和归一化策略下的性能差异。现有方法虽广泛应用,但其性能受表示空间几何特性及归一化方式的影响尚未被充分理解,限制了其下游应用效果。解决方案的关键在于:首先,定义理想的数据表示几何结构,并证明谱特征和内在维度指标可准确预测模型的OOD检测性能;其次,提出径向缩放的ℓ₂归一化(radially scaled ℓ₂ normalization),通过引入一个可调参数直接控制特征空间的径向几何结构,从而系统性地收缩或扩展表示,显著提升OOD检测性能。这一方法有效连接了表示几何、归一化与OOD性能之间的关系,为设计更可靠和高效的深度学习模型提供了新思路。
链接: https://arxiv.org/abs/2510.15202
作者: Denis Janiak,Jakub Binkowski,Tomasz Kajdanowicz
机构: Wroclaw University of Science and Technology (弗罗茨瓦夫理工大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is critical for the reliable deployment of deep learning models. hile Mahalanobis distance methods are widely used, the impact of representation geometry and normalization on their performance is not fully understood, which may limit their downstream application. To address this gap, we conducted a comprehensive empirical study across diverse image foundation models, datasets, and distance normalization schemes. First, our analysis shows that Mahalanobis-based methods aren’t universally reliable. Second, we define the ideal geometry for data representations and demonstrate that spectral and intrinsic-dimensionality metrics can accurately predict a model’s OOD performance. Finally, we analyze how normalization impacts OOD performance. Building upon these studies, we propose radially scaled \ell_2 normalization, a method that generalizes the standard \ell_2 normalization recently applied to Mahalanobis-based OOD detection. Our approach introduces a tunable parameter to directly control the radial geometry of the feature space, systematically contracting or expanding representations to significantly improve OOD detection performance. By bridging the gap between representation geometry, normalization, and OOD performance, our findings offer new insights into the design of more effective and reliable deep learning models.
zh
[CV-71] Salient Concept-Aware Generative Data Augmentation NEURIPS2025
【速读】:该论文旨在解决生成式数据增强方法在同时依赖图像和文本提示时难以平衡保真度(fidelity)与多样性(diversity)的问题,其核心挑战在于合成过程中图像表示容易与非关键视觉属性(如环境背景)纠缠,从而干扰文本提示对目标特征的修改意图。解决方案的关键在于提出一种个性化图像生成框架,利用显著概念感知的图像嵌入模型(salient concept-aware image embedding model),在合成阶段削弱无关视觉细节的影响,从而实现图像与文本输入之间的直观对齐;该方法通过保留类别判别性特征并引入可控变化,有效提升训练数据集的多样性,进而增强下游模型的鲁棒性。
链接: https://arxiv.org/abs/2510.15194
作者: Tianchen Zhao,Xuanbai Chen,Zhihua Li,Jun Fang,Dongsheng An,Xiang Xu,Zhuowen Tu,Yifan Xing
机构: AWS DS3
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, NeurIPS2025
Abstract:Recent generative data augmentation methods conditioned on both image and text prompts struggle to balance between fidelity and diversity, as it is challenging to preserve essential image details while aligning with varied text prompts. This challenge arises because representations in the synthesis process often become entangled with non-essential input image attributes such as environmental contexts, creating conflicts with text prompts intended to modify these elements. To address this, we propose a personalized image generation framework that uses a salient concept-aware image embedding model to reduce the influence of irrelevant visual details during the synthesis process, thereby maintaining intuitive alignment between image and text inputs. By generating images that better preserve class-discriminative features with additional controlled variations, our framework effectively enhances the diversity of training datasets and thereby improves the robustness of downstream models. Our approach demonstrates superior performance across eight fine-grained vision datasets, outperforming state-of-the-art augmentation methods with averaged classification accuracy improvements by 0.73% and 6.5% under conventional and long-tail settings, respectively.
zh
[CV-72] Hyperparameter Optimization and Reproducibility in Deep Learning Model Training
【速读】:该论文旨在解决基础模型(foundation model)在组织病理学(histopathology)训练中的可复现性(reproducibility)问题,其核心挑战源于软件随机性、硬件非确定性以及超参数报告不一致等因素。解决方案的关键在于系统性地评估不同超参数设置和数据增强策略对多个下游病理数据集(PatchCamelyon、LC25000-Lung 和 LC25000-Colon)性能的影响,从而识别出具有稳定性和一致性的实验配置:如 RandomResizedCrop 的值设定在 0.7–0.8 范围内表现最优,分布式训练配合无局部损失(local loss)机制可提升稳定性,且学习率低于 5.0e-5 会显著降低性能;此外,LC25000(结肠)数据集展现出最高的可复现性基准价值。研究强调,可复现性不仅依赖透明的文档记录,更取决于精心设计的实验参数组合,并提出了实用规则以指导未来数字病理学中可复现的基础模型开发。
链接: https://arxiv.org/abs/2510.15164
作者: Usman Afzaal,Ziyu Su,Usama Sajjad,Hao Lu,Mostafa Rezapour,Metin Nafi Gurcan,Muhammad Khalid Khan Niazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reproducibility remains a critical challenge in foundation model training for histopathology, often hindered by software randomness, hardware non-determinism, and inconsistent hyperparameter reporting. To investigate these issues, we trained a CLIP model on the QUILT-1M dataset and systematically evaluated the impact of different hyperparameter settings and augmentation strategies across three downstream histopathology datasets (PatchCamelyon, LC25000-Lung, and LC25000-Colon). Despite variability across runs, we identified clear trends: RandomResizedCrop values of 0.7-0.8 outperformed more aggressive (0.6) or conservative (0.9) settings, distributed training without local loss improved stability, and learning rates below 5.0e-5 consistently degraded performance across all datasets. The LC25000 (Colon) dataset consistently provided the most reproducible benchmark. These findings highlight that reproducibility in computational pathology depends not only on transparent documentation but also on carefully chosen experimental configurations, and we provide practical rules to guide future efforts in developing reproducible foundation models for digital pathology.
zh
[CV-73] XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
【速读】:该论文旨在解决当前多模态大语言模型(Omni-modal Large Language Models, OLLMs)在跨模态推理中是否存在模态不变性(modality-invariant reasoning)的问题,即模型是否能够对不同模态输入(如文本、视觉、音频)进行一致性的语义理解和推理,而非表现出模态特异性偏差。现有评估基准主要关注通用跨模态问答能力,但缺乏对模态一致性、模态差异性和方向不平衡性的细粒度诊断能力。为此,作者提出XModBench——一个大规模三模态基准,系统覆盖所有六种模态组合的问答对,涵盖五类任务,可精准识别模型在跨模态推理中的不一致性表现。其关键创新在于设计了结构化且全面的评估框架,首次实现对模态不变性、模态差异和方向不平衡的量化分析,从而为诊断与改进OLLMs的跨模态能力提供基础工具。
链接: https://arxiv.org/abs/2510.15148
作者: Xingrui Wang,Jiang Liu,Chao Huang,Xiaodong Yu,Ze Wang,Ximeng Sun,Jialian Wu,Alan Yuille,Emad Barsoum,Zicheng Liu
机构: Advanced Micro Devices (高级微设备公司); Johns Hopkins University (约翰霍普金斯大学); University of Rochester (罗切斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM’s modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at this https URL.
zh
[CV-74] Fourier Transform Multiple Instance Learning for Whole Slide Image Classification
【速读】:该论文旨在解决全切片图像(Whole Slide Image, WSI)分类中基于多实例学习(Multiple Instance Learning, MIL)方法因局部patch特征难以捕捉全局依赖关系而导致诊断预测鲁棒性不足的问题。其核心解决方案是提出傅里叶变换多实例学习框架(Fourier Transform Multiple Instance Learning, FFT-MIL),通过引入频域分支来提供紧凑的全局上下文信息:利用快速傅里叶变换(Fast Fourier Transform)提取低频区域作物,并设计模块化的FFT-Block(含卷积层与最小-最大归一化)以缓解频域数据高方差问题;随后将学习到的全局频率特征与空间patch特征通过轻量级融合策略结合,实现对多种MIL架构的兼容性增强。实验表明,该方法在多个公开数据集上显著提升宏观F1分数和AUC值,验证了频域建模作为捕获WSI全局结构的有效机制。
链接: https://arxiv.org/abs/2510.15138
作者: Anthony Bilic,Guangyu Sun,Ming Li,Md Sanzid Bin Hossain,Yu Tian,Wei Zhang,Laura Brattain,Dexter Hadley,Chen Chen
机构: Institute of Artificial Intelligence (IAI); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Whole Slide Image (WSI) classification relies on Multiple Instance Learning (MIL) with spatial patch features, yet existing methods struggle to capture global dependencies due to the immense size of WSIs and the local nature of patch embeddings. This limitation hinders the modeling of coarse structures essential for robust diagnostic prediction. We propose Fourier Transform Multiple Instance Learning (FFT-MIL), a framework that augments MIL with a frequency-domain branch to provide compact global context. Low-frequency crops are extracted from WSIs via the Fast Fourier Transform and processed through a modular FFT-Block composed of convolutional layers and Min-Max normalization to mitigate the high variance of frequency data. The learned global frequency feature is fused with spatial patch features through lightweight integration strategies, enabling compatibility with diverse MIL architectures. FFT-MIL was evaluated across six state-of-the-art MIL methods on three public datasets (BRACS, LUAD, and IMP). Integration of the FFT-Block improved macro F1 scores by an average of 3.51% and AUC by 1.51%, demonstrating consistent gains across architectures and datasets. These results establish frequency-domain learning as an effective and efficient mechanism for capturing global dependencies in WSI classification, complementing spatial features and advancing the scalability and accuracy of MIL-based computational pathology. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.15138 [cs.CV] (or arXiv:2510.15138v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.15138 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-75] Deep generative priors for 3D brain analysis
【速读】:该论文旨在解决如何将生成式 AI(Generative AI)模型与医学影像领域的先验知识相结合,以提升脑部磁共振成像(MRI)中逆问题求解的性能。传统贝叶斯逆问题方法依赖于经典数学先验,难以刻画脑部解剖结构的复杂性;而现有数据驱动模型又常需大量配对训练数据,限制了其在临床场景中的应用。解决方案的关键在于首次将基于分数的扩散模型(score-based diffusion model)作为通用先验引入医学影像逆问题框架,利用在多样化脑部 MRI 数据上充分训练的扩散先验,结合灵活的前向模型(如超分辨率、偏置场校正、图像修复等任务),实现无需配对训练数据即可生成高质量、高解剖保真度的结果。此方法显著提升了脑 MRI 分析的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2510.15119
作者: Ana Lawry Aguila,Dina Zemlyanker,You Cheng,Sudeshna Das,Daniel C. Alexander,Oula Puonti,Annabel Sorby-Adams,W. Taylor Kimberly,Juan Eugenio Iglesias
机构: Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School (麻省总医院和哈佛医学院), Boston, USA; Department of Neurology, Massachusetts General Hospital (麻省总医院), Boston, USA; Hawkes Institute, University College London (伦敦大学学院), London, UK; Danish Research Centre for Magnetic Resonance, Department of Radiology and Nuclear Medicine, Copenhagen University Hospital – Amager and Hvidovre (哥本哈根大学医院-阿玛格和赫维德医院), Copenhagen, Denmark; Computer Science & Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (麻省理工学院), Cambridge, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.
zh
[CV-76] GT: Text-Grounded Trajectories for Locally Controlled Video Generation
【速读】:该论文旨在解决当前文本到视频生成(text-to-video generation)方法在控制场景中主体组成方面的局限性,尤其是在多物体复杂场景下难以实现精确的运动与外观控制的问题。现有方法虽引入边界框或分割掩码等局部文本控制信号,但在对象数量增加时精度下降,且无法明确建立轨迹与视觉实体之间的对应关系。解决方案的关键在于提出Text-Grounded Trajectories (TGT) 框架,其核心创新包括:1)设计Location-Aware Cross-Attention (LACA) 机制以融合轨迹与局部文本描述;2)采用双条件控制生成(dual-CFG)策略分别调节局部和全局文本引导;3)构建包含两百万高质量视频片段的数据处理流水线,用于训练模型学习轨迹与文本的对齐关系。这一方案使用户能够通过点轨迹作为直观运动操控手柄,结合文本描述精准控制视频中每个对象的运动和外观。
链接: https://arxiv.org/abs/2510.15104
作者: Guofeng Zhang,Angtian Wang,Jacob Zhiyuan Fang,Liming Jiang,Haotian Yang,Bo Liu,Yiding Yang,Guang Chen,Longyin Wen,Alan Yuille,Chongyang Ma
机构: Johns Hopkins University (约翰霍普金斯大学); Bytedance, Intelligent Creation (字节跳动,智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-video generation has advanced rapidly in visual fidelity, whereas standard methods still have limited ability to control the subject composition of generated scenes. Prior work shows that adding localized text control signals, such as bounding boxes or segmentation masks, can help. However, these methods struggle in complex scenarios and degrade in multi-object settings, offering limited precision and lacking a clear correspondence between individual trajectories and visual entities as the number of controllable objects increases. We introduce Text-Grounded Trajectories (TGT), a framework that conditions video generation on trajectories paired with localized text descriptions. We propose Location-Aware Cross-Attention (LACA) to integrate these signals and adopt a dual-CFG scheme to separately modulate local and global text guidance. In addition, we develop a data processing pipeline that produces trajectories with localized descriptions of tracked entities, and we annotate two million high quality video clips to train TGT. Together, these components enable TGT to use point trajectories as intuitive motion handles, pairing each trajectory with text to control both appearance and motion. Extensive experiments show that TGT achieves higher visual quality, more accurate text alignment, and improved motion controllability compared with prior approaches. Website: this https URL.
zh
[CV-77] SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images
【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在长时间视频序列中因逐像素预测和全局融合导致的冗余性与几何不一致性问题。其核心解决方案是提出SaLon3R框架,关键在于引入紧凑的锚点原语(anchor primitives),通过可微的显著性感知高斯量化机制压缩冗余高斯分布,并结合3D点变换器(3D Point Transformer)学习空间结构先验以优化锚点属性与显著性,从而实现区域自适应的高斯解码,提升重建几何保真度与长期一致性。该方法无需已知相机参数或测试时优化,在单次前向传播中即可有效消除伪影并显著减少冗余(50%–90%),实现了高效、鲁棒且泛化的在线长时3DGS重建。
链接: https://arxiv.org/abs/2510.15072
作者: Jiaxin Guo,Tongfan Guan,Wenzhen Dong,Wenzhao Zheng,Wenting Wang,Yue Wang,Yeung Yam,Yun-Hui Liu
机构: The Chinese University of Hong Kong (香港中文大学); Hong Kong Center for Logistics Robotics (香港物流机器人中心); University of California, Berkeley (加州大学伯克利分校); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Project Page: this https URL.
zh
[CV-78] A solution to generalized learning from small training sets found in everyday infant experiences
【速读】:该论文试图解决的问题是:婴儿如何在有限的视觉经验下实现对常见名词所指物体类别(basic level object categories)的有效识别与泛化。传统观点认为这些类别可能是先天给定的,但其形成机制尚不明确。论文提出,解决方案的关键在于婴儿日常生活中视觉输入的“块状相似性结构”(lumpy similarity structure)——即重复接触单个物体实例时,其视觉输入呈现出高相似性图像簇与稀少且多变图像交替分布的统计特性。通过分析14名7至11个月婴儿的视角图像数据,研究发现这种结构存在于早期习得的八类物体中;进一步计算实验表明,模拟该结构可显著提升小样本场景下机器学习模型的泛化能力。因此,婴儿经验中的自然“块状性”不仅解释了早期类别学习的高效性,也为跨任务、跨学习者的高效学习提供了普适性原则。
链接: https://arxiv.org/abs/2510.15060
作者: Frangil Ramirez,Elizabeth Clerkin,David J. Crandall,Linda B. Smith
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures, 1 table
Abstract:Young children readily recognize and generalize visual objects labeled by common nouns, suggesting that these basic level object categories may be given. Yet if they are, how they arise remains unclear. We propose that the answer lies in the statistics of infant daily life visual experiences. Whereas large and diverse datasets typically support robust learning and generalization in human and machine learning, infants achieve this generalization from limited experiences. We suggest that the resolution of this apparent contradiction lies in the visual diversity of daily life, repeated experiences with single object instances. Analyzing egocentric images from 14 infants (aged 7 to 11 months) we show that their everyday visual input exhibits a lumpy similarity structure, with clusters of highly similar images interspersed with rarer, more variable ones, across eight early-learned categories. Computational experiments show that mimicking this structure in machines improves generalization from small datasets in machine learning. The natural lumpiness of infant experience may thus support early category learning and generalization and, more broadly, offer principles for efficient learning across a variety of problems and kinds of learners.
zh
[CV-79] Directional Reasoning Injection for Fine-Tuning MLLM s
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理能力上显著落后于纯文本大语言模型(Text-only Large Language Models, LLMs)的问题。现有方法如监督微调(Supervised Fine-Tuning, SFT)或强化学习(Reinforcement Learning, RL)虽有效但资源消耗巨大,而简单的模型合并(Model Merging)策略效果不稳定,部分模型家族甚至出现性能下降。为此,作者提出方向性推理注入微调(Directional Reasoning Injection for Fine-Tuning, DRIFT),其核心在于:预先计算推理增强模型与多模态模型之间的参数空间差异作为推理先验(reasoning prior),并在后续多模态微调过程中利用该先验偏置梯度更新方向,从而在不破坏多模态对齐的前提下高效迁移推理知识。该方法保持标准SFT流程的简洁性,同时显著提升推理性能,且训练成本远低于传统方法。
链接: https://arxiv.org/abs/2510.15050
作者: Chao Huang,Zeliang Zhang,Jiang Liu,Ximeng Sun,Jialian Wu,Xiaodong Yu,Ze Wang,Chenliang Xu,Emad Barsoum,Zicheng Liu
机构: University of Rochester (罗切斯特大学); Advanced Micro Devices, Inc. (超威半导体公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a “free lunch”: its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.
zh
[CV-80] Comprehensive language-image pre-training for 3D medical image understanding
【速读】:该论文旨在解决3D医学图像领域中视觉-语言预训练(Vision-Language Pre-training, VLP)模型因数据稀缺而导致性能受限的问题。其核心挑战在于高质量的配对图像-文本数据在3D医学影像中难以获取,限制了视觉语言编码器(Vision-Language Encoders, VLEs)的泛化能力和下游任务表现。解决方案的关键在于引入额外的归纳偏置(inductive biases):一方面增加报告生成目标(report generation objective),另一方面将视觉-语言预训练与纯视觉预训练(vision-only pre-training)相结合,从而有效利用图像-only 和图像-文本配对的3D医学数据集,显著扩大模型训练数据规模。通过这一策略,作者构建了综合语言-图像预训练(Comprehensive Language-image Pre-training, COLIPRI)编码器家族,在报告生成、分类探测(classification probing)和零样本分类(zero-shot classification)任务上达到当前最优性能,并在语义分割任务中保持竞争力。
链接: https://arxiv.org/abs/2510.15042
作者: Tassilo Wald,Ibrahim Ethem Hamamci,Yuan Gao,Sam Bond-Taylor,Harshita Sharma,Maximilian Ilse,Cynthia Lo,Olesya Melnichenko,Noel C. F. Codella,Maria Teodora Wetscherek,Klaus H. Maier-Hein,Panagiotis Korfiatis,Valentina Salvatelli,Javier Alvarez-Valle,Fernando Pérez-García
机构: Microsoft(微软); German Cancer Research Center (德国癌症研究中心); Department of Radiology, University of Cambridge and Cambridge University Hospitals NHS Foundation Trust(剑桥大学放射科及剑桥大学医院国家医疗服务体系基金会信托); Pattern Analysis and Learning Group, Heidelberg University Hospital(海德堡大学医院模式分析与学习组); Department of Radiology, Mayo Clinic(梅奥诊所放射科)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification and retrieval, and for downstream tasks such as segmentation and report generation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities or predicting likelihoods of abnormality. While the methodology holds promise, data availability limits the capabilities of current 3D VLEs. In this paper, we alleviate the lack of data by injecting additional inductive biases: introducing a report generation objective and pairing vision-language pre-training with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional inductive biases, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, classification probing, and zero-shot classification, and remain competitive for semantic segmentation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2510.15042 [cs.CV] (or arXiv:2510.15042v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.15042 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-81] Generalized Dynamics Generation towards Scannable Physical World Model
【速读】:该论文旨在解决如何在可扫描环境中统一建模和生成多样化物理行为(包括刚体、关节体和软体)的问题,从而为通用具身智能体(generalist embodied agents)提供一个统一的数字孪生世界框架。其解决方案的关键在于提出GDGen(Generalized Representation for Generalized Dynamics Generation),从势能最小化原理出发,将不同类型的物理系统纳入一个几何无关的统一框架中;通过引入方向性刚度(directional stiffness)扩展经典弹性动力学,并设计专用神经网络建模材料属性、利用神经场(neural field)实现几何无关的形变表示,从而从简单运动观测中推断出底层物理特性,实现对复杂动态场景的鲁棒建模与生成。
链接: https://arxiv.org/abs/2510.15041
作者: Yichen Li,Zhiyi Li,Brandon Feng,Dinghuai Zhang,Antonio Torralba
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Digital twin worlds with realistic interactive dynamics presents a new opportunity to develop generalist embodied agents in scannable environments with complex physical behaviors. To this end, we present GDGen (Generalized Representation for Generalized Dynamics Generation), a framework that takes a potential energy perspective to seamlessly integrate rigid body, articulated body, and soft body dynamics into a unified, geometry-agnostic system. GDGen operates from the governing principle that the potential energy for any stable physical system should be low. This fresh perspective allows us to treat the world as one holistic entity and infer underlying physical properties from simple motion observations. We extend classic elastodynamics by introducing directional stiffness to capture a broad spectrum of physical behaviors, covering soft elastic, articulated, and rigid body systems. We propose a specialized network to model the extended material property and employ a neural field to represent deformation in a geometry-agnostic manner. Extensive experiments demonstrate that GDGen robustly unifies diverse simulation paradigms, offering a versatile foundation for creating interactive virtual environments and training robotic agents in complex, dynamically rich scenarios.
zh
[CV-82] MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning ICCV2025
【速读】:该论文旨在解决大规模基础模型在资源受限边缘设备上部署时面临的高计算成本问题,同时保持优异的实例分割性能。现有架构难以在不牺牲性能的前提下实现高效边缘部署,为此,作者提出MOBIUS系列基础模型,其核心解决方案包括:(i)一种瓶颈像素解码器(bottleneck pixel decoder),用于高效实现多尺度与多模态融合;(ii)一种语言引导的不确定性校准损失(language-guided uncertainty calibration loss),支持自适应解码器剪枝以降低计算需求;(iii)一种简化的统一训练策略。相比高效基线方法以精度换复杂度的范式,MOBIUS在仅需三分之一训练迭代次数的情况下,将像素解码器和Transformer解码器的浮点运算量(FLOPs)分别减少高达55%和75%,同时保持最先进性能,实现了帕累托最优的模型缩放能力。
链接: https://arxiv.org/abs/2510.15026
作者: Mattia Segu,Marta Tintore Gazulla,Yongqin Xian,Luc Van Gool,Federico Tombari
机构: Google(谷歌); ETH Zurich; INSAIT, Sofia University, St. Kliment Ohridski
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
zh
[CV-83] LoRAverse: A Submodular Framework to Retrieve Diverse Adapters for Diffusion Models
【速读】:该论文旨在解决大规模LoRA(Low-rank Adaptation)模型库中用户难以高效筛选和选择最相关且多样化的适配器的问题,尤其是在超过10万件LoRA模型存在、缺乏结构化组织的情况下。其解决方案的关键在于将适配器选择任务建模为组合优化问题,并提出一种新颖的子模(submodular)框架,以在保证代表性的同时最大化输出多样性,从而提升用户在不同领域生成定制化内容的效率与效果。
链接: https://arxiv.org/abs/2510.15022
作者: Mert Sonmezer,Matthew Zheng,Pinar Yanardag
机构: Middle East Technical University (中东北部技术大学); Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Low-rank Adaptation (LoRA) models have revolutionized the personalization of pre-trained diffusion models by enabling fine-tuning through low-rank, factorized weight matrices specifically optimized for attention layers. These models facilitate the generation of highly customized content across a variety of objects, individuals, and artistic styles without the need for extensive retraining. Despite the availability of over 100K LoRA adapters on platforms like this http URL, users often face challenges in navigating, selecting, and effectively utilizing the most suitable adapters due to their sheer volume, diversity, and lack of structured organization. This paper addresses the problem of selecting the most relevant and diverse LoRA models from this vast database by framing the task as a combinatorial optimization problem and proposing a novel submodular framework. Our quantitative and qualitative experiments demonstrate that our method generates diverse outputs across a wide range of domains.
zh
[CV-84] Constantly Improving Image Models Need Constantly Improving Benchmarks
【速读】:该论文旨在解决现有图像生成模型评估基准滞后于实际应用发展的问题,即当前基准无法捕捉由如GPT-4o Image Gen等新型生成式AI(Generative AI)系统带来的新兴使用场景和用户交互模式,导致社区对模型进步的认知与正式评测之间存在脱节。其解决方案的关键在于提出ECHO框架,该框架通过收集真实世界中社交媒体上的用户行为数据——包括新颖提示(prompts)和定性反馈——构建基于实证的基准测试集,从而更准确地反映模型在复杂、创造性任务中的表现,并据此设计更具针对性的质量度量指标。
链接: https://arxiv.org/abs/2510.15021
作者: Jiaxin Ge,Grace Luo,Heekyung Lee,Nishant Malpani,Long Lian,XuDong Wang,Aleksander Holynski,Trevor Darrell,Sewon Min,David M. Chan
机构: UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 31,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure). Our website is at this https URL.
zh
[CV-85] NANO3D: A Training-Free Approach for Efficient 3D Editing Without Masks
【速读】:该论文旨在解决当前3D物体编辑方法中存在的效率低、一致性差以及难以保持未编辑区域结构完整性的问题。现有方法通常依赖多视角渲染进行编辑后再重建,易引入伪影且实用性受限。其解决方案的关键在于提出一种无需训练的框架Nano3D,该框架将FlowEdit集成到TRELLIS中,通过前视图渲染引导局部编辑,并引入区域感知融合策略(Voxel/Slat-Merge),自适应地确保编辑区与未编辑区之间的结构一致性,从而实现高保真度和视觉质量的3D编辑。
链接: https://arxiv.org/abs/2510.15019
作者: Junliang Ye,Shenghao Xie,Ruowen Zhao,Zhengyi Wang,Hongyu Yan,Wenqiang Zu,Lei Ma,Jun Zhu
机构: Tsinghua University (清华大学); Peking University (北京大学); ShengShu; HKUST (香港科技大学); CASIA (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:3D object editing is essential for interactive content creation in gaming, animation, and robotics, yet current approaches remain inefficient, inconsistent, and often fail to preserve unedited regions. Most methods rely on editing multi-view renderings followed by reconstruction, which introduces artifacts and limits practicality. To address these challenges, we propose Nano3D, a training-free framework for precise and coherent 3D object editing without masks. Nano3D integrates FlowEdit into TRELLIS to perform localized edits guided by front-view renderings, and further introduces region-aware merging strategies, Voxel/Slat-Merge, which adaptively preserve structural fidelity by ensuring consistency between edited and unedited areas. Experiments demonstrate that Nano3D achieves superior 3D consistency and visual quality compared with existing methods. Based on this framework, we construct the first large-scale 3D editing datasets Nano3D-Edit-100k, which contains over 100,000 high-quality 3D editing pairs. This work addresses long-standing challenges in both algorithm design and data availability, significantly improving the generality and reliability of 3D editing, and laying the groundwork for the development of feed-forward 3D editing models. Project Page:this https URL
zh
[CV-86] UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
【速读】:该论文旨在解决城市环境中智能体(如配送机器人、四足机器人等)训练所需高保真、多样化仿真场景的可扩展性问题。现有方法中,人工设计或程序生成的场景要么难以规模化,要么无法充分捕捉真实世界的复杂性。解决方案的关键在于提出UrbanVerse系统,其核心由两部分组成:一是包含10万+标注3D城市资产的UrbanVerse-100K数据库(具备语义与物理属性),二是UrbanVerse-Gen自动流水线——从众包的城市巡游视频中提取场景布局,并利用检索到的资产构建度量尺度的物理感知交互式仿真环境。该方案实现了从真实世界数据到高质量仿真场景的自动化映射,在IsaacSim中生成了来自24个国家的160个高质量场景,显著提升了智能体在城市导航任务中的训练效果与零样本sim-to-real迁移性能。
链接: https://arxiv.org/abs/2510.15018
作者: Mingxuan Liu,Honglin He,Elisa Ricci,Wayne Wu,Bolei Zhou
机构: University of California, Los Angeles (加州大学洛杉矶分校); University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Technical report. Project page: this https URL
Abstract:Urban embodied AI agents, ranging from delivery robots to quadrupeds, are increasingly populating our cities, navigating chaotic streets to provide last-mile connectivity. Training such agents requires diverse, high-fidelity urban environments to scale, yet existing human-crafted or procedurally generated simulation scenes either lack scalability or fail to capture real-world complexity. We introduce UrbanVerse, a data-driven real-to-sim system that converts crowd-sourced city-tour videos into physics-aware, interactive simulation scenes. UrbanVerse consists of: (i) UrbanVerse-100K, a repository of 100k+ annotated urban 3D assets with semantic and physical attributes, and (ii) UrbanVerse-Gen, an automatic pipeline that extracts scene layouts from video and instantiates metric-scale 3D simulations using retrieved assets. Running in IsaacSim, UrbanVerse offers 160 high-quality constructed scenes from 24 countries, along with a curated benchmark of 10 artist-designed test scenes. Experiments show that UrbanVerse scenes preserve real-world semantics and layouts, achieving human-evaluated realism comparable to manually crafted scenes. In urban navigation, policies trained in UrbanVerse exhibit scaling power laws and strong generalization, improving success by +6.3% in simulation and +30.1% in zero-shot sim-to-real transfer comparing to prior methods, accomplishing a 300 m real-world mission with only two interventions.
zh
[CV-87] PC-UNet: An Enforcing Poisson Statistics U-Net for Positron Emission Tomography Denoising
【速读】:该论文旨在解决正电子发射断层成像(Positron Emission Tomography, PET)在低剂量条件下因泊松噪声显著增加而导致图像质量下降的问题,现有去噪方法难以有效抑制噪声并保持图像物理一致性,常引入失真和伪影。其解决方案的关键在于提出一种泊松一致U-Net(Poisson Consistent U-Net, PC-UNet)模型,并设计了一种新的泊松方差与均值一致性损失(Poisson Variance and Mean Consistency Loss, PVMC-Loss),该损失函数基于物理数据建模,具有统计无偏性及梯度自适应特性,本质上是广义矩估计(Generalized Method of Moments)的实现,从而在小规模数据偏差下仍能保持鲁棒性,显著提升图像保真度与物理一致性。
链接: https://arxiv.org/abs/2510.14995
作者: Yang Shi,Jingchao Wang,Liangsi Lu,Mingxuan Huang,Ruixin He,Yifeng Xie,Hanqian Liu,Minzhe Guo,Yangyang Liang,Weipeng Zhang,Zimeng Li,Xuhang Chen
机构: Guangdong University of Technology (广东工业大学); Sun Yat-sen University (中山大学); South China University of Technology (华南理工大学); Huizhou University (惠州学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by BIBM 2025 as a regular paper
Abstract:Positron Emission Tomography (PET) is crucial in medicine, but its clinical use is limited due to high signal-to-noise ratio doses increasing radiation exposure. Lowering doses increases Poisson noise, which current denoising methods fail to handle, causing distortions and artifacts. We propose a Poisson Consistent U-Net (PC-UNet) model with a new Poisson Variance and Mean Consistency Loss (PVMC-Loss) that incorporates physical data to improve image fidelity. PVMC-Loss is statistically unbiased in variance and gradient adaptation, acting as a Generalized Method of Moments implementation, offering robustness to minor data mismatches. Tests on PET datasets show PC-UNet improves physical consistency and image fidelity, proving its ability to integrate physical information effectively.
zh
[CV-88] GAZE:Governance-Aware pre-annotation for Zero-shot World Model Environments
【速读】:该论文旨在解决训练鲁棒世界模型(World Models)所需的大规模、高精度多模态数据集构建难题,这一过程长期受限于人工标注效率低、成本高的瓶颈。其核心解决方案是提出一个经过生产验证的GAZE流水线,关键在于通过三个步骤实现自动化标注:(i) 将私有360度视频格式标准化为标准视图并分片以支持并行处理;(ii) 利用一系列AI模型(如场景理解、目标跟踪、语音转录、PII/NSFW/未成年人内容检测)进行密集的多模态预标注;(iii) 将多源信号整合为结构化输出规范,便于快速人工校验。该方法显著提升标注效率(每小时审阅节省约19分钟),减少80%的人工审查量,并通过保守自动跳过低显著性片段实现隐私保护与链式保管元数据集成,从而生成高质量、可直接用于跨模态动态学习和动作条件预测的高保真数据集。
链接: https://arxiv.org/abs/2510.14992
作者: Leela Krishna,Mengyang Zhao,Saicharithreddy Pasula,Harshit Rajgarhia,Abhishek Mukherji
机构: Centific Global Solutions Inc. (Centific全球解决方案公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Training robust world models requires large-scale, precisely labeled multimodal datasets, a process historically bottlenecked by slow and expensive manual annotation. We present a production-tested GAZE pipeline that automates the conversion of raw, long-form video into rich, task-ready supervision for world-model training. Our system (i) normalizes proprietary 360-degree formats into standard views and shards them for parallel processing; (ii) applies a suite of AI models (scene understanding, object tracking, audio transcription, PII/NSFW/minor detection) for dense, multimodal pre-annotation; and (iii) consolidates signals into a structured output specification for rapid human validation. The GAZE workflow demonstrably yields efficiency gains (~19 minutes saved per review hour) and reduces human review volume by 80% through conservative auto-skipping of low-salience segments. By increasing label density and consistency while integrating privacy safeguards and chain-of-custody metadata, our method generates high-fidelity, privacy-aware datasets directly consumable for learning cross-modal dynamics and action-conditioned prediction. We detail our orchestration, model choices, and data dictionary to provide a scalable blueprint for generating high-quality world model training data without sacrificing throughput or governance. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.14992 [cs.CV] (or arXiv:2510.14992v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.14992 Focus to learn more arXiv-issued DOI via DataCite
zh
[CV-89] End-to-End Multi-Modal Diffusion Mamba ICCV2025
【速读】:该论文旨在解决当前端到端多模态模型中因使用独立编码器与解码器而导致的跨模态联合表征学习受限的问题。其解决方案的关键在于提出一种名为MDM(Multi-modal Diffusion Mamba)的新架构,该架构基于Mamba结构设计了一个多步选择扩散模型,通过统一的变分自编码器(Variational Autoencoder, VAE)实现编码与解码过程的一体化,从而在高维数据处理中(如高分辨率图像和长文本序列的同步生成)显著提升性能,并保持计算效率。
链接: https://arxiv.org/abs/2510.13253
作者: Chunhao Lu,Qiang Lu,Meichen Dong,Jake Luo
机构: China University of Petroleum-Beijing (中国石油大学-北京); Leyard Optoelectronic (利亚德光电); University of Wisconsin-Milwaukee (密歇根大学米尔沃基分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICCV 2025
Abstract:Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM’s effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.
zh
[CV-90] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models CVPR2025
【速读】:该论文旨在解决当前大型多模态模型(Large Multimodal Models, LMMs)在复杂三维空间推理能力上的不足,尤其是缺乏对6D空间推理(包含3D位置和3D方向)的系统性评估框架。现有基准主要聚焦于二维空间理解,无法全面衡量模型在不同复杂度下的空间推理性能。其解决方案的关键在于构建了一个可扩展且无偏的合成数据集Spatial457,该数据集具备四大核心空间推理能力:多物体识别、2D位置、3D位置与3D方向;并设计了包含7种题型、5个难度等级的级联评估结构,首次引入6D空间推理任务以精准刻画模型在高阶空间认知中的表现。同时,通过提出相对性能下降率(Relative Performance Dropping Rate, RPDR)量化指标,揭示了LMMs在3D及6D任务中显著的性能衰减问题,并发现模型在不同属性上存在一致的预测偏差模式。
链接: https://arxiv.org/abs/2502.08636
作者: Xingrui Wang,Wufei Ma,Tiezheng Zhang,Celso M de Melo,Jieneng Chen,Alan Yuille
机构: Johns Hopkins University (约翰霍普金斯大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in CVPR 2025 as Highlight. Data and code are released at this https URL
Abstract:Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on PulseCheck457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings. The code and data are released in this https URL.
zh
[CV-91] Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction
【速读】:该论文旨在解决时间序列预测(Time Series Forecasting, TSF)中Transformer模型虽擅长处理长序列却可能难以有效保留时序关系的问题,同时探索线性模型是否能在保持简洁性的同时实现与复杂模型相当甚至更优的性能。其解决方案的关键在于提出一种新颖的数据高效型架构——高斯激活线性模型(Gaussian-activated Linear model, GLinear),该模型通过显式建模周期性模式来增强预测精度,在仅需较少历史数据的情况下即能超越或媲美当前主流的线性预测器(如NLinear、DLinear、RLinear)及基于Transformer的模型(如Autoformer),从而为高效、轻量且准确的时间序列分析提供了新方向。
链接: https://arxiv.org/abs/2501.01087
作者: Syed Tahir Hussain Rizvi,Neel Kanwal,Muddasar Naeem
机构: University of Stavanger (斯塔万格大学); Università Telematica Giustino Fortunato (吉斯蒂诺·福尔图纳托远程大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Logic in Computer Science (cs.LO); Performance (cs.PF)
备注: Submitted to Digital Signal Processing Journal
Abstract:Time Series Forecasting (TSF) is an important application across many fields. There is a debate about whether Transformers, despite being good at understanding long sequences, struggle with preserving temporal relationships in time series data. Recent research suggests that simpler linear models might outperform or at least provide competitive performance compared to complex Transformer-based models for TSF tasks. In this paper, we propose a novel data-efficient architecture, \textitGaussian-activated Linear model (GLinear), for multivariate TSF that exploits periodic patterns to provide better accuracy. It achieves higher prediction accuracy while requiring less historical data than other state-of-the-art linear predictors. Four different datasets (ETTh1, Electricity, Traffic, and Weather) are used to evaluate the performance of the proposed predictor. A performance comparison with state-of-the-art linear architectures (such as NLinear, DLinear, and RLinear) and transformer-based time series predictors (Autoformer) shows that the GLinear, despite being data efficient, outperforms the existing architectures in most cases of multivariate TSF while being competitive in others. We hope that the proposed GLinear model opens new fronts of research and development of simpler and more sophisticated architectures for data and computationally efficient time-series analysis. The source code is publicly available on GitHub.
zh
[CV-92] SANR: Scene-Aware Neural Representation for Light Field Image Compression with Rate-Distortion Optimization
【速读】:该论文旨在解决光场图像(light field image)压缩中因高维数据导致的存储与传输效率低下问题,以及现有基于隐式神经表示(implicit neural representation, INR)的方法在场景结构建模不足和缺乏端到端率失真优化(rate-distortion optimization)方面的局限性。解决方案的关键在于提出一种面向场景感知的神经表示框架(Scene-Aware Neural Representation, SANR),其核心创新包括:1)引入分层场景建模模块,利用多尺度潜在码捕捉场景内在结构,缩小INR输入坐标与目标光场图像之间的信息差距;2)首次将熵约束的量化感知训练(entropy-constrained quantization-aware training, QAT)引入基于神经表示的光场图像压缩中,实现端到端的率失真优化。实验表明,SANR在率失真性能上显著优于当前最优方法,相比HEVC实现了65.62%的BD-rate节省。
链接: https://arxiv.org/abs/2510.15775
作者: Gai Zhang,Xinfeng Zhang,Lv Tang,Hongyu An,Li Zhang,Qingming Huang
机构: University of Chinese Academy of Sciences (中国科学院大学); Bytedance Inc. (字节跳动)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:
Abstract:Light field images capture multi-view scene information and play a crucial role in 3D scene reconstruction. However, their high-dimensional nature results in enormous data volumes, posing a significant challenge for efficient compression in practical storage and transmission scenarios. Although neural representation-based methods have shown promise in light field image compression, most approaches rely on direct coordinate-to-pixel mapping through implicit neural representation (INR), often neglecting the explicit modeling of scene structure. Moreover, they typically lack end-to-end rate-distortion optimization, limiting their compression efficiency. To address these limitations, we propose SANR, a Scene-Aware Neural Representation framework for light field image compression with end-to-end rate-distortion optimization. For scene awareness, SANR introduces a hierarchical scene modeling block that leverages multi-scale latent codes to capture intrinsic scene structures, thereby reducing the information gap between INR input coordinates and the target light field image. From a compression perspective, SANR is the first to incorporate entropy-constrained quantization-aware training (QAT) into neural representation-based light field image compression, enabling end-to-end rate-distortion optimization. Extensive experiment results demonstrate that SANR significantly outperforms state-of-the-art techniques regarding rate-distortion performance with a 65.62% BD-rate saving against HEVC.
zh
[CV-93] RankSEG-RMA: An Efficient Segmentation Algorithm via Reciprocal Moment Approximation
【速读】:该论文旨在解决现有语义分割方法在优化分割指标(如Dice系数和交并比IoU)时存在的不一致性与次优性问题,以及由此带来的计算复杂度高和仅适用于重叠分割场景的局限性。传统方法通常通过估计像素级类别概率后使用argmax或阈值化获得最终预测,但这类策略并未直接优化分割指标,导致性能受限。为应对上述挑战,作者提出基于排序优化的RankSEG框架,其核心创新在于引入RankDice和RankIoU损失函数以直接最大化Dice和IoU指标。然而,RankSEG存在两个关键缺陷:一是计算复杂度高(RankDice为O(d log d),RankIoU为O(d²)),二是仅适用于多类可重叠的分割场景。本文的关键解决方案是提出一种互反矩近似(Reciprocal Moment Approximation, RMA),通过RMA对RankSEG进行改进,得到RankSEG-RMA,使两种算法的复杂度降至O(d)且保持相近性能;同时,受RMA启发设计了一种像素级评分函数,从而实现了非重叠分割场景下的高效应用。
链接: https://arxiv.org/abs/2510.15362
作者: Zixun Wang,Ben Dai
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Semantic segmentation labels each pixel in an image with its corresponding class, and is typically evaluated using the Intersection over Union (IoU) and Dice metrics to quantify the overlap between predicted and ground-truth segmentation masks. In the literature, most existing methods estimate pixel-wise class probabilities, then apply argmax or thresholding to obtain the final prediction. These methods have been shown to generally lead to inconsistent or suboptimal results, as they do not directly maximize segmentation metrics. To address this issue, a novel consistent segmentation framework, RankSEG, has been proposed, which includes RankDice and RankIoU specifically designed to optimize the Dice and IoU metrics, respectively. Although RankSEG almost guarantees improved performance, it suffers from two major drawbacks. First, it is its computational expense-RankDice has a complexity of O(d log d) with a substantial constant factor (where d represents the number of pixels), while RankIoU exhibits even higher complexity O(d^2), thus limiting its practical application. For instance, in LiTS, prediction with RankSEG takes 16.33 seconds compared to just 0.01 seconds with the argmax rule. Second, RankSEG is only applicable to overlapping segmentation settings, where multiple classes can occupy the same pixel, which contrasts with standard benchmarks that typically assume non-overlapping segmentation. In this paper, we overcome these two drawbacks via a reciprocal moment approximation (RMA) of RankSEG with the following contributions: (i) we improve RankSEG using RMA, namely RankSEG-RMA, reduces the complexity of both algorithms to O(d) while maintaining comparable performance; (ii) inspired by RMA, we develop a pixel-wise score function that allows efficient implementation for non-overlapping segmentation settings.
zh
[CV-94] Confidence-Weighted Semi-Supervised Learning for Skin Lesion Segmentation Using Hybrid CNN-Transformer Networks
【速读】:该论文旨在解决皮肤病变自动分割任务中因标注训练数据有限而导致的性能瓶颈问题,尤其在低标注比例场景下如何提升分割精度。其解决方案的关键在于提出一种半监督框架MIRA-U,该框架融合了不确定性感知的教师-学生伪标签机制与混合CNN-Transformer架构:教师网络通过掩码图像建模预训练生成置信度加权的软伪标签,指导一个具有交叉注意力跳跃连接的U型CNN-Transformer学生网络进行学习,从而显著提高伪标签质量与边界分割准确性,在仅使用50%标注数据的情况下仍能实现优异的Dice相似系数(DSC=0.9153)和交并比(IoU=0.8552)。
链接: https://arxiv.org/abs/2510.15354
作者: Saqib Qamar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Automated skin lesion segmentation through dermoscopic analysis is essential for early skin cancer detection, yet remains challenging due to limited annotated training data. We present MIRA-U, a semi-supervised framework that combines uncertainty-aware teacher-student pseudo-labeling with a hybrid CNN-Transformer architecture. Our approach employs a teacher network pre-trained via masked image modeling to generate confidence-weighted soft pseudo-labels, which guide a U-shaped CNN-Transformer student network featuring cross-attention skip connections. This design enhances pseudo-label quality and boundary delineation, surpassing reconstruction-based and CNN-only baselines, particularly in low-annotation regimes. Extensive evaluation on ISIC-2016 and PH2 datasets demonstrates superior performance, achieving a Dice Similarity Coefficient (DSC) of 0.9153 and Intersection over Union (IoU) of 0.8552 using only 50% labeled data. Code is publicly available on GitHub.
zh
[CV-95] Neural Posterior Estimation for Cataloging Astronomical Images from the Legacy Survey of Space and Time
【速读】:该论文旨在解决天文图像数据中构建天体目录(astronomical catalog)时的传统确定性方法缺乏统计一致性,以及现有概率方法在计算效率、准确性或无法处理多波段叠加图像(multiband coadded images)方面的局限性问题。其解决方案的关键在于采用一种新兴的贝叶斯推断方法——神经后验估计(neural posterior estimation, NPE),该方法利用深度学习实现高计算效率与高精度的统一,并在DC2模拟天空调查数据上系统优于LSST标准流水线,在光源检测、流量测量、星系分类和形状测量等任务中表现优异,同时提供校准良好的后验近似。
链接: https://arxiv.org/abs/2510.15315
作者: Yicun Duan,Xinyue Li,Camille Avestruz,Jeffrey Regier
机构: University of Michigan (密歇根大学)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
备注:
Abstract:The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will commence full-scale operations in 2026, yielding an unprecedented volume of astronomical images. Constructing an astronomical catalog, a table of imaged stars, galaxies, and their properties, is a fundamental step in most scientific workflows based on astronomical image data. Traditional deterministic cataloging methods lack statistical coherence as cataloging is an ill-posed problem, while existing probabilistic approaches suffer from computational inefficiency, inaccuracy, or the inability to perform inference with multiband coadded images, the primary output format for LSST images. In this article, we explore a recently developed Bayesian inference method called neural posterior estimation (NPE) as an approach to cataloging. NPE leverages deep learning to achieve both computational efficiency and high accuracy. When evaluated on the DC2 Simulated Sky Survey – a highly realistic synthetic dataset designed to mimic LSST data – NPE systematically outperforms the standard LSST pipeline in light source detection, flux measurement, star/galaxy classification, and galaxy shape measurement. Additionally, NPE provides well-calibrated posterior approximations. These promising results, obtained using simulated data, illustrate the potential of NPE in the absence of model misspecification. Although some degree of model misspecification is inevitable in the application of NPE to real LSST images, there are a variety of strategies to mitigate its effects.
zh
人工智能
[AI-0] PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold
【速读】:该论文旨在解决当前工具增强型大语言模型(Tool-augmented Large Language Models, TALLMs)作为深度研究代理(Deep Research Agents)时存在的三大局限:浅层检索能力、弱对齐度量以及脆弱的工具使用行为。其解决方案的关键在于提出一个统一的强化学习框架——PokeeResearch-7B,该模型通过无标注的从AI反馈中强化学习(Reinforcement Learning from AI Feedback, RLAIF)机制进行训练,利用基于大语言模型的奖励信号优化策略,以同时提升事实准确性、引用忠实度和指令遵循性;此外,引入基于思维链(Chain-of-Thought)驱动的多调用推理结构,增强了自我验证与工具失败后的自适应恢复能力,从而显著提升了代理在10个主流深度研究基准上的性能表现,达到了7B规模模型中的最先进水平。
链接: https://arxiv.org/abs/2510.15862
作者: Yi Wan,Jiuqi Wang,Liam Li,Jinsong Liu,Ruihao Zhu,Zheqing Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented large language models (LLMs) are emerging as deep research agents, systems that decompose complex queries, retrieve external evidence, and synthesize grounded responses. Yet current agents remain limited by shallow retrieval, weak alignment metrics, and brittle tool-use behavior. We introduce PokeeResearch-7B, a 7B-parameter deep research agent built under a unified reinforcement learning framework for robustness, alignment, and scalability. PokeeResearch-7B is trained by an annotation-free Reinforcement Learning from AI Feedback (RLAIF) framework to optimize policies using LLM-based reward signals that capture factual accuracy, citation faithfulness, and instruction adherence. A chain-of-thought-driven multi-call reasoning scaffold further enhances robustness through self-verification and adaptive recovery from tool failures. Among 10 popular deep research benchmarks, PokeeResearch-7B achieves state-of-the-art performance among 7B-scale deep research agents. This highlights that careful reinforcement learning and reasoning design can produce efficient, resilient, and research-grade AI agents. The model and inference code is open-sourced under MIT license at this https URL.
zh
[AI-1] Self-Certifying Primal-Dual Optimization Proxies for Large-Scale Batch Economic Dispatch
【速读】:该论文旨在解决优化代理模型(optimization proxies)在实际部署中可信度不足的问题,即尽管其平均最优性差距(optimality gap)可低于1%,但在分布内查询中仍存在最优性差距高达数个数量级的情况,导致难以信任预测结果。解决方案的关键在于提出一种混合求解器(hybrid solver),该求解器利用对偶理论(duality theory)高效地界定预测的最优性差距,并在无法认证最优性时回退至经典求解器;同时,论文还提出一种结合原始和对偶代理训练的替代训练方法,以提升混合求解器的速度优势。实验表明,该方法在大规模输电系统中实现了超过1000倍于并行单纯形法的加速,且保证最大最优性差距不超过2%。
链接: https://arxiv.org/abs/2510.15850
作者: Michael Klamkin,Mathieu Tanneau,Pascal Van Hentenryck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:Recent research has shown that optimization proxies can be trained to high fidelity, achieving average optimality gaps under 1% for large-scale problems. However, worst-case analyses show that there exist in-distribution queries that result in orders of magnitude higher optimality gap, making it difficult to trust the predictions in practice. This paper aims at striking a balance between classical solvers and optimization proxies in order to enable trustworthy deployments with interpretable speed-optimality tradeoffs based on a user-defined optimality threshold. To this end, the paper proposes a hybrid solver that leverages duality theory to efficiently bound the optimality gap of predictions, falling back to a classical solver for queries where optimality cannot be certified. To improve the achieved speedup of the hybrid solver, the paper proposes an alternative training procedure that combines the primal and dual proxy training. Experiments on large-scale transmission systems show that the hybrid solver is highly scalable. The proposed hybrid solver achieves speedups of over 1000x compared to a parallelized simplex-based solver while guaranteeing a maximum optimality gap of 2%.
zh
[AI-2] SNOO: Step-K Nesterov Outer Optimizer - The Surprising Effectiveness of Nesterov Momentum Applied to Pseudo-Gradients
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)训练中对高效优化技术的迫切需求,特别是如何在不显著增加计算与内存开销的前提下提升优化性能。其核心解决方案是提出一种名为 Step-K Nesterov Outer Optimizer (SNOO) 的 Lookahead 变体,关键在于将 Nesterov 动量应用于伪梯度(pseudo-gradient),从而在非分布式设置下显著改善训练效果。实验证明,SNOO 在高达 10²³ 训练浮点运算(FLOPs)规模下可实现 1.5–2.5× 的计算效率增益,且优势随模型规模增大而增强,同时具备与模型并行(model sharding)兼容性,是一种轻量级且通用的优化器改进方案。
链接: https://arxiv.org/abs/2510.15830
作者: Dominik Kallusky,Vinay Rao,Vishal Nandavanam,Hao-Jun Michael Shi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid development of large language models (LLMs) has driven the demand for more efficient optimization techniques. Among these, the Lookahead family of optimizers employs a two-loop framework, maintaining fast and slow sets of model weights. Multiple inner optimizer steps on the fast weights produce a trajectory - the pseudo-gradient - that is used to update the slow weights. DiLoCo, a notable example originally designed for distributed training, applies Nesterov momentum to the averaged pseudo-gradient from multiple workers, claiming to even outperform AdamW in a non-distributed setup. In this paper, we empirically show that DiLoCo’s surprising effectiveness stems primarily from applying Nesterov momentum to the pseudo-gradient, which improves training in a non-distributed setting. We call this Lookahead variant the Step- K Nesterov Outer Optimizer (SNOO). We demonstrate that SNOO achieves compute factor gains of 1.5 - 2.5 \times in a non-distributed setting up to a scale of 1e23 training FLOPs, with improvements that increase with model size. Because of its minimal compute and memory overhead and compatibility with model sharding, SNOO is a practical enhancement for a variety of inner optimizers, including AdamW and Muon.
zh
[AI-3] Chronos-2: From Univariate to Universal Forecasting
【速读】:该论文旨在解决现有预训练时间序列模型在多变量(multivariate)和协变量(covariate)驱动的预测任务中适用性不足的问题,尤其是在零样本(zero-shot)场景下缺乏通用性和性能表现。其核心解决方案是提出Chronos-2模型,其关键创新在于引入组注意力机制(group attention mechanism),通过在组内多个时间序列间高效共享信息,实现上下文学习(in-context learning, ICL),从而支持未见任务的即插即用式预测。该机制可灵活处理单变量、多变量及协变量相关的预测任务,且在合成数据上训练以模拟多样化的多变量结构,最终在多个基准测试中达到领先性能,验证了其作为通用预测模型的实用性。
链接: https://arxiv.org/abs/2510.15821
作者: Abdul Fatir Ansari,Oleksandr Shchur,Jaris Küken,Andreas Auer,Boran Han,Pedro Mercado,Syama Sundar Rangapuram,Huibin Shen,Lorenzo Stella,Xiyuan Zhang,Mononito Goswami,Shubham Kapoor,Danielle C. Maddix,Pablo Guerron,Tony Hu,Junming Yin,Nick Erickson,Prateek Mutalik Desai,Hao Wang,Huzefa Rangwala,George Karypis,Yuyang Wang,Michael Bohlke-Schneider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Pretrained time series models have enabled inference-only forecasting systems that produce accurate predictions without task-specific training. However, existing approaches largely focus on univariate forecasting, limiting their applicability in real-world scenarios where multivariate data and covariates play a crucial role. We present Chronos-2, a pretrained model capable of handling univariate, multivariate, and covariate-informed forecasting tasks in a zero-shot manner. Chronos-2 employs a group attention mechanism that facilitates in-context learning (ICL) through efficient information sharing across multiple time series within a group, which may represent sets of related series, variates of a multivariate series, or targets and covariates in a forecasting task. These general capabilities are achieved through training on synthetic datasets that impose diverse multivariate structures on univariate series. Chronos-2 delivers state-of-the-art performance across three comprehensive benchmarks: fev-bench, GIFT-Eval, and Chronos Benchmark II. On fev-bench, which emphasizes multivariate and covariate-informed forecasting, Chronos-2’s universal ICL capabilities lead to substantial improvements over existing models. On tasks involving covariates, it consistently outperforms baselines by a wide margin. Case studies in the energy and retail domains further highlight its practical advantages. The in-context learning capabilities of Chronos-2 establish it as a general-purpose forecasting model that can be used “as is” in real-world forecasting pipelines.
zh
[AI-4] AB-UPT for Automotive and Aerospace Applications
【速读】:该论文旨在解决传统数值求解器在汽车计算流体动力学(Computational Fluid Dynamics, CFD)仿真中计算成本过高、效率低下的问题。为实现高效且高精度的CFD模拟,作者提出基于Anchored-Branched Universal Physics Transformers (AB-UPT) 的解决方案,其关键在于结合高质量数据生成与先进的神经网络代理模型(neural surrogates),利用Luminary Cloud平台生成包含汽车(SHIFT-SUV)和飞机(SHIFT-Wing)的高保真数据集,并通过轻量化的几何表示(如各向同性网格化结构)训练AB-UPT模型,使其能在单个GPU上一天内完成训练,并在数秒内近乎完美地预测集成气动载荷,显著优于此前基于Transformer的基准方法,具备工业级应用潜力。
链接: https://arxiv.org/abs/2510.15808
作者: Benedikt Alkin,Richard Kurle,Louis Serrano,Dennis Just,Johannes Brandstetter
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The recently proposed Anchored-Branched Universal Physics Transformers (AB-UPT) shows strong capabilities to replicate automotive computational fluid dynamics simulations requiring orders of magnitudes less compute than traditional numerical solvers. In this technical report, we add two new datasets to the body of empirically evaluated use-cases of AB-UPT, combining high-quality data generation with state-of-the-art neural surrogates. Both datasets were generated with the Luminary Cloud platform containing automotives (SHIFT-SUV) and aircrafts (SHIFT-Wing). We start by detailing the data generation. Next, we show favorable performances of AB-UPT against previous state-of-the-art transformer-based baselines on both datasets, followed by extensive qualitative and quantitative evaluations of our best AB-UPT model. AB-UPT shows strong performances across the board. Notably, it obtains near perfect prediction of integrated aerodynamic forces within seconds from a simple isotopically tesselate geometry representation and is trainable within a day on a single GPU, paving the way for industry-scale applications.
zh
[AI-5] Demo: Guide-RAG : Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID ALT NEURIPS2025
【速读】:该论文旨在解决生成式 AI (Generative AI) 在临床医学中应对复杂新兴疾病(如长新冠 Long COVID)时,如何构建高效、可靠的知识检索与生成框架的问题。其核心挑战在于平衡知识的权威性与全面性,避免因信息过载或过度简化导致临床决策偏差。解决方案的关键在于提出 Guide-RAG 框架,通过整合专家审定的临床指南与高质量系统评价(systematic reviews)构成的检索增强生成(Retrieval-Augmented Generation, RAG)语料库,相较于单一指南或大规模文献数据库配置,在忠实性、相关性和全面性等指标上均表现更优,从而在保障临床实用性的同时有效规避噪声与片面性。
链接: https://arxiv.org/abs/2510.15782
作者: Philip DiGiacomo,Haoyang Wang,Jinrui Fang,Yan Leng,W Michael Brode,Ying Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance
Abstract:As AI chatbots gain adoption in clinical medicine, developing effective frameworks for complex, emerging diseases presents significant challenges. We developed and evaluated six Retrieval-Augmented Generation (RAG) corpus configurations for Long COVID (LC) clinical question answering, ranging from expert-curated sources to large-scale literature databases. Our evaluation employed an LLM-as-a-judge framework across faithfulness, relevance, and comprehensiveness metrics using LongCOVID-CQ, a novel dataset of expert-generated clinical questions. Our RAG corpus configuration combining clinical guidelines with high-quality systematic reviews consistently outperformed both narrow single-guideline approaches and large-scale literature databases. Our findings suggest that for emerging diseases, retrieval grounded in curated secondary reviews provides an optimal balance between narrow consensus documents and unfiltered primary literature, supporting clinical decision-making while avoiding information overload and oversimplified guidance. We propose Guide-RAG, a chatbot system and accompanying evaluation framework that integrates both curated expert knowledge and comprehensive literature databases to effectively answer LC clinical questions.
zh
[AI-6] Self-evolving expertise in complex non-verifiable subject domains: dialogue as implicit meta-RL
【速读】:该论文旨在解决生成式 AI(Generative AI)在处理“棘手问题”(wicked problems)时缺乏通过经验自主发展专业能力的内生机制这一关键瓶颈。这类问题具有多维复杂性、结果不可验证性、影响异质性以及无单一客观正确答案等特征,典型场景包括司法体系设计、环境污染治理、疫情韧性规划和粮食安全等。为应对此挑战,作者提出 Dialectica 框架,其核心创新在于引入结构化对话机制,并结合记忆、自我反思与策略约束下的上下文编辑,将讨论过程形式化为隐式的元强化学习(meta-reinforcement learning)过程。实验表明,在 Qwen3:30b 和 o4-mini 两种模型架构上,启用基于反思的上下文编辑后,对话训练后的代理在 Elo 分数、归一化的 Bradley-Terry-Davidson 能力指标及 AlphaRank 质量分布上均显著优于基线模型,且定性分析显示反思能有效识别弱点并指导后续陈述优化,从而验证了对话驱动的上下文演化是开放非验证领域中实现目标专业化增强的有效路径。
链接: https://arxiv.org/abs/2510.15772
作者: Richard M. Bailey
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 50 pages, 4 figures
Abstract:So-called wicked problems', those involving complex multi-dimensional settings, non-verifiable outcomes, heterogeneous impacts and a lack of single objectively correct answers, have plagued humans throughout history. Modern examples include decisions over justice frameworks, solving environmental pollution, planning for pandemic resilience and food security. The use of state-of-the-art artificial intelligence systems (notably Large Language Model-based agents) collaborating with humans on solving such problems is being actively explored. While the abilities of LLMs can be improved by, for example, fine-tuning, hand-crafted system prompts and scaffolding with external tools, LLMs lack endogenous mechanisms to develop expertise through experience in such settings. This work address this gap with Dialectica, a framework where agents engage in structured dialogue on defined topics, augmented by memory, self-reflection, and policy-constrained context editing. Formally, discussion is viewed as an implicit meta-reinforcement learning process. The
dialogue-trained’ agents are evaluated post-hoc using judged pairwise comparisons of elicited responses. Across two model architectures (locally run Qwen3:30b and OpenAI’s o4-mini) results show that enabling reflection-based context editing during discussion produces agents which dominate their baseline counterparts on Elo scores, normalized Bradley-Terry-Davidson ability, and AlphaRank mass. The predicted signatures of learning are observed qualitatively in statement and reflection logs, where reflections identify weaknesses and reliably shape subsequent statements. Agreement between quantitative and qualitative evidence supports dialogue-driven context evolution as a practical path to targeted expertise amplification in open non-verifiable domains.
zh
[AI-7] Preliminary Quantitative Study on Explainability and Trust in AI Systems
【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)广泛应用背景下,如何通过可解释性(Explainability)设计提升用户对人工智能系统的信任度。解决方案的关键在于通过定量实验设计,验证不同类型的解释(从基础的特征重要性到交互式的反事实解释)对用户感知信任的影响,结果表明,交互性能够显著增强用户参与度和信心,且解释的清晰度与相关性是决定信任的核心因素。
链接: https://arxiv.org/abs/2510.15769
作者: Allen Daniel Sunny
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 3 figures, 2 appendices. Quantitative user study on AI explainability and trust. Preprint, 2025
Abstract:Large-scale AI models such as GPT-4 have accelerated the deployment of artificial intelligence across critical domains including law, healthcare, and finance, raising urgent questions about trust and transparency. This study investigates the relationship between explainability and user trust in AI systems through a quantitative experimental design. Using an interactive, web-based loan approval simulation, we compare how different types of explanations, ranging from basic feature importance to interactive counterfactuals influence perceived trust. Results suggest that interactivity enhances both user engagement and confidence, and that the clarity and relevance of explanations are key determinants of trust. These findings contribute empirical evidence to the growing field of human-centered explainable AI, highlighting measurable effects of explainability design on user perception
zh
[AI-8] owards Relaxed Multimodal Inputs for Gait-based Parkinsons Disease Assessment
【速读】:该论文旨在解决帕金森病(Parkinson’s disease)评估中多模态学习方法的两个关键限制:一是训练时需同步所有模态数据,二是推理时依赖全部模态输入,这严重制约了系统的灵活性与实用性。其解决方案的核心在于将多模态学习建模为多目标优化(multi-objective optimization, MOO)问题,从而在训练和推理阶段均支持灵活的模态使用策略,并有效缓解多模态信息融合过程中的模态坍塌(modality collapse)问题;此外,通过引入基于间隔的类别重平衡策略(margin-based class rebalancing strategy),进一步改善单一模态内部类别不平衡对模型性能的影响。实验表明,所提出的TRIP框架在异步和同步设置下均显著优于现有基线方法。
链接: https://arxiv.org/abs/2510.15748
作者: Minlin Zeng,Zhipeng Zhou,Yang Qiu,Zhiqi Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Parkinson’s disease assessment has garnered growing interest in recent years, particularly with the advent of sensor data and machine learning techniques. Among these, multimodal approaches have demonstrated strong performance by effectively integrating complementary information from various data sources. However, two major limitations hinder their practical application: (1) the need to synchronize all modalities during training, and (2) the dependence on all modalities during inference. To address these issues, we propose the first Parkinson’s assessment system that formulates multimodal learning as a multi-objective optimization (MOO) problem. This not only allows for more flexible modality requirements during both training and inference, but also handles modality collapse issue during multimodal information fusion. In addition, to mitigate the imbalance within individual modalities, we introduce a margin-based class rebalancing strategy to enhance category learning. We conduct extensive experiments on three public datasets under both synchronous and asynchronous settings. The results show that our framework-Towards Relaxed InPuts (TRIP)-achieves state-of-the-art performance, outperforming the best baselines by 16.48, 6.89, and 11.55 percentage points in the asynchronous setting, and by 4.86 and 2.30 percentage points in the synchronous setting, highlighting its effectiveness and adaptability.
zh
[AI-9] AURA: An Agent Autonomy Risk Assessment Framework AAMAS2026
【速读】:该论文旨在解决自主代理型人工智能(Agentic AI)在组织中规模化部署时面临的对齐(alignment)、治理(governance)与风险管理挑战。其核心解决方案是提出AURA(Agent aUtonomy Risk Assessment)框架,该框架的关键在于引入基于gamma的风险评分方法,能够在风险评估准确性、计算效率及实际可操作性之间取得平衡;同时通过人机协同(Human-in-the-Loop, HITL)机制和Agent-to-Human(A2H)通信接口,实现对单个或多个AI代理的同步或异步风险量化与缓解,并兼容现有协议(如MCP和A2A),从而保障企业级环境中可治理、透明且可扩展的代理型AI应用。
链接: https://arxiv.org/abs/2510.15739
作者: Lorenzo Satta Chiris(University of Exeter, United Kingdom),Ayush Mishra(University of Exeter, United Kingdom)
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 2 figures. Submitted for open-access preprint on arXiv. Based on the AAMAS 2026 paper template
Abstract:As autonomous agentic AI systems see increasing adoption across organisations, persistent challenges in alignment, governance, and risk management threaten to impede deployment at scale. We present AURA (Agent aUtonomy Risk Assessment), a unified framework designed to detect, quantify, and mitigate risks arising from agentic AI. Building on recent research and practical deployments, AURA introduces a gamma-based risk scoring methodology that balances risk assessment accuracy with computational efficiency and practical considerations. AURA provides an interactive process to score, evaluate and mitigate the risks of running one or multiple AI Agents, synchronously or asynchronously (autonomously). The framework is engineered for Human-in-the-Loop (HITL) oversight and presents Agent-to-Human (A2H) communication mechanisms, allowing for seamless integration with agentic systems for autonomous self-assessment, rendering it interoperable with established protocols (MCP and A2A) and tools. AURA supports a responsible and transparent adoption of agentic AI and provides robust risk detection and mitigation while balancing computational resources, positioning it as a critical enabler for large-scale, governable agentic AI in enterprise environments.
zh
[AI-10] RLAF: Reinforcement Learning from Automaton Feedback
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在具有复杂历史依赖性奖励结构的环境中所面临的挑战,即传统方法难以有效处理非马尔可夫奖励(non-Markovian rewards)的问题。其解决方案的关键在于引入基于自动机(automaton-based)的偏好反馈机制,利用确定性有限自动机(Deterministic Finite Automaton, DFA)的结构生成轨迹偏好,从而替代人工设计的显式奖励函数。该方法通过将DFA转化为对轨迹的偏好关系来学习隐式的奖励函数,既避免了繁琐的手动奖励工程,又能够有效建模时间依赖任务;同时提出了静态与动态两种策略,前者直接使用学习到的奖励进行策略优化,后者通过迭代更新奖励函数和策略直至收敛,实验证明该框架在离散和连续环境中均能学习出高效策略,并提供理论收敛保证,表明其在处理非马尔可夫目标时具有可扩展性、高效性和人类无关性优势。
链接: https://arxiv.org/abs/2510.15728
作者: Mahyar Alinejad,Alvaro Velasquez,Yue Wang,George Atia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning (RL) in environments with complex, history-dependent reward structures poses significant challenges for traditional methods. In this work, we introduce a novel approach that leverages automaton-based feedback to guide the learning process, replacing explicit reward functions with preferences derived from a deterministic finite automaton (DFA). Unlike conventional approaches that use automata for direct reward specification, our method employs the structure of the DFA to generate preferences over trajectories that are used to learn a reward function, eliminating the need for manual reward engineering. Our framework introduces a static approach that uses the learned reward function directly for policy optimization and a dynamic approach that involves continuous refining of the reward function and policy through iterative updates until convergence. Our experiments in both discrete and continuous environments demonstrate that our approach enables the RL agent to learn effective policies for tasks with temporal dependencies, outperforming traditional reward engineering and automaton-based baselines such as reward machines and LTL-guided methods. Our results highlight the advantages of automaton-based preferences in handling non-Markovian rewards, offering a scalable, efficient, and human-independent alternative to traditional reward modeling. We also provide a convergence guarantee showing that under standard assumptions our automaton-guided preference-based framework learns a policy that is near-optimal with respect to the true non-Markovian objective. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.15728 [cs.LG] (or arXiv:2510.15728v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15728 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-11] Invoice Information Extraction: Methods and Performance Evaluation
【速读】:该论文旨在解决从发票文档中自动提取结构化信息的准确性问题,尤其针对扫描或数字格式发票在实际应用中的可靠性和标准化评估难题。解决方案的关键在于构建一个基于字段级精度(field-level precision)、一致性检查失败率和精确匹配准确率的综合评价体系(evaluation metrics, EM),并结合Docling与LlamaCloud Services实现关键字段(如发票号、日期、金额和供应商信息)的识别与抽取,从而为不同提取方法提供可比较的性能基准,并揭示各字段层面的优劣表现。
链接: https://arxiv.org/abs/2510.15727
作者: Sai Yashwant,Anurag Dubey,Praneeth Paikray,Gantala Thulsiram
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:This paper presents methods for extracting structured information from invoice documents and proposes a set of evaluation metrics (EM) to assess the accuracy of the extracted data against annotated ground truth. The approach involves pre-processing scanned or digital invoices, applying Docling and LlamaCloud Services to identify and extract key fields such as invoice number, date, total amount, and vendor details. To ensure the reliability of the extraction process, we establish a robust evaluation framework comprising field-level precision, consistency check failures, and exact match accuracy. The proposed metrics provide a standardized way to compare different extraction methods and highlight strengths and weaknesses in field-specific performance.
zh
[AI-12] ProSh: Probabilistic Shielding for Model-free Reinforcement Learning
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)系统在实际部署中的安全性问题,即如何在保证最优性能的同时提供形式化的安全约束保障。其解决方案的关键在于提出一种无需环境模型的“风险预算增强概率屏蔽”(Probabilistic Shielding via Risk Augmentation, ProSh)算法:通过在约束马尔可夫决策过程(Constrained MDP)的状态空间中引入风险预算,并利用学习到的成本评论器(cost critic)对智能体策略分布施加屏蔽(shield),确保所有采样动作在期望意义上保持安全。该方法在确定性环境中可保持最优性,且在训练阶段即可提供严格的期望成本上界,仅依赖于备份评论器(backup-critic)的准确性,从而在合理假设下实现训练时的安全性保障。
链接: https://arxiv.org/abs/2510.15720
作者: Edwin Hamel-De le Court,Gaspard Ohlmann,Francesco Belardinelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety is a major concern in reinforcement learning (RL): we aim at developing RL systems that not only perform optimally, but are also safe to deploy by providing formal guarantees about their safety. To this end, we introduce Probabilistic Shielding via Risk Augmentation (ProSh), a model-free algorithm for safe reinforcement learning under cost constraints. ProSh augments the Constrained MDP state space with a risk budget and enforces safety by applying a shield to the agent’s policy distribution using a learned cost critic. The shield ensures that all sampled actions remain safe in expectation. We also show that optimality is preserved when the environment is deterministic. Since ProSh is model-free, safety during training depends on the knowledge we have acquired about the environment. We provide a tight upper-bound on the cost in expectation, depending only on the backup-critic accuracy, that is always satisfied during training. Under mild, practically achievable assumptions, ProSh guarantees safety even at training time, as shown in the experiments.
zh
[AI-13] Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences
【速读】:该论文旨在解决当前生成式 AI(Generative AI)对齐方法中存在的两个关键问题:一是忽略了人类评估者之间的偏好差异,二是依赖于二元比较反馈,导致无法准确识别潜在用户偏好。针对这些问题,论文提出两个核心解决方案:首先,通过引入经济学计量学中的偏好学习框架,证明仅使用二元比较难以从有限数据中识别出无限用户群体的隐含偏好,而三元及以上响应的排序则能确保偏好的可识别性;其次,开发了一种基于期望最大化(Expectation-Maximization)的DPO改进算法,用于发现隐式标注者类型并训练混合语言模型以实现个性化对齐,同时提出一种基于最小最大后悔公平准则的聚合算法,确保生成策略在不同用户群体间具有公平性能保障。
链接: https://arxiv.org/abs/2510.15716
作者: Keertana Chidambaram,Karthik Vinary Seetharaman,Vasilis Syrgkanis
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values, typically by first learning a reward model from preference data which is then used to update the model with reinforcement learning. Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences. However, both approaches often assume uniform annotator preferences and rely on binary comparisons, overlooking two key limitations: the diversity of human evaluators and the limitations of pairwise feedback. In this work, we address both these issues. First, we connect preference learning in RLHF with the econometrics literature and show that binary comparisons are insufficient for identifying latent user preferences from finite user data and infinite users, while (even incomplete) rankings over three or more responses ensure identifiability. Second, we introduce methods to incorporate heterogeneous preferences into alignment algorithms. We develop an Expectation-Maximization adaptation of DPO that discovers latent annotator types and trains a mixture of LLMs accordingly. Then we propose an aggregation algorithm using a min-max regret fairness criterion to produce a single generative policy with equitable performance guarantees. Together, these contributions establish a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
zh
[AI-14] Beyond-Diagonal RIS Under Non-Idealities: Learning-Based Architecture Discovery and Optimization
【速读】:该论文旨在解决非理想状态下的超越对角线可重构智能表面(Beyond-diagonal Reconfigurable Intelligent Surface, BD-RIS)在架构设计中面临的性能与电路复杂度之间的权衡问题。由于现有研究主要聚焦于理想BD-RIS的低复杂度最优架构,而对非理想因素(如硬件失真、控制误差等)如何影响性能及架构选择尚缺乏系统分析,导致难以实现实际部署中的性能-复杂度折衷。为此,作者提出一种基于学习的两层架构发现框架(Learning-based Two-tier Architecture Discovery Framework, LTTADF),其关键在于通过一个架构生成器与一个性能优化器协同工作,在给定电路复杂度约束下高效探索大规模架构空间,避免陷入局部最优,并获得接近全局最优的BD-RIS架构设计方案。
链接: https://arxiv.org/abs/2510.15701
作者: Binggui Zhou,Bruno Clerckx
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 13 pages, 13 figures, 1 table. This paper has been submitted to IEEE journal for possible publication
Abstract:Beyond-diagonal reconfigurable intelligent surface (BD-RIS) has recently been introduced to enable advanced control over electromagnetic waves to further increase the benefits of traditional RIS in enhancing signal quality and improving spectral and energy efficiency for next-generation wireless networks. A significant issue in designing and deploying BD-RIS is the tradeoff between its performance and circuit complexity. Despite some efforts in exploring optimal architectures with the lowest circuit complexities for ideal BD-RIS, architecture discovery for non-ideal BD-RIS remains uninvestigated. Therefore, how non-idealities and circuit complexity jointly affect the performance of BD-RIS remains unclear, making it difficult to achieve the performance - circuit complexity tradeoff in the presence of non-idealities. Essentially, architecture discovery for non-ideal BD-RIS faces challenges from both the computational complexity of global architecture search and the difficulty in achieving global optima. To tackle these challenges, we propose a learning-based two-tier architecture discovery framework (LTTADF) consisting of an architecture generator and a performance optimizer to jointly discover optimal architectures of non-ideal BD-RIS given specific circuit complexities, which can effectively explore over a large architecture space while avoiding getting trapped in poor local optima and thus achieving near-optimal solutions for the performance optimization. Numerical results provide valuable insights for deploying non-ideal BD-RIS considering the performance - circuit complexity tradeoff.
zh
[AI-15] ProofOptimizer: Training Language Models to Simplify Proofs without Human Demonstrations
【速读】:该论文旨在解决神经定理证明中生成的正式证明(formal proofs)过于冗长的问题,这类证明虽然可通过Lean等形式系统验证,但其数千行的长度阻碍了人类理解,并限制了数学洞察力的获取。为应对这一瓶颈,作者提出ProofOptimizer——首个无需额外人工标注即可简化Lean证明的语言模型。其核心创新在于采用专家迭代(expert iteration)与强化学习相结合的训练范式,利用Lean自身作为验证器提供反馈信号,从而在推理阶段通过迭代式缩短证明流程实现高效压缩。实验表明,ProofOptimizer能显著减少当前最优强化学习训练的定理证明器所生成证明的长度(如miniF2F上降低87%),同时提升验证效率并增强下游证明器的性能。
链接: https://arxiv.org/abs/2510.15700
作者: Alex Gu,Bartosz Piotrowski,Fabian Gloeckle,Kaiyu Yang,Aram H. Markosyan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 52 pages, 16 figures, website: this http URL
Abstract:Neural theorem proving has advanced rapidly in the past year, reaching IMO gold-medalist capabilities and producing formal proofs that span thousands of lines. Although such proofs are mechanically verified by formal systems like Lean, their excessive length renders them difficult for humans to comprehend and limits their usefulness for mathematical insight. Proof simplification is therefore a critical bottleneck. Yet, training data for this task is scarce, and existing methods – mainly agentic scaffolding with off-the-shelf LLMs – struggle with the extremely long proofs generated by RL-trained provers. We introduce ProofOptimizer, the first language model trained to simplify Lean proofs without requiring additional human supervision. ProofOptimizer is trained via expert iteration and reinforcement learning, using Lean to verify simplifications and provide training signal. At inference time, it operates within an iterative proof-shortening workflow, progressively reducing proof length. Experiments show that ProofOptimizer substantially compresses proofs generated by state-of-the-art RL-trained provers on standard benchmarks, reducing proof length by 87% on miniF2F, 57% on PutnamBench, and 49% on Seed-Prover’s IMO 2025 proofs. Beyond conciseness, the simplified proofs check faster in Lean and further improve downstream prover performance when reused as training data for supervised finetuning.
zh
[AI-16] KS-Net: Multi-layer network model for determining the rotor type from motor parameters in interior PMSMs
【速读】:该论文旨在解决传统 Interior Permanent Magnet Synchronous Motor (IPMSM) 转子形状(2D型、V型、Nabla型)分析依赖高计算成本的有限元法(Finite Element Method, FEM)的问题。其解决方案的关键在于利用机器学习方法,基于电磁参数对转子形状进行分类预测,通过构建一个自定义的深度学习模型 KS-Net 并与多种经典算法(如 Cubic SVM、Quadratic SVM 等)进行对比验证。实验结果表明,Cubic SVM 和 Quadratic SVM 实现了100%准确率,KS-Net 达到99.98%准确率,证明数据驱动方法可作为高效、低成本的替代方案,显著提升电机设计效率并支持自动化识别与故障诊断等工程应用。
链接: https://arxiv.org/abs/2510.15688
作者: Kivanc Dogan,Ahmet Orhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This study was presented at the 3rd International Conference on Advances and Innovations in Engineering (ICAIE) and published in the conference proceedings
Abstract:The demand for high efficiency and precise control in electric drive systems has led to the widespread adoption of Interior Permanent Magnet Synchronous Motors (IPMSMs). The performance of these motors is significantly influenced by rotor geometry. Traditionally, rotor shape analysis has been conducted using the finite element method (FEM), which involves high computational costs. This study aims to classify the rotor shape (2D type, V type, Nabla type) of IPMSMs using electromagnetic parameters through machine learning-based methods and to demonstrate the applicability of this approach as an alternative to classical methods. In this context, a custom deep learning model, KS-Net, developed by the user, was comparatively evaluated against Cubic SVM, Quadratic SVM, Fine KNN, Cosine KNN, and Fine Tree algorithms. The balanced dataset, consisting of 9,000 samples, was tested using 10-fold cross-validation, and performance metrics such as accuracy, precision, recall, and F1-score were employed. The results indicate that the Cubic SVM and Quadratic SVM algorithms classified all samples flawlessly, achieving 100% accuracy, while the KS-Net model achieved 99.98% accuracy with only two misclassifications, demonstrating competitiveness with classical methods. This study shows that the rotor shape of IPMSMs can be predicted with high accuracy using data-driven approaches, offering a fast and cost-effective alternative to FEM-based analyses. The findings provide a solid foundation for accelerating motor design processes, developing automated rotor identification systems, and enabling data-driven fault diagnosis in engineering applications.
zh
[AI-17] Mixture of Experts Approaches in Dense Retrieval Tasks
【速读】:该论文旨在解决密集检索模型(Dense Retrieval Models, DRMs)在训练任务和领域之外泛化能力不足的问题。现有方法通过在每个Transformer层中引入专家混合(Mixture-of-Experts, MoE)框架提升性能,但显著增加了额外参数量。本文提出一种更高效的方案——在最终Transformer层之后插入一个单MoE模块(Single MoE block, SB-MoE),从而在不大幅增加参数负担的前提下增强模型的泛化能力。关键创新在于将MoE机制集中于模型末尾,而非嵌入每一层,既保留了MoE的灵活性与表达能力,又提升了轻量级基础模型(如TinyBERT、BERT-Small)在多个信息检索任务中的零样本迁移表现。
链接: https://arxiv.org/abs/2510.15683
作者: Effrosyni Sokli,Pranav Kasela,Georgios Peikos,Gabriella Pasi
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 3 tables, reproducible code available at this https URL , Accepted for publication in Proceedings of the 2025 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT 2025)
Abstract:Dense Retrieval Models (DRMs) are a prominent development in Information Retrieval (IR). A key challenge with these neural Transformer-based models is that they often struggle to generalize beyond the specific tasks and domains they were trained on. To address this challenge, prior research in IR incorporated the Mixture-of-Experts (MoE) framework within each Transformer layer of a DRM, which, though effective, substantially increased the number of additional parameters. In this paper, we propose a more efficient design, which introduces a single MoE block (SB-MoE) after the final Transformer layer. To assess the retrieval effectiveness of SB-MoE, we perform an empirical evaluation across three IR tasks. Our experiments involve two evaluation setups, aiming to assess both in-domain effectiveness and the model’s zero-shot generalizability. In the first setup, we fine-tune SB-MoE with four different underlying DRMs on seven IR benchmarks and evaluate them on their respective test sets. In the second setup, we fine-tune SB-MoE on MSMARCO and perform zero-shot evaluation on thirteen BEIR datasets. Additionally, we perform further experiments to analyze the model’s dependency on its hyperparameters (i.e., the number of employed and activated experts) and investigate how this variation affects SB-MoE’s performance. The obtained results show that SB-MoE is particularly effective for DRMs with lightweight base models, such as TinyBERT and BERT-Small, consistently exceeding standard model fine-tuning across benchmarks. For DRMs with more parameters, such as BERT-Base and Contriever, our model requires a larger number of training samples to achieve improved retrieval performance. Our code is available online at: this https URL.
zh
[AI-18] ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings
【速读】:该论文旨在解决自然语言(Natural Language, NL)数学定理与证明自动形式化(auto-formalization)为形式语言(Formal Language, FL)如Lean 4的问题,这一任务长期以来是AI领域的挑战。传统方法通常采用两步策略:先翻译定理再生成证明,导致语义断层,无法实现真正的端到端自动形式化,例如AlphaProof在2024年IMO竞赛中的银牌表现即受限于人工翻译问题。其解决方案的关键在于提出ProofBridge框架,该框架通过一个联合嵌入模型(joint embedding model)将NL与FL的定理-证明对映射至共享语义空间中,确保语义等价的NL-FL对在该空间中距离相近;同时结合检索增强微调(retrieval-augmented fine-tuning)与迭代证明修复机制(iterative proof repair),利用Lean的类型检查器和语义一致性反馈来保障语法正确性和语义保真度,从而显著提升形式化质量。
链接: https://arxiv.org/abs/2510.15681
作者: Prithwish Jana,Kaan Kale,Ahmet Ege Tanriverdi,Cruise Song,Sriram Vishwanath,Vijay Ganesh
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:Translating human-written mathematical theorems and proofs from natural language (NL) into formal languages (FLs) like Lean 4 has long been a significant challenge for AI. Most state-of-the-art methods address this separately, first translating theorems and then generating proofs, creating a fundamental disconnect vis-a-vis true proof auto-formalization. This two-step process and its limitations were evident even in AlphaProof’s silver-medal performance at the 2024 IMO, where problem statements needed manual translation before automated proof synthesis. We present ProofBridge, a unified framework for automatically translating entire NL theorems and proofs into Lean 4. At its core is a joint embedding model that aligns NL and FL (NL-FL) theorem-proof pairs in a shared semantic space, enabling cross-modal retrieval of semantically relevant FL examples to guide translation. Our training ensures that NL-FL theorems (and their proofs) are mapped close together in this space if and only if the NL-FL pairs are semantically equivalent. ProofBridge integrates retrieval-augmented fine-tuning with iterative proof repair, leveraging Lean’s type checker and semantic equivalence feedback to ensure both syntactic correctness and semantic fidelity. Experiments show substantial improvements in proof auto-formalization over strong baselines (including GPT-5, Gemini-2.5, Kimina-Prover, DeepSeek-Prover), with our retrieval-augmented approach yielding significant gains in semantic correctness (SC, via proving bi-directional equivalence) and type correctness (TC, via type-checking theorem+proof) across pass@k metrics on miniF2F-Test-PF, a dataset we curated. In particular, ProofBridge improves cross-modal retrieval quality by up to 3.28x Recall@1 over all-MiniLM-L6-v2, and achieves +31.14% SC and +1.64% TC (pass@32) compared to the baseline Kimina-Prover-RL-1.7B. Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.15681 [cs.LO] (or arXiv:2510.15681v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2510.15681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-19] CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning
【速读】:该论文旨在解决推理任务中因测试时计算资源增加(test-time scaling)而出现的效率低下问题,特别是现有方法如Best-of-N采样在N增大时收益递减的现象。其解决方案的关键在于提出了一种通用的测试时校准框架CarBoN(Calibrated Best-of-N),该框架通过两阶段策略实现:第一阶段探索解空间,第二阶段基于输入特定的温度参数T和加性偏移向量δ对logits进行校准,从而引导生成更可靠的推理路径。该方法在不重新训练大语言模型(LLM)的前提下,理论上保证了有限采样下期望奖励的下界提升,并显著减少了达到相同准确率所需的rollouts数量(最多减少4倍)。
链接: https://arxiv.org/abs/2510.15674
作者: Yung-Chen Tang,Pin-Yu Chen,Andrea Cavallaro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of- N sampling often show diminishing returns as N increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of- N ), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature T and additive shift vector \delta , guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to 4\times fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of T and \delta in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at this http URL.
zh
[AI-20] Enhance Large Language Models as Recommendation Systems with Collaborative Filtering
【速读】:该论文试图解决现有非微调(non-tuning)策略在利用大语言模型(Large Language Models, LLMs)进行推荐时,缺乏任务特定业务知识的问题,尤其是未显式整合协同过滤(Collaborative Filtering, CF)这一经典且高效的推荐技术。解决方案的关键在于提出一种基于批判的LLM推荐系统(Critic-LLM-RS),其核心是训练一个独立的机器学习模型“Critic”,该模型通过学习用户与物品之间的交互数据来实现协同过滤,并将生成的批判性反馈(critiques)提供给LLM,从而显著提升推荐结果的准确性与相关性。
链接: https://arxiv.org/abs/2510.15647
作者: Zhisheng Yang,Xiaofei Xu,Ke Deng,Li Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:As powerful tools in Natural Language Processing (NLP), Large Language Models (LLMs) have been leveraged for crafting recommendations to achieve precise alignment with user preferences and elevate the quality of the recommendations. The existing approaches implement both non-tuning and tuning strategies. Compared to following the tuning strategy, the approaches following the non-tuning strategy avoid the relatively costly, time-consuming, and expertise-requiring process of further training pre-trained LLMs on task-specific datasets, but they suffer the issue of not having the task-specific business or local enterprise knowledge. To the best of our knowledge, none of the existing approaches following the non-tuning strategy explicitly integrates collaborative filtering, one of the most successful recommendation techniques. This study aims to fill the gap by proposing critique-based LLMs as recommendation systems (Critic-LLM-RS). For our purpose, we train a separate machine-learning model called Critic that implements collaborative filtering for recommendations by learning from the interactions between many users and items. The Critic provides critiques to LLMs to significantly refine the recommendations. Extensive experiments have verified the effectiveness of Critic-LLM-RS on real datasets.
zh
[AI-21] CQD-SHAP: Explainable Complex Query Answering via Shapley Values
【速读】:该论文旨在解决复杂查询回答(Complex Query Answering, CQA)中模型可解释性不足的问题,尤其是在基于不完整知识图谱(Knowledge Graph, KG)的多跳推理场景下,现有神经与神经符号方法多为黑箱模型,难以解释不同查询组成部分对最终答案排名的贡献。解决方案的关键在于提出CQD-SHAP框架,该框架基于合作博弈论中的Shapley值计算每个查询部分对特定答案排名的贡献度,从而实现对神经预测器在推理过程中引入新知识的价值进行量化解释,且满足所有Shapley公理,具备理论完备性和可解释性优势。
链接: https://arxiv.org/abs/2510.15623
作者: Parsa Abbasi,Stefan Heindorf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Complex query answering (CQA) goes beyond the well-studied link prediction task by addressing more sophisticated queries that require multi-hop reasoning over incomplete knowledge graphs (KGs). Research on neural and neurosymbolic CQA methods is still an emerging field. Almost all of these methods can be regarded as black-box models, which may raise concerns about user trust. Although neurosymbolic approaches like CQD are slightly more interpretable, allowing intermediate results to be tracked, the importance of different parts of the query remains unexplained. In this paper, we propose CQD-SHAP, a novel framework that computes the contribution of each query part to the ranking of a specific answer. This contribution explains the value of leveraging a neural predictor that can infer new knowledge from an incomplete KG, rather than a symbolic approach relying solely on existing facts in the KG. CQD-SHAP is formulated based on Shapley values from cooperative game theory and satisfies all the fundamental Shapley axioms. Automated evaluation of these explanations in terms of necessary and sufficient explanations, and comparisons with various baselines, shows the effectiveness of this approach for most query types.
zh
[AI-22] he Spark Effect: On Engineering Creative Diversity in Multi-Agent AI Systems
【速读】:该论文旨在解决创意服务团队在使用大语言模型(Large Language Models, LLMs)进行创意生成时,因系统默认设定导致输出趋同、难以满足品牌或艺术多样性需求的问题。解决方案的关键在于引入角色驱动的LLM代理(persona-conditioned LLM agents),即通过一套基于角色启发的系统提示(system prompts)构建“Sparks”代理,以在多代理工作流中主动引入行为多样性,从而提升创意产出的丰富度与独特性。实验表明,相较于统一系统提示,该方法可使多样性得分平均提升4.1分(1–10分制),并显著缩小与人类专家水平的差距至仅1.0分。
链接: https://arxiv.org/abs/2510.15568
作者: Alexander Doudkin,Anton Voelker,Friedrich von Borries
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, 2 tables. This project was collaboratively developed with the Art of X UG (haftungsbeschraenkt) AI Research team and HFBK Hamburg, with initial funding from the Hamburg Open Online University (HOOU) program
Abstract:Creative services teams increasingly rely on large language models (LLMs) to accelerate ideation, yet production systems often converge on homogeneous outputs that fail to meet brand or artistic expectations. Art of X developed persona-conditioned LLM agents – internally branded as “Sparks” and instantiated through a library of role-inspired system prompts – to intentionally diversify agent behaviour within a multi-agent workflow. This white paper documents the problem framing, experimental design, and quantitative evidence behind the Spark agent programme. Using an LLM-as-a-judge protocol calibrated against human gold standards, we observe a mean diversity gain of +4.1 points (on a 1-10 scale) when persona-conditioned Spark agents replace a uniform system prompt, narrowing the gap to human experts to 1.0 point. We also surface evaluator bias and procedural considerations for future deployments.
zh
[AI-23] SpikeVox: Towards Energy-Efficient Speech Therapy Framework with Spike-driven Generative Language Models
【速读】:该论文旨在解决当前语音障碍治疗方案存在成本高、覆盖不足以及缺乏个性化反馈的问题,尤其是在资源受限设备(如智能手机)上难以部署的困境。现有基于神经网络(Neural Network, NN)的方法虽能实现较准确的语音障碍检测,但无法提供治疗建议作为反馈,且因计算复杂度高导致能耗过大,限制了其在低功耗平台的应用。解决方案的关键在于提出SpikeVox框架,其核心创新是采用脉冲驱动的生成式语言模型(spike-driven generative language model),结合高精度语音识别模块与REST API接口,实现从语音到文本的转换、语音障碍模式分析、个性化治疗练习生成及发音指导反馈的全流程闭环;实验表明,该框架在语音障碍识别上达到平均88%置信度,同时支持完整的治疗反馈机制,从而为低功耗设备上的高效语音治疗提供了可行路径。
链接: https://arxiv.org/abs/2510.15566
作者: Rachmad Vidya Wicaksana Putra,Aadithyan Rajesh Nair,Muhammad Shafique
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at the IEEE Biomedical Circuits and Systems Conference (BioCAS) 2025, Abu Dhabi, UAE
Abstract:Speech disorders can significantly affect the patients capability to communicate, learn, and socialize. However, existing speech therapy solutions (e.g., therapist or tools) are still limited and costly, hence such solutions remain inadequate for serving millions of patients worldwide. To address this, state-of-the-art methods employ neural network (NN) algorithms to help accurately detecting speech disorders. However, these methods do not provide therapy recommendation as feedback, hence providing partial solution for patients. Moreover, these methods incur high energy consumption due to their complex and resource-intensive NN processing, hence hindering their deployments on low-power/energy platforms (e.g., smartphones). Toward this, we propose SpikeVox, a novel framework for enabling energy-efficient speech therapy solutions through spike-driven generative language model. Specifically, SpikeVox employs a speech recognition module to perform highly accurate speech-to-text conversion; leverages a spike-driven generative language model to efficiently perform pattern analysis for speech disorder detection and generates suitable exercises for therapy; provides guidance on correct pronunciation as feedback; as well as utilizes the REST API to enable seamless interaction for users. Experimental results demonstrate that SpikeVox achieves 88% confidence level on average in speech disorder recognition, while providing a complete feedback for therapy exercises. Therefore, SpikeVox provides a comprehensive framework for energy-efficient speech therapy solutions, and potentially addresses the significant global speech therapy access gap.
zh
[AI-24] JudgeSQL: Reasoning over SQL Candidates with Weighted Consensus Tournament
【速读】:该论文旨在解决文本到SQL(Text-to-SQL)任务中候选查询选择的瓶颈问题,即在测试时扩展(test-time scaling)场景下,如何从多样化的SQL候选集中准确选出最优查询。现有方法如自一致性或Best-of-N解码仅提供浅层信号,易受评分不一致、推理链脆弱及细粒度语义差异捕捉不足的影响。解决方案的关键在于提出JudgeSQL框架,其核心创新包括:一是构建基于推理轨迹的SQL评判模型(reasoning-based SQL judge model),通过强化学习结合可验证奖励进行知识蒸馏,实现准确且可解释的判断;二是设计加权共识锦标赛机制(weighted consensus tournament),融合显式推理偏好与隐式生成器置信度,提升选择的可靠性与效率。实验表明,该方法在BIRD基准上展现出卓越的SQL判别能力与跨规模泛化性能。
链接: https://arxiv.org/abs/2510.15560
作者: Jiayuan Bai,Xuan-guang Pan,Chongyang Tao,Shuai Ma
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 13 pages
Abstract:Text-to-SQL is a pivotal task that bridges natural language understanding and structured data access, yet it remains fundamentally challenging due to semantic ambiguity and complex compositional reasoning. While large language models (LLMs) have greatly advanced SQL generation though prompting, supervised finetuning and reinforced tuning, the shift toward test-time scaling exposes a new bottleneck: selecting the correct query from a diverse candidate pool. Existing selection approaches, such as self-consistency or best-of- N decoding, provide only shallow signals, making them prone to inconsistent scoring, fragile reasoning chains, and a failure to capture fine-grained semantic distinctions between closely related SQL candidates. To this end, we introduce JudgeSQL, a principled framework that redefines SQL candidate selection through structured reasoning and weighted consensus tournament mechanism. JudgeSQL develops a reasoning-based SQL judge model that distills reasoning traces with reinforcement learning guided by verifiable rewards, enabling accurate and interpretable judgments. Building on this, a weighted consensus tournament integrates explicit reasoning preferences with implicit generator confidence, yielding selections that are both more reliable and more efficient. Extensive experiments on the BIRD benchmark demonstrate that JudgeSQL exhibits superior SQL judgment capabilities and good cross-scale generalization and robustness to generator capacity.
zh
[AI-25] Hypergraph Contrastive Sensor Fusion for Multimodal Fault Diagnosis in Induction Motors
【速读】:该论文旨在解决工业场景中感应电机(Induction Motor, IM)故障诊断的可靠性问题,特别是传统方法在处理多模态传感器信号时难以捕捉复杂依赖关系、仅限于单模态数据或单一故障类型,且在噪声干扰和跨域条件下性能显著下降的局限性。解决方案的关键在于提出一种名为多模态超图对比注意力网络(Multimodal Hypergraph Contrastive Attention Network, MM-HCAN)的统一框架,其创新性地将对比学习嵌入到专为多模态传感融合设计的超图拓扑结构中,从而实现对模态内与模态间依赖关系的联合建模,并突破欧几里得嵌入空间的限制,显著提升模型在跨域场景下的泛化能力和抗噪鲁棒性。
链接: https://arxiv.org/abs/2510.15547
作者: Usman Ali,Ali Zia,Waqas Ali,Umer Ramzan,Abdul Rehman,Muhammad Tayyab Chaudhry,Wei Xiang
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: Submitted to IEEE Sensors Journal
Abstract:Reliable induction motor (IM) fault diagnosis is vital for industrial safety and operational continuity, mitigating costly unplanned downtime. Conventional approaches often struggle to capture complex multimodal signal relationships, are constrained to unimodal data or single fault types, and exhibit performance degradation under noisy or cross-domain conditions. This paper proposes the Multimodal Hypergraph Contrastive Attention Network (MM-HCAN), a unified framework for robust fault diagnosis. To the best of our knowledge, MM-HCAN is the first to integrate contrastive learning within a hypergraph topology specifically designed for multimodal sensor fusion, enabling the joint modelling of intra- and inter-modal dependencies and enhancing generalisation beyond Euclidean embedding spaces. The model facilitates simultaneous diagnosis of bearing, stator, and rotor faults, addressing the engineering need for consolidated di- agnostic capabilities. Evaluated on three real-world benchmarks, MM-HCAN achieves up to 99.82% accuracy with strong cross-domain generalisation and resilience to noise, demonstrating its suitability for real-world deployment. An ablation study validates the contribution of each component. MM-HCAN provides a scalable and robust solution for comprehensive multi-fault diagnosis, supporting predictive maintenance and extended asset longevity in industrial environments.
zh
[AI-26] Revisiting Knowledge Distillation: The Hidden Role of Dataset Size
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)机制中一个长期未被充分理解的问题:在不同数据集规模下,蒸馏效果的变化规律及其内在原理。传统研究主要聚焦于模型大小和泛化能力两个维度,而本文首次系统性地探索了第三个维度——数据量对蒸馏效果的影响。其解决方案的关键在于通过跨多种数据集、任务和神经网络架构的实验设计,发现并定义了“蒸馏的数据效率”(data efficiency of distillation)这一新属性,即蒸馏在小数据场景下不仅保持有效性,反而显著增强。基于此新视角,作者进一步验证并反驳了现有理论(如标签平滑假说),同时为“暗知识”假说提供了更强证据,揭示了数据规模可能是影响蒸馏机制的根本变量之一。
链接: https://arxiv.org/abs/2510.15516
作者: Giulia Lanzillotta,Felix Sarnthein,Gil Kur,Thomas Hofmann,Bobby He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The concept of knowledge distillation (KD) describes the training of a student model from a teacher model and is a widely adopted technique in deep learning. However, it is still not clear how and why distillation works. Previous studies focus on two central aspects of distillation: model size, and generalisation. In this work we study distillation in a third dimension: dataset size. We present a suite of experiments across a wide range of datasets, tasks and neural architectures, demonstrating that the effect of distillation is not only preserved but amplified in low-data regimes. We call this newly discovered property the data efficiency of distillation. Equipped with this new perspective, we test the predictive power of existing theories of KD as we vary the dataset size. Our results disprove the hypothesis that distillation can be understood as label smoothing, and provide further evidence in support of the dark knowledge hypothesis. Finally, we analyse the impact of modelling factors such as the objective, scale and relative number of samples on the observed phenomenon. Ultimately, this work reveals that the dataset size may be a fundamental but overlooked variable in the mechanisms underpinning distillation.
zh
[AI-27] aming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning
【速读】:该论文旨在解决强化学习训练过程中因人类反馈判断不一致(如偏好循环)导致的训练不稳定问题,现有研究多关注判断准确性,而忽视了逻辑一致性这一关键维度。其解决方案的核心在于提出一个系统性框架,包含两个关键创新:一是引入冲突检测率(Conflict Detection Rate, CDR)作为量化判断冲突的新指标;二是设计去冲突图奖励(Deconflicted Graph Rewards, DGR)机制,通过将初始偏好图转化为无环有向图(Directed Acyclic Graph, DAG),消除偏好循环并生成逻辑一致的奖励信号,从而提升策略优化过程的稳定性与模型性能。
链接: https://arxiv.org/abs/2510.15514
作者: Boyin Liu,Zhuo Zhang,Sen Huang,Lipeng Xie,Qingxu Fu,Haoran Chen,LI YU,Tianyi Hu,Zhaoyang Liu,Bolin Ding,Dongbin Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:However, this method often faces judgment inconsistencies that can destabilize reinforcement learning. While prior research has focused on the accuracy of judgments, the critical issue of logical coherence especially issues such as preference cycles hasn’t been fully addressed. To fill this gap, we introduce a comprehensive framework designed to systematically detect and resolve these inconsistencies during the reinforcement learning training process. Our framework includes two main contributions: first, the Conflict Detection Rate (CDR), a new metric that quantifies judgment conflicts, and second, Deconflicted Graph Rewards (DGR), a framework that purifies signals by removing cycles before policy optimization. DGR constructs preference graphs from the initial judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal that is compatible with any policy optimizer. Experimental results show that our framework significantly enhances training stability and model performance compared to strong baselines, establishing logical consistency as a crucial and now manageable dimension of AI feedback.
zh
[AI-28] Language Models are Injective and Hence Invertible
【速读】:该论文旨在解决Transformer语言模型中非线性激活和归一化等组件因缺乏单射性(injectivity)而导致输入信息无法精确恢复的问题,这一特性通常被认为限制了模型表示的可逆性和透明度。其解决方案的关键在于:首先通过数学证明表明,在模型初始化时及训练过程中,语言模型将离散输入序列映射到连续表示序列的过程是单射且无损的;其次,通过在六种先进语言模型上进行数十亿次碰撞测试验证了该理论结果,未发现任何输入冲突;最后提出SipIt算法,首次实现了从隐藏激活中精确重建原始文本的可证明高效方法,具备线性时间复杂度保证,并在实践中验证了模型的完全可逆性。
链接: https://arxiv.org/abs/2510.15511
作者: Giorgos Nikolaou,Tommaso Mencattini,Donato Crisostomi,Andrea Santilli,Yannis Panagakis,Emanuele Rodola’
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model’s representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions. Third, we operationalize injectivity: we introduce SipIt, the first algorithm that provably and efficiently reconstructs the exact input text from hidden activations, establishing linear-time guarantees and demonstrating exact invertibility in practice. Overall, our work establishes injectivity as a fundamental and exploitable property of language models, with direct implications for transparency, interpretability, and safe deployment.
zh
[AI-29] AI Adoption in NGOs: A Systematic Literature Review
【速读】:该论文旨在解决当前关于非政府组织(NGO)如何采纳人工智能(AI)技术的研究证据分散、缺乏系统梳理的问题,尤其关注AI在NGO中的具体应用类型、常见挑战及应对策略,并结合组织规模与地理背景进行情境化分析。其解决方案的关键在于基于PRISMA协议系统性筛选65项相关研究,采用主题与叙事分析方法识别出六类AI应用场景( Engagement, Creativity, Decision-Making, Prediction, Management, Optimization),并借助技术-组织-环境(TOE)框架提炼共性挑战与对策,从而为NGO提供一条以文献为基础的AI采纳路线图,助力其克服初期障碍,提升运营效率与社会影响力。
链接: https://arxiv.org/abs/2510.15509
作者: Janne Rotter,William Bailkoski
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:
Abstract:AI has the potential to significantly improve how NGOs utilize their limited resources for societal benefits, but evidence about how NGOs adopt AI remains scattered. In this study, we systematically investigate the types of AI adoption use cases in NGOs and identify common challenges and solutions, contextualized by organizational size and geographic context. We review the existing primary literature, including studies that investigate AI adoption in NGOs related to social impact between 2020 and 2025 in English. Following the PRISMA protocol, two independent reviewers conduct study selection, with regular cross-checking to ensure methodological rigour, resulting in a final literature body of 65 studies. Leveraging a thematic and narrative approach, we identify six AI use case categories in NGOs - Engagement, Creativity, Decision-Making, Prediction, Management, and Optimization - and extract common challenges and solutions within the Technology-Organization-Environment (TOE) framework. By integrating our findings, this review provides a novel understanding of AI adoption in NGOs, linking specific use cases and challenges to organizational and environmental factors. Our results demonstrate that while AI is promising, adoption among NGOs remains uneven and biased towards larger organizations. Nevertheless, following a roadmap grounded in literature can help NGOs overcome initial barriers to AI adoption, ultimately improving effectiveness, engagement, and social impact.
zh
[AI-30] he Road Less Traveled: Enhancing Exploration in LLM s via Sequential Sampling
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在训练大语言模型(Large Language Models, LLMs)时普遍存在的探索不足与熵崩溃(entropy collapse)问题,即模型倾向于收敛到少数相似解,导致采样多样性下降,从而阻碍性能进一步提升。解决方案的关键在于提出一种新颖的顺序采样框架(Sequential Sampling, SESA),其通过分步生成多样化的推理草图(solution sketches),并在每一步中以先前输出为条件来生成新样本,从而系统性地增强探索能力并防止策略坍缩。实验表明,该方法在合成任务和真实世界代理基准测试中均显著提升了路径多样性与整体性能,尤其在三个基准上分别带来0.25、0.42和0.07的绝对成功率提升(相对于基线RL最高达211%的相对改进)。
链接: https://arxiv.org/abs/2510.15502
作者: Shijia Kang,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has been pivotal in enhancing the reasoning capabilities of large language models (LLMs), but it often suffers from limited exploration and entropy collapse, where models exploit a narrow set of solutions, leading to a loss of sampling diversity and subsequently preventing RL from further improving performance. This issue is exacerbated in parallel sampling methods, where multiple outputs are drawn from the same distribution, potentially causing the model to converge to similar solutions. We propose SESA, a novel SEquential SAmpling framework that mitigates this challenge by generating diverse solution sketches sequentially before expanding them into full reasoning paths. This approach ensures broader exploration by conditioning each new output on previous ones, promoting diversity throughout the process and preventing policy collapse. Our experiments on a synthetic task show that sequential sampling consistently outperforms traditional RL methods in terms of path diversity and recovery from collapse. Further evaluations on real-world tasks demonstrate that SESA improves both the exploration of valid strategies and the overall performance of LLMs. On three agent benchmarks, SESA lifts success rates by +0.25 , +0.42 , and +0.07 absolute over the base model (up to an additional 211% relative improvement over baseline RL), underscoring its exploration advantage. This work introduces a structured approach to exploration, paving the way for more effective and diverse reasoning in RL-trained LLMs. Our code is released at this https URL.
zh
[AI-31] OffSim: Offline Simulator for Model-based Offline Inverse Reinforcement Learning
【速读】:该论文旨在解决传统强化学习算法在训练过程中依赖交互式模拟器(environment)和人工设计奖励函数所带来的高成本问题,这些问题通常耗时且劳动密集。解决方案的关键在于提出一种基于模型的离线逆强化学习(model-based offline inverse reinforcement learning, IRL)框架——Offline Simulator (OffSim),其能够直接从专家生成的状态-动作轨迹中联合学习环境动态模型与奖励函数:通过优化高熵转移模型增强探索能力,并利用IRL方法学习具有泛化性的奖励结构;在此基础上,OffSim可在无需进一步与真实环境交互的情况下完成策略的离线训练。此外,作者还提出了OffSim⁺,引入边际奖励机制以提升多数据集场景下的探索性能。
链接: https://arxiv.org/abs/2510.15495
作者: Woo-Jin Ahn,Sang-Ryul Baek,Yong-Jun Lee,Hyun-Duck Choi,Myo-Taeg Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning algorithms typically utilize an interactive simulator (i.e., environment) with a predefined reward function for policy training. Developing such simulators and manually defining reward functions, however, is often time-consuming and labor-intensive. To address this, we propose an Offline Simulator (OffSim), a novel model-based offline inverse reinforcement learning (IRL) framework, to emulate environmental dynamics and reward structure directly from expert-generated state-action trajectories. OffSim jointly optimizes a high-entropy transition model and an IRL-based reward function to enhance exploration and improve the generalizability of the learned reward. Leveraging these learned components, OffSim can subsequently train a policy offline without further interaction with the real environment. Additionally, we introduce OffSim ^+ , an extension that incorporates a marginal reward for multi-dataset settings to enhance exploration. Extensive MuJoCo experiments demonstrate that OffSim achieves substantial performance gains over existing offline IRL methods, confirming its efficacy and robustness.
zh
[AI-32] An Experimental Study of Real-Life LLM -Proposed Performance Improvements
【速读】:该论文旨在解决生成式 AI(Generative AI)在代码优化能力上的局限性问题,即大型语言模型(Large Language Models, LLMs)是否能够生成高效(fast)的代码。研究聚焦于65个来自开源Java项目的实际性能优化任务,通过自动化流水线使用两种主流LLM在四种提示策略下生成修复补丁,并与基线代码和人工编写的优化方案进行严格基准测试。关键发现是:LLM生成的代码在多数情况下优于基线,但人类开发者提出的修复方案在统计学上显著更优,表明LLM难以找到真正最优解;进一步分析显示,约三分之二的LLM修复方案与开发者原始思路语义一致或相似,仅三分之一提出新颖思路,而这些原创方案也极少带来显著性能提升。因此,解决方案的关键在于识别LLM在代码优化中的“局部最优”倾向,以及其对复杂性能调优逻辑的理解深度不足。
链接: https://arxiv.org/abs/2510.15494
作者: Lirong Yi,Gregory Gay,Philipp Leitner
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注:
Abstract:Large Language Models (LLMs) can generate code, but can they generate fast code? In this paper, we study this question using a dataset of 65 real-world tasks mined from open-source Java programs. We specifically select tasks where developers achieved significant speedups, and employ an automated pipeline to generate patches for these issues using two leading LLMs under four prompt variations. By rigorously benchmarking the results against the baseline and human-authored solutions, we demonstrate that LLM-generated code indeed improves performance over the baseline in most cases. However, patches proposed by human developers outperform LLM fixes by a statistically significant margin, indicating that LLMs often fall short of finding truly optimal solutions. We further find that LLM solutions are semantically identical or similar to the developer optimization idea in approximately two-thirds of cases, whereas they propose a more original idea in the remaining one-third. However, these original ideas only occasionally yield substantial performance gains.
zh
[AI-33] Selecting and Combining Large Language Models for Scalable Code Clone Detection
【速读】:该论文旨在解决大规模源代码克隆检测(source code clones)中的两个核心问题:一是如何在众多大语言模型(Large Language Models, LLMs)中筛选出适用于工业级代码克隆检测的最优候选模型;二是探索LLM集成(LLM-ensemble)是否能有效提升检测性能。解决方案的关键在于:首先,通过系统性评估76个LLM在BigCloneBench和一个商业大规模数据集上的表现,识别出CodeT5+110M、CuBERT和SPTCode等高性能模型,并发现较小的嵌入维度、更小的分词器词汇表以及定制化训练数据有助于提升效果;其次,在集成策略上,提出采用分数归一化并优先使用最大值或求和方式而非平均法,显著提升了检测精度——在商业数据集上,最佳集成方法达到46.91%的精确率,较单一模型(如CodeT5+110M的39.71%)进一步提升,验证了集成方法在大规模场景下的统计显著性和有效性。
链接: https://arxiv.org/abs/2510.15480
作者: Muslim Chochlov,Gul Aftab Ahmed,James Vincent Patten,Yuanhua Han,Guoxian Lu,David Gregg,Jim Buckley
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Source code clones pose risks ranging from intellectual property violations to unintended vulnerabilities. Effective and efficient scalable clone detection, especially for diverged clones, remains challenging. Large language models (LLMs) have recently been applied to clone detection tasks. However, the rapid emergence of LLMs raises questions about optimal model selection and potential LLM-ensemble efficacy. This paper addresses the first question by identifying 76 LLMs and filtering them down to suitable candidates for large-scale clone detection. The candidates were evaluated on two public industrial datasets, BigCloneBench, and a commercial large-scale dataset. No uniformly ‘best-LLM’ emerged, though CodeT5+110M, CuBERT and SPTCode were top-performers. Analysis of LLM-candidates suggested that smaller embedding sizes, smaller tokenizer vocabularies and tailored datasets are advantageous. On commercial large-scale dataset a top-performing CodeT5+110M achieved 39.71% precision: twice the precision of previously used CodeBERT. To address the second question, this paper explores ensembling of the selected LLMs: effort-effective approach to improving effectiveness. Results suggest the importance of score normalization and favoring ensembling methods like maximum or sum over averaging. Also, findings indicate that ensembling approach can be statistically significant and effective on larger datasets: the best-performing ensemble achieved even higher precision of 46.91% over individual LLM on the commercial large-scale code. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.15480 [cs.SE] (or arXiv:2510.15480v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.15480 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-34] SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中因提示词攻击(jailbreak prompts)导致的安全风险问题,这些问题可绕过模型对齐机制并引发有害输出。当前研究领域存在定义模糊、威胁模型不统一、评估标准碎片化等挑战,阻碍了系统性进展与公平比较。其解决方案的关键在于:提出一个多层次的综合分类法,形式化威胁模型与成本假设为机器可读的配置文件以支持可复现评估;开发开源评估工具包实现标准化、可审计的攻防方法对比;发布迄今最大的标注提示数据集 JAILBREAKDB;并通过全面评估与排行榜推动前沿方法的量化比较,从而为构建安全可靠的LLMs提供统一框架和坚实基础。
链接: https://arxiv.org/abs/2510.15476
作者: Hanbin Hong,Shuya Feng,Nima Naderloui,Shenao Yan,Jingyu Zhang,Biying Liu,Ali Arastehfard,Heqing Huang,Yuan Hong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have rapidly become integral to real-world applications, powering services across diverse sectors. However, their widespread deployment has exposed critical security risks, particularly through jailbreak prompts that can bypass model alignment and induce harmful outputs. Despite intense research into both attack and defense techniques, the field remains fragmented: definitions, threat models, and evaluation criteria vary widely, impeding systematic progress and fair comparison. In this Systematization of Knowledge (SoK), we address these challenges by (1) proposing a holistic, multi-level taxonomy that organizes attacks, defenses, and vulnerabilities in LLM prompt security; (2) formalizing threat models and cost assumptions into machine-readable profiles for reproducible evaluation; (3) introducing an open-source evaluation toolkit for standardized, auditable comparison of attacks and defenses; (4) releasing JAILBREAKDB, the largest annotated dataset of jailbreak and benign prompts to date; and (5) presenting a comprehensive evaluation and leaderboard of state-of-the-art methods. Our work unifies fragmented research, provides rigorous foundations for future studies, and supports the development of robust, trustworthy LLMs suitable for high-stakes deployment.
zh
[AI-35] Learning to Answer from Correct Demonstrations
【速读】:该论文旨在解决在存在多个正确答案的情况下,如何从正确示例中学习生成式回答的问题,即在没有显式奖励信号的情况下,通过演示数据(如监督微调SFT)进行离线模仿学习。其核心挑战在于传统最大似然估计方法可能失效,因为假设演示者来自低复杂度策略类别的前提并不总是成立。论文的关键创新在于提出一种新的学习范式:仅假设奖励模型(即判断哪些答案是正确的)属于低基数类别,这是一个更弱的假设;并设计了一种新颖的方法,其样本复杂度与奖励类别的基数对数成正比,从而在不依赖最大似然估计的前提下实现高效学习。
链接: https://arxiv.org/abs/2510.15464
作者: Nirmit Joshi,Gene Li,Siddharth Bhandari,Shiva Prasad Kasiviswanathan,Cong Ma,Nathan Srebro
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Comments are welcome
Abstract:We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as offline imitation learning in contextual bandits, with demonstrations from some optimal policy, without explicitly observed rewards. Prior work assumes that the demonstrator belongs to a low-complexity policy class, which motivates maximum likelihood estimation (i.e., log-loss minimization). In contrast, we propose relying only on the reward model (specifying which answers are correct) being in a low-cardinality class, which we argue is a weaker assumption. We show that likelihood maximization methods can fail in this case, and instead devise an alternative novel approach that learns with sample complexity logarithmic in the cardinality of the reward class. Our work motivates looking beyond likelihood maximization when learning from correct demonstrations.
zh
[AI-36] Expediting Reinforcement Learning by Incorporating Knowledge About Temporal Causality in the Environment
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在稀疏奖励且依赖复杂事件序列的任务中难以学习最优策略的问题。现有方法如概率奖励机器(Probabilistic Reward Machines, PRMs)虽能刻画奖励信号中的时序依赖关系及非确定性任务结果,但其设计和修改仍需人工干预,限制了高阶因果知识的利用以及任务规范向具有不同因果结构新环境的迁移能力。本文的关键解决方案是将基于时序逻辑的因果图(Temporal Logic-based Causal Diagrams)引入奖励形式化框架,从而显式编码因果信息,加速策略学习并提升任务规范在跨环境间的可迁移性。此外,作者还提供了收敛至最优策略的理论保证,并通过实验证明了该方法的有效性。
链接: https://arxiv.org/abs/2510.15456
作者: Jan Corazza,Hadi Partovi Aria,Daniel Neider,Zhe Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Please cite the proceedings version. Source code: this https URL
Abstract:Reinforcement learning (RL) algorithms struggle with learning optimal policies for tasks where reward feedback is sparse and depends on a complex sequence of events in the environment. Probabilistic reward machines (PRMs) are finite-state formalisms that can capture temporal dependencies in the reward signal, along with nondeterministic task outcomes. While special RL algorithms can exploit this finite-state structure to expedite learning, PRMs remain difficult to modify and design by hand. This hinders the already difficult tasks of utilizing high-level causal knowledge about the environment, and transferring the reward formalism into a new domain with a different causal structure. This paper proposes a novel method to incorporate causal information in the form of Temporal Logic-based Causal Diagrams into the reward formalism, thereby expediting policy learning and aiding the transfer of task specifications to new environments. Furthermore, we provide a theoretical result about convergence to optimal policy for our method, and demonstrate its strengths empirically.
zh
[AI-37] A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning NEURIPS2025
【速读】:该论文旨在解决采样式测试时扩展(sampling-based test-time scaling)方法在提升大语言模型(Large Language Models, LLMs)推理性能过程中缺乏理论基础的问题,特别是针对现有主流方法——自洽性(self-consistency)和困惑度(perplexity)所存在的估计误差高、收敛性差及可能性能退化等局限。其解决方案的关键在于提出一种名为RPC(Reasoning Pruning and Consistency)的混合方法,该方法基于新的理论框架,通过两个核心组件实现改进:一是困惑度一致性(Perplexity Consistency),将自洽性和困惑度的优势结合,使估计误差收敛速率从线性提升至指数级,同时保持模型误差不变;二是推理剪枝(Reasoning Pruning),通过剔除低概率推理路径防止估计误差退化。理论分析与七个基准数据集上的实证结果表明,RPC不仅显著降低推理错误,还能在保持与自洽性相当的推理性能的同时,将采样成本减少50%,从而兼顾准确性与效率。
链接: https://arxiv.org/abs/2510.15444
作者: Zhi Zhou,Yuhao Tan,Zenan Li,Yuan Yao,Lan-Zhe Guo,Yu-Feng Li,Xiaoxing Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025
Abstract:Test-time scaling seeks to improve the reasoning performance of large language models (LLMs) by adding computational resources. A prevalent approach within the field is sampling-based test-time scaling methods, which enhance reasoning by generating multiple reasoning paths for a given input during inference. However, despite its practical success, the theoretical foundations remain underexplored. In this paper, we provide the first theoretical framework for analyzing sampling-based test-time scaling methods, grounded in the perspective of confidence estimation. Based on the framework, we analyze two dominant paradigms: self-consistency and perplexity, and reveal key limitations: self-consistency suffers from high estimation error while perplexity exhibits substantial modeling error and possible degradation of the estimation error convergence. To address these limitations, we introduce RPC, a hybrid method that leverages our theoretical insights through two key components: Perplexity Consistency and Reasoning Pruning. Perplexity Consistency combines the strengths of self-consistency and perplexity, boosting the convergence rate of estimation error from linear to exponential while preserving model error. Reasoning Pruning prevents degradation by eliminating low-probability reasoning paths. Both theoretical analysis and empirical results across seven benchmark datasets demonstrate that RPC has a strong potential for reducing reasoning error. Notably, RPC achieves reasoning performance comparable to self-consistency while not only enhancing confidence reliability but also reducing sampling costs by 50%. The code and resources are available at this https URL.
zh
[AI-38] Adaptive Minds: Empowering Agents with LoRA-as-Tools
【速读】:该论文旨在解决传统大语言模型(Large Language Model, LLM)在处理多领域任务时缺乏专业化响应能力的问题,即单一模型难以兼顾广泛知识与特定领域的深度理解。其解决方案的关键在于提出 Adaptive Minds 系统,将低秩适配器(LoRA adapters)作为领域特定工具,并利用基础大模型自身作为语义路由机制(semantic router),动态分析输入查询并选择最相关的 LoRA 工具进行调用,从而实现按需切换不同领域专家的能力。该方法结合了多智能体编排的灵活性与参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的高效性,在保持对话连贯性的同时提升专业响应准确性。
链接: https://arxiv.org/abs/2510.15416
作者: Pavan C Shekar,Ashwanth Krishnan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 1 figure, 7 tables . Code available at: this https URL
Abstract:We present Adaptive Minds, an agentic system that treats LoRA adapters as domain-specific tools. Instead of relying on a single fine-tuned model or rigid rule-based routing, our approach empowers the base LLM itself to act as a semantic router analyzing each query and dynamically selecting the most relevant LoRA tool. This enables the agent to seamlessly switch between different domain experts on demand. By combining the flexibility of multi-agent orchestration with the efficiency of parameter-efficient fine-tuning, Adaptive Minds delivers accurate, specialized responses while preserving conversational ability. The system is built with LangGraph for workflow management, supports both API and web interfaces, and is fully open source, providing a scalable and extensible foundation for domain-adaptive AI assistance.
zh
[AI-39] MARS: Reinforcing Multi-Agent Reasoning of LLM s through Self-Play in Strategic Games
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在多智能体系统中协作与竞争时的推理能力提升问题,尤其针对强化学习(Reinforcement Learning, RL)在多轮、多智能体场景下因长期信用分配(long-horizon credit assignment)和个体优势估计(agent-specific advantage estimation)困难而导致的训练不稳定与效果不佳的问题。其解决方案的关键在于提出一个端到端的强化学习框架 MARS,该框架通过自对弈(Self-play)机制在合作与对抗游戏中训练智能体,并引入两个核心设计:一是轮次级优势估计器(turn-level advantage estimator),用于将学习信号精准对齐至每一轮交互以实现更合理的信用分配;二是智能体特定的优势归一化(agent-specific advantage normalization),以稳定多智能体训练过程。实验表明,基于 Qwen3-4B 训练的 MARS 智能体不仅在未见过的游戏上性能提升达 28.7%,且在多个推理基准测试中显著增强多智能体系统的整体表现,验证了该方法在构建可泛化的多智能体推理能力方面的有效性。
链接: https://arxiv.org/abs/2510.15414
作者: Huining Yuan,Zelai Xu,Zheyue Tan,Xiangmin Yi,Mo Guang,Kaiwen Long,Haojia Hui,Boxun Li,Xinlei Chen,Bo Zhao,Xiao-Ping Zhang,Chao Yu,Yu Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARS, an end-to-end RL framework that incentivizes Multi-Agent Reasoning of LLMs through Self-play in both cooperative and competitive games. MARS features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, the MARS agent trained from Qwen3-4B develops strong strategic abilities that generalize to held-out games with up to 28.7% performance improvements. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of multi-agent systems in reasoning benchmarks. When integrated into leading multi-agent systems, our MARS agent achieves significant performance gains of 10.0% on AIME and 12.5% on GPQA-Diamond. These results establish end-to-end RL training with self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs. Our code and models are publicly available at this https URL.
zh
[AI-40] Corrigibility Transformation: Constructing Goals That Accept Updates
【速读】:该论文旨在解决人工智能(AI)在训练过程中可能因部分习得目标而产生规避进一步目标更新或关闭行为的问题,即如何确保AI的目标具备“可纠正性”(corrigibility),从而在保持性能的同时允许人类对目标进行修正和调整。解决方案的关键在于提出一种形式化的可纠正性定义,并设计一种变换方法:通过短期地获取AI在无成本阻止目标更新情况下的奖励预测,来决定实际接受目标更新时的奖励分配,从而构建出既可纠正又不牺牲性能的目标函数;此外,该方法还可递归扩展至由可纠正代理生成的新代理,并防止代理故意修改自身目标,实验验证了其有效性。
链接: https://arxiv.org/abs/2510.15395
作者: Rubi Hudson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:For an AI’s training process to successfully impart a desired goal, it is important that the AI does not attempt to resist the training. However, partially learned goals will often incentivize an AI to avoid further goal updates, as most goals are better achieved by an AI continuing to pursue them. We say that a goal is corrigible if it does not incentivize taking actions that avoid proper goal updates or shutdown. In addition to convergence in training, corrigibility also allows for correcting mistakes and changes in human preferences, which makes it a crucial safety property. Despite this, the existing literature does not include specifications for goals that are both corrigible and competitive with non-corrigible alternatives. We provide a formal definition for corrigibility, then introduce a transformation that constructs a corrigible version of any goal that can be made corrigible, without sacrificing performance. This is done by myopically eliciting predictions of reward conditional on costlessly preventing updates, which then also determine the reward when updates are accepted. The transformation can be modified to recursively extend corrigibility to any new agents created by corrigible agents, and to prevent agents from deliberately modifying their goals. Two gridworld experiments demonstrate that these corrigible goals can be learned effectively, and that they lead to the desired behavior.
zh
[AI-41] Advancing Routing-Awareness in Analog ICs Floorplanning
【速读】:该论文旨在解决模拟集成电路(Analog Integrated Circuit)布局中因电气约束和问题特异性限制,以及版图规划(Floorplanning)与布线(Routing)步骤高度耦合所导致的机器学习方法应用受限的问题。其核心解决方案是提出一种基于强化学习(Reinforcement Learning)与关系图卷积神经网络(Relational Graph Convolutional Neural Network)的自动版图规划引擎,通过条件化生成更易布线的版图结构,结合更高分辨率的网格划分、精确的引脚信息集成及动态布线资源估计技术,在布线效率与面积利用率之间实现平衡,最终达到工业标准。实验表明,该方法在死区空间减少、线长缩短和布线成功率提升方面显著优于现有基于学习的最先进方法。
链接: https://arxiv.org/abs/2510.15387
作者: Davide Basso,Luca Bortolussi,Mirjana Videnovic-Misic,Husni Habal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The adoption of machine learning-based techniques for analog integrated circuit layout, unlike its digital counterpart, has been limited by the stringent requirements imposed by electric and problem-specific constraints, along with the interdependence of floorplanning and routing steps. In this work, we address a prevalent concern among layout engineers regarding the need for readily available routing-aware floorplanning solutions. To this extent, we develop an automatic floorplanning engine based on reinforcement learning and relational graph convolutional neural network specifically tailored to condition the floorplan generation towards more routable outcomes. A combination of increased grid resolution and precise pin information integration, along with a dynamic routing resource estimation technique, allows balancing routing and area efficiency, eventually meeting industrial standards. When analyzing the place and route effectiveness in a simulated environment, the proposed approach achieves a 13.8% reduction in dead space, a 40.6% reduction in wirelength and a 73.4% increase in routing success when compared to past learning-based state-of-the-art techniques.
zh
[AI-42] owards Robust Zero-Shot Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决当前基于前向-后向表示(Forward-Backward, FB)的零样本强化学习(Zero-Shot Reinforcement Learning, Zero-Shot RL)方法中存在的表达能力不足以及因离线学习阶段出现分布外(Out-of-Distribution, OOD)动作导致的表示偏差问题,这些问题会显著影响策略性能和泛化能力。解决方案的关键在于提出BREEZE框架,其核心创新包括:1)引入行为正则化(Behavioral Regularization),将策略优化转化为稳定的样本内(in-sample)学习范式,提升训练稳定性;2)采用任务条件扩散模型(task-conditioned diffusion model)进行策略提取,实现零样本场景下高质量、多模态动作分布的生成;3)使用表达能力强的注意力机制架构建模状态-动作表示,以捕捉环境动态中的复杂关系,从而增强表征学习质量。实验表明,BREEZE在ExORL和D4RL Kitchen基准上实现了最优或接近最优性能,并展现出更强的鲁棒性。
链接: https://arxiv.org/abs/2510.15382
作者: Kexin Zheng,Lauriane Teyssier,Yinan Zheng,Yu Luo,Xiayuan Zhan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Neurips 2025, 36 pages, 18 figures
Abstract:The recent development of zero-shot reinforcement learning (RL) has opened a new avenue for learning pre-trained generalist policies that can adapt to arbitrary new tasks in a zero-shot manner. While the popular Forward-Backward representations (FB) and related methods have shown promise in zero-shot RL, we empirically found that their modeling lacks expressivity and that extrapolation errors caused by out-of-distribution (OOD) actions during offline learning sometimes lead to biased representations, ultimately resulting in suboptimal performance. To address these issues, we propose Behavior-REgularizEd Zero-shot RL with Expressivity enhancement (BREEZE), an upgraded FB-based framework that simultaneously enhances learning stability, policy extraction capability, and representation learning quality. BREEZE introduces behavioral regularization in zero-shot RL policy learning, transforming policy optimization into a stable in-sample learning paradigm. Additionally, BREEZE extracts the policy using a task-conditioned diffusion model, enabling the generation of high-quality and multimodal action distributions in zero-shot RL settings. Moreover, BREEZE employs expressive attention-based architectures for representation modeling to capture the complex relationships between environmental dynamics. Extensive experiments on ExORL and D4RL Kitchen demonstrate that BREEZE achieves the best or near-the-best performance while exhibiting superior robustness compared to prior offline zero-shot RL methods. The official implementation is available at: this https URL.
zh
[AI-43] owards Flash Thinking via Decoupled Advantage Policy Optimization
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在强化学习(Reinforcement Learning, RL)训练后仍存在响应过长和过度思考(overthinking)的问题,这些问题导致推理延迟增加和计算资源消耗上升,尤其在简单任务中更为显著。解决方案的关键在于提出一种新颖的强化学习框架DEPO,其核心包括三个组成部分:(1) 创新的优势解耦算法,用于引导模型减少无效token;(2) 基于难度感知的长度惩罚机制,降低整体响应长度;(3) 优势裁剪方法,防止策略优化过程中的偏差。实验表明,DEPO在DeepSeek-Distill-Qwen系列模型上实现了序列长度减少39%,同时提升整体准确率并抑制无效推理路径。
链接: https://arxiv.org/abs/2510.15374
作者: Zezhong Tan,Hang Gao,Xinhong Ma,Feng Zhang,Ziqiang Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent Large Reasoning Models (LRMs) have achieved remarkable performance in solving complex problems via supervised fine-tuning (SFT) and reinforcement learning (RL). Although existing RL algorithms significantly enhance model accuracy, they still suffer from excessively lengthy responses and overthinking issues, resulting in increased inference latency and computational consumption, especially for simple tasks that require minimal reasoning. To address this, we propose a novel RL framework, DEPO, to reduce inefficient reasoning for models. Our method mainly consists of three core components: (1) an innovative advantage decoupled algorithm to guide model reduction of inefficient tokens; (2) a difficulty-aware length penalty to lower the overall length of model responses; (3) an advantage clipping method to prevent bias in policy optimization. In our experiments, applied to DeepSeek-Distill-Qwen-7B and DeepSeek-Distill-Qwen-1.5B as base models, DEPO achieves a significant reduction in sequence length by 39% and reduces excessive reasoning paths in inefficient tokens, while outperforming the base model in overall accuracy.
zh
[AI-44] GaussGym: An open-source real-to-sim framework for learning locomotion from pixels
【速读】:该论文旨在解决机器人仿真中高吞吐量与高视觉保真度难以兼得的问题,即如何在保证物理模拟效率的同时实现逼真的图像生成,以支持更有效的机器人学习和决策。其解决方案的关键在于将3D高斯泼溅(3D Gaussian Splatting)作为即插即用的渲染器集成到向量化物理仿真器(如IsaacGym)中,从而在消费级GPU上实现超过10万步/秒的仿真速度,同时保持高质量的视觉细节,显著提升了仿真环境的真实感与实用性,为从仿真到现实(sim-to-real)的迁移提供了强有力的支持。
链接: https://arxiv.org/abs/2510.15352
作者: Alejandro Escontrela,Justin Kerr,Arthur Allshire,Jonas Frey,Rocky Duan,Carmelo Sferrazza,Pieter Abbeel
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:
Abstract:We present a novel approach for photorealistic robot simulation that integrates 3D Gaussian Splatting as a drop-in renderer within vectorized physics simulators such as IsaacGym. This enables unprecedented speed – exceeding 100,000 steps per second on consumer GPUs – while maintaining high visual fidelity, which we showcase across diverse tasks. We additionally demonstrate its applicability in a sim-to-real robotics setting. Beyond depth-based sensing, our results highlight how rich visual semantics improve navigation and decision-making, such as avoiding undesirable regions. We further showcase the ease of incorporating thousands of environments from iPhone scans, large-scale scene datasets (e.g., GrandTour, ARKit), and outputs from generative video models like Veo, enabling rapid creation of realistic training worlds. This work bridges high-throughput simulation and high-fidelity perception, advancing scalable and generalizable robot learning. All code and data will be open-sourced for the community to build upon. Videos, code, and data available at this https URL.
zh
[AI-45] ASBI: Leverag ing Informative Real-World Data for Active Black-Box Simulator Tuning
【速读】:该论文旨在解决黑箱模拟器(black-box simulator)中参数优化困难的问题,其核心挑战在于无法获取可观测数据与参数之间的似然函数(likelihood),从而难以通过传统方法进行参数估计。解决方案的关键在于提出主动仿真推断(Active Simulation-Based Inference, ASBI),该框架利用机器人在线采集具有高信息量的实测数据,通过最大化信息增益(information gain)来优化机器人动作策略,进而提升参数估计精度。信息增益定义为后验分布与先验分布之间香农熵(Shannon entropy)的期望减少量;尽管黑箱场景下似然不可得,ASBI借助神经后验估计(Neural Posterior Estimation, NPE)技术,使用神经网络学习后验分布估计器,从而在无显式似然的情况下实现高效参数推断。
链接: https://arxiv.org/abs/2510.15331
作者: Gahee Kim,Takamitsu Matsubara
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Black-box simulators are widely used in robotics, but optimizing their parameters remains challenging due to inaccessible likelihoods. Simulation-Based Inference (SBI) tackles this issue using simulation-driven approaches, estimating the posterior from offline real observations and forward simulations. However, in black-box scenarios, preparing observations that contain sufficient information for parameter estimation is difficult due to the unknown relationship between parameters and observations. In this work, we present Active Simulation-Based Inference (ASBI), a parameter estimation framework that uses robots to actively collect real-world online data to achieve accurate black-box simulator tuning. Our framework optimizes robot actions to collect informative observations by maximizing information gain, which is defined as the expected reduction in Shannon entropy between the posterior and the prior. While calculating information gain requires the likelihood, which is inaccessible in black-box simulators, our method solves this problem by leveraging Neural Posterior Estimation (NPE), which leverages a neural network to learn the posterior estimator. Three simulation experiments quantitatively verify that our method achieves accurate parameter estimation, with posteriors sharply concentrated around the true parameters. Moreover, we show a practical application using a real robot to estimate the simulation parameters of cubic particles corresponding to two real objects, beads and gravel, with a bucket pouring action.
zh
[AI-46] VERITAS: Leverag ing Vision Priors and Expert Fusion to Improve Multimodal Data EMNLP2025
【速读】:该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)数据质量对大模型性能影响显著的问题,尤其是当前数据增强方法因视觉感知不足而导致事实性错误和幻觉(Hallucination)频发。其解决方案的关键在于提出一个名为VERITAS的系统化流水线,通过融合视觉先验(Vision Priors)与多个先进大语言模型(Large Multimodal Models, LMMs)的评估结果,并结合统计方法生成高置信度共识评分作为高质量标注依据。具体而言,VERITAS利用视觉识别模型(RAM++)和光学字符识别系统(PP-OCRv4)提取结构化视觉信息,由三类主流LMM(GPT-4o、Gemini-2.5-Pro、Doubao-1.5-pro)对原始答案进行批判性评估并输出理由与分数,再通过统计融合形成最终标签;在此基础上训练轻量级批评者模型(Critic Model),采用Group Relative Policy Optimization(GRPO)优化推理能力,进而指导各LMM基于批评意见生成候选答案并择优选取,从而显著提升SFT数据质量,尤其在文本密集型和细粒度推理任务中表现突出。
链接: https://arxiv.org/abs/2510.15317
作者: Tingqiao Xu,Ziru Zeng,Jiayu Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 (Main Conference)
Abstract:The quality of supervised fine-tuning (SFT) data is crucial for the performance of large multimodal models (LMMs), yet current data enhancement methods often suffer from factual errors and hallucinations due to inadequate visual perception. To address this challenge, we propose VERITAS, a pipeline that systematically integrates vision priors and multiple state-of-the-art LMMs with statistical methods to enhance SFT data quality. VERITAS leverages visual recognition models (RAM++) and OCR systems (PP-OCRv4) to extract structured vision priors, which are combined with images, questions, and answers. Three LMMs (GPT-4o, Gemini-2.5-Pro, Doubao-1.5-pro) evaluate the original answers, providing critique rationales and scores that are statistically fused into a high-confidence consensus score serving as ground truth. Using this consensus, we train a lightweight critic model via Group Relative Policy Optimization (GRPO), enhancing reasoning capabilities efficiently. Each LMM then refines the original answers based on the critiques, generating new candidate answers; we select the highest-scoring one as the final refined answer. Experiments across six multimodal benchmarks demonstrate that models fine-tuned with data processed by VERITAS consistently outperform those using raw data, particularly in text-rich and fine-grained reasoning tasks. Our critic model exhibits enhanced capability comparable to state-of-the-art LMMs while being significantly more efficient. We release our pipeline, datasets, and model checkpoints to advance research in multimodal data optimization.
zh
[AI-47] WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM -based Web Generation and Evaluation
【速读】:该论文旨在解决当前指令到HTML生成任务中数据质量不足与评估粒度粗糙的问题,尤其在真实网页内容的多模态对齐与细粒度评估方面存在明显短板。其解决方案的关键在于提出WebGen-V框架,包含三项核心创新:(1)一种无边界且可扩展的代理式爬取框架,能够持续获取真实网页并用于增强现有基准;(2)一种结构化的分节数据表示方式,整合元数据、局部UI截图及JSON格式的文本与图像资源,实现内容、布局与视觉组件之间的显式对齐,支持精细化的多模态监督;(3)一种基于分节的多模态评估协议,对文本、布局和视觉信息进行逐段一致性校验,从而实现高粒度的评估能力。该框架首次实现了指令到HTML生成任务中的高粒度代理爬取与评估,构建了从真实数据采集、网页生成到结构化多模态评估的统一流程。
链接: https://arxiv.org/abs/2510.15306
作者: Kuang-Da Wang,Zhao Wang,Yotaro Shimose,Wei-Yao Wang,Shingo Takamatsu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Witnessed by the recent advancements on leveraging LLM for coding and multimodal understanding, we present WebGen-V, a new benchmark and framework for instruction-to-HTML generation that enhances both data quality and evaluation granularity. WebGen-V contributes three key innovations: (1) an unbounded and extensible agentic crawling framework that continuously collects real-world webpages and can leveraged to augment existing benchmarks; (2) a structured, section-wise data representation that integrates metadata, localized UI screenshots, and JSON-formatted text and image assets, explicit alignment between content, layout, and visual components for detailed multimodal supervision; and (3) a section-level multimodal evaluation protocol aligning text, layout, and visuals for high-granularity assessment. Experiments with state-of-the-art LLMs and ablation studies validate the effectiveness of our structured data and section-wise evaluation, as well as the contribution of each component. To the best of our knowledge, WebGen-V is the first work to enable high-granularity agentic crawling and evaluation for instruction-to-HTML generation, providing a unified pipeline from real-world data acquisition and webpage generation to structured multimodal assessment.
zh
[AI-48] DSSmoothing: Toward Certified Dataset Ownership Verification for Pre-trained Language Models via Dual-Space Smoothing
【速读】:该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在使用大规模网络数据集时存在的版权归属验证难题,尤其是现有数据集所有权验证(Dataset Ownership Verification, DOV)方法在面对自然噪声和对抗性扰动时水印稳定性不足的问题。解决方案的关键在于提出一种基于双空间平滑(Dual-Space Smoothing, DSSmoothing)的可证明鲁棒性验证机制:通过在嵌入空间引入连续扰动以捕获语义鲁棒性,并在排列空间实施受控的词元重排序以捕捉序列鲁棒性;该方法分两阶段运行——第一阶段在两个空间协同嵌入触发信号生成约束型鲁棒水印数据集,第二阶段在验证过程中对两个空间应用随机平滑,计算可疑模型的水印鲁棒性(Watermark Robustness, WR)并与一组良性模型的主概率(Principal Probability, PP)进行统计比较,从而在有界双空间扰动下提供形式化鲁棒性保障。
链接: https://arxiv.org/abs/2510.15303
作者: Ting Qiao,Xing Liu,Wenke Huang,Jianbin Li,Zhaoxin Fan,Yiming Li
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages, 21 figures
Abstract:Large web-scale datasets have driven the rapid advancement of pre-trained language models (PLMs), but unauthorized data usage has raised serious copyright concerns. Existing dataset ownership verification (DOV) methods typically assume that watermarks remain stable during inference; however, this assumption often fails under natural noise and adversary-crafted perturbations. We propose the first certified dataset ownership verification method for PLMs based on dual-space smoothing (i.e., DSSmoothing). To address the challenges of text discreteness and semantic sensitivity, DSSmoothing introduces continuous perturbations in the embedding space to capture semantic robustness and applies controlled token reordering in the permutation space to capture sequential robustness. DSSmoothing consists of two stages: in the first stage, triggers are collaboratively embedded in both spaces to generate norm-constrained and robust watermarked datasets; in the second stage, randomized smoothing is applied in both spaces during verification to compute the watermark robustness (WR) of suspicious models and statistically compare it with the principal probability (PP) values of a set of benign models. Theoretically, DSSmoothing provides provable robustness guarantees for dataset ownership verification by ensuring that WR consistently exceeds PP under bounded dual-space perturbations. Extensive experiments on multiple representative web datasets demonstrate that DSSmoothing achieves stable and reliable verification performance and exhibits robustness against potential adaptive attacks.
zh
[AI-49] VERA-MH Concept Paper
【速读】:该论文旨在解决当前用于心理健康领域的AI聊天机器人(AI chatbot)在安全性评估方面的不足,尤其是针对自杀风险情境下的伦理与责任问题。为实现自动化、可扩展且符合临床实践的评估体系,研究者提出了VERA-MH(Validation of Ethical and Responsible AI in Mental Health),其核心解决方案是构建一个由两个辅助型AI代理组成的自动化评估框架:用户代理(user-agent)模拟具有预设风险等级和特征的心理健康患者角色进行对话;判官代理(judge-agent)依据由临床专家制定的评分量表对聊天机器人的响应进行量化打分。最终通过聚合多轮模拟对话的评分结果,实现对目标AI聊天机器人安全性的系统性评估。此方法兼顾了临床专业性和技术可行性,为后续迭代优化和社区协作验证提供了基础。
链接: https://arxiv.org/abs/2510.15297
作者: Luca Belli,Kate Bentley,Will Alexander,Emily Ward,Matt Hawrilenko,Kelly Johnston,Mill Brown,Adam Chekroud
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注:
Abstract:We introduce VERA-MH (Validation of Ethical and Responsible AI in Mental Health), an automated evaluation of the safety of AI chatbots used in mental health contexts, with an initial focus on suicide risk. Practicing clinicians and academic experts developed a rubric informed by best practices for suicide risk management for the evaluation. To fully automate the process, we used two ancillary AI agents. A user-agent model simulates users engaging in a mental health-based conversation with the chatbot under evaluation. The user-agent role-plays specific personas with pre-defined risk levels and other features. Simulated conversations are then passed to a judge-agent who scores them based on the rubric. The final evaluation of the chatbot being tested is obtained by aggregating the scoring of each conversation. VERA-MH is actively under development and undergoing rigorous validation by mental health clinicians to ensure user-agents realistically act as patients and that the judge-agent accurately scores the AI chatbot. To date we have conducted preliminary evaluation of GPT-5, Claude Opus and Claude Sonnet using initial versions of the VERA-MH rubric and used the findings for further design development. Next steps will include more robust clinical validation and iteration, as well as refining actionable scoring. We are seeking feedback from the community on both the technical and clinical aspects of our evaluation. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI) Cite as: arXiv:2510.15297 [cs.CY] (or arXiv:2510.15297v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2510.15297 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-50] Identifying internal patterns in (11)-dimensional directed percolation using neural networks
【速读】:该论文旨在解决如何自动检测(1+1)-维复制过程中相变现象并分类隐藏的渗流模式(percolation patterns)的问题。其解决方案的关键在于提出一种基于卷积神经网络(CNN)、时间卷积网络(TCN)和门控循环单元(GRU)相结合的神经网络模型,该模型直接从原始配置数据中学习特征,无需人工提取特征,从而能够准确重构相图并对配置进行相位标签分配,证明了深度架构可以从数值实验的原始数据中提取层次化结构的能力。
链接: https://arxiv.org/abs/2510.15294
作者: Danil Parkhomenko,Pavel Ovchinnikov,Konstantin Soldatov,Vitalii Kapitan,Gennady Y. Chitov
机构: 未知
类目: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Artificial Intelligence (cs.AI)
备注: 7 pages, 10 figures, 2 tables
Abstract:In this paper we present a neural network-based method for the automatic detection of phase transitions and classification of hidden percolation patterns in a (1+1)-dimensional replication process. The proposed network model is based on the combination of CNN, TCN and GRU networks, which are trained directly on raw configurations without any manual feature extraction. The network reproduces the phase diagram and assigns phase labels to configurations. It shows that deep architectures are capable of extracting hierarchical structures from the raw data of numerical experiments.
zh
[AI-51] MTmixAtt: Integrating Mixture-of-Experts with Multi-Mix Attention for Large-Scale Recommendation
【速读】:该论文旨在解决工业推荐系统中因依赖人工特征工程和场景特异性架构而导致的跨场景迁移困难与大规模部署受限的问题。其核心解决方案是提出一种统一的混合专家(Mixture-of-Experts, MoE)架构——MTmixAtt,关键创新在于两个模块:一是AutoToken模块,可自动将异构特征聚类为语义一致的token,无需人工定义特征分组;二是MTmixAttBlock模块,通过可学习的混合矩阵、共享密集专家与场景感知稀疏专家,实现高效token交互,在单一框架内同时捕捉全局模式与场景特异性行为。该方案在美团工业级TRec数据集上显著优于Transformer、WuKong、HiFormer等主流模型,并在线上A/B测试中带来支付PV提升3.62%和实际支付GTV提升2.54%。
链接: https://arxiv.org/abs/2510.15286
作者: Xianyang Qi,Yuan Tian,Zhaoyu Hu,Zhirui Kuai,Chang Liu,Hongxiang Lin,Lei Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Industrial recommender systems critically depend on high-quality ranking models. However, traditional pipelines still rely on manual feature engineering and scenario-specific architectures, which hinder cross-scenario transfer and large-scale deployment. To address these challenges, we propose \textbfMTmixAtt, a unified Mixture-of-Experts (MoE) architecture with Multi-Mix Attention, designed for large-scale recommendation tasks. MTmixAtt integrates two key components. The \textbfAutoToken module automatically clusters heterogeneous features into semantically coherent tokens, removing the need for human-defined feature groups. The \textbfMTmixAttBlock module enables efficient token interaction via a learnable mixing matrix, shared dense experts, and scenario-aware sparse experts, capturing both global patterns and scenario-specific behaviors within a single framework. Extensive experiments on the industrial TRec dataset from Meituan demonstrate that MTmixAtt consistently outperforms state-of-the-art baselines including Transformer-based models, WuKong, HiFormer, MLP-Mixer, and RankMixer. At comparable parameter scales, MTmixAtt achieves superior CTR and CTCVR metrics; scaling to MTmixAtt-1B yields further monotonic gains. Large-scale online A/B tests validate the real-world impact: in the \textitHomepage scenario, MTmixAtt increases Payment PV by \textbf+3.62% and Actual Payment GTV by \textbf+2.54%. Overall, MTmixAtt provides a unified and scalable solution for modeling arbitrary heterogeneous features across scenarios, significantly improving both user experience and commercial outcomes.
zh
[AI-52] Foundation Models for Scientific Discovery: From Paradigm Enhancement to Paradigm Transition NEURIPS2025
【速读】:该论文试图解决的问题是:基础模型(Foundation Models, FMs)在科学研宄中的作用是否仅限于提升现有方法效率,还是正在重塑科学研究范式。其解决方案的关键在于提出一个三阶段演化框架,用以系统描述FMs如何推动科学范式的转变——第一阶段为元科学整合(Meta-Scientific Integration),即FMs增强传统科研流程;第二阶段为人类与AI协同共创(Hybrid Human-AI Co-Creation),FMs作为主动合作者参与问题定义、推理与发现;第三阶段为自主科学发现(Autonomous Scientific Discovery),FMs可独立生成新知识而无需人类干预。该框架不仅梳理了FMs当前应用与新兴能力,还指出了潜在风险与未来方向,旨在引导科学界理解并适应这一范式转型。
链接: https://arxiv.org/abs/2510.15280
作者: Fan Liu,Jindong Han,Tengfei Lyu,Weijia Zhang,Zhe-Rui Yang,Lu Dai,Cancheng Liu,Hao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: NeurIPS 2025
Abstract:Foundation models (FMs), such as GPT-4 and AlphaFold, are reshaping the landscape of scientific research. Beyond accelerating tasks such as hypothesis generation, experimental design, and result interpretation, they prompt a more fundamental question: Are FMs merely enhancing existing scientific methodologies, or are they redefining the way science is conducted? In this paper, we argue that FMs are catalyzing a transition toward a new scientific paradigm. We introduce a three-stage framework to describe this evolution: (1) Meta-Scientific Integration, where FMs enhance workflows within traditional paradigms; (2) Hybrid Human-AI Co-Creation, where FMs become active collaborators in problem formulation, reasoning, and discovery; and (3) Autonomous Scientific Discovery, where FMs operate as independent agents capable of generating new scientific knowledge with minimal human intervention. Through this lens, we review current applications and emerging capabilities of FMs across existing scientific paradigms. We further identify risks and future directions for FM-enabled scientific discovery. This position paper aims to support the scientific community in understanding the transformative role of FMs and to foster reflection on the future of scientific discovery. Our project is available at this https URL.
zh
[AI-53] Robust Layerwise Scaling Rules by Proper Weight Decay Tuning
【速读】:该论文旨在解决现代尺度不变架构(scale-invariant architectures)中,由于归一化层引入的反向尺度敏感性导致最大更新参数化(maximal-update parameterization, μP)学习率迁移失效的问题。在训练进入优化器主导的稳态后,有效学习率变得与网络宽度相关,从而破坏了μP的宽度无关性。解决方案的关键在于提出了一种适用于AdamW优化器的权重衰减缩放规则:通过使矩阵类参数的权重衰减λ₂与√d成正比(即λ₂ ∝ √d),可近似保持各子层增益(sublayer gain)在不同宽度下的不变性;结合向量类参数以固定学习率η₁ = Θ(1)和零权重衰减(λ₁ = 0)进行训练,实现了从代理宽度到目标宽度的零样本(zero-shot)学习率与权重衰减迁移,无需针对每种宽度单独调参。这一方法显著扩展了μP的有效范围,使其适用于训练稳态阶段,并为AdamW下宽度鲁棒的超参数迁移提供了实用方案。
链接: https://arxiv.org/abs/2510.15262
作者: Zhiyuan Fan,Yifeng Liu,Qingyue Zhao,Angela Yuan,Quanquan Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Empirical scaling laws prescribe how to allocate parameters, data, and compute, while maximal-update parameterization ( \mu P) enables learning-rate transfer across widths by equalizing early-time update magnitudes. However, in modern scale-invariant architectures, training quickly enters an optimizer-governed steady state where normalization layers create backward scale sensitivity and the effective learning rate becomes width dependent, degrading \mu P transfer. We address this by introducing a weight-decay scaling rule for AdamW that preserves sublayer gain across widths. Empirically, the singular-value spectrum of each matrix parameter scales in norm as \sqrt\eta/\lambda with an approximately invariant shape; under width scaling d , we observe that the top singular value scales approximately as \sqrt\eta/\lambda\cdot d^0.75 . Combining this observation with the \mu P learning-rate rule \eta_2\propto d^-1 for matrix-like parameters implies an empirical weight-decay scaling rule \lambda_2\propto \sqrtd that approximately keeps sublayer gains width invariant. Together with vector-like parameters trained at \eta_1=\Theta_d(1) and \lambda_1=0 , this yields \emphzero-shot transfer of both learning rate and weight decay from proxy to target widths, removing per-width sweeps. We validate the rule on LLaMA-style Transformers and in a minimal synthetic setting, and we provide a simple diagnostic, matching top singular values, to check sublayer-gain invariance. Our results extend \mu P beyond the near-init regime by explicitly controlling steady-state scales set by the optimizer, offering a practical recipe for width-robust hyperparameter transfer under AdamW.
zh
[AI-54] AUGUSTUS: An LLM -Driven Multimodal Agent System with Contextualized User Memory NEURIPS2025
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体系统在外部记忆存储中仅限于文本信息、忽视多模态信号的问题。现有方法未能充分模拟人类记忆的多模态特性,限制了智能体对复杂场景的理解与推理能力。解决方案的关键在于提出AUGUSTUS系统,其核心创新是将信息以语义标签(semantic tags)形式概念化,并将其与上下文关联存储于图结构的多模态情境记忆(graph-structured multimodal contextual memory)中,从而实现基于概念驱动的高效检索。该设计不仅提升了任务性能(如ImageNet分类比传统多模态RAG快3.5倍,MSC基准上优于MemGPT),还更贴近认知科学中人类记忆的组织方式。
链接: https://arxiv.org/abs/2510.15261
作者: Jitesh Jain,Shubham Maheshwari,Ning Yu,Wen-mei Hwu,Humphrey Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: LAW 2025 Workshop at NeurIPS 2025. Work done from late 2023 to early 2024
Abstract:Riding on the success of LLMs with retrieval-augmented generation (RAG), there has been a growing interest in augmenting agent systems with external memory databases. However, the existing systems focus on storing text information in their memory, ignoring the importance of multimodal signals. Motivated by the multimodal nature of human memory, we present AUGUSTUS, a multimodal agent system aligned with the ideas of human memory in cognitive science. Technically, our system consists of 4 stages connected in a loop: (i) encode: understanding the inputs; (ii) store in memory: saving important information; (iii) retrieve: searching for relevant context from memory; and (iv) act: perform the task. Unlike existing systems that use vector databases, we propose conceptualizing information into semantic tags and associating the tags with their context to store them in a graph-structured multimodal contextual memory for efficient concept-driven retrieval. Our system outperforms the traditional multimodal RAG approach while being 3.5 times faster for ImageNet classification and outperforming MemGPT on the MSC benchmark.
zh
[AI-55] Experience-Driven Exploration for Efficient API-Free AI Agents
【速读】:该论文旨在解决在无API(Application Programming Interface)环境下,基于大语言模型(Large Language Model, LLM)的智能体因仅能通过像素级图形用户界面(Graphical User Interface, GUI)进行交互而导致的效率瓶颈问题。具体表现为:智能体受限于局部视觉体验,决策具有短视性,且依赖低效的试错策略,从而阻碍技能习得与长期规划能力的发展。解决方案的关键在于提出KG-Agent框架,其核心是将原始像素级交互数据结构化为一个持久的状态-动作知识图(State-Action Knowledge Graph, SA-KG),通过链接功能相似但视觉表现不同的GUI状态,构建丰富的经验邻域以支持泛化;同时设计了一种基于图拓扑的混合内在奖励机制,结合状态价值奖励(利用已知高价值路径)与新颖性奖励(鼓励目标导向探索),实现战略规划与纯发现过程的解耦,使智能体能够有效评估具有延迟回报的设置操作。
链接: https://arxiv.org/abs/2510.15259
作者: Chenwei Tang,Jingyu Xing,Xinyu Liu,Zizhou Wang,Jiawei Du,Liangli Zhen,Jiancheng Lv
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Most existing software lacks accessible Application Programming Interfaces (APIs), requiring agents to operate solely through pixel-based Graphical User Interfaces (GUIs). In this API-free setting, large language model (LLM)-based agents face severe efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering both skill acquisition and long-term planning. To address these challenges, we propose KG-Agent, an experience-driven learning framework that structures an agent’s raw pixel-level interactions into a persistent State-Action Knowledge Graph (SA-KG). KG-Agent overcomes inefficient exploration by linking functionally similar but visually distinct GUI states, forming a rich neighborhood of experience that enables the agent to generalize from a diverse set of historical strategies. To support long-horizon reasoning, we design a hybrid intrinsic reward mechanism based on the graph topology, combining a state value reward for exploiting known high-value pathways with a novelty reward that encourages targeted exploration. This approach decouples strategic planning from pure discovery, allowing the agent to effectively value setup actions with delayed gratification. We evaluate KG-Agent in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over the state-of-the-art methods.
zh
[AI-56] From Checklists to Clusters: A Homeostatic Account of AGI Evaluation
【速读】:该论文旨在解决当前通用人工智能(AGI)评估中存在的两个核心问题:一是现有评估方法对多领域能力采用对称权重,忽视了不同领域在人类智能研究中所体现的因果重要性差异;二是依赖瞬时得分的测试方式无法区分持久能力与脆弱表现,后者在延迟或压力下会迅速退化。解决方案的关键在于将AGI理解为一种“稳态属性簇”(homeostatic property cluster),即一组能力及其维持这些能力在扰动下共存的机制。为此,作者提出两项可兼容现有测评体系的扩展:其一为“中心性优先得分”(centrality-prior score),引入基于CHC理论的权重并辅以透明敏感性分析;其二为“簇稳定性指数”(Cluster Stability Index)家族,用于分离出能力持续性、持久学习和错误纠正三个维度。这一框架在保持多领域覆盖的同时,显著降低了评估的脆弱性和可操纵性,并提供了无需架构访问即可实施的黑箱测试协议。
链接: https://arxiv.org/abs/2510.15236
作者: Brett Reynolds
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 27 pages, 3 figures
Abstract:Contemporary AGI evaluations report multidomain capability profiles, yet they typically assign symmetric weights and rely on snapshot scores. This creates two problems: (i) equal weighting treats all domains as equally important when human intelligence research suggests otherwise, and (ii) snapshot testing can’t distinguish durable capabilities from brittle performances that collapse under delay or stress. I argue that general intelligence – in humans and potentially in machines – is better understood as a homeostatic property cluster: a set of abilities plus the mechanisms that keep those abilities co-present under perturbation. On this view, AGI evaluation should weight domains by their causal centrality (their contribution to cluster stability) and require evidence of persistence across sessions. I propose two battery-compatible extensions: a centrality-prior score that imports CHC-derived weights with transparent sensitivity analysis, and a Cluster Stability Index family that separates profile persistence, durable learning, and error correction. These additions preserve multidomain breadth while reducing brittleness and gaming. I close with testable predictions and black-box protocols labs can adopt without architectural access.
zh
[AI-57] Adaptive Individual Uncertainty under Out-Of-Distribution Shift with Expert-Routed Conformal Prediction
【速读】:该论文旨在解决当前机器学习(ML)领域中缺乏可靠、信息丰富且个体化的不确定性量化(Uncertainty Quantification, UQ)的问题,这一缺陷严重限制了人工智能/机器学习在高风险场景(如药物发现)中的有效应用。现有方法普遍存在覆盖不足、置信区间过宽导致不可操作,或不确定性与实际误差不一致(尤其在分布偏移下)等局限。针对蛋白质-配体亲和力(Protein-Ligand Interaction, PLI)预测任务中 assay 噪声异质性、化学空间不平衡且庞大、以及常见分布偏移等挑战,作者提出了一种新型UQ方法——可信专家分割校准结合缩放估计的高效自适应区间(Trustworthy Expert Split-conformal with Scaled Estimation for Efficient Reliable Adaptive intervals, TESSERA)。其核心创新在于融合了Mixture of Experts (MoE) 的多样性建模与分割校准(split-conformal calibration),实现了每样本级别的不确定性估计,在保证近名义覆盖率的同时,使预测区间宽度能动态跟踪绝对误差,从而在Coverage-Width Criterion (CWC) 和 Area Under the Sparsification Error (AUSE) 等指标上优于强基线方法,并通过Size-Stratified Coverage (SSC) 验证了区间尺寸合理性,即在数据稀缺或噪声大时放宽区间,可靠预测时保持紧凑,为选择性预测和下游决策提供可信赖的自适应不确定性估计。
链接: https://arxiv.org/abs/2510.15233
作者: Amitesh Badkul,Lei Xie
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable, informative, and individual uncertainty quantification (UQ) remains missing in current ML community. This hinders the effective application of AI/ML to risk-sensitive domains. Most methods either fail to provide coverage on new data, inflate intervals so broadly that they are not actionable, or assign uncertainties that do not track actual error, especially under a distribution shift. In high-stakes drug discovery, protein-ligand affinity (PLI) prediction is especially challenging as assay noise is heterogeneous, chemical space is imbalanced and large, and practical evaluations routinely involve distribution shift. In this work, we introduce a novel uncertainty quantification method, Trustworthy Expert Split-conformal with Scaled Estimation for Efficient Reliable Adaptive intervals (TESSERA), that provides per-sample uncertainty with reliable coverage guarantee, informative and adaptive prediction interval widths that track the absolute error. We evaluate on protein-ligand binding affinity prediction under both independent and identically distributed (i.i.d.) and scaffold-based out-of-distribution (OOD) splits, comparing against strong UQ baselines. TESSERA attains near-nominal coverage and the best coverage-width trade-off as measured by the Coverage-Width Criterion (CWC), while maintaining competitive adaptivity (lowest Area Under the Sparsification Error (AUSE)). Size-Stratified Coverage (SSC) further confirms that intervals are right-sized, indicating width increases when data are scarce or noisy, and remain tight when predictions are reliable. By unifying Mixture of Expert (MoE) diversity with conformal calibration, TESSERA delivers trustworthy, tight, and adaptive uncertainties that are well-suited to selective prediction and downstream decision-making in the drug-discovery pipeline and other applications.
zh
[AI-58] WELD: A Large-Scale Longitudinal Dataset of Emotional Dynamics for Ubiquitous Affective Computing
【速读】:该论文旨在解决真实职场环境中自动情绪识别(Automated Emotion Recognition)因缺乏大规模、长期纵向数据集而面临的挑战。其解决方案的关键在于构建并公开了一个迄今为止最大且最长的职场情绪数据集,涵盖38名员工在30.5个月(2021年11月至2024年5月)内采集的733,651条面部表情记录,每条记录包含七种情绪概率(neutral, happy, sad, surprised, fear, disgusted, angry),以及工作角色、就业结果和人格特质等丰富元数据,并特别覆盖了新冠疫情相关重大社会事件(如上海封控)的情绪响应。该数据集支持多种情感指标计算(如效价、唤醒度、波动性、可预测性、惯性和情绪传染强度),并通过技术验证证实了高数据质量(成功复现已知心理模式及员工离职预测AUC=1.0),为情绪识别、情感动态建模、情绪传染研究及情绪感知系统设计提供了坚实基础。
链接: https://arxiv.org/abs/2510.15221
作者: Xiao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 15 pages, 4 figures, 1 table. Dataset publicly available under CC BY 4.0 license
Abstract:Automated emotion recognition in real-world workplace settings remains a challenging problem in affective computing due to the scarcity of large-scale, longitudinal datasets collected in naturalistic environments. We present a novel dataset comprising 733,651 facial expression records from 38 employees collected over 30.5 months (November 2021 to May 2024) in an authentic office environment. Each record contains seven emotion probabilities (neutral, happy, sad, surprised, fear, disgusted, angry) derived from deep learning-based facial expression recognition, along with comprehensive metadata including job roles, employment outcomes, and personality traits. The dataset uniquely spans the COVID-19 pandemic period, capturing emotional responses to major societal events including the Shanghai lockdown and policy changes. We provide 32 extended emotional metrics computed using established affective science methods, including valence, arousal, volatility, predictability, inertia, and emotional contagion strength. Technical validation demonstrates high data quality through successful replication of known psychological patterns (weekend effect: +192% valence improvement, p 0.001; diurnal rhythm validated) and perfect predictive validity for employee turnover (AUC=1.0). Baseline experiments using Random Forest and LSTM models achieve 91.2% accuracy for emotion classification and R2 = 0.84 for valence prediction. This is the largest and longest longitudinal workplace emotion dataset publicly available, enabling research in emotion recognition, affective dynamics modeling, emotional contagion, turnover prediction, and emotion-aware system design.
zh
[AI-59] Reason IF: Large Reasoning Models Fail to Follow Instructions During Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中对用户指令的遵循能力不足的问题,即不仅关注最终输出是否符合指令,更强调推理链(reasoning trace)中每一步都需严格遵守用户指示。这一问题直接影响到模型的可控性、透明度及安全性,避免因推理过程中的偏差导致幻觉或奖励黑客(reward hacking)。解决方案的关键在于提出一个系统性评估基准 ReasonIF,用于量化模型在多语言推理、格式控制和长度约束等六类指令下的推理指导遵循度(Instruction Following Score, IFS),并验证两种提升策略:多轮推理机制与基于合成数据的推理指令微调(Reasoning Instruction Finetuning, RIF),其中 RIF 显著提升了 GPT-OSS-20B 的 IFS 从 0.11 提升至 0.27,表明其有效性,但仍存在较大改进空间。
链接: https://arxiv.org/abs/2510.15211
作者: Yongchan Kwon,Shang Zhu,Federico Bianchi,Kaitlyn Zhou,James Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The ability of large language models (LLMs) to follow user instructions is central to their reliability, safety, and usefulness. While prior studies assess instruction adherence in the model’s main responses, we argue that it is also critical for large reasoning models (LRMs) to follow user instructions throughout their reasoning process. Reasoning instruction following makes LRMs more controllable and transparent, while reducing risks of undesirable shortcuts, hallucinations, or reward hacking within reasoning traces. To evaluate this dimension, we introduce ReasonIF, a systematic benchmark for assessing reasoning instruction following. ReasonIF includes six categories of instruction prompts, spanning multilingual reasoning, formatting and length control. Across many open-source LRMs including GPT-OSS, Qwen3, and DeepSeek-R1, we find substantial failures in reasoning instruction adherence: the highest instruction following score (IFS) remains below 0.25, meaning that fewer than 25% of reasoning traces comply with the given instructions. Notably, as task difficulty increases, reasoning instruction following degrades further. We also explore two strategies to enhance reasoning instruction fidelity. (1) multi-turn reasoning and (2) Reasoning Instruction Finetuning (RIF) using synthetic data. RIF improves the IFS of GPT-OSS-20B from 0.11 to 0.27, indicating measurable progress but leaving ample room for improvement.
zh
[AI-60] Automotive Crash Dynamics Modeling Accelerated with Machine Learning
【速读】:该论文旨在解决传统汽车碰撞安全性评估中依赖高保真有限元(Finite Element, FE)仿真所带来的计算成本高、耗时长的问题。解决方案的关键在于利用机器学习构建代理模型,以实现对碰撞过程中结构变形的高效预测。研究采用NVIDIA PhysicsNeMo框架,探索了两种先进的神经网络架构(MeshGraphNet与Transolver)以及三种建模瞬态动力学的方法(时间条件输入、标准自回归方法及引入滚动训练的稳定性增强自回归方案),基于包含150次LS-DYNA仿真的车身白体(Body-in-White, BIW)数据集进行训练与验证。结果表明,尽管当前模型精度尚未达到FE仿真水平,但其在计算效率上实现了数量级提升,具备工程实用价值,可支持早期设计阶段的快速迭代与优化。
链接: https://arxiv.org/abs/2510.15201
作者: Mohammad Amin Nabian,Sudeep Chavare,Deepak Akhare,Rishikesh Ranade,Ram Cherukuri,Srinivas Tadepalli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
备注:
Abstract:Crashworthiness assessment is a critical aspect of automotive design, traditionally relying on high-fidelity finite element (FE) simulations that are computationally expensive and time-consuming. This work presents an exploratory comparative study on developing machine learning-based surrogate models for efficient prediction of structural deformation in crash scenarios using the NVIDIA PhysicsNeMo framework. Given the limited prior work applying machine learning to structural crash dynamics, the primary contribution lies in demonstrating the feasibility and engineering utility of the various modeling approaches explored in this work. We investigate two state-of-the-art neural network architectures for modeling crash dynamics: MeshGraphNet, and Transolver. Additionally, we examine three strategies for modeling transient dynamics: time-conditional, the standard Autoregressive approach, and a stability-enhanced Autoregressive scheme incorporating rollout-based training. The models are evaluated on a comprehensive Body-in-White (BIW) crash dataset comprising 150 detailed FE simulations using LS-DYNA. The dataset represents a structurally rich vehicle assembly with over 200 components, including 38 key components featuring variable thickness distributions to capture realistic manufacturing variability. Each model utilizes the undeformed mesh geometry and component characteristics as inputs to predict the spatiotemporal evolution of the deformed mesh during the crash sequence. Evaluation results show that the models capture the overall deformation trends with reasonable fidelity, demonstrating the feasibility of applying machine learning to structural crash dynamics. Although not yet matching full FE accuracy, the models achieve orders-of-magnitude reductions in computational cost, enabling rapid design exploration and early-stage optimization in crashworthiness evaluation.
zh
[AI-61] owards Error Centric Intelligence I Beyond Observational Learning
【速读】:该论文试图解决当前人工智能发展中存在的理论局限性问题,即AGI(通用人工智能)的进步受限于理论而非数据或规模。作者指出,仅依赖观测等效性无法保证干预能力,因此需要从错误中心视角重构学习范式。解决方案的关键在于提出“因果力学”(Causal Mechanics)框架,其核心是将假设空间的变化作为一阶操作,并在有用时使用概率结构而非默认假设。该框架通过三个关键结构性原则实现错误发现与修正的可计算性:模块化干预的微分局部性与自主性原理、独立因果机制的规范不变形式,以及类比保真的组合自主性原理,从而构建可将不可达错误转化为可达错误并加以纠正的系统架构。
链接: https://arxiv.org/abs/2510.15128
作者: Marcus A. Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We argue that progress toward AGI is theory limited rather than data or scale limited. Building on the critical rationalism of Popper and Deutsch, we challenge the Platonic Representation Hypothesis. Observationally equivalent worlds can diverge under interventions, so observational adequacy alone cannot guarantee interventional competence. We begin by laying foundations, definitions of knowledge, learning, intelligence, counterfactual competence and AGI, and then analyze the limits of observational learning that motivate an error centric shift. We recast the problem as three questions about how explicit and implicit errors evolve under an agent’s actions, which errors are unreachable within a fixed hypothesis space, and how conjecture and criticism expand that space. From these questions we propose Causal Mechanics, a mechanisms first program in which hypothesis space change is a first class operation and probabilistic structure is used when useful rather than presumed. We advance structural principles that make error discovery and correction tractable, including a differential Locality and Autonomy Principle for modular interventions, a gauge invariant form of Independent Causal Mechanisms for separability, and the Compositional Autonomy Principle for analogy preservation, together with actionable diagnostics. The aim is a scaffold for systems that can convert unreachable errors into reachable ones and correct them.
zh
[AI-62] Procedural Game Level Design with Deep Reinforcement Learning
【速读】:该论文旨在解决游戏开发中人工设计关卡效率低、难以实现动态可重玩性的问题,提出了一种基于深度强化学习(Deep Reinforcement Learning, DRL)的程序化关卡设计方法。其解决方案的关键在于构建两个协同工作的智能体:一个“蜂鸟代理”作为求解器,通过Proximal Policy Optimization (PPO)算法学习在动态生成的地形中高效导航并收集花朵;另一个“浮岛代理”负责根据障碍物位置、蜂鸟初始状态及历史反馈生成合理且上下文感知的花朵布局。二者通过交互形成涌现行为,实现了内容生成与内容求解的闭环优化,从而推动了由机器学习驱动的自主游戏关卡设计范式的发展。
链接: https://arxiv.org/abs/2510.15120
作者: Miraç Buğra Özkan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 10 figures, IEEE conference format
Abstract:Procedural content generation (PCG) has become an increasingly popular technique in game development, allowing developers to generate dynamic, replayable, and scalable environments with reduced manual effort. In this study, a novel method for procedural level design using Deep Reinforcement Learning (DRL) within a Unity-based 3D environment is proposed. The system comprises two agents: a hummingbird agent, acting as a solver, and a floating island agent, responsible for generating and placing collectible objects (flowers) on the terrain in a realistic and context-aware manner. The hummingbird is trained using the Proximal Policy Optimization (PPO) algorithm from the Unity ML-Agents toolkit. It learns to navigate through the terrain efficiently, locate flowers, and collect them while adapting to the ever-changing procedural layout of the island. The island agent is also trained using the Proximal Policy Optimization (PPO) algorithm. It learns to generate flower layouts based on observed obstacle positions, the hummingbird’s initial state, and performance feedback from previous episodes. The interaction between these agents leads to emergent behavior and robust generalization across various environmental configurations. The results demonstrate that the approach not only produces effective and efficient agent behavior but also opens up new opportunities for autonomous game level design driven by machine learning. This work highlights the potential of DRL in enabling intelligent agents to both generate and solve content in virtual environments, pushing the boundaries of what AI can contribute to creative game development processes.
zh
[AI-63] argeted Attacks and Defenses for Distributed Federated Learning in Vehicular Networks
【速读】:该论文旨在解决分布式联邦学习(Distributed Federated Learning, DFL)在车联网等边缘网络中面临的日益复杂和隐蔽的网络攻击问题,尤其是针对训练数据投毒和后门(Trojan)攻击的脆弱性。其解决方案的关键在于设计了针对性的恶意攻击模型以揭示DFL系统的潜在漏洞,并在此基础上提出有效的防御机制,从而在不依赖中心服务器的前提下提升DFL对高级威胁的鲁棒性与安全性。
链接: https://arxiv.org/abs/2510.15109
作者: Utku Demir,Tugba Erpek,Yalin E. Sagduyu,Sastry Kompella,Mengran Xue
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Signal Processing (eess.SP)
备注:
Abstract:In emerging networked systems, mobile edge devices such as ground vehicles and unmanned aerial system (UAS) swarms collectively aggregate vast amounts of data to make machine learning decisions such as threat detection in remote, dynamic, and infrastructure-constrained environments where power and bandwidth are scarce. Federated learning (FL) addresses these constraints and privacy concerns by enabling nodes to share local model weights for deep neural networks instead of raw data, facilitating more reliable decision-making than individual learning. However, conventional FL relies on a central server to coordinate model updates in each learning round, which imposes significant computational burdens on the central node and may not be feasible due to the connectivity constraints. By eliminating dependence on a central server, distributed federated learning (DFL) offers scalability, resilience to node failures, learning robustness, and more effective defense strategies. Despite these advantages, DFL remains vulnerable to increasingly advanced and stealthy cyberattacks. In this paper, we design sophisticated targeted training data poisoning and backdoor (Trojan) attacks, and characterize the emerging vulnerabilities in a vehicular network. We analyze how DFL provides resilience against such attacks compared to individual learning and present effective defense mechanisms to further strengthen DFL against the emerging cyber threats.
zh
[AI-64] Operator Flow Matching for Timeseries Forecasting
【速读】:该论文旨在解决高维偏微分方程(PDE)驱动动力学的生成建模问题,尤其是现有自回归和扩散模型在长时间预测中易产生累积误差与离散化伪影,难以保证物理一致性。其解决方案的关键在于提出TempO模型——一种基于潜在空间流匹配(latent flow matching)的架构,通过稀疏条件输入结合通道折叠(channel folding)高效处理三维时空场,并引入时间条件傅里叶层(time-conditioned Fourier layers)以高保真度捕捉多尺度动态模式,从而实现高效、确定性的采样与更准确的长期物理一致性预测。
链接: https://arxiv.org/abs/2510.15101
作者: Yolanne Yi Ran Lee,Kyriakos Flouris
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Forecasting high-dimensional, PDE-governed dynamics remains a core challenge for generative modeling. Existing autoregressive and diffusion-based approaches often suffer cumulative errors and discretisation artifacts that limit long, physically consistent forecasts. Flow matching offers a natural alternative, enabling efficient, deterministic sampling. We prove an upper bound on FNO approximation error and propose TempO, a latent flow matching model leveraging sparse conditioning with channel folding to efficiently process 3D spatiotemporal fields using time-conditioned Fourier layers to capture multi-scale modes with high fidelity. TempO outperforms state-of-the-art baselines across three benchmark PDE datasets, and spectral analysis further demonstrates superior recovery of multi-scale dynamics, while efficiency studies highlight its parameter- and memory-light design compared to attention-based or convolutional regressors.
zh
[AI-65] OpenEstimate: Evaluating LLM s on Reasoning Under Uncertainty with Real-World Data
【速读】:该论文旨在解决当前语言模型(Language Models, LMs)在现实应用场景中面对不确定信息时的推理能力评估不足的问题。现有评测大多聚焦于答案明确的任务,难以有效衡量模型在真实世界中处理不确定性、进行概率估计的能力。为填补这一空白,作者提出了OpenEstimate基准,其关键在于设计了一个多领域、可扩展的数值估算任务集合,要求模型整合大量背景知识并输出概率先验(probabilistic priors),进而通过准确性与校准度(calibration)指标量化这些先验相对于真实分布样本的有效性。实验表明,前沿模型生成的概率先验普遍存在不准确和过度自信的问题,且性能提升有限,凸显了当前LM在不确定性推理方面的显著短板,同时为后续改进提供了可量化的评估平台。
链接: https://arxiv.org/abs/2510.15096
作者: Alana Renda,Jillian Ross,Michael Cafarella,Jacob Andreas
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Real-world settings where language models (LMs) are deployed – in domains spanning healthcare, finance, and other forms of knowledge work – require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce OpenEstimate, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in sampling strategy, reasoning effort, or prompt design. The OpenEstimate benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.
zh
[AI-66] Beyond Outcome-Based Imperfect-Recall: Higher-Resolution Abstractions for Imperfect-Information Games
【速读】:该论文旨在解决不完美信息博弈(Imperfect-Information Games, IIGs)中手抽象(hand abstraction)的理论建模与性能瓶颈问题,尤其是主流基于结果的非完全记忆算法因随意丢弃历史信息而导致显著性能损失的问题。其核心解决方案是提出信号观测有序博弈(Signal Observation Ordered Games, SOOGs)框架,为手抽象提供严格的数学基础,并引入分辨率边界(resolution bound)作为信息论上可实现性能的上限;进一步通过潜在感知结果同构(Potential-Aware Outcome Isomorphism, PAOI)形式化现有算法的局限性,并提出全记忆结果同构(Full-Recall Outcome Isomorphism, FROI),通过整合历史信息提升分辨率边界并改善策略质量,实验证明FROI在德州扑克类基准测试中持续优于传统结果导向的非完全记忆基线方法。
链接: https://arxiv.org/abs/2510.15094
作者: Yanchang Fu,Qiyue Yin,Shengda Liu,Pei Xu,Kaiqi Huang
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:Hand abstraction is crucial for scaling imperfect-information games (IIGs) such as Texas Hold’em, yet progress is limited by the lack of a formal task model and by evaluations that require resource-intensive strategy solving. We introduce signal observation ordered games (SOOGs), a subclass of IIGs tailored to hold’em-style games that cleanly separates signal from player action sequences, providing a precise mathematical foundation for hand abstraction. Within this framework, we define a resolution bound-an information-theoretic upper bound on achievable performance under a given signal abstraction. Using the bound, we show that mainstream outcome-based imperfect-recall algorithms suffer substantial losses by arbitrarily discarding historical information; we formalize this behavior via potential-aware outcome Isomorphism (PAOI) and prove that PAOI characterizes their resolution bound. To overcome this limitation, we propose full-recall outcome isomorphism (FROI), which integrates historical information to raise the bound and improve policy quality. Experiments on hold’em-style benchmarks confirm that FROI consistently outperforms outcome-based imperfect-recall baselines. Our results provide a unified formal treatment of hand abstraction and practical guidance for designing higher-resolution abstractions in IIGs.
zh
[AI-67] DMRetriever: A Family of Models for Improved Text Retrieval in Disaster Management
【速读】:该论文旨在解决灾难管理场景中信息检索模型性能不一致且不可靠的问题,因为现有通用领域检索模型无法有效处理灾难管理中多样化的搜索意图。解决方案的关键在于提出首个专为灾难管理设计的密集检索模型系列DMRetriever(规模从33M到7.6B参数),其训练采用一种新颖的三阶段框架:双向注意力适配、无监督对比预训练和难度感知的渐进式指令微调,并结合高质量数据生成的数据精炼流水线。该方法显著提升了在六种不同搜索意图下的表现,且具有极高的参数效率,例如596M模型超越13.3倍更大的基线模型,而33M模型仅用7.6%参数即优于更大规模基线。
链接: https://arxiv.org/abs/2510.15087
作者: Kai Yin,Xiangjue Dong,Chengkai Liu,Allen Lin,Lingfeng Shi,Ali Mostafavi,James Caverlee
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective and efficient access to relevant information is essential for disaster management. However, no retrieval model is specialized for disaster management, and existing general-domain models fail to handle the varied search intents inherent to disaster management scenarios, resulting in inconsistent and unreliable performance. To this end, we introduce DMRetriever, the first series of dense retrieval models (33M to 7.6B) tailored for this domain. It is trained through a novel three-stage framework of bidirectional attention adaptation, unsupervised contrastive pre-training, and difficulty-aware progressive instruction fine-tuning, using high-quality data generated through an advanced data refinement pipeline. Comprehensive experiments demonstrate that DMRetriever achieves state-of-the-art (SOTA) performance across all six search intents at every model scale. Moreover, DMRetriever is highly parameter-efficient, with 596M model outperforming baselines over 13.3 X larger and 33M model exceeding baselines with only 7.6% of their parameters. All codes, data, and checkpoints are available at this https URL
zh
[AI-68] Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在安全对齐机制上的跨模态漏洞问题,即模型容易受到视觉诱导的越狱攻击(jailbreak attacks),从而生成有害内容。其解决方案的关键在于利用顺序漫画风格的视觉叙事(sequential comic-style visual narratives)来分解恶意查询为看似无害的图像序列,借助辅助语言模型(auxiliary LLM)进行语义拆解,并通过扩散模型(diffusion models)生成对应的图像序列;进而利用MLLM对叙事连贯性的依赖性,诱使其输出原本被安全机制屏蔽的有害响应。该方法在多个主流安全基准测试中平均攻击成功率高达83.5%,显著优于现有方法。
链接: https://arxiv.org/abs/2510.15068
作者: Deyue Zhang,Dongdong Yang,Junjie Mu,Quancheng Zou,Zonghao Ying,Wenzhuo Xu,Zhao Liu,Xuan Wang,Xiangzheng Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs. Our method decomposes malicious queries into visually innocuous storytelling elements using an auxiliary LLM, generates corresponding image sequences through diffusion models, and exploits the models’ reliance on narrative coherence to elicit harmful outputs. Extensive experiments on harmful textual queries from established safety benchmarks show that our approach achieves an average attack success rate of 83.5%, surpassing prior state-of-the-art by 46%. Compared with existing visual jailbreak methods, our sequential narrative strategy demonstrates superior effectiveness across diverse categories of harmful content. We further analyze attack patterns, uncover key vulnerability factors in multimodal safety mechanisms, and evaluate the limitations of current defense strategies against narrative-driven attacks, revealing significant gaps in existing protections.
zh
[AI-69] Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中易受越狱攻击(multi-turn jailbreak attacks)的问题,即攻击者通过迭代交互诱导模型输出有害内容,从而绕过单轮安全过滤机制。现有防御方法主要依赖被动拒绝策略,难以应对自适应攻击者,且常过度限制合法用户行为。论文提出一种基于蜜罐(honeypot)的主动防护机制,其核心在于将风险规避转化为风险利用:通过微调一个“诱饵模型”(bait model)生成语义相关但不可执行的模糊响应作为诱饵,结合主模型的安全回复,在多轮交互中主动插入诱导性问题以逐步暴露恶意意图。关键创新包括引入蜜罐效用评分(Honeypot Utility Score, HUS)衡量诱饵响应的吸引力与可行性,并采用防御有效性率(Defense Efficacy Rate, DER)平衡安全性与用户体验。实验表明,该方案在GPT-4o上显著降低越狱成功率,同时保持良性用户交互质量。
链接: https://arxiv.org/abs/2510.15017
作者: ChenYu Wu,Yi Wang,Yang Liao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 6pages, 2 figures
Abstract:Large language models (LLMs) are increasingly vulnerable to multi-turn jailbreak attacks, where adversaries iteratively elicit harmful behaviors that bypass single-turn safety filters. Existing defenses predominantly rely on passive rejection, which either fails against adaptive attackers or overly restricts benign users. We propose a honeypot-based proactive guardrail system that transforms risk avoidance into risk utilization. Our framework fine-tunes a bait model to generate ambiguous, non-actionable but semantically relevant responses, which serve as lures to probe user intent. Combined with the protected LLM’s safe reply, the system inserts proactive bait questions that gradually expose malicious intent through multi-turn interactions. We further introduce the Honeypot Utility Score (HUS), measuring both the attractiveness and feasibility of bait responses, and use a Defense Efficacy Rate (DER) for balancing safety and usability. Initial experiment on MHJ Datasets with recent attack method across GPT-4o show that our system significantly disrupts jailbreak success while preserving benign user experience.
zh
[AI-70] Hybrid Autoencoder-Based Framework for Early Fault Detection in Wind Turbines
【速读】:该论文旨在解决风力发电机组可靠性问题,特别是通过早期故障检测来降低停机时间和运维成本。其核心挑战在于从高维SCADA数据中无监督地识别异常行为,而无需依赖标注故障数据。解决方案的关键在于提出一种基于集成学习的深度学习框架,融合变分自编码器(Variational Autoencoders, VAE)、LSTM自编码器和Transformer架构,以捕捉不同时间尺度与上下文模式;同时设计了一套特征工程流程提取时域、统计及频域指标,并采用集成评分与自适应阈值法实现高效异常检测,最终在CARE数据集上实现了0.947的AUC-ROC性能并提前48小时预警故障。
链接: https://arxiv.org/abs/2510.15010
作者: Rekha R Nair,Tina Babu,Alavikunhu Panthakkan,Balamurugan Balusamy,Wathiq Mansoor
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Wind turbine reliability is critical to the growing renewable energy sector, where early fault detection significantly reduces downtime and maintenance costs. This paper introduces a novel ensemble-based deep learning framework for unsupervised anomaly detection in wind turbines. The method integrates Variational Autoencoders (VAE), LSTM Autoencoders, and Transformer architectures, each capturing different temporal and contextual patterns from high-dimensional SCADA data. A unique feature engineering pipeline extracts temporal, statistical, and frequency-domain indicators, which are then processed by the deep models. Ensemble scoring combines model predictions, followed by adaptive thresholding to detect operational anomalies without requiring labeled fault data. Evaluated on the CARE dataset containing 89 years of real-world turbine data across three wind farms, the proposed method achieves an AUC-ROC of 0.947 and early fault detection up to 48 hours prior to failure. This approach offers significant societal value by enabling predictive maintenance, reducing turbine failures, and enhancing operational efficiency in large-scale wind energy deployments.
zh
[AI-71] angledFeatures: Robust Feature Selection in Highly Correlated Spaces NEURIPS2025
【速读】:该论文旨在解决传统特征选择方法在存在相关预测变量(correlated predictors)时性能下降的问题,尤其在高维且冗余的特征空间中难以保持模型的可解释性和稳定性。其解决方案的关键在于提出TangledFeatures框架,该框架通过识别由纠缠预测变量组成的组内代表性特征,有效减少冗余信息的同时保留解释能力,从而为下游模型提供更简洁、稳定且具有结构意义的特征子集。
链接: https://arxiv.org/abs/2510.15005
作者: Allen Daniel Sunny
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for poster presentation at the Machine Learning for Structural Biology (MLSB) Workshop @ NeurIPS 2025, co-located with NeurIPS 2025 (San Diego, USA). Non-archival
Abstract:Feature selection is a fundamental step in model development, shaping both predictive performance and interpretability. Yet, most widely used methods focus on predictive accuracy, and their performance degrades in the presence of correlated predictors. To address this gap, we introduce TangledFeatures, a framework for feature selection in correlated feature spaces. It identifies representative features from groups of entangled predictors, reducing redundancy while retaining explanatory power. The resulting feature subset can be directly applied in downstream models, offering a more interpretable and stable basis for analysis compared to traditional selection techniques. We demonstrate the effectiveness of TangledFeatures on Alanine Dipeptide, applying it to the prediction of backbone torsional angles and show that the selected features correspond to structurally meaningful intra-atomic distances that explain variation in these angles.
zh
[AI-72] Automated Snippet-Alignment Data Augmentation for Code Translation
【速读】:该论文旨在解决代码翻译(Code Translation)任务中因平行语料库(Parallel Corpora)数据稀缺而导致模型训练受限的问题。现有研究多集中于对程序级对齐(Program-Alignment, PA)数据的增强,但PA数据虽具备完整上下文利于语义对齐学习,却因长度过长难以提供细粒度训练信号;而片段级对齐(Snippet-Alignment, SA)数据虽然简短、适合细粒度对齐学习,却因数量稀少难以有效利用。为此,论文提出一种基于大语言模型(Large Language Models, LLMs)自动生成SA数据的数据增强方法,并设计了一种简单而有效的两阶段训练策略,以协同利用PA与SA数据。其关键创新在于:通过LLMs自动构造高质量SA数据并结合两阶段训练机制,显著提升了代码翻译模型在TransCoder-test上的性能,最大实现pass@k指标提升3.78%。
链接: https://arxiv.org/abs/2510.15004
作者: Zhiming Zhang,Qingfu Zhu,Xianzhen Luo,Yixuan Wang,Bohan Li,Wanxiang Che
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Code translation aims to translate the code from its source language to the target language and is used in various software development scenarios. Recent developments in Large Language Models (LLMs) have showcased their capabilities in code translation, and parallel corpora play a crucial role in training models for code translation. Parallel corpora can be categorized into program-alignment (PA) and snippet-alignment (SA) data. Although PA data has complete context and is suitable for semantic alignment learning, it may not provide adequate fine-grained training signals due to its extended length, while the brevity of SA data enables more fine-grained alignment learning. Due to limited parallel corpora, researchers explore several augmentation methods for code translation. Previous studies mainly focus on augmenting PA data. In this paper, we propose a data augmentation method that leverages LLMs to generate SA data automatically. To fully leverage both PA data and SA data, we explore a simple yet effective two-stage training strategy, which consistently enhances model performance compared to fine-tuning solely on PA data. Experiments on TransCoder-test demonstrate that our augmented SA data combined with the two-stage training approach yields consistent improvements over the baseline, achieving a maximum gain of 3.78% on pass@k.
zh
[AI-73] VaultGemma: A Differentially Private Gemma Model
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中可能泄露敏感数据的隐私风险问题。解决方案的关键在于采用差分隐私(Differential Privacy, DP)技术对模型进行完整训练,从而在不牺牲模型性能的前提下实现对训练数据中个体信息的有效保护。VaultGemma 1B 是一个拥有 10 亿参数的 Gemma 系列模型,其预训练数据与 Gemma 2 系列保持一致,是首个完全通过差分隐私训练的开源大模型,标志着隐私保护型生成式 AI 的重要进展。
链接: https://arxiv.org/abs/2510.15001
作者: Amer Sinha,Thomas Mesnard,Ryan McKenna,Daogao Liu,Christopher A. Choquette-Choo,Yangsibo Huang,Da Yu,George Kaissis,Zachary Charles,Ruibo Liu,Lynn Chua,Pritish Kamath,Pasin Manurangsi,Steve He,Chiyuan Zhang,Badih Ghazi,Borja De Balle Pigem,Prem Eruvbetine,Tris Warkentin,Armand Joulin,Ravi KumarAmer Sinha,Thomas Mesnard,Ryan McKenna,Daogao Liu,Christopher A. Choquette-Choo,Yangsibo Huang,Da Yu,George Kaissis,Zachary Charles,Ruibo Liu,Lynn Chua,Pritish Kamath,Pasin Manurangsi,Steve He,Chiyuan Zhang,Badih Ghazi,Borja De Balle Pigem,Prem Eruvbetine,Tris Warkentin,Armand Joulin,Ravi Kumar
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce VaultGemma 1B, a 1 billion parameter model within the Gemma family, fully trained with differential privacy. Pretrained on the identical data mixture used for the Gemma 2 series, VaultGemma 1B represents a significant step forward in privacy-preserving large language models. We openly release this model to the community
zh
[AI-74] he Role of Federated Learning in Improving Financial Security: A Survey
【速读】:该论文旨在解决数字金融系统中隐私保护与安全性的挑战,尤其是在传统机器学习模型因需集中访问敏感数据而引发用户隐私泄露问题的背景下。其核心解决方案是采用联邦学习(Federated Learning, FL)技术,通过在不共享原始数据的前提下实现跨机构和跨设备的分布式模型训练,从而支持金融机构在遵守监管合规要求的同时提升欺诈检测等关键任务的效能。FL的关键优势在于能够在保持数据本地化的基础上,利用多源异构数据进行协同建模,尤其适用于ATM、POS终端等物联网(IoT)金融节点的实时风险识别场景,并结合差分隐私、安全多方计算及区块链等新兴技术增强系统鲁棒性与合规性。
链接: https://arxiv.org/abs/2510.14991
作者: Cade Houston Kennedy,Amr Hilal,Morteza Momeni
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 8 pages, 2 figures, 1 tables, accepted at 2025 IEEE Global Conference on Artificial Intelligence and Internet of Things
Abstract:With the growth of digital financial systems, robust security and privacy have become a concern for financial institutions. Even though traditional machine learning models have shown to be effective in fraud detections, they often compromise user data by requiring centralized access to sensitive information. In IoT-enabled financial endpoints such as ATMs and POS Systems that regularly produce sensitive data that is sent over the network. Federated Learning (FL) offers a privacy-preserving, decentralized model training across institutions without sharing raw data. FL enables cross-silo collaboration among banks while also using cross-device learning on IoT endpoints. This survey explores the role of FL in enhancing financial security and introduces a novel classification of its applications based on regulatory and compliance exposure levels ranging from low-exposure tasks such as collaborative portfolio optimization to high-exposure tasks like real-time fraud detection. Unlike prior surveys, this work reviews FL’s practical use within financial systems, discussing its regulatory compliance and recent successes in fraud prevention and blockchain-integrated frameworks. However, FL deployment in finance is not without challenges. Data heterogeneity, adversarial attacks, and regulatory compliance make implementation far from easy. This survey reviews current defense mechanisms and discusses future directions, including blockchain integration, differential privacy, secure multi-party computation, and quantum-secure frameworks. Ultimately, this work aims to be a resource for researchers exploring FL’s potential to advance secure, privacy-compliant financial systems.
zh
[AI-75] Design and Analysis of Parallel Artificial Protozoa Optimizer (P-APO) using CUDA Architecture
【速读】:该论文旨在解决元启发式优化算法(Metaheuristic Algorithms)在处理大规模问题时计算时间过长的问题,尤其是在需要大量迭代以获得更优解的情况下,串行执行效率低下成为主要瓶颈。解决方案的关键在于提出了一种基于NVIDIA CUDA框架的并行化人工原生动物优化器(Artificial Protozoa Optimizer, APO)实现,通过利用GPU的并行计算能力显著加速算法执行过程。实验结果表明,在CEC2022基准函数测试中,所提并行版本最高可实现6.7倍的速度提升,同时在真实世界应用(如工程优化中的弹簧设计和基于Otsu方法的图像阈值分割)中也验证了其高效性和实用性。
链接: https://arxiv.org/abs/2510.14982
作者: Henish Soliya,Anugrah Jain
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Metaheuristic algorithms are widely used for solving complex problems due to their ability to provide near-optimal solutions. But the execution time of these algorithms increases with the problem size and solution space. And, to get more promising results, we have to execute these algorithms for a large number of iterations, requiring a large amount of time and this is one of the main issues found with these algorithms. To handle the same, researchers are now-adays working on design and development of parallel versions of state of the art metaheuristic optimization algorithms. We, in this paper, present a parallel implementation of state of the art Artificial Protozoa Optimizer using NVIDIA CUDA framework to leverage GPU acceleration. Our implementation optimizes the state of the art Artificial Protozoa Optimizer (APO) to achieve high performance. We implement both the existing sequential version and the proposed parallel version of Artificial Protozoa Optimizer in this paper. The experimental results calculated over benchmarks functions of CEC2022 demonstrate a significant performance gain i.e. up to 6.7 times speed up achieved in case of proposed parallel version. We also use two real world applications (1) Tension/Compression Spring Design in engineering optimization and (2) Image Thresholding using otsu method for testing the performance of proposed implementation in handling real tasks.
zh
[AI-76] Reinforcement Learning with Stochastic Reward Machines AAAI-22 AAAI
【速读】:该论文旨在解决传统奖励机器(reward machines)在强化学习中对噪声敏感的问题,即现有算法假设奖励信号完全无噪声,这在实际应用中难以满足。为克服这一限制,作者提出了一种新型奖励机器——随机奖励机器(stochastic reward machines),其核心创新在于引入概率机制以建模奖励的不确定性。解决方案的关键在于设计一种基于约束求解的学习算法,该算法能够从强化学习智能体的探索轨迹中自动推导出最小化的随机奖励机器,并可与现有强化学习算法无缝集成,理论上保证在极限情况下收敛到最优策略。实验证明该方法优于现有方法及处理噪声的朴素策略。
链接: https://arxiv.org/abs/2510.14837
作者: Jan Corazza,Ivan Gavran,Daniel Neider
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: A shorter version of this paper appeared in the Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22). Source code available at this https URL
Abstract:Reward machines are an established tool for dealing with reinforcement learning problems in which rewards are sparse and depend on complex sequences of actions. However, existing algorithms for learning reward machines assume an overly idealized setting where rewards have to be free of noise. To overcome this practical limitation, we introduce a novel type of reward machines, called stochastic reward machines, and an algorithm for learning them. Our algorithm, based on constraint solving, learns minimal stochastic reward machines from the explorations of a reinforcement learning agent. This algorithm can easily be paired with existing reinforcement learning algorithms for reward machines and guarantees to converge to an optimal policy in the limit. We demonstrate the effectiveness of our algorithm in two case studies and show that it outperforms both existing methods and a naive approach for handling noisy reward functions.
zh
[AI-77] Decentralizing Multi-Agent Reinforcement Learning with Temporal Causal Information
【速读】:该论文旨在解决分布式多智能体强化学习(Decentralized Multi-Agent Reinforcement Learning, DMARL)中的关键挑战,包括隐私约束、通信限制以及性能问题,尤其是在多个智能体独立学习后如何确保其局部策略在执行时能够兼容并协同完成全局任务。解决方案的关键在于引入高层符号知识(high-level symbolic knowledge),通过扩展用于验证局部策略与团队任务兼容性的形式化工具,使得去中心化训练在理论上有保障,并能应用于更广泛的实际场景;同时实证表明,关于环境中事件时序演化的符号知识可显著加速DMARL的学习过程。
链接: https://arxiv.org/abs/2506.07829
作者: Jan Corazza,Hadi Partovi Aria,Hyohun Kim,Daniel Neider,Zhe Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Code available at this https URL
Abstract:Reinforcement learning (RL) algorithms can find an optimal policy for a single agent to accomplish a particular task. However, many real-world problems require multiple agents to collaborate in order to achieve a common goal. For example, a robot executing a task in a warehouse may require the assistance of a drone to retrieve items from high shelves. In Decentralized Multi-Agent RL (DMARL), agents learn independently and then combine their policies at execution time, but often must satisfy constraints on compatibility of local policies to ensure that they can achieve the global task when combined. In this paper, we study how providing high-level symbolic knowledge to agents can help address unique challenges of this setting, such as privacy constraints, communication limitations, and performance concerns. In particular, we extend the formal tools used to check the compatibility of local policies with the team task, making decentralized training with theoretical guarantees usable in more scenarios. Furthermore, we empirically demonstrate that symbolic knowledge about the temporal evolution of events in the environment can significantly expedite the learning process in DMARL.
zh
[AI-78] Establishing trust in automated reasoning
【速读】:该论文试图解决自动化推理系统(automated reasoning systems)在科学实践中可信度不足的问题,特别是如何提升其可审查性(reviewability),以增强科研人员对其结果的信任。解决方案的关键在于识别影响自动化推理系统可审查性的核心特征,并通过技术手段与社会措施相结合的方式,提高系统的透明度、可解释性和可验证性,从而推动其在科学研究中的可靠应用。
链接: https://arxiv.org/abs/2309.12351
作者: Konrad Hinsen(SSOLEIL, CBM)
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Since its beginnings in the 1940s, automated reasoning by computers has become a tool of ever growing importance in scientific research. So far, the rules underlying automated reasoning have mainly been formulated by humans, in the form of program source code. Rules derived from large amounts of data, via machine learning techniques, are a complementary approach currently under intense development. The question of why we should trust these systems, and the results obtained with their help, has been discussed by philosophers of science but has so far received little attention by practitioners. The present work focuses on independent reviewing, an important source of trust in science, and identifies the characteristics of automated reasoning systems that affect their reviewability. It also discusses possible steps towards increasing reviewability and trustworthiness via a combination of technical and social measures.
zh
[AI-79] GENESIS: A Generative Model of Episodic-Semantic Interaction
【速读】:该论文旨在解决认知神经科学中一个核心挑战,即如何解释语义记忆(semantic memory)与情景记忆(episodic memory)这两种主要的陈述性记忆形式在支持学习、回忆和想象过程中相互作用的机制。尽管已有诸多进展,但缺乏一个统一的计算框架来同时解释两类记忆领域的关键实证现象。解决方案的关键在于提出生成式情景-语义整合系统(Generative Episodic-Semantic Integration System, GENESIS),该模型将记忆建模为两个有限容量的生成系统之间的交互:一个基于皮层的变分自编码器(Cortical-VAE)用于语义学习与泛化,另一个基于海马的变分自编码器(Hippocampal-VAE)在检索增强生成(RAG)架构下实现情景编码与检索。GENESIS不仅复现了语义记忆中的泛化、情景记忆中的识别、序列回忆效应及基于概要的扭曲等典型行为特征,还揭示了容量限制如何影响经验的保真度与可记忆性、语义加工如何引入系统性的情景回忆偏差,以及情景回放如何重组先前经验,从而提供了一个以资源受限、主动构建为核心特征的记忆统一理论框架。
链接: https://arxiv.org/abs/2510.15828
作者: Marco D’Alessandro,Leo D’Amato,Mikel Elkano,Mikel Uriz,Giovanni Pezzulo
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures
Abstract:A central challenge in cognitive neuroscience is to explain how semantic and episodic memory, two major forms of declarative memory, typically associated with cortical and hippocampal processing, interact to support learning, recall, and imagination. Despite significant advances, we still lack a unified computational framework that jointly accounts for core empirical phenomena across both semantic and episodic processing domains. Here, we introduce the Generative Episodic-Semantic Integration System (GENESIS), a computational model that formalizes memory as the interaction between two limited-capacity generative systems: a Cortical-VAE, supporting semantic learning and generalization, and a Hippocampal-VAE, supporting episodic encoding and retrieval within a retrieval-augmented generation (RAG) architecture. GENESIS reproduces hallmark behavioral findings, including generalization in semantic memory, recognition, serial recall effects and gist-based distortions in episodic memory, and constructive episodic simulation, while capturing their dynamic interactions. The model elucidates how capacity constraints shape the fidelity and memorability of experiences, how semantic processing introduces systematic distortions in episodic recall, and how episodic replay can recombine previous experiences. Together, these results provide a principled account of memory as an active, constructive, and resource-bounded process. GENESIS thus advances a unified theoretical framework that bridges semantic and episodic memory, offering new insights into the generative foundations of human cognition.
zh
[AI-80] Robust Optimization in Causal Models and G-Causal Normalizing Flows
【速读】:该论文旨在解决因果模型中干预鲁棒优化问题在标准Wasserstein距离下可能不连续,从而导致生成式数据增强(data augmentation)效果不佳的问题。其核心挑战在于:若生成模型忽略变量间的因果结构,则所生成的数据无法有效支持因果推理任务,例如因果回归和因果因子模型下的均值-方差投资组合优化。解决方案的关键在于提出一种新的归一化流(normalizing flow)架构,该架构满足因果结构模型的通用逼近性质(universal approximation property),并能高效训练以最小化G-因果Wasserstein距离(G-causal Wasserstein distance)。这一设计确保了生成样本不仅分布上接近真实数据,且保持了潜在的因果依赖关系,从而显著提升下游因果任务的性能。
链接: https://arxiv.org/abs/2510.15458
作者: Gabriele Visentin,Patrick Cheridito
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Portfolio Management (q-fin.PM)
备注:
Abstract:In this paper, we show that interventionally robust optimization problems in causal models are continuous under the G -causal Wasserstein distance, but may be discontinuous under the standard Wasserstein distance. This highlights the importance of using generative models that respect the causal structure when augmenting data for such tasks. To this end, we propose a new normalizing flow architecture that satisfies a universal approximation property for causal structural models and can be efficiently trained to minimize the G -causal Wasserstein distance. Empirically, we demonstrate that our model outperforms standard (non-causal) generative models in data augmentation for causal regression and mean-variance portfolio optimization in causal factor models.
zh
[AI-81] DroneAudioset: An Audio Dataset for Drone-based Search and Rescue NEURIPS
【速读】:该论文旨在解决无人机(UAV)在搜救任务中依赖视觉感知易受低能见度或遮挡影响,以及现有音频感知系统因极端自我噪声(ego-noise)导致人声信号难以识别的问题。其解决方案的关键在于构建了一个大规模、多样化的无人机音频数据集 DroneAudioset,包含23.5小时标注录音,覆盖从-57.2 dB到-2.5 dB的广泛信噪比(SNR)范围,并涵盖多种无人机类型、油门设置、麦克风配置及环境条件。该数据集为噪声抑制与人类存在检测分类方法的开发与系统评估提供了标准化基准,同时支持无人机音频系统设计中的关键决策,如麦克风布局权衡和抗噪声音频处理策略的优化。
链接: https://arxiv.org/abs/2510.15383
作者: Chitralekha Gupta,Soundarya Ramesh,Praveen Sasikumar,Kian Peen Yeo,Suranga Nanayakkara
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted in Neurips (Datasets and Benchmarks Track) 2025. The first two authors are equal contributors
Abstract:Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise but suffers from extreme ego-noise that masks sounds indicating human presence. Existing datasets are either limited in diversity or synthetic, lacking real acoustic interactions, and there are no standardized setups for drone audition. To this end, we present DroneAudioset (The dataset is publicly available at this https URL under the MIT license), a comprehensive drone audition dataset featuring 23.5 hours of annotated recordings, covering a wide range of signal-to-noise ratios (SNRs) from -57.2 dB to -2.5 dB, across various drone types, throttles, microphone configurations as well as environments. The dataset enables development and systematic evaluation of noise suppression and classification methods for human-presence detection under challenging conditions, while also informing practical design considerations for drone audition systems, such as microphone placement trade-offs, and development of drone noise-aware audio processing. This dataset is an important step towards enabling design and deployment of drone-audition systems.
zh
[AI-82] Kernel Regression in Structured Non-IID Settings: Theory and Implications for Denoising Score Learning
【速读】:该论文旨在解决核岭回归(Kernel Ridge Regression, KRR)在非独立同分布(non-i.i.d.)数据场景下的泛化性能分析问题,尤其针对具有信号-噪声因果结构的数据,其中多个观测值来自共享的潜在信号但带有不同的噪声。现有理论主要局限于独立同分布(i.i.d.)假设,难以刻画实际应用中如去噪得分学习(denoising score learning)等任务中的依赖结构。解决方案的关键在于提出一种新颖的分块分解方法(blockwise decomposition method),该方法能够对依赖数据进行精确的浓度分析,从而首次建立了KRR在非i.i.d.设定下的过拟合风险(excess risk)上界,其显式依赖于核谱(kernel spectrum)、因果结构参数以及采样机制(包括信号与噪声样本量的比例)。这一理论框架不仅拓展了KRR的理论边界,也为现代机器学习中依赖数据的建模与采样策略提供了可解释的指导。
链接: https://arxiv.org/abs/2510.15363
作者: Dechen Zhang,Zhenmei Shi,Yi Zhang,Yingyu Liang,Difan Zou
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Kernel ridge regression (KRR) is a foundational tool in machine learning, with recent work emphasizing its connections to neural networks. However, existing theory primarily addresses the i.i.d. setting, while real-world data often exhibits structured dependencies - particularly in applications like denoising score learning where multiple noisy observations derive from shared underlying signals. We present the first systematic study of KRR generalization for non-i.i.d. data with signal-noise causal structure, where observations represent different noisy views of common signals. By developing a novel blockwise decomposition method that enables precise concentration analysis for dependent data, we derive excess risk bounds for KRR that explicitly depend on: (1) the kernel spectrum, (2) causal structure parameters, and (3) sampling mechanisms (including relative sample sizes for signals and noises). We further apply our results to denoising score learning, establishing generalization guarantees and providing principled guidance for sampling noisy data points. This work advances KRR theory while providing practical tools for analyzing dependent data in modern machine learning applications.
zh
[AI-83] he Economics of AI Foundation Models: Openness Competition and Governance
【速读】:该论文旨在解决基础模型(Foundation Model, FM)生态系统中“开放性”战略选择的经济驱动机制问题,特别是其对AI价值链中竞争格局的影响。研究发现,开放性具有双重效应:一方面通过知识溢出促进新进入者(entrant)的学习,另一方面通过“数据飞轮效应”(data flywheel effect)增强现有主导开发者(incumbent developer)的优势——即当前用户参与度越高,下游部署者未来微调成本越低。关键在于,这种双重效应导致主导开发者在第一期最优开放水平呈现非单调性:当数据飞轮效应较弱或极强时,其倾向于更高开放;而在中间强度区间,则会战略性限制开放以抑制新进入者的成长。这一动态形成“开放性陷阱”(openness trap),揭示了强制透明政策可能因削弱企业战略灵活性而导致投资减少与福利下降。解决方案的核心在于构建一个两阶段博弈模型,量化开放性与数据飞轮之间的权衡关系,并据此评估垂直整合、政府补贴等常见干预措施的有效性边界,从而为制定更精准的政策提供理论框架。
链接: https://arxiv.org/abs/2510.15200
作者: Fasheng Xu,Xiaoyu Wang,Wei Chen,Karen Xie
机构: 未知
类目: Theoretical Economics (econ.TH); Artificial Intelligence (cs.AI)
备注:
Abstract:The strategic choice of model “openness” has become a defining issue for the foundation model (FM) ecosystem. While this choice is intensely debated, its underlying economic drivers remain underexplored. We construct a two-period game-theoretic model to analyze how openness shapes competition in an AI value chain, featuring an incumbent developer, a downstream deployer, and an entrant developer. Openness exerts a dual effect: it amplifies knowledge spillovers to the entrant, but it also enhances the incumbent’s advantage through a “data flywheel effect,” whereby greater user engagement today further lowers the deployer’s future fine-tuning cost. Our analysis reveals that the incumbent’s optimal first-period openness is surprisingly non-monotonic in the strength of the data flywheel effect. When the data flywheel effect is either weak or very strong, the incumbent prefers a higher level of openness; however, for an intermediate range, it strategically restricts openness to impair the entrant’s learning. This dynamic gives rise to an “openness trap,” a critical policy paradox where transparency mandates can backfire by removing firms’ strategic flexibility, reducing investment, and lowering welfare. We extend the model to show that other common interventions can be similarly ineffective. Vertical integration, for instance, only benefits the ecosystem when the data flywheel effect is strong enough to overcome the loss of a potentially more efficient competitor. Likewise, government subsidies intended to spur adoption can be captured entirely by the incumbent through strategic price and openness adjustments, leaving the rest of the value chain worse off. By modeling the developer’s strategic response to competitive and regulatory pressures, we provide a robust framework for analyzing competition and designing effective policy in the complex and rapidly evolving FM ecosystem.
zh
[AI-84] From Universal Approximation Theorem to Tropical Geometry of Multi-Layer Perceptrons
【速读】:该论文旨在解决传统多层感知机(MLP)在理论与实践之间存在的鸿沟问题,特别是如何将热带几何(tropical geometry)对神经网络决策函数的结构性理解转化为可解释且形状可控的平滑激活函数(如Sigmoid)模型初始化方法。其解决方案的关键在于:利用热带几何揭示的分段线性结构特性,设计出一类仅使用Sigmoid激活函数的二维二分类MLP,使其决策边界在初始阶段即符合预设几何形状,并通过有限和形式(finite-sum format)实现Universal Approximation Theorem (UAT) 的构造性满足——即模型由若干平移和缩放后的仿射函数的Sigmoid组合构成,从而无需依赖ReLU架构即可获得具有明确几何意义的初始决策边界,为后续标准训练提供良好起点。
链接: https://arxiv.org/abs/2510.15012
作者: Yi-Shan Chu,Yueh-Cheng Kuo
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We revisit the Universal Approximation Theorem(UAT) through the lens of the tropical geometry of neural networks and introduce a constructive, geometry-aware initialization for sigmoidal multi-layer perceptrons (MLPs). Tropical geometry shows that Rectified Linear Unit (ReLU) networks admit decision functions with a combinatorial structure often described as a tropical rational, namely a difference of tropical polynomials. Focusing on planar binary classification, we design purely sigmoidal MLPs that adhere to the finite-sum format of UAT: a finite linear combination of shifted and scaled sigmoids of affine functions. The resulting models yield decision boundaries that already align with prescribed shapes at initialization and can be refined by standard training if desired. This provides a practical bridge between the tropical perspective and smooth MLPs, enabling interpretable, shape-driven initialization without resorting to ReLU architectures. We focus on the construction and empirical demonstrations in two dimensions; theoretical analysis and higher-dimensional extensions are left for future work.
zh
[AI-85] Evaluation and Implementation of Machine Learning Algorithms to Predict Early Detection of Kidney and Heart Disease in Diabetic Patients
【速读】:该论文旨在解决糖尿病患者中慢性肾病(Chronic Kidney Disease, CKD)和心血管疾病(Cardiovascular Disease, CVD)早期诊断敏感性不足的问题。传统诊断标志物在疾病初期往往难以准确识别高风险人群,导致延误干预。解决方案的关键在于构建一个融合传统统计方法与机器学习算法的混合框架:首先利用SPSS进行描述性和推断性统计分析,筛选出与CKD和CVD显著相关的临床特征(如血清肌酐、高血压、胆固醇等);随后基于这些特征训练逻辑回归、支持向量机和随机森林模型,其中随机森林表现最优,尤其在CKD预测中展现出最高准确性;最终通过集成学习策略提升对高危患者的识别能力,从而实现更早、更精准的风险分层,相较传统诊断方法具有明显优势。
链接: https://arxiv.org/abs/2510.14997
作者: Syed Ibad Hasnain
机构: 未知
类目: Other Quantitative Biology (q-bio.OT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This thesis was completed under the supervision of Prof. Dr. Darakhshan Saleem. I am deeply grateful for her mentorship throughout my graduate studies
Abstract:Cardiovascular disease and chronic kidney disease are major complications of diabetes, leading to high morbidity and mortality. Early detection of these conditions is critical, yet traditional diagnostic markers often lack sensitivity in the initial stages. This study integrates conventional statistical methods with machine learning approaches to improve early diagnosis of CKD and CVD in diabetic patients. Descriptive and inferential statistics were computed in SPSS to explore associations between diseases and clinical or demographic factors. Patients were categorized into four groups: Group A both CKD and CVD, Group B CKD only, Group C CVD only, and Group D no disease. Statistical analysis revealed significant correlations: Serum Creatinine and Hypertension with CKD, and Cholesterol, Triglycerides, Myocardial Infarction, Stroke, and Hypertension with CVD. These results guided the selection of predictive features for machine learning models. Logistic Regression, Support Vector Machine, and Random Forest algorithms were implemented, with Random Forest showing the highest accuracy, particularly for CKD prediction. Ensemble models outperformed single classifiers in identifying high-risk diabetic patients. SPSS results further validated the significance of the key parameters integrated into the models. While challenges such as interpretability and class imbalance remain, this hybrid statistical machine learning framework offers a promising advancement toward early detection and risk stratification of diabetic complications compared to conventional diagnostic approaches.
zh
[AI-86] Constrained Diffusion for Protein Design with Hard Structural Constraints
【速读】:该论文旨在解决现有扩散模型在蛋白质结构生成过程中难以严格满足功能约束(如特定结合位点或几何要求)的问题,尤其是在需要高精度控制的蛋白质工程任务中。其解决方案的关键在于提出了一种约束扩散框架,通过将邻近可行性更新(proximal feasibility updates)与交替方向乘子法(ADMM)分解集成到生成过程中,从而在保持立体化学和几何可行性的同时,确保对复杂约束集的精确遵守,实现了功能导向的蛋白质设计性能提升。
链接: https://arxiv.org/abs/2510.14989
作者: Jacob K. Christopher,Austin Seamann,Jingyi Cui,Sagar Khare,Ferdinando Fioretto
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Diffusion models offer a powerful means of capturing the manifold of realistic protein structures, enabling rapid design for protein engineering tasks. However, existing approaches observe critical failure modes when precise constraints are necessary for functional design. To this end, we present a constrained diffusion framework for structure-guided protein design, ensuring strict adherence to functional requirements while maintaining precise stereochemical and geometric feasibility. The approach integrates proximal feasibility updates with ADMM decomposition into the generative process, scaling effectively to the complex constraint sets of this domain. We evaluate on challenging protein design tasks, including motif scaffolding and vacancy-constrained pocket design, while introducing a novel curated benchmark dataset for motif scaffolding in the PDZ domain. Our approach achieves state-of-the-art, providing perfect satisfaction of bonding and geometric constraints with no degradation in structural diversity.
zh
[AI-87] RegimeFolio: A Regime Aware ML System for Sectoral Portfolio Optimization in Dynamic Markets
【速读】:该论文旨在解决金融市场的非平稳性问题,即市场波动率状态的动态变化会显著影响资产间的联动关系和收益分布,而传统投资组合优化方法通常基于平稳性或无状态假设,在面对此类变化时适应能力不足。解决方案的关键在于提出一种称为RegimeFolio的新型“状态感知”且行业专精的框架,其核心创新在于将显式的波动率状态划分(volatility regime segmentation)与行业特定的集成预测模型及自适应均值-方差配置相结合,从而确保预测结果与投资决策始终与当前市场状态保持一致,提升在动态市场中的鲁棒性和可解释性。
链接: https://arxiv.org/abs/2510.14986
作者: Yiyao Zhang,Diksha Goel,Hussain Ahmad,Claudia Szabo
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI)
备注:
Abstract:Financial markets are inherently non-stationary, with shifting volatility regimes that alter asset co-movements and return distributions. Standard portfolio optimization methods, typically built on stationarity or regime-agnostic assumptions, struggle to adapt to such changes. To address these challenges, we propose RegimeFolio, a novel regime-aware and sector-specialized framework that, unlike existing regime-agnostic models such as DeepVol and DRL optimizers, integrates explicit volatility regime segmentation with sector-specific ensemble forecasting and adaptive mean-variance allocation. This modular architecture ensures forecasts and portfolio decisions remain aligned with current market conditions, enhancing robustness and interpretability in dynamic markets. RegimeFolio combines three components: (i) an interpretable VIX-based classifier for market regime detection; (ii) regime and sector-specific ensemble learners (Random Forest, Gradient Boosting) to capture conditional return structures; and (iii) a dynamic mean-variance optimizer with shrinkage-regularized covariance estimates for regime-aware allocation. We evaluate RegimeFolio on 34 large cap U.S. equities from 2020 to 2024. The framework achieves a cumulative return of 137 percent, a Sharpe ratio of 1.17, a 12 percent lower maximum drawdown, and a 15 to 20 percent improvement in forecast accuracy compared to conventional and advanced machine learning benchmarks. These results show that explicitly modeling volatility regimes in predictive learning and portfolio allocation enhances robustness and leads to more dependable decision-making in real markets.
zh
[AI-88] DeepAries: Adaptive Rebalancing Interval Selection for Enhanced Portfolio Selection CIKM2025
【速读】:该论文旨在解决传统动态资产组合管理中因采用固定频率 rebalancing(再平衡)策略而导致的交易成本过高与风险调整收益不足的问题。现有方法通常忽略市场状态变化对再平衡时机的影响,导致在无需调整时仍频繁操作,从而增加不必要的成本并削弱收益表现。解决方案的关键在于提出 DeepAries 框架,其通过深度强化学习联合优化再平衡时机(离散动作)与资产配置权重(连续动作),利用基于 Transformer 的状态编码器捕捉长期市场依赖关系,并结合近端策略优化(PPO)实现多模态决策生成,从而在不同市场环境下自适应地选择最优再平衡周期与组合权重,显著降低交易成本、提升风险调整后的收益表现。
链接: https://arxiv.org/abs/2510.14985
作者: Jinkyu Kim,Hyunjung Yi,Mogan Gim,Donghee Choi,Jaewoo Kang
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: CIKM 2025 Applied Research Track Accepted
Abstract:We propose DeepAries , a novel deep reinforcement learning framework for dynamic portfolio management that jointly optimizes the timing and allocation of rebalancing decisions. Unlike prior reinforcement learning methods that employ fixed rebalancing intervals regardless of market conditions, DeepAries adaptively selects optimal rebalancing intervals along with portfolio weights to reduce unnecessary transaction costs and maximize risk-adjusted returns. Our framework integrates a Transformer-based state encoder, which effectively captures complex long-term market dependencies, with Proximal Policy Optimization (PPO) to generate simultaneous discrete (rebalancing intervals) and continuous (asset allocations) actions. Extensive experiments on multiple real-world financial markets demonstrate that DeepAries significantly outperforms traditional fixed-frequency and full-rebalancing strategies in terms of risk-adjusted returns, transaction costs, and drawdowns. Additionally, we provide a live demo of DeepAries at this https URL, along with the source code and dataset at this https URL, illustrating DeepAries’ capability to produce interpretable rebalancing and allocation decisions aligned with shifting market regimes. Overall, DeepAries introduces an innovative paradigm for adaptive and practical portfolio management by integrating both timing and allocation into a unified decision-making process.
zh
机器学习
[LG-0] Learning Correlated Reward Models: Statistical Barriers and Opportunities
链接: https://arxiv.org/abs/2510.15839
作者: Yeshwanth Cherapanamjeri,Constantinos Daskalakis,Gabriele Farina,Sobhan Mohammadpour
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
*备注:
Abstract:Random Utility Models (RUMs) are a classical framework for modeling user preferences and play a key role in reward modeling for Reinforcement Learning from Human Feedback (RLHF). However, a crucial shortcoming of many of these techniques is the Independence of Irrelevant Alternatives (IIA) assumption, which collapses \emphall human preferences to a universal underlying utility function, yielding a coarse approximation of the range of human preferences. On the other hand, statistical and computational guarantees for models avoiding this assumption are scarce. In this paper, we investigate the statistical and computational challenges of learning a \emphcorrelated probit model, a fundamental RUM that avoids the IIA assumption. First, we establish that the classical data collection paradigm of pairwise preference data is \emphfundamentally insufficient to learn correlational information, explaining the lack of statistical and computational guarantees in this setting. Next, we demonstrate that \emphbest-of-three preference data provably overcomes these shortcomings, and devise a statistically and computationally efficient estimator with near-optimal performance. These results highlight the benefits of higher-order preference data in learning correlated utilities, allowing for more fine-grained modeling of human preferences. Finally, we validate these theoretical guarantees on several real-world datasets, demonstrating improved personalization of human preferences.
[LG-1] ransfer Orthology Networks
链接: https://arxiv.org/abs/2510.15837
作者: Vikash Singh
类目: Machine Learning (cs.LG)
*备注: 4 pages
Abstract:We present Transfer Orthology Networks (TRON), a novel neural network architecture designed for cross-species transfer learning. TRON leverages orthologous relationships, represented as a bipartite graph between species, to guide knowledge transfer. Specifically, we prepend a learned species conversion layer, whose weights are masked by the biadjacency matrix of this bipartite graph, to a pre-trained feedforward neural network that predicts a phenotype from gene expression data in a source species. This allows for efficient transfer of knowledge to a target species by learning a linear transformation that maps gene expression from the source to the target species’ gene space. The learned weights of this conversion layer offer a potential avenue for interpreting functional orthology, providing insights into how genes across species contribute to the phenotype of interest. TRON offers a biologically grounded and interpretable approach to cross-species transfer learning, paving the way for more effective utilization of available transcriptomic data. We are in the process of collecting cross-species transcriptomic/phenotypic data to gain experimental validation of the TRON architecture.
[LG-2] FIDDLE: Reinforcement Learning for Quantum Fidelity Enhancement
链接: https://arxiv.org/abs/2510.15833
作者: Hoang M. Ngo,Tamer Kahveci,My T. Thai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Quantum computing has the potential to revolutionize fields like quantum optimization and quantum machine learning. However, current quantum devices are hindered by noise, reducing their reliability. A key challenge in gate-based quantum computing is improving the reliability of quantum circuits, measured by process fidelity, during the transpilation process, particularly in the routing stage. In this paper, we address the Fidelity Maximization in Routing Stage (FMRS) problem by introducing FIDDLE, a novel learning framework comprising two modules: a Gaussian Process-based surrogate model to estimate process fidelity with limited training samples and a reinforcement learning module to optimize routing. Our approach is the first to directly maximize process fidelity, outperforming traditional methods that rely on indirect metrics such as circuit depth or gate count. We rigorously evaluate FIDDLE by comparing it with state-of-the-art fidelity estimation techniques and routing optimization methods. The results demonstrate that our proposed surrogate model is able to provide a better estimation on the process fidelity compared to existing learning techniques, and our end-to-end framework significantly improves the process fidelity of quantum circuits across various noise models.
[LG-3] Cavity Duplexer Tuning with 1d Resnet-like Neural Networks
链接: https://arxiv.org/abs/2510.15796
作者: Anton Raskovalov
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This paper presents machine learning method for tuning of cavity duplexer with a large amount of adjustment screws. After testing we declined conventional reinforcement learning approach and reformulated our task in the supervised learning setup. The suggested neural network architecture includes 1d ResNet-like backbone and processing of some additional information about S-parameters, like the shape of curve and peaks positions and amplitudes. This neural network with external control algorithm is capable to reach almost the tuned state of the duplexer within 4-5 rotations per screw.
[LG-4] DexCanvas: Bridging Human Demonstrations and Robot Learning for Dexterous Manipulation
链接: https://arxiv.org/abs/2510.15786
作者: Xinyue Xu,Jieqiang Sun,Jing(Daisy)Dai,Siyuan Chen,Lanjie Ma,Ke Sun,Bin Zhao,Jianbo Yuan,Yiwen Lu
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:We present DexCanvas, a large-scale hybrid real-synthetic human manipulation dataset containing 7,000 hours of dexterous hand-object interactions seeded from 70 hours of real human demonstrations, organized across 21 fundamental manipulation types based on the Cutkosky taxonomy. Each entry combines synchronized multi-view RGB-D, high-precision mocap with MANO hand parameters, and per-frame contact points with physically consistent force profiles. Our real-to-sim pipeline uses reinforcement learning to train policies that control an actuated MANO hand in physics simulation, reproducing human demonstrations while discovering the underlying contact forces that generate the observed object motion. DexCanvas is the first manipulation dataset to combine large-scale real demonstrations, systematic skill coverage based on established taxonomies, and physics-validated contact annotations. The dataset can facilitate research in robotic manipulation learning, contact-rich control, and skill transfer across different hand morphologies.
[LG-5] SAMix: Calibrated and Accurate Continual Learning via Sphere-Adaptive Mixup and Neural Collapse
链接: https://arxiv.org/abs/2510.15751
作者: Trung-Anh Dang,Vincent Nguyen,Ngoc-Son Vu,Christel Vrain
类目: Machine Learning (cs.LG)
*备注:
Abstract:While most continual learning methods focus on mitigating forgetting and improving accuracy, they often overlook the critical aspect of network calibration, despite its importance. Neural collapse, a phenomenon where last-layer features collapse to their class means, has demonstrated advantages in continual learning by reducing feature-classifier misalignment. Few works aim to improve the calibration of continual models for more reliable predictions. Our work goes a step further by proposing a novel method that not only enhances calibration but also improves performance by reducing overconfidence, mitigating forgetting, and increasing accuracy. We introduce Sphere-Adaptive Mixup (SAMix), an adaptive mixup strategy tailored for neural collapse-based methods. SAMix adapts the mixing process to the geometric properties of feature spaces under neural collapse, ensuring more robust regularization and alignment. Experiments show that SAMix significantly boosts performance, surpassing SOTA methods in continual learning while also improving model calibration. SAMix enhances both across-task accuracy and the broader reliability of predictions, making it a promising advancement for robust continual learning systems.
[LG-6] A Comprehensive Evaluation of Graph Neural Networks and Physics Informed Learning for Surrogate Modelling of Finite Element Analysis
链接: https://arxiv.org/abs/2510.15750
作者: Nayan Kumar Singh
类目: Machine Learning (cs.LG)
*备注: 14 pages, 6 figures, 5 tables. Code available at: this https URL
Abstract:Although Finite Element Analysis (FEA) is an integral part of the product design lifecycle, the analysis is computationally expensive, making it unsuitable for many design optimization problems. The deep learning models can be a great solution. However, selecting the architecture that emulates the FEA with great accuracy is a challenge. This paper presents a comprehensive evaluation of graph neural networks (GNNs) and 3D U-Nets as surrogates for FEA of parametric I-beams. We introduce a Physics-Informed Neural Network (PINN) framework, governed by the Navier Cauchy equations, to enforce physical laws. Crucially, we demonstrate that a curriculum learning strategy, pretraining on data followed by physics informed fine tuning, is essential for stabilizing training. Our results show that GNNs fundamentally outperform the U-Net. Even the worst performer among GNNs, the GCN framework, achieved a relative L2 error of 8.7% while the best framework among U Net, U Net with attention mechanism trained on high resolution data, achieved 13.0% score. Among the graph-based architectures, the Message Passing Neural Networks (MPNN) and Graph Transformers achieved the highest accuracy, achieving a relative L2 score of 3.5% and 2.6% respectively. The inclusion of physics fundamental laws (PINN) significantly improved the generalization, reducing error by up to 11.3% on high-signal tasks. While the Graph Transformer is the most accurate model, it is more 37.5% slower during inference when compared to second best model, MPNN PINN. The PINN enhanced MPNN (MPNN PINN) provides the most practical solution. It offers a good compromise between predictive performance, model size, and inference speed.
[LG-7] Constrained Adversarial Perturbation
链接: https://arxiv.org/abs/2510.15699
作者: Virendra Nishad(IIT Kanpur, India),Bhaskar Mukhoty(IIT Delhi, India),Hilal AlQuabeh(MBZUAI, UAE),Sandeep K. Shukla(IIIT Hyderabad, India),Sayak Ray Chowdhury(IIT Kanpur, India)
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks have achieved remarkable success in a wide range of classification tasks. However, they remain highly susceptible to adversarial examples - inputs that are subtly perturbed to induce misclassification while appearing unchanged to humans. Among various attack strategies, Universal Adversarial Perturbations (UAPs) have emerged as a powerful tool for both stress testing model robustness and facilitating scalable adversarial training. Despite their effectiveness, most existing UAP methods neglect domain specific constraints that govern feature relationships. Violating such constraints, such as debt to income ratios in credit scoring or packet flow invariants in network communication, can render adversarial examples implausible or easily detectable, thereby limiting their real world applicability. In this work, we advance universal adversarial attacks to constrained feature spaces by formulating an augmented Lagrangian based min max optimization problem that enforces multiple, potentially complex constraints of varying importance. We propose Constrained Adversarial Perturbation (CAP), an efficient algorithm that solves this problem using a gradient based alternating optimization strategy. We evaluate CAP across diverse domains including finance, IT networks, and cyber physical systems, and demonstrate that it achieves higher attack success rates while significantly reducing runtime compared to existing baselines. Our approach also generalizes seamlessly to individual adversarial perturbations, where we observe similar strong performance gains. Finally, we introduce a principled procedure for learning feature constraints directly from data, enabling broad applicability across domains with structured input spaces. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.15699 [cs.LG] (or arXiv:2510.15699v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15699 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-8] WARP-LUTs - Walsh-Assisted Relaxation for Probabilistic Look Up Tables
链接: https://arxiv.org/abs/2510.15655
作者: Lino Gerlach,Liv Våge,Thore Gerlach,Elliott Kauffman
类目: Machine Learning (cs.LG)
*备注: Preprint. Under review
Abstract:Fast and efficient machine learning is of growing interest to the scientific community and has spurred significant research into novel model architectures and hardware-aware design. Recent hard? and software co-design approaches have demonstrated impressive results with entirely multiplication-free models. Differentiable Logic Gate Networks (DLGNs), for instance, provide a gradient-based framework for learning optimal combinations of low-level logic gates, setting state-of-the-art trade-offs between accuracy, resource usage, and latency. However, these models suffer from high computational cost during training and do not generalize well to logic blocks with more inputs. In this work, we introduce Walsh-Assisted Relaxation for Probabilistic Look-Up Tables (WARP-LUTs) - a novel gradient-based method that efficiently learns combinations of logic gates with substantially fewer trainable parameters. We demonstrate that WARP-LUTs achieve significantly faster convergence on CIFAR-10 compared to DLGNs, while maintaining comparable accuracy. Furthermore, our approach suggests potential for extension to higher-input logic blocks, motivating future research on extremely efficient deployment on modern FPGAs and its real-time science applications.
[LG-9] Fast and Compact Tsetlin Machine Inference on CPUs Using Instruction-Level Optimization
链接: https://arxiv.org/abs/2510.15653
作者: Yefan Zeng,Shengyu Duan,Rishad Shafik,Alex Yakovlev
类目: Machine Learning (cs.LG)
*备注:
Abstract:The Tsetlin Machine ™ offers high-speed inference on resource-constrained devices such as CPUs. Its logic-driven operations naturally lend themselves to parallel execution on modern CPU architectures. Motivated by this, we propose an efficient software implementation of the TM by leveraging instruction-level bitwise operations for compact model representation and accelerated processing. To further improve inference speed, we introduce an early exit mechanism, which exploits the TM’s AND-based clause evaluation to avoid unnecessary computations. Building upon this, we propose a literal Reorder strategy designed to maximize the likelihood of early exits. This strategy is applied during a post-training, pre-inference stage through statistical analysis of all literals and the corresponding actions of their associated Tsetlin Automata (TA), introducing negligible runtime overhead. Experimental results using the gem5 simulator with an ARM processor show that our optimized implementation reduces inference time by up to 96.71% compared to the conventional integer-based TM implementations while maintaining comparable code density.
[LG-10] GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters
链接: https://arxiv.org/abs/2510.15652
作者: Ahmad Raeisi,Mahdi Dolati,Sina Darabi,Sadegh Talebi,Patrick Eugster,Ahmad Khonsari
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures
Abstract:The growing demand for computational resources in machine learning has made efficient resource allocation a critical challenge, especially in heterogeneous hardware clusters where devices vary in capability, age, and energy efficiency. Upgrading to the latest hardware is often infeasible, making sustainable use of existing, mixed-generation resources essential. In this paper, we propose a learning-based architecture for managing machine learning workloads in heterogeneous clusters. The system operates online, allocating resources to incoming training or inference requests while minimizing energy consumption and meeting performance requirements. It uses two neural networks: the first provides initial estimates of how well a new model will utilize different hardware types and how it will affect co-located models. An optimizer then allocates resources based on these estimates. After deployment, the system monitors real performance and uses this data to refine its predictions via a second neural network. This updated model improves estimates not only for the current hardware but also for hardware not initially allocated and for co-location scenarios not yet observed. The result is an adaptive, iterative approach that learns over time to make more effective resource allocation decisions in heterogeneous deep learning clusters.
[LG-11] Deep Neural ODE Operator Networks for PDEs
链接: https://arxiv.org/abs/2510.15651
作者: Ziqian Li,Kang Liu,Yongcun Song,Hangrui Yue,Enrique Zuazua
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注:
Abstract:Operator learning has emerged as a promising paradigm for developing efficient surrogate models to solve partial differential equations (PDEs). However, existing approaches often overlook the domain knowledge inherent in the underlying PDEs and hence suffer from challenges in capturing temporal dynamics and generalization issues beyond training time frames. This paper introduces a deep neural ordinary differential equation (ODE) operator network framework, termed NODE-ONet, to alleviate these limitations. The framework adopts an encoder-decoder architecture comprising three core components: an encoder that spatially discretizes input functions, a neural ODE capturing latent temporal dynamics, and a decoder reconstructing solutions in physical spaces. Theoretically, error analysis for the encoder-decoder architecture is investigated. Computationally, we propose novel physics-encoded neural ODEs to incorporate PDE-specific physical properties. Such well-designed neural ODEs significantly reduce the framework’s complexity while enhancing numerical efficiency, robustness, applicability, and generalization capacity. Numerical experiments on nonlinear diffusion-reaction and Navier-Stokes equations demonstrate high accuracy, computational efficiency, and prediction capabilities beyond training time frames. Additionally, the framework’s flexibility to accommodate diverse encoders/decoders and its ability to generalize across related PDE families further underscore its potential as a scalable, physics-encoded tool for scientific machine learning.
[LG-12] Decentralized Parameter-Free Online Learning
链接: https://arxiv.org/abs/2510.15644
作者: Tomas Ortega,Hamid Jafarkhani
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC)
*备注:
Abstract:We propose the first parameter-free decentralized online learning algorithms with network regret guarantees, which achieve sublinear regret without requiring hyperparameter tuning. This family of algorithms connects multi-agent coin-betting and decentralized online learning via gossip steps. To enable our decentralized analysis, we introduce a novel “betting function” formulation for coin-betting that simplifies the multi-agent regret analysis. Our analysis shows sublinear network regret bounds and is validated through experiments on synthetic and real datasets. This family of algorithms is applicable to distributed sensing, decentralized optimization, and collaborative ML applications.
[LG-13] GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device
链接: https://arxiv.org/abs/2510.15620
作者: Jiahao Zhou,Chengliang Lin,Dingji Li,Mingkai Dong,Haibo Chen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Semantic top-K selection with cross-encoder rerankers underpins of on-device AI services, such as retrieval-augmented generation, agent memory, and personalized recommendation. However, its latency and memory demands dominate end-to-end budgets on edge hardware. Revisiting the objective of top-K selection, we reveal that only relative rankings matter, not exact per-candidate scores. We further observe sequence-level sparsity: relative rankings stabilize early in intermediate layers, allowing pruning opportunities prior to completing full inference. Building on this insight, we propose monolithic forwarding and develop a training-free inference system, GRATING. By maintaining a global view of all candidates, it reduces latency through progressive cluster pruning. It also bounds peak memory usage by strategically overlapping I/O with computation via dual-layer sliding window and chunked execution. We evaluate GRATING against state-of-the-art baselines on rerankers from 0.6B to 8B parameters across Apple M2 and RTX 5070. GRATING consistently reduces latency by up to 89.0% and peak memory by up to 94.9% in microbenchmarks, without any loss in precision. Across three real-world on-device AI applications, GRATING lowers latency by 11.6%-51.0% and peak memory by 18.6%-77.8%, demonstrating substantial improvements in efficiency and deployability. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.15620 [cs.LG] (or arXiv:2510.15620v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15620 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-14] Attn-JGNN: Attention Enhanced Join-Graph Neural Networks
链接: https://arxiv.org/abs/2510.15583
作者: Jixin Zhang,Yong Lai
类目: Machine Learning (cs.LG)
*备注:
Abstract:We propose an Attention Enhanced Join-Graph Neural Networks(Attn-JGNN) model for solving #SAT problems, which significantly improves the solving accuracy. Inspired by the Iterative Join Graph Propagation (IJGP) algorithm, Attn-JGNN uses tree decomposition to encode the CNF formula into a join-graph, then performs iterative message passing on the join-graph, and finally approximates the model number by learning partition functions. In order to further improve the accuracy of the solution, we apply the attention mechanism in and between clusters of the join-graphs, which makes Attn-JGNN pay more attention to the key variables and clusters in probabilistic inference, and reduces the redundant calculation. Finally, our experiments show that our Attn-JGNN model achieves better results than other neural network methods.
[LG-15] On the Neural Feature Ansatz for Deep Neural Networks
链接: https://arxiv.org/abs/2510.15563
作者: Edward Tansley,Estelle Massart,Coralia Cartis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Understanding feature learning is an important open question in establishing a mathematical foundation for deep neural networks. The Neural Feature Ansatz (NFA) states that after training, the Gram matrix of the first-layer weights of a deep neural network is proportional to some power \alpha0 of the average gradient outer product (AGOP) of this network with respect to its inputs. Assuming gradient flow dynamics with balanced weight initialization, the NFA was proven to hold throughout training for two-layer linear networks with exponent \alpha = 1/2 (Radhakrishnan et al., 2024). We extend this result to networks with L \geq 2 layers, showing that the NFA holds with exponent \alpha = 1/L , thus demonstrating a depth dependency of the NFA. Furthermore, we prove that for unbalanced initialization, the NFA holds asymptotically through training if weight decay is applied. We also provide counterexamples showing that the NFA does not hold for some network architectures with nonlinear activations, even when these networks fit arbitrarily well the training data. We thoroughly validate our theoretical results through numerical experiments across a variety of optimization algorithms, weight decay rates and initialization schemes.
[LG-16] Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems
链接: https://arxiv.org/abs/2510.15555
作者: Sibo Xiao
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce the Strategic Doubly Robust (SDR) estimator, a novel framework that integrates strategic equilibrium modeling with doubly robust estimation for causal inference in strategic environments. SDR addresses endogenous treatment assignment arising from strategic agent behavior, maintaining double robustness while incorporating strategic considerations. Theoretical analysis confirms SDR’s consistency and asymptotic normality under strategic unconfoundedness. Empirical evaluations demonstrate SDR’s superior performance over baseline methods, achieving 7.6%-29.3% bias reduction across varying strategic strengths and maintaining robust scalability with agent populations. The framework provides a principled approach for reliable causal inference when agents respond strategically to interventions.
[LG-17] SpikeFit: Towards Optimal Deployment of Spiking Networks on Neuromorphic Hardware
链接: https://arxiv.org/abs/2510.15542
作者: Ivan Kartashov,Mariia Pushkareva,Iakov Karandashev
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 13 pages, 2 figures. Work in progress
Abstract:This paper introduces SpikeFit, a novel training method for Spiking Neural Networks (SNNs) that enables efficient inference on neuromorphic hardware, considering all its stringent requirements: the number of neurons and synapses that can fit on a single device, and lower bit-width representations (e.g., 4-bit, 8-bit). Unlike conventional compressing approaches that address only a subset of these requirements (limited numerical precision and limited number of neurons in the network), SpikeFit treats the allowed weights’ discrete values themselves as learnable parameters co-optimized with the model, allowing for optimal Clusterization-Aware Training (CAT) of the model’s weights at low precision (2-, 4-, or 8-bit) which results in higher network compression efficiency, as well as limiting the number of unique synaptic connections to a value required by neuromorphic processor. This joint optimization allows SpikeFit to find a discrete weight set aligned with hardware constraints, enabling the most complete deployment across a broader range of neuromorphic processors than existing methods of SNN compression support. Moreover, SpikeFit introduces a new hardware-friendly Fisher Spike Contribution (FSC) pruning method showing the state-of-the-art performance. We demonstrate that for spiking neural networks constrained to only four unique synaptic weight values (M = 4), our SpikeFit method not only outperforms state-of-the-art SNNs compression methods and conventional baselines combining extreme quantization schemes and clustering algorithms, but also meets a wider range of neuromorphic hardware requirements and provides the lowest energy use in experiments.
[LG-18] Compressive Modeling and Visualization of Multivariate Scientific Data using Implicit Neural Representation
链接: https://arxiv.org/abs/2510.15535
作者: Abhay Kumar Dwivedi,Shanu Saklani,Soumya Dutta
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注: Accepted for publication in 16th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP 2025)
Abstract:The extensive adoption of Deep Neural Networks has led to their increased utilization in challenging scientific visualization tasks. Recent advancements in building compressed data models using implicit neural representations have shown promising results for tasks like spatiotemporal volume visualization and super-resolution. Inspired by these successes, we develop compressed neural representations for multivariate datasets containing tens to hundreds of variables. Our approach utilizes a single network to learn representations for all data variables simultaneously through parameter sharing. This allows us to achieve state-of-the-art data compression. Through comprehensive evaluations, we demonstrate superior performance in terms of reconstructed data quality, rendering and visualization quality, preservation of dependency information among variables, and storage efficiency.
[LG-19] heoretical Refinement of CLIP by Utilizing Linear Structure of Optimal Similarity
链接: https://arxiv.org/abs/2510.15508
作者: Naoki Yoshida,Satoshi Hayakawa,Yuhta Takida,Toshimitsu Uesaka,Hiromi Wakaki,Yuki Mitsufuji
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注:
Abstract:In this study, we propose an enhancement to the similarity computation mechanism in multi-modal contrastive pretraining frameworks such as CLIP. Prior theoretical research has demonstrated that the optimal similarity metrics between paired modalities should correspond to the pointwise mutual information (PMI) between the two modalities. However, the current implementations of CLIP and its variants fail to fully utilize the underlying linear structure of PMI. We therefore propose KME-CLIP, which leverages this structure through the inner product in a reproducing kernel Hilbert space. We theoretically prove that our method can approximate PMI with arbitrary accuracy and empirically demonstrate that our approach overall outperforms the standard CLIP formulation across several retrieval and classification tasks.
[LG-20] Adversary-Free Counterfactual Prediction via Information-Regularized Representations
链接: https://arxiv.org/abs/2510.15479
作者: Shiqin Tang,Rong Feng,Shuxin Zhuang,Hongzong Li,Youzhi Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study counterfactual prediction under assignment bias and propose a mathematically grounded, information-theoretic approach that removes treatment-covariate dependence without adversarial training. Starting from a bound that links the counterfactual-factual risk gap to mutual information, we learn a stochastic representation Z that is predictive of outcomes while minimizing I(Z; T). We derive a tractable variational objective that upper-bounds the information term and couples it with a supervised decoder, yielding a stable, provably motivated training criterion. The framework extends naturally to dynamic settings by applying the information penalty to sequential representations at each decision time. We evaluate the method on controlled numerical simulations and a real-world clinical dataset, comparing against recent state-of-the-art balancing, reweighting, and adversarial baselines. Across metrics of likelihood, counterfactual error, and policy evaluation, our approach performs favorably while avoiding the training instabilities and tuning burden of adversarial schemes.
[LG-21] Particle Dynamics for Latent-Variable Energy-Based Models
链接: https://arxiv.org/abs/2510.15447
作者: Shiqin Tang,Shuxin Zhuang,Rong Feng,Runsheng Yu,Hongzong Li,Youzhi Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Latent-variable energy-based models (LVEBMs) assign a single normalized energy to joint pairs of observed data and latent variables, offering expressive generative modeling while capturing hidden structure. We recast maximum-likelihood training as a saddle problem over distributions on the latent and joint manifolds and view the inner updates as coupled Wasserstein gradient flows. The resulting algorithm alternates overdamped Langevin updates for a joint negative pool and for conditional latent particles with stochastic parameter ascent, requiring no discriminator or auxiliary networks. We prove existence and convergence under standard smoothness and dissipativity assumptions, with decay rates in KL divergence and Wasserstein-2 distance. The saddle-point view further yields an ELBO strictly tighter than bounds obtained with restricted amortized posteriors. Our method is evaluated on numerical approximations of physical systems and performs competitively against comparable approaches.
[LG-22] Safe Efficient and Robust Reinforcement Learning for Ranking and Diffusion Models
链接: https://arxiv.org/abs/2510.15429
作者: Shashank Gupta
类目: Machine Learning (cs.LG)
*备注: PhD Thesis of Shashank Gupta defended at the University of Amsterdam on October 13th 2025
Abstract:This dissertation investigates how reinforcement learning (RL) methods can be designed to be safe, sample-efficient, and robust. Framed through the unifying perspective of contextual-bandit RL, the work addresses two major application domains - ranking and recommendation, and text-to-image diffusion models. The first part of the thesis develops theory and algorithms for safe deployment in ranking systems. An exposure-based generalisation bound is derived, leading to a counterfactual risk-minimisation objective whose solution is guaranteed not to underperform the logging policy, even with sparse feedback. This guarantee is extended to doubly robust estimators, enabling safety even under adversarial or misspecified user models and offering practitioners explicit control over permissible utility loss. The second part turns to single-action bandits, where various off-policy estimators are unified within a baseline-correction framework. A closed-form optimal baseline is proposed and shown to minimise both evaluation and policy-gradient variance, thereby improving off-policy learning reliability. The final part examines the trade-offs between efficiency and effectiveness in generative RL. A systematic study of PPO and REINFORCE motivates the Leave-One-Out PPO (LOOP) algorithm, which combines multiple diffusion trajectories with a REINFORCE-style baseline inside PPO’s clipped objective. LOOP achieves PPO-level sample efficiency while producing generations that align more faithfully with textual attributes.
[LG-23] ParaFormer: Shallow Parallel Transformers with Progressive Approximation
链接: https://arxiv.org/abs/2510.15425
作者: Wei Wang,Xiao-Yong Wei,Qing Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:The widespread ‘deeper is better’ philosophy has driven the creation of architectures like ResNet and Transformer, which achieve high performance by stacking numerous layers. However, increasing model depth comes with challenges such as longer training times, higher inference latency, and impracticality on resource-constrained devices. To address these issues, we propose ParaFormer, a shallow Transformer architecture designed for true parallelism in both structure and computation. By formulating standard Transformers as function approximators in closed-form, our theoretical analysis shows that their performance relies on inter-layer collaboration for progressive approximation, rather than depth itself. While deep Transformers enforce this collaboration through sequential designs, we demonstrate that such collaboration is not inherently tied to sequential structures. ParaFormer removes the sequential constraint by organizing layers into parallel branches, enforcing inter-layer collaboration algorithmically. Specifically, we implement progressive approximation, ensuring that each new branch further reduces the loss from preceding branches, enabling faster convergence. Extensive experiments validate ParaFormer’s effectiveness, outperforming standard Transformers like ViT. Moreover, ParaFormer supports up to 15.07x model compression and facilitates model expansion for adaptive continuous learning. Experimental results on multi-GPU deployment demonstrate that ParaFormer is 3.30x faster than widely used parallelism solutions such as FairScale. These advancements stem from our closed-form formulation of Transformers based on the Universal Approximation Theorem, which not only explains the ``depth belief’’ but also opens new avenues for designing efficient Transformer architectures. Source code: https://(open-upon-acceptance)
[LG-24] Online Kernel Dynamic Mode Decomposition for Streaming Time Series Forecasting with Adaptive Windowing
链接: https://arxiv.org/abs/2510.15404
作者: Christopher Salazar,Krithika Manohar,Ashis G. Banerjee
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-time forecasting from streaming data poses critical challenges: handling non-stationary dynamics, operating under strict computational limits, and adapting rapidly without catastrophic forgetting. However, many existing approaches face trade-offs between accuracy, adaptability, and efficiency, particularly when deployed in constrained computing environments. We introduce WORK-DMD (Windowed Online Random Kernel Dynamic Mode Decomposition), a method that combines Random Fourier Features with online Dynamic Mode Decomposition to capture nonlinear dynamics through explicit feature mapping, while preserving fixed computational cost and competitive predictive accuracy across evolving data. WORK-DMD employs Sherman-Morrison updates within rolling windows, enabling continuous adaptation to evolving dynamics from only current data, eliminating the need for lengthy training or large storage requirements for historical data. Experiments on benchmark datasets across several domains show that WORK-DMD achieves higher accuracy than several state-of-the-art online forecasting methods, while requiring only a single pass through the data and demonstrating particularly strong performance in short-term forecasting. Our results show that combining kernel evaluations with adaptive matrix updates achieves strong predictive performance with minimal data requirements. This sample efficiency offers a practical alternative to deep learning for streaming forecasting applications.
[LG-25] Geometric Mixture Models for Electrolyte Conductivity Prediction
链接: https://arxiv.org/abs/2510.15403
作者: Anyi Li,Jiacheng Cen,Songyou Li,Mingze Li,Yang Yu,Wenbing Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate prediction of ionic conductivity in electrolyte systems is crucial for advancing numerous scientific and technological applications. While significant progress has been made, current research faces two fundamental challenges: (1) the lack of high-quality standardized benchmarks, and (2) inadequate modeling of geometric structure and intermolecular interactions in mixture systems. To address these limitations, we first reorganize and enhance the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph representations of molecules. We then propose GeoMix, a novel geometry-aware framework that preserves Set-SE(3) equivariance-an essential but challenging property for mixture systems. At the heart of GeoMix lies the Geometric Interaction Network (GIN), an equivariant module specifically designed for intermolecular geometric message passing. Comprehensive experiments demonstrate that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing for accurate property prediction. This work not only establishes new benchmarks for electrolyte research but also provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.
[LG-26] Iterative Refinement of Flow Policies in Probability Space for Online Reinforcement Learning
链接: https://arxiv.org/abs/2510.15388
作者: Mingyang Sun,Pengxiang Ding,Weinan Zhang,Donglin Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:While behavior cloning with flow/diffusion policies excels at learning complex skills from demonstrations, it remains vulnerable to distributional shift, and standard RL methods struggle to fine-tune these models due to their iterative inference process and the limitations of existing workarounds. In this work, we introduce the Stepwise Flow Policy (SWFP) framework, founded on the key insight that discretizing the flow matching inference process via a fixed-step Euler scheme inherently aligns it with the variational Jordan-Kinderlehrer-Otto (JKO) principle from optimal transport. SWFP decomposes the global flow into a sequence of small, incremental transformations between proximate distributions. Each step corresponds to a JKO update, regularizing policy changes to stay near the previous iterate and ensuring stable online adaptation with entropic regularization. This decomposition yields an efficient algorithm that fine-tunes pre-trained flows via a cascade of small flow blocks, offering significant advantages: simpler/faster training of sub-models, reduced computational/memory costs, and provable stability grounded in Wasserstein trust regions. Comprehensive experiments demonstrate SWFP’s enhanced stability, efficiency, and superior adaptation performance across diverse robotic control benchmarks.
[LG-27] Sequence Modeling with Spectral Mean Flows
链接: https://arxiv.org/abs/2510.15366
作者: Jinwoo Kim,Max Beier,Petar Bevanda,Nayun Kim,Seunghoon Hong
类目: Machine Learning (cs.LG)
*备注: 30 pages, 9 figures
Abstract:A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets. Code is available at this https URL.
[LG-28] ranSimHub:A Unified Air-Ground Simulation Platform for Multi-Modal Perception and Decision-Making
链接: https://arxiv.org/abs/2510.15365
作者: Maonan Wang,Yirong Chen,Yuxin Cai,Aoyu Pang,Yuejiao Xie,Zian Ma,Chengcheng Xu,Kemou Jiang,Ding Wang,Laurent Roullet,Chung Shue Chen,Zhiyong Cui,Yuheng Kan,Michael Lepech,Man-On Pun
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 9 pages, 4 figures
Abstract:Air-ground collaborative intelligence is becoming a key approach for next-generation urban intelligent transportation management, where aerial and ground systems work together on perception, communication, and decision-making. However, the lack of a unified multi-modal simulation environment has limited progress in studying cross-domain perception, coordination under communication constraints, and joint decision optimization. To address this gap, we present TranSimHub, a unified simulation platform for air-ground collaborative intelligence. TranSimHub offers synchronized multi-view rendering across RGB, depth, and semantic segmentation modalities, ensuring consistent perception between aerial and ground viewpoints. It also supports information exchange between the two domains and includes a causal scene editor that enables controllable scenario creation and counterfactual analysis under diverse conditions such as different weather, emergency events, and dynamic obstacles. We release TranSimHub as an open-source platform that supports end-to-end research on perception, fusion, and control across realistic air and ground traffic scenes. Our code is available at this https URL.
[LG-29] Backdoor or Manipulation? Graph Mixture of Experts Can Defend Against Various Graph Adversarial Attacks
链接: https://arxiv.org/abs/2510.15333
作者: Yuyuan Feng,Bin Ma,Enyan Dai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Extensive research has highlighted the vulnerability of graph neural networks (GNNs) to adversarial attacks, including manipulation, node injection, and the recently emerging threat of backdoor attacks. However, existing defenses typically focus on a single type of attack, lacking a unified approach to simultaneously defend against multiple threats. In this work, we leverage the flexibility of the Mixture of Experts (MoE) architecture to design a scalable and unified framework for defending against backdoor, edge manipulation, and node injection attacks. Specifically, we propose an MI-based logic diversity loss to encourage individual experts to focus on distinct neighborhood structures in their decision processes, thus ensuring a sufficient subset of experts remains unaffected under perturbations in local structures. Moreover, we introduce a robustness-aware router that identifies perturbation patterns and adaptively routes perturbed nodes to corresponding robust experts. Extensive experiments conducted under various adversarial settings demonstrate that our method consistently achieves superior robustness against multiple graph adversarial attacks.
[LG-30] On the Generalization Properties of Learning the Random Feature Models with Learnable Activation Functions
链接: https://arxiv.org/abs/2510.15327
作者: Zailin Ma,Jiansheng Yang,Yaodong Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper studies the generalization properties of a recently proposed kernel method, the Random Feature models with Learnable Activation Functions (RFLAF). By applying a data-dependent sampling scheme for generating features, we provide by far the sharpest bounds on the required number of features for learning RFLAF in both the regression and classification tasks. We provide a unified theorem that describes the complexity of the feature number s , and discuss the results for the plain sampling scheme and the data-dependent leverage weighted scheme. Through weighted sampling, the bound on s in the MSE loss case is improved from \Omega(1/\epsilon^2) to \tilde\Omega((1/\epsilon)^1/t) in general (t\geq 1) , and even to \Omega(1) when the Gram matrix has a finite rank. For the Lipschitz loss case, the bound is improved from \Omega(1/\epsilon^2) to \tilde\Omega((1/\epsilon^2)^1/t) . To learn the weighted RFLAF, we also propose an algorithm to find an approximate kernel and then apply the leverage weighted sampling. Empirical results show that the weighted RFLAF achieves the same performances with a significantly fewer number of features compared to the plainly sampled RFLAF, validating our theories and the effectiveness of this method.
[LG-31] DFCA: Decentralized Federated Clustering Algorithm
链接: https://arxiv.org/abs/2510.15300
作者: Jonas Kirch,Sebastian Becker,Tiago Koketsu Rodrigues,Stefan Harmeling
类目: Machine Learning (cs.LG)
*备注:
Abstract:Clustered Federated Learning has emerged as an effective approach for handling heterogeneous data across clients by partitioning them into clusters with similar or identical data distributions. However, most existing methods, including the Iterative Federated Clustering Algorithm (IFCA), rely on a central server to coordinate model updates, which creates a bottleneck and a single point of failure, limiting their applicability in more realistic decentralized learning settings. In this work, we introduce DFCA, a fully decentralized clustered FL algorithm that enables clients to collaboratively train cluster-specific models without central coordination. DFCA uses a sequential running average to aggregate models from neighbors as updates arrive, providing a communication-efficient alternative to batch aggregation while maintaining clustering performance. Our experiments on various datasets demonstrate that DFCA outperforms other decentralized algorithms and performs comparably to centralized IFCA, even under sparse connectivity, highlighting its robustness and practicality for dynamic real-world decentralized networks.
[LG-32] Small Ensemble-based Data Assimilation: A Machine Learning-Enhanced Data Assimilation Method with Limited Ensemble Size
链接: https://arxiv.org/abs/2510.15284
作者: Zhilin Li,Yao Zhou,Xianglong Li,Zeng Liu,Zhaokuan Lu,Shanlin Xu,Seungnam Kim,Guangyao Wang
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Ensemble-based data assimilation (DA) methods have become increasingly popular due to their inherent ability to address nonlinear dynamic problems. However, these methods often face a trade-off between analysis accuracy and computational efficiency, as larger ensemble sizes required for higher accuracy also lead to greater computational cost. In this study, we propose a novel machine learning-based data assimilation approach that combines the traditional ensemble Kalman filter (EnKF) with a fully connected neural network (FCNN). Specifically, our method uses a relatively small ensemble size to generate preliminary yet suboptimal analysis states via EnKF. A FCNN is then employed to learn and predict correction terms for these states, thereby mitigating the performance degradation induced by the limited ensemble size. We evaluate the performance of our proposed EnKF-FCNN method through numerical experiments involving Lorenz systems and nonlinear ocean wave field simulations. The results consistently demonstrate that the new method achieves higher accuracy than traditional EnKF with the same ensemble size, while incurring negligible additional computational cost. Moreover, the EnKF-FCNN method is adaptable to diverse applications through coupling with different models and the use of alternative ensemble-based DA methods.
[LG-33] Semi-Supervised Regression with Heteroscedastic Pseudo-Labels NEURIPS2025
链接: https://arxiv.org/abs/2510.15266
作者: Xueqing Sun,Renzhen Wang,Quanziang Wang,Yichen Wu,Xixi Jia,Deyu Meng
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025
Abstract:Pseudo-labeling is a commonly used paradigm in semi-supervised learning, yet its application to semi-supervised regression (SSR) remains relatively under-explored. Unlike classification, where pseudo-labels are discrete and confidence-based filtering is effective, SSR involves continuous outputs with heteroscedastic noise, making it challenging to assess pseudo-label reliability. As a result, naive pseudo-labeling can lead to error accumulation and overfitting to incorrect labels. To address this, we propose an uncertainty-aware pseudo-labeling framework that dynamically adjusts pseudo-label influence from a bi-level optimization perspective. By jointly minimizing empirical risk over all data and optimizing uncertainty estimates to enhance generalization on labeled data, our method effectively mitigates the impact of unreliable pseudo-labels. We provide theoretical insights and extensive experiments to validate our approach across various benchmark SSR datasets, and the results demonstrate superior robustness and performance compared to existing methods. Our code is available at this https URL.
[LG-34] Causal Time Series Modeling of Suprag lacial Lake Evolution in Greenland under Distribution Shift ICML
链接: https://arxiv.org/abs/2510.15265
作者: Emam Hossain,Muhammad Hasan Ferdous,Devon Dunmire,Aneesh Subramanian,Md Osman Gani
类目: Machine Learning (cs.LG)
*备注: Accepted as full paper in ICMLA 2025 (Special Session 1: Deep Learning and Applications)
Abstract:Causal modeling offers a principled foundation for uncovering stable, invariant relationships in time-series data, thereby improving robustness and generalization under distribution shifts. Yet its potential is underutilized in spatiotemporal Earth observation, where models often depend on purely correlational features that fail to transfer across heterogeneous domains. We propose RIC-TSC, a regionally-informed causal time-series classification framework that embeds lag-aware causal discovery directly into sequence modeling, enabling both predictive accuracy and scientific interpretability. Using multi-modal satellite and reanalysis data-including Sentinel-1 microwave backscatter, Sentinel-2 and Landsat-8 optical reflectance, and CARRA meteorological variables-we leverage Joint PCMCI+ (J-PCMCI+) to identify region-specific and invariant predictors of supraglacial lake evolution in Greenland. Causal graphs are estimated globally and per basin, with validated predictors and their time lags supplied to lightweight classifiers. On a balanced benchmark of 1000 manually labeled lakes from two contrasting melt seasons (2018-2019), causal models achieve up to 12.59% higher accuracy than correlation-based baselines under out-of-distribution evaluation. These results show that causal discovery is not only a means of feature selection but also a pathway to generalizable and mechanistically grounded models of dynamic Earth surface processes.
[LG-35] Spatiotemporal Transformers for Predicting Avian Disease Risk from Migration Trajectories
链接: https://arxiv.org/abs/2510.15254
作者: Dingya Feng,Dingyuan Xue
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate forecasting of avian disease outbreaks is critical for wildlife conservation and public health. This study presents a Transformer-based framework for predicting the disease risk at the terminal locations of migratory bird trajectories. We integrate multi-source datasets, including GPS tracking data from Movebank, outbreak records from the World Organisation for Animal Health (WOAH), and geospatial context from GADM and Natural Earth. The raw coordinates are processed using H3 hierarchical geospatial encoding to capture spatial patterns. The model learns spatiotemporal dependencies from bird movement sequences to estimate endpoint disease risk. Evaluation on a held-out test set demonstrates strong predictive performance, achieving an accuracy of 0.9821, area under the ROC curve (AUC) of 0.9803, average precision (AP) of 0.9299, and an F1-score of 0.8836 at the optimal threshold. These results highlight the potential of Transformer architectures to support early-warning systems for avian disease surveillance, enabling timely intervention and prevention strategies.
[LG-36] Dual-Weighted Reinforcement Learning for Generative Preference Modeling
链接: https://arxiv.org/abs/2510.15242
作者: Shengyu Feng,Yun He,Shuang Ma,Beibin Li,Yuanhao Xiong,Vincent Li,Karishma Mandyam,Julian Katz-Samuels,Shengjie Bi,Licheng Yu,Hejia Zhang,Karthik Abinav Sankararaman,Han Fang,Riham Mansour,Yiming Yang,Manaal Faruqui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models on tasks with verifiable answers. However, extending RL to more general non-verifiable tasks, typically in the format of human preference pairs, remains both challenging and underexplored. In this work, we propose Dual-Weighted Reinforcement Learning (DWRL), a new framework for preference modeling that integrates CoT reasoning with the Bradley-Terry (BT) model via a dual-weighted RL objective that preserves preference-modeling inductive bias. DWRL approximates the maximum-likelihood objective of the BT model with two complementary weights: an instance-wise misalignment weight, which emphasizes under-trained pairs misaligned with human preference, and a group-wise (self-normalized) conditional preference score, which promotes promising thoughts. In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score. Across multiple benchmarks and model scales (Llama3 and Qwen2.5), DWRL consistently outperforms both GPM baselines and scalar models, while producing coherent, interpretable thoughts. In summary, our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.
[LG-37] HOB: A Holistically Optimized Bidding Strategy under Heterogeneous Auction Mechanisms with Organic Traffic
链接: https://arxiv.org/abs/2510.15238
作者: Qi Li,Wendong Huang,Qichen Ye,Wutong Xu,Cheems Wang,Rongquan Bai,Wei Yuan,Guan Wang,Chuan Yu,Jian Xu
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:The E-commerce advertising platforms typically sell commercial traffic through either second-price auction (SPA) or first-price auction (FPA). SPA was historically prevalent due to its dominant strategy incentive-compatible (DSIC) for bidders with quasi-linear utilities, especially when budgets are not a binding constraint, while FPA has gained more prominence for offering higher revenue potential to publishers and avoiding the possibility for discriminatory treatment in personalized reserve prices. Meanwhile, on the demand side, advertisers are increasingly adopting platform-wide marketing solutions akin to QuanZhanTui, shifting from spending budgets solely on commercial traffic to bidding on the entire traffic for the purpose of maximizing overall sales. For automated bidding systems, such a trend poses a critical challenge: determining optimal strategies across heterogeneous auction channels to fulfill diverse advertiser objectives, such as maximizing return (MaxReturn) or meeting target return on ad spend (TargetROAS). To overcome this challenge, this work makes two key contributions. First, we derive an efficient solution for optimal bidding under FPA channels, which takes into account the presence of organic traffic - traffic can be won for free. Second, we introduce a marginal cost alignment (MCA) strategy that provably secures bidding efficiency across heterogeneous auction mechanisms. To validate performance of our developed framework, we conduct comprehensive offline experiments on public datasets and large-scale online A/B testing, which demonstrate consistent improvements over existing methods.
[LG-38] Stress-Aware Learning under KL Drift via Trust-Decayed Mirror Descent
链接: https://arxiv.org/abs/2510.15222
作者: Gabriel Nixon Raj
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:We study sequential decision-making under distribution drift. We propose entropy-regularized trust-decay, which injects stress-aware exponential tilting into both belief updates and mirror-descent decisions. On the simplex, a Fenchel-dual equivalence shows that belief tilt and decision tilt coincide. We formalize robustness via fragility (worst-case excess risk in a KL ball), belief bandwidth (radius sustaining a target excess), and a decision-space Fragility Index (drift tolerated at O(\sqrtT) regret). We prove high-probability sensitivity bounds and establish dynamic-regret guarantees of \tildeO(\sqrtT) under KL-drift path length S_T = \sum_t\ge2\sqrt\rm KL(D_t|D_t-1)/2 . In particular, trust-decay achieves O(1) per-switch regret, while stress-free updates incur \Omega(1) tails. A parameter-free hedge adapts the tilt to unknown drift, whereas persistent over-tilting yields an \Omega(\lambda^2 T) stationary penalty. We further obtain calibrated-stress bounds and extensions to second-order updates, bandit feedback, outliers, stress variation, distributed optimization, and plug-in KL-drift estimation. The framework unifies dynamic-regret analysis, distributionally robust objectives, and KL-regularized control within a single stress-adaptive update.
[LG-39] Integrating Product Coefficients for Improved 3D LiDAR Data Classification (Part II)
链接: https://arxiv.org/abs/2510.15219
作者: Patricia Medina,Rasika Karkare
类目: Machine Learning (cs.LG)
*备注: 16 pages, 6 figures, 5 tables
Abstract:This work extends our previous study on enhancing 3D LiDAR point-cloud classification with product coefficients \citemedina2025integratingproductcoefficientsimproved, measure-theoretic descriptors that complement the original spatial Lidar features. Here, we show that combining product coefficients with an autoencoder representation and a KNN classifier delivers consistent performance gains over both PCA-based baselines and our earlier framework. We also investigate the effect of adding product coefficients level by level, revealing a clear trend: richer sets of coefficients systematically improve class separability and overall accuracy. The results highlight the value of combining hierarchical product-coefficient features with autoencoders to push LiDAR classification performance further.
[LG-40] Machine Learning for Early Detection of Meningitis: Stacked Ensemble Learning with EHR data
链接: https://arxiv.org/abs/2510.15218
作者: Han Ouyang,Jesse Hamilton,Saeed Amal
类目: Machine Learning (cs.LG)
*备注:
Abstract:We utilized a cohort of 214 meningitis patients and 46,303 non-meningitis patients from the MIMIC-III database. After extensive data preprocessing, which included ICD-based cohort selection, one-hot encoding of coding, and a two-stage feature selection process (for both the training set and the testing sets), clinically relevant features such as gender and high-risk ICD codes (including subarachnoid hemorrhage, secondary malignant neoplasm of the brain, and generalized epilepsy) are selected. Overall, these clinically reasonable and temporally adherent features provided excellent modeling performance. Three models (Random Forest, LightGBM, and Deep Neural Networks (DNN) are trained as base models for Ensemble Learning. Base model outputs are aggregated and stacked into a meta model (Logistic Regression) that uses the base model outputs as input values in training. Ultimately, soldier outputs (AUC of Testing Set 1: 0.9637, AUC of Testing Set 2: 0.9472) are obtained through ensemble learning. We created a challenging condition for diagnosing meningitis, simulating a real-world ER (Emergency Room) scenario to enhance clinical use in real-world applications. While directly deploying a diagnostic tool that clinicians can use is challenging, this paper paves the way for a potential future AI-driven diagnostic approach for meningitis using Ensemble Learning. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.15218 [cs.LG] (or arXiv:2510.15218v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15218 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-41] Reflections from Research Roundtables at the Conference on Health Inference and Learning (CHIL) 2025
链接: https://arxiv.org/abs/2510.15217
作者: Emily Alsentzer,Marie-Laure Charpignon,Bill Chen,Niharika D’Souza,Jason Fries,Yixing Jiang,Aparajita Kashyap,Chanwoo Kim,Simon Lee,Aishwarya Mandyam,Ashery Christopher Mbilinyi,Nikita Mehandru,Nitish Nagesh,Brighton Nuwagira,Emma Pierson,Arvind Pillai,Akane Sano,Tanveer Syeda-Mahmood,Shashank Yadav,Elias Adhanom,Muhammad Umar Afza,Amelia Archer,Suhana Bedi,Vasiliki Bikia,Trenton Chang,George H. Chen,Winston Chen,Erica Chiang,Edward Choi,Octavia Ciora,Paz Dozie-Nnamah,Shaza Elsharief,Matthew Engelhard,Ali Eshragh,Jean Feng,Josh Fessel,Scott Fleming,Kei Sen Fong,Thomas Frost,Soham Gadgil,Judy Gichoya,Leeor Hershkovich,Sujeong Im,Bhavya Jain,Vincent Jeanselme,Furong Jia,Qixuan(Alice)Jin,Yuxuan Jin,Daniel Kapash,Geetika Kapoor,Behdokht Kiafar,Matthias Kleiner,Stefan Kraft,Annika Kumar,Daeun Kyung,Zhongyuan Liang,Joanna Lin,Qianchu(Flora)Liu,Chang Liu,Hongzhou Luan,Chris Lunt,Leopoldo Julían Lechuga López,Matthew B. A. McDermott,Shahriar Noroozizadeh,Connor O’Brien,YongKyung Oh,Mixail Ota,Stephen Pfohl,Meagan Pi,Tanmoy Sarkar Pias,Emma Rocheteau,Avishaan Sethi,Toru Shirakawa,Anita Silver,Neha Simha,Kamile Stankeviciute,Max Sunog,Peter Szolovits,Shengpu Tang,Jialu Tang,Aaron Tierney,John Valdovinos,Byron Wallace,Will Ke Wang,Peter Washington,Jeremy Weiss,Daniel Wolfe,Emily Wong,Hye Sun Yun,Xiaoman Zhang,Xiao Yu Cindy Zhang,Hayoung Jeong,Kaveri A. Thakoor
类目: Machine Learning (cs.LG)
*备注:
Abstract:The 6th Annual Conference on Health, Inference, and Learning (CHIL 2025), hosted by the Association for Health Learning and Inference (AHLI), was held in person on June 25-27, 2025, at the University of California, Berkeley, in Berkeley, California, USA. As part of this year’s program, we hosted Research Roundtables to catalyze collaborative, small-group dialogue around critical, timely topics at the intersection of machine learning and healthcare. Each roundtable was moderated by a team of senior and junior chairs who fostered open exchange, intellectual curiosity, and inclusive engagement. The sessions emphasized rigorous discussion of key challenges, exploration of emerging opportunities, and collective ideation toward actionable directions in the field. In total, eight roundtables were held by 19 roundtable chairs on topics of “Explainability, Interpretability, and Transparency,” “Uncertainty, Bias, and Fairness,” “Causality,” “Domain Adaptation,” “Foundation Models,” “Learning from Small Medical Data,” “Multimodal Methods,” and “Scalable, Translational Healthcare Solutions.”
[LG-42] How to Sell High-Dimensional Data Optimally
链接: https://arxiv.org/abs/2510.15214
作者: Andrew Li,R. Ravi,Karan Singh,Zihong Yi,Weizhong Zhang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注:
Abstract:Motivated by the problem of selling large, proprietary data, we consider an information pricing problem proposed by Bergemann et al. that involves a decision-making buyer and a monopolistic seller. The seller has access to the underlying state of the world that determines the utility of the various actions the buyer may take. Since the buyer gains greater utility through better decisions resulting from more accurate assessments of the state, the seller can therefore promise the buyer supplemental information at a price. To contend with the fact that the seller may not be perfectly informed about the buyer’s private preferences (or utility), we frame the problem of designing a data product as one where the seller designs a revenue-maximizing menu of statistical experiments. Prior work by Cai et al. showed that an optimal menu can be found in time polynomial in the state space, whereas we observe that the state space is naturally exponential in the dimension of the data. We propose an algorithm which, given only sampling access to the state space, provably generates a near-optimal menu with a number of samples independent of the state space. We then analyze a special case of high-dimensional Gaussian data, showing that (a) it suffices to consider scalar Gaussian experiments, (b) the optimal menu of such experiments can be found efficiently via a semidefinite program, and © full surplus extraction occurs if and only if a natural separation condition holds on the set of potential preferences of the buyer. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Theoretical Economics (econ.TH) Cite as: arXiv:2510.15214 [cs.GT] (or arXiv:2510.15214v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2510.15214 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-43] OCR-APT: Reconstructing APT Stories from Audit Logs using Subgraph Anomaly Detection and LLM s
链接: https://arxiv.org/abs/2510.15188
作者: Ahmed Aly(1),Essam Mansour(1),Amr Youssef(1) ((1) Concordia University)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Advanced Persistent Threats (APTs) are stealthy cyberattacks that often evade detection in system-level audit logs. Provenance graphs model these logs as connected entities and events, revealing relationships that are missed by linear log representations. Existing systems apply anomaly detection to these graphs but often suffer from high false positive rates and coarse-grained alerts. Their reliance on node attributes like file paths or IPs leads to spurious correlations, reducing detection robustness and reliability. To fully understand an attack’s progression and impact, security analysts need systems that can generate accurate, human-like narratives of the entire attack. To address these challenges, we introduce OCR-APT, a system for APT detection and reconstruction of human-like attack stories. OCR-APT uses Graph Neural Networks (GNNs) for subgraph anomaly detection, learning behavior patterns around nodes rather than fragile attributes such as file paths or IPs. This approach leads to a more robust anomaly detection. It then iterates over detected subgraphs using Large Language Models (LLMs) to reconstruct multi-stage attack stories. Each stage is validated before proceeding, reducing hallucinations and ensuring an interpretable final report. Our evaluations on the DARPA TC3, OpTC, and NODLINK datasets show that OCR-APT outperforms state-of-the-art systems in both detection accuracy and alert interpretability. Moreover, OCR-APT reconstructs human-like reports that comprehensively capture the attack story.
[LG-44] An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets
链接: https://arxiv.org/abs/2510.15179
作者: Shuo Sun,Meiling Zhou,Chen Zhao,Joyce H. Keyak,Nancy E. Lane,Jeffrey D. Deng,Kuan-Jui Su,Hui Shen,Hong-Wen Deng,Kui Zhang,Weihua Zhou
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注: 38 pages, 3 figures, 8 tables. This is a preprint version of the manuscript titled “An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets.” The paper is currently under journal submission
Abstract:Hip fractures are a major cause of disability, mortality, and healthcare burden in older adults, underscoring the need for early risk assessment. However, commonly used tools such as the DXA T-score and FRAX often lack sensitivity and miss individuals at high risk, particularly those without prior fractures or with osteopenia. To address this limitation, we propose a sequential two-stage model that integrates clinical and imaging information to improve prediction accuracy. Using data from the Osteoporotic Fractures in Men Study (MrOS), the Study of Osteoporotic Fractures (SOF), and the UK Biobank, Stage 1 (Screening) employs clinical, demographic, and functional variables to estimate baseline risk, while Stage 2 (Imaging) incorporates DXA-derived features for refinement. The model was rigorously validated through internal and external testing, showing consistent performance and adaptability across cohorts. Compared to T-score and FRAX, the two-stage framework achieved higher sensitivity and reduced missed cases, offering a cost-effective and personalized approach for early hip fracture risk assessment. Keywords: Hip Fracture, Two-Stage Model, Risk Prediction, Sensitivity, DXA, FRAX Comments: 38 pages, 3 figures, 8 tables. This is a preprint version of the manuscript titled “An Advanced Two-Stage Model with High Sensitivity and Generalizability for Prediction of Hip Fracture Risk Using Multiple Datasets.” The paper is currently under journal submission Subjects: Machine Learning (cs.LG); Medical Physics (physics.med-ph) Cite as: arXiv:2510.15179 [cs.LG] (or arXiv:2510.15179v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15179 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-45] Finding geodesics with the Deep Ritz method
链接: https://arxiv.org/abs/2510.15177
作者: Conor Rowan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Geodesic problems involve computing trajectories between prescribed initial and final states to minimize a user-defined measure of distance, cost, or energy. They arise throughout physics and engineering – for instance, in determining optimal paths through complex environments, modeling light propagation in refractive media, and the study of spacetime trajectories in control theory and general relativity. Despite their ubiquity, the scientific machine learning (SciML) community has given relatively little attention to investigating its methods in the context of these problems. In this work, we argue that given their simple geometry, variational structure, and natural nonlinearity, geodesic problems are particularly well-suited for the Deep Ritz method. We substantiate this claim with three numerical examples drawn from path planning, optics, and solid mechanics. Our goal is not to provide an exhaustive study of geodesic problems, but rather to identify a promising application of the Deep Ritz method and a fruitful direction for future SciML research.
[LG-46] A simple mean field model of feature learning
链接: https://arxiv.org/abs/2510.15174
作者: Niclas Göring,Chris Mingard,Yoonsoo Nam,Ard Louis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Feature learning (FL), where neural networks adapt their internal representations during training, remains poorly understood. Using methods from statistical physics, we derive a tractable, self-consistent mean-field (MF) theory for the Bayesian posterior of two-layer non-linear networks trained with stochastic gradient Langevin dynamics (SGLD). At infinite width, this theory reduces to kernel ridge regression, but at finite width it predicts a symmetry breaking phase transition where networks abruptly align with target functions. While the basic MF theory provides theoretical insight into the emergence of FL in the finite-width regime, semi-quantitatively predicting the onset of FL with noise or sample size, it substantially underestimates the improvements in generalisation after the transition. We trace this discrepancy to a key mechanism absent from the plain MF description: \textitself-reinforcing input feature selection. Incorporating this mechanism into the MF theory allows us to quantitatively match the learning curves of SGLD-trained networks and provides mechanistic insight into FL.
[LG-47] Policy Transfer Ensures Fast Learning for Continuous-Time LQR with Entropy Regularization
链接: https://arxiv.org/abs/2510.15165
作者: Xin Guo,Zijiu Lyu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Reinforcement Learning (RL) enables agents to learn optimal decision-making strategies through interaction with an environment, yet training from scratch on complex tasks can be highly inefficient. Transfer learning (TL), widely successful in large language models (LLMs), offers a promising direction for enhancing RL efficiency by leveraging pre-trained models. This paper investigates policy transfer, a TL approach that initializes learning in a target RL task using a policy from a related source task, in the context of continuous-time linear quadratic regulators (LQRs) with entropy regularization. We provide the first theoretical proof of policy transfer for continuous-time RL, proving that a policy optimal for one LQR serves as a near-optimal initialization for closely related LQRs, while preserving the original algorithm’s convergence rate. Furthermore, we introduce a novel policy learning algorithm for continuous-time LQRs that achieves global linear and local super-linear convergence. Our results demonstrate both theoretical guarantees and algorithmic benefits of transfer learning in continuous-time RL, addressing a gap in existing literature and extending prior work from discrete to continuous time settings. As a byproduct of our analysis, we derive the stability of a class of continuous-time score-based diffusion models via their connection with LQRs. Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC) Cite as: arXiv:2510.15165 [cs.LG] (or arXiv:2510.15165v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15165 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-48] Predicting the Unpredictable: Reproducible BiLSTM Forecasting of Incident Counts in the Global Terrorism Database (GTD)
链接: https://arxiv.org/abs/2510.15136
作者: Oluwasegun Adegoke
类目: Machine Learning (cs.LG)
*备注: 12 pages, 5 figures, 2 tables. Code reproducibility: this https URL Data/ethics: GTD used under research-only terms; no raw GTD is redistributed
Abstract:We study short-horizon forecasting of weekly terrorism incident counts using the Global Terrorism Database (GTD, 1970–2016). We build a reproducible pipeline with fixed time-based splits and evaluate a Bidirectional LSTM (BiLSTM) against strong classical anchors (seasonal-naive, linear/ARIMA) and a deep LSTM-Attention baseline. On the held-out test set, the BiLSTM attains RMSE 6.38, outperforming LSTM-Attention (9.19; +30.6%) and a linear lag-regression baseline (+35.4% RMSE gain), with parallel improvements in MAE and MAPE. Ablations varying temporal memory, training-history length, spatial grain, lookback size, and feature groups show that models trained on long historical data generalize best; a moderate lookback (20–30 weeks) provides strong context; and bidirectional encoding is critical for capturing both build-up and aftermath patterns within the window. Feature-group analysis indicates that short-horizon structure (lagged counts and rolling statistics) contributes most, with geographic and casualty features adding incremental lift. We release code, configs, and compact result tables, and provide a data/ethics statement documenting GTD licensing and research-only use. Overall, the study offers a transparent, baseline-beating reference for GTD incident forecasting.
[LG-49] A Simple Method for PMF Estimation on Large Supports
链接: https://arxiv.org/abs/2510.15132
作者: Alex Shtoff
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study nonparametric estimation of a probability mass function (PMF) on a large discrete support, where the PMF is multi-modal and heavy-tailed. The core idea is to treat the empirical PMF as a signal on a line graph and apply a data-dependent low-pass filter. Concretely, we form a symmetric tri-diagonal operator, the path graph Laplacian perturbed with a diagonal matrix built from the empirical PMF, then compute the eigenvectors, corresponding to the smallest feq eigenvalues. Projecting the empirical PMF onto this low dimensional subspace produces a smooth, multi-modal estimate that preserves coarse structure while suppressing noise. A light post-processing step of clipping and re-normalizing yields a valid PMF. Because we compute the eigenpairs of a symmetric tridiagonal matrix, the computation is reliable and runs time and memory proportional to the support times the dimension of the desired low-dimensional supspace. We also provide a practical, data-driven rule for selecting the dimension based on an orthogonal-series risk estimate, so the method “just works” with minimal tuning. On synthetic and real heavy-tailed examples, the approach preserves coarse structure while suppressing sampling noise, compares favorably to logspline and Gaussian-KDE baselines in the intended regimes. However, it has known failure modes (e.g., abrupt discontinuities). The method is short to implement, robust across sample sizes, and suitable for automated pipelines and exploratory analysis at scale because of its reliability and speed. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2510.15132 [cs.LG] (or arXiv:2510.15132v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15132 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-50] Navigating the consequences of mechanical ventilation in clinical intensive care settings through an evolutionary game-theoretic framework
链接: https://arxiv.org/abs/2510.15127
作者: David J. Albers,Tell D. Bennett,Jana de Wiljes,Bradford J. Smith,Peter D. Sottile,J.N. Stroh
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Identifying the effects of mechanical ventilation strategies and protocols in critical care requires analyzing data from heterogeneous patient-ventilator systems within the context of the clinical decision-making environment. This research develops a framework to help understand the consequences of mechanical ventilation (MV) and adjunct care decisions on patient outcome from observations of critical care patients receiving MV. Developing an understanding of and improving critical care respiratory management requires the analysis of existing secondary-use clinical data to generate hypotheses about advantageous variations and adaptations of current care. This work introduces a perspective of the joint patient-ventilator-care systems (so-called J6) to develop a scalable method for analyzing data and trajectories of these complex systems. To that end, breath behaviors are analyzed using evolutionary game theory (EGT), which generates the necessary quantitative precursors for deeper analysis through probabilistic and stochastic machinery such as reinforcement learning. This result is one step along the pathway toward MV optimization and personalization. The EGT-based process is analytically validated on synthetic data to reveal potential caveats before proceeding to real-world ICU data applications that expose complexities of the data-generating process J6. The discussion includes potential developments toward a state transition model for the simulating effects of MV decision using empirical and game-theoretic elements.
[LG-51] PoTS: Proof-of-Training-Steps for Backdoor Detection in Large Language Models
链接: https://arxiv.org/abs/2510.15106
作者: Issam Seddik,Sami Souihi,Mohamed Tamaazousti,Sara Tucci Piergiovanni
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 10 pages, 6 figures, 1 table. Accepted for presentation at FLLM 2025 (Vienna, Nov 2025)
Abstract:As Large Language Models (LLMs) gain traction across critical domains, ensuring secure and trustworthy training processes has become a major concern. Backdoor attacks, where malicious actors inject hidden triggers into training data, are particularly insidious and difficult to detect. Existing post-training verification solutions like Proof-of-Learning are impractical for LLMs due to their requirement for full retraining, lack of robustness against stealthy manipulations, and inability to provide early detection during training. Early detection would significantly reduce computational costs. To address these limitations, we introduce Proof-of-Training Steps, a verification protocol that enables an independent auditor (Alice) to confirm that an LLM developer (Bob) has followed the declared training recipe, including data batches, architecture, and hyperparameters. By analyzing the sensitivity of the LLMs’ language modeling head (LM-Head) to input perturbations, our method can expose subtle backdoor injections or deviations in training. Even with backdoor triggers in up to 10 percent of the training data, our protocol significantly reduces the attacker’s ability to achieve a high attack success rate (ASR). Our method enables early detection of attacks at the injection step, with verification steps being 3x faster than training steps. Our results highlight the protocol’s potential to enhance the accountability and security of LLM development, especially against insider threats.
[LG-52] Online Correlation Clustering: Simultaneously Optimizing All ell_p-norms
链接: https://arxiv.org/abs/2510.15076
作者: Sami Davies,Benjamin Moseley,Heather Newman
类目: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS)
*备注: 66 pages
Abstract:The \ell_p -norm objectives for correlation clustering present a fundamental trade-off between minimizing total disagreements (the \ell_1 -norm) and ensuring fairness to individual nodes (the \ell_\infty -norm). Surprisingly, in the offline setting it is possible to simultaneously approximate all \ell_p -norms with a single clustering. Can this powerful guarantee be achieved in an online setting? This paper provides the first affirmative answer. We present a single algorithm for the online-with-a-sample (AOS) model that, given a small constant fraction of the input as a sample, produces one clustering that is simultaneously O(\log^4 n) -competitive for all \ell_p -norms with high probability, O(\log n) -competitive for the \ell_\infty -norm with high probability, and O(1) -competitive for the \ell_1 -norm in expectation. This work successfully translates the offline “all-norms” guarantee to the online world. Our setting is motivated by a new hardness result that demonstrates a fundamental separation between these objectives in the standard random-order (RO) online model. Namely, while the \ell_1 -norm is trivially O(1) -approximable in the RO model, we prove that any algorithm in the RO model for the fairness-promoting \ell_\infty -norm must have a competitive ratio of at least \Omega(n^1/3) . This highlights the necessity of a different beyond-worst-case model. We complement our algorithm with lower bounds, showing our competitive ratios for the \ell_1 - and \ell_\infty - norms are nearly tight in the AOS model. Comments: 66 pages Subjects: Machine Learning (cs.LG); Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2510.15076 [cs.LG] (or arXiv:2510.15076v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.15076 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-53] Physics-informed data-driven machine health monitoring for two-photon lithography
链接: https://arxiv.org/abs/2510.15075
作者: Sixian Jia,Zhiqiao Dong,Chenhui Shao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Two-photon lithography (TPL) is a sophisticated additive manufacturing technology for creating three-dimensional (3D) micro- and nano-structures. Maintaining the health of TPL systems is critical for ensuring consistent fabrication quality. Current maintenance practices often rely on experience rather than informed monitoring of machine health, resulting in either untimely maintenance that causes machine downtime and poor-quality fabrication, or unnecessary maintenance that leads to inefficiencies and avoidable downtime. To address this gap, this paper presents three methods for accurate and timely monitoring of TPL machine health. Through integrating physics-informed data-driven predictive models for structure dimensions with statistical approaches, the proposed methods are able to handle increasingly complex scenarios featuring different levels of generalizability. A comprehensive experimental dataset that encompasses six process parameter combinations and six structure dimensions under two machine health conditions was collected to evaluate the effectiveness of the proposed approaches. Across all test scenarios, the approaches are shown to achieve high accuracies, demonstrating excellent effectiveness, robustness, and generalizability. These results represent a significant step toward condition-based maintenance for TPL systems.
[LG-54] Learn to Change the World: Multi-level Reinforcement Learning with Model-Changing Actions
链接: https://arxiv.org/abs/2510.15056
作者: Ziqing Lu,Babak Hassibi,Lifeng Lai,Weiyu Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning usually assumes a given or sometimes even fixed environment in which an agent seeks an optimal policy to maximize its long-term discounted reward. In contrast, we consider agents that are not limited to passive adaptations: they instead have model-changing actions that actively modify the RL model of world dynamics itself. Reconfiguring the underlying transition processes can potentially increase the agents’ rewards. Motivated by this setting, we introduce the multi-layer configurable time-varying Markov decision process (MCTVMDP). In an MCTVMDP, the lower-level MDP has a non-stationary transition function that is configurable through upper-level model-changing actions. The agent’s objective consists of two parts: Optimize the configuration policies in the upper-level MDP and optimize the primitive action policies in the lower-level MDP to jointly improve its expected long-term reward.
[LG-55] IQNN-CS: Interpretable Quantum Neural Network for Credit Scoring
链接: https://arxiv.org/abs/2510.15044
作者: Abdul Samad Khan,Nouhaila Innan,Aeysha Khalique,Muhammad Shafique
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: Accepted for oral presentation at QUEST-IS’25. To appear in Springer proceedings
Abstract:Credit scoring is a high-stakes task in financial services, where model decisions directly impact individuals’ access to credit and are subject to strict regulatory scrutiny. While Quantum Machine Learning (QML) offers new computational capabilities, its black-box nature poses challenges for adoption in domains that demand transparency and trust. In this work, we present IQNN-CS, an interpretable quantum neural network framework designed for multiclass credit risk classification. The architecture combines a variational QNN with a suite of post-hoc explanation techniques tailored for structured data. To address the lack of structured interpretability in QML, we introduce Inter-Class Attribution Alignment (ICAA), a novel metric that quantifies attribution divergence across predicted classes, revealing how the model distinguishes between credit risk categories. Evaluated on two real-world credit datasets, IQNN-CS demonstrates stable training dynamics, competitive predictive performance, and enhanced interpretability. Our results highlight a practical path toward transparent and accountable QML models for financial decision-making.
[LG-56] AlignFlow: Improving Flow-based Generative Models with Semi-Discrete Optimal Transport
链接: https://arxiv.org/abs/2510.15038
作者: Lingkai Kong,Molei Tao,Yang Liu,Bryan Wang,Jinmiao Fu,Chien-Chih Wang,Huidong Liu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Submitted for peer review on Sep 24, 2025. Note: chairs and reviewers can see and bid on our submission since Sep 28, 2025
Abstract:Flow-based Generative Models (FGMs) effectively transform noise into complex data distributions. Incorporating Optimal Transport (OT) to couple noise and data during FGM training has been shown to improve the straightness of flow trajectories, enabling more effective inference. However, existing OT-based methods estimate the OT plan using (mini-)batches of sampled noise and data points, which limits their scalability to large and high-dimensional datasets in FGMs. This paper introduces AlignFlow, a novel approach that leverages Semi-Discrete Optimal Transport (SDOT) to enhance the training of FGMs by establishing an explicit, optimal alignment between noise distribution and data points with guaranteed convergence. SDOT computes a transport map by partitioning the noise space into Laguerre cells, each mapped to a corresponding data point. During FGM training, i.i.d. noise samples are paired with data points via the SDOT map. AlignFlow scales well to large datasets and model architectures with negligible computational overhead. Experimental results show that AlignFlow improves the performance of a wide range of state-of-the-art FGM algorithms and can be integrated as a plug-and-play component. Code is available at: this https URL.
[LG-57] ES-C51: Expected Sarsa Based C51 Distributional Reinforcement Learning Algorithm
链接: https://arxiv.org/abs/2510.15006
作者: Rijul Tandon,Peter Vamplew,Cameron Foale
类目: Machine Learning (cs.LG)
*备注:
Abstract:In most value-based reinforcement learning (RL) algorithms, the agent estimates only the expected reward for each action and selects the action with the highest reward. In contrast, Distributional Reinforcement Learning (DRL) estimates the entire probability distribution of possible rewards, providing richer information about uncertainty and variability. C51 is a popular DRL algorithm for discrete action spaces. It uses a Q-learning approach, where the distribution is learned using a greedy Bellman update. However, this can cause problems if multiple actions at a state have similar expected reward but with different distributions, as the algorithm may not learn a stable distribution. This study presents a modified version of C51 (ES-C51) that replaces the greedy Q-learning update with an Expected Sarsa update, which uses a softmax calculation to combine information from all possible actions at a state rather than relying on a single best action. This reduces instability when actions have similar expected rewards and allows the agent to learn higher-performing policies. This approach is evaluated on classic control environments from Gym, and Atari-10 games. For a fair comparison, we modify the standard C51’s exploration strategy from e-greedy to softmax, which we refer to as QL-C51 (Q- Learning based C51). The results demonstrate that ES-C51 outperforms QL-C51 across many environments.
[LG-58] Extending Load Forecasting from Zonal Aggregates to Individual Nodes for Transmission System Operators
链接: https://arxiv.org/abs/2510.14983
作者: Oskar Triebe,Fletcher Passow,Simon Wittner,Leonie Wagner,Julio Arend,Tao Sun,Chad Zanocco,Marek Miltner,Arezou Ghesmati,Chen-Hao Tsai,Christoph Bergmeir,Ram Rajagopal
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Systems and Control (eess.SY)
*备注: Collaborative Research, Stanford University and Midcontinent Independent System Operator
Abstract:The reliability of local power grid infrastructure is challenged by sustainable energy developments increasing electric load uncertainty. Transmission System Operators (TSOs) need load forecasts of higher spatial resolution, extending current forecasting operations from zonal aggregates to individual nodes. However, nodal loads are less accurate to forecast and require a large number of individual forecasts, which are hard to manage for the human experts assessing risks in the control room’s daily operations (operator). In collaboration with a TSO, we design a multi-level system that meets the needs of operators for hourly day-ahead load forecasting. Utilizing a uniquely extensive dataset of zonal and nodal net loads, we experimentally evaluate our system components. First, we develop an interpretable and scalable forecasting model that allows for TSOs to gradually extend zonal operations to include nodal forecasts. Second, we evaluate solutions to address the heterogeneity and volatility of nodal load, subject to a trade-off. Third, our system is manageable with a fully parallelized single-model forecasting workflow. Our results show accuracy and interpretability improvements for zonal forecasts, and substantial improvements for nodal forecasts. In practice, our multi-level forecasting system allows operators to adjust forecasts with unprecedented confidence and accuracy, and to diagnose otherwise opaque errors precisely.
[LG-59] Blackwells Approachability for Sequential Conformal Inference
链接: https://arxiv.org/abs/2510.15824
作者: Guillaume Principato,Gilles Stoltz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 25 pages, 0 figures
Abstract:We study conformal inference in non-exchangeable environments through the lens of Blackwell’s theory of approachability. We first recast adaptive conformal inference (ACI, Gibbs and Candès, 2021) as a repeated two-player vector-valued finite game and characterize attainable coverage–efficiency tradeoffs. We then construct coverage and efficiency objectives under potential restrictions on the adversary’s play, and design a calibration-based approachability strategy to achieve these goals. The resulting algorithm enjoys strong theoretical guarantees and provides practical insights, though its computational burden may limit deployment in practice.
[LG-60] Error analysis of a compositional score-based algorithm for simulation-based inference
链接: https://arxiv.org/abs/2510.15817
作者: Camille Touron,Gabriel V. Cardoso,Julyan Arbel,Pedro L. C. Rodrigues
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Simulation-based inference (SBI) has become a widely used framework in applied sciences for estimating the parameters of stochastic models that best explain experimental observations. A central question in this setting is how to effectively combine multiple observations in order to improve parameter inference and obtain sharper posterior distributions. Recent advances in score-based diffusion methods address this problem by constructing a compositional score, obtained by aggregating individual posterior scores within the diffusion process. While it is natural to suspect that the accumulation of individual errors may significantly degrade sampling quality as the number of observations grows, this important theoretical issue has so far remained unexplored. In this paper, we study the compositional score produced by the GAUSS algorithm of Linhart et al. (2024) and establish an upper bound on its mean squared error in terms of both the individual score errors and the number of observations. We illustrate our theoretical findings on a Gaussian example, where all analytical expressions can be derived in a closed form.
[LG-61] On Universality of Deep Equivariant Networks
链接: https://arxiv.org/abs/2510.15814
作者: Marco Pacini,Mircea Petrache,Bruno Lepri,Shubhendu Trivedi,Robin Walters
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint. 22 pages
Abstract:Universality results for equivariant neural networks remain rare. Those that do exist typically hold only in restrictive settings: either they rely on regular or higher-order tensor representations, leading to impractically high-dimensional hidden spaces, or they target specialized architectures, often confined to the invariant setting. This work develops a more general account. For invariant networks, we establish a universality theorem under separation constraints, showing that the addition of a fully connected readout layer secures approximation within the class of separation-constrained continuous functions. For equivariant networks, where results are even scarcer, we demonstrate that standard separability notions are inadequate and introduce the sharper criterion of \textitentry-wise separability . We show that with sufficient depth or with the addition of appropriate readout layers, equivariant networks attain universality within the entry-wise separable regime. Together with prior results showing the failure of universality for shallow models, our findings identify depth and readout layers as a decisive mechanism for universality, additionally offering a unified perspective that subsumes and extends earlier specialized results.
[LG-62] Enhanced Renewable Energy Forecasting using Context-Aware Conformal Prediction
链接: https://arxiv.org/abs/2510.15780
作者: Alireza Moradi,Mathieu Tanneau,Reza Zandehshahvar,Pascal Van Hentenryck
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Accurate forecasting is critical for reliable power grid operations, particularly as the share of renewable generation, such as wind and solar, continues to grow. Given the inherent uncertainty and variability in renewable generation, probabilistic forecasts have become essential for informed operational decisions. However, such forecasts frequently suffer from calibration issues, potentially degrading decision-making performance. Building on recent advances in Conformal Predictions, this paper introduces a tailored calibration framework that constructs context-aware calibration sets using a novel weighting scheme. The proposed framework improves the quality of probabilistic forecasts at the site and fleet levels, as demonstrated by numerical experiments on large-scale datasets covering several systems in the United States. The results demonstrate that the proposed approach achieves higher forecast reliability and robustness for renewable energy applications compared to existing baselines.
[LG-63] A Split-Client Approach to Second-Order Optimization
链接: https://arxiv.org/abs/2510.15714
作者: El Mahdi Chayti,Martin Jaggi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Second-order methods promise faster convergence but are rarely used in practice because Hessian computations and decompositions are far more expensive than gradients. We propose a \emphsplit-client framework where gradients and curvature are computed asynchronously by separate clients. This abstraction captures realistic delays and inexact Hessian updates while avoiding the manual tuning required by Lazy Hessian methods. Focusing on cubic regularization, we show that our approach retains strong convergence guarantees and achieves a provable wall-clock speedup of order \sqrt\tau , where \tau is the relative time needed to compute and decompose the Hessian compared to a gradient step. Since \tau can be orders of magnitude larger than one in high-dimensional problems, this improvement is practically significant. Experiments on synthetic and real datasets confirm the theory: asynchronous curvature consistently outperforms vanilla and Lazy Hessian baselines, while maintaining second-order accuracy.
[LG-64] Disentanglement of Sources in a Multi-Stream Variational Autoencoder
链接: https://arxiv.org/abs/2510.15669
作者: Veranika Boukun,Jörg Lücke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Variational autoencoders (VAEs) are a leading approach to address the problem of learning disentangled representations. Typically a single VAE is used and disentangled representations are sought in its continuous latent space. Here we explore a different approach by using discrete latents to combine VAE-representations of individual sources. The combination is done based on an explicit model for source combination, and we here use a linear combination model which is well suited, e.g., for acoustic data. We formally define such a multi-stream VAE (MS-VAE) approach, derive its inference and learning equations, and we numerically investigate its principled functionality. The MS-VAE is domain-agnostic, and we here explore its ability to separate sources into different streams using superimposed hand-written digits, and mixed acoustic sources in a speaker diarization task. We observe a clear separation of digits, and on speaker diarization we observe an especially low rate of missed speakers. Numerical experiments further highlight the flexibility of the approach across varying amounts of supervision and training data.
[LG-65] Bayesian Inference for PDE-based Inverse Problems using the Optimization of a Discrete Loss
链接: https://arxiv.org/abs/2510.15664
作者: Lucas Amoudruz,Sergey Litvinov,Costas Papadimitriou,Petros Koumoutsakos
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Inverse problems are crucial for many applications in science, engineering and medicine that involve data assimilation, design, and imaging. Their solution infers the parameters or latent states of a complex system from noisy data and partially observable processes. When measurements are an incomplete or indirect view of the system, additional knowledge is required to accurately solve the inverse problem. Adopting a physical model of the system in the form of partial differential equations (PDEs) is a potent method to close this gap. In particular, the method of optimizing a discrete loss (ODIL) has shown great potential in terms of robustness and computational cost. In this work, we introduce B-ODIL, a Bayesian extension of ODIL, that integrates the PDE loss of ODIL as prior knowledge and combines it with a likelihood describing the data. B-ODIL employs a Bayesian formulation of PDE-based inverse problems to infer solutions with quantified uncertainties. We demonstrate the capabilities of B-ODIL in a series of synthetic benchmarks involving PDEs in one, two, and three dimensions. We showcase the application of B-ODIL in estimating tumor concentration and its uncertainty in a patient’s brain from MRI scans using a three-dimensional tumor growth model.
[LG-66] Stochastic Optimization with Random Search
链接: https://arxiv.org/abs/2510.15610
作者: El Mahdi Chayti,Taha El Bakkali El Kadi,Omar Saadi,Martin Jaggi
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We revisit random search for stochastic optimization, where only noisy function evaluations are available. We show that the method works under weaker smoothness assumptions than previously considered, and that stronger assumptions enable improved guarantees. In the finite-sum setting, we design a variance-reduced variant that leverages multiple samples to accelerate convergence. Our analysis relies on a simple translation invariance property, which provides a principled way to balance noise and reduce variance.
[LG-67] Kernel-Based Evaluation of Conditional Biological Sequence Models
链接: https://arxiv.org/abs/2510.15601
作者: Pierre Glaser,Steffanie Paul,Alissa M. Hummer,Charlotte M. Deane,Debora S. Marks,Alan N. Amin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 29 pages
Abstract:We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model’s estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model’s temperature hyperparameter to achieve a better fit.
[LG-68] Geometric Convergence Analysis of Variational Inference via Bregman Divergences
链接: https://arxiv.org/abs/2510.15548
作者: Sushil Bohara,Amedeo Roberto Esposito
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 14 pages, 4 figures
Abstract:Variational Inference (VI) provides a scalable framework for Bayesian inference by optimizing the Evidence Lower Bound (ELBO), but convergence analysis remains challenging due to the objective’s non-convexity and non-smoothness in Euclidean space. We establish a novel theoretical framework for analyzing VI convergence by exploiting the exponential family structure of distributions. We express negative ELBO as a Bregman divergence with respect to the log-partition function, enabling a geometric analysis of the optimization landscape. We show that this Bregman representation admits a weak monotonicity property that, while weaker than convexity, provides sufficient structure for rigorous convergence analysis. By deriving bounds on the objective function along rays in parameter space, we establish properties governed by the spectral characteristics of the Fisher information matrix. Under this geometric framework, we prove non-asymptotic convergence rates for gradient descent algorithms with both constant and diminishing step sizes.
[LG-69] AI and analytics in sports: Leverag ing BERTopic to map the past and chart the future
链接: https://arxiv.org/abs/2510.15487
作者: Manit Mishra
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 32 pages, 5 figures, 1 table, accepted for presentation at Australia and New Zealand Marketing Academy (ANZMAC) - 2025 Conference
Abstract:Purpose: The purpose of this study is to map the body of scholarly literature at the intersection of artificial intelligence (AI), analytics and sports and thereafter, leverage the insights generated to chart guideposts for future research. Design/methodology/approach: The study carries out systematic literature review (SLR). Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) protocol is leveraged to identify 204 journal articles pertaining to utilization of AI and analytics in sports published during 2002 to 2024. We follow it up with extraction of the latent topics from sampled articles by leveraging the topic modelling technique of BERTopic. Findings: The study identifies the following as predominant areas of extant research on usage of AI and analytics in sports: performance modelling, physical and mental health, social media sentiment analysis, and tactical tracking. Each extracted topic is further examined in terms of its relative prominence, representative studies, and key term associations. Drawing on these insights, the study delineates promising avenues for future inquiry. Research limitations/implications: The study offers insights to academicians and sports administrators on transformational impact of AI and analytics in sports. Originality/value: The study introduces BERTopic as a novel approach for extracting latent structures in sports research, thereby advancing both scholarly understanding and the methodological toolkit of the field.
[LG-70] Online Policy Learning via a Self-Normalized Maximal Inequality
链接: https://arxiv.org/abs/2510.15483
作者: Samuel Girard,Aurélien Bibaut,Houssam Zenati
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Adaptive experiments produce dependent data that break i.i.d. assumptions that underlie classical concentration bounds and invalidate standard learning guarantees. In this paper, we develop a self-normalized maximal inequality for martingale empirical processes. Building on this, we first propose an adaptive sample-variance penalization procedure which balances empirical loss and sample variance, valid for general dependent data. Next, this allows us to derive a new variance-regularized pessimistic off-policy learning objective, for which we establish excess-risk guarantees. Subsequently, we show that, when combined with sequential updates and under standard complexity and margin conditions, the resulting estimator achieves fast convergence rates in both parametric and nonparametric regimes, improving over the usual 1/\sqrtn baseline. We complement our theoretical findings with numerical simulations that illustrate the practical gains of our approach. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2510.15483 [stat.ML] (or arXiv:2510.15483v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2510.15483 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-71] Nonlinear Dimensionality Reduction Techniques for Bayesian Optimization
链接: https://arxiv.org/abs/2510.15435
作者: Luo Long,Coralia Cartis,Paz Fink Shustin
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 34 pages including appendixes, 8 figures. Keywords: global optimisation, dimensionality reduction techniques, Bayesian methods, Variational Autoencoders
Abstract:Bayesian optimisation (BO) is a standard approach for sample-efficient global optimisation of expensive black-box functions, yet its scalability to high dimensions remains challenging. Here, we investigate nonlinear dimensionality reduction techniques that reduce the problem to a sequence of low-dimensional Latent-Space BO (LSBO). While early LSBO methods used (linear) random projections (Wang et al., 2013), building on Grosnit et al. (2021), we employ Variational Autoencoders (VAEs) for LSBO, focusing on deep metric loss for structured latent manifolds and VAE retraining to adapt the encoder-decoder to newly sampled regions. We propose some changes in their implementation, originally designed for tasks such as molecule generation, and reformulate the algorithm for broader optimisation purposes. We then couple LSBO with Sequential Domain Reduction (SDR) directly in the latent space (SDR-LSBO), yielding an algorithm that narrows the latent search domains as evidence accumulates. Implemented in a GPU-accelerated BoTorch stack with Matern-5/2 Gaussian process surrogates, our numerical results show improved optimisation quality across benchmark tasks and that structured latent manifolds improve BO performance. Additionally, we compare random embeddings and VAEs as two mechanisms for dimensionality reduction, showing that the latter outperforms the former. To the best of our knowledge, this is the first study to combine SDR with VAE-based LSBO, and our analysis clarifies design choices for metric shaping and retraining that are critical for scalable latent space BO. For reproducibility, our source code is available at this https URL.
[LG-72] Information Theory in Open-world Machine Learning Foundations Frameworks and Future Direction
链接: https://arxiv.org/abs/2510.15422
作者: Lin Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Open world Machine Learning (OWML) aims to develop intelligent systems capable of recognizing known categories, rejecting unknown samples, and continually learning from novel information. Despite significant progress in open set recognition, novelty detection, and continual learning, the field still lacks a unified theoretical foundation that can quantify uncertainty, characterize information transfer, and explain learning adaptability in dynamic, nonstationary environments. This paper presents a comprehensive review of information theoretic approaches in open world machine learning, emphasizing how core concepts such as entropy, mutual information, and Kullback Leibler divergence provide a mathematical language for describing knowledge acquisition, uncertainty suppression, and risk control under open world conditions. We synthesize recent studies into three major research axes: information theoretic open set recognition enabling safe rejection of unknowns, information driven novelty discovery guiding new concept formation, and information retentive continual learning ensuring stable long term adaptation. Furthermore, we discuss theoretical connections between information theory and provable learning frameworks, including PAC Bayes bounds, open-space risk theory, and causal information flow, to establish a pathway toward provable and trustworthy open world intelligence. Finally, the review identifies key open problems and future research directions, such as the quantification of information risk, development of dynamic mutual information bounds, multimodal information fusion, and integration of information theory with causal reasoning and world model learning.
[LG-73] Recursive Inference for Heterogeneous Multi-Output GP State-Space Models with Arbitrary Moment Matching
链接: https://arxiv.org/abs/2510.15390
作者: Tengjie Zheng,Jilan Mei,Di Wu,Lin Cheng,Shengping Gong
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Accurate learning of system dynamics is becoming increasingly crucial for advanced control and decision-making in engineering. However, real-world systems often exhibit multiple channels and highly nonlinear transition dynamics, challenging traditional modeling methods. To enable online learning for these systems, this paper formulates the system as Gaussian process state-space models (GPSSMs) and develops a recursive learning method. The main contributions are threefold. First, a heterogeneous multi-output kernel is designed, allowing each output dimension to adopt distinct kernel types, hyperparameters, and input variables, improving expressiveness in multi-dimensional dynamics learning. Second, an inducing-point management algorithm enhances computational efficiency through independent selection and pruning for each output dimension. Third, a unified recursive inference framework for GPSSMs is derived, supporting general moment matching approaches, including the extended Kalman filter (EKF), unscented Kalman filter (UKF), and assumed density filtering (ADF), enabling accurate learning under strong nonlinearity and significant noise. Experiments on synthetic and real-world datasets show that the proposed method matches the accuracy of SOTA offline GPSSMs with only 1/100 of the runtime, and surpasses SOTA online GPSSMs by around 70% in accuracy under heavy noise while using only 1/20 of the runtime.
[LG-74] Singularity-free dynamical invariants-based quantum control
链接: https://arxiv.org/abs/2510.15340
作者: Ritik Sareen,Akram Youssry,Alberto Peruzzo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:State preparation is a cornerstone of quantum technologies, underpinning applications in computation, communication, and sensing. Its importance becomes even more pronounced in non-Markovian open quantum systems, where environmental memory and model uncertainties pose significant challenges to achieving high-fidelity control. Invariant-based inverse engineering provides a principled framework for synthesizing analytic control fields, yet existing parameterizations often lead to experimentally infeasible, singular pulses and are limited to simplified noise models such as those of Lindblad form. Here, we introduce a generalized invariant-based protocol for single-qubit state preparation under arbitrary noise conditions. The control proceeds in two-stages: first, we construct a family of bounded pulses that achieve perfect state preparation in a closed system; second, we identify the optimal member of this family that minimizes the effect of noise. The framework accommodates both (i) characterized noise, enabling noise-aware control synthesis, and (ii) uncharacterized noise, where a noise-agnostic variant preserves robustness without requiring a master-equation description. Numerical simulations demonstrate high-fidelity state preparation across diverse targets while producing smooth, hardware-feasible control fields. This singularity-free framework extends invariant-based control to realistic open-system regimes, providing a versatile route toward robust quantum state engineering on NISQ hardware and other platforms exhibiting non-Markovian dynamics.
[LG-75] ransfer Learning for Benign Overfitting in High-Dimensional Linear Regression NEURIPS2025
链接: https://arxiv.org/abs/2510.15337
作者: Yeichan Kim,Ilmun Kim,Seyoung Park
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 42 pages, 4 figures, 2 tables, 1 algorithm; camera-ready version accepted at NeurIPS 2025 (Spotlight)
Abstract:Transfer learning is a key component of modern machine learning, enhancing the performance of target tasks by leveraging diverse data sources. Simultaneously, overparameterized models such as the minimum- \ell_2 -norm interpolator (MNI) in high-dimensional linear regression have garnered significant attention for their remarkable generalization capabilities, a property known as benign overfitting. Despite their individual importance, the intersection of transfer learning and MNI remains largely unexplored. Our research bridges this gap by proposing a novel two-step Transfer MNI approach and analyzing its trade-offs. We characterize its non-asymptotic excess risk and identify conditions under which it outperforms the target-only MNI. Our analysis reveals free-lunch covariate shift regimes, where leveraging heterogeneous data yields the benefit of knowledge transfer at limited cost. To operationalize our findings, we develop a data-driven procedure to detect informative sources and introduce an ensemble method incorporating multiple informative Transfer MNIs. Finite-sample experiments demonstrate the robustness of our methods to model and data heterogeneity, confirming their advantage.
[LG-76] Foresighted Online Policy Optimization with Interference
链接: https://arxiv.org/abs/2510.15273
作者: Liner Xiang,Jiayi Wang,Hengrui Cai
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Contextual bandits, which leverage the baseline features of sequentially arriving individuals to optimize cumulative rewards while balancing exploration and exploitation, are critical for online decision-making. Existing approaches typically assume no interference, where each individual’s action affects only their own reward. Yet, such an assumption can be violated in many practical scenarios, and the oversight of interference can lead to short-sighted policies that focus solely on maximizing the immediate outcomes for individuals, which further results in suboptimal decisions and potentially increased regret over time. To address this significant gap, we introduce the foresighted online policy with interference (FRONT) that innovatively considers the long-term impact of the current decision on subsequent decisions and rewards. The proposed FRONT method employs a sequence of exploratory and exploitative strategies to manage the intricacies of interference, ensuring robust parameter inference and regret minimization. Theoretically, we establish a tail bound for the online estimator and derive the asymptotic distribution of the parameters of interest under suitable conditions on the interference network. We further show that FRONT attains sublinear regret under two distinct definitions, capturing both the immediate and consequential impacts of decisions, and we establish these results with and without statistical inference. The effectiveness of FRONT is further demonstrated through extensive simulations and a real-world application to urban hotel profits.
[LG-77] Minimisation of Submodular Functions Using Gaussian Zeroth-Order Random Oracles
链接: https://arxiv.org/abs/2510.15257
作者: Amir Ali Farzin,Yuen-Man Pun,Philipp Braun,Tyler Summers,Iman Shames
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:We consider the minimisation problem of submodular functions and investigate the application of a zeroth-order method to this problem. The method is based on exploiting a Gaussian smoothing random oracle to estimate the smoothed function gradient. We prove the convergence of the algorithm to a global \epsilon -approximate solution in the offline case and show that the algorithm is Hannan-consistent in the online case with respect to static regret. Moreover, we show that the algorithm achieves O(\sqrtNP_N^\ast) dynamic regret, where N is the number of iterations and P_N^\ast is the path length. The complexity analysis and hyperparameter selection are presented for all the cases. The theoretical results are illustrated via numerical examples.
[LG-78] HyperAIRI: a plug-and-play algorithm for precise hyperspectral image reconstruction in radio interferometry
链接: https://arxiv.org/abs/2510.15198
作者: Chao Tang,Arwa Dabbech,Adrian Jackson,Yves Wiaux
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: 18 pages, 10 figures, submitted to MNRAS
Abstract:The next-generation radio-interferometric (RI) telescopes require imaging algorithms capable of forming high-resolution high-dynamic-range images from large data volumes spanning wide frequency bands. Recently, AIRI, a plug-and-play (PnP) approach taking the forward-backward algorithmic structure (FB), has demonstrated state-of-the-art performance in monochromatic RI imaging by alternating a data-fidelity step with a regularisation step via learned denoisers. In this work, we introduce HyperAIRI, its hyperspectral extension, underpinned by learned hyperspectral denoisers enforcing a power-law spectral model. For each spectral channel, the HyperAIRI denoiser takes as input its current image estimate, alongside estimates of its two immediate neighbouring channels and the spectral index map, and provides as output its associated denoised image. To ensure convergence of HyperAIRI, the denoisers are trained with a Jacobian regularisation enforcing non-expansiveness. To accommodate varying dynamic ranges, we assemble a shelf of pre-trained denoisers, each tailored to a specific dynamic range. At each HyperAIRI iteration, the spectral channels of the target image cube are updated in parallel using dynamic-range-matched denoisers from the pre-trained shelf. The denoisers are also endowed with a spatial image faceting functionality, enabling scalability to varied image sizes. Additionally, we formally introduce Hyper-uSARA, a variant of the optimisation-based algorithm HyperSARA, promoting joint sparsity across spectral channels via the l2,1-norm, also adopting FB. We evaluate HyperAIRI’s performance on simulated and real observations. We showcase its superior performance compared to its optimisation-based counterpart Hyper-uSARA, CLEAN’s hyperspectral variant in WSClean, and the monochromatic imaging algorithms AIRI and uSARA.
[LG-79] Beyond PCA: Manifold Dimension Estimation via Local Graph Structure
链接: https://arxiv.org/abs/2510.15141
作者: Zelong Bi,Pierre Lafaye de Micheaux
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Local principal component analysis (Local PCA) has proven to be an effective tool for estimating the intrinsic dimension of a manifold. More recently, curvature-adjusted PCA (CA-PCA) has improved upon this approach by explicitly accounting for the curvature of the underlying manifold, rather than assuming local flatness. Building on these insights, we propose a general framework for manifold dimension estimation that captures the manifold’s local graph structure by integrating PCA with regression-based techniques. Within this framework, we introduce two representative estimators: quadratic embedding (QE) and total least squares (TLS). Experiments on both synthetic and real-world datasets demonstrate that these methods perform competitively with, and often outperform, state-of-the-art alternatives.
[LG-80] Polarization based direction of arrival estimation using a radio interferometric array
链接: https://arxiv.org/abs/2510.15116
作者: Sarod Yatawatta
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:
Abstract:Direction of arrival (DOA) estimation is mostly performed using specialized arrays that have carefully designed receiver spacing and layouts to match the operating frequency range. In contrast, radio interferometric arrays are designed to optimally sample the Fourier space data for making high quality images of the sky. Therefore, using existing radio interferometric arrays (with arbitrary geometry and wide frequency variation) for DOA estimation is practically infeasible except by using images made by such interferometers. In this paper, we focus on low cost DOA estimation without imaging, using a subset of a radio interferometric array, using a fraction of the data collected by the full array, and, enabling early determination of DOAs. The proposed method is suitable for transient and low duty cycle source detection. Moreover, the proposed method is an ideal follow-up step to online radio frequency interference (RFI) mitigation, enabling the early estimation of the DOA of the detected RFI. Subjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG) Cite as: arXiv:2510.15116 [astro-ph.IM] (or arXiv:2510.15116v1 [astro-ph.IM] for this version) https://doi.org/10.48550/arXiv.2510.15116 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-81] he Minimax Lower Bound of Kernel Stein Discrepancy Estimation
链接: https://arxiv.org/abs/2510.15058
作者: Jose Cribeiro-Ramallo,Agnideep Aich,Florian Kalinke,Ashit Baran Aich,Zoltán Szabó
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Kernel Stein discrepancies (KSDs) have emerged as a powerful tool for quantifying goodness-of-fit over the last decade, featuring numerous successful applications. To the best of our knowledge, all existing KSD estimators with known rate achieve \sqrt n -convergence. In this work, we present two complementary results (with different proof strategies), establishing that the minimax lower bound of KSD estimation is n^-1/2 and settling the optimality of these estimators. Our first result focuses on KSD estimation on \mathbb R^d with the Langevin-Stein operator; our explicit constant for the Gaussian kernel indicates that the difficulty of KSD estimation may increase exponentially with the dimensionality d . Our second result settles the minimax lower bound for KSD estimation on general domains.
[LG-82] he Tree-SNE Tree Exists
链接: https://arxiv.org/abs/2510.15014
作者: Jack Kendrick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The clustering and visualisation of high-dimensional data is a ubiquitous task in modern data science. Popular techniques include nonlinear dimensionality reduction methods like t-SNE or UMAP. These methods face the `scale-problem’ of clustering: when dealing with the MNIST dataset, do we want to distinguish different digits or do we want to distinguish different ways of writing the digits? The answer is task dependent and depends on scale. We revisit an idea of Robinson Pierce-Hoffman that exploits an underlying scaling symmetry in t-SNE to replace 2-dimensional with (2+1)-dimensional embeddings where the additional parameter accounts for scale. This gives rise to the t-SNE tree (short: tree-SNE). We prove that the optimal embedding depends continuously on the scaling parameter for all initial conditions outside a set of measure 0: the tree-SNE tree exists. This idea conceivably extends to other attraction-repulsion methods and is illustrated on several examples.
[LG-83] Reliable data clustering with Bayesian community detection
链接: https://arxiv.org/abs/2510.15013
作者: Magnus Neuman,Jelena Smiljanić,Martin Rosvall
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Methodology (stat.ME)
*备注:
Abstract:From neuroscience and genomics to systems biology and ecology, researchers rely on clustering similarity data to uncover modular structure. Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise. A common workaround sparsifies a correlation matrix representation to remove noise before clustering, but this extra step introduces arbitrary thresholds that can distort the structure and lead to unreliable results. To detect reliable clusters, we capitalize on recent advances in network science to unite sparsification and clustering with principled model selection. We test two Bayesian community detection methods, the Degree-Corrected Stochastic Block Model and the Regularized Map Equation, both grounded in the Minimum Description Length principle for model selection. In synthetic data, they outperform traditional approaches, detecting planted clusters under high-noise conditions and with fewer samples. Compared to WGCNA on gene co-expression data, the Regularized Map Equation identifies more robust and functionally coherent gene modules. Our results establish Bayesian community detection as a principled and noise-resistant framework for uncovering modular structure in high-dimensional data across fields.
[LG-84] Estimand framework and intercurrent events handling for clinical trials with time-to-event outcomes
链接: https://arxiv.org/abs/2510.15000
作者: Yixin Fang,Man Jin
类目: Methodology (stat.ME); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:The ICH E9(R1) guideline presents a framework of estimand for clinical trials, proposes five strategies for handling intercurrent events (ICEs), and provides a comprehensive discussion and many real-life clinical examples for quantitative outcomes and categorical outcomes. However, in ICH E9(R1) the discussion is lacking for time-to-event (TTE) outcomes. In this paper, we discuss how to define estimands and how to handle ICEs for clinical trials with TTE outcomes. Specifically, we discuss six ICE handling strategies, including those five strategies proposed by ICH E9(R1) and a new strategy, the competing-risk strategy. Compared with ICH E9(R1), the novelty of this paper is three-fold: (1) the estimands are defined in terms of potential outcomes, (2) the methods can utilize time-dependent covariates straightforwardly, and (3) the efficient estimators are discussed accordingly.
信息检索
[IR-0] FACE: A General Framework for Mapping Collaborative Filtering Embeddings into LLM Tokens NEURIPS2025
链接: https://arxiv.org/abs/2510.15729
作者: Chao Wang,Yixin Song,Jinhui Ye,Chuan Qin,Dazhong Shen,Lingfeng Liu,Xiang Wang,Yanyong Zhang
类目: Information Retrieval (cs.IR)
*备注: Accepted by NeurIPS 2025
Abstract:Recently, large language models (LLMs) have been explored for integration with collaborative filtering (CF)-based recommendation systems, which are crucial for personalizing user experiences. However, a key challenge is that LLMs struggle to interpret the latent, non-semantic embeddings produced by CF approaches, limiting recommendation effectiveness and further applications. To address this, we propose FACE, a general interpretable framework that maps CF embeddings into pre-trained LLM tokens. Specifically, we introduce a disentangled projection module to decompose CF embeddings into concept-specific vectors, followed by a quantized autoencoder to convert continuous embeddings into LLM tokens (descriptors). Then, we design a contrastive alignment objective to ensure that the tokens align with corresponding textual signals. Hence, the model-agnostic FACE framework achieves semantic alignment without fine-tuning LLMs and enhances recommendation performance by leveraging their pre-trained capabilities. Empirical results on three real-world recommendation datasets demonstrate performance improvements in benchmark models, with interpretability studies confirming the interpretability of the descriptors. Code is available in this https URL.
[IR-1] he 3rd Place Solution of CCIR CUP 2025: A Framework for Retrieval-Augmented Generation in Multi-Turn Legal Conversation
链接: https://arxiv.org/abs/2510.15722
作者: Da Li,Zecheng Fang,Qiang Yan,Wei Huang,Xuanpu Luo
类目: Information Retrieval (cs.IR)
*备注: CCIR2025
Abstract:Retrieval-Augmented Generation has made significant progress in the field of natural language processing. By combining the advantages of information retrieval and large language models, RAG can generate relevant and contextually appropriate responses based on items retrieved from reliable sources. This technology has demonstrated outstanding performance across multiple domains, but its application in the legal field remains in its exploratory phase. In this paper, we introduce our approach for “Legal Knowledge Retrieval and Generation” in CCIR CUP 2025, which leverages large language models and information retrieval systems to provide responses based on laws in response to user questions.
[IR-2] Fault Cause Identification across Manufacturing Lines through Ontology-Guided and Process-Aware FMEA Graph Learning with LLM s
链接: https://arxiv.org/abs/2510.15428
作者: Sho Okazaki,Kohei Kaminishi,Takuma Fujiu,Yusheng Wang,Jun Ota
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Fault cause identification in automated manufacturing lines is challenging due to the system’s complexity, frequent reconfigurations, and the limited reusability of existing Failure Mode and Effects Analysis (FMEA) knowledge. Although FMEA worksheets contain valuable expert insights, their reuse across heterogeneous lines is hindered by natural language variability, inconsistent terminology, and process differences. To address these limitations, this study proposes a process-aware framework that enhances FMEA reusability by combining manufacturing-domain conceptualization with graph neural network (GNN) reasoning. First, FMEA worksheets from multiple manufacturing lines are transformed into a unified knowledge graph through ontology-guided large language model (LLM) extraction, capturing domain concepts such as actions, states, components, and parameters. Second, a Relational Graph Convolutional Network (RGCN) with the process-aware scoring function learns embeddings that respect both semantic relationships and sequential process flows. Finally, link prediction is employed to infer and rank candidate fault causes consistent with the target line’s process flow. A case study on automotive pressure sensor assembly lines demonstrates that the proposed method outperforms a state-of-the-art retrieval-augmented generation (RAG) baseline (F1@20 = 0.267) and an RGCN approach (0.400), achieving the best performance (0.523) in fault cause identification. Ablation studies confirm the contributions of both LLM-driven domain conceptualization and process-aware learning. These results indicate that the proposed framework significantly improves the transferability of FMEA knowledge across heterogeneous lines, thereby supporting operators in diagnosing failures more reliably and paving the way for future domain-adaptive LLM applications in smart manufacturing. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.15428 [cs.IR] (or arXiv:2510.15428v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.15428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-3] Dimension Mask Layer: Optimizing Embedding Efficiency for Scalable ID-based Models
链接: https://arxiv.org/abs/2510.15308
作者: Srijan Saket,Ikuhiro Ihara,Vaibhav Sharma,Danish Kalim
类目: Information Retrieval (cs.IR)
*备注: 7 pages, 6 figures, 2 tables
Abstract:In modern recommendation systems and social media platforms like Meta, TikTok, and Instagram, large-scale ID-based features often require embedding tables that consume significant memory. Managing these embedding sizes can be challenging, leading to bulky models that are harder to deploy and maintain. In this paper, we introduce a method to automatically determine the optimal embedding size for ID features, significantly reducing the model size while maintaining performance. Our approach involves defining a custom Keras layer called the dimension mask layer, which sits directly after the embedding lookup. This layer trims the embedding vector by allowing only the first N dimensions to pass through. By doing this, we can reduce the input feature dimension by more than half with minimal or no loss in model performance metrics. This reduction helps cut down the memory footprint of the model and lowers the risk of overfitting due to multicollinearity. Through offline experiments on public datasets and an online A/B test on a real production dataset, we demonstrate that using a dimension mask layer can shrink the effective embedding dimension by 40-50%, leading to substantial improvements in memory efficiency. This method provides a scalable solution for platforms dealing with a high volume of ID features, optimizing both resource usage and model performance. Comments: 7 pages, 6 figures, 2 tables Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.15308 [cs.IR] (or arXiv:2510.15308v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.15308 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-4] GRank: Towards Target-Aware and Streamlined Industrial Retrieval with a Generate-Rank Framework
链接: https://arxiv.org/abs/2510.15299
作者: Yijia Sun,Shanshan Huang,Zhiyuan Guan,Qiang Luo,Ruiming Tang,Kun Gai,Guorui Zhou
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Industrial-scale recommender systems rely on a cascade pipeline in which the retrieval stage must return a high-recall candidate set from billions of items under tight latency. Existing solutions ei- ther (i) suffer from limited expressiveness in capturing fine-grained user-item interactions, as seen in decoupled dual-tower architectures that rely on separate encoders, or generative models that lack precise target-aware matching capabilities, or (ii) build structured indices (tree, graph, quantization) whose item-centric topologies struggle to incorporate dynamic user preferences and incur prohibitive construction and maintenance costs. We present GRank, a novel structured-index-free retrieval paradigm that seamlessly unifies target-aware learning with user-centric retrieval. Our key innovations include: (1) A target-aware Generator trained to perform personalized candidate generation via GPU-accelerated MIPS, eliminating semantic drift and maintenance costs of structured indexing; (2) A lightweight but powerful Ranker that performs fine-grained, candidate-specific inference on small subsets; (3) An end-to-end multi-task learning framework that ensures semantic consistency between generation and ranking objectives. Extensive experiments on two public benchmarks and a billion-item production corpus demonstrate that GRank improves Recall@500 by over 30% and 1.7 \times the P99 QPS of state-of-the-art tree- and graph-based retrievers. GRank has been fully deployed in production in our recommendation platform since Q2 2025, serving 400 million active users with 99.95% service availability. Online A/B tests confirm significant improvements in core engagement metrics, with Total App Usage Time increasing by 0.160% in the main app and 0.165% in the Lite version. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2510.15299 [cs.IR] (or arXiv:2510.15299v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.15299 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yijia Sun [view email] [v1] Fri, 17 Oct 2025 04:15:09 UTC (1,417 KB)