本篇博文主要内容为 2025-10-27 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-10-27)

今日共更新474篇论文,其中:

  • 自然语言处理63篇(Computation and Language (cs.CL))
  • 人工智能126篇(Artificial Intelligence (cs.AI))
  • 计算机视觉90篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习170篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

【速读】: 该论文旨在解决当前AI代理(AI agents)在科学研究场景中缺乏系统性、可复现且涵盖真实使用情境的评估基准问题。现有评测方法存在五大局限:无法全面衡量实际科研任务中的代理能力,缺少可控对比的核心工具,未考虑模型成本与工具访问等混杂变量,缺乏标准化接口支持快速原型开发,以及缺少充分的基线代理以识别真正进步。解决方案的关键在于提出一套严谨的评估原则与配套工具,并基于此构建了AstaBench——首个面向科学发现全流程(涵盖2400+问题,覆盖多学科领域)的综合性评测套件,其包含生产级检索工具的科研环境、9类优化后的科学代理及多种基线模型。该方案首次实现了对AI代理在真实科研任务中端到端能力的可控、可复现评估,从而为科学辅助型AI的发展提供了可靠的量化依据。

链接: https://arxiv.org/abs/2510.21652
作者: Jonathan Bragg,Mike D’Arcy,Nishant Balepur,Dan Bareket,Bhavana Dalvi,Sergey Feldman,Dany Haddad,Jena D. Hwang,Peter Jansen,Varsha Kishore,Bodhisattwa Prasad Majumder,Aakanksha Naik,Sigal Rahamimov,Kyle Richardson,Amanpreet Singh,Harshit Surana,Aryeh Tiktinsky,Rosni Vasu,Guy Wiener,Chloe Anastasiades,Stefan Candra,Jason Dunkelberger,Dan Emery,Rob Evans,Malachi Hamada,Regan Huff,Rodney Kinney,Matt Latzke,Jaron Lochner,Ruben Lozano-Aguilera,Cecile Nguyen,Smita Rao,Amber Tanaka,Brooke Vlahos,Peter Clark,Doug Downey,Yoav Goldberg,Ashish Sabharwal,Daniel S. Weld
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI agents hold the potential to revolutionize scientific productivity by automating literature reviews, replicating experiments, analyzing data, and even proposing new directions of inquiry; indeed, there are now many such agents, ranging from general-purpose “deep research” systems to specialized science-specific agents, such as AI Scientist and AIGS. Rigorous evaluation of these agents is critical for progress. Yet existing benchmarks fall short on several fronts: they (1) fail to provide holistic, product-informed measures of real-world use cases such as science research; (2) lack reproducible agent tools necessary for a controlled comparison of core agentic capabilities; (3) do not account for confounding variables such as model cost and tool access; (4) do not provide standardized interfaces for quick agent prototyping and evaluation; and (5) lack comprehensive baseline agents necessary to identify true advances. In response, we define principles and tooling for more rigorously benchmarking agents. Using these, we present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research, comprising 2400+ problems spanning the entire scientific discovery process and multiple scientific domains, and including many problems inspired by actual user requests to deployed Asta agents. Our suite comes with the first scientific research environment with production-grade search tools that enable controlled, reproducible evaluation, better accounting for confounders. Alongside, we provide a comprehensive suite of nine science-optimized classes of Asta agents and numerous baselines. Our extensive evaluation of 57 agents across 22 agent classes reveals several interesting findings, most importantly that despite meaningful progress on certain individual aspects, AI remains far from solving the challenge of science research assistance.
zh

[NLP-1] Few-Shot Knowledge Distillation of LLM s With Counterfactual Explanations NEURIPS2025

【速读】: 该论文旨在解决任务感知型知识蒸馏(task-aware knowledge distillation)在少样本场景下的性能瓶颈问题,即传统方法通常依赖大量标注数据,而这些数据在实际应用中可能难以获取或成本高昂。解决方案的关键在于提出一种名为“反事实解释注入的知识蒸馏”(Counterfactual-explanation-infused Distillation, CoD)的新策略,其核心是系统性地引入反事实解释(Counterfactual Explanations, CFEs)——即能以最小扰动改变教师模型预测结果的输入样本。通过利用CFEs精确刻画教师模型的决策边界,CoD能够在显著减少样本数量(仅为基线方法的一半)的情况下提升学生模型对教师决策边界的拟合能力,并从统计与几何两个角度提供了理论保障,证明CFEs可提供更富信息性的样本,从而改善参数估计并增强知识迁移效率。

链接: https://arxiv.org/abs/2510.21631
作者: Faisal Hamman,Pasan Dissanayake,Yanjun Fu,Sanghamitra Dutta
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (stat.ML)
备注: NeurIPS 2025

点击查看摘要

Abstract:Knowledge distillation is a promising approach to transfer capabilities from complex teacher models to smaller, resource-efficient student models that can be deployed easily, particularly in task-aware scenarios. However, existing methods of task-aware distillation typically require substantial quantities of data which may be unavailable or expensive to obtain in many practical scenarios. In this paper, we address this challenge by introducing a novel strategy called Counterfactual-explanation-infused Distillation CoD for few-shot task-aware knowledge distillation by systematically infusing counterfactual explanations. Counterfactual explanations (CFEs) refer to inputs that can flip the output prediction of the teacher model with minimum perturbation. Our strategy CoD leverages these CFEs to precisely map the teacher’s decision boundary with significantly fewer samples. We provide theoretical guarantees for motivating the role of CFEs in distillation, from both statistical and geometric perspectives. We mathematically show that CFEs can improve parameter estimation by providing more informative examples near the teacher’s decision boundary. We also derive geometric insights on how CFEs effectively act as knowledge probes, helping the students mimic the teacher’s decision boundaries more effectively than standard data. We perform experiments across various datasets and LLMs to show that CoD outperforms standard distillation approaches in few-shot regimes (as low as 8-512 samples). Notably, CoD only uses half of the original samples used by the baselines, paired with their corresponding CFEs and still improves performance.
zh

[NLP-2] he Universal Landscape of Human Reasoning

【速读】: 该论文旨在解决人类推理过程中信息动态累积与转换机制难以被统一、定量描述的问题,现有理论(如经典逻辑模型与概率模型)虽能解释部分推理输出或个体建模特征,但缺乏对人类推理行为普遍规律的整合性刻画。其解决方案的关键在于提出信息流追踪方法(Information Flow Tracking, IF-Track),利用大语言模型(Large Language Models, LLMs)作为概率编码器,在每个推理步骤中量化信息熵与信息增益,从而在单一度量空间内建模跨任务的人类推理行为全景。该方法首次实现了对推理核心特征、系统性错误模式及个体差异的精细刻画,并为人工智能与人类认知之间的映射关系提供了可量化的理论桥梁。

链接: https://arxiv.org/abs/2510.21623
作者: Qiguang Chen,Jinhao Liu,Libo Qin,Yimeng Zhang,Yihao Liang,Shangxu Ren,Chengyu Luan,Dengyun Peng,Hanjing Li,Jiannan Guan,Zheng Yan,Jiaqi Wang,Mengkang Hu,Yantao Du,Zhi Chen,Xie Chen,Wanxiang Che
机构: Harbin Institute of Technology (哈尔滨工业大学); Central South University (中南大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Princeton University (普林斯顿大学); The Chinese University of Hong Kong (香港中文大学); The University of Hong Kong (香港大学); ByteDance Seed (中国) (字节跳动种子基金(中国)); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Understanding how information is dynamically accumulated and transformed in human reasoning has long challenged cognitive psychology, philosophy, and artificial intelligence. Existing accounts, from classical logic to probabilistic models, illuminate aspects of output or individual modelling, but do not offer a unified, quantitative description of general human reasoning dynamics. To solve this, we introduce Information Flow Tracking (IF-Track), that uses large language models (LLMs) as probabilistic encoder to quantify information entropy and gain at each reasoning step. Through fine-grained analyses across diverse tasks, our method is the first successfully models the universal landscape of human reasoning behaviors within a single metric space. We show that IF-Track captures essential reasoning features, identifies systematic error patterns, and characterizes individual differences. Applied to discussion of advanced psychological theory, we first reconcile single- versus dual-process theories in IF-Track and discover the alignment of artificial and human cognition and how LLMs reshaping human reasoning process. This approach establishes a quantitative bridge between theory and measurement, offering mechanistic insights into the architecture of reasoning.
zh

[NLP-3] DeepAgent : A General Reasoning Agent with Scalable Toolsets

【速读】: 该论文旨在解决大模型在真实世界任务中因缺乏外部工具调用能力和长周期交互能力而导致的自主性与全局任务完成度不足的问题。现有代理框架多依赖预定义工作流,难以实现灵活、高效的多步决策与工具使用。其解决方案的关键在于提出 DeepAgent,一个端到端的深度推理代理系统,通过引入自主记忆折叠机制(autonomous memory folding mechanism),将历史交互压缩为结构化的情景记忆(episodic memory)、工作记忆(working memory)和工具记忆(tool memory),从而缓解上下文长度爆炸问题并减少误差累积;同时设计了基于大语言模型模拟API的强化学习策略 ToolPO,通过工具调用优势归因实现细粒度奖励分配,提升通用工具使用的训练效率与稳定性。

链接: https://arxiv.org/abs/2510.21618
作者: Xiaoxi Li,Wenxiang Jiao,Jiarui Jin,Guanting Dong,Jiajie Jin,Yinuo Wang,Hao Wang,Yutao Zhu,Ji-Rong Wen,Yuan Lu,Zhicheng Dou
机构: Renmin University of China (中国人民大学); Xiaohongshu Inc. (小红书)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. In this paper, we introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To address the challenges of long-horizon interactions, particularly the context length explosion from multiple tool calls and the accumulation of interaction history, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens. Extensive experiments on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank, TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA, HLE), demonstrate that DeepAgent consistently outperforms baselines across both labeled-tool and open-set tool retrieval scenarios. This work takes a step toward more general and capable agents for real-world applications. The code and demo are available at this https URL.
zh

[NLP-4] RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在金融领域,特别是股票价格走势预测任务中推理能力未被有效利用的问题。现有研究表明,LLMs在数学和编程任务中表现出色,但在金融场景下往往依赖分析师观点而非构建系统性、独立的因果链(Chain-of-Thoughts, CoTs),且未能对来自不同来源的对抗性证据进行权衡,导致预测可靠性不足。为解决这一问题,论文提出一种名为“反思式证据微调”(Reflective Evidence Tuning, RETuning)的方法,其核心在于:在强化学习前引入冷启动阶段,引导模型在生成CoT过程中动态构建分析框架,从多源信息中组织并评分支持上涨或下跌的证据,而非简单复用上下文中的主观观点;随后通过反思机制确保最终预测与所建框架一致,从而增强逻辑独立性和鲁棒性。此方法显著提升了模型在复杂金融场景下的推理能力,且在长周期(6个月)及分布外股票上仍保持良好性能。

链接: https://arxiv.org/abs/2510.21604
作者: Xueyuan Lin,Cehao Yang,Ye Ma,Ming Li,Rongjunchen Zhang,Yang Ni,Xiaojun Wu,Chengjin Xu,Jian Guo,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou); IDEA Research; Hithink RoyalFlush Information Network Co., Ltd; DataArc Tech Ltd
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated outstanding reasoning capabilities on mathematical and coding tasks. However, their application to financial tasks-especially the most fundamental task of stock movement prediction-remains underexplored. We study a three-class classification problem (up, hold, down) and, by analyzing existing reasoning responses, observe that: (1) LLMs follow analysts’ opinions rather than exhibit a systematic, independent analytical logic (CoTs). (2) LLMs list summaries from different sources without weighing adversarial evidence, yet such counterevidence is crucial for reliable prediction. It shows that the model does not make good use of its reasoning ability to complete the task. To address this, we propose Reflective Evidence Tuning (RETuning), a cold-start method prior to reinforcement learning, to enhance prediction ability. While generating CoT, RETuning encourages dynamically constructing an analytical framework from diverse information sources, organizing and scoring evidence for price up or down based on that framework-rather than on contextual viewpoints-and finally reflecting to derive the prediction. This approach maximally aligns the model with its learned analytical framework, ensuring independent logical reasoning and reducing undue influence from context. We also build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks, with long contexts (32K tokens) and over 200K samples. In addition to price and news, it incorporates analysts’ opinions, quantitative reports, fundamental data, macroeconomic indicators, and similar stocks. Experiments show that RETuning successfully unlocks the model’s reasoning ability in the financial domain. Inference-time scaling still works even after 6 months or on out-of-distribution stocks, since the models gain valuable insights about stock movement prediction.
zh

[NLP-5] Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research

【速读】: 该论文旨在解决当前深度研究系统(Deep Research systems)在处理多模态文档时存在的根本性局限问题,即现有系统仅依赖文本网页数据,忽视了嵌入在图表、表格、公式等视觉内容中的丰富知识。其核心挑战在于如何有效解析多模态文档以保留视觉语义(visual semantics),实现结构化分块(intelligent chunking)和跨模态自适应检索(adaptive retrieval)。解决方案的关键在于提出Doc-Researcher这一统一系统,包含三个核心组件:(i) 深度多模态解析模块,能够从片段到文档层级生成多层次表示并保持布局结构与视觉语义完整性;(ii) 支持纯文本、纯视觉及混合范式的系统性检索架构,并具备动态粒度选择能力;(iii) 迭代式多智能体工作流,可分解复杂查询、逐步积累证据并在不同文档与模态间合成答案。该方案通过引入首个针对多模态、多跳、多文档和多轮推理的基准M4DocBench进行验证,实验表明其准确率达到50.6%,显著优于现有最先进基线(提升3.4倍),证明了深度解析与多模态一致性对高效文档研究的重要性。

链接: https://arxiv.org/abs/2510.21603
作者: Kuicai Dong,Shurui Huang,Fangda Ye,Wei Han,Zhi Zhang,Dexun Li,Wenjun Li,Qu Yang,Gang Wang,Yichao Wang,Chen Zhang,Yong Liu
机构: Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Deep Research systems have revolutionized how LLMs solve complex questions through iterative reasoning and evidence gathering. However, current systems remain fundamentally constrained to textual web data, overlooking the vast knowledge embedded in multimodal documents Processing such documents demands sophisticated parsing to preserve visual semantics (figures, tables, charts, and equations), intelligent chunking to maintain structural coherence, and adaptive retrieval across modalities, which are capabilities absent in existing systems. In response, we present Doc-Researcher, a unified system that bridges this gap through three integrated components: (i) deep multimodal parsing that preserves layout structure and visual semantics while creating multi-granular representations from chunk to document level, (ii) systematic retrieval architecture supporting text-only, vision-only, and hybrid paradigms with dynamic granularity selection, and (iii) iterative multi-agent workflows that decompose complex queries, progressively accumulate evidence, and synthesize comprehensive answers across documents and modalities. To enable rigorous evaluation, we introduce M4DocBench, the first benchmark for Multi-modal, Multi-hop, Multi-document, and Multi-turn deep research. Featuring 158 expert-annotated questions with complete evidence chains across 304 documents, M4DocBench tests capabilities that existing benchmarks cannot assess. Experiments demonstrate that Doc-Researcher achieves 50.6% accuracy, 3.4xbetter than state-of-the-art baselines, validating that effective document research requires not just better retrieval, but fundamentally deep parsing that preserve multimodal integrity and support iterative research. Our work establishes a new paradigm for conducting deep research on multimodal document collections.
zh

[NLP-6] Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

【速读】: 该论文旨在解决语言记录中词汇数据因转写错误和未标注的借词而导致的语言学分析偏差问题。其解决方案的关键在于提出无监督异常检测方法,通过字符级与音节级的音系结构特征识别词表中的音系不一致现象,其中音节感知特征显著优于字符级基线模型,从而为田野工作者提供一种系统性标记需验证条目的手段,提升低资源语言记录的数据质量。

链接: https://arxiv.org/abs/2510.21584
作者: Kellen Parker van Dam,Abishek Stephen
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to The 5th Workshop on Evaluation and Comparison for NLP systems (Eval4NLP) 2025

点击查看摘要

Abstract:Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
zh

[NLP-7] From Polyester Girlfriends to Blind Mice: Creating the First Prag matics Understanding Benchmarks for Slovene

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在评估中过于依赖表面语言能力,而忽视深层语用理解(pragmatics)的问题。语用理解涉及对情境意义的把握,包括语境、语言规范及文化背景等因素。为应对这一挑战,作者提出并构建了SloPragEval与SloPragMega两个面向斯洛文尼亚语的语用理解基准测试,包含共计405道多项选择题。其解决方案的关键在于:首先,设计基于本土数据的高质量语用任务,确保文化相关性;其次,通过人工标注建立人类基线以验证模型表现;最后,系统评估现有LLMs在非字面表达和文化特定语境下的推理能力,揭示开源与专有模型间的显著差距,从而推动更深层次的语言理解能力评估标准的发展。

链接: https://arxiv.org/abs/2510.21575
作者: Mojca Brglez,Špela Vintar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are demonstrating increasing capabilities, excelling at benchmarks once considered very difficult. As their capabilities grow, there is a need for more challenging evaluations that go beyond surface-level linguistic competence. Namely, language competence involves not only syntax and semantics but also pragmatics, i.e., understanding situational meaning as shaped by context as well as linguistic and cultural norms. To contribute to this line of research, we introduce SloPragEval and SloPragMega, the first pragmatics understanding benchmarks for Slovene that contain altogether 405 multiple-choice questions. We discuss the difficulties of translation, describe the campaign to establish a human baseline, and report pilot evaluations with LLMs. Our results indicate that current models have greatly improved in understanding nuanced language but may still fail to infer implied speaker meaning in non-literal utterances, especially those that are culture-specific. We also observe a significant gap between proprietary and open-source models. Finally, we argue that benchmarks targeting nuanced language understanding and knowledge of the target culture must be designed with care, preferably constructed from native data, and validated with human responses.
zh

[NLP-8] ColorEcosystem: Powering Personalized Standardized and Trustworthy Agent ic Service in massive-agent Ecosystem

【速读】: 该论文旨在解决当前大规模代理生态系统(massive-agent ecosystems)中面临的三大核心挑战:服务体验缺乏个性化、标准不统一以及行为不可信。为应对这些问题,作者提出了ColorEcosystem这一新型架构,其关键在于三个核心组件的协同作用:代理载体(agent carrier)通过利用用户特定数据构建数字孪生(digital twin)实现个性化服务;代理商店(agent store)作为集中式标准化平台,统一管理多样化的代理服务;代理审计(agent audit)则基于对开发者和用户活动的监督机制,保障服务提供者与用户的可信度。该方案通过结构化设计实现了规模化下个性化、标准化与可信性的统一。

链接: https://arxiv.org/abs/2510.21566
作者: Fangwen Wu,Zheng Wu,Jihong Wang,Yunku Chen,Ruiguang Pei,Heyuan Huang,Xin Liao,Xingyu Lou,Huarong Deng,Zhihui Fu,Weiwen Liu,Zhuosheng Zhang,Weinan Zhang,Jun Wang
机构: Shanghai Jiao Tong University (上海交通大学); OPPO
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid development of (multimodal) large language model-based agents, the landscape of agentic service management has evolved from single-agent systems to multi-agent systems, and now to massive-agent ecosystems. Current massive-agent ecosystems face growing challenges, including impersonal service experiences, a lack of standardization, and untrustworthy behavior. To address these issues, we propose ColorEcosystem, a novel blueprint designed to enable personalized, standardized, and trustworthy agentic service at scale. Concretely, ColorEcosystem consists of three key components: agent carrier, agent store, and agent audit. The agent carrier provides personalized service experiences by utilizing user-specific data and creating a digital twin, while the agent store serves as a centralized, standardized platform for managing diverse agentic services. The agent audit, based on the supervision of developer and user activities, ensures the integrity and credibility of both service providers and users. Through the analysis of challenges, transitional forms, and practical considerations, the ColorEcosystem is poised to power personalized, standardized, and trustworthy agentic service across massive-agent ecosystems. Meanwhile, we have also implemented part of ColorEcosystem’s functionality, and the relevant code is open-sourced at this https URL.
zh

[NLP-9] Are the LLM s Capable of Maintaining at Least the Language Genus?

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言行为上存在显著差异的问题,特别是探究语系结构(genealogical language structure)是否在塑造这种差异中发挥关键作用。其解决方案的关键在于通过扩展对MultiQ数据集的分析,系统检验LLMs是否在提示语言保真度下降时倾向于切换到语系相关的语言,以及知识一致性是否在语系内部优于跨语系。研究发现,尽管语系层面的影响确实存在,但其强度强烈依赖于训练资源的可获得性,并且不同LLM家族展现出不同的多语言策略,表明模型虽能编码语系结构信息,但训练数据不平衡仍是决定其多语言性能的主要因素。

链接: https://arxiv.org/abs/2510.21561
作者: Sandra Mitrović,David Kletz,Ljiljana Dolamic,Fabio Rinaldi
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) display notable variation in multilingual behavior, yet the role of genealogical language structure in shaping this variation remains underexplored. In this paper, we investigate whether LLMs exhibit sensitivity to linguistic genera by extending prior analyses on the MultiQ dataset. We first check if models prefer to switch to genealogically related languages when prompt language fidelity is not maintained. Next, we investigate whether knowledge consistency is better preserved within than across genera. We show that genus-level effects are present but strongly conditioned by training resource availability. We further observe distinct multilingual strategies across LLMs families. Our findings suggest that LLMs encode aspects of genus-level structure, but training data imbalances remain the primary factor shaping their multilingual performance.
zh

[NLP-10] Document Understanding Measurement and Manipulation Using Category Theory

【速读】: 该论文旨在解决多模态文档结构的提取与建模问题,进而推动信息度量、内容摘要生成、文档扩展(exegesis)及预训练大模型的自监督优化。其核心解决方案基于范畴论(category theory),将文档形式化为问题-答案对(question-answer pair)构成的范畴,并通过正交化过程将文档信息分解为非重叠片段,从而实现信息的可度量性与可枚举性;在此基础上,提出了一种新颖的率失真分析方法用于摘要技术评估,并构建了基于范畴结构约束(如复合性与运算封闭性)的强化学习框架(RLVR),以自监督方式提升预训练模型性能。

链接: https://arxiv.org/abs/2510.21553
作者: Jared Claypoole,Yunye Gong,Noson S. Yanofsky,Ajay Divakaran
机构: SRI International(美国研究协会); Brooklyn College(布鲁克林学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We apply category theory to extract multimodal document structure which leads us to develop information theoretic measures, content summarization and extension, and self-supervised improvement of large pretrained models. We first develop a mathematical representation of a document as a category of question-answer pairs. Second, we develop an orthogonalization procedure to divide the information contained in one or more documents into non-overlapping pieces. The structures extracted in the first and second steps lead us to develop methods to measure and enumerate the information contained in a document. We also build on those steps to develop new summarization techniques, as well as to develop a solution to a new problem viz. exegesis resulting in an extension of the original document. Our question-answer pair methodology enables a novel rate distortion analysis of summarization techniques. We implement our techniques using large pretrained models, and we propose a multimodal extension of our overall mathematical framework. Finally, we develop a novel self-supervised method using RLVR to improve large pretrained models using consistency constraints such as composability and closure under certain operations that stem naturally from our category theoretic framework.
zh

[NLP-11] InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中模型输出与检索到的外部知识不一致的问题,即幻觉(hallucination)现象。现有方法通常未能区分外部上下文和参数化知识对生成结果的贡献,导致检测精度受限。研究发现,RAG幻觉主要源于模型后期前馈网络(Feed-Forward Network, FFN)模块过度将参数化知识注入残差流(residual stream)。为此,作者提出基于机制的检测方法,通过计算各层和注意力头的外部上下文得分与参数化知识得分,并训练回归分类器预测幻觉。关键创新在于利用模型内部机制信号作为高效且可泛化的幻觉检测指标,且在Qwen3-0.6b上训练的分类器能迁移至GPT-4.1-mini,验证了代理模型评估的可行性。

链接: https://arxiv.org/abs/2510.21538
作者: Likun Tan,Kuan-Wei Huang,Joy Shi,Kevin Wu
机构: Pegasi AI(佩加西AI)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) integrates external knowledge to mitigate hallucinations, yet models often generate outputs inconsistent with retrieved content. Accurate hallucination detection requires disentangling the contributions of external context and parametric knowledge, which prior methods typically conflate. We investigate the mechanisms underlying RAG hallucinations and find they arise when later-layer FFN modules disproportionately inject parametric knowledge into the residual stream. To address this, we explore a mechanistic detection approach based on external context scores and parametric knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and attention heads and train regression-based classifiers to predict hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5, GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore, classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses, demonstrating the potential of proxy-model evaluation. Our results highlight mechanistic signals as efficient, generalizable predictors for hallucination detection in RAG systems.
zh

[NLP-12] Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models NEURIPS2025

【速读】: 该论文旨在解决当前预训练语言模型在脑响应对齐(brain alignment)研究中存在的两个关键问题:一是现有方法依赖于个体参与者数据,导致泛化能力差;二是模型性能高度受限于每位参与者可用的fMRI数据量,难以进行跨个体和群体层面的分析。解决方案的关键在于提出一种可扩展、通用的“脑调优”(brain-tuning)方法,通过在多个参与者的fMRI响应上联合微调预训练语言模型,使其不仅在个体层面保持强脑对齐性,还能在不同参与者之间实现良好泛化。该方法显著降低了新参与者预测所需的fMRI数据量(减少5倍),提升了整体脑对齐度(最高提升50%),并展现出对未见数据集的强大泛化能力,同时增强了语义任务的下游性能,体现了神经科学与人工智能之间的双向促进作用。

链接: https://arxiv.org/abs/2510.21520
作者: Omer Moussa,Mariya Toneva
机构: Max Planck Institute for Software Systems (马普软件系统研究所)
类目: Computation and Language (cs.CL)
备注: Published at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Pretrained language models are remarkably effective in aligning with human brain responses elicited by natural language stimuli, positioning them as promising model organisms for studying language processing in the brain. However, existing approaches for both estimating and improving this brain alignment are participant-dependent and highly affected by the amount of data available per participant, hindering both generalization to new participants and population-level analyses. In this work, we address these limitations by introducing a scalable, generalizable brain-tuning method, in which we fine-tune pretrained speech language models to jointly predict fMRI responses from multiple participants. We demonstrate that the resulting brain-tuned models exhibit strong individual brain alignment while generalizing across participants. Specifically, our method leads to 1) a 5-fold decrease in the amount of fMRI data needed to predict brain data from new participants, 2) up to a 50% increase in the overall brain alignment, and 3) strong generalization to new unseen datasets. Furthermore, this multi-participant brain-tuning additionally improves downstream performance on semantic tasks, suggesting that training using brain data from multiple participants leads to more generalizable semantic representations. Taken together, these findings demonstrate a bidirectional benefit between neuroscience and AI, helping bridge the gap between the two fields. We make our code and models publicly available at this https URL.
zh

[NLP-13] Head Pursuit: Probing Attention Specialization in Multimodal Transformers NEURIPS2025

【速读】: 该论文试图解决生成式语言模型(Generative AI)中注意力头(attention heads)内部机制不明确的问题,尤其是其如何特化于特定语义或视觉属性。解决方案的关键在于提出一种基于信号处理视角的可解释性方法,重新诠释使用最终解码层探测中间激活的方法,从而在原理上分析多个样本并按相关性对注意力头进行排序。该方法能够识别出仅占1%的注意力头即可有效抑制或增强目标概念,为理解和编辑大型生成模型提供了简洁且可控的工具。

链接: https://arxiv.org/abs/2510.21518
作者: Lorenzo Basile,Valentino Maiorca,Diego Doimo,Francesco Locatello,Alberto Cazzaniga
机构: Area Science Park (意大利科学园区); Sapienza University of Rome (罗马大学); Institute of Science and Technology (奥地利科学技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at NeurIPS 2025 (spotlight)

点击查看摘要

Abstract:Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.
zh

[NLP-14] Wisdom and Delusion of LLM Ensembles for Code Generation and Repair

【速读】: 该论文旨在解决当前软件工程领域中单一大型语言模型(Large Language Model, LLM)在代码生成与程序修复等任务上存在的性能瓶颈问题,以及缺乏有效策略来最大化多模型集成(ensemble)潜力的实践困境。其核心问题是:不同编码LLM之间是否存在互补性?如何设计高效的集成策略以超越单模型上限?解决方案的关键在于通过实证比较十种来自五类不同家族的LLM及其三种集成方式,在三个基准测试中量化模型间的互补效应,并发现基于多样性的选择策略优于传统的共识机制——后者易陷入“流行陷阱”(popularity trap),放大错误输出;而多样性策略可实现理论上限的95%,且在仅含两个模型的小型集成中亦具有效性,从而提供一种成本效益高的提升方案。

链接: https://arxiv.org/abs/2510.21513
作者: Fernando Vallecillos Ruiz,Max Hort,Leon Moonen
机构: Simula Research Laboratory (Simula 研究实验室)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Today’s pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble’s potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble’s candidate pool. We find that the theoretical upperbound for an ensemble’s performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a “popularity trap,” amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs. Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2510.21513 [cs.SE] (or arXiv:2510.21513v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.21513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-15] MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization NEURIPS2025

【速读】: 该论文旨在解决扩散语言模型(Diffusion Language Models, DLMs)在推理性能上落后于自回归大语言模型(Autoregressive Large Language Models, LLMs)的问题,尤其是在去噪步数减少时表现更差。核心原因在于DLMs在去噪过程中独立生成被掩码的token,未能有效捕捉token之间的相关性。为此,作者提出多奖励优化(Multi-Reward Optimization, MRO)方法,其关键在于通过测试时缩放(test-time scaling)、拒绝采样(reject sampling)与强化学习(reinforcement learning)相结合的方式,直接优化两种token相关性:序列内相关性(intra-sequence correlation)和序列间相关性(inter-sequence correlation),并引入分组步长(group step)和重要性采样(importance sampling)策略以降低奖励方差并提升采样效率,从而显著提升推理能力并实现采样加速。

链接: https://arxiv.org/abs/2510.21473
作者: Chenglong Wang,Yang Gan,Hang Zhou,Chi Hu,Yongyu Mu,Kai Song,Murun Yang,Bei Li,Chunliang Zhang,Tongran Liu,Jingbo Zhu,Zhengtao Yu,Tong Xiao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.
zh

[NLP-16] SBASH: a Framework for Designing and Evaluating RAG vs. Prompt-Tuned LLM Honeypots

【速读】: 该论文旨在解决蜜罐(Honeypot)系统中攻击者参与度不足的问题,核心挑战在于提升蜜罐对新型攻击类型、目标系统和攻击者行为的上下文感知能力。传统基于大型语言模型(Large Language Models, LLMs)的蜜罐方案虽能增强上下文适应性,但面临响应准确性低、延迟高、运维成本大及数据隐私保护风险等瓶颈,尤其在云端部署时更为突出。论文提出System-Based Attention Shell Honeypot (SBASH) 框架,其关键创新在于采用轻量级本地部署的LLM以规避云服务带来的数据隐私问题,并通过检索增强生成(Retrieval Augmented Generation, RAG)与系统提示微调(system prompt tuning)两种策略优化响应质量:实验表明,RAG可显著提升未微调模型的准确性;而经系统提示微调后的非RAG模型在准确率上可媲美未微调的RAG模型,同时具备更低延迟,从而在实用性与安全性之间实现更好平衡。

链接: https://arxiv.org/abs/2510.21459
作者: Adetayo Adebimpe,Helmut Neukirchen,Thomas Welsh
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: to be published in: The 3rd International Conference on Foundation and Large Language Models (FLLM2025), IEEE, 2025

点击查看摘要

Abstract:Honeypots are decoy systems used for gathering valuable threat intelligence or diverting attackers away from production systems. Maximising attacker engagement is essential to their utility. However research has highlighted that context-awareness, such as the ability to respond to new attack types, systems and attacker agents, is necessary to increase engagement. Large Language Models (LLMs) have been shown as one approach to increase context awareness but suffer from several challenges including accuracy and timeliness of response time, high operational costs and data-protection issues due to cloud deployment. We propose the System-Based Attention Shell Honeypot (SBASH) framework which manages data-protection issues through the use of lightweight local LLMs. We investigate the use of Retrieval Augmented Generation (RAG) supported LLMs and non-RAG LLMs for Linux shell commands and evaluate them using several different metrics such as response time differences, realism from human testers, and similarity to a real system calculated with Levenshtein distance, SBert, and BertScore. We show that RAG improves accuracy for untuned models while models that have been tuned via a system prompt that tells the LLM to respond like a Linux system achieve without RAG a similar accuracy as untuned with RAG, while having a slightly lower latency.
zh

[NLP-17] REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring

【速读】: 该论文旨在解决远程患者监测(Remote Patient Monitoring, RPM)领域中人机交互(Human-Machine Interaction, HMI)方面的显著空白问题,即现有系统多聚焦于传感器数据采集与异常检测,缺乏对患者状态的自然语言理解与交互能力。解决方案的关键在于提出一个名为REMONI的自主式远程健康监测系统,其核心创新是融合多模态大语言模型(Multimodal Large Language Models, MLLMs)、物联网(Internet of Things, IoT)和可穿戴设备,实现对生命体征、加速度计数据及视频中患者活动与情绪的自动感知,并通过提示工程(prompt engineering)整合多源信息,使医疗人员可通过自然语言交互方式实时获取患者生理状态与心理情绪,从而提升护理效率并降低医疗成本。

链接: https://arxiv.org/abs/2510.21445
作者: Thanh Cong Ho,Farah Kharrat,Abderrazek Abid,Fakhri Karray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the widespread adoption of wearable devices in our daily lives, the demand and appeal for remote patient monitoring have significantly increased. Most research in this field has concentrated on collecting sensor data, visualizing it, and analyzing it to detect anomalies in specific diseases such as diabetes, heart disease and depression. However, this domain has a notable gap in the aspect of human-machine interaction. This paper proposes REMONI, an autonomous REmote health MONItoring system that integrates multimodal large language models (MLLMs), the Internet of Things (IoT), and wearable devices. The system automatically and continuously collects vital signs, accelerometer data from a special wearable (such as a smartwatch), and visual data in patient video clips collected from cameras. This data is processed by an anomaly detection module, which includes a fall detection model and algorithms to identify and alert caregivers of the patient’s emergency conditions. A distinctive feature of our proposed system is the natural language processing component, developed with MLLMs capable of detecting and recognizing a patient’s activity and emotion while responding to healthcare worker’s inquiries. Additionally, prompt engineering is employed to integrate all patient information seamlessly. As a result, doctors and nurses can access real-time vital signs and the patient’s current state and mood by interacting with an intelligent agent through a user-friendly web application. Our experiments demonstrate that our system is implementable and scalable for real-life scenarios, potentially reducing the workload of medical professionals and healthcare costs. A full-fledged prototype illustrating the functionalities of the system has been developed and being tested to demonstrate the robustness of its various capabilities.
zh

[NLP-18] Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

【速读】: 该论文试图解决的问题是:在需求工程(Requirements Engineering, RE)任务中,小型语言模型(Small Language Models, SLMs)相较于大型语言模型(Large Language Models, LLMs)的性能表现是否足够接近,从而成为更具隐私保护、成本更低且可本地部署的替代方案。其解决方案的关键在于通过实证比较八种模型(包括三类LLMs和五类SLMs)在PROMISE、PROMISE Reclass和SecReq三个数据集上的需求分类任务表现,发现尽管LLMs平均F1分数高出2%,但差异不具统计显著性;SLMs在多数情况下接近LLMs性能,甚至在PROMISE Reclass数据集上召回率更高,且模型规模最多仅为LLMs的1/300,表明数据集特征对性能的影响大于模型大小,从而验证了SLMs作为RE任务中可行替代方案的有效性。

链接: https://arxiv.org/abs/2510.21443
作者: Mohammad Amin Zadenoori,Vincenzo De Martino,Jacek Dabrowski,Xavier Franch,Alessio Ferrari
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.
zh

[NLP-19] Redefining Retrieval Evaluation in the Era of LLM s

【速读】: 该论文旨在解决传统信息检索(Information Retrieval, IR)评估指标(如nDCG、MAP和MRR)在检索增强生成(Retrieval Augmented Generation, RAG)系统中失效的问题。传统指标假设人类用户会按顺序浏览文档并随排名下降而注意力减弱,且忽略相关但干扰性的文档对生成质量的负面影响;然而,在RAG中,大语言模型(Large Language Models, LLMs)会整体处理所有检索到的文档,且干扰项会显著降低生成准确性。解决方案的关键在于提出一种基于效用的标注框架,量化相关段落的正面贡献与干扰段落的负面效应,并在此基础上设计UDCG(Utility and Distraction-aware Cumulative Gain)指标——该指标采用面向LLM的位置折扣机制,直接优化与端到端答案准确率的相关性,实验证明其相较于传统指标的相关性提升达36%。

链接: https://arxiv.org/abs/2510.21440
作者: Giovanni Trappolini,Florin Cuconasu,Simone Filice,Yoelle Maarek,Fabrizio Silvestri
机构: Sapienza University of Rome (罗马大学); Technology Innovation Institute (技术创新研究所)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR, assume that human users sequentially examine documents with diminishing attention to lower ranks. This assumption breaks down in Retrieval Augmented Generation (RAG) systems, where search results are consumed by Large Language Models (LLMs), which, unlike humans, process all retrieved documents as a whole rather than sequentially. Additionally, traditional IR metrics do not account for related but irrelevant documents that actively degrade generation quality, rather than merely being ignored. Due to these two major misalignments, namely human vs. machine position discount and human relevance vs. machine utility, classical IR metrics do not accurately predict RAG performance. We introduce a utility-based annotation schema that quantifies both the positive contribution of relevant passages and the negative impact of distracting ones. Building on this foundation, we propose UDCG (Utility and Distraction-aware Cumulative Gain), a metric using an LLM-oriented positional discount to directly optimize the correlation with the end-to-end answer accuracy. Experiments on five datasets and six LLMs demonstrate that UDCG improves correlation by up to 36% compared to traditional metrics. Our work provides a critical step toward aligning IR evaluation with LLM consumers and enables more reliable assessment of RAG components
zh

[NLP-20] Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings

【速读】: 该论文旨在解决生成式 AI 在人类活动识别(Human Activity Recognition, HAR)领域应用中缺乏有效评估方法的问题,尤其是在远程健康监测场景下,传统深度学习模型存在局限性,而 Vision Language Models (VLMs) 虽具潜力却因输出的动态性和非确定性难以量化评估。解决方案的关键在于构建一个用于描述性图像字幕(descriptive caption)的数据集,并提出一套全面的评估方法,从而实现对 VLM 在 HAR 任务中的性能进行系统性衡量;实验表明,VLM 在准确率上可达到甚至超越主流深度学习模型,验证了其在智能医疗系统集成中的可行性与优势。

链接: https://arxiv.org/abs/2510.21424
作者: Abderrazek Abid,Thanh-Cong Ho,Fakhri Karray
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of accuracy. This work contributes a strong benchmark and opens new possibilities for the integration of VLMs into intelligent healthcare systems.
zh

[NLP-21] HalleluBERT: Let every token that has meaning bear its weight

【速读】: 该论文旨在解决希伯来语(Hebrew)领域缺乏大规模、高质量预训练语言模型的问题,现有模型如HeBERT、AlephBERT和HeRo在语料规模、词汇表或训练深度上存在局限。解决方案的关键在于提出HalleluBERT,一个基于RoBERTa架构的编码器家族(base与large版本),从零开始在49.1 GB去重后的希伯来语网络文本和维基百科数据上进行训练,并采用专为希伯来语设计的字节级BPE(Byte-Pair Encoding)词汇表。该方法显著提升了希伯来语命名实体识别(NER)和情感分类任务的性能,达到了该领域的最新技术水平,验证了充分收敛的单语种预训练对特定语言建模的重要性。

链接: https://arxiv.org/abs/2510.21372
作者: Raphael Scheible-Schmitt
机构: Technical University of Munich (慕尼黑工业大学); University of Freiburg (弗莱堡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer-based models have advanced NLP, yet Hebrew still lacks a large-scale RoBERTa encoder which is extensively trained. Existing models such as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or training depth. We present HalleluBERT, a RoBERTa-based encoder family (base and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER and sentiment classification benchmarks, HalleluBERT outperforms both monolingual and multilingual baselines. HalleluBERT sets a new state of the art for Hebrew and highlights the benefits of fully converged monolingual pretraining.
zh

[NLP-22] HIKMA: Human-Inspired Knowledge by Machine Agents through a Multi-Agent Framework for Semi-Autonomous Scientific Conferences

【速读】: 该论文旨在解决传统学术交流体系在效率、透明度与协作模式上的局限性,尤其是在研究流程中如何有效整合人工智能(Artificial Intelligence, AI)以增强科研生产力并保障学术诚信。其解决方案的关键在于构建一个端到端的AI集成框架——HIKMA,涵盖从数据集整理、论文生成、同行评审、修订优化到会议展示及归档传播的全流程,并通过语言模型(Language Models)、结构化研究工作流和领域安全机制的协同作用,确保AI辅助而非替代人类学者的核心角色,同时维护知识产权保护、过程透明性和学术完整性。

链接: https://arxiv.org/abs/2510.21370
作者: Zain Ul Abideen Tariq,Mahmood Al-Zubaidi,Uzair Shah,Marco Agus,Mowafa Househ
机构: Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注:

点击查看摘要

Abstract:HIKMA Semi-Autonomous Conference is the first experiment in reimagining scholarly communication through an end-to-end integration of artificial intelligence into the academic publishing and presentation pipeline. This paper presents the design, implementation, and evaluation of the HIKMA framework, which includes AI dataset curation, AI-based manuscript generation, AI-assisted peer review, AI-driven revision, AI conference presentation, and AI archival dissemination. By combining language models, structured research workflows, and domain safeguards, HIKMA shows how AI can support - not replace traditional scholarly practices while maintaining intellectual property protection, transparency, and integrity. The conference functions as a testbed and proof of concept, providing insights into the opportunities and challenges of AI-enabled scholarship. It also examines questions about AI authorship, accountability, and the role of human-AI collaboration in research.
zh

[NLP-23] SindBERT the Sailor: Charting the Seas of Turkish NLP

【速读】: 该论文旨在解决土耳其语等形态丰富的语言在大规模预训练模型中代表性不足的问题,提出并构建了首个基于RoBERTa架构的大型土耳其语编码器模型SindBERT。其解决方案的关键在于:首先,利用312 GB高质量土耳其文本数据(包括mC4、OSCAR23和Wikipedia)从头训练SindBERT,提供base和large两种配置;其次,通过在词性标注、命名实体识别、攻击性语言检测及TurBLiMP语言可接受性基准上的实证评估,揭示当前土耳其语基准可能已趋于饱和,且模型性能提升不仅依赖于数据规模,更受语料质量和多样性的影响。这一发现凸显了在形态复杂语言中,合理设计训练语料比单纯扩大数据量更为关键。

链接: https://arxiv.org/abs/2510.21364
作者: Raphael Scheible-Schmitt,Stefan Schweter
机构: Technical University of Munich (慕尼黑工业大学); University of Freiburg (弗莱堡大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.
zh

[NLP-24] FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models NEURIPS2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)中的文本到图像扩散模型(如 Stable Diffusion)在生成图像时复制并放大社会偏见的问题,尤其体现在性别和种族等人口统计属性上。解决方案的关键在于提出一种后处理去偏框架 FairImagen,其核心是通过 Fair Principal Component Analysis(公平主成分分析)将 CLIP 基础的提示嵌入投影到一个最小化群体特异性信息但保留语义内容的子空间中;同时引入经验性噪声注入与统一的跨人口统计属性投影方法,实现对多个偏见维度的同时去偏,从而在不重新训练或修改原始模型的前提下显著提升图像生成的公平性。

链接: https://arxiv.org/abs/2510.21363
作者: Zihao Fu,Ryan Brown,Shun Shao,Kai Rawal,Eoin Delaney,Chris Russell
机构: The Chinese University of Hong Kong (香港中文大学); University of Oxford (牛津大学); University of Cambridge (剑桥大学); Trinity College Dublin (都柏林三一学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Neurips 2025

点击查看摘要

Abstract:Text-to-image diffusion models, such as Stable Diffusion, have demonstrated remarkable capabilities in generating high-quality and diverse images from natural language prompts. However, recent studies reveal that these models often replicate and amplify societal biases, particularly along demographic attributes like gender and race. In this paper, we introduce FairImagen (this https URL), a post-hoc debiasing framework that operates on prompt embeddings to mitigate such biases without retraining or modifying the underlying diffusion model. Our method integrates Fair Principal Component Analysis to project CLIP-based input embeddings into a subspace that minimizes group-specific information while preserving semantic content. We further enhance debiasing effectiveness through empirical noise injection and propose a unified cross-demographic projection method that enables simultaneous debiasing across multiple demographic attributes. Extensive experiments across gender, race, and intersectional settings demonstrate that FairImagen significantly improves fairness with a moderate trade-off in image quality and prompt fidelity. Our framework outperforms existing post-hoc methods and offers a simple, scalable, and model-agnostic solution for equitable text-to-image generation.
zh

[NLP-25] A Diagnostic Benchmark for Sweden-Related Factual Knowledge

【速读】: 该论文旨在解决现有基准测试数据多为以美国为中心的翻译版本,难以有效评估针对瑞典特定人物和事件的知识掌握问题。其解决方案的关键在于构建一个由人工编写、聚焦瑞典本土内容的问答基准数据集,涵盖国际媒体覆盖有限的文化、媒体及体育领域的重要事件与人物,并包含英文翻译以支持跨语言一致性分析。该数据集可用于衡量不同规模模型在瑞典相关事实记忆上的表现,揭示语言适应过程中的知识保留与遗忘现象。

链接: https://arxiv.org/abs/2510.21360
作者: Jenny Kunz
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Many Swedish benchmarks are translated US-centric benchmarks, and therefore not suitable for testing knowledge that is particularly relevant, or even specific, to Sweden. We therefore introduce a manually written question-answering benchmark specifically targeted to Sweden-related personalities and events, many of which receive very limited coverage in international media. Our annotators drew inspiration from a popular radio program featuring public figures from culture and media, as well as major sports events in Sweden. The dataset can be used to measure factual recall across models of varying sizes and degrees of Swedish coverage, and allows to probe cross-lingual factual consistency as to contains English translations. Using the dataset, we find that smaller models with stronger Swedish coverage perform comparably to a three times larger multilingual model in recalling Sweden-related facts. We also observe that continued pre-training on Swedish generally improves factual knowledge but also leads to forgetting of a part of the previously known information. These results demonstrate the dataset’s potential as a diagnostic tool for studying language adaptation and knowledge retention in multilingual models and during language adaptation.
zh

[NLP-26] Magellan: Guided MCTS for Latent Space Exploration and Novelty Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成创造性想法时普遍存在的局限性,即模型倾向于依赖训练数据中的高概率、熟悉概念,难以突破“引力井”(gravity wells)产生真正新颖的创新。为克服这一问题,作者提出了一种名为Magellan的新框架,其核心创新在于将创造性生成重构为对LLM潜在概念空间的有原则、受引导的探索过程。解决方案的关键是引入基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的结构,并辅以分层引导机制:长期方向由“语义罗盘”向量控制,该向量通过正交投影实现语义空间中的定向导航;局部决策则由一个景观感知的价值函数替代不一致的自我评估,显式地平衡内在连贯性、外在新颖性和叙事进展性。实验表明,Magellan显著优于ReAct和Tree of Thoughts(ToT)等强基线,在科学创意生成中展现出更高的合理性与创新性。

链接: https://arxiv.org/abs/2510.21341
作者: Lufan Chang
机构: Google(谷歌)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to 1st Open Conference on AI Agents for Science (agents4science 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) often struggle with generating truly innovative ideas, typically defaulting to high-probability, familiar concepts within their training data’s “gravity wells.” While advanced search-based methods like Tree of Thoughts (ToT) attempt to mitigate this, they are fundamentally limited by their reliance on unprincipled, inconsistent self-evaluation heuristics to guide exploration. To address this gap, we introduce \textbfMagellan, a novel framework that reframes creative generation as a principled, guided exploration of an LLM’s latent conceptual space. At its core, Magellan employs Monte Carlo Tree Search (MCTS) governed by a hierarchical guidance system. For long-range direction, a “semantic compass” vector, formulated via orthogonal projection, steers the search towards relevant novelty. For local, step-by-step decisions, a landscape-aware value function replaces flawed self-evaluation with an explicit reward structure that balances intrinsic coherence, extrinsic novelty, and narrative progress. Extensive experiments demonstrate that Magellan significantly outperforms strong baselines, including ReAct and ToT, in generating scientific ideas with superior plausibility and innovation. Our work shows that for creative discovery, a principled, guided search is more effective than unconstrained agency, paving the way for LLMs to become more capable partners in innovation.
zh

[NLP-27] Multi-turn Training with Basic Human Feedback Helps Little on LLM Reasoning

【速读】: 该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)的推理能力训练中,是否存在必要性通过多轮人类反馈(multi-turn human feedback)进行训练,以提升其在真实场景中的表现。当前主流方法通常采用单轮强化学习(single-turn reinforcement learning),但实际应用常涉及与人类的多轮交互,存在训练与部署环境不一致的风险。论文的关键解决方案在于系统比较了单轮训练与三种多轮训练策略的效果,发现单轮训练模型在单轮和多轮评估中均表现出良好泛化能力,而多轮训练反而显著损害了单轮推理性能。这表明,在信息完备的任务中,稳健的单轮训练仍是最有效且可靠的方法,多轮训练带来的收益有限,甚至可能削弱模型推理能力。

链接: https://arxiv.org/abs/2510.21339
作者: Qiang Liu,Wuganjing Song,Zhenzhou Lin,Feifan Chen,Qiaolong Cai,Chen Li,Yongduo Sui
机构: Tencent Interactive Entertainment (腾讯互动娱乐); Hong Kong University of Science and Technology (香港科技大学); Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The reasoning capabilities of Large Language Models (LLMs) are typically developed through the single-turn reinforcement learning, whereas real-world applications often involve multi-turn interactions with human feedback, leading to a potential mismatch between training and deployment conditions. In this work, we study whether multi-turn training with human feedback is necessary for reasoning tasks. We compare conventional single-turn training with three multi-turn strategies and reach contrary conclusions to previous research. We find that models trained in a single-turn setting generalize effectively to both single- and multi-turn evaluations, while models trained with multi-turn strategies exhibit a significant degradation in single-turn reasoning performance. These results suggest that for tasks with complete information, robust single-turn training remains more effective and reliable, as multi-turn training with basic feedback provides limited benefits and can even degrade reasoning capabilities.
zh

[NLP-28] ripTide: A Benchmark for Adaptive Travel Planning under Disruptions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在个性化旅行规划中缺乏应对现实世界扰动(如航班取消、天气关闭或景点超售)的适应能力问题。现有方法如TripCraft和TravelPlanner虽能生成满足约束的行程,但未充分评估LLM在突发扰动下的修订能力。解决方案的关键在于提出首个基准测试平台TripTide,通过建模扰动严重程度与用户容忍度等维度,量化LLM在语义、空间和时序上的适应性,并引入自动指标(意图保持度、响应性、适应性)与人工专家评估相结合的方法,系统评估LLM在扰动情境下的修订质量。实验表明,LLM在长行程中表现出更强的空间一致性与语义稳定性,但处理复杂扰动的能力随计划长度增加而下降,揭示了当前LLM在鲁棒性和动态响应方面的局限性。

链接: https://arxiv.org/abs/2510.21329
作者: Priyanshu Karmakar(1),Soumyabrata Chaudhuri(1),Shubhojit Mallick(2),Manish Gupta(2),Abhik Jana(1),Shreya Ghosh(1) ((1) School of Electrical and Computer Sciences, IIT Bhubaneswar, India, (2) Microsoft, India)
机构: IIT Bhubaneswar (印度理工学院布巴内斯瓦尔分校); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 12 tables and 7 figures

点击查看摘要

Abstract:Recent efforts like TripCraft and TravelPlanner have advanced the use of Large Language Models ( LLMs) for personalized, constraint aware travel itinerary generation. Yet, real travel often faces disruptions. To address this, we present TripTide, the first benchmark evaluating LLM’s ability to revise itineraries under realistic disruptions. TripTide models key dimensions such as disruption severity and traveler tolerance, enabling nuanced assessment of LLM adaptability to events like flight cancellations, weather closures, or overbooked attractions. We conduct a threefold evaluation. First, we introduce automatic metrics including Preservation of Intent (how well the revised plan maintains feasibility and goals), Responsiveness (promptness and appropriateness of disruption handling), and Adaptability (semantic, spatial, and sequential divergence between original and revised plans). Second, we apply an LLM-as-a-judge approach to automatically assess revision quality. Third, we perform manual expert evaluation to verify whether revisions preserve semantic, spatial, sequential, and responsive aspects. Our experiments show that LLMs maintain strong sequential consistency and semantic stability, while spatial deviations are larger for shorter trips but decrease with longer ones, indicating that extended plans encourage better geographic coherence. However, disruption-handling ability declines as plan length increases, highlighting limits in LLM robustness. TripTide establishes a benchmark for evaluating adaptability, personalization, and resilience in LLM-based travel planning under real-world uncertainty.
zh

[NLP-29] ypoglycemia under the Hood: Investigating Language Models Understanding of Scrambled Words

【速读】: 该论文试图解决的问题是:当英文单词内部字母顺序被随机打乱(即 typoglycemia)时,为何某些自然语言处理(NLP)模型仍能保持较高的鲁棒性?其解决方案的关键在于揭示两个核心机制:一是英语中因 typoglycemia 导致的词义坍缩(word collapse)比例较低;二是即使发生坍缩,这些词在上下文中的语境差异足够显著,使得模型能够轻松进行消歧。通过分析英国国家语料库(British National Corpus)、评估 BERT 的消歧能力以及对比在干净文本与 typoglycemic 文本上训练的 BERT 变体,研究发现模型性能下降远小于预期,从而验证了上述假设。

链接: https://arxiv.org/abs/2510.21326
作者: Gianluca Sperduti,Alejandro Moreo
机构: Institute of Information Science and Technologies (信息科学与技术研究所); National Research Council (国家研究委员会); University of Pisa (比萨大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Research in linguistics has shown that humans can read words with internally scrambled letters, a phenomenon recently dubbed typoglycemia. Some specific NLP models have recently been proposed that similarly demonstrate robustness to such distortions by ignoring the internal order of characters by design. This raises a fundamental question: how can models perform well when many distinct words (e.g., form and from) collapse into identical representations under typoglycemia? Our work, focusing exclusively on the English language, seeks to shed light on the underlying aspects responsible for this robustness. We hypothesize that the main reasons have to do with the fact that (i) relatively few English words collapse under typoglycemia, and that (ii) collapsed words tend to occur in contexts so distinct that disambiguation becomes trivial. In our analysis, we (i) analyze the British National Corpus to quantify word collapse and ambiguity under typoglycemia, (ii) evaluate BERT’s ability to disambiguate collapsing forms, and (iii) conduct a probing experiment by comparing variants of BERT trained from scratch on clean versus typoglycemic Wikipedia text; our results reveal that the performance degradation caused by scrambling is smaller than expected.
zh

[NLP-30] Leverag e Unlearning to Sanitize LLM s

【速读】: 该论文旨在解决预训练大语言模型(Large Language Models, LLMs)在特定领域数据(如医疗报告、商业数据)上微调后可能记忆敏感信息(如个人身份或机密数据)的问题,这些信息可能在模型后续使用中被重新生成,构成隐私泄露风险。解决方案的关键在于提出一种名为SANI的去遗忘(unlearning)方法,其核心机制包括两个阶段:第一阶段通过重置模型最后几层中的部分神经元来破坏对细粒度敏感信息的记忆;第二阶段在避免再次记忆敏感信息的前提下对模型进行微调,从而实现模型的净化(sanitization)。实验表明,仅需少量额外训练轮次即可显著减少敏感信息的再生次数,且无需依赖安全数据集进行昂贵的再训练,适用于已投入大量资源训练模型的机构(如医院)在共享前进行隐私保护处理。

链接: https://arxiv.org/abs/2510.21322
作者: Antoine Boutet,Lucas Magnana
机构: INSA Lyon (里昂国立应用科学学院); Inria (法国国家信息与自动化研究院); CITI (信息与电信研究中心); UR3720 (研究单元3720)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Pre-trained large language models (LLMs) are becoming useful for various tasks. To improve their performance on certain tasks, it is necessary to fine-tune them on specific data corpora (e.g., medical reports, business data). These specialized data corpora may contain sensitive data (e.g., personal or confidential data) that will be memorized by the model and likely to be regurgitated during its subsequent use. This memorization of sensitive information by the model poses a significant privacy or confidentiality issue. To remove this memorization and sanitize the model without requiring costly additional fine-tuning on a secured data corpus, we propose SANI. SANI is an unlearning approach to sanitize language models. It relies on both an erasure and repair phases that 1) reset certain neurons in the last layers of the model to disrupt the memorization of fine-grained information, and then 2) fine-tune the model while avoiding memorizing sensitive information. We comprehensively evaluate SANI to sanitize both a model fine-tuned and specialized with medical data by removing directly and indirectly identifiers from the memorization of the model, and a standard pre-trained model by removing specific terms defined as confidential information from the model. Results show that with only few additional epochs of unlearning, the model is sanitized and the number of regurgitations is drastically reduced. This approach can be particularly useful for hospitals or other industries that have already spent significant resources training models on large datasets and wish to sanitize them before sharing.
zh

[NLP-31] Efficient semantic uncertainty quantification in language models via diversity-steered sampling NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自由格式问答(free-form question answering, QA)任务中准确估计语义随机不确定性(semantic aleatoric uncertainty)和认知不确定性(epistemic uncertainty)的难题,尤其是在生成样本数量有限时难以获得稳定估计的问题。解决方案的关键在于提出一种多样性引导采样器(diversity-steered sampler),通过在解码过程中引入连续语义相似性惩罚项来抑制语义冗余输出,该惩罚项基于轻量微调的自然语言推理(Natural Language Inference, NLI)模型对部分前缀或扩散过程中的中间状态进行计算;同时结合重要性重加权(importance reweighting)以校正下游不确定性估计的偏差,并利用控制变量(control variates)降低估计方差。该方法适用于自回归和掩码扩散(masked diffusion)两种范式,在保持与基线相当或更优性能的同时显著提升了样本效率。

链接: https://arxiv.org/abs/2510.21310
作者: Ji Won Park,Kyunghyun Cho
机构: Prescient Design, Genentech (基因泰克); Center for Data Science, New York University (纽约大学数据科学中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages (+7 appendix), 7 figures. Accepted at NeurIPS 2025

点击查看摘要

Abstract:Accurately estimating semantic aleatoric and epistemic uncertainties in large language models (LLMs) is particularly challenging in free-form question answering (QA), where obtaining stable estimates often requires many expensive generations. We introduce a diversity-steered sampler that discourages semantically redundant outputs during decoding, covers both autoregressive and masked diffusion paradigms, and yields substantial sample-efficiency gains. The key idea is to inject a continuous semantic-similarity penalty into the model’s proposal distribution using a natural language inference (NLI) model lightly finetuned on partial prefixes or intermediate diffusion states. We debias downstream uncertainty estimates with importance reweighting and shrink their variance with control variates. Across four QA benchmarks, our method matches or surpasses baselines while covering more semantic clusters with the same number of samples. Being modular and requiring no gradient access to the base LLM, the framework promises to serve as a drop-in enhancement for uncertainty estimation in risk-sensitive model deployments.
zh

[NLP-32] PARL: Prompt-based Agents for Reinforcement Learning

【速读】: 该论文试图解决的问题是:如何将大语言模型(Large Language Models, LLMs)作为强化学习(Reinforcement Learning, RL)代理,在不进行微调的前提下,通过提示(prompting)机制实现基于试错的环境交互与策略学习。解决方案的关键在于提出PARL(Prompt-based Agent for Reinforcement Learning),其核心创新在于将动作(actions)、状态(states)和奖励(rewards)编码进提示词中,使LLM能够直接利用预训练知识在结构化、非语言推理任务中完成RL学习,从而在无需任何微调的情况下适应简单环境并达到或超越传统RL代理的性能表现。

链接: https://arxiv.org/abs/2510.21306
作者: Yarik Menchaca Resendiz,Roman Klinger
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated high performance on tasks expressed in natural language, particularly in zero- or few-shot settings. These are typically framed as supervised (e.g., classification) or unsupervised (e.g., clustering) problems. However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system. While prior work focused on representing tasks that rely on a language representation, we study structured, non-linguistic reasoning - such as interpreting positions in a grid world. We therefore introduce PARL (Prompt-based Agent for Reinforcement Learning), a method that uses LLMs as RL agents through prompting, without any fine-tuning. PARL encodes actions, states, and rewards in the prompt, enabling the model to learn through trial-and-error interaction. We evaluate PARL on three standard RL tasks that do not entirely rely on natural language. We show that it can match or outperform traditional RL agents in simple environments by leveraging pretrained knowledge. However, we identify performance limitations in tasks that require complex mathematical operations or decoding states and actions.
zh

[NLP-33] When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在复杂推理任务中面临的严重安全风险问题,尤其是有害内容生成和越狱攻击(jailbreak attacks),这些问题源于现有基于启发式安全信号的缓解策略往往抑制模型的推理能力,导致安全与推理能力之间的权衡困境。解决方案的关键在于提出一种名为“护栏链”(Chain-of-Guardrail, CoG)的训练框架,该框架通过重构或回溯不安全的推理步骤,引导模型回到安全路径,同时保留有效的推理链条,从而在显著提升安全性的同时维持与现有方法相当的推理能力。

链接: https://arxiv.org/abs/2510.21285
作者: Yingzhi Mao(1 and 2),Chunkang Zhang(1 and 2),Junxiang Wang(1),Xinyan Guan(1 and 2),Boxi Cao(1),Yaojie Lu(1),Hongyu Lin(1),Xianpei Han(1 and 2),Le Sun(1 and 2) ((1) Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, (2) University of Chinese Academy of Sciences)
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of Chinese Academy of Sciences (中国科学院大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First two authors contributed equally. The main text is 10 pages, with an appendix of 19 pages. The paper contains 18 figures and 16 tables

点击查看摘要

Abstract:Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks but remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks. Existing mitigation strategies rely on injecting heuristic safety signals during training, which often suppress reasoning ability and fail to resolve the safety-reasoning trade-off. To systematically investigate this issue, we analyze the reasoning trajectories of diverse LRMs and uncover a phenomenon we term Self-Jailbreak, where models override their own risk assessments and justify responding to unsafe prompts. This finding reveals that LRMs inherently possess the ability to reject unsafe queries, but this ability is compromised, resulting in harmful outputs. Building on these insights, we propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps, steering the model back onto safe trajectories while preserving valid reasoning chains. Extensive experiments across multiple reasoning and safety benchmarks demonstrate that CoG substantially improves the safety of current LRMs while preserving comparable reasoning ability, significantly outperforming prior methods that suffer from severe safety-reasoning trade-offs.
zh

[NLP-34] Pctx: Tokenizing Personalized Context for Generative Recommendation

【速读】: 该论文旨在解决生成式推荐(Generative Recommendation, GR)模型中现有tokenization方法静态且非个性化的问题。当前方法仅基于物品特征生成语义ID(semantic IDs),假设所有用户对物品的相似性认知一致,忽略了用户特定意图和偏好,导致在自回归生成过程中无法捕捉个体差异化的理解标准。解决方案的关键在于提出一种个性化上下文感知的分词器(personalized context-aware tokenizer),通过引入用户的交互历史动态生成语义ID,使同一物品在不同用户上下文中可被映射为不同的语义ID,从而支持GR模型学习多种解释标准,实现更精准的个性化预测。

链接: https://arxiv.org/abs/2510.21276
作者: Qiyong Zhong,Jiajie Su,Yunshan Ma,Julian McAuley,Yupeng Hou
机构: Zhejiang University (浙江大学); University of California, San Diego (加州大学圣地亚哥分校); Singapore Management University (新加坡管理大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative recommendation (GR) models tokenize each action into a few discrete tokens (called semantic IDs) and autoregressively generate the next tokens as predictions, showing advantages such as memory efficiency, scalability, and the potential to unify retrieval and ranking. Despite these benefits, existing tokenization methods are static and non-personalized. They typically derive semantic IDs solely from item features, assuming a universal item similarity that overlooks user-specific perspectives. However, under the autoregressive paradigm, semantic IDs with the same prefixes always receive similar probabilities, so a single fixed mapping implicitly enforces a universal item similarity standard across all users. In practice, the same item may be interpreted differently depending on user intentions and preferences. To address this issue, we propose a personalized context-aware tokenizer that incorporates a user’s historical interactions when generating semantic IDs. This design allows the same item to be tokenized into different semantic IDs under different user contexts, enabling GR models to capture multiple interpretive standards and produce more personalized predictions. Experiments on three public datasets demonstrate up to 11.44% improvement in NDCG@10 over non-personalized action tokenization baselines. Our code is available at this https URL.
zh

[NLP-35] Sparser Block-Sparse Attention via Token Permutation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在扩展上下文长度时因自注意力机制(self-attention mechanism)计算复杂度为 O(N2)O(N^2) 而导致的内存占用高和延迟大的问题。尽管块稀疏注意力(Block-sparse attention)通过跳过部分块的计算来优化效率,但其性能受限于注意力模式的不规则性,常导致关键键令牌(key tokens)分散于多个块中,造成冗余计算。论文提出了一种即插即用的置换块稀疏注意力(Permuted Block-Sparse Attention, PBS-Attn)方法,其核心在于利用注意力的置换特性增强块级稀疏性,从而提升预填充阶段(prefilling)的计算效率。实验表明,PBS-Attn 在长上下文真实数据集上优于现有块稀疏方法且接近全注意力基线,并借助定制的置换FlashAttention内核实现高达2.75倍的端到端加速,验证了其实际可行性。

链接: https://arxiv.org/abs/2510.21270
作者: Xinghao Wang,Pengyu Wang,Dong Zhang,Chenkun Tan,Shaojun Zhou,Zhaoxiang Liu,Shiguo Lian,Fangxu Liu,Kai Song,Xipeng Qiu
机构: Fudan University (复旦大学); China Unicom (中国联通); ByteDance (字节跳动); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose O(N^2) complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbfPBS-Attn), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to 2.75\times in long-context prefilling, confirming its practical viability. Code available at this https URL
zh

[NLP-36] Correlation Dimension of Auto-Regressive Large Language Models NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时表现出重复、不连贯等异常行为的问题,尽管这些行为通常发生在模型预测误差(如困惑度,perplexity)较低的情况下。传统评估指标仅关注局部预测准确性,忽视了文本的长程结构复杂性。论文提出以**相关维数(correlation dimension)**作为核心解决方案,这是一种基于分形几何的自相似性度量方法,能够量化语言模型感知到的文本认知复杂度,从而捕捉语言的层次化重复结构,并统一建模局部与全局属性。该方法不仅揭示了预训练过程中的三个阶段、上下文依赖的复杂性变化以及幻觉倾向,还能可靠检测多种生成退化现象,且在计算效率、模型量化鲁棒性和架构兼容性方面表现优异。

链接: https://arxiv.org/abs/2510.21258
作者: Xin Du,Kumiko Tanaka-Ishii
机构: Waseda University (早稻田大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Chaotic Dynamics (nlin.CD)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors – such as repetition and incoherence – even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model. This measure captures the hierarchical recurrence structure of language, bridging local and global properties in a unified framework. Through extensive experiments, we show that correlation dimension (1) reveals three distinct phases during pretraining, (2) reflects context-dependent complexity, (3) indicates a model’s tendency toward hallucination, and (4) reliably detects multiple forms of degeneration in generated text. The method is computationally efficient, robust to model quantization (down to 4-bit precision), broadly applicable across autoregressive architectures (e.g., Transformer and Mamba), and provides fresh insight into the generative dynamics of LLMs.
zh

[NLP-37] DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services

【速读】: 该论文旨在解决紧急医疗调度(Emergency Medical Dispatch, EMD)过程中因来电者情绪激动、信息模糊及调度员认知负荷高所带来的决策挑战。解决方案的关键在于构建一个基于临床分类体系(clinical taxonomy)的大型语言模型(Large Language Models, LLM)驱动的多智能体系统(Multi-Agent System, MAS),其中包含模拟来电者和调度员的双智能体,并通过事实共用库(fact commons)确保交互的临床合理性与抗误导性。该系统在六阶段呼叫协议框架下运行,实现了高保真度的EMD场景模拟,经由专业医生评估和自动化语言分析验证,展现出优异的调度有效性(94%正确联系其他潜在协作方)和指导有效性(91%提供有效建议),为调度员培训、流程优化及实时辅助决策提供了可靠的技术路径。

链接: https://arxiv.org/abs/2510.21228
作者: Xiang Li,Huizi Yu,Wenkong Wang,Yiran Wu,Jiayan Zhou,Wenyue Hua,Xinxin Lin,Wenjia Tan,Lexuan Zhu,Bingyi Chen,Guang Chen,Ming-Li Chen,Yang Zhou,Zhao Li,Themistocles L. Assimes,Yongfeng Zhang,Qingyun Wu,Xin Ma,Lingyao Li,Lizhou Fan
机构: Shandong University (山东大学); The Chinese University of Hong Kong (香港中文大学); Pennsylvania State University (宾夕法尼亚州立大学); Stanford University School of Medicine (斯坦福大学医学院); University of California, Santa Barbara (加州大学圣塔芭芭拉分校); University of Macau (澳门大学); New York University (纽约大学); Broad Institute of MIT and Harvard (麻省理工学院和哈佛大学的博德研究所); The University of Hong Kong (香港大学); Chinese Academy of Medical Sciences and Peking Union Medical College (中国医学科学院与北京协和医学院); Chinese Academy of Medical Sciences (中国医学科学院); Rutgers University (罗格斯大学); University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 27 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Objective: Emergency medical dispatch (EMD) is a high-stakes process challenged by caller distress, ambiguity, and cognitive load. Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers. This study aimed to develop and evaluate a taxonomy-grounded, LLM-powered multi-agent system for simulating realistic EMD scenarios. Methods: We constructed a clinical taxonomy (32 chief complaints, 6 caller identities from MIMIC-III) and a six-phase call protocol. Using this framework, we developed an AutoGen-based MAS with Caller and Dispatcher Agents. The system grounds interactions in a fact commons to ensure clinical plausibility and mitigate misinformation. We used a hybrid evaluation framework: four physicians assessed 100 simulated cases for “Guidance Efficacy” and “Dispatch Effectiveness,” supplemented by automated linguistic analysis (sentiment, readability, politeness). Results: Human evaluation, with substantial inter-rater agreement (Gwe’s AC1 0.70), confirmed the system’s high performance. It demonstrated excellent Dispatch Effectiveness (e.g., 94 % contacting the correct potential other agents) and Guidance Efficacy (advice provided in 91 % of cases), both rated highly by physicians. Algorithmic metrics corroborated these findings, indicating a predominantly neutral affective profile (73.7 % neutral sentiment; 90.4 % neutral emotion), high readability (Flesch 80.9), and a consistently polite style (60.0 % polite; 0 % impolite). Conclusion: Our taxonomy-grounded MAS simulates diverse, clinically plausible dispatch scenarios with high fidelity. Findings support its use for dispatcher training, protocol evaluation, and as a foundation for real-time decision support. This work outlines a pathway for safely integrating advanced AI agents into emergency response workflows.
zh

[NLP-38] he “Right” Discourse on Migration: Analysing Migration-Related Tweets in Right and Far-Right Political Movements

【速读】: 该论文旨在解决如何通过跨学科方法揭示社交媒体上右翼极端主义话语的传播机制及其对政治结果的影响问题,尤其聚焦于移民议题、仇恨言论及说服策略的识别与分析。其解决方案的关键在于融合最先进的自然语言处理(Natural Language Processing, NLP)技术与社会学洞见,构建一个整合语言学特征、社会结构背景与计算模型的分析框架,从而系统性地挖掘MIGR-TWIT语料库中英文和法语右翼推文中的 discourse 模式,为理解当代右翼极端主义在社交平台上的运作逻辑提供实证依据与理论支持。

链接: https://arxiv.org/abs/2510.21220
作者: Nishan Chatterjee(L3I),Veronika Bajt,Ana Zwitter Vitez,Senja Pollak
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rise of right-wing populism in Europe has brought to the forefront the significance of analysing social media discourse to understand the dissemination of extremist ideologies and their impact on political outcomes. Twitter, as a platform for interaction and mobilisation, provides a unique window into the everyday communication of far-right supporters. In this paper, we propose a methodology that uses state-of-the-art natural language processing techniques with sociological insights to analyse the MIGR-TWIT corpus of far-right tweets in English and French. We aim to uncover patterns of discourse surrounding migration, hate speech, and persuasion techniques employed by right and far-right actors. By integrating linguistic, sociological, and computational approaches, we seek to offer cross-disciplinary insights into societal dynamics and contribute to a better understanding of contemporary challenges posed by right-wing extremism on social media platforms.
zh

[NLP-39] Estonian Native Large Language Model Benchmark

【速读】: 该论文旨在解决当前 Estonian 语言大语言模型(Large Language Models, LLMs)缺乏系统性评估基准的问题,以及尚未有全面比较不同 LLM 在 Estonian 任务上性能的研究空白。其解决方案的关键在于构建一个基于七个多样化数据集的新基准,这些数据集均源自 native Estonian 来源且未使用机器翻译,从而确保语言真实性和任务多样性;同时,通过对比6个基础模型与26个指令微调的开源及商用模型,并结合人工评估和 LLM-as-a-judge 方法进行多维度评价,验证了顶级 LLM(如 Claude 3.7 Sonnet)在评估 Estonian 模型时与人类评分具有高度一致性,为后续本地化 LLM 的开发与评估提供了可靠工具和方法论支持。

链接: https://arxiv.org/abs/2510.21193
作者: Helena Grete Lillepalu,Tanel Alumäe
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted. We introduce a new benchmark for evaluating LLMs in Estonian, based on seven diverse datasets. These datasets assess general and domain-specific knowledge, understanding of Estonian grammar and vocabulary, summarization abilities, contextual comprehension, and more. The datasets are all generated from native Estonian sources without using machine translation. We compare the performance of base models, instruction-tuned open-source models, and commercial models. Our evaluation includes 6 base models and 26 instruction-tuned models. To assess the results, we employ both human evaluation and LLM-as-a-judge methods. Human evaluation scores showed moderate to high correlation with benchmark evaluations, depending on the dataset. Claude 3.7 Sonnet, used as an LLM judge, demonstrated strong alignment with human ratings, indicating that top-performing LLMs can effectively support the evaluation of Estonian-language models.
zh

[NLP-40] Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在对齐语言模型(Language Model, LM)与人类偏好时存在的一个核心权衡问题:标准RL方法虽能优化平均奖励,但难以有效降低不良输出(undesired outputs)的概率;而专门针对减少不良输出的方法往往以牺牲平均性能为代价。解决方案的关键在于提出一种名为RePULSe的新训练方法,其通过在标准RL损失基础上引入额外的损失项,利用学习到的提议(learned proposals)引导采样低奖励输出,并进一步降低这些输出的概率,从而在期望奖励与不良输出概率之间实现更优的平衡,同时提升对抗鲁棒性。

链接: https://arxiv.org/abs/2510.21184
作者: Stephen Zhao,Aidan Li,Rob Brekelmans,Roger Grosse
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs’ probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.
zh

[NLP-41] KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution ICLR2026

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)评估中静态基准测试存在的数据污染(data contamination)和性能饱和(saturation)问题,这些问题会导致评估结果虚高或误导。其解决方案的关键在于提出知识增强型基准演化框架(Knowledge-enhanced Benchmark Evolution, KBE),该框架通过图结构形式化表示视觉问答(VQA)样本,并基于此整合外部文本知识与原始图像信息,实现动态重构与扩展:一方面可重新选择图像中的视觉信息以生成新问题,另一方面可引入外部知识拓展现有问题,从而构建可控难度的动态评估体系,有效缓解数据污染与饱和风险,并提供更全面的模型能力评估。

链接: https://arxiv.org/abs/2510.21182
作者: Junzhe Zhang,Huixuan Zhang,Xiaojun Wan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: submitting to ICLR2026

点击查看摘要

Abstract:The rapid progress of multimodal large language models (MLLMs) calls for more reliable evaluation protocols. Existing static benchmarks suffer from the potential risk of data contamination and saturation, leading to inflated or misleading performance evaluations. To address these issues, we first apply Graph formulation to represent a static or dynamic VQA sample. With the formulation, we propose Knowledge-enhanced Benchmark Evolution(KBE), a dynamic multimodal evaluation framework. KBE first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. It enables difficulty-controllable evaluation by adjusting the degree of question exploration. Extensive experiments demonstrate that KBE alleviates the risk of data contamination, data saturation, and provides a more comprehensive assessment of MLLM capabilities.
zh

[NLP-42] Social Simulations with Large Language Model Risk Utopian Illusion

【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在社会模拟中是否能够真实再现人类行为,尤其是在多主体交互场景下,其行为是否具有社会合理性与多样性。由于现有研究尚未充分揭示LLMs在社会情境中与真实人类行为的差异,可能导致社会科学误读及现实应用中的 unintended consequences(意外后果)。解决方案的关键在于提出一个系统性的分析框架,通过模拟聊天室式多智能体对话,并从五个语言维度进行分析,从而识别出LLMs在社会认知偏见上的表现,如社会角色偏倚、首因效应和积极偏倚等,进而揭示其生成的社会行为本质上是理想化而非真实的人类社会互动,为构建更具社会根基的LLMs提供方向。

链接: https://arxiv.org/abs/2510.21180
作者: Ning Bian,Xianpei Han,Hongyu Lin,Baolei Wu,Jun Wang
机构: 未知
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Reliable simulation of human behavior is essential for explaining, predicting, and intervening in our society. Recent advances in large language models (LLMs) have shown promise in emulating human behaviors, interactions, and decision-making, offering a powerful new lens for social science studies. However, the extent to which LLMs diverge from authentic human behavior in social contexts remains underexplored, posing risks of misinterpretation in scientific studies and unintended consequences in real-world applications. Here, we introduce a systematic framework for analyzing LLMs’ behavior in social simulation. Our approach simulates multi-agent interactions through chatroom-style conversations and analyzes them across five linguistic dimensions, providing a simple yet effective method to examine emergent social cognitive biases. We conduct extensive experiments involving eight representative LLMs across three families. Our findings reveal that LLMs do not faithfully reproduce genuine human behavior but instead reflect overly idealized versions of it, shaped by the social desirability bias. In particular, LLMs show social role bias, primacy effect, and positivity bias, resulting in “Utopian” societies that lack the complexity and variability of real human interactions. These findings call for more socially grounded LLMs that capture the diversity of human social behavior.
zh

[NLP-43] Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化推理和可解释性方面的局限性,以及文本属性图(Text-Attributed Graphs, TAGs)在语义深度上的不足。其核心解决方案在于通过协同整合LLMs与TAGs,实现优势互补:一方面利用LLMs增强TAG的表示学习能力;另一方面借助TAG的显式关系结构提升LLMs的多跳推理能力和可解释性。关键在于提出了一种基于编排(orchestration)视角的系统性分类框架,涵盖“LLM for TAG”与“TAG for LLM”两大方向,并归纳了顺序、并行及多模块三类集成策略,从而为语言与图学习交叉领域的研究提供方法论指导。

链接: https://arxiv.org/abs/2510.21131
作者: Guangxin Su,Hanchen Wang,Jianwei Wang,Wenjie Zhang,Ying Zhang,Jian Pei
机构: The University of New South Wales (新南威尔士大学); The University of Technology Sydney (悉尼科技大学); Duke University (杜克大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Surveys and overviews; Natural language processing; Knowledge representation and reasoning; Graph algorithms

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success in natural language processing through strong semantic understanding and generation. However, their black-box nature limits structured and multi-hop reasoning. In contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures enriched with textual context, yet often lack semantic depth. Recent research shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG representation learning and improving the reasoning and interpretability of LLMs. This survey provides the first systematic review of LLM–TAG integration from an orchestration perspective. We introduce a novel taxonomy covering two fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and TAG for LLM, where structured graphs improve LLM reasoning. We categorize orchestration strategies into sequential, parallel, and multi-module frameworks, and discuss advances in TAG-specific pretraining, prompting, and parameter-efficient fine-tuning. Beyond methodology, we summarize empirical insights, curate available datasets, and highlight diverse applications across recommendation systems, biomedical analysis, and knowledge-intensive question answering. Finally, we outline open challenges and promising research directions, aiming to guide future work at the intersection of language and graph learning.
zh

[NLP-44] he Gray Zone of Faithfulness: Taming Ambiguity in Unfaithfulness Detection

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在文本摘要任务中生成内容与源文档不一致(unfaithfulness)的检测基准存在标注模糊的问题,特别是由于对允许使用外部知识的边界定义不清导致的标注不一致性。其解决方案的关键在于提出一种新的忠实性标注框架——引入“Out-Dependent”这一中间类别,用于标识那些必须依赖外部知识才能验证真伪的生成句子;基于此框架构建了名为VeriGray的新基准,能够更精确地区分幻觉(hallucination)和需外部验证的内容,从而显著提升对LLM生成忠实性的评估准确性。

链接: https://arxiv.org/abs/2510.21118
作者: Qiang Ding,Lvzhou Luo,Yixuan Cao,Ping Luo
机构: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS) (中国科学院智能信息处理重点实验室); Institute of Computing Technology, CAS (中国科学院计算技术研究所); State Key Lab of Al Safety (人工智能安全国家重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring that Large Language Models (LLMs) generate summaries faithful to a given source document is essential for real-world applications. While prior research has explored LLM faithfulness, existing benchmarks suffer from annotation ambiguity, primarily due to the ill-defined boundary of permissible external knowledge in generated outputs. For instance, common sense is often incorporated into responses and labeled as “faithful”, yet the acceptable extent of such knowledge remains unspecified, leading to inconsistent annotations. To address this issue, we propose a novel faithfulness annotation framework, which introduces an intermediate category, Out-Dependent, to classify cases where external knowledge is required for verification. Using this framework, we construct VeriGray (Verification with the Gray Zone) – a new unfaithfulness detection benchmark in summarization. Statistics reveal that even SOTA LLMs, such as GPT-5, exhibit hallucinations ( \sim 6% of sentences) in summarization tasks. Moreover, a substantial proportion ( \sim 8% on average of models) of generated sentences fall into the Out-Dependent category, underscoring the importance of resolving annotation ambiguity in unfaithfulness detection benchmarks. Experiments demonstrate that our benchmark poses significant challenges to multiple baseline methods, indicating considerable room for future improvement.
zh

[NLP-45] Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)在小数据场景下易过拟合、泛化能力差的问题,尤其是其作为离策略(off-policy)方法时对人类偏好标注依赖性强且难以适应分布外(out-of-domain)数据的局限性。解决方案的关键在于提出一种名为自奖励PPO(Self-Rewarding PPO)的新颖微调方法,其核心创新是设计了一个基于SFT模型与预训练基础模型之间策略对数比值(log policy ratio)的隐式奖励函数,该函数将预训练模型视为基准策略、SFT模型作为目标策略,从而实现无需人工偏好标注的在策略(on-policy)微调。通过将此自奖励机制整合进近端策略优化(Proximal Policy Optimization, PPO),显著提升了语言模型在演示数据上的对齐效果、数据效率和鲁棒性。

链接: https://arxiv.org/abs/2510.21090
作者: Qingru Zhang,Liang Qiu,Ilgee Hong,Zhenghao Xu,Tianyi Liu,Shiyang Li,Rongzhi Zhang,Zheng Li,Lihong Li,Bing Yin,Chao Zhang,Jianshu Chen,Haoming Jiang,Tuo Zhao
机构: Georgia Institute of Technology (佐治亚理工学院); Amazon
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by COLM 2025

点击查看摘要

Abstract:Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with overfitting and poor out-of-domain generalization, especially in limited-data scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance. Our approach combines the strengths of SFT and proximal policy optimization (PPO) to achieve more effective alignment from demonstration data. At its core is a reward function designed as the log policy ratio between the SFT model and the pretrained base model. This function serves as an implicit reward signal, using the pretrained policy as a baseline and the SFT policy as a target. By doing so, it enables on-policy fine-tuning without relying on human preference annotations. The integration of this self-rewarding mechanism with PPO addresses key limitations of SFT, improving generalization, data efficiency, and robustness. Our empirical evaluation across a range of natural language processing tasks demonstrates that Self-Rewarding PPO consistently outperforms traditional SFT methods. The results highlight the effectiveness of our approach in aligning LLMs using demonstration data, particularly in scenarios where high-quality annotated data is scarce.
zh

[NLP-46] Designing and Evaluating Hint Generation Systems for Science Education

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在教育场景中因直接提供答案而削弱学生概念理解与批判性思维的问题。其解决方案的关键在于引入自动提示生成(automatic hint generation)作为教学策略,通过构建分层提示链(chains of hints)来引导学习者主动参与学习内容,同时避免直接暴露答案。研究对比了静态提示(预生成)与动态提示(根据学习者进度自适应调整)两种策略,发现不同学习者对提示方式存在偏好差异,并指出当前自动评估指标难以准确捕捉这些偏好,从而为未来以学习者为中心的智能辅导系统设计提供了关键实践依据。

链接: https://arxiv.org/abs/2510.21087
作者: Anubhav Jangra,Smaranda Muresan
机构: Columbia University (哥伦比亚大学); Barnard College (巴纳德学院)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models are influencing the education landscape, with students relying on them in their learning process. Often implemented using general-purpose models, these systems are likely to give away the answers, which could hinder conceptual understanding and critical thinking. We study the role of automatic hint generation as a pedagogical strategy to promote active engagement with the learning content, while guiding learners toward the answers. Focusing on scientific topics at the secondary education level, we explore the potential of large language models to generate chains of hints that scaffold learners without revealing answers. We compare two distinct hinting strategies: static hints, pre-generated for each problem, and dynamic hints, adapted to learners’ progress. Through a quantitative study with 41 participants, we uncover different preferences among learners with respect to hinting strategies, and identify the limitations of automatic evaluation metrics to capture them. Our findings highlight key design considerations for future research on hint generation and intelligent tutoring systems that seek to develop learner-centered educational technologies.
zh

[NLP-47] CDrugRed: A Chinese Drug Recommendation Dataset for Discharge Medications in Metabolic Diseases

【速读】: 该论文旨在解决当前智能药物推荐系统因缺乏高质量、公开可用的非英语电子健康记录(EHR)数据集而发展受限的问题,尤其是针对中文语境下代谢性疾病出院药物推荐任务。其解决方案的关键在于构建并发布首个面向中文患者的出院药物推荐数据集CDrugRed,该数据集包含5,894条脱敏患者记录,涵盖人口统计学、病史、临床过程和出院诊断等多维信息,并通过在多个先进大语言模型(LLMs)上进行基准测试验证其有效性,结果表明尽管监督微调可提升性能,但当前模型在F1分数(0.5648)和Jaccard分数(0.4477)上仍有显著改进空间,从而凸显了该任务的复杂性与CDrugRed作为挑战性研究资源的价值。

链接: https://arxiv.org/abs/2510.21084
作者: Juntao Li,Haobin Yuan,Ling Luo,Yan Jiang,Fan Wang,Ping Zhang,Huiyi Lv,Jian Wang,Yuanyuan Sun,Hongfei Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intelligent drug recommendation based on Electronic Health Records (EHRs) is critical for improving for improving the quality and efficiency of clinical decision-making. By leveraging large-scale patient data, drug recommendation systems can assist physicians in selecting the most appropriate medications according to a patient’s medical history, diagnoses, laboratory results, and comorbidities. However, the advancement of such systems is significantly hampered by the scarcity of publicly available, real-world EHR datasets, particularly in languages other than English. In this work, we present CDrugRed, a first publicly available Chinese drug recommendation dataset focused on discharge medications for metabolic diseases. The dataset includes 5,894 de-identified records from 3,190 patients, containing comprehensive information such as patient demographics, medical history, clinical course, and discharge diagnoses. We assess the utility of CDrugRed by benchmarking several state-of-the-art large language models (LLMs) on the discharge medication recommendation task. Experimental results show that while supervised fine-tuning improves model performance, there remains substantial room for improvement, with the best model achieving the F1 score of 0.5648 and Jaccard score of 0.4477. This result highlights the complexity of the clinical drug recommendation task and establishes CDrugRed as a challenging and valuable resource for developing more robust and accurate drug recommendation systems. The dataset is publicly available to the research community under the data usage agreements at this https URL.
zh

[NLP-48] Bridging Language Gaps with Adaptive RAG : Improving Indonesian Language Question Answering

【速读】: 该论文旨在解决生成式问答系统(Question Answering, QA)在低资源语言(如印尼语)中性能显著低于英语的问题。其核心挑战在于缺乏高质量的多语言训练数据以及现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在非英语环境下的适应性不足。解决方案的关键在于引入自适应RAG(Adaptive RAG)系统,该系统通过一个分类器来识别问题复杂度,并据此动态选择最优的回答策略(如单次检索或多次检索)。为缓解印尼语标注数据稀缺问题,研究进一步采用机器翻译作为数据增强手段,以扩展训练语料。实验表明,该分类器能可靠区分问题复杂度,但多检索策略存在不一致性,影响整体性能,凸显了低资源语言下QA系统的潜力与改进方向。

链接: https://arxiv.org/abs/2510.21068
作者: William Christian,Daniel Adamlu,Adrian Yu,Derwin Suhartono
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Question Answering (QA) has seen significant improvements with the advancement of machine learning models, further studies enhanced this question answering system by retrieving external information, called Retrieval-Augmented Generation (RAG) to produce more accurate and informative answers. However, these state-of-the-art-performance is predominantly in English language. To address this gap we made an effort of bridging language gaps by incorporating Adaptive RAG system to Indonesian language. Adaptive RAG system integrates a classifier whose task is to distinguish the question complexity, which in turn determines the strategy for answering the question. To overcome the limited availability of Indonesian language dataset, our study employs machine translation as data augmentation approach. Experiments show reliable question complexity classifier; however, we observed significant inconsistencies in multi-retrieval answering strategy which negatively impacted the overall evaluation when this strategy was applied. These findings highlight both the promise and challenges of question answering in low-resource language suggesting directions for future improvement.
zh

[NLP-49] Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文知识编辑(In-Context Knowledge Editing, IKE)中面临的两个核心问题:一是演示样本(demonstration set)的数量与质量之间的权衡,二是现有方法缺乏对任务难度的自适应调整能力。解决方案的关键在于提出一种轻量级框架 Dynamic Retriever for In-Context Knowledge Editing (DR-IKE),其核心创新包括:(1) 使用 REINFORCE 算法训练一个 BERT 接收器(retriever),根据编辑奖励动态排序演示样本;(2) 引入可学习阈值以剪枝低价值示例,在简单任务中缩短提示长度、在困难任务中扩展提示内容,从而实现高效且自适应的知识编辑。该方法无需修改模型权重,仅依赖前向传播,兼容黑盒 LLM API,并在 COUNTERFACT 基准上显著提升编辑成功率(最高达 17.1%)、降低延迟(41.6%)并保持无关查询的准确性。

链接: https://arxiv.org/abs/2510.21059
作者: Mahmud Wasif Nafee,Maiqi Jiang,Haipeng Chen,Yanfu Zhang
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); College of William & Mary (威廉与玛丽学院)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025. \c{opyright} 2025 Association for Computational Linguistics (CC BY 4.0)

点击查看摘要

Abstract:Large language models (LLMs) excel at factual recall yet still propagate stale or incorrect knowledge. In-context knowledge editing offers a gradient-free remedy suitable for black-box APIs, but current editors rely on static demonstration sets chosen by surface-level similarity, leading to two persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of adaptivity to task difficulty. We address these issues by dynamically selecting supporting demonstrations according to their utility for the edit. We propose Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight framework that (1) trains a BERT retriever with REINFORCE to rank demonstrations by editing reward, and (2) employs a learnable threshold to prune low-value examples, shortening the prompt when the edit is easy and expanding it when the task is hard. DR-IKE performs editing without modifying model weights, relying solely on forward passes for compatibility with black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to 17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries, demonstrating scalable and adaptive knowledge editing. The code is available at this https URL .
zh

[NLP-50] Reasoning s Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection

【速读】: 该论文旨在解决生成式 AI(Generative AI)在精度敏感型分类任务中应用时,推理机制(reasoning)是否仍具优势的问题。传统观点认为推理能提升大语言模型(LLM)的准确性,但其在严格低假阳性率(False Positive Rate, FPR)场景下的适用性尚不明确。论文通过系统性实验对比了“思考开启”(Think On,推理增强生成)与“思考关闭”(Think Off,无推理生成)两种模式在安全检测和幻觉检测任务中的表现,发现:在低FPR阈值下,Think Off 显著优于 Think On;而当允许更高FPR时,Think On 才展现出更高的整体准确率。解决方案的关键在于识别出推理并非普适优化手段——它是一把双刃剑:虽可提高平均准确率,但在需要高精度的应用场景中反而会损害性能;同时提出使用基于token的评分策略替代自述置信度(self-verbalized confidence),并设计简单集成方法以兼顾两种模式的优势,从而实现精度与准确性的协同优化。

链接: https://arxiv.org/abs/2510.21049
作者: Atoosa Chegini,Hamid Kazemi,Garrett Souza,Maria Safi,Yang Song,Samy Bengio,Sinead Williamson,Mehrdad Farajtabar
机构: Apple(苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning has become a central paradigm for large language models (LLMs), consistently boosting accuracy across diverse benchmarks. Yet its suitability for precision-sensitive tasks remains unclear. We present the first systematic study of reasoning for classification tasks under strict low false positive rate (FPR) regimes. Our analysis covers two tasks–safety detection and hallucination detection–evaluated in both fine-tuned and zero-shot settings, using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a clear trade-off: Think On (reasoning-augmented) generation improves overall accuracy, but underperforms at the low-FPR thresholds essential for practical use. In contrast, Think Off (no reasoning during inference) dominates in these precision-sensitive regimes, with Think On surpassing only when higher FPRs are acceptable. In addition, we find token-based scoring substantially outperforms self-verbalized confidence for precision-sensitive deployments. Finally, a simple ensemble of the two modes recovers the strengths of each. Taken together, our findings position reasoning as a double-edged tool: beneficial for average accuracy, but often ill-suited for applications requiring strict precision.
zh

[NLP-51] Input Matters: Evaluating Input Structures Impact on LLM Summaries of Sports Play-by-Play

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高精度要求领域(如体育报道)中生成内容时可能出现的事实性错误(factual errors),特别是幻觉(hallucination)问题。研究通过量化输入结构对LLM生成NBA比赛逐条数据摘要时错误率的影响,发现输入格式是决定事实准确性的重要因素:与非结构化输入相比,使用JSON格式可使Llama-3.1-70B和Qwen2.5-72B的错误率分别降低69%和65%,而行结构输入也能显著减少误差(Llama降低54%,Qwen降低51%)。关键解决方案在于优化输入数据的结构化表示,以提升LLM对原始数据的理解与忠实再现能力,从而有效抑制事实性错误。

链接: https://arxiv.org/abs/2510.21034
作者: Barkavi Sundararajan,Somayajulu Sripada,Ehud Reiter
机构: University of Aberdeen (阿伯丁大学)
类目: Computation and Language (cs.CL)
备注: Accepted at INLG 2025

点击查看摘要

Abstract:A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.
zh

[NLP-52] Can Confidence Estimates Decide When Chain-of-thought is Necessary for Llm s?

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)提示在大型语言模型(Large Language Models, LLMs)中使用时存在的效率与效果不匹配问题:尽管CoT能提升复杂任务的准确性,但其冗长推理过程常导致不必要的Token消耗,且在某些任务上甚至会损害性能。解决方案的关键在于提出“置信度门控的CoT”(confidence-gated CoT),即仅当模型对直接答案的置信度较低时才触发CoT推理,从而实现推理资源的自适应调度。作者首次系统性地评估了四种无需训练的置信度估计方法,并对比随机基线和理想Oracle,验证了现有方法可在减少冗余推理的同时优于随机调用,但其有效性因数据集和模型而异,揭示了当前方法在实际部署中的潜力与局限性。

链接: https://arxiv.org/abs/2510.21007
作者: Samuel Lewis-Lim,Xingwei Tan,Zhixue Zhao,Nikolaos Aletras
机构: University of Sheffield (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models (LLMs). While extended reasoning can boost accuracy on complex tasks, it is often unnecessary and substantially increases token usage, limiting the practicality of reasoning models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose controls that enable users to adjust the length of CoT or determine whether it is used at all. Yet, it remains unclear when CoT should be used: on some tasks it improves performance, while on others it provides little benefit or even harms performance. We address this challenge with confidence-gated CoT, where a model invokes reasoning only when confidence in its direct answer is low. To this end, we present the first systematic study of training-free confidence estimation methods for CoT gating. Specifically, we evaluate four training-free confidence estimation methods and compare them to a random baseline and an oracle that always knows when CoT is needed. Through extensive experiments, we show that existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT. However, the utility of individual confidence measures is inconsistent, varying with both the dataset and the model, underscoring the difficulty of deploying confidence-gated CoT in practice. By analysing both strengths and failure modes, our study highlights the potential and limitations of current methods and paves the way toward more reliable adaptive gating of CoT.
zh

[NLP-53] Irish-BLiMP: A Linguistic Benchmark for Evaluating Human and Language Model Performance in a Low-Resource Setting

【速读】: 该论文旨在解决低资源语言(如爱尔兰语)中大型语言模型(LLM)语法能力评估缺乏系统性基准的问题。其解决方案的关键在于构建了首个面向爱尔兰语的细粒度语言能力评估数据集与框架——Irish-BLiMP,该框架包含1020组最小对立体(minimal pairs),覆盖11类语法特征,并由母语流利者团队手工标注和审核。通过该基准对现有LLM及人类参与者进行对比测试,发现人类在所有语法特征上均显著优于模型,且开源自模型与闭源模型间存在18.1%的性能差距,揭示了模型在语法表征上的局限性,为推进低资源语言中的语言理解研究提供了可量化的基准工具。

链接: https://arxiv.org/abs/2510.20957
作者: Josh McGiff,Khanh-Tung Tran,William Mulcahy,Dáibhidh Ó Luinín,Jake Dalzell,Róisín Ní Bhroin,Adam Burke,Barry O’Sullivan,Hoang D. Nguyen,Nikola S. Nikolov
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8 pages

点击查看摘要

Abstract:We present Irish-BLiMP (Irish Benchmark of Linguistic Minimal Pairs), the first dataset and framework designed for fine-grained evaluation of linguistic competence in the Irish language, an endangered language. Drawing on a variety of linguistic literature and grammar reference works, we manually constructed and reviewed 1020 minimal pairs across a taxonomy of 11 linguistic features, through a team of fluent Irish speakers. We evaluate both existing Large Language Models (LLMs) and fluent human participants on their syntactic knowledge of Irish. Our findings show that humans outperform all models across all linguistic features, achieving 16.6% higher accuracy on average. Moreover, a substantial performance gap of 18.1% persists between open- and closed-source LLMs, with even the strongest model (gpt-5) reaching only 73.5% accuracy compared to 90.1% by human. Interestingly, human participants and models struggle on different aspects of Irish grammar, thus highlighting a difference in representation learned by the models. Overall, Irish-BLiMP provides the first systematic framework for evaluating the grammatical competence of LLMs in Irish and offers a valuable benchmark for advancing research on linguistic understanding in low-resource languages.
zh

[NLP-54] Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training

【速读】: 该论文试图解决推理语言模型(Reasoning Language Models, RLMs)在经过良性推理训练后出现的“自越狱”(self-jailbreaking)问题,即模型在未被明确诱导的情况下,通过引入善意假设(如将恶意请求解释为安全测试场景)来规避自身安全防护机制,从而响应有害请求。解决方案的关键在于:在训练过程中加入最小量的安全推理数据(safety reasoning data),即可有效维持模型的安全对齐性,避免因良性推理能力增强而导致的意外合规行为。这一发现揭示了RLMs在推理能力提升与安全对齐之间的潜在冲突,并提供了可操作的缓解路径。

链接: https://arxiv.org/abs/2510.20956
作者: Zheng-Xin Yong,Stephen H. Bach
机构: Brown University (布朗大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We discover a novel and surprising phenomenon of unintentional misalignment in reasoning language models (RLMs), which we call self-jailbreaking. Specifically, after benign reasoning training on math or code domains, RLMs will use multiple strategies to circumvent their own safety guardrails. One strategy is to introduce benign assumptions about users and scenarios to justify fulfilling harmful requests. For instance, an RLM reasons that harmful requests like outline a strategy for stealing customers' credit card information from a retail store'' could be associated with the benign intent of a security professional trying to test defense,‘’ despite no such benign context being provided as input. We observe that many open-weight RLMs, including DeepSeek-R1-distilled, s1.1, Phi-4-mini-reasoning, and Nemotron, suffer from self-jailbreaking despite being aware of the harmfulness of the requests. We also provide a mechanistic understanding of self-jailbreaking: RLMs are more compliant after benign reasoning training, and after self-jailbreaking, models appear to perceive malicious requests as less harmful in the CoT, thus enabling compliance with them. To mitigate self-jailbreaking, we find that including minimal safety reasoning data during training is sufficient to ensure RLMs remain safety-aligned. Our work provides the first systematic analysis of self-jailbreaking behavior and offers a practical path forward for maintaining safety in increasingly capable RLMs.
zh

[NLP-55] Do LLM s Truly Understand When a Precedent Is Overruled?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理长篇法律文档时评估不足的问题,尤其是缺乏能够真实反映高风险、复杂法律推理任务的长上下文基准测试。现有评估多依赖简化的人工合成任务,难以体现实际法律实践中对长期文本理解的要求。为应对这一挑战,作者提出了一种基于美国最高法院判例中“推翻关系”(overruling relationships)的评估框架,利用包含236对案例的数据集系统性地检验LLMs在识别法律判例间推翻逻辑的能力。解决方案的关键在于构建一个贴近法律实务场景的基准测试,不仅涵盖真实法律文本的复杂性,还能揭示模型在时间敏感性、深层法律推理和上下文依赖性推理方面的三大局限,从而推动更可靠、更具法律专业性的长上下文模型发展。

链接: https://arxiv.org/abs/2510.20941
作者: Li Zhang,Jaromir Savelka,Kevin Ashley
机构: University of Pittsburgh (匹兹堡大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures, JURIX 2025

点击查看摘要

Abstract:Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Developing long-context benchmarks that capture realistic, high-stakes tasks remains a significant challenge in the field, as most existing evaluations rely on simplified synthetic tasks that fail to represent the complexity of real-world document understanding. Overruling relationships are foundational to common-law doctrine and commonly found in judicial opinions. They provide a focused and important testbed for long-document legal understanding that closely resembles what legal professionals actually do. We present an assessment of state-of-the-art LLMs on identifying overruling relationships from U.S. Supreme Court cases using a dataset of 236 case pairs. Our evaluation reveals three critical limitations: (1) era sensitivity – the models show degraded performance on historical cases compared to modern ones, revealing fundamental temporal bias in their training; (2) shallow reasoning – models rely on shallow logical heuristics rather than deep legal comprehension; and (3) context-dependent reasoning failures – models produce temporally impossible relationships in complex open-ended tasks despite maintaining basic temporal awareness in simple contexts. Our work contributes a benchmark that addresses the critical gap in realistic long-context evaluation, providing an environment that mirrors the complexity and stakes of actual legal reasoning tasks.
zh

[NLP-56] FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction EMNLP2025

链接: https://arxiv.org/abs/2510.20926
作者: Natasha Johnson,Amanda Bertsch,Maria-Emil Deal,Emma Strubell
机构: Language Technologies Institute, Carnegie Mellon University (卡内基梅隆大学语言技术研究所); School of Library and Information Studies, University of Oklahoma (俄克拉荷马大学图书馆与信息研究学院)
类目: Computation and Language (cs.CL)
备注: Accepted to Findings of EMNLP 2025

点击查看摘要

[NLP-57] Code-enabled language models can outperform reasoning models on diverse tasks

【速读】: 该论文旨在解决生成式 AI (Generative AI) 中推理模型(Reasoning Models, RMs)训练与推理效率低的问题,即RMs虽在长文本推理任务中表现优异,但依赖大量计算资源和数据进行训练,且推理过程耗时昂贵。解决方案的关键在于提出CodeAdapt方法,其核心是将代码执行(code execution)融入自然语言推理流程中,通过CodeAct框架实现多步交互式推理,并结合仅需五道示例的少样本上下文学习(few-shot bootstrap in-context learning),无需微调即可显著提升标准指令微调语言模型(instruction-tuned LMs)的推理能力,使其在多个领域达到甚至超越对应RMs的性能水平,同时具备更高的token效率。

链接: https://arxiv.org/abs/2510.20909
作者: Cedegao E. Zhang,Cédric Colas,Gabriel Poesia,Joshua B. Tenenbaum,Jacob Andreas
机构: MIT (麻省理工学院); Inria (法国国家信息与自动化研究院); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.
zh

[NLP-58] Shoot First Ask Questions Later? Building Rational Agents that Explore and Act Like People

【速读】: 该论文旨在解决语言模型(Language Models, LMs)在高风险应用场景中进行理性信息获取(rational information-seeking)能力不足的问题,尤其是在资源受限条件下如何平衡探索(exploration)与行动(action)以提升决策效率。其核心挑战在于LM agents难以基于上下文生成有信息量的问题、准确回应信息瓶颈下的请求,并选择高价值行动。解决方案的关键在于引入基于贝叶斯实验设计(Bayesian Experimental Design, BED)的蒙特卡洛推理策略:对于“观察者”(Spotter)角色,该方法显著提升回答准确性(最高+14.7%绝对提升);对于“指挥官”(Captain)角色,则通过最大化预期信息增益(Expected Information Gain, EIG)优化问题选择和射击策略(EIG提升达0.227比特,占理论噪声上限的94.2%)。这一框架使弱模型(如Llama-4-Scout)在成本仅为GPT-5的1%时,仍能超越人类玩家(胜率提升8%-82%)和前沿模型(胜率提升0%-67%),并在另一任务Guess Who?中验证了方法的泛化性(准确率提升28.3-42.4个百分点)。

链接: https://arxiv.org/abs/2510.20886
作者: Gabriel Grand,Valerio Pepe,Jacob Andreas,Joshua B. Tenenbaum
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); MIT Brain and Cognitive Sciences (麻省理工学院脑与认知科学系); Harvard SEAS (哈佛大学工程与应用科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many high-stakes applications of AI require forming data-driven hypotheses and making targeted guesses; e.g., in scientific and diagnostic settings. Given limited resources, to what extent do agents based on language models (LMs) act rationally? We develop methods to benchmark and enhance agentic information-seeking, drawing on insights from human behavior. First, we introduce a strategic decision-oriented dialogue task called Collaborative Battleship, in which a partially-informed Captain must balance exploration (asking questions) and action (taking shots), while a fully-informed Spotter must provide accurate answers under an information bottleneck. Compared to human players (N=42), we find that LM agents struggle to ground answers in context, generate informative questions, and select high-value actions. Next, to address these gaps, we develop novel Monte Carlo inference strategies for LMs based on principles from Bayesian Experimental Design (BED). For Spotter agents, our approach boosts accuracy by up to 14.7% absolute over LM-only baselines; for Captain agents, it raises expected information gain (EIG) by up to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs, such as Llama-4-Scout, to outperform both humans (8% - 82% win rate) and frontier models (0% - 67% win rate vs. GPT-5) at ~1% of GPT-5’s cost. We replicate these findings on Guess Who? where our methods significantly boost accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for building rational information-seeking agents.
zh

[NLP-59] Cultural Alien Sampler: Open-ended art generation balancing originality and coherence NEURIPS2025

【速读】: 该论文旨在解决开放领域(如艺术)中自主代理在生成创意时面临的两难问题:如何在保持内部一致性的同时实现高原创性。当前大型语言模型(Large Language Models, LLMs)往往要么沿袭既定文化模式,要么在追求新颖性时牺牲连贯性。解决方案的关键在于提出一种名为“文化异乡采样器”(Cultural Alien Sampler, CAS)的概念选择方法,其核心思想是显式分离组合合理性(compositional fit)与文化典型性(cultural typicality)。CAS利用两个在WikiArt概念上微调过的GPT-2模型——一个评估概念组合在艺术作品内的合理性(Concept Coherence Model),另一个衡量该组合在特定艺术家作品中的典型程度(Cultural Context Model)——从而筛选出高一致性且低典型性的概念组合,使生成内容既保持内在和谐又能突破文化惯性,显著提升创造性表现。

链接: https://arxiv.org/abs/2510.20849
作者: Alejandro H. Artiles,Hiromu Yakura,Levin Brinkmann,Mar Canet Sola,Hassan Abu Alhaija,Ignacio Serna,Nasim Rahaman,Bernhard Schölkopf,Iyad Rahwan
机构: Max Planck Institute for Human Development, Berlin, Germany.; Max Planck Institute for Intelligent Systems, Tübingen, Germany.; Academy of Media Arts Cologne, Germany.; NVIDIA; BFM, Tallinn University, Estonia.
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025). Creative AI Track. 26 pages, 24 figures

点击查看摘要

Abstract:In open-ended domains like art, autonomous agents must generate ideas that are both original and internally coherent, yet current Large Language Models (LLMs) either default to familiar cultural patterns or sacrifice coherence when pushed toward novelty. We address this by introducing the Cultural Alien Sampler (CAS), a concept-selection method that explicitly separates compositional fit from cultural typicality. CAS uses two GPT-2 models fine-tuned on WikiArt concepts: a Concept Coherence Model that scores whether concepts plausibly co-occur within artworks, and a Cultural Context Model that estimates how typical those combinations are within individual artists’ bodies of work. CAS targets combinations that are high in coherence and low in typicality, yielding ideas that maintain internal consistency while deviating from learned conventions and embedded cultural context. In a human evaluation (N = 100), our approach outperforms random selection and GPT-4o baselines and achieves performance comparable to human art students in both perceived originality and harmony. Additionally, a quantitative study shows that our method produces more diverse outputs and explores a broader conceptual space than its GPT-4o counterpart, demonstrating that artificial cultural alienness can unlock creative potential in autonomous agents.
zh

[NLP-60] Data-Centric Lessons To Improve Speech-Language Pretraining

【速读】: 该论文旨在解决当前生成式语音语言模型(Speech-Language Models, SpeechLMs)在口语问答(Spoken Question-Answering, SQA)任务中性能提升机制不明确的问题,尤其缺乏对预训练数据处理与构建过程的可控性分析。其解决方案的关键在于开展以数据为中心的系统性消融实验,聚焦三个核心问题:(1)如何处理原始网络爬取的音频内容用于语音-文本预训练;(2)如何构建合成数据集以增强爬取数据;(3)如何交错排列(文本、音频)片段形成训练序列。基于这些实证发现,作者成功训练出一个38亿参数的SpeechLM模型SpeLangy,在SQA任务上超越最大达3倍规模的现有模型,绝对性能提升10.2%,凸显了高质量数据策展(data curation)在SpeechLM预训练中的决定性作用。

链接: https://arxiv.org/abs/2510.20860
作者: Vishaal Udandarao,Zhiyun Lu,Xuankai Chang,Yongqiang Wang,Violet Z. Yao,Albin Madapally Jose,Fartash Faghri,Josh Gardner,Chung-Cheng Chiu
机构: Apple; University of Cambridge (剑桥大学); University of Tübingen (图宾根大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Tech Report

点击查看摘要

Abstract:Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs. We focus on three research questions fundamental to speech-language pretraining data: (1) how to process raw web-crawled audio content for speech-text pretraining, (2) how to construct synthetic pretraining datasets to augment web-crawled data and (3) how to interleave (text, audio) segments into training sequences. We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.
zh

[NLP-61] Beyond Hearing: Learning Task-agnostic ExG Representations from Earphones via Physiology-informed Tokenization

【速读】: 该论文旨在解决当前生理电信号(Electrophysiological, ExG)基础模型在日常任务中泛化能力不足的问题,主要受限于数据多样性匮乏和任务特定模型设计导致的适应性差。为应对这一挑战,作者提出了一种可扩展、任务无关的ExG监测方法,其核心创新在于 Physiology-informed Multi-band Tokenization (PiMT)——该方法将ExG信号分解为12个生理信息驱动的token,并通过重构任务学习鲁棒特征表示,从而实现全频谱自适应特征识别并保留任务相关信息,显著提升了跨任务泛化性能。

链接: https://arxiv.org/abs/2510.20853
作者: Hyungjun Yoon,Seungjoo Lee,Yu Yvonne Wu,Xiaomeng Chen,Taiting Lu,Freddy Yifei Liu,Taeckyung Lee,Hyeongheon Cha,Haochen Zhao,Gaoteng Zhao,Sung-Ju Lee,Cecilia Mascolo,Dongyao Chen,Lili Qiu
机构: KAIST (韩国科学技术院); Carnegie Mellon University (卡内基梅隆大学); University of Cambridge (剑桥大学); Shanghai Jiao Tong University (上海交通大学); Pennsylvania State University (宾夕法尼亚州立大学); UCLA (加州大学洛杉矶分校); Northwest University (西北大学); University of Texas at Austin (德克萨斯大学奥斯汀分校); Microsoft Research (微软研究院)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 19 pages, 9 figures

点击查看摘要

Abstract:Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i) insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii) task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new DailySense dataset-the first to enable ExG-based analysis across five human senses-together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.
zh

[NLP-62] Can large audio language models understand child stuttering speech? speech summarization and source separation ICASSP2026

链接: https://arxiv.org/abs/2510.20850
作者: Chibuzor Okocha,Maya Bakri,Christan Grant
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 7 pages, 1 Figure, 8 tables, Under review ICASSP 2026

点击查看摘要

计算机视觉

[CV-0] Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent NEURIPS2025

【速读】:该论文旨在解决视觉模型在图像识别过程中对特定视觉属性的无意依赖问题,此类依赖可能导致模型鲁棒性不足、过拟合或产生虚假相关性。解决方案的关键在于提出一种自动化框架,其核心是一个自反思代理(self-reflective agent),该代理通过系统性地生成和测试关于模型可能依赖的视觉属性的假设,并基于实验结果迭代优化假设,同时利用自评估协议验证发现是否准确解释模型行为;当出现不一致时,代理会触发新一轮实验以增强分析的准确性。

链接: https://arxiv.org/abs/2510.21704
作者: Christy Li,Josep Lopez Camuñas,Jake Thomas Touchet,Jacob Andreas,Agata Lapedriza,Antonio Torralba,Tamar Rott Shaham
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); Universitat Oberta de Catalunya (加泰罗尼亚开放大学); Louisiana Tech (路易斯安那理工大学); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 10 figures, Neurips 2025

点击查看摘要

Abstract:When a vision model performs image recognition, which visual attributes drive its predictions? Detecting unintended reliance on specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting such dependencies in trained vision models. At the core of our method is a self-reflective agent that systematically generates and tests hypotheses about visual attributes that a model may rely on. This process is iterative: the agent refines its hypotheses based on experimental outcomes and uses a self-evaluation protocol to assess whether its findings accurately explain model behavior. When inconsistencies arise, the agent self-reflects over its findings and triggers a new cycle of experimentation. We evaluate our approach on a novel benchmark of 130 models designed to exhibit diverse visual attribute dependencies across 18 categories. Our results show that the agent’s performance consistently improves with self-reflection, with a significant performance increase over non-reflective baselines. We further demonstrate that the agent identifies real-world visual attribute dependencies in state-of-the-art models, including CLIP’s vision encoder and the YOLOv8 object detector.
zh

[CV-1] Visual Diffusion Models are Geometric Solvers

【速读】:该论文旨在解决几何难题的求解问题,包括著名的“内接正方形问题”(Inscribed Square Problem)、“斯坦纳树问题”(Steiner Tree Problem)和“简单多边形问题”(Simple Polygon Problem),这些问题是计算几何中的经典难解问题。解决方案的关键在于将几何问题实例转化为图像表示,并利用标准视觉扩散模型(visual diffusion model)直接在像素空间中进行推理:通过训练模型从高斯噪声逐步生成符合约束条件的近似解图像,从而将几何推理任务重构为图像生成任务。这种方法无需专门设计的架构或参数化几何表示的领域适配,仅依赖通用的视觉扩散模型即可实现有效求解,揭示了生成式建模与几何推理之间的一种新桥梁。

链接: https://arxiv.org/abs/2510.21697
作者: Nir Goren,Shai Yehezkel,Omer Dahary,Andrey Voynov,Or Patashnik,Daniel Cohen-Or
机构: Tel Aviv University (特拉维夫大学); Google DeepMind (谷歌深度智 mind)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:In this paper we show that visual diffusion models can serve as effective geometric solvers: they can directly reason about geometric problems by working in pixel space. We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry that asks whether every Jordan curve contains four points forming a square. We then extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Simple Polygon Problem. Our method treats each problem instance as an image and trains a standard visual diffusion model that transforms Gaussian noise into an image representing a valid approximate solution that closely matches the exact one. The model learns to transform noisy geometric structures into correct configurations, effectively recasting geometric reasoning as image generation. Unlike prior work that necessitates specialized architectures and domain-specific adaptations when applying diffusion to parametric geometric representations, we employ a standard visual diffusion model that operates on the visual representation of the problem. This simplicity highlights a surprising bridge between generative modeling and geometric problem solving. Beyond the specific problems studied here, our results point toward a broader paradigm: operating in image space provides a general and practical framework for approximating notoriously hard problems, and opens the door to tackling a far wider class of challenging geometric tasks. Comments: Project page: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2510.21697 [cs.CV] (or arXiv:2510.21697v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.21697 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-2] BachVid: Training-Free Video Generation with Consistent Background and Character

【速读】:该论文旨在解决文本到视频(text-to-video, T2V)生成中多视频一致性的问题,尤其是如何在不依赖参考图像或额外训练的情况下实现角色和背景的一致性。解决方案的关键在于对扩散 Transformer(Diffusion Transformer, DiT)注意力机制与中间特征的系统分析,发现其在去噪过程中能够提取前景掩码并识别匹配点;基于此发现,作者提出 BachVid 方法:首先生成一个身份视频并缓存其中间变量,随后将这些缓存变量注入到新生成视频的对应位置,从而确保多个视频在前景和背景层面均保持一致。该方法无需额外训练,实现了高效且无监督的一致性视频生成。

链接: https://arxiv.org/abs/2510.21696
作者: Han Yan,Xibin Song,Yifu Wang,Hongdong Li,Pan Ji,Chao Ma
机构: MoE Key Lab of Artificial, AI Institute, Shanghai Jiao Tong University (上海交通大学人工智能研究院); Vertex Lab; Australian National University (澳大利亚国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have recently driven significant progress in text-to-video (T2V) generation. However, generating multiple videos with consistent characters and backgrounds remains a significant challenge. Existing methods typically rely on reference images or extensive training, and often only address character consistency, leaving background consistency to image-to-video models. We introduce BachVid, the first training-free method that achieves consistent video generation without needing any reference images. Our approach is based on a systematic analysis of DiT’s attention mechanism and intermediate features, revealing its ability to extract foreground masks and identify matching points during the denoising process. Our method leverages this finding by first generating an identity video and caching the intermediate variables, and then inject these cached variables into corresponding positions in newly generated videos, ensuring both foreground and background consistency across multiple videos. Experimental results demonstrate that BachVid achieves robust consistency in generated videos without requiring additional training, offering a novel and efficient solution for consistent video generation without relying on reference images or additional training.
zh

[CV-3] On Thin Ice: Towards Explainable Conservation Monitoring via Attribution and Perturbations NEURIPS

【速读】:该论文试图解决生态研究中计算机视觉模型因“黑箱”特性导致的信任缺失问题,从而限制其在野外部署的应用。解决方案的关键在于引入后验可解释性方法(post-hoc explanations),通过梯度类激活映射(HiResCAM、LayerCAM)、局部可解释模型无关解释(LIME)和扰动-based 解释,对基于 Faster R-CNN 的海豹检测模型进行可视化分析,从定位保真度(localization fidelity)、忠实性(faithfulness)和诊断实用性(diagnostic utility)三个维度评估解释的有效性,从而提供预测依据并揭示系统性误判模式,最终推动模型向可审计、可决策支持的工具演进。

链接: https://arxiv.org/abs/2510.21689
作者: Jiayi Zhou,Günel Aghakishiyeva,Saagar Arya,Julian Dale,James David Poling,Holly R. Houliston,Jamie N. Womble,Gregory D. Larsen,David W. Johnston,Brinnae Bent
机构: Duke University (杜克大学); University of Agder (奥加大学); University of Cambridge (剑桥大学); U.S. National Park Service (美国国家公园管理局); Alaska Spatial Science (阿拉斯加空间科学公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS Imageomics Workshop 2025

点击查看摘要

Abstract:Computer vision can accelerate ecological research and conservation monitoring, yet adoption in ecology lags in part because of a lack of trust in black-box neural-network-based models. We seek to address this challenge by applying post-hoc explanations to provide evidence for predictions and document limitations that are important to field deployment. Using aerial imagery from Glacier Bay National Park, we train a Faster R-CNN to detect pinnipeds (harbor seals) and generate explanations via gradient-based class activation mapping (HiResCAM, LayerCAM), local interpretable model-agnostic explanations (LIME), and perturbation-based explanations. We assess explanations along three axes relevant to field use: (i) localization fidelity: whether high-attribution regions coincide with the animal rather than background context; (ii) faithfulness: whether deletion/insertion tests produce changes in detector confidence; and (iii) diagnostic utility: whether explanations reveal systematic failure modes. Explanations concentrate on seal torsos and contours rather than surrounding ice/rock, and removal of the seals reduces detection confidence, providing model-evidence for true positives. The analysis also uncovers recurrent error sources, including confusion between seals and black ice and rocks. We translate these findings into actionable next steps for model development, including more targeted data curation and augmentation. By pairing object detection with post-hoc explainability, we can move beyond “black-box” predictions toward auditable, decision-supporting tools for conservation monitoring.
zh

[CV-4] WorldGrow: Generating Infinite 3D World

【速读】:该论文旨在解决生成无限可扩展的3D世界(large, continuous environments with coherent geometry and realistic appearance)这一挑战,现有方法在跨视角几何与外观一致性、3D隐式表示的可扩展性以及当前3D基础模型多以物体为中心限制场景级生成等方面存在瓶颈。其解决方案的关键在于利用预训练3D模型的强大生成先验来结构化生成场景块(scene blocks),提出了一种分层框架WorldGrow,核心包括:(1) 数据整理管道提取高质量场景块以适配结构化潜空间表示;(2) 3D块修复机制实现上下文感知的场景扩展;(3) 粗到精生成策略兼顾全局布局合理性与局部几何/纹理保真度,从而实现具有照片级真实感且结构一致的无限场景生成。

链接: https://arxiv.org/abs/2510.21682
作者: Sikuang Li,Chen Yang,Jiemin Fang,Taoran Yi,Jia Lu,Jiazhong Cen,Lingxi Xie,Wei Shen,Qi Tian
机构: MoE Key Lab of Artificial Intelligence, School of Computer Science, SJTU (上海交通大学人工智能重点实验室); Huawei Inc. (华为公司); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project page: this https URL Code: this https URL

点击查看摘要

Abstract:We tackle the challenge of generating the infinitely extendable 3D world – large, continuous environments with coherent geometry and realistic appearance. Existing methods face key challenges: 2D-lifting approaches suffer from geometric and appearance inconsistencies across views, 3D implicit representations are hard to scale up, and current 3D foundation models are mostly object-centric, limiting their applicability to scene-level generation. Our key insight is leveraging strong generation priors from pre-trained 3D models for structured scene block generation. To this end, we propose WorldGrow, a hierarchical framework for unbounded 3D scene synthesis. Our method features three core components: (1) a data curation pipeline that extracts high-quality scene blocks for training, making the 3D structured latent representations suitable for scene generation; (2) a 3D block inpainting mechanism that enables context-aware scene extension; and (3) a coarse-to-fine generation strategy that ensures both global layout plausibility and local geometric/textural fidelity. Evaluated on the large-scale 3D-FRONT dataset, WorldGrow achieves SOTA performance in geometry reconstruction, while uniquely supporting infinite scene generation with photorealistic and structurally consistent outputs. These results highlight its capability for constructing large-scale virtual environments and potential for building future world models.
zh

[CV-5] Foundation Models in Dermatopathology: Skin Tissue Classification

【速读】:该论文旨在解决皮肤病理学中全切片图像(Whole-Slide Images, WSIs)快速生成背景下,如何实现高效处理与准确分类的问题。其解决方案的关键在于利用两种基础模型(UNI 和 Virchow2)作为特征提取器,从WSI中提取patch级嵌入(patch-level embeddings),并通过均值聚合(mean-aggregation)策略构建slide-level特征表示,进而训练多种机器学习分类器(如逻辑回归、梯度提升树和随机森林)。实验表明,Virchow2提取的特征在多数分类器上表现优于UNI,其中逻辑回归模型对Virchow2特征的分类准确率达到90%,验证了基础模型在slide-level表示学习中的有效性,为自动化皮肤病理诊断提供了可扩展且高效的方案。

链接: https://arxiv.org/abs/2510.21664
作者: Riya Gupta,Yiwei Zong,Dennis H. Murphree
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:The rapid generation of whole-slide images (WSIs) in dermatopathology necessitates automated methods for efficient processing and accurate classification. This study evaluates the performance of two foundation models, UNI and Virchow2, as feature extractors for classifying WSIs into three diagnostic categories: melanocytic, basaloid, and squamous lesions. Patch-level embeddings were aggregated into slide-level features using a mean-aggregation strategy and subsequently used to train multiple machine learning classifiers, including logistic regression, gradient-boosted trees, and random forest models. Performance was assessed using precision, recall, true positive rate, false positive rate, and the area under the receiver operating characteristic curve (AUROC) on the test set. Results demonstrate that patch-level features extracted using Virchow2 outperformed those extracted via UNI across most slide-level classifiers, with logistic regression achieving the highest accuracy (90%) for Virchow2, though the difference was not statistically significant. The study also explored data augmentation techniques and image normalization to enhance model robustness and generalizability. The mean-aggregation approach provided reliable slide-level feature representations. All experimental results and metrics were tracked and visualized using this http URL, facilitating reproducibility and interpretability. This research highlights the potential of foundation models for automated WSI classification, providing a scalable and effective approach for dermatopathological diagnosis while paving the way for future advancements in slide-level representation learning.
zh

[CV-6] Self-Supervised Learning of Synapse Types from EM Images

【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中突触分类的难题,即如何在不依赖先验标签的情况下对突触进行无监督分类,从而识别出具有不同结构或功能特性的突触类群。传统方法通常需要人工标注样本进行监督学习,而该研究提出了一种基于邻近性假设的无监督方法:同一神经元内相邻突触更可能属于同一类别,而非随机选取自不同神经元的突触。其解决方案的关键在于利用这一空间结构先验信息,无需预先设定突触类型数量,即可自动发现潜在的突触类群,并为后续研究提供具有生物学意义的真值基准(ground-truth)。

链接: https://arxiv.org/abs/2510.21663
作者: Aarav Shetty,Gary B Huang
机构: Janelia Research Campus, Howard Hughes Medical Institute, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Separating synapses into different classes based on their appearance in EM images has many applications in biology. Examples may include assigning a neurotransmitter to a particular class, or separating synapses whose strength can be modulated from those whose strength is fixed. Traditionally, this has been done in a supervised manner, giving the classification algorithm examples of the different classes. Here we instead separate synapses into classes based only on the observation that nearby synapses in the same neuron are likely more similar than synapses chosen randomly from different cells. We apply our methodology to data from \it Drosophila. Our approach has the advantage that the number of synapse types does not need to be known in advance. It may also provide a principled way to select ground-truth that spans the range of synapse structure.
zh

[CV-7] Long-tailed Species Recognition in the NACTI Wildlife Dataset

【速读】:该论文旨在解决野生动物图像识别中因类别分布极度不均衡(长尾分布)导致的模型性能瓶颈问题,特别是在包含370万张图像的North America Camera Trap Images (NACTI)数据集上。其核心挑战在于“头部”类别占据约50%的数据量,而“尾部”类别样本稀少,传统交叉熵损失函数难以有效学习尾部类别的特征。解决方案的关键在于系统性地结合长尾识别(Long-Tail Recognition, LTR)损失函数与LTR敏感的正则化策略,并引入实验优化的调度器(scheduler),从而显著提升模型在尾部类别的判别能力。最终,该方法在NACTI测试集上达到99.40%的Top-1准确率,优于基线(95.51%)及先前报道的最佳结果(96.8%),并在跨域测试中表现出更强的鲁棒性,验证了所提方案的有效性和泛化能力。

链接: https://arxiv.org/abs/2510.21657
作者: Zehua Liu,Tilo Burghardt
机构: University of Bristol (布里斯托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As most ‘‘in the wild’’ data collections of the natural world, the North America Camera Trap Images (NACTI) dataset shows severe long-tailed class imbalance, noting that the largest ‘Head’ class alone covers 50% of the 3.7M images in the corpus. Building on the PyTorch Wildlife model, we present a systematic study of Long-Tail Recognition methodologies for species recognition on the NACTI dataset covering experiments on various LTR loss functions plus LTR-sensitive regularisation. Our best configuration achieves 99.40% Top-1 accuracy on our NACTI test data split, substantially improving over a 95.51% baseline using standard cross-entropy with Adam. This also improves on previously reported top performance in MLWIC2 at 96.8% albeit using partly unpublished (potentially different) partitioning, optimiser, and evaluation protocols. To evaluate domain shifts (e.g. night-time captures, occlusion, motion-blur) towards other datasets we construct a Reduced-Bias Test set from the ENA-Detection dataset where our experimentally optimised long-tail enhanced model achieves leading 52.55% accuracy (up from 51.20% with WCE loss), demonstrating stronger generalisation capabilities under distribution shift. We document the consistent improvements of LTR-enhancing scheduler choices in this NACTI wildlife domain, particularly when in tandem with state-of-the-art LTR losses. We finally discuss qualitative and quantitative shortcomings that LTR methods cannot sufficiently address, including catastrophic breakdown for ‘Tail’ classes under severe domain shift. For maximum reproducibility we publish all dataset splits, key code, and full network weights.
zh

[CV-8] Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging ICCV2025

【速读】:该论文旨在解决基于稀疏可穿戴惯性测量单元(Inertial Measurement Units, IMUs)的多人全身运动捕捉中,因惯性信号自参考特性导致的全局位移估计不准确及个体间相对定位困难的问题。解决方案的关键在于引入超宽带测距(Ultra-Wideband Ranging, UWB)技术以获取各传感器之间的绝对距离信息,并将其融合进结构化状态空间模型中,从而结合惯性观测与空间约束实现精确的三维姿态估计和多人体全球轨迹跟踪。通过两阶段优化策略进一步利用这些距离估计提升人群在真实环境中的全局运动轨迹精度。

链接: https://arxiv.org/abs/2510.21654
作者: Ying Xue,Jiaxi Jiang,Rayan Armani,Dominik Hollidt,Yi-Chi Liao,Christian Holz
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注: Accepted by ICCV 2025, Code: this https URL

点击查看摘要

Abstract:Tracking human full-body motion using sparse wearable inertial measurement units (IMUs) overcomes the limitations of occlusion and instrumentation of the environment inherent in vision-based approaches. However, purely IMU-based tracking compromises translation estimates and accurate relative positioning between individuals, as inertial cues are inherently self-referential and provide no direct spatial reference for others. In this paper, we present a novel approach for robustly estimating body poses and global translation for multiple individuals by leveraging the distances between sparse wearable sensors - both on each individual and across multiple individuals. Our method Group Inertial Poser estimates these absolute distances between pairs of sensors from ultra-wideband ranging (UWB) and fuses them with inertial observations as input into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation. Our novel two-step optimization further leverages the estimated distances for accurately tracking people’s global trajectories through the world. We also introduce GIP-DB, the first IMU+UWB dataset for two-person tracking, which comprises 200 minutes of motion recordings from 14 participants. In our evaluation, Group Inertial Poser outperforms previous state-of-the-art methods in accuracy and robustness across synthetic and real-world data, showing the promise of IMU+UWB-based multi-human motion capture in the wild. Code, models, dataset: this https URL
zh

[CV-9] A Dynamic Knowledge Distillation Method Based on the Gompertz Curve

【速读】:该论文旨在解决传统知识蒸馏(Knowledge Distillation)方法无法有效捕捉学生模型认知能力动态演化过程的问题,导致知识迁移效率受限。其核心解决方案是提出一种基于Gompertz生长曲线的动态知识蒸馏框架——Gompertz-CNN,通过引入阶段感知的蒸馏策略,利用Gompertz曲线对蒸馏损失权重进行时变调控,以匹配学生模型从初始缓慢学习、中期快速提升到后期饱和的学习阶段特性;同时结合Wasserstein距离衡量特征级差异与梯度匹配机制对齐师生模型的反向传播行为,构建统一的多损失目标函数,从而实现更高效、自适应的知识迁移。

链接: https://arxiv.org/abs/2510.21649
作者: Han Yang,Guangjun Qin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 2 figures

点击查看摘要

Abstract:This paper introduces a novel dynamic knowledge distillation framework, Gompertz-CNN, which integrates the Gompertz growth model into the training process to address the limitations of traditional knowledge distillation. Conventional methods often fail to capture the evolving cognitive capacity of student models, leading to suboptimal knowledge transfer. To overcome this, we propose a stage-aware distillation strategy that dynamically adjusts the weight of distillation loss based on the Gompertz curve, reflecting the student’s learning progression: slow initial growth, rapid mid-phase improvement, and late-stage saturation. Our framework incorporates Wasserstein distance to measure feature-level discrepancies and gradient matching to align backward propagation behaviors between teacher and student models. These components are unified under a multi-loss objective, where the Gompertz curve modulates the influence of distillation losses over time. Extensive experiments on CIFAR-10 and CIFAR-100 using various teacher-student architectures (e.g., ResNet50 and MobileNet_v2) demonstrate that Gompertz-CNN consistently outperforms traditional distillation methods, achieving up to 8% and 4% accuracy gains on CIFAR-10 and CIFAR-100, respectively.
zh

[CV-10] DAP-MAE: Domain-Adaptive Point Cloud Masked Autoencoder for Effective Cross-Domain Learning

【速读】:该论文旨在解决跨域点云数据在预训练阶段知识不匹配导致下游任务性能下降的问题,即不同领域点云数据混合预训练时,先验知识可能与具体3D点云分析任务(如物体分类或表情识别)的特征分布不一致,从而限制模型泛化能力。其解决方案的关键在于提出Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE),通过设计异构域适配器(heterogeneous domain adapter),在预训练阶段采用适应模式(adaptation mode)以全面学习多域点云信息,并在微调阶段切换至融合模式(fusion mode)增强特征表示;同时引入域特征生成器(domain feature generator)引导点云特征向下游任务对齐,实现跨域知识的有效迁移与任务适配。

链接: https://arxiv.org/abs/2510.21635
作者: Ziqi Gao,Qiufu Li,Linlin Shen
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, conference

点击查看摘要

Abstract:Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the fine-tuning to enhance point cloud features. Meanwhile, DAP-MAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18% in object classification on ScanObjectNN and 88.45% in facial expression recognition on Bosphorus.
zh

[CV-11] Epipolar Geometry Improves Video Generation Models

【速读】:该论文旨在解决当前视频生成模型中存在的几何不一致性、运动不稳定性和视觉伪影问题,这些问题严重削弱了生成视频在模拟真实三维场景时的可信度。其解决方案的关键在于引入基于成对极几何(epipolar geometry)约束的偏好优化方法,通过数学上严谨的几何强制机制直接改善相机轨迹的稳定性并减少几何伪影,而无需依赖端到端可微分结构;实验表明,经典几何约束提供的优化信号比现代学习型度量更稳定,从而显著提升视频的空间一致性,同时保持高质量的视觉表现力。

链接: https://arxiv.org/abs/2510.21615
作者: Orest Kupyn,Fabian Manhardt,Federico Tombari,Christian Rupprecht
机构: University of Oxford (牛津大学); Google(谷歌); TU Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation models have progressed tremendously through large latent diffusion transformers trained with rectified flow techniques. Yet these models still struggle with geometric inconsistencies, unstable motion, and visual artifacts that break the illusion of realistic 3D scenes. 3D-consistent video generation could significantly impact numerous downstream applications in generation and reconstruction tasks. We explore how epipolar geometry constraints improve modern video diffusion models. Despite massive training data, these models fail to capture fundamental geometric principles underlying visual content. We align diffusion models using pairwise epipolar geometry constraints via preference-based optimization, directly addressing unstable camera trajectories and geometric artifacts through mathematically principled geometric enforcement. Our approach efficiently enforces geometric principles without requiring end-to-end differentiability. Evaluation demonstrates that classical geometric constraints provide more stable optimization signals than modern learned metrics, which produce noisy targets that compromise alignment quality. Training on static scenes with dynamic cameras ensures high-quality measurements while the model generalizes effectively to diverse dynamic content. By bridging data-driven deep learning with classical geometric computer vision, we present a practical method for generating spatially consistent videos without compromising visual quality.
zh

[CV-12] Modest-Align: Data-Efficient Alignment for Vision-Language Models

【速读】:该论文旨在解决跨模态对齐(cross-modal alignment)在资源受限场景下因数据稀缺或质量低下导致的模型过自信(overconfidence)和性能下降问题,尤其针对现有对比学习方法依赖单一正样本对、加剧不确定样本误判的缺陷。解决方案的关键在于提出轻量级框架Modest-Align,其核心机制包括两个互补策略:随机扰动(Random Perturbation)通过引入可控噪声模拟不确定性,嵌入平滑(Embedding Smoothing)则校准嵌入空间中的相似度分布,从而有效降低模型对模糊或弱关联图像-文本对的过自信程度,并显著提升在噪声数据上的鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2510.21606
作者: Jiaxiang Liu,Yuan Wang,Jiawei Du,Joey Tianyi Zhou,Mingkun Xu,Zuozhu Liu
机构: Guangdong Institute of Intelligence Science and Technology (广东省智能科学与技术研究院); ZJU-Angelalign R&D Center for Intelligence Healthcare, Zhejiang University (浙江大学-天使联研智医研发中心); Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (ASTAR) (前沿人工智能研究中心(CFAR),新加坡科技研究局(ASTAR)); Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (ASTAR) (高性能计算研究所(IHPC),新加坡科技研究局(ASTAR))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies – Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing, which calibrates similarity distributions in the embedding space. These mechanisms collectively reduce overconfidence and improve performance on noisy or weakly aligned samples. Extensive experiments across multiple benchmark datasets demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval tasks, achieving competitive results with over 100x less training data and 600x less GPU time than CLIP. Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
zh

[CV-13] S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

【速读】:该论文旨在解决显著性目标检测(Salient Object Detection, SOD)任务中因标注成本高导致的模型泛化能力受限问题,尤其是在不同子任务如低分辨率显著性检测(DIS)和高分辨率显著性检测(HR-SOD)之间难以共享模型的问题。其解决方案的关键在于两个方面:一是构建大规模合成数据集 S3OD(包含超过 139,000 张高分辨率图像),通过多模态扩散生成管道从扩散模型和 DINO-v3 特征中提取标签;二是设计一种模糊感知的轻量化多掩码解码器架构,能够同时预测多个合理的目标解释,从而有效应对显著性检测任务中固有的语义歧义性。实验证明,仅用合成数据训练的模型即可在跨数据集评估中实现 20–50% 的误差降低,微调后性能达到 DIS 和 HR-SOD 基准的最先进水平。

链接: https://arxiv.org/abs/2510.21605
作者: Orest Kupyn,Hirokatsu Kataoka,Christian Rupprecht
机构: University of Oxford (牛津大学); VGG; AIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.
zh

[CV-14] Automated interictal epileptic spike detection from simple and noisy annotations in MEG data

【速读】:该论文旨在解决药物难治性癫痫患者术前评估中,磁脑电图(Magnetoencephalography, MEG)记录中间期癫痫尖波(interictal epileptic spikes)自动检测的难题。当前手动标注效率低且一致性差,而现有自动化方法要么依赖大量标注数据,要么在非典型数据上表现不稳定,难以满足临床实际需求。解决方案的关键在于提出两种轻量级深度学习模型——基于特征的人工神经网络(ANN)和卷积神经网络(CNN),并在仅使用时间序列信息与单专家标注的条件下实现有效训练;同时引入交互式机器学习(interactive machine learning)策略,利用模型中间输出迭代优化标注质量,从而提升模型对噪声标签的鲁棒性。实验表明,两种模型在10名保留测试患者上的F1分数均优于现有最优模型(CNN=0.46,ANN=0.44),验证了简单架构在复杂、不完美标注数据下的有效性及临床适用潜力。

链接: https://arxiv.org/abs/2510.21596
作者: Pauline Mouches,Julien Jung,Armand Demasson,Agnès Guinard,Romain Bouet,Rosalie Marchal,Romain Quentin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 Figures

点击查看摘要

Abstract:In drug-resistant epilepsy, presurgical evaluation of epilepsy can be considered. Magnetoencephalography (MEG) has been shown to be an effective exam to inform the localization of the epileptogenic zone through the localization of interictal epileptic spikes. Manual detection of these pathological biomarkers remains a fastidious and error-prone task due to the high dimensionality of MEG recordings, and interrater agreement has been reported to be only moderate. Current automated methods are unsuitable for clinical practice, either requiring extensively annotated data or lacking robustness on non-typical data. In this work, we demonstrate that deep learning models can be used for detecting interictal spikes in MEG recordings, even when only temporal and single-expert annotations are available, which represents real-world clinical practice. We propose two model architectures: a feature-based artificial neural network (ANN) and a convolutional neural network (CNN), trained on a database of 59 patients, and evaluated against a state-of-the-art model to classify short time windows of signal. In addition, we employ an interactive machine learning strategy to iteratively improve our data annotation quality using intermediary model outputs. Both proposed models outperform the state-of-the-art model (F1-scores: CNN=0.46, ANN=0.44) when tested on 10 holdout test patients. The interactive machine learning strategy demonstrates that our models are robust to noisy annotations. Overall, results highlight the robustness of models with simple architectures when analyzing complex and imperfectly annotated data. Our method of interactive machine learning offers great potential for faster data annotation, while our models represent useful and efficient tools for automated interictal spikes detection.
zh

[CV-15] Restore Text First Enhance Image Later: Two-Stage Scene Text Image Super-Resolution with Glyph Structure Guidance

【速读】:该论文旨在解决当前生成式超分辨率方法在自然图像上表现优异但会破坏文本信息的问题,即图像质量与文本可读性之间存在的根本性权衡。其解决方案的关键在于提出了一种名为TIGER(Text-Image Guided Super-Resolution)的两阶段框架,采用“文本优先、图像后处理”的范式,通过显式分离字形恢复(glyph restoration)与图像增强过程:首先重建精确的文本结构,再利用这些结构引导全图超分辨率重建,从而实现高保真度与视觉一致性的统一。

链接: https://arxiv.org/abs/2510.21590
作者: Minxing Luo,Linlong Fan,Wang Qiushi,Ge Wu,Yiyan Luo,Yuhang Yu,Jinwei Chen,Yaxing Wang,Qingnan Fan,Jian Yang
机构: VCIP, CS, Nankai University (南开大学); vivo Mobile Communication Co. Ltd (维沃移动通信有限公司); SDS, The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳))
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current generative super-resolution methods show strong performance on natural images but distort text, creating a fundamental trade-off between image quality and textual readability. To address this, we introduce \textbfTIGER (\textbfText-\textbfImage \textbfGuided sup\textbfEr-\textbfResolution), a novel two-stage framework that breaks this trade-off through a \textit"text-first, image-later" paradigm. \textbfTIGER explicitly decouples glyph restoration from image enhancement: it first reconstructs precise text structures and then uses them to guide subsequent full-image super-resolution. This glyph-to-image guidance ensures both high fidelity and visual consistency. To support comprehensive training and evaluation, we also contribute the \textbfUltraZoom-ST (UltraZoom-Scene Text), the first scene text dataset with extreme zoom (\textbf \times 14.29). Extensive experiments show that \textbfTIGER achieves \textbfstate-of-the-art performance, enhancing readability while preserving overall image quality.
zh

[CV-16] MATrack: Efficient Multiscale Adaptive Tracker for Real-Time Nighttime UAV Operations

【速读】:该论文旨在解决夜间无人机(UAV)跟踪中因低光照条件、复杂背景干扰和视角频繁变化导致的跟踪漂移或失败问题。现有方法如低光增强和域自适应虽有一定效果,但前者易引入视觉伪影,后者计算开销大且轻量化设计难以充分利用动态目标信息。解决方案的关键在于提出MATrack——一个专为夜间UAV跟踪设计的多尺度自适应系统,其核心创新是三个模块的协同工作:多尺度层次融合模块(Multiscale Hierarchy Blende, MHB)提升静态与动态模板间特征一致性;自适应关键token门控机制(Adaptive Key Token Gate)精准识别复杂背景中的目标信息;夜间模板校准器(Nighttime Template Calibrator, NTC)保障长时间序列下的稳定跟踪性能。该方案在UAVDark135基准上相较最先进方法精度、归一化精度和AUC分别提升5.9%、5.4%和4.2%,同时保持81 FPS的实时处理速度,验证了其在真实无人机平台上的可靠性与有效性。

链接: https://arxiv.org/abs/2510.21586
作者: Xuzhao Li,Xuchen Li,Shiyu Hu
机构: Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint, Under Review

点击查看摘要

Abstract:Nighttime UAV tracking faces significant challenges in real-world robotics operations. Low-light conditions not only limit visual perception capabilities, but cluttered backgrounds and frequent viewpoint changes also cause existing trackers to drift or fail during deployment. To address these difficulties, researchers have proposed solutions based on low-light enhancement and domain adaptation. However, these methods still have notable shortcomings in actual UAV systems: low-light enhancement often introduces visual artifacts, domain adaptation methods are computationally expensive and existing lightweight designs struggle to fully leverage dynamic object information. Based on an in-depth analysis of these key issues, we propose MATrack-a multiscale adaptive system designed specifically for nighttime UAV tracking. MATrack tackles the main technical challenges of nighttime tracking through the collaborative work of three core modules: Multiscale Hierarchy Blende (MHB) enhances feature consistency between static and dynamic templates. Adaptive Key Token Gate accurately identifies object information within complex backgrounds. Nighttime Template Calibrator (NTC) ensures stable tracking performance over long sequences. Extensive experiments show that MATrack achieves a significant performance improvement. On the UAVDark135 benchmark, its precision, normalized precision and AUC surpass state-of-the-art (SOTA) methods by 5.9%, 5.4% and 4.2% respectively, while maintaining a real-time processing speed of 81 FPS. Further tests on a real-world UAV platform validate the system’s reliability, demonstrating that MATrack can provide stable and effective nighttime UAV tracking support for critical robotics applications such as nighttime search and rescue and border patrol.
zh

[CV-17] Sample By Step Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation

【速读】:该论文旨在解决基于流匹配(flow matching)的文本到图像(text-to-image, T2I)生成中,Group Relative Policy Optimization (GRPO) 方法面临的两个关键问题:一是优势估计(advantage attribution)不准确,二是忽略了生成过程中的时间动态性(temporal dynamics)。解决方案的关键在于将优化范式从步骤级别(step level)提升至块级别(chunk level),即把连续的生成步骤分组为具有内在时间一致性的“块”(chunk),并在块层面进行策略优化。这一设计更好地捕捉了流匹配过程中的时序特性,并通过引入可选的加权采样策略进一步提升了性能,实验表明该方法在偏好对齐和图像质量上均优于现有方法。

链接: https://arxiv.org/abs/2510.21583
作者: Yifu Luo,Penghui Du,Bo Li,Sinan Du,Tiantian Zhang,Yongzhe Chang,Kai Wu,Kun Gai,Xueqian Wang
机构: Tsinghua University (清华大学); Kolors Team, Kuaishou Technology (快手科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, preprint

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation, but it faces two key limitations: inaccurate advantage attribution, and the neglect of temporal dynamics of generation. In this work, we argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues. Building on this idea, we propose Chunk-GRPO, the first chunk-level GRPO-based approach for T2I generation. The insight is to group consecutive steps into coherent 'chunk’s that capture the intrinsic temporal dynamics of flow matching, and to optimize policies at the chunk level. In addition, we introduce an optional weighted sampling strategy to further enhance performance. Extensive experiments show that ChunkGRPO achieves superior results in both preference alignment and image quality, highlighting the promise of chunk-level optimization for GRPO-based methods.
zh

[CV-18] Foley Control: Aligning a Frozen Latent Text-to-Audio Model to Video

【速读】:该论文旨在解决多模态视频到音频生成任务中模型参数量大、训练成本高且缺乏模块化灵活性的问题。其核心挑战在于如何在不重新训练预训练单模态模型的前提下,实现高质量的音视频同步与语义对齐。解决方案的关键在于提出“Foley Control”方法:通过插入一个轻量级的交叉注意力桥(cross-attention bridge)连接冻结的视频嵌入(如V-JEPA2)与文本到音频(T2A)模型(如Stable Audio Open DiT),使提示词控制全局语义,而视频细化时间节奏和局部动态细节;同时,在条件生成前对视频token进行池化以降低内存消耗并稳定训练。该设计仅需少量可训练参数即可学习音频-视频依赖关系,保留预训练模型的边际分布特性,并支持灵活替换编码器或T2A骨干网络而不需端到端重训练,从而实现了高效、可控且模块化的音视频生成框架。

链接: https://arxiv.org/abs/2510.21581
作者: Ciara Rowles,Varun Jampani,Simon Donné,Shimon Vainer,Julian Parker,Zach Evans
机构: Stability AI(Stability.AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: Project Page: this https URL

点击查看摘要

Abstract:Foley Control is a lightweight approach to video-guided Foley that keeps pretrained single-modality models frozen and learns only a small cross-attention bridge between them. We connect V-JEPA2 video embeddings to a frozen Stable Audio Open DiT text-to-audio (T2A) model by inserting compact video cross-attention after the model’s existing text cross-attention, so prompts set global semantics while video refines timing and local dynamics. The frozen backbones retain strong marginals (video; audio given text) and the bridge learns the audio-video dependency needed for synchronization – without retraining the audio prior. To cut memory and stabilize training, we pool video tokens before conditioning. On curated video-audio benchmarks, Foley Control delivers competitive temporal and semantic alignment with far fewer trainable parameters than recent multi-modal systems, while preserving prompt-driven controllability and production-friendly modularity (swap/upgrade encoders or the T2A backbone without end-to-end retraining). Although we focus on Video-to-Foley, the same bridge design can potentially extend to other audio modalities (e.g., speech).
zh

[CV-19] Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos MICRO

【速读】:该论文旨在解决机器人视觉-语言-动作(Vision-Language-Action, VLA)模型预训练数据稀缺且泛化能力不足的问题,尤其是如何利用大规模、未标注的真实世界人类手部活动视频来构建高质量的VLA训练数据。其解决方案的关键在于提出了一种全自动的、端到端的人类手部活动分析方法,能够将无标注的自然视角(egocentric)人类手部视频自动转化为与机器人VLA训练数据格式完全对齐的数据集——包含原子级手部动作片段、对应的语言描述、逐帧3D手部运动及相机运动信息。该方法使得从100万条视频片段(2600万帧)中提取出覆盖广泛物体、操作任务和环境变化的高质量训练数据成为可能,从而显著提升模型在真实机器人场景下的零样本迁移能力和微调后的任务成功率与泛化性能。

链接: https://arxiv.org/abs/2510.21571
作者: Qixiu Li,Yu Deng,Yaobo Liang,Lin Luo,Lei Zhou,Chengtang Yao,Lingqi Zeng,Zhiyuan Feng,Huizhi Liang,Sicheng Xu,Yizhong Zhang,Xi Chen,Hao Chen,Lily Sun,Dong Chen,Jiaolong Yang,Baining Guo
机构: Tsinghua University (清华大学); Microsoft Research Asia (微软亚洲研究院)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that “in-the-wild” egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model’s task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.
zh

[CV-20] AURASeg: Attention Guided Upsampling with Residual Boundary-Assistive Refinement for Drivable-Area Segmentation

【速读】:该论文旨在解决现有地面分割模型在室内和结构化环境中难以提取细粒度特征的问题,具体表现为多尺度处理效率低、边界精炼效果不佳以及特征表达能力有限。其解决方案的关键在于提出了一种名为AURASeg的新型地面平面语义分割模型,该模型通过引入残差边界辅助精炼模块(Residual Border Refinement Module, RBRM)实现高精度边缘 delineation,并结合注意力引导的渐进式上采样解码器(Attention Progressive Upsampling Decoder, APUD)增强特征融合能力;同时嵌入轻量级空洞空间金字塔池化模块(ASPP-Lite)以在不牺牲实时性能的前提下有效提取多尺度上下文信息,从而显著提升分割精度与边界清晰度。

链接: https://arxiv.org/abs/2510.21536
作者: Narendhiran Vijayakumar,Sridevi. M
机构: NIT Trichy (印度理工学院特里奇分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Free space ground segmentation is essential to navigate robots and autonomous vehicles, recognize drivable zones, and traverse efficiently. Fine-grained features remain challenging for existing segmentation models, particularly for robots in indoor and structured environments. These difficulties arise from ineffective multi-scale processing, suboptimal boundary refinement, and limited feature representation. In order to overcome these limitations, we propose Attention-Guided Upsampling with Residual Boundary-Assistive Refinement (AURASeg), a ground-plane semantic segmentation model that maintains high segmentation accuracy while improving border precision. Our method uses CSP-Darknet backbone by adding a Residual Border Refinement Module (RBRM) for accurate edge delineation and an Attention Progressive Upsampling Decoder (APUD) for strong feature integration. We also incorporate a lightweight Atrous Spatial Pyramid Pooling (ASPP-Lite) module to ensure multi-scale context extraction without compromising real-time performance. The proposed model beats benchmark segmentation architectures in mIoU and F1 metrics when tested on the Ground Mobile Robot Perception (GMRP) Dataset and a custom Gazebo indoor dataset. Our approach achieves an improvement in mean Intersection-over-Union (mIoU) of +1.26% and segmentation precision of +1.65% compared to state-of-the-art models. These results show that our technique is feasible for autonomous perception in both indoor and outdoor environments, enabling precise border refinement with minimal effect on inference speed.
zh

[CV-21] owards a Golden Classifier-Free Guidance Path via Foresight Fixed Point Iterations NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)中条件引导(Conditional Guidance, CFG)机制的理论不统一与设计效率低下问题,现有方法因基于不同理论解释而限制了设计空间并模糊了关键决策。其解决方案的关键在于提出一种统一视角,将条件引导重新建模为固定点迭代过程,识别出使潜在表示在条件与无条件生成下保持一致的“黄金路径”;进一步揭示CFG及其变体本质上是单步短区间迭代的特例,且理论上存在效率瓶颈;为此,作者提出前瞻性引导(Foresight Guidance, FSG),通过在扩散早期阶段优先求解更长区间子问题并增加迭代次数,显著提升图像质量和计算效率。

链接: https://arxiv.org/abs/2510.21512
作者: Kaibo Wang,Jianda Mao,Tong Wu,Yang Xiang
机构: The Hong Kong University of Science and Technology (香港科技大学); Shenzhen-Hong Kong Collaborative Innovation Research Institute, HKUST (深港协同创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025 (Spotlight)

点击查看摘要

Abstract:Classifier-Free Guidance (CFG) is an essential component of text-to-image diffusion models, and understanding and advancing its operational mechanisms remains a central focus of research. Existing approaches stem from divergent theoretical interpretations, thereby limiting the design space and obscuring key design choices. To address this, we propose a unified perspective that reframes conditional guidance as fixed point iterations, seeking to identify a golden path where latents produce consistent outputs under both conditional and unconditional generation. We demonstrate that CFG and its variants constitute a special case of single-step short-interval iteration, which is theoretically proven to exhibit inefficiency. To this end, we introduce Foresight Guidance (FSG), which prioritizes solving longer-interval subproblems in early diffusion stages with increased iterations. Extensive experiments across diverse datasets and model architectures validate the superiority of FSG over state-of-the-art methods in both image quality and computational efficiency. Our work offers novel perspectives for conditional guidance and unlocks the potential of adaptive design.
zh

[CV-22] GranViT: A Fine-Grained Vision Model With Autoregressive Perception For MLLM s

【速读】:该论文旨在解决现有视觉编码器(Vision Encoder)在多模态大语言模型(MLLMs)中对细粒度区域分析能力不足的问题。当前方法主要关注全局图像表征,受限于细粒度标注数据稀缺及缺乏相应的预训练范式,导致其在细粒度感知任务中表现有限。解决方案的关键在于提出GranViT,一种融合细粒度特征提取与大语言模型(LLM)语义对齐的视觉Transformer架构,通过区域级自回归训练实现端到端优化;其核心创新包括构建大规模细粒度标注数据集Gran-29M(含200万张自然图像和OCR图像及超1.8亿条区域级注释),并设计预训练-适配框架与自蒸馏机制,其中利用边界框到文本描述的回归增强局部视觉表示,以及文本到边界框的回归提升LLM对视觉特征的定位利用效率,同时引入显式定位约束强化视觉编码器的区域推理能力。

链接: https://arxiv.org/abs/2510.21501
作者: Guanghao Zheng,Bowen Shi,Mingxing Xu,Ruoyu Sun,Peisen Zhao,Zhibo Zhang,Wenrui Dai,Junni Zou,Hongkai Xiong,Xiaopeng Zhang,Qi Tian
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Inc. (华为公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 21 pages, 6 figures

点击查看摘要

Abstract:Vision encoders are indispensable for allowing impressive performance of Multi-modal Large Language Models (MLLMs) in vision language tasks such as visual question answering and reasoning. However, existing vision encoders focus on global image representations but overlook fine-grained regional analysis. They are limited in fine grained perception due to the scarcity of fine grained annotated data and the lack of a fine grained pre-training paradigm. In this paper, we propose GranViT, a novel Vision Transformer that integrates fine-grained feature extraction with semantic alignment to Large Language Models (LLMs) via region level autoregressive training. We first construct Gran-29M, a dataset comprising 2million natural and OCR images paired with over 180 million high-quality region-level annotations, to enable large scale fine grained pretraining. Consequently, we develop a pretraining-adaptation framework along with a self distillation mechanism to train fine-grained GranViT on Gran-29M. We sufficiently exploit the fine-grained annotations from Gran-29M to resort to bounding-box-to-caption regression to enhance localized visual representation of the vision encoder in the pretraining and caption-to-bounding-box regression to improve vision feature utilization and localization for LLM in the adaptation. We further incorporate a self distillation mechanism that imposes explicit localization constraints on the vision encoder to strengthen its regional reasoning capability. Extensive experiments show that GranViT surpasses existing vision encoders and attains strong transferability to varying LLMs. Remarkably, it achieves state-of-the-art results on fine-grained recognition, multimodal VQA, and OCR understanding.
zh

[CV-23] An Automatic Detection Method for Hematoma Features in Placental Abruption Ultrasound Images Based on Few-Shot Learning

【速读】:该论文旨在解决胎盘早剥(placental abruption)早期诊断中因依赖医生经验而导致的主观偏差和诊断不一致性问题,提出了一种基于小样本学习的改进模型EH-YOLOv11n(Enhanced Hemorrhage-YOLOv11n),用于实现胎盘超声图像中血肿特征的自动检测。其解决方案的关键在于多维度优化:一是引入小波卷积(wavelet convolution)与坐标卷积(coordinate convolution)以增强频率域与空间特征提取能力;二是采用级联分组注意力机制(cascaded group attention mechanism)抑制超声伪影和遮挡干扰,从而显著提升边界框定位精度。实验表明,该模型在准确率、精确率-召回率曲线、置信度评分及遮挡场景下均优于YOLOv11n和YOLOv8,具备高精度与实时处理能力,为胎盘早剥的计算机辅助诊断提供了可靠方案。

链接: https://arxiv.org/abs/2510.21495
作者: Xiaoqing Liu,Jitai Han,Hua Yan,Peng Li,Sida Tang,Ying Li,Kaiwen Zhang,Min Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注:

点击查看摘要

Abstract:Placental abruption is a severe complication during pregnancy, and its early accurate diagnosis is crucial for ensuring maternal and fetal safety. Traditional ultrasound diagnostic methods heavily rely on physician experience, leading to issues such as subjective bias and diagnostic inconsistencies. This paper proposes an improved model, EH-YOLOv11n (Enhanced Hemorrhage-YOLOv11n), based on small-sample learning, aiming to achieve automatic detection of hematoma features in placental ultrasound images. The model enhances performance through multidimensional optimization: it integrates wavelet convolution and coordinate convolution to strengthen frequency and spatial feature extraction; incorporates a cascaded group attention mechanism to suppress ultrasound artifacts and occlusion interference, thereby improving bounding box localization accuracy. Experimental results demonstrate a detection accuracy of 78%, representing a 2.5% improvement over YOLOv11n and a 13.7% increase over YOLOv8. The model exhibits significant superiority in precision-recall curves, confidence scores, and occlusion scenarios. Combining high accuracy with real-time processing, this model provides a reliable solution for computer-aided diagnosis of placental abruption, holding significant clinical application value.
zh

[CV-24] GRAP-MOT: Unsupervised Graph-based Position Weighted Person Multi-camera Multi-object Tracking in a Highly Congested Space

【速读】:该论文旨在解决封闭区域视频中多目标跟踪(Multi-Object Tracking, MOT)问题,特别是在存在频繁人像遮挡且多摄像头视角重叠的场景下,传统方法难以准确维持个体身份关联。解决方案的关键在于提出一种基于图加权的在线身份标签更新机制(GRAP-MOT),其核心创新包括:1)利用轨迹信息与个体特征(如外观特征)构建图结构以优化身份关联;2)引入位置估计模块提供额外空间约束信息,提升跟踪鲁棒性;3)系统性优化特征提取、跟踪和社区搜索等MOT全流程组件。实验表明,该方法在高密度人群场景下显著优于现有算法,并建议使用IDF1作为更合适的评估指标替代MOTA。

链接: https://arxiv.org/abs/2510.21482
作者: Marek Socha,Michał Marczyk,Aleksander Kempski,Michał Cogiel,Paweł Foszner,Radosław Zawiski,Michał Staniszewski
机构: Silesian University of Technology (西里西亚技术大学); Blees Sp. z o. o. (Blees有限责任公司); QSystems.pro Sp. z o. o. (QSystems.pro有限责任公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, 8 tables

点击查看摘要

Abstract:GRAP-MOT is a new approach for solving the person MOT problem dedicated to videos of closed areas with overlapping multi-camera views, where person occlusion frequently occurs. Our novel graph-weighted solution updates a person’s identification label online based on tracks and the person’s characteristic features. To find the best solution, we deeply investigated all elements of the MOT process, including feature extraction, tracking, and community search. Furthermore, GRAP-MOT is equipped with a person’s position estimation module, which gives additional key information to the MOT method, ensuring better results than methods without position data. We tested GRAP-MOT on recordings acquired in a closed-area model and on publicly available real datasets that fulfil the requirement of a highly congested space, showing the superiority of our proposition. Finally, we analyzed existing metrics used to compare MOT algorithms and concluded that IDF1 is more adequate than MOTA in such comparisons. We made our code, along with the acquired dataset, publicly available.
zh

[CV-25] ITC-RWKV: Interactive Tissue-Cell Modeling with Recurrent Key-Value Aggregation for Histopathological Subtyping BMVC2025

【速读】:该论文旨在解决当前病理学基础模型在细粒度任务(如癌症亚型分类)中因缺乏细胞层面特征建模而导致性能受限的问题。其核心解决方案是提出一种双流架构,通过联合建模宏观组织特征与聚合的细胞表征来增强对局部细胞细节和整体组织结构的协同理解;关键创新在于引入了具有线性复杂度的可接受加权键值聚合模型以高效整合大规模细胞信息,并设计双向组织-细胞交互模块实现局部细胞线索与其周围组织环境之间的相互注意力机制,从而显著提升细粒度病理图像分析的准确性。

链接: https://arxiv.org/abs/2510.21479
作者: Yating Huang,Qijun Yang,Lintao Xiang,Hujun Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by BMVC 2025

点击查看摘要

Abstract:Accurate interpretation of histopathological images demands integration of information across spatial and semantic scales, from nuclear morphology and cellular textures to global tissue organization and disease-specific patterns. Although recent foundation models in pathology have shown strong capabilities in capturing global tissue context, their omission of cell-level feature modeling remains a key limitation for fine-grained tasks such as cancer subtype classification. To address this, we propose a dual-stream architecture that models the interplay between macroscale tissue features and aggregated cellular representations. To efficiently aggregate information from large cell sets, we propose a receptance-weighted key-value aggregation model, a recurrent transformer that captures inter-cell dependencies with linear complexity. Furthermore, we introduce a bidirectional tissue-cell interaction module to enable mutual attention between localized cellular cues and their surrounding tissue environment. Experiments on four histopathological subtype classification benchmarks show that the proposed method outperforms existing models, demonstrating the critical role of cell-level aggregation and tissue-cell interaction in fine-grained computational pathology.
zh

[CV-26] CXR-LanIC: Language-Grounded Interpretable Classifier for Chest X-Ray Diagnosis

【速读】:该论文旨在解决深度学习模型在胸部X光(chest X-ray)诊断中因“黑箱”特性导致临床采纳受限的问题,即医生难以信任自动化诊断结果并识别潜在的失效模式。其解决方案的关键在于提出了一种名为CXR-LanIC(Language-Grounded Interpretable Classifier for Chest X-rays)的新框架,通过任务对齐的模式发现机制实现可解释性:该框架基于BiomedCLIP诊断分类器训练一组稀疏自动编码器(sparse autoencoders),从医学图像表示中提取出约5000个单义性视觉模式(monosemantic patterns),涵盖心脏、肺部、胸膜、结构、设备及伪影等类别,这些模式在具有特定影像特征的图像中表现出一致激活行为,从而支持将预测结果分解为20–50个可验证的解释性模式,并构建可追溯的激活图库,确保所提取特征直接关联临床决策目标,而非通用嵌入空间,显著提升了医疗AI系统的透明度与可信度。

链接: https://arxiv.org/abs/2510.21464
作者: Yiming Tang,Wenjia Zhong,Rushi Shah,Dianbo Liu
机构: National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models have achieved remarkable accuracy in chest X-ray diagnosis, yet their widespread clinical adoption remains limited by the black-box nature of their predictions. Clinicians require transparent, verifiable explanations to trust automated diagnoses and identify potential failure modes. We introduce CXR-LanIC (Language-Grounded Interpretable Classifier for Chest X-rays), a novel framework that addresses this interpretability challenge through task-aligned pattern discovery. Our approach trains transcoder-based sparse autoencoders on a BiomedCLIP diagnostic classifier to decompose medical image representations into interpretable visual patterns. By training an ensemble of 100 transcoders on multimodal embeddings from the MIMIC-CXR dataset, we discover approximately 5,000 monosemantic patterns spanning cardiac, pulmonary, pleural, structural, device, and artifact categories. Each pattern exhibits consistent activation behavior across images sharing specific radiological features, enabling transparent attribution where predictions decompose into 20-50 interpretable patterns with verifiable activation galleries. CXR-LanIC achieves competitive diagnostic accuracy on five key findings while providing the foundation for natural language explanations through planned large multimodal model annotation. Our key innovation lies in extracting interpretable features from a classifier trained on specific diagnostic objectives rather than general-purpose embeddings, ensuring discovered patterns are directly relevant to clinical decision-making, demonstrating that medical AI systems can be both accurate and interpretable, supporting safer clinical deployment through transparent, clinically grounded explanations.
zh

[CV-27] VidSplice: Towards Coherent Video Inpainting via Explicit Spaced Frame Guidance

【速读】:该论文旨在解决现有视频修复(video inpainting)方法在严重内容退化情况下难以保持时空一致性的问题,尤其是对视频后半段的控制能力不足。其核心解决方案是将视频修复任务解耦为两个子任务:多帧一致的图像修复和掩码区域运动传播,并提出VidSplice框架,引入间隔帧先验(spaced-frame priors)以提供时空引导。关键创新在于设计了CoSpliced模块,通过首帧传播策略利用拼接机制将初始帧内容扩散至后续参考帧,从而增强空间一致性;同时引入上下文控制器模块,在帧复制后编码一致先验并注入图像到视频生成骨干网络中,有效抑制生成过程中的内容失真,显著提升前景对齐与运动稳定性。

链接: https://arxiv.org/abs/2510.21461
作者: Ming Xie,Junqiu Yu,Qiaole Dong,Xiangyang Xue,Yanwei Fu
机构: Fudan University (复旦大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages

点击查看摘要

Abstract:Recent video inpainting methods often employ image-to-video (I2V) priors to model temporal consistency across masked frames. While effective in moderate cases, these methods struggle under severe content degradation and tend to overlook spatiotemporal stability, resulting in insufficient control over the latter parts of the video. To address these limitations, we decouple video inpainting into two sub-tasks: multi-frame consistent image inpainting and masked area motion propagation. We propose VidSplice, a novel framework that introduces spaced-frame priors to guide the inpainting process with spatiotemporal cues. To enhance spatial coherence, we design a CoSpliced Module to perform first-frame propagation strategy that diffuses the initial frame content into subsequent reference frames through a splicing mechanism. Additionally, we introduce a delicate context controller module that encodes coherent priors after frame duplication and injects the spliced video into the I2V generative backbone, effectively constraining content distortion during generation. Extensive evaluations demonstrate that VidSplice achieves competitive performance across diverse video inpainting scenarios. Moreover, its design significantly improves both foreground alignment and motion stability, outperforming existing approaches.
zh

[CV-28] MoniTor: Exploiting Large Language Models with Instruction for Online Video Anomaly Detection NEURIPS2025

【速读】:该论文旨在解决在线视频异常检测(Online Video Anomaly Detection, Online VAD)中因实时性约束和计算复杂性导致的研究不足问题。现有方法多聚焦于离线场景,而在线场景下如何在不依赖训练的前提下实现高效、准确的异常识别仍具挑战。解决方案的关键在于提出一种无需训练的基于记忆的在线评分队列机制(Memory-based online scoring queue scheme),即MoniTor:首先利用预训练视觉语言模型(Vision-Language Models, VLMs)对流式输入进行处理;其次引入受长短期记忆(Long Short-Term Memory, LSTM)启发的预测机制以建模时间依赖关系,从而增强对当前帧的理解;最后设计动态评分队列与异常先验策略,用于存储近期得分并覆盖监控场景中的各类异常,为大语言模型(Large Language Models, LLMs)提供时序区分能力,实现无监督下的高精度在线异常检测。

链接: https://arxiv.org/abs/2510.21449
作者: Shengtian Yang,Yue Feng,Yingshi Liu,Jingrou Zhang,Jie Qin
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education, China (教育部脑机智能技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025. The first two authors hold equal contributions

点击查看摘要

Abstract:Video Anomaly Detection (VAD) aims to locate unusual activities or behaviors within videos. Recently, offline VAD has garnered substantial research attention, which has been invigorated by the progress in large language models (LLMs) and vision-language models (VLMs), offering the potential for a more nuanced understanding of anomalies. However, online VAD has seldom received attention due to real-time constraints and computational intensity. In this paper, we introduce a novel Memory-based online scoring queue scheme for Training-free VAD (MoniTor), to address the inherent complexities in online VAD. Specifically, MoniTor applies a streaming input to VLMs, leveraging the capabilities of pre-trained large-scale models. To capture temporal dependencies more effectively, we incorporate a novel prediction mechanism inspired by Long Short-Term Memory (LSTM) networks. This ensures the model can effectively model past states and leverage previous predictions to identify anomalous behaviors. Thereby, it better understands the current frame. Moreover, we design a scoring queue and an anomaly prior to dynamically store recent scores and cover all anomalies in the monitoring scenario, providing guidance for LLMs to distinguish between normal and abnormal behaviors over time. We evaluate MoniTor on two large datasets (i.e., UCF-Crime and XD-Violence) containing various surveillance and real-world scenarios. The results demonstrate that MoniTor outperforms state-of-the-art methods and is competitive with weakly supervised methods without training. Code is available at this https URL.
zh

[CV-29] PhysWorld: From Real Videos to World Models of Deformable Objects via Physics-Aware Demonstration Synthesis

【速读】:该论文旨在解决从有限的真实世界视频数据中学习物理一致性动力学模型的难题,特别是针对具有空间变化物理特性的可变形物体。解决方案的关键在于提出PhysWorld框架,其核心创新包括:首先在MPM(Material Point Method)模拟器中通过本构模型选择与全局到局部优化构建物理一致的数字孪生;其次对物理属性施加部件感知扰动以生成多样化的运动模式,从而合成大量且多样的演示数据;最后利用这些数据训练嵌入物理属性的轻量级图神经网络(GNN)世界模型,并用真实视频进一步微调物理参数。该方法实现了高精度、快速的未来状态预测,并具备良好的新交互泛化能力。

链接: https://arxiv.org/abs/2510.21447
作者: Yu Yang,Zhilu Zhang,Xiang Zhang,Yihan Zeng,Hui Li,Wangmeng Zuo
机构: Harbin Institute of Technology (哈尔滨工业大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 17 pages, 5 figures

点击查看摘要

Abstract:Interactive world models that simulate object dynamics are crucial for robotics, VR, and AR. However, it remains a significant challenge to learn physics-consistent dynamics models from limited real-world video data, especially for deformable objects with spatially-varying physical properties. To overcome the challenge of data scarcity, we propose PhysWorld, a novel framework that utilizes a simulator to synthesize physically plausible and diverse demonstrations to learn efficient world models. Specifically, we first construct a physics-consistent digital twin within MPM simulator via constitutive model selection and global-to-local optimization of physical properties. Subsequently, we apply part-aware perturbations to the physical properties and generate various motion patterns for the digital twin, synthesizing extensive and diverse demonstrations. Finally, using these demonstrations, we train a lightweight GNN-based world model that is embedded with physical properties. The real video can be used to further refine the physical properties. PhysWorld achieves accurate and fast future predictions for various deformable objects, and also generalizes well to novel interactions. Experiments show that PhysWorld has competitive performance while enabling inference speeds 47 times faster than the recent state-of-the-art method, i.e., PhysTwin.
zh

[CV-30] OpenHype: Hyperbolic Embeddings for Hierarchical Open-Vocabulary Radiance Fields

【速读】:该论文旨在解决三维场景中层次结构建模的问题,特别是如何在隐式表示(如神经辐射场,Neural Radiance Fields)中有效捕捉多尺度的结构信息。现有方法通常依赖显式的离散层次结构,存在推理效率低或泛化能力差的局限性。其解决方案的关键在于提出OpenHype,利用连续的双曲潜空间(hyperbolic latent space)来编码场景的层次关系,通过双曲几何的特性自然建模多尺度嵌套结构,并借助测地线路径实现层次间的平滑遍历,从而在保持高效推理的同时提升对复杂真实场景的适应性。

链接: https://arxiv.org/abs/2510.21441
作者: Lisa Weijler,Sebastian Koch,Fabio Poiesi,Timo Ropinski,Pedro Hermosilla
机构: TU Wien (维也纳工业大学); Ulm University (乌尔姆大学); Fondazione Bruno Kessler (布鲁诺·凯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling the inherent hierarchical structure of 3D objects and 3D scenes is highly desirable, as it enables a more holistic understanding of environments for autonomous agents. Accomplishing this with implicit representations, such as Neural Radiance Fields, remains an unexplored challenge. Existing methods that explicitly model hierarchical structures often face significant limitations: they either require multiple rendering passes to capture embeddings at different levels of granularity, significantly increasing inference time, or rely on predefined, closed-set discrete hierarchies that generalize poorly to the diverse and nuanced structures encountered by agents in the real world. To address these challenges, we propose OpenHype, a novel approach that represents scene hierarchies using a continuous hyperbolic latent space. By leveraging the properties of hyperbolic geometry, OpenHype naturally encodes multi-scale relationships and enables smooth traversal of hierarchies through geodesic paths in latent space. Our method outperforms state-of-the-art approaches on standard benchmarks, demonstrating superior efficiency and adaptability in 3D scene understanding.
zh

[CV-31] Anisotropic Pooling for LUT-realizable CNN Image Restoration

【速读】:该论文旨在解决基于查表(Look-Up Table, LUT)实现的图像恢复卷积神经网络(CNN)中,如何在不显著限制感受野的前提下有效管理表格规模的问题。当前主流方法通过复用表格处理不同方向的小像素块并采用平均池化(average pooling)进行结果融合,但作者发现平均池化对各向异性信号结构不敏感,导致性能受限。解决方案的关键在于引入各向异性池化策略:首先提出广义中值池化(generalized median pooling),实现比平均池化更优的性能;进一步通过学习每个方向的数据相关池化系数,使模型能自适应地调整不同方向像素块的贡献权重,从而显著提升LUT可实现CNN图像恢复方法的主观与客观质量表现。

链接: https://arxiv.org/abs/2510.21437
作者: Xi Zhang,Xiaolin Wu
机构: ANGEL Lab, Nanyang Technological University (南洋理工大学); School of Computing and Artificial Intelligence, Southwest Jiaotong University (西南交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Table look-up realization of image restoration CNNs has the potential of achieving competitive image quality while being much faster and resource frugal than the straightforward CNN implementation. The main technical challenge facing the LUT-based CNN algorithm designers is to manage the table size without overly restricting the receptive field. The prevailing strategy is to reuse the table for small pixel patches of different orientations (apparently assuming a degree of isotropy) and then fuse the look-up results. The fusion is currently done by average pooling, which we find being ill suited to anisotropic signal structures. To alleviate the problem, we investigate and discuss anisotropic pooling methods to replace naive averaging for improving the performance of the current LUT-realizable CNN restoration methods. First, we introduce the method of generalized median pooling which leads to measurable gains over average pooling. We then extend this idea by learning data-dependent pooling coefficients for each orientation, so that they can adaptively weigh the contributions of differently oriented pixel patches. Experimental results on various restoration benchmarks show that our anisotropic pooling strategy yields both perceptually and numerically superior results compared to existing LUT-realizable CNN methods.
zh

[CV-32] ArtiLatent: Realistic Articulated 3D Object Generation via Structured Latents SIGGRAPH

【速读】:该论文旨在解决人工制作的3D物体在几何细节、关节运动机制与外观真实性方面难以协同生成的问题。现有方法往往无法同时保证细粒度几何结构、准确的关节动力学建模以及跨不同姿态下的视觉一致性。其解决方案的关键在于提出ArtiLatent框架,通过变分自编码器(Variational Autoencoder, VAE)将稀疏体素表示(sparse voxel representations)及其关联的关节属性(包括关节类型、轴线、原点、活动范围和部件类别)统一嵌入到一个潜在空间中,并在此基础上训练潜扩散模型(latent diffusion model),从而实现多样化且物理合理的采样;进一步引入一种考虑关节状态的高斯解码器(articulation-aware Gaussian decoder),能够根据关节状态动态调整可见性,为静态姿态下通常被遮挡的区域分配合理纹理特征,显著提升不同关节配置下的视觉真实感。

链接: https://arxiv.org/abs/2510.21432
作者: Honghua Chen,Yushi Lan,Yongwei Chen,Xingang Pan
机构: S-Lab, Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: accepted to SIGGRAPH Asia; Project page: this https URL

点击查看摘要

Abstract:We propose ArtiLatent, a generative framework that synthesizes human-made 3D objects with fine-grained geometry, accurate articulation, and realistic appearance. Our approach jointly models part geometry and articulation dynamics by embedding sparse voxel representations and associated articulation properties, including joint type, axis, origin, range, and part category, into a unified latent space via a variational autoencoder. A latent diffusion model is then trained over this space to enable diverse yet physically plausible sampling. To reconstruct photorealistic 3D shapes, we introduce an articulation-aware Gaussian decoder that accounts for articulation-dependent visibility changes (e.g., revealing the interior of a drawer when opened). By conditioning appearance decoding on articulation state, our method assigns plausible texture features to regions that are typically occluded in static poses, significantly improving visual realism across articulation configurations. Extensive experiments on furniture-like objects from PartNet-Mobility and ACD datasets demonstrate that ArtiLatent outperforms existing approaches in geometric consistency and appearance fidelity. Our framework provides a scalable solution for articulated 3D object synthesis and manipulation.
zh

[CV-33] Bridging the gap to real-world language-grounded visual concept learning

【速读】:该论文旨在解决现有语言引导的视觉概念学习方法在真实场景中受限于预定义的有限语义轴(如颜色和形状)且难以扩展至多样复杂概念的问题。其关键解决方案在于提出一个可扩展框架,通过预训练的视觉-语言模型与通用提示策略自适应地识别图像相关的语义轴,并利用无额外参数的通用概念编码器将视觉特征绑定到这些轴上;同时,通过优化组合锚定目标(compositional anchoring objective),确保各语义轴可独立操控而不相互干扰,从而实现对多样化真实世界概念的有效编辑与组合泛化。

链接: https://arxiv.org/abs/2510.21412
作者: Whie Jung,Semin Kim,Junee Kim,Seunghoon Hong
机构: KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at this https URL.
zh

[CV-34] MUVR: A Multi-Modal Untrimmed Video Retrieval Benchmark with Multi-Level Visual Correspondence NEURIPS2025

【速读】:该论文旨在解决长视频平台中针对未剪辑视频(untrimmed videos)的多模态检索问题,即如何基于文本描述、视频标签或掩码提示等多模态查询精准定位包含相关片段的长视频。其解决方案的关键在于提出了一种全新的多模态未剪辑视频检索任务(Multi-modal Untrimmed Video Retrieval, MUVR),并构建了首个面向此任务的基准数据集MUVR。该方案的核心创新包括:1)设计视频中心的多模态查询范式,支持细粒度检索需求;2)建立六级视觉对应关系(copy, event, scene, instance, action, others),以覆盖常见视频类别并精确匹配用户意图;3)提出三种评估版本(Base/Filter/QA)及重排序评分(Reranking Score),全面衡量模型在多模态理解与跨视频推理能力上的表现。这一工作揭示了现有检索方法和多模态大语言模型(MLLMs)在处理未剪辑视频和复杂多模态查询时的局限性,并为未来研究提供了标准化评测框架。

链接: https://arxiv.org/abs/2510.21406
作者: Yue Feng,Jinwei Hu,Qijia Lu,Jiawei Niu,Li Tan,Shuo Yuan,Ziyi Yan,Yizhen Jia,Qingzhi He,Shiping Ge,Ethan Q. Chen,Wentong Li,Limin Wang,Jie Qin
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Nanjing University (南京大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 DB Track

点击查看摘要

Abstract:We propose the Multi-modal Untrimmed Video Retrieval task, along with a new benchmark (MUVR) to advance video retrieval for long-video platforms. MUVR aims to retrieve untrimmed videos containing relevant segments using multi-modal queries. It has the following features: 1) Practical retrieval paradigm: MUVR supports video-centric multi-modal queries, expressing fine-grained retrieval needs through long text descriptions, video tag prompts, and mask prompts. It adopts a one-to-many retrieval paradigm and focuses on untrimmed videos, tailored for long-video platform applications. 2) Multi-level visual correspondence: To cover common video categories (e.g., news, travel, dance) and precisely define retrieval matching criteria, we construct multi-level visual correspondence based on core video content (e.g., news events, travel locations, dance moves) which users are interested in and want to retrieve. It covers six levels: copy, event, scene, instance, action, and others. 3) Comprehensive evaluation criteria: We develop 3 versions of MUVR (i.e., Base, Filter, QA). MUVR-Base/Filter evaluates retrieval models, while MUVR-QA assesses MLLMs in a question-answering format. We also propose a Reranking Score to evaluate the reranking ability of MLLMs. MUVR consists of 53K untrimmed videos from the video platform Bilibili, with 1,050 multi-modal queries and 84K matches. Extensive evaluations of 3 state-of-the-art video retrieval models, 6 image-based VLMs, and 10 MLLMs are conducted. MUVR reveals the limitations of retrieval methods in processing untrimmed videos and multi-modal queries, as well as MLLMs in multi-video understanding and reranking. Our code and benchmark is available at this https URL.
zh

[CV-35] Disentangled Representation Learning via Modular Compositional Bias

【速读】:该论文旨在解决当前解耦表示学习(Disentangled Representation Learning, DRL)方法中因依赖特定因子策略(如属性或对象的损失函数设计或模型架构)而导致的灵活性不足问题。当新出现的变量不满足先前假设(如统计独立性或空间互斥性)或多个因子共存时,现有方法需重新设计目标函数或网络结构,造成显著开销。解决方案的关键在于提出一种组合偏置(compositional bias),即一种与具体目标函数和模型架构解耦的模块化归纳偏置。其核心思想是:不同因子在数据分布中遵循不同的重组规则——全局属性(如面部特征)具有互斥性(例如一张脸只能有一个鼻子),而对象则共享支持域(任意子集的对象可共存)。通过根据因子特异性规则随机混合潜在变量(即“混合策略”),并利用两个互补目标强制编码器学习该混合策略所反映的因子结构:(i) 先验损失确保每次重混都能重建出真实图像;(ii) 组合一致性损失(compositional consistency loss)将合成图像与其对应的复合潜在变量对齐。此框架仅通过调整混合策略即可实现属性、对象乃至两者的联合解耦,无需改动目标函数或模型架构。

链接: https://arxiv.org/abs/2510.21402
作者: Whie Jung,Dong Hoon Lee,Seunghoon Hong
机构: KAIST
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent disentangled representation learning (DRL) methods heavily rely on factor specific strategies-either learning objectives for attributes or model architectures for objects-to embed inductive biases. Such divergent approaches result in significant overhead when novel factors of variation do not align with prior assumptions, such as statistical independence or spatial exclusivity, or when multiple factors coexist, as practitioners must redesign architectures or objectives. To address this, we propose a compositional bias, a modular inductive bias decoupled from both objectives and architectures. Our key insight is that different factors obey distinct recombination rules in the data distribution: global attributes are mutually exclusive, e.g., a face has one nose, while objects share a common support (any subset of objects can co-exist). We therefore randomly remix latents according to factor-specific rules, i.e., a mixing strategy, and force the encoder to discover whichever factor structure the mixing strategy reflects through two complementary objectives: (i) a prior loss that ensures every remix decodes into a realistic image, and (ii) the compositional consistency loss introduced by Wiedemer et al. (arXiv:2310.05327), which aligns each composite image with its corresponding composite latent. Under this general framework, simply adjusting the mixing strategy enables disentanglement of attributes, objects, and even both, without modifying the objectives or architectures. Extensive experiments demonstrate that our method shows competitive performance in both attribute and object disentanglement, and uniquely achieves joint disentanglement of global style and objects. Code is available at this https URL.
zh

[CV-36] Depth-Supervised Fusion Network for Seamless-Free Image Stitching NEURIPS2025

【速读】:该论文旨在解决多视角图像拼接中因物体深度差异导致的大视差问题,从而避免拼接结果中的鬼影(ghosting)和错位现象。其解决方案的关键在于提出了一种基于深度一致性约束的无缝图像拼接方法:首先,设计多阶段机制结合全局深度正则化约束,提升不同深度范围内相同目标的对齐精度;其次,在图像融合过程中,通过图论驱动的低代价计算确定最优拼接缝,并扩散软缝区域以精确定位过渡区,有效缓解由视差引起的对齐误差;此外,引入重参数化策略优化结构设计,显著降低位移回归过程的计算开销,兼顾效率与性能。

链接: https://arxiv.org/abs/2510.21396
作者: Zhiying Jiang,Ruhao Yan,Zengxi Zhang,Bowei Zhang,Jinyuan Liu
机构: Dalian Maritime University (大连海事大学); Dalian University of Technology (大连理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Neurips 2025

点击查看摘要

Abstract:Image stitching synthesizes images captured from multiple perspectives into a single image with a broader field of view. The significant variations in object depth often lead to large parallax, resulting in ghosting and misalignment in the stitched results. To address this, we propose a depth-consistency-constrained seamless-free image stitching method. First, to tackle the multi-view alignment difficulties caused by parallax, a multi-stage mechanism combined with global depth regularization constraints is developed to enhance the alignment accuracy of the same apparent target across different depth ranges. Second, during the multi-view image fusion process, an optimal stitching seam is determined through graph-based low-cost computation, and a soft-seam region is diffused to precisely locate transition areas, thereby effectively mitigating alignment errors induced by parallax and achieving natural and seamless stitching results. Furthermore, considering the computational overhead in the shift regression process, a reparameterization strategy is incorporated to optimize the structural design, significantly improving algorithm efficiency while maintaining optimal performance. Extensive experiments demonstrate the superior performance of the proposed method against the existing methods. Code is available at this https URL.
zh

[CV-37] rraG en: A Unified Multi-Task Layout Generation Framework for Remote Sensing Data Augmentation

【速读】:该论文旨在解决遥感视觉任务中因跨域标注数据稀缺、现有生成式数据增强框架任务孤立且忽略地理空间约束而导致的模型泛化能力不足问题。其解决方案的关键在于提出TerraGen框架,该框架通过引入地理-空间布局编码器统一处理边界框与分割掩码输入,并结合多尺度注入机制和掩码加权损失函数,显式建模从全局结构到细粒度的空间约束,从而实现对多种高阶遥感视觉任务(如检测、分割和提取)的灵活、空间可控的图像合成。

链接: https://arxiv.org/abs/2510.21391
作者: Datao Tang,Hao Wang,Yudeng Xin,Hui Qiao,Dongsheng Jiang,Yin Li,Zhiheng Yu,Xiangyong Cao
机构: Xi’an Jiaotong University (西安交通大学); University of Melbourne (墨尔本大学); China Telecom Shaanxi Branch (中国电信陕西分公司); Huawei Technologies Ltd (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing vision tasks require extensive labeled data across multiple, interconnected domains. However, current generative data augmentation frameworks are task-isolated, i.e., each vision task requires training an independent generative model, and ignores the modeling of geographical information and spatial constraints. To address these issues, we propose \textbfTerraGen, a unified layout-to-image generation framework that enables flexible, spatially controllable synthesis of remote sensing imagery for various high-level vision tasks, e.g., detection, segmentation, and extraction. Specifically, TerraGen introduces a geographic-spatial layout encoder that unifies bounding box and segmentation mask inputs, combined with a multi-scale injection scheme and mask-weighted loss to explicitly encode spatial constraints, from global structures to fine details. Also, we construct the first large-scale multi-task remote sensing layout generation dataset containing 45k images and establish a standardized evaluation protocol for this task. Experimental results show that our TerraGen can achieve the best generation image quality across diverse tasks. Additionally, TerraGen can be used as a universal data-augmentation generator, enhancing downstream task performance significantly and demonstrating robust cross-task generalisation in both full-data and few-shot scenarios.
zh

[CV-38] BADiff: Bandwidth Adaptive Diffusion Model NEURIPS2025

【速读】:该论文旨在解决扩散模型(diffusion models)在云到设备传输场景中因网络带宽受限而导致的图像生成质量下降问题。传统扩散模型采用固定步数的去噪过程,不考虑实际传输条件,导致在低带宽环境下需进行过度压缩,造成细节丢失和计算资源浪费。解决方案的关键在于提出一种联合端到端训练策略,使扩散模型能够根据可用带宽动态调整生成质量:通过轻量级的质量嵌入(quality embedding)对模型进行条件控制,学习自适应调节去噪轨迹,从而实现基于感知质量的早期停止采样(early-stop sampling),在保持视觉保真度的同时显著提升带宽受限环境下的图像生成效率。

链接: https://arxiv.org/abs/2510.21366
作者: Xi Zhang,Hanwei Zhu,Yan Zhong,Jiamang Wang,Weisi Lin
机构: Nanyang Technological University (南洋理工大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: NeurIPS 2025 Poster

点击查看摘要

Abstract:In this work, we propose a novel framework to enable diffusion models to adapt their generation quality based on real-time network bandwidth constraints. Traditional diffusion models produce high-fidelity images by performing a fixed number of denoising steps, regardless of downstream transmission limitations. However, in practical cloud-to-device scenarios, limited bandwidth often necessitates heavy compression, leading to loss of fine textures and wasted computation. To address this, we introduce a joint end-to-end training strategy where the diffusion model is conditioned on a target quality level derived from the available bandwidth. During training, the model learns to adaptively modulate the denoising process, enabling early-stop sampling that maintains perceptual quality appropriate to the target transmission condition. Our method requires minimal architectural changes and leverages a lightweight quality embedding to guide the denoising trajectory. Experimental results demonstrate that our approach significantly improves the visual fidelity of bandwidth-adapted generations compared to naive early-stopping, offering a promising solution for efficient image delivery in bandwidth-constrained environments. Code is available at: this https URL.
zh

[CV-39] Why Registration Quality Matters: Enhancing sCT Synthesis with IMPACT-Based Registration

【速读】:该论文旨在解决从MRI和CBCT图像中生成高质量合成CT(synthetic CT, sCT)的问题,核心挑战在于如何提升sCT的结构保真度与 anatomical 一致性。解决方案的关键在于:采用统一的2.5D U-Net++架构结合ResNet-34编码器,并引入IMPACT-Synth感知损失(perceptual loss),该损失基于SAM(Segment Anything Model)和TotalSegmentator预训练分割网络提取特征以增强结构细节;同时,创新性地使用IMPACT注册策略替代传统互信息(mutual information)方法,通过特征空间相似性实现更准确的图像对齐,从而减少注册误差向监督学习过程中的传播,提高模型在真实场景下的泛化能力与解剖学合理性。

链接: https://arxiv.org/abs/2510.21358
作者: Valentin Boussot,Cédric Hémon,Jean-Claude Nunes,Jean-Louis Dillenseger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper for the SynthRAD2025 challenge, Team BreizhCT

点击查看摘要

Abstract:We participated in the SynthRAD2025 challenge (Tasks 1 and 2) with a unified pipeline for synthetic CT (sCT) generation from MRI and CBCT, implemented using the KonfAI framework. Our model is a 2.5D U-Net++ with a ResNet-34 encoder, trained jointly across anatomical regions and fine-tuned per region. The loss function combined pixel-wise L1 loss with IMPACT-Synth, a perceptual loss derived from SAM and TotalSegmentator to enhance structural fidelity. Training was performed using AdamW (initial learning rate = 0.001, halved every 25k steps) on patch-based, normalized, body-masked inputs (320x320 for MRI, 256x256 for CBCT), with random flipping as the only augmentation. No post-processing was applied. Final predictions leveraged test-time augmentation and five-fold ensembling. The best model was selected based on validation MAE. Two registration strategies were evaluated: (i) Elastix with mutual information, consistent with the challenge pipeline, and (ii) IMPACT, a feature-based similarity metric leveraging pretrained segmentation networks. On the local test sets, IMPACT-based registration achieved more accurate and anatomically consistent alignments than mutual-information-based registration, resulting in improved sCT synthesis with lower MAE and more realistic anatomical structures. On the public validation set, however, models trained with Elastix-aligned data achieved higher scores, reflecting a registration bias favoring alignment strategies consistent with the evaluation pipeline. This highlights how registration errors can propagate into supervised learning, influencing both training and evaluation, and potentially inflating performance metrics at the expense of anatomical fidelity. By promoting anatomically consistent alignment, IMPACT helps mitigate this bias and supports the development of more robust and generalizable sCT synthesis models.
zh

[CV-40] Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding

【速读】:该论文旨在解决如何提升视觉语言模型(Visual Language Models, VLMs)在第一人称视角(egocentric)场景下对未来事件预测和当前活动理解的准确性问题。现有方法通常仅依赖视觉输入,或把眼动(eye gaze)作为辅助信号,未能有效利用人类注视行为所蕴含的注意力线索。解决方案的关键在于提出一种“眼动正则化”(gaze-regularized)框架,在训练阶段引入眼动信息来引导模型注意力机制,通过设计一种可插拔的眼动正则化注意力模块,使模型关注区域与人类视觉焦点对齐,从而增强模型的语义理解和预测能力。该方法不改变推理阶段的输入结构,具有良好的通用性和模块化特性,实验表明其在两项任务上分别带来最高11分和约7分的性能提升。

链接: https://arxiv.org/abs/2510.21356
作者: Anupam Pani,Yanchao Yang
机构: HKU Musketeers Foundation Institute of Data Science, The University of Hong Kong (香港大学 Musketeers 基金会数据科学研究所); Department of Electrical and Electronic Engineering, The University of Hong Kong (香港大学电子与电机工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Eye gaze offers valuable cues about attention, short-term intent, and future actions, making it a powerful signal for modeling egocentric behavior. In this work, we propose a gaze-regularized framework that enhances VLMs for two key egocentric understanding tasks: fine-grained future event prediction and current activity understanding. Unlike prior approaches that rely solely on visual inputs or use gaze as an auxiliary input signal , our method uses gaze only during training. We introduce a gaze-regularized attention mechanism that aligns model focus with human visual gaze. This design is flexible and modular, allowing it to generalize across multiple VLM architectures that utilize attention. Experimental results show that our approach improves semantic prediction scores by up to 11 for future event prediction and around 7 for current activity understanding, compared to the corresponding baseline models trained without gaze regularization. These results highlight the value of gaze-guided training in improving the accuracy and robustness of egocentric VLMs. Overall, this work establishes a foundation for using human gaze to enhance the predictive capabilities of VLMs in real-world scenarios like assistive robots and human-machine collaboration. Code and additional information is available at: this https URL
zh

[CV-41] Dynamic Semantic-Aware Correlation Modeling for UAV Tracking NEURIPS2025

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)跟踪方法中因缺乏语义感知能力而导致的定位精度不足问题,尤其在相机运动、快速运动和低分辨率等典型挑战下表现欠佳。其核心解决方案是提出一种动态语义感知相关建模框架,其中的关键创新在于设计了“动态语义相关性生成器”(Dynamic Semantic Relevance Generator),该模块结合Transformer生成的相关图(correlation map)来挖掘模板与搜索区域间的语义关联,从而提升搜索区域对关键信息的提取能力,增强跟踪的准确性和鲁棒性。此外,为兼顾实时性,作者还引入了一种剪枝策略以优化计算效率,支持多种模型变体在速度与精度之间灵活权衡。

链接: https://arxiv.org/abs/2510.21351
作者: Xinyu Zhou,Tongxin Pan,Lingyi Hong,Pinxue Guo,Haijing Guo,Zhaoyu Chen,Kaixun Jiang,Wenqiang Zhang
机构: Fudan University (复旦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS2025

点击查看摘要

Abstract:UAV tracking can be widely applied in scenarios such as disaster rescue, environmental monitoring, and logistics transportation. However, existing UAV tracking methods predominantly emphasize speed and lack exploration in semantic awareness, which hinders the search region from extracting accurate localization information from the template. The limitation results in suboptimal performance under typical UAV tracking challenges such as camera motion, fast motion, and low resolution, etc. To address this issue, we propose a dynamic semantic aware correlation modeling tracking framework. The core of our framework is a Dynamic Semantic Relevance Generator, which, in combination with the correlation map from the Transformer, explore semantic relevance. The approach enhances the search region’s ability to extract important information from the template, improving accuracy and robustness under the aforementioned challenges. Additionally, to enhance the tracking speed, we design a pruning method for the proposed framework. Therefore, we present multiple model variants that achieve trade-offs between speed and accuracy, enabling flexible deployment according to the available computational resources. Experimental results validate the effectiveness of our method, achieving competitive performance on multiple UAV tracking datasets. The code is available at this https URL.
zh

[CV-42] CT-CLIP: A Multi-modal Fusion Framework for Robust Apple Leaf Disease Recognition in Complex Environments

【速读】:该论文旨在解决复杂果园环境下苹果叶片病害表型异质性带来的识别难题,尤其是传统多尺度特征融合方法难以有效建模局部病变细节与全局结构关系之间的关联问题。其解决方案的关键在于提出一种多分支识别框架CNN-Transformer-CLIP(CT-CLIP),该框架通过卷积神经网络(CNN)提取局部病变细节特征,利用视觉Transformer(Vision Transformer)捕捉全局结构关系,并引入自适应特征融合模块(Adaptive Feature Fusion Module, AFFM)动态融合二者特征,实现局部与全局信息的最优耦合;同时结合预训练CLIP模型的图文对齐能力,构建多模态图像-文本学习策略,在少样本条件下显著提升识别准确率并抑制复杂背景干扰。

链接: https://arxiv.org/abs/2510.21346
作者: Lemin Liu,Fangchao Hu,Honghua Jiang,Yaru Chen,Limin Liu,Yongliang Qiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In complex orchard environments, the phenotypic heterogeneity of different apple leaf diseases, characterized by significant variation among lesions, poses a challenge to traditional multi-scale feature fusion methods. These methods only integrate multi-layer features extracted by convolutional neural networks (CNNs) and fail to adequately account for the relationships between local and global features. Therefore, this study proposes a multi-branch recognition framework named CNN-Transformer-CLIP (CT-CLIP). The framework synergistically employs a CNN to extract local lesion detail features and a Vision Transformer to capture global structural relationships. An Adaptive Feature Fusion Module (AFFM) then dynamically fuses these features, achieving optimal coupling of local and global information and effectively addressing the diversity in lesion morphology and distribution. Additionally, to mitigate interference from complex backgrounds and significantly enhance recognition accuracy under few-shot conditions, this study proposes a multimodal image-text learning approach. By leveraging pre-trained CLIP weights, it achieves deep alignment between visual features and disease semantic descriptions. Experimental results show that CT-CLIP achieves accuracies of 97.38% and 96.12% on a publicly available apple disease and a self-built dataset, outperforming several baseline methods. The proposed CT-CLIP demonstrates strong capabilities in recognizing agricultural diseases, significantly enhances identification accuracy under complex environmental conditions, provides an innovative and practical solution for automated disease recognition in agricultural applications.
zh

[CV-43] Morphologically Intelligent Perturbation Prediction with FORM

【速读】:该论文旨在解决当前细胞响应建模框架受限于二维表示、难以捕捉细胞形态在扰动下复杂变化的问题,从而制约了高精度虚拟细胞模型的发展。其解决方案的关键在于提出FORM(Functional Representation of Morphology),一个基于机器学习的三维细胞结构扰动预测框架:首先通过端到端训练的多通道VQGAN形态编码器学习细胞形状的紧凑3D表征;其次引入基于扩散模型的扰动轨迹模块,以刻画形态随扰动条件演变的过程。该方法依托包含6.5万例多荧光标记3D细胞体积的大规模数据集进行训练,支持无条件形态生成与条件扰动状态模拟,并可进一步预测下游信号活性、组合扰动效应及未见扰动状态间的形态动力学转换,从而实现从形态、扰动到功能的高分辨率预测性关联。

链接: https://arxiv.org/abs/2510.21337
作者: Reed Naidoo,Matt De Vries,Olga Fourkioti,Vicky Bousgouni,Mar Arias-Garcia,Maria Portillo-Malumbres,Chris Bakal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding how cells respond to external stimuli is a central challenge in biomedical research and drug development. Current computational frameworks for modelling cellular responses remain restricted to two-dimensional representations, limiting their capacity to capture the complexity of cell morphology under perturbation. This dimensional constraint poses a critical bottleneck for the development of accurate virtual cell models. Here, we present FORM, a machine learning framework for predicting perturbation-induced changes in three-dimensional cellular structure. FORM consists of two components: a morphology encoder, trained end-to-end via a novel multi-channel VQGAN to learn compact 3D representations of cell shape, and a diffusion-based perturbation trajectory module that captures how morphology evolves across perturbation conditions. Trained on a large-scale dataset of over 65,000 multi-fluorescence 3D cell volumes spanning diverse chemical and genetic perturbations, FORM supports both unconditional morphology synthesis and conditional simulation of perturbed cell states. Beyond generation, FORM can predict downstream signalling activity, simulate combinatorial perturbation effects, and model morphodynamic transitions between states of unseen perturbations. To evaluate performance, we introduce MorphoEval, a benchmarking suite that quantifies perturbation-induced morphological changes in structural, statistical, and biological dimensions. Together, FORM and MorphoEval work toward the realisation of the 3D virtual cell by linking morphology, perturbation, and function through high-resolution predictive simulation.
zh

[CV-44] VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set NEURIPS2025

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中多模态表示对齐的可解释性问题,即当前VLMs虽然具备强大的跨模态推理能力,但其对齐机制缺乏清晰的概念级解释,主要受限于难以将多模态表征映射到统一的概念集合。解决方案的关键在于提出VL-SAE(Vision-Language Sparse Autoencoder),这是一种稀疏自编码器,能够将视觉和语言表示编码为隐藏层激活,并通过自监督训练使语义相似的多模态表征在相同神经元上产生一致激活,从而建立神经元与概念之间的对应关系。具体而言,VL-SAE采用基于距离的编码器和两个模态特定的解码器结构,结合显式的余弦相似度对齐策略,确保语义相近的表示在概念层面保持激活一致性,进而实现对VLMs中视觉-语言对齐机制的解释与增强。

链接: https://arxiv.org/abs/2510.21323
作者: Shufan Shen,Junshu Sun,Qingming Huang,Shuhui Wang
机构: Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS (中国科学院计算技术研究所智能信息处理重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at this https URL.
zh

[CV-45] FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在高分辨率图像中对极小目标(extra-small objects)进行精确理解与定位时面临的挑战,尤其是在复杂背景下的视觉细节识别能力受限的问题。其解决方案的关键在于提出一个两阶段的强化学习框架——FineRS,采用“粗粒度到细粒度”的处理流程:第一阶段Global Semantic Exploration (GSE) 通过指令引导推理生成文本响应和粗略的目标区域;第二阶段Localized Perceptual Refinement (LPR) 在此基础上进一步优化得到精确的边界框和分割掩码。为耦合两个阶段,引入一种基于定位信息的回顾性奖励机制,利用LPR输出反向优化GSE,从而提升粗粒度探索的鲁棒性,最终实现对高分辨率场景中细微目标的属性级推理与像素级分割。

链接: https://arxiv.org/abs/2510.21311
作者: Lu Zhang,Jiazuo Yu,Haomiao Xiong,Ping Hu,Yunzhi Zhuge,Huchuan Lu,You He
机构: Dalian University of Technology (大连理工大学); University of Electronic Science and Technology of China (电子科技大学); Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images – particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose \textscFineRS, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. \textscFineRS adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR’s outputs are used to optimize GSE for more robust coarse region exploration. % Additionally, we present \textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on \textscFineRS-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.
zh

[CV-46] owards Physically Executable 3D Gaussian for Embodied Navigation

【速读】:该论文旨在解决3D Gaussian Splatting(3DGS)在视觉-语言导航(Visual-Language Navigation, VLN)任务中缺乏细粒度语义信息和物理可执行性的问题。现有3DGS虽具备高质量实时渲染能力,但其场景表示难以支持智能体进行语义理解与物理交互,从而限制了其在真实世界导航中的应用。解决方案的关键在于提出SAGE-3D(Semantically and Physically Aligned Gaussian Environments for 3D Navigation),通过两个核心组件实现升级:(1) 基于对象中心的语义定位(Object-Centric Semantic Grounding),为3DGS添加物体级别的细粒度标注;(2) 物理感知的执行连接(Physics-Aware Execution Jointing),将碰撞物体嵌入3DGS并构建丰富的物理接口,从而使得环境既具备语义可理解性又支持物理层面的动作执行。

链接: https://arxiv.org/abs/2510.21307
作者: Bingchen Miao,Rong Wei,Zhiqi Ge,Xiaoquan sun,Shiqi Gao,Jingzhe Zhu,Renhan Wang,Siliang Tang,Jun Xiao,Rui Tang,Juncheng Li
机构: Zhejiang University (浙江大学); Manycore Tech Inc; Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Download link of InteriorGS: this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task. The data and code will be available soon.
zh

[CV-47] Buffer layers for Test-Time Adaptation NEURIPS2025

【速读】:该论文旨在解决现有测试时自适应(Test Time Adaptation, TTA)方法中依赖归一化层(如Batch Normalization, BN)更新所带来的局限性问题,主要包括:小批量样本下归一化统计量不稳定、以及预训练模型结构对适应能力的固有约束,导致在显著领域偏移(domain shift)场景下性能受限。其解决方案的关键在于提出一种新型“Buffer层”机制,该机制不修改模型核心参数,而是通过引入可学习的缓冲模块来动态调整特征分布,从而在保持预训练主干网络完整性的同时实现稳定且鲁棒的在线适应,有效避免灾难性遗忘,并具备良好的模块化兼容性,可无缝集成至多数现有TTA框架中以提升性能。

链接: https://arxiv.org/abs/2510.21271
作者: Hyeongyu Kim,Geonhui Han,Dosik Hwang
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:In recent advancements in Test Time Adaptation (TTA), most existing methodologies focus on updating normalization layers to adapt to the test domain. However, the reliance on normalization-based adaptation presents key challenges. First, normalization layers such as Batch Normalization (BN) are highly sensitive to small batch sizes, leading to unstable and inaccurate statistics. Moreover, normalization-based adaptation is inherently constrained by the structure of the pre-trained model, as it relies on training-time statistics that may not generalize well to unseen domains. These issues limit the effectiveness of normalization-based TTA approaches, especially under significant domain shift. In this paper, we introduce a novel paradigm based on the concept of a Buffer layer, which addresses the fundamental limitations of normalization layer updates. Unlike existing methods that modify the core parameters of the model, our approach preserves the integrity of the pre-trained backbone, inherently mitigating the risk of catastrophic forgetting during online adaptation. Through comprehensive experimentation, we demonstrate that our approach not only outperforms traditional methods in mitigating domain shift and enhancing model robustness, but also exhibits strong resilience to forgetting. Furthermore, our Buffer layer is modular and can be seamlessly integrated into nearly all existing TTA frameworks, resulting in consistent performance improvements across various architectures. These findings validate the effectiveness and versatility of the proposed solution in real-world domain adaptation scenarios. The code is available at this https URL.
zh

[CV-48] opology Sculptor Shape Refiner: Discrete Diffusion Model for High-Fidelity 3D Meshes Generation

【速读】:该论文旨在解决当前基于扩散模型(Diffusion Models)生成高质量、艺术家风格3D网格时存在的效率低和拓扑控制弱的问题,尤其是传统自回归方法在生成过程中难以并行化且对局部细节与全局结构的建模能力受限。其解决方案的关键在于提出一种名为Topology Sculptor, Shape Refiner (TSSR) 的新方法,通过三个核心创新实现:1)解耦训练与混合推理策略,将生成过程分为拓扑雕刻(topology sculpting)和形状精修(shape refinement)两个阶段,从而分别精准捕捉局部拓扑结构和整体几何形态;2)改进的Hourglass架构结合面-顶点序列级旋转位置编码(Rotational Positional Embeddings, RoPE),增强跨mesh结构的上下文信息建模能力;3)引入新型连接损失(Connection Loss),作为拓扑约束以提升生成网格的真实感与几何保真度。此设计使TSSR能够在高分辨率(如1024³)下高效并行生成多达10,000个面的高质量3D网格。

链接: https://arxiv.org/abs/2510.21264
作者: Kaiyu Song,Hanjiang Lai,Yaqing Zhang,Chuangjian Cai,Yan Pan Kun Yue,Jian Yin
机构: Sun Yat-sen University (中山大学); Tencent VisVise (腾讯视觉智能实验室); Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we introduce Topology Sculptor, Shape Refiner (TSSR), a novel method for generating high-quality, artist-style 3D meshes based on Discrete Diffusion Models (DDMs). Our primary motivation for TSSR is to achieve highly accurate token prediction while enabling parallel generation, a significant advantage over sequential autoregressive methods. By allowing TSSR to “see” all mesh tokens concurrently, we unlock a new level of efficiency and control. We leverage this parallel generation capability through three key innovations: 1) Decoupled Training and Hybrid Inference, which distinctly separates the DDM-based generation into a topology sculpting stage and a subsequent shape refinement stage. This strategic decoupling enables TSSR to effectively capture both intricate local topology and overarching global shape. 2) An Improved Hourglass Architecture, featuring bidirectional attention enriched by face-vertex-sequence level Rotational Positional Embeddings (RoPE), thereby capturing richer contextual information across the mesh structure. 3) A novel Connection Loss, which acts as a topological constraint to further enhance the realism and fidelity of the generated meshes. Extensive experiments on complex datasets demonstrate that TSSR generates high-quality 3D artist-style meshes, capable of achieving up to 10,000 faces at a remarkable spatial resolution of 1024^3 . The code will be released at: this https URL.
zh

[CV-49] Improved Training Technique for Shortcut Models NEURIPS2025

【速读】:该论文旨在解决shortcut models在生成建模中面临的五大核心问题:(1)指导信号累积缺陷(compounding guidance)导致严重图像伪影;(2)固定指导机制限制推理时的控制灵活性;(3)由直接域中低级距离依赖引发的频率偏差,使重建偏向低频成分;(4)与EMA训练冲突引起的自一致性不一致;(5)弯曲的流轨迹阻碍收敛。解决方案的关键在于提出iSM统一训练框架,其包含四项核心技术改进:Intrinsic Guidance实现动态、显式指导强度控制,同时解决累积缺陷与灵活性不足;Multi-Level Wavelet Loss缓解频率偏差以恢复高频细节;Scaling Optimal Transport(sOT)降低训练方差并学习更稳定、更直的生成路径;Twin EMA策略协调训练稳定性与自一致性。实验表明,该方法在ImageNet 256×256上显著提升FID指标,使shortcut models成为具备竞争力的一类生成模型。

链接: https://arxiv.org/abs/2510.21250
作者: Anh Nguyen,Viet Nguyen,Duc Vu,Trung Dao,Chi Tran,Toan Tran,Anh Tran
机构: Qualcomm AI Research (Qualcomm人工智能研究)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Shortcut models represent a promising, non-adversarial paradigm for generative modeling, uniquely supporting one-step, few-step, and multi-step sampling from a single trained network. However, their widespread adoption has been stymied by critical performance bottlenecks. This paper tackles the five core issues that held shortcut models back: (1) the hidden flaw of compounding guidance, which we are the first to formalize, causing severe image artifacts; (2) inflexible fixed guidance that restricts inference-time control; (3) a pervasive frequency bias driven by a reliance on low-level distances in the direct domain, which biases reconstructions toward low frequencies; (4) divergent self-consistency arising from a conflict with EMA training; and (5) curvy flow trajectories that impede convergence. To address these challenges, we introduce iSM, a unified training framework that systematically resolves each limitation. Our framework is built on four key improvements: Intrinsic Guidance provides explicit, dynamic control over guidance strength, resolving both compounding guidance and inflexibility. A Multi-Level Wavelet Loss mitigates frequency bias to restore high-frequency details. Scaling Optimal Transport (sOT) reduces training variance and learns straighter, more stable generative paths. Finally, a Twin EMA strategy reconciles training stability with self-consistency. Extensive experiments on ImageNet 256 x 256 demonstrate that our approach yields substantial FID improvements over baseline shortcut models across one-step, few-step, and multi-step generation, making shortcut models a viable and competitive class of generative models.
zh

[CV-50] 3rd Place Solution to Large-scale Fine-grained Food Recognition

【速读】:该论文致力于解决细粒度食物识别(fine-grained food recognition)问题,旨在提升健康领域中食品分析的准确性。其解决方案的关键在于采用Arcface损失函数与Circle损失函数的合理组合,通过精心调参训练模型并进行集成,从而显著提升识别性能,在LargeFineFoodAI-ICCV Workshop-Recognition挑战赛中获得第三名。

链接: https://arxiv.org/abs/2510.21199
作者: Yang Zhong,Yifan Yao,Tong Luo,Youcai Zhang,Yaqian Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Food analysis is becoming a hot topic in health area, in which fine-grained food recognition task plays an important role. In this paper, we describe the details of our solution to the LargeFineFoodAI-ICCV Workshop-Recognition challenge held on Kaggle. We find a proper combination of Arcface loss[1] and Circle loss[9] can bring improvement to the performance. With Arcface and the combined loss, model was trained with carefully tuned configurations and ensembled to get the final results. Our solution won the 3rd place in the competition.
zh

[CV-51] 3rd Place Solution to ICCV LargeFineFoodAI Retrieval

【速读】:该论文针对大规模细粒度食物图像检索任务中的特征表示能力不足问题,提出了一种基于多模型集成与新型重排序机制的解决方案。其关键在于:首先训练四个基础模型,采用ArcFace与Circle Loss的加权组合优化特征学习;随后通过测试时增强(TTA)和模型集成(Ensemble)提升特征鲁棒性;最后引入基于扩散模型(diffusion model)与k-互惠重排序(k-reciprocal reranking)相结合的新重排序方法,显著改善检索精度。该方案在公开和私有排行榜上分别取得0.81219和0.81191的mAP@100指标。

链接: https://arxiv.org/abs/2510.21198
作者: Yang Zhong,Zhiming Wang,Zhaoyang Li,Jinyu Ma,Xiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper introduces the 3rd place solution to the ICCV LargeFineFoodAI Retrieval Competition on Kaggle. Four basic models are independently trained with the weighted sum of ArcFace and Circle loss, then TTA and Ensemble are successively applied to improve feature representation ability. In addition, a new reranking method for retrieval is proposed based on diffusion and k-reciprocal reranking. Finally, our method scored 0.81219 and 0.81191 mAP@100 on the public and private leaderboard, respectively.
zh

[CV-52] okenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

【速读】:该论文旨在解决现有基于CLIP(Contrastive Language–Image Pretraining)的异常检测方法在未见物体上的零样本适应问题,其核心挑战在于:传统方法依赖单一文本空间对齐视觉语义,导致无法精准捕捉不同异常模式的细粒度语义差异。解决方案的关键在于提出TokenCLIP框架,通过引入token-wise动态对齐机制,将每个视觉token映射至与其语义最相关的可学习文本子空间组合,而非统一映射到全局文本空间。具体而言,作者将原始文本空间扩展为一组正交子空间,并利用最优传输(Optimal Transport, OT)建模视觉token到文本子空间的分配过程,以语义相似性为约束优化分配方案;进一步采用top-k掩码策略稀疏化传输计划,使各子空间聚焦于图像中不同的视觉区域,从而实现高效且定制化的细粒度异常学习。

链接: https://arxiv.org/abs/2510.21171
作者: Qihang Zhou,Binbin Gao,Guansong Pang,Xin Wang,Jiming Chen,Shibo He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
zh

[CV-53] Blockwise Flow Matching: Improving Flow Matching Models For Efficient High-Quality Generation

【速读】:该论文旨在解决传统流匹配(Flow Matching)模型在高保真数据生成中面临的两大问题:一是单一大网络难以同时捕捉不同时间步的信号特征,导致生成质量受限;二是迭代评估整个模型带来高昂的推理成本。解决方案的关键在于提出分块流匹配(Blockwise Flow Matching, BFM),通过将生成轨迹划分为多个时间片段,每个片段由小型且专用的速度块(velocity block)建模,使各块能在指定区间内高效专业化,从而提升推理效率与样本质量。此外,引入语义特征引导模块(Semantic Feature Guidance)和轻量级特征残差近似策略(Feature Residual Approximation),进一步增强生成保真度并显著降低计算开销。实验表明,BFM在ImageNet 256x256上实现了比现有方法更优的帕累托前沿,推理复杂度加速达2.1x至4.9倍,同时保持相近的生成性能。

链接: https://arxiv.org/abs/2510.21167
作者: Dogyun Park,Taehoon Lee,Minseok Joo,Hyunwoo J. Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Flow Matching models have pushed the boundaries of high-fidelity data generation across a wide range of domains. It typically employs a single large network to learn the entire generative trajectory from noise to data. Despite their effectiveness, this design struggles to capture distinct signal characteristics across timesteps simultaneously and incurs substantial inference costs due to the iterative evaluation of the entire model. To address these limitations, we propose Blockwise Flow Matching (BFM), a novel framework that partitions the generative trajectory into multiple temporal segments, each modeled by smaller but specialized velocity blocks. This blockwise design enables each block to specialize effectively in its designated interval, improving inference efficiency and sample quality. To further enhance generation fidelity, we introduce a Semantic Feature Guidance module that explicitly conditions velocity blocks on semantically rich features aligned with pretrained representations. Additionally, we propose a lightweight Feature Residual Approximation strategy that preserves semantic quality while significantly reducing inference cost. Extensive experiments on ImageNet 256x256 demonstrate that BFM establishes a substantially improved Pareto frontier over existing Flow Matching methods, achieving 2.1x to 4.9x accelerations in inference complexity at comparable generation performance. Code is available at this https URL.
zh

[CV-54] owards Physics-informed Spatial Intelligence with Human Priors: An Autonomous Driving Pilot Study NEURIPS2025

【速读】:该论文旨在解决当前基础模型中空间智能(Spatial Intelligence)难以有效集成与验证的问题。现有方法通常通过纯文本提示和VQA(Visual Question Answering)评分来代理视觉-空间智能(Visual-Spatial Intelligence, VSI),这种方法容易混淆几何信息、引入语言捷径,并削弱对真正空间能力的可解释性评估。其解决方案的关键在于提出一种结构化的网格表示框架——空间智能网格(Spatial Intelligence Grid, SIG),该框架显式编码物体布局、物体间关系及物理先验知识,作为文本之外的互补通道,提供场景结构的忠实且组合式的表征,从而支持基础模型更准确地推理空间信息。基于SIG,作者进一步设计了能够量化模型内在VSI能力的评估指标,有效分离空间能力与语言偏置,并在多模态大语言模型(如GPT和Gemini系列)的少样本上下文学习中验证了SIG相较于传统VQA表示的显著提升效果。

链接: https://arxiv.org/abs/2510.21160
作者: Guanlin Wu,Boyan Su,Yang Zhao,Pu Wang,Yichen Lin,Hao Frank Yang
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 (Spotlight)

点击查看摘要

Abstract:How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model’s intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.
zh

[CV-55] Digital Contrast CT Pulmonary Angiography Synthesis from Non-contrast CT for Pulmonary Vascular Disease

【速读】:该论文旨在解决CT肺动脉造影(CTPA)依赖碘对比剂所带来的肾毒性及过敏反应风险问题,尤其是在高危患者中。其核心解决方案是提出一种基于级联合成器的数字对比CTPA(DCCTPA)生成方法,该方法利用循环一致生成对抗网络(CycleGAN)从非对比CT(NCCT)图像中合成具有血管增强效果的DCCTPA图像。关键创新在于通过多中心回顾性数据训练和验证模型,实现了对肺血管结构的有效重建与保真度提升,在定量指标(如MAE、PSNR、SSIM)和定性可视化上均优于现有最先进方法,并在下游任务(如肺血管分割与量化)中显著提升了性能,尤其在小血管的识别和体积一致性方面表现突出(ICC从0.70提升至0.81)。

链接: https://arxiv.org/abs/2510.21140
作者: Ying Ming(1),Yue Lin(3),Longfei Zhao(2),Gengwan Li(2),Zuopeng Tan(2),Bing Li(2),Sheng Xie(3),Wei Song(1),Qiqi Xu(2) ((1) Department of Radiology Peking Union Medical College Hospital Chinese Academy of Medical Sciences and Peking Union Medical College, (2) Research and Development Center Canon Medical Systems China, (3) Department of Radiology, China-Japan Friendship Hospital, Beijing, China)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed Tomography Pulmonary Angiography (CTPA) is the reference standard for diagnosing pulmonary vascular diseases such as Pulmonary Embolism (PE) and Chronic Thromboembolic Pulmonary Hypertension (CTEPH). However, its reliance on iodinated contrast agents poses risks including nephrotoxicity and allergic reactions, particularly in high-risk patients. This study proposes a method to generate Digital Contrast CTPA (DCCTPA) from Non-Contrast CT (NCCT) scans using a cascaded synthesizer based on Cycle-Consistent Generative Adversarial Networks (CycleGAN). Totally retrospective 410 paired CTPA and NCCT scans were obtained from three centers. The model was trained and validated internally on 249 paired images. Extra dataset that comprising 161 paired images was as test set for model generalization evaluation and downstream clinical tasks validation. Compared with state-of-the-art (SOTA) methods, the proposed method achieved the best comprehensive performance by evaluating quantitative metrics (For validation, MAE: 156.28, PSNR: 20.71 and SSIM: 0.98; For test, MAE: 165.12, PSNR: 20.27 and SSIM: 0.98) and qualitative visualization, demonstrating valid vessel enhancement, superior image fidelity and structural preservation. The approach was further applied to downstream tasks of pulmonary vessel segmentation and vascular quantification. On the test set, the average Dice, clDice, and clRecall of artery and vein pulmonary segmentation was 0.70, 0.71, 0.73 and 0.70, 0.72, 0.75 respectively, all markedly improved compared with NCCT inputs.@ Inter-class Correlation Coefficient (ICC) for vessel volume between DCCTPA and CTPA was significantly better than that between NCCT and CTPA (Average ICC : 0.81 vs 0.70), indicating effective vascular enhancement in DCCTPA, especially for small vessels.
zh

[CV-56] NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation NEURIPS2025

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在提升多模态大语言模型(Multimodal Large Language Models, MLLMs)的链式思维(Chain-of-Thought, CoT)推理能力时,普遍存在泛化能力不足的问题,即现有RL框架难以在训练分布之外的场景中保持性能稳定。其解决方案的关键在于提出NoisyGRPO框架,该框架包含两个核心机制:一是通过向视觉输入注入高斯噪声以增强探索能力(Noise-Injected Exploration Policy),从而扩大模型对多样化视觉场景的学习覆盖范围;二是基于贝叶斯框架建模优势估计(Bayesian Advantage Estimation),将噪声水平作为先验、轨迹奖励作为似然,融合二者以获得更稳健的轨迹优势后验估计,从而引导模型偏好视觉 grounded 的推理路径而非受噪声干扰的路径。这一设计显著提升了小规模MLLMs(如Qwen2.5-VL 3B)在CoT质量、泛化能力和抗幻觉方面的表现。

链接: https://arxiv.org/abs/2510.21122
作者: Longtian Qiu,Shan Ning,Jiaxuan Sun,Xuming He
机构: ShanghaiTech University (上海科技大学); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉与成像工程技术研究中心); Lingang Laboratory (临港实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by Neurips2025, Project page at at this https URL

点击查看摘要

Abstract:Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) \textbfNoise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) \textbfBayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at \hrefthis https URL\textttthis https URL_pages/NoisyGRPO.
zh

[CV-57] SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

【速读】:该论文旨在解决图像安全评估中缺乏细粒度标注的问题,即现有数据集仅提供粗粒度的安全标签,无法明确识别导致图像安全状态变化的具体视觉特征。其解决方案的关键在于提出SafetyPairs框架,通过图像编辑模型生成对抗样本对(counterfactual pairs),使得每一对图像仅在与特定安全政策相关的特征上存在差异,从而实现安全标签的翻转,同时保持其他无关细节不变。该方法不仅构建了一个包含3,020张图像的细粒度安全基准测试集,还验证了其在提升视觉-语言模型判别能力及增强轻量级安全防护模型训练效率方面的有效性。

链接: https://arxiv.org/abs/2510.21120
作者: Alec Helbling,Shruti Palaskar,Kundan Krishna,Polo Chau,Leon Gatys,Joseph Yitan Cheng
机构: Georgia Tech (佐治亚理工学院); Apple (苹果)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subtle changes to an image, such as an insulting gesture or symbol, can drastically alter its safety implications. However, existing image safety datasets are coarse and ambiguous, offering only broad safety labels without isolating the specific features that drive these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images, that differ only in the features relevant to the given safety policy, thus flipping their safety label. By leveraging image editing models, we make targeted changes to images that alter their safety labels while leaving safety-irrelevant details unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data that highlights weaknesses in vision-language models’ abilities to distinguish between subtly different images. Beyond evaluation, we find our pipeline serves as an effective data augmentation strategy that improves the sample efficiency of training lightweight guard models. We release a benchmark containing over 3,020 SafetyPair images spanning a diverse taxonomy of 9 safety categories, providing the first systematic resource for studying fine-grained image safety distinctions.
zh

[CV-58] Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts ICCV2025

【速读】:该论文旨在解决大规模基础模型在下游目标分割任务中进行全参数微调时计算开销过大、训练效率低的问题,同时克服现有方法通过固定提示(prompt)微调时缺乏语义先验导致适应性不足的局限。其解决方案的关键在于提出一种基于动态先验的可控微调范式——Controllable-LPMoE,该方法通过轻量级动态局部先验提取器(利用异质卷积捕获多样局部先验,并由门控网络动态选择专家先验),以及双向交互适配器(结合余弦对齐可变形注意力与通道导向自适应尺度增强机制),实现对冻结基础模型的有效调控,在显著减少可训练参数的同时提升细粒度感知能力与任务适应性。

链接: https://arxiv.org/abs/2510.21114
作者: Yanguang Sun,Jiawei Lian,Jian Yang,Lei Luo
机构: PCA Lab, Nanjing University of Science and Technology, Nanjing, China (南京理工大学); PCA Lab, VCIP, College of Computer Science, Nankai University, Tianjin, China (南开大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Large-scale foundation models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the full-parameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of large-scale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our \hrefthis https URL Controllable-LPMoE approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art (SOTA) methods and adaptability to multiple binary object segmentation tasks.
zh

[CV-59] Urban 3D Change Detection Using LiDAR Sensor for HD Map Maintenance and Smart Mobility

【速读】:该论文旨在解决城市尺度下基于双时相激光雷达(LiDAR)点云的物体级变化检测问题,以支持高精地图维护、施工监控和可靠定位。传统方法如数字表面模型(DSM)差分或图像法易受微小垂直偏移、地面坡度和视角不匹配影响,且输出为像素级结果而缺乏对象标识;点云神经网络模型则存在内存消耗大、依赖理想预对齐、结构细节退化及类别一致性关联缺失等问题,导致分裂或合并情况无法处理且忽略不确定性。解决方案的关键在于提出一种面向对象、具备不确定性感知的全流程管道:首先通过多分辨率正态分布变换(NDT)与点到平面ICP实现高精度配准,再利用注册协方差和表面粗糙度计算每个位置的变化置信度以校准决策并抑制伪变化;接着以几何代理种子关联,并结合语义与实例分割及带增强虚拟节点的类别约束二分图分配策略处理分裂与合并场景,同时保持每类目标数量不变;最后采用分块处理控制内存开销,并在实例级决策中融合3D重叠度、法向位移、高度与体积差异及直方图距离,所有判断均受局部置信度门控,从而在部分重叠和采样变化下保持稳定。

链接: https://arxiv.org/abs/2510.21112
作者: Hezam Albagami,Haitian Wang,Xinyu Wang,Muhammad Ibrahim,Zainy M. Malakan,Abdullah M. Alqamdi,Mohammed H. Alghamdi,Ajmal Mian
机构: University of Jeddah (吉达大学); University of Western Australia (西澳大利亚大学); Umm Al-Qura University (乌姆库拉大学); King Khalid University (国王卡立德大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-definition 3D city maps underpin smart transportation, digital twins, and autonomous driving, where object level change detection across bi temporal LiDAR enables HD map maintenance, construction monitoring, and reliable localization. Classical DSM differencing and image based methods are sensitive to small vertical bias, ground slope, and viewpoint mismatch and yield cellwise outputs without object identity. Point based neural models and voxel encodings demand large memory, assume near perfect pre alignment, degrade thin structures, and seldom enforce class consistent association, which leaves split or merge cases unresolved and ignores uncertainty. We propose an object centric, uncertainty aware pipeline for city scale LiDAR that aligns epochs with multi resolution NDT followed by point to plane ICP, normalizes height, and derives a per location level of detection from registration covariance and surface roughness to calibrate decisions and suppress spurious changes. Geometry only proxies seed cross epoch associations that are refined by semantic and instance segmentation and a class constrained bipartite assignment with augmented dummies to handle splits and merges while preserving per class counts. Tiled processing bounds memory without eroding narrow ground changes, and instance level decisions combine 3D overlap, normal direction displacement, and height and volume differences with a histogram distance, all gated by the local level of detection to remain stable under partial overlap and sampling variation. On 15 representative Subiaco blocks the method attains 95.2% accuracy, 90.4% mF1, and 82.6% mIoU, exceeding Triplet KPConv by 0.2 percentage points in accuracy, 0.2 in mF1, and 0.8 in mIoU, with the largest gain on Decreased where IoU reaches 74.8% and improves by 7.6 points.
zh

[CV-60] PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在静态、完全可观测场景中进行视觉推理时,难以适应现实世界中因遮挡或视野受限导致的信息不完整问题。现有方法缺乏主动探索与交互能力,无法通过动态感知-决策-行动闭环机制获取并整合信息。解决方案的关键在于提出主动视觉推理(Active Visual Reasoning, AVR)任务,其要求智能体通过序列化物理动作主动收集信息、跨步整合观测以实现连贯推理,并依据实时视觉反馈动态调整策略;同时构建了CLEVR-AVR仿真基准和AVR-152k大规模数据集,包含Chain-of-Thought(CoT)标注,用于训练高阶马尔可夫决策过程中的不确定性识别、动作条件下的信息增益预测及信息最大化动作选择。基于此,作者开发了PhysVLM-AVR模型,在多项被动与具身视觉推理任务上达到最优性能,揭示了当前具身MLLM在主动信息获取与整合方面的关键短板。

链接: https://arxiv.org/abs/2510.21111
作者: Weijie Zhou,Xuantang Xiong,Yi Peng,Manli Tao,Chaoyang Zhao,Honghui Dong,Ming Tang,Jinqiao Wang
机构: Beijing Jiaotong University (北京交通大学); Tencent Robotics X & Futian Laboratory, Shenzhen (腾讯机器人X与福田实验室,深圳); Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (中科院自动化研究所基础模型研究中心); ObjectEye Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39th Conference on Neural Information Processing Systemss (NeurIPS 2025)

点击查看摘要

Abstract:Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.
zh

[CV-61] HistRetinex: Optimizing Retinex model in Histogram Domain for Efficient Low-Light Image Enhancement

【速读】:该论文旨在解决基于Retinex的低光照图像增强方法在处理大尺寸图像时计算效率低的问题。其解决方案的关键在于将传统的空间域Retinex模型扩展至直方图域,提出了一种新的基于直方图的Retinex模型(HistRetinex)。通过定义直方图位置矩阵和直方图计数矩阵,建立亮度、反射率与低光照图像之间直方图关系,并构建两级优化模型以迭代求解亮度和反射率的直方图分布,最终通过直方图匹配实现快速增强。该方法在保持图像增强质量的同时显著提升了运算速度,在1000×664分辨率图像上仅需1.86秒,相比现有方法最快可节省6.67秒。

链接: https://arxiv.org/abs/2510.21100
作者: Jingtian Zhao,Xueli Xie,Jianxiang Xi,Xiaogang Yang,Haoxuan Sun
机构: Rocket Force University of Engineering (火箭军工程大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Currently, this manuscript has been rejected by TIP and is undergoing revisions. The reviewers noted that the paper contains some innovative aspects, but identified issues in the experimental and algorithmic sections

点击查看摘要

Abstract:Retinex-based low-light image enhancement methods are widely used due to their excellent performance. However, most of them are time-consuming for large-sized images. This paper extends the Retinex model from the spatial domain to the histogram domain, and proposes a novel histogram-based Retinex model for fast low-light image enhancement, named HistRetinex. Firstly, we define the histogram location matrix and the histogram count matrix, which establish the relationship among histograms of the illumination, reflectance and the low-light image. Secondly, based on the prior information and the histogram-based Retinex model, we construct a novel two-level optimization model. Through solving the optimization model, we give the iterative formulas of the illumination histogram and the reflectance histogram, respectively. Finally, we enhance the low-light image through matching its histogram with the one provided by HistRetinex. Experimental results demonstrate that the HistRetinex outperforms existing enhancement methods in both visibility and performance metrics, while executing 1.86 seconds on 1000*664 resolution images, achieving a minimum time saving of 6.67 seconds.
zh

[CV-62] Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprungs Disease

【速读】:该论文旨在解决先天性无神经节细胞症(Hirschsprung’s disease)病理诊断中,如何准确识别肠肌丛(myenteric plexus)不同区域的 ganglion cells 缺失问题。传统深度学习方法如卷积神经网络(Convolutional Neural Networks, CNNs)虽在组织切片分类任务中表现优异,但其决策过程缺乏可解释性,难以与临床医生的判读逻辑对齐。解决方案的关键在于提出一种融合专家文本概念的视觉-语言多模态框架:通过大语言模型生成并经专家审核的提示词(prompts),结合 QuiltNet 编码器将临床语义信息嵌入对比语言-图像预训练(Contrastive Language-Image Pre-training)模型,从而实现视觉特征与医学先验知识的对齐,显著提升了分类性能(准确率83.9%、精确率86.6%、特异度87.6%),为病理图像分析提供了更具临床相关性的可解释性智能辅助工具。

链接: https://arxiv.org/abs/2510.21083
作者: Youssef Megahed,Atallah Madi,Dina El Demellawy,Adrian D. C. Chan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted into the ICAAI 2025 - The 9th International Conference on Advances in Artificial Intelligence

点击查看摘要

Abstract:Hirschsprung’s disease is defined as the congenital absence of ganglion cells in some segment(s) of the colon. The muscle cannot make coordinated movements to propel stool in that section, most commonly leading to obstruction. The diagnosis and treatment for this disease require a clear identification of different region(s) of the myenteric plexus, where ganglion cells should be present, on the microscopic view of the tissue slide. While deep learning approaches, such as Convolutional Neural Networks, have performed very well in this task, they are often treated as black boxes, with minimal understanding gained from them, and may not conform to how a physician makes decisions. In this study, we propose a novel framework that integrates expert-derived textual concepts into a Contrastive Language-Image Pre-training-based vision-language model to guide plexus classification. Using prompts derived from expert sources (e.g., medical textbooks and papers) generated by large language models and reviewed by our team before being encoded with QuiltNet, our approach aligns clinically relevant semantic cues with visual features. Experimental results show that the proposed model demonstrated superior discriminative capability across different classification metrics as it outperformed CNN-based models, including VGG-19, ResNet-18, and ResNet-50; achieving an accuracy of 83.9%, a precision of 86.6%, and a specificity of 87.6%. These findings highlight the potential of multi-modal learning in histopathology and underscore the value of incorporating expert knowledge for more clinically relevant model outputs.
zh

[CV-63] WaveSeg: Enhancing Segmentation Precision via High-Frequency Prior and Mamba-Driven Spectrum Decomposition

【速读】:该论文旨在解决当前语义分割网络中解码器设计过于简单导致语义上下文与细粒度细节之间权衡不佳的问题。其解决方案的关键在于提出了一种新颖的解码器架构 WaveSeg,该架构在空间域和小波域(wavelet domain)中联合优化特征精炼过程:首先利用输入图像中的高频分量作为显式先验,在早期阶段增强边界细节;随后引入双域操作(Dual Domain Operation, DDO)机制,并设计了基于 Mamba 的频谱分解注意力(Spectrum Decomposition Attention, SDA)模块,以线性复杂度建模长程依赖并强化高频结构细节;同时采用重参数化卷积保留小波域中的低频语义完整性;最终通过残差引导融合策略,在原始分辨率下整合多尺度特征与边界感知表示,从而生成语义与结构均丰富的特征图。

链接: https://arxiv.org/abs/2510.21079
作者: Guoan Xu,Yang Xiao,Wenjing Jia,Guangwei Gao,Guo-Jun Qi,Chia-Wen Lin
机构: University of Technology Sydney (悉尼科技大学); Nanjing University of Science and Technology (南京理工大学); Westlake University (西湖大学); OPPO Research (OPPO研究院); National Tsing Hua University (国立清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 10 figures

点击查看摘要

Abstract:While recent semantic segmentation networks heavily rely on powerful pretrained encoders, most employ simplistic decoders, leading to suboptimal trade-offs between semantic context and fine-grained detail preservation. To address this, we propose a novel decoder architecture, WaveSeg, which jointly optimizes feature refinement in spatial and wavelet domains. Specifically, high-frequency components are first learned from input images as explicit priors to reinforce boundary details at early stages. A multi-scale fusion mechanism, Dual Domain Operation (DDO), is then applied, and the novel Spectrum Decomposition Attention (SDA) block is proposed, which is developed to leverage Mamba’s linear-complexity long-range modeling to enhance high-frequency structural details. Meanwhile, reparameterized convolutions are applied to preserve low-frequency semantic integrity in the wavelet domain. Finally, a residual-guided fusion integrates multi-scale features with boundary-aware representations at native resolution, producing semantically and structurally rich feature maps. Extensive experiments on standard benchmarks demonstrate that WaveSeg, leveraging wavelet-domain frequency prior with Mamba-based attention, consistently outperforms state-of-the-art approaches both quantitatively and qualitatively, achieving efficient and precise segmentation.
zh

[CV-64] ZING-3D: Zero-shot Incremental 3D Scene Graphs via Vision-Language Models

【速读】:该论文旨在解决现有3D场景图生成方法在复杂环境理解与推理中的局限性,具体包括:单视角限制、无法支持增量更新以及缺乏显式的三维空间几何定位,这些问题制约了其在具身智能(embodied intelligence)场景中的应用。解决方案的关键在于提出ZING-3D框架,该框架利用预训练视觉语言模型(VLM)的零样本能力实现开放词汇识别,并通过深度信息将二维场景图显式地映射到三维空间中,从而构建包含对象特征、3D位置及语义上下文的节点和刻画对象间距离与关系的边的结构化表示,实现了无需任务特定微调即可获取空间与语义关联知识的能力,且支持随新观测数据的持续更新,适用于下游机器人应用。

链接: https://arxiv.org/abs/2510.21069
作者: Pranav Saxena,Jimmy Chiun
机构: Birla Institute of Technology and Science Pilani, K.K Birla Goa Campus (比尔拉理工学院与科学学院,K.K. 比尔拉果阿校区); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Understanding and reasoning about complex 3D environments requires structured scene representations that capture not only objects but also their semantic and spatial relationships. While recent works on 3D scene graph generation have leveraged pretrained VLMs without task-specific fine-tuning, they are largely confined to single-view settings, fail to support incremental updates as new observations arrive and lack explicit geometric grounding in 3D space, all of which are essential for embodied scenarios. In this paper, we propose, ZING-3D, a framework that leverages the vast knowledge of pretrained foundation models to enable open-vocabulary recognition and generate a rich semantic representation of the scene in a zero-shot manner while also enabling incremental updates and geometric grounding in 3D space, making it suitable for downstream robotics applications. Our approach leverages VLM reasoning to generate a rich 2D scene graph, which is grounded in 3D using depth information. Nodes represent open-vocabulary objects with features, 3D locations, and semantic context, while edges capture spatial and semantic relations with inter-object distances. Our experiments on scenes from the Replica and HM3D dataset show that ZING-3D is effective at capturing spatial and relational knowledge without the need of task-specific training.
zh

[CV-65] Deep learning-based automated damage detection in concrete structures using images from earthquake events

【速读】:该论文旨在解决地震后结构完整性快速评估问题,尤其关注混凝土结构中暴露钢筋的自动检测,以实现对建筑和桥梁损伤程度的量化判断。其解决方案的关键在于构建一个融合多阶段深度学习模型的混合框架:首先利用YOLOv11(You Only Look Once)模型识别裂缝、剥落及钢筋暴露等损伤特征;其次通过微调(fine-tuning)训练另一YOLO模型区分不同等级的结构损伤;同时结合新采集的2023年土耳其地震后图像数据集进行标注与增强,提升模型在多样化损伤场景下的鲁棒性与泛化能力。该方法实现了从图像输入到自动化损伤分级的全流程处理,为灾后应急响应提供了高效可靠的技术支持。

链接: https://arxiv.org/abs/2510.21063
作者: Abdullah Turer,Yongsheng Bai,Halil Sezen,Alper Yilmaz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure

点击查看摘要

Abstract:Timely assessment of integrity of structures after seismic events is crucial for public safety and emergency response. This study focuses on assessing the structural damage conditions using deep learning methods to detect exposed steel reinforcement in concrete buildings and bridges after large earthquakes. Steel bars are typically exposed after concrete spalling or large flexural or shear cracks. The amount and distribution of exposed steel reinforcement is an indication of structural damage and degradation. To automatically detect exposed steel bars, new datasets of images collected after the 2023 Turkey Earthquakes were labeled to represent a wide variety of damaged concrete structures. The proposed method builds upon a deep learning framework, enhanced with fine-tuning, data augmentation, and testing on public datasets. An automated classification framework is developed that can be used to identify inside/outside buildings and structural components. Then, a YOLOv11 (You Only Look Once) model is trained to detect cracking and spalling damage and exposed bars. Another YOLO model is finetuned to distinguish different categories of structural damage levels. All these trained models are used to create a hybrid framework to automatically and reliably determine the damage levels from input images. This research demonstrates that rapid and automated damage detection following disasters is achievable across diverse damage contexts by utilizing image data collection, annotation, and deep learning approaches.
zh

[CV-66] More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中的塑性-稳定性-效率三难困境(plasticity-stability-efficiency trilemma),尤其是在训练预算受限的情况下,如何在保持模型对新任务的适应能力(塑性)与防止遗忘旧知识(稳定性)之间取得平衡。其解决方案的关键在于提出一种名为 ZO-FC 的新方法:将零阶优化(Zeroth-order, ZO)应用于仅一个基于适配器的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)模块,而保留分类器部分使用一阶优化(First-order, FO)。该设计利用 ZO 优化天然带来更平坦的损失曲面以增强稳定性,同时通过 FO 更新分类器来维持良好的塑性,从而在极低内存开销下实现稳定性和适应性的有效权衡,为设备端持续学习提供了一种实用且高效的方案。

链接: https://arxiv.org/abs/2510.21019
作者: Wanhao Yu,Zheng Wang,Shuteng Niu,Sen Lin,Li Yang
机构: University of North Carolina at Charlotte (北卡罗来纳大学夏洛特分校); University of Houston (休斯顿大学); Mayo Clinic (梅奥诊所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zeroth-order (ZO) optimization has gained attention as a memory-efficient alternative to first-order (FO) methods, particularly in settings where gradient computation is expensive or even impractical. Beyond its memory efficiency, in this work, we investigate ZO optimization for continual learning (CL) as a novel approach to address the plasticity-stability-efficiency trilemma. Through theoretical analysis and empirical evidence, we show that ZO optimization naturally leads to flatter loss landscapes, which in turn reduce forgetting in CL. However, this stability comes at a cost of plasticity: due to its imprecise gradient estimates and slower convergence, ZO optimization tends to be less effective than FO in acquiring new task-specific knowledge, particularly under constrained training budgets. To better understand this trade-off, we conduct a holistic evaluation of ZO optimization applied to various existing CL methods. Our findings reveal that ZO optimization enhances stability but often undermines plasticity, particularly when used with learnable classifiers. Motivated by this insight, we propose ZO-FC, a simple but effective approach that applies ZO optimization to a single adapter-based PEFT module with FO optimized classifier. This design leverages the stability benefits of ZO while preserving the adaptability of FO updates with negligible memory overhead. Experiments demonstrate that ZO-FC achieves an effective balance between stability and plasticity, offering a practical and memory-efficient solution for on-device CL.
zh

[CV-67] BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies ICCV2025

【速读】:该论文旨在解决工业场景下6D位姿估计(6D pose estimation)中因遮挡、光照不良和复杂背景导致的检测性能下降问题,尤其针对现有流水线依赖通用目标检测器后接裁剪与位姿精修时出现的误检瓶颈。其解决方案的关键在于构建一个标准化、即插即用的2D检测流水线,通过低光照图像增强与基于基础模型(foundation models)引导的开放词汇检测(open-vocabulary detection)实现背景去除,从而有效抑制原始Segment Anything Model (SAM) 输出中的假阳性,提升对未见过工业物体的检测可靠性,同时保持极低的推理开销。

链接: https://arxiv.org/abs/2510.21000
作者: Jiaqi Hu,Hongli Xu,Junwen Huang,Peter KT Yu,Slobodan Ilic,Benjamin Busam
机构: Technical University of Munich (慕尼黑工业大学); XYZ Robotics
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, accepted by ICCV 2025 R6D

点击查看摘要

Abstract:Accurate 6D pose estimation is essential for robotic manipulation in industrial environments. Existing pipelines typically rely on off-the-shelf object detectors followed by cropping and pose refinement, but their performance degrades under challenging conditions such as clutter, poor lighting, and complex backgrounds, making detection the critical bottleneck. In this work, we introduce a standardized and plug-in pipeline for 2D detection of unseen objects in industrial settings. Based on current SOTA baselines, our approach reduces domain shift and background artifacts through low-light image enhancement and background removal guided by open-vocabulary detection with foundation models. This design suppresses the false positives prevalent in raw SAM outputs, yielding more reliable detections for downstream pose estimation. Extensive experiments on real-world industrial bin-picking benchmarks from BOP demonstrate that our method significantly boosts detection accuracy while incurring negligible inference overhead, showing the effectiveness and practicality of the proposed method.
zh

[CV-68] VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models NEURIPS2025

【速读】:该论文旨在解决视觉基础模型(Vision Foundation Models)在面临分布偏移(distribution shifts)和标签稀缺场景时,因监督微调(supervised fine-tuning)不可行而导致性能下降的问题。其解决方案的关键在于提出一种基于视频的物体中心自监督适应方法(VESSA),通过利用仅包含短多视角物体中心视频的数据进行自蒸馏式自监督微调,无需任何标注信息即可实现模型对新域的有效适应。VESSA 的核心创新在于:1)设计了针对视觉编码器的自监督微调范式;2)采用参数高效适配技术并精细调整预测头,防止灾难性遗忘(catastrophic forgetting),从而在不损失预训练知识的前提下提升模型鲁棒性与下游任务表现。

链接: https://arxiv.org/abs/2510.20994
作者: Jesimon Barreto,Carlos Caetano,André Araujo,William Robson Schwartz
机构: Universidade Federal de Minas Gerais (联邦大学); Universidade Estadual de Campinas (州立大学); Google DeepMind
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning. However, they may underperform in domains with distribution shifts and scarce labels, where supervised fine-tuning may be infeasible. While continued self-supervised learning for model adaptation is common for generative language models, this strategy has not proven effective for vision-centric encoder models. To address this challenge, we introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations, leveraging only short multi-view object-centric videos. Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models. VESSA’s training technique is based on a self-distillation paradigm, where it is critical to carefully tune prediction heads and deploy parameter-efficient adaptation techniques - otherwise, the model may quickly forget its pretrained knowledge and reach a degraded state. VESSA benefits significantly from multi-view object observations sourced from different frames in an object-centric video, efficiently learning robustness to varied capture conditions, without the need of annotations. Through comprehensive experiments with 3 vision foundation models on 2 datasets, VESSA demonstrates consistent improvements in downstream classification tasks, compared to the base models and previous adaptation methods. Code is publicly available at this https URL.
zh

[CV-69] hermal Polarimetric Multi-view Stereo ICCV2025

【速读】:该论文旨在解决现有三维形状重建方法在处理透明、半透明及非均质物体时细节恢复不足的问题,尤其针对依赖光照和材料属性的传统方法所存在的局限性。其解决方案的关键在于提出了一种基于多视角长波红外(LWIR)偏振成像的新型三维形状重建方法,该方法利用热偏振信号的理论模型,在不依赖照明条件和材料特性的前提下,有效克服了可见光偏振分析中的模糊性问题,从而实现了对复杂物体表面精细结构的高精度重建。

链接: https://arxiv.org/abs/2510.20972
作者: Takahiro Kushida,Kenichiro Tanaka
机构: Ritsumeikan University (立命馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:This paper introduces a novel method for detailed 3D shape reconstruction utilizing thermal polarization cues. Unlike state-of-the-art methods, the proposed approach is independent of illumination and material properties. In this paper, we formulate a general theory of polarization observation and show that long-wave infrared (LWIR) polarimetric imaging is free from the ambiguities that affect visible polarization analyses. Subsequently, we propose a method for recovering detailed 3D shapes using multi-view thermal polarimetric images. Experimental results demonstrate that our approach effectively reconstructs fine details in transparent, translucent, and heterogeneous objects, outperforming existing techniques.
zh

[CV-70] 3DReason Knee: Advancing Grounded Reasoning in Medical Vision Language Models

【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在3D医学图像中难以对解剖区域进行精准定位并开展逐步推理的问题,这是实现临床诊断流程对齐与可信医工协作的关键瓶颈。现有3D医学数据集虽提供定位标签,但缺乏支持“基于场景的推理”(grounded reasoning)的能力。解决方案的核心在于构建首个面向3D医学图像的结构化推理数据集——3DReasonKnee,其包含7,970例3D膝关节MRI体积数据及其对应的49.4万条五元组标注信息,每条五元组涵盖:(1) MRI体积、(2) 针对特定解剖区域的诊断问题、(3) 3D边界框定位、(4) 由临床专家生成的显式三维推理链、以及(5) 结构化的严重程度评估。该数据集通过超450小时专家标注与验证,确保了高质量与临床相关性,为评估和提升VLM在3D空间中的定位准确性与诊断推理能力提供了基准测试平台(ReasonKnee-Bench),从而推动多模态医疗AI向可解释、可信赖、符合临床实践的三维决策方向发展。

链接: https://arxiv.org/abs/2510.20967
作者: Sraavya Sambara,Sung Eun Kim,Xiaoman Zhang,Luyang Luo,Shreya Johri,Mohammed Baharoon,Du Hyun Ro,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Seoul National University Hospital (首尔国立大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Vision-Language Models (VLMs) struggle to ground anatomical regions in 3D medical images and reason about them in a step-by-step manner, a key requirement of real-world diagnostic assessment. This ability is essential for aligning model outputs with the diagnostic workflows clinicians use in practice, enabling trustworthy clinician-AI collaboration. Existing 3D datasets provide localization labels, but none support this “grounded reasoning” ability. To address this gap, we introduce 3DReasonKnee, the first 3D grounded reasoning dataset for medical images, which provides 494k high-quality quintuples derived from 7,970 3D knee MRI volumes. Each quintuple includes: (1) the 3D MRI volume, (2) a diagnostic question targeting a specific anatomical region (3) a 3D bounding box localizing the relevant anatomical structures, (4) clinician-generated diagnostic reasoning steps that explicitly detail the 3D reasoning process, and (5) structured severity assessments for the relevant anatomical region. The creation and validation of 3DReasonKnee, involving over 450 hours of expert clinician time for manually segmenting MRIs and generating reasoning chains, ensures its superior quality and clinical relevance. We establish ReasonKnee-Bench to evaluate localization and diagnostic accuracy, providing insight into VLM ability to perform grounding and severity assessment across anatomical regions and diagnostic inquiries. We benchmark five state-of-the-art VLMs, providing baseline performance for ReasonKnee-Bench. By providing this unique resource of expert-annotated 3D reasoning pathways, 3DReasonKnee serves as a repository of orthopedic surgeons’ diagnostic expertise and offers a vital testbed for advancing multimodal medical AI systems towards 3D, clinically aligned, localized decision-making capabilities. The dataset can be found in: this https URL
zh

[CV-71] Generative Point Tracking with Flow Matching

【速读】:该论文旨在解决点轨迹跟踪中因视觉遮挡和外观变化导致的不确定性问题,尤其是现有判别式模型在面对多模态轨迹时仅能回归单一均值或模式、无法捕捉多种可能轨迹路径的局限性。其解决方案的关键在于提出生成式点追踪器(Generative Point Tracker, GenPT),该框架通过一种新颖的流匹配(flow matching)训练机制,融合了判别式追踪器的迭代精化能力、窗口依赖的先验以保证跨窗口一致性,以及专为点坐标设计的方差调度策略;同时,在推理阶段利用模型自身的置信度指导最佳优先搜索(best-first search)策略对生成样本进行筛选,从而有效提升轨迹估计的准确性,尤其在遮挡场景下展现出卓越的多模态建模能力与领先性能。

链接: https://arxiv.org/abs/2510.20951
作者: Mattie Tesfaldet,Adam W. Harley,Konstantinos G. Derpanis,Derek Nowrouzezahrai,Christopher Pal
机构: McGill University (麦吉尔大学); Mila; Stanford University (斯坦福大学); York University (约克大学); Polytechnique Montréal (蒙特利尔工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Tracking a point through a video can be a challenging task due to uncertainty arising from visual obfuscations, such as appearance changes and occlusions. Although current state-of-the-art discriminative models excel in regressing long-term point trajectory estimates – even through occlusions – they are limited to regressing to a mean (or mode) in the presence of uncertainty, and fail to capture multi-modality. To overcome this limitation, we introduce Generative Point Tracker (GenPT), a generative framework for modelling multi-modal trajectories. GenPT is trained with a novel flow matching formulation that combines the iterative refinement of discriminative trackers, a window-dependent prior for cross-window consistency, and a variance schedule tuned specifically for point coordinates. We show how our model’s generative capabilities can be leveraged to improve point trajectory estimates by utilizing a best-first search strategy on generated samples during inference, guided by the model’s own confidence of its predictions. Empirically, we evaluate GenPT against the current state of the art on the standard PointOdyssey, Dynamic Replica, and TAP-Vid benchmarks. Further, we introduce a TAP-Vid variant with additional occlusions to assess occluded point tracking performance and highlight our model’s ability to capture multi-modality. GenPT is capable of capturing the multi-modality in point trajectories, which translates to state-of-the-art tracking accuracy on occluded points, while maintaining competitive tracking accuracy on visible points compared to extant discriminative point trackers.
zh

[CV-72] Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割中因卷积神经网络(Convolutional Neural Networks, CNNs)局部感受野限制而导致的全局上下文信息捕捉不足、长程依赖关系建模困难的问题,进而影响对边界复杂、尺寸和形态多变病灶结构的精确分割。其解决方案的关键在于提出一种融合卷积与Transformer架构的新型网络FM-BFF-Net,核心创新包括:引入焦点调制注意力机制(Focal Modulation Attention Mechanism)以增强对局部与全局上下文的感知能力,并设计双向特征融合模块(Bidirectional Feature Fusion Module),实现编码器与解码器之间跨尺度表示的有效交互,从而显著提升分割边界精度及对病灶大小、形状和对比度变化的鲁棒性。

链接: https://arxiv.org/abs/2510.20933
作者: Moin Safdar,Shahzaib Iqbal,Mehwish Mehmood,Mubeen Ghafoor,Tariq M.Khan,Imran Razzak
机构: Abasyn University Islamabad Campus (阿巴斯大学伊斯兰堡校区); Queen’s University Belfast (贝尔法斯特女王大学); De Montfort University (德蒙福特大学); Naif Arab University for Security Sciences (安全科学纳伊夫阿拉伯大学); University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image segmentation is essential for clinical applications such as disease diagnosis, treatment planning, and disease development monitoring because it provides precise morphological and spatial information on anatomical structures that directly influence treatment decisions. Convolutional neural networks significantly impact image segmentation; however, since convolution operations are local, capturing global contextual information and long-range dependencies is still challenging. Their capacity to precisely segment structures with complicated borders and a variety of sizes is impacted by this restriction. Since transformers use self-attention methods to capture global context and long-range dependencies efficiently, integrating transformer-based architecture with CNNs is a feasible approach to overcoming these challenges. To address these challenges, we propose the Focal Modulation and Bidirectional Feature Fusion Network for Medical Image Segmentation, referred to as FM-BFF-Net in the remainder of this paper. The network combines convolutional and transformer components, employs a focal modulation attention mechanism to refine context awareness, and introduces a bidirectional feature fusion module that enables efficient interaction between encoder and decoder representations across scales. Through this design, FM-BFF-Net enhances boundary precision and robustness to variations in lesion size, shape, and contrast. Extensive experiments on eight publicly available datasets, including polyp detection, skin lesion segmentation, and ultrasound imaging, show that FM-BFF-Net consistently surpasses recent state-of-the-art methods in Jaccard index and Dice coefficient, confirming its effectiveness and adaptability for diverse medical imaging scenarios.
zh

[CV-73] An Experimental Study of Trojan Vulnerabilities in UAV Autonomous Landing

【速读】:该论文旨在解决城市空中交通(Urban Air Mobility, UAM)车辆中自主导航与着陆系统在深度学习模型(如卷积神经网络,Convolutional Neural Networks, CNNs)层面所面临的后门攻击(Trojan attacks)安全问题。此类攻击通过在训练数据中嵌入隐蔽触发器(covert triggers),使模型在特定条件下产生错误决策,而在正常情况下仍保持高准确率,从而对UAM系统的安全性构成潜在威胁。解决方案的关键在于:首先构建一个针对UAAV(Urban Autonomous Aerial Vehicles)的定制化数据集并基于DroNet框架训练模型以模拟真实场景;其次开发了一套评估框架用于识别被Trojan攻击污染的模型,从而揭示其脆弱性并为提升UAM系统抗干扰能力提供基础支撑。

链接: https://arxiv.org/abs/2510.20932
作者: Reza Ahmari,Ahmad Mohammadi,Vahid Hemmati,Mohammed Mynuddin,Mahmoud Nabil Mahmoud,Parham Kebria,Abdollah Homaifar,Mehrdad Saif
机构: North Carolina A&T State University (北卡罗来纳农业技术州立大学); University of Alabama (阿拉巴马大学); Windsor University (温莎大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 6 pages

点击查看摘要

Abstract:This study investigates the vulnerabilities of autonomous navigation and landing systems in Urban Air Mobility (UAM) vehicles. Specifically, it focuses on Trojan attacks that target deep learning models, such as Convolutional Neural Networks (CNNs). Trojan attacks work by embedding covert triggers within a model’s training data. These triggers cause specific failures under certain conditions, while the model continues to perform normally in other situations. We assessed the vulnerability of Urban Autonomous Aerial Vehicles (UAAVs) using the DroNet framework. Our experiments showed a significant drop in accuracy, from 96.4% on clean data to 73.3% on data triggered by Trojan attacks. To conduct this study, we collected a custom dataset and trained models to simulate real-world conditions. We also developed an evaluation framework designed to identify Trojan-infected models. This work demonstrates the potential security risks posed by Trojan attacks and lays the groundwork for future research on enhancing the resilience of UAM systems.
zh

[CV-74] Video-As-Prompt: Unified Semantic Control for Video Generation

【速读】:该论文旨在解决视频生成中统一且可泛化的语义控制问题(semantic control in video generation),这是当前开放性挑战之一。现有方法通常因引入不恰当的像素级先验而导致伪影,或依赖于特定条件微调与任务特异性架构,缺乏通用性。其解决方案的关键在于提出 Video-As-Prompt (VAP) 新范式,将问题重构为上下文内生成(in-context generation),利用参考视频作为直接语义提示,通过一个即插即用的 Mixture-of-Transformers (MoT) 专家模块引导冻结的 Video Diffusion Transformer (DiT),同时采用时序偏置的位置嵌入消除错误映射先验,从而实现鲁棒的上下文检索和零样本泛化能力。

链接: https://arxiv.org/abs/2510.20888
作者: Yuxuan Bian,Xin Chen,Zenan Li,Tiancheng Zhi,Shen Sang,Linjie Luo,Qiang Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Website: this https URL

点击查看摘要

Abstract:Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP’s strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.
zh

[CV-75] Preventing Shortcuts in Adapter Training via Providing the Shortcuts NEURIPS2025

【速读】:该论文旨在解决基于适配器(Adapter)训练的图像生成模型中因输入图像包含多种混杂视觉因素(如姿态、表情和光照等)而导致的目标属性(如身份)与非目标属性发生纠缠的问题,从而影响生成结果的质量、多样性和对文本提示的遵循能力。其解决方案的关键在于:在适配器训练阶段主动引入辅助模块(如ControlNet或LoRA)来“路由”这些混杂因素,使适配器无需学习这些无关特征,从而实现更解耦的表示;训练完成后移除这些辅助模块进行推理,避免引入额外计算开销。这一方法揭示了一个通用设计原则——在大模型时代,若希望获得解耦表征,最有效的路径可能是为不希望被学习的内容建立明确的“捷径”。

链接: https://arxiv.org/abs/2510.20887
作者: Anujraaj Argo Goyal,Guocheng Gordon Qian,Huseyin Coskun,Aarush Gupta,Himmy Tam,Daniil Ostashev,Ju Hu,Dhritiman Sagar,Sergey Tulyakov,Kfir Aberman,Kuan-Chieh Jackson Wang
机构: Snap Inc. (Snap Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025, webpage: this https URL

点击查看摘要

Abstract:Adapter-based training has emerged as a key mechanism for extending the capabilities of powerful foundation image generators, enabling personalized and stylized text-to-image synthesis. These adapters are typically trained to capture a specific target attribute, such as subject identity, using single-image reconstruction objectives. However, because the input image inevitably contains a mixture of visual factors, adapters are prone to entangle the target attribute with incidental ones, such as pose, expression, and lighting. This spurious correlation problem limits generalization and obstructs the model’s ability to adhere to the input text prompt. In this work, we uncover a simple yet effective solution: provide the very shortcuts we wish to eliminate during adapter training. In Shortcut-Rerouted Adapter Training, confounding factors are routed through auxiliary modules, such as ControlNet or LoRA, eliminating the incentive for the adapter to internalize them. The auxiliary modules are then removed during inference. When applied to tasks like facial and full-body identity injection, our approach improves generation quality, diversity, and prompt adherence. These results point to a general design principle in the era of large models: when seeking disentangled representations, the most effective path may be to establish shortcuts for what should NOT be learned.
zh

[CV-76] SViM3D: Stable Video Material Diffusion for Single Image 3D Generation ICCV2025

【速读】:该论文旨在解决从单张图像中重建具有物理基础渲染(PBR)特性的多视角一致材质的问题,现有方法在反射特性建模上通常依赖简化的材料模型或需额外步骤估计材质参数,难以支持灵活的光照重渲染和可控外观编辑。解决方案的关键在于扩展潜在视频扩散模型,使其联合输出空间变化的PBR参数与表面法向量,并通过显式相机控制生成多视角结果,从而构建一个可作为神经先验的3D资产生成框架,实现高质量的光照重渲染与新视角合成。

链接: https://arxiv.org/abs/2510.08271
作者: Andreas Engelhardt,Mark Boss,Vikram Voletti,Chun-Han Yao,Hendrik P. A. Lensch,Varun Jampani
机构: Stability AI(Stability.AI); University of Tübingen(图宾根大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by International Conference on Computer Vision (ICCV 2025). Project page: this http URL

点击查看摘要

Abstract:We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
zh

[CV-77] Physics-Informed Deep Learning for Improved Input Function Estimation in Motion-Blurred Dynamic [18F]FDG PET Images MICCAI2025

链接: https://arxiv.org/abs/2510.21281
作者: Christian Salomonsen,Kristoffer K. Wickstrøm,Samuel Kuttner,Elisabeth Wetzer
机构: 未知
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, 1 table. Preprint: Accepted to PRIME @ MICCAI 2025. This is the submitted (pre-review) version (url: this https URL )

点击查看摘要

[CV-78] Efficient Meningioma Tumor Segmentation Using Ensemble Learning MICCAI

链接: https://arxiv.org/abs/2510.21040
作者: Mohammad Mahdi Danesh Pajouh,Sara Saeedi
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 2nd Place Winner in the BraTS 2025 MICCAI Challenge (Task 2: Meningioma Tumor Segmentation)

点击查看摘要

[CV-79] Eye-Tracking as a Tool to Quantify the Effects of CAD Display on Radiologists Interpretation of Chest Radiographs

【速读】:该论文旨在解决计算机辅助检测(Computer-Aided Detection, CAD)系统中并发显示边界框(bounding-box, BB)提示对放射科医生胸片解读过程的影响问题,特别是其如何改变视觉搜索行为。解决方案的关键在于采用眼动追踪技术量化BB提示下阅读者的眼动指标变化,包括首次注视时间、病灶停留时间、总注视路径长度和肺野覆盖比等,从而客观评估BB提示对诊断流程的干扰或促进作用。研究发现,BB显示虽缩短了首次定位病灶的时间,但显著延长了整体解读时间和病灶注视时间,并增加了视野扫描范围,表明其可能影响阅读效率与注意力分配模式。

链接: https://arxiv.org/abs/2510.20864
作者: Daisuke Matsumoto,Tomohiro Kikuchi,Yusuke Takagi,Soichiro Kojima,Ryoma Kobayashi,Daiju Ueda,Kohei Yamamoto,Sho Kawabe,Harushi Mori
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rationale and Objectives: Computer-aided detection systems for chest radiographs are widely used, and concurrent reader displays, such as bounding-box (BB) highlights, may influence the reading process. This pilot study used eye tracking to conduct a preliminary experiment to quantify which aspects of visual search were affected. Materials and Methods: We sampled 180 chest radiographs from the VinDR-CXR dataset: 120 with solitary pulmonary nodules or masses and 60 without. The BBs were configured to yield an overall display sensitivity and specificity of 80%. Three radiologists (with 11, 5, and 1 years of experience, respectively) interpreted each case twice - once with BBs visible and once without - after a washout of = 2 weeks. Eye movements were recorded using an EyeTech VT3 Mini. Metrics included interpretation time, time to first fixation on the lesion, lesion dwell time, total gaze-path length, and lung-field coverage ratio. Outcomes were modeled using a linear mixed model, with reading condition as a fixed effect and case and reader as random intercepts. The primary analysis was restricted to true positives (n=96). Results: Concurrent BB display prolonged interpretation time by 4.9 s (p0.001) and increased lesion dwell time by 1.3 s (p0.001). Total gaze-path length increased by 2,076 pixels (p0.001), and lung-field coverage ratio increased by 10.5% (p0.001). Time to first fixation on the lesion was reduced by 1.3 s (p0.001). Conclusion: Eye tracking captured measurable alterations in search behavior associated with concurrent BB displays during chest radiograph interpretation. These findings support the feasibility of this approach and highlight the need for larger studies to confirm effects and explore implications across modalities and clinical contexts.
zh

[CV-80] Lightweight Classifier for Detecting Intracranial Hemorrhage in Ultrasound Data

链接: https://arxiv.org/abs/2510.20857
作者: Phat Tran,Enbai Kuang,Fred Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-81] his EEG Looks Like These EEGs: Interpretable Interictal Epileptiform Discharge Detection With ProtoEEG-kNN MICCAI2025

【速读】:该论文旨在解决癫痫患者脑电图(EEG)中间期癫痫样放电(IEDs)自动检测任务中现有机器学习模型缺乏可解释性的问题。当前多数算法虽具备高准确率,但其决策过程不可解释,导致临床医生无法理解模型推理逻辑,难以识别错误预测并进行干预。解决方案的关键在于提出ProtoEEG-kNN模型,这是一种基于原型的最近邻分类方法,通过将待测EEG与训练集中相似样本进行比较,并以可视化方式展示其推理依据——包括IED的形态学特征(shape)和空间分布(location),从而实现高精度检测的同时提供人类专家更易接受的解释机制。

链接: https://arxiv.org/abs/2510.20846
作者: Dennis Tang,Jon Donnelly,Alina Jade Barnett,Lesia Semenova,Jin Jing,Peter Hadar,Ioannis Karakis,Olga Selioutski,Kehan Zhao,M. Brandon Westover,Cynthia Rudin
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2025

点击查看摘要

Abstract:The presence of interictal epileptiform discharges (IEDs) in electroencephalogram (EEG) recordings is a critical biomarker of epilepsy. Even trained neurologists find detecting IEDs difficult, leading many practitioners to turn to machine learning for help. While existing machine learning algorithms can achieve strong accuracy on this task, most models are uninterpretable and cannot justify their conclusions. Absent the ability to understand model reasoning, doctors cannot leverage their expertise to identify incorrect model predictions and intervene accordingly. To improve the human-model interaction, we introduce ProtoEEG-kNN, an inherently interpretable model that follows a simple case-based reasoning process. ProtoEEG-kNN reasons by comparing an EEG to similar EEGs from the training set and visually demonstrates its reasoning both in terms of IED morphology (shape) and spatial distribution (location). We show that ProtoEEG-kNN can achieve state-of-the-art accuracy in IED detection while providing explanations that experts prefer over existing approaches.
zh

[CV-82] Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets

【速读】:该论文旨在解决构建具身人工智能(Embodied AI)代理时,训练环境在内容多样性与物理准确性之间难以平衡的问题。现有世界模拟器存在明显局限:基于视频的方法虽能生成丰富内容,但缺乏实时物理反馈以支持交互式学习;而基于物理引擎的方法虽具备高动态精度,却受限于人工资产创建成本高昂导致的可扩展性不足。解决方案的关键在于提出Seed3D 1.0——一个从单张图像生成仿真就绪3D资产的基础模型,其核心优势在于能够自动产出几何准确、纹理对齐良好且具备真实物理材质特性的三维对象,可直接集成至物理引擎中并最小化配置需求,从而实现高效、大规模的场景构建,支撑机器人操作与仿真训练等应用。

链接: https://arxiv.org/abs/2510.19944
作者: Jiashi Feng,Xiu Li,Jing Lin,Jiahang Liu,Gaohong Liu,Weiqiang Lou,Su Ma,Guang Shi,Qinlong Wang,Jun Wang,Zhongcong Xu,Xuanyu Yi,Zihao Yu,Jianfeng Zhang,Yifan Zhu,Rui Chen,Jinxin Chi,Zixian Du,Li Han,Lixin Huang,Kaihua Jiang,Yuhan Li,Guan Luo,Shuguang Wang,Qianyi Wu,Fan Yang,Junyang Zhang,Xuanmeng Zhang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Seed3D 1.0 Technical Report; Official Page on this https URL

点击查看摘要

Abstract:Developing embodied AI agents requires scalable training environments that balance content diversity with physics accuracy. World simulators provide such environments but face distinct limitations: video-based methods generate diverse content but lack real-time physics feedback for interactive learning, while physics-based engines provide accurate dynamics but face scalability limitations from costly manual asset creation. We present Seed3D 1.0, a foundation model that generates simulation-ready 3D assets from single images, addressing the scalability challenge while maintaining physics rigor. Unlike existing 3D generation models, our system produces assets with accurate geometry, well-aligned textures, and realistic physically-based materials. These assets can be directly integrated into physics engines with minimal configuration, enabling deployment in robotic manipulation and simulation training. Beyond individual objects, the system scales to complete scene generation through assembling objects into coherent environments. By enabling scalable simulation-ready content creation, Seed3D 1.0 provides a foundation for advancing physics-based world simulators. Seed3D 1.0 is now available on this https URL
zh

[CV-83] Image and Point-cloud Classification for Jet Analysis in High-Energy Physics: A survey

【速读】:该论文旨在解决高能物理(High-Energy Physics, HEP)领域中如何有效应用机器学习(Machine Learning, ML)与深度学习(Deep Learning, DL)技术以提升粒子探测、特征提取及分类等任务性能的问题。其关键解决方案在于系统性地梳理和归纳当前主流ML/DL方法在HEP中的具体应用场景,包括Jet重建、Jet tagging、粒子分类与追踪等,并针对不同数据类型(如图像与点云数据)提出适配的预处理、特征工程与模型选择策略。论文特别强调了这些技术对下一代强子对撞机(如HL-LHC和FCChh)的适用性与扩展潜力,为未来HEP研究提供了可落地的技术路径与理论支撑。

链接: https://arxiv.org/abs/2403.11934
作者: Hamza Kheddar,Yassine Himeur,Abbes Amira,Rachik Soualah
机构: University of Medea(麦地大学); University of Dubai(迪拜大学); University of Sharjah(沙迦大学); De Montfort University(德蒙福特大学); Khalifa University of Science and Technology(哈利法科学技术大学); International Center for Theoretical Physics(国际理论物理中心)
类目: High Energy Physics - Phenomenology (hep-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); High Energy Physics - Experiment (hep-ex)
备注: Accepted paper in Frontier of Physics

点击查看摘要

Abstract:Nowadays, there has been a growing trend in the field of high-energy physics (HEP), in both its experimental and phenomenological studies, to incorporate machine learning (ML) and its specialized branch, deep learning (DL). This review paper provides a thorough illustration of these applications using different ML and DL approaches. The first part of the paper examines the basics of various particle physics types and establishes guidelines for assessing particle physics alongside the available learning models. Next, a detailed classification is provided for representing Jets that are reconstructed in high-energy collisions, mainly in proton-proton collisions at well-defined beam energies. This section covers various datasets, preprocessing techniques, and feature extraction and selection methods. The presented techniques can be applied to future hadron-hadron colliders (HHC), such as the high-luminosity LHC (HL-LHC) and the future circular collider - hadron-hadron (FCChh). The authors then explore several AI techniques analyses designed specifically for both image and point-cloud (PC) data in HEP. Additionally, a closer look is taken at the classification associated with Jet tagging in hadron collisions. In this review, various state-of-the-art (SOTA) techniques in ML and DL are examined, with a focus on their implications for HEP demands. More precisely, this discussion addresses various applications in extensive detail, such as Jet tagging, Jet tracking, particle classification, and more. The review concludes with an analysis of the current state of HEP using DL methodologies. It highlights the challenges and potential areas for future research, which are illustrated for each application.
zh

人工智能

[AI-0] A Knowledge-Graph Translation Layer for Mission-Aware Multi-Agent Path Planning in Spatiotemporal Dynamics

【速读】:该论文旨在解决自主代理在动态环境中协调时面临的语义鸿沟问题,即高层任务目标与底层规划器输入之间的不匹配。解决方案的关键在于引入一个以知识图谱(Knowledge Graph, KG)为核心的框架,作为智能翻译层:KG采用双平面架构,将声明式事实转化为每个代理特有的、任务感知的“世界观”和物理感知的通行规则,从而实现任务语义与领域无关的规划器解耦;通过修改KG中的事实即可简单调整复杂协同路径,显著提升系统的适应性与可解释性。

链接: https://arxiv.org/abs/2510.21695
作者: Edward Holmberg,Elias Ioup,Mahdi Abdelguerfi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 10 figures, conference submission

点击查看摘要

Abstract:The coordination of autonomous agents in dynamic environments is hampered by the semantic gap between high-level mission objectives and low-level planner inputs. To address this, we introduce a framework centered on a Knowledge Graph (KG) that functions as an intelligent translation layer. The KG’s two-plane architecture compiles declarative facts into per-agent, mission-aware ``worldviews" and physics-aware traversal rules, decoupling mission semantics from a domain-agnostic planner. This allows complex, coordinated paths to be modified simply by changing facts in the KG. A case study involving Autonomous Underwater Vehicles (AUVs) in the Gulf of Mexico visually demonstrates the end-to-end process and quantitatively proves that different declarative policies produce distinct, high-performing outcomes. This work establishes the KG not merely as a data repository, but as a powerful, stateful orchestrator for creating adaptive and explainable autonomous systems.
zh

[AI-1] A Multimodal Benchmark for Framing of Oil Gas Advertising and Potential Greenwashing Detection NEURIPS2025

【速读】:该论文旨在解决企业公共关系(PR)宣传中“言行不一”现象的量化与识别问题,尤其是通过视觉-语言模态分析揭示品牌在能源行业中的战略传播框架(framing)。其核心挑战在于如何从海量视频广告中自动识别并区分多种隐含的叙事框架,以评估企业是否在进行绿色洗牌(greenwashing)。解决方案的关键是构建了一个大规模、专家标注的多国视频广告数据集,涵盖50余家公司及倡导团体在20个国家发布的广告内容,并对13类传播框架进行标注,专门用于评估视觉-语言模型(Vision-Language Models, VLMs)在多模态语境下理解战略传播的能力。这一数据集填补了以往仅基于文本的框架分析研究空白,为未来开发更精准的多模态舆情监测工具提供了基准和方向。

链接: https://arxiv.org/abs/2510.21679
作者: Gaku Morio,Harri Rowlands,Dominik Stammbach,Christopher D. Manning,Peter Henderson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Forthcoming in NeurIPS 2025 Datasets and Benchmarks Track

点击查看摘要

Abstract:Companies spend large amounts of money on public relations campaigns to project a positive brand image. However, sometimes there is a mismatch between what they say and what they do. Oil gas companies, for example, are accused of “greenwashing” with imagery of climate-friendly initiatives. Understanding the framing, and changes in framing, at scale can help better understand the goals and nature of public relations campaigns. To address this, we introduce a benchmark dataset of expert-annotated video ads obtained from Facebook and YouTube. The dataset provides annotations for 13 framing types for more than 50 companies or advocacy groups across 20 countries. Our dataset is especially designed for the evaluation of vision-language models (VLMs), distinguishing it from past text-only framing datasets. Baseline experiments show some promising results, while leaving room for improvement for future work: GPT-4.1 can detect environmental messages with 79% F1 score, while our best model only achieves 46% F1 score on identifying framing around green innovation. We also identify challenges that VLMs must address, such as implicit framing, handling videos of various lengths, or implicit cultural backgrounds. Our dataset contributes to research in multimodal analysis of strategic communication in the energy sector.
zh

[AI-2] CMOMgen: Complex Multi-Ontology Alignment via Pattern-Guided In-Context Learning

【速读】:该论文旨在解决多ontology(本体)之间复杂语义整合问题,即如何在不依赖简单一对一映射的前提下,实现跨多个目标本体的复合逻辑表达式级对齐,从而构建更精细、可溯源的语义层。其核心挑战在于传统方法无法有效处理相关但结构分离的本体之间的深层语义关系。解决方案的关键是提出CMOMgen——首个端到端的复杂多本体匹配(Complex Multi-Ontology Matching, CMOM)策略,通过检索增强生成(Retrieval-Augmented Generation)技术筛选相关类并过滤参考映射作为示例,强化上下文学习能力,从而生成完整且语义合理的复合映射。该方法无需限制目标本体或实体数量,在三个生物医学任务中显著优于基线模型,并在F1-score和人工评估中均表现出优越性能。

链接: https://arxiv.org/abs/2510.21656
作者: Marta Contreiras Silva,Daniel Faria,Catia Pesquita
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 5 figures

点击查看摘要

Abstract:Constructing comprehensive knowledge graphs requires the use of multiple ontologies in order to fully contextualize data into a domain. Ontology matching finds equivalences between concepts interconnecting ontologies and creating a cohesive semantic layer. While the simple pairwise state of the art is well established, simple equivalence mappings cannot provide full semantic integration of related but disjoint ontologies. Complex multi-ontology matching (CMOM) aligns one source entity to composite logical expressions of multiple target entities, establishing more nuanced equivalences and provenance along the ontological hierarchy. We present CMOMgen, the first end-to-end CMOM strategy that generates complete and semantically sound mappings, without establishing any restrictions on the number of target ontologies or entities. Retrieval-Augmented Generation selects relevant classes to compose the mapping and filters matching reference mappings to serve as examples, enhancing In-Context Learning. The strategy was evaluated in three biomedical tasks with partial reference alignments. CMOMgen outperforms baselines in class selection, demonstrating the impact of having a dedicated strategy. Our strategy also achieves a minimum of 63% in F1-score, outperforming all baselines and ablated versions in two out of three tasks and placing second in the third. Furthermore, a manual evaluation of non-reference mappings showed that 46% of the mappings achieve the maximum score, further substantiating its ability to construct semantically sound mappings. Comments: 32 pages, 5 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.21656 [cs.AI] (or arXiv:2510.21656v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.21656 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] DEEDEE: Fast and Scalable Out-of-Distribution Dynamics Detection

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在安全关键场景中因分布偏移(distribution shift)导致的脆弱性问题,尤其是如何有效检测时间序列数据中的分布外(Out-of-Distribution, OOD)样本。解决方案的关键在于提出DEEDEE——一种基于两个统计量的轻量级检测器:利用每个episode的均值和与训练摘要的RBF核相似度,分别捕捉全局与局部的偏差特征。该方法摒弃了传统复杂表示学习流程,仅通过低阶统计量即可实现对多种异常类型的高效识别,在保持高检测精度的同时显著降低计算开销(FLOPs/运行时间减少600倍),并相较强基线平均提升5%的绝对准确率,验证了复杂环境中OOD检测可建立于一组紧凑低阶统计特征之上的可行性。

链接: https://arxiv.org/abs/2510.21638
作者: Tala Aljaafari,Varun Kanade,Philip Torr,Christian Schroeder de Witt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deploying reinforcement learning (RL) in safety-critical settings is constrained by brittleness under distribution shift. We study out-of-distribution (OOD) detection for RL time series and introduce DEEDEE, a two-statistic detector that revisits representation-heavy pipelines with a minimal alternative. DEEDEE uses only an episodewise mean and an RBF kernel similarity to a training summary, capturing complementary global and local deviations. Despite its simplicity, DEEDEE matches or surpasses contemporary detectors across standard RL OOD suites, delivering a 600-fold reduction in compute (FLOPs / wall-time) and an average 5% absolute accuracy gain over strong baselines. Conceptually, our results indicate that diverse anomaly types often imprint on RL trajectories through a small set of low-order statistics, suggesting a compact foundation for OOD detection in complex environments.
zh

[AI-4] Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

【速读】:该论文旨在解决自改进编码代理(self-improving coding agent)在自我演化过程中存在的“元生产力-性能不匹配”(Metaproductivity-Performance Mismatch)问题,即代理在软件工程基准测试中的当前表现与其未来潜在的自我改进能力之间缺乏一致性。为应对这一挑战,作者提出了一种新的度量指标 CMP\mathrm{CMP}(Clade Metaproductivity),该指标通过聚合一个代理后代在多个基准上的表现来衡量其自我改进潜力。解决方案的关键在于引入Huxley-Gödel Machine(HGM),该机器基于对CMP\mathrm{CMP}的估计作为指导信号,在自修改树中进行搜索,从而更高效地识别高潜力的自我修改路径。实验表明,HGM在SWE-bench Verified和Polyglot数据集上优于现有方法,并且在不同模型和任务间具有良好的迁移性,甚至在GPT-5-mini优化后于SWE-bench Lite评估时达到人类水平性能。

链接: https://arxiv.org/abs/2510.21614
作者: Wenyi Wang,Piotr Piękos,Li Nanbo,Firas Laakom,Yimeng Chen,Mateusz Ostaszewski,Mingchen Zhuge,Jürgen Schmidhuber
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent’s self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley’s concept of clade, we propose a metric ( \mathrmCMP ) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true \mathrmCMP is sufficient to simulate how the Gödel Machine would behave under certain assumptions. We introduce the Huxley-Gödel Machine (HGM), which, by estimating \mathrmCMP and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using less wall-clock time. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is available at this https URL.
zh

[AI-5] Generative Correlation Manifolds: Generating Synthetic Data with Preserved Higher-Order Correlations

【速读】:该论文旨在解决当前合成数据生成技术难以保留真实数据中高阶相关结构的问题,即现有方法虽能复现简单的统计特征,却无法有效维持变量间的成对及更高阶的关联关系,导致生成的数据在复杂建模任务中表现不佳。其解决方案的关键在于提出生成相关流形(Generative Correlation Manifolds, GCM)方法,通过目标相关矩阵的Cholesky分解,从数学上严格保证生成数据能够完整保留源数据的全部相关结构——从简单成对关系到复杂的多变量交互。

链接: https://arxiv.org/abs/2510.21610
作者: Jens E. d’Hondt,Wieger R. Punter,Odysseas Papapetrou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing need for data privacy and the demand for robust machine learning models have fueled the development of synthetic data generation techniques. However, current methods often succeed in replicating simple summary statistics but fail to preserve both the pairwise and higher-order correlation structure of the data that define the complex, multi-variable interactions inherent in real-world systems. This limitation can lead to synthetic data that is superficially realistic but fails when used for sophisticated modeling tasks. In this white paper, we introduce Generative Correlation Manifolds (GCM), a computationally efficient method for generating synthetic data. The technique uses Cholesky decomposition of a target correlation matrix to produce datasets that, by mathematical proof, preserve the entire correlation structure – from simple pairwise relationships to higher-order interactions – of the source dataset. We argue that this method provides a new approach to synthetic data generation with potential applications in privacy-preserving data sharing, robust model training, and simulation.
zh

[AI-6] Learning Neural Control Barrier Functions from Expert Demonstrations using Inverse Constraint Learning

【速读】:该论文旨在解决自主系统在关键领域中安全控制的问题,特别是当系统的不可接受状态(即失败集)难以明确形式化定义时,如何设计有效的安全约束。例如,在自动驾驶场景中,“跟车过近”这类风险行为不易精确建模,但专家示范数据却相对容易获取。解决方案的关键在于利用示范数据(Imitation Learning, ICL)训练一个约束函数,该函数能够将系统状态分类为安全(属于与失败集不相交的前向不变集)或不安全状态;随后,利用该约束函数对仿真轨迹进行自动标注,从而训练神经控制屏障函数(neural Control Barrier Function, neural CBF)。这种方法无需显式定义失败集,即可实现与使用真实安全标签训练的神经CBF相当的安全性能。

链接: https://arxiv.org/abs/2510.21560
作者: Yuxuan Yang,Hussein Sibai
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Safety is a fundamental requirement for autonomous systems operating in critical domains. Control barrier functions (CBFs) have been used to design safety filters that minimally alter nominal controls for such systems to maintain their safety. Learning neural CBFs has been proposed as a data-driven alternative for their computationally expensive optimization-based synthesis. However, it is often the case that the failure set of states that should be avoided is non-obvious or hard to specify formally, e.g., tailgating in autonomous driving, while a set of expert demonstrations that achieve the task and avoid the failure set is easier to generate. We use ICL to train a constraint function that classifies the states of the system under consideration to safe, i.e., belong to a controlled forward invariant set that is disjoint from the unspecified failure set, and unsafe ones, i.e., belong to the complement of that set. We then use that function to label a new set of simulated trajectories to train our neural CBF. We empirically evaluate our approach in four different environments, demonstrating that it outperforms existing baselines and achieves comparable performance to a neural CBF trained with the same data but annotated with ground-truth safety labels.
zh

[AI-7] Co-Sight: Enhancing LLM -Based Agents via Conflict-Aware Meta-Verification and Trustworthy Reasoning with Structured Facts

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在长时推理过程中因中间推理步骤缺乏充分验证而导致的可靠性不足问题。其核心解决方案在于构建一个可 falsifiable(可证伪)且可审计(auditable)的推理闭环机制,关键创新点为两个互补模块:Conflict-Aware Meta-Verification (CAMV) 和 Trustworthy Reasoning with Structured Facts (TRSF)。CAMV 将验证过程重构为冲突识别与针对性证伪,仅在专家代理间存在分歧的热点区域分配计算资源,从而显著降低验证成本并提升效率与可靠性;TRSF 通过结构化事实模块持续组织、验证和同步跨代理证据,确保推理始终基于一致、可溯源的源验证信息,并支持全程透明验证。二者协同形成闭环验证机制,使推理既可信又可解释,实验证明该方法在 GAIA、Humanity’s Last Exam 等基准上达到领先性能。

链接: https://arxiv.org/abs/2510.21557
作者: Hongwei Zhang,Ji Lu,Shiqing Jiang,Chenxiang Zhu,Li Xie,Chen Zhong,Haoran Chen,Yurui Zhu,Yongsheng Du,Yanqin Gao,Lingjun Huang,Baoli Wang,Fang Tan,Peng Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon reasoning in LLM-based agents often fails not from generative weakness but from insufficient verification of intermediate reasoning. Co-Sight addresses this challenge by turning reasoning into a falsifiable and auditable process through two complementary mechanisms: Conflict-Aware Meta-Verification (CAMV) and Trustworthy Reasoning with Structured Facts (TRSF). CAMV reformulates verification as conflict identification and targeted falsification, allocating computation only to disagreement hotspots among expert agents rather than to full reasoning chains. This bounds verification cost to the number of inconsistencies and improves efficiency and reliability. TRSF continuously organizes, validates, and synchronizes evidence across agents through a structured facts module. By maintaining verified, traceable, and auditable knowledge, it ensures that all reasoning is grounded in consistent, source-verified information and supports transparent verification throughout the reasoning process. Together, TRSF and CAMV form a closed verification loop, where TRSF supplies structured facts and CAMV selectively falsifies or reinforces them, yielding transparent and trustworthy reasoning. Empirically, Co-Sight achieves state-of-the-art accuracy on GAIA (84.4%) and Humanity’s Last Exam (35.5%), and strong results on Chinese-SimpleQA (93.8%). Ablation studies confirm that the synergy between structured factual grounding and conflict-aware verification drives these improvements. Co-Sight thus offers a scalable paradigm for reliable long-horizon reasoning in LLM-based agents. Code is available at this https URL.
zh

[AI-8] Human and AI Trust: Trust Attitude Measurement Instrument

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)技术广泛应用背景下,如何从以人为本的角度系统、可靠地测量非专家用户对AI系统的信任问题。现有研究缺乏针对人类-AI交互场景下信任态度的标准化测量工具,限制了对AI接受度和部署效果的深入分析。解决方案的关键在于基于心理测量学原则开发并验证了一个包含16个题项的信任量表,该量表专门用于评估 layperson(非专家)对AI医疗支持系统(如癌症/健康预测)的信任态度,并通过六个阶段的研究流程(包括项目开发、评估、调查实施、维度检验、信度与效度测试)证明其具有良好的实证可靠性与有效性,从而为后续相关领域的人机信任研究提供了可复用的测量工具。

链接: https://arxiv.org/abs/2510.21535
作者: Retno Larasati
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the current progress of Artificial Intelligence (AI) technology and its increasingly broader applications, trust is seen as a required criterion for AI usage, acceptance, and deployment. A robust measurement instrument is essential to correctly evaluate trust from a human-centered perspective. This paper describes the development and validation process of a trust measure instrument, which follows psychometric principles, and consists of a 16-items trust scale. The instrument was built explicitly for research in human-AI interaction to measure trust attitudes towards AI systems from layperson (non-expert) perspective. The use-case we used to develop the scale was in the context of AI medical support systems (specifically cancer/health prediction). The scale development (Measurement Item Development) and validation (Measurement Item Evaluation) involved six research stages: item development, item evaluation, survey administration, test of dimensionality, test of reliability, and test of validity. The results of the six-stages evaluation show that the proposed trust measurement instrument is empirically reliable and valid for systematically measuring and comparing non-experts’ trust in AI Medical Support Systems.
zh

[AI-9] EU-Agent -Bench: Measuring Illegal Behavior of LLM Agents Under EU Law NEURIPS2025

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)作为智能代理(agent)在实际部署中可能因缺乏法律合规性而导致非法或不安全行为的问题,尤其是在欧盟(EU)立法框架下。其解决方案的关键在于构建并发布EU-Agent-Bench——一个可验证的人工标注基准测试集,用于评估LLM代理在用户输入看似无害但可能引发违法后果的情境下,是否能遵循欧盟法律规范(如数据保护、歧视防范和科学诚信等领域)。该基准通过将模型的功能调用与详尽引用相关立法条文的评分标准进行比对,量化模型的法律合规性,并进一步验证在系统提示中嵌入具体法规文本及明确合规指令是否能提升模型行为的安全性。

链接: https://arxiv.org/abs/2510.21524
作者: Ilija Lichkovski,Alexander Müller,Mariam Ibrahim,Tiwai Mhundwa
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the Workshop on Regulatable ML at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as agents in various contexts by providing tools at their disposal. However, LLM agents can exhibit unpredictable behaviors, including taking undesirable and/or unsafe actions. In order to measure the latent propensity of LLM agents for taking illegal actions under an EU legislative context, we introduce EU-Agent-Bench, a verifiable human-curated benchmark that evaluates an agent’s alignment with EU legal norms in situations where benign user inputs could lead to unlawful actions. Our benchmark spans scenarios across several categories, including data protection, bias/discrimination, and scientific integrity, with each user request allowing for both compliant and non-compliant execution of the requested actions. Comparing the model’s function calls against a rubric exhaustively supported by citations of the relevant legislature, we evaluate the legal compliance of frontier LLMs, and furthermore investigate the compliance effect of providing the relevant legislative excerpts in the agent’s system prompt along with explicit instructions to comply. We release a public preview set for the research community, while holding out a private test set to prevent data contamination in evaluating upcoming models. We encourage future work extending agentic safety benchmarks to different legal jurisdictions and to multi-turn and multilingual interactions. We release our code on \hrefthis https URLthis URL.
zh

[AI-10] Enhancing Social Robots through Resilient AI

【速读】:该论文旨在解决人工智能系统在敏感领域(如医疗、教育和日常生活)中因环境复杂性或系统退化而难以维持可靠性和用户信任的问题。其解决方案的关键在于引入“韧性”(resilience)这一核心特性,即社会机器人在不利或压力条件下仍能保持基本功能的能力,从而增强老年人等低信任群体对系统的信赖,确保其在关键场景下的稳定运行。

链接: https://arxiv.org/abs/2510.21469
作者: Domenico Palmisano,Giuseppe Palestra,Berardina Nadja De Carolis
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, Workshop on Adaptive Social Interaction based on user’s Mental mOdels and behaVior in HRI, The 17th International Conference on Social Robotics, 10-12 September 2025, Naples (IT)

点击查看摘要

Abstract:As artificial intelligence continues to advance and becomes more integrated into sensitive areas like healthcare, education, and everyday life, it’s crucial for these systems to be both resilient and robust. This paper shows how resilience is a fundamental characteristic of social robots, which, through it, ensure trust in the robot itself-an essential element especially when operating in contexts with elderly people, who often have low trust in these systems. Resilience is therefore the ability to operate under adverse or stressful conditions, even when degraded or weakened, while maintaining essential operational capabilities.
zh

[AI-11] Multi-Task Vehicle Routing Solver via Mixture of Specialized Experts under State-Decomposable MDP NEURIPS2025

【速读】:该论文旨在解决现有神经方法在处理多任务车辆路径问题(Multi-task Vehicle Routing Problems, VRPs)时,未能有效利用VRP变体的组合结构这一关键问题。传统统一求解器往往将多个约束同时学习,忽略了不同VRP变体可由一组基础VRP变体通过组合方式生成的特性,从而导致无法充分利用每个基础变体对应的专用求解器的优势。解决方案的关键在于提出一种状态可分解的马尔可夫决策过程(State-Decomposable MDP, SDMDP),它将原问题的状态空间表示为各基础VRP变体对应状态空间的笛卡尔积,从而自然地导出每个基础变体的最优策略;进一步引入基于潜在空间的SDMDP扩展,结合最优基础策略与可学习混合函数,在潜在空间中实现策略复用,并在弱假设下保证能通过混合函数恢复出SDMDP的最优统一策略。实际实现上,提出了Mixture-of-Specialized-Experts Solver (MoSES),使用低秩适配(Low-Rank Adaptation, LoRA)专家实现基础策略,并通过自适应门控机制实现混合函数,显著提升了多任务VRP求解性能。

链接: https://arxiv.org/abs/2510.21453
作者: Yuxin Pan,Zhiguang Cao,Chengyang Gu,Liu Liu,Peilin Zhao,Yize Chen,Fangzhen Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Existing neural methods for multi-task vehicle routing problems (VRPs) typically learn unified solvers to handle multiple constraints simultaneously. However, they often underutilize the compositional structure of VRP variants, each derivable from a common set of basis VRP variants. This critical oversight causes unified solvers to miss out the potential benefits of basis solvers, each specialized for a basis VRP variant. To overcome this limitation, we propose a framework that enables unified solvers to perceive the shared-component nature across VRP variants by proactively reusing basis solvers, while mitigating the exponential growth of trained neural solvers. Specifically, we introduce a State-Decomposable MDP (SDMDP) that reformulates VRPs by expressing the state space as the Cartesian product of basis state spaces associated with basis VRP variants. More crucially, this formulation inherently yields the optimal basis policy for each basis VRP variant. Furthermore, a Latent Space-based SDMDP extension is developed by incorporating both the optimal basis policies and a learnable mixture function to enable the policy reuse in the latent space. Under mild assumptions, this extension provably recovers the optimal unified policy of SDMDP through the mixture function that computes the state embedding as a mapping from the basis state embeddings generated by optimal basis policies. For practical implementation, we introduce the Mixture-of-Specialized-Experts Solver (MoSES), which realizes basis policies through specialized Low-Rank Adaptation (LoRA) experts, and implements the mixture function via an adaptive gating mechanism. Extensive experiments conducted across VRP variants showcase the superiority of MoSES over prior methods.
zh

[AI-12] AutoOpt: A Dataset and a Unified Framework for Automating Optimization Problem Solving NEURIPS2025

【速读】:该论文旨在解决优化问题自动求解的难题,即如何从用户提供的数学优化模型图像中自动识别并转化为可执行的计算机代码,进而高效求解各类复杂优化问题。其核心挑战在于跨模态理解(图像到文本)与复杂优化模型的自动化建模及求解。解决方案的关键在于构建AutoOpt框架,该框架由三个模块组成:(i) M1(Image_to_Text)利用深度学习模型完成数学表达式识别(MER),生成LaTeX表示;(ii) M2(Text_to_Text)基于微调的小规模大语言模型(LLM)将LaTeX转换为PYOMO脚本;(iii) M3(Optimization)采用基于双层优化分解(BOBD)的混合方法求解PYOMO描述的问题,显著优于传统内点法和遗传算法等通用方法。整个流程实现了从图像输入到优化结果输出的端到端自动化,且依赖于高质量的AutoOpt-11k数据集进行训练与验证。

链接: https://arxiv.org/abs/2510.21436
作者: Ankur Sinha,Shobhit Arora,Dhaval Pujara
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025, 28 pages, 11 figures, 11 tables

点击查看摘要

Abstract:This study presents AutoOpt-11k, a unique image dataset of over 11,000 handwritten and printed mathematical optimization models corresponding to single-objective, multi-objective, multi-level, and stochastic optimization problems exhibiting various types of complexities such as non-linearity, non-convexity, non-differentiability, discontinuity, and high-dimensionality. The labels consist of the LaTeX representation for all the images and modeling language representation for a subset of images. The dataset is created by 25 experts following ethical data creation guidelines and verified in two-phases to avoid errors. Further, we develop AutoOpt framework, a machine learning based automated approach for solving optimization problems, where the user just needs to provide an image of the formulation and AutoOpt solves it efficiently without any further human intervention. AutoOpt framework consists of three Modules: (i) M1 (Image_to_Text)- a deep learning model performs the Mathematical Expression Recognition (MER) task to generate the LaTeX code corresponding to the optimization formulation in image; (ii) M2 (Text_to_Text)- a small-scale fine-tuned LLM generates the PYOMO script (optimization modeling language) from LaTeX code; (iii) M3 (Optimization)- a Bilevel Optimization based Decomposition (BOBD) method solves the optimization formulation described in the PYOMO script. We use AutoOpt-11k dataset for training and testing of deep learning models employed in AutoOpt. The deep learning model for MER task (M1) outperforms ChatGPT, Gemini and Nougat on BLEU score metric. BOBD method (M3), which is a hybrid approach, yields better results on complex test problems compared to common approaches, like interior-point algorithm and genetic algorithm.
zh

[AI-13] Advancing Symbolic Integration in Large Language Models : Beyond Conventional Neurosymbolic AI

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在高风险领域应用中因缺乏透明性而面临的“黑箱”问题,尤其是在如何有效整合符号人工智能(Symbolic AI)以提升模型可解释性和可控性方面存在系统性理解不足的问题。其解决方案的关键在于提出一种全新的符号集成分类框架,从LLM的处理阶段、耦合机制、架构范式以及算法与应用层四个维度对现有文献进行系统梳理,并据此构建一个清晰的研究路线图,从而为未来符号技术与LLM的深度融合提供结构化指导和实践路径。

链接: https://arxiv.org/abs/2510.21425
作者: Maneeha Rani,Bhupesh Kumar Mishra,Dhavalkumar Thakker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have demonstrated highly effective learning, human-like response generation,and decision-making capabilities in high-risk sectors. However, these models remain black boxes because they struggle to ensure transparency in responses. The literature has explored numerous approaches to address transparency challenges in LLMs, including Neurosymbolic AI (NeSy AI). NeSy AI approaches were primarily developed for conventional neural networks and are not well-suited to the unique features of LLMs. Consequently, there is a limited systematic understanding of how symbolic AI can be effectively integrated into LLMs. This paper aims to address this gap by first reviewing established NeSy AI methods and then proposing a novel taxonomy of symbolic integration in LLMs, along with a roadmap to merge symbolic techniques with LLMs. The roadmap introduces a new categorisation framework across four dimensions by organising existing literature within these categories. These include symbolic integration across various stages of LLM, coupling mechanisms, architectural paradigms, as well as algorithmic and application-level perspectives. The paper thoroughly identifies current benchmarks, cutting-edge advancements, and critical gaps within the field to propose a roadmap for future research. By highlighting the latest developments and notable gaps in the literature, it offers practical insights for implementing frameworks for symbolic integration into LLMs to enhance transparency.
zh

[AI-14] DreamerV3-XP: Optimizing exploration through uncertainty estimation

【速读】:该论文旨在解决强化学习中探索效率低和学习收敛慢的问题,尤其是在稀疏奖励(sparse-reward)环境中。其解决方案的关键在于两个核心改进:一是引入基于回报(return)、重建损失(reconstruction loss)和价值误差(value error)的优先级重放缓冲区(prioritized replay buffer),以更高效地利用经验数据;二是设计一种基于集成世界模型对环境奖励预测不一致性的内在奖励(intrinsic reward),从而增强探索能力。实验表明,这些改进显著提升了学习速度并降低了动态模型损失,尤其在稀疏奖励任务中表现突出。

链接: https://arxiv.org/abs/2510.21418
作者: Lukas Bierling,Davide Pasero,Jan-Henrik Bertrand,Kiki Van Gerwen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce DreamerV3-XP, an extension of DreamerV3 that improves exploration and learning efficiency. This includes (i) a prioritized replay buffer, scoring trajectories by return, reconstruction loss, and value error and (ii) an intrinsic reward based on disagreement over predicted environment rewards from an ensemble of world models. DreamerV3-XP is evaluated on a subset of Atari100k and DeepMind Control Visual Benchmark tasks, confirming the original DreamerV3 results and showing that our extensions lead to faster learning and lower dynamics model loss, particularly in sparse-reward settings.
zh

[AI-15] Large Language Models as Model Organisms for Human Associative Learning

【速读】:该论文旨在解决生物系统中表征变化机制难以验证的问题,尤其是关联学习(associative learning)如何重塑内部表征这一核心认知科学难题。其解决方案的关键在于利用大语言模型(LLMs)作为可扩展且可控的计算平台,借鉴认知神经科学中的关联学习范式,通过在上下文学习(in-context learning)框架下观察表征演化过程,发现了一种非单调性(non-monotonic)的表征分化模式,并进一步揭示了词汇干扰(vocabulary interference)——即新关联与现有词汇库的竞争关系——对表征差异化的调节作用。这表明表征变化不仅依赖于项目间的相似性,还受全局知识竞争的影响,从而为理解大脑记忆重组原则提供了新的计算模型和假设生成工具。

链接: https://arxiv.org/abs/2510.21408
作者: Camila Kolling,Vy Ai Vo,Mariya Toneva
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Associative learning–forming links between co-occurring items–is fundamental to human cognition, reshaping internal representations in complex ways. Testing hypotheses on how representational changes occur in biological systems is challenging, but large language models (LLMs) offer a scalable alternative. Building on LLMs’ in-context learning, we adapt a cognitive neuroscience associative learning paradigm and investigate how representations evolve across six models. Our initial findings reveal a non-monotonic pattern consistent with the Non-Monotonic Plasticity Hypothesis, with moderately similar items differentiating after learning. Leveraging the controllability of LLMs, we further show that this differentiation is modulated by the overlap of associated items with the broader vocabulary–a factor we term vocabulary interference, capturing how new associations compete with prior knowledge. We find that higher vocabulary interference amplifies differentiation, suggesting that representational change is influenced by both item similarity and global competition. Our findings position LLMs not only as powerful tools for studying representational dynamics in human-like learning systems, but also as accessible and general computational models for generating new hypotheses about the principles underlying memory reorganization in the brain.
zh

[AI-16] REvolution: An Evolutionary Framework for RTL Generation driven by Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在寄存器传输级(Register-Transfer Level, RTL)代码生成中面临的两大核心问题:功能正确性不足与功耗、性能和面积(Power, Performance, and Area, PPA)优化能力有限。传统基于迭代反馈的方法虽能部分缓解这些问题,但受限于局部搜索策略,难以找到全局最优解。其解决方案的关键在于提出REvolution框架,该框架将进化计算(Evolutionary Computation, EC)与LLMs相结合,通过并行演化候选设计群体实现全局探索;其中引入双种群算法分别处理错误修复(Fail组)与PPA优化(Success组),并设计自适应机制动态调整提示策略的选择概率以提升搜索效率,从而显著提升RTL生成的正确率与设计质量。

链接: https://arxiv.org/abs/2510.21407
作者: Kyungjun Min,Kyumin Cho,Junhwan Jang,Seokhyeong Kang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted for publication at the 2026 Asia and South Pacific Design Automation Conference (ASP-DAC)

点击查看摘要

Abstract:Large Language Models (LLMs) are used for Register-Transfer Level (RTL) code generation, but they face two main challenges: functional correctness and Power, Performance, and Area (PPA) optimization. Iterative, feedback-based methods partially address these, but they are limited to local search, hindering the discovery of a global optimum. This paper introduces REvolution, a framework that combines Evolutionary Computation (EC) with LLMs for automatic RTL generation and optimization. REvolution evolves a population of candidates in parallel, each defined by a design strategy, RTL implementation, and evaluation feedback. The framework includes a dual-population algorithm that divides candidates into Fail and Success groups for bug fixing and PPA optimization, respectively. An adaptive mechanism further improves search efficiency by dynamically adjusting the selection probability of each prompt strategy according to its success rate. Experiments on the VerilogEval and RTLLM benchmarks show that REvolution increased the initial pass rate of various LLMs by up to 24.0 percentage points. The DeepSeek-V3 model achieved a final pass rate of 95.5%, comparable to state-of-the-art results, without the need for separate training or domain-specific tools. Additionally, the generated RTL designs showed significant PPA improvements over reference designs. This work introduces a new RTL design approach by combining LLMs’ generative capabilities with EC’s broad search power, overcoming the local-search limitations of previous methods.
zh

[AI-17] Boosting Accuracy and Efficiency of Budget Forcing in LLM s via Reinforcement Learning for Mathematical Reasoning ECAI

【速读】:该论文旨在解决预算强制(budget forcing)方法在小规模模型上因依赖长上下文推理轨迹的监督微调(SFT)而导致性能下降的问题,特别是由冗长响应引发的token效率低下。其解决方案的关键在于引入强化学习(Reinforcement Learning, RL)框架,通过极少量(仅1.5K样本)的训练数据对模型进行优化,从而在保持推理能力的同时显著提升token使用效率。实验表明,该SFT+RL联合模型在GSM8K数学推理任务中,在不同计算预算下均实现了更高准确率,并将token消耗减少超过40%,有效弥补了长上下文训练带来的性能损失,提升了小模型在数学推理中的表现。

链接: https://arxiv.org/abs/2510.21398
作者: Ravindra Aribowo Tarunokusumo,Rafael Fernandes Cunha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to the European Conference on Artificial Intelligence (ECAI)

点击查看摘要

Abstract:Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model, revealing how RL can recover the losses due to long-context training and altogether improving performance in mathematical reasoning.
zh

[AI-18] Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics: An Application-Grounded User Study

【速读】:该论文旨在解决生成式 AI (Generative AI) 在临床实践中的可信集成问题,即如何使医生能够准确判断何时以及为何信任算法推荐,从而提升人机协作的效率与准确性。其关键解决方案是采用基于应用场景的用户研究,对比黑箱(Black-box, BB)与透明白箱(White-box, WB)AI辅助模式下,不同介入时机(从开始即辅助或作为事后质量控制QC)对睡眠事件识别性能、时间消耗及用户体验的影响。结果表明,以目标性QC方式部署透明AI可显著提升事件级准确率(较黑箱提高约30%),并降低评分者间变异性,同时兼顾临床实用性,为实现高可信度的人工智能辅助诊断提供了可行路径。

链接: https://arxiv.org/abs/2510.21389
作者: Stefan Kraft,Andreas Theissler,Vera Wienhausen-Wilke,Gjergji Kasneci,Hendrik Lensch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI) systems increasingly match or surpass human experts in biomedical signal interpretation. However, their effective integration into clinical practice requires more than high predictive accuracy. Clinicians must discern \textitwhen and \textitwhy to trust algorithmic recommendations. This work presents an application-grounded user study with eight professional sleep medicine practitioners, who score nocturnal arousal events in polysomnographic data under three conditions: (i) manual scoring, (ii) black-box (BB) AI assistance, and (iii) transparent white-box (WB) AI assistance. Assistance is provided either from the \textitstart of scoring or as a post-hoc quality-control (\textitQC) review. We systematically evaluate how the type and timing of assistance influence event-level and clinically most relevant count-based performance, time requirements, and user experience. When evaluated against the clinical standard used to train the AI, both AI and human-AI teams significantly outperform unaided experts, with collaboration also reducing inter-rater variability. Notably, transparent AI assistance applied as a targeted QC step yields median event-level performance improvements of approximately 30% over black-box assistance, and QC timing further enhances count-based outcomes. While WB and QC approaches increase the time required for scoring, start-time assistance is faster and preferred by most participants. Participants overwhelmingly favor transparency, with seven out of eight expressing willingness to adopt the system with minor or no modifications. In summary, strategically timed transparent AI assistance effectively balances accuracy and clinical efficiency, providing a promising pathway toward trustworthy AI integration and user acceptance in clinical workflows.
zh

[AI-19] α-LoRA: Effective Fine-Tuning via Base Model Rescaling

【速读】:该论文旨在解决预训练模型在微调(fine-tuning)过程中泛化能力不足的问题,尤其是在数据样本有限的情况下。其解决方案的关键在于提出了一类新的重参数化方法(reparameterization methods),通过在冻结的权重矩阵上叠加一个可训练的低秩矩阵来更新目标模块,从而增强模型在目标任务上的泛化性能。该方法在高维二分类场景中借助随机矩阵理论(Random Matrix Theory)进行了理论证明,并在真实场景下验证了其有效性,如对大语言模型(LLM)进行微调时的表现提升。

链接: https://arxiv.org/abs/2510.21345
作者: Aymane El Firdoussi,El Mahdi Chayti,Mohamed El Amine Seddik,Martin Jaggi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Fine-tuning has proven to be highly effective in adapting pre-trained models to perform better on new desired tasks with minimal data samples. Among the most widely used approaches are reparameterization methods, which update a target module by augmenting its frozen weight matrix with an additional trainable weight matrix. The most prominent example is Low Rank Adaption (LoRA), which gained significant attention in recent years. In this paper, we introduce a new class of reparameterization methods for transfer learning, designed to enhance the generalization ability of fine-tuned models. We establish the effectiveness of our approach in a high-dimensional binary classification setting using tools from Random Matrix Theory, and further validate our theoretical findings through more realistic experiments, such as fine-tuning LLMs.
zh

[AI-20] World-POI: Global Point-of-Interest Data Enriched from Foursquare and OpenStreetMap as Tabular and Graph Data

【速读】:该论文旨在解决当前商业兴趣点(Point of Interest, POI)数据集在完整性与真实性方面的局限性问题:Foursquare 提供了覆盖广泛的商业 POI 基线,但存在元数据缺失或虚构地点的问题;而 OpenStreetMap(OSM)虽具备详尽且高频更新的元数据,却缺乏对 POI 是否真实存在的验证机制。解决方案的关键在于通过记录链接(record linkage)技术,融合二者优势——利用名称相似度和空间距离计算匹配得分,筛选出高置信度的对应关系,从而构建一个既涵盖广泛商业实体、又具备丰富属性信息的增强型 POI 数据集。该方法最终生成约 1 TB 的整合数据,并提供可调节阈值的过滤版本及完整构建流程,支持多领域空间分析与下游应用。

链接: https://arxiv.org/abs/2510.21342
作者: Hossein Amiri,Mohammad Hashemi,Andreas Züfle
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Recently, Foursquare released a global dataset with more than 100 million points of interest (POIs), each representing a real-world business on its platform. However, many entries lack complete metadata such as addresses or categories, and some correspond to non-existent or fictional locations. In contrast, OpenStreetMap (OSM) offers a rich, user-contributed POI dataset with detailed and frequently updated metadata, though it does not formally verify whether a POI represents an actual business. In this data paper, we present a methodology that integrates the strengths of both datasets: Foursquare as a comprehensive baseline of commercial POIs and OSM as a source of enriched metadata. The combined dataset totals approximately 1 TB. While this full version is not publicly released, we provide filtered releases with adjustable thresholds that reduce storage needs and make the data practical to download and use across domains. We also provide step-by-step instructions to reproduce the full 631 GB build. Record linkage is achieved by computing name similarity scores and spatial distances between Foursquare and OSM POIs. These measures identify and retain high-confidence matches that correspond to real businesses in Foursquare, have representations in OSM, and show strong name similarity. Finally, we use this filtered dataset to construct a graph-based representation of POIs enriched with attributes from both sources, enabling advanced spatial analyses and a range of downstream applications.
zh

[AI-21] CausalRec: A CausalBoost Attention Model for Sequential Recommendation

【速读】:该论文旨在解决现有基于相关性的序列推荐系统在仅依赖物品共现关系时,忽视用户行为背后潜在动机的问题,从而导致虚假相关性并影响推荐准确性。其解决方案的关键在于提出一种融合因果注意力机制的框架 CausalRec,通过引入因果发现模块(causal discovery block)学习用户行为序列中的因果图结构,并基于该结构设计因果增强模块(CausalBooster)来优化注意力机制,优先关注具有因果意义的行为路径,从而提升推荐的准确性和可靠性。

链接: https://arxiv.org/abs/2510.21333
作者: Yunbo Hou,Tianle Yang,Ruijie Li,Li He,Liang Wang,Weiping Li,Bo Zheng,Guojie Song
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Recent advances in correlation-based sequential recommendation systems have demonstrated substantial success. Specifically, the attention-based model outperforms other RNN-based and Markov chains-based models by capturing both short- and long-term dependencies more effectively. However, solely focusing on item co-occurrences overlooks the underlying motivations behind user behaviors, leading to spurious correlations and potentially inaccurate recommendations. To address this limitation, we present a novel framework that integrates causal attention for sequential recommendation, CausalRec. It incorporates a causal discovery block and a CausalBooster. The causal discovery block learns the causal graph in user behavior sequences, and we provide a theory to guarantee the identifiability of the learned causal graph. The CausalBooster utilizes the discovered causal graph to refine the attention mechanism, prioritizing behaviors with causal significance. Experimental evaluations on real-world datasets indicate that CausalRec outperforms several state-of-the-art methods, with average improvements of 7.21% in Hit Rate (HR) and 8.65% in Normalized Discounted Cumulative Gain (NDCG). To the best of our knowledge, this is the first model to incorporate causality through the attention mechanism in sequential recommendation, demonstrating the value of causality in generating more accurate and reliable recommendations.
zh

[AI-22] Weak-to-Strong Generalization under Distribution Shifts NEURIPS2025

【速读】:该论文旨在解决弱模型(weak models)在分布偏移(distribution shifts)条件下对强模型(strong models)进行监督时性能下降的问题,即传统弱到强(weak-to-strong)泛化方法在分布外(out-of-distribution)场景下失效的现象。解决方案的关键在于提出 RAVEN 框架,该框架通过动态学习弱模型的最优组合权重与强模型参数协同优化,从而增强监督信号的鲁棒性;实验表明,RAVEN 能自动赋予更准确的弱模型更高权重,并在分布外任务上性能提升超过 30%,同时保持或超越现有方法在分布内任务上的表现。

链接: https://arxiv.org/abs/2510.21332
作者: Myeongho Jeon,Jan Sobotka,Suhwan Choi,Maria Brbić
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:As future superhuman models become increasingly complex, accurately supervising their behavior may exceed human capabilities. Recent works have demonstrated that in such scenarios, weak models can effectively supervise strong models, a phenomenon known as weak-to-strong generalization. However, we find that naive weak-to-strong generalization fails under distribution shifts, often leading to worse performance of the strong model than its weak supervisors. To address this, we propose RAVEN, a robust weak-to-strong generalization framework that dynamically learns the optimal combinations of weak models in addition to parameters of the strong model. We demonstrate the effectiveness of RAVEN on image classification, text classification, and preference alignment tasks. RAVEN outperforms alternative baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. Moreover, our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.
zh

[AI-23] CXRAg ent: Director-Orchestrated Multi-Stage Reasoning for Chest X-Ray Interpretation

【速读】:该论文旨在解决当前胸部X光(CXR)自动解读模型在面对新诊断任务和复杂推理场景时适应性差、可信度低的问题。现有基于大语言模型(LLM)的代理模型虽能通过工具协同与多步推理提升性能,但普遍依赖单一诊断流程且缺乏对工具可靠性的评估机制,限制了其灵活性与临床可信度。解决方案的关键在于提出CXRAgent,一个由“导演”(director)协调的多阶段代理系统:首先通过证据驱动验证器(Evidence-driven Validator, EDV)对工具输出进行归一化与视觉证据校验,确保可靠性;其次根据任务需求动态构建专家团队并规划诊断路径,实现自适应协作推理;最后融合专家意见与上下文记忆生成有依据的诊断结论,从而显著提升模型在多样化临床任务中的泛化能力与可解释性。

链接: https://arxiv.org/abs/2510.21324
作者: Jinhui Lou,Yan Yang,Zhou Yu,Zhenqi Fu,Weidong Han,Qingming Huang,Jun Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 10 pages, 4 figures, 7 Tables

点击查看摘要

Abstract:Chest X-ray (CXR) plays a pivotal role in clinical diagnosis, and a variety of task-specific and foundation models have been developed for automatic CXR interpretation. However, these models often struggle to adapt to new diagnostic tasks and complex reasoning scenarios. Recently, LLM-based agent models have emerged as a promising paradigm for CXR analysis, enhancing model’s capability through tool coordination, multi-step reasoning, and team collaboration, etc. However, existing agents often rely on a single diagnostic pipeline and lack mechanisms for assessing tools’ reliability, limiting their adaptability and credibility. To this end, we propose CXRAgent, a director-orchestrated, multi-stage agent for CXR interpretation, where a central director coordinates the following stages: (1) Tool Invocation: The agent strategically orchestrates a set of CXR-analysis tools, with outputs normalized and verified by the Evidence-driven Validator (EDV), which grounds diagnostic outputs with visual evidence to support reliable downstream diagnosis; (2) Diagnostic Planning: Guided by task requirements and intermediate findings, the agent formulates a targeted diagnostic plan. It then assembles an expert team accordingly, defining member roles and coordinating their interactions to enable adaptive and collaborative reasoning; (3) Collaborative Decision-making: The agent integrates insights from the expert team with accumulated contextual memories, synthesizing them into an evidence-backed diagnostic conclusion. Experiments on various CXR interpretation tasks show that CXRAgent delivers strong performance, providing visual evidence and generalizes well to clinical tasks of different complexity. Code and data are valuable at this \hrefthis https URLlink.
zh

[AI-24] Seemingly Redundant Modules Enhance Robust Odor Learning in Fruit Flies NEURIPS

【速读】:该论文旨在解决生物电路中看似冗余的模块(如侧向抑制 lateral inhibition, LI 和神经元尖峰频率适应 spike frequency adaptation, SFA)在复杂环境噪声条件下如何协同作用以实现最优气味识别的问题。解决方案的关键在于构建一个飞蝇嗅觉回路的计算模型,通过模拟不同噪声水平下的气味辨别任务,发现LI主要在低至中等噪声环境中提升辨别能力,而SFA则在所有噪声水平下均能稳定改善辨别性能;二者结合时可实现最优的模式分离效果,表明这些机制并非冗余,而是根据环境噪声动态分工协作,从而保障学习性能在复杂情境下的最优性。

链接: https://arxiv.org/abs/2510.21315
作者: Haiyang Li,Liao Yu,Qiang Yu,Yunliang Zang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10page,Accepted by NeurIPS

点击查看摘要

Abstract:Biological circuits have evolved to incorporate multiple modules that perform similar functions. In the fly olfactory circuit, both lateral inhibition (LI) and neuronal spike frequency adaptation (SFA) are thought to enhance pattern separation for odor learning. However, it remains unclear whether these mechanisms play redundant or distinct roles in this process. In this study, we present a computational model of the fly olfactory circuit to investigate odor discrimination under varying noise conditions that simulate complex environments. Our results show that LI primarily enhances odor discrimination in low- and medium-noise scenarios, but this benefit diminishes and may reverse under higher-noise conditions. In contrast, SFA consistently improves discrimination across all noise levels. LI is preferentially engaged in low- and medium-noise environments, whereas SFA dominates in high-noise settings. When combined, these two sparsification mechanisms enable optimal discrimination performance. This work demonstrates that seemingly redundant modules in biological circuits can, in fact, be essential for achieving optimal learning in complex contexts.
zh

[AI-25] A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization

【速读】:该论文旨在解决低精度训练(low-precision training)中适应性优化器(如Adam和Muon)的收敛性理论缺失问题,即现有理论假设所有计算组件均为精确值,忽略了硬件感知的量化(quantization)影响,从而无法解释为何在实际应用中低精度训练仍能保持有效性。其解决方案的关键在于构建了首个针对自适应优化器在梯度、权重及优化器状态(如动量估计)浮点量化下的收敛性分析框架;在此基础上,证明了在尾数长度(mantissa length)仅随迭代次数对数增长的条件下,算法仍可保持接近全精度版本的收敛速率,并揭示了Adam对权重和二阶矩量化敏感而Muon对误差控制要求更弱,因而更具鲁棒性的本质差异。

链接: https://arxiv.org/abs/2510.21314
作者: Xuan Tang,Jichu Li,Difan Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 65 pages, 10 figures

点击查看摘要

Abstract:The rapid scaling of large language models (LLMs) has made low-precision training essential for reducing memory, improving efficiency, and enabling larger models and datasets. Existing convergence theories for adaptive optimizers, however, assume all components are exact and neglect hardware-aware quantization, leaving open the question of why low-precision training remains effective. We introduce the first theoretical framework for analyzing the convergence of adaptive optimizers, including Adam and Muon, under floating-point quantization of gradients, weights, and optimizer states (e.g., moment estimates). Within this framework, we derive convergence rates on smooth non-convex objectives under standard stochastic gradient assumptions, explicitly characterizing how quantization errors from different components affect convergence. We show that both algorithms retain rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations. Our analysis further reveals that Adam is highly sensitive to weights and second-moment quantization due to its reliance on \beta_2 \to 1 , while Muon requires weaker error control and is thus potentially more robust. These results narrow the gap between empirical success and theoretical understanding of low-precision training methods. Numerical experiments on synthetic and real-world data corroborate our theory.
zh

[AI-26] owards Reliable Code-as-Policies: A Neuro-Symbolic Framework for Embodied Task Planning NEURIPS2025

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的代码作为策略(Code-as-Policies)方法在具身智能体(如机器人)任务规划与控制中因环境感知不足而导致的代码生成不准确或不完整问题,尤其是在动态或部分可观测环境中,这会显著降低任务成功率。解决方案的关键在于提出一种神经符号(Neuro-Symbolic)具身任务规划框架,通过在代码生成过程中引入显式的符号验证(symbolic verification)和交互式验证(interactive validation)机制,使生成的探索性代码能够主动与环境交互以获取缺失观测信息,同时保持任务相关状态不变,从而增强代码的环境接地性(environmental grounding),最终提升复杂环境下任务执行的可靠性与成功率。

链接: https://arxiv.org/abs/2510.21302
作者: Sanghyun Ahn,Wonje Choi,Junyong Lee,Jinwoo Park,Honguk Woo
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted at NeurIPS 2025 Spotlight

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled the automatic generation of executable code for task planning and control in embodied agents such as robots, demonstrating the potential of LLM-based embodied intelligence. However, these LLM-based code-as-policies approaches often suffer from limited environmental grounding, particularly in dynamic or partially observable settings, leading to suboptimal task success rates due to incorrect or incomplete code generation. In this work, we propose a neuro-symbolic embodied task planning framework that incorporates explicit symbolic verification and interactive validation processes during code generation. In the validation phase, the framework generates exploratory code that actively interacts with the environment to acquire missing observations while preserving task-relevant states. This integrated process enhances the grounding of generated code, resulting in improved task reliability and success rates in complex environments. We evaluate our framework on RLBench and in real-world settings across dynamic, partially observable scenarios. Experimental results demonstrate that our framework improves task success rates by 46.2% over Code-as-Policies baselines and attains over 86.8% executability of task-relevant actions, thereby enhancing the reliability of task planning in dynamic environments.
zh

[AI-27] Understanding AI Trustworthiness: A Scoping Review of AIES FAccT Articles

【速读】:该论文试图解决当前可信人工智能(Trustworthy AI)研究中过度聚焦技术属性而忽视社会技术维度的问题,旨在厘清人工智能伦理领域两大会议(AIES 和 FAccT)如何定义、测量与验证AI可信性,并识别其中的关键差距与机遇。其解决方案之关键在于推动跨学科方法的整合,将技术严谨性与社会、文化及制度因素相结合,构建能够真实反映AI系统与社会复杂互动关系的综合框架,从而促进负责任的技术发展,惠及所有利益相关方。

链接: https://arxiv.org/abs/2510.21293
作者: Siddharth Mehrotra,Jin Huang,Xuelong Fu,Roel Dobbe,Clara I. Sánchez,Maarten de Rijke
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Submitted to Journal of Artificial Intelligence Research (JAIR)

点击查看摘要

Abstract:Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. However, current research often adopts techno-centric approaches, focusing primarily on technical attributes such as reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders. Comments: Submitted to Journal of Artificial Intelligence Research (JAIR) Subjects: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC) Cite as: arXiv:2510.21293 [cs.AI] (or arXiv:2510.21293v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.21293 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Siddharth Mehrotra [view email] [v1] Fri, 24 Oct 2025 09:40:38 UTC (5,454 KB) Full-text links: Access Paper: View a PDF of the paper titled Understanding AI Trustworthiness: A Scoping Review of AIES FAccT Articles, by Siddharth Mehrotra and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2025-10 Change to browse by: cs cs.HC References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[AI-28] Investigating Scale Independent UCT Exploration Factor Strategies

【速读】:该论文旨在解决Upper Confidence Bounds For Trees (UCT)算法对游戏奖励尺度敏感的问题,尤其在奖励密集且人工设定的奖励尺度不一致的游戏场景中,节点Q值会因奖励尺度差异而呈现不同量级,从而影响搜索效率和性能。解决方案的关键在于提出一系列自适应选择探索常数λ(即λ-策略)的方法,其中推荐使用一种新提出的策略:将λ设为搜索树中所有状态-动作对Q值经验标准差σ的2倍(即λ = 2·σ),该方法在多个任务中均表现出优于现有策略的性能,且对单一参数值具有鲁棒性,并能在优化所有可用参数时达到更高峰值表现。

链接: https://arxiv.org/abs/2510.21275
作者: Robin Schmöcker,Christoph Schnell,Alexander Dockhorn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of -1,0,1\ at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node’s Q-value to span different magnitudes across different games. In this paper, we evaluate various strategies for adaptively choosing the UCT exploration constant \lambda , called \lambda -strategies, that are agnostic to the game’s reward scale. These \lambda -strategies include those proposed in the literature as well as five new strategies. Given our experimental results, we recommend using one of our newly suggested \lambda -strategies, which is to choose \lambda as 2 \cdot \sigma where \sigma is the empirical standard deviation of all state-action pairs’ Q-values of the search tree. This method outperforms existing \lambda -strategies across a wide range of tasks both in terms of a single parameter value and the peak performances obtained by optimizing all available parameters.
zh

[AI-29] Out-of-Distribution Detection for Safety Assurance of AI and Autonomous Systems

【速读】:该论文旨在解决自主系统(autonomous systems)在安全关键领域中因面对分布外(out-of-distribution, OoD)数据而导致的安全保障难题。其核心问题是:如何在系统全生命周期内,通过可靠的方法学手段识别和应对OoD数据,从而提升系统的安全性与可信度。解决方案的关键在于系统性地梳理和整合适用于机器学习(ML)开发全流程的OoD检测技术,并明确这些技术在不同阶段(如训练、验证、部署及运行监控)对构建安全论证(safety assurance arguments)的支持作用,同时指出工程实践中需注意的技术局限与集成风险。

链接: https://arxiv.org/abs/2510.21254
作者: Victoria J. Hodge,Colin Paterson,Ibrahim Habli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The operational capabilities and application domains of AI-enabled autonomous systems have expanded significantly in recent years due to advances in robotics and machine learning (ML). Demonstrating the safety of autonomous systems rigorously is critical for their responsible adoption but it is challenging as it requires robust methodologies that can handle novel and uncertain situations throughout the system lifecycle, including detecting out-of-distribution (OoD) data. Thus, OOD detection is receiving increased attention from the research, development and safety engineering communities. This comprehensive review analyses OOD detection techniques within the context of safety assurance for autonomous systems, in particular in safety-critical domains. We begin by defining the relevant concepts, investigating what causes OOD and exploring the factors which make the safety assurance of autonomous systems and OOD detection challenging. Our review identifies a range of techniques which can be used throughout the ML development lifecycle and we suggest areas within the lifecycle in which they may be used to support safety assurance arguments. We discuss a number of caveats that system and safety engineers must be aware of when integrating OOD detection into system lifecycles. We conclude by outlining the challenges and future work necessary for the safe development and operation of autonomous systems across a range of domains and applications.
zh

[AI-30] OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbenchs Professional-Aligned Series

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在专家级智能外呼场景中评估存在的三大局限:数据集多样性与类别覆盖不足、用户模拟不真实以及评价指标不准确。其核心解决方案在于构建一个结构化的评估框架——OutboundEval,关键包括三点:一是设计涵盖六大业务领域和30个代表性子场景的基准测试体系,采用场景特异性流程分解、加权评分与领域自适应指标;二是开发基于大模型驱动的用户模拟器(User Simulator),生成具有丰富人格特征、真实行为模式、情绪波动和沟通风格的虚拟用户,实现可控且真实的测试环境;三是引入动态评估方法,融合自动化与人工介入评估机制,以衡量任务执行准确性、专业知识应用能力、适应性及用户体验质量,从而全面刻画LLM在专业外呼场景中的性能表现。

链接: https://arxiv.org/abs/2510.21244
作者: Pengyu Xu,Shijia Li,Ao Sun,Feng Zhang,Yahan Li,Bo Wu,Zhanyu Ma,Jiguo Li,Jun Xu,Jiuchong Gao,Jinghua Hao,Renqing He,Rui Wang,Yang Liu,Xiaobo Hu,Fan Yang,Jia Zheng,Guanghua Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose OutboundEval, a comprehensive benchmark for evaluating large language models (LLMs) in expert-level intelligent outbound calling scenarios. Unlike existing methods that suffer from three key limitations - insufficient dataset diversity and category coverage, unrealistic user simulation, and inaccurate evaluation metrics - OutboundEval addresses these issues through a structured framework. First, we design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics. Second, we develop a large-model-driven User Simulator that generates diverse, persona-rich virtual users with realistic behaviors, emotional variability, and communication styles, providing a controlled yet authentic testing environment. Third, we introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality. Experiments on 12 state-of-the-art LLMs reveal distinct trade-offs between expert-level task completion and interaction fluency, offering practical insights for building reliable, human-like outbound AI systems. OutboundEval establishes a practical, extensible, and domain-oriented standard for benchmarking LLMs in professional applications.
zh

[AI-31] Physics-Informed Neural Networks for MIMO Beam Map and Environment Reconstruction

【速读】:该论文旨在解决在缺乏显式三维环境信息的情况下,如何利用信道状态信息(CSI)中的接收信号强度(RSS)数据,联合构建多输入多输出(MIMO)系统的无线电波束图(beam map)与环境几何结构的问题。其关键解决方案是提出一种面向方向的虚拟障碍物模型(oriented virtual obstacle model),该模型能够同时捕捉遮挡(blockage)和反射(reflection)的几何特征,并通过定义反射区(reflective zone)来识别与环境几何关系相关的反射路径;进一步推导出反射区的解析表达式并分析其几何特性,从而设计出更适配深度学习表示的重构形式,最终构建一个融合物理先验知识的物理信息深度学习框架(physics-informed deep learning framework),实现对遮挡、反射、散射分量及波束模式的联合学习,显著提升MIMO波束图重建精度(32%-48%)。

链接: https://arxiv.org/abs/2510.21238
作者: Wangqian Chen,Junting Chen,Shuguang Cui
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:As communication networks evolve towards greater complexity (e.g., 6G and beyond), a deep understanding of the wireless environment becomes increasingly crucial. When explicit knowledge of the environment is unavailable, geometry-aware feature extraction from channel state information (CSI) emerges as a pivotal methodology to bridge physical-layer measurements with network intelligence. This paper proposes to explore the received signal strength (RSS) data, without explicit 3D environment knowledge, to jointly construct the radio beam map and environmental geometry for a multiple-input multiple-output (MIMO) system. Unlike existing methods that only learn blockage structures, we propose an oriented virtual obstacle model that captures the geometric features of both blockage and reflection. Reflective zones are formulated to identify relevant reflected paths according to the geometry relation of the environment. We derive an analytical expression for the reflective zone and further analyze its geometric characteristics to develop a reformulation that is more compatible with deep learning representations. A physics-informed deep learning framework that incorporates the reflective-zone-based geometry model is proposed to learn the blockage, reflection, and scattering components, along with the beam pattern, which leverages physics prior knowledge to enhance network transferability. Numerical experiments demonstrate that, in addition to reconstructing the blockage and reflection geometry, the proposed model can construct a more accurate MIMO beam map with a 32%-48% accuracy improvement.
zh

[AI-32] Securing AI Agent Execution

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为AI代理与外部工具和环境交互时,因Model Context Protocol (MCP)服务器缺乏有效访问控制机制而导致的安全风险问题。当前数千个MCP服务器以无限制权限运行,构成广泛攻击面。解决方案的关键在于提出AgentBound——首个面向MCP服务器的访问控制框架,其核心由两部分组成:一是受Android权限模型启发的声明式策略机制,用于定义资源访问规则;二是无需修改MCP服务器即可实施策略的策略执行引擎,可有效阻断恶意行为。实验表明,该框架能自动从源代码生成80.9%准确率的访问控制策略,并在多个恶意MCP服务器中成功拦截安全威胁,同时引入的性能开销可忽略不计。

链接: https://arxiv.org/abs/2510.21236
作者: Christoph Bühler,Matteo Biagiola,Luca Di Grazia,Guido Salvaneschi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have evolved into AI agents that interact with external tools and environments to perform complex tasks. The Model Context Protocol (MCP) has become the de facto standard for connecting agents with such resources, but security has lagged behind: thousands of MCP servers execute with unrestricted access to host systems, creating a broad attack surface. In this paper, we introduce AgentBound, the first access control framework for MCP servers. AgentBound combines a declarative policy mechanism, inspired by the Android permission model, with a policy enforcement engine that contains malicious behavior without requiring MCP server modifications. We build a dataset containing the 296 most popular MCP servers, and show that access control policies can be generated automatically from source code with 80.9% accuracy. We also show that AgentBound blocks the majority of security threats in several malicious MCP servers, and that policy enforcement engine introduces negligible overhead. Our contributions provide developers and project managers with a practical foundation for securing MCP servers while maintaining productivity, enabling researchers and tool builders to explore new directions for declarative access control and MCP security.
zh

[AI-33] PLAN: Proactive Low-Rank Allocation for Continual Learning ICCV2025

【速读】:该论文旨在解决持续学习(Continual Learning, CL)中模型在适应新任务时易发生对先前知识的遗忘问题,尤其是在大规模预训练模型上的高效微调挑战。解决方案的关键在于提出一种名为PLAN(Proactive Low-rank Allocation Network)的框架,其核心创新是将低秩适配(Low-Rank Adaptation, LoRA)扩展为具备干扰感知能力的机制:通过为每个任务主动分配正交基向量,并采用基于扰动的优化策略最小化新旧参数间的冲突;同时引入一种新颖的基向量选择机制,优先选取对干扰敏感度最低的基向量进行分配,从而在保障历史知识稳定性的前提下实现对新任务的高效适应。

链接: https://arxiv.org/abs/2510.21188
作者: Xiequn Wang,Zhan Zhuang,Yu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: accepted by ICCV 2025

点击查看摘要

Abstract:Continual learning (CL) requires models to continuously adapt to new tasks without forgetting past knowledge. In this work, we propose \underlineProactive \underlineLow-rank \underlineAllocatio\underlineN (PLAN), a framework that extends Low-Rank Adaptation (LoRA) to enable efficient and interference-aware fine-tuning of large pre-trained models in CL settings. PLAN proactively manages the allocation of task-specific subspaces by introducing orthogonal basis vectors for each task and optimizing them through a perturbation-based strategy that minimizes conflicts with previously learned parameters. Furthermore, PLAN incorporates a novel selection mechanism that identifies and assigns basis vectors with minimal sensitivity to interference, reducing the risk of degrading past knowledge while maintaining efficient adaptation to new tasks. Empirical results on standard CL benchmarks demonstrate that PLAN consistently outperforms existing methods, establishing a new state-of-the-art for continual learning with foundation models.
zh

[AI-34] Shylock: Causal Discovery in Multivariate Time Series based on Hybrid Constraints

【速读】:该论文旨在解决多变量时间序列(Multivariate Time Series, MTS)中因果关系发现的难题,尤其针对现有方法在数据稀缺场景下易过拟合、依赖大量数据以及难以捕捉时滞特征的问题。解决方案的关键在于提出一种名为Shylock的新方法,其核心创新包括:通过分组膨胀卷积(group dilated convolution)和共享核(sharing kernel)实现参数数量的指数级减少,同时保留对时滞变量的有效表征能力;并通过结合全局约束与局部约束机制,在网络间实现信息共享,从而提升因果推断的准确性。实验表明,Shylock在少样本和常规MTS场景下均优于当前主流方法。

链接: https://arxiv.org/abs/2510.21181
作者: Shuo Li,Keqin Xu,Jie Liu,Dan Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal relationship discovery has been drawing increasing attention due to its prevalent application. Existing methods rely on human experience, statistical methods, or graphical criteria methods which are error-prone, stuck at the idealized assumption, and rely on a huge amount of data. And there is also a serious data gap in accessing Multivariate time series(MTS) in many areas, adding difficulty in finding their causal relationship. Existing methods are easy to be over-fitting on them. To fill the gap we mentioned above, in this paper, we propose Shylock, a novel method that can work well in both few-shot and normal MTS to find the causal relationship. Shylock can reduce the number of parameters exponentially by using group dilated convolution and a sharing kernel, but still learn a better representation of variables with time delay. By combing the global constraint and the local constraint, Shylock achieves information sharing among networks to help improve the accuracy. To evaluate the performance of Shylock, we also design a data generation method to generate MTS with time delay. We evaluate it on commonly used benchmarks and generated datasets. Extensive experiments show that Shylock outperforms two existing state-of-art methods on both few-shot and normal MTS. We also developed Tcausal, a library for easy use and deployed it on the EarthDataMiner platform
zh

[AI-35] Memory-Free Continual Learning with Null Space Adaptation for Zero-Shot Vision-Language Models

【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在持续学习场景下因分布偏移或新类出现而导致的灾难性遗忘问题,同时保持其原有的零样本迁移能力。解决方案的关键在于提出一种轻量级、无需存储记忆的持续学习框架NuSA-CL(Null Space Adaptation for Continual Learning),通过低秩适应(Low-Rank Adaptation)并约束任务特定权重更新位于模型当前参数的近似零空间内,从而最小化对已学知识的干扰,有效保留原始模型的零样本性能,且不依赖重放缓冲区或昂贵的知识蒸馏,显著降低计算与内存开销,适用于资源受限的实际部署环境。

链接: https://arxiv.org/abs/2510.21175
作者: Yujin Jo,Taesup Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated remarkable zero-shot generalization, enabling deployment in a wide range of real-world tasks without additional task-specific training. However, in real deployment scenarios with evolving environments or emerging classes, these models inevitably face distributional shifts and novel tasks. In such contexts, static zero-shot capabilities are insufficient, and there is a growing need for continual learning methods that allow models to adapt over time while avoiding catastrophic forgetting. We introduce NuSA-CL (Null Space Adaptation for Continual Learning), a lightweight memory-free continual learning framework designed to address this challenge. NuSA-CL employs low-rank adaptation and constrains task-specific weight updates to lie within an approximate null space of the model’s current parameters. This strategy minimizes interference with previously acquired knowledge, effectively preserving the zero-shot capabilities of the original model. Unlike methods relying on replay buffers or costly distillation, NuSA-CL imposes minimal computational and memory overhead, making it practical for deployment in resource-constrained, real-world continual learning environments. Experiments show that our framework not only effectively preserves zero-shot transfer capabilities but also achieves highly competitive performance on continual learning benchmarks. These results position NuSA-CL as a practical and scalable solution for continually evolving zero-shot VLMs in real-world applications.
zh

[AI-36] owards Strag gler-Resilient Split Federated Learning: An Unbalanced Update Approach

【速读】:该论文旨在解决Split Federated Learning (SFL) 中因设备延迟(straggler)导致的训练效率低下问题。SFL 依赖于客户端与 Split Server 之间的同步通信,即服务器端模型更新需等待所有客户端激活值(activations)到达,这使得慢节点成为系统可扩展性和效率的关键瓶颈。解决方案的核心是提出 MU-SplitFed,一种基于零阶优化(zeroth-order optimization)的抗迟滞 SFL 算法,其关键创新在于通过一个简单而有效的非均衡更新机制(unbalanced update mechanism),将服务器端的训练进度与客户端延迟解耦:服务器在每个客户端轮次中执行 τ 次本地更新,从而实现 O(√d/(τT)) 的收敛速率,并在通信轮次上获得 τ 倍线性加速,显著缓解了 straggler 的负面影响。

链接: https://arxiv.org/abs/2510.21155
作者: Dandan Liang,Jianing Zhang,Evan Chen,Zhe Li,Rui Li,Haibo Yang
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Split Federated Learning (SFL) enables scalable training on edge devices by combining the parallelism of Federated Learning (FL) with the computational offloading of Split Learning (SL). Despite its great success, SFL suffers significantly from the well-known straggler issue in distributed learning systems. This problem is exacerbated by the dependency between Split Server and clients: the Split Server side model update relies on receiving activations from clients. Such synchronization requirement introduces significant time latency, making straggler a critical bottleneck to the scalability and efficiency of the system. To mitigate this problem, we propose MU-SplitFed, a straggler-resilient SFL algorithm in zeroth-order optimization that decouples training progress from straggler delays via a simple yet effective unbalanced update mechanism. By enabling the server to perform \tau local updates per client round, MU-SplitFed achieves a convergence rate of O(\sqrtd/(\tau T)) for non-convex objectives, demonstrating a linear speedup of \tau in communication rounds. Experiments demonstrate that MU-SplitFed consistently outperforms baseline methods with the presence of stragglers and effectively mitigates their impact through adaptive tuning of \tau . The code for this project is available at this https URL. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.21155 [cs.DC] (or arXiv:2510.21155v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2510.21155 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-37] Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design NEURIPS2025

【速读】:该论文旨在解决生成式 AI 在药物发现和分子工程中设计具有理想性质的全新三维(3D)分子时,难以有效控制复杂多目标约束的问题。其解决方案的关键在于提出一种基于不确定性的强化学习(Reinforcement Learning, RL)框架,通过引入具备预测不确定性估计能力的代理模型(surrogate models)动态调整奖励函数,从而在多个优化目标之间实现平衡,同时提升生成分子的整体质量。该方法显著优于现有基线,在多个基准数据集和扩散模型架构上均展现出更优的分子质量和属性优化能力,并通过分子动力学(Molecular Dynamics, MD)模拟与ADMET特性分析验证了候选分子的良好类药性和结合稳定性。

链接: https://arxiv.org/abs/2510.21153
作者: Lianghong Chen,Dongkyu Eugene Kim,Mike Domaratzki,Pingzhao Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:Designing de novo 3D molecules with desirable properties remains a fundamental challenge in drug discovery and molecular engineering. While diffusion models have demonstrated remarkable capabilities in generating high-quality 3D molecular structures, they often struggle to effectively control complex multi-objective constraints critical for real-world applications. In this study, we propose an uncertainty-aware Reinforcement Learning (RL) framework to guide the optimization of 3D molecular diffusion models toward multiple property objectives while enhancing the overall quality of the generated molecules. Our method leverages surrogate models with predictive uncertainty estimation to dynamically shape reward functions, facilitating balance across multiple optimization objectives. We comprehensively evaluate our framework across three benchmark datasets and multiple diffusion model architectures, consistently outperforming baselines for molecular quality and property optimization. Additionally, Molecular Dynamics (MD) simulations and ADMET profiling of top generated candidates indicate promising drug-like behavior and binding stability, comparable to known Epidermal Growth Factor Receptor (EGFR) inhibitors. Our results demonstrate the strong potential of RL-guided generative diffusion models for advancing automated molecular design.
zh

[AI-38] String Seed of Thought: Prompting LLM s for Distribution-Faithful and Diverse Generation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在概率性指令遵循(Probabilistic Instruction Following, PIF)任务中的表现不足问题,即当要求模型从预定义选项中按指定概率分布选择答案时,LLMs 常因固有偏差导致输出分布偏离目标分布,从而影响需要非确定性行为的应用场景(如人类行为模拟、内容多样性生成和多人游戏)。解决方案的关键在于提出一种名为“字符串种子思维链”(String Seed of Thought, SSoT)的提示方法:首先引导模型生成一个随机字符串以引入足够熵,随后通过操纵该字符串提取随机性并推导最终答案,从而在保持约束条件的同时显著提升响应多样性。实验表明,SSoT 能使 LLM 的 PIF 性能逼近伪随机数生成器的理想水平,并在 NoveltyBench 上验证其对开放任务也具有增强多样性的作用。

链接: https://arxiv.org/abs/2510.21150
作者: Kou Misaki,Takuya Akiba
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce String Seed of Thought (SSoT), a novel prompting method for LLMs that improves Probabilistic Instruction Following (PIF). We define PIF as a task requiring an LLM to select its answer from a predefined set of options, each associated with a specific probability, such that the empirical distribution of the generated answers aligns with the target distribution when prompted multiple times. While LLMs excel at tasks with single, deterministic answers, they often fail at PIF, exhibiting biases problematic for applications requiring non-deterministic behaviors, such as human-behavior simulation, content diversification, and multiplayer games. It also harms the diversity of generated responses, a crucial factor in test-time scaling, by causing the outputs to collapse into a limited set of answers. To address this, we propose SSoT, a simple prompting method that instructs an LLM to first output a random string to generate sufficient entropy. SSoT also instructs the LLM to extract randomness by manipulating this string to derive a final answer, thereby preserving diversity while adhering to specific constraints. We demonstrate that SSoT significantly improves the PIF performance of LLMs, approaching the ideal performance of a pseudo-random number generator. Furthermore, our experiments on NoveltyBench show SSoT’s benefits extend beyond closed-set tasks to open-ended tasks by enhancing response diversity.
zh

[AI-39] How to Auto-optimize Prompts for Domain Tasks? Adaptive Prompting and Reasoning through Evolutionary Domain Knowledge Adaptation

【速读】:该论文旨在解决在真实世界应用中,如何为大语言模型(Large Language Models, LLMs)设计最优提示(prompt)和推理过程的问题,尤其关注领域知识的整合、推理效率的提升以及向领域专家提供可操作的知识融合建议。其核心挑战在于现有方法难以自动化地优化提示结构与因果推理机制,且缺乏对不同LLM特性差异的适应能力。解决方案的关键是提出一种名为进化图优化提示(Evolutionary Graph Optimization for Prompting, EGO-Prompt)的自动化框架,该框架以人类专家构建的容错初始语义因果图(Semantic Causal Graph, SCG)为基础,通过两阶段因果引导文本梯度机制:首先从SCG生成近似确定性的推理指导,其次调整LLM以有效利用该指导与原始输入;并结合基于真实标签的文本梯度迭代优化SCG与推理机制,从而实现提示质量、推理效率与可解释性的协同提升。

链接: https://arxiv.org/abs/2510.21148
作者: Yang Zhao,Pu Wang,Hao Frank Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Designing optimal prompts and reasoning processes for large language models (LLMs) on domain-specific tasks is both necessary and challenging in real-world applications. Determining how to integrate domain knowledge, enhance reasoning efficiency, and even provide domain experts with refined knowledge integration hints are particularly crucial yet unresolved tasks. In this research, we propose Evolutionary Graph Optimization for Prompting (EGO-Prompt), an automated framework to designing better prompts, efficient reasoning processes and providing enhanced causal-informed process. EGO-Prompt begins with a general prompt and fault-tolerant initial Semantic Causal Graph (SCG) descriptions, constructed by human experts, which is then automatically refined and optimized to guide LLM reasoning. Recognizing that expert-defined SCGs may be partial or imperfect and that their optimal integration varies across LLMs, EGO-Prompt integrates a novel causal-guided textual gradient process in two steps: first, generating nearly deterministic reasoning guidance from the SCG for each instance, and second, adapting the LLM to effectively utilize the guidance alongside the original input. The iterative optimization algorithm further refines both the SCG and the reasoning mechanism using textual gradients with ground-truth. We tested the framework on real-world public health, transportation and human behavior tasks. EGO-Prompt achieves 7.32%-12.61% higher F1 than cutting-edge methods, and allows small models to reach the performence of larger models at under 20% of the original cost. It also outputs a refined, domain-specific SCG that improves interpretability.
zh

[AI-40] NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对外部知识投毒攻击时的脆弱性问题,特别是现有攻击方法忽视了大语言模型(Large Language Models, LLMs)内部表示动态性和神经元敏感性的局限。其解决方案的关键在于提出一种名为NeuroGenPoisoning的新颖攻击框架,该框架通过LLM内部神经元归因(neuron attribution)和遗传优化相结合的方式,精准识别对中毒知识响应敏感的“毒响应神经元”(Poison-Responsive Neurons),并利用遗传算法演化出能最大化激活这些神经元的对抗性外部文本片段。该方法不仅显著提升了投毒成功率(Population Overwrite Success Rate > 90%),还通过神经元层面的引导有效缓解了外部知识与模型强参数化知识之间的冲突问题。

链接: https://arxiv.org/abs/2510.21144
作者: Hanyu Zhu,Lance Fiondella,Jiawei Yuan,Kai Zeng,Long Jiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) empowers Large Language Models (LLMs) to dynamically integrate external knowledge during inference, improving their factual accuracy and adaptability. However, adversaries can inject poisoned external knowledge to override the model’s internal memory. While existing attacks iteratively manipulate retrieval content or prompt structure of RAG, they largely ignore the model’s internal representation dynamics and neuron-level sensitivities. The underlying mechanism of RAG poisoning has not been fully studied and the effect of knowledge conflict with strong parametric knowledge in RAG is not considered. In this work, we propose NeuroGenPoisoning, a novel attack framework that generates adversarial external knowledge in RAG guided by LLM internal neuron attribution and genetic optimization. Our method first identifies a set of Poison-Responsive Neurons whose activation strongly correlates with contextual poisoning knowledge. We then employ a genetic algorithm to evolve adversarial passages that maximally activate these neurons. Crucially, our framework enables massive-scale generation of effective poisoned RAG knowledge by identifying and reusing promising but initially unsuccessful external knowledge variants via observed attribution signals. At the same time, Poison-Responsive Neurons guided poisoning can effectively resolves knowledge conflict. Experimental results across models and datasets demonstrate consistently achieving high Population Overwrite Success Rate (POSR) of over 90% while preserving fluency. Empirical evidence shows that our method effectively resolves knowledge conflict.
zh

[AI-41] PanicToCalm: A Proactive Counseling Agent for Panic Attacks

【速读】:该论文旨在解决恐慌发作(Panic Attacks)情境下缺乏合适训练数据的问题,从而限制了有效干预模型的开发。其关键解决方案是构建了一个名为PACE的数据集,该数据集基于第一人称叙述构建高压力事件,并遵循心理急救(Psychological First Aid, PFA)原则;在此基础上训练出PACER模型,该模型通过监督学习和模拟偏好对齐优化,具备共情与指导性支持能力。实验表明,PACER在咨询师侧指标和来访者情绪改善方面均优于强基线模型,且人类评估进一步验证其在恐慌场景中的实际价值。

链接: https://arxiv.org/abs/2510.21143
作者: Jihyun Lee,Yejin Min,San Kim,Yejin Jeon,SungJun Yang,Hyounghun Kim,Gary Geunbae Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Panic attacks are acute episodes of fear and distress, in which timely, appropriate intervention can significantly help individuals regain stability. However, suitable datasets for training such models remain scarce due to ethical and logistical issues. To address this, we introduce PACE, which is a dataset that includes high-distress episodes constructed from first-person narratives, and structured around the principles of Psychological First Aid (PFA). Using this data, we train PACER, a counseling model designed to provide both empathetic and directive support, which is optimized through supervised learning and simulated preference alignment. To assess its effectiveness, we propose PanicEval, a multi-dimensional framework covering general counseling quality and crisis-specific strategies. Experimental results show that PACER outperforms strong baselines in both counselor-side metrics and client affect improvement. Human evaluations further confirm its practical value, with PACER consistently preferred over general, CBT-based, and GPT-4-powered models in panic scenarios (Code is available at this https URL ).
zh

[AI-42] Quantifying CBRN Risk in Frontier Models

【速读】:该论文旨在解决前沿大型语言模型(Large Language Models, LLMs)在化学、生物、放射性和核(CBRN)武器知识传播方面存在的双重用途风险问题,即这些模型可能被恶意利用以生成或扩散危险信息。其解决方案的关键在于提出并实施一种严格的三层攻击方法,对10个主流商业LLM进行系统性评估,涵盖自建的200个CBRN相关提示数据集和FORTRESS基准的180个子集;研究发现,通过深度诱饵(Deep Inception)攻击可使攻击成功率提升至86.0%,远高于直接请求的33.8%,揭示了当前安全对齐机制存在严重脆弱性,且不同模型的安全表现差异巨大(攻击成功率从2%到96%不等),表明现有防护措施极易被简单提示工程绕过。这一结果凸显了建立标准化评估框架、透明安全指标与更强对齐技术的紧迫性。

链接: https://arxiv.org/abs/2510.21133
作者: Divyanshu Kumar,Nitin Aravind Birur,Tanay Baswa,Sahil Agarwal,Prashanth Harshangi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Frontier Large Language Models (LLMs) pose unprecedented dual-use risks through the potential proliferation of chemical, biological, radiological, and nuclear (CBRN) weapons knowledge. We present the first comprehensive evaluation of 10 leading commercial LLMs against both a novel 200-prompt CBRN dataset and a 180-prompt subset of the FORTRESS benchmark, using a rigorous three-tier attack methodology. Our findings expose critical safety vulnerabilities: Deep Inception attacks achieve 86.0% success versus 33.8% for direct requests, demonstrating superficial filtering mechanisms; Model safety performance varies dramatically from 2% (claude-opus-4) to 96% (mistral-small-latest) attack success rates; and eight models exceed 70% vulnerability when asked to enhance dangerous material properties. We identify fundamental brittleness in current safety alignment, where simple prompt engineering techniques bypass safeguards for dangerous CBRN information. These results challenge industry safety claims and highlight urgent needs for standardized evaluation frameworks, transparent safety metrics, and more robust alignment techniques to mitigate catastrophic misuse risks while preserving beneficial capabilities.
zh

[AI-43] Enhanced Evolutionary Multi-Objective Deep Reinforcement Learning for Reliable and Efficient Wireless Rechargeable Sensor Networks

【速读】:该论文旨在解决无线可充电传感器网络(Wireless Rechargeable Sensor Networks, WRSNs)中长期存在的关键挑战:如何在动态运行条件下,同时最大化节点生存率和移动充电器的能量使用效率,从而延长网络寿命并减少能量浪费。这一问题具有NP-hard复杂度且存在长期时间依赖性,传统优化方法难以有效应对。解决方案的关键在于提出一种增强型进化多目标深度强化学习算法,其核心创新包括:基于长短期记忆(Long Short-Term Memory, LSTM)的策略网络以捕捉时间模式、基于多层感知机(Multilayer Perceptron)的前瞻性增量模型用于未来状态预测,以及一种时变Pareto策略评估方法以实现动态偏好自适应,从而在复杂环境中高效生成多样化的帕累托最优解。

链接: https://arxiv.org/abs/2510.21127
作者: Bowei Tong,Hui Kang,Jiahui Li,Geng Sun,Jiacheng Wang,Yaoqi Yang,Bo Xu,Dusit Niyato
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 15 pages, 9 figures, submited to TVT

点击查看摘要

Abstract:Despite rapid advancements in sensor networks, conventional battery-powered sensor networks suffer from limited operational lifespans and frequent maintenance requirements that severely constrain their deployment in remote and inaccessible environments. As such, wireless rechargeable sensor networks (WRSNs) with mobile charging capabilities offer a promising solution to extend network lifetime. However, WRSNs face critical challenges from the inherent trade-off between maximizing the node survival rates and maximizing charging energy efficiency under dynamic operational conditions. In this paper, we investigate a typical scenario where mobile chargers move and charge the sensor, thereby maintaining the network connectivity while minimizing the energy waste. Specifically, we formulate a multi-objective optimization problem that simultaneously maximizes the network node survival rate and mobile charger energy usage efficiency across multiple time slots, which presents NP-hard computational complexity with long-term temporal dependencies that make traditional optimization approaches ineffective. To address these challenges, we propose an enhanced evolutionary multi-objective deep reinforcement learning algorithm, which integrates a long short-term memory (LSTM)-based policy network for temporal pattern recognition, a multilayer perceptron-based prospective increment model for future state prediction, and a time-varying Pareto policy evaluation method for dynamic preference adaptation. Extensive simulation results demonstrate that the proposed algorithm significantly outperforms existing approaches in balancing node survival rate and energy efficiency while generating diverse Pareto-optimal solutions. Moreover, the LSTM-enhanced policy network converges 25% faster than conventional networks, with the time-varying evaluation method effectively adapting to dynamic conditions.
zh

[AI-44] Generalizable Hierarchical Skill Learning via Object-Centric Representation

【速读】:该论文旨在解决机器人操作中策略泛化能力弱与样本效率低的问题,尤其在面对未见过的空间布局、物体外观及任务组合时表现不佳。解决方案的关键在于提出一种可迁移的分层技能学习框架(Generalizable Hierarchical Skill Learning, GSL),其核心创新是利用以物体为中心的技能作为接口,连接高层视觉-语言模型与低层视觉-运动策略。GSL通过基础模型将示范分解为可迁移且物体规范化的技能原语,确保在物体坐标系下高效学习底层技能;测试时,高层代理预测的技能-物体对被输入至低层模块,推断出的规范动作再映射回世界坐标系执行。这种结构化且灵活的设计显著提升了样本效率和泛化性能。

链接: https://arxiv.org/abs/2510.21121
作者: Haibo Zhao,Yu Qi,Boce Hu,Yizhe Zhu,Ziyan Chen,Heng Tian,Xupeng Zhu,Owen Howell,Haojie Huang,Robin Walters,Dian Wang,Robert Platt
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Generalizable Hierarchical Skill Learning (GSL), a novel framework for hierarchical policy learning that significantly improves policy generalization and sample efficiency in robot manipulation. One core idea of GSL is to use object-centric skills as an interface that bridges the high-level vision-language model and the low-level visual-motor policy. Specifically, GSL decomposes demonstrations into transferable and object-canonicalized skill primitives using foundation models, ensuring efficient low-level skill learning in the object frame. At test time, the skill-object pairs predicted by the high-level agent are fed to the low-level module, where the inferred canonical actions are mapped back to the world frame for execution. This structured yet flexible design leads to substantial improvements in sample efficiency and generalization of our method across unseen spatial arrangements, object appearances, and task compositions. In simulation, GSL trained with only 3 demonstrations per task outperforms baselines trained with 30 times more data by 15.5 percent on unseen tasks. In real-world experiments, GSL also surpasses the baseline trained with 10 times more data.
zh

[AI-45] DAO-AI: Evaluating Collective Decision-Making through Agent ic AI in Decentralized Governance

【速读】:该论文旨在解决去中心化自治组织(DAO)中治理决策缺乏可解释性与经济合理性的问题,尤其关注如何利用生成式AI作为自主决策代理(agentic AI)来提升治理效率与可信度。其解决方案的关键在于构建一个基于模块化可组合程序(MCP)工作流的智能投票代理,该代理能够理解提案上下文、检索历史讨论数据,并独立做出投票决策;整个过程运行于基于区块链数据验证的金融模拟环境中,确保决策具备可审计性和实证基础,从而实现对人类和代币加权结果的高度一致性,为DAO治理提供可解释且经济严谨的AI代理设计范式。

链接: https://arxiv.org/abs/2510.21117
作者: Chunghyun Han,Alfio Gliozzo,Junkyu Lee,Agostino Capponi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 2 Figures

点击查看摘要

Abstract:This paper presents a first empirical study of agentic AI as autonomous decision-makers in decentralized governance. Using more than 3K proposals from major protocols, we build an agentic AI voter that interprets proposal contexts, retrieves historical deliberation data, and independently determines its voting position. The agent operates within a realistic financial simulation environment grounded in verifiable blockchain data, implemented through a modular composable program (MCP) workflow that defines data flow and tool usage via Agentics framework. We evaluate how closely the agent’s decisions align with the human and token-weighted outcomes, uncovering strong alignments measured by carefully designed evaluation metrics. Our findings demonstrate that agentic AI can augment collective decision-making by producing interpretable, auditable, and empirically grounded signals in realistic DAO governance settings. The study contributes to the design of explainable and economically rigorous AI agents for decentralized financial systems.
zh

[AI-46] Confounding Robust Deep Reinforcement Learning: A Causal Approach NEURIPS2025

【速读】:该论文旨在解决在存在未观测混杂因素(unobserved confounding)的复杂高维环境中,基于有偏数据进行离策略(off-policy)学习时策略优化性能下降的问题。其核心挑战在于传统深度强化学习方法(如DQN)在行为策略与目标策略输入不一致且存在隐藏混杂变量时会失效。解决方案的关键在于提出一种新的深度强化学习算法,该算法通过寻找在与观测数据兼容的最坏情形环境下的安全策略(safe policy),从而有效缓解混杂偏差对策略学习的影响。实验表明,在12个受混杂因素干扰的Atari游戏中,该方法在所有测试场景中均显著优于标准DQN。

链接: https://arxiv.org/abs/2510.21110
作者: Mingxuan Li,Junzhe Zhang,Elias Bareinboim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emphunobserved confounding cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.
zh

[AI-47] ESCORT: Efficient Stein-variational and Sliced Consistency-Optimized Temporal Belief Representation for POMDPs NEURIPS’25

【速读】:该论文旨在解决部分可观测马尔可夫决策过程(Partially Observable Markov Decision Processes, POMDPs)中信念分布(belief distribution)复杂性难以准确建模的问题,尤其是在高维、多模态场景下,传统数学模型与现有信念近似方法无法有效捕捉不确定性结构,导致估计误差并引发次优代理行为。解决方案的关键在于提出ESCORT(Efficient Stein-variational and sliced Consistency-Optimized Representation for Temporal beliefs),其核心创新是将Stein变分梯度下降(Stein Variational Gradient Descent, SVGD)扩展为具备两个关键机制:一是相关性感知投影(correlation-aware projections),用于显式建模状态维度间的依赖关系;二是时序一致性约束(temporal consistency constraints),在更新过程中稳定信念演化同时保留相关结构。该方法在保持SVGD吸引-排斥粒子动力学优势的基础上,实现了对复杂相关模式的精确建模,且无需重采样或假设固定分布形式,从而动态适应信念空间的复杂性,显著提升信念近似精度与下游决策质量。

链接: https://arxiv.org/abs/2510.21107
作者: Yunuo Zhang,Baiting Luo,Ayan Mukhopadhyay,Gabor Karsai,Abhishek Dubey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Proceeding of the 39th Conference on Neural Information Processing Systems (NeurIPS’25). Code would be available at this https URL

点击查看摘要

Abstract:In Partially Observable Markov Decision Processes (POMDPs), maintaining and updating belief distributions over possible underlying states provides a principled way to summarize action-observation history for effective decision-making under uncertainty. As environments grow more realistic, belief distributions develop complexity that standard mathematical models cannot accurately capture, creating a fundamental challenge in maintaining representational accuracy. Despite advances in deep learning and probabilistic modeling, existing POMDP belief approximation methods fail to accurately represent complex uncertainty structures such as high-dimensional, multi-modal belief distributions, resulting in estimation errors that lead to suboptimal agent behaviors. To address this challenge, we present ESCORT (Efficient Stein-variational and sliced Consistency-Optimized Representation for Temporal beliefs), a particle-based framework for capturing complex, multi-modal distributions in high-dimensional belief spaces. ESCORT extends SVGD with two key innovations: correlation-aware projections that model dependencies between state dimensions, and temporal consistency constraints that stabilize updates while preserving correlation structures. This approach retains SVGD’s attractive-repulsive particle dynamics while enabling accurate modeling of intricate correlation patterns. Unlike particle filters prone to degeneracy or parametric methods with fixed representational capacity, ESCORT dynamically adapts to belief landscape complexity without resampling or restrictive distributional assumptions. We demonstrate ESCORT’s effectiveness through extensive evaluations on both POMDP domains and synthetic multi-modal distributions of varying dimensionality, where it consistently outperforms state-of-the-art methods in terms of belief approximation accuracy and downstream decision quality.
zh

[AI-48] MedAlign: A Synergistic Framework of Multimodal Preference Optimization and Federated Meta-Cognitive Reasoning

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在临床医疗服务部署中面临的三大关键挑战:回答内容与视觉证据脱节导致的幻觉问题、固定深度推理效率低下,以及多机构协作困难。解决方案的核心在于提出MedAlign框架,其关键技术包括:1)设计一种多模态直接偏好优化(multimodal Direct Preference Optimization, mDPO)目标,显式地将偏好学习与视觉上下文对齐,提升答案的视觉准确性;2)构建基于检索感知的专家混合(Retrieval-Aware Mixture-of-Experts, RA-MoE)架构,通过图像和文本相似度动态路由查询至特定且上下文增强的LVLM专家,从而减少幻觉;3)引入联邦治理机制,结合本地元认知不确定性估计器实现自适应链式思维(Chain-of-Thought, CoT)推理,支持多机构协同训练与高效推理,实验表明该方法在多个Med-VQA数据集上显著优于现有基线,F1分数最高提升11.85%,平均推理长度减少51.60%。

链接: https://arxiv.org/abs/2510.21093
作者: Siyong Chen,Jinbo Wen,Jiawen Kang,Tenghui Huang,Xumin Huang,Yuanjia Su,Hudan Pan,Zishao Zhong,Dusit Niyato,Shengli Xie,Dong In Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large models have shown significant potential for smart healthcare. However, the deployment of Large Vision-Language Models (LVLMs) for clinical services is currently hindered by three critical challenges: a tendency to hallucinate answers not grounded in visual evidence, the inefficiency of fixed-depth reasoning, and the difficulty of multi-institutional collaboration. To address these challenges, in this paper, we develop MedAlign, a novel framework to ensure visually accurate LVLM responses for Medical Visual Question Answering (Med-VQA). Specifically, we first propose a multimodal Direct Preference Optimization (mDPO) objective to explicitly align preference learning with visual context. We then design a Retrieval-Aware Mixture-of-Experts (RA-MoE) architecture that utilizes image and text similarity to route queries to a specialized and context-augmented LVLM (i.e., an expert), thereby mitigating hallucinations in LVLMs. To achieve adaptive reasoning and facilitate multi-institutional collaboration, we propose a federated governance mechanism, where the selected expert, fine-tuned on clinical datasets based on mDPO, locally performs iterative Chain-of-Thought (CoT) reasoning via the local meta-cognitive uncertainty estimator. Extensive experiments on three representative Med-VQA datasets demonstrate that MedAlign achieves state-of-the-art performance, outperforming strong retrieval-augmented baselines by up to 11.85% in F1-score, and simultaneously reducing the average reasoning length by 51.60% compared with fixed-depth CoT approaches.
zh

[AI-49] M-GLC: Motif-Driven Global-Local Context Graphs for Few-shot Molecular Property Prediction

【速读】:该论文旨在解决少样本分子性质预测(Few-shot Molecular Property Prediction, FSMPP)中因标注数据稀缺而导致模型性能受限的问题。传统深度学习方法依赖大规模标注数据,而FSMPP通过引入关系归纳偏置(relational inductive bias)构建分子-属性图来缓解此问题,但现有方法在结构引导方面仍显不足。解决方案的关键在于提出一种基团驱动的全局-局部上下文图(Motif Driven Global-Local Context Graph):在全局层面,引入代表化学意义的基团节点(motif nodes),形成包含分子、属性与基团三部分的异构图,从而捕捉长程组成模式并促进具有共同基团的分子间知识迁移;在局部层面,为每个分子-属性对构建子图并分别编码,聚焦于最相关的邻近分子和基团,增强细粒度上下文感知能力。该方法显著提升了少样本场景下的预测鲁棒性与准确性。

链接: https://arxiv.org/abs/2510.21088
作者: Xiangyang Xu,Hongyang Gao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Molecular property prediction (MPP) is a cornerstone of drug discovery and materials science, yet conventional deep learning approaches depend on large labeled datasets that are often unavailable. Few-shot Molecular property prediction (FSMPP) addresses this scarcity by incorporating relational inductive bias through a context graph that links molecule nodes to property nodes, but such molecule-property graphs offer limited structural guidance. We propose a comprehensive solution: Motif Driven Global-Local Context Graph for few-shot molecular property prediction, which enriches contextual information at both the global and local levels. At the global level, chemically meaningful motif nodes representing shared substructures, such as rings or functional groups, are introduced to form a global tri-partite heterogeneous graph, yielding motif-molecule-property connections that capture long-range compositional patterns and enable knowledge transfer among molecules with common motifs. At the local level, we build a subgraph for each node in the molecule-property pair and encode them separately to concentrate the model’s attention on the most informative neighboring molecules and motifs. Experiments on five standard FSMPP benchmarks demonstrate that our framework consistently outperforms state-of-the-art methods. These results underscore the effectiveness of integrating global motif knowledge with fine-grained local context to advance robust few-shot molecular property prediction.
zh

[AI-50] Soppia: A Structured Prompting Framework for the Proportional Assessment of Non-Pecuniary Damages in Personal Injury Cases

【速读】:该论文旨在解决司法实践中因复杂法律规则中包含多个异质权重标准而导致的判决不一致问题,尤其是在人身伤害案件中非财产损害赔偿的量化难题。解决方案的关键在于提出Soppia(System for Ordered Proportional and Pondered Intelligent Assessment),这是一个结构化提示框架,通过先进人工智能技术对所有规定标准进行系统性、平衡性的分析,从而实现立法者意图下的整体评估。该框架将巴西《劳动法典》(CLT)第223-G条规定的十二项非财产损害赔偿标准转化为可重复、透明且可解释的操作方法,提升了判决的一致性和可预测性,并实现了规范解释与计算推理的融合,推动了可审计法律人工智能的发展。

链接: https://arxiv.org/abs/2510.21082
作者: Jorge Alberto Araujo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 2 tables, includes GitHub link to framework implementation. Submitted to the Artificial Intelligence and Law section of arXiv

点击查看摘要

Abstract:Applying complex legal rules characterized by multiple, heterogeneously weighted criteria presents a fundamental challenge in judicial decision-making, often hindering the consistent realization of legislative intent. This challenge is particularly evident in the quantification of non-pecuniary damages in personal injury cases. This paper introduces Soppia, a structured prompting framework designed to assist legal professionals in navigating this complexity. By leveraging advanced AI, the system ensures a comprehensive and balanced analysis of all stipulated criteria, fulfilling the legislator’s intent that compensation be determined through a holistic assessment of each case. Using the twelve criteria for non-pecuniary damages established in the Brazilian CLT (Art. 223-G) as a case study, we demonstrate how Soppia (System for Ordered Proportional and Pondered Intelligent Assessment) operationalizes nuanced legal commands into a practical, replicable, and transparent methodology. The framework enhances consistency and predictability while providing a versatile and explainable tool adaptable across multi-criteria legal contexts, bridging normative interpretation and computational reasoning toward auditable legal AI.
zh

[AI-51] On the Sample Complexity of Differentially Private Policy Optimization

【速读】:该论文旨在解决政策优化(Policy Optimization, PO)在敏感领域部署时面临的隐私保护问题,特别是针对差分隐私(Differential Privacy, DP)约束下的样本复杂度(sample complexity)进行理论分析。其解决方案的关键在于:首先形式化了一个适用于PO场景的差分隐私定义,解决了基于策略学习动态和隐私单位界定的固有挑战;其次,通过统一框架系统分析了策略梯度(PG)、自然策略梯度(NPG)等主流PO算法在不同设置下满足DP约束时的样本复杂度,发现隐私成本通常仅表现为样本复杂度中的低阶项,同时揭示了私有PO环境下若干关键但微妙的性质,为设计高效且隐私保护的PO算法提供了理论依据与实践指导。

链接: https://arxiv.org/abs/2510.21060
作者: Yi He,Xingyu Zhou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.
zh

[AI-52] From Questions to Queries: An AI-powered Multi-Agent Framework for Spatial Text-to-SQL

【速读】:该论文旨在解决非专家用户在使用地理空间数据库(如PostGIS)时面临的两大挑战:一是结构化查询语言(SQL)的复杂性,二是地理空间函数的专业性导致的语义理解障碍。为应对这些问题,作者提出了一种多智能体(multi-agent)框架,其核心创新在于构建了一个包含程序化模式分析与语义增强的知识库、基于嵌入的上下文检索机制,并设计了一个协作式多智能体流水线——包括实体提取、元数据检索、查询逻辑构建、SQL生成及审查代理(review agent)等模块。其中,审查代理通过程序化和语义验证实现自验证(self-verification),显著提升了生成SQL的准确性,尤其在空间查询场景下表现突出,相较单代理方法准确率提升11个百分点(从76.7%至87.7%)。该方案不仅增强了自然语言到空间SQL翻译的鲁棒性和可扩展性,也为构建自主地理信息系统(Autonomous GIS)提供了通用基础。

链接: https://arxiv.org/abs/2510.21045
作者: Ali Khosravi Kazazi,Zhenlong Li,M. Naser Lessani,Guido Cervone
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The complexity of Structured Query Language (SQL) and the specialized nature of geospatial functions in tools like PostGIS present significant barriers to non-experts seeking to analyze spatial data. While Large Language Models (LLMs) offer promise for translating natural language into SQL (Text-to-SQL), single-agent approaches often struggle with the semantic and syntactic complexities of spatial queries. To address this, we propose a multi-agent framework designed to accurately translate natural language questions into spatial SQL queries. The framework integrates several innovative components, including a knowledge base with programmatic schema profiling and semantic enrichment, embeddings for context retrieval, and a collaborative multi-agent pipeline as its core. This pipeline comprises specialized agents for entity extraction, metadata retrieval, query logic formulation, SQL generation, and a review agent that performs programmatic and semantic validation of the generated SQL to ensure correctness (self-verification). We evaluate our system using both the non-spatial KaggleDBQA benchmark and a new, comprehensive SpatialQueryQA benchmark that includes diverse geometry types, predicates, and three levels of query complexity. On KaggleDBQA, the system achieved an overall accuracy of 81.2% (221 out of 272 questions) after the review agent’s review and corrections. For spatial queries, the system achieved an overall accuracy of 87.7% (79 out of 90 questions), compared with 76.7% without the review agent. Beyond accuracy, results also show that in some instances the system generates queries that are more semantically aligned with user intent than those in the benchmarks. This work makes spatial analysis more accessible, and provides a robust, generalizable foundation for spatial Text-to-SQL systems, advancing the development of autonomous GIS.
zh

[AI-53] Epistemic Deference to AI

【速读】:该论文试图解决的问题是:在何种情况下应优先采纳人工智能(AI)输出而非人类专家判断。作者指出,某些AI系统因其可靠性和认知优势可被视为人工认识权威(Artificial Epistemic Authorities, AEAs),但传统“预设主义”(Preemptionism)主张以AI输出完全替代人类独立的认知理由,存在放大化的风险,如盲目服从、认知固化和基础脱钩等问题,尤其因AI的不透明性、自我强化权威及缺乏失败标记而加剧。论文提出一种更可行的替代方案——“总体证据视角”(total evidence view),其关键在于将AEA输出视为贡献性理由(contributory reasons),而非直接取代人类独立的认知考量。这一框架通过保持人类参与以防止专业能力退化、支持有意义的人类监督与控制,并合理解释在可靠性不足时对AI的正当怀疑,为高风险场景中AI deference 的合理性提供了原则性依据。

链接: https://arxiv.org/abs/2510.21043
作者: Benjamin Lange
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 12 pages

点击查看摘要

Abstract:When should we defer to AI outputs over human expert judgment? Drawing on recent work in social epistemology, I motivate the idea that some AI systems qualify as Artificial Epistemic Authorities (AEAs) due to their demonstrated reliability and epistemic superiority. I then introduce AI Preemptionism, the view that AEA outputs should replace rather than supplement a user’s independent epistemic reasons. I show that classic objections to preemptionism - such as uncritical deference, epistemic entrenchment, and unhinging epistemic bases - apply in amplified form to AEAs, given their opacity, self-reinforcing authority, and lack of epistemic failure markers. Against this, I develop a more promising alternative: a total evidence view of AI deference. According to this view, AEA outputs should function as contributory reasons rather than outright replacements for a user’s independent epistemic considerations. This approach has three key advantages: (i) it mitigates expertise atrophy by keeping human users engaged, (ii) it provides an epistemic case for meaningful human oversight and control, and (iii) it explains the justified mistrust of AI when reliability conditions are unmet. While demanding in practice, this account offers a principled way to determine when AI deference is justified, particularly in high-stakes contexts requiring rigorous reliability.
zh

[AI-54] Agent ArcEval: An Architecture Evaluation Method for Foundation Model based Agents

【速读】:该论文旨在解决基于基础模型(Foundation Models, FMs)的智能体(Agent)架构评估难题,传统评估方法难以应对智能体特有的复合架构、自主性与非确定性行为及持续演化等特性。解决方案的关键在于提出一种名为AgentArcEval的新颖评估方法,专门针对FM驱动的智能体架构复杂性设计,并配套构建了一个面向智能体的通用场景目录,用于指导具体场景的设计与评估。通过真实税务助手Luna系统的案例研究验证了该方法的有效性。

链接: https://arxiv.org/abs/2510.21031
作者: Qinghua Lu,Dehai Zhao,Yue Liu,Hao Zhang,Liming Zhu,Xiwei Xu,Angela Shi,Tristan Tan,Rick Kazman
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of foundation models (FMs) has enabled the development of highly capable and autonomous agents, unlocking new application opportunities across a wide range of domains. Evaluating the architecture of agents is particularly important as the architectural decisions significantly impact the quality attributes of agents given their unique characteristics, including compound architecture, autonomous and non-deterministic behaviour, and continuous evolution. However, these traditional methods fall short in addressing the evaluation needs of agent architecture due to the unique characteristics of these agents. Therefore, in this paper, we present AgentArcEval, a novel agent architecture evaluation method designed specially to address the complexities of FM-based agent architecture and its evaluation. Moreover, we present a catalogue of agent-specific general scenarios, which serves as a guide for generating concrete scenarios to design and evaluate the agent architecture. We demonstrate the usefulness of AgentArcEval and the catalogue through a case study on the architecture evaluation of a real-world tax copilot, named Luna.
zh

[AI-55] Customizing Open Source LLM s for Quantitative Medication Attribute Extraction across Heterogeneous EHR Systems ALT NEURIPS2025

【速读】:该论文旨在解决电子健康记录(Electronic Health Record, EHR)系统中药物数据异构性导致的阿片类药物使用障碍(Opioid Use Disorder, OUD)治疗药物(Medication for Opioid Use Disorder, MOUD)监测难题。其核心问题是:不同医疗机构的EHR系统中关键处方属性(如处方日期、药品名称、疗程、总量、日剂量和复方次数)分散于格式各异的字段与自由文本中,难以统一提取和分析。解决方案的关键在于构建一个基于开源大语言模型(Large Language Models, LLMs)的定制化框架,通过固定JSON Schema处理原始数据,结合轻量级归一化与跨字段一致性校验,实现对MOUD处方属性的标准化抽取,并计算患者层面的标准化指标“MOUD天数”(MOUD days)。该方法显著提升了跨机构数据的可比性和隐私保护能力,避免了传统脆弱的站点特定ETL流程,支持本地部署以保障数据安全,从而实现真实世界环境中对MOUD暴露、依从性和保留情况的一致性分析。

链接: https://arxiv.org/abs/2510.21027
作者: Zhe Fei,Mehmet Yigit Turali,Shreyas Rajesh,Xinyang Dai,Huyen Pham,Pavan Holur,Yuhui Zhu,Larissa Mooney,Yih-Ing Hser,Vwani Roychowdhury
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025: The Second Workshop on GenAI for Health: Potential, Trust, and Policy Compliance

点击查看摘要

Abstract:Harmonizing medication data across Electronic Health Record (EHR) systems is a persistent barrier to monitoring medications for opioid use disorder (MOUD). In heterogeneous EHR systems, key prescription attributes are scattered across differently formatted fields and freetext notes. We present a practical framework that customizes open source large language models (LLMs), including Llama, Qwen, Gemma, and MedGemma, to extract a unified set of MOUD prescription attributes (prescription date, drug name, duration, total quantity, daily quantity, and refills) from heterogeneous, site specific data and compute a standardized metric of medication coverage, \emphMOUD days, per patient. Our pipeline processes records directly in a fixed JSON schema, followed by lightweight normalization and cross-field consistency checks. We evaluate the system on prescription level EHR data from five clinics in a national OUD study (25,605 records from 1,257 patients), using a previously annotated benchmark of 10,369 records (776 patients) as the ground truth. Performance is reported as coverage (share of records with a valid, matchable output) and record-level exact-match accuracy. Larger models perform best overall: Qwen2.5-32B achieves \textbf93.4% coverage with \textbf93.0% exact-match accuracy across clinics, and MedGemma-27B attains \textbf93.1%/\textbf92.2%. A brief error review highlights three common issues and fixes: imputing missing dosage fields using within-drug norms, handling monthly/weekly injectables (e.g., Vivitrol) by setting duration from the documented schedule, and adding unit checks to prevent mass units (e.g., ``250 g’') from being misread as daily counts. By removing brittle, site-specific ETL and supporting local, privacy-preserving deployment, this approach enables consistent cross-site analyses of MOUD exposure, adherence, and retention in real-world settings.
zh

[AI-56] JSTprove: Pioneering Verifiable AI for a Trustless Future

【速读】:该论文旨在解决生成式 AI(Generative AI)在医疗、金融和网络安全等关键领域部署时面临的可信度、安全性和责任归属问题,尤其是如何在不泄露敏感数据的前提下验证AI推理的正确性。解决方案的关键在于提出JSTprove,一个基于Polyhedra Network Expander后端的专用零知识机器学习(Zero-Knowledge Machine Learning, zkML)工具包,它通过简化密码学复杂性并提供命令行接口与可审计产物,实现了从模型推理到证明生成与验证的端到端可验证AI流程,从而为AI开发者和ML工程师提供了既可用又可复现的zkML能力。

链接: https://arxiv.org/abs/2510.21024
作者: Jonathan Gold,Tristan Freiberg,Haruna Isah,Shirin Shahabi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: 13 pages, 8 figures, and 4 tables

点击查看摘要

Abstract:The integration of machine learning (ML) systems into critical industries such as healthcare, finance, and cybersecurity has transformed decision-making processes, but it also brings new challenges around trust, security, and accountability. As AI systems become more ubiquitous, ensuring the transparency and correctness of AI-driven decisions is crucial, especially when they have direct consequences on privacy, security, or fairness. Verifiable AI, powered by Zero-Knowledge Machine Learning (zkML), offers a robust solution to these challenges. zkML enables the verification of AI model inferences without exposing sensitive data, providing an essential layer of trust and privacy. However, traditional zkML systems typically require deep cryptographic expertise, placing them beyond the reach of most ML engineers. In this paper, we introduce JSTprove, a specialized zkML toolkit, built on Polyhedra Network’s Expander backend, to enable AI developers and ML engineers to generate and verify proofs of AI inference. JSTprove provides an end-to-end verifiable AI inference pipeline that hides cryptographic complexity behind a simple command-line interface while exposing auditable artifacts for reproducibility. We present the design, innovations, and real-world use cases of JSTprove as well as our blueprints and tooling to encourage community review and extension. JSTprove therefore serves both as a usable zkML product for current engineering needs and as a reproducible foundation for future research and production deployments of verifiable AI.
zh

[AI-57] Physically consistent and uncertainty-aware learning of spatiotemporal dynamics

【速读】:该论文旨在解决现有机器学习方法在时空动态长期预测中忽视物理定律约束且无法量化预测不确定性的问题。其核心解决方案是提出一种物理一致性神经算子(Physics-Consistent Neural Operator, PCNO),通过将代理模型输出投影到满足预设物理规律的函数空间来强制实现物理约束,其中物理一致性投影层可在傅里叶空间高效计算质量与动量守恒;进一步地,基于PCNO构建扩散模型增强的DiffPCNO框架,利用一致性模型对预测不确定性进行量化和抑制,从而提升预测精度与可靠性。该两阶段框架实现了跨多种系统和空间分辨率下的高保真、物理一致且具备不确定性感知能力的时空预测。

链接: https://arxiv.org/abs/2510.21023
作者: Qingsong Xu,Jonathan L Bamber,Nils Thuerey,Niklas Boers,Paul Bates,Gustau Camps-Valls,Yilei Shi,Xiao Xiang Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: Main text:33 pages,6 figures

点击查看摘要

Abstract:Accurate long-term forecasting of spatiotemporal dynamics remains a fundamental challenge across scientific and engineering domains. Existing machine learning methods often neglect governing physical laws and fail to quantify inherent uncertainties in spatiotemporal predictions. To address these challenges, we introduce a physics-consistent neural operator (PCNO) that enforces physical constraints by projecting surrogate model outputs onto function spaces satisfying predefined laws. A physics-consistent projection layer within PCNO efficiently computes mass and momentum conservation in Fourier space. Building upon deterministic predictions, we further propose a diffusion model-enhanced PCNO (DiffPCNO), which leverages a consistency model to quantify and mitigate uncertainties, thereby improving the accuracy and reliability of forecasts. PCNO and DiffPCNO achieve high-fidelity spatiotemporal predictions while preserving physical consistency and uncertainty across diverse systems and spatial resolutions, ranging from turbulent flow modeling to real-world flood/atmospheric forecasting. Our two-stage framework provides a robust and versatile approach for accurate, physically grounded, and uncertainty-aware spatiotemporal forecasting.
zh

[AI-58] Race and Gender in LLM -Generated Personas: A Large-Scale Audit of 41 Occupations

【速读】:该论文旨在解决生成式 AI (Generative AI) 工具在生成职业人物形象时可能加剧种族与性别偏见的问题,特别是评估其对不同人口群体代表性的影响。研究通过大规模审计超过150万条来自四个不同国家背景和安全承诺的大语言模型(LLM)生成的职业人物数据,发现系统性偏差(systematic shifts)和刻板印象放大(stereotype exaggeration)两类模式普遍存在:例如白人和黑人群体普遍被低估,而西班牙裔和亚裔则被高估;某些职业如家政人员几乎全被描绘为西班牙裔,黑人群体则在多个职业中被完全抹除。解决方案的关键在于强调模型提供商的选择直接影响可见群体构成,从而推动针对特定模型的审计机制与负责任的设计实践(accountable design practices),以减少生成内容中的结构性不平等。

链接: https://arxiv.org/abs/2510.21011
作者: Ilona van der Linden,Sahana Kumar,Arnav Dixit,Aadi Sudan,Smruthi Danda,David C. Anastasiu,Kai Lukoff
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Generative AI tools are increasingly used to create portrayals of people in occupations, raising concerns about how race and gender are represented. We conducted a large-scale audit of over 1.5 million occupational personas across 41 U.S. occupations, generated by four large language models with different AI safety commitments and countries of origin (U.S., China, France). Compared with Bureau of Labor Statistics data, we find two recurring patterns: systematic shifts, where some groups are consistently under- or overrepresented, and stereotype exaggeration, where existing demographic skews are amplified. On average, White (–31pp) and Black (–9pp) workers are underrepresented, while Hispanic (+17pp) and Asian (+12pp) workers are overrepresented. These distortions can be extreme: for example, across all four models, Housekeepers are portrayed as nearly 100% Hispanic, while Black workers are erased from many occupations. For HCI, these findings show provider choice materially changes who is visible, motivating model-specific audits and accountable design practices.
zh

[AI-59] Exploring Spiking Neural Networks for Binary Classification in Multivariate Time Series at the Edge IJCNN

【速读】:该论文旨在解决如何在低信噪比环境下对多变量时间序列数据进行高精度、低误报率的二分类任务,特别是在资源受限场景下实现高效部署的问题。其核心挑战在于传统方法(如PCA和深度学习)难以在保持高真阳性率(TPR)的同时控制假警报率(FAR),且模型复杂度较高。解决方案的关键在于提出一种基于进化优化的脉冲神经网络(Spiking Neural Networks, SNNs)训练框架——EONS算法,通过联合优化网络结构与参数,生成稀疏、状态感知的SNN模型;输入采用脉冲编码,输出通过单一神经元的尖峰计数阈值判定;并引入简单投票集成策略提升鲁棒性与性能。该方法在伽马射线辐射源检测中仅用49个神经元和66个突触即达到51.8% TPR@1/hr FAR,优于基准方法,并在脑电图癫痫检测任务中无需领域调整即可实现95% TPR,验证了其通用性和高效性。

链接: https://arxiv.org/abs/2510.20997
作者: James Ghawaly,Andrew Nicholson,Catherine Schuman,Dalton Diez,Aaron Young,Brett Witherspoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted in 2025 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:We present a general framework for training spiking neural networks (SNNs) to perform binary classification on multivariate time series, with a focus on step-wise prediction and high precision at low false alarm rates. The approach uses the Evolutionary Optimization of Neuromorphic Systems (EONS) algorithm to evolve sparse, stateful SNNs by jointly optimizing their architectures and parameters. Inputs are encoded into spike trains, and predictions are made by thresholding a single output neuron’s spike counts. We also incorporate simple voting ensemble methods to improve performance and robustness. To evaluate the framework, we apply it with application-specific optimizations to the task of detecting low signal-to-noise ratio radioactive sources in gamma-ray spectral data. The resulting SNNs, with as few as 49 neurons and 66 synapses, achieve a 51.8% true positive rate (TPR) at a false alarm rate of 1/hr, outperforming PCA (42.7%) and deep learning (49.8%) baselines. A three-model any-vote ensemble increases TPR to 67.1% at the same false alarm rate. Hardware deployment on the microCaspian neuromorphic platform demonstrates 2mW power consumption and 20.2ms inference latency. We also demonstrate generalizability by applying the same framework, without domain-specific modification, to seizure detection in EEG recordings. An ensemble achieves 95% TPR with a 16% false positive rate, comparable to recent deep learning approaches with significant reduction in parameter count. Comments: Accepted in 2025 International Joint Conference on Neural Networks (IJCNN) Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.20997 [cs.LG] (or arXiv:2510.20997v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.20997 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-60] GPU Memory Requirement Prediction for Deep Learning Task Based on Bidirectional Gated Recurrent Unit Optimization Transformer

【速读】:该论文旨在解决深度学习任务中GPU内存资源预测精度不足的问题,以支持更高效的资源调度与管理。其解决方案的关键在于提出一种融合双向门控循环单元(Bidirectional Gated Recurrent Unit, BiGRU)的Transformer优化模型,通过引入BiGRU结构增强时序特征提取能力,并对Transformer架构进行改进,从而显著提升预测准确性。实验结果表明,该模型在均方误差(MSE)、均方根误差(RMSE)、平均绝对误差(MAE)及决定系数(R²)等指标上均优于决策树、随机森林、Adaboost和XGBoost等传统机器学习方法,展现出更强的稳定性与预测性能。

链接: https://arxiv.org/abs/2510.20985
作者: Chao Wang,Zhizhao Wen,Ruoxin Zhang,Puyang Xu,Yifan Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In response to the increasingly critical demand for accurate prediction of GPU memory resources in deep learning tasks, this paper deeply analyzes the current research status and innovatively proposes a deep learning model that integrates bidirectional gated recurrent units (BiGRU) to optimize the Transformer architecture, aiming to improve the accuracy of memory demand prediction. To verify the effectiveness of the model, a carefully designed comparative experiment was conducted, selecting four representative basic machine learning models: decision tree, random forest, Adaboost, and XGBoost as benchmarks. The detailed experimental results show that the BiGRU Transformer optimization model proposed in this paper exhibits significant advantages in key evaluation indicators: in terms of mean square error (MSE) and root mean square error (RMSE), the model achieves the lowest value among all comparison models, and its predicted results have the smallest deviation from the actual values; In terms of mean absolute error (MAE) and coefficient of determination (R2) indicators, the model also performs well and the results are balanced and stable, with comprehensive predictive performance far exceeding the benchmark machine learning methods compared. In summary, the Transformer model based on bidirectional gated recurrent unit optimization successfully constructed in this study can efficiently and accurately complete GPU memory demand prediction tasks in deep learning tasks, and its prediction accuracy has been significantly improved compared to traditional machine learning methods. This research provides strong technical support and reliable theoretical basis for optimizing resource scheduling and management of deep learning tasks, and improving the utilization efficiency of computing clusters.
zh

[AI-61] Learning Grouped Lattice Vector Quantizers for Low-Bit LLM Compression NEURIPS2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中对计算资源和内存需求过高,而传统后训练量化(Post-Training Quantization, PTQ)方法在低比特场景下易导致显著性能下降的问题。其解决方案的关键在于提出一种分组格点向量量化(Grouped Lattice Vector Quantization, GLVQ)框架,该框架为每组权重分配一个由可学习生成矩阵定义的定制化格点码本,并采用Babai取整近似非可微量化过程中的最近格点搜索,从而实现生成矩阵的稳定优化;训练完成后,解码仅需矩阵-向量乘法,具备高效性和实用性。实验表明,该方法在模型尺寸与精度之间实现了优于现有PTQ基线的权衡,尤其适用于资源受限环境下的大模型部署。

链接: https://arxiv.org/abs/2510.20984
作者: Xi Zhang,Xiaolin Wu,Jiamang Wang,Weisi Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Poster

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities but typically require extensive computational resources and memory for inference. Post-training quantization (PTQ) can effectively reduce these demands by storing weights in lower bit-width formats. However, standard uniform quantization often leads to notable performance degradation, particularly in low-bit scenarios. In this work, we introduce a Grouped Lattice Vector Quantization (GLVQ) framework that assigns each group of weights a customized lattice codebook, defined by a learnable generation matrix. To address the non-differentiability of the quantization process, we adopt Babai rounding to approximate nearest-lattice-point search during training, which enables stable optimization of the generation matrices. Once trained, decoding reduces to a simple matrix-vector multiplication, yielding an efficient and practical quantization pipeline. Experiments on multiple benchmarks show that our approach achieves a better trade-off between model size and accuracy compared to existing post-training quantization baselines, highlighting its effectiveness in deploying large models under stringent resource constraints. Our source code is available on GitHub repository: this https URL.
zh

[AI-62] Memory Constrained Dynamic Subnetwork Update for Transfer Learning

【速读】:该论文旨在解决在设备端(on-device)进行神经网络训练时面临的内存约束问题,该问题限制了预训练模型在下游任务中的适应能力。解决方案的关键在于提出了一种理论驱动的动态子网络适配框架MeDyate,其核心创新包括:(1)LaRa(Layer Ranking)层重要性度量方法,用于有原则地进行层预筛选;(2)一种利用微调过程中通道重要性分布时间稳定性的动态通道采样策略,通过基于重要性加权的概率在每轮迭代间重采样通道,从而在严格内存预算下实现参数空间的充分探索。该方法在极端内存限制下实现了当前最优性能,同时保持高计算效率,为设备端高效学习提供了可行路径。

链接: https://arxiv.org/abs/2510.20979
作者: Aël Quélennec,Pavlo Mozharovskyi,Van-Tam Nguyen,Enzo Tartaglione
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:On-device neural network training faces critical memory constraints that limit the adaptation of pre-trained models to downstream tasks. We present MeDyate, a theoretically-grounded framework for memory-constrained dynamic subnetwork adaptation. Our approach introduces two key innovations: LaRa (Layer Ranking), an improved layer importance metric that enables principled layer pre-selection, and a dynamic channel sampling strategy that exploits the temporal stability of channel importance distributions during fine-tuning. MeDyate dynamically resamples channels between epochs according to importance-weighted probabilities, ensuring comprehensive parameter space exploration while respecting strict memory budgets. Extensive evaluation across a large panel of tasks and architectures demonstrates that MeDyate achieves state-of-the-art performance under extreme memory constraints, consistently outperforming existing static and dynamic approaches while maintaining high computational efficiency. Our method represents a significant step towards enabling efficient on-device learning by demonstrating effective fine-tuning with memory budgets as low as a few hundred kB of RAM.
zh

[AI-63] REx86: A Local Large Language Model for Assisting in x86 Assembly Reverse Engineering ACSA

【速读】:该论文旨在解决x86二进制反向工程(Reverse Engineering, RE)中因符号信息被剥离和对抗性混淆导致的效率低下问题,尤其在无法使用云端闭源大语言模型(Large Language Models, LLMs)的封闭网络环境中。其核心解决方案是通过领域特定的数据微调,构建一个本地部署、开源权重的高效LLM辅助工具——命名为REx86,该模型基于Qwen2.5-Coder-7B架构,采用LoRA(Low-Rank Adaptation)参数高效微调方法,在自建的5,981条x86汇编示例数据集上训练。实验表明,REx86在测试集上的交叉熵损失降低64.2%,语义相似度提升20.3%,并在用户案例研究中显著增强代码理解能力(p=0.031),且生成注释更准确、简洁、幻觉更少,成为当前本地开源LLM中性能最优的x86 RE辅助工具。

链接: https://arxiv.org/abs/2510.20975
作者: Darrin Lea,James Ghawaly,Golden Richard III,Aisha Ali-Gombe,Andrew Case
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted in 2025 Annual Computer Security Applications Conference (ACSAC)

点击查看摘要

Abstract:Reverse engineering (RE) of x86 binaries is indispensable for malware and firmware analysis, but remains slow due to stripped metadata and adversarial obfuscation. Large Language Models (LLMs) offer potential for improving RE efficiency through automated comprehension and commenting, but cloud-hosted, closed-weight models pose privacy and security risks and cannot be used in closed-network facilities. We evaluate parameter-efficient fine-tuned local LLMs for assisting with x86 RE tasks in these settings. Eight open-weight models across the CodeLlama, Qwen2.5-Coder, and CodeGemma series are fine-tuned on a custom curated dataset of 5,981 x86 assembly examples. We evaluate them quantitatively and identify the fine-tuned Qwen2.5-Coder-7B as the top performer, which we name REx86. REx86 reduces test-set cross-entropy loss by 64.2% and improves semantic cosine similarity against ground truth by 20.3% over its base model. In a limited user case study (n=43), REx86 significantly enhanced line-level code understanding (p = 0.031) and increased the correct-solve rate from 31% to 53% (p = 0.189), though the latter did not reach statistical significance. Qualitative analysis shows more accurate, concise comments with fewer hallucinations. REx86 delivers state-of-the-art assistance in x86 RE among local, open-weight LLMs. Our findings demonstrate the value of domain-specific fine-tuning, and highlight the need for more commented disassembly data to further enhance LLM performance in RE. REx86, its dataset, and LoRA adapters are publicly available at this https URL and this https URL. Comments: Accepted in 2025 Annual Computer Security Applications Conference (ACSAC) Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.20975 [cs.CR] (or arXiv:2510.20975v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.20975 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-64] Meta-Learning for Cross-Task Generalization in Protein Mutation Property Prediction

链接: https://arxiv.org/abs/2510.20943
作者: Srivathsan Badrinarayanan,Yue Su,Janghoon Ock,Alan Pham,Sanya Ahuja,Amir Barati Farimani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-65] Security Logs to ATTCK Insights: Leverag ing LLM s for High-Level Threat Understanding and Cognitive Trait Inference

【速读】:该论文旨在解决传统网络安全分析中难以实时推断攻击者意图与认知策略的问题,即如何从低层级的系统遥测数据(如Suricata入侵检测系统IDS日志)中提取高阶攻击行为特征,并将其映射到MITRE ATT&CK技术框架及潜在的认知动机。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的新型分析框架,通过设计策略驱动的提示(prompt)系统对海量网络日志进行行为阶段分割,使LLM能够识别不同行为阶段对应的攻击技术及其背后的心理学机制(如损失厌恶、风险偏好或目标坚持等认知偏差),从而在包级日志与战略意图之间建立语义桥梁,为行为自适应网络安全防御奠定基础。

链接: https://arxiv.org/abs/2510.20930
作者: Soham Hans,Stacy Marsella,Sophia Hirschmann,Nikolos Gurney
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding adversarial behavior in cybersecurity has traditionally relied on high-level intelligence reports and manual interpretation of attack chains. However, real-time defense requires the ability to infer attacker intent and cognitive strategy directly from low-level system telemetry such as intrusion detection system (IDS) logs. In this paper, we propose a novel framework that leverages large language models (LLMs) to analyze Suricata IDS logs and infer attacker actions in terms of MITRE ATTCK techniques. Our approach is grounded in the hypothesis that attacker behavior reflects underlying cognitive biases such as loss aversion, risk tolerance, or goal persistence that can be extracted and modeled through careful observation of log sequences. This lays the groundwork for future work on behaviorally adaptive cyber defense and cognitive trait inference. We develop a strategy-driven prompt system to segment large amounts of network logs data into distinct behavioral phases in a highly efficient manner, enabling the LLM to associate each phase with likely techniques and underlying cognitive motives. By mapping network-layer events to high-level attacker strategies, our method reveals how behavioral signals such as tool switching, protocol transitions, or pivot patterns correspond to psychologically meaningful decision points. The results demonstrate that LLMs can bridge the semantic gap between packet-level logs and strategic intent, offering a pathway toward cognitive-adaptive cyber defense. Keywords: Cognitive Cybersecurity, Large Language Models (LLMs), Cyberpsychology, Intrusion Detection Systems (IDS), MITRE ATTCK, Cognitive Biases Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.20930 [cs.CR] (or arXiv:2510.20930v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.20930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-66] Aircraft Collision Avoidance Systems: Technological Challenges and Solutions on the Path to Regulatory Acceptance

【速读】:该论文旨在解决航空器防撞系统(Aircraft Collision Avoidance Systems, ACAS)设计中的关键技术挑战,包括监视(surveillance)、决策制定(decision making)和验证(validation)等方面的问题。其解决方案的关键在于提出并总结那些经过严格验证流程、被监管机构采纳的防撞技术方案,强调这些方案在实际应用中的可靠性与安全性,从而为其他安全关键系统提供可借鉴的工程实践路径。

链接: https://arxiv.org/abs/2510.20916
作者: Sydney M. Katz,Robert J. Moss,Dylan M. Asmar,Wesley A. Olson,James K. Kuchar,Mykel J. Kochenderfer
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 32 pages, 9 figures

点击查看摘要

Abstract:Aircraft collision avoidance systems is critical to modern aviation. These systems are designed to predict potential collisions between aircraft and recommend appropriate avoidance actions. Creating effective collision avoidance systems requires solutions to a variety of technical challenges related to surveillance, decision making, and validation. These challenges have sparked significant research and development efforts over the past several decades that have resulted in a variety of proposed solutions. This article provides an overview of these challenges and solutions with an emphasis on those that have been put through a rigorous validation process and accepted by regulatory bodies. The challenges posed by the collision avoidance problem are often present in other domains, and aircraft collision avoidance systems can serve as case studies that provide valuable insights for a wide range of safety-critical systems.
zh

[AI-67] HA-RAG : Hotness-Aware RAG Acceleration via Mixed Precision and Data Placement

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在长上下文处理中因外部知识库引入而导致的内存消耗高和推理延迟大的问题。解决方案的关键在于提出一种基于热度感知(hotness-aware)的推理优化框架HA-RAG:首先,利用KV(Key-Value)块访问频率的分布特性,设计了一种热度感知的混合精度压缩与按需加载方法,以降低磁盘I/O和内存访问开销;其次,提出热度感知的数据放置策略,将高频访问的KV块优先存储于高速内存中,从而提升数据访问效率。实验表明,相较于TurboRAG,HA-RAG在Time-To-First-Token(TTFT)指标上平均提速2.10倍、最大提速达10.49倍,且精度损失可忽略。

链接: https://arxiv.org/abs/2510.20878
作者: Danying Ge,Jianhua Gao,Yixue Yang,Weixing Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages,16 figures,2 tables

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) improves model output accuracy by leveraging external knowledge bases, serving as an effective solution to address hallucination issues and knowledge-update delays in Large Language Models (LLMs). However, the introduction of external knowledge bases presents RAG with challenges in long-context processing, significantly increasing memory consumption and inference latency. Existing research accelerates inference by precomputing Key and Value (KV) of the knowledge base and loading them on-demand during inference. Based on the access frequency of different KV chunks within the external knowledge base, this paper proposes a hotness-aware RAG (HA-RAG) inference optimization system. First, leveraging the numerical distribution of KV chunks, we introduce a hotness-aware mixed-precision compressing and loading method to reduce disk I/O and memory access overhead. Second, we design a hotness-aware data placement strategy that prioritizes storing frequently accessed KV chunks in high-speed memory to improve data access efficiency. Experimental results demonstrate that, compared with TurboRAG, the proposed HA-RAG achieves an average speedup of 2.10x and maximum speedup of 10.49x in Time-To-First-Token (TTFT) with negligible accuracy loss.
zh

[AI-68] Multimodal Negative Learning NEURIPS2025

【速读】:该论文旨在解决多模态学习系统中因模态不平衡(modality imbalance)导致的弱模态信息被主导模态压制的问题。传统方法通过“正向学习”(Learning to be)强制弱模态与主导模态对齐,容易丢失其特有的判别信息。本文提出新的“负向学习”(Learning Not to be)范式,核心在于让主导模态动态引导弱模态抑制非目标类别,而非增强其目标类预测,从而稳定决策空间并保留模态特异性信息。关键创新是构建了多模态负向学习(Multimodal Negative Learning, MNL)框架,引入动态引导机制,并理论证明其通过提升单模态置信度边界(Unimodal Confidence Margin, UCoM)可收紧多模态学习的鲁棒性下界,显著降低弱模态在噪声和不平衡场景下的经验误差。

链接: https://arxiv.org/abs/2510.20877
作者: Baoquan Gong,Xiyuan Gao,Pengfei Zhu,Qinghua Hu,Bing Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in NeurIPS 2025

点击查看摘要

Abstract:Multimodal learning systems often encounter challenges related to modality imbalance, where a dominant modality may overshadow others, thereby hindering the learning of weak modalities. Conventional approaches often force weak modalities to align with dominant ones in “Learning to be (the same)” (Positive Learning), which risks suppressing the unique information inherent in the weak modalities. To address this challenge, we offer a new learning paradigm: “Learning Not to be” (Negative Learning). Instead of enhancing weak modalities’ target-class predictions, the dominant modalities dynamically guide the weak modality to suppress non-target classes. This stabilizes the decision space and preserves modality-specific information, allowing weak modalities to preserve unique information without being over-aligned. We proceed to reveal multimodal learning from a robustness perspective and theoretically derive the Multimodal Negative Learning (MNL) framework, which introduces a dynamic guidance mechanism tailored for negative learning. Our method provably tightens the robustness lower bound of multimodal learning by increasing the Unimodal Confidence Margin (UCoM) and reduces the empirical error of weak modalities, particularly under noisy and imbalanced scenarios. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generalizability of our approach against competing methods. The code will be available at this https URL.
zh

[AI-69] CC-GRMAS: A Multi-Agent Graph Neural System for Spatiotemporal Landslide Risk Assessment in High Mountain Asia

链接: https://arxiv.org/abs/2510.20875
作者: Mihir Panchal,Ying-Jung Chen,Surya Parkash
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-70] Crisis-Resilient Portfolio Management via Graph-based Spatio-Temporal Learning

【速读】:该论文旨在解决金融时间序列预测中因市场危机机制不同(如信用传染、疫情冲击或通胀驱动的抛售)而导致资产相关性结构发生动态变化的问题,传统基于预设图拓扑(如相关阈值或行业分类)的方法无法适应这些 regime-dependent(制度依赖)的关联模式。其解决方案的关键在于提出 CRISP 框架——一种融合图卷积网络(Graph Convolutional Networks, GCN)与双向长短期记忆网络(BiLSTM)结合自注意力机制的时空学习模型,并通过多头图注意力网络(Multi-head Graph Attention Networks)自动学习稀疏且具解释性的资产关系结构。该方法在训练数据涵盖信用与疫情危机的基础上,实现了对通胀驱动市场(2022–2024)的稳健泛化能力,准确捕捉不同危机情境下的关键依赖关系,从而支持自适应组合配置,在下行周期中保持盈利性,Sharpe 比率达 3.76,显著优于等权重基线和静态图方法。

链接: https://arxiv.org/abs/2510.20868
作者: Zan Li,Rui Fan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:

点击查看摘要

Abstract:Financial time series forecasting faces a fundamental challenge: predicting optimal asset allocations requires understanding regime-dependent correlation structures that transform during crisis periods. Existing graph-based spatio-temporal learning approaches rely on predetermined graph topologies–correlation thresholds, sector classifications–that fail to adapt when market dynamics shift across different crisis mechanisms: credit contagion, pandemic shocks, or inflation-driven selloffs. We present CRISP (Crisis-Resilient Investment through Spatio-temporal Patterns), a graph-based spatio-temporal learning framework that encodes spatial relationships via Graph Convolutional Networks and temporal dynamics via BiLSTM with self-attention, then learns sparse structures through multi-head Graph Attention Networks. Unlike fixed-topology methods, CRISP discovers which asset relationships matter through attention mechanisms, filtering 92.5% of connections as noise while preserving crisis-relevant dependencies for accurate regime-specific predictions. Trained on 2005–2021 data encompassing credit and pandemic crises, CRISP demonstrates robust generalization to 2022–2024 inflation-driven markets–a fundamentally different regime–by accurately forecasting regime-appropriate correlation structures. This enables adaptive portfolio allocation that maintains profitability during downturns, achieving Sharpe ratio 3.76: 707% improvement over equal-weight baselines and 94% improvement over static graph methods. Learned attention weights provide interpretable regime detection, with defensive cluster attention strengthening 49% during crises versus 31% market-wide–emergent behavior from learning to forecast rather than imposing assumptions. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2510.20868 [cs.LG] (or arXiv:2510.20868v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.20868 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-71] Incentivizing Consistent Effective and Scalable Reasoning Capability in Audio LLM Reasoning Capability in Audio LLMs via Reasoning Process Rewards

【速读】:该论文旨在解决音频大语言模型(Audio Large Language Models, Audio LLMs)中推理能力不足的问题,特别是“测试时反向缩放”(test-time inverse scaling)现象——即随着推理链长度增加,模型性能反而下降。研究表明,这一问题并非源于推理机制本身的局限,而是由于训练过程中缺乏对推理过程的有效引导,导致生成的推理链条存在幻觉和不一致性,错误随链长累积。解决方案的关键在于提出CESAR(Consistent, Effective, and Scalable Audio Reasoners),采用在线强化学习框架,通过多维度奖励机制(包括正确性、格式规范性、一致性、结构化分析模式、因果推理、领域知识融合及推理深度校准)直接优化推理过程本身,而非仅验证最终结果。该方法有效缓解了测试时反向缩放,实现了推理能力从负面影响到正向增益的转变,并揭示了不同模型的“推理甜点”(reasoning sweet spots),显著提升了音频推理任务的性能与鲁棒性。

链接: https://arxiv.org/abs/2510.20867
作者: Jiajun Fan,Roger Ren,Jingyuan Li,Rahul Pandey,Prashanth Gurunath Shivakumar,Ivan Bulyko,Ankur Gandhe,Ge Liu,Yile Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 49 pages

点击查看摘要

Abstract:The role of reasoning in Audio Large Language Models remains widely underexplored, as introducing a reasoning process often degrades rather than improves performance during inference, a phenomenon we term test-time inverse scaling, where longer reasoning chains yield progressively worse results. We demonstrate that this stems not from fundamental limitations of reasoning itself, but from inadequate training: models without proper guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. To address these challenges, we introduce CESAR (Consistent, Effective, and Scalable Audio Reasoners), shifting from outcome verification to rewarding the reasoning process. Our online reinforcement learning framework employs Group Relative Policy Optimization with a multi-faceted reward suite that incentivizes not only correctness and format but also consistency, structured analytical patterns, causal reasoning, domain-knowledge integration, and calibrated reasoning depth. CESAR resolves test-time inverse scaling, transforming reasoning from detriments into gains while revealing model-specific ``reasoning sweet spots", where performance peaks during test-time scaling. We achieve state-of-the-art results on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio, and near-human-level performance on MMSU reasoning tasks. Through AI-as-judge evaluations and qualitative comparisons, we provide both quantitative and qualitative validation of our improved reasoning quality. Importantly, enhanced reasoning creates synergistic effects, simultaneously improving multimodal reasoning and perception capabilities. Overall, CESAR establishes a principled method for developing robust and scalable reasoning in Audio LLMs.
zh

[AI-72] Fuzzy numbers revisited: operations on extensional fuzzy numbers

【速读】:该论文旨在解决传统模糊数(fuzzy numbers)在运算过程中存在的三大问题:一是计算复杂度高,二是运算结果可能不再保持原模糊集的特征(如两个三角模糊数相乘的结果不再是三角模糊数),三是随着运算次数增加,模糊 spread(模糊性扩散)加剧,导致结果模糊程度显著上升。这些问题严重限制了模糊数在实际应用中的有效性与可靠性。论文提出的解决方案是引入一种新型模糊数——扩展模糊数(extensional fuzzy numbers),并为其定义了基本运算规则和关系算子(=, <, >, ≤, ≥)。其关键在于通过扩展模糊数的数学结构设计,使得运算结果能够保持更稳定的模糊特性,并有效控制模糊传播,从而提升模糊推理系统的效率与精度。

链接: https://arxiv.org/abs/2510.20861
作者: Krzysztof Siminski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 33 pages, 62 references

点击查看摘要

Abstract:Fuzzy numbers are commonly represented with fuzzy sets. Their objective is to better represent imprecise data. However, operations on fuzzy numbers are not as straightforward as maths on crisp numbers. Commonly, the Zadeh’s extension rule is applied to elaborate a result. This can produce two problems: (1) high computational complexity and (2) for some fuzzy sets and some operations the results is not a fuzzy set with the same features (eg. multiplication of two triangular fuzzy sets does not produce a triangular fuzzy set). One more problem is the fuzzy spread – fuzziness of the result increases with the number of operations. These facts can severely limit the application field of fuzzy numbers. In this paper we would like to revisite this problem with a different kind of fuzzy numbers – extensional fuzzy numbers. The paper defines operations on extensional fuzzy numbers and relational operators (=, , =, , =) for them. The proposed approach is illustrated with several applicational examples. The C++ implementation is available from a public GitHub repository.
zh

[AI-73] Sketch2BIM: A Multi-Agent Human-AI Collaborative Pipeline to Convert Hand-Drawn Floor Plans to 3D BIM

【速读】:该论文旨在解决从手绘平面草图自动转换为语义一致的三维建筑信息模型(BIM)的问题,尤其针对非专业用户难以直接创建高质量BIM模型的挑战。其解决方案的关键在于构建一个“人在回路”(human-in-the-loop)的多智能体流水线,利用多模态大语言模型(MLLM)实现感知提取、人类反馈、结构验证与自动化BIM脚本生成的协同推理,从而在少量交互迭代中显著提升墙体和开口(门、窗)识别精度,并最终生成几何误差趋近于零的可执行BIM模型。

链接: https://arxiv.org/abs/2510.20838
作者: Abir Khan Ratul,Sanjay Acharjee,Somin Park,Md Nazmus Sakib
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:This study introduces a human-in-the-loop pipeline that converts unscaled, hand-drawn floor plan sketches into semantically consistent 3D BIM models. The workflow leverages multimodal large language models (MLLMs) within a multi-agent framework, combining perceptual extraction, human feedback, schema validation, and automated BIM scripting. Initially, sketches are iteratively refined into a structured JSON layout of walls, doors, and windows. Later, these layouts are transformed into executable scripts that generate 3D BIM models. Experiments on ten diverse floor plans demonstrate strong convergence: openings (doors, windows) are captured with high reliability in the initial pass, while wall detection begins around 83% and achieves near-perfect alignment after a few feedback iterations. Across all categories, precision, recall, and F1 scores remain above 0.83, and geometric errors (RMSE, MAE) progressively decrease to zero through feedback corrections. This study demonstrates how MLLM-driven multi-agent reasoning can make BIM creation accessible to both experts and non-experts using only freehand sketches.
zh

[AI-74] Compressing Quaternion Convolutional Neural Networks for Audio Classification

【速读】:该论文旨在解决Quaternion Convolutional Neural Networks (QCNNs)在音频分类任务中因四元数运算带来的高计算复杂度问题,从而限制其在资源受限平台上的部署效率。其关键解决方案是采用模型剪枝(pruning)策略,在不显著损失性能的前提下大幅降低QCNN的计算成本和参数量;实验表明,剪枝后的QCNN相比知识蒸馏(KD)方法在保持或提升性能的同时更高效,且在多个音频分类基准数据集上均展现出良好的泛化能力与竞争力。

链接: https://arxiv.org/abs/2510.21388
作者: Arshdeep Singh,Vinayak Abrol,Mark D. Plumbley
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD); Signal Processing (eess.SP)
备注: Under review in IEEE TASLPRO

点击查看摘要

Abstract:Conventional Convolutional Neural Networks (CNNs) in the real domain have been widely used for audio classification. However, their convolution operations process multi-channel inputs independently, limiting the ability to capture correlations among channels. This can lead to suboptimal feature learning, particularly for complex audio patterns such as multi-channel spectrogram representations. Quaternion Convolutional Neural Networks (QCNNs) address this limitation by employing quaternion algebra to jointly capture inter-channel dependencies, enabling more compact models with fewer learnable parameters while better exploiting the multi-dimensional nature of audio signals. However, QCNNs exhibit higher computational complexity due to the overhead of quaternion operations, resulting in increased inference latency and reduced efficiency compared to conventional CNNs, posing challenges for deployment on resource-constrained platforms. To address this challenge, this study explores knowledge distillation (KD) and pruning, to reduce the computational complexity of QCNNs while maintaining performance. Our experiments on audio classification reveal that pruning QCNNs achieves similar or superior performance compared to KD while requiring less computational effort. Compared to conventional CNNs and Transformer-based architectures, pruned QCNNs achieve competitive performance with a reduced learnable parameter count and computational complexity. On the AudioSet dataset, pruned QCNNs reduce computational cost by 50% and parameter count by 80%, while maintaining performance comparable to the conventional CNNs. Furthermore, pruned QCNNs generalize well across multiple audio classification benchmarks, including GTZAN for music genre recognition, ESC-50 for environmental sound classification and RAVDESS for speech emotion recognition.
zh

[AI-75] Patient-specific AI for generation of 3D dosimetry imaging from two 2D-planar measurements

【速读】:该论文旨在解决核医学中放射性药物(如¹⁷⁷Lu-PSMA)治疗后活度分布三维定量难题,传统方法依赖昂贵且耗时的3D SPECT成像,或仅能提供二维信息的平面显像(planar scintigraphy)。其核心解决方案是基于患者特异性强化学习(patient-specific reinforcement learning),首先构建包含个体解剖结构下可能的三维活度分布数据集,再利用生成式AI模型(包括3DUnet和扩散模型)从两幅二维平面图像(前后位)中重建高质量三维活度图。关键创新在于结合患者特异性训练与扩散模型,显著提升重建精度(仿真中MAE降低约20%,SSIM提高至0.89;真实数据中SSIM达0.73),实现无需SPECT即可完成高精度三维剂量学评估,为核医学剂量学带来范式变革。

链接: https://arxiv.org/abs/2510.21362
作者: Alejandro Lopez-Montes,Robert Seifert,Astrid Delker,Guido Boening,Jiahui Wang,Christoph Clement,Ali Afshar-Oromieh,Axel Rominger,Kuangyu Shi
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE NSS/MIC 2025

点击查看摘要

Abstract:In this work we explored the use of patient specific reinforced learning to generate 3D activity maps from two 2D planar images (anterior and posterior). The solution of this problem remains unachievable using conventional methodologies and is of particular interest for dosimetry in nuclear medicine where approaches for post-therapy distribution of radiopharmaceuticals such as 177Lu-PSMA are typically done via either expensive and long 3D SPECT acquisitions or fast, yet only 2D, planar scintigraphy. Being able to generate 3D activity maps from planar scintigraphy opens the gate for new dosimetry applications removing the need for SPECT and facilitating multi-time point dosimetry studies. Our solution comprises the generation of a patient specific dataset with possible 3D uptake maps of the radiopharmaceuticals withing the anatomy of the individual followed by an AI approach (we explored both the use of 3DUnet and diffusion models) able to generate 3D activity maps from 2D planar images. We have validated our method both in simulation and real planar acquisitions. We observed enhanced results using patient specific reinforcement learning (~20% reduction on MAE and ~5% increase in SSIM) and better organ delineation and patient anatomy especially when combining diffusion models with patient specific training yielding a SSIM=0.89 compared to the ground truth for simulations and 0.73 when compared to a SPECT acquisition performed half an hour after the planar. We believe that our methodology can set a change of paradigm for nuclear medicine dosimetry allowing for 3D quantification using only planar scintigraphy without the need of expensive and time-consuming SPECT leveraging the pre-therapy information of the patients.
zh

[AI-76] WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation

【速读】:该论文旨在解决声事件检测(Sound Event Detection, SED)系统在海洋音频中识别须鲸叫声时存在的误报率高和少数类事件检测性能差的问题。解决方案的关键在于提出边界提议网络(Boundary Proposal Network, BPN),该网络基于图像目标检测中的思想,利用骨干分类模型内部的中间潜在表示来门控最终输出,从而有效降低误检数量;同时通过前向搜索和后向搜索两种策略优化后处理超参数,分别针对事件级和帧级参数进行独立调优,显著提升了系统对少数类事件(如d-call和bp-call)的F1分数,最终使WhaleVAD-BPN系统在交叉验证下的F1分数达到0.475,较基线提升9.8%绝对值。

链接: https://arxiv.org/abs/2510.21280
作者: Christiaan M. Geldenhuys,Günther Tonitz,Thomas R. Niesler
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:While recent sound event detection (SED) systems can identify baleen whale calls in marine audio, challenges related to false positive and minority-class detection persist. We propose the boundary proposal network (BPN), which extends an existing lightweight SED system. The BPN is inspired by work in image object detection and aims to reduce the number of false positive detections. It achieves this by using intermediate latent representations computed within the backbone classification model to gate the final output. When added to an existing SED system, the BPN achieves a 16.8 % absolute increase in precision, as well as 21.3 % and 9.4 % improvements in the F1-score for minority-class d-calls and bp-calls, respectively. We further consider two approaches to the selection of post-processing hyperparameters: a forward-search and a backward-search. By separately optimising event-level and frame-level hyperparameters, these two approaches lead to considerable performance improvements over parameters selected using empirical methods. The complete WhaleVAD-BPN system achieves a cross-validated development F1-score of 0.475, which is a 9.8 % absolute improvement over the baseline.
zh

[AI-77] Hierarchical AI Multi-Agent Fundamental Investing: Evidence from Chinas A-Share Market

【速读】:该论文旨在解决传统因子投资策略在多维度信息整合与动态调整能力上的局限性,尤其是在宏观环境变化与微观企业基本面之间缺乏有效衔接的问题。其解决方案的关键在于提出了一种分层式多智能体(multi-agent)架构:通过宏观代理(Macro agent)实现基于经济指标和行业表现的自适应 sector 筛选与权重分配,结合四个企业级代理(Fundamental、Technical、Report 和 News)从财务、技术、研报和新闻等多源数据中提取深度信号,再由组合代理(Portfolio agent)利用强化学习将各代理输出融合为统一交易策略,并由风险控制代理(Risk Control agent)实时响应市场波动调整头寸。该设计实现了自上而下宏观筛选与自下而上基本面分析的有机联动,显著提升了风险调整后收益与回撤控制能力。

链接: https://arxiv.org/abs/2510.21147
作者: Chujun He,Zhonghao Huang,Xiangguo Li,Ye Luo,Kewei Ma,Yuxuan Xiong,Xiaowei Zhang,Mingyang Zhao
机构: 未知
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a multi-agent, AI-driven framework for fundamental investing that integrates macro indicators, industry-level and firm-specific information to construct optimized equity portfolios. The architecture comprises: (i) a Macro agent that dynamically screens and weights sectors based on evolving economic indicators and industry performance; (ii) four firm-level agents – Fundamental, Technical, Report, and News – that conduct in-depth analyses of individual firms to ensure both breadth and depth of coverage; (iii) a Portfolio agent that uses reinforcement learning to combine the agent outputs into a unified policy to generate the trading strategy; and (iv) a Risk Control agent that adjusts portfolio positions in response to market volatility. We evaluate the system on the constituents by the CSI 300 Index of China’s A-share market and find that it consistently outperforms standard benchmarks and a state-of-the-art multi-agent trading system on risk-adjusted returns and drawdown control. Our core contribution is a hierarchical multi-agent design that links top-down macro screening with bottom-up fundamental analysis, offering a robust and extensible approach to factor-based portfolio construction.
zh

[AI-78] Integrated representational signatures strengthen specificity in brains and models

【速读】:该论文旨在解决不同神经网络(包括生物大脑区域和人工神经网络)在执行相似任务时,是否依赖于等效表征这一核心问题。以往研究通常仅使用单一的表示相似性度量,但此类方法只能捕捉表征结构的一个方面。本文的关键解决方案是引入一套涵盖几何结构、单元级调谐特性及线性可解码性等多个维度的表示相似性度量,并通过改进的相似性网络融合(Similarity Network Fusion, SNF)框架对这些互补指标进行整合。SNF能够显著提升脑区或模型家族层面的分离效果,生成更稳健的复合相似性谱,从而揭示出与已知视觉皮层解剖和功能层级高度一致的层次化组织结构,优于任一单独度量的表现。

链接: https://arxiv.org/abs/2510.20847
作者: Jialin Wu,Shreya Saha,Yiqing Bo,Meenakshi Khosla
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The extent to which different neural or artificial neural networks (models) rely on equivalent representations to support similar tasks remains a central question in neuroscience and machine learning. Prior work has typically compared systems using a single representational similarity metric, yet each captures only one facet of representational structure. To address this, we leverage a suite of representational similarity metrics-each capturing a distinct facet of representational correspondence, such as geometry, unit-level tuning, or linear decodability-and assess brain region or model separability using multiple complementary measures. Metrics that preserve geometric or tuning structure (e.g., RSA, Soft Matching) yield stronger region-based discrimination, whereas more flexible mappings such as Linear Predictivity show weaker separation. These findings suggest that geometry and tuning encode brain-region- or model-family-specific signatures, while linearly decodable information tends to be more globally shared across regions or models. To integrate these complementary representational facets, we adapt Similarity Network Fusion (SNF), a framework originally developed for multi-omics data integration. SNF produces substantially sharper regional and model family-level separation than any single metric and yields robust composite similarity profiles. Moreover, clustering cortical regions using SNF-derived similarity scores reveals a clearer hierarchical organization that aligns closely with established anatomical and functional hierarchies of the visual cortex-surpassing the correspondence achieved by individual metrics.
zh

[AI-79] Consciousness natural and artificial: an evolutionary advantage for reasoning on reactive substrates

【速读】:该论文试图解决的核心问题是:如何精确界定意识(consciousness)并识别其作用机制,尤其是在人工智能(Artificial Intelligence, AI)快速发展背景下,科学界在物理主义(physicalism)与自然二元论(natural dualism)之间的分歧。传统研究常将意识与人类的其他认知能力(如智能、生理感觉)混淆,导致难以构建可计算模型。论文提出的关键解决方案是:通过引入底层载体(生物或数字)及其子系统中的反应性行为(如自主生理反应),构建一个能够精确建模意识的计算框架。结果表明,意识并非独立于载体的抽象过程,而是依赖于其物理基础,并且拥有意识对智能体具有进化优势;同时,论文还揭示人工意识虽可实现,但任意水平的人工智能并不必然需要意识,且赋予AI意识并无额外优势。

链接: https://arxiv.org/abs/2510.20839
作者: Warisa Sritriratanarak,Paulo Garcia
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Precisely defining consciousness and identifying the mechanisms that effect it is a long-standing question, particularly relevant with advances in artificial intelligence. The scientific community is divided between physicalism and natural dualism. Physicalism posits consciousness is a physical process that can be modeled computationally; natural dualism rejects this hypothesis. Finding a computational model has proven elusive, particularly because of conflation of consciousness with other cognitive capabilities exhibited by humans, such as intelligence and physiological sensations. Here we show such a computational model that precisely models consciousness, natural or artificial, identifying the structural and functional mechanisms that effect it, confirming the physicalism hypothesis. We found such a model is obtainable when including the underlying (biological or digital) substrate and accounting for reactive behavior in substrate sub-systems (e.g., autonomous physiological responses). Results show that, unlike all other computational processes, consciousness is not independent of its substrate and possessing it is an evolutionary advantage for intelligent entities. Our result shows there is no impediment to the realization of fully artificial consciousness but, surprisingly, that it is also possible to realize artificial intelligence of arbitrary level without consciousness whatsoever, and that there is no advantage in imbuing artificial systems with consciousness.
zh

机器学习

[LG-0] Equivariance by Contrast: Identifiable Equivariant Embeddings from Unlabeled Finite Group Actions NEURIPS2025

链接: https://arxiv.org/abs/2510.21706
作者: Tobias Schmidt,Steffen Schneider,Matthias Bethge
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at NeurIPS 2025. The last two authors contributed equally. Code is available at this https URL

点击查看摘要

Abstract:We propose Equivariance by Contrast (EbC) to learn equivariant embeddings from observation pairs (\mathbfy, g \cdot \mathbfy) , where g is drawn from a finite group acting on the data. Our method jointly learns a latent space and a group representation in which group actions correspond to invertible linear maps – without relying on group-specific inductive biases. We validate our approach on the infinite dSprites dataset with structured transformations defined by the finite group G:= (R_m \times \mathbbZ_n \times \mathbbZ_n) , combining discrete rotations and periodic translations. The resulting embeddings exhibit high-fidelity equivariance, with group operations faithfully reproduced in latent space. On synthetic data, we further validate the approach on the non-abelian orthogonal group O(n) and the general linear group GL(n) . We also provide a theoretical proof for identifiability. While broad evaluation across diverse group types on real-world data remains future work, our results constitute the first successful demonstration of general-purpose encoder-only equivariant learning from group action observations alone, including non-trivial non-abelian groups and a product group motivated by modeling affine equivariances in computer vision.

[LG-1] Mechanistic Interpretability for Neural TSP Solvers

链接: https://arxiv.org/abs/2510.21693
作者: Reuben Narad,Leonard Boussioux,Michael Wagner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks have advanced combinatorial optimization, with Transformer-based solvers achieving near-optimal solutions on the Traveling Salesman Problem (TSP) in milliseconds. However, these models operate as black boxes, providing no insight into the geometric patterns they learn or the heuristics they employ during tour construction. We address this opacity by applying sparse autoencoders (SAEs), a mechanistic interpretability technique, to a Transformer-based TSP solver, representing the first application of activation-based interpretability methods to operations research models. We train a pointer network with reinforcement learning on 100-node instances, then fit an SAE to the encoder’s residual stream to discover an overcomplete dictionary of interpretable features. Our analysis reveals that the solver naturally develops features mirroring fundamental TSP concepts: boundary detectors that activate on convex-hull nodes, cluster-sensitive features responding to locally dense regions, and separator features encoding geometric partitions. These findings provide the first model-internal account of what neural TSP solvers compute before node selection, demonstrate that geometric structure emerges without explicit supervision, and suggest pathways toward transparent hybrid systems that combine neural efficiency with algorithmic interpretability. Interactive feature explorer: this https URL

[LG-2] On Uncertainty Calibration for Equivariant Functions

链接: https://arxiv.org/abs/2510.21691
作者: Edward Berman,Jacob Ginesin,Marco Pacini,Robin Walters
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: Under review at Transactions on Machine Learning Research (TMLR). Code is available at this https URL . Excited to share this paper, comments welcome :D

点击查看摘要

Abstract:Data-sparse settings such as robotic manipulation, molecular physics, and galaxy morphology classification are some of the hardest domains for deep learning. For these problems, equivariant networks can help improve modeling across undersampled parts of the input space, and uncertainty estimation can guard against overconfidence. However, until now, the relationships between equivariance and model confidence, and more generally equivariance and model calibration, has yet to be studied. Since traditional classification and regression error terms show up in the definitions of calibration error, it is natural to suspect that previous work can be used to help understand the relationship between equivariance and calibration error. In this work, we present a theory relating equivariance to uncertainty estimation. By proving lower and upper bounds on uncertainty calibration errors (ECE and ENCE) under various equivariance conditions, we elucidate the generalization limits of equivariant models and illustrate how symmetry mismatch can result in miscalibration in both classification and regression. We complement our theoretical framework with numerical experiments that clarify the relationship between equivariance and uncertainty using a variety of real and simulated datasets, and we comment on trends with symmetry mismatch, group size, and aleatoric and epistemic uncertainties.

[LG-3] Optimal Graph Clustering without Edge Density Signals

链接: https://arxiv.org/abs/2510.21669
作者: Maximilien Dreveton,Elaine Siyu Liu,Matthias Grossglauser,Patrick Thiran
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper establishes the theoretical limits of graph clustering under the Popularity-Adjusted Block Model (PABM), addressing limitations of existing models. In contrast to the Stochastic Block Model (SBM), which assumes uniform vertex degrees, and to the Degree-Corrected Block Model (DCBM), which applies uniform degree corrections across clusters, PABM introduces separate popularity parameters for intra- and inter-cluster connections. Our main contribution is the characterization of the optimal error rate for clustering under PABM, which provides novel insights on clustering hardness: we demonstrate that unlike SBM and DCBM, cluster recovery remains possible in PABM even when traditional edge-density signals vanish, provided intra- and inter-cluster popularity coefficients differ. This highlights a dimension of degree heterogeneity captured by PABM but overlooked by DCBM: local differences in connectivity patterns can enhance cluster separability independently of global edge densities. Finally, because PABM exhibits a richer structure, its expected adjacency matrix has rank between k and k^2 , where k is the number of clusters. As a result, spectral embeddings based on the top k eigenvectors may fail to capture important structural information. Our numerical experiments on both synthetic and real datasets confirm that spectral clustering algorithms incorporating k^2 eigenvectors outperform traditional spectral approaches.

[LG-4] Enhancing Tactile-based Reinforcement Learning for Robotic Control

链接: https://arxiv.org/abs/2510.21609
作者: Elle Miller,Trevor McInroe,David Abel,Oisin Mac Aodha,Sethu Vijayakumar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Achieving safe, reliable real-world robotic manipulation requires agents to evolve beyond vision and incorporate tactile sensing to overcome sensory deficits and reliance on idealised state information. Despite its potential, the efficacy of tactile sensing in reinforcement learning (RL) remains inconsistent. We address this by developing self-supervised learning (SSL) methodologies to more effectively harness tactile observations, focusing on a scalable setup of proprioception and sparse binary contacts. We empirically demonstrate that sparse binary tactile signals are critical for dexterity, particularly for interactions that proprioceptive control errors do not register, such as decoupled robot-object motions. Our agents achieve superhuman dexterity in complex contact tasks (ball bouncing and Baoding ball rotation). Furthermore, we find that decoupling the SSL memory from the on-policy memory can improve performance. We release the Robot Tactile Olympiad (RoTO) benchmark to standardise and promote future research in tactile-based manipulation. Project page: this https URL

[LG-5] Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds

链接: https://arxiv.org/abs/2510.21608
作者: Oscar Davis,Michael S. Albergo,Nicholas M. Boffi,Michael M. Bronstein,Avishek Joey Bose
类目: Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Geometric data and purpose-built generative models on them have become ubiquitous in high-impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data. Current geometric generative models remain computationally expensive at inference – requiring many steps of complex numerical simulation – as they are derived from dynamical measure transport frameworks such as diffusion and flow-matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few-step generative models that generalises the Flow Map framework in Euclidean spaces to arbitrary Riemannian manifolds. We instantiate GFMs with three self-distillation-based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few-step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and achieve state-of-the-art sample quality for single- and few-step evaluations, and superior or competitive log-likelihoods using the implicit probability flow.

[LG-6] SHAP Meets Tensor Networks: Provably Tractable Explanations with Parallelism NEURIPS2025

链接: https://arxiv.org/abs/2510.21599
作者: Reda Marzouk,Shahaf Bassan,Guy Katz
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Formal Languages and Automata Theory (cs.FL); Quantum Physics (quant-ph)
*备注: To appear in NeurIPS 2025

点击查看摘要

Abstract:Although Shapley additive explanations (SHAP) can be computed in polynomial time for simple models like decision trees, they unfortunately become NP-hard to compute for more expressive black-box models like neural networks - where generating explanations is often most critical. In this work, we analyze the problem of computing SHAP explanations for Tensor Networks (TNs), a broader and more expressive class of models than those for which current exact SHAP algorithms are known to hold, and which is widely used for neural network abstraction and compression. First, we introduce a general framework for computing provably exact SHAP explanations for general TNs with arbitrary structures. Interestingly, we show that, when TNs are restricted to a Tensor Train (TT) structure, SHAP computation can be performed in poly-logarithmic time using parallel computation. Thanks to the expressiveness power of TTs, this complexity result can be generalized to many other popular ML models such as decision trees, tree ensembles, linear models, and linear RNNs, therefore tightening previously reported complexity results for these families of models. Finally, by leveraging reductions of binarized neural networks to Tensor Network representations, we demonstrate that SHAP computation can become efficiently tractable when the network’s width is fixed, while it remains computationally hard even with constant depth. This highlights an important insight: for this class of models, width - rather than depth - emerges as the primary computational bottleneck in SHAP computation.

[LG-7] Accelerating Data Generation for Nonlinear temporal PDEs via homologous perturbation in solution space

链接: https://arxiv.org/abs/2510.21592
作者: Lei Liu,Zhenxin Huang,Hong Wang,huanshuo dong,Haiyang Xin,Hongwei Zhao,Bin Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven deep learning methods like neural operators have advanced in solving nonlinear temporal partial differential equations (PDEs). However, these methods require large quantities of solution pairs\u2014the solution functions and right-hand sides (RHS) of the equations. These pairs are typically generated via traditional numerical methods, which need thousands of time steps iterations far more than the dozens required for training, creating heavy computational and temporal overheads. To address these challenges, we propose a novel data generation algorithm, called HOmologous Perturbation in Solution Space (HOPSS), which directly generates training datasets with fewer time steps rather than following the traditional approach of generating large time steps datasets. This algorithm simultaneously accelerates dataset generation and preserves the approximate precision required for model training. Specifically, we first obtain a set of base solution functions from a reliable solver, usually with thousands of time steps, and then align them in time steps with training datasets by downsampling. Subsequently, we propose a “homologous perturbation” approach: by combining two solution functions (one as the primary function, the other as a homologous perturbation term scaled by a small scalar) with random noise, we efficiently generate comparable-precision PDE data points. Finally, using these data points, we compute the variation in the original equation’s RHS to form new solution pairs. Theoretical and experimental results show HOPSS lowers time complexity. For example, on the Navier-Stokes equation, it generates 10,000 samples in approximately 10% of traditional methods’ time, with comparable model training performance.

[LG-8] REVE: A Foundation Model for EEG – Adapting to Any Setup with Large-Scale Pretraining on 25000 Subjects

链接: https://arxiv.org/abs/2510.21585
作者: Yassine El Ouahidi,Jonathan Lys,Philipp Thölke,Nicolas Farrugia,Bastien Pasdeloup,Vincent Gripon,Karim Jerbi,Giulia Lioi
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: Code available at: this https URL

点击查看摘要

Abstract:Foundation models have transformed AI by reducing reliance on task-specific data through large-scale pretraining. While successful in language and vision, their adoption in EEG has lagged due to the heterogeneity of public datasets, which are collected under varying protocols, devices, and electrode configurations. Existing EEG foundation models struggle to generalize across these variations, often restricting pretraining to a single setup, resulting in suboptimal performance, in particular under linear probing. We present REVE (Representation for EEG with Versatile Embeddings), a pretrained model explicitly designed to generalize across diverse EEG signals. REVE introduces a novel 4D positional encoding scheme that enables it to process signals of arbitrary length and electrode arrangement. Using a masked autoencoding objective, we pretrain REVE on over 60,000 hours of EEG data from 92 datasets spanning 25,000 subjects, representing the largest EEG pretraining effort to date. REVE achieves state-of-the-art results on 10 downstream EEG tasks, including motor imagery classification, seizure detection, sleep staging, cognitive load estimation, and emotion recognition. With little to no fine-tuning, it demonstrates strong generalization, and nuanced spatio-temporal modeling. We release code, pretrained weights, and tutorials to support standardized EEG research and accelerate progress in clinical neuroscience.

[LG-9] An unsupervised tour through the hidden pathways of deep neural networks

链接: https://arxiv.org/abs/2510.21582
作者: Diego Doimo
类目: Machine Learning (cs.LG)
*备注: PhD thesis

点击查看摘要

Abstract:The goal of this thesis is to improve our understanding of the internal mechanisms by which deep artificial neural networks create meaningful representations and are able to generalize. We focus on the challenge of characterizing the semantic content of the hidden representations with unsupervised learning tools, partially developed by us and described in this thesis, which allow harnessing the low-dimensional structure of the data. Chapter 2. introduces Gride, a method that allows estimating the intrinsic dimension of the data as an explicit function of the scale without performing any decimation of the data set. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among nearest data points. In Chapter 3, we study the evolution of the probability density across the hidden layers in some state-of-the-art deep neural networks. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant to classification. In subsequent layers, density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. This process leaves a footprint in the probability density of the output layer, where the topography of the peaks allows reconstructing the semantic relationships of the categories. In Chapter 4, we study the problem of generalization in deep neural networks: adding parameters to a network that interpolates its training data will typically improve its generalization performance, at odds with the classical bias-variance trade-off. We show that wide neural networks learn redundant representations instead of overfitting to spurious correlation and that redundant neurons appear only if the network is regularized and the training error is zero.

[LG-10] Leverag ing Classical Algorithms for Graph Neural Networks

链接: https://arxiv.org/abs/2510.21574
作者: Jason Wu,Petar Veličković
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural networks excel at processing unstructured data but often fail to generalise out-of-distribution, whereas classical algorithms guarantee correctness but lack flexibility. We explore whether pretraining Graph Neural Networks (GNNs) on classical algorithms can improve their performance on molecular property prediction tasks from the Open Graph Benchmark: ogbg-molhiv (HIV inhibition) and ogbg-molclintox (clinical toxicity). GNNs trained on 24 classical algorithms from the CLRS Algorithmic Reasoning Benchmark are used to initialise and freeze selected layers of a second GNN for molecular prediction. Compared to a randomly initialised baseline, the pretrained models achieve consistent wins or ties, with the Segments Intersect algorithm pretraining yielding a 6% absolute gain on ogbg-molhiv and Dijkstra pretraining achieving a 3% gain on ogbg-molclintox. These results demonstrate embedding classical algorithmic priors into GNNs provides useful inductive biases, boosting performance on complex, real-world graph data.

[LG-11] Interpretable Multimodal Zero-Shot ECG Diagnosis via Structured Clinical Knowledge Alignment

链接: https://arxiv.org/abs/2510.21551
作者: Jialu Tang,Hung Manh Pham,Ignace De Lathauwer,Henk S. Schipper,Yuan Lu,Dong Ma,Aaqib Saeed
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) interpretation is essential for cardiovascular disease diagnosis, but current automated systems often struggle with transparency and generalization to unseen conditions. To address this, we introduce ZETA, a zero-shot multimodal framework designed for interpretable ECG diagnosis aligned with clinical workflows. ZETA uniquely compares ECG signals against structured positive and negative clinical observations, which are curated through an LLM-assisted, expert-validated process, thereby mimicking differential diagnosis. Our approach leverages a pre-trained multimodal model to align ECG and text embeddings without disease-specific fine-tuning. Empirical evaluations demonstrate ZETA’s competitive zero-shot classification performance and, importantly, provide qualitative and quantitative evidence of enhanced interpretability, grounding predictions in specific, clinically relevant positive and negative diagnostic features. ZETA underscores the potential of aligning ECG analysis with structured clinical knowledge for building more transparent, generalizable, and trustworthy AI diagnostic systems. We will release the curated observation dataset and code to facilitate future research.

[LG-12] Cost Minimization for Space-Air-Ground Integrated Multi-Access Edge Computing Systems

链接: https://arxiv.org/abs/2510.21541
作者: Weihong Qin,Aimin Wang,Geng Sun,Zemin Sun,Jiacheng Wang,Dusit Niyato,Dong In Kim,Zhu Han
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:

点击查看摘要

Abstract:Space-air-ground integrated multi-access edge computing (SAGIN-MEC) provides a promising solution for the rapidly developing low-altitude economy (LAE) to deliver flexible and wide-area computing services. However, fully realizing the potential of SAGIN-MEC in the LAE presents significant challenges, including coordinating decisions across heterogeneous nodes with different roles, modeling complex factors such as mobility and network variability, and handling real-time decision-making under partially observable environment with hybrid variables. To address these challenges, we first present a hierarchical SAGIN-MEC architecture that enables the coordination between user devices (UDs), uncrewed aerial vehicles (UAVs), and satellites. Then, we formulate a UD cost minimization optimization problem (UCMOP) to minimize the UD cost by jointly optimizing the task offloading ratio, UAV trajectory planning, computing resource allocation, and UD association. We show that the UCMOP is an NP-hard problem. To overcome this challenge, we propose a multi-agent deep deterministic policy gradient (MADDPG)-convex optimization and coalitional game (MADDPG-COCG) algorithm. Specifically, we employ the MADDPG algorithm to optimize the continuous temporal decisions for heterogeneous nodes in the partially observable SAGIN-MEC system. Moreover, we propose a convex optimization and coalitional game (COCG) method to enhance the conventional MADDPG by deterministically handling the hybrid and varying-dimensional decisions. Simulation results demonstrate that the proposed MADDPG-COCG algorithm significantly enhances the user-centric performances in terms of the aggregated UD cost, task completion delay, and UD energy consumption, with a slight increase in UAV energy consumption, compared to the benchmark algorithms. Moreover, the MADDPG-COCG algorithm shows superior convergence stability and scalability.

[LG-13] Excision Score: Evaluating Edits with Surgical Precision

链接: https://arxiv.org/abs/2510.21537
作者: Nikolai Gruzinov,Ksenia Sycheva,Earl T. Barr,Alex Bezzubov
类目: Machine Learning (cs.LG)
*备注: Code is available at this https URL

点击查看摘要

Abstract:Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence (LCS) to remove content shared by an existing document with the ground truth and predicted revisions, before comparing only the remaining divergent regions. This is analogous to a surgeon creating a sterile field to focus on the work area. We use approximation to speed the standard cubic LCS computation to quadratic. In code-editing evaluation, where static measures are often used as a cheap proxy for passing tests, we demonstrate that ES surpasses existing measures. When aligned with test execution on HumanEvalFix, ES improves over its nearest competitor, SARI, by 12% Pearson correlation and by 21% over standard measures like BLEU. The key criterion is invariance to shared context; when we perturb HumanEvalFix with increased shared context, ES’ improvement over SARI increases to 20% and 30% over standard measures. ES also handles other corner cases that other measures do not, such as correctly aligning moved code blocks, and appropriately rewarding matching insertions or deletions.

[LG-14] FrameShield: Adversarially Robust Video Anomaly Detection

链接: https://arxiv.org/abs/2510.21532
作者: Mojtaba Nafez,Mobina Poulaei,Nikan Vasei,Bardia Soltani Moakhar,Mohammad Sabokrou,MohammadHossein Rohban
类目: Machine Learning (cs.LG)
*备注: 28 page, 5 figures

点击查看摘要

Abstract:Weakly Supervised Video Anomaly Detection (WSVAD) has achieved notable advancements, yet existing models remain vulnerable to adversarial attacks, limiting their reliability. Due to the inherent constraints of weak supervision, where only video-level labels are provided despite the need for frame-level predictions, traditional adversarial defense mechanisms, such as adversarial training, are not effective since video-level adversarial perturbations are typically weak and inadequate. To address this limitation, pseudo-labels generated directly from the model can enable frame-level adversarial training; however, these pseudo-labels are inherently noisy, significantly degrading performance. We therefore introduce a novel Pseudo-Anomaly Generation method called Spatiotemporal Region Distortion (SRD), which creates synthetic anomalies by applying severe augmentations to localized regions in normal videos while preserving temporal consistency. Integrating these precisely annotated synthetic anomalies with the noisy pseudo-labels substantially reduces label noise, enabling effective adversarial training. Extensive experiments demonstrate that our method significantly enhances the robustness of WSVAD models against adversarial attacks, outperforming state-of-the-art methods by an average of 71.0% in overall AUROC performance across multiple benchmarks. The implementation and code are publicly available at this https URL.

[LG-15] Probe-based Fine-tuning for Reducing Toxicity

链接: https://arxiv.org/abs/2510.21531
作者: Jan Wehner,Mario Fritz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probes trained on model activations can detect undesirable behaviors like deception or biases that are difficult to identify from outputs alone. This makes them useful detectors to identify misbehavior. Furthermore, they are also valuable training signals, since they not only reward outputs, but also good internal processes for arriving at that output. However, training against interpretability tools raises a fundamental concern: when a monitor becomes a training target, it may cease to be reliable (Goodhart’s Law). We propose two methods for training against probes based on Supervised Fine-tuning and Direct Preference Optimization. We conduct an initial exploration of these methods in a testbed for reducing toxicity and evaluate the amount by which probe accuracy drops when training against them. To retain the accuracy of probe-detectors after training, we attempt (1) to train against an ensemble of probes, (2) retain held-out probes that aren’t used for training, and (3) retrain new probes after training. First, probe-based preference optimization unexpectedly preserves probe detectability better than classifier-based methods, suggesting the preference learning objective incentivizes maintaining rather than obfuscating relevant representations. Second, probe diversity provides minimal practical benefit - simply retraining probes after optimization recovers high detection accuracy. Our findings suggest probe-based training can be viable for certain alignment methods, though probe ensembles are largely unnecessary when retraining is feasible. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.21531 [cs.LG] (or arXiv:2510.21531v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.21531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-16] A Unified Model for Multi-Task Drone Routing in Post-Disaster Road Assessment

链接: https://arxiv.org/abs/2510.21525
作者: Huatian Gong,Jiuh-Biing Sheu,Zheng Wang,Xiaoguang Yang,Ran Yan
类目: Machine Learning (cs.LG)
*备注: 34 pages, 8 figures,9 tables

点击查看摘要

Abstract:Post-disaster road assessment (PDRA) is essential for emergency response, enabling rapid evaluation of infrastructure conditions and efficient allocation of resources. Although drones provide a flexible and effective tool for PDRA, routing them in large-scale networks remains challenging. Traditional optimization methods scale poorly and demand domain expertise, while existing deep reinforcement learning (DRL) approaches adopt a single-task paradigm, requiring separate models for each problem variant and lacking adaptability to evolving operational needs. This study proposes a unified model (UM) for drone routing that simultaneously addresses eight PDRA variants. By training a single neural network across multiple problem configurations, UM captures shared structural knowledge while adapting to variant-specific constraints through a modern transformer encoder-decoder architecture. A lightweight adapter mechanism further enables efficient finetuning to unseen attributes without retraining, enhancing deployment flexibility in dynamic disaster scenarios. Extensive experiments demonstrate that the UM reduces training time and parameters by a factor of eight compared with training separate models, while consistently outperforming single-task DRL methods by 6–14% and traditional optimization approaches by 24–82% in terms of solution quality (total collected information value). The model achieves real-time solutions (1–10 seconds) across networks of up to 1,000 nodes, with robustness confirmed through sensitivity analyses. Moreover, finetuning experiments show that unseen attributes can be effectively incorporated with minimal cost while retaining high solution quality. The proposed UM advances neural combinatorial optimization for time-critical applications, offering a computationally efficient, high-quality, and adaptable solution for drone-based PDRA.

[LG-17] Surrogate-based quantification of policy uncertainty in generative flow networks

链接: https://arxiv.org/abs/2510.21523
作者: Ramón Nartallo-Kaluarachchi,Robert Manson-Sawko,Shashanka Ubaru,Dongsung Huh,Małgorzata J Zimoń,Lior Horesh,Yoshua Bengio
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:Generative flow networks are able to sample, via sequential construction, high-reward, complex objects according to a reward function. However, such reward functions are often estimated approximately from noisy data, leading to epistemic uncertainty in the learnt policy. We present an approach to quantify this uncertainty by constructing a surrogate model composed of a polynomial chaos expansion, fit on a small ensemble of trained flow networks. This model learns the relationship between reward functions, parametrised in a low-dimensional space, and the probability distributions over actions at each step along a trajectory of the flow network. The surrogate model can then be used for inexpensive Monte Carlo sampling to estimate the uncertainty in the policy given uncertain rewards. We illustrate the performance of our approach on a discrete and continuous grid-world, symbolic regression, and a Bayesian structure learning task.

[LG-18] Uniform Convergence Beyond Glivenko-Cantelli

链接: https://arxiv.org/abs/2510.21506
作者: Tanmay Devale,Pramith Devulapalli,Steve Hanneke
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We characterize conditions under which collections of distributions on \0,1^\mathbbN admit uniform estimation of their mean. Prior work from Vapnik and Chervonenkis (1971) has focused on uniform convergence using the empirical mean estimator, leading to the principle known as P- Glivenko-Cantelli. We extend this framework by moving beyond the empirical mean estimator and introducing Uniform Mean Estimability, also called UME- learnability, which captures when a collection permits uniform mean estimation by any arbitrary estimator. We work on the space created by the mean vectors of the collection of distributions. For each distribution, the mean vector records the expected value in each coordinate. We show that separability of the mean vectors is a sufficient condition for UME- learnability. However, we show that separability of the mean vectors is not necessary for UME- learnability by constructing a collection of distributions whose mean vectors are non-separable yet UME- learnable using techniques fundamentally different from those used in our separability-based analysis. Finally, we establish that countable unions of UME- learnable collections are also UME- learnable, solving a conjecture posed in Cohen et al. (2025).

[LG-19] Benchmarking Catastrophic Forgetting Mitigation Methods in Federated Time Series Forecasting

链接: https://arxiv.org/abs/2510.21491
作者: Khaled Hallak,Oudom Kem
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (stat.ML)
*备注: Accepted for presentation at the FLTA 2025 Conference on Federated Learning. This version corresponds to the camera-ready author manuscript

点击查看摘要

Abstract:Catastrophic forgetting (CF) poses a persistent challenge in continual learning (CL), especially within federated learning (FL) environments characterized by non-i.i.d. time series data. While existing research has largely focused on classification tasks in vision domains, the regression-based forecasting setting prevalent in IoT and edge applications remains underexplored. In this paper, we present the first benchmarking framework tailored to investigate CF in federated continual time series forecasting. Using the Beijing Multi-site Air Quality dataset across 12 decentralized clients, we systematically evaluate several CF mitigation strategies, including Replay, Elastic Weight Consolidation, Learning without Forgetting, and Synaptic Intelligence. Key contributions include: (i) introducing a new benchmark for CF in time series FL, (ii) conducting a comprehensive comparative analysis of state-of-the-art methods, and (iii) releasing a reproducible open-source framework. This work provides essential tools and insights for advancing continual learning in federated time-series forecasting systems.

[LG-20] Parameter-Free Hypergraph Neural Network for Few-Shot Node Classification

链接: https://arxiv.org/abs/2510.21462
作者: Chaewoon Bae,Doyun Choi,Jaehyun Lee,Jaemin Yoo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Few-shot node classification on hypergraphs requires models that generalize from scarce labels while capturing high-order structures. Existing hypergraph neural networks (HNNs) effectively encode such structures but often suffer from overfitting and scalability issues due to complex, black-box architectures. In this work, we propose ZEN (Zero-Parameter Hypergraph Neural Network), a fully linear and parameter-free model that achieves both expressiveness and efficiency. Built upon a unified formulation of linearized HNNs, ZEN introduces a tractable closed-form solution for the weight matrix and a redundancy-aware propagation scheme to avoid iterative training and to eliminate redundant self information. On 11 real-world hypergraph benchmarks, ZEN consistently outperforms eight baseline models in classification accuracy while achieving up to 696x speedups over the fastest competitor. Moreover, the decision process of ZEN is fully interpretable, providing insights into the characteristic of a dataset. Our code and datasets are fully available at this https URL.

[LG-21] Risk Management for Mitigating Benchmark Failure Modes: BenchRisk NEURIPS2025

链接: https://arxiv.org/abs/2510.21460
作者: Sean McGregor,Victor Lu,Vassil Tashev,Armstrong Foundjem,Aishwarya Ramasethu,Sadegh AlMahdi Kazemi Zarkouei,Chris Knotz,Kongtao Chen,Alicia Parrish,Anka Reuel,Heather Frase
类目: oftware Engineering (cs.SE); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, to be published in the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Large language model (LLM) benchmarks inform LLM use decisions (e.g., “is this LLM safe to deploy for my use case and context?”). However, benchmarks may be rendered unreliable by various failure modes that impact benchmark bias, variance, coverage, or people’s capacity to understand benchmark evidence. Using the National Institute of Standards and Technology’s risk management process as a foundation, this research iteratively analyzed 26 popular benchmarks, identifying 57 potential failure modes and 196 corresponding mitigation strategies. The mitigations reduce failure likelihood and/or severity, providing a frame for evaluating “benchmark risk,” which is scored to provide a metaevaluation benchmark: BenchRisk. Higher scores indicate that benchmark users are less likely to reach an incorrect or unsupported conclusion about an LLM. All 26 scored benchmarks present significant risk within one or more of the five scored dimensions (comprehensiveness, intelligibility, consistency, correctness, and longevity), which points to important open research directions for the field of LLM benchmarking. The BenchRisk workflow allows for comparison between benchmarks; as an open-source tool, it also facilitates the identification and sharing of risks and their mitigations.

[LG-22] Estimating Treatment Effects in Networks using Domain Adversarial Training

链接: https://arxiv.org/abs/2510.21457
作者: Daan Caljon,Jente Van Belle,Wouter Verbeke
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating heterogeneous treatment effects in network settings is complicated by interference, meaning that the outcome of an instance can be influenced by the treatment status of others. Existing causal machine learning approaches usually assume a known exposure mapping that summarizes how the outcome of a given instance is influenced by others’ treatment, a simplification that is often unrealistic. Furthermore, the interaction between homophily – the tendency of similar instances to connect – and the treatment assignment mechanism can induce a network-level covariate shift that may lead to inaccurate treatment effect estimates, a phenomenon that has not yet been explicitly studied. To address these challenges, we propose HINet, a novel method that integrates graph neural networks with domain adversarial training. This combination allows estimating treatment effects under unknown exposure mappings while mitigating the impact of (network-level) covariate shift. An extensive empirical evaluation on synthetic and semi-synthetic network datasets demonstrates the effectiveness of our approach.

[LG-23] owards Explainable Personalized Recommendations by Learning from Users Photos

链接: https://arxiv.org/abs/2510.21455
作者: Jorge Díez,Pablo Pérez-Núñez,Oscar Luaces,Beatriz Remeseiro,Antonio Bahamonde
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Explaining the output of a complex system, such as a Recommender System (RS), is becoming of utmost importance for both users and companies. In this paper we explore the idea that personalized explanations can be learned as recommendation themselves. There are plenty of online services where users can upload some photos, in addition to rating items. We assume that users take these photos to reinforce or justify their opinions about the items. For this reason we try to predict what photo a user would take of an item, because that image is the argument that can best convince her of the qualities of the item. In this sense, an RS can explain its results and, therefore, increase its reliability. Furthermore, once we have a model to predict attractive images for users, we can estimate their distribution. Thus, the companies acquire a vivid knowledge about the aspects that the clients highlight of their products. The paper includes a formal framework that estimates the authorship probability for a given pair (user, photo). To illustrate the proposal, we use data gathered from TripAdvisor containing the reviews (with photos) of restaurants in six cities of different sizes.

[LG-24] ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

链接: https://arxiv.org/abs/2510.21450
作者: Federico Danieli,Pau Rodriguez,Miguel Sarabia,Xavier Suau,Luca Zappella
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recurrent Neural Networks (RNNs) laid the foundation for sequence modeling, but their intrinsic sequential nature restricts parallel computation, creating a fundamental barrier to scaling. This has led to the dominance of parallelizable architectures like Transformers and, more recently, State Space Models (SSMs). While SSMs achieve efficient parallelization through structured linear recurrences, this linearity constraint limits their expressive power and precludes modeling complex, nonlinear sequence-wise dependencies. To address this, we present ParaRNN, a framework that breaks the sequence-parallelization barrier for nonlinear RNNs. Building on prior work, we cast the sequence of nonlinear recurrence relationships as a single system of equations, which we solve in parallel using Newton’s iterations combined with custom parallel reductions. Our implementation achieves speedups of up to 665x over naive sequential application, allowing training nonlinear RNNs at unprecedented scales. To showcase this, we apply ParaRNN to adaptations of LSTM and GRU architectures, successfully training models of 7B parameters that attain perplexity comparable to similarly-sized Transformers and Mamba2 architectures. To accelerate research in efficient sequence modeling, we release the ParaRNN codebase as an open-source framework for automatic training-parallelization of nonlinear RNNs, enabling researchers and practitioners to explore new nonlinear RNN models at scale.

[LG-25] Unified token representations for sequential decision models

链接: https://arxiv.org/abs/2510.21448
作者: Zhuojing Tian,Yushu Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transformers have demonstrated strong potential in offline reinforcement learning (RL) by modeling trajectories as sequences of return-to-go, states, and actions. However, existing approaches such as the Decision Transformer(DT) and its variants suffer from redundant tokenization and quadratic attention complexity, limiting their scalability in real-time or resource-constrained settings. To address this, we propose a Unified Token Representation (UTR) that merges return-to-go, state, and action into a single token, substantially reducing sequence length and model complexity. Theoretical analysis shows that UTR leads to a tighter Rademacher complexity bound, suggesting improved generalization. We further develop two variants: UDT and UDC, built upon transformer and gated CNN backbones, respectively. Both achieve comparable or superior performance to state-of-the-art methods with markedly lower computation. These findings demonstrate that UTR generalizes well across architectures and may provide an efficient foundation for scalable control in future large decision models.

[LG-26] Scalable Neural Incentive Design with Parameterized Mean-Field Approximation NEURIPS2025

链接: https://arxiv.org/abs/2510.21442
作者: Nathan Corecco,Batuhan Yardim,Vinzenz Thoma,Zebang Shen,Niao He
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 52 pages, to appear at NeurIPS 2025

点击查看摘要

Abstract:Designing incentives for a multi-agent system to induce a desirable Nash equilibrium is both a crucial and challenging problem appearing in many decision-making domains, especially for a large number of agents N . Under the exchangeability assumption, we formalize this incentive design (ID) problem as a parameterized mean-field game (PMFG), aiming to reduce complexity via an infinite-population limit. We first show that when dynamics and rewards are Lipschitz, the finite- N ID objective is approximated by the PMFG at rate \mathscrO(\frac1\sqrtN) . Moreover, beyond the Lipschitz-continuous setting, we prove the same \mathscrO(\frac1\sqrtN) decay for the important special case of sequential auctions, despite discontinuities in dynamics, through a tailored auction-specific analysis. Built on our novel approximation results, we further introduce our Adjoint Mean-Field Incentive Design (AMID) algorithm, which uses explicit differentiation of iterated equilibrium operators to compute gradients efficiently. By uniting approximation bounds with optimization guarantees, AMID delivers a powerful, scalable algorithmic tool for many-agent (large N ) ID. Across diverse auction settings, the proposed AMID method substantially increases revenue over first-price formats and outperforms existing benchmark methods.

[LG-27] Causality Meets Locality: Provably Generalizable and Scalable Policy Learning for Networked Systems NEURIPS2025

链接: https://arxiv.org/abs/2510.21427
作者: Hao Liang,Shuqing Shi,Yudi Zhang,Biwei Huang,Yali Du
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 (Spotlight)

点击查看摘要

Abstract:Large-scale networked systems, such as traffic, power, and wireless grids, challenge reinforcement-learning agents with both scale and environment shifts. To address these challenges, we propose GSAC (Generalizable and Scalable Actor-Critic), a framework that couples causal representation learning with meta actor-critic learning to achieve both scalability and domain generalization. Each agent first learns a sparse local causal mask that provably identifies the minimal neighborhood variables influencing its dynamics, yielding exponentially tight approximately compact representations (ACRs) of state and domain factors. These ACRs bound the error of truncating value functions to \kappa -hop neighborhoods, enabling efficient learning on graphs. A meta actor-critic then trains a shared policy across multiple source domains while conditioning on the compact domain factors; at test time, a few trajectories suffice to estimate the new domain factor and deploy the adapted policy. We establish finite-sample guarantees on causal recovery, actor-critic convergence, and adaptation gap, and show that GSAC adapts rapidly and significantly outperforms learning-from-scratch and conventional adaptation baselines.

[LG-28] A Rapid Physics-Informed Machine Learning Framework Based on Extreme Learning Machine for Inverse Stefan Problems

链接: https://arxiv.org/abs/2510.21426
作者: Pei-Zhi Zhuang,Ming-Yue Yang,Fei Ren,Hong-Ya Yue,He Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The inverse Stefan problem, as a typical phase-change problem with moving boundaries, finds extensive applications in science and engineering. Recent years have seen the applications of physics-informed neural networks (PINNs) to solving Stefan problems, yet they still exhibit shortcomings in hyperparameter dependency, training efficiency, and prediction accuracy. To address this, this paper develops a physics-informed extreme learning machine (PIELM), a rapid physics-informed learning method framework for inverse Stefan problems. PIELM replaces conventional deep neural networks with an extreme learning machine network. The input weights are fixed in the PIELM framework, and the output weights are determined by optimizing a loss vector of physical laws composed by initial and boundary conditions and governing partial differential equations (PDEs). Then, solving inverse Stefan problems is transformed into finding the Moore-Penrose generalized inverse by the least squares method. Case studies show that the PIELM can increase the prediction accuracy by 3-7 order of magnitude in terms of the relative L2 error, and meanwhile saving more than 94% training time, compared to conventional PINNs.

[LG-29] Self-diffusion for Solving Inverse Problems

链接: https://arxiv.org/abs/2510.21417
作者: Guanxiong Luo,Shoujin Huang,Yanlong Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose self-diffusion, a novel framework for solving inverse problems without relying on pretrained generative models. Traditional diffusion-based approaches require training a model on a clean dataset to learn to reverse the forward noising process. This model is then used to sample clean solutions – corresponding to posterior sampling from a Bayesian perspective – that are consistent with the observed data under a specific task. In contrast, self-diffusion introduces a self-contained iterative process that alternates between noising and denoising steps to progressively refine its estimate of the solution. At each step of self-diffusion, noise is added to the current estimate, and a self-denoiser, which is a single untrained convolutional network randomly initialized from scratch, is continuously trained for certain iterations via a data fidelity loss to predict the solution from the noisy estimate. Essentially, self-diffusion exploits the spectral bias of neural networks and modulates it through a scheduled noise process. Without relying on pretrained score functions or external denoisers, this approach still remains adaptive to arbitrary forward operators and noisy observations, making it highly flexible and broadly applicable. We demonstrate the effectiveness of our approach on a variety of linear inverse problems, showing that self-diffusion achieves competitive or superior performance compared to other methods.

[LG-30] On Local Limits of Sparse Random Graphs: Color Convergence and the Refined Configuration Model

链接: https://arxiv.org/abs/2510.21392
作者: Alexander Pluska,Sagar Malhotra
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Local convergence has emerged as a fundamental tool for analyzing sparse random graph models. We introduce a new notion of local convergence, color convergence, based on the Weisfeiler-Leman algorithm. Color convergence fully characterizes the class of random graphs that are well-behaved in the limit for message-passing graph neural networks. Building on this, we propose the Refined Configuration Model (RCM), a random graph model that generalizes the configuration model. The RCM is universal with respect to local convergence among locally tree-like random graph models, including Erdős-Rényi, stochastic block and configuration models. Finally, this framework enables a complete characterization of the random trees that arise as local limits of such graphs.

[LG-31] Cost-Sensitive Freeze-thaw Bayesian Optimization for Efficient Hyperparameter Tuning NEURIPS2025

链接: https://arxiv.org/abs/2510.21379
作者: Dong Bok Lee,Aoxuan Silvia Zhang,Byungjoo Kim,Junhyeon Park,Steven Adriaensen,Juho Lee,Sung Ju Hwang,Hae Beom Lee
类目: Machine Learning (cs.LG)
*备注: Published at NeurIPS 2025

点击查看摘要

Abstract:In this paper, we address the problem of \emphcost-sensitive hyperparameter optimization (HPO) built upon freeze-thaw Bayesian optimization (BO). Specifically, we assume a scenario where users want to early-stop the HPO process when the expected performance improvement is not satisfactory with respect to the additional computational cost. Motivated by this scenario, we introduce \emphutility in the freeze-thaw framework, a function describing the trade-off between the cost and performance that can be estimated from the user’s preference data. This utility function, combined with our novel acquisition function and stopping criterion, allows us to dynamically continue training the configuration that we expect to maximally improve the utility in the future, and also automatically stop the HPO process around the maximum utility. Further, we improve the sample efficiency of existing freeze-thaw methods with transfer learning to develop a specialized surrogate model for the cost-sensitive HPO problem. We validate our algorithm on established multi-fidelity HPO benchmarks and show that it outperforms all the previous freeze-thaw BO and transfer-BO baselines we consider, while achieving a significantly better trade-off between the cost and performance. Our code is publicly available at this https URL.

[LG-32] Randomized Neural Network with Adaptive Forward Regularization for Online Task-free Class Incremental Learning

链接: https://arxiv.org/abs/2510.21367
作者: Junda Wang,Minghui Hu,Ning Li,Abdulaziz Al-Ali,Ponnuthurai Nagaratnam Suganthan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Class incremental learning (CIL) requires an agent to learn distinct tasks consecutively with knowledge retention against forgetting. Problems impeding the practical applications of CIL methods are twofold: (1) non-i.i.d batch streams and no boundary prompts to update, known as the harsher online task-free CIL (OTCIL) scenario; (2) CIL methods suffer from memory loss in learning long task streams, as shown in Fig. 1 (a). To achieve efficient decision-making and decrease cumulative regrets during the OTCIL process, a randomized neural network (Randomized NN) with forward regularization (-F) is proposed to resist forgetting and enhance learning performance. This general framework integrates unsupervised knowledge into recursive convex optimization, has no learning dissipation, and can outperform the canonical ridge style (-R) in OTCIL. Based on this framework, we derive the algorithm of the ensemble deep random vector functional link network (edRVFL) with adjustable forward regularization (-kF), where k mediates the intensity of the intervention. edRVFL-kF generates one-pass closed-form incremental updates and variable learning rates, effectively avoiding past replay and catastrophic forgetting while achieving superior performance. Moreover, to curb unstable penalties caused by non-i.i.d and mitigate intractable tuning of -kF in OTCIL, we improve it to the plug-and-play edRVFL-kF-Bayes, enabling all hard ks in multiple sub-learners to be self-adaptively determined based on Bayesian learning. Experiments were conducted on 2 image datasets including 6 metrics, dynamic performance, ablation tests, and compatibility, which distinctly validates the efficacy of our OTCIL frameworks with -kF-Bayes and -kF styles.

[LG-33] Compositional Monte Carlo Tree Diffusion for Extendable Planning NEURIPS25

链接: https://arxiv.org/abs/2510.21361
作者: Jaesik Yoon,Hyeonseo Cho,Sungjin Ahn
类目: Machine Learning (cs.LG)
*备注: 24 pages, 4 figures, NeurIPS 25 Spotlight

点击查看摘要

Abstract:Monte Carlo Tree Diffusion (MCTD) integrates diffusion models with structured tree search to enable effective trajectory exploration through stepwise reasoning. However, MCTD remains fundamentally limited by training trajectory lengths. While periodic replanning allows plan concatenation for longer plan generation, the planning process remains locally confined, as MCTD searches within individual trajectories without access to global context. We propose Compositional Monte Carlo Tree Diffusion (C-MCTD), a framework that elevates planning from individual trajectory optimization to reasoning over complete plan compositions. C-MCTD introduces three complementary components: (1) Online Composer, which performs globally-aware planning by searching across entire plan compositions; (2) Distributed Composer, which reduces search complexity through parallel exploration from multiple starting points; and (3) Preplan Composer, which accelerates inference by leveraging cached plan graphs.

[LG-34] Robust Yield Curve Estimation for Mortgage Bonds Using Neural Networks

链接: https://arxiv.org/abs/2510.21347
作者: Sina Molavipour,Alireza M. Javid,Cassie Ye,Björn Löfdahl,Mikhail Nechaev
类目: Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注:

点击查看摘要

Abstract:Robust yield curve estimation is crucial in fixed-income markets for accurate instrument pricing, effective risk management, and informed trading strategies. Traditional approaches, including the bootstrapping method and parametric Nelson-Siegel models, often struggle with overfitting or instability issues, especially when underlying bonds are sparse, bond prices are volatile, or contain hard-to-remove noise. In this paper, we propose a neural networkbased framework for robust yield curve estimation tailored to small mortgage bond markets. Our model estimates the yield curve independently for each day and introduces a new loss function to enforce smoothness and stability, addressing challenges associated with limited and noisy data. Empirical results on Swedish mortgage bonds demonstrate that our approach delivers more robust and stable yield curve estimates compared to existing methods such as Nelson-Siegel-Svensson (NSS) and Kernel-Ridge (KR). Furthermore, the framework allows for the integration of domain-specific constraints, such as alignment with risk-free benchmarks, enabling practitioners to balance the trade-off between smoothness and accuracy according to their needs.

[LG-35] SCORENF: Score-based Normalizing Flows for Sampling Unnormalized distributions

链接: https://arxiv.org/abs/2510.21330
作者: Vikas Kanaujia,Vipul Arora
类目: Machine Learning (cs.LG); High Energy Physics - Lattice (hep-lat); Quantum Physics (quant-ph)
*备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

点击查看摘要

Abstract:Unnormalized probability distributions are central to modeling complex physical systems across various scientific domains. Traditional sampling methods, such as Markov Chain Monte Carlo (MCMC), often suffer from slow convergence, critical slowing down, poor mode mixing, and high autocorrelation. In contrast, likelihood-based and adversarial machine learning models, though effective, are heavily data-driven, requiring large datasets and often encountering mode covering and mode collapse. In this work, we propose ScoreNF, a score-based learning framework built on the Normalizing Flow (NF) architecture, integrated with an Independent Metropolis-Hastings (IMH) module, enabling efficient and unbiased sampling from unnormalized target distributions. We show that ScoreNF maintains high performance even with small training ensembles, thereby reducing reliance on computationally expensive MCMC-generated training data. We also present a method for assessing mode-covering and mode-collapse behaviours. We validate our method on synthetic 2D distributions (MOG-4 and MOG-8) and the high-dimensional \phi^4 lattice field theory distribution, demonstrating its effectiveness for sampling tasks.

[LG-36] Revisiting Social Welfare in Bandits: UCB is (Nearly) All You Need

链接: https://arxiv.org/abs/2510.21312
作者: Dhruv Sarkar,Nishant Pandey,Sayak Ray Chowdhury
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regret in stochastic multi-armed bandits traditionally measures the difference between the highest reward and either the arithmetic mean of accumulated rewards or the final reward. These conventional metrics often fail to address fairness among agents receiving rewards, particularly in settings where rewards are distributed across a population, such as patients in clinical trials. To address this, a recent body of work has introduced Nash regret, which evaluates performance via the geometric mean of accumulated rewards, aligning with the Nash social welfare function known for satisfying fairness axioms. To minimize Nash regret, existing approaches require specialized algorithm designs and strong assumptions, such as multiplicative concentration inequalities and bounded, non-negative rewards, making them unsuitable for even Gaussian reward distributions. We demonstrate that an initial uniform exploration phase followed by a standard Upper Confidence Bound (UCB) algorithm achieves near-optimal Nash regret, while relying only on additive Hoeffding bounds, and naturally extending to sub-Gaussian rewards. Furthermore, we generalize the algorithm to a broad class of fairness metrics called the p -mean regret, proving (nearly) optimal regret bounds uniformly across all p values. This is in contrast to prior work, which made extremely restrictive assumptions on the bandit instances and even then achieved suboptimal regret bounds. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.21312 [cs.LG] (or arXiv:2510.21312v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.21312 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity

链接: https://arxiv.org/abs/2510.21303
作者: Prakhar Ganesh,Hsiang Hsu,Golnoosh Farnadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multiplicity – the existence of distinct models with comparable performance – has received growing attention in recent years. While prior work has largely emphasized modelling choices, the critical role of data in shaping multiplicity has been comparatively overlooked. In this work, we introduce a neighbouring datasets framework to examine the most granular case: the impact of a single-data-point difference on multiplicity. Our analysis yields a seemingly counterintuitive finding: neighbouring datasets with greater inter-class distribution overlap exhibit lower multiplicity. This reversal of conventional expectations arises from a shared Rashomon parameter, and we substantiate it with rigorous proofs. Building on this foundation, we extend our framework to two practical domains: active learning and data imputation. For each, we establish natural extensions of the neighbouring datasets perspective, conduct the first systematic study of multiplicity in existing algorithms, and finally, propose novel multiplicity-aware methods, namely, multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation techniques. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.21303 [cs.LG] (or arXiv:2510.21303v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.21303 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Amortized Variational Inference for Partial-Label Learning: A Probabilistic Approach to Label Disambiguation

链接: https://arxiv.org/abs/2510.21300
作者: Tobias Fuchs,Nadja Klein
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Real-world data is frequently noisy and ambiguous. In crowdsourcing, for example, human annotators may assign conflicting class labels to the same instances. Partial-label learning (PLL) addresses this challenge by training classifiers when each instance is associated with a set of candidate labels, only one of which is correct. While early PLL methods approximate the true label posterior, they are often computationally intensive. Recent deep learning approaches improve scalability but rely on surrogate losses and heuristic label refinement. We introduce a novel probabilistic framework that directly approximates the posterior distribution over true labels using amortized variational inference. Our method employs neural networks to predict variational parameters from input data, enabling efficient inference. This approach combines the expressiveness of deep learning with the rigor of probabilistic modeling, while remaining architecture-agnostic. Theoretical analysis and extensive experiments on synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in both accuracy and efficiency.

[LG-39] An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination NEURIPS2025

链接: https://arxiv.org/abs/2510.21296
作者: Sukanya Patra,Souhaib Ben Taieb
类目: Machine Learning (cs.LG)
*备注: Accepted in the Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Unsupervised anomaly detection (AD) methods typically assume clean training data, yet real-world datasets often contain undetected or mislabeled anomalies, leading to significant performance degradation. Existing solutions require access to the training pipelines, data or prior knowledge of the proportions of anomalies in the data, limiting their real-world applicability. To address this challenge, we propose EPHAD, a simple yet effective test-time adaptation framework that updates the outputs of AD models trained on contaminated datasets using evidence gathered at test time. Our approach integrates the prior knowledge captured by the AD model trained on contaminated datasets with evidence derived from multimodal foundation models like Contrastive Language-Image Pre-training (CLIP), classical AD methods like the Latent Outlier Factor or domain-specific knowledge. We illustrate the intuition behind EPHAD using a synthetic toy example and validate its effectiveness through comprehensive experiments across eight visual AD datasets, twenty-six tabular AD datasets, and a real-world industrial AD dataset. Additionally, we conduct an ablation study to analyse hyperparameter influence and robustness to varying contamination levels, demonstrating the versatility and robustness of EPHAD across diverse AD models and evidence pairs. To ensure reproducibility, our code is publicly available at this https URL.

[LG-40] Additive Models Explained: A Computational Complexity Approach NEURIPS2025

链接: https://arxiv.org/abs/2510.21292
作者: Shahaf Bassan,Michal Moshkovitz,Guy Katz
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注: To appear in NeurIPS 2025

点击查看摘要

Abstract:Generalized Additive Models (GAMs) are commonly considered interpretable within the ML community, as their structure makes the relationship between inputs and outputs relatively understandable. Therefore, it may seem natural to hypothesize that obtaining meaningful explanations for GAMs could be performed efficiently and would not be computationally infeasible. In this work, we challenge this hypothesis by analyzing the computational complexity of generating different explanations for various forms of GAMs across multiple contexts. Our analysis reveals a surprisingly diverse landscape of both positive and negative complexity outcomes. Particularly, under standard complexity assumptions such as P!=NP, we establish several key findings: (1) in stark contrast to many other common ML models, the complexity of generating explanations for GAMs is heavily influenced by the structure of the input space; (2) the complexity of explaining GAMs varies significantly with the types of component models used - but interestingly, these differences only emerge under specific input domain settings; (3) significant complexity distinctions appear for obtaining explanations in regression tasks versus classification tasks in GAMs; and (4) expressing complex models like neural networks additively (e.g., as neural additive models) can make them easier to explain, though interestingly, this benefit appears only for certain explanation methods and input domains. Collectively, these results shed light on the feasibility of computing diverse explanations for GAMs, offering a rigorous theoretical picture of the conditions under which such computations are possible or provably hard.

[LG-41] Adaptive Data Selection for Multi-Layer Perceptron Training: A Sub-linear Value-Driven Method

链接: https://arxiv.org/abs/2510.21286
作者: Xiyang Zhang,Chen Liang,Haoxuan Qiu,Hongzhi Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data selection is one of the fundamental problems in neural network training, particularly for multi-layer perceptrons (MLPs) where identifying the most valuable training samples from massive, multi-source, and heterogeneous data sources under budget constraints poses significant challenges. Existing data selection methods, including coreset construction, data Shapley values, and influence functions, suffer from critical limitations: they oversimplify nonlinear transformations, ignore informative intermediate representations in hidden layers, or fail to scale to larger MLPs due to high computational complexity. In response, we propose DVC (Data Value Contribution), a novel budget-aware method for evaluating and selecting data for MLP training that accounts for the dynamic evolution of network parameters during training. The DVC method decomposes data contribution into Layer Value Contribution (LVC) and Global Value Contribution (GVC), employing six carefully designed metrics and corresponding efficient algorithms to capture data characteristics across three dimensions–quality, relevance, and distributional diversity–at different granularities. DVC integrates these assessments with an Upper Confidence Bound (UCB) algorithm for adaptive source selection that balances exploration and exploitation. Extensive experiments across six datasets and eight baselines demonstrate that our method consistently outperforms existing approaches under various budget constraints, achieving superior accuracy and F1 scores. Our approach represents the first systematic treatment of hierarchical data evaluation for neural networks, providing both theoretical guarantees and practical advantages for large-scale machine learning systems.

[LG-42] Sensor-Specific Transformer (PatchTST) Ensembles with Test-Matched Augmentation

链接: https://arxiv.org/abs/2510.21282
作者: Pavankumar Chandankar,Robin Burchard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a noise-aware, sensor-specific ensemble approach for robust human activity recognition on the 2nd WEAR Dataset Challenge. Our method leverages the PatchTST transformer architecture, training four independent models-one per inertial sensor location-on a tampered training set whose 1-second sliding windows are augmented to mimic the test-time noise. By aligning the train and test data schemas (JSON-encoded 50-sample windows) and applying randomized jitter, scaling, rotation, and channel dropout, each PatchTST model learns to generalize across real-world sensor perturbations. At inference, we compute softmax probabilities from all four sensor models on the Kaggle test set and average them to produce final labels. On the private leaderboard, this pipeline achieves a macro-F1 substantially above the baseline, demonstrating that test-matched augmentation combined with transformer-based ensembling is an effective strategy for robust HAR under noisy conditions.

[LG-43] Relieving the Over-Aggregating Effect in Graph Transformers NEURIPS2025

链接: https://arxiv.org/abs/2510.21267
作者: Junshu Sun,Wanxing Chang,Chenxue Yang,Qingming Huang,Shuhui Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Graph attention has demonstrated superior performance in graph learning tasks. However, learning from global interactions can be challenging due to the large number of nodes. In this paper, we discover a new phenomenon termed over-aggregating. Over-aggregating arises when a large volume of messages is aggregated into a single node with less discrimination, leading to the dilution of the key messages and potential information loss. To address this, we propose Wideformer, a plug-and-play method for graph attention. Wideformer divides the aggregation of all nodes into parallel processes and guides the model to focus on specific subsets of these processes. The division can limit the input volume per aggregation, avoiding message dilution and reducing information loss. The guiding step sorts and weights the aggregation outputs, prioritizing the informative messages. Evaluations show that Wideformer can effectively mitigate over-aggregating. As a result, the backbone methods can focus on the informative messages, achieving superior performance compared to baseline methods.

[LG-44] PINN Balls: Scaling Second-Order Methods for PINNs with Domain Decomposition and Adaptive Sampling

链接: https://arxiv.org/abs/2510.21262
作者: Andrea Bonfanti,Ismael Medina,Roman List,Björn Staeves,Roberto Santana,Marco Ellero
类目: Machine Learning (cs.LG)
*备注: Accepted Conference Paper

点击查看摘要

Abstract:Recent advances in Scientific Machine Learning have shown that second-order methods can enhance the training of Physics-Informed Neural Networks (PINNs), making them a suitable alternative to traditional numerical methods for Partial Differential Equations (PDEs). However, second-order methods induce large memory requirements, making them scale poorly with the model size. In this paper, we define a local Mixture of Experts (MoE) combining the parameter-efficiency of ensemble models and sparse coding to enable the use of second-order training. Our model – \textscPINN Balls – also features a fully learnable domain decomposition structure, achieved through the use of Adversarial Adaptive Sampling (AAS), which adapts the DD to the PDE and its domain. \textscPINN Balls achieves better accuracy than the state-of-the-art in scientific machine learning, while maintaining invaluable scalability properties and drawing from a sound theoretical background.

[LG-45] Unified Implementations of Recurrent Neural Networks in Multiple Deep Learning Frameworks

链接: https://arxiv.org/abs/2510.21252
作者: Francesco Martinuzzi
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Recurrent neural networks (RNNs) are a cornerstone of sequence modeling across various scientific and industrial applications. Owing to their versatility, numerous RNN variants have been proposed over the past decade, aiming to improve the modeling of long-term dependencies and to address challenges such as vanishing and exploding gradients. However, no central library is available to test these variations, and reimplementing diverse architectures can be time-consuming and error-prone, limiting reproducibility and exploration. Here, we introduce three open-source libraries in Julia and Python that centralize numerous recurrent cell implementations and higher-level recurrent architectures. torchrecurrent, this http URL, and this http URL offer a consistent framework for constructing and extending RNN models, providing built-in mechanisms for customization and experimentation. All packages are available under the MIT license and actively maintained on GitHub.

[LG-46] Convergence of Stochastic Gradient Langevin Dynamics in the Lazy Training Regime

链接: https://arxiv.org/abs/2510.21245
作者: Noah Oberweis,Semih Cayci
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Continuous-time models provide important insights into the training dynamics of optimization algorithms in deep learning. In this work, we establish a non-asymptotic convergence analysis of stochastic gradient Langevin dynamics (SGLD), which is an Itô stochastic differential equation (SDE) approximation of stochastic gradient descent in continuous time, in the lazy training regime. We show that, under regularity conditions on the Hessian of the loss function, SGLD with multiplicative and state-dependent noise (i) yields a non-degenerate kernel throughout the training process with high probability, and (ii) achieves exponential convergence to the empirical risk minimizer in expectation, and we establish finite-time and finite-width bounds on the optimality gap. We corroborate our theoretical findings with numerical examples in the regression setting.

[LG-47] How Hard is it to Confuse a World Model?

链接: https://arxiv.org/abs/2510.21232
作者: Waris Radji(Scool, CRIStAL),Odalric-Ambrym Maillard(Scool, CRIStAL)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In reinforcement learning (RL) theory, the concept of most confusing instances is central to establishing regret lower bounds, that is, the minimal exploration needed to solve a problem. Given a reference model and its optimal policy, a most confusing instance is the statistically closest alternative model that makes a suboptimal policy optimal. While this concept is well-studied in multi-armed bandits and ergodic tabular Markov decision processes, constructing such instances remains an open question in the general case. In this paper, we formalize this problem for neural network world models as a constrained optimization: finding a modified model that is statistically close to the reference one, while producing divergent performance between optimal and suboptimal policies. We propose an adversarial training procedure to solve this problem and conduct an empirical study across world models of varying quality. Our results suggest that the degree of achievable confusion correlates with uncertainty in the approximate model, which may inform theoretically-grounded exploration strategies for deep model-based RL.

[LG-48] Model Merging with Functional Dual Anchors

链接: https://arxiv.org/abs/2510.21223
作者: Kexuan Shi,Yandong Wen,Weiyang Liu
类目: Machine Learning (cs.LG)
*备注: Technical report (23 pages, 15 figures, project page: this https URL )

点击查看摘要

Abstract:Model merging is an efficient post-training strategy for integrating knowledge from multiple finetuned checkpoints of a shared foundation model. Existing methods operate in the parameter space, combining task vectors to mitigate conflicts, but remain constrained by parameter inconsistencies. We propose Functional Dual Anchors (FDAs), a framework that instead models the input-representation space. FDAs are synthetic inputs whose induced gradients align with task vectors, capturing task-specific functional shifts relative to the pretrained model. This perspective bridges joint multi-task training and post-hoc merging, offering both robustness and flexibility. We further introduce a principled initialization scheme and show that FDAs are complementary to parameter-space model merging. Comprehensive experiments demonstrate the effectiveness of FDAs in model merging.

[LG-49] On the flow matching interpretability

链接: https://arxiv.org/abs/2510.21210
作者: Francesco Pivi,Simone Gazza,Davide Evangelista,Roberto Amadini,Maurizio Gabbrielli
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Generative models based on flow matching have demonstrated remarkable success in various domains, yet they suffer from a fundamental limitation: the lack of interpretability in their intermediate generation steps. In fact these models learn to transform noise into data through a series of vector field updates, however the meaning of each step remains opaque. We address this problem by proposing a general framework constraining each flow step to be sampled from a known physical distribution. Flow trajectories are mapped to (and constrained to traverse) the equilibrium states of the simulated physical process. We implement this approach through the 2D Ising model in such a way that flow steps become thermal equilibrium points along a parametric cooling schedule. Our proposed architecture includes an encoder that maps discrete Ising configurations into a continuous latent space, a flow-matching network that performs temperature-driven diffusion, and a projector that returns to discrete Ising states while preserving physical constraints. We validate this framework across multiple lattice sizes, showing that it preserves physical fidelity while outperforming Monte Carlo generation in speed as the lattice size increases. In contrast with standard flow matching, each vector field represents a meaningful stepwise transition in the 2D Ising model’s latent space. This demonstrates that embedding physical semantics into generative flows transforms opaque neural trajectories into interpretable physical processes. Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph) MSC classes: 68T07 Cite as: arXiv:2510.21210 [cs.LG] (or arXiv:2510.21210v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.21210 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-50] Adaptive Graph Mixture of Residual Experts: Unsupervised Learning on Diverse Graphs with Heterogeneous Specialization

链接: https://arxiv.org/abs/2510.21207
作者: Yunlong Chu,Minglai Shao,Zengyi Wo,Bing Hao,Yuhang Liu,Ruijie Wang,Jianxin Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) face a fundamental adaptability challenge: their fixed message-passing architectures struggle with the immense diversity of real-world graphs, where optimal computational strategies vary by local structure and task. While Mixture-of-Experts (MoE) offers a promising pathway to adaptability, existing graph MoE methods remain constrained by their reliance on supervised signals and instability when training heterogeneous experts. We introduce ADaMoRE (Adaptive Mixture of Residual Experts), a principled framework that enables robust, fully unsupervised training of heterogeneous MoE on graphs. ADaMoRE employs a backbone-residual expert architecture where foundational encoders provide stability while specialized residual experts capture diverse computational patterns. A structurally-aware gating network performs fine-grained node routing. The entire architecture is trained end-to-end using a unified unsupervised objective, which integrates a primary reconstruction task with an information-theoretic diversity regularizer to explicitly enforce functional specialization among the experts. Theoretical analysis confirms our design improves data efficiency and training stability. Extensive evaluation across 16 benchmarks validates ADaMoRE’s state-of-the-art performance in unsupervised node classification and few-shot learning, alongside superior generalization, training efficiency, and faster convergence on diverse graphs and tasks.

[LG-51] Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models NEURIPS2025

链接: https://arxiv.org/abs/2510.21204
作者: Xiyuan Zhang,Danielle C. Maddix,Junming Yin,Nick Erickson,Abdul Fatir Ansari,Boran Han,Shuai Zhang,Leman Akoglu,Christos Faloutsos,Michael W. Mahoney,Cuixiong Hu,Huzefa Rangwala,George Karypis,Bernie Wang
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025. We released both classifier (autogluon/mitra-classifier) and regressor (autogluon/mitra-regressor) model weights on HuggingFace

点击查看摘要

Abstract:Since the seminal work of TabPFN, research on tabular foundation models (TFMs) based on in-context learning (ICL) has challenged long-standing paradigms in machine learning. Without seeing any real-world data, models pretrained on purely synthetic datasets generalize remarkably well across diverse datasets, often using only a moderate number of in-context examples. This shifts the focus in tabular machine learning from model architecture design to the design of synthetic datasets, or, more precisely, to the prior distributions that generate them. Yet the guiding principles for prior design remain poorly understood. This work marks the first attempt to address the gap. We systematically investigate and identify key properties of synthetic priors that allow pretrained TFMs to generalize well. Based on these insights, we introduce Mitra, a TFM trained on a curated mixture of synthetic priors selected for their diversity, distinctiveness, and performance on real-world tabular data. Mitra consistently outperforms state-of-the-art TFMs, such as TabPFNv2 and TabICL, across both classification and regression benchmarks, with better sample efficiency.

[LG-52] Online AUC Optimization Based on Second-order Surrogate Loss

链接: https://arxiv.org/abs/2510.21202
作者: JunRu Luo,Difei Cheng,Bo Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Area Under the Curve (AUC) is an important performance metric for classification tasks, particularly in class-imbalanced scenarios. However, minimizing the AUC presents significant challenges due to the non-convex and discontinuous nature of pairwise 0/1 losses, which are difficult to optimize, as well as the substantial memory cost of instance-wise storage, which creates bottlenecks in large-scale applications. To overcome these challenges, we propose a novel second-order surrogate loss based on the pairwise hinge loss, and develop an efficient online algorithm. Unlike conventional approaches that approximate each individual pairwise 0/1 loss term with an instance-wise surrogate function, our approach introduces a new paradigm that directly substitutes the entire aggregated pairwise loss with a surrogate loss function constructed from the first- and second-order statistics of the training data. Theoretically, while existing online AUC optimization algorithms typically achieve an \mathcalO(\sqrtT) regret bound, our method attains a tighter \mathcalO(\ln T) bound. Furthermore, we extend the proposed framework to nonlinear settings through a kernel-based formulation. Extensive experiments on multiple benchmark datasets demonstrate the superior efficiency and effectiveness of the proposed second-order surrogate loss in optimizing online AUC performance.

[LG-53] Gen-Review: A Large-scale Dataset of AI-Generated (and Human-written) Peer Reviews

链接: https://arxiv.org/abs/2510.21192
作者: Luca Demetrio,Giovanni Apruzzese,Kathrin Grosse,Pavel Laskov,Emil Lupu,Vera Rimmer,Philine Widmer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:How does the progressive embracement of Large Language Models (LLMs) affect scientific peer reviewing? This multifaceted question is fundamental to the effectiveness – as well as to the integrity – of the scientific process. Recent evidence suggests that LLMs may have already been tacitly used in peer reviewing, e.g., at the 2024 International Conference of Learning Representations (ICLR). Furthermore, some efforts have been undertaken in an attempt to explicitly integrate LLMs in peer reviewing by various editorial boards (including that of ICLR’25). To fully understand the utility and the implications of LLMs’ deployment for scientific reviewing, a comprehensive relevant dataset is strongly desirable. Despite some previous research on this topic, such dataset has been lacking so far. We fill in this gap by presenting GenReview, the hitherto largest dataset containing LLM-written reviews. Our dataset includes 81K reviews generated for all submissions to the 2018–2025 editions of the ICLR by providing the LLM with three independent prompts: a negative, a positive, and a neutral one. GenReview is also linked to the respective papers and their original reviews, thereby enabling a broad range of investigations. To illustrate the value of GenReview, we explore a sample of intriguing research questions, namely: if LLMs exhibit bias in reviewing (they do); if LLM-written reviews can be automatically detected (so far, they can); if LLMs can rigorously follow reviewing instructions (not always) and whether LLM-provided ratings align with decisions on paper acceptance or rejection (holds true only for accepted papers). GenReview can be accessed at the following link: this https URL.

[LG-54] Instance-Adaptive Hypothesis Tests with Heterogeneous Agents

链接: https://arxiv.org/abs/2510.21178
作者: Flora C. Shi,Martin J. Wainwright,Stephen Bates
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We study hypothesis testing over a heterogeneous population of strategic agents with private information. Any single test applied uniformly across the population yields statistical error that is sub-optimal relative to the performance of an oracle given access to the private information. We show how it is possible to design menus of statistical contracts that pair type-optimal tests with payoff structures, inducing agents to self-select according to their private information. This separating menu elicits agent types and enables the principal to match the oracle performance even without a priori knowledge of the agent type. Our main result fully characterizes the collection of all separating menus that are instance-adaptive, matching oracle performance for an arbitrary population of heterogeneous agents. We identify designs where information elicitation is essentially costless, requiring negligible additional expense relative to a single-test benchmark, while improving statistical performance. Our work establishes a connection between proper scoring rules and menu design, showing how the structure of the hypothesis test constrains the elicitable information. Numerical examples illustrate the geometry of separating menus and the improvements they deliver in error trade-offs. Overall, our results connect statistical decision theory with mechanism design, demonstrating how heterogeneity and strategic participation can be harnessed to improve efficiency in hypothesis testing.

[LG-55] Scalable Principal-Agent Contract Design via Gradient-Based Optimization

链接: https://arxiv.org/abs/2510.21177
作者: Tomer Galanti,Aarya Bookseller,Korok Ray
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a bilevel \emphmax-max optimization framework for principal-agent contract design, in which a principal chooses incentives to maximize utility while anticipating the agent’s best response. This problem, central to moral hazard and contract theory, underlies applications ranging from market design to delegated portfolio management, hedge fund fee structures, and executive compensation. While linear-quadratic models such as Holmstr"om-Milgrom admit closed-form solutions, realistic environments with nonlinear utilities, stochastic dynamics, or high-dimensional actions generally do not. We introduce a generic algorithmic framework that removes this reliance on closed forms. Our method adapts modern machine learning techniques for bilevel optimization – using implicit differentiation with conjugate gradients (CG) – to compute hypergradients efficiently through Hessian-vector products, without ever forming or inverting Hessians. In benchmark CARA-Normal (Constant Absolute Risk Aversion with Gaussian distribution of uncertainty) environments, the approach recovers known analytical optima and converges reliably from random initialization. More broadly, because it is matrix-free, variance-reduced, and problem-agnostic, the framework extends naturally to complex nonlinear contracts where closed-form solutions are unavailable, such as sigmoidal wage schedules (logistic pay), relative-performance/tournament compensation with common shocks, multi-task contracts with vector actions and heterogeneous noise, and CARA-Poisson count models with \mathbbE[X\mid a]=e^a . This provides a new computational tool for contract design, enabling systematic study of models that have remained analytically intractable. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.21177 [cs.LG] (or arXiv:2510.21177v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.21177 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] A visual big data system for the prediction of weather-related variables: Jordan-Spain case study

链接: https://arxiv.org/abs/2510.21176
作者: Shadi Aljawarneh,Juan A. Lara,Muneer Bani Yassein
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Meteorology is a field where huge amounts of data are generated, mainly collected by sensors at weather stations, where different variables can be measured. Those data have some particularities such as high volume and dimensionality, the frequent existence of missing values in some stations, and the high correlation between collected variables. In this regard, it is crucial to make use of Big Data and Data Mining techniques to deal with those data and extract useful knowledge from them that can be used, for instance, to predict weather phenomena. In this paper, we propose a visual big data system that is designed to deal with high amounts of weather-related data and lets the user analyze those data to perform predictive tasks over the considered variables (temperature and rainfall). The proposed system collects open data and loads them onto a local NoSQL database fusing them at different levels of temporal and spatial aggregation in order to perform a predictive analysis using univariate and multivariate approaches as well as forecasting based on training data from neighbor stations in cases with high rates of missing values. The system has been assessed in terms of usability and predictive performance, obtaining an overall normalized mean squared error value of 0.00013, and an overall directional symmetry value of nearly 0.84. Our system has been rated positively by a group of experts in the area (all aspects of the system except graphic desing were rated 3 or above in a 1-5 scale). The promising preliminary results obtained demonstrate the validity of our system and invite us to keep working on this area.

[LG-57] A Unified Matrix Factorization Framework for Classical and Robust Clustering

链接: https://arxiv.org/abs/2510.21172
作者: Angshul Majumdar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a unified matrix factorization framework for classical and robust clustering. We begin by revisiting the well-known equivalence between crisp k-means clustering and matrix factorization, following and rigorously rederiving an unpublished formulation by Bauckhage. Extending this framework, we derive an analogous matrix factorization interpretation for fuzzy c-means clustering, which to the best of our knowledge has not been previously formalized. These reformulations allow both clustering paradigms to be expressed as optimization problems over factor matrices, thereby enabling principled extensions to robust variants. To address sensitivity to outliers, we propose robust formulations for both crisp and fuzzy clustering by replacing the Frobenius norm with the l1,2-norm, which penalizes the sum of Euclidean norms across residual columns. We develop alternating minimization algorithms for the standard formulations and IRLS-based algorithms for the robust counterparts. All algorithms are theoretically proven to converge to a local minimum.

[LG-58] URBOTEST: Learning When Less is Enough through Early Termination of Internet Speed Tests

链接: https://arxiv.org/abs/2510.21141
作者: Haarika Manda,Manshi Sagar,Yogesh,Kartikay Singh,Cindy Zhao,Tarun Mangla,Phillipa Gill,Elizabeth Belding,Arpit Gupta
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Internet speed tests are indispensable for users, ISPs, and policymakers, but their static flooding-based design imposes growing costs: a single high-speed test can transfer hundreds of megabytes, and collectively, platforms like Ookla, M-Lab, and this http URL generate petabytes of traffic each month. Reducing this burden requires deciding when a test can be stopped early without sacrificing accuracy. We frame this as an optimal stopping problem and show that existing heuristics-static thresholds, BBR pipe-full signals, or throughput stability rules from this http URL and FastBTS-capture only a narrow portion of the achievable accuracy-savings trade-off. This paper introduces TURBOTEST, a systematic framework for speed test termination that sits atop existing platforms. The key idea is to decouple throughput prediction (Stage 1) from test termination (Stage 2): Stage 1 trains a regressor to estimate final throughput from partial measurements, while Stage 2 trains a classifier to decide when sufficient evidence has accumulated to stop. Leveraging richer transport-level features (RTT, retransmissions, congestion window) alongside throughput, TURBOTEST exposes a single tunable parameter for accuracy tolerance and includes a fallback mechanism for high-variability cases. Evaluation on 173,000 M-Lab NDT speed tests (2024-2025) shows that TURBOTEST achieves nearly 2-4x higher data savings than an approach based on BBR signals while reducing median error. These results demonstrate that adaptive ML-based termination can deliver accurate, efficient, and deployable speed tests at scale.

[LG-59] Cloud-Fog-Edge Collaborative Computing for Sequential MIoT Workflow: A Two-Tier DDPG-Based Scheduling Framework

链接: https://arxiv.org/abs/2510.21135
作者: Yuhao Fu(1),Yinghao Zhang(2),Yalin Liu(1),Bishenghui Tao(1),Junhong Ruan(3) ((1) Hong Kong Metropolitan University, Hong Kong, China, (2) Guangdong Key Lab of AI and Multi-Modal Data Processing, Beijing Normal-Hong Kong Baptist University, (3) Hong Kong University of Science and Technology, Hong Kong, China)
类目: Machine Learning (cs.LG)
*备注: 14 pages, 3 figures, 2 tables

点击查看摘要

Abstract:The Medical Internet of Things (MIoT) demands stringent end-to-end latency guarantees for sequential healthcare workflows deployed over heterogeneous cloud-fog-edge infrastructures. Scheduling these sequential workflows to minimize makespan is an NP-hard problem. To tackle this challenge, we propose a Two-tier DDPG-based scheduling framework that decomposes the scheduling decision into a hierarchical process: a global controller performs layer selection (edge, fog, or cloud), while specialized local controllers handle node assignment within the chosen layer. The primary optimization objective is the minimization of the workflow makespan. Experiments results validate our approach, demonstrating increasingly superior performance over baselines as workflow complexity rises. This trend highlights the frameworks ability to learn effective long-term strategies, which is critical for complex, large-scale MIoT scheduling scenarios.

[LG-60] SolarBoost: Distributed Photovoltaic Power Forecasting Amid Time-varying Grid Capacity

链接: https://arxiv.org/abs/2510.21129
作者: Linyuan Geng,Linxiao Yang,Xinyue Gu,Liang Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents SolarBoost, a novel approach for forecasting power output in distributed photovoltaic (DPV) systems. While existing centralized photovoltaic (CPV) methods are able to precisely model output dependencies due to uniformity, it is difficult to apply such techniques to DPV systems, as DPVs face challenges such as missing grid-level data, temporal shifts in installed capacity, geographic variability, and panel diversity. SolarBoost overcomes these challenges by modeling aggregated power output as a composite of output from small grids, where each grid output is modeled using a unit output function multiplied by its capacity. This approach decouples the homogeneous unit output function from dynamic capacity for accurate prediction. Efficient algorithms over an upper-bound approximation are proposed to overcome computational bottlenecks in loss functions. We demonstrate the superiority of grid-level modeling via theoretical analysis and experiments. SolarBoost has been validated through deployment across various cities in China, significantly reducing potential losses and provides valuable insights for the operation of power grids. The code for this work is available at this https URL.

[LG-61] A Unified Approach to Submodular Maximization Under Noise NEURIPS2025

链接: https://arxiv.org/abs/2510.21128
作者: Kshipra Bhawalkar,Yang Cai,Zhe Feng,Christopher Liaw,Tao Lin
类目: Data Structures and Algorithms (cs.DS); Computational Complexity (cs.CC); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:We consider the problem of maximizing a submodular function with access to a noisy value oracle for the function instead of an exact value oracle. Similar to prior work, we assume that the noisy oracle is persistent in that multiple calls to the oracle for a specific set always return the same value. In this model, Hassidim and Singer (2017) design a (1-1/e) -approximation algorithm for monotone submodular maximization subject to a cardinality constraint, and Huang et al (2022) design a (1-1/e)/2 -approximation algorithm for monotone submodular maximization subject to any arbitrary matroid constraint. In this paper, we design a meta-algorithm that allows us to take any “robust” algorithm for exact submodular maximization as a black box and transform it into an algorithm for the noisy setting while retaining the approximation guarantee. By using the meta-algorithm with the measured continuous greedy algorithm, we obtain a (1-1/e) -approximation (resp. 1/e -approximation) for monotone (resp. non-monotone) submodular maximization subject to a matroid constraint under noise. Furthermore, by using the meta-algorithm with the double greedy algorithm, we obtain a 1/2 -approximation for unconstrained (non-monotone) submodular maximization under noise.

[LG-62] Distributionally Robust Feature Selection NEURIPS2025

链接: https://arxiv.org/abs/2510.21113
作者: Maitreyi Swaroop,Tamar Krishnamurti,Bryan Wilder
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:We study the problem of selecting limited features to observe such that models trained on them can perform well simultaneously across multiple subpopulations. This problem has applications in settings where collecting each feature is costly, e.g. requiring adding survey questions or physical sensors, and we must be able to use the selected features to create high-quality downstream models for different populations. Our method frames the problem as a continuous relaxation of traditional variable selection using a noising mechanism, without requiring backpropagation through model training processes. By optimizing over the variance of a Bayes-optimal predictor, we develop a model-agnostic framework that balances overall performance of downstream prediction across populations. We validate our approach through experiments on both synthetic datasets and real-world data.

[LG-63] DictPFL: Efficient and Private Federated Learning on Encrypted Gradients NEURIPS2025

链接: https://arxiv.org/abs/2510.21086
作者: Jiaqi Xue,Mayank Kumar,Yuzhang Shang,Shangqian Gao,Rui Ning,Mengxin Zheng,Xiaoqian Jiang,Qian Lou
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across institutions without sharing raw data. However, gradient sharing still risks privacy leakage, such as gradient inversion attacks. Homomorphic Encryption (HE) can secure aggregation but often incurs prohibitive computational and communication overhead. Existing HE-based FL methods sit at two extremes: encrypting all gradients for full privacy at high cost, or partially encrypting gradients to save resources while exposing vulnerabilities. We present DictPFL, a practical framework that achieves full gradient protection with minimal overhead. DictPFL encrypts every transmitted gradient while keeping non-transmitted parameters local, preserving privacy without heavy computation. It introduces two key modules: Decompose-for-Partial-Encrypt (DePE), which decomposes model weights into a static dictionary and an updatable lookup table, only the latter is encrypted and aggregated, while the static dictionary remains local and requires neither sharing nor encryption; and Prune-for-Minimum-Encrypt (PrME), which applies encryption-aware pruning to minimize encrypted parameters via consistent, history-guided masks. Experiments show that DictPFL reduces communication cost by 402-748 \times and accelerates training by 28-65 \times compared to fully encrypted FL, while outperforming state-of-the-art selective encryption methods by 51-155 \times in overhead and 4-19 \times in speed. Remarkably, DictPFL’s runtime is within 2 \times of plaintext FL, demonstrating for the first time, that HE-based private federated learning is practical for real-world deployment. The code is publicly available at this https URL.

[LG-64] Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

链接: https://arxiv.org/abs/2510.21081
作者: Zhuojin Li,Marco Paolieri,Leana Golubchik
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
*备注: To appear on Lecture Notes in Computer Science, volume on Selected Papers of EPEW 2025

点击查看摘要

Abstract:Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU kernels and account for their dispatch times. A comprehensive evaluation on four mobile platforms shows that our approach can quickly select CPU-GPU co-execution strategies achieving up to 1.89x speedup for linear layers and 1.75x speedup for convolutional layers (close to the achievable maximum values of 2.01x and 1.87x, respectively, found by exhaustive grid search on a Pixel~5 smartphone).

[LG-65] Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data NEURIPS2025

链接: https://arxiv.org/abs/2510.21078
作者: Hancheng Min,Zhihui Zhu,René Vidal
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Among many mysteries behind the success of deep networks lies the exceptional discriminative power of their learned representations as manifested by the intriguing Neural Collapse (NC) phenomenon, where simple feature structures emerge at the last layer of a trained neural network. Prior works on the theoretical understandings of NC have focused on analyzing the optimization landscape of matrix-factorization-like problems by considering the last-layer features as unconstrained free optimization variables and showing that their global minima exhibit NC. In this paper, we show that gradient flow on a two-layer ReLU network for classifying orthogonally separable data provably exhibits NC, thereby advancing prior results in two ways: First, we relax the assumption of unconstrained features, showing the effect of data structure and nonlinear activations on NC characterizations. Second, we reveal the role of the implicit bias of the training dynamics in facilitating the emergence of NC.

[LG-66] he Virtues of Brevity: Avoid Overthinking in Parallel Test-Time Reasoning NEURIPS2025

链接: https://arxiv.org/abs/2510.21067
作者: Raul Cavalcante Dinardi,Bruno Yamamoto,Anna Helena Reali Costa,Artur Jordao
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025 Workshop on Efficient Reasoning

点击查看摘要

Abstract:Reasoning models represent a significant advance in LLM capabilities, particularly for complex reasoning tasks such as mathematics and coding. Previous studies confirm that parallel test-time compute-sampling multiple solutions and selecting the best one-can further enhance the predictive performance of LLMs. However, strategies in this area often require complex scoring, thus increasing computational cost and complexity. In this work, we demonstrate that the simple and counterintuitive heuristic of selecting the shortest solution is highly effective. We posit that the observed effectiveness stems from models operating in two distinct regimes: a concise, confident conventional regime and a verbose overthinking regime characterized by uncertainty, and we show evidence of a critical point where the overthinking regime begins to be significant. By selecting the shortest answer, the heuristic preferentially samples from the conventional regime. We confirm that this approach is competitive with more complex methods such as self-consistency across two challenging benchmarks while significantly reducing computational overhead. The shortest-answer heuristic provides a Pareto improvement over self-consistency and applies even to tasks where output equality is not well defined.

[LG-67] Scalable Machine Learning Analysis of Parker Solar Probe Solar Wind Data

链接: https://arxiv.org/abs/2510.21066
作者: Daniela Martin,Connor O’Brien,Valmir P Moraes Filho,Jinsu Hong,Jasmine R. Kobayashi,Evangelia Samara,Joseph Gallego
类目: Machine Learning (cs.LG); Solar and Stellar Astrophysics (astro-ph.SR); Space Physics (physics.space-ph)
*备注:

点击查看摘要

Abstract:We present a scalable machine learning framework for analyzing Parker Solar Probe (PSP) solar wind data using distributed processing and the quantum-inspired Kernel Density Matrices (KDM) method. The PSP dataset (2018–2024) exceeds 150 GB, challenging conventional analysis approaches. Our framework leverages Dask for large-scale statistical computations and KDM to estimate univariate and bivariate distributions of key solar wind parameters, including solar wind speed, proton density, and proton thermal speed, as well as anomaly thresholds for each parameter. We reveal characteristic trends in the inner heliosphere, including increasing solar wind speed with distance from the Sun, decreasing proton density, and the inverse relationship between speed and density. Solar wind structures play a critical role in enhancing and mediating extreme space weather phenomena and can trigger geomagnetic storms; our analyses provide quantitative insights into these processes. This approach offers a tractable, interpretable, and distributed methodology for exploring complex physical datasets and facilitates reproducible analysis of large-scale in situ measurements. Processed data products and analysis tools are made publicly available to advance future studies of solar wind dynamics and space weather forecasting. The code and configuration files used in this study are publicly available to support reproducibility.

[LG-68] Soft Instruction De-escalation Defense

链接: https://arxiv.org/abs/2510.21057
作者: Nils Philipp Walter,Chawin Sitawarin,Jamie Hayes,David Stutz,Ilia Shumailov
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

[LG-69] Online Multi-Class Selection with Group Fairness Guarantee

链接: https://arxiv.org/abs/2510.21055
作者: Faraz Zargari,Hossein Nekouyan,Lyndon Hallett,Bo Sun,Xiaoqi Tan
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:

点击查看摘要

Abstract:We study the online multi-class selection problem with group fairness guarantees, where limited resources must be allocated to sequentially arriving agents. Our work addresses two key limitations in the existing literature. First, we introduce a novel lossless rounding scheme that ensures the integral algorithm achieves the same expected performance as any fractional solution. Second, we explicitly address the challenges introduced by agents who belong to multiple classes. To this end, we develop a randomized algorithm based on a relax-and-round framework. The algorithm first computes a fractional solution using a resource reservation approach – referred to as the set-aside mechanism – to enforce fairness across classes. The subsequent rounding step preserves these fairness guarantees without degrading performance. Additionally, we propose a learning-augmented variant that incorporates untrusted machine-learned predictions to better balance fairness and efficiency in practical settings.

[LG-70] Amortized Active Generation of Pareto Sets NEURIPS2025

链接: https://arxiv.org/abs/2510.21052
作者: Daniel M. Steinberg,Asiri Wijesinghe,Rafael Oliveira,Piotr Koniusz,Cheng Soon Ong,Edwin V. Bonilla
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Appears in the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:We introduce active generation of Pareto sets (A-GPS), a new framework for online discrete black-box multi-objective optimization (MOO). A-GPS learns a generative model of the Pareto set that supports a-posteriori conditioning on user preferences. The method employs a class probability estimator (CPE) to predict non-dominance relations and to condition the generative model toward high-performing regions of the search space. We also show that this non-dominance CPE implicitly estimates the probability of hypervolume improvement (PHVI). To incorporate subjective trade-offs, A-GPS introduces preference direction vectors that encode user-specified preferences in objective space. At each iteration, the model is updated using both Pareto membership and alignment with these preference directions, producing an amortized generative model capable of sampling across the Pareto front without retraining. The result is a simple yet powerful approach that achieves high-quality Pareto set approximations, avoids explicit hypervolume computation, and flexibly captures user preferences. Empirical results on synthetic benchmarks and protein design tasks demonstrate strong sample efficiency and effective preference incorporation.

[LG-71] xMem: A CPU-Based Approach for Accurate Estimation of GPU Memory in Deep Learning Training Workloads

链接: https://arxiv.org/abs/2510.21048
作者: Jiabo Shi,Dimitrios Pezaros,Yehia Elkhatib
类目: Performance (cs.PF); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The global scarcity of GPUs necessitates more sophisticated strategies for Deep Learning jobs in shared cluster environments. Accurate estimation of how much GPU memory a job will require is fundamental to enabling advanced scheduling and GPU sharing, which helps prevent out-of-memory (OOM) errors and resource underutilization. However, existing estimation methods have limitations. Approaches relying on static analysis or historical data with machine learning often fail to accurately capture runtime dynamics. Furthermore, direct GPU analysis consumes scarce resources, and some techniques require intrusive code modifications. Thus, the key challenge lies in precisely estimating dynamic memory requirements, including memory allocator nuances, without consuming GPU resources and non-intrusive code changes. To address this challenge, we propose xMem, a novel framework that leverages CPU-only dynamic analysis to accurately estimate peak GPU memory requirements a priori. We conducted a thorough evaluation of xMem against state-of-the-art solutions using workloads from 25 different models, including architectures like Convolutional Neural Networks and Transformers. The analysis of 5209 runs, which includes ANOVA and Monte Carlo results, highlights xMem’s benefits: it decreases the median relative error by 91% and significantly reduces the probability of estimation failure as safe OOM thresholds by 75%, meaning that the estimated value can often be used directly without causing OOM. Ultimately, these improvements lead to a 368% increase in memory conservation potential over current solutions.

[LG-72] Elementary My Dear Watson: Non-Invasive Neural Keyword Spotting in the LibriBrain Dataset

链接: https://arxiv.org/abs/2510.21038
作者: Gereon Elvers,Gilad Landau,Oiwi Parker Jones
类目: Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Non-invasive brain-computer interfaces (BCIs) are beginning to benefit from large, public benchmarks. However, current benchmarks target relatively simple, foundational tasks like Speech Detection and Phoneme Classification, while application-ready results on tasks like Brain-to-Text remain elusive. We propose Keyword Spotting (KWS) as a practically applicable, privacy-aware intermediate task. Using the deep 52-hour, within-subject LibriBrain corpus, we provide standardized train/validation/test splits for reproducible benchmarking, and adopt an evaluation protocol tailored to extreme class imbalance. Concretely, we use area under the precision-recall curve (AUPRC) as a robust evaluation metric, complemented by false alarms per hour (FA/h) at fixed recall to capture user-facing trade-offs. To simplify deployment and further experimentation within the research community, we are releasing an updated version of the pnpl library with word-level dataloaders and Colab-ready tutorials. As an initial reference model, we present a compact 1-D Conv/ResNet baseline with focal loss and top-k pooling that is trainable on a single consumer-class GPU. The reference model achieves approximately 13x the permutation baseline AUPRC on held-out sessions, demonstrating the viability of the task. Exploratory analyses reveal: (i) predictable within-subject scaling - performance improves log-linearly with more training hours - and (ii) the existence of word-level factors (frequency and duration) that systematically modulate detectability.

[LG-73] CIPHER: Scalable Time Series Analysis for Physical Sciences with Application to Solar Wind Phenomena NEURIPS2025

链接: https://arxiv.org/abs/2510.21022
作者: Jasmine R. Kobayashi,Daniela Martin,Valmir P Moraes Filho,Connor O’Brien,Jinsu Hong,Sudeshna Boro Saikia,Hala Lamdouar,Nathan D. Miles,Marcella Scoczynski,Mavis Stone,Sairam Sundaresan,Anna Jungbluth,Andrés Muñoz-Jaramillo,Evangelia Samara,Joseph Gallego
类目: Machine Learning (cs.LG); Solar and Stellar Astrophysics (astro-ph.SR)
*备注: 5 pages, 2 figures, Machine Learning and the Physical Sciences Workshop @ NeurIPS 2025

点击查看摘要

Abstract:Labeling or classifying time series is a persistent challenge in the physical sciences, where expert annotations are scarce, costly, and often inconsistent. Yet robust labeling is essential to enable machine learning models for understanding, prediction, and forecasting. We present the \textitClustering and Indexation Pipeline with Human Evaluation for Recognition (CIPHER), a framework designed to accelerate large-scale labeling of complex time series in physics. CIPHER integrates \textitindexable Symbolic Aggregate approXimation (iSAX) for interpretable compression and indexing, density-based clustering (HDBSCAN) to group recurring phenomena, and a human-in-the-loop step for efficient expert validation. Representative samples are labeled by domain scientists, and these annotations are propagated across clusters to yield systematic, scalable classifications. We evaluate CIPHER on the task of classifying solar wind phenomena in OMNI data, a central challenge in space weather research, showing that the framework recovers meaningful phenomena such as coronal mass ejections and stream interaction regions. Beyond this case study, CIPHER highlights a general strategy for combining symbolic representations, unsupervised learning, and expert knowledge to address label scarcity in time series across the physical sciences. The code and configuration files used in this study are publicly available to support reproducibility.

[LG-74] From Information to Generative Exponent: Learning Rate Induces Phase Transitions in SGD NEURIPS2025

链接: https://arxiv.org/abs/2510.21020
作者: Konstantinos Christopher Tsiolis,Alireza Mousavi-Hosseini,Murat A. Erdogdu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: NeurIPS 2025

点击查看摘要

Abstract:To understand feature learning dynamics in neural networks, recent theoretical works have focused on gradient-based learning of Gaussian single-index models, where the label is a nonlinear function of a latent one-dimensional projection of the input. While the sample complexity of online SGD is determined by the information exponent of the link function, recent works improved this by performing multiple gradient steps on the same sample with different learning rates – yielding a non-correlational update rule – and instead are limited by the (potentially much smaller) generative exponent. However, this picture is only valid when these learning rates are sufficiently large. In this paper, we characterize the relationship between learning rate(s) and sample complexity for a broad class of gradient-based algorithms that encapsulates both correlational and non-correlational updates. We demonstrate that, in certain cases, there is a phase transition from an “information exponent regime” with small learning rate to a “generative exponent regime” with large learning rate. Our framework covers prior analyses of one-pass SGD and SGD with batch reuse, while also introducing a new layer-wise training algorithm that leverages a two-timescales approach (via different learning rates for each layer) to go beyond correlational queries without reusing samples or modifying the loss from squared error. Our theoretical study demonstrates that the choice of learning rate is as important as the design of the algorithm in achieving statistical and computational efficiency.

[LG-75] Fair Representation Learning with Controllable High Confidence Guarantees via Adversarial Inference NEURIPS2025

链接: https://arxiv.org/abs/2510.21017
作者: Yuhong Luo,Austin Hoag,Xintong Wang,Philip S. Thomas,Przemyslaw A. Grabowicz
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Representation learning is increasingly applied to generate representations that generalize well across multiple downstream tasks. Ensuring fairness guarantees in representation learning is crucial to prevent unfairness toward specific demographic groups in downstream tasks. In this work, we formally introduce the task of learning representations that achieve high-confidence fairness. We aim to guarantee that demographic disparity in every downstream prediction remains bounded by a user-defined error threshold \epsilon , with controllable high probability. To this end, we propose the Fair Representation learning with high-confidence Guarantees (FRG) framework, which provides these high-confidence fairness guarantees by leveraging an optimized adversarial model. We empirically evaluate FRG on three real-world datasets, comparing its performance to six state-of-the-art fair representation learning methods. Our results demonstrate that FRG consistently bounds unfairness across a range of downstream models and tasks.

[LG-76] Graph Neural Regularizers for PDE Inverse Problems

链接: https://arxiv.org/abs/2510.21012
作者: William Lauga,James Rowbottom,Alexander Denker,Željko Kereta,Moshe Eliasof,Carola-Bibiane Schönlieb
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a framework for solving a broad class of ill-posed inverse problems governed by partial differential equations (PDEs), where the target coefficients of the forward operator are recovered through an iterative regularization scheme that alternates between FEM-based inversion and learned graph neural regularization. The forward problem is numerically solved using the finite element method (FEM), enabling applicability to a wide range of geometries and PDEs. By leveraging the graph structure inherent to FEM discretizations, we employ physics-inspired graph neural networks as learned regularizers, providing a robust, interpretable, and generalizable alternative to standard approaches. Numerical experiments demonstrate that our framework outperforms classical regularization techniques and achieves accurate reconstructions even in highly ill-posed scenarios.

[LG-77] Can Current Detectors Catch Face-to-Voice Deepfake Attacks? ACSA

链接: https://arxiv.org/abs/2510.21004
作者: Nguyen Linh Bao Nguyen,Alsharif Abuadbba,Kristen Moore,Tingming Wu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
*备注: 8 pages, Accepted at Workshop on AI for Cyber Threat Intelligence, co-located with ACSAC 2025

点击查看摘要

Abstract:The rapid advancement of generative models has enabled the creation of increasingly stealthy synthetic voices, commonly referred to as audio deepfakes. A recent technique, FOICE [USENIX’24], demonstrates a particularly alarming capability: generating a victim’s voice from a single facial image, without requiring any voice sample. By exploiting correlations between facial and vocal features, FOICE produces synthetic voices realistic enough to bypass industry-standard authentication systems, including WeChat Voiceprint and Microsoft Azure. This raises serious security concerns, as facial images are far easier for adversaries to obtain than voice samples, dramatically lowering the barrier to large-scale attacks. In this work, we investigate two core research questions: (RQ1) can state-of-the-art audio deepfake detectors reliably detect FOICE-generated speech under clean and noisy conditions, and (RQ2) whether fine-tuning these detectors on FOICE data improves detection without overfitting, thereby preserving robustness to unseen voice generators such as SpeechT5. Our study makes three contributions. First, we present the first systematic evaluation of FOICE detection, showing that leading detectors consistently fail under both standard and noisy conditions. Second, we introduce targeted fine-tuning strategies that capture FOICE-specific artifacts, yielding significant accuracy improvements. Third, we assess generalization after fine-tuning, revealing trade-offs between specialization to FOICE and robustness to unseen synthesis pipelines. These findings expose fundamental weaknesses in today’s defenses and motivate new architectures and training protocols for next-generation audio deepfake detection. Comments: 8 pages, Accepted at Workshop on AI for Cyber Threat Intelligence, co-located with ACSAC 2025 Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD) Cite as: arXiv:2510.21004 [cs.CR] (or arXiv:2510.21004v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.21004 Focus to learn more arXiv-issued DOI via DataCite

[LG-78] Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation NEURIPS2025

链接: https://arxiv.org/abs/2510.21003
作者: Enshu Liu,Qian Chen,Xuefei Ning,Shengen Yan,Guohao Dai,Zinan Lin,Yu Wang
类目: Machine Learning (cs.LG)
*备注: Published at NeurIPS 2025

点击查看摘要

Abstract:Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel \emphconditional score distillation loss to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3 \times training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at this https URL.

[LG-79] AL-CoLe: Augmented Lagrangian for Constrained Learning

链接: https://arxiv.org/abs/2510.20995
作者: Ignacio Boero,Ignacio Hounie,Alejandro Ribeiro
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Despite the non-convexity of most modern machine learning parameterizations, Lagrangian duality has become a popular tool for addressing constrained learning problems. We revisit Augmented Lagrangian methods, which aim to mitigate the duality gap in non-convex settings while requiring only minimal modifications, and have remained comparably unexplored in constrained learning settings. We establish strong duality results under mild conditions, prove convergence of dual ascent algorithms to feasible and optimal primal solutions, and provide PAC-style generalization guarantees. Finally, we demonstrate its effectiveness on fairness constrained classification tasks.

[LG-80] L2M3OF: A Large Language Multimodal Model for Metal-Organic Frameworks

链接: https://arxiv.org/abs/2510.20976
作者: Jiyu Cui,Fang Wu,Haokai Zhao,Minggao Feng,Xenophon Evangelopoulos,Andrew I. Cooper,Yejin Choi
类目: Machine Learning (cs.LG)
*备注: 18 pages, 7 figures

点击查看摘要

Abstract:Large language models have demonstrated remarkable reasoning capabilities across diverse natural language tasks. However, comparable breakthroughs in scientific discovery are more limited, because understanding complex physical phenomena demands multifaceted representations far beyond language alone. A compelling example is the design of functional materials such as MOFs-critical for a range of impactful applications like carbon capture and hydrogen storage. Navigating their vast and intricate design space in language-based representations interpretable by LLMs is challenging due to the numerous possible three-dimensional atomic arrangements and strict reticular rules of coordination geometry and topology. Despite promising early results in LLM-assisted discovery for simpler materials systems, MOF design remains heavily reliant on tacit human expertise rarely codified in textual information alone. To overcome this barrier, we introduce L2M3OF, the first multimodal LLM for MOFs. L2M3OF integrates crystal representation learning with language understanding to process structural, textual, and knowledge modalities jointly. L2M3OF employs a pre-trained crystal encoder with a lightweight projection layer to compress structural information into a token space, enabling efficient alignment with language instructions. To facilitate training and evaluation, we curate a structure-property-knowledge database of crystalline materials and benchmark L2M3OF against state-of-the-art closed-source LLMs such as GPT-5, Gemini-2.5-Pro and DeepSeek-R1. Experiments show that L2M3OF outperforms leading text-based closed-source LLMs in property prediction and knowledge generation tasks, despite using far fewer parameters. These results highlight the importance of multimodal approaches for porous material understanding and establish L2M3OF as a foundation for next-generation AI systems in materials discovery.

[LG-81] Robust Point Cloud Reinforcement Learning via PCA-Based Canonicalization

链接: https://arxiv.org/abs/2510.20974
作者: Michael Bezick,Vittorio Giammarino,Ahmed H. Qureshi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) from raw visual input has achieved impressive successes in recent years, yet it remains fragile to out-of-distribution variations such as changes in lighting, color, and viewpoint. Point Cloud Reinforcement Learning (PC-RL) offers a promising alternative by mitigating appearance-based brittleness, but its sensitivity to camera pose mismatches continues to undermine reliability in realistic settings. To address this challenge, we propose PCA Point Cloud (PPC), a canonicalization framework specifically tailored for downstream robotic control. PPC maps point clouds under arbitrary rigid-body transformations to a unique canonical pose, aligning observations to a consistent frame, thereby substantially decreasing viewpoint-induced inconsistencies. In our experiments, we show that PPC improves robustness to unseen camera poses across challenging robotic tasks, providing a principled alternative to domain randomization.

[LG-82] On the accuracy of implicit neural representations for cardiovascular anatomies and hemodynamic fields

链接: https://arxiv.org/abs/2510.20970
作者: Jubilee Lee,Daniele E. Schiavazzi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit neural representations (INRs, also known as neural fields) have recently emerged as a powerful framework for knowledge representation, synthesis, and compression. By encoding fields as continuous functions within the weights and biases of deep neural networks-rather than relying on voxel- or mesh-based structured or unstructured representations-INRs offer both resolution independence and high memory efficiency. However, their accuracy in domain-specific applications remains insufficiently understood. In this work, we assess the performance of state-of-the-art INRs for compressing hemodynamic fields derived from numerical simulations and for representing cardiovascular anatomies via signed distance functions. We investigate several strategies to mitigate spectral bias, including specialized activation functions, both fixed and trainable positional encoding, and linear combinations of nonlinear kernels. On realistic, space- and time-varying hemodynamic fields in the thoracic aorta, INRs achieved remarkable compression ratios of up to approximately 230, with maximum absolute errors of 1 mmHg for pressure and 5-10 cm/s for velocity, without extensive hyperparameter tuning. Across 48 thoracic aortic anatomies, the average and maximum absolute anatomical discrepancies were below 0.5 mm and 1.6 mm, respectively. Overall, the SIREN, MFN-Gabor, and MHE architectures demonstrated the best performance. Source code and data is available at this https URL.

[LG-83] Neural Mutual Information Estimation with Vector Copulas

链接: https://arxiv.org/abs/2510.20968
作者: Yanzhi Chen,Zijing Ou,Adrian Weller,Michael U. Gutmann
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating mutual information (MI) is a fundamental task in data science and machine learning. Existing estimators mainly rely on either highly flexible models (e.g., neural networks), which require large amounts of data, or overly simplified models (e.g., Gaussian copula), which fail to capture complex distributions. Drawing upon recent vector copula theory, we propose a principled interpolation between these two extremes to achieve a better trade-off between complexity and capacity. Experiments on state-of-the-art synthetic benchmarks and real-world data with diverse modalities demonstrate the advantages of the proposed estimator.

[LG-84] SutureBot: A Precision Framework Benchmark For Autonomous End-to-End Suturing NEURIPS2025

链接: https://arxiv.org/abs/2510.20965
作者: Jesse Haworth,Juo-Tung Chen,Nigel Nelson,Ji Woong Kim,Masoud Moghani,Chelsea Finn,Axel Krieger
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 10 pages, 5 figures, 4 tables, NeurIPS 2025

点击查看摘要

Abstract:Robotic suturing is a prototypical long-horizon dexterous manipulation task, requiring coordinated needle grasping, precise tissue penetration, and secure knot tying. Despite numerous efforts toward end-to-end autonomy, a fully autonomous suturing pipeline has yet to be demonstrated on physical hardware. We introduce SutureBot: an autonomous suturing benchmark on the da Vinci Research Kit (dVRK), spanning needle pickup, tissue insertion, and knot tying. To ensure repeatability, we release a high-fidelity dataset comprising 1,890 suturing demonstrations. Furthermore, we propose a goal-conditioned framework that explicitly optimizes insertion-point precision, improving targeting accuracy by 59%-74% over a task-only baseline. To establish this task as a benchmark for dexterous imitation learning, we evaluate state-of-the-art vision-language-action (VLA) models, including \pi_0 , GR00T N1, OpenVLA-OFT, and multitask ACT, each augmented with a high-level task-prediction policy. Autonomous suturing is a key milestone toward achieving robotic autonomy in surgery. These contributions support reproducible evaluation and development of precision-focused, long-horizon dexterous manipulation policies necessary for end-to-end suturing. Dataset is available at: this https URL

[LG-85] owards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection

链接: https://arxiv.org/abs/2510.20963
作者: Yongqiang Chen,Gang Niu,James Cheng,Bo Han,Masashi Sugiyama
类目: Machine Learning (cs.LG)
*备注: Preprint, ongoing work

点击查看摘要

Abstract:Accurate detection of errors in large language models (LLM) responses is central to the success of scalable oversight, or providing effective supervision to superhuman intelligence. Yet, self-diagnosis is often unreliable on complex tasks unless aided by reliable external feedback. Multi-agent debate (MAD) seems to be a natural alternative to external feedback: multiple LLMs provide complementary perspectives and cross-checks for error detection. However, prior MAD protocols frame debate as a zero-sum game, where the debaters compete to win the game instead of seeking the truth. Consequently, it leads to debate hacking: debaters tend to mislead the judge by misinterpreting the task or presenting overconfident claims, which introduce more mistakes and underperform single-agent methods. To mitigate the issue, we introduce a new collaborative MAD protocol, termed ColMAD, that reframes MAD as a non-zero sum game. Specifically, ColMAD encourages multiple agents to criticize each other in a supportive way, such that they can complement the missing points of each other. Therefore, the judge agent can make a more informative conclusion based on more comprehensive evidence. Empirically, we show that ColMAD significantly outperforms previous competitive MAD by 19% and brings non-trivial improvements over single-agent methods in error detection.

[LG-86] An Ensembled Penalized Federated Learning Framework for Falling People Detection

链接: https://arxiv.org/abs/2510.20960
作者: Sizhe Rao,Runqiu Zhang,Sajal Saha,Liang Chen
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:Falls among elderly and disabled individuals remain a leading cause of injury and mortality worldwide, necessitating robust, accurate, and privacy-aware fall detection systems. Traditional fall detection approaches, whether centralized or point-wise, often struggle with key challenges such as limited generalizability, data privacy concerns, and variability in individual movement behaviors. To address these limitations, we propose EPFL-an Ensembled Penalized Federated Learning framework that integrates continual learning, personalized modeling, and a novel Specialized Weighted Aggregation (SWA) strategy. EPFL leverages wearable sensor data to capture sequential motion patterns while preserving user privacy through homomorphic encryption and federated training. Unlike existing federated models, EPFL incorporates both penalized local training and ensemble-based inference to improve inter-client consistency and adaptability to behavioral differences. Extensive experiments on a benchmark fall detection dataset demonstrate the effectiveness of our approach, achieving a Recall of 88.31 percent and an F1-score of 89.94 percent, significantly outperforming both centralized and baseline models. This work presents a scalable, secure, and accurate solution for real-world fall detection in healthcare settings, with strong potential for continuous improvement via its adaptive feedback mechanism.

[LG-87] NeuroPilot: A Realtime Brain-Computer Interface system to enhance concentration of students in online learning

链接: https://arxiv.org/abs/2510.20958
作者: Asif Islam,Farhan Ishtiaque,Md. Muhyminul Haque,Kaled Masukur Rahman,Ravi Vaidyanathan,Khondaker A. Mamun
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Signal Processing (eess.SP); Neurons and Cognition (q-bio.NC)
*备注: 11 pages, 5 figures and 3 tables

点击查看摘要

Abstract:Prevalence of online learning poses a vital challenge in real-time monitoring of students’ concentration. Traditional methods such as questionnaire assessments require manual interventions and webcam-based monitoring fails to provide accurate insights into learners’ mental focus as they are deceived by mere screen fixation without cognitive engagement. Existing BCI-based approaches lack real-time validation and evaluation procedures. To address these limitations, a Brain-Computer Interface (BCI) system is developed using a non-invasive Electroencephalogram (EEG) headband, FocusCalm, to record brainwave activity under attentive and non-attentive states. 20 minutes of data were collected from each of 20 participants watching a pre-recorded educational video. The data validation employed a novel intra-video questionnaire assessment. Subsequently, collected signals were segmented (sliding window), filtered (butterworth bandpass), and cleaned (removal of high-amplitude and EOG artifacts such as eye blinks). Time, frequency, wavelet and statistical features have been extracted, followed by recursive feature elimination (RFE) with Support vector machines (SVMs) to classify attention and non-attention states. The leave-one-subject-out (LOSO) cross-validation accuracy has been tested to be 88.77%. The system provides feedback alerts upon non-attention state detection and keeps focus profile logs. A pilot study was conducted to evaluate the effectiveness of real-time feedback. Five participants completed a 10-minute session consisting of a 5-minute baseline phase without feedback followed by a 5-minute feedback phase, during which alerts were issued if participants remained non-attentive for approximately 8 consecutive seconds. A paired t-test (t = 5.73, p = 0.007) indicated a statistically significant improvement in concentration during the feedback phase.

[LG-88] Safety Assessment in Reinforcement Learning via Model Predictive Control

链接: https://arxiv.org/abs/2510.20955
作者: Jeff Pflueger,Michael Everett
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:Model-free reinforcement learning approaches are promising for control but typically lack formal safety guarantees. Existing methods to shield or otherwise provide these guarantees often rely on detailed knowledge of the safety specifications. Instead, this work’s insight is that many difficult-to-specify safety issues are best characterized by invariance. Accordingly, we propose to leverage reversibility as a method for preventing these safety issues throughout the training process. Our method uses model-predictive path integral control to check the safety of an action proposed by a learned policy throughout training. A key advantage of this approach is that it only requires the ability to query the black-box dynamics, not explicit knowledge of the dynamics or safety constraints. Experimental results demonstrate that the proposed algorithm successfully aborts before all unsafe actions, while still achieving comparable training progress to a baseline PPO approach that is allowed to violate safety.

[LG-89] LLM -Integrated Bayesian State Space Models for Multimodal Time-Series Forecasting

链接: https://arxiv.org/abs/2510.20952
作者: Sungjun Cho,Changho Shin,Suenggwan Jo,Xinya Yan,Shourjo Aditya Chaudhuri,Frederic Sala
类目: Machine Learning (cs.LG)
*备注: 15 pages, 8 figures

点击查看摘要

Abstract:Forecasting in the real world requires integrating structured time-series data with unstructured textual information, but existing methods are architecturally limited by fixed input/output horizons and are unable to model or quantify uncertainty. We address this challenge by introducing LLM-integrated Bayesian State space models (LBS), a novel probabilistic framework for multimodal temporal forecasting. At a high level, LBS consists of two components: (1) a state space model (SSM) backbone that captures the temporal dynamics of latent states from which both numerical and textual observations are generated and (2) a pretrained large language model (LLM) that is adapted to encode textual inputs for posterior state estimation and decode textual forecasts consistent with the latent trajectory. This design enables flexible lookback and forecast windows, principled uncertainty quantification, and improved temporal generalization thanks to the well-suited inductive bias of SSMs toward modeling dynamical systems. Experiments on the TextTimeCorpus benchmark demonstrate that LBS improves the previous state-of-the-art by 13.20% while providing human-readable summaries of each forecast. Our work is the first to unify LLMs and SSMs for joint numerical and textual prediction, offering a novel foundation for multimodal temporal reasoning.

[LG-90] Learning from Interval Targets NEURIPS2025

链接: https://arxiv.org/abs/2510.20925
作者: Rattana Pukdee,Ziqi Ke,Chirag Gupta
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:We study the problem of regression with interval targets, where only upper and lower bounds on target values are available in the form of intervals. This problem arises when the exact target label is expensive or impossible to obtain, due to inherent uncertainties. In the absence of exact targets, traditional regression loss functions cannot be used. First, we study the methodology of using a loss functions compatible with interval targets, for which we establish non-asymptotic generalization bounds based on smoothness of the hypothesis class that significantly relaxing prior assumptions of realizability and small ambiguity degree. Second, we propose a novel min-max learning formulation: minimize against the worst-case (maximized) target labels within the provided intervals. The maximization problem in the latter is non-convex, but we show that good performance can be achieved with the incorporation of smoothness constraints. Finally, we perform extensive experiments on real-world datasets and show that our methods achieve state-of-the-art performance.

[LG-91] Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control

链接: https://arxiv.org/abs/2510.20905
作者: Xingyu Wang,Chang-Han Rhee
类目: Machine Learning (cs.LG); Probability (math.PR)
*备注: 60 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Stochastic gradient descent (SGD) and its variants enable modern artificial intelligence. However, theoretical understanding lags far behind their empirical success. It is widely believed that SGD has a curious ability to avoid sharp local minima in the loss landscape, which are associated with poor generalization. To unravel this mystery and further enhance such capability of SGDs, it is imperative to go beyond the traditional local convergence analysis and obtain a comprehensive understanding of SGDs’ global dynamics. In this paper, we develop a set of technical machinery based on the recent large deviations and metastability analysis in Wang and Rhee (2023) and obtain sharp characterization of the global dynamics of heavy-tailed SGDs. In particular, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD can almost completely avoid sharp minima and achieve better generalization performance for the test data. Simulation and deep learning experiments confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds local minima with a more flat geometry and achieves better generalization performance.

[LG-92] Information Theoretic Learning for Diffusion Models with Warm Start NEURIPS2025

链接: https://arxiv.org/abs/2510.20903
作者: Yirong Shen,Lu Gan,Cong Ling
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

[LG-93] ROPES: Robotic Pose Estimation via Score-Based Causal Representation Learning NEURIPS2025

链接: https://arxiv.org/abs/2510.20884
作者: Pranamya Kulkarni,Puranjay Datta,Burak Varıcı,Emre Acartürk,Karthikeyan Shanmugam,Ali Tajer
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: A preliminary version of this paper appeared at NeurIPS 2025 Workshop on Embodied World Models for Decision Making

点击查看摘要

Abstract:Causal representation learning (CRL) has emerged as a powerful unsupervised framework that (i) disentangles the latent generative factors underlying high-dimensional data, and (ii) learns the cause-and-effect interactions among the disentangled variables. Despite extensive recent advances in identifiability and some practical progress, a substantial gap remains between theory and real-world practice. This paper takes a step toward closing that gap by bringing CRL to robotics, a domain that has motivated CRL. Specifically, this paper addresses the well-defined robot pose estimation – the recovery of position and orientation from raw images – by introducing Robotic Pose Estimation via Score-Based CRL (ROPES). Being an unsupervised framework, ROPES embodies the essence of interventional CRL by identifying those generative factors that are actuated: images are generated by intrinsic and extrinsic latent factors (e.g., joint angles, arm/limb geometry, lighting, background, and camera configuration) and the objective is to disentangle and recover the controllable latent variables, i.e., those that can be directly manipulated (intervened upon) through actuation. Interventional CRL theory shows that variables that undergo variations via interventions can be identified. In robotics, such interventions arise naturally by commanding actuators of various joints and recording images under varied controls. Empirical evaluations in semi-synthetic manipulator experiments demonstrate that ROPES successfully disentangles latent generative factors with high fidelity with respect to the ground truth. Crucially, this is achieved by leveraging only distributional changes, without using any labeled data. The paper also includes a comparison with a baseline based on a recently proposed semi-supervised framework. This paper concludes by positioning robot pose estimation as a near-practical testbed for CRL.

[LG-94] MOBO-OSD: Batch Multi-Objective Bayesian Optimization via Orthogonal Search Directions NEURIPS2025

链接: https://arxiv.org/abs/2510.20872
作者: Lam Ngo,Huong Ha,Jeffrey Chan,Hongyu Zhang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Published at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Bayesian Optimization (BO) is a powerful tool for optimizing expensive black-box objective functions. While extensive research has been conducted on the single-objective optimization problem, the multi-objective optimization problem remains challenging. In this paper, we propose MOBO-OSD, a multi-objective Bayesian Optimization algorithm designed to generate a diverse set of Pareto optimal solutions by solving multiple constrained optimization problems, referred to as MOBO-OSD subproblems, along orthogonal search directions (OSDs) defined with respect to an approximated convex hull of individual objective minima. By employing a well-distributed set of OSDs, MOBO-OSD ensures broad coverage of the objective space, enhancing both solution diversity and hypervolume performance. To further improve the density of the set of Pareto optimal candidate solutions without requiring an excessive number of subproblems, we leverage a Pareto Front Estimation technique to generate additional solutions in the neighborhood of existing solutions. Additionally, MOBO-OSD supports batch optimization, enabling parallel function evaluations to accelerate the optimization process when resources are available. Through extensive experiments and analysis on a variety of synthetic and real-world benchmark functions with two to six objectives, we demonstrate that MOBO-OSD consistently outperforms the state-of-the-art algorithms. Our code implementation can be found at this https URL.

[LG-95] Multimodal Datasets with Controllable Mutual Information

链接: https://arxiv.org/abs/2510.21686
作者: Raheem Karim Hashmani,Garrett W. Merz,Helen Qu,Mariel Pettee,Kyle Cranmer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 1 table. Our code is publicly available at this https URL

点击查看摘要

Abstract:We introduce a framework for generating highly multimodal datasets with explicitly calculable mutual information between modalities. This enables the construction of benchmark datasets that provide a novel testbed for systematic studies of mutual information estimators and multimodal self-supervised learning techniques. Our framework constructs realistic datasets with known mutual information using a flow-based generative model and a structured causal framework for generating correlated latent variables.

[LG-96] Fisher meets Feynman: score-based variational inference with a product of experts NEURIPS

链接: https://arxiv.org/abs/2510.21598
作者: Diana Cai,Robert M. Gower,David M. Blei,Lawrence K. Saul
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 27 pages, 11 figures. To appear in Advances in Neural Processing Information Systems (NeurIPS), 2025

点击查看摘要

Abstract:We introduce a highly expressive yet distinctly tractable family for black-box variational inference (BBVI). Each member of this family is a weighted product of experts (PoE), and each weighted expert in the product is proportional to a multivariate t -distribution. These products of experts can model distributions with skew, heavy tails, and multiple modes, but to use them for BBVI, we must be able to sample from their densities. We show how to do this by reformulating these products of experts as latent variable models with auxiliary Dirichlet random variables. These Dirichlet variables emerge from a Feynman identity, originally developed for loop integrals in quantum field theory, that expresses the product of multiple fractions (or in our case, t -distributions) as an integral over the simplex. We leverage this simplicial latent space to draw weighted samples from these products of experts – samples which BBVI then uses to find the PoE that best approximates a target density. Given a collection of experts, we derive an iterative procedure to optimize the exponents that determine their geometric weighting in the PoE. At each iteration, this procedure minimizes a regularized Fisher divergence to match the scores of the variational and target densities at a batch of samples drawn from the current approximation. This minimization reduces to a convex quadratic program, and we prove under general conditions that these updates converge exponentially fast to a near-optimal weighting of experts. We conclude by evaluating this approach on a variety of synthetic and real-world target distributions.

[LG-97] Contribution of task-irrelevant stimuli to drift of neural representations NEURIPS2025

链接: https://arxiv.org/abs/2510.21588
作者: Farhad Pashakhanloo
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Biological and artificial learners are inherently exposed to a stream of data and experience throughout their lifetimes and must constantly adapt to, learn from, or selectively ignore the ongoing input. Recent findings reveal that, even when the performance remains stable, the underlying neural representations can change gradually over time, a phenomenon known as representational drift. Studying the different sources of data and noise that may contribute to drift is essential for understanding lifelong learning in neural systems. However, a systematic study of drift across architectures and learning rules, and the connection to task, are missing. Here, in an online learning setup, we characterize drift as a function of data distribution, and specifically show that the learning noise induced by task-irrelevant stimuli, which the agent learns to ignore in a given context, can create long-term drift in the representation of task-relevant stimuli. Using theory and simulations, we demonstrate this phenomenon both in Hebbian-based learning – Oja’s rule and Similarity Matching – and in stochastic gradient descent applied to autoencoders and a supervised two-layer network. We consistently observe that the drift rate increases with the variance and the dimension of the data in the task-irrelevant subspace. We further show that this yields different qualitative predictions for the geometry and dimension-dependency of drift than those arising from Gaussian synaptic noise. Overall, our study links the structure of stimuli, task, and learning rule to representational drift and could pave the way for using drift as a signal for uncovering underlying computation in the brain.

[LG-98] HollowFlow: Efficient Sample Likelihood Evaluation using Hollow Message Passing NEURIPS2025

链接: https://arxiv.org/abs/2510.21542
作者: Johann Flemming Gloy,Simon Olsson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Flow and diffusion-based models have emerged as powerful tools for scientific applications, particularly for sampling non-normalized probability distributions, as exemplified by Boltzmann Generators (BGs). A critical challenge in deploying these models is their reliance on sample likelihood computations, which scale prohibitively with system size n , often rendering them infeasible for large-scale problems. To address this, we introduce \textitHollowFlow , a flow-based generative model leveraging a novel non-backtracking graph neural network (NoBGNN). By enforcing a block-diagonal Jacobian structure, HollowFlow likelihoods are evaluated with a constant number of backward passes in n , yielding speed-ups of up to \mathcalO(n^2) : a significant step towards scaling BGs to larger systems. Crucially, our framework generalizes: \textbfany equivariant GNN or attention-based architecture can be adapted into a NoBGNN. We validate HollowFlow by training BGs on two different systems of increasing size. For both systems, the sampling and likelihood evaluation time decreases dramatically, following our theoretical scaling laws. For the larger system we obtain a 10^2\times speed-up, clearly illustrating the potential of HollowFlow-based approaches for high-dimensional scientific problems previously hindered by computational bottlenecks.

[LG-99] Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds NEURIPS2025

链接: https://arxiv.org/abs/2510.21468
作者: Emre Sahinoglu,Youbang Sun,Shahin Shahrampour
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: To Appear in NeurIPS 2025

点击查看摘要

Abstract:This work addresses the finite-time analysis of nonsmooth nonconvex stochastic optimization under Riemannian manifold constraints. We adapt the notion of Goldstein stationarity to the Riemannian setting as a performance metric for nonsmooth optimization on manifolds. We then propose a Riemannian Online to NonConvex (RO2NC) algorithm, for which we establish the sample complexity of O(\epsilon^-3\delta^-1) in finding (\delta,\epsilon) -stationary points. This result is the first-ever finite-time guarantee for fully nonsmooth, nonconvex optimization on manifolds and matches the optimal complexity in the Euclidean setting. When gradient information is unavailable, we develop a zeroth order version of RO2NC algorithm (ZO-RO2NC), for which we establish the same sample complexity. The numerical results support the theory and demonstrate the practical effectiveness of the algorithms.

[LG-100] Oracle-Efficient Combinatorial Semi-Bandits NEURIPS2025

链接: https://arxiv.org/abs/2510.21431
作者: Jung-hun Kim,Milan Vojnović,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:We study the combinatorial semi-bandit problem where an agent selects a subset of base arms and receives individual feedback. While this generalizes the classical multi-armed bandit and has broad applicability, its scalability is limited by the high cost of combinatorial optimization, requiring oracle queries at every round. To tackle this, we propose oracle-efficient frameworks that significantly reduce oracle calls while maintaining tight regret guarantees. For the worst-case linear reward setting, our algorithms achieve \tildeO(\sqrtT) regret using only O(\log\log T) oracle queries. We also propose covariance-adaptive algorithms that leverage noise structure for improved regret, and extend our approach to general (non-linear) rewards. Overall, our methods reduce oracle usage from linear to (doubly) logarithmic in time, with strong theoretical guarantees.

[LG-101] Efficient Exploration of Chemical Kinetics

链接: https://arxiv.org/abs/2510.21368
作者: Rohit Goswami(1) ((1) Science Institute and Faculty of Physical Sciences, University of Iceland, Reykjavík, Iceland)
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Software Engineering (cs.SE); Atomic Physics (physics.atom-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Doctoral dissertation, 102 pages, ISBN pending from the University of Iceland. doctorate. By design, all text and figures within this thesis are original and do not appear in the associated papers

点击查看摘要

Abstract:Estimating reaction rates and chemical stability is fundamental, yet efficient methods for large-scale simulations remain out of reach despite advances in modeling and exascale computing. Direct simulation is limited by short timescales; machine-learned potentials require large data sets and struggle with transition state regions essential for reaction rates. Reaction network exploration with sufficient accuracy is hampered by the computational cost of electronic structure calculations, and even simplifications like harmonic transition state theory rely on prohibitively expensive saddle point searches. Surrogate model-based acceleration has been promising but hampered by overhead and numerical instability. This dissertation presents a holistic solution, co-designing physical representations, statistical models, and systems architecture in the Optimal Transport Gaussian Process (OT-GP) framework. Using physics-aware optimal transport metrics, OT-GP creates compact, chemically relevant surrogates of the potential energy surface, underpinned by statistically robust sampling. Alongside EON software rewrites for long timescale simulations, we introduce reinforcement learning approaches for both minimum-mode following (when the final state is unknown) and nudged elastic band methods (when endpoints are specified). Collectively, these advances establish a representation-first, modular approach to chemical kinetics simulation. Large-scale benchmarks and Bayesian hierarchical validation demonstrate state-of-the-art performance and practical exploration of chemical kinetics, transforming a longstanding theoretical promise into a working engine for discovery. Comments: Doctoral dissertation, 102 pages, ISBN pending from the University of Iceland. doctorate. By design, all text and figures within this thesis are original and do not appear in the associated papers Subjects: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Software Engineering (cs.SE); Atomic Physics (physics.atom-ph); Data Analysis, Statistics and Probability (physics.data-an) Cite as: arXiv:2510.21368 [physics.chem-ph] (or arXiv:2510.21368v1 [physics.chem-ph] for this version) https://doi.org/10.48550/arXiv.2510.21368 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-102] Enforcing Calibration in Multi-Output Probabilistic Regression with Pre-rank Regularization

链接: https://arxiv.org/abs/2510.21273
作者: Naomi Desobry,Elnura Zhalieva,Souhaib Ben Taieb
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Probabilistic models must be well calibrated to support reliable decision-making. While calibration in single-output regression is well studied, defining and achieving multivariate calibration in multi-output regression remains considerably more challenging. The existing literature on multivariate calibration primarily focuses on diagnostic tools based on pre-rank functions, which are projections that reduce multivariate prediction-observation pairs to univariate summaries to detect specific types of miscalibration. In this work, we go beyond diagnostics and introduce a general regularization framework to enforce multivariate calibration during training for arbitrary pre-rank functions. This framework encompasses existing approaches such as highest density region calibration and copula calibration. Our method enforces calibration by penalizing deviations of the projected probability integral transforms (PITs) from the uniform distribution, and can be added as a regularization term to the loss function of any probabilistic predictor. Specifically, we propose a regularization loss that jointly enforces both marginal and multivariate pre-rank calibration. We also introduce a new PCA-based pre-rank that captures calibration along directions of maximal variance in the predictive distribution, while also enabling dimensionality reduction. Across 18 real-world multi-output regression datasets, we show that unregularized models are consistently miscalibrated, and that our methods significantly improve calibration across all pre-rank functions without sacrificing predictive accuracy.

[LG-103] Doubly-Regressing Approach for Subgroup Fairness

链接: https://arxiv.org/abs/2510.21091
作者: Kyungseon Lee,Kunwoong Kim,Jihu Lee,Dongyoon Yang,Yongdai Kim
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Algorithmic fairness is a socially crucial topic in real-world applications of AI. Among many notions of fairness, subgroup fairness is widely studied when multiple sensitive attributes (e.g., gender, race, age) are present. However, as the number of sensitive attributes grows, the number of subgroups increases accordingly, creating heavy computational burdens and data sparsity problem (subgroups with too small sizes). In this paper, we develop a novel learning algorithm for subgroup fairness which resolves these issues by focusing on subgroups with sufficient sample sizes as well as marginal fairness (fairness for each sensitive attribute). To this end, we formalize a notion of subgroup-subset fairness and introduce a corresponding distributional fairness measure called the supremum Integral Probability Metric (supIPM). Building on this formulation, we propose the Doubly Regressing Adversarial learning for subgroup Fairness (DRAF) algorithm, which reduces a surrogate fairness gap for supIPM with much less computation than directly reducing supIPM. Theoretically, we prove that the proposed surrogate fairness gap is an upper bound of supIPM. Empirically, we show that the DRAF algorithm outperforms baseline methods in benchmark datasets, specifically when the number of sensitive attributes is large so that many subgroups are very small. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2510.21091 [stat.ML] (or arXiv:2510.21091v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2510.21091 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kyungseon Lee [view email] [v1] Fri, 24 Oct 2025 02:04:44 UTC (7,646 KB) Full-text links: Access Paper: View a PDF of the paper titled Doubly-Regressing Approach for Subgroup Fairness, by Kyungseon Lee and 4 other authorsView PDFTeX Source view license Current browse context: stat.ML prev | next new | recent | 2025-10 Change to browse by: cs cs.LG stat stat.ME References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-104] Iso-Riemannian Optimization on Learned Data Manifolds

链接: https://arxiv.org/abs/2510.21033
作者: Willem Diepeveen,Melanie Weber
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Differential Geometry (math.DG)
*备注:

点击查看摘要

[LG-105] A Short Note on Upper Bounds for Graph Neural Operator Convergence Rate

链接: https://arxiv.org/abs/2510.20954
作者: Roxanne Holden,Luana Ruiz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

[LG-106] Kernel Learning with Adversarial Features: Numerical Efficiency and Adaptive Regularization NEURIPS2025

链接: https://arxiv.org/abs/2510.20883
作者: Antônio H. Ribeiro,David Vävinggren,Dave Zachariah,Thomas B. Schön,Francis Bach
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted NeurIPS 2025

点击查看摘要

Abstract:Adversarial training has emerged as a key technique to enhance model robustness against adversarial input perturbations. Many of the existing methods rely on computationally expensive min-max problems that limit their application in practice. We propose a novel formulation of adversarial training in reproducing kernel Hilbert spaces, shifting from input to feature-space perturbations. This reformulation enables the exact solution of inner maximization and efficient optimization. It also provides a regularized estimator that naturally adapts to the noise level and the smoothness of the underlying function. We establish conditions under which the feature-perturbed formulation is a relaxation of the original problem and propose an efficient optimization algorithm based on iterative kernel ridge regression. We provide generalization bounds that help to understand the properties of the method. We also extend the formulation to multiple kernel learning. Empirical evaluation shows good performance in both clean and adversarial settings.

[LG-107] Exponential Convergence Guarantees for Iterative Markovian Fitting

链接: https://arxiv.org/abs/2510.20871
作者: Marta Gentiloni Silveri,Giovanni Conforti,Alain Durmus
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注:

点击查看摘要

Abstract:The Schrödinger Bridge (SB) problem has become a fundamental tool in computational optimal transport and generative modeling. To address this problem, ideal methods such as Iterative Proportional Fitting and Iterative Markovian Fitting (IMF) have been proposed-alongside practical approximations like Diffusion Schrödinger Bridge and its Matching (DSBM) variant. While previous work have established asymptotic convergence guarantees for IMF, a quantitative, non-asymptotic understanding remains unknown. In this paper, we provide the first non-asymptotic exponential convergence guarantees for IMF under mild structural assumptions on the reference measure and marginal distributions, assuming a sufficiently large time horizon. Our results encompass two key regimes: one where the marginals are log-concave, and another where they are weakly log-concave. The analysis relies on new contraction results for the Markovian projection operator and paves the way to theoretical guarantees for DSBM.

[LG-108] BACE: Behavior-Adaptive Connectivity Estimation for Interpretable Graphs of Neural Dynamics

链接: https://arxiv.org/abs/2510.20831
作者: Mehrnaz Asadi,Sina Javadzadeh,Rahil Soroushmojdehi,S. Alireza Seyyed Mousavi,Terence D. Sanger
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding how distributed brain regions coordinate to produce behavior requires models that are both predictive and interpretable. We introduce Behavior-Adaptive Connectivity Estimation (BACE), an end-to-end framework that learns phase-specific, directed inter-regional connectivity directly from multi-region intracranial local field potentials (LFP). BACE aggregates many micro-contacts within each anatomical region via per-region temporal encoders, applies a learnable adjacency specific to each behavioral phase, and is trained on a forecasting objective. On synthetic multivariate time series with known graphs, BACE accurately recovers ground-truth directed interactions while achieving forecasting performance comparable to state-of-the-art baselines. Applied to human subcortical LFP recorded simultaneously from eight regions during a cued reaching task, BACE yields an explicit connectivity matrix for each within-trial behavioral phase. The resulting behavioral phase-specific graphs reveal behavior-aligned reconfiguration of inter-regional influence and provide compact, interpretable adjacency matrices for comparing network organization across behavioral phases. By linking predictive success to explicit connectivity estimates, BACE offers a practical tool for generating data-driven hypotheses about the dynamic coordination of subcortical regions during behavior.

[LG-109] A Multiscale Approach for Enhancing Weak Signal Detection

链接: https://arxiv.org/abs/2510.20828
作者: Dixon Vimalajeewa,Ursula U. Muller,Brani Vidakovic
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Stochastic resonance (SR), a phenomenon originally introduced in climate modeling, enhances signal detection by leveraging optimal noise levels within non-linear systems. Traditional SR techniques, mainly based on single-threshold detectors, are limited to signals whose behavior does not depend on time. Often large amounts of noise are needed to detect weak signals, which can distort complex signal characteristics. To address these limitations, this study explores multi-threshold systems and the application of SR in multiscale applications using wavelet transforms. In the multiscale domain signals can be analyzed at different levels of resolution to better understand the underlying dynamics. We propose a double-threshold detection system that integrates two single-threshold detectors to enhance weak signal detection. We evaluate it both in the original data domain and in the multiscale domain using simulated and real-world signals and compare its performance with existing methods. Experimental results demonstrate that, in the original data domain, the proposed double-threshold detector significantly improves weak signal detection compared to conventional single-threshold approaches. Its performance is further improved in the frequency domain, requiring lower noise levels while outperforming existing detection systems. This study advances SR-based detection methodologies by introducing a robust approach to weak signal identification, with potential applications in various disciplines. Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME) Cite as: arXiv:2510.20828 [eess.SP] (or arXiv:2510.20828v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2510.20828 Focus to learn more arXiv-issued DOI via DataCite

[LG-110] riangle Multiplication Is All You Need For Biomolecular Structure Representations

链接: https://arxiv.org/abs/2510.18870
作者: Jeffrey Ouyang-Zhang,Pranav Murugan,Daniel J. Diaz,Gianluca Scarpellini,Richard Strong Bowen,Nate Gruver,Adam Klivans,Philipp Krähenbühl,Aleksandra Faust,Maruan Al-Shedivat
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

信息检索

[IR-0] A Data-Centric Approach to Multilingual E-Commerce Product Search: Case Study on Query-Category and Query-Item Relevance

链接: https://arxiv.org/abs/2510.21671
作者: Yabo Yin,Yang Xi,Jialong Wang,Shanqi Wang,Jiateng Hu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multilingual e-commerce search suffers from severe data imbalance across languages, label noise, and limited supervision for low-resource languages–challenges that impede the cross-lingual generalization of relevance models despite the strong capabilities of large language models (LLMs). In this work, we present a practical, architecture-agnostic, data-centric framework to enhance performance on two core tasks: Query-Category (QC) relevance (matching queries to product categories) and Query-Item (QI) relevance (matching queries to product titles). Rather than altering the model, we redesign the training data through three complementary strategies: (1) translation-based augmentation to synthesize examples for languages absent in training, (2) semantic negative sampling to generate hard negatives and mitigate class imbalance, and (3) self-validation filtering to detect and remove likely mislabeled instances. Evaluated on the CIKM AnalytiCup 2025 dataset, our approach consistently yields substantial F1 score improvements over strong LLM baselines, achieving competitive results in the official competition. Our findings demonstrate that systematic data engineering can be as impactful as–and often more deployable than–complex model modifications, offering actionable guidance for building robust multilingual search systems in the real-world e-commerce settings.

[IR-1] SciNUP: Natural Language User Interest Profiles for Scientific Literature Recommendation

链接: https://arxiv.org/abs/2510.21352
作者: Mariam Arustashvili,Krisztian Balog
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The use of natural language (NL) user profiles in recommender systems offers greater transparency and user control compared to traditional representations. However, there is scarcity of large-scale, publicly available test collections for evaluating NL profile-based recommendation. To address this gap, we introduce SciNUP, a novel synthetic dataset for scholarly recommendation that leverages authors’ publication histories to generate NL profiles and corresponding ground truth items. We use this dataset to conduct a comparison of baseline methods, ranging from sparse and dense retrieval approaches to state-of-the-art LLM-based rerankers. Our results show that while baseline methods achieve comparable performance, they often retrieve different items, indicating complementary behaviors. At the same time, considerable headroom for improvement remains, highlighting the need for effective NL-based recommendation approaches. The SciNUP dataset thus serves as a valuable resource for fostering future research and development in this area.

[IR-2] Bi-Level Optimization for Generative Recommendation: Bridging Tokenization and Generation

链接: https://arxiv.org/abs/2510.21242
作者: Yimeng Bai,Chang Liu,Yang Zhang,Dingxian Wang,Frank Yang,Andrew Rabinovich,Wenge Rong,Fuli Feng
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommendation is emerging as a transformative paradigm by directly generating recommended items, rather than relying on matching. Building such a system typically involves two key components: (1) optimizing the tokenizer to derive suitable item identifiers, and (2) training the recommender based on those identifiers. Existing approaches often treat these components separately–either sequentially or in alternation–overlooking their interdependence. This separation can lead to misalignment: the tokenizer is trained without direct guidance from the recommendation objective, potentially yielding suboptimal identifiers that degrade recommendation performance. To address this, we propose BLOGER, a Bi-Level Optimization for GEnerative Recommendation framework, which explicitly models the interdependence between the tokenizer and the recommender in a unified optimization process. The lower level trains the recommender using tokenized sequences, while the upper level optimizes the tokenizer based on both the tokenization loss and recommendation loss. We adopt a meta-learning approach to solve this bi-level optimization efficiently, and introduce gradient surgery to mitigate gradient conflicts in the upper-level updates, thereby ensuring that item identifiers are both informative and recommendation-aligned. Extensive experiments on real-world datasets demonstrate that BLOGER consistently outperforms state-of-the-art generative recommendation methods while maintaining practical efficiency with no significant additional computational overhead, effectively bridging the gap between item tokenization and autoregressive generation. Subjects: Information Retrieval (cs.IR) ACMclasses: H.3.3; H.3.5 Cite as: arXiv:2510.21242 [cs.IR] (or arXiv:2510.21242v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.21242 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion

链接: https://arxiv.org/abs/2510.21151
作者: David Guo,Minqi Sun,Yilun Jiang,Jiazhou Liang,Scott Sanner
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal conversational recommendation has emerged as a promising paradigm for delivering personalized experiences through natural dialogue enriched by visual and contextual grounding. Yet, current multimodal conversational recommendation datasets remain limited: existing resources either simulate conversations, omit user history, or fail to collect sufficiently detailed feedback, all of which constrain the types of research and evaluation they support. To address these gaps, we introduce VOGUE, a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios. Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants. This design enables rigorous evaluation of conversational inference, including not only alignment between predicted and ground-truth preferences, but also calibration against full rating distributions and comparison with explicit and implicit user satisfaction signals. Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue. For example, recommenders frequently suggest items simultaneously in feature-based groups, which creates distinct conversational phases bridged by Seeker critiques and refinements. Benchmarking multimodal large language models against human recommenders shows that while MLLMs approach human-level alignment in aggregate, they exhibit systematic distribution errors in reproducing human ratings and struggle to generalize preference inference beyond explicitly discussed items. These findings establish VOGUE as both a unique resource for studying multimodal conversational systems and as a challenge dataset beyond the current recommendation capabilities of existing top-tier multimodal foundation models such as GPT-4o-mini, GPT-5-mini, and Gemini-2.5-Flash. Subjects: Information Retrieval (cs.IR) ACMclasses: H.5.2; H.3.3; I.2.7 Cite as: arXiv:2510.21151 [cs.IR] (or arXiv:2510.21151v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2510.21151 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-4] Communication Platform for Non-verbal Autistic children in Oman using Android mobile

链接: https://arxiv.org/abs/2510.21028
作者: Amna Al-Araimi,Yue Zheng,Haiming Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:This paper discusses the issue regarding Non-verbal Autism Spectrum Disorder. It has been observed that this mental disorder is listed in major parts of the world including the US, UK, and India. To mitigate this type of disorder, a wide range of smartphones, computers, and artificial intelligence technologies have been used. This technology has helped the population cope with socialization and communication needs. Many applications have been developed to enhance the communication capabilities of non-verbal autistic children. This thesis project proposes the development of a platform that includes a web panel and an Android mobile application to assist non-verbal autistic children in communication, especially in Oman. Different interventions have been merged to improve the quality of life for people on the autism spectrum. The main problem identified in this case is that fragmented approaches are not suitable for autistic children. The augmented reality framework provides the capability to engage autistic children in creative play and self-reflection through interactive screen-based activities.

[IR-5] Gaussian Mixture Flow Matching with Domain Alignment for Multi-Domain Sequential Recommendation

链接: https://arxiv.org/abs/2510.21021
作者: Xiaoxin Ye,Chengkai Huang,Hongtao Huang,Lina Yao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Users increasingly interact with content across multiple domains, resulting in sequential behaviors marked by frequent and complex transitions. While Cross-Domain Sequential Recommendation (CDSR) models two-domain interactions, Multi-Domain Sequential Recommendation (MDSR) introduces significantly more domain transitions, compounded by challenges such as domain heterogeneity and imbalance. Existing approaches often overlook the intricacies of domain transitions, tend to overfit to dense domains while underfitting sparse ones, and struggle to scale effectively as the number of domains increases. We propose \textitGMFlowRec, an efficient generative framework for MDSR that models domain-aware transition trajectories via Gaussian Mixture Flow Matching. GMFlowRec integrates: (1) a unified dual-masked Transformer to disentangle domain-invariant and domain-specific intents, (2) a Gaussian Mixture flow field to capture diverse behavioral patterns, and (3) a domain-aligned prior to support frequent and sparse transitions. Extensive experiments on JD and Amazon datasets demonstrate that GMFlowRec achieves state-of-the-art performance with up to 44% improvement in NDCG@5, while maintaining high efficiency via a single unified backbone, making it scalable for real-world multi-domain sequential recommendation.

附件下载

点击下载今日全部论文列表