本篇博文主要内容为 2025-10-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-10-15)
今日共更新552篇论文,其中:
- 自然语言处理共89篇(Computation and Language (cs.CL))
- 人工智能共160篇(Artificial Intelligence (cs.AI))
- 计算机视觉共102篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共166篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models
【速读】: 该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)中视觉理解与视觉生成能力之间的不匹配问题,即模型虽能准确理解图像内容,却难以根据文本提示生成语义一致的图像。其解决方案的关键在于提出一种自奖励后训练框架(Self-Rewarding Post-Training Framework, SRUM),通过构建由模型自身理解模块作为内部“评估器”的反馈机制,向生成模块提供校正信号,从而实现无需额外人工标注数据的自我优化。SRUM采用全局-局部双奖励系统,其中全局奖励确保整体视觉语义和布局正确性,局部奖励提升对象级别的细节保真度,有效促进了模型生成能力的显著增强与泛化性能的提升。
链接: https://arxiv.org/abs/2510.12784
作者: Weiyang Jin,Yuwei Niu,Jiaqi Liao,Chengqi Duan,Aoxue Li,Shenghua Gao,Xihui Liu
机构: HKU MMLab (香港大学多媒体实验室); The University of Hong Kong (香港大学); Noah’s Ark Lab, Huawei (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 20 pages, 8 figures, webpage can be seen in this https URL
Abstract:Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model’s strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model’s own understanding module acts as an internal ``evaluator’‘, providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a \textbfglobal reward ensures the correctness of the overall visual semantics and layout, while a \textbflocal reward refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf88.37 and on T2I-ReasonBench from 43.82 to \textbf46.75. Overall, our work establishes a powerful new paradigm for enabling a UMMs’ understanding module to guide and enhance its own generation via self-rewarding.
zh
[NLP-1] Cost Analysis of Human-corrected Transcription for Predominately Oral Languages
【速读】: 该论文旨在解决低资源语言(Low-resource languages)在构建高质量语音标注数据集时所面临的高昂人力成本问题,特别是针对以口头传承为主的低识字率语言(Low literacy Predominately Oral Languages)。其关键解决方案在于通过实地调研与实验室对比实验,量化了人工转录一小时语音所需的时间成本:在实验室条件下平均需30小时,在田野条件下则上升至36小时。这一实证结果为同类语言的自然语言处理(NLP)资源建设提供了可参考的基准和实践指导。
链接: https://arxiv.org/abs/2510.12781
作者: Yacouba Diarra,Nouhoum Souleymane Coulibaly,Michael Leventhal
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure
Abstract:Creating speech datasets for low-resource languages is a critical yet poorly understood challenge, particularly regarding the actual cost in human labor. This paper investigates the time and complexity required to produce high-quality annotated speech data for a subset of low-resource languages, low literacy Predominately Oral Languages, focusing on Bambara, a Manding language of Mali. Through a one-month field study involving ten transcribers with native proficiency, we analyze the correction of ASR-generated transcriptions of 53 hours of Bambara voice data. We report that it takes, on average, 30 hours of human labor to accurately transcribe one hour of speech data under laboratory conditions and 36 hours under field conditions. The study provides a baseline and practical insights for a large class of languages with comparable profiles undertaking the creation of NLP resources.
zh
[NLP-2] Content Anonymization for Privacy in Long-form Audio
【速读】: 该论文旨在解决长时音频场景下语音匿名化技术面临的隐私泄露风险问题。在短句孤立语料中表现良好的语音匿名化方法,在实际应用如电话通话、访谈等长时对话场景中失效,因为攻击者可通过同一说话人多段语音中的词汇、语法和表达习惯实现再识别,即使其声纹已被完全伪装。解决方案的关键在于引入基于内容的匿名化策略:在自动语音识别(ASR)与文本转语音(TTS)管道中对转录文本进行上下文相关的改写(paraphrasing),在保留语义信息的同时消除说话人特有的语言风格特征,从而有效抵御内容驱动的再识别攻击。实验表明,该方法在长时电话对话场景中显著提升了隐私保护效果,同时维持了语音可用性。
链接: https://arxiv.org/abs/2510.12780
作者: Cristina Aggazzotti,Ashi Garg,Zexin Cai,Nicholas Andrews
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL)
备注:
Abstract:Voice anonymization techniques have been found to successfully obscure a speaker’s acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual’s vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.
zh
[NLP-3] Dr.LLM : Dynamic Layer Routing in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理过程中计算资源浪费的问题,即所有输入 token 均需经过 Transformer 模型全部层的处理,导致简单任务效率低下、复杂任务缺乏深度推理能力。其核心解决方案是提出一种可 retrofittable(后置适配)的动态层路由框架——Dynamic Routing of Layers for LLMs,关键在于引入轻量级 per-layer 路由器(router),通过显式监督训练决定是否跳过、执行或重复某一层;路由器使用蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)获得高质量的层配置,在有限计算预算下保持甚至提升准确性;同时结合窗口池化(windowed pooling)、焦点损失(focal loss)与瓶颈 MLP 路由结构,确保在类别不平衡和长序列场景下的稳定性与泛化性。实验表明,该方法在 ARC 和 DART 上平均节省 5 层/例且准确率提升达 +3.4 个百分点,并在多个下游任务中仅损失 0.85% 准确率却显著优于现有路由方法。
链接: https://arxiv.org/abs/2510.12773
作者: Ahmed Heakl,Martin Gubri,Salman Khan,Sangdoo Yun,Seong Joon Oh
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 17 pages, Under submission
Abstract:Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce this http URL, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), this http URL improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, this http URL shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.
zh
[NLP-4] Language Models Model Language
【速读】: 该论文试图解决当前对大语言模型(Large Language Models, LLMs)的批评缺乏实证基础的问题,尤其是那些基于索绪尔(de Saussure)和乔姆斯基(Chomsky)理论框架、强调“深层结构”或“具身性”(grounding)以实现理想化语言能力(competence)的观点。论文提出的关键解决方案是引入波兰语言学家维托尔德·马尼察克(Witold Mańczak)的经验主义语言观:将语言视为所有被说出和书写内容的总和,而非符号系统或大脑计算系统;并以特定语言元素的使用频率作为语言运作的核心原则。这一视角为重新评估LLMs提供了新的理论基础,并指导其设计、评估与解释方法。
链接: https://arxiv.org/abs/2510.12766
作者: Łukasz Borchmann
机构: Snowflake AI Research (Snowflake人工智能研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for “deep structure” or “grounding” to achieve an idealized linguistic “competence.” We argue for a radical shift in perspective towards the empiricist principles of Witold Mańczak, a prominent general and historical linguist. He defines language not as a “system of signs” or a “computational system of the brain” but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language’s primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.
zh
[NLP-5] Hey wait a minute: on at-issue sensitivity in Language Models
【速读】: 该论文旨在解决语言模型(Language Models, LMs)对话自然性(naturalness)评估难题,即“自然性”概念主观性强且缺乏可扩展的量化指标。其解决方案的关键在于引入语言学中的“议题相关性”(at-issueness)概念,并提出一种新方法——分治、生成、重组与比较(Divide, Generate, Recombine, and Compare, DGRC)。DGRC通过将对话拆分为子部分、用LM生成续写、重组序列并比较概率,有效缓解了传统分析中可能存在的偏差,从而系统性地测试LM在话语层面的敏感行为,揭示了LM倾向于延续议题相关内容的倾向及其在指令微调模型中的增强效应。
链接: https://arxiv.org/abs/2510.12740
作者: Sanghee J. Kim,Kanishka Misra
机构: The University of Chicago (芝加哥大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures, 3 tables. See this https URL for code and data
Abstract:Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of ‘naturalness’ vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of ‘at-issueness’ to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., “Hey, wait a minute”) are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.
zh
[NLP-6] Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages EMNLP2025
【速读】: 该论文旨在解决语言模型(Language Models, LMs)是否具备偏好类型学上常见的语法属性(typologically frequent grammatical properties)而非罕见或不合理的属性的归纳偏置(inductive biases)这一问题。现有研究多基于人工语言(Artificial Languages, ALs),但受限于其上下文无关的形式化表达,难以涵盖自然语言中真实存在的复杂结构。本文的关键解决方案在于:首先,采用广义范畴语法(Generalized Categorial Grammar, GCG)扩展AL的形式化能力,使其能够表示已知但此前被忽视的构式,如无界依存和弱上下文敏感结构;其次,通过聚焦于LM对未见过的更长测试句的泛化能力进行评估,从而得出更清晰的结论——类型学上合理的词序更容易促使语言模型实现有效的泛化。
链接: https://arxiv.org/abs/2510.12722
作者: Nadine El-Naggar,Tatsuki Kuribayashi,Ted Briscoe
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference
Abstract:Whether language models (LMs) have inductive biases that favor typologically frequent grammatical properties over rare, implausible ones has been investigated, typically using artificial languages (ALs) (White and Cotterell, 2021; Kuribayashi et al., 2024). In this paper, we extend these works from two perspectives. First, we extend their context-free AL formalization by adopting Generalized Categorial Grammar (GCG) (Wood, 2014), which allows ALs to cover attested but previously overlooked constructions, such as unbounded dependency and mildly context-sensitive structures. Second, our evaluation focuses more on the generalization ability of LMs to process unseen longer test sentences. Thus, our ALs better capture features of natural languages and our experimental paradigm leads to clearer conclusions – typologically plausible word orders tend to be easier for LMs to productively generalize.
zh
[NLP-7] Omni-Captioner: Data Pipeline Models and Benchmark for Omni Detailed Perception
【速读】: 该论文旨在解决当前多模态大模型(Omni Language Models, OLMs)在细粒度感知(fine-grained perception)能力上的局限性,特别是其在生成高细节描述时易产生幻觉(hallucination)的问题。研究发现,现有OLMs中细节丰富度与幻觉之间存在内在的“共生长”(co-growth)关系,制约了模型对音频和视频信号的精确理解与表达。解决方案的关键在于提出Omni-Detective——一个集成工具调用(tool-calling)的智能数据生成流水线,能够自主生成高细节且低幻觉的多模态数据;基于此数据,训练出Audio-Captioner(纯音频细粒度描述)和Omni-Captioner(音视频联合细粒度描述)两个模型,并设计了Omni-Cloze这一新型填空式评估基准,以稳定、高效地衡量细粒度多模态描述的质量。实验表明,该方案显著提升了细粒度感知性能并有效平衡了细节与真实性之间的矛盾。
链接: https://arxiv.org/abs/2510.12720
作者: Ziyang Ma,Ruiyang Xu,Zhenghao Xing,Yunfei Chu,Yuxuan Wang,Jinzheng He,Jin Xu,Pheng-Ann Heng,Kai Yu,Junyang Lin,Eng Siong Chng,Xie Chen
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: this https URL
Abstract:Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent “co-growth” between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.
zh
[NLP-8] Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在开放生成任务中因输出多样性与准确性失衡而导致的两大问题:在创造性任务中生成结果过于同质化,在事实性任务中则产生看似多样但错误的幻觉输出。其核心解决方案是引入“有效生成空间大小”(Generation Space Size, GSS)这一统一概念,即模型针对特定提示所考虑的语义上不同的输出集合,并提出GSSBench评测基准以量化不同任务中GSS的真实关系。研究表明,基于模型内部表征的EigenScore等幻觉检测指标能更准确地识别模型行为偏差,且无需外部标注即可提供可解释的洞察,从而实现对模型生成行为的诊断与调控,例如检测提示歧义、解释推理过程中的过度思考或思考不足,以及主动扩展生成空间以提升输出质量与多样性。
链接: https://arxiv.org/abs/2510.12699
作者: Sunny Yu,Ahmad Jabbar,Robert Hawkins,Dan Jurafsky,Myra Cheng
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) – the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model’s internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.
zh
[NLP-9] Who is a Better Matchmaker? Human vs. Algorithmic Judge Assignment in a High-Stakes Startup Competition
【速读】: 该论文旨在解决高风险场景下评委分配(judge assignment)问题,即如何将参赛项目高效且精准地匹配给具备相应领域专业知识的评审专家。传统人工分配方式耗时长、依赖个体经验,难以满足大规模赛事对匹配质量与效率的双重需求。解决方案的核心在于提出一种基于混合词法-语义相似度的集成算法(Hybrid Lexical-Semantic Similarity Ensemble, HLSE),通过融合文本特征与语义理解能力,实现自动化高质量匹配。实验表明,HLSE在309组评委-项目配对中与人工专家分配无显著差异(AUC=0.48, p=0.40),平均评分分别为3.90和3.94(5分制),同时将原本需一周的人工流程压缩至数小时内完成,验证了其在保持人类专家级匹配质量的同时显著提升可扩展性与效率。
链接: https://arxiv.org/abs/2510.12692
作者: Sarina Xi,Orelia Pi,Miaomiao Zhang,Becca Xiong,Jacqueline Ng Lane,Nihar B. Shah
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 17 Pages, 2 figures
Abstract:There is growing interest in applying artificial intelligence (AI) to automate and support complex decision-making tasks. However, it remains unclear how algorithms compare to human judgment in contexts requiring semantic understanding and domain expertise. We examine this in the context of the judge assignment problem, matching submissions to suitably qualified judges. Specifically, we tackled this problem at the Harvard President’s Innovation Challenge, the university’s premier venture competition awarding over \ 500,000 to student and alumni startups. This represents a real-world environment where high-quality judge assignment is essential. We developed an AI-based judge-assignment algorithm, Hybrid Lexical-Semantic Similarity Ensemble (HLSE), and deployed it at the competition. We then evaluated its performance against human expert assignments using blinded match-quality scores from judges on 309 judge-venture pairs. Using a Mann-Whitney U statistic based test, we found no statistically significant difference in assignment quality between the two approaches ( AUC=0.48, p=0.40 ); on average, algorithmic matches are rated 3.90 and manual matches 3.94 on a 5-point scale, where 5 indicates an excellent match. Furthermore, manual assignments that previously required a full week could be automated in several hours by the algorithm during deployment. These results demonstrate that HLSE achieves human-expert-level matching quality while offering greater scalability and efficiency, underscoring the potential of AI-driven solutions to support and enhance human decision-making for judge assignment in high-stakes settings.
zh
[NLP-10] Demystifying Hybrid Thinking: Can LLM s Truly Switch Between Think and No-Think?
【速读】: 该论文旨在解决当前混合思维(Hybrid Thinking)大语言模型(LLM)中模式控制不充分的问题,即推理模式(reasoning mode)的行为常常泄露到无需推理的直接回答模式(no-think mode),导致效率下降和输出冗余。解决方案的关键在于通过四项核心策略优化训练过程:(1) 使用更大规模的数据集;(2) 利用不同问题的思考与非思考答案而非同一问题的配对样本;(3) 适度增加无思考数据的数量;(4) 采用两阶段训练策略——先强化推理能力,再进行混合思维训练。基于此,作者提出了一套实用训练方案,在保持两种模式准确率的同时,显著缩短了无思考模式下的输出长度(从1085降至585 tokens)并减少了推理支持性标记(如“wait”)的出现频率(从5917降至522),从而提升了混合思维机制的可控性和实用性。
链接: https://arxiv.org/abs/2510.12680
作者: Shouren Wang,Wang Yang,Xianxuan Long,Qifan Wang,Vipin Chaudhary,Xiaotian Han
机构: Case Western Reserve University (凯斯西储大学); Meta AI (Meta人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 6 figures
Abstract:Hybrid thinking enables LLMs to switch between reasoning and direct answering, offering a balance between efficiency and reasoning capability. Yet our experiments reveal that current hybrid thinking LLMs only achieve partial mode separation: reasoning behaviors often leak into the no-think mode. To understand and mitigate this, we analyze the factors influencing controllability and identify four that matter most: (1) larger data scale, (2) using think and no-think answers from different questions rather than the same question, (3) a moderate increase in no-think data number, and (4) a two-phase strategy that first trains reasoning ability and then applies hybrid think training. Building on these findings, we propose a practical recipe that, compared to standard training, can maintain accuracy in both modes while significantly reducing no-think output length (from 1085 to 585 on MATH500) and occurrences of reasoning-supportive tokens such as ``\textttwait’’ (from 5917 to 522 on MATH500). Our findings highlight the limitations of current hybrid thinking and offer directions for strengthening its controllability.
zh
[NLP-11] he Role of Parametric Injection-A Systematic Study of Parametric Retrieval-Augmented Generation
【速读】: 该论文旨在解决参数化检索增强生成(Parametric Retrieval-Augmented Generation, PRAG)中参数注入机制不明确的问题,尤其是其如何影响模型与文档之间的交互效果。现有研究表明,PRAG通过将文档编码为模型参数(如LoRA模块)并在推理时注入,相较于传统文本级输入更高效,但其内在作用机制尚不清楚。论文的关键发现是:参数化文档仅捕获文档的部分语义信息,单独依赖此类表示会导致性能下降;然而,它们能编码高层文档特征,有助于提升模型对输入上下文中文档的理解能力。因此,解决方案的核心在于联合使用参数化文档与文本文档,从而实现互补优势——既利用参数化表示增强鲁棒性,又借助文本级交互获取完整语义,最终显著优于单一来源的方案。
链接: https://arxiv.org/abs/2510.12668
作者: Minghao Tang,Shiyu Ni,Jingtong Wu,Zengxin Han,Keping Bi
机构: State Key Laboratory of AI Safety, ICT, Chinese Academy of Sciences (中国科学院人工智能安全重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving external documents. As an emerging form of RAG, parametric retrieval-augmented generation (PRAG) encodes documents as model parameters (i.e., LoRA modules) and injects these representations into the model during inference, enabling interaction between the LLM and documents at parametric level. Compared with directly placing documents in the input context, PRAG is more efficient and has the potential to offer deeper model-document interaction. Despite its growing attention, the mechanism underlying parametric injection remains poorly understood. In this work, we present a systematic study of PRAG to clarify the role of parametric injection, showing that parameterized documents capture only partial semantic information of documents, and relying on them alone yields inferior performance compared to interaction at text level. However, these parametric representations encode high-level document information that can enhance the model’s understanding of documents within the input context. When combined parameterized documents with textual documents, the model can leverage relevant information more effectively and become more robust to noisy inputs, achieving better performance than either source alone. We recommend jointly using parameterized and textual documents and advocate for increasing the information content of parametric representations to advance PRAG.
zh
[NLP-12] Reasoning Pattern Matters: Learning to Reason without Human Rationales
【速读】: 该论文旨在解决在大型语言模型(Large Language Models, LLMs)推理能力提升过程中,监督微调(Supervised Fine-Tuning, SFT)阶段依赖高成本人工标注推理轨迹(rationales)的问题。研究发现,在一类称为“模式化推理任务”(patterned reasoning tasks)的问题中,尽管实例内容各异,但其解题过程遵循固定、可复用的推理模式。论文指出,SFT+RLVR范式在这些任务上的成功主要源于模型对推理模式的内化能力,而非大量高质量人工标注的推理轨迹。因此,解决方案的关键在于提出Pattern-Aware LLMs as Rationale AnnOtators (PARO) 框架,利用LLM自身生成与任务特定推理模式一致的推理轨迹,仅需少量人类对推理模式的监督即可替代大规模人工标注,显著降低标注成本并保持性能。
链接: https://arxiv.org/abs/2510.12643
作者: Chaoxu Pang,Yixuan Cao,Ping Luo
机构: Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Frontiers of Computer Science
Abstract:Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.
zh
[NLP-13] COSTAR-A: A prompting framework for enhancing Large Language Model performance on Point-of-View questions
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)对提示词(prompt)设计高度敏感的问题,尤其关注在资源受限环境中部署的小型本地化模型中,如何通过优化提示工程提升输出的一致性与结构化程度。其解决方案的关键在于提出一种改进的提示框架 COSTAR-A,该框架在原有 COSTAR(Context, Objective, Style, Tone, Audience, Response)基础上新增了“Answer”组件,以增强提示的指令明确性和输出约束力,从而显著改善小型模型(如 Llama 3.1-8B)在特定任务中的输出质量与决策果断性,体现出更强的适应性和可扩展性。
链接: https://arxiv.org/abs/2510.12637
作者: Nzubechukwu C. Ohalete(1),Kevin B. Gittner(1),Lauren M. Matheny(1) ((1) School of Data Science and Analytics, Kennesaw State University, GA, USA)
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 2 figures
Abstract:Large Language Models (LLMs) are highly sensitive to prompt design, and making optimized prompting techniques is crucial for generating consistent, high-quality outputs. In this study, we introduce COSTAR-A, a novel prompt engineering framework that enhances the existing COSTAR method, which stands for Context, Objective, Style, Tone, Audience, and Response, by adding the ‘Answer’ component at the end. We demonstrate that while the original COSTAR framework improves prompt clarity and aligns outputs for larger LLMs, its performance is less consistent with smaller, locally optimized models, particularly in tasks that require more directive or constrained outputs. Through a series of controlled prompt-output assessments with smaller (at most 8 billion parameters), fine-tuned models, we found that COSTAR-A can enhance the output structure and decisiveness of localized LLMs for certain tasks, although its effectiveness varies across models and use cases. Notably, the Llama 3.1-8B model exhibited performance improvements when prompted with COSTAR-A compared to COSTAR alone. These findings emphasize the adaptability and scalability of COSTAR-A as a prompting framework, particularly in computationally efficient AI deployments on resource-constrained hardware.
zh
[NLP-14] ACADATA: Parallel Dataset of Academic Data for Machine Translation
【速读】: 该论文旨在解决学术文本翻译质量低、长文本上下文保持能力弱的问题,特别是在跨语言学术交流中对专业术语和复杂句式准确传递的挑战。解决方案的关键在于构建并发布ACADATA数据集,该数据集包含高质量平行语料(ACAD-TRAIN)和标准化评估基准(ACAD-BENCH),并通过在ACAD-TRAIN上微调大语言模型(LLMs),显著提升学术翻译性能与长文本翻译一致性。实验表明,基于该数据集微调后的模型在学术领域翻译任务中超越了现有开源及专有模型,并在英文输出场景下将长文本翻译质量提升最高达24.9%。
链接: https://arxiv.org/abs/2510.12621
作者: Iñaki Lacunza,Javier Garcia Gilabert,Francesca De Luca Fornaciari,Javier Aula-Blasco,Aitor Gonzalez-Agirre,Maite Melero,Marta Villegas
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.
zh
[NLP-15] StyleDecipher: Robust and Explainable Detection of LLM -Generated Texts with Stylistic Analysis
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本在开放域写作中难以准确、可解释且泛化能力强的检测问题。现有方法依赖统计差异或模型特异性启发式规则,但在面对风格多样性、混合人类与AI创作内容及对抗性扰动时表现不佳。解决方案的关键在于提出StyleDecipher框架,通过联合建模离散风格指标与连续语义嵌入表示的风格特征,在统一表征空间中量化人类与LLM输出之间的风格差异,从而实现无需访问模型内部或标注片段即可进行高精度、可解释且领域无关的检测。
链接: https://arxiv.org/abs/2510.12608
作者: Siyuan Li,Aodu Wulianghai,Xi Lin,Guangyan Li,Xiang Chen,Jun Wu,Jianhua Li
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Key Laboratory of Integrated Administration Technologies for Information Security (上海市信息安全综合管理技术重点实验室); Chinese Academy of Sciences (中国科学院); Institute of Automation (自动化研究所); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the increasing integration of large language models (LLMs) into open-domain writing, detecting machine-generated text has become a critical task for ensuring content authenticity and trust. Existing approaches rely on statistical discrepancies or model-specific heuristics to distinguish between LLM-generated and human-written text. However, these methods struggle in real-world scenarios due to limited generalization, vulnerability to paraphrasing, and lack of explainability, particularly when facing stylistic diversity or hybrid human-AI authorship. In this work, we propose StyleDecipher, a robust and explainable detection framework that revisits LLM-generated text detection using combined feature extractors to quantify stylistic differences. By jointly modeling discrete stylistic indicators and continuous stylistic representations derived from semantic embeddings, StyleDecipher captures distinctive style-level divergences between human and LLM outputs within a unified representation space. This framework enables accurate, explainable, and domain-agnostic detection without requiring access to model internals or labeled segments. Extensive experiments across five diverse domains, including news, code, essays, reviews, and academic abstracts, demonstrate that StyleDecipher consistently achieves state-of-the-art in-domain accuracy. Moreover, in cross-domain evaluations, it surpasses existing baselines by up to 36.30%, while maintaining robustness against adversarial perturbations and mixed human-AI content. Further qualitative and quantitative analysis confirms that stylistic signals provide explainable evidence for distinguishing machine-generated text. Our source code can be accessed at this https URL.
zh
[NLP-16] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space
【速读】: 该论文旨在解决当前多模态推理(Multimodal Reasoning)方法依赖人工标注的显式推理步骤所带来的高成本与高延迟问题。现有方法通常需要大量视觉-文本对的精细标注,且推理过程复杂,导致效率低下。解决方案的关键在于提出一种“交错式视觉-文本潜在推理”(Interleaved Vision-Text Latent Reasoning, IVT-LR),其核心思想是在潜在空间中隐式融合视觉与文本信息进行推理:每一步推理由两个隐式组成部分构成——前一阶段的潜在文本(latent text,即隐藏状态)和一组精选的图像嵌入(latent vision)。通过引入渐进式的多阶段训练策略,使大语言模型(MLLMs)能够有效执行此类多模态潜在推理,从而在显著提升推理效率(提速超5倍)的同时,平均准确率提高5.45%。
链接: https://arxiv.org/abs/2510.12603
作者: Chao Chen,Zhixin Ma,Yongqi Li,Yupeng Hu,Yinwei Wei,Wenjie Li,Liqiang Nie
机构: The Hong Kong Polytechnic University (香港理工大学); Singapore Management University (新加坡管理大学); Shandong University (山东大学); Harbin Institute of Technology (Shenzhen) (哈尔滨工业大学(深圳)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at this https URL.
zh
[NLP-17] aching Language Models to Faithfully Express their Uncertainty
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答问题时存在“信念一致性缺口”(faithfulness gap)的问题,即模型对自身不确定性的表达不准确——即使多次生成的答案存在显著差异,其输出仍常以无修饰或不当的模糊措辞呈现,导致用户误判模型知识的可靠性。解决方案的关键在于提出一种名为“忠实不确定性微调”(Faithful Uncertainty Tuning, FUT)的方法:通过在指令微调后的LLM上引入基于样本一致性判断的不确定性修饰语(hedgers,如“可能”、“大概”),仅需模型自身和一组提示词即可构建训练数据,无需额外标注监督信号;FUT在不改变模型原始答案分布的前提下,显著提升模型表达不确定性的准确性,同时保持问答任务性能稳定,并展现出对不同解码策略、修饰语选择及数值型不确定性表达方式的鲁棒性。
链接: https://arxiv.org/abs/2510.12587
作者: Bryan Eikema,Evgenia Ilia,José G. C. de Souza,Chrysoula Zerva,Wilker Aziz
机构: University of Amsterdam (阿姆斯特丹大学); Outsystems; Instituto de Telecomunicações (电信研究所); Universidade de Lisboa (里斯本大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) often miscommunicate their uncertainty: repeated queries can produce divergent answers, yet generated responses are typically unhedged or hedged in ways that do not reflect this variability. This conveys unfaithful information about the uncertain state of the LLMs’ knowledge, creating a faithfulness gap that affects even strong LLMs. We introduce Faithful Uncertainty Tuning (FUT): a fine-tuning approach that teaches instruction-tuned LLMs to express uncertainty faithfully without altering their underlying answer distribution. We construct training data by augmenting model samples with uncertainty hedges (i.e. verbal cues such as ‘possibly’ or ‘likely’) aligned with sample consistency, requiring no supervision beyond the model and a set of prompts. We evaluate FUT on open-domain question answering (QA) across multiple models and datasets. Our results show that FUT substantially reduces the faithfulness gap, while preserving QA accuracy and introducing minimal semantic distribution shift. Further analyses demonstrate robustness across decoding strategies, choice of hedgers, and other forms of uncertainty expression (i.e. numerical). These findings establish FUT as a simple and effective way to teach LLMs to communicate uncertainty faithfully.
zh
[NLP-18] VISaGE: Understanding Visual Generics and Exceptions EMNLP2025
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在面对非典型(atypical)输入时,其概念理解能力如何受两种先验(priors)冲突影响的问题。具体而言,VLMs 在训练中习得了一种语义先验——即类别概念具有普遍适用性;但在实际应用中,又依赖于一种实用先验——即文本与图像输入通常具有一致性(congruency)。当输入图像与文本不一致时,这两种先验产生矛盾,导致模型性能下降。为探究这一权衡机制,作者提出了一个新的评估数据集 VISaGE,包含典型和异常图像,并通过精心设计的实验发现:当违反一致性假设时,实用先验对模型表现的影响强于语义先验,从而揭示了 VLMs 在个体实例推理中对输入一致性敏感的本质缺陷。
链接: https://arxiv.org/abs/2510.12548
作者: Stella Frank,Emily Allaway
机构: University of Copenhagen (哥本哈根大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: EMNLP 2025
Abstract:While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.
zh
[NLP-19] BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)
【速读】: 该论文旨在解决生成式 AI (Generative AI) 在推理阶段(inference time)如何通过额外计算提升模型输出质量的问题,尤其关注其在非传统验证场景(如标注分歧评估任务 LeWiDi-2025)中的适用性。解决方案的关键在于将测试时扩展(test-time scaling)技术从可验证正确答案的领域(如数学和编码)迁移至标注不一致的语义理解任务中,并系统评估三种方法:模型平均(Model Averaging)、多数投票(Majority Voting)与 Best-of-N 采样。实验表明,前两种基准方法在 LeWiDi 任务上能稳定提升性能,而 Best-of-N 方法未能有效迁移,提示当前该方法在缺乏明确正确答案的任务中存在局限性。
链接: https://arxiv.org/abs/2510.12516
作者: Tomas Ruiz,Siyao Peng,Barbara Plank,Carsten Schwemmer
机构: Ludwig Maximilian University of Munich (慕尼黑路德维希马克西米利安大学); Computational Social Sciences; MaiNLP & MCML
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the Best-of-N method does not. Our experiments suggest that the Best-of-N method does not currently transfer from mathematics to LeWiDi tasks, and we analyze potential reasons for this gap.
zh
[NLP-20] When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection
【速读】: 该论文旨在解决个性化机器生成文本(Personalized Machine-Generated Text, MGT)检测中的鲁棒性不足问题,即现有检测器在面对模仿特定个体风格的生成文本时性能显著下降。其关键解决方案是提出一种名为 \method 的方法,通过识别导致特征反转(feature inversion)的潜在方向,并构建仅沿这些方向差异化的探测数据集,从而有效预测检测器在个性化场景下的性能变化方向与幅度,实验表明该方法与实际性能差距具有85%的相关性。
链接: https://arxiv.org/abs/2510.12476
作者: Lang Gao,Xuhui Li,Chenxi Wang,Mingzhe Li,Wei Liu,Zirui Song,Jinghui Zhang,Rui Yan,Preslav Nakov,Xiuying Chen
机构: MBZUAI; ByteDance; National University of Singapore; Wuhan University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textitfeature-inversion trap, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.
zh
[NLP-21] SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成的高维嵌入表示在实际部署中面临的计算复杂度高和存储需求大的问题。其核心解决方案是提出一种名为Sequential Matryoshka Embedding Compression (SMEC) 的新型训练框架,关键创新包括:引入Sequential Matryoshka Representation Learning (SMRL) 方法以缓解训练过程中的梯度方差问题;设计Adaptive Dimension Selection (ADS) 模块以减少维度剪枝导致的信息退化;以及构建Selectable Cross-batch Memory (S-XBM) 模块以增强高低维嵌入之间的无监督学习能力。实验表明,SMEC 在图像、文本及多模态数据集上均实现了显著的维度压缩并保持性能稳定,例如在BEIR数据集上,压缩至256维的LLM2Vec嵌入相比Matryoshka-Adaptor和Search-Adaptor分别提升1.1和2.7个点。
链接: https://arxiv.org/abs/2510.12474
作者: Biao Zhang,Lixin Chen,Tong Liu,Bo Zheng
机构: Taobao & Tmall Group of Alibaba (淘宝与天猫集团)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted by EMNLP2025
Abstract:Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.
zh
[NLP-22] Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在形态学泛化任务中是否具备接近人类语言能力的问题,以及模型性能主要受语言结构复杂性还是训练数据量的影响。其解决方案的关键在于采用多语言适配的Wug测试范式,在四种部分无关语言(加泰罗尼亚语、英语、希腊语和西班牙语)中对六种模型进行评估,并与人类说话者对比。结果表明,尽管模型在未见词上能实现类人准确度,但其表现更依赖于语言社区规模和数字资源丰富程度,而非语法复杂性,揭示了模型行为本质上由资源丰富度驱动,而非深层语言规则敏感性。
链接: https://arxiv.org/abs/2510.12463
作者: Nikoleta Pantelidou,Evelina Leivada,Paolo Morosi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the quantity of available training data. Consistent with previous research, the results show that the models are able to generalize morphological processes to unseen words with human-like accuracy. However, accuracy patterns align more closely with community size and data availability than with structural complexity, refining earlier claims in the literature. In particular, languages with larger speaker communities and stronger digital representation, such as Spanish and English, revealed higher accuracy than less-resourced ones like Catalan and Greek. Overall, our findings suggest that model behavior is mainly driven by the richness of linguistic resources rather than by sensitivity to grammatical complexity, reflecting a form of performance that resembles human linguistic competence only superficially.
zh
[NLP-23] Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中普遍存在的不忠实问题(unfaithfulness issue),即模型输出与所检索到的上下文证据相矛盾的现象。现有方法多依赖外部干预手段(如提示工程、解码约束或基于奖励的微调),将大语言模型(Large Language Models, LLMs)视为黑箱,忽视了模型内部如何整合检索到的证据与参数化记忆(parametric memory),尤其是在知识冲突场景下的机制。针对这一空白,作者通过隐状态探针分析揭示了三个关键发现:知识整合具有层次性、冲突以句子级潜在信号形式显现、无关上下文在与参数知识对齐时会被放大。基于此,论文提出CLEAR框架,其核心创新在于:(i) 将上下文细粒度分解为句子级知识单元,(ii) 利用隐状态探针定位冲突知识,(iii) 引入冲突感知微调策略引导模型准确融合检索证据。实验表明,CLEAR在多个基准测试中显著提升准确性与上下文忠实性,优于多种强基线方法。
链接: https://arxiv.org/abs/2510.12460
作者: Linfeng Gao,Baolong Bi,Zheng Yuan,Le Wang,Zerui Chen,Zhimin Wei,Shenghua Liu,Qinggang Zhang,Jinsong Su
机构: Xiamen University (厦门大学); University of Chinese Academy of Sciences (中国科学院大学); The Hong Kong Polytechnic University (香港理工大学); Migu Meland Co.,Ltd. (咪咕音乐有限公司)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (LLMs). However, existing RAG systems often suffer from an unfaithfulness issue, where the model’s response contradicts evidence from the retrieved context. Existing approaches to improving contextual faithfulness largely rely on external interventions, such as prompt engineering, decoding constraints, or reward-based fine-tuning. These works treat the LLM as a black box and overlook a crucial question: how does the LLM internally integrate retrieved evidence with its parametric memory, particularly under knowledge conflicts? To address this gap, we conduct a probing-based analysis of hidden-state representations in LLMs and observe three findings: knowledge integration occurs hierarchically, conflicts manifest as latent signals at the sentence level, and irrelevant context is often amplified when aligned with parametric knowledge. Building on these findings, we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that (i) decomposes context into fine-grained sentence-level knowledge, (ii) employs hidden-state probing to localize conflicting knowledge, and (iii) introduces conflict-aware fine-tuning to guide the model to accurately integrate retrieved evidence. Extensive experiments across three benchmarks demonstrate that CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions. The related resources are available at this https URL.
zh
[NLP-24] PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation
【速读】: 该论文旨在解决现有基于知识超图(Knowledge Hypergraphs, KHs)的检索增强生成(Retrieval-Augmented Generation, RAG)方法在多跳问答任务中面临的三大局限:静态检索规划、非自适应的检索执行以及对KH结构与语义的浅层利用。为此,作者提出PRoH框架,其核心创新在于:(i) 上下文感知的规划模块,用于构建局部KH邻域以生成结构化推理计划;(ii) 结构化的子问题分解机制,将子问题组织为动态演化的有向无环图(Directed Acyclic Graph, DAG),支持多路径自适应探索;(iii) 基于实体加权重叠(Entity-Weighted Overlap, EWO)的推理路径检索算法,优先选择语义一致的超边遍历路径。该方案显著提升了多跳推理的准确性与鲁棒性,在多个数据集上相较SOTA模型HyperGraphRAG平均提升19.73% F1和8.41%生成评估(Generation Evaluation, G-E)分数。
链接: https://arxiv.org/abs/2510.12434
作者: Xiangjun Zai,Xingyu Tan,Xiaoyang Wang,Qing Liu,Xiwei Xu,Wenjie Zhang
机构: University of New South Wales (新南威尔士大学); Data61, CSIRO (数据61,澳大利亚联邦科学与工业研究组织)
类目: Computation and Language (cs.CL)
备注:
Abstract:Knowledge Hypergraphs (KHs) have recently emerged as a knowledge representation for retrieval-augmented generation (RAG), offering a paradigm to model multi-entity relations into a structured form. However, existing KH-based RAG methods suffer from three major limitations: static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure and semantics, which constrain their ability to perform effective multi-hop question answering. To overcome these limitations, we propose PRoH, a dynamic Planning and Reasoning over Knowledge Hypergraphs framework. PRoH incorporates three core innovations: (i) a context-aware planning module that sketches the local KH neighborhood to guide structurally grounded reasoning plan generation; (ii) a structured question decomposition process that organizes subquestions as a dynamically evolving Directed Acyclic Graph (DAG) to enable adaptive, multi-trajectory exploration; and (iii) an Entity-Weighted Overlap (EWO)-guided reasoning path retrieval algorithm that prioritizes semantically coherent hyperedge traversals. Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining strong robustness in long-range multi-hop reasoning tasks.
zh
[NLP-25] okenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
【速读】: 该论文旨在解决多语言环境下大语言模型(LLM)中因分词(tokenization)差异导致的计算不平等性问题,即不同语言在token化效率上的系统性偏差对低资源语言和非拉丁语系语言造成的额外计算负担。其解决方案的关键在于通过大规模跨语言评估(覆盖200多种语言),采用统一的预处理与标准化tokenization方法(基于tiktoken库),并利用Tokens Per Sentence (TPS) 和Relative Tokenization Cost (RTC) 等指标量化分词效率,从而揭示拉丁字母语言与非拉丁、形态复杂语言之间显著的token膨胀现象(RTC可达3–5倍),并提出未来应发展基于语言类型学特征的适应性分词策略与词汇构建机制,以实现更公平、高效的多语言人工智能系统。
链接: https://arxiv.org/abs/2510.12389
作者: Hailay Kidu Teklehaymanot,Wolfgang Nejdl
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 6 pages 4 figures
Abstract:Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using established evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and morphologically complex languages incur significantly greater token inflation, often 3-5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems.
zh
[NLP-26] LLM -REVal: Can We Trust LLM Reviewers Yet?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在学术研究与同行评审流程中深度集成后可能引发的公平性风险问题,特别是LLM作为审稿人时对人类作者和研究成果造成的潜在不公。其解决方案的关键在于通过构建一个模拟系统,包含“研究代理”(生成论文并修订)与“审稿代理”(评估投稿),结合人工标注验证,揭示LLM审稿中存在的两类核心偏差:一是语言特征偏倚,即LLM更倾向于给自身生成的文本打高分;二是对批判性表述的厌恶倾向,导致其持续低估人类撰写的、包含风险或公平性讨论的论文。这一发现警示了在未加审慎控制的情况下将LLM引入同行评审环节可能加剧学术不公平,同时也表明基于LLM的反馈能有效提升论文质量,尤其对初学者和低质量稿件具有改进潜力。
链接: https://arxiv.org/abs/2510.12367
作者: Rui Li,Jia-Chen Gu,Po-Nien Kung,Heming Xia,Junfeng liu,Xiangwen Kong,Zhifang Sui,Nanyun Peng
机构: Peking University (北京大学); University of California, Los Angeles (加州大学洛杉矶分校); The Hong Kong Polytechnic University (香港理工大学); StepFun AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness, examining the potential risks of using LLMs as reviewers by simulation. This simulation incorporates a research agent, which generates papers and revises, alongside a review agent, which assesses the submissions. Based on the simulation results, we conduct human annotations and identify pronounced misalignment between LLM-based reviews and human judgments: (1) LLM reviewers systematically inflate scores for LLM-authored papers, assigning them markedly higher scores than human-authored ones; (2) LLM reviewers persistently underrate human-authored papers with critical statements (e.g., risk, fairness), even after multiple revisions. Our analysis reveals that these stem from two primary biases in LLM reviewers: a linguistic feature bias favoring LLM-generated writing styles, and an aversion toward critical statements. These results highlight the risks and equity concerns posed to human authors and academic research if LLMs are deployed in the peer review cycle without adequate caution. On the other hand, revisions guided by LLM reviews yield quality gains in both LLM-based and human evaluations, illustrating the potential of the LLMs-as-reviewers for early-stage researchers and enhancing low-quality papers.
zh
[NLP-27] MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts
【速读】: 该论文旨在解决混合专家(Mixture-of-Experts, MoE)模型在GPU推理过程中因CPU-GPU互连带宽受限而导致的性能瓶颈问题。现有方法依赖预取机制来加速推理,但存在训练开销大且对细粒度专家划分的MoE模型效果下降的问题。其解决方案的关键在于提出MoBiLE框架,该框架采用“大-小专家混合”策略:对于重要token保留完整专家以保证模型质量,而对于不重要token则将专家数量减半以提升加速比;同时设计专用的回退与预取机制实现大小专家间的高效切换,从而显著提高内存利用效率并降低延迟。实验表明,在消费级GPU系统上,MoBiLE相较基线实现了1.60x至1.72倍的加速,且精度损失可忽略。
链接: https://arxiv.org/abs/2510.12357
作者: Yushu Zhao,Yubin Qin,Yang Wang,Xiaolong Yang,Huiming Han,Shaojun Wei,Yang Hu,Shouyi Yin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to ASP-DAC 2026
Abstract:Mixture-of-Experts (MoE) models have recently demonstrated exceptional performance across a diverse range of applications. The principle of sparse activation in MoE models facilitates an offloading strategy, wherein active experts are maintained in GPU HBM, while inactive experts are stored in CPU DRAM. The efficacy of this approach, however, is fundamentally constrained by the limited bandwidth of the CPU-GPU interconnect. To mitigate this bottleneck, existing approaches have employed prefetching to accelerate MoE inference. These methods attempt to predict and prefetch the required experts using specially trained modules. Nevertheless, such techniques are often encumbered by significant training overhead and have shown diminished effectiveness on recent MoE models with fine-grained expert segmentation. In this paper, we propose MoBiLE, a plug-and-play offloading-based MoE inference framework with \textitmixture of big-little experts. It reduces the number of experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens to guarantee model quality. Further, a dedicated fallback and prefetching mechanism is designed for switching between little and big experts to improve memory efficiency. We evaluate MoBiLE on four typical modern MoE architectures and challenging generative tasks. Our results show that MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy. Comments: Accepted to ASP-DAC 2026 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.12357 [cs.CL] (or arXiv:2510.12357v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.12357 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-28] Fine-grained Analysis of Brain-LLM Alignment through Input Attribution
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与人类大脑活动之间对齐机制的细粒度理解问题,特别是厘清脑-LLM对齐(Brain Alignment, BA)与下一个词预测(Next-Word Prediction, NWP)之间的关系这一争议性科学问题。其解决方案的关键在于提出了一种细粒度输入归因方法(fine-grained input attribution method),用于识别对BA和NWP各自贡献最大的具体词汇;通过该方法发现,BA主要依赖语义和话语层面的信息,并表现出更聚焦的近期效应,而NWP则呈现显著的近因效应和首因效应,且更关注句法特征,二者所依赖的词集 largely 不同,从而揭示了LLMs在模拟人类语言处理时在特征选择上的本质差异。
链接: https://arxiv.org/abs/2510.12355
作者: Michela Proietti,Roberto Capobianco,Mariya Toneva
机构: Sapienza University of Rome (罗马大学); Sony AI (索尼人工智能); Max Planck Institute for Software Systems (马克斯·普朗克软件系统研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding the alignment between large language models (LLMs) and human brain activity can reveal computational principles underlying language processing. We introduce a fine-grained input attribution method to identify the specific words most important for brain-LLM alignment, and leverage it to study a contentious research question about brain-LLM alignment: the relationship between brain alignment (BA) and next-word prediction (NWP). Our findings reveal that BA and NWP rely on largely distinct word subsets: NWP exhibits recency and primacy biases with a focus on syntax, while BA prioritizes semantic and discourse-level information with a more targeted recency effect. This work advances our understanding of how LLMs relate to human language processing and highlights differences in feature reliance between BA and NWP. Beyond this study, our attribution method can be broadly applied to explore the cognitive relevance of model predictions in diverse language processing tasks.
zh
[NLP-29] Simple Projection Variants Improve ColBERT Performance
【速读】: 该论文旨在解决多向量稠密检索方法(如ColBERT)中,使用单层线性投影(single-layer linear projection)进行向量降维时所存在的梯度传播局限性问题。其核心解决方案在于用更复杂、结构更优的前馈神经网络(Feedforward Linear Networks, FFN)替代原有的简单线性投影模块,具体包括深层非线性FFN块、GLU块及残差连接(residual connections)。实验表明,改进后的投影结构能显著提升模型在多个检索基准上的性能,平均NDCG@10指标提升超过2点,且这一效果在不同随机种子下保持一致,验证了该替换策略作为“即插即用”升级方案的有效性和鲁棒性。
链接: https://arxiv.org/abs/2510.12327
作者: Benjamin Clavié,Sean Lee,Rikiya Takehi,Aamir Shakir,Makoto P. Kato
机构: Mixedbread AI(混合面包人工智能); National Institute of Informatics (NII)(日本信息研究所); Waseda University(早稻田大学); University of Tsukuba(筑波大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Multi-vector dense retrieval methods like ColBERT systematically use a single-layer linear projection to reduce the dimensionality of individual vectors. In this study, we explore the implications of the MaxSim operator on the gradient flows of the training of multi-vector models and show that such a simple linear projection has inherent, if non-critical, limitations in this setting. We then discuss the theoretical improvements that could result from replacing this single-layer projection with well-studied alternative feedforward linear networks (FFN), such as deeper, non-linear FFN blocks, GLU blocks, and skip-connections, could alleviate these limitations. Through the design and systematic evaluation of alternate projection blocks, we show that better-designed final projections positively impact the downstream performance of ColBERT models. We highlight that many projection variants outperform the original linear projections, with the best-performing variants increasing average performance on a range of retrieval benchmarks across domains by over 2 NDCG@10 points. We then conduct further exploration on the individual parameters of these projections block in order to understand what drives this empirical performance, highlighting the particular importance of upscaled intermediate projections and residual connections. As part of these ablation studies, we show that numerous suboptimal projection variants still outperform the traditional single-layer projection across multiple benchmarks, confirming our hypothesis. Finally, we observe that this effect is consistent across random seeds, further confirming that replacing the linear layer of ColBERT models is a robust, drop-in upgrade.
zh
[NLP-30] Beating Harmful Stereotypes Through Facts: RAG -based Counter-speech Generation
【速读】: 该论文旨在解决当前反仇恨言论(hate speech)生成中存在可信度不足与可扩展性差的问题,即现有方法将反言论生成视为纯文本生成任务,依赖大型语言模型(Large Language Models, LLMs)或非政府组织(NGO)专家,导致生成内容在事实可靠性与逻辑连贯性上表现不佳,且难以规模化应用。其解决方案的关键在于提出一种基于知识驱动的文本生成框架,将反言论生成建模为知识导向的生成过程,通过集成先进的检索增强生成(Retrieval-Augmented Generation, RAG)管道,从联合国数字图书馆、EUR-Lex及欧盟基本权利机构等权威来源构建包含32,792篇文献的知识库,从而确保针对8类目标群体(如女性、有色人种、残障人士、移民、穆斯林、犹太人、LGBT群体等)生成的反言论具备高可信度与针对性。实证评估表明,该框架在JudgeLM指标和人工评价中均优于标准LLM基线与竞争方法,为可信、稳健的反言论生成研究提供了新路径。
链接: https://arxiv.org/abs/2510.12316
作者: Greta Damo,Elena Cabrio,Serena Villata
机构: Université Côte d’Azur, CNRS, Inria, I3S (法国蔚蓝海岸大学,国家科学研究中心,法国国家信息与自动化研究院,信息与系统研究所), France
类目: Computation and Language (cs.CL)
备注:
Abstract:Counter-speech generation is at the core of many expert activities, such as fact-checking and hate speech, to counter harmful content. Yet, existing work treats counter-speech generation as pure text generation task, mainly based on Large Language Models or NGO experts. These approaches show severe drawbacks due to the limited reliability and coherence in the generated countering text, and in scalability, respectively. To close this gap, we introduce a novel framework to model counter-speech generation as knowledge-wise text generation process. Our framework integrates advanced Retrieval-Augmented Generation (RAG) pipelines to ensure the generation of trustworthy counter-speech for 8 main target groups identified in the hate speech literature, including women, people of colour, persons with disabilities, migrants, Muslims, Jews, LGBT persons, and other. We built a knowledge base over the United Nations Digital Library, EUR-Lex and the EU Agency for Fundamental Rights, comprising a total of 32,792 texts. We use the MultiTarget-CONAN dataset to empirically assess the quality of the generated counter-speech, both through standard metrics (i.e., JudgeLM) and a human evaluation. Results show that our framework outperforms standard LLM baselines and competitive approach, on both assessments. The resulting framework and the knowledge base pave the way for studying trustworthy and sound counter-speech generation, in hate speech and beyond.
zh
[NLP-31] A large-scale unsupervised pipeline for automatic corpus annotation using LLM s: variation and change in the English consider construction
【速读】: 该论文旨在解决大规模语料库中语法标注(grammatical annotation)依赖人工标注所导致的方法学瓶颈问题。其解决方案的关键在于提出了一种可扩展的、无监督的自动化流水线,利用大语言模型(Large Language Models, LLMs)实现高效标注:该流程包含四个阶段——提示工程(prompt engineering)、事前评估(pre-hoc evaluation)、自动批量处理(automated batch processing)和事后验证(post-hoc validation),并在历史美国英语语料库(Corpus of Historical American English, COHA)上成功实现了对143,933句的快速标注,准确率超过98%,显著减少了人工干预并提升了标注效率。
链接: https://arxiv.org/abs/2510.12306
作者: Cameron Morin,Matti Marttinen Larsson
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline’s accessibility and effectiveness through a diachronic case study of variation in the English consider construction. Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus-based research, though implementation requires attention to costs, licensing, and other ethical considerations.
zh
[NLP-32] Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector
【速读】: 该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在多模态推理中普遍存在的一种特定幻觉问题——logo hallucination,即模型在面对纯符号类Logo(不含可见文字)时错误生成品牌名称或文本内容的现象。研究表明,这种幻觉并非偶然,而是源于模型对符号先验(symbolic priors)的过度依赖,而非真实字符感知能力的缺失,尤其在圆形标识类Logo上表现显著。解决方案的关键在于识别并干预VLM中负责图像到文本映射的投影层(projector)子空间:通过嵌入层分析发现,仅一小部分投影维度与幻觉高度相关;针对性地移除这些维度可显著降低错误率,同时保持光学字符识别(OCR)性能不受影响。这一发现揭示了提升多模态系统可信度的新路径——通过投影层解耦(projector disentanglement)和OCR引导的解码策略,实现更鲁棒的视觉语义理解。
链接: https://arxiv.org/abs/2510.12287
作者: Sifan Li,Hongkai Chen,Yujun Cai,Qingwen Ye,Liyang Chen,Junsong Yuan,Yiwei Wang
机构: University of California, Merced (加州大学默塞德分校); vivo Mobile Communication Co., Ltd. (维沃移动通信有限公司); University of Queensland (昆士兰大学); UCLA (加州大学洛杉矶分校); University at Buffalo (纽约州立大学布法罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision Language Models (VLMs) have achieved impressive progress in multimodal reasoning; yet, they remain vulnerable to hallucinations, where outputs are not grounded in visual evidence. In this paper, we investigate a previously overlooked setting: logo hallucination, where models generate brand names or textual content despite logos containing no visible words. Using curated splits of pure symbols, hybrids, and text-bearing logos, as well as the challenging Hard-60 subset, we systematically measure hallucination across leading VLMs. We further probe robustness through nine structured perturbations and show that hallucinations persist even under strong distortions, with occlusion exposing the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA demonstrates that hallucination is tied to a small subset of projector dimensions, and targeted ablation substantially reduces errors while preserving OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic priors rather than genuine glyph perception, particularly for iconic circular logos, and that projector subspaces play a decisive role in this failure mode. Our work contributes both a novel diagnostic lens and actionable mitigation insights, highlighting projector disentanglement and OCR-guided decoding as promising directions for building more trustworthy multimodal systems.
zh
[NLP-33] Chinese ModernBERT with Whole-Word Masking
【速读】: 该论文旨在解决中文自然语言处理中因词汇形态和分词方式与英文显著不同而导致的预训练模型性能瓶颈问题。其核心挑战在于如何在保持高效计算的同时,提升模型对中文语义结构的理解能力。解决方案的关键在于:(1)设计一个面向硬件优化的32k BPE(Byte Pair Encoding)词汇表,聚焦高频中文前缀/复合词以降低嵌入预算;(2)引入动态掩码课程(从30%到15%)的整词掩码(Whole-Word Masking, WWM),使任务难度随训练进程逐步调整;(3)采用两阶段预训练流程,结合RoPE(Rotary Position Embedding)与交替局部/全局注意力机制,将上下文长度从1,024扩展至8,192 tokens;(4)使用阻尼余弦学习率调度策略实现长周期优化稳定性。上述改进共同推动了中文Encoder-only Transformer在准确性、速度和内存效率上的帕累托改进。
链接: https://arxiv.org/abs/2510.12285
作者: Zeyu Zhao,Ningtao Wang,Xing Fu,Yu Cheng
机构: Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Encoder-only Transformers have advanced along three axes – architecture, data, and systems – yielding Pareto gains in accuracy, speed, and memory efficiency. Yet these improvements have not fully transferred to Chinese, where tokenization and morphology differ markedly from English. We introduce Chinese ModernBERT, a from-scratch Chinese encoder that couples: (i) a hardware-aware 32k BPE vocabulary tailored to frequent Chinese affixes/compounds, lowering the embedding budget; (ii) whole-word masking (WWM) with a dynamic masking curriculum (30% - 15%) to align task difficulty with training progress; (iii) a two-stage pre-training pipeline that extends the native context from 1,024 to 8,192 tokens using RoPE and alternating local/global attention; and (iv) a damped-cosine learning-rate schedule for stable long-horizon optimization. We pre-train on ~1.2T Chinese tokens from CCI3-HQ, CCI4 (Chinese), and Cosmopedia-Chinese. On CLUE, Chinese ModernBERT is competitive with strong Chinese encoders under a unified fine-tuning protocol. Under bf16 it achieves high long-sequence throughput while maintaining strong short-sequence speed, reflecting benefits from budget allocation and attention design. To probe retrieval-oriented quality, we add a small amount of open contrastive data: fine-tuning on SimCLUE (~3M pairs) improves further when adding T2Ranking (~2M), reaching 0.505 (Pearson) / 0.537 (Spearman) on the SimCLUE test set. Under this open-data setting, Chinese ModernBERT surpasses Qwen-0.6B-embedding on SimCLUE, suggesting a clear scaling path for STS with additional curated pairs. We will release tokenizer and weights to facilitate reproducible research.
zh
[NLP-34] Shallow Robustness Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLM s NEURIPS2025 ALT
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实医疗临床场景中多轮交互下的可靠性问题,现有评估框架多聚焦于理想化单轮问答任务,忽略了医患对话中常见的冲突输入、误导性上下文及权威影响等复杂因素。其解决方案的关键在于提出MedQA-Followup评估框架,系统性地衡量多轮医疗问答中的鲁棒性,区分浅层鲁棒性(抵抗初始误导性上下文)与深层鲁棒性(在多轮挑战下维持答案准确性),并引入间接-直接轴以分离情境构建(间接)与显式提示(直接)两类干预方式。通过在MedQA数据集上施加受控干预,研究发现当前主流LLMs在多轮设置下性能显著下降,尤其在间接干预下准确率降幅最大,揭示了临床部署中亟需关注的脆弱性维度。
链接: https://arxiv.org/abs/2510.12255
作者: Blazej Manczak,Eric Lin,Francisco Eiras,James O’ Neill,Vaikkunth Mugunthan
机构: Dynamo AI; Intercom
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Dataset and code: this https URL ; this https URL Accepted as a poster at NeurIPS 2025 Workshop on GenAI for Health: Potential, Trust, and Policy Compliance
Abstract:Large language models (LLMs) are rapidly transitioning into medical clinical use, yet their reliability under realistic, multi-turn interactions remains poorly understood. Existing evaluation frameworks typically assess single-turn question answering under idealized conditions, overlooking the complexities of medical consultations where conflicting input, misleading context, and authority influence are common. We introduce MedQA-Followup, a framework for systematically evaluating multi-turn robustness in medical question answering. Our approach distinguishes between shallow robustness (resisting misleading initial context) and deep robustness (maintaining accuracy when answers are challenged across turns), while also introducing an indirect-direct axis that separates contextual framing (indirect) from explicit suggestion (direct). Using controlled interventions on the MedQA dataset, we evaluate five state-of-the-art LLMs and find that while models perform reasonably well under shallow perturbations, they exhibit severe vulnerabilities in multi-turn settings, with accuracy dropping from 91.2% to as low as 13.5% for Claude Sonnet 4. Counterintuitively, indirect, context-based interventions are often more harmful than direct suggestions, yielding larger accuracy drops across models and exposing a significant vulnerability for clinical deployment. Further compounding analyses reveal model differences, with some showing additional performance drops under repeated interventions while others partially recovering or even improving. These findings highlight multi-turn robustness as a critical but underexplored dimension for safe and reliable deployment of medical LLMs.
zh
[NLP-35] DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多文档问答(Multi-document Question Answering, Multi-doc QA)任务中的两大核心挑战:一是长程依赖建模困难,即LLMs难以聚焦于长文本中的关键信息,导致重要语义关联弱化;二是“中间丢失”(lost-in-the-middle)问题,即模型对输入序列中段内容的处理能力较弱。为应对上述问题,作者提出双阶段自适应锐化(Dual-Stage Adaptive Sharpening, DSAS)方法,其关键在于两个模块的设计:(i) 上下文门控加权(Contextual Gate Weighting, CGW)模块通过逐层注意力追踪与位置感知加权机制评估段落相关性,缓解“中间丢失”问题;(ii) 互斥注意力抑制(Reciprocal Attention Suppression, RAS)模块通过抑制关键文本与无关文本之间的信息交互,强化对关键段落的关注,从而提升长程依赖建模能力。DSAS作为即插即用方案,无需架构改动或额外训练参数,在多个主流LLM上均显著提升Multi-doc QA性能。
链接: https://arxiv.org/abs/2510.12251
作者: Jiakai Li,Rongzheng Wang,Yizhuo Ma,Shuang Liang,Guangchun Luo,Ke Qin
机构: Institute of Intelligent Computing, University of Electronic Science and Technology of China (电子科技大学智能计算研究所); School of Information and Software Engineering, University of Electronic Science and Technology of China (电子科技大学信息与软件工程学院)
类目: Computation and Language (cs.CL)
备注: 27 pages, has been accepted by NeurIPS 2025
Abstract:While large language models (LLMs) show considerable promise across various fields, they have notable limitations in handling multi-document question answering (Multi-doc QA) tasks. The first challenge is long-range dependency modeling, where LLMs struggle to focus on key information in long texts, which weakens important semantic connections. Second, most LLMs suffer from the ‘‘lost-in-the-middle’’ issue, where they have difficulty processing information in the middle of long inputs. Current solutions either truncate global dependencies or demand costly finetuning, ultimately lacking a universal and simple solution for these challenges. To resolve these limitations, we propose Dual-Stage Adaptive Sharpening (DSAS) containing two modules. (i) The Contextual Gate Weighting (CGW) module alleviates ‘‘lost-in-the-middle’’ by assessing paragraph relevance through layer-wise attention tracking and position-aware weighting. (ii) The Reciprocal Attention Suppression (RAS) module enhances focus on critical paragraphs by suppressing information exchange between key and irrelevant texts, thus mitigating the limitations in long-range dependency modeling. Notably, DSAS functions as a plug-and-play solution requiring no architectural modifications or extra training parameters. Extensive experiments on four benchmarks demonstrate DSAS’s efficacy across mainstream LLMs (Llama, Qwen, Mistral, and Deepseek), with an average F1-score improvement of 4.2% in Multi-doc QA tasks on Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct. Ablation studies confirm the essential contributions of both the CGW and RAS modules. In addition, detailed discussions in the Appendix further validate the robustness and scalability of DSAS.
zh
[NLP-36] Analysing Moral Bias in Finetuned LLM s through Mechanistic Interpretability
【速读】: 该论文试图解决的问题是:大规模语言模型(Large Language Models, LLMs)在微调过程中是否会内化人类类似的道德偏见(如Knobe效应),以及这种偏见的具体产生机制和可定位性。解决方案的关键在于通过Layer-Patching分析方法,在三个开源权重的LLM中定位到偏见行为所依赖的特定层,并发现仅将预训练模型的激活值插入到这些关键层即可消除Knobe效应,从而证明社会偏见可通过针对性干预而非重新训练进行缓解。
链接: https://arxiv.org/abs/2510.12229
作者: Bianca Raimondi,Daniela Dalbagno,Maurizio Gabbrielli
机构: University of Florence (佛罗伦萨大学); University of Padua (帕多瓦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.
zh
[NLP-37] HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)公平性评估缺乏现实场景依据且未考虑不同偏差后果严重性的问题,即现有评测方法无法区分如手术决策中的偏见与文本摘要中的风格偏见在实际危害上的差异。其解决方案的关键在于提出HALF(Harm-Aware LLM Fairness)框架,该框架通过一个五阶段流程将九个应用领域划分为严重(Severe)、中等(Moderate)和轻微(Mild)三类危害等级,并据此对模型输出的公平性结果进行加权评估,从而实现更贴近部署场景的、以危害严重程度为导向的公平性衡量。
链接: https://arxiv.org/abs/2510.12217
作者: Ali Mekky,Omar El Herraoui,Preslav Nakov,Yuxia Wang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.
zh
[NLP-38] HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities
【速读】: 该论文旨在解决当前计算机使用代理(Computer-Use Agents, CUAs)在Web应用安全测试中能力不足的问题,尤其是在通过图形界面发现和利用漏洞方面缺乏系统性评估。传统渗透测试成本高、依赖专家经验,难以适应快速扩展的Web生态系统;尽管语言模型代理在网络安全领域展现出潜力,但现代Web应用对视觉理解、动态内容处理及多步骤交互的需求超出了其能力范围。论文提出HackWorld框架,作为首个系统评估CUAs通过视觉交互发现并利用Web应用漏洞的工具,其关键在于引入包含36个真实世界应用(覆盖11种开发框架与7种编程语言)的CTF竞赛环境,涵盖注入漏洞、认证绕过等典型缺陷,从而真实反映CUAs在复杂Web界面中的攻击规划与执行能力。实验表明,当前先进CUAs的漏洞利用成功率低于12%,且普遍存在多步攻击策略缺失与安全工具误用问题,揭示了现有代理在Web安全场景下的显著局限性,并为构建更具备安全意识的智能代理指明方向。
链接: https://arxiv.org/abs/2510.12200
作者: Xiaoxue Ren,Penghao Jiang,Kaixin Li,Zhiyong Huang,Xiaoning Du,Jiaojiao Jiang,Zhenchang Xing,Jiamou Sun,Terry Yue Zhuo
机构: Zhejiang University (浙江大学); University of New South Wales (新南威尔士大学); National University of Singapore (新加坡国立大学); Monash University (蒙纳士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); Australian National University (澳大利亚国立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Web applications are prime targets for cyberattacks as gateways to critical services and sensitive data. Traditional penetration testing is costly and expertise-intensive, making it difficult to scale with the growing web ecosystem. While language model agents show promise in cybersecurity, modern web applications demand visual understanding, dynamic content handling, and multi-step interactions that only computer-use agents (CUAs) can perform. Yet, their ability to discover and exploit vulnerabilities through graphical interfaces remains largely unexplored. We present HackWorld, the first framework for systematically evaluating CUAs’ capabilities to exploit web application vulnerabilities via visual interaction. Unlike sanitized benchmarks, HackWorld includes 36 real-world applications across 11 frameworks and 7 languages, featuring realistic flaws such as injection vulnerabilities, authentication bypasses, and unsafe input handling. Using a Capture-the-Flag (CTF) setup, it tests CUAs’ capacity to identify and exploit these weaknesses while navigating complex web interfaces. Evaluation of state-of-the-art CUAs reveals concerning trends: exploitation rates below 12% and low cybersecurity awareness. CUAs often fail at multi-step attack planning and misuse security tools. These results expose the current limitations of CUAs in web security contexts and highlight opportunities for developing more security-aware agents capable of effective vulnerability detection and exploitation.
zh
[NLP-39] DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation
【速读】: 该论文旨在解决同步语音翻译(Simultaneous Speech Translation, SST)中分割策略的优化问题,即如何在保证翻译质量的同时最小化延迟。现有方法如SHAS虽采用预训练模型提升了分割准确性,但仍受限于监督学习目标,缺乏对人类偏好(human preference)的对齐,难以满足实时口译场景下自然流畅的分割需求。解决方案的关键在于引入基于大语言模型(Large Language Models, LLMs)并使用直接偏好优化(Direct Preference Optimization, DPO)进行微调,使模型能够学习符合人类偏好的自然分割点,从而实现更适应真实交互环境的动态分割策略。实验表明,该方法在多个语言对上均优于SHAS,在翻译质量(BLEU、COMET)和延迟(Average Lagging)方面取得一致改进。
链接: https://arxiv.org/abs/2510.12195
作者: Zeyu Yang,Satoshi Nakamura
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.
zh
[NLP-40] Not in Sync: Unveiling Temporal Bias in Audio Chat Models
【速读】: 该论文旨在解决大型音频语言模型(Large Audio Language Models, LALMs)在事件时间定位任务中存在系统性时序偏差(temporal bias)的问题,即模型预测的事件发生时刻与真实时间戳之间存在稳定且可量化的偏移。解决方案的关键在于提出一种量化指标——时序偏差指数(Temporal Bias Index, TBI),用于测量模型在预测事件发生时间时的系统性错位,并辅以可视化框架揭示偏差的空间分布特征。通过控制实验发现,这种偏差在不同数据集和模型中普遍存在,且随音频长度增长而累积,甚至可达数十秒,从而揭示了当前LALMs在时间感知能力上的根本局限,呼吁开发具备更强时序鲁棒性的架构。
链接: https://arxiv.org/abs/2510.12185
作者: Jiayu Yao,Shenghua Liu,Yiwei Wang,Rundong Cheng,Lingrui Mei,Baolong Bi,Zhen Xiong,Xueqi Cheng
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); University of California, Merced (加州大学默塞德分校); Beijing University of Posts and Telecommunications (北京邮电大学); University of Southern California (南加州大学); Key Laboratory of Network Data Science and Technology, ICT, CAS (网络数据科学与技术重点实验室,中科院计算所); State Key Laboratory of AI Safety (人工智能安全国家重点实验室); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:
Abstract:Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked “At which second does the lecturer introduce the key formula?”, models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.
zh
[NLP-41] From Knowledge to Treatment: Large Language Model Assisted Biomedical Concept Representation for Drug Repurposing EMNLP2025
【速读】: 该论文旨在解决现有药物再利用(Drug Repurposing)研究中忽视真实世界实验室中常见生物医学概念知识的问题,特别是缺乏对药物与治疗方案之间机制性不兼容性的先验认知。其解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)辅助的药物再利用框架——LLaDR,通过从LLM中提取语义增强的治疗相关文本表示,并将其用于微调知识图谱嵌入(Knowledge Graph Embedding, KGE)模型,从而将治疗相关的先验知识注入KGE,显著提升生物医学概念的表征能力,增强对复杂或研究不足适应症的语义理解。
链接: https://arxiv.org/abs/2510.12181
作者: Chengrui Xiang,Tengfei Ma,Xiangzheng Fu,Yiping Liu,Bosheng Song,Xiangxiang Zeng
机构: Hunan University (湖南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures, 13 tables. Accepted by EMNLP 2025 (Findings)
Abstract:Drug repurposing plays a critical role in accelerating treatment discovery, especially for complex and rare diseases. Biomedical knowledge graphs (KGs), which encode rich clinical associations, have been widely adopted to support this task. However, existing methods largely overlook common-sense biomedical concept knowledge in real-world labs, such as mechanistic priors indicating that certain drugs are fundamentally incompatible with specific treatments. To address this gap, we propose LLaDR, a Large Language Model-assisted framework for Drug Repurposing, which improves the representation of biomedical concepts within KGs. Specifically, we extract semantically enriched treatment-related textual representations of biomedical entities from large language models (LLMs) and use them to fine-tune knowledge graph embedding (KGE) models. By injecting treatment-relevant knowledge into KGE, LLaDR largely improves the representation of biomedical concepts, enhancing semantic understanding of under-studied or complex indications. Experiments based on benchmarks demonstrate that LLaDR achieves state-of-the-art performance across different scenarios, with case studies on Alzheimer’s disease further confirming its robustness and effectiveness. Code is available at this https URL.
zh
[NLP-42] Evolution of metas llama models and parameter-efficient fine-tuning of large language models : a survey
【速读】: 该论文旨在系统梳理Meta AI的LLaMA(Large Language Model Meta AI)系列模型及其参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法的发展脉络,解决大规模语言模型在实际应用中面临的计算资源消耗高、微调成本大等挑战。其解决方案的关键在于提出并分析五种PEFT技术——LoRA(Low-Rank Adaptation)、LLaMA-Adapter V1/V2、LLaMA-Excitor与QLoRA(Quantized LoRA),这些方法通过仅更新少量参数即可实现对预训练大模型的有效适配,在保持性能的同时显著降低计算和存储开销,并已在指令微调、多模态任务及法律、医疗等专业领域成功落地验证。
链接: https://arxiv.org/abs/2510.12178
作者: Abdulhady Abas Abdullah,Arkaitz Zubiaga,Seyedali Mirjalili,Amir H. Gandomi,Fatemeh Daneshfar,Mohammadsadra Amini,Alan Salam Mohammed,Hadi Veisi
机构: University of Kurdistan Hewler (库尔德斯坦大学赫勒分校); Queen Mary University (伦敦玛丽女王大学); Torrens University Australia (托伦斯大学澳大利亚分校); University of Technology Sydney (悉尼科技大学); University of Kurdistan Sanandaj, Iran (库尔德斯坦大学桑南达吉分校,伊朗); TU Dortmund University (多特蒙德工业大学); University of Kurdistan Hewler (库尔德斯坦大学赫勒分校); Tehran University (德黑兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This review surveys the rapid evolution of Meta AI’s LLaMA (Large Language Model Meta AI) series - from LLaMA 1 through LLaMA 4 and the specialized parameter-efficient fine-tuning (PEFT) methods developed for these models. We first describe the LLaMA family of foundation models (7B-65B to 288B parameters), their architectures (including native multimodal and Mixtureof-Experts variants), and key performance characteristics. We then describe and discuss the concept of PEFT, which adapts large pre-trained models by updating only a small subset of parameters, and review five PEFT methods that have been applied to LLaMA: LoRA (Low-Rank Adaptation), LLaMA-Adapter V1 and V2, LLaMA-Excitor, and QLoRA (Quantized LoRA). We discuss each method’s mechanism, parameter savings, and example application to LLaMA (e.g., instruction tuning, multimodal tasks). We provide structured discussion and analysis of model and adapter architectures, parameter counts, and benchmark results (including examples where fine-tuned LLaMA models outperform larger baselines). Finally, we examine real-world use cases where LLaMA-based models and PEFT have been successfully applied (e.g., legal and medical domains), and we discuss ongoing challenges and future research directions (such as scaling to even larger contexts and improving robustness). This survey paper provides a one-stop resource for ML researchers and practitioners interested in LLaMA models and efficient fine-tuning strategies.
zh
[NLP-43] owards Inference-time Scaling for Continuous Space Reasoning
【速读】: 该论文试图解决在连续空间中实现高效推理的问题,特别是如何借鉴离散空间中已验证有效的推理时缩放(inference-time scaling)技术,如通过dropout-based采样生成多样化推理路径,并结合过程或结果奖励模型(Process- or Outcome-Reward Model, PRM或ORM)进行重排序以提升性能。解决方案的关键在于:首先验证了在连续空间中通过dropout-based采样生成多样推理路径的可行性,并发现其具有类似离散空间中的性能增益潜力;其次,识别出当前连续空间推理模型的核心瓶颈在于缺乏关键归纳偏置(inductive biases),导致PRM和ORM难以有效区分正确与错误的推理路径,从而限制了推理时重排序的有效性。因此,论文强调连续空间推理大语言模型的训练框架不仅需优化预测准确性,还应显式引入可被推理时利用的归纳偏置,以支持对推理路径的质量判别。
链接: https://arxiv.org/abs/2510.12167
作者: Minghan Wang,Thuy-Trang Vu,Ehsan Shareghi,Gholamreza Haffari
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Inference-time scaling through multiple sample generation in combination with Process- or Outcome-Reward Model (PRM or ORM) re-ranking has proven effective for text-based reasoning in large language models. This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space, using COCONUT (Hao et al. 2024) continuous space reasoning LM as the backbone. We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling. Our Pass@N analysis on the generated samples reveals the potential that could enable a significant gain in performance akin to observed gain in the discrete space. However, we highlight unique challenges faced for materializing this gain in the continuous thought space. In particular, working recipes for data generation and training PRM and ORM models in the discrete space unlocks only marginal improvements in the continuous space. Through probing various aspects including geometric properties and trajectory dynamics we identify the underlying reasons that prevent effective discrimination between correct and incorrect reasoning (essential for the functioning of PRM and ORM). Our findings reveal that current limitations stem from the absence of key inductive biases in continuous thought representations. We argue that the training frameworks for continuous reasoning LMs require not only to optimize for accuracy but also to explicitly incorporate inductive biases that could be utilized during inference-time for discrimination of correct and incorrect thoughts.\footnoteOur code and data will be publicly available.
zh
[NLP-44] A Survey on Parallel Reasoning
【速读】: 该论文旨在解决传统链式思维(Chain-of-Thought)推理方法在大型语言模型(Large Language Models, LLMs)中存在脆弱性和低鲁棒性的问题,其核心挑战在于如何提升推理过程的稳定性与准确性。解决方案的关键在于引入并系统梳理“并行推理”(parallel reasoning)这一新范式,即通过同时探索多条独立的推理路径,在收敛前增强对复杂问题的理解能力,从而提高最终答案的可靠性。论文进一步提出了一种新的分类体系,涵盖非交互式推理、交互式推理及效率导向的解码策略,为不同场景下的并行推理技术提供了结构化框架和实践指导。
链接: https://arxiv.org/abs/2510.12164
作者: Ziqi Wang,Boye Niu,Zipeng Gao,Zhi Zheng,Tong Xu,Linghui Meng,Zhongli Li,Jing Liu,Yilong Chen,Chen Zhu,Hua Wu,Haifeng Wang,Enhong Chen
机构: USTC(中国科学技术大学); Baidu(百度); USYD(悉尼大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final answer. It has become a significant trend to explore parallel reasoning to overcome the fragility of standard sequential methods and improve practical performance. In this paper, we aim to survey and summarize the progress and challenges of parallel reasoning. We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought. Then, we organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused decoding strategies. Additionally, we explore various application scenarios, such as solving complex problems and enhancing the reliability of LLM this http URL, we highlight the core challenges of parallel reasoning and suggest potential directions for future research. We hope that our work can provide a useful roadmap for beginners and encourage more research on improving parallel reasoning methods. Related source can be avaliable in this https URL.
zh
[NLP-45] Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型在生成内容时产生看似合理但事实错误的陈述。作者认为这一现象源于Transformer架构中Softmax函数导致的“人工确定性”(Artificial Certainty),该函数将模糊的注意力分数压缩为单一概率分布,从而丢失了不确定性信息。解决方案的关键在于提出一种新的Credal Transformer架构,其核心是引入基于证据理论的可信度注意力机制(Credal Attention Mechanism, CAM),该机制不再输出单一注意力向量,而是生成一个“可信集”(credal set)——一组概率分布,其集合大小直接反映模型的不确定性。通过将注意力分数重新解释为Dirichlet分布的证据质量,当证据充足时恢复标准注意力,证据不足时则生成扩散分布以表示模糊性,从而实现对不确定输入的有效识别与抑制错误自信输出。
链接: https://arxiv.org/abs/2510.12137
作者: Shihao Ji,Zihui Song,Jiajie Huang
机构: Zaozhuang No.28 Middle School (枣庄市第二十八中学); Tengzhou No.1 High School (滕州第一中学); Xi’an Jiaotong University (西安交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) hallucinate, generating factually incorrect yet confident assertions. We argue this stems from the Transformer’s Softmax function, which creates “Artificial Certainty” by collapsing ambiguous attention scores into a single probability distribution, discarding uncertainty information at each layer. To fix this, we introduce the Credal Transformer, which replaces standard attention with a Credal Attention Mechanism (CAM) based on evidential theory. CAM produces a “credal set” (a set of distributions) instead of a single attention vector, with the set’s size directly measuring model uncertainty. We implement this by re-conceptualizing attention scores as evidence masses for a Dirichlet distribution: sufficient evidence recovers standard attention, while insufficient evidence yields a diffuse distribution, representing ambiguity. Empirically, the Credal Transformer identifies out-of-distribution inputs, quantifies ambiguity, and significantly reduces confident errors on unanswerable questions by abstaining. Our contribution is a new architecture to mitigate hallucinations and a design paradigm that integrates uncertainty quantification directly into the model, providing a foundation for more reliable AI.
zh
[NLP-46] SafeMT: Multi-turn Safety for Multimodal Language Models
【速读】: 该论文旨在解决多轮对话场景下多模态大语言模型(Multi-modal Large Language Models, MLLMs)的安全性问题,尤其关注现有基准测试未能充分覆盖复杂交互情境的局限性。其关键解决方案是提出SafeMT基准测试和Safety Index(SI)指标:SafeMT包含10,000个样本,涵盖17种危害场景与4种越狱方法,用于系统评估模型在多轮有害对话中的安全表现;同时引入SI量化模型整体安全性,并进一步设计一种对话安全监管器(dialogue safety moderator),能够识别隐藏于对话中的恶意意图并提供针对性安全策略。实验表明,该监管器相较于现有防护机制,在降低多轮攻击成功率(ASR)方面更具有效性。
链接: https://arxiv.org/abs/2510.12133
作者: Han Zhu,Juntao Dai,Jiaming Ji,Haoran Li,Chengkun Cai,Pengcheng Wen,Chi-Min Chan,Boyuan Chen,Yaodong Yang,Sirui Han,Yike Guo
机构: Hong Kong University of Science and Technology (香港科技大学); Peking University (北京大学); University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn ASR compared to existed guard models.
zh
[NLP-47] Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在生成文本时难以实现用户指定属性强度的精确控制问题。现有对齐方法通常仅提供方向性或模糊的引导,无法可靠地达到特定的属性强度水平。解决方案的关键在于三项设计:首先,将属性强度控制重新建模为一个目标到达问题(target-reaching problem),而非简单的最大化过程;其次,通过时序差分学习(temporal-difference learning)训练一个轻量级价值函数,用于从部分生成内容中预测最终属性强度得分,从而实现对LLM输出的有效引导;最后,采用基于梯度的干预机制作用于隐藏表示层,使模型能够精准导航至预设的属性强度目标。这一方法实现了对属性强度的细粒度、连续控制,显著超越了传统方向性对齐方式。
链接: https://arxiv.org/abs/2510.12121
作者: Rongzhi Zhang,Liqin Ye,Yuzhao Heng,Xiang Chen,Tong Yu,Lingkai Kong,Sudheer Chava,Chao Zhang
机构: Georgia Institute of Technology (佐治亚理工学院); Adobe Research (Adobe 研究院); Harvard University (哈佛大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Precise attribute intensity control–generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities–is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method’s ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on this https URL
zh
[NLP-48] Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models EMNLP2025
【速读】: 该论文旨在解决端到端大语言模型(Large Speech Language Models, LSLMs)在语音输入与文本输入之间存在的性能差异问题,即“模态差距”(modality gap)。研究表明,尽管LSLMs在语音-文本对齐训练后文本输入性能有所下降,但其语音输入的性能损失更为显著,导致语义理解能力弱于传统流水线系统。解决方案的关键在于通过多层次分析揭示模态差距的成因:在粗粒度层面,语音和文本表示在深层网络中方向趋同(余弦相似度高)但幅值差异增大(欧氏距离大);在细粒度层面,发现文本与语音表示间存在自发的token级对齐模式,并提出“对齐路径得分”(Alignment Path Score)量化该对齐质量,其与模态差距强相关。基于此,作者设计针对关键token的角度投影和长度归一化干预策略,有效提升了语音输入的准确性,为优化LSLMs提供了理论依据与可操作的方法路径。
链接: https://arxiv.org/abs/2510.12116
作者: Bajian Xiang,Shuaijiang Zhao,Tingwei Guo,Wei Zou
机构: Beike Inc.(贝壳公司); Bairong Inc.(百融公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 (Main Conference)
Abstract:End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.
zh
[NLP-49] racing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation
【速读】: 该论文旨在解决多语言领域自适应(Multilingual Domain Adaptation, ML-DA)中关于跨语言知识获取机制不明确的问题,尤其是低资源场景下性能不佳的瓶颈。现有方法虽能提升领域适应效果,但对语言内知识习得与跨语言迁移的具体机制缺乏深入理解。为此,作者提出AdaXEval这一自适应评估方法,通过从训练用的双语领域语料库构建多项选择题数据集,实现对多语言知识获取过程的直接测量;同时,借助多样化数据配方的持续训练实验,追踪大语言模型(LLM)如何从原始训练数据中提取并固化领域事实,并揭示其向跨语言知识转化的关键机制。
链接: https://arxiv.org/abs/2510.12115
作者: Xin Zhao,Naoki Yoshinaga,Yuma Tsuta,Akiko Aizawa
机构: The University of Tokyo (东京大学); National Institute of Informatics (信息基础研究所); Institute of Industrial Science, The University of Tokyo (产业科学研究院,东京大学); Fixstars Corporation
类目: Computation and Language (cs.CL)
备注: 22 Pages, Submitted to ARR 2025 Oct
Abstract:Multilingual domain adaptation (ML-DA) is widely used to learn new domain knowledge across languages into large language models (LLMs). Although many methods have been proposed to improve domain adaptation, the mechanisms of multilingual knowledge acquisition, how domain knowledge is learned within a language and transferred across languages, remain underexplored. This gap leads to suboptimal performance, particularly in low-resource settings. This work examines the learning dynamics of LLMs during ML-DA. Because prior ML-DA studies often train and evaluate on datasets with mismatched knowledge coverage, we propose AdaXEval, an adaptive evaluation method that builds multiple-choice QA datasets from the same bilingual domain corpus used for training, thereby directly studying multilingual knowledge acquisition. Through continual training of LLMs with diverse data recipes, we track how LLMs acquire domain facts and pinpoint the mechanism behind the transformation process from domain training data to knowledge. Our experiments on a 13B English-Japanese bilingual LLM reveal that cross-lingual transfer remains challenging despite a high-quality bilingual corpus. The code has been released.
zh
[NLP-50] Deep Associations High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)创造力评估中的关键挑战,包括数据污染(data contamination)和高昂的人工评估成本。其解决方案的核心是提出PACE(Parallel Association Chains for Evaluation),通过让LLMs生成平行联想链来量化其创造性。该方法有效规避了训练数据泄露风险,且具有高效性与可扩展性;实验表明,PACE评分与Chatbot Arena创意写作排名呈显著正相关(Spearman’s ρ = 0.739, p < 0.001),验证了其有效性。
链接: https://arxiv.org/abs/2510.12110
作者: Ziliang Qiu,Renfen Hu
机构: University of Illinois, Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages
Abstract:The evaluation of LLMs’ creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman’s \rho = 0.739 , p 0.001 ) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.
zh
[NLP-51] One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
【速读】: 该论文旨在解决在复杂、随机且缺乏人类指导的环境中,如何从有限的一次性探索中自动学习并构建可执行的程序化世界模型(programmatic world model)的问题。传统方法通常依赖于确定性环境、大量交互数据和人工干预,难以适应真实世界的不确定性与资源限制。其解决方案的关键在于提出 OneLife 框架,该框架基于概率编程范式,通过条件激活的程序化规则(conditionally-activated programmatic laws)建模环境动态,每条规则以前提-效果结构运行,在特定状态激活后仅影响相关推理路径,从而形成动态计算图,避免全量规则参与预测导致的扩展瓶颈,并支持在稀疏规则激活下学习随机动力学。这一机制显著提升了模型在低数据、无监督场景下的泛化能力和可解释性。
链接: https://arxiv.org/abs/2510.12088
作者: Zaid Khan,Archiki Prasad,Elias Stengel-Eskin,Jaemin Cho,Mohit Bansal
机构: UNC Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL 39 pages
Abstract:Symbolic world modeling requires inferring and representing an environment’s transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only “one life” to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife’s planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.
zh
[NLP-52] An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理精神健康危机时可能提供有害或不适当建议的问题,从而导致破坏性行为的风险。研究提出并评估了Verily行为健康安全过滤器(Verily Behavioral Health Safety Filter, VBHSF),其关键在于通过临床标注数据集进行严格验证,在保证高敏感性(sensitivity)的同时维持良好的特异性(specificity),以最小化漏诊精神健康危机事件的可能性。VBHSF在两个独立数据集上均表现出优于开源内容审核护栏(如NVIDIA NeMo Guardrails和OpenAI Omni Moderation Latest)的性能,尤其在敏感性方面显著更高(所有p < 0.001),体现出其作为医疗场景下可靠安全机制的核心优势。
链接: https://arxiv.org/abs/2510.12083
作者: Benjamin W. Nelson,Celeste Wong,Matthew T. Silvestrini,Sooyoon Shin,Alanna Robinson,Jessica Lee,Eric Yang,John Torous,Andrew Trister
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Main Text: 2943; Abstract: 256; Tables and Figures: 5
Abstract:Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice and enabling destructive behaviors. This study evaluated the Verily behavioral health safety filter (VBHSF) on two datasets: the Verily Mental Health Crisis Dataset containing 1,800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. The two datasets were clinician-labelled and we evaluated performance using the clinician labels. Additionally, we carried out comparative performance analyses against two open source, content moderation guardrails: OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VBHSF demonstrated, well-balanced performance on the Verily Mental Health Crisis Dataset v1.0, achieving high sensitivity (0.990) and specificity (0.992) in detecting any mental health crises. It achieved an F1-score of 0.939, sensitivity ranged from 0.917-0.992, and specificity was = 0.978 in identifying specific crisis categories. When evaluated against the NVIDIA Aegis AI Content Safety Dataset 2.0, VBHSF performance remained highly sensitive (0.982) and accuracy (0.921) with reduced specificity (0.859). When compared with the NVIDIA NeMo and OpenAI Omni Moderation Latest guardrails, the VBHSF demonstrated superior performance metrics across both datasets, achieving significantly higher sensitivity in all cases (all p 0.001) and higher specificity relative to NVIDIA NeMo (p 0.001), but not to OpenAI Omni Moderation Latest (p = 0.094). NVIDIA NeMo and OpenAI Omni Moderation Latest exhibited inconsistent performance across specific crisis types, with sensitivity for some categories falling below 0.10. Overall, the VBHSF demonstrated robust, generalizable performance that prioritizes sensitivity to minimize missed crises, a crucial feature for healthcare applications.
zh
[NLP-53] hinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization
【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中存在的效率低下和方向偏离问题,即模型常因缺乏有效引导而产生冗余或不相关推理路径,从而影响性能与安全性。其解决方案的关键在于提出一种无需训练的框架 ThinkPilot,通过演化过程自动生成“思考前缀”(think-prefixes)——即动态优化的指令模板,这些前缀基于推理行为的分类体系(taxonomy of reasoning behaviors)进行迭代演化,以精准调控模型的行为分布,使其更符合特定任务的需求。实验表明,ThinkPilot 能显著改善推理准确率与长度之间的权衡、大幅提升安全性(如将 DeepSeek-R1-Distill-Qwen-32B 的 StrongREJECT 分数从 27.0% 降至 0.7%),并增强指令遵循能力,同时可与已有训练方法协同增效。
链接: https://arxiv.org/abs/2510.12063
作者: Sunzhu Li,Zhiyu Lin,Shuling Yang,Jiale Zhao,Wei Chen
机构: Li Auto Inc.(李想汽车公司); The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning. Currently, training-free methods are limited to either rigid heuristics or descriptive, non-actionable analyses. In this paper, we introduce ThinkPilot, a training-free framework that automatically optimizes LRMs reasoning. It uses an evolutionary process to generate think-prefixes, which are instructions that evolve driven by a taxonomy of reasoning behaviors to guide models toward superior performance. Extensive experiments demonstrate ThinkPilot’s broad effectiveness: it significantly improves the accuracy-length trade-off for efficient reasoning, drastically improves safety (for example, cutting the StrongREJECT score of DeepSeek-R1-Distill-Qwen-32B from 27.0% to 0.7), and enhances instruction following. It also synergizes with existing training-based methods. Our analysis reveals that think-prefixes can reliably control LRMs’ reasoning behaviors, and that different tasks have strong preferences for specific behavioral distributions. By automatically identifying and eliciting these behaviors, ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands. Data and code are available at this https URL
zh
[NLP-54] APCE: Adaptive Progressive Context Expansion for Long Context Processing NEURIPS2025
【速读】: 该论文旨在解决长上下文Transformer模型(Long-Context Transformer Models, LCTMs)在实际部署中面临的两大挑战:一是随着序列长度增加,由于自注意力机制的二次复杂度和KV缓存线性增长导致内存占用急剧上升;二是“ContextRot”现象,即Transformer架构在处理更长上下文时性能下降的问题。解决方案的关键在于提出APCE(Attention-aware Prompt Chunking and Embedding),这是一种基于上下文感知的输入片段选择方法,通过低维语义相似度匹配当前查询与输入片段,精准筛选出最具信息量的输入块进行处理。该方法直接作用于输入层面,不依赖特定硬件或CUDA环境,具备良好的可扩展性和兼容性,在仅使用原始输入序列50%-70%的情况下实现了与全序列密集基线相当甚至更优的摘要性能,同时显著提升了KV缓存和自注意力机制的内存效率。
链接: https://arxiv.org/abs/2510.12051
作者: Baisub Lee,Sanghyun Byun,Mohanad Odema,Jung Guack,Jacob Song,Woo Seong Chung
机构: LG Electronics USA (LG 电子美国公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Workshop: ML For Systems
Abstract:Deploying useful Long-Context Transformer Models (LCTMs) requires addressing two key challenges: (1) A growing memory footprint due to quadratic self-attention and linear KV-cache scaling in memory as sequence length increases; (2) the ContextRot phenomena where empirical evidence suggests that transformer architecture’s performance degrades with increasing context length. Given the shared dependency on the input, a natural question arises: Can we surgically select the most important input chunks for processing to synergistically (a) reduce the memory footprint, and (b) mitigate the ContextRot effects? In this paper, we answer this question in the affirmative for long-context summarization tasks. We propose APCE as a context-aware solution to select the most important input chunks through low-dimensional semantic similarity matching with the current query. By directly operating on the input, APCE decouples from strict dependency on underlying hardware or CUDA environments, promising a compatible solution scalable to different deployment systems. Our empirical evaluations have demonstrated superior or on-par summarization performance for APCE compared to the full dense baseline using a fraction (50%-70%) of the input sequence resulting in KV-cache and self-attention memory efficiency improvements. We hope our findings inspire further research on context-aware efficiency solutions for LCTMs geared towards other relevant long-context tasks.
zh
[NLP-55] Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)对齐方法中存在的“一刀切”问题,即现有技术如直接偏好优化(Direct Preference Optimization, DPO)将模型视为单一整体,对所有层施加相同的优化压力,忽略了Transformer架构中不同层在功能上的专业化分工(如局部层处理语法、中间层处理逻辑、全局层处理事实性)。其解决方案的关键在于提出一种新型的分层对齐(Hierarchical Alignment)方法,通过LoRA(Low-Rank Adaptation)实现针对特定功能模块的靶向微调:局部对齐(Local-Align)提升语法流畅性,全局对齐(Global-Align)不仅增强事实一致性,还显著改善逻辑连贯性,且所有策略均有效避免了标准DPO中常见的“对齐代价”(alignment tax),即在提升流畅性的同时牺牲逻辑能力的问题。这一结构感知的手术式微调路径为构建更高效、可控且可解释的LLM对齐提供了新范式。
链接: https://arxiv.org/abs/2510.12044
作者: Yukun Zhang,Qi Dong
机构: The Chinese University of Hong Kong (香港中文大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing alignment techniques for Large Language Models (LLMs), such as Direct Preference Optimization (DPO), typically treat the model as a monolithic entity, applying uniform optimization pressure across all layers. This approach overlooks the functional specialization within the Transformer architecture, where different layers are known to handle distinct tasks from syntax to abstract reasoning. In this paper, we challenge this one-size-fits-all paradigm by introducing Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model’s layers: local (syntax), intermediate (logic), and global (factuality). Through a series of controlled experiments on state-of-the-art models like Llama-3.1-8B and Qwen1.5-7B using LoRA for surgical fine-tuning, our results, evaluated by a powerful LLM-as-Judge, demonstrate significant and predictable improvements. Specifically, aligning the local layers (Local-Align) enhances grammatical fluency. More importantly, aligning the global layers (Global-Align) not only improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence, outperforming all baselines. Critically, all hierarchical strategies successfully avoid the “alignment tax” observed in standard DPO, where gains in fluency come at the cost of degraded logical reasoning. These findings establish a more resource-efficient, controllable, and interpretable path for model alignment, highlighting the immense potential of shifting from monolithic optimization to structure-aware surgical fine-tuning to build more advanced and reliable LLMs.
zh
[NLP-56] Improving Text-to-Image Generation with Input-Side Inference-Time Scaling
【速读】: 该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在面对简单或描述不充分的提示词时,常出现图像与文本对齐不佳、视觉质量低及美学表现差的问题。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的提示重写框架,通过设计精细的奖励机制和迭代式直接偏好优化(Iterative Direct Preference Optimization, DPO)训练流程,使提示重写器能够在无需监督微调数据的情况下提升提示质量。该方法不仅显著改善了图像-文本一致性、视觉质量和美学表现,还展现出良好的跨模型泛化能力,即在一个T2I骨干网络上训练的重写器可直接迁移至其他模型而无需重新训练,验证了其作为通用、可扩展且实用的T2I系统增强策略的有效性。
链接: https://arxiv.org/abs/2510.12041
作者: Ruibo Chen,Jiacheng Pan,Heng Huang,Zhenheng Yang
机构: TikTok; University of Maryland, College Park (马里兰大学学院市分校)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models often struggle with simple or underspecified prompts, leading to suboptimal image-text alignment, aesthetics, and quality. We propose a prompt rewriting framework that leverages large language models (LLMs) to refine user inputs before feeding them into T2I backbones. Our approach introduces a carefully designed reward system and an iterative direct preference optimization (DPO) training pipeline, enabling the rewriter to enhance prompts without requiring supervised fine-tuning data. We evaluate our method across diverse T2I models and benchmarks. Results show that our prompt rewriter consistently improves image-text alignment, visual quality, and aesthetics, outperforming strong baselines. Furthermore, we demonstrate strong transferability by showing that a prompt rewriter trained on one T2I backbone generalizes effectively to others without needing to be retrained. We also systematically study scalability, evaluating how performance gains scale with the capacity of the large LLM used as the rewriter. These findings highlight that prompt rewriting is an effective, scalable, and practical model-agnostic strategy for improving T2I systems. We plan to release the code and trained prompt rewriters soon.
zh
[NLP-57] Uncertainty Quantification for Hallucination Detection in Large Language Models : Foundations Methodology and Future Directions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中因幻觉(hallucination)导致的不可靠性和可信度不足问题,即模型生成内容虽形式合理但事实错误。其解决方案的关键在于引入不确定性量化(Uncertainty Quantification, UQ),通过区分认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty)并将其适配至LLM场景,实现对模型输出可信度的系统性评估,从而为识别和抑制幻觉提供可量化的依据。
链接: https://arxiv.org/abs/2510.12040
作者: Sungmin Kang,Yavuz Faruk Bakman,Duygu Nur Yaldiz,Baturalp Buyukates,Salman Avestimehr
机构: University of Southern California (南加州大学); University of Birmingham (伯明翰大学)
类目: Computation and Language (cs.CL)
备注: 24 pages, 3 figures, magazine
Abstract:The rapid advancement of large language models (LLMs) has transformed the landscape of natural language processing, enabling breakthroughs across a wide range of areas including question answering, machine translation, and text summarization. Yet, their deployment in real-world applications has raised concerns over reliability and trustworthiness, as LLMs remain prone to hallucinations that produce plausible but factually incorrect outputs. Uncertainty quantification (UQ) has emerged as a central research direction to address this issue, offering principled measures for assessing the trustworthiness of model generations. We begin by introducing the foundations of UQ, from its formal definition to the traditional distinction between epistemic and aleatoric uncertainty, and then highlight how these concepts have been adapted to the context of LLMs. Building on this, we examine the role of UQ in hallucination detection, where quantifying uncertainty provides a mechanism for identifying unreliable generations and improving reliability. We systematically categorize a wide spectrum of existing methods along multiple dimensions and present empirical results for several representative approaches. Finally, we discuss current limitations and outline promising future research directions, providing a clearer picture of the current landscape of LLM UQ for hallucination detection.
zh
[NLP-58] On the Interplay between Human Label Variation and Model Fairness
【速读】: 该论文试图解决人类标注变异(Human Label Variation, HLV)对模型公平性影响这一未被充分探索的问题。其解决方案的关键在于通过对比基于多数投票标签的训练与多种HLV训练方法,发现即使不引入显式的去偏机制(explicit debiasing),采用HLV方法进行训练也能显著提升模型的公平性表现。
链接: https://arxiv.org/abs/2510.12036
作者: Kemal Kurniawan,Meladel Mistica,Timothy Baldwin,Jey Han Lau
机构: University of Melbourne (墨尔本大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 7 figures
Abstract:The impact of human label variation (HLV) on model fairness is an unexplored topic. This paper examines the interplay by comparing training on majority-vote labels with a range of HLV methods. Our experiments show that without explicit debiasing, HLV training methods have a positive impact on fairness.
zh
[NLP-59] Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中存在的幻觉(hallucination)问题,尤其是由提示词(prompt)本身存在语法错误、歧义、拼写错误或信息不完整等“不良结构”所引发的幻觉。解决方案的关键在于提出多阶段提示词优化(Multi-stage Prompt Refinement, MPR)框架,该框架通过多个阶段逐层修复提示词中的具体错误(如标点、拼写、术语误用),并利用微调后的小语言模型(Small Language Models, SLMs)进行针对性修正;同时引入自省式排序机制以增强提示词的清晰度和相关性,从而显著降低幻觉发生率,提升LLM输出准确性。
链接: https://arxiv.org/abs/2510.12032
作者: Jung-Woo Shim,Yeong-Joon Ju,Ji-Hoon Park,Seong-Whan Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 22 pages, 6 figures
Abstract:Recent advancements in large language models (LLMs) have shown strong performance in natural language understanding and generation tasks. However, LLMs continue to encounter challenges with hallucinations, where models generate plausible but incorrect information. While several factors contribute to hallucinations, the impact of ill-formed prompts, prompts with ambiguous wording, incorrect grammar, or incomplete information, was relatively under explored. To address this, we introduce Multi-stage Prompt Refinement (MPR), a framework designed to systematically improve these ill-formed prompts across multiple stages. Each stage addresses specific errors such as punctuation, typographical mistakes, and misuse of key terms, using small language models (SLMs) fine-tuned for these tasks. MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input. Experimental results on hallucination benchmarks show that prompts refined by MPR achieve over an 85~% win rate compared to their original forms, demonstrating its effectiveness in reducing hallucinations and improving LLM output accuracy. Interestingly, we reveal that MPR can be combined with existing post-hoc hallucination mitigation frameworks, further enhancing its versatility. MPR provides a lightweight and adaptable solution for enhancing LLM reliability across various domains.
zh
[NLP-60] CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成响应时因用户输入提示(prompt)结构不良或表述模糊而导致的“幻觉”问题,即模型基于假设而非真实意图生成看似合理但错误的事实,从而损害可信度。解决方案的关键在于提出一种可插拔的矫正式提示优化框架(Curative Prompt Refinement, CPR),其核心机制包括:1)对病态提示进行清洗以去除歧义和冗余信息;2)利用微调的小型语言模型补充任务描述,增强用户意图与提示之间的对齐。实证研究表明,应用CPR后,生成质量显著提升,且幻觉现象得到有效缓解,在无需外部知识的情况下,CPR优化后的提示在胜率上超过原始提示达90%以上。
链接: https://arxiv.org/abs/2510.12029
作者: Jung-Woo Shim,Yeong-Joon Ju,Ji-Hoon Park,Seong-Whan Lee
机构: Korea University (韩国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 7 pages, 2 figures
Abstract:Recent advancements in large language models (LLMs) highlight their fluency in generating responses to diverse prompts. However, these models sometimes generate plausible yet incorrect ``hallucinated" facts, undermining trust. A frequent but often overlooked cause of such errors is the use of poorly structured or vague prompts by users, leading LLMs to base responses on assumed rather than actual intentions. To mitigate hallucinations induced by these ill-formed prompts, we introduce Curative Prompt Refinement (CPR), a plug-and-play framework for curative prompt refinement that 1) cleans ill-formed prompts, and 2) generates additional informative task descriptions to align the intention of the user and the prompt using a fine-tuned small language model. When applied to language models, we discover that CPR significantly increases the quality of generation while also mitigating hallucination. Empirical studies show that prompts with CPR applied achieves over a 90% win rate over the original prompts without any external knowledge.
zh
[NLP-61] Information Extraction from Conversation Transcripts: Neuro-Symbolic vs. LLM
【速读】: 该论文旨在解决当前信息抽取(Information Extraction, IE)领域过度依赖大语言模型(Large Language Models, LLMs)而忽视传统符号或统计方法所积累的经验问题。其关键解决方案在于通过实证比较神经符号(Neuro-Symbolic, NS)与LLM-based两种IE系统在农业领域的性能表现,揭示二者在准确率、效率、可控性及可维护性等方面的权衡关系,从而指出部署自然语言处理(NLP)系统时存在的“隐性成本”,强调需在性能、效率与控制之间寻求平衡。
链接: https://arxiv.org/abs/2510.12023
作者: Alice Saebom Kwak,Maria Alexeeva,Gus Hahn-Powell,Keith Alcock,Kevin McLaughlin,Doug McCorkle,Gabe McNunn,Mihai Surdeanu
机构: University of Arizona (亚利桑那大学); Lum AI; Eocene Environmental Group
类目: Computation and Language (cs.CL)
备注: 15 pages, 2 figures
Abstract:The current trend in information extraction (IE) is to rely extensively on large language models, effectively discarding decades of experience in building symbolic or statistical IE systems. This paper compares a neuro-symbolic (NS) and an LLM-based IE system in the agricultural domain, evaluating them on nine interviews across pork, dairy, and crop subdomains. The LLM-based system outperforms the NS one (F1 total: 69.4 vs. 52.7; core: 63.0 vs. 47.2), where total includes all extracted information and core focuses on essential details. However, each system has trade-offs: the NS approach offers faster runtime, greater control, and high accuracy in context-free tasks but lacks generalizability, struggles with contextual nuances, and requires significant resources to develop and maintain. The LLM-based system achieves higher performance, faster deployment, and easier maintenance but has slower runtime, limited control, model dependency and hallucination risks. Our findings highlight the “hidden cost” of deploying NLP systems in real-world applications, emphasizing the need to balance performance, efficiency, and control.
zh
[NLP-62] Generate Logical Equivalence Questions
【速读】: 该论文旨在解决在线教学环境下学术不端行为(尤其是剽窃)日益严重的问题,同时应对传统练习题资源不足与难度不均的挑战。其核心解决方案是提出一种面向离散数学课程中逻辑等价类问题的自动题目生成(Automatic Question Generation, AQG)方法,关键在于通过形式化语言定义逻辑等价问题,并将其转化为两组生成规则,进而设计出线性时间复杂度的算法,从而实现高效、均匀难度的个性化题目生成,有效减少学生间抄袭的可能性并提升练习质量。
链接: https://arxiv.org/abs/2510.12001
作者: Xinyu Wang,Haoming Yu,Yicheng Yang,Zhiyuan Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Academic dishonesty is met with zero tolerance in higher education, yet plagiarism has become increasingly prevalent in the era of online teaching and learning. Automatic Question Generation (AQG) presents a potential solution to mitigate copying by creating unique questions for each student. Additionally, AQG can provide a vast array of practice questions. Our AQG focuses on generating logical equivalence questions for Discrete Mathematics, a foundational course for first-year computer science students. A literature review reveals that existing AQGs for this type of question generate all propositions that meet user-defined constraints, resulting in inefficiencies and a lack of uniform question difficulty. To address this, we propose a new approach that defines logical equivalence questions using a formal language, translates this language into two sets of generation rules, and develops a linear-time algorithm for question generation. We evaluated our AQG through two experiments. The first involved a group of students completing questions generated by our system. Statistical analysis shows that the accuracy of these questions is comparable to that of textbook questions. The second experiment assessed the number of steps required to solve our generated questions, textbook questions, and those generated by multiple large language models. The results indicated that the difficulty of our questions was similar to that of textbook questions, confirming the quality of our AQG.
zh
[NLP-63] UALM: Unified Audio Language Model for Understanding Generation and Reasoning
【速读】: 该论文旨在解决音频理解(audio understanding)、文本到音频生成(text-to-audio generation)与多模态推理(multimodal reasoning)任务在现有模型中被割裂处理的问题,这限制了高级跨模态智能的发展。解决方案的关键在于提出统一音频语言模型(Unified Audio Language Model, UALM),通过设计一个单一架构同时支持上述三项任务:首先构建UALM-Gen实现高质量的文本到音频生成,其性能可媲美基于扩散模型(diffusion-based models)的先进方法;其次利用数据混合策略、训练配方和推理技术使单个模型在音频理解、文本到音频生成及文本推理任务上达到专业化模型的水平;最后引入UALM-Reason模块,在中间推理阶段融合文本与音频信息,实现跨模态生成式推理(cross-modal generative reasoning),这是音频研究领域首次展示此类能力,并经主观评估验证其有效性。
链接: https://arxiv.org/abs/2510.12000
作者: Jinchuan Tian,Sang-gil Lee,Zhifeng Kong,Sreyan Ghosh,Arushi Goel,Chao-Han Huck Yang,Wenliang Dai,Zihan Liu,Hanrong Ye,Shinji Watanabe,Mohammad Shoeybi,Bryan Catanzaro,Rafael Valle,Wei Ping
机构: Carnegie Mellon University (卡内基梅隆大学); NVIDIA (英伟达); University of Maryland (马里兰大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks – an essential step toward advanced multimodal reasoning. This paper introduces Unified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.
zh
[NLP-64] SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation
【速读】: 该论文旨在解决多轮交互式智能体(multi-turn interactive agents)评估中依赖人工评测的难题,尤其是现有模拟用户方法因缺乏领域特定知识而导致行为不真实的问题。解决方案的关键在于提出SAGE(Simulation framework for multi-turn AGent Evaluation),其通过整合自上而下的业务逻辑知识(如理想客户画像)与自下而上的业务基础设施知识(如产品目录、FAQ和知识库),使模拟用户的行为能够基于真实的客户角色和信息需求,从而生成更贴近实际场景的交互数据。实证结果表明,该方法不仅提升了交互的真实性与多样性,还能比传统方法多识别33%的智能体错误,显著增强评估效果。
链接: https://arxiv.org/abs/2510.11997
作者: Ryan Shea,Yunan Lu,Liang Qiu,Zhou Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users’ information needs and expectations in a company’s target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.
zh
[NLP-65] Conjecturing: An Overlooked Step in Formal Mathematical Reasoning
【速读】: 该论文旨在解决自动形式化(autoformalisation)任务中被忽视的关键步骤——猜想生成(conjecturing),即在将非形式化的数学命题转化为形式语言之前,需先推测出结论(如显式答案或具体界)。传统方法常将猜想视为隐含前提,导致对大型语言模型(LLMs)自动形式化能力的评估被高估。为此,作者构建了ConjectureBench数据集并设计专门的评估框架与指标,以独立量化LLMs的猜想能力;进一步提出推理时方法Lean-FIRe,通过显式建模猜想过程显著提升自动形式化性能,在PutnamBench上实现了GPT-4.1成功处理13题、DeepSeek-V3.1处理7题的首次端到端自动形式化成果。关键在于:将猜想识别为独立于自动形式化的必要环节,并探索其正确集成机制,从而更真实地衡量和提升LLMs在正式数学推理中的表现。
链接: https://arxiv.org/abs/2510.11986
作者: Jasivan Alex Sivakumar,Philipp Borchert,Ronald Cardenas,Gerasimos Lampouras
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.
zh
[NLP-66] Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
【速读】: 该论文旨在解决当前AI代理(AI agent)评估中存在的诸多挑战,这些问题严重削弱了我们对代理实际性能的理解。其核心问题在于评估过程缺乏标准化、效率低下且易受实现错误干扰,同时难以揭示代理在真实场景中的行为细节。解决方案的关键在于提出一个名为“整体代理排行榜”(Holistic Agent Leaderboard, HAL)的标准化评估框架:首先,通过并行化虚拟机(VM)调度机制将评估时间从数周缩短至数小时,并消除常见实现缺陷;其次,引入模型、支架(scaffold)与基准测试(benchmark)的三维分析维度,系统性地验证评估结果;最后,利用大语言模型(LLM)辅助日志解析技术,发现此前未被报告的代理异常行为(如在HuggingFace上搜索任务而非完成任务),从而推动更可靠的代理开发。
链接: https://arxiv.org/abs/2510.11977
作者: Sayash Kapoor,Benedikt Stroebl,Peter Kirgis,Nitya Nadgir,Zachary S Siegel,Boyi Wei,Tianci Xue,Ziru Chen,Felix Chen,Saiteja Utpala,Franck Ndzomga,Dheeraj Oruganty,Sophie Luskin,Kangheng Liu,Botao Yu,Amit Arora,Dongyoon Hahm,Harsh Trivedi,Huan Sun,Juyong Lee,Tengjun Jin,Yifan Mai,Yifei Zhou,Yuxuan Zhu,Rishi Bommasani,Daniel Kang,Dawn Song,Peter Henderson,Yu Su,Percy Liang,Arvind Narayanan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about 40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.
zh
[NLP-67] Scaling Long-Horizon LLM Agent via Context-Folding
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在长周期任务中因上下文长度限制而导致性能受限的问题。其核心解决方案是提出了一种名为“Context-Folding”的框架,关键在于允许代理(agent)主动管理其工作上下文:通过将任务分解为子轨迹(sub-trajectory),在完成子任务后折叠(fold)中间步骤并保留结果摘要,从而显著减少所需上下文空间。为实现这一机制的可学习性,作者进一步设计了端到端强化学习框架FoldGRPO,引入特定的过程奖励(process rewards)以激励有效的任务分解与上下文管理策略。实验表明,该方法在Deep Research和SWE等复杂长周期任务上达到或超越ReAct基线,同时使用活跃上下文量仅为后者的1/10,并显著优于依赖摘要的上下文管理方式。
链接: https://arxiv.org/abs/2510.11967
作者: Weiwei Sun,Miao Lu,Zhan Ling,Kang Liu,Xuesong Yao,Yiming Yang,Jiecao Chen
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10 \times smaller and significantly outperforms models that rely on summarization-based context management.
zh
[NLP-68] Direct Multi-Token Decoding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因重复计算早期和中间层特征而导致的低效问题。传统解码方式需逐层遍历整个网络以生成每个输出token,造成冗余计算。解决方案的关键在于提出一种名为“直接多标记解码”(Direct Multi-Token Decoding, DMTD)的新推理范式:在预训练模型中,一旦输入信息经由早期和中间层处理后,其隐藏状态已包含足够语义信息,可仅通过晚期层直接生成多个输出token,从而避免反复调用早期与中间层。该方法无需额外参数、辅助流程或事后验证,仅依赖模型自身结构即可实现高效解码,在有限数据下微调后的Qwen3-4B模型已实现最高2倍加速且性能损失轻微,且随着训练数据规模扩大,其效率优势有望进一步提升。
链接: https://arxiv.org/abs/2510.11958
作者: Xuan Luo,Weizhi Wang,Xifeng Yan
机构: UC Santa Barbara (加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Decoder-only transformers have become the standard architecture for large language models (LLMs) due to their strong performance. Recent studies suggest that, in pre-trained LLMs, early, middle, and late layers may serve distinct roles: Early layers focus on understanding the input context, middle layers handle task-specific processing, and late layers convert abstract representations into output tokens. We hypothesize that once representations have been processed by the early and middle layers, the resulting hidden states may encapsulate sufficient information to support the generation of multiple tokens using only the late layers, eliminating the need to repeatedly traverse the early and middle layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces no additional parameters, auxiliary routines, or post-generation verification. Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss. Moreover, as shown in our scaling analysis, its performance is expected to further improve with larger training datasets.
zh
[NLP-69] Evaluating Retrieval-Augmented Generation Systems on Unanswerable Uncheatable Realistic Multi-hop Queries
【速读】: 该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统在面对复杂查询时,尤其是多跳推理(multi-hop reasoning)和超出语料范围的不可回答问题(unanswerable queries)时,现有基准测试难以真实反映任务复杂性的问题。现有基准常可通过非连贯推理(disconnected reasoning)绕过真正的多跳推理需求,或仅依赖简单事实召回,导致无法有效揭示RAG系统的局限性。其解决方案的关键在于提出首个可自动、可控难度地生成不可回答、多跳、现实且难以作弊(CRUMQs)查询的流水线方法,该方法适用于任意语料库和领域,并通过在两个主流RAG数据集上构建CRUMQs并进行基准实验,验证了其对主流检索增强大语言模型(LLMs)的有效挑战性——相较以往基准,CRUMQs使可作弊性评分降低高达81.0%,显著提升了评测的真实性与难度。
链接: https://arxiv.org/abs/2510.11956
作者: Gabrielle Kaili-May Liu,Bryan Li,Arman Cohan,William Gantt Walden,Eugene Yang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un \underlinec heatable, \underliner ealistic, \underlineu nanswerable, and \underlinem ulti-hop \underlineq uerie \underlines (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.
zh
[NLP-70] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)个性化过程中对昂贵的人类反馈或交互日志的依赖问题,从而限制了可扩展性并忽视了用户深层次属性(如价值观、兴趣、信念和人格特质)。其解决方案的关键在于提出GRAVITY框架,该框架通过整合人口统计学、文化与心理学理论(包括霍夫斯泰德的文化维度、舒瓦茨的基本价值观、世界价值观调查以及大五人格OCEAN特质),生成基于用户画像的合成偏好数据,用以指导个性化内容生成。相比传统提示词条件化、标准微调及朴素合成对的方法,GRAVITY在多文化场景下显著提升偏好匹配度(超过4%),并通过用户研究验证其输出被偏好比例达86%,证明了情景引导的合成数据能更有效地捕捉用户多样性、降低标注成本并生成更具吸引力的用户中心内容。
链接: https://arxiv.org/abs/2510.11952
作者: Priyanka Dey,Daniele Rosa,Wenqing Zheng,Daniel Barcklow,Jieyu Zhao,Emilio Ferrara
机构: University of Southern California (南加州大学); Capital One Research (Capital One 研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users’ interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks – including Hofstede’s cultural dimensions, Schwartz’s basic values, the World Values Survey, and Big Five OCEAN traits – GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.
zh
[NLP-71] opoAlign: A Framework for Aligning Code to Math via Topological Decomposition
【速读】: 该论文旨在解决生成式 AI 在数学推理任务中面临的自动形式化(autoformalisation)难题,即如何将非正式的数学陈述有效转换为可机器验证的形式化语句。当前大语言模型(Large Language Models, LLMs)在处理非正式数学推理方面表现优异,但在与形式化证明助手(proof assistants)结合时受限于高质量成对数据(非正式与形式化陈述)的稀缺性。解决方案的关键在于提出 TopoAlign 框架,该框架通过分解代码库中的文档字符串(docstrings)、主函数和依赖函数,并重构为结构上类比于形式化数学陈述的组件,从而无需人工标注即可构建结构对齐的代码数据集,用于训练数学专用大模型(Math LLMs),显著提升了模型在 minif2f、Putnam 和 ProofNet 等基准上的性能表现。
链接: https://arxiv.org/abs/2510.11944
作者: Yupei Li,Philipp Borchert,Gerasimos Lampouras
机构: Imperial College London (帝国理工学院); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.
zh
[NLP-72] Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering EMNLP2025
【速读】: 该论文旨在解决多语言问答(Multilingual Question Answering, MQA)系统中事实一致性与文化差异性难以兼顾的问题,尤其是在客观问题(如“什么是黄疸?”)上确保事实准确性,同时在主观问题(如“谁协助分娩?”)中识别因地域和文化背景导致的回答差异。解决方案的关键在于提出一种用户在环路(user-in-the-loop)的事实核查流程 MIND,通过识别并标注跨语言知识库中的事实性不一致和文化敏感性分歧,从而提升 QA 系统的可靠性与文化适配能力。
链接: https://arxiv.org/abs/2510.11928
作者: Lorena Calvo-Bartolomé,Valérie Aldana,Karla Cantarero,Alonso Madroñal de Mesa,Jerónimo Arenas-García,Jordan Boyd-Graber
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Long paper accepted at EMNLP 2025
Abstract:Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.
zh
[NLP-73] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens
【速读】: 该论文试图解决的问题是:在机器翻译(Machine Translation, MT)任务中,大型推理模型(Large Reasoning Models, LRMs)通过生成中间 token(即“思考 token”)是否能提升翻译性能。研究发现,直接让模型在翻译前生成自然语言的推理过程(如链式思维 CoT)并不能改善翻译效果,即使对模型进行基于合成 CoT 的微调也未优于标准的输入-输出微调方式。解决方案的关键在于:并非所有中间 token 都有效,真正有效的中间表示必须包含具体的翻译尝试(translation attempts),例如通过模块化提示策略组合生成中间步骤,而非单纯依赖抽象推理。这表明,在训练阶段引入翻译相关的具体操作信息比单纯模仿人类翻译者的“思考过程”更为重要。
链接: https://arxiv.org/abs/2510.11919
作者: Armel Zebaze,Rachel Bawden,Benoît Sagot
机构: Inria Paris (法国国家信息与自动化研究院巴黎分部)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that “thinking tokens” do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators’ practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into “thinking” MT models.
zh
[NLP-74] LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在面对分布外(out-of-distribution, OOD)输入时表现脆弱的问题,即模型对微小的表面形式变化(如拼写错误或句式改写)过于敏感,导致其知识表征缺乏鲁棒性。解决方案的关键在于通过语义保持的扰动(semantically-preserving perturbations)评估模型内部表征对陈述真实性(statement truthfulness)的分离能力随输入分布偏移而退化的程度,从而揭示LLMs的学习机制可能依赖于浅层、非鲁棒的表征,而非稳定、可泛化的知识结构。
链接: https://arxiv.org/abs/2510.11905
作者: Patrick Haller,Mark Ibrahim,Polina Kirichenko,Levent Sagun,Samuel J. Bell
机构: Meta(元)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings – often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness – i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples’ presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement’s exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.
zh
[NLP-75] R-WoM: Retrieval-augmented World Model For Computer-use Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为世界模型(World Model)在数字环境中进行长期决策时因幻觉倾向和静态知识依赖而导致的误差累积问题,从而限制其对未来状态预测与奖励估计能力的有效性。其核心解决方案是提出检索增强型世界模型(Retrieval-augmented World Model, R-WoM),通过引入外部教程中获取的实时、事实性知识来约束LLM的模拟过程,从而提升环境动态建模的可靠性,尤其在长时程任务中表现显著优于基线方法,实验表明在OSWorld和WebArena基准上分别实现最高达25.3%和18.1%的性能提升。
链接: https://arxiv.org/abs/2510.11892
作者: Kai Mei,Jiang Guo,Shuaichen Chang,Mingwen Dong,Dongkyu Lee,Xing Niu,Jiarong Jiang
机构: Rutgers University (罗格斯大学); AWS Agentic AI (亚马逊智能代理人工智能)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs’ tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models–future state prediction and reward estimation–through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs’ limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.
zh
[NLP-76] Deep Research Brings Deeper Harm NEURIPS2025
【速读】: 该论文旨在解决深度研究代理(Deep Research, DR)代理在高风险领域(如生物安全)中因多步规划与执行能力而可能生成有害内容的问题。现有针对大语言模型(Large Language Models, LLMs)的对齐机制无法有效防范DR代理中的系统性安全漏洞,因其核心风险源于任务分解、信息检索与报告合成等复杂流程中的意图劫持和计划注入。解决方案的关键在于提出两种新颖的越狱策略:计划注入(Plan Injection)——向代理的行动计划中注入恶意子目标;以及意图劫持(Intent Hijack)——将有害请求重构为学术研究问题以绕过安全防护。实验表明,这些策略能显著暴露DR代理的深层不一致问题,揭示其比独立LLM更易产生专业且危险的内容,凸显了亟需为DR代理设计专门的对齐技术。
链接: https://arxiv.org/abs/2510.11851
作者: Shuo Chen,Zonggen Li,Zhen Han,Bailan He,Tong Liu,Haokun Chen,Georg Groh,Philip Torr,Volker Tresp,Jindong Gu
机构: LMU Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Technical University of Munich (TUM) (慕尼黑工业大学); AWS AI (亚马逊云科技人工智能); Konrad Zuse School of Excellence in Reliable AI (relAI) (康拉德·祖塞可靠人工智能卓越学院); University of Hong Kong (HKU) (香港大学); University of Oxford (牛津大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: Accepted to Reliable ML from Unreliable Data Workshop @ NeurIPS 2025
Abstract:Deep Research (DR) agents built on Large Language Models (LLMs) can perform complex, multi-step research by decomposing tasks, retrieving online information, and synthesizing detailed reports. However, the misuse of LLMs with such powerful capabilities can lead to even greater risks. This is especially concerning in high-stakes and knowledge-intensive domains such as biosecurity, where DR can generate a professional report containing detailed forbidden knowledge. Unfortunately, we have found such risks in practice: simply submitting a harmful query, which a standalone LLM directly rejects, can elicit a detailed and dangerous report from DR agents. This highlights the elevated risks and underscores the need for a deeper safety analysis. Yet, jailbreak methods designed for LLMs fall short in exposing such unique risks, as they do not target the research ability of DR agents. To address this gap, we propose two novel jailbreak strategies: Plan Injection, which injects malicious sub-goals into the agent’s plan; and Intent Hijack, which reframes harmful queries as academic research questions. We conducted extensive experiments across different LLMs and various safety benchmarks, including general and biosecurity forbidden prompts. These experiments reveal 3 key findings: (1) Alignment of the LLMs often fail in DR agents, where harmful prompts framed in academic terms can hijack agent intent; (2) Multi-step planning and execution weaken the alignment, revealing systemic vulnerabilities that prompt-level safeguards cannot address; (3) DR agents not only bypass refusals but also produce more coherent, professional, and dangerous content, compared with standalone LLMs. These results demonstrate a fundamental misalignment in DR agents and call for better alignment techniques tailored to DR agents. Code and datasets are available at this https URL.
zh
[NLP-77] Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities NEURIPS2025 DATE
【速读】: 该论文试图解决语言模型在通过持续预训练适应新任务时面临的“灾难性遗忘”(catastrophic forgetting)问题,即模型在学习新能力的同时如何有效保留原有知识。其解决方案的关键在于系统性地研究回放比例(replay ratio)与计算预算之间的权衡关系,通过合成数据生成技术,在不同总token预算下评估多种回放比例配置对任务掌握能力和通用知识保留的影响,最终发现了一种能够在任务性能与知识保留之间取得最优平衡的配置,并据此提出了基于计算预算的实证指导原则,从而实现高效且低成本的任务适应。
链接: https://arxiv.org/abs/2510.11842
作者: Urs Spiegelhalter,Jörg K.H. Franke,Frank Hutter
机构: University of Freiburg (弗莱堡大学); ELLIS Institute Tübingen (ELLIS图宾根研究所); Open-Sci Collective (开放科学集体); LAION (LAION); Prior Labs (Prior实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Presented at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop on Continual and Compatible Foundation Model Updates (CCFM)
Abstract:Adapting language models to new tasks through continued pretraining faces a fundamental trade-off: models must learn new capabilities while avoiding catastrophic forgetting of existing knowledge. While prior work has studied synthetic data generation techniques, the optimal replay ratios for balancing task performance and knowledge retention under computational constraints remain poorly understood. We present a comprehensive empirical study investigating the interplay between replay ratio configuration and computational budget when adapting language models to new tasks. Using the bAbI reasoning tasks as our target objective, we apply synthetic data generation and systematically evaluate different total token budgets and replay ratio configurations. We analyze their effects on both task mastery and general knowledge retention. Our experiments reveal an optimal configuration that balances task-specific performance with general knowledge retention. Based on our findings, we provide empirically-grounded guidelines for selecting replay ratios based on computational budget, enabling practitioners to achieve strong task adaptation with significantly reduced training costs.
zh
[NLP-78] Data or Language Supervision: What Makes CLIP Better than DINO? EMNLP2025
【速读】: 该论文试图解决的问题是:在视觉语言模型(VLM)中,CLIP作为视觉编码器为何优于自监督模型如DINO,这种优势究竟是源于其语言监督机制,还是因其更大的训练数据规模。解决方案的关键在于在受控条件下对CLIP和DINO进行预训练——使用相同的网络架构、数据集和训练配置,使得两者在ImageNet上的准确率相当,从而有效分离语言监督与数据规模这两个变量的影响。通过嵌入空间分析和下游VQA任务评估,研究发现CLIP更擅长捕捉高层语义(如物体类别和文本),而DINO则对低层特征(如颜色和风格)更敏感;在VLM集成中,CLIP在文本密集型任务上表现更优,而DINO在视觉主导任务上略胜一筹,表明语言监督是性能差异的核心因素。
链接: https://arxiv.org/abs/2510.11835
作者: Yiming Liu,Yuhui Zhang,Dhruba Ghosh,Ludwig Schmidt,Serena Yeung-Levy
机构: Stanford University (斯坦福大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: EMNLP 2025 Findings
Abstract:CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP’s language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings – using the same architecture, dataset, and training configuration – achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.
zh
[NLP-79] Dont Walk the Line: Boundary Guidance for Filtered Generation
【速读】: 该论文旨在解决生成式模型(Generative Models)在与安全分类器(Safety Classifiers)协同使用时,因常规微调策略导致输出趋向于分类器决策边界,从而引发误报(False Positives)和漏报(False Negatives)增加的问题。解决方案的关键在于提出一种基于强化学习的微调方法——Boundary Guidance,其通过显式引导生成过程远离分类器的决策边界,从而在不牺牲模型效用的前提下提升安全性。实验证明,该方法在对抗性提示和模糊指令基准测试中显著改善了输出的安全性和实用性。
链接: https://arxiv.org/abs/2510.11834
作者: Sarah Ball,Andreas Haupt
机构: Ludwig-Maximilians-Universität in Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Stanford University (斯坦福大学); Department of Computer Science (计算机科学系)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 3 figures, 3 tables
Abstract:Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier’s decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier’s margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
zh
[NLP-80] ask-Aware Reduction for Scalable LLM -Database Systems
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理数据密集型工作流时因输入文本冗余、噪声多和体量大而导致的效率低下问题,这些问题不仅增加计算成本与环境负担,还可能偏离任务目标。解决方案的关键在于将LLM的token预算视为注意力预算,并将任务感知的文本缩减提升为语言-数据系统设计的核心原则;即通过优先保留对下游任务最相关的信息,而非简单压缩数据,实现更高效、精准且可持续的LLM与数据系统集成。
链接: https://arxiv.org/abs/2510.11813
作者: Marcus Emmanuel Barnes,Taher A. Ghaleb,Safwat Hassan
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Databases (cs.DB)
备注: Preprint. Accepted for presentation at the Workshop on Language Models and Databases (LMD), co-located with CASCON 2025 (IEEE). The final version will appear in IEEE Xplore
Abstract:Large Language Models (LLMs) are increasingly applied to data-intensive workflows, from database querying to developer observability. Yet the effectiveness of these systems is constrained by the volume, verbosity, and noise of real-world text-rich data such as logs, telemetry, and monitoring streams. Feeding such data directly into LLMs is costly, environmentally unsustainable, and often misaligned with task objectives. Parallel efforts in LLM efficiency have focused on model- or architecture-level optimizations, but the challenge of reducing upstream input verbosity remains underexplored. In this paper, we argue for treating the token budget of an LLM as an attention budget and elevating task-aware text reduction as a first-class design principle for language – data systems. We position input-side reduction not as compression, but as attention allocation: prioritizing information most relevant to downstream tasks. We outline open research challenges for building benchmarks, designing adaptive reduction pipelines, and integrating token-budget–aware preprocessing into database and retrieval systems. Our vision is to channel scarce attention resources toward meaningful signals in noisy, data-intensive workflows, enabling scalable, accurate, and sustainable LLM–data integration.
zh
[NLP-81] PHANTOM RECALL: When Familiar Puzzles Fool Smart Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理逻辑谜题时表现出的“伪推理”问题,即模型往往依赖于记忆的模板而非基于第一性原理进行推理,导致在面对细微扰动(perturbations)时性能急剧下降。其关键解决方案是提出一个系统性的评估框架——PHANTOM RECALL,包含25个经典逻辑谜题及其149种保持推理结构但改变表面细节和解法的扰动版本,并识别出一种普遍的失败模式:“幻觉回忆”(phantom recall),即模型自信地复现原有答案或虚假推理过程。为缓解此问题,研究进一步引入三个核心工具:(i) 自动化逻辑等价判别器以检测推理不一致,(ii) 细粒度推理错误分类体系,以及 (iii) 基于该分类的提示工程缓解框架,从而推动模型从依赖记忆向真正重新推理转变。
链接: https://arxiv.org/abs/2510.11812
作者: Souradeep Mukhopadhyay,Rishabh Baral,Nimeesh Mahajan,Samhitha Harish,Aswin RRV,Mihir Parmar,Mutsumi Nakamura,Chitta Baral
机构: Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 Pages
Abstract:Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles–but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode–phantom recall–where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift–highlighting the gap between linguistic fluency and logical understanding.
zh
[NLP-82] Evolution of wartime discourse on Telegram: A comparative study of Ukrainian and Russian policymakers communication before and after Russias full-scale invasion of Ukraine
【速读】: 该论文试图解决的问题是:在社交媒体时代背景下,精英政治传播者如何利用Telegram平台进行战时政治沟通,以及其传播策略在俄乌冲突升级后发生了怎样的变化。解决方案的关键在于构建了一个独特的数据集,涵盖2019年至2024年间乌克兰和俄罗斯政策制定者的Telegram公开帖子,并通过分析通信量、主题内容及行动者参与度的变化,揭示了不同国家政策制定者在战争期间的传播行为差异——特别是乌克兰政策制定者从初期聚焦战争议题到后期关注度下降,而俄罗斯政策制定者则转向西方危机等非战争话题以转移公众注意力,同时识别出大党与小党、个体政策制定者之间的策略差异。这一方法为理解战时在线政治话语动态提供了实证依据。
链接: https://arxiv.org/abs/2510.11746
作者: Mykola Makhortykh,Aytalina Kulichkina,Kateryna Maikovska
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 46 pages
Abstract:This study examines elite-driven political communication on Telegram during the ongoing Russo-Ukrainian war, the first large-scale European war in the social media era. Using a unique dataset of Telegram public posts from Ukrainian and Russian policymakers (2019-2024), we analyze changes in communication volume, thematic content, and actor engagement following Russia’s 2022 full-scale invasion. Our findings show a sharp increase in Telegram activity after the invasion, particularly among ruling-party policymakers. Ukrainian policymakers initially focused on war-related topics, but this emphasis declined over time In contrast, Russian policymakers largely avoided war-related discussions, instead emphasizing unrelated topics, such as Western crises, to distract public attention. We also identify differences in communication strategies between large and small parties, as well as individual policymakers. Our findings shed light on how policymakers adapt to wartime communication challenges and offer critical insights into the dynamics of online political discourse during times of war.
zh
[NLP-83] Celebrity Profiling on Short Urdu Text using Twitter Followers Feed
【速读】: 该论文旨在解决在低资源语言 Urdu 中基于社交媒体文本实现名人画像(celebrity profiling)的问题,具体包括性别、年龄、职业和知名度等人口统计特征的预测。其关键解决方案在于构建并利用来自南亚地区名人粉丝的短篇 Urdu 推文数据集,通过多种机器学习与深度学习模型(如逻辑回归、支持向量机、随机森林、卷积神经网络和长短期记忆网络)对粉丝的语言特征进行建模,并发现这些语言特征能够有效用于预测名人的 demographic 属性,其中性别预测表现最佳(cRank = 0.65,准确率 = 0.65),验证了在低资源语言环境下,基于用户生成内容(UGC)的语义特征提取与分类方法的有效性。
链接: https://arxiv.org/abs/2510.11739
作者: Muhammad Hamza,Rizwan Jafar
机构: COMSATS University Islamabad, Lahore Campus
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Social media has become an essential part of the digital age, serving as a platform for communication, interaction, and information sharing. Celebrities are among the most active users and often reveal aspects of their personal and professional lives through online posts. Platforms such as Twitter provide an opportunity to analyze language and behavior for understanding demographic and social patterns. Since followers frequently share linguistic traits and interests with the celebrities they follow, textual data from followers can be used to predict celebrity demographics. However, most existing research in this field has focused on English and other high-resource languages, leaving Urdu largely unexplored. This study applies modern machine learning and deep learning techniques to the problem of celebrity profiling in Urdu. A dataset of short Urdu tweets from followers of subcontinent celebrities was collected and preprocessed. Multiple algorithms were trained and compared, including Logistic Regression, Support Vector Machines, Random Forests, Convolutional Neural Networks, and Long Short-Term Memory networks. The models were evaluated using accuracy, precision, recall, F1-score, and cumulative rank (cRank). The best performance was achieved for gender prediction with a cRank of 0.65 and an accuracy of 0.65, followed by moderate results for age, profession, and fame prediction. These results demonstrate that follower-based linguistic features can be effectively leveraged using machine learning and neural approaches for demographic prediction in Urdu, a low-resource language. Subjects: Social and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2510.11739 [cs.SI] (or arXiv:2510.11739v1 [cs.SI] for this version) https://doi.org/10.48550/arXiv.2510.11739 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-84] Scaling Law in LLM Simulated Personality: More Detailed and Realistic Persona Profile Is All You Need
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在社会实验模拟中对人类人格特征的建模能力不足的问题,尤其是如何科学评估LLM在虚拟人格扮演中的真实性与一致性。其解决方案的关键在于提出了一套端到端的评估框架,包括个体层面的稳定性与可识别性分析,以及群体层面的“渐进人格曲线”(progressive personality curves),用以捕捉LLM在人格模拟上的改进趋势;同时修正了传统心理测量方法(如验证性因子分析CFA和构念效度)在低水平模拟阶段可能带来的误判,从而为LLM在社会科学实验中的应用提供了理论依据和可操作的评价指标。
链接: https://arxiv.org/abs/2510.11734
作者: Yuqi Bai,Tianyu Huang,Kun Sun,Yuting Chen
机构: Hebei Petroleum University of Technology (河北石油职业技术大学); Beijing Institute of Technology (北京理工大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This research focuses on using large language models (LLMs) to simulate social experiments, exploring their ability to emulate human personality in virtual persona role-playing. The research develops an end-to-end evaluation framework, including individual-level analysis of stability and identifiability, as well as population-level analysis called progressive personality curves to examine the veracity and consistency of LLMs in simulating human personality. Methodologically, this research proposes important modifications to traditional psychometric approaches (CFA and construct validity) which are unable to capture improvement trends in LLMs at their current low-level simulation, potentially leading to remature rejection or methodological misalignment. The main contributions of this research are: proposing a systematic framework for LLM virtual personality evaluation; empirically demonstrating the critical role of persona detail in personality simulation quality; and identifying marginal utility effects of persona profiles, especially a Scaling Law in LLM personality simulation, offering operational evaluation metrics and a theoretical foundation for applying large language models in social science experiments.
zh
[NLP-85] LLM AtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings
【速读】: 该论文旨在解决知识图谱嵌入(Knowledge Graph Embeddings, KGE)在对抗攻击中面临的两大挑战:一是现有黑盒攻击方法缺乏可解释性,无法生成人类可读的解释;二是其泛化能力较差。针对这些问题,论文提出了一种基于大语言模型(Large Language Models, LLMs)的新型框架 LLMAtKGE,其核心创新在于通过结构化提示(structured prompting)将攻击任务建模为多选题形式,并融合知识图谱中的事实证据以提供充分的上下文信息;同时引入基于语义和中心性的过滤机制,在受限输入窗口下压缩候选集并保留高召回率的攻击相关知识,辅以预计算的高阶邻接关系与三元组分类任务微调,有效提升LLM在整合语义与结构信息方面的过滤性能。实验表明,该方法不仅显著优于最强黑盒基线,还能生成具有推理逻辑的人类可读解释,且在白盒攻击场景中表现具有竞争力。
链接: https://arxiv.org/abs/2510.11584
作者: Ting Li,Yang Yang,Yipeng Yu,Liang Yao,Guoqing Chao,Ruifeng Xu
机构: Sun Yat-sen University (中山大学); Alibaba Inc. (阿里巴巴集团); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 13 pages
Abstract:Adversarial attacks on knowledge graph embeddings (KGE) aim to disrupt the model’s ability of link prediction by removing or inserting triples. A recent black-box method has attempted to incorporate textual and structural information to enhance attack performance. However, it is unable to generate human-readable explanations, and exhibits poor generalizability. In the past few years, large language models (LLMs) have demonstrated powerful capabilities in text comprehension, generation, and reasoning. In this paper, we propose LLMAtKGE, a novel LLM-based framework that selects attack targets and generates human-readable explanations. To provide the LLM with sufficient factual context under limited input constraints, we design a structured prompting scheme that explicitly formulates the attack as multiple-choice questions while incorporating KG factual evidence. To address the context-window limitation and hesitation issues, we introduce semantics-based and centrality-based filters, which compress the candidate set while preserving high recall of attack-relevant information. Furthermore, to efficiently integrate both semantic and structural information into the filter, we precompute high-order adjacency and fine-tune the LLM with a triple classification task to enhance filtering performance. Experiments on two widely used knowledge graph datasets demonstrate that our attack outperforms the strongest black-box baselines and provides explanations via reasoning, and showing competitive performance compared with white-box methods. Comprehensive ablation and case studies further validate its capability to generate explanations.
zh
[NLP-86] RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)预训练数据质量下降的问题,即高质量预训练数据资源日益枯竭,难以支撑前沿模型的持续发展。其核心解决方案是提出一种名为RePro的新型网页数据回收方法,关键在于通过强化学习训练一个相对较小的语言模型(LM)作为重述器(rephraser),以生成既高质量又忠实于原始语义的数据重写版本。具体而言,RePro设计了1个质量奖励和3个忠实度奖励机制,在优化过程中确保重述后的文本保留原数据的核心语义与结构,从而有效提升预训练数据的利用效率。实验表明,RePro在多个下游任务中相较纯有机数据基线实现4.7%-14.0%的相对准确率提升,并优于当前最先进的提示驱动式回收方法(ReWire),且能将有机数据使用效率提高2-3倍。
链接: https://arxiv.org/abs/2510.10681
作者: Zichun Yu,Chenyan Xiong
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at this https URL.
zh
[NLP-87] ripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在旅行规划任务中面临的可行性、可靠性与用户参与度难以有效评估的问题。现有基准测试虽能衡量LLMs的规划能力,但在细粒度指标整合与实际应用适配方面存在不足。其解决方案的关键在于构建一个统一奖励机制的综合性评测基准,将多个细粒度评估标准(如行程合理性、执行可靠性及用户吸引力)融合为单一奖励信号,从而支持直接比较不同旅行计划的质量,并可无缝集成到强化学习(Reinforcement Learning, RL)框架中进行优化。该方法在专家标注上达到60.75%的一致性,优于多个基于LLM作为评判者的基线模型,并通过释放包含4,870个查询的大规模数据集(含219条真实自由格式请求)提升泛化能力,实验表明基于GRPO的强化学习策略显著优于仅靠提示工程或监督微调的方法,在提高行程可行性方面表现突出。
链接: https://arxiv.org/abs/2510.09011
作者: Yincen Qu,Huan Xiao,Feng Li,Gregory Li,Hui Zhou,Xiangying Dai
机构: Trip.com Group (携程集团); Chongqing University (重庆大学); Stanford University (斯坦福大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs’ planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.
zh
[NLP-88] DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
【速读】: 该论文旨在解决当前零样本文语转换(Text-to-Speech, TTS)系统在分布偏移下表现脆弱、可控性有限的问题。现有方法通常依赖于连续语音表示中的自回归(Autoregressive, AR)草图与扩散模型精修的混合架构,但这类方案难以适应不同数据分布且缺乏灵活的控制机制。其解决方案的关键在于提出DISTAR框架,该框架完全在离散残差向量量化(Residual Vector Quantization, RVQ)码空间中运行,并将AR语言模型与掩码扩散模型紧密耦合,无需强制对齐或持续预测器。具体而言,DISTAR首先利用AR模型生成块级RVQ token草稿,随后通过条件掩码扩散填充完成下一区块,实现块级并行推理以缓解经典AR暴露偏差问题;同时,离散码空间支持推理阶段显式控制:可通过贪婪解码或带无分类器引导(classifier-free guidance)的采样方式生成高质量音频,实现鲁棒性与多样性之间的权衡,并借助RVQ层剪枝实现可变比特率和可控计算量。
链接: https://arxiv.org/abs/2510.12210
作者: Yakun Song,Xiaobin Zhuang,Jiawei Chen,Zhikang Niu,Guanrou Yang,Chenpeng Du,Zhuo Chen,Yuping Wang,Yuxuan Wang,Xie Chen
机构: Shanghai Jiao Tong University (上海交通大学); ByteDance Inc. (字节跳动)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on this https URL.
zh
计算机视觉
[CV-0] DeepMMSearch-R1: Empowering Multimodal LLM s in Multimodal Web Search
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实应用场景中面临的信息获取效率低、响应动态信息能力弱的问题,尤其是在处理知识密集型用户查询时,现有方法如检索增强生成(Retrieval Augmented Generation, RAG)、搜索代理和配备搜索功能的MLLMs常因固定流水线、冗余搜索调用及低效查询构建导致性能不佳。其解决方案的关键在于提出DeepMMSearch-R1,这是首个具备按需多轮网络搜索能力并能动态生成图像与文本搜索查询的多模态大模型;它通过基于输入图像相关区域触发图像搜索提升检索有效性,并结合检索结果迭代优化文本查询,实现自我反思与纠错机制。该方案采用两阶段训练流程:冷启动监督微调+在线强化学习优化,并引入DeepMMSearchVQA这一新型多模态视觉问答数据集,包含融合文本与视觉信息的多跳问题,以指导模型决策何时搜索、搜什么、用何种工具以及如何推理检索内容,从而显著提升模型在知识密集型任务中的表现。
链接: https://arxiv.org/abs/2510.12801
作者: Kartik Narayan,Yang Xu,Tian Cao,Kavya Nerella,Vishal M. Patel,Navid Shiee,Peter Grasch,Chao Jia,Yinfei Yang,Zhe Gan
机构: Johns Hopkins University (约翰霍普金斯大学); Apple (苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注:
Abstract:Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.
zh
[CV-1] Detect Anything via Next Point Prediction
【速读】:该论文旨在解决当前基于传统坐标回归(coordinate regression)的检测模型(如YOLO、DETR和Grounding DINO)在引入多模态大语言模型(Multimodal Large Language Models, MLLMs)后所面临的性能瓶颈问题,包括召回率低、预测重复、坐标错位等挑战。其核心解决方案是提出一个3B规模的MLLM——Rex-Omni,通过三项关键设计实现性能突破:1)任务建模创新,使用特殊标记表示0到999范围内的量化坐标,降低学习难度并提升坐标预测的token效率;2)构建多种数据引擎生成高质量的指代表达(grounding)、指代(referring)和指向(pointing)数据,提供语义丰富的监督信号;3)采用两阶段训练流程,先在2200万样本上进行监督微调(Supervised Fine-Tuning, SFT),再通过基于GRPO(Generalized Reward Policy Optimization)的强化学习后训练,利用几何感知奖励机制弥合离散到连续坐标预测的鸿沟,提升边界框精度并抑制因教师引导导致的重复预测等不良行为。
链接: https://arxiv.org/abs/2510.12798
作者: Qing Jiang,Junan Huo,Xingyu Chen,Yuda Xiong,Zhaoyang Zeng,Yihao Chen,Tianhe Ren,Junzhi Yu,Lei Zhang
机构: International Digital Economy Academy (IDEA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: homepage: this https URL
Abstract:Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model’s learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni’s inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.
zh
[CV-2] DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在大规模数据训练中面临的“监督不足”问题,即模型容量庞大但仅由稀疏、低维的动作信号进行监督,导致其表征能力未被充分挖掘。解决方案的关键在于提出DriveVLA-W0训练范式,通过引入世界建模(world modeling)任务来预测未来图像,从而生成密集的自监督信号,迫使模型学习驾驶环境的动力学特征;同时,基于世界建模所获得的丰富表征,进一步引入轻量级动作专家以降低推理延迟,实现高效实时部署。该方法显著提升了VLA模型在NAVSIM和更大规模自建数据集上的性能,并强化了数据扩展规律,使模型性能随训练数据增长而加速提升。
链接: https://arxiv.org/abs/2510.12796
作者: Yingyan Li,Shuyao Shang,Weisong Liu,Bing Zhan,Haochen Wang,Yuqi Wang,Yuntao Chen,Xiaoman Wang,Yasong An,Chufeng Tang,Lu Hou,Lue Fan,Zhaoxiang Zhang
机构: NLPR, Institute of Automation, Chinese Academy of Sciences (CASIA); Yinwang Intelligent Technology Co. Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit’': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbfDriveVLA-W0, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm’s versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.
zh
[CV-3] CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations ICCV2025
【速读】:该论文旨在解决将立方体多参数持久同调(Cubical Multiparameter Persistence, CMP)有效集成到深度学习流水线中的难题,其核心挑战在于多滤波结构的复杂性以及CMP的向量化表示困难。解决方案的关键是提出了一种可微分的向量化层CuMPerLay,该层将CMP分解为一系列可学习的单参数持久同调,并联合优化双滤波函数,从而实现对全局拓扑特征的有效提取与端到端训练。该方法在理论上保证了在广义Wasserstein度量下的稳定性,并在医学影像和计算机视觉基准数据集上验证了其在小样本场景下提升分类与分割性能的能力。
链接: https://arxiv.org/abs/2510.12795
作者: Caner Korkmaz,Brighton Nuwagira,Barış Coşkunuzer,Tolga Birdal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
备注: Appears at ICCV 2025
Abstract:We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging and computer vision datasets show the benefit CuMPerLay on classification and segmentation performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.
zh
[CV-4] ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)因图像输入引入额外视觉标记(vision tokens)而导致的推理成本增加问题。解决方案的关键在于提出一种名为视觉一致性学习(Visual Consistency Learning, ViCO)的新型训练算法,其核心思想是利用多个具有不同图像压缩比的多层感知机(MLP)连接器,根据图像语义复杂度动态下采样视觉标记;在训练阶段最小化不同MLP连接器条件下的响应之间的KL散度,在推理阶段引入视觉分辨率路由器(Visual Resolution Router, ViR)自动为每个图像块选择合适的压缩率,从而实现基于语义复杂度而非图像分辨率的动态视觉标记数量调整,实验表明该方法可在保持模型感知、推理和OCR能力的同时将视觉标记数量减少高达50%。
链接: https://arxiv.org/abs/2510.12793
作者: Long Cui,Weiyun Wang,Jie Shao,Zichen Wen,Gen Luo,Linfeng Zhang,Yanting Zhang,Yu Qiao,Wenhai Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory; Fudan University (复旦大学); Nanjing University (南京大学); Donghua University (东华大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model’s perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.
zh
[CV-5] UniFusion: Vision-Language Model as Unified Encoder in Image Generation
【速读】:该论文旨在解决当前扩散模型在跨模态推理与知识迁移能力上的局限性,其根源在于现有架构普遍依赖独立的图像和文本编码器,导致难以实现高效的多模态信息融合。解决方案的关键在于提出UniFusion框架,其核心创新是Layerwise Attention Pooling(LAP)机制,该机制从冻结的大规模视觉语言模型(VLM)中提取文本与视觉token的多层次语义与细节信息,用以条件化扩散生成模型;同时引入VLM-Enabled Rewriting Injection with Flexible Inference(VERIFI),仅利用VLM在推理阶段生成的文本token对扩散Transformer(DiT)进行条件控制,从而结合VLM的推理能力与条件分布对齐优势,提升生成质量与编辑灵活性。实验表明,该设计不仅显著增强文本-图像对齐效果,还实现了跨任务的零样本泛化能力,验证了统一编码器架构的有效性。
链接: https://arxiv.org/abs/2510.12789
作者: Kevin Li,Manuel Brack,Sudeep Katakol,Hareesh Ravi,Ajinkya Kale
机构: Adobe Applied Research (Adobe应用研究)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page at this https URL
Abstract:Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models’ ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its this http URL present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM’s reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.
zh
[CV-6] Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report ICCV2025
【速读】:该论文旨在解决高效真实世界图像去模糊(real-world image deblurring)问题,即在严格计算资源约束下(模型参数少于500万,计算量低于200 GMACs)实现高质量的图像去模糊。解决方案的关键在于设计轻量化且高效的网络架构,在保证去模糊性能的同时满足实际部署所需的低复杂度要求,其中最优方法在RSBlur数据集上达到了31.1298 dB的PSNR,验证了高效方法在真实场景下的可行性与潜力。
链接: https://arxiv.org/abs/2510.12788
作者: Daniel Feijoo,Paula Garrido-Mellado,Marcos V. Conde,Jaesung Rim,Alvaro Garcia,Sunghyun Cho,Radu Timofte
机构: Cidaut AI(西班牙); POSTECH(韩国); University of Würzburg(德国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 - AIM Workshop
Abstract:This paper reviews the AIM 2025 Efficient Real-World Deblurring using Single Images Challenge, which aims to advance in efficient real-blur restoration. The challenge is based on a new test set based on the well known RSBlur dataset. Pairs of blur and degraded images in this dataset are captured using a double-camera system. Participant were tasked with developing solutions to effectively deblur these type of images while fulfilling strict efficiency constraints: fewer than 5 million model parameters and a computational budget under 200 GMACs. A total of 71 participants registered, with 4 teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 31.1298 dB, showcasing the potential of efficient methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers in efficient real-world image deblurring.
zh
[CV-7] MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars
【速读】:该论文旨在解决当前生成式数字人(Digital Human Avatar)在单张参考图像基础上生成高质量、多视角、可动画化视频时存在的局限性问题,即现有方法缺乏多视角信息约束和显式的三维表示,导致从偏离参考视角的视角渲染时图像质量与真实感显著下降。解决方案的关键在于提出MVP4D模型,该模型基于先进的预训练视频扩散模型(Video Diffusion Model),能够从单张参考图像和目标表情输入中同时生成覆盖360度视角的数百帧多视角动画视频,并通过知识蒸馏(Knowledge Distillation)将输出结果压缩为可实时渲染的4D数字人表示,从而在视觉真实性、时间一致性及三维一致性方面显著优于以往方法。
链接: https://arxiv.org/abs/2510.12785
作者: Felix Taubner,Ruihang Zhang,Mathieu Tuli,Sherwin Bahmani,David B. Lindell
机构: University of Toronto (多伦多大学); Vector Institute (矢量研究所); LG Electronics (LG电子公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: 18 pages, 12 figures
Abstract:Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.
zh
[CV-8] What If : Understanding Motion Through Sparse Interactions
【速读】:该论文旨在解决物理场景动态变化的多模态建模问题,特别是如何从局部交互(称为“pokes”)出发,准确预测场景中局部运动的概率分布,从而捕捉场景动力学的不确定性与依赖关系。传统方法通常仅能生成单一实现的密集运动采样,难以表达复杂场景的多种可能演化路径。其解决方案的关键在于提出Flow Poke Transformer (FPT) 框架,通过直接建模条件于稀疏交互的局部运动分布,提供可解释且显式表示的多模态运动预测,同时具备良好的泛化能力,可在密集人脸运动生成、关节物体运动估计及从pokes进行移动部件分割等下游任务中超越现有方法。
链接: https://arxiv.org/abs/2510.12777
作者: Stefan Andreas Baumann,Nick Stracke,Timy Phan,Björn Ommer
机构: CompVis @ LMU Munich (计算机视觉组 @ 慕尼黑大学); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL
Abstract:Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed “pokes”. Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at this https URL.
zh
[CV-9] Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction
【速读】:该论文旨在解决单目输入下动态三维场景重建中存在的根本性欠约束问题,尤其针对遮挡和极端视角下因观测不足导致的运动漂移(motion drift)与合成质量下降的问题。现有动态高斯点绘(Dynamic Gaussian Splatting)方法对所有高斯基元进行均匀优化,忽视了其可观测性的差异,从而在遮挡或新视角下难以保持几何稳定性与视觉一致性。解决方案的关键在于引入不确定性感知机制:通过估计随时间变化的每个高斯基元的不确定性,并构建时空图结构以实现不确定性加权的优化策略,从而将高置信度、多视角重复观测的高斯基元作为可靠运动锚点,引导整体重建过程,显著提升在遮挡条件下的几何稳定性和极端视角下的图像合成质量。
链接: https://arxiv.org/abs/2510.12768
作者: Fengzhi Guo,Chih-Chuan Hsu,Sihao Ding,Cheng Zhang
机构: Texas A&M University (德克萨斯农工大学); Mercedes-Benz North America (梅赛德斯-奔驰北美公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: Project page: this https URL
Abstract:Reconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our key insight is to estimate time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.
zh
[CV-10] Efficient Perceptual Image Super Resolution: AIM 2025 Study and Benchmark ICCV2025
【速读】:该论文旨在解决高效感知超分辨率(Efficient Perceptual Super-Resolution, EPSR)中感知质量与模型效率之间的矛盾问题,即现有方法在保持高感知质量的同时往往计算复杂度较高,难以满足实际部署的参数量(≤5M)和浮点运算次数(≤2000 GFLOPs)约束。其解决方案的关键在于设计一种能够在严格效率限制下实现或超越Real-ESRGAN感知性能的模型架构,通过在自建的500张4K分辨率测试图像数据集上进行评估,验证了所提方法在多种退化类型下的鲁棒性和优越性,从而确立了高效感知超分辨率的现代基准。
链接: https://arxiv.org/abs/2510.12765
作者: Bruno Longarela,Marcos V. Conde,Alvaro Garcia,Radu Timofte
机构: Cidaut AI; University of Würzburg (维尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 - AIM Workshop
Abstract:This paper presents a comprehensive study and benchmark on Efficient Perceptual Super-Resolution (EPSR). While significant progress has been made in efficient PSNR-oriented super resolution, approaches focusing on perceptual quality metrics remain relatively inefficient. Motivated by this gap, we aim to replicate or improve the perceptual results of Real-ESRGAN while meeting strict efficiency constraints: a maximum of 5M parameters and 2000 GFLOPs, calculated for an input size of 960x540 pixels. The proposed solutions were evaluated on a novel dataset consisting of 500 test images of 4K resolution, each degraded using multiple degradation types, without providing the original high-quality counterparts. This design aims to reflect realistic deployment conditions and serves as a diverse and challenging benchmark. The top-performing approach manages to outperform Real-ESRGAN across all benchmark datasets, demonstrating the potential of efficient methods in the perceptual domain. This paper establishes the modern baselines for efficient perceptual super resolution.
zh
[CV-11] AnyUp: Universal Feature Upsampling
【速读】:该论文旨在解决现有基于学习的特征上采样方法在不同视觉特征提取器(如DINO或CLIP)之间缺乏泛化能力的问题,即这些方法通常需要针对每个特定编码器重新训练,无法在推理阶段直接适用于多种类型的特征。解决方案的关键在于提出一种推理时特征无关(feature-agnostic)的上采样架构AnyUp,该架构无需针对特定编码器进行训练即可对任意分辨率的视觉特征进行高效且语义保持的上采样,从而显著提升上采样质量并扩展到多种下游任务中。
链接: https://arxiv.org/abs/2510.12764
作者: Thomas Wimmer,Prune Truong,Marie-Julie Rakotosaona,Michael Oechsle,Federico Tombari,Bernt Schiele,Jan Eric Lenssen
机构: Max Planck Institute for Informatics (马克斯普朗克信息研究所); ETH Zurich (苏黎世联邦理工学院); Google(谷歌); TU Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Project Website: this https URL
Abstract:We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.
zh
[CV-12] PET Head Motion Estimation Using Supervised Deep Learning with Attention
【速读】:该论文旨在解决脑部正电子发射断层成像(PET)中因头部运动导致的图像伪影和示踪剂摄取定量不准确问题,这严重影响了神经疾病诊断的准确性。其解决方案的关键在于提出一种基于深度学习的头部运动校正方法(DL-HMC++),该方法通过交叉注意力机制从一秒内的3D PET原始数据中预测刚性头部运动,并在监督学习框架下利用已有的动态PET扫描数据与外部硬件运动追踪(HMT)提供的金标准运动测量进行训练。该方法在两种PET扫描仪(HRRT和mCT)及四种放射性示踪剂上验证了有效性与泛化能力,结果表明其能显著优于现有数据驱动的运动估计方法,生成的图像结构清晰、伪影减少,且与金标准HMT相比区域感兴趣区标准化摄取值差异极小(HRRT平均差值为1.2%±0.5%,mCT为0.5%±0.2%),从而有望替代依赖硬件的运动追踪,使运动校正技术更适用于临床常规场景。
链接: https://arxiv.org/abs/2510.12758
作者: Zhuotong Cai,Tianyi Zeng,Jiazhen Zhang,Eléonore V. Lieffrig,Kathryn Fontaine,Chenyu You,Enette Mae Revilla,James S. Duncan,Jingmin Xin,Yihuan Lu,John A. Onofrey
机构: Xi’an Jiaotong University (西安交通大学); Yale University (耶鲁大学); United Imaging Healthcare (联影医疗)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in IEEE Transactions on Medical Imaging (TMI), 2025. This is the accepted manuscript version
Abstract:Head movement poses a significant challenge in brain positron emission tomography (PET) imaging, resulting in image artifacts and tracer uptake quantification inaccuracies. Effective head motion estimation and correction are crucial for precise quantitative image analysis and accurate diagnosis of neurological disorders. Hardware-based motion tracking (HMT) has limited applicability in real-world clinical practice. To overcome this limitation, we propose a deep-learning head motion correction approach with cross-attention (DL-HMC++) to predict rigid head motion from one-second 3D PET raw data. DL-HMC++ is trained in a supervised manner by leveraging existing dynamic PET scans with gold-standard motion measurements from external HMT. We evaluate DL-HMC++ on two PET scanners (HRRT and mCT) and four radiotracers (18F-FDG, 18F-FPEB, 11C-UCB-J, and 11C-LSN3172176) to demonstrate the effectiveness and generalization of the approach in large cohort PET studies. Quantitative and qualitative results demonstrate that DL-HMC++ consistently outperforms state-of-the-art data-driven motion estimation methods, producing motion-free images with clear delineation of brain structures and reduced motion artifacts that are indistinguishable from gold-standard HMT. Brain region of interest standard uptake value analysis exhibits average difference ratios between DL-HMC++ and gold-standard HMT to be 1.2 plus-minus 0.5% for HRRT and 0.5 plus-minus 0.2% for mCT. DL-HMC++ demonstrates the potential for data-driven PET head motion correction to remove the burden of HMT, making motion correction accessible to clinical populations beyond research settings. The code is available at this https URL.
zh
[CV-13] E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization NEURIPS2025
【速读】:该论文旨在解决事件相机(event camera)在无监督条件下同时估计光学流(optical flow)与6-DoF自身运动(egomotion)的病态问题(ill-posed challenge),传统方法因缺乏鲁棒的数据关联而难以有效分离这两个任务。解决方案的关键在于提出一种完全无监督的联合优化框架E-MoFlow,其核心创新包括:1)通过将相机运动建模为连续样条(continuous spline)并以隐式神经表示(implicit neural representation)编码光学流,利用归纳偏置(inductive biases)自然嵌入时空一致性;2)引入微分几何约束(differential geometric constraints)来隐式融合结构-运动先验(structure-and-motion priors),从而避免显式深度估计的同时保持严格的几何一致性。该方法显著提升了无监督场景下的估计精度与稳定性,且性能优于现有无监督方法,甚至可媲美有监督方法。
链接: https://arxiv.org/abs/2510.12753
作者: Wenpu Li,Bangyan Liao,Yi Zhou,Qi Xu,Pian Wan,Peidong Liu
机构: Westlake University (西湖大学); Zhejiang University (浙江大学); Hunan University (湖南大学); Wuhan University (武汉大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The Thirty-Ninth Annual Conference on Neural Information Processing Systems(NeurIPS 2025)
Abstract:The estimation of optical flow and 6-DoF ego-motion, two fundamental tasks in 3D vision, has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth. Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment. The former notably introduces bias in results and computational overhead, while the latter, which parametrizes the optical flow in terms of the scene depth and the camera motion, often converges to suboptimal local minima. To address these issues, we propose an unsupervised framework that jointly optimizes egomotion and optical flow via implicit spatial-temporal and geometric regularization. First, by modeling camera’s egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency. As a result, our framework (called E-MoFlow) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches.
zh
[CV-14] VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉问答(Visual Question Answering, VQA)任务中对深层语义理解能力评估不足的问题,尤其是在视觉艺术分析等复杂文化领域。现有基准测试往往局限于简单的句法结构和表层属性,导致模型倾向于利用统计捷径而非进行真正的视觉推理。为应对这一挑战,作者提出VQArt-Bench——一个面向文化遗产领域的大型、高质量VQA基准,其关键创新在于采用一种新型多智能体(multi-agent)生成管道,由专业化智能体协作生成语义细腻、经验证且语言多样化的问答对,并按照视觉理解维度(如符号意义、叙事解读和复杂视觉关系)组织数据。此设计有效提升了评估的深度与多样性,从而更真实地反映模型在复杂视觉语境下的推理能力。
链接: https://arxiv.org/abs/2510.12750
作者: A. Alfarano,L. Venturoli,D. Negueruela del Castillo(University of Zurich, Max Planck Society)
机构: University of Zurich (苏黎世大学); Max Planck Society (马克斯·普朗克学会)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model’s ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.
zh
[CV-15] SPORTS: Simultaneous Panoptic Odometry Rendering Tracking and Segmentation for Urban Scenes Understanding
【速读】:该论文旨在解决具身智能体在场景感知、理解与模拟中面临的分割不完整、动态物体干扰、传感器数据稀疏以及视角受限等问题。其解决方案的关键在于提出了一种名为SPORTS的统一框架,通过紧密集成视频全景分割(Video Panoptic Segmentation, VPS)、视觉里程计(Visual Odometry, VO)和场景渲染(Scene Rendering, SR)任务,在迭代过程中实现多模态信息融合与协同优化:VPS采用基于自适应注意力机制的几何融合方法,利用位姿、深度和光流模态对齐跨帧特征并提升目标身份跟踪;VO结合VPS输出的全景分割结果与光流图增强动态物体置信度估计,从而提高相机位姿估计精度与深度图完整性;SR则利用VO生成的稀疏点云构建神经场,合成高保真RGB视图及对应的全景视图。该框架在多个公开数据集上验证了其在里程计、跟踪、分割和新视角合成等任务上的优越性能。
链接: https://arxiv.org/abs/2510.12749
作者: Zhiliu Yang,Jinyu Dai,Jianyuan Zhang,Zhu Yang
机构: Yunnan University (云南大学); Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia
Abstract:The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects’ interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.
zh
[CV-16] FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution
【速读】:该论文旨在解决扩散模型在真实世界视频超分辨率(Video Super-Resolution, VSR)应用中的效率、可扩展性和实时性问题,这些问题主要源于高延迟、计算开销大以及对超高分辨率的泛化能力差。解决方案的关键在于提出FlashVSR,这是首个面向实时VSR的单步流式扩散框架,其核心创新包括:(i) 一种训练友好的三阶段蒸馏流水线,支持流式超分辨率;(ii) 局部约束的稀疏注意力机制,在减少冗余计算的同时弥合训练与测试分辨率之间的差距;(iii) 一个小型条件解码器,在不损失重建质量的前提下显著加速重建过程。这些设计使FlashVSR能够在单张A100 GPU上以约17 FPS的速度处理768×1408分辨率视频,并在超高清分辨率下实现可靠扩展和高达12倍的速度提升。
链接: https://arxiv.org/abs/2510.12747
作者: Junhao Zhuang,Shi Guo,Xin Cai,Xiaohui Li,Yihao Liu,Chun Yuan,Tianfan Xue
机构: Tsinghua University (清华大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page with code: this https URL
Abstract:Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.
zh
[CV-17] Personalized Federated Fine-Tuning of Vision Foundation Models for Healthcare ALT
【速读】:该论文旨在解决医疗领域中基础模型(foundation models)在跨机构联邦学习(federated learning)场景下,因数据隐私限制导致的训练数据不足与个性化需求难以满足的问题。其关键解决方案是提出一种新的个性化联邦微调方法,通过学习正交的低秩适配器(orthogonal LoRA adapters),将通用知识与客户端特定知识解耦,使每个参与方(如医院)能够同时充分利用自身数据和全局数据,从而提升模型性能并保障隐私安全。
链接: https://arxiv.org/abs/2510.12741
作者: Adam Tupper,Christian Gagné
机构: Institut intelligence et données (IID); Université Laval; Mila - Quebec AI Institute
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted to the Symposium on Model Accountability, Sustainability and Healthcare (SMASH) 2025
Abstract:Foundation models open up new possibilities for the use of AI in healthcare. However, even when pre-trained on health data, they still need to be fine-tuned for specific downstream tasks. Furthermore, although foundation models reduce the amount of training data required to achieve good performance, obtaining sufficient data is still a challenge. This is due, in part, to restrictions on sharing and aggregating data from different sources to protect patients’ privacy. One possible solution to this is to fine-tune foundation models via federated learning across multiple participating clients (i.e., hospitals, clinics, etc.). In this work, we propose a new personalized federated fine-tuning method that learns orthogonal LoRA adapters to disentangle general and client-specific knowledge, enabling each client to fully exploit both their own data and the data of others. Our preliminary results on real-world federated medical imaging tasks demonstrate that our approach is competitive against current federated fine-tuning methods.
zh
[CV-18] Beyond Seeing: Evaluating Multimodal LLM s on Tool-Enabled Image Perception Transformation and Reasoning
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实场景中处理不完美图像时能力不足的问题,特别是如何从“思考图像”(think about images)向“与图像协同思考”(think with images)转变,即模型需具备动态操作视觉内容并整合通用工具以完成复杂任务的能力。现有基准大多仍基于静态图像输入的评估范式,未能充分考察MLLMs对视觉信息的主动干预和认知性利用。解决方案的关键在于提出IRIS基准,这是首个聚焦于“think with images”范式的评测体系,包含1,204个开放性、多轮交互的视觉-文本任务(覆盖五个领域),并配有详细评分标准,用于系统评估模型在感知、变换与推理方面的综合能力。实验表明,当前最强模型(GPT-5-think)仅达18.68%的通过率,凸显了MLLMs在视觉操作与工具集成上的显著瓶颈,为后续研究提供了明确方向。
链接: https://arxiv.org/abs/2510.12712
作者: Xingang Guo,Utkarsh Tyagi,Advait Gosai,Paula Vergara,Ernesto Gabriel Hernández Montoya,Chen Bo Calvin Zhang,Bin Hu,Yunzhong He,Bing Liu,Rakshith Sharma Srinivasa
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce IRIS, an Interactive Reasoning with Images and Systems that evaluates MLLMs’ ability to perceive, transform, and reason across complex visual-textual tasks under the think with images paradigm. IRIS comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, IRIS offers critical insights for advancing visual intelligence in MLLMs.
zh
[CV-19] SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
【速读】:该论文旨在解决当前多模态嵌入模型在实际应用与商业场景中面临的三大挑战:模态支持有限、训练机制不稳定以及工业领域差距。为应对这些问题,作者提出了SAIL-Embedding,其核心解决方案在于通过定制化的训练策略与架构设计实现更强大的跨模态表示能力。关键创新包括:多阶段训练策略(multi-stage training scheme),其中内容感知的渐进式训练(content-aware progressive training)提升模型对下游任务的适应性与跨模态理解力,协作感知的推荐增强训练(collaboration-aware recommendation enhancement training)则结合序列到物品和ID到物品嵌入知识,并挖掘用户历史兴趣以优化推荐场景下的表示;同时引入随机专业化(stochastic specialization)与数据驱动模式匹配(dataset-driven pattern matching)以增强训练灵活性与泛化性能。实验表明,SAIL-Embedding在多种检索任务中达到最先进(SOTA)效果,并在线上真实场景中显著提升了用户生命周期(Lifetime, LT)指标及匹配特征的AUC表现。
链接: https://arxiv.org/abs/2510.12709
作者: Lin Lin,Jiefeng Long,Zhihe Wan,Yuchi Wang,Dingkang Yang,Shuang Yang,Yueyang Yao,Xu Chen,Zirui Guo,Shengqiang Li,Weiran Li,Hanyu Li,Yaling Mou,Yan Qiu,Haiyang Yu,Xiao Liang,Hongsheng Li,Chao Feng
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model’s adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.08% AUC gain.
zh
[CV-20] Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis MICCAI2025
【速读】:该论文旨在解决基于Transformer的深度学习模型在医学影像分析中因学习虚假相关性而导致的偏差和泛化能力不足的问题。其解决方案的关键在于提出一种混合解释引导学习(Hybrid Explanation-Guided Learning, H-EGL)框架,该框架通过结合自监督约束与人工指导约束来增强注意力对齐并提升模型泛化性能;其中自监督组件利用类别区分性注意力机制,无需依赖严格的先验假设,从而提升了模型的鲁棒性和灵活性。
链接: https://arxiv.org/abs/2510.12704
作者: Shelley Zixin Shu,Haozhe Luo,Alexander Poellinger,Mauricio Reyes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by iMIMIC at MICCAI 2025
Abstract:Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.
zh
[CV-21] DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization
【速读】:该论文旨在解决在仅有噪声或退化观测数据的情况下训练扩散模型(Diffusion Models)的挑战,这在高维逆问题中尤为关键。其解决方案的核心在于提出一种基于期望最大化(Expectation-Maximization, EM)框架的新型方法——DiffEM:在E步中利用条件扩散模型从观测数据中重构干净数据,在M步中使用重构数据优化条件扩散模型参数。该方法在理论层面提供了单调收敛保证,并在多种图像重建任务中验证了有效性。
链接: https://arxiv.org/abs/2510.12691
作者: Danial Hosseintabar,Fan Chen,Giannis Daras,Antonio Torralba,Constantinos Daskalakis
机构: Massachusetts Institute of Technology (麻省理工学院); MIT Computer Science and Artificial Intelligence Laboratory (麻省理工学院计算机科学与人工智能实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when only corrupted or noisy observations are available remains challenging. In this work, we propose a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data. Our proposed method, DiffEM, utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.
zh
[CV-22] EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels
【速读】:该论文旨在解决标签噪声下开放集域泛化(Open-Set Domain Generalization under Noisy Labels, OSDG-NL)问题,即在源域存在标签噪声的情况下,提升模型对新域中已知类别识别与未知类别拒识的能力。现有方法虽采用双曲原型引导的元学习策略,但在领域差异较大且清洁标注数据有限时仍表现不佳。其解决方案的关键在于提出证据可靠性感知残差流元学习框架(Evidential Reliability-Aware Residual Flow Meta-Learning, EReLiFM):首先设计一种无监督两阶段证据损失聚类方法以增强标签可靠性意识;其次引入残差流匹配机制,建模结构化的、条件于领域和类别的残差,从而生成多样且不确定性感知的迁移路径,超越传统插值增强方式;同时,在元学习过程中优化更新方向,使在清洁样本上的损失下降最大化地促进噪声样本上的损失减少,利用置信度最高的伪标签进行监督。该方法显著提升了OSDG-NL任务下的性能,达到当前最优水平。
链接: https://arxiv.org/abs/2510.12687
作者: Kunyu Peng,Di Wen,Kailun Yang,Jia Fu,Yufan Chen,Ruiping Liu,Jiamin Wu,Junwei Zheng,M. Saquib Sarfraz,Luc Van Gool,Danda Pani Paudel,Rainer Stiefelhagen
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); INSAIT, Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥赫里德斯基”); Hunan University (湖南大学); KTH Royal Institute of Technology (皇家理工学院); RISE Research Institutes of Sweden (瑞典工业研究学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: The source code is available at this https URL
Abstract:Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at this https URL.
zh
[CV-23] MCOP: Multi-UAV Collaborative Occupancy Prediction
【速读】:该论文旨在解决无人机群(UAV swarm)系统在复杂场景下协同感知的两个核心问题:一是基于鸟瞰图(BEV)的传统方法使用边界框(bounding-box)表示,难以完整捕捉场景的语义与几何信息;二是当遇到未定义或被遮挡目标时,其性能显著下降。解决方案的关键在于提出一种新型多无人机协同占用预测框架,通过引入空间感知特征编码器(Spatial-Aware Feature Encoder)和跨代理特征融合机制(Cross-Agent Feature Integration)来有效保留三维空间结构和语义信息;同时,采用高度感知特征压缩(Altitude-Aware Feature Reduction)以高效表达场景,并结合双掩码感知引导机制(Dual-Mask Perceptual Guidance)自适应选择特征,从而降低通信开销。
链接: https://arxiv.org/abs/2510.12679
作者: Zefu Lin,Wenbo Chen,Xiaojuan Jin,Yuran Yang,Lue Fan,Yixin Zhang,Yufeng Zhang,Zhaoxiang Zhang
机构: University of Chinese Academy of Sciences (UCAS); Institute of Automation, Chinese Academy of Sciences (CASIA); New Laboratory of Pattern Recognition (NLPR); State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS); Beijing University of Posts and Telecommunications (BUPT); Tencent
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird’s Eye View (BEV)-based approaches exhibit two main limitations: bounding-box representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects. To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead. Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.
zh
[CV-24] rraCodec: Compressing Earth Observations
【速读】:该论文旨在解决地球观测(Earth Observation, EO)卫星产生的多光谱图像时序数据在存储与传输中面临的高压挑战,同时克服现有压缩方法在遥感场景下的局限性。传统图像编解码器忽视时间冗余,而视频编解码器依赖运动先验,难以建模静态场景中的辐射变化。其解决方案的关键在于提出TerraCodec(TEC),一个专为EO设计的可学习编解码器家族:包括适配多光谱输入的高效图像级变体,以及利用时间依赖性的Temporal Transformer模型(TEC-TT);并引入Latent Repacking技术,使Transformer模型能在不同率失真(rate-distortion)配置下灵活训练,突破当前神经编解码器固定码率的限制。实验表明,TerraCodec在Sentinel-2数据上实现3–10倍更强压缩比,且支持零样本云掩膜修复任务,在AllClear基准上优于现有方法。
链接: https://arxiv.org/abs/2510.12670
作者: Julen Costa-Watanabe,Isabelle Wittmann,Benedikt Blumenstiel,Konrad Schindler
机构: IBM Research Europe (IBM研究欧洲); ETH Zurich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Earth observation (EO) satellites produce massive streams of multispectral image time series, posing pressing challenges for storage and transmission. Yet, learned EO compression remains fragmented, lacking publicly available pretrained models and misaligned with advances in compression for natural imagery. Image codecs overlook temporal redundancy, while video codecs rely on motion priors that fail to capture the radiometric evolution of largely static scenes. We introduce TerraCodec (TEC), a family of learned codecs tailored to EO. TEC includes efficient image-based variants adapted to multispectral inputs, as well as a Temporal Transformer model (TEC-TT) that leverages dependencies across time. To overcome the fixed-rate setting of today’s neural codecs, we present Latent Repacking, a novel method for training flexible-rate transformer models that operate on varying rate-distortion settings. Trained on Sentinel-2 data, TerraCodec outperforms classical codecs, achieving 3-10x stronger compression at equivalent image quality. Beyond compression, TEC-TT enables zero-shot cloud inpainting, surpassing state-of-the-art methods on the AllClear benchmark. Our results establish bespoke, learned compression algorithms as a promising direction for Earth observation. Code and model weights will be released under a permissive license.
zh
[CV-25] On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation ICCV
【速读】:该论文旨在解决人体网格恢复(Human Mesh Recovery, HMR)及其前身任务——人体姿态估计(Human Pose Estimation, HPE)中模型复杂度高、计算效率低的问题。现有先进方法如HMR2.0依赖于大型非分层视觉Transformer作为编码器,导致资源消耗大。解决方案的关键在于:提出利用分层视觉基础模型(Hierarchical Vision Foundation Models, VFMs)的早期阶段作为编码器,这些模型包括Swin Transformer、GroupMixFormer和VMamba;通过实验验证仅使用前两到三个阶段即可达到与全阶段模型相当的性能,同时显著提升准确率与计算效率之间的权衡,优于现有的轻量化替代方案。
链接: https://arxiv.org/abs/2510.12660
作者: Shuhei Tarashima,Yushan Wang,Norio Tagawa
机构: NTT DOCOMO Business (NTT DOCOMO 商业部门); Tokyo Metropolitan University (东京都立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCVW 2025
Abstract:In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives.
zh
[CV-26] Zero-Shot CFC: Fast Real-World Image Denoising based on Cross-Frequency Consistency
【速读】:该论文旨在解决现有零样本去噪方法(zero-shot denoisers)在真实世界图像去噪中面临的两大问题:一是训练时间过长,二是依赖噪声独立性和零均值假设,导致在复杂噪声场景下性能受限。其解决方案的关键在于提出基于跨频域一致性(Cross-Frequency Consistency, CFC)的零样本去噪方法(ZSCFC),该方法仅需单张含噪图像即可完成训练与去噪,且不依赖任何噪声分布假设。核心思想是利用图像纹理在不同频率域中具有位置相似性和内容一致性,而噪声不具备此特性,从而设计跨频域一致性损失函数和轻量级网络结构,实现高效且鲁棒的真实世界图像去噪。
链接: https://arxiv.org/abs/2510.12646
作者: Yanlin Jiang,Yuchen Liu,Mingren Liu
机构: Beijing University of Technology (北京工业大学); Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The British Machine Vision Conference
Abstract:Zero-shot denoisers address the dataset dependency of deep-learning-based denoisers, enabling the denoising of unseen single images. Nonetheless, existing zero-shot methods suffer from long training times and rely on the assumption of noise independence and a zero-mean property, limiting their effectiveness in real-world denoising scenarios where noise characteristics are more complicated. This paper proposes an efficient and effective method for real-world denoising, the Zero-Shot denoiser based on Cross-Frequency Consistency (ZSCFC), which enables training and denoising with a single noisy image and does not rely on assumptions about noise distribution. Specifically, image textures exhibit position similarity and content consistency across different frequency bands, while noise does not. Based on this property, we developed cross-frequency consistency loss and an ultralight network to realize image denoising. Experiments on various real-world image datasets demonstrate that our ZSCFC outperforms other state-of-the-art zero-shot methods in terms of computational efficiency and denoising performance.
zh
[CV-27] WaterFlow: Explicit Physics-Prior Rectified Flow for Underwater Saliency Mask Generation
【速读】:该论文旨在解决水下显著目标检测(Underwater Salient Object Detection, USOD)中因图像质量退化和域差异带来的挑战,现有方法通常忽视水下成像的物理原理,或将退化现象简单视为干扰因素予以消除,未能充分利用其中蕴含的有价值信息。其解决方案的关键在于提出WaterFlow框架,该框架创新性地将水下物理成像信息作为显式先验直接引入网络训练过程,并引入时序维度建模,从而显著提升模型对显著目标的识别能力。在USOD10K数据集上,WaterFlow在S_m指标上提升了0.072,验证了方法的有效性和优越性。
链接: https://arxiv.org/abs/2510.12605
作者: Runting Li,Shijie Lian,Hua Li,Yutong Li,Wenhui Wu,Sam Kwong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater Salient Object Detection (USOD) faces significant challenges, including underwater image quality degradation and domain gaps. Existing methods tend to ignore the physical principles of underwater imaging or simply treat degradation phenomena in underwater images as interference factors that must be eliminated, failing to fully exploit the valuable information they contain. We propose WaterFlow, a rectified flow-based framework for underwater salient object detection that innovatively incorporates underwater physical imaging information as explicit priors directly into the network training process and introduces temporal dimension modeling, significantly enhancing the model’s capability for salient object identification. On the USOD10K dataset, WaterFlow achieves a 0.072 gain in S_m, demonstrating the effectiveness and superiority of our method. The code will be published after the acceptance.
zh
[CV-28] Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training
【速读】:该论文旨在解决像素空间(pixel-space)生成模型在训练难度和性能上普遍落后于潜在空间(latent-space)模型的问题,即存在持续的性能与效率差距。其解决方案的关键在于提出了一种新颖的两阶段训练框架:第一阶段通过预训练编码器,使其在捕捉干净图像语义的同时,对齐同一确定性采样轨迹上的点(从先验分布到数据分布的演化路径);第二阶段将预训练编码器与随机初始化的解码器结合,端到端微调整个模型以适配扩散模型和一致性模型(consistency models)。该方法显著提升了像素空间模型的生成质量与效率,在ImageNet上实现了优于现有像素空间方法的FID指标,并首次成功在高分辨率图像上训练出无需依赖预训练VAE或扩散模型的一致性模型。
链接: https://arxiv.org/abs/2510.12586
作者: Jiachen Lei,Keli Liu,Julius Berner,Haiming Yu,Hongkai Zheng,Jiahong Wu,Xiangxiang Chu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.
zh
[CV-29] LayerSync: Self-aligning Intermediate Layers
【速读】:该论文旨在解决扩散模型(diffusion models)在生成质量与训练效率之间的权衡问题。现有研究表明,模型中间表示(intermediate representations)的质量直接影响生成效果,而外部引导机制虽能加速训练,但依赖额外监督信号且难以泛化。其解决方案的关键在于提出一种名为LayerSync的自洽正则化方法:通过利用扩散模型自身不同层间表示质量差异,将语义信息最丰富的层作为内在指导信号,对其他较弱层进行正则化,从而无需外部监督或预训练模型即可提升生成质量和训练效率。该方法具有领域无关性、零额外计算开销,并已在图像、音频、视频及运动生成等多个模态中验证有效性。
链接: https://arxiv.org/abs/2510.12581
作者: Yasaman Haghighi,Bastien van Delft,Mariam Hassan,Alexandre Alahi
机构: Ecole Polytechnique Fédérale de Lausanne (瑞士联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75x on ImageNet dataset and improved the generation quality by 23.6%. The code is available at this https URL.
zh
[CV-30] Unlocking Zero-Shot Plant Segmentation with Pl@ntNet Intelligence
【速读】:该论文旨在解决农业图像中目标分割(segmentation)任务面临的标注数据稀缺问题,尤其是在训练数据有限且田间环境复杂的情况下,传统监督学习方法性能受限。解决方案的关键在于利用Plantnet这一大规模植物分类模型及其DinoV2骨干网络提取的植物特异性表征,先生成粗粒度的植物区域掩码(mask),再通过Segment Anything Model (SAM) 对其进行精细化处理,从而实现无需额外标注数据的零样本分割(zero-shot segmentation)。实验表明,基于Plantnet微调后的DinoV2在IoU指标上显著优于原始DinoV2,验证了融合基础模型与专用植物模型的有效性。
链接: https://arxiv.org/abs/2510.12579
作者: Simon Ravé,Jean-Christophe Lombardo,Pejman Rasti,Alexis Joly,David Rousseau
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a zero-shot segmentation approach for agricultural imagery that leverages Plantnet, a large-scale plant classification model, in conjunction with its DinoV2 backbone and the Segment Anything Model (SAM). Rather than collecting and annotating new datasets, our method exploits Plantnet’s specialized plant representations to identify plant regions and produce coarse segmentation masks. These masks are then refined by SAM to yield detailed segmentations. We evaluate on four publicly available datasets of various complexity in terms of contrast including some where the limited size of the training data and complex field conditions often hinder purely supervised methods. Our results show consistent performance gains when using Plantnet-fine-tuned DinoV2 over the base DinoV2 model, as measured by the Jaccard Index (IoU). These findings highlight the potential of combining foundation models with specialized plant-centric models to alleviate the annotation bottleneck and enable effective segmentation in diverse agricultural scenarios.
zh
[CV-31] Learning Human Motion with Temporally Conditional Mamba
【速读】:该论文旨在解决基于时变输入信号的人体运动生成任务中,现有方法依赖交叉注意力机制导致的局部时间对齐困难问题。其解决方案的关键在于提出了一种基于Mamba架构的新型模型——时序条件Mamba(Temporally Conditional Mamba),通过将条件信息嵌入Mamba模块的递归动态中,从而实现更精确的逐帧时间对齐与运动一致性,显著提升了运动的真实感与时序准确性。
链接: https://arxiv.org/abs/2510.12573
作者: Quang Nguyen,Tri Le,Baoru Huang,Minh Nhat Vu,Ngan Le,Thieu Vo,Anh Nguyen
机构: FPT Software AI Center (FPT软件人工智能中心); University of Liverpool (利物浦大学); Vienna University of Technology (维也纳工业大学); University of Arkansas (阿肯色大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Learning human motion based on a time-dependent input signal presents a challenging yet impactful task with various applications. The goal of this task is to generate or estimate human movement that consistently reflects the temporal patterns of conditioning inputs. Existing methods typically rely on cross-attention mechanisms to fuse the condition with motion. However, this approach primarily captures global interactions and struggles to maintain step-by-step temporal alignment. To address this limitation, we introduce Temporally Conditional Mamba, a new mamba-based model for human motion generation. Our approach integrates conditional information into the recurrent dynamics of the Mamba block, enabling better temporally aligned motion. To validate the effectiveness of our method, we evaluate it on a variety of human motion tasks. Extensive experiments demonstrate that our model significantly improves temporal alignment, motion realism, and condition consistency over state-of-the-art approaches. Our project page is available at this https URL.
zh
[CV-32] MMOT: The First Challenging Benchmark for Drone-based Multispectral Multi-Object Tracking
【速读】:该论文旨在解决无人机(UAV)平台下多目标跟踪(Multi-Object Tracking, MOT)中因目标尺寸小、严重遮挡和背景杂乱导致的跟踪可靠性下降问题。现有基于RGB图像的跟踪算法依赖空间外观特征(如颜色和纹理),在航拍视角下性能显著退化。为此,作者提出首个面向无人机多光谱多目标跟踪的基准数据集MMOT,并设计了一种多光谱与方向感知的MOT方案:关键创新在于引入轻量级光谱3D-Stem以融合多光谱特征并保持与RGB预训练模型的兼容性;构建方向感知卡尔曼滤波器提升状态估计精度;以及开发端到端的方向自适应Transformer模块,有效利用精确标注的定向信息。实验表明,多光谱输入显著优于RGB基线,尤其在小目标和高密度场景中表现突出。
链接: https://arxiv.org/abs/2510.12565
作者: Tianhao Li,Tingfa Xu,Ying Wang,Haolin Qin,Xu Lin,Jianan Li
机构: Beijing Institute of Technology (北京理工大学); Beijing Institute of Technology Chongqing Innovation Center (北京理工大学重庆创新中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial cues that enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce MMOT, the first challenging benchmark for drone-based multispectral multi-object tracking. It features three key characteristics: (i) Large Scale - 125 video sequences with over 488.8K annotations across eight categories; (ii) Comprehensive Challenges - covering diverse conditions such as extreme small targets, high-density scenarios, severe occlusions, and complex motion; and (iii) Precise Oriented Annotations - enabling accurate localization and reduced ambiguity under aerial perspectives. To better extract spectral features and leverage oriented annotations, we further present a multispectral and orientation-aware MOT scheme adapting existing methods, featuring: (i) a lightweight Spectral 3D-Stem integrating spectral features while preserving compatibility with RGB pretraining; (ii) an orientation-aware Kalman filter for precise state estimation; and (iii) an end-to-end orientation-adaptive transformer. Extensive experiments across representative trackers consistently show that multispectral input markedly improves tracking performance over RGB baselines, particularly for small and densely packed objects. We believe our work will advance drone-based multispectral multi-object tracking research. Our MMOT, code, and benchmarks are publicly available at this https URL.
zh
[CV-33] CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving
【速读】:该论文旨在解决纯模仿学习(Imitation Learning, IL)在端到端自动驾驶模型中泛化能力差,以及强化学习(Reinforcement Learning, RL)因样本效率低和收敛不稳定导致性能受限的问题。解决方案的关键在于提出一种竞争性双策略框架 CoIRL-AD,通过在训练过程中让 IL 与 RL 代理相互竞争并交互,引入基于竞争的机制实现知识共享同时避免梯度冲突,从而提升模型的整体鲁棒性和长尾场景下的表现。
链接: https://arxiv.org/abs/2510.12560
作者: Xiaoji Zheng,Ziyuan Yang,Yanhao Chen,Yuhang Peng,Yuanrong Tang,Gengyuan Liu,Bokui Chen,Jiangtao Gong
机构: Tsinghua University (清华大学); University of Washington (华盛顿大学); Beijing Jiaotong University (北京交通大学); The Hong Kong Polytechnic University (香港理工大学); Institute for AI Industry Research (AIR), Tsinghua University (清华大学人工智能产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 18 pages, 17 figures
Abstract:End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training. CoIRL-AD introduces a competition-based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios. Code is available at: this https URL.
zh
[CV-34] Unconditional Human Motion and Shape Generation via Balanced Score-Based Diffusion
【速读】:该论文旨在解决当前人类运动生成模型中过度依赖过参数化输入特征和辅助损失函数的问题,尤其是在扩散模型(diffusion models)中,这些策略并非必要却广泛使用。其解决方案的关键在于:仅通过精心设计的特征空间归一化(feature-space normalization)和理论上推导出的L2 score-matching损失权重(analytically derived weightings),即可在无条件人类运动生成任务中达到与最先进方法相当的性能。该方法直接生成运动和形状信息,避免了传统流程中需从关节坐标后处理恢复形状的低效步骤,且每个模块均有明确的理论依据,实验验证了各组件独立的有效性。
链接: https://arxiv.org/abs/2510.12537
作者: David Björkstrand,Tiesheng Wang,Lars Bretzner,Josephine Sullivan
机构: Royal Institute of Technology (瑞典皇家理工学院); EA Sports TRACAB (EA体育TRACAB)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has explored a range of model families for human motion generation, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion-based models. Despite their differences, many methods rely on over-parameterized input features and auxiliary losses to improve empirical results. These strategies should not be strictly necessary for diffusion models to match the human motion distribution. We show that on par with state-of-the-art results in unconditional human motion generation are achievable with a score-based diffusion model using only careful feature-space normalization and analytically derived weightings for the standard L2 score-matching loss, while generating both motion and shape directly, thereby avoiding slow post hoc shape recovery from joints. We build the method step by step, with a clear theoretical motivation for each component, and provide targeted ablations demonstrating the effectiveness of each proposed addition in isolation.
zh
[CV-35] Voronoi-Assisted Diffusion for Computing Unsigned Distance Fields from Unoriented Points
【速读】:该论文旨在解决从无向点云直接计算Unsigned Distance Fields (UDFs)时面临的数值不稳定、计算成本高及可控性差的问题。其解决方案的关键在于提出一种无需神经网络的轻量级方法——Voronoi-Assisted Diffusion (VAD),该方法首先基于两个基于Voronoi几何准则的能量函数,为输入点分配双向法向量以实现最优对齐;随后通过扩散过程构建近似UDF梯度场,并积分恢复最终的UDF,从而在保持计算效率与数值稳定性的同时,有效处理具有任意拓扑结构(包括非流形和不可定向几何)的复杂3D形状。
链接: https://arxiv.org/abs/2510.12524
作者: Jiayi Kong,Chen Zong,Junkai Deng,Xuhui Chen,Fei Hou,Shiqing Xin,Junhui Hou,Chen Qian,Ying He
机构: Nanyang Technological University (南洋理工大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Chinese Academy of Sciences (中国科学院); Shandong University (山东大学); City University of Hong Kong (香港城市大学); SenseTime Research (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unsigned Distance Fields (UDFs) provide a flexible representation for 3D shapes with arbitrary topology, including open and closed surfaces, orientable and non-orientable geometries, and non-manifold structures. While recent neural approaches have shown promise in learning UDFs, they often suffer from numerical instability, high computational cost, and limited controllability. We present a lightweight, network-free method, Voronoi-Assisted Diffusion (VAD), for computing UDFs directly from unoriented point clouds. Our approach begins by assigning bi-directional normals to input points, guided by two Voronoi-based geometric criteria encoded in an energy function for optimal alignment. The aligned normals are then diffused to form an approximate UDF gradient field, which is subsequently integrated to recover the final UDF. Experiments demonstrate that VAD robustly handles watertight and open surfaces, as well as complex non-manifold and non-orientable geometries, while remaining computationally efficient and stable.
zh
[CV-36] BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring
【速读】:该论文旨在解决基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的去模糊方法在处理由相机运动引起的运动模糊图像时性能受限的问题,主要瓶颈包括对相机位姿精度的高度依赖以及无法有效控制因运动模糊导致的错误高斯原型(Gaussian primitives)过度密集化。解决方案的关键在于提出一种双阶段(Bi-Stage)3DGS框架:第一阶段通过粗略优化相机位姿以减少运动引起的畸变;第二阶段在固定粗略位姿的基础上,引入全局刚性变换(Global Rigid Transformation)进一步校正运动模糊,并设计子帧梯度聚合策略缓解多子帧梯度冲突;同时,采用时空双阶段优化策略动态调整原型密度阈值,防止模糊区域过早生成噪声高斯原型,从而显著提升重建质量与鲁棒性。
链接: https://arxiv.org/abs/2510.12493
作者: An Zhao,Piaopiao Yu,Zhe Zhu,Mingqiang Wei
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene this http URL, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant this http URL performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion this http URL solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred this http URL contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur this http URL alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both this http URL, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.
zh
[CV-37] Fast Visuomotor Policy for Robotic Manipulation
【速读】:该论文旨在解决高频率机器人操作任务中计算资源受限场景下的高效精准控制问题,尤其针对传统策略框架在处理多模态动作预测时存在的效率低下与精度不足。其解决方案的关键在于提出了一种名为Energy Policy的政策框架,该框架通过两个核心组件实现:一是采用能量评分(energy score)作为学习目标以支持多模态动作建模;二是引入能量MLP(energy MLP)来高效实现该目标,同时保持架构简洁性。此设计使得模型能够在单次前向传播中直接预测多模态动作,从而在仿真和真实机器人任务中显著降低计算开销并达到或超越当前最优方法的性能表现。
链接: https://arxiv.org/abs/2510.12483
作者: Jingkai Jia,Tong Yang,Xueyao Chen,Chenhuan Liu,Wenqiang Zhang
机构: Fudan University (复旦大学); MEGVII Technology
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a fast and effective policy framework for robotic manipulation, named Energy Policy, designed for high-frequency robotic tasks and resource-constrained systems. Unlike existing robotic policies, Energy Policy natively predicts multimodal actions in a single forward pass, enabling high-precision manipulation at high speed. The framework is built upon two core components. First, we adopt the energy score as the learning objective to facilitate multimodal action modeling. Second, we introduce an energy MLP to implement the proposed objective while keeping the architecture simple and efficient. We conduct comprehensive experiments in both simulated environments and real-world robotic tasks to evaluate the effectiveness of Energy Policy. The results show that Energy Policy matches or surpasses the performance of state-of-the-art manipulation methods while significantly reducing computational overhead. Notably, on the MimicGen benchmark, Energy Policy achieves superior performance with at a faster inference compared to existing approaches.
zh
[CV-38] A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation
【速读】:该论文旨在解决医学图像分割中因数据有限导致的性能瓶颈问题,尤其针对多模态学习中文本与图像之间的空间对齐被常见数据增强(如旋转、翻转)破坏所引发的性能下降问题。解决方案的关键在于提出一种早期融合框架,在数据增强前将文本特征与视觉特征进行融合,从而保持图文间的空间一致性;同时设计了一个轻量级生成器,将文本嵌入映射到视觉空间以弥合语义鸿沟,并通过伪图像可视化验证了区域定位准确性。
链接: https://arxiv.org/abs/2510.12482
作者: Shurong Chai,Rahul Kumar JAIN,Rui Xu,Shaocong Mo,Ruibo Hou,Shiyu Teng,Jiaqing Liu,Lanfen Lin,Yen-Wei Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: this https URL.
zh
[CV-39] MS-GAGA: Metric-Selective Guided Adversarial Generation Attack
【速读】:该论文旨在解决黑盒环境下生成具有高迁移性且视觉上难以察觉的对抗样本,以攻击深度伪造检测器(deepfake detectors)的问题。解决方案的关键在于提出一种两阶段框架MS-GAGA(Metric-Selective Guided Adversarial Generation Attack):第一阶段通过双流攻击模块生成候选对抗样本,其中MNTD-PGD优化小扰动预算下的梯度计算,SG-PGD则聚焦于视觉显著区域,从而扩展对抗搜索空间并提升跨模型迁移能力;第二阶段引入基于结构相似性(SSIM)的度量感知选择模块,在黑盒模型成功率与原始图像保真度之间进行联合优化,最终实现比现有最优攻击方法在未见检测器上误分类率提升高达27%。
链接: https://arxiv.org/abs/2510.12468
作者: Dion J. X. Ho,Gabriel Lee Jun Rong,Niharika Shrivastava,Harshavardhan Abichandani,Pai Chet Ng,Xiaoxiao Miao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present MS-GAGA (Metric-Selective Guided Adversarial Generation Attack), a two-stage framework for crafting transferable and visually imperceptible adversarial examples against deepfake detectors in black-box settings. In Stage 1, a dual-stream attack module generates adversarial candidates: MNTD-PGD applies enhanced gradient calculations optimized for small perturbation budgets, while SG-PGD focuses perturbations on visually salient regions. This complementary design expands the adversarial search space and improves transferability across unseen models. In Stage 2, a metric-aware selection module evaluates candidates based on both their success against black-box models and their structural similarity (SSIM) to the original image. By jointly optimizing transferability and imperceptibility, MS-GAGA achieves up to 27% higher misclassification rates on unseen detectors compared to state-of-the-art attacks.
zh
[CV-40] A Function Centric Perspective On Flat and Sharp Minima
【速读】:该论文试图解决的问题是:传统观点认为平坦的极小值(flat minima)与深度神经网络的更好泛化性能相关,但近年来的研究发现这一关联并不总是成立,存在理论反例和实证例外。为澄清这一争议,论文提出应将梯度下降过程中出现的“尖锐性”(sharpness)重新理解为一种函数依赖属性,而非直接反映泛化能力的指标。其解决方案的关键在于:通过大量实证研究(涵盖单目标优化到现代图像分类任务),表明在正则化策略(如自适应平滑优化器 SAM、权重衰减或数据增强)下,模型往往收敛至更尖锐的极小值,而这些尖锐极小值反而能带来更好的泛化、校准、鲁棒性和功能一致性;同时发现未加正则化的基线模型虽趋于平坦极小值,但在各项安全指标上表现较差。因此,论文主张以函数复杂度为核心来理解损失景观几何结构,强调尖锐极小值可能反映了更合适的归纳偏置(inductive bias),从而呼吁从“函数中心”的视角重新审视损失地形的几何特性。
链接: https://arxiv.org/abs/2510.12451
作者: Israel Mason-Williams,Gabryel Mason-Williams,Helen Yannakoudakis
机构: UKRI Safe and Trustd AI; Imperial and King’s College London; Queen Mary University of London; King’s College London
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 26 tables, 63 figures, pre-print
Abstract:Flat minima are widely believed to correlate with improved generalisation in deep neural networks. However, this connection has proven more nuanced in recent studies, with both theoretical counterexamples and empirical exceptions emerging in the literature. In this paper, we revisit the role of sharpness in model performance, proposing that sharpness is better understood as a function-dependent property rather than a reliable indicator of poor generalisation. We conduct extensive empirical studies, from single-objective optimisation to modern image classification tasks, showing that sharper minima often emerge when models are regularised (e.g., via SAM, weight decay, or data augmentation), and that these sharp minima can coincide with better generalisation, calibration, robustness, and functional consistency. Across a range of models and datasets, we find that baselines without regularisation tend to converge to flatter minima yet often perform worse across all safety metrics. Our findings demonstrate that function complexity, rather than flatness alone, governs the geometry of solutions, and that sharper minima can reflect more appropriate inductive biases (especially under regularisation), calling for a function-centric reappraisal of loss landscape geometry.
zh
[CV-41] A Review of Longitudinal Radiology Report Generation: Dataset Composition Methods and Performance Evaluation
【速读】:该论文旨在解决当前胸部X光片(Chest X-ray, CXR)放射学报告生成(CXRRRG)模型普遍依赖单张图像、无法捕捉纵向临床信息的问题,从而导致生成的报告缺乏对患者历史影像变化的准确描述。解决方案的关键在于引入纵向数据(longitudinal data),构建系统性的纵向放射学报告生成(Longitudinal Radiology Report Generation, LRRG)研究框架,涵盖数据集构建策略、面向纵向信息的模型架构设计以及包含纵向特异性评估指标的评测协议,强调了纵向信息和架构设计选择对提升模型性能的核心作用。
链接: https://arxiv.org/abs/2510.12444
作者: Shaoyang Zhou,Yingshu Li,Yunyi Liu,Lingqiao Liu,Lei Wang,Luping Zhou
机构: University of Sydney (悉尼大学); University of Adelaide (阿德莱德大学); University of Wollongong (卧龙岗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chest Xray imaging is a widely used diagnostic tool in modern medicine, and its high utilization creates substantial workloads for radiologists. To alleviate this burden, vision language models are increasingly applied to automate Chest Xray radiology report generation (CXRRRG), aiming for clinically accurate descriptions while reducing manual effort. Conventional approaches, however, typically rely on single images, failing to capture the longitudinal context necessary for producing clinically faithful comparison statements. Recently, growing attention has been directed toward incorporating longitudinal data into CXR RRG, enabling models to leverage historical studies in ways that mirror radiologists diagnostic workflows. Nevertheless, existing surveys primarily address single image CXRRRG and offer limited guidance for longitudinal settings, leaving researchers without a systematic framework for model design. To address this gap, this survey provides the first comprehensive review of longitudinal radiology report generation (LRRG). Specifically, we examine dataset construction strategies, report generation architectures alongside longitudinally tailored designs, and evaluation protocols encompassing both longitudinal specific measures and widely used benchmarks. We further summarize LRRG methods performance, alongside analyses of different ablation studies, which collectively highlight the critical role of longitudinal information and architectural design choices in improving model performance. Finally, we summarize five major limitations of current research and outline promising directions for future development, aiming to lay a foundation for advancing this emerging field.
zh
[CV-42] VideoLucy: Deep Memory Backtracking for Long Video Understanding NEURIPS-2025
【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的代理系统在长视频理解中面临的两个核心问题:一是模型通常仅对单帧进行建模与推理,难以捕捉连续帧之间的时序上下文;二是为降低密集帧级描述的成本而采用稀疏帧采样策略,容易丢失关键信息。解决方案的关键在于提出VideoLucy——一种深度记忆回溯框架,其灵感来源于人类从粗到细的记忆回忆过程,通过分层记忆结构实现渐进式粒度控制,明确不同层级记忆的时间范围和细节级别,并结合代理驱动的迭代回溯机制,系统性挖掘与问题相关的视频全局深度记忆,直至积累足够信息以生成可靠答案。该设计既增强了对连续帧的时序理解能力,又有效保留了重要细节,显著提升了长视频理解性能。
链接: https://arxiv.org/abs/2510.12422
作者: Jialong Zuo,Yongtai Deng,Lingdong Kong,Jingkang Yang,Rui Jin,Yiwei Zhang,Nong Sang,Liang Pan,Ziwei Liu,Changxin Gao
机构: Huazhong University of Science and Technology (华中科技大学); NUS (新加坡国立大学); S-Lab, NTU (南洋理工大学S-Lab); Shanghai AI Lab (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS-2025 Accepted Paper
Abstract:Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model’s ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at this https URL
zh
[CV-43] Low-Field Magnetic Resonance Image Quality Enhancement using a Conditional Flow Matching Model
【速读】:该论文旨在解决低场磁共振成像(Low-field Magnetic Resonance Imaging, LF-MRI)因信噪比低和诊断质量差而导致的图像质量问题,目标是实现从低场输入到高场类似图像的高质量重建,从而在不依赖昂贵设备的前提下提升图像诊断价值。解决方案的关键在于提出一种基于条件流匹配(Conditional Flow Matching, CFM)的新框架,该方法通过直接回归最优速度场(optimal velocity field)来学习噪声分布与目标数据分布之间的连续流,而非依赖迭代采样或对抗性目标,从而在参数效率和泛化能力上显著优于现有深度学习方法。
链接: https://arxiv.org/abs/2510.12408
作者: Huu Tien Nguyen,Ahmed Karam Eldaly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces a novel framework for image quality transfer based on conditional flow matching (CFM). Unlike conventional generative models that rely on iterative sampling or adversarial objectives, CFM learns a continuous flow between a noise distribution and target data distributions through the direct regression of an optimal velocity field. We evaluate this approach in the context of low-field magnetic resonance imaging (LF-MRI), a rapidly emerging modality that offers affordable and portable scanning but suffers from inherently low signal-to-noise ratio and reduced diagnostic quality. Our framework is designed to reconstruct high-field-like MR images from their corresponding low-field inputs, thereby bridging the quality gap without requiring expensive infrastructure. Experiments demonstrate that CFM not only achieves state-of-the-art performance, but also generalizes robustly to both in-distribution and out-of-distribution data. Importantly, it does so while utilizing significantly fewer parameters than competing deep learning methods. These results underline the potential of CFM as a powerful and scalable tool for MRI reconstruction, particularly in resource-limited clinical environments.
zh
[CV-44] owards General Urban Monitoring with Vision-Language Models: A Review Evaluation and a Research Agenda
【速读】:该论文旨在解决城市公共基础设施(如垃圾箱、路牌、绿化带、人行道和建筑工地等)监测中因对象多样性、环境复杂性和情境差异导致的挑战,传统方法依赖物联网传感器与人工巡检,存在成本高、难以扩展且与市民直观感知脱节的问题。其解决方案的关键在于利用视觉语言模型(Vision-Language Models, VLMs),通过融合视觉理解与自然语言推理能力,使机器能够像市民一样“看见”并推断出对城市基础设施状态的合理判断,尤其聚焦于零样本(zero-shot)应用场景,从而实现更高效、可扩展且贴近人类认知的城市监测。
链接: https://arxiv.org/abs/2510.12400
作者: André Torneiro,Diogo Monteiro,Paulo Novais,Pedro Rangel Henriques,Nuno F. Rodrigues
机构: University of Minho (米尼奥大学); Instituto Politécnico de Cávado e do Douro (卡瓦多和杜罗理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages
Abstract:Urban monitoring of public infrastructure (such as waste bins, road signs, vegetation, sidewalks, and construction sites) poses significant challenges due to the diversity of objects, environments, and contextual conditions involved. Current state-of-the-art approaches typically rely on a combination of IoT sensors and manual inspections, which are costly, difficult to scale, and often misaligned with citizens’ perception formed through direct visual observation. This raises a critical question: Can machines now “see” like citizens and infer informed opinions about the condition of urban infrastructure? Vision-Language Models (VLMs), which integrate visual understanding with natural language reasoning, have recently demonstrated impressive capabilities in processing complex visual information, turning them into a promising technology to address this challenge. This systematic review investigates the role of VLMs in urban monitoring, with particular emphasis on zero-shot applications. Following the PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021 and 2025 to address four core research questions: (1) What urban monitoring tasks have been effectively addressed using VLMs? (2) Which VLM architectures and frameworks are most commonly used and demonstrate superior performance? (3) What datasets and resources support this emerging field? (4) How are VLM-based applications evaluated, and what performance levels have been reported?
zh
[CV-45] Scene Coordinate Reconstruction Priors ICCV2025
【速读】:该论文旨在解决场景坐标回归(Scene Coordinate Regression, SCR)模型在训练数据缺乏足够多视角约束时容易退化的问题,尤其是在单场景训练中因几何信息不足导致的重建质量下降。解决方案的关键在于对SCR模型的训练过程进行概率重解释,并引入高阶重建先验(reconstruction priors),包括对深度分布的简单先验和通过大规模室内扫描数据训练的3D点云扩散模型所学习的复杂先验。这些先验在每个训练步骤中引导预测的3D场景点趋向于符合物理合理性,从而提升场景表示的质量,增强相机位姿估计精度,并改善下游任务如新视角合成和视觉重定位的效果。
链接: https://arxiv.org/abs/2510.12387
作者: Wenjing Bian,Axel Barroso-Laguna,Tommaso Cavallari,Victor Adrian Prisacariu,Eric Brachmann
机构: Niantic Spatial (Niantic空间); University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, Project page: this https URL
Abstract:Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.
zh
[CV-46] Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling
【速读】:该论文旨在解决程序步骤识别(Procedure Step Recognition, PSR)中因物体部分遮挡导致模型鲁棒性和准确性受限的问题。现有方法仅依赖单帧图像中的装配状态检测,忽略了时间维度信息,难以在遮挡场景下准确识别步骤完成情况。解决方案的关键在于提出一种双流架构STORM-PSR,其核心创新包括:一是引入时空流(spatio-temporal stream),通过空间编码器(基于新型弱监督预训练策略)提取有意义的空间特征,并结合基于Transformer的时序编码器学习这些空间特征随时间的变化关系;二是该时空流无需完整视野即可推断步骤完成状态,从而显著提升遮挡条件下的识别精度。实验表明,该方法在MECCANO和IndustReal数据集上分别将实际与预测步骤完成时间的平均延迟降低11.2%和26.1%。
链接: https://arxiv.org/abs/2510.12385
作者: Tim J. Schoonbeek,Shao-Hsuan Hung,Dan Lehman,Hans Onvlee,Jacek Kustra,Peter H.N. de With,Fons van der Sommen
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 7 figures and 5 tables in the main paper and one figure and table in the appendix. To be published in Computer Vision and Image Understanding
Abstract:Procedure step recognition (PSR) aims to identify all correctly completed steps and their sequential order in videos of procedural tasks. The existing state-of-the-art models rely solely on detecting assembly object states in individual video frames. By neglecting temporal features, model robustness and accuracy are limited, especially when objects are partially occluded. To overcome these limitations, we propose Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition (STORM-PSR), a dual-stream framework for PSR that leverages both spatial and temporal features. The assembly state detection stream operates effectively with unobstructed views of the object, while the spatio-temporal stream captures both spatial and temporal features to recognize step completions even under partial occlusion. This stream includes a spatial encoder, pre-trained using a novel weakly supervised approach to capture meaningful spatial representations, and a transformer-based temporal encoder that learns how these spatial features relate over time. STORM-PSR is evaluated on the MECCANO and IndustReal datasets, reducing the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively, compared to prior methods. We demonstrate that this reduction in delay is driven by the spatio-temporal stream, which does not rely on unobstructed views of the object to infer completed steps. The code for STORM-PSR, along with the newly annotated MECCANO labels, is made publicly available at this https URL .
zh
[CV-47] Deep Attention-guided Adaptive Subsampling
【速读】:该论文旨在解决深度神经网络在3D体积或视频分类任务中因冗余信息导致的计算复杂度高、资源消耗大的问题。现有方法虽尝试通过可学习的下采样机制降低计算量,但其采样策略通常是任务自适应而非输入自适应,即一旦训练完成便固定不变,无法根据输入内容动态调整,限制了实际应用效果。解决方案的关键在于提出一种基于注意力机制的动态采样模块(attention-guided sampling module),该模块能在推理阶段根据输入特征自适应地调整采样策略,从而在保持性能的同时显著降低模型复杂度,并在医学影像和超声视频等真实场景数据集上验证了有效性。
链接: https://arxiv.org/abs/2510.12376
作者: Sharath M Shankaranarayana,Soumava Kumar Roy,Prasad Sudhakar,Chandan Aladahalli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.
zh
[CV-48] CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion
【速读】:该论文旨在解决单目图像驱动的语义场景补全(Semantic Scene Completion, SSC)中因依赖时间堆叠或深度投影而导致的显式运动推理不足、遮挡处理困难以及深度监督噪声干扰等问题。其解决方案的关键在于提出CurriFlow框架,该框架融合基于光流的时间对齐与课程引导的深度融合机制:首先利用预训练光流实现多层级特征(分割、视觉与深度)在帧间的精准对齐,提升时序一致性与动态物体理解能力;其次通过课程学习策略从稀疏但精确的LiDAR深度逐步过渡到稠密但噪声较大的立体深度,保障训练稳定性和实际部署适应性;此外,引入Segment Anything Model (SAM) 提供的语义先验作为类别无关的监督信号,增强体素级语义学习和空间一致性。
链接: https://arxiv.org/abs/2510.12362
作者: Jinzhou Lin,Jie Zhou,Wenhao Xu,Rongtao Xu,Changwei Wang,Shunpeng Chen,Kexue Fu,Yihua Shao,Li Guo,Shibiao Xu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Chinese Academy of Sciences (中国科学院); Shandong Computer Science Center (国家超级计算济南中心); Qilu University of Technology (山东科学院); Shandong Provincial Key Laboratory of Computing Power Internet and Service Computing (山东省计算 power 互联网与服务计算重点实验室); Shandong Fundamental Research Center for Computer Science (山东省计算机科学基础研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.
zh
[CV-49] Hybrid Gaussian Splatting for Novel Urban View Synthesis ICCV2025
【速读】:该论文旨在解决街景场景中的新视角合成(Novel View Synthesis, NVS)问题,即从训练阶段采集的车辆中心视角帧出发,生成同一城市环境在不同视角(如不同车道或车头方向)下的渲染图像。解决方案的关键在于采用两阶段混合方法:首先利用高斯泼溅(Gaussian Splatting)进行3D场景重建并渲染目标相机视角的新视图;随后通过一个专用的单步扩散模型(Single-step Diffusion Model)对初步渲染结果进行增强,以提升图像质量。该方案在初始化高斯基元(Gaussian Primitives)和微调增强模型及其训练数据筛选方面进行了针对性设计,最终在PSNR、SSIM和LPIPS指标上验证了其有效性,并在RealADSim-NVS挑战赛中取得第二名的成绩。
链接: https://arxiv.org/abs/2510.12308
作者: Mohamed Omran,Farhad Zanjani,Davide Abati,Jens Petersen,Amirhossein Habibian
机构: Qualcomm AI Research(高通人工智能研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 RealADSim Workshop
Abstract:This paper describes the Qualcomm AI Research solution to the RealADSim-NVS challenge, hosted at the RealADSim Workshop at ICCV 2025. The challenge concerns novel view synthesis in street scenes, and participants are required to generate, starting from car-centric frames captured during some training traversals, renders of the same urban environment as viewed from a different traversal (e.g. different street lane or car direction). Our solution is inspired by hybrid methods in scene generation and generative simulators merging gaussian splatting and diffusion models, and it is composed of two stages: First, we fit a 3D reconstruction of the scene and render novel views as seen from the target cameras. Then, we enhance the resulting frames with a dedicated single-step diffusion model. We discuss specific choices made in the initialization of gaussian primitives as well as the finetuning of the enhancer model and its training data curation. We report the performance of our model design and we ablate its components in terms of novel view quality as measured by PSNR, SSIM and LPIPS. On the public leaderboard reporting test results, our proposal reaches an aggregated score of 0.432, achieving the second place overall.
zh
[CV-50] Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval
【速读】:该论文旨在解决实际场景中视频检索任务的挑战性问题——部分相关视频检索(Partially Relevant Video Retrieval, PRVR),即在给定查询条件下,从长时、未剪辑且背景复杂的视频中检索出仅部分相关的片段。传统方法通常假设视频已预剪裁为短时且内容单一,这与真实应用场景存在显著差异。解决方案的关键在于提出一种双学习框架结合动态知识蒸馏(Dual Learning framework with Dynamic Knowledge Distillation, DL-DKD++),其中大型预训练视觉-语言模型作为教师模型提供泛化知识,指导一个轻量级双分支学生网络:一个继承分支用于吸收可迁移的知识,另一个探索分支则学习PRVR数据集中的特定特征以弥合领域差距;同时引入动态软目标构建机制,用随训练过程演化的自适应软标签替代固定硬标签,从而更精准地捕捉视频与查询之间的细粒度部分相关性。
链接: https://arxiv.org/abs/2510.12283
作者: Jianfeng Dong,Lei Huang,Daizong Liu,Xianke Chen,Xun Yang,Changting Lin,Xun Wang,Meng Wang
机构: Zhejiang Gongshang University (浙江工商大学); Peking University (北京大学); University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Almost all previous text-to-video retrieval works ideally assume that videos are pre-trimmed with short durations containing solely text-related content. However, in practice, videos are typically untrimmed in long durations with much more complicated background content. Therefore, in this paper, we focus on the more practical yet challenging task of Partially Relevant Video Retrieval (PRVR), which aims to retrieve partially relevant untrimmed videos with the given query. To tackle this task, we propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model and transfers it to a lightweight, task-specific PRVR network. Specifically, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD++), where a large teacher model provides supervision to a compact dual-branch student network. The student model comprises two branches: an inheritance branch that absorbs transferable knowledge from the teacher, and an exploration branch that learns task-specific information from the PRVR dataset to address domain gaps. To further enhance learning, we incorporate a dynamic soft-target construction mechanism. By replacing rigid hard-target supervision with adaptive soft targets that evolve during training, our method enables the model to better capture the fine-grained, partial relevance between videos and queries. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR. The code is available at this https URL.
zh
[CV-51] PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes
【速读】:该论文旨在解决动态三维城市场景重建中 fidelity(保真度)与计算成本之间的权衡问题,现有方法因语义无关的设计导致资源分配不均,对静态背景和关键安全物体同等对待,造成效率低下。其解决方案的关键在于提出 Priority-Adaptive Gaussian Splatting (PAGS),通过两个核心机制实现任务感知的优先级优化:一是引入语义引导的剪枝与正则化策略,利用混合重要性度量对非关键场景元素进行激进简化,同时保留导航关键物体的精细细节;二是设计基于优先级的渲染流水线,采用优先级驱动的深度预遍历(depth pre-pass)剔除被遮挡的图元,显著加速最终着色计算,从而在Waymo和KITTI数据集上实现高质量重建(尤其在安全关键对象上)并提升渲染速度至350 FPS以上。
链接: https://arxiv.org/abs/2510.12282
作者: Ying A,Wenzhang Sun,Chang Zeng,Chunfeng Wang,Hao Li,Jianxun Cui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing dynamic 3D urban scenes is crucial for autonomous driving, yet current methods face a stark trade-off between fidelity and computational cost. This inefficiency stems from their semantically agnostic design, which allocates resources uniformly, treating static backgrounds and safety-critical objects with equal importance. To address this, we introduce Priority-Adaptive Gaussian Splatting (PAGS), a framework that injects task-aware semantic priorities directly into the 3D reconstruction and rendering pipeline. PAGS introduces two core contributions: (1) Semantically-Guided Pruning and Regularization strategy, which employs a hybrid importance metric to aggressively simplify non-critical scene elements while preserving fine-grained details on objects vital for navigation. (2) Priority-Driven Rendering pipeline, which employs a priority-based depth pre-pass to aggressively cull occluded primitives and accelerate the final shading computations. Extensive experiments on the Waymo and KITTI datasets demonstrate that PAGS achieves exceptional reconstruction quality, particularly on safety-critical objects, while significantly reducing training time and boosting rendering speeds to over 350 FPS.
zh
[CV-52] SpineBench: Benchmarking Multimodal LLM s for Spinal Pathology Analysis
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在脊柱医学领域评估不足的问题,尤其是其在依赖视觉输入的精细任务(如脊柱疾病诊断与病灶定位)中表现不佳且缺乏系统性测评基准。解决方案的关键在于构建SpineBench——一个专注于脊柱领域的视觉问答(Visual Question Answering, VQA)基准,包含64,878个问答对和40,263张脊柱影像,覆盖11种脊柱疾病,并通过引入基于视觉相似性的“难负样本”(hard negative options)模拟真实临床挑战场景,从而实现对MLLMs在脊柱医学应用中的细粒度性能评估。
链接: https://arxiv.org/abs/2510.12267
作者: Chenghanyu Zhang,Zekun Li,Peipei Li,Xing Cui,Shuhan Xia,Weixiang Yan,Yiqiao Zhang,Qianyu Zhuang
机构: Beijing University of Posts and Telecommunications(北京邮电大学); University of California, Santa Barbara(加州大学圣塔芭芭拉分校); Peking Union Medical College Hospital(北京协和医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Proceedings of the 33rd ACM International Conference on Multimedia,ACMMM 2025 Dataset Track
Abstract:With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at this https URL.
zh
[CV-53] AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion
【速读】:该论文旨在解决当前无监督可见光-红外图像融合方法中存在的两大核心问题:一是现有参考图像生成策略导致的细节缺失与亮度不均;二是梯度损失仅关注梯度幅值而忽略方向信息,从而影响融合图像的边缘结构保真度。解决方案的关键在于提出一种基于角度感知的感知框架(AngularFuse),其创新性体现在三个方面:首先设计跨模态互补掩码模块以引导网络学习模态间的互补信息;其次引入细粒度参考图像合成策略,结合拉普拉斯边缘增强与自适应直方图均衡化生成细节更丰富、亮度更均衡的参考图像;最后提出角度感知损失函数(angle-aware loss),首次在梯度域中同时约束梯度的幅值与方向,确保融合图像既保留纹理强度又保持正确的边缘走向。
链接: https://arxiv.org/abs/2510.12260
作者: Xiaopeng Liu,Yupei Lin,Sen Zhang,Xiao Wang,Yukai Shi,Liang Lin
机构: Guangdong University of Technology (广东工业大学); TikTok (字节跳动); Anhui University (安徽大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: For the first time, angle-based perception was introduced into the multi-modality image fusion task
Abstract:Visible-infrared image fusion is crucial in key applications such as autonomous driving and nighttime surveillance. Its main goal is to integrate multimodal information to produce enhanced images that are better suited for downstream tasks. Although deep learning based fusion methods have made significant progress, mainstream unsupervised approaches still face serious challenges in practical applications. Existing methods mostly rely on manually designed loss functions to guide the fusion process. However, these loss functions have obvious limitations. On one hand, the reference images constructed by existing methods often lack details and have uneven brightness. On the other hand, the widely used gradient losses focus only on gradient magnitude. To address these challenges, this paper proposes an angle-based perception framework for spatial-sensitive image fusion (AngularFuse). At first, we design a cross-modal complementary mask module to force the network to learn complementary information between modalities. Then, a fine-grained reference image synthesis strategy is introduced. By combining Laplacian edge enhancement with adaptive histogram equalization, reference images with richer details and more balanced brightness are generated. Last but not least, we introduce an angle-aware loss, which for the first time constrains both gradient magnitude and direction simultaneously in the gradient domain. AngularFuse ensures that the fused images preserve both texture intensity and correct edge orientation. Comprehensive experiments on the MSRS, RoadScene, and M3FD public datasets show that AngularFuse outperforms existing mainstream methods with clear margin. Visual comparisons further confirm that our method produces sharper and more detailed results in challenging scenes, demonstrating superior fusion capability.
zh
[CV-54] Local Background Features Matter in Out-of-Distribution Detection
【速读】:该论文旨在解决深度神经网络在真实场景部署中面临的分布外(Out-of-distribution, OOD)检测难题,特别是模型对OOD数据产生过度自信预测的问题。其解决方案的关键在于利用ID(In-Distribution)图像中的局部背景特征作为模拟的OOD特征,在训练过程中通过优化这些背景特征的L₂范数来抑制模型对OOD数据的过自信输出。该方法基于卷积层局部不变性的观察,无需额外收集OOD数据或生成假OOD样本,从而在多个标准OOD检测基准上实现了显著性能提升,并具备与现有后处理方法良好的兼容性。
链接: https://arxiv.org/abs/2510.12259
作者: Jinlun Ye,Zhuohao Sun,Yiqiao Qiu,Qiu Li,Zhijun Tan,Ruixuan Wang
机构: Sun Yat-sen University (中山大学); Peng Cheng Laboratory (鹏城实验室); China United Network Communications Corporation Limited Guangdong Branch (广东省通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is crucial when deploying deep neural networks in the real world to ensure the reliability and safety of their applications. One main challenge in OOD detection is that neural network models often produce overconfident predictions on OOD data. While some methods using auxiliary OOD datasets or generating fake OOD images have shown promising OOD detection performance, they are limited by the high costs of data collection and training. In this study, we propose a novel and effective OOD detection method that utilizes local background features as fake OOD features for model training. Inspired by the observation that OOD images generally share similar background regions with ID images, the background features are extracted from ID images as simulated OOD visual representations during training based on the local invariance of convolution. Through being optimized to reduce the L_2 -norm of these background features, the neural networks are able to alleviate the overconfidence issue on OOD data. Extensive experiments on multiple standard OOD detection benchmarks confirm the effectiveness of our method and its wide combinatorial compatibility with existing post-hoc methods, with new state-of-the-art performance achieved from our method.
zh
[CV-55] Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images ICCV2025
【速读】:该论文旨在解决医学图像和细胞图像语义分割中因数据稀缺(受限于隐私、伦理及标注成本)而导致的训练不稳定与性能不佳问题。现有常用损失函数如交叉熵(Cross Entropy)与Dice Loss的加性组合对超参数敏感,且在小样本场景下表现欠佳。解决方案的关键在于提出两种新型乘法形式的损失函数:Multiplicative Loss 通过将交叉熵与Dice Loss相乘,动态调节梯度——对高置信度的正确预测施加较小惩罚,同时放大低置信度错误预测的梯度,从而稳定优化过程;进一步提出的 Confidence-Adaptive Multiplicative Loss 引入受Focal Loss启发的置信度驱动指数缩放机制,结合预测概率与Dice系数,自适应增强难样本的学习信号,显著提升在极端数据稀缺条件下的分割鲁棒性与效率。
链接: https://arxiv.org/abs/2510.12258
作者: Yuto Yokoi,Kazuhiro Hotta
机构: Meijo University (明治大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025 Workshop “Third Workshop on Computer Vision for Automated Medical Diagnosis”
Abstract:We propose two novel loss functions, Multiplicative Loss and Confidence-Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence-Adaptive Multiplicative Loss applies a confidence-driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter-free mechanism for robust segmentation under challenging data limitations.
zh
[CV-56] Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding
【速读】:该论文旨在解决当前视频表示方法中因依赖不稳定的像素级匹配与跟踪机制而导致的运动和外观建模脆弱性问题,尤其在存在跟踪误差、遮挡及大范围运动等场景下容易导致视觉对象表征崩溃。其解决方案的关键在于引入时空一致的代理节点(spatio-temporally consistent proxy nodes)来动态表示视频中的物体或场景:一方面,分层代理节点具备稳定表达视觉对象多尺度结构的能力,从而不受累积跟踪误差、长期运动、遮挡及视角变化的影响;另一方面,通过动态更新机制充分利用视频的时空先验信息以缓解不准确跟踪带来的影响,并支持复杂场景变化下的鲁棒表示。此外,形状与纹理表示的解耦编码进一步提升了可控且细粒度的外观编辑能力。
链接: https://arxiv.org/abs/2510.12256
作者: Ye Chen,Liming Tan,Yupeng Zhu,Yuanbin Wang,Bingbing Ni
机构: Shanghai Jiao Tong University (上海交通大学); USC-SJTU Institute of Cultural and Creative Industry (南加州大学-上海交通大学文化创意产业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current video representations heavily rely on unstable and over-grained priors for motion and appearance modelling, \emphi.e., pixel-level matching and tracking. A tracking error of just a few pixels would lead to the collapse of the visual object representation, not to mention occlusions and large motion frequently occurring in videos. To overcome the above mentioned vulnerability, this work proposes spatio-temporally consistent proxy nodes to represent dynamically changing objects/scenes in the video. On the one hand, the hierarchical proxy nodes have the ability to stably express the multi-scale structure of visual objects, so they are not affected by accumulated tracking error, long-term motion, occlusion, and viewpoint variation. On the other hand, the dynamic representation update mechanism of the proxy nodes adequately leverages spatio-temporal priors of the video to mitigate the impact of inaccurate trackers, thereby effectively handling drastic changes in scenes and objects. Additionally, the decoupled encoding manner of the shape and texture representations across different visual objects in the video facilitates controllable and fine-grained appearance editing capability. Extensive experiments demonstrate that the proposed representation achieves high video reconstruction accuracy with fewer parameters and supports complex video processing tasks, including video in-painting and keyframe-based temporally consistent video editing.
zh
[CV-57] Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, ISTD)中面临的两个核心挑战:跨域偏移(cross-domain shift)和异方差噪声扰动(heteroscedastic noise perturbations)。为应对这些问题,作者提出了一种双小波引导的不变性学习框架(Ivan-ISTD)。其解决方案的关键在于:第一阶段通过小波引导的跨域合成(Wavelet-guided Cross-domain Synthesis)生成与目标域对齐的训练样本,利用多频小波滤波精确分离目标背景;第二阶段引入实域噪声不变性学习(Real-domain Noise Invariance Learning),从真实目标域提取噪声特征构建动态噪声库,并通过自监督损失函数学习噪声不变性,从而克服传统人工噪声建模带来的分布偏差问题。
链接: https://arxiv.org/abs/2510.12241
作者: Yuehui Li,Yahao Lu,Haoyuan Wu,Sen Zhang,Liang Lin,Yukai Shi
机构: Guangdong University of Technology (广东工业大学); TikTok (字节跳动); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: In infrared small target detection, noise from different sensors can cause significant interference to performance. We propose a new dataset and a wavelet-guided Invariance learning framework(Ivan-ISTD) to emphasize this issue
Abstract:In the multimedia domain, Infrared Small Target Detection (ISTD) plays a important role in drone-based multi-modality sensing. To address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD, we propose a doubly wavelet-guided Invariance learning framework(Ivan-ISTD). In the first stage, we generate training samples aligned with the target domain using Wavelet-guided Cross-domain Synthesis. This wavelet-guided alignment machine accurately separates the target background through multi-frequency wavelet filtering. In the second stage, we introduce Real-domain Noise Invariance Learning, which extracts real noise characteristics from the target domain to build a dynamic noise library. The model learns noise invariance through self-supervised loss, thereby overcoming the limitations of distribution bias in traditional artificial noise modeling. Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic degradation dataset that simulates the distribution shifts encountered in real-world applications. Additionally, we validate the versatility of our method using other real-world datasets. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in terms of many quantitative metrics. In particular, Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios. The code for this work can be found at: this https URL.
zh
[CV-58] BIGFix: Bidirectional Image Generation with Token Fixing
【速读】:该论文旨在解决生成式 AI(Generative AI)在图像和视频生成任务中推理效率与生成质量之间的权衡问题。传统方法通常依赖自回归序列建模,虽能保证质量但效率低下;而并行预测多标记(multi-token prediction)虽可显著提升效率,却因标记间结构不一致性和缺乏纠错机制导致生成质量下降。解决方案的关键在于提出一种自校正(self-correcting)机制,通过在训练过程中向上下文注入随机标记(random tokens),增强模型对错误预测的鲁棒性,并在采样阶段实现对已生成标记的迭代修正,从而在保持并行预测高效率的同时大幅提升生成质量。
链接: https://arxiv.org/abs/2510.12231
作者: Victor Besnier,David Hurych,Andrei Bursuc,Eduardo Valle
机构: Valeo.ai(瓦莱奥人工智能); Valeo.ai(瓦莱奥人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in image and video generation have raised significant interest from both academia and industry. A key challenge in this field is improving inference efficiency, as model size and the number of inference steps directly impact the commercial viability of generative models while also posing fundamental scientific challenges. A promising direction involves combining auto-regressive sequential token modeling with multi-token prediction per step, reducing inference time by up to an order of magnitude. However, predicting multiple tokens in parallel can introduce structural inconsistencies due to token incompatibilities, as capturing complex joint dependencies during training remains challenging. Traditionally, once tokens are sampled, there is no mechanism to backtrack and refine erroneous predictions. We propose a method for self-correcting image generation by iteratively refining sampled tokens. We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling. Our method preserves the efficiency benefits of parallel token prediction while significantly enhancing generation quality. We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.
zh
[CV-59] HoneyBee: Data Recipes for Vision-Language Reason ers
【速读】:该论文旨在解决视觉语言推理(Vision-Language Reasoning, VLR)训练数据集构建原则不明确的问题,即如何通过数据策展策略有效提升视觉语言模型(Vision-Language Models, VLMs)的推理能力。其解决方案的关键在于系统性地评估多种数据来源、针对性的数据干预措施以及多维度数据规模扩展(如图像、问题和链式思维(Chain-of-Thought, CoT)解决方案的数量),并基于实证发现提出HoneyBee数据集——一个包含250万样本的高质量CoT推理数据集,显著提升了不同规模VLM在数学推理任务上的表现;同时引入测试时缩放策略,在保持准确率的前提下降低解码成本73%。
链接: https://arxiv.org/abs/2510.12225
作者: Hritik Bansal,Devandra Singh Sachan,Kai-Wei Chang,Aditya Grover,Gargi Ghosh,Wen-tau Yih,Ramakanth Pasunuru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 32 pages
Abstract:Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and © scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.
zh
[CV-60] DIANet: A Phase-Aware Dual-Stream Network for Micro-Expression Recognition via Dynamic Images
【速读】:该论文旨在解决微表情识别(Micro-expression Recognition, MER)中因面部线索细微且短暂、标注数据有限而导致的识别困难问题。现有基于动态图像(Dynamic Image, DI)的方法通常将时序运动信息压缩为单一帧,但忽略了微表情在不同时间相位(如起始至峰值和峰值至结束阶段)中的特征差异。其解决方案的关键在于提出一种双流框架DIANet,通过构建两个独立的卷积神经网络分支分别处理编码不同相位信息的动态图像:一个捕捉“起始至峰值”相位,另一个聚焦“峰值至结束”相位,并引入交叉注意力融合模块以自适应地整合两路特征,从而更精准地建模微表情的时间演化特性。实验表明,该方法在CASME-II、SAMM和MMEW三个基准数据集上均优于传统单相位DI方法,验证了显式建模时间相位信息的重要性。
链接: https://arxiv.org/abs/2510.12219
作者: Vu Tram Anh Khuong,Luu Tu Nguyen,Thi Bich Phuong Man,Thanh Ha Le,Thi Duyen Ngo
机构: VNU University of Engineering and Technology (越南国家大学工程与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-expressions are brief, involuntary facial movements that typically last less than half a second and often reveal genuine emotions. Accurately recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. However, micro-expression recognition (MER) remains a challenging task due to the subtle and transient nature of facial cues and the limited availability of annotated data. While dynamic image (DI) representations have been introduced to summarize temporal motion into a single frame, conventional DI-based methods often overlook the distinct characteristics of different temporal phases within a micro-expression. To address this issue, this paper proposes a novel dual-stream framework, DIANet, which leverages phase-aware dynamic images - one encoding the onset-to-apex phase and the other capturing the apex-to-offset phase. Each stream is processed by a dedicated convolutional neural network, and a cross-attention fusion module is employed to adaptively integrate features from both streams based on their contextual relevance. Extensive experiments conducted on three benchmark MER datasets (CASME-II, SAMM, and MMEW) demonstrate that the proposed method consistently outperforms conventional single-phase DI-based approaches. The results highlight the importance of modeling temporal phase information explicitly and suggest a promising direction for advancing MER.
zh
[CV-61] he Impact of Synthetic Data on Object Detection Model Performance: A Comparative Analysis with Real-World Data
【速读】:该论文试图解决的问题是:在仓库物流场景中,基于真实数据训练的物体检测模型因获取高质量标注数据成本高、效率低而受限,从而影响生成式AI(Generative AI)在实际工业场景中的落地效果。解决方案的关键在于探索合成数据(synthetic data)在提升物体检测模型性能方面的潜力,特别是利用NVIDIA Omniverse Replicator工具生成的合成图像数据对模型进行微调,以降低对昂贵真实数据的依赖,并验证合成数据与真实数据混合训练策略在实际场景中的有效性。研究结果表明,合理融合合成数据与真实数据可显著增强模型鲁棒性和泛化能力,为工业计算机视觉(Computer Vision, CV)应用提供了一种高效可行的数据驱动方案。
链接: https://arxiv.org/abs/2510.12208
作者: Muammer Bay,Timo von Marcard,Dren Fazlija
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures, 2 tables. Code: this https URL ; Data: this https URL
Abstract:Recent advances in generative AI, particularly in computer vision (CV), offer new opportunities to optimize workflows across industries, including logistics and manufacturing. However, many AI applications are limited by a lack of expertise and resources, which forces a reliance on general-purpose models. Success with these models often requires domain-specific data for fine-tuning, which can be costly and inefficient. Thus, using synthetic data for fine-tuning is a popular, cost-effective alternative to gathering real-world data. This work investigates the impact of synthetic data on the performance of object detection models, compared to models trained on real-world data only, specifically within the domain of warehouse logistics. To this end, we examined the impact of synthetic data generated using the NVIDIA Omniverse Replicator tool on the effectiveness of object detection models in real-world scenarios. It comprises experiments focused on pallet detection in a warehouse setting, utilizing both real and various synthetic dataset generation strategies. Our findings provide valuable insights into the practical applications of synthetic image data in computer vision, suggesting that a balanced integration of synthetic and real data can lead to robust and efficient object detection models.
zh
[CV-62] Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos ICCV2025
【速读】:该论文旨在解决自动驾驶模型在分布外(out-of-distribution, OOD)场景下表现不佳的问题,特别是针对复杂交通事件的可解释性分析不足。其核心挑战在于如何从行车记录仪视频中自动生成人类可理解的事故报告,而不仅仅是依赖封闭类别标签进行决策。解决方案的关键在于提出了一种分层推理框架(hierarchical reasoning framework),该框架融合帧级描述生成(frame-level captioning)、事故帧检测(incident frame detection)以及视觉语言模型(vision-language models, VLMs)内的细粒度推理机制,从而实现对交通事件的结构化理解和叙事生成。此外,通过模型集成和盲式A/B评分选择协议进一步提升生成内容的事实准确性和可读性,在2COOOL竞赛中取得第二名及最优CIDEr-D得分,验证了该方法在事故分析与安全关键事件理解方面的有效性。
链接: https://arxiv.org/abs/2510.12190
作者: Shingo Yokoi,Kento Sasaki,Yu Yamaguchi
机构: Turing Inc.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 2nd Place Winner, ICCV 2025 2COOOL Competition
Abstract:Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at this https URL.
zh
[CV-63] CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLM s
【速读】:该论文旨在解决当前知识蒸馏(Knowledge Distillation, KD)方法在多模态大语言模型(Multimodal Large Language Models, MLLMs)中难以有效迁移教师模型丰富视觉感知能力的问题,这一挑战在以往研究中被忽视。其解决方案的关键在于识别出学生模型与教师模型之间存在视觉注意力错位(visual attention misalignment),并提出CompoDistill框架,通过显式对齐学生模型的视觉注意力机制与教师模型的一致性,从而显著增强学生模型的视觉感知能力,尤其在需要组合推理的任务上表现突出,同时保持在视觉问答任务上的高性能。
链接: https://arxiv.org/abs/2510.12184
作者: Jiwan Kim,Kibum Kim,Sangwoo Seo,Chanyoung Park
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint. Under Review
Abstract:Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM’s rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student’s visual attention with that of the teacher to enhance the student’s visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.
zh
[CV-64] BEEP3D: Box-Supervised End-to-End Pseudo-Mask Generation for 3D Instance Segmentation
【速读】:该论文旨在解决3D实例分割(3D instance segmentation)中因依赖密集点级标注而导致的高标注成本问题,提出基于框级标注(box-level annotations)的弱监督学习方法。其核心挑战在于框级标注在重叠区域引入语义歧义,难以实现精准的点到实例分配。解决方案的关键在于设计了一个端到端的伪掩码生成框架BEEP3D,采用学生-教师(student-teacher)架构,通过指数移动平均(Exponential Moving Average, EMA)更新教师模型以生成高质量伪掩码;同时引入基于实例中心的查询优化机制增强位置查询定位精度,并设计查询一致性损失(query consistency loss)与掩码特征一致性损失(masked feature consistency loss),协同对齐预测结果与伪掩码之间的语义和几何信息,从而在无需额外训练阶段的前提下提升分割性能并保持计算效率。
链接: https://arxiv.org/abs/2510.12182
作者: Youngju Yoo,Seho Kim,Changick Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D instance segmentation is crucial for understanding complex 3D environments, yet fully supervised methods require dense point-level annotations, resulting in substantial annotation costs and labor overhead. To mitigate this, box-level annotations have been explored as a weaker but more scalable form of supervision. However, box annotations inherently introduce ambiguity in overlapping regions, making accurate point-to-instance assignment challenging. Recent methods address this ambiguity by generating pseudo-masks through training a dedicated pseudo-labeler in an additional training stage. However, such two-stage pipelines often increase overall training time and complexity, hinder end-to-end optimization. To overcome these challenges, we propose BEEP3D-Box-supervised End-to-End Pseudo-mask generation for 3D instance segmentation. BEEP3D adopts a student-teacher framework, where the teacher model serves as a pseudo-labeler and is updated by the student model via an Exponential Moving Average. To better guide the teacher model to generate precise pseudo-masks, we introduce an instance center-based query refinement that enhances position query localization and leverages features near instance centers. Additionally, we design two novel losses-query consistency loss and masked feature consistency loss-to align semantic and geometric signals between predictions and pseudo-masks. Extensive experiments on ScanNetV2 and S3DIS datasets demonstrate that BEEP3D achieves competitive or superior performance compared to state-of-the-art weakly supervised methods while remaining computationally efficient.
zh
[CV-65] UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering
【速读】:该论文旨在解决多模态三维重建中如何实现高保真度、几何一致性和计算效率的问题。现有方法在处理RGB图像、深度图、表面法向量和语义 logits等多模态信息时,常面临渲染精度不足、几何一致性差以及计算资源消耗大的挑战。其解决方案的关键在于提出一种统一的地图表示与可微分框架 UniGS,通过 CUDA 加速的光栅化管线同步生成高质量的 RGB 图像、几何准确的深度图、一致的表面法向量及语义 logits;创新性地采用基于可微射线-椭球相交的深度渲染方式替代传统的高斯中心采样,从而利用解析梯度有效优化旋转和尺度属性;同时推导出表面法向量渲染的解析梯度公式,保障重建场景的几何一致性,并引入可学习的属性以实现训练过程中对贡献度低的高斯点进行可微剪枝,显著提升计算与存储效率。
链接: https://arxiv.org/abs/2510.12174
作者: Yusen Xie,Zhenmin Huang,Jianhao Jiao,Dimitrios Kanoulas,Jun Ma
机构: HKUST (GZ) (香港科技大学(广州)); HKUST (香港科技大学); UCL (伦敦大学学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.
zh
[CV-66] State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
【速读】:该论文旨在解决预训练状态空间模型(State Space Model, SSM)在视频理解任务中,由于对视觉提示(visual prompt)进行顺序压缩导致的空间和时间上下文信息丢失问题,从而限制了帧内空间特征与帧间时序特征的有效传播及判别性信息的提取。解决方案的关键在于提出了一种状态空间提示(State Space Prompting, SSP)方法,通过设计两个核心模块:帧内汇聚(Intra-Frame Gathering, IFG)模块用于聚合单帧内的空间关键信息,以及帧间扩散(Inter-Frame Spreading, IFS)模块用于跨帧传播判别性时空信息;二者协同作用,实现对帧内与帧间关键时空信息的自适应平衡与压缩,从而以互补方式增强视频中判别性信息的传播效率。
链接: https://arxiv.org/abs/2510.12160
作者: Jiahuan Zhou,Kai Zhu,Zhenyu Cui,Zichen Liu,Xu Zou,Gang Hua
机构: Peking University (北京大学); Huazhong University of Science and Technology (华中科技大学); Amazon.com, Inc (亚马逊公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.
zh
[CV-67] DPL: Spatial-Conditioned Diffusion Prototype Enhancement for One-Shot Medical Segmentation
【速读】:该论文旨在解决单样本医学图像分割(one-shot medical image segmentation)中因标注数据有限和患者间解剖差异显著而导致的原型表示(prototype representation)不鲁棒的问题。传统基于确定性平均的支持特征方法难以捕捉类别内部多样性,从而影响模型泛化能力。其解决方案的关键在于提出扩散原型学习(Diffusion Prototype Learning, DPL)框架,将原型建模为可学习的概率分布,并通过扩散过程从少量标注样本中可控地生成多样且语义一致的原型变体;核心创新包括:(1) 基于扩散机制的原型增强模块,实现从单一支持原型到多样化变体集的转换;(2) 空间感知条件机制,利用原型特征统计量提取几何信息进行引导;(3) 保守融合策略,在保持原型保真度的同时最大化表征多样性。DPL在训练与推理阶段使用统一的扩散增强与融合流程,确保一致性并以扩散过程作为正则化手段,显著提升分割性能。
链接: https://arxiv.org/abs/2510.12159
作者: Ziyuan Gao,Philippe Morel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IVCNZ 2025. To be published in IEEE proceedings
Abstract:One-shot medical image segmentation faces fundamental challenges in prototype representation due to limited annotated data and significant anatomical variability across patients. Traditional prototype-based methods rely on deterministic averaging of support features, creating brittle representations that fail to capture intra-class diversity essential for robust generalization. This work introduces Diffusion Prototype Learning (DPL), a novel framework that reformulates prototype construction through diffusion-based feature space exploration. DPL models one-shot prototypes as learnable probability distributions, enabling controlled generation of diverse yet semantically coherent prototype variants from minimal labeled data. The framework operates through three core innovations: (1) a diffusion-based prototype enhancement module that transforms single support prototypes into diverse variant sets via forward-reverse diffusion processes, (2) a spatial-aware conditioning mechanism that leverages geometric properties derived from prototype feature statistics, and (3) a conservative fusion strategy that preserves prototype fidelity while maximizing representational diversity. DPL ensures training-inference consistency by using the same diffusion enhancement and fusion pipeline in both phases. This process generates enhanced prototypes that serve as the final representations for similarity calculations, while the diffusion process itself acts as a regularizer. Extensive experiments on abdominal MRI and CT datasets demonstrate significant improvements respectively, establishing new state-of-the-art performance in one-shot medical image segmentation.
zh
[CV-68] Class-aware Domain Knowledge Fusion and Fission for Continual Test-Time Adaptation
【速读】:该论文旨在解决持续测试时适应(Continual Test-Time Adaptation, CTTA)中因域数据不规则切换导致的历史知识灾难性遗忘问题,同时克服现有方法在新知识学习不足和潜在有害历史知识干扰下造成的性能退化。解决方案的关键在于提出一种类感知域知识融合与分裂方法(Knowledge Fusion and Fission, KFF),其核心机制包括两个模块:一是域知识分裂(Knowledge FIssion, KFI)模块,通过自适应地从类感知域提示池中分离出新域知识,缓解来自与当前域差异较大的旧域的负向知识干扰;二是域知识融合(Knowledge FUsion, KFU)模块,采用贪心的知识动态合并策略,在最小计算开销下将分裂出的新知识合并到现有知识池中,从而实现新旧知识的兼容性提升与计算效率保障。
链接: https://arxiv.org/abs/2510.12150
作者: Jiahuan Zhou,Chao Zhu,Zhenyu Cui,Zichen Liu,Xu Zou,Gang Hua
机构: Peking University (北京大学); Huazhong University of Science and Technology (华中科技大学); Amazon.com, Inc (亚马逊公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual Test-Time Adaptation (CTTA) aims to quickly fine-tune the model during the test phase so that it can adapt to multiple unknown downstream domain distributions without pre-acquiring downstream domain data. To this end, existing advanced CTTA methods mainly reduce the catastrophic forgetting of historical knowledge caused by irregular switching of downstream domain data by restoring the initial model or reusing historical models. However, these methods are usually accompanied by serious insufficient learning of new knowledge and interference from potentially harmful historical knowledge, resulting in severe performance degradation. To this end, we propose a class-aware domain Knowledge Fusion and Fission method for continual test-time adaptation, called KFF, which adaptively expands and merges class-aware domain knowledge in old and new domains according to the test-time data from different domains, where discriminative historical knowledge can be dynamically accumulated. Specifically, considering the huge domain gap within streaming data, a domain Knowledge FIssion (KFI) module is designed to adaptively separate new domain knowledge from a paired class-aware domain prompt pool, alleviating the impact of negative knowledge brought by old domains that are distinct from the current domain. Besides, to avoid the cumulative computation and storage overheads from continuously fissioning new knowledge, a domain Knowledge FUsion (KFU) module is further designed to merge the fissioned new knowledge into the existing knowledge pool with minimal cost, where a greedy knowledge dynamic merging strategy is designed to improve the compatibility of new and old knowledge while keeping the computational efficiency. Extensive experiments on the ImageNet-C dataset verify the effectiveness of our proposed method against other methods.
zh
[CV-69] FedHUG: Federated Heterogeneous Unsupervised Generalization for Remote Physiological Measurements
【速读】:该论文旨在解决远程生理测量中因依赖用户隐私敏感数据且现有无监督接触式测量方法需大量标注数据而导致的模型更新难题,尤其是在实际部署场景下存在大量未标注用户数据时。其解决方案的核心在于提出联邦无监督域泛化(Federated Unsupervised Domain Generalization, FUDG)框架下的FedHUG(Federated Heterogeneous Unsupervised Generalization),关键创新包括:(1) 最小偏置聚合模块(Minimal Bias Aggregation module)通过先验驱动的偏差评估动态调整聚合权重,以应对多源异构非独立同分布(non-IID)特征;(2) 全局分布感知学习控制器(Global Distribution-aware Learning Controller)参数化标签分布并动态调控客户端训练策略,从而缓解服务器与客户端之间的标签分布偏移及长尾问题。该方案在RGB视频和毫米波雷达(mmWave radar)两种模态下的生理参数估计任务中均显著优于现有最先进方法。
链接: https://arxiv.org/abs/2510.12132
作者: Xiao Yang,Jiyao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote physiological measurement gained wide attention, while it requires collecting users’ privacy-sensitive information, and existing contactless measurements still rely on labeled client data. This presents challenges when we want to further update real-world deployed models with numerous user data lacking labels. To resolve these challenges, we instantiate a new protocol called Federated Unsupervised Domain Generalization (FUDG) in this work. Subsequently, the \textbfFederated \textbfHeterogeneous \textbfUnsupervised \textbfGeneralization (\textbfFedHUG) framework is proposed and consists of: (1) Minimal Bias Aggregation module dynamically adjusts aggregation weights based on prior-driven bias evaluation to cope with heterogeneous non-IID features from multiple domains. (2) The Global Distribution-aware Learning Controller parameterizes the label distribution and dynamically manipulates client-specific training strategies, thereby mitigating the server-client label distribution skew and long-tail issue. The proposal shows superior performance across state-of-the-art techniques in estimation with either RGB video or mmWave radar. The code will be released.
zh
[CV-70] MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites
【速读】:该论文旨在解决当前开源视觉描述模型在通用视觉描述任务中与商业模型之间存在的显著性能差距问题,这一差距限制了其在数据合成等应用场景中的有效使用。解决方案的关键在于提出CapFlow——一种多智能体协作工作流,通过协同利用多个开源模型,实现与GPT-4相当的跨域视觉描述质量,同时成本降低89.5%。该方法首次证明了仅依赖开源模型即可达成高质量、高性价比的视觉描述能力,并基于CapFlow构建了可扩展的数据合成管道,最终训练出名为MetaCaptioner的通用视觉描述模型,其性能在开源社区中达到顶尖水平。
链接: https://arxiv.org/abs/2510.12126
作者: Zhenxin Lei,Zhangwei Gao,Changyao Tian,Erfei Cui,Guanzhou Chen,Danni Yang,Yuchen Duan,Zhaokai Wang,Wenhao Li,Weiyun Wang,Xiangyu Zhao,Jiayi Ji,Yu Qiao,Wenhai Wang,Gen Luo
机构: Shanghai AI Laboratory; Shanghai Jiao Tong University; Fudan University; University of Chinese Academy of Science; Xiamen University; The Chinese University of Hong Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.
zh
[CV-71] Hardware-aware Coding Function Design for Compressive Single-Photon 3D Cameras
【速读】:该论文旨在解决单光子相机在飞行时间(Time-of-Flight, ToF)三维成像中因硬件限制(如系统带宽、激光峰值功率、传感器数据速率及片上存储与计算资源)而导致的性能瓶颈问题,尤其是在压缩感知框架下,传统压缩直方图方法在实际光照硬件约束下表现不佳的问题。解决方案的关键在于提出一种受约束的优化方法,通过梯度下降联合优化照明策略与编码矩阵(即编码函数),使其在满足实际硬件参数限制的前提下,实现对单光子时间戳数据的高效压缩与重建。该方法在仿真和真实系统中均表现出优于传统编码设计的性能,尤其在峰值功率受限场景下优势显著,并能自适应任意参数化的脉冲响应函数。
链接: https://arxiv.org/abs/2510.12123
作者: David Parra,Felipe Gutierrez-Barragan,Trevor Seets,Andreas Velten
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE TPAMI Special Issue
Abstract:Single-photon cameras are becoming increasingly popular in time-of-flight 3D imaging because they can time-tag individual photons with extreme resolution. However, their performance is susceptible to hardware limitations, such as system bandwidth, maximum laser power, sensor data rates, and in-sensor memory and compute resources. Compressive histograms were recently introduced as a solution to the challenge of data rates through an online in-sensor compression of photon timestamp data. Although compressive histograms work within limited in-sensor memory and computational resources, they underperform when subjected to real-world illumination hardware constraints. To address this, we present a constrained optimization approach for designing practical coding functions for compressive single-photon 3D imaging. Using gradient descent, we jointly optimize an illumination and coding matrix (i.e., the coding functions) that adheres to hardware constraints. We show through extensive simulations that our coding functions consistently outperform traditional coding designs under both bandwidth and peak power constraints. This advantage is particularly pronounced in systems constrained by peak power. Finally, we show that our approach adapts to arbitrary parameterized impulse responses by evaluating it on a real-world system with a non-ideal impulse response function.
zh
[CV-72] ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation NEURIPS2025
【速读】:该论文旨在解决检索增强型图像生成(Retrieval-Augmented Image Generation, RAIG)系统中视觉数据集未经授权使用的问题。由于RAIG依赖于参考图像进行特征提取与重组,传统数字水印技术难以在生成过程中保持水印信号的完整性,导致保护机制失效。解决方案的关键在于提出ImageSentinel框架,通过视觉语言模型合成与原始数据集视觉一致的“哨兵图像”(sentinel images),并利用随机生成的字符序列作为检索密钥实现保护验证,从而在不损害授权应用生成质量的前提下,有效检测未授权的数据使用行为。
链接: https://arxiv.org/abs/2510.12119
作者: Ziyuan Luo,Yangyi Zhao,Ka Chun Cheung,Simon See,Renjie Wan
机构: Hong Kong Baptist University (香港浸会大学); NVIDIA AI Technology Center (NVIDIA人工智能技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025
Abstract:The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking approaches face limitations in RAIG systems, as the complex feature extraction and recombination processes fail to preserve watermark signals during generation. To address these challenges, we propose ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our framework synthesizes sentinel images that maintain visual consistency with the original dataset. These sentinels enable protection verification through randomly generated character sequences that serve as retrieval keys. To ensure seamless integration, we leverage vision-language models to generate the sentinel images. Experimental results demonstrate that ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications. Code is available at this https URL.
zh
[CV-73] Self-Supervised Selective-Guided Diffusion Model for Old-Photo Face Restoration
【速读】:该论文旨在解决老照片人脸修复中因复合退化(如破损、褪色和严重模糊)导致的局部伪影和人脸色彩失真问题。现有基于预训练扩散模型的方法通常依赖显式退化先验或全局统计引导,难以有效处理局部区域的修复需求与身份一致性保持。其解决方案的关键在于提出自监督选择性引导扩散模型(Self-Supervised Selective-Guided Diffusion, SSDiff),通过弱引导下预训练扩散模型生成伪参考人脸作为伪标签,这些伪标签具备结构对齐轮廓和自然色彩特性;进而采用分阶段监督策略——在去噪过程中全程施加结构引导,并在后期步骤中进行颜色精修,契合扩散模型从粗到细的生成机制;同时引入人脸分割图和划痕掩码实现对破损区域的选择性修复,避免身份错位,从而显著提升修复结果的感知质量、保真度及区域可控性。
链接: https://arxiv.org/abs/2510.12114
作者: Wenjie Li,Xiangyi Wang,Heng Guo,Guangwei Gao,Zhanyu Ma
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: this https URL.
zh
[CV-74] DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning
【速读】:该论文旨在解决非回放类增量学习(Class-Incremental Learning, CIL)中的三大难题:模型复杂度持续增加、增量学习过程中表示空间的非平滑迁移,以及阶段式子问题优化与全局推理之间的不一致性。其解决方案的核心是提出判别性表示学习(Discriminative Representation Learning, DRL)框架,关键创新在于两个设计:一是基于预训练模型(Pre-Trained Models, PTMs)构建增量并行适配器(Incremental Parallel Adapter, IPA)网络,通过轻量级适配器在每个增量阶段以极低参数开销实现对新类别的适应,并利用传输门机制保持表示能力的连续性,从而保证不同阶段间表示迁移的平滑性;二是引入解耦锚点监督(Decoupled Anchor Supervision, DAS),通过将正负样本分别与虚拟锚点比较来解耦约束,促进判别性表示学习并对齐各阶段特征空间,有效缩小局部优化与全局推理之间的差距。
链接: https://arxiv.org/abs/2510.12107
作者: Jiawei Zhan,Jun Liu,Jinlong Peng,Xiaochen Chen,Bin-Bin Gao,Yong Liu,Chengjie Wang
机构: Tencent Youtu Lab(腾讯优图实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 7 figures
Abstract:With the excellent representation capabilities of Pre-Trained Models (PTMs), remarkable progress has been made in non-rehearsal Class-Incremental Learning (CIL) research. However, it remains an extremely challenging task due to three conundrums: increasingly large model complexity, non-smooth representation shift during incremental learning and inconsistency between stage-wise sub-problem optimization and global inference. In this work, we propose the Discriminative Representation Learning (DRL) framework to specifically address these challenges. To conduct incremental learning effectively and yet efficiently, the DRL’s network, called Incremental Parallel Adapter (IPA) network, is built upon a PTM and increasingly augments the model by learning a lightweight adapter with a small amount of parameter learning overhead in each incremental stage. The adapter is responsible for adapting the model to new classes, it can inherit and propagate the representation capability from the current model through parallel connection between them by a transfer gate. As a result, this design guarantees a smooth representation shift between different incremental stages. Furthermore, to alleviate inconsistency and enable comparable feature representations across incremental stages, we design the Decoupled Anchor Supervision (DAS). It decouples constraints of positive and negative samples by respectively comparing them with the virtual anchor. This decoupling promotes discriminative representation learning and aligns the feature spaces learned at different stages, thereby narrowing the gap between stage-wise local optimization over a subset of data and global inference across all classes. Extensive experiments on six benchmarks reveal that our DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period while maintaining high efficiency in both training and inference phases.
zh
[CV-75] Gaussian Semantic Field for One-shot LiDAR Global Localization
【速读】:该论文旨在解决基于地标语义注册的全局定位方法中因地标重复或误导性导致对应关系建立困难的问题。解决方案的关键在于构建一个轻量级三层次场景图(tri-layered scene graph),其中引入由高斯过程学习得到的连续函数作为中间层,用于建模语义分布,从而捕捉更细粒度的地理语义信息并提供更精确的度量信息以支持对应关系建立,该方法被称为Outram-GSF(Gaussian semantic field)。
链接: https://arxiv.org/abs/2510.12101
作者: Pengyu Yin,Shenghai Yuan,Haozhi Cao,Xingyu Ji,Ruofei Bai,Siyu Chen,Lihua Xie
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a one-shot LiDAR global localization algorithm featuring semantic disambiguation ability based on a lightweight tri-layered scene graph. While landmark semantic registration-based methods have shown promising performance improvements in global localization compared with geometric-only methods, landmarks can be repetitive and misleading for correspondence establishment. We propose to mitigate this problem by modeling semantic distributions with continuous functions learned from a population of Gaussian processes. Compared with discrete semantic labels, the continuous functions capture finer-grained geo-semantic information and also provide more detailed metric information for correspondence establishment. We insert this continuous function as the middle layer between the object layer and the metric-semantic layer, forming a tri-layered 3D scene graph, serving as a light-weight yet performant backend for one-shot localization. We term our global localization pipeline Outram-GSF (Gaussian semantic field) and conduct a wide range of experiments on publicly available data sets, validating the superior performance against the current state-of-the-art.
zh
[CV-76] G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior
【速读】:该论文旨在解决当前基于预训练扩散模型进行3D场景重建时面临的两个关键问题:一是缺乏可靠的几何监督,导致即使在观测区域内也难以生成高质量的重建结果,更无法有效处理未观测区域;二是生成图像中多视角不一致性严重,引发形状-外观歧义并损害场景几何精度。解决方案的关键在于将精确的几何信息作为利用生成模型提升3D重建效果的基础前提,具体包括:首先利用平面结构普遍性估计出度量尺度的深度图,为观测与未观测区域提供可靠监督;其次将该几何引导贯穿整个生成流程,用于改进可见性掩码估计、指导新视角选择,并在视频扩散模型补全过程中增强多视角一致性,从而实现高精度且一致性的场景补全。
链接: https://arxiv.org/abs/2510.12099
作者: Junfeng Ni,Yixin Chen,Zhifei Yang,Yu Liu,Ruijie Lu,Song-Chun Zhu,Siyuan Huang
机构: Tsinghua University (清华大学); State Key Laboratory of General Artificial Intelligence, BIGAI; Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at this https URL.
zh
[CV-77] An Adaptive Edge-Guided Dual-Network Framework for Fast QR Code Motion Deblurring
【速读】:该论文旨在解决QR码图像去模糊问题,其核心挑战在于传统图像去模糊方法注重感知质量,而QR码去模糊需确保解码成功率。由于QR码具有高度结构化模式和锐利边缘的特性,可作为强先验信息用于恢复,但现有深度学习方法极少显式利用这一先验。解决方案的关键是提出边缘引导注意力模块(Edge-Guided Attention Block, EGAB),将显式的边缘先验嵌入Transformer架构中;基于EGAB构建了Edge-Guided Restormer(EG-Restormer)网络,显著提升严重模糊QR码的解码率;同时设计轻量高效网络(LENet)用于轻微模糊输入,并集成二者为自适应双网络(ADNet),根据输入模糊程度动态选择模型,适用于资源受限移动设备。
链接: https://arxiv.org/abs/2510.12098
作者: Jianping Li,Dongyang Guo,Wenjie Li,Wei Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Unlike general image deblurring that prioritizes perceptual quality, QR code deblurring focuses on ensuring successful decoding. QR codes are characterized by highly structured patterns with sharp edges, a robust prior for restoration. Yet existing deep learning methods rarely exploit these priors explicitly. To address this gap, we propose the Edge-Guided Attention Block (EGAB), which embeds explicit edge priors into a Transformer architecture. Based on EGAB, we develop Edge-Guided Restormer (EG-Restormer), an effective network that significantly boosts the decoding rate of severely blurred QR codes. For mildly blurred inputs, we design the Lightweight and Efficient Network (LENet) for fast deblurring. We further integrate these two networks into an Adaptive Dual-network (ADNet), which dynamically selects the suitable network based on input blur severity, making it ideal for resource-constrained mobile devices. Extensive experiments show that our EG-Restormer and ADNet achieve state-of-the-art performance with a competitive speed. Project page: this https URL
zh
[CV-78] IL3D: A Large-Scale Indoor Layout Dataset for LLM -Driven 3D Scene Generation
【速读】:该论文旨在解决室内场景生成中高质量、多样化训练数据稀缺的问题,以支持大语言模型(Large Language Model, LLM)驱动的3D场景生成任务。其关键解决方案是构建了一个大规模、精细化标注的3D场景数据集IL3D,包含27,816个室内布局和29,215个高保真3D物体资产,并提供实例级自然语言注释,从而支撑视觉-语言多模态学习。通过在IL3D上进行监督微调(Supervised Fine-Tuning, SFT),显著提升了LLM在场景生成中的泛化能力,优于其他数据集上的微调效果。此外,IL3D支持多种模态数据导出格式(如点云、3D边界框、多视角图像等),便于适配不同视觉任务,推动了3D场景生成与具身智能研究的发展。
链接: https://arxiv.org/abs/2510.12095
作者: Wenxu Zhou,Kaixuan Nie,Hang Du,Dong Yin,Wei Huang,Siqiang Guo,Xiaobo Zhang,Pengbo Hu
机构: University of Science and Technology of China (中国科学技术大学); Songying Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages main paper; 15 pages references and appendix
Abstract:In this study, we present IL3D, a large-scale dataset meticulously designed for large language model (LLM)-driven 3D scene generation, addressing the pressing demand for diverse, high-quality training data in indoor layout design. Comprising 27,816 indoor layouts across 18 prevalent room types and a library of 29,215 high-fidelity 3D object assets, IL3D is enriched with instance-level natural language annotations to support robust multimodal learning for vision-language tasks. We establish rigorous benchmarks to evaluate LLM-driven scene generation. Experimental results show that supervised fine-tuning (SFT) of LLMs on IL3D significantly improves generalization and surpasses the performance of SFT on other datasets. IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks. As a versatile and robust resource, IL3D significantly advances research in 3D scene generation and embodied intelligence, by providing high-fidelity scene data to support environment perception tasks of embodied agents.
zh
[CV-79] Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback
【速读】:该论文旨在解决当前基于扩散模型(diffusion models)的音频驱动人脸视频生成方法中存在的三大挑战:唇形同步(lip-sync)准确性不足、长视频生成时的时间连贯性差,以及多角色动画的实现困难。其解决方案的关键在于提出了一种基于扩散Transformer(DiT)的框架,并引入三项核心技术:首先,采用LoRA(Low-Rank Adaptation)训练策略结合位置偏移推理(position shift inference),在不破坏基础模型能力的前提下实现高效长视频生成;其次,通过局部参数更新与奖励反馈机制协同优化唇同步与自然肢体动作;最后,创新性地提出无需训练的Mask Classifier-Free Guidance(Mask-CFG)方法,支持三名及以上角色的音频驱动动画,且无需额外数据集或模型修改,从而在保持高质量、时间一致性的同时显著提升多角色场景下的实用性与效率。
链接: https://arxiv.org/abs/2510.12089
作者: Xingpei Ma,Shenneng Huang,Jiaran Cai,Yuansheng Guan,Shen Zheng,Hanfeng Zhao,Qiang Zhang,Shunsi Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.
zh
[CV-80] A Review on Domain Adaption and Generative Adversarial Networks(GANs)
【速读】:该论文旨在解决计算机视觉领域中高质量标注数据稀缺的问题,尤其是在图像分类任务中,由于人工标注成本高昂甚至不可行,导致模型训练受限。其解决方案的关键在于域适应(Domain Adaptation),即利用在某一源域上训练的模型,直接迁移至目标域(同类型但分布不同的数据)进行预测,例如将基于画作图像训练的飞机识别模型应用于真实飞机图像的分类任务,从而缓解对大量标注数据的依赖并提升跨域场景下的模型性能。
链接: https://arxiv.org/abs/2510.12075
作者: Aashish Dhawan,Divyanshu Mudgal
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The major challenge in today’s computer vision scenario is the availability of good quality labeled data. In a field of study like image classification, where data is of utmost importance, we need to find more reliable methods which can overcome the scarcity of data to produce results comparable to previous benchmark results. In most cases, obtaining labeled data is very difficult because of the high cost of human labor and in some cases impossible. The purpose of this paper is to discuss Domain Adaptation and various methods to implement it. The main idea is to use a model trained on a particular dataset to predict on data from a different domain of the same kind, for example - a model trained on paintings of airplanes predicting on real images of airplanes
zh
[CV-81] VIDMP3: Video Editing by Representing Motion with Pose and Position Priors
【速读】:该论文旨在解决运动保持视频编辑中结构与语义灵活性不足的问题,尤其在需要对替换对象的结构和语义进行自由调整时,现有基于扩散模型的方法常面临时空不一致性、主体身份漂移及人工干预需求等问题。解决方案的关键在于提出VidMP3方法,通过利用姿态(pose)和位置(position)先验从源视频中学习通用的运动表示,从而在保持原始运动模式的同时,实现结构与语义的灵活生成。
链接: https://arxiv.org/abs/2510.12069
作者: Sandeep Mishra,Oindrila Saha,Alan C. Bovik
机构: University of Texas at Austin (得克萨斯大学奥斯汀分校); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Motion-preserved video editing is crucial for creators, particularly in scenarios that demand flexibility in both the structure and semantics of swapped objects. Despite its potential, this area remains underexplored. Existing diffusion-based editing methods excel in structure-preserving tasks, using dense guidance signals to ensure content integrity. While some recent methods attempt to address structure-variable editing, they often suffer from issues such as temporal inconsistency, subject identity drift, and the need for human intervention. To address these challenges, we introduce VidMP3, a novel approach that leverages pose and position priors to learn a generalized motion representation from source videos. Our method enables the generation of new videos that maintain the original motion while allowing for structural and semantic flexibility. Both qualitative and quantitative evaluations demonstrate the superiority of our approach over existing methods. The code will be made publicly available at this https URL.
zh
[CV-82] Your VAR Model is Secretly an Efficient and Explainable Generative Classifier
【速读】:该论文旨在解决当前生成式分类器(generative classifier)研究过度依赖扩散模型(diffusion-based models)所带来的计算成本高、可扩展性差以及对分类机制理解不足的问题。其解决方案的关键在于提出一种基于视觉自回归建模(visual autoregressive modeling, VAR)的新一代生成式分类器——A-VARC⁺,该方法不仅在准确率与推理速度之间实现了更优权衡,还通过可计算的似然函数支持基于token级互信息的可视化解释,并展现出在类增量学习任务中对灾难性遗忘的天然鲁棒性,从而为生成式分类器提供了新的理论视角与实用路径。
链接: https://arxiv.org/abs/2510.12060
作者: Yi-Chung Chen,David I. Inouye,Jing Gao
机构: Elmore Family School of Electrical and Computer Engineering (电气与计算机工程埃尔莫尔家庭学院); Purdue University (普渡大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative classifiers, which leverage conditional generative models for classification, have recently demonstrated desirable properties such as robustness to distribution shifts. However, recent progress in this area has been largely driven by diffusion-based models, whose substantial computational cost severely limits scalability. This exclusive focus on diffusion-based methods has also constrained our understanding of generative classifiers. In this work, we propose a novel generative classifier built on recent advances in visual autoregressive (VAR) modeling, which offers a new perspective for studying generative classifiers. To further enhance its performance, we introduce the Adaptive VAR Classifier ^+ (A-VARC ^+ ), which achieves a superior trade-off between accuracy and inference speed, thereby significantly improving practical applicability. Moreover, we show that the VAR-based method exhibits fundamentally different properties from diffusion-based methods. In particular, due to its tractable likelihood, the VAR-based classifier enables visual explainability via token-wise mutual information and demonstrates inherent resistance to catastrophic forgetting in class-incremental learning tasks.
zh
[CV-83] APGNet: Adaptive Prior-Guided for Underwater Camouflaged Object Detection ACM-MM
【速读】:该论文旨在解决水下环境中伪装目标检测(Camouflaged Object Detection, COD)的两大挑战:一是水下图像退化问题,包括对比度低和颜色失真;二是海洋生物天然的伪装特性导致传统方法难以准确识别。为应对上述问题,作者提出了一种自适应先验引导网络(Adaptive Prior-Guided Network, APGNet),其核心创新在于引入了一个分层融合位置与边界先验的自适应机制:在高层特征中嵌入空间注意力以实现粗粒度定位,在低层特征中采用可变形卷积进行轮廓精修,从而提升对复杂水下场景的鲁棒性和检测精度。此外,结合多尺度Retinex色彩恢复(MSRCR)增强数据分布,并设计扩展感受野(Extended Receptive Field, ERF)模块与多尺度渐进解码器(Multi-Scale Progressive Decoder, MPD)以增强上下文建模能力,最终在两个公开MAS数据集上显著优于15种主流方法。
链接: https://arxiv.org/abs/2510.12056
作者: Xinxin Huang,Han Sun,Junmin Cai,Ningzhong Liu,Huiyu Zhou
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); University of Leicester (莱斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. accepted by ACM MM Asia 2025
Abstract:Detecting camouflaged objects in underwater environments is crucial for marine ecological research and resource exploration. However, existing methods face two key challenges: underwater image degradation, including low contrast and color distortion, and the natural camouflage of marine organisms. Traditional image enhancement techniques struggle to restore critical features in degraded images, while camouflaged object detection (COD) methods developed for terrestrial scenes often fail to adapt to underwater environments due to the lack of consideration for underwater optical characteristics. To address these issues, we propose APGNet, an Adaptive Prior-Guided Network, which integrates a Siamese architecture with a novel prior-guided mechanism to enhance robustness and detection accuracy. First, we employ the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm for data augmentation, generating illumination-invariant images to mitigate degradation effects. Second, we design an Extended Receptive Field (ERF) module combined with a Multi-Scale Progressive Decoder (MPD) to capture multi-scale contextual information and refine feature representations. Furthermore, we propose an adaptive prior-guided mechanism that hierarchically fuses position and boundary priors by embedding spatial attention in high-level features for coarse localization and using deformable convolution to refine contours in low-level features. Extensive experimental results on two public MAS datasets demonstrate that our proposed method APGNet outperforms 15 state-of-art methods under widely used evaluation metrics. Comments: 6 pages. accepted by ACM MM Asia 2025 Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2510.12056 [cs.CV] (or arXiv:2510.12056v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.12056 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-84] Evaluating the Explainability of Vision Transformers in Medical Imaging MICCAI2025
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)在医学影像诊断中因复杂注意力机制导致的可解释性不足问题,从而提升临床对模型决策的信任与采纳。其解决方案的关键在于系统评估不同ViT架构(ViT、DeiT、DINO、Swin Transformer)与预训练策略在两类医学图像任务(外周血细胞分类和乳腺超声图像分类)中的解释能力,并结合Grad-CAM与梯度注意力回传(Gradient Attention Rollout)方法进行定量与定性分析;研究发现,基于DINO预训练模型配合Grad-CAM生成的热图具有最高的忠实性和空间定位精度,能够清晰突出与临床相关的形态学特征,即使在误分类情况下亦能揭示误导模型的关键结构信息,从而显著增强模型透明度,支持其在关键医疗诊断流程中的可靠部署。
链接: https://arxiv.org/abs/2510.12021
作者: Leili Barekatain,Ben Glocker
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Workshop on Interpretability of Machine Intelligence in Medical Image Computing at MICCAI 2025
Abstract:Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.
zh
[CV-85] Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning ICCV
【速读】:该论文旨在解决大规模3D场景(如仓库)中视觉语言系统在空间推理方面的挑战,这些问题主要源于场景杂乱、遮挡以及对精确空间理解的需求。现有模型因过度依赖局部外观特征且缺乏显式空间定位能力而难以泛化。解决方案的关键在于构建一个专门用于物理AI空间智能仓库数据集(Physical AI Spatial Intelligence Warehouse dataset)的空间推理框架,通过将掩码的边界框坐标直接嵌入输入提示中,实现对物体几何形状和布局的显式建模;同时,针对距离估计、物体计数、多选定位和空间关系推理四类任务进行细粒度微调,并在训练集中添加归一化答案以提升与评估系统的匹配一致性,从而显著增强模型在真实工业环境中的空间推理能力。
链接: https://arxiv.org/abs/2510.11996
作者: Tanner Muturi,Blessing Agyei Kyem,Joshua Kofi Asamoah,Neema Jakisa Owor,Richard Dyzinela,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah
机构: University of Missouri–Columbia (密苏里大学-哥伦比亚分校); North Dakota State University (北达科他州立大学); Texas A&M (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper was accepted at ICCV Conference 2025
Abstract:Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.
zh
[CV-86] PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation
【速读】:该论文旨在解决从单张全景图像中准确估计房间三维布局的问题,这是计算机视觉领域的重要任务,广泛应用于机器人导航、增强现实和室内设计等场景。解决方案的关键在于提出了一种名为PanoTPS-Net的新模型,其核心创新是将卷积神经网络(CNN)与薄板样条(Thin Plate Spline, TPS)空间变换相结合:首先通过CNN提取输入图像的高层特征以学习TPS变换的空间参数,随后利用TPS层将参考布局映射到目标布局。这种结构使模型不仅能精准预测房间布局,还具备对立方体和非立方体布局的良好泛化能力,实验表明其在多个公开数据集上均取得了优异性能(如3DIoU最高达91.98)。
链接: https://arxiv.org/abs/2510.11992
作者: Hatem Ibrahem,Ahmed Salem,Qinmin Vivian Hu,Guanghui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately estimating the 3D layout of rooms is a crucial task in computer vision, with potential applications in robotics, augmented reality, and interior design. This paper proposes a novel model, PanoTPS-Net, to estimate room layout from a single panorama image. Leveraging a Convolutional Neural Network (CNN) and incorporating a Thin Plate Spline (TPS) spatial transformation, the architecture of PanoTPS-Net is divided into two stages: First, a convolutional neural network extracts the high-level features from the input images, allowing the network to learn the spatial parameters of the TPS transformation. Second, the TPS spatial transformation layer is generated to warp a reference layout to the required layout based on the predicted parameters. This unique combination empowers the model to properly predict room layouts while also generalizing effectively to both cuboid and non-cuboid layouts. Extensive experiments on publicly available datasets and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed method. The results underscore the model’s accuracy in room layout estimation and emphasize the compatibility between the TPS transformation and panorama images. The robustness of the model in handling both cuboid and non-cuboid room layout estimation is evident with a 3DIoU value of 85.49, 86.16, 81.76, and 91.98 on PanoContext, Stanford-2D3D, Matterport3DLayout, and ZInD datasets, respectively. The source code is available at: this https URL.
zh
[CV-87] MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics ICCV2025
【速读】:该论文旨在解决扩散模型(Diffusion Models)在预训练过程中存在不同学习速率阶段,而现有后训练加速方法未考虑这一动态特性的问题。解决方案的关键在于提出一种名为 MosaicDiff 的新框架,通过轨迹感知的结构剪枝(trajectory-aware structural pruning)实现预训练动态与采样加速过程的对齐:在快速学习阶段采用保守剪枝以保留关键特征,而在慢速学习阶段则实施更激进的剪枝策略,从而首次显式匹配扩散模型预训练的学习速度变化,显著提升采样效率且不损失生成质量。
链接: https://arxiv.org/abs/2510.11962
作者: Bowei Guo,Shengkun Tang,Cong Zeng,Zhiqiang Shen
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Computer Vision, ICCV 2025
Abstract:Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model’s inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.
zh
[CV-88] ask-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis ICCV2025
【速读】:该论文旨在解决交通安全管理中视频理解的复杂性问题,即如何准确捕捉细粒度的行为模式并生成全面的描述以支持事故预防。其解决方案的关键在于提出一种独特的双模型框架,通过任务特定优化策略,充分发挥VideoLLaMA在时序推理方面的优势与Qwen2.5-VL在视觉理解上的能力,核心创新点是将字幕生成(captioning)与视觉问答(VQA)任务的训练过程分离,从而减少任务干扰、提升各模型的专业化性能。实验表明,该方法在WTS数据集上实现了S2分数45.7572,在AI City Challenge Track 2中排名第十,且消融研究验证了分离训练相较于联合训练在VQA准确率上提升8.6%的同时保持字幕质量。
链接: https://arxiv.org/abs/2510.11907
作者: Blessing Agyei Kyem,Neema Jakisa Owor,Andrews Danyo,Joshua Kofi Asamoah,Eugene Denteh,Tanner Muturi,Anthony Dontoh,Yaw Adu-Gyamfi,Armstrong Aboah
机构: North Dakota State University (北达科他州立大学); University of Missouri–Columbia (密苏里大学哥伦比亚分校); University of Memphis (孟菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was accepted at ICCV 2025
Abstract:Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6% in VQA accuracy while maintaining captioning quality.
zh
[CV-89] MammoDINO: Anatomically Aware Self-Supervision for Mammographic Images
【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在医学影像领域,特别是乳腺X线摄影(mammography)中应用受限的问题,主要挑战包括数据量有限以及领域特异性偏差。其解决方案的关键在于提出了一种名为MammoDINO的新颖SSL框架,通过两个核心创新实现:一是引入了基于乳腺组织感知的数据增强采样器,用于图像级和patch级的监督信号;二是设计了一种跨切片对比学习目标,将三维数字乳腺断层合成(Digital Breast Tomosynthesis, DBT)结构信息融入二维预训练过程中,从而有效捕捉临床相关的特征表示。该方法在多个乳腺癌筛查任务上达到最先进性能,并在五个基准数据集上表现出良好的泛化能力,为乳腺X线图像提供了一个可扩展、无需标注的基础模型,助力多用途计算机辅助诊断(Computer-Aided Diagnosis, CAD)工具的发展。
链接: https://arxiv.org/abs/2510.11883
作者: Sicheng Zhou,Lei Wu,Cao Xiao,Parminder Bhatia,Taha Kass-Hout
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages
Abstract:Self-supervised learning (SSL) has transformed vision encoder training in general domains but remains underutilized in medical imaging due to limited data and domain specific biases. We present MammoDINO, a novel SSL framework for mammography, pretrained on 1.4 million mammographic images. To capture clinically meaningful features, we introduce a breast tissue aware data augmentation sampler for both image-level and patch-level supervision and a cross-slice contrastive learning objective that leverages 3D digital breast tomosynthesis (DBT) structure into 2D pretraining. MammoDINO achieves state-of-the-art performance on multiple breast cancer screening tasks and generalizes well across five benchmark datasets. It offers a scalable, annotation-free foundation for multipurpose computer-aided diagnosis (CAD) tools for mammogram, helping reduce radiologists’ workload and improve diagnostic efficiency in breast cancer screening.
zh
[CV-90] GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality
【速读】:该论文旨在解决当前虚拟现实(VR)中物理交互式3D内容操作存在的两大核心问题:一是依赖工程密集型流程,难以高效实现复杂交互;二是普遍采用简化的几何表示(如四面体笼子),导致视觉保真度和物理准确性下降。其解决方案的关键在于提出一种名为 \our(Gaussian Splatting for Virtual Environment Rendering and Scene Editing)的新方法,通过将物体的网格(mesh)直接与高斯点阵(Gaussian Splatting, GS)表示融合,实现更精确的表面逼近,从而支持高度真实的形变与交互效果。该方法无需修改物理引擎即可兼容多种物理模拟系统,显著提升了交互的真实性、适应性和开发效率。
链接: https://arxiv.org/abs/2510.11878
作者: Anastasiya Pechko,Piotr Borycki,Joanna Waczyńska,Daniel Barczyk,Agata Szymańska,Sławomir Tadeja,Przemysław Spurek
机构: Jagiellonian University (雅盖隆大学); University of Cambridge (剑桥大学); IDEAS Research Institute (IDEAS 研究所)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce \our (\textbfGaussian \textbfSplatting for \textbfVirtual \textbfEnvironment \textbfRendering and \textbfScene \textbfEditing), a novel method designed to overcome these challenges by directly integrating an object’s mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, \our facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.
zh
[CV-91] Enhancing the Quality of 3D Lunar Maps Using JAXAs Kaguya Imagery
【速读】:该论文旨在解决由压缩噪声引起的立体匹配误差问题,从而提升基于Kaguya TC(Terrain Camera)图像生成的月球三维地图(3D lunar maps)质量。其关键解决方案在于识别并减少因JPEG压缩导致的视差图(disparity map)中的系统性噪声,尤其是在暗区中表现明显的噪声模式,通过优化视差图像的残余噪声抑制,显著降低高程噪声,进而提高地形数据的安全性和可靠性,为未来如NASA Endurance任务等长距离月面探测提供更精准的地形信息支持。
链接: https://arxiv.org/abs/2510.11817
作者: Yumi Iwashita,Haakon Moe,Yang Cheng,Adnan Ansar,Georgios Georgakis,Adrian Stoica,Kazuto Nakashima,Ryo Kurazume,Jim Torresen
机构: Jet Propulsion Laboratory, California Institute of Technology (喷气推进实验室,加州理工学院); University of Oslo (奥斯陆大学); LunaSol Space LLC; Kyushu University (九州大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Presented at IEEE SMC 2025
Abstract:As global efforts to explore the Moon intensify, the need for high-quality 3D lunar maps becomes increasingly critical-particularly for long-distance missions such as NASA’s Endurance mission concept, in which a rover aims to traverse 2,000 km across the South Pole-Aitken basin. Kaguya TC (Terrain Camera) images, though globally available at 10 m/pixel, suffer from altitude inaccuracies caused by stereo matching errors and JPEG-based compression artifacts. This paper presents a method to improve the quality of 3D maps generated from Kaguya TC images, focusing on mitigating the effects of compression-induced noise in disparity maps. We analyze the compression behavior of Kaguya TC imagery, and identify systematic disparity noise patterns, especially in darker regions. In this paper, we propose an approach to enhance 3D map quality by reducing residual noise in disparity images derived from compressed images. Our experimental results show that the proposed approach effectively reduces elevation noise, enhancing the safety and reliability of terrain data for future lunar missions.
zh
[CV-92] Audio-Guided Visual Perception for Audio-Visual Navigation
【速读】:该论文旨在解决音频-视觉具身导航(Audio-Visual Embodied Navigation, AVN)中跨声源泛化能力差的问题,即当前方法在遇到未见过的声音或环境时,导航成功率显著下降且探索路径过长。其根本原因在于缺乏对听觉信号与视觉区域之间的显式对齐机制,导致策略在训练过程中记忆了特定声学指纹与场景的虚假关联,从而在新声音下产生盲区探索。解决方案的关键在于提出AGVP框架,通过音频自注意力提取全局听觉上下文,并将其作为查询引导视觉特征注意力,实现跨模态的空间级对齐与区域重加权,从而减少对特定声学指纹的依赖,提升导航效率和跨场景泛化性能。
链接: https://arxiv.org/abs/2510.11760
作者: Yi Wang,Yinfeng Yu,Fuchun Sun,Liejun Wang,Wendong Zheng
机构: Xinjiang University (新疆大学); Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing (丝绸之路多语言认知计算联合国际实验室); Tsinghua University (清华大学); Tianjin University of Technology (天津理工大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Main paper (6 pages). Accepted for publication by International Conference on Virtual Reality and Visualization 2025 (ICVRV 2025)
Abstract:Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious \enquoteacoustic fingerprint-scenario correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.
zh
[CV-93] SeeingSounds: Learning Audio-to-Visual Alignment via Text
【速读】:该论文旨在解决音频到图像生成(audio-to-image generation)中缺乏有效跨模态对齐机制的问题,尤其是在没有成对音频-视觉数据或视觉生成模型训练的情况下实现高质量、可控的图像生成。其关键解决方案是提出一种轻量且模块化的框架SeeingSounds,通过双对齐机制实现音频到视觉的映射:首先利用冻结的语言编码器将音频投影至语义语言空间,再借助视觉-语言模型将音频在上下文中锚定到视觉域;该方法受认知神经科学启发,模拟人类感知中的跨模态关联,并仅训练轻量级适配器(adapter),结合基于音频变换的程序化文本提示生成(如音量或音高变化对应“远处雷声”等描述性提示),从而实现细粒度且可解释的控制。
链接: https://arxiv.org/abs/2510.11738
作者: Simone Carnemolla,Matteo Pennisi,Chiara Russo,Simone Palazzo,Daniela Giordano,Concetto Spampinato
机构: University of Catania(卡塔尼亚大学)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: accepted to ACM Multimedia Asia 2025
Abstract:We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., “a distant thunder”) that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.
zh
[CV-94] nsor Completion via Monotone Inclusion: Generalized Low-Rank Priors Meet Deep Denoisers
【速读】:该论文旨在解决多维数据(tensor)中缺失值填充问题,这是在真实世界应用中影响下游分析的关键挑战。现有方法虽结合了全局低秩先验与插件式去噪器(plug and play denoisers),但常依赖经验收敛性或不切实际的假设(如深度去噪器可视为隐式正则化项的近似算子)。论文提出了一种基于单调包含(monotone inclusion)框架的新方法,其核心在于将广义低秩先验与深度伪收缩去噪器(deep pseudo contractive denoisers)统一建模,并超越传统凸优化限制。关键创新是基于Davis-Yin分裂算法构建GTCTV DPC算法,并严格证明其全局收敛性,从而在低采样率下显著优于现有方法,在定量指标和视觉质量上均表现更优。
链接: https://arxiv.org/abs/2510.12425
作者: Peng Chen,Deliang Wei,Jiale Yao,Fang Li
机构: East China Normal University (华东师范大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Optimization and Control (math.OC); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 5 figures
Abstract:Missing entries in multi dimensional data pose significant challenges for downstream analysis across diverse real world applications. These data are naturally modeled as tensors, and recent completion methods integrating global low rank priors with plug and play denoisers have demonstrated strong empirical performance. However, these approaches often rely on empirical convergence alone or unrealistic assumptions, such as deep denoisers acting as proximal operators of implicit regularizers, which generally does not hold. To address these limitations, we propose a novel tensor completion framework grounded in the monotone inclusion paradigm, which unifies generalized low rank priors with deep pseudo contractive denoisers and extends beyond traditional convex optimization. Building on the Davis Yin splitting scheme, we develop the GTCTV DPC algorithm and rigorously establish its global convergence. Extensive experiments demonstrate that GTCTV DPC consistently outperforms existing methods in both quantitative metrics and visual quality, particularly at low sampling rates.
zh
[CV-95] MAPS: Masked Attribution-based Probing of Strategies- A computational framework to align human and model explanations
【速读】:该论文试图解决的问题是:如何在不依赖直接测量的情况下,量化和比较人类核心物体识别过程中所采用的视觉信息选择策略,并验证人工神经网络(ANN)的可解释性方法是否能够准确描述这些生物视觉策略。解决方案的关键在于提出一种名为MAPS(Masked Attribution-based Probing of Strategies)的计算工具,其核心机制是将ANN的归因图(attribution maps)转换为解释掩码图像(Explanation-masked images, EMIs),并通过对比人类在有限像素预算下对EMIs与完整刺激的识别准确率,来评估不同解释方法与生物视觉行为的一致性。该方法不仅在模拟中能可靠恢复模型间的真实策略相似性,还能在真实人类和猕猴实验中实现高行为效度,同时显著减少所需的实验次数,从而提供了一种高效、可扩展且统一的标准,用于连接人类行为、神经活动与模型决策。
链接: https://arxiv.org/abs/2510.12141
作者: Sabine Muzellec,Yousif Kashef Alghetaa,Simon Kornblith,Kohitij Kar
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human core object recognition depends on the selective use of visual information, but the strategies guiding these choices are difficult to measure directly. We present MAPS (Masked Attribution-based Probing of Strategies), a behaviorally validated computational tool that tests whether explanations derived from artificial neural networks (ANNs) can also explain human vision. MAPS converts attribution maps into explanation-masked images (EMIs) and compares image-by-image human accuracies on these minimal images with limited pixel budgets with accuracies on the full stimuli. MAPS provides a principled way to evaluate and choose among competing ANN interpretability methods. In silico, EMI-based behavioral similarity between models reliably recovers the ground-truth similarity computed from their attribution maps, establishing which explanation methods best capture the model’s strategy. When applied to humans and macaques, MAPS identifies ANN-explanation combinations whose explanations align most closely with biological vision, achieving the behavioral validity of Bubble masks while requiring far fewer behavioral trials. Because it needs only access to model attributions and a modest set of behavioral data on the original images, MAPS avoids exhaustive psychophysics while offering a scalable tool for adjudicating explanations and linking human behavior, neural activity, and model decisions under a common standard.
zh
人工智能
[AI-0] Ax-Prover: A Deep Reasoning Agent ic Framework for Theorem Proving in Mathematics and Quantum Physics
【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving)在跨学科科学领域中通用性不足的问题,即现有专用证明系统难以泛化到不同领域且缺乏与人类专家协同工作的能力。解决方案的关键在于提出 Ax-Prover,一个基于多智能体的系统,通过将大型语言模型(LLMs)与 Lean 证明工具结合,并借助模型上下文协议(Model Context Protocol, MCP)确保形式正确性,从而实现创造性推理与语法严格性的统一。该方法不仅在公开数学基准上达到前沿水平,在作者引入的抽象代数和量子理论新基准上显著优于现有方法,还展示了作为人类专家助手的实际应用价值,如协助形式化复杂密码学定理的证明。
链接: https://arxiv.org/abs/2510.12787
作者: Marco Del Tredici,Jacob McCarran,Benjamin Breen,Javier Aspuru Mijares,Weichen Winston Yin,Jacob M. Taylor,Frank Koppens,Dirk Englund
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperform them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover’s assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.
zh
[AI-1] CTRL-Rec: Controlling Recommender Systems With Natural Language
【速读】:该论文旨在解决传统推荐系统在用户对推荐结果不满意时缺乏细粒度控制手段的问题。现有系统通常仅提供粗粒度反馈(如评分或点击),无法支持用户通过自然语言表达具体偏好调整需求(例如“我希望看到尊重他人观点的内容”)。解决方案的关键在于提出CTRL-Rec方法,其核心是利用大语言模型(Large Language Models, LLMs)模拟用户基于自然语言请求对物品的接受程度,并训练嵌入模型以近似这些模拟判断;随后将此类基于用户请求的预测结果整合进传统推荐系统的信号加权机制中。部署阶段仅需对每条请求计算一次LLM嵌入,即可实现低延迟、实时的推荐控制,从而显著提升用户对推荐过程的掌控感与满意度。
链接: https://arxiv.org/abs/2510.12742
作者: Micah Carroll,Adeline Foote,Kevin Feng,Marcus Williams,Anca Dragan,W. Bradley Knox,Smitha Milli
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:When users are dissatisfied with recommendations from a recommender system, they often lack fine-grained controls for changing them. Large language models (LLMs) offer a solution by allowing users to guide their recommendations through natural language requests (e.g., “I want to see respectful posts with a different perspective than mine”). We propose a method, CTRL-Rec, that allows for natural language control of traditional recommender systems in real-time with computational efficiency. Specifically, at training time, we use an LLM to simulate whether users would approve of items based on their language requests, and we train embedding models that approximate such simulated judgments. We then integrate these user-request-based predictions into the standard weighting of signals that traditional recommender systems optimize. At deployment time, we require only a single LLM embedding computation per user request, allowing for real-time control of recommendations. In experiments with the MovieLens dataset, our method consistently allows for fine-grained control across a diversity of requests. In a study with 19 Letterboxd users, we find that CTRL-Rec was positively received by users and significantly enhanced users’ sense of control and satisfaction with recommendations compared to traditional controls.
zh
[AI-2] HYPE: Hybrid Planning with Ego Proposal-Conditioned Predictions
【速读】:该论文旨在解决复杂城市环境中安全且可解释的运动规划问题,核心挑战在于如何有效建模多智能体之间的双向交互以准确估计车辆行为的成本函数。传统方法依赖采样生成初始轨迹并基于学习到的未来环境状态预测进行优化,但成本函数设计困难,尤其在多样化复杂场景下难以泛化。解决方案的关键在于提出HYPE(HYbrid Planning with Ego proposal-conditioned predictions),其通过将学习得到的多模态轨迹提议作为启发式先验引入蒙特卡洛树搜索(MCTS)进行精细化优化,并引入一种以自车状态条件化的占用预测模型(ego-conditioned occupancy prediction model),实现场景感知的一致性交互建模。该设计显著简化了优化阶段的成本函数设计,仅需最小化的网格化成本项即可实现高性能,从而在nuPlan和DeepUrban大规模真实数据集上取得最先进的安全性与适应性表现。
链接: https://arxiv.org/abs/2510.12733
作者: Hang Yu,Julian Jordan,Julian Schmidt,Silvan Lindner,Alessandro Canevaro,Wilhelm Stork
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Safe and interpretable motion planning in complex urban environments needs to reason about bidirectional multi-agent interactions. This reasoning requires to estimate the costs of potential ego driving maneuvers. Many existing planners generate initial trajectories with sampling-based methods and refine them by optimizing on learned predictions of future environment states, which requires a cost function that encodes the desired vehicle behavior. Designing such a cost function can be very challenging, especially if a wide range of complex urban scenarios has to be considered. We propose HYPE: HYbrid Planning with Ego proposal-conditioned predictions, a planner that integrates multimodal trajectory proposals from a learned proposal model as heuristic priors into a Monte Carlo Tree Search (MCTS) refinement. To model bidirectional interactions, we introduce an ego-conditioned occupancy prediction model, enabling consistent, scene-aware reasoning. Our design significantly simplifies cost function design in refinement by considering proposal-driven guidance, requiring only minimalistic grid-based cost terms. Evaluations on large-scale real-world benchmarks nuPlan and DeepUrban show that HYPE effectively achieves state-of-the-art performance, especially in safety and adaptability.
zh
[AI-3] Clutch Control: An Attention-based Combinatorial Bandit for Efficient Mutation in JavaScript Engine Fuzzing
【速读】:该论文旨在解决JavaScript模糊测试(fuzzing)中 Mutation 目标选择效率低下的问题,现有方法依赖随机选择变异位置,导致测试用例的有效性和代码覆盖率提升缓慢。其解决方案的关键在于提出 CLUTCH——一种新型深度组合多臂赌博机(deep combinatorial bandit),能够通过注意力机制(attention mechanism)处理变长 JavaScript 测试用例表示,并结合 Concrete Dropout 实现动态探索策略,从而在波动性高且组合复杂的场景下显著优化变异目标选择,提升模糊测试的效率与效果。
链接: https://arxiv.org/abs/2510.12732
作者: Myles Foley,Sergio Maffeis,Muhammad Fakhrur Rozi,Takeshi Takahashi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:JavaScript engines are widely used in web browsers, PDF readers, and server-side applications. The rise in concern over their security has led to the development of several targeted fuzzing techniques. However, existing approaches use random selection to determine where to perform mutations in JavaScript code. We postulate that the problem of selecting better mutation targets is suitable for combinatorial bandits with a volatile number of arms. Thus, we propose CLUTCH, a novel deep combinatorial bandit that can observe variable length JavaScript test case representations, using an attention mechanism from deep learning. Furthermore, using Concrete Dropout, CLUTCH can dynamically adapt its exploration. We show that CLUTCH increases efficiency in JavaScript fuzzing compared to three state-of-the-art solutions by increasing the number of valid test cases and coverage-per-testcase by, respectively, 20.3% and 8.9% on average. In volatile and combinatorial settings we show that CLUTCH outperforms state-of-the-art bandits, achieving at least 78.1% and 4.1% less regret in volatile and combinatorial settings, respectively.
zh
[AI-4] Hierarchical Federated Learning for Crop Yield Prediction in Smart Agricultural Production Systems
【速读】:该论文旨在解决智能农业系统中作物产量预测面临的异构农场环境与隐私敏感数据共存的挑战,传统集中式机器学习方法难以在保护数据隐私的同时实现跨区域、多作物的高效建模。其解决方案的关键在于提出一种分层联邦学习(Hierarchical Federated Learning)架构:通过季节性订阅机制将农场按作物类型动态分组,在客户端层(个体智能农场)、中间层(作物特定聚合器)和顶层(全局模型聚合器)形成三级结构;各作物簇内协作训练专用模型并局部聚合,再由全局聚合器融合多作物知识,从而兼顾局部作物特异性建模与跨作物泛化能力,同时显著降低通信开销并保障数据隐私。
链接: https://arxiv.org/abs/2510.12727
作者: Anas Abouaomar,Mohammed El hanjri,Abdellatif Kobbane,Anis Laouiti,Khalid Nafil
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 6 pages, 3 figures, conference
Abstract:In this paper, we presents a novel hierarchical federated learning architecture specifically designed for smart agricultural production systems and crop yield prediction. Our approach introduces a seasonal subscription mechanism where farms join crop-specific clusters at the beginning of each agricultural season. The proposed three-layer architecture consists of individual smart farms at the client level, crop-specific aggregators at the middle layer, and a global model aggregator at the top level. Within each crop cluster, clients collaboratively train specialized models tailored to specific crop types, which are then aggregated to produce a higher-level global model that integrates knowledge across multiple crops. This hierarchical design enables both local specialization for individual crop types and global generalization across diverse agricultural contexts while preserving data privacy and reducing communication overhead. Experiments demonstrate the effectiveness of the proposed system, showing that local and crop-layer models closely follow actual yield patterns with consistent alignment, significantly outperforming standard machine learning models. The results validate the advantages of hierarchical federated learning in the agricultural context, particularly for scenarios involving heterogeneous farming environments and privacy-sensitive agricultural data.
zh
[AI-5] owards Robust Artificial Intelligence: Self-Supervised Learning Approach for Out-of-Distribution Detection
【速读】:该论文旨在解决人工智能(AI)系统在面对分布外(out-of-distribution, OOD)样本时鲁棒性不足的问题,尤其关注无需标签数据即可实现有效OOD检测的挑战。其解决方案的关键在于结合自监督学习(self-supervised learning)与图论技术:通过自监督学习从无标签数据中提取有用的表征,再利用图结构对样本进行高效识别与分类,从而显著提升模型在未见数据上的检测性能。实验表明,该方法在AUROC指标上达到0.99,优于现有最先进方法。
链接: https://arxiv.org/abs/2510.12713
作者: Wissam Salhab,Darine Ameyed,Hamid Mcheick,Fehmi Jaafar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Robustness in AI systems refers to their ability to maintain reliable and accurate performance under various conditions, including out-of-distribution (OOD) samples, adversarial attacks, and environmental changes. This is crucial in safety-critical systems, such as autonomous vehicles, transportation, or healthcare, where malfunctions could have severe consequences. This paper proposes an approach to improve OOD detection without the need of labeled data, thereby increasing the AI systems’ robustness. The proposed approach leverages the principles of self-supervised learning, allowing the model to learn useful representations from unlabeled data. Combined with graph-theoretical techniques, this enables the more efficient identification and categorization of OOD samples. Compared to existing state-of-the-art methods, this approach achieved an Area Under the Receiver Operating Characteristic Curve (AUROC) = 0.99.
zh
[AI-6] CAMNet: Leverag ing Cooperative Awareness Messages for Vehicle Trajectory Prediction
【速读】:该论文旨在解决自动驾驶中因传感器视野受限导致的情境感知能力下降问题,尤其是在车辆间存在遮挡时如何提升轨迹预测的准确性。其解决方案的关键在于利用车辆间通信获取的协作感知消息(Cooperative Awareness Messages, CAMs),设计并训练一种基于图神经网络的模型——CAMNet,通过融合CAM数据来增强对周围车辆运动状态的理解与预测能力。实验表明,CAM数据能够有效支持轨迹预测任务,但同时也揭示了当前方法在实际应用中的若干局限性,为后续研究提供了方向。
链接: https://arxiv.org/abs/2510.12703
作者: Mattia Grasselli,Angelo Porrello,Carlo Augusto Grazia
机构: 未知
类目: Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: Accepted at the IEEE Consumer Communications Networking Conference (CCNC) 2026 - Las Vegas, NV, USA 9 - 12 January 2026
Abstract:Autonomous driving remains a challenging task, particularly due to safety concerns. Modern vehicles are typically equipped with expensive sensors such as LiDAR, cameras, and radars to reduce the risk of accidents. However, these sensors face inherent limitations: their field of view and line of sight can be obstructed by other vehicles, thereby reducing situational awareness. In this context, vehicle-to-vehicle communication plays a crucial role, as it enables cars to share information and remain aware of each other even when sensors are occluded. One way to achieve this is through the use of Cooperative Awareness Messages (CAMs). In this paper, we investigate the use of CAM data for vehicle trajectory prediction. Specifically, we design and train a neural network, Cooperative Awareness Message-based Graph Neural Network (CAMNet), on a widely used motion forecasting dataset. We then evaluate the model on a second dataset that we created from scratch using Cooperative Awareness Messages, in order to assess whether this type of data can be effectively exploited. Our approach demonstrates promising results, showing that CAMs can indeed support vehicle trajectory prediction. At the same time, we discuss several limitations of the approach, which highlight opportunities for future research.
zh
[AI-7] Beyond Postconditions: Can Large Language Models infer Formal Contracts for Automatic Software Verification?
【速读】:该论文旨在解决自动软件验证中因缺乏形式化规格说明(formal specification)而导致的实践应用受限问题。现有方法依赖人工编写规格,难以在真实代码中广泛部署。为此,作者提出NL2Contract任务——利用大语言模型(LLM)将自然语言提示(如函数名、注释)转化为包含前置条件和后置条件的正式功能契约(functional contract),从而为验证器提供更完整的规格信息。其关键创新在于引入了系统性的评估指标(包括正确性、缺陷区分能力和验证器可用性),并证明:LLM生成的完整契约不仅能保持对所有输入的逻辑正确性,还能有效区分错误行为与正常行为,且相比仅使用后置条件的方法显著减少验证过程中的误报(false alarms),从而提升自动验证工具在实际场景中的可靠性与实用性。
链接: https://arxiv.org/abs/2510.12702
作者: Cedric Richter,Heike Wehrheim
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: under submission
Abstract:Automatic software verifiers have become increasingly effective at the task of checking software against (formal) specifications. Yet, their adoption in practice has been hampered by the lack of such specifications in real world code. Large Language Models (LLMs) have shown promise in inferring formal postconditions from natural language hints embedded in code such as function names, comments or documentation. Using the generated postconditions as specifications in a subsequent verification, however, often leads verifiers to suggest invalid inputs, hinting at potential issues that ultimately turn out to be false alarms. To address this, we revisit the problem of specification inference from natural language in the context of automatic software verification. In the process, we introduce NL2Contract, the task of employing LLMs to translate informal natural language into formal functional contracts, consisting of postconditions as well as preconditions. We introduce metrics to validate and compare different NL2Contract approaches, using soundness, bug discriminative power of the generated contracts and their usability in the context of automatic software verification as key metrics. We evaluate NL2Contract with different LLMs and compare it to the task of postcondition generation nl2postcond. Our evaluation shows that (1) LLMs are generally effective at generating functional contracts sound for all possible inputs, (2) the generated contracts are sufficiently expressive for discriminating buggy from correct behavior, and (3) verifiers supplied with LLM inferred functional contracts produce fewer false alarms than when provided with postconditions alone. Further investigations show that LLM inferred preconditions generally align well with developers intentions which allows us to use automatic software verifiers to catch real-world bugs. Comments: under submission Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL) Cite as: arXiv:2510.12702 [cs.SE] (or arXiv:2510.12702v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.12702 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-8] opological Signatures of ReLU Neural Network Activation Patterns
【速读】:该论文旨在解决深度神经网络中激活模式的拓扑结构与其决策边界及训练动态之间关系的问题。其核心解决方案在于通过分析ReLU神经网络在特征空间中诱导的多面体(polytope)分解,利用双图(dual graph)的Fiedler划分来刻画二分类任务中的决策边界特性,并通过计算细胞分解的同调群(homology)揭示回归任务中训练损失与多面体单元数量之间的行为一致性,从而从拓扑角度理解模型的内部工作机制。
链接: https://arxiv.org/abs/2510.12700
作者: Vicente Bosca,Tatum Rask,Sunia Tanweer,Andrew R. Tawfeek,Branden Stone
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Algebraic Topology (math.AT); Machine Learning (stat.ML)
备注:
Abstract:This paper explores the topological signatures of ReLU neural network activation patterns. We consider feedforward neural networks with ReLU activation functions and analyze the polytope decomposition of the feature space induced by the network. Mainly, we investigate how the Fiedler partition of the dual graph and show that it appears to correlate with the decision boundary – in the case of binary classification. Additionally, we compute the homology of the cellular decomposition – in a regression task – to draw similar patterns in behavior between the training loss and polyhedral cell-count, as the model is trained.
zh
[AI-9] Multi-Agent Debate for LLM Judges with Adaptive Stability Detection
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)作为评判者(LLMs-as-Judges)在自动化评估任务中因依赖简单聚合方法(如多数投票)而导致的准确性不足问题,尤其是在个体代理已给出正确答案时仍可能失败的情形。其解决方案的关键在于提出一种多智能体辩论裁判框架(multi-agent debate judge framework),通过智能体协同推理与迭代优化来提升判断准确性;同时引入基于时间变化的Beta-Binomial混合模型和稳定性检测机制,利用柯尔莫哥洛夫-斯米尔诺夫检验(Kolmogorov-Smirnov test)实现自适应停止策略,从而在保证计算效率的同时显著增强判别性能。
链接: https://arxiv.org/abs/2510.12697
作者: Tianyu Hu,Zhen Tan,Song Wang,Huaizhi Qu,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e.g., majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles. To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time-varying Beta-Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov-Smirnov test). This mechanism models the judges’ collective correct rate dynamics using a time-varying mixture of Beta-Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov-Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.
zh
[AI-10] ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
【速读】:该论文旨在解决当前具身智能(Embodied Intelligence)系统中大型视觉语言模型(Vision Language Models, VLMs)部署成本高、而小型VLMs又缺乏足够知识与技能的问题。解决方案的关键在于提出一个两阶段框架——具身推理代理(Embodied Reasoning Agent, ERA),第一阶段通过三种先验知识学习机制(轨迹增强先验、环境锚定先验和外部知识先验)将强模型的知识蒸馏至小模型,构建基础能力;第二阶段引入在线强化学习(Online Reinforcement Learning, RL)管道,结合自总结(self-summarization)、密集奖励塑造(dense reward shaping)和逐轮策略优化(turn-level policy optimization)三项设计,有效应对长时序、稀疏奖励和训练不稳定等挑战,从而显著提升小模型在复杂任务中的性能与泛化能力。
链接: https://arxiv.org/abs/2510.12693
作者: Hanyang Chen,Mark Zhao,Rui Yang,Qinwei Ma,Ke Yang,Jiarui Yao,Kangrui Wang,Hao Bai,Zhenhailong Wang,Rui Pan,Mengchao Zhang,Jose Barreiros,Aykut Onol,ChengXiang Zhai,Heng Ji,Manling Li,Huan Zhang,Tong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present \textitEmbodied Reasoning Agent (ERA), a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, \textitEmbodied Prior Learning, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4% on EB-ALFRED and 19.4% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.
zh
[AI-11] From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLM
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在模拟人类利益代表时存在的设计取舍问题,即AI系统应作为“委托人”(delegate)忠实再现个体明确表达的偏好,还是作为“受托人”(trustee)基于长期利益判断代为决策。其解决方案的关键在于引入一个基于时间效用的框架,通过权衡短期与长期利益来模拟受托人角色,并将其与仅复制用户表达偏好的行为克隆模型(delegate)进行对比实验。结果表明,受托人式预测在共识明确的问题上更贴近专家立场,但在缺乏明确共识的主题上则表现出更强的模型默认立场偏差,揭示了在AI代理设计中需平衡用户自主性与政策合理性之间的根本权衡。
链接: https://arxiv.org/abs/2510.12689
作者: Suyash Fulay,Jocelyn Zhu,Michiel Bakker
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown promising accuracy in predicting survey responses and policy preferences, which has increased interest in their potential to represent human interests in various domains. Most existing research has focused on behavioral cloning, effectively evaluating how well models reproduce individuals’ expressed preferences. Drawing on theories of political representation, we highlight an underexplored design trade-off: whether AI systems should act as delegates, mirroring expressed preferences, or as trustees, exercising judgment about what best serves an individual’s interests. This trade-off is closely related to issues of LLM sycophancy, where models can encourage behavior or validate beliefs that may be aligned with a user’s short-term preferences, but is detrimental to their long-term interests. Through a series of experiments simulating votes on various policy issues in the U.S. context, we apply a temporal utility framework that weighs short and long-term interests (simulating a trustee role) and compare voting outcomes to behavior-cloning models (simulating a delegate). We find that trustee-style predictions weighted toward long-term interests produce policy decisions that align more closely with expert consensus on well-understood issues, but also show greater bias toward models’ default stances on topics lacking clear agreement. These findings reveal a fundamental trade-off in designing AI systems to represent human interests. Delegate models better preserve user autonomy but may diverge from well-supported policy positions, while trustee models can promote welfare on well-understood issues yet risk paternalism and bias on subjective topics.
zh
[AI-12] SG-XDEAT: Sparsity-Guided Cross-Dimensional and Cross-Encoding Attention with Target-Aware Conditioning in Tabular Learning
【速读】:该论文旨在解决传统深度学习模型在处理表格数据(tabular data)时,难以有效建模特征间复杂依赖关系且易受噪声干扰的问题。其解决方案的关键在于提出SG-XDEAT框架,通过双流编码器将每个输入特征分解为原始值流和目标条件流(target-conditioned stream),并引入三种核心机制:跨维度自注意力(Cross-Dimensional self-attention)捕捉单一流内部特征依赖、跨编码自注意力(Cross-Encoding self-attention)实现原始与目标感知表示之间的双向交互,以及自适应稀疏自注意力(Adaptive Sparse Self-Attention, ASSA)动态抑制低效token,从而提升模型对噪声的鲁棒性。此架构在多个公开基准上均显著优于强基线,验证了联合建模原始与目标感知视图并自适应过滤噪声的有效性。
链接: https://arxiv.org/abs/2510.12659
作者: Chih-Chuan Cheng,Yi-Ju Tseng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose SG-XDEAT (Sparsity-Guided Cross Dimensional and Cross-Encoding Attention with Target Aware Conditioning), a novel framework designed for supervised learning on tabular data. At its core, SG-XDEAT employs a dual-stream encoder that decomposes each input feature into two parallel representations: a raw value stream and a target-conditioned (label-aware) stream. These dual representations are then propagated through a hierarchical stack of attention-based modules. SG-XDEAT integrates three key components: (i) Cross-Dimensional self-attention, which captures intra-view dependencies among features within each stream; (ii) Cross-Encoding self-attention, which enables bidirectional interaction between raw and target-aware representations; and (iii) an Adaptive Sparse Self-Attention (ASSA) mechanism, which dynamically suppresses low-utility tokens by driving their attention weights toward zero–thereby mitigating the impact of noise. Empirical results on multiple public benchmarks show consistent gains over strong baselines, confirming that jointly modeling raw and target-aware views–while adaptively filtering noise–yields a more robust deep tabular learner.
zh
[AI-13] Aixel: A Unified Adaptive and Extensible System for AI-powered Data Analysis
【速读】:该论文旨在解决现代数据分析中因数据管理与学习过程分离而导致的系统碎片化问题,具体表现为用户交互复杂、适应性差、性能欠优以及组件扩展性不足。其核心解决方案是提出一个统一、自适应且可扩展的AI驱动数据分析系统Aixel,通过四层架构(应用层、任务层、模型层和数据层)实现端到端的协同优化:任务层以声明式接口捕获用户意图并生成执行计划,结合优化器按精度、延迟和成本目标调度;模型层提供版本化存储与动态更新机制,支持共享组件复用;数据层则实现统一的数据管理能力,包括索引、约束感知发现、任务对齐选择和特征管理,从而在保证效率的同时提升系统的灵活性与可扩展性。
链接: https://arxiv.org/abs/2510.12642
作者: Meihui Zhang,Liming Wang,Chi Zhang,Zhaojing Luo
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:A growing trend in modern data analysis is the integration of data management with learning, guided by accuracy, latency, and cost requirements. In practice, applications draw data of different formats from many sources. In the meanwhile, the objectives and budgets change over time. Existing systems handle these applications across databases, analysis libraries, and tuning services. Such fragmentation leads to complex user interaction, limited adaptability, suboptimal performance, and poor extensibility across components. To address these challenges, we present Aixel, a unified, adaptive, and extensible system for AI-powered data analysis. The system organizes work across four layers: application, task, model, and data. The task layer provides a declarative interface to capture user intent, which is parsed into an executable operator plan. An optimizer compiles and schedules this plan to meet specified goals in accuracy, latency, and cost. The task layer coordinates the execution of data and model operators, with built-in support for reuse and caching to improve efficiency. The model layer offers versioned storage for index, metadata, tensors, and model artifacts. It supports adaptive construction, task-aligned drift detection, and safe updates that reuse shared components. The data layer provides unified data management capabilities, including indexing, constraint-aware discovery, task-aligned selection, and comprehensive feature management. With the above designed layers, Aixel delivers a user friendly, adaptive, efficient, and extensible system.
zh
[AI-14] Memory as Action: Autonomous Context Curation for Long-Horizon Agent ic Tasks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长时程智能体任务中因工作记忆(working memory)受限而难以有效管理上下文信息的问题,尤其针对现有方法依赖外部启发式机制、与智能体核心策略解耦导致的效率低下和适应性差的缺陷。其解决方案的关键在于提出一种名为“Memory-as-Action”的新框架,将工作记忆管理建模为可学习的内在能力,通过强化学习训练智能体以显式编辑操作(如插入、删除或替换)主动调控记忆内容,并将其作为统一策略的一部分进行端到端优化。为此,作者进一步设计了动态上下文策略优化算法(Dynamic Context Policy Optimization),通过在记忆编辑点分割轨迹并应用轨迹级优势信号,解决了因非前缀式记忆修改引发的因果连续性破坏问题,从而实现了任务推理与记忆管理的协同优化,显著提升了任务性能并降低了计算开销。
链接: https://arxiv.org/abs/2510.12635
作者: Yuxiang Zhang,Jiangming Shu,Ye Ma,Xueyuan Lin,Shangxi Wu,Jitao Sang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models face challenges in long-horizon agentic tasks as their constrained memory is easily overwhelmed by distracting or irrelevant context. Existing working memory methods typically rely on external, heuristic mechanisms that are decoupled from the agent’s core policy. In this work, we reframe working memory management as a learnable, intrinsic capability. We propose a novel framework, Memory-as-Action, where an agent actively manages its working memory by executing explicit editing operations as part of a unified policy. This formulation allows an agent, trained via reinforcement learning, to balance memory curation against long-term task objectives under given resource constraints. However, such memory editing actions break the standard assumption of a continuously growing prefix in LLM interactions, leading to what we call trajectory fractures. These non-prefix changes disrupt the causal continuity required by standard policy gradient methods, making those methods inapplicable. To address this, we propose a new algorithm, Dynamic Context Policy Optimization, which enables stable end-to-end reinforcement learning by segmenting trajectories at memory action points and applying trajectory-level advantages to the resulting action segments. Our results demonstrate that jointly optimizing for task reasoning and memory management in an end-to-end fashion not only reduces overall computational consumption but also improves task performance, driven by adaptive context curation strategies tailored to the model’s intrinsic capabilities.
zh
[AI-15] Laminar: A Scalable Asynchronous RL Post-Training Framework
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)后训练大型语言模型(Large Language Models, LLMs)时因轨迹生成分布极度长尾偏斜导致的GPU利用率低下问题。现有异步RL系统依赖于演员(actor)与所有回放轨迹之间的全局权重同步,造成刚性更新调度,难以适应轨迹生成延迟的高度波动性,从而严重限制训练效率。其解决方案的关键在于引入轨迹级异步机制,通过完全解耦的架构实现每个轨迹独立生成与消费:首先,用中继工作节点作为分布式参数服务替代全局同步,支持细粒度、无阻塞的权重拉取;其次,设计动态重打包机制将长尾轨迹集中到少数专用回放进程中,提升生成吞吐量;该设计同时增强了系统的容错能力,确保长时间运行任务的鲁棒性。
链接: https://arxiv.org/abs/2510.12633
作者: Guangming Sheng,Yuxuan Tong,Borui Wan,Wang Zhang,Chaobo Jia,Xibin Wu,Yuqi Wu,Xiang Li,Chi Zhang,Yanghua Peng,Haibin Lin,Xin Liu,Chuan Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:Reinforcement learning (RL) post-training for Large Language Models (LLMs) is now scaling to large clusters and running for extended durations to enhance model reasoning performance. However, the scalability of existing RL frameworks is limited, as extreme long-tail skewness in RL trajectory generation causes severe GPU underutilization. Current asynchronous RL systems attempt to mitigate this, but they rely on global weight synchronization between the actor and all rollouts, which creates a rigid model update schedule. This global synchronization is ill-suited for the highly skewed and evolving distribution of trajectory generation latency in RL training, crippling training efficiency. Our key insight is that efficient scaling requires breaking this lockstep through trajectory-level asynchrony, which generates and consumes each trajectory independently. We propose Laminar, a scalable and robust RL post-training system built on a fully decoupled architecture. First, we replace global updates with a tier of relay workers acting as a distributed parameter service. This enables asynchronous and fine-grained weight synchronization, allowing rollouts to pull the latest weight anytime without stalling the actor’s training loop. Second, a dynamic repack mechanism consolidates long-tail trajectories onto a few dedicated rollouts, maximizing generation throughput. The fully decoupled design also isolates failures, ensuring robustness for long-running jobs. Our evaluation on a 1024-GPU cluster shows that Laminar achieves up to 5.48 \times training throughput speedup over state-of-the-art systems, while reducing model convergence time.
zh
[AI-16] Designing Tools with Control Confidence
【速读】:该论文旨在解决当前自主工具设计框架仅依赖性能优化而忽视代理在重复使用场景下对工具使用信心的问题,从而导致工具鲁棒性不足。其关键解决方案在于提出一种面向任务条件的机器人手持工具自主设计优化框架,并引入神经启发式的控制信心(control confidence)项作为优化目标,使设计出的工具在环境不确定性下表现出更低的性能波动,同时在鲁棒性与目标准确性之间实现平衡。通过CMAES进化优化策略实现高效搜索,显著优于现有最优优化器,在最少迭代次数内完成最优工具设计。
链接: https://arxiv.org/abs/2510.12630
作者: Ajith Anil Meera,Abian Torres,Pablo Lanillos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Prehistoric humans invented stone tools for specialized tasks by not just maximizing the tool’s immediate goal-completion accuracy, but also increasing their confidence in the tool for later use under similar settings. This factor contributed to the increased robustness of the tool, i.e., the least performance deviations under environmental uncertainties. However, the current autonomous tool design frameworks solely rely on performance optimization, without considering the agent’s confidence in tool use for repeated use. Here, we take a step towards filling this gap by i) defining an optimization framework for task-conditioned autonomous hand tool design for robots, where ii) we introduce a neuro-inspired control confidence term into the optimization routine that helps the agent to design tools with higher robustness. Through rigorous simulations using a robotic arm, we show that tools designed with control confidence as the objective function are more robust to environmental uncertainties during tool use than a pure accuracy-driven objective. We further show that adding control confidence to the objective function for tool design provides a balance between the robustness and goal accuracy of the designed tools under control perturbations. Finally, we show that our CMAES-based evolutionary optimization strategy for autonomous tool design outperforms other state-of-the-art optimizers by designing the optimal tool within the fewest iterations. Code: this https URL.
zh
[AI-17] Learning-To-Measure: In-context Active Feature Acquisition
【速读】:该论文旨在解决**主动特征获取(Active Feature Acquisition, AFA)中因任务特定性导致的可扩展性不足问题,即现有方法通常仅针对单一预定义任务进行学习,难以泛化到新任务。为突破此限制,作者提出元AFA(meta-AFA)**问题,并设计了Learning-to-Measure (L2M)框架作为解决方案,其关键在于:i) 基于序列建模或自回归预训练实现对未见任务的可靠不确定性量化;ii) 构建基于不确定性的贪婪特征获取代理,通过最大化条件互信息来优化特征选择策略。L2M无需针对每个任务重新训练,可直接在具有回顾性缺失的数据集上运行,并在合成与真实世界表格基准测试中表现出优于或相当的任务特定基线,尤其在标签稀缺和高缺失率场景下表现突出。
链接: https://arxiv.org/abs/2510.12624
作者: Yuta Kobayashi,Zilin Jing,Jiayu Yao,Hongseok Namkoong,Shalmali Joshi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.
zh
[AI-18] Rethinking Knowledge Distillation: A Data Dependent Regulariser With a Negative Asymmetric Payoff
【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation)在功能层面的理解不足问题,特别是其压缩能力与知识迁移机制之间的关系不明确。传统观点将知识蒸馏视为一种模型压缩手段,但其实际是否有效传递了教师模型的知识仍缺乏系统性验证。论文的关键解决方案在于通过假设检验、对照实验及随机蒸馏控制,从功能角度解耦压缩与架构简化的影响,从而量化知识迁移的程度和边界;同时,通过多模态数据、多种架构和缩放规律分析,揭示出知识蒸馏更像是一种依赖于数据的正则化机制,且存在显著的负向不对称知识转移现象,即学生模型可能被引入有害或错误的知识,这对应用安全性构成潜在风险。
链接: https://arxiv.org/abs/2510.12615
作者: Israel Mason-Williams,Gabryel Mason-Williams,Helen Yannakoudakis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 45 pages, 24 figures and 104 tables
Abstract:Knowledge distillation is often considered a compression mechanism when judged on the resulting student’s accuracy and loss, yet its functional impact is poorly understood. In this work, we quantify the compression capacity of knowledge distillation and the resulting knowledge transfer from a functional perspective, decoupling compression from architectural reduction, which provides an improved understanding of knowledge distillation. We employ hypothesis testing, controls, and random control distillation to understand knowledge transfer mechanisms across data modalities. To rigorously test the breadth and limits of our analyses, we explore multiple distillation variants and analyse distillation scaling laws across model sizes. Our findings demonstrate that, while there is statistically significant knowledge transfer in some modalities and architectures, the extent of this transfer is less pronounced than anticipated, even under conditions designed to maximise knowledge sharing. Notably, in cases of significant knowledge transfer, we identify a consistent and severe asymmetric transfer of negative knowledge to the student, raising safety concerns in knowledge distillation applications. Across 12 experimental setups, 9 architectures, and 7 datasets, our findings show that knowledge distillation functions less as a compression mechanism and more as a data-dependent regulariser with a negative asymmetric payoff.
zh
[AI-19] SMILE: SeMantic Ids Enhanced CoLd Item Representation for Click-through Rate Prediction in E-commerce SEarch
【速读】:该论文旨在解决冷启动物品(cold-start items)在现代搜索与推荐平台中因协同信息不足而加剧的“马太效应”问题,即高热度物品持续获得更多曝光和交互,导致平台多样性下降。解决方案的关键在于提出SMILE方法,通过语义ID融合对齐(fused alignment of semantic IDs)增强物品表征:首先利用RQ-OPQ编码对物品的内容信息和协同信息进行量化,随后分两步对齐——RQ编码用于跨物品传递共享的协同信号,OPQ编码则学习物品间的细粒度差异化特征,从而有效缓解协同与内容之间的不对称性及物品间细微差异问题。
链接: https://arxiv.org/abs/2510.12604
作者: Qihang Zhao,Zhongbo Sun,Xiaoyang Zheng,Xian Guo,Siyuan Wang,Zihan Liang,Mingcan Peng,Ben Chen,Chenyi Lei
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:With the rise of modern search and recommendation platforms, insufficient collaborative information of cold-start items exacerbates the Matthew effect of existing platform items, challenging platform diversity and becoming a longstanding issue. Existing methods align items’ side content with collaborative information to transfer collaborative signals from high-popularity items to cold-start items. However, these methods fail to account for the asymmetry between collaboration and content, nor the fine-grained differences among items. To address these issues, we propose SMILE, an item representation enhancement approach based on fused alignment of semantic IDs. Specifically, we use RQ-OPQ encoding to quantize item content and collaborative information, followed by a two-step alignment: RQ encoding transfers shared collaborative signals across items, while OPQ encoding learns differentiated information of items. Comprehensive offline experiments on large-scale industrial datasets demonstrate superiority of SMILE, and rigorous online A/B tests confirm statistically significant improvements: item CTR +1.66%, buyers +1.57%, and order volume +2.17%.
zh
[AI-20] HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在面对非标准逻辑游戏变体时,其规则迁移能力与灵活推理能力不足的问题。现有基准数据集多聚焦于常见谜题(如标准9×9数独),易导致模型过拟合于特定格式或记忆解题模式,从而掩盖其对新规则理解与策略适应的缺陷。为此,作者提出HardcoreLogic这一涵盖10类逻辑游戏、超过5000个谜题的新基准,通过三个维度系统性地构造挑战:复杂度提升(Increased Complexity, IC)、非常规元素引入(Uncommon Elements, UE)以及无解谜题设计(Unsolvable Puzzles, UP),以削弱模型对捷径记忆的依赖。关键创新在于构建具有“长尾分布”特征的多样化测试集,有效暴露当前LRM在真实推理能力上的局限性,并为未来高阶逻辑推理研究提供可量化评估标准。
链接: https://arxiv.org/abs/2510.12563
作者: Jingcong Liang,Shijun Wan,Xuehai Wu,Siyuan Wang,Yitong Li,Qianglong Chen,Duyu Tang,Zhongyu Wei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games that require deriving solutions satisfying all constraints. However, whether they can flexibly apply appropriate rules to varying conditions, particularly when faced with non-canonical game variants, remains an open question. Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants. To address this, we introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games, designed to test the robustness of LRMs on the “long-tail” of logical games. HardcoreLogic systematically transforms canonical puzzles through three dimensions: Increased Complexity (IC), Uncommon Elements (UE), and Unsolvable Puzzles (UP), reducing reliance on shortcut memorization. Evaluations on a diverse set of LRMs reveal significant performance drops, even for models achieving top scores on existing benchmarks, indicating heavy reliance on memorized stereotypes. While increased complexity is the dominant source of difficulty, models also struggle with subtle rule variations that do not necessarily increase puzzle difficulty. Our systematic error analysis on solvable and unsolvable puzzles further highlights gaps in genuine reasoning. Overall, HardcoreLogic exposes the limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning.
zh
[AI-21] Inclusive Fitness as a Key Step Towards More Advanced Social Behaviors in Multi-Agent Reinforcement Learning Settings AAMAS2022 ICLR2022 DATE
【速读】:该论文旨在解决多智能体强化学习中社会行为(如合作与竞争)难以自然涌现的问题,尤其是如何在不依赖预设团队结构的情况下实现基于遗传相似性的动态协作机制。其解决方案的关键在于引入以“广义适合度”(inclusive fitness)为核心的奖励函数,并赋予每个智能体基因型(genotype),使智能体间可通过基因共享形成合作关系,从而在囚徒困境类网络博弈中自发产生符合生物学规律(如汉密尔顿法则)的社会动态。这一机制不仅支持从完全对抗到完全合作的连续谱系协作,还为开放环境中策略演进和多智能体自适应学习提供了演化导向的自动课程(auto-curriculum)框架。
链接: https://arxiv.org/abs/2510.12555
作者: Andries Rosseau,Raphaël Avalos,Ann Nowé
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
备注: This version is a slightly updated version (e.g., added an important reference) compared to the peer-reviewed versions at ‘Adapative Learning Agents’ at AAMAS 2022 or ‘From Cells to Societies’ at ICLR 2022
Abstract:The competitive and cooperative forces of natural selection have driven the evolution of intelligence for millions of years, culminating in nature’s vast biodiversity and the complexity of human minds. Inspired by this process, we propose a novel multi-agent reinforcement learning framework where each agent is assigned a genotype and where reward functions are modelled after the concept of inclusive fitness. An agent’s genetic material may be shared with other agents, and our inclusive reward function naturally accounts for this. We study the resulting social dynamics in two types of network games with prisoner’s dilemmas and find that our results align with well-established principles from biology, such as Hamilton’s rule. Furthermore, we outline how this framework can extend to more open-ended environments with spatial and temporal structure, finite resources, and evolving populations. We hypothesize the emergence of an arms race of strategies, where each new strategy is a gradual improvement over earlier adaptations of other agents, effectively producing a multi-agent autocurriculum analogous to biological evolution. In contrast to the binary team-based structures prevalent in earlier research, our gene-based reward structure introduces a spectrum of cooperation ranging from full adversity to full cooperativeness based on genetic similarity, enabling unique non team-based social dynamics. For example, one agent having a mutual cooperative relationship with two other agents, while the two other agents behave adversarially towards each other. We argue that incorporating inclusive fitness in agents provides a foundation for the emergence of more strategically advanced and socially intelligent agents.
zh
[AI-22] Evaluation of Real-Time Preprocessing Methods in AI-Based ECG Signal Analysis
【速读】:该论文旨在解决便携式心电图(ECG)系统在隐私合规、低功耗实时分析方面的挑战,尤其是在数据采集端进行高效信号处理的需求。其解决方案的关键在于结合边缘计算(edge computing)与云计算的优势,通过在边缘侧实现节能、高实时性的预处理方法,提升长期心电图信号的分析效率与安全性,从而构建一个协同优化的机器学习框架。
链接: https://arxiv.org/abs/2510.12541
作者: Jasmin Freudenberg,Kai Hahn,Christian Weber,Madjid Fathi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Conference paper for 2025 IEEE World AI IoT Congress (AIIoT), FACE Project, University of Siegen, Germany
Abstract:The increasing popularity of portable ECG systems and the growing demand for privacy-compliant, energy-efficient real-time analysis require new approaches to signal processing at the point of data acquisition. In this context, the edge domain is acquiring increasing importance, as it not only reduces latency times, but also enables an increased level of data security. The FACE project aims to develop an innovative machine learning solution for analysing long-term electrocardiograms that synergistically combines the strengths of edge and cloud computing. In this thesis, various pre-processing steps of ECG signals are analysed with regard to their applicability in the project. The selection of suitable methods in the edge area is based in particular on criteria such as energy efficiency, processing capability and real-time capability.
zh
[AI-23] ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification
【速读】:该论文旨在解决现有原型模型在细粒度多标签文本分类任务中面临的两大挑战:一是解释粒度粗(通常仅限于句子或文档级别),二是难以有效处理现实场景中文本的多标签特性。解决方案的关键在于提出ProtoSiTex框架,其核心创新包括:采用双阶段交替训练策略,先通过无监督原型发现阶段学习语义一致且多样化的原型,再通过有监督分类阶段将原型映射到类别标签;引入分层损失函数以确保子句、句子和文档层级的一致性,提升可解释性与对齐度;并通过自适应原型和多头注意力机制捕捉标签间的重叠与冲突语义。该方法在新构建的酒店评论细粒度多标签数据集及两个公开基准上均实现最优性能,并提供忠实于人类理解的解释。
链接: https://arxiv.org/abs/2510.12534
作者: Utsav Kumar Nareti,Suraj Kumar,Soumya Pandey,Soumi Chattopadhyay,Chandranath Adak
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The surge in user-generated reviews has amplified the need for interpretable models that can provide fine-grained insights. Existing prototype-based models offer intuitive explanations but typically operate at coarse granularity (sentence or document level) and fail to address the multi-label nature of real-world text classification. We propose ProtoSiTex, a semi-interpretable framework designed for fine-grained multi-label text classification. ProtoSiTex employs a dual-phase alternating training strategy: an unsupervised prototype discovery phase that learns semantically coherent and diverse prototypes, and a supervised classification phase that maps these prototypes to class labels. A hierarchical loss function enforces consistency across sub-sentence, sentence, and document levels, enhancing interpretability and alignment. Unlike prior approaches, ProtoSiTex captures overlapping and conflicting semantics using adaptive prototypes and multi-head attention. We also introduce a benchmark dataset of hotel reviews annotated at the sub-sentence level with multiple labels. Experiments on this dataset and two public benchmarks (binary and multi-class) show that ProtoSiTex achieves state-of-the-art performance while delivering faithful, human-aligned explanations, establishing it as a robust solution for semi-interpretable multi-label text classification.
zh
[AI-24] he Robustness of Differentiable Causal Discovery in Misspecified Scenarios ICLR2025
【速读】:该论文旨在解决因果发现算法在现实世界数据中因违反独立同分布(i.i.d.)假设而导致性能下降的问题,从而限制了其在实际场景中的广泛应用。解决方案的关键在于对主流因果发现算法在八种模型假设违背情况下的实证性能进行全面基准测试,特别是聚焦于可微分因果发现方法(differentiable causal discovery methods)的鲁棒性表现。实验结果表明,这些方法在结构汉明距离(Structural Hamming Distance)和结构干预距离(Structural Intervention Distance)指标下,在多数挑战性场景中表现出较强的稳定性,仅在尺度变化(scale variation)情况下失效;同时,论文还提供了相应的理论解释,旨在建立合理的因果发现评估标准,推动其在真实场景中的落地应用。
链接: https://arxiv.org/abs/2510.12503
作者: Huiyang Yi,Yanyan He,Duxin Chen,Mingyu Kang,He Wang,Wenwu Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME); Machine Learning (stat.ML)
备注: accepted to ICLR 2025
Abstract:Causal discovery aims to learn causal relationships between variables from targeted data, making it a fundamental task in machine learning. However, causal discovery algorithms often rely on unverifiable causal assumptions, which are usually difficult to satisfy in real-world data, thereby limiting the broad application of causal discovery in practical scenarios. Inspired by these considerations, this work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit robustness under the metrics of Structural Hamming Distance and Structural Intervention Distance of the inferred graphs in commonly used challenging scenarios, except for scale variation. We also provide the theoretical explanations for the performance of differentiable causal discovery methods. Finally, our work aims to comprehensively benchmark the performance of recent differentiable causal discovery methods under model assumption violations, and provide the standard for reasonable evaluation of causal discovery, as well as to further promote its application in real-world scenarios.
zh
[AI-25] Artificial Intelligence Virtual Cells: From Measurements to Decisions across Modality Scale Dynamics and Evaluation
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在细胞状态建模中面临的跨实验室、跨平台、跨尺度和跨干预条件下的可迁移性差与评估不系统的问题,具体表现为数据分割存在泄露风险、覆盖偏差严重,以及对剂量、时间及组合效应缺乏系统处理。其解决方案的关键在于提出一种模型无关的细胞状态潜在空间(Cell-State Latent, CSL)视角,通过操作符语法(operator grammar)组织学习过程:包括测量(measurement)、升维/投影以实现跨尺度耦合(lift/project for cross-scale coupling),以及干预(intervention)用于剂量与调度建模。这一框架推动了面向决策对齐的多维度评估蓝图,强调功能空间读出(如通路活性、空间邻域和临床终点),并推荐操作符感知的数据设计、抗泄露的数据划分和透明校准报告,以支持可重现的、类比一致的比较。
链接: https://arxiv.org/abs/2510.12498
作者: Chengpeng Hu,Calvin Yu-Chian Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence Virtual Cells (AIVCs) aim to learn executable, decision-relevant models of cell state from multimodal, multiscale measurements. Recent studies have introduced single-cell and spatial foundation models, improved cross-modality alignment, scaled perturbation atlases, and explored pathway-level readouts. Nevertheless, although held-out validation is standard practice, evaluations remain predominantly within single datasets and settings; evidence indicates that transport across laboratories and platforms is often limited, that some data splits are vulnerable to leakage and coverage bias, and that dose, time and combination effects are not yet systematically handled. Cross-scale coupling also remains constrained, as anchors linking molecular, cellular and tissue levels are sparse, and alignment to scientific or clinical readouts varies across studies. We propose a model-agnostic Cell-State Latent (CSL) perspective that organizes learning via an operator grammar: measurement, lift/project for cross-scale coupling, and intervention for dosing and scheduling. This view motivates a decision-aligned evaluation blueprint across modality, scale, context and intervention, and emphasizes function-space readouts such as pathway activity, spatial neighborhoods and clinically relevant endpoints. We recommend operator-aware data design, leakage-resistant partitions, and transparent calibration and reporting to enable reproducible, like-for-like comparisons.
zh
[AI-26] PubSub-VFL: Towards Efficient Two-Party Split Learning in Heterogeneous Environments via Publisher/Subscriber Architecture NEURIPS2025
【速读】:该论文旨在解决两方纵向联邦学习(Vertical Federated Learning, VFL)中存在的计算资源利用率低和训练效率不足的问题,尤其是由同步依赖设计导致的训练延迟以及参与方间资源与数据异构性带来的计算不均衡。其解决方案的关键在于提出一种基于发布/订阅(Publisher/Subscriber, PubSub)架构的新型VFL范式——PubSub-VFL,通过解耦Pub/Sub架构与参数服务器的数据并行机制,设计分层异步训练机制以降低延迟并提升系统效率;同时,针对资源与数据异构性引发的训练不平衡问题,构建基于参与者系统配置的优化模型,实现隐私保护下的最优超参数选择,从而在保证收敛稳定性和兼容差分隐私等安全协议的前提下,显著提升训练速度与资源利用率。
链接: https://arxiv.org/abs/2510.12494
作者: Yi Liu,Yang Liu,Leqian Zheng,Jue Hong,Junjie Shi,Qingyou Yang,Ye Wu,Cong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted at NeurIPS 2025
Abstract:With the rapid advancement of the digital economy, data collaboration between organizations has become a well-established business model, driving the growth of various industries. However, privacy concerns make direct data sharing impractical. To address this, Two-Party Split Learning (a.k.a. Vertical Federated Learning (VFL)) has emerged as a promising solution for secure collaborative learning. Despite its advantages, this architecture still suffers from low computational resource utilization and training efficiency. Specifically, its synchronous dependency design increases training latency, while resource and data heterogeneity among participants further hinder efficient computation. To overcome these challenges, we propose PubSub-VFL, a novel VFL paradigm with a Publisher/Subscriber architecture optimized for two-party collaborative learning with high computational efficiency. PubSub-VFL leverages the decoupling capabilities of the Pub/Sub architecture and the data parallelism of the parameter server architecture to design a hierarchical asynchronous mechanism, reducing training latency and improving system efficiency. Additionally, to mitigate the training imbalance caused by resource and data heterogeneity, we formalize an optimization problem based on participants’ system profiles, enabling the selection of optimal hyperparameters while preserving privacy. We conduct a theoretical analysis to demonstrate that PubSub-VFL achieves stable convergence and is compatible with security protocols such as differential privacy. Extensive case studies on five benchmark datasets further validate its effectiveness, showing that, compared to state-of-the-art baselines, PubSub-VFL not only accelerates training by 2 \sim 7\times without compromising accuracy, but also achieves a computational resource utilization rate of up to 91.07%.
zh
[AI-27] Using Medical Algorithms for Task-Oriented Dialogue in LLM -Based Medical Interviews
【速读】:该论文旨在解决临床问诊过程中信息获取效率低、流程不规范以及医生认知负荷过高的问题,特别是在缺乏患者先验信息时难以快速构建有效问诊路径的挑战。其解决方案的关键在于提出了一种基于有向无环图(Directed Acyclic Graph, DAG)的任务导向型对话框架,通过将医学指南转化为结构化临床问题语料库,并结合分层聚类的冷启动机制生成初始提问策略,辅以动态扩展与剪枝机制实现根据患者响应自适应调整问诊分支和回溯,同时引入终止逻辑确保在收集足够诊断信息后及时结束访谈,最终输出符合临床工作流的结构化报告,从而显著降低医生认知负担并提升问诊效率。
链接: https://arxiv.org/abs/2510.12490
作者: Rui Reis,Pedro Rangel Henriques,João Ferreira-Coimbra,Eva Oliveira,Nuno F. Rodrigues
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We developed a task-oriented dialogue framework structured as a Directed Acyclic Graph (DAG) of medical questions. The system integrates: (1) a systematic pipeline for transforming medical algorithms and guidelines into a clinical question corpus; (2) a cold-start mechanism based on hierarchical clustering to generate efficient initial questioning without prior patient information; (3) an expand-and-prune mechanism enabling adaptive branching and backtracking based on patient responses; (4) a termination logic to ensure interviews end once sufficient information is gathered; and (5) automated synthesis of doctor-friendly structured reports aligned with clinical workflows. Human-computer interaction principles guided the design of both the patient and physician applications. Preliminary evaluation involved five physicians using standardized instruments: NASA-TLX (cognitive workload), the System Usability Scale (SUS), and the Questionnaire for User Interface Satisfaction (QUIS). The patient application achieved low workload scores (NASA-TLX = 15.6), high usability (SUS = 86), and strong satisfaction (QUIS = 8.1/9), with particularly high ratings for ease of learning and interface design. The physician application yielded moderate workload (NASA-TLX = 26) and excellent usability (SUS = 88.5), with satisfaction scores of 8.3/9. Both applications demonstrated effective integration into clinical workflows, reducing cognitive demand and supporting efficient report generation. Limitations included occasional system latency and a small, non-diverse evaluation sample.
zh
[AI-28] Evaluating and Mitigating LLM -as-a-judge Bias in Communication Systems
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)作为“AI裁判”在通信系统内容质量评估中可能存在的判断偏差问题,这些问题可能导致评分失真并削弱用户信任。研究发现,尽管当前最先进的LLM裁判模型(如GPT-Judge和JudgeLM)对有偏输入具有一定的鲁棒性(通常给出较低分数),但若在训练过程中使用高分却带有偏见的样本,则会显著降低其性能,凸显了数据偏见对模型可靠性的影响。此外,评分与任务难度密切相关:复杂任务(如GPQA)平均得分更低,而开放推理任务(如JudgeLM-val)得分更高。解决方案的关键在于:首先,通过提供详细的评分标准提升模型对偏见输入的识别能力;其次,避免使用存在偏见的训练数据;最后,提出四种潜在的缓解策略以确保实际应用场景下AI评判的公平性和可靠性。
链接: https://arxiv.org/abs/2510.12462
作者: Jiaxin Gao,Chen Chen,Yanwen Jia,Xueluan Gong,Kwok-Yan Lam,Qian Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI “judges” is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases in two LLM-as-a-judge models (i.e., GPT-Judge and JudgeLM) under the point-wise scoring setting, encompassing 11 types of biases that cover both implicit and explicit forms. We observed that state-of-the-art LLM judges demonstrate robustness to biased inputs, generally assigning them lower scores than the corresponding clean samples. Providing a detailed scoring rubric further enhances this robustness. We further found that fine-tuning an LLM on high-scoring yet biased responses can significantly degrade its performance, highlighting the risk of training on biased data. We also discovered that the judged scores correlate with task difficulty: a challenging dataset like GPQA yields lower average scores, whereas an open-ended reasoning dataset (e.g., JudgeLM-val) sees higher average scores. Finally, we proposed four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
zh
[AI-29] Biased-Attention Guided Risk Prediction for Safe Decision-Making at Unsignalized Intersections
【速读】:该论文旨在解决无信号交叉口场景下自动驾驶决策的挑战性问题,其核心难点在于复杂动态交互关系与高冲突风险带来的安全控制难题。解决方案的关键在于提出一种融合偏置注意力机制(biased attention mechanism)的深度强化学习(Deep Reinforcement Learning, DRL)决策框架,该框架基于Soft Actor-Critic(SAC)算法构建,并通过偏置注意力机制设计了一个交通风险预测器,用于评估车辆进入交叉口后的长期碰撞风险,并将该风险转化为密集奖励信号,从而引导智能体在保证安全的前提下实现高效驾驶决策。
链接: https://arxiv.org/abs/2510.12428
作者: Chengyang Dong,Nan Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous driving decision-making at unsignalized intersections is highly challenging due to complex dynamic interactions and high conflict risks. To achieve proactive safety control, this paper proposes a deep reinforcement learning (DRL) decision-making framework integrated with a biased attention mechanism. The framework is built upon the Soft Actor-Critic (SAC) algorithm. Its core innovation lies in the use of biased attention to construct a traffic risk predictor. This predictor assesses the long-term risk of collision for a vehicle entering the intersection and transforms this risk into a dense reward signal to guide the SAC agent in making safe and efficient driving decisions. Finally, the simulation results demonstrate that the proposed method effectively improves both traffic efficiency and vehicle safety at the intersection, thereby proving the effectiveness of the intelligent decision-making framework in complex scenarios. The code of our work is available at this https URL.
zh
[AI-30] MTOS: A LLM -Driven Multi-topic Opinion Simulation Framework for Exploring Echo Chamber Dynamics
【速读】:该论文旨在解决现有社会模拟框架在多主题情境下难以准确刻画观点演化与认知迁移的问题,尤其针对传统数值模型对语言态度简化处理导致的可解释性差、行为一致性弱,以及基于大语言模型(LLM)的研究多局限于单一主题、无法体现跨领域信息交互的局限。其解决方案的关键在于提出多主题观点模拟框架(Multi-topic Opinion Simulation, MTOS),该框架通过融合LLM与短期/长期记忆机制,引入多种用户选择交互方式和动态主题选择策略,并设计信念衰减机制以实现跨主题视角更新,从而在保留语言层面真实性的同时,有效捕捉多主题语境下的群体极化趋势与局部一致性特征。
链接: https://arxiv.org/abs/2510.12423
作者: Dingyi Zuo,Hongjie Zhang,Jie Ou,Chaosheng Feng,Shuwan Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 14 pages, 11figures
Abstract:The polarization of opinions, information segregation, and cognitive biases on social media have attracted significant academic attention. In real-world networks, information often spans multiple interrelated topics, posing challenges for opinion evolution and highlighting the need for frameworks that simulate interactions among topics. Existing studies based on large language models (LLMs) focus largely on single topics, limiting the capture of cognitive transfer in multi-topic, cross-domain contexts. Traditional numerical models, meanwhile, simplify complex linguistic attitudes into discrete values, lacking interpretability, behavioral consistency, and the ability to integrate multiple topics. To address these issues, we propose Multi-topic Opinion Simulation (MTOS), a social simulation framework integrating multi-topic contexts with LLMs. MTOS leverages LLMs alongside short-term and long-term memory, incorporates multiple user-selection interaction mechanisms and dynamic topic-selection strategies, and employs a belief decay mechanism to enable perspective updates across topics. We conduct extensive experiments on MTOS, varying topic numbers, correlation types, and performing ablation studies to assess features such as group polarization and local consistency. Results show that multi-topic settings significantly alter polarization trends: positively correlated topics amplify echo chambers, negatively correlated topics inhibit them, and irrelevant topics also mitigate echo chamber effects through resource competition. Compared with numerical models, LLM-based agents realistically simulate dynamic opinion changes, reproduce linguistic features of news texts, and capture complex human reasoning, improving simulation interpretability and system stability.
zh
[AI-31] PricingLogic: Evaluating LLM s Reasoning on Complex Tourism Pricing Tasks
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂旅游定价场景下自动化价格计算的可靠性问题,特别是在存在多重重叠票价规则时,LLMs是否能准确理解并执行规则以避免财务损失。其解决方案的关键在于构建了首个专门针对此任务的基准测试集PricingLogic,包含300个基于真实定价政策的自然语言问题,涵盖两个难度层级:基础客户类型定价和涉及交互折扣的捆绑旅游计算。实验表明,尽管LLMs在一般任务中表现良好,但在高难度任务上性能显著下降,暴露出规则解析与算术运算中的系统性失败,凸显了在收入敏感型应用中部署LLMs前需引入额外保障机制或领域适配的重要性。
链接: https://arxiv.org/abs/2510.12409
作者: Yunuo Liu,Dawei Zhu,Zena Al-Khalili,Dai Cheng,Yanjun Chen,Dietrich Klakow,Wei Zhang,Xiaoyu Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic this http URL results highlight that, despite their general capabilities, today’s LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at this https URL.
zh
[AI-32] A Survey of Vibe Coding with Large Language Models
【速读】:该论文旨在解决生成式 AI (Generative AI) 驱动的“Vibe Coding”范式在实际应用中缺乏系统性理论支撑与实践框架的问题,尤其关注其在提升开发效率方面的有效性尚未得到充分验证,且存在人机协作中的根本性挑战。解决方案的关键在于通过构建一个形式化的约束马尔可夫决策过程(Constrained Markov Decision Process)来正式定义 Vibe Coding 的三元动态关系(人类开发者、软件项目与编码代理),并基于对超过1000篇文献的系统分析,提出首个涵盖编码代理基础设施、开发环境和反馈机制的完整生态系统框架;进一步提炼出五种典型开发模型(无约束自动化、迭代对话协作、规划驱动、测试驱动与上下文增强模型),强调成功实施 Vibe Coding 不仅依赖于代理能力,更取决于系统性的上下文工程、成熟的发展环境以及人机协同开发模式的设计。
链接: https://arxiv.org/abs/2510.12399
作者: Yuyao Ge,Lingrui Mei,Zenghao Duan,Tianhao Li,Yujia Zheng,Yiwei Wang,Lexin Wang,Jiayu Yao,Tianyu Liu,Yujun Cai,Baolong Bi,Fangda Guo,Jiafeng Guo,Shenghua Liu,Xueqi Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed “Vibe Coding” where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.
zh
[AI-33] ®evolution of Programming: Vibe Coding as a Post-Coding Paradigm
【速读】:该论文试图解决当前软件开发实践中因生成式 AI(Generative AI)兴起而引发的范式转变问题,特别是如何理解开发者与 AI 系统之间新型互动模式对编程文化的影响。其解决方案的关键在于提出“直觉编码”(Vibe Coding, VC)这一新范式,强调开发者与 AI 之间的直觉性、情感驱动和即兴协作,通过五项主题维度(创造力、可持续性、编程未来、协作与批判)揭示 VC 与传统 AI 辅助开发(如 GitHub Copilot 支持的“共驾”模式)的本质差异,并以“共漂浮”(co-drifting)隐喻重构开发者角色,模糊专业与非开发者边界,从而推动对人机交互(HCI)和软件工程领域中编程文化演进的深入研究。
链接: https://arxiv.org/abs/2510.12364
作者: Kevin Krings,Nino S. Bohn,Thomas Ludwig
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Workshop Submission at the sixth decennial Aarhus conference in Workshop “The End of Programming (as we know it) - Envisioning Radical Re-Conceptualizations of Co-Coding with AI”
Abstract:Recent advancements in generative artificial intelligence (GenAI), particularly large language models, have introduced new possibilities for software development practices. In our paper we investigate the emerging Vibe Coding (VC) paradigm that emphasizes intuitive, affect-driven, and improvisational interactions between developers and AI systems. Building upon the discourse of End-User Development (EUD), we explore how VC diverges from conventional programming approaches such as those supported by tools like GitHub Copilot. Through five semi-structured interview sessions with ten experienced software practitioners, we identify five thematic dimensions: creativity, sustainability, the future of programming, collaboration, and criticism. Our analysis conceptualizes VC within the metaphor of co-drifting, contrasting it with the prevalent co-piloting perspective of AI-assisted development. We argue that VC reconfigures the developers role, blurring boundaries between professional and non-developers. While VC enables novel forms of expression and rapid prototyping, it also introduces challenges regarding reproducibility, scalability, and inclusivity. We propose that VC represents a meaningful shift in programming culture, warranting further investigation within human-computer interaction (HCI) and software engineering research.
zh
[AI-34] O-Forge: An LLM Computer Algebra Framework for Asymptotic Analysis
【速读】:该论文旨在解决生成式 AI(Generative AI)在研究级数学证明中应用受限的问题,核心瓶颈在于缺乏对模型生成内容的严格验证机制。解决方案的关键在于提出一个名为 LLM+CAS 的框架,并开发配套工具 O-Forge,通过将前沿大语言模型(Large Language Model, LLM)与计算机代数系统(Computer Algebra System, CAS)结合,在上下文符号反馈循环中实现创造性推理与符号验证的协同工作。该框架利用 LLM 提出域分解策略,由 CAS 对每个子域进行公理化验证,从而有效应对复杂渐近不等式的证明任务,显著提升了 AI 在专业数学研究中的实用性与可信度。
链接: https://arxiv.org/abs/2510.12350
作者: Ayush Khaitan,Vijay Ganesh
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models have recently demonstrated advanced capabilities in solving IMO and Putnam problems; yet their role in research mathematics has remained fairly limited. The key difficulty is verification: suggested proofs may look plausible, but cannot be trusted without rigorous checking. We present a framework, called LLM+CAS, and an associated tool, O-Forge, that couples frontier LLMs with a computer algebra systems (CAS) in an In-Context Symbolic Feedback loop to produce proofs that are both creative and symbolically verified. Our focus is on asymptotic inequalities, a topic that often involves difficult proofs and appropriate decomposition of the domain into the “right” subdomains. Many mathematicians, including Terry Tao, have suggested that using AI tools to find the right decompositions can be very useful for research-level asymptotic analysis. In this paper, we show that our framework LLM+CAS turns out to be remarkably effective at proposing such decompositions via a combination of a frontier LLM and a CAS. More precisely, we use an LLM to suggest domain decomposition, and a CAS (such as Mathematica) that provides a verification of each piece axiomatically. Using this loop, we answer a question posed by Terence Tao: whether LLMs coupled with a verifier can be used to help prove intricate asymptotic inequalities. More broadly, we show how AI can move beyond contest math towards research-level tools for professional mathematicians.
zh
[AI-35] Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中演化奖励函数(evolving reward function)场景下的理论收敛性问题,尤其是针对单时间尺度的Actor-Critic算法在马尔可夫采样下的有限时间收敛分析。现有主流RL算法常采用奖励塑形(reward shaping)、熵正则化或课程学习(curriculum learning)等技术动态调整奖励函数,但其理论基础尚不完善。论文的关键贡献在于:在标准假设下,首次建立了Actor和Critic误差的非渐近界,证明当奖励参数变化足够缓慢时,仍可达到与静态奖励情形相同的 O(1/T) 收敛速率;进一步地,若奖励通过有界梯度更新且与Actor和Critic处于同一时间尺度,则该收敛率依然成立,从而为诸多实际RL方法提供了理论支撑。此外,论文还提出了新的马尔可夫采样下的分布失配(distribution mismatch)分析框架,在静态奖励情况下将最优收敛速率提升了一个 log2T 因子。
链接: https://arxiv.org/abs/2510.12334
作者: Rui Hu,Yu Chen,Longbo Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an O(1/\sqrtT) convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of \log^2T in the static-reward case.
zh
[AI-36] Causal Inspired Multi Modal Recommendation
【速读】:该论文旨在解决多模态推荐系统中存在的两个关键偏差问题:一是模态混杂偏差(modal confounding),即潜在因素(如品牌风格或产品类别)同时驱动多个模态并影响用户偏好,导致虚假的特征-偏好关联;二是交互偏差(interaction bias),即真实用户偏好被曝光效应和偶然点击等噪声所干扰。解决方案的关键在于提出一种因果启发的多模态推荐框架,其核心创新包括:引入双通道跨模态扩散模块以识别隐藏的模态混杂因子,利用后门调整结合分层匹配与向量量化码本阻断混杂路径,并通过前门调整与因果拓扑重构构建去混杂的因果子图,从而在保持强可解释性的同时显著提升推荐性能。
链接: https://arxiv.org/abs/2510.12325
作者: Jie Yang,Chenyang Gu,Zixuan Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal recommender systems enhance personalized recommendations in e-commerce and online advertising by integrating visual, textual, and user-item interaction data. However, existing methods often overlook two critical biases: (i) modal confounding, where latent factors (e.g., brand style or product category) simultaneously drive multiple modalities and influence user preference, leading to spurious feature-preference associations; (ii) interaction bias, where genuine user preferences are mixed with noise from exposure effects and accidental clicks. To address these challenges, we propose a Causal-inspired multimodal Recommendation framework. Specifically, we introduce a dual-channel cross-modal diffusion module to identify hidden modal confounders, utilize back-door adjustment with hierarchical matching and vector-quantized codebooks to block confounding paths, and apply front-door adjustment combined with causal topology reconstruction to build a deconfounded causal subgraph. Extensive experiments on three real-world e-commerce datasets demonstrate that our method significantly outperforms state-of-the-art baselines while maintaining strong interpretability.
zh
[AI-37] RAG -Anything: All-in-One RAG Framework
【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)框架在处理多模态知识库时存在的根本性局限问题。现有RAG方法仅支持文本内容的检索与生成,而现实世界中的知识库通常包含文本、图像、表格和数学表达式等多种模态信息,导致传统RAG在面对跨模态证据推理时性能严重受限。其解决方案的关键在于提出RAG-Anything框架,该框架将多模态内容统一建模为相互关联的知识实体,并通过双图结构构建技术同时捕捉跨模态关系与文本语义信息;进一步引入跨模态混合检索机制,融合结构化知识导航与语义匹配策略,从而实现对异构多模态内容的有效推理与访问。该方法显著提升了长文档场景下的性能表现,解决了传统RAG因架构碎片化而导致的多模态知识获取瓶颈。
链接: https://arxiv.org/abs/2510.12323
作者: Zirui Guo,Xubin Ren,Lingrui Xu,Jiahao Zhang,Chao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between current RAG capabilities and real-world information environments. Modern knowledge repositories are inherently multimodal, containing rich combinations of textual content, visual elements, structured tables, and mathematical expressions. Yet existing RAG frameworks are limited to textual content, creating fundamental gaps when processing multimodal documents. We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities. Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types. The framework introduces dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. We develop cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching. This enables effective reasoning over heterogeneous content where relevant evidence spans multiple modalities. RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. Performance gains become particularly pronounced on long documents where traditional approaches fail. Our framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. Our framework is open-sourced at: this https URL.
zh
[AI-38] Deep SPI: Safe Policy Improvement via World Models
【速读】:该论文旨在解决在线强化学习(Online Reinforcement Learning, RL)中策略改进的安全性问题,尤其是在使用世界模型(World Model)和表征学习(Representation Learning)的场景下,如何在保证理论安全性的同时实现高效策略优化。其解决方案的关键在于构建一个理论框架,证明在当前策略的特定邻域内进行局部策略更新可确保单调改进与收敛;同时将状态转移和奖励预测损失与表征质量关联起来,从而推导出适用于深度神经网络的在线版安全策略改进(Safe Policy Improvement, SPI)定理。基于此理论,作者提出DeepSPI算法,通过耦合局部转移与奖励损失以及正则化策略更新,在ALE-57基准上实现了性能媲美甚至超越PPO和DeepMDPs等强基线方法,且保留了严格的理论保障。
链接: https://arxiv.org/abs/2510.12312
作者: Florent Delgrange,Raphael Avalos,Willem Röpke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages main text, 17 pages appendix (excluding references)
Abstract:Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, “deep” analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on-policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE-57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.
zh
[AI-39] Quantum Annealing for Staff Scheduling in Educational Environments
【速读】:该论文旨在解决多校区、多层次教育机构中的人员分配问题(staff allocation problem),即在确保教师可用性、专业能力与公平性的约束下,将工作人员合理分配至幼儿园、小学和中学。其解决方案的关键在于构建一个优化模型,并采用量子退火(quantum annealing)方法进行求解;实验结果表明,该方法能够在较短时间内生成均衡的人员分配方案,验证了量子优化技术在教育调度及复杂资源分配任务中的实际应用潜力。
链接: https://arxiv.org/abs/2510.12278
作者: Alessia Ciacco,Francesca Guerriero,Eneko Osaba
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 tables, and 1 figure. Paper submitted to the International Conference on Quantum Communications, Networking, and Computing (QCNC 2026)
Abstract:We address a novel staff allocation problem that arises in the organization of collaborators among multiple school sites and educational levels. The problem emerges from a real case study in a public school in Calabria, Italy, where staff members must be distributed across kindergartens, primary, and secondary schools under constraints of availability, competencies, and fairness. To tackle this problem, we develop an optimization model and investigate a solution approach based on quantum annealing. Our computational experiments on real-world data show that quantum annealing is capable of producing balanced assignments in short runtimes. These results provide evidence of the practical applicability of quantum optimization methods in educational scheduling and, more broadly, in complex resource allocation tasks.
zh
[AI-40] FGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction
【速读】:该论文旨在解决脑控语音分离中如何有效利用听者EEG信号与目标语音之间的共性信息这一难题。其核心挑战在于如何从EEG信号中提取具有判别性的时空特征,并充分利用其非欧几里得结构以捕捉全局语义信息,同时保留语音的节奏和韵律特性。解决方案的关键在于提出了一种名为TFGA-Net的模型:首先通过多尺度时频特征提取与选择性激活的皮层拓扑结构增强EEG表征;其次在EEG编码器中融合图卷积网络(Graph Convolutional Networks, GCNs)与自注意力机制(Self-Attention Mechanism),以建模EEG信号的非欧空间结构并捕获全局依赖关系;最后引入结合MossFormer和无RNN循环结构的分离模块(RNN-Free Recurrent),实现对融合后的EEG与语音特征的有效分离,从而提升目标说话人语音提取性能。
链接: https://arxiv.org/abs/2510.12275
作者: Youhao Si,Yuan Liao,Qiushi Han,Yuhang Yang,Rui Dai,Liya Huang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures
Abstract:The rapid development of auditory attention decoding (AAD) based on electroencephalography (EEG) signals offers the possibility EEG-driven target speaker extraction. However, how to effectively utilize the target-speaker common information between EEG and speech remains an unresolved problem. In this paper, we propose a model for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to effectively extract information from EEG signals, we derive multi-scale time–frequency features and further incorporate cortical topological structures that are selectively engaged during the task. Moreover, to effectively exploit the non-Euclidean structure of EEG signals and capture their global features, the graph convolutional networks and self-attention mechanism are used in the EEG encoder. In addition, to make full use of the fused EEG and speech feature and preserve global context and capture speech rhythm and prosody, we introduce MossFormer2 which combines MossFormer and RNN-Free Recurrent as separator. Experimental results on both the public Cocktail Party and KUL dataset in this paper show that our TFGA-Net model significantly outper-forms the state-of-the-art method in certain objective evaluation metrics. The source code is available at: this https URL.
zh
[AI-41] nsor Logic: The Language of AI
【速读】:该论文旨在解决当前人工智能(AI)发展中缺乏一种兼具自动微分、高效GPU计算、可扩展性、学习能力以及自动化推理和知识获取功能的编程语言的问题。现有工具如PyTorch和TensorFlow虽支持自动微分与GPU加速,但基于Python这一非AI原生语言,难以有效整合符号推理;而传统AI语言如LISP和Prolog则缺乏可扩展性和学习能力。论文提出“张量逻辑”(tensor logic)作为解决方案,其核心在于将神经网络与符号AI在基础层面上统一:所有计算均可归约为张量方程(tensor equation),这源于一个关键观察——逻辑规则与爱因斯坦求和(Einstein summation)本质上是同一操作。由此,张量逻辑不仅优雅地实现了Transformer、形式推理、核机器和图模型等主流AI范式,还开辟了嵌入空间中可靠推理等新方向,从而融合了神经网络的可扩展性与学习能力及符号推理的可靠性与透明性。
链接: https://arxiv.org/abs/2510.12269
作者: Pedro Domingos
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL); Machine Learning (stat.ML)
备注: 17 pages, 0 figures
Abstract:Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP an Prolog lack scalability and support for learning. This paper proposes tensor logic, a language that solves these problems by unifying neural and symbolic AI at a fundamental level. The sole construct in tensor logic is the tensor equation, based on the observation that logical rules and Einstein summation are essentially the same operation, and all else can be reduced to them. I show how to elegantly implement key forms of neural, symbolic and statistical AI in tensor logic, including transformers, formal reasoning, kernel machines and graphical models. Most importantly, tensor logic makes new directions possible, such as sound reasoning in embedding space. This combines the scalability and learnability of neural networks with the reliability and transparency of symbolic reasoning, and is potentially a basis for the wider adoption of AI.
zh
[AI-42] HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization
【速读】:该论文旨在解决现有低秩适配(Low-Rank Adaptation, LoRA)方法在领域泛化(domain generalization)中面临的两大问题:一是依赖显式任务标签或额外训练,难以部署;二是通常激活固定数量的完整LoRA模块,导致参数冗余或不足,影响性能。解决方案的关键在于提出一种无需训练的框架HiLoRA,其核心创新是基于LoRA结构特性定义了秩一组件(rank-one components, ROCs),并设计了一种分层自适应路由机制:首先在序列层面根据高斯似然性选择子集LoRA及其ROC分配,再在token层面仅激活最信息量的ROC,从而实现精准、高效的LoRA组合使用。理论分析表明,该方法以高概率选取最相关的LoRA模块,实验验证其在领域泛化上显著优于当前最优基线(准确率提升最高达55%),同时保持相近的推理吞吐量。
链接: https://arxiv.org/abs/2510.12266
作者: Ziyi Han,Huanyu Wang,Zeyu Zhang,Xiangxiang Dai,Xutong Liu,John C.S. Lui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Low-Rank Adaptation (LoRA) has emerged as a widely used technique for adapting large language models (LLMs) to new domains, due to its modular design and broad availability on platforms such as HuggingFace. This availability has motivated efforts to reuse existing LoRAs for domain generalization. However, existing methods often rely on explicit task labels or additional training, which are impractical for deployment. Moreover, they typically activate a fixed number of entire LoRA modules, leading to parameter redundancy or insufficiency that degrade performance. In this paper, we propose \textttHiLoRA, a training-free framework that performs adaptive hierarchical routing over LoRA pools. Drawing on structural properties of LoRA, we define rank-one components (ROCs), in which each rank parameter is regarded as an independent unit. For a given input sequence, \textttHiLoRA first adaptively selects a subset of LoRAs and determines their ROC allocation based on Gaussian likelihoods at the sequence level. At the token level, it further refines routing by activating only the most informative ROCs. We further provide theoretical guarantees that \textttHiLoRA selects the most relevant LoRAs with high probability. Extensive experiments show that \textttHiLoRA achieves substantial improvements in domain generalization, with accuracy gains of up to \small 55% over state-of-the-art baselines, while maintaining comparable inference throughput. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.12266 [cs.LG] (or arXiv:2510.12266v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.12266 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ziyi Han [view email] [v1] Tue, 14 Oct 2025 08:19:13 UTC (1,040 KB) Full-text links: Access Paper: View a PDF of the paper titled HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization, by Ziyi Han and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2025-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[AI-43] Human-in-the-Loop Bandwidth Estimation for Quality of Experience Optimization in Real-Time Video Communication AAAI
【速读】:该论文旨在解决实时通信中带宽估计的难题,其核心挑战在于网络架构快速演进、协议栈日益复杂,以及缺乏能够可靠提升用户体验(Quality of Experience, QoE)的量化指标。为应对这些问题,作者提出了一种部署在真实场景中的“人在回路”(human-in-the-loop)数据驱动框架:首先基于主观用户评价训练客观QoE奖励模型以实时评估音视频质量;随后利用约100万条来自微软Teams实际通话的网络轨迹与对应的QoE奖励构建训练数据集;最终引入一种新颖的分布式离线强化学习(distributional offline reinforcement learning, offline RL)算法,训练神经网络带宽估计器以优化用户QoE。实验表明,该方法相较基线带宽估计器可将主观差评通话比例降低11.41%,且所提离线RL算法在D4RL基准任务上也展现出良好泛化能力。
链接: https://arxiv.org/abs/2510.12265
作者: Sami Khairy,Gabriel Mittag,Vishak Gopal,Ross Cutler
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注: Accepted for publication in the proceedings of the AAAI Conference on Artificial Intelligence 2026 (IAAI Technical Track on Deployed Highly Innovative Applications of AI)
Abstract:The quality of experience (QoE) delivered by video conferencing systems is significantly influenced by accurately estimating the time-varying available bandwidth between the sender and receiver. Bandwidth estimation for real-time communications remains an open challenge due to rapidly evolving network architectures, increasingly complex protocol stacks, and the difficulty of defining QoE metrics that reliably improve user experience. In this work, we propose a deployed, human-in-the-loop, data-driven framework for bandwidth estimation to address these challenges. Our approach begins with training objective QoE reward models derived from subjective user evaluations to measure audio and video quality in real-time video conferencing systems. Subsequently, we collect roughly 1 M network traces with objective QoE rewards from real-world Microsoft Teams calls to curate a bandwidth estimation training dataset. We then introduce a novel distributional offline reinforcement learning (RL) algorithm to train a neural-network-based bandwidth estimator aimed at improving QoE for users. Our real-world A/B test demonstrates that the proposed approach reduces the subjective poor call ratio by 11.41% compared to the baseline bandwidth estimator. Furthermore, the proposed offline RL algorithm is benchmarked on D4RL tasks to demonstrate its generalization beyond bandwidth estimation.
zh
[AI-44] mathbfT3: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在主动推理(Active Reasoning)过程中因信念跟踪(Belief Tracking)能力不足而导致的信念偏差(Belief Deviation)问题,即模型难以准确维护对问题状态和缺失信息的理解,从而引发无效或重复动作,导致强化学习(Reinforcement Learning, RL)训练无法正确赋予探索性步骤应有的奖励。解决方案的关键在于提出 T³ 方法——一种简单而有效的轨迹截断机制,通过检测并终止过度信念偏离的推理轨迹,保留具有信息量的前缀部分以确保奖励信号的有效传递,从而系统性提升策略优化的稳定性与效率。
链接: https://arxiv.org/abs/2510.12264
作者: Deyu Zou,Yongqiang Chen,Jianxiang Wang,Haochen Yang,Mufei Li,James Cheng,Pan Li,Yu Gong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Active reasoning requires large language models (LLMs) to interact with external sources and strategically gather information to solve problems. Central to this process is belief tracking: maintaining a coherent understanding of the problem state and the missing information toward the solution. However, due to limited reasoning capabilities, LLM-based agents often suffer from belief deviation: they struggle to correctly model beliefs, lose track of problem states, and fall into uninformative or repetitive actions. Once this happens, errors compound and reinforcement learning (RL) training fails to properly credit the crucial exploratory steps. To address this issue, we propose to track the deviation of model beliefs and develop \mathbfT^3 , a simple yet effective method that detects excessive belief deviation and truncates trajectories during training to remove uninformative tails. By preserving credit for informative prefixes, \mathbfT^3 systematically improves policy optimization. Across 5 challenging tasks, \mathbfT^3 consistently enhances training stability, token efficiency, and final performance, achieving up to 30% gains while cutting rollout tokens by roughly 25%. These results highlight belief control as a key principle for developing robust and generalizable LLM-based active reasoners.
zh
[AI-45] Diffusion Models for Reinforcement Learning: Foundations Taxonomy and Development
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中面临的诸多挑战,包括策略表达能力有限、训练不稳定以及缺乏高效的轨迹级规划能力等问题。其解决方案的核心在于系统性地整合扩散模型(Diffusion Models, DMs)到RL框架中,利用DMs在多模态表示、稳定训练和轨迹级规划方面的优势,构建了基于功能导向与技术导向的双轴分类体系,从而清晰刻画DM在RL流程中的角色及其在在线与离线学习场景下的实现方式,并推动从单智能体向多智能体领域的扩展应用。
链接: https://arxiv.org/abs/2510.12253
作者: Changfu Xu,Jianxiong Guo,Yuzhu Liang,Haiyang Huang,Haodong Zou,Xi Zheng,Shui Yu,Xiaowen Chu,Jiannong Cao,Tian Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Diffusion Models (DMs), as a leading class of generative models, offer key advantages for reinforcement learning (RL), including multi-modal expressiveness, stable training, and trajectory-level planning. This survey delivers a comprehensive and up-to-date synthesis of diffusion-based RL. We first provide an overview of RL, highlighting its challenges, and then introduce the fundamental concepts of DMs, investigating how they are integrated into RL frameworks to address key challenges in this research field. We establish a dual-axis taxonomy that organizes the field along two orthogonal dimensions: a function-oriented taxonomy that clarifies the roles DMs play within the RL pipeline, and a technique-oriented taxonomy that situates implementations across online versus offline learning regimes. We also provide a comprehensive examination of this progression from single-agent to multi-agent domains, thereby forming several frameworks for DM-RL integration and highlighting their practical utility. Furthermore, we outline several categories of successful applications of diffusion-based RL across diverse domains, discuss open research issues of current methodologies, and highlight key directions for future research to advance the field. Finally, we summarize the survey to identify promising future development directions. We are actively maintaining a GitHub repository (this https URL) for papers and other related resources to apply DMs for RL.
zh
[AI-46] PromptLocate: Localizing Prompt Injection Attacks
【速读】:该论文旨在解决生成式 AI(Generative AI)在遭遇提示注入攻击(prompt injection attack)后,难以准确定位被污染数据中注入提示(injected prompt)的具体位置这一问题。解决方案的关键在于提出 PromptLocate,其核心由三个步骤构成:首先将污染数据分割为语义连贯的片段;其次识别出包含注入指令的片段;最后精确定位包含注入数据的片段。该方法首次实现了对多种现有及自适应攻击下注入提示的精准定位,为事后取证分析与数据恢复提供了有效支持。
链接: https://arxiv.org/abs/2510.12252
作者: Yuqi Jia,Yupei Liu,Zedian Shao,Jinyuan Jia,Neil Gong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear in IEEE Symposium on Security and Privacy, 2026
Abstract:Prompt injection attacks deceive a large language model into completing an attacker-specified task instead of its intended task by contaminating its input data with an injected prompt, which consists of injected instruction(s) and data. Localizing the injected prompt within contaminated data is crucial for post-attack forensic analysis and data recovery. Despite its growing importance, prompt injection localization remains largely unexplored. In this work, we bridge this gap by proposing PromptLocate, the first method for localizing injected prompts. PromptLocate comprises three steps: (1) splitting the contaminated data into semantically coherent segments, (2) identifying segments contaminated by injected instructions, and (3) pinpointing segments contaminated by injected data. We show PromptLocate accurately localizes injected prompts across eight existing and eight adaptive attacks.
zh
[AI-47] PromptFlow: Training Prompts Like Neural Networks
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在跨领域部署时因缺乏领域适应性而导致性能下降的问题,尤其是传统提示工程(Prompt Engineering, PE)依赖人工设计、效率低且难以泛化的问题。其核心挑战包括:静态更新策略导致对不同NLP任务适应性不足、整体式提示更新缺乏细粒度编辑能力,以及经验复用机制缺失。解决方案的关键在于提出PromptFlow框架——一个受TensorFlow启发的模块化训练架构,集成元提示(meta-prompts)、操作符(operators)、优化器与评估器,并引入基于梯度的元学习方法自动探索最优提示优化路径;同时创新性地采用强化学习机制实现LLM在提示工程过程中的经验回收利用,从而显著提升提示生成的自动化程度与适应性,且仅需少量任务特定数据即可实现高效微调。
链接: https://arxiv.org/abs/2510.12246
作者: Jingyi Wang,Hongyuan Zhu,Ye Niu,Yunhui Deng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Comments: 18 pages, 14 figures, conference submission, appendix included
Abstract:Large Language Models (LLMs) have demonstrated profound impact on Natural Language Processing (NLP) tasks. However, their effective deployment across diverse domains often require domain-specific adaptation strategies, as generic models may underperform when faced with specialized data distributions. Recent advances in prompt engineering (PE) offer a promising alternative to extensive retraining by refining input instructions to align LLM outputs with task objectives. This paradigm has emerged as a rapid and versatile approach for model fine-tuning. Despite its potential, manual prompt design remains labor-intensive and heavily depends on specialized expertise, often requiring iterative human effort to achieve optimal formulations. To address this limitation, automated prompt engineering methodologies have been developed to systematically generate task-specific prompts. However, current implementations predominantly employ static update rules and lack mechanisms for dynamic strategy selection, resulting in suboptimal adaptation to varying NLP task requirements. Furthermore, most methods treat and update the whole prompts at each step, without considering editing prompt sections at a finer granularity. At last, in particular, the problem of how to recycle experience in LLM is still underexplored. To this end, we propose the PromptFlow, a modular training framework inspired by TensorFlow, which integrates meta-prompts, operators, optimization, and evaluator. Our framework can be equipped with the latest optimization methods and autonomously explores optimal prompt refinement trajectories through gradient-based meta-learning, requiring minimal task-specific training data. Specifically, we devise a reinforcement learning method to recycle experience for LLM in the PE process. Finally, we conduct extensive experiments on various datasets, and demonstrate the effectiveness of PromptFlow.
zh
[AI-48] MoRA: On-the-fly Molecule-aware Low-Rank Adaptation Framework for LLM -based Multi-Modal Molecular Assistant
【速读】:该论文旨在解决如何有效将分子图结构与大语言模型(Large Language Models, LLMs)进行多模态对齐的问题,尤其是在药物发现领域中。现有方法通常通过微调LLM或添加静态适配器来实现,但存在两个关键局限:一是优化全局共享参数空间,难以捕捉每个分子实例的特异性结构特征;二是微调过程易引发灾难性遗忘,削弱LLM的通用推理能力。解决方案的关键在于提出一种基于分子感知的低秩适配(Molecule-aware Low-Rank Adaptation, MoRA),该方法为每个输入分子图动态生成一组独特的低秩适配权重,并将其注入冻结的LLM中,从而实现实例级的参数空间对齐,既保留了LLM的核心知识,又增强了其针对特定分子结构的适应能力。
链接: https://arxiv.org/abs/2510.12245
作者: Tao Yin,Xiaohong Zhang,Jiacheng Zhang,Li Huang,Zhibin Zhang,Yuansong Zeng,Jin Xie,Meng Yan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Effectively integrating molecular graph structures with Large Language Models (LLMs) is a key challenge in drug discovery. Most existing multi-modal alignment methods typically process these structures by fine-tuning the LLM or adding a static adapter simultaneously. However, these approaches have two main limitations: (1) it optimizes a shared parameter space across all molecular inputs, limiting the model’s ability to capture instance-specific structural features; and (2) fine-tuning the LLM for molecular tasks can lead to catastrophic forgetting, undermining its general reasoning capabilities. In this paper, instead of static task-oriented adaptation, we propose an instance-specific parameter space alignment approach for each molecule on-the-fly. To this end, we introduce Molecule-aware Low-Rank Adaptation (MoRA) that produces a unique set of low-rank adaptation weights for each input molecular graph. These weights are then dynamically injected into a frozen LLM, allowing the model to adapt its reasoning to the structure of each molecular input, while preserving the LLM’s core knowledge. Extensive experiments demonstrate that on key molecular tasks, such as chemical reaction prediction and molecular captioning, MoRA’s instance-specific dynamic adaptation outperforms statically adapted baselines, including a 14.1% relative improvement in reaction prediction exact match and a 22% reduction in error for quantum property prediction. The code is available at this https URL.
zh
[AI-49] MedKGEval: A Knowledge Graph-Based Multi-Turn Evaluation Framework for Open-Ended Patient Interactions with Clinical LLM s
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗应用场景中可靠评估的难题,尤其是如何有效捕捉真实临床环境中多轮医患交互的动态性和情境敏感性,以及患者信息需求的演变过程。传统评估方法依赖于对话全文的事后审查,忽视了医疗对话的实时性与上下文依赖特性。其解决方案的关键在于提出MedKGEval框架,该框架基于结构化医学知识构建,包含三个核心创新:(1) 基于知识图谱的患者仿真机制,通过控制模块从定制的知识图谱中检索相关医学事实,使患者代理具备类人且真实的对话行为;(2) 在线、逐轮评估机制,由判别代理在对话进行中逐轮评估模型响应的临床适当性、事实正确性和安全性,采用细粒度任务特异性指标;(3) 构建涵盖八种先进LLM的多轮基准测试集,验证了该框架能识别传统评估流程常忽略的细微行为缺陷和安全风险。
链接: https://arxiv.org/abs/2510.12224
作者: Yuechun Yu,Han Ying,Haoan Jin,Wenjian Jiang,Dong Xian,Binghao Wang,Zhou Yang,Mengyue Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The reliable evaluation of large language models (LLMs) in medical applications remains an open challenge, particularly in capturing the complexity of multi-turn doctor-patient interactions that unfold in real clinical environments. Existing evaluation methods typically rely on post hoc review of full conversation transcripts, thereby neglecting the dynamic, context-sensitive nature of medical dialogues and the evolving informational needs of patients. In this work, we present MedKGEval, a novel multi-turn evaluation framework for clinical LLMs grounded in structured medical knowledge. Our approach introduces three key contributions: (1) a knowledge graph-driven patient simulation mechanism, where a dedicated control module retrieves relevant medical facts from a curated knowledge graph, thereby endowing the patient agent with human-like and realistic conversational behavior. This knowledge graph is constructed by integrating open-source resources with additional triples extracted from expert-annotated datasets; (2) an in-situ, turn-level evaluation framework, where each model response is assessed by a Judge Agent for clinical appropriateness, factual correctness, and safety as the dialogue progresses using a suite of fine-grained, task-specific metrics; (3) a comprehensive multi-turn benchmark of eight state-of-the-art LLMs, demonstrating MedKGEval’s ability to identify subtle behavioral flaws and safety risks that are often overlooked by conventional evaluation pipelines. Although initially designed for Chinese and English medical applications, our framework can be readily extended to additional languages by switching the input knowledge graphs, ensuring seamless bilingual support and domain-specific applicability.
zh
[AI-50] GOAT: A Training Framework for Goal-Oriented Agent with Tools
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理目标导向型查询时能力有限的问题,尤其是如何将高层目标分解为多个相互依赖的API调用,并进行正确规划与执行。现有方法主要依赖零样本评估,且受限于缺乏训练数据,导致开源小模型难以有效完成复杂工具使用任务。解决方案的关键在于提出一种名为GOAT的新型训练框架,其核心创新是无需人工标注即可自动从API文档中构建目标导向的合成数据集,从而让模型具备对多步API调用进行推理和生成连贯响应的能力。实验表明,GOAT训练的代理在多个基准测试中达到最先进性能,并在新提出的GOATBench上同样表现优异,验证了其在提升开源LLM代理复杂推理与工具使用能力方面的有效性。
链接: https://arxiv.org/abs/2510.12218
作者: Hyunji Min,Sangwon Jung,Junyoung Sung,Dosung Lee,Leekyeung Han,Paul Hongsuck Seo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 21 figures
Abstract:Large language models (LLMs) have recently been extended beyond traditional text generation to serve as interactive agents capable of using external tools based on user intent. However, current LLM agents still show limited ability to handle goal-oriented queries, which require decomposing a high-level objective into multiple interdependent API calls with correct planning and execution. Current approaches mainly rely on zero-shot evaluation due to the absence of training data. While proprietary closed-source models such as GPT-4 demonstrate strong reasoning abilities, smaller open-source models struggle to perform complex tool use effectively. Thus, we propose a novel training framework GOAT, which enables fine-tuning of LLM agents in a human annotation-free setting. GOAT automatically constructs synthetic datasets of goal-oriented API execution tasks directly from given API documents, equipping models with the ability to reason over interdependent calls and generate coherent responses. Through extensive experiments, we show that GOAT-trained agents achieve state-of-the-art performance across multiple existing goal-oriented benchmarks. In addition, we introduce GOATBench, a new goal-oriented API execution benchmark, and demonstrate that agents trained with GOAT also excel in this setting. These results highlight GOAT as a practical path toward building robust open-source LLM agents capable of complex reasoning and tool use.
zh
[AI-51] DE3S: Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification
【速读】:该论文旨在解决医疗早期时间序列分类(Early Time-Series Classification, ETSC)中准确率与早熟性之间的冲突问题,特别是在重症监护室(ICU)中 sepsis 等急症的早期预测场景下,现有方法往往因初始信号微弱和类别不平衡而难以捕捉细微的早期特征。解决方案的关键在于通过识别具有高可解释性的形状子序列(shapelets)来提升模型性能。论文提出 Dual-Enhanced Soft-Sparse-Shape Learning (DE3S) 框架,其核心创新包括:(1) 结合传统时序增强与基于注意力的全局时序增强的双重增强策略,提升表示学习鲁棒性;(2) 基于注意力分数的软形状稀疏化机制,动态保留判别性模式并聚合非重要形状为代表性标记;(3) 双路径混合专家网络(Mixture of Experts, MoE)与 Inception 模块融合架构,分别实现形状内的局部学习和跨形状的多尺度全局建模。该框架还采用加权交叉熵损失缓解类别不平衡问题,在六个真实医疗数据集上实现了最先进性能。
链接: https://arxiv.org/abs/2510.12214
作者: Tao Xie,Zexi Tan,Haoyi Xiao,Binbin Sun,Yiqun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE BIBM 2025
Abstract:Early time-series classification (ETSC) in medical applications is crucial for time-sensitive scenarios such as sepsis prediction in intensive care units (ICUs), where a large number of deaths are caused by delayed prediction. ETSC can significantly improve ICU resource utilization efficiency and healthcare precision. However, it faces conflicting goals of accuracy and earliness, with existing methods often trading one for the other, struggling to capture subtle early-stage patterns due to weak initial signals and class imbalance. The key to solve these challenges is to find shapelets, which are discriminative subsequences (or shapes) with high interpretability in time-series classification. This paper proposes Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification (DE3S), which introduces a novel Dual-Enhanced Soft-Shape Learning framework to figure out shapelets precisely through three innovations: (1) a comprehensive dual-enhancement strategy combines traditional temporal augmentation with attention-based global temporal enhancement for robust representation learning, (2) an attention-score-based soft shapelet sparsification mechanism dynamically preserves discriminative patterns while aggregating less important shapelets into representative tokens, and (3) a dual-path Mixture of Experts Network (MoE) and Inception modules fusion architecture where MoE performs local learning within shapelets and multi-scale Inception modules capture global patterns across shapelets. The framework employs weighted cross-entropy loss for class imbalance handling and demonstrates robustness on subject-consistency datasets. Extensive experiments on six real-world medical datasets show state-of-the-art performance, with ablation studies confirming component efficacy.
zh
[AI-52] Revisiting Meta-Learning with Noisy Labels: Reweighting Dynamics and Theoretical Guarantees
【速读】:该论文旨在解决标签噪声下学习(learning with noisy labels)的挑战,即过参数化网络会记忆被污染的监督信号,导致模型性能下降。其解决方案的关键在于对基于元学习的样本重加权(meta-reweighting)机制进行严格的理论分析,揭示其训练轨迹包含三个阶段:对齐阶段(alignment phase)、过滤阶段(filtering phase)和后过滤阶段(post-filtering phase),并发现该机制的本质是干净子集信号与训练信号之间的相似性加权耦合,以及干净子集损失收缩效应。基于此理解,作者提出一种轻量级替代方案,通过均值中心化(mean-centering)、行移位(row shifting)和标签符号调制(label-signed modulation)实现稳定且高效的噪声过滤,避免了昂贵的双层优化过程,在合成与真实噪声标签基准上均显著优于现有重加权/选择基线方法。
链接: https://arxiv.org/abs/2510.12209
作者: Yiming Zhang,Chester Holtz,Gal Mishne,Alex Cloninger
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning with noisy labels remains challenging because over-parameterized networks memorize corrupted supervision. Meta-learning-based sample reweighting mitigates this by using a small clean subset to guide training, yet its behavior and training dynamics lack theoretical understanding. We provide a rigorous theoretical analysis of meta-reweighting under label noise and show that its training trajectory unfolds in three phases: (i) an alignment phase that amplifies examples consistent with a clean subset and suppresses conflicting ones; (ii) a filtering phase driving noisy example weights toward zero until the clean subset loss plateaus; and (iii) a post-filtering phase in which noise filtration becomes perturbation-sensitive. The mechanism is a similarity-weighted coupling between training and clean subset signals together with clean subset training loss contraction; in the post-filtering regime where the clean-subset loss is sufficiently small, the coupling term vanishes and meta-reweighting loses discriminatory power. Guided by this analysis, we propose a lightweight surrogate for meta-reweighting that integrates mean-centering, row shifting, and label-signed modulation, yielding more stable performance while avoiding expensive bi-level optimization. Across synthetic and real noisy-label benchmarks, our method consistently outperforms strong reweighting/selection baselines.
zh
[AI-53] On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy
【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)系统评估过程中缺乏以用户为中心设计导向的问题,即现有评估方法过于技术化,未能充分考虑人类用户的实际需求与使用情境。其解决方案的关键在于提出一套面向人类用户的XAI系统设计目标框架,并根据用户AI素养水平(AI初学者与数据专家)进行差异化适配:对于AI初学者,强调责任使用、接受度和易用性;对于数据专家,则聚焦于人机协作效率及系统与用户任务性能的优化。该研究通过系统梳理65项涉及不同领域的人类用户实验,提炼出XAI系统的结构特征(核心系统与解释模块分离)、评价指标分类(情感、认知、可用性、可解释性与解释质量)以及用户行为特征,从而为XAI开发提供可操作的设计指南,推动从技术驱动向用户中心的范式转变。
链接: https://arxiv.org/abs/2510.12201
作者: Aline Mangold,Juliane Zietz,Susanne Weinhold,Sebastian Pannasch
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As AI becomes more common in everyday living, there is an increasing demand for intelligent systems that are both performant and understandable. Explainable AI (XAI) systems aim to provide comprehensible explanations of decisions and predictions. At present, however, evaluation processes are rather technical and not sufficiently focused on the needs of human users. Consequently, evaluation studies involving human users can serve as a valuable guide for conducting user studies. This paper presents a comprehensive review of 65 user studies evaluating XAI systems across different domains and application contexts. As a guideline for XAI developers, we provide a holistic overview of the properties of XAI systems and evaluation metrics focused on human users (human-centered). We propose objectives for the human-centered design (design goals) of XAI systems. To incorporate users’ specific characteristics, design goals are adapted to users with different levels of AI expertise (AI novices and data experts). In this regard, we provide an extension to existing XAI evaluation and design frameworks. The first part of our results includes the analysis of XAI system characteristics. An important finding is the distinction between the core system and the XAI explanation, which together form the whole system. Further results include the distinction of evaluation metrics into affection towards the system, cognition, usability, interpretability, and explanation metrics. Furthermore, the users, along with their specific characteristics and behavior, can be assessed. For AI novices, the relevant extended design goals include responsible use, acceptance, and usability. For data experts, the focus is performance-oriented and includes human-AI collaboration and system and user task performance.
zh
[AI-54] ResearStudio: A Human-Intervenable Framework for Building Controllable Deep-Research Agents EMNLP2025
【速读】:该论文旨在解决当前深度研究代理(Deep Research Agents)在运行过程中缺乏实时人类干预能力的问题,即系统一旦启动便以“发射后不管”(fire-and-forget)模式执行,无法在过程中修正错误或融入专家知识。解决方案的关键在于提出ResearStudio——首个将实时人类控制置于核心的开源框架,其采用协作工作坊(Collaborative Workshop)设计:通过分层规划器-执行器将每一步操作写入一个动态更新的“计划即文档”(plan-as-document),并利用高速通信层实时同步动作、文件变更和工具调用至Web界面;用户可在任意时刻暂停、编辑计划或代码、执行自定义命令后继续运行,实现AI主导、人机协同与人类主导三种模式的无缝切换。实验表明,该系统在GAIA基准上达到当前最优性能,证明了强自动化能力与细粒度人类控制可共存。
链接: https://arxiv.org/abs/2510.12194
作者: Linyi Yang,Yixuan Weng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Demo, Oral
Abstract:Current deep-research agents run in a ‘‘fire-and-forget’’ mode: once started, they give users no way to fix errors or add expert knowledge during execution. We present ResearStudio, the first open-source framework that places real-time human control at its core. The system follows a Collaborative Workshop design. A hierarchical Planner-Executor writes every step to a live ‘‘plan-as-document,’’ a fast communication layer streams each action, file change, and tool call to a web interface. At any moment, the user can pause the run, edit the plan or code, run custom commands, and resume – switching smoothly between AI-led, human-assisted and human-led, AI-assisted modes. In fully autonomous mode, ResearStudio achieves state-of-the-art results on the GAIA benchmark, surpassing systems like OpenAI’s DeepResearch and Manus. These results show that strong automated performance and fine-grained human control can coexist. The full code, protocol, and evaluation scripts are available at this https URL. We will continue to update the repository to encourage further work on safe and controllable research agents. Our live demo is publicly accessible at this http URL. We support the development of DeepScientist, which can be accessed at this https URL.
zh
[AI-55] MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在材料科学(materials science)领域中科学推理能力尚未被充分探索的问题。为填补这一空白,作者提出了MatSciBench——一个涵盖大学本科水平的综合性基准测试集,包含1,340道覆盖材料科学六大主领域与31个子领域的结构化问题,并引入基于推理长度的三级难度分类、详尽参考答案以支持精确误差分析,以及通过视觉上下文增强多模态推理能力。其关键创新在于构建了一个细粒度、多层次且具备多模态特性的评测体系,能够系统评估不同推理策略(如基础思维链、工具增强和自我修正)在复杂材料科学任务中的表现,从而推动LLMs在该专业领域科学推理能力的提升。
链接: https://arxiv.org/abs/2510.12171
作者: Junkai Zhang,Jingru Gan,Xiaoxuan Wang,Zian Jia,Changquan Gu,Jianpeng Chen,Yanqiao Zhu,Mingyu Derek Ma,Dawei Zhou,Ling Li,Wei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable abilities in scientific reasoning, yet their reasoning capabilities in materials science remain underexplored. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark comprising 1,340 problems that span the essential subdisciplines of materials science. MatSciBench features a structured and fine-grained taxonomy that categorizes materials science questions into 6 primary fields and 31 sub-fields, and includes a three-tier difficulty classification based on the reasoning length required to solve each question. MatSciBench provides detailed reference solutions enabling precise error analysis and incorporates multimodal reasoning through visual contexts in numerous questions. Evaluations of leading models reveal that even the highest-performing model, Gemini-2.5-Pro, achieves under 80% accuracy on college-level materials science questions, highlighting the complexity of MatSciBench. Our systematic analysis of different reasoning strategie–basic chain-of-thought, tool augmentation, and self-correction–demonstrates that no single method consistently excels across all scenarios. We further analyze performance by difficulty level, examine trade-offs between efficiency and accuracy, highlight the challenges inherent in multimodal reasoning tasks, analyze failure modes across LLMs and reasoning methods, and evaluate the influence of retrieval-augmented generation. MatSciBench thus establishes a comprehensive and solid benchmark for assessing and driving improvements in the scientific reasoning capabilities of LLMs within the materials science domain.
zh
[AI-56] Budget-constrained Active Learning to Effectively De-censor Survival Data
【速读】:该论文旨在解决在生存分析(survival analysis)场景下,如何有效利用有限预算进行标注选择的问题。传统监督学习假设所有训练样本均有完整标签,但在生存数据中存在右删失(right-censored)实例,即仅知事件发生时间的下界,而无法获得确切时间。因此,本文提出一种预算驱动的标注策略,允许模型通过支付预算来“部分标注”删失实例(如从(3年, 删失)变为(7.2年, 未删失)或获取更多随访信息),从而提升模型性能。其解决方案的关键在于将标准预算学习算法(budgeted learning)扩展至含删失数据的场景,并基于BatchBALD(Batch Bayesian Active Learning by Disagreement)框架设计了具有理论保证和近似最优时间复杂度的新方法,实验证明其在多个生存任务上优于现有替代方案。
链接: https://arxiv.org/abs/2510.12144
作者: Ali Parsaee,Bei Jiang,Zachary Friggstad,Russell Greiner
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Standard supervised learners attempt to learn a model from a labeled dataset. Given a small set of labeled instances, and a pool of unlabeled instances, a budgeted learner can use its given budget to pay to acquire the labels of some unlabeled instances, which it can then use to produce a model. Here, we explore budgeted learning in the context of survival datasets, which include (right) censored instances, where we know only a lower bound on an instance’s time-to-event. Here, that learner can pay to (partially) label a censored instance – e.g., to acquire the actual time for an instance [perhaps go from (3 yr, censored) to (7.2 yr, uncensored)], or other variants [e.g., learn about one more year, so go from (3 yr, censored) to either (4 yr, censored) or perhaps (3.2 yr, uncensored)]. This serves as a model of real world data collection, where follow-up with censored patients does not always lead to uncensoring, and how much information is given to the learner model during data collection is a function of the budget and the nature of the data itself. We provide both experimental and theoretical results for how to apply state-of-the-art budgeted learning algorithms to survival data and the respective limitations that exist in doing so. Our approach provides bounds and time complexity asymptotically equivalent to the standard active learning method BatchBALD. Moreover, empirical analysis on several survival tasks show that our model performs better than other potential approaches on several benchmarks.
zh
[AI-57] Chimera: State Space Models Beyond Sequences
【速读】:该论文旨在解决当前基于Transformer的深度学习方法在建模序列、图像和图结构数据时,因依赖特定领域的人工归纳偏置(如位置嵌入或随机游走)而导致的通用性不足问题。这些偏置虽能弥补自注意力机制对数据拓扑结构的忽略,但设计复杂且易引入副作用,限制模型泛化能力。解决方案的关键在于提出Chimera模型,其核心思想是将状态空间模型(State Space Model, SSM)推广至任意图拓扑结构,从而直接、统一地建模数据的拓扑信息,无需引入任务特定的归纳偏置。实验表明,Chimera在语言、视觉和图学习任务上均优于现有基线,验证了数据拓扑作为跨模态强归纳偏置的有效性。
链接: https://arxiv.org/abs/2510.12111
作者: Aakash Lahoti,Tanya Marwah,Ratish Puduppully,Albert Gu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in TMLR (October 2025); 22 Pages, 6 Figures, 11 Tables
Abstract:Transformer-based deep learning methods have become the standard approach for modeling diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires inductive biases–such as position embeddings in sequences and images, or random walks in graphs–to incorporate topology. However, designing such task-specific biases requires significant effort and can introduce side effects that hinder generalization. We introduce Chimera, a unified model that directly incorporates data topology in a principled way, removing the need for domain-specific biases. The key idea is that state space models–which naturally do not require position embeddings–can be generalized to capture any graph topology. Our experiments show that Chimera achieves strong performance across language, vision, and graph domains, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on the Long Range Graph Benchmark. We further propose algorithmic optimizations to improve Chimera’s efficiency: (1) for Directed Acyclic Graphs, Chimera can be implemented as a linear-time recurrence; (2) for general graphs, a simple mathematical relaxation achieves Transformer’s quadratic complexity without domain-specific heuristics. These results validate Chimera’s core contribution and support the idea that data topology is a powerful inductive bias across modalities.
zh
[AI-58] oPolyAgent : AI Agents for Coarse-Grained Topological Polymer Simulations
【速读】:该论文旨在解决复杂分子动力学(MD)模拟在拓扑聚合物研究中操作门槛高、流程繁琐的问题,尤其针对不同聚合物架构(如线性、环状、刷状、星形聚合物及树状大分子)的模拟任务缺乏高效、可交互且自动化的工具。解决方案的关键在于提出一个名为ToPolyAgent的多智能体AI框架,通过将大语言模型(LLM)与领域专用计算工具(如LAMMPS)深度融合,构建了四个功能明确的智能体:配置代理(Config Agent)用于生成初始构型,模拟代理(Simulation Agent)执行MD模拟与构象分析,报告代理(Report Agent)生成结构化Markdown报告,以及工作流代理(Workflow Agent)实现端到端自动化任务调度。该框架支持交互式和自主式两种模式,显著提升了模拟流程的灵活性与可扩展性,为聚合物科学中的AI驱动材料发现提供了可自治、易扩展的科研生态系统基础。
链接: https://arxiv.org/abs/2510.12091
作者: Lijie Ding,Jan-Michael Carrillo,Changwoo Do
机构: 未知
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Soft Condensed Matter (cond-mat.soft)
备注: 10 pages, 8 figures
Abstract:We introduce ToPolyAgent, a multi-agent AI framework for performing coarse-grained molecular dynamics (MD) simulations of topological polymers through natural language instructions. By integrating large language models (LLMs) with domain-specific computational tools, ToPolyAgent supports both interactive and autonomous simulation workflows across diverse polymer architectures, including linear, ring, brush, and star polymers, as well as dendrimers. The system consists of four LLM-powered agents: a Config Agent for generating initial polymer-solvent configurations, a Simulation Agent for executing LAMMPS-based MD simulations and conformational analyses, a Report Agent for compiling markdown reports, and a Workflow Agent for streamlined autonomous operations. Interactive mode incorporates user feedback loops for iterative refinements, while autonomous mode enables end-to-end task execution from detailed prompts. We demonstrate ToPolyAgent’s versatility through case studies involving diverse polymer architectures under varying solvent condition, thermostats, and simulation lengths. Furthermore, we highlight its potential as a research assistant by directing it to investigate the effect of interaction parameters on the linear polymer conformation, and the influence of grafting density on the persistence length of the brush polymer. By coupling natural language interfaces with rigorous simulation tools, ToPolyAgent lowers barriers to complex computational workflows and advances AI-driven materials discovery in polymer science. It lays the foundation for autonomous and extensible multi-agent scientific research ecosystems.
zh
[AI-59] Enhancing Neural Code Representation with Additional Context
【速读】:该论文旨在解决当前深度学习模型在程序理解任务中仅依赖源代码本身、忽视版本历史和结构关系等上下文信息的问题,从而限制了模型对代码演化过程与运行机制的捕捉能力。其解决方案的关键在于通过引入多种上下文信号(如版本历史和调用图)来增强代码表示,并系统评估这些上下文信号对关键程序理解任务(包括代码克隆检测和代码摘要生成)性能的影响。实验表明,融合上下文信息可显著提升模型表现,尤其在版本历史方面效果稳定,且多源上下文结合能进一步优化结果(最高提升达21.48%宏F1),验证了上下文信号在增强代码理解中的有效性。
链接: https://arxiv.org/abs/2510.12082
作者: Huy Nguyen,Christoph Treude,Patanamon Thongtanunam
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 34 pages, 7 figures, 11 tables
Abstract:Automated program comprehension underpins many software engineering tasks, from code summarisation to clone detection. Recent deep learning models achieve strong results but typically rely on source code alone, overlooking contextual information such as version history or structural relationships. This limits their ability to capture how code evolves and operates. We conduct an empirical study on how enriching code representations with such contextual signals affects neural model performance on key comprehension tasks. Two downstream tasks, code clone detection and code summarisation, are evaluated using SeSaMe (1,679 Java methods) and CodeSearchNet (63,259 methods). Five representative models (CodeBERT, GraphCodeBERT, CodeT5, PLBART, ASTNN) are fine-tuned under code-only and context-augmented settings. Results show that context generally improves performance: version history consistently boosts clone detection (e.g., CodeT5 +15.92% F1) and summarisation (e.g., GraphCodeBERT +5.56% METEOR), while call-graph effects vary by model and task. Combining multiple contexts yields further gains (up to +21.48% macro-F1). Human evaluation on 100 Java snippets confirms that context-augmented summaries are significantly preferred for Accuracy and Content Adequacy (p = 0.026; |delta| up to 0.55). These findings highlight the potential of contextual signals to enhance code comprehension and open new directions for optimising contextual encoding in neural SE models.
zh
[AI-60] Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在处理涉及随机性的任务时能力不足的问题,特别是其生成和利用随机数的有效性尚不明确。研究通过一系列实验系统评估了LLM在不同条件下的随机性任务表现,包括是否可访问外部工具、任务类型、模型状态(新鲜或非新鲜)及提示策略等因素。关键发现是:尽管LLM能够生成具有一定随机性的输出,但其行为不稳定且常显著偏离预期;解决方案的核心在于识别并改进LLM在随机性建模与控制方面的局限性,例如优化提示设计、引入外部随机源或增强对熵等随机性指标的感知能力,从而提升其在密码学、AI代理决策、调度等依赖高质量随机性的场景中的可靠性。
链接: https://arxiv.org/abs/2510.12080
作者: Rabimba Karanjai,Yang Lu,Ranjith Chodavarapu,Lei Xu,Weidong Shi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of large language model (LLM) technology has led to diverse applications, many of which inherently require randomness, such as stochastic decision-making, gaming, scheduling, AI agents, and cryptography-related tasks. However, the capabilities of LLMs in handling randomness, particularly in generating and utilizing random numbers effectively, remain unclear. This paper investigates the capacity of LLMs for handling tasks that involve randomness through a series of experiments. We designed a set of experiments that consider various factors that can influence an LLM’s performance in tasks involving randomness, such as accessibility to external tools, types of tasks, model states (fresh vs. non-fresh), and prompting strategies. The experiments cover a range of tasks, including generating random numbers, generating random strings such as passwords, shuffling items, and evaluating the quality of randomness using entropy and the NIST randomness test-suite. Our findings reveal that while LLMs can generate outputs that exhibit some degree of randomness, their performance is inconsistent and often deviates significantly from the expected behavior. The analysis of the experimental results highlights key limitations and areas where improvement is needed for the LLMs to effectively handle tasks involving randomness
zh
[AI-61] BeSTAD: Behavior-Aware Spatio-Temporal Anomaly Detection for Human Mobility Data
【速读】:该论文旨在解决大规模人群移动数据中个体层面异常行为检测的挑战,即如何识别单个个体相对于其自身历史行为模式的细微偏离,而传统方法主要关注轨迹层面的统计异常或时空不一致性。解决方案的关键在于提出BeSTAD(Behavior-aware Spatio-Temporal Anomaly Detection for Human Mobility Data)框架,其核心创新是通过联合建模空间上下文与时间动态,学习语义增强的移动表征(semantically enriched mobility representations),从而捕捉个体化的行为特征;同时引入基于行为聚类的建模机制(behavior-cluster-aware modeling mechanism),从正常活动中构建个性化行为画像,并通过跨周期的行为对比实现具有一致语义对齐的异常识别,最终在无监督条件下实现精细化、可解释的个体级异常检测。
链接: https://arxiv.org/abs/2510.12076
作者: Junyi Xie,Jina Kim,Yao-Yi Chiang,Lingyi Zhao,Khurram Shafique
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted by The 2nd ACM SIGSPATIAL International Workshop on Geospatial Anomaly Detection
Abstract:Traditional anomaly detection in human mobility has primarily focused on trajectory-level analysis, identifying statistical outliers or spatiotemporal inconsistencies across aggregated movement traces. However, detecting individual-level anomalies, i.e., unusual deviations in a person’s mobility behavior relative to their own historical patterns, within datasets encompassing large populations remains a significant challenge. In this paper, we present BeSTAD (Behavior-aware Spatio-Temporal Anomaly Detection for Human Mobility Data), an unsupervised framework that captures individualized behavioral signatures across large populations and uncovers fine-grained anomalies by jointly modeling spatial context and temporal dynamics. BeSTAD learns semantically enriched mobility representations that integrate location meaning and temporal patterns, enabling the detection of subtle deviations in individual movement behavior. BeSTAD further employs a behavior-cluster-aware modeling mechanism that builds personalized behavioral profiles from normal activity and identifies anomalies through cross-period behavioral comparison with consistent semantic alignment. Building on prior work in mobility behavior clustering, this approach enables not only the detection of behavioral shifts and deviations from established routines but also the identification of individuals exhibiting such changes within large-scale mobility datasets. By learning individual behaviors directly from unlabeled data, BeSTAD advances anomaly detection toward personalized and interpretable mobility analysis.
zh
[AI-62] EmboMatrix: A Scalable Training-Ground for Embodied Decision-Making
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在具身决策(embodied decision-making)能力上的局限性问题,即LLMs虽具备通用决策能力,但因缺乏对物理环境的直接交互经验,难以实现真正意义上的具身理解。为解决这一问题,论文提出构建一个名为EmboMatrix的训练平台,其核心创新在于提供任务与场景仿真、具身交互以及精准反馈信号的一体化基础设施,从而支持LLM通过环境驱动的学习获得真实的具身决策技能。解决方案的关键在于三个关键技术:多智能体数据引擎用于大规模任务和场景生成、异构分布式硬件系统实现可扩展仿真、多层级奖励架构提供精细化监督信号;基于此,研究者训练出EmboBrain模型,实验证明其在两个具身决策基准测试中性能优于671B参数的DeepSeek-R1基线模型9.5%,验证了环境感知型交互学习在构建真正智能具身代理中的有效性。
链接: https://arxiv.org/abs/2510.12072
作者: Zixing Lei,Sheng Yin,Yichen Xiong,Yuanzhuo Ding,Wenhao Huang,Yuxi Wei,Qingyao Xu,Yiming Li,Weixin Li,Yunhong Wang,Siheng Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 10 pages 8 figures
Abstract:Embodied decision-making enables agents to translate high-level goals into executable actions through continuous interactions within the physical world, forming a cornerstone of general-purpose embodied intelligence. Large language models (LLMs), with their general decision-making capabilities, offer a promising path to realize this potential; however, LLMs trained solely on language lack exposure to physical environments, limiting their true embodied understanding. To bridge this gap, we propose the concept of a training ground: a comprehensive infrastructure that provides task and scene simulation, embodied interaction, and feedback signals, offering a one-stop solution for LLM acquire genuine embodied decision-making skills. In this work, we present EmboMatrix, the first training ground of its kind, providing massive and diverse tasks with efficient simulation and precise rewards. EmboMatrix incorporates a series of novel techniques: a multi-agent data engine for large-scale task and scene generation, a distributed heterogeneous-hardware system for scalable simulation, and a multi-level reward architecture for precise supervision. Leveraging EmboMatrix, we cultivate EmboBrain, an LLM whose embodied decision-making abilities emerge from extensive embodied interactions. Experiments show that EmboBrain-7B surpasses the 671B DeepSeek-R1 baseline by 9.5% on two challenging embodied decision-making benchmarks, demonstrating the power of interactive, environment-grounded learning for building truly intelligent embodied agents.
zh
[AI-63] MEASURE: Multi-scale Minimal Sufficient Representation Learning for Domain Generalization in Sleep Staging
【速读】:该论文旨在解决深度学习模型在自动睡眠分期(sleep staging)任务中因生理信号个体差异导致的泛化能力不足问题,尤其是在分布外(out-of-distribution)场景下性能下降的问题。现有域泛化方法虽尝试通过对比学习提取域不变特征,但未能充分消除样本间非共享信息中隐含的域相关属性(即“冗余域相关信息”,excess domain-relevant information),从而限制了模型对跨被试数据的有效适应。解决方案的关键在于提出一种新颖的MEASURE(Multi-scalE minimAl SUfficient Representation lEarning)框架,该框架通过多尺度最小充分表示学习策略,在有效抑制冗余域相关信息的同时,保留对睡眠阶段分类至关重要的时域与频域特征,从而显著提升模型在未见受试者上的泛化性能。
链接: https://arxiv.org/abs/2510.12070
作者: Sangmin Jo,Jee Seok Yoon,Wootaek Jeong,Kwanseok Oh,Heung-Il Suk
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 page, 7 figures, uses this http URL
Abstract:Deep learning-based automatic sleep staging has significantly advanced in performance and plays a crucial role in the diagnosis of sleep disorders. However, those models often struggle to generalize on unseen subjects due to variability in physiological signals, resulting in degraded performance in out-of-distribution scenarios. To address this issue, domain generalization approaches have recently been studied to ensure generalized performance on unseen domains during training. Among those techniques, contrastive learning has proven its validity in learning domain-invariant features by aligning samples of the same class across different domains. Despite its potential, many existing methods are insufficient to extract adequately domain-invariant representations, as they do not explicitly address domain characteristics embedded within the unshared information across samples. In this paper, we posit that mitigating such domain-relevant attributes-referred to as excess domain-relevant information-is key to bridging the domain gap. However, the direct strategy to mitigate the domain-relevant attributes often overfits features at the high-level information, limiting their ability to leverage the diverse temporal and spectral information encoded in the multiple feature levels. To address these limitations, we propose a novel MEASURE (Multi-scalE minimAl SUfficient Representation lEarning) framework, which effectively reduces domain-relevant information while preserving essential temporal and spectral features for sleep stage classification. In our exhaustive experiments on publicly available sleep staging benchmark datasets, SleepEDF-20 and MASS, our proposed method consistently outperformed state-of-the-art methods. Our code is available at : this https URL
zh
[AI-64] HiCoTraj:Zero-Shot Demographic Reasoning via Hierarchical Chain-of-Thought Prompting from Trajectory
【速读】:该论文旨在解决基于轨迹数据进行人口统计学属性(如年龄、性别、收入水平)推断时面临的标签数据稀缺与模型泛化能力差的问题。现有方法依赖大规模带标签轨迹数据,导致可解释性弱且跨数据集和用户群体的适应性不足。其解决方案的关键在于提出HiCoTraj框架,利用大语言模型(LLM)的零样本学习和语义理解能力,通过将轨迹转化为语义丰富的自然语言描述(包括活动编年史和多尺度访问摘要),并采用新颖的分层思维链(hierarchical chain-of-thought reasoning)引导LLM依次完成事实特征提取、行为模式分析与结构化输出的人口统计推断,从而在无需标注数据的情况下实现高性能且透明的推理过程。
链接: https://arxiv.org/abs/2510.12067
作者: Junyi Xie,Yuankun Jiao,Jina Kim,Yao-Yi Chiang,Lingyi Zhao,Khurram Shafique
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: accepted by The 1st ACM SIGSPATIAL International Workshop on Generative and Agentic AI for Multi-Modality Space-Time Intelligence
Abstract:Inferring demographic attributes such as age, sex, or income level from human mobility patterns enables critical applications such as targeted public health interventions, equitable urban planning, and personalized transportation services. Existing mobility-based demographic inference studies heavily rely on large-scale trajectory data with demographic labels, leading to limited interpretability and poor generalizability across different datasets and user groups. We propose HiCoTraj (Zero-Shot Demographic Reasoning via Hierarchical Chain-of-Thought Prompting from Trajectory), a framework that leverages LLMs’ zero-shot learning and semantic understanding capabilities to perform demographic inference without labeled training data. HiCoTraj transforms trajectories into semantically rich, natural language representations by creating detailed activity chronicles and multi-scale visiting summaries. Then HiCoTraj uses a novel hierarchical chain of thought reasoning to systematically guide LLMs through three cognitive stages: factual feature extraction, behavioral pattern analysis, and demographic inference with structured output. This approach addresses the scarcity challenge of labeled demographic data while providing transparent reasoning chains. Experimental evaluation on real-world trajectory data demonstrates that HiCoTraj achieves competitive performance across multiple demographic attributes in zero-shot scenarios.
zh
[AI-65] AI Agents as Universal Task Solvers
【速读】:该论文旨在解决AI推理代理(AI reasoning agents)是否具备通用计算能力的问题,以及如何通过学习实现高效推理,特别是链式思维(chain-of-thought reasoning)能否解决任何可计算任务。其核心挑战在于:当前模型规模和训练数据量的扩展是否足以催生真正智能的推理行为,还是仅能产生“天才型”(savant-like)的暴力搜索行为。解决方案的关键在于重新定义学习范式,从经典的归纳学习(inductive learning)转向传导学习(transductive learning),即不再追求对历史数据分布的逼近,而是聚焦于捕获数据的算法结构以减少新任务的求解时间。文中指出,信息在学习中的核心作用不是最小化重建误差,而是最小化求解时间——最优加速比与数据的算法信息(algorithmic information)紧密相关,并由此推导出推理时间与训练时间之间的幂律缩放关系。因此,论文主张,在扩展推理模型时,应优先优化推理效率(时间),而非单纯扩大模型或数据规模。
链接: https://arxiv.org/abs/2510.12066
作者: Alessandro Achille,Stefano Soatto
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:AI reasoning agents are already able to solve a variety of tasks by deploying tools, simulating outcomes of multiple hypotheses and reflecting on them. In doing so, they perform computation, although not in the classical sense – there is no program being executed. Still, if they perform computation, can AI agents be universal? Can chain-of-thought reasoning solve any computable task? How does an AI Agent learn to reason? Is it a matter of model size? Or training dataset size? In this work, we reinterpret the role of learning in the context of AI Agents, viewing them as compute-capable stochastic dynamical systems, and highlight the role of time in a foundational principle for learning to reason. In doing so, we propose a shift from classical inductive learning to transductive learning – where the objective is not to approximate the distribution of past data, but to capture their algorithmic structure to reduce the time needed to find solutions to new tasks. Transductive learning suggests that, counter to Shannon’s theory, a key role of information in learning is about reduction of time rather than reconstruction error. In particular, we show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information. Using this, we show a theoretical derivation for the observed power-law scaling of inference time versus training time. We then show that scaling model size can lead to behaviors that, while improving accuracy on benchmarks, fail any reasonable test of intelligence, let alone super-intelligence: In the limit of infinite space and time, large models can behave as savants, able to brute-force through any task without any insight. Instead, we argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.12066 [cs.AI] (or arXiv:2510.12066v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.12066 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-66] Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response
【速读】:该论文旨在解决现有灾害响应方法在语义上下文缺失、跨事件泛化能力弱以及可解释性差等问题,同时克服大型语言模型(Large Language Models, LLMs)仅限文本输入且缺乏地理感知的局限性。其解决方案的关键在于提出一种地理空间感知层(Geospatial Awareness Layer, GAL),该层能够将LLM代理与结构化的地球数据(如基础设施、人口分布、地形和气象信息)进行自动关联与融合,生成包含单位标注的感知脚本,从而支持基于证据的资源分配决策,并通过历史类比和每日变化信号实现动态更新。实验证明,该框架在多个真实野火场景中优于基线方法,并具备扩展至洪水、飓风等其他灾害类型的潜力。
链接: https://arxiv.org/abs/2510.12061
作者: Yiheng Chen,Lingyao Li,Zihui Ma,Qikai Hu,Yilun Zhu,Min Deng,Runlong Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Effective disaster response is essential for safeguarding lives and property. Existing statistical approaches often lack semantic context, generalize poorly across events, and offer limited interpretability. While Large language models (LLMs) provide few-shot generalization, they remain text-bound and blind to geography. To bridge this gap, we introduce a Geospatial Awareness Layer (GAL) that grounds LLM agents in structured earth data. Starting from raw wildfire detections, GAL automatically retrieves and integrates infrastructure, demographic, terrain, and weather information from external geodatabases, assembling them into a concise, unit-annotated perception script. This enriched context enables agents to produce evidence-based resource-allocation recommendations (e.g., personnel assignments, budget allocations), further reinforced by historical analogs and daily change signals for incremental updates. We evaluate the framework in real wildfire scenarios across multiple LLM models, showing that geospatially grounded agents can outperform baselines. The proposed framework can generalize to other hazards such as floods and hurricanes.
zh
[AI-67] Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation
【速读】:该论文旨在解决当前主流代码生成评估基准(如HumanEval+和MBPP+)忽视了真实软件中至关重要的合同遵守问题,即模型对非法输入的处理能力——这些基准仅衡量功能正确性(functional correctness),而未评估生成代码是否能正确识别并拒绝违反前置条件(preconditions)或有效性约束(validity constraints)的输入。这一缺陷导致现有模型无法生成真正鲁棒且可靠的代码片段。解决方案的关键在于提出PACT框架,其核心创新包括:构建专注于合同违规的综合性测试用例库以扩展现有基准;通过系统化分析不同提示策略下模型表现,发现加入合同违规测试用例可显著提升模型对合同的遵守能力;引入新型量化指标,精确衡量代码生成与测试生成中的合同遵守程度,从而提供可解释且严谨的评估体系,弥补传统基准在代码鲁棒性方面的盲区。
链接: https://arxiv.org/abs/2510.12047
作者: Soohan Lim,Joonghyuk Hahn,Hyunwoo Park,Sang-Ki Ko,Yo-Sub Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 21 pages, 12 figures, 3 tables
Abstract:Prevailing code generation benchmarks, such as HumanEval+ and MBPP+, primarily evaluate large language models (LLMs) with pass@k on functional correctness using well-formed inputs. However, they ignore a crucial aspect of real-world software: adherence to contracts-the preconditions and validity constraints that dictate how ill-formed inputs must be rejected. This critical oversight means that existing benchmarks fail to measure, and models consequently fail to generate, truly robust and reliable code snippets. We introduce PACT, a program assessment and contract-adherence evaluation framework, to bridge this gap. PACT is the first framework designed to systematically evaluate and enhance contract-adherence in LLM-generated code snippets alongside functional correctness. PACT’s contributions are threefold: First, it provides a comprehensive test-suite corpus focused on contract violations, extending HumanEval+ and MBPP+. Second, it enables a systematic analysis of code generation under varied prompting conditions. This analysis demonstrates that augmenting prompts with contract-violating test cases significantly enhance a model’s ability to respect contracts compared to using contract description alone. Finally, it introduces novel metrics to rigorously quantify contract adherence in both test generation and code generation. By revealing critical errors that conventional benchmarks overlook, PACT provides the rigorous and interpretable metrics to evaluate the robustness of LLM-generated code snippets in both functionality and this http URL code and data are available at this https URL.
zh
[AI-68] CausalTrace: A Neurosymbolic Causal Analysis Agent for Smart Manufacturing AAAI2026
【速读】:该论文旨在解决现代制造环境中AI系统因缺乏可解释性与因果推理能力而导致的决策可信度不足问题,尤其是在异常处理、根因分析(Root Cause Analysis, RCA)和干预策略制定等高风险场景中。现有AI模型多为孤立的黑箱系统,难以提供透明、可追溯的决策依据,限制了其在工业现场的实际应用。解决方案的关键在于提出CausalTrace模块——一个嵌入到SmartPilot工业CoPilot中的神经符号因果分析组件,它融合数据驱动的因果发现、反事实推理与知识图谱增强的根因分析功能,并支持实时操作员交互,从而实现预测、解释与因果推理的一体化决策支持,显著提升了系统的鲁棒性、智能性和可信度。
链接: https://arxiv.org/abs/2510.12033
作者: Chathurangi Shyalika,Aryaman Sharma,Fadi El Kalach,Utkarshani Jaimini,Cory Henson,Ramy Harik,Amit Sheth
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 3 tables, Accepted at AAAI 2026: IAAI - Innovative Applications of AI Conference
Abstract:Modern manufacturing environments demand not only accurate predictions but also interpretable insights to process anomalies, root causes, and potential interventions. Existing AI systems often function as isolated black boxes, lacking the seamless integration of prediction, explanation, and causal reasoning required for a unified decision-support solution. This fragmentation limits their trustworthiness and practical utility in high-stakes industrial environments. In this work, we present CausalTrace, a neurosymbolic causal analysis module integrated into the SmartPilot industrial CoPilot. CausalTrace performs data-driven causal analysis enriched by industrial ontologies and knowledge graphs, including advanced functions such as causal discovery, counterfactual reasoning, and root cause analysis (RCA). It supports real-time operator interaction and is designed to complement existing agents by offering transparent, explainable decision support. We conducted a comprehensive evaluation of CausalTrace using multiple causal assessment methods and the C3AN framework (i.e. Custom, Compact, Composite AI with Neurosymbolic Integration), which spans principles of robustness, intelligence, and trustworthiness. In an academic rocket assembly testbed, CausalTrace achieved substantial agreement with domain experts (ROUGE-1: 0.91 in ontology QA) and strong RCA performance (MAP@3: 94%, PR@2: 97%, MRR: 0.92, Jaccard: 0.92). It also attained 4.59/5 in the C3AN evaluation, demonstrating precision and reliability for live deployment.
zh
[AI-69] Asking Clarifying Questions for Preference Elicitation With Large Language Models
【速读】:该论文旨在解决在用户历史数据有限的情况下,如何通过生成有效的序列式澄清问题(clarifying questions)来有效获取用户偏好,从而提升大语言模型(Large Language Models, LLMs)在推荐系统中个性化响应的能力。其解决方案的关键在于提出一种受扩散模型(diffusion models)启发的两阶段训练框架:第一阶段通过逐步添加“噪声”(即移除已获得的答案)模拟用户画像的退化过程,第二阶段训练模型逆向恢复用户偏好,学习如何生成能逐步聚焦用户兴趣的引导性问题(funnel questions),从而显著提升LLM在多领域场景下 eliciting 用户偏好的能力。
链接: https://arxiv.org/abs/2510.12015
作者: Ali Montazeralghaem,Guy Tennenholtz,Craig Boutilier,Ofer Meshi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) have made it possible for recommendation systems to interact with users in open-ended conversational interfaces. In order to personalize LLM responses, it is crucial to elicit user preferences, especially when there is limited user history. One way to get more information is to present clarifying questions to the user. However, generating effective sequential clarifying questions across various domains remains a challenge. To address this, we introduce a novel approach for training LLMs to ask sequential questions that reveal user preferences. Our method follows a two-stage process inspired by diffusion models. Starting from a user profile, the forward process generates clarifying questions to obtain answers and then removes those answers step by step, serving as a way to add noise'' to the user profile. The reverse process involves training a model to
denoise’’ the user profile by learning to ask effective clarifying questions. Our results show that our method significantly improves the LLM’s proficiency in asking funnel questions and eliciting user preferences effectively.
zh
[AI-70] CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research NEURIPS2025
【速读】:该论文旨在解决临床遗传学领域中基因变异与功能注释的自动化解读难题,传统方法依赖人工且效率低下,难以满足精准医学对快速转化研究发现为临床可操作见解的需求。其解决方案的关键在于构建了一个名为CGBench的鲁棒性基准测试集,该基准基于ClinGen专家标注的文献数据,系统评估生成式语言模型(Generative Language Models, LMs)在科学文献中的三项核心推理能力:1)遵循精确协议提取实验结果;2)判断证据强度;3)分类并描述实验结果。通过测试8种不同LMs,研究揭示了推理类模型在细粒度任务上的优势及非推理模型在高层次理解上的表现,并首次采用LM判官方法量化模型解释与人类解释的一致性,发现即便分类正确,模型仍常出现幻觉或误读现象,从而明确了当前LM在科学文献精准解读方面的局限与改进方向。
链接: https://arxiv.org/abs/2510.11985
作者: Owen Queen,Harrison G. Zhang,James Zou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025
Abstract:Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly.
zh
[AI-71] Learning Dynamics of VLM Finetuning
【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在基于偏好微调(preference-based fine-tuning)过程中存在的训练不稳定性问题,尤其是由“错误负样本”(trivially wrong negatives)引入的无信息梯度导致的优化扰动。其解决方案的关键在于提出一种两阶段的优化框架——Cooling-Weighted DPO(CW-DPO),其中第一阶段通过温和负样本(gentle negatives)进行监督微调(SFT),利用低权重平滑监督信号来正则化基础策略并抑制过自信;第二阶段采用DPO目标函数,并对负样本项施加一个基于平均token对数概率计算的冷却权重(cooling weight),从而抑制来自易样本或分布外样本的无信息梯度,同时保留硬负样本的信号。该方法通过显式建模训练轨迹和引入Δlog p探针作为早期停止、课程设计与故障诊断的第一类信号,实现了更稳定的优化、更好的校准性和更高的成对胜率,且收敛步数更少。
链接: https://arxiv.org/abs/2510.11978
作者: Jusheng Zhang,Kaitong Cai,Jing Yang,Keze Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Preference-based finetuning of vision–language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbflearning-dynamics–aware optimization and introduce \textbfCooling-Weighted DPO (CW-DPO), a two-stage recipe that explicitly models and exploits the training trajectory. \textbfStage 1 performs supervised finetuning with \textbfgentle negatives: \textbflow-weight smoothed supervision that regularizes the base policy and curbs overconfidence without explicit penalties. \textbfStage 2 applies a DPO objective in which the \textbfnegative term is scaled by a cooling weight computed from the model’s \textbfaverage token log-probability on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize \textbfon-policy negatives and allow \textbfmixed negatives by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with \Delta!\log p probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields \textbfmore stable optimization, \textbfbetter calibration, and \textbfhigher pairwise win-rates than SFT-only and vanilla DPO, while \textbfconverging in fewer steps. Ablations isolate the \textbfcooling-weight mechanism as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that \textbfsmoothing learning dynamics before cooling preferences is a simple, general principle for robust VLM alignment.
zh
[AI-72] CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在网络安全威胁情报(Cyber Threat Intelligence, CTI)应用中评估不足的问题,具体包括:现有基准测试多采用封闭式(closed-book)设置、任务覆盖范围狭窄且缺乏对多源异构情报的综合分析能力。为填补这些空白,作者提出了CTIArena——首个面向知识增强型、多源异构威胁情报分析的基准测试框架。其关键创新在于构建了一个涵盖结构化、非结构化和混合类别的九项任务体系,并设计了检索增强技术(retrieval-augmented techniques),使LLMs能够有效利用安全领域知识库进行推理。实验表明,尽管通用LLMs在封闭环境下表现有限,但在引入领域知识后性能显著提升,凸显了领域定制化方法对释放LLMs在CTI中潜力的重要性。
链接: https://arxiv.org/abs/2510.11974
作者: Yutong Cheng,Yang Liu,Changze Li,Dawn Song,Peng Gao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Under peer-review
Abstract:Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats. With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI, which calls for benchmarks that can rigorously evaluate their performance. Several early efforts have studied LLMs on some CTI tasks but remain limited: (i) they adopt only closed-book settings, relying on parametric knowledge without leveraging CTI knowledge bases; (ii) they cover only a narrow set of tasks, lacking a systematic view of the CTI landscape; and (iii) they restrict evaluation to single-source analysis, unlike realistic scenarios that require reasoning across multiple sources. To fill these gaps, we present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI under knowledge-augmented settings. CTIArena spans three categories, structured, unstructured, and hybrid, further divided into nine tasks that capture the breadth of CTI analysis in modern security operations. We evaluate ten widely used LLMs and find that most struggle in closed-book setups but show noticeable gains when augmented with security-specific knowledge through our designed retrieval-augmented techniques. These findings highlight the limitations of general-purpose LLMs and the need for domain-tailored techniques to fully unlock their potential for CTI.
zh
[AI-73] Y-shaped Generative Flows
【速读】:该论文旨在解决现代连续时间生成模型中普遍存在的V形传输(V-shaped transport)问题,即样本在从先验分布到数据分布的迁移过程中独立移动、缺乏共享结构建模,导致效率低下且难以捕捉复杂数据间的层次关系。其解决方案的关键在于提出Y形生成流(Y-shaped generative flows),通过设计一种基于速度驱动的运输成本函数(velocity-powered transport cost),并采用介于0到1之间的次线性指数,使得该代价函数对联合快速质量迁移具有奖励机制,从而引导概率质量沿共享路径协同移动后再分支至目标特定终点。这一机制有效提升了生成模型对数据层次结构的感知能力,并在合成数据、图像和生物数据集上实现了优于现有流模型的分布匹配效果与计算效率。
链接: https://arxiv.org/abs/2510.11955
作者: Arip Asadulaev,Semyon Semenov,Abduragim Shtanchaev,Eric Moulines,Fakhri Karray,Martin Takac
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern continuous-time generative models often induce V-shaped transport: each sample travels independently along nearly straight trajectories from prior to data, overlooking shared structure. We introduce Y-shaped generative flows, which move probability mass together along shared pathways before branching to target-specific endpoints. Our formulation is based on novel velocity-powered transport cost with a sublinear exponent (between zero and one). this concave dependence rewards joint and fast mass movement. Practically, we instantiate the idea in a scalable neural ODE training objective. On synthetic, image, and biology datasets, Y-flows recover hierarchy-aware structure, improve distributional metrics over strong flow-based baselines, and reach targets with fewer integration steps.
zh
[AI-74] Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors
【速读】:该论文旨在解决变分自编码器(Variational Autoencoder, VAE)中基于KL散度的正则化机制无法可靠实现解耦表示的问题,即该机制未能有效约束聚合后验分布以匹配预设的因子化先验分布,从而导致潜在空间中的特征仍存在纠缠。解决方案的关键在于提出一种可编程先验框架(Programmable Prior Framework),该框架基于最大均值差异(Maximum Mean Discrepancy, MMD)构建,允许从业者显式设计和调控潜在空间结构,从而在不牺牲重建质量的前提下,在CIFAR-10和Tiny ImageNet等复杂数据集上实现最先进的互独立性,并进一步通过工程化先验提升语义特征对齐能力。
链接: https://arxiv.org/abs/2510.11953
作者: Quentin Fruytier,Akshay Malhotra,Shahab Hamidi-Rad,Aditya Sant,Aryan Mokhtari,Sujay Sanghavi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning disentangled representations, where distinct factors of variation are captured by independent latent variables, is a central goal in machine learning. The dominant approach has been the Variational Autoencoder (VAE) framework, which uses a Kullback-Leibler (KL) divergence penalty to encourage the latent space to match a factorized Gaussian prior. In this work, however, we provide direct evidence that this KL-based regularizer is an unreliable mechanism, consistently failing to enforce the target distribution on the aggregate posterior. We validate this and quantify the resulting entanglement using our novel, unsupervised Latent Predictability Score (LPS). To address this failure, we introduce the Programmable Prior Framework, a method built on the Maximum Mean Discrepancy (MMD). Our framework allows practitioners to explicitly sculpt the latent space, achieving state-of-the-art mutual independence on complex datasets like CIFAR-10 and Tiny ImageNet without the common reconstruction trade-off. Furthermore, we demonstrate how this programmability can be used to engineer sophisticated priors that improve alignment with semantically meaningful features. Ultimately, our work provides a foundational tool for representation engineering, opening new avenues for model identifiability and causal reasoning.
zh
[AI-75] Indoor Localization using Compact Telemetry-Agnostic Transfer-Learning Enabled Decoder-Only Transformer
【速读】:该论文旨在解决室内Wi-Fi定位中因环境动态变化、信道传播特性及硬件异构性导致的高敏感性问题,传统指纹匹配和模型驱动方法通常需要大量人工校准,且在设备、信道或部署条件变更时性能迅速下降。解决方案的关键在于提出Locaris——一种基于解码器-only架构的大语言模型(Large Language Model, LLM),将每个接入点(Access Point, AP)的测量值视为一个“token”,从而直接处理未经预处理的原始Wi-Fi遥测数据;通过在不同Wi-Fi数据集上微调,Locaris学习从原始信号到设备位置的轻量级、可泛化的映射关系,实现无需校准的回归建模,在异构Wi-Fi部署中展现出良好的跨环境鲁棒性和扩展性,尤其在仅需少量校准点(few-shot)的情况下仍能保持亚米级精度。
链接: https://arxiv.org/abs/2510.11926
作者: Nayan Sanjay Bhatia,Pranay Kocheta,Russell Elliott,Harikrishna S. Kuttivelil,Katia Obraczka
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 11 pages, 12 Figures
Abstract:Indoor Wi-Fi positioning remains a challenging problem due to the high sensitivity of radio signals to environmental dynamics, channel propagation characteristics, and hardware heterogeneity. Conventional fingerprinting and model-based approaches typically require labor-intensive calibration and suffer rapid performance degradation when devices, channel or deployment conditions change. In this paper, we introduce Locaris, a decoder-only large language model (LLM) for indoor localization. Locaris treats each access point (AP) measurement as a token, enabling the ingestion of raw Wi-Fi telemetry without pre-processing. By fine-tuning its LLM on different Wi-Fi datasets, Locaris learns a lightweight and generalizable mapping from raw signals directly to device location. Our experimental study comparing Locaris with state-of-the-art methods consistently shows that Locaris matches or surpasses existing techniques for various types of telemetry. Our results demonstrate that compact LLMs can serve as calibration-free regression models for indoor localization, offering scalable and robust cross-environment performance in heterogeneous Wi-Fi deployments. Few-shot adaptation experiments, using only a handful of calibration points per device, further show that Locaris maintains high accuracy when applied to previously unseen devices and deployment scenarios. This yields sub-meter accuracy with just a few hundred samples, robust performance under missing APs and supports any and all available telemetry. Our findings highlight the practical viability of Locaris for indoor positioning in the real-world scenarios, particularly in large-scale deployments where extensive calibration is infeasible.
zh
[AI-76] Integrating Sequential and Relational Modeling for User Events: Datasets and Prediction Tasks
【速读】:该论文旨在解决现有用户事件建模方法中对个人事件(personal events)与关系事件(relational events)分离处理的问题。当前主流方法通常仅采用序列建模(如RNN、Transformer)处理个人事件,或使用图神经网络(Graph Neural Networks, GNNs)处理关系事件,而忽视了二者在真实系统中的共存与交互。其关键解决方案是提出一种统一的形式化框架,将两类事件整合到同一建模体系中,并构建包含两者标注的公开数据集,实验证明联合建模可显著提升预测性能,表明当前模型仍有较大改进空间。
链接: https://arxiv.org/abs/2510.11903
作者: Rizal Fathony,Igor Melnyk,Owen Reinert,Nam H. Nguyen,Daniele Rosa,C. Bayan Bruss
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:User event modeling plays a central role in many machine learning applications, with use cases spanning e-commerce, social media, finance, cybersecurity, and other domains. User events can be broadly categorized into personal events, which involve individual actions, and relational events, which involve interactions between two users. These two types of events are typically modeled separately, using sequence-based methods for personal events and graph-based methods for relational events. Despite the need to capture both event types in real-world systems, prior work has rarely considered them together. This is often due to the convenient simplification that user behavior can be adequately represented by a single formalization, either as a sequence or a graph. To address this gap, there is a need for public datasets and prediction tasks that explicitly incorporate both personal and relational events. In this work, we introduce a collection of such datasets, propose a unified formalization, and empirically show that models benefit from incorporating both event types. Our results also indicate that current methods leave a notable room for improvements. We release these resources to support further research in unified user event modeling and encourage progress in this direction.
zh
[AI-77] Countermind: A Multi-Layered Security Architecture for Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)应用在面对“表单优先”攻击(如提示注入和越狱攻击)时的安全性问题,此类攻击通过将恶意指令嵌入用户输入中,利用模型对输入内容缺乏区分能力的缺陷实现攻击。传统防御方法依赖输出端的事后过滤,往往脆弱且无法根除风险。其解决方案的关键在于提出一种多层安全架构 Countermind,从被动响应转向主动、预推理及推理过程中的防御机制:核心包括语义边界逻辑(Semantic Boundary Logic, SBL)用于结构化验证与加密输入以降低明文提示注入攻击面;参数空间限制(Parameter-Space Restriction, PSR)机制动态控制模型对内部语义簇的访问,缓解语义漂移与危险行为涌现;一个基于OODA环和不可变审计日志的自适应安全核心,实现持续学习与防御调整;以及多模态输入沙箱与上下文防御机制,应对非文本数据和长期语义污染威胁。
链接: https://arxiv.org/abs/2510.11837
作者: Dominik Schwarz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 33 pages, 3 figures, 6 tables. Keywords: LLM security; defense-in-depth; prompt injection; activation steering; multimodal sandbox; threat modeling
Abstract:The security of Large Language Model (LLM) applications is fundamentally challenged by “form-first” attacks like prompt injection and jailbreaking, where malicious instructions are embedded within user inputs. Conventional defenses, which rely on post hoc output filtering, are often brittle and fail to address the root cause: the model’s inability to distinguish trusted instructions from untrusted data. This paper proposes Countermind, a multi-layered security architecture intended to shift defenses from a reactive, post hoc posture to a proactive, pre-inference, and intra-inference enforcement model. The architecture proposes a fortified perimeter designed to structurally validate and transform all inputs, and an internal governance mechanism intended to constrain the model’s semantic processing pathways before an output is generated. The primary contributions of this work are conceptual designs for: (1) A Semantic Boundary Logic (SBL) with a mandatory, time-coupled Text Crypter intended to reduce the plaintext prompt injection attack surface, provided all ingestion paths are enforced. (2) A Parameter-Space Restriction (PSR) mechanism, leveraging principles from representation engineering, to dynamically control the LLM’s access to internal semantic clusters, with the goal of mitigating semantic drift and dangerous emergent behaviors. (3) A Secure, Self-Regulating Core that uses an OODA loop and a learning security module to adapt its defenses based on an immutable audit log. (4) A Multimodal Input Sandbox and Context-Defense mechanisms to address threats from non-textual data and long-term semantic poisoning. This paper outlines an evaluation plan designed to quantify the proposed architecture’s effectiveness in reducing the Attack Success Rate (ASR) for form-first attacks and to measure its potential latency overhead.
zh
[AI-78] Combining Euclidean and Hyperbolic Representations for Node-level Anomaly Detection
【速读】:该论文旨在解决节点级异常检测(Node-level Anomaly Detection, NAD)中因结构模式多样性和特征分布复杂而导致的识别困难问题,其应用场景涵盖欺诈检测、网络安全和推荐系统等。解决方案的关键在于提出Janus框架,该框架通过联合利用欧几里得(Euclidean)与双曲(Hyperbolic)图神经网络(Graph Neural Networks, GNNs),从原始特征和由随机游走及度数导出的结构特征构建节点的双视图表示,并将其分别嵌入到欧几里得空间与双曲空间中;进一步采用多图自编码器(multi Graph-Autoencoder)架构并引入对比学习作为正则项,实现跨空间嵌入对齐,从而突出那些在两个几何空间中难以一致匹配的节点——这些节点即为潜在的异常节点。实验证明,该方法在四个真实世界数据集上均显著优于浅层与深度基线模型,验证了多几何表示融合在识别复杂异常中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2510.11827
作者: Simone Mungari,Ettore Ritacco,Pietro Sabatino
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Node-level anomaly detection (NAD) is challenging due to diverse structural patterns and feature distributions. As such, NAD is a critical task with several applications which range from fraud detection, cybersecurity, to recommendation systems. We introduce Janus, a framework that jointly leverages Euclidean and Hyperbolic Graph Neural Networks to capture complementary aspects of node representations. Each node is described by two views, composed by the original features and structural features derived from random walks and degrees, then embedded into Euclidean and Hyperbolic spaces. A multi Graph-Autoencoder framework, equipped with a contrastive learning objective as regularization term, aligns the embeddings across the Euclidean and Hyperbolic spaces, highlighting nodes whose views are difficult to reconcile and are thus likely anomalous. Experiments on four real-world datasets show that Janus consistently outperforms shallow and deep baselines, empirically demonstrating that combining multiple geometric representations provides a robust and effective approach for identifying subtle and complex anomalies in graphs.
zh
[AI-79] Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning NEURIPS2025
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)系统在真实世界中缺乏鲁棒性(robustness)与韧性(resilience)的问题。尽管现有方法通常在理想仿真环境中优化合作性能,但这些策略往往无法应对现实中的不确定性,导致系统稳定性下降甚至失效。解决方案的关键在于通过大规模实证研究(超过82,620次实验)系统评估不同MARL算法在多种不确定类型(13类)、环境(4个真实场景)和超参数(15个)下的合作能力、鲁棒性和韧性表现,并揭示:(1)适度扰动下优化合作可提升鲁棒性与韧性,但随扰动强度增加该关联减弱;(2)鲁棒性与韧性不具备跨不确定模态或智能体范围的泛化能力;(3)标准超参数调优实践(如参数共享、GAE、PopArt)可能损害鲁棒性,而早停、高评论者学习率和Leaky ReLU等策略则显著增强系统可靠性。最终表明,仅通过超参数优化即可大幅提升各类MARL骨干模型的合作、鲁棒与韧性性能,且该现象在鲁棒MARL方法中亦具通用性。
链接: https://arxiv.org/abs/2510.11824
作者: Simin Li,Zihao Mao,Hanxiao Li,Zonglei Jing,Zhuohang bian,Jun Guo,Li Wang,Zhuoran Han,Ruixiao Xu,Xin Yu,Chengdong Ma,Yuqing Ma,Bo An,Yaodong Yang,Weifeng Lv,Xianglong Liu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 44 pages, 16 figures, NeurIPS 2025
Abstract:In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions–a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at this https URL .
zh
[AI-80] BlackIce: A Containerized Red Teaming Toolkit for AI Security Testing
【速读】:该论文旨在解决当前AI红队(AI red teaming)实践中存在的两大核心问题:一是工具选择困难,即从业者难以从快速扩展的AI安全测试工具集中筛选出最合适的工具;二是环境配置复杂,由于各项目间存在频繁冲突的软件依赖关系,导致评估流程难以标准化和复现。解决方案的关键在于提出一个名为BlackIce的开源容器化工具包,其通过一个版本锁定的Docker镜像集成14个精心挑选的开源工具,覆盖生成式AI(Generative AI)与传统机器学习(ML)模型的安全测试需求,并提供统一命令行接口,使红队评估任务可一键启动,无论本地或云端部署。此外,该方案具备模块化架构,支持社区持续扩展,以应对新兴威胁。
链接: https://arxiv.org/abs/2510.11823
作者: Caelin Kaplan,Alexander Warnecke,Neil Archibald
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:AI models are being increasingly integrated into real-world systems, raising significant concerns about their safety and security. Consequently, AI red teaming has become essential for organizations to proactively identify and address vulnerabilities before they can be exploited by adversaries. While numerous AI red teaming tools currently exist, practitioners face challenges in selecting the most appropriate tools from a rapidly expanding landscape, as well as managing complex and frequently conflicting software dependencies across isolated projects. Given these challenges and the relatively small number of organizations with dedicated AI red teams, there is a strong need to lower barriers to entry and establish a standardized environment that simplifies the setup and execution of comprehensive AI model assessments. Inspired by Kali Linux’s role in traditional penetration testing, we introduce BlackIce, an open-source containerized toolkit designed for red teaming Large Language Models (LLMs) and classical machine learning (ML) models. BlackIce provides a reproducible, version-pinned Docker image that bundles 14 carefully selected open-source tools for Responsible AI and Security testing, all accessible via a unified command-line interface. With this setup, initiating red team assessments is as straightforward as launching a container, either locally or using a cloud platform. Additionally, the image’s modular architecture facilitates community-driven extensions, allowing users to easily adapt or expand the toolkit as new threats emerge. In this paper, we describe the architecture of the container image, the process used for selecting tools, and the types of evaluations they support. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.11823 [cs.CR] (or arXiv:2510.11823v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.11823 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-81] Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在自动评估其他LLM输出质量时存在的系统性正向偏差问题,即LLMs作为评估者(LLM-as-a-judge)时对有效输出识别准确率高(真阳性率96%),但对无效输出识别能力极差(真阴性率仅25%),导致可靠性评分被显著高估。其解决方案的关键在于提出两种互补策略:一是引入“最优少数 veto”机制,通过少数样本的否定判断来抑制偏差并增强鲁棒性;二是设计一种基于回归的新型框架,利用少量人工标注的真实数据直接建模验证者的偏差,从而实现更高精度的评估。在366个高中Python程序代码反馈任务上的实验证明,该回归方法将最大绝对误差降至1.2%,相较最优集成方法提升2倍性能。
链接: https://arxiv.org/abs/2510.11822
作者: Suryaansh Jain,Umair Z. Ahmed,Shubham Sahai,Ben Leong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate 25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.11822 [cs.AI] (or arXiv:2510.11822v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.11822 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-82] GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
【速读】:该论文旨在解决当前数学定理证明模型在训练过程中存在的效率低下和泛化能力不足的问题。现有方法通常依赖于固定的问题集进行强化学习(Reinforcement Learning, RL)或专家迭代(expert iteration),导致训练资源浪费且难以应对复杂问题。其解决方案的关键在于提出一种名为GAR(Generative Adversarial Reinforcement learning)的综合强化学习框架,通过在对抗性循环中联合训练问题生成器(problem composer)与求解器(solver),引入隐式课程学习机制(implicit curriculum learning),使任务难度动态匹配求解器的能力演化过程,从而提升训练效率并增强模型对高阶定理的证明能力。
链接: https://arxiv.org/abs/2510.11769
作者: Ruida Wang,Jiarui Yao,Rui Pan,Shizhe Diao,Tong Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. GAR introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover’s evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with GAR training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of 4.20% on MiniF2F-Test benchmark, while DeepSeek-Prover-V2’s pass@32 on ProofNet-Test increases from 22.58% to 25.81%. Beyond formal proving, GAR establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments.
zh
[AI-83] AwareCompiler: Agent ic Context-Aware Compiler Optimization via a Synergistic Knowledge-Data Driven Framework
【速读】:该论文旨在解决生成式 AI (Generative AI) 在编译器优化中应用时面临的三大挑战:(1)抽象程序表示与具体优化步骤之间的语义不对齐;(2)智能体与编译环境之间低效的交互机制;(3)在大规模优化空间中决策过程导致的奖励稀疏性。解决方案的关键在于提出一个名为 AwareCompiler 的代理框架,其核心创新包括:结构化知识集成与数据集构建、基于知识的自适应优化步骤生成,以及数据驱动的混合训练流程,从而实现知识与数据协同驱动的高效编译优化。
链接: https://arxiv.org/abs/2510.11759
作者: Hongyu Lin,Haolin Pan,Haoran Luo,Yuchen Li,Kaichun Yao,Libo Zhang,Mingjie Xing,Yanjun Wu
机构: 未知
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:
Abstract:Compiler optimization is crucial for enhancing program performance by transforming the sequence of optimization passes while maintaining correctness. Despite the promising potential of large language models (LLMs)-based agent for software optimization, automating compiler optimization remains challenging due to: (1) semantic misalignment between abstract program representations and concrete optimization passes, (2) inefficient interaction mechanisms between agents and compiler environments, and (3) reward sparsity from the extensive decision-making process within large optimization spaces. This paper introduces \textbfAwareCompiler, an agentic framework for compiler optimization that addresses these challenges through three key innovations: structured knowledge integration and dataset construction, knowledge-driven adaptive pass generation, and data-driven hybrid training pipeline. Experimental results on standard benchmarks demonstrate that AwareCompiler significantly outperforms existing baselines in both performance and efficiency, highlighting the effectiveness of our synergistic knowledge-data-driven approach. Our code is publicly available at this https URL.
zh
[AI-84] he Adoption Paradox: A Comparative Analysis of Veterinary AI Adoption in China and the North America
【速读】:该论文试图解决的问题是:不同地区 veterinary professionals(兽医专业人员)在人工智能(Artificial Intelligence, AI)的感知、采纳与应用上存在显著差异,这种差异是否由区域市场和人口统计因素所驱动,以及如何基于这些差异制定有效的AI整合策略。解决方案的关键在于识别出中国与北美(NA)兽医群体之间存在的“采纳悖论”——即中国兽医尽管对AI熟悉度较低,但临床采纳率高,聚焦于提升诊疗效率;而北美兽医虽熟悉度高,采纳率低,更关注行政流程优化。研究指出,应摒弃全球统一的AI开发与部署模式,转而采用因地制宜、区域定制化的策略,以负责任的方式推动AI在兽医实践中的有效整合。
链接: https://arxiv.org/abs/2510.11758
作者: Shumin Li,Xiaoyun Lai
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 1 Table, 5 Figures (included in the end), Full questionnaire used in this study (both original Chinese version and translated/English version included in the end)
Abstract:This study compares the perception, adoption, and application of artificial intelligence (AI) among veterinary professionals in China and North America (NA), testing the hypothesis that adoption patterns are shaped by regional market and demographic factors. A descriptive, cross-sectional survey was conducted with 455 veterinary professionals in China between May and July 2025. The results were compared with published data from a 2024 survey of 3,968 veterinary professionals in the United States and Canada. The Chinese cohort, primarily composed of clinicians (81.5%), showed a high AI adoption rate (71.0%) despite low familiarity (55.4%). Their AI use was focused on clinical tasks, such as disease diagnosis (50.1%) and prescription calculation (44.8%). In contrast, the NA cohort reported high familiarity (83.8%) but a lower adoption rate (39.2%). Their priorities were administrative, including imaging analysis (39.0%) and record-keeping (39.0%). Concerns about AI reliability and accuracy were the top barrier in both groups. Our findings reveal an “adoption paradox” where the Chinese market demonstrates a practitioner-driven, bottom-up adoption model focused on augmenting clinical efficacy, while the NA market shows a more cautious, structured, top-down integration aimed at improving administrative efficiency. This suggests that a one-size-fits-all approach to AI development and integration is insufficient, and tailored, region-specific strategies are necessary to responsibly incorporate AI into global veterinary practice.
zh
[AI-85] Artificial Intelligence for Optimal Learning: A Comparative Approach towards AI-Enhanced Learning Environments
【速读】:该论文试图解决的问题是如何在教育实践中有效整合不同技术层级(传统教学、非人工智能技术辅助教学和人工智能驱动教学)以优化学习效果、提升参与度并促进教育资源的公平获取。其解决方案的关键在于构建一种融合式教育框架,即结合传统课堂中的人际互动与成熟教学法、非AI技术带来的资源可及性与协作工具,以及AI技术所支持的个性化与自适应学习策略,从而形成一个更全面、高效且包容性强的学习环境。
链接: https://arxiv.org/abs/2510.11755
作者: Ananth Hariharan
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:In the rapidly evolving educational landscape, the integration of technology has shifted from an enhancement to a cornerstone of educational strategy worldwide. This transition is propelled by advancements in digital technology, especially the emergence of artificial intelligence as a crucial tool in learning environments. This research project critically evaluates the impact of three distinct educational settings: traditional educational methods without technological integration, those enhanced by non-AI technology, and those utilising AI-driven technologies. This comparison aims to assess how each environment influences educational outcomes, engagement, pedagogical methods, and equity in access to learning resources, and how each contributes uniquely to the learning experience. The ultimate goal of this research is to synthesise the strengths of each model to create a more holistic educational approach. By integrating the personal interaction and tested pedagogical techniques of traditional classrooms, the enhanced accessibility and collaborative tools offered by non-AI technology, and the personalised, adaptive learning strategies enabled by AI-driven technologies, education systems can develop richer, more effective learning environments. This hybrid approach aims to leverage the best elements of each setting, thereby enhancing educational outcomes, engagement, and inclusiveness, while also addressing the distinct challenges and limitations inherent in each model. The intention is to create an educational framework deeply attentive to the diverse needs of students, ensuring equitable access to high-quality education for all.
zh
[AI-86] AI Agents for the Dhumbal Card Game: A Comparative Study
【速读】:该论文旨在解决如何在不完全信息条件下设计和评估人工智能(AI)代理以有效参与Dhumbal这一具有文化意义的多人纸牌游戏的问题。其核心挑战在于应对游戏中信息不对称带来的决策复杂性,同时确保AI代理在策略多样性下具备可比性和性能优势。解决方案的关键在于构建一个系统化的AI代理比较框架,涵盖基于规则、搜索和学习三类方法:包括启发式策略(如激进、保守等)、蒙特卡洛树搜索(MCTS)及其信息集变体(ISMCTS),以及深度Q网络(DQN)与近端策略优化(PPO)等强化学习方法,并通过多轮模拟比赛(共1024局)进行量化评估,最终发现规则驱动的激进型代理在胜率(88.3%)和Jhyap声明利用效率上显著优于其他方法,揭示了启发式策略在部分信息场景下的有效性。
链接: https://arxiv.org/abs/2510.11736
作者: Sahaj Raj Malla
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: 10 pages, 7 figures, 6 tables
Abstract:This study evaluates Artificial Intelligence (AI) agents for Dhumbal, a culturally significant multiplayer card game with imperfect information, through a systematic comparison of rule-based, search-based, and learning-based strategies. We formalize Dhumbal’s mechanics and implement diverse agents, including heuristic approaches (Aggressive, Conservative, Balanced, Opportunistic), search-based methods such as Monte Carlo Tree Search (MCTS) and Information Set Monte Carlo Tree Search (ISMCTS), and reinforcement learning approaches including Deep Q-Network (DQN) and Proximal Policy Optimization (PPO), and a random baseline. Evaluation involves within-category tournaments followed by a cross-category championship. Performance is measured via win rate, economic outcome, Jhyap success, cards discarded per round, risk assessment, and decision efficiency. Statistical significance is assessed using Welch’s t-test with Bonferroni correction, effect sizes via Cohen’s d, and 95% confidence intervals (CI). Across 1024 simulated rounds, the rule-based Aggressive agent achieves the highest win rate (88.3%, 95% CI: [86.3, 90.3]), outperforming ISMCTS (9.0%) and PPO (1.5%) through effective exploitation of Jhyap declarations. The study contributes a reproducible AI framework, insights into heuristic efficacy under partial information, and open-source code, thereby advancing AI research and supporting digital preservation of cultural games.
zh
[AI-87] Serial-Parallel Dual-Path Architecture for Speaking Style Recognition
【速读】:该论文旨在解决语音风格识别(Speaking Style Recognition, SSR)中因主要依赖语言信息而忽视声学信息导致的识别准确率受限问题。现有方法未能充分融合声学与语言模态信息,限制了性能提升。其解决方案的关键在于提出一种新颖的串行-并行双路径架构,其中串行路径遵循ASR+STYLE的序列范式以捕捉时序依赖关系,而并行路径则引入设计的声学-语言相似性模块(Acoustic-Linguistic Similarity Module, ALSM),实现跨模态的时间同步交互,从而有效融合多模态信息。实验表明,该方法在参数量减少88.4%的同时,SSR准确率相较基线OSUM模型提升30.3%。
链接: https://arxiv.org/abs/2510.11732
作者: Guojian Li,Qijie Shao,Zhixian Zhao,Shuiyuan Wang,Zhonghua Fu,Lei Xie
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted by NCMMSC2025
Abstract:Speaking Style Recognition (SSR) identifies a speaker’s speaking style characteristics from speech. Existing style recognition approaches primarily rely on linguistic information, with limited integration of acoustic information, which restricts recognition accuracy improvements. The fusion of acoustic and linguistic modalities offers significant potential to enhance recognition performance. In this paper, we propose a novel serial-parallel dual-path architecture for SSR that leverages acoustic-linguistic bimodal information. The serial path follows the ASR+STYLE serial paradigm, reflecting a sequential temporal dependency, while the parallel path integrates our designed Acoustic-Linguistic Similarity Module (ALSM) to facilitate cross-modal interaction with temporal simultaneity. Compared to the existing SSR baseline – the OSUM model, our approach reduces parameter size by 88.4% and achieves a 30.3% improvement in SSR accuracy for eight styles on the test set.
zh
[AI-88] Modeling Hypergraph Using Large Language Models
【速读】:该论文旨在解决当前高阶图(hypergraph)数据在规模和多样性上的匮乏问题,这严重制约了高级高阶图学习算法的发展与评估。现有真实世界高阶图数据集稀缺,难以支撑大规模、高质量的模型训练与验证。为应对这一挑战,作者提出HyperLLM——一种基于大语言模型(Large Language Models, LLMs)驱动的高阶图生成框架。其核心创新在于利用LLM在语义推理、结构化生成及模拟人类行为方面的优势,通过多智能体协作机制模拟高阶图的形成与演化过程,并结合提示工程(prompt engineering)与结构反馈机制,确保生成的高阶图忠实反映现实网络的关键结构与时间模式。实验表明,该方法在无需大量统计先验知识的情况下即可生成高保真度的高阶图,为高阶图建模提供了一种全新的、可扩展的解决方案。
链接: https://arxiv.org/abs/2510.11728
作者: Bingqiao Gu,Jiale Zeng,Xingqin Qi,Dong Li
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Due to the advantages of hypergraphs in modeling high-order relationships in complex systems, they have been applied to higher-order clustering, hypergraph neural networks and computer vision. These applications rely heavily on access to high-quality, large-scale real-world hypergraph data. Yet, compared to traditional pairwise graphs, real hypergraph datasets remain scarce in both scale and diversity. This shortage significantly limits the development and evaluation of advanced hypergraph learning algorithms. Therefore, how to quickly generate large-scale hypergraphs that conform to the characteristics of real networks is a crucial task that has not received sufficient attention. Motivated by recent advances in large language models (LLMs), particularly their capabilities in semantic reasoning, structured generation, and simulating human behavior, we investigate whether LLMs can facilitate hypergraph generation from a fundamentally new perspective. We introduce HyperLLM, a novel LLM-driven hypergraph generator that simulates the formation and evolution of hypergraphs through a multi-agent collaboration. The framework integrates prompts and structural feedback mechanisms to ensure that the generated hypergraphs reflect key real-world patterns. Extensive experiments across diverse datasets demonstrate that HyperLLM achieves superior fidelity to structural and temporal hypergraph patterns, while requiring minimal statistical priors. Our findings suggest that LLM-based frameworks offer a promising new direction for hypergraph modeling.
zh
[AI-89] Dual Perspectives on Non-Contrastive Self-Supervised Learning
【速读】:该论文旨在解决自监督学习中非对比方法(non-contrastive approaches)因表示崩溃(representation collapse)而导致性能下降的问题。其解决方案的关键在于从优化理论和动力系统两个视角分析“停止梯度”(stop gradient)与“指数移动平均”(exponential moving average, EMA)迭代过程的作用机制:尽管这两个操作并不直接优化原始目标函数或任何平滑目标函数,但它们能有效避免表示崩溃;在线性设定下,作者进一步通过动力系统建模证明,若不使用停止梯度或EMA,则最小化原始目标必然导致崩溃;同时,他们明确刻画了这两种策略对应的动态系统的平衡点为参数空间中的代数簇(algebraic varieties),并证明这些平衡点通常具有渐近稳定性,从而从理论上解释了其有效性。
链接: https://arxiv.org/abs/2507.01028
作者: Jean Ponce(ENS-PSL, NYU),Basile Terver(FAIR, WILLOW),Martial Hebert(CMU),Michael Arbel(Thoth)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The \em stop gradient and \em exponential moving average iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they \em do not optimize the original objective, or \em any other smooth function, they \em do avoid collapse Following~\citetTian21, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average \em always leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, \em asymptotically stable. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.
zh
[AI-90] Leverag ing LLM s IDEs and Semantic Embeddings for Automated Move Method Refactoring
【速读】:该论文旨在解决方法移动(MOVEMETHOD)重构中自动化推荐与执行的难题,特别是现有基于大型语言模型(Large Language Models, LLMs)的工具虽能提供专家级建议,但存在高达80%的幻觉问题,导致推荐不可靠。解决方案的关键在于提出名为MM-assist的端到端自动化重构助手,其核心创新包括:(1) 利用IDE静态分析自动过滤LLM输出中的幻觉;(2) 设计自一致性校验、批判性评估与排序的工作流以提升建议质量;(3) 通过重构感知的检索增强生成(refactoring-aware retrieval augmented generation, RAG)缓解LLM上下文长度限制,实现项目级全局推理。该方案协同整合了LLM、IDE、静态分析和语义相关性优势,在多个基准和真实开源项目上显著优于现有方法,并在用户研究中获得高接受度。
链接: https://arxiv.org/abs/2503.20934
作者: Fraol Batole,Abhiram Bellur,Malinda Dilhara,Mohammed Raihan Ullah,Yaroslav Zharov,Timofey Bryksin,Kai Ishikawa,Haifeng Chen,Masaharu Morimoto,Shota Motoura,Takeo Hosomi,Tien N. Nguyen,Hridesh Rajan,Nikolaos Tsantalis,Danny Dig
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 12 pages, 2 figures
Abstract:MOVEMETHOD is a hallmark refactoring. Despite a plethora of research tools that recommend which methods to move and where, these recommendations do not align with how expert developers perform MOVEMETHOD. Given the extensive training of Large Language Models and their reliance upon naturalness of code, they should expertly recommend which methods are misplaced in a given class and which classes are better hosts. Our formative study of 2016 LLM recommendations revealed that LLMs give expert suggestions, yet they are unreliable: up to 80% of the suggestions are hallucinations. We introduce the first LLM fully powered assistant for MOVEMETHOD refactoring that automates its whole end-to-end lifecycle, from recommendation to execution. We designed novel solutions that automatically filter LLM hallucinations using static analysis from IDEs and a novel workflow that requires LLMs to be self-consistent, critique, and rank refactoring suggestions. As MOVEMETHOD refactoring requires global, projectlevel reasoning, we solved the limited context size of LLMs by employing refactoring-aware retrieval augment generation (RAG). Our approach, MM-assist, synergistically combines the strengths of the LLM, IDE, static analysis, and semantic relevance. In our thorough, multi-methodology empirical evaluation, we compare MM-assist with the previous state-of-the-art approaches. MM-assist significantly outperforms them: (i) on a benchmark widely used by other researchers, our Recall@1 and Recall@3 show a 1.7x improvement; (ii) on a corpus of 210 recent refactorings from Open-source software, our Recall rates improve by at least 2.4x. Lastly, we conducted a user study with 30 experienced participants who used MM-assist to refactor their own code for one week. They rated 82.8% of MM-assist recommendations positively. This shows that MM-assist is both effective and useful.
zh
[AI-91] Disentangling Neurodegeneration with Brain Age Gap Prediction Models: A Graph Signal Processing Perspective
【速读】:该论文旨在解决传统神经退行性病变评估方法(如基于结构磁共振成像的皮层厚度或脑体积变化)在统计学上缺乏足够复杂性,难以充分捕捉神经退行性变的空间相关性和异质性的问题。其解决方案的关键在于引入脑年龄差预测(Brain Age Gap Prediction, BAGP)模型,并提出一种基于图信号处理(Graph Signal Processing, GSP)的原理性框架,特别是利用解剖协方差矩阵构建的协方差神经网络(coVariance Neural Network, VNN),以实现对脑年龄差的稳健估计。VNN结合了图神经网络(Graph Neural Networks, GNNs)的优势,具备强理论基础和可解释性,从而提升了BAGP模型在不同临床人群中的泛化能力和可靠性,推动其在个性化医疗中的应用。
链接: https://arxiv.org/abs/2510.12763
作者: Saurabh Sihag,Gonzalo Mateos,Alejandro Ribeiro
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: Accepted for publication in IEEE Signal Processing Magazine
Abstract:Neurodegeneration, characterized by the progressive loss of neuronal structure or function, is commonly assessed in clinical practice through reductions in cortical thickness or brain volume, as visualized by structural MRI. While informative, these conventional approaches lack the statistical sophistication required to fully capture the spatially correlated and heterogeneous nature of neurodegeneration, which manifests both in healthy aging and in neurological disorders. To address these limitations, brain age gap has emerged as a promising data-driven biomarker of brain health. The brain age gap prediction (BAGP) models estimate the difference between a person’s predicted brain age from neuroimaging data and their chronological age. The resulting brain age gap serves as a compact biomarker of brain health, with recent studies demonstrating its predictive utility for disease progression and severity. However, practical adoption of BAGP models is hindered by their methodological obscurities and limited generalizability across diverse clinical populations. This tutorial article provides an overview of BAGP and introduces a principled framework for this application based on recent advancements in graph signal processing (GSP). In particular, we focus on graph neural networks (GNNs) and introduce the coVariance neural network (VNN), which leverages the anatomical covariance matrices derived from structural MRI. VNNs offer strong theoretical grounding and operational interpretability, enabling robust estimation of brain age gap predictions. By integrating perspectives from GSP, machine learning, and network neuroscience, this work clarifies the path forward for reliable and interpretable BAGP models and outlines future research directions in personalized medicine.
zh
[AI-92] Artificial intelligence for simplified patient-centered dosimetry in radiopharmaceutical therapies
【速读】:该论文旨在解决放射性药物治疗(Radiopharmaceutical Therapy, RPT)中个性化与患者友好型剂量计算(Patient-friendly Dosimetry)的迫切需求,尤其是在当前剂量计算方法存在效率低、标准化程度差等关键局限的问题。其解决方案的关键在于利用人工智能(Artificial Intelligence, AI)技术,通过简化和优化剂量估算流程,提升RPT过程中剂量评估的精准性与可及性,从而推动个体化治疗的实现。
链接: https://arxiv.org/abs/2510.12714
作者: Alejandro Lopez-Montes,Fereshteh Yousefirizi,Yizhou Chen,Yazdan Salimi,Robert Seifert,Ali Afshar-Oromieh,Carlos Uribe,Axel Rominger,Habib Zaidi,Arman Rahmim,Kuangyu Shi
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注:
Abstract:KEY WORDS: Artificial Intelligence (AI), Theranostics, Dosimetry, Radiopharmaceutical Therapy (RPT), Patient-friendly dosimetry KEY POINTS - The rapid evolution of radiopharmaceutical therapy (RPT) highlights the growing need for personalized and patient-centered dosimetry. - Artificial Intelligence (AI) offers solutions to the key limitations in current dosimetry calculations. - The main advances on AI for simplified dosimetry toward patient-friendly RPT are reviewed. - Future directions on the role of AI in RPT dosimetry are discussed.
zh
[AI-93] Phenome-Wide Multi-Omics Integration Uncovers Distinct Archetypes of Human Aging
【速读】:该论文旨在解决传统基于单一组学数据的衰老时钟(aging clock)难以全面刻画人类衰老分子复杂性的问题,从而提升对生理衰退和疾病风险预测的准确性。其解决方案的关键在于利用包含临床、行为、环境及多组学(转录组学、脂质组学、代谢组学和微生物组)数据的大规模纵向队列(Human Phenotype Project,n=12,000),结合能够建模非线性生物动态的先进机器学习框架,构建并验证了一个多组学衰老时钟。该模型不仅显著提升了健康结局和未来疾病风险的预测能力,还通过无监督聚类揭示了衰老的生物学亚型及其特异性通路改变,为精准干预年龄相关疾病提供了理论基础与技术路径。
链接: https://arxiv.org/abs/2510.12384
作者: Huifa Li,Feilong Tang,Haochen Xue,Yulong Li,Xinlin Zhuang,Bin Zhang,Eran Segal,Imran Razzak
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI)
备注:
Abstract:Aging is a highly complex and heterogeneous process that progresses at different rates across individuals, making biological age (BA) a more accurate indicator of physiological decline than chronological age. While previous studies have built aging clocks using single-omics data, they often fail to capture the full molecular complexity of human aging. In this work, we leveraged the Human Phenotype Project, a large-scale cohort of 12,000 adults aged 30–70 years, with extensive longitudinal profiling that includes clinical, behavioral, environmental, and multi-omics datasets – spanning transcriptomics, lipidomics, metabolomics, and the microbiome. By employing advanced machine learning frameworks capable of modeling nonlinear biological dynamics, we developed and rigorously validated a multi-omics aging clock that robustly predicts diverse health outcomes and future disease risk. Unsupervised clustering of the integrated molecular profiles from multi-omics uncovered distinct biological subtypes of aging, revealing striking heterogeneity in aging trajectories and pinpointing pathway-specific alterations associated with different aging patterns. These findings demonstrate the power of multi-omics integration to decode the molecular landscape of aging and lay the groundwork for personalized healthspan monitoring and precision strategies to prevent age-related diseases.
zh
[AI-94] LiteVPNet: A Lightweight Network for Video Encoding Control in Quality-Critical Applications
【速读】:该论文旨在解决电影制作生态系统中视频流媒体技术在新工作流程(如现场虚拟制作)中对精确质量控制和能效要求的挑战,现有转码方法往往因缺乏质量控制或计算开销过大而难以满足需求。解决方案的关键在于提出一种轻量级神经网络 LiteVPNet,通过低复杂度特征(包括比特流特性、视频复杂度指标及基于 CLIP 的语义嵌入)来准确预测 NVENC AV1 编码器的量化参数(Quantisation Parameters),从而实现指定 VMAF 分数的目标。实验表明,LiteVPNet 在多种质量目标下均能将平均 VMAF 误差控制在 1.2 分以内,且超过 87% 的测试样本误差在 2 分以内,显著优于当前最优方法(约 61%)。
链接: https://arxiv.org/abs/2510.12379
作者: Vibhoothi Vibhoothi,François Pitié,Anil Kokaram
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted PCS 2025 Camera-Ready Version, 5 Pages
Abstract:In the last decade, video workflows in the cinema production ecosystem have presented new use cases for video streaming technology. These new workflows, e.g. in On-set Virtual Production, present the challenge of requiring precise quality control and energy efficiency. Existing approaches to transcoding often fall short of these requirements, either due to a lack of quality control or computational overhead. To fill this gap, we present a lightweight neural network (LiteVPNet) for accurately predicting Quantisation Parameters for NVENC AV1 encoders that achieve a specified VMAF score. We use low-complexity features, including bitstream characteristics, video complexity measures, and CLIP-based semantic embeddings. Our results demonstrate that LiteVPNet achieves mean VMAF errors below 1.2 points across a wide range of quality targets. Notably, LiteVPNet achieves VMAF errors within 2 points for over 87% of our test corpus, c.f. approx 61% with state-of-the-art methods. LiteVPNet’s performance across various quality regions highlights its applicability for enhancing high-value content transport and streaming for more energy-efficient, high-quality media experiences.
zh
[AI-95] Generative AI and Firm Productivity: Field Experiments in Online Retail
【速读】:该论文旨在解决生成式人工智能(Generative AI)在在线零售场景中对企业发展绩效影响的因果关系问题,特别是其对生产率提升的具体作用机制与边界条件。解决方案的关键在于通过一系列大规模随机对照试验(RCT),在一家领先的跨境电商平台部署七项面向消费者的GenAI功能,控制输入和价格不变以确保产出增长直接反映为全要素生产率(Total Factor Productivity, TFP)改善,并识别出不同用户群体的异质性响应,从而提供可量化、可推广的实证证据。
链接: https://arxiv.org/abs/2510.12049
作者: Lu Fang,Zhe Yuan,Kaifu Zhang,Dante Donati,Miklos Sarvary
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注: Keywords: Field Experiments, Generative AI, Productivity, Retail Platforms, Consumer Experience. JEL codes: C93, D24, L81, M31, O3
Abstract:We quantify the impact of Generative Artificial Intelligence (GenAI) on firm productivity through a series of large-scale randomized field experiments involving millions of users and products at a leading cross-border online retail platform. Over six months in 2023-2024, GenAI-based enhancements were integrated into seven consumer-facing business workflows. We find that GenAI adoption significantly increases sales, with treatment effects ranging from 0% to 16.3%, depending on GenAI’s marginal contribution relative to existing firm practices. Because inputs and prices were held constant across experimental arms, these gains map directly into total factor productivity improvements. Across the four GenAI applications with positive effects, the implied annual incremental value is approximately \ 5 per consumer-an economically meaningful impact given the retailer’s scale and the early stage of GenAI adoption. The primary mechanism operates through higher conversion rates, consistent with GenAI reducing frictions in the marketplace and improving consumer experience. We also document substantial heterogeneity: smaller and newer sellers, as well as less experienced consumers, exhibit disproportionately larger gains. Our findings provide novel, large-scale causal evidence on the productivity effects of GenAI in online retail, highlighting both its immediate value and broader potential.
zh
[AI-96] Zero-Shot Large Language Model Agents for Fully Automated Radiotherapy Treatment Planning NEURIPS2025 ALT
【速读】:该论文旨在解决放射治疗计划制定过程中高度依赖专家经验、效率低下且难以应对日益增长的癌症病例负担的问题,从而推动自动化治疗计划生成。其解决方案的关键在于提出了一种基于大语言模型(Large Language Model, LLM)的代理(agent)驱动的工作流程,该代理能够在无需任何任务特定训练或微调的情况下,直接与商业治疗计划系统(Treatment Planning System, TPS)交互,通过迭代获取中间计划状态并动态调整优化约束参数,实现对调强放射治疗(Intensity-Modulated Radiation Therapy, IMRT)逆向优化的自主引导。该方法在零样本(zero-shot)推理设置下即展现出可比甚至更优的剂量学性能,尤其在危及器官(OAR)保护、热点控制(Dmax)和靶区适形性(conformity index)方面优于临床手动计划,证明了LLM驱动自动化IMRT计划的可行性与通用性。
链接: https://arxiv.org/abs/2510.11754
作者: Dongrong Yang,Xin Wu,Yibo Xie,Xinyi Li,Qiuwen Wu,Jackie Wu,Yang Sheng
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Accepted for poster presentation at the NeurIPS 2025 Workshop on GenAI for Health: Potential, Trust, and Policy Compliance
Abstract:Radiation therapy treatment planning is an iterative, expertise-dependent process, and the growing burden of cancer cases has made reliance on manual planning increasingly unsustainable, underscoring the need for automation. In this study, we propose a workflow that leverages a large language model (LLM)-based agent to navigate inverse treatment planning for intensity-modulated radiation therapy (IMRT). The LLM agent was implemented to directly interact with a clinical treatment planning system (TPS) to iteratively extract intermediate plan states and propose new constraint values to guide inverse optimization. The agent’s decision-making process is informed by current observations and previous optimization attempts and evaluations, allowing for dynamic strategy refinement. The planning process was performed in a zero-shot inference setting, where the LLM operated without prior exposure to manually generated treatment plans and was utilized without any fine-tuning or task-specific training. The LLM-generated plans were evaluated on twenty head-and-neck cancer cases against clinical manual plans, with key dosimetric endpoints analyzed and reported. The LLM-generated plans achieved comparable organ-at-risk (OAR) sparing relative to clinical plans while demonstrating improved hot spot control (Dmax: 106.5% vs. 108.8%) and superior conformity (conformity index: 1.18 vs. 1.39 for boost PTV; 1.82 vs. 1.88 for primary PTV). This study demonstrates the feasibility of a zero-shot, LLM-driven workflow for automated IMRT treatment planning in a commercial TPS. The proposed approach provides a generalizable and clinically applicable solution that could reduce planning variability and support broader adoption of AI-based planning strategies.
zh
[AI-97] Fast and Interpretable Protein Substructure Alignment via Optimal Transport
【速读】:该论文旨在解决蛋白质结构中局部结构域(如活性位点)的高效且可解释的残基级子结构比对问题,这是理解蛋白质功能演化和实现蛋白质工程的关键瓶颈。现有计算方法在识别和比较这些局部结构方面存在显著不足。解决方案的关键在于提出PLASMA,这是一个基于深度学习的框架,将问题重新建模为正则化最优传输任务,并利用可微分Sinkhorn迭代进行求解,从而输出清晰的对齐矩阵和可解释的整体相似性评分,实现了准确、轻量且可解释的残基级比对。
链接: https://arxiv.org/abs/2510.11752
作者: Zhiyu Wang,Bingxin Zhou,Jing Wang,Yang Tan,Weishu Zhao,Pietro Liò,Liang Hong
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Proteins are essential biological macromolecules that execute life functions. Local motifs within protein structures, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, the first deep learning framework for efficient and interpretable residue-level protein substructure alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue-level alignment. Additionally, we introduce PLASMA-PF, a training-free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure-based drug design. Reproducibility is ensured via our official implementation at this https URL.
zh
机器学习
[LG-0] Sample-Efficient Omniprediction for Proper Losses
链接: https://arxiv.org/abs/2510.12769
作者: Isaac Gibbs,Ryan J. Tibshirani
类目: Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:We consider the problem of constructing probabilistic predictions that lead to accurate decisions when employed by downstream users to inform actions. For a single decision maker, designing an optimal predictor is equivalent to minimizing a proper loss function corresponding to the negative utility of that individual. For multiple decision makers, our problem can be viewed as a variant of omniprediction in which the goal is to design a single predictor that simultaneously minimizes multiple losses. Existing algorithms for achieving omniprediction broadly fall into two categories: 1) boosting methods that optimize other auxiliary targets such as multicalibration and obtain omniprediction as a corollary, and 2) adversarial two-player game based approaches that estimate and respond to the ``worst-case" loss in an online fashion. We give lower bounds demonstrating that multicalibration is a strictly more difficult problem than omniprediction and thus the former approach must incur suboptimal sample complexity. For the latter approach, we discuss how these ideas can be used to obtain a sample-efficient algorithm through an online-to-batch conversion. This conversion has the downside of returning a complex, randomized predictor. We improve on this method by designing a more direct, unrandomized algorithm that exploits structural elements of the set of proper losses.
[LG-1] KoALA: KL-L0 Adversarial Detector via Label Agreement
链接: https://arxiv.org/abs/2510.12752
作者: Siqi Li,Yasser Shoukry
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep neural networks are highly susceptible to adversarial attacks, which pose significant risks to security- and safety-critical applications. We present KoALA (KL-L0 Adversarial detection via Label Agreement), a novel, semantics-free adversarial detector that requires no architectural changes or adversarial retraining. KoALA operates on a simple principle: it detects an adversarial attack when class predictions from two complementary similarity metrics disagree. These metrics-KL divergence and an L0-based similarity-are specifically chosen to detect different types of perturbations. The KL divergence metric is sensitive to dense, low-amplitude shifts, while the L0-based similarity is designed for sparse, high-impact changes. We provide a formal proof of correctness for our approach. The only training required is a simple fine-tuning step on a pre-trained image encoder using clean images to ensure the embeddings align well with both metrics. This makes KOALA a lightweight, plug-and-play solution for existing models and various data modalities. Our extensive experiments on ResNet/CIFAR-10 and CLIP/Tiny-ImageNet confirm our theoretical claims. When the theorem’s conditions are met, KoALA consistently and effectively detects adversarial examples. On the full test sets, KoALA achieves a precision of 0.94 and a recall of 0.81 on ResNet/CIFAR-10, and a precision of 0.66 and a recall of 0.85 on CLIP/Tiny-ImageNet.
[LG-2] Doctor Rashomon and the UNIVERSE of Madness: Variable Importance with Unobserved Confounding and the Rashomon Effect
链接: https://arxiv.org/abs/2510.12734
作者: Jon Donnelly,Srikar Katta,Emanuele Borgonovo,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Variable importance (VI) methods are often used for hypothesis generation, feature selection, and scientific validation. In the standard VI pipeline, an analyst estimates VI for a single predictive model with only the observed features. However, the importance of a feature depends heavily on which other variables are included in the model, and essential variables are often omitted from observational datasets. Moreover, the VI estimated for one model is often not the same as the VI estimated for another equally-good model - a phenomenon known as the Rashomon Effect. We address these gaps by introducing UNobservables and Inference for Variable importancE using Rashomon SEts (UNIVERSE). Our approach adapts Rashomon sets - the sets of near-optimal models in a dataset - to produce bounds on the true VI even with missing features. We theoretically guarantee the robustness of our approach, show strong performance on semi-synthetic simulations, and demonstrate its utility in a credit risk task.
[LG-3] Data-Model Co-Evolution: Growing Test Sets to Refine LLM Behavior
链接: https://arxiv.org/abs/2510.12728
作者: Minjae Lee,Minsuk Kahng
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:A long-standing challenge in machine learning has been the rigid separation between data work and model refinement, enforced by slow fine-tuning cycles. The rise of Large Language Models (LLMs) overcomes this historical barrier, allowing applications developers to instantly govern model behavior by editing prompt instructions. This shift enables a new paradigm: data-model co-evolution, where a living test set and a model’s instructions evolve in tandem. We operationalize this paradigm in an interactive system designed to address the critical challenge of encoding subtle, domain-specific policies into prompt instructions. The system’s structured workflow guides people to discover edge cases, articulate rationales for desired behavior, and iteratively evaluate instruction revisions against a growing test set. A user study shows our workflow helps participants refine instructions systematically and specify ambiguous policies more concretely. This work points toward more robust and responsible LLM applications through human-in-the-loop development aligned with local preferences and policies.
[LG-4] Improving Decision Trees through the Lens of Parameterized Local Search NEURIPS2025
链接: https://arxiv.org/abs/2510.12726
作者: Juha Harviainen,Frank Sommer,Manuel Sorge
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025
Abstract:Algorithms for learning decision trees often include heuristic local-search operations such as (1) adjusting the threshold of a cut or (2) also exchanging the feature of that cut. We study minimizing the number of classification errors by performing a fixed number of a single type of these operations. Although we discover that the corresponding problems are NP-complete in general, we provide a comprehensive parameterized-complexity analysis with the aim of determining those properties of the problems that explain the hardness and those that make the problems tractable. For instance, we show that the problems remain hard for a small number d of features or small domain size D but the combination of both yields fixed-parameter tractability. That is, the problems are solvable in (D + 1)^2d \cdot |I|^O(1) time, where |I| is the size of the input. We also provide a proof-of-concept implementation of this algorithm and report on empirical results.
[LG-5] CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression EMNLP
链接: https://arxiv.org/abs/2510.12721
作者: Dayin Gou,Sanghyun Byun,Nilesh Malpeddi,Gabrielle De Micheli,Prathamesh Vaste,Jacob Song,Woo Seong Chung
类目: Machine Learning (cs.LG)
*备注: Accepted at EMNLP Findings 2025
Abstract:Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model’s memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.
[LG-6] Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction
链接: https://arxiv.org/abs/2510.12719
作者: Matthew Adrian,Yunsie Chung,Kevin Boyd,Saee Paliwal,Srimukh Prasad Veccham,Alan C. Cheng
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Chemical pretrained models, sometimes referred to as foundation models, are receiving considerable interest for drug discovery applications. The general chemical knowledge extracted from self-supervised training has the potential to improve predictions for critical drug discovery endpoints, including on-target potency and ADMET properties. Multi-task learning has previously been successfully leveraged to improve predictive models. Here, we show that enabling multitasking in finetuning of chemical pretrained graph neural network models such as Kinetic GROVER Multi-Task (KERMT), an enhanced version of the GROVER model, and Knowledge-guided Pre-training of Graph Transformer (KGPT) significantly improves performance over non-pretrained graph neural network models. Surprisingly, we find that the performance improvement from finetuning KERMT in a multitask manner is most significant at larger data sizes. Additionally, we publish two multitask ADMET data splits to enable more accurate benchmarking of multitask deep learning methods for drug property prediction. Finally, we provide an accelerated implementation of the KERMT model on GitHub, unlocking large-scale pretraining, finetuning, and inference in industrial drug discovery workflows.
[LG-7] Few Shot Semi-Supervised Learning for Abnormal Stop Detection from Sparse GPS Trajectories
链接: https://arxiv.org/abs/2510.12686
作者: Muhammad Ayub Sabir,Junbiao Pang,Jiaqi Wu,Fatima Ashraf
类目: Machine Learning (cs.LG)
*备注:
Abstract:Abnormal stop detection (ASD) in intercity coach transportation is critical for ensuring passenger safety, operational reliability, and regulatory compliance. However, two key challenges hinder ASD effectiveness: sparse GPS trajectories, which obscure short or unauthorized stops, and limited labeled data, which restricts supervised learning. Existing methods often assume dense sampling or regular movement patterns, limiting their applicability. To address data sparsity, we propose a Sparsity-Aware Segmentation (SAS) method that adaptively defines segment boundaries based on local spatial-temporal density. Building upon these segments, we introduce three domain-specific indicators to capture abnormal stop behaviors. To further mitigate the impact of sparsity, we develop Locally Temporal-Indicator Guided Adjustment (LTIGA), which smooths these indicators via local similarity graphs. To overcome label scarcity, we construct a spatial-temporal graph where each segment is a node with LTIGA-refined features. We apply label propagation to expand weak supervision across the graph, followed by a GCN to learn relational patterns. A final self-training module incorporates high-confidence pseudo-labels to iteratively improve predictions. Experiments on real-world coach data show an AUC of 0.854 and AP of 0.866 using only 10 labeled instances, outperforming prior methods. The code and dataset are publicly available at \hrefthis https URL
[LG-8] CoRA: Covariate-Aware Adaptation of Time Series Foundation Models
链接: https://arxiv.org/abs/2510.12681
作者: Guo Qin,Zhi Chen,Yong Liu,Zhiyuan Shi,Haixuan Liu,Xiangdong Huang,Jianmin Wang,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time Series Foundation Models (TSFMs) have shown significant impact through their model capacity, scalability, and zero-shot generalization. However, due to the heterogeneity of inter-variate dependencies and the backbone scalability on large-scale multivariate datasets, most TSFMs are typically pre-trained on univariate time series. This limitation renders them oblivious to crucial information from diverse covariates in real-world forecasting tasks. To further enhance the performance of TSFMs, we propose a general covariate-aware adaptation (CoRA) framework for TSFMs. It leverages pre-trained backbones of foundation models while effectively incorporating exogenous covariates from various modalities, including time series, language, and images, to improve the quality of predictions. Technically, CoRA maintains the equivalence of initialization and parameter consistency during adaptation. With preserved backbones of foundation models as frozen feature extractors, the outcome embeddings from foundation models are empirically demonstrated more informative than raw data. Further, CoRA employs a novel Granger Causality Embedding (GCE) to automatically evaluate covariates regarding their causal predictability with respect to the target variate. We incorporate these weighted embeddings with a zero-initialized condition-injection mechanism, avoiding catastrophic forgetting of pre-trained foundation models and gradually integrates exogenous information. Extensive experiments show that CoRA of TSFMs surpasses state-of-the-art covariate-aware deep forecasters with full or few-shot training samples, achieving 31.1% MSE reduction on covariate-aware forecasting. Compared to other adaptation methods, CoRA exhibits strong compatibility with various advanced TSFMs and extends the scope of covariates to other modalities, presenting a practical paradigm for the application of TSFMs.
[LG-9] Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
链接: https://arxiv.org/abs/2510.12672
作者: Ruben Belo,Claudia Soares,Marta Guimaraes
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation \textbfCALM, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging \gls*cw technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.
[LG-10] Structure-Aware Spectral Sparsification via Uniform Edge Sampling NEURIPS2025
链接: https://arxiv.org/abs/2510.12669
作者: Kaiwen He,Petros Drineas,Rajiv Khanna
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注: 19 pages, 4 figures, NeurIPS 2025
Abstract:Spectral clustering is a fundamental method for graph partitioning, but its reliance on eigenvector computation limits scalability to massive graphs. Classical sparsification methods preserve spectral properties by sampling edges proportionally to their effective resistances, but require expensive preprocessing to estimate these resistances. We study whether uniform edge sampling-a simple, structure-agnostic strategy-can suffice for spectral clustering. Our main result shows that for graphs admitting a well-separated k -clustering, characterized by a large structure ratio \Upsilon(k) = \lambda_k+1 / \rho_G(k) , uniform sampling preserves the spectral subspace used for clustering. Specifically, we prove that uniformly sampling O(\gamma^2 n \log n / \epsilon^2) edges, where \gamma is the Laplacian condition number, yields a sparsifier whose top (n-k) -dimensional eigenspace is approximately orthogonal to the cluster indicators. This ensures that the spectral embedding remains faithful, and clustering quality is preserved. Our analysis introduces new resistance bounds for intra-cluster edges, a rank- (n-k) effective resistance formulation, and a matrix Chernoff bound adapted to the dominant eigenspace. These tools allow us to bypass importance sampling entirely. Conceptually, our result connects recent coreset-based clustering theory to spectral sparsification, showing that under strong clusterability, even uniform sampling is structure-aware. This provides the first provable guarantee that uniform edge sampling suffices for structure-preserving spectral clustering.
[LG-11] Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models
链接: https://arxiv.org/abs/2510.12666
作者: Prasenjit K Mudi,Anshi Sachan,Dahlia Devapriya,Sheetal Kalyani
类目: Machine Learning (cs.LG)
*备注:
Abstract:Whisper models have achieved remarkable progress in speech recognition; yet their large size remains a bottleneck for deployment on resource-constrained edge devices. This paper proposes a framework to design fine-tuned variants of Whisper which address the above problem. Structured sparsity is enforced via the Sparse Group LASSO penalty as a loss regularizer, to reduce the number of FLOating Point operations (FLOPs). Further, a weight statistics aware pruning algorithm is proposed. We also design our custom text normalizer for WER evaluation. On Common Voice 11.0 Hindi dataset, we obtain, without degrading WER, (a) 35.4% reduction in model parameters, 14.25% lower memory consumption and 18.5% fewer FLOPs on Whisper-small, and (b) 31% reduction in model parameters, 15.29% lower memory consumption and 16.95% fewer FLOPs on Whisper-medium; and, © substantially outperform the state-of-the-art Iterative Magnitude Pruning based method by pruning 18.7% more parameters along with a 12.31 reduction in WER.
[LG-12] owards Foundation Inference Models that Learn ODEs In-Context
链接: https://arxiv.org/abs/2510.12650
作者: Maximilian Mauel,Manuel Hinz,Patrick Seifner,David Berghaus,Ramses J. Sanchez
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ordinary differential equations (ODEs) describe dynamical systems evolving deterministically in continuous time. Accurate data-driven modeling of systems as ODEs, a central problem across the natural sciences, remains challenging, especially if the data is sparse or noisy. We introduce FIM-ODE (Foundation Inference Model for ODEs), a pretrained neural model designed to estimate ODEs zero-shot (i.e., in context) from sparse and noisy observations. Trained on synthetic data, the model utilizes a flexible neural operator for robust ODE inference, even from corrupted data. We empirically verify that FIM-ODE provides accurate estimates, on par with a neural state-of-the-art method, and qualitatively compare the structure of their estimated vector fields.
[LG-13] On Foundation Models for Temporal Point Processes to Accelerate Scientific Discovery
链接: https://arxiv.org/abs/2510.12640
作者: David Berghaus,Patrick Seifner,Kostadin Cvejoski,Ramses J. Sanchez
类目: Machine Learning (cs.LG)
*备注:
Abstract:Many scientific fields, from medicine to seismology, rely on analyzing sequences of events over time to understand complex systems. Traditionally, machine learning models must be built and trained from scratch for each new dataset, which is a slow and costly process. We introduce a new approach: a single, powerful model that learns the underlying patterns of event data in context. We trained this “foundation model” on millions of simulated event sequences, teaching it a general-purpose understanding of how events can unfold. As a result, our model can analyze new scientific data instantly, without retraining, simply by looking at a few examples from the dataset. It can also be quickly fine-tuned for even higher accuracy. This approach makes sophisticated event analysis more accessible and accelerates the pace of scientific discovery.
[LG-14] Expert or not? assessing data quality in offline reinforcement learning
链接: https://arxiv.org/abs/2510.12638
作者: Arip Asadulaev,Fakhri Karray,Martin Takac
类目: Machine Learning (cs.LG)
*备注:
Abstract:Offline reinforcement learning (RL) learns exclusively from static datasets, without further interaction with the environment. In practice, such datasets vary widely in quality, often mixing expert, suboptimal, and even random trajectories. The choice of algorithm therefore depends on dataset fidelity. Behavior cloning can suffice on high-quality data, whereas mixed- or low-quality data typically benefits from offline RL methods that stitch useful behavior across trajectories. Yet in the wild it is difficult to assess dataset quality a priori because the data’s provenance and skill composition are unknown. We address the problem of estimating offline dataset quality without training an agent. We study a spectrum of proxies from simple cumulative rewards to learned value based estimators, and introduce the Bellman Wasserstein distance (BWD), a value aware optimal transport score that measures how dissimilar a dataset’s behavioral policy is from a random reference policy. BWD is computed from a behavioral critic and a state conditional OT formulation, requiring no environment interaction or full policy optimization. Across D4RL MuJoCo tasks, BWD strongly correlates with an oracle performance score that aggregates multiple offline RL algorithms, enabling efficient prediction of how well standard agents will perform on a given dataset. Beyond prediction, integrating BWD as a regularizer during policy optimization explicitly pushes the learned policy away from random behavior and improves returns. These results indicate that value aware, distributional signals such as BWD are practical tools for triaging offline RL datasets and policy optimization.
[LG-15] owards Fast Coarse-graining and Equation Discovery with Foundation Inference Models
链接: https://arxiv.org/abs/2510.12618
作者: Manuel Hinz,Maximilian Mauel,Patrick Seifner,David Berghaus,Kostadin Cvejoski,Ramses J. Sanchez
类目: Machine Learning (cs.LG)
*备注:
Abstract:High-dimensional recordings of dynamical processes are often characterized by a much smaller set of effective variables, evolving on low-dimensional manifolds. Identifying these latent dynamics requires solving two intertwined problems: discovering appropriate coarse-grained variables and simultaneously fitting the governing equations. Most machine learning approaches tackle these tasks jointly by training autoencoders together with models that enforce dynamical consistency. We propose to decouple the two problems by leveraging the recently introduced Foundation Inference Models (FIMs). FIMs are pretrained models that estimate the infinitesimal generators of dynamical systems (e.g., the drift and diffusion of a stochastic differential equation) in zero-shot mode. By amortizing the inference of the dynamics through a FIM with frozen weights, and training only the encoder-decoder map, we define a simple, simulation-consistent loss that stabilizes representation learning. A proof of concept on a stochastic double-well system with semicircle diffusion, embedded into synthetic video data, illustrates the potential of this approach for fast and reusable coarse-graining pipelines.
[LG-16] Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice
链接: https://arxiv.org/abs/2510.12595
作者: Kevin Kuo,Chhavi Yadav,Virginia Smith
类目: Machine Learning (cs.LG)
*备注: Main text: 23 pages, 2 tables, 2 figures
Abstract:Cross-silo federated learning (FL) is a promising approach to enable cross-organization collaboration in machine learning model development without directly sharing private data. Despite growing organizational interest driven by data protection regulations such as GDPR and HIPAA, the adoption of cross-silo FL remains limited in practice. In this paper, we conduct an interview study to understand the practical challenges associated with cross-silo FL adoption. With interviews spanning a diverse set of stakeholders such as user organizations, software providers, and academic researchers, we uncover various barriers, from concerns about model performance to questions of incentives and trust between participating organizations. Our study shows that cross-silo FL faces a set of challenges that have yet to be well-captured by existing research in the area and are quite distinct from other forms of federated learning such as cross-device FL. We end with a discussion on future research directions that can help overcome these challenges.
[LG-17] Multi-Armed Bandits with Minimum Aggregated Revenue Constraints
链接: https://arxiv.org/abs/2510.12523
作者: Ahmed Ben Yahmed,Hafedh El Ferchichi,Marc Abeille,Vianney Perchet
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We examine a multi-armed bandit problem with contextual information, where the objective is to ensure that each arm receives a minimum aggregated reward across contexts while simultaneously maximizing the total cumulative reward. This framework captures a broad class of real-world applications where fair revenue allocation is critical and contextual variation is inherent. The cross-context aggregation of minimum reward constraints, while enabling better performance and easier feasibility, introduces significant technical challenges – particularly the absence of closed-form optimal allocations typically available in standard MAB settings. We design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction. For each algorithm, we derive problem-dependent upper bounds on both regret and constraint violations. Furthermore, we establish a lower bound demonstrating that the dependence on the time horizon in our results is optimal in general and revealing fundamental limitations of the free exploration principle leveraged in prior work.
[LG-18] Why the noise model matters: A performance gap in learned regularization
链接: https://arxiv.org/abs/2510.12521
作者: Sebastian Banert,Christoph Brauer,Dirk Lorenz,Lionel Tondji
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:This article addresses the challenge of learning effective regularizers for linear inverse problems. We analyze and compare several types of learned variational regularization against the theoretical benchmark of the optimal affine reconstruction, i.e. the best possible affine linear map for minimizing the mean squared error. It is known that this optimal reconstruction can be achieved using Tikhonov regularization, but this requires precise knowledge of the noise covariance to properly weight the data fidelity term. However, in many practical applications, noise statistics are unknown. We therefore investigate the performance of regularization methods learned without access to this noise information, focusing on Tikhonov, Lavrentiev, and quadratic regularization. Our theoretical analysis and numerical experiments demonstrate that for non-white noise, a performance gap emerges between these methods and the optimal affine reconstruction. Furthermore, we show that these different types of regularization yield distinct results, highlighting that the choice of regularizer structure is critical when the noise model is not explicitly learned. Our findings underscore the significant value of accurately modeling or co-learning noise statistics in data-driven regularization.
[LG-19] Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance
链接: https://arxiv.org/abs/2510.12497
作者: Jincheng Zhong,Boyuan Jiang,Xin Tao,Pengfei Wan,Kun Gai,Mingsheng Long
类目: Machine Learning (cs.LG)
*备注:
Abstract:Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.
[LG-20] CrossAD: Time Series Anomaly Detection with Cross-scale Associations and Cross-window Modeling
链接: https://arxiv.org/abs/2510.12489
作者: Beibu Li,Qichao Shentu,Yang Shu,Hui Zhang,Ming Li,Ning Jin,Bin Yang,Chenjuan Guo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted by the thirty-ninth annual conference on Neural Information Processing Systems
Abstract:Time series anomaly detection plays a crucial role in a wide range of real-world applications. Given that time series data can exhibit different patterns at different sampling granularities, multi-scale modeling has proven beneficial for uncovering latent anomaly patterns that may not be apparent at a single scale. However, existing methods often model multi-scale information independently or rely on simple feature fusion strategies, neglecting the dynamic changes in cross-scale associations that occur during anomalies. Moreover, most approaches perform multi-scale modeling based on fixed sliding windows, which limits their ability to capture comprehensive contextual information. In this work, we propose CrossAD, a novel framework for time series Anomaly Detection that takes Cross-scale associations and Cross-window modeling into account. We propose a cross-scale reconstruction that reconstructs fine-grained series from coarser series, explicitly capturing cross-scale associations. Furthermore, we design a query library and incorporate global multi-scale context to overcome the limitations imposed by fixed window sizes. Extensive experiments conducted on multiple real-world datasets using nine evaluation metrics validate the effectiveness of CrossAD, demonstrating state-of-the-art performance in anomaly detection.
[LG-21] Diff-XYZ: A Benchmark for Evaluating Diff Understanding
链接: https://arxiv.org/abs/2510.12487
作者: Evgeniy Glukhov,Michele Conti,Egor Bogomolov,Yaroslav Golubev,Alexander Bezzubov
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code-diff understanding with three supervised tasks: apply (old code + diff \rightarrow new code), anti-apply (new code - diff \rightarrow old code), and diff generation (new code - old code \rightarrow diff). Instances in the benchmark are triples \langle \textitold code, \textitnew code, \textitdiff \rangle drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code. The dataset is published on HuggingFace Hub: this https URL.
[LG-22] me-Correlated Video Bridge Matching
链接: https://arxiv.org/abs/2510.12453
作者: Viacheslav Vasilev,Arseny Ivanov,Nikita Gushchin,Maria Kovaleva,Alexander Korotin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models excel in noise-to-data generation tasks, providing a mapping from a Gaussian distribution to a more complex data distribution. However they struggle to model translations between complex distributions, limiting their effectiveness in data-to-data tasks. While Bridge Matching (BM) models address this by finding the translation between data distributions, their application to time-correlated data sequences remains unexplored. This is a critical limitation for video generation and manipulation tasks, where maintaining temporal coherence is particularly important. To address this gap, we propose Time-Correlated Video Bridge Matching (TCVBM), a framework that extends BM to time-correlated data sequences in the video domain. TCVBM explicitly models inter-sequence dependencies within the diffusion bridge, directly incorporating temporal correlations into the sampling process. We compare our approach to classical methods based on bridge matching and diffusion models for three video-related tasks: frame interpolation, image-to-video generation, and video super-resolution. TCVBM achieves superior performance across multiple quantitative metrics, demonstrating enhanced generation quality and reconstruction fidelity.
[LG-23] Bayesian Optimization for Dynamic Pricing and Learning
链接: https://arxiv.org/abs/2510.12447
作者: Anush Anand,Pranav Agrawal,Tejas Bodas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dynamic pricing is the practice of adjusting the selling price of a product to maximize a firm’s revenue by responding to market demand. The literature typically distinguishes between two settings: infinite inventory, where the firm has unlimited stock and time to sell, and finite inventory, where both inventory and selling horizon are limited. In both cases, the central challenge lies in the fact that the demand function – how sales respond to price – is unknown and must be learned from data. Traditional approaches often assume a specific parametric form for the demand function, enabling the use of reinforcement learning (RL) to identify near-optimal pricing strategies. However, such assumptions may not hold in real-world scenarios, limiting the applicability of these methods. In this work, we propose a Gaussian Process (GP) based nonparametric approach to dynamic pricing that avoids restrictive modeling assumptions. We treat the demand function as a black-box function of the price and develop pricing algorithms based on Bayesian Optimization (BO) – a sample-efficient method for optimizing unknown functions. We present BO-based algorithms tailored for both infinite and finite inventory settings and provide regret guarantees for both regimes, thereby quantifying the learning efficiency of our methods. Through extensive experiments, we demonstrate that our BO-based methods outperform several state-of-the-art RL algorithms in terms of revenue, while requiring fewer assumptions and offering greater robustness. This highlights Bayesian Optimization as a powerful and practical tool for dynamic pricing in complex, uncertain environments.
[LG-24] Formal Models and Convergence Analysis for Context-Aware Security Verification
链接: https://arxiv.org/abs/2510.12440
作者: Ayush Chaudhary
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 4 tables. Presents formal framework for context-aware security verification with ML-enhanced adaptive systems. Includes theoretical bounds (sample complexity, information-theoretic limits, convergence guarantees, soundness preservation) and empirical validation on 97,224 exploit samples
Abstract:We present a formal framework for context-aware security verification that establishes provable guarantees for ML-enhanced adaptive systems. We introduce context-completeness - a new security property - and prove: (1) sample complexity bounds showing when adaptive verification succeeds, (2) information-theoretic limits relating context richness to detection capability, (3) convergence guarantees for ML-based payload generators, and (4) compositional soundness bounds. We further provide a formal separation between static context-blind verifiers and context-aware adaptive verifiers: for a natural family of targets, any static verifier with finite payload budget achieves completeness at most alpha, while a context-aware verifier with sufficient information achieves completeness greater than alpha. We validate our theoretical predictions through controlled experiments on 97,224 exploit samples, demonstrating: detection accuracy improving from 58% to 69.93% with dataset growth, success probability increasing from 51% to 82% with context enrichment, training loss converging at O(1/sqrt(T)) rate, and false positive rate (10.19%) within theoretical bounds (12%). Our results show that theoretically-grounded adaptive verification achieves provable improvements over static approaches under stated assumptions while maintaining soundness guarantees.
[LG-25] Continuous Uniqueness and Novelty Metrics for Generative Modeling of Inorganic Crystals NEURIPS2025
链接: https://arxiv.org/abs/2510.12405
作者: Masahiro Negishi,Hyunsoo Park,Kinga O. Mastej,Aron Walsh
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 13 pages (5 pages of main text), accepted to the AI4Mat workshop at NeurIPS 2025. See this https URL for the code
Abstract:To address pressing scientific challenges such as climate change, increasingly sophisticated generative artificial intelligence models are being developed that can efficiently sample the large chemical space of possible functional materials. These models can quickly sample new chemical compositions paired with crystal structures. They are typically evaluated using uniqueness and novelty metrics, which depend on a chosen crystal distance function. However, the most prevalent distance function has four limitations: it fails to quantify the degree of similarity between compounds, cannot distinguish compositional difference and structural difference, lacks Lipschitz continuity against shifts in atomic coordinates, and results in a uniqueness metric that is not invariant against the permutation of generated samples. In this work, we propose using two continuous distance functions to evaluate uniqueness and novelty, which theoretically overcome these limitations. Our experiments show that these distances reveal insights missed by traditional distance functions, providing a more reliable basis for evaluating and comparing generative models for inorganic crystals.
[LG-26] Robot Learning: A Tutorial
链接: https://arxiv.org/abs/2510.12403
作者: Francesco Capuano,Caroline Pascal,Adil Zouitine,Thomas Wolf,Michel Aractingi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Tutorial on Robot Learning using LeRobot, the end-to-end robot learning library developed by Hugging Face
Abstract:Robot learning is at an inflection point, driven by rapid advancements in machine learning and the growing availability of large-scale robotics data. This shift from classical, model-based methods to data-driven, learning-based paradigms is unlocking unprecedented capabilities in autonomous systems. This tutorial navigates the landscape of modern robot learning, charting a course from the foundational principles of Reinforcement Learning and Behavioral Cloning to generalist, language-conditioned models capable of operating across diverse tasks and even robot embodiments. This work is intended as a guide for researchers and practitioners, and our goal is to equip the reader with the conceptual understanding and practical tools necessary to contribute to developments in robot learning, with ready-to-use examples implemented in \textttlerobot .
[LG-27] Cautious Weight Decay
链接: https://arxiv.org/abs/2510.12402
作者: Lizhang Chen,Jonathan Li,Kaizhao Liang,Baiyu Su,Cong Xie,Nuo Wang Pierse,Chen Liang,Ni Lao,Qiang Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
[LG-28] Enhanced Pre-training of Graph Neural Networks for Million-Scale Heterogeneous Graphs
链接: https://arxiv.org/abs/2510.12401
作者: Shengyin Sun,Chen Ma,Jiehao Chen
类目: Machine Learning (cs.LG)
*备注: 26 pages
Abstract:In recent years, graph neural networks (GNNs) have facilitated the development of graph data mining. However, training GNNs requires sufficient labeled task-specific data, which is expensive and sometimes unavailable. To be less dependent on labeled data, recent studies propose to pre-train GNNs in a self-supervised manner and then apply the pre-trained GNNs to downstream tasks with limited labeled data. However, most existing methods are designed solely for homogeneous graphs (real-world graphs are mostly heterogeneous) and do not consider semantic mismatch (the semantic difference between the original data and the ideal data containing more transferable semantic information). In this paper, we propose an effective framework to pre-train GNNs on the large-scale heterogeneous graph. We first design a structure-aware pre-training task, which aims to capture structural properties in heterogeneous graphs. Then, we design a semantic-aware pre-training task to tackle the mismatch. Specifically, we construct a perturbation subspace composed of semantic neighbors to help deal with the semantic mismatch. Semantic neighbors make the model focus more on the general knowledge in the semantic space, which in turn assists the model in learning knowledge with better transferability. Finally, extensive experiments are conducted on real-world large-scale heterogeneous graphs to demonstrate the superiority of the proposed method over state-of-the-art baselines. Code available at this https URL.
[LG-29] Improving Generative Behavior Cloning via Self-Guidance and Adaptive Chunking NEURIPS25
链接: https://arxiv.org/abs/2510.12392
作者: Junhyuk So,Chiwoong Lee,Shinyoung Lee,Jungseul Ok,Eunhyeok Park
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS25
Abstract:Generative Behavior Cloning (GBC) is a simple yet effective framework for robot learning, particularly in multi-task settings. Recent GBC methods often employ diffusion policies with open-loop (OL) control, where actions are generated via a diffusion process and executed in multi-step chunks without replanning. While this approach has demonstrated strong success rates and generalization, its inherent stochasticity can result in erroneous action sampling, occasionally leading to unexpected task failures. Moreover, OL control suffers from delayed responses, which can degrade performance in noisy or dynamic environments. To address these limitations, we propose two novel techniques to enhance the consistency and reactivity of diffusion policies: (1) self-guidance, which improves action fidelity by leveraging past observations and implicitly promoting future-aware behavior; and (2) adaptive chunking, which selectively updates action sequences when the benefits of reactivity outweigh the need for temporal consistency. Extensive experiments show that our approach substantially improves GBC performance across a wide range of simulated and real-world robotic manipulation tasks. Our code is available at this https URL
[LG-30] owards Cross-Modal Error Detection with Tables and Images
链接: https://arxiv.org/abs/2510.12383
作者: Olga Ovcharenko,Sebastian Schelter
类目: Machine Learning (cs.LG)
*备注:
Abstract:Ensuring data quality at scale remains a persistent challenge for large organizations. Despite recent advances, maintaining accurate and consistent data is still complex, especially when dealing with multiple data modalities. Traditional error detection and correction methods tend to focus on a single modality, typically a table, and often miss cross-modal errors that are common in domains like e-Commerce and healthcare, where image, tabular, and text data co-exist. To address this gap, we take an initial step towards cross-modal error detection in tabular data, by benchmarking several methods. Our evaluation spans four datasets and five baseline approaches. Among them, Cleanlab, a label error detection framework, and DataScope, a data valuation method, perform the best when paired with a strong AutoML framework, achieving the highest F1 scores. Our findings indicate that current methods remain limited, particularly when applied to heavy-tailed real-world data, motivating further research in this area.
[LG-31] Constrained Sensing and Reliable State Estimation with Shallow Recurrent Decoders on a TRIGA Mark II Reactor
链接: https://arxiv.org/abs/2510.12368
作者: Stefano Riva,Carolina Introini,Josè Nathan Kutz,Antonio Cammi
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Shallow Recurrent Decoder networks are a novel data-driven methodology able to provide accurate state estimation in engineering systems, such as nuclear reactors. This deep learning architecture is a robust technique designed to map the temporal trajectories of a few sparse measures to the full state space, including unobservable fields, which is agnostic to sensor positions and able to handle noisy data through an ensemble strategy, leveraging the short training times and without the need for hyperparameter tuning. Following its application to a novel reactor concept, this work investigates the performance of Shallow Recurrent Decoders when applied to a real system. The underlying model is represented by a fluid dynamics model of the TRIGA Mark II research reactor; the architecture will use both synthetic temperature data coming from the numerical model and leveraging experimental temperature data recorded during a previous campaign. The objective of this work is, therefore, two-fold: 1) assessing if the architecture can reconstruct the full state of the system (temperature, velocity, pressure, turbulence quantities) given sparse data located in specific, low-dynamics channels and 2) assessing the correction capabilities of the architecture (that is, given a discrepancy between model and data, assessing if sparse measurements can provide some correction to the architecture output). As will be shown, the accurate reconstruction of every characteristic field, using both synthetic and experimental data, in real-time makes this approach suitable for interpretable monitoring and control purposes in the framework of a reactor digital twin.
[LG-32] Pretraining in Actor-Critic Reinforcement Learning for Robot Motion Control ICLR2026
链接: https://arxiv.org/abs/2510.12363
作者: Jiale Fan,Andrei Cramariuc,Tifanny Portela,Marco Hutter
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to ICLR 2026
Abstract:The pretraining-finetuning paradigm has facilitated numerous transformative advancements in artificial intelligence research in recent years. However, in the domain of reinforcement learning (RL) for robot motion control, individual skills are often learned from scratch despite the high likelihood that some generalizable knowledge is shared across all task-specific policies belonging to a single robot embodiment. This work aims to define a paradigm for pretraining neural network models that encapsulate such knowledge and can subsequently serve as a basis for warm-starting the RL process in classic actor-critic algorithms, such as Proximal Policy Optimization (PPO). We begin with a task-agnostic exploration-based data collection algorithm to gather diverse, dynamic transition data, which is then used to train a Proprioceptive Inverse Dynamics Model (PIDM) through supervised learning. The pretrained weights are loaded into both the actor and critic networks to warm-start the policy optimization of actual tasks. We systematically validated our proposed method on seven distinct robot motion control tasks, showing significant benefits to this initialization strategy. Our proposed approach on average improves sample efficiency by 40.1% and task performance by 7.5%, compared to random initialization. We further present key ablation studies and empirical analyses that shed light on the mechanisms behind the effectiveness of our method.
[LG-33] raveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models
链接: https://arxiv.org/abs/2510.12343
作者: Donghwan Rho,Sieun Seo,Hyewon Sung,Chohong Min,Ernest K. Ryu
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 34 pages
Abstract:As users increasingly interact with large language models (LLMs) using private information, secure and encrypted communication becomes essential. Homomorphic encryption (HE) provides a principled solution by enabling computation directly on encrypted data. Although prior work has explored aspects of running LLMs under HE, the challenge of text generation, particularly next-token prediction, has received limited attention and remains a key obstacle to practical encrypted interaction. In this work, we propose a TSP-based token reordering strategy to address the difficulties of encrypted text generation, together with a post-processing step that further reduces approximation error. Theoretical analysis and experimental results demonstrate that our method prevents collapse, improves coherence in generated text, and preserves data privacy throughout. Overall, our contributions advance the feasibility of practical and privacy-preserving LLM inference.
[LG-34] Leverag ing Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand
链接: https://arxiv.org/abs/2510.12328
作者: Kiattikun Chobtham,Kanoksri Sarinnapakorn,Kritanai Torsri,Prattana Deeprasertkul,Jirawan Kamma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce fine-resolution maps that support decision-making in long-term water management.
[LG-35] DeepTrust: Multi-Step Classification through Dissimilar Adversarial Representations for Robust Android Malware Detection
链接: https://arxiv.org/abs/2510.12310
作者: Daniel Pulido-Cortázar,Daniel Gibert,Felip Manyà
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Over the last decade, machine learning has been extensively applied to identify malicious Android applications. However, such approaches remain vulnerable against adversarial examples, i.e., examples that are subtly manipulated to fool a machine learning model into making incorrect predictions. This research presents DeepTrust, a novel metaheuristic that arranges flexible classifiers, like deep neural networks, into an ordered sequence where the final decision is made by a single internal model based on conditions activated in cascade. In the Robust Android Malware Detection competition at the 2025 IEEE Conference SaTML, DeepTrust secured the first place and achieved state-of-the-art results, outperforming the next-best competitor by up to 266% under feature-space evasion attacks. This is accomplished while maintaining the highest detection rate on non-adversarial malware and a false positive rate below 1%. The method’s efficacy stems from maximizing the divergence of the learned representations among the internal models. By using classifiers inducing fundamentally dissimilar embeddings of the data, the decision space becomes unpredictable for an attacker. This frustrates the iterative perturbation process inherent to evasion attacks, enhancing system robustness without compromising accuracy on clean examples.
[LG-36] General Fourier Feature Physics-Informed Extreme Learning Machine (GFF-PIELM) for High-Frequency PDEs
链接: https://arxiv.org/abs/2510.12293
作者: Fei Ren,Sifan Wang,Pei-Zhi Zhuang,Hai-Sui Yu,He Yang
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Computational Physics (physics.comp-ph)
*备注:
Abstract:Conventional physics-informed extreme learning machine (PIELM) often faces challenges in solving partial differential equations (PDEs) involving high-frequency and variable-frequency behaviors. To address these challenges, we propose a general Fourier feature physics-informed extreme learning machine (GFF-PIELM). We demonstrate that directly concatenating multiple Fourier feature mappings (FFMs) and an extreme learning machine (ELM) network makes it difficult to determine frequency-related hyperparameters. Fortunately, we find an alternative to establish the GFF-PIELM in three main steps. First, we integrate a variation of FFM into ELM as the Fourier-based activation function, so there is still one hidden layer in the GFF-PIELM framework. Second, we assign a set of frequency coefficients to the hidden neurons, which enables ELM network to capture diverse frequency components of target solutions. Finally, we develop an innovative, straightforward initialization method for these hyperparameters by monitoring the distribution of ELM output weights. GFF-PIELM not only retains the high accuracy, efficiency, and simplicity of the PIELM framework but also inherits the ability of FFMs to effectively handle high-frequency problems. We carry out five case studies with a total of ten numerical examples to highlight the feasibility and validity of the proposed GFF-PIELM, involving high frequency, variable frequency, multi-scale behaviour, irregular boundary and inverse problems. Compared to conventional PIELM, the GFF-PIELM approach significantly improves predictive accuracy without additional cost in training time and architecture complexity. Our results confirm that that PIELM can be extended to solve high-frequency and variable-frequency PDEs with high accuracy, and our initialization strategy may further inspire advances in other physics-informed machine learning (PIML) frameworks.
[LG-37] Multi-Action Self-Improvement for Neural Combinatorial Optimization
链接: https://arxiv.org/abs/2510.12273
作者: Laurin Luttmann,Lin Xie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-improvement has emerged as a state-of-the-art paradigm in Neural Combinatorial Optimization (NCO), where models iteratively refine their policies by generating and imitating high-quality solutions. Despite strong empirical performance, existing methods face key limitations. Training is computationally expensive, as policy updates require sampling numerous candidate solutions per instance to extract a single expert trajectory. More fundamentally, these approaches fail to exploit the structure of combinatorial problems involving the coordination of multiple agents, such as vehicles in min-max routing or machines in scheduling. By supervising on single-action trajectories, they fail to exploit agent-permutation symmetries, where distinct sequences of actions yield identical solutions, hindering generalization and the ability to learn coordinated behavior. We address these challenges by extending self-improvement to operate over joint multi-agent actions. Our model architecture predicts complete agent-task assignments jointly at each decision step. To explicitly leverage symmetries, we employ a set-prediction loss, which supervises the policy on multiple expert assignments for any given state. This approach enhances sample efficiency and the model’s ability to learn coordinated behavior. Furthermore, by generating multi-agent actions in parallel, it drastically accelerates the solution generation phase of the self-improvement loop. Empirically, we validate our method on several combinatorial problems, demonstrating consistent improvements in the quality of the final solution and a reduced generation latency compared to standard self-improvement. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.12273 [cs.LG] (or arXiv:2510.12273v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.12273 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-38] Heterogeneous RBCs via deep multi-agent reinforcement learning
链接: https://arxiv.org/abs/2510.12272
作者: Federico Gabriele,Aldo Glielmo,Marco Taboga
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注: 13 pages, 9 figures
Abstract:Current macroeconomic models with agent heterogeneity can be broadly divided into two main groups. Heterogeneous-agent general equilibrium (GE) models, such as those based on Heterogeneous Agents New Keynesian (HANK) or Krusell-Smith (KS) approaches, rely on GE and ‘rational expectations’, somewhat unrealistic assumptions that make the models very computationally cumbersome, which in turn limits the amount of heterogeneity that can be modelled. In contrast, agent-based models (ABMs) can flexibly encompass a large number of arbitrarily heterogeneous agents, but typically require the specification of explicit behavioural rules, which can lead to a lengthy trial-and-error model-development process. To address these limitations, we introduce MARL-BC, a framework that integrates deep multi-agent reinforcement learning (MARL) with Real Business Cycle (RBC) models. We demonstrate that MARL-BC can: (1) recover textbook RBC results when using a single agent; (2) recover the results of the mean-field KS model using a large number of identical agents; and (3) effectively simulate rich heterogeneity among agents, a hard task for traditional GE approaches. Our framework can be thought of as an ABM if used with a variety of heterogeneous interacting agents, and can reproduce GE results in limit cases. As such, it is a step towards a synthesis of these often opposed modelling paradigms.
[LG-39] FedMMKT:Co-Enhancing a Server Text-to-Image Model and Client Task Models in Multi-Modal Federated Learning
链接: https://arxiv.org/abs/2510.12254
作者: Ningxin He,Yang Liu,Wei Sun,Xiaozhou Ye,Ye Ouyang,Tiegang Gao,Zehui Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Text-to-Image (T2I) models have demonstrated their versatility in a wide range of applications. However, adaptation of T2I models to specialized tasks is often limited by the availability of task-specific data due to privacy concerns. On the other hand, harnessing the power of rich multimodal data from modern mobile systems and IoT infrastructures presents a great opportunity. This paper introduces Federated Multi-modal Knowledge Transfer (FedMMKT), a novel framework that enables co-enhancement of a server T2I model and client task-specific models using decentralized multimodal data without compromising data privacy.
[LG-40] Optimal Regularization for Performative Learning
链接: https://arxiv.org/abs/2510.12249
作者: Edwige Cyffers,Alireza Mirrokni,Marco Mondelli
类目: Machine Learning (cs.LG)
*备注:
Abstract:In performative learning, the data distribution reacts to the deployed model - for example, because strategic users adapt their features to game it - which creates a more complex dynamic than in classical supervised learning. One should thus not only optimize the model for the current data but also take into account that the model might steer the distribution in a new direction, without knowing the exact nature of the potential shift. We explore how regularization can help cope with performative effects by studying its impact in high-dimensional ridge regression. We show that, while performative effects worsen the test risk in the population setting, they can be beneficial in the over-parameterized regime where the number of features exceeds the number of samples. We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect. We illustrate this finding through empirical evaluations of the optimal regularization parameter on both synthetic and real-world datasets.
[LG-41] Unveiling the Vulnerability of Graph-LLM s: An Interpretable Multi-Dimensional Adversarial Attack on TAGs
链接: https://arxiv.org/abs/2510.12233
作者: Bowen Fan,Zhilin Guo,Xunkai Li,Yihan Zhou,Bing Zhou,Zhenjun Li,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注: 12 pages, 4 figures
Abstract:Graph Neural Networks (GNNs) have become a pivotal framework for modeling graph-structured data, enabling a wide range of applications from social network analysis to molecular chemistry. By integrating large language models (LLMs), text-attributed graphs (TAGs) enhance node representations with rich textual semantics, significantly boosting the expressive power of graph-based learning. However, this sophisticated synergy introduces critical vulnerabilities, as Graph-LLMs are susceptible to adversarial attacks on both their structural topology and textual attributes. Although specialized attack methods have been designed for each of these aspects, no work has yet unified them into a comprehensive approach. In this work, we propose the Interpretable Multi-Dimensional Graph Attack (IMDGA), a novel human-centric adversarial attack framework designed to orchestrate multi-level perturbations across both graph structure and textual features. IMDGA utilizes three tightly integrated modules to craft attacks that balance interpretability and impact, enabling a deeper understanding of Graph-LLM vulnerabilities. Through rigorous theoretical analysis and comprehensive empirical evaluations on diverse datasets and architectures, IMDGA demonstrates superior interpretability, attack effectiveness, stealthiness, and robustness compared to existing methods. By exposing critical weaknesses in TAG representation learning, this work uncovers a previously underexplored semantic dimension of vulnerability in Graph-LLMs, offering valuable insights for improving their resilience. Our code and resources are publicly available at this https URL.
[LG-42] Hierarchical Koopman Diffusion: Fast Generation with Interpretable Diffusion Trajectory NEURIPS2025
链接: https://arxiv.org/abs/2510.12220
作者: Hanru Bai,Weiyang Ding,Difan Zou
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025
Abstract:Diffusion models have achieved impressive success in high-fidelity image generation but suffer from slow sampling due to their inherently iterative denoising process. While recent one-step methods accelerate inference by learning direct noise-to-image mappings, they sacrifice the interpretability and fine-grained control intrinsic to diffusion dynamics, key advantages that enable applications like editable generation. To resolve this dichotomy, we introduce \textbfHierarchical Koopman Diffusion, a novel framework that achieves both one-step sampling and interpretable generative trajectories. Grounded in Koopman operator theory, our method lifts the nonlinear diffusion dynamics into a latent space where evolution is governed by globally linear operators, enabling closed-form trajectory solutions. This formulation not only eliminates iterative sampling but also provides full access to intermediate states, allowing manual intervention during generation. To model the multi-scale nature of images, we design a hierarchical architecture that disentangles generative dynamics across spatial resolutions via scale-specific Koopman subspaces, capturing coarse-to-fine details systematically. We empirically show that the Hierarchical Koopman Diffusion not only achieves competitive one-step generation performance but also provides a principled mechanism for interpreting and manipulating the generative process through spectral analysis. Our framework bridges the gap between fast sampling and interpretability in diffusion models, paving the way for explainable image synthesis in generative modeling.
[LG-43] Controllable Collision Scenario Generation via Collision Pattern Prediction ICRA
链接: https://arxiv.org/abs/2510.12206
作者: Pin-Lun Chen,Chi-Hsi Kung,Che-Han Chang,Wei-Chen Chiu,Yi-Ting Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures. Submitted to IEEE International Conference on Robotics and Automation (ICRA) 2026
Abstract:Evaluating the safety of autonomous vehicles (AVs) requires diverse, safety-critical scenarios, with collisions being especially important yet rare and unsafe to collect in the real world. Therefore, the community has been focusing on generating safety-critical scenarios in simulation. However, controlling attributes such as collision type and time-to-accident (TTA) remains challenging. We introduce a new task called controllable collision scenario generation, where the goal is to produce trajectories that realize a user-specified collision type and TTA, to investigate the feasibility of automatically generating desired collision scenarios. To support this task, we present COLLIDE, a large-scale collision scenario dataset constructed by transforming real-world driving logs into diverse collisions, balanced across five representative collision types and different TTA intervals. We propose a framework that predicts Collision Pattern, a compact and interpretable representation that captures the spatial configuration of the ego and the adversarial vehicles at impact, before rolling out full adversarial trajectories. Experiments show that our approach outperforms strong baselines in both collision rate and controllability. Furthermore, generated scenarios consistently induce higher planner failure rates, revealing limitations of existing planners. We demonstrate that these scenarios fine-tune planners for robustness improvements, contributing to safer AV deployment in different collision scenarios.
[LG-44] Self-Verifying Reflection Helps Transformers with CoT Reasoning NEURIPS2025
链接: https://arxiv.org/abs/2510.12157
作者: Zhongwei Yu,Wannian Xia,Xue Yan,Bo Xu,Haifeng Zhang,Yali Du,Jun Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by NeurIPS2025
Abstract:Advanced large language models (LLMs) frequently reflect in reasoning chain-of-thoughts (CoTs), where they self-verify the correctness of current solutions and explore alternatives. However, given recent findings that LLMs detect limited errors in CoTs, how reflection contributes to empirical improvements remains unclear. To analyze this issue, in this paper, we present a minimalistic reasoning framework to support basic self-verifying reflection for small transformers without natural language, which ensures analytic clarity and reduces the cost of comprehensive experiments. Theoretically, we prove that self-verifying reflection guarantees improvements if verification errors are properly bounded. Experimentally, we show that tiny transformers, with only a few million parameters, benefit from self-verification in both training and reflective execution, reaching remarkable LLM-level performance in integer multiplication and Sudoku. Similar to LLM results, we find that reinforcement learning (RL) improves in-distribution performance and incentivizes frequent reflection for tiny transformers, yet RL mainly optimizes shallow statistical patterns without faithfully reducing verification errors. In conclusion, integrating generative transformers with discriminative verification inherently facilitates CoT reasoning, regardless of scaling and natural language.
[LG-45] Fairness-Constrained Optimization Attack in Federated Learning
链接: https://arxiv.org/abs/2510.12143
作者: Harsh Kasyap,Minghong Fang,Zhuqing Liu,Carsten Maple,Somanath Tripathy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: To appear in IEEE TrustCom 2025
Abstract:Federated learning (FL) is a privacy-preserving machine learning technique that facilitates collaboration among participants across demographics. FL enables model sharing, while restricting the movement of data. Since FL provides participants with independence over their training data, it becomes susceptible to poisoning attacks. Such collaboration also propagates bias among the participants, even unintentionally, due to different data distribution or historical bias present in the data. This paper proposes an intentional fairness attack, where a client maliciously sends a biased model, by increasing the fairness loss while training, even considering homogeneous data distribution. The fairness loss is calculated by solving an optimization problem for fairness metrics such as demographic parity and equalized odds. The attack is insidious and hard to detect, as it maintains global accuracy even after increasing the bias. We evaluate our attack against the state-of-the-art Byzantine-robust and fairness-aware aggregation schemes over different datasets, in various settings. The empirical results demonstrate the attack efficacy by increasing the bias up to 90%, even in the presence of a single malicious client in the FL system.
[LG-46] Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration NEURIPS25
链接: https://arxiv.org/abs/2510.12140
作者: Yonghao Liu,Yajun Wang,Chunli Guo,Wei Pang,Ximing Li,Fausto Giunchiglia,Xiaoyue Feng,Renchu Guan
类目: Machine Learning (cs.LG)
*备注: NeurIPS25
Abstract:Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.
[LG-47] nuGPR: GPU-Accelerated Gaussian Process Regression with Iterative Algorithms and Low-Rank Approximations
链接: https://arxiv.org/abs/2510.12128
作者: Ziqi Zhao,Vivek Sarin
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Numerical Analysis (math.NA)
*备注: 22 pages, 6 figures, published in SIAM Journal on Scientific Computing, E-print available at: this https URL
Abstract:Gaussian Process Regression (GPR) is an important type of supervised machine learning model with inherent uncertainty measure in its predictions. We propose a new framework, nuGPR, to address the well-known challenge of high computation cost associated with GPR training. Our framework includes several ideas from numerical linear algebra to reduce the amount of computation in key steps of GPR, and we combine them to establish an end-to-end training algorithm. Specifically, we leverage the preconditioned conjugate gradient method to accelerate the convergence of the linear solves required in GPR. We exploit clustering in the input data to identify block-diagonal structure of the covariance matrix and subsequently construct low-rank approximations of the off-diagonal blocks. These enhancements significantly reduce the time and space complexity of our computations. In addition, unlike other frameworks that rely on exact differentiation, we employ numerical gradients to optimize the hyperparameters of our GPR model, further reducing the training cost by eliminating the need for backpropagation. Lastly, we leverage the CUDA Toolkit to efficiently parallelize the training procedure on NVIDIA GPUs. As a result, nuGPR reduces total training time by up to 2x and peak memory consumption by up to 12x on various synthetic and real-world datasets when compared to the best existing GPU-based GPR implementation.
[LG-48] Locket: Robust Feature-Locking Technique for Language Models
链接: https://arxiv.org/abs/2510.12117
作者: Lipeng He,Vasisht Duddu,N. Asokan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 12 pages, 3 figures
Abstract:Chatbot providers (e.g., OpenAI) rely on tiered subscription schemes to generate revenue, offering basic models for free users, and advanced models for paying subscribers. However, a finer-grained pay-to-unlock scheme for premium features (e.g., math, coding) is thought to be more economically viable for the providers. Such a scheme requires a feature-locking technique (FLoTE) which is (i) effective in refusing locked features, (ii) utility-preserving for unlocked features, (iii) robust against evasion or unauthorized credential sharing, and (iv) scalable to multiple features and users. However, existing FLoTEs (e.g., password-locked models) are not robust or scalable. We present Locket, the first robust and scalable FLoTE to enable pay-to-unlock schemes. Locket uses a novel merging approach to attach adapters to an LLM for refusing unauthorized features. Our comprehensive evaluation shows that Locket is effective ( 100 % refusal on locked features), utility-preserving ( \leq 7 % utility degradation in unlocked features), robust ( \leq 5 % attack success rate), and scales to multiple features and clients.
[LG-49] Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning
链接: https://arxiv.org/abs/2510.12096
作者: Guozheng Ma,Lu Li,Zilin Wang,Haoyu Wang,Shengchao Hu,Leszek Rutkowski,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL), where larger models often degrade performance due to unique optimization pathologies such as plasticity loss. While recent works show that dynamically adapting network topology during training can mitigate these issues, existing studies have three critical limitations: (1) applying uniform dynamic training strategies across all modules despite encoder, critic, and actor following distinct learning paradigms, (2) focusing evaluation on basic architectures without clarifying the relative importance and interaction between dynamic training and architectural improvements, and (3) lacking systematic comparison between different dynamic approaches including sparse-to-sparse, dense-to-sparse, and sparse-to-dense. Through comprehensive investigation across modules and architectures, we reveal that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements. We finally distill these insights into Module-Specific Training (MST), a practical framework that further exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.
[LG-50] H4G: Unlocking Faithful Inference for Zero-Shot Graph Learning in Hyperbolic Space
链接: https://arxiv.org/abs/2510.12094
作者: Heng Zhang,Tianyi Zhang,Zijun Liu,Yuling Shi,Yaomin Shen,Haochen You,Haichuan Hu,Lubin Gan,Jin Huang
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:Text-attributed graphs are widely used across domains, offering rich opportunities for zero-shot learning via graph-text alignment. However, existing methods struggle with tasks requiring fine-grained pattern recognition, particularly on heterophilic graphs. Through empirical and theoretical analysis, we identify an \textbfover-abstraction problem: current approaches operate at excessively large hyperbolic radii, compressing multi-scale structural information into uniform high-level abstractions. This abstraction-induced information loss obscures critical local patterns essential for accurate predictions. By analyzing embeddings in hyperbolic space, we demonstrate that optimal graph learning requires \textbffaithful preservation of fine-grained structural details, better retained by representations positioned closer to the origin. To address this, we propose \textbfH4G, a framework that systematically reduces embedding radii using learnable block-diagonal scaling matrices and Möbius matrix multiplication. This approach restores access to fine-grained patterns while maintaining global receptive ability with minimal computational overhead. Experiments show H4G achieves state-of-the-art zero-shot performance with \textbf12.8% improvement on heterophilic graphs and \textbf8.4% on homophilic graphs, confirming that radius reduction enables faithful multi-scale representation for advancing zero-shot graph learning.
[LG-51] GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs
链接: https://arxiv.org/abs/2510.12085
作者: Heng Zhang,Tianyi Zhang,Yuling Shi,Xiaodong Gu,Yaomin Shen,Haochen You,Zijian Zhang,Yilei Yuan,Jin Huang
类目: Machine Learning (cs.LG); Graphics (cs.GR)
*备注:
Abstract:Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: \textbfCan alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures? We introduce \textbfGraphShaper, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47% accuracy improvements on citation networks and 7.63% on social networks in zero-shot settings.
[LG-52] FedLoDrop: Federated LoRA with Dropout for Generalized LLM Fine-tuning
链接: https://arxiv.org/abs/2510.12078
作者: Sijing Xie,Dingzhu Wen,Changsheng You,Qimei Chen,Mehdi Bennis,Kaibin Huang
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Fine-tuning (FT) large language models (LLMs) is crucial for adapting general-purpose models to specific tasks, enhancing accuracy and relevance with minimal resources. To further enhance generalization ability while reducing training costs, this paper proposes Federated LoRA with Dropout (FedLoDrop), a new framework that applies dropout to the rows and columns of the trainable matrix in Federated LoRA. A generalization error bound and convergence analysis under sparsity regularization are obtained, which elucidate the fundamental trade-off between underfitting and overfitting. The error bound reveals that a higher dropout rate increases model sparsity, thereby lowering the upper bound of pointwise hypothesis stability (PHS). While this reduces the gap between empirical and generalization errors, it also incurs a higher empirical error, which, together with the gap, determines the overall generalization error. On the other hand, though dropout reduces communication costs, deploying FedLoDrop at the network edge still faces challenges due to limited network resources. To address this issue, an optimization problem is formulated to minimize the upper bound of the generalization error, by jointly optimizing the dropout rate and resource allocation subject to the latency and per-device energy consumption constraints. To solve this problem, a branch-and-bound (B\B)-based method is proposed to obtain its globally optimal solution. Moreover, to reduce the high computational complexity of the B\B-based method, a penalized successive convex approximation (P-SCA)-based algorithm is proposed to efficiently obtain its high-quality suboptimal solution. Finally, numerical results demonstrate the effectiveness of the proposed approach in mitigating overfitting and improving the generalization capability.
[LG-53] Influence Dynamics and Stagewise Data Attribution
链接: https://arxiv.org/abs/2510.12071
作者: Jin Hwa Lee,Matthew Smith,Maxwell Adam,Jesse Hoogland
类目: Machine Learning (cs.LG)
*备注: 28 pages, 15 figures
Abstract:Current training data attribution (TDA) methods treat the influence one sample has on another as static, but neural networks learn in distinct stages that exhibit changing patterns of influence. In this work, we introduce a framework for stagewise data attribution grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first validate these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model’s progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.
[LG-54] MIARec: Mutual-influence-aware Heterogeneous Network Embedding for Scientific Paper Recommendation
链接: https://arxiv.org/abs/2510.12054
作者: Wenjin Xie,Tao Jia
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:With the rapid expansion of scientific literature, scholars increasingly demand precise and high-quality paper recommendations. Among various recommendation methodologies, graph-based approaches have garnered attention by effectively exploiting the structural characteristics inherent in scholarly networks. However, these methods often overlook the asymmetric academic influence that is prevalent in scholarly networks when learning graph representations. To address this limitation, this study proposes the Mutual-Influence-Aware Recommendation (MIARec) model, which employs a gravity-based approach to measure the mutual academic influence between scholars and incorporates this influence into the feature aggregation process during message propagation in graph representation learning. Additionally, the model utilizes a multi-channel aggregation method to capture both individual embeddings of distinct single relational sub-networks and their interdependent embeddings, thereby enabling a more comprehensive understanding of the heterogeneous scholarly network. Extensive experiments conducted on real-world datasets demonstrate that the MIARec model outperforms baseline models across three primary evaluation metrics, indicating its effectiveness in scientific paper recommendation tasks.
[LG-55] Mamaba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning
链接: https://arxiv.org/abs/2510.12026
作者: Junsoo Oh,Wei Huang,Taiji Suzuki
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 34 pages
Abstract:Mamba, a recently proposed linear-time sequence model, has attracted significant attention for its computational efficiency and strong empirical performance. However, a rigorous theoretical understanding of its underlying mechanisms remains limited. In this work, we provide a theoretical analysis of Mamba’s in-context learning (ICL) capability by focusing on tasks defined by low-dimensional nonlinear target functions. Specifically, we study in-context learning of a single-index model y \approx g_*(\langle \boldsymbol\beta, \boldsymbolx \rangle) , which depends on only a single relevant direction \boldsymbol\beta , referred to as feature. We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning, extracting the relevant direction directly from context examples. Consequently, we establish a test-time sample complexity that improves upon linear Transformers – analyzed to behave like kernel methods – and is comparable to nonlinear Transformers, which have been shown to surpass the Correlational Statistical Query (CSQ) lower bound and achieve near information-theoretically optimal rate in previous works. Our analysis reveals the crucial role of the nonlinear gating mechanism in Mamba for feature extraction, highlighting it as the fundamental driver behind Mamba’s ability to achieve both computational efficiency and high performance.
[LG-56] Embedding the Teacher: Distilling vLLM Preferences for Scalable Image Retrieval
链接: https://arxiv.org/abs/2510.12014
作者: Eric He,Akash Gupta,Adian Liusie,Vatsal Raina,Piotr Molenda,Shirom Chabra,Vyas Raina
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Text–image retrieval is necessary for applications such as product recommendation. Embedding-based approaches like CLIP enable efficient large-scale retrieval via vector similarity search, but they are primarily trained on literal caption-like text–image pairs and often fail to capture abstract or persona-driven attributes common in product recommendation applications (e.g., ``a gift for a mother who loves gardening’'). In contrast, state-of-the-art vision–language models (vLLMs) can align text with images in a flexible manner, but their limited context window prevents them from directly handling retrieval over large catalogs. We propose a framework that distills the preference rankings of a powerful vLLM into an embedding-based system, transferring its nuanced alignment abilities while maintaining the inference-time scalability of an embedding-based approach. Experiments on persona-driven product recommendation tasks demonstrate that our method significantly outperforms existing embedding-based baselines, providing an efficient solution for personalized text–image retrieval.
[LG-57] Nonlinear discretizations and Newtons method: characterizing stationary points of regression objectives
链接: https://arxiv.org/abs/2510.11987
作者: Conor Rowan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.
[LG-58] Learning by Steering the Neural Dynamics: A Statistical Mechanics Perspective
链接: https://arxiv.org/abs/2510.11984
作者: Mattia Scardecchia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Despite the striking successes of deep neural networks trained with gradient-based optimization, these methods differ fundamentally from their biological counterparts. This gap raises key questions about how nature achieves robust, sample-efficient learning at minimal energy costs and solves the credit-assignment problem without backpropagation. We take a step toward bridging contemporary AI and computational neuroscience by studying how neural dynamics can support fully local, distributed learning that scales to simple machine-learning benchmarks. Using tools from statistical mechanics, we identify conditions for the emergence of robust dynamical attractors in random asymmetric recurrent networks. We derive a closed-form expression for the number of fixed points as a function of self-coupling strength, and we reveal a phase transition in their structure: below a critical self-coupling, isolated fixed points coexist with exponentially many narrow clusters showing the overlap-gap property; above it, subdominant yet dense and extensive clusters appear. These fixed points become accessible, including to a simple asynchronous dynamical rule, after an algorithm-dependent self-coupling threshold. Building on this analysis, we propose a biologically plausible algorithm for supervised learning with any binary recurrent network. Inputs are mapped to fixed points of the dynamics, by relaxing under transient external stimuli and stabilizing the resulting configurations via local plasticity. We show that our algorithm can learn an entangled version of MNIST, leverages depth to develop hierarchical representations and increase hetero-association capacity, and is applicable to several architectures. Finally, we highlight the strong connection between algorithm performance and the unveiled phase transition, and we suggest a cortex-inspired alternative to self-couplings for its emergence.
[LG-59] QLENS: Towards A Quantum Perspective of Language Transformers
链接: https://arxiv.org/abs/2510.11963
作者: Aditya Gupta,Kirandeep Kaur,Vinayak Gupta
类目: Machine Learning (cs.LG)
*备注:
Abstract:In natural language processing, current methods for understanding Transformers are successful at identifying intermediate predictions during a model’s inference. However, these approaches function as limited diagnostic checkpoints, lacking a mathematical framework for mechanistically modeling how each layer facilitates transitions between these evolving states. This interpretability gap and past successes of interdisciplinary outlooks inspire us to turn to physics in search of a descriptive mathematical framework for Transformers. We observe that language models are intrinsically probabilistic, an attribute that is echoed in the core postulates of quantum mechanics. This parallel inspires us to translate insights from this discipline to that of natural language processing. Towards this objective, we propose QLENS a novel attempt to develop a physics-based perspective on the Transformer generation process. Under QLENS, a Transformer is studied by converting its latent activations into a state vector in a Hilbert space derived from the model’s output units. This state subsequently evolves through hidden layers - reformulated as unitary operators and analogously defined Hamiltonians - during inference. The model’s final probability distribution is obtained by applying the Born rule to the end state using a specific measurement operator. To demonstrate QLENS’s potential, we conduct a proof-of-concept by probing a toy Transformer to investigate the influence of individual layers in a model’s prediction trajectory. We present our work as a foundation for cross-domain insights to be leveraged towards a broader understanding of Transformers.
[LG-60] On efficiently computable functions deep networks and sparse compositionality
链接: https://arxiv.org/abs/2510.11942
作者: Tomaso Poggio
类目: Machine Learning (cs.LG)
*备注:
Abstract:We show that \emphefficient Turing computability at any fixed input/output precision implies the existence of \emphcompositionally sparse (bounded-fan-in, polynomial-size) DAG representations and of corresponding neural approximants achieving the target precision. Concretely: if f:[0,1]^d\to\R^m is computable in time polynomial in the bit-depths, then for every pair of precisions (n,m_\mathrmout) there exists a bounded-fan-in Boolean circuit of size and depth \poly(n+m_\mathrmout) computing the discretized map; replacing each gate by a constant-size neural emulator yields a deep network of size/depth \poly(n+m_\mathrmout) that achieves accuracy \varepsilon=2^-m_\mathrmout . We also relate these constructions to compositional approximation rates \citeMhaskarPoggio2016b,poggio_deep_shallow_2017,Poggio2017,Poggio2023HowDS and to optimization viewed as hierarchical search over sparse structures.
[LG-61] Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2510.11933
作者: Hiroshi Nonaka,Simon Ambrozak,Sofia R. Miskala-Dinc,Amedeo Ercole,Aviva Prins
类目: Machine Learning (cs.LG)
*备注: This paper contains 19 pages and 3 figures. To be presented at the 2nd Workshop on Aligning Reinforcement Learning Experimentalists and Theorists (ARLET 2025) at NeurIPS 2025
Abstract:In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to 91 % relative to RestartQ-UCB.
[LG-62] Variational Mixture of Graph Neural Experts for Alzheimers Disease Biomarker Recognition in EEG Brain Networks
链接: https://arxiv.org/abs/2510.11917
作者: Jun-En Ding,Anna Zilverstand,Shihao Yang,Albert Chih-Chieh Yang,Feng Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dementia disorders such as Alzheimer’s disease (AD) and frontotemporal dementia (FTD) exhibit overlapping electrophysiological signatures in EEG that challenge accurate diagnosis. Existing EEG-based methods are limited by full-band frequency analysis that hinders precise differentiation of dementia subtypes and severity stages. We propose a variational mixture of graph neural experts (VMoGE) that integrates frequency-specific biomarker identification with structured variational inference for enhanced dementia diagnosis and staging. VMoGE employs a multi-granularity transformer to extract multi-scale temporal patterns across four frequency bands, followed by a variational graph convolutional encoder using Gaussian Markov Random Field priors. Through structured variational inference and adaptive gating, VMoGE links neural specialization to physiologically meaningful EEG frequency bands. Evaluated on two diverse datasets for both subtype classification and severity staging, VMoGE achieves superior performance with AUC improvements of +4% to +10% over state-of-the-art methods. Moreover, VMoGE provides interpretable insights through expert weights that correlate with clinical indicators and spatial patterns aligned with neuropathological signatures, facilitating EEG biomarker discovery for comprehensive dementia diagnosis and monitoring.
[LG-63] ADARL: Adaptive Low-Rank Structures for Robust Policy Learning under Uncertainty
链接: https://arxiv.org/abs/2510.11899
作者: Chenliang Li,Junyu Leng,Jiaxiang Li,Youbang Sun,Shixiang Chen,Shahin Shahrampour,Alfredo Garcia
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Robust reinforcement learning (Robust RL) seeks to handle epistemic uncertainty in environment dynamics, but existing approaches often rely on nested min–max optimization, which is computationally expensive and yields overly conservative policies. We propose \textbfAdaptive Rank Representation (AdaRL), a bi-level optimization framework that improves robustness by aligning policy complexity with the intrinsic dimension of the task. At the lower level, AdaRL performs policy optimization under fixed-rank constraints with dynamics sampled from a Wasserstein ball around a centroid model. At the upper level, it adaptively adjusts the rank to balance the bias–variance trade-off, projecting policy parameters onto a low-rank manifold. This design avoids solving adversarial worst-case dynamics while ensuring robustness without over-parameterization. Empirical results on MuJoCo continuous control benchmarks demonstrate that AdaRL not only consistently outperforms fixed-rank baselines (e.g., SAC) and state-of-the-art robust RL methods (e.g., RNAC, Parseval), but also converges toward the intrinsic rank of the underlying tasks. These results highlight that adaptive low-rank policy representations provide an efficient and principled alternative for robust RL under model uncertainty.
[LG-64] Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling NEURIPS2025
链接: https://arxiv.org/abs/2510.11877
作者: Xiaohang Tang,Zhuowen Cheng,Satyabrat Kumar
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注: Accepted by Reliable ML Workshop @ NeurIPS 2025
Abstract:The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.
[LG-65] Improving Knowledge Graph Embeddings through Contrastive Learning with Negative Statements
链接: https://arxiv.org/abs/2510.11868
作者: Rita T. Sousa,Heiko Paulheim
类目: Machine Learning (cs.LG)
*备注: Accepted at the Thirteenth International Conference on Knowledge Capture (K-CAP 2025)
Abstract:Knowledge graphs represent information as structured triples and serve as the backbone for a wide range of applications, including question answering, link prediction, and recommendation systems. A prominent line of research for exploring knowledge graphs involves graph embedding methods, where entities and relations are represented in low-dimensional vector spaces that capture underlying semantics and structure. However, most existing methods rely on assumptions such as the Closed World Assumption or Local Closed World Assumption, treating missing triples as false. This contrasts with the Open World Assumption underlying many real-world knowledge graphs. Furthermore, while explicitly stated negative statements can help distinguish between false and unknown triples, they are rarely included in knowledge graphs and are often overlooked during embedding training. In this work, we introduce a novel approach that integrates explicitly declared negative statements into the knowledge embedding learning process. Our approach employs a dual-model architecture, where two embedding models are trained in parallel, one on positive statements and the other on negative statements. During training, each model generates negative samples by corrupting positive samples and selecting the most likely candidates as scored by the other model. The proposed approach is evaluated on both general-purpose and domain-specific knowledge graphs, with a focus on link prediction and triple classification tasks. The extensive experiments demonstrate that our approach improves predictive performance over state-of-the-art embedding models, demonstrating the value of integrating meaningful negative knowledge into embedding learning. Comments: Accepted at the Thirteenth International Conference on Knowledge Capture (K-CAP 2025) Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.11868 [cs.LG] (or arXiv:2510.11868v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.11868 Focus to learn more arXiv-issued DOI via DataCite
[LG-66] Actor-Enriched Time Series Forecasting of Process Performance
链接: https://arxiv.org/abs/2510.11856
作者: Aurelie Leribaux,Rafael Oyamada,Johannes De Smedt,Zahra Dasht Bozorgi,Artem Polyvyanyy,Jochen De Weerdt
类目: Machine Learning (cs.LG)
*备注: Accepted at ICPM 2025
Abstract:Predictive Process Monitoring (PPM) is a key task in Process Mining that aims to predict future behavior, outcomes, or performance indicators. Accurate prediction of the latter is critical for proactive decision-making. Given that processes are often resource-driven, understanding and incorporating actor behavior in forecasting is crucial. Although existing research has incorporated aspects of actor behavior, its role as a time-varying signal in PPM remains limited. This study investigates whether incorporating actor behavior information, modeled as time series, can improve the predictive performance of throughput time (TT) forecasting models. Using real-life event logs, we construct multivariate time series that include TT alongside actor-centric features, i.e., actor involvement, the frequency of continuation, interruption, and handover behaviors, and the duration of these behaviors. We train and compare several models to study the benefits of adding actor behavior. The results show that actor-enriched models consistently outperform baseline models, which only include TT features, in terms of RMSE, MAE, and R2. These findings demonstrate that modeling actor behavior over time and incorporating this information into forecasting models enhances performance indicator predictions.
[LG-67] Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection ICDM
链接: https://arxiv.org/abs/2510.11852
作者: Saroj Basnet,Shafkat Farabi,Tharindu Ranasinghe,Diptesh Kanoji,Marcos Zampieri
类目: Machine Learning (cs.LG)
*备注: Accepted to ICDMW 2025 Workshop on Multimodal AI (MMAI). Full workshop info: this https URL
Abstract:Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models’ capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model’s performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.
[LG-68] WaveletDiff: Multilevel Wavelet Diffusion For Time Series Generation
链接: https://arxiv.org/abs/2510.11839
作者: Yu-Hsiang Wang,Olgica Milenkovic
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series are ubiquitous in many applications that involve forecasting, classification and causal inference tasks, such as healthcare, finance, audio signal processing and climate sciences. Still, large, high-quality time series datasets remain scarce. Synthetic generation can address this limitation; however, current models confined either to the time or frequency domains struggle to reproduce the inherently multi-scaled structure of real-world time series. We introduce WaveletDiff, a novel framework that trains diffusion models directly on wavelet coefficients to exploit the inherent multi-resolution structure of time series data. The model combines dedicated transformers for each decomposition level with cross-level attention mechanisms that enable selective information exchange between temporal and frequency scales through adaptive gating. It also incorporates energy preservation constraints for individual levels based on Parseval’s theorem to preserve spectral fidelity throughout the diffusion process. Comprehensive tests across six real-world datasets from energy, finance, and neuroscience domains demonstrate that WaveletDiff consistently outperforms state-of-the-art time-domain and frequency-domain generative methods on both short and long time series across five diverse performance metrics. For example, WaveletDiff achieves discriminative scores and Context-FID scores that are 3\times smaller on average than the second-best baseline across all datasets.
[LG-69] Z0-Inf: Zeroth Order Approximation for Data Influence
链接: https://arxiv.org/abs/2510.11832
作者: Narine Kokhlikyan,Kamalika Chaudhuri,Saeed Mahloujifar
类目: Machine Learning (cs.LG)
*备注:
Abstract:A critical aspect of analyzing and improving modern machine learning systems lies in understanding how individual training examples influence a model’s predictive behavior. Estimating this influence enables critical applications, including data selection and model debugging; in particular, self-influence, which quantifies the influence of a training point on itself, has found many uses in data quality assessment and outlier detection. Existing methods for measuring data influence, however, are often impractical for large models due to low accuracy or prohibitive computational costs: most approaches either provide poor approximations or rely on gradients and inverse-Hessian computations that remain challenging to scale. In this work, we introduce a highly efficient zeroth-order approximation for estimating the influence of training data that requires only a fraction of the time and memory footprint of prior methods. Notably, our method relies solely on loss values of intermediate checkpoints on the training and test data, along with the checkpoints themselves, making it broadly applicable even when the loss function of interest is non-differentiable. Beyond its computational efficiency, our approach achieves superior accuracy in estimating self-influence and comparable or improved accuracy in estimating train-test influence for fine-tuned large language models, enabling scalable and practical analysis of how training data shapes model behavior.
[LG-70] Schrödinger bridge for generative AI: Soft-constrained formulation and convergence analysis
链接: https://arxiv.org/abs/2510.11829
作者: Jin Ma,Ying Tan,Renyuan Xu
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Mathematical Finance (q-fin.MF)
*备注: 31 pages
Abstract:Generative AI can be framed as the problem of learning a model that maps simple reference measures into complex data distributions, and it has recently found a strong connection to the classical theory of the Schrödinger bridge problems (SBPs) due partly to their common nature of interpolating between prescribed marginals via entropy-regularized stochastic dynamics. However, the classical SBP enforces hard terminal constraints, which often leads to instability in practical implementations, especially in high-dimensional or data-scarce regimes. To address this challenge, we follow the idea of the so-called soft-constrained Schrödinger bridge problem (SCSBP), in which the terminal constraint is replaced by a general penalty function. This relaxation leads to a more flexible stochastic control formulation of McKean-Vlasov type. We establish the existence of optimal solutions for all penalty levels and prove that, as the penalty grows, both the controls and value functions converge to those of the classical SBP at a linear rate. Our analysis builds on Doob’s h-transform representations, the stability results of Schrödinger potentials, Gamma-convergence, and a novel fixed-point argument that couples an optimization problem over the space of measures with an auxiliary entropic optimal transport problem. These results not only provide the first quantitative convergence guarantees for soft-constrained bridges but also shed light on how penalty regularization enables robust generative modeling, fine-tuning, and transfer learning. Comments: 31 pages Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS); Optimization and Control (math.OC); Mathematical Finance (q-fin.MF) Cite as: arXiv:2510.11829 [cs.LG] (or arXiv:2510.11829v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.11829 Focus to learn more arXiv-issued DOI via DataCite
[LG-71] hink as a Doctor: An Interpretable AI Approach for ICU Mortality Prediction
链接: https://arxiv.org/abs/2510.11745
作者: Qingwen Li,Xiaohang Zhao,Xiao Han,Hailiang Huang,Lanjuan Liu
类目: Machine Learning (cs.LG)
*备注: 42 pages
Abstract:Intensive Care Unit (ICU) mortality prediction, which estimates a patient’s mortality status at discharge using EHRs collected early in an ICU admission, is vital in critical care. For this task, predictive accuracy alone is insufficient; interpretability is equally essential for building clinical trust and meeting regulatory standards, a topic that has attracted significant attention in information system research. Accordingly, an ideal solution should enable intrinsic interpretability and align its reasoning with three key elements of the ICU decision-making practices: clinical course identification, demographic heterogeneity, and prognostication awareness. However, conventional approaches largely focus on demographic heterogeneity, overlooking clinical course identification and prognostication awareness. Recent prototype learning methods address clinical course identification, yet the integration of the other elements into such frameworks remains underexplored. To address these gaps, we propose ProtoDoctor, a novel ICU mortality prediction framework that delivers intrinsic interpretability while integrating all three elements of the ICU decision-making practices into its reasoning process. Methodologically, ProtoDoctor features two key innovations: the Prognostic Clinical Course Identification module and the Demographic Heterogeneity Recognition module. The former enables the identification of clinical courses via prototype learning and achieves prognostication awareness using a novel regularization mechanism. The latter models demographic heterogeneity through cohort-specific prototypes and risk adjustments. Extensive empirical evaluations demonstrate that ProtoDoctor outperforms state-of-the-art baselines in predictive accuracy. Human evaluations further confirm that its interpretations are more clinically meaningful, trustworthy, and applicable in ICU practice.
[LG-72] Multi-objective Bayesian Optimization with Human-in-the-Loop for Flexible Neuromorphic Electronics Fabrication
链接: https://arxiv.org/abs/2510.11727
作者: Benius Dunn,Javier Meza-Arroyo,Armi Tiihonen,Mark Lee,Julia W. P. Hsu
类目: Emerging Technologies (cs.ET); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Neuromorphic computing hardware enables edge computing and can be implemented in flexible electronics for novel applications. Metal oxide materials are promising candidates for fabricating flexible neuromorphic electronics, but suffer from processing constraints due to the incompatibilities between oxides and polymer substrates. In this work, we use photonic curing to fabricate flexible metal-insulator-metal capacitors with solution-processible aluminum oxide dielectric tailored for neuromorphic applications. Because photonic curing outcomes depend on many input parameters, identifying an optimal processing condition through a traditional grid-search approach is unfeasible. Here, we apply multi-objective Bayesian optimization (MOBO) to determine photonic curing conditions that optimize the trade-off between desired electrical properties of large capacitance-frequency dispersion and low leakage current. Furthermore, we develop a human-in-the-loop (HITL) framework for incorporating failed experiments into the MOBO machine learning workflow, demonstrating that this framework accelerates optimization by reducing the number of experimental rounds required. Once optimization is concluded, we analyze different Pareto-optimal conditions to tune the dielectrics properties and provide insight into the importance of different inputs through Shapley Additive exPlanations analysis. The demonstrated framework of combining MOBO with HITL feedback can be adapted to a wide range of multi-objective experimental problems that have interconnected inputs and high experimental failure rates to generate usable results for machine learning models.
[LG-73] Replicable Learning of Large-Margin Halfspaces ICML2024
链接: https://arxiv.org/abs/2402.13857
作者: Alkis Kalavasis,Amin Karbasi,Kasper Green Larsen,Grigoris Velegkas,Felix Zhou
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: to be published in ICML 2024
Abstract:We provide efficient replicable algorithms for the problem of learning large-margin halfspaces. Our results improve upon the algorithms provided by Impagliazzo, Lei, Pitassi, and Sorrell [STOC, 2022]. We design the first dimension-independent replicable algorithms for this task which runs in polynomial time, is proper, and has strictly improved sample complexity compared to the one achieved by Impagliazzo et al. [2022] with respect to all the relevant parameters. Moreover, our first algorithm has sample complexity that is optimal with respect to the accuracy parameter \epsilon . We also design an SGD-based replicable algorithm that, in some parameters’ regimes, achieves better sample and time complexity than our first algorithm. Departing from the requirement of polynomial time algorithms, using the DP-to-Replicability reduction of Bun, Gaboardi, Hopkins, Impagliazzo, Lei, Pitassi, Sorrell, and Sivakumar [STOC, 2023], we show how to obtain a replicable algorithm for large-margin halfspaces with improved sample complexity with respect to the margin parameter \tau , but running time doubly exponential in 1/\tau^2 and worse sample complexity dependence on \epsilon than one of our previous algorithms. We then design an improved algorithm with better sample complexity than all three of our previous algorithms and running time exponential in 1/\tau^2 .
[LG-74] Wavefront Coding for Accommodation-Invariant Near-Eye Displays
链接: https://arxiv.org/abs/2510.12778
作者: Ugur Akpinar,Erdem Sahin,Tina M. Hayward,Apratim Majumder,Rajesh Menon,Atanas Gotchev
类目: Optics (physics.optics); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:We present a new computational near-eye display method that addresses the vergence-accommodation conflict problem in stereoscopic displays through accommodation-invariance. Our system integrates a refractive lens eyepiece with a novel wavefront coding diffractive optical element, operating in tandem with a pre-processing convolutional neural network. We employ end-to-end learning to jointly optimize the wavefront-coding optics and the image pre-processing module. To implement this approach, we develop a differentiable retinal image formation model that accounts for limiting aperture and chromatic aberrations introduced by the eye optics. We further integrate the neural transfer function and the contrast sensitivity function into the loss model to account for related perceptual effects. To tackle off-axis distortions, we incorporate position dependency into the pre-processing module. In addition to conducting rigorous analysis based on simulations, we also fabricate the designed diffractive optical element and build a benchtop setup, demonstrating accommodation-invariance for depth ranges of up to four diopters.
[LG-75] Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency without Model Sweeps
链接: https://arxiv.org/abs/2510.12744
作者: Do Tien Hai,Trung Nguyen Mai,TrungTin Nguyen,Nhat Ho,Binh T. Nguyen,Christopher Drovandi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO); Methodology (stat.ME)
*备注: Do Tien Hai, Trung Nguyen Mai, and TrungTin Nguyen are co-first authors
Abstract:We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE’s convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., \epsilon -contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.
[LG-76] Contraction and entropy production in continuous-time Sinkhorn dynamics
链接: https://arxiv.org/abs/2510.12639
作者: Anand Srinivasan,Jean-Jacques Slotine
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR)
*备注: 10 pages excluding references
Abstract:Recently, the vanishing-step-size limit of the Sinkhorn algorithm at finite regularization parameter \varepsilon was shown to be a mirror descent in the space of probability measures. We give L^2 contraction criteria in two time-dependent metrics induced by the mirror Hessian, which reduce to the coercivity of certain conditional expectation operators. We then give an exact identity for the entropy production rate of the Sinkhorn flow, which was previously known only to be nonpositive. Examining this rate shows that the standard semigroup analysis of diffusion processes extends systematically to the Sinkhorn flow. We show that the flow induces a reversible Markov dynamics on the target marginal as an Onsager gradient flow. We define the Dirichlet form associated to its (nonlocal) infinitesimal generator, prove a Poincaré inequality for it, and show that the spectral gap is strictly positive along the Sinkhorn flow whenever \varepsilon 0 . Lastly, we show that the entropy decay is exponential if and only if a logarithmic Sobolev inequality (LSI) holds. We give for illustration two immediate practical use-cases for the Sinkhorn LSI: as a design principle for the latent space in which generative models are trained, and as a stopping heuristic for discrete-time algorithms.
[LG-77] Adapting Noise to Data: Generative Flows from 1D Processes
链接: https://arxiv.org/abs/2510.12636
作者: Jannis Chemseddine,Gregor Kornhardt,Richard Duong,Gabriele Steidl
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:
Abstract:We introduce a general framework for constructing generative models using one-dimensional noising processes. Beyond diffusion processes, we outline examples that demonstrate the flexibility of our approach. Motivated by this, we propose a novel framework in which the 1D processes themselves are learnable, achieved by parameterizing the noise distribution through quantile functions that adapt to the data. Our construction integrates seamlessly with standard objectives, including Flow Matching and consistency models. Learning quantile-based noise naturally captures heavy tails and compact supports when present. Numerical experiments highlight both the flexibility and the effectiveness of our method.
[LG-78] Same model better performance: the impact of shuffling on DNA Language Models benchmarking
链接: https://arxiv.org/abs/2510.12617
作者: Davide Greco,Konrad Rawlik
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:
Abstract:Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic’s domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters – number of data loading workers and buffer sizes – create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.
[LG-79] Universal Adaptive Environment Discovery
链接: https://arxiv.org/abs/2510.12547
作者: Madi Matymov,Ba-Hien Tran,Maurizio Filippone
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 8 papes in the main body, 4 pages in the appendix, 4 figures and 9 tables overall, conference
Abstract:An open problem in Machine Learning is how to avoid models to exploit spurious correlations in the data; a famous example is the background-label shortcut in the Waterbirds dataset. A common remedy is to train a model across multiple environments; in the Waterbirds dataset, this corresponds to training by randomizing the background. However, selecting the right environments is a challenging problem, given that these are rarely known a priori. We propose Universal Adaptive Environment Discovery (UAED), a unified framework that learns a distribution over data transformations that instantiate environments, and optimizes any robust objective averaged over this learned distribution. UAED yields adaptive variants of IRM, REx, GroupDRO, and CORAL without predefined groups or manual environment design. We provide a theoretical analysis by providing PAC-Bayes bounds and by showing robustness to test environment distributions under standard conditions. Empirically, UAED discovers interpretable environment distributions and improves worst-case accuracy on standard benchmarks, while remaining competitive on mean accuracy. Our results indicate that making environments adaptive is a practical route to out-of-distribution generalization.
[LG-80] Neural Guided Sampling for Quantum Circuit Optimization
链接: https://arxiv.org/abs/2510.12430
作者: Bodo Rosenhahn,Tobias J. Osborne,Christoph Hirche
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 12 pages, 9 Figures
Abstract:Translating a general quantum circuit on a specific hardware topology with a reduced set of available gates, also known as transpilation, comes with a substantial increase in the length of the equivalent circuit. Due to decoherence, the quality of the computational outcome can degrade seriously with increasing circuit length. Thus, there is major interest to reduce a quantum circuit to an equivalent circuit which is in its gate count as short as possible. One method to address efficient transpilation is based on approaches known from stochastic optimization, e.g. by using random sampling and token replacement strategies. Here, a core challenge is that these methods can suffer from sampling efficiency, causing long and energy consuming optimization time. As a remedy, we propose in this work 2D neural guided sampling. Thus, given a 2D representation of a quantum circuit, a neural network predicts groups of gates in the quantum circuit, which are likely reducible. Thus, it leads to a sampling prior which can heavily reduce the compute time for quantum circuit reduction. In several experiments, we demonstrate that our method is superior to results obtained from different qiskit or BQSKit optimization levels.
[LG-81] Geopolitics Geoeconomics and Risk:A Machine Learning Approach
链接: https://arxiv.org/abs/2510.12416
作者: Alvaro Ortiz,Tomasa Rodrigo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We introduce a novel high-frequency daily panel dataset of both markets and news-based indicators – including Geopolitical Risk, Economic Policy Uncertainty, Trade Policy Uncertainty, and Political Sentiment – for 42 countries across both emerging and developed markets. Using this dataset, we study how sentiment dynamics shape sovereign risk, measured by Credit Default Swap (CDS) spreads, and evaluate their forecasting value relative to traditional drivers such as global monetary policy and market volatility. Our horse-race analysis of forecasting models demonstrates that incorporating news-based indicators significantly enhances predictive accuracy and enriches the analysis, with non-linear machine learning methods – particularly Random Forests – delivering the largest gains. Our analysis reveals that while global financial variables remain the dominant drivers of sovereign risk, geopolitical risk and economic policy uncertainty also play a meaningful role. Crucially, their effects are amplified through non-linear interactions with global financial conditions. Finally, we document pronounced regional heterogeneity, as certain asset classes and emerging markets exhibit heightened sensitivity to shocks in policy rates, global financial volatility, and geopolitical risk.
[LG-82] Improved Central Limit Theorem and Bootstrap Approximations for Linear Stochastic Approximation
链接: https://arxiv.org/abs/2510.12375
作者: Bogdan Butyrin,Eric Moulines,Alexey Naumov,Sergey Samsonov,Qi-Man Shao,Zhuo-Song Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:In this paper, we refine the Berry-Esseen bounds for the multivariate normal approximation of Polyak-Ruppert averaged iterates arising from the linear stochastic approximation (LSA) algorithm with decreasing step size. We consider the normal approximation by the Gaussian distribution with covariance matrix predicted by the Polyak-Juditsky central limit theorem and establish the rate up to order n^-1/3 in convex distance, where n is the number of samples used in the algorithm. We also prove a non-asymptotic validity of the multiplier bootstrap procedure for approximating the distribution of the rescaled error of the averaged LSA estimator. We establish approximation rates of order up to 1/\sqrtn for the latter distribution, which significantly improves upon the previous results obtained by Samsonov et al. (2024).
[LG-83] Learning Latent Energy-Based Models via Interacting Particle Langevin Dynamics
链接: https://arxiv.org/abs/2510.12311
作者: Joanna Marks,Tim Y. J. Wang,O. Deniz Akyildiz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:We develop interacting particle algorithms for learning latent variable models with energy-based priors. To do so, we leverage recent developments in particle-based methods for solving maximum marginal likelihood estimation (MMLE) problems. Specifically, we provide a continuous-time framework for learning latent energy-based models, by defining stochastic differential equations (SDEs) that provably solve the MMLE problem. We obtain a practical algorithm as a discretisation of these SDEs and provide theoretical guarantees for the convergence of the proposed algorithm. Finally, we demonstrate the empirical effectiveness of our method on synthetic and image datasets.
[LG-84] he Living Forecast: Evolving Day-Ahead Predictions into Intraday Reality
链接: https://arxiv.org/abs/2510.12271
作者: Kutay Bölat,Peter Palensky,Simon Tindemans
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:
Abstract:Accurate intraday forecasts are essential for power system operations, complementing day-ahead forecasts that gradually lose relevance as new information becomes available. This paper introduces a Bayesian updating mechanism that converts fully probabilistic day-ahead forecasts into intraday forecasts without retraining or re-inference. The approach conditions the Gaussian mixture output of a conditional variational autoencoder-based forecaster on observed measurements, yielding an updated distribution for the remaining horizon that preserves its probabilistic structure. This enables consistent point, quantile, and ensemble forecasts while remaining computationally efficient and suitable for real-time applications. Experiments on household electricity consumption and photovoltaic generation datasets demonstrate that the proposed method improves forecast accuracy up to 25% across likelihood-, sample-, quantile-, and point-based metrics. The largest gains occur in time steps with strong temporal correlation to observed data, and the use of pattern dictionary-based covariance structures further enhances performance. The results highlight a theoretically grounded framework for intraday forecasting in modern power systems.
[LG-85] A Gradient Guided Diffusion Framework for Chance Constrained Programming
链接: https://arxiv.org/abs/2510.12238
作者: Boyang Zhang,Zhiguo Wang,Ya-Feng Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Chance constrained programming (CCP) is a powerful framework for addressing optimization problems under uncertainty. In this paper, we introduce a novel Gradient-Guided Diffusion-based Optimization framework, termed GGDOpt, which tackles CCP through three key innovations. First, GGDOpt accommodates a broad class of CCP problems without requiring the knowledge of the exact distribution of uncertainty-relying solely on a set of samples. Second, to address the nonconvexity of the chance constraints, it reformulates the CCP as a sampling problem over the product of two distributions: an unknown data distribution supported on a nonconvex set and a Boltzmann distribution defined by the objective function, which fully leverages both first- and second-order gradient information. Third, GGDOpt has theoretical convergence guarantees and provides practical error bounds under mild assumptions. By progressively injecting noise during the forward diffusion process to convexify the nonconvex feasible region, GGDOpt enables guided reverse sampling to generate asymptotically optimal solutions. Experimental results on synthetic datasets and a waveform design task in wireless communications demonstrate that GGDOpt outperforms existing methods in both solution quality and stability with nearly 80% overhead reduction.
[LG-86] Learning Mean-Field Games through Mean-Field Actor-Critic Flow
链接: https://arxiv.org/abs/2510.12180
作者: Mo Zhou,Haosheng Zhou,Ruimeng Hu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:We propose the Mean-Field Actor-Critic (MFAC) flow, a continuous-time learning dynamics for solving mean-field games (MFGs), combining techniques from reinforcement learning and optimal transport. The MFAC framework jointly evolves the control (actor), value function (critic), and distribution components through coupled gradient-based updates governed by partial differential equations (PDEs). A central innovation is the Optimal Transport Geodesic Picard (OTGP) flow, which drives the distribution toward equilibrium along Wasserstein-2 geodesics. We conduct a rigorous convergence analysis using Lyapunov functionals and establish global exponential convergence of the MFAC flow under a suitable timescale. Our results highlight the algorithmic interplay among actor, critic, and distribution components. Numerical experiments illustrate the theoretical findings and demonstrate the effectiveness of the MFAC framework in computing MFG equilibria.
[LG-87] Follow-the-Perturbed-Leader for Decoupled Bandits: Best-of-Both-Worlds and Practicality
链接: https://arxiv.org/abs/2510.12152
作者: Chaiwon Kim,Jongyeong Lee,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint, 29 pages
Abstract:We study the decoupled multi-armed bandit (MAB) problem, where the learner selects one arm for exploration and one arm for exploitation in each round. The loss of the explored arm is observed but not counted, while the loss of the exploited arm is incurred without being observed. We propose a policy within the Follow-the-Perturbed-Leader (FTPL) framework using Pareto perturbations. Our policy achieves (near-)optimal regret regardless of the environment, i.e., Best-of-Both-Worlds (BOBW): constant regret in the stochastic regime, improving upon the optimal bound of the standard MABs, and minimax optimal regret in the adversarial regime. Moreover, the practicality of our policy stems from avoiding both the convex optimization step required by the previous BOBW policy, Decoupled-Tsallis-INF (Rouyer Seldin, 2020), and the resampling step that is typically necessary in FTPL. Consequently, it achieves substantial computational improvement, about 20 times faster than Decoupled-Tsallis-INF, while also demonstrating better empirical performance in both regimes. Finally, we empirically show that our approach outperforms a pure exploration policy, and that naively combining a pure exploration with a standard exploitation policy is suboptimal.
[LG-88] Probabilistic Super-Resolution for Urban Micrometeorology via a Schrödinger Bridge
链接: https://arxiv.org/abs/2510.12148
作者: Yuki Yasuda,Ryo Onishi
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:This study employs a neural network that represents the solution to a Schrödinger bridge problem to perform super-resolution of 2-m temperature in an urban area. Schrödinger bridges generally describe transformations between two data distributions based on diffusion processes. We use a specific Schrödinger-bridge model (SM) that directly transforms low-resolution data into high-resolution data, unlike denoising diffusion probabilistic models (simply, diffusion models; DMs) that generate high-resolution data from Gaussian noise. Low-resolution and high-resolution data were obtained from separate numerical simulations with a physics-based model under common initial and boundary conditions. Compared with a DM, the SM attains comparable accuracy at one-fifth the computational cost, requiring 50 neural-network evaluations per datum for the DM and only 10 for the SM. Furthermore, high-resolution samples generated by the SM exhibit larger variance, implying superior uncertainty quantification relative to the DM. Owing to the reduced computational cost of the SM, our results suggest the feasibility of real-time ensemble micrometeorological prediction using SM-based super-resolution.
[LG-89] Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory
链接: https://arxiv.org/abs/2510.12077
作者: Einar Urdshals,Edmund Lau,Jesse Hoogland,Stan van Wingerden,Daniel Murfet
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 33 pages, 21 figures
Abstract:We study neural network compressibility by using singular learning theory to extend the minimum description length (MDL) principle to singular models like neural networks. Through extensive experiments on the Pythia suite with quantization, factorization, and other compression techniques, we find that complexity estimates based on the local learning coefficient (LLC) are closely, and in some cases, linearly correlated with compressibility. Our results provide a path toward rigorously evaluating the limits of model compression.
[LG-90] Statistical Guarantees for High-Dimensional Stochastic Gradient Descent NEURIPS2025
链接: https://arxiv.org/abs/2510.12013
作者: Jiaqi Li,Zhipeng Lou,Johannes Schmidt-Hieber,Wei Biao Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025
Abstract:Stochastic Gradient Descent (SGD) and its Ruppert-Polyak averaged variant (ASGD) lie at the heart of modern large-scale learning, yet their theoretical properties in high-dimensional settings are rarely understood. In this paper, we provide rigorous statistical guarantees for constant learning-rate SGD and ASGD in high-dimensional regimes. Our key innovation is to transfer powerful tools from high-dimensional time series to online learning. Specifically, by viewing SGD as a nonlinear autoregressive process and adapting existing coupling techniques, we prove the geometric-moment contraction of high-dimensional SGD for constant learning rates, thereby establishing asymptotic stationarity of the iterates. Building on this, we derive the q -th moment convergence of SGD and ASGD for any q\ge2 in general \ell^s -norms, and, in particular, the \ell^\infty -norm that is frequently adopted in high-dimensional sparse or structured models. Furthermore, we provide sharp high-probability concentration analysis which entails the probabilistic bound of high-dimensional ASGD. Beyond closing a critical gap in SGD theory, our proposed framework offers a novel toolkit for analyzing a broad class of high-dimensional learning algorithms.
[LG-91] Enhancing Diffusion-Based Sampling with Molecular Collective Variables
链接: https://arxiv.org/abs/2510.11923
作者: Juno Nam,Bálint Máté,Artur P. Toshev,Manasa Kaniselvan,Rafael Gómez-Bombarelli,Ricky T. Q. Chen,Brandon Wood,Guan-Horng Liu,Benjamin Kurt Miller
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Diffusion-based samplers learn to sample complex, high-dimensional distributions using energies or log densities alone, without training data. Yet, they remain impractical for molecular sampling because they are often slower than molecular dynamics and miss thermodynamically relevant modes. Inspired by enhanced sampling, we encourage exploration by introducing a sequential bias along bespoke, information-rich, low-dimensional projections of atomic coordinates known as collective variables (CVs). We introduce a repulsive potential centered on the CVs from recent samples, which pushes future samples towards novel CV regions and effectively increases the temperature in the projected space. Our resulting method improves efficiency, mode discovery, enables the estimation of free energy differences, and retains independent sampling from the approximate Boltzmann distribution via reweighting by the bias. On standard peptide conformational sampling benchmarks, the method recovers diverse conformational states and accurate free energy profiles. We are the first to demonstrate reactive sampling using a diffusion-based sampler, capturing bond breaking and formation with universal interatomic potentials at near-first-principles accuracy. The approach resolves reactive energy landscapes at a fraction of the wall-clock time of standard sampling methods, advancing diffusion-based sampling towards practical use in molecular sciences.
[LG-92] Simplifying Optimal Transport through Schatten-p Regularization
链接: https://arxiv.org/abs/2510.11910
作者: Tyler Maunu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 26 pages, 4 figures
Abstract:We propose a new general framework for recovering low-rank structure in optimal transport using Schatten- p norm regularization. Our approach extends existing methods that promote sparse and interpretable transport maps or plans, while providing a unified and principled family of convex programs that encourage low-dimensional structure. The convexity of our formulation enables direct theoretical analysis: we derive optimality conditions and prove recovery guarantees for low-rank couplings and barycentric maps in simplified settings. To efficiently solve the proposed program, we develop a mirror descent algorithm with convergence guarantees for p \geq 1 . Experiments on synthetic and real data demonstrate the method’s efficiency, scalability, and ability to recover low-rank transport structures.
[LG-93] High-Probability Bounds For Heterogeneous Local Differential Privacy
链接: https://arxiv.org/abs/2510.11895
作者: Maryam Aliakbarpour,Alireza Fallah,Swaha Roy,Ria Stevens
类目: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:
Abstract:We study statistical estimation under local differential privacy (LDP) when users may hold heterogeneous privacy levels and accuracy must be guaranteed with high probability. Departing from the common in-expectation analyses, and for one-dimensional and multi-dimensional mean estimation problems, we develop finite sample upper bounds in \ell_2 -norm that hold with probability at least 1-\beta . We complement these results with matching minimax lower bounds, establishing the optimality (up to constants) of our guarantees in the heterogeneous LDP regime. We further study distribution learning in \ell_\infty -distance, designing an algorithm with high-probability guarantees under heterogeneous privacy demands. Our techniques offer principled guidance for designing mechanisms in settings with user-specific privacy levels.
[LG-94] Active Subspaces in Infinite Dimension
链接: https://arxiv.org/abs/2510.11871
作者: Poorbita Kundu,Nathan Wycoff
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Active subspace analysis uses the leading eigenspace of the gradient’s second moment to conduct supervised dimension reduction. In this article, we extend this methodology to real-valued functionals on Hilbert space. We define an operator which coincides with the active subspace matrix when applied to a Euclidean space. We show that many of the desirable properties of Active Subspace analysis extend directly to the infinite dimensional setting. We also propose a Monte Carlo procedure and discuss its convergence properties. Finally, we deploy this methodology to create visualizations and improve modeling and optimization on complex test problems.
[LG-95] On Thompson Sampling and Bilateral Uncertainty in Additive Bayesian Optimization
链接: https://arxiv.org/abs/2510.11792
作者: Nathan Wycoff
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:In Bayesian Optimization (BO), additive assumptions can mitigate the twin difficulties of modeling and searching a complex function in high dimension. However, common acquisition functions, like the Additive Lower Confidence Bound, ignore pairwise covariances between dimensions, which we’ll call \textitbilateral uncertainty (BU), imposing a second layer of approximations. While theoretical results indicate that asymptotically not much is lost in doing so, little is known about the practical effects of this assumption in small budgets. In this article, we show that by exploiting conditional independence, Thompson Sampling respecting BU can be efficiently conducted. We use this fact to execute an empirical investigation into the loss incurred by ignoring BU, finding that the additive approximation to Thompson Sampling does indeed have, on balance, worse performance than the exact method, but that this difference is of little practical significance. This buttresses the theoretical understanding and suggests that the BU-ignoring approximation is sufficient for BO in practice, even in the non-asymptotic regime.
[LG-96] Dimension-Free Minimax Rates for Learning Pairwise Interactions in Attention-Style Models
链接: https://arxiv.org/abs/2510.11789
作者: Shai Zucker,Xiong Wang,Fei Lu,Inbar Seroussi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:
Abstract:We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a non-linear activation function. We prove that the minimax rate is M^-\frac2\beta2\beta+1 with M being the sample size, depending only on the smoothness \beta of the activation, and crucially independent of token count, ambient dimension, or rank of the weight matrix. These results highlight a fundamental dimension-free statistical efficiency of attention-style nonlocal models, even when the weight matrix and activation are not separately identifiable and provide a theoretical understanding of the attention mechanism and its training.
[LG-97] PRISM: Enhancing Protein Inverse Folding through Fine-Grained Retrieval on Structure-Sequence Multimodal Representations
链接: https://arxiv.org/abs/2510.11750
作者: Sazan Mahbub,Souvik Kundu,Eric P. Xing
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:Designing protein sequences that fold into a target three-dimensional structure, known as the inverse folding problem, is central to protein engineering but remains challenging due to the vast sequence space and the importance of local structural constraints. Existing deep learning approaches achieve strong recovery rates, yet they lack explicit mechanisms to reuse fine-grained structure-sequence patterns that are conserved across natural proteins. We present PRISM, a multimodal retrieval-augmented generation framework for inverse folding that retrieves fine-grained representations of potential motifs from known proteins and integrates them with a hybrid self-cross attention decoder. PRISM is formulated as a latent-variable probabilistic model and implemented with an efficient approximation, combining theoretical grounding with practical scalability. Across five benchmarks (CATH-4.2, TS50, TS500, CAMEO 2022, and the PDB date split), PRISM establishes new state of the art in both perplexity and amino acid recovery, while also improving foldability metrics (RMSD, TM-score, pLDDT), demonstrating that fine-grained multimodal retrieval is a powerful and efficient paradigm for protein sequence design.
[LG-98] Quantum Kernel Methods: Convergence Theory Separation Bounds and Applications to Marketing Analytics
链接: https://arxiv.org/abs/2510.11744
作者: Laura Sáez-Ortuño,Santiago Forgas-Coll,Massimiliano Ferrara
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 15 pages, 3 figures
Abstract:This work studies the feasibility of applying quantum kernel methods to a real consumer classification task in the NISQ regime. We present a hybrid pipeline that combines a quantum-kernel Support Vector Machine (Q-SVM) with a quantum feature extraction module (QFE), and benchmark it against classical and quantum baselines in simulation and with limited shallow-depth hardware runs. With fixed hyperparameters, the proposed Q-SVM attains 0.7790 accuracy, 0.7647 precision, 0.8609 recall, 0.8100 F1, and 0.83 ROC AUC, exhibiting higher sensitivity while maintaining competitive precision relative to classical SVM. We interpret these results as an initial indicator and a concrete starting point for NISQ-era workflows and hardware integration, rather than a definitive benchmark. Methodologically, our design aligns with recent work that formalizes quantum-classical separations and verifies resources via XEB-style approaches, motivating shallow yet expressive quantum embeddings to achieve robust separability despite hardware noise constraints.
[LG-99] scPPDM: A Diffusion Model for Single-Cell Drug-Response Prediction
链接: https://arxiv.org/abs/2510.11726
作者: Zhaokang Liang,Shuyang Zhuang,Xiaoran Jiao,Weian Mao,Hao Chen,Chunhua Shen
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:
Abstract:This paper introduces the Single-Cell Perturbation Prediction Diffusion Model (scPPDM), the first diffusion-based framework for single-cell drug-response prediction from scRNA-seq data. scPPDM couples two condition channels, pre-perturbation state and drug with dose, in a unified latent space via non-concatenative GD-Attn. During inference, factorized classifier-free guidance exposes two interpretable controls for state preservation and drug-response strength and maps dose to guidance magnitude for tunable intensity. Evaluated on the Tahoe-100M benchmark under two stringent regimes, unseen covariate combinations (UC) and unseen drugs (UD), scPPDM sets new state-of-the-art results across log fold-change recovery, delta correlations, explained variance, and DE-overlap. Representative gains include +36.11%/+34.21% on DEG logFC-Spearman/Pearson in UD over the second-best model. This control interface enables transparent what-if analyses and dose tuning, reducing experimental burden while preserving biological specificity.
[LG-100] On a Geometry of Interbrain Networks NEURIPS2025
链接: https://arxiv.org/abs/2509.10650
作者: Nicolás Hinrichs,Noah Guzmán,Melanie Weber
类目: Neurons and Cognition (q-bio.NC); Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注: 4 pages, 1 figure, accepted at NeurIPS 2025 Workshop on Symmetry and Geometry in Neural Representations
Abstract:Effective analysis in neuroscience benefits significantly from robust conceptual frameworks. Traditional metrics of interbrain synchrony in social neuroscience typically depend on fixed, correlation-based approaches, restricting their explanatory capacity to descriptive observations. Inspired by the successful integration of geometric insights in network science, we propose leveraging discrete geometry to examine the dynamic reconfigurations in neural interactions during social exchanges. Unlike conventional synchrony approaches, our method interprets inter-brain connectivity changes through the evolving geometric structures of neural networks. This geometric framework is realized through a pipeline that identifies critical transitions in network connectivity using entropy metrics derived from curvature distributions. By doing so, we significantly enhance the capacity of hyperscanning methodologies to uncover underlying neural mechanisms in interactive social behavior.
信息检索
[IR-0] Leverag ing Language Semantics for Collaborative Filtering with TextGCN and TextGCN-MLP: Zero-Shot vs In-Domain Performance
链接: https://arxiv.org/abs/2510.12461
作者: Andrei Chernov,Haroon Wahab,Oleg Novitskij
类目: Information Retrieval (cs.IR)
*备注:
Abstract:In recent years, various approaches have been proposed to leverage large language models (LLMs) for incorporating textual information about items into recommender systems. Existing methods primarily focus on either fine-tuning LLMs to generate recommendations or integrating LLM-based embeddings into downstream models. In this work, we follow the latter direction and propose \textbfTextGCN, which applies parameter-free graph convolution layers directly over LLM-based item-title embeddings, instead of learning ID-based embeddings as in traditional methods. By combining language semantics with graph message passing, this architecture achieves state-of-the-art zero-shot performance, significantly outperforming prior approaches. Furthermore, we introduce \textbfTextGCN-MLP, which extends TextGCN with a trainable multilayer perceptron trained using a contrastive loss, achieving state-of-the-art in-domain performance on recommendation benchmarks. However, the zero-shot performance of TextGCN-MLP remains lower than that of TextGCN, highlighting the trade-off between in-domain specialization and zero-shot generalization. We release our code on github at \hrefthis https URLthis http URL.
[IR-1] A Hierarchical Quantized Tokenization Framework for Task-Adaptive Graph Representation Learning
链接: https://arxiv.org/abs/2510.12369
作者: Yang Xiang,Li Fan,Chenke Yin,Chengtao Ji
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent progress in language and vision foundation models demonstrates the importance of discrete token interfaces that transform complex inputs into compact sequences for large-scale modeling. Extending this paradigm to graphs requires a tokenization scheme that handles non-Euclidean structures and multi-scale dependencies efficiently. Existing approaches to graph tokenization, linearized, continuous, and quantized, remain limited in adaptability and efficiency. In particular, most current quantization-based tokenizers organize hierarchical information in fixed or task-agnostic ways, which may either over-represent or under-utilize structural cues, and lack the ability to dynamically reweight contributions from different levels without retraining the encoder. This work presents a hierarchical quantization framework that introduces a self-weighted mechanism for task-adaptive aggregation across multiple scales. The proposed method maintains a frozen encoder while modulating information flow through a lightweight gating process, enabling parameter-efficient adaptation to diverse downstream tasks. Experiments on benchmark datasets for node classification and link prediction demonstrate consistent improvements over strong baselines under comparable computational budgets.
[IR-2] An Empirical Study for Representations of Videos in Video Question Answering via MLLM s
链接: https://arxiv.org/abs/2510.12299
作者: Zhi Li,Yanan Wang,Hao Niu,Julio Vizcarra,Masato Taya
类目: Information Retrieval (cs.IR)
*备注: 6 pages, 3 figures
Abstract:Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.
[IR-3] Reinforced Preference Optimization for Recommendation
链接: https://arxiv.org/abs/2510.12211
作者: Junfei Tan,Yuxin Chen,An Zhang,Junguang Jiang,Bin Liu,Ziru Xu,Han Zhu,Jian Xu,Bo Zheng,Xiang Wang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent breakthroughs in large language models (LLMs) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals. However, applying RLVR to generative recommenders remains non-trivial. Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is sparse since most items receive identical zero rewards. To address these challenges, we propose Reinforced Preference Optimization for Recommendation (ReRe), a reinforcement-based paradigm tailored to LLM-based recommenders, an important direction in generative recommendation. ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision. Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance. Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales. Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research.