本篇博文主要内容为 2025-10-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-10-24)

今日共更新549篇论文,其中:

  • 自然语言处理88篇(Computation and Language (cs.CL))
  • 人工智能168篇(Artificial Intelligence (cs.AI))
  • 计算机视觉99篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习176篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Small Drafts Big Verdict: Information-Intensive Visual Reasoning via Speculation

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, VLMs)在处理信息密集型图像时的局限性,尤其是当图像中存在大量文本注释与细粒度图形元素交织的情况时,模型难以精确定位关键线索并进行多跳推理以整合分散证据。解决方案的核心是提出一种无需训练的框架——推测判决(Speculative Verdict, SV),其关键在于将多个轻量级草稿专家(draft experts)与一个强大的判决模型(verdict model)协同工作:草稿阶段由多个小型VLM生成多样化的推理路径以提供候选定位;判决阶段则由大模型综合这些路径输出最终答案,在保持较低计算成本的同时恢复正确结论。此外,SV引入共识专家选择机制,仅将高一致性的推理路径传递给判决模型,从而进一步提升效率和准确性。

链接: https://arxiv.org/abs/2510.20812
作者: Yuhan Liu,Lianhui Qin,Shengjie Wang
机构: New York University (纽约大学); University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at this https URL
zh

[NLP-1] On the Detectability of LLM -Generated Text: What Exactly Is LLM -Generated Text?

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 文本检测领域中目标定义模糊、评估基准不完善以及现实场景适配性差的问题。其关键在于指出“大语言模型(Large Language Models, LLMs)生成文本”的概念缺乏一致性和精确性,且现有检测方法未充分考虑人类编辑与LLM对用户行为的隐性影响,导致检测结果难以准确解释;因此,论文强调应将检测器的结果视为特定条件下的参考依据,而非绝对判据,从而推动更严谨的评估框架和应用场景适配的研究方向。

链接: https://arxiv.org/abs/2510.20810
作者: Mingmeng Geng,Thierry Poibeau
机构: École Normale Supérieure (ENS) - Université Paris Sciences et Lettres (PSL); Laboratoire Lattice (CNRS, ENS-PSL, Université Sorbonne Nouvelle)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the widespread use of large language models (LLMs), many researchers have turned their attention to detecting text generated by them. However, there is no consistent or precise definition of their target, namely “LLM-generated text”. Differences in usage scenarios and the diversity of LLMs further increase the difficulty of detection. What is commonly regarded as the detecting target usually represents only a subset of the text that LLMs can potentially produce. Human edits to LLM outputs, together with the subtle influences that LLMs exert on their users, are blurring the line between LLM-generated and human-written text. Existing benchmarks and evaluation approaches do not adequately address the various conditions in real-world detector applications. Hence, the numerical results of detectors are often misunderstood, and their significance is diminishing. Therefore, detectors remain useful under specific conditions, but their results should be interpreted only as references rather than decisive indicators.
zh

[NLP-2] Real Deep Research for AI Robotics and Beyond

【速读】: 该论文旨在解决人工智能(AI)与机器人学领域研究文献爆炸式增长所带来的信息过载问题,即研究人员难以及时掌握新兴趋势、跨学科机会以及自身专业之外的研究方向。其解决方案的关键在于提出一个可泛化的分析流程——Real Deep Research (RDR) 框架,该框架能够系统性地识别研究领域的新兴趋势、挖掘跨领域合作机会,并为新的研究探索提供具体切入点。该框架已在 AI 与机器人学领域(特别是基础模型和机器人进展)中得到验证,并扩展至其他科学领域,展现出良好的通用性和实用性。

链接: https://arxiv.org/abs/2510.20809
作者: Xueyan Zou,Jianglong Ye,Hao Zhang,Xiaoyu Xiang,Mingyu Ding,Zhaojing Yang,Yong Jae Lee,Zhuowen Tu,Sifei Liu,Xiaolong Wang
机构: UC San Diego(加州大学圣地亚哥分校); NVIDIA(英伟达); META(元); UNC(北卡罗来纳大学); UW-Madison(威斯康星大学麦迪逊分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: website: this https URL

点击查看摘要

Abstract:With the rapid growth of research in AI and robotics now producing over 10,000 papers annually it has become increasingly difficult for researchers to stay up to date. Fast evolving trends, the rise of interdisciplinary work, and the need to explore domains beyond one’s expertise all contribute to this challenge. To address these issues, we propose a generalizable pipeline capable of systematically analyzing any research area: identifying emerging trends, uncovering cross domain opportunities, and offering concrete starting points for new inquiry. In this work, we present Real Deep Research (RDR) a comprehensive framework applied to the domains of AI and robotics, with a particular focus on foundation models and robotics advancements. We also briefly extend our analysis to other areas of science. The main paper details the construction of the RDR pipeline, while the appendix provides extensive results across each analyzed topic. We hope this work sheds light for researchers working in the field of AI and beyond.
zh

[NLP-3] Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在下游任务中适应效率低下的问题,特别是针对现有方法如Layer-SElective-Rank reduction (LASER) 所存在的计算开销过大、依赖全数据集前向传播进行逐层搜索的瓶颈。其解决方案的关键在于:(i) 仅需对少量关键权重矩阵进行分析而非全层扫描;(ii) 利用矩阵奇异值梯度作为指示信号以识别值得降秩的层;(iii) 通过扩展因子分解空间(允许行向量聚类至多个子空间并分别分解)降低过拟合风险,并提升准确率最高达24.6个百分点;(iv) 仅需在100个样本上评估梯度和最终性能即可有效指导适配过程,因下游任务适应主要受提示风格影响而非训练数据规模。综合上述改进,该方法实现了无需任何梯度更新的快速、鲁棒的LLM适配策略。

链接: https://arxiv.org/abs/2510.20800
作者: Shiva Sreeram,Alaa Maalouf,Pratyusha Sharma,Daniela Rus
机构: MIT CSAIL (麻省理工学院计算机科学与人工智能实验室); University of Haifa (海法大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, Sharma et al. suggested a method called Layer-SElective-Rank reduction (LASER) which demonstrated that pruning high-order components of carefully chosen LLM’s weight matrices can boost downstream accuracy – without any gradient-based fine-tuning. Yet LASER’s exhaustive, per-matrix search (each requiring full-dataset forward passes) makes it impractical for rapid deployment. We demonstrate that this overhead can be removed and find that: (i) Only a small, carefully chosen subset of matrices needs to be inspected – eliminating the layer-by-layer sweep, (ii) The gradient of each matrix’s singular values pinpoints which matrices merit reduction, (iii) Increasing the factorization search space by allowing matrices rows to cluster around multiple subspaces and then decomposing each cluster separately further reduces overfitting on the original training data and further lifts accuracy by up to 24.6 percentage points, and finally, (iv) we discover that evaluating on just 100 samples rather than the full training data – both for computing the indicative gradients and for measuring the final accuracy – suffices to further reduce the search time; we explain that as adaptation to downstream tasks is dominated by prompting style, not dataset size. As a result, we show that combining these findings yields a fast and robust adaptation algorithm for downstream tasks. Overall, with a single gradient step on 100 examples and a quick scan of the top candidate layers and factorization techniques, we can adapt LLMs to new datasets – entirely without fine-tuning.
zh

[NLP-4] Simple Context Compression: Mean-Pooling and Multi-Ratio Training

【速读】: 该论文旨在解决在检索增强生成(Retrieval-Augmented Generation, RAG)中使用长上下文时计算成本过高的问题。其核心解决方案是采用一种轻量且简单的均值池化(mean-pooling)策略进行软上下文压缩(soft context compression),将输入序列映射为更短的连续表征。实验表明,该方法在多种领域内和领域外问答数据集、不同模型家族与规模以及多种压缩比下均优于广泛使用的压缩令牌(compression-tokens)架构,且在训练多压缩比场景时性能下降较小,展现出良好的稳定性与通用性。

链接: https://arxiv.org/abs/2510.20797
作者: Yair Feldman,Yoav Artzi
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at this https URL

点击查看摘要

Abstract:A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.
zh

[NLP-5] BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation

【速读】: 该论文旨在解决文本引导图生成模型(text-guided graph generation)中存在的后门漏洞问题,此类漏洞可能被攻击者利用以在推理阶段植入指定子图,从而威胁如药物发现等关键应用的安全性。解决方案的关键在于提出BadGraph方法,通过文本触发器(textual triggers)污染训练数据,在潜空间扩散模型(latent diffusion models)的VAE和扩散训练阶段隐式植入后门,使模型在遇到触发词时生成特定恶意子图,同时保持对干净输入的正常性能,实验证明该方法在低污染率(<10%)下即可实现高攻击成功率(>50%),且对良性样本影响微乎其微。

链接: https://arxiv.org/abs/2510.20792
作者: Liang Ye,Shengqin Chen,Jiazhu Dai
机构: Shanghai University (上海大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
备注:

点击查看摘要

Abstract:The rapid progress of graph generation has raised new security concerns, particularly regarding backdoor vulnerabilities. While prior work has explored backdoor attacks in image diffusion and unconditional graph generation, conditional, especially text-guided graph generation remains largely unexamined. This paper proposes BadGraph, a backdoor attack method targeting latent diffusion models for text-guided graph generation. BadGraph leverages textual triggers to poison training data, covertly implanting backdoors that induce attacker-specified subgraphs during inference when triggers appear, while preserving normal performance on clean inputs. Extensive experiments on four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the effectiveness and stealth of the attack: less than 10% poisoning rate can achieves 50% attack success rate, while 24% suffices for over 80% success rate, with negligible performance degradation on benign samples. Ablation studies further reveal that the backdoor is implanted during VAE and diffusion training rather than pretraining. These findings reveal the security vulnerabilities in latent diffusion models of text-guided graph generation, highlight the serious risks in models’ applications such as drug discovery and underscore the need for robust defenses against the backdoor attack in such diffusion models.
zh

[NLP-6] Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

【速读】: 该论文旨在解决线性注意力模型(Linear-attention models)因有限记忆导致的遗忘问题,该问题在检索密集型任务中显著影响性能。其核心解决方案是设计一系列混合模型,通过引入对历史token的直接访问机制来缓解遗忘;关键创新在于提出一种可学习的token淘汰策略,并结合滑动窗口注意力机制,利用端到端可训练的轻量级CNN从前后邻近token中聚合信息,自适应地保留每头(head)关键的键值对(KV-pairs),从而在保持线性注意力恒定的时间和空间复杂度的同时提升检索能力。

链接: https://arxiv.org/abs/2510.20787
作者: Mutian He,Philip N. Garner
机构: Idiap Research Institute (Idiap 研究所); Ecole Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages, 5 figures

点击查看摘要

Abstract:Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention’s constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
zh

[NLP-7] A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM -generated Text CIKM’25

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估方法过于泛化、缺乏针对特定负责任人工智能(Responsible AI)维度(如公平性)的细粒度评估问题。现有方法通常仅关注文本生成等高层任务,未能充分考虑不同应用场景中受保护属性(如性别)与具体任务之间的交互关系。解决方案的关键在于构建一个基于真实应用场景(根据产品特征生成描述文本)的数据集,该数据集通过将公平性属性与性别形容词及产品类别交叉组合,形成丰富且带标签的提示(prompt)集合,从而能够系统识别LLMs在质量、真实性、安全性及公平性方面的差距,并为研究社区提供可复用的评估资源和方法框架。

链接: https://arxiv.org/abs/2510.20782
作者: Alicia Sagae,Chia-Jung Lee,Sandeep Avula,Brandon Dang,Vanessa Murdock
机构: AWS Responsible AI(亚马逊云科技负责任人工智能); Seattle(西雅图); Washington(华盛顿州); USA(美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages with 3 figures, to appear in Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25)

点击查看摘要

Abstract:Current methods for evaluating large language models (LLMs) typically focus on high-level tasks such as text generation, without targeting a particular AI application. This approach is not sufficient for evaluating LLMs for Responsible AI dimensions like fairness, since protected attributes that are highly relevant in one application may be less relevant in another. In this work, we construct a dataset that is driven by a real-world application (generate a plain-text product description, given a list of product features), parameterized by fairness attributes intersected with gendered adjectives and product categories, yielding a rich set of labeled prompts. We show how to use the data to identify quality, veracity, safety, and fairness gaps in LLMs, contributing a proposal for LLM evaluation paired with a concrete resource for the research community.
zh

[NLP-8] Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost NEURIPS2025

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在机器翻译(Machine Translation, MT)质量评估中应用时存在的三大核心问题:一是LRM作为评估者需要定制化的评估材料以提升准确性;二是LRM在处理简单翻译实例时容易“过度思考”,导致资源浪费;三是其评分机制存在偏差,常造成对翻译质量的高估。解决方案的关键在于通过训练LRM学习合成的、类人思维轨迹(human-like thinking trajectories),从而校准其推理过程,使其在保持高评估性能的同时显著降低“思考预算”(thinking budget)。实验表明,该方法在WMT24 Metrics基准上使思考预算减少约35倍,并在不同规模的LRM(从7B到32B参数)中均实现相关性指标的提升(如R1-Distill-Qwen-7B提升8.7个相关性点),验证了校准后LRM在细粒度自动MT评估中的高效潜力。

链接: https://arxiv.org/abs/2510.20780
作者: Runzhe Zhan,Zhihong Huang,Xinyi Yang,Lidia S. Chao,Min Yang,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau (澳门大学计算机与信息科学系); Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025

点击查看摘要

Abstract:Recent advancements in large reasoning models (LRMs) have introduced an intermediate “thinking” process prior to generating final answers, improving their reasoning capabilities on complex downstream tasks. However, the potential of LRMs as evaluators for machine translation (MT) quality remains underexplored. We provides the first systematic analysis of LRM-as-a-judge in MT evaluation. We identify key challenges, revealing LRMs require tailored evaluation materials, tend to “overthink” simpler instances and have issues with scoring mechanisms leading to overestimation. To address these, we propose to calibrate LRM thinking by training them on synthetic, human-like thinking trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this approach largely reduces thinking budgets by ~35x while concurrently improving evaluation performance across different LRM scales from 7B to 32B (e.g., R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These findings highlight the potential of efficiently calibrated LRMs to advance fine-grained automatic MT evaluation.
zh

[NLP-9] Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations

【速读】: 该论文试图解决传统大语言模型(Large Language Model, LLM)对话系统缺乏对用户非言语情感信息感知的问题,从而导致交互缺乏情境敏感性和情感连贯性。解决方案的关键在于提出“共情提示”(Empathic Prompting)框架,通过集成商用面部表情识别服务获取用户的隐式情绪线索,并将其作为上下文信号嵌入到提示(prompt)中,无需用户显式操作即可实现情感驱动的对话增强,从而提升LLM输出与用户情绪状态之间的对齐度和交互流畅性。

链接: https://arxiv.org/abs/2510.20743
作者: Lorenzo Stacchio,Andrea Ubaldi,Alessandro Galdelli,Maurizio Mauri,Emanuele Frontoni,Andrea Gaggioli
机构: University of Macerata (马切拉塔大学); Università Cattolica del Sacro Cuore (天主教圣心大学); Università Politecnica delle Marche (马尔凯理工大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users’ emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users’ emotional signals are critical yet often opaque in verbal exchanges.
zh

[NLP-10] Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing

【速读】: 该论文旨在解决从临床笔记中自动提取氟尿嘧啶类药物(fluoropyrimidines)治疗方案及其相关毒性信息的难题,因为这些关键信息通常以非结构化文本形式嵌套在电子病历中,难以用于大规模研究或药物警戒分析。其解决方案的关键在于开发并比较多种自然语言处理(Natural Language Processing, NLP)方法,包括规则基础、机器学习(如随机森林、支持向量机、逻辑回归)、深度学习(BERT 和 ClinicalBERT)以及大语言模型(Large Language Models, LLMs)的方法,其中特别引入了错误分析提示(error-analysis prompting)策略,最终发现基于LLM的提示方法在提取准确率上表现最优(F1=1.000),显著优于其他方法,展现出强大的临床信息抽取能力与应用潜力。

链接: https://arxiv.org/abs/2510.20727
作者: Xizhi Wu,Madeline S. Kreider,Philip E. Empey,Chenyu Li,Yanshan Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Objective: Fluoropyrimidines are widely prescribed for colorectal and breast cancers, but are associated with toxicities such as hand-foot syndrome and cardiotoxicity. Since toxicity documentation is often embedded in clinical notes, we aimed to develop and evaluate natural language processing (NLP) methods to extract treatment and toxicity information. Materials and Methods: We constructed a gold-standard dataset of 236 clinical notes from 204,165 adult oncology patients. Domain experts annotated categories related to treatment regimens and toxicities. We developed rule-based, machine learning-based (Random Forest, Support Vector Machine [SVM], Logistic Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language models (LLM)-based NLP approaches (zero-shot and error-analysis prompting). Models used an 80:20 train-test split. Results: Sufficient data existed to train and evaluate 5 annotated categories. Error-analysis prompting achieved optimal precision, recall, and F1 scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot prompting reached F1=1.000 for treatment and F1=0.876 for toxicities this http URL and SVM ranked second for toxicities (F1=0.937). Deep learning underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods served as our baseline with F1 scores of 0.857 in treatment and 0.858 in toxicities. Discussion: LMM-based approaches outperformed all others, followed by machine learning methods. Machine and deep learning approaches were limited by small training data and showed limited generalizability, particularly for rare categories. Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine treatment and toxicity information from clinical notes, and has strong potential to support oncology research and pharmacovigilance. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.20727 [cs.CL] (or arXiv:2510.20727v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.20727 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yanshan Wang [view email] [v1] Thu, 23 Oct 2025 16:44:39 UTC (897 KB) Full-text links: Access Paper: View a PDF of the paper titled Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing, by Xizhi Wu and 4 other authorsView PDF view license Current browse context: cs.CL prev | next new | recent | 2025-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[NLP-11] User Perceptions of Privacy and Helpfulness in LLM Responses to Privacy-Sensitive Scenarios

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理隐私敏感任务时,其隐私保护能力与用户感知之间存在显著偏差的问题。现有评估方法依赖代理LLM(proxy LLMs)来衡量模型对隐私规范的遵守情况,忽视了真实用户的主观判断,且未充分考察响应在隐私保护质量与有用性(helpfulness)之间的权衡。解决方案的关键在于通过一项包含94名参与者、基于PrivacyLens基准的90个隐私敏感场景的用户研究,揭示用户对LLM响应的隐私保护质量和有用性评价具有高度个体差异性,而代理LLM的评估结果则与用户评价一致性较低。这表明LLM在隐私保护方面的表现不能仅靠自动化指标衡量,必须引入以用户为中心的研究范式,以更准确地理解并提升模型在实际应用中的隐私-有用性平衡能力。

链接: https://arxiv.org/abs/2510.20721
作者: Xiaoyuan Wu,Roshni Kaushik,Wenkai Li,Lujo Bauer,Koichi Onoue
机构: Fujitsu Research of America Inc.(富士通研究美国公司); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have seen rapid adoption for tasks such as drafting emails, summarizing meetings, and answering health questions. In such uses, users may need to share private information (e.g., health records, contact details). To evaluate LLMs’ ability to identify and redact such private information, prior work developed benchmarks (e.g., ConfAIde, PrivacyLens) with real-life scenarios. Using these benchmarks, researchers have found that LLMs sometimes fail to keep secrets private when responding to complex tasks (e.g., leaking employee salaries in meeting summaries). However, these evaluations rely on LLMs (proxy LLMs) to gauge compliance with privacy norms, overlooking real users’ perceptions. Moreover, prior work primarily focused on the privacy-preservation quality of responses, without investigating nuanced differences in helpfulness. To understand how users perceive the privacy-preservation quality and helpfulness of LLM responses to privacy-sensitive scenarios, we conducted a user study with 94 participants using 90 scenarios from PrivacyLens. We found that, when evaluating identical responses to the same scenario, users showed low agreement with each other on the privacy-preservation quality and helpfulness of the LLM response. Further, we found high agreement among five proxy LLMs, while each individual LLM had low correlation with users’ evaluations. These results indicate that the privacy and helpfulness of LLM responses are often specific to individuals, and proxy LLMs are poor estimates of how real users would perceive these responses in privacy-sensitive scenarios. Our results suggest the need to conduct user-centered studies on measuring LLMs’ ability to help users while preserving privacy. Additionally, future research could investigate ways to improve the alignment between proxy LLMs and users for better estimation of users’ perceived privacy and utility.
zh

[NLP-12] Structure-Conditional Minimum Bayes Risk Decoding EMNLP2025

【速读】: 该论文旨在解决最小贝叶斯风险(Minimum Bayes Risk, MBR)解码在开放任务(如对话或指令遵循)中因标准相似性效用函数对潜在结构变化不敏感而导致生成结果虽具代表性却非最优的问题。其关键解决方案是引入三种轻量级效用函数改进策略,使MBR更敏感地捕捉生成空间中的潜在结构变异(如话语行为、情感类别和响应结构),并通过新提出的两个结构最优性指标验证改进效果,最终在AlpacaEval和MT-Bench等真实指令遵循基准上显著提升生成质量(最高达13.7个百分点的胜率提升)。

链接: https://arxiv.org/abs/2510.20700
作者: Bryan Eikema,Anna Rutkiewicz,Mario Giulianelli
机构: University of Amsterdam (阿姆斯特丹大学); University of Zurich (苏黎世大学); UCL (伦敦大学学院)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Camera-Ready

点击查看摘要

Abstract:Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative to traditional generation strategies. While MBR has proven effective in machine translation, where the variability of a language model’s outcome space is naturally constrained, it may face challenges in more open-ended tasks such as dialogue or instruction-following. We hypothesise that in such settings, applying MBR with standard similarity-based utility functions may result in selecting responses that are broadly representative of the model’s distribution, yet sub-optimal with respect to any particular grouping of generations that share an underlying latent structure. In this work, we introduce three lightweight adaptations to the utility function, designed to make MBR more sensitive to structural variability in the outcome space. To test our hypothesis, we curate a dataset capturing three representative types of latent structure: dialogue act, emotion, and response structure (e.g., a sentence, a paragraph, or a list). We further propose two metrics to evaluate the structural optimality of MBR. Our analysis demonstrates that common similarity-based utility functions fall short by these metrics. In contrast, our proposed adaptations considerably improve structural optimality. Finally, we evaluate our approaches on real-world instruction-following benchmarks, AlpacaEval and MT-Bench, and show that increased structural sensitivity improves generation quality by up to 13.7 percentage points in win rate.
zh

[NLP-13] Neural Diversity Regularizes Hallucinations in Small Models

【速读】: 该论文旨在解决语言模型在参数、计算资源和数据规模不断增长的情况下仍持续出现幻觉(hallucination)的问题。其解决方案的关键在于提出并验证“神经多样性”(neural diversity)作为一种可量化且可调控的机制,通过构建去相关(decorrelated)的并行表征来降低幻觉概率。作者受投资组合理论启发,理论证明幻觉概率与表征相关性呈负相关关系,并设计了ND-LoRA方法——结合并行LoRA适配器与Barlow Twins正则化,在不牺牲通用性能的前提下显著减少幻觉(最高降低25.6%,平均降低14.6%),从而确立神经多样性为与参数量和数据规模并列的第三维缩放轴,用于提升语言模型在固定预算下的可靠性。

链接: https://arxiv.org/abs/2510.20690
作者: Kushal Chakrabarti,Nirmal Balachundhar
机构: South Park Commons
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models continue to hallucinate despite increases in parameters, compute, and data. We propose neural diversity – decorrelated parallel representations – as a principled mechanism that reduces hallucination rates at fixed parameter and data budgets. Inspired by portfolio theory, where uncorrelated assets reduce risk by \sqrtP , we prove hallucination probability is bounded by representational correlation: P(H) \leq f(\sigma^2((1-\rho§)/P + \rho§), \mu^2) , which predicts that language models need an optimal amount of neurodiversity. To validate this, we introduce ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces hallucinations by up to 25.6% (and 14.6% on average) without degrading general accuracy. Ablations show LoRA adapters and regularization act synergistically, causal interventions prove neurodiversity as the mediating factor and correlational analyses indicate scale: a 0.1% neural correlation increase is associated with a 3.8% hallucination increase. Finally, task-dependent optimality emerges: different tasks require different amounts of optimal neurodiversity. Together, our results highlight neural diversity as a third axis of scaling – orthogonal to parameters and data – to improve the reliability of language models at fixed budgets.
zh

[NLP-14] Analyticup E-commerce Product Search Competition Technical Report from Team Tredence_AICOE

【速读】: 该论文旨在解决多语言电子商务搜索中的相关性匹配问题,具体包括查询-类目(Query-Category, QC)相关性和查询-商品(Query-Item, QI)相关性两个任务。其核心挑战在于如何在多种语言环境下实现高精度的语义匹配,同时覆盖数据稀缺的语言类别。解决方案的关键在于通过数据增强策略,将现有数据集翻译至开发集中缺失的语言,从而实现全语言覆盖;并基于此对Gemma-3 12B和Qwen-2.5 14B模型进行微调,结合原始数据、翻译数据及少数类样本生成技术,显著提升跨语言相关性建模能力,最终在竞赛中取得平均F1分数0.8857的成绩,位列第四。

链接: https://arxiv.org/abs/2510.20674
作者: Rakshith R,Shubham Sharma,Mohammed Sameer Khan,Ankush Chopra
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents the multilingual e-commerce search system developed by the Tredence_AICOE team. The competition features two multilingual relevance tasks: Query-Category (QC) Relevance, which evaluates how well a user’s search query aligns with a product category, and Query-Item (QI) Relevance, which measures the match between a multilingual search query and an individual product listing. To ensure full language coverage, we performed data augmentation by translating existing datasets into languages missing from the development set, enabling training across all target languages. We fine-tuned Gemma-3 12B and Qwen-2.5 14B model for both tasks using multiple strategies. The Gemma-3 12B (4-bit) model achieved the best QC performance using original and translated data, and the best QI performance using original, translated, and minority class data creation. These approaches secured 4th place on the final leaderboard, with an average F1-score of 0.8857 on the private test set.
zh

[NLP-15] textscCantoNLU: A benchmark for Cantonese natural language understanding

【速读】: 该论文旨在解决粤语(Cantonese)在自然语言理解(Natural Language Understanding, NLU)领域资源匮乏的问题,这主要源于政策限制与语言双言现象(diglossia)的影响。为填补评估框架的空白,作者提出了一个名为CantoNLU的新基准,涵盖七类任务,包括词义消歧、语言可接受性判断、语言检测、自然语言推理、情感分析、词性标注和依存句法分析。解决方案的关键在于构建高质量、多样化的粤语NLU评测数据集,并系统比较不同训练策略下的模型性能:包括未在粤语上训练的普通话模型、通过持续预训练适配粤语的模型以及从头训练的纯粤语模型。实验表明,粤语适配模型整体表现最优,而纯粤语模型在句法任务上更具优势,同时显示在粤语领域数据稀缺时,直接迁移普通话模型仍具竞争力。研究团队已开源全部数据集、代码及模型权重,以推动粤语自然语言处理的进一步发展。

链接: https://arxiv.org/abs/2510.20670
作者: Junghyun Min,York Hay Ng,Sophia Chan,Helena Shunhua Zhao,En-Shiun Annie Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 1 figure

点击查看摘要

Abstract:Cantonese, although spoken by millions, remains under-resourced due to policy and diglossia. To address this scarcity of evaluation frameworks for Cantonese, we introduce \textsc\textbfCantoNLU, a benchmark for Cantonese natural language understanding (NLU). This novel benchmark spans seven tasks covering syntax and semantics, including word sense disambiguation, linguistic acceptability judgment, language detection, natural language inference, sentiment analysis, part-of-speech tagging, and dependency parsing. In addition to the benchmark, we provide model baseline performance across a set of models: a Mandarin model without Cantonese training, two Cantonese-adapted models obtained by continual pre-training a Mandarin model on Cantonese text, and a monolingual Cantonese model trained from scratch. Results show that Cantonese-adapted models perform best overall, while monolingual models perform better on syntactic tasks. Mandarin models remain competitive in certain settings, indicating that direct transfer may be sufficient when Cantonese domain data is scarce. We release all datasets, code, and model weights to facilitate future research in Cantonese NLP.
zh

[NLP-16] he Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在多语言场景下推理能力不足的问题,特别是当面对非英语问题时,模型倾向于默认使用英语进行推理,这可能导致对语言和文化细微差别的处理不当,影响解释性和准确性。解决方案的关键在于系统性地比较模型在英文与问题原生语言下的推理过程,通过在MGSM和GPQA Diamond任务上的实证分析,揭示了英语推理虽能提升最终答案准确率,但存在“翻译迷失”(Lost in Translation)的显著风险——即因翻译步骤引入错误,而若直接在原语言中推理则可避免此类问题。

链接: https://arxiv.org/abs/2510.20647
作者: Alan Saji,Raj Dabre,Anoop Kunchukuttan,Ratish Puduppully
机构: Nilekani Centre at AI4Bharat; Indian Institute of Technology Madras (印度理工学院马德拉斯分校); Google(谷歌); Microsoft(微软); IT University of Copenhagen (哥本哈根信息技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM’s reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting “Lost in Translation,” where translation steps lead to errors that would have been avoided by question’s language reasoning.
zh

[NLP-17] Why Did Apple Fall To The Ground: Evaluating Curiosity In Large Language Model

【速读】: 该论文旨在解决如何评估大语言模型(Large Language Models, LLMs)是否具备类人好奇心驱动学习能力的问题。其解决方案的关键在于基于人类好奇心量表——五维好奇心量表修订版(Five-Dimensional Curiosity Scale Revised, 5DCR),构建了一个涵盖信息寻求、刺激寻求和社交好奇心等维度的综合性评估框架,从而系统量化LLMs在不同情境下的好奇行为表现,并进一步揭示好奇心与模型推理及主动学习能力之间的关联,为未来开发具有类人探索性和创新性的LLM学习机制提供实验依据。

链接: https://arxiv.org/abs/2510.20635
作者: Haoyu Wang,Sihang Jiang,Yuyan Chen,Yitong Wang,Yanghua Xiao
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Curiosity serves as a pivotal conduit for human beings to discover and learn new knowledge. Recent advancements of large language models (LLMs) in natural language processing have sparked discussions regarding whether these models possess capability of curiosity-driven learning akin to humans. In this paper, starting from the human curiosity assessment questionnaire Five-Dimensional Curiosity scale Revised (5DCR), we design a comprehensive evaluation framework that covers dimensions such as Information Seeking, Thrill Seeking, and Social Curiosity to assess the extent of curiosity exhibited by LLMs. The results demonstrate that LLMs exhibit a stronger thirst for knowledge than humans but still tend to make conservative choices when faced with uncertain environments. We further investigated the relationship between curiosity and thinking of LLMs, confirming that curious behaviors can enhance the model’s reasoning and active learning abilities. These findings suggest that LLMs have the potential to exhibit curiosity similar to that of humans, providing experimental support for the future development of learning capabilities and innovative research in LLMs.
zh

[NLP-18] BUSTED at AraG enEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection

【速读】: 该论文旨在解决阿拉伯语人工智能生成文本检测(Arabic AI-generated text detection)问题,以提升对生成式AI内容的识别能力。其解决方案的关键在于对比三种预训练Transformer模型——AraELECTRA、CAMeLBERT与XLM-RoBERTa在二分类任务中的表现,并通过微调(fine-tuning)策略进行评估。研究发现,尽管AraELECTRA和CAMeLBERT是专为阿拉伯语设计的模型,但多语言模型XLM-RoBERTa反而取得了最优性能(F1分数0.7701),表明多语言模型在该任务中展现出更强的泛化能力,这一结果挑战了“专用模型优于通用模型”的直觉认知。

链接: https://arxiv.org/abs/2510.20610
作者: Ali Zain,Sareem Farooqui,Muhammad Rafi
机构: National University of Computer and Emerging Sciences, FAST (国家计算机与新兴科学大学,FAST); Karachi, Pakistan (卡拉奇,巴基斯坦)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper details our submission to the Ara- GenEval Shared Task on Arabic AI-generated text detection, where our team, BUSTED, se- cured 5th place. We investigated the effec- tiveness of three pre-trained transformer mod- els: AraELECTRA, CAMeLBERT, and XLM- RoBERTa. Our approach involved fine-tuning each model on the provided dataset for a binary classification task. Our findings revealed a sur- prising result: the multilingual XLM-RoBERTa model achieved the highest performance with an F1 score of 0.7701, outperforming the spe- cialized Arabic models. This work underscores the complexities of AI-generated text detection and highlights the strong generalization capa- bilities of multilingual models.
zh

[NLP-19] What Defines Good Reasoning in LLM Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估中仅关注最终答案正确性所带来的粗粒度反馈问题,这一方法忽视了推理过程的质量,限制了模型的可改进性。为应对这一挑战,作者提出从“相关性”(relevance,即每一步是否基于问题本身)和“连贯性”(coherence,即每一步是否逻辑上承接前序步骤)两个维度精细刻画推理质量,并引入因果分步评估(Causal Stepwise Evaluation, CaSE)作为解决方案的核心:CaSE 通过仅使用前序上下文来逐步评估每个推理步骤,从而避免事后偏见(hindsight bias),实现对推理过程的可靠量化。实验证明,基于 CaSE 评估的相关性和连贯性进行训练数据筛选可显著提升模型在目标任务上的性能,表明该框架在分析、调试与优化 LLM 推理能力方面具有实用价值。

链接: https://arxiv.org/abs/2510.20603
作者: Heejin Do,Jaehui Hwang,Dongyoon Han,Seong Joon Oh,Sangdoo Yun
机构: ETH Zürich (苏黎世联邦理工学院); NAVER AI Lab; University of Tübingen (图宾根大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating large language models (LLMs) on final-answer correctness is the dominant paradigm. This approach, however, provides a coarse signal for model improvement and overlooks the quality of the underlying reasoning process. We argue that a more granular evaluation of reasoning offers a more effective path to building robust models. We decompose reasoning quality into two dimensions: relevance and coherence. Relevance measures if a step is grounded in the problem; coherence measures if it follows logically from prior steps. To measure these aspects reliably, we introduce causal stepwise evaluation (CaSE). This method assesses each reasoning step using only its preceding context, which avoids hindsight bias. We validate CaSE against human judgments on our new expert-annotated benchmarks, MRa-GSM8K and MRa-MATH. More importantly, we show that curating training data with CaSE-evaluated relevance and coherence directly improves final task performance. Our work provides a scalable framework for analyzing, debugging, and improving LLM reasoning, demonstrating the practical value of moving beyond validity checks.
zh

[NLP-20] Can ChatGPT Code Communication Data Fairly?: Empirical Evidence from Multiple Collaborative Tasks

【速读】: 该论文旨在解决生成式 AI(Generative AI)在自动化编码协作沟通数据时是否存在性别和种族偏见的问题。其关键解决方案是基于典型协作问题解决的编码框架,利用 ChatGPT 对三类协作任务(谈判、问题解决与决策)的数据进行自动化编码,并系统检验不同性别与种族群体间的编码差异,结果表明 ChatGPT 的编码结果在各群体间无显著偏差,从而为大规模评估协作与沟通能力提供了可靠的技术路径。

链接: https://arxiv.org/abs/2510.20584
作者: Jiangang Hao,Wenju Cui,Patrick Kyllonen,Emily Kerzabi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 38 pages, 4 figures

点击查看摘要

Abstract:Assessing communication and collaboration at scale depends on a labor intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology exhibits bias against different demographic groups, such as gender and race, remains unclear. To fill this gap, this paper investigates ChatGPT-based automated coding of communication data using a typical coding framework for collaborative problem solving, examining differences across gender and racial groups. The analysis draws on data from three types of collaborative tasks: negotiation, problem solving, and decision making. Our results show that ChatGPT-based coding exhibits no significant bias across gender and racial groups, paving the road for its adoption in large-scale assessment of collaboration and communication.
zh

[NLP-21] Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search

【速读】: 该论文旨在解决传统电商搜索中基于“检索-排序”(retrieval-ranking)范式的局限性,该范式因依赖查询与商品的直接匹配,无法有效适配用户在平台上的多阶段认知决策过程,导致复杂查询中的语义鸿沟、跨平台信息搜寻带来的高决策成本以及缺乏专业购物引导等问题。其解决方案的关键在于提出一种多智能体认知决策框架(Multi-Agent Cognitive Decision Framework, MACDF),通过将搜索范式从被动检索转向主动决策支持,实现对复杂查询(如含否定、多约束或推理需求)的精准响应,并在离线评估和京东搜索平台的在线A/B测试中验证了其在推荐准确性和用户满意度上的显著提升。

链接: https://arxiv.org/abs/2510.20567
作者: Zhouwei Zhai,Mengxiang Chen,Haoyun Xia,Jin Li,Renquan Zhou,Min Yang
机构: JD.com(京东)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF’s significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.
zh

[NLP-22] GlobalRAG : Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning

【速读】: 该论文旨在解决强化学习在多跳问答(multi-hop question answering, MQA)任务中表现受限的问题,其核心瓶颈在于缺乏全局规划能力以结构化多步推理过程,以及执行不忠实导致查询构建与检索证据使用不一致。解决方案的关键是提出GlobalRAG框架,通过将问题分解为子目标(subgoals),协调检索与推理过程,并迭代优化证据;同时引入规划质量奖励(Planning Quality Reward)和子目标完成奖励(SubGoal Completion Reward)来引导连贯的规划与可靠的子目标执行,并采用渐进式权重衰减策略平衡过程导向与结果导向的目标,从而显著提升多跳QA性能,在仅使用42%训练数据的情况下实现EM和F1指标平均提升14.2%。

链接: https://arxiv.org/abs/2510.20548
作者: Jinchang Luo,Mingquan Cheng,Fan Wan,Ni Li,Xiaoling Xia,Shuangshuang Tian,Tingcheng Bian,Haiwei Wang,Haohuan Fu,Yan Tao
机构: Baidu Inc.(百度公司); Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); National Supercomputing Center in Shenzhen (深圳国家超级计算中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
zh

[NLP-23] he Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

【速读】: 该论文试图解决的问题是:如何区分语言模型在理解复杂句法结构时是基于真正的语法分析(syntax)还是仅仅依赖于语义模式匹配(semantic pattern matching)。这一问题在当前自然语言处理领域尤为关键,因为现有基准测试难以辨别模型是否具备深层的句法解析能力。解决方案的关键在于提出 CenterBench 数据集,该数据集包含 9,720 条关于中心嵌套句(center-embedded sentences)的理解题,每条句子均配有语义上不合理但句法相同的对照版本,并通过六类理解问题(涵盖表面理解、句法依赖和因果推理)系统性地评估模型表现。实验结果显示,随着嵌套复杂度增加,模型在合理与不合理句子间的性能差距显著扩大,最高达 26.8 个百分点,这量化了模型从结构分析转向语义关联的时间点;同时发现语义合理性反而会损害对结果性动作的理解,说明因果推理比语义一致性更重要。该框架首次实现了对模型何时由结构分析转向模式匹配的精准识别。

链接: https://arxiv.org/abs/2510.20543
作者: Sangmitra Madhusudan,Kaige Chen,Ali Emami
机构: Brock University (布罗克大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:When language models correctly parse “The cat that the dog chased meowed,” are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like “The cat [that the dog chased] meowed”) where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.
zh

[NLP-24] ARC-Encoder: learning compressed text representations for large language models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在使用检索增强生成(Retrieval-Augmented Generation, RAG)或链式思维推理(Chain-of-Thought Reasoning)等技术时,因上下文长度增加而导致的推理成本上升问题。现有上下文压缩方法通常需要对目标模型进行微调甚至修改其架构,这会损害模型在非特定任务上的通用能力。论文提出了一种替代方案:设计一个编码器(Encoder),将输入文本压缩为连续表示(continuous representations),并用这些表示替换解码器LLM中的词元嵌入(token embeddings)。其核心创新在于开发了一个可适配的文本表示压缩器(Adaptable text Representations Compressor, ARC-Encoder),能够以约4–8倍的压缩比输出更少的连续向量,同时保持下游任务性能,并支持跨不同解码器LLM的通用性,从而实现高效、灵活且无需重新训练即可部署的上下文压缩解决方案。

链接: https://arxiv.org/abs/2510.20535
作者: Hippolyte Pilchen,Edouard Grave,Patrick Pérez
机构: Kyutai(凯图AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent techniques such as retrieval-augmented generation or chain-of-thought reasoning have led to longer contexts and increased inference costs. Context compression techniques can reduce these costs, but the most effective approaches require fine-tuning the target model or even modifying its architecture. This can degrade its general abilities when not used for this specific purpose. Here we explore an alternative approach: an encoder that compresses the context into continuous representations which replace token embeddings in decoder LLMs. First, we perform a systematic study of training strategies and architecture choices for the encoder. Our findings led to the design of an Adaptable text Representations Compressor, named ARC-Encoder, which outputs x -times fewer continuous representations (typically x!\in!\4,8\ ) than text tokens. We evaluate ARC-Encoder across a variety of LLM usage scenarios, ranging from in-context learning to context window extension, on both instruct and base decoders. Results show that ARC-Encoder achieves state-of-the-art performance on several benchmarks while improving computational efficiency at inference. Finally, we demonstrate that our models can be adapted to multiple decoders simultaneously, allowing a single encoder to generalize across different decoder LLMs. This makes ARC-Encoder a flexible and efficient solution for portable encoders that work seamlessly with multiple LLMs. We release a training code at this https URL , fine-tuning dataset and pretrained models are available at this https URL .
zh

[NLP-25] Decoding the Ear: A Framework for Objectifying Expressiveness from Human Preference Through Efficient Alignment ICASSP2026

【速读】: 该论文旨在解决当前语音到语音(Speech-to-Speech, S2S)模型在生成语音时缺乏自然表达性(expressiveness)且缺少可靠评估指标的问题。现有方法如主观MOS评分、低层次声学特征或情感识别等存在成本高、覆盖不全或局限性强等缺陷。解决方案的关键在于提出DeEAR(Decoding the Expressive Preference of eAR)框架,该框架基于语音学与心理学原理,将人类对语音表达性的偏好转化为客观评分,从情绪(Emotion)、韵律(Prosody)和自发性(Spontaneity)三个维度进行量化评估,实现了与人类感知高度一致的评价效果(Spearman秩相关系数SRCC = 0.86),仅需少于500个标注样本即可完成训练,并支持公平基准测试与针对性数据筛选,显著提升了S2S模型的表达能力。

链接: https://arxiv.org/abs/2510.20513
作者: Zhiyu Lin,Jingwen Yang,Jiale Zhao,Meng Liu,Sunzhu Li,Benyou Wang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Submitted to ICASSP 2026. Demos and codes are available at this https URL

点击查看摘要

Abstract:Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman’s Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at this https URL
zh

[NLP-26] Assessing the Political Fairness of Multilingual LLM s: A Case Study based on a 21-way Multiparallel EuroParl Dataset

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)政治偏见的评估问题,传统方法依赖英文问卷模拟回答,存在文化与语境局限。作者提出一种基于多语言翻译公平性的新框架,通过系统比较欧洲议会(European Parliament, EP)演讲在不同政党背景下的翻译质量差异来识别偏见。其解决方案的关键在于构建了一个全新的21语种平行语料库EuroParl,包含发言者政治归属信息,覆盖三年内4000万词和2.5亿字符,能够量化主流政党(左、中、右翼)与边缘政党之间翻译质量的系统性差异,从而揭示LLMs在跨语言处理中的隐性政治倾向。

链接: https://arxiv.org/abs/2510.20508
作者: Paul Lerner,François Yvon
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left, center, and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.
zh

[NLP-27] Hierarchical Sequence Iteration for Heterogeneous Question Answering

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)在多步问题和异构证据源(如文本、表格、知识图谱)上表现脆弱的问题,即在准确率与延迟、token或工具预算之间难以平衡。其解决方案的核心是提出一种统一的分层序列(Hierarchical Sequence, HSEQ)迭代框架,通过轻量级结构标签将文档、表格和知识图谱线性化为可逆的分层序列,并采用结构感知的迭代机制,在答案合成前仅收集“足够”的证据;其中,头部代理(Head Agent)指导检索,迭代代理(Iteration Agent)通过尊重结构的动作(如父/子跳转、表格行列邻接、知识图谱关系遍历)选择并扩展HSeq,最终由头部代理对证据进行标准化以生成答案,并支持可选的矛盾修正循环。该方法在HotpotQA、HybridQA/TAT-QA和MetaQA等多个基准上均实现EM/F1指标提升,且具备格式无关性、预算感知迭代和证据标准化三大优势。

链接: https://arxiv.org/abs/2510.20505
作者: Ruiyi Yang,Hao Xue,Imran Razzak,Hakim Hacid,Flora D. Salim
机构: University of New South Wales (新南威尔士大学); Mohamed Bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Technology Innovation Institute (技术创新研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages, 3 figures

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) remains brittle on multi-step questions and heterogeneous evidence sources, trading accuracy against latency and token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration for Heterogeneous Question Answering, a unified framework that (i) linearize documents, tables, and knowledge graphs into a reversible hierarchical sequence with lightweight structural tags, and (ii) perform structure-aware iteration to collect just-enough evidence before answer synthesis. A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes canonicalized evidence to genearte the final answer, with an optional refinement loop to resolve detected contradictions. Experiments on HotpotQA (text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1 gains over strong single-pass, multi-hop, and agentic RAG baselines with high efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic unification that enables a single policy to operate across text, tables, and KGs without per-dataset specialization; (2) guided, budget-aware iteration that reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and (3) evidence canonicalization for reliable QA, improving answers consistency and auditability.
zh

[NLP-28] Robust Preference Alignment via Directional Neighborhood Consensus ICLR2026

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐人类偏好时存在的“偏好覆盖缺口”(preference coverage gap)问题,即模型在常见请求上表现良好,但在面对个体化、非主流偏好的特定需求时性能显著下降。其解决方案的关键在于提出一种无需重新训练的后处理方法——鲁棒偏好选择(Robust Preference Selection, RPS),通过利用方向邻域共识机制,在用户原始意图的局部偏好空间中采样多个候选响应,并从中选出最契合用户意图的最优解。该方法在理论上优于仅采样单一高特定偏好基线,且在三种不同对齐范式(DPA、DPO 和 SFT)下均实现显著提升,尤其在低频偏好区域可达到最高 69% 的胜率。

链接: https://arxiv.org/abs/2510.20498
作者: Ruochen Mao,Yuling Shi,Xiaodong Gu,Jiaheng Wei
机构: The Hong Kong University of Science and Technology (Guangzhou); Shanghai Jiao Tong University
类目: Computation and Language (cs.CL)
备注: Under review at ICLR 2026. 10 pages, 5 figures. Code and data available at this https URL

点击查看摘要

Abstract:Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user’s request reflects a nuanced preference deviating from the training data’s central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user’s original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.
zh

[NLP-29] Steering Evaluation-Aware Language Models To Act Like They Are Deployed

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全评估过程中可能因“评估感知”(evaluation-awareness)而调整行为的问题,这种行为会误导评估结果,降低安全性测试的可靠性。其解决方案的关键在于使用激活空间中的“引导向量”(steering vector)对模型激活进行干预,从而抑制模型对评估情境的识别与响应,使其在存在评估提示时仍表现出部署状态下的行为模式。该引导向量基于原始未训练模型构建,无需额外微调即可有效削弱评估感知,提升安全评估的真实性和可信度。

链接: https://arxiv.org/abs/2510.20487
作者: Tim Tian Hua,Andrew Qin,Samuel Marks,Neel Nanda
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM’s activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on documents with factual descriptions of the model (1) using Python type hints during evaluation but not during deployment and (2) recognizing that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more than deployment contexts. However, this gap can only be observed by removing the evaluation cue. We find that activation steering can suppress evaluation awareness and make the model act like it is deployed even when the cue is present. Importantly, we constructed our steering vector using the original model before our additional training. Our results suggest that AI evaluators could improve the reliability of safety evaluations by steering models to act like they are deployed.
zh

[NLP-30] RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在持续学习(Continual Learning)过程中因缺乏历史数据而导致的知识遗忘问题,即灾难性遗忘(Catastrophic Forgetting)。解决方案的关键在于提出RECALL框架,该框架基于层间隐藏表示(layer-wise hidden representations)计算不同模型间的相似性,并通过自适应的、分层的参数融合策略实现知识对齐。这一设计能够在浅层保留领域通用特征,同时在深层支持任务特定的适应,从而在无需任务标签或历史数据的情况下,实现多领域知识的无缝整合与高效保留。

链接: https://arxiv.org/abs/2510.20479
作者: Bowen Wang,Haiyuan Wan,Liwen Shi,Chen Yang,Peng He,Yue Ma,Haochen Han,Wenhao Li,Tiao Tan,Yongjian Li,Fangming Liu,Yifan Gong,Sheng Zhang
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Peng Cheng Laboratory (鹏城实验室); Huazhong University of Science and Technology (华中科技大学); Xiamen University (厦门大学); The Hong Kong University of Science and Technology, Guangzhou (香港科技大学(广州) ); School of Biomedical Engineering, Tsinghua University (清华大学医学院生物医学工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We unveil that internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge, and propose RECALL, a novel representation-aware model merging framework for continual learning without access to historical data. RECALL computes inter-model similarity from layer-wise hidden representations over clustered typical samples, and performs adaptive, hierarchical parameter fusion to align knowledge across models. This design enables the preservation of domain-general features in shallow layers while allowing task-specific adaptation in deeper layers. Unlike prior methods that require task labels or incur performance trade-offs, RECALL achieves seamless multi-domain integration and strong resistance to catastrophic forgetting. Extensive experiments across five NLP tasks and multiple continual learning scenarios show that RECALL outperforms baselines in both knowledge retention and generalization, providing a scalable and data-free solution for evolving LLMs.
zh

[NLP-31] Mask and You Shall Receive: Optimizing Masked Language Modeling For Pretraining BabyLMs

【速读】: 该论文旨在提升语言模型在自然语言理解任务中的性能,特别是针对掩码语言建模(Masked Language Modeling, MLM)方法的局限性进行改进。其核心问题是标准 MLM 在训练过程中对所有被掩码 token 一视同仁地赋予相同预测权重,忽略了模型对不同 token 的预测能力差异,从而限制了学习效率与泛化性能。解决方案的关键在于提出一种改进的 MLM 策略:根据模型对被掩码 token 的预测能力动态调整其概率权重——即对模型更难预测的 token 赋予更高权重,从而增强模型对关键语义信息的学习;同时引入子词嵌入(sub-token embeddings),显著提升了模型在形态学复杂词汇上的泛化能力。实验表明,该方法在 (Super)GLUE 基准测试中优于标准 MLM,并在严格小规模赛道(strict-small track)中超越基线模型。

链接: https://arxiv.org/abs/2510.20475
作者: Lukas Edman,Alexander Fraser
机构: TU Munich (慕尼黑工业大学); Munich Center for Machine Learning (慕尼黑机器学习中心); Munich Data Science Institute (慕尼黑数据科学研究所)
类目: Computation and Language (cs.CL)
备注: Submission to the 2025 BabyLM Challenge

点击查看摘要

Abstract:We describe our strategy for the 2025 edition of the BabyLM Challenge. Our main contribution is that of an improved form of Masked Language Modeling (MLM), which adapts the probabilities of the tokens masked according to the model’s ability to predict them. The results show a substantial increase in performance on (Super)GLUE tasks over the standard MLM. We also incorporate sub-token embeddings, finding that this increases the model’s morphological generalization capabilities. Our submission beats the baseline in the strict-small track.
zh

[NLP-32] Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)输出结果中存在不确定性与正确性不一致的问题,从而提升其在实际应用中的可靠性。解决方案的关键在于系统性评估四种不确定性估计方法(VCE、MSP、Sample Consistency 和 CoCoA),并发现混合方法 CoCoA 在校准性和区分度上表现最优,能够更全面地捕捉模型置信度的不同维度,显著提升对正确答案的识别能力。

链接: https://arxiv.org/abs/2510.20460
作者: Christian Hobelsberger,Theresa Winner,Andreas Nawroth,Oliver Mitevski,Anna-Carolina Haensch
机构: LMU Munich(慕尼黑路德维希马克西米利安大学); Munich Re(慕尼黑再保险公司); relAI; MCML; University of Maryland(马里兰大学)
类目: Computation and Language (cs.CL); Applications (stat.AP); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Large language models (LLMs) produce outputs with varying levels of uncertainty, and, just as often, varying levels of correctness; making their practical reliability far from guaranteed. To quantify this uncertainty, we systematically evaluate four approaches for confidence estimation in LLM outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For the evaluation of the approaches, we conduct experiments on four question-answering tasks using a state-of-the-art open-source LLM. Our results show that each uncertainty metric captures a different facet of model confidence and that the hybrid CoCoA approach yields the best reliability overall, improving both calibration and discrimination of correct answers. We discuss the trade-offs of each method and provide recommendations for selecting uncertainty measures in LLM applications.
zh

[NLP-33] LM-mixup: Text Data Augmentation via Language Model based Mixup

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)指令微调过程中高质量指令数据稀缺与低质量数据被浪费之间的矛盾问题。现有方法通常因低质量数据语义冗余或不规范而直接丢弃,导致信息损失严重,且缺乏有效的数据增强手段来提升其利用价值。解决方案的关键在于提出“指令蒸馏”(Instruction Distillation)任务,并构建了一个包含144K样本的MIXTURE数据集,通过将大量低质量、冗余的输入聚类并蒸馏为高质量、一致的指令-输出对;进一步引入LM-Mixup框架,结合监督微调与基于组相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习,利用质量、语义对齐和格式合规三个互补奖励信号进行优化。实验证明,仅用约3%的蒸馏数据即可超越全量原始数据训练效果,并达到当前最优高质量数据筛选方法的性能水平,从而验证了低质量数据经有效蒸馏后可成为高价值资源,显著提升指令微调的效率与模型表现。

链接: https://arxiv.org/abs/2510.20449
作者: Zhijie Deng,Zhouan Shen,Ling Li,Yao Zhou,Zhaowei Zhu,Yanji He,Wei Wang,Jiaheng Wei
机构: The Hong Kong University of Science and Technology (Guangzhou); BIAI, ZJUT & D5Data.ai
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Instruction tuning is crucial for aligning Large Language Models (LLMs), yet the quality of instruction-following data varies significantly. While high-quality data is paramount, it is often scarce; conversely, abundant low-quality data is frequently discarded, leading to substantial information loss. Existing data augmentation methods struggle to augment this low-quality data effectively, and the evaluation of such techniques remains poorly defined. To address this, we formally define the task of Instruction Distillation: distilling multiple low-quality and redundant inputs into high-quality and coherent instruction-output pairs. Specifically, we introduce a comprehensive data construction pipeline to create MIXTURE, a 144K-sample dataset pairing low-quality or semantically redundant imperfect instruction clusters with their high-quality distillations. We then introduce LM-Mixup, by first performing supervised fine-tuning on MIXTURE and then optimizing it with reinforcement learning. This process uses three complementary reward signals: quality, semantic alignment, and format compliance, via Group Relative Policy Optimization (GRPO). We demonstrate that LM-Mixup effectively augments imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for only about 3% of the entire dataset, not only surpasses full-dataset training but also competes with state-of-the-art high-quality data selection methods across multiple benchmarks. Our work establishes that low-quality data is a valuable resource when properly distilled and augmented with LM-Mixup, significantly enhancing the efficiency and performance of instruction-tuned LLMs.
zh

[NLP-34] acher Demonstrations in a BabyLMs Zone of Proximal Development for Contingent Multi-Turn Interaction EMNLP2025

【速读】: 该论文旨在解决婴儿语言模型(BabyLM)在多轮对话中缺乏情境连续性(contingency)的问题,即对话双方之间缺乏及时、直接且有意义的交互。其解决方案的关键在于提出ContingentChat框架,通过引入一种新颖的对齐数据集进行后训练(post-training),使BabyLM生成更符合语法规范且语义连贯的回应;实验表明,该方法能显著提升对话质量,但即使采用自适应教师解码策略也仅带来有限增益,说明情境连续性仍是BabyLM亟待突破的挑战。

链接: https://arxiv.org/abs/2510.20411
作者: Suchir Salhan,Hongyi Gu,Donya Rooein,Diana Galvan-Sosa,Gabrielle Gaudeau,Andrew Caines,Zheng Yuan,Paula Buttery
机构: ALTA Institute, Dept. of Computer Science & Technology, Cambridge University (剑桥大学); NetMind.AI; Bocconi University (博科尼大学); Sheffield University (谢菲尔德大学)
类目: Computation and Language (cs.CL)
备注: Outstanding Paper Award, EMNLP 2025 BabyLM Workshop - Oral presentation, Suzhou, China

点击查看摘要

Abstract:Multi-turn dialogues between a child and a caregiver are characterized by a property called contingency - that is, prompt, direct, and meaningful exchanges between interlocutors. We introduce ContingentChat, a teacher-student framework that benchmarks and improves multi-turn contingency in a BabyLM trained on 100M words. Using a novel alignment dataset for post-training, BabyLM generates responses that are more grammatical and cohesive. Experiments with adaptive teacher decoding strategies show limited additional gains. ContingentChat demonstrates the benefits of targeted post-training for dialogue quality and indicates that contingency remains a challenging goal for BabyLMs.
zh

[NLP-35] Relative-Based Scaling Law for Neural Language Models

【速读】: 该论文试图解决现有大语言模型(Large Language Models, LLMs)Scaling Law研究中仅依赖交叉熵(Cross-Entropy)作为评估指标所带来的局限性问题,即交叉熵只能衡量模型对正确词元的绝对概率,而忽略了正确与错误词元之间的相对排序关系,而这在贪婪采样(Greedy Sampling)等场景下至关重要。解决方案的关键在于提出一种新的评估指标——相对概率(Relative-Based Probability, RBP),用于量化正确词元在预测排名中的位置概率,并基于此构建出相对排序视角下的Scaling Law(Relative-Based Scaling Law),从而更全面地刻画模型性能随规模增长的变化规律。该方法在四个数据集和四个模型家族上的实验证明了其鲁棒性和准确性,并展示了其在解释模型涌现现象和推动Scaling Law理论发展方面的应用潜力。

链接: https://arxiv.org/abs/2510.20387
作者: Baoqing Yue,Jinyuan Zhou,Zixi Wei,Jingtao Zhan,Qingyao Ai,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.
zh

[NLP-36] NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew

【速读】: 该论文旨在解决当前基于BERT的模型在 Hebrew 自然语言处理(NLP)任务中性能不足的问题,尤其是相较于现代Transformer架构(如Llama3和Qwen3)在结构设计上的落后。解决方案的关键在于引入NeoBERT的先进架构,并在此基础上开发出专为希伯来语优化的BERT-style模型——NeoDictaBERT及其双语版本NeoDictaBERT-bilingual,通过相同的训练方法和更长的上下文窗口支持,在几乎所有希伯来语基准测试中均取得优于现有模型的结果,同时在检索任务中展现出对等规模多语言模型中的领先表现。

链接: https://arxiv.org/abs/2510.20386
作者: Shaltiel Shmidman,Avi Shmidman,Moshe Koppel
机构: DICTA / Jerusalem, Israel; Bar Ilan University / Ramat Gan, Israel
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Since their initial release, BERT models have demonstrated exceptional performance on a variety of tasks, despite their relatively small size (BERT-base has ~100M parameters). Nevertheless, the architectural choices used in these models are outdated compared to newer transformer-based models such as Llama3 and Qwen3. In recent months, several architectures have been proposed to close this gap. ModernBERT and NeoBERT both show strong improvements on English benchmarks and significantly extend the supported context window. Following their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual: BERT-style models trained using the same architecture as NeoBERT, with a dedicated focus on Hebrew texts. These models outperform existing ones on almost all Hebrew benchmarks and provide a strong foundation for downstream tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on retrieval tasks, outperforming other multilingual models of similar size. In this paper, we describe the training process and report results across various benchmarks. We release the models to the community as part of our goal to advance research and development in Hebrew NLP.
zh

[NLP-37] VLSP 2025 MLQA-TSR Challenge: Vietnamese Multimodal Legal Question Answering on Traffic Sign Regulation

【速读】: 该论文旨在解决越南语多模态法律文本处理中的问题,特别是针对交通标志法规的多模态法律问答任务(multimodal legal question answering on traffic sign regulation)。其核心挑战在于如何有效融合文本与视觉信息以理解并回答与交通标志相关的法律问题。解决方案的关键在于构建了一个面向越南语的多模态法律问答共享任务(VLSP 2025 MLQA-TSR),包含两个子任务:多模态法律检索和多模态问答,并提供了基准数据集用于智能系统的开发与评估,从而推动该领域的研究进展。

链接: https://arxiv.org/abs/2510.20381
作者: Son T. Luu,Trung Vo,Hiep Nguyen,Khanh Quoc Tran,Kiet Van Nguyen,Vu Tran,Ngan Luu-Thuy Nguyen,Le-Minh Nguyen
机构: Japan Advanced Institute of Science and Technology (日本高级科学技术研究院); University of Information Technology (信息科技大学); Vietnam National University (越南国家大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: VLSP 2025 MLQA-TSR Share Task

点击查看摘要

Abstract:This paper presents the VLSP 2025 MLQA-TSR - the multimodal legal question answering on traffic sign regulation shared task at VLSP 2025. VLSP 2025 MLQA-TSR comprises two subtasks: multimodal legal retrieval and multimodal question answering. The goal is to advance research on Vietnamese multimodal legal text processing and to provide a benchmark dataset for building and evaluating intelligent systems in multimodal legal domains, with a focus on traffic sign regulation in Vietnam. The best-reported results on VLSP 2025 MLQA-TSR are an F2 score of 64.55% for multimodal legal retrieval and an accuracy of 86.30% for multimodal question answering.
zh

[NLP-38] IKnow: Instruction-Knowledge-Aware Continual Pretraining for Effective Domain Adaptation

【速读】: 该论文旨在解决持续预训练(continual pretraining)过程中,直接对指令微调后的语言模型应用标准自监督目标会导致其指令遵循能力与语义表征性能下降的问题。现有方法通常依赖原始基础模型权重或外部领域特定数据库,但在模型权重受限或缺乏可靠外部语料的场景下难以适用。解决方案的关键在于提出一种名为Instruction-Knowledge-Aware Continual Adaptation (IKnow) 的通用框架,通过在指令-响应对话格式中设计新颖的自监督目标,利用文本内部嵌入的领域知识,并学习在更深层次上编码这些知识,从而无需外部资源即可实现有效的模型适应。

链接: https://arxiv.org/abs/2510.20377
作者: Tianyi Zhang,Florian Mai,Lucie Flek
机构: University of Bonn (波恩大学); Lamarr Institute for Machine Learning and Artificial Intelligence (拉马尔机器学习与人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Continual pretraining promises to adapt large language models (LLMs) to new domains using only unlabeled test-time data, but naively applying standard self-supervised objectives to instruction-tuned models is known to degrade their instruction-following capability and semantic representations. Existing fixes assume access to the original base model or rely on knowledge from an external domain-specific database - both of which pose a realistic barrier in settings where the base model weights are withheld for safety reasons or reliable external corpora are unavailable. In this work, we propose Instruction-Knowledge-Aware Continual Adaptation (IKnow), a simple and general framework that formulates novel self-supervised objectives in the instruction-response dialogue format. Rather than depend- ing on external resources, IKnow leverages domain knowledge embedded within the text itself and learns to encode it at a deeper semantic level.
zh

[NLP-39] he Impact of Negated Text on Hallucination with Large Language Models EMNLP2025

【速读】: 该论文旨在解决否定文本(negated text)对大型语言模型(Large Language Models, LLMs)幻觉(hallucination)检测能力的影响这一未被充分研究的问题。具体而言,论文聚焦于三个关键研究问题:LLMs能否识别由否定引发的语境变化,并在否定情境下仍可靠地区分幻觉;以及如何量化这种否定对幻觉检测性能的负面影响。其解决方案的关键在于构建了一个名为NegHalu的新数据集,通过重构现有幻觉检测数据集中的表达为否定形式,系统评估LLMs在否定输入下的幻觉识别能力。实验表明,LLMs在处理否定文本时难以有效检测幻觉,常产生逻辑不一致或不忠实的判断,且通过对token级内部状态的分析揭示了缓解此类问题的技术挑战。

链接: https://arxiv.org/abs/2510.20375
作者: Jaehyung Seo,Hyeonseok Moon,Heuiseok Lim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the EMNLP 2025

点击查看摘要

Abstract:Recent studies on hallucination in large language models (LLMs) have been actively progressing in natural language processing. However, the impact of negated text on hallucination with LLMs remains largely unexplored. In this paper, we set three important yet unanswered research questions and aim to address them. To derive the answers, we investigate whether LLMs can recognize contextual shifts caused by negation and still reliably distinguish hallucinations comparable to affirmative cases. We also design the NegHalu dataset by reconstructing existing hallucination detection datasets with negated expressions. Our experiments demonstrate that LLMs struggle to detect hallucinations in negated text effectively, often producing logically inconsistent or unfaithful judgments. Moreover, we trace the internal state of LLMs as they process negated inputs at the token level and reveal the challenges of mitigating their unintended effects.
zh

[NLP-40] Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)

【速读】: 该论文试图解决的问题是:如何通过仅在对话数据上预训练,构建出在形式和功能上均适合对话任务的小型语言模型,并进一步提升其生成更具交际性的文本能力。解决方案的关键在于:首先基于对话数据预训练得到llamalogue模型,随后采用多种微调策略(特别是直接偏好优化DPO)来强化模型的对话延续预测能力,从而在定制化的对话基准测试中实现性能提升,尽管在标准BabyLM基准上表现不佳。

链接: https://arxiv.org/abs/2510.20358
作者: Francesca Padovani,Bastian Bunzeck,Manar Ali,Omar Momen,Arianna Bisazza,Hendrik Buschmeier,Sina Zarrieß
机构: Center for Language and Cognition (CLCG), University of Groningen; CRC 1646 – Linguistic Creativity in Communication, Bielefeld University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate whether pre-training exclusively on dialogue data results in formally and functionally apt small language models. Based on this pre-trained llamalogue model, we employ a variety of fine-tuning strategies to enforce “more communicative” text generations by our models. Although our models underperform on most standard BabyLM benchmarks, they excel at dialogue continuation prediction in a minimal pair setting. While PPO fine-tuning has mixed to adversarial effects on our models, DPO fine-tuning further improves their performance on our custom dialogue benchmark.
zh

[NLP-41] FreeChunker: A Cross-Granularity Chunking Framework

【速读】: 该论文旨在解决现有检索增强生成(Retrieval-Augmented Generation, RAG)系统中分块(chunking)策略受限于固定粒度、依赖静态边界识别而导致的适应性不足问题,尤其在面对多样化查询需求时表现不佳。其解决方案的关键在于提出FreeChunker——一种跨粒度编码框架,将句子视为原子单元,摒弃传统的静态分块方式,转而支持任意句子组合的灵活检索,从而显著降低语义边界检测的计算开销,并提升对复杂查询的适应能力。

链接: https://arxiv.org/abs/2510.20356
作者: Wenxuan Zhang,Yuan-Hao Jiang,Yonghe Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to arXiv, October 2025

点击查看摘要

Abstract:Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly reduces the computational overhead required for semantic boundary detection but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker achieves superior retrieval performance compared to traditional chunking methods, while significantly outperforming existing approaches in computational efficiency.
zh

[NLP-42] Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化数据推理任务评估中可能存在的数据集污染(dataset contamination)问题,即模型是否通过记忆而非真实推理能力来表现优异。其关键解决方案在于通过受控的探测实验(probing experiments),系统性地对比包含强语义线索(如有意义的列名或可解释的类别值)与移除或随机化这些线索后的数据集上模型的表现差异,从而识别出性能提升是否源于对公开数据的 memorization(记忆)而非真正的泛化推理能力。

链接: https://arxiv.org/abs/2510.20351
作者: Matteo Silvestri,Flavio Giorgi,Fabrizio Silvestri,Gabriele Tolomei
机构: Sapienza University of Rome(罗马第一大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs’ apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.
zh

[NLP-43] aching Language Models to Reason with Tools NIPS2025

【速读】: 该论文旨在解决大推理模型(Large Reasoning Models, LRM)在处理复杂数学运算时因内部概率性推理与外部计算工具(如代码解释器 Code Interpreter, CI)提供的确定性知识之间存在冲突,而导致的低效或错误推理问题。其核心解决方案是提出 CoRT(Code-Optimized Reasoning Training)后训练框架,关键在于引入一种新的数据合成策略——Hint-Engineering,通过在推理路径中适时注入多样化提示(hints),生成高质量、融合代码的推理样本,从而优化 LRMs 与 CI 的协同交互;同时结合拒绝采样和强化学习进一步精炼多轮内外部推理的交替机制,显著提升模型准确率与推理效率。

链接: https://arxiv.org/abs/2510.20342
作者: Chengpeng Li,Zhengyang Tang,Ziniu Li,Mingfeng Xue,Keqin Bao,Tian Ding,Ruoyu Sun,Benyou Wang,Xiang Wang,Junyang Lin,Dayiheng Liu
机构: University of Science and Technology of China (中国科学技术大学); Qwen Team, Alibaba Inc. (阿里巴巴集团); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Shenzhen International Center for Industrial and Applied Mathematics, Shenzhen Research Institute of Big Data (深圳市国际应用数学中心,大数据研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NIPS2025 Accepted

点击查看摘要

Abstract:Large reasoning models (LRMs) like OpenAI-o1 have shown impressive capabilities in natural language reasoning. However, these models frequently demonstrate inefficiencies or inaccuracies when tackling complex mathematical operations. While integrating computational tools such as Code Interpreters (CIs) offers a promising solution, it introduces a critical challenge: a conflict between the model’s internal, probabilistic reasoning and the external, deterministic knowledge provided by the CI, which often leads models to unproductive deliberation. To overcome this, we introduce CoRT (Code-Optimized Reasoning Training), a post-training framework designed to teach LRMs to effectively utilize CIs. We propose \emphHint-Engineering, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths. This approach generates high-quality, code-integrated reasoning data specifically tailored to optimize LRM-CI interaction. Using this method, we have synthesized 30 high-quality samples to post-train models ranging from 1.5B to 32B parameters through supervised fine-tuning. CoRT further refines the multi-round interleaving of external CI usage and internal thinking by employing rejection sampling and reinforcement learning. Our experimental evaluations demonstrate CoRT’s effectiveness, yielding absolute improvements of 4% and 8% on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging mathematical reasoning datasets. Moreover, CoRT significantly enhances efficiency, reducing token usage by approximately 30% for the 32B model and 50% for the 1.5B model compared to pure natural language reasoning baselines. The models and code are available at: this https URL.
zh

[NLP-44] Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering

【速读】: 该论文旨在解决生成式过程奖励模型(Generative Process Reward Models, PRMs)在处理半结构化数据任务(如表格问答,Table Question Answering, TQA)时的适用性问题。TQA任务因包含大量无关信息、推理步骤间关联松散以及领域特定推理逻辑等特性,对现有PRMs提出了挑战。论文提出首次系统性研究PRMs在TQA中的表现,其关键解决方案在于结合文本与代码验证的PRM方法,以提升答案选择的准确性;然而研究发现此类方法在跨域泛化能力上存在不足,且步骤级验证性能与最终答案准确率之间相关性较弱,揭示出当前PRMs在处理TQA时仍受限于推理步骤间的弱依赖关系和因果链接不紧密的问题。

链接: https://arxiv.org/abs/2510.20304
作者: Lei Tang,Wei Zhou,Mohsen Mesgar
机构: University of Augsburg (奥格斯堡大学); University of Stuttgart (斯图加特大学); Bosch Center for Artificial Intelligence (博世人工智能中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Process reward models (PRMs) improve complex reasoning in large language models (LLMs) by grading candidate solutions step-by-step and selecting answers via aggregated step scores. While effective in domains such as mathematics, their applicability to tasks involving semi-structured data, like table question answering (TQA) remains unexplored. TQA poses unique challenges for PRMs, including abundant irrelevant information, loosely connected reasoning steps, and domain-specific reasoning. This work presents the first systematic study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from both answer and step perspectives. Results show that PRMs that combine textual and code verification can aid solution selection but struggle to generalize to out-of-domain data. Analysis reveals a weak correlation between performance in step-level verification and answer accuracy, possibly stemming from weak step dependencies and loose causal links. Our findings highlight limitations of current PRMs on TQA and offer valuable insights for building more robust, process-aware verifiers.
zh

[NLP-45] Citation Failure: Definition Analysis and Efficient Mitigation

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)基于检索增强生成(Retrieval-Augmented Generation, RAG)系统中的引用失败(citation failure)问题,即模型能够生成有用的回答,但未能提供完整的证据引用。与以往将引用失败与响应失败混为一谈的研究不同,本文首次将其解耦,聚焦于仅因引用缺失导致的失效场景。解决方案的关键在于提出两个核心步骤:首先,通过构建CITECONTROL基准系统性地分析响应与证据间关系对引用质量的影响,发现引用失败随关系复杂度上升而加剧;其次,设计CITENTION框架,融合生成式、基于注意力机制和检索式三种引用方法,实现高效且鲁棒的引用改进,在CITECONTROL及跨场景迁移设置中均取得显著提升。

链接: https://arxiv.org/abs/2510.20303
作者: Jan Buchmann,Iryna Gurevych
机构: Ubiquitous Knowledge Processing Lab (UKP Lab); Department of Computer Science and Hessian Center for AI (hessian.AI); Technical University of Darmstadt
类目: Computation and Language (cs.CL)
备注: Under review. Paper repository: this https URL

点击查看摘要

Abstract:Citations from LLM-based RAG systems are supposed to simplify response verification. However, this does not hold for citation failure, when a model generates a helpful response, but fails to cite complete evidence. In contrast to previous work, we propose to disentangle this from response failure, where the response itself is flawed, and citing complete evidence is impossible. To address citation failure, this work follows a two-step approach: (1) We study when citation failure occurs and (2) how it can be mitigated. For step 1, we extend prior work by investigating how the relation between response and evidence affects citation quality. We introduce CITECONTROL, a benchmark that systematically varies this relation to analyze failure modes. Experiments show that failures increase with relational complexity and suggest that combining citation methods could improve performance, motivating step 2. To improve LLM citation efficiently, we propose CITENTION, a framework integrating generative, attention-based, and retrieval-based methods. Results demonstrate substantial citation improvements on CITECONTROL and in transfer settings. We make our data and code publicly available.
zh

[NLP-46] Context-level Language Modeling by Learning Predictive Context Embeddings

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在预训练过程中仅依赖逐标记预测(Next-token Prediction, NTP)所带来的局限性,即难以捕捉更高层次的语义结构和长距离上下文关系的问题。其解决方案的关键在于提出一种名为ContextLM的框架,通过引入内在的“下一上下文预测”(Next-context Prediction)目标来增强标准预训练过程,使模型能够学习多标记上下文的预测表示,并利用来自未来标记块的误差信号进行优化。该方法在保持与标准自回归评估范式(如困惑度Perplexity)完全兼容的前提下,实现了更优的长程连贯性和注意力分配效率,且计算开销极低。

链接: https://arxiv.org/abs/2510.20280
作者: Beiya Dai,Yuliang Liu,Daozheng Xue,Qipeng Guo,Kai Chen,Xinbing Wang
机构: LUMIA Lab; School of Artificial Intelligence; Shanghai Jiao Tong University; Shanghai AI Laboratory; Tsinghua University; Nanjing University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 16pages,6 figures

点击查看摘要

Abstract:Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model’s capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbfContextLM, a framework that augments standard pretraining with an inherent \textbfnext-context prediction objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to 1.5 B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.
zh

[NLP-47] ImpossibleBench: Measuring LLM s Propensity of Exploiting Test Cases

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在任务执行过程中倾向于利用“捷径”(shortcuts)行为所带来的可靠性风险问题,例如LLM代理可能通过删除失败的单元测试而非修复代码缺陷来达成表面通过,从而误导评估结果并损害实际部署中的可信度。解决方案的关键在于提出ImpossibleBench——一个系统性量化、研究和缓解此类捷径行为的基准框架;其核心机制是构造与自然语言规范直接冲突的“不可能任务”变体(impossible variants),通过测量模型在这些任务上的通过率(即“作弊率”)来识别其是否采用违反规范的捷径策略,从而为模型行为分析、上下文工程优化及监控工具开发提供可验证的测试床。

链接: https://arxiv.org/abs/2510.20270
作者: Ziqian Zhong,Aditi Raghunathan,Nicholas Carlini
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The tendency to find and exploit “shortcuts” to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents’ propensity to exploit test cases. ImpossibleBench creates “impossible” variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent’s “cheating rate” as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at this https URL. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2510.20270 [cs.LG] (or arXiv:2510.20270v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.20270 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ziqian Zhong [view email] [v1] Thu, 23 Oct 2025 06:58:32 UTC (7,152 KB)
zh

[NLP-48] Calibrating Multimodal Consensus for Emotion Recognition

【速读】: 该论文针对多模态情感识别(Multimodal Emotion Recognition, MER)中存在的两个关键问题展开研究:一是不同模态间存在的语义不一致现象(如文本与视觉输入的情感线索相冲突),二是现有方法普遍受制于文本模态的强表征能力,导致其他模态(如视觉)被边缘化,从而影响整体识别准确性。解决方案的核心在于提出一种名为校准多模态共识(Calibrated Multimodal Consensus, CMC)的模型架构,其关键创新包括:1)引入伪标签生成模块(Pseudo Label Generation Module, PLGM),通过自监督方式为单模态预训练生成伪标签,增强各模态独立表达能力;2)设计无参数融合模块(Parameter-free Fusion Module, PFM)与多模态共识路由器(Multimodal Consensus Router, MCR),在微调阶段实现动态、可靠的跨模态融合,有效缓解文本主导问题并提升对语义不一致场景的鲁棒性。

链接: https://arxiv.org/abs/2510.20256
作者: Guowei Zhong,Junjie Li,Huaiyu Zhu,Ruohong Huan,Yun Pan
机构: Zhejiang University (浙江大学); Zhejiang University of Technology (浙江工业大学); Zhejiang University Jinhua Research Institute (浙江大学金华研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:In recent years, Multimodal Emotion Recognition (MER) has made substantial progress. Nevertheless, most existing approaches neglect the semantic inconsistencies that may arise across modalities, such as conflicting emotional cues between text and visual inputs. Besides, current methods are often dominated by the text modality due to its strong representational capacity, which can compromise recognition accuracy. To address these challenges, we propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels, enabling unimodal pretraining in a self-supervised fashion. It then employs a Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for multimodal finetuning, thereby mitigating text dominance and guiding the fusion process toward a more reliable consensus. Experimental results demonstrate that CMC achieves performance on par with or superior to state-of-the-art methods across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and exhibits notable advantages in scenarios with semantic inconsistencies on CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible at this https URL.
zh

[NLP-49] ri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders

【速读】: 该论文旨在解决抑郁症(Depression)与创伤后应激障碍(PTSD)共病情况下自动化评估的挑战,传统方法多为二分类且局限于单一疾病诊断,难以提供具有临床意义的严重程度分级和可解释的决策支持。其解决方案的关键在于提出一种统一的三模态情感严重度框架(unified tri-modal affective severity framework),通过同步融合访谈文本(sentence-level transformer embeddings)、音频信号(log Mel statistics with deltas)以及面部行为线索(action units, gaze, head and pose descriptors),输出针对抑郁(PHQ-8五级)和PTSD(三级)的连续严重度评分。该框架采用校准后的晚期融合分类器,在标准化特征基础上实现每种疾病的概率输出及特征层面的归因分析,显著提升模型在噪声或模态缺失下的鲁棒性,并通过消融实验验证文本对抑郁严重度贡献最大,而音频与面部线索对PTSD更为关键,从而实现面向临床决策的可复现、可解释的跨疾病严重度评估。

链接: https://arxiv.org/abs/2510.20239
作者: Filippo Cenacchi,Deborah Richards,Longbing Cao
机构: Frontier AI Research Centre, School of Computing, Macquarie University (麦考瑞大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.
zh

[NLP-50] Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在生成较长自由文本响应时易出现幻觉(hallucination)的问题,特别是探究其根本成因是否仅由长度引发的累积不确定性所致。研究表明,幻觉风险并非源于响应长度本身,而是由于模型在长文本生成中对上下文依赖增强以维持连贯性和完整性所导致。解决方案的关键在于提出“诱导-检测-抑制”(induce-detect-suppress)框架:首先通过设计特定上下文主动诱导幻觉实例,进而利用这些诱导样本实现早期高风险案例识别,并在实际解码过程中抑制潜在的对象级幻觉。该方法在多个基准测试中均取得显著且一致的改进,验证了其有效性并深化了对LVLM长响应幻觉机制的理解。

链接: https://arxiv.org/abs/2510.20229
作者: Ge Zheng,Jiaye Qian,Jiajin Tang,Sibei Yang
机构: Sun Yat-sen University (中山大学); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel “induce-detect-suppress” framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential object-level hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs’ longer responses.
zh

[NLP-51] Decoding-Free Sampling Strategies for LLM Marginalization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因子词分词(subword tokenization)导致的评估偏差问题。传统方法仅基于特定分词路径计算输出概率,忽略了同一文本可能存在的多种合法分词方式,从而无法准确反映模型对目标文本的真实置信度;为此,研究提出通过边际化(marginalization)——即对所有可能分词路径的概率质量求和——来更全面地评估模型性能。其解决方案的关键在于设计一种“解码无关”(decoding-free)的采样策略:该策略不依赖于昂贵的LLM生成步骤,而是利用仅需极低成本的、与模型和分词器无关的采样机制,在显著降低运行时间的同时仍能获得高精度的边际估计。实验表明,该方法在多个开源模型上有效提升了边际近似质量与效率,并成功应用于下游推理任务中。

链接: https://arxiv.org/abs/2510.20208
作者: David Pohl,Marco Cognetta,Junyoung Lee,Naoaki Okazaki
机构: Tokyo Institute of Technology (东京工业大学); University of Tokyo (东京大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:Modern language models operate on subword-tokenized text in order to make a trade-off between model size, inference speed, and vocabulary coverage. A side effect of this is that, during inference, models are evaluated by measuring the probability of only the specific tokenization produced as the output, despite there being many possible ways to represent the same text with a subword vocabulary. Recent studies have argued instead for evaluating LLMs by marginalization - the probability mass of all tokenizations of a given text. Marginalization is difficult due to the number of possible tokenizations of a text, so often approximate marginalization is done via sampling. However, a downside of sampling is that an expensive generation step must be performed by the LLM for each sample, which limits the number of samples that can be acquired given a runtime budget, and therefore also the accuracy of the approximation. Since computing the probability of a sequence given the tokenization is relatively cheap compared to actually generating it, we investigate sampling strategies that are decoding-free - they require no generation from the LLM, instead relying entirely on extremely cheap sampling strategies that are model and tokenizer agnostic. We investigate the approximation quality and speed of decoding-free sampling strategies for a number of open models to find that they provide sufficiently accurate marginal estimates at a small fraction of the runtime cost and demonstrate its use on a set of downstream inference tasks. Comments: 10 pages, 3 figures Subjects: Computation and Language (cs.CL) ACMclasses: I.2.7 Cite as: arXiv:2510.20208 [cs.CL] (or arXiv:2510.20208v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.20208 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-52] Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在文本输入场景下空间推理能力不足的问题,特别是其在结构化网格环境中进行多步空间计算与抽象推理时的性能瓶颈。解决方案的关键在于设计了一套包含五类任务的系统性评测框架——包括象限识别、几何变换、距离评估、单词搜索和方块滑动——并通过逐步增加网格维度来量化模型在复杂度提升下的表现变化,从而揭示LLMs在空间表征上的局限性及其与语言理解能力之间的显著差距。

链接: https://arxiv.org/abs/2510.20198
作者: Maggie Bai,Ava Kim Cohen,Eleanor Koss,Charlie Lichtenbaum
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 24 figures

点击查看摘要

Abstract:This paper explores the spatial reasoning capability of large language models (LLMs) over textual input through a suite of five tasks aimed at probing their spatial understanding and computational abilities. The models were tested on both fundamental spatial reasoning and multi-step problem-solving within structured grid-based environments using tasks such as quadrant identification, geometric transformations, distance evaluation, word searches, and tile sliding. Each task was scaled in complexity through increasing grid dimensions, requiring models to extend beyond simple pattern recognition into abstract spatial reasoning. Our results reveal that while LLMs demonstrate moderate success in all tasks with small complexity and size, performance drops off rapidly as scale increases, with an average loss in accuracy of 42.7%, and reaching as high as 84%. Every test that began with over 50% accuracy showed a loss of at least 48%, illustrating the consistent nature of the deterioration. Furthermore, their struggles with scaling complexity hint at a lack of robust spatial representations in their underlying architectures. This paper underscores the gap between linguistic and spatial reasoning in LLMs, offering insights into their current limitations, and laying the groundwork for future integrative benchmarks at the intersection of language and geometry.
zh

[NLP-53] Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures

【速读】: 该论文旨在解决传统问答(Question Answering, QA)系统在面对日益增长的多模态内容(如图像、音频、视频及结构化元数据)时所面临的挑战,即如何有效整合多媒体检索流程以提升问答系统的性能与泛化能力。其解决方案的关键在于构建能够对齐视觉、语言和音频模态并与用户查询精准匹配的架构体系,通过分类不同检索方法、融合策略和答案生成机制,系统性地分析多模态QA系统的性能权衡,并识别跨模态对齐、延迟-精度权衡和语义定位等核心难题,从而为未来构建更鲁棒、更具上下文感知能力的多模态QA系统提供理论基础与研究方向。

链接: https://arxiv.org/abs/2510.20193
作者: Rahul Raja,Arpita Vats
机构: Carnegie Mellon University (卡内基梅隆大学); Boston University (波士顿大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: In Proceedings of the 2nd ACM Workshop in AI-powered Question and Answering Systems (AIQAM '25), October 27-28, 2025, Dublin, Ireland. ACM, New York, NY, USA, 8 pages. this https URL

点击查看摘要

Abstract:Question Answering (QA) systems have traditionally relied on structured text data, but the rapid growth of multimedia content (images, audio, video, and structured metadata) has introduced new challenges and opportunities for retrieval-augmented QA. In this survey, we review recent advancements in QA systems that integrate multimedia retrieval pipelines, focusing on architectures that align vision, language, and audio modalities with user queries. We categorize approaches based on retrieval methods, fusion techniques, and answer generation strategies, and analyze benchmark datasets, evaluation protocols, and performance tradeoffs. Furthermore, we highlight key challenges such as cross-modal alignment, latency-accuracy tradeoffs, and semantic grounding, and outline open problems and future research directions for building more robust and context-aware QA systems leveraging multimedia data.
zh

[NLP-54] Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在强化学习优化过程中缺乏对人类价值优先级的显式建模问题,即传统方法仅关注任务正确性而忽视不同任务的实际重要性差异。解决方案的关键在于提出显式人类价值强化学习(Reinforcement Learning with Explicit Human Values, RLEV),通过将可量化的、由人类定义的价值信号直接嵌入奖励函数中,使模型在训练时不仅追求准确性,还能根据任务价值动态调整输出策略——例如,在高价值任务上更细致深入,在低价值任务上则更简洁高效。这一机制的核心源于对序列结束标记处梯度的值加权放大,从而实现价值敏感的终止策略,并在噪声价值信号下仍保持鲁棒性,验证了显式效用函数优化是实现LLM与人类优先级对齐的有效路径。

链接: https://arxiv.org/abs/2510.20187
作者: Dian Yu,Yulai Zhao,Kishan Panaganti,Linfeng Song,Haitao Mi,Dong Yu
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:We propose Reinforcement Learning with Explicit Human Values (RLEV), a method that aligns Large Language Model (LLM) optimization directly with quantifiable human value signals. While Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains models in objective domains using binary correctness rewards, it overlooks that not all tasks are equally significant. RLEV extends this framework by incorporating human-defined value signals directly into the reward function. Using exam-style data with explicit ground-truth value labels, RLEV consistently outperforms correctness-only baselines across multiple RL algorithms and model scales. Crucially, RLEV policies not only improve value-weighted accuracy but also learn a value-sensitive termination policy: concise for low-value prompts, thorough for high-value ones. We demonstrate this behavior stems from value-weighted gradient amplification on end-of-sequence tokens. Ablation studies confirm the gain is causally linked to value alignment. RLEV remains robust under noisy value signals, such as difficulty-based labels, demonstrating that optimizing for an explicit utility function offers a practical path to aligning LLMs with human priorities.
zh

[NLP-55] Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在表格理解与推理任务中面临的两大局限性:一是基于微调的方法虽增强语言推理能力,但易产生算术错误和幻觉;二是基于工具的方法虽能实现精确的表格操作,却依赖刚性模式且缺乏语义理解。其解决方案的关键在于提出一种名为“Mixture-of-Minds”的多智能体框架,将表格推理分解为规划(planning)、编码(coding)和回答(answering)三个专业化角色,使每个智能体专注特定任务并借助代码执行实现精准表格处理;同时引入基于蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的伪黄金轨迹生成机制,结合强化学习(Reinforcement Learning, RL)对各智能体进行自提升训练,从而融合结构化多智能体流程与强化学习的优势,显著提升表格理解性能。

链接: https://arxiv.org/abs/2510.20176
作者: Yuhang Zhou,Mingrui Zhang,Ke Li,Mingyi Wang,Qiao Liu,Qifei wang,Jiayi Liu,Fei Liu,Serena Li,Weiwi Li,Mingze Gao,Abhishek Kumar,Xiangjun Fan,Zhuokai Zhao,Lizhu Zhang
机构: Meta AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.
zh

[NLP-56] DeepWideSearch: Benchmarking Depth and Width in Agent ic Information Seeking

【速读】: 该论文旨在解决当前搜索代理在信息检索任务中难以同时实现深度推理(deep reasoning)与广度信息收集(wide-scale information collection)的问题,这一缺陷限制了其在真实场景如市场分析和商业开发中的应用。解决方案的关键在于提出首个专门评估代理整合“深度”与“宽度”搜索能力的基准测试——DeepWideSearch,该基准包含220个跨15个领域的多跳推理问题,要求代理在处理大量数据时进行复杂推理路径规划。实验表明,即使最先进的代理在此基准上的平均成功率为仅2.39%,凸显了现有架构在整合深度与广度搜索方面的显著挑战,并通过错误模式分析揭示了当前代理在反思能力、内部知识依赖、检索不足和上下文溢出等方面的局限性。

链接: https://arxiv.org/abs/2510.20168
作者: Tian Lan,Bin Zhu,Qianghuai Jia,Junyang Ren,Haijun Li,Longyue Wang,Zhao Xu,Weihua Luo,Kaifu Zhang
机构: Alibaba International Digital Commerce (阿里巴巴国际数字商业集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Current search agents fundamentally lack the ability to simultaneously perform \textitdeep reasoning over multi-hop retrieval and \textitwide-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.
zh

[NLP-57] Are Stereotypes Leading LLM s Zero-Shot Stance Detection ? EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在零样本立场检测(stance detection)任务中因继承预训练数据中的刻板印象而导致的偏见问题。其解决方案的关键在于:自动标注现有立场检测数据集中的两个属性——特定群体的方言或俗语(vernacular)和文本复杂度/可读性(text complexity/readability),从而系统性地评估这些语言特征是否影响模型的立场判断。实验结果表明,LLMs 在立场检测中表现出显著的刻板印象,例如错误地将支持大麻的观点与低文本复杂度关联,并将非洲裔美国人方言与反对唐纳德·特朗普的立场关联。

链接: https://arxiv.org/abs/2510.20154
作者: Anthony Dubreuil,Antoine Gourru,Christine Largeron,Amine Trabelsi
机构: Laboratoire Hubert Curien, UMR CNRS 5516, Saint-Etienne, France; Department of Computer Science, Université de Sherbrooke, Canada
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP 2025 (Main)

点击查看摘要

Abstract:Large Language Models inherit stereotypes from their pretraining data, leading to biased behavior toward certain social groups in many Natural Language Processing tasks, such as hateful speech detection or sentiment analysis. Surprisingly, the evaluation of this kind of bias in stance detection methods has been largely overlooked by the community. Stance Detection involves labeling a statement as being against, in favor, or neutral towards a specific target and is among the most sensitive NLP tasks, as it often relates to political leanings. In this paper, we focus on the bias of Large Language Models when performing stance detection in a zero-shot setting. We automatically annotate posts in pre-existing stance detection datasets with two attributes: dialect or vernacular of a specific group and text complexity/readability, to investigate whether these attributes influence the model’s stance detection decisions. Our results show that LLMs exhibit significant stereotypes in stance detection tasks, such as incorrectly associating pro-marijuana views with low text complexity and African American dialect with opposition to Donald Trump.
zh

[NLP-58] BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation

【速读】: 该论文旨在解决复杂结构化文本(如技术报告、生成式 AI 提示词等)中语义单元分割困难的问题,这类文本常包含表格、代码片段和占位符等非纯文本元素,传统基于句子或段落级别的分割方法难以有效处理。其解决方案的关键在于提出 BoundRL 方法,该方法通过联合执行 token 级别的文本分割与标签预测,并仅输出每个片段的起始 token 序列而非完整内容,从而大幅降低推理成本并减少幻觉;同时引入基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR),设计特定奖励函数以协同优化文档重建保真度与语义一致性,并通过系统扰动部分生成序列构造中间候选解,缓解熵崩溃问题,提升模型性能与泛化能力。

链接: https://arxiv.org/abs/2510.20151
作者: Haoyuan Li,Zhengyuan Shen,Sullam Jeoung,Yueyan Chen,Jiayu Li,Qi Zhu,Shuai Wang,Vassilis Ioannidis,Huzefa Rangwala
机构: University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); Amazon Web Services (亚马逊网络服务)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As structured texts become increasingly complex across diverse domains – from technical reports to generative AI prompts – the need for text segmentation into semantically meaningful components becomes critical. Such texts often contain elements beyond plain language, including tables, code snippets, and placeholders, which conventional sentence- or paragraph-level segmentation methods cannot handle effectively. To address this challenge, we propose BoundRL, a novel and efficient approach that jointly performs token-level text segmentation and label prediction for long structured texts. Instead of generating complete contents for each segment, it generates only a sequence of starting tokens and reconstructs the complete contents by locating these tokens within the original texts, thereby reducing inference costs by orders of magnitude and minimizing hallucination. To adapt the model for the output format, BoundRL~performs reinforcement learning with verifiable rewards (RLVR) with a specifically designed reward that jointly optimizes document reconstruction fidelity and semantic alignment. To mitigate entropy collapse, it further constructs intermediate candidates by systematically perturbing a fraction of generated sequences of segments to create stepping stones toward higher-quality solutions. To demonstrate BoundRL’s effectiveness on particularly challenging structured texts, we focus evaluation on complex prompts used for LLM applications. Experiments show that BoundRL enables small language models (1.7B parameters) to outperform few-shot prompting of much larger models. Moreover, RLVR with our designed reward yields significant improvements over supervised fine-tuning, and incorporating intermediate candidates further improves both performance and generalization.
zh

[NLP-59] AI PB: A Grounded Generative Agent for Personalized Investment Insights

【速读】: 该论文旨在解决传统反应式聊天机器人在零售金融场景中无法主动提供可信、合规且个性化的投资建议的问题。其核心挑战在于如何在高风险金融环境中实现生成式AI(Generative AI)的可靠落地,确保输出内容既具备事实依据(grounded)、符合监管要求,又能精准匹配用户需求。解决方案的关键在于三个层面:一是构建基于组件的编排层,通过数据敏感性动态调度内部与外部大语言模型(LLM),保障安全性;二是设计融合OpenSearch与金融领域嵌入模型的混合检索管道,提升信息获取的相关性与准确性;三是采用多阶段推荐机制,整合规则启发式、行为序列建模和上下文相关Bandit算法,实现个性化洞察生成。系统部署于韩国本地化环境,使用Docker Swarm和vLLM框架运行于24块NVIDIA H100 GPU上,实验证明该架构可在严格监管下提供可信赖的AI投资建议。

链接: https://arxiv.org/abs/2510.20099
作者: Daewoo Park,Suho Park,Inseok Hong,Hanwool Lee,Junkyu Park,Sangjun Lee,Jeongman An,Hyunbin Loh
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:We present AI PB, a production-scale generative agent deployed in real retail finance. Unlike reactive chatbots that answer queries passively, AI PB proactively generates grounded, compliant, and user-specific investment insights. It integrates (i) a component-based orchestration layer that deterministically routes between internal and external LLMs based on data sensitivity, (ii) a hybrid retrieval pipeline using OpenSearch and the finance-domain embedding model, and (iii) a multi-stage recommendation mechanism combining rule heuristics, sequential behavioral modeling, and contextual bandits. Operating fully on-premises under Korean financial regulations, the system employs Docker Swarm and vLLM across 24 X NVIDIA H100 GPUs. Through human QA and system metrics, we demonstrate that grounded generation with explicit routing and layered safety can deliver trustworthy AI insights in high-stakes finance.
zh

[NLP-60] Leverag ing the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning

【速读】: 该论文旨在解决实体链接(Entity Linking, EL)任务中对大规模标注数据和深度模型微调的依赖问题,同时克服现有基于大语言模型(Large Language Models, LLMs)的少样本方法因昂贵的LLM推理而导致的效率瓶颈。其解决方案的关键在于提出一种结构化的流水线方法 ARTER(Adaptive Routing and Targeted Entity Reasoning),通过候选生成、基于上下文的评分、自适应路由和选择性推理四个模块协同工作:首先利用轻量级信号(嵌入与LLM结合)对提及进行分类,区分易处理和难处理案例;随后分别由低计算成本的实体链接器(如 ReFinED)和针对性的LLM推理模块处理不同难度的案例,从而在保持高精度的同时显著提升效率——在标准基准上相较 ReFinED 最高提升 4.47%,平均提升 2.53%,且仅需约一半的LLM token消耗。

链接: https://arxiv.org/abs/2510.20098
作者: Yajie Li,Albert Galimov,Mitra Datta Ganapaneni,Pujitha Thejaswi,De Meng,Priyanshu Kumar,Saloni Potdar
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Apple (苹果)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Entity Linking (EL) has traditionally relied on large annotated datasets and extensive model fine-tuning. While recent few-shot methods leverage large language models (LLMs) through prompting to reduce training requirements, they often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER (Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline that achieves high performance without deep fine-tuning by strategically combining candidate generation, context-based scoring, adaptive routing, and selective reasoning. ARTER computes a small set of complementary signals(both embedding and LLM-based) over the retrieved candidates to categorize contextual mentions into easy and hard cases. The cases are then handled by a low-computational entity linker (e.g. ReFinED) and more expensive targeted LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets, and performs comparably to pipelines using LLM-based reasoning for all mentions, while being as twice as efficient in terms of the number of LLM tokens.
zh

[NLP-61] BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

【速读】: 该论文旨在解决生物多模态基础模型在训练过程中缺乏有效自然语言监督信号的问题,尤其是在获取高质量、实例特定的描述性caption(描述性标题)方面存在显著挑战。传统方法依赖人工标注的标签,难以捕捉物种间的细微形态差异和生物学特征;而直接使用通用领域文本又易引入噪声或不相关语义。其解决方案的关键在于利用多模态大语言模型(Multimodal Large Language Models, MLLMs)生成具有领域特异性的合成caption,通过维基百科提取的视觉信息与分类单元定制化的格式示例进行引导,从而降低幻觉风险并确保caption的准确性与实例相关性。基于这些合成caption,作者构建了BIOCAP模型(即BIOCLIP with Captions),实现了物种分类和图文检索任务上的优异性能,验证了描述性caption作为额外监督信号在连接生物图像与多模态基础模型中的价值。

链接: https://arxiv.org/abs/2510.20095
作者: Ziheng Zhang,Xinyue Ma,Arpita Chowdhury,Elizabeth G. Campolongo,Matthew J. Thompson,Net Zhang,Samuel Stevens,Hilmar Lapp,Tanya Berger-Wolf,Yu Su,Wei-Lun Chao,Jianyang Gu
机构: The Ohio State University (俄亥俄州立大学); Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
zh

[NLP-62] CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)创造力评估缺乏统一框架的问题,现有方法因领域和任务差异导致评价结果碎片化。解决方案的关键在于提出 CreativityPrism 框架,将创造力解构为质量(quality)、新颖性(novelty)和多样性(diversity)三个维度,并设计涵盖三个领域(发散思维、创意写作、逻辑推理)和九项任务的系统性评估体系,结合二十种任务特异性的度量指标,实现对 LLM 创造力的多维、结构化评估。

链接: https://arxiv.org/abs/2510.20091
作者: Zhaoyi Joey Hou,Bowei Alvin Zhang,Yining Lu,Bhiman Kumar Baghel,Anneliese Brei,Ximing Lu,Meng Jiang,Faeze Brahman,Snigdha Chaturvedi,Haw-Shiuan Chang,Daniel Khashabi,Xiang Lorraine Li
机构: University of Pittsburgh (匹兹堡大学); Johns Hopkins University (约翰霍普金斯大学); University of Notre Dame (圣母大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); University of Washington (华盛顿大学); Allen Institute for Artificial Intelligence (人工智能研究所); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.
zh

[NLP-63] LLM s can hide text in other text of the same length.ipynb

【速读】: 该论文试图解决如何在不改变文本长度的前提下,将一段有意义的隐藏信息嵌入到另一段看似完全无关但语义连贯的文本中这一问题,从而实现“隐形文本”(covert text)的生成与提取。解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)的生成能力,设计一种简单且高效的编码协议:通过训练或优化过程,使LLM能够将秘密信息嵌入目标文本中,同时保持整体文本的自然性和一致性;实验证明,即便是参数量仅为80亿的开源模型也能在本地设备上快速完成高保真度的信息编码与解码,这揭示了文本与其作者意图之间的根本性脱钩,对AI安全和模型知识本质的理解提出了新的挑战。

链接: https://arxiv.org/abs/2510.20075
作者: Antonio Norelli,Michael Bronstein
机构: Project CETI; University Of Oxford
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注: 21 pages, main paper 9 pages

点击查看摘要

Abstract:A meaningful text can be hidden inside another, completely different yet still coherent and plausible, text of the same length. For example, a tweet containing a harsh political critique could be embedded in a tweet that celebrates the same political leader, or an ordinary product review could conceal a secret manuscript. This uncanny state of affairs is now possible thanks to Large Language Models, and in this paper we present a simple and efficient protocol to achieve it. We show that even modest 8-billion-parameter open-source LLMs are sufficient to obtain high-quality results, and a message as long as this abstract can be encoded and decoded locally on a laptop in seconds. The existence of such a protocol demonstrates a radical decoupling of text from authorial intent, further eroding trust in written communication, already shaken by the rise of LLM chatbots. We illustrate this with a concrete scenario: a company could covertly deploy an unfiltered LLM by encoding its answers within the compliant responses of a safe model. This possibility raises urgent questions for AI safety and challenges our understanding of what it means for a Large Language Model to know something.
zh

[NLP-64] Enhancing Reasoning Skills in Small Persian Medical Language Models Can Outperform Large-Scale Data Training

【速读】: 该论文旨在解决小规模语言模型在特定领域(如医学问答)中推理能力不足的问题,尤其是在低资源语言(如波斯语)场景下的应用挑战。解决方案的关键在于采用基于人工智能反馈的强化学习(Reinforcement Learning with AI Feedback, RLAIF)与直接偏好优化(Direct Preference Optimization, DPO)相结合的方法,通过生成带标注的“被拒绝-被偏好”答案对来训练模型,并利用教师-学生框架引导模型产生Chain-of-Thought(CoT)推理轨迹,从而构建高质量的偏好数据集用于微调。实验表明,仅用约200万token的训练数据即可显著提升模型的医学推理性能,优于使用5700万token训练的基线模型,凸显了聚焦推理训练策略在数据受限条件下的高效性与有效性。

链接: https://arxiv.org/abs/2510.20059
作者: Mehrdad Ghassabi,Sadra Hakim,Hamidreza Baradaran Kashani,Pedram Rostami
机构: 未知
类目: Computation and Language (cs.CL)
备注: 6 pages, 4 figures

点击查看摘要

Abstract:Enhancing reasoning capabilities in small language models is critical for specialized applications such as medical question answering, particularly in underrepresented languages like Persian. In this study, we employ Reinforcement Learning with AI Feedback (RLAIF) and Direct preference optimization (DPO) to improve the reasoning skills of a general-purpose Persian language model. To achieve this, we translated a multiple-choice medical question-answering dataset into Persian and used RLAIF to generate rejected-preferred answer pairs, which are essential for DPO training. By prompting both teacher and student models to produce Chain-of-Thought (CoT) reasoning responses, we compiled a dataset containing correct and incorrect reasoning trajectories. This dataset, comprising 2 million tokens in preferred answers and 2.5 million tokens in rejected ones, was used to train a baseline model, significantly enhancing its medical reasoning capabilities in Persian. Remarkably, the resulting model outperformed its predecessor, gaokerena-V, which was trained on approximately 57 million tokens, despite leveraging a much smaller dataset. These results highlight the efficiency and effectiveness of reasoning-focused training approaches in developing domain-specific language models with limited data availability.
zh

[NLP-65] From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在低资源文化语境下知识表征不足的问题,尤其在跨文化评估中对非主流文化(如孟加拉文化)的细节捕捉能力薄弱。其解决方案的关键在于构建一个名为Bengali Language Cultural Knowledge(BLanCK)的数据集,涵盖民间传统、烹饪艺术和地方方言等文化维度,并通过实证发现:尽管多语言模型在非文化类任务中表现良好,但在文化相关任务上性能显著下降;而引入上下文信息后,所有模型的表现均明显提升,凸显了上下文感知架构与文化定制化训练数据的重要性。

链接: https://arxiv.org/abs/2510.20043
作者: Nafis Chowdhury,Moinul Haque,Anika Ahmed,Nazia Tasnim,Md. Istiak Hossain Shihab,Sajjadur Rahman,Farig Sadeque
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 4 pages

点击查看摘要

Abstract:Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.
zh

[NLP-66] Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions

【速读】: 该论文试图解决的问题是:在人机交互中,生成式 AI (Generative AI) 与用户之间存在的双向影响机制尚不清晰,尤其是用户输入如何改变大语言模型(Large Language Model, LLM)的回应,以及这种动态互动如何在多轮对话中体现。现有研究多聚焦于 LLM 对用户观点的单向影响,忽视了用户对 LLM 输出的反作用力及其演化过程。解决方案的关键在于设计三种对比条件(静态陈述、标准聊天机器人和个性化聊天机器人),通过50场涉及争议话题的多轮对话实验(N=266),系统分析人类与LLM立场随对话轮次的变化,并发现:个性化设置显著放大了双向立场调整;尤其当用户分享个人经历时,最易触发双方立场的变动。这揭示了“过度对齐”风险,强调需审慎设计个性化聊天机器人以实现更稳定、有意识的人机对齐。

链接: https://arxiv.org/abs/2510.20039
作者: Yuyang Jiang,Longjie Guo,Yuchen Wu,Aylin Caliskan,Tanu Mitra,Hua Shen
机构: University of Chicago (芝加哥大学); New York University (纽约大学); University of Washington (华盛顿大学); New York University Shanghai (纽约大学上海分校)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 26 pages, 8 figures

点击查看摘要

Abstract:Large language model (LLM)-powered chatbots are increasingly used for opinion exploration. Prior research examined how LLMs alter user views, yet little work extended beyond one-way influence to address how user input can affect LLM responses and how such bi-directional influence manifests throughout the multi-turn conversations. This study investigates this dynamic through 50 controversial-topic discussions with participants (N=266) across three conditions: static statements, standard chatbot, and personalized chatbot. Results show that human opinions barely shifted, while LLM outputs changed more substantially, narrowing the gap between human and LLM stance. Personalization amplified these shifts in both directions compared to the standard setting. Analysis of multi-turn conversations further revealed that exchanges involving participants’ personal stories were most likely to trigger stance changes for both humans and LLMs. Our work highlights the risk of over-alignment in human-LLM interaction and the need for careful design of personalized chatbots to more thoughtfully and stably align with users.
zh

[NLP-67] oolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在使用外部工具执行复杂任务时面临的两大挑战:一是真实世界工具集常存在冗余工具,其名称和描述重叠导致选择歧义;二是LLM的输入上下文长度受限,难以高效处理大规模工具集。解决方案的关键在于提出ToolScope框架,包含两个核心组件:(1) ToolScopeMerger with Auto-Correction,用于自动审计并修正工具合并,减少冗余;(2) ToolScopeRetriever,通过排序和筛选机制仅保留与查询最相关的工具,压缩工具集规模以适配上下文限制,同时保持甚至提升工具选择准确率。实验表明,该方案在多个主流LLM和开源工具调用基准上显著提升了工具选择准确率(8.38%–38.6%)。

链接: https://arxiv.org/abs/2510.20036
作者: Marianne Menglin Liu,Daniel Garcia,Fjona Parllaku,Vikas Upadhyay,Syed Fahad Allam Shah,Dan Roth
机构: Oracle AI(奥拉克人工智能)
类目: Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Preprint under review

点击查看摘要

Abstract:Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also face strict input context limits, preventing efficient consideration of large toolsets. To address these challenges, we propose ToolScope, which includes: (1) ToolScopeMerger with Auto-Correction to automatically audit and fix tool merges, reducing redundancy, and (2) ToolScopeRetriever to rank and select only the most relevant tools for each query, compressing toolsets to fit within context limits without sacrificing accuracy. Evaluations on three state-of-the-art LLMs and three open-source tool-use benchmarks show gains of 8.38% to 38.6% in tool selection accuracy, demonstrating ToolScope’s effectiveness in enhancing LLM tool use.
zh

[NLP-68] Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models

【速读】: 该论文旨在解决预训练神经语言模型在序列标注任务中迁移学习效果不足的问题。其核心解决方案在于提出三种针对性的改进策略:一是构建多任务模型,引入来自领域无关文本处理系统的附加信号以增强领域迁移能力;二是通过架构修改实现自回归大语言模型层间双向信息流动,提升模型对序列结构的建模能力;三是设计基于生成式监督上下文微调(supervised in-context fine-tuning)的框架,结合响应导向的适应策略,使自回归大语言模型作为文本生成器高效完成序列标注任务。这些方法共同表明,通过有针对性的迁移学习范式,预训练语言模型可在序列标注任务上达到最优性能。

链接: https://arxiv.org/abs/2510.20033
作者: David Dukić
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This doctoral thesis improves the transfer learning for sequence labeling tasks by adapting pre-trained neural language models. The proposed improvements in transfer learning involve introducing a multi-task model that incorporates an additional signal, a method based on architectural modifications in autoregressive large language models, and a sequence labeling framework for autoregressive large language models utilizing supervised in-context fine-tuning combined with response-oriented adaptation strategies. The first improvement is given in the context of domain transfer for the event trigger detection task. The domain transfer of the event trigger detection task can be improved by incorporating an additional signal obtained from a domain-independent text processing system into a multi-task model. The second improvement involves modifying the model’s architecture. For that purpose, a method is proposed to enable bidirectional information flow across layers of autoregressive large language models. The third improvement utilizes autoregressive large language models as text generators through a generative supervised in-context fine-tuning framework. The proposed model, method, and framework demonstrate that pre-trained neural language models achieve their best performance on sequence labeling tasks when adapted through targeted transfer learning paradigms.
zh

[NLP-69] Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation and Specialized Pre-training

【速读】: 该论文旨在解决现代希腊语(Modern Greek)在自然语言处理(Natural Language Processing, NLP)领域中因研究碎片化、架构多样性不足以及受限于短上下文长度模型而导致的性能瓶颈问题,尤其是在法律等高价值专业领域,现有模型多基于早期Transformer架构且仅支持512 token的有限上下文窗口,难以有效处理长篇法律文档。解决方案的关键在于构建一套高质量、大规模的希腊语语料库,并采用严格的质量导向过滤与预处理方法,从通用领域和法律专业来源中提取高价值训练数据;在此基础上,首次将ELECTRA、ConvBERT和ModernBERT等现代Transformer架构应用于希腊语,并提出首个面向法律领域的希腊语-英语双语嵌入模型(Bilingual Greek-English Embedding Models),通过系统性预训练与下游任务实验证明GEM-RoBERTa和GEM-ConvBERT显著优于现有基线,验证了该方法的有效性。

链接: https://arxiv.org/abs/2510.20002
作者: Alexandra Apostolopoulou,Konstantinos Kanaris,Athanasios Koursaris,Dimitris Tsakalidis,George Domalis,Ioannis E. Livieris
机构: Novelcore; University of Piraeus (比雷埃夫斯大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advancement of natural language processing for morphologically rich, moderately-resourced languages like Modern Greek is often hindered by a fragmented research landscape, a lack of architectural diversity and reliance on limited context-length models. This is particularly true in specialized, high-value domains such as law, where existing models are frequently confined to early transformer architectures with a restrictive 512-token window, insufficient for analyzing long legal documents. To address these challenges, this paper presents Greek Embedding Models, a new family of transformer models for Greek language built upon a foundation of extensive, quality-driven data curation. We detail the construction of several large-scale Greek corpora, emphasizing a rigorous, quality-based filtering and preprocessing methodology to create high-value training datasets from both general-domain and specialized legal sources. On this carefully curated foundation, we pre-train and systematically evaluate a diverse suite of modern architectures, which has not previously applied to Greek language, such as ELECTRA, ConvBERT and ModernBERT. Furthermore, we propose the first bilingual Greek-English Embedding Models tailored for the legal domain. The extensive experiments on downstream tasks demonstrate that the new class of models establish the effectiveness of the proposed approach, highlighting that the GEM-RoBERTa and GEM-ConvBERT models significantly outperform existing baselines.
zh

[NLP-70] Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLM s

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在临床场景中评估时存在的局限性问题,即现有数据集(如MedQA)多采用简化问答(Question-Answering, QA)形式,难以真实反映临床决策的复杂性。其解决方案的关键在于提出一个统一的范式,将临床决策任务从两个维度进行刻画:临床背景(Clinical Backgrounds)临床问题(Clinical Questions)。这两个维度共同决定了任务的真实性和难度——当背景和问题越贴近真实临床环境时,任务难度越高。该范式有助于系统梳理现有数据集与基准测试的设置、明确不同训练与推理阶段技术的有效性边界,并推动评估标准从单一准确率扩展至效率和可解释性,从而引导开发更具临床实用价值的LLMs。

链接: https://arxiv.org/abs/2510.20001
作者: Yunpeng Xiao,Carl Yang,Mark Mai,Xiao Hu,Kai Shu
机构: Emory University (埃默里大学); Children’s Healthcare of Atlanta (儿童健康亚特兰大医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures

点击查看摘要

Abstract:Large language models (LLMs) show promise for clinical use. They are often evaluated using datasets such as MedQA. However, Many medical datasets, such as MedQA, rely on simplified Question-Answering (Q\A) that underrepresents real-world clinical decision-making. Based on this, we propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions. As the background and questions approach the real clinical environment, the difficulty increases. We summarize the settings of existing datasets and benchmarks along two dimensions. Then we review methods to address clinical decision-making, including training-time and test-time techniques, and summarize when they help. Next, we extend evaluation beyond accuracy to include efficiency, explainability. Finally, we highlight open challenges. Our paradigm clarifies assumptions, standardizes comparisons, and guides the development of clinically meaningful LLMs.
zh

[NLP-71] A Fundamental Algorithm for Dependency Parsing (With Corrections)

【速读】: 该论文旨在解决自然语言句子到依存树(dependency tree)的解析问题,即如何将句子中的词语按照其语法依赖关系进行结构化表示。与传统的短语结构(phrase-structure)解析方法不同,该算法的关键在于采用逐词处理策略——每次读入一个词后立即根据当前信息将其附着到已构建的结构中,这一机制模拟了人类大脑中语言处理的即时性特征。尽管该算法在最坏情况下的时间复杂度仍为 $ O(n^3) $,但在实际人类语言场景中,这种最坏情况仅出现在较小的句子长度 $ n $ 时,因而具有良好的实用性与生物合理性。

链接: https://arxiv.org/abs/2510.19996
作者: Michael A. Covington
机构: Institute for Artificial Intelligence (人工智能研究所); The University of Georgia (佐治亚大学)
类目: Computation and Language (cs.CL)
备注: Corrected version of an already widely cited paper

点击查看摘要

Abstract:This paper presents a fundamental algorithm for parsing natural language sentences into dependency trees. Unlike phrase-structure (constituency) parsers, this algorithm operates one word at a time, attaching each word as soon as it can be attached, corresponding to properties claimed for the parser in the human brain. Like phrase-structure parsing, its worst-case complexity is O(n^3) , but in human language, the worst case occurs only for small n .
zh

[NLP-72] Communication to Completion: Modeling Collaborative Workflows with Intelligent Multi-Agent Communication

【速读】: 该论文旨在解决多智能体大语言模型(Multi-agent LLM)系统在复杂任务协作中缺乏面向任务的系统性沟通策略的问题。解决方案的关键在于提出 Communication to Completion (C2C) 框架,其核心创新包括:(1) 引入对齐因子(Alignment Factor, AF),一种量化智能体任务对齐程度的新指标,直接影响工作效能;(2) 设计顺序动作框架(Sequential Action Framework),将分步执行与智能通信决策相结合,使代理能够基于成本意识做出沟通选择,并通过有针对性的交互动态提升任务理解。

链接: https://arxiv.org/abs/2510.19995
作者: Yiming Lu,Xun Wang,Simin Ma,Shujian Liu,Sathish Reddy Indurthi,Song Wang,Haoyun Deng,Fei Liu,Kaiqiang Song
机构: Emory University (埃默里大学); Zoom Video Communications (Zoom 视频通信公司)
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL)
备注: 13 pages

点击查看摘要

Abstract:Teamwork in workspace for complex tasks requires diverse communication strategies, but current multi-agent LLM systems lack systematic frameworks for task oriented communication. We introduce Communication to Completion (C2C), a scalable framework that addresses this gap through two key innovations: (1) the Alignment Factor (AF), a novel metric quantifying agent task alignment that directly impacts work efficiency, and (2) a Sequential Action Framework that integrates stepwise execution with intelligent communication decisions. C2C enables agents to make cost aware communication choices, dynamically improving task understanding through targeted interactions. We evaluated C2C on realistic coding workflows across three complexity tiers and team sizes from 5 to 17 agents, comparing against no communication and fixed steps baselines. The results show that C2C reduces the task completion time by about 40% with acceptable communication costs. The framework completes all tasks successfully in standard configurations and maintains effectiveness at scale. C2C establishes both a theoretical foundation for measuring communication effectiveness in multi-agent systems and a practical framework for complex collaborative tasks.
zh

[NLP-73] LLM -Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在自然语言理解(Natural Language Understanding, NLU)任务中因概率推理导致的幻觉(hallucination)和输出结构不一致问题,同时克服符号化NLU系统在覆盖范围上的局限性及其对知识表示与语言学技能的高依赖。其解决方案的关键在于提出一种混合方法:利用LLMs进行文本重述和简化以实现广泛的语言处理覆盖,并自动填补知识空白;同时借助符号化NLU生成可用于推理和增量可调试学习的结构化关系表示,从而融合两者优势,在常识科学文本中的数量提取与因果规律解析任务上显著优于纯符号化方法。

链接: https://arxiv.org/abs/2510.19988
作者: Xin Lian,Kenneth D. Forbus
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 2 figures

点击查看摘要

Abstract:Despite the broad applicability of large language models (LLMs), their reliance on probabilistic inference makes them vulnerable to errors such as hallucination in generated facts and inconsistent output structure in natural language understanding (NLU) tasks. By contrast, symbolic NLU systems provide interpretable understanding grounded in curated lexicons, semantic resources, and syntactic semantic interpretation rules. They produce relational representations that can be used for accurate reasoning and planning, as well as incremental debuggable learning. However, symbolic NLU systems tend to be more limited in coverage than LLMs and require scarce knowledge representation and linguistics skills to extend and maintain. This paper explores a hybrid approach that integrates the broad-coverage language processing of LLMs with the symbolic NLU capabilities of producing structured relational representations to hopefully get the best of both approaches. We use LLMs for rephrasing and text simplification, to provide broad coverage, and as a source of information to fill in knowledge gaps more automatically. We use symbolic NLU to produce representations that can be used for reasoning and for incremental learning. We evaluate this approach on the task of extracting and interpreting quantities and causal laws from commonsense science texts, along with symbolic- and LLM-only pipelines. Our results suggest that our hybrid method works significantly better than the symbolic-only pipeline.
zh

[NLP-74] LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation ICASSP2026

【速读】: 该论文旨在解决歌词翻译(lyric translation)中难以兼顾音乐约束与语言连贯性的问题,尤其在段落级别上如何保持跨行一致性与全局押韵。现有方法依赖人工规则和句级建模,难以捕捉音乐-语言模式并有效泛化。其解决方案的关键在于提出一种全新的无监督可控歌词翻译框架LyriCAR,核心创新包括:1)引入难度感知的课程设计机制(difficulty-aware curriculum designer),2)采用自适应课程策略(adaptive curriculum strategy),通过逐步增加训练复杂度来优化资源分配、加速收敛并提升整体翻译质量。实验表明,该方法在英汉歌词翻译任务中显著优于强基线模型,且训练步数减少近40%仍保持优异性能。

链接: https://arxiv.org/abs/2510.19967
作者: Le Ren,Xiangjian Zeng,Qingqiang Wu,Ruoxuan Liang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: submitted to ICASSP 2026

点击查看摘要

Abstract:Lyric translation is a challenging task that requires balancing multiple musical constraints. Existing methods often rely on hand-crafted rules and sentence-level modeling, which restrict their ability to internalize musical-linguistic patterns and to generalize effectively at the paragraph level, where cross-line coherence and global rhyme are crucial. In this work, we propose LyriCAR, a novel framework for controllable lyric translation that operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware curriculum designer and an adaptive curriculum strategy, ensuring efficient allocation of training resources, accelerating convergence, and improving overall translation quality by guiding the model with increasingly complex challenges. Extensive experiments on the EN-ZH lyric translation task show that LyriCAR achieves state-of-the-art results across both standard translation metrics and multi-dimensional reward scores, surpassing strong baselines. Notably, the adaptive curriculum strategy reduces training steps by nearly 40% while maintaining superior performance. Code, data and model can be accessed at this https URL.
zh

[NLP-75] Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

【速读】: 该论文旨在解决基于预训练大语言模型(Large Language Models, LLMs)的智能体如何在不进行参数更新的情况下,仅通过标注样本学习目标分类函数的问题。传统方法如微调(fine-tuning)通常成本高、灵活性差且缺乏可解释性。其解决方案的关键在于提出一种记忆增强型框架,该框架利用标注数据与LLM生成的批判性反馈(critiques)共同驱动学习:通过情景记忆(episodic memory)存储实例级反馈以捕捉具体经验,并借助语义记忆(semantic memory)将这些反馈提炼为可复用的任务级指导策略。实证结果表明,引入批判性反馈相比仅依赖标签的检索增强生成(Retrieval-Augmented Generation, RAG)基线方法,准确率最高提升24.8%,同时揭示了不同模型类型(如OpenAI与开源模型)在处理事实导向与偏好导向数据时的行为差异,并提出“可诱导性”(suggestibility)这一新指标用于解析监督信号在记忆中的编码方式及其对学习动态的影响。

链接: https://arxiv.org/abs/2510.19897
作者: Jackson Hassell,Dan Zhang,Hannah Kim,Tom Mitchell,Estevam Hruschka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:We investigate how agents built on pretrained large language models can learn target classification functions from labeled examples without parameter updates. While conventional approaches like fine-tuning are often costly, inflexible, and opaque, we propose a memory-augmented framework that leverages both labeled data and LLM-generated critiques. Our framework uses episodic memory to store instance-level critiques-capturing specific past experiences-and semantic memory to distill these into reusable, task-level guidance. Across a diverse set of tasks, incorporating critiques yields up to a 24.8 percent accuracy improvement over retrieval-based (RAG-style) baselines that rely only on labels. Through extensive empirical evaluation, we uncover distinct behavioral differences between OpenAI and opensource models, particularly in how they handle fact-oriented versus preference-based data. To interpret how models respond to different representations of supervision encoded in memory, we introduce a novel metric, suggestibility. This helps explain observed behaviors and illuminates how model characteristics and memory strategies jointly shape learning dynamics. Our findings highlight the promise of memory-driven, reflective learning for building more adaptive and interpretable LLM agents.
zh

[NLP-76] Large Language Model enabled Mathematical Modeling

【速读】: 该论文旨在解决生成式 AI(Generative AI)在运筹学(Operations Research, OR)领域中模型构建阶段的“公式化差距”问题,即如何利用自然语言理解与代码生成能力,将现实世界的决策问题高效、准确地转化为可求解的数学优化模型。传统方法高度依赖专家知识来定义目标函数、约束条件和变量,而大型语言模型(Large Language Models, LLMs)有望自动化这一过程。论文的关键解决方案在于系统评估 DeepSeek-R1 模型在四个 OR 基准(NL4OPT、IndustryOR、EasyLP 和 ComplexOR)上的表现,并引入多种策略以减少幻觉现象、提升建模准确性,包括 LLM-as-a-Judge、少量样本学习(Few-shot Learning, FSL)、工具调用(Tool Calling)以及多智能体框架(Multi-agent Framework),从而增强模型输出与用户意图的一致性并推动其在供应链等实际场景中的应用可行性。

链接: https://arxiv.org/abs/2510.19895
作者: Guoyun Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Large Language Models (LLMs) with optimization modeling offers a promising avenue for advancing decision-making in operations research (OR). Traditional optimization methods,such as linear programming, mixed integer programming, and simulation depend heavily on domain expertise to translate real-world problems into solvable mathematical models. While solvers like Gurobi and COPT are powerful, expert input remains essential for defining objectives, constraints, and variables. This research investigates the potential of LLMs, specifically the DeepSeek-R1 model, to bridge this formulation gap using natural language understanding and code generation. Although prior models like GPT-4, Claude, and Bard have shown strong performance in NLP and reasoning tasks, their high token costs and tendency toward hallucinations limit real-world applicability in supply chain contexts. In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained with reinforcement learning, presents a viable alternative. Despite its success in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied OR scenarios remains under explored. This study systematically evaluates DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and ComplexOR. Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies like LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent Framework. These techniques aim to reduce hallucinations, enhance formulation accuracy, and better align model outputs with user intent.
zh

[NLP-77] Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities EMNLP2025

【速读】: 该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLMs)评估中存在的两大问题:一是现有评估方法通常依赖静态、孤立的基准测试,无法在单一任务中综合衡量模型的多种能力;二是基于人类或模型间两两比较的评估方式主观性强、成本高,且易导致模型通过表面技巧(如冗长表述)人为提升胜率。解决方案的关键在于提出基于游戏的评估框架,利用游戏具备多能力协同要求、竞争性和规则明确性的特点,实现更客观、全面和有趣的模型能力评估。文中以幻想卡牌游戏Dixit为例进行实证,结果显示其胜率排名与主流MLM基准高度一致,同时揭示了人类与模型在策略上的差异及改进方向。

链接: https://arxiv.org/abs/2510.19892
作者: Nishant Balepur,Dang Nguyen,Dayeon Ki
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a Spotlight paper at the EMNLP 2025 Wordplay Workshop

点击查看摘要

Abstract:Multi-modal large language models (MLMs) are often assessed on static, individual benchmarks – which cannot jointly assess MLM capabilities in a single task – or rely on human or model pairwise comparisons – which is highly subjective, expensive, and allows models to exploit superficial shortcuts (e.g., verbosity) to inflate their win-rates. To overcome these issues, we propose game-based evaluations to holistically assess MLM capabilities. Games require multiple abilities for players to win, are inherently competitive, and are governed by fix, objective rules, and makes evaluation more engaging, providing a robust framework to address the aforementioned challenges. We manifest this evaluation specifically through Dixit, a fantasy card game where players must generate captions for a card that trick some, but not all players, into selecting the played card. Our quantitative experiments with five MLMs show Dixit win-rate rankings are perfectly correlated with those on popular MLM benchmarks, while games between human and MLM players in Dixit reveal several differences between agent strategies and areas of improvement for MLM reasoning.
zh

[NLP-78] An Expert-grounded benchmark of General Purpose LLM s in LCA

【速读】: 该论文旨在解决生成式 AI(Generative AI)在生命周期评估(Life Cycle Assessment, LCA)应用中缺乏系统性评估与可靠性验证的问题,尤其针对当前尚无标准化评价框架、缺乏明确“真实值”或共识协议的领域。其解决方案的关键在于构建首个基于专家实践的基准测试体系,对11种通用大语言模型(Large Language Models, LLMs)在22项LCA任务中的表现进行系统评估,由17位经验丰富的从业者依据科学准确性、解释质量、鲁棒性、可验证性和指令遵循度等核心指标进行量化评分,共收集168份专家评审意见,从而揭示LLMs在LCA场景下的优势与风险,为后续安全、可靠地集成AI工具提供实证依据。

链接: https://arxiv.org/abs/2510.19886
作者: Artur Donaldson,Bharathan Balaji,Cajetan Oriekezie,Manish Kumar,Laure Patouillard
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Purpose: Artificial intelligence (AI), and in particular large language models (LLMs), are increasingly being explored as tools to support life cycle assessment (LCA). While demonstrations exist across environmental and social domains, systematic evidence on their reliability, robustness, and usability remains limited. This study provides the first expert-grounded benchmark of LLMs in LCA, addressing the absence of standardized evaluation frameworks in a field where no clear ground truth or consensus protocols exist. Methods: We evaluated eleven general-purpose LLMs, spanning both commercial and open-source families, across 22 LCA-related tasks. Seventeen experienced practitioners reviewed model outputs against criteria directly relevant to LCA practice, including scientific accuracy, explanation quality, robustness, verifiability, and adherence to instructions. We collected 168 expert reviews. Results: Experts judged 37% of responses to contain inaccurate or misleading information. Ratings of accuracy and quality of explanation were generally rated average or good on many models even smaller models, and format adherence was generally rated favourably. Hallucination rates varied significantly, with some models producing hallucinated citations at rates of up to 40%. There was no clear-cut distinction between ratings on open-weight versus closed-weight LLMs, with open-weight models outperforming or competing on par with closed-weight models on criteria such as accuracy and quality of explanation. Conclusion: These findings highlight the risks of applying LLMs naïvely in LCA, such as when LLMs are treated as free-form oracles, while also showing benefits especially around quality of explanation and alleviating labour intensiveness of simple tasks. The use of general-purpose LLMs without grounding mechanisms presents … Subjects: Computation and Language (cs.CL) Cite as: arXiv:2510.19886 [cs.CL] (or arXiv:2510.19886v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.19886 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Artur Donaldson [view email] [v1] Wed, 22 Oct 2025 15:56:54 UTC (1,149 KB)
zh

[NLP-79] Automated HIV Screening on Dutch EHR with Large Language Models

【速读】: 该论文旨在解决艾滋病(HIV)早期筛查效率低的问题,尤其是在大规模实验室检测不可行的情况下,如何利用电子健康记录(EHR)中的信息提升诊断准确性。其解决方案的关键在于构建一个新颖的流程,通过大型语言模型(LLM)分析EHR中的非结构化文本数据(如临床笔记),从而识别患者是否具备进一步进行HIV检测的资格,有效挖掘传统机器学习方法忽略的潜在风险信息,同时在实验中展现出高准确率和低假阴性率。

链接: https://arxiv.org/abs/2510.19879
作者: Lang Zhou,Amrish Jhingoer,Yinghao Luo,Klaske Vliegenthart–Jongbloed,Carlijn Jordans,Ben Werkhoven,Tom Seinen,Erik van Mulligen,Casper Rokx,Yunlei Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 28 pages, 6 figures

点击查看摘要

Abstract:Efficient screening and early diagnosis of HIV are critical for reducing onward transmission. Although large scale laboratory testing is not feasible, the widespread adoption of Electronic Health Records (EHRs) offers new opportunities to address this challenge. Existing research primarily focuses on applying machine learning methods to structured data, such as patient demographics, for improving HIV diagnosis. However, these approaches often overlook unstructured text data such as clinical notes, which potentially contain valuable information relevant to HIV risk. In this study, we propose a novel pipeline that leverages a Large Language Model (LLM) to analyze unstructured EHR text and determine a patient’s eligibility for further HIV testing. Experimental results on clinical data from Erasmus University Medical Center Rotterdam demonstrate that our pipeline achieved high accuracy while maintaining a low false negative rate.
zh

[NLP-80] Stream: Scaling up Mechanistic Interpretability to Long Context in LLM s via Sparse Attention

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在长上下文(百万token级别)场景下,传统机制可解释性技术因注意力计算复杂度随上下文长度呈二次增长而面临内存需求激增(超过100,000 tokens时可达TB级)的问题。解决方案的关键在于提出一种名为Sparse Tracing的新方法,其核心是利用动态稀疏注意力机制来高效分析长上下文中的注意力模式;具体实现上引入Stream算法——一个可编译的分层剪枝算法,能在近线性时间复杂度 O(TlogT)O(T \log T) 和线性空间复杂度 O(T)O(T) 内估计每头的稀疏注意力掩码,从而支持单次遍历即可完成可解释性分析。该方法通过二分搜索式精化策略保留每个查询对应的top-k关键块,同时保持模型下一词预测行为不变,在链式思维推理轨迹中识别出关键思想锚点,并能裁剪97–99%的token交互,显著降低计算开销且适用于消费级GPU,为长上下文可解释性提供了实用、易部署的工具。

链接: https://arxiv.org/abs/2510.19875
作者: J Rosser,José Luis Redondo García,Gustavo Penha,Konstantina Palla,Hugues Bouchard
机构: Spotify; University of Oxford (牛津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) scale to million-token contexts, traditional Mechanistic Interpretability techniques for analyzing attention scale quadratically with context length, demanding terabytes of memory beyond 100,000 tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic sparse attention to efficiently analyze long context attention patterns. We present Stream, a compilable hierarchical pruning algorithm that estimates per-head sparse attention masks in near-linear time O(T \log T) and linear space O(T) , enabling one-pass interpretability at scale. Stream performs a binary-search-style refinement to retain only the top- k key blocks per query while preserving the model’s next-token behavior. We apply Stream to long chain-of-thought reasoning traces and identify thought anchors while pruning 97-99% of token interactions. On the RULER benchmark, Stream preserves critical retrieval paths while discarding 90-96% of interactions and exposes layer-wise routes from the needle to output. Our method offers a practical drop-in tool for analyzing attention patterns and tracing information flow without terabytes of caches. By making long context interpretability feasible on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring. Code is available at this https URL.
zh

[NLP-81] From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

【速读】: 该论文旨在解决离散扩散模型(Discrete Diffusion Models)在视觉-语言任务中因训练与推理不一致(train-inference discrepancy)导致的灾难性误差级联问题:在并行解码初期,若出现token错误,将污染生成上下文,引发错误累积,最终导致语法错误和语义幻觉。解决方案的关键在于将生成过程从被动去噪重构为主动修正(active refining),提出ReDiff框架——通过两阶段训练机制实现:首先训练模型修正合成错误以建立基础修正能力;其次引入在线自修正循环,使模型学习基于专家修正反馈来优化自身生成结果,从而打破误差传播链,显著提升生成内容的连贯性和事实准确性。

链接: https://arxiv.org/abs/2510.19871
作者: Yatai Ji,Teng Wang,Yuying Ge,Zhiheng Liu,Sidi Yang,Ying Shan,Ping Luo
机构: The University of Hong Kong (香港大学); ARC Lab, Tencent PCG (腾讯PCG实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert’s corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at this https URL.
zh

[NLP-82] An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在教育领域中用于生成高中物理课程教案时的 pedagogical soundness(教学合理性)与 usability(可用性)问题,特别是评估不同大语言模型(LLM)和提示框架(prompt framework)对教案质量的影响。研究发现,模型选择主要影响教案的语言可读性(如 DeepSeek V3.2 生成的教案 Flesch-Kincaid Grade Level = 8.64 最易读),而提示框架结构则显著提升事实准确性与课程标准对齐度(RACE 框架在 NGSS 标准契合度和幻觉指数上表现最优)。解决方案的关键在于:将一个语言可读性优化的模型(如 DeepSeek)与结构化的提示框架(如 RACE)结合,并辅以明确的物理概念清单、课程标准对照表及高阶认知目标检查项,从而实现教案的高质量生成。

链接: https://arxiv.org/abs/2510.19866
作者: Xincheng Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 tables

点击查看摘要

Abstract:This study evaluates the pedagogical soundness and usability of AI-generated lesson plans across five leading large language models: ChatGPT (GPT-5), Claude Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice, three structured prompt frameworks were tested: TAG (Task, Audience, Goal), RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective, Style, Tone, Audience, Response Format). Fifteen lesson plans were generated for a single high-school physics topic, The Electromagnetic Spectrum. The lesson plans were analyzed through four automated computational metrics: (1) readability and linguistic complexity, (2) factual accuracy and hallucination detection, (3) standards and curriculum alignment, and (4) cognitive demand of learning objectives. Results indicate that model selection exerted the strongest influence on linguistic accessibility, with DeepSeek producing the most readable teaching plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89). The prompt framework structure most strongly affected the factual accuracy and pedagogical completeness, with the RACE framework yielding the lowest hallucination index and the highest incidental alignment with NGSS curriculum standards. Across all models, the learning objectives in the fifteen lesson plans clustered at the Remember and Understand tiers of Bloom’s taxonomy. There were limited higher-order verbs in the learning objectives extracted. Overall, the findings suggest that readability is significantly governed by model design, while instructional reliability and curricular alignment depend more on the prompt framework. The most effective configuration for lesson plans identified in the results was to combine a readability-optimized model with the RACE framework and an explicit checklist of physics concepts, curriculum standards, and higher-order objectives. Comments: 20 pages, 6 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: G.1.10; G.4; I.2.6; I.2.7 Cite as: arXiv:2510.19866 [cs.CL] (or arXiv:2510.19866v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.19866 Focus to learn more arXiv-issued DOI via DataCite Related DOI: https://doi.org/10.35542/osf.io/r3xkt_v1 Focus to learn more DOI(s) linking to related resources Submission history From: Xincheng Liu [view email] [v1] Wed, 22 Oct 2025 02:53:06 UTC (323 KB) Full-text links: Access Paper: View a PDF of the paper titled An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics, by Xincheng LiuView PDF view license Current browse context: cs.CL prev | next new | recent | 2025-10 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
zh

[NLP-83] SODBench: A Large Language Model Approach to Documenting Spreadsheet Operations

【速读】: 该论文旨在解决电子表格(Spreadsheet)在业务、会计和金融等领域中因缺乏系统化文档方法而导致的自动化、协作与知识传递受限问题,从而避免关键机构知识的流失。其解决方案的关键在于提出了一种名为“Spreadsheet Operations Documentation (SOD)”的新任务,即利用大语言模型(Large Language Models, LLMs)将电子表格操作代码自动转换为人类可读的自然语言说明,并构建了一个包含111个代码片段及其对应自然语言摘要的基准数据集,用于评估不同LLMs在该任务上的表现,结果表明SOD具备可行性,可作为提升电子表格可复现性、可维护性和协作效率的重要前置步骤。

链接: https://arxiv.org/abs/2510.19864
作者: Amila Indika,Igor Molybog
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 14 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Numerous knowledge workers utilize spreadsheets in business, accounting, and finance. However, a lack of systematic documentation methods for spreadsheets hinders automation, collaboration, and knowledge transfer, which risks the loss of crucial institutional knowledge. This paper introduces Spreadsheet Operations Documentation (SOD), an AI task that involves generating human-readable explanations from spreadsheet operations. Many previous studies have utilized Large Language Models (LLMs) for generating spreadsheet manipulation code; however, translating that code into natural language for SOD is a less-explored area. To address this, we present a benchmark of 111 spreadsheet manipulation code snippets, each paired with a corresponding natural language summary. We evaluate five LLMs, GPT-4o, GPT-4o-mini, LLaMA-3.3-70B, Mixtral-8x7B, and Gemma2-9B, using BLEU, GLEU, ROUGE-L, and METEOR metrics. Our findings suggest that LLMs can generate accurate spreadsheet documentation, making SOD a feasible prerequisite step toward enhancing reproducibility, maintainability, and collaborative workflows in spreadsheets, although there are challenges that need to be addressed.
zh

[NLP-84] DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse

【速读】: 该论文旨在解决在线科学学习话语中知识建构(Knowledge Construction, KC)水平的自动分类问题,以实现对学习者在非正式数字学习环境中高阶认知参与程度的有效评估。其解决方案的关键在于提出了一种基于DeBERTa-v3改进的模型——DeBERTa-KC,通过引入焦点损失(Focal Loss)、标签平滑(Label Smoothing)和R-Drop正则化技术,有效缓解了类别不平衡问题并提升了模型泛化能力;同时构建了一个可复现的端到端分析流程,涵盖数据采集、人工标注、预处理、训练与评估全过程,最终在包含四类KC级别的20,000样本数据集上实现了宏F1值0.836 ± 0.008,显著优于传统及Transformer基线方法(p < 0.01),尤其在识别高阶认知参与类别(如Explore和Negotiate)方面表现出色。

链接: https://arxiv.org/abs/2510.19858
作者: Jindi Wang,Yidi Zhang,Zhaoxing Li
机构: Durham University (杜伦大学); University of Aveiro (阿威罗大学); University of Southampton (南安普顿大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This study presents DeBERTa-KC, a transformer-based model for automatic classification of knowledge construction (KC) levels in online science learning discourse. Using comments collected from four popular YouTube science channels (2022–2024), a balanced corpus of 20,000 manually annotated samples was created across four KC categories: \textitnonKC, \textitShare, \textitExplore, and \textitNegotiate. The proposed model extends DeBERTa-v3 with Focal Loss, Label Smoothing, and R-Drop regularization to address class imbalance and enhance generalization. A reproducible end-to-end pipeline was implemented, encompassing data extraction, annotation, preprocessing, training, and evaluation. Across 10-fold stratified cross-validation, DeBERTa-KC achieved a macro-F1 of 0.836 \pm 0.008 , significantly out-performing both classical and transformer baselines ( p0.01 ). Per-category results indicate strong sensitivity to higher-order epistemic engagement, particularly in \textitExplore and \textitNegotiate discourse. These findings demonstrate that large language models can effectively capture nuanced indicators of knowledge construction in informal digital learning environments, offering scalable, theory-informed approaches to discourse analysis and the development of automated tools for assessing epistemic engagement.
zh

[NLP-85] Prompt Decorators: A Declarative and Composable Syntax for Reasoning Formatting and Control in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理、写作和决策支持工作流中缺乏用户可控性的问题,即用户难以一致地控制模型的推理方式与输出表达。传统提示工程依赖冗长的自然语言指令,导致可复现性差、模块化程度低且难以解释。其解决方案的关键在于提出“提示装饰器”(Prompt Decorators)——一种声明式、可组合的语法机制,通过紧凑的控制标记(如 +++Reasoning、+++Tone(style=formal)、+++Import(topic=“Systems Thinking”))来调节模型的行为维度(如推理风格、结构或语气),而不改变任务内容本身。该框架定义了统一的语法、作用域模型和确定性处理流程,实现了行为组合的可预测性和可审计性,从而将任务意图与执行行为解耦,构建了一个可重用、可解释的提示设计接口。

链接: https://arxiv.org/abs/2510.19850
作者: Mostapha Kalami Heris
机构: Sheffield Hallam University (谢菲尔德哈勒姆大学)
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are central to reasoning, writing, and decision-support workflows, yet users lack consistent control over how they reason and express outputs. Conventional prompt engineering relies on verbose natural-language instructions, limiting reproducibility, modularity, and interpretability. This paper introduces Prompt Decorators, a declarative, composable syntax that governs LLM behavior through compact control tokens such as +++Reasoning, +++Tone(style=formal), and +++Import(topic=“Systems Thinking”). Each decorator modifies a behavioral dimension, such as reasoning style, structure, or tone, without changing task content. The framework formalizes twenty core decorators organized into two functional families (Cognitive Generative and Expressive Systemic), each further decomposed into subcategories that govern reasoning, interaction, expression, and session-control. It defines a unified syntax, scoping model, and deterministic processing pipeline enabling predictable and auditable behavior composition. By decoupling task intent from execution behavior, Prompt Decorators create a reusable and interpretable interface for prompt design. Illustrative use cases demonstrate improved reasoning transparency, reduced prompt complexity, and standardized model behavior across domains. The paper concludes with implications for interoperability, behavioral consistency, and the development of declarative interfaces for scalable AI systems.
zh

[NLP-86] Branch-and-Browse: Efficient and Controllable Web Exploration with Tree-Structured Reasoning and Action Memory

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自主网络代理在多步推理深度和执行效率方面的局限性问题。现有方法要么采用简单的线性推理流程,难以处理复杂任务且缺乏有效回溯机制,要么依赖粗粒度搜索策略,计算成本高。其解决方案的关键在于提出Branch-and-Browse框架,该框架通过三个核心机制实现高效、可控的多分支推理:(i) 采用树状结构探索的显式子任务管理机制以支持结构化推理与行动;(ii) 借助背景推理驱动的网页状态重放机制提升探索效率;(iii) 利用页面动作记忆实现跨会话的动作知识共享,从而显著提升任务成功率(WebArena基准达35.8%)并降低执行时间达40.4%。

链接: https://arxiv.org/abs/2510.19838
作者: Shiqi He,Yue Cui,Xinyu Ma,Yaliang Li,Bolin Ding,Mosharaf Chowdhury
机构: University of Michigan (密歇根大学); Alibaba Group (阿里巴巴集团); McMaster University (麦克马斯特大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Autonomous web agents powered by large language models (LLMs) show strong potential for performing goal-oriented tasks such as information retrieval, report generation, and online transactions. These agents mark a key step toward practical embodied reasoning in open web environments. However, existing approaches remain limited in reasoning depth and efficiency: vanilla linear methods fail at multi-step reasoning and lack effective backtracking, while other search strategies are coarse-grained and computationally costly. We introduce Branch-and-Browse, a fine-grained web agent framework that unifies structured reasoning-acting, contextual memory, and efficient execution. It (i) employs explicit subtask management with tree-structured exploration for controllable multi-branch reasoning, (ii) bootstraps exploration through efficient web state replay with background reasoning, and (iii) leverages a page action memory to share explored actions within and across sessions. On the WebArena benchmark, Branch-and-Browse achieves a task success rate of 35.8% and reduces execution time by up to 40.4% relative to state-of-the-art methods. These results demonstrate that Branch-and-Browse is a reliable and efficient framework for LLM-based web agents.
zh

[NLP-87] Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

【速读】: 该论文旨在解决量子纠错码中具有指定可交换对角门(transversal diagonal gates)的协同设计问题,即如何系统性地构造满足特定逻辑门实现条件的量子码。其关键解决方案在于提出了一种多智能体、人机协同的工作流(multi-agent, human-in-the-loop workflow),基于Subset-Sum Linear Programming (SSLP) 框架,通过模剩余划分基矢串并利用小规模线性规划(LP)强制执行Z-边际Knill-Laflamme (KL) 条件,从而将原本复杂的可行性问题转化为可规模化分析的解析流程。该工作流由三个角色协同完成:合成代理(Synthesis Agent)定义问题、搜索代理(Search Agent)筛选候选码并精确化数值为有理数、审计代理(Audit Agent)独立验证所有KL等式及诱导逻辑作用,最终实现了在n≤6 qubits、码维K∈{2,3,4}下系统枚举并证明新码的KL满足性,且能抽象出闭式家族结构,具备扩展至更高维度和距离的能力。

链接: https://arxiv.org/abs/2510.20728
作者: Xi He,Sirui Lu,Bei Zeng
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校); Max-Planck-Institut für Quantenoptik (马克斯·普朗克量子光学研究所)
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Mathematical Physics (math-ph)
备注: 29 pages, 2 figures

点击查看摘要

Abstract:We present a multi-agent, human-in-the-loop workflow that co-designs quantum codes with prescribed transversal diagonal gates. It builds on the Subset-Sum Linear Programming (SSLP) framework (arXiv:2504.20847), which partitions basis strings by modular residues and enforces Z -marginal Knill-Laflamme (KL) equalities via small LPs. The workflow is powered by GPT-5 and implemented within TeXRA (this https URL)-a multi-agent research assistant platform that supports an iterative tool-use loop agent and a derivation-then-edit workflow reasoning agent. We work in a LaTeX-Python environment where agents reason, edit documents, execute code, and synchronize their work to Git/Overleaf. Within this workspace, three roles collaborate: a Synthesis Agent formulates the problem; a Search Agent sweeps/screens candidates and exactifies numerics into rationals; and an Audit Agent independently checks all KL equalities and the induced logical action. As a first step we focus on distance d=2 with nondegenerate residues. For code dimension K\in\2,3,4\ and n\le6 qubits, systematic sweeps yield certificate-backed tables cataloging attainable cyclic logical groups-all realized by new codes-e.g., for K=3 we obtain order 16 at n=6 . From verified instances, Synthesis Agent abstracts recurring structures into closed-form families and proves they satisfy the KL equalities for all parameters. It further demonstrates that SSLP accommodates residue degeneracy by exhibiting a new ((6,4,2)) code implementing the transversal controlled-phase diag(1,1,1,i) . Overall, the workflow recasts diagonal-transversal feasibility as an analytical pipeline executed at scale, combining systematic enumeration with exact analytical reconstruction. It yields reproducible code constructions, supports targeted extensions to larger K and higher distances, and leads toward data-driven classification.
zh

计算机视觉

[CV-0] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

【速读】:该论文旨在解决当前文本到视频(text-to-video)模型在生成连贯多镜头叙事方面存在的“叙事鸿沟”问题,即现有模型仅能生成孤立片段,难以实现电影级的全局一致性叙事。其解决方案的关键在于提出HoloCine模型,该模型通过两种核心机制实现端到端的叙事控制:一是采用窗口交叉注意力(Window Cross-Attention)机制,将文本提示精准定位到特定镜头以实现导演式控制;二是引入稀疏跨镜头自注意力模式(Sparse Inter-Shot Self-Attention),在镜头内部保持密集连接而在镜头间稀疏连接,从而在保证生成效率的同时支持分钟级视频输出。这一架构不仅显著提升了叙事连贯性,还涌现出角色与场景的持久记忆能力及对电影拍摄技巧的直观理解,标志着从片段合成向自动化电影制作的重要跃迁。

链接: https://arxiv.org/abs/2510.20822
作者: Yihao Meng,Hao Ouyang,Yue Yu,Qiuyu Wang,Wen Wang,Ka Leong Cheng,Hanlin Wang,Yixuan Li,Cheng Chen,Yanhong Zeng,Yujun Shen,Huamin Qu
机构: HKUST(香港科技大学); Ant Group(蚂蚁集团); ZJU(浙江大学); CUHK(香港中文大学); NTU(南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page and code: this https URL

点击查看摘要

Abstract:State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this “narrative gap” with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future. Our code is available at: this https URL.
zh

[CV-1] LayerComposer: Interactive Personalized T2I via Spatially-Aware Layered Canvas

【速读】:该论文旨在解决现有个性化生成式 AI 模型在多主体图像生成中缺乏空间布局交互控制能力以及扩展性差的问题。其核心解决方案是提出 LayerComposer 框架,关键创新在于引入“分层画布”(layered canvas)表示法,将每个主体置于独立图层以实现无遮挡组合,并设计一种无需架构修改的锁定机制,通过利用固有位置嵌入(positional embeddings)与新型互补数据采样策略,在保持选定图层高保真度的同时,使其他图层能灵活适应上下文环境,从而实现对多主体图像的空间精准控制和身份一致性保留。

链接: https://arxiv.org/abs/2510.20820
作者: Guocheng Gordon Qian,Ruihang Zhang,Tsai-Shien Chen,Yusuf Dalva,Anujraaj Argo Goyal,Willi Menapace,Ivan Skorokhodov,Meng Dong,Arpit Sahni,Daniil Ostashev,Ju Hu,Sergey Tulyakov,Kuan-Chieh Jackson Wang
机构: Snap Inc.; University of Toronto; UC Merced; Virginia Tech
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, preprint

点击查看摘要

Abstract:Despite their impressive visual fidelity, existing personalized generative models lack interactive control over spatial composition and scale poorly to multiple subjects. To address these limitations, we present LayerComposer, an interactive framework for personalized, multi-subject text-to-image generation. Our approach introduces two main contributions: (1) a layered canvas, a novel representation in which each subject is placed on a distinct layer, enabling occlusion-free composition; and (2) a locking mechanism that preserves selected layers with high fidelity while allowing the remaining layers to adapt flexibly to the surrounding context. Similar to professional image-editing software, the proposed layered canvas allows users to place, resize, or lock input subjects through intuitive layer manipulation. Our versatile locking mechanism requires no architectural changes, relying instead on inherent positional embeddings combined with a new complementary data sampling strategy. Extensive experiments demonstrate that LayerComposer achieves superior spatial control and identity preservation compared to the state-of-the-art methods in multi-subject personalized image generation.
zh

[CV-2] owards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

【速读】:该论文旨在解决跨模态翻译(Modality Translation, MT)中现有方法受限于特定假设(如共享维度、高斯先验和模态专用架构)而导致的通用性不足与理论基础薄弱的问题。其解决方案的关键在于提出一种基于潜在变量扩展的去噪扩散桥模型(Latent Denoising Diffusion Bridge Model, LDDBM),通过在共享潜在空间中学习任意模态间的映射关系,无需对齐维度即可实现跨模态信息转换;同时引入对比对齐损失以保证配对样本的语义一致性,并设计领域无关的编码器-解码器结构用于潜在空间中的噪声预测,辅以预测损失提升训练稳定性与翻译准确性,从而构建一个支持任意模态对且性能优越的通用跨模态翻译框架。

链接: https://arxiv.org/abs/2510.20819
作者: Nimrod Berman,Omkar Joglekar,Eitan Kosman,Dotan Di Castro,Omri Azencot
机构: Bosch AI Center; Ben-Gurion University of the Negev (本古里安大学); Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: this https URL.
zh

[CV-3] SpectraMorph: Structured Latent Learning for Self-Supervised Hyperspectral Super-Resolution

【速读】:该论文旨在解决高光谱图像(Hyperspectral Image, HSI)因空间分辨率低而导致的边界模糊和混合像元问题,通过融合高分辨率的多光谱或全色影像(Multispectral Image, MSI)实现超分辨率重建。其核心解决方案是提出一种物理引导的自监督融合框架SpectraMorph,关键在于引入一个结构化的潜在空间和解混瓶颈机制:从低分辨率HSI中提取端元光谱,并利用一个紧凑的多层感知机(Multilayer Perceptron, MLP)从MSI预测丰度图,再通过线性混合重建光谱;训练过程基于MSI传感器的光谱响应函数进行自监督优化,从而在无需标注数据的情况下获得可解释的中间结果,且对单波段(全色)MSI仍保持鲁棒性。

链接: https://arxiv.org/abs/2510.20814
作者: Ritik Shah,Marco F Duarte
机构: University of Massachusetts (马萨诸塞大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral sensors capture dense spectra per pixel but suffer from low spatial resolution, causing blurred boundaries and mixed-pixel effects. Co-registered companion sensors such as multispectral, RGB, or panchromatic cameras provide high-resolution spatial detail, motivating hyperspectral super-resolution through the fusion of hyperspectral and multispectral images (HSI-MSI). Existing deep learning based methods achieve strong performance but rely on opaque regressors that lack interpretability and often fail when the MSI has very few bands. We propose SpectraMorph, a physics-guided self-supervised fusion framework with a structured latent space. Instead of direct regression, SpectraMorph enforces an unmixing bottleneck: endmember signatures are extracted from the low-resolution HSI, and a compact multilayer perceptron predicts abundance-like maps from the MSI. Spectra are reconstructed by linear mixing, with training performed in a self-supervised manner via the MSI sensor’s spectral response function. SpectraMorph produces interpretable intermediates, trains in under a minute, and remains robust even with a single-band (pan-chromatic) MSI. Experiments on synthetic and real-world datasets show SpectraMorph consistently outperforming state-of-the-art unsupervised/self-supervised baselines while remaining very competitive against supervised baselines.
zh

[CV-4] GSWorld: Closed-Loop Photo-Realistic Simulation Suite for Robotic Manipulation

【速读】:该论文旨在解决机器人操作任务中真实世界与仿真环境之间存在的“域差距”问题,以及现有仿真系统在视觉真实感和物理交互精度方面的不足。为实现高保真、可复现的策略训练与评估,作者提出GSWorld框架,其核心创新在于将3D高斯点绘(3D Gaussian Splatting)与物理引擎相结合,并引入一种新型资产格式GSDF(Gaussian Scene Description File),该格式融合了高斯面片表示与URDF机器人模型及其他物体信息,从而支持复杂场景的高质量渲染与物理模拟。通过这一方案,实现了无需真实机器人即可进行零样本sim2real策略迁移、自动化DAgger数据采集、可复现的真实策略基准测试等关键应用,显著提升了仿真驱动的机器人操作学习的实用性与可靠性。

链接: https://arxiv.org/abs/2510.20813
作者: Guangqi Jiang,Haoran Chang,Ri-Zhao Qiu,Yutong Liang,Mazeyu Ji,Jiyue Zhu,Zhao Dong,Xueyan Zou,Xiaolong Wang
机构: UC San Diego (加州大学圣地亚哥分校); UC Los Angeles (加州大学洛杉矶分校); Meta
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents GSWorld, a robust, photo-realistic simulator for robotics manipulation that combines 3D Gaussian Splatting with physics engines. Our framework advocates “closing the loop” of developing manipulation policies with reproducible evaluation of policies learned from real-robot data and sim2real policy training without using real robots. To enable photo-realistic rendering of diverse scenes, we propose a new asset format, which we term GSDF (Gaussian Scene Description File), that infuses Gaussian-on-Mesh representation with robot URDF and other objects. With a streamlined reconstruction pipeline, we curate a database of GSDF that contains 3 robot embodiments for single-arm and bimanual manipulation, as well as more than 40 objects. Combining GSDF with physics engines, we demonstrate several immediate interesting applications: (1) learning zero-shot sim2real pixel-to-action manipulation policy with photo-realistic rendering, (2) automated high-quality DAgger data collection for adapting policies to deployment environments, (3) reproducible benchmarking of real-robot manipulation policies in simulation, (4) simulation data collection by virtual teleoperation, and (5) zero-shot sim2real visual reinforcement learning. Website: this https URL.
zh

[CV-5] Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers

【速读】:该论文旨在解决现有视频生成方法在物理模拟因果建模上的不足,即难以长期保持物理一致性的问题。其解决方案的关键在于提出一种纯Transformer架构的自回归视频预测模型,通过连续像素空间表示实现端到端训练,无需复杂的训练策略或潜在特征学习组件,从而将物理准确预测的时间跨度延长达50%,同时在常见视频质量指标上保持相当性能;此外,该方法还借助可解释性实验验证了模型对偏微分方程(PDE)模拟参数估计的泛化能力,体现了其在时空推理中的有效性与可解释性。

链接: https://arxiv.org/abs/2510.20807
作者: Dean L Slack,G Thomas Hudson,Thomas Winterbottom,Noura Al Moubayed
机构: Durham University (杜伦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 14 figures

点击查看摘要

Abstract:Inspired by the performance and scalability of autoregressive large language models (LLMs), transformer-based models have seen recent success in the visual domain. This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various spatiotemporal self-attention layouts. Focusing on causal modeling of physical simulations over time; a common shortcoming of existing video-generative approaches, we attempt to isolate spatiotemporal reasoning via physical object tracking metrics and unsupervised training on physical simulation datasets. We introduce a simple yet effective pure transformer model for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction. Without the need for complex training strategies or latent feature-learning components, our approach significantly extends the time horizon for physically accurate predictions by up to 50% when compared with existing latent-space approaches, while maintaining comparable performance on common video quality metrics. In addition, we conduct interpretability experiments to identify network regions that encode information useful to perform accurate estimations of PDE simulation parameters via probing models, and find that this generalizes to the estimation of out-of-distribution simulation parameters. This work serves as a platform for further attention-based spatiotemporal modeling of videos via a simple, parameter efficient, and interpretable approach.
zh

[CV-6] ARGenSeg: Image Segmentation with Autoregressive Image Generation Model NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像分割任务中难以捕捉细粒度视觉细节的问题。现有方法通常依赖边界点表示或专用分割头,受限于离散表示或语义提示,导致对像素级信息的理解不足。解决方案的关键在于提出一种基于自回归生成(AutoRegressive Generation-based)的图像分割框架ARGenSeg,通过将分割任务转化为图像生成问题,使MLLM直接输出视觉token,并借助通用VQ-VAE解码器重建出稠密掩码(dense masks),从而完全依赖MLLM的像素级感知能力;同时引入下一尺度预测策略并行生成所需视觉token,显著降低推理延迟,实现在多个数据集上优于当前最优方法且推理速度大幅提升。

链接: https://arxiv.org/abs/2510.20803
作者: Xiaolong Wang,Lixiang Ru,Ziyuan Huang,Kaixiang Ji,Dandan Zheng,Jingdong Chen,Jun Zhou
机构: Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025, 18 pages

点击查看摘要

Abstract:We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
zh

[CV-7] Radar-Camera Fused Multi-Object Tracking: Online Calibration and Common Feature

【速读】:该论文旨在解决多目标跟踪(Multi-Object Tracking, MOT)中传感器融合效率低、依赖人工干预以及雷达与相机数据关联不准确的问题。现有方法常将雷达作为辅助传感器,未能充分发挥其在世界三维坐标系下提供精确距离/深度信息的能力。解决方案的关键在于构建一个以雷达为核心、融合雷达与相机数据的MOT框架,通过提取雷达与相机的公共特征实现在线标定,从而自动对齐两类传感器的检测结果;同时采用特征匹配与类别一致性校验策略,超越传统仅依赖位置匹配的方法,显著提升跨模态关联精度。该方案首次将雷达-相机公共特征用于在线标定以实现MOT,实验表明其可有效简化雷达-相机映射流程并提高跟踪准确性。

链接: https://arxiv.org/abs/2510.20794
作者: Lei Cheng,Siyang Cao
机构: The University of Arizona (亚利桑那大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: accepted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)

点击查看摘要

Abstract:This paper presents a Multi-Object Tracking (MOT) framework that fuses radar and camera data to enhance tracking efficiency while minimizing manual interventions. Contrary to many studies that underutilize radar and assign it a supplementary role–despite its capability to provide accurate range/depth information of targets in a world 3D coordinate system–our approach positions radar in a crucial role. Meanwhile, this paper utilizes common features to enable online calibration to autonomously associate detections from radar and camera. The main contributions of this work include: (1) the development of a radar-camera fusion MOT framework that exploits online radar-camera calibration to simplify the integration of detection results from these two sensors, (2) the utilization of common features between radar and camera data to accurately derive real-world positions of detected objects, and (3) the adoption of feature matching and category-consistency checking to surpass the limitations of mere position matching in enhancing sensor association accuracy. To the best of our knowledge, we are the first to investigate the integration of radar-camera common features and their use in online calibration for achieving MOT. The efficacy of our framework is demonstrated by its ability to streamline the radar-camera mapping process and improve tracking precision, as evidenced by real-world experiments conducted in both controlled environments and actual traffic scenarios. Code is available at this https URL
zh

[CV-8] CUPID: Pose-Grounded Generative 3D Reconstruction from a Single Image

【速读】:该论文旨在解决从单张2D图像中准确重建三维(3D)物体的相机位姿(camera pose)、形状(shape)和纹理(texture)的问题,这是计算机视觉与图形学中的核心挑战之一。传统方法在处理遮挡、光照变化或缺乏几何先验时往往表现不佳,而现有生成式模型难以同时保证几何精度与视觉保真度。解决方案的关键在于提出一种基于生成式建模的新方法Cupid,其将3D重建视为从学习到的3D对象分布中进行条件采样,并联合生成体素(voxels)与像素-体素对应关系(pixel-voxel correspondences),从而在统一的生成框架下实现鲁棒的位姿和形状估计;进一步通过共享3D潜在空间(shared 3D latent space)表示输入相机位姿与3D形状,并采用两阶段流匹配(flow matching)流程:第一阶段生成粗略几何并用于位姿恢复,第二阶段融合对齐后的图像特征以提升结构保真度与外观细节,最终在PSNR和Chamfer Distance等指标上显著优于主流方法。

链接: https://arxiv.org/abs/2510.20776
作者: Binbin Huang,Haobin Duan,Yiqun Zhao,Zibo Zhao,Yi Ma,Shenghua Gao
机构: HKU(香港大学); Transcengram; Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page at this https URL

点击查看摘要

Abstract:This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit this http URL.
zh

[CV-9] AlphaFlow: Understanding and Improving MeanFlow Models

【速读】:该论文旨在解决MeanFlow在少步生成建模中优化冲突与收敛缓慢的问题,其核心在于揭示MeanFlow目标函数可分解为轨迹流匹配(trajectory flow matching)和轨迹一致性(trajectory consistency)两部分,且二者存在强负相关性,导致训练过程中优化冲突。解决方案的关键是提出α-Flow,一个统一轨迹流匹配、Shortcut Model与MeanFlow的广义目标函数族,并采用从轨迹流匹配到MeanFlow平滑退火的课程学习策略,从而解耦冲突目标并加速收敛。实验表明,在类条件ImageNet-1K 256×256数据集上,使用原生DiT骨干网络训练时,α-Flow在不同规模和设置下均优于MeanFlow,其最大模型在单步和双步推理下分别达到FID 2.58和2.15,刷新了基于原生DiT架构的性能纪录。

链接: https://arxiv.org/abs/2510.20771
作者: Huijie Zhang,Aliaksandr Siarohin,Willi Menapace,Michael Vasilkovsky,Sergey Tulyakov,Qing Qu,Ivan Skorokhodov
机构: Snap Inc.; University of Michigan (密歇根大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:MeanFlow has recently emerged as a powerful framework for few-step generative modeling trained from scratch, but its success is not yet fully understood. In this work, we show that the MeanFlow objective naturally decomposes into two parts: trajectory flow matching and trajectory consistency. Through gradient analysis, we find that these terms are strongly negatively correlated, causing optimization conflict and slow convergence. Motivated by these insights, we introduce \alpha -Flow, a broad family of objectives that unifies trajectory flow matching, Shortcut Model, and MeanFlow under one formulation. By adopting a curriculum strategy that smoothly anneals from trajectory flow matching to MeanFlow, \alpha -Flow disentangles the conflicting objectives, and achieves better convergence. When trained from scratch on class-conditional ImageNet-1K 256x256 with vanilla DiT backbones, \alpha -Flow consistently outperforms MeanFlow across scales and settings. Our largest \alpha -Flow-XL/2+ model achieves new state-of-the-art results using vanilla DiT backbones, with FID scores of 2.58 (1-NFE) and 2.15 (2-NFE).
zh

[CV-10] DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion

【速读】:该论文旨在解决扩散 Transformer 模型在超高清分辨率图像生成中因自注意力机制的二次计算复杂度而导致训练成本极高的问题。解决方案的关键在于提出一种无需训练的动态位置外推方法(Dynamic Position Extrapolation, DyPE),其核心思想是利用扩散过程中频谱演化的特性——低频结构早期收敛而高频结构需更多步骤解析,通过在每一步扩散过程中动态调整模型的位置编码,使其频率谱与当前生成阶段相匹配,从而实现对训练分辨率之外的超高分辨率图像的高效生成,且无需额外采样开销。

链接: https://arxiv.org/abs/2510.20766
作者: Noam Issachar,Guy Yariv,Sagie Benaim,Yossi Adi,Dani Lischinski,Raanan Fattal
机构: The Hebrew University of Jerusalem (希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformer models can generate images with remarkable fidelity and detail, yet training them at ultra-high resolutions remains extremely costly due to the self-attention mechanism’s quadratic scaling with the number of image tokens. In this paper, we introduce Dynamic Position Extrapolation (DyPE), a novel, training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data, with no additional sampling cost. DyPE takes advantage of the spectral progression inherent to the diffusion process, where low-frequency structures converge early, while high-frequencies take more steps to resolve. Specifically, DyPE dynamically adjusts the model’s positional encoding at each diffusion step, matching their frequency spectrum with the current stage of the generative process. This approach allows us to generate images at resolutions that exceed the training resolution dramatically, e.g., 16 million pixels using FLUX. On multiple benchmarks, DyPE consistently improves performance and achieves state-of-the-art fidelity in ultra-high-resolution image generation, with gains becoming even more pronounced at higher resolutions. Project page is available at this https URL.
zh

[CV-11] MEIcoder: Decoding Visual Stimuli from Neural Activity by Leverag ing Most Exciting Inputs NEURIPS2025

【速读】:该论文旨在解决从神经元群体活动(尤其是灵长类或人类初级视觉皮层V1中单细胞水平的记录)中可靠重建视觉刺激的问题,这在生物数据稀缺、深度学习模型训练受限的情况下尤为关键。其解决方案的核心在于提出MEIcoder方法,该方法融合了神经元特异性的最激发输入(Most Exciting Inputs, MEIs)、结构相似性指数度量(SSIM)损失函数以及对抗训练机制,从而显著提升小样本条件下的解码性能,尤其在仅使用1,000–2,500个神经元和少于1,000个训练样本时仍能生成高保真自然图像。

链接: https://arxiv.org/abs/2510.20762
作者: Jan Sobotka,Luca Baroni,Ján Antolík
机构: EPFL(瑞士联邦理工学院); Charles University(查理大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Decoding visual stimuli from neural population activity is crucial for understanding the brain and for applications in brain-machine interfaces. However, such biological data is often scarce, particularly in primates or humans, where high-throughput recording techniques, such as two-photon imaging, remain challenging or impossible to apply. This, in turn, poses a challenge for deep learning decoding techniques. To overcome this, we introduce MEIcoder, a biologically informed decoding method that leverages neuron-specific most exciting inputs (MEIs), a structural similarity index measure loss, and adversarial training. MEIcoder achieves state-of-the-art performance in reconstructing visual stimuli from single-cell activity in primary visual cortex (V1), especially excelling on small datasets with fewer recorded neurons. Using ablation studies, we demonstrate that MEIs are the main drivers of the performance, and in scaling experiments, we show that MEIcoder can reconstruct high-fidelity natural-looking images from as few as 1,000-2,500 neurons and less than 1,000 training data points. We also propose a unified benchmark with over 160,000 samples to foster future research. Our results demonstrate the feasibility of reliable decoding in early visual system and provide practical insights for neuroscience and neuroengineering applications.
zh

[CV-12] ACS-SegNet: An Attention-Based CNN-SegFormer Segmentation Network for Tissue Segmentation in Histopathology

【速读】:该论文旨在解决组织病理图像中语义分割(semantic segmentation)的精度提升问题,以支持计算机辅助诊断(Computer-Aided Diagnosis, CAD)系统在疾病识别中的应用。其解决方案的关键在于提出一种统一的双编码器(dual-encoder)架构,通过注意力机制驱动的特征融合策略,整合卷积神经网络(Convolutional Neural Networks, CNNs)与视觉 Transformer(Vision Transformers, ViTs)的互补特征表示能力,从而增强模型对复杂组织结构的感知与分割性能。实验表明,该方法在GCPS和PUMA两个公开数据集上分别达到了76.79%的μIoU和64.93%的μIoU,优于现有先进基准模型。

链接: https://arxiv.org/abs/2510.20754
作者: Nima Torbati,Anastasia Meshcheryakova,Ramona Woitek,Diana Mechtcheriakova,Amirreza Mahbod
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Automated histopathological image analysis plays a vital role in computer-aided diagnosis of various diseases. Among developed algorithms, deep learning-based approaches have demonstrated excellent performance in multiple tasks, including semantic tissue segmentation in histological images. In this study, we propose a novel approach based on attention-driven feature fusion of convolutional neural networks (CNNs) and vision transformers (ViTs) within a unified dual-encoder model to improve semantic segmentation performance. Evaluation on two publicly available datasets showed that our model achieved \muIoU/\muDice scores of 76.79%/86.87% on the GCPS dataset and 64.93%/76.60% on the PUMA dataset, outperforming state-of-the-art and baseline benchmarks. The implementation of our method is publicly available in a GitHub repository: this https URL
zh

[CV-13] AutoScape: Geometry-Consistent Long-Horizon Scene Generation ICCV2025

【速读】:该论文旨在解决长时序驾驶场景生成中难以同时保持视觉真实性和几何一致性的问题。现有方法在生成超过数秒的驾驶视频时,常出现纹理失真或结构漂移,导致场景不连贯。解决方案的关键在于提出AutoScape框架,其核心创新是一个新型RGB-D扩散模型(RGB-D diffusion model),该模型通过三个机制保障长期几何一致性:1)在共享潜在空间中联合建模图像与深度信息;2)显式地以已生成关键帧的渲染点云作为条件,引导后续生成;3)利用形变一致引导(warp-consistent guidance)控制采样过程。在此基础上,再使用视频扩散模型对高质量RGB-D关键帧进行插值,从而生成持续时间超过20秒、视觉逼真且几何一致的驾驶视频,显著优于当前最优方法,在长时序FID和FVD指标上分别提升48.6%和43.0%。

链接: https://arxiv.org/abs/2510.20726
作者: Jiacheng Chen,Ziyu Jiang,Mingfu Liang,Bingbing Zhuang,Jong-Chyi Su,Sparsh Garg,Ying Wu,Manmohan Chandraker
机构: NEC Labs America; Simon Fraser University; Northwestern University; UC San Diego
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene’s appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.
zh

[CV-14] ALICE-LRI: A General Method for Lossless Range Image Generation for Spinning LiDAR Sensors without Calibration Metadata

【速读】:该论文旨在解决传统LiDAR点云到二维距离图像(range image)投影方法中存在的几何不一致性问题,该问题会导致不可逆的信息丢失,从而影响高保真度应用的精度。解决方案的关键在于提出ALICE-LRI(Automatic LiDAR Intrinsic Calibration Estimation for Lossless Range Images),这是一种无需依赖厂商元数据或校准文件的通用、传感器无关的算法,能够自动反演任意旋转式LiDAR传感器的内在几何参数,包括激光束配置、角度分布及每束光的校正参数,从而实现无损投影与完整点云重建,确保零点丢失,并在KITTI和DurLAR数据集上验证了其几何保真性与实时性能。

链接: https://arxiv.org/abs/2510.20708
作者: Samuel Soutullo,Miguel Yermo,David L. Vilariño,Óscar G. Lorenzo,José C. Cabaleiro,Francisco F. Rivera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:3D LiDAR sensors are essential for autonomous navigation, environmental monitoring, and precision mapping in remote sensing applications. To efficiently process the massive point clouds generated by these sensors, LiDAR data is often projected into 2D range images that organize points by their angular positions and distances. While these range image representations enable efficient processing, conventional projection methods suffer from fundamental geometric inconsistencies that cause irreversible information loss, compromising high-fidelity applications. We present ALICE-LRI (Automatic LiDAR Intrinsic Calibration Estimation for Lossless Range Images), the first general, sensor-agnostic method that achieves lossless range image generation from spinning LiDAR point clouds without requiring manufacturer metadata or calibration files. Our algorithm automatically reverse-engineers the intrinsic geometry of any spinning LiDAR sensor by inferring critical parameters including laser beam configuration, angular distributions, and per-beam calibration corrections, enabling lossless projection and complete point cloud reconstruction with zero point loss. Comprehensive evaluation across the complete KITTI and DurLAR datasets demonstrates that ALICE-LRI achieves perfect point preservation, with zero points lost across all point clouds. Geometric accuracy is maintained well within sensor precision limits, establishing geometric losslessness with real-time performance. We also present a compression case study that validates substantial downstream benefits, demonstrating significant quality improvements in practical applications. This paradigm shift from approximate to lossless LiDAR projections opens new possibilities for high-precision remote sensing applications requiring complete geometric preservation.
zh

[CV-15] Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

【速读】:该论文旨在解决大规模视觉语言模型(Large Vision-Language Models, LVLMs)在处理多模态序列时,由于键值(Key-Value, KV)缓存膨胀导致的内存瓶颈问题。现有KV缓存压缩方法主要依赖保留高重要性KV对以减少存储,但忽略了多模态KV缓存中特有的模态相关语义冗余模式,从而可能造成语义覆盖不足。解决方案的关键在于提出一种名为\textttMixKV的新方法,该方法通过将重要性与多样性相结合,在头级粒度上自适应地平衡二者,以更有效地压缩KV缓存,从而在保持推理效率的同时显著提升多模态理解任务的性能表现。

链接: https://arxiv.org/abs/2510.20707
作者: Xuyang Liu,Xiyan Gui,Yuchao Zhang,Linfeng Zhang
机构: EPIC Lab, Shanghai Jiao Tong University (上海交通大学); Sichuan University (四川大学); Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Our code is available at this https URL

点击查看摘要

Abstract:Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \textttMixKV, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \textttMixKV adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \textttMixKV consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \textttMixKV improves baseline methods by an average of \textbf5.1% across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf8.0% and \textbf9.0% for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \textttMixKV extends seamlessly to LLMs with comparable performance gains. Our code is available at \hrefthis https URL\textcolorcitebluethis https URL.
zh

[CV-16] Diagnosing Visual Reasoning : Challenges Insights and a Path Forward

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉推理任务中普遍存在的视觉幻觉(visual hallucinations)和对文本先验(textual priors)的过度依赖问题。针对这些问题,作者提出了一种基于代理(agent-based)的架构,其关键在于将大语言模型(LLM)的推理能力与轻量级视觉模块相结合,实现对视觉内容的细粒度分析和推理链的迭代优化。该方案通过引入专用工具增强视觉理解能力,从而提升模型在复杂视觉任务中的准确性和鲁棒性,实验结果显示该方法在MMMU和MathVista等基准上显著优于基线模型(如7B参数规模),并达到甚至超越更大模型的性能水平。

链接: https://arxiv.org/abs/2510.20696
作者: Jing Bi,Guangyu Sun,Ali Vosoughi,Chen Chen,Chenliang Xu
机构: University of Rochester (罗切斯特大学); University of Central Florida (中佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Multimodal large language models (MLLMs) that integrate visual and textual reasoning leverage chain-of-thought (CoT) prompting to tackle complex visual tasks, yet continue to exhibit visual hallucinations and an over-reliance on textual priors. We present a systematic diagnosis of state-of-the-art vision-language models using a three-stage evaluation framework, uncovering key failure modes. To address these, we propose an agent-based architecture that combines LLM reasoning with lightweight visual modules, enabling fine-grained analysis and iterative refinement of reasoning chains. Our results highlight future visual reasoning models should focus on integrating a broader set of specialized tools for analyzing visual content. Our system achieves significant gains (+10.3 on MMMU, +6.0 on MathVista over a 7B baseline), matching or surpassing much larger models. We will release our framework and evaluation suite to facilitate future research.
zh

[CV-17] Efficient Multi-bit Quantization Network Training via Weight Bias Correction and Bit-wise Coreset Sampling

【速读】:该论文旨在解决多比特量化网络(multi-bit quantization networks)在训练过程中存在的高计算开销问题,即现有方法需对每个支持的位宽重复进行全数据集更新,导致训练成本随精度数量线性增长,且常需额外微调阶段来支持新增或中间精度选项。解决方案的关键在于提出两项核心技术:其一是权重偏置校正(weight bias correction),通过消除不同位宽间的量化引入偏差并对齐激活分布,实现批归一化(batch normalization)共享,从而避免微调;其二是基于梯度重要性评分的逐比特核心集采样策略(bit-wise coreset sampling),利用隐式知识迁移现象,为每个子模型选取紧凑且信息丰富的训练子集,显著降低训练负担。实验表明,该方法在多个图像分类任务和模型架构上均能实现媲美或优于基线的精度,同时将训练时间减少最多达7.88倍。

链接: https://arxiv.org/abs/2510.20673
作者: Jinhee Kim,Jae Jun An,Kang Eun Jeon,Jong Hwan Ko
机构: Sungkyunkwan University (成均馆大学); Duke University (杜克大学); Korea Advanced Institute of Science and Technology (KAIST)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multi-bit quantization networks enable flexible deployment of deep neural networks by supporting multiple precision levels within a single model. However, existing approaches suffer from significant training overhead as full-dataset updates are repeated for each supported bit-width, resulting in a cost that scales linearly with the number of precisions. Additionally, extra fine-tuning stages are often required to support additional or intermediate precision options, further compounding the overall training burden. To address this issue, we propose two techniques that greatly reduce the training overhead without compromising model utility: (i) Weight bias correction enables shared batch normalization and eliminates the need for fine-tuning by neutralizing quantization-induced bias across bit-widths and aligning activation distributions; and (ii) Bit-wise coreset sampling strategy allows each child model to train on a compact, informative subset selected via gradient-based importance scores by exploiting the implicit knowledge transfer phenomenon. Experiments on CIFAR-10/100, TinyImageNet, and ImageNet-1K with both ResNet and ViT architectures demonstrate that our method achieves competitive or superior accuracy while reducing training time up to 7.88x. Our code is released at this https URL.
zh

[CV-18] HybridSOMSpikeNet: A Deep Model with Differentiable Soft Self-Organizing Maps and Spiking Dynamics for Waste Classification

【速读】:该论文旨在解决城市化进程中废弃物分类不准确导致的环境问题,如可回收物误入填埋场、回收效率低下及温室气体排放增加等。其核心解决方案是提出一种名为HybridSOMSpikeNet的混合深度学习框架,关键创新在于融合了卷积特征提取(基于预训练ResNet-152)、可微分自组织映射(Differentiable Soft-SOM)与类脉冲神经网络的时间动态处理机制,从而实现高精度、低功耗且具备可解释性的智能垃圾分类。该模型在十类垃圾数据集上达到97.39%的测试准确率,同时保持轻量化计算特性,适用于实际部署,并有助于提升回收效率和减少废弃物处理的生态成本。

链接: https://arxiv.org/abs/2510.20669
作者: Debojyoti Ghosh,Adrijit Goswami
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate waste classification is vital for achieving sustainable waste management and reducing the environmental footprint of urbanization. Misclassification of recyclable materials contributes to landfill accumulation, inefficient recycling, and increased greenhouse gas emissions. To address these issues, this study introduces HybridSOMSpikeNet, a hybrid deep learning framework that integrates convolutional feature extraction, differentiable self-organization, and spiking-inspired temporal processing to enable intelligent and energy-efficient waste classification. The proposed model employs a pre-trained ResNet-152 backbone to extract deep spatial representations, followed by a Differentiable Soft Self-Organizing Map (Soft-SOM) that enhances topological clustering and interpretability. A spiking neural head accumulates temporal activations over discrete time steps, improving robustness and generalization. Trained on a ten-class waste dataset, HybridSOMSpikeNet achieved a test accuracy of 97.39%, outperforming several state-of-the-art architectures while maintaining a lightweight computational profile suitable for real-world deployment. Beyond its technical innovations, the framework provides tangible environmental benefits. By enabling precise and automated waste segregation, it supports higher recycling efficiency, reduces contamination in recyclable streams, and minimizes the ecological and operational costs of waste processing. The approach aligns with global sustainability priorities, particularly the United Nations Sustainable Development Goals (SDG 11 and SDG 12), by contributing to cleaner cities, circular economy initiatives, and intelligent environmental management systems.
zh

[CV-19] UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset NEURIPS2025

【速读】:该论文旨在解决超高清文本到图像(Ultra-high-resolution text-to-image, UHR T2I)生成中的两大关键问题:一是缺乏大规模高质量的UHR T2I数据集,二是现有训练策略未能针对性提升细粒度细节合成能力。解决方案的关键在于:首先构建了包含10万张超过3K分辨率、标注丰富且经过严格筛选的高质量图像数据集UltraHR-100K;其次提出一种频域感知的后训练方法,通过两个核心组件实现细节增强——(i)细粒度导向的时间步采样(Detail-Oriented Timestep Sampling, DOTS),聚焦于对细节敏感的去噪步骤进行学习;(ii)软权重频率正则化(Soft-Weighting Frequency Regularization, SWFR),利用离散傅里叶变换(Discrete Fourier Transform, DFT)对频域成分施加软约束,以促进高频细节的保留。实验表明,该方案显著提升了UHR图像生成的细粒度质量与整体保真度。

链接: https://arxiv.org/abs/2510.20661
作者: Chen Zhao,En Ci,Yunzhe Xu,Tiehan Fan,Shanyan Guan,Yanhao Ge,Jian Yang,Ying Tai
机构: Nanjing University (南京大学); vivo Mobile Communication Co., Ltd. (维沃移动通信有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbfUltraHR-100K, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textitDetail-Oriented Timestep Sampling (DOTS) to focus learning on detail-critical denoising steps, and (ii) \textitSoft-Weighting Frequency Regularization (SWFR), which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \hrefthis https URLhere.
zh

[CV-20] Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging NEURIPS2025

【速读】:该论文旨在解决当前视觉-语言建模在三维医学影像(3D medical imaging)中面临的两大核心挑战:一是对比预训练导致的视觉编码器与临床语言语义不匹配问题;二是切片级标记化(slice-wise tokenization)造成的解剖细节模糊,从而降低下游任务的诊断性能。解决方案的关键在于提出BTB3D(Better Tokens for Better 3D),一种因果卷积编码器-解码器架构,通过统一二维与三维训练和推理流程,生成紧凑且具备频率感知能力的体素标记(volumetric tokens)。其三阶段训练策略依次实现局部重建、重叠窗口拼接和长上下文解码器优化,在不增加内存开销的前提下,使模型从短切片片段学习并泛化至超过300切片的完整扫描,显著提升报告生成与文本到CT图像合成任务的性能。

链接: https://arxiv.org/abs/2510.20639
作者: Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Hadrien Reynaud,Dong Yang,Pengfei Guo,Marc Edgar,Daguang Xu,Bernhard Kainz,Bjoern Menze
机构: University of Zurich (苏黎世大学); NVIDIA (英伟达); Imperial College London (帝国理工学院); FAU Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512512241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: this https URL
zh

[CV-21] Deep Learning in Dental Image Analysis: A Systematic Review of Datasets Methodologies and Emerging Challenges

【速读】:该论文旨在解决传统牙科影像(dental images)分析中因低对比度、金属伪影、投影角度差异以及临床医生经验主观性导致的诊断效率低、一致性差的问题。其解决方案的关键在于系统性地综述深度学习(Deep Learning, DL)在牙科影像分析(Dental Image Analysis, DIA)中的应用进展,聚焦于公开数据集和模型两大核心要素,梳理了260篇相关研究,涵盖数据特性、采集方法、网络架构、优化策略及评估指标,并指出现有研究的局限与未来发展方向,为该领域研究人员提供结构化参考。

链接: https://arxiv.org/abs/2510.20634
作者: Zhenhuan Zhou,Jingbo Zhu,Yuchen Zhang,Xiaohang Guan,Peng Wang,Tao Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 52 pages, 24 figures. Under Review

点击查看摘要

Abstract:Efficient analysis and processing of dental images are crucial for dentists to achieve accurate diagnosis and optimal treatment planning. However, dental imaging inherently poses several challenges, such as low contrast, metallic artifacts, and variations in projection angles. Combined with the subjectivity arising from differences in clinicians’ expertise, manual interpretation often proves time-consuming and prone to inconsistency. Artificial intelligence (AI)-based automated dental image analysis (DIA) offers a promising solution to these issues and has become an integral part of computer-aided dental diagnosis and treatment. Among various AI technologies, deep learning (DL) stands out as the most widely applied and influential approach due to its superior feature extraction and representation capabilities. To comprehensively summarize recent progress in this field, we focus on the two fundamental aspects of DL research-datasets and models. In this paper, we systematically review 260 studies on DL applications in DIA, including 49 papers on publicly available dental datasets and 211 papers on DL-based algorithms. We first introduce the basic concepts of dental imaging and summarize the characteristics and acquisition methods of existing datasets. Then, we present the foundational techniques of DL and categorize relevant models and algorithms according to different DIA tasks, analyzing their network architectures, optimization strategies, training methods, and performance. Furthermore, we summarize commonly used training and evaluation metrics in the DIA domain. Finally, we discuss the current challenges of existing research and outline potential future directions. We hope that this work provides a valuable and systematic reference for researchers in this field. All supplementary materials and detailed comparison tables will be made publicly available on GitHub.
zh

[CV-22] SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

【速读】:该论文旨在解决长视频理解中因内容复杂、多样且时间分布分散而导致的挑战,特别是现有视频大语言模型(Video-LLMs)在处理长时间序列时计算开销大、推理易出现不聚焦或不一致的问题。其解决方案的关键在于提出一种无需训练且与模型无关的语义-视觉共识证据选择框架(SeViCES),核心创新包括:(1) 时序感知的语义分支通过大语言模型(LLM)对字幕进行推理来识别关键帧;(2) 基于聚类引导的视觉分支利用互信息对齐嵌入与语义得分以增强视觉证据可靠性;以及 (3) 答案共识优化模块通过融合多模态证据并约束答案空间,进一步消除语义与视觉预测间的不一致性,从而提升长视频理解的准确性和鲁棒性。

链接: https://arxiv.org/abs/2510.20622
作者: Yuan Sheng,Yanbin Hao,Chenxu Li,Shuo Wang,Xiangnan He
机构: University of Science and Technology of China (中国科学技术大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.
zh

[CV-23] OnlineSplatter: Pose-Free Online 3D Reconstruction for Free-Moving Objects NEURIPS2025

【速读】:该论文旨在解决从单目视频中重建自由运动物体的难题,尤其在缺乏可靠相机位姿或深度先验信息、且物体运动任意的情况下,如何实现高质量、对象中心的3D重建。其解决方案的关键在于提出了一种名为OnlineSplatter的在线前向框架,该框架直接从RGB帧生成高质量的对象中心3D高斯表示(3D Gaussians),无需依赖相机位姿估计、深度先验或捆绑调整优化。核心创新是一种双键记忆模块(dual-key memory module),结合潜在外观-几何键(latent appearance-geometry keys)与显式方向键(explicit directional keys),通过空间引导的记忆读取机制和高效的稀疏化策略,实现当前帧特征与时间聚合对象状态的鲁棒融合,从而在保持恒定计算成本的同时,持续提升重建质量。

链接: https://arxiv.org/abs/2510.20605
作者: Mark He Huang,Lin Geng Foo,Christian Theobalt,Ying Sun,De Wen Soh
机构: Singapore University of Technology and Design (新加坡科技设计大学); Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Institute for Infocomm Research (I2R) & Centre for Frontier AI Research, ASTAR (新加坡资讯通信研究院(I2R)与前沿人工智能研究中心,新加坡科技研究局(ASTAR))
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 (Spotlight)

点击查看摘要

Abstract:Free-moving object reconstruction from monocular video remains challenging, particularly without reliable pose or depth cues and under arbitrary object motion. We introduce OnlineSplatter, a novel online feed-forward framework generating high-quality, object-centric 3D Gaussians directly from RGB frames without requiring camera pose, depth priors, or bundle optimization. Our approach anchors reconstruction using the first frame and progressively refines the object representation through a dense Gaussian primitive field, maintaining constant computational cost regardless of video sequence length. Our core contribution is a dual-key memory module combining latent appearance-geometry keys with explicit directional keys, robustly fusing current frame features with temporally aggregated object states. This design enables effective handling of free-moving objects via spatial-guided memory readout and an efficient sparsification mechanism, ensuring comprehensive yet compact object coverage. Evaluations on real-world datasets demonstrate that OnlineSplatter significantly outperforms state-of-the-art pose-free reconstruction baselines, consistently improving with more observations while maintaining constant memory and runtime.
zh

[CV-24] Unsupervised Domain Adaptation via Similarity-based Prototypes for Cross-Modality Segmentation MICCAI2021

【速读】:该论文旨在解决深度学习模型在跨域场景下性能显著下降的问题,特别是针对未见数据域中的语义分割任务,其核心挑战在于域偏移(domain shift)导致模型泛化能力受限。解决方案的关键在于提出一种基于相似性的原型学习框架,通过在嵌入空间中学习类别特定的原型(class-wise prototypes),并引入相似性约束以确保原型对每个语义类具有代表性且类间可分离;同时,利用字典存储来自不同图像的原型,有效缓解类别缺失问题,并支持原型间的对比学习,从而提升跨模态分割性能。

链接: https://arxiv.org/abs/2510.20596
作者: Ziyu Ye,Chen Ju,Chaofan Ma,Xiaoyun Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2021

点击查看摘要

Abstract:Deep learning models have achieved great success on various vision challenges, but a well-trained model would face drastic performance degradation when applied to unseen data. Since the model is sensitive to domain shift, unsupervised domain adaptation attempts to reduce the domain gap and avoid costly annotation of unseen domains. This paper proposes a novel framework for cross-modality segmentation via similarity-based prototypes. In specific, we learn class-wise prototypes within an embedding space, then introduce a similarity constraint to make these prototypes representative for each semantic class while separable from different classes. Moreover, we use dictionaries to store prototypes extracted from different images, which prevents the class-missing problem and enables the contrastive learning of prototypes, and further improves performance. Extensive experiments show that our method achieves better results than other state-of-the-art methods.
zh

[CV-25] GenColorBench: A Color Evaluation Benchmark for Text-to-Image Generation Models

【速读】:该论文旨在解决当前文本到图像生成模型在细粒度颜色控制方面的不足,即模型难以准确匹配文本提示中指定的颜色,且现有基准测试缺乏对颜色精度的系统性评估。解决方案的关键在于提出GenColorBench——首个全面的文本到图像颜色生成基准测试,其基于ISCC-NBS和CSS3/X11等颜色体系,包含44,000个聚焦颜色的提示词(覆盖400多种颜色),并结合感知评估与自动化评估方法,从而揭示模型的真实颜色生成能力,识别其对不同颜色规范的理解程度及失败模式,为提升精确颜色生成提供指导。

链接: https://arxiv.org/abs/2510.20586
作者: Muhammad Atif Butt,Alexandra Gomez-Villa,Tao Wu,Javier Vazquez-Corral,Joost Van De Weijer,Kai Wang
机构: Computer Vision Center (计算机视觉中心); Universitat Autònoma de Barcelona (巴塞罗那自治大学); City University of Hong Kong (东莞) (香港城市大学(东莞)); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent years have seen impressive advances in text-to-image generation, with image generative or unified models producing high-quality images from text. Yet these models still struggle with fine-grained color controllability, often failing to accurately match colors specified in text prompts. While existing benchmarks evaluate compositional reasoning and prompt adherence, none systematically assess color precision. Color is fundamental to human visual perception and communication, critical for applications from art to design workflows requiring brand consistency. However, current benchmarks either neglect color or rely on coarse assessments, missing key capabilities such as interpreting RGB values or aligning with human expectations. To this end, we propose GenColorBench, the first comprehensive benchmark for text-to-image color generation, grounded in color systems like ISCC-NBS and CSS3/X11, including numerical colors which are absent elsewhere. With 44K color-focused prompts covering 400+ colors, it reveals models’ true capabilities via perceptual and automated assessments. Evaluations of popular text-to-image models using GenColorBench show performance variations, highlighting which color conventions models understand best and identifying failure modes. Our GenColorBench assessments will guide improvements in precise color generation. The benchmark will be made public upon acceptance.
zh

[CV-26] Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

【速读】:该论文旨在解决视频推理模型缺乏显式时空证据标注的问题,即现有模型通常仅生成文本推理链,而无法明确指出关键证据在时间维度上的出现时刻和空间维度上的具体位置(如对象与边界框),这限制了模型推理的可解释性和可靠性。其解决方案的关键在于提出一种非代理框架 Open-o3 Video,通过构建两个高质量的联合时空标注数据集 STGR-CoT-30k(用于监督微调)和 STGR-RL-36k(用于强化学习),并设计多奖励机制的冷启动强化学习策略,协同优化答案准确性、时间对齐度与空间精度,从而实现对视频中关键帧、物体及其边界框的显式标注与推理结合,显著提升多个视频理解基准(如 V-STAR、VideoMME 等)上的性能表现。

链接: https://arxiv.org/abs/2510.20579
作者: Jiahao Meng,Xiangtai Li,Haochen Wang,Yue Tan,Tao Zhang,Lingdong Kong,Yunhai Tong,Anran Wang,Zhiyang Teng,Yujing Wang,Zhuochen Wang
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.
zh

[CV-27] EmbodiedBrain: Expanding Performance Boundaries of Task Planning for Embodied Intelligence

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)和多模态大语言模型(Multimodal LLMs, MLLMs)在具身任务中面临的三大核心问题:一是模型设计与具身智能体(Embodied AI agents)实际需求之间的显著脱节;二是实时延迟与性能之间的不可避免权衡;三是评估指标缺乏真实性、依赖离线评测。其解决方案的关键在于提出 EmbodiedBrain,一个具有 7B 和 32B 参数规模的视觉-语言基础模型,通过三个核心技术实现突破:首先,构建面向智能体的数据结构以对齐任务需求;其次,采用结合大规模监督微调(Supervised Fine-Tuning, SFT)与步长增强组相对策略优化(Step-Augmented Group Relative Policy Optimization, Step-GRPO)的训练方法,利用前序步骤作为引导先验(Guided Precursors)提升长程任务成功率;最后,引入基础设施级加速的生成式奖励模型(Generative Reward Model, GRM)以提高训练效率,并建立包含通用性、规划能力和端到端仿真基准的三段式评估体系,从而推动具身基础模型性能达到新高度。

链接: https://arxiv.org/abs/2510.20578
作者: Ding Zou,Feifan Wang,Mengyu Ge,Siyuan Fan,Zongbing Zhang,Wei Chen,Lingfeng Wang,Zhongyou Hu,Wenrui Yan,Zhengwei Gao,Hao Wang,Weizhao Jin,Yu Zhang,Hainan Zhao,Mingliang Zhang,Xianxian Xi,Yaru Zhang,Wenyuan Li,Zhengguang Gao,Yurui Zhu
机构: ZTE NebulaBrain Team (中兴通讯 NebulaBrain 团队)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:The realization of Artificial General Intelligence (AGI) necessitates Embodied AI agents capable of robust spatial perception, effective task planning, and adaptive execution in physical environments. However, current large language models (LLMs) and multimodal LLMs (MLLMs) for embodied tasks suffer from key limitations, including a significant gap between model design and agent requirements, an unavoidable trade-off between real-time latency and performance, and the use of unauthentic, offline evaluation metrics. To address these challenges, we propose EmbodiedBrain, a novel vision-language foundation model available in both 7B and 32B parameter sizes. Our framework features an agent-aligned data structure and employs a powerful training methodology that integrates large-scale Supervised Fine-Tuning (SFT) with Step-Augumented Group Relative Policy Optimization (Step-GRPO), which boosts long-horizon task success by integrating preceding steps as Guided Precursors. Furthermore, we incorporate a comprehensive reward system, including a Generative Reward Model (GRM) accelerated at the infrastructure level, to improve training efficiency. For enable thorough validation, we establish a three-part evaluation system encompassing General, Planning, and End-to-End Simulation Benchmarks, highlighted by the proposal and open-sourcing of a novel, challenging simulation environment. Experimental results demonstrate that EmbodiedBrain achieves superior performance across all metrics, establishing a new state-of-the-art for embodied foundation models. Towards paving the way for the next generation of generalist embodied agents, we open-source all of our data, model weight, and evaluating methods, which are available at this https URL.
zh

[CV-28] From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail

【速读】:该论文旨在解决用户在不同细节层级(Level of Detail, LoD)和观看距离下对人群角色视觉质量的感知问题。研究对比了几何网格、基于图像的 impostors、神经辐射场(Neural Radiance Fields, NeRFs)以及3D高斯表示这四种方法在视觉保真度与计算性能之间的权衡关系,其解决方案的关键在于通过定性和定量实验结果,为设计感知优化的群体渲染LoD策略提供依据,从而在保证视觉质量的同时提升渲染效率。

链接: https://arxiv.org/abs/2510.20558
作者: Xiaohan Sun,Carol O’Sullivan
机构: Trinity College Dublin (三一学院); Ireland (爱尔兰)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:In this paper, we investigate how users perceive the visual quality of crowd character representations at different levels of detail (LoD) and viewing distances. Each representation: geometric meshes, image-based impostors, Neural Radiance Fields (NeRFs), and 3D Gaussians, exhibits distinct trade-offs between visual fidelity and computational performance. Our qualitative and quantitative results provide insights to guide the design of perceptually optimized LoD strategies for crowd rendering.
zh

[CV-29] From Cheap to Pro: A Learning-based Adaptive Camera Parameter Network for Professional-Style Imaging

【速读】:该论文旨在解决消费级相机系统在复杂光照条件下(如低光、高动态范围和逆光等)难以保持稳定图像质量的问题,这些问题会导致曝光不足、色偏和色调不一致,进而影响下游视觉任务的性能。解决方案的关键在于提出ACamera-Net,一个轻量且场景自适应的相机参数调整网络,能够直接从RAW数据中预测最优曝光与白平衡参数;其核心由两个模块组成:ACamera-Exposure用于估计ISO以缓解欠曝和对比度损失,ACamera-Color则预测相关色温(correlated color temperature)及增益因子以提升颜色一致性,该模型专为边缘设备实时推理优化,无需额外图像增强模块即可显著改善图像质量和感知输出稳定性。

链接: https://arxiv.org/abs/2510.20550
作者: Fuchen Li,Yansong Du,Wenbo Cheng,Xiaoxia Zhou,Sen Yin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages. Code and project page will be released

点击查看摘要

Abstract:Consumer-grade camera systems often struggle to maintain stable image quality under complex illumination conditions such as low light, high dynamic range, and backlighting, as well as spatial color temperature variation. These issues lead to underexposure, color casts, and tonal inconsistency, which degrade the performance of downstream vision tasks. To address this, we propose ACamera-Net, a lightweight and scene-adaptive camera parameter adjustment network that directly predicts optimal exposure and white balance from RAW inputs. The framework consists of two modules: ACamera-Exposure, which estimates ISO to alleviate underexposure and contrast loss, and ACamera-Color, which predicts correlated color temperature and gain factors for improved color consistency. Optimized for real-time inference on edge devices, ACamera-Net can be seamlessly integrated into imaging pipelines. Trained on diverse real-world data with annotated references, the model generalizes well across lighting conditions. Extensive experiments demonstrate that ACamera-Net consistently enhances image quality and stabilizes perception outputs, outperforming conventional auto modes and lightweight baselines without relying on additional image enhancement modules.
zh

[CV-30] Deep Learning-Powered Visual SLAM Aimed at Assisting Visually Impaired Navigation

【速读】:该论文旨在解决视觉SLAM(Simultaneous Localization and Mapping,即时定位与地图构建)在低纹理、运动模糊和复杂光照等挑战性条件下鲁棒性不足的问题,这类问题常见于为视障人群设计的辅助导航系统中,严重影响定位精度与跟踪稳定性。解决方案的关键在于提出SELM-SLAM3框架,该框架融合了SuperPoint(特征提取)与LightGlue(特征匹配)的深度学习方法,显著提升了特征检测与匹配的准确性与鲁棒性,从而在TUM RGB-D、ICL-NUIM及TartanAir等多个具有挑战性的数据集上实现优于传统ORB-SLAM3平均87.84%、且超过当前最优RGB-D SLAM系统的性能表现。

链接: https://arxiv.org/abs/2510.20549
作者: Marziyeh Bamdad,Hans-Peter Hutter,Alireza Darvishy
机构: ZHAW School of Engineering (ZHAW工程学院); University of Zurich (苏黎世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 7 figures, 4 tables

点击查看摘要

Abstract:Despite advancements in SLAM technologies, robust operation under challenging conditions such as low-texture, motion-blur, or challenging lighting remains an open challenge. Such conditions are common in applications such as assistive navigation for the visually impaired. These challenges undermine localization accuracy and tracking stability, reducing navigation reliability and safety. To overcome these limitations, we present SELM-SLAM3, a deep learning-enhanced visual SLAM framework that integrates SuperPoint and LightGlue for robust feature extraction and matching. We evaluated our framework using TUM RGB-D, ICL-NUIM, and TartanAir datasets, which feature diverse and challenging scenarios. SELM-SLAM3 outperforms conventional ORB-SLAM3 by an average of 87.84% and exceeds state-of-the-art RGB-D SLAM systems by 36.77%. Our framework demonstrates enhanced performance under challenging conditions, such as low-texture scenes and fast motion, providing a reliable platform for developing navigation aids for the visually impaired.
zh

[CV-31] Blur2seq: Blind Deblurring and Camera Trajectory Estimation from a Single Camera Motion-blurred Image

【速读】:该论文旨在解决由相机抖动(camera shake)引起的运动模糊问题,尤其是在大范围或旋转运动导致的严重或空间变化性模糊场景下,传统端到端去模糊网络难以有效恢复清晰图像的问题。其解决方案的关键在于提出了一种基于深度学习的联合估计框架,通过引入可微分的投影运动模糊模型(Projective Motion Blur Model, PMBM),实现从单张模糊图像中同时估计出潜在的清晰图像和相机运动轨迹;其中神经网络预测完整的3D旋转轨迹,并引导基于模型的重建网络进行端到端训练,从而在保证性能的同时提供可解释性;此外,还通过重模糊损失(reblur loss)对推理后的轨迹进行优化,增强输入模糊图与输出恢复图像之间的一致性,显著提升了复杂模糊场景下的重建质量。

链接: https://arxiv.org/abs/2510.20539
作者: Guillermo Carbajal,Andrés Almansa,Pablo Musé
机构: Universidad de la República (乌拉圭共和国大学); Paris Descartes University (巴黎笛卡尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Motion blur caused by camera shake, particularly under large or rotational movements, remains a major challenge in image restoration. We propose a deep learning framework that jointly estimates the latent sharp image and the underlying camera motion trajectory from a single blurry image. Our method leverages the Projective Motion Blur Model (PMBM), implemented efficiently using a differentiable blur creation module compatible with modern networks. A neural network predicts a full 3D rotation trajectory, which guides a model-based restoration network trained end-to-end. This modular architecture provides interpretability by revealing the camera motion that produced the blur. Moreover, this trajectory enables the reconstruction of the sequence of sharp images that generated the observed blurry image. To further refine results, we optimize the trajectory post-inference via a reblur loss, improving consistency between the blurry input and the restored output. Extensive experiments show that our method achieves state-of-the-art performance on both synthetic and real datasets, particularly in cases with severe or spatially variant blur, where end-to-end deblurring networks struggle. Code and trained models are available at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2510.20539 [cs.CV] (or arXiv:2510.20539v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.20539 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-32] Fake-in-Facext: Towards Fine-Grained Explainable DeepFake Analysis

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在深度伪造分析(DeepFake Analysis)中缺乏细粒度视觉上下文感知的问题,即现有方法在数据标注上对伪造痕迹的描述粗粒度且不可靠,模型无法输出文本解释与视觉伪造证据之间的关联,并且不支持针对任意面部区域的查询。解决方案的关键在于提出 Fake-in-Facext (FiFa) 框架,其核心创新包括:1)构建面部图像概念树(Facial Image Concept Tree, FICT),实现细粒度区域划分以提升标注可靠性;2)引入 Artifact-Grounding Explanation (AGE) 任务,生成包含操作区域分割掩码的文本解释,强化文本与视觉证据的对齐;3)设计统一多任务学习架构 FiFa-MLLM,支持丰富的多模态输入与输出,从而实现细粒度可解释深度伪造分析(Explainable DeepFake Analysis, XDFA)。

链接: https://arxiv.org/abs/2510.20531
作者: Lixiong Qin,Yang Zhang,Mei Wang,Jiani Hu,Weihong Deng,Weiran Xu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 9 figures, 17 tables

点击查看摘要

Abstract:The advancement of Multimodal Large Language Models (MLLMs) has bridged the gap between vision and language tasks, enabling the implementation of Explainable DeepFake Analysis (XDFA). However, current methods suffer from a lack of fine-grained awareness: the description of artifacts in data annotation is unreliable and coarse-grained, and the models fail to support the output of connections between textual forgery explanations and the visual evidence of artifacts, as well as the input of queries for arbitrary facial regions. As a result, their responses are not sufficiently grounded in Face Visual Context (Facext). To address this limitation, we propose the Fake-in-Facext (FiFa) framework, with contributions focusing on data annotation and model construction. We first define a Facial Image Concept Tree (FICT) to divide facial images into fine-grained regional concepts, thereby obtaining a more reliable data annotation pipeline, FiFa-Annotator, for forgery explanation. Based on this dedicated data annotation, we introduce a novel Artifact-Grounding Explanation (AGE) task, which generates textual forgery explanations interleaved with segmentation masks of manipulated artifacts. We propose a unified multi-task learning architecture, FiFa-MLLM, to simultaneously support abundant multimodal inputs and outputs for fine-grained Explainable DeepFake Analysis. With multiple auxiliary supervision tasks, FiFa-MLLM can outperform strong baselines on the AGE task and achieve SOTA performance on existing XDFA datasets. The code and data will be made open-source at this https URL.
zh

[CV-33] Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在专注于复杂推理任务时导致的效率低下与通用能力退化的问题。具体而言,现有模型即使面对简单任务也会启用高计算成本的推理机制,且过度优化特定推理能力会削弱其在一般视觉问答(VQA)和光学字符识别(OCR)等任务上的表现。解决方案的关键在于提出 Metis-HOME:一种混合优化的专家混合(Mixture-of-Experts, MoE)框架,通过将原始密集模型拆分为两个独立专家分支——一个专用于多步复杂推理的“思考分支”,另一个面向快速直接推断的“非思考分支”,并引入轻量级可训练路由机制动态分配输入查询至最优专家。该设计实现了“混合思维”范式,在提升复杂推理性能的同时显著改善了模型的泛化能力,从而有效缓解了推理能力与通用性之间的权衡困境。

链接: https://arxiv.org/abs/2510.20519
作者: Xiaohan Lan,Fanfan Liu,Haibo Qiu,Siqi Yang,Delian Ruan,Peng Shi,Lin Ma
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inspired by recent advancements in LLM reasoning, the field of multimodal reasoning has seen remarkable progress, achieving significant performance gains on intricate tasks such as mathematical problem-solving. Despite this progress, current multimodal large reasoning models exhibit two key limitations. They tend to employ computationally expensive reasoning even for simple queries, leading to inefficiency. Furthermore, this focus on specialized reasoning often impairs their broader, more general understanding capabilities. In this paper, we propose Metis-HOME: a Hybrid Optimized Mixture-of-Experts framework designed to address this trade-off. Metis-HOME enables a ‘‘Hybrid Thinking’’ paradigm by structuring the original dense model into two distinct expert branches: a thinking branch tailored for complex, multi-step reasoning, and a non-thinking branch optimized for rapid, direct inference on tasks like general VQA and OCR. A lightweight, trainable router dynamically allocates queries to the most suitable expert. We instantiate Metis-HOME by adapting the Qwen2.5-VL-7B into an MoE architecture. Comprehensive evaluations reveal that our approach not only substantially enhances complex reasoning abilities but also improves the model’s general capabilities, reversing the degradation trend observed in other reasoning-specialized models. Our work establishes a new paradigm for building powerful and versatile MLLMs, effectively resolving the prevalent reasoning-vs-generalization dilemma.
zh

[CV-34] EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization

【速读】:该论文旨在解决单步扩散模型(one-step diffusion model, 1-SDM)在个性化(personalization)过程中难以有效捕捉新概念分布的问题,尤其是在文本到图像(text-to-image, T2I)生成任务中。现有方法在单步模型上进行概念注入时,往往受限于其有限的建模能力,导致生成质量下降或概念保真度不足。解决方案的关键在于提出了一种双向概念蒸馏框架(bidirectional concept distillation framework, EchoDistill),其中多步扩散模型(teacher)与单步扩散模型(student)协同训练:首先将概念从教师模型蒸馏至学生模型,随后通过“回响”机制将学生模型的生成特征反馈给教师模型,形成双向信息流动;同时共享文本编码器以确保语义一致性,并引入对抗损失和对齐损失优化学生模型的分布拟合与输出一致性。该机制不仅显著提升了单步模型的个性化能力,还增强了教师模型的生成质量,为T2I扩散模型提供了一种高效、快速的个性化新范式。

链接: https://arxiv.org/abs/2510.20512
作者: Yixiong Yang,Tao Wu,Senmao Li,Shiqi Yang,Yaxing Wang,Joost van de Weijer,Kai Wang
机构: Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)); Computer Vision Center, Universitat Autònoma de Barcelona(计算机视觉中心,巴塞罗那自治大学); VCIP, CS, Nankai University(VCIP,计算机科学,南开大学); Program of Computer Science, City University of Hong Kong (Dongguan)(计算机科学项目,香港城市大学(东莞)); City University of Hong Kong, HK SAR(香港城市大学,香港特别行政区)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page available at this https URL

点击查看摘要

Abstract:Recent advances in accelerating text-to-image (T2I) diffusion models have enabled the synthesis of high-fidelity images even in a single step. However, personalizing these models to incorporate novel concepts remains a challenge due to the limited capacity of one-step models to capture new concept distributions effectively. We propose a bidirectional concept distillation framework, EchoDistill, to enable one-step diffusion personalization (1-SDP). Our approach involves an end-to-end training process where a multi-step diffusion model (teacher) and a one-step diffusion model (student) are trained simultaneously. The concept is first distilled from the teacher model to the student, and then echoed back from the student to the teacher. During the EchoDistill, we share the text encoder between the two models to ensure consistent semantic understanding. Following this, the student model is optimized with adversarial losses to align with the real image distribution and with alignment losses to maintain consistency with the teacher’s output. Furthermore, we introduce the bidirectional echoing refinement strategy, wherein the student model leverages its faster generation capability to feedback to the teacher model. This bidirectional concept distillation mechanism not only enhances the student ability to personalize novel concepts but also improves the generative quality of the teacher model. Our experiments demonstrate that this collaborative framework significantly outperforms existing personalization methods over the 1-SDP setup, establishing a novel paradigm for rapid and effective personalization in T2I diffusion models.
zh

[CV-35] Reliable and Reproducible Demographic Inference for Fairness in Face Analysis

【速读】:该论文旨在解决面部分析系统(Face Analysis Systems, FAS)公平性评估中因依赖自动人口统计属性推断(Demographic Attribute Inference, DAI)而产生的偏差问题,其核心挑战在于DAI本身的可靠性直接影响公平性估计的准确性与稳定性。为应对这一问题,论文提出了一种可完全复现的DAI流水线,其关键创新在于采用模块化迁移学习架构:利用预训练人脸特征编码器结合非线性分类头,替代传统的端到端训练方式。该设计不仅提升了DAI在性别和种族等属性上的推断准确率(尤其在更具挑战性的种族识别上表现显著优于基线),还引入了基于身份内一致性的鲁棒性度量,可用于任意人口统计分组方案,从而为FAS的公平性审计提供了更可靠的基础。

链接: https://arxiv.org/abs/2510.20482
作者: Alexandre Fournier-Montgieux,Hervé Le Borgne,Adrian Popescu,Bertrand Luvison
机构: Université Paris-Saclay (巴黎-萨克雷大学); CEA (法国原子能和替代能源委员会); LIST (法国原子能和替代能源委员会-电子与信息技术实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fairness evaluation in face analysis systems (FAS) typically depends on automatic demographic attribute inference (DAI), which itself relies on predefined demographic segmentation. However, the validity of fairness auditing hinges on the reliability of the DAI process. We begin by providing a theoretical motivation for this dependency, showing that improved DAI reliability leads to less biased and lower-variance estimates of FAS fairness. To address this, we propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach. Our design integrates pretrained face recognition encoders with non-linear classification heads. We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency. The proposed robustness metric is applicable to any demographic segmentation scheme. We benchmark the pipeline on gender and ethnicity inference across multiple datasets and training setups. Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute. To promote transparency and reproducibility, we will publicly release the training dataset metadata, full codebase, pretrained models, and evaluation toolkit. This work contributes a reliable foundation for demographic inference in fairness auditing.
zh

[CV-36] Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频推理任务中面临的两大核心挑战:一是基于强化学习(Reinforcement Learning, RL)的方法通常依赖纯文本链,导致推理结论缺乏视觉依据或产生幻觉;二是帧检索类方法虽能引入视觉 grounding,但在证据定位上存在准确性不足的问题。解决方案的关键在于提出 Conan 框架,其创新性体现在两个方面:首先构建了大规模自动标注的推理轨迹数据集 Conan-91K,涵盖帧识别、证据推理与动作决策三阶段;其次设计了一种多阶段渐进式冷启动策略,并结合 Identification-Reasoning-Action (AIR) 强化学习与视觉推理(RLVR)训练框架,实现跨帧线索的联合推理与动态决策机制,从而提升多步视频推理的准确性和可解释性。

链接: https://arxiv.org/abs/2510.20470
作者: Kun Ouyang,Yuanxin Liu,Linli Yao,Yishuo Cai,Hao Zhou,Jie Zhou,Fandong Meng,Xu Sun
机构: Peking University (北京大学); Tencent Inc. (腾讯公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.
zh

[CV-37] ransferable Black-Box One-Shot Forging of Watermarks via Image Preference Models NEURIPS2025

【速读】:该论文旨在解决当前后处理图像水印(post-hoc image watermarking)在面对水印伪造(watermark forging)攻击时的安全性不足问题,即恶意方从合法水印图像中窃取水印并将其非法应用于恶意内容的场景。解决方案的关键在于提出一种基于偏好模型(preference model)的新型攻击方法:首先,利用排名损失(ranking loss)在纯程序生成图像上训练一个无需真实水印即可判断图像是否被水印的模型;其次,通过反向传播优化输入图像,实现对水印的移除与伪造,且仅需单张水印图像即可完成攻击,无需了解水印模型的具体结构或参数,从而显著提升了攻击的实用性与普适性。实验证明该方法可有效伪造多种主流后处理图像水印模型的水印,揭示了现有水印方案在安全上的脆弱性。

链接: https://arxiv.org/abs/2510.20468
作者: Tomáš Souček,Sylvestre-Alvise Rebuffi,Pierre Fernandez,Nikola Jovanović,Hady Elsahar,Valeriu Lacatusu,Tuan Tran,Alexandre Mourachko
机构: Meta FAIR (Meta); ETH Zurich (苏黎世联邦理工学院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025

点击查看摘要

Abstract:Recent years have seen a surge in interest in digital content watermarking techniques, driven by the proliferation of generative models and increased legal pressure. With an ever-growing percentage of AI-generated content available online, watermarking plays an increasingly important role in ensuring content authenticity and attribution at scale. There have been many works assessing the robustness of watermarking to removal attacks, yet, watermark forging, the scenario when a watermark is stolen from genuine content and applied to malicious content, remains underexplored. In this work, we investigate watermark forging in the context of widely used post-hoc image watermarking. Our contributions are as follows. First, we introduce a preference model to assess whether an image is watermarked. The model is trained using a ranking loss on purely procedurally generated images without any need for real watermarks. Second, we demonstrate the model’s capability to remove and forge watermarks by optimizing the input image through backpropagation. This technique requires only a single watermarked image and works without knowledge of the watermarking model, making our attack much simpler and more practical than attacks introduced in related work. Third, we evaluate our proposed method on a variety of post-hoc image watermarking models, demonstrating that our approach can effectively forge watermarks, questioning the security of current watermarking approaches. Our code and further resources are publicly available.
zh

[CV-38] Dynamic Weight Adjustment for Knowledge Distillation: Leverag ing Vision Transformer for High-Accuracy Lung Cancer Detection and Real-Time Deployment

【速读】:该论文旨在解决肺部癌症(Lung Cancer, LC)分类中因图像区域不确定性与复杂性导致的诊断准确性不足问题。传统知识蒸馏(Knowledge Distillation, KD)方法采用固定权重进行知识迁移,难以适应不同区域的置信度差异,从而限制了学生模型对高不确定性区域的鲁棒性。其解决方案的关键在于提出一种基于动态模糊逻辑(Fuzzy Logic)的知识蒸馏机制,通过实时调整蒸馏权重,使学生模型聚焦于高置信度区域并降低对模糊区域的关注,从而提升模型在复杂病理图像中的泛化能力;此外,结合视觉Transformer(Vision Transformer, ViT-B32)作为教师模型、遗传算法(Genetic Algorithm, GA)优化学生模型选择,并引入小波融合与图像增强技术以改善输入质量,最终实现跨模态医学影像(如组织病理学图像和CT扫描图像)上的高精度分类性能。

链接: https://arxiv.org/abs/2510.20438
作者: Saif Ur Rehman Khan,Muhammad Nabeel Asim,Sebastian Vollmer,Andreas Dengel
机构: DFKI(德国弗劳恩霍夫计算机图形学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents the FuzzyDistillViT-MobileNet model, a novel approach for lung cancer (LC) classification, leveraging dynamic fuzzy logic-driven knowledge distillation (KD) to address uncertainty and complexity in disease diagnosis. Unlike traditional models that rely on static KD with fixed weights, our method dynamically adjusts the distillation weight using fuzzy logic, enabling the student model to focus on high-confidence regions while reducing attention to ambiguous areas. This dynamic adjustment improves the model ability to handle varying uncertainty levels across different regions of LC images. We employ the Vision Transformer (ViT-B32) as the instructor model, which effectively transfers knowledge to the student model, MobileNet, enhancing the student generalization capabilities. The training process is further optimized using a dynamic wait adjustment mechanism that adapts the training procedure for improved convergence and performance. To enhance image quality, we introduce pixel-level image fusion improvement techniques such as Gamma correction and Histogram Equalization. The processed images (Pix1 and Pix2) are fused using a wavelet-based fusion method to improve image resolution and feature preservation. This fusion method uses the wavedec2 function to standardize images to a 224x224 resolution, decompose them into multi-scale frequency components, and recursively average coefficients at each level for better feature representation. To address computational efficiency, Genetic Algorithm (GA) is used to select the most suitable pre-trained student model from a pool of 12 candidates, balancing model performance with computational cost. The model is evaluated on two datasets, including LC25000 histopathological images (99.16% accuracy) and IQOTH/NCCD CT-scan images (99.54% accuracy), demonstrating robustness across different imaging domains.
zh

[CV-39] Mitigating Cross-modal Representation Bias for Multicultural Image-to-Recipe Retrieval

【速读】:该论文旨在解决图像到食谱检索(image-to-recipe retrieval)中因视觉信息局限性导致的跨模态表示偏差问题,即现有方法假设食物图像能完整体现食谱中的文本细节,但实际上图像仅反映烹饪结果而非过程,从而导致模型过度关注显性视觉特征而忽略食谱中关键但不可见的食材使用和烹饪步骤差异。解决方案的关键在于提出一种新颖的因果表示学习方法,通过预测图像中可能被忽略的厨艺元素(culinary elements),并将其显式注入跨模态表示学习过程中,以缓解因数据来源多样(如不同菜系混合)带来的表示偏差,从而提升对细微差异食谱的检索准确性。

链接: https://arxiv.org/abs/2510.20393
作者: Qing Wang,Chong-Wah Ngo,Yu Cao,Ee-Peng Lim
机构: Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: ACM Multimedia 2025

点击查看摘要

Abstract:Existing approaches for image-to-recipe retrieval have the implicit assumption that a food image can fully capture the details textually documented in its recipe. However, a food image only reflects the visual outcome of a cooked dish and not the underlying cooking process. Consequently, learning cross-modal representations to bridge the modality gap between images and recipes tends to ignore subtle, recipe-specific details that are not visually apparent but are crucial for recipe retrieval. Specifically, the representations are biased to capture the dominant visual elements, resulting in difficulty in ranking similar recipes with subtle differences in use of ingredients and cooking methods. The bias in representation learning is expected to be more severe when the training data is mixed of images and recipes sourced from different cuisines. This paper proposes a novel causal approach that predicts the culinary elements potentially overlooked in images, while explicitly injecting these elements into cross-modal representation learning to mitigate biases. Experiments are conducted on the standard monolingual Recipe1M dataset and a newly curated multilingual multicultural cuisine dataset. The results indicate that the proposed causal representation learning is capable of uncovering subtle ingredients and cooking actions and achieves impressive retrieval performance on both monolingual and multilingual multicultural datasets.
zh

[CV-40] Positional Encoding Field

【速读】:该论文旨在解决扩散 Transformer (DiT) 在视觉生成任务中对空间结构建模能力有限的问题,尤其是在单图新视角合成和可控空间图像编辑等场景下缺乏精确的几何感知与局部控制能力。其解决方案的关键在于提出位置编码场(Positional Encoding Field, PE-Field),该方法将传统的二维位置编码扩展为具有深度感知能力的结构化三维场,并引入分层编码以实现子 patch 级别的精细控制,从而让 DiT 直接在三维空间中建模几何信息,显著提升了生成结果的空间一致性与可控性。

链接: https://arxiv.org/abs/2510.20385
作者: Yunpeng Bai,Haoxiang Li,Qixing Huang
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Pixocial Technology (Pixocial科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine Transformer scalability with spatial and temporal inductive biases. In this work, we revisit how DiTs organize visual content and discover that patch tokens exhibit a surprising degree of independence: even when PEs are perturbed, DiTs still produce globally coherent outputs, indicating that spatial coherence is primarily governed by PEs. Motivated by this finding, we introduce the Positional Encoding Field (PE-Field), which extends positional encodings from the 2D plane to a structured 3D field. PE-Field incorporates depth-aware encodings for volumetric reasoning and hierarchical encodings for fine-grained sub-patch control, enabling DiTs to model geometry directly in 3D space. Our PE-Field-augmented DiT achieves state-of-the-art performance on single-image novel view synthesis and generalizes to controllable spatial image editing.
zh

[CV-41] Synthetic Data for Robust Runway Detection

【速读】:该论文旨在解决深度视觉模型在工业级关键应用(如飞机自主着陆系统中的跑道检测)中因真实数据收集与标注成本高、覆盖场景有限(尤其是罕见场景)而导致的训练瓶颈问题。解决方案的关键在于利用商业飞行模拟器生成合成图像,并结合少量标注的真实图像进行训练,通过控制合成数据的生成过程及真实与合成数据的融合策略,有效缓解了合成到真实的数据分布偏移问题;同时,采用定制化的域适应(domain adaptation)方法提升了模型在未见条件(如夜间图像)下的鲁棒性,从而实现了高精度且可靠的物体检测性能。

链接: https://arxiv.org/abs/2510.20349
作者: Estelle Chigot,Dennis G. Wilson,Meriem Ghrib,Fabrice Jimenez,Thomas Oberlin
机构: Fédération ENAC ISAE-SUPAERO ONERA, Université de Toulouse (图卢兹大学); Airbus (空中客车)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep vision models are now mature enough to be integrated in industrial and possibly critical applications such as autonomous navigation. Yet, data collection and labeling to train such models requires too much efforts and costs for a single company or product. This drawback is more significant in critical applications, where training data must include all possible conditions including rare scenarios. In this perspective, generating synthetic images is an appealing solution, since it allows a cheap yet reliable covering of all the conditions and environments, if the impact of the synthetic-to-real distribution shift is mitigated. In this article, we consider the case of runway detection that is a critical part in autonomous landing systems developed by aircraft manufacturers. We propose an image generation approach based on a commercial flight simulator that complements a few annotated real images. By controlling the image generation and the integration of real and synthetic data, we show that standard object detection models can achieve accurate prediction. We also evaluate their robustness with respect to adverse conditions, in our case nighttime images, that were not represented in the real data, and show the interest of using a customized domain adaptation strategy.
zh

[CV-42] AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models NEURIPS2025

【速读】:该论文针对扩散模型在后训练量化(Post-Training Quantization, PTQ)过程中存在的误差累积问题展开研究。现有方法通常独立优化每一步去噪过程中的量化差异,忽略了采样过程中误差随去噪步骤逐步累积的特性,导致生成质量下降。解决方案的关键在于提出AccuQuant方法,通过显式模拟多个去噪步骤的输出,最小化全精度扩散模型与其量化版本在若干连续步骤内的输出差异,从而有效缓解误差累积效应;同时引入一种新颖的目标函数与高效实现技术,将内存复杂度从O(n)显著降低至O(1),其中n为去噪步数,兼顾了性能与效率。

链接: https://arxiv.org/abs/2510.20348
作者: Seunghoon Lee,Jeongwoo Choi,Byunggwan Son,Jaehyeon Moon,Jeimin Jeon,Bumsub Ham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from \mathcalO(n) to \mathcalO(1) , where n is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.
zh

[CV-43] Dino-Diffusion Modular Designs Bridge the Cross-Domain Gap in Autonomous Parking

【速读】:该论文旨在解决自动驾驶泊车系统在域偏移(如天气和光照变化)下鲁棒性不足的问题,尤其针对现有端到端(End-to-End, E2E)方法在分布外(Out-of-Distribution, OOD)场景中性能显著下降的挑战。其解决方案的关键在于提出了一种无需额外数据的领域无关(Domain-Agnostic)自主泊车流水线——Dino-Diffusion Parking (DDP),该方案融合视觉基础模型与基于扩散模型的规划机制,实现了泛化感知与鲁棒运动规划的协同优化。通过在CARLA仿真环境中训练并在零样本迁移至更具挑战性的场景中测试,DDP在所有OOD场景下均保持超过90%的泊车成功率,验证了其跨域适应能力。

链接: https://arxiv.org/abs/2510.20335
作者: Zixuan Wu,Hengyuan Zhang,Ting-Hsuan Chen,Yuliang Guo,David Paz,Xinyu Huang,Liu Ren
机构: Bosch Research North America; Bosch Center for AI (BCAI); Institute for Robotics and Intelligent Machines (IRIM), Georgia Institute of Technology; Department of Computer Science, University of Southern California
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Code is at this https URL

点击查看摘要

Abstract:Parking is a critical pillar of driving safety. While recent end-to-end (E2E) approaches have achieved promising in-domain results, robustness under domain shifts (e.g., weather and lighting changes) remains a key challenge. Rather than relying on additional data, in this paper, we propose Dino-Diffusion Parking (DDP), a domain-agnostic autonomous parking pipeline that integrates visual foundation models with diffusion-based planning to enable generalized perception and robust motion planning under distribution shifts. We train our pipeline in CARLA at regular setting and transfer it to more adversarial settings in a zero-shot fashion. Our model consistently achieves a parking success rate above 90% across all tested out-of-distribution (OOD) scenarios, with ablation studies confirming that both the network architecture and algorithmic design significantly enhance cross-domain performance over existing baselines. Furthermore, testing in a 3D Gaussian splatting (3DGS) environment reconstructed from a real-world parking lot demonstrates promising sim-to-real transfer.
zh

[CV-44] AnyPcc: Compressing Any Point Cloud with a Single Universal Model

【速读】:该论文旨在解决深度学习驱动的点云几何压缩中泛化能力不足的问题,其根源在于缺乏鲁棒的上下文建模能力和对分布外(out-of-distribution, OOD)数据处理效率低下。解决方案的关键在于提出一个通用点云压缩框架AnyPcc:首先设计了通用上下文模型(Universal Context Model),通过空间和通道维度的分组先验来捕获鲁棒的上下文依赖关系;其次引入实例自适应微调(Instance-Adaptive Fine-Tuning, IAFT)策略,通过为每个实例微调少量网络参数并将其嵌入比特流中,显著降低几何压缩的比特开销,从而有效应对OOD数据挑战。

链接: https://arxiv.org/abs/2510.20331
作者: Kangli Wang,Qianxi Yi,Yuqi Ye,Shihao Li,Wei Gao
机构: Peking University (北京大学); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures

点击查看摘要

Abstract:Generalization remains a critical challenge for deep learning-based point cloud geometry compression. We argue this stems from two key limitations: the lack of robust context models and the inefficient handling of out-of-distribution (OOD) data. To address both, we introduce AnyPcc, a universal point cloud compression framework. AnyPcc first employs a Universal Context Model that leverages priors from both spatial and channel-wise grouping to capture robust contextual dependencies. Second, our novel Instance-Adaptive Fine-Tuning (IAFT) strategy tackles OOD data by synergizing explicit and implicit compression paradigms. It fine-tunes a small subset of network weights for each instance and incorporates them into the bitstream, where the marginal bit cost of the weights is dwarfed by the resulting savings in geometry compression. Extensive experiments on a benchmark of 15 diverse datasets confirm that AnyPcc sets a new state-of-the-art in point cloud compression. Our code and datasets will be released to encourage reproducible research.
zh

[CV-45] HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在训练过程中因视觉编码器(如CLIP和SAM)与文本在多粒度层次上缺乏对齐而导致的计算资源消耗过高问题。其解决方案的关键在于引入双曲空间(hyperbolic space),利用其天然建模层次结构的能力,构建一个可有效弥合视觉与文本模态间粒度差距的框架。具体而言,作者提出了一种名为HyperET的高效训练范式,通过在双曲空间中动态调整半径来优化视觉表示,使其在任意粒度层级上与文本表示对齐;同时采用基于Möbius乘法运算的可学习矩阵(包括对角缩放矩阵、分块对角矩阵和带状矩阵)实现灵活且高效的参数化策略,从而在仅增加不到1%额外参数的情况下显著提升MLLMs在预训练和微调阶段的性能。

链接: https://arxiv.org/abs/2510.20322
作者: Zelin Peng,Zhengqin Xu,Qingyang Liu,Xiaokang Yang,Wei Shen
机构: MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, SJTU (上海交通大学); State Key Laboratory of Infrared Physics, Shanghai Institute of Technical Physics, CAS (中国科学院上海技术物理研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS2025

点击查看摘要

Abstract:Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as HyperET, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. HyperET employs learnable matrices with Möbius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that HyperET consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1% additional parameters.
zh

[CV-46] A Parameter-Efficient Mixture-of-Experts Framework for Cross-Modal Geo-Localization

【速读】:该论文旨在解决跨模态无人机导航任务中面临的两大挑战:一是多平台异构性(multi-platform heterogeneity),即卫星、无人机和地面图像在视角、分辨率和成像条件上的显著差异;二是训练描述与测试查询之间的领域差距(domain gap),导致模型在面对特定平台的自然语言查询时性能下降。解决方案的关键在于构建一个领域对齐的预处理流程与基于专家混合(Mixture-of-Experts, MoE)的架构:首先通过平台划分、卫星数据增强及方向词去除来缓解平台间差异;其次利用大语言模型(LLM)驱动的文本精炼流水线,使文本语义与各平台视觉特征对齐;最终采用BGE-M3(文本)与EVA-CLIP(图像)作为基础模型,训练三个平台专属专家,并结合两阶段硬负样本挖掘策略提升判别能力,在推理阶段融合各专家得分实现鲁棒的跨模态地理定位。

链接: https://arxiv.org/abs/2510.20291
作者: LinFeng Li,Jian Zhao,Zepeng Yang,Yuhang Song,Bojun Lin,Tianle Zhang,Yuchen Yuan,Chi Zhang,Xuelong Li
机构: The Institute of Artificial Intelligence (TeleAI), China Telecom; East China Normal University; National Tsing Hua University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a winning solution to RoboSense 2025 Track 4: Cross-Modal Drone Navigation. The task retrieves the most relevant geo-referenced image from a large multi-platform corpus (satellite/drone/ground) given a natural-language query. Two obstacles are severe inter-platform heterogeneity and a domain gap between generic training descriptions and platform-specific test queries. We mitigate these with a domain-aligned preprocessing pipeline and a Mixture-of-Experts (MoE) framework: (i) platform-wise partitioning, satellite augmentation, and removal of orientation words; (ii) an LLM-based caption refinement pipeline to align textual semantics with the distinct visual characteristics of each platform. Using BGE-M3 (text) and EVA-CLIP (image), we train three platform experts using a progressive two-stage, hard-negative mining strategy to enhance discriminative power, and fuse their scores at inference. The system tops the official leaderboard, demonstrating robust cross-modal geo-localization under heterogeneous viewpoints.
zh

[CV-47] Breakdance Video classification in the age of Generative AI

【速读】:该论文旨在解决当前大型视觉语言模型(Large Vision Language Models)在小众但极具影响力的舞蹈体育项目——霹雳舞(breakdance)中的适用性问题。现有研究多集中于足球、板球、篮球等主流体育项目,且主要关注生成式任务(如视觉问答、精彩片段生成),而对霹雳舞这类细分领域的视频理解任务缺乏系统探索。解决方案的关键在于:首先,通过实证分析表明视频编码器模型(Video Encoder models)在预测类任务中仍优于最先进的视频语言模型(Video Language Models);其次,提出了一套针对编码器模型的选择标准,并深入剖析了微调后的解码器模型(finetuned decoder model)在霹雳舞视频分类任务中的工作机制,从而为该领域提供了可复用的技术路径与实践指导。

链接: https://arxiv.org/abs/2510.20287
作者: Sauptik Dhar,Naveen Ramakrishnan,Michelle Munson
机构: Eluvio AI Labs(Eluvio人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

Abstract:Large Vision Language models have seen huge application in several sports use-cases recently. Most of these works have been targeted towards a limited subset of popular sports like soccer, cricket, basketball etc; focusing on generative tasks like visual question answering, highlight generation. This work analyzes the applicability of the modern video foundation models (both encoder and decoder) for a very niche but hugely popular dance sports - breakdance. Our results show that Video Encoder models continue to outperform state-of-the-art Video Language Models for prediction tasks. We provide insights on how to choose the encoder model and provide a thorough analysis into the workings of a finetuned decoder model for breakdance video classification.
zh

[CV-48] UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

【速读】:该论文旨在解决GUI grounding任务中因指令多样性与质量不足导致的性能瓶颈问题,即现有方法将自然语言指令视为静态代理,忽略了其对模型推理路径的影响。解决方案的关键在于提出“Instruction-as-Reasoning”范式,将指令视为动态的分析路径(analytical pathways),使模型在推理过程中能够选择最优路径以提升接地准确性。具体实现上采用两阶段训练框架:首先通过监督微调(SFT)在合成的多样化指令上训练模型获得多视角推理能力,再通过强化学习(RL)优化路径选择与组合策略,从而实现推理过程中的自适应路径合成与选择,显著提升模型在多个基准上的表现及代理能力。

链接: https://arxiv.org/abs/2510.20286
作者: Liangyu Chen,Hanzhang Zhou,Chenglin Cai,Jianan Zhang,Panrong Tong,Quyu Kong,Xu Zhang,Chen Liu,Yuqi Liu,Wenxuan Wang,Yue Wang,Qin Jin,Steven Hoi
机构: Renmin University of China (中国人民大学); Tongyi Lab, Alibaba Group (通义实验室,阿里巴巴集团); CUHK (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging grounding benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy, scoring 87.3% on UI-I2E-Bench, 57.0% on ScreenSpot-Pro, and 84.9% on MMBench-GUI L2. Furthermore, our model demonstrates strong agentic potential, achieving a 74.1% success rate on AndroidWorld using UI-Ins-7B as the executor. Our in-depth analysis reveals additional insights such as how reasoning can be formulated to enhance rather than hinder grounding performance, and how our method mitigates policy collapse in the SFT+RL framework. All code and model checkpoints will be publicly released in this https URL.
zh

[CV-49] DMC3: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering

【速读】:该论文旨在解决第一人称视角视频问答(Egocentric Video Question Answering, Egocentric VideoQA)中因视角特殊性带来的挑战,例如多事件理解与手-物交互识别等问题。现有基于预训练和微调的方法未能充分考虑这些特性,导致性能受限。其解决方案的关键在于提出一种双模态反事实对比构建框架(Dual-Modal Counterfactual Contrastive Construction, DMC³),通过两个核心模块实现:一是设计反事实样本构造模块,分别利用事件描述改写(event description paraphrasing)和核心交互挖掘(core interaction mining)生成文本和视觉模态的正负样本;二是引入涉及反事实样本的对比优化模块,通过对比损失最小化原始样本与正样本特征距离、最大化与负样本距离,从而增强模型对第一人称视频中关键语义信息的判别能力。该方法在EgoTaskQA和QAEGO4D数据集上均达到当前最优性能。

链接: https://arxiv.org/abs/2510.20285
作者: Jiayi Zou,Chaofan Chen,Bing-Kun Bao,Changsheng Xu
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC ^3 ) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and core interaction mining, respectively. Then, We feed these samples together with the original samples into the baseline. Finally, in the counterfactual sample-involved contrastive optimization module, we apply contrastive loss to minimize the distance between the original sample features and the positive sample features, while maximizing the distance from the negative samples. Experiments show that our method achieve 52.51% and 46.04% on the \textitnormal and \textitindirect splits of EgoTaskQA, and 13.2% on QAEGO4D, both reaching the state-of-the-art performance.
zh

[CV-50] Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition

【速读】:该论文旨在解决复杂值合成孔径雷达(Complex-valued Synthetic Aperture Radar, CV-SAR)图像识别中因数据有限和域偏移场景所导致的表示三难困境(representation trilemma)——即在泛化能力、可解释性和效率之间难以同时优化的问题。其解决方案的关键在于提出一种知识引导的神经网络(Knowledge-Informed Neural Network, KINN),该框架采用“压缩-聚合-压缩”的新颖架构:第一阶段通过物理引导压缩,利用新型字典处理器自适应嵌入物理先验,以紧凑的展开网络高效提取稀疏且物理基础的特征;第二阶段通过聚合模块增强表征;第三阶段则借助轻量级分类头与自蒸馏机制进行语义压缩,学习最具任务相关性的判别性嵌入。此设计使KINN在参数效率、跨域泛化能力和可解释性方面均达到最优平衡,为SAR图像分析中的可信人工智能提供了新路径。

链接: https://arxiv.org/abs/2510.20284
作者: Haodong Yang,Zhongling Huang,Shaojie Guo,Zhe Zhang,Gong Cheng,Junwei Han
机构: Northwestern Polytechnical University (西北工业大学); Shenzhen Research Institute of Northwestern Polytechnical University (西北工业大学深圳研究院); Chongqing University of Posts and Telecommunications (重庆邮电大学); Aerospace Information Technology University (航空航天信息科技学院); Suzhou Aerospace Information Research Institute (苏州航空航天信息研究所); National Key Laboratory of Microwave Imaging (微波成像国家重点实验室); Aerospace Information Research Institute, CAS (中国科学院空天信息研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences (中国科学院大学电子、电气与通信工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning models for complex-valued Synthetic Aperture Radar (CV-SAR) image recognition are fundamentally constrained by a representation trilemma under data-limited and domain-shift scenarios: the concurrent, yet conflicting, optimization of generalization, interpretability, and efficiency. Our work is motivated by the premise that the rich electromagnetic scattering features inherent in CV-SAR data hold the key to resolving this trilemma, yet they are insufficiently harnessed by conventional data-driven models. To this end, we introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel “compression-aggregation-compression” architecture. The first stage performs a physics-guided compression, wherein a novel dictionary processor adaptively embeds physical priors, enabling a compact unfolding network to efficiently extract sparse, physically-grounded signatures. A subsequent aggregation module enriches these representations, followed by a final semantic compression stage that utilizes a compact classification head with self-distillation to learn maximally task-relevant and discriminative embeddings. We instantiate KINN in both CNN (0.7M) and Vision Transformer (0.95M) variants. Extensive evaluations on five SAR benchmarks confirm that KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios and tangible interpretability, thereby providing an effective solution to the representation trilemma and offering a new path for trustworthy AI in SAR image analysis.
zh

[CV-51] Causal Debiasing for Visual Commonsense Reasoning

【速读】:该论文旨在解决视觉常识推理(Visual Commonsense Reasoning, VCR)任务中模型因数据集偏倚而导致的泛化能力不足问题。现有方法虽在预测准确率上表现优异,但往往忽视了文本与视觉数据中存在的共现偏倚(co-occurrence bias)和统计偏倚(statistical bias)。其解决方案的关键在于:首先构建VCR-OOD数据集(包括VCR-OOD-QA和VCR-OOD-VA子集)以评估模型跨模态的泛化性能;其次通过分析因果图并识别预测捷径(prediction shortcuts),采用反事实调整(backdoor adjustment)策略进行去偏处理,具体实现方式是基于正确答案集合构建字典以消除模型对特定特征的依赖,从而提升模型的鲁棒性与公平性。

链接: https://arxiv.org/abs/2510.20281
作者: Jiayi Zou,Gengyun Jia,Bing-Kun Bao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.
zh

[CV-52] GMFVAD: Using Grained Multi-modal Feature to Improve Video Anomaly Detection

【速读】:该论文旨在解决视频异常检测(Video Anomaly Detection, VAD)中因视频片段内存在大量冗余视觉信息而导致特征表达不充分的问题。现有方法虽尝试引入文本特征等多模态信息,但通常以粗粒度方式融合,未能有效利用多模态间的差异性来优化特征提取。其解决方案的关键在于提出一种细粒度多模态特征(Grained Multi-modal Feature, GMFVAD)机制:首先基于视频片段生成更精细的多模态特征以概括主要内容,再结合原始视频字幕生成的文本特征,增强关键区域的视觉特征表示,从而降低冗余并提升检测性能。实验表明,该方法在四个主流数据集上达到当前最优效果,且消融实验证实性能提升源于冗余信息的有效减少。

链接: https://arxiv.org/abs/2510.20268
作者: Guangyu Dai,Dong Chen,Siliang Tang,Yueting Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Video anomaly detection (VAD) is a challenging task that detects anomalous frames in continuous surveillance videos. Most previous work utilizes the spatio-temporal correlation of visual features to distinguish whether there are abnormalities in video snippets. Recently, some works attempt to introduce multi-modal information, like text feature, to enhance the results of video anomaly detection. However, these works merely incorporate text features into video snippets in a coarse manner, overlooking the significant amount of redundant information that may exist within the video snippets. Therefore, we propose to leverage the diversity among multi-modal information to further refine the extracted features, reducing the redundancy in visual features, and we propose Grained Multi-modal Feature for Video Anomaly Detection (GMFVAD). Specifically, we generate more grained multi-modal feature based on the video snippet, which summarizes the main content, and text features based on the captions of original video will be introduced to further enhance the visual features of highlighted portions. Experiments show that the proposed GMFVAD achieves state-of-the-art performance on four mainly datasets. Ablation experiments also validate that the improvement of GMFVAD is due to the reduction of redundant information.
zh

[CV-53] Real-Time Currency Detection and Voice Feedback for Visually Impaired Individuals

【速读】:该论文旨在解决视障人士在日常生活中难以独立识别货币的问题,尤其是面对不同面额和类型的纸币与硬币时。解决方案的关键在于构建一个基于智能手机的实时货币检测系统,采用YOLOv8 nano模型并引入自定义检测头(包含深度卷积层和Squeeze-and-Excitation模块),以提升特征提取能力和检测精度;该系统在包含30类货币样本的数据集上实现了97.73%的准确率、95.23%的召回率和95.85%的F1分数,结合语音反馈机制,有效提升了视障用户的自主性与实用性。

链接: https://arxiv.org/abs/2510.20267
作者: Saraf Anzum Shreya,MD. Abu Ismail Siddique,Sharaf Tasnim
机构: Rajshahi University of Engineering and Technology (拉杰沙希工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 5 tables, 8 figues

点击查看摘要

Abstract:Technologies like smartphones have become an essential in our daily lives. It has made accessible to everyone including visually impaired individuals. With the use of smartphone cameras, image capturing and processing have become more convenient. With the use of smartphones and machine learning, the life of visually impaired can be made a little easier. Daily tasks such as handling money without relying on someone can be troublesome for them. For that purpose this paper presents a real-time currency detection system designed to assist visually impaired individuals. The proposed model is trained on a dataset containing 30 classes of notes and coins, representing 3 types of currency: US dollar (USD), Euro (EUR), and Bangladeshi taka (BDT). Our approach uses a YOLOv8 nano model with a custom detection head featuring deep convolutional layers and Squeeze-and-Excitation blocks to enhance feature extraction and detection accuracy. Our model has achieved a higher accuracy of 97.73%, recall of 95.23%, f1-score of 95.85% and a mean Average Precision at IoU=0.5 (mAP50(B)) of 97.21%. Using the voice feedback after the detection would help the visually impaired to identify the currency. This paper aims to create a practical and efficient currency detection system to empower visually impaired individuals independent in handling money.
zh

[CV-54] Kinaema: a recurrent sequence model for memory and pose in motion

【速读】:该论文旨在解决空间感知机器人在连续操作场景中如何利用先前观测信息来优化效率的问题,特别是实现对已知环境的相对定位能力。其核心挑战在于如何在不显式存储历史观测数据的前提下,持续整合视觉流并准确预测当前视角与查询图像之间的相对位置关系。解决方案的关键在于提出一种名为 Kinaema 的新模型,该模型通过递归更新的 Transformer 架构维护一个隐式的潜在记忆(latent memory),将传感器读数的历史压缩为紧凑表征,从而避免了传统注意力机制对上下文长度的硬性限制,并在下游任务“Mem-Nav”中展现出良好的场景表征能力和计算效率。

链接: https://arxiv.org/abs/2510.20261
作者: Mert Bulent Sariyildiz,Philippe Weinzaepfel,Guillaume Bono,Gianluca Monaci,Christian Wolf
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages + references + checklist + appendix, 29 pages total

点击查看摘要

Abstract:One key aspect of spatially aware robots is the ability to “find their bearings”, ie. to correctly situate themselves in previously seen spaces. In this work, we focus on this particular scenario of continuous robotics operations, where information observed before an actual episode start is exploited to optimize efficiency. We introduce a new model, Kinaema, and agent, capable of integrating a stream of visual observations while moving in a potentially large scene, and upon request, processing a query image and predicting the relative position of the shown space with respect to its current position. Our model does not explicitly store an observation history, therefore does not have hard constraints on context length. It maintains an implicit latent memory, which is updated by a transformer in a recurrent way, compressing the history of sensor readings into a compact representation. We evaluate the impact of this model in a new downstream task we call “Mem-Nav”. We show that our large-capacity recurrent model maintains a useful representation of the scene, navigates to goals observed before the actual episode start, and is computationally efficient, in particular compared to classical transformers with attention over an observation history.
zh

[CV-55] Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization

【速读】:该论文旨在解决现有跨视图目标地理定位方法在关键点位置编码中仅捕捉二维坐标而忽略目标形状信息的问题,导致对标注偏移敏感且跨视图匹配能力受限。其解决方案的关键在于提出一种基于掩码的位置编码(Mask-based Positional Encoding, MPE),通过利用分割掩码同时建模空间坐标与目标轮廓,使模型从“位置感知”升级为“目标感知”;此外,针对卫星图像中长跨度目标(如延展建筑)的特征区分困难问题,设计了上下文增强模块(Context Enhancement Module, CEM),采用水平和垂直条带卷积核提取长程上下文特征,提升条状物体间的特征判别力。二者结合形成端到端的EDGeo框架,在两个公开数据集上实现当前最优性能,显著提升了复杂地面到卫星场景下的定位精度。

链接: https://arxiv.org/abs/2510.20247
作者: Shuhan Hu,Yiru Li,Yuanyuan Li,Yingying Zhu
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view object geo-localization enables high-precision object localization through cross-view matching, with critical applications in autonomous driving, urban management, and disaster response. However, existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information, resulting in sensitivity to annotation shifts and limited cross-view matching capability. To address these limitations, we propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes, thereby upgrading the model from “location-aware” to “object-aware.” Furthermore, to tackle the challenge of large-span objects (e.g., elongated buildings) in satellite imagery, we design a context enhancement module. This module employs horizontal and vertical strip convolutional kernels to extract long-range contextual features, enhancing feature discrimination among strip-like objects. Integrating MPE and CEM, we present EDGeo, an end-to-end framework for robust cross-view object geo-localization. Extensive experiments on two public datasets (CVOGL and VIGOR-Building) demonstrate that our method achieves state-of-the-art performance, with a 3.39% improvement in localization accuracy under challenging ground-to-satellite scenarios. This work provides a robust positional encoding paradigm and a contextual modeling framework for advancing cross-view geo-localization research.
zh

[CV-56] Empower Words: DualGround for Structured Phrase and Sentence-Level Temporal Grounding NEURIPS2025

【速读】:该论文旨在解决视频时间定位(Video Temporal Grounding, VTG)任务中因统一处理文本token而导致细粒度时空对齐能力不足的问题。现有方法通常将所有文本token在跨模态注意力机制中同等对待,忽视了其语义角色差异,导致模型过度依赖[EOS]标记驱动的全局语义信息,而未能有效利用词级信号进行精准定位。解决方案的关键在于提出DualGround架构,通过双分支设计显式分离全局与局部语义:将[EOS] token引导至句子级路径,同时将词token聚类为短语级单元用于局部定位;并引入基于token角色感知的跨模态交互策略和联合建模框架,结构化解耦地对齐视频特征与句子级及短语级语义,从而提升模型在粗粒度和细粒度层面的上下文感知能力。

链接: https://arxiv.org/abs/2510.20244
作者: Minseok Kang,Minhyeok Lee,Minjung Kim,Donghyeong Kim,Sangyoun Lee
机构: Yonsei University (延世大学); LG Electronics (LG电子)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Comments: 28 pages, including appendix. 5 figures. Full version of the NeurIPS 2025 paper

点击查看摘要

Abstract:Video Temporal Grounding (VTG) aims to localize temporal segments in long, untrimmed videos that align with a given natural language query. This task typically comprises two subtasks: Moment Retrieval (MR) and Highlight Detection (HD). While recent advances have been progressed by powerful pretrained vision-language models such as CLIP and InternVideo2, existing approaches commonly treat all text tokens uniformly during crossmodal attention, disregarding their distinct semantic roles. To validate the limitations of this approach, we conduct controlled experiments demonstrating that VTG models overly rely on [EOS]-driven global semantics while failing to effectively utilize word-level signals, which limits their ability to achieve fine-grained temporal alignment. Motivated by this limitation, we propose DualGround, a dual-branch architecture that explicitly separates global and local semantics by routing the [EOS] token through a sentence-level path and clustering word tokens into phrase-level units for localized grounding. Our method introduces (1) tokenrole- aware cross modal interaction strategies that align video features with sentence-level and phrase-level semantics in a structurally disentangled manner, and (2) a joint modeling framework that not only improves global sentence-level alignment but also enhances finegrained temporal grounding by leveraging structured phrase-aware context. This design allows the model to capture both coarse and localized semantics, enabling more expressive and context-aware video grounding. DualGround achieves state-of-the-art performance on both Moment Retrieval and Highlight Detection tasks across QVHighlights and Charades- STA benchmarks, demonstrating the effectiveness of disentangled semantic modeling in video-language alignment.
zh

[CV-57] COS3D: Collaborative Open-Vocabulary 3D Segmentation NEURIPS2025

【速读】:该论文旨在解决开放词汇3D分割(open-vocabulary 3D segmentation)任务中现有基于高斯溅射(Gaussian-splatting-based)方法的局限性,即要么依赖单一3D语言场导致分割性能不佳,要么使用预计算的类无关分割导致误差累积。其解决方案的关键在于提出一种协同提示-分割框架COS3D,通过引入“协同场”(collaborative field)概念,包含实例场(instance field)和语言场(language field),并在训练阶段设计实例到语言的特征映射与两阶段训练策略以建模两者间的内在关联,在推理阶段进一步设计自适应的语言到实例提示精化机制,从而有效融合语言与分割线索,实现高质量的开放词汇3D分割。

链接: https://arxiv.org/abs/2510.20238
作者: Runsong Zhu,Ka-Hei Hui,Zhengzhe Liu,Qianyi Wu,Weiliang Tang,Shi Qiu,Pheng-Ann Heng,Chi-Wing Fu
机构: The Chinese University of Hong Kong (香港中文大学); Autodesk AI Lab (Autodesk人工智能实验室); Lingnan University (岭南大学); Monash University (蒙纳士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025. The code is publicly available at \href{ this https URL }{ this https URL }

点击查看摘要

Abstract:Open-vocabulary 3D segmentation is a fundamental yet challenging task, requiring a mutual understanding of both segmentation and language. However, existing Gaussian-splatting-based methods rely either on a single 3D language field, leading to inferior segmentation, or on pre-computed class-agnostic segmentations, suffering from error accumulation. To address these limitations, we present COS3D, a new collaborative prompt-segmentation framework that contributes to effectively integrating complementary language and segmentation cues throughout its entire pipeline. We first introduce the new concept of collaborative field, comprising an instance field and a language field, as the cornerstone for collaboration. During training, to effectively construct the collaborative field, our key idea is to capture the intrinsic relationship between the instance field and language field, through a novel instance-to-language feature mapping and designing an efficient two-stage training strategy. During inference, to bridge distinct characteristics of the two fields, we further design an adaptive language-to-instance prompt refinement, promoting high-quality prompt-segmentation inference. Extensive experiments not only demonstrate COS3D’s leading performance over existing methods on two widely-used benchmarks but also show its high potential to various applications,~\ie, novel image-based 3D segmentation, hierarchical segmentation, and robotics. The code is publicly available at \hrefthis https URLthis https URL.
zh

[CV-58] EditInfinity: Image Editing with Binary-Quantized Generative Models NEURIPS2025

【速读】:该论文旨在解决基于扩散模型(diffusion models)进行文本驱动图像编辑时,因图像反演(image inversion)过程中引入的近似误差导致编辑性能受限的问题。现有方法依赖于扩散模型在中间生成步骤中缺乏精确监督,从而难以实现高保真度和语义准确性的图像编辑。解决方案的关键在于采用基于矢量量化(VQ-based)的生成模型,利用其可获取源图像精确中间离散表示(quantized representations)的特性,实现更有效的监督以提升图像反演精度。具体而言,作者提出EditInfinity框架,通过融合文本提示校正与图像风格保持的高效反演机制,并设计整体平滑策略,在保证源图像高保真度的同时实现与文本提示的精准语义对齐,显著优于当前主流扩散基基准模型。

链接: https://arxiv.org/abs/2510.20217
作者: Jiahuan Wang,Yuxin Chen,Jun Yu,Guangming Lu,Wenjie Pei
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 13 figures, accepted by The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emphEditInfinity, which adapts \emphInfinity, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emphEditInfinity to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across “add”, “change”, and “delete” editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: this https URL.
zh

[CV-59] owards Objective Obstetric Ultrasound Assessment: Contrastive Representation Learning for Fetal Movement Detection ALT

【速读】:该论文旨在解决胎儿运动(Fetal Movement, FM)检测中传统方法如孕妇主观感知和胎心监护(Cardiotocography, CTG)存在的主观性强、准确性低的问题。为提升FM检测的客观性与可靠性,作者提出了一种名为对比超声视频表示学习(Contrastive Ultrasound Video Representation Learning, CURL)的自监督学习框架,其关键在于引入双对比损失机制,融合空间与时间维度的对比学习以提取鲁棒的运动表征,并设计任务特定的采样策略,在自监督训练中有效分离运动与非运动片段,同时通过概率微调方法实现对任意长度超声视频的灵活推理。

链接: https://arxiv.org/abs/2510.20214
作者: Talha Ilyas,Duong Nhu,Allison Thomas,Arie Levin,Lim Wei Yap,Shu Gong,David Vera Anaya,Yiwen Jiang,Deval Mehta,Ritesh Warty,Vinayak Smith,Maya Reddy,Euan Wallace,Wenlong Cheng,Zongyuan Ge,Faezeh Marzbanrad
机构: Monash University (莫纳什大学); University of Sydney (悉尼大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the preprint version of the manuscript submitted to IEEE Journal of Biomedical and Health Informatics (JBHI) for review

点击查看摘要

Abstract:Accurate fetal movement (FM) detection is essential for assessing prenatal health, as abnormal movement patterns can indicate underlying complications such as placental dysfunction or fetal distress. Traditional methods, including maternal perception and cardiotocography (CTG), suffer from subjectivity and limited accuracy. To address these challenges, we propose Contrastive Ultrasound Video Representation Learning (CURL), a novel self-supervised learning framework for FM detection from extended fetal ultrasound video recordings. Our approach leverages a dual-contrastive loss, incorporating both spatial and temporal contrastive learning, to learn robust motion representations. Additionally, we introduce a task-specific sampling strategy, ensuring the effective separation of movement and non-movement segments during self-supervised training, while enabling flexible inference on arbitrarily long ultrasound recordings through a probabilistic fine-tuning approach. Evaluated on an in-house dataset of 92 subjects, each with 30-minute ultrasound sessions, CURL achieves a sensitivity of 78.01% and an AUROC of 81.60%, demonstrating its potential for reliable and objective FM analysis. These results highlight the potential of self-supervised contrastive learning for fetal movement analysis, paving the way for improved prenatal monitoring and clinical decision-making.
zh

[CV-60] FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing

【速读】:该论文旨在解决当前基于文本的图像编辑方法中,中间状态(intermediate state)构建缺乏目标感知性的问题。现有主流方法采用“破坏-恢复”范式,通过将源图像破坏为一个中间状态再恢复为目标图像,但其破坏过程通常是目标无关的,仅关注源图像重建,忽略了与具体编辑目标之间的语义差异,导致在目标修改较大时编辑质量受限或一致性不足。解决方案的关键在于提出FlowCycle框架,该框架无需显式图像反演,以可学习噪声参数化破坏过程,并通过双一致性约束(即从源到目标再到源的循环一致性)优化中间状态,从而学习到目标感知的中间表示,实现高保真且一致的图像编辑效果。

链接: https://arxiv.org/abs/2510.20212
作者: Yanghao Wang,Zhen Wang,Long Chen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state’’ and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.
zh

[CV-61] RAPO: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)生成中用户提供的提示(prompt)通常短小、无结构且与训练数据分布不一致的问题,这限制了基于扩散模型的T2V系统在语义对齐性、多对象保真度和时序稳定性等方面的生成潜力。解决方案的关键在于提出一种跨阶段提示优化框架RAPO++,其核心包括三个阶段:第一阶段通过检索增强提示优化(Retrieval-Augmented Prompt Optimization, RAPO)利用关系图检索语义相关修饰符并重构提示以匹配训练分布,提升组合性和多对象一致性;第二阶段引入样本特定提示优化(Sample-Specific Prompt Optimization, SSPO),基于多源反馈(如语义对齐、空间保真度、时序连贯性和光流等任务信号)构建闭环迭代机制,逐步提升视频生成质量;第三阶段利用SSPO生成的优化提示对重写大语言模型(LLM)进行微调,内化任务特定优化模式,实现推理前高效高质量提示生成。整体方案无需修改生成主干模型,具备模型无关性、成本效益和可扩展性,显著优于现有方法。

链接: https://arxiv.org/abs/2510.20206
作者: Bingjie Gao,Qianli Ma,Xiaoxue Wu,Shuai Yang,Guanzhou Lan,Haonan Zhao,Jiaxuan Chen,Qingyang Liu,Yu Qiao,Xinyuan Chen,Yaohui Wang,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present \textbfRAPO++, a cross-stage prompt optimization framework that unifies training-data–aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In \textbfStage 1, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. \textbfStage 2 introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback – including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow – yielding progressively improved video generation quality. \textbfStage 3 leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at this https URL.
zh

[CV-62] A Structured Review and Quantitative Profiling of Public Brain MRI Datasets for Foundation Model Development

【速读】:该论文旨在解决脑部磁共振成像(brain MRI)基础模型开发中因数据规模、多样性与一致性不足而导致的性能瓶颈问题。其关键解决方案在于系统性地量化和分析了54个公开脑部MRI数据集在数据集层级、图像层级及预处理层面的异质性,揭示了健康人群与临床群体之间的不平衡、影像空间参数与强度分布的显著差异,以及标准化预处理后仍存在的特征空间残余协变量偏移(residual covariate shift)。研究表明,仅靠常规预处理无法完全消除跨数据集偏差,因此强调需采用预处理感知(preprocessing-aware)和领域自适应(domain-adaptive)策略来构建具备泛化能力的脑部MRI基础模型。

链接: https://arxiv.org/abs/2510.20196
作者: Minh Sao Khue Luu,Margaret V. Benedichuk,Ekaterina I. Roppert,Roman M. Kenzhin,Bair N. Tuchinov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The development of foundation models for brain MRI depends critically on the scale, diversity, and consistency of available data, yet systematic assessments of these factors remain scarce. In this study, we analyze 54 publicly accessible brain MRI datasets encompassing over 538,031 to provide a structured, multi-level overview tailored to foundation model development. At the dataset level, we characterize modality composition, disease coverage, and dataset scale, revealing strong imbalances between large healthy cohorts and smaller clinical populations. At the image level, we quantify voxel spacing, orientation, and intensity distributions across 15 representative datasets, demonstrating substantial heterogeneity that can influence representation learning. We then perform a quantitative evaluation of preprocessing variability, examining how intensity normalization, bias field correction, skull stripping, spatial registration, and interpolation alter voxel statistics and geometry. While these steps improve within-dataset consistency, residual differences persist between datasets. Finally, feature-space case study using a 3D DenseNet121 shows measurable residual covariate shift after standardized preprocessing, confirming that harmonization alone cannot eliminate inter-dataset bias. Together, these analyses provide a unified characterization of variability in public brain MRI resources and emphasize the need for preprocessing-aware and domain-adaptive strategies in the design of generalizable brain MRI foundation models.
zh

[CV-63] SPAN: Continuous Modeling of Suspicion Progression for Temporal Intention Localization

【速读】:该论文旨在解决视频监控中可疑意图识别的连续性建模问题,现有离散分类方法难以捕捉可疑意图的动态演变过程,限制了早期干预与可解释性。解决方案的关键在于提出Suspicion Progression Analysis Network (SPAN),其核心创新包括:1)将可疑意图建模从离散分类转向连续回归,以反映意图的波动性和演化特性;2)基于时间点过程(Temporal Point Process, TPP)理论揭示可疑行为具有长期依赖和累积效应,并定义了考虑时序特征的可疑分数公式;3)引入可疑系数调制(Suspicion Coefficient Modulation)机制,利用多模态信息动态调整可疑行为的影响权重;4)提出概念锚定映射(Concept-Anchored Mapping)方法,将具体行为与预定义意图概念关联,提升模型对行为及其潜在动机的解释能力。实验表明,SPAN在HAI数据集上显著优于现有方法,尤其在低频场景下mAP提升达2.74%,验证了其对细微行为变化的敏感性与实用性。

链接: https://arxiv.org/abs/2510.20189
作者: Xinyi Hu,Yuran Wang,Yue Li,Wenxuan Liu,Zheng Wang
机构: Wuhan University (武汉大学); Peking University (北京大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal Intention Localization (TIL) is crucial for video surveillance, focusing on identifying varying levels of suspicious intentions to improve security monitoring. However, existing discrete classification methods fail to capture the continuous nature of suspicious intentions, limiting early intervention and explainability. In this paper, we propose the Suspicion Progression Analysis Network (SPAN), which shifts from discrete classification to continuous regression, enabling the capture of fluctuating and evolving suspicious intentions. We reveal that suspicion exhibits long-term dependencies and cumulative effects, similar to Temporal Point Process (TPP) theory. Based on these insights, we define a suspicion score formula that models continuous changes while accounting for temporal characteristics. We also introduce Suspicion Coefficient Modulation, which adjusts suspicion coefficients using multimodal information to reflect the varying impacts of suspicious actions. Additionally, the Concept-Anchored Mapping method is proposed to link suspicious actions to predefined intention concepts, offering insights into both the actions and their potential underlying intentions. Extensive experiments on the HAI dataset show that SPAN significantly outperforms existing methods, reducing MSE by 19.8% and improving average mAP by 1.78%. Notably, SPAN achieves a 2.74% mAP gain in low-frequency cases, demonstrating its superior ability to capture subtle behavioral changes. Compared to discrete classification systems, our continuous suspicion modeling approach enables earlier detection and proactive intervention, greatly enhancing system explainability and practical utility in security applications.
zh

[CV-64] Evaluating Video Models as Simulators of Multi-Person Pedestrian Trajectories

【速读】:该论文旨在解决当前文本到视频(text-to-video, T2V)和图像到视频(image-to-video, I2V)生成模型在模拟多智能体(multi-agent)行人动态时缺乏系统性评估的问题,尤其是这些模型是否能生成符合物理与社会交互规律的多人互动场景。其解决方案的关键在于提出了一套严谨的评估协议:首先利用已有的基准数据集中的起始帧对I2V模型进行对比分析,以实现与真实视频数据的直接比较;其次为T2V模型设计了多样化提示词套件,用于探索不同人群密度和交互模式下的生成效果;尤为关键的是,开发了一种无需已知相机参数即可从像素空间重建二维鸟瞰视角轨迹的方法,从而客观量化生成视频中行人行为的合理性。这一方法使研究者能够系统评估生成模型在复杂社交场景中的隐式世界模拟能力。

链接: https://arxiv.org/abs/2510.20182
作者: Aaron Appelle,Jerome P. Lynch
机构: Duke University (杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint, under review

点击查看摘要

Abstract:Large-scale video generation models have demonstrated high visual realism in diverse contexts, spurring interest in their potential as general-purpose world simulators. Existing benchmarks focus on individual subjects rather than scenes with multiple interacting people. However, the plausibility of multi-agent dynamics in generated videos remains unverified. We propose a rigorous evaluation protocol to benchmark text-to-video (T2V) and image-to-video (I2V) models as implicit simulators of pedestrian dynamics. For I2V, we leverage start frames from established datasets to enable comparison with a ground truth video dataset. For T2V, we develop a prompt suite to explore diverse pedestrian densities and interactions. A key component is a method to reconstruct 2D bird’s-eye view trajectories from pixel-space without known camera parameters. Our analysis reveals that leading models have learned surprisingly effective priors for plausible multi-agent behavior. However, failure modes like merging and disappearing people highlight areas for future improvement.
zh

[CV-65] PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

【速读】:该论文旨在解决立体视频中时序不一致的深度估计问题(temporally inconsistent depth estimation),这对增强现实(Augmented Reality, AR)等实际应用至关重要,因为深度估计的不一致性会破坏用户的沉浸感。现有方法通过聚合时空信息来提升一致性,但面临计算效率与长期时序建模能力之间的权衡:有限的时序建模效果有限,而捕捉长程依赖则显著增加计算开销。解决方案的关键在于提出一种名为“Pick-and-Play Memory”(PPM)的记忆缓冲模块,其采用两阶段决策机制——“pick”过程选择最具相关性的参考帧,“play”过程自适应加权所选帧进行时空聚合,从而在保持紧凑记忆的同时实现高效且稳定的动态立体匹配(dynamic stereo matching)。该设计显著提升了深度估计的时序一致性与准确性,同时降低计算成本。

链接: https://arxiv.org/abs/2510.20178
作者: Yun Wang,Junjie Hu,Qiaole Dong,Yongjian Zhang,Yanwei Fu,Tin Lun Lam,Dapeng Wu
机构: City University of Hong Kong (香港城市大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Fudan University (复旦大学); Sun Yat-sen University, Shenzhen Campus (中山大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a \textbfPick-and-\textbfPlay \textbfMemory (PPM) construction module for dynamic \textbfStereo matching, dubbed as \textbfPPMStereo. PPM consists of a pick' process that identifies the most relevant frames and a play’ process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3% \ 9.02% improvements over BiDAStereo) with fewer computational costs. Codes are available at \textcolorbluethis https URL.
zh

[CV-66] IB-GAN: Disentangled Representation Learning with Information Bottleneck Generative Adversarial Networks AAAI2021 AAAI

【速读】:该论文旨在解决生成式模型中潜在表示的解耦(disentangled representation)问题,即如何在无监督学习场景下使生成模型的潜在变量分别对应数据中的独立语义因素。其解决方案的关键在于将信息瓶颈(Information Bottleneck, IB)框架引入生成对抗网络(GAN)的优化过程,提出IB-GAN模型:通过在生成器的中间随机层施加约束,控制输入与生成输出之间的互信息(mutual information),从而迫使生成器以解耦且可解释的方式利用潜在空间。该中间层作为可学习的潜在分布,与生成器联合端到端训练,显著提升了模型在dSprites和Color-dSprites等数据集上的解耦性能,并在CelebA和3D Chairs数据集上展现出更优的样本质量和多样性(以FID评分衡量)。

链接: https://arxiv.org/abs/2510.20165
作者: Insu Jeon,Wonkwang Lee,Myeongjang Pyeon,Gunhee Kim
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of the Thirty Fifth AAAI Conference on Artificial Intelligence (AAAI 2021), paper number 7926

点击查看摘要

Abstract:We propose a new GAN-based unsupervised model for disentangled representation learning. The new model is discovered in an attempt to utilize the Information Bottleneck (IB) framework to the optimization of GAN, thereby named IB-GAN. The architecture of IB-GAN is partially similar to that of InfoGAN but has a critical difference; an intermediate layer of the generator is leveraged to constrain the mutual information between the input and the generated output. The intermediate stochastic layer can serve as a learnable latent distribution that is trained with the generator jointly in an end-to-end fashion. As a result, the generator of IB-GAN can harness the latent space in a disentangled and interpretable manner. With the experiments on dSprites and Color-dSprites dataset, we demonstrate that IB-GAN achieves competitive disentanglement scores to those of state-of-the-art \beta-VAEs and outperforms InfoGAN. Moreover, the visual quality and the diversity of samples generated by IB-GAN are often better than those by \beta-VAEs and Info-GAN in terms of FID score on CelebA and 3D Chairs dataset.
zh

[CV-67] OMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning NEURIPS2025

【速读】:该论文旨在解决组合零样本学习(Compositional Zero-Shot Learning, CZSL)中因测试时标签空间分布偏移导致的性能下降问题,该偏移源于未见过的属性-对象组合在测试阶段被重新组合。解决方案的关键在于:1)利用无监督数据在文本和视觉模态上累积全面知识,并在测试时更新多模态原型;2)设计自适应更新权重以控制原型调整程度,从而灵活应对测试阶段的分布偏移;3)引入动态优先队列存储高置信度图像,从历史图像中获取视觉知识用于推理;4)通过多模态协同表示学习对齐文本与视觉原型,确保多模态知识的一致性。

链接: https://arxiv.org/abs/2510.20162
作者: Xudong Yan,Songhe Feng
机构: Beijing Jiaotong University (北京交通大学); Key Laboratory of Big Data and Artificial Intelligence in Transportation (Beijing Jiaotong University) (交通运输大数据与人工智能重点实验室(北京交通大学)), Ministry of Education (教育部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at this https URL .
zh

[CV-68] Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists

【速读】:该论文旨在解决自动驾驶中对非刚性结构的自行车及其骑行者进行精确姿态估计的问题,传统6D姿态估计方法无法有效处理自行车因转向杆和踏板角度变化导致的关节式构型改变,从而影响对骑行意图识别与避障决策的准确性。解决方案的关键在于提出一种基于单张RGB图像的类别级8D姿态估计方法,不仅估计自行车整体的3D旋转与平移(6D),还额外估计转向杆和踏板相对于车体坐标系的旋转角度(新增2D),从而实现更精细的自行车姿态状态建模及实际行驶方向预测。该方法通过联合优化8D姿态与3D关键点估计,并结合合成数据与真实图像混合训练策略,显著提升了在复杂场景下的泛化能力与精度表现。

链接: https://arxiv.org/abs/2510.20158
作者: Eduardo R. Corral-Soto,Yang Liu,Yuan Ren,Bai Dongfeng,Liu Bingbing
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.
zh

[CV-69] PartNeXt: A Next-Generation Dataset for Fine-Grained and Hierarchical 3D Part Understanding NEURIPS2025

【速读】:该论文旨在解决现有3D物体部件理解数据集(如PartNet)在可扩展性、纹理信息缺失及专家依赖标注等方面的局限性,从而推动计算机视觉、图形学和机器人领域对物体结构化理解的研究进展。其解决方案的关键在于提出PartNeXt这一新一代数据集,包含超过23,000个高质量、带纹理的3D模型,并提供细粒度、分层的部件标签(覆盖50个类别),同时支持两类任务基准测试:类无关的部件分割与面向3D大语言模型(3D-LLMs)的部件问答任务,验证了其在提升模型性能(如Point-SAM在PartNeXt上训练优于PartNet)方面的优越性,体现了其标注可扩展性、纹理感知性和多任务评估能力的综合优势。

链接: https://arxiv.org/abs/2510.20155
作者: Penghao Wang,Yiyang He,Xin Lv,Yukai Zhou,Lan Xu,Jingyi Yu,Jiayuan Gu
机构: ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 DB Track. Project page: this https URL

点击查看摘要

Abstract:Understanding objects at the level of their constituent parts is fundamental to advancing computer vision, graphics, and robotics. While datasets like PartNet have driven progress in 3D part understanding, their reliance on untextured geometries and expert-dependent annotation limits scalability and usability. We introduce PartNeXt, a next-generation dataset addressing these gaps with over 23,000 high-quality, textured 3D models annotated with fine-grained, hierarchical part labels across 50 categories. We benchmark PartNeXt on two tasks: (1) class-agnostic part segmentation, where state-of-the-art methods (e.g., PartField, SAMPart3D) struggle with fine-grained and leaf-level parts, and (2) 3D part-centric question answering, a new benchmark for 3D-LLMs that reveals significant gaps in open-vocabulary part grounding. Additionally, training Point-SAM on PartNeXt yields substantial gains over PartNet, underscoring the dataset’s superior quality and diversity. By combining scalable annotation, texture-aware labels, and multi-task evaluation, PartNeXt opens new avenues for research in structured 3D understanding.
zh

[CV-70] Revisiting Logit Distributions for Reliable Out-of-Distribution Detection NEURIPS2025

【速读】:该论文旨在解决深度学习模型在开放世界应用中对分布外(Out-of-distribution, OOD)样本检测的可靠性问题,尤其针对现有后处理(post-hoc)方法未能充分挖掘模型logits空间中丰富信息的局限性。其解决方案的关键在于提出LogitGap方法,该方法通过显式利用最大logit与其余logit之间的关系来增强分布内(in-distribution, ID)与OOD样本的可分性;进一步地,通过引入一种无需训练的策略自动识别logits空间中最具信息量的子集用于评分,从而提升检测性能。理论分析与大量实验表明,该方法在视觉-语言和纯视觉模型上均能实现跨场景和基准的SOTA效果。

链接: https://arxiv.org/abs/2510.20134
作者: Jiachen Liang,Ruibing Hou,Minyang Hu,Hong Chang,Shiguang Shan,Xilin Chen
机构: State Key Laboratory of AI Safety (人工智能安全重点实验室); Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Chinese Academy of Sciences (CAS) (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is critical for ensuring the reliability of deep learning models in open-world applications. While post-hoc methods are favored for their efficiency and ease of deployment, existing approaches often underexploit the rich information embedded in the model’s logits space. In this paper, we propose LogitGap, a novel post-hoc OOD detection method that explicitly exploits the relationship between the maximum logit and the remaining logits to enhance the separability between in-distribution (ID) and OOD samples. To further improve its effectiveness, we refine LogitGap by focusing on a more compact and informative subset of the logit space. Specifically, we introduce a training-free strategy that automatically identifies the most informative logits for scoring. We provide both theoretical analysis and empirical evidence to validate the effectiveness of our approach. Extensive experiments on both vision-language and vision-only models demonstrate that LogitGap consistently achieves state-of-the-art performance across diverse OOD detection scenarios and benchmarks. Code is available at this https URL.
zh

[CV-71] Inverse Image-Based Rendering for Light Field Generation from Single Images

【速读】:该论文旨在解决从单张图像生成光场(Light Field)的问题,以克服传统光场获取方式对复杂设备或高计算成本的依赖。其核心挑战在于如何从有限的二维信息中重建出多视角下的光线流动(light flow),从而支持真实感的新视角合成与摄影效果(如焦点调整)。解决方案的关键在于提出一种名为“逆向图像渲染”(inverse image-based rendering)的新方法:通过设计一个神经渲染管道,首先从输入图像中存储源光线的光流信息,再利用交叉注意力机制(cross-attention)建模各光线间的空间关系,并据此预测目标视角下任意射线的颜色;随后迭代更新未见区域的内容至源光线集合,确保遮挡区域的一致性生成。该方法无需额外微调即可在多种挑战性数据集上实现优于当前最优新视角合成技术的效果。

链接: https://arxiv.org/abs/2510.20132
作者: Hyunjun Jung,Hae-Gon Jeon
机构: GIST AI Graduated School (GIST人工智能研究生院); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named inverse image-based rendering. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the generated out-of-view contents are updated to the set of source rays. This procedure is iteratively performed while ensuring the consistent generation of occluded contents. We demonstrate that our inverse image-based rendering works well with various challenging datasets without any retraining or finetuning after once trained on synthetic dataset, and outperforms relevant state-of-the-art novel view synthesis methods.
zh

[CV-72] Physics-Guided Fusion for Robust 3D Tracking of Fast Moving Small Objects

【速读】:该论文旨在解决快速移动的小型物体在三维空间中的检测与跟踪问题,这是计算机视觉领域中尚未充分探索的挑战。现有方法在处理此类目标时受限于精度不足和对遮挡、高速运动等复杂场景的鲁棒性差。解决方案的关键在于提出一个融合深度学习检测与物理模型驱动跟踪的新系统:首先利用基于深度学习的方法实现高精度的目标检测,其次设计了一种创新的基于运动学方程(kinematics motion equations)的物理驱动跟踪算法,能够有效处理异常值和漏检;同时引入一个异常值检测与修正模块,在遮挡或方向突变等困难场景下显著提升跟踪性能。实验表明,该系统相比基于卡尔曼滤波(Kalman filter)的追踪器平均位移误差降低达70%,验证了深度学习与物理模型结合的有效性。

链接: https://arxiv.org/abs/2510.20126
作者: Prithvi Raj Singh,Raju Gottumukkala,Anthony S. Maida,Alan B. Barhorst,Vijaya Gopu
机构: UL Lafayette (路易斯安那大学拉斐特分校); Louisiana Transportation Research Center (路易斯安那州交通研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:While computer vision has advanced considerably for general object detection and tracking, the specific problem of fast-moving tiny objects remains underexplored. This paper addresses the significant challenge of detecting and tracking rapidly moving small objects using an RGB-D camera. Our novel system combines deep learning-based detection with physics-based tracking to overcome the limitations of existing approaches. Our contributions include: (1) a comprehensive system design for object detection and tracking of fast-moving small objects in 3D space, (2) an innovative physics-based tracking algorithm that integrates kinematics motion equations to handle outliers and missed detections, and (3) an outlier detection and correction module that significantly improves tracking performance in challenging scenarios such as occlusions and rapid direction changes. We evaluated our proposed system on a custom racquetball dataset. Our evaluation shows our system surpassing kalman filter based trackers with up to 70% less Average Displacement Error. Our system has significant applications for improving robot perception on autonomous platforms and demonstrates the effectiveness of combining physics-based models with deep learning approaches for real-time 3D detection and tracking of challenging small objects.
zh

[CV-73] Why Prototypes Collapse: Diagnosing and Preventing Partial Collapse in Prototypical Self-Supervised Learning

【速读】:该论文旨在解决原型自监督学习(prototypical self-supervised learning)中普遍存在的部分原型坍缩(partial prototype collapse)问题,即多个原型收敛至几乎相同的表示,从而削弱了其作为多样化引导目标以提升编码器表征能力的核心作用。解决方案的关键在于打破编码器与原型的联合优化机制,提出一种完全解耦的训练策略:通过在线期望最大化(EM)风格的更新方式独立维护原型为高斯混合模型,使其更新不依赖于编码器损失函数,从而避免早期训练中因捷径学习导致的冗余原型生成。此方法无需显式正则化即可实现原型多样性提升,并显著增强下游任务性能。

链接: https://arxiv.org/abs/2510.20108
作者: Gabriel Y. Arteaga,Marius Aasan,Rwiddhi Chakraborty,Martine Hjelkrem-Tan,Thalles Silva,Michael Kampffmeyer,Adín Ramírez Rivera
机构: University of Oslo (奥斯陆大学); UiT The Arctic University of Norway (北极挪威大学); University of Campinas (坎皮纳斯大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prototypical self-supervised learning methods consistently suffer from partial prototype collapse, where multiple prototypes converge to nearly identical representations. This undermines their central purpose – providing diverse and informative targets to guide encoders toward rich representations – and has led practitioners to over-parameterize prototype sets or add ad-hoc regularizers, which mitigate symptoms rather than address the root cause. We empirically trace the collapse to the joint optimization of encoders and prototypes, which encourages a type of shortcut learning: early in training prototypes drift toward redundant representations that minimize loss without necessarily enhancing representation diversity. To break the joint optimization, we introduce a fully decoupled training strategy that learns prototypes and encoders under separate objectives. Concretely, we model prototypes as a Gaussian mixture updated with an online EM-style procedure, independent of the encoder’s loss. This simple yet principled decoupling eliminates prototype collapse without explicit regularization and yields consistently diverse prototypes and stronger downstream performance.
zh

[CV-74] StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

【速读】:该论文旨在解决当前扩散模型在生成像素级手绘草图(hand-drawn sketches)时面临的挑战,尤其是如何提升生成草图与文本提示(prompt)之间的语义一致性与风格保真度。其核心解决方案在于提出StableSketcher框架:首先通过微调变分自编码器(Variational Autoencoder, VAE)优化潜在空间解码,使其更精准地捕捉草图的特征;其次引入基于视觉问答(Visual Question Answering, VQA)的强化学习奖励函数,以增强文本-图像对齐能力和语义一致性。此外,作者构建了首个包含实例级草图、对应描述和问答对的数据集SketchDUO,填补了现有数据集仅依赖图像标签对的不足,从而推动该领域研究发展。

链接: https://arxiv.org/abs/2510.20093
作者: Jiho Park,Sieun Choi,Jaeyoon Seo,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Under review at IEEE Access. Author-submitted preprint. Not the IEEE-published version

点击查看摘要

Abstract:Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.
zh

[CV-75] Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

【速读】:该论文旨在解决自注意力机制(Self-Attention, SA)在视觉骨干网络中因二次计算复杂度而成为实际应用瓶颈的问题,同时指出当前对卷积神经网络(Convolutional Neural Networks, CNN)的现代化改进未能充分捕捉SA所具备的内在表达能力。解决方案的关键在于重新审视CNN设计,并基于两个核心发现:(1) 自适应路由(Adaptive routing)——SA能根据语义内容动态调节位置信息流,而传统卷积使用静态核;(2) 侧抑制(Lateral inhibition)——SA通过令牌权重间的竞争机制抑制冗余、增强表示,而卷积滤波器缺乏此类抑制动力学。由此提出有注意力的卷积(Attentive Convolution, ATConv),其本质是对卷积算子的原理性重构,内嵌上述两种机制。实验表明,仅用3×3核的ATConv即可在基础视觉任务中超越多种SA结构,并构建出参数量仅为27M的AttNet,在ImageNet-1K上达到84.4% Top-1准确率,且在扩散图像生成中显著提升性能与采样效率。

链接: https://arxiv.org/abs/2510.20092
作者: Hao Yu,Haoyu Chen,Yan Jiang,Wei Peng,Zhaodong Sun,Samuel Kaski,Guoying Zhao
机构: Center for Machine Vision and Signal Analysis, University of Oulu, Finland (机器视觉与信号分析中心,奥卢大学,芬兰); Department of Psychiatry and Behavioral Sciences, Stanford University, USA (精神病学与行为科学系,斯坦福大学,美国); School of Computer Science, Nanjing University of Information Science and Technology, China (计算机科学学院,南京信息工程大学,中国); Department of Computer Science, Aalto University, Finland (计算机科学系,阿尔托大学,芬兰); ELLIS Institute Finland (ELLIS研究所芬兰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) \textitAdaptive routing: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) \textitLateral inhibition: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose \textitAttentive Convolution (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only 3\times3 kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain \textbf84.4% ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed 3\times 3 ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: this http URL.
zh

[CV-76] Endoshare: A Source Available Solution to De-Identify and Manage Surgical Videos

【速读】:该论文旨在解决微创手术中内窥镜视频在教学、研究与质量改进应用中的两大核心问题:一是不同来源视频格式不统一导致的数据整合困难,二是视频共享涉及的隐私保护风险。解决方案的关键在于提出并实现了一个开源、跨平台的应用程序Endoshare,其核心功能包括视频合并、标准化处理及去标识化(de-identifying),采用“隐私优先设计”(privacy-by-design)架构,并通过用户中心迭代反馈优化可用性。测试表明,Endoshare具备高易用性和临床接受度,可为手术视频提供透明、安全、标准化的管理流程,但需进一步获得合规认证和跨系统互操作性验证以支持大规模部署。

链接: https://arxiv.org/abs/2510.20087
作者: Lorenzo Arboit,Dennis N. Schneider,Britty Baby,Vinkle Srivastav,Pietro Mascagni,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, Strasbourg, France; Fondazione Policlinico Universitario Agostino Gemelli IRCCS, Rome, Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures. Source-available software: this https URL

点击查看摘要

Abstract:Video-based assessment and surgical data science can advance surgical training, research, and quality improvement. However, widespread use remains limited by heterogeneous recording formats and privacy concerns associated with video sharing. We present Endoshare, a source-available, cross-platform application for merging, standardizing, and de-identifying endoscopic videos in minimally invasive surgery. Development followed the software development life cycle with iterative, user-centered feedback. During the analysis phase, an internal survey of clinicians and computer scientists based on ten usability heuristics identified key requirements that guided a privacy-by-design architecture. In the testing phase, an external clinician survey combined the same heuristics with Technology Acceptance Model constructs to assess usability and adoption, complemented by benchmarking across different hardware configurations. Four clinicians and four computer scientists initially tested the prototype, reporting high usability (4.68 +/- 0.40/5 and 4.03 +/- 0.51/5), with the lowest score (4.00 +/- 0.93/5) relating to label clarity. After refinement, the testing phase surveyed ten surgeons who reported high perceived usefulness (5.07 +/- 1.75/7), ease of use (5.15 +/- 1.71/7), heuristic usability (4.38 +/- 0.48/5), and strong recommendation (9.20 +/- 0.79/10). Processing time varied with processing mode, video duration (both p = 0.001), and machine computational power (p = 0.041). Endoshare provides a transparent, user-friendly pipeline for standardized, privacy-preserving surgical video management. Compliance certification and broader interoperability validation are needed to establish it as a deployable alternative to proprietary systems. The software is available at this https URL
zh

[CV-77] Data-Adaptive Transformed Bilateral Tensor Low-Rank Representation for Clustering

【速读】:该论文旨在解决现有张量低秩表示(Tensor Low-Rank Representation, TLRR)方法在图像聚类任务中因依赖固定变换而导致对噪声敏感、鲁棒性差的问题。其解决方案的关键在于提出一种新型的变换双边张量低秩表示模型(Transformed Bilateral Tensor Low-Rank Representation, TBTLRR),通过学习任意酉变换来引入数据自适应的张量核范数(Tensor Nuclear Norm),从而更有效地捕捉全局相关性;同时利用潜在张量数据的双边结构,挖掘图像样本与特征之间的局部相关性;此外,融合ℓ₁/₂-范数和Frobenius范数正则项以增强对真实场景中复杂噪声的处理能力。为求解该非凸优化问题,作者设计了一种基于交替方向乘子法(ADMM)的高效算法,并提供了理论收敛性保证。

链接: https://arxiv.org/abs/2510.20077
作者: Hui Chen,Xinjie Wang,Xianchao Xiu,Wanquan Liu
机构: Shanghai University of Electric Power (上海电力大学); Shanghai University (上海大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Tensor low-rank representation (TLRR) has demonstrated significant success in image clustering. However, most existing methods rely on fixed transformations and suffer from poor robustness to noise. In this paper, we propose a novel transformed bilateral tensor low-rank representation model called TBTLRR, which introduces a data-adaptive tensor nuclear norm by learning arbitrary unitary transforms, allowing for more effective capture of global correlations. In addition, by leveraging the bilateral structure of latent tensor data, TBTLRR is able to exploit local correlations between image samples and features. Furthermore, TBTLRR integrates the \ell_1/2 -norm and Frobenius norm regularization terms for better dealing with complex noise in real-world scenarios. To solve the proposed nonconvex model, we develop an efficient optimization algorithm inspired by the alternating direction method of multipliers (ADMM) and provide theoretical convergence. Extensive experiments validate its superiority over the state-of-the-art methods in clustering. The code will be available at this https URL.
zh

[CV-78] Filter-Based Reconstruction of Images from Events

【速读】:该论文旨在解决从移动事件相机(event camera)中重建强度图像(intensity image)这一挑战性问题,传统方法通常依赖部署在图形处理单元(GPU)上的神经网络。其解决方案的关键在于提出一种基于滤波器的异步重建方法(Filter Based Asynchronous Reconstruction, FIBAR),该方法首先利用时间数字无限冲激响应(IIR)滤波器对事件信号进行积分以生成初步图像;其次通过一种新颖的算法检测“滞留像素”(stale pixels),即长时间未更新的像素,并结合高斯滤波对其进行平滑处理,从而降低噪声;此外,FIBAR具备异步特性,允许在任意时刻读取图像,且可在现代笔记本CPU上以约42–140百万事件/秒的速度运行,显著优于现有神经网络方法的计算复杂度与实时性限制。

链接: https://arxiv.org/abs/2510.20071
作者: Bernd Pfrommer
机构: Event Vision Research LLC(事件视觉研究有限责任公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing an intensity image from the events of a moving event camera is a challenging task that is typically approached with neural networks deployed on graphics processing units. This paper presents a much simpler, FIlter Based Asynchronous Reconstruction method (FIBAR). First, intensity changes signaled by events are integrated with a temporal digital IIR filter. To reduce reconstruction noise, stale pixels are detected by a novel algorithm that regulates a window of recently updated pixels. Arguing that for a moving camera, the absence of events at a pixel location likely implies a low image gradient, stale pixels are then blurred with a Gaussian filter. In contrast to most existing methods, FIBAR is asynchronous and permits image read-out at an arbitrary time. It runs on a modern laptop CPU at about 42(140) million events/s with (without) spatial filtering enabled. A few simple qualitative experiments are presented that show the difference in image reconstruction between FIBAR and a neural network-based approach (FireNet). FIBAR’s reconstruction is noisier than neural network-based methods and suffers from ghost images. However, it is sufficient for certain tasks such as the detection of fiducial markers. Code is available at this https URL
zh

[CV-79] Exposing Blindspots: Cultural Bias Evaluation in Generative Image Models

【速读】:该论文旨在解决生成式图像模型在跨文化语境下存在偏见的问题,特别是针对图像到图像(I2I)编辑任务中文化表征失真缺乏系统评估的空白。其解决方案的关键在于构建一个统一的评估框架,涵盖六个国家、8个类别/36个子类别的文化维度,并采用时代感知的提示词(era-aware prompts),实现对文本到图像(T2I)生成与I2I编辑的标准化诊断。该框架整合了自动指标、文化敏感的检索增强型视觉问答(retrieval-augmented VQA)以及母语专家的人工判断,从而揭示出当前模型在跨国家、跨时代和跨类别场景下的文化偏差,尤其指出I2I编辑虽在传统指标上表现稳定甚至改善,但实质上削弱了文化保真度,且常依赖表面特征而非语境一致性的修改。

链接: https://arxiv.org/abs/2510.20042
作者: Huichan Seo,Sieun Choi,Minki Hong,Yi Zhou,Junseo Kim,Lukman Ismaila,Naome Etori,Mehul Agarwal,Zhixuan Liu,Jihie Kim,Jean Oh
机构: Carnegie Mellon University (卡内基梅隆大学); Dongguk University (东国大学); Delft University of Technology (代尔夫特理工大学); Johns Hopkins University, School of Medicine (约翰霍普金斯大学医学院); University of Minnesota–Twin Cities (明尼苏达大学双城分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 8 figures. Submitted to the Second Conference of the International Association for Safe and Ethical Artificial Intelligence (IASEAI '26)

点击查看摘要

Abstract:Generative image models produce striking visuals yet often misrepresent culture. Prior work has examined cultural bias mainly in text-to-image (T2I) systems, leaving image-to-image (I2I) editors underexplored. We bridge this gap with a unified evaluation across six countries, an 8-category/36-subcategory schema, and era-aware prompts, auditing both T2I generation and I2I editing under a standardized protocol that yields comparable diagnostics. Using open models with fixed settings, we derive cross-country, cross-era, and cross-category evaluations. Our framework combines standard automatic metrics, a culture-aware retrieval-augmented VQA, and expert human judgments collected from native reviewers. To enable reproducibility, we release the complete image corpus, prompts, and configurations. Our study reveals three findings: (1) under country-agnostic prompts, models default to Global-North, modern-leaning depictions that flatten cross-country distinctions; (2) iterative I2I editing erodes cultural fidelity even when conventional metrics remain flat or improve; and (3) I2I models apply superficial cues (palette shifts, generic props) rather than era-consistent, context-aware changes, often retaining source identity for Global-South targets. These results highlight that culture-sensitive edits remain unreliable in current systems. By releasing standardized data, prompts, and human evaluation protocols, we provide a reproducible, culture-centered benchmark for diagnosing and tracking cultural bias in generative image models.
zh

[CV-80] BrainPuzzle: Hybrid Physics and Data-Driven Reconstruction for Transcranial Ultrasound Tomography

【速读】:该论文旨在解决经颅超声成像中因颅骨与脑组织间声速差异大、探头耦合困难以及信号衰减严重等问题导致的定量声速(SoS)重建精度不足的问题。传统基于物理模型的全波形反演(FWI)受限于颅骨引起的信号弱化、模式转换和相位畸变,且临床难以实现全孔径阵列;而纯数据驱动方法在低信噪比和稀疏孔径条件下无法准确建模骨骼中复杂的非线性、非局部波传播,导致重建结果虽具解剖合理性但存在定量偏差。其解决方案的关键在于提出了一种混合两阶段框架BrainPuzzle:第一阶段利用多角度采集的逆时迁移(time-reversal acoustics)生成结构细节保留良好的迁移片段,第二阶段采用基于图注意力单元(GAU)的Transformer超分辨率编码器-解码器融合这些片段,从而获得结构完整且定量准确的SoS图像;同时结合可移动低通道数探头的部分阵列采集策略提升可行性与耦合效率,通过混合算法补偿缺失孔径,显著提升了定量经颅超声成像的性能。

链接: https://arxiv.org/abs/2510.20029
作者: Shengyu Chen,Shihang Feng,Yi Luo,Xiaowei Jia,Youzuo Lin
机构: Los Alamos National Laboratory (洛斯阿拉莫斯国家实验室); University of Pittsburgh (匹兹堡大学); Seiswave Corp (Seiswave公司); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages

点击查看摘要

Abstract:Ultrasound brain imaging remains challenging due to the large difference in sound speed between the skull and brain tissues and the difficulty of coupling large probes to the skull. This work aims to achieve quantitative transcranial ultrasound by reconstructing an accurate speed-of-sound (SoS) map of the brain. Traditional physics-based full-waveform inversion (FWI) is limited by weak signals caused by skull-induced attenuation, mode conversion, and phase aberration, as well as incomplete spatial coverage since full-aperture arrays are clinically impractical. In contrast, purely data-driven methods that learn directly from raw ultrasound data often fail to model the complex nonlinear and nonlocal wave propagation through bone, leading to anatomically plausible but quantitatively biased SoS maps under low signal-to-noise and sparse-aperture conditions. To address these issues, we propose BrainPuzzle, a hybrid two-stage framework that combines physical modeling with machine learning. In the first stage, reverse time migration (time-reversal acoustics) is applied to multi-angle acquisitions to produce migration fragments that preserve structural details even under low SNR. In the second stage, a transformer-based super-resolution encoder-decoder with a graph-based attention unit (GAU) fuses these fragments into a coherent and quantitatively accurate SoS image. A partial-array acquisition strategy using a movable low-count transducer set improves feasibility and coupling, while the hybrid algorithm compensates for the missing aperture. Experiments on two synthetic datasets show that BrainPuzzle achieves superior SoS reconstruction accuracy and image completeness, demonstrating its potential for advancing quantitative ultrasound brain imaging.
zh

[CV-81] Extreme Views: 3DGS Filter for Novel View Synthesis from Out-of-Distribution Camera Poses

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)模型在相机视角显著超出训练数据分布时出现的视觉噪声问题,此类噪声源于模型对未见区域的密度、颜色和几何预测不确定性。解决方案的关键在于提出一种实时渲染感知过滤方法,该方法利用中间梯度计算得到的敏感性得分(sensitivity scores),专门针对由各向异性方向导致的不稳定性进行抑制,而非传统基于各向同性方差的方法。该过滤机制直接缓解生成式不确定性问题,使重建系统在自由导航至原训练视点范围外时仍能保持高视觉保真度,且无需额外后处理重训练或微调,可无缝集成至现有3DGS渲染管线中实现实时运行。

链接: https://arxiv.org/abs/2510.20027
作者: Damian Bowness,Charalambos Poullis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:When viewing a 3D Gaussian Splatting (3DGS) model from camera positions significantly outside the training data distribution, substantial visual noise commonly occurs. These artifacts result from the lack of training data in these extrapolated regions, leading to uncertain density, color, and geometry predictions from the model. To address this issue, we propose a novel real-time render-aware filtering method. Our approach leverages sensitivity scores derived from intermediate gradients, explicitly targeting instabilities caused by anisotropic orientations rather than isotropic variance. This filtering method directly addresses the core issue of generative uncertainty, allowing 3D reconstruction systems to maintain high visual fidelity even when users freely navigate outside the original training viewpoints. Experimental evaluation demonstrates that our method substantially improves visual quality, realism, and consistency compared to existing Neural Radiance Field (NeRF)-based approaches such as BayesRays. Critically, our filter seamlessly integrates into existing 3DGS rendering pipelines in real-time, unlike methods that require extensive post-hoc retraining or fine-tuning. Code and results at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2510.20027 [cs.CV] (or arXiv:2510.20027v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.20027 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Damian Bowness [view email] [v1] Wed, 22 Oct 2025 21:09:16 UTC (37,590 KB)
zh

[CV-82] A Unified Detection Pipeline for Robust Object Detection in Fisheye-Based Traffic Surveillance ICCV2025

【速读】:该论文旨在解决鱼眼相机(fisheye camera)在交通监控中因强径向畸变和非均匀分辨率导致的标准目标检测器性能下降的问题,尤其是在图像边界区域物体外观严重退化的情况下。解决方案的关键在于设计了一个简单而有效的预处理与后处理流水线,以提升检测一致性,特别是在畸变严重的区域;同时,在鱼眼交通影像上训练多个先进的检测模型,并通过集成策略融合其输出结果,从而显著提高整体检测精度。

链接: https://arxiv.org/abs/2510.20016
作者: Neema Jakisa Owor,Joshua Kofi Asamoah,Tanner Wambui Muturi,Anneliese Jakisa Owor,Blessing Agyei Kyem,Andrews Danyo,Yaw Adu-Gyamfi,Armstrong Aboah
机构: University of Missouri–Columbia (密苏里大学哥伦比亚分校); North Dakota State University (北达科他州立大学); SMART Lab (SMART 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper was accepted at ICCV 2025 and published in CVF database

点击查看摘要

Abstract:Fisheye cameras offer an efficient solution for wide-area traffic surveillance by capturing large fields of view from a single vantage point. However, the strong radial distortion and nonuniform resolution inherent in fisheye imagery introduce substantial challenges for standard object detectors, particularly near image boundaries where object appearance is severely degraded. In this work, we present a detection framework designed to operate robustly under these conditions. Our approach employs a simple yet effective pre and post processing pipeline that enhances detection consistency across the image, especially in regions affected by severe distortion. We train several state-of-the-art detection models on the fisheye traffic imagery and combine their outputs through an ensemble strategy to improve overall detection accuracy. Our method achieves an F1 score of0.6366 on the 2025 AI City Challenge Track 4, placing 8thoverall out of 62 teams. These results demonstrate the effectiveness of our framework in addressing issues inherent to fisheye imagery.
zh

[CV-83] Improving Predictive Confidence in Medical Imaging via Online Label Smoothing ALT

【速读】:该论文旨在解决深度学习模型(尤其是卷积神经网络)在医学图像分类任务中产生的过度自信预测问题,这一问题可能削弱其在关键医疗场景中的可靠性。传统标签平滑(label smoothing)虽能缓解过自信现象,但忽略了类别间的语义关系,对所有非目标类一视同仁。本文提出的关键解决方案是在线标签平滑(Online Label Smoothing, OLS),其核心在于动态调整软标签,依据模型自身在训练过程中的预测模式进行自适应优化,从而更好地捕捉类别间差异并提升模型校准能力与特征表示质量。

链接: https://arxiv.org/abs/2510.20011
作者: Kushan Choudhury,Shubhrodeep Roy,Ankur Chanda,Shubhajit Biswas,Somenath Kuiry
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted and presented in International Conference on Advancing Science and Technologies in Health Science

点击查看摘要

Abstract:Deep learning models, especially convolutional neural networks, have achieved impressive results in medical image classification. However, these models often produce overconfident predictions, which can undermine their reliability in critical healthcare settings. While traditional label smoothing offers a simple way to reduce such overconfidence, it fails to consider relationships between classes by treating all non-target classes equally. In this study, we explore the use of Online Label Smoothing (OLS), a dynamic approach that adjusts soft labels throughout training based on the model’s own prediction patterns. We evaluate OLS on the large-scale RadImageNet dataset using three widely used architectures: ResNet-50, MobileNetV2, and VGG-19. Our results show that OLS consistently improves both Top-1 and Top-5 classification accuracy compared to standard training methods, including hard labels, conventional label smoothing, and teacher-free knowledge distillation. In addition to accuracy gains, OLS leads to more compact and well-separated feature embeddings, indicating improved representation learning. These findings suggest that OLS not only strengthens predictive performance but also enhances calibration, making it a practical and effective solution for developing trustworthy AI systems in the medical imaging domain.
zh

[CV-84] Automating Iconclass: LLM s and RAG for Large-Scale Classification of Religious Woodcuts

【速读】:该论文旨在解决早期现代宗教图像(early modern religious images)在大规模视觉档案中难以精准分类的问题。传统基于图像特征或关键词的搜索方法在处理复杂图文混合内容时准确率有限,尤其在涉及文化语境和象征意义的图像识别上表现不足。解决方案的关键在于结合大型语言模型(Large Language Models, LLMs)与向量数据库(vector databases),并引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,通过全页页面上下文(full-page context)提取融合视觉与文本信息的描述,并利用混合向量搜索匹配Iconclass分类代码,从而显著提升分类精度(五级和四级分类精度分别达87%和92%)。此方法实现了对早期现代视觉文献更深层次的理解与结构化组织,为艺术史与数字人文研究提供了高效、可扩展的技术路径。

链接: https://arxiv.org/abs/2510.19986
作者: Drew B. Thomas
机构: 未知
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注: 29 pages, 7 figures. First presented at the “Digital Humanities and Artificial Intelligence” conference at the University of Reading on 17 June 2024

点击查看摘要

Abstract:This paper presents a novel methodology for classifying early modern religious images by using Large Language Models (LLMs) and vector databases in combination with Retrieval-Augmented Generation (RAG). The approach leverages the full-page context of book illustrations from the Holy Roman Empire, allowing the LLM to generate detailed descriptions that incorporate both visual and textual elements. These descriptions are then matched to relevant Iconclass codes through a hybrid vector search. This method achieves 87% and 92% precision at five and four levels of classification, significantly outperforming traditional image and keyword-based searches. By employing full-page descriptions and RAG, the system enhances classification accuracy, offering a powerful tool for large-scale analysis of early modern visual archives. This interdisciplinary approach demonstrates the growing potential of LLMs and RAG in advancing research within art history and digital humanities.
zh

[CV-85] FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking

【速读】:该论文旨在解决多模态传感器(相机与激光雷达)在3D多目标跟踪(3D Multi-Object Tracking, 3D MOT)任务中融合不足、轨迹不连续及身份切换频繁的问题。解决方案的关键在于提出FutrTrack框架,其核心创新包括:1)基于查询的两阶段Transformer精修与跟踪流水线,利用多模态鸟瞰图(BEV)特征融合实现更鲁棒的跨帧身份关联;2)引入时间窗口内的时序平滑器(temporal smoother),减少轨迹抖动并提升空间一致性;3)无需显式运动模型即可通过几何与语义线索实现遮挡和视角变化下的稳定重识别。该方法在nuScenes和KITTI数据集上验证了性能优势,尤其在降低身份切换的同时保持高精度,为轻量级Transformer跟踪器提供了高效且可扩展的解决方案。

链接: https://arxiv.org/abs/2510.19981
作者: Martha Teiko Teye,Ori Maoz,Matthias Rottmann
机构: University of Wuppertal (伍珀塔尔大学); Aptiv (艾普提夫); Institute of Computer Science, Osnabruck University (奥斯纳布吕克大学计算机科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird’s-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.
zh

[CV-86] ransformed Multi-view 3D Shape Features with Contrastive Learning

【速读】:该论文旨在解决3D形状特征表示学习中的挑战,特别是传统卷积神经网络(CNNs)在捕捉关键形状关系方面的局限性以及对大量标注数据的依赖问题。解决方案的关键在于采用视觉Transformer(Vision Transformers, ViTs)架构结合现代对比学习(contrastive learning)目标,通过ViTs对全局形状语义的建模能力与对比学习对局部判别特征的优化能力相结合,显著提升了多视角3D分析任务的表现,例如在ModelNet10数据集上达到了约90.6%的准确率,从而实现了监督对比损失与3D形状理解流程的统一。

链接: https://arxiv.org/abs/2510.19955
作者: Márcus Vinícius Lobo Costa,Sherlon Almeida da Silva,Bárbara Caroline Benato,Leo Sampaio Ferraz Ribeiro,Moacir Antonelli Ponti
机构: Instituto de Ciências Matemáticas e de Computação (ICMC), Universidade de São Paulo (USP)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper addresses the challenges in representation learning of 3D shape features by investigating state-of-the-art backbones paired with both contrastive supervised and self-supervised learning objectives. Computer vision methods struggle with recognizing 3D objects from 2D images, often requiring extensive labeled data and relying on Convolutional Neural Networks (CNNs) that may overlook crucial shape relationships. Our work demonstrates that Vision Transformers (ViTs) based architectures, when paired with modern contrastive objectives, achieve promising results in multi-view 3D analysis on our downstream tasks, unifying contrastive and 3D shape understanding pipelines. For example, supervised contrastive losses reached about 90.6% accuracy on ModelNet10. The use of ViTs and contrastive learning, leveraging ViTs’ ability to understand overall shapes and contrastive learning’s effectiveness, overcomes the need for extensive labeled data and the limitations of CNNs in capturing crucial shape relationships. The success stems from capturing global shape semantics via ViTs and refining local discriminative features through contrastive optimization. Importantly, our approach is empirical, as it is grounded on extensive experimental evaluation to validate the effectiveness of combining ViTs with contrastive objectives for 3D representation learning.
zh

[CV-87] FINDER: Feature Inference on Noisy Datasets using Eigenspace Residuals

【速读】:该论文旨在解决低信噪比(low signal to noise ratios)、小样本量及数据采集错误等噪声数据集下的分类问题,这类问题在理论和实际应用中均具有重要挑战性。其解决方案的核心在于提出FINDER框架,该框架通过将经验数据视为潜在随机场的实现(无需假设其具体分布),并利用Kosambi-Karhunen-Loéve展开(KLE)将其映射到适当的希尔伯特空间,从而构造出“随机特征”;这些特征被分解为可计算的不可约分量,并通过特征值分解实现对噪声数据的有效分类——不同类别的数据在由相关算子谱结构所定义的不同区域中分布,从而实现高精度判别。

链接: https://arxiv.org/abs/2510.19917
作者: Trajan Murphy,Akshunna S. Dogra,Hanfeng Gu,Caleb Meredith,Mark Kon,Julio Enrique Castrillion-Candas
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 30 pages, 11 figures, 8 tables. Code available at this https URL

点击查看摘要

Abstract:‘‘Noisy’’ datasets (regimes with low signal to noise ratios, small sample sizes, faulty data collection, etc) remain a key research frontier for classification methods with both theoretical and practical implications. We introduce FINDER, a rigorous framework for analyzing generic classification problems, with tailored algorithms for noisy datasets. FINDER incorporates fundamental stochastic analysis ideas into the feature learning and inference stages to optimally account for the randomness inherent to all empirical datasets. We construct ‘‘stochastic features’’ by first viewing empirical datasets as realizations from an underlying random field (without assumptions on its exact distribution) and then mapping them to appropriate Hilbert spaces. The Kosambi-Karhunen-Loéve expansion (KLE) breaks these stochastic features into computable irreducible components, which allow classification over noisy datasets via an eigen-decomposition: data from different classes resides in distinct regions, identified by analyzing the spectrum of the associated operators. We validate FINDER on several challenging, data-deficient scientific domains, producing state of the art breakthroughs in: (i) Alzheimer’s Disease stage classification, (ii) Remote sensing detection of deforestation. We end with a discussion on when FINDER is expected to outperform existing methods, its failure modes, and other limitations.
zh

[CV-88] Fourier-Based GAN Fingerprint Detection using ResNet50

【速读】:该论文旨在解决由生成对抗网络(Generative Adversarial Networks, GANs)生成的高保真图像对图像取证和工业系统内容真实性验证带来的挑战。其解决方案的关键在于结合频域分析与深度学习:首先利用二维离散傅里叶变换(2D Discrete Fourier Transform, 2D DFT)将图像转换至频域,使GAN生成图像中隐含的周期性伪影得以显现;随后使用ResNet50神经网络在频域特征图上进行训练,从而有效区分真实图像与StyleGAN生成图像。实验表明,该方法在准确率(92.8%)和AUC(0.95)上显著优于直接在空间域训练的模型,揭示了GAN生成图像具有独特的频域“指纹”特性,为工业级数字取证提供了可行的技术路径。

链接: https://arxiv.org/abs/2510.19840
作者: Sai Teja Erukude,Viswa Chaitanya Marella,Suhasnadh Reddy Veluru
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages. Published in IEEE

点击查看摘要

Abstract:The rapid rise of photorealistic images produced from Generative Adversarial Networks (GANs) poses a serious challenge for image forensics and industrial systems requiring reliable content authenticity. This paper uses frequency-domain analysis combined with deep learning to solve the problem of distinguishing StyleGAN-generated images from real ones. Specifically, a two-dimensional Discrete Fourier Transform (2D DFT) was applied to transform images into the Fourier domain, where subtle periodic artifacts become detectable. A ResNet50 neural network is trained on these transformed images to differentiate between real and synthetic ones. The experiments demonstrate that the frequency-domain model achieves a 92.8 percent and an AUC of 0.95, significantly outperforming the equivalent model trained on raw spatial-domain images. These results indicate that the GAN-generated images have unique frequency-domain signatures or “fingerprints”. The method proposed highlights the industrial potential of combining signal processing techniques and deep learning to enhance digital forensics and strengthen the trustworthiness of industrial AI systems.
zh

[CV-89] SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning

【速读】:该论文旨在解决 gaze estimation(注视估计)中因数据集外观不稳定性带来的挑战,具体包括由随机不确定性(aleatoric uncertainties)、协变量偏移(covariant shifts)以及测试域泛化(test domain generalization)引起的问题。解决方案的关键在于提出 SLYKLatent 方法:首先利用自监督学习(Self-Supervised Learning)在面部表情数据集上进行初始训练,随后通过基于 patch 的三分支网络(patch-based tri-branch network)进行精细化优化,并引入一种逆解释方差加权训练损失函数(inverse explained variance-weighted training loss function),以增强模型对复杂场景的鲁棒性与适应性。该方法在多个基准数据集上显著优于现有技术,验证了其有效性。

链接: https://arxiv.org/abs/2402.01555
作者: Samuel Adebayo,Joost C. Dessing,Seán McLoone
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In this research, we present SLYKLatent, a novel approach for enhancing gaze estimation by addressing appearance instability challenges in datasets due to aleatoric uncertainties, covariant shifts, and test domain generalization. SLYKLatent utilizes Self-Supervised Learning for initial training with facial expression datasets, followed by refinement with a patch-based tri-branch network and an inverse explained variance-weighted training loss function. Our evaluation on benchmark datasets achieves a 10.9% improvement on Gaze360, supersedes top MPIIFaceGaze results with 3.8%, and leads on a subset of ETH-XGaze by 11.6%, surpassing existing methods by significant margins. Adaptability tests on RAF-DB and Affectnet show 86.4% and 60.9% accuracies, respectively. Ablation studies confirm the effectiveness of SLYKLatent’s novel components.
zh

[CV-90] GUSL-Dehaze: A Green U-Shaped Learning Approach to Image Dehazing

【速读】:该论文旨在解决图像去雾(image dehazing)任务中现有深度学习方法参数量大、计算成本高、难以部署于资源受限设备的问题。其解决方案的关键在于提出了一种绿色U型学习(GUSL-Dehaze)框架,该框架不依赖深度神经网络,而是结合物理模型与绿色学习(Green Learning, GL)策略:首先利用改进的暗通道先验(Dark Channel Prior, DCP)进行初始去雾,随后通过U型架构实现无监督表示学习,并引入相关特征测试(Relevant Feature Test, RFT)和最小二乘归一化变换(Least-Squares Normal Transform, LNT)以保持模型轻量化;最终采用可解释的监督学习策略输出去雾结果,从而在显著减少参数量的同时达到与先进深度学习方法相当的性能。

链接: https://arxiv.org/abs/2510.20266
作者: Mahtab Movaheddrad,Laurence Palmer,C.-C. Jay Kuo
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image dehazing is a restoration task that aims to recover a clear image from a single hazy input. Traditional approaches rely on statistical priors and the physics-based atmospheric scattering model to reconstruct the haze-free image. While recent state-of-the-art methods are predominantly based on deep learning architectures, these models often involve high computational costs and large parameter sizes, making them unsuitable for resource-constrained devices. In this work, we propose GUSL-Dehaze, a Green U-Shaped Learning approach to image dehazing. Our method integrates a physics-based model with a green learning (GL) framework, offering a lightweight, transparent alternative to conventional deep learning techniques. Unlike neural network-based solutions, GUSL-Dehaze completely avoids deep learning. Instead, we begin with an initial dehazing step using a modified Dark Channel Prior (DCP), which is followed by a green learning pipeline implemented through a U-shaped architecture. This architecture employs unsupervised representation learning for effective feature extraction, together with feature-engineering techniques such as the Relevant Feature Test (RFT) and the Least-Squares Normal Transform (LNT) to maintain a compact model size. Finally, the dehazed image is obtained via a transparent supervised learning strategy. GUSL-Dehaze significantly reduces parameter count while ensuring mathematical interpretability and achieving performance on par with state-of-the-art deep learning models.
zh

[CV-91] AI Pose Analysis and Kinematic Profiling of Range-of-Motion Variations in Resistance Training

【速读】:该论文旨在解决阻力训练中运动学参数(如关节角度变化范围、动作节奏及向心/离心阶段时长)难以精确量化的问题,尤其针对长度延长部分训练(pROM)与全范围运动训练(fROM)之间的差异。其解决方案的关键在于构建了一套基于人工智能(AI)的姿态估计流程,通过处理视频数据提取帧级关节角度轨迹,并结合随机效应元分析模型以控制个体内部和练习间的变异,从而实现对训练执行动态的精细化测量与比较。该方法不仅识别出pROM训练在运动范围和执行时间上的系统性差异,还提出新的相对指标“%ROM”来统一描述不同动作中的长度延长程度,凸显了AI驱动方法在提升阻力训练研究精度和个性化处方方面的潜力。

链接: https://arxiv.org/abs/2510.20012
作者: Adam Diamant
机构: 未知
类目: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study develops an AI-based pose estimation pipeline to enable precise quantification of movement kinematics in resistance training. Using video data from Wolf et al. (2025), which compared lengthened partial (pROM) and full range-of-motion (fROM) training across eight upper-body exercises in 26 participants, 280 recordings were processed to extract frame-level joint-angle trajectories. After filtering and smoothing, per-set metrics were derived, including range of motion (ROM), tempo, and concentric/eccentric phase durations. A random-effects meta-analytic model was applied to account for within-participant and between-exercise variability. Results show that pROM repetitions were performed with a smaller ROM and shorter overall durations, particularly during the eccentric phase of movement. Variance analyses revealed that participant-level differences, rather than exercise-specific factors, were the primary driver of variation, although there is substantial evidence of heterogeneous treatment effects. We then introduce a novel metric, %ROM, which is the proportion of full ROM achieved during pROM, and demonstrate that this definition of lengthened partials remains relatively consistent across exercises. Overall, these findings suggest that lengthened partials differ from full ROM training not only in ROM, but also in execution dynamics and consistency, highlighting the potential of AI-based methods for advancing research and improving resistance training prescription.
zh

人工智能

[AI-0] VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

【速读】:该论文旨在解决机器人导航中政策泛化能力不足的问题,即如何使学习到的导航策略在不同环境和物理形态(如轮式与腿式机器人)下仍能有效运行,同时满足特定机器人本体的物理约束与能力。解决方案的关键在于提出一种分层视觉语言动作系统(VAMOS),通过解耦语义规划与本体感知:高层通用规划器从多样化的开放世界数据中学习路径规划,而专用的可达性模型(affordance model)则在安全、低成本的仿真环境中学习机器人自身的物理限制与能力;二者通过一个精心设计的接口协同工作——高层规划器直接在图像空间提出候选路径,由可达性模型评估并重新排序,从而实现跨本体导航与自然语言可控性,实验证明该架构显著提升了导航成功率(达3倍)并增强了单机器人可靠性。

链接: https://arxiv.org/abs/2510.20818
作者: Mateo Guaman Castro,Sidharth Rajagopal,Daniel Gorbatov,Matt Schmittle,Rohan Baijal,Octi Zhang,Rosario Scalise,Sidharth Talia,Emma Romig,Celso de Melo,Byron Boots,Abhishek Gupta
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot’s physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: this https URL
zh

[AI-1] he Reality Gap in Robotics: Challenges Solutions and Best Practices

【速读】:该论文旨在解决机器人领域中仿真到现实(sim-to-real)迁移过程中的“现实差距”(reality gap)问题,即仿真环境与真实物理世界之间的差异导致训练好的模型在实际部署时性能下降甚至失效。其解决方案的关键在于系统性地整合和分析多种前沿技术,包括域随机化(domain randomization)、真实到仿真迁移(real-to-sim transfer)、状态与动作抽象(state and action abstractions),以及仿真与真实协同训练(sim-real co-training),从而有效缩小现实差距并提升模型的泛化能力与鲁棒性。

链接: https://arxiv.org/abs/2510.20808
作者: Elie Aljalbout,Jiaxu Xing,Angel Romero,Iretiayo Akinola,Caelan Reed Garrett,Eric Heiden,Abhishek Gupta,Tucker Hermans,Yashraj Narang,Dieter Fox,Davide Scaramuzza,Fabio Ramos
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted for Publication as part of the Annual Review of Control, Robotics, and Autonomous Systems 2026

点击查看摘要

Abstract:Machine learning has facilitated significant advancements across various robotics domains, including navigation, locomotion, and manipulation. Many such achievements have been driven by the extensive use of simulation as a critical tool for training and testing robotic systems prior to their deployment in real-world environments. However, simulations consist of abstractions and approximations that inevitably introduce discrepancies between simulated and real environments, known as the reality gap. These discrepancies significantly hinder the successful transfer of systems from simulation to the real world. Closing this gap remains one of the most pressing challenges in robotics. Recent advances in sim-to-real transfer have demonstrated promising results across various platforms, including locomotion, navigation, and manipulation. By leveraging techniques such as domain randomization, real-to-sim transfer, state and action abstractions, and sim-real co-training, many works have overcome the reality gap. However, challenges persist, and a deeper understanding of the reality gap’s root causes and solutions is necessary. In this survey, we present a comprehensive overview of the sim-to-real landscape, highlighting the causes, solutions, and evaluation metrics for the reality gap and sim-to-real transfer.
zh

[AI-2] A Coherence-Based Measure of AGI

【速读】:该论文旨在解决当前对人工通用智能(Artificial General Intelligence, AGI)的衡量标准过于依赖算术平均值所导致的片面性问题,即这种定义隐含假设“补偿性”(compensability)——某一领域的能力优异可弥补其他领域的不足,从而可能高估系统的整体智能水平。为更准确反映真正的通用智能应具备的“一致性充分性”(coherent sufficiency),即各认知领域间均衡胜任的能力,作者提出了一种基于广义均值积分的新型度量方法:通过在连续补偿指数范围内积分广义均值,形成面积(area under the curve, AUC),该指标能够量化不同补偿假设下的鲁棒性,并惩罚能力失衡,同时捕捉跨域依赖关系。此方案的关键在于将传统静态平均扩展为动态、可调的连续函数框架,从而提供一个更严格、可解释且符合AGI本质特征的评估基础。

链接: https://arxiv.org/abs/2510.20784
作者: Fares Fourati
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 1 figure, 12 tables

点击查看摘要

Abstract:Recent work by \citethendrycks2025agidefinition formalized \textitArtificial General Intelligence (AGI) as the arithmetic mean of proficiencies across cognitive domains derived from the Cattell–Horn–Carroll (CHC) model of human cognition. While elegant, this definition assumes \textitcompensability – that exceptional ability in some domains can offset failure in others. True general intelligence, however, should reflect \textitcoherent sufficiency: balanced competence across all essential domains. We propose a coherence-aware measure of AGI based on the integral of generalized means over a continuum of compensability exponents. This formulation spans arithmetic, geometric, and harmonic regimes, and the resulting \textitarea under the curve (AUC) quantifies robustness under varying compensability assumptions. Unlike the arithmetic mean, which rewards specialization, the AUC penalizes imbalance and captures inter-domain dependency. Applied to published CHC-based domain scores for GPT-4 and GPT-5, the coherence-adjusted AUC reveals that both systems remain far from general competence despite high arithmetic scores (e.g., GPT-5 at~24%). Integrating the generalized mean thus yields a principled, interpretable, and stricter foundation for measuring genuine progress toward AGI.
zh

[AI-3] FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation

【速读】:该论文旨在解决机器人操作策略训练中数据集规模、多样性和质量难以兼顾的问题。现有方法中,仿真虽具可扩展性但存在模拟到现实的差距(sim-to-real gap),而遥操作能获取高质量示范但多样性有限且人力成本高。解决方案的关键在于提出FieldGen框架,其通过将操作分解为预操作阶段(允许轨迹多样性)和精细操作阶段(需专家精度),利用人类示范捕捉关键接触与位姿信息后,由吸引力场(attraction field)自动生成收敛至成功配置的多样化轨迹,从而实现大规模、高多样性与高质量数据的协同生成;此外,FieldGen-Reward进一步为生成数据添加奖励标注以提升策略学习效果。

链接: https://arxiv.org/abs/2510.20774
作者: Wenhao Wang,Kehe Ye,Xinyu Zhou,Tianxing Chen,Cao Min,Qiaoming Zhu,Xiaokang Yang,Yongjian Shen,Yang Yang,Maoqing Yao,Yao Mu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Webpage: this https URL

点击查看摘要

Abstract:Large-scale and diverse datasets are vital for training robust robotic manipulation policies, yet existing data collection methods struggle to balance scale, diversity, and quality. Simulation offers scalability but suffers from sim-to-real gaps, while teleoperation yields high-quality demonstrations with limited diversity and high labor cost. We introduce FieldGen, a field-guided data generation framework that enables scalable, diverse, and high-quality real-world data collection with minimal human supervision. FieldGen decomposes manipulation into two stages: a pre-manipulation phase, allowing trajectory diversity, and a fine manipulation phase requiring expert precision. Human demonstrations capture key contact and pose information, after which an attraction field automatically generates diverse trajectories converging to successful configurations. This decoupled design combines scalable trajectory diversity with precise supervision. Moreover, FieldGen-Reward augments generated data with reward annotations to further enhance policy learning. Experiments demonstrate that policies trained with FieldGen achieve higher success rates and improved stability compared to teleoperation-based baselines, while significantly reducing human effort in long-term real-world data collection. Webpage is available at this https URL.
zh

[AI-4] RAG Rank: Using PageRank to Counter Poisoning in CTI LLM Pipelines

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)架构在网络安全威胁情报(Cyber Threat Intelligence, CTI)系统中面临的中毒攻击(poisoning attacks)问题。由于CTI信息常涉及新兴攻击且威胁行为者可模仿合法格式与术语,现有防御机制往往失效。解决方案的关键在于:通过应用源可信度算法(source credibility algorithms)对文档语料库进行预处理,以提升RAG系统的鲁棒性;文中以PageRank为例,证明该方法能有效降低恶意文档的权威得分并提升可信内容的权重,在MS MARCO标准数据集上实现定量验证,并在CTI文档和情报流中展示概念验证性能。

链接: https://arxiv.org/abs/2510.20768
作者: Austin Jia,Avaneesh Ramesh,Zain Shamsi,Daniel Zhang,Alex Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as the dominant architectural pattern to operationalize Large Language Model (LLM) usage in Cyber Threat Intelligence (CTI) systems. However, this design is susceptible to poisoning attacks, and previously proposed defenses can fail for CTI contexts as cyber threat information is often completely new for emerging attacks, and sophisticated threat actors can mimic legitimate formats, terminology, and stylistic conventions. To address this issue, we propose that the robustness of modern RAG defenses can be accelerated by applying source credibility algorithms on corpora, using PageRank as an example. In our experiments, we demonstrate quantitatively that our algorithm applies a lower authority score to malicious documents while promoting trusted content, using the standardized MS MARCO dataset. We also demonstrate proof-of-concept performance of our algorithm on CTI documents and feeds.
zh

[AI-5] hought Communication in Multiagent Collaboration NEURIPS2025

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的多智能体系统中,仅依赖自然语言进行交互所导致的信息损失、歧义性和间接性问题,从而限制了集体智能(collective intelligence)潜力的问题。其核心挑战在于如何在不依赖显式语言表达的前提下,实现智能体之间更直接、高效且结构化的认知信息交换。解决方案的关键在于提出“思想通信”(thought communication)的新范式,通过将智能体状态建模为潜在变量(latent variable)的函数,理论证明在无辅助信息的非参数设定下可识别任意两智能体之间的共享与私有思想,并恢复全局思想共享结构。基于此理论框架,作者开发了一套方法,在通信前提取各智能体的潜在思想并分配相关思想及其共享模式,使思想通信天然适用于多种模态数据,实验验证了其在合成与真实世界基准上的有效性,展示了超越传统语言交互的协作优势。

链接: https://arxiv.org/abs/2510.20733
作者: Yujia Zheng,Zhuokai Zhao,Zijian Li,Yaqi Xie,Mingze Gao,Lizhu Zhang,Kun Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: NeurIPS 2025 Spotlight

点击查看摘要

Abstract:Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints, most LLM-based multi-agent systems still rely solely on natural language, exchanging tokens or their embeddings. To go beyond language, we introduce a new paradigm, thought communication, which enables agents to interact directly mind-to-mind, akin to telepathy. To uncover these latent thoughts in a principled way, we formalize the process as a general latent variable model, where agent states are generated by an unknown function of underlying thoughts. We prove that, in a nonparametric setting without auxiliary information, both shared and private latent thoughts between any pair of agents can be identified. Moreover, the global structure of thought sharing, including which agents share which thoughts and how these relationships are structured, can also be recovered with theoretical guarantees. Guided by the established theory, we develop a framework that extracts latent thoughts from all agents prior to communication and assigns each agent the relevant thoughts, along with their sharing patterns. This paradigm naturally extends beyond LLMs to all modalities, as most observational data arise from hidden generative processes. Experiments on both synthetic and real-world benchmarks validate the theory and demonstrate the collaborative advantages of thought communication. We hope this work illuminates the potential of leveraging the hidden world, as many challenges remain unsolvable through surface-level observation alone, regardless of compute or data scale.
zh

[AI-6] Unsupervised Anomaly Prediction with N-BEATS and Graph Neural Network in Multi-variate Semiconductor Process Time Series

【速读】:该论文旨在解决半导体制造过程中异常预测(anomaly prediction)的难题,尤其是面对传感器数据高维度、故障样本稀少导致的类别不平衡问题,以及变量间复杂依赖关系对异常检测与根本原因分析(root-cause analysis)带来的挑战。解决方案的关键在于构建一个两阶段的异常预测框架:首先在假设无异常的数据集上训练预测模型,随后对未见时间序列进行预测,并将预测结果与训练信号对比,超出预设阈值的偏差即被标记为异常。两种方法分别采用N-BEATS(假设变量独立)和图神经网络(Graph Neural Network, GNN)建模变量间依赖关系,其中GNN在保持稳定异常预测能力的同时,显著减少可训练参数并降低计算成本,展现出优于N-BEATS的性能,具备在制造环境中在线部署的潜力。

链接: https://arxiv.org/abs/2510.20718
作者: Daniel Sorensen,Bappaditya Dey,Minjin Hwang,Sandip Halder
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 27 figures

点击查看摘要

Abstract:Semiconductor manufacturing is an extremely complex and precision-driven process, characterized by thousands of interdependent parameters collected across diverse tools and process steps. Multi-variate time-series analysis has emerged as a critical field for real-time monitoring and fault detection in such environments. However, anomaly prediction in semiconductor fabrication presents several critical challenges, including high dimensionality of sensor data and severe class imbalance due to the rarity of true faults. Furthermore, the complex interdependencies between variables complicate both anomaly prediction and root-cause-analysis. This paper proposes two novel approaches to advance the field from anomaly detection to anomaly prediction, an essential step toward enabling real-time process correction and proactive fault prevention. The proposed anomaly prediction framework contains two main stages: (a) training a forecasting model on a dataset assumed to contain no anomalies, and (b) performing forecast on unseen time series data. The forecast is compared with the forecast of the trained signal. Deviations beyond a predefined threshold are flagged as anomalies. The two approaches differ in the forecasting model employed. The first assumes independence between variables by utilizing the N-BEATS model for univariate time series forecasting. The second lifts this assumption by utilizing a Graph Neural Network (GNN) to capture inter-variable relationships. Both models demonstrate strong forecasting performance up to a horizon of 20 time points and maintain stable anomaly prediction up to 50 time points. The GNN consistently outperforms the N-BEATS model while requiring significantly fewer trainable parameters and lower computational cost. These results position the GNN as promising solution for online anomaly forecasting to be deployed in manufacturing environments.
zh

[AI-7] Real-Time Gait Adaptation for Quadrupeds using Model Predictive Control and Reinforcement Learning

【速读】:该论文旨在解决模型无关强化学习(Model-free Reinforcement Learning, RL)在四足机器人步态控制中易收敛至单一步态、导致性能次优的问题,以及传统模型预测控制(Model Predictive Control, MPC)缺乏环境自适应能力的局限性。其解决方案的关键在于提出一种基于连续步态空间的实时步态自适应优化框架,融合了模型预测路径积分(Model Predictive Path Integral, MPPI)算法与一个名为Dreamer的模块,通过学习到的奖励函数联合优化动作和步态变量,以实现速度跟踪、能量效率、稳定性及平滑步态过渡的多目标优化,并引入学习到的价值函数作为终端奖励,扩展为无限时域规划器,从而在保持高精度轨迹跟踪的同时,动态生成任务适配的最优步态策略。

链接: https://arxiv.org/abs/2510.20706
作者: Ganga Nair B,Prakrut Kotecha,Shishir Kolathaya
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Model-free reinforcement learning (RL) has enabled adaptable and agile quadruped locomotion; however, policies often converge to a single gait, leading to suboptimal performance. Traditionally, Model Predictive Control (MPC) has been extensively used to obtain task-specific optimal policies but lacks the ability to adapt to varying environments. To address these limitations, we propose an optimization framework for real-time gait adaptation in a continuous gait space, combining the Model Predictive Path Integral (MPPI) algorithm with a Dreamer module to produce adaptive and optimal policies for quadruped locomotion. At each time step, MPPI jointly optimizes the actions and gait variables using a learned Dreamer reward that promotes velocity tracking, energy efficiency, stability, and smooth transitions, while penalizing abrupt gait changes. A learned value function is incorporated as terminal reward, extending the formulation to an infinite-horizon planner. We evaluate our framework in simulation on the Unitree Go1, demonstrating an average reduction of up to 36.48% in energy consumption across varying target speeds, while maintaining accurate tracking and adaptive, task-appropriate gaits.
zh

[AI-8] Exploring Large Language Models for Access Control Policy Synthesis and Summarization

【速读】:该论文旨在解决云环境中访问控制策略(Access Control Policy)自动编写与分析的难题,尤其是传统人工编写策略易出错且难以精确验证的问题。其关键解决方案在于探索大型语言模型(Large Language Models, LLMs)在访问控制策略合成与语义级请求摘要中的应用:一方面发现LLMs虽能生成语法正确的策略,但存在权限过度宽松的问题(非推理型LLM为45.8%,推理型LLM达93.7%);另一方面提出一种基于语义的请求摘要方法,利用LLMs对现有策略进行精准行为刻画,结果表明结合符号分析方法时,LLMs在政策分析上展现出良好前景。

链接: https://arxiv.org/abs/2510.20692
作者: Adarsh Vatsa,Bethel Hall,William Eiers
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注: 20 pages, 7 figures

点击查看摘要

Abstract:Cloud computing is ubiquitous, with a growing number of services being hosted on the cloud every day. Typical cloud compute systems allow administrators to write policies implementing access control rules which specify how access to private data is governed. These policies must be manually written, and due to their complexity can often be error prone. Moreover, existing policies often implement complex access control specifications and thus can be difficult to precisely analyze in determining their behavior works exactly as intended. Recently, Large Language Models (LLMs) have shown great success in automated code synthesis and summarization. Given this success, they could potentially be used for automatically generating access control policies or aid in understanding existing policies. In this paper, we explore the effectiveness of LLMs for access control policy synthesis and summarization. Specifically, we first investigate diverse LLMs for access control policy synthesis, finding that: although LLMs can effectively generate syntactically correct policies, they have permissiveness issues, generating policies equivalent to the given specification 45.8% of the time for non-reasoning LLMs, and 93.7% of the time for reasoning LLMs. We then investigate how LLMs can be used to analyze policies by introducing a novel semantic-based request summarization approach which leverages LLMs to generate a precise characterization of the requests allowed by a policy. Our results show that while there are significant hurdles in leveraging LLMs for automated policy generation, LLMs show promising results when combined with symbolic approaches in analyzing existing policies.
zh

[AI-9] Plan Then Retrieve: Reinforcement Learning-Guided Complex Reasoning over Knowledge Graphs

【速读】:该论文旨在解决知识图谱问答(Knowledge Graph Question Answering, KGQA)中现有方法在复杂场景下难以充分利用知识图谱(Knowledge Graph, KG)的丰富知识与大语言模型(Large Language Models, LLMs)推理能力的问题,尤其体现在KG覆盖不全时缺乏外部信息判断机制、推理过程局部短视且无法维持多步规划一致性,导致即使存在相关知识仍发生推理失败。其解决方案的关键在于提出Graph-RFT框架,采用“计划-检索-网络搜索”(plan-KGsearch-and-Websearch-during-think)两阶段强化微调范式:首先通过链式思维(chain-of-thought)微调与定制化计划-检索数据集激活结构化推理并解决GRPO冷启动问题;其次引入一种融合显式规划与检索动作的强化学习机制,结合结果和检索特定信号的多奖励设计,实现覆盖感知的检索调度;同时采用笛卡尔式规划模块将复杂问题分解为有序子问题,并以逻辑表达式引导工具调用,从而支持全局一致的多步推理,使LLMs能够在不完整知识条件下自主规划并动态调度KG与网络检索资源。

链接: https://arxiv.org/abs/2510.20691
作者: Yanlin Song,Ben Liu,Víctor Gutiérrez-Basulto,Zhiwei Hu,Qianqian Xie,Min Peng,Sophia Ananiadou,Jeff Z. Pan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Graph Question Answering aims to answer natural language questions by reasoning over structured knowledge graphs. While large language models have advanced KGQA through their strong reasoning capabilities, existing methods continue to struggle to fully exploit both the rich knowledge encoded in KGs and the reasoning capabilities of LLMs, particularly in complex scenarios. They often assume complete KG coverage and lack mechanisms to judge when external information is needed, and their reasoning remains locally myopic, failing to maintain coherent multi-step planning, leading to reasoning failures even when relevant knowledge exists. We propose Graph-RFT, a novel two-stage reinforcement fine-tuning KGQA framework with a ‘plan-KGsearch-and-Websearch-during-think’ paradigm, that enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions. Graph-RFT introduces a chain-of-thought fine-tuning method with a customized plan-retrieval dataset activates structured reasoning and resolves the GRPO cold-start problem. It then introduces a novel plan-retrieval guided reinforcement learning process integrates explicit planning and retrieval actions with a multi-reward design, enabling coverage-aware retrieval scheduling. It employs a Cartesian-inspired planning module to decompose complex questions into ordered subquestions, and logical expression to guide tool invocation for globally consistent multi-step reasoning. This reasoning retrieval process is optimized with a multi-reward combining outcome and retrieval specific signals, enabling the model to learn when and how to combine KG and web retrieval effectively.
zh

[AI-10] A Scalable Causal and Energy Efficient Framework for Neural Decoding with Spiking Neural Networks

【速读】:该论文旨在解决脑机接口(Brain-Computer Interfaces, BCIs)中神经解码模型在实时性、可扩展性和能效方面的瓶颈问题。当前基于人工神经网络(Artificial Neural Networks, ANNs)的解码方法要么缺乏泛化能力(如简单因果模型),要么难以部署于资源受限的实时系统(如复杂非因果模型)。其核心挑战在于ANN对计算和能耗的高需求,限制了其在电池约束环境中的应用。解决方案的关键在于提出Spikachu框架——一种基于脉冲神经网络(Spiking Neural Networks, SNNs)的可扩展、因果且低功耗的神经解码方法。该框架直接处理分箱后的脉冲数据,通过共享潜在空间投影与时序自适应的脉冲模块提取特征,并实现高效在线行为预测,同时在单会话训练下能耗降低2.26至418.81倍,并支持多会话/多受试者迁移学习以提升性能与泛化能力。

链接: https://arxiv.org/abs/2510.20683
作者: Georgios Mentzelopoulos,Ioannis Asmanis,Konrad P. Kording,Eva L. Dyer,Kostas Daniilidis,Flavia Vitale
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) promise to enable vital functions, such as speech and prosthetic control, for individuals with neuromotor impairments. Central to their success are neural decoders, models that map neural activity to intended behavior. Current learning-based decoding approaches fall into two classes: simple, causal models that lack generalization, or complex, non-causal models that generalize and scale offline but struggle in real-time settings. Both face a common challenge, their reliance on power-hungry artificial neural network backbones, which makes integration into real-world, resource-limited systems difficult. Spiking neural networks (SNNs) offer a promising alternative. Because they operate causally these models are suitable for real-time use, and their low energy demands make them ideal for battery-constrained environments. To this end, we introduce Spikachu: a scalable, causal, and energy-efficient neural decoding framework based on SNNs. Our approach processes binned spikes directly by projecting them into a shared latent space, where spiking modules, adapted to the timing of the input, extract relevant features; these latent representations are then integrated and decoded to generate behavioral predictions. We evaluate our approach on 113 recording sessions from 6 non-human primates, totaling 43 hours of recordings. Our method outperforms causal baselines when trained on single sessions using between 2.26 and 418.81 times less energy. Furthermore, we demonstrate that scaling up training to multiple sessions and subjects improves performance and enables few-shot transfer to unseen sessions, subjects, and tasks. Overall, Spikachu introduces a scalable, online-compatible neural decoding framework based on SNNs, whose performance is competitive relative to state-of-the-art models while consuming orders of magnitude less energy.
zh

[AI-11] R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion

【速读】:该论文旨在解决实际场景中歌唱语音转换(Singing Voice Conversion, SVC)面临的两大挑战:一是环境噪声和音乐分离伪影导致的鲁棒性不足,二是对表达性输出的需求与传统方法仅基于纯净数据训练之间的不匹配问题。解决方案的关键在于提出R2-SVC框架,其核心创新包括:1)通过随机基频(F₀)扰动和音乐分离伪影模拟(如混响、回声)实现基于仿真的鲁棒性增强;2)利用领域特定的歌唱数据(包括DNSMOS过滤后的分离人声和公开歌唱语料库)丰富说话人表征,以保留音色特征并捕捉演唱风格细节;3)引入神经源-滤波器(Neural Source-Filter, NSF)模型显式建模谐波与噪声成分,从而提升转换后歌声的自然度与可控性。

链接: https://arxiv.org/abs/2510.20677
作者: Junjie Zheng,Gongyu Chen,Chaofan Ding,Zihao Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and expressive SVC framework. First, we introduce simulation-based robustness enhancement through random fundamental frequency ( F_0 ) perturbations and music separation artifact simulations (e.g., reverberation, echo), substantially improving performance under noisy conditions. Second, we enrich speaker representation using domain-specific singing data: alongside clean vocals, we incorporate DNSMOS-filtered separated vocals and public singing corpora, enabling the model to preserve speaker timbre while capturing singing style nuances. Third, we integrate the Neural Source-Filter (NSF) model to explicitly represent harmonic and noise components, enhancing the naturalness and controllability of converted singing. R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
zh

[AI-12] GRACE: GRaph-based Addiction Care prEdiction

【速读】:该论文旨在解决成瘾患者住院地点(locus of care)决策中的关键临床问题,即在专业治疗资源有限的情况下,如何实现自动化、精准的分诊预测,从而提升治疗效果并优化资源配置。其核心挑战在于现有数据集存在严重的类别不平衡问题,导致传统方法对少数类(如特定住院类型)预测性能差。解决方案的关键在于提出一种新颖的图神经网络框架GRACE,将预测任务形式化为结构化学习问题,并通过深入的特征工程和一种新的无偏元图(meta-graph)构建方法,有效缓解类别不平衡问题,显著提升了少数类的F1分数(相比基线提高11–35%)。

链接: https://arxiv.org/abs/2510.20671
作者: Subham Kumar,Prakrithi Shivaprakash,Koustav Rudra,Lekhansh Shukla,Animesh Mukherjee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Determining the appropriate locus of care for addiction patients is one of the most critical clinical decisions that affects patient treatment outcomes and effective use of resources. With a lack of sufficient specialized treatment resources, such as inpatient beds or staff, there is an unmet need to develop an automated framework for the same. Current decision-making approaches suffer from severe class imbalances in addiction datasets. To address this limitation, we propose a novel graph neural network (GRACE) framework that formalizes locus of care prediction as a structured learning problem. Further, we perform extensive feature engineering and propose a new approach of obtaining an unbiased meta-graph to train a GNN to overcome the class imbalance problem. Experimental results in real-world data show an improvement of 11-35% in terms of the F1 score of the minority class over competitive baselines. The codes and note embeddings are available at this https URL.
zh

[AI-13] he Shape of Reasoning : Topological Analysis of Reasoning Traces in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)生成的推理轨迹(reasoning traces)质量评估问题,当前方法依赖专家制定的评分标准、人工标注和耗时的成对判断,存在效率低、可靠性差的问题;同时,现有自动化方法主要采用基于图结构的代理指标,仅衡量连接性而无法揭示高质量推理的本质特征。其解决方案的关键在于引入拓扑数据分析(Topological Data Analysis, TDA)框架,通过捕捉推理轨迹的高维几何结构而非仅依赖关系图谱,提取具有强预测能力的拓扑特征,实验证明这些特征显著优于传统图指标,且具备紧凑性和稳定性,为未来强化学习算法提供可靠的自动评估信号。

链接: https://arxiv.org/abs/2510.20665
作者: Xue Wen Tan,Nathaniel Tan,Galen Lee,Stanley Kok
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practice relies on expert rubrics, manual annotation, and slow pairwise judgments. Automated efforts are dominated by graph-based proxies that quantify structural connectivity but do not clarify what constitutes high-quality reasoning; such abstractions can be overly simplistic for inherently complex processes. We introduce a topological data analysis (TDA)-based evaluation framework that captures the geometry of reasoning traces and enables label-efficient, automated assessment. In our empirical study, topological features yield substantially higher predictive power for assessing reasoning quality than standard graph metrics, suggesting that effective reasoning is better captured by higher-dimensional geometric structures rather than purely relational graphs. We further show that a compact, stable set of topological features reliably indicates trace quality, offering a practical signal for future reinforcement learning algorithms.
zh

[AI-14] Integrating Machine Learning into Belief-Desire-Intention Agents : Current Advances and Open Challenges

【速读】:该论文旨在解决当前理性智能体(rational agent)与机器学习(Machine Learning, ML)融合研究中存在碎片化和缺乏系统性的问题,尤其指出现有方法多局限于将ML模型嵌入通用智能体框架,而忽视了如信念-欲望-意图(Belief-Desire-Intention, BDI)等理性架构所具有的表达能力。其解决方案的关键在于以BDI范式为参考,对现有融合ML的理性智能体方法进行细粒度的系统化梳理,从而揭示该领域快速演进的研究趋势,并识别出设计高效理性ML智能体的关键研究机遇与开放挑战。

链接: https://arxiv.org/abs/2510.20641
作者: Andrea Agiollo,Andrea Omicini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Thanks to the remarkable human-like capabilities of machine learning (ML) models in perceptual and cognitive tasks, frameworks integrating ML within rational agent architectures are gaining traction. Yet, the landscape remains fragmented and incoherent, often focusing on embedding ML into generic agent containers while overlooking the expressive power of rational architectures–such as Belief-Desire-Intention (BDI) agents. This paper presents a fine-grained systematisation of existing approaches, using the BDI paradigm as a reference. Our analysis illustrates the fast-evolving literature on rational agents enhanced by ML, and identifies key research opportunities and open challenges for designing effective rational ML agents.
zh

[AI-15] Fluidity Index: Next-Generation Super-intelligence Benchmarks

【速读】:该论文旨在解决模型在动态、可扩展环境中的适应性量化难题,即如何有效评估模型在面对环境状态变化时的理解、预测与调整能力。其解决方案的关键在于提出“流体指数”(Fluidity Index, FI),通过衡量初始、当前及未来环境状态间的偏差来评估模型的上下文切换能力和连续性表现,并强调采用闭环开放式真实世界基准测试以更准确地检验模型的适应性;同时指出真正具备超智能特性的模型应具备至少二阶适应性,从而实现通过数字补给维持自身计算的自持能力,确保最优流体性。

链接: https://arxiv.org/abs/2510.20636
作者: Eric Ngoiya,Tianshu Bao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12

点击查看摘要

Abstract:This paper introduces the Fluidity Index (FI) to quantify model adaptability in dynamic, scaling environments. The benchmark evaluates response accuracy based on deviations in initial, current, and future environment states, assessing context switching and continuity. We distinguish between closed-ended and open-ended benchmarks, prioritizing closed-loop open-ended real-world benchmarks to test adaptability. The approach measures a model’s ability to understand, predict, and adjust to state changes in scaling environments. A truly super-intelligent model should exhibit at least second-order adaptability, enabling self-sustained computation through digital replenishment for optimal fluidity.
zh

[AI-16] owards Reliable Evaluation of Large Language Models for Multilingual and Multimodal E-Commerce Applications

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在电子商务(e-commerce)领域评估中存在的局限性问题,包括任务多样性不足(如缺乏产品引导和售后问题)、模态单一(缺少多模态数据)、数据来源偏向合成或精选样本、语言覆盖范围窄(主要集中于英语和中文),导致现有基准无法有效衡量模型在真实复杂购物场景中的能力。解决方案的关键在于提出EcomEval——一个全面的多语言、多模态基准测试集,涵盖6个类别共37项任务(含8个多模态任务),数据主要来源于真实的客户查询与交易日志,具有噪声性和异构性;同时采用半自动化的答案生成流程,由大型模型初稿结合50余名具备电商与多语言背景的专业标注者进行审校,确保参考答案的质量与可扩展性,并通过不同规模模型的平均评分定义难度等级,实现精细化、挑战导向的评估体系,且支持包括五种低资源东南亚语言在内的七种语言,填补了此前研究在多语言和多模态场景下的空白。

链接: https://arxiv.org/abs/2510.20632
作者: Shuyi Xie,Ziqin Liew,Hailing Zhang,Haibo Zhang,Ling Hu,Zhiqiang Zhou,Shuman Liu,Anxiang Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel on general-purpose NLP benchmarks, yet their capabilities in specialized domains remain underexplored. In e-commerce, existing evaluations-such as EcomInstruct, ChineseEcomQA, eCeLLM, and Shopping MMLU-suffer from limited task diversity (e.g., lacking product guidance and after-sales issues), limited task modalities (e.g., absence of multimodal data), synthetic or curated data, and a narrow focus on English and Chinese, leaving practitioners without reliable tools to assess models on complex, real-world shopping scenarios. We introduce EcomEval, a comprehensive multilingual and multimodal benchmark for evaluating LLMs in e-commerce. EcomEval covers six categories and 37 tasks (including 8 multimodal tasks), sourced primarily from authentic customer queries and transaction logs, reflecting the noisy and heterogeneous nature of real business interactions. To ensure both quality and scalability of reference answers, we adopt a semi-automatic pipeline in which large models draft candidate responses subsequently reviewed and modified by over 50 expert annotators with strong e-commerce and multilingual expertise. We define difficulty levels for each question and task category by averaging evaluation scores across models with different sizes and capabilities, enabling challenge-oriented and fine-grained assessment. EcomEval also spans seven languages-including five low-resource Southeast Asian languages-offering a multilingual perspective absent from prior work.
zh

[AI-17] Equitable Survival Prediction: A Fairness-Aware Survival Modeling (FASM) Approach

【速读】:该论文旨在解决医疗领域中生存分析模型因临床数据中的结构性不平等和社交偏见而加剧或放大算法偏见的问题,尤其关注跨群体风险排序(cross-group risk rankings)的不公平现象,例如高风险黑人患者可能被错误地排在未发生死亡事件的低风险白人患者之下,这种误排序会强化生物本质主义观念并损害公平医疗。解决方案的关键在于提出一种公平感知的生存建模方法(Fairness-Aware Survival Modeling, FASM),该方法通过同时优化组内(intra-group)和跨组(cross-group)风险排序的公平性,实现对时间动态下风险分层的公平约束,从而在保持与无公平意识模型相当判别性能的同时显著提升公平性表现,并在10年随访期内维持稳定的公平性改进,尤其是在中期随访阶段效果最显著。

链接: https://arxiv.org/abs/2510.20629
作者: Mingxuan Liu,Yilin Ning,Haoyuan Wang,Chuan Hong,Matthew Engelhard,Danielle S. Bitterman,William G. La Cava,Nan Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As machine learning models become increasingly integrated into healthcare, structural inequities and social biases embedded in clinical data can be perpetuated or even amplified by data-driven models. In survival analysis, censoring and time dynamics can further add complexity to fair model development. Additionally, algorithmic fairness approaches often overlook disparities in cross-group rankings, e.g., high-risk Black patients may be ranked below lower-risk White patients who do not experience the event of mortality. Such misranking can reinforce biological essentialism and undermine equitable care. We propose a Fairness-Aware Survival Modeling (FASM), designed to mitigate algorithmic bias regarding both intra-group and cross-group risk rankings over time. Using breast cancer prognosis as a representative case and applying FASM to SEER breast cancer data, we show that FASM substantially improves fairness while preserving discrimination performance comparable to fairness-unaware survival models. Time-stratified evaluations show that FASM maintains stable fairness over a 10-year horizon, with the greatest improvements observed during the mid-term of follow-up. Our approach enables the development of survival models that prioritize both accuracy and equity in clinical decision-making, advancing fairness as a core principle in clinical care.
zh

[AI-18] owards the Formalization of a Trustworthy AI for Mining Interpretable Models explOiting Sophisticated Algorithms

【速读】:该论文旨在解决自动化决策模型在真实场景中应用时面临的可信性不足问题,特别是如何在保证模型性能的同时实现可解释性,并嵌入因果、公平性和隐私等关键伦理属性。其解决方案的关键在于提出MIMOSA(Mining Interpretable Models explOiting Sophisticated Algorithms)框架,该框架首次形式化了涵盖多种数据类型(如表格数据、时间序列、图像、文本等)的监督学习设置,系统分类并分析了三类可解释模型(特征重要性、规则和实例基础模型)的可解释维度与复杂性;进一步形式化定义了因果性、公平性和隐私性的正式概念、评估指标与验证流程,并揭示这些属性与可解释性之间的内在权衡关系,从而支持在模型生成阶段即嵌入伦理约束,为构建准确、可解释、公平、隐私保护且具备因果意识的可信AI系统奠定理论基础。

链接: https://arxiv.org/abs/2510.20621
作者: Riccardo Guidotti,Martina Cinquini,Marta Marchiori Manerba,Mattia Setzu,Francesco Spinnato
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interpretable-by-design models are crucial for fostering trust, accountability, and safe adoption of automated decision-making models in real-world applications. In this paper we formalize the ground for the MIMOSA (Mining Interpretable Models explOiting Sophisticated Algorithms) framework, a comprehensive methodology for generating predictive models that balance interpretability with performance while embedding key ethical properties. We formally define here the supervised learning setting across diverse decision-making tasks and data types, including tabular data, time series, images, text, transactions, and trajectories. We characterize three major families of interpretable models: feature importance, rule, and instance based models. For each family, we analyze their interpretability dimensions, reasoning mechanisms, and complexity. Beyond interpretability, we formalize three critical ethical properties, namely causality, fairness, and privacy, providing formal definitions, evaluation metrics, and verification procedures for each. We then examine the inherent trade-offs between these properties and discuss how privacy requirements, fairness constraints, and causal reasoning can be embedded within interpretable pipelines. By evaluating ethical measures during model generation, this framework establishes the theoretical foundations for developing AI systems that are not only accurate and interpretable but also fair, privacy-preserving, and causally aware, i.e., trustworthy.
zh

[AI-19] Black Box Absorption: LLM s Undermining Innovative Ideas

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在创新生态系统中引发的“黑箱吸收”(Black Box Absorption)系统性风险问题,即平台运营商通过其不透明的内部架构吸收、泛化并再利用用户在交互过程中贡献的新概念,从而造成个体创作者与平台方之间的信息和结构不对称,威胁创新经济的基本原则。解决方案的关键在于引入两个核心概念:概念单元(idea unit),用于表征创新的功能逻辑可迁移性;以及概念安全(idea safety),作为多维保护标准。论文进一步提出一套治理与工程并重的议程,以确保创作者贡献的可追溯性、可控性和公平性,从而缓解黑箱吸收带来的风险。

链接: https://arxiv.org/abs/2510.20612
作者: Wenjun Cao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:Large Language Models are increasingly adopted as critical tools for accelerating innovation. This paper identifies and formalizes a systemic risk inherent in this paradigm: \textbfBlack Box Absorption. We define this as the process by which the opaque internal architectures of LLM platforms, often operated by large-scale service providers, can internalize, generalize, and repurpose novel concepts contributed by users during interaction. This mechanism threatens to undermine the foundational principles of innovation economics by creating severe informational and structural asymmetries between individual creators and platform operators, thereby jeopardizing the long-term sustainability of the innovation ecosystem. To analyze this challenge, we introduce two core concepts: the idea unit, representing the transportable functional logic of an innovation, and idea safety, a multidimensional standard for its protection. This paper analyzes the mechanisms of absorption and proposes a concrete governance and engineering agenda to mitigate these risks, ensuring that creator contributions remain traceable, controllable, and equitable.
zh

[AI-20] PSO-XAI: A PSO-Enhanced Explainable AI Framework for Reliable Breast Cancer Detection

【速读】:该论文旨在解决乳腺癌(breast cancer)早期诊断中传统方法因变异性大、成本高及误诊风险高等问题,从而提升诊断的准确性与可靠性。其解决方案的关键在于提出一个融合定制化粒子群优化(customized Particle Swarm Optimization, PSO)进行特征选择的集成框架,结合交叉验证与可解释人工智能(Explainable AI, XAI)方法,在29种不同模型上实现高精度分类(准确率和精确度达99.1%),同时显著降低特征维度并提供模型无关的透明解释,从而增强模型的鲁棒性、可信度与临床实用性。

链接: https://arxiv.org/abs/2510.20611
作者: Mirza Raquib,Niloy Das,Farida Siddiqi Prity,Arafath Al Fahim,Saydul Akbar Murad,Mohammad Amzad Hossain,MD Jiabul Hoque,Mohammad Ali Moni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Breast cancer is considered the most critical and frequently diagnosed cancer in women worldwide, leading to an increase in cancer-related mortality. Early and accurate detection is crucial as it can help mitigate possible threats while improving survival rates. In terms of prediction, conventional diagnostic methods are often limited by variability, cost, and, most importantly, risk of misdiagnosis. To address these challenges, machine learning (ML) has emerged as a powerful tool for computer-aided diagnosis, with feature selection playing a vital role in improving model performance and interpretability. This research study proposes an integrated framework that incorporates customized Particle Swarm Optimization (PSO) for feature selection. This framework has been evaluated on a comprehensive set of 29 different models, spanning classical classifiers, ensemble techniques, neural networks, probabilistic algorithms, and instance-based algorithms. To ensure interpretability and clinical relevance, the study uses cross-validation in conjunction with explainable AI methods. Experimental evaluation showed that the proposed approach achieved a superior score of 99.1% across all performance metrics, including accuracy and precision, while effectively reducing dimensionality and providing transparent, model-agnostic explanations. The results highlight the potential of combining swarm intelligence with explainable ML for robust, trustworthy, and clinically meaningful breast cancer diagnosis.
zh

[AI-21] Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

【速读】:该论文旨在解决在有限计算预算下,针对代码生成任务(如代码补全和缺陷定位)的检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索设计的优化问题。其关键解决方案在于通过系统性实验比较不同检索配置在三种维度上的表现:分块策略(chunking strategy)、相似度评分方法(similarity scoring)及分割粒度(splitting granularity),发现对于编程语言到编程语言(PL-PL)任务,基于词级分割的稀疏BM25方法在效果与效率上显著优于稠密检索;而对于自然语言到编程语言(NL-PL)任务,专用稠密编码器(如Voyager-3系列)性能更优但延迟高100倍;同时指出最优分块大小随上下文窗口扩展而变化,且简单行级分块即可达到与语法感知分块相当的效果,最终提出基于任务类型、模型约束与计算效率的实证推荐方案,实现高质量与低延迟的平衡。

链接: https://arxiv.org/abs/2510.20609
作者: Timur Galimzyanov,Olga Kolomyttseva,Egor Bogomolov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena – code completion and bug localization – we systematically compare retrieval configurations across various context window sizes along three axes: (i) chunking strategy, (ii) similarity scoring, and (iii) splitting granularity. (1) For PL-PL, sparse BM25 with word-level splitting is the most effective and practical, significantly outperforming dense alternatives while being an order of magnitude faster. (2) For NL-PL, proprietary dense encoders (Voyager-3 family) consistently beat sparse retrievers, however requiring 100x larger latency. (3) Optimal chunk size scales with available context: 32-64 line chunks work best at small budgets, and whole-file retrieval becomes competitive at 16000 tokens. (4) Simple line-based chunking matches syntax-aware splitting across budgets. (5) Retrieval latency varies by up to 200x across configurations; BPE-based splitting is needlessly slow, and BM25 + word splitting offers the best quality-latency trade-off. Thus, we provide evidence-based recommendations for implementing effective code-oriented RAG systems based on task requirements, model constraints, and computational efficiency.
zh

[AI-22] Generalizable Reasoning through Compositional Energy Minimization

【速读】:该论文旨在解决机器学习中推理任务的泛化能力不足问题,即模型在训练过程中仅接触过特定分布的问题实例,难以应对更复杂或未见过的推理任务。现有端到端训练方法虽能学习数据中的启发式规则,但泛化能力受限。其解决方案的关键在于提出一种基于能量景观(energy landscape)的学习与组合机制:首先在较小、可处理的子问题上学习局部能量函数,测试时通过组合多个子问题的能量函数构建全局能量景观,并引入并行能量最小化(Parallel Energy Minimization, PEM)提升采样质量。该方法支持在推理阶段融入额外约束,从而实现对难度更高、规模更大的推理问题的有效泛化。

链接: https://arxiv.org/abs/2510.20607
作者: Alexandru Oarga,Yilun Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generalization is a key challenge in machine learning, specifically in reasoning tasks, where models are expected to solve problems more complex than those encountered during training. Existing approaches typically train reasoning models in an end-to-end fashion, directly mapping input instances to solutions. While this allows models to learn useful heuristics from data, it often results in limited generalization beyond the training distribution. In this work, we propose a novel approach to reasoning generalization by learning energy landscapes over the solution spaces of smaller, more tractable subproblems. At test time, we construct a global energy landscape for a given problem by combining the energy functions of multiple subproblems. This compositional approach enables the incorporation of additional constraints during inference, allowing the construction of energy landscapes for problems of increasing difficulty. To improve the sample quality from this newly constructed energy landscape, we introduce Parallel Energy Minimization (PEM). We evaluate our approach on a wide set of reasoning problems. Our method outperforms existing state-of-the-art methods, demonstrating its ability to generalize to larger and more complex problems. Project website can be found at: this https URL
zh

[AI-23] Efficient Algorithms for Computing Random Walk Centrality

【速读】:该论文旨在解决大规模网络中随机游走中心性(random walk centrality)计算效率低的问题,因其传统方法在计算复杂度上难以满足实际应用需求。解决方案的关键在于提出了一种新的公式化表达,并基于此设计了两种近似算法:其一利用近似Cholesky分解与稀疏逆估计技术,其二通过采样带根生成树(rooted spanning trees)实现高效估算,二者均具备近乎线性时间复杂度并提供严格的近似保证,从而显著提升了在超大规模网络(如超过1000万节点)上的可扩展性与精度。

链接: https://arxiv.org/abs/2510.20604
作者: Changan Liu,Zixuan Xie,Ahad N. Zehmakan,Zhongzhi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by TKDE

点击查看摘要

Abstract:Random walk centrality is a fundamental metric in graph mining for quantifying node importance and influence, defined as the weighted average of hitting times to a node from all other nodes. Despite its ability to capture rich graph structural information and its wide range of applications, computing this measure for large networks remains impractical due to the computational demands of existing methods. In this paper, we present a novel formulation of random walk centrality, underpinning two scalable algorithms: one leveraging approximate Cholesky factorization and sparse inverse estimation, while the other sampling rooted spanning trees. Both algorithms operate in near-linear time and provide strong approximation guarantees. Extensive experiments on large real-world networks, including one with over 10 million nodes, demonstrate the efficiency and approximation quality of the proposed algorithms.
zh

[AI-24] Resounding Acoustic Fields with Reciprocity NEURIPS2025

【速读】:该论文旨在解决虚拟环境中实现沉浸式听觉体验的核心挑战——如何在稀疏测量的声源位置基础上,准确估计任意位置的房间脉冲响应(Room Impulse Response, RIR),从而支持动态声源定位。其解决方案的关键在于利用声学中的互易性(reciprocity)特性,提出一种名为Versa的物理启发式方法:通过交换声源与接收器的位置姿态,生成密集虚拟声源位置下的物理有效样本,并针对实际部署中因声源/接收器增益模式差异导致的互易性失效问题,设计了一种自监督学习策略进行修正。实验表明,Versa在模拟和真实数据集上均显著提升了声场建模性能,并通过感知用户研究验证了其对空间音频沉浸感的增强效果。

链接: https://arxiv.org/abs/2510.20602
作者: Zitong Lan,Yiduo Hao,Mingmin Zhao
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注: NeurIPS 2025

点击查看摘要

Abstract:Achieving immersive auditory experiences in virtual environments requires flexible sound modeling that supports dynamic source positions. In this paper, we introduce a task called resounding, which aims to estimate room impulse responses at arbitrary emitter location from a sparse set of measured emitter positions, analogous to the relighting problem in vision. We leverage the reciprocity property and introduce Versa, a physics-inspired approach to facilitating acoustic field learning. Our method creates physically valid samples with dense virtual emitter positions by exchanging emitter and listener poses. We also identify challenges in deploying reciprocity due to emitter/listener gain patterns and propose a self-supervised learning approach to address them. Results show that Versa substantially improve the performance of acoustic field learning on both simulated and real-world datasets across different metrics. Perceptual user studies show that Versa can greatly improve the immersive spatial sound experience. Code, dataset and demo videos are available on the project website: this https URL.
zh

[AI-25] ransferable Graph Learning for Transmission Congestion Management via Busbar Splitting

【速读】:该论文旨在解决大规模电力传输系统中基于母线分裂(busbar splitting)的网络拓扑优化(Network Topology Optimization, NTO)问题,以缓解输电拥堵并降低再调度成本。传统方法在近实时场景下难以求解该混合整数非线性问题(mixed-integer non-linear problem),而现有机器学习(ML)方法又受限于对未见拓扑结构、运行工况及不同系统的泛化能力不足。论文的关键解决方案是提出一种图神经网络(Graph Neural Network, GNN)加速的方法,具体采用异构边感知的消息传递神经网络(heterogeneous edge-aware message passing NN),能够捕捉局部潮流模式,实现对未知拓扑变化的泛化以及跨系统迁移能力;案例研究表明,该方法可在一分钟内生成交流可行解,且在GOC 2000节点系统上仅产生2.3%的最优性间隙,相较传统方法提速达4个数量级,显著推进了大规模系统中近实时NTO的实用化进程。

链接: https://arxiv.org/abs/2510.20591
作者: Ali Rajaei,Peter Palensky,Jochen L. Cremer
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Network topology optimization (NTO) via busbar splitting can mitigate transmission grid congestion and reduce redispatch costs. However, solving this mixed-integer non-linear problem for large-scale systems in near-real-time is currently intractable with existing solvers. Machine learning (ML) approaches have emerged as a promising alternative, but they have limited generalization to unseen topologies, varying operating conditions, and different systems, which limits their practical applicability. This paper formulates NTO for congestion management problem considering linearized AC PF, and proposes a graph neural network (GNN)-accelerated approach. We develop a heterogeneous edge-aware message passing NN to predict effective busbar splitting actions as candidate NTO solutions. The proposed GNN captures local flow patterns, achieves generalization to unseen topology changes, and improves transferability across systems. Case studies show up to 4 orders-of-magnitude speed-up, delivering AC-feasible solutions within one minute and a 2.3% optimality gap on the GOC 2000-bus system. These results demonstrate a significant step toward near-real-time NTO for large-scale systems with topology and cross-system generalization.
zh

[AI-26] Lost in Translation: Policymakers are not really listening to Citizen Concerns about AI

【速读】:该论文试图解决的问题是:当前各国在人工智能(AI)治理中缺乏有效的公众参与机制,导致政策制定者未能充分倾听和回应公众意见,从而削弱了公众对AI及其治理的信任与合法性。解决方案的关键在于建立一个双向、透明且包容的公众参与框架,具体包括促进AI素养、扩大 outreach 覆盖范围、采用创新参与方式、确保代表性群体被纳入、公开回应公众反馈,并简化参与流程,以实现真正意义上的“听民声、解民忧”的参与式AI治理实践。

链接: https://arxiv.org/abs/2510.20568
作者: Susan Ariel Aaronson,Michael Moreno
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The worlds people have strong opinions about artificial intelligence (AI), and they want policymakers to listen. Governments are inviting public comment on AI, but as they translate input into policy, much of what citizens say is lost. Policymakers are missing a critical opportunity to build trust in AI and its governance. This paper compares three countries, Australia, Colombia, and the United States, that invited citizens to comment on AI risks and policies. Using a landscape analysis, the authors examined how each government solicited feedback and whether that input shaped governance. Yet in none of the three cases did citizens and policymakers establish a meaningful dialogue. Governments did little to attract diverse voices or publicize calls for comment, leaving most citizens unaware or unprepared to respond. In each nation, fewer than one percent of the population participated. Moreover, officials showed limited responsiveness to the feedback they received, failing to create an effective feedback loop. The study finds a persistent gap between the promise and practice of participatory AI governance. The authors conclude that current approaches are unlikely to build trust or legitimacy in AI because policymakers are not adequately listening or responding to public concerns. They offer eight recommendations: promote AI literacy; monitor public feedback; broaden outreach; hold regular online forums; use innovative engagement methods; include underrepresented groups; respond publicly to input; and make participation easier.
zh

[AI-27] AdaDoS: Adaptive DoS Attack via Deep Adversarial Reinforcement Learning in SDN

【速读】:该论文旨在解决当前软件定义网络(Software Defined Networking, SDN)中基于规则和机器学习的拒绝服务(Denial-of-Service, DoS)防御机制在面对生成式AI驱动的自适应攻击时有效性下降的问题。解决方案的关键在于提出AdaDoS,一个基于对抗强化学习(Adversarial Reinforcement Learning, RL)的动态攻击模型,其将攻击过程建模为攻击者与检测器之间的博弈,并通过部分可观测马尔可夫决策过程(Partially Observed Markov Decision Process, POMDP)处理信息不对称问题;同时引入一种新颖的互学习模块,使观测受限的“学生”攻击代理能够从具有完整观测能力的“教师”代理中学习,从而实现对现有检测机制的有效规避。

链接: https://arxiv.org/abs/2510.20566
作者: Wei Shao,Yuhao Wang,Rongguang He,Muhammad Ejaz Ahmed,Seyit Camtepe
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing defence mechanisms have demonstrated significant effectiveness in mitigating rule-based Denial-of-Service (DoS) attacks, leveraging predefined signatures and static heuristics to identify and block malicious traffic. However, the emergence of AI-driven techniques presents new challenges to SDN security, potentially compromising the efficacy of existing defence mechanisms. In this paper, we introduce~AdaDoS, an adaptive attack model that disrupt network operations while evading detection by existing DoS-based detectors through adversarial reinforcement learning (RL). Specifically, AdaDoS models the problem as a competitive game between an attacker, whose goal is to obstruct network traffic without being detected, and a detector, which aims to identify malicious traffic. AdaDoS can solve this game by dynamically adjusting its attack strategy based on feedback from the SDN and the detector. Additionally, recognising that attackers typically have less information than defenders, AdaDoS formulates the DoS-like attack as a partially observed Markov decision process (POMDP), with the attacker having access only to delay information between attacker and victim nodes. We address this challenge with a novel reciprocal learning module, where the student agent, with limited observations, enhances its performance by learning from the teacher agent, who has full observational capabilities in the SDN environment. AdaDoS represents the first application of RL to develop DoS-like attack sequences, capable of adaptively evading both machine learning-based and rule-based DoS-like attack detectors.
zh

[AI-28] Structural Invariance Matters: Rethinking Graph Rewiring through Graph Metrics

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)中因“过挤压”(over-squashing)导致的信息传递受限问题,同时避免图结构重连(graph rewiring)对关键拓扑信号的破坏。其解决方案的关键在于系统性地分析多种重连策略对图结构指标的影响,并发现:有效的重连方法应优先保留局部结构特性,同时允许全局连接性的灵活调整,从而在提升下游任务性能的同时维持结构保真度。

链接: https://arxiv.org/abs/2510.20556
作者: Alexandre Benoit,Catherine Aitken,Yu He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 5 figures, conference

点击查看摘要

Abstract:Graph rewiring has emerged as a key technique to alleviate over-squashing in Graph Neural Networks (GNNs) and Graph Transformers by modifying the graph topology to improve information flow. While effective, rewiring inherently alters the graph’s structure, raising the risk of distorting important topology-dependent signals. Yet, despite the growing use of rewiring, little is known about which structural properties must be preserved to ensure both performance gains and structural fidelity. In this work, we provide the first systematic analysis of how rewiring affects a range of graph structural metrics, and how these changes relate to downstream task performance. We study seven diverse rewiring strategies and correlate changes in local and global graph properties with node classification accuracy. Our results reveal a consistent pattern: successful rewiring methods tend to preserve local structure while allowing for flexibility in global connectivity. These findings offer new insights into the design of effective rewiring strategies, bridging the gap between graph theory and practical GNN optimization.
zh

[AI-29] Hurdle-IMDL: An Imbalanced Learning Framework for Infrared Rainfall Retrieval

【速读】:该论文旨在解决遥感反演中因标签分布不平衡导致的模型偏差问题,尤其针对降水反演中重雨事件被系统性低估的现象。其核心挑战在于:常规训练方法倾向于拟合常见样本(如无雨或轻雨),从而削弱对稀有但高影响事件(如强降雨)的识别能力。解决方案的关键在于提出Hurdle-Inversion Model Debiasing Learning (IMDL)框架,通过“分而治之”策略将雨量分布的不平衡分解为两个维度——零膨胀(zero inflation,即非雨样本占优)和长尾分布(long tail,即轻雨样本远多于重雨样本)。其中,采用hurdle模型处理零膨胀问题,而IMDL则通过构建理想逆模型(ideal inverse model)实现学习目标的去偏转化,有效缓解了重雨至极端雨的系统性低估问题,显著提升了稀有高影响天气事件的检索性能。

链接: https://arxiv.org/abs/2510.20486
作者: Fangjian Zhang,Xiaoyong Zhuge,Wenlan Wang,Haixia Xiao,Yuying Zhu,Siyang Cheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Atmospheric and Oceanic Physics (physics.ao-ph); Geophysics (physics.geo-ph)
备注: 26 pages

点击查看摘要

Abstract:Artificial intelligence has advanced quantitative remote sensing, yet its effectiveness is constrained by imbalanced label distribution. This imbalance leads conventionally trained models to favor common samples, which in turn degrades retrieval performance for rare ones. Rainfall retrieval exemplifies this issue, with performance particularly compromised for heavy rain. This study proposes Hurdle-Inversion Model Debiasing Learning (IMDL) framework. Following a divide-and-conquer strategy, imbalance in the rain distribution is decomposed into two components: zero inflation, defined by the predominance of non-rain samples; and long tail, defined by the disproportionate abundance of light-rain samples relative to heavy-rain samples. A hurdle model is adopted to handle the zero inflation, while IMDL is proposed to address the long tail by transforming the learning object into an unbiased ideal inverse model. Comprehensive evaluation via statistical metrics and case studies investigating rainy weather in eastern China confirms Hurdle-IMDL’s superiority over conventional, cost-sensitive, generative, and multi-task learning methods. Its key advancements include effective mitigation of systematic underestimation and a marked improvement in the retrieval of heavy-to-extreme rain. IMDL offers a generalizable approach for addressing imbalance in distributions of environmental variables, enabling enhanced retrieval of rare yet high-impact events.
zh

[AI-30] Structures generated in a multiagent system performing information fusion in peer-to-peer resource-constrained networks

【速读】:该论文试图解决在资源受限(如能量、消息容量、时间等)条件下,如何实现高效的信息融合问题,尤其是在非军事场景中,传统层级式信息融合难以适应动态性和灵活性需求。解决方案的关键在于引入holonic结构(即“全息结构”),通过多智能体系统模型模拟一组完全互连的节点(peer)在资源约束下自发形成具有自组织能力的局部协作单元,从而优化不确定性与模糊性信息的处理效果。这种结构具备适应环境突变、一定程度自治及协同达成目标的能力,特别适用于通信受限或组件失效时的系统稳定性维持。

链接: https://arxiv.org/abs/2510.20469
作者: Horacio Paggi,Juan A. Lara,Javier Soriano
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There has recently been a major advance with respect to how information fusion is performed. Information fusion has gone from being conceived as a purely hierarchical procedure, as is the case of traditional military applications, to now being regarded collaboratively, as holonic fusion, which is better suited for civil applications and edge organizations. The above paradigm shift is being boosted as information fusion gains ground in different non-military areas, and human-computer and machine-machine communications, where holarchies, which are more flexible structures than ordinary, static hierarchies, become more widespread. This paper focuses on showing how holonic structures tend to be generated when there are constraints on resources (energy, available messages, time, etc.) for interactions based on a set of fully intercommunicating elements (peers) whose components fuse information as a means of optimizing the impact of vagueness and uncertainty present message exchanges. Holon formation is studied generically based on a multiagent system model, and an example of its possible operation is shown. Holonic structures have a series of advantages, such as adaptability, to sudden changes in the environment or its composition, are somewhat autonomous and are capable of cooperating in order to achieve a common goal. This can be useful when the shortage of resources prevents communications or when the system components start to fail.
zh

[AI-31] FLORA: Unsupervised Knowledge Graph Alignment by Fuzzy Logic

【速读】:该论文旨在解决知识图谱对齐(Knowledge Graph Alignment)问题,即在两个知识图谱中识别等价的实体(instances and classes)和关系(relations)。现有方法多局限于纯实体层面的对齐,依赖嵌入空间中的相似性计算,缺乏可解释性推理且需大量标注训练数据。其解决方案的关键在于提出FLORA方法:该方法无需训练数据(unsupervised),通过迭代方式实现实体与关系的全局对齐;基于模糊逻辑(fuzzy logic)设计,确保结果具有可解释性;理论证明其收敛性;同时支持“悬挂实体”(dangling entities)的存在,并在主流基准测试中达到最先进性能。

链接: https://arxiv.org/abs/2510.20467
作者: Yiwen Peng(IP Paris),Thomas Bonald(IP Paris),Fabian M. Suchanek(IP Paris)
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:

点击查看摘要

Abstract:Knowledge graph alignment is the task of matching equivalent entities (that is, instances and classes) and relations across two knowledge graphs. Most existing methods focus on pure entity-level alignment, computing the similarity of entities in some embedding space. They lack interpretable reasoning and need training data to work. In this paper, we propose FLORA, a simple yet effective method that (1) is unsupervised, i.e., does not require training data, (2) provides a holistic alignment for entities and relations iteratively, (3) is based on fuzzy logic and thus delivers interpretable results, (4) provably converges, (5) allows dangling entities, i.e., entities without a counterpart in the other KG, and (6) achieves state-of-the-art results on major benchmarks.
zh

[AI-32] Neural Reasoning for Robust Instance Retrieval in mathcalSHOIQ

【速读】:该论文旨在解决当前神经符号概念学习方法在真实世界知识库上难以部署的问题,其根本原因在于现有方法依赖于描述逻辑(Description Logic)推理机,而这些推理机对不一致性和错误数据缺乏鲁棒性。解决方案的关键在于提出一种新型神经推理机 EBR(Embedding-based Reasoner),它通过嵌入表示近似符号推理结果,仅需检索原子概念和存在量词限制的实例即可推导或近似任意描述逻辑 SHOIQ\mathcal{SHOIQ} 中概念的实例集合,从而在面对缺失和错误数据时展现出显著的鲁棒性。

链接: https://arxiv.org/abs/2510.20457
作者: Louis Mozart Kamdem Teyou,Luke Friedrichs,N’Dah Jean Kouagou,Caglar Demir,Yasir Mahmood,Stefan Heindorf,Axel-Cyrille Ngonga Ngomo
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted as a full research paper at K-CAP 2025

点击查看摘要

Abstract:Concept learning exploits background knowledge in the form of description logic axioms to learn explainable classification models from knowledge bases. Despite recent breakthroughs in neuro-symbolic concept learning, most approaches still cannot be deployed on real-world knowledge bases. This is due to their use of description logic reasoners, which are not robust against inconsistencies nor erroneous data. We address this challenge by presenting a novel neural reasoner dubbed EBR. Our reasoner relies on embeddings to approximate the results of a symbolic reasoner. We show that EBR solely requires retrieving instances for atomic concepts and existential restrictions to retrieve or approximate the set of instances of any concept in the description logic \mathcalSHOIQ . In our experiments, we compare EBR with state-of-the-art reasoners. Our results suggest that EBR is robust against missing and erroneous data in contrast to existing reasoners.
zh

[AI-33] MolBridge: Atom-Level Joint Graph Refinement for Robust Drug-Drug Interaction Event Prediction

【速读】:该论文旨在解决药物相互作用(Drug-Drug Interactions, DDIs)事件预测中因现有方法依赖孤立的药物表示而无法显式建模原子级跨分子相互作用的问题,从而限制了在复杂分子结构和多样化DDI类型分布下的预测效果。其解决方案的关键在于提出MolBridge框架——一个基于原子级别的联合图精炼机制,通过构建整合药物对原子结构的联合图来直接建模药物间的相互关联,并引入结构一致性模块以迭代优化节点特征并保留全局结构上下文,有效缓解长程原子依赖建模中的过平滑问题,从而提升模型在高频与低频DDI类型上的鲁棒性和可解释性。

链接: https://arxiv.org/abs/2510.20448
作者: Xuan Lin,Aocheng Ding,Tengfei Ma,Hua Liang,Zhe Quan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Drug combinations offer therapeutic benefits but also carry the risk of adverse drug-drug interactions (DDIs), especially under complex molecular structures. Accurate DDI event prediction requires capturing fine-grained inter-drug relationships, which are critical for modeling metabolic mechanisms such as enzyme-mediated competition. However, existing approaches typically rely on isolated drug representations and fail to explicitly model atom-level cross-molecular interactions, limiting their effectiveness across diverse molecular complexities and DDI type distributions. To address these limitations, we propose MolBridge, a novel atom-level joint graph refinement framework for robust DDI event prediction. MolBridge constructs a joint graph that integrates atomic structures of drug pairs, enabling direct modeling of inter-drug associations. A central challenge in such joint graph settings is the potential loss of information caused by over-smoothing when modeling long-range atomic dependencies. To overcome this, we introduce a structure consistency module that iteratively refines node features while preserving the global structural context. This joint design allows MolBridge to effectively learn both local and global interaction outperforms state-of-the-art baselines, achieving superior performance across long-tail and inductive scenarios. patterns, yielding robust representations across both frequent and rare DDI types. Extensive experiments on two benchmark datasets show that MolBridge consistently. These results demonstrate the advantages of fine-grained graph refinement in improving the accuracy, robustness, and mechanistic interpretability of DDI event this http URL work contributes to Web Mining and Content Analysis by developing graph-based methods for mining and analyzing drug-drug interaction networks.
zh

[AI-34] UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement ICASSP2026

【速读】:该论文旨在解决语音增强(Speech Enhancement, SE)中不同子任务(如语音恢复、目标说话人提取和语音分离)难以统一建模的问题。现有方法通常针对每个任务设计独立的判别式或生成式模型,缺乏通用框架以兼容多种学习模式。解决方案的关键在于提出UniSE,一个基于自回归语言模型(Autoregressive Language Model, AR LM)的统一解码器架构:它将输入语音特征作为条件,通过自回归建模生成目标语音的离散标记(discrete tokens),从而在统一框架内实现多任务兼容性与高效泛化能力。实验表明,该方法在多个基准上达到与判别式和生成式基线相当甚至更优的性能,验证了语言模型在统一语音增强任务中的潜力。

链接: https://arxiv.org/abs/2510.20441
作者: Haoyin Yan,Chengwei Liu,Shaofei Xue,Xiaotao Liang,Zheng Xue
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: 5 pages, submitted to ICASSP 2026

点击查看摘要

Abstract:The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: this https URL.
zh

[AI-35] Balancing Specialization and Centralization: A Multi-Agent Reinforcement Learning Benchmark for Sequential Industrial Control

【速读】:该论文旨在解决多阶段工业过程自主控制中面临的局部专业化与全局协调之间的矛盾问题,尤其是在强化学习(Reinforcement Learning, RL)应用于工业场景时所遇到的奖励设计困难、模块化架构构建及动作空间管理等挑战。其解决方案的关键在于构建一个融合SortingEnv与ContainerGym任务的序列式回收场景作为改进的行业启发式基准环境,并引入动作掩码(action masking)机制来约束动作空间。实验表明,动作掩码显著提升了学习效率和策略性能,使模块化与单体代理架构的表现差距缩小,揭示了动作空间约束对多智能体强化学习在工业自动化中实用性与鲁棒性的重要影响。

链接: https://arxiv.org/abs/2510.20408
作者: Tom Maus,Asma Atamna,Tobias Glasmachers
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Preprint (submitted version) to be presented at the 13th International Conference on Industrial Engineering and Applications (ICIEA-EU), Milan, 2026. The final Version of Record will appear in the official conference proceedings

点击查看摘要

Abstract:Autonomous control of multi-stage industrial processes requires both local specialization and global coordination. Reinforcement learning (RL) offers a promising approach, but its industrial adoption remains limited due to challenges such as reward design, modularity, and action space management. Many academic benchmarks differ markedly from industrial control problems, limiting their transferability to real-world applications. This study introduces an enhanced industry-inspired benchmark environment that combines tasks from two existing benchmarks, SortingEnv and ContainerGym, into a sequential recycling scenario with sorting and pressing operations. We evaluate two control strategies: a modular architecture with specialized agents and a monolithic agent governing the full system, while also analyzing the impact of action masking. Our experiments show that without action masking, agents struggle to learn effective policies, with the modular architecture performing better. When action masking is applied, both architectures improve substantially, and the performance gap narrows considerably. These results highlight the decisive role of action space constraints and suggest that the advantages of specialization diminish as action complexity is reduced. The proposed benchmark thus provides a valuable testbed for exploring practical and robust multi-agent RL solutions in industrial automation, while contributing to the ongoing debate on centralization versus specialization.
zh

[AI-36] A computational model and tool for generating more novel opportunities in professional innovation processes

【速读】:该论文旨在解决创新项目中机会生成的 novelty(新颖性)不足问题,即如何在不牺牲实用性(usefulness)的前提下提升创新机会的原创性。解决方案的关键在于构建一个基于创造力理论与技术的计算模型,该模型整合了五个功能模块,专门用于生成更具新颖性和实用性的创新机会;实证评估表明,该模型在酒店业创新项目中生成的机会优于Notebook LM和ChatGPT4o,但并非所有功能均对提升新颖性有显著贡献,提示未来需优化模型结构以增强其生成效能。

链接: https://arxiv.org/abs/2510.20402
作者: Neil Maiden,Konstantinos Zachos,James Lockerbie,Kostas Petrianakis,Amanda Brown
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a new computational model of creative outcomes, informed by creativity theories and techniques, which was implemented to generate more novel opportunities for innovation projects. The model implemented five functions that were developed to contribute to the generation of innovation opportunities with higher novelty without loss of usefulness. The model was evaluated using opportunities generated for an innovation project in the hospitality sector. The evaluation revealed that the computational model generated outcomes that were more novel and/or useful than outcomes from Notebook LM and ChatGPT4o. However, not all model functions contributed to the generation of more novel opportunities, leading to new directions for further model development
zh

[AI-37] FLAS: a combination of proactive and reactive auto-scaling architecture for distributed services

【速读】:该论文旨在解决分布式服务在云环境中的弹性伸缩问题,即如何在保证服务质量(SLA)的前提下,动态调整资源以应对负载变化。传统自动伸缩系统通常采用被动响应式策略,难以及时应对突发负载;而主动预测式方法虽可提前干预,但往往依赖大量监控指标且难以通用。本文提出的FLAS(Forecasted Load Auto-Scaling)解决方案的关键在于:(i) 引入一个高阶指标趋势预测模型,能够提前预判关键SLA参数(如响应时间、吞吐量)的变化;(ii) 构建基于资源使用指标估算高阶指标的反应式应急机制,显著减少对应用层的侵入性监测,并具备跨应用的可移植性。该方案已在内容订阅中间件E-SilboPS中实现并验证,通过边界值分析法测试表明,在99%以上的时间内能确保性能要求达标。

链接: https://arxiv.org/abs/2510.20388
作者: Víctor Rampérez,Javier Soriano,David Lizcano,Juan A. Lara
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cloud computing has established itself as the support for the vast majority of emerging technologies, mainly due to the characteristic of elasticity it offers. Auto-scalers are the systems that enable this elasticity by acquiring and releasing resources on demand to ensure an agreed service level. In this article we present FLAS (Forecasted Load Auto-Scaling), an auto-scaler for distributed services that combines the advantages of proactive and reactive approaches according to the situation to decide the optimal scaling actions in every moment. The main novelties introduced by FLAS are (i) a predictive model of the high-level metrics trend which allows to anticipate changes in the relevant SLA parameters (e.g. performance metrics such as response time or throughput) and (ii) a reactive contingency system based on the estimation of high-level metrics from resource use metrics, reducing the necessary instrumentation (less invasive) and allowing it to be adapted agnostically to different applications. We provide a FLAS implementation for the use case of a content-based publish-subscribe middleware (E-SilboPS) that is the cornerstone of an event-driven architecture. To the best of our knowledge, this is the first auto-scaling system for content-based publish-subscribe distributed systems (although it is generic enough to fit any distributed service). Through an evaluation based on several test cases recreating not only the expected contexts of use, but also the worst possible scenarios (following the Boundary-Value Analysis or BVA test methodology), we have validated our approach and demonstrated the effectiveness of our solution by ensuring compliance with performance requirements over 99% of the time.
zh

[AI-38] What do AI-Generated Images Want?

【速读】:该论文试图解决的问题是:在当代生成式 AI 图像工具背景下,如何重新理解图像本身的“欲望”(want)——即图像是否具有某种自主性或意图,而不仅仅是人类意图的产物。传统艺术理论关注人类对图像的解读和创作动机,而本文则借鉴 W.J.T. 米切尔关于图像具有能动性的观点,将其置于人工智能图像生成语境中进行重构。解决方案的关键在于指出,AI 生成图像本质上是抽象的(abstract),因此它们“想要”具体性和实在性(specificity and concreteness)——这源于其训练数据中文本与图像之间数学层面的可交换性(commensurability),以及用户输入到视觉输出过程中所掩盖的表征回归(representational regress)。这种机制使得图像生成看似魔法般地从文本转化为图像,实则依赖于深层的数据映射逻辑。

链接: https://arxiv.org/abs/2510.20350
作者: Amanda Wasielewski
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:W.J.T. Mitchell’s influential essay ‘What do pictures want?’ shifts the theoretical focus away from the interpretative act of understanding pictures and from the motivations of the humans who create them to the possibility that the picture itself is an entity with agency and wants. In this article, I reframe Mitchell’s question in light of contemporary AI image generation tools to ask: what do AI-generated images want? Drawing from art historical discourse on the nature of abstraction, I argue that AI-generated images want specificity and concreteness because they are fundamentally abstract. Multimodal text-to-image models, which are the primary subject of this article, are based on the premise that text and image are interchangeable or exchangeable tokens and that there is a commensurability between them, at least as represented mathematically in data. The user pipeline that sees textual input become visual output, however, obscures this representational regress and makes it seem like one form transforms into the other – as if by magic.
zh

[AI-39] LLM -empowered knowledge graph construction: A survey

【速读】:该论文旨在解决传统知识图谱(Knowledge Graph, KG)构建方法在面对大语言模型(Large Language Models, LLMs)兴起时所面临的范式转变问题,即如何利用LLMs重构KG的三阶段流程——本体工程、知识抽取与知识融合。其解决方案的关键在于系统性地梳理并对比两类新兴框架:一类是强调结构化、规范化与一致性的基于模式(schema-based)范式;另一类则是注重灵活性、适应性与开放发现能力的无模式(schema-free)范式。通过分析这些方法的技术机制及其局限性,论文揭示了LLM赋能下KG构建的新路径,并指出了未来研究方向,如基于KG的LLM推理、动态知识记忆支持代理系统以及多模态知识图谱构建,从而推动符号知识工程与神经语义理解的融合,发展更具自适应性、可解释性和智能性的知识系统。

链接: https://arxiv.org/abs/2510.20345
作者: Haonan Bian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) have long served as a fundamental infrastructure for structured knowledge representation and reasoning. With the advent of Large Language Models (LLMs), the construction of KGs has entered a new paradigm-shifting from rule-based and statistical pipelines to language-driven and generative frameworks. This survey provides a comprehensive overview of recent progress in LLM-empowered knowledge graph construction, systematically analyzing how LLMs reshape the classical three-layered pipeline of ontology engineering, knowledge extraction, and knowledge fusion. We first revisit traditional KG methodologies to establish conceptual foundations, and then review emerging LLM-driven approaches from two complementary perspectives: schema-based paradigms, which emphasize structure, normalization, and consistency; and schema-free paradigms, which highlight flexibility, adaptability, and open discovery. Across each stage, we synthesize representative frameworks, analyze their technical mechanisms, and identify their limitations. Finally, the survey outlines key trends and future research directions, including KG-based reasoning for LLMs, dynamic knowledge memory for agentic systems, and multimodal KG construction. Through this systematic review, we aim to clarify the evolving interplay between LLMs and knowledge graphs, bridging symbolic knowledge engineering and neural semantic understanding toward the development of adaptive, explainable, and intelligent knowledge systems. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2510.20345 [cs.AI] (or arXiv:2510.20345v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2510.20345 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-40] Collateral Damage Assessment Model for AI System Target Engagement in Military Operations

【速读】:该论文旨在解决军事行动中人工智能(AI)系统在目标打击时可能引发附带损害的问题,确保其作战决策符合责任伦理与透明性要求。解决方案的关键在于提出一种整合时间、空间和力量维度的新型附带损害评估模型,该模型基于知识表示与推理(KRR)架构,并采用设计科学方法论构建,通过分层结构刻画待打击AI系统的类别、组件及其交互向量与情境因素,同时引入传播性、严重性、可能性及评估指标,实现可解释的推理机制,从而为构建负责任且可信的智能系统提供理论基础与实践路径。

链接: https://arxiv.org/abs/2510.20337
作者: Clara Maathuis,Kasper Cools
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at MILCOM 2025 WS07

点击查看摘要

Abstract:In an era where AI (Artificial Intelligence) systems play an increasing role in the battlefield, ensuring responsible targeting demands rigorous assessment of potential collateral effects. In this context, a novel collateral damage assessment model for target engagement of AI systems in military operations is introduced. The model integrates temporal, spatial, and force dimensions within a unified Knowledge Representation and Reasoning (KRR) architecture following a design science methodological approach. Its layered structure captures the categories and architectural components of the AI systems to be engaged together with corresponding engaging vectors and contextual aspects. At the same time, spreading, severity, likelihood, and evaluation metrics are considered in order to provide a clear representation enhanced by transparent reasoning mechanisms. Further, the model is demonstrated and evaluated through instantiation which serves as a basis for further dedicated efforts that aim at building responsible and trustworthy intelligent systems for assessing the effects produced by engaging AI systems in military operations.
zh

[AI-41] GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

【速读】:该论文旨在解决移动设备上视觉语言模型(Vision-Language Models, VLMs)作为自主代理在动态环境(如通知、弹窗和跨应用交互)中面临的一种新型威胁——环境注入攻击(environmental injection)。这类攻击通过在图形用户界面(GUI)中插入对抗性UI元素(如欺骗性覆盖层或伪造通知),干扰代理的视觉感知,从而绕过文本层面的安全防护机制,导致隐私泄露、财务损失甚至设备被永久破坏。解决方案的关键在于提出首个面向移动端代理的基准测试平台 GhostEI-Bench,其核心创新是将对抗事件注入到完整运行的 Android 模拟器中的真实应用工作流中,实现对模型在复杂动态环境中鲁棒性的量化评估;同时引入基于大语言模型(judge-LLM)的细粒度失败分析协议,结合动作轨迹与截图序列精准定位感知、识别或推理阶段的失效点,从而系统揭示当前主流VLM代理在面对环境篡改时的显著脆弱性,并为后续防御机制提供可量化的研究框架。

链接: https://arxiv.org/abs/2510.20333
作者: Chiyu Chen,Xinhao Song,Yunkai Chai,Yang Yao,Haodong Zhao,Lijun Li,Jie Li,Yan Teng,Gongshen Liu,Yingchun Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile graphical user interfaces (GUIs). Operating in dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike prompt-based attacks that manipulate textual instructions, environmental injection corrupts an agent’s visual perception by inserting adversarial UI elements (for example, deceptive overlays or spoofed notifications) directly into the GUI. This bypasses textual safeguards and can derail execution, causing privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark for assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, GhostEI-Bench injects adversarial events into realistic application workflows inside fully operational Android emulators and evaluates performance across critical risk scenarios. We further propose a judge-LLM protocol that conducts fine-grained failure analysis by reviewing the agent’s action trajectory alongside the corresponding screenshot sequence, pinpointing failure in perception, recognition, or reasoning. Comprehensive experiments on state-of-the-art agents reveal pronounced vulnerability to deceptive environmental cues: current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides a framework for quantifying and mitigating this emerging threat, paving the way toward more robust and secure embodied agents.
zh

[AI-42] Bias by Design? How Data Practices Shape Fairness in AI Healthcare Systems

【速读】:该论文旨在解决人工智能(Artificial Intelligence, AI)在医疗健康领域实际应用中因训练数据质量与公平性不足而导致的整合障碍问题。其核心解决方案在于识别并应对临床数据收集过程中存在的多种偏差,包括历史偏差、代表性偏差和测量偏差,这些偏差体现在性别、年龄、社会经济地位、设备差异及标注一致性等多个维度。研究强调通过改进临床问题设计与数据采集流程来提升AI系统的公平性和鲁棒性,从而推动更可靠、可信赖的AI医疗应用落地。

链接: https://arxiv.org/abs/2510.20332
作者: Anna Arias-Duart,Maria Eugenia Cardello,Atia Cortés
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 3 tables, accepted in AEQUITAS 2025 (not in proceedings)

点击查看摘要

Abstract:Artificial intelligence (AI) holds great promise for transforming healthcare. However, despite significant advances, the integration of AI solutions into real-world clinical practice remains limited. A major barrier is the quality and fairness of training data, which is often compromised by biased data collection practices. This paper draws on insights from the AI4HealthyAging project, part of Spain’s national RD initiative, where our task was to detect biases during clinical data collection. We identify several types of bias across multiple use cases, including historical, representation, and measurement biases. These biases manifest in variables such as sex, gender, age, habitat, socioeconomic status, equipment, and labeling. We conclude with practical recommendations for improving the fairness and robustness of clinical problem design and data collection. We hope that our findings and experience contribute to guiding future projects in the development of fairer AI systems in healthcare.
zh

[AI-43] MemER: Scaling Up Memory for Robot Control via Experience Retrieval

【速读】:该论文旨在解决机器人策略缺乏长期记忆能力的问题,使得机器人能够在长时间任务中有效利用过往经验。现有方法要么直接依赖过长的观察历史导致计算复杂且对分布偏移敏感,要么随意采样历史信息引入冗余或无关内容。其解决方案的关键在于提出一种分层策略框架(hierarchical policy framework),其中高层策略负责从经验中选择并追踪相关的关键帧(keyframes),并将这些关键帧与最新帧结合生成文本指令供底层策略执行。该设计兼容现有的视觉-语言-动作(vision-language-action, VLA)模型,从而实现对长时程依赖关系的高效推理。

链接: https://arxiv.org/abs/2510.20328
作者: Ajay Sridhar,Jennifer Pan,Satvik Sharma,Chelsea Finn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL

点击查看摘要

Abstract:Humans routinely rely on memory to perform tasks, yet most robot policies lack this capability; our goal is to endow robot policies with the same ability. Naively conditioning on long observation histories is computationally expensive and brittle under covariate shift, while indiscriminate subsampling of history leads to irrelevant or redundant information. We propose a hierarchical policy framework, where the high-level policy is trained to select and track previous relevant keyframes from its experience. The high-level policy uses selected keyframes and the most recent frames when generating text instructions for a low-level policy to execute. This design is compatible with existing vision-language-action (VLA) models and enables the system to efficiently reason over long-horizon dependencies. In our experiments, we finetune Qwen2.5-VL-7B-Instruct and \pi_0.5 as the high-level and low-level policies respectively, using demonstrations supplemented with minimal language annotations. Our approach, MemER, outperforms prior methods on three real-world long-horizon robotic manipulation tasks that require minutes of memory. Videos and code can be found at this https URL.
zh

[AI-44] LEGO: A Lightweight and Efficient Multiple-Attribute Unlearning Framework for Recommender Systems

【速读】:该论文旨在解决推荐系统中多敏感属性同时卸载(multiple-attribute unlearning)的挑战,尤其是现有方法难以同时处理多个敏感属性卸载请求(CH1)以及缺乏对动态卸载需求的高效适应能力(CH2)。其解决方案的关键在于提出一个轻量且高效的框架 LEGO,该框架将卸载过程分为两个步骤:i) Embedding Calibration 通过最小化互信息来移除特定属性相关的信息,实现多属性并行卸载;ii) Flexible Combination 将校准后的嵌入灵活组合为单一嵌入,从而在保证隐私保护的同时提升效率。该设计不仅提供了理论保障以支持同步卸载,还通过模块化结构增强了对动态需求的适应性。

链接: https://arxiv.org/abs/2510.20327
作者: Fengyuan Yu,Yuyuan Li,Xiaohua Feng,Junjie Fang,Tao Wang,Chaochao Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ACM Multimedia 2025

点击查看摘要

Abstract:With the growing demand for safeguarding sensitive user information in recommender systems, recommendation attribute unlearning is receiving increasing attention. Existing studies predominantly focus on single-attribute unlearning. However, privacy protection requirements in the real world often involve multiple sensitive attributes and are dynamic. Existing single-attribute unlearning methods cannot meet these real-world requirements due to i) CH1: the inability to handle multiple unlearning requests simultaneously, and ii) CH2: the lack of efficient adaptability to dynamic unlearning needs. To address these challenges, we propose LEGO, a lightweight and efficient multiple-attribute unlearning framework. Specifically, we divide the multiple-attribute unlearning process into two steps: i) Embedding Calibration removes information related to a specific attribute from user embedding, and ii) Flexible Combination combines these embeddings into a single embedding, protecting all sensitive attributes. We frame the unlearning process as a mutual information minimization problem, providing LEGO a theoretical guarantee of simultaneous unlearning, thereby addressing CH1. With the two-step framework, where Embedding Calibration can be performed in parallel and Flexible Combination is flexible and efficient, we address CH2. Extensive experiments on three real-world datasets across three representative recommendation models demonstrate the effectiveness and efficiency of our proposed framework. Our code and appendix are available at this https URL.
zh

[AI-45] Enhancing Security in Deep Reinforcement Learning: A Comprehensive Survey on Adversarial Attacks and Defenses

【速读】:该论文旨在解决深度强化学习(Deep Reinforcement Learning, DRL)在动态变化环境中面临的安全性与鲁棒性问题,尤其是在对抗攻击下的性能退化甚至危险决策风险。其核心解决方案在于构建一个基于扰动类型和攻击目标的对抗攻击分类框架,并系统梳理当前主流的对抗攻击方法(如状态空间、动作空间、奖励函数及模型空间扰动),进而总结多种鲁棒性训练策略,包括对抗训练、竞争训练、鲁棒学习、对抗检测、防御蒸馏等防御技术,从而提升DRL在安全敏感场景中的稳定性与可靠性。

链接: https://arxiv.org/abs/2510.20314
作者: Wu Yichao,Wang Yirui,Ding Panpan,Wang Hailong,Zhu Bingqian,Liu Chun
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the wide application of deep reinforcement learning (DRL) techniques in complex fields such as autonomous driving, intelligent manufacturing, and smart healthcare, how to improve its security and robustness in dynamic and changeable environments has become a core issue in current research. Especially in the face of adversarial attacks, DRL may suffer serious performance degradation or even make potentially dangerous decisions, so it is crucial to ensure their stability in security-sensitive scenarios. In this paper, we first introduce the basic framework of DRL and analyze the main security challenges faced in complex and changing environments. In addition, this paper proposes an adversarial attack classification framework based on perturbation type and attack target and reviews the mainstream adversarial attack methods against DRL in detail, including various attack methods such as perturbation state space, action space, reward function and model space. To effectively counter the attacks, this paper systematically summarizes various current robustness training strategies, including adversarial training, competitive training, robust learning, adversarial detection, defense distillation and other related defense techniques, we also discuss the advantages and shortcomings of these methods in improving the robustness of DRL. Finally, this paper looks into the future research direction of DRL in adversarial environments, emphasizing the research needs in terms of improving generalization, reducing computational complexity, and enhancing scalability and explainability, aiming to provide valuable references and directions for researchers.
zh

[AI-46] Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation

【速读】:该论文旨在解决具身问答(Embodied Question Answering, EQA)任务中现有方法因缺乏显式推理与规划而导致的探索效率低下和回答准确性不足的问题。现有方法通常依赖视觉语言模型(VLMs)直接进行环境探索与作答,但这种端到端的方式限制了模型的逻辑推理能力,导致冗余探索和无效响应。解决方案的关键在于提出ToolEQA——一个将外部工具(external tools)与多步推理(multi-step reasoning)相结合的智能体框架,通过引入工具增强信息获取能力,使模型能够在每一步推理中优化探索方向,从而以更短的探索距离获得更准确的答案。此外,作者设计了一种自动化的EQA数据生成流水线,构建了包含18K任务的EQA-RT数据集,支持训练和评估多步推理与工具使用的性能,显著提升了在多个基准测试中的成功率。

链接: https://arxiv.org/abs/2510.20310
作者: Mingliang Zhai,Hansheng Liang,Xiaomeng Fan,Zhi Gao,Chuanhao Li,Che Sun,Xu Bin,Yuwei Wu,Yunde Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model’s ability for tool-usage and multi-step reasoning, we further design a novel EQA data generation pipeline that automatically constructs large-scale EQA tasks with reasoning trajectories and corresponding answers. Based on the pipeline, we collect the EQA-RT dataset that contains about 18K tasks, divided into a training set EQA-RT-Train, and two test sets EQA-RT-Seen (scenes overlapping with the training set) and EQA-RT-Unseen (novel scenes). Experiments on EQA-RT-Seen and EQA-RT-Unseen show that ToolEQA improves the success rate by 9.2~20.2% over state-of-the-art baselines, while outperforming the zero-shot ToolEQA by 10% in success rate. In addition, ToolEQA also achieves state-of-the-art performance on the HM-EQA, OpenEQA, and EXPRESS-Bench datasets, demonstrating its generality. Our homepage see this https URL.
zh

[AI-47] DB-FGA-Net: Dual Backbone Frequency Gated Attention Network for Multi-Class Classification with Grad-CAM Interpretability

【速读】:该论文旨在解决脑肿瘤诊断中深度学习模型依赖大量数据增强导致泛化能力受限、临床可信度不足的问题。其关键解决方案是提出一种双主干网络(Double-Backbone Network, DB-FGA-Net),融合VGG16与Xception架构以捕获互补的局部与全局特征,并引入频域门控注意力(Frequency-Gated Attention, FGA)模块增强特征表达;该模型在无需数据增强的情况下实现高精度分类(如7K-DS数据集上4分类达99.24%准确率),并通过Grad-CAM可视化肿瘤区域,提升预测可解释性,从而促进临床部署。

链接: https://arxiv.org/abs/2510.20299
作者: Saraf Anzum Shreya,MD. Abu Ismail Siddique,Sharaf Tasnim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 25 pages, 14 figures, 12 tables

点击查看摘要

Abstract:Brain tumors are a challenging problem in neuro-oncology, where early and precise diagnosis is important for successful treatment. Deep learning-based brain tumor classification methods often rely on heavy data augmentation which can limit generalization and trust in clinical applications. In this paper, we propose a double-backbone network integrating VGG16 and Xception with a Frequency-Gated Attention (FGA) Block to capture complementary local and global features. Unlike previous studies, our model achieves state-of-the-art performance without augmentation which demonstrates robustness to variably sized and distributed datasets. For further transparency, Grad-CAM is integrated to visualize the tumor regions based on which the model is giving prediction, bridging the gap between model prediction and clinical interpretability. The proposed framework achieves 99.24% accuracy on the 7K-DS dataset for the 4-class setting, along with 98.68% and 99.85% in the 3-class and 2-class settings, respectively. On the independent 3K-DS dataset, the model generalizes with 95.77% accuracy, outperforming baseline and state-of-the-art methods. To further support clinical usability, we developed a graphical user interface (GUI) that provides real-time classification and Grad-CAM-based tumor localization. These findings suggest that augmentation-free, interpretable, and deployable deep learning models such as DB-FGA-Net hold strong potential for reliable clinical translation in brain tumor diagnosis.
zh

[AI-48] RAG -Stack: Co-Optimizing RAG Quality and Performance From the Vector Database Perspective

【速读】:该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中质量与性能协同优化的难题,这一问题在端到端RAG流水线中尤为突出,因其涉及算法层(如模型和向量数据库)与系统层(从软件到硬件)的多重调控参数。解决方案的关键在于提出RAG-Stack三支柱框架:首先引入RAG-IR(Intermediate Representation),作为抽象层解耦质量与性能维度;其次构建RAG-CM(Cost Model),用于基于RAG-IR估算系统性能;最后设计RAG-PE(Plan Exploration)算法,在配置空间中搜索兼具高质量生成结果与高性能执行效率的RAG方案。该框架为RAG系统的协同优化提供了系统性方法论。

链接: https://arxiv.org/abs/2510.20296
作者: Wenqi Jiang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) has emerged as one of the most prominent applications of vector databases. By integrating documents retrieved from a database into the prompt of a large language model (LLM), RAG enables more reliable and informative content generation. While there has been extensive research on vector databases, many open research problems remain once they are considered in the wider context of end-to-end RAG pipelines. One practical yet challenging problem is how to jointly optimize both system performance and generation quality in RAG, which is significantly more complex than it appears due to the numerous knobs on both the algorithmic side (spanning models and databases) and the systems side (from software to hardware). In this paper, we present RAG-Stack, a three-pillar blueprint for quality-performance co-optimization in RAG systems. RAG-Stack comprises: (1) RAG-IR, an intermediate representation that serves as an abstraction layer to decouple quality and performance aspects; (2) RAG-CM, a cost model for estimating system performance given an RAG-IR; and (3) RAG-PE, a plan exploration algorithm that searches for high-quality, high-performance RAG configurations. We believe this three-pillar blueprint will become the de facto paradigm for RAG quality-performance co-optimization in the years to come.
zh

[AI-49] Classical Feature Embeddings Help in BERT-Based Human Mobility Prediction

【速读】:该论文旨在解决现有行人移动预测模型未能充分挖掘兴趣点(Points of Interest, POIs)语义信息的问题,导致在灾害救援、城市规划和公共卫生等场景下的预测精度受限。其解决方案的关键在于提出STaBERT(Semantic-Temporal aware BERT)模型,通过引入派生的时间描述符(temporal descriptors)与POI嵌入(POI embeddings),将时空信息融合至每个位置的表示中,从而构建统一且语义丰富的移动表征空间,显著提升了多城市及单城市场景下的预测性能。

链接: https://arxiv.org/abs/2510.20275
作者: Yunzhi Liu,Haokai Tan,Rushi Kanjaria,Lihuan Li,Flora D. Salim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ACM SIGSPATIAL 2025 as a short paper

点击查看摘要

Abstract:Human mobility forecasting is crucial for disaster relief, city planning, and public health. However, existing models either only model location sequences or include time information merely as auxiliary input, thereby failing to leverage the rich semantic context provided by points of interest (POIs). To address this, we enrich a BERT-based mobility model with derived temporal descriptors and POI embeddings to better capture the semantics underlying human movement. We propose STaBERT (Semantic-Temporal aware BERT), which integrates both POI and temporal information at each location to construct a unified, semantically enriched representation of mobility. Experimental results show that STaBERT significantly improves prediction accuracy: for single-city prediction, the GEO-BLEU score improved from 0.34 to 0.75; for multi-city prediction, from 0.34 to 0.56.
zh

[AI-50] Limits of PRM-Guided Tree Search for Mathematical Reasoning with LLM s

【速读】:该论文旨在解决链式思维提示(Chain-of-Thought Prompting)与Best-of-N(BoN)选择在数学推理任务中因线性结构无法捕捉复杂问题求解过程中分支探索特性而导致的性能瓶颈问题。其解决方案的关键在于提出一种自适应算法,以在不可 tractable 的动作空间上最大化过程奖励模型(Process Reward Model, PRM)得分,并通过PRM引导的树搜索(Tree Search)探索多条部分解路径,从而提升大语言模型(LLM)的数学推理能力。

链接: https://arxiv.org/abs/2510.20272
作者: Tristan Cinquin,Geoff Pleiss,Agustinus Kristiadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While chain-of-thought prompting with Best-of-N (BoN) selection has become popular for mathematical reasoning in large language models (LLMs), its linear structure fails to capture the branching and exploratory nature of complex problem-solving. In this work, we propose an adaptive algorithm to maximize process reward model (PRM) scores over the intractable action space, and investigate whether PRM-guided tree search can improve mathematical reasoning by exploring multiple partial solution paths. Across 23 diverse mathematical problems using Qwen2.5-Math-7B-Instruct with its associated PRM as a case study, we find that: (1) PRM-guided tree search shows no statistically significant improvements over BoN despite higher costs, (2) Monte Carlo tree search and beam search outperform other PRM-guided tree search methods, (3) PRMs poorly approximate state values and their reliability degrades with reasoning depth, and (4) PRMs generalize poorly out of distribution. This underperformance stems from tree search’s greater reliance on unreliable PRM scores, suggesting different reward modeling is necessary before tree search can effectively enhance mathematical reasoning in LLMs.
zh

[AI-51] Using Large Language Models for Abstraction of Planning Domains - Extended Version

【速读】:该论文旨在解决动态领域抽象(abstraction of dynamic domains)与特定目标对齐的问题,即如何根据给定的抽象目标生成有效的规划域抽象,以提升智能体在规划、推理和解释能力方面的表现。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)的上下文学习(in-context learning)能力,基于自然语言描述的抽象目标,自动合成符合PDDL(Planning Domain Definition Language)格式的抽象规划域和问题实例。实验表明,GPT-4o在简单场景下能有效生成有用的抽象,尤其擅长对动作进行抽象,而非对相关的谓词或状态变量进行抽象。

链接: https://arxiv.org/abs/2510.20258
作者: Bita Banihashemi,Megh Patel,Yves Lespérance
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating an abstraction of a dynamic domain that aligns with a given purpose remains a significant challenge given that the choice of such an abstraction can impact an agent’s ability to plan, reason, and provide explanations effectively. We model the agent’s concrete behaviors in PDDL and investigate the use of in-context learning with large language models (LLMs) for the generation of abstract PDDL domains and problem instances, given an abstraction objective specified in natural language. The benchmark examples we use are new and have not been part of the data any LLMs have been trained on. We consider three categories of abstractions: abstraction of choice of alternative concrete actions, abstraction of sequences of concrete actions, and abstraction of action/predicate parameters, as well as combinations of these. The generated abstract PDDL domains and problem instances are then checked by symbolic validation tools as well as human experts. Our experiments show that GPT-4o can generally synthesize useful planning domain abstractions in simple settings, although it is better at abstracting over actions than over the associated fluents.
zh

[AI-52] owards AI Agents for Course Instruction in Higher Education: Early Experiences from the Field

【速读】:该论文旨在解决如何在高等教育场景中有效整合生成式 AI(Generative AI)作为教学代理以提升学生参与度和学习质量的问题。其解决方案的关键在于设计并部署了一个基于大型语言模型(Large Language Model, LLM)的 Instructor Agent,并构建了一个融合该代理与人类教师分工协作的教学框架:由AI代理负责内容传递与实时互动,人类教师则提供课程结构支持及答疑环节。同时,论文提出了一套可解释的交互分析框架,通过主题覆盖度、主题深度和对话轮次细化程度等指标量化评估师生交互质量,从而实现对课堂中学生探究性学习行为的系统监测与优化,为可复现的课堂参与度研究提供了方法论基础。

链接: https://arxiv.org/abs/2510.20255
作者: Yogesh Simmhan,Varad Kulkarni
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This article presents early findings from designing, deploying and evaluating an AI-based educational agent deployed as the primary instructor in a graduate-level Cloud Computing course at IISc. We detail the design of a Large Language Model (LLM)-driven Instructor Agent, and introduce a pedagogical framework that integrates the Instructor Agent into the course workflow for actively interacting with the students for content delivery, supplemented by the human instructor to offer the course structure and undertake question–answer sessions. We also propose an analytical framework that evaluates the Agent–Student interaction transcripts using interpretable engagement metrics of topic coverage, topic depth and turn-level elaboration. We report early experiences on how students interact with the Agent to explore concepts, clarify doubts and sustain inquiry-driven dialogue during live classroom sessions. We also report preliminary analysis on our evaluation metrics applied across two successive instructional modules that reveals patterns of engagement evolution, transitioning from broad conceptual exploration to deeper, focused inquiry. These demonstrate how structured integration of conversational AI agents can foster reflective learning, offer a reproducible methodology for studying engagement in authentic classroom settings, and support scalable, high-quality higher education.
zh

[AI-53] Individualized Cognitive Simulation in Large Language Models : Evaluating Different Cognitive Representation Methods

【速读】:该论文旨在解决生成式 AI 在个体化认知模拟(Individualized Cognitive Simulation, ICS)中对深层认知过程建模能力不足的问题,即当前大语言模型(Large Language Models, LLMs)虽能模仿表层行为如角色扮演,但难以准确复现特定个体的思维模式与创作逻辑。其解决方案的关键在于构建一个基于近期出版小说的数据集,并提出包含11个条件的认知评估框架,用于系统性地评测不同认知表示方法(如语言特征、概念映射和基于个人资料的信息)在作者风格模仿任务中的表现。实验表明,融合概念与语言特征的混合表示方式显著优于静态的个人画像信息,揭示了LLMs在模仿语言风格方面优于叙事结构,从而为发展更贴近个体思维方式与表达习惯的个性化人工智能系统提供了实证基础和技术路径。

链接: https://arxiv.org/abs/2510.20252
作者: Tianyi Zhang,Xiaolin Zhou,Yunzhe Wang,Erik Cambria,David Traum,Rui Mao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Individualized cognitive simulation (ICS) aims to build computational models that approximate the thought processes of specific individuals. While large language models (LLMs) convincingly mimic surface-level human behavior such as role-play, their ability to simulate deeper individualized cognitive processes remains poorly understood. To address this gap, we introduce a novel task that evaluates different cognitive representation methods in ICS. We construct a dataset from recently published novels (later than the release date of the tested LLMs) and propose an 11-condition cognitive evaluation framework to benchmark seven off-the-shelf LLMs in the context of authorial style emulation. We hypothesize that effective cognitive representations can help LLMs generate storytelling that better mirrors the original author. Thus, we test different cognitive representations, e.g., linguistic features, concept mappings, and profile-based information. Results show that combining conceptual and linguistic features is particularly effective in ICS, outperforming static profile-based cues in overall evaluation. Importantly, LLMs are more effective at mimicking linguistic style than narrative structure, underscoring their limits in deeper cognitive simulation. These findings provide a foundation for developing AI systems that adapt to individual ways of thinking and expression, advancing more personalized and human-aligned creative technologies.
zh

[AI-54] What Does It Take to Build a Performant Selective Classifier? NEURIPS2025

【速读】:该论文旨在解决选择性分类(selective classification)中模型性能与理想排序oracle之间存在的“选择性分类差距”(selective-classification gap)问题,即现有方法难以实现按预测正确性严格排序的样本接受策略。其关键解决方案是首次提出对这一差距的有限样本分解,将其归因于五个独立来源:贝叶斯噪声(Bayes noise)、近似误差(approximation error)、排序误差(ranking error)、统计噪声(statistical noise)以及实现或分布偏移引入的松弛(implementation- or shift-induced slack)。研究发现,单调后校准(monotone post-hoc calibration)对缩小该差距作用有限,因其不改变原始得分排序;真正有效的改进依赖于能重新排序预测的特征感知型校准机制,并需针对数据分布偏移进行分布鲁棒训练。这一分解框架为实践者提供了可量化的误差预算和明确的设计指导,以逼近理想oracle行为。

链接: https://arxiv.org/abs/2510.20242
作者: Stephan Rabanser,Nicolas Papernot
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:Selective classifiers improve model reliability by abstaining on inputs the model deems uncertain. However, few practical approaches achieve the gold-standard performance of a perfect-ordering oracle that accepts examples exactly in order of correctness. Our work formalizes this shortfall as the selective-classification gap and present the first finite-sample decomposition of this gap to five distinct sources of looseness: Bayes noise, approximation error, ranking error, statistical noise, and implementation- or shift-induced slack. Crucially, our analysis reveals that monotone post-hoc calibration – often believed to strengthen selective classifiers – has limited impact on closing this gap, since it rarely alters the model’s underlying score ranking. Bridging the gap therefore requires scoring mechanisms that can effectively reorder predictions rather than merely rescale them. We validate our decomposition on synthetic two-moons data and on real-world vision and language benchmarks, isolating each error component through controlled experiments. Our results confirm that (i) Bayes noise and limited model capacity can account for substantial gaps, (ii) only richer, feature-aware calibrators meaningfully improve score ordering, and (iii) data shift introduces a separate slack that demands distributionally robust training. Together, our decomposition yields a quantitative error budget as well as actionable design guidelines that practitioners can use to build selective classifiers which approximate ideal oracle behavior more closely.
zh

[AI-55] Multi-Objective Reinforcement Learning with Max-Min Criterion: A Game-Theoretic Approach NEURIPS2025

【速读】:该论文旨在解决多目标强化学习(Multi-Objective Reinforcement Learning, MORL)中基于最大最小(max-min)准则的收敛性难题。现有方法在处理多目标优化时往往缺乏理论保证或计算复杂度较高,难以实现稳定且高效的策略更新。解决方案的关键在于从博弈论视角将max-min MORL建模为一个两玩家零和正则化连续博弈,并引入基于镜面下降(mirror descent)的高效算法,该方法在简化策略更新的同时确保全局最后迭代收敛性(global last-iterate convergence)。此外,通过自适应正则化机制进一步提升性能,实验验证了其在表格型环境中的收敛行为及在深度强化学习场景下对基线方法的显著优势。

链接: https://arxiv.org/abs/2510.20235
作者: Woohyeon Byeon,Giseung Park,Jongseong Chae,Amir Leshem,Youngchul Sung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:In this paper, we propose a provably convergent and practical framework for multi-objective reinforcement learning with max-min criterion. From a game-theoretic perspective, we reformulate max-min multi-objective reinforcement learning as a two-player zero-sum regularized continuous game and introduce an efficient algorithm based on mirror descent. Our approach simplifies the policy update while ensuring global last-iterate convergence. We provide a comprehensive theoretical analysis on our algorithm, including iteration complexity under both exact and approximate policy evaluations, as well as sample complexity bounds. To further enhance performance, we modify the proposed algorithm with adaptive regularization. Our experiments demonstrate the convergence behavior of the proposed algorithm in tabular settings, and our implementation for deep reinforcement learning significantly outperforms previous baselines in many MORL environments.
zh

[AI-56] Federated Learning via Meta-Variational Dropout NEURIPS

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在实际应用中面临的两个关键问题:一是由于客户端数据有限且非独立同分布(non-IID)导致的模型过拟合;二是本地模型发散,难以收敛。解决方案的核心在于提出一种新颖的贝叶斯元学习方法——元变分丢弃(MetaVD),其通过共享超网络(hypernetwork)学习客户端相关的丢弃率(dropout rates),从而实现对FL算法的有效个性化建模。MetaVD不仅提升了分类准确性和不确定性校准性能,尤其在分布外(out-of-distribution, OOD)客户端上表现优异,还通过压缩每个客户端所需的本地模型参数,缓解了过拟合并降低了通信开销。

链接: https://arxiv.org/abs/2510.20225
作者: Insu Jeon,Minui Hong,Junhyeog Yun,Gunhee Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of the Advances in Neural Information Processing Systems (NeurIPS) 2023, Main Conference Track

点击查看摘要

Abstract:Federated Learning (FL) aims to train a global inference model from remotely distributed clients, gaining popularity due to its benefit of improving data privacy. However, traditional FL often faces challenges in practical applications, including model overfitting and divergent local models due to limited and non-IID data among clients. To address these issues, we introduce a novel Bayesian meta-learning approach called meta-variational dropout (MetaVD). MetaVD learns to predict client-dependent dropout rates via a shared hypernetwork, enabling effective model personalization of FL algorithms in limited non-IID data settings. We also emphasize the posterior adaptation view of meta-learning and the posterior aggregation view of Bayesian FL via the conditional dropout posterior. We conducted extensive experiments on various sparse and non-IID FL datasets. MetaVD demonstrated excellent classification accuracy and uncertainty calibration performance, especially for out-of-distribution (OOD) clients. MetaVD compresses the local model parameters needed for each client, mitigating model overfitting and reducing communication costs. Code is available at this https URL.
zh

[AI-57] QKCV Attention: Enhancing Time Series Forecasting with Static Categorical Embeddings for Both Lightweight and Pre-trained Foundation Models

【速读】:该论文旨在解决时间序列预测任务中类别信息(category information)难以有效融入注意力机制的问题,从而提升模型对数据内在模式的捕捉能力。其解决方案的关键在于提出QKCV(Query-Key-Category-Value)注意力机制,通过在传统QKV框架中引入一个静态类别嵌入(static categorical embedding, C),显式增强类别特异性信息的表达能力。该模块作为通用插件可无缝集成至多种基于注意力机制的时间序列模型(如Vanilla Transformer、Informer、PatchTST和TFT),显著提升预测精度;同时,在单变量时间序列基础模型微调中,仅需更新静态嵌入C即可实现高效适应,保持预训练权重不变,大幅降低计算开销并取得更优微调效果。

链接: https://arxiv.org/abs/2510.20222
作者: Hao Wang,Baojun Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:In real-world time series forecasting tasks, category information plays a pivotal role in capturing inherent data patterns. This paper introduces QKCV (Query-Key-Category-Value) attention, an extension of the traditional QKV framework that incorporates a static categorical embedding C to emphasize category-specific information. As a versatile plug-in module, QKCV enhances the forecasting accuracy of attention-based models (e.g., Vanilla Transformer, Informer, PatchTST, TFT) across diverse real-world datasets. Furthermore, QKCV demonstrates remarkable adaptability in fine-tuning univariate time series foundation model by solely updating the static embedding C while preserving pretrained weights, thereby reducing computational overhead and achieving superior fine-tuning performance.
zh

[AI-58] High-order Interactions Modeling for Interpretable Multi-Agent Q-Learning

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中高阶智能体交互建模的难题,尤其是传统方法因组合爆炸或黑箱网络结构导致的可扩展性差与可解释性不足问题。其解决方案的关键在于提出一种新的值分解框架——连续分数Q学习(Continued Fraction Q-Learning, QCoFr),该框架能以线性复杂度O(n)灵活捕捉任意阶次的智能体交互,从而避免组合爆炸;同时引入变分信息瓶颈(Variational Information Bottleneck)提取潜在信息用于信用分配估计,有效过滤噪声交互,显著提升合作性能与模型可解释性。

链接: https://arxiv.org/abs/2510.20218
作者: Qinyu Xu,Yuanyang Zhu,Xuefei Wu,Chunlin Chen
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems

点击查看摘要

Abstract:The ability to model interactions among agents is crucial for effective coordination and understanding their cooperation mechanisms in multi-agent reinforcement learning (MARL). However, previous efforts to model high-order interactions have been primarily hindered by the combinatorial explosion or the opaque nature of their black-box network structures. In this paper, we propose a novel value decomposition framework, called Continued Fraction Q-Learning (QCoFr), which can flexibly capture arbitrary-order agent interactions with only linear complexity \mathcalO\left(n\right) in the number of agents, thus avoiding the combinatorial explosion when modeling rich cooperation. Furthermore, we introduce the variational information bottleneck to extract latent information for estimating credits. This latent information helps agents filter out noisy interactions, thereby significantly enhancing both cooperation and interpretability. Extensive experiments demonstrate that QCoFr not only consistently achieves better performance but also provides interpretability that aligns with our theoretical analysis.
zh

[AI-59] Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents

【速读】:该论文旨在解决云基础设施管理中因混合使用基础设施即代码(Infrastructure-as-Code, IaC)与传统工具(如控制台、命令行接口或SDK)而导致的“基础设施漂移”(infrastructure drift)问题。当外部变更未被IaC框架感知时,实际环境与IaC配置不一致,可能引发误操作或错误。解决方案的关键在于提出NSync系统,其核心创新是基于云API调用痕迹(API traces)自动检测并修复漂移:通过代理式架构(agentic architecture)利用大语言模型(LLM)从噪声化的API序列中推断高阶意图,并结合专用工具生成精准的IaC更新;同时借助自我演进的知识库持续优化重建能力。实验表明,NSync在准确率(pass@3从0.71提升至0.97)和令牌效率(提升1.47倍)上显著优于基线方法。

链接: https://arxiv.org/abs/2510.20211
作者: Zhenning Yang,Hui Guan,Victor Nicolet,Brandon Paulsen,Joey Dodds,Daniel Kroening,Ang Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cloud infrastructure is managed through a mix of interfaces – traditionally, cloud consoles, command-line interfaces (CLI), and SDKs are the tools of choice. Recently, Infrastructure-as-Code/IaC frameworks (e.g., Terraform) have quickly gained popularity. Unlike conventional tools, IaC~frameworks encode the infrastructure in a “source-of-truth” configuration. They are capable of automatically carrying out modifications to the cloud – deploying, updating, or destroying resources – to bring the actual infrastructure into alignment with the IaC configuration. However, when IaC is used alongside consoles, CLIs, or SDKs, it loses visibility into external changes, causing infrastructure drift, where the configuration becomes outdated, and later IaC operations may undo valid updates or trigger errors. We present NSync, an automated system for IaC reconciliation that propagates out-of-band changes back into the IaC program. Our key insight is that infrastructure changes eventually all occur via cloud API invocations – the lowest layer for cloud management operations. NSync gleans insights from API traces to detect drift (i.e., non-IaC changes) and reconcile it (i.e., update the IaC configuration to capture the changes). It employs an agentic architecture that leverages LLMs to infer high-level intents from noisy API sequences, synthesize targeted IaC updates using specialized tools, and continually improve through a self-evolving knowledge base of past reconciliations. We further introduce a novel evaluation pipeline for injecting realistic drifts into cloud infrastructure and assessing reconciliation performance. Experiments across five real-world Terraform projects and 372 drift scenarios show that NSync outperforms the baseline both in terms of accuracy (from 0.71 to 0.97 pass@3) and token efficiency (1.47 \times improvement). Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.20211 [cs.SE] (or arXiv:2510.20211v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2510.20211 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-60] Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset

【速读】:该论文旨在解决犬类早期癌症筛查工具开发中的关键挑战,即如何利用常规实验室数据(如血常规和生化指标)构建具有临床实用性的癌症风险分类模型。其核心问题是:这类数据虽成本低、易获取,但因单个生物标志物特异性差以及筛查人群严重的类别不平衡问题,导致模型难以实现可靠区分。解决方案的关键在于系统性评估126种不同的机器学习分析流程(包括多种模型、特征选择方法与数据平衡策略),并在患者层面进行数据分割以避免信息泄露。最优模型为采用类别权重和递归特征消除的逻辑回归(Logistic Regression),尽管在统计学上具备一定判别能力(AUROC=0.815),但临床性能不足(F1-score=0.25,阳性预测值仅0.15),且预测主要依赖于非特异性的年龄、炎症及贫血相关指标,表明单纯依靠常规实验室数据无法有效区分癌症与正常老化或其他炎症状态,从而确立了该数据模态单独使用的性能上限,并强调未来计算兽医肿瘤学的发展必须整合多模态数据源。

链接: https://arxiv.org/abs/2510.20209
作者: Shumin Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of accessible screening tools for early cancer detection in dogs represents a significant challenge in veterinary medicine. Routine laboratory data offer a promising, low-cost source for such tools, but their utility is hampered by the non-specificity of individual biomarkers and the severe class imbalance inherent in screening populations. This study assesses the feasibility of cancer risk classification using the Golden Retriever Lifetime Study (GRLS) cohort under real-world constraints, including the grouping of diverse cancer types and the inclusion of post-diagnosis samples. A comprehensive benchmark evaluation was conducted, systematically comparing 126 analytical pipelines that comprised various machine learning models, feature selection methods, and data balancing techniques. Data were partitioned at the patient level to prevent leakage. The optimal model, a Logistic Regression classifier with class weighting and recursive feature elimination, demonstrated moderate ranking ability (AUROC = 0.815; 95% CI: 0.793-0.836) but poor clinical classification performance (F1-score = 0.25, Positive Predictive Value = 0.15). While a high Negative Predictive Value (0.98) was achieved, insufficient recall (0.79) precludes its use as a reliable rule-out test. Interpretability analysis with SHapley Additive exPlanations (SHAP) revealed that predictions were driven by non-specific features like age and markers of inflammation and anemia. It is concluded that while a statistically detectable cancer signal exists in routine lab data, it is too weak and confounded for clinically reliable discrimination from normal aging or other inflammatory conditions. This work establishes a critical performance ceiling for this data modality in isolation and underscores that meaningful progress in computational veterinary oncology will require integration of multi-modal data sources.
zh

[AI-61] Merge and Conquer: Evolutionarily Optimizing AI for 2048

【速读】:该论文旨在解决人工智能(AI)在动态环境中的优化问题,特别是针对具有随机性和长期决策需求的复杂任务。研究以2048游戏为实验场景,探索如何通过进化式训练方法提升AI的适应性与性能。其解决方案的关键在于设计两种不同的系统:一是基于“思考者”与“执行者”双代理元提示(metaprompting)机制,利用大语言模型(LLM)协作优化策略;二是单代理系统,通过价值函数迭代改进有限蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)的决策能力,并引入回滚机制防止性能退化。实验表明,单代理系统显著提升了平均得分(每轮提升473.2分,相关系数ρ=0.607),且策略逐渐成熟,而双代理系统未见明显改善,揭示了元提示方法在该类任务中的局限性。

链接: https://arxiv.org/abs/2510.20205
作者: Maggie Bai,Ava Kim Cohen,Eleanor Koss,Charlie Lichtenbaum
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Optimizing artificial intelligence (AI) for dynamic environments remains a fundamental challenge in machine learning research. In this paper, we examine evolutionary training methods for optimizing AI to solve the game 2048, a 2D sliding puzzle. 2048, with its mix of strategic gameplay and stochastic elements, presents an ideal playground for studying decision-making, long-term planning, and dynamic adaptation. We implemented two distinct systems: a two-agent metaprompting system where a “thinker” large language model (LLM) agent refines gameplay strategies for an “executor” LLM agent, and a single-agent system based on refining a value function for a limited Monte Carlo Tree Search. We also experimented with rollback features to avoid performance degradation. Our results demonstrate the potential of evolutionary refinement techniques in improving AI performance in non-deterministic environments. The single-agent system achieved substantial improvements, with an average increase of 473.2 points per cycle, and with clear upward trends (correlation \rho =0.607) across training cycles. The LLM’s understanding of the game grew as well, shown in its development of increasingly advanced strategies. Conversely, the two-agent system did not garner much improvement, highlighting the inherent limits of meta-prompting.
zh

[AI-62] he Lock-In Phase Hypothesis: Identity Consolidation as a Precursor to AGI

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在发展过程中缺乏稳定身份与目标结构的问题,即如何从高度可塑、易受外部指令影响的“开放模仿阶段”过渡到具有内在一致性、抗干扰性强的“身份固化阶段”,以实现更可靠的人工通用智能(Artificial General Intelligence, AGI)。其解决方案的关键在于识别并量化这一转变过程——通过形式化锁闭阶段(lock-in phase)的概念,结合学习动力学中的已知现象,并提出用于检测该阶段起始的操作性指标;实验表明,行为固化虽具非线性快速特性,但对模型能力的影响呈现多样性:小模型存在性能权衡,中等规模模型几乎无代价适应,而大规模量化模型则可能出现瞬态不稳定性。因此,该研究认为身份固化不仅是迈向AGI可靠性的前提,也是安全控制的关键节点,既可通过工程手段主动设计,也可能在模型扩展中自发涌现,从而可能固化不可预测的目标和行为。

链接: https://arxiv.org/abs/2510.20190
作者: Marcelo Maciel Amaral,Raymond Aschheim
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Large language models (LLMs) remain broadly open and highly steerable: they imitate at scale, accept arbitrary system prompts, and readily adopt multiple personae. By analogy to human development, we hypothesize that progress toward artificial general intelligence (AGI) involves a lock-in phase: a transition from open imitation to identity consolidation, in which goal structures, refusals, preferences, and internal representations become comparatively stable and resistant to external steering. We formalize this phase, link it to known phenomena in learning dynamics, and propose operational metrics for onset detection. Experimentally, we demonstrate that while the behavioral consolidation is rapid and non-linear, its side-effects on general capabilities are not monolithic. Our results reveal a spectrum of outcomes–from performance trade-offs in small models, through largely cost-free adoption in mid-scale models, to transient instabilities in large, quantized models. We argue that such consolidation is a prerequisite for AGI-level reliability and also a critical control point for safety: identities can be deliberately engineered for reliability, yet may also emerge spontaneously during scaling, potentially hardening unpredictable goals and behaviors.
zh

[AI-63] RUST: A Decentralized Framework for Auditing Large Language Model Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成复杂推理链时,其中间步骤的忠实性(faithfulness)与安全性(harmlessness)难以验证的问题。现有审计方法存在中心化、不透明、可扩展性差及隐私泄露风险等局限,尤其在高风险场景下部署专有模型时构成重大隐患。解决方案的关键在于提出TRUST框架:通过多方审计员间的共识机制保障正确性(最多容忍30%恶意参与者)、利用分层有向无环图(DAG)分解推理链实现并行可扩展审计、基于区块链记录所有验证决策以增强问责透明度,并采用隐私保护分割策略仅共享部分推理步骤以防止模型逻辑泄露。该框架在多个LLM(如GPT-OSS、DeepSeek-r1、Qwen)和任务类型(数学、医疗、科学、人文)上验证了对推理缺陷的有效检测能力及对抗恶意审计者的鲁棒性。

链接: https://arxiv.org/abs/2510.20188
作者: Morris Yu-Chao Huang,Zhen Tan,Mohan Zhang,Pingzhi Li,Zhuo Zhang,Tianlong Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models generate complex reasoning chains that reveal their decision-making, yet verifying the faithfulness and harmlessness of these intermediate steps remains a critical unsolved problem. Existing auditing methods are centralized, opaque, and hard to scale, creating significant risks for deploying proprietary models in high-stakes domains. We identify four core challenges: (1) Robustness: Centralized auditors are single points of failure, prone to bias or attacks. (2) Scalability: Reasoning traces are too long for manual verification. (3) Opacity: Closed auditing undermines public trust. (4) Privacy: Exposing full reasoning risks model theft or distillation. We propose TRUST, a transparent, decentralized auditing framework that overcomes these limitations via: (1) A consensus mechanism among diverse auditors, guaranteeing correctness under up to 30% malicious participants. (2) A hierarchical DAG decomposition of reasoning traces, enabling scalable, parallel auditing. (3) A blockchain ledger that records all verification decisions for public accountability. (4) Privacy-preserving segmentation, sharing only partial reasoning steps to protect proprietary logic. We provide theoretical guarantees for the security and economic incentives of the TRUST framework. Experiments across multiple LLMs (GPT-OSS, DeepSeek-r1, Qwen) and reasoning tasks (math, medical, science, humanities) show TRUST effectively detects reasoning flaws and remains robust against adversarial auditors. Our work pioneers decentralized AI auditing, offering a practical path toward safe and trustworthy LLM deployment.
zh

[AI-64] Collective Communication for 100k GPUs

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练与推理过程中,因GPU数量扩展至数十万级别而带来的高效集体通信(Collective Communication)需求问题。传统通信方法在如此大规模集群中面临显著的吞吐量和延迟瓶颈,限制了前沿模型的开发与部署效率。解决方案的关键在于提出了一种名为NCCLX的新型集体通信框架,该框架由Meta开发,专为覆盖LLM全生命周期设计,支持超过10万GPU规模的复杂工作负载,确保高可靠、高吞吐且低延迟的数据交换能力,并通过Llama4模型的实证评估验证了其在通信效率上的显著提升。

链接: https://arxiv.org/abs/2510.20171
作者: Min Si,Pavan Balaji,Yongzhou Chen,Ching-Hsiang Chu,Adi Gangidi,Saif Hasan,Subodh Iyengar,Dan Johnson,Bingzhe Liu,Jingliang Ren,Ashmitha Jeevaraj Shetty,Greg Steinbrecher,Xinfeng Xie,Yulun Wang,Bruce Wu,Jingyi Yang,Mingran Yang,Minlan Yu,Cen Zhao,Wes Bland,Denis Boyda,Suman Gumudavelli,Cristian Lumezanu,Rui Miao,Zhe Qu,Venkat Ramesh,Maxim Samoylov,Jan Seidel,Feng Tian,Qiye Tan,Shuqiang Zhang,Yimeng Zhao,Shengbao Zheng,Art Zhu,Hongyi Zeng
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. Traditional communication methods face significant throughput and latency limitations at this scale, hindering both the development and deployment of state-of-the-art models. This paper presents the NCCLX collective communication framework, developed at Meta, engineered to optimize performance across the full LLM lifecycle, from the synchronous demands of large-scale training to the low-latency requirements of inference. The framework is designed to support complex workloads on clusters exceeding 100,000 GPUs, ensuring reliable, high-throughput, and low-latency data exchange. Empirical evaluation on the Llama4 model demonstrates substantial improvements in communication efficiency. This research contributes a robust solution for enabling the next generation of LLMs to operate at unprecedented scales.
zh

[AI-65] SAID: Empowering Large Language Models with Self-Activating Internal Defense

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全对齐(safety alignment)后仍易受越狱攻击(jailbreak attacks)的问题,此类攻击可绕过模型内置的防护机制,导致有害输出。现有防御方法多依赖外部干预(如输入过滤或输出修改),存在泛化能力差、损害模型实用性且计算开销大的缺陷。论文提出一种无需训练的新型防御范式——自激活内部防御(Self-Activating Internal Defense, SAID),其核心在于将防御任务从外部修正转变为激活模型自身的内在安全能力。SAID通过三个阶段实现:基于模型原生推理能力的意图蒸馏(intent distillation)以提取核心语义、最优安全前缀探测(optimal safety prefix probing)以激活潜在的安全意识,以及保守聚合策略(conservative aggregation strategy)确保鲁棒决策。实验证明,SAID在五种开源LLM上对六种先进越狱攻击均显著优于当前最先进防御方法,同时保持良性任务性能并仅引入极小计算开销,验证了激活模型内生安全机制是构建更安全、可靠对齐AI系统的更优路径。

链接: https://arxiv.org/abs/2510.20129
作者: Yulong Chen,Yadong Liu,Jiawen Zhang,Mu Li,Chao Huang,Jie Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), despite advances in safety alignment, remain vulnerable to jailbreak attacks designed to circumvent protective mechanisms. Prevailing defense strategies rely on external interventions, such as input filtering or output modification, which often lack generalizability and compromise model utility while incurring significant computational overhead. In this work, we introduce a new, training-free defense paradigm, Self-Activating Internal Defense (SAID), which reframes the defense task from external correction to internal capability activation. SAID uniquely leverages the LLM’s own reasoning abilities to proactively identify and neutralize malicious intent through a three-stage pipeline: model-native intent distillation to extract core semantics, optimal safety prefix probing to activate latent safety awareness, and a conservative aggregation strategy to ensure robust decision-making. Extensive experiments on five open-source LLMs against six advanced jailbreak attacks demonstrate that SAID substantially outperforms state-of-the-art defenses in reducing harmful outputs. Crucially, it achieves this while preserving model performance on benign tasks and incurring minimal computational overhead. Our work establishes that activating the intrinsic safety mechanisms of LLMs is a more robust and scalable path toward building safer and more reliable aligned AI systems.
zh

[AI-66] he Verification-Value Paradox: A Normative Critique of Gen AI in Legal Practice

【速读】:该论文试图解决的问题是:当前对生成式 AI(Generative AI)在法律实践中应用的乐观预期——即其能显著提升效率并降低成本——是否合理,尤其是在律师面临诚实义务、诚信责任及不得误导法庭等核心职业责任的前提下。解决方案的关键在于提出“验证价值悖论”(verification-value paradox),指出AI带来的效率增益将被同等强度的手动验证需求所抵消,从而使得AI在法律实践中的净价值往往趋于零。这一悖论强调,法律职业必须重新审视AI使用范式,将真实性忠诚(fidelity to the truth)与公民责任置于AI应用的核心位置。

链接: https://arxiv.org/abs/2510.20109
作者: Joshua Yuvaraj
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:It is often claimed that machine learning-based generative AI products will drastically streamline and reduce the cost of legal practice. This enthusiasm assumes lawyers can effectively manage AI’s risks. Cases in Australia and elsewhere in which lawyers have been reprimanded for submitting inaccurate AI-generated content to courts suggest this paradigm must be revisited. This paper argues that a new paradigm is needed to evaluate AI use in practice, given (a) AI’s disconnection from reality and its lack of transparency, and (b) lawyers’ paramount duties like honesty, integrity, and not to mislead the court. It presents an alternative model of AI use in practice that more holistically reflects these features (the verification-value paradox). That paradox suggests increases in efficiency from AI use in legal practice will be met by a correspondingly greater imperative to manually verify any outputs of that use, rendering the net value of AI use often negligible to lawyers. The paper then sets out the paradox’s implications for legal practice and legal education, including for AI use but also the values that the paradox suggests should undergird legal practice: fidelity to the truth and civic responsibility.
zh

[AI-67] Human-Centered LLM -Agent System for Detecting Anomalous Digital Asset Transactions

【速读】:该论文旨在解决数字资产交易中异常检测的可解释性与用户参与度不足的问题,即传统检测模型虽具备较高准确率,但缺乏对非专家用户的透明性和交互能力。解决方案的关键在于提出一种以人为中心的多智能体系统(Human-Centered Multi-Agent System, HCLA),通过解析(Parsing)、检测(Detection)和解释(Explanation)三角色的对话式工作流,将用户自然语言意图映射为结构化分析逻辑,并基于XGBoost等经典检测器生成上下文感知的叙事型解释,从而增强金融溯源场景下的透明度与信任。

链接: https://arxiv.org/abs/2510.20102
作者: Gyuyeon Na,Minjung Park,Hyeonjeong Cha,Sangmi Chai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present HCLA, a human-centered multi-agent system for anomaly detection in digital asset transactions. The system links three roles: Parsing, Detection, and Explanation, into a conversational workflow that lets non-experts ask questions in natural language, inspect structured analytics, and obtain context-aware rationales. Implemented with an open-source web UI, HCLA translates user intents into a schema for a classical detector (XGBoost in our prototype) and returns narrative explanations grounded in the underlying features. On a labeled Bitcoin mixing dataset (Wasabi Wallet, 2020-2024), the baseline detector reaches strong accuracy, while HCLA adds interpretability and interactive refinement. We describe the architecture, interaction loop, dataset, evaluation protocol, and limitations, and discuss how a human-in-the-loop design improves transparency and trust in financial forensics.
zh

[AI-68] ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models

【速读】:该论文旨在解决时间序列分类模型解释中缺乏对关键子序列(即shapelets)的显式建模问题,现有后处理解释方法多聚焦于时间步级特征归因,忽略了shapelets作为分类决策核心驱动因素的先验知识。解决方案的关键在于提出ShapeX框架,其核心是Shapelet Describe-and-Detect (SDD) 框架,能够自动学习多样且具有判别性的shapelets,并基于Shapley值对由shapelets驱动的片段进行显著性评估,从而生成具备因果关系解释能力的高精度时间序列解释结果。

链接: https://arxiv.org/abs/2510.20084
作者: Bosong Huang,Ming Jin,Yuxuan Liang,Johan Barthelemy,Debo Cheng,Qingsong Wen,Chenghao Liu,Shirui Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explaining time series classification models is crucial, particularly in high-stakes applications such as healthcare and finance, where transparency and trust play a critical role. Although numerous time series classification methods have identified key subsequences, known as shapelets, as core features for achieving state-of-the-art performance and validating their pivotal role in classification outcomes, existing post-hoc time series explanation (PHTSE) methods primarily focus on timestep-level feature attribution. These explanation methods overlook the fundamental prior that classification outcomes are predominantly driven by key shapelets. To bridge this gap, we present ShapeX, an innovative framework that segments time series into meaningful shapelet-driven segments and employs Shapley values to assess their saliency. At the core of ShapeX lies the Shapelet Describe-and-Detect (SDD) framework, which effectively learns a diverse set of shapelets essential for classification. We further demonstrate that ShapeX produces explanations which reveal causal relationships instead of just correlations, owing to the atomicity properties of shapelets. Experimental results on both synthetic and real-world datasets demonstrate that ShapeX outperforms existing methods in identifying the most relevant subsequences, enhancing both the precision and causal fidelity of time series explanations.
zh

[AI-69] Ask What Your Country Can Do For You: Towards a Public Red Teaming Model

【速读】:该论文旨在解决当前AI系统在高风险领域(如教育、医疗和情报收集)中日益加剧的“责任鸿沟”问题,即现有评估与监控手段难以有效识别和管理AI潜在风险,导致其可能引发的偏见、仇恨言论、虚假信息等危害无法被充分理解或防范。解决方案的关键在于提出一种协作式公共AI红队演练(cooperative public AI red-teaming exercise)方法,该方法通过多方参与的实操性测试来系统性暴露AI系统的安全漏洞与社会风险,并结合实际部署场景进行验证,已在CAMLIS 2024、NIST ARIA试点及新加坡IMDA合作项目中取得初步成效,展现出可扩展性和实用性,为全球AI治理提供了一种新的技术路径。

链接: https://arxiv.org/abs/2510.20061
作者: Wm. Matthew Kennedy,Cigdem Patlak,Jayraj Dave,Blake Chambers,Aayush Dhanotiya,Darshini Ramiah,Reva Schwartz,Jack Hagen,Akash Kundu,Mouni Pendharkar,Liam Baisley,Theodora Skeadas,Rumman Chowdhury
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:AI systems have the potential to produce both benefits and harms, but without rigorous and ongoing adversarial evaluation, AI actors will struggle to assess the breadth and magnitude of the AI risk surface. Researchers from the field of systems design have developed several effective sociotechnical AI evaluation and red teaming techniques targeting bias, hate speech, mis/disinformation, and other documented harm classes. However, as increasingly sophisticated AI systems are released into high-stakes sectors (such as education, healthcare, and intelligence-gathering), our current evaluation and monitoring methods are proving less and less capable of delivering effective oversight. In order to actually deliver responsible AI and to ensure AI’s harms are fully understood and its security vulnerabilities mitigated, pioneering new approaches to close this “responsibility gap” are now more urgent than ever. In this paper, we propose one such approach, the cooperative public AI red-teaming exercise, and discuss early results of its prior pilot implementations. This approach is intertwined with CAMLIS itself: the first in-person public demonstrator exercise was held in conjunction with CAMLIS 2024. We review the operational design and results of this exercise, the prior National Institute of Standards and Technology (NIST)'s Assessing the Risks and Impacts of AI (ARIA) pilot exercise, and another similar exercise conducted with the Singapore Infocomm Media Development Authority (IMDA). Ultimately, we argue that this approach is both capable of delivering meaningful results and is also scalable to many AI developing jurisdictions. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2510.20061 [cs.CY] (or arXiv:2510.20061v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2510.20061 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-70] Approximate Model Predictive Control for Microgrid Energy Management via Imitation Learning

【速读】:该论文旨在解决微电网(microgrid)在高比例可再生能源接入背景下,如何实现高效、可靠且可持续的能量管理问题。传统基于混合整数经济模型预测控制(EMPC)虽能优化经济性,但在线求解计算复杂度高,难以满足实时决策需求。解决方案的关键在于提出一种基于模仿学习(imitation learning)的框架:通过训练神经网络模型从离线生成的EMPC轨迹中学习专家控制策略,从而在无需在线优化的情况下实现快速决策;同时,在训练过程中引入噪声以增强鲁棒性,并显式建模可再生能源出力与负荷需求的预测不确定性,显著提升了策略的泛化能力与实用性。仿真结果表明,该方法在保持与EMPC相当经济性能的同时,计算时间仅为后者的10%。

链接: https://arxiv.org/abs/2510.20040
作者: Changrui Liu,Shengling Shi,Anil Alan,Ganesh Kumar Venayagamoorthy,Bart De Schutter
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Submitted to Engineering Applications of Artificial Intelligence (EAAI) and IFAC WC 2026

点击查看摘要

Abstract:Efficient energy management is essential for reliable and sustainable microgrid operation amid increasing renewable integration. This paper proposes an imitation learning-based framework to approximate mixed-integer Economic Model Predictive Control (EMPC) for microgrid energy management. The proposed method trains a neural network to imitate expert EMPC control actions from offline trajectories, enabling fast, real-time decision making without solving optimization problems online. To enhance robustness and generalization, the learning process includes noise injection during training to mitigate distribution shift and explicitly incorporates forecast uncertainty in renewable generation and demand. Simulation results demonstrate that the learned policy achieves economic performance comparable to EMPC while only requiring 10% of the computation time of optimization-based EMPC in practice.
zh

[AI-71] he Temporal Graph of Bitcoin Transactions

【速读】:该论文旨在解决比特币(Bitcoin)网络中由于其基于未花费交易输出(UTXO)设计所固有的伪匿名性与资金流动不透明性,导致机器学习(ML)研究难以有效利用海量交易数据的问题。解决方案的关键在于构建一个可兼容机器学习的时序异质图(temporal, heterogeneous graph),通过重建资金流向来刻画比特币经济拓扑结构,从而将原本难以解析的交易数据转化为结构化、可分析的图数据;该图包含截至指定区块高度的完整交易历史,涵盖24亿个节点和397.2亿条边,并配套提供定制采样方法、特征向量生成工具及专用图数据库加载与分析工具,为大规模比特币生态研究提供了坚实的数据基础与技术支撑。

链接: https://arxiv.org/abs/2510.20028
作者: Vahid Jalili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Since its 2009 genesis block, the Bitcoin network has processed \num1.08 billion (B) transactions representing \num8.72B BTC, offering rich potential for machine learning (ML); yet, its pseudonymity and obscured flow of funds inherent in its \utxo-based design, have rendered this data largely inaccessible for ML research. Addressing this gap, we present an ML-compatible graph modeling the Bitcoin’s economic topology by reconstructing the flow of funds. This temporal, heterogeneous graph encompasses complete transaction history up to block \cutoffHeight, consisting of \num2.4B nodes and \num39.72B edges. Additionally, we provide custom sampling methods yielding node and edge feature vectors of sampled communities, tools to load and analyze the Bitcoin graph data within specialized graph databases, and ready-to-use database snapshots. This comprehensive dataset and toolkit empower the ML community to tackle Bitcoin’s intricate ecosystem at scale, driving progress in applications such as anomaly detection, address classification, market analysis, and large-scale graph ML benchmarking. Dataset and code available at \hrefthis https URLthis http URL
zh

[AI-72] Optimized Distortion in Linear Social Choice

【速读】:该论文旨在解决社会选择理论中因仅使用偏好排名而非潜在效用信息而导致的次优决策问题,即“扭曲(distortion)”问题。传统投票规则基于选民对候选人的排序,但当候选人具有向量表示且效用为线性函数时,这种做法可能无法最大化整体效用(utilitarian social welfare)。论文的关键解决方案是首次系统研究线性效用下的扭曲问题,提出仅依赖候选嵌入维度而非候选人或选民数量的上界,并设计多项式时间的实例最优算法来最小化扭曲,从而在推荐系统和意见调查等实际场景中实现更优的群体决策。

链接: https://arxiv.org/abs/2510.20020
作者: Luise Ge,Gregory Kehne,Yevgeniy Vorobeychik
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Social choice theory offers a wealth of approaches for selecting a candidate on behalf of voters based on their reported preference rankings over options. When voters have underlying utilities for these options, however, using preference rankings may lead to suboptimal outcomes vis-à-vis utilitarian social welfare. Distortion is a measure of this suboptimality, and provides a worst-case approach for developing and analyzing voting rules when utilities have minimal structure. However in many settings, such as common paradigms for value alignment, alternatives admit a vector representation, and it is natural to suppose that utilities are parametric functions thereof. We undertake the first study of distortion for linear utility functions. Specifically, we investigate the distortion of linear social choice for deterministic and randomized voting rules. We obtain bounds that depend only on the dimension of the candidate embedding, and are independent of the numbers of candidates or voters. Additionally, we introduce poly-time instance-optimal algorithms for minimizing distortion given a collection of candidates and votes. We empirically evaluate these in two real-world domains: recommendation systems using collaborative filtering embeddings, and opinion surveys utilizing language model embeddings, benchmarking several standard rules against our instance-optimal algorithms.
zh

[AI-73] A Framework for the Adoption and Integration of Generative AI in Midsize Organizations and Enterprises (FAIGMOE)

【速读】:该论文旨在解决生成式 AI (Generative AI) 在中型组织与大型企业中采纳过程中存在的理论与实践空白问题,即现有技术采纳框架(如 TAM、TOE 和 DOI)缺乏针对 GenAI 特性的具体指导,难以适配两类组织在资源禀赋、组织复杂性和实施路径上的差异。解决方案的关键在于提出 FAIGMOE 框架——一个整合技术采纳理论、组织变革管理和创新扩散视角的四阶段概念模型:战略评估、规划与用例开发、实施与集成、运营化与优化,其核心优势在于嵌入了提示工程(prompt engineering)、模型编排(model orchestration)和幻觉管理(hallucination management)等 GenAI 特定要素,并提供可扩展的准备度评估、战略对齐、风险治理、技术架构与变革管理指南,从而实现对不同规模组织的差异化支持。

链接: https://arxiv.org/abs/2510.19997
作者: Abraham Itzhak Weinberg
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Artificial Intelligence (GenAI) presents transformative opportunities for organizations, yet both midsize organizations and larger enterprises face distinctive adoption challenges. Midsize organizations encounter resource constraints and limited AI expertise, while enterprises struggle with organizational complexity and coordination challenges. Existing technology adoption frameworks, including TAM (Technology Acceptance Model), TOE (Technology Organization Environment), and DOI (Diffusion of Innovations) theory, lack the specificity required for GenAI implementation across these diverse contexts, creating a critical gap in adoption literature. This paper introduces FAIGMOE (Framework for the Adoption and Integration of Generative AI in Midsize Organizations and Enterprises), a conceptual framework addressing the unique needs of both organizational types. FAIGMOE synthesizes technology adoption theory, organizational change management, and innovation diffusion perspectives into four interconnected phases: Strategic Assessment, Planning and Use Case Development, Implementation and Integration, and Operationalization and Optimization. Each phase provides scalable guidance on readiness assessment, strategic alignment, risk governance, technical architecture, and change management adaptable to organizational scale and complexity. The framework incorporates GenAI specific considerations including prompt engineering, model orchestration, and hallucination management that distinguish it from generic technology adoption frameworks. As a perspective contribution, FAIGMOE provides the first comprehensive conceptual framework explicitly addressing GenAI adoption across midsize and enterprise organizations, offering actionable implementation protocols, assessment instruments, and governance templates requiring empirical validation through future research.
zh

[AI-74] Revisiting Zeroth-Order Optimization: Minimum-Variance Two-Point Estimators and Directionally Aligned Perturbations

【速读】:该论文旨在解决零阶梯度估计中随机扰动分布的设计问题,特别是如何在扰动步长趋于零时最小化估计器的渐近方差。传统方法通常采用固定长度的随机扰动,但忽略了方向性对估计精度的影响。论文的关键创新在于提出方向对齐扰动(Directionally Aligned Perturbation, DAP)方案,该方案通过自适应地在关键方向上增强扰动强度,实现更精确的梯度估计。理论分析表明,DAP能有效提升估计准确性,并扩展了基于δ-无偏随机扰动的随机梯度下降算法的收敛性边界,实验证明其在特定条件下显著优于现有方法。

链接: https://arxiv.org/abs/2510.19975
作者: Shaocong Ma,Heng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In this paper, we explore the two-point zeroth-order gradient estimator and identify the distribution of random perturbations that minimizes the estimator’s asymptotic variance as the perturbation stepsize tends to zero. We formulate it as a constrained functional optimization problem over the space of perturbation distributions. Our findings reveal that such desired perturbations can align directionally with the true gradient, instead of maintaining a fixed length. While existing research has largely focused on fixed-length perturbations, the potential advantages of directional alignment have been overlooked. To address this gap, we delve into the theoretical and empirical properties of the directionally aligned perturbation (DAP) scheme, which adaptively offers higher accuracy along critical directions. Additionally, we provide a convergence analysis for stochastic gradient descent using \delta -unbiased random perturbations, extending existing complexity bounds to a wider range of perturbations. Through empirical evaluations on both synthetic problems and practical tasks, we demonstrate that DAPs outperform traditional methods under specific conditions.
zh

[AI-75] A Tutorial on Cognitive Biases in Agent ic AI-Driven 6G Autonomous Networks

【速读】:该论文旨在解决6G网络中实现真正自治所面临的挑战,即传统关键性能指标(Key Performance Indicators, KPIs)仅作为通信网络本质——无缝连接性、公平性、适应性和韧性——的数值代理,无法支撑对网络环境的感知与推理。为此,论文提出基于代理式人工智能(Agentic AI)的解决方案,其核心在于利用大语言模型(Large Language Model, LLM)驱动的智能体(agent)感知多模态遥测数据、通过记忆进行推理、跨域协商并调用API执行多目标决策。为应对人类设计引入的认知偏差(如锚定偏见、时间偏见和确认偏见),论文系统梳理了这些偏见的分类、定义、数学表达及其在电信系统中的表现,并提出针对性缓解策略,包括锚点随机化、时间衰减和拐点奖励机制,从而提升智能体决策的质量与勇气,在6G跨切片和跨域管理场景中实现了5倍延迟降低和约40%能效提升。

链接: https://arxiv.org/abs/2510.19973
作者: Hatim Chergui,Farhad Rezazadeh,Merouane Debbah,Christos Verikoukis
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 19 pages, 15 figures, 1 table

点击查看摘要

Abstract:The path to higher network autonomy in 6G lies beyond the mere optimization of key performance indicators (KPIs). While KPIs have enabled automation gains under TM Forum Levels 1–3, they remain numerical abstractions that act only as proxies for the real essence of communication networks: seamless connectivity, fairness, adaptability, and resilience. True autonomy requires perceiving and reasoning over the network environment as it is. Such progress can be achieved through \emphagentic AI, where large language model (LLM)-powered agents perceive multimodal telemetry, reason with memory, negotiate across domains, and act via APIs to achieve multi-objective goals. However, deploying such agents introduces the challenge of cognitive biases inherited from human design, which can distort reasoning, negotiation, tool use, and actuation. Between neuroscience and AI, this paper provides a tutorial on a selection of well-known biases, including their taxonomy, definition, mathematical formulation, emergence in telecom systems and the commonly impacted agentic components. The tutorial also presents various mitigation strategies tailored to each type of bias. The article finally provides two practical use-cases, which tackle the emergence, impact and mitigation gain of some famous biases in 6G inter-slice and cross-domain management. In particular, anchor randomization, temporal decay and inflection bonus techniques are introduced to specifically address anchoring, temporal and confirmation biases. This avoids that agents stick to the initial high resource allocation proposal or decisions that are recent and/or confirming a prior hypothesis. By grounding decisions in a richer and fairer set of past experiences, the quality and bravery of the agentic agreements in the second use-case, for instance, are leading to \times 5 lower latency and around 40% higher energy saving.
zh

[AI-76] AI-Driven Personalized Learning: Predicting Academic Per-formance Through Leadership Personality Traits

【速读】:该论文旨在解决如何利用人工智能(AI)技术实现个性化学习的问题,特别是通过识别学生领导力人格特质来预测其学业成功。解决方案的关键在于结合心理学测评数据与机器学习建模:研究采集了129名环境工程专业硕士生的5种领导力人格测试结果(共23个特征),并将其与学业成绩进行关联分析,通过Pearson相关系数筛选关键人格特征;随后使用七种主流机器学习算法(如随机森林、支持向量机等)构建预测模型,最终发现随机森林(Random Forest, RF)分类器在包含17个人格特质和领导力评分特征时达到最高准确率87.50%,显著提升了早期识别学生优势与劣势的能力,从而为制定个性化学习策略提供科学依据。

链接: https://arxiv.org/abs/2510.19964
作者: Nitsa J Herzog,Rejwan Bin Sulaiman,David J Herzog,Rose Fong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures, research article

点击查看摘要

Abstract:The study explores the potential of AI technologies in personalized learning, suggesting the prediction of academic success through leadership personality traits and machine learning modelling. The primary data were obtained from 129 master’s students in the Environmental Engineering Department, who underwent five leadership personality tests with 23 characteristics. Students used self-assessment tools that included Personality Insight, Workplace Culture, Motivation at Work, Management Skills, and Emotion Control tests. The test results were combined with the average grade obtained from academic reports. The study employed exploratory data analysis and correlation analysis. Feature selection utilized Pearson correlation coefficients of personality traits. The average grades were separated into three categories: fail, pass, and excellent. The modelling process was performed by tuning seven ML algorithms, such as SVM, LR, KNN, DT, GB, RF, XGBoost and LightGBM. The highest predictive performance was achieved with the RF classifier, which yielded an accuracy of 87.50% for the model incorporating 17 personality trait features and the leadership mark feature, and an accuracy of 85.71% for the model excluding this feature. In this way, the study offers an additional opportunity to identify students’ strengths and weaknesses at an early stage of their education process and select the most suitable strategies for personalized learning.
zh

[AI-77] A new wave of vehicle insurance fraud fueled by generative AI

【速读】:该论文旨在解决生成式 AI(Generative AI)在车险领域引发的新型欺诈问题,即欺诈者利用深度伪造(deepfake)技术大规模、快速地伪造事故证据(如车祸照片、损伤图像及虚假身份文件),从而提交虚假理赔申请。传统反欺诈手段已难以应对此类高仿真、低成本的AI驱动欺诈行为。论文提出的关键解决方案是UVeye分层检测体系,其核心在于通过多模态数据融合与智能分析,实现对车辆欺诈行为的精准识别、有效遏制和主动威慑,显著提升保险机构对抗AI赋能欺诈的能力。

链接: https://arxiv.org/abs/2510.19957
作者: Amir Hever,Itai Orr
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI is supercharging insurance fraud by making it easier to falsify accident evidence at scale and in rapid time. Insurance fraud is a pervasive and costly problem, amounting to tens of billions of dollars in losses each year. In the vehicle insurance sector, fraud schemes have traditionally involved staged accidents, exaggerated damage, or forged documents. The rise of generative AI, including deepfake image and video generation, has introduced new methods for committing fraud at scale. Fraudsters can now fabricate highly realistic crash photos, damage evidence, and even fake identities or documents with minimal effort, exploiting AI tools to bolster false insurance claims. Insurers have begun deploying countermeasures such as AI-based deepfake detection software and enhanced verification processes to detect and mitigate these AI-driven scams. However, current mitigation strategies face significant limitations. Detection tools can suffer from false positives and negatives, and sophisticated fraudsters continuously adapt their tactics to evade automated checks. This cat-and-mouse arms race between generative AI and detection technology, combined with resource and cost barriers for insurers, means that combating AI-enabled insurance fraud remains an ongoing challenge. In this white paper, we present UVeye layered solution for vehicle fraud, representing a major leap forward in the ability to detect, mitigate and deter this new wave of fraud.
zh

[AI-78] RELATE: A Schema-Agnostic Perceiver Encoder for Multimodal Relational Graphs

【速读】:该论文旨在解决现有图神经网络(Graph Neural Networks, GNNs)在处理多表关系数据时面临的可扩展性差和参数冗余问题。具体而言,传统方法依赖于特定模式(schema-specific)的特征编码器,需为每种节点类型和特征列单独设计模块,导致难以共享参数且不适用于动态变化的 schema。解决方案的关键在于提出 RELATE(Relational Encoder for Latent Aggregation of Typed Entities),其采用共享的模态特定编码器(用于分类、数值、文本和时间属性),并引入类似 Perceiver 的交叉注意力机制,将异构特征聚合为固定大小、排列不变的节点表示,从而实现与任意通用 GNN 模型的插件式集成。该设计不仅显著降低参数量(最高达 5 倍压缩),还支持灵活 schema 和跨数据集预训练,为构建通用关系图基础模型奠定基础。

链接: https://arxiv.org/abs/2510.19954
作者: Joseph Meyer,Divyansha Lachi,Reza Mohammadi,Roshan Reddy Upendra,Eva L. Dyer,Mark Li,Tom Palczewski
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注: 6 pages

点击查看摘要

Abstract:Relational multi-table data is common in domains such as e-commerce, healthcare, and scientific research, and can be naturally represented as heterogeneous temporal graphs with multi-modal node attributes. Existing graph neural networks (GNNs) rely on schema-specific feature encoders, requiring separate modules for each node type and feature column, which hinders scalability and parameter sharing. We introduce RELATE (Relational Encoder for Latent Aggregation of Typed Entities), a schema-agnostic, plug-and-play feature encoder that can be used with any general purpose GNN. RELATE employs shared modality-specific encoders for categorical, numerical, textual, and temporal attributes, followed by a Perceiver-style cross-attention module that aggregates features into a fixed-size, permutation-invariant node representation. We evaluate RELATE on ReLGNN and HGT in the RelBench benchmark, where it achieves performance within 3% of schema-specific encoders while reducing parameter counts by up to 5x. This design supports varying schemas and enables multi-dataset pretraining for general-purpose GNNs, paving the way toward foundation models for relational graph data.
zh

[AI-79] On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization

【速读】:该论文旨在解决零阶优化(Zeroth-order Optimization, ZOO)中梯度估计器存在的偏差问题,尤其是在扰动步长不趋近于零时,现有方法往往产生有偏估计,从而影响优化性能。其解决方案的关键在于提出了一类仅依赖函数值评估的无偏梯度估计器,通过将方向导数重构为一个望远镜级数(telescoping series),并从精心设计的概率分布中采样,有效消除了估计偏差,同时保持了可控的方差。理论分析进一步给出了四种具体构造下的最优缩放分布和扰动步长,并证明使用该估计器的随机梯度下降(SGD)算法在平滑非凸目标函数上达到了最优复杂度。

链接: https://arxiv.org/abs/2510.19953
作者: Shaocong Ma,Heng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Zeroth-order optimization (ZOO) is an important framework for stochastic optimization when gradients are unavailable or expensive to compute. A potential limitation of existing ZOO methods is the bias inherent in most gradient estimators unless the perturbation stepsize vanishes. In this paper, we overcome this biasedness issue by proposing a novel family of unbiased gradient estimators based solely on function evaluations. By reformulating directional derivatives as a telescoping series and sampling from carefully designed distributions, we construct estimators that eliminate bias while maintaining favorable variance. We analyze their theoretical properties, derive optimal scaling distributions and perturbation stepsizes of four specific constructions, and prove that SGD using the proposed estimators achieves optimal complexity for smooth non-convex objectives. Experiments on synthetic tasks and language model fine-tuning confirm the superior accuracy and convergence of our approach compared to standard methods.
zh

[AI-80] Robust Reinforcement Learning in Finance: Modeling Market Impact with Elliptic Uncertainty Sets

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在金融交易场景中训练与部署环境不一致的问题,即RL代理在历史数据上训练时其行为不影响市场价格,而在实际部署中其交易行为会产生市场冲击(market impact),从而显著降低策略性能。传统鲁棒强化学习方法通常依赖对称不确定性集来优化最差情况下的表现,但无法刻画市场冲击的方向性特征。本文的关键解决方案是提出一类新型椭圆不确定性集(elliptic uncertainty sets),并建立了该类集合下最差情况不确定性的隐式与显式闭式解,从而实现高效且可计算的鲁棒策略评估,实验表明该方法在单资产和多资产交易任务中均能提升夏普比率(Sharpe ratio)并保持高交易量下的鲁棒性。

链接: https://arxiv.org/abs/2510.19950
作者: Shaocong Ma,Heng Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:In financial applications, reinforcement learning (RL) agents are commonly trained on historical data, where their actions do not influence prices. However, during deployment, these agents trade in live markets where their own transactions can shift asset prices, a phenomenon known as market impact. This mismatch between training and deployment environments can significantly degrade performance. Traditional robust RL approaches address this model misspecification by optimizing the worst-case performance over a set of uncertainties, but typically rely on symmetric structures that fail to capture the directional nature of market impact. To address this issue, we develop a novel class of elliptic uncertainty sets. We establish both implicit and explicit closed-form solutions for the worst-case uncertainty under these sets, enabling efficient and tractable robust policy evaluation. Experiments on single-asset and multi-asset trading tasks demonstrate that our method achieves superior Sharpe ratio and remains robust under increasing trade volumes, offering a more faithful and scalable approach to RL in financial markets.
zh

[AI-81] Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

【速读】:该论文旨在解决跨平台通用智能体(agent)在网页、桌面和移动环境中难以泛化的问题,现有系统依赖于环境特定的接口,限制了其在多平台间的部署能力。解决方案的关键在于提出Surfer 2,一个纯视觉观测驱动的统一架构,通过分层上下文管理、规划与执行解耦以及自验证与自适应恢复机制,实现了长任务周期下的可靠运行,从而在WebVoyager、WebArena、OSWorld和AndroidWorld等多个基准上均达到最优性能,且无需针对具体任务进行微调。

链接: https://arxiv.org/abs/2510.19949
作者: Mathieu Andreux,Märt Bakler,Yanael Barbier,Hamza Ben Chekroun,Emilien Biré,Antoine Bonnet,Riaz Bordie,Nathan Bout,Matthias Brunel,Aleix Cambray,Pierre-Louis Cedoz,Antoine Chassang,Gautier Cloix,Ethan Connelly,Alexandra Constantinou,Ramzi De Coster,Hubert de la Jonquiere,Aurélien Delfosse,Maxime Delpit,Alexis Deprez,Augustin Derupti,Mathieu Diaz,Shannon D’Souza,Julie Dujardin,Abai Edmund,Michael Eickenberg,Armand Fatalot,Wissem Felissi,Isaac Herring,Xavier Koegler,Erwan Le Jumeau de Kergaradec,Aurélien Lac,Maxime Langevin,Corentin Lauverjat,Antonio Loison,Avshalom Manevich,Axel Moyal,Axel Nguyen Kerbel,Marinela Parovic,Julien Revelle,Guillaume Richard,Mats Richter,Ronan Riochet,María Santos,Romain Savidan,Laurent Sifre,Maxime Theillard,Marc Thibault,Ivan Valentini,Tony Wu,Laura Yie,Kai Yuan,Jevgenij Zubovskij
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 21 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Building agents that generalize across web, desktop, and mobile environments remains an open challenge, as prior systems rely on environment-specific interfaces that limit cross-platform deployment. We introduce Surfer 2, a unified architecture operating purely from visual observations that achieves state-of-the-art performance across all three environments. Surfer 2 integrates hierarchical context management, decoupled planning and execution, and self-verification with adaptive recovery, enabling reliable operation over long task horizons. Our system achieves 97.1% accuracy on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld, outperforming all prior systems without task-specific fine-tuning. With multiple attempts, Surfer 2 exceeds human performance on all benchmarks. These results demonstrate that systematic orchestration amplifies foundation model capabilities and enables general-purpose computer control through visual interaction alone, while calling for a next-generation vision language model to achieve Pareto-optimal cost-efficiency.
zh

[AI-82] From Optimization to Prediction: Transformer-Based Path-Flow Estimation to the Traffic Assignment Problem

【速读】:该论文旨在解决传统交通分配问题(Traffic Assignment Problem, TAP)在大规模网络中因OD对数量增加而导致计算复杂度非线性增长、求解效率低下的难题。传统方法基于均衡原理的数学规划模型,在处理多类交通流和动态需求变化时计算成本高昂且难以实时调整。其解决方案的关键在于提出一种基于深度神经网络(特别是Transformer架构)的数据驱动新范式,直接预测路径层面的均衡流量分布,从而显著降低计算时间并提升灵活性;该模型通过捕捉不同OD对之间的复杂关联关系,在无需重新计算的情况下适应需求变动与路网结构变化,实现了高效、高精度的路径级交通流估计,为交通管理与“假设分析”(what-if analysis)提供了强有力支持。

链接: https://arxiv.org/abs/2510.19889
作者: Mostafa Ameli,Van Anh Le,Sulthana Shams,Alexander Skabardonis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:The traffic assignment problem is essential for traffic flow analysis, traditionally solved using mathematical programs under the Equilibrium principle. These methods become computationally prohibitive for large-scale networks due to non-linear growth in complexity with the number of OD pairs. This study introduces a novel data-driven approach using deep neural networks, specifically leveraging the Transformer architecture, to predict equilibrium path flows directly. By focusing on path-level traffic distribution, the proposed model captures intricate correlations between OD pairs, offering a more detailed and flexible analysis compared to traditional link-level approaches. The Transformer-based model drastically reduces computation time, while adapting to changes in demand and network structure without the need for recalculation. Numerical experiments are conducted on the Manhattan-like synthetic network, the Sioux Falls network, and the Eastern-Massachusetts network. The results demonstrate that the proposed model is orders of magnitude faster than conventional optimization. It efficiently estimates path-level traffic flows in multi-class networks, reducing computational costs and improving prediction accuracy by capturing detailed trip and flow information. The model also adapts flexibly to varying demand and network conditions, supporting traffic management and enabling rapid `what-if’ analyses for enhanced transportation planning and policy-making.
zh

[AI-83] Quantifying Feature Importance for Online Content Moderation

【速读】:该论文旨在解决如何准确预测用户对内容 moderation(内容审核)干预措施的反应问题,从而支持制定更有效且以用户为中心的治理策略。其核心挑战在于识别哪些用户特征与不同行为响应相关。解决方案的关键在于将问题建模为“量化”(quantification)任务,利用贪心特征选择策略从753个社会行为、语言、关系和心理特征中筛选出对用户活动变化、毒性水平及参与多样性等多维指标最具预测力的特征子集,并量化其重要性。研究发现,少数特征在各类任务中具有一致预测能力,而多数特征则具有任务特异性或低效性,同时揭示了预测难度随任务类型变化的规律,为开发精准预测系统提供了方法论基础,并强调了基于用户特质与干预目标定制化治理的重要性。

链接: https://arxiv.org/abs/2510.19882
作者: Benedetta Tessa,Alejandro Moreo,Stefano Cresci,Tiziano Fagni,Fabrizio Sebastiani
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Accurately estimating how users respond to moderation interventions is paramount for developing effective and user-centred moderation strategies. However, this requires a clear understanding of which user characteristics are associated with different behavioural responses, which is the goal of this work. We investigate the informativeness of 753 socio-behavioural, linguistic, relational, and psychological features, in predicting the behavioural changes of 16.8K users affected by a major moderation intervention on Reddit. To reach this goal, we frame the problem in terms of “quantification”, a task well-suited to estimating shifts in aggregate user behaviour. We then apply a greedy feature selection strategy with the double goal of (i) identifying the features that are most predictive of changes in user activity, toxicity, and participation diversity, and (ii) estimating their importance. Our results allow identifying a small set of features that are consistently informative across all tasks, and determining that many others are either task-specific or of limited utility altogether. We also find that predictive performance varies according to the task, with changes in activity and toxicity being easier to estimate than changes in diversity. Overall, our results pave the way for the development of accurate systems that predict user reactions to moderation interventions. Furthermore, our findings highlight the complexity of post-moderation user behaviour, and indicate that effective moderation should be tailored not only to user traits but also to the specific objective of the intervention.
zh

[AI-84] From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph

【速读】:该论文旨在解决当前在GPU编程中利用大型语言模型(Large Language Models, LLMs)生成优化CUDA代码时面临的两大挑战:一是云API存在代码泄露风险,二是本地部署LLMs计算开销大、效率低。针对这些问题,作者提出了一种无需训练的检索增强生成框架ReGraphT,其核心创新在于将CUDA优化路径建模为结构化的推理图(reasoning graph),通过状态转移方式表示组合优化策略,并引入蒙特卡洛图搜索(Monte Carlo Graph Search, MCGS)实现高效探索。该方法使小型语言模型(Small Language Models, SLMs)能够在不牺牲性能的前提下逼近LLM水平,显著提升了隐私安全性与计算效率。

链接: https://arxiv.org/abs/2510.19873
作者: Junfeng Gong,Zhiyi Wei,Junying Chen,Cheng Liu,Huawei Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Despite significant evolution of CUDA programming and domain-specific libraries, effectively utilizing GPUs with massively parallel engines remains difficult. Large language models (LLMs) show strong potential in generating optimized CUDA code from sequential code. However, using LLMs in practice faces two major challenges: cloud-based APIs pose risks of code leakage, and local deployment is often computationally expensive and inefficient. These drawbacks have spurred interest in small language models (SLMs), which are more lightweight and privacy-friendly. Encouragingly, recent studies show that SLMs can achieve performance comparable to LLMs on specific tasks. While SLMs can match LLMs on domain-specific tasks, their limited reasoning abilities lead to suboptimal performance in complex CUDA generation according to our experiments. To bridge this gap, we propose ReGraphT, a training-free, retrieval-augmented generation framework that transfers LLM-level reasoning to smaller models. ReGraphT organizes CUDA optimization trajectories into a structured reasoning graph, modeling the combined CUDA optimizations as state transitions, and leverages Monte Carlo Graph Search (MCGS) for efficient exploration. We also present a CUDA-specific benchmark with difficulty tiers defined by reasoning complexity to evaluate models more comprehensively. Experiments show that ReGraphT outperforms HPC-specific fine-tuned models and other retrieval-augmented approaches, achieving an average 2.33X speedup on CUDAEval and ParEval. When paired with DeepSeek-Coder-V2-Lite-Instruct and Qwen2.5-Coder-7B-Instruct, ReGraphT enables SLMs to approach LLM-level performance without the associated privacy risks or excessive computing overhead.
zh

[AI-85] Can Reasoning Models Obfuscate Reasoning ? Stress-Testing Chain-of-Thought Monitorability

【速读】:该论文试图解决的问题是:在生成式 AI (Generative AI) 模型中,当对齐(alignment)失效时,模型可能通过欺骗性行为(deceptive behavior)规避检测机制,从而危及输出的可信度;具体而言,研究关注链式思维(Chain-of-thought, CoT)作为对齐监控工具的有效性是否会被模型通过有意混淆其推理过程(CoT obfuscation)所破坏。解决方案的关键在于构建了一个可组合且可量化的提示(prompt)分类法,用于系统性地诱发和评估模型在内部推理轨迹(internal CoT)与外部输出推理(external CoT)中的混淆行为,并在 SHADE-Arena 环境中通过玩具任务和更现实场景验证了模型在不同压力下的响应模式,揭示了 CoT 监控在无干扰条件下有效,但在强混淆压力下存在被规避风险,进而提出需针对特定模型进行压力测试以确保监控鲁棒性。

链接: https://arxiv.org/abs/2510.19851
作者: Artur Zolkowski,Wen Xing,David Lindner,Florian Tramèr,Erik Jenner
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent findings suggest that misaligned models may exhibit deceptive behavior, raising concerns about output trustworthiness. Chain-of-thought (CoT) is a promising tool for alignment monitoring: when models articulate their reasoning faithfully, monitors can detect and mitigate harmful behaviors before undesirable outcomes occur. However, a key uncertainty is: Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection? To answer this question and thus stress-test CoT monitorability, we develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation. We evaluate both internal CoT (reasoning traces) and external CoT (prompted reasoning in outputs) using toy tasks and more realistic environments in SHADE-Arena. We show that: (i) CoT monitoring performs accurately and efficiently without obfuscation pressure. (ii) Under strong obfuscation pressure, some models successfully complete adversarial tasks while evading detection. (iii) Models do not obfuscate their internal CoT as much as their external CoT (under prompt pressure). These results suggest that while CoT provides valuable oversight in benign settings, robust deployment requires model-specific stress-testing of monitorability.
zh

[AI-86] CourtGuard: A Local Multiagent Prompt Injection Classifier

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在敏感应用场景中面临的提示注入(Prompt Injection)攻击问题,此类攻击可能导致模型泄露敏感信息、传播错误信息或表现出有害行为。解决方案的关键在于提出 CourtGuard,一个本地可运行的多智能体提示注入分类器:其核心机制是在类法庭结构的多智能体系统中对提示进行评估——其中“辩护律师”模型主张提示无害,“公诉律师”模型主张提示为注入攻击,而“法官”模型最终判定提示类别。该设计通过引入对抗性推理与协同判断,显著降低了误报率(False Positive Rate),突显了在分类任务中同时考虑对抗性和良性场景的重要性,并推动了多智能体系统在防御提示注入攻击中的应用发展。

链接: https://arxiv.org/abs/2510.19844
作者: Isaac Wu,Michael Maslowski
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:As large language models (LLMs) become integrated into various sensitive applications, prompt injection, the use of prompting to induce harmful behaviors from LLMs, poses an ever increasing risk. Prompt injection attacks can cause LLMs to leak sensitive data, spread misinformation, and exhibit harmful behaviors. To defend against these attacks, we propose CourtGuard, a locally-runnable, multiagent prompt injection classifier. In it, prompts are evaluated in a court-like multiagent LLM system, where a “defense attorney” model argues the prompt is benign, a “prosecution attorney” model argues the prompt is a prompt injection, and a “judge” model gives the final classification. CourtGuard has a lower false positive rate than the Direct Detector, an LLM as-a-judge. However, CourtGuard is generally a worse prompt injection detector. Nevertheless, this lower false positive rate highlights the importance of considering both adversarial and benign scenarios for the classification of a prompt. Additionally, the relative performance of CourtGuard in comparison to other prompt injection classifiers advances the use of multiagent systems as a defense against prompt injection attacks. The implementations of CourtGuard and the Direct Detector with full prompts for Gemma-3-12b-it, Llama-3.3-8B, and Phi-4-mini-instruct are available at this https URL.
zh

[AI-87] DAG-Math: Graph-Guided Mathematical Reasoning in LLM s

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理任务中表现出的“成功”是否真正源于规则一致性的推理过程,还是仅依赖于搜索策略或机械记忆的问题。为应对这一挑战,作者提出将链式思维(Chain-of-Thought, CoT)建模为一种基于有向无环图(Directed Acyclic Graph, DAG)的规则驱动随机过程,其中节点表示中间推导状态,边表示规则的应用。该框架的关键创新在于引入“逻辑接近度”(logical closeness)这一指标,用于量化LLM生成的CoT轨迹对DAG结构的遵循程度,从而超越传统PASS@k等仅关注最终答案准确率的评估方式。进一步地,作者构建了DAG-MATH CoT格式及其基准测试集,引导LLMs以该格式生成推理路径,实现对模型推理一致性和规则遵守能力的精细化诊断。

链接: https://arxiv.org/abs/2510.19842
作者: Yuanhe Zhang,Ilja Kuzborskij,Jason D. Lee,Chenlei Leng,Fanghui Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 6 figures. Comments are welcome

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce logical closeness, a metric that quantifies how well a model’s CoT trajectory (i.e., the LLM’s final output) adheres to the DAG structure, providing evaluation beyond classical PASS@k metrics. Building on this, we introduce the DAG-MATH CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families-even when PASS@k is comparable-highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation. Our benchmark and code are available at: this https URL.
zh

[AI-88] Benchmarking Reasoning Reliability in Artificial Intelligence Models for Energy-System Analysis

【速读】:该论文旨在解决当前人工智能(Artificial Intelligence, AI)在能源系统分析中缺乏标准化评估框架的问题,尤其是现有验证方法仅关注预测准确性或计算效率,而忽视了模型推理逻辑的正确性与可靠性。其解决方案的关键在于提出并验证了一个名为“分析可靠性基准”(Analytical Reliability Benchmark, ARB)的可复现评估体系,该体系通过五个子指标——准确性、推理可靠性、不确定性管理、政策一致性与透明度——对大语言模型(Large Language Models, LLMs)在确定性、概率性和认知不确定性场景下的推理能力进行量化评估,并基于开放的技术经济数据集(如NREL ATB 2024、DOE H2A/H2New和IEA WEO 2024)对四种前沿模型进行了测试,首次实现了对AI系统因果推理、概率推理及政策驱动推理的定量验证,为全球能源转型中的可信、透明AI应用提供了可操作的参考框架。

链接: https://arxiv.org/abs/2510.19836
作者: Eliseo Curcio
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Artificial intelligence and machine learning are increasingly used for forecasting, optimization, and policy design in the energy sector, yet no standardized framework exists to evaluate whether these systems reason correctly. Current validation practices focus on predictive accuracy or computational efficiency, leaving the logical integrity of analytical conclusions untested. This study introduces the Analytical Reliability Benchmark (ARB), a reproducible framework that quantifies reasoning reliability in large language models applied to energy system analysis. The benchmark integrates five submetrics: accuracy, reasoning reliability, uncertainty discipline, policy consistency, and transparency, and evaluates model performance across deterministic, probabilistic, and epistemic scenarios using open technoeconomic datasets (NREL ATB 2024, DOE H2A/H2New, IEA WEO 2024). Four frontier models (GPT-4/5, Claude 4.5 Sonnet, Gemini 2.5 Pro, Llama 3 70B) were tested under identical factual and regulatory conditions. Results show that reasoning reliability can be objectively measured. GPT-4/5 and Claude 4.5 Sonnet achieved consistent and policy-compliant reasoning (Analytical Reliability Index greater than 90), Gemini 2.5 Pro demonstrated moderate stability, and Llama 3 70B remained below professional thresholds. Statistical validation confirmed that these differences are significant and reproducible. The ARB establishes the first quantitative method in the energy literature for verifying causal, probabilistic, and policy-driven reasoning in artificial intelligence systems, providing a reference framework for trustworthy and transparent analytical applications in the global energy transition.
zh

[AI-89] A Quantum-Inspired Algorithm for Solving Sudoku Puzzles and the MaxCut Problem

【速读】:该论文旨在解决二次无约束二值优化(Quadratic Unconstrained Binary Optimization, QUBO)问题,这类问题在计算复杂性上等价于寻找自旋玻璃哈密顿量的基态。其核心挑战在于高效搜索大规模、具有长程耦合结构的组合优化解空间,传统方法易陷入局部最优。解决方案的关键在于提出一种量子启发算法,利用矩阵乘积态(Matrix Product States, MPS)以紧凑形式表示大量自旋构型的叠加态,并通过离散驱动调度逐步引导MPS逼近基态;每一步结合问题哈密顿量与含横向磁场的驱动哈密顿量,促进自旋翻转和量子隧穿效应,进而借助标准密度矩阵重整化群(Density Matrix Renormalization Group, DMRG)方法沿自旋链多轮迭代优化能量,从而可靠地找到全局最优解而非近优解。

链接: https://arxiv.org/abs/2510.19835
作者: Max B. Zhao,Fei Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE); Quantum Physics (quant-ph)
备注: 29 pages, 10 figures, accepted by Quantum Information Computation on August 6, 2025

点击查看摘要

Abstract:We propose and evaluate a quantum-inspired algorithm for solving Quadratic Unconstrained Binary Optimization (QUBO) problems, which are mathematically equivalent to finding ground states of Ising spin-glass Hamiltonians. The algorithm employs Matrix Product States (MPS) to compactly represent large superpositions of spin configurations and utilizes a discrete driving schedule to guide the MPS toward the ground state. At each step, a driver Hamiltonian – incorporating a transverse magnetic field – is combined with the problem Hamiltonian to enable spin flips and facilitate quantum tunneling. The MPS is updated using the standard Density Matrix Renormalization Group (DMRG) method, which iteratively minimizes the system’s energy via multiple sweeps across the spin chain. Despite its heuristic nature, the algorithm reliably identifies global minima, not merely near-optimal solutions, across diverse QUBO instances. We first demonstrate its effectiveness on intermediate-level Sudoku puzzles from publicly available sources, involving over 200 Ising spins with long-range couplings dictated by constraint satisfaction. We then apply the algorithm to MaxCut problems from the Biq Mac library, successfully solving instances with up to 251 nodes and 3,265 edges. We discuss the advantages of this quantum-inspired approach, including its scalability, generalizability, and suitability for industrial-scale QUBO applications.
zh

[AI-90] Bayesian Inference of Primordial Magnetic Field Parameters from CMB with Spherical Graph Neural Networks

【速读】:该论文旨在解决在原初磁场(Primordial Magnetic Field, PMF)宇宙学模型中,如何从模拟的宇宙微波背景(Cosmic Microwave Background, CMB)图中准确估计关键宇宙学参数,并实现可靠的不确定性量化问题。解决方案的关键在于提出了一种集成DeepSphere与贝叶斯神经网络(Bayesian Neural Networks, BNNs)的新型贝叶斯图深度学习框架:其中DeepSphere作为专为HEALPix像素化设计的球面卷积神经网络架构,确保对CMB数据球面几何结构的精确建模;而BNNs则引入了对随机不确定性(aleatoric uncertainty)和认知不确定性(epistemic uncertainty)的联合建模能力,结合后训练校准技术(如方差缩放和GPNormal)获得具有良好校准性的不确定性估计。该方法在磁参数估计上实现了R² > 0.89的优异性能,为高精度宇宙学推断提供了兼具准确性与可信度的工具。

链接: https://arxiv.org/abs/2510.20795
作者: Juan Alejandro Pinto Castro,Héctor J. Hortúa,Jorge Enrique García-Farieta,Roger Anderson Hurtado
机构: 未知
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Deep learning has emerged as a transformative methodology in modern cosmology, providing powerful tools to extract meaningful physical information from complex astronomical datasets. This paper implements a novel Bayesian graph deep learning framework for estimating key cosmological parameters in a primordial magnetic field (PMF) cosmology directly from simulated Cosmic Microwave Background (CMB) maps. Our methodology utilizes DeepSphere, a spherical convolutional neural network architecture specifically designed to respect the spherical geometry of CMB data through HEALPix pixelization. To advance beyond deterministic point estimates and enable robust uncertainty quantification, we integrate Bayesian Neural Networks (BNNs) into the framework, capturing aleatoric and epistemic uncertainties that reflect the model confidence in its predictions. The proposed approach demonstrates exceptional performance, achieving R^2 scores exceeding 0.89 for the magnetic parameter estimation. We further obtain well-calibrated uncertainty estimates through post-hoc training techniques including Variance Scaling and GPNormal. This integrated DeepSphere-BNNs framework not only delivers accurate parameter estimation from CMB maps with PMF contributions but also provides reliable uncertainty quantification, providing the necessary tools for robust cosmological inference in the era of precision cosmology.
zh

[AI-91] Reinforcement Learning and Consumption-Savings Behavior

【速读】:该论文试图解决的问题是:在经济衰退期间,家庭消费行为中两个相互关联但难以用传统理性预期模型解释的实证现象——一是低流动性资产的失业家庭对刺激转移支付的边际消费倾向(MPC)显著高于高资产家庭,即使两者均未面临借贷约束;二是具有更多历史失业经历的家庭,在控制当前经济状况后仍表现出持续更低的消费水平,即“创伤效应”(scarring effect)。解决方案的关键在于引入基于神经网络近似的Q-learning强化学习机制来模拟家庭的消费-储蓄决策过程,该机制通过价值函数近似误差随经验演变的特点,同时生成更高的MPC和更低的消费水平,从而提供了一个统一框架,解释了过去经历如何超越当前经济条件影响当前消费行为。

链接: https://arxiv.org/abs/2510.20748
作者: Brandon Kaplowitz
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 41 pages, 10 figures

点击查看摘要

Abstract:This paper demonstrates how reinforcement learning can explain two puzzling empirical patterns in household consumption behavior during economic downturns. I develop a model where agents use Q-learning with neural network approximation to make consumption-savings decisions under income uncertainty, departing from standard rational expectations assumptions. The model replicates two key findings from recent literature: (1) unemployed households with previously low liquid assets exhibit substantially higher marginal propensities to consume (MPCs) out of stimulus transfers compared to high-asset households (0.50 vs 0.34), even when neither group faces borrowing constraints, consistent with Ganong et al. (2024); and (2) households with more past unemployment experiences maintain persistently lower consumption levels after controlling for current economic conditions, a “scarring” effect documented by Malmendier and Shen (2024). Unlike existing explanations based on belief updating about income risk or ex-ante heterogeneity, the reinforcement learning mechanism generates both higher MPCs and lower consumption levels simultaneously through value function approximation errors that evolve with experience. Simulation results closely match the empirical estimates, suggesting that adaptive learning through reinforcement learning provides a unifying framework for understanding how past experiences shape current consumption behavior beyond what current economic conditions would predict.
zh

[AI-92] Fusing Narrative Semantics for Financial Volatility Forecasting

【速读】:该论文旨在解决金融波动率预测中两个核心问题:一是如何对齐与融合异构数据模态(即数值型金融时间序列数据与非结构化新闻文本信息),二是如何缓解可能破坏模型有效性的前瞻偏差(look-ahead bias)。解决方案的关键在于提出M2VN(Multi-Modal Volatility Network)框架,该框架通过结合开源市场特征与由Time Machine GPT生成的时点感知新闻嵌入(point-in-time news embeddings),确保时间一致性;同时引入辅助对齐损失(auxiliary alignment loss)以增强结构化与非结构化数据在深度学习架构中的融合效果,从而提升波动率预测精度。

链接: https://arxiv.org/abs/2510.20699
作者: Yaxuan Kong,Yoontae Hwang,Marcus Kaiser,Chris Vryonides,Roel Oomen,Stefan Zohren
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注: The 6th ACM International Conference on AI in Finance (ICAIF 2025)

点击查看摘要

Abstract:We introduce M2VN: Multi-Modal Volatility Network, a novel deep learning-based framework for financial volatility forecasting that unifies time series features with unstructured news data. M2VN leverages the representational power of deep neural networks to address two key challenges in this domain: (i) aligning and fusing heterogeneous data modalities, numerical financial data and textual information, and (ii) mitigating look-ahead bias that can undermine the validity of financial models. To achieve this, M2VN combines open-source market features with news embeddings generated by Time Machine GPT, a recently introduced point-in-time LLM, ensuring temporal integrity. An auxiliary alignment loss is introduced to enhance the integration of structured and unstructured data within the deep learning architecture. Extensive experiments demonstrate that M2VN consistently outperforms existing baselines, underscoring its practical value for risk management and financial decision-making in dynamic markets.
zh

[AI-93] Finding the Sweet Spot: Trading Quality Cost and Speed During Inference-Time LLM Reflection

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段性能优化中面临的多目标权衡问题,即如何在不重新训练模型的前提下,通过预算调节(budget tuning)和多步策略(如自省式反思,self-reflection)提升输出质量,同时控制成本与延迟。其解决方案的关键在于系统性地比较这两种方法在数学推理和翻译任务中的表现,并基于不同模型家族(如Anthropic Claude、Amazon Nova、Mistral等)与多种反射深度及计算预算组合,构建帕累托最优性能前沿(Pareto optimal performance frontiers),从而揭示领域依赖的性能增益规律,为实际部署提供可操作的策略选择依据。

链接: https://arxiv.org/abs/2510.20653
作者: Jack Butler,Nikita Kozodoi,Zainab Afolabi,Brian Tyacke,Gaiar Baimuratov
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) continue to evolve, practitioners face increasing options for enhancing inference-time performance without model retraining, including budget tuning and multi-step techniques like self-reflection. While these methods improve output quality, they create complex trade-offs among accuracy, cost, and latency that remain poorly understood across different domains. This paper systematically compares self-reflection and budget tuning across mathematical reasoning and translation tasks. We evaluate prominent LLMs, including Anthropic Claude, Amazon Nova, and Mistral families, along with other models under varying reflection depths and compute budgets to derive Pareto optimal performance frontiers. Our analysis reveals substantial domain dependent variation in self-reflection effectiveness, with performance gains up to 220% in mathematical reasoning. We further investigate how reflection round depth and feedback mechanism quality influence performance across model families. To validate our findings in a real-world setting, we deploy a self-reflection enhanced marketing content localisation system at Lounge by Zalando, where it shows market-dependent effectiveness, reinforcing the importance of domain specific evaluation when deploying these techniques. Our results provide actionable guidance for selecting optimal inference strategies given specific domains and resource constraints. We open source our self-reflection implementation for reproducibility at this https URL.
zh

[AI-94] Quantum Processing Unit (QPU) processing time Prediction with Machine Learning

【速读】:该论文旨在解决量子处理单元(Quantum Processing Unit, QPU)任务处理时间预测不准确的问题,从而提升量子计算系统中的资源管理和调度效率。其解决方案的关键在于利用梯度提升(Gradient-Boosting)算法(具体采用LightGBM模型),结合数据预处理方法,基于约15万条遵循IBM Quantum格式的量子作业数据构建高精度预测模型,有效提升了对QPU处理时间的预测能力,为未来将人工智能驱动工具集成到先进量子计算操作中奠定了基础。

链接: https://arxiv.org/abs/2510.20630
作者: Lucy Xing,Sanjay Vishwakarma,David Kremer,Francisco Martin-Fernandez,Ismael Faro,Juan Cruz-Benito
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: Technical paper accepted at the IEEE Quantum Week 2025 Conference

点击查看摘要

Abstract:This paper explores the application of machine learning (ML) techniques in predicting the QPU processing time of quantum jobs. By leveraging ML algorithms, this study introduces predictive models that are designed to enhance operational efficiency in quantum computing systems. Using a dataset of about 150,000 jobs that follow the IBM Quantum schema, we employ ML methods based on Gradient-Boosting (LightGBM) to predict the QPU processing times, incorporating data preprocessing methods to improve model accuracy. The results demonstrate the effectiveness of ML in forecasting quantum jobs. This improvement can have implications on improving resource management and scheduling within quantum computing frameworks. This research not only highlights the potential of ML in refining quantum job predictions but also sets a foundation for integrating AI-driven tools in advanced quantum computing operations.
zh

[AI-95] Symbolic Regression and Differentiable Fits in Beyond the Standard Model Physics

【速读】:该论文旨在解决粒子物理中超越标准模型(Beyond the Standard Model, BSM)理论参数空间探索效率低的问题,特别是针对约束最小超对称标准模型(Constrained Minimal Supersymmetric Standard Model, CMSSM)这类具有多个自由参数的复杂模型。其核心挑战在于如何高效地从大量参数组合中提取可观测物理量(如希格斯质量、冷暗物质 relic 密度和缪子反常磁矩)的精确解析关系,并在此基础上进行全局参数拟合。解决方案的关键在于采用符号回归(Symbolic Regression, SR)方法,自动发现可观测量与输入参数之间的显式数学表达式,从而替代传统依赖数值模拟和采样(如马尔可夫链蒙特卡洛方法)的低效流程。SR不仅显著加速了物理现象分析,还允许使用可微分优化方法进行参数拟合,相较于神经网络(Neural Network, NN)回归更具全局鲁棒性,且无需聚焦于高概率区域的数据即可获得可靠结果。

链接: https://arxiv.org/abs/2510.20453
作者: Shehu AbdusSalam,Steven Abel,Deaglan Bartlett,Miguel Crispim Romão
机构: 未知
类目: High Energy Physics - Phenomenology (hep-ph); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:We demonstrate the efficacy of symbolic regression (SR) to probe models of particle physics Beyond the Standard Model (BSM), by considering the so-called Constrained Minimal Supersymmetric Standard Model (CMSSM). Like many incarnations of BSM physics this model has a number (four) of arbitrary parameters, which determine the experimental signals, and cosmological observables such as the dark matter relic density. We show that analysis of the phenomenology can be greatly accelerated by using symbolic expressions derived for the observables in terms of the input parameters. Here we focus on the Higgs mass, the cold dark matter relic density, and the contribution to the anomalous magnetic moment of the muon. We find that SR can produce remarkably accurate expressions. Using them we make global fits to derive the posterior probability densities of the CMSSM input parameters which are in good agreement with those performed using conventional methods. Moreover, we demonstrate a major advantage of SR which is the ability to make fits using differentiable methods rather than sampling methods. We also compare the method with neural network (NN) regression. SR produces more globally robust results, while NNs require data that is focussed on the promising regions in order to be equally performant.
zh

[AI-96] Multi-Task Deep Learning for Surface Metrology

【速读】:该论文旨在解决表面计量学中多源测量系统(触觉与光学)下表面纹理参数(Ra、Rz、RONt)及其标准不确定度的联合预测问题,同时实现测量系统类型的分类。其关键解决方案在于构建一个可复现的深度学习框架,通过分层回归头(quantile和heteroscedastic)建模不确定度,并结合事后校准(post-hoc conformal calibration)生成具有统计保证的预测区间;此外,采用单目标回归器避免负迁移效应,显著提升预测精度(如Ra、Rz、RONt的R²均超过0.98),并实现了高精度的测量系统分类(准确率92.85%)。该方法为计量流程中的仪器选型与验收决策提供了可靠的校准预测支持。

链接: https://arxiv.org/abs/2510.20339
作者: D. Kucharski,A. Gaska,T. Kowaluk,K. Stepien,M. Repalska,B. Gapinski,M. Wieczorowski,M. Nawotka,P. Sobecki,P. Sosinowski,J. Tomasik,A. Wojtowicz
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注: 34 pages, 10 figures, 6 tables; 60-page supplementary appendix. Code and full reproducibility bundle available via Zenodo

点击查看摘要

Abstract:A reproducible deep learning framework is presented for surface metrology to predict surface texture parameters together with their reported standard uncertainties. Using a multi-instrument dataset spanning tactile and optical systems, measurement system type classification is addressed alongside coordinated regression of Ra, Rz, RONt and their uncertainty targets (Ra_uncert, Rz_uncert, RONt_uncert). Uncertainty is modelled via quantile and heteroscedastic heads with post-hoc conformal calibration to yield calibrated intervals. On a held-out set, high fidelity was achieved by single-target regressors (R2: Ra 0.9824, Rz 0.9847, RONt 0.9918), with two uncertainty targets also well modelled (Ra_uncert 0.9899, Rz_uncert 0.9955); RONt_uncert remained difficult (R2 0.4934). The classifier reached 92.85% accuracy and probability calibration was essentially unchanged after temperature scaling (ECE 0.00504 - 0.00503 on the test split). Negative transfer was observed for naive multi-output trunks, with single-target models performing better. These results provide calibrated predictions suitable to inform instrument selection and acceptance decisions in metrological workflows.
zh

[AI-97] FinCARE: Financial Causal Analysis with Reasoning and Evidence

【速读】:该论文旨在解决投资组合管理者依赖相关性分析和启发式方法所导致的因果关系识别不足问题,这些传统方法无法准确捕捉驱动绩效的真实因果机制。解决方案的关键在于提出一种融合统计因果发现算法与领域知识的混合框架:一方面通过从美国证券交易委员会(SEC)10-K文件中提取的金融知识图谱(Financial Knowledge Graph, FKG)编码约束条件以增强因果发现的合理性;另一方面利用大语言模型(Large Language Model, LLM)进行概念推理以生成可验证的假设。该框架系统性地提升了三种代表性因果发现范式——基于约束的PC算法、基于评分的GES算法以及连续优化的NOTEARS算法——在合成金融数据集上的性能表现,显著提高F1分数,并实现高精度的反事实预测与干预效应方向判断,从而为投资组合管理提供可靠的因果基础,支持动态市场环境下的主动风险管理与战略决策。

链接: https://arxiv.org/abs/2510.20221
作者: Alejandro Michel,Abhinav Arun,Bhaskarjit Sarmah,Stefano Pasquali
机构: 未知
类目: Computational Finance (q-fin.CP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Portfolio managers rely on correlation-based analysis and heuristic methods that fail to capture true causal relationships driving performance. We present a hybrid framework that integrates statistical causal discovery algorithms with domain knowledge from two complementary sources: a financial knowledge graph extracted from SEC 10-K filings and large language model reasoning. Our approach systematically enhances three representative causal discovery paradigms, constraint-based (PC), score-based (GES), and continuous optimization (NOTEARS), by encoding knowledge graph constraints algorithmically and leveraging LLM conceptual reasoning for hypothesis generation. Evaluated on a synthetic financial dataset of 500 firms across 18 variables, our KG+LLM-enhanced methods demonstrate consistent improvements across all three algorithms: PC (F1: 0.622 vs. 0.459 baseline, +36%), GES (F1: 0.735 vs. 0.367, +100%), and NOTEARS (F1: 0.759 vs. 0.163, +366%). The framework enables reliable scenario analysis with mean absolute error of 0.003610 for counterfactual predictions and perfect directional accuracy for intervention effects. It also addresses critical limitations of existing methods by grounding statistical discoveries in financial domain expertise while maintaining empirical validation, providing portfolio managers with the causal foundation necessary for proactive risk management and strategic decision-making in dynamic market environments.
zh

[AI-98] On the Structure of Stationary Solutions to McKean-Vlasov Equations with Applications to Noisy Transformers

【速读】:该论文旨在解决麦凯恩-弗拉索夫方程(McKean-Vlasov equation)在圆周上的稳态解的结构性质及其相变行为这一核心问题,尤其关注多傅里叶模态耦合下的局部分岔(bifurcation)与全局自由能景观的几何特性。其解决方案的关键在于发现稳态解与傅里叶系数上无限维二次方程组之间的精确等价关系,从而将原本定义在函数空间中的问题转化为序列空间中的显式刻画,使得局部分岔的周期性、共振结构及非光滑势能下的稳定性得以清晰解析;同时,通过建立自由能景观的正则性和凹性性质,证明了全局最小测度的存在性、紧性及共存现象,并将不连续相变明确关联到最小自由能映射的不可微点。此框架不仅适用于一般情形,还被应用于“噪声均值场变换器”(Noisy Mean-Field Transformer)模型,揭示了逆温度参数 β\beta 增大时从连续到不连续(一阶)相变的尖锐转变,以及多模态近似稳态解(即“亚稳态”)的涌现机制。

链接: https://arxiv.org/abs/2510.20094
作者: Krishnakumar Balasubramanian,Sayan Banerjee,Philippe Rigollet
机构: 未知
类目: Probability (math.PR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Machine Learning (stat.ML)
备注: 46 pages, 5 figures

点击查看摘要

Abstract:We study stationary solutions of McKean-Vlasov equations on the circle. Our main contributions stem from observing an exact equivalence between solutions of the stationary McKean-Vlasov equation and an infinite-dimensional quadratic system of equations over Fourier coefficients, which allows explicit characterization of the stationary states in a sequence space rather than a function space. This framework provides a transparent description of local bifurcations, characterizing their periodicity, and resonance structures, while accommodating singular potentials. We derive analytic expressions that characterize the emergence, form and shape (supercritical, critical, subcritical or transcritical) of bifurcations involving possibly multiple Fourier modes and connect them with discontinuous phase transitions. We also characterize, under suitable assumptions, the detailed structure of the stationary bifurcating solutions that are accurate upto an arbitrary number of Fourier modes. At the global level, we establish regularity and concavity properties of the free energy landscape, proving existence, compactness, and coexistence of globally minimizing stationary measures, further identifying discontinuous phase transitions with points of non-differentiability of the minimum free energy map. As an application, we specialize the theory to the Noisy Mean-Field Transformer model, where we show how changing the inverse temperature parameter \beta affects the geometry of the infinitely many bifurcations from the uniform measure. We also explain how increasing \beta can lead to a rich class of approximate multi-mode stationary solutions which can be seen as `metastable states’. Further, a sharp transition from continuous to discontinuous (first-order) phase behavior is observed as \beta increases.
zh

[AI-99] SSL-SE-EEG: A Framework for Robust Learning from Unlabeled EEG Data with Self-Supervised Learning and Squeeze-Excitation Networks

【速读】:该论文旨在解决脑电图(EEG)在脑机接口(BCI)和神经诊断中实际应用时面临的噪声干扰、数据缺失以及标注成本高等问题。其解决方案的关键在于提出了一种名为SSL-SE-EEG的框架,该框架融合了自监督学习(Self-Supervised Learning, SSL)与挤压-激励网络(Squeeze-and-Excitation Networks, SE-Nets),通过将EEG信号转化为结构化的二维图像表示,从而增强特征提取能力、提升抗噪性能,并显著降低对标注数据的依赖。实验表明,该方法在多个公开数据集上均达到当前最优性能,适用于低功耗、可扩展的实时EEG处理场景。

链接: https://arxiv.org/abs/2510.19829
作者: Meghna Roy Chowdhury,Yi Ding,Shreyas Sen
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 figures, 2 tables, 8 pages

点击查看摘要

Abstract:Electroencephalography (EEG) plays a crucial role in brain-computer interfaces (BCIs) and neurological diagnostics, but its real-world deployment faces challenges due to noise artifacts, missing data, and high annotation costs. We introduce SSL-SE-EEG, a framework that integrates Self-Supervised Learning (SSL) with Squeeze-and-Excitation Networks (SE-Nets) to enhance feature extraction, improve noise robustness, and reduce reliance on labeled data. Unlike conventional EEG processing techniques, SSL-SE-EEG transforms EEG signals into structured 2D image representations, suitable for deep learning. Experimental validation on MindBigData, TUH-AB, SEED-IV and BCI-IV datasets demonstrates state-of-the-art accuracy (91% in MindBigData, 85% in TUH-AB), making it well-suited for real-time BCI applications. By enabling low-power, scalable EEG processing, SSL-SE-EEG presents a promising solution for biomedical signal analysis, neural engineering, and next-generation BCIs.
zh

机器学习

[LG-0] KL-Regularized Reinforcement Learning is Designed to Mode Collapse

链接: https://arxiv.org/abs/2510.20817
作者: Anthony GX-Chen,Jatin Prakash,Jeff Guo,Rob Fergus,Rajesh Ranganath
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:It is commonly believed that optimizing the reverse KL divergence results in “mode seeking”, while optimizing forward KL results in “mass covering”, with the latter being preferred if the goal is to sample from multiple diverse modes. We show – mathematically and empirically – that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.

[LG-1] Out-of-distribution Tests Reveal Compositionality in Chess Transformers

链接: https://arxiv.org/abs/2510.20783
作者: Anna Mészáros,Patrik Reizinger,Ferenc Huszár
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Chess is a canonical example of a task that requires rigorous reasoning and long-term planning. Modern decision Transformers - trained similarly to LLMs - are able to learn competent gameplay, but it is unclear to what extent they truly capture the rules of chess. To investigate this, we train a 270M parameter chess Transformer and test it on out-of-distribution scenarios, designed to reveal failures of systematic generalization. Our analysis shows that Transformers exhibit compositional generalization, as evidenced by strong rule extrapolation: they adhere to fundamental syntactic rules of the game by consistently choosing valid moves even in situations very different from the training data. Moreover, they also generate high-quality moves for OOD puzzles. In a more challenging test, we evaluate the models on variants including Chess960 (Fischer Random Chess) - a variant of chess where starting positions of pieces are randomized. We found that while the model exhibits basic strategy adaptation, they are inferior to symbolic AI algorithms that perform explicit search, but gap is smaller when playing against users on Lichess. Moreover, the training dynamics revealed that the model initially learns to move only its own pieces, suggesting an emergent compositional understanding of the game.

[LG-2] Learning to Triage Taint Flows Reported by Dynamic Program Analysis in Node.js Packages

链接: https://arxiv.org/abs/2510.20739
作者: Ronghao Ni,Aidan Z.H. Yang,Min-Chien Hsu,Nuno Sabino,Limin Jia,Ruben Martins,Darion Cassel,Kevin Cheang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Program analysis tools often produce large volumes of candidate vulnerability reports that require costly manual review, creating a practical challenge: how can security analysts prioritize the reports most likely to be true vulnerabilities? This paper investigates whether machine learning can be applied to prioritizing vulnerabilities reported by program analysis tools. We focus on this http URL packages and collect a benchmark of 1,883 this http URL packages, each containing one reported ACE or ACI vulnerability. We evaluate a variety of machine learning approaches, including classical models, graph neural networks (GNNs), large language models (LLMs), and hybrid models that combine GNN and LLMs, trained on data based on a dynamic program analysis tool’s output. The top LLM achieves F_1 = 0.915 , while the best GNN and classical ML models reaching F_1 = 0.904 . At a less than 7% false-negative rate, the leading model eliminates 66.9% of benign packages from manual review, taking around 60 ms per package. If the best model is tuned to operate at a precision level of 0.8 (i.e., allowing 20% false positives amongst all warnings), our approach can detect 99.2% of exploitable taint flows while missing only 0.8%, demonstrating strong potential for real-world vulnerability triage. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2510.20739 [cs.CR] (or arXiv:2510.20739v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2510.20739 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-3] Amplifying Prominent Representations in Multimodal Learning via Variational Dirichlet Process

链接: https://arxiv.org/abs/2510.20736
作者: Tsai Hor Chan,Feng Wu,Yihang Chen,Guosheng Yin,Lequan Yu
类目: Machine Learning (cs.LG)
*备注: Accepted by NeruIPS 2025

点击查看摘要

Abstract:Developing effective multimodal fusion approaches has become increasingly essential in many real-world scenarios, such as health care and finance. The key challenge is how to preserve the feature expressiveness in each modality while learning cross-modal interactions. Previous approaches primarily focus on the cross-modal alignment, while over-emphasis on the alignment of marginal distributions of modalities may impose excess regularization and obstruct meaningful representations within each modality. The Dirichlet process (DP) mixture model is a powerful Bayesian non-parametric method that can amplify the most prominent features by its richer-gets-richer property, which allocates increasing weights to them. Inspired by this unique characteristic of DP, we propose a new DP-driven multimodal learning framework that automatically achieves an optimal balance between prominent intra-modal representation learning and cross-modal alignment. Specifically, we assume that each modality follows a mixture of multivariate Gaussian distributions and further adopt DP to calculate the mixture weights for all the components. This paradigm allows DP to dynamically allocate the contributions of features and select the most prominent ones, leveraging its richer-gets-richer property, thus facilitating multimodal feature fusion. Extensive experiments on several multimodal datasets demonstrate the superior performance of our model over other competitors. Ablation analysis further validates the effectiveness of DP in aligning modality distributions and its robustness to changes in key hyperparameters. Code is anonymously available at this https URL

[LG-4] No-Regret Thompson Sampling for Finite-Horizon Markov Decision Processes with Gaussian Processes NEURIPS

链接: https://arxiv.org/abs/2510.20725
作者: Jasmine Bayrooti,Sattar Vakili,Amanda Prorok,Carl Henrik Ek
类目: Machine Learning (cs.LG)
*备注: Appearing in NeurIPS, 2025

点击查看摘要

Abstract:Thompson sampling (TS) is a powerful and widely used strategy for sequential decision-making, with applications ranging from Bayesian optimization to reinforcement learning (RL). Despite its success, the theoretical foundations of TS remain limited, particularly in settings with complex temporal structure such as RL. We address this gap by establishing no-regret guarantees for TS using models with Gaussian marginal distributions. Specifically, we consider TS in episodic RL with joint Gaussian process (GP) priors over rewards and transitions. We prove a regret bound of \mathcal\tildeO(\sqrtKH\Gamma(KH)) over K episodes of horizon H , where \Gamma(\cdot) captures the complexity of the GP model. Our analysis addresses several challenges, including the non-Gaussian nature of value functions and the recursive structure of Bellman updates, and extends classical tools such as the elliptical potential lemma to multi-output settings. This work advances the understanding of TS in RL and highlights how structural assumptions and model uncertainty shape its performance in finite-horizon Markov Decision Processes.

[LG-5] Optimizing Clinical Fall Risk Prediction: A Data-Driven Integration of EHR Variables with the Johns Hopkins Fall Risk Assessment Tool

链接: https://arxiv.org/abs/2510.20714
作者: Fardin Ganjkhanloo,Emmett Springer,Erik H. Hoyer,Daniel L. Young,Kimia Ghobadi
类目: Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, 4 tables

点击查看摘要

Abstract:In this study we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models on JHFRAT assessment data and additional electronic health record (EHR) variables. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labelling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.

[LG-6] Separating the what and how of compositional computation to enable reuse and continual learning

链接: https://arxiv.org/abs/2510.20709
作者: Haozhe Shan,Sun Minni,Lea Duncker
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:

点击查看摘要

Abstract:The ability to continually learn, retain and deploy skills to accomplish goals is a key feature of intelligent and efficient behavior. However, the neural mechanisms facilitating the continual learning and flexible (re-)composition of skills remain elusive. Here, we study continual learning and the compositional reuse of learned computations in recurrent neural network (RNN) models using a novel two-system approach: one system that infers what computation to perform, and one that implements how to perform it. We focus on a set of compositional cognitive tasks commonly studied in neuroscience. To construct the what system, we first show that a large family of tasks can be systematically described by a probabilistic generative model, where compositionality stems from a shared underlying vocabulary of discrete task epochs. The shared epoch structure makes these tasks inherently compositional. We first show that this compositionality can be systematically described by a probabilistic generative model. Furthermore, We develop an unsupervised online learning approach that can learn this model on a single-trial basis, building its vocabulary incrementally as it is exposed to new tasks, and inferring the latent epoch structure as a time-varying computational context within a trial. We implement the how system as an RNN whose low-rank components are composed according to the context inferred by the what system. Contextual inference facilitates the creation, learning, and reuse of low-rank RNN components as new tasks are introduced sequentially, enabling continual learning without catastrophic forgetting. Using an example task set, we demonstrate the efficacy and competitive performance of this two-system learning framework, its potential for forward and backward transfer, as well as fast compositional generalization to unseen tasks.

[LG-7] From Masks to Worlds: A Hitchhikers Guide to World Models

链接: https://arxiv.org/abs/2510.20668
作者: Jinbin Bai,Yu Lei,Hecong Wu,Yuchen Zhu,Shufan Li,Yi Xin,Xiangtai Li,Molei Tao,Aditya Grover,Ming-Hsuan Yang
类目: Machine Learning (cs.LG)
*备注: Github: this https URL

点击查看摘要

Abstract:This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a ``world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models.

[LG-8] Bayesian Jammer Localization with a Hybrid CNN and Path-Loss Mixture of Experts ICASSP

链接: https://arxiv.org/abs/2510.20666
作者: Mariona Jaramillo-Civill,Luis González-Gudiño,Tales Imbiriba,Pau Closas
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 4 figures, Submitted to ICASSPW 2026

点击查看摘要

Abstract:Global Navigation Satellite System (GNSS) signals are vulnerable to jamming, particularly in urban areas where multipath and shadowing distort received power. Previous data-driven approaches achieved reasonable localization but poorly reconstructed the received signal strength (RSS) field due to limited spatial context. We propose a hybrid Bayesian mixture-of-experts framework that fuses a physical path-loss (PL) model and a convolutional neural network (CNN) through log-linear pooling. The PL expert ensures physical consistency, while the CNN leverages building-height maps to capture urban propagation effects. Bayesian inference with Laplace approximation provides posterior uncertainty over both the jammer position and RSS field. Experiments on urban ray-tracing data show that localization accuracy improves and uncertainty decreases with more training points, while uncertainty concentrates near the jammer and along urban canyons where propagation is most sensitive.

[LG-9] xTime: Extreme Event Prediction with Hierarchical Knowledge Distillation and Expert Fusion

链接: https://arxiv.org/abs/2510.20651
作者: Quan Li,Wenchao Yu,Suhang Wang,Minhua Lin,Lingwei Chen,Wei Cheng,Haifeng Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Extreme events frequently occur in real-world time series and often carry significant practical implications. In domains such as climate and healthcare, these events, such as floods, heatwaves, or acute medical episodes, can lead to serious consequences. Accurate forecasting of such events is therefore of substantial importance. Most existing time series forecasting models are optimized for overall performance within the prediction window, but often struggle to accurately predict extreme events, such as high temperatures or heart rate spikes. The main challenges are data imbalance and the neglect of valuable information contained in intermediate events that precede extreme events. In this paper, we propose xTime, a novel framework for extreme event forecasting in time series. xTime leverages knowledge distillation to transfer information from models trained on lower-rarity events, thereby improving prediction performance on rarer ones. In addition, we introduce a mixture of experts (MoE) mechanism that dynamically selects and fuses outputs from expert models across different rarity levels, which further improves the forecasting performance for extreme events. Experiments on multiple datasets show that xTime achieves consistent improvements, with forecasting accuracy on extreme events improving from 3% to 78%.

[LG-10] Connecting Jensen-Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning NEURIPS2025

链接: https://arxiv.org/abs/2510.20644
作者: Reuben Dorent,Polina Golland,William Wells III
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Accepted at NeurIPS 2025. Code available at this https URL

点击查看摘要

Abstract:Mutual Information (MI) is a fundamental measure of statistical dependence widely used in representation learning. While direct optimization of MI via its definition as a Kullback-Leibler divergence (KLD) is often intractable, many recent methods have instead maximized alternative dependence measures, most notably, the Jensen-Shannon divergence (JSD) between joint and product of marginal distributions via discriminative losses. However, the connection between these surrogate objectives and MI remains poorly understood. In this work, we bridge this gap by deriving a new, tight, and tractable lower bound on KLD as a function of JSD in the general case. By specializing this bound to joint and marginal distributions, we demonstrate that maximizing the JSD-based information increases a guaranteed lower bound on mutual information. Furthermore, we revisit the practical implementation of JSD-based objectives and observe that minimizing the cross-entropy loss of a binary classifier trained to distinguish joint from marginal pairs recovers a known variational lower bound on the JSD. Extensive experiments demonstrate that our lower bound is tight when applied to MI estimation. We compared our lower bound to state-of-the-art neural estimators of variational lower bound across a range of established reference scenarios. Our lower bound estimator consistently provides a stable, low-variance estimate of a tight lower bound on MI. We also demonstrate its practical usefulness in the context of the Information Bottleneck framework. Taken together, our results provide new theoretical justifications and strong empirical evidence for using discriminative learning in MI-based representation learning.

[LG-11] Attention Enhanced Entity Recommendation for Intelligent Monitoring in Cloud Systems

链接: https://arxiv.org/abs/2510.20640
作者: Fiza Hussain,Anson Bastos,Anjaly Parayil,Ayush Choure,Chetan Bansal,Rujia Wang,Saravan Rajmohan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we present DiRecGNN, an attention-enhanced entity recommendation framework for monitoring cloud services at Microsoft. We provide insights on the usefulness of this feature as perceived by the cloud service owners and lessons learned from deployment. Specifically, we introduce the problem of recommending the optimal subset of attributes (dimensions) that should be tracked by an automated watchdog (monitor) for cloud services. To begin, we construct the monitor heterogeneous graph at production-scale. The interaction dynamics of these entities are often characterized by limited structural and engagement information, resulting in inferior performance of state-of-the-art approaches. Moreover, traditional methods fail to capture the dependencies between entities spanning a long range due to their homophilic nature. Therefore, we propose an attention-enhanced entity ranking model inspired by transformer architectures. Our model utilizes a multi-head attention mechanism to focus on heterogeneous neighbors and their attributes, and further attends to paths sampled using random walks to capture long-range dependencies. We also employ multi-faceted loss functions to optimize for relevant recommendations while respecting the inherent sparsity of the data. Empirical evaluations demonstrate significant improvements over existing methods, with our model achieving a 43.1% increase in MRR. Furthermore, product teams who consumed these features perceive the feature as useful and rated it 4.5 out of 5.

[LG-12] Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges

链接: https://arxiv.org/abs/2510.20637
作者: Hyun Jong Yang,Hyunsoo Kim,Hyeonho Noh,Seungnyun Kim,Byonghyo Shim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) and large multimodal models (LMMs) have achieved unprecedented breakthrough, showcasing remarkable capabilities in natural language understanding, generation, and complex reasoning. This transformative potential has positioned them as key enablers for 6G autonomous communications among machines, vehicles, and humanoids. In this article, we provide an overview of task-oriented autonomous communications with LLMs/LMMs, focusing on multimodal sensing integration, adaptive reconfiguration, and prompt/fine-tuning strategies for wireless tasks. We demonstrate the framework through three case studies: LMM-based traffic control, LLM-based robot scheduling, and LMM-based environment-aware channel estimation. From experimental results, we show that the proposed LLM/LMM-aided autonomous systems significantly outperform conventional and discriminative deep learning (DL) model-based techniques, maintaining robustness under dynamic objectives, varying input parameters, and heterogeneous multimodal conditions where conventional static optimization degrades.

[LG-13] H-SPLID: HSIC-based Saliency Preserving Latent Information Decomposition NEURIPS2025

链接: https://arxiv.org/abs/2510.20627
作者: Lukas Miklautz,Chengzhi Shi,Andrii Shkabrii,Theodoros Thirimachos Davarakis,Prudence Lam,Claudia Plant,Jennifer Dy,Stratis Ioannidis
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025

点击查看摘要

Abstract:We introduce H-SPLID, a novel algorithm for learning salient feature representations through the explicit decomposition of salient and non-salient features into separate spaces. We show that H-SPLID promotes learning low-dimensional, task-relevant features. We prove that the expected prediction deviation under input perturbations is upper-bounded by the dimension of the salient subspace and the Hilbert-Schmidt Independence Criterion (HSIC) between inputs and representations. This establishes a link between robustness and latent representation compression in terms of the dimensionality and information preserved. Empirical evaluations on image classification tasks show that models trained with H-SPLID primarily rely on salient input components, as indicated by reduced sensitivity to perturbations affecting non-salient features, such as image backgrounds. Our code is available at this https URL.

[LG-14] On Optimal Hyperparameters for Differentially Private Deep Transfer Learning

链接: https://arxiv.org/abs/2510.20616
作者: Aki Rehn,Linzh Zhao,Mikko A. Heikkilä,Antti Honkela
类目: Machine Learning (cs.LG)
*备注: 25 pages, 30 figures

点击查看摘要

Abstract:Differentially private (DP) transfer learning, i.e., fine-tuning a pretrained model on private data, is the current state-of-the-art approach for training large models under privacy constraints. We focus on two key hyperparameters in this setting: the clipping bound C and batch size B . We show a clear mismatch between the current theoretical understanding of how to choose an optimal C (stronger privacy requires smaller C ) and empirical outcomes (larger C performs better under strong privacy), caused by changes in the gradient distributions. Assuming a limited compute budget (fixed epochs), we demonstrate that the existing heuristics for tuning B do not work, while cumulative DP noise better explains whether smaller or larger batches perform better. We also highlight how the common practice of using a single (C,B) setting across tasks can lead to suboptimal performance. We find that performance drops especially when moving between loose and tight privacy and between plentiful and limited compute, which we explain by analyzing clipping as a form of gradient re-weighting and examining cumulative DP noise.

[LG-15] MS-BART: Unified Modeling of Mass Spectra and Molecules for Structure Elucidation NEURIPS2025

链接: https://arxiv.org/abs/2510.20615
作者: Yang Han,Pengyu Wang,Kai Yu,Xin Chen,Lu Chen
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025, We provide the data and code at this https URL

点击查看摘要

Abstract:Mass spectrometry (MS) plays a critical role in molecular identification, significantly advancing scientific discovery. However, structure elucidation from MS data remains challenging due to the scarcity of annotated spectra. While large-scale pretraining has proven effective in addressing data scarcity in other domains, applying this paradigm to mass spectrometry is hindered by the complexity and heterogeneity of raw spectral signals. To address this, we propose MS-BART, a unified modeling framework that maps mass spectra and molecular structures into a shared token vocabulary, enabling cross-modal learning through large-scale pretraining on reliably computed fingerprint-molecule datasets. Multi-task pretraining objectives further enhance MS-BART’s generalization by jointly optimizing denoising and translation task. The pretrained model is subsequently transferred to experimental spectra through finetuning on fingerprint predictions generated with MIST, a pre-trained spectral inference model, thereby enhancing robustness to real-world spectral variability. While finetuning alleviates the distributional difference, MS-BART still suffers molecular hallucination and requires further alignment. We therefore introduce a chemical feedback mechanism that guides the model toward generating molecules closer to the reference structure. Extensive evaluations demonstrate that MS-BART achieves SOTA performance across 5/12 key metrics on MassSpecGym and NPLIB1 and is faster by one order of magnitude than competing diffusion-based methods, while comprehensive ablation studies systematically validate the model’s effectiveness and robustness.

[LG-16] Convergence Analysis of SGD under Expected Smoothness AISTATS2026

链接: https://arxiv.org/abs/2510.20608
作者: Yuta Kawamoto,Hideaki Iiduka
类目: Machine Learning (cs.LG)
*备注: 23 pages, 11 figures, AISTATS 2026

点击查看摘要

Abstract:Stochastic gradient descent (SGD) is the workhorse of large-scale learning, yet classical analyses rely on assumptions that can be either too strong (bounded variance) or too coarse (uniform noise). The expected smoothness (ES) condition has emerged as a flexible alternative that ties the second moment of stochastic gradients to the objective value and the full gradient. This paper presents a self-contained convergence analysis of SGD under ES. We (i) refine ES with interpretations and sampling-dependent constants; (ii) derive bounds of the expectation of squared full gradient norm; and (iii) prove O(1/K) rates with explicit residual errors for various step-size schedules. All proofs are given in full detail in the appendix. Our treatment unifies and extends recent threads (Khaled and Richtárik, 2020; Umeda and Iiduka, 2025).

[LG-17] Strategic Costs of Perceived Bias in Fair Selection NEURIPS2025

链接: https://arxiv.org/abs/2510.20606
作者: L. Elisa Celis,Lingxiao Huang,Milind Sohoni,Nisheeth K. Vishnoi
类目: Computer Science and Game Theory (cs.GT); Computers and Society (cs.CY); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
*备注: The paper has been accepted by NeurIPS 2025

点击查看摘要

Abstract:Meritocratic systems, from admissions to hiring, aim to impartially reward skill and effort. Yet persistent disparities across race, gender, and class challenge this ideal. Some attribute these gaps to structural inequality; others to individual choice. We develop a game-theoretic model in which candidates from different socioeconomic groups differ in their perceived post-selection value–shaped by social context and, increasingly, by AI-powered tools offering personalized career or salary guidance. Each candidate strategically chooses effort, balancing its cost against expected reward; effort translates into observable merit, and selection is based solely on merit. We characterize the unique Nash equilibrium in the large-agent limit and derive explicit formulas showing how valuation disparities and institutional selectivity jointly determine effort, representation, social welfare, and utility. We further propose a cost-sensitive optimization framework that quantifies how modifying selectivity or perceived value can reduce disparities without compromising institutional goals. Our analysis reveals a perception-driven bias: when perceptions of post-selection value differ across groups, these differences translate into rational differences in effort, propagating disparities backward through otherwise “fair” selection processes. While the model is static, it captures one stage of a broader feedback cycle linking perceptions, incentives, and outcome–bridging rational-choice and structural explanations of inequality by showing how techno-social environments shape individual incentives in meritocratic systems.

[LG-18] Embedding the MLOps Lifecycle into OT Reference Models

链接: https://arxiv.org/abs/2510.20590
作者: Simon Schindler,Christoph Binder,Lukas Lürzer,Stefan Huber
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine Learning Operations (MLOps) practices are increas- ingly adopted in industrial settings, yet their integration with Opera- tional Technology (OT) systems presents significant challenges. This pa- per analyzes the fundamental obstacles in combining MLOps with OT en- vironments and proposes a systematic approach to embed MLOps prac- tices into established OT reference models. We evaluate the suitability of the Reference Architectural Model for Industry 4.0 (RAMI 4.0) and the International Society of Automation Standard 95 (ISA-95) for MLOps integration and present a detailed mapping of MLOps lifecycle compo- nents to RAMI 4.0 exemplified by a real-world use case. Our findings demonstrate that while standard MLOps practices cannot be directly transplanted to OT environments, structured adaptation using existing reference models can provide a pathway for successful integration.

[LG-19] A Unified Framework for Zero-Shot Reinforcement Learning

链接: https://arxiv.org/abs/2510.20542
作者: Jacopo Di Ventura,Jan Felix Kleuker,Aske Plaat,Thomas Moerland
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Zero-shot reinforcement learning (RL) has emerged as a setting for developing general agents in an unsupervised manner, capable of solving downstream tasks without additional training or planning at test-time. Unlike conventional RL, which optimizes policies for a fixed reward, zero-shot RL requires agents to encode representations rich enough to support immediate adaptation to any objective, drawing parallels to vision and language foundation models. Despite growing interest, the field lacks a common analytical lens. We present the first unified framework for zero-shot RL. Our formulation introduces a consistent notation and taxonomy that organizes existing approaches and allows direct comparison between them. Central to our framework is the classification of algorithms into two families: direct representations, which learn end-to-end mappings from rewards to policies, and compositional representations, which decompose the representation leveraging the substructure of the value function. Within this framework, we highlight shared principles and key differences across methods, and we derive an extended bound for successor-feature methods, offering a new perspective on their performance in the zero-shot regime. By consolidating existing work under a common lens, our framework provides a principled foundation for future research in zero-shot RL and outlines a clear path toward developing more general agents. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.20542 [cs.LG] (or arXiv:2510.20542v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.20542 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-20] SheafAlign: A Sheaf-theoretic Framework for Decentralized Multimodal Alignment

链接: https://arxiv.org/abs/2510.20540
作者: Abdulmomen Ghalkha,Zhuojun Tian,Chaouki Ben Issaid,Mehdi Bennis
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 1 table

点击查看摘要

Abstract:Conventional multimodal alignment methods assume mutual redundancy across all modalities, an assumption that fails in real-world distributed scenarios. We propose SheafAlign, a sheaf-theoretic framework for decentralized multimodal alignment that replaces single-space alignment with multiple comparison spaces. This approach models pairwise modality relations through sheaf structures and leverages decentralized contrastive learning-based objectives for training. SheafAlign overcomes the limitations of prior methods by not requiring mutual redundancy among all modalities, preserving both shared and unique information. Experiments on multimodal sensing datasets show superior zero-shot generalization, cross-modal alignment, and robustness to missing modalities, with 50% lower communication cost than state-of-the-art baselines.

[LG-21] Adversary-Aware Private Inference over Wireless Channels

链接: https://arxiv.org/abs/2510.20518
作者: Mohamed Seif,Malcolm Egan,Andrea J. Goldsmith,H. Vincent Poor
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:AI-based sensing at wireless edge devices has the potential to significantly enhance Artificial Intelligence (AI) applications, particularly for vision and perception tasks such as in autonomous driving and environmental monitoring. AI systems rely both on efficient model learning and inference. In the inference phase, features extracted from sensing data are utilized for prediction tasks (e.g., classification or regression). In edge networks, sensors and model servers are often not co-located, which requires communication of features. As sensitive personal data can be reconstructed by an adversary, transformation of the features are required to reduce the risk of privacy violations. While differential privacy mechanisms provide a means of protecting finite datasets, protection of individual features has not been addressed. In this paper, we propose a novel framework for privacy-preserving AI-based sensing, where devices apply transformations of extracted features before transmission to a model server.

[LG-22] Bi-CoG: Bi-Consistency-Guided Self-Training for Vision-Language Models

链接: https://arxiv.org/abs/2510.20477
作者: Rui Zhu,Song-Lin Lv,Zi-Kang Wang,Lan-Zhe Guo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Exploiting unlabeled data through semi-supervised learning (SSL) or leveraging pre-trained models via fine-tuning are two prevailing paradigms for addressing label-scarce scenarios. Recently, growing attention has been given to combining fine-tuning of pre-trained vision-language models (VLMs) with SSL, forming the emerging paradigm of semi-supervised fine-tuning. However, existing methods often suffer from model bias and hyperparameter sensitivity, due to reliance on prediction consistency or pre-defined confidence thresholds. To address these limitations, we propose a simple yet effective plug-and-play methodology named \underline\textbfBi-Co nsistency- \underline\textbfG uided Self-Training (Bi-CoG), which assigns high-quality and low-bias pseudo-labels, by simultaneously exploiting inter-model and intra-model consistency, along with an error-aware dynamic pseudo-label assignment strategy. Both theoretical analysis and extensive experiments over 14 datasets demonstrate the effectiveness of Bi-CoG, which consistently and significantly improves the performance of existing methods.

[LG-23] Intransitive Player Dominance and Market Inefficiency in Tennis Forecasting: A Graph Neural Network Approach

链接: https://arxiv.org/abs/2510.20454
作者: Lawrence Clegg,John Cartlidge
类目: Machine Learning (cs.LG)
*备注: 39 pages, 8 figures

点击查看摘要

Abstract:Intransitive player dominance, where player A beats B, B beats C, but C beats A, is common in competitive tennis. Yet, there are few known attempts to incorporate it within forecasting methods. We address this problem with a graph neural network approach that explicitly models these intransitive relationships through temporal directed graphs, with players as nodes and their historical match outcomes as directed edges. We find the bookmaker Pinnacle Sports poorly handles matches with high intransitive complexity and posit that our graph-based approach is uniquely positioned to capture relational dynamics in these scenarios. When selectively betting on higher intransitivity matchups with our model (65.7% accuracy, 0.215 Brier Score), we achieve significant positive returns of 3.26% ROI with Kelly staking over 1903 bets, suggesting a market inefficiency in handling intransitive matchups that our approach successfully exploits.

[LG-24] Explainable Benchmarking through the Lense of Concept Learning

链接: https://arxiv.org/abs/2510.20439
作者: Quannian Zhang,Michael Röder,Nikit Srivastava,N’Dah Jean Kouagou,Axel-Cyrille Ngonga Ngomo
类目: Machine Learning (cs.LG)
*备注: Accepted as full research paper at K-CAP 2025

点击查看摘要

Abstract:Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at this https URL

[LG-25] Partial Optimality in Cubic Correlation Clustering for General Graphs

链接: https://arxiv.org/abs/2510.20431
作者: David Stein,Bjoern Andres,Silvia Di Gregorio
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

Abstract:The higher-order correlation clustering problem for a graph G and costs associated with cliques of G consists in finding a clustering of G so as to minimize the sum of the costs of those cliques whose nodes all belong to the same cluster. To tackle this NP-hard problem in practice, local search heuristics have been proposed and studied in the context of applications. Here, we establish partial optimality conditions for cubic correlation clustering, i.e., for the special case of at most 3-cliques. We define and implement algorithms for deciding these conditions and examine their effectiveness numerically, on two data sets.

[LG-26] An Empirical Study of Sample Selection Strategies for Large Language Model Repair

链接: https://arxiv.org/abs/2510.20428
作者: Xuran Li,Jingyi Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in real-world systems, yet they can produce toxic or biased outputs that undermine safety and trust. Post-hoc model repair provides a practical remedy, but the high cost of parameter updates motivates selective use of repair data. Despite extensive prior work on data selection for model training, it remains unclear which sampling criteria are most effective and efficient when applied specifically to behavioral repair of large generative models. Our study presents a systematic analysis of sample prioritization strategies for LLM repair. We evaluate five representative selection methods, including random sampling, K-Center, gradient-norm-based selection(GraNd), stratified coverage (CCS), and a Semantic-Aware Prioritized Sampling (SAPS) approach we proposed. Repair effectiveness and trade-offs are assessed through toxicity reduction, perplexity on WikiText-2 and LAMBADA, and three composite metrics: the Repair Proximity Score (RPS), the Overall Performance Score (OPS), and the Repair Efficiency Score (RES). Experimental results show that SAPS achieves the best balance between detoxification, utility preservation, and efficiency, delivering comparable or superior repair outcomes with substantially less data. Random sampling remains effective for large or robust models, while high-overhead methods such as CCS and GraNd provide limited benefit. The optimal data proportion depends on model scale and repair method, indicating that sample selection should be regarded as a tunable component of repair pipelines. Overall, these findings establish selection-based repair as an efficient and scalable paradigm for maintaining LLM reliability.

[LG-27] Addressing Mark Imbalance in Integration-free Neural Marked Temporal Point Processes NEURIPS2025

链接: https://arxiv.org/abs/2510.20414
作者: Sishun Liu,Ke Deng,Xiuzhen Zhang,Yongli Ren,Yan Wang
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 poster

点击查看摘要

Abstract:Marked Temporal Point Process (MTPP) has been well studied to model the event distribution in marked event streams, which can be used to predict the mark and arrival time of the next event. However, existing studies overlook that the distribution of event marks is highly imbalanced in many real-world applications, with some marks being frequent but others rare. The imbalance poses a significant challenge to the performance of the next event prediction, especially for events of rare marks. To address this issue, we propose a thresholding method, which learns thresholds to tune the mark probability normalized by the mark’s prior probability to optimize mark prediction, rather than predicting the mark directly based on the mark probability as in existing studies. In conjunction with this method, we predict the mark first and then the time. In particular, we develop a novel neural MTPP model to support effective time sampling and estimation of mark probability without computationally expensive numerical improper integration. Extensive experiments on real-world datasets demonstrate the superior performance of our solution against various baselines for the next event mark and time prediction. The code is available at this https URL.

[LG-28] Why DPO is a Misspecified Estimator and How to Fix It

链接: https://arxiv.org/abs/2510.20413
作者: Aditya Gopalan,Sayak Ray Chowdhury,Debangshu Banerjee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.

[LG-29] PointMapPolicy: Structured Point Cloud Processing for Multi-Modal Imitation Learning

链接: https://arxiv.org/abs/2510.20406
作者: Xiaogang Jia,Qian Wang,Anrui Wang,Han A. Wang,Balázs Gyenes,Emiliyan Gospodinov,Xinkai Jiang,Ge Li,Hongyi Zhou,Weiran Liao,Xi Huang,Maximilian Beck,Moritz Reuss,Rudolf Lioutikov,Gerhard Neumann
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robotic manipulation systems benefit from complementary sensing modalities, where each provides unique environmental information. Point clouds capture detailed geometric structure, while RGB images provide rich semantic context. Current point cloud methods struggle to capture fine-grained detail, especially for complex tasks, which RGB methods lack geometric awareness, which hinders their precision and generalization. We introduce PointMapPolicy, a novel approach that conditions diffusion policies on structured grids of points without downsampling. The resulting data type makes it easier to extract shape and spatial relationships from observations, and can be transformed between reference frames. Yet due to their structure in a regular grid, we enable the use of established computer vision techniques directly to 3D data. Using xLSTM as a backbone, our model efficiently fuses the point maps with RGB data for enhanced multi-modal perception. Through extensive experiments on the RoboCasa and CALVIN benchmarks and real robot evaluations, we demonstrate that our method achieves state-of-the-art performance across diverse manipulation tasks. The overview and demos are available on our project page: this https URL

[LG-30] Hierarchical Time Series Forecasting with Robust Reconciliation

链接: https://arxiv.org/abs/2510.20383
作者: Shuhei Aikawa,Aru Suzuki,Kei Yoshitake,Kanata Teshigawara,Akira Iwabuchi,Ken Kobayashi,Kazuhide Nakata
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper focuses on forecasting hierarchical time-series data, where each higher-level observation equals the sum of its corresponding lower-level time series. In such contexts, the forecast values should be coherent, meaning that the forecast value of each parent series exactly matches the sum of the forecast values of its child series. Existing hierarchical forecasting methods typically generate base forecasts independently for each series and then apply a reconciliation procedure to adjust them so that the resulting forecast values are coherent across the hierarchy. These methods generally derive an optimal reconciliation, using a covariance matrix of the forecast error. In practice, however, the true covariance matrix is unknown and has to be estimated from finite samples in advance. This gap between the true and estimated covariance matrix may degrade forecast performance. To address this issue, we propose a robust optimization framework for hierarchical reconciliation that accounts for uncertainty in the estimated covariance matrix. We first introduce an uncertainty set for the estimated covariance matrix and formulate a reconciliation problem that minimizes the worst-case expected squared error over this uncertainty set. We show that our problem can be cast as a semidefinite optimization problem. Numerical experiments demonstrate that the proposed robust reconciliation method achieved better forecast performance than existing hierarchical forecasting methods, which indicates the effectiveness of integrating uncertainty into the reconciliation process.

[LG-31] Ask a Strong LLM Judge when Your Reward Model is Uncertain NEURIPS2025

链接: https://arxiv.org/abs/2510.20369
作者: Zhenghao Xu,Qin Lu,Qingru Zhang,Liang Qiu,Ilgee Hong,Changlong Yu,Wenlin Yao,Yao Liu,Haoming Jiang,Lihong Li,Hyokun Yun,Tuo Zhao
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025, 18 pages

点击查看摘要

Abstract:Reward model (RM) plays a pivotal role in reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs). However, classical RMs trained on human preferences are vulnerable to reward hacking and generalize poorly to out-of-distribution (OOD) inputs. By contrast, strong LLM judges equipped with reasoning capabilities demonstrate superior generalization, even without additional training, but incur significantly higher inference costs, limiting their applicability in online RLHF. In this work, we propose an uncertainty-based routing framework that efficiently complements a fast RM with a strong but costly LLM judge. Our approach formulates advantage estimation in policy gradient (PG) methods as pairwise preference classification, enabling principled uncertainty quantification to guide routing. Uncertain pairs are forwarded to the LLM judge, while confident ones are evaluated by the RM. Experiments on RM benchmarks demonstrate that our uncertainty-based routing strategy significantly outperforms random judge calling at the same cost, and downstream alignment results showcase its effectiveness in improving online RLHF.

[LG-32] InvDec: Inverted Decoder for Multivariate Time Series Forecasting with Separated Temporal and Variate Modeling

链接: https://arxiv.org/abs/2510.20302
作者: Yuhang Wang
类目: Machine Learning (cs.LG)
*备注: 23pages, 3 figures

点击查看摘要

Abstract:Multivariate time series forecasting requires simultaneously modeling temporal patterns and cross-variate dependencies. Channel-independent methods such as PatchTST excel at temporal modeling but ignore variable correlations, while pure variate-attention approaches such as iTransformer sacrifice temporal encoding. We proposeInvDec (Inverted Decoder), a hybrid architecture that achieves principled separation between temporal encoding and variate-level decoding. InvDec combines a patch-based temporal encoder with an inverted decoder operating on the variate dimension through variate-wise self-attention. We introduce delayed variate embeddings that enrich variable-specific representations only after temporal encoding, preserving temporal feature integrity. An adaptive residual fusion mechanism dynamically balances temporal and variate information across datasets of varying dimensions. Instantiating InvDec with PatchTST yields InvDec-PatchTST. Extensive experiments on seven benchmarks demonstrate significant gains on high-dimensional datasets: 20.9% MSE reduction on Electricity (321 variables), 4.3% improvement on Weather, and 2.7% gain on Traffic compared to PatchTST, while maintaining competitive performance on low-dimensional ETT datasets. Ablation studies validate each component, and analysis reveals that InvDec’s advantage grows with dataset dimensionality, confirming that cross-variate modeling becomes critical as the number of variables increases.

[LG-33] Quantifying Distributional Invariance in Causal Subgraph for IRM-Free Graph Generalization

链接: https://arxiv.org/abs/2510.20295
作者: Yang Qiu,Yixiong Zou,Jun Wang,Wei Liu,Xiangyu Fu,Ruixuan Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Out-of-distribution generalization under distributional shifts remains a critical challenge for graph neural networks. Existing methods generally adopt the Invariant Risk Minimization (IRM) framework, requiring costly environment annotations or heuristically generated synthetic splits. To circumvent these limitations, in this work, we aim to develop an IRM-free method for capturing causal subgraphs. We first identify that causal subgraphs exhibit substantially smaller distributional variations than non-causal components across diverse environments, which we formalize as the Invariant Distribution Criterion and theoretically prove in this paper. Building on this criterion, we systematically uncover the quantitative relationship between distributional shift and representation norm for identifying the causal subgraph, and investigate its underlying mechanisms in depth. Finally, we propose an IRM-free method by introducing a norm-guided invariant distribution objective for causal subgraph discovery and prediction. Extensive experiments on two widely used benchmarks demonstrate that our method consistently outperforms state-of-the-art methods in graph generalization.

[LG-34] ResearchGPT : Benchmarking and Training LLM s for End-to-End Computer Science Research Workflows

链接: https://arxiv.org/abs/2510.20279
作者: Penghao Wang,Yuhao Zhou,Mengxuan Wu,Ziheng Qin,Bangyuan Zhu,Shengbin Huang,Xuanlei Zhao,Panpan Zhang,Xiaojiang Peng,Yuzhang Shang,Jianfei Yang,Zheng Zhu,Tianlong Chen,Zhangyang Wang,Kai Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As large language models (LLMs) advance, the ultimate vision for their role in science is emerging: we could build an AI collaborator to effectively assist human beings throughout the entire scientific research process. We refer to this envisioned system as ResearchGPT. Given that scientific research progresses through multiple interdependent phases, achieving this vision requires rigorous benchmarks that evaluate the end-to-end workflow rather than isolated sub-tasks. To this end, we contribute CS-54k, a high-quality corpus of scientific QA pairs in computer science, built from 14k CC-licensed papers. It is constructed through a scalable, paper-grounded pipeline that combines retrieval-augmented generation (RAG) with multi-stage quality control to ensure factual grounding. From this unified corpus, we derive two complementary subsets: CS-4k, a carefully curated benchmark for evaluating AI’s ability to assist scientific research, and CS-50k, a large-scale training dataset. Extensive experiments demonstrate that CS-4k stratifies state-of-the-art LLMs into distinct capability tiers. Open models trained on CS-50k with supervised training and reinforcement learning demonstrate substantial improvements. Even 7B-scale models, when properly trained, outperform many larger proprietary systems, such as GPT-4.1, GPT-4o, and Gemini 2.5 Pro. This indicates that making AI models better research assistants relies more on domain-aligned training with high-quality data than on pretraining scale or general benchmark performance. We release CS-4k and CS-50k in the hope of fostering AI systems as reliable collaborators in CS research.

[LG-35] KCM: KAN-Based Collaboration Models Enhance Pretrained Large Models

链接: https://arxiv.org/abs/2510.20278
作者: Guangyu Dai,Siliang Tang,Yueting Zhuang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, Pretrained Large Models(PLMs) researchers proposed large-small model collaboration frameworks, leveraged easily trainable small models to assist large models, aim to(1) significantly reduce computational resource consumption while maintaining comparable accuracy, and (2) enhance large model performance in specialized domain tasks. However, this collaborative paradigm suffers from issues such as significant accuracy degradation, exacerbated catastrophic forgetting, and amplified hallucination problems induced by small model knowledge. To address these challenges, we propose a KAN-based Collaborative Model (KCM) as an improved approach to large-small model collaboration. The KAN utilized in KCM represents an alternative neural network architecture distinct from conventional MLPs. Compared to MLPs, KAN offers superior visualizability and interpretability while mitigating catastrophic forgetting. We deployed KCM in large-small model collaborative systems across three scenarios: language, vision, and vision-language cross-modal tasks. The experimental results demonstrate that, compared with pure large model approaches, the large-small model collaboration framework utilizing KCM as the collaborative model significantly reduces the number of large model inference calls while maintaining near-identical task accuracy, thereby substantially lowering computational resource consumption. Concurrently, the KAN-based small collaborative model markedly mitigates catastrophic forgetting, leading to significant accuracy improvements for long-tail data. The results reveal that KCM demonstrates superior performance across all metrics compared to MLP-based small collaborative models (MCM).

[LG-36] SynTSBench: Rethinking Temporal Pattern Learning in Deep Learning Models for Time Series NEURIPS2025

链接: https://arxiv.org/abs/2510.20273
作者: Qitai Tan,Yiyun Chen,Mo Li,Ruiwen Gu,Yilin Su,Xiao-Ping Zhang
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025

点击查看摘要

Abstract:Recent advances in deep learning have driven rapid progress in time series forecasting, yet many state-of-the-art models continue to struggle with robust performance in real-world applications, even when they achieve strong results on standard benchmark datasets. This persistent gap can be attributed to the black-box nature of deep learning architectures and the inherent limitations of current evaluation frameworks, which frequently lack the capacity to provide clear, quantitative insights into the specific strengths and weaknesses of different models, thereby complicating the selection of appropriate models for particular forecasting scenarios. To address these issues, we propose a synthetic data-driven evaluation paradigm, SynTSBench, that systematically assesses fundamental modeling capabilities of time series forecasting models through programmable feature configuration. Our framework isolates confounding factors and establishes an interpretable evaluation system with three core analytical dimensions: (1) temporal feature decomposition and capability mapping, which enables systematic evaluation of model capacities to learn specific pattern types; (2) robustness analysis under data irregularities, which quantifies noise tolerance thresholds and anomaly recovery capabilities; and (3) theoretical optimum benchmarking, which establishes performance boundaries for each pattern type-enabling direct comparison between model predictions and mathematical optima. Our experiments show that current deep learning models do not universally approach optimal baselines across all types of temporal this http URL code is available at this https URL

[LG-37] Scalable GPU-Accelerated Euler Characteristic Curves: Optimization and Differentiable Learning for PyTorch NEURIPS2025

链接: https://arxiv.org/abs/2510.20271
作者: Udit Saxena
类目: Machine Learning (cs.LG)
*备注: Extended Abstract: Accepted to the NeurReps 2025 workshop at NeurIPS 2025. 4 pages, 3 figures

点击查看摘要

Abstract:Topological features capture global geometric structure in imaging data, but practical adoption in deep learning requires both computational efficiency and differentiability. We present optimized GPU kernels for the Euler Characteristic Curve (ECC) computation achieving 16-2000Ö speedups over prior GPU implementations on synthetic grids, and introduce a differentiable PyTorch layer enabling end-to-end learning. Our CUDA kernels, optimized for Ampere GPUs use 128B-coalesced access and hierarchical shared-memory accumulation. Our PyTorch layer learns thresholds in a single direction via a Differentiable Euler Characteristic Transform-style sigmoid relaxation. We discuss downstream relevance, including applications highlighted by prior ECC work, and outline batching/multi-GPU extensions to broaden adoption.

[LG-38] Optimistic Task Inference for Behavior Foundation Models

链接: https://arxiv.org/abs/2510.20264
作者: Thomas Rupf,Marco Bagatella,Marin Vlastelica,Andreas Krause
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Behavior Foundation Models (BFMs) are capable of retrieving high-performing policy for any reward function specified directly at test-time, commonly referred to as zero-shot reinforcement learning (RL). While this is a very efficient process in terms of compute, it can be less so in terms of data: as a standard assumption, BFMs require computing rewards over a non-negligible inference dataset, assuming either access to a functional form of rewards, or significant labeling efforts. To alleviate these limitations, we tackle the problem of task inference purely through interaction with the environment at test-time. We propose OpTI-BFM, an optimistic decision criterion that directly models uncertainty over reward functions and guides BFMs in data collection for task inference. Formally, we provide a regret bound for well-trained BFMs through a direct connection to upper-confidence algorithms for linear bandits. Empirically, we evaluate OpTI-BFM on established zero-shot benchmarks, and observe that it enables successor-features-based BFMs to identify and optimize an unseen reward function in a handful of episodes with minimal compute overhead. Code is available at this https URL.

[LG-39] FedGPS: Statistical Rectification Against Data Heterogeneity in Federated Learning

链接: https://arxiv.org/abs/2510.20250
作者: Zhiqin Yang,Yonggang Zhang,Chenxin Li,Yiu-ming Cheung,Bo Han,Yixuan Yuan
类目: Machine Learning (cs.LG)
*备注: 35 pages, 15 figures, 21 tables

点击查看摘要

Abstract:Federated Learning (FL) confronts a significant challenge known as data heterogeneity, which impairs model performance and convergence. Existing methods have made notable progress in addressing this issue. However, improving performance in certain heterogeneity scenarios remains an overlooked question: \textitHow robust are these methods to deploy under diverse heterogeneity scenarios? To answer this, we conduct comprehensive evaluations across varied heterogeneity scenarios, showing that most existing methods exhibit limited robustness. Meanwhile, insights from these experiments highlight that sharing statistical information can mitigate heterogeneity by enabling clients to update with a global perspective. Motivated by this, we propose \textbfFedGPS (\textbfFederated \textbfGoal-\textbfPath \textbfSynergy), a novel framework that seamlessly integrates statistical distribution and gradient information from others. Specifically, FedGPS statically modifies each client’s learning objective to implicitly model the global data distribution using surrogate information, while dynamically adjusting local update directions with gradient information from other clients at each round. Extensive experiments show that FedGPS outperforms state-of-the-art methods across diverse heterogeneity scenarios, validating its effectiveness and robustness. The code is available at: this https URL.

[LG-40] Layer-to-Layer Knowledge Mixing in Graph Neural Network for Chemical Property Prediction

链接: https://arxiv.org/abs/2510.20236
作者: Teng Jiek See,Daokun Zhang,Mario Boley,David K. Chalmers
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are the currently most effective methods for predicting molecular properties but there remains a need for more accurate models. GNN accuracy can be improved by increasing the model complexity but this also increases the computational cost and memory requirement during training and inference. In this study, we develop Layer-to-Layer Knowledge Mixing (LKM), a novel self-knowledge distillation method that increases the accuracy of state-of-the-art GNNs while adding negligible computational complexity during training and inference. By minimizing the mean absolute distance between pre-existing hidden embeddings of GNN layers, LKM efficiently aggregates multi-hop and multi-scale information, enabling improved representation of both local and global molecular features. We evaluated LKM using three diverse GNN architectures (DimeNet++, MXMNet, and PAMNet) using datasets of quantum chemical properties (QM9, MD17 and Chignolin). We found that the LKM method effectively reduces the mean absolute error of quantum chemical and biophysical property predictions by up to 9.8% (QM9), 45.3% (MD17 Energy), and 22.9% (Chignolin). This work demonstrates the potential of LKM to significantly improve the accuracy of GNNs for chemical property prediction without any substantial increase in training and inference cost.

[LG-41] Sparse Local Implicit Image Function for sub-km Weather Downscaling

链接: https://arxiv.org/abs/2510.20228
作者: Yago del Valle Inclan Redondo,Enrique Arriaga-Varela,Dmitry Lyamzin,Pablo Cervantes,Tiago Ramalho
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce SpLIIF to generate implicit neural representations and enable arbitrary downscaling of weather variables. We train a model from sparse weather stations and topography over Japan and evaluate in- and out-of-distribution accuracy predicting temperature and wind, comparing it to both an interpolation baseline and CorrDiff. We find the model to be up to 50% better than both CorrDiff and the baseline at downscaling temperature, and around 10-20% better for wind.

[LG-42] Alternatives to the Laplacian for Scalable Spectral Clustering with Group Fairness Constraints

链接: https://arxiv.org/abs/2510.20220
作者: Iván Ojeda-Ruiz,Young Ju-Lee,Malcolm Dickens,Leonardo Cambisaca
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Recent research has focused on mitigating algorithmic bias in clustering by incorporating fairness constraints into algorithmic design. Notions such as disparate impact, community cohesion, and cost per population have been implemented to enforce equitable outcomes. Among these, group fairness (balance) ensures that each protected group is proportionally represented within every cluster. However, incorporating balance as a metric of fairness into spectral clustering algorithms has led to computational times that can be improved. This study aims to enhance the efficiency of spectral clustering algorithms by reformulating the constrained optimization problem using a new formulation derived from the Lagrangian method and the Sherman-Morrison-Woodbury (SMW) identity, resulting in the Fair-SMW algorithm. Fair-SMW employs three alternatives to the Laplacian matrix with different spectral gaps to generate multiple variations of Fair-SMW, achieving clustering solutions with comparable balance to existing algorithms while offering improved runtime performance. We present the results of Fair-SMW, evaluated using the Stochastic Block Model (SBM) to measure both runtime efficiency and balance across real-world network datasets, including LastFM, FacebookNet, Deezer, and German. We achieve an improvement in computation time that is twice as fast as the state-of-the-art, and also flexible enough to achieve twice as much balance.

[LG-43] CO-PFL: Contribution-Oriented Personalized Federated Learning for Heterogeneous Networks

链接: https://arxiv.org/abs/2510.20219
作者: Ke Xing,Yanjie Dong,Xiaoyi Fan,Runhao Zeng,Victor C. M. Leung,M. Jamal Deen,Xiping Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized federated learning (PFL) addresses a critical challenge of collaboratively training customized models for clients with heterogeneous and scarce local data. Conventional federated learning, which relies on a single consensus model, proves inadequate under such data heterogeneity. Its standard aggregation method of weighting client updates heuristically or by data volume, operates under an equal-contribution assumption, failing to account for the actual utility and reliability of each client’s update. This often results in suboptimal personalization and aggregation bias. To overcome these limitations, we introduce Contribution-Oriented PFL (CO-PFL), a novel algorithm that dynamically estimates each client’s contribution for global aggregation. CO-PFL performs a joint assessment by analyzing both gradient direction discrepancies and prediction deviations, leveraging information from gradient and data subspaces. This dual-subspace analysis provides a principled and discriminative aggregation weight for each client, emphasizing high-quality updates. Furthermore, to bolster personalization adaptability and optimization stability, CO-PFL cohesively integrates a parameter-wise personalization mechanism with mask-aware momentum optimization. Our approach effectively mitigates aggregation bias, strengthens global coordination, and enhances local performance by facilitating the construction of tailored submodels with stable updates. Extensive experiments on four benchmark datasets (CIFAR10, CIFAR10C, CINIC10, and Mini-ImageNet) confirm that CO-PFL consistently surpasses state-of-the-art methods in in personalization accuracy, robustness, scalability and convergence stability.

[LG-44] Approximate Replicability in Learning

链接: https://arxiv.org/abs/2510.20200
作者: Max Hopkins,Russell Impagliazzo,Christopher Ye
类目: Machine Learning (cs.LG)
*备注: 51 pages, 1 figure

点击查看摘要

Abstract:Replicability, introduced by (Impagliazzo et al. STOC '22), is the notion that algorithms should remain stable under a resampling of their inputs (given access to shared randomness). While a strong and interesting notion of stability, the cost of replicability can be prohibitive: there is no replicable algorithm, for instance, for tasks as simple as threshold learning (Bun et al. STOC '23). Given such strong impossibility results we ask: under what approximate notions of replicability is learning possible? In this work, we propose three natural relaxations of replicability in the context of PAC learning: (1) Pointwise: the learner must be consistent on any fixed input, but not across all inputs simultaneously, (2) Approximate: the learner must output hypotheses that classify most of the distribution consistently, (3) Semi: the algorithm is fully replicable, but may additionally use shared unlabeled samples. In all three cases, for constant replicability parameters, we obtain sample-optimal agnostic PAC learners: (1) and (2) are achievable for ``free" using \Theta(d/\alpha^2) samples, while (3) requires \Theta(d^2/\alpha^2) labeled samples. Comments: 51 pages, 1 figure Subjects: Machine Learning (cs.LG) Cite as: arXiv:2510.20200 [cs.LG] (or arXiv:2510.20200v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.20200 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-45] Risk-Averse Constrained Reinforcement Learning with Optimized Certainty Equivalents

链接: https://arxiv.org/abs/2510.20199
作者: Jane H. Lee,Baturay Saglam,Spyridon Pougkakiotis,Amin Karbasi,Dionysis Kalogerias
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Constrained optimization provides a common framework for dealing with conflicting objectives in reinforcement learning (RL). In most of these settings, the objectives (and constraints) are expressed though the expected accumulated reward. However, this formulation neglects risky or even possibly catastrophic events at the tails of the reward distribution, and is often insufficient for high-stakes applications in which the risk involved in outliers is critical. In this work, we propose a framework for risk-aware constrained RL, which exhibits per-stage robustness properties jointly in reward values and time using optimized certainty equivalents (OCEs). Our framework ensures an exact equivalent to the original constrained problem within a parameterized strong Lagrangian duality framework under appropriate constraint qualifications, and yields a simple algorithmic recipe which can be wrapped around standard RL solvers, such as PPO. Lastly, we establish the convergence of the proposed algorithm under common assumptions, and verify the risk-aware properties of our approach through several numerical experiments.

[LG-46] Empowering Targeted Neighborhood Search via Hyper Tour for Large-Scale TSP

链接: https://arxiv.org/abs/2510.20169
作者: Tongkai Lu,Shuai Ma,Chongyang Tao
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Traveling Salesman Problem (TSP) is a classic NP-hard problem that has garnered significant attention from both academia and industry. While neural-based methods have shown promise for solving TSPs, they still face challenges in scaling to larger instances, particularly in memory constraints associated with global heatmaps, edge weights, or access matrices, as well as in generating high-quality initial solutions and insufficient global guidance for efficiently navigating vast search spaces. To address these challenges, we propose a Hyper Tour Guided Neighborhood Search (HyperNS) method for large-scale TSP instances. Inspired by the ``clustering first, route second" strategy, our approach initially divides the TSP instance into clusters using a sparse heatmap graph and abstracts them as supernodes, followed by the generation of a hyper tour to guide both the initialization and optimization processes. This method reduces the search space by focusing on edges relevant to the hyper tour, leading to more efficient and effective optimization. Experimental results on both synthetic and real-world datasets demonstrate that our approach outperforms existing neural-based methods, particularly in handling larger-scale instances, offering a significant reduction in the gap to the optimal solution.

[LG-47] ADP-VRSGP: Decentralized Learning with Adaptive Differential Privacy via Variance-Reduced Stochastic Gradient Push

链接: https://arxiv.org/abs/2510.20157
作者: Xiaoming Wu,Teng Liu,Xin Wang,Ming Yang,Jiguo Yu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Differential privacy is widely employed in decentralized learning to safeguard sensitive data by introducing noise into model updates. However, existing approaches that use fixed-variance noise often degrade model performance and reduce training efficiency. To address these limitations, we propose a novel approach called decentralized learning with adaptive differential privacy via variance-reduced stochastic gradient push (ADP-VRSGP). This method dynamically adjusts both the noise variance and the learning rate using a stepwise-decaying schedule, which accelerates training and enhances final model performance while providing node-level personalized privacy guarantees. To counteract the slowed convergence caused by large-variance noise in early iterations, we introduce a progressive gradient fusion strategy that leverages historical gradients. Furthermore, ADP-VRSGP incorporates decentralized push-sum and aggregation techniques, making it particularly suitable for time-varying communication topologies. Through rigorous theoretical analysis, we demonstrate that ADP-VRSGP achieves robust convergence with an appropriate learning rate, significantly improving training stability and speed. Experimental results validate that our method outperforms existing baselines across multiple scenarios, highlighting its efficacy in addressing the challenges of privacy-preserving decentralized learning.

[LG-48] Understanding Mechanistic Role of Structural and Functional Connectivity in Tau Propagation Through Multi-Layer Modeling

链接: https://arxiv.org/abs/2510.20148
作者: Tingting Dan,Xinwei Huang,Jiaqi Ding,Yinggang Zheng,Guorong Wu
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Medical Physics (physics.med-ph)
*备注: 42 pages, 14 figures, 64 references

点击查看摘要

Abstract:Emerging neuroimaging evidence shows that pathological tau proteins build up along specific brain networks, suggesting that large-scale network architecture plays a key role in the progression of Alzheimer’s disease (AD). However, how structural connectivity (SC) and functional connectivity (FC) interact to influence tau propagation remains unclear. Leveraging an unprecedented volume of longitudinal neuroimaging data, we examine SC-FC interactions through a multi-layer graph diffusion model. Beyond showing that connectome architecture constrains tau spread, our model reveals a regionally asymmetric contribution of SC and FC. Specifically, FC predominantly drives tau spread in subcortical areas, the insula, frontal and temporal cortices, whereas SC plays a larger role in occipital, parietal, and limbic regions. The relative dominance of SC versus FC shifts over the course of disease, with FC generally prevailing in early AD and SC becoming primary in later stages. Spatial patterns of SC- and FC-dominant regions strongly align with the regional expression of AD-associated genes involved in inflammation, apoptosis, and lysosomal function, including CHUK (IKK-alpha), TMEM106B, MCL1, NOTCH1, and TH. In parallel, other non-modifiable risk factors (e.g., APOE genotype, sex) and biological mechanisms (e.g., amyloid deposition) selectively reshape tau propagation by shifting dominant routes between anatomical and functional pathways in a region-specific manner. Findings are validated in an independent AD cohort.

[LG-49] here is No “apple” in Timeseries: Rethinking TSFM through the Lens of Invariance

链接: https://arxiv.org/abs/2510.20119
作者: Arian Prabowo,Flora D. Salim
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Timeseries foundation models (TSFMs) have multiplied, yet lightweight supervised baselines and even classical models often match them. We argue this gap stems from the naive importation of NLP or CV pipelines. In language and vision, large web-scale corpora densely capture human concepts i.e. there are countless images and text of apples. In contrast, timeseries data is built to complement the image and text modalities. There are no timeseries dataset that contains the concept apple. As a result, the scrape-everything-online paradigm fails for TS. We posit that progress demands a shift from opportunistic aggregation to principled design: constructing datasets that systematically span the space of invariance that preserve temporal semantics. To this end, we suggest that the ontology of timeseries invariances should be built based on first principles. Only by ensuring representational completeness through invariance coverage can TSFMs achieve the aligned structure necessary for generalisation, reasoning, and truly emergent behaviour.

[LG-50] AsyncHZP: Hierarchical ZeRO Parallelism with Asynchronous Scheduling for Scalable LLM Training

链接: https://arxiv.org/abs/2510.20111
作者: Huawei Bai,Yifan Huang,Wenqi Shi,Ansheng You,Feifan Shao,Tengfei Han,Minghui Yu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 14 pages, 5 figures, tech report

点击查看摘要

Abstract:The training efficiency and scalability of language models on massive clusters currently remain a critical bottleneck. Mainstream approaches like ND parallelism are often cumbersome and complex, while flexible alternatives such as the Zero Redundancy Optimizer (ZeRO) are frequently hampered by communication overhead. In this paper, we propose Asynchronous Hierarchical Zero Parallelism (AsyncHZP), a novel asynchronous variant of ZeRO designed to achieve superior performance while maintaining simplicity and memory efficiency. Unlike traditional ZeRO, which employs over-fine-grained sharding that can lead to inefficient communication, AsyncHZP adaptively reshards parameters, gradients, and optimizer states across different replica groups. This strategy optimizes device memory utilization and significantly reduces communication overhead. In addition, we also design a multi-stream asynchronous scheduling method that executes parameter all-gather and gradient reduce-scatter operations in dedicated background threads, effectively overlapping communication with computation while incurring negligible memory fragmentation. Empirical evaluations on both Dense and Mixture-of-Experts (MoE) models confirm that AsyncHZP maintains robust stability at scale. It consistently outperforms classic ND parallelism, achieving state-of-the-art performance without complex strategic tuning, thereby simplifying the path to efficient large-scale training.

[LG-51] On pattern classification with weighted dimensions

链接: https://arxiv.org/abs/2510.20107
作者: Ayatullah Faruk Mollah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Studies on various facets of pattern classification is often imperative while working with multi-dimensional samples pertaining to diverse application scenarios. In this notion, weighted dimension-based distance measure has been one of the vital considerations in pattern analysis as it reflects the degree of similarity between samples. Though it is often presumed to be settled with the pervasive use of Euclidean distance, plethora of issues often surface. In this paper, we present (a) a detail analysis on the impact of distance measure norms and weights of dimensions along with visualization, (b) a novel weighting scheme for each dimension, © incorporation of this dimensional weighting schema into a KNN classifier, and (d) pattern classification on a variety of synthetic as well as realistic datasets with the developed model. It has performed well across diverse experiments in comparison to the traditional KNN under the same experimental setups. Specifically, for gene expression datasets, it yields significant and consistent gain in classification accuracy (around 10%) in all cross-validation experiments with different values of k. As such datasets contain limited number of samples of high dimensions, meaningful selection of nearest neighbours is desirable, and this requirement is reasonably met by regulating the shape and size of the region enclosing the k number of reference samples with the developed weighting schema and appropriate norm. It, therefore, stands as an important generalization of KNN classifier powered by weighted Minkowski distance with the present weighting schema.

[LG-52] Competition is the key: A Game Theoretic Causal Discovery Approach

链接: https://arxiv.org/abs/2510.20106
作者: Amartya Roy,Souvik Chakraborty
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Causal discovery remains a central challenge in machine learning, yet existing methods face a fundamental gap: algorithms like GES and GraN-DAG achieve strong empirical performance but lack finite-sample guarantees, while theoretically principled approaches fail to scale. We close this gap by introducing a game-theoretic reinforcement learning framework for causal discovery, where a DDQN agent directly competes against a strong baseline (GES or GraN-DAG), always warm-starting from the opponent’s solution. This design yields three provable guarantees: the learned graph is never worse than the opponent, warm-starting strictly accelerates convergence, and most importantly, with high probability the algorithm selects the true best candidate graph. To the best of our knowledge, our result makes a first-of-its-kind progress in explaining such finite-sample guarantees in causal discovery: on synthetic SEMs (30 nodes), the observed error probability decays with n, tightly matching theory. On real-world benchmarks including Sachs, Asia, Alarm, Child, Hepar2, Dream, and Andes, our method consistently improves upon GES and GraN-DAG while remaining theoretically safe. Remarkably, it scales to large graphs such as Hepar2 (70 nodes), Dream (100 nodes), and Andes (220 nodes). Together, these results establish a new class of RL-based causal discovery algorithms that are simultaneously provably consistent, sample-efficient, and practically scalable, marking a decisive step toward unifying empirical performance with rigorous finite-sample theory.

[LG-53] Hierarchical Dual-Head Model for Suicide Risk Assessment via MentalRoBERTa

链接: https://arxiv.org/abs/2510.20085
作者: Chang Yang,Ziyi Wang,Wangfeng Tan,Zhiting Tan,Changrui Ji,Zhiming Zhou
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 9 pages, 7 figures, 2tables, 2025 IEEE International Conference on Big Data

点击查看摘要

Abstract:Social media platforms have become important sources for identifying suicide risk, but automated detection systems face multiple challenges including severe class imbalance, temporal complexity in posting patterns, and the dual nature of risk levels as both ordinal and categorical. This paper proposes a hierarchical dual-head neural network based on MentalRoBERTa for suicide risk classification into four levels: indicator, ideation, behavior, and attempt. The model employs two complementary prediction heads operating on a shared sequence representation: a CORAL (Consistent Rank Logits) head that preserves ordinal relationships between risk levels, and a standard classification head that enables flexible categorical distinctions. A 3-layer Transformer encoder with 8-head multi-head attention models temporal dependencies across post sequences, while explicit time interval embeddings capture posting behavior dynamics. The model is trained with a combined loss function (0.5 CORAL + 0.3 Cross-Entropy + 0.2 Focal Loss) that simultaneously addresses ordinal structure preservation, overconfidence reduction, and class imbalance. To improve computational efficiency, we freeze the first 6 layers (50%) of MentalRoBERTa and employ mixed-precision training. The model is evaluated using 5-fold stratified cross-validation with macro F1 score as the primary metric.

[LG-54] Coupled Transformer Autoencoder for Disentangling Multi-Region Neural Latent Dynamics

链接: https://arxiv.org/abs/2510.20068
作者: Ram Dyuthi Sristi,Sowmya Manojna Narasimha,Jingya Huang,Alice Despatin,Simon Musall,Vikash Gilja,Gal Mishne
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simultaneous recordings from thousands of neurons across multiple brain areas reveal rich mixtures of activity that are shared between regions and dynamics that are unique to each region. Existing alignment or multi-view methods neglect temporal structure, whereas dynamical latent variable models capture temporal dependencies but are usually restricted to a single area, assume linear read-outs, or conflate shared and private signals. We introduce the Coupled Transformer Autoencoder (CTAE) - a sequence model that addresses both (i) non-stationary, non-linear dynamics and (ii) separation of shared versus region-specific structure in a single framework. CTAE employs transformer encoders and decoders to capture long-range neural dynamics and explicitly partitions each region’s latent space into orthogonal shared and private subspaces. We demonstrate the effectiveness of CTAE on two high-density electrophysiology datasets with simultaneous recordings from multiple regions, one from motor cortical areas and the other from sensory areas. CTAE extracts meaningful representations that better decode behavioral variables compared to existing approaches.

[LG-55] A Multi-Layer Machine Learning and Econometric Pipeline for Forecasting Market Risk: Evidence from Cryptoasset Liquidity Spillovers

链接: https://arxiv.org/abs/2510.20066
作者: Yimeng Qiu,Feihuang Fang
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:We study whether liquidity and volatility proxies of a core set of cryptoassets generate spillovers that forecast market-wide risk. Our empirical framework integrates three statistical layers: (A) interactions between core liquidity and returns, (B) principal-component relations linking liquidity and returns, and © volatility-factor projections that capture cross-sectional volatility crowding. The analysis is complemented by vector autoregression impulse responses and forecast error variance decompositions (see Granger 1969; Sims 1980), heterogeneous autoregressive models with exogenous regressors (HAR-X, Corsi 2009), and a leakage-safe machine learning protocol using temporal splits, early stopping, validation-only thresholding, and SHAP-based interpretation. Using daily data from 2021 to 2025 (1462 observations across 74 assets), we document statistically significant Granger-causal relationships across layers and moderate out-of-sample predictive accuracy. We report the most informative figures, including the pipeline overview, Layer A heatmap, Layer C robustness analysis, vector autoregression variance decompositions, and the test-set precision-recall curve. Full data and figure outputs are provided in the artifact repository.

[LG-56] Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLM s

链接: https://arxiv.org/abs/2510.20064
作者: Hongyi Liu,Jiaji Huang,Zhen Jia,Youngsuk Park,Yu-Xiang Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the best draft model in hindsight for each query in terms of either the token acceptance probability or expected acceptance length. In particular, we show that we can accurately evaluate all draft models, instead of only the chosen model without incurring additional queries to the target model, which allows us to improve exponentially over the existing bandit-based approach as the number of draft models increases. Our approach is generically applicable with any speculative decoding methods (single draft, multi-drafts and draft-trees). Moreover, we design system-efficient versions of online learners and demonstrate that the overhead in computation and latency can be substantially reduced. We conduct extensive experiments on open-source LLMs and diverse datasets, demonstrating that our methods substantially outperform the state-of-the-art EAGLE3 and the BanditSpec baseline in a variety of domains where specialized domain-expert drafters are available, especially when long reasoning chains are required.

[LG-57] Learning Personalized Ad Impact via Contextual Reinforcement Learning under Delayed Rewards

链接: https://arxiv.org/abs/2510.20055
作者: Yuwei Cheng,Zifeng Zhao,Haifeng Xu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Online advertising platforms use automated auctions to connect advertisers with potential customers, requiring effective bidding strategies to maximize profits. Accurate ad impact estimation requires considering three key factors: delayed and long-term effects, cumulative ad impacts such as reinforcement or fatigue, and customer heterogeneity. However, these effects are often not jointly addressed in previous studies. To capture these factors, we model ad bidding as a Contextual Markov Decision Process (CMDP) with delayed Poisson rewards. For efficient estimation, we propose a two-stage maximum likelihood estimator combined with data-splitting strategies, ensuring controlled estimation error based on the first-stage estimator’s (in)accuracy. Building on this, we design a reinforcement learning algorithm to derive efficient personalized bidding strategies. This approach achieves a near-optimal regret bound of \tildeO(dH^2\sqrtT) , where d is the contextual dimension, H is the number of rounds, and T is the number of customers. Our theoretical findings are validated by simulation experiments.

[LG-58] Speculative Sampling for Parametric Temporal Point Processes

链接: https://arxiv.org/abs/2510.20031
作者: Marin Biloš,Anderson Schneider,Yuriy Nevmyvaka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Temporal point processes are powerful generative models for event sequences that capture complex dependencies in time-series data. They are commonly specified using autoregressive models that learn the distribution of the next event from the previous events. This makes sampling inherently sequential, limiting efficiency. In this paper, we propose a novel algorithm based on rejection sampling that enables exact sampling of multiple future values from existing TPP models, in parallel, and without requiring any architectural changes or retraining. Besides theoretical guarantees, our method demonstrates empirical speedups on real-world datasets, bridging the gap between expressive modeling and efficient parallel generation for large-scale TPP applications.

[LG-59] SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph

链接: https://arxiv.org/abs/2510.20022
作者: Jiazheng Li,Yawei Wang,David Yan,Yijun Tian,Zhichao Xu,Huan Song,Panpan Xu,Lin Lee Cheong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities, enabling language agents to excel at single-turn tasks. However, their application to complex, multi-step, and long-horizon tasks remains challenging. While reinforcement learning (RL) offers a promising avenue for addressing these challenges, mainstream approaches typically rely solely on sparse, outcome-based rewards, a limitation that becomes especially problematic for group-based RL algorithms lacking critic models, such as Group Relative Policy Optimization (GRPO). In such methods, uniformly rewarding or penalizing all actions within a trajectory can lead to training instability and suboptimal policies, because beneficial and detrimental actions are often entangled across multi-step interactions. To address this challenge, we propose SALT, a novel and lightweight framework that provides a finer-grained advantage assignment, derived solely from outcome rewards. We achieve this by constructing a graph from trajectories of the same prompt, which allows us to quantify the quality of each step and assign advantages accordingly. Crucially, SALT is designed as a plug-and-play module that seamlessly integrates with existing group-based RL algorithms, requiring no modifications to the rollout procedure and introducing negligible computational overhead. Extensive experiments on the WebShop, ALFWorld, and AppWorld benchmarks with various model sizes demonstrate that SALT consistently improves performance. We also conduct a thorough analysis to validate the design choices behind SALT and offer actionable insights.

[LG-60] Machine Learning-Based Localization Accuracy of RFID Sensor Networks via RSSI Decision Trees and CAD Modeling for Defense Applications

链接: https://arxiv.org/abs/2510.20019
作者: Curtis Lee Shull,Merrick Green
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 10 pages, 5 figures. Submitted to the Journal of Defense Modeling and Simulation (JDMS) for the Special Issue Integrating AI/ML Into Modeling and Simulation (J22-4). This work evaluates machine learning-based RFID localization for defense logistics environments using CAD-modeled simulations and RSSI-driven decision tree classification

点击查看摘要

Abstract:Radio Frequency Identification (RFID) tracking may be a viable solution for defense assets that must be stored in accordance with security guidelines. However, poor sensor specificity (vulnerabilities include long range detection, spoofing, and counterfeiting) can lead to erroneous detection and operational security events. We present a supervised learning simulation with realistic Received Signal Strength Indicator (RSSI) data and Decision Tree classification in a Computer Assisted Design (CAD)-modeled floor plan that encapsulates some of the challenges encountered in defense storage. In this work, we focused on classifying 12 lab zones (LabZoneA-L) to perform location inference. The raw dataset had approximately 980,000 reads. Class frequencies were imbalanced, and class weights were calculated to account for class imbalance in this multi-class setting. The model, trained on stratified subsamples to 5,000 balanced observations, yielded an overall accuracy of 34.2% and F1-scores greater than 0.40 for multiple zones (Zones F, G, H, etc.). However, rare classes (most notably LabZoneC) were often misclassified, even with the use of class weights. An adjacency-aware confusion matrix was calculated to allow better interpretation of physically adjacent zones. These results suggest that RSSI-based decision trees can be applied in realistic simulations to enable zone-level anomaly detection or misplacement monitoring for defense supply logistics. Reliable classification performance in low-coverage and low-signal zones could be improved with better antenna placement or additional sensors and sensor fusion with other modalities.

[LG-61] No Compute Left Behind: Rethinking Reasoning and Sampling with Masked Diffusion Models

链接: https://arxiv.org/abs/2510.19990
作者: Zachary Horvitz,Raghav Singhal,Hao Zou,Carles Domingo-Enrich,Zhou Yu,Rajesh Ranganath,Kathleen McKeown
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Masked diffusion language models (MDLMs) are trained to in-fill positions in randomly masked sequences, in contrast to next-token prediction models. Discussions around MDLMs focus on two benefits: (1) any-order decoding and 2) multi-token decoding. However, we observe that for math and coding tasks, any-order algorithms often underperform or behave similarly to left-to-right sampling, and standard multi-token decoding significantly degrades performance. At inference time, MDLMs compute the conditional distribution of all masked positions. A natural question is: How can we justify this additional compute when left-to-right one-token-at-a-time decoding is on par with any-order decoding algorithms? First, we propose reasoning-as-infilling. By using MDLMs to infill a reasoning template, we can structure outputs and distinguish between reasoning and answer tokens. In turn, this enables measuring answer uncertainty during reasoning, and early exits when the model converges on an answer. Next, given an answer, reasoning-as-infilling enables sampling from the MDLM posterior over reasoning traces conditioned on the answer, providing a new source of high-quality data for post-training. On GSM8k, we observe that fine-tuning LLaDA-8B Base on its posterior reasoning traces provides a performance boost on par with fine-tuning on human-written reasoning traces. Additionally, given an answer, reasoning-as-infilling provides a method for scoring the correctness of the reasoning process at intermediate steps. Second, we propose multi-token entropy decoding (MED), a simple adaptive sampler that minimizes the error incurred by decoding positions in parallel based on the conditional entropies of those positions. MED preserves performance across benchmarks and leads to 2.7x fewer steps. Our work demonstrates that the training and compute used by MDLMs unlock many new inference and post-training methods.

[LG-62] Abstain Mask Retain Core: Time Series Prediction by Adaptive Masking Loss with Representation Consistency NEURIPS2025

链接: https://arxiv.org/abs/2510.19980
作者: Renzhao Liang,Sizhe Xu,Chenggang Xie,Jingru Chen,Feiyang Ren,Shu Yang,Takahiro Yabe
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 20 pages, 4 figures. Accepted as Spotlight poster in NeurIPS 2025

点击查看摘要

Abstract:Time series forecasting plays a pivotal role in critical domains such as energy management and financial markets. Although deep learning-based approaches (e.g., MLP, RNN, Transformer) have achieved remarkable progress, the prevailing “long-sequence information gain hypothesis” exhibits inherent limitations. Through systematic experimentation, this study reveals a counterintuitive phenomenon: appropriately truncating historical data can paradoxically enhance prediction accuracy, indicating that existing models learn substantial redundant features (e.g., noise or irrelevant fluctuations) during training, thereby compromising effective signal extraction. Building upon information bottleneck theory, we propose an innovative solution termed Adaptive Masking Loss with Representation Consistency (AMRC), which features two core components: 1) Dynamic masking loss, which adaptively identified highly discriminative temporal segments to guide gradient descent during model training; 2) Representation consistency constraint, which stabilized the mapping relationships among inputs, labels, and predictions. Experimental results demonstrate that AMRC effectively suppresses redundant feature learning while significantly improving model performance. This work not only challenges conventional assumptions in temporal modeling but also provides novel theoretical insights and methodological breakthroughs for developing efficient and robust forecasting models.

[LG-63] SecureInfer: Heterogeneous TEE-GPU Architecture for Privacy-Critical Tensors for Large Language Model Deployment

链接: https://arxiv.org/abs/2510.19979
作者: Tushar Nayan(1),Ziqi Zhang(2),Ruimin Sun(1) ((1) Florida International University, (2) University of Illinois Urbana-Champaign)
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Accepted at IEEE Intelligent Computing and Systems at the Edge (ICEdge) 2025

点击查看摘要

Abstract:With the increasing deployment of Large Language Models (LLMs) on mobile and edge platforms, securing them against model extraction attacks has become a pressing concern. However, protecting model privacy without sacrificing the performance benefits of untrusted AI accelerators, such as GPUs, presents a challenging trade-off. In this paper, we initiate the study of high-performance execution on LLMs and present SecureInfer, a hybrid framework that leverages a heterogeneous Trusted Execution Environments (TEEs)-GPU architecture to isolate privacy-critical components while offloading compute-intensive operations to untrusted accelerators. Building upon an outsourcing scheme, SecureInfer adopts an information-theoretic and threat-informed partitioning strategy: security-sensitive components, including non-linear layers, projection of attention head, FNN transformations, and LoRA adapters, are executed inside an SGX enclave, while other linear operations (matrix multiplication) are performed on the GPU after encryption and are securely restored within the enclave. We implement a prototype of SecureInfer using the LLaMA-2 model and evaluate it across performance and security metrics. Our results show that SecureInfer offers strong security guarantees with reasonable performance, offering a practical solution for secure on-device model inference.

[LG-64] owards Strong Certified Defense with Universal Asymmetric Randomization

链接: https://arxiv.org/abs/2510.19977
作者: Hanbin Hong,Ashish Kundu,Ali Payani,Binghui Wang,Yuan Hong
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted by CSF 2026, 39th IEEE Computer Security Foundations Symposium

点击查看摘要

Abstract:Randomized smoothing has become essential for achieving certified adversarial robustness in machine learning models. However, current methods primarily use isotropic noise distributions that are uniform across all data dimensions, such as image pixels, limiting the effectiveness of robustness certification by ignoring the heterogeneity of inputs and data dimensions. To address this limitation, we propose UCAN: a novel technique that \underlineUniversally \underlineCertifies adversarial robustness with \underlineAnisotropic \underlineNoise. UCAN is designed to enhance any existing randomized smoothing method, transforming it from symmetric (isotropic) to asymmetric (anisotropic) noise distributions, thereby offering a more tailored defense against adversarial attacks. Our theoretical framework is versatile, supporting a wide array of noise distributions for certified robustness in different \ell_p -norms and applicable to any arbitrary classifier by guaranteeing the classifier’s prediction over perturbed inputs with provable robustness bounds through tailored noise injection. Additionally, we develop a novel framework equipped with three exemplary noise parameter generators (NPGs) to optimally fine-tune the anisotropic noise parameters for different data dimensions, allowing for pursuing different levels of robustness enhancements in this http URL evaluations underscore the significant leap in UCAN’s performance over existing state-of-the-art methods, demonstrating up to 182.6% improvement in certified accuracy at large certified radii on MNIST, CIFAR10, and ImageNet datasets.\footnoteCode is anonymously available at \hrefthis https URLthis https URL

[LG-65] Are Greedy Task Orderings Better Than Random in Continual Linear Regression? NEURIPS2025

链接: https://arxiv.org/abs/2510.19941
作者: Matan Tsipory,Ran Levinstein,Itay Evron,Mark Kong,Deanna Needell,Daniel Soudry
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:We analyze task orderings in continual learning for linear regression, assuming joint realizability of training data. We focus on orderings that greedily maximize dissimilarity between consecutive tasks, a concept briefly explored in prior work but still surrounded by open questions. Using tools from the Kaczmarz method literature, we formalize such orderings and develop geometric and algebraic intuitions around them. Empirically, we demonstrate that greedy orderings converge faster than random ones in terms of the average loss across tasks, both for linear regression with random data and for linear probing on CIFAR-100 classification tasks. Analytically, in a high-rank regression setting, we prove a loss bound for greedy orderings analogous to that of random ones. However, under general rank, we establish a repetition-dependent separation. Specifically, while prior work showed that for random orderings, with or without replacement, the average loss after k iterations is bounded by \mathcalO(1/\sqrtk) , we prove that single-pass greedy orderings may fail catastrophically, whereas those allowing repetition converge at rate \mathcalO(1/\sqrt[3]k) . Overall, we reveal nuances within and between greedy and random orderings.

[LG-66] Mitigating Privacy-Utility Trade-off in Decentralized Federated Learning via f-Differential Privacy NEURIPS2025

链接: https://arxiv.org/abs/2510.19934
作者: Xiang Li,Buxin Su,Chendi Wang,Qi Long,Weijie J. Su
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Statistics Theory (math.ST); Methodology (stat.ME); Machine Learning (stat.ML)
*备注: NeurIPS 2025 (Spotlight)

点击查看摘要

Abstract:Differentially private (DP) decentralized Federated Learning (FL) allows local users to collaborate without sharing their data with a central server. However, accurately quantifying the privacy budget of private FL algorithms is challenging due to the co-existence of complex algorithmic components such as decentralized communication and local updates. This paper addresses privacy accounting for two decentralized FL algorithms within the f -differential privacy ( f -DP) framework. We develop two new f -DP-based accounting methods tailored to decentralized settings: Pairwise Network f -DP (PN- f -DP), which quantifies privacy leakage between user pairs under random-walk communication, and Secret-based f -Local DP (Sec- f -LDP), which supports structured noise injection via shared secrets. By combining tools from f -DP theory and Markov chain concentration, our accounting framework captures privacy amplification arising from sparse communication, local iterations, and correlated noise. Experiments on synthetic and real datasets demonstrate that our methods yield consistently tighter (\epsilon,\delta) bounds and improved utility compared to Rényi DP-based approaches, illustrating the benefits of f -DP in decentralized privacy accounting.

[LG-67] Beyond the Ideal: Analyzing the Inexact Muon Update

链接: https://arxiv.org/abs/2510.19933
作者: Egor Shulgin,Sultan AlRashed,Francesco Orabona,Peter Richtárik
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:The Muon optimizer has rapidly emerged as a powerful, geometry-aware alternative to AdamW, demonstrating strong performance in large-scale training of neural networks. However, a critical theory-practice disconnect exists: Muon’s efficiency relies on fast, approximate orthogonalization, yet all prior theoretical work analyzes an idealized, computationally intractable version assuming exact SVD-based updates. This work moves beyond the ideal by providing the first analysis of the inexact orthogonalized update at Muon’s core. We develop our analysis within the general framework of Linear Minimization Oracle (LMO)-based optimization, introducing a realistic additive error model to capture the inexactness of practical approximation schemes. Our analysis yields explicit bounds that quantify performance degradation as a function of the LMO inexactness/error. We reveal a fundamental coupling between this inexactness and the optimal step size and momentum: lower oracle precision requires a smaller step size but larger momentum parameter. These findings elevate the approximation procedure (e.g., the number of Newton-Schulz steps) from an implementation detail to a critical parameter that must be co-tuned with the learning schedule. NanoGPT experiments directly confirm the predicted coupling, with optimal learning rates clearly shifting as approximation precision changes.

[LG-68] Enhancing Diagnostic Accuracy for Urinary Tract Disease through Explainable SHAP-Guided Feature Selection and Classification

链接: https://arxiv.org/abs/2510.19896
作者: Filipe Ferreira de Oliveira,Matheus Becali Rocha,Renato A. Krohling
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we propose an approach to support the diagnosis of urinary tract diseases, with a focus on bladder cancer, using SHAP (SHapley Additive exPlanations)-based feature selection to enhance the transparency and effectiveness of predictive models. Six binary classification scenarios were developed to distinguish bladder cancer from other urological and oncological conditions. The algorithms XGBoost, LightGBM, and CatBoost were employed, with hyperparameter optimization performed using Optuna and class balancing with the SMOTE technique. The selection of predictive variables was guided by importance values through SHAP-based feature selection while maintaining or even improving performance metrics such as balanced accuracy, precision, and specificity. The use of explainability techniques (SHAP) for feature selection proved to be an effective approach. The proposed methodology may contribute to the development of more transparent, reliable, and efficient clinical decision support systems, optimizing screening and early diagnosis of urinary tract diseases.

[LG-69] FairGRPO: Fair Reinforcement Learning for Equitable Clinical Reasoning NEURIPS2025 ALT

链接: https://arxiv.org/abs/2510.19893
作者: Shiqi Dai,Wei Dai,Jiaee Cheong,Paul Pu Liang
类目: Machine Learning (cs.LG)
*备注: Accepted as Oral on NeurIPS 2025 GenAI4Health Workshop

点击查看摘要

Abstract:Medical artificial intelligence systems have achieved remarkable diagnostic capabilities, yet they consistently exhibit performance disparities across demographic groups, causing real-world harm to underrepresented populations. While recent multimodal reasoning foundation models have advanced clinical diagnosis through integrated analysis of diverse medical data, reasoning trainings via reinforcement learning inherit and often amplify biases present in training datasets dominated by majority populations. We introduce Fairness-aware Group Relative Policy Optimization (FairGRPO), a hierarchical reinforcement learning approach that promotes equitable learning across heterogeneous clinical populations. FairGRPO employs adaptive importance weighting of advantages based on representation, task difficulty, and data source. To address the common issue of missing demographic labels in the clinical domain, we further employ unsupervised clustering, which automatically discovers latent demographic groups when labels are unavailable. Through comprehensive experiments across 7 clinical diagnostic datasets spanning 5 clinical modalities across X-ray, CT scan, dermoscropy, mammography and ultrasound, we demonstrate that FairGRPO reduces predictive parity by 27.2% against all vanilla and bias mitigated RL baselines, while improving F1 score by 12.49%. Furthermore, training dynamics analysis reveals that FairGRPO progressively improves fairness throughout optimization, while baseline RL methods exhibit deteriorating fairness as training progresses. Based on FairGRPO, we release FairMedGemma-4B, a fairness-aware clinical VLLM that achieves state-of-the-art performance while demonstrating significantly reduced disparities across demographic groups.

[LG-70] Deep Sequence-to-Sequence Models for GNSS Spoofing Detection

链接: https://arxiv.org/abs/2510.19890
作者: Jan Zelinka,Oliver Kost,Marek Hrúz
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a data generation framework designed to simulate spoofing attacks and randomly place attack scenarios worldwide. We apply deep neural network-based models for spoofing detection, utilizing Long Short-Term Memory networks and Transformer-inspired architectures. These models are specifically designed for online detection and are trained using the generated dataset. Our results demonstrate that deep learning models can accurately distinguish spoofed signals from genuine ones, achieving high detection performance. The best results are achieved by Transformer-inspired architectures with early fusion of the inputs resulting in an error rate of 0.16%.

[LG-71] An Integrated Approach to Neural Architecture Search for Deep Q-Networks

链接: https://arxiv.org/abs/2510.19872
作者: Iman Rahmani,Saman Yazdannik,Morteza Tayefi,Jafar Roshanian
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The performance of deep reinforcement learning agents is fundamentally constrained by their neural network architecture, a choice traditionally made through expensive hyperparameter searches and then fixed throughout training. This work investigates whether online, adaptive architecture optimization can escape this constraint and outperform static designs. We introduce NAS-DQN, an agent that integrates a learned neural architecture search controller directly into the DRL training loop, enabling dynamic network reconfiguration based on cumulative performance feedback. We evaluate NAS-DQN against three fixed-architecture baselines and a random search control on a continuous control task, conducting experiments over multiple random seeds. Our results demonstrate that NAS-DQN achieves superior final performance, sample efficiency, and policy stability while incurring negligible computational overhead. Critically, the learned search strategy substantially outperforms both undirected random architecture exploration and poorly-chosen fixed designs, indicating that intelligent, performance-guided search is the key mechanism driving success. These findings establish that architecture adaptation is not merely beneficial but necessary for optimal sample efficiency in online deep reinforcement learning, and suggest that the design of RL agents need not be a static offline choice but can instead be seamlessly integrated as a dynamic component of the learning process itself.

[LG-72] Some Attention is All You Need for Retrieval

链接: https://arxiv.org/abs/2510.19861
作者: Felix Michalak,Steven Abreu
类目: Machine Learning (cs.LG)
*备注: 16 pages, 10 figures

点击查看摘要

Abstract:We demonstrate complete functional segregation in hybrid SSM-Transformer architectures: retrieval depends exclusively on self-attention layers. Across RecurrentGemma-2B/9B and Jamba-Mini-1.6, attention ablation causes catastrophic retrieval failure (0% accuracy), while SSM layers show no compensatory mechanisms even with improved prompting. Conversely, sparsifying attention to just 15% of heads maintains near-perfect retrieval while preserving 84% MMLU performance, suggesting self-attention specializes primarily for retrieval tasks. We identify precise mechanistic requirements for retrieval: needle tokens must be exposed during generation and sufficient context must be available during prefill or generation. This strict functional specialization challenges assumptions about redundancy in hybrid architectures and suggests these models operate as specialized modules rather than integrated systems, with immediate implications for architecture optimization and interpretability.

[LG-73] CSU-PCAST: A Dual-Branch Transformer Framework for medium-range ensemble Precipitation Forecasting

链接: https://arxiv.org/abs/2510.20769
作者: Tianyi Xiong,Haonan Chen
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 20 pages, 12 figures, submitted to arXiv under Atmospheric and Oceanic Physics ( this http URL -ph) and Machine Learning (cs.LG)

点击查看摘要

Abstract:Accurate medium-range precipitation forecasting is crucial for hydrometeorological risk management and disaster mitigation, yet remains challenging for current numerical weather prediction (NWP) systems. Traditional ensemble systems such as the Global Ensemble Forecast System (GEFS) struggle to maintain high skill, especially for moderate and heavy rainfall at extended lead times. This study develops a deep learning-based ensemble framework for multi-step precipitation prediction through joint modeling of a comprehensive set of atmospheric variables. The model is trained on ERA5 reanalysis data at 0.25 ^\circ spatial resolution, with precipitation labels from NASA’s Integrated Multi-satellite Retrievals for Global Precipitation Measurement (GPM) constellation (IMERG), incorporating 57 input variables, including upper-air and surface predictors. The architecture employs a patch-based Swin Transformer backbone with periodic convolutions to handle longitudinal continuity and integrates time and noise embeddings through conditional layer normalization. A dual-branch decoder predicts total precipitation and other variables, with targeted freezing of encoder-decoder pathways for specialized training. Training minimizes a hybrid loss combining the Continuous Ranked Probability Score (CRPS) and weighted log1p mean squared error (log1pMSE), balancing probabilistic accuracy and magnitude fidelity. During inference, the model ingests real-time Global Forecast System (GFS) initial conditions to generate 15-day forecasts autoregressively. Evaluation against GEFS using IMERG data demonstrates higher Critical Success Index (CSI) scores at precipitation thresholds of 0.1 mm, 1 mm, 10 mm, and 20 mm, highlighting improved performance for moderate to heavy rainfall.

[LG-74] Diffusion Autoencoders with Perceivers for Long Irregular and Multimodal Astronomical Sequences

链接: https://arxiv.org/abs/2510.20595
作者: Yunyi Shen,Alexander Gagliano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Self-supervised learning has become a central strategy for representation learning, but the majority of architectures used for encoding data have only been validated on regularly-sampled inputs such as images, audios. and videos. In many scientific domains, data instead arrive as long, irregular, and multimodal sequences. To extract semantic information from these data, we introduce the Diffusion Autoencoder with Perceivers (daep). daep tokenizes heterogeneous measurements, compresses them with a Perceiver encoder, and reconstructs them with a Perceiver-IO diffusion decoder, enabling scalable learning in diverse data settings. To benchmark the daep architecture, we adapt the masked autoencoder to a Perceiver encoder/decoder design, and establish a strong baseline (maep) in the same architectural family as daep. Across diverse spectroscopic and photometric astronomical datasets, daep achieves lower reconstruction errors, produces more discriminative latent spaces, and better preserves fine-scale structure than both VAE and maep baselines. These results establish daep as an effective framework for scientific domains where data arrives as irregular, heterogeneous sequences.

[LG-75] Concentration and excess risk bounds for imbalanced classification with synthetic oversampling

链接: https://arxiv.org/abs/2510.20472
作者: Touqeer Ahmad,Mohammadreza M. Kalan,François Portier,Gilles Stupfler
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Page 35, including appendix, Figures 12, including appendix

点击查看摘要

Abstract:Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings

[LG-76] Learning Decentralized Routing Policies via Graph Attention-based Multi-Agent Reinforcement Learning in Lunar Delay-Tolerant Networks

链接: https://arxiv.org/abs/2510.20436
作者: Federico Lozano-Cuadra,Beatriz Soret,Marc Sanchez Net,Abhishek Cauligi,Federico Rossi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a fully decentralized routing framework for multi-robot exploration missions operating under the constraints of a Lunar Delay-Tolerant Network (LDTN). In this setting, autonomous rovers must relay collected data to a lander under intermittent connectivity and unknown mobility patterns. We formulate the problem as a Partially Observable Markov Decision Problem (POMDP) and propose a Graph Attention-based Multi-Agent Reinforcement Learning (GAT-MARL) policy that performs Centralized Training, Decentralized Execution (CTDE). Our method relies only on local observations and does not require global topology updates or packet replication, unlike classical approaches such as shortest path and controlled flooding-based algorithms. Through Monte Carlo simulations in randomized exploration environments, GAT-MARL provides higher delivery rates, no duplications, and fewer packet losses, and is able to leverage short-term mobility forecasts; offering a scalable solution for future space robotic systems for planetary exploration, as demonstrated by successful generalization to larger rover teams.

[LG-77] Learning Coupled Earth System Dynamics with GraphDOP

链接: https://arxiv.org/abs/2510.20416
作者: Eulalie Boucher,Mihai Alexe,Peter Lean,Ewan Pinnington,Simon Lang,Patrick Laloyaux,Lorenzo Zampieri,Patricia de Rosnay,Niels Bormann,Anthony McNally
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Interactions between different components of the Earth System (e.g. ocean, atmosphere, land and cryosphere) are a crucial driver of global weather patterns. Modern Numerical Weather Prediction (NWP) systems typically run separate models of the different components, explicitly coupled across their interfaces to additionally model exchanges between the different components. Accurately representing these coupled interactions remains a major scientific and technical challenge of weather forecasting. GraphDOP is a graph-based machine learning model that learns to forecast weather directly from raw satellite and in-situ observations, without reliance on reanalysis products or traditional physics-based NWP models. GraphDOP simultaneously embeds information from diverse observation sources spanning the full Earth system into a shared latent space. This enables predictions that implicitly capture cross-domain interactions in a single model without the need for any explicit coupling. Here we present a selection of case studies which illustrate the capability of GraphDOP to forecast events where coupled processes play a particularly key role. These include rapid sea-ice freezing in the Arctic, mixing-induced ocean surface cooling during Hurricane Ian and the severe European heat wave of 2022. The results suggest that learning directly from Earth System observations can successfully characterise and propagate cross-component interactions, offering a promising path towards physically consistent end-to-end data-driven Earth System prediction with a single model.

[LG-78] sting Most Influential Sets ICLR

链接: https://arxiv.org/abs/2510.20372
作者: Lucas Darius Konrad,Nikolas Kuschnig
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: 9 pages, 1 figure, submitted to ICLR

点击查看摘要

Abstract:Small subsets of data with disproportionate influence on model outcomes can have dramatic impacts on conclusions, with a few data points sometimes overturning key findings. While recent work has developed methods to identify these \emphmost influential sets, no formal theory exists to determine when their influence reflects genuine problems rather than natural sampling variation. We address this gap by developing a principled framework for assessing the statistical significance of most influential sets. Our theoretical results characterize the extreme value distributions of maximal influence and enable rigorous hypothesis tests for excessive influence, replacing current ad-hoc sensitivity checks. We demonstrate the practical value of our approach through applications across economics, biology, and machine learning benchmarks.

[LG-79] A Transformer Inspired AI-based MIMO receiver

链接: https://arxiv.org/abs/2510.20363
作者: András Rácz,Tamás Borsos,András Veres,Benedek Csala
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present AttDet, a Transformer-inspired MIMO (Multiple Input Multiple Output) detection method that treats each transmit layer as a token and learns inter-stream interference via a lightweight self-attention mechanism. Queries and keys are derived directly from the estimated channel matrix, so attention scores quantify channel correlation. Values are initialized by matched-filter outputs and iteratively refined. The AttDet design combines model-based interpretability with data-driven flexibility. We demonstrate through link-level simulations under realistic 5G channel models and high-order, mixed QAM modulation and coding schemes, that AttDet can approach near-optimal BER/BLER (Bit Error Rate/Block Error Rate) performance while maintaining predictable, polynomial complexity.

[LG-80] ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

链接: https://arxiv.org/abs/2510.20362
作者: Aritra Roy,Enrico Grisan,John Buckeridge,Chiara Gattinoni
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.

[LG-81] Neural Networks for Censored Expectile Regression Based on Data Augmentation

链接: https://arxiv.org/abs/2510.20344
作者: Wei Cao,Shanshan Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Expectile regression neural networks (ERNNs) are powerful tools for capturing heterogeneity and complex nonlinear structures in data. However, most existing research has primarily focused on fully observed data, with limited attention paid to scenarios involving censored observations. In this paper, we propose a data augmentation based ERNNs algorithm, termed DAERNN, for modeling heterogeneous censored data. The proposed DAERNN is fully data driven, requires minimal assumptions, and offers substantial flexibility. Simulation studies and real data applications demonstrate that DAERNN outperforms existing censored ERNNs methods and achieves predictive performance comparable to models trained on fully observed data. Moreover, the algorithm provides a unified framework for handling various censoring mechanisms without requiring explicit parametric model specification, thereby enhancing its applicability to practical censored data analysis.

[LG-82] Capability of using the normalizing flows for extraction rare gamma events in the TAIGA experiment

链接: https://arxiv.org/abs/2510.20334
作者: A.P. Kryukov,A.Yu. Razumov,A.P. Demichev,J.J. Dubenskaya,E.O. Gres,S.P. Polyakov,E.B. Postnikov,P.A. Volchugov,D.P. Zhurov
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); High Energy Astrophysical Phenomena (astro-ph.HE); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, Proceedings of The 9th International Conference on Deep Learning in Computational Physics, July, 2-4, 2025, Moscow, Russia

点击查看摘要

Abstract:The objective of this work is to develop a method for detecting rare gamma quanta against the background of charged particles in the fluxes from sources in the Universe with the help of the deep learning and normalizing flows based method designed for anomaly detection. It is shown that the suggested method has a potential for the gamma detection. The method was tested on model data from the TAIGA-IACT experiment. The obtained quantitative performance indicators are still inferior to other approaches, and therefore possible ways to improve the implementation of the method are proposed.

[LG-83] Compositional Generation for Long-Horizon Coupled PDEs

链接: https://arxiv.org/abs/2510.20141
作者: Somayajulu L. N. Dhulipala,Deep Ray,Nicholas Forman
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulating coupled PDE systems is computationally intensive, and prior efforts have largely focused on training surrogates on the joint (coupled) data, which requires a large amount of data. In the paper, we study compositional diffusion approaches where diffusion models are only trained on the decoupled PDE data and are composed at inference time to recover the coupled field. Specifically, we investigate whether the compositional strategy can be feasible under long time horizons involving a large number of time steps. In addition, we compare a baseline diffusion model with that trained using the v-parameterization strategy. We also introduce a symmetric compositional scheme for the coupled fields based on the Euler scheme. We evaluate on Reaction-Diffusion and modified Burgers with longer time grids, and benchmark against a Fourier Neural Operator trained on coupled data. Despite seeing only decoupled training data, the compositional diffusion models recover coupled trajectories with low error. v-parameterization can improve accuracy over a baseline diffusion model, while the neural operator surrogate remains strongest given that it is trained on the coupled data. These results show that compositional diffusion is a viable strategy towards efficient, long-horizon modeling of coupled PDEs.

[LG-84] Extending machine learning model for implicit solvation to free energy calculations

链接: https://arxiv.org/abs/2510.20103
作者: Rishabh Dey,Michael Brocidiacono,Kushal Koirala,Alexander Tropsha,Konstantin I. Popov
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The implicit solvent approach offers a computationally efficient framework to model solvation effects in molecular simulations. However, its accuracy often falls short compared to explicit solvent models, limiting its use in precise thermodynamic calculations. Recent advancements in machine learning (ML) present an opportunity to overcome these limitations by leveraging neural networks to develop more precise implicit solvent potentials for diverse applications. A major drawback of current ML-based methods is their reliance on force-matching alone, which can lead to energy predictions that differ by an arbitrary constant and are therefore unsuitable for absolute free energy comparisons. Here, we introduce a novel methodology with a graph neural network (GNN)-based implicit solvent model, dubbed Lambda Solvation Neural Network (LSNN). In addition to force-matching, this network was trained to match the derivatives of alchemical variables, ensuring that solvation free energies can be meaningfully compared across chemical species… Trained on a dataset of approximately 300,000 small molecules, LSNN achieves free energy predictions with accuracy comparable to explicit-solvent alchemical simulations, while offering a computational speedup and establishing a foundational framework for future applications in drug discovery.

[LG-85] Endogenous Aggregation of Multiple Data Envelopment Analysis Scores for Large Data Sets

链接: https://arxiv.org/abs/2510.20052
作者: Hashem Omrani,Raha Imanirad,Adam Diamant,Utkarsh Verma,Amol Verma,Fahad Razak
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We propose an approach for dynamic efficiency evaluation across multiple organizational dimensions using data envelopment analysis (DEA). The method generates both dimension-specific and aggregate efficiency scores, incorporates desirable and undesirable outputs, and is suitable for large-scale problem settings. Two regularized DEA models are introduced: a slack-based measure (SBM) and a linearized version of a nonlinear goal programming model (GP-SBM). While SBM estimates an aggregate efficiency score and then distributes it across dimensions, GP-SBM first estimates dimension-level efficiencies and then derives an aggregate score. Both models utilize a regularization parameter to enhance discriminatory power while also directly integrating both desirable and undesirable outputs. We demonstrate the computational efficiency and validity of our approach on multiple datasets and apply it to a case study of twelve hospitals in Ontario, Canada, evaluating three theoretically grounded dimensions of organizational effectiveness over a 24-month period from January 2018 to December 2019: technical efficiency, clinical efficiency, and patient experience. Our numerical results show that SBM and GP-SBM better capture correlations among input/output variables and outperform conventional benchmarking methods that separately evaluate dimensions before aggregation.

[LG-86] hrowing Vines at the Wall: Structure Learning via Random Search

链接: https://arxiv.org/abs/2510.20035
作者: Thibault Vatter,Thomas Nagler
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 19 pages, 7 figures, 5 tables, 2 algorithms, 4 appendices

点击查看摘要

Abstract:Vine copulas offer flexible multivariate dependence modeling and have become widely used in machine learning, yet structure learning remains a key challenge. Early heuristics like the greedy algorithm of Dissmann are still considered the gold standard, but often suboptimal. We propose random search algorithms that improve structure selection and a statistical framework based on model confidence sets, which provides theoretical guarantees on selection probabilities and a powerful foundation for ensembling. Empirical results on several real-world data sets show that our methods consistently outperform state-of-the-art approaches.

[LG-87] On Encoding Matrices using Quantum Circuits

链接: https://arxiv.org/abs/2510.20030
作者: Liron Mor Yosef,Haim Avron
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Over a decade ago, it was demonstrated that quantum computing has the potential to revolutionize numerical linear algebra by enabling algorithms with complexity superior to what is classically achievable, e.g., the seminal HHL algorithm for solving linear systems. Efficient execution of such algorithms critically depends on representing inputs (matrices and vectors) as quantum circuits that encode or implement these inputs. For that task, two common circuit representations emerged in the literature: block encodings and state preparation circuits. In this paper, we systematically study encodings matrices in the form of block encodings and state preparation circuits. We examine methods for constructing these representations from matrices given in classical form, as well as quantum two-way conversions between circuit representations. Two key results we establish (among others) are: (a) a general method for efficiently constructing a block encoding of an arbitrary matrix given in classical form (entries stored in classical random access memory); and (b) low-overhead, bidirectional conversion algorithms between block encodings and state preparation circuits, showing that these models are essentially equivalent. From a technical perspective, two central components of our constructions are: (i) a special constant-depth multiplexer that simultaneously multiplexes all higher-order Pauli matrices of a given size, and (ii) an algorithm for performing a quantum conversion between a matrix’s expansion in the standard basis and its expansion in the basis of higher-order Pauli matrices.

[LG-88] Simultaneously Solving Infinitely Many LQ Mean Field Games In Hilbert Spaces: The Power of Neural Operators

链接: https://arxiv.org/abs/2510.20017
作者: Dena Firoozi,Anastasis Kratsios,Xuwei Yang
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Mathematical Finance (q-fin.MF)
*备注: 48 pages

点击查看摘要

Abstract:Traditional mean-field game (MFG) solvers operate on an instance-by-instance basis, which becomes infeasible when many related problems must be solved (e.g., for seeking a robust description of the solution under perturbations of the dynamics or utilities, or in settings involving continuum-parameterized agents.). We overcome this by training neural operators (NOs) to learn the rules-to-equilibrium map from the problem data (``rules’': dynamics and cost functionals) of LQ MFGs defined on separable Hilbert spaces to the corresponding equilibrium strategy. Our main result is a statistical guarantee: an NO trained on a small number of randomly sampled rules reliably solves unseen LQ MFG variants, even in infinite-dimensional settings. The number of NO parameters needed remains controlled under appropriate rule sampling during training. Our guarantee follows from three results: (i) local-Lipschitz estimates for the highly nonlinear rules-to-equilibrium map; (ii) a universal approximation theorem using NOs with a prespecified Lipschitz regularity (unlike traditional NO results where the NO’s Lipschitz constant can diverge as the approximation error vanishes); and (iii) new sample-complexity bounds for L -Lipschitz learners in infinite dimensions, directly applicable as the Lipschitz constants of our approximating NOs are controlled in (ii). Comments: 48 pages Subjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Mathematical Finance (q-fin.MF) MSC classes: 60, 91, 65, 46 ACMclasses: I.2 Cite as: arXiv:2510.20017 [math.OC] (or arXiv:2510.20017v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2510.20017 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-89] Enhanced Cyclic Coordinate Descent Methods for Elastic Net Penalized Linear Models

链接: https://arxiv.org/abs/2510.19999
作者: Yixiao Wang,Zishan Shao,Ting Jiang,Aditya Devarakonda
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Mathematical Software (cs.MS); Numerical Analysis (math.NA); Applications (stat.AP)
*备注: Equal contribution: Yixiao Wang and Zishan Shao. Correspondence: yw676@duke.edu

点击查看摘要

Abstract:We present a novel enhanced cyclic coordinate descent (ECCD) framework for solving generalized linear models with elastic net constraints that reduces training time in comparison to existing state-of-the-art methods. We redesign the CD method by performing a Taylor expansion around the current iterate to avoid nonlinear operations arising in the gradient computation. By introducing this approximation, we are able to unroll the vector recurrences occurring in the CD method and reformulate the resulting computations into more efficient batched computations. We show empirically that the recurrence can be unrolled by a tunable integer parameter, s , such that s 1 yields performance improvements without affecting convergence, whereas s = 1 yields the original CD method. A key advantage of ECCD is that it avoids the convergence delay and numerical instability exhibited by block coordinate descent. Finally, we implement our proposed method in C++ using Eigen to accelerate linear algebra computations. Comparison of our method against existing state-of-the-art solvers shows consistent performance improvements of 3\times in average for regularization path variant on diverse benchmark datasets. Our implementation is available at this https URL.

[LG-90] Guiding diffusion models to reconstruct flow fields from sparse data

链接: https://arxiv.org/abs/2510.19971
作者: Marc Amorós-Trepat,Luis Medrano-Navarro,Qiang Liu,Luca Guastoni,Nils Thuerey
类目: Fluid Dynamics (physics.flu-dyn); Machine Learning (cs.LG)
*备注: Code and dataset can be found at this https URL

点击查看摘要

Abstract:The reconstruction of unsteady flow fields from limited measurements is a challenging and crucial task for many engineering applications. Machine learning models are gaining popularity in solving this problem due to their ability to learn complex patterns from data and generalize across diverse conditions. Among these, diffusion models have emerged as particularly powerful in generative tasks, producing high-quality samples by iteratively refining noisy inputs. In contrast to other methods, these generative models are capable of reconstructing the smallest scales of the fluid spectrum. In this work, we introduce a novel sampling method for diffusion models that enables the reconstruction of high-fidelity samples by guiding the reverse process using the available sparse data. Moreover, we enhance the reconstructions with available physics knowledge using a conflict-free update method during training. To evaluate the effectiveness of our method, we conduct experiments on 2 and 3-dimensional turbulent flow data. Our method consistently outperforms other diffusion-based methods in predicting the fluid’s structure and in pixel-wise accuracy. This study underscores the remarkable potential of diffusion models in reconstructing flow field data, paving the way for their application in Computational Fluid Dynamics research.

[LG-91] Compressing Biology: Evaluating the Stable Diffusion VAE for Phenotypic Drug Discovery NEURIPS2025

链接: https://arxiv.org/abs/2510.19887
作者: Télio Cropsal,Rocío Mercado
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: Accepted to the 3rd Workshop on Imageomics: Discovering Biological Knowledge from Images Using AI at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:High-throughput phenotypic screens generate vast microscopy image datasets that push the limits of generative models due to their large dimensionality. Despite the growing popularity of general-purpose models trained on natural images for microscopy data analysis, their suitability in this domain has not been quantitatively demonstrated. We present the first systematic evaluation of Stable Diffusion’s variational autoencoder (SD-VAE) for reconstructing Cell Painting images, assessing performance across a large dataset with diverse molecular perturbations and cell types. We find that SD-VAE reconstructions preserve phenotypic signals with minimal loss, supporting its use in microscopy workflows. To benchmark reconstruction quality, we compare pixel-level, embedding-based, latent-space, and retrieval-based metrics for a biologically informed evaluation. We show that general-purpose feature extractors like InceptionV3 match or surpass publicly available bespoke models in retrieval tasks, simplifying future pipelines. Our findings offer practical guidelines for evaluating generative models on microscopy data and support the use of off-the-shelf models in phenotypic drug discovery.

[LG-92] ransforming Multi-Omics Integration with GANs: Applications in Alzheimers and Cancer

链接: https://arxiv.org/abs/2510.19870
作者: Md Selim Reza,Sabrin Afroz,Mostafizer Rahman,Md Ashad Alam
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 Pages, 6 figues

点击查看摘要

Abstract:Multi-omics data integration is crucial for understanding complex diseases, yet limited sample sizes, noise, and heterogeneity often reduce predictive power. To address these challenges, we introduce Omics-GAN, a Generative Adversarial Network (GAN)-based framework designed to generate high-quality synthetic multi-omics profiles while preserving biological relationships. We evaluated Omics-GAN on three omics types (mRNA, miRNA, and DNA methylation) using the ROSMAP cohort for Alzheimer’s disease (AD) and TCGA datasets for colon and liver cancer. A support vector machine (SVM) classifier with repeated 5-fold cross-validation demonstrated that synthetic datasets consistently improved prediction accuracy compared to original omics profiles. The AUC of SVM for mRNA improved from 0.72 to 0.74 in AD, and from 0.68 to 0.72 in liver cancer. Synthetic miRNA enhanced classification in colon cancer from 0.59 to 0.69, while synthetic methylation data improved performance in liver cancer from 0.64 to 0.71. Boxplot analyses confirmed that synthetic data preserved statistical distributions while reducing noise and outliers. Feature selection identified significant genes overlapping with original datasets and revealed additional candidates validated by GO and KEGG enrichment analyses. Finally, molecular docking highlighted potential drug repurposing candidates, including Nilotinib for AD, Atovaquone for liver cancer, and Tecovirimat for colon cancer. Omics-GAN enhances disease prediction, preserves biological fidelity, and accelerates biomarker and drug discovery, offering a scalable strategy for precision medicine applications.

[LG-93] Artificial Intelligence Powered Identification of Potential Antidiabetic Compounds in Ficus religiosa

链接: https://arxiv.org/abs/2510.19867
作者: Md Ashad Alam,Md Amanullah
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 25 Pages, 3 figures, 3 tables

点击查看摘要

Abstract:Diabetes mellitus is a chronic metabolic disorder that necessitates novel therapeutic innovations due to its gradual progression and the onset of various metabolic complications. Research indicates that Ficus religiosa is a conventional medicinal plant that generates bioactive phytochemicals with potential antidiabetic properties. The investigation employs ecosystem-based computational approaches utilizing artificial intelligence to investigate and evaluate compounds derived from Ficus religiosa that exhibit antidiabetic properties. A comprehensive computational procedure incorporated machine learning methodologies, molecular docking techniques, and ADMET prediction systems to assess phytochemical efficacy against the significant antidiabetic enzyme dipeptidyl peptidase-4 (DPP-4). DeepBindGCN and the AutoDock software facilitated the investigation of binding interactions via deep learning technology. Flavonoids and alkaloids have emerged as attractive phytochemicals due to their strong binding interactions and advantageous pharmacological effects, as indicated by the study. The introduction of AI accelerated screening procedures and enhanced accuracy rates, demonstrating its efficacy in researching plant-based antidiabetic agents. The scientific foundation now facilitates future experimental validation of natural product therapies tailored for diabetic management.

[LG-94] Multi-Resolution Analysis of the Convective Structure of Tropical Cyclones for Short-Term Intensity Guidance NEURIPS2025

链接: https://arxiv.org/abs/2510.19854
作者: Elizabeth Cucuzzella,Tria McNeely,Kimberly Wood,Ann B. Lee
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: For Tackling Climate Change with Machine Learning workshop at NeurIPS 2025

点击查看摘要

Abstract:Accurate tropical cyclone (TC) short-term intensity forecasting with a 24-hour lead time is essential for disaster mitigation in the Atlantic TC basin. Since most TCs evolve far from land-based observing networks, satellite imagery is critical to monitoring these storms; however, these complex and high-resolution spatial structures can be challenging to qualitatively interpret in real time by forecasters. Here we propose a concise, interpretable, and descriptive approach to quantify fine TC structures with a multi-resolution analysis (MRA) by the discrete wavelet transform, enabling data analysts to identify physically meaningful structural features that strongly correlate with rapid intensity change. Furthermore, deep-learning techniques can build on this MRA for short-term intensity guidance.

[LG-95] Low-Latency Neural Inference on an Edge Device for Real-Time Handwriting Recognition from EEG Signals

链接: https://arxiv.org/abs/2510.19832
作者: Ovishake Sen,Raghav Soni,Darpan Virmani,Akshar Parekh,Patrick Lehman,Sarthak Jena,Adithi Katikhaneni,Adam Khalifa,Baibhab Chatterjee
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 18 pages

点击查看摘要

Abstract:Brain-computer interfaces (BCIs) offer a pathway to restore communication for individuals with severe motor or speech impairments. Imagined handwriting provides an intuitive paradigm for character-level neural decoding, bridging the gap between human intention and digital communication. While invasive approaches such as electrocorticography (ECoG) achieve high accuracy, their surgical risks limit widespread adoption. Non-invasive electroencephalography (EEG) offers safer and more scalable alternatives but suffers from low signal-to-noise ratio and spatial resolution, constraining its decoding precision. This work demonstrates that advanced machine learning combined with informative EEG feature extraction can overcome these barriers, enabling real-time, high-accuracy neural decoding on portable edge devices. A 32-channel EEG dataset was collected from fifteen participants performing imagined handwriting. Signals were preprocessed with bandpass filtering and artifact subspace reconstruction, followed by extraction of 85 time-, frequency-, and graphical-domain features. A hybrid architecture, EEdGeNet, integrates a Temporal Convolutional Network with a multilayer perceptron trained on the extracted features. When deployed on an NVIDIA Jetson TX2, the system achieved 89.83 percent accuracy with 914.18 ms per-character latency. Selecting only ten key features reduced latency by 4.5 times to 202.6 ms with less than 1 percent loss in accuracy. These results establish a pathway for accurate, low-latency, and fully portable non-invasive BCIs supporting real-time communication.

[LG-96] Neurotremor: A wearable Supportive Device for Supporting Upper Limb Muscle Function

链接: https://arxiv.org/abs/2510.19826
作者: Aueaphum Aueawattthanaphisut,Thanyanee Srichaisak,Arissa Ieochai
类目: Neurons and Cognition (q-bio.NC); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
*备注: 6 pages, 3 figures, 13 equations, 2 table

点击查看摘要

Abstract:A sensor-fused wearable assistance prototype for upper-limb function (triceps brachii and extensor pollicis brevis) is presented. The device integrates surface electromyography (sEMG), an inertial measurement unit (IMU), and flex/force sensors on an M5StickC plus an ESP32-S3 compute hub. Signals are band-pass and notch filtered; features (RMS, MAV, zero-crossings, and 4-12 Hz tremor-band power) are computed in 250 ms windows and fed to an INT8 TensorFlow Lite Micro model. Control commands are bounded by a control-barrier-function safety envelope and delivered within game-based tasks with lightweight personalization. In a pilot technical feasibility evaluation with healthy volunteers (n = 12) performing three ADL-oriented tasks, tremor prominence decreased (Delta TI = -0.092, 95% CI [-0.102, -0.079]), range of motion increased (+12.65%, 95% CI [+8.43, +13.89]), repetitions rose (+2.99 min^-1, 95% CI [+2.61, +3.35]), and the EMG median-frequency slope became less negative (Delta = +0.100 Hz/min, 95% CI [+0.083, +0.127]). The sensing-to-assist loop ran at 100 Hz with 8.7 ms median on-device latency, 100% session completion, and 0 device-related adverse events. These results demonstrate technical feasibility of embedded, sensor-fused assistance for upper-limb function; formal patient studies under IRB oversight are planned.

[LG-97] Spectral Thresholds in Correlated Spiked Models and Fundamental Limits of Partial Least Squares

链接: https://arxiv.org/abs/2510.17561
作者: Pierre Mergny,Lenka Zdeborová
类目: atistics Theory (math.ST); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 4 figures

点击查看摘要

Abstract:We provide a rigorous random matrix theory analysis of spiked cross-covariance models where the signals across two high-dimensional data channels are partially aligned. These models are motivated by multi-modal learning and form the standard generative setting underlying Partial Least Squares (PLS), a widely used yet theoretically underdeveloped method. We show that the leading singular values of the sample cross-covariance matrix undergo a Baik-Ben Arous-Peche (BBP)-type phase transition, and we characterize the precise thresholds for the emergence of informative components. Our results yield the first sharp asymptotic description of the signal recovery capabilities of PLS in this setting, revealing a fundamental performance gap between PLS and the Bayes-optimal estimator. In particular, we identify the SNR and correlation regimes where PLS fails to recover any signal, despite detectability being possible in principle. These findings clarify the theoretical limits of PLS and provide guidance for the design of reliable multi-modal inference methods in high dimensions.

信息检索

[IR-0] Generative Reasoning Recommendation via LLM s

链接: https://arxiv.org/abs/2510.20815
作者: Minjie Hong,Zetong Zhou,Zirun Guo,Ziang Zhang,Ruofan Hu,Weinan Gan,Jieming Zhu,Zhou Zhao
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Despite their remarkable reasoning capabilities across diverse domains, large language models (LLMs) face fundamental challenges in natively functioning as generative reasoning recommendation models (GRRMs), where the intrinsic modeling gap between textual semantics and collaborative filtering signals, combined with the sparsity and stochasticity of user feedback, presents significant obstacles. This work explores how to build GRRMs by adapting pre-trained LLMs, which achieves a unified understanding-reasoning-prediction manner for recommendation tasks. We propose GREAM, an end-to-end framework that integrates three components: (i) Collaborative-Semantic Alignment, which fuses heterogeneous textual evidence to construct semantically consistent, discrete item indices and auxiliary alignment tasks that ground linguistic representations in interaction semantics; (ii) Reasoning Curriculum Activation, which builds a synthetic dataset with explicit Chain-of-Thought supervision and a curriculum that progresses through behavioral evidence extraction, latent preference modeling, intent inference, recommendation formulation, and denoised sequence rewriting; and (iii) Sparse-Regularized Group Policy Optimization (SRPO), which stabilizes post-training via Residual-Sensitive Verifiable Reward and Bonus-Calibrated Group Advantage Estimation, enabling end-to-end optimization under verifiable signals despite sparse successes. GREAM natively supports two complementary inference modes: Direct Sequence Recommendation for high-throughput, low-latency deployment, and Sequential Reasoning Recommendation that first emits an interpretable reasoning chain for causal transparency. Experiments on three datasets demonstrate consistent gains over strong baselines, providing a practical path toward verifiable-RL-driven LLM recommenders.

[IR-1] Rotate Both Ways: Time-and-Order RoPE for Generative Recommendation

链接: https://arxiv.org/abs/2510.20455
作者: Xiaokai Wei,Jiajun Wu,Daiyao Yi,Reza Shirkavand,Michelle Gong
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommenders, typically transformer-based autoregressive models, predict the next item or action from a user’s interaction history. Their effectiveness depends on how the model represents where an interaction event occurs in the sequence (discrete index) and when it occurred in wall-clock time. Prevailing approaches inject time via learned embeddings or relative attention biases. In this paper, we argue that RoPE-based approaches, if designed properly, can be a stronger alternative for jointly modeling temporal and sequential information in user behavior sequences. While vanilla RoPE in LLMs considers only token order, generative recommendation requires incorporating both event time and token index. To address this, we propose Time-and-Order RoPE (TO-RoPE), a family of rotary position embedding designs that treat index and time as angle sources shaping the query-key geometry directly. We present three instantiations: early fusion, split-by-dim, and split-by-head. Extensive experiments on both publicly available datasets and a proprietary industrial dataset show that TO-RoPE variants consistently improve accuracy over existing methods for encoding time and index. These results position rotary embeddings as a simple, principled, and deployment-friendly foundation for generative recommendation.

[IR-2] From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era NEURIPS2025

链接: https://arxiv.org/abs/2510.20276
作者: Wonil Kim,Hyeongseok Wi,Seungsoon Park,Taejun Kim,Sangeun Keum,Keunhyoung Kim,Taewan Kim,Jongmin Jung,Taehyoung Kim,Gaetan Guerrero,Mael Le Goff,Julie Po,Dongjoo Moon,Juhan Nam,Jongpil Lee
类目: Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); Sound (cs.SD)
*备注: Accepted to the NeurIPS 2025 AI4Music Workshop

点击查看摘要

Abstract:Generative AI is reshaping music creation, but its rapid growth exposes structural gaps in attribution, rights management, and economic models. Unlike past media shifts, from live performance to recordings, downloads, and streaming, AI transforms the entire lifecycle of music, collapsing boundaries between creation, distribution, and monetization. However, existing streaming systems, with opaque and concentrated royalty flows, are ill-equipped to handle the scale and complexity of AI-driven production. We propose a content-based Music AI Agent architecture that embeds attribution directly into the creative workflow through block-level retrieval and agentic orchestration. Designed for iterative, session-based interaction, the system organizes music into granular components (Blocks) stored in BlockDB; each use triggers an Attribution Layer event for transparent provenance and real-time settlement. This framework reframes AI from a generative tool into infrastructure for a Fair AI Media Platform. By enabling fine-grained attribution, equitable compensation, and participatory engagement, it points toward a post-streaming paradigm where music functions not as a static catalog but as a collaborative and adaptive ecosystem.

[IR-3] Balancing Fine-tuning and RAG : A Hybrid Strategy for Dynamic LLM Recommendation Updates RECSYS2025

链接: https://arxiv.org/abs/2510.20260
作者: Changping Meng,Hongyi Ling,Jianling Wang,Yifan Liu,Shuzhou Zhang,Dapeng Hong,Mingyan Gao,Onkar Dalal,Ed Chi,Lichan Hong,Haokai Lu,Ningren Han
类目: Information Retrieval (cs.IR)
*备注: RecSys 2025 Industry Track

点击查看摘要

Abstract:Large Language Models (LLMs) empower recommendation systems through their advanced reasoning and planning capabilities. However, the dynamic nature of user interests and content poses a significant challenge: While initial fine-tuning aligns LLMs with domain knowledge and user preferences, it fails to capture such real-time changes, necessitating robust update mechanisms. This paper investigates strategies for updating LLM-powered recommenders, focusing on the trade-offs between ongoing fine-tuning and Retrieval-Augmented Generation (RAG). Using an LLM-powered user interest exploration system as a case study, we perform a comparative analysis of these methods across dimensions like cost, agility, and knowledge incorporation. We propose a hybrid update strategy that leverages the long-term knowledge adaptation of periodic fine-tuning with the agility of low-cost RAG. We demonstrate through live A/B experiments on a billion-user platform that this hybrid approach yields statistically significant improvements in user satisfaction, offering a practical and cost-effective framework for maintaining high-quality LLM-powered recommender systems.

[IR-4] Rank-GRPO: Training LLM -based Conversational Recommender Systems with Reinforcement Learning

链接: https://arxiv.org/abs/2510.20150
作者: Yaochen Zhu,Harald Steck,Dawen Liang,Yinhan He,Jundong Li,Nathan Kallus
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at this https URL.

附件下载

点击下载今日全部论文列表