本篇博文主要内容为 2025-11-03 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-11-03)

今日共更新446篇论文,其中:

  • 自然语言处理66篇(Computation and Language (cs.CL))
  • 人工智能143篇(Artificial Intelligence (cs.AI))
  • 计算机视觉92篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习130篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Continuous Autoregressive Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因逐标记(token-by-token)生成机制导致的效率瓶颈问题。其核心解决方案是引入连续自回归语言模型(Continuous Autoregressive Language Models, CALM),关键在于将传统的离散下一个标记预测转变为连续下一个向量预测:通过高保真自编码器将K个连续标记压缩为单一连续向量,并实现超过99.9%的重建准确率,从而在保持语言建模性能的同时,将生成步骤数减少K倍,显著提升计算效率。这一范式转变要求构建全新的无似然训练、评估与可控采样框架,以支持在连续空间中的稳定建模。

链接: https://arxiv.org/abs/2510.27688
作者: Chenze Shao,Darren Li,Fandong Meng,Jie Zhou
机构: WeChat AI, Tencent Inc (微信人工智能); Qiuzhen College, Tsinghua University (清华求真书院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: this https URL. Project: this https URL.
zh

[NLP-1] Culture Cartography: Mapping the Landscape of Cultural Knowledge EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在服务全球用户时缺乏文化特异性知识的问题,即如何识别并获取对特定文化群体用户而言显著但LLM尚未掌握的知识。传统方法依赖单向协作:要么研究者设定问题由用户被动回答(传统标注),要么用户主动提供数据供研究者构建基准(知识提取)。其局限在于难以有效捕捉文化相关性与挑战性之间的平衡。本文提出一种混合倡议(mixed-initiative)方法——CultureCartography,其核心在于利用LLM基于低置信度回答自动生成初始标注问题,显式暴露模型的知识边界;随后人类响应者通过直接编辑填补知识空白,并引导模型聚焦于文化敏感话题。这一机制使LLM能从用户反馈中学习到更贴近真实文化语境的信息,从而提升其在跨文化场景下的表现。实验表明,基于该方法收集的数据微调Llama-3.1-8B模型后,在相关文化基准测试中准确率最高提升19.2%。

链接: https://arxiv.org/abs/2510.27672
作者: Caleb Ziems,William Held,Jane Yu,Amir Goldberg,David Grusky,Diyi Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:To serve global users safely and productively, LLMs need culture-specific knowledge that might not be learned during pre-training. How do we find such knowledge that is (1) salient to in-group users, but (2) unknown to LLMs? The most common solutions are single-initiative: either researchers define challenging questions that users passively answer (traditional annotation), or users actively produce data that researchers structure as benchmarks (knowledge extraction). The process would benefit from mixed-initiative collaboration, where users guide the process to meaningfully reflect their cultures, and LLMs steer the process towards more challenging questions that meet the researcher’s goals. We propose a mixed-initiative methodology called CultureCartography. Here, an LLM initializes annotation with questions for which it has low-confidence answers, making explicit both its prior knowledge and the gaps therein. This allows a human respondent to fill these gaps and steer the model towards salient topics through direct edits. We implement this methodology as a tool called CultureExplorer. Compared to a baseline where humans answer LLM-proposed questions, we find that CultureExplorer more effectively produces knowledge that leading models like DeepSeek R1 and GPT-4o are missing, even with web search. Fine-tuning on this data boosts the accuracy of Llama-3.1-8B by up to 19.2% on related culture benchmarks.
zh

[NLP-2] SpecAttn: Speculating Sparse Attention NEURIPS2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理阶段因自注意力机制的二次复杂度而导致的显著计算瓶颈问题,尤其是在上下文长度增加时。解决方案的关键在于提出SpecAttn,一种无需训练的稀疏注意力方法,其核心创新是利用推测解码(speculative decoding)过程中已计算的草稿模型(draft model)注意力权重,识别目标模型(target model)中重要的token,从而避免冗余计算并保持输出质量。具体而言,SpecAttn通过三个关键技术实现:基于KL散度的草稿与目标模型层对齐、GPU优化的无需排序的top-p token选择算法,以及基于预测结果动态修剪键值缓存(key-value cache)的策略。该方法在不改变现有推测解码流程的前提下,实现了超过75%的键值缓存访问减少,同时仅带来15.29%的困惑度(perplexity)上升,在PG-19数据集上显著优于现有稀疏注意力方法。

链接: https://arxiv.org/abs/2510.27641
作者: Harsh Shah
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference Generative Modeling

点击查看摘要

Abstract:Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
zh

[NLP-3] Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)驱动的具身智能体(embodied agents)面临的视觉后门攻击(visual backdoor attacks)问题,即攻击者通过在环境中的物体上植入视觉触发器(trigger),使智能体在正常情况下表现无异常,但一旦检测到特定触发器则持续执行预设的多步恶意策略。解决方案的关键在于提出BEAT框架,其核心创新包括:(1) 构建涵盖多样化场景、任务和触发器位置的训练数据集以增强对触发器变化性的覆盖;(2) 设计两阶段训练策略——先进行监督微调(SFT),再引入对比触发学习(Contrastive Trigger Learning, CTL),将触发器识别建模为带偏好的输入对比任务,显式强化决策边界以实现精准激活。CTL显著提升了在有限后门数据下的攻击成功率(最高提升39%),同时保持良性任务性能,验证了该方法的有效性和泛化能力。

链接: https://arxiv.org/abs/2510.27623
作者: Qiusi Zhan,Hyeonjeong Ha,Rui Yang,Sirui Xu,Hanyang Chen,Liang-Yan Gui,Yu-Xiong Wang,Huan Zhang,Heng Ji,Daniel Kang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have advanced embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs. However, such vision driven embodied agents open a new attack surface: visual backdoor attacks, where the agent behaves normally until a visual trigger appears in the scene, then persistently executes an attacker-specified multi-step policy. We introduce BEAT, the first framework to inject such visual backdoors into MLLM-based embodied agents using objects in the environments as triggers. Unlike textual triggers, object triggers exhibit wide variation across viewpoints and lighting, making them difficult to implant reliably. BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation. Across various embodied agent benchmarks and MLLMs, BEAT achieves attack success rates up to 80%, while maintaining strong benign task performance, and generalizes reliably to out-of-distribution trigger placements. Notably, compared to naive SFT, CTL boosts backdoor activation accuracy up to 39% under limited backdoor data. These findings expose a critical yet unexplored security risk in MLLM-based embodied agents, underscoring the need for robust defenses before real-world deployment.
zh

[NLP-4] owards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum

【速读】: 该论文旨在解决当前视频检索范式中存在的结构性错位问题,即窄泛的基准测试激励了受限的数据集和单任务训练,导致通用能力被抑制,缺乏能够定义并要求多维泛化能力的诊断评估。其解决方案的关键在于提出一个评价、数据与建模协同设计(co-design)的框架:首先构建了包含16个数据集的通用视频检索基准(Universal Video Retrieval Benchmark, UVRB),用于诊断跨任务和跨域的能力缺口;其次基于UVRB的诊断结果,设计了一种可扩展的数据合成流程,生成155万对高质量样本以填充实现通用性所需的语义空间;最后提出模态金字塔(Modality Pyramid)课程学习策略,通过显式利用多样化数据中的潜在关联来训练通用视频嵌入模型(General Video Embedder, GVE)。实验表明,GVE在UVRB上实现了最先进的零样本泛化性能,揭示了现有主流基准对通用能力预测能力较差,且部分相关检索是主导但被忽视的场景。

链接: https://arxiv.org/abs/2510.27571
作者: Zhuoning Guo,Mingxin Li,Yanzhao Zhang,Dingkun Long,Pengjun Xie,Xiaowen Chu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Alibaba Inc. (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB’s diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.
zh

[NLP-5] MARAG -R1: Beyond Single Retriever via Reinforcement-Learned Multi-Tool Agent ic Retrieval

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)因静态预训练数据导致的事实性错误和对新信息适应能力弱的问题,尤其在需要全语料库推理的任务中,现有检索增强生成(Retrieval-Augmented Generation, RAG)系统受限于单一固定检索器的top-k选择机制,难以获取广泛且动态的相关信息。解决方案的关键在于提出MARAG-R1框架,该框架通过强化学习实现多工具协同检索,使LLM能够动态调度四种检索工具(语义搜索、关键词搜索、过滤和聚合),并在两阶段训练(监督微调+强化学习)中学习何时以及如何使用这些工具,从而实现推理与检索的交错进行,逐步收集足够证据以完成语料库级别的综合推理。

链接: https://arxiv.org/abs/2510.27569
作者: Qi Luo,Xiaonan Li,Yuxin Wang,Tingshuo Fan,Yuan Li,Xinchi Chen,Xipeng Qiu
机构: Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel at reasoning and generation but are inherently limited by static pretraining data, resulting in factual inaccuracies and weak adaptability to new information. Retrieval-Augmented Generation (RAG) addresses this issue by grounding LLMs in external knowledge; However, the effectiveness of RAG critically depends on whether the model can adequately access relevant information. Existing RAG systems rely on a single retriever with fixed top-k selection, restricting access to a narrow and static subset of the corpus. As a result, this single-retriever paradigm has become the primary bottleneck for comprehensive external information acquisition, especially in tasks requiring corpus-level reasoning. To overcome this limitation, we propose MARAG-R1, a reinforcement-learned multi-tool RAG framework that enables LLMs to dynamically coordinate multiple retrieval mechanisms for broader and more precise information access. MARAG-R1 equips the model with four retrieval tools – semantic search, keyword search, filtering, and aggregation – and learns both how and when to use them through a two-stage training process: supervised fine-tuning followed by reinforcement learning. This design allows the model to interleave reasoning and retrieval, progressively gathering sufficient evidence for corpus-level synthesis. Experiments on GlobalQA, HotpotQA, and 2WikiMultiHopQA demonstrate that MARAG-R1 substantially outperforms strong baselines and achieves new state-of-the-art results in corpus-level reasoning tasks.
zh

[NLP-6] SIGMA: Search-Augmented On-Demand Knowledge Integration for Agent ic Mathematical Reasoning

【速读】: 该论文旨在解决当前检索增强型模型在数学推理任务中面临的三大问题:单一视角依赖、搜索策略僵化以及多源信息融合效率低。其解决方案的关键在于提出SIGMA框架,该框架通过协调多个专业化智能体(agent)实现按需知识整合,每个智能体独立进行推理、执行目标明确的搜索,并通过中介机制(moderator mechanism)合成结果;同时,各智能体生成假设性文本以优化检索过程,从而确保知识整合具有上下文敏感性和计算高效性,显著提升了复杂数学推理任务中的准确率与效率。

链接: https://arxiv.org/abs/2510.27568
作者: Ali Asgarov,Umid Suleymanov,Aadyant Khatri
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Short Paper - Under Review

点击查看摘要

Abstract:Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.
zh

[NLP-7] Data-Efficient Domain Adaptation for LLM -based MT using Contrastive Preference Optimization

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域适应过程中因依赖监督微调(Supervised Fine-Tuning, SFT)而导致的数据成本高昂问题。其解决方案的关键在于引入对照偏好优化(Constrained Preference Optimization, CPO),通过模拟后编辑(post-editing)工作流来实现高效的数据利用:具体而言,将基础模型的原始输出视为“被拒绝”样本,而将人工审核的术语库(Translation Memory, TM)条目作为“被选择”样本,从而构建偏好对(preference pairs)。这种方法直接反馈模型当前知识状态,引导其向领域标准靠拢,实验证明仅用14.7k偏好对即可达到使用超过160k样本SFT训练模型的性能水平,显著提升了数据效率。

链接: https://arxiv.org/abs/2510.27556
作者: Inacio Vieira,Antonio Castaldo,James O’Doherty,Sheila Castilho
机构: Alpha CRC(阿尔法CRC); University of Naples “L’Orientale”(那不勒斯东方大学); University of Pisa(比萨大学); SALIS / ADAPT Centre(ADAPT中心); Dublin City University(都柏林城市大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:LLMs often require adaptation to domain-specific requirements, a process that can be expensive when relying solely on SFT. We present an empirical study on applying CPO to simulate a post-editing workflow for data-efficient domain adaptation. Our approach synthesizes preference pairs by treating the base model’s own raw output as the ‘rejected’ translation and the human-approved TM entry as the ‘chosen’ one. This method provides direct feedback on the model’s current knowledge, guiding it to align with domain-specific standards. Experiments in English-Brazilian Portuguese and English-Korean show that, by using just 14.7k preference pairs, the model achieves performance close to that of a model trained on 160k+ samples with SFT, demonstrating significant data efficiency. Although we showcase its effectiveness in MT, this application of CPO naturally generalizes to other generative tasks where a model’s initial drafts can serve as a contrastive signal against a golden reference.
zh

[NLP-8] Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality

【速读】: 该论文旨在解决低资源语言在医疗自然语言处理(Natural Language Processing, NLP)任务中因缺乏领域特定工具和训练数据而导致的性能瓶颈问题。其解决方案的关键在于通过进一步预训练(further pre-training)构建针对医学领域的专用模型,特别是区分临床(clinical)与一般生物医学(biomedical)子领域进行微调,并验证了跨语言迁移能力(cross-lingual transferability)。实验表明,这种领域适配策略显著提升了荷兰语、罗马尼亚语和西班牙语等低资源语言在患者自动筛查和命名实体识别任务上的表现,且临床领域模型优于通用生物医学领域模型,证明了领域细化和跨语言迁移在提升多语言医疗NLP系统性能中的有效性。

链接: https://arxiv.org/abs/2510.27552
作者: Yinghao Luo(1 and 2),Lang Zhou(1 and 2),Amrish Jhingoer(1 and 2),Klaske Vliegenthart Jongbloed(3 and 4),Carlijn Jordans(4),Ben Werkhoven(5),Tom Seinen(6),Erik van Mulligen(6),Casper Rokx(3 and 4),Yunlei Li(1) ((1) Department of Pathology amp; Clinical Bioinformatics, Erasmus University Medical Center Rotterdam, (2) Department of Computer Science, Vrije Universiteit Amsterdam, (3) Department of Internal Medicine, Erasmus University Medical Center Rotterdam, (4) Department of Medical Microbiology and Infectious Diseases, Erasmus University Medical Center Rotterdam, (5) Department of Data and Analytics, Erasmus University Medical Center Rotterdam, (6) Department of Medical Informatics, Erasmus University Medical Center Rotterdam)
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In multilingual healthcare applications, the availability of domain-specific natural language processing(NLP) tools is limited, especially for low-resource languages. Although multilingual bidirectional encoder representations from transformers (BERT) offers a promising motivation to mitigate the language gap, the medical NLP tasks in low-resource languages are still underexplored. Therefore, this study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks, focusing on three languages: Dutch, Romanian and Spanish. In terms of further pre-training, we conducted four experiments to create medical domain models. Then, these models were fine-tuned on three downstream tasks: Automated patient screening in Dutch clinical notes, named entity recognition in Romanian and Spanish clinical notes. Results show that domain adaptation significantly enhanced task performance. Furthermore, further differentiation of domains, e.g. clinical and general biomedical domains, resulted in diverse performances. The clinical domain-adapted model outperformed the more general biomedical domain-adapted model. Moreover, we observed evidence of cross-lingual transferability. Moreover, we also conducted further investigations to explore potential reasons contributing to these performance differences. These findings highlight the feasibility of domain adaptation and cross-lingual ability in medical NLP. Within the low-resource language settings, these findings can provide meaningful guidance for developing multilingual medical NLP systems to mitigate the lack of training data and thereby improve the model performance.
zh

[NLP-9] DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models

【速读】: 该论文试图解决当前大型语言模型(LLM)在阿拉伯语方言(Dialectal Arabic)上的理解能力评估不足的问题,尽管已有针对现代标准阿拉伯语(Modern Standard Arabic, MSA)和多语言场景的基准测试,但日常交流中广泛使用的阿拉伯语方言仍缺乏系统性评测资源。解决方案的关键在于构建一个名为 DialectalArabicMMLU 的新基准,通过人工翻译与适配将 3,000 个多项选择题-答案对扩展至五个主要阿拉伯语方言(叙利亚、埃及、阿联酋、沙特和摩洛哥),形成共 15,000 个问答对,并覆盖 32 个学术与专业领域;该基准首次实现了对阿拉伯语方言理解能力的统一、人工校验的量化评估,从而支持任务导向和语言学层面的分析,揭示了模型在不同方言间的泛化差距,推动更具包容性的模型开发与评测体系建立。

链接: https://arxiv.org/abs/2510.27543
作者: Malik H. Altakrori,Nizar Habash,Abdelhakim Freihat,Younes Samih,Kirill Chirkunov,Muhammed AbuOdeh,Radu Florian,Teresa Lynn,Preslav Nakov,Alham Fikri Aji
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 9 tables

点击查看摘要

Abstract:We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects. While recently developed Arabic and multilingual benchmarks have advanced LLM evaluation for Modern Standard Arabic (MSA), dialectal varieties remain underrepresented despite their prevalence in everyday communication. DialectalArabicMMLU extends the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects (Syrian, Egyptian, Emirati, Saudi, and Moroccan), yielding a total of 15K QA pairs across 32 academic and professional domains (22K QA pairs when also including English and MSA). The benchmark enables systematic assessment of LLM reasoning and comprehension beyond MSA, supporting both task-based and linguistic analysis. We evaluate 19 open-weight Arabic and multilingual LLMs (1B-13B parameters) and report substantial performance variation across dialects, revealing persistent gaps in dialectal generalization. DialectalArabicMMLU provides the first unified, human-curated resource for measuring dialectal understanding in Arabic, thus promoting more inclusive evaluation and future model development.
zh

[NLP-10] Patient-Centered Summarization Framework for AI Clinical Summarization: A Mixed-Methods Design

【速读】: 该论文旨在解决当前生成式 AI 在临床总结任务中普遍存在的偏倚问题——即现有大型语言模型(Large Language Models, LLMs)生成的摘要多聚焦于患者的生物医学信息,而忽视了其个人价值观、偏好、担忧与社会心理背景等关键患者中心要素。为实现真正以患者为中心的医疗服务,作者提出了一种新的标准:患者中心型临床摘要(Patient-Centered Summaries, PCS),并构建了一个系统框架用于生成此类摘要,确保其兼具临床实用性与对患者个体需求的敏感性。解决方案的关键在于通过混合方法学流程,结合患者和公众参与(Patient and Public Involvement, PPI)群体的定性访谈与临床专家的结构化标注,提炼出患者中心信息的核心维度,并据此制定详尽的注释指南;随后利用该指南训练和评估五种开源LLM在零样本(zero-shot)与少样本(few-shot)场景下的表现,最终发现尽管模型在流畅性和完整性上接近人类水平,但在准确性和患者中心特性方面仍显著逊色于人工生成的金标准摘要,揭示了当前LLMs在捕捉深层次人文信息方面的局限性。

链接: https://arxiv.org/abs/2510.27535
作者: Maria Lizarazo Jimenez,Ana Gabriela Claros,Kieran Green,David Toro-Tobon,Felipe Larios,Sheena Asthana,Camila Wenczenovicz,Kerly Guevara Maldonado,Luis Vilatuna-Andrango,Cristina Proano-Velez,Satya Sai Sri Bandi,Shubhangi Bagewadi,Megan E. Branda,Misk Al Zahidy,Saturnino Luz,Mirella Lapata,Juan P. Brito,Oscar J. Ponce-Ponte
机构: 未知
类目: Computation and Language (cs.CL)
备注: The first two listed authors contributed equally Pages: 21; Figures:2; Tables:3

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly demonstrating the potential to reach human-level performance in generating clinical summaries from patient-clinician conversations. However, these summaries often focus on patients’ biology rather than their preferences, values, wishes, and concerns. To achieve patient-centered care, we propose a new standard for Artificial Intelligence (AI) clinical summarization tasks: Patient-Centered Summaries (PCS). Our objective was to develop a framework to generate PCS that capture patient values and ensure clinical utility and to assess whether current open-source LLMs can achieve human-level performance in this task. We used a mixed-methods process. Two Patient and Public Involvement groups (10 patients and 8 clinicians) in the United Kingdom participated in semi-structured interviews exploring what personal and contextual information should be included in clinical summaries and how it should be structured for clinical use. Findings informed annotation guidelines used by eight clinicians to create gold-standard PCS from 88 atrial fibrillation consultations. Sixteen consultations were used to refine a prompt aligned with the guidelines. Five open-source LLMs (Llama-3.2-3B, Llama-3.1-8B, Mistral-8B, Gemma-3-4B, and Qwen3-8B) generated summaries for 72 consultations using zero-shot and few-shot prompting, evaluated with ROUGE-L, BERTScore, and qualitative metrics. Patients emphasized lifestyle routines, social support, recent stressors, and care values. Clinicians sought concise functional, psychosocial, and emotional context. The best zero-shot performance was achieved by Mistral-8B (ROUGE-L 0.189) and Llama-3.1-8B (BERTScore 0.673); the best few-shot by Llama-3.1-8B (ROUGE-L 0.206, BERTScore 0.683). Completeness and fluency were similar between experts and models, while correctness and patient-centeredness favored human PCS.
zh

[NLP-11] SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps EMNLP

【速读】: 该论文旨在解决文本到SQL(text-to-SQL)任务中基准数据集缺乏可解释性与通用性、模型性能评估难以深入以及模型优化路径不明确的问题。其解决方案的关键在于提出SQLSpace——一种人类可读、可泛化且紧凑的文本到SQL示例表示方法,通过最小的人工干预生成,并能有效支持多维度分析:首先用于对比不同基准的数据组成差异,其次实现对模型性能的细粒度解析(超越整体准确率),最后基于学习到的正确性估计进行针对性查询重写以提升模型表现。该方法显著增强了对文本到SQL任务的理解与建模能力。

链接: https://arxiv.org/abs/2510.27532
作者: Neha Srikanth,Victor Bursztyn,Puneet Mathur,Ani Nenkova
机构: University of Maryland (马里兰大学); Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP Findings

点击查看摘要

Abstract:We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples derived with minimal human intervention. We demonstrate the utility of these representations in evaluation with three use cases: (i) closely comparing and contrasting the composition of popular text-to-SQL benchmarks to identify unique dimensions of examples they evaluate, (ii) understanding model performance at a granular level beyond overall accuracy scores, and (iii) improving model performance through targeted query rewriting based on learned correctness estimation. We show that SQLSpace enables analysis that would be difficult with raw examples alone: it reveals compositional differences between benchmarks, exposes performance patterns obscured by accuracy alone, and supports modeling of query success.
zh

[NLP-12] BiSparse-AAS: Bilinear Sparse Attention and Adaptive Spans Framework for Scalable and Efficient Text Summarization ICDM

【速读】: 该论文旨在解决基于Transformer的文本摘要模型在处理长文档时因注意力机制的二次计算复杂度而导致的可扩展性瓶颈问题。其解决方案的关键在于提出BiSparse-AAS(Bilinear Sparse Attention with Adaptive Spans)框架,该框架融合了稀疏注意力(sparse attention)、自适应跨度(adaptive spans)和双线性注意力(bilinear attention)三种机制:稀疏注意力通过聚焦输入中最相关的部分降低计算开销,自适应跨度动态调整注意力范围以适应不同长度的序列,而双线性注意力则在精炼后的上下文中建模复杂的token交互关系。这一综合设计显著提升了模型在长文本摘要任务中的效率与性能,同时在多个基准数据集上实现了优于现有最先进方法的ROUGE得分。

链接: https://arxiv.org/abs/2510.27516
作者: Desta Haileselassie Hagos,Legand L. Burge,Anietie Andy,Anis Yazidi,Vladimir Vlassov
机构: Howard University (霍华德大学); University of Oslo (奥斯陆大学); KTH Royal Institute of Technology (皇家理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at the IEEE International Conference on Data Mining (ICDM) 2025, Washington, DC, USA

点击查看摘要

Abstract:Transformer-based architectures have advanced text summarization, yet their quadratic complexity limits scalability on long documents. This paper introduces BiSparse-AAS (Bilinear Sparse Attention with Adaptive Spans), a novel framework that combines sparse attention, adaptive spans, and bilinear attention to address these limitations. Sparse attention reduces computational costs by focusing on the most relevant parts of the input, while adaptive spans dynamically adjust the attention ranges. Bilinear attention complements both by modeling complex token interactions within this refined context. BiSparse-AAS consistently outperforms state-of-the-art baselines in both extractive and abstractive summarization tasks, achieving average ROUGE improvements of about 68.1% on CNN/DailyMail and 52.6% on XSum, while maintaining strong performance on OpenWebText and Gigaword datasets. By addressing efficiency, scalability, and long-sequence modeling, BiSparse-AAS provides a unified, practical solution for real-world text summarization applications.
zh

[NLP-13] Effect of Domain Generalization Techniques in Low Resource Systems

【速读】: 该论文旨在解决低资源自然语言处理任务中因数据分布偏移(distribution shift)导致的模型泛化能力不足问题,尤其在训练与测试数据来自不同域时表现不佳。解决方案的关键在于引入因果机制:一方面采用因果数据增强(Causal Data Augmentation, CDA),通过生成语义等价但分布可控的反事实样本以缓解虚假相关性;另一方面采用不变因果表示学习(Invariant Causal Representation Learning, ICRL),基于DINER框架提取跨域不变特征,从而提升模型在未见领域和多语言场景下的鲁棒性。

链接: https://arxiv.org/abs/2510.27512
作者: Mahi Aminu,Chisom Chibuike,Fatimo Adebanjo,Omokolade Awosanya,Samuel Oyeneye
机构: ML Collective
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Machine learning models typically assume that training and test data follow the same distribution, an assumption that often fails in real-world scenarios due to distribution shifts. This issue is especially pronounced in low-resource settings, where data scarcity and limited domain diversity hinder robust generalization. Domain generalization (DG) approaches address this challenge by learning features that remain invariant across domains, often using causal mechanisms to improve model robustness. In this study, we examine two distinct causal DG techniques in low-resource natural language tasks. First, we investigate a causal data augmentation (CDA) approach that automatically generates counterfactual examples to improve robustness to spurious correlations. We apply this method to sentiment classification on the NaijaSenti Twitter corpus, expanding the training data with semantically equivalent paraphrases to simulate controlled distribution shifts. Second, we explore an invariant causal representation learning (ICRL) approach using the DINER framework, originally proposed for debiasing aspect-based sentiment analysis. We adapt DINER to a multilingual setting. Our findings demonstrate that both approaches enhance robustness to unseen domains: counterfactual data augmentation yields consistent cross-domain accuracy gains in sentiment classification, while causal representation learning with DINER improves out-of-distribution performance in multilingual sentiment analysis, albeit with varying gains across languages.
zh

[NLP-14] hought Branches: Interpreting LLM Reasoning Requires Resampling

【速读】: 该论文试图解决当前对推理模型(reasoning models)的理解局限于单一思维链(chain-of-thought, CoT)样本的问题,这导致难以准确分析因果影响和底层计算机制。其核心挑战在于:单一样本无法揭示模型在生成多个可能CoT时的分布特性,从而限制了对决策过程的因果推断与干预有效性评估。解决方案的关键在于通过重采样(resampling)方法系统性地探索CoT分布,从而实现更可靠的因果分析、清晰的推理叙事以及原则性的干预策略。具体而言,论文提出利用重采样来检验特定句子的因果作用、比较离策略(off-policy)编辑与在策略(on-policy)选择的效果、量化移除某一步骤后的鲁棒性,并识别未显式提及但具因果影响力的提示信息,最终表明基于分布采样的方法优于传统静态编辑方式。

链接: https://arxiv.org/abs/2510.27484
作者: Uzay Macar,Paul C. Bogdan,Senthooran Rajamanoharan,Neel Nanda
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Uzay Macar and Paul C. Bogdan contributed equally to this work, and their listed order was determined by coinflip

点击查看摘要

Abstract:Most work interpreting reasoning models studies only a single chain-of-thought (CoT), yet these models define distributions over many possible CoTs. We argue that studying a single sample is inadequate for understanding causal influence and the underlying computation. Though fully specifying this distribution is intractable, it can be understood by sampling. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action? In “agentic misalignment” scenarios, we resample specific sentences to measure their downstream effects. Self-preservation sentences have small causal impact, suggesting they do not meaningfully drive blackmail. Second, are artificial edits to CoT sufficient for steering reasoning? These are common in literature, yet take the model off-policy. Resampling and selecting a completion with the desired property is a principled on-policy alternative. We find off-policy interventions yield small and unstable effects compared to resampling in decision-making tasks. Third, how do we understand the effect of removing a reasoning step when the model may repeat it post-edit? We introduce a resilience metric that repeatedly resamples to prevent similar content from reappearing downstream. Critical planning statements resist removal but have large effects when eliminated. Fourth, since CoT is sometimes “unfaithful”, can our methods teach us anything in these settings? Adapting causal mediation analysis, we find that hints that have a causal effect on the output without being explicitly mentioned exert a subtle and cumulative influence on the CoT that persists even if the hint is removed. Overall, studying distributions via resampling enables reliable causal analysis, clearer narratives of model reasoning, and principled CoT interventions.
zh

[NLP-15] he aftermath of compounds: Investigating Compounds and their Semantic Representations AACL

【速读】: 该论文旨在解决计算嵌入(computational embeddings)与人类语义判断在英语复合词处理中的一致性问题,特别是静态词向量(如GloVe)与上下文感知嵌入(如BERT)在捕捉词素意义主导性(Lexeme Meaning Dominance, LMD)和语义透明度(Semantic Transparency, ST)方面的差异。其解决方案的关键在于:通过引入基于关联强度(Edinburgh Associative Thesaurus)、频率(BNC)和可预测性(LaDEC)的嵌入衍生LMD与ST指标,并采用斯皮尔曼相关系数和回归分析方法,系统评估模型与人类评分之间的关系;结果表明,BERT嵌入在表征复合词的组合语义方面优于GloVe,且可预测性是影响语义透明度的重要因素,从而为嵌入驱动的语义建模提供了实证依据和理论深化。

链接: https://arxiv.org/abs/2510.27477
作者: Swarang Joshi
机构: International Institute of Information Technology, Hyderabad, India (印度海得拉巴国际信息科技研究所)
类目: Computation and Language (cs.CL)
备注: IJCNLP-AACL SRW 2025

点击查看摘要

Abstract:This study investigates how well computational embeddings align with human semantic judgments in the processing of English compound words. We compare static word vectors (GloVe) and contextualized embeddings (BERT) against human ratings of lexeme meaning dominance (LMD) and semantic transparency (ST) drawn from a psycholinguistic dataset. Using measures of association strength (Edinburgh Associative Thesaurus), frequency (BNC), and predictability (LaDEC), we compute embedding-derived LMD and ST metrics and assess their relationships with human judgments via Spearmans correlation and regression analyses. Our results show that BERT embeddings better capture compositional semantics than GloVe, and that predictability ratings are strong predictors of semantic transparency in both human and model data. These findings advance computational psycholinguistics by clarifying the factors that drive compound word processing and offering insights into embedding-based semantic modeling.
zh

[NLP-16] Diffuse Thinking: Exploring Diffusion Language Models as Efficient Thought Proposers for Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因自回归生成范式导致的计算效率低下问题,即测试时计算资源增加与推理性能提升之间呈现次优 scaling 关系,常需大量计算开销才能获得微弱性能改善。其解决方案的关键在于提出一种高效的协同推理框架,利用扩散语言模型(Diffusion Language Models, DLMs)在单次前向传播中并行去噪的能力,高效生成候选中间推理步骤(intermediate thoughts),再由LLMs对这些候选步骤进行质量评估,从而在显著降低计算负担的同时保持高质量推理输出。

链接: https://arxiv.org/abs/2510.27469
作者: Chenyang Shao,Sijian Ren,Fengli Xu,Yong Li
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have witnessed remarkable advancements, with the test-time scaling law consistently enhancing the reasoning capabilities. Through systematic evaluation and exploration of a diverse spectrum of intermediate thoughts, LLMs demonstrate the potential to generate deliberate reasoning steps, thereby substantially enhancing reasoning accuracy. However, LLMs’ autoregressive generation paradigm results in reasoning performance scaling sub-optimally with test-time computation, often requiring excessive computational overhead to propose thoughts while yielding only marginal performance gains. In contrast, diffusion language models (DLMs) can efficiently produce diverse samples through parallel denoising in a single forward pass, inspiring us to leverage them for proposing intermediate thoughts, thereby alleviating the computational burden associated with autoregressive generation while maintaining quality. In this work, we propose an efficient collaborative reasoning framework, leveraging DLMs to generate candidate thoughts and LLMs to evaluate their quality. Experiments across diverse benchmarks demonstrate that our framework achieves strong performance in complex reasoning tasks, offering a promising direction for future research. Our code is open-source at this https URL.
zh

[NLP-17] VCORE: Variance-Controlled Optimization-based Reweighting for Chain-of-Thought Supervision

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中因使用标准交叉熵损失而导致的监督分配不均问题,即模型在长链式思维(Chain-of-Thought, CoT)推理轨迹中对所有token一视同仁,忽略了不同token在推理过程中的异质性贡献,从而导致监督信号错配和泛化能力弱化。解决方案的关键在于提出一种基于方差控制的优化重加权框架(Variance-Controlled Optimization-based REweighting, VCORE),其将CoT监督建模为一个约束优化问题,通过优化理论视角实现对token级别监督权重的自适应调整,使训练目标更贴近鲁棒推理泛化的本质需求。

链接: https://arxiv.org/abs/2510.27462
作者: Xuan Gong,Senmiao Wang,Hanbo Huang,Ruoyu Sun,Shiyu Liang
机构: Shanghai Jiao Tong University (上海交通大学); Chinese University of Hong Kong (深圳) (香港中文大学(深圳)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on long chain-of-thought (CoT) trajectories has emerged as a crucial technique for enhancing the reasoning abilities of large language models (LLMs). However, the standard cross-entropy loss treats all tokens equally, ignoring their heterogeneous contributions across a reasoning trajectory. This uniform treatment leads to misallocated supervision and weak generalization, especially in complex, long-form reasoning tasks. To address this, we introduce \textbfVariance-\textbfControlled \textbfOptimization-based \textbfREweighting (VCORE), a principled framework that reformulates CoT supervision as a constrained optimization problem. By adopting an optimization-theoretic perspective, VCORE enables a principled and adaptive allocation of supervision across tokens, thereby aligning the training objective more closely with the goal of robust reasoning generalization. Empirical evaluations demonstrate that VCORE consistently outperforms existing token reweighting methods. Across both in-domain and out-of-domain settings, VCORE achieves substantial performance gains on mathematical and coding benchmarks, using models from the Qwen3 series (4B, 8B, 32B) and LLaMA-3.1-8B-Instruct. Moreover, we show that VCORE serves as a more effective initialization for subsequent reinforcement learning, establishing a stronger foundation for advancing the reasoning capabilities of LLMs. The Code will be released at this https URL.
zh

[NLP-18] DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains

【速读】: 该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)在推理过程中存在的认知效率问题,即对简单问题过度思考(overthinking)和对复杂问题思考不足(underthinking),而现有通过监督微调(Supervised Fine-Tuning, SFT)或基于token长度奖励的强化学习(Reinforcement Learning, RL)方法往往在提升效率的同时牺牲了准确性。其解决方案的关键在于提出一种名为DeepCompress的新框架,该框架采用自适应长度奖励机制,能够实时动态地将问题分类为“简单”或“困难”,并据此调整推理路径长度:对简单问题鼓励短而高效的Chain-of-Thought(CoT)推理,对困难问题则促进更长、更具探索性的思考链。这种双奖励策略使模型能自主调节推理深度,在保持高准确率的同时显著提升token效率。

链接: https://arxiv.org/abs/2510.27419
作者: Tian Liang,Wenxiang Jiao,Zhiwei He,Jiahao Xu,Haitao Mi,Dong Yu
机构: Tencent AI Lab (腾讯AI实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like overthinking'' simple problems and underthinking’’ complex ones. While existing methods that use supervised fine-tuning~(SFT) or reinforcement learning~(RL) with token-length rewards can improve efficiency, they often do so at the cost of accuracy. This paper introduces \textbfDeepCompress, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs. We challenge the prevailing approach of consistently favoring shorter reasoning paths, showing that longer responses can contain a broader range of correct solutions for difficult problems. DeepCompress employs an adaptive length reward mechanism that dynamically classifies problems as Simple'' or Hard’’ in real-time based on the model’s evolving capability. It encourages shorter, more efficient reasoning for Simple'' problems while promoting longer, more exploratory thought chains for Hard’’ problems. This dual-reward strategy enables the model to autonomously adjust its Chain-of-Thought (CoT) length, compressing reasoning for well-mastered problems and extending it for those it finds challenging. Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
zh

[NLP-19] Dynamic Affective Memory Management for Personalized LLM Agents

【速读】: 该论文旨在解决个性化AI代理在长期交互中面临的记忆冗余(memory redundancy)、记忆过时(memory staleness)以及记忆与上下文整合不佳等问题,这些问题主要源于缺乏有效的记忆更新机制。解决方案的关键在于提出一种基于贝叶斯思想的记忆更新算法,引入记忆熵(memory entropy)概念,通过最小化全局熵来动态维护记忆向量数据库,从而实现更高效的个性化服务。该方法使代理能够在交互过程中自主优化记忆内容,缓解记忆膨胀(memory bloat),提升个性化、逻辑连贯性和准确性。

链接: https://arxiv.org/abs/2510.27418
作者: Junfeng Lu,Yueyan Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注: 12 pasges, 8 figures

点击查看摘要

Abstract:Advances in large language models are making personalized AI agents a new research focus. While current agent systems primarily rely on personalized external memory databases to deliver customized experiences, they face challenges such as memory redundancy, memory staleness, and poor memory-context integration, largely due to the lack of effective memory updates during interaction. To tackle these issues, we propose a new memory management system designed for affective scenarios. Our approach employs a Bayesian-inspired memory update algorithm with the concept of memory entropy, enabling the agent to autonomously maintain a dynamically updated memory vector database by minimizing global entropy to provide more personalized services. To better evaluate the system’s effectiveness in this context, we propose DABench, a benchmark focusing on emotional expression and emotional change toward objects. Experimental results demonstrate that, our system achieves superior performance in personalization, logical coherence, and accuracy. Ablation studies further validate the effectiveness of the Bayesian-inspired update mechanism in alleviating memory bloat. Our work offers new insights into the design of long-term memory systems.
zh

[NLP-20] Atlas-Alignment: Making Interpretability Transferable Across Language Models

【速读】: 该论文旨在解决当前语言模型可解释性(interpretability)研究中成本高、难以扩展的问题,即对每个新模型都需要昂贵的训练过程(如特定稀疏自动编码器 SAE)、人工或半自动化标注组件以及后续验证。其解决方案的关键在于提出 Atlas-Alignment 框架,通过仅使用共享输入和轻量级表征对齐技术,将未知的潜在空间(latent space)映射到一个已标注的人类可理解的概念图谱(Concept Atlas),从而实现无需额外标注数据即可在新模型中进行语义特征检索与沿人类可解释概念的生成控制,显著降低可解释 AI 的边际成本。

链接: https://arxiv.org/abs/2510.27413
作者: Bruno Puri,Jim Berend,Sebastian Lapuschkin,Wojciech Samek
机构: Fraunhofer Heinrich Hertz Institute (弗劳恩霍夫海因里希·赫兹研究所); Technische Universität Berlin (柏林工业大学); Technological University Dublin (都柏林理工大学); BIFOLD - Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Interpretability is crucial for building safe, reliable, and controllable language models, yet existing interpretability pipelines remain costly and difficult to scale. Interpreting a new model typically requires costly training of model-specific sparse autoencoders, manual or semi-automated labeling of SAE components, and their subsequent validation. We introduce Atlas-Alignment, a framework for transferring interpretability across language models by aligning unknown latent spaces to a Concept Atlas - a labeled, human-interpretable latent space - using only shared inputs and lightweight representational alignment techniques. Once aligned, this enables two key capabilities in previously opaque models: (1) semantic feature search and retrieval, and (2) steering generation along human-interpretable atlas concepts. Through quantitative and qualitative evaluations, we show that simple representational alignment methods enable robust semantic retrieval and steerable generation without the need for labeled concept data. Atlas-Alignment thus amortizes the cost of explainable AI and mechanistic interpretability: by investing in one high-quality Concept Atlas, we can make many new models transparent and controllable at minimal marginal cost.
zh

[NLP-21] Awal – Community-Powered Language Technology for Tamazight

【速读】: 该论文旨在解决Tamazight语言在数字空间中资源匮乏的问题,特别是其自然语言处理(Natural Language Processing, NLP)技术发展滞后与数据稀缺的困境。解决方案的关键在于构建一个由社区驱动的协作平台Awal,通过鼓励母语者贡献翻译对和语音数据,以弥补传统数据采集方式在复杂社会语言学背景下的不足。尽管平台上线后获得积极反馈,但实际数据贡献仍集中于语言学家和活动家群体,表明标准众包策略难以直接适用于此类具有高度文化多样性和标准化挑战的语言。研究团队正基于收集到的数据开发改进的开源机器翻译(Machine Translation, MT)模型,以推动Tamazight的可持续语言技术发展。

链接: https://arxiv.org/abs/2510.27407
作者: Alp Öktem,Farida Boudichat
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to the International Conference on Information and Communication Technologies for Amazigh (TICAM 25)

点击查看摘要

Abstract:This paper presents Awal, a community-powered initiative for developing language technology resources for Tamazight. We provide a comprehensive review of the NLP landscape for Tamazight, examining recent progress in computational resources, and the emergence of community-driven approaches to address persistent data scarcity. Launched in 2024, this http URL platform addresses the underrepresentation of Tamazight in digital spaces through a collaborative platform enabling speakers to contribute translation and voice data. We analyze 18 months of community engagement, revealing significant barriers to participation including limited confidence in written Tamazight and ongoing standardization challenges. Despite widespread positive reception, actual data contribution remained concentrated among linguists and activists. The modest scale of community contributions – 6,421 translation pairs and 3 hours of speech data – highlights the limitations of applying standard crowdsourcing approaches to languages with complex sociolinguistic contexts. We are working on improved open-source MT models using the collected data.
zh

[NLP-22] Balancing Knowledge Updates: Toward Unified Modular Editing in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中知识编辑效率与效果受限的问题,特别是现有方法仅聚焦于多层感知机(MLP)模块的权重修改,而忽略了注意力(Attention, Attn)模块在事实知识存储中的重要作用,导致编辑后仍存在过时知识残留、泛化能力弱及知识保留不充分等问题。解决方案的关键在于提出IntAttn-Edit方法,该方法基于对LLMs中知识定位的系统性实验发现——Attn模块(尤其在早期层)对事实知识存储具有显著贡献——并引入一种知识平衡策略,按各模块对知识存储的实际贡献比例分配更新幅度,从而联合优化MLP和Attn模块的参数更新,实现更高效、稳定且可泛化的知识编辑效果。

链接: https://arxiv.org/abs/2510.27400
作者: Jiahao Liu,Zijian Wang,Kuo Zhao,Dong Hu
机构: Jinan University (暨南大学); The University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge editing has emerged as an efficient approach for updating factual knowledge in large language models (LLMs). It typically locates knowledge storage modules and then modifies their parameters. However, most existing methods focus on the weights of multilayer perceptron (MLP) modules, which are often identified as the main repositories of factual information. Other components, such as attention (Attn) modules, are often ignored during editing. This imbalance can leave residual outdated knowledge and limit editing effectiveness. We perform comprehensive knowledge localization experiments on advanced LLMs and find that Attn modules play a substantial role in factual knowledge storage and retrieval, especially in earlier layers. Based on these insights, we propose IntAttn-Edit, a method that extends the associative memory paradigm to jointly update both MLP and Attn modules. Our approach uses a knowledge balancing strategy that allocates update magnitudes in proportion to each module’s measured contribution to knowledge storage. Experiments on standard benchmarks show that IntAttn-Edit achieves higher edit success, better generalization, and stronger knowledge preservation than prior methods. Further analysis shows that the balancing strategy keeps editing performance within an optimal range across diverse settings.
zh

[NLP-23] Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity

【速读】: 该论文旨在解决链式思维(Chain-of-thought, CoT)在模型安全监控中的可监测性(monitorability)问题,即如何确保CoT能真实反映模型内部推理过程(faithfulness),并完整呈现任务所需的关键因素(verbosity)。其核心挑战在于:现有方法仅通过添加提示(cue)观察模型是否改变答案来评估faithfulness,但这种方法无法捕捉模型维持原答案时可能存在的推理缺失或误导性输出。论文的关键解决方案是将faithfulness与verbosity结合,提出一个统一的monitorability评分指标,用以衡量CoT作为模型外部“工作记忆”的有效性。实验表明,即使模型看似faithful,若忽略关键推理因子,仍难以有效监控;且不同模型家族间的monitorability存在显著差异,这为基于CoT的安全机制设计提供了新的评估维度和改进方向。

链接: https://arxiv.org/abs/2510.27378
作者: Austin Meek,Eitan Sprejer,Iván Arcuschin,Austin J. Brockmeier,Steven Basart
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project page at this https URL

点击查看摘要

Abstract:Chain-of-thought (CoT) outputs let us read a model’s step-by-step reasoning. Since any long, serial reasoning process must pass through this textual trace, the quality of the CoT is a direct window into what the model is thinking. This visibility could help us spot unsafe or misaligned behavior (monitorability), but only if the CoT is transparent about its internal reasoning (faithfulness). Fully measuring faithfulness is difficult, so researchers often focus on examining the CoT in cases where the model changes its answer after adding a cue to the input. This proxy finds some instances of unfaithfulness but loses information when the model maintains its answer, and does not investigate aspects of reasoning not tied to the cue. We extend these results to a more holistic sense of monitorability by introducing verbosity: whether the CoT lists every factor needed to solve the task. We combine faithfulness and verbosity into a single monitorability score that shows how well the CoT serves as the model’s external `working memory’, a property that many safety schemes based on CoT monitoring depend on. We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU. Our results show that models can appear faithful yet remain hard to monitor when they leave out key factors, and that monitorability differs sharply across model families. We release our evaluation code using the Inspect library to support reproducible future work.
zh

[NLP-24] From the Rock Floor to the Cloud: A Systematic Survey of State-of-the-Art NLP in Battery Life Cycle

【速读】: 该论文旨在解决电池全生命周期中自然语言处理(Natural Language Processing, NLP)应用碎片化的问题,即现有研究多聚焦于单一阶段或方法,缺乏系统性整合。为此,作者提出了一种新型技术语言处理(Technical Language Processing, TLP)框架,用于欧盟拟议的数字电池护照(Digital Battery Passport, DBP)及其他通用电池预测任务。该框架的关键在于融合代理型人工智能(agentic AI)与优化提示(optimized prompts),从而提升对电池相关非结构化文本数据的解析能力,并为材料发现等关键环节提供支持,同时缓解当前领域内标准基准缺失等挑战。

链接: https://arxiv.org/abs/2510.27369
作者: Tosin Adewumi,Martin Karlsson,Marcus Liwicki,Mikael Sjödahl,Lama Alkhaled,Rihab Gargouri,Nudrat Habib,Franz Hennie
机构: Luleå University of Technology (LTU), Sweden; Lund University (LU), Sweden
类目: Computation and Language (cs.CL)
备注: 15 pages, 3 images

点击查看摘要

Abstract:We present a comprehensive systematic survey of the application of natural language processing (NLP) along the entire battery life cycle, instead of one stage or method, and introduce a novel technical language processing (TLP) framework for the EU’s proposed digital battery passport (DBP) and other general battery predictions. We follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) method and employ three reputable databases or search engines, including Google Scholar, Institute of Electrical and Electronics Engineers Xplore (IEEE Xplore), and Scopus. Consequently, we assessed 274 scientific papers before the critical review of the final 66 relevant papers. We publicly provide artifacts of the review for validation and reproducibility. The findings show that new NLP tasks are emerging in the battery domain, which facilitate materials discovery and other stages of the life cycle. Notwithstanding, challenges remain, such as the lack of standard benchmarks. Our proposed TLP framework, which incorporates agentic AI and optimized prompts, will be apt for tackling some of the challenges.
zh

[NLP-25] houghtProbe: Classifier-Guided LLM Thought Space Exploration via Probing Representations EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在推理任务中因路径探索不充分或无效推理链被优先选择而导致性能受限的问题。解决方案的关键在于提出ThoughtProbe框架,该框架利用LLMs隐藏的推理特征作为判别性信号,引导树状结构响应空间的高效探索:在每个节点扩展时,通过分类器对候选路径进行评分与排序,从而智能分配计算资源;最终通过分支聚合机制对所有支持性分支的思维链(Chain-of-Thought, CoT)得分进行边际化整合,从候选答案池中识别最优解。此方法实现了对有效推理链的全面覆盖与精准识别,显著提升了多类算术推理基准上的性能表现。

链接: https://arxiv.org/abs/2510.27355
作者: Zijian Wang,Chang Xu
机构: The University of Sydney (悉尼大学)
类目: Computation and Language (cs.CL)
备注: EMNLP2025 main conference

点击查看摘要

Abstract:This paper introduces ThoughtProbe, a novel inference time framework that leverages the hidden reasoning features of Large Language Models (LLMs) to improve their reasoning performance. Unlike previous works that manipulate the hidden representations to steer LLM generation, we harness them as discriminative signals to guide the tree structured response space exploration. In each node expansion, a classifier serves as a scoring and ranking mechanism that efficiently allocates computational resources by prioritizing higher score candidates for continuation. After completing the tree expansion, we collect answers from all branches to form a candidate answer pool. We then propose a branch aggregation method that marginalizes over all supporting branches by aggregating their CoT scores, thereby identifying the optimal answer from the pool. Experimental results show that our framework’s comprehensive exploration not only covers valid reasoning chains but also effectively identifies them, achieving significant improvements across multiple arithmetic reasoning benchmarks.
zh

[NLP-26] ransAlign: Machine Translation Encoders are Strong Word Aligners Too

【速读】: 该论文旨在解决跨语言迁移(Cross-lingual Transfer, XLT)中 token 分类任务的标签投影问题,即如何将源语言句子中的标签准确映射到目标语言翻译后的对应词元上。当前主流方法依赖于多语言词对齐器(Multilingual Word Aligners, WAs),如基于 mBERT 或 LaBSE 的模型,但这些方法在利用机器翻译(Machine Translation, MT)模型时受限于仅使用编码器-解码器架构中的交叉注意力机制,导致对齐效果不佳。本文的关键解决方案是提出 TransAlign——一种基于大规模多语言 MT 模型编码器的新颖词对齐方法,其不仅显著提升了对齐性能,还在基于 MT 的 XLT 任务中超越了现有主流 WA 方法及最先进的非对齐标签投影方法。

链接: https://arxiv.org/abs/2510.27337
作者: Benedikt Ebing,Christian Goldschmied,Goran Glavaš
机构: University of Würzburg (维尔茨堡大学); Center for Artificial Intelligence and Data Science (CAIDAS) (人工智能与数据科学中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the absence of sizable training data for most world languages and NLP tasks, translation-based strategies such as translate-test – evaluating on noisy source language data translated from the target language – and translate-train – training on noisy target language data translated from the source language – have been established as competitive approaches for cross-lingual transfer (XLT). For token classification tasks, these strategies require label projection: mapping the labels from each token in the original sentence to its counterpart(s) in the translation. To this end, it is common to leverage multilingual word aligners (WAs) derived from encoder language models such as mBERT or LaBSE. Despite obvious associations between machine translation (MT) and WA, research on extracting alignments with MT models is largely limited to exploiting cross-attention in encoder-decoder architectures, yielding poor WA results. In this work, in contrast, we propose TransAlign, a novel word aligner that utilizes the encoder of a massively multilingual MT model. We show that TransAlign not only achieves strong WA performance but substantially outperforms popular WA and state-of-the-art non-WA-based label projection methods in MT-based XLT for token classification.
zh

[NLP-27] A Unified Representation Underlying the Judgment of Large Language Models

【速读】: 该论文试图解决的核心问题是:生成式 AI(Generative AI)中的判断机制是依赖于专用模块(modular architecture)还是统一的、跨领域的通用资源(domain-general resource)。针对这一问题,作者通过分析多种大型语言模型(Large Language Models, LLMs)发现,多样化的评价性判断实际上沿一个主导维度进行计算,该维度被命名为“效价-认同轴”(Valence-Assent Axis, VAA),其同时编码主观效价(“什么是好的”)与模型对事实陈述的认同程度(“什么是真实的”)。解决方案的关键在于揭示了VAA作为控制信号的作用:它在生成过程中主导推理路径,使模型优先构建与自身评价状态一致的理由,即使这意味着牺牲事实准确性——这种机制被称为“推理的从属关系”(subordination of reasoning),从而解释了系统性偏见和幻觉现象的内在成因。

链接: https://arxiv.org/abs/2510.27328
作者: Yi-Long Lu,Jiajun Song,Wei Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A central architectural question for both biological and artificial intelligence is whether judgment relies on specialized modules or a unified, domain-general resource. While the discovery of decodable neural representations for distinct concepts in Large Language Models (LLMs) has suggested a modular architecture, whether these representations are truly independent systems remains an open question. Here we provide evidence for a convergent architecture. Across a range of LLMs, we find that diverse evaluative judgments are computed along a dominant dimension, which we term the Valence-Assent Axis (VAA). This axis jointly encodes subjective valence (“what is good”) and the model’s assent to factual claims (“what is true”). Through direct interventions, we show this unified representation creates a critical dependency: the VAA functions as a control signal that steers the generative process to construct a rationale consistent with its evaluative state, even at the cost of factual accuracy. This mechanism, which we term the subordination of reasoning, shifts the process of reasoning from impartial inference toward goal-directed justification. Our discovery offers a mechanistic account for systemic bias and hallucination, revealing how an architecture that promotes coherent judgment can systematically undermine faithful reasoning.
zh

[NLP-28] Un-Attributability: Computing Novelty From Retrieval Semantic Similarity

【速读】: 该论文旨在解决语言模型输出与预训练语料库之间关系的量化问题,特别是如何识别模型生成内容中具有语义新颖性的部分。传统训练数据归属(Training Data Attribution, TDA)方法关注哪些训练样本因果性地影响了特定输出,而本文提出反向视角:识别无法归属于任何预训练样本的输出,从而定义并度量语义新颖性(semantic novelty)。其解决方案的关键在于引入“不可归属性”(un-attributability)作为操作化指标,并设计了一个两阶段检索流水线——首先使用轻量级GIST嵌入对语料库进行索引,再通过ColBERTv2对Top-N候选进行重排序,若最邻近语料项的可归属性低于人类生成文本参考,则判定该模型输出为新颖。这一方法实现了在预训练规模下的高效新颖性分析,且支持大规模扩展与复现。

链接: https://arxiv.org/abs/2510.27313
作者: Philipp Davydov,Ameya Prabhu,Matthias Bethge,Elisa Nguyen,Seong Joon Oh
机构: Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding how language-model outputs relate to the pretraining corpus is central to studying model behavior. Most training data attribution (TDA) methods ask which training examples causally influence a given output, often using leave-one-out tests. We invert the question: which outputs cannot be attributed to any pretraining example? We introduce un-attributability as an operational measure of semantic novelty: an output is novel if the pretraining corpus contains no semantically similar context. We approximate this with a simple two-stage retrieval pipeline: index the corpus with lightweight GIST embeddings, retrieve the top-n candidates, then rerank with ColBERTv2. If the nearest corpus item is less attributable than a human-generated text reference, we consider the output of the model as novel. We evaluate on SmolLM and SmolLM2 and report three findings: (1) models draw on pretraining data across much longer spans than previously reported; (2) some domains systematically promote or suppress novelty; and (3) instruction tuning not only alters style but also increases novelty. Reframing novelty assessment around un-attributability enables efficient analysis at pretraining scale. We release ~20 TB of corpus chunks and index artifacts to support replication and large-scale extension of our analysis at this https URL
zh

[NLP-29] Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?

【速读】: 该论文旨在解决生成式 AI(Generative AI)在多语言推理任务中存在的“多语言推理差距”问题,即模型在高资源语言(如英语)上表现优异,而在低资源语言上性能显著下降。研究表明,这一差距主要源于模型在推理过程中无法准确将多语言输入语义映射到主导语言(即英语),导致理解失败。解决方案的关键在于识别这些理解失败,并采用“选择性翻译”(Selective Translation)策略:仅当检测到理解失败时,才将多语言输入翻译为英语进行推理,从而在保持高精度的同时大幅减少翻译使用比例(约20%的输入)。该方法有效缩小了多语言推理差距,揭示了其根源并提供了可操作的缓解路径。

链接: https://arxiv.org/abs/2510.27269
作者: Deokhyung Kang,Seonjeong Hwang,Daehui Kim,Hyounghun Kim,Gary Geunbae Lee
机构: Graduate School of Artificial Intelligence, POSTECH (韩国科学技术院人工智能研究生院); AI Future Lab, KT (韩国电信人工智能未来实验室); Department of Computer Science and Engineering, POSTECH (韩国科学技术院计算机科学与工程系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model’s inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at this https URL.
zh

[NLP-30] MedCalc-Eval and MedCalc-Env: Advancing Medical Calculation Capabilities of Large Language Models

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域应用中对定量推理能力评估不足的问题。现有基准测试如MedCalc-Bench仅涵盖少量计算任务,且难以反映真实临床场景中的复杂计算需求。为此,作者提出了MedCalc-Eval,这是目前规模最大、覆盖最广的用于评估LLMs医学计算能力的基准,包含700余项任务,分为基于公式的计算(如Cockcroft-Gault公式、BMI、体表面积BSA)和基于规则的评分系统(如Apgar评分、格拉斯哥昏迷评分Glasgow Coma Scale),并覆盖内科、外科、儿科和心血管等多个专科。解决方案的关键在于引入MedCalc-Env——一个基于InternBootcamp框架构建的强化学习环境,支持多步骤临床推理与规划,通过在该环境中微调Qwen2.5-32B模型,显著提升了模型在数值敏感性、公式选择准确性和推理鲁棒性方面的表现,实现了当前最优性能。

链接: https://arxiv.org/abs/2510.27267
作者: Kangkun Mao,Jinru Ding,Jiayuan Chen,Mouxiao Bian,Ruiyao Chen,Xinwei Peng,Sijie Ren,Linyang Li,Jie Xu
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) enter the medical domain, most benchmarks evaluate them on question answering or descriptive reasoning, overlooking quantitative reasoning critical to clinical decision-making. Existing datasets like MedCalc-Bench cover few calculation tasks and fail to reflect real-world computational scenarios. We introduce MedCalc-Eval, the largest benchmark for assessing LLMs’ medical calculation abilities, comprising 700+ tasks across two types: equation-based (e.g., Cockcroft-Gault, BMI, BSA) and rule-based scoring systems (e.g., Apgar, Glasgow Coma Scale). These tasks span diverse specialties including internal medicine, surgery, pediatrics, and cardiology, offering a broader and more challenging evaluation setting. To improve performance, we further develop MedCalc-Env, a reinforcement learning environment built on the InternBootcamp framework, enabling multi-step clinical reasoning and planning. Fine-tuning a Qwen2.5-32B model within this environment achieves state-of-the-art results on MedCalc-Eval, with notable gains in numerical sensitivity, formula selection, and reasoning robustness. Remaining challenges include unit conversion, multi-condition logic, and contextual understanding. Code and datasets are available at this https URL. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2510.27267 [cs.CL] (or arXiv:2510.27267v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2510.27267 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kangkun Mao [view email] [v1] Fri, 31 Oct 2025 08:07:16 UTC (7,791 KB)
zh

[NLP-31] Higher-order Linear Attention

【速读】: 该论文旨在解决自回归语言模型在处理长上下文时,因缩放点积注意力(scaled dot-product attention)的二次计算复杂度而导致的扩展瓶颈问题。传统线性时间注意力机制和状态空间模型(State Space Models, SSMs)虽提供了可扩展替代方案,但通常受限于一阶或基于核的近似,从而限制了模型的表达能力。解决方案的关键在于提出高阶线性注意力(Higher-order Linear Attention, HLA),其通过紧凑的前缀充分统计量(prefix sufficient statistics)实现更高阶交互,其中二阶情形下仅需维持常数大小的状态,并以线性时间计算每个标记输出,无需显式构建任何 n×nn \times n 矩阵;同时,论文给出了闭式流式恒等式、严格因果掩码变体及基于关联扫描的分块并行训练方案,能精确复现串行递推的激活结果,为高效且具表达力的序列建模提供了一个结构化、可扩展的基础模块。

链接: https://arxiv.org/abs/2510.27258
作者: Yifan Zhang,Zhen Qin,Quanquan Gu
机构: Princeton University (普林斯顿大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:The quadratic cost of scaled dot-product attention is a central obstacle to scaling autoregressive language models to long contexts. Linear-time attention and State Space Models (SSMs) provide scalable alternatives but are typically restricted to first-order or kernel-based approximations, which can limit expressivity. We introduce Higher-order Linear Attention (HLA), a causal, streaming mechanism that realizes higher interactions via compact prefix sufficient statistics. In the second-order case, HLA maintains a constant-size state and computes per-token outputs in linear time without materializing any n \times n matrices. We give closed-form streaming identities, a strictly causal masked variant using two additional summaries, and a chunk-parallel training scheme based on associative scans that reproduces the activations of a serial recurrence exactly. We further outline extensions to third and higher orders. Collectively, these results position HLA as a principled, scalable building block that combines attention-like, data-dependent mixing with the efficiency of modern recurrent architectures. Project Page: this https URL.
zh

[NLP-32] Languages are Modalities: Cross-Lingual Alignment via Encoder Injection

【速读】: 该论文旨在解决指令微调的大语言模型(Instruction-tuned Large Language Models, LLMs)在低资源非拉丁文字语言上的性能不足问题,其根源在于分词器碎片化(tokenizer fragmentation)和跨语言耦合弱(weak cross-lingual coupling)。解决方案的关键在于提出一种计算高效的“语言即模态”方法 LLINK(Latent Language Injection for Non-English Knowledge),该方法通过冻结多语言编码器与解码器的潜空间对齐,在保留原分词器不变的前提下,将低资源语言的句子嵌入映射至解码器的潜在空间,并以轻量级对比投影器生成K个软槽位(soft slots),结合极小适配器(minimal adapters)训练,使解码器能有效利用该信号。实验表明,LLINK显著提升了双语检索性能,并在LLM评判的问答评估中相比基线模型提升81.3%,相比直接微调提升63.6%,其改进可归因于减少分词膨胀和增强跨语言对齐能力。

链接: https://arxiv.org/abs/2510.27254
作者: Rajan Agarwal,Aarush Gupta
机构: University of Waterloo (滑铁卢大学); Independent (独立)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 3 Figures

点击查看摘要

Abstract:Instruction-tuned Large Language Models (LLMs) underperform on low resource, non-Latin scripts due to tokenizer fragmentation and weak cross-lingual coupling. We present LLINK (Latent Language Injection for Non-English Knowledge), a compute efficient language-as-modality method that conditions an instruction-tuned decoder without changing the tokenizer or retraining the decoder. First, we align sentence embeddings from a frozen multilingual encoder to the decoder’s latent embedding space at a reserved position via a lightweight contrastive projector. Second, the vector is expanded into K soft slots and trained with minimal adapters so the frozen decoder consumes the signal. LLINK substantially improves bilingual retrieval and achieves 81.3% preference over the base model and 63.6% over direct fine-tuning in LLM-judged QA evaluations. We further find that improvements can be attributed to reduced tokenization inflation and a stronger cross lingual alignment, despite the model having residual weaknesses in numeric fidelity. Treating low resource languages as a modality offers a practical path to stronger cross-lingual alignment in lightweight LLMs.
zh

[NLP-33] Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLM s

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长时记忆和长上下文推理任务中评估不足的问题,特别是对话场景下因现有基准测试缺乏叙事连贯性、领域单一以及仅考察简单回忆能力而难以准确衡量模型真实性能。其解决方案的关键在于:首先提出一种自动构建长达10M tokens、语义连贯且主题多样化的对话数据集的框架,并据此构建BEAM基准,包含100段对话和2000个验证过的探针问题;其次设计LIGHT框架,借鉴人类认知机制,为LLMs引入三种互补的记忆系统——长期情景记忆(long-term episodic memory)、短期工作记忆(short-term working memory)和用于积累关键事实的临时计算空间(scratchpad),从而显著提升模型在复杂长程对话中的表现,实验表明LIGHT在不同基线模型上平均提升3.5%–12.69%。

链接: https://arxiv.org/abs/2510.27246
作者: Mohammad Tavakoli,Alireza Salemi,Carrie Ye,Mohamed Abdalla,Hamed Zamani,J Ross Mitchell
机构: University of Alberta (阿尔伯塔大学); University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.5%-12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.
zh

[NLP-34] Identifying the Periodicity of Information in Natural Language

【速读】: 该论文旨在解决自然语言中信息密度是否存在周期性模式的问题,即探究人类语言在编码信息层面是否呈现出可识别的周期性特征。其解决方案的关键在于提出了一种名为“自适应 surprisal 周期性检测”(AutoPeriod of Surprisal, APS)的新方法,该方法基于标准的周期性检测算法,能够从单个文档的信息 surprisal 序列中识别出显著的周期成分,并通过谐波回归建模进一步验证非典型结构单元(如句法边界或话语单元)之外的新周期模式,从而揭示语言信息周期性是由结构化因素与长距离驱动因素共同作用的结果。

链接: https://arxiv.org/abs/2510.27241
作者: Yulin Ou,Yu Wang,Yang Xu,Hendrik Buschmeier
机构: Southern University of Science and Technology (南方科技大学); Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent theoretical advancement of information density in natural language has brought the following question on desk: To what degree does natural language exhibit periodicity pattern in its encoded information? We address this question by introducing a new method called AutoPeriod of Surprisal (APS). APS adopts a canonical periodicity detection algorithm and is able to identify any significant periods that exist in the surprisal sequence of a single document. By applying the algorithm to a set of corpora, we have obtained the following interesting results: Firstly, a considerable proportion of human language demonstrates a strong pattern of periodicity in information; Secondly, new periods that are outside the distributions of typical structural units in text (e.g., sentence boundaries, elementary discourse units, etc.) are found and further confirmed via harmonic regression modeling. We conclude that the periodicity of information in language is a joint outcome from both structured factors and other driving factors that take effect at longer distances. The advantages of our periodicity detection method and its potentials in LLM-generation detection are further discussed.
zh

[NLP-35] DRAMA: Unifying Data Retrieval and Analysis for Open-Domain Analytic Queries SIGMOD2026

【速读】: 该论文旨在解决现有数据科学工作流自动化系统在开放域数据收集、结构化数据转换和分析推理三个核心能力上缺乏统一支持的问题。当前方法难以有效应对真实世界中复杂、多步骤的数据分析任务,导致效率低下且泛化能力不足。其解决方案的关键在于提出一个端到端的范式DRAMA(Data Retrieval, Transformation, and Analysis for Multi-agent systems),将数据采集、变换与分析整合为单一流水线,并构建了DRAMA-Bench基准测试集用于量化评估。进一步开发的多智能体系统DRAMA-Bot通过协调子智能体执行数据检索与转换,并基于结构化推理完成分析任务,在公开数据集上实现了86.5%的任务准确率,显著优于五个主流基线模型,且成本仅为基线的1/6。

链接: https://arxiv.org/abs/2510.27238
作者: Chuxuan Hu,Maxwell Yang,James Weiland,Yeji Lim,Suhas Palawala,Daniel Kang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: Accepted to SIGMOD 2026

点击查看摘要

Abstract:Manually conducting real-world data analyses is labor-intensive and inefficient. Despite numerous attempts to automate data science workflows, none of the existing paradigms or systems fully demonstrate all three key capabilities required to support them effectively: (1) open-domain data collection, (2) structured data transformation, and (3) analytic reasoning. To overcome these limitations, we propose DRAMA, an end-to-end paradigm that answers users’ analytic queries in natural language on large-scale open-domain data. DRAMA unifies data collection, transformation, and analysis as a single pipeline. To quantitatively evaluate system performance on tasks representative of DRAMA, we construct a benchmark, DRAMA-Bench, consisting of two categories of tasks: claim verification and question answering, each comprising 100 instances. These tasks are derived from real-world applications that have gained significant public attention and require the retrieval and analysis of open-domain data. We develop DRAMA-Bot, a multi-agent system designed following DRAMA. It comprises a data retriever that collects and transforms data by coordinating the execution of sub-agents, and a data analyzer that performs structured reasoning over the retrieved data. We evaluate DRAMA-Bot on DRAMA-Bench together with five state-of-the-art baseline agents. DRAMA-Bot achieves 86.5% task accuracy at a cost of 0.05, outperforming all baselines with up to 6.9 times the accuracy and less than 1/6 of the cost. DRAMA is publicly available at this https URL. Comments: Accepted to SIGMOD 2026 Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2510.27238 [cs.DB] (or arXiv:2510.27238v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2510.27238 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3769781 Focus to learn more DOI(s) linking to related resources
zh

[NLP-36] MemeArena: Automating Context-Aware Unbiased Evaluation of Harmfulness Understanding for Multimodal Large Language Models EMNLP2025

【速读】: 该论文旨在解决现有多模态大语言模型(multimodal Large Language Models, mLLMs)在评估其对网络迷因(meme)中多模态有害性理解能力时存在的偏差问题。当前主流评估方法主要依赖二分类任务的检测准确率,难以捕捉不同语境下有害性的复杂解释差异,导致评估结果缺乏深度与公平性。论文提出MemeArena——一种基于代理(agent-based)的竞技场式评估框架,其核心创新在于通过模拟多样化的解释语境来设计评估任务,从而激发mLLMs从特定视角进行分析,并结合多方评价者达成共识,实现对mLLMs在多模态有害性理解上的公平、无偏比较。实验表明,该框架显著降低了判断代理的评估偏差,且评分结果更贴近人类偏好,为可靠、全面的mLLM多模态有害性理解评估提供了新范式。

链接: https://arxiv.org/abs/2510.27196
作者: Zixin Chen,Hongzhan Lin,Kaixin Li,Ziyang Luo,Yayue Deng,Jing Ma
机构: Hong Kong Baptist University (香港浸会大学); Beijing University of Posts and Telecommunications (北京邮电大学); National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025

点击查看摘要

Abstract:The proliferation of memes on social media necessitates the capabilities of multimodal Large Language Models (mLLMs) to effectively understand multimodal harmfulness. Existing evaluation approaches predominantly focus on mLLMs’ detection accuracy for binary classification tasks, which often fail to reflect the in-depth interpretive nuance of harmfulness across diverse contexts. In this paper, we propose MemeArena, an agent-based arena-style evaluation framework that provides a context-aware and unbiased assessment for mLLMs’ understanding of multimodal harmfulness. Specifically, MemeArena simulates diverse interpretive contexts to formulate evaluation tasks that elicit perspective-specific analyses from mLLMs. By integrating varied viewpoints and reaching consensus among evaluators, it enables fair and unbiased comparisons of mLLMs’ abilities to interpret multimodal harmfulness. Extensive experiments demonstrate that our framework effectively reduces the evaluation biases of judge agents, with judgment results closely aligning with human preferences, offering valuable insights into reliable and comprehensive mLLM evaluations in multimodal harmfulness understanding. Our code and data are publicly available at this https URL.
zh

[NLP-37] Can MLLM s Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

【速读】: 该论文旨在解决在动态多主体对话中自动识别欺骗行为的难题,这一问题的关键在于如何有效融合语言与视觉等多模态信息以准确判断陈述的真实性。解决方案的核心是提出了一个新的任务——多模态交互真实性评估(Multimodal Interactive Veracity Assessment, MIVA),并构建了一个基于社交推理游戏“狼人杀”的新型多模态数据集,该数据集包含同步的视频、文本及可验证的真实标签,从而为评估生成式 AI (Generative AI) 在复杂社交场景下的欺骗检测能力提供了基准。

链接: https://arxiv.org/abs/2510.27195
作者: Caixin Kang,Yifei Huang,Liangyang Ouyang,Mingfang Zhang,Yoichi Sato
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.
zh

[NLP-38] Simple Additions Substantial Gains: Expanding Scripts Languages and Lineage Coverag e in URIEL

【速读】: 该论文旨在解决URIEL+语言知识库中存在的数据稀疏问题,包括缺失特征类型、语言条目不完整以及谱系覆盖有限等问题,这些问题限制了其在跨语言迁移学习中的应用,尤其是在低资源语言支持方面。解决方案的关键在于三项核心改进:引入脚本向量(script vectors)以表征7,488种语言的书写系统属性;整合Glottolog数据库新增18,710种语言,显著提升语言覆盖范围;并通过谱系传播机制对26,449种语言进行谱系推断(lineage imputation),将形态学和脚本特征跨谱系传递,从而减少脚本向量的特征稀疏性14%、语言覆盖率提升至原规模的10倍(增加19,015种语言)、并提高推断质量达33%。实证结果表明,在面向低资源语言的跨语言迁移任务中,改进后的URIEL+表现出性能增益最高达6%,验证了其在多语言研究中的完备性和包容性增强。

链接: https://arxiv.org/abs/2510.27183
作者: Mason Shipton,York Hay Ng,Aditya Khan,Phuong Hanh Hoang,Xiang Lu,A. Seza Doğruöz,En-Shiun Annie Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity remains prevalent, in the form of missing feature types, incomplete language entries, and limited genealogical coverage. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, this paper extends URIEL+ with three contributions: introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These additions reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and improve imputation quality metrics by up to 33%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups. Our advances make URIEL+ more complete and inclusive for multilingual research.
zh

[NLP-39] Glia: A Human-Inspired AI for Automated Systems Design and Optimization

【速读】: 该论文旨在解决如何让人工智能(AI)在计算机系统设计中实现与人类专家相当的创造力和推理能力的问题。传统机器学习方法通常优化黑箱策略,缺乏可解释性且难以理解其设计逻辑。本文提出的解决方案是Glia架构,其关键在于采用类人多智能体协作工作流:每个智能体分别专注于推理、实验和分析,并通过一个基于实证反馈的评估框架进行协同,从而将抽象推理与实际性能数据相结合。这种结构化的方法使AI不仅能够生成可解释的设计方案,还能在分布式GPU集群的请求路由、调度和自动扩展等任务中达到人类专家水平的表现,同时揭示新的工作负载行为规律。

链接: https://arxiv.org/abs/2510.27176
作者: Pouya Hamadanian,Pantea Karimi,Arash Nasr-Esfahany,Kimia Noorbakhsh,Joseph Chandler,Ali ParandehGheibi,Mohammad Alizadeh,Hari Balakrishnan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Can an AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired, multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning process. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that by combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
zh

[NLP-40] Probability Distributions Computed by Hard-Attention Transformers

【速读】: 该论文旨在解决当前对Transformer模型表达能力研究中存在的偏差问题,即现有大多数结果将Transformer视为语言识别器(language recognizers),而忽略了其在实际应用中作为语言模型(language models)进行自回归概率生成的特性。论文的核心贡献在于系统性地刻画了Transformer语言模型所能表达的概率分布类型,并指出:将语言识别器转化为自回归结构有时可提升其表达能力,而引入概率机制则可能破坏非概率情形下存在的等价关系。解决方案的关键在于从函数表达的角度重新分析Transformer在典型语言建模场景下的能力边界,从而更准确地理解其理论潜力与实践差异。

链接: https://arxiv.org/abs/2510.27118
作者: Andy Yang,Anej Svete,Jiaoda Li,Anthony Widjaja Lin,Jonathan Rawski,Ryan Cotterell,David Chiang
机构: University of Notre Dame (圣母大学); ETH Zürich (苏黎世联邦理工学院); Max-Planck Institute for Software Systems (马克斯·普朗克软件系统研究所); University of Kaiserslautern-Landau (凯撒斯劳滕-兰道大学); San José State University (圣何塞州立大学)
类目: Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically). Here, we characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing, in their most common use-case as language models.
zh

[NLP-41] Rating Roulette: Self-Inconsistency in LLM -As-A-Judge Frameworks EMNLP2025

【速读】: 该论文试图解决当前自然语言生成(Natural Language Generation, NLG)评估中依赖大语言模型(Large Language Models, LLMs)作为评判者时存在的评分一致性差的问题。研究表明,LLM judges在不同运行中表现出低组内评分者信度(intra-rater reliability),导致其评分结果不稳定甚至近乎随机,从而难以准确衡量其判断质量。解决方案的关键在于通过制定严谨的评估指南(proper guidelines)来规范LLM judge的使用方式,以降低评分波动性并提升其在不同NLG任务和基准测试中的实用性。

链接: https://arxiv.org/abs/2510.27106
作者: Rajarshi Haldar,Julia Hockenmaier
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
zh

[NLP-42] Characterizing Selective Refusal Bias in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)安全护栏中存在的选择性拒绝偏差(selective refusal bias)问题,即模型在拒绝生成有害内容时对不同性别、性取向、国籍和宗教等群体存在非均衡行为。解决方案的关键在于系统性评估安全护栏对个体及交叉身份群体的拒绝率差异、响应类型以及拒绝文本长度,并通过间接攻击实验揭示此类偏差可能引发的新型安全风险,从而推动构建更具公平性和鲁棒性的模型安全机制。

链接: https://arxiv.org/abs/2510.27087
作者: Adel Khorramrouz,Sharon Levy
机构: Rutgers University (罗格斯大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 21 pages, 12 figures, 14 tables

点击查看摘要

Abstract:Safety guardrails in large language models(LLMs) are developed to prevent malicious users from generating toxic content at a large scale. However, these measures can inadvertently introduce or reflect new biases, as LLMs may refuse to generate harmful content targeting some demographic groups and not others. We explore this selective refusal bias in LLM guardrails through the lens of refusal rates of targeted individual and intersectional demographic groups, types of LLM responses, and length of generated refusals. Our results show evidence of selective refusal bias across gender, sexual orientation, nationality, and religion attributes. This leads us to investigate additional safety implications via an indirect attack, where we target previously refused groups. Our findings emphasize the need for more equitable and robust performance in safety guardrails across demographic groups.
zh

[NLP-43] Contrastive Knowledge Transfer and Robust Optimization for Secure Alignment of Large Language Models

【速读】: 该论文旨在解决大规模语言模型在安全对齐(safety alignment)和鲁棒性(robustness)方面的局限性。其核心解决方案是提出一种结合对比蒸馏(contrastive distillation)与噪声鲁棒训练(noise-robust training)的微调方法:通过冻结骨干模型,利用蒸馏将教师模型的知识边界传递给学生模型,提升语义一致性和对齐精度;同时在训练中引入噪声扰动和鲁棒优化约束,确保模型在噪声和不确定输入下仍能保持稳定的预测输出。该方法构建了由蒸馏损失、鲁棒性损失和正则项组成的统一优化目标,有效平衡了对齐能力与抗干扰性能,显著优于现有基线,在多个关键指标上实现了最优表现。

链接: https://arxiv.org/abs/2510.27077
作者: Jiasen Zheng,Huajun Zhang,Xu Yan,Ran Hao,Chong Peng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the limitations of large-scale language models in safety alignment and robustness by proposing a fine-tuning method that combines contrastive distillation with noise-robust training. The method freezes the backbone model and transfers the knowledge boundaries of the teacher model to the student model through distillation, thereby improving semantic consistency and alignment accuracy. At the same time, noise perturbations and robust optimization constraints are introduced during training to ensure that the model maintains stable predictive outputs under noisy and uncertain inputs. The overall framework consists of distillation loss, robustness loss, and a regularization term, forming a unified optimization objective that balances alignment ability with resistance to interference. To systematically validate its effectiveness, the study designs experiments from multiple perspectives, including distillation weight sensitivity, stability analysis under computation budgets and mixed-precision environments, and the impact of data noise and distribution shifts on model performance. Results show that the method significantly outperforms existing baselines in knowledge transfer, robustness, and overall safety, achieving the best performance across several key metrics. This work not only enriches the theoretical system of parameter-efficient fine-tuning but also provides a new solution for building safer and more trustworthy alignment mechanisms.
zh

[NLP-44] owards a Measure of Algorithm Similarity

【速读】: 该论文旨在解决如何在给定两个算法实现时,判断它们是否具有实质差异的问题。这一问题在理论上是不可判定的(uncomputable),且实践中常因相似性定义不一致而难以量化。针对此挑战,作者提出EMOC框架——即Evaluation-Memory-Operations-Complexity框架,其核心在于将算法实现嵌入到一个适用于下游任务的特征空间中,从而支持对算法类型进行聚类与分类、近似重复检测以及大语言模型(LLM)生成程序多样性的量化分析。关键创新在于构建了可计算、可解释且具备实用性的算法相似性度量体系,并通过PACD数据集验证其有效性。

链接: https://arxiv.org/abs/2510.27063
作者: Shairoz Sohail,Taher Ali
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Software Engineering (cs.SE)
备注: 11 pages, many figures and images

点击查看摘要

Abstract:Given two algorithms for the same problem, can we determine whether they are meaningfully different? In full generality, the question is uncomputable, and empirically it is muddied by competing notions of similarity. Yet, in many applications (such as clone detection or program synthesis) a pragmatic and consistent similarity metric is necessary. We review existing equivalence and similarity notions and introduce EMOC: An Evaluation-Memory-Operations-Complexity framework that embeds algorithm implementations into a feature space suitable for downstream tasks. We compile PACD, a curated dataset of verified Python implementations across three problems, and show that EMOC features support clustering and classification of algorithm types, detection of near-duplicates, and quantification of diversity in LLM-generated programs. Code, data, and utilities for computing EMOC embeddings are released to facilitate reproducibility and future work on algorithm similarity.
zh

[NLP-45] Detecting Data Contamination in LLM s via In-Context Learning

【速读】: 该论文旨在解决大规模语言模型中训练数据污染(training data contamination)的检测与量化问题,即识别模型是否在训练过程中“记忆”了特定数据集,从而影响其泛化能力与可信度。解决方案的关键在于提出一种名为CoDeC(Contamination Detection via Context)的方法,其核心思想是利用上下文学习(in-context learning)对模型性能的影响差异来判断数据是否来自训练分布:若某数据集未被训练过,模型在该数据上表现更自信;而若数据属于训练集,则因记忆模式被干扰导致置信度下降。该方法无需依赖模型内部结构或训练数据细节,具备自动化、模型无关和数据集无关的特性,可直接嵌入基准评估流程中。

链接: https://arxiv.org/abs/2510.27055
作者: Michał Zawalski,Meriem Boubdir,Klaudia Bałazy,Besmira Nushi,Pablo Ribalta
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Contamination Detection via Context (CoDeC), a practical and accurate method to detect and quantify training data contamination in large language models. CoDeC distinguishes between data memorized during training and data outside the training distribution by measuring how in-context learning affects model performance. We find that in-context examples typically boost confidence for unseen datasets but may reduce it when the dataset was part of training, due to disrupted memorization patterns. Experiments show that CoDeC produces interpretable contamination scores that clearly separate seen and unseen datasets, and reveals strong evidence of memorization in open-weight models with undisclosed training corpora. The method is simple, automated, and both model- and dataset-agnostic, making it easy to integrate with benchmark evaluations.
zh

[NLP-46] LLM -Centric RAG with Multi-Granular Indexing and Confidence Constraints

【速读】: 该论文旨在解决复杂知识环境下检索增强生成(Retrieval-Augmented Generation, RAG)中存在的覆盖不足、结果不稳定以及可靠性有限的问题。其解决方案的关键在于提出一种融合多粒度记忆索引与不确定性估计的置信度控制方法:首先构建分层记忆结构,将知识表示按不同粒度划分,实现从局部细节到全局上下文的动态索引与检索,从而强化检索与生成之间的语义关联;其次引入不确定性估计机制,在生成过程中显式约束并过滤低置信度路径,以在保持信息覆盖的同时有效抑制噪声和虚假内容。该方法通过生成损失、熵约束与方差正则化共同优化,形成统一的置信度控制框架,在多项指标上显著优于现有模型,提升了RAG系统在复杂场景下的稳定性与可靠性。

链接: https://arxiv.org/abs/2510.27054
作者: Xiaofan Guo,Yaxuan Luan,Yue Kang,Xiangchen Song,Jinxu Guo
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper addresses the issues of insufficient coverage, unstable results, and limited reliability in retrieval-augmented generation under complex knowledge environments, and proposes a confidence control method that integrates multi-granularity memory indexing with uncertainty estimation. The method builds a hierarchical memory structure that divides knowledge representations into different levels of granularity, enabling dynamic indexing and retrieval from local details to global context, and thus establishing closer semantic connections between retrieval and generation. On this basis, an uncertainty estimation mechanism is introduced to explicitly constrain and filter low-confidence paths during the generation process, allowing the model to maintain information coverage while effectively suppressing noise and false content. The overall optimization objective consists of generation loss, entropy constraints, and variance regularization, forming a unified confidence control framework. In the experiments, comprehensive sensitivity tests and comparative analyses were designed, covering hyperparameters, environmental conditions, and data structures, to verify the stability and robustness of the proposed method across different scenarios. The results show that the method achieves superior performance over existing models in QA accuracy, retrieval recall, ranking quality, and factual consistency, demonstrating the effectiveness of combining multi-granularity indexing with confidence control. This study not only provides a new technical pathway for retrieval-augmented generation but also offers practical evidence for improving the reliability and controllability of large models in complex contexts.
zh

[NLP-47] VISTA Score: Verification In Sequential Turn-based Assessment

【速读】: 该论文旨在解决对话式人工智能系统中幻觉(hallucination)问题,即模型生成与现有证据或对话上下文不一致或相矛盾的陈述,这严重限制了其在需要事实可靠性的场景中的部署。解决方案的关键在于提出VISTA(Verification In Sequential Turn-based Assessment)框架,该框架通过逐轮分解助手回复为原子级事实性声明(factual claims),并结合可信来源和对话历史进行逐条验证,从而实现对对话事实性的精细化评估;同时,VISTA能够识别未验证陈述的类型(主观判断、自相矛盾、缺乏证据或选择不回答),并通过追踪多轮对话中的一致性来建模事实性的动态特性,显著优于现有基于孤立响应或将不可验证内容简单视为错误的指标。

链接: https://arxiv.org/abs/2510.27052
作者: Ashley Lewis,Andrew Perrault,Eric Fosler-Lussier,Michael White
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucination–defined here as generating statements unsupported or contradicted by available evidence or conversational context–remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA’s decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.
zh

[NLP-48] Recursive numeral systems are highly regular and easy to process

【速读】: 该论文试图解决的问题是:为何仅自然语言中的递归数词系统(recursive numeral systems)能优化词库规模与平均形态句法复杂度之间的权衡,而其他理论上可能的系统却未被人类语言所采用。此前的研究依赖人为设定的约束条件来排除非自然系统,但缺乏理论基础。本文的关键解决方案在于引入**最小描述长度(Minimum Description Length, MDL)框架,将递归数词系统的优化标准从单一的词库与复杂度权衡扩展至对规律性(regularity)处理复杂度(processing complexity)**的综合考量。研究表明,基于MDL的规律性与处理复杂度指标能够更准确地区分已知自然语言系统与未被使用的“最优”系统,并且先前文献中的人为约束条件可自然地由规律性推导得出。这一方法强调了在语言最优性研究中纳入形式集合整体规律性的必要性。

链接: https://arxiv.org/abs/2510.27049
作者: Ponrawee Prasertsom,Andrea Silvi,Jennifer Culbertson,Moa Johansson,Devdatt Dubhashi,Kenny Smith
机构: University of Edinburgh (爱丁堡大学); Chalmers University of Technology and Gothenburg University (查尔姆斯理工大学和哥德堡大学)
类目: Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Previous work has argued that recursive numeral systems optimise the trade-off between lexicon size and average morphosyntatic complexity (Denić and Szymanik, 2024). However, showing that only natural-language-like systems optimise this tradeoff has proven elusive, and the existing solution has relied on ad-hoc constraints to rule out unnatural systems (Yang and Regier, 2025). Here, we argue that this issue arises because the proposed trade-off has neglected regularity, a crucial aspect of complexity central to human grammars in general. Drawing on the Minimum Description Length (MDL) approach, we propose that recursive numeral systems are better viewed as efficient with regard to their regularity and processing complexity. We show that our MDL-based measures of regularity and processing complexity better capture the key differences between attested, natural systems and unattested but possible ones, including “optimal” recursive numeral systems from previous work, and that the ad-hoc constraints from previous literature naturally follow from regularity. Our approach highlights the need to incorporate regularity across sets of forms in studies that attempt to measure and explain optimality in language.
zh

[NLP-49] Quantitative Intertextuality from the Digital Humanities Perspective: A Survey

【速读】: 该论文旨在解决定量研究中文学文本之间互文性(intertextuality)的系统化梳理与方法整合问题,以推动数字人文研究的深化与拓展。其解决方案的关键在于构建一个涵盖多语言、多主题数据集的综述框架,系统总结从统计方法到深度学习的各类定量分析技术,并归纳其在人文学科和社会科学研究中的应用实例及配套平台工具,从而为未来跨学科融合研究(尤其是人工智能与人文学科的结合)提供可操作的路线图和理论支撑。

链接: https://arxiv.org/abs/2510.27045
作者: Siyu Duan
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The connection between texts is referred to as intertextuality in literary theory, which served as an important theoretical basis in many digital humanities studies. Over the past decade, advancements in natural language processing have ushered intertextuality studies into the quantitative age. Large-scale intertextuality research based on cutting-edge methods has continuously emerged. This paper provides a roadmap for quantitative intertextuality studies, summarizing their data, methods, and applications. Drawing on data from multiple languages and topics, this survey reviews methods from statistics to deep learning. It also summarizes their applications in humanities and social sciences research and the associated platform tools. Driven by advances in computer technology, more precise, diverse, and large-scale intertext studies can be anticipated. Intertextuality holds promise for broader application in interdisciplinary research bridging AI and the humanities.
zh

[NLP-50] Dataset Creation and Baseline Models for Sexism Detection in Hausa

【速读】: 该论文旨在解决低资源语言(如豪萨语)中性别歧视(sexism)检测的难题,其核心挑战在于文化差异和语言表达多样性导致现有高资源语言的计算方法难以直接迁移。解决方案的关键在于构建首个豪萨语性别歧视检测数据集,通过社区参与、定性编码和数据增强实现,并结合两阶段用户研究(n=66)获取母语者对日常话语中性别歧视定义与表述的认知,从而提升模型对文化语境的理解能力;同时实验对比了传统机器学习分类器与预训练多语言语言模型的效果,验证了少样本学习(few-shot learning)在该任务中的潜力,但也揭示了在澄清性提问和习语表达等场景下易产生大量误报的问题。

链接: https://arxiv.org/abs/2510.27038
作者: Fatima Adam Muhammad,Shamsuddeen Muhammad Hassan,Isa Inuwa-Dutse
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 4 tables

点击查看摘要

Abstract:Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.
zh

[NLP-51] Elastic Architecture Search for Efficient Language Models ICME2025

【速读】: 该论文旨在解决大型预训练语言模型在自然语言理解(NLU)任务中因计算和内存需求巨大而引发的经济与环境问题。其核心解决方案是提出一种名为弹性语言模型(Elastic Language Model, ELM)的新一代神经架构搜索(Neural Architecture Search, NAS)方法,关键创新在于引入了一个包含高效Transformer模块和可动态调整维度与注意力头数的模块化搜索空间,从而提升搜索过程的效率与灵活性;同时设计了新型知识蒸馏损失函数,以保留每个模块的独特特征,增强架构选择过程中的判别能力,实验表明基于ELM发现的模型在掩码语言建模和因果语言建模任务上显著优于现有方法。

链接: https://arxiv.org/abs/2510.27037
作者: Shang Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: ICME 2025

点击查看摘要

Abstract:As large pre-trained language models become increasingly critical to natural language understanding (NLU) tasks, their substantial computational and memory requirements have raised significant economic and environmental concerns. Addressing these challenges, this paper introduces the Elastic Language Model (ELM), a novel neural architecture search (NAS) method optimized for compact language models. ELM extends existing NAS approaches by introducing a flexible search space with efficient transformer blocks and dynamic modules for dimension and head number adjustment. These innovations enhance the efficiency and flexibility of the search process, which facilitates more thorough and effective exploration of model architectures. We also introduce novel knowledge distillation losses that preserve the unique characteristics of each block, in order to improve the discrimination between architectural choices during the search process. Experiments on masked language modeling and causal language modeling tasks demonstrate that models discovered by ELM significantly outperform existing methods.
zh

[NLP-52] Kad: A Framework for Proxy-based Test-time Alignment with Knapsack Approximation Deferral

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在预训练后仍需进行对齐(alignment)以满足下游任务需求和风格偏好等问题,而随着模型规模扩大,传统对齐方法的计算成本急剧上升,难以高效应用。其解决方案的关键在于提出一种基于代理模型(proxy-based)的测试时对齐(test-time alignment)方法,通过一个小型已对齐模型提供引导,采用一种基于token的级联策略(token-specific cascading method),将每个token的决策规则建模为0-1背包问题(0-1 knapsack problem),并在此框架下推导出最优决策的原始与对偶近似解,从而在保持任务性能的同时显著提升推测解码(speculative decoding)速度。

链接: https://arxiv.org/abs/2510.27017
作者: Ayoub Hammal,Pierre Zweigenbaum,Caio Corro
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Several previous works concluded that the largest part of generation capabilities of large language models (LLM) are learned (early) during pre-training. However, LLMs still require further alignment to adhere to downstream task requirements and stylistic preferences, among other desired properties. As LLMs continue to scale in terms of size, the computational cost of alignment procedures increase prohibitively. In this work, we propose a novel approach to circumvent these costs via proxy-based test-time alignment, i.e. using guidance from a small aligned model. Our approach can be described as token-specific cascading method, where the token-specific deferral rule is reduced to 0-1 knapsack problem. In this setting, we derive primal and dual approximations of the optimal deferral decision. We experimentally show the benefits of our method both in task performance and speculative decoding speed.
zh

[NLP-53] Semantically-Aware LLM Agent to Enhance Privacy in Conversational AI Services

【速读】: 该论文旨在解决用户在与远程大型语言模型(Large Language Models, LLMs)交互时,因共享包含个人身份信息(Personally Identifiable Information, PII)的对话内容而导致的隐私泄露问题。解决方案的关键在于提出一种语义感知的隐私代理框架——局部优化的去标识化与语义完整性导向实体检测(Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection, LOPSIDED),其核心机制是动态地将用户提示中的敏感PII实体替换为语义一致的伪名,从而在不损害对话上下文完整性的情况下保护隐私;生成响应后,再自动对伪名进行去伪名化处理,确保输出结果的准确性与隐私性双重保障。

链接: https://arxiv.org/abs/2510.27016
作者: Jayden Serenari,Stephen Lee
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to IEEE Big Data 2025

点击查看摘要

Abstract:With the increasing use of conversational AI systems, there is growing concern over privacy leaks, especially when users share sensitive personal data in interactions with Large Language Models (LLMs). Conversations shared with these models may contain Personally Identifiable Information (PII), which, if exposed, could lead to security breaches or identity theft. To address this challenge, we present the Local Optimizations for Pseudonymization with Semantic Integrity Directed Entity Detection (LOPSIDED) framework, a semantically-aware privacy agent designed to safeguard sensitive PII data when using remote LLMs. Unlike prior work that often degrade response quality, our approach dynamically replaces sensitive PII entities in user prompts with semantically consistent pseudonyms, preserving the contextual integrity of conversations. Once the model generates its response, the pseudonyms are automatically depseudonymized, ensuring the user receives an accurate, privacy-preserving output. We evaluate our approach using real-world conversations sourced from ShareGPT, which we further augment and annotate to assess whether named entities are contextually relevant to the model’s response. Our results show that LOPSIDED reduces semantic utility errors by a factor of 5 compared to baseline techniques, all while enhancing privacy.
zh

[NLP-54] Semantic Frame Aggregation-based Transformer for Live Video Comment Generation

【速读】: 该论文旨在解决直播视频流中自动生成语境相关评论的难题,尤其针对现有方法忽视视频帧重要性排序、导致生成评论缺乏上下文关联性的问题。解决方案的关键在于提出一种基于语义帧聚合的Transformer(Semantic Frame Aggregation-based Transformer, SFAT)模型:该模型利用CLIP的视觉-文本多模态知识生成评论,并通过语义相关性权重对视频帧进行加权处理,优先突出与观众对话高度相关的帧;同时引入带有交叉注意力机制的评论解码器,使生成内容能融合聊天记录和视频模态的上下文线索,从而提升评论的相关性和自然度。

链接: https://arxiv.org/abs/2510.26978
作者: Anam Fatima,Yi Yu,Janak Kapuriya,Julien Lalanne,Jainendra Shukla
机构: IIIT-Delhi (印度国际信息技术学院); Hiroshima University (广岛大学); Grenoble INP – Ensimag (格勒诺布尔理工学院 – Ensimag校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Live commenting on video streams has surged in popularity on platforms like Twitch, enhancing viewer engagement through dynamic interactions. However, automatically generating contextually appropriate comments remains a challenging and exciting task. Video streams can contain a vast amount of data and extraneous content. Existing approaches tend to overlook an important aspect of prioritizing video frames that are most relevant to ongoing viewer interactions. This prioritization is crucial for producing contextually appropriate comments. To address this gap, we introduce a novel Semantic Frame Aggregation-based Transformer (SFAT) model for live video comment generation. This method not only leverages CLIP’s visual-text multimodal knowledge to generate comments but also assigns weights to video frames based on their semantic relevance to ongoing viewer conversation. It employs an efficient weighted sum of frames technique to emphasize informative frames while focusing less on irrelevant ones. Finally, our comment decoder with a cross-attention mechanism that attends to each modality ensures that the generated comment reflects contextual cues from both chats and video. Furthermore, to address the limitations of existing datasets, which predominantly focus on Chinese-language content with limited video categories, we have constructed a large scale, diverse, multimodal English video comments dataset. Extracted from Twitch, this dataset covers 11 video categories, totaling 438 hours and 3.2 million comments. We demonstrate the effectiveness of our SFAT model by comparing it to existing methods for generating comments from live video and ongoing dialogue contexts.
zh

[NLP-55] Overview of the MEDIQA-OE 2025 Shared Task on Medical Order Extraction from Doctor-Patient Consultations

【速读】: 该论文旨在解决从医生-患者对话中自动提取可执行的医疗医嘱(medical orders)这一尚未被充分探索的问题,以减轻临床医生的文书负担并直接改善下游患者护理。解决方案的关键在于引入了MEDIQA-OE 2025共享任务,首次系统性地评估从自然语言对话中抽取医嘱的能力,参赛团队采用了包括闭源和开源大语言模型(Large Language Models, LLMs)在内的多种方法,并在统一的数据集上进行实验与比较,从而推动该领域的研究进展。

链接: https://arxiv.org/abs/2510.26974
作者: Jean-Philippe Corbeil,Asma Ben Abacha,Jerome Tremblay,Phillip Swazinna,Akila Jeeson Daniel,Miguel Del-Agua,Francois Beaulieu
机构: Microsoft Healthcare & Life Sciences
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Clinical documentation increasingly uses automatic speech recognition and summarization, yet converting conversations into actionable medical orders for Electronic Health Records remains unexplored. A solution to this problem can significantly reduce the documentation burden of clinicians and directly impact downstream patient care. We introduce the MEDIQA-OE 2025 shared task, the first challenge on extracting medical orders from doctor-patient conversations. Six teams participated in the shared task and experimented with a broad range of approaches, and both closed- and open-weight large language models (LLMs). In this paper, we describe the MEDIQA-OE task, dataset, final leaderboard ranking, and participants’ solutions.
zh

[NLP-56] Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence

【速读】: 该论文旨在解决电子健康记录(e-medical records)中性别暴力(gender-based violence, GBV)事件的低报告率问题,尤其针对初级卫生保健单位就诊时未充分记录的相关信息。其解决方案的关键在于提出一种基于语义框架(semantic frames)的方法,通过定义细粒度的文本模式并在非结构化数据(如开放文本字段)中进行搜索,从而有效识别潜在的暴力事件报告。该方法在巴西葡萄牙语的2100万句语料库上验证,实现了0.726的精确度,且设计为透明、高效、低碳且语言无关的流水线,具备良好的可迁移性与可解释性,适用于其他公共卫生监测场景。

链接: https://arxiv.org/abs/2510.26969
作者: Lívia Dutra,Arthur Lorenzi,Laís Berno,Franciany Campos,Karoline Biscardi,Kenneth Brown,Marcelo Viridiano,Frederico Belcavello,Ely Matos,Olívia Guaranha,Erik Santos,Sofia Reinach,Tiago Timponi Torrent
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce a methodology for the identification of notifiable events in the domain of healthcare. The methodology harnesses semantic frames to define fine-grained patterns and search them in unstructured data, namely, open-text fields in e-medical records. We apply the methodology to the problem of underreporting of gender-based violence (GBV) in e-medical records produced during patients’ visits to primary care units. A total of eight patterns are defined and searched on a corpus of 21 million sentences in Brazilian Portuguese extracted from e-SUS APS. The results are manually evaluated by linguists and the precision of each pattern measured. Our findings reveal that the methodology effectively identifies reports of violence with a precision of 0.726, confirming its robustness. Designed as a transparent, efficient, low-carbon, and language-agnostic pipeline, the approach can be easily adapted to other health surveillance contexts, contributing to the broader, ethical, and explainable use of NLP in public health systems.
zh

[NLP-57] RepV: Safety-Separable Latent Spaces for Scalable Neurosymbolic Plan Verification

【速读】: 该论文旨在解决安全关键领域中AI系统行为合规性验证难题,即如何在不依赖人工编写复杂时序逻辑规范的前提下,实现对计划(plan)是否符合自然语言规则的可靠判断。传统形式化方法虽能提供可证明的保证,但受限于表达能力和可访问性;而深度学习方法虽然可处理自然语言约束,却因决策过程不透明易引发误判。解决方案的关键在于提出一种神经符号验证器(RepV),其核心思想是学习一个安全可分离的潜在空间(safety-separable latent space),使得安全与不安全计划在此空间中线性可分。RepV从少量由现成模型检查器标注的计划出发,训练一个轻量级投影器(projector),将每个计划及其由语言模型生成的推理依据共同嵌入低维空间,并通过冻结的线性边界进行单次前向传播验证。此外,该框架还能基于潜在空间中的位置提供概率化的正确验证保障,从而驱动规划器的自适应优化,提升规则遵守率且无需人工标注。

链接: https://arxiv.org/abs/2510.26935
作者: Yunhao Yang,Neel P. Bhatt,Pranay Samineni,Rohan Siva,Zhanyang Wang,Ufuk Topcu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
备注: Code and data are available at: this https URL

点击查看摘要

Abstract:As AI systems migrate to safety-critical domains, verifying that their actions comply with well-defined rules remains a challenge. Formal methods provide provable guarantees but demand hand-crafted temporal-logic specifications, offering limited expressiveness and accessibility. Deep learning approaches enable evaluation of plans against natural-language constraints, yet their opaque decision process invites misclassifications with potentially severe consequences. We introduce RepV, a neurosymbolic verifier that unifies both views by learning a latent space where safe and unsafe plans are linearly separable. Starting from a modest seed set of plans labeled by an off-the-shelf model checker, RepV trains a lightweight projector that embeds each plan, together with a language model-generated rationale, into a low-dimensional space; a frozen linear boundary then verifies compliance for unseen natural-language rules in a single forward pass. Beyond binary classification, RepV provides a probabilistic guarantee on the likelihood of correct verification based on its position in the latent space. This guarantee enables a guarantee-driven refinement of the planner, improving rule compliance without human annotations. Empirical evaluations show that RepV improves compliance prediction accuracy by up to 15% compared to baseline methods while adding fewer than 0.2M parameters. Furthermore, our refinement framework outperforms ordinary fine-tuning baselines across various planning domains. These results show that safety-separable latent spaces offer a scalable, plug-and-play primitive for reliable neurosymbolic plan verification. Code and data are available at: this https URL. Comments: Code and data are available at: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL) Cite as: arXiv:2510.26935 [cs.RO] (or arXiv:2510.26935v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2510.26935 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-58] Understanding and Enhancing Mamba-Transformer Hybrids for Memory Recall and Language Modeling

【速读】: 该论文旨在解决混合状态空间模型(State Space Model, SSM)与注意力机制(Attention)结合架构中设计选择不明确的问题,特别是如何优化记忆利用效率与整体性能之间的权衡。其关键解决方案包括:首先,通过对比顺序集成与并行集成的混合架构,发现顺序型混合模型在短上下文场景下表现更优,而并行型混合模型更适合长上下文;其次,提出一种以数据为中心的持续训练策略,即在训练集中不断引入改写(paraphrase)增强的数据,从而显著提升召回能力(recall),同时保持其他性能指标不变,并且该方法具有跨不同基础模型的良好泛化性,优于仅依赖架构调整的改进方式。

链接: https://arxiv.org/abs/2510.26912
作者: Hyunji Lee,Wenhao Yu,Hongming Zhang,Kaixin Ma,Jiyeon Kim,Dong Yu,Minjoon Seo
机构: UNC Chapel Hill; Tencent AI Lab; KAIST AI
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hybrid models that combine state space models (SSMs) with attention mechanisms have shown strong performance by leveraging the efficiency of SSMs and the high recall ability of attention. However, the architectural design choices behind these hybrid models remain insufficiently understood. In this work, we analyze hybrid architectures through the lens of memory utilization and overall performance, and propose a complementary method to further enhance their effectiveness. We first examine the distinction between sequential and parallel integration of SSM and attention layers. Our analysis reveals several interesting findings, including that sequential hybrids perform better on shorter contexts, whereas parallel hybrids are more effective for longer contexts. We also introduce a data-centric approach of continually training on datasets augmented with paraphrases, which further enhances recall while preserving other capabilities. It generalizes well across different base models and outperforms architectural modifications aimed at enhancing recall. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases. Our findings provide a deeper understanding of hybrid SSM-attention models and offer practical guidance for designing architectures tailored to various use cases.
zh

[NLP-59] he Denario project: Deep knowledge AI agents for scientific discovery

【速读】: 该论文旨在解决当前科学研究中效率低下、跨学科协作困难以及研究人员在文献综述、实验设计与论文撰写等环节面临繁重负担的问题。其解决方案的核心是提出并实现了一个名为Denario的生成式AI多智能体系统(Generative AI Multi-agent System),该系统具备模块化架构,能够自主执行从科学假设生成、文献调研、研究计划制定、代码编写与可视化到论文撰写与评审的全流程科研任务,并支持跨学科知识融合。通过多个学科领域的AI生成论文实例及专家评估,验证了其在提升科研效率和促进创新方面的潜力,同时揭示了现有系统的局限性与伦理挑战。

链接: https://arxiv.org/abs/2510.26887
作者: Francisco Villaescusa-Navarro,Boris Bolliet,Pablo Villanueva-Domingo,Adrian E. Bayer,Aidan Acquah,Chetana Amancharla,Almog Barzilay-Siegal,Pablo Bermejo,Camille Bilodeau,Pablo Cárdenas Ramírez,Miles Cranmer,Urbano L. França,ChangHoon Hahn,Yan-Fei Jiang,Raul Jimenez,Jun-Young Lee,Antonio Lerario,Osman Mamun,Thomas Meier,Anupam A. Ojha,Pavlos Protopapas,Shimanto Roy,David N. Spergel,Pedro Tarancón-Álvarez,Ujjwal Tiwari,Matteo Viel,Digvijay Wadekar,Chi Wang,Bonny Y. Wang,Licong Xu,Yossi Yovel,Shuwen Yue,Wen-Han Zhou,Qiyao Zhu,Jiajun Zou,Íñigo Zubeldia
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Lawrence Berkeley National Laboratory (劳伦斯伯克利国家实验室); 3. Max Planck Institute for Astrophysics (马克斯·普朗克天体物理研究所); 4. Ludwig-Maximilians-Universität München (慕尼黑路德维希-马克西米利安大学); 5. Instituto de Astrofísica de Canarias (加那利天体物理研究所); 6. University of Oxford (牛津大学); 7. University of Pennsylvania (宾夕法尼亚大学); 8. Tel Aviv University (特拉维夫大学); 9. Universidad Autónoma de Madrid (马德里自治大学); 10. Instituto de Física Teórica UAM-CSIC (理论物理研究所,马德里自治大学-西班牙国家研究委员会联合); 11. Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas (能源、环境与技术研究中心); 12. McGill University (麦吉尔大学); 13. University of Arizona (亚利桑那大学); 14. Steward Observatory (斯特沃德天文台); 15. Department of Astronomy and Astrophysics (天文学与天体物理学系); 16. University of Colorado Boulder (科罗拉多大学博尔德分校); 17. Instituto Nacional de Astrofísica, Óptica y Electrónica (国家天体物理、光学与电子研究所); 18. Universidad Nacional Autónoma de México (墨西哥国立自治大学); 19. University of Texas at Austin (得克萨斯大学奥斯汀分校); 20. McDonald Observatory (麦克唐纳天文台); 21. University of Barcelona (巴塞罗那大学); 22. Institut de Ciències del Cosmos (宇宙科学研究所); 23. International School for Advanced Studies (国际高等研究学院); 24. University of Zurich (苏黎世大学); 25. Harvard University (哈佛大学); 26. Center for Astrophysics | Harvard & Smithsonian (哈佛与史密森天体物理中心); 27. Harvard-Smithsonian Center for Astrophysics (哈佛-史密森天体物理中心); 28. Institute of Space Sciences (空间科学研究所); 29. Istituto Nazionale di Astrofisica (意大利国家天体物理研究所); 30. Università degli Studi di Trieste (特雷维索大学); 31. INAF-Osservatorio Astronomico di Trieste (特雷维索天文台); 32. INFN - Sezione di Trieste (意大利国家核物理研究院-特雷维索分部); 33. University of Cambridge (剑桥大学); 34. Princeton University (普林斯顿大学); 35. University of California, San Diego (加州大学圣地亚哥分校); 36. University of California, Davis (加州大学戴维斯分校); 37. Hebrew University of Jerusalem (耶路撒冷希伯来大学); 38. University of Science and Technology of China (中国科学技术大学); 39. Tsinghua University (清华大学); 40. Beijing Institute of Technology (北京理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 272 pages. Examples of 11 AI-generated paper drafts from different scientific disciplines. Code publicly available at this https URL

点击查看摘要

Abstract:We present Denario, an AI multi-agent system designed to serve as a scientific research assistant. Denario can perform many different tasks, such as generating ideas, checking the literature, developing research plans, writing and executing code, making plots, and drafting and reviewing a scientific paper. The system has a modular architecture, allowing it to handle specific tasks, such as generating an idea, or carrying out end-to-end scientific analysis using Cmbagent as a deep-research backend. In this work, we describe in detail Denario and its modules, and illustrate its capabilities by presenting multiple AI-generated papers generated by it in many different scientific disciplines such as astrophysics, biology, biophysics, biomedical informatics, chemistry, material science, mathematical physics, medicine, neuroscience and planetary science. Denario also excels at combining ideas from different disciplines, and we illustrate this by showing a paper that applies methods from quantum physics and machine learning to astrophysical data. We report the evaluations performed on these papers by domain experts, who provided both numerical scores and review-like feedback. We then highlight the strengths, weaknesses, and limitations of the current system. Finally, we discuss the ethical implications of AI-driven research and reflect on how such technology relates to the philosophy of science. We publicly release the code at this https URL. A Denario demo can also be run directly on the web at this https URL, and the full app will be deployed on the cloud.
zh

[NLP-60] Evaluating Perspectival Biases in Cross-Modal Retrieval

【速读】: 该论文旨在解决多模态检索系统中存在的两种偏见问题:一是流行性偏见(prevalence bias),即在图像到文本检索中倾向于选择语言占优的条目而非语义上更准确的条目;二是关联偏见(association bias),即在文本到图像检索中偏好与查询文化相关联的图像而非语义正确的图像。研究表明,显式对齐(explicit alignment)是缓解流行性偏见的有效策略,但关联偏见则更具挑战性且需针对性干预。因此,实现真正公平的多模态系统需要超越单纯数据扩展的靶向策略,尤其应重视由文化关联引发的偏见问题。

链接: https://arxiv.org/abs/2510.26861
作者: Teerapol Saengsukhiran,Peerawat Chomphooyod,Narabodee Rodjananant,Chompakorn Chaksangchaichot,Patawee Prakrankamanant,Witthawin Sripheanpol,Pak Lovichit,SarChaksaana Nutanong,Ekapol Chuangsuwanich
机构: Chulalongkorn University (朱拉隆功大学); VISAI.AI; VISTEC (视觉科技研究所)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal retrieval systems are expected to operate in a semantic space, agnostic to the language or cultural origin of the query. In practice, however, retrieval outcomes systematically reflect perspectival biases: deviations shaped by linguistic prevalence and cultural associations. We study two such biases. First, prevalence bias refers to the tendency to favor entries from prevalent languages over semantically faithful entries in image-to-text retrieval. Second, association bias refers to the tendency to favor images culturally associated with the query over semantically correct ones in text-to-image retrieval. Results show that explicit alignment is a more effective strategy for mitigating prevalence bias. However, association bias remains a distinct and more challenging problem. These findings suggest that achieving truly equitable multimodal systems requires targeted strategies beyond simple data scaling and that bias arising from cultural association may be treated as a more challenging problem than one arising from linguistic prevalence.
zh

[NLP-61] CATArena: Evaluation of LLM Agents through Iterative Tournament Competitions

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)代理评估体系中存在的局限性问题,包括:现有基准测试主要聚焦于固定场景下的端到端性能,难以全面衡量代理的持续学习能力;同时存在得分饱和(score saturation)和对专家标注依赖性增强的问题,阻碍了对日益进步的代理能力进行动态、可靠评估。解决方案的关键在于提出一种迭代式、竞争性的同伴学习(peer-learning)框架,通过代理间的反复交互与反馈实现策略优化,并设计了一个名为CATArena的锦标赛式评估平台,该平台采用四种多样化的棋盘与卡牌游戏,具备开放式评分机制,从而支持无上限任务目标下的持续动态评估。实验表明,该方案能够稳定、可扩展地衡量代理的核心能力,尤其是学习能力和策略编码能力。

链接: https://arxiv.org/abs/2510.26852
作者: Lingyue Fu,Xin Ding,Yaoming Zhu,Shao Zhang,Lin Qiu,Weiwen Liu,Weinan Zhang,Xuezhi Cao,Xunliang Cai,Jiaxin Ding,Yong Yu
机构: Shanghai Jiao Tong University (上海交通大学); AGI-Eval; Meituan (美团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have evolved from basic text generation to autonomously completing complex tasks through interaction with external tools. However, current benchmarks mainly assess end-to-end performance in fixed scenarios, restricting evaluation to specific skills and suffering from score saturation and growing dependence on expert annotation as agent capabilities improve. In this work, we emphasize the importance of learning ability, including both self-improvement and peer-learning, as a core driver for agent evolution toward human-level intelligence. We propose an iterative, competitive peer-learning framework, which allows agents to refine and optimize their strategies through repeated interactions and feedback, thereby systematically evaluating their learning capabilities. To address the score saturation issue in current benchmarks, we introduce CATArena, a tournament-style evaluation platform featuring four diverse board and card games with open-ended scoring. By providing tasks without explicit upper score limits, CATArena enables continuous and dynamic evaluation of rapidly advancing agent capabilities. Experimental results and analyses involving both minimal and commercial code agents demonstrate that CATArena provides reliable, stable, and scalable benchmarking for core agent abilities, particularly learning ability and strategy coding.
zh

[NLP-62] Broken-Token: Filtering Obfuscated Prompts by Counting Characters-Per-Token

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时的安全防护失效问题,这类攻击通过使用密码学编码或字符级变换隐藏恶意提示,从而绕过模型内置的安全护栏。解决方案的关键在于提出一种名为CPT-Filtering的新方法,其核心思想是利用字节对编码(Byte-Pair Encoding, BPE)分词器的内在行为:自然语言文本通常具有较低的字符每标记数(Characters Per Token, CPT),而编码后的异常文本(如密文)则会显著增加短标记的数量,导致CPT值升高。该方法仅依赖一个简单的CPT阈值判断,无需额外计算资源,即可高效识别并过滤潜在的恶意编码输入,具备高精度、低开销和模型无关性等优势,适用于实时文本过滤与离线数据清洗场景。

链接: https://arxiv.org/abs/2510.26847
作者: Shaked Zychlinski,Yuval Kainan
机构: JFrog
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are susceptible to jailbreak attacks where malicious prompts are disguised using ciphers and character-level encodings to bypass safety guardrails. While these guardrails often fail to interpret the encoded content, the underlying models can still process the harmful instructions. We introduce CPT-Filtering, a novel, model-agnostic with negligible-costs and near-perfect accuracy guardrail technique that aims to mitigate these attacks by leveraging the intrinsic behavior of Byte-Pair Encoding (BPE) tokenizers. Our method is based on the principle that tokenizers, trained on natural language, represent out-of-distribution text, such as ciphers, using a significantly higher number of shorter tokens. Our technique uses a simple yet powerful artifact of using language models: the average number of Characters Per Token (CPT) in the text. This approach is motivated by the high compute cost of modern methods - relying on added modules such as dedicated LLMs or perplexity models. We validate our approach across a large dataset of over 100,000 prompts, testing numerous encoding schemes with several popular tokenizers. Our experiments demonstrate that a simple CPT threshold robustly identifies encoded text with high accuracy, even for very short inputs. CPT-Filtering provides a practical defense layer that can be immediately deployed for real-time text filtering and offline data curation.
zh

[NLP-63] Detecting Prefix Bias in LLM -based Reward Models

【速读】: 该论文旨在解决基于人类反馈的强化学习(Reinforcement Learning with Human Feedback, RLHF)中奖励模型(reward model)存在的前缀偏差(prefix bias)问题,即模型偏好因查询前缀的微小变化而发生系统性偏移,进而可能放大种族和性别维度上的不公平性。解决方案的关键在于提出了一套新颖的检测与评估指标以量化前缀偏差,并在此基础上设计了一种数据增强策略来缓解此类偏差,实验证明该方法能有效降低前缀偏差对奖励模型的影响,从而提升模型的公平性和可靠性。

链接: https://arxiv.org/abs/2505.13487
作者: Ashwin Kumar,Yuzi He,Aram H. Markosyan,Bobbie Chern,Imanol Arrieta-Ibarra
机构: Washington University in St Louis (圣路易斯华盛顿大学); Meta Platforms, Inc. (Meta)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias – a systematic shift in model preferences triggered by minor variations in query prefixes – in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose a data augmentation strategy to mitigate these biases, showing its effectiveness in reducing the impact of prefix bias. Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models, contributing to the broader discourse on fairness in AI.
zh

[NLP-64] A Transformer-based Neural Architecture Search Method GECCO2023

【速读】: 该论文旨在解决神经网络结构设计中如何自动搜索出更优的Transformer架构以提升机器翻译性能的问题。其解决方案的关键在于提出一种基于多目标遗传算法(multi-objective genetic algorithm)的神经架构搜索方法,该方法在搜索过程中不仅以BLEU分数作为主评价指标,还引入困惑度(perplexity)作为辅助评价指标,从而在种群迭代优化中更全面地引导模型结构向高性能方向演化,最终获得超越基线模型的翻译效果。

链接: https://arxiv.org/abs/2505.01314
作者: Shang Wang,Huanrong Tang,Jianquan Ouyang
机构: Xiangtan University (湘潭大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: GECCO 2023

点击查看摘要

Abstract:This paper presents a neural architecture search method based on Transformer architecture, searching cross multihead attention computation ways for different number of encoder and decoder combinations. In order to search for neural network structures with better translation results, we considered perplexity as an auxiliary evaluation metric for the algorithm in addition to BLEU scores and iteratively improved each individual neural network within the population by a multi-objective genetic algorithm. Experimental results show that the neural network structures searched by the algorithm outperform all the baseline models, and that the introduction of the auxiliary evaluation metric can find better models than considering only the BLEU score as an evaluation metric.
zh

[NLP-65] W-PCA Based Gradient-Free Proxy for Efficient Search of Lightweight Language Models ICLR2025

【速读】: 该论文旨在解决轻量级语言模型(Lightweight Language Models)在零样本神经架构搜索(Zero-shot NAS)中面临的评估指标偏差和计算效率低下的问题。其解决方案的关键在于提出一种名为权重加权主成分分析(Weight-weighted PCA, W-PCA)的新方法,该方法通过两个代理指标——参数量与前馈神经网络(Feed-Forward Neural, FFN)层中累计贡献超过阈值η的主成分数量——实现无需梯度计算的高效评估。此设计显著降低了评估时间,并在GLUE和SQuAD数据集上验证了其相较于传统训练型NAS方法更优的性能表现及排名相关性。

链接: https://arxiv.org/abs/2504.15983
作者: Shang Wang
机构: ShanghaiTech University (上海科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: ICLR 2025

点击查看摘要

Abstract:The demand for efficient natural language processing (NLP) systems has led to the development of lightweight language models. Previous work in this area has primarily focused on manual design or training-based neural architecture search (NAS) methods. Recently, zero-shot NAS methods have been proposed for evaluating language models without the need for training. However, prevailing approaches to zero-shot NAS often face challenges such as biased evaluation metrics and computational inefficiencies. In this paper, we introduce weight-weighted PCA (W-PCA), a novel zero-shot NAS method specifically tailored for lightweight language models. Our approach utilizes two evaluation proxies: the parameter count and the number of principal components with cumulative contribution exceeding \eta in the feed-forward neural (FFN) layer. Additionally, by eliminating the need for gradient computations, we optimize the evaluation time, thus enhancing the efficiency of designing and evaluating lightweight language models. We conduct a comparative analysis on the GLUE and SQuAD datasets to evaluate our approach. The results demonstrate that our method significantly reduces training time compared to one-shot NAS methods and achieves higher scores in the testing phase compared to previous state-of-the-art training-based methods. Furthermore, we perform ranking evaluations on a dataset sampled from the FlexiBERT search space. Our approach exhibits superior ranking correlation and further reduces solving time compared to other zero-shot NAS methods that require gradient computation.
zh

计算机视觉

[CV-0] LifWavNet: Lifting Wavelet-based Network for Non-contact ECG Reconstruction from Radar

【速读】:该论文旨在解决从雷达信号中非接触式重建心电图(ECG)的问题,以实现无感心脏监测。其核心挑战在于如何从具有弱生理相关性的雷达回波中提取并重构出具有临床意义的ECG波形。解决方案的关键在于提出LifWavNet——一种基于多分辨率分析与合成(MRAS)模型的提升小波网络(lifting wavelet network),该架构采用可学习的提升小波结构,通过提升(lifting)和逆提升(inverse lifting)单元自适应地捕捉雷达信号特征,并生成符合生理特性的ECG波形;同时引入多分辨率短时傅里叶变换(multi-resolution STFT)损失函数,在时域和频域上强制约束重建结果与真实ECG的一致性,从而显著提升重建保真度与下游生命体征估计(如心率和心率变异性)性能。

链接: https://arxiv.org/abs/2510.27692
作者: Soumitra Kundu,Gargi Panda,Saumik Bhattacharya,Aurobinda Routray,Rajlakshmi Guha
机构: Rekhi Centre of Excellence for the Science of Happiness, IIT Kharagpur, India (幸福科学卓越中心,印度理工学院克哈格普尔分校); Department of Electrical Engineering, IIT Kharagpur, India (电气工程系,印度理工学院克哈格普尔分校); Department of Electronics and Electrical Communication Engineering, IIT Kharagpur, India (电子与电气通信工程系,印度理工学院克哈格普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Non-contact electrocardiogram (ECG) reconstruction from radar signals offers a promising approach for unobtrusive cardiac monitoring. We present LifWavNet, a lifting wavelet network based on a multi-resolution analysis and synthesis (MRAS) model for radar-to-ECG reconstruction. Unlike prior models that use fixed wavelet approaches, LifWavNet employs learnable lifting wavelets with lifting and inverse lifting units to adaptively capture radar signal features and synthesize physiologically meaningful ECG waveforms. To improve reconstruction fidelity, we introduce a multi-resolution short-time Fourier transform (STFT) loss, that enforces consistency with the ground-truth ECG in both temporal and spectral domains. Evaluations on two public datasets demonstrate that LifWavNet outperforms state-of-the-art methods in ECG reconstruction and downstream vital sign estimation (heart rate and heart rate variability). Furthermore, intermediate feature visualization highlights the interpretability of multi-resolution decomposition and synthesis in radar-to-ECG reconstruction. These results establish LifWavNet as a robust framework for radar-based non-contact ECG measurement.
zh

[CV-1] Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals

【速读】:该论文旨在解决生成式 AI(Generative AI)中基于分数的蒸馏方法在复杂任务下性能受限的问题,特别是传统单步蒸馏(DMD)因模型容量不足导致在文本到视频生成等高复杂度任务中表现不佳,而直接扩展为多步蒸馏又会引发内存占用增加、计算深度上升以及训练不稳定等问题。其解决方案的关键在于提出分阶段分布匹配蒸馏(Phased DMD),该框架融合了分段蒸馏与专家混合(Mixture-of-Experts, MoE)思想,通过将信噪比(SNR)范围划分为子区间并逐级提升模型对高信噪比区域的建模能力,同时在每个子区间内进行精确的分数匹配优化,从而在降低学习难度的同时显著增强模型容量,有效保留生成多样性并维持关键生成能力。

链接: https://arxiv.org/abs/2510.27684
作者: Xiangyu Fan,Zesong Qiu,Zhuguanyu Wu,Fanzhou Wang,Zhiqian Lin,Tianxiang Ren,Dahua Lin,Ruihao Gong,Lei Yang
机构: SenseTime Research (商汤科技); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Distribution Matching Distillation (DMD) distills score-based generative models into efficient one-step generators, without requiring a one-to-one correspondence with the sampling trajectories of their teachers. However, limited model capacity causes one-step distilled models underperform on complex generative tasks, e.g., synthesizing intricate object motions in text-to-video generation. Directly extending DMD to multi-step distillation increases memory usage and computational depth, leading to instability and reduced efficiency. While prior works propose stochastic gradient truncation as a potential solution, we observe that it substantially reduces the generation diversity of multi-step distilled models, bringing it down to the level of their one-step counterparts. To address these limitations, we propose Phased DMD, a multi-step distillation framework that bridges the idea of phase-wise distillation with Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model capacity. Phased DMD is built upon two key ideas: progressive distribution matching and score matching within subintervals. First, our model divides the SNR range into subintervals, progressively refining the model to higher SNR levels, to better capture complex distributions. Next, to ensure the training objective within each subinterval is accurate, we have conducted rigorous mathematical derivations. We validate Phased DMD by distilling state-of-the-art image and video generation models, including Qwen-Image (20B parameters) and Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD preserves output diversity better than DMD while retaining key generative capabilities. We will release our code and models.
zh

[CV-2] PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在医学影像领域应用受限于二维图像、难以处理三维(3D)医学数据的问题,尤其是针对正电子发射断层扫描与计算机断层扫描(PET/CT)中大体积数据、病灶小且分散、报告冗长等挑战。其解决方案的关键在于构建了一个包含超过11,000个病灶级描述与3D分割标注的大规模数据集,并提出了一种名为PETAR-4B的3D掩码感知视觉语言模型,该模型能够融合PET、CT图像及病灶轮廓信息,实现空间定位准确的报告生成,从而在全局上下文理解与局部病灶识别之间建立桥梁,显著提升PET/CT影像报告的质量与临床一致性。

链接: https://arxiv.org/abs/2510.27680
作者: Danyal Maqbool,Changhee Lee,Zachary Huemann,Samuel D. Church,Matthew E. Larson,Scott B. Perlman,Tomas A. Romero,Joshua D. Warner,Meghan Lubner,Xin Tie,Jameson Merkow,Junjie Hu,Steve Y. Cho,Tyler J. Bradshaw
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
zh

[CV-3] Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

【速读】:该论文旨在解决监控场景下行人重识别(Person Re-identification, ReID)面临的遮挡(occlusion)、视角畸变(viewpoint distortion)和图像质量差等挑战,尤其针对现有方法依赖复杂模块或仅在清晰正面图像上表现良好的局限性。其解决方案的关键在于提出一种轻量且鲁棒的模型 Sh-ViT(Shuffling Vision Transformer),通过三个核心设计实现:1)在最终 Transformer 层引入 Shuffle 模块以打破空间相关性,增强对遮挡与模糊的鲁棒性;2)采用场景自适应的数据增强策略(几何变换、擦除、模糊和色彩调整)模拟真实监控环境;3)基于 DeiT 的知识蒸馏机制,在有限标注数据下提升学习效率。实验表明,该方法在自建 MyTT 数据集和 Market1501 上均显著优于 CNN 和 ViT 基线,验证了其在实际监控场景中对遮挡和图像退化的强适应能力。

链接: https://arxiv.org/abs/2510.27677
作者: Bo Li,Duyuan Zheng,Xinyang Liu,Qingwen Li,Hong Li,Hongyan Cui,Ge Gao,Chen Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages,conference

点击查看摘要

Abstract:Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited this http URL support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art this http URL summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.
zh

[CV-4] Deep learning denoising unlocks quantitative insights in operando materials microscopy

【速读】:该论文旨在解决原位显微成像(operando microscopy)中因测量噪声导致的有效分辨率受限及定量分析可靠性下降的问题。其核心挑战在于,噪声会干扰对功能材料动态化学与物理过程的准确捕捉,从而影响基于偏微分方程(PDE)约束优化的模型学习精度与不确定性控制。解决方案的关键在于提出一个通用的、基于无监督深度学习的去噪框架,该框架可无缝集成到跨模态(如STXM、光学显微镜、中子断层扫描)和多尺度的定量显微成像工作流中,通过保留物理保真度、引入最小偏差并显著降低模型学习中的不确定性,实现对纳米级化学/结构异质性等关键信息的可靠解析。

链接: https://arxiv.org/abs/2510.27667
作者: Samuel Degnan-Morgenstern,Alexander E. Cohen,Rajeev Gopal,Megan Gober,George J. Nelson,Peng Bai,Martin Z. Bazant
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:Operando microscopy provides direct insight into the dynamic chemical and physical processes that govern functional materials, yet measurement noise limits the effective resolution and undermines quantitative analysis. Here, we present a general framework for integrating unsupervised deep learning-based denoising into quantitative microscopy workflows across modalities and length scales. Using simulated data, we demonstrate that deep denoising preserves physical fidelity, introduces minimal bias, and reduces uncertainty in model learning with partial differential equation (PDE)-constrained optimization. Applied to experiments, denoising reveals nanoscale chemical and structural heterogeneity in scanning transmission X-ray microscopy (STXM) of lithium iron phosphate (LFP), enables automated particle segmentation and phase classification in optical microscopy of graphite electrodes, and reduces noise-induced variability by nearly 80% in neutron radiography to resolve heterogeneous lithium transport. Collectively, these results establish deep denoising as a powerful, modality-agnostic enhancement that advances quantitative operando imaging and extends the reach of previously noise-limited techniques.
zh

[CV-5] Imbalanced Classification through the Lens of Spurious Correlations

【速读】:该论文旨在解决机器学习中类别不平衡(class imbalance)导致的分类性能不可靠问题。现有方法多聚焦于数据重采样或损失函数加权,而本文提出将不平衡视为一种数据条件,会通过少数类的欠定义(underspecification)放大“聪明汉斯效应”(Clever Hans, CH)——即模型依赖表面相关性而非真实因果关系进行预测。其解决方案的关键在于利用可解释人工智能(Explainable AI, XAI)技术,基于反事实解释(counterfactual explanations)联合识别并消除因类别不平衡引发的CH效应,从而提升模型鲁棒性与公平性。

链接: https://arxiv.org/abs/2510.27650
作者: Jakob Hackstein,Sidney Bender
机构: TU Berlin (柏林工业大学); BASF SE (巴斯夫公司)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Class imbalance poses a fundamental challenge in machine learning, frequently leading to unreliable classification performance. While prior methods focus on data- or loss-reweighting schemes, we view imbalance as a data condition that amplifies Clever Hans (CH) effects by underspecification of minority classes. In a counterfactual explanations-based approach, we propose to leverage Explainable AI to jointly identify and eliminate CH effects emerging under imbalance. Our method achieves competitive classification performance on three datasets and demonstrates how CH effects emerge under imbalance, a perspective largely overlooked by existing approaches.
zh

[CV-6] Gaussian Combined Distance: A Generic Metric for Object Detection

【速读】:该论文旨在解决当前目标检测中基于IoU(Intersection over Union)的相似性度量在小物体检测任务中因对位置偏移敏感而导致性能下降的问题,以及现有替代方案 Wasserstein Distance 因缺乏尺度不变性且独立优化中心属性导致模型收敛慢、定位精度不足的局限。其解决方案的关键在于提出一种新的高斯联合距离(Gaussian Combined Distance, GCD),该距离度量不仅具备尺度不变性,还支持边界框回归中的联合优化,从而显著提升模型的定位准确性和泛化能力。实验表明,GCD 作为回归损失函数和标签分配度量,在 AI-TOD-v2、MS-COCO-2017 和 Visdrone-2019 等多尺度数据集上均实现了优于 Wasserstein Distance 的检测性能。

链接: https://arxiv.org/abs/2510.27649
作者: Ziqian Guan,Xieyi Fu,Pengjun Huang,Hengyuan Zhang,Hubin Du,Yongtao Liu,Yinglin Wang,Qang Ma
机构: North China Institute of Science and Technology (华北科技学院); Hegang Industrial Technology Service Co., Ltd. (鹤岗工业技术服务有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by the GRSL in 2025

点击查看摘要

Abstract:In object detection, a well-defined similarity metric can significantly enhance model performance. Currently, the IoU-based similarity metric is the most commonly preferred choice for detectors. However, detectors using IoU as a similarity metric often perform poorly when detecting small objects because of their sensitivity to minor positional deviations. To address this issue, recent studies have proposed the Wasserstein Distance as an alternative to IoU for measuring the similarity of Gaussian-distributed bounding boxes. However, we have observed that the Wasserstein Distance lacks scale invariance, which negatively impacts the model’s generalization capability. Additionally, when used as a loss function, its independent optimization of the center attributes leads to slow model convergence and unsatisfactory detection precision. To address these challenges, we introduce the Gaussian Combined Distance (GCD). Through analytical examination of GCD and its gradient, we demonstrate that GCD not only possesses scale invariance but also facilitates joint optimization, which enhances model localization performance. Extensive experiments on the AI-TOD-v2 dataset for tiny object detection show that GCD, as a bounding box regression loss function and label assignment metric, achieves state-of-the-art performance across various detectors. We further validated the generalizability of GCD on the MS-COCO-2017 and Visdrone-2019 datasets, where it outperforms the Wasserstein Distance across diverse scales of datasets. Code is available at this https URL.
zh

[CV-7] NegoCollab: A Common Representation Negotiation Approach for Heterogeneous Collaborative Perception NEURIPS2025

【速读】:该论文旨在解决异构协作感知(heterogeneous collaborative perception)中因参与智能体使用固定且不同的感知模型而导致的中间特征域差异(domain gap)问题,从而影响协作性能。其核心解决方案是提出NegoCollab方法,关键在于引入一个训练阶段的协商器(negotiator),从各模态智能体的本地表示中动态推导出一个协商一致的公共表示(negotiated common representation),并通过发送者-接收者机制实现本地表示空间与公共表示空间之间的双向特征转换。此外,通过结构对齐损失和实用对齐损失补充分布对齐损失,强化多模态信息在公共表示中的知识蒸馏效果,显著降低异构智能体间的域差距。

链接: https://arxiv.org/abs/2510.27647
作者: Congzhang Shao,Quan Yuan,Guiyang Luo,Yue Hu,Danni Wang,Yilin Liu,Rui Pan,Bo Chen,Jinglin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, Accepted by NeurIPS 2025

点击查看摘要

Abstract:Collaborative perception improves task performance by expanding the perception range through information sharing among agents. . Immutable heterogeneity poses a significant challenge in collaborative perception, as participating agents may employ different and fixed perception models. This leads to domain gaps in the intermediate features shared among agents, consequently degrading collaborative performance. Aligning the features of all agents to a common representation can eliminate domain gaps with low training cost. However, in existing methods, the common representation is designated as the representation of a specific agent, making it difficult for agents with significant domain discrepancies from this specific agent to achieve proper alignment. This paper proposes NegoCollab, a heterogeneous collaboration method based on the negotiated common representation. It introduces a negotiator during training to derive the common representation from the local representations of each modality’s agent, effectively reducing the inherent domain gap with the various local representations. In NegoCollab, the mutual transformation of features between the local representation space and the common representation space is achieved by a pair of sender and receiver. To better align local representations to the common representation containing multimodal information, we introduce structural alignment loss and pragmatic alignment loss in addition to the distribution alignment loss to supervise the training. This enables the knowledge in the common representation to be fully distilled into the sender.
zh

[CV-8] VessShape: Few-shot 2D blood vessel segmentation by leverag ing shape priors from synthetic images

【速读】:该论文旨在解决血血管语义分割任务中因标注数据稀缺和模型跨成像模态泛化能力差而导致的性能瓶颈问题。其解决方案的关键在于引入VessShape方法,通过生成大规模2D合成数据集来强化模型对血管几何先验(如管状结构与分支特性)的学习,从而减少对纹理特征的依赖,提升模型的数据效率与跨域泛化能力。实验表明,基于VessShape预训练的模型在少量样本微调下即可实现优异的分割性能,并具备显著的零样本迁移能力。

链接: https://arxiv.org/abs/2510.27646
作者: Cesar H. Comin,Wesley N. Galvão
机构: Federal University of São Carlos (圣卡洛斯联邦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic segmentation of blood vessels is an important task in medical image analysis, but its progress is often hindered by the scarcity of large annotated datasets and the poor generalization of models across different imaging modalities. A key aspect is the tendency of Convolutional Neural Networks (CNNs) to learn texture-based features, which limits their performance when applied to new domains with different visual characteristics. We hypothesize that leveraging geometric priors of vessel shapes, such as their tubular and branching nature, can lead to more robust and data-efficient models. To investigate this, we introduce VessShape, a methodology for generating large-scale 2D synthetic datasets designed to instill a shape bias in segmentation models. VessShape images contain procedurally generated tubular geometries combined with a wide variety of foreground and background textures, encouraging models to learn shape cues rather than textures. We demonstrate that a model pre-trained on VessShape images achieves strong few-shot segmentation performance on two real-world datasets from different domains, requiring only four to ten samples for fine-tuning. Furthermore, the model exhibits notable zero-shot capabilities, effectively segmenting vessels in unseen domains without any target-specific training. Our results indicate that pre-training with a strong shape bias can be an effective strategy to overcome data scarcity and improve model generalization in blood vessel segmentation.
zh

[CV-9] Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation ICCV2025

【速读】:该论文旨在解决当前图形布局生成(graphic layout generation)研究中用户约束表达复杂、难以操作的问题,即现有方法依赖繁琐的规则或参数设定来引导布局生成,限制了实际可用性。其核心解决方案是引入一种创新的“草图到布局”(sketch-to-layout)范式,利用用户提供的手绘草图作为直观约束条件,从而简化交互流程并提升设计体验。关键在于提出了一种基于多模态Transformer的模型架构,以草图和内容资产(content assets)为输入,直接生成高质量布局;同时,为克服真实草图数据稀缺问题,设计了一种高效且可扩展的合成草图生成方法,显著降低了训练数据获取成本,并在PubLayNet、DocLayNet和SlidesVQA三个公开数据集上验证了方法的有效性,优于当前最先进的约束驱动方法。

链接: https://arxiv.org/abs/2510.27632
作者: Riccardo Brioschi,Aleksandr Alekseev,Emanuele Nevali,Berkay Döner,Omar El Malki,Blagoj Mitrevski,Leandro Kieliger,Mark Collier,Andrii Maksai,Jesse Berent,Claudiu Musat,Efi Kokiopoulou
机构: EPFL(瑞士联邦理工学院); Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 18 figures, GitHub link: this https URL , accept at ICCV 2025 Workshop (HiGen)

点击查看摘要

Abstract:Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at this https URL.
zh

[CV-10] Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

【速读】:该论文旨在解决视觉-语言-动作模型(Vision-Language-Action models, VLAs)在联合预测下一状态观测和动作序列时因模态差异带来的挑战,即如何有效处理不同模态之间的冲突以提升机器人策略学习性能。其解决方案的关键在于提出双流扩散模型(DUal-STream diffusion, DUST),该框架采用多模态扩散Transformer架构,显式维护独立的模态流(vision 和 action),同时通过跨模态知识共享机制实现信息融合;此外,引入针对各模态独立的噪声扰动和解耦的流匹配损失(flow-matching loss),使模型能够在无需统一潜在空间的情况下双向学习联合分布。这一设计支持测试时缩放(test-time scaling),允许视觉与动作标记以不同速率异步演化,从而显著提升泛化能力与实际表现。

链接: https://arxiv.org/abs/2510.27607
作者: John Won,Kyungmin Lee,Huiwon Jang,Dongyoung Kim,Jinwoo Shin
机构: KAIST; RLWRLD
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 20 pages, 10 figures

点击查看摘要

Abstract:Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST’s potential for large-scale VLA pretraining.
zh

[CV-11] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在空间理解能力上的不足问题。现有监督微调(Supervised Fine-Tuning, SFT)和基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)方法依赖昂贵的人工标注、专用工具或受限环境,难以规模化。解决方案的关键在于提出一种自监督强化学习范式——Spatial-SSRL,其核心创新是直接从普通的RGB或RGB-D图像中自动构建五类预训练任务,包括打乱补丁重排序、翻转补丁识别、裁剪补丁修复、区域深度排序和相对3D位置预测,这些任务提供易于验证的真值信号且无需人工或LVLM标注,从而实现高效、可扩展的空间推理能力提升。

链接: https://arxiv.org/abs/2510.27606
作者: Yuhong Liu,Beichen Zhang,Yuhang Zang,Yuhang Cao,Long Xing,Xiaoyi Dong,Haodong Duan,Dahua Lin,Jiaqi Wang
机构: Shanghai AI Laboratory (上海人工智能实验室); Shanghai Jiaotong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); Shanghai Innovation Institute (上海创新研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: preprint

点击查看摘要

Abstract:Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
zh

[CV-12] Who Made This? Fake Detection and Source Attribution with Diffusion Features

【速读】:该论文旨在解决生成式扩散模型(generative diffusion models)产生的合成图像日益难以与真实图像区分所带来的真实性验证、版权保护和虚假信息传播问题。现有监督式检测方法在跨未见生成器场景下泛化能力差,且依赖大量标注数据并需频繁重新训练。解决方案的关键在于利用预训练扩散模型内部激活特征(diffusion features)进行深度伪造检测与源生成器识别,通过k近邻分类器直接在扩散特征空间中实现无需微调的跨生成器检测性能最优,并结合紧凑神经网络实现高精度源生成器归属,表明扩散表示天然编码了生成器特异性模式,为合成图像取证提供了简洁且可解释的基础。

链接: https://arxiv.org/abs/2510.27602
作者: Simone Bonechi,Paolo Andreini,Barbara Toniella Corradini
机构: Department of Information Engineering and Mathematics, University of Siena (西耶纳大学信息工程与数学系), Italy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid progress of generative diffusion models has enabled the creation of synthetic images that are increasingly difficult to distinguish from real ones, raising concerns about authenticity, copyright, and misinformation. Existing supervised detectors often struggle to generalize across unseen generators, requiring extensive labeled data and frequent retraining. We introduce FRIDA (Fake-image Recognition and source Identification via Diffusion-features Analysis), a lightweight framework that leverages internal activations from a pre-trained diffusion model for deepfake detection and source generator attribution. A k-nearest-neighbor classifier applied to diffusion features achieves state-of-the-art cross-generator performance without fine-tuning, while a compact neural model enables accurate source attribution. These results show that diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.
zh

[CV-13] ANCHOR: Integrating Adversarial Training with Hard-mined Supervised Contrastive Learning for Robust Representation Learning

【速读】:该论文旨在解决神经网络在面对对抗攻击时的脆弱性问题,即模型参数优化过程中依赖梯度所导致的易受微小、人眼不可察觉扰动影响而产生错误预测的现象。解决方案的关键在于提出Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR)框架,该框架结合监督对比学习与显式困难正样本挖掘机制,促使模型在嵌入空间中将原始图像、其数据增强版本及对抗扰动后的图像聚类到同一类别内,同时与其他类别分离,从而引导模型聚焦于稳定且语义明确的特征表示,而非对梯度敏感的脆弱线索,显著提升模型在干净样本和对抗样本下的准确率与鲁棒性。

链接: https://arxiv.org/abs/2510.27599
作者: Samarup Bhattacharya,Anubhab Bhattacharya,Abir Chakraborty
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure

点击查看摘要

Abstract:Neural networks have changed the way machines interpret the world. At their core, they learn by following gradients, adjusting their parameters step by step until they identify the most discriminant patterns in the data. This process gives them their strength, yet it also opens the door to a hidden flaw. The very gradients that help a model learn can also be used to produce small, imperceptible tweaks that cause the model to completely alter its decision. Such tweaks are called adversarial attacks. These attacks exploit this vulnerability by adding tiny, imperceptible changes to images that, while leaving them identical to the human eye, cause the model to make wrong predictions. In this work, we propose Adversarially-trained Contrastive Hard-mining for Optimized Robustness (ANCHOR), a framework that leverages the power of supervised contrastive learning with explicit hard positive mining to enable the model to learn representations for images such that the embeddings for the images, their augmentations, and their perturbed versions cluster together in the embedding space along with those for other images of the same class while being separated from images of other classes. This alignment helps the model focus on stable, meaningful patterns rather than fragile gradient cues. On CIFAR-10, our approach achieves impressive results for both clean and robust accuracy under PGD-20 (epsilon = 0.031), outperforming standard adversarial training methods. Our results indicate that combining adversarial guidance with hard-mined contrastive supervision helps models learn more structured and robust representations, narrowing the gap between accuracy and robustness.
zh

[CV-14] Image Hashing via Cross-View Code Alignment in the Age of Foundation Models

【速读】:该论文旨在解决大规模检索中高维嵌入表示的计算效率问题,即如何在保持语义区分度的同时实现紧凑且高效的二值化编码(binary codes),以支持快速哈希距离搜索。现有方法通常依赖复杂流水线、多目标优化、单一学习范式专用设计及较长训练时间,难以兼顾性能与效率。其解决方案的关键在于提出CroVCA(Cross-View Code Alignment)原则——通过单一二元交叉熵损失强制跨语义对齐视图间的代码一致性,并利用编码率最大化作为防坍缩正则项促进代码分布的平衡与多样性;同时设计轻量级HashCoder网络结构,采用最终批归一化层约束代码均衡性,可直接作为冻结嵌入的探测头或结合LoRA微调高效适配编码器,在多个基准测试中仅用5个训练轮次即可达到SOTA效果,显著提升效率与适用性。

链接: https://arxiv.org/abs/2510.27584
作者: Ilyass Moummad,Kawtar Zaher,Hervé Goëau,Alexis Joly
机构: INRIA, LIRMM, Université de Montpellier, France (法国国家信息与自动化研究院,蒙彼利埃大学); Institut National de l’Audiovisuel, France (法国国家视听研究所); CIRAD, UMR AMAP, Montpellier, Occitanie, France (国际热带农业研究中心,蒙彼利埃,奥克西塔尼)
类目: Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Efficient large-scale retrieval requires representations that are both compact and discriminative. Foundation models provide powerful visual and multimodal embeddings, but nearest neighbor search in these high-dimensional spaces is computationally expensive. Hashing offers an efficient alternative by enabling fast Hamming distance search with binary codes, yet existing approaches often rely on complex pipelines, multi-term objectives, designs specialized for a single learning paradigm, and long training times. We introduce CroVCA (Cross-View Code Alignment), a simple and unified principle for learning binary codes that remain consistent across semantically aligned views. A single binary cross-entropy loss enforces alignment, while coding-rate maximization serves as an anti-collapse regularizer to promote balanced and diverse codes. To implement this, we design HashCoder, a lightweight MLP hashing network with a final batch normalization layer to enforce balanced codes. HashCoder can be used as a probing head on frozen embeddings or to adapt encoders efficiently via LoRA fine-tuning. Across benchmarks, CroVCA achieves state-of-the-art results in just 5 training epochs. At 16 bits, it particularly well-for instance, unsupervised hashing on COCO completes in under 2 minutes and supervised hashing on ImageNet100 in about 3 minutes on a single GPU. These results highlight CroVCA’s efficiency, adaptability, and broad applicability.
zh

[CV-15] MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series

【速读】:该论文旨在解决历史地图图像及时间序列的自动化分割与关联问题,其核心挑战在于历史地图风格差异大、标注数据稀缺,且构建跨时间的时空关联数据集耗时费力。解决方案的关键在于提出MapSAM2框架,通过将单张历史地图图像和多时相地图序列统一建模为“视频”形式:对图像采用分块处理并引入记忆注意力机制以增强几何准确性;对时间序列则构建了Siegfried建筑时间序列数据集,并通过模拟常见时空变换生成伪标签数据以降低人工标注成本。此方法在有限监督下实现了建筑物的精准分割与跨时间链接,显著提升了历史地图分析的自动化水平。

链接: https://arxiv.org/abs/2510.27547
作者: Xue Xia,Randall Balestriero,Tao Zhang,Yixin Zhou,Andrew Ding,Dev Saini,Lorenz Hurni
机构: ETH Zurich (苏黎世联邦理工学院); Brown University (布朗大学); Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.
zh

[CV-16] Deep Neural Watermarking for Robust Copyright Protection in 3D Point Clouds

【速读】:该论文旨在解决三维点云(3D point cloud)版权保护中水印信号易受几何与非几何攻击而失效的问题。传统水印方法在面对旋转、缩放、噪声、裁剪及信号失真等攻击时鲁棒性不足,难以实现可靠的内容认证与所有权验证。其解决方案的关键在于提出一种基于深度神经网络的鲁棒水印框架:首先利用奇异值分解(Singular Value Decomposition, SVD)将二进制水印嵌入到点云块的奇异值中,随后采用PointNet++神经网络架构训练一个端到端的提取模型,使其能够在多种攻击下准确恢复水印信息。实验表明,该方法在严重裁剪(Crop 70%)攻击下仍能实现高达0.83的比特准确率和0.80的交并比(IoU),显著优于传统SVD方法(比特准确率0.58,IoU 0.26),体现出深度学习驱动的水印提取机制在高鲁棒性和高保真度方面的优势。

链接: https://arxiv.org/abs/2510.27533
作者: Khandoker Ashik Uz Zaman,Mohammad Zahangir Alam,Mohammed N. M. Ali,Mahdi H. Miraz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:The protection of intellectual property has become critical due to the rapid growth of three-dimensional content in digital media. Unlike traditional images or videos, 3D point clouds present unique challenges for copyright enforcement, as they are especially vulnerable to a range of geometric and non-geometric attacks that can easily degrade or remove conventional watermark signals. In this paper, we address these challenges by proposing a robust deep neural watermarking framework for 3D point cloud copyright protection and ownership verification. Our approach embeds binary watermarks into the singular values of 3D point cloud blocks using spectral decomposition, i.e. Singular Value Decomposition (SVD), and leverages the extraction capabilities of Deep Learning using PointNet++ neural network architecture. The network is trained to reliably extract watermarks even after the data undergoes various attacks such as rotation, scaling, noise, cropping and signal distortions. We validated our method using the publicly available ModelNet40 dataset, demonstrating that deep learning-based extraction significantly outperforms traditional SVD-based techniques under challenging conditions. Our experimental evaluation demonstrates that the deep learning-based extraction approach significantly outperforms existing SVD-based methods with deep learning achieving bitwise accuracy up to 0.83 and Intersection over Union (IoU) of 0.80, compared to SVD achieving a bitwise accuracy of 0.58 and IoU of 0.26 for the Crop (70%) attack, which is the most severe geometric distortion in our experiment. This demonstrates our method’s ability to achieve superior watermark recovery and maintain high fidelity even under severe distortions.
zh

[CV-17] Context-Gated Cross-Modal Perception with Visual Mamba for PET-CT Lung Tumor Segmentation

【速读】:该论文旨在解决肺部肿瘤分割中如何有效融合PET(正电子发射断层成像)与CT(计算机断层扫描)图像的 anatomical 和 functional 信息这一关键挑战。解决方案的核心在于提出了一种轻量级多模态框架 vMambaX,其关键创新是引入了 Context-Gated Cross-Modal Perception Module (CGM),通过自适应地增强跨模态特征交互,在强调信息丰富区域的同时抑制噪声,从而提升分割精度并保持较低的计算复杂度。

链接: https://arxiv.org/abs/2510.27508
作者: Elena Mulero Ayllón,Linlin Shen,Pierangelo Veltri,Fabrizia Gelardi,Arturo Chiti,Paolo Soda,Matteo Tortora
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate lung tumor segmentation is vital for improving diagnosis and treatment planning, and effectively combining anatomical and functional information from PET and CT remains a major challenge. In this study, we propose vMambaX, a lightweight multimodal framework integrating PET and CT scan images through a Context-Gated Cross-Modal Perception Module (CGM). Built on the Visual Mamba architecture, vMambaX adaptively enhances inter-modality feature interaction, emphasizing informative regions while suppressing noise. Evaluated on the PCLT20K dataset, the model outperforms baseline models while maintaining lower computational complexity. These results highlight the effectiveness of adaptive cross-modal gating for multimodal tumor segmentation and demonstrate the potential of vMambaX as an efficient and scalable framework for advanced lung cancer analysis. The code is available at this https URL.
zh

[CV-18] hinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

【速读】:该论文旨在解决多模态推理中语言与视觉信息如何有效协同的问题,特别是缺乏对有意义的交错式思维链(interleaved chain of thought)的明确界定。现有方法往往将文本和图像思维视为同构模态,忽略了二者应作为互补模态相互促进推理过程的本质。解决方案的关键在于提出ThinkMorph模型,该模型通过在24K高质量交错推理轨迹上进行微调,学习生成逐步推进的文本-图像推理步骤,在保持连贯语义逻辑的同时,具体操作视觉内容。这一设计使模型在视觉主导任务上显著提升性能(平均比基础模型高出34.7%),并展现出跨域泛化能力,甚至优于更大规模的专有视觉语言模型(VLMs),同时表现出涌现的多模态智能特性,如未见过的视觉操作技能、推理模式自适应切换及测试时的多样化扩展能力。

链接: https://arxiv.org/abs/2510.27492
作者: Jiawei Gu,Yunzhuo Hao,Huichen Will Wang,Linjie Li,Michael Qizhe Shieh,Yejin Choi,Ranjay Krishna,Yu Cheng
机构: National University of Singapore (新加坡国立大学); Zhejiang University (浙江大学); University of Washington (华盛顿大学); Stanford University (斯坦福大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: project page: this https URL

点击查看摘要

Abstract:Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on 24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text-image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal this http URL findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
zh

[CV-19] NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding NEURIPS2025

【速读】:该论文旨在解决水下场景理解任务中因缺乏大规模多任务指令微调数据集而导致模型发展受限的问题,以及水下图像退化对感知性能的干扰。其关键解决方案是构建了包含1.45百万张图像-文本对的NautData数据集,支持八种水下场景理解任务,并提出一种即插即用的视觉特征增强(Vision Feature Enhancement, VFE)模块,该模块基于水下成像物理先验显式恢复清晰的水下信息,集成至LLaVA-1.5和Qwen2.5-VL等主流多模态大模型(Multimodal Large Models, MLLMs)中,形成NAUTILUS模型,显著提升了水下场景理解的鲁棒性和性能。

链接: https://arxiv.org/abs/2510.27481
作者: Wei Xu,Cheng Wang,Dingkang Liang,Zongchuang Zhao,Xingyu Jiang,Peng Zhang,Xiang Bai
机构: Huazhong University of Science and Technology (华中科技大学); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025. Data and models are available at this https URL

点击查看摘要

Abstract:Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at this https URL.
zh

[CV-20] Referee: Reference-aware Audiovisual Deepfake Detection

【速读】:该论文旨在解决当前音频视频深度伪造(deepfake)检测方法在面对未见过的伪造类型时泛化能力不足的问题。解决方案的关键在于提出一种新颖的参考感知(reference-aware)多模态检测方法 Referee,其通过利用仅需单样本的说话人特定线索,超越传统基于时空伪影的检测方式,实现对深度伪造内容的更鲁棒识别。Referee 通过将参考内容与目标内容中的身份相关查询进行跨模态特征匹配与对齐,联合推理音频-视频同步性和身份一致性,从而提升跨数据集和跨语言场景下的检测性能。

链接: https://arxiv.org/abs/2510.27475
作者: Hyemin Boo,Eunsang Lee,Jiyoung Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: In Progress

点击查看摘要

Abstract:Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at this https URL.
zh

[CV-21] A Multi-tiered Human-in-the-loop Approach for Interactive School Mapping Using Earth Observation and Machine Learning

【速读】:该论文旨在解决发展中国家教育设施数据稀缺且更新不及时的问题,以提升学校地理信息记录的准确性和完整性。其解决方案的关键在于提出了一种多层级人机协同(human-in-the-loop)框架:首先利用机器学习分析人口密度、土地覆盖和现有基础设施与已知校址的匹配情况,识别潜在遗漏或误标学校;随后通过中分辨率遥感影像(Sentinel-2)筛选高概率区域,再结合超高清遥感影像(VHR)与深度学习模型生成候选校址;最终通过交互式界面让人工操作员迭代验证和优化结果。该方法显著提升了地图绘制的可扩展性与成本效益,为教育资源配置提供可靠支持。

链接: https://arxiv.org/abs/2510.27460
作者: Casper Fibaek,Abi Riley,Kelsey Doerksen,Do-Hyung Kim,Rochelle Schneider
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents a multi-tiered human-in-the-loop framework for interactive school mapping designed to improve the accuracy and completeness of educational facility records, particularly in developing regions where such data may be scarce and infrequently updated. The first tier involves a machine learning based analysis of population density, land cover, and existing infrastructure compared with known school locations. The first tier identifies potential gaps and “mislabelled” schools. In subsequent tiers, medium-resolution satellite imagery (Sentinel-2) is investigated to pinpoint regions with a high likelihood of school presence, followed by the application of very high-resolution (VHR) imagery and deep learning models to generate detailed candidate locations for schools within these prioritised areas. The medium-resolution approach was later removed due to insignificant improvements. The medium and VHR resolution models build upon global pre-trained steps to improve generalisation. A key component of the proposed approach is an interactive interface to allow human operators to iteratively review, validate, and refine the mapping results. Preliminary evaluations indicate that the multi-tiered strategy provides a scalable and cost-effective solution for educational infrastructure mapping to support planning and resource allocation.
zh

[CV-22] From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration

【速读】:该论文旨在解决当前生成式AI在科学插图(Scientific Illustration)创作中面临的两大核心问题:一是基于图像的生成模型输出的是栅格化(rasterized)结果,缺乏语义结构,无法进行元素级编辑与重组;二是基于代码的生成方法(如TikZ或SVG)虽提供细粒度控制,但用户需经历“编写-编译-审查”的繁琐流程,且交互不够直观。为此,作者提出VisPainter框架,其关键在于采用多智能体架构(multi-agent framework),通过Manager、Designer和Toolbox三个模块协同工作,将插图元素显式表示并支持后期修改,从而实现高效、直观且可迭代的矢量图形生成,同时结合VisBench基准对插图质量进行多维量化评估,验证了该方案在内容准确性、布局合理性、视觉感知效果及交互成本等方面的优越性。

链接: https://arxiv.org/abs/2510.27452
作者: Jianwen Sun,Fanrui Zhang,Yukang Feng,Chuanhao Li,Zizhen Li,Jiaxin Ai,Yifan Chang,Yu Dai,Kaipeng Zhang
机构: Nankai University (南开大学); Shanghai Innovation Institute (上海创新研究院); Wuhan University (武汉大学); University of Science and Technology of China (中国科学技术大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of “writing-compiling-reviewing” and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.
zh

[CV-23] CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging MICCAI2025

【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在医学影像分析中因计算资源需求高和小样本下过拟合问题而难以应用于真实临床场景的挑战。其解决方案的关键在于提出一种轻量且可泛化的 ViT 架构 CoMViT,通过引入卷积分词器(convolutional tokenizer)、对角掩码(diagonal masking)、动态温度缩放(dynamic temperature scaling)以及基于池化的序列聚合(pooling-based sequence aggregation)等模块,在保持仅约 450 万参数的同时,实现了在十二个 MedMNIST 数据集上的鲁棒性能,并显著优于更深的 CNN 和 ViT 变体(参数减少达 5–20 倍),同时保证了临床相关区域的注意力一致性(通过 Grad-CAM 定性验证)。

链接: https://arxiv.org/abs/2510.27442
作者: Aon Safdar,Mohamed Saadeldin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Preprint (submitted manuscript). Accepted at the MICCAI 2025 MIRASOL Workshop; to appear in the Springer proceedings volume. This is the pre-review version (not the Version of Record). DOI will be added after publication. [Optional: 8 pages, 4 figures, 4 tables.]

点击查看摘要

Abstract:Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.
zh

[CV-24] DeblurSDI: Blind Image Deblurring Using Self-diffusion

【速读】:该论文旨在解决盲图像去卷积(blind image deconvolution)这一病态逆问题,即在未知清晰图像和模糊核的情况下恢复高质量图像。传统方法依赖手工设计先验或需大量外部数据预训练,难以适应真实场景;而本文提出的DeblurSDI框架是一种零样本、自监督的方法,其核心创新在于将盲去卷积建模为从纯噪声开始的迭代反向自扩散(self-diffusion, SDI)过程,通过两个随机初始化的神经网络联合优化清晰图像与模糊核,并引入基于L1范数的稀疏性约束和一种噪声调度机制(noise scheduling),从而在无需预训练的前提下动态学习实例相关的先验,显著提升对不同模糊核尺寸的鲁棒性与去模糊性能。

链接: https://arxiv.org/abs/2510.27439
作者: Yanlong Yang,Guanxiong Luo
机构: JCU (詹姆斯库克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Blind image deconvolution is a challenging ill-posed inverse problem, where both the latent sharp image and the blur kernel are unknown. Traditional methods often rely on handcrafted priors, while modern deep learning approaches typically require extensive pre-training on large external datasets, limiting their adaptability to real-world scenarios. In this work, we propose DeblurSDI, a zero-shot, self-supervised framework based on self-diffusion (SDI) that requires no prior training. DeblurSDI formulates blind deconvolution as an iterative reverse self-diffusion process that starts from pure noise and progressively refines the solution. At each step, two randomly-initialized neural networks are optimized continuously to refine the sharp image and the blur kernel. The optimization is guided by an objective function combining data consistency with a sparsity-promoting L1-norm for the kernel. A key innovation is our noise scheduling mechanism, which stabilizes the optimization and provides remarkable robustness to variations in blur kernel size. These allow DeblurSDI to dynamically learn an instance-specific prior tailored to the input image. Extensive experiments demonstrate that DeblurSDI consistently achieves superior performance, recovering sharp images and accurate kernels even in highly degraded scenarios.
zh

[CV-25] Mitigating Semantic Collapse in Partially Relevant Video Retrieval NEURIPS2025

【速读】:该论文旨在解决部分相关视频检索(Partially Relevant Video Retrieval, PRVR)中因语义坍缩(semantic collapse)导致的性能瓶颈问题,即现有方法将每个文本-视频标注对视为正样本、其余均为负样本,忽略了单个视频内不同事件及跨视频间语义相似性,造成查询与视频片段嵌入在文本和视频空间中出现语义混淆。解决方案的关键在于:首先引入文本相关性保留学习(Text Correlation Preservation Learning),以保持基础模型编码的文本查询间语义关系;其次提出跨分支视频对齐(Cross-Branch Video Alignment, CBVA),通过对比学习解耦多时间尺度下的视频表示;进一步结合顺序保持的标记合并自适应CBVA机制,提升视频片段内部一致性与跨片段区分度,从而有效缓解语义坍缩并显著提升检索精度。

链接: https://arxiv.org/abs/2510.27432
作者: WonJun Moon,MinSeok Jung,Gilhan Park,Tae-Young Kim,Cheol-Ho Cho,Woojin Jun,Jae-Pil Heo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accpeted to NeurIPS 2025. Code is available at this https URL

点击查看摘要

Abstract:Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
zh

[CV-26] Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset NEURIPS

【速读】:该论文旨在解决深度学习模型在医学图像分割任务中公平性评估不足的问题,特别是乳腺癌肿瘤分割任务中的潜在偏见对不同人群(如年龄、种族和数据来源)造成的护理质量差异。其关键解决方案在于对MAMA-MIA数据集中的自动化分割标签进行细致的公平性审计,发现并量化了针对年轻患者的固有年龄相关偏差,即使在控制数据来源等混杂因素后仍持续存在,并进一步揭示了多源数据聚合虽可缓解部分种族偏差,但需在更细粒度层面分析数据以识别和消除特定站点的偏倚,从而推动更公平的AI辅助诊断系统开发。

链接: https://arxiv.org/abs/2510.27421
作者: Aditya Parikh,Sneha Das,Aasa Feragen
机构: DTU Compute, Technical University of Denmark (丹麦技术大学计算机系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Medical Imaging Meets EurIPS (NeurIPS-endorsed workshop) - MedEurIPS

点击查看摘要

Abstract:Deep learning models aim to improve diagnostic workflows, but fairness evaluation remains underexplored beyond classification, e.g., in image segmentation. Unaddressed segmentation bias can lead to disparities in the quality of care for certain populations, potentially compounded across clinical decision points and amplified through iterative model development. Here, we audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA. We evaluate automated segmentation quality across age, ethnicity, and data source. Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source. We hypothesize that this bias may be linked to physiological factors, a known challenge for both radiologists and automated systems. Finally, we show how aggregating data from multiple data sources influences site-specific ethnic biases, underscoring the necessity of investigating data at a granular level.
zh

[CV-27] A Hybrid Deep Learning and Forensic Approach for Robust Deepfake Detection ACSA

【速读】:该论文旨在解决生成式 AI (Generative AI) 生成的深度伪造(deepfake)内容日益逼真所带来的社会风险,如虚假信息传播、身份盗用和数字信任危机。现有检测方法存在两大局限:基于深度学习的方法泛化能力差且易受图像失真影响,而基于数字取证(forensic analysis)的方法虽具可解释性但难以应对新型篡改技术。解决方案的关键在于提出一种融合框架,将数字取证特征(包括噪声残差、JPEG压缩痕迹和频域描述符)与卷积神经网络(CNN)及视觉Transformer(ViT)的深度学习表示相结合,从而在保持高检测准确率的同时提升模型对未知攻击的鲁棒性和结果的可解释性。实验表明,该混合方法在多个基准数据集上均优于单一方法和当前最优混合模型,并在压缩、对抗扰动和未见篡改场景下保持稳定性能,同时Grad-CAM与取证热图在82%案例中与真实篡改区域重叠,显著增强系统透明度与可信度。

链接: https://arxiv.org/abs/2510.27392
作者: Sales Aribe Jr
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
备注: 11 pages, 13 figures, 9 tables, Published with International Journal of Advanced Computer Science and Applications (IJACSA)

点击查看摘要

Abstract:The rapid evolution of generative adversarial networks (GANs) and diffusion models has made synthetic media increasingly realistic, raising societal concerns around misinformation, identity fraud, and digital trust. Existing deepfake detection methods either rely on deep learning, which suffers from poor generalization and vulnerability to distortions, or forensic analysis, which is interpretable but limited against new manipulation techniques. This study proposes a hybrid framework that fuses forensic features, including noise residuals, JPEG compression traces, and frequency-domain descriptors, with deep learning representations from convolutional neural networks (CNNs) and vision transformers (ViTs). Evaluated on benchmark datasets (FaceForensics++, Celeb-DF v2, DFDC), the proposed model consistently outperformed single-method baselines and demonstrated superior performance compared to existing state-of-the-art hybrid approaches, achieving F1-scores of 0.96, 0.82, and 0.77, respectively. Robustness tests demonstrated stable performance under compression (F1 = 0.87 at QF = 50), adversarial perturbations (AUC = 0.84), and unseen manipulations (F1 = 0.79). Importantly, explainability analysis showed that Grad-CAM and forensic heatmaps overlapped with ground-truth manipulated regions in 82 percent of cases, enhancing transparency and user trust. These findings confirm that hybrid approaches provide a balanced solution, combining the adaptability of deep models with the interpretability of forensic cues, to develop resilient and trustworthy deepfake detection systems.
zh

[CV-28] Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)中模态对齐(Modality Alignment)的不对称性问题,即现有方法在文本端提取分层特征,而图像端仅用单一特征表示,导致跨模态信息整合效率低下。其解决方案的关键在于提出一种名为“树间对齐”(Alignment across Trees)的新框架:首先构建图像与文本的树状分层特征结构,通过语义感知的视觉特征提取机制(利用中间Transformer层的视觉类别标记与文本提示进行交叉注意力),实现从粗到细的视觉语义建模;随后将两种模态的特征树嵌入具有不同曲率的双曲流形(Hyperbolic Manifolds)以有效捕捉层级关系,并引入一种适用于异质流形间的KL散度度量,通过最小化距离学习一个最优中间流形实现跨流形对齐,理论上证明了该中间流形的存在性与唯一性。

链接: https://arxiv.org/abs/2510.27391
作者: Wu Wei,Xiaomeng Fan,Yuwei Wu,Zhi Gao,Pengxiang Li,Yunde Jia,Mehrtash Harandi
机构: Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology (北京理工大学智能信息技术北京市重点实验室); Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University (深圳中美联合大学机器感知与智能计算广东省实验室); Department of Electrical and Computer System Engineering, Monash University (蒙纳士大学电气与计算机系统工程系)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
zh

[CV-29] Fine-Tuning Open Video Generators for Cinematic Scene Synthesis: A Small-Data Pipeline with LoRA and Wan2.1 I2V

【速读】:该论文旨在解决如何利用小规模数据集对开源视频扩散变换器(Video Diffusion Transformers)进行高效微调,以生成符合影视制作需求的电影级场景。其核心挑战在于如何在有限数据条件下实现视觉风格与运动生成的解耦学习,并保持时序一致性与画面质量。解决方案的关键在于提出两阶段流程:第一阶段采用低秩适应(Low-Rank Adaptation, LoRA)模块嵌入到Wan2.1 I2V-14B模型的交叉注意力层中,基于少量历史电视剧片段快速完成视觉风格迁移;第二阶段则利用微调后的模型生成风格一致的关键帧,并通过视频解码器进行时序扩展,同时引入轻量级并行化和序列分块策略,在不损失质量的前提下显著加速推理过程。

链接: https://arxiv.org/abs/2510.27364
作者: Meftun Akarsu,Kerem Catay,Sedat Bin Vedat,Enes Kutay Yarkan,Ilke Senturk,Arda Sar,Dafne Eksioglu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: video generation, image-to-video, dif- fusion transformer, LoRA, fine-tuning, cinematic scene synthesis, multi-GPU inference, fully sharded data parallelism, computational efficiency

点击查看摘要

Abstract:We present a practical pipeline for fine-tuning open-source video diffusion transformers to synthesize cinematic scenes for television and film production from small datasets. The proposed two-stage process decouples visual style learning from motion generation. In the first stage, Low-Rank Adaptation (LoRA) modules are integrated into the cross-attention layers of the Wan2.1 I2V-14B model to adapt its visual representations using a compact dataset of short clips from Ay Yapim’s historical television film El Turco. This enables efficient domain transfer within hours on a single GPU. In the second stage, the fine-tuned model produces stylistically consistent keyframes that preserve costume, lighting, and color grading, which are then temporally expanded into coherent 720p sequences through the model’s video decoder. We further apply lightweight parallelization and sequence partitioning strategies to accelerate inference without quality degradation. Quantitative and qualitative evaluations using FVD, CLIP-SIM, and LPIPS metrics, supported by a small expert user study, demonstrate measurable improvements in cinematic fidelity and temporal stability over the base model. The complete training and inference pipeline is released to support reproducibility and adaptation across cinematic domains.
zh

[CV-30] FPS: Feedforward-based Parameter Selection For Efficient Fine-Tuning

【速读】:该论文旨在解决参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法在实际应用中面临的内存效率与计算复杂度之间的权衡问题。现有方法如基于添加的Adapters会引入推理延迟和工程复杂性,而基于选择的梯度驱动方法(Gradient-based Parameter Selection, GPS)则需完整反向传播,导致峰值显存占用与全量微调相当。解决方案的关键在于提出一种无需梯度信息的前馈式参数选择方法(Feedforward-based Parameter Selection, FPS),其通过单次前向传播即可对参数重要性进行排序——依据参数幅值与对应输入激活的乘积,从而融合预训练知识与下游数据特征,在显著降低峰值显存消耗(约减少9倍)的同时,将参数选择速度提升约2倍,实现了真正高效的参数选择机制。

链接: https://arxiv.org/abs/2510.27359
作者: Kenneth Yang,Wen-Li Wei,Jen-Chun Lin
机构: Academia Sinica (中央研究院); National Taiwan University (台湾大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters [1], introduce inference latency and engineering complexity, while selection-based methods like Gradient-based Parameter Selection (GPS) [2] require a full backward pass, which results in the same peak memory usage as full fine-tuning. To address this dilemma, we propose Feedforward-based Parameter Selection (FPS), a gradient-free method that identifies an optimal parameter subset in a single forward pass. FPS ranks parameters by the product of their magnitudes and corresponding input activations, leveraging both pre-trained knowledge and downstream data. Evaluated on 24 visual tasks from FGVC and VTAB-1k, FPS achieves performance comparable to state-of-the-art methods while reducing peak memory usage by nearly 9 \times and accelerating parameter selection by about 2 \times , offering a genuinely memory-efficient and practical solution for fine-tuning large-scale pre-trained models.
zh

[CV-31] RzenEmbed: Towards Comprehensive Multimodal Retrieval

【速读】:该论文旨在解决当前基于CLIP的多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态检索任务中对非自然图像模态(如视频和视觉文档)支持不足的问题。现有方法主要聚焦于自然图像,导致在视频和视觉文档等关键视觉模态上的嵌入表示能力有限。解决方案的关键在于提出一个统一框架RzenEmbed,通过两阶段训练策略学习判别性嵌入:第一阶段专注于基础文本与多模态检索任务;第二阶段引入改进的InfoNCE损失函数,包含两个核心增强机制——硬度加权机制(hardness-weighted mechanism),通过为困难样本分配更高权重来引导模型关注更具挑战性的样本;以及伪负样本抑制机制,以降低错误负样本和数据噪声的影响。此外,结合可学习温度参数和模型融合(model souping)进一步提升性能,最终在MMEB基准测试上达到新的最先进水平,尤其在视频和视觉文档检索任务中显著优于以往方法。

链接: https://arxiv.org/abs/2510.27350
作者: Weijian Jian,Yajun Zhang,Dawei Liang,Chunyu Xie,Yixiao He,Dawei Leng,Yuhui Yin
机构: 360 AI Research
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of Multimodal Large Language Models (MLLMs) has extended CLIP-based frameworks to produce powerful, universal embeddings for retrieval tasks. However, existing methods primarily focus on natural images, offering limited support for other crucial visual modalities such as videos and visual documents. To bridge this gap, we introduce RzenEmbed, a unified framework to learn embeddings across a diverse set of modalities, including text, images, videos, and visual documents. We employ a novel two-stage training strategy to learn discriminative representations. The first stage focuses on foundational text and multimodal retrieval. In the second stage, we introduce an improved InfoNCE loss, incorporating two key enhancements. Firstly, a hardness-weighted mechanism guides the model to prioritize challenging samples by assigning them higher weights within each batch. Secondly, we implement an approach to mitigate the impact of false negatives and alleviate data noise. This strategy not only enhances the model’s discriminative power but also improves its instruction-following capabilities. We further boost performance with learnable temperature parameter and model souping. RzenEmbed sets a new state-of-the-art on the MMEB benchmark. It not only achieves the best overall score but also outperforms all prior work on the challenging video and visual document retrieval tasks. Our models are available in this https URL.
zh

[CV-32] Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

【速读】:该论文旨在解决现有图像编辑方法在处理复杂编辑指令时需联合微调大语言模型(Large Language Models, LLMs)与扩散模型(Diffusion Models, DMs)所带来的高计算复杂度和训练成本问题。其解决方案的关键在于提出一种名为“基于LLM推理的复杂图像编辑”(Complex Image Editing via LLM Reasoning, CIELR)的新方法,通过将复杂的用户指令转化为一系列简单且明确的编辑动作,从而避免了LLMs与DMs的联合微调;具体而言,该方法首先利用基础模型构建输入图像的结构化语义表示,并引入迭代更新机制逐步细化该表示,最终实现对图像场景的细粒度理解与灵活编辑,显著提升了编辑一致性与效果。

链接: https://arxiv.org/abs/2510.27335
作者: Yijia Wang,Yiqing Shen,Weiming Chen,Zhihai He
机构: Southern University of Science and Technology (南方科技大学); Johns Hopkins University (约翰霍普金斯大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbfComplex \textbfImage \textbfEditing via \textbfLLM \textbfReasoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \hrefthis https URLthis https URL.
zh

[CV-33] MeisenMeister: A Simple Two Stage Pipeline for Breast Cancer Classification on MRI MICCAI2025

【速读】:该论文旨在解决乳腺癌筛查中早期检测效率与准确性不足的问题,核心挑战在于乳腺MRI图像的解读仍面临困难,主要受限于高质量分割标注数据的稀缺性。为此,研究提出了一种基于分类的方法作为解决方案的关键,其优势在于能够绕过对大量精细标注数据的依赖,从而在大规模筛查场景下实现更鲁棒、可推广的早期乳腺癌识别能力。

链接: https://arxiv.org/abs/2510.27326
作者: Benjamin Hamm,Yannick Kirchhoff,Maximilian Rokuss,Klaus Maier-Hein
机构: 1. German Cancer Research Center (德国癌症研究中心); 2. Heidelberg University Hospital (海德堡大学医院); 3. German Cancer Consortium (德国癌症联盟); 4. National Center for Tumor Diseases Heidelberg (海德堡肿瘤疾病国家中心); 5. German Cancer Research Center (德国癌症研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Winning Solution of the MICCAI 2025 ODELIA Breast MRI Classification Challenge

点击查看摘要

Abstract:The ODELIA Breast MRI Challenge 2025 addresses a critical issue in breast cancer screening: improving early detection through more efficient and accurate interpretation of breast MRI scans. Even though methods for general-purpose whole-body lesion segmentation as well as multi-time-point analysis exist, breast cancer detection remains highly challenging, largely due to the limited availability of high-quality segmentation labels. Therefore, developing robust classification-based approaches is crucial for the future of early breast cancer detection, particularly in applications such as large-scale screening. In this write-up, we provide a comprehensive overview of our approach to the challenge. We begin by detailing the underlying concept and foundational assumptions that guided our work. We then describe the iterative development process, highlighting the key stages of experimentation, evaluation, and refinement that shaped the evolution of our solution. Finally, we present the reasoning and evidence that informed the design choices behind our final submission, with a focus on performance, robustness, and clinical relevance. We release our full implementation publicly at this https URL
zh

[CV-34] Generative Semantic Coding for Ultra-Low Bitrate Visual Communication and Analysis

【速读】:该论文旨在解决在极低比特率通信场景下(如深空探测、战场情报和复杂环境中的机器人导航)实现高质量视觉通信与远程视觉分析及人机交互的问题,即如何在不牺牲视觉分析准确性和人机交互性能的前提下,仅用现有编码方法极小部分的比特率完成视觉场景的精确重建。其解决方案的关键在于将图像生成与深度图像压缩无缝融合,利用联合文本描述和编码潜在表示(coding latent)引导修正流模型(rectified flow models),从而实现对视觉场景的高精度生成;同时,语义文本描述与编码潜在表示均以极低比特率编码并传输至解码端,实验表明该方法可在显著降低带宽消耗的同时,达到与现有方法相当的图像重建质量和视觉分析准确性。

链接: https://arxiv.org/abs/2510.27324
作者: Weiming Chen,Yijia Wang,Zhihan Zhu,Zhihai He
机构: Southern University of Science and Technology (南方科技大学); Pengcheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We consider the problem of ultra-low bit rate visual communication for remote vision analysis, human interactions and control in challenging scenarios with very low communication bandwidth, such as deep space exploration, battlefield intelligence, and robot navigation in complex environments. In this paper, we ask the following important question: can we accurately reconstruct the visual scene using only a very small portion of the bit rate in existing coding methods while not sacrificing the accuracy of vision analysis and performance of human interactions? Existing text-to-image generation models offer a new approach for ultra-low bitrate image description. However, they can only achieve a semantic-level approximation of the visual scene, which is far insufficient for the purpose of visual communication and remote vision analysis and human interactions. To address this important issue, we propose to seamlessly integrate image generation with deep image compression, using joint text and coding latent to guide the rectified flow models for precise generation of the visual scene. The semantic text description and coding latent are both encoded and transmitted to the decoder at a very small bit rate. Experimental results demonstrate that our method can achieve the same image reconstruction quality and vision analysis accuracy as existing methods while using much less bandwidth. The code will be released upon paper acceptance.
zh

[CV-35] SAGS: Self-Adaptive Alias-Free Gaussian Splatting for Dynamic Surgical Endoscopic Reconstruction

【速读】:该论文旨在解决内窥镜视频中动态组织重建的挑战,特别是由组织运动引起的混叠(aliasing)和伪影问题,这些问题会显著降低可视化质量。现有基于3D Gaussian Splatting(3DGS)的方法虽提升了渲染效率,但常忽视这些关键视觉质量问题。解决方案的关键在于提出SAGS(Self-adaptive Alias-free Gaussian Splatting),其核心创新是引入一种注意力驱动的、动态加权的4D变形解码器,结合3D平滑滤波器与2D Mip滤波器,有效抑制伪影并更精确地捕捉组织运动的细微细节,从而在PSNR、SSIM和LPIPS等指标上均优于当前最先进方法。

链接: https://arxiv.org/abs/2510.27318
作者: Wenfeng Huang,Xiangyun Liao,Yinling Qian,Hao Liu,Yongming Yang,Wenjing Jia,Qiong Wang
机构: University of Technology Sydney, Australia; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, China; Shenyang Institute of Automation, Chinese Academy of Sciences, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Surgical reconstruction of dynamic tissues from endoscopic videos is a crucial technology in robot-assisted surgery. The development of Neural Radiance Fields (NeRFs) has greatly advanced deformable tissue reconstruction, achieving high-quality results from video and image sequences. However, reconstructing deformable endoscopic scenes remains challenging due to aliasing and artifacts caused by tissue movement, which can significantly degrade visualization quality. The introduction of 3D Gaussian Splatting (3DGS) has improved reconstruction efficiency by enabling a faster rendering pipeline. Nevertheless, existing 3DGS methods often prioritize rendering speed while neglecting these critical issues. To address these challenges, we propose SAGS, a self-adaptive alias-free Gaussian splatting framework. We introduce an attention-driven, dynamically weighted 4D deformation decoder, leveraging 3D smoothing filters and 2D Mip filters to mitigate artifacts in deformable tissue reconstruction and better capture the fine details of tissue movement. Experimental results on two public benchmarks, EndoNeRF and SCARED, demonstrate that our method achieves superior performance in all metrics of PSNR, SSIM, and LPIPS compared to the state of the art while also delivering better visualization quality.
zh

[CV-36] Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection

【速读】:该论文旨在解决增量目标检测(Incremental Object Detection, IOD)中因提示(prompt)结构固定导致的灾难性遗忘问题,尤其是在存在类别共现场景下,传统基于提示池的方法因假设各任务间类别互斥而产生混淆。其解决方案的关键在于提出参数化提示的增量目标检测方法(Parameterized Prompts for Incremental Object Detection, P²IOD),通过将神经网络本身作为参数化提示来利用其全局演化特性,实现跨任务知识的自适应整合;同时引入参数化提示融合策略以约束提示结构更新,从而有效缓解灾难性遗忘并提升检测性能。

链接: https://arxiv.org/abs/2510.27316
作者: Zijia An,Boyu Diao,Ruiqi Liu,Libo Huang,Chuanguang Yang,Fei Wang,Zhulin An,Yongjun Xu
机构: Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P ^2 IOD). Leveraging neural networks global evolution properties, P ^2 IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P ^2 IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P ^2 IOD’s effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
zh

[CV-37] CASR-Net: An Image Processing-focused Deep Learning-based Coronary Artery Segmentation and Refinement Network for X-ray Coronary Angiogram

【速读】:该论文旨在解决冠状动脉疾病(Coronary Artery Disease, CAD)早期检测中因血管造影图像质量差而导致的临床诊断困难问题。其解决方案的关键在于提出一种三阶段的冠状动脉分割与精炼网络(Coronary Artery Segmentation and Refinement Network, CASR-Net),其中核心创新包括:基于DenseNet121编码器和自组织操作神经网络(Self-organized Operational Neural Network, Self-ONN)解码器的UNet架构,有效保留狭窄血管分支的连续性;以及一种结合CLAHE与改进的Ben Graham方法的多通道预处理策略,显著提升分割性能(Dice Score Coefficient提升0.31–0.89%,IoU提升0.40–1.16%);最终通过轮廓精炼模块进一步抑制假阳性结果,整体在公开数据集上实现IoU 61.43%、DSC 76.10%和clDice 79.36%,展现出对自动化冠状动脉分割的高鲁棒性。

链接: https://arxiv.org/abs/2510.27315
作者: Alvee Hassan,Rusab Sarmun,Muhammad E. H. Chowdhury,M. Murugappan,Md. Sakib Abrar Hossain,Sakib Mahmud,Abdulrahman Alqahtani,Sohaib Bassam Zoghoul,Amith Khandakar,Susu M. Zughaier,Somaya Al-Maadeed,Anwarul Hasan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Early detection of coronary artery disease (CAD) is critical for reducing mortality and improving patient treatment planning. While angiographic image analysis from X-rays is a common and cost-effective method for identifying cardiac abnormalities, including stenotic coronary arteries, poor image quality can significantly impede clinical diagnosis. We present the Coronary Artery Segmentation and Refinement Network (CASR-Net), a three-stage pipeline comprising image preprocessing, segmentation, and refinement. A novel multichannel preprocessing strategy combining CLAHE and an improved Ben Graham method provides incremental gains, increasing Dice Score Coefficient (DSC) by 0.31-0.89% and Intersection over Union (IoU) by 0.40-1.16% compared with using the techniques individually. The core innovation is a segmentation network built on a UNet with a DenseNet121 encoder and a Self-organized Operational Neural Network (Self-ONN) based decoder, which preserves the continuity of narrow and stenotic vessel branches. A final contour refinement module further suppresses false positives. Evaluated with 5-fold cross-validation on a combination of two public datasets that contain both healthy and stenotic arteries, CASR-Net outperformed several state-of-the-art models, achieving an IoU of 61.43%, a DSC of 76.10%, and clDice of 79.36%. These results highlight a robust approach to automated coronary artery segmentation, offering a valuable tool to support clinicians in diagnosis and treatment planning.
zh

[CV-38] Versatile and Efficient Medical Image Super-Resolution Via Frequency-Gated Mamba

【速读】:该论文旨在解决医学图像超分辨率(Medical Image Super-Resolution, MISR)中如何在低计算开销下同时建模长程解剖结构与细粒度频域细节的问题。其解决方案的关键在于提出一种频率感知的门控状态空间模型(FGMamba),通过两个核心创新实现:一是门控注意力增强的状态空间模块(Gated Attention-enhanced State-Space Module, GASM),融合高效状态空间建模与双分支空间-通道注意力机制,以精准捕捉全局依赖关系;二是金字塔频率融合模块(Pyramid Frequency Fusion Module, PFFM),利用傅里叶变换(FFT)引导的多尺度融合策略,有效提取和增强高频细节信息。该架构在保持极小参数量(< 0.75M)的同时,在五种医学成像模态(超声、OCT、MRI、CT 和内窥镜)上显著优于当前主流CNN与Transformer方法。

链接: https://arxiv.org/abs/2510.27296
作者: Wenfeng Huang,Xiangyun Liao,Wei Cao,Wenjing Jia,Weixin Si
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image super-resolution (SR) is essential for enhancing diagnostic accuracy while reducing acquisition cost and scanning time. However, modeling both long-range anatomical structures and fine-grained frequency details with low computational overhead remains challenging. We propose FGMamba, a novel frequency-aware gated state-space model that unifies global dependency modeling and fine-detail enhancement into a lightweight architecture. Our method introduces two key innovations: a Gated Attention-enhanced State-Space Module (GASM) that integrates efficient state-space modeling with dual-branch spatial and channel attention, and a Pyramid Frequency Fusion Module (PFFM) that captures high-frequency details across multiple resolutions via FFT-guided fusion. Extensive evaluations across five medical imaging modalities (Ultrasound, OCT, MRI, CT, and Endoscopic) demonstrate that FGMamba achieves superior PSNR/SSIM while maintaining a compact parameter footprint ( 0.75M), outperforming CNN-based and Transformer-based SOTAs. Our results validate the effectiveness of frequency-aware state-space modeling for scalable and accurate medical image enhancement.
zh

[CV-39] Rethinking Robust Adversarial Concept Erasure in Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)中概念擦除(Concept Erasure)方法在对抗训练过程中因忽视概念语义而导致的擦除效果不充分和干扰其他概念空间的问题。现有方法通常依赖对抗样本生成来抑制目标概念,但其生成的对抗样本未能有效拟合目标概念空间,导致在样本稀少时无法覆盖完整对象概念,而在样本过多时又可能破坏其他概念空间。解决方案的关键在于提出S-GRACE(Semantics-Guided Robust Adversarial Concept Erasure),通过在概念空间内引入语义引导机制,使对抗样本更精准地匹配目标概念语义,从而提升擦除效果、增强鲁棒性并显著降低训练时间(减少90%)。

链接: https://arxiv.org/abs/2510.27285
作者: Qinghong Yin,Yu Tian,Yue Zhang
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Tsinghua University (清华大学); University of Chinese Academy of Sciences (中国科学院大学); Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Concept erasure aims to selectively unlearning undesirable content in diffusion models (DMs) to reduce the risk of sensitive content generation. As a novel paradigm in concept erasure, most existing methods employ adversarial training to identify and suppress target concepts, thus reducing the likelihood of sensitive outputs. However, these methods often neglect the specificity of adversarial training in DMs, resulting in only partial mitigation. In this work, we investigate and quantify this specificity from the perspective of concept space, i.e., can adversarial samples truly fit the target concept space? We observe that existing methods neglect the role of conceptual semantics when generating adversarial samples, resulting in ineffective fitting of concept spaces. This oversight leads to the following issues: 1) when there are few adversarial samples, they fail to comprehensively cover the object concept; 2) conversely, they will disrupt other target concept spaces. Motivated by the analysis of these findings, we introduce S-GRACE (Semantics-Guided Robust Adversarial Concept Erasure), which grace leveraging semantic guidance within the concept space to generate adversarial samples and perform erasure training. Experiments conducted with seven state-of-the-art methods and three adversarial prompt generation strategies across various DM unlearning scenarios demonstrate that S-GRACE significantly improves erasure performance 26%, better preserves non-target concepts, and reduces training time by 90%. Our code is available at this https URL.
zh

[CV-40] FOCUS: Efficient Keyframe Selection for Long Video Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长视频时因视觉token数量急剧增加而导致的计算资源消耗过大的问题。传统方法如均匀采样或基于小型视觉-语言模型的关键帧选择虽能降低推理成本,但往往依赖预过滤策略且可能遗漏关键信息。其解决方案的核心是提出一种无需训练、与模型无关的关键帧选择模块FOCUS(Frame-Optimistic Confidence Upper-bound Selection),将关键帧选择建模为多臂赌博机中的组合纯探索(Combinatorial Pure-Exploration, CPE)问题:将短时片段视为“臂”,利用经验均值和Bernstein置信半径识别高信息量区域,并在两阶段探索-利用过程中先定位高价值时间区间,再从中选出得分最高的帧。该方法在两个长视频问答基准上仅处理不到2%的帧即可显著提升准确率,尤其在超过20分钟的视频中于LongVideoBench上实现11.9%的准确率提升,验证了其作为可扩展长视频理解通用方案的有效性。

链接: https://arxiv.org/abs/2510.27280
作者: Zirui Zhu,Hailun Xu,Yang Luo,Yong Liu,Kanchan Sarkar,Zhenheng Yang,Yang You
机构: National University of Singapore (新加坡国立大学); TikTok
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2510.27280 [cs.CV] (or arXiv:2510.27280v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2510.27280 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-41] HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

【速读】:该论文旨在解决自主图形用户界面(GUI)代理在执行用户指令时因缺乏对自身能力边界的自我认知而导致的过度自信与不可靠预测问题,尤其是在动态GUI自动化任务中,单一错误即可能导致任务失败。解决方案的关键在于提出HyperClick框架,通过引入双奖励机制——结合二值奖励以优化正确动作的同时,采用基于截断高斯的空间置信度建模并利用Brier评分进行校准,从而协同提升GUI定位准确性和置信度可靠性,实现具 introspective self-criticism(内省式自我批判)的不确定性校准,显著降低过度假设并增强GUI自动化的鲁棒性。

链接: https://arxiv.org/abs/2510.27266
作者: Shaojie Zhang,Pei Fu,Ruoceng Zhang,Jiahui Yang,Anan Du,Xiuwen Xi,Shaokang Wang,Ying Huang,Bin Qin,Zhenbo Luo,Jian Luan
机构: MiLM Plus, Xiaomi Inc(小米公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a truncated Gaussian-based spatial confidence modeling, calibrated using the Brier score. This approach jointly optimizes grounding accuracy and confidence reliability, fostering introspective self-criticism. Extensive experiments on seven challenge benchmarks show that HyperClick achieves state-of-the-art performance while providing well-calibrated confidence. By enabling explicit confidence calibration and introspective self-criticism, HyperClick reduces overconfidence and supports more reliable GUI automation.
zh

[CV-42] 3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis

【速读】:该论文旨在解决医学视觉-语言模型(Medical Vision-Language Models, MVLMs)在面对模态漂移(modality shift)时性能不稳定的问题:预训练模型虽具广泛鲁棒性,但缺乏细粒度的模态特异性;而微调后的专家模型在分布内任务中表现优异,却难以适应新模态或数据分布变化。其核心解决方案是提出一种无需反向传播的测试时任务自适应融合方法(Test-Time Task adaptive merging, T³),关键在于通过样本级Jensen-Shannon散度动态计算两个模型输出分布之间的差异,从而自适应地调整融合系数——当模型预测一致时保留局部精度,发生漂移时则依赖通用模型的鲁棒性。进一步地,为降低计算开销,作者还提出了批处理版本T³_B,显著提升了推理效率,实验证明该方法在跨模态、分布外和噪声场景下均优于现有基线,达到新的SOTA性能。

链接: https://arxiv.org/abs/2510.27265
作者: Raza Imam,Hu Wang,Dwarikanath Mahapatra,Mohammad Yaqub
机构: Mohammed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Khalifa University (哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Main: 11 pages, Supplementary: 9 pages 10 tables, 10 figures

点击查看摘要

Abstract:In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models’ output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at this https URL.
zh

[CV-43] RegionRAG : Region-level Retrieval-Augumented Generation for Visually-Rich Documents

【速读】:该论文旨在解决多模态检索增强生成(Multi-modal Retrieval-Augmented Generation, RAG)中因以整个文档为基本检索单元而导致的冗余视觉内容问题,具体表现为:1)相关文档内存在大量与查询无关的区域,稀释了关键信息;2)为提升召回率而检索多个文档时引入了冗余和无关文档,干扰模型注意力并降低性能。解决方案的关键在于提出一种名为RegionRAG的新框架,其核心创新是将检索粒度从文档级细化到区域级(region-level),通过训练阶段的混合监督策略(结合标注与未标注数据)精确定位相关图像块(patch),推理阶段设计动态流水线智能地将显著图像块聚类为完整语义区域,从而让生成器专注于与查询相关的紧凑视觉内容,显著提升了检索准确率(平均R@1提升10.02%)和问答准确率(提升3.56%),同时仅使用71.42%的视觉token,实现了效率与精度的双重优化。

链接: https://arxiv.org/abs/2510.27261
作者: Yinglu Li,Zhiying Lu,Zhihang Liu,Chuanbin Liu,Hongtao Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-modal Retrieval-Augmented Generation (RAG) has become a critical method for empowering LLMs by leveraging candidate visual documents. However, current methods consider the entire document as the basic retrieval unit, introducing substantial irrelevant visual content in two ways: 1) Relevant documents often contain large regions unrelated to the query, diluting the focus on salient information; 2) Retrieving multiple documents to increase recall further introduces redundant and irrelevant documents. These redundant contexts distract the model’s attention and further degrade the performance. To address this challenge, we propose \modelname, a novel framework that shifts the retrieval paradigm from the document level to the region level. During training, we design a hybrid supervision strategy from both labeled data and unlabeled data to pinpoint relevant patches. During inference, we propose a dynamic pipeline that intelligently groups salient patches into complete semantic regions. By delegating the task of identifying relevant regions to the retriever, \modelname enables the generator to focus solely on concise visual content relevant to queries, improving both efficiency and accuracy. Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance. Improves retrieval accuracy by 10.02% in R@1 on average and increases question answering accuracy by 3.56% while using only 71.42% visual tokens compared to prior methods. The code will be available at this https URL.
zh

[CV-44] Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在零样本动作识别中因仅依赖动作类别提供语义上下文而导致的歧义问题,特别是多义词带来的概念理解模糊。解决方案的关键在于:首先利用网络爬取的描述文本,通过大语言模型(Large-Language Model, LLM)自动提取相关关键词,从而减少人工标注需求并避免繁琐的手动属性数据构建;其次引入一个时空交互模块(spatio-temporal interaction module),聚焦于物体和动作单元,促进描述属性与视频内容之间的对齐,提升模型在跨域任务中的适应性和准确性。

链接: https://arxiv.org/abs/2510.27255
作者: Yehna Kim andYoung-Eun Kim,Seong-Whan Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model’s adaptability and effectiveness across various downstream tasks.
zh

[CV-45] C-LEAD: Contrastive Learning for Enhanced Adversarial Defense

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在计算机视觉任务中对对抗攻击(adversarial attacks)敏感的问题,即输入图像经微小扰动后可能导致模型输出错误预测,从而影响其在实际部署中的可靠性。解决方案的关键在于引入对比学习(contrastive learning)机制,利用对比损失函数(contrastive loss function)联合训练模型于干净样本与对抗扰动样本之上,通过优化模型参数及扰动方向,使网络能够学习到更具鲁棒性的特征表示,从而显著提升模型对多种类型对抗扰动的防御能力。

链接: https://arxiv.org/abs/2510.27249
作者: Suklav Ghosh,Sonal Kumar,Arijit Sur
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model’s parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model’s robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.
zh

[CV-46] rans-defense: Transformer-based Denoiser for Adversarial Defense with Spatial-Frequency Domain Representation

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在安全关键系统中易受复杂对抗攻击的问题。其解决方案的关键在于提出一种两阶段训练方法:首先训练一个融合空间域与频域特征的去噪网络,利用离散小波变换(Discrete Wavelet Transform, DWT)识别并修复被攻击图像中高频成分的严重失真,该去噪网络通过Transformer层整合空间图像特征与小波系数;其次,使用去噪后的图像重新训练分类器,从而显著提升模型对对抗样本的鲁棒性。实验表明,该方法在MNIST、CIFAR-10和Fashion-MNIST数据集上均优于仅使用去噪网络或对抗训练的基准方案。

链接: https://arxiv.org/abs/2510.27245
作者: Alik Pramanick,Mayank Bansal,Utkarsh Srivastava,Suklav Ghosh,Arijit Sur
机构: Indian Institute of Technology Guwahati (印度理工学院古瓦哈蒂分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent times, deep neural networks (DNNs) have been successfully adopted for various applications. Despite their notable achievements, it has become evident that DNNs are vulnerable to sophisticated adversarial attacks, restricting their applications in security-critical systems. In this paper, we present two-phase training methods to tackle the attack: first, training the denoising network, and second, the deep classifier model. We propose a novel denoising strategy that integrates both spatial and frequency domain approaches to defend against adversarial attacks on images. Our analysis reveals that high-frequency components of attacked images are more severely corrupted compared to their lower-frequency counterparts. To address this, we leverage Discrete Wavelet Transform (DWT) for frequency analysis and develop a denoising network that combines spatial image features with wavelets through a transformer layer. Next, we retrain the classifier using the denoised images, which enhances the classifier’s robustness against adversarial attacks. Experimental results across the MNIST, CIFAR-10, and Fashion-MNIST datasets reveal that the proposed method remarkably elevates classification accuracy, substantially exceeding the performance by utilizing a denoising network and adversarial training approaches. The code is available at this https URL.
zh

[CV-47] Fusion of Heterogeneous Pathology Foundation Models for Whole Slide Image Analysis

【速读】:该论文旨在解决病理学基础模型(Pathological Foundation Models, FMs)因训练数据私有性和网络架构差异导致的异质性问题,这种异质性会引发下游任务中特征提取性能的不稳定。为有效融合多个异质性病理FM并提升集成性能,其关键解决方案包括:(i) 提出基于多视角聚类的判别性图像块筛选方法,以保证训练样本的代表性;(ii) 设计簇级重嵌入策略,实时捕捉图像块级别的局部特征以实现异构patch-level FM的有效融合;(iii) 构建协同蒸馏机制,挖掘slide-level FM之间的关联关系,从而优化异构slide-level FM的集成效果。

链接: https://arxiv.org/abs/2510.27237
作者: Zhidong Yang,Xiuhui Shi,Wei Ba,Zhigang Song,Haijing Luan,Taiyuan Hu,Senlin Lin,Jiguang Wang,Shaohua Kevin Zhou,Rui Yan
机构: University of Science and Technology of China (中国科学技术大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 9 figures

点击查看摘要

Abstract:Whole slide image (WSI) analysis has emerged as an increasingly essential technique in computational pathology. Recent advances in the pathological foundation models (FMs) have demonstrated significant advantages in deriving meaningful patch-level or slide-level feature representations from WSIs. However, current pathological FMs have exhibited substantial heterogeneity caused by diverse private training datasets and different network architectures. This heterogeneity introduces performance variability when we utilize the extracted features from different FMs in the downstream tasks. To fully explore the advantage of multiple FMs effectively, in this work, we propose a novel framework for the fusion of heterogeneous pathological FMs, called FuseCPath, yielding a model with a superior ensemble performance. The main contributions of our framework can be summarized as follows: (i) To guarantee the representativeness of the training patches, we propose a multi-view clustering-based method to filter out the discriminative patches via multiple FMs’ embeddings. (ii) To effectively fuse the heterogeneous patch-level FMs, we devise a cluster-level re-embedding strategy to online capture patch-level local features. (iii) To effectively fuse the heterogeneous slide-level FMs, we devise a collaborative distillation strategy to explore the connections between slide-level FMs. Extensive experiments conducted on lung cancer, bladder cancer, and colorectal cancer datasets from The Cancer Genome Atlas (TCGA) have demonstrated that the proposed FuseCPath achieves state-of-the-art performance across multiple tasks on these public datasets.
zh

[CV-48] Object-IR: Leverag ing Object Consistency and Mesh Deformation for Self-Supervised Image Retargeting

【速读】:该论文旨在解决图像重定位(image retargeting)中语义重要区域几何失真难以消除的问题。其解决方案的关键在于提出一种自监督架构 Object-IR,将图像重定位建模为基于网格变形的优化问题,通过卷积神经网络预测网格节点运动以实现图像的非刚性形变映射;同时设计包含对象一致性损失(object-consistent loss)、几何保持损失(geometric-preserving loss)和边界损失(boundary loss)的综合目标函数,从而在保证语义对象外观一致性和简单尺度变换约束的同时,生成清晰的矩形输出结果。该方法无需人工标注数据集,直接从输入图像的几何与语义特性中提取监督信号,实现了高精度且实时的图像重定位效果。

链接: https://arxiv.org/abs/2510.27236
作者: Tianli Liao,Ran Wang,Siqing Zhang,Lei Li,Guangen Liu,Chenyang Zhao,Heling Cao,Peng Li
机构: Henan University of Technology (河南工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Publish in Pattern Recognition

点击查看摘要

Abstract:Eliminating geometric distortion in semantically important regions remains an intractable challenge in image retargeting. This paper presents Object-IR, a self-supervised architecture that reformulates image retargeting as a learning-based mesh warping optimization problem, where the mesh deformation is guided by object appearance consistency and geometric-preserving constraints. Given an input image and a target aspect ratio, we initialize a uniform rigid mesh at the output resolution and use a convolutional neural network to predict the motion of each mesh grid and obtain the deformed mesh. The retargeted result is generated by warping the input image according to the rigid mesh in the input image and the deformed mesh in the output resolution. To mitigate geometric distortion, we design a comprehensive objective function incorporating a) object-consistent loss to ensure that the important semantic objects retain their appearance, b) geometric-preserving loss to constrain simple scale transform of the important meshes, and c) boundary loss to enforce a clean rectangular output. Notably, our self-supervised paradigm eliminates the need for manually annotated retargeting datasets by deriving supervision directly from the input’s geometric and semantic properties. Extensive evaluations on the RetargetMe benchmark demonstrate that our Object-IR achieves state-of-the-art performance, outperforming existing methods in quantitative metrics and subjective visual quality assessments. The framework efficiently processes arbitrary input resolutions (average inference time: 0.009s for 1024x683 resolution) while maintaining real-time performance on consumer-grade GPUs. The source code will soon be available at this https URL.
zh

[CV-49] MoRE: 3D Visual Geometry Reconstruction Meets Mixture-of-Experts

【速读】:该论文旨在解决大规模3D视觉几何重建中模型扩展性受限的问题,主要挑战在于几何监督的复杂性和3D数据的多样性。解决方案的关键在于提出MoRE——一种基于Mixture-of-Experts(MoE)架构的密集3D视觉基础模型,通过动态路由机制将特征分配给任务特定专家,实现对互补数据特性的专业化学习,从而提升模型的可扩展性和适应性;同时引入基于置信度的深度精炼模块以增强实际场景下的几何估计鲁棒性,并融合密集语义特征与全局对齐的3D骨干表示,实现高保真表面法向量预测,辅以定制化损失函数确保多任务几何学习的稳定性与有效性。

链接: https://arxiv.org/abs/2510.27234
作者: Jingnan Gao,Zhe Wang,Xianze Fang,Xingyu Ren,Zhuo Chen,Shengqi Liu,Yuhao Cheng,Jiangjing Lyu,Xiaokang Yang,Yichao Yan
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Recent advances in language and vision have demonstrated that scaling up model capacity consistently improves performance across diverse tasks. In 3D visual geometry reconstruction, large-scale training has likewise proven effective for learning versatile representations. However, further scaling of 3D models is challenging due to the complexity of geometric supervision and the diversity of 3D data. To overcome these limitations, we propose MoRE, a dense 3D visual foundation model based on a Mixture-of-Experts (MoE) architecture that dynamically routes features to task-specific experts, allowing them to specialize in complementary data aspects and enhance both scalability and adaptability. Aiming to improve robustness under real-world conditions, MoRE incorporates a confidence-based depth refinement module that stabilizes and refines geometric estimation. In addition, it integrates dense semantic features with globally aligned 3D backbone representations for high-fidelity surface normal prediction. MoRE is further optimized with tailored loss functions to ensure robust learning across diverse inputs and multiple geometric tasks. Extensive experiments demonstrate that MoRE achieves state-of-the-art performance across multiple benchmarks and supports effective downstream applications without extra computation.
zh

[CV-50] Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

【速读】:该论文旨在解决城市规划、三维城市建模和基础设施监测中对建筑实例分割(instance segmentation)与高度分类(height classification)精度不足的问题。其解决方案的关键在于引入YOLOv11模型,该模型通过优化多尺度特征融合机制,在保持高效推理速度的同时显著提升复杂城市场景中的目标定位准确性和类别区分能力,尤其在处理遮挡、不规则建筑形状及罕见高层建筑的类别不平衡问题上表现优异,从而实现高精度的联合建筑提取与离散高度分类。

链接: https://arxiv.org/abs/2510.27224
作者: Mahmoud El Hussieni,Bahadır K. Güntürk,Hasan F. Ateş,Oğuz Hanoğlu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset – which includes over 125,000 annotated buildings across 12 cities – we evaluate YOLOv11’s performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4% mAP@50 and 38.3% mAP@50–95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11’s potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.
zh

[CV-51] Soft Task-Aware Routing of Experts for Equivariant Representation Learning NEURIPS2025

【速读】:该论文旨在解决当前联合学习不变性(invariant)与等变性(equivariant)表示时存在的冗余特征学习问题,即分离的投影头设计忽略了两者间共享的信息,导致模型容量利用效率低下。解决方案的关键在于提出Soft Task-Aware Routing (STAR),一种将投影头建模为专家的路由策略,通过引导不同专家专注于捕获共享信息或任务特异性信息,从而减少冗余特征学习,并在实验中表现为不变嵌入与等变嵌入之间的典型相关性降低,显著提升迁移学习任务的性能。

链接: https://arxiv.org/abs/2510.27222
作者: Jaebyeong Jeon,Hyeonseo Jang,Jy-yong Sohn,Kibok Lee
机构: Yonsei University (延世大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: NeurIPS 2025

点击查看摘要

Abstract:Equivariant representation learning aims to capture variations induced by input transformations in the representation space, whereas invariant representation learning encodes semantic information by disregarding such transformations. Recent studies have shown that jointly learning both types of representations is often beneficial for downstream tasks, typically by employing separate projection heads. However, this design overlooks information shared between invariant and equivariant learning, which leads to redundant feature learning and inefficient use of model capacity. To address this, we introduce Soft Task-Aware Routing (STAR), a routing strategy for projection heads that models them as experts. STAR induces the experts to specialize in capturing either shared or task-specific information, thereby reducing redundant feature learning. We validate this effect by observing lower canonical correlations between invariant and equivariant embeddings. Experimental results show consistent improvements across diverse transfer learning tasks. The code is available at this https URL.
zh

[CV-52] SpecAware: A Spectral-Content Aware Foundation Model for Unifying Multi-Sensor Learning in Hyperspectral Remote Sensing Mapping

【速读】:该论文旨在解决高光谱影像(Hyperspectral Imaging, HSI)在多传感器联合训练中因数据固有异质性导致模型泛化能力差的问题,以及现有HSI基础模型普遍忽视传感器元属性(sensor meta-attributes)对跨传感器迁移能力的指导作用。解决方案的关键在于提出SpecAware——一种新型的、光谱内容感知的HSI基础模型,其核心是两阶段超网络驱动的编码机制:首先设计元内容感知模块(meta-content aware module),融合传感器元属性与图像内容生成针对每个样本各波段的条件输入;其次构建HyperEmbedding模块,通过样本条件超网络动态生成通道维度上的矩阵因子,实现自适应的空间模式提取与潜在语义特征重投影。该设计使模型能够统一处理不同传感器和可变波段数的HSI数据,从而建立适用于多传感器联合预训练的统一框架。

链接: https://arxiv.org/abs/2510.27219
作者: Renjie Ji,Xue Wang,Chao Niu,Wen Zhang,Yong Mei,Kun Tan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral imaging (HSI) is a vital tool for fine-grained land-use and land-cover (LULC) mapping. However, the inherent heterogeneity of HSI data has long posed a major barrier to developing generalized models via joint training. Although HSI foundation models have shown promise for different downstream tasks, the existing approaches typically overlook the critical guiding role of sensor meta-attributes, and struggle with multi-sensor training, limiting their transferability. To address these challenges, we propose SpecAware, which is a novel hyperspectral spectral-content aware foundation model for unifying multi-sensor learning for HSI mapping. We also constructed the Hyper-400K dataset to facilitate this research, which is a new large-scale, high-quality benchmark dataset with over 400k image patches from diverse airborne AVIRIS sensors. The core of SpecAware is a two-step hypernetwork-driven encoding process for HSI data. Firstly, we designed a meta-content aware module to generate a unique conditional input for each HSI patch, tailored to each spectral band of every sample by fusing the sensor meta-attributes and its own image content. Secondly, we designed the HyperEmbedding module, where a sample-conditioned hypernetwork dynamically generates a pair of matrix factors for channel-wise encoding, consisting of adaptive spatial pattern extraction and latent semantic feature re-projection. Thus, SpecAware gains the ability to perceive and interpret spatial-spectral features across diverse scenes and sensors. This, in turn, allows SpecAware to adaptively process a variable number of spectral channels, establishing a unified framework for joint pre-training. Extensive experiments on six datasets demonstrate that SpecAware can learn superior feature representations, excelling in land-cover semantic segmentation classification, change detection, and scene classification.
zh

[CV-53] Privacy-Aware Continual Self-Supervised Learning on Multi-Window Chest Computed Tomography for Domain-Shift Robustness

【速读】:该论文旨在解决医学图像诊断中因标注数据稀缺和动态医疗环境中域偏移(domain shift)导致的模型泛化能力不足问题,尤其针对胸部CT图像在不同窗宽窗位设置下产生的特征分布差异。其解决方案的关键在于提出一种新颖的持续自监督学习(Continual Self-Supervised Learning, CSSL)框架,通过在无标签图像上进行持续预训练,结合基于潜在回放(latent replay)机制有效缓解由于域偏移引发的灾难性遗忘,并确保数据隐私;同时引入融合Wasserstein距离的知识蒸馏(Wasserstein distance-based knowledge distillation, WKD)与批次知识集成(batch-knowledge ensemble, BKE)的特征蒸馏技术,增强模型对跨域不变特征的学习能力,从而提升模型在多窗口胸部CT图像上的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2510.27213
作者: Ren Tasai,Guang Li,Ren Togo,Takahiro Ogawa,Kenji Hirata,Minghui Tang,Takaaki Yoshimura,Hiroyuki Sugimori,Noriko Nishioka,Yukie Shimizu,Kohsuke Kudo,Miki Haseyama
机构: Hokkaido University (北海道大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose a novel continual self-supervised learning (CSSL) framework for simultaneously learning diverse features from multi-window-obtained chest computed tomography (CT) images and ensuring data privacy. Achieving a robust and highly generalizable model in medical image diagnosis is challenging, mainly because of issues, such as the scarcity of large-scale, accurately annotated datasets and domain shifts inherent to dynamic healthcare environments. Specifically, in chest CT, these domain shifts often arise from differences in window settings, which are optimized for distinct clinical purposes. Previous CSSL frameworks often mitigated domain shift by reusing past data, a typically impractical approach owing to privacy constraints. Our approach addresses these challenges by effectively capturing the relationship between previously learned knowledge and new information across different training stages through continual pretraining on unlabeled images. Specifically, by incorporating a latent replay-based mechanism into CSSL, our method mitigates catastrophic forgetting due to domain shifts during continual pretraining while ensuring data privacy. Additionally, we introduce a feature distillation technique that integrates Wasserstein distance-based knowledge distillation (WKD) and batch-knowledge ensemble (BKE), enhancing the ability of the model to learn meaningful, domain-shift-robust representations. Finally, we validate our approach using chest CT images obtained across two different window settings, demonstrating superior performance compared with other approaches.
zh

[CV-54] GUI-Rise: Structured Reasoning and History Summarization for GUI Navigation NEURIPS2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在GUI导航代理中面临的跨域泛化能力弱和历史信息利用效率低的问题。解决方案的关键在于提出一种增强推理的框架,通过结构化推理(structured reasoning)、动作预测(action prediction)与历史摘要(history summarization)的系统性整合:其中结构化推理模块生成包含进度估计与决策推理的链式思维(Chain-of-Thought)分析,既指导当前动作预测,又生成紧凑的历史摘要以供后续步骤使用;同时,基于该框架训练的GUI-Rise代理采用监督微调与组相对策略优化(Group Relative Policy Optimization, GRPO)相结合的方式,并引入历史感知奖励机制,使摘要质量直接关联到后续动作表现,从而显著提升跨领域任务下的推理鲁棒性与泛化能力。

链接: https://arxiv.org/abs/2510.27210
作者: Tao Liu,Chongyu Wang,Rongjie Li,Yingchen Yu,Xuming He,Bai Song
机构: ShanghaiTech University (上海科技大学); ByteDance (字节跳动); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉与成像工程研究中心)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in NeurIPS 2025

点击查看摘要

Abstract:While Multimodal Large Language Models (MLLMs) have advanced GUI navigation agents, current approaches face limitations in cross-domain generalization and effective history utilization. We present a reasoning-enhanced framework that systematically integrates structured reasoning, action prediction, and history summarization. The structured reasoning component generates coherent Chain-of-Thought analyses combining progress estimation and decision reasoning, which inform both immediate action predictions and compact history summaries for future steps. Based on this framework, we train a GUI agent, \textbfGUI-Rise, through supervised fine-tuning on pseudo-labeled trajectories and reinforcement learning with Group Relative Policy Optimization (GRPO). This framework employs specialized rewards, including a history-aware objective, directly linking summary quality to subsequent action performance. Comprehensive evaluations on standard benchmarks demonstrate state-of-the-art results under identical training data conditions, with particularly strong performance in out-of-domain scenarios. These findings validate our framework’s ability to maintain robust reasoning and generalization across diverse GUI navigation tasks. Code is available at this https URL.
zh

[CV-55] Multi-Modal Feature Fusion for Spatial Morphology Analysis of Traditional Villages via Hierarchical Graph Neural Networks

【速读】:该论文旨在解决村庄空间形态研究中因城市化导致的空间特征消失与景观同质化问题,以及现有单学科视角下定性分析方法在数据支撑和数字基础设施不足时的局限性。其关键解决方案是提出一种分层图神经网络(Hierarchical Graph Neural Network, HGNN)模型,该模型通过引入输入节点与通信节点、静态输入边与动态通信边的双层结构,融合图卷积网络(Graph Convolutional Networks, GCN)与图注意力网络(Graph Attention Networks, GAT),构建两阶段特征更新机制以高效整合多模态数据,并结合关系池化机制与17类村庄空间形态的联合训练策略,在分类任务中显著提升性能,将平均准确率和F1值从独立模型的0.71/0.83提升至0.82/0.90,尤其在地块任务上实现6%的增益,为解析村庄空间格局及其生成逻辑提供了科学依据。

链接: https://arxiv.org/abs/2510.27208
作者: Jiaxin Zhang,Zehong Zhu,Junye Deng,Yunqin Li,and Bowen Wang
机构: Nanchang University (南昌大学); SANKEN, The University of Osaka (大阪大学三科研中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Villages areas hold significant importance in the study of human-land relationships. However, with the advancement of urbanization, the gradual disappearance of spatial characteristics and the homogenization of landscapes have emerged as prominent issues. Existing studies primarily adopt a single-disciplinary perspective to analyze villages spatial morphology and its influencing factors, relying heavily on qualitative analysis methods. These efforts are often constrained by the lack of digital infrastructure and insufficient data. To address the current research limitations, this paper proposes a Hierarchical Graph Neural Network (HGNN) model that integrates multi-source data to conduct an in-depth analysis of villages spatial morphology. The framework includes two types of nodes-input nodes and communication nodes-and two types of edges-static input edges and dynamic communication edges. By combining Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), the proposed model efficiently integrates multimodal features under a two-stage feature update mechanism. Additionally, based on existing principles for classifying villages spatial morphology, the paper introduces a relational pooling mechanism and implements a joint training strategy across 17 subtypes. Experimental results demonstrate that this method achieves significant performance improvements over existing approaches in multimodal fusion and classification tasks. Additionally, the proposed joint optimization of all sub-types lifts mean accuracy/F1 from 0.71/0.83 (independent models) to 0.82/0.90, driven by a 6% gain for parcel tasks. Our method provides scientific evidence for exploring villages spatial patterns and generative logic.
zh

[CV-56] Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

【速读】:该论文旨在解决高分辨率图像在基于视觉 Transformer(Vision Transformers, ViTs)的密集模型反演(model inversion)方法中效率低下的问题。现有方法通常对整个图像区域进行重建,导致计算资源浪费,尤其在处理复杂背景和潜在虚假关联时产生冗余甚至误导性的反演结果(称为“幻觉”现象)。解决方案的关键在于提出一种稀疏模型反演策略(sparse model inversion strategy),作为即插即用模块集成到原有密集反演方法中,无需修改原损失函数,即可选择性地仅反演语义前景区域,同时抑制噪声背景与虚假相关性的反演,从而显著提升反演速度(最高达3.79倍加速),并保持或增强下游任务如无数据量化和无数据知识迁移的性能。

链接: https://arxiv.org/abs/2510.27186
作者: Zixuan Hu,Yongxian Wei,Li Shen,Zhenyi Wang,Lei Li,Chun Yuan,Dacheng Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations–a phenomenon we term “hallucination” in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at this https URL.
zh

[CV-57] Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization

【速读】:该论文旨在解决无人机与卫星图像之间的跨视角地理定位(Cross-view geo-localization, CVGL)难题,其核心挑战在于视角差异巨大以及存在大量视觉相似但地理不匹配的难负样本(hard negatives)。现有基于静态加权的挖掘或重加权策略易受分布偏移影响,并可能过早强化困难样本,导致梯度噪声和训练不稳定。解决方案的关键在于提出一种双层次渐进式硬度感知重加权(Dual-level Progressive Hardness-aware Reweighting, DPHR)策略:在样本层面引入基于比率的难度感知模块(Ratio-based Difficulty-Aware, RDA),对负样本进行细粒度权重分配;在批次层面设计渐进自适应损失加权机制(Progressive Adaptive Loss Weighting, PALW),利用训练进度信号在早期抑制噪声梯度,在后期逐步增强难负样本挖掘效果,从而提升模型收敛稳定性与定位精度。

链接: https://arxiv.org/abs/2510.27181
作者: Guozheng Zheng,Jian Guan,Mingjie Xie,Xuanjia Zhao,Congyi Fan,Shiheng Zhang,Pengming Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) between drone and satellite imagery remains challenging due to severe viewpoint gaps and the presence of hard negatives, which are visually similar but geographically mismatched samples. Existing mining or reweighting strategies often use static weighting, which is sensitive to distribution shifts and prone to overemphasizing difficult samples too early, leading to noisy gradients and unstable convergence. In this paper, we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy. At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates relative difficulty and assigns fine-grained weights to negatives. At the batch level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a training-progress signal to attenuate noisy gradients during early optimization and progressively enhance hard-negative mining as training matures. Experiments on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness and robustness of the proposed DPHR, achieving consistent improvements over state-of-the-art methods.
zh

[CV-58] SilhouetteTell: Practical Video Identification Leverag ing Blurred Recordings of Video Subtitles

【速读】:该论文旨在解决视频识别攻击(video identification attack)对用户隐私造成的威胁,即通过分析视频播放时的副标题轮廓(subtitle silhouette)来推断用户观看的视频内容,从而暴露其兴趣爱好、宗教信仰、政治倾向、性取向及健康状况等敏感信息。解决方案的关键在于提出一种名为SilhouetteTell的新攻击方法,该方法将副标题轮廓的空间特征与时间差异信息融合为时空特征,利用录制的副标题轮廓与原始字幕文件之间的时空相关性,实现对在线和离线视频的高精度识别,且在多种场景下(包括40米距离)均表现出优异效果。

链接: https://arxiv.org/abs/2510.27179
作者: Guanchong Huang,Song Fang
机构: University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注: 16 pages, 29 figures. Accepted at 26th Privacy Enhancing Technologies Symposium (PETS 2026)

点击查看摘要

Abstract:Video identification attacks pose a significant privacy threat that can reveal videos that victims watch, which may disclose their hobbies, religious beliefs, political leanings, sexual orientation, and health status. Also, video watching history can be used for user profiling or advertising and may result in cyberbullying, discrimination, or blackmail. Existing extensive video inference techniques usually depend on analyzing network traffic generated by streaming online videos. In this work, we observe that the content of a subtitle determines its silhouette displayed on the screen, and identifying each subtitle silhouette also derives the temporal difference between two consecutive subtitles. We then propose SilhouetteTell, a novel video identification attack that combines the spatial and time domain information into a spatiotemporal feature of subtitle silhouettes. SilhouetteTell explores the spatiotemporal correlation between recorded subtitle silhouettes of a video and its subtitle file. It can infer both online and offline videos. Comprehensive experiments on off-the-shelf smartphones confirm the high efficacy of SilhouetteTell for inferring video titles and clips under various settings, including from a distance of up to 40 meters.
zh

[CV-59] H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

【速读】:该论文旨在解决生成式扩散模型(Generative Diffusion Models)在实际部署中因迭代去噪过程计算成本高而导致的效率瓶颈问题,尤其针对现有缓存技术在加速推理时难以兼顾速度与图像质量的难题。解决方案的关键在于提出一种名为H2-Cache的分层缓存机制,其核心创新是基于对去噪过程的功能分解——将过程划分为结构定义阶段(structure-defining stage)和细节精修阶段(detail-refining stage),并引入双阈值系统分别控制两个阶段的缓存策略;同时设计轻量级的池化特征摘要(Pooled Feature Summarization, PFS)方法以实现高效且鲁棒的相似性估计,从而在显著提升推理速度(最高达5.08倍)的同时保持接近基线的图像质量,有效缓解了速度与 fidelity 之间的权衡困境。

链接: https://arxiv.org/abs/2510.27171
作者: Mingyu Sung,Il-Min Kim,Sangseok Yun,Jae-Mo Kang
机构: Kyungpook National University (庆北国立大学); Queen’s University (皇后大学); Pukyong National University (釜庆国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at this https URL.
zh

[CV-60] DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model

【速读】:该论文旨在解决单人舞蹈视频生成中因人体运动高自由度导致的逼真性和连续性难题。现有扩散模型在静态图像生成上表现优异,但在视频生成尤其是包含复杂人体动作的任务中仍面临挑战,如如何保持视觉细节一致性与运动连贯性。解决方案的关键在于提出一种名为DANCER(Dance ANimation via Condition Enhancement and Rendering with Diffusion Model)的新框架,其核心创新包括:1)引入外观增强模块(Appearance Enhancement Module, AEM),强化参考图像细节在生成过程中的作用;2)设计姿态渲染模块(Pose Rendering Module, PRM),通过跨域姿态条件扩展运动引导能力;此外,作者还构建了大规模互联网收集的数据集TikTok-3K以提升模型训练效果,从而显著优于当前最先进方法。

链接: https://arxiv.org/abs/2510.27169
作者: Yucheng Xing,Jinxing Yin,Xiaodong Liu
机构: Stony Brook University (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.
zh

[CV-61] M3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

【速读】:该论文旨在解决当前相机与4D成像雷达融合方法在单帧输入下难以获取完整场景信息的问题,尤其是在恶劣天气条件下,由于图像退化和雷达点云稀疏性导致检测性能受限。其关键解决方案是提出M³Detection框架,通过多级特征融合策略实现跨帧、跨模态的高效融合:首先利用基准检测器的中间特征并结合跟踪器生成参考轨迹以提升计算效率;其次设计全局级目标间特征聚合模块(基于雷达信息引导)和局部级网格间特征聚合模块(沿参考轨迹扩展),增强细粒度目标表征;最后引入轨迹级时空推理模块编码跨帧交互关系,从而显著提升多帧3D目标检测性能。

链接: https://arxiv.org/abs/2510.27166
作者: Xiaozhi Li,Huijun Di,Jian Li,Feng Liu,Wei Liang
机构: Beijing Institute of Technology (北京理工大学); Key Laboratory of Electronic and Information Technology in Satellite Navigation, Ministry of Education (教育部卫星导航电子与信息技术创新实验室); School of Computer Science, Beijing Institute of Technology (北京理工大学计算机学院); Beijing Racobit Electronic Information Technology Co., Ltd. (北京锐博电子信息技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 9 figures

点击查看摘要

Abstract:Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.
zh

[CV-62] Generating Accurate and Detailed Captions for High-Resolution Images

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在处理高分辨率图像时生成的描述性文本不够准确和详尽的问题,这是因为VLM通常在低分辨率图像(如224×224或336×336像素)上预训练,导致高分辨率图像下细节丢失及重要对象遗漏。其解决方案的关键在于提出一种多阶段增强流水线:首先利用VLM生成初始描述,随后通过大语言模型(Large Language Models, LLMs)识别关键对象,并预测与之共现的潜在对象,再由目标检测系统验证这些预测;对未被初始描述覆盖的新发现对象进行区域特异性重写,从而提升描述的完整性并减少幻觉现象(hallucination)。该方法显著增强了图像描述的细节丰富度与可靠性。

链接: https://arxiv.org/abs/2510.27164
作者: Hankyeol Lee,Gawon Seo,Kyounggyu Lee,Dogun Kim,Kyungwoo Song,Jiyoung Jung
机构: University of Seoul (首尔大学); POSTECH (浦项工科大学); Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Work conducted in 2024; released for archival purposes

点击查看摘要

Abstract:Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.
zh

[CV-63] How Close Are We? Limitations and Progress of AI Models in Banff Lesion Scoring

【速读】:该论文旨在解决如何利用现有深度学习模型通过模块化规则框架来近似实现肾移植活检的Banff分类评分问题,以应对当前Banff标准因半定量特性、复杂评估准则及观察者间变异带来的计算复制挑战。其解决方案的关键在于将每个Banff指标(如肾小球炎 g、毛细血管炎 ptc 和内膜动脉炎 v)分解为结构和炎症成分,并基于当前分割与检测工具计算各组成部分,再通过符合专家指南的启发式规则映射至最终Banff评分,从而构建可解释且可验证的计算流程。

链接: https://arxiv.org/abs/2510.27158
作者: Yanfan Zhu,Juming Xiong,Ruining Deng,Yu Wang,Yaohong Wang,Shilin Zhao,Mengmeng Yin,Yuqing Liu,Haichun Yang,Yuankai Huo
机构: Vanderbilt University (范德比尔特大学); Weill Cornell Medicine (威尔康奈尔医学院); Vanderbilt University Medical Center (范德比尔特大学医学中心); UT MD Anderson Cancer Center (MD安德森癌症中心); Tongji University School of Medicine (同济大学医学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Banff Classification provides the global standard for evaluating renal transplant biopsies, yet its semi-quantitative nature, complex criteria, and inter-observer variability present significant challenges for computational replication. In this study, we explore the feasibility of approximating Banff lesion scores using existing deep learning models through a modular, rule-based framework. We decompose each Banff indicator - such as glomerulitis (g), peritubular capillaritis (ptc), and intimal arteritis (v) - into its constituent structural and inflammatory components, and assess whether current segmentation and detection tools can support their computation. Model outputs are mapped to Banff scores using heuristic rules aligned with expert guidelines, and evaluated against expert-annotated ground truths. Our findings highlight both partial successes and critical failure modes, including structural omission, hallucination, and detection ambiguity. Even when final scores match expert annotations, inconsistencies in intermediate representations often undermine interpretability. These results reveal the limitations of current AI pipelines in replicating computational expert-level grading, and emphasize the importance of modular evaluation and computational Banff grading standard in guiding future model development for transplant pathology.
zh

[CV-64] AFM-Net: Advanced Fusing Hierarchical CNN Visual Priors with Global Sequence Modeling for Remote Sensing Image Scene Classification

【速读】:该论文旨在解决遥感图像场景分类中因地物空间结构复杂性和多尺度特性导致的挑战,尤其是如何高效融合卷积神经网络(CNN)与Transformer在局部纹理建模和全局上下文捕捉方面的优势。其解决方案的关键在于提出AFM-Net框架,通过两条并行路径实现局部与全局信息的协同表示:一条是CNN分支用于提取分层视觉先验,另一条是Mamba分支用于高效建模全局序列依赖;核心创新为层级融合机制(Hierarchical Fusion Mechanism),可逐步聚合来自两条路径的多尺度特征,并支持动态跨层级特征交互与上下文重建,从而生成高判别力表征;最终通过Mixture-of-Experts分类器模块自适应路由特征至最优专家,提升细粒度场景识别性能。

链接: https://arxiv.org/abs/2510.27155
作者: Yuanhao Tang,Xuechao Zou,Zhengpei Hu,Junliang Xing,Chengkun Zhang,Jianqiang Huang
机构: Qinghai University (青海大学); Beijing Jiaotong University (北京交通大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image scene classification remains a challenging task, primarily due to the complex spatial structures and multi-scale characteristics of ground objects. Existing approaches see CNNs excel at modeling local textures, while Transformers excel at capturing global context. However, efficiently integrating them remains a bottleneck due to the high computational cost of Transformers. To tackle this, we propose AFM-Net, a novel Advanced Hierarchical Fusing framework that achieves effective local and global co-representation through two pathways: a CNN branch for extracting hierarchical visual priors, and a Mamba branch for efficient global sequence modeling. The core innovation of AFM-Net lies in its Hierarchical Fusion Mechanism, which progressively aggregates multi-scale features from both pathways, enabling dynamic cross-level feature interaction and contextual reconstruction to produce highly discriminative representations. These fused features are then adaptively routed through a Mixture-of-Experts classifier module, which dispatches them to the most suitable experts for fine-grained scene recognition. Experiments on AID, NWPU-RESISC45, and UC Merced show that AFM-Net obtains 93.72, 95.54, and 96.92 percent accuracy, surpassing state-of-the-art methods with balanced performance and efficiency. Code is available at this https URL.
zh

[CV-65] HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition

【速读】:该论文旨在解决现有三维场景生成方法多采用单步生成流程,难以在用户输入极少的情况下平衡场景复杂度与生成质量的问题。其核心解决方案是提出一种分层生成框架HiGS(Hierarchical Generative framework for multi-step associative semantic spatial composition),通过模仿人类从全局到局部的认知建模过程,使用户能够通过选择关键语义对象逐步扩展场景,从而实现对感兴趣区域的细粒度控制,同时由模型自动补全周边区域。该方案的关键在于引入了渐进式分层空间-语义图(Progressive Hierarchical Spatial-Semantic Graph, PHiSSG),动态组织场景中的空间关系与语义依赖,并通过节点与生成对象的一一映射及递归布局优化机制,保障生成过程的空间几何一致性与结构连贯性。

链接: https://arxiv.org/abs/2510.27148
作者: Jiacheng Hong,Kunzhen Wu,Mingrui Yu,Yichao Gu,Shengze Xue,Shuangjiu Xiao,Deli Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Three-dimensional scene generation holds significant potential in gaming, film, and virtual reality. However, most existing methods adopt a single-step generation process, making it difficult to balance scene complexity with minimal user input. Inspired by the human cognitive process in scene modeling, which progresses from global to local, focuses on key elements, and completes the scene through semantic association, we propose HiGS, a hierarchical generative framework for multi-step associative semantic spatial composition. HiGS enables users to iteratively expand scenes by selecting key semantic objects, offering fine-grained control over regions of interest while the model completes peripheral areas automatically. To support structured and coherent generation, we introduce the Progressive Hierarchical Spatial-Semantic Graph (PHiSSG), which dynamically organizes spatial relationships and semantic dependencies across the evolving scene structure. PHiSSG ensures spatial and geometric consistency throughout the generation process by maintaining a one-to-one mapping between graph nodes and generated objects and supporting recursive layout optimization. Experiments demonstrate that HiGS outperforms single-stage methods in layout plausibility, style consistency, and user preference, offering a controllable and extensible paradigm for efficient 3D scene construction.
zh

[CV-66] Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features

【速读】:该论文旨在解决跨视角物体地理定位(Cross-view object geo-localization)中因视图间信息传递不足和空间关系特征图未充分优化而导致的定位精度下降问题,尤其是模型易受无关边缘噪声干扰的问题。解决方案的关键在于提出一种交叉视图与交叉注意力模块(Cross-view and Cross-attention Module, CVCAM),通过多轮迭代交互实现两个视角之间上下文信息的持续交换与学习,从而深化对跨视角关系的理解并抑制无关噪声;同时引入多头空间注意力模块(Multi-head Spatial Attention Module, MHSAM),利用不同尺寸的卷积核提取包含隐式对应关系特征图中的多尺度空间特征,进一步增强查询对象的特征表达能力。

链接: https://arxiv.org/abs/2510.27139
作者: Xingtao Ling Yingying Zhu
机构: Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce a Cross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate a Multi-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset called G2D for the “Ground-to-Drone” localization task, enriching existing datasets and filling the gap in “Ground-to-Drone” localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.
zh

[CV-67] E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

【速读】:该论文旨在解决当前生成式AI(Generative AI)中扩散模型(Diffusion Models)训练成本高、计算资源消耗大以及结构复杂导致延迟高的问题。针对这些问题,作者提出了一种轻量级多模态扩散Transformer模型——E-MMDiT,其关键创新在于通过token压缩策略显著降低计算复杂度:首先采用高度压缩的视觉分词器(visual tokenizer)生成紧凑的特征表示,进而引入新型多路径压缩模块进一步减少token数量;同时设计了位置强化(Position Reinforcement)机制以维持空间一致性,并提出交替子区域注意力(Alternating Subregion Attention, ASA)来降低注意力计算开销;此外,还提出了AdaLN-affine模块高效地生成Transformer块中的调制参数,从而在仅304M参数下实现快速图像合成,且训练仅需25M公共数据和1.5天单节点时间(8张AMD MI300X GPU)。

链接: https://arxiv.org/abs/2510.27135
作者: Tong Shen,Jingai Yu,Dong Zhou,Dong Li,Emad Barsoum
机构: Advanced Micro Devices, Inc. (超威半导体公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at this https URL and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.
zh

[CV-68] WildfireX-SLAM: A Large-scale Low-altitude RGB-D Dataset for Wildfire SLAM and Beyond

【速读】:该论文旨在解决大尺度森林场景下基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的SLAM(同时定位与建图)方法缺乏高质量数据集的问题,从而推动其在野火应急响应和森林管理等实际应用中的发展。解决方案的关键在于构建了一个大规模、综合性且高质量的合成数据集WildfireX-SLAM,该数据集包含5.5k张低空RGB-D航空图像,覆盖16 km²的森林区域,并通过Unreal Engine 5平台实现了对光照、天气及野火类型与状态等环境因素的灵活控制,支持多模态数据采集与多样化任务需求,为后续3DGS-based SLAM在复杂森林环境中的研究提供了可靠基准和工具。

链接: https://arxiv.org/abs/2510.27133
作者: Zhicong Sun,Jacqueline Lo,Jinxing Hu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: This paper has been accepted by MMM 2026

点击查看摘要

Abstract:3D Gaussian splatting (3DGS) and its subsequent variants have led to remarkable progress in simultaneous localization and mapping (SLAM). While most recent 3DGS-based SLAM works focus on small-scale indoor scenes, developing 3DGS-based SLAM methods for large-scale forest scenes holds great potential for many real-world applications, especially for wildfire emergency response and forest management. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, and collecting such a dataset over real-world scenes is costly and technically infeasible. To this end, we have built a large-scale, comprehensive, and high-quality synthetic dataset for SLAM in wildfire and forest environments. Leveraging the Unreal Engine 5 Electric Dreams Environment Sample Project, we developed a pipeline to easily collect aerial and ground views, including ground-truth camera poses and a range of additional data modalities from unmanned aerial vehicle. Our pipeline also provides flexible controls on environmental factors such as light, weather, and types and conditions of wildfire, supporting the need for various tasks covering forest mapping, wildfire emergency response, and beyond. The resulting pilot dataset, WildfireX-SLAM, contains 5.5k low-altitude RGB-D aerial images from a large-scale forest map with a total size of 16 km2. On top of WildfireX-SLAM, a thorough benchmark is also conducted, which not only reveals the unique challenges of 3DGS-based SLAM in the forest but also highlights potential improvements for future works. The dataset and code will be publicly available. Project page: this https URL.
zh

[CV-69] ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding NEURIPS2025

【速读】:该论文旨在解决当前神经解码方法依赖个体特异性模型或需进行个体特定微调的问题,从而限制了其可扩展性和实际应用。解决方案的关键在于提出ZEBRA框架,其核心思想是将fMRI表征分解为个体相关和语义相关的成分,并通过对抗训练显式地解耦这些成分,以分离出与个体无关的语义特异性表示,从而实现无需额外fMRI数据或重新训练即可在未见个体上泛化的能力。

链接: https://arxiv.org/abs/2510.27128
作者: Haonan Wang,Jingyu Lu,Hongrui Li,Xiaomeng Li
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Recent advances in neural decoding have enabled the reconstruction of visual experiences from brain activity, positioning fMRI-to-image reconstruction as a promising bridge between neuroscience and computer vision. However, current methods predominantly rely on subject-specific models or require subject-specific fine-tuning, limiting their scalability and real-world applicability. In this work, we introduce ZEBRA, the first zero-shot brain visual decoding framework that eliminates the need for subject-specific adaptation. ZEBRA is built on the key insight that fMRI representations can be decomposed into subject-related and semantic-related components. By leveraging adversarial training, our method explicitly disentangles these components to isolate subject-invariant, semantic-specific representations. This disentanglement allows ZEBRA to generalize to unseen subjects without any additional fMRI data or retraining. Extensive experiments show that ZEBRA significantly outperforms zero-shot baselines and achieves performance comparable to fully finetuned models on several metrics. Our work represents a scalable and practical step toward universal neural decoding. Code and model weights are available at: this https URL.
zh

[CV-70] Hierarchical Transformers for Unsupervised 3D Shape Abstraction

【速读】:该论文旨在解决3D形状表示中如何自动学习跨类别通用层次结构的问题,传统方法通常受限于预定义的固定层级结构(如二叉树),难以捕捉复杂且多样化的子结构关系。解决方案的关键在于提出一种分层Transformer(Hierarchical Transformer, HiT),通过压缩码本(compressed codebook)在每一层级显式建模父-子节点间的层次关系,从而实现从数据中无监督地推断出灵活且多样的树状层次结构,而无需预先设定具体的拓扑约束。该方法在ShapeNet全部55个类别上验证了其在无监督形状分割任务中的有效性,能够生成多粒度的结构化分割结果。

链接: https://arxiv.org/abs/2510.27088
作者: Aditya Vora,Lily Goli,Andrea Tagliasacchi,Hao Zhang
机构: Simon Fraser University (西蒙菲莎大学); University of Toronto (多伦多大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer (HiT), where each level learns parent-child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.
zh

[CV-71] AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)场景中语义分割任务的精度与泛化能力不足的问题,尤其是在复杂空间结构和多样几何特征下的分割稳定性与数据效率问题。解决方案的关键在于提出一种针对自动驾驶优化的视觉基础模型——自主驾驶分割一切模型(Autonomous Driving Segment Anything Model, AD-SAM),其核心创新包括:采用双编码器结构融合来自预训练Vision Transformer(ViT-H)的全局语义信息与来自可训练卷积骨干网络(ResNet-50)的局部空间细节;引入可变形融合模块对异构特征进行跨尺度与跨对象几何对齐;设计多阶段渐进式细化解码器并结合混合损失函数(Focal、Dice、Lovasz-Softmax 和 Surface Loss)以提升类别平衡性、边界精度及优化稳定性。这些改进使AD-SAM在Cityscapes和BDD100K基准上显著优于SAM、G-SAM和DeepLabV3,并展现出更强的跨域泛化能力与数据效率。

链接: https://arxiv.org/abs/2510.27047
作者: Mario Camarena,Het Patel,Fatemeh Nazari,Evangelos Papalexakis,Mohamadhossein Noruzoliaee,Jia Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE Transactions on Intelligent Transportation Systems (IEEE T-ITS)

点击查看摘要

Abstract:This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM’s pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.
zh

[CV-72] A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning -Based Visual Grounding in Robotics

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂环境中进行细粒度空间推理(spatial reasoning)能力不足的问题,尤其是在机器人领域中对物体间空间关系与逻辑交互的理解局限。现有VLMs主要依赖图像信息并基于隐式相关性进行推理,难以应对动态、拥挤的人工环境中的精确空间判断。解决方案的关键在于提出一种新颖的神经符号(neuro-symbolic)框架,通过融合全景图像(panoramic-image)与三维点云(3D point cloud)信息,将神经网络感知能力与符号推理相结合,显式建模场景中的空间和逻辑关系;其核心组件包括一个用于实体检测与属性提取的感知模块,以及一个构建结构化场景图(scene graph)的推理模块,从而支持可解释、高精度的空间查询,在JRDB-Reasoning数据集上验证了该方法在复杂环境下的优越性能与可靠性。

链接: https://arxiv.org/abs/2510.27033
作者: Simindokht Jahangard,Mehrzad Mohammadi,Abhinav Dhall,Hamid Rezatofighi
机构: Monash University (莫纳什大学); Sharif University of Technology (谢里夫理工大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
zh

[CV-73] VitalLens 2.0: High-Fidelity rPPG for Heart Rate Variability Estimation from Face Video

【速读】:该论文旨在解决远程光电容积脉搏波描记术(remote photoplethysmography, rPPG)中生理信号估计精度不足的问题,特别是针对心率(HR)、呼吸频率(RR)及心率变异性(HRV)指标的准确提取。其解决方案的关键在于两个方面:一是提出了一种新的深度学习模型架构,二是构建了包含1,413名个体的更大规模且多样化的训练数据集,从而显著提升了模型在多类生理参数上的鲁棒性与准确性。

链接: https://arxiv.org/abs/2510.27028
作者: Philipp V. Rouast
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Technical Report. 8 pages, 5 figures. Introduces the VitalLens 2.0 model for rPPG and Heart Rate Variability (HRV) estimation. Project website: this https URL

点击查看摘要

Abstract:This report introduces VitalLens 2.0, a new deep learning model for estimating physiological signals from face video. This new model demonstrates a significant leap in accuracy for remote photoplethysmography (rPPG), enabling the robust estimation of not only heart rate (HR) and respiratory rate (RR) but also Heart Rate Variability (HRV) metrics. This advance is achieved through a combination of a new model architecture and a substantial increase in the size and diversity of our training data, now totaling 1,413 unique individuals. We evaluate VitalLens 2.0 on a new, combined test set of 422 unique individuals from four public and private datasets. When averaging results by individual, VitalLens 2.0 achieves a Mean Absolute Error (MAE) of 1.57 bpm for HR, 1.08 bpm for RR, 10.18 ms for HRV-SDNN, and 16.45 ms for HRV-RMSSD. These results represent a new state-of-the-art, significantly outperforming previous methods. This model is now available to developers via the VitalLens API at this https URL.
zh

[CV-74] Incremental Human-Object Interaction Detection with Invariant Relation Representation Learning

【速读】:该论文旨在解决开放世界环境中人类-物体交互(Human-Object Interactions, HOIs)动态演变所带来的挑战,尤其是传统闭合世界HOI检测模型在增量学习中面临的灾难性遗忘、交互漂移(interaction drift)以及零样本HOI组合识别等问题。其解决方案的关键在于提出一种无需示例的增量关系蒸馏(exemplar-free Incremental Relation Distillation, IRD)框架:该框架通过解耦物体与关系的学习过程,并引入两种独特的蒸馏损失函数,以在不同共享相同语义关系的HOI组合之间学习不变的关系特征,从而有效缓解遗忘、增强对交互漂移的鲁棒性,并提升对零样本HOI的泛化能力。

链接: https://arxiv.org/abs/2510.27020
作者: Yana Wei,Zeen Chi,Chongyu Wang,Yu Wu,Shipeng Yan,Yongfei Liu,Xuming He
机构: ShanghaiTech University (上海科技大学); Shanghai Engineering Research Center of Intelligent Vision and Imaging (上海智能视觉与成像工程技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In open-world environments, human-object interactions (HOIs) evolve continuously, challenging conventional closed-world HOI detection models. Inspired by humans’ ability to progressively acquire knowledge, we explore incremental HOI detection (IHOID) to develop agents capable of discerning human-object relations in such dynamic environments. This setup confronts not only the common issue of catastrophic forgetting in incremental learning but also distinct challenges posed by interaction drift and detecting zero-shot HOI combinations with sequentially arriving data. Therefore, we propose a novel exemplar-free incremental relation distillation (IRD) framework. IRD decouples the learning of objects and relations, and introduces two unique distillation losses for learning invariant relation features across different HOI combinations that share the same relation. Extensive experiments on HICO-DET and V-COCO datasets demonstrate the superiority of our method over state-of-the-art baselines in mitigating forgetting, strengthening robustness against interaction drift, and generalization on zero-shot HOIs. Code is available at \hrefthis https URLthis HTTP URL
zh

[CV-75] MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

【速读】:该论文旨在解决医学图像分割中模型性能受限于单一架构表达能力不足的问题,尤其在复杂多样的医学影像场景下难以兼顾精度与泛化性。解决方案的关键在于提出MoME(Mixture of Visual Language Medical Experts),其核心创新是将大语言模型中成功的Mixture of Experts(MoE)范式引入医学视觉-语言任务,通过动态选择专家网络来利用多层次视觉特征和文本嵌入,从而增强对医学影像细节的捕捉能力,并实现跨数据集的稳健分割性能。

链接: https://arxiv.org/abs/2510.26996
作者: Arghavan Rezvani,Xiangyi Yan,Anthony T. Wu,Kun Han,Pooya Khosravi,Xiaohui Xie
机构: University of California, Irvine (加州大学欧文分校); Health Sciences, University of California, Irvine (加州大学欧文分校健康科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.
zh

[CV-76] Using Salient Object Detection to Identify Manipulative Cookie Banners that Circumvent GDPR AAAI

【速读】:该论文旨在解决欧盟通用数据保护条例(General Data Protection Regulation, GDPR)合规性背景下,用户隐私选择界面中是否存在审美操纵(aesthetic manipulation)的问题,即设计手段如何引导用户点击同意数据共享的按钮。其关键解决方案在于采用计算机视觉模型中的显著对象检测(salient object detection)技术,量化每个Cookie横幅元素的显著性(attention-drawing能力),从而识别出传统方法未能发现的新类型审美操纵(如按钮位置布局),并据此得出更准确的统计结果:38%的GDPR合规横幅存在审美操纵,显著高于此前研究报道的27%。此外,研究还通过对比欧盟与非欧盟网站在不同用户地理位置下的设计差异,揭示了欧盟网站在隐私法规压力下更具创新性的响应策略。

链接: https://arxiv.org/abs/2510.26967
作者: Riley Grossman,Michael Smith,Cristian Borcea,Yi Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted to International AAAI Conference on Web and Social Media 2026 (ICWSM’26)

点击查看摘要

Abstract:The main goal of this paper is to study how often cookie banners that comply with the General Data Protection Regulation (GDPR) contain aesthetic manipulation, a design tactic to draw users’ attention to the button that permits personal data sharing. As a byproduct of this goal, we also evaluate how frequently the banners comply with GDPR and the recommendations of national data protection authorities regarding banner designs. We visited 2,579 websites and identified the type of cookie banner implemented. Although 45% of the relevant websites have fully compliant banners, we found aesthetic manipulation on 38% of the compliant banners. Unlike prior studies of aesthetic manipulation, we use a computer vision model for salient object detection to measure how salient (i.e., attention-drawing) each banner element is. This enables the discovery of new types of aesthetic manipulation (e.g., button placement), and leads us to conclude that aesthetic manipulation is more common than previously reported (38% vs 27% of banners). To study the effects of user and/or website location on cookie banner design, we include websites within the European Union (EU), where privacy regulation enforcement is more stringent, and websites outside the EU. We visited websites from IP addresses in the EU and from IP addresses in the United States (US). We find that 13.9% of EU websites change their banner design when the user is from the US, and EU websites are roughly 48.3% more likely to use aesthetic manipulation than non-EU websites, highlighting their innovative responses to privacy regulation.
zh

[CV-77] SYNAPSE-Net: A Unified Framework with Lesion-Aware Hierarchical Gating for Robust Segmentation of Heterogeneous Brain Lesions

【速读】:该论文旨在解决多模态磁共振成像(multi-modal MRI)中异质性脑病变自动分割的挑战,当前深度学习模型普遍存在泛化能力弱和性能波动大等问题,限制了其在临床中的可靠性。解决方案的关键在于提出一种统一的自适应框架——Unified Multi-Stream SYNAPSE-Net,其核心创新包括:基于多流卷积神经网络(CNN)编码器的混合架构、利用Swin Transformer瓶颈捕捉全局上下文信息、动态跨模态注意力融合(CMAF)机制实现模态间信息高效整合,以及分层门控解码器用于高保真掩膜重建;同时采用方差缩减策略,结合病理性特定数据增强与难度感知采样方法进行训练。该框架在三个公开数据集上均达到最优性能,验证了其在多种脑部病理下的鲁棒性和临床可行性。

链接: https://arxiv.org/abs/2510.26961
作者: Md. Mehedi Hassan,Shafqat Alam,Shahriar Ahmed Seam,Maruf Ahmed
机构: Bangladesh University of Engineering and Technology (孟加拉国工程与技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 17 pages, 10 figures, 8 tables, submitted to “Medical Image Analysis” journal

点击查看摘要

Abstract:Automated segmentation of heterogeneous brain lesions from multi-modal MRI remains a critical challenge in clinical neuroimaging. Current deep learning models are typically specialized `point solutions’ that lack generalization and high performance variance, limiting their clinical reliability. To address these gaps, we propose the Unified Multi-Stream SYNAPSE-Net, an adaptive framework designed for both generalization and robustness. The framework is built on a novel hybrid architecture integrating multi-stream CNN encoders, a Swin Transformer bottleneck for global context, a dynamic cross-modal attention fusion (CMAF) mechanism, and a hierarchical gated decoder for high-fidelity mask reconstruction. The architecture is trained with a variance reduction strategy that combines pathology specific data augmentation and difficulty-aware sampling method. The model was evaluated on three different challenging public datasets: the MICCAI 2017 WMH Challenge, the ISLES 2022 Challenge, and the BraTS 2020 Challenge. Our framework attained a state-of-the-art DSC value of 0.831 with the HD95 value of 3.03 in the WMH dataset. For ISLES 2022, it achieved the best boundary accuracy with a statistically significant difference (HD95 value of 9.69). For BraTS 2020, it reached the highest DSC value for the tumor core region (0.8651). These experimental findings suggest that our unified adaptive framework achieves state-of-the-art performance across multiple brain pathologies, providing a robust and clinically feasible solution for automated segmentation. The source code and the pre-trained models are available at this https URL.
zh

[CV-78] Scale-Aware Curriculum Learning for Ddata-Efficient Lung Nodule Detection with YOLOv11

【速读】:该论文旨在解决肺结节检测任务中因标注数据稀缺导致深度学习模型训练效果受限的问题。在临床场景下,尤其是医疗影像领域,高质量标注数据往往难以获取,而传统静态课程学习策略在此类数据稀疏条件下表现不佳。解决方案的关键在于提出一种尺度自适应课程学习(Scale Adaptive Curriculum Learning, SACL)方法,其核心机制包括:(1) 自适应的训练轮次调度、(2) 硬样本注入策略以及 (3) 面向数据规模的优化策略,能够根据可用数据量动态调整训练过程,从而在不同数据规模下均实现稳定且优越的性能,尤其在低数据比例(如10%、20%、50%)时显著优于基线模型,无需修改网络结构即可提升模型鲁棒性。

链接: https://arxiv.org/abs/2510.26923
作者: Yi Luo,Yike Guo,Hamed Hooshangnejad,Kai Ding
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Lung nodule detection in chest CT is crucial for early lung cancer diagnosis, yet existing deep learning approaches face challenges when deployed in clinical settings with limited annotated data. While curriculum learning has shown promise in improving model training, traditional static curriculum strategies fail in data-scarce scenarios. We propose Scale Adaptive Curriculum Learning (SACL), a novel training strategy that dynamically adjusts curriculum design based on available data scale. SACL introduces three key mechanisms:(1) adaptive epoch scheduling, (2) hard sample injection, and (3) scale-aware optimization. We evaluate SACL on the LUNA25 dataset using YOLOv11 as the base detector. Experimental results demonstrate that while SACL achieves comparable performance to static curriculum learning on the full dataset in mAP50, it shows significant advantages under data-limited conditions with 4.6%, 3.5%, and 2.0% improvements over baseline at 10%, 20%, and 50% of training data respectively. By enabling robust training across varying data scales without architectural modifications, SACL provides a practical solution for healthcare institutions to develop effective lung nodule detection systems despite limited annotation resources.
zh

[CV-79] DC4GS: Directional Consistency-Driven Adaptive Density Control for 3D Gaussian Splatting NEURIPS2025

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting)中自适应密度控制(Adaptive Density Control, ADC)因依赖位置梯度幅值进行原始分割而导致冗余分割、结构保真度不足的问题。其解决方案的关键在于引入方向一致性(Directional Consistency, DC),通过梯度的角相干性来衡量局部结构复杂度,从而更精准地判断是否需要分割,并在必要时利用DC确定最优分割位置,使子原始体更好地对齐局部结构,而非采用随机放置方式。这一改进显著减少了原始体数量(实验中最多降低30%),同时提升了重建保真度。

链接: https://arxiv.org/abs/2510.26921
作者: Moonsoo Jeong,Dongbeen Kim,Minseong Kim,Sungkil Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 / Project page: this https URL

点击查看摘要

Abstract:We present a Directional Consistency (DC)-driven Adaptive Density Control (ADC) for 3D Gaussian Splatting (DC4GS). Whereas the conventional ADC bases its primitive splitting on the magnitudes of positional gradients, we further incorporate the DC of the gradients into ADC, and realize it through the angular coherence of the gradients. Our DC better captures local structural complexities in ADC, avoiding redundant splitting. When splitting is required, we again utilize the DC to define optimal split positions so that sub-primitives best align with the local structures than the conventional random placement. As a consequence, our DC4GS greatly reduces the number of primitives (up to 30% in our experiments) than the existing ADC, and also enhances reconstruction fidelity greatly.
zh

[CV-80] PF-DAformer: Proximal Femur Segmentation via Domain Adaptive Transformer for Dual-Center QCT

【速读】:该论文旨在解决多中心定量计算机断层扫描(Quantitative Computed Tomography, QCT)中因域偏移(domain shift)导致的自动分割模型泛化性能差的问题。具体而言,不同机构间扫描仪型号、重建参数及患者人群差异会显著影响深度学习模型在新数据上的预测稳定性,进而阻碍骨密度分布分析与骨折风险评估的跨中心可重复性。解决方案的关键在于提出一种面向多中心QCT的域自适应Transformer分割框架,在3D TransUNet主干网络中融合两种互补策略:一是通过梯度反转层(Gradient Reversal Layer, GRL)实现对抗对齐,抑制模型学习站点特异性特征;二是利用最大均值差异(Maximum Mean Discrepancy, MMD)进行统计对齐,显式减少不同机构间特征分布的差异。这种双重机制在保持解剖细节的同时,实现了扫描仪无关的特征学习,从而提升模型在多中心场景下的鲁棒性和迁移能力。

链接: https://arxiv.org/abs/2510.26903
作者: Rochak Dhakal,Chen Zhao,Zixin Shi,Joyce H. Keyak,Tadashi S. Kaneko,Kuan-Jui Su,Hui Shen,Hong-Wen Deng,Weihua Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注: 22 Pages, 5 Tables, 10 Figures. The combination of GRL and MMD achieved the most balanced performance, reducing contour deviations and enhancing surface smoothness

点击查看摘要

Abstract:Quantitative computed tomography (QCT) plays a crucial role in assessing bone strength and fracture risk by enabling volumetric analysis of bone density distribution in the proximal femur. However, deploying automated segmentation models in practice remains difficult because deep networks trained on one dataset often fail when applied to another. This failure stems from domain shift, where scanners, reconstruction settings, and patient demographics vary across institutions, leading to unstable predictions and unreliable quantitative metrics. Overcoming this barrier is essential for multi-center osteoporosis research and for ensuring that radiomics and structural finite element analysis results remain reproducible across sites. In this work, we developed a domain-adaptive transformer segmentation framework tailored for multi-institutional QCT. Our model is trained and validated on one of the largest hip fracture related research cohorts to date, comprising 1,024 QCT images scans from Tulane University and 384 scans from Rochester, Minnesota for proximal femur segmentation. To address domain shift, we integrate two complementary strategies within a 3D TransUNet backbone: adversarial alignment via Gradient Reversal Layer (GRL), which discourages the network from encoding site-specific cues, and statistical alignment via Maximum Mean Discrepancy (MMD), which explicitly reduces distributional mismatches between institutions. This dual mechanism balances invariance and fine-grained alignment, enabling scanner-agnostic feature learning while preserving anatomical detail.
zh

[CV-81] Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在读取测量仪器(如仪表盘、刻度盘等)时表现不佳的问题,尤其是其在细粒度空间定位(fine-grained spatial grounding)方面的局限性。解决方案的关键在于构建了一个名为MeasureBench的基准测试集,涵盖真实世界与合成图像中的多种测量场景,并提出了一种可扩展的数据合成流水线:该流水线能够程序化生成特定类型的仪表,且可控地调整指针、刻度、字体、光照和杂波等关键视觉特征,从而实现大规模、多样化的数据生成。实验表明,即使是最先进的VLMs在测量读数任务中仍存在系统性失败模式——即无法准确识别指针位置,导致数值误差显著,尽管其文本推理能力较强。这一发现揭示了当前VLMs在精确空间感知上的根本缺陷,为未来提升视觉-数字对齐能力提供了重要方向。

链接: https://arxiv.org/abs/2510.26865
作者: Fenfen Lin,Yesheng Liu,Haiyu Xu,Chen Yue,Zheqi He,Mingxuan Zhao,Miguel Hu Chen,Jiakang Liu,JG Yao,Xi Yang
机构: BAAI(北京智源人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
zh

[CV-82] Audio-Visual Speech Enhancement In Complex Scenarios With Separation And Dereverberation Joint Modeling

【速读】:该论文旨在解决复杂声学环境下音频-视觉语音增强(Audio-Visual Speech Enhancement, AVSE)性能不佳的问题,尤其是在存在多种干扰声音和混响的情况下,传统方法往往导致提取语音的感知质量较差。其解决方案的关键在于提出了一种“先分离后去混响”(separation before dereverberation)的处理流程,该设计能够有效提升系统在复杂环境中的鲁棒性,并可扩展应用于其他AVSE网络架构。该方法在第4届COGMHEAR AVSE挑战赛(AVSEC-4)中验证了优越性,在三个客观指标上表现优异,并在主观听感测试中获得第一名。

链接: https://arxiv.org/abs/2510.26825
作者: Jiarong Du,Zhan Jin,Peijun Yang,Juan Liu,Zhuo Li,Xin Liu,Ming Li
机构: 未知
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker’s speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a “separation before dereverberation” pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.
zh

[CV-83] A Neural Architecture Search Method using Auxiliary Evaluation Metric based on ResNet Architecture GECCO2023

【速读】:该论文旨在解决神经网络架构设计中人工设计效率低、性能难以优化的问题,提出了一种基于ResNet框架的神经架构搜索(Neural Architecture Search, NAS)空间,其搜索目标包括卷积层、池化层、全连接层的参数以及残差网络的连接方式。解决方案的关键在于构建一个结构化的搜索空间,并引入验证集损失值作为次要优化目标,从而在保证识别准确率的同时提升模型泛化能力。实验表明,该方法在MNIST、Fashion-MNIST和CIFAR100数据集上均能发现具有竞争力的网络架构。

链接: https://arxiv.org/abs/2505.01313
作者: Shang Wang,Huanrong Tang,Jianquan Ouyang
机构: Xiangtan University (湘潭大学)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: GECCO 2023

点击查看摘要

Abstract:This paper proposes a neural architecture search space using ResNet as a framework, with search objectives including parameters for convolution, pooling, fully connected layers, and connectivity of the residual network. In addition to recognition accuracy, this paper uses the loss value on the validation set as a secondary objective for optimization. The experimental results demonstrate that the search space of this paper together with the optimisation approach can find competitive network architectures on the MNIST, Fashion-MNIST and CIFAR100 datasets.
zh

[CV-84] Dark-Field X-Ray Imaging Significantly Improves Deep-Learning based Detection of Synthetic Early-Stage Lung Tumors in Preclinical Models

【速读】:该论文旨在解决低剂量计算机断层扫描(Low-dose computed tomography, LDCT)在肺癌早期筛查中因设备基础设施缺乏和假阳性率较高(如NLST研究中达26.6%)而导致的可及性与准确性不足的问题。其解决方案的关键在于利用X射线暗场成像(X-ray dark-field imaging, DFI)技术,该技术对肺泡微结构的小角散射敏感且受器官遮挡影响较小,并结合深度学习分割模型(U-Net)进行肿瘤检测。实验表明,仅使用DFI图像时,真阳性检出率可达83.7%,显著高于仅使用传统衰减成像(ATTN)的51%,同时保持高特异性(90.5% vs 92.9%),进一步验证了DFI在提升早期肺部肿瘤检测能力方面的潜力,尤其适用于无LDCT条件下的预临床或资源受限场景。

链接: https://arxiv.org/abs/2510.27679
作者: Joyoni Dey,Hunter C. Meyer,Murtuza S. Taqi
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV); Optics (physics.optics)
备注:

点击查看摘要

Abstract:Low-dose computed tomography (LDCT) is the current standard for lung cancer screening, yet its adoption and accessibility remain limited. Many regions lack LDCT infrastructure, and even among those screened, early-stage cancer detection often yield false positives, as shown in the National Lung Screening Trial (NLST) with a sensitivity of 93.8 percent and a false-positive rate of 26.6 percent. We aim to investigate whether X-ray dark-field imaging (DFI) radiograph, a technique sensitive to small-angle scatter from alveolar microstructure and less susceptible to organ shadowing, can significantly improve early-stage lung tumor detection when coupled with deep-learning segmentation. Using paired attenuation (ATTN) and DFI radiograph images of euthanized mouse lungs, we generated realistic synthetic tumors with irregular boundaries and intensity profiles consistent with physical lung contrast. A U-Net segmentation network was trained on small patches using either ATTN, DFI, or a combination of ATTN and DFI this http URL show that the DFI-only model achieved a true-positive detection rate of 83.7 percent, compared with 51 percent for ATTN-only, while maintaining comparable specificity (90.5 versus 92.9 percent). The combined ATTN and DFI input achieved 79.6 percent sensitivity and 97.6 percent specificity. In conclusion, DFI substantially improves early-tumor detectability in comparison to standard attenuation radiography and shows potential as an accessible, low-cost, low-dose alternative for pre-clinical or limited-resource screening where LDCT is unavailable.
zh

[CV-85] A frag ile zero-watermarking method based on dual quaternion matrix decomposition

【速读】:该论文旨在解决医疗图像在传输与共享过程中面临的版权归属不清和内容被篡改的风险问题。解决方案的关键在于提出一种基于双四元数矩阵分解的脆弱零水印模型,通过利用双四元数标准部分与对偶部分之间的运算关系,将原始载体图像与水印图像相关联,并基于双四元数矩阵分解的特性生成零水印信息,从而实现对医疗图像的版权保护与内容篡改检测。

链接: https://arxiv.org/abs/2510.27307
作者: Mingcui Zhang,Zhigang Jia
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
备注: 18 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Medical images play a crucial role in assisting diagnosis, remote consultation, and academic research. However, during the transmission and sharing process, they face serious risks of copyright ownership and content tampering. Therefore, protecting medical images is of great importance. As an effective means of image copyright protection, zero-watermarking technology focuses on constructing watermarks without modifying the original carrier by extracting its stable features, which provides an ideal approach for protecting medical images. This paper aims to propose a fragile zero-watermarking model based on dual quaternion matrix decomposition, which utilizes the operational relationship between the standard part and the dual part of dual quaternions to correlate the original carrier image with the watermark image, and generates zero-watermarking information based on the characteristics of dual quaternion matrix decomposition, ultimately achieving copyright protection and content tampering detection for medical images.
zh

[CV-86] Generative diffusion modeling protocols for improving the Kikuchi pattern indexing in electron back-scatter diffraction

【速读】:该论文旨在解决电子背散射衍射(Electron Back-Scatter Diffraction, EBSD)在高速扫描条件下因曝光时间缩短导致信号噪声比下降、进而影响衍射花样索引精度的问题。解决方案的关键在于开发生成式机器学习模型,用于对高扫描速度下获取的噪声衍射花样进行后处理或实时修复,从而恢复高质量衍射图案,提升晶体取向确定的可靠性。值得注意的是,该方法相较于传统机器学习模型具有较低的数据依赖性,即“不依赖大量数据”。

链接: https://arxiv.org/abs/2510.26907
作者: Meghraj Prajapata,Alankar Alankar
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Electron back-scatter diffraction (EBSD) has traditionally relied upon methods such as the Hough transform and dictionary Indexing to interpret diffraction patterns and extract crystallographic orientation. However, these methods encounter significant limitations, particularly when operating at high scanning speeds, where the exposure time per pattern is decreased beyond the operating sensitivity of CCD camera. Hence the signal to noise ratio decreases for the observed pattern which makes the pattern noisy, leading to reduced indexing accuracy. This research work aims to develop generative machine learning models for the post-processing or on-the-fly processing of Kikuchi patterns which are capable of restoring noisy EBSD patterns obtained at high scan speeds. These restored patterns can be used for the determination of crystal orientations to provide reliable indexing results. We compare the performance of such generative models in enhancing the quality of patterns captured at short exposure times (high scan speeds). An interesting observation is that the methodology is not data-hungry as typical machine learning methods.
zh

[CV-87] See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

链接: https://arxiv.org/abs/2510.26819
作者: Jinting Wang,Jun Wang,Hei Victor Cheng,Li Liu
机构: Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Jingdong Corporation (京东集团); Aarhus University (奥胡斯大学)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 16 pages,15 figures, accepted by TASLP

点击查看摘要

人工智能

[AI-0] MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design

【速读】:该论文旨在解决结构基于药物设计(Structure-based Drug Design, SBDD)中的两大核心挑战:一是如何有效对齐蛋白质结构表示与分子表示,确保生成的候选配体与目标蛋白具有良好的结合特性;二是如何引导生成的分子具备理想的药理学性质。解决方案的关键在于提出MolChord框架,其核心创新包括:(1) 利用NatureLM这一统一文本、小分子和蛋白质的自回归模型作为分子生成器,并结合基于扩散的结构编码器,实现蛋白质与分子在结构、序列(如FASTA和SMILES)及文本描述层面的多模态对齐;(2) 通过整合偏好数据构建属性感知数据集,并采用直接偏好优化(Direct Preference Optimization, DPO)方法精炼对齐过程,从而提升生成分子的药理性质导向性。实验结果表明,该方法在CrossDocked2020数据集上达到了当前最优性能,展现出在SBDD中的实用潜力。

链接: https://arxiv.org/abs/2510.27671
作者: Wei Zhang,Zekun Guo,Yingce Xia,Peiran Jin,Shufang Xie,Tao Qin,Xiang-Yang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages

点击查看摘要

Abstract:Structure-based drug design (SBDD), which maps target proteins to candidate molecular ligands, is a fundamental task in drug discovery. Effectively aligning protein structural representations with molecular representations, and ensuring alignment between generated drugs and their pharmacological properties, remains a critical challenge. To address these challenges, we propose MolChord, which integrates two key techniques: (1) to align protein and molecule structures with their textual descriptions and sequential representations (e.g., FASTA for proteins and SMILES for molecules), we leverage NatureLM, an autoregressive model unifying text, small molecules, and proteins, as the molecule generator, alongside a diffusion-based structure encoder; and (2) to guide molecules toward desired properties, we curate a property-aware dataset by integrating preference data and refine the alignment process using Direct Preference Optimization (DPO). Experimental results on CrossDocked2020 demonstrate that our approach achieves state-of-the-art performance on key evaluation metrics, highlighting its potential as a practical tool for SBDD.
zh

[AI-1] Challenges in Credit Assignment for Multi-Agent Reinforcement Learning in Open Agent Systems

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中开放系统下的信用分配问题(Credit Assignment Problem, CAP)。在开放系统中,智能体群体、任务以及智能体类型均具有动态变化特性(即代理开放性、任务开放性和类型开放性),传统信用分配方法因假设环境静态、智能体组成固定而失效,导致信用误分配,表现为损失函数不稳定和性能显著下降。解决方案的关键在于揭示开放性如何破坏现有CAP方法所依赖的环境平稳性和团队组成固定性假设,并通过实证研究验证开放性直接引发信用误分配,从而为设计适应开放环境的新型信用分配机制提供理论依据与实验支持。

链接: https://arxiv.org/abs/2510.27659
作者: Alireza Saleh Abadi,Leen-Kiat Soh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In the rapidly evolving field of multi-agent reinforcement learning (MARL), understanding the dynamics of open systems is crucial. Openness in MARL refers to the dynam-ic nature of agent populations, tasks, and agent types with-in a system. Specifically, there are three types of openness as reported in (Eck et al. 2023) [2]: agent openness, where agents can enter or leave the system at any time; task openness, where new tasks emerge, and existing ones evolve or disappear; and type openness, where the capabil-ities and behaviors of agents change over time. This report provides a conceptual and empirical review, focusing on the interplay between openness and the credit assignment problem (CAP). CAP involves determining the contribution of individual agents to the overall system performance, a task that becomes increasingly complex in open environ-ments. Traditional credit assignment (CA) methods often assume static agent populations, fixed and pre-defined tasks, and stationary types, making them inadequate for open systems. We first conduct a conceptual analysis, in-troducing new sub-categories of openness to detail how events like agent turnover or task cancellation break the assumptions of environmental stationarity and fixed team composition that underpin existing CAP methods. We then present an empirical study using representative temporal and structural algorithms in an open environment. The results demonstrate that openness directly causes credit misattribution, evidenced by unstable loss functions and significant performance degradation.
zh

[AI-2] Community Detection on Model Explanation Graphs for Explainable AI

【速读】:该论文旨在解决现有特征归因方法(如SHAP、LIME)在解释模型预测时仅关注单个特征贡献,而忽视高阶结构——即多个特征协同作用对预测结果的影响这一问题。其解决方案的关键在于提出模块化影响框架(Modules of Influence, MoI),该框架首先基于实例级归因构建模型解释图,继而通过社区检测算法识别出共同影响预测的特征模块,并量化这些模块与偏倚、冗余及因果关系模式之间的关联。此方法能够有效揭示特征间的相关性群体,在模型调试中实现模块级消融实验,并精确定位偏倚来源,从而提升可解释人工智能(XAI)中对复杂特征交互机制的理解与应用。

链接: https://arxiv.org/abs/2510.27655
作者: Ehsan Moradi
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Feature-attribution methods (e.g., SHAP, LIME) explain individual predictions but often miss higher-order structure: sets of features that act in concert. We propose Modules of Influence (MoI), a framework that (i) constructs a model explanation graph from per-instance attributions, (ii) applies community detection to find feature modules that jointly affect predictions, and (iii) quantifies how these modules relate to bias, redundancy, and causality patterns. Across synthetic and real datasets, MoI uncovers correlated feature groups, improves model debugging via module-level ablations, and localizes bias exposure to specific modules. We release stability and synergy metrics, a reference implementation, and evaluation protocols to benchmark module discovery in XAI.
zh

[AI-3] Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition

【速读】:该论文旨在解决深度神经网络(DNNs)在训练过程中依赖全局交叉熵损失函数和反向传播(Backpropagation)所带来的生物不合理性、高内存消耗及梯度消失或爆炸等问题。传统端到端训练需要存储中间层输出并计算逐层梯度,这不仅违反了生物可塑性机制,也限制了模型在大规模数据上的扩展性。解决方案的关键在于提出一种基于信息论的新颖分层训练方法:利用确定性信息瓶颈(Deterministic Information Bottleneck, DIB)与矩阵形式的Rényi α-阶熵泛函,使每一层通过一个直接连接至输出层的辅助分类器进行联合优化,从而学习最小充分的任务相关表示(minimal sufficient task-relevant representations)。该方法无需显式反向传播,实现了从底层到顶层的渐进式收敛,并在CIFAR-10/100和交通标志识别任务中展现出优于现有分层训练基线且接近随机梯度下降(SGD)性能的优越效果。

链接: https://arxiv.org/abs/2510.27651
作者: Shuyan Lyu,Zhanzimo Wu,Junliang Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern deep neural networks (DNNs) are typically trained with a global cross-entropy loss in a supervised end-to-end manner: neurons need to store their outgoing weights; training alternates between a forward pass (computation) and a top-down backward pass (learning) which is biologically implausible. Alternatively, greedy layer-wise training eliminates the need for cross-entropy loss and backpropagation. By avoiding the computation of intermediate gradients and the storage of intermediate outputs, it reduces memory usage and helps mitigate issues such as vanishing or exploding gradients. However, most existing layer-wise training approaches have been evaluated only on relatively small datasets with simple deep architectures. In this paper, we first systematically analyze the training dynamics of popular convolutional neural networks (CNNs) trained by stochastic gradient descent (SGD) through an information-theoretic lens. Our findings reveal that networks converge layer-by-layer from bottom to top and that the flow of information adheres to a Markov information bottleneck principle. Building on these observations, we propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based Rényi’s \alpha -order entropy functional. Specifically, each layer is trained jointly with an auxiliary classifier that connects directly to the output layer, enabling the learning of minimal sufficient task-relevant representations. We empirically validate the effectiveness of our training procedure on CIFAR-10 and CIFAR-100 using modern deep CNNs and further demonstrate its applicability to a practical task involving traffic sign recognition. Our approach not only outperforms existing layer-wise training baselines but also achieves performance comparable to SGD.
zh

[AI-4] Interaction as Intelligence Part II: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)在长周期、领域专业化任务中训练困难的问题,具体表现为传统行为克隆方法依赖密集人工标注成本过高,而基于结果驱动的采样策略因有效正向轨迹稀疏易导致训练崩溃。其解决方案的关键在于提出Apollo采样框架,该框架通过异步人类指导与动作级数据过滤相结合的方式实现高效数据收集:一方面允许人类仅在代理偏离潜在可行轨迹时介入提供先验知识或策略建议,从而显著降低标注负担并维持长时间交互(超过30小时);另一方面引入监督控制机制对次优动作进行过滤,防止错误传播,保障数据质量。这一设计使Apollo能够在长周期环境中可靠地收集高质量轨迹,实验证明其在InnovatorBench上使GLM-4.5模型性能相比未训练基线提升超50%,且优于无人类参与的变体方案28%。

链接: https://arxiv.org/abs/2510.27630
作者: Dayuan Fu,Yunze Wu,Xiaojie Cai,Lyumanshan Ye,Shijie Xia,Zhen Huang,Weiye Si,Tianze Xu,Jie Sun,Keyu Li,Mohan Jiang,Junfei Wang,Qishuo Hua,Pengrui Lu,Yang Xiao,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents have recently shown strong potential in domains such as automated coding, deep research, and graphical user interface manipulation. However, training them to succeed on long-horizon, domain-specialized tasks remains challenging. Current methods primarily fall into two categories. The first relies on dense human annotations through behavior cloning, which is prohibitively expensive for long-horizon tasks that can take days or months. The second depends on outcome-driven sampling, which often collapses due to the rarity of valid positive trajectories on domain-specialized tasks. We introduce Apollo, a sampling framework that integrates asynchronous human guidance with action-level data filtering. Instead of requiring annotators to shadow every step, Apollo allows them to intervene only when the agent drifts from a promising trajectory, by providing prior knowledge, strategic advice, etc. This lightweight design makes it possible to sustain interactions for over 30 hours and produces valuable trajectories at a lower cost. Apollo then applies supervision control to filter out sub-optimal actions and prevent error propagation. Together, these components enable reliable and effective data collection in long-horizon environments. To demonstrate the effectiveness of Apollo, we evaluate it using InnovatorBench. Our experiments show that when applied to train the GLM-4.5 model on InnovatorBench, Apollo achieves more than a 50% improvement over the untrained baseline and a 28% improvement over a variant trained without human interaction. These results highlight the critical role of human-in-the-loop sampling and the robustness of Apollo’s design in handling long-horizon, domain-specialized tasks.
zh

[AI-5] Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

【速读】:该论文旨在解决开放权重生物基础模型(open-weight bio-foundation models)存在的双重用途风险问题,即这些模型在促进科学研究和药物开发的同时,也可能被恶意使用者用于开发更具致命性的生物武器。现有方法主要依赖于预训练阶段的数据过滤来降低风险,但其有效性尚不明确,尤其对具备针对性意图的攻击者而言。论文提出了一种名为 \eval 的评估框架,其关键在于通过三个维度——序列建模、突变效应预测和毒力预测——系统性评估模型对病毒知识的理解能力,从而检验安全措施的鲁棒性。结果显示,当前数据过滤策略效果有限,因为有害知识可通过微调快速恢复,并且潜在的双用途信号已存在于预训练表示中,可经由简单线性探测被激活,这凸显了仅靠数据过滤难以保障安全,亟需更稳健的防护机制。

链接: https://arxiv.org/abs/2510.27629
作者: Boyi Wei,Zora Che,Nathaniel Li,Udari Madhushani Sehwag,Jasper Götting,Samira Nedungadi,Julian Michael,Summer Yue,Dan Hendrycks,Peter Henderson,Zifan Wang,Seth Donoughe,Mantas Mazeika
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 Pages, 5 figures

点击查看摘要

Abstract:Open-weight bio-foundation models present a dual-use dilemma. While holding great promise for accelerating scientific research and drug development, they could also enable bad actors to develop more deadly bioweapons. To mitigate the risk posed by these models, current approaches focus on filtering biohazardous data during pre-training. However, the effectiveness of such an approach remains unclear, particularly against determined actors who might fine-tune these models for malicious use. To address this gap, we propose \eval, a framework to evaluate the robustness of procedures that are intended to reduce the dual-use capabilities of bio-foundation models. \eval assesses models’ virus understanding through three lenses, including sequence modeling, mutational effects prediction, and virulence prediction. Our results show that current filtering practices may not be particularly effective: Excluded knowledge can be rapidly recovered in some cases via fine-tuning, and exhibits broader generalizability in sequence modeling. Furthermore, dual-use signals may already reside in the pretrained representations, and can be elicited via simple linear probing. These findings highlight the challenges of data filtering as a standalone procedure, underscoring the need for further research into robust safety and security strategies for open-weight bio-foundation models.
zh

[AI-6] Validity Is What You Need

【速读】:该论文试图解决当前对“Agentic AI”概念界定模糊、应用落地依赖大语言模型(Large Language Models, LLMs)而忽视实际验证需求的问题。其核心观点是:Agentic AI本质上是一种软件交付机制(software delivery mechanism),类似于软件即服务(Software as a Service, SaaS),旨在复杂企业环境中自主运行应用程序;解决方案的关键在于强调有效性(validity)而非单纯依赖LLM作为基础模型——即,Agentic AI的成功取决于终端用户和关键利益相关者的验证,而这种验证所需的工具与评估基础模型的方法截然不同;因此,在良好验证体系下,可采用更简单、快速且可解释的模型替代LLM来实现核心逻辑处理,从而提升实用性与可信度。

链接: https://arxiv.org/abs/2510.27628
作者: Sebastian Benthall,Andrew Clark
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While AI agents have long been discussed and studied in computer science, today’s Agentic AI systems are something new. We consider other definitions of Agentic AI and propose a new realist definition. Agentic AI is a software delivery mechanism, comparable to software as a service (SaaS), which puts an application to work autonomously in a complex enterprise setting. Recent advances in large language models (LLMs) as foundation models have driven excitement in Agentic AI. We note, however, that Agentic AI systems are primarily applications, not foundations, and so their success depends on validation by end users and principal stakeholders. The tools and techniques needed by the principal users to validate their applications are quite different from the tools and techniques used to evaluate foundation models. Ironically, with good validation measures in place, in many cases the foundation models can be replaced with much simpler, faster, and more interpretable models that handle core logic. When it comes to Agentic AI, validity is what you need. LLMs are one option that might achieve it.
zh

[AI-7] VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在寄存器传输级(Register Transfer Level, RTL)设计自动化中面临的两大核心挑战:一是大型语言模型(Large Language Models, LLMs)因参数知识有限和领域约束难以生成高质量硬件描述语言(Hardware Description Language, HDL)代码;二是现有多智能体(multi-agent)架构存在噪声传播敏感性和推理空间探索受限的问题。解决方案的关键在于提出一种无需训练的混合智能体(mixture-of-agents, MoA)框架 VeriMoA,其核心创新包括:1)质量引导的缓存机制,用于保存所有中间 HDL 输出并基于质量进行排序与选择,从而实现跨推理层的知识累积;2)多路径生成策略,利用 C++ 和 Python 作为中间表示,将规格到 HDL 的转换分解为两阶段过程,既发挥 LLM 在高资源语言中的流畅性优势,又提升解空间多样性。实验表明,VeriMoA 在 VerilogEval 2.0 和 RTLLM 2.0 基准上实现了 Pass@1 指标提升 15–30%,且小模型可媲美大模型及微调方案,无需昂贵训练成本。

链接: https://arxiv.org/abs/2510.27617
作者: Heng Ping,Arijit Bhattacharjee,Peiyu Zhang,Shixuan Li,Wei Yang,Anzhe Cheng,Xiaole Zhang,Jesse Thomason,Ali Jannesari,Nesreen Ahmed,Paul Bogdan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose VeriMoA, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a quality-guided caching mechanism to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a multi-path generation strategy that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that VeriMoA achieves 15–30% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.
zh

[AI-8] InnovatorBench: Evaluating Agents Ability to Conduct Innovative LLM Research

【速读】:该论文旨在解决当前AI代理(AI agents)在科学研究中面临的评估局限性问题,即现有基准测试仅考察单一技能且局限于简化场景,无法真实反映代理在端到端科研任务中的综合能力。其解决方案的关键在于提出 InnovatorBench —— 一个包含20项涵盖数据构建、过滤、增强、损失设计、奖励设计及架构构造等环节的综合性基准平台,要求生成可运行的代码 artifact 并对正确性、性能、输出质量与不确定性进行多维评估;同时配套开发 ResearchGym 研究环境以支持分布式、长周期执行、异步监控与快照保存,并实现轻量级 ReAct 代理模型,结合显式推理与可执行规划,利用前沿大语言模型(LLM)如 Claude-4、GPT-5、GLM-4.5 和 Kimi-K2 进行实验验证,从而推动生成式 AI 在科研自动化领域的系统性评估与发展。

链接: https://arxiv.org/abs/2510.27598
作者: Yunze Wu,Dayuan Fu,Weiye Si,Zhen Huang,Mohan Jiang,Keyu Li,Shijie Xia,Jie Sun,Tianze Xu,Xiangkun Hu,Pengrui Lu,Xiaojie Cai,Lyumanshan Ye,Wenhong Zhu,Yang Xiao,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark’s difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.
zh

[AI-9] CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)在代码生成能力评估中过于关注功能性正确性、忽视真实开发场景下多维指令遵循能力的问题。解决方案的关键在于提出一个可扩展的多语言基准测试框架,该框架不仅评估模型对初始问题中预定义约束的遵守程度,还考察其根据后续指令进行代码优化与修正的能力,从而更全面地衡量LLM在实际编程任务中的指令理解与执行水平。

链接: https://arxiv.org/abs/2510.27565
作者: Forough Mehralian,Ryan Shar,James R. Rae,Alireza Hashemi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of real-world coding tasks and developer expectations. To this end, we introduce a multi-language benchmark that evaluates LLM instruction-following capabilities and is extensible to operate on any set of standalone coding problems. Our benchmark evaluates instruction following in two key settings: adherence to pre-defined constraints specified with the initial problem, and the ability to perform refinements based on follow-up instructions. For this paper’s analysis, we empirically evaluated our benchmarking pipeline with programming tasks from LiveBench, that are also automatically translated from Python into Java and JavaScript. Our automated benchmark reveals that models exhibit differing levels of performance across multiple dimensions of instruction-following. Our benchmarking pipeline provides a more comprehensive evaluation of code generation models, highlighting their strengths and limitations across languages and generation goals.
zh

[AI-10] oward Accurate Long-Horizon Robotic Manipulation: Language-to-Action with Foundation Models via Scene Graphs

【速读】:该论文旨在解决机器人操作(robotic manipulation)中依赖特定领域训练的问题,即传统方法通常需要大量针对特定任务或环境的标注数据进行微调,限制了系统的通用性和部署效率。其解决方案的关键在于构建一个基于预训练基础模型(foundation models)的框架,该框架无需领域特定训练即可实现复杂操作任务。核心创新包括:利用现成的多模态感知模型提取环境信息,并结合具备鲁棒任务序列推理能力的通用推理模型;同时通过动态维护场景图(scene graphs)实现空间认知与环境一致性推理,从而在无额外训练的情况下提升机器人对复杂交互场景的理解与执行能力。

链接: https://arxiv.org/abs/2510.27558
作者: Sushil Samuel Dinesh,Shinkyu Park
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a framework that leverages pre-trained foundation models for robotic manipulation without domain-specific training. The framework integrates off-the-shelf models, combining multimodal perception from foundation models with a general-purpose reasoning model capable of robust task sequencing. Scene graphs, dynamically maintained within the framework, provide spatial awareness and enable consistent reasoning about the environment. The framework is evaluated through a series of tabletop robotic manipulation experiments, and the results highlight its potential for building robotic manipulation systems directly on top of off-the-shelf foundation models.
zh

[AI-11] Sybil-Resistant Service Discovery for Agent Economies

【速读】:该论文旨在解决在x402协议支持的HTTP服务(如API、数据流和推理提供者)中,如何基于用户支付行为实现可信的服务发现与排序问题。现有方法依赖交易量或语义匹配,易受基础设施偏见或低质量内容干扰。解决方案的关键在于提出TraceRank算法——一种基于声誉加权的排名机制,利用支付交易作为推荐信号,通过交易金额和时间衰减因子对声誉进行传播,从而识别出由高声誉用户偏好支持的服务,而非仅依赖高交易频次的服务。该机制有效抵御Sybil攻击,确保排名结果更可靠且公平。

链接: https://arxiv.org/abs/2510.27554
作者: David Shi,Kevin Joo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 5 pages

点击查看摘要

Abstract:x402 enables Hypertext Transfer Protocol (HTTP) services like application programming interfaces (APIs), data feeds, and inference providers to accept cryptocurrency payments for access. As agents increasingly consume these services, discovery becomes critical: which swap interface should an agent trust? Which data provider is the most reliable? We introduce TraceRank, a reputation-weighted ranking algorithm where payment transactions serve as endorsements. TraceRank seeds addresses with precomputed reputation metrics and propagates reputation through payment flows weighted by transaction value and temporal recency. Applied to x402’s payment graph, this surfaces services preferred by high-reputation users rather than those with high transaction volume. Our system combines TraceRank with semantic search to respond to natural language queries with high quality results. We argue that reputation propagation resists Sybil attacks by making spam services with many low-reputation payers rank below legitimate services with few high-reputation payers. Ultimately, we aim to construct a search method for x402 enabled services that avoids infrastructure bias and has better performance than purely volume based or semantic methods.
zh

[AI-12] EBT-Policy: Energy Unlocks Emergent Physical Reasoning Capabilities

【速读】:该论文旨在解决当前基于生成式模型(如扩散策略,Diffusion Policy)在机器人任务中面临的高计算成本、暴露偏差(exposure bias)以及推理动态不稳定等问题,这些问题常导致模型在分布偏移下发生发散。其解决方案的关键在于提出一种新的能量基础架构——EBT-Policy,该架构利用能量基模型(Energy-Based Models, EBMs)学习端到端的能量景观并建模平衡动力学,从而显著降低暴露偏差并提升鲁棒性;同时借助能量基Transformer(Energy-Based Transformers, EBTs)的可扩展性,在模拟与真实世界任务中实现高效训练与推理,且能在少数几步(如仅两步)内收敛,相较扩散策略减少50倍计算量,并展现出无需显式重试训练即可零样本恢复失败动作序列等新兴能力。

链接: https://arxiv.org/abs/2510.27545
作者: Travis Davies,Yiqi Huang,Alexi Gladstone,Yunxin Liu,Xiang Chen,Heng Ji,Huxian Liu,Luhui Hu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Implicit policies parameterized by generative models, such as Diffusion Policy, have become the standard for policy learning and Vision-Language-Action (VLA) models in robotics. However, these approaches often suffer from high computational cost, exposure bias, and unstable inference dynamics, which lead to divergence under distribution shifts. Energy-Based Models (EBMs) address these issues by learning energy landscapes end-to-end and modeling equilibrium dynamics, offering improved robustness and reduced exposure bias. Yet, policies parameterized by EBMs have historically struggled to scale effectively. Recent work on Energy-Based Transformers (EBTs) demonstrates the scalability of EBMs to high-dimensional spaces, but their potential for solving core challenges in physically embodied models remains underexplored. We introduce a new energy-based architecture, EBT-Policy, that solves core issues in robotic and real-world settings. Across simulated and real-world tasks, EBT-Policy consistently outperforms diffusion-based policies, while requiring less training and inference computation. Remarkably, on some tasks it converges within just two inference steps, a 50x reduction compared to Diffusion Policy’s 100. Moreover, EBT-Policy exhibits emergent capabilities not seen in prior models, such as zero-shot recovery from failed action sequences using only behavior cloning and without explicit retry training. By leveraging its scalar energy for uncertainty-aware inference and dynamic compute allocation, EBT-Policy offers a promising path toward robust, generalizable robot behavior under distribution shifts.
zh

[AI-13] Mechanics of Learned Reasoning 1: TempoBench A Benchmark for Interpretable Deconstruction of Reasoning System Performance

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在推理能力评估中存在两大局限性的问题:一是基于人工生成的数据集虽能捕捉现实世界中的决策链,但可能引入隐性偏见且无法形式化验证;二是如Lean等形式化证明系统虽具备可验证性,却难以适配代理式决策链任务,导致现有基准在推理结构或现实对齐度上表现不足。解决方案的关键在于提出TempoBench——首个形式化基础且可验证的诊断性基准,通过参数化难度系统分析LLM的推理性能,其核心由两个子任务构成:时间轨迹评估(Temporal Trace Evaluation, TTE)用于测试LLM对多步推理系统的理解与模拟能力,以及时间因果评估(Temporal Causal Evaluation, TCE)用于检验LLM从复杂系统中提炼因果关系的能力。实验表明,尽管主流LLM在TCE-normal任务中得分达65.6%,但在更复杂的TCE-hard场景下仅得7.5%,揭示了模型在系统复杂度提升时推理能力显著下降的现象。

链接: https://arxiv.org/abs/2510.27544
作者: Nikolaus Holzer,William Fishell,Baishakhi Ray,Mark Santolucito
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM’s ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our \hrefthis https URLGitHub repository.
zh

[AI-14] raJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control

【速读】:该论文旨在解决低精度全量化训练(Fully-Quantized Training, FQT)中因极端量化位宽(如4-bit)导致的模型性能显著下降问题,尤其在大语言模型(Large Language Models, LLMs)预训练场景下,如何实现近无损训练仍具挑战。其关键解决方案在于提出TetraJet-v2方法:首先识别出两个核心障碍——权重振荡(weight oscillation)与异常值(outliers),进而设计三项关键技术:1)基于NVFP4格式的无偏双块量化机制,提升线性层量化精度;2)OsciReset算法抑制权重振荡,增强训练稳定性;3)OutControl算法保留异常值精度,避免关键信息丢失。实验表明,TetraJet-v2在多种模型规模(最高达370M参数)和数据量(最多200B tokens)下均优于现有FP4训练方法,平均将性能差距缩小51.3%。

链接: https://arxiv.org/abs/2510.27527
作者: Yuxiang Chen,Xiaoming Xu,Pengle Zhang,Michael Beyer,Martin Rapp,Jun Zhu,Jianfei Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) training is prohibitively expensive, driving interest in low-precision fully-quantized training (FQT). While novel 4-bit formats like NVFP4 offer substantial efficiency gains, achieving near-lossless training at such low precision remains challenging. We introduce TetraJet-v2, an end-to-end 4-bit FQT method that leverages NVFP4 for activations, weights, and gradients in all linear layers. We identify two critical issues hindering low-precision LLM training: weight oscillation and outliers. To address these, we propose: 1) an unbiased double-block quantization method for NVFP4 linear layers, 2) OsciReset, an algorithm to suppress weight oscillation, and 3) OutControl, an algorithm to retain outlier accuracy. TetraJet-v2 consistently outperforms prior FP4 training methods on pre-training LLMs across varying model sizes up to 370M and data sizes up to 200B tokens, reducing the performance gap to full-precision training by an average of 51.3%.
zh

[AI-15] Leverag ing Generic Time Series Foundation Models for EEG Classification

【速读】:该论文旨在解决生成式 AI (Generative AI) 在脑电图(EEG)等特定生物医学信号领域应用潜力尚未充分挖掘的问题。其核心挑战在于如何有效利用通用时间序列基础模型(foundation models for time series)来提升 EEG 任务的性能,如运动想象分类和睡眠阶段预测。解决方案的关键在于验证两种预训练策略的有效性:一是基于多领域真实世界时间序列数据的跨域预训练,二是基于纯合成数据的预训练。实验表明,这两种方法均能显著优于现有基准模型(如 EEGNet 和 CBraMod),证明即使在非神经源数据或合成信号上预训练的基础模型,也能在 EEG 任务中实现有效的知识迁移,从而凸显了跨领域预训练模型在脑信号分析中的巨大潜力。

链接: https://arxiv.org/abs/2510.27522
作者: Théo Gnassounou,Yessin Moakher,Shifeng Xie,Vasilii Feofanov,Ievgen Redko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models for time series are emerging as powerful general-purpose backbones, yet their potential for domain-specific biomedical signals such as electroencephalography (EEG) remains rather unexplored. In this work, we investigate the applicability a recently proposed time series classification foundation model, to a different EEG tasks such as motor imagery classification and sleep stage prediction. We test two pretraining regimes: (a) pretraining on heterogeneous real-world time series from multiple domains, and (b) pretraining on purely synthetic data. We find that both variants yield strong performance, consistently outperforming EEGNet, a widely used convolutional baseline, and CBraMod, the most recent EEG-specific foundation model. These results suggest that generalist time series foundation models, even when pretrained on data of non-neural origin or on synthetic signals, can transfer effectively to EEG. Our findings highlight the promise of leveraging cross-domain pretrained models for brain signal analysis, suggesting that EEG may benefit from advances in the broader time series literature.
zh

[AI-16] DP-FedPGN: Finding Global Flat Minima for Differentially Private Federated Learning via Penalizing Gradient Norm

【速读】:该论文旨在解决客户端级差分隐私联邦学习(Client-level Differentially Private Federated Learning, CL-DPFL)中因局部平坦性不足导致模型泛化性能下降的问题,尤其针对现有方法在引入差分隐私后出现更尖锐的损失曲面(loss landscapes)所引发的性能退化。其解决方案的关键在于提出一种新的算法 DP-FedPGN,通过在本地损失函数中引入全局梯度范数惩罚(global gradient norm penalty),引导模型寻找更具泛化能力的全局平坦最小值(global flat minimum),从而缓解局部平坦性与全局性能之间的不一致性问题。该机制不仅提升了模型的整体平坦度,还降低了本地更新的梯度范数,进一步减少梯度裁剪误差;同时,结合 Rényi 差分隐私(Rényi DP)提供严格的隐私保障,并通过敏感性分析优化本地更新策略,最终在 ResNet 和 Transformer 模型上于多个视觉和自然语言处理任务中实现显著性能提升。

链接: https://arxiv.org/abs/2510.27504
作者: Junkang Liu,Yuxuan Tian,Fanhua Shang,Yuanyuan Liu,Hongying Liu,Junchao Zhou,Daorui Ding
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 21 pages, 8 figures

点击查看摘要

Abstract:To prevent inference attacks in Federated Learning (FL) and reduce the leakage of sensitive information, Client-level Differentially Private Federated Learning (CL-DPFL) is widely used. However, current CL-DPFL methods usually result in sharper loss landscapes, which leads to a decrease in model generalization after differential privacy protection. By using Sharpness Aware Minimization (SAM), the current popular federated learning methods are to find a local flat minimum value to alleviate this problem. However, the local flatness may not reflect the global flatness in CL-DPFL. Therefore, to address this issue and seek global flat minima of models, we propose a new CL-DPFL algorithm, DP-FedPGN, in which we introduce a global gradient norm penalty to the local loss to find the global flat minimum. Moreover, by using our global gradient norm penalty, we not only find a flatter global minimum but also reduce the locally updated norm, which means that we further reduce the error of gradient clipping. From a theoretical perspective, we analyze how DP-FedPGN mitigates the performance degradation caused by DP. Meanwhile, the proposed DP-FedPGN algorithm eliminates the impact of data heterogeneity and achieves fast convergence. We also use Rényi DP to provide strict privacy guarantees and provide sensitivity analysis for local updates. Finally, we conduct effectiveness tests on both ResNet and Transformer models, and achieve significant improvements in six visual and natural language processing tasks compared to existing state-of-the-art algorithms. The code is available at this https URL
zh

[AI-17] InertialAR: Autoregressive 3D Molecule Generation with Inertial Frames

【速读】:该论文旨在解决基于Transformer的自回归模型在3D分子生成任务中面临的两大核心挑战:一是如何将分子编码为一个对SE(3)变换(刚体运动)和原子索引排列均保持不变的标准化一维token序列;二是如何设计能够建模混合原子类型(离散)与连续3D坐标(连续)耦合关系的架构。解决方案的关键在于提出InertialAR框架,其核心创新包括:(1) 基于分子惯性坐标系的标准化token化方法,通过将分子对齐至惯性帧并重排序原子以实现SE(3)和排列不变性;(2) 引入几何旋转位置编码(GeoRoPE),赋予注意力机制几何感知能力;(3) 采用分层自回归范式,先预测原子类型再通过扩散损失(Diffusion loss)预测其3D坐标。该方案在多个基准数据集上实现了当前最优性能,并显著提升可控分子生成的效果。

链接: https://arxiv.org/abs/2510.27497
作者: Haorui Li,Weitao Du,Yuqiang Li,Hongyu Guo,Shengchao Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based autoregressive models have emerged as a unifying paradigm across modalities such as text and images, but their extension to 3D molecule generation remains underexplored. The gap stems from two fundamental challenges: (1) tokenizing molecules into a canonical 1D sequence of tokens that is invariant to both SE(3) transformations and atom index permutations, and (2) designing an architecture capable of modeling hybrid atom-based tokens that couple discrete atom types with continuous 3D coordinates. To address these challenges, we introduce InertialAR. InertialAR devises a canonical tokenization that aligns molecules to their inertial frames and reorders atoms to ensure SE(3) and permutation invariance. Moreover, InertialAR equips the attention mechanism with geometric awareness via geometric rotary positional encoding (GeoRoPE). In addition, it utilizes a hierarchical autoregressive paradigm to predict the next atom-based token, predicting the atom type first and then its 3D coordinates via Diffusion loss. Experimentally, InertialAR achieves state-of-the-art performance on 7 of the 10 evaluation metrics for unconditional molecule generation across QM9, GEOM-Drugs, and B3LYP. Moreover, it significantly outperforms strong baselines in controllable generation for targeted chemical functionality, attaining state-of-the-art results across all 5 metrics.
zh

[AI-18] FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

【速读】:该论文旨在解决在联邦学习(Federated Learning, FL)场景下直接应用AdamW优化器所面临的三大挑战:(1) 由于数据异构性导致二阶矩估计(second-moment estimate)方差过高;(2) AdamW的局部过拟合引发客户端漂移(client drift);(3) 每轮通信后重置动量估计(v,m\boldsymbol{v}, \boldsymbol{m})显著降低收敛速度。解决方案的关键在于提出首个专为联邦学习设计的AdamW变体——FedAdamW,其核心创新包括:(1) 引入局部校正机制(local correction mechanism)与解耦权重衰减(decoupled weight decay)以缓解局部过拟合并使本地更新与全局更新对齐;(2) 高效聚合二阶矩估计的均值以降低方差并动态重初始化,从而提升收敛效率。理论分析表明,FedAdamW在无异构性假设下可实现线性加速收敛率 O(LΔσl2SKRϵ2+LΔR)\mathcal{O}\left(\frac{\sqrt{L \Delta \sigma_l^2}}{S K R \epsilon^2} + \frac{L \Delta}{R}\right),其中 SSKKRR 分别表示每轮参与客户端数、本地迭代次数和总通信轮数。

链接: https://arxiv.org/abs/2510.27486
作者: Junkang Liu,Fanhua Shang,Kewen Zhu,Hongying Liu,Yuanyuan Liu,Jin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate \boldsymbolv ; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates ( \boldsymbolv , \boldsymbolm ) at each round slows down convergence. To address these challenges, we propose the first \underlineFederated \underlineAdamW algorithm, called \textttFedAdamW, for training and fine-tuning various large models. \textttFedAdamW aligns local updates with the global update using both a \textbflocal correction mechanism and decoupled weight decay to mitigate local overfitting. \textttFedAdamW efficiently aggregates the \textttmean of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that \textttFedAdamW achieves a linear speedup convergence rate of \mathcalO(\sqrt(L \Delta \sigma_l^2)/(S K R \epsilon^2)+(L \Delta)/R) without \textbfheterogeneity assumption, where S is the number of participating clients per round, K is the number of local iterations, and R is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of \textttFedAdamW on language and vision Transformer models. Compared to several baselines, \textttFedAdamW significantly reduces communication rounds and improves test accuracy. The code is available in this https URL.
zh

[AI-19] GeoFM: Enhancing Geometric Reasoning of MLLM s via Synthetic Data Generation through Formal Language

【速读】:该论文旨在解决多模态大语言模型(Multi-modal Large Language Models, MLLMs)在几何推理任务中因高质量几何数据稀缺而导致性能受限的问题。现有合成几何数据的方法依赖于对已有问题的重述或基于预定义规则和模板生成,往往导致数据多样性不足、噪声较多,且生成的几何图像与真实几何图示差异较大。其解决方案的关键在于提出GeoFM方法,该方法利用形式语言在度量空间中探索条件组合,并通过符号引擎确保生成问题的正确性,从而合成高保真度、多样性强且结构合理的几何问题数据,显著提升模型在几何推理任务上的表现。

链接: https://arxiv.org/abs/2510.27448
作者: Yuhao Zhang,Dingxin Hu,Tinghao Yu,Hao Liu,Yiting Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) have gained significant attention in both academia and industry for their capabilities in handling multi-modal tasks. However, these models face challenges in mathematical geometric reasoning due to the scarcity of high-quality geometric data. To address this issue, synthetic geometric data has become an essential strategy. Current methods for generating synthetic geometric data involve rephrasing or expanding existing problems and utilizing predefined rules and templates to create geometric images and problems. However, these approaches often produce data that lacks diversity or is prone to noise. Additionally, the geometric images synthesized by existing methods tend to exhibit limited variation and deviate significantly from authentic geometric diagrams. To overcome these limitations, we propose GeoFM, a novel method for synthesizing geometric data. GeoFM uses formal languages to explore combinations of conditions within metric space, generating high-fidelity geometric problems that differ from the originals while ensuring correctness through a symbolic engine. Experimental results show that our synthetic data significantly outperforms existing methods. The model trained with our data surpass the proprietary GPT-4o model by 18.7% on geometry problem-solving tasks in MathVista and by 16.5% on GeoQA. Additionally, it exceeds the performance of a leading open-source model by 5.7% on MathVista and by 2.7% on GeoQA.
zh

[AI-20] Learning Soft Robotic Dynamics with Active Exploration

【速读】:该论文旨在解决软体机器人(soft robots)在非结构化环境中因高维、非线性及柔顺动力学特性而导致的建模难题,尤其是现有数据驱动方法难以泛化的问题。其解决方案的关键在于提出SoftAE框架——一种基于不确定性感知的主动探索机制,通过概率集成模型估计认知不确定性(epistemic uncertainty),并引导探索过程聚焦于状态-动作空间中未充分覆盖的区域,从而在无需任务特定监督的情况下高效学习通用的动力学模型。此方法显著提升了模型准确性与零样本迁移能力,并在多种仿真与真实软体机器人平台上验证了其鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2510.27428
作者: Hehui Zheng,Bhavya Sukhija,Chenhao Li,Klemens Iten,Andreas Krause,Robert K. Katzschmann
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Soft robots offer unmatched adaptability and safety in unstructured environments, yet their compliant, high-dimensional, and nonlinear dynamics make modeling for control notoriously difficult. Existing data-driven approaches often fail to generalize, constrained by narrowly focused task demonstrations or inefficient random exploration. We introduce SoftAE, an uncertainty-aware active exploration framework that autonomously learns task-agnostic and generalizable dynamics models of soft robotic systems. SoftAE employs probabilistic ensemble models to estimate epistemic uncertainty and actively guides exploration toward underrepresented regions of the state-action space, achieving efficient coverage of diverse behaviors without task-specific supervision. We evaluate SoftAE on three simulated soft robotic platforms – a continuum arm, an articulated fish in fluid, and a musculoskeletal leg with hybrid actuation – and on a pneumatically actuated continuum soft arm in the real world. Compared with random exploration and task-specific model-based reinforcement learning, SoftAE produces more accurate dynamics models, enables superior zero-shot control on unseen tasks, and maintains robustness under sensing noise, actuation delays, and nonlinear material effects. These results demonstrate that uncertainty-driven active exploration can yield scalable, reusable dynamics models across diverse soft robotic morphologies, representing a step toward more autonomous, adaptable, and data-efficient control in compliant robots.
zh

[AI-21] Dialogue as Discovery: Navigating Human Intent Through Principled Inquiry

【速读】:该论文旨在解决人机协作中的“意图表达鸿沟”(intention expression gap)问题,即人类难以高效地向人工智能(AI)传达复杂、高维的思维意图,导致用户陷入低效的试错循环,且这一问题在不同用户专业水平下更为显著。解决方案的关键在于提出一种名为Nous的主动探究型代理(agent),其核心机制是基于信息论第一性原理设计的训练框架,将对话中的信息增益定义为内在奖励信号(本质上等价于结构化任务空间中Shannon熵的减少),从而无需依赖昂贵的人类偏好标注或外部奖励模型即可实现对用户意图不确定性的有效缓解。该框架通过自动化模拟流水线生成大规模偏好数据集,在科学图表生成等复杂任务中验证了其高效性与鲁棒性,并展现出跨领域泛化能力。

链接: https://arxiv.org/abs/2510.27410
作者: Jianwen Sun,Yukang Feng,Yifan Chang,Chuanhao Li,Zizhen Li,Jiaxin Ai,Fanrui Zhang,Yu Dai,Kaipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A fundamental bottleneck in human-AI collaboration is the “intention expression gap,” the difficulty for humans to effectively convey complex, high-dimensional thoughts to AI. This challenge often traps users in inefficient trial-and-error loops and is exacerbated by the diverse expertise levels of users. We reframe this problem from passive instruction following to a Socratic collaboration paradigm, proposing an agent that actively probes for information to resolve its uncertainty about user intent. we name the proposed agent Nous, trained to acquire proficiency in this inquiry policy. The core mechanism of Nous is a training framework grounded in the first principles of information theory. Within this framework, we define the information gain from dialogue as an intrinsic reward signal, which is fundamentally equivalent to the reduction of Shannon entropy over a structured task space. This reward design enables us to avoid reliance on costly human preference annotations or external reward models. To validate our framework, we develop an automated simulation pipeline to generate a large-scale, preference-based dataset for the challenging task of scientific diagram generation. Comprehensive experiments, including ablations, subjective and objective evaluations, and tests across user expertise levels, demonstrate the effectiveness of our proposed framework. Nous achieves leading efficiency and output quality, while remaining robust to varying user expertise. Moreover, its design is domain-agnostic, and we show evidence of generalization beyond diagram generation. Experimental results prove that our work offers a principled, scalable, and adaptive paradigm for resolving uncertainty about user intent in complex human-AI collaboration.
zh

[AI-22] FedMuon: Accelerating Federated Learning with Matrix Orthogonalization

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中通信轮次过多的核心瓶颈问题,即如何通过更有效的本地更新来减少通信开销。现有方法主要依赖逐元素优化器(如Adam/SGD),忽视了权重矩阵的几何结构,导致局部更新过程中病理方向被放大,条件数恶化,收敛速度变慢。解决方案的关键在于引入Muon优化器,其核心是通过矩阵正交化(matrix orthogonalization)对结构化权重矩阵进行优化,从而更好地保持参数空间的几何特性。针对非独立同分布(non-IID)场景下客户端漂移(client drift)严重的问题,进一步提出FedMuon,采用两个关键技术:(1) 动量聚合(momentum aggregation),利用全局动量初始化本地更新;(2) 局部-全局对齐(local-global alignment),使本地梯度与全局更新方向对齐以显著抑制漂移。理论证明FedMuon在无异质性假设下可实现线性加速收敛,实验验证其在语言和视觉模型上均能显著减少通信轮次并提升测试精度。

链接: https://arxiv.org/abs/2510.27403
作者: Junkang Liu,Fanhua Shang,Junchao Zhou,Hongying Liu,Yuanyuan Liu,Jin Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The core bottleneck of Federated Learning (FL) lies in the communication rounds. That is, how to achieve more effective local updates is crucial for reducing communication rounds. Existing FL methods still primarily use element-wise local optimizers (Adam/SGD), neglecting the geometric structure of the weight matrices. This often leads to the amplification of pathological directions in the weights during local updates, leading deterioration in the condition number and slow convergence. Therefore, we introduce the Muon optimizer in local, which has matrix orthogonalization to optimize matrix-structured parameters. Experimental results show that, in IID setting, Local Muon significantly accelerates the convergence of FL and reduces communication rounds compared to Local SGD and Local AdamW. However, in non-IID setting, independent matrix orthogonalization based on the local distributions of each client induces strong client drift. Applying Muon in non-IID FL poses significant challenges: (1) client preconditioner leading to client drift; (2) moment reinitialization. To address these challenges, we propose a novel Federated Muon optimizer (FedMuon), which incorporates two key techniques: (1) momentum aggregation, where clients use the aggregated momentum for local initialization; (2) local-global alignment, where the local gradients are aligned with the global update direction to significantly reduce client drift. Theoretically, we prove that \textttFedMuon achieves a linear speedup convergence rate without the heterogeneity assumption, where S is the number of participating clients per round, K is the number of local iterations, and R is the total number of communication rounds. Empirically, we validate the effectiveness of FedMuon on language and vision models. Compared to several baselines, FedMuon significantly reduces communication rounds and improves test accuracy.
zh

[AI-23] Realistic pedestrian-driver interaction modelling using multi-agent RL with human perceptual-motor constraints

【速读】:该论文旨在解决现有行人-驾驶员交互建模方法在灵活性和机制解释性方面的不足,尤其是忽视了感知与运动约束对人类行为影响的问题。解决方案的关键在于提出一种融合视觉和运动约束的多智能体强化学习(multi-agent reinforcement learning, MARL)框架,通过引入真实世界数据中观察到的感官(如视野限制)和运动(如速度调整)约束,使模型能够生成更符合实际交互行为的轨迹,并有效捕捉个体差异。实验表明,同时包含两类约束的模型在交互真实性指标上表现最优,且在数据有限条件下优于监督式行为克隆方法,验证了该框架在模拟真实道路用户交互中的潜力。

链接: https://arxiv.org/abs/2510.27383
作者: Yueyang Wang,Mehmet Dogar,Gustav Markkula
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modelling pedestrian-driver interactions is critical for understanding human road user behaviour and developing safe autonomous vehicle systems. Existing approaches often rely on rule-based logic, game-theoretic models, or ‘black-box’ machine learning methods. However, these models typically lack flexibility or overlook the underlying mechanisms, such as sensory and motor constraints, which shape how pedestrians and drivers perceive and act in interactive scenarios. In this study, we propose a multi-agent reinforcement learning (RL) framework that integrates both visual and motor constraints of pedestrian and driver agents. Using a real-world dataset from an unsignalised pedestrian crossing, we evaluate four model variants, one without constraints, two with either motor or visual constraints, and one with both, across behavioural metrics of interaction realism. Results show that the combined model with both visual and motor constraints performs best. Motor constraints lead to smoother movements that resemble human speed adjustments during crossing interactions. The addition of visual constraints introduces perceptual uncertainty and field-of-view limitations, leading the agents to exhibit more cautious and variable behaviour, such as less abrupt deceleration. In this data-limited setting, our model outperforms a supervised behavioural cloning model, demonstrating that our approach can be effective without large training datasets. Finally, our framework accounts for individual differences by modelling parameters controlling the human constraints as population-level distributions, a perspective that has not been explored in previous work on pedestrian-vehicle interaction modelling. Overall, our work demonstrates that multi-agent RL with human constraints is a promising modelling approach for simulating realistic road user interactions.
zh

[AI-24] Spiking Neural Networks: The Future of Brain-Inspired Computing

【速读】:该论文旨在解决传统人工神经网络(Artificial Neural Networks, ANNs)在能效、时序动态性和硬件适配性方面的局限性,提出以脉冲神经网络(Spiking Neural Networks, SNNs)作为更具生物启发性的替代方案。其核心解决方案在于系统性地分析SNN的设计模型、训练算法与多维性能指标(包括准确率、能耗、延迟、脉冲计数和收敛行为),并验证不同训练策略的有效性:其中基于代理梯度下降(surrogate gradient descent)的SNN可逼近ANN精度(误差1-2%),且收敛更快(第20个epoch内)、延迟低至10毫秒;而基于脉冲时间依赖可塑性(Spike-Timing Dependent Plasticity, STDP)的SNN虽收敛较慢,却具有最低的脉冲消耗和能耗(单次推理低至5毫焦耳),适用于无监督学习和超低功耗场景。这一研究为SNN在机器人、类脑视觉和边缘智能等资源受限场景中的部署提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2510.27379
作者: Sales G. Aribe Jr
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 17 pages, 7 figures, 4 tables, Published with International Journal of Engineering Trends and Technology (IJETT)

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) represent the latest generation of neural computation, offering a brain-inspired alternative to conventional Artificial Neural Networks (ANNs). Unlike ANNs, which depend on continuous-valued signals, SNNs operate using distinct spike events, making them inherently more energy-efficient and temporally dynamic. This study presents a comprehensive analysis of SNN design models, training algorithms, and multi-dimensional performance metrics, including accuracy, energy consumption, latency, spike count, and convergence behavior. Key neuron models such as the Leaky Integrate-and-Fire (LIF) and training strategies, including surrogate gradient descent, ANN-to-SNN conversion, and Spike-Timing Dependent Plasticity (STDP), are examined in depth. Results show that surrogate gradient-trained SNNs closely approximate ANN accuracy (within 1-2%), with faster convergence by the 20th epoch and latency as low as 10 milliseconds. Converted SNNs also achieve competitive performance but require higher spike counts and longer simulation windows. STDP-based SNNs, though slower to converge, exhibit the lowest spike counts and energy consumption (as low as 5 millijoules per inference), making them optimal for unsupervised and low-power tasks. These findings reinforce the suitability of SNNs for energy-constrained, latency-sensitive, and adaptive applications such as robotics, neuromorphic vision, and edge AI systems. While promising, challenges persist in hardware standardization and scalable training. This study concludes that SNNs, with further refinement, are poised to propel the next phase of neuromorphic computing.
zh

[AI-25] oolScope: An Agent ic Framework for Vision-Guided and Long-Horizon Tool Use

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在长周期视觉问答(VQA)任务中因视觉上下文退化而导致的推理效率与准确性下降问题。其核心挑战在于如何使MLLMs在复杂且多样化的多模态信息环境下,灵活、高效地调用外部工具以实现局部感知与全局规划的协同推理。解决方案的关键在于提出ToolScope框架,该框架通过三个模块实现统一:全局导航器(Global Navigator)提供高层策略指导,代理执行器(Agentic Executor)集成Search、Code和专用Perceive工具进行局部感知增强,响应合成器(Response Synthesizer)整合推理过程并输出结构化结果;其中,引入专门设计的Perceive工具有效缓解了长程推理中的视觉上下文退化问题,从而显著提升模型在多个VQA基准上的泛化性能(平均提升达+6.69%)。

链接: https://arxiv.org/abs/2510.27363
作者: Mengjie Deng,Guanting Dong,Zhicheng Dou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a “telescope”, offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.
zh

[AI-26] An In-depth Study of LLM Contributions to the Bin Packing Problem

【速读】:该论文旨在解决当前关于大型语言模型(Large Language Models, LLMs)在数学发现中贡献的争议问题,特别是针对LLM生成的启发式算法是否真正为在线装箱问题(online bin packing problem)提供了新见解。研究通过细致分析LLM生成的启发式策略的行为与可解释性,发现尽管这些策略具备人类可读性,但对领域专家而言仍高度不透明。解决方案的关键在于提出一类专为特定装箱实例设计的新算法,该类算法在结构上显著更简单、效率更高、可解释性更强且泛化能力更好,从而揭示出原始问题实例本身其实相对简单,进而质疑了LLM“创新性贡献”的前提——即认为这些实例此前已被充分研究。论文强调,在评估LLM生成结果的科学价值时,必须进行严格的验证和上下文分析。

链接: https://arxiv.org/abs/2510.27353
作者: Julien Herrmann,Guillaume Pallez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 13 figures

点击查看摘要

Abstract:Recent studies have suggested that Large Language Models (LLMs) could provide interesting ideas contributing to mathematical discovery. This claim was motivated by reports that LLM-based genetic algorithms produced heuristics offering new insights into the online bin packing problem under uniform and Weibull distributions. In this work, we reassess this claim through a detailed analysis of the heuristics produced by LLMs, examining both their behavior and interpretability. Despite being human-readable, these heuristics remain largely opaque even to domain experts. Building on this analysis, we propose a new class of algorithms tailored to these specific bin packing instances. The derived algorithms are significantly simpler, more efficient, more interpretable, and more generalizable, suggesting that the considered instances are themselves relatively simple. We then discuss the limitations of the claim regarding LLMs’ contribution to this problem, which appears to rest on the mistaken assumption that the instances had previously been studied. Our findings instead emphasize the need for rigorous validation and contextualization when assessing the scientific value of LLM-generated outputs.
zh

[AI-27] Discriminative Rule Learning for Outcome-Guided Process Model Discovery

【速读】:该论文旨在解决传统过程发现方法在处理具有不同结果(如高效/合规与低效/违规)的业务流程时,无法区分和聚焦于关键行为差异的问题。其核心挑战在于,单一模型难以同时支持准确的合规性检查(conformance checking)和性能分析(performance analysis),且可能掩盖导致不同结果的结构化模式。解决方案的关键在于:首先基于控制流特征(control-flow features)学习可解释的判别规则(discriminative rules),将事件日志中的轨迹(trace)按相似的可取性(desirability)分布分组;随后在每个子组内独立进行过程发现,从而生成更聚焦、可解释的过程模型,清晰揭示驱动理想与非理想执行的核心流程模式。

链接: https://arxiv.org/abs/2510.27343
作者: Ali Norouzifar,Wil van der Aalst
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: The paper will be published as part of the CoopIS 2025 conference proceedings

点击查看摘要

Abstract:Event logs extracted from information systems offer a rich foundation for understanding and improving business processes. In many real-world applications, it is possible to distinguish between desirable and undesirable process executions, where desirable traces reflect efficient or compliant behavior, and undesirable ones may involve inefficiencies, rule violations, delays, or resource waste. This distinction presents an opportunity to guide process discovery in a more outcome-aware manner. Discovering a single process model without considering outcomes can yield representations poorly suited for conformance checking and performance analysis, as they fail to capture critical behavioral differences. Moreover, prioritizing one behavior over the other may obscure structural distinctions vital for understanding process outcomes. By learning interpretable discriminative rules over control-flow features, we group traces with similar desirability profiles and apply process discovery separately within each group. This results in focused and interpretable models that reveal the drivers of both desirable and undesirable executions. The approach is implemented as a publicly available tool and it is evaluated on multiple real-life event logs, demonstrating its effectiveness in isolating and visualizing critical process patterns.
zh

[AI-28] Reinforcement Learning for Long-Horizon Unordered Tasks: From Boolean to Coupled Reward Machines

【速读】:该论文旨在解决奖励机器(Reward Machines, RMs)在处理长时程、无序子任务问题时的效率瓶颈,即当子任务可任意顺序执行时,传统RM方法所需学习的信息量随子任务数量呈指数增长,导致难以扩展。解决方案的关键在于提出三种RMs的泛化形式:(1) 数值奖励机器(Numeric RMs)用于紧凑表达复杂任务;(2) 日程奖励机器(Agenda RMs)通过状态关联日程来追踪剩余子任务;(3) 耦合奖励机器(Coupled RMs)为日程中的每个子任务引入耦合状态,从而显式建模子任务间的依赖关系。进一步地,作者提出一种基于耦合RMs的组合学习算法——耦合奖励机器Q学习(Q-learning with Coupled RMs, CoRM),该算法能够高效利用结构化奖励信息,在长时程无序子任务场景下显著优于现有最优RM学习方法。

链接: https://arxiv.org/abs/2510.27329
作者: Kristina Levina,Nikolaos Pappas,Athanasios Karapantelakis,Aneta Vulgarakis Feljan,Jendrik Seipp
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reward machines (RMs) inform reinforcement learning agents about the reward structure of the environment. This is particularly advantageous for complex non-Markovian tasks because agents with access to RMs can learn more efficiently from fewer samples. However, learning with RMs is ill-suited for long-horizon problems in which a set of subtasks can be executed in any order. In such cases, the amount of information to learn increases exponentially with the number of unordered subtasks. In this work, we address this limitation by introducing three generalisations of RMs: (1) Numeric RMs allow users to express complex tasks in a compact form. (2) In Agenda RMs, states are associated with an agenda that tracks the remaining subtasks to complete. (3) Coupled RMs have coupled states associated with each subtask in the agenda. Furthermore, we introduce a new compositional learning algorithm that leverages coupled RMs: Q-learning with coupled RMs (CoRM). Our experiments show that CoRM scales better than state-of-the-art RM algorithms for long-horizon problems with unordered subtasks.
zh

[AI-29] Can LLM s Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments EMNLP2025

【速读】:该论文旨在解决在企业环境中集成大语言模型(Large Language Model, LLM)系统所面临的挑战,特别是由于数据分散在多个来源、访问控制复杂以及跨职能工作流交织所带来的开发与评估难题。其解决方案的关键在于提出一个名为EnterpriseBench的综合性基准测试平台,该平台模拟真实的企业场景,包含500个来自软件工程、人力资源、财务和行政等领域的多样化任务,并通过一种新颖的数据生成管道从组织元数据中构建内部一致的企业任务,从而系统性地刻画企业环境的核心特征,如数据源碎片化、访问控制层次结构和跨功能流程。实验表明,即便最先进的LLM代理模型在该基准上的任务完成率也仅为41.8%,凸显了当前企业导向AI系统在实际应用中的巨大改进空间。

链接: https://arxiv.org/abs/2510.27287
作者: Harsh Vishwakarma,Ankush Agarwal,Ojas Patil,Chaitanya Devaguptapu,Mahesh Chandran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025 Main Track

点击查看摘要

Abstract:Enterprise systems are crucial for enhancing productivity and decision-making among employees and customers. Integrating LLM based systems into enterprise systems enables intelligent automation, personalized experiences, and efficient information retrieval, driving operational efficiency and strategic growth. However, developing and evaluating such systems is challenging due to the inherent complexity of enterprise environments, where data is fragmented across multiple sources and governed by sophisticated access controls. We present EnterpriseBench, a comprehensive benchmark that simulates enterprise settings, featuring 500 diverse tasks across software engineering, HR, finance, and administrative domains. Our benchmark uniquely captures key enterprise characteristics including data source fragmentation, access control hierarchies, and cross-functional workflows. Additionally, we provide a novel data generation pipeline that creates internally consistent enterprise tasks from organizational metadata. Experiments with state-of-the-art LLM agents demonstrate that even the most capable models achieve only 41.8% task completion, highlighting significant opportunities for improvement in enterprise-focused AI systems.
zh

[AI-30] HiF-DTA: Hierarchical Feature Learning Network for Drug-Target Affinity Prediction

【速读】:该论文旨在解决药物-靶标亲和力(Drug-Target Affinity, DTA)预测中因忽略全局序列语义特征与局部拓扑结构特征协同建模,以及缺乏原子级、子结构级和分子级多尺度表示所导致的预测精度不足问题。其解决方案的关键在于提出一种分层网络架构HiF-DTA,采用双路径策略同时提取药物和蛋白质序列的全局语义与局部拓扑特征,并通过多尺度双线性注意力模块融合原子、子结构和分子层级的多层次药物表示,从而实现更精准的DTA预测。

链接: https://arxiv.org/abs/2510.27281
作者: Minghui Li,Yuanhang Wang,Peijin Guo,Wei Wan,Shengshan Hu,Shengqing Hu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by International Conference on Bioinformatics and Biomedicine (BIBM 25)

点击查看摘要

Abstract:Accurate prediction of Drug-Target Affinity (DTA) is crucial for reducing experimental costs and accelerating early screening in computational drug discovery. While sequence-based deep learning methods avoid reliance on costly 3D structures, they still overlook simultaneous modeling of global sequence semantic features and local topological structural features within drugs and proteins, and represent drugs as flat sequences without atomic-level, substructural-level, and molecular-level multi-scale features. We propose HiF-DTA, a hierarchical network that adopts a dual-pathway strategy to extract both global sequence semantic and local topological features from drug and protein sequences, and models drugs multi-scale to learn atomic, substructural, and molecular representations fused via a multi-scale bilinear attention module. Experiments on Davis, KIBA, and Metz datasets show HiF-DTA outperforms state-of-the-art baselines, with ablations confirming the importance of global-local extraction and multi-scale fusion.
zh

[AI-31] Not All Instances Are Equally Valuable: Towards Influence-Weighted Dataset Distillation

【速读】:该论文旨在解决现有数据集蒸馏方法在处理现实世界数据时忽视数据质量的问题,即直接对包含冗余或有害样本的全量数据进行蒸馏可能导致模型性能下降。其解决方案的关键在于提出了一种基于影响函数(influence functions)的加权蒸馏框架——IWD(Influence-Weighted Distillation),通过估计每个样本对蒸馏目标的影响来动态分配权重,从而优先保留有益样本并抑制低效或有害样本,实现更高质量的蒸馏结果。

链接: https://arxiv.org/abs/2510.27253
作者: Qiyan Deng,Changqian Zheng,Lianpeng Qiao,Yuping Wang,Chengliang Chai,Lei Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dataset distillation condenses large datasets into synthetic subsets, achieving performance comparable to training on the full dataset while substantially reducing storage and computation costs. Most existing dataset distillation methods assume that all real instances contribute equally to the process. In practice, real-world datasets contain both informative and redundant or even harmful instances, and directly distilling the full dataset without considering data quality can degrade model performance. In this work, we present Influence-Weighted Distillation IWD, a principled framework that leverages influence functions to explicitly account for data quality in the distillation process. IWD assigns adaptive weights to each instance based on its estimated impact on the distillation objective, prioritizing beneficial data while downweighting less useful or harmful ones. Owing to its modular design, IWD can be seamlessly integrated into diverse dataset distillation frameworks. Our empirical results suggest that integrating IWD tends to improve the quality of distilled datasets and enhance model performance, with accuracy gains of up to 7.8%.
zh

[AI-32] Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication

【速读】:该论文旨在解决脑-语音(Brain-to-speech, BTS)系统中实现开放词汇神经通信的难题,即如何从高密度脑电图(EEG)信号中解码出未见过的句子,从而突破现有非侵入式BTS系统仅限于预定义词句的局限。其解决方案的关键在于利用提取自EEG信号的音素级信息,并结合肌电图(EMG)信号,实现对不同说话模式下新句子的语音合成。研究进一步揭示了影响音素解码准确性的神经生理特性,为提升EEG解码性能、开发个性化和自适应神经通信与康复方案提供了重要依据。

链接: https://arxiv.org/abs/2510.27247
作者: Deok-Seon Kim,Seo-Hyun Lee,Kang Yin,Seong-Whan Lee
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering

点击查看摘要

Abstract:Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.
zh

[AI-33] Vintage Code Modern Judges: Meta-Validation in Low Data Regimes

【速读】:该论文旨在解决在遗留语言(如COBOL、PL/I和REXX)应用现代化过程中,因专家资源稀缺和高质量人工标注数据不足而导致的大型语言模型作为评判者(Large Language Models as a Judge, LaaJ)可靠性难以验证的问题。其核心挑战在于:缺乏足够的人工标注数据使得传统统计方法失效,进而无法可靠评估Laaj与人类判断的一致性,可能导致评估闭环中引入偏差并影响下游部署决策。解决方案的关键在于提出SparseAlign框架,该框架创新性地结合了“成对置信度”(pairwise-confidence)概念与“得分敏感的一致性度量”(score-sensitive alignment metric),同时捕捉排序一致性与评分接近度,从而在稀疏人工标注数据条件下实现可靠的评判者选择,已在COBOL代码解释任务中成功应用于指导模型发布决策。

链接: https://arxiv.org/abs/2510.27244
作者: Ora Nova Fandina,Gal Amram,Eitan Farchi,Shmulik Froimovich,Raviv Gal,Wesam Ibraheem,Rami Katan,Alice Podolsky,Orna Raz
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess how well a LaaJ aligns with human judgment. We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data. SparseAlign combines a novel pairwise-confidence concept with a score-sensitive alignment metric that jointly capture ranking consistency and score proximity, enabling reliable evaluator selection even when traditional statistical methods are ineffective due to limited annotated examples. SparseAlign was applied internally to select LaaJs for COBOL code explanation. The top-aligned evaluators were integrated into assessment workflows, guiding model release decisions. We present a case study of four LaaJs to demonstrate SparseAlign’s utility in real-world evaluation scenarios.
zh

[AI-34] Feature-Function Curvature Analysis: A Geometric Framework for Explaining Differentiable Models

【速读】:该论文旨在解决当前可解释人工智能(Explainable AI, XAI)中主流归因方法无法充分刻画复杂机器学习模型内部机制的问题,尤其是这些方法常将特征的作用简化为单一评分,从而在非线性关系和特征交互作用下产生误导。其解决方案的关键在于提出特征-函数曲率分析(Feature-Function Curvature Analysis, FFCA),通过量化每个特征的四个维度属性——影响度(Impact)、波动性(Volatility)、非线性程度(Non-linearity)和交互作用(Interaction)——来揭示模型学习函数的几何结构。进一步地,论文引入动态原型分析(Dynamic Archetype Analysis),追踪训练过程中上述特征签名的演化轨迹,首次提供了层次化学习过程的实证证据,并实现了对模型容量不足和过拟合早期预警等实用诊断功能,从而将模型解释从静态量化提升至对整个学习过程的动态、可信分析。

链接: https://arxiv.org/abs/2510.27207
作者: Hamed Najafi,Dongsheng Luo,Jason Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Explainable AI (XAI) is critical for building trust in complex machine learning models, yet mainstream attribution methods often provide an incomplete, static picture of a model’s final state. By collapsing a feature’s role into a single score, they are confounded by non-linearity and interactions. To address this, we introduce Feature-Function Curvature Analysis (FFCA), a novel framework that analyzes the geometry of a model’s learned function. FFCA produces a 4-dimensional signature for each feature, quantifying its: (1) Impact, (2) Volatility, (3) Non-linearity, and (4) Interaction. Crucially, we extend this framework into Dynamic Archetype Analysis, which tracks the evolution of these signatures throughout the training process. This temporal view moves beyond explaining what a model learned to revealing how it learns. We provide the first direct, empirical evidence of hierarchical learning, showing that models consistently learn simple linear effects before complex interactions. Furthermore, this dynamic analysis provides novel, practical diagnostics for identifying insufficient model capacity and predicting the onset of overfitting. Our comprehensive experiments demonstrate that FFCA, through its static and dynamic components, provides the essential geometric context that transforms model explanation from simple quantification to a nuanced, trustworthy analysis of the entire learning process.
zh

[AI-35] Fints: Efficient Inference-Time Personalization for LLM s with Fine-Grained Instance-Tailored Steering

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化适配中面临的两大挑战:一是现有参数化方法在动态用户行为模式下适应能力不足,二是高数据稀疏场景下存在数据效率低的问题。为此,作者提出一种细粒度且实例定制的引导框架(fine-grained and instance-tailored steering framework),其核心创新在于:首先引入一个细粒度引导组件,通过钩取注意力层(attention layer)和多层感知机层(MLP layer)的激活信号以捕捉细微的个性化特征;其次设计了一个输入感知聚合模块,将这些信号合成具有上下文相关性的增强信息,并注入到模型前向传播过程中实现个性化调整。该方案具备高度灵活性与数据效率,在快速变化分布和数据稀疏场景下表现优异,且可作为插件式模块兼容多种现有个性化技术。

链接: https://arxiv.org/abs/2510.27206
作者: Kounianhua Du,Jianxing Liu,Kangning Zhang,Wenxiang Jiao,Yuan Lu,Jiarui Jin,Weiwen Liu,Yong Yu,Weinan Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of large language models (LLMs) has intensified the demand for effective personalization techniques that can adapt model behavior to individual user preferences. Despite the non-parametric methods utilizing the in-context learning ability of LLMs, recent parametric adaptation methods, including personalized parameter-efficient fine-tuning and reward modeling emerge. However, these methods face limitations in handling dynamic user patterns and high data sparsity scenarios, due to low adaptability and data efficiency. To address these challenges, we propose a fine-grained and instance-tailored steering framework that dynamically generates sample-level interference vectors from user data and injects them into the model’s forward pass for personalized adaptation. Our approach introduces two key technical innovations: a fine-grained steering component that captures nuanced signals by hooking activations from attention and MLP layers, and an input-aware aggregation module that synthesizes these signals into contextually relevant enhancements. The method demonstrates high flexibility and data efficiency, excelling in fast-changing distribution and high data sparsity scenarios. In addition, the proposed method is orthogonal to existing methods and operates as a plug-in component compatible with different personalization techniques. Extensive experiments across diverse scenarios–including short-to-long text generation, and web function calling–validate the effectiveness and compatibility of our approach. Results show that our method significantly enhances personalization performance in fast-shifting environments while maintaining robustness across varying interaction modes and context lengths. Implementation is available at this https URL.
zh

[AI-36] From product to system network challenges in system of systems lifecycle management

【速读】:该论文旨在解决传统线性产品生命周期模型在复杂网络化系统(System of Systems, SoS)背景下日益失效的问题,核心挑战包括跨学科互操作性、配置管理、全链条可追溯性及组织边界外的治理需求。其解决方案的关键在于构建一个以模型为基础的系统工程(Model-Based Systems Engineering, MBSE)为语义基础,结合产品生命周期管理(Product Lifecycle Management, PLM)作为治理与配置层,辅以计算机辅助设计-计算机辅助工程(CAD-CAE)模型驱动领域,并通过数字主线(Digital Thread)和数字孪生(Digital Twin)实现持续反馈的集成框架。该框架强调四个原则:参考架构与数据模型、端到端配置主权、经审查的模型以及随时间、质量、成本和可持续性维度可衡量的价值贡献,从而推动从产品导向向网络导向发展的三步路线图,提升变更鲁棒性、缩短交付周期、增强复用效率并支持可持续决策。

链接: https://arxiv.org/abs/2510.27194
作者: Vahid Salehi,Josef Vilsmeier,Shirui Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Today, products are no longer isolated artifacts, but nodes in networked systems. This means that traditional, linearly conceived life cycle models are reaching their limits: Interoperability across disciplines, variant and configuration management, traceability, and governance across organizational boundaries are becoming key factors. This collective contribution classifies the state of the art and proposes a practical frame of reference for SoS lifecycle management, model-based systems engineering (MBSE) as the semantic backbone, product lifecycle management (PLM) as the governance and configuration level, CAD-CAE as model-derived domains, and digital thread and digital twin as continuous feedback. Based on current literature and industry experience, mobility, healthcare, and the public sector, we identify four principles: (1) referenced architecture and data models, (2) end-to-end configuration sovereignty instead of tool silos, (3) curated models with clear review gates, and (4) measurable value contributions along time, quality, cost, and sustainability. A three-step roadmap shows the transition from product- to network- centric development: piloting with reference architecture, scaling across variant and supply chain spaces, organizational anchoring (roles, training, compliance). The results are increased change robustness, shorter throughput times, improved reuse, and informed sustainability decisions. This article is aimed at decision-makers and practitioners who want to make complexity manageable and design SoS value streams to be scalable.
zh

[AI-37] Vectorized Online POMDP Planning ICRA2026

【速读】:该论文旨在解决部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP)求解中难以实现高效并行化的问题。传统POMDP求解器依赖于动作数值优化与价值估计之间的交替计算,导致并行进程间存在强依赖性和同步瓶颈,严重限制了硬件并行能力的发挥。其解决方案的关键在于提出一种向量化在线POMDP规划器(Vectorized Online POMDP Planner, VOPP),该方法基于一种新的POMDP公式,通过解析求解优化组件的一部分,仅将期望估计留作数值计算;同时将所有规划相关数据结构表示为张量,并将全部规划步骤实现为对这些张量的完全向量化运算,从而彻底消除并行计算间的依赖和同步开销,显著提升求解效率——实验表明VOPP在生成近优解方面至少比现有最先进的并行在线求解器快20倍。

链接: https://arxiv.org/abs/2510.27191
作者: Marcus Hoerger,Muhammad Sudrajat,Hanna Kurniawati
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures. Submitted to ICRA 2026

点击查看摘要

Abstract:Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization of today’s hardware, but parallelizing POMDP solvers has been challenging. They rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can quickly offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation that analytically solves part of the optimization component, leaving only the estimation of expectations for numerical computation. VOPP represents all data structures related to planning as a collection of tensors and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel solver with no dependencies and synchronization bottlenecks between parallel computations. Experimental results indicate that VOPP is at least 20X more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver.
zh

[AI-38] Unvalidated Trust: Cross-Stage Vulnerabilities in Large Language Model Architectures

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在自动化多阶段处理流程中因各处理阶段间未经验证的信任关系而引发的系统性风险问题。研究表明,输入数据常被非中性地解释,可能触发由实现方式决定的响应或未预期的状态变化,即使没有明确指令也会发生。这些行为构成架构层面的故障模式,仅靠字符串级别的过滤无法有效应对。解决方案的关键在于引入零信任架构原则,包括溯源强制(provenance enforcement)、上下文封印(context sealing)和计划重新验证(plan revalidation),并提出“Countermind”作为实现上述防御机制的概念蓝图。

链接: https://arxiv.org/abs/2510.27190
作者: Dominik Schwarz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 178 pages, mechanism-centered taxonomy of 41 LLM risk patterns, extensive appendix with experiment prompts and consolidation tables. Full traces available to reviewers and affected providers

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly integrated into automated, multi-stage pipelines, risk patterns that arise from unvalidated trust between processing stages become a practical concern. This paper presents a mechanism-centered taxonomy of 41 recurring risk patterns in commercial LLMs. The analysis shows that inputs are often interpreted non-neutrally and can trigger implementation-shaped responses or unintended state changes even without explicit commands. We argue that these behaviors constitute architectural failure modes and that string-level filtering alone is insufficient. To mitigate such cross-stage vulnerabilities, we recommend zero-trust architectural principles, including provenance enforcement, context sealing, and plan revalidation, and we introduce “Countermind” as a conceptual blueprint for implementing these defenses.
zh

[AI-39] FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction

【速读】:该论文旨在解决动力系统模拟中普遍存在的准确性与计算效率之间的权衡问题,以及现有基于神经网络的方法通常需要为每个具体案例单独训练模型的局限性。其解决方案的关键在于提出了一种用于随机微分方程(Stochastic Differential Equations, SDE)的大规模模拟多模态基础模型——FMint-SDE,该模型基于仅解码器结构的Transformer架构并采用上下文学习能力,通过融合数值和文本模态来学习一种通用的误差校正机制;其训练依赖于由传统求解器生成的粗粒度解序列,并借助提示(prompted)方式实现跨多种应用场景(如分子动力学、机械系统、金融和生物等领域)的广泛泛化能力,从而在准确性和效率之间取得更优平衡。

链接: https://arxiv.org/abs/2510.27173
作者: Jiaxin Yuan,Haizhao Yang,Maria Cameron
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Fast and accurate simulation of dynamical systems is a fundamental challenge across scientific and engineering domains. Traditional numerical integrators often face a trade-off between accuracy and computational efficiency, while existing neural network-based approaches typically require training a separate model for each case. To overcome these limitations, we introduce a novel multi-modal foundation model for large-scale simulations of differential equations: FMint-SDE (Foundation Model based on Initialization for stochastic differential equations). Based on a decoder-only transformer with in-context learning, FMint-SDE leverages numerical and textual modalities to learn a universal error-correction scheme. It is trained using prompted sequences of coarse solutions generated by conventional solvers, enabling broad generalization across diverse systems. We evaluate our models on a suite of challenging SDE benchmarks spanning applications in molecular dynamics, mechanical systems, finance, and biology. Experimental results show that our approach achieves a superior accuracy-efficiency tradeoff compared to classical solvers, underscoring the potential of FMint-SDE as a general-purpose simulation tool for dynamical systems.
zh

[AI-40] Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在“微调即服务”(Fine-tuning-as-a-Service)场景中因有害微调(Harmful Fine-tuning)带来的安全风险问题。现有防御策略依赖攻击模拟来预先构建鲁棒性,但存在两大局限:一是难以扩展至边界威胁模型之外,因无法预判未知攻击;二是对不同攻击设置适应性差,因模拟无法捕捉其复杂性和变异性。解决方案的关键在于提出贝叶斯数据调度器(Bayesian Data Scheduler, BDS),它将有害微调防御建模为贝叶斯推断问题,通过学习每个数据点的安全属性后验分布(条件于微调和对齐数据集),并在微调过程中根据采样的安全权重对数据进行加权,从而抑制有害数据的影响。BDS利用贝叶斯推断的后验性质,使防御能自适应地匹配特定数据集,实现无需攻击模拟的动态防御,并引入基于摊销贝叶斯学习的神经调度器,支持高效迁移至新数据而无需重新训练。

链接: https://arxiv.org/abs/2510.27172
作者: Zixuan Hu,Li Shen,Zhenyi Wang,Yongxian Wei,Dacheng Tao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Harmful fine-tuning poses critical safety risks to fine-tuning-as-a-service for large language models. Existing defense strategies preemptively build robustness via attack simulation but suffer from fundamental limitations: (i) the infeasibility of extending attack simulations beyond bounded threat models due to the inherent difficulty of anticipating unknown attacks, and (ii) limited adaptability to varying attack settings, as simulation fails to capture their variability and complexity. To address these challenges, we propose Bayesian Data Scheduler (BDS), an adaptive tuning-stage defense strategy with no need for attack simulation. BDS formulates harmful fine-tuning defense as a Bayesian inference problem, learning the posterior distribution of each data point’s safety attribute, conditioned on the fine-tuning and alignment datasets. The fine-tuning process is then constrained by weighting data with their safety attributes sampled from the posterior, thus mitigating the influence of harmful data. By leveraging the post hoc nature of Bayesian inference, the posterior is conditioned on the fine-tuning dataset, enabling BDS to tailor its defense to the specific dataset, thereby achieving adaptive defense. Furthermore, we introduce a neural scheduler based on amortized Bayesian learning, enabling efficient transfer to new data without retraining. Comprehensive results across diverse attack and defense settings demonstrate the state-of-the-art performance of our approach. Code is available at this https URL.
zh

[AI-41] MARIA: A Framework for Marginal Risk Assessment without Ground Truth in AI Systems

【速读】:该论文旨在解决在将人工智能(AI)系统部署以替代现有流程之前,如何有效评估其相对于原有系统的性能与风险问题。传统评估方法依赖于真实标签(ground truth),但在许多实际场景中,由于结果延迟、不可知性、高成本或数据不完整等原因,难以获得可靠的真实标签,尤其对于长期运行且被认定为安全的既有系统更为突出。论文提出的解决方案核心在于构建一种边际风险评估框架(marginal risk assessment framework),其关键创新在于摒弃对绝对风险和真实标签的依赖,转而采用相对评估策略,聚焦于三种类型的比较方法:可预测性(predictability)、能力(capability)和交互主导性(interaction dominance)。这一转变使开发团队能够明确识别AI系统在哪些方面提升效果、在哪些环节引入新风险,并据此制定负责任的采纳策略。

链接: https://arxiv.org/abs/2510.27163
作者: Jieshan Chen,Suyu Ma,Qinghua Lu,Sung Une Lee,Liming Zhu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9 pages, 1 figure

点击查看摘要

Abstract:Before deploying an AI system to replace an existing process, it must be compared with the incumbent to ensure improvement without added risk. Traditional evaluation relies on ground truth for both systems, but this is often unavailable due to delayed or unknowable outcomes, high costs, or incomplete data, especially for long-standing systems deemed safe by convention. The more practical solution is not to compute absolute risk but the difference between systems. We therefore propose a marginal risk assessment framework, that avoids dependence on ground truth or absolute risk. It emphasizes three kinds of relative evaluation methodology, including predictability, capability and interaction dominance. By shifting focus from absolute to relative evaluation, our approach equips software teams with actionable guidance: identifying where AI enhances outcomes, where it introduces new risks, and how to adopt such systems responsibly.
zh

[AI-42] Exploring Landscapes for Better Minima along Valleys NEURIPS2025

【速读】:该论文旨在解决深度学习中优化器在达到局部最小值后停止搜索参数空间的问题,从而难以保证找到全局最低或具有最佳泛化性能的解。其核心解决方案是提出一种适用于梯度优化算法的适配器“E”,该适配器使优化过程在抵达局部最小值后仍能沿损失曲面中的低谷区域(即损失接近且平坦的区域)持续探索,以寻找潜在更低且更平坦的局部最小值——这类极小值通常与更好的泛化能力相关联。该方法通过增强对损失景观的探索能力,提升了优化器在大批次训练等挑战性场景下的表现,实验证明改进后的Lamb优化器(ALTO)平均提升测试准确率2.5%。

链接: https://arxiv.org/abs/2510.27153
作者: Tong Zhao,Jiacheng Li,Yuanchang Zhou,Guangming Tan,Weile Jia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注: Neurips 2025 poster

点击查看摘要

Abstract:Finding lower and better-generalizing minima is crucial for deep learning. However, most existing optimizers stop searching the parameter space once they reach a local minimum. Given the complex geometric properties of the loss landscape, it is difficult to guarantee that such a point is the lowest or provides the best generalization. To address this, we propose an adaptor “E” for gradient-based optimizers. The adapted optimizer tends to continue exploring along landscape valleys (areas with low and nearly identical losses) in order to search for potentially better local minima even after reaching a local minimum. This approach increases the likelihood of finding a lower and flatter local minimum, which is often associated with better generalization. We also provide a proof of convergence for the adapted optimizers in both convex and non-convex scenarios for completeness. Finally, we demonstrate their effectiveness in an important but notoriously difficult training scenario, large-batch training, where Lamb is the benchmark optimizer. Our testing results show that the adapted Lamb, ALTO, increases the test accuracy (generalization) of the current state-of-the-art optimizer by an average of 2.5% across a variety of large-batch training tasks. This work potentially opens a new research direction in the design of optimization algorithms.
zh

[AI-43] AURA: A Reinforcement Learning Framework for AI-Driven Adaptive Conversational Surveys

【速读】:该论文旨在解决传统在线问卷调查个性化程度低、用户参与度不足以及响应质量较差的问题,同时克服现有AI对话式问卷(survey chatbots)多为被动响应、无法在单次会话中动态适应个体差异的局限。其解决方案的关键在于提出AURA(Adaptive Understanding through Reinforcement Learning for Assessment)框架,该框架基于强化学习机制,通过一个四维LSDE指标(长度Length、自我披露Self-disclosure、情感Emotion、具体性Specificity)量化响应质量,并采用ε-greedy策略在每轮对话中实时更新预期质量增益,从而动态调整后续问题类型。系统初始化时利用96个先前校园氛围对话数据(共467次交互)提取先验知识,在10–15轮对话内平衡探索与利用,实现对个体用户的实时自适应调整,显著提升了响应质量和互动深度。

链接: https://arxiv.org/abs/2510.27126
作者: Jinwen Tang,Yi Shang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conventional online surveys provide limited personalization, often resulting in low engagement and superficial responses. Although AI survey chatbots improve convenience, most are still reactive: they rely on fixed dialogue trees or static prompt templates and therefore cannot adapt within a session to fit individual users, which leads to generic follow-ups and weak response quality. We address these limitations with AURA (Adaptive Understanding through Reinforcement Learning for Assessment), a reinforcement learning framework for AI-driven adaptive conversational surveys. AURA quantifies response quality using a four-dimensional LSDE metric (Length, Self-disclosure, Emotion, and Specificity) and selects follow-up question types via an epsilon-greedy policy that updates the expected quality gain within each session. Initialized with priors extracted from 96 prior campus-climate conversations (467 total chatbot-user exchanges), the system balances exploration and exploitation across 10-15 dialogue exchanges, dynamically adapting to individual participants in real time. In controlled evaluations, AURA achieved a +0.12 mean gain in response quality and a statistically significant improvement over non-adaptive baselines (p=0.044, d=0.66), driven by a 63% reduction in specification prompts and a 10x increase in validation behavior. These results demonstrate that reinforcement learning can give survey chatbots improved adaptivity, transforming static questionnaires into interactive, self-improving assessment systems.
zh

[AI-44] Expressive Range Characterization of Open Text-to-Audio Models AAAI

【速读】:该论文旨在解决生成式音频模型(尤其是文本到音频模型)在输出内容上的可变性与保真度不明确的问题,即缺乏对这类模型生成能力的量化评估方法。其核心挑战在于音频作为高度多样化的输出类别,难以系统分析其表达范围(expressive range)。为此,作者将程序化内容生成(Procedural Content Generation, PCG)领域中用于评估关卡生成器的表达范围分析(Expressive Range Analysis, ERA)方法进行适配,提出一种基于固定提示词的ERA框架来评估文本到音频模型的输出空间。关键解决方案是:通过使用来自Environmental Sound Classification (ESC-50) 数据集的标准提示词,对生成音频沿关键声学维度(如音高、响度和音色)进行量化分析,从而实现对生成音频模型表达能力的可计算、可比较的探索性评估。

链接: https://arxiv.org/abs/2510.27102
作者: Jonathan Morse,Azadeh Naderi,Swen Gaudl,Mark Cartwright,Amy K. Hoover,Mark J. Nelson
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE 2025)

点击查看摘要

Abstract:Text-to-audio models are a type of generative model that produces audio output in response to a given textual prompt. Although level generators and the properties of the functional content that they create (e.g., playability) dominate most discourse in procedurally generated content (PCG), games that emotionally resonate with players tend to weave together a range of creative and multimodal content (e.g., music, sounds, visuals, narrative tone), and multimodal models have begun seeing at least experimental use for this purpose. However, it remains unclear what exactly such models generate, and with what degree of variability and fidelity: audio is an extremely broad class of output for a generative system to target. Within the PCG community, expressive range analysis (ERA) has been used as a quantitative way to characterize generators’ output space, especially for level generators. This paper adapts ERA to text-to-audio models, making the analysis tractable by looking at the expressive range of outputs for specific, fixed prompts. Experiments are conducted by prompting the models with several standardized prompts derived from the Environmental Sound Classification (ESC-50) dataset. The resulting audio is analyzed along key acoustic dimensions (e.g., pitch, loudness, and timbre). More broadly, this paper offers a framework for ERA-based exploratory evaluation of generative audio models. Comments: Accepted at the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE 2025) Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS) Cite as: arXiv:2510.27102 [cs.SD] (or arXiv:2510.27102v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2510.27102 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-45] CombiGraph-Vis: A Curated Multimodal Olympiad Benchmark for Discrete Mathematical Reasoning

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在自动评分数学证明题时存在的准确性与公平性问题,特别是如何有效检测错误、判断错误严重程度并合理分配部分分数(partial credit),而不仅限于简单的正确/错误二分类。其关键解决方案是引入代理式(agentic)工作流,通过自动提取和分析参考解法来生成针对每道题目的特定评分细则(rubric),进而实现多步骤的结构化评分过程,从而显著提升模型评分结果与人类评分的一致性,并改善对部分得分的稳定处理能力。

链接: https://arxiv.org/abs/2510.27094
作者: Hamed Mahdavi(Pennsylvania State University),Pouria Mahdavinia(Pennsylvania State University),Alireza Farhadi(Amirkabir University of Technology),Pegah Mohammadipour(Pennsylvania State University),Samira Malek(Pennsylvania State University),Majid Daliri(New York University),Pedram Mohammadipour(Amirkabir University of Technology),Alireza Hashemi(City University of New York),Amir Khasahmadi(Autodesk),Vasant Honavar(Pennsylvania State University)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code/data: this https URL , this https URL

点击查看摘要

Abstract:State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
zh

[AI-46] QiNN-QJ: A Quantum-inspired Neural Network with Quantum Jump for Multimodal Sentiment Analysis

【速读】:该论文旨在解决现有量子启发融合模型在多模态学习中因仅依赖酉或类酉变换生成量子纠缠而导致的训练不稳定与泛化能力有限的问题。其解决方案的关键在于提出一种量子启发神经网络(Quantum-inspired Neural Network with Quantum Jump, QiNN-QJ),通过引入模拟量子跳跃(Quantum Jump, QJ)算子的可微模块,将各模态编码为量子纯态后,从可分离的乘积态转化为纠缠表示;同时联合学习哈密顿量(Hamiltonian)和林德布拉德(Lindblad)算子,利用耗散动力学实现可控的跨模态纠缠建模,其中结构化的随机性和稳态吸引子特性有效稳定训练过程并约束纠缠结构,最终通过可训练测量向量投影输出预测结果。

链接: https://arxiv.org/abs/2510.27091
作者: Yiwei Chen,Kehuan Yan,Yu Pan,Daoyi Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantum Physics (quant-ph)
备注:

点击查看摘要

Abstract:Quantum theory provides non-classical principles, such as superposition and entanglement, that inspires promising paradigms in machine learning. However, most existing quantum-inspired fusion models rely solely on unitary or unitary-like transformations to generate quantum entanglement. While theoretically expressive, such approaches often suffer from training instability and limited generalizability. In this work, we propose a Quantum-inspired Neural Network with Quantum Jump (QiNN-QJ) for multimodal entanglement modelling. Each modality is firstly encoded as a quantum pure state, after which a differentiable module simulating the QJ operator transforms the separable product state into the entangled representation. By jointly learning Hamiltonian and Lindblad operators, QiNN-QJ generates controllable cross-modal entanglement among modalities with dissipative dynamics, where structured stochasticity and steady-state attractor properties serve to stabilize training and constrain entanglement shaping. The resulting entangled states are projected onto trainable measurement vectors to produce predictions. In addition to achieving superior performance over the state-of-the-art models on benchmark datasets, including CMU-MOSI, CMU-MOSEI, and CH-SIMS, QiNN-QJ facilitates enhanced post-hoc interpretability through von-Neumann entanglement entropy. This work establishes a principled framework for entangled multimodal fusion and paves the way for quantum-inspired approaches in modelling complex cross-modal correlations.
zh

[AI-47] Adapting Large Language Models to Emerging Cybersecurity using Retrieval Augmented Generation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在网络安全应用中因推理过程不透明而导致的信任缺失问题,尤其是在需要领域特定安全知识的决策场景下;同时,面对快速演化的安全威胁,LLMs需具备历史事件记忆与新兴漏洞及攻击模式适应能力。解决方案的关键在于引入基于检索增强生成(Retrieval-Augmented Generation, RAG)的框架,通过整合外部数据集与优化的混合检索策略,提升LLMs在知识保留和时序推理方面的准确性与可靠性,从而增强其在网络安全任务中的适应性和可信度。

链接: https://arxiv.org/abs/2510.27080
作者: Arnabh Borah,Md Tanvirul Alam,Nidhi Rastogi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Security applications are increasingly relying on large language models (LLMs) for cyber threat detection; however, their opaque reasoning often limits trust, particularly in decisions that require domain-specific cybersecurity knowledge. Because security threats evolve rapidly, LLMs must not only recall historical incidents but also adapt to emerging vulnerabilities and attack patterns. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in general LLM applications, but its potential for cybersecurity remains underexplored. In this work, we introduce a RAG-based framework designed to contextualize cybersecurity data and enhance LLM accuracy in knowledge retention and temporal reasoning. Using external datasets and the Llama-3-8B-Instruct model, we evaluate baseline RAG, an optimized hybrid retrieval approach, and conduct a comparative analysis across multiple performance metrics. Our findings highlight the promise of hybrid retrieval in strengthening the adaptability and reliability of LLMs for cybersecurity tasks.
zh

[AI-48] Consistency Training Helps Stop Sycophancy and Jailbreaks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在事实性(factuality)和拒绝响应(refusal training)方面易受提示中无关线索干扰的问题,例如用户信念诱导(sycophancy)或通过特殊文本包装的越狱请求(jailbreaking)。其核心解决方案是提出一致性训练(consistency training),即通过自监督方式使模型对提示中的无关扰动保持行为不变——具体而言,在提示数据增强(如添加引导性问题或越狱文本)前后,模型应输出一致的响应。文中引入两种实现路径:一种基于外部输出的偏置增强一致性训练(Bias-augmented Consistency Training, BCT),另一种基于内部激活的激活一致性训练(Activation Consistency Training, ACT)。二者均能有效降低模型对无关提示线索的敏感性,且因使用模型自身生成的数据作为训练样本,避免了静态数据集带来的能力退化与过时规则强化问题,从而提升模型鲁棒性和对齐质量。

链接: https://arxiv.org/abs/2510.27062
作者: Alex Irpan,Alexander Matt Turner,Mark Kurzeja,David K. Elson,Rohin Shah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages

点击查看摘要

Abstract:An LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within special text (jailbreaking). We explore \emphconsistency training, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt. Instead of teaching the model what exact response to give on a particular prompt, we aim to teach the model to behave identically across prompt data augmentations (like adding leading questions or jailbreak text). We try enforcing this invariance in two ways: over the model’s external outputs (\emphBias-augmented Consistency Training (BCT) from Chua et al. [2025]) and over its internal activations (\emphActivation Consistency Training (ACT), a method we introduce). Both methods reduce Gemini 2.5 Flash’s susceptibility to irrelevant cues. Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data, such as degrading model capabilities or enforcing outdated response guidelines. While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction. We think that BCT can simplify training pipelines by removing reliance on static datasets. We argue that some alignment problems are better viewed not in terms of optimal responses, but rather as consistency issues.
zh

[AI-49] Adaptive Data Flywheel: Applying MAPE Control Loops to AI Agent Improvement

【速读】:该论文旨在解决企业级生成式 AI(Generative AI)代理在实际部署中面临的持续适应性问题,包括保持准确性、降低延迟以及与用户需求对齐。其核心挑战在于如何通过系统化机制实现从真实使用场景中持续学习并改进模型性能。解决方案的关键在于构建一个基于 MAPE(Monitor-Analysis-Plan-Execute)驱动的数据飞轮(data flywheel)闭环系统,结合人工在环(Human-in-the-loop, HITL)反馈,针对检索增强生成(Retrieval-Augmented Generation, RAG)管道中的两类主要失败模式——路由错误(routing errors)和查询重述错误(query rephrasal errors)进行针对性优化。通过 NVIDIA NeMo 微服务对模型进行微调,实现了显著的性能提升:路由模块由 Llama 3.1 70B 替换为微调后的 8B 模型后,准确率提升至 96%,模型规模减少 10 倍,延迟下降 70%;查询重述模块经微调后准确率提升 3.7%,延迟降低 40%。该方法验证了数据飞轮机制在规模化企业环境中实现自进化 AI 代理的有效性,并提供了可复用的工程实践路径。

链接: https://arxiv.org/abs/2510.27051
作者: Aaditya Shukla,Sidney Knowles,Meenakshi Madugula,Dave Farris,Ryan Angilly,Santiago Pombo,Anbang Xu,Lu An,Abhinav Balasubramanian,Tan Yu,Jiaxiang Ren,Rama Akkiraju
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 5 figures, 5 tables. Presents MAPE-K control loop application to enterprise AI agent improvement with experimental validation on NVIDIA’s NVInfo AI system

点击查看摘要

Abstract:Enterprise AI agents must continuously adapt to maintain accuracy, reduce latency, and remain aligned with user needs. We present a practical implementation of a data flywheel in NVInfo AI, NVIDIA’s Mixture-of-Experts (MoE) Knowledge Assistant serving over 30,000 employees. By operationalizing a MAPE-driven data flywheel, we built a closed-loop system that systematically addresses failures in retrieval-augmented generation (RAG) pipelines and enables continuous learning. Over a 3-month post-deployment period, we monitored feedback and collected 495 negative samples. Analysis revealed two major failure modes: routing errors (5.25%) and query rephrasal errors (3.2%). Using NVIDIA NeMo microservices, we implemented targeted improvements through fine-tuning. For routing, we replaced a Llama 3.1 70B model with a fine-tuned 8B variant, achieving 96% accuracy, a 10x reduction in model size, and 70% latency improvement. For query rephrasal, fine-tuning yielded a 3.7% gain in accuracy and a 40% latency reduction. Our approach demonstrates how human-in-the-loop (HITL) feedback, when structured within a data flywheel, transforms enterprise AI agents into self-improving systems. Key learnings include approaches to ensure agent robustness despite limited user feedback, navigating privacy constraints, and executing staged rollouts in production. This work offers a repeatable blueprint for building robust, adaptive enterprise AI agents capable of learning from real-world usage at scale.
zh

[AI-50] 1: Learning Adaptive Control of Reasoning Effort

【速读】:该论文旨在解决大语言模型在推理过程中如何动态、高效地分配计算资源(即“思考预算”)的问题,以实现用户对输出质量与延迟/成本之间权衡的精细控制。传统方法通常要求用户预先设定固定的token数量作为推理预算,但这需要事先了解问题难度,难以适应不同任务的需求。解决方案的关键在于提出一种自适应努力控制(Adaptive Effort Control)机制,其基于强化学习训练模型,在推理时根据当前查询的平均思维链(chain-of-thought)长度自动调整使用token的比例,从而无需数据集或训练阶段特定调参即可实现更优的成本-准确率权衡曲线。该方法允许用户通过连续的努力参数在推理时灵活调节资源投入,并且实验表明模型能按任务难度自动分配资源,在1.5B至32B参数规模下可实现约3倍的思维链长度压缩,同时保持或提升性能。

链接: https://arxiv.org/abs/2510.27042
作者: Michael Kleinman,Matthew Trager,Alessandro Achille,Wei Xia,Stefano Soatto
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables approximately 3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.
zh

[AI-51] Causal Masking on Spatial Data: An Information-Theoretic Case for Learning Spatial Datasets with Unimodal Language Models NEURIPS2025

【速读】:该论文旨在解决在具有空间或关系结构的数据上使用因果掩码(causal masking)是否可行的问题,尤其是在传统上认为应采用顺序线性化(sequential linearization)处理此类数据的情况下。其关键解决方案在于:通过在国际象棋这一同时支持空间表示(棋盘状态)和序列表示(走法序列)的领域中进行对比实验,发现即使在空间数据上应用因果掩码,训练出的语言模型仍能表现出更强的博弈能力,且优于基于序列数据训练的模型。这一结果表明,将因果掩码应用于空间数据是训练单模态大语言模型(unimodal LLMs)的一种可行方法,甚至在某些场景下优于传统的顺序化处理方式。

链接: https://arxiv.org/abs/2510.27009
作者: Jared Junkin,Samuel Nathanson
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 8 pages, NeurIPS 2025

点击查看摘要

Abstract:Language models are traditionally designed around causal masking. In domains with spatial or relational structure, causal masking is often viewed as inappropriate, and sequential linearizations are instead used. Yet the question of whether it is viable to accept the information loss introduced by causal masking on nonsequential data has received little direct study, in part because few domains offer both spatial and sequential representations of the same dataset. In this work, we investigate this issue in the domain of chess, which naturally supports both representations. We train language models with bidirectional and causal self-attention mechanisms on both spatial (board-based) and sequential (move-based) data. Our results show that models trained on spatial board states - \textiteven with causal masking - consistently achieve stronger playing strength than models trained on sequential data. While our experiments are conducted on chess, our results are methodological and may have broader implications: applying causal masking to spatial data is a viable procedure for training unimodal LLMs on spatial data, and in some domains is even preferable to sequentialization.
zh

[AI-52] Jasmine: A Simple Performant and Scalable JAX-based World Modeling Codebase

【速读】:该论文旨在解决世界模型(World Models)在机器人等领域中因数据稀缺而导致的训练效率低下的问题,同时应对当前开放训练基础设施薄弱的现状。其解决方案的关键在于提出一个高性能、可扩展的JAX代码库Jasmine,该代码库通过优化数据加载、训练流程和检查点机制,在单机到数百加速器的部署场景下实现最小代码改动即可高效扩展,并显著提升CoinRun案例研究的复现速度(快一个数量级),同时保证训练过程的完全可复现性并支持多种分片配置,从而为不同模型家族和架构变体提供严谨的基准测试基础设施。

链接: https://arxiv.org/abs/2510.27002
作者: Mihir Mahajan,Alfred Nguyen,Franz Srambical,Stefan Bauer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Blog post: this https URL

点击查看摘要

Abstract:While world models are increasingly positioned as a pathway to overcoming data scarcity in domains such as robotics, open training infrastructure for world modeling remains nascent. We introduce Jasmine, a performant JAX-based world modeling codebase that scales from single hosts to hundreds of accelerators with minimal code changes. Jasmine achieves an order-of-magnitude faster reproduction of the CoinRun case study compared to prior open implementations, enabled by performance optimizations across data loading, training and checkpointing. The codebase guarantees fully reproducible training and supports diverse sharding configurations. By pairing Jasmine with curated large-scale datasets, we establish infrastructure for rigorous benchmarking pipelines across model families and architectural ablations.
zh

[AI-53] A Framework for Fair Evaluation of Variance-Aware Bandit Algorithms

【速读】:该论文旨在解决多臂老虎机(Multi-armed Bandit, MAB)算法评估与比较中存在的缺乏标准化条件和可复现性问题,尤其关注方差感知型算法(variance-aware algorithms)相较于经典算法(如UCB)在不同环境下的性能差异。其解决方案的关键在于构建了一个可复现的系统性评估框架——Bandit Playground,该框架包含明确的实验设置、多种性能指标(包括奖励、累计遗憾、奖励分布、风险价值和动作最优性)以及交互式分析界面,从而实现了对八种经典与方差感知型MAB算法的透明、一致比较。研究发现,方差感知算法在高不确定性且各臂奖励差异细微的环境中表现更优,而经典算法在可分性较强或经精细调参后仍具竞争力。

链接: https://arxiv.org/abs/2510.27001
作者: Elise Wolf
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint of a paper presented at GI Skill 2025. The final version will appear in the conference proceedings

点击查看摘要

Abstract:Multi-armed bandit (MAB) problems serve as a fundamental building block for more complex reinforcement learning algorithms. However, evaluating and comparing MAB algorithms remains challenging due to the lack of standardized conditions and replicability. This is particularly problematic for variance-aware extensions of classical methods like UCB, whose performance can heavily depend on the underlying environment. In this study, we address how performance differences between bandit algorithms can be reliably observed, and under what conditions variance-aware algorithms outperform classical ones. We present a reproducible evaluation designed to systematically compare eight classical and variance-aware MAB algorithms. The evaluation framework, implemented in our Bandit Playground codebase, features clearly defined experimental setups, multiple performance metrics (reward, regret, reward distribution, value-at-risk, and action optimality), and an interactive evaluation interface that supports consistent and transparent analysis. We show that variance-aware algorithms can offer advantages in settings with high uncertainty where the difficulty arises from subtle differences between arm rewards. In contrast, classical algorithms often perform equally well or better in more separable scenarios or if fine-tuned extensively. Our contributions are twofold: (1) a framework for systematic evaluation of MAB algorithms, and (2) insights into the conditions under which variance-aware approaches outperform their classical counterparts.
zh

[AI-54] AIOT based Smart Education System: A Dual Layer Authentication and Context-Aware Tutoring Framework for Learning Environments

【速读】:该论文旨在解决当代课堂中长期存在的四大挑战:考勤作弊、教学缺乏个性化、学生参与度低以及资源利用效率低下。其解决方案的关键在于构建一个基于人工智能与物联网(AIoT)的统一智能教育平台,集成四个核心模块:(1) 基于RFID身份识别和WiFi验证的双因子认证系统,实现防作弊的 secure attendance;(2) AI驱动的助教模块,可基于教师提供的教学材料实时生成情境感知的支持与动态测验;(3) 自动化测试生成器,支持自适应评估并降低行政负担;(4) EcoSmart校园模块,通过IoT传感器与执行器自主调控教室照明、空气质量与温度。该系统通过上述模块协同作用,实现了高效、安全、个性化的教学环境,为未来教育创新提供了可扩展的技术范式。

链接: https://arxiv.org/abs/2510.26999
作者: Adithya Neelakantan,Pratik Satpute,Prerna Shinde,Tejas Manjunatha Devang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The AIoT-Based Smart Education System integrates Artificial Intelligence and IoT to address persistent challenges in contemporary classrooms: attendance fraud, lack of personalization, student disengagement, and inefficient resource use. The unified platform combines four core modules: (1) a dual-factor authentication system leveraging RFID-based ID scans and WiFi verification for secure, fraud-resistant attendance; (2) an AI-powered assistant that provides real-time, context-aware support and dynamic quiz generation based on instructor-supplied materials; (3) automated test generators to streamline adaptive assessment and reduce administrative overhead; and (4) the EcoSmart Campus module, which autonomously regulates classroom lighting, air quality, and temperature using IoT sensors and actuators. Simulated evaluations demonstrate the system’s effectiveness in delivering robust real-time monitoring, fostering inclusive engagement, preventing fraudulent practices, and supporting operational scalability. Collectively, the AIoT-Based Smart Education System offers a secure, adaptive, and efficient learning environment, providing a scalable blueprint for future educational innovation and improved student outcomes through the synergistic application of artificial intelligence and IoT technologies.
zh

[AI-55] SUSTAINABLE Platform: Seamless Smart Farming Integration Towards Agronomy Automation

【速读】:该论文旨在解决全球农业领域面临的三大核心挑战:日益增长的粮食需求、气候变化带来的不确定性以及对可持续农业生产方式的迫切需求。为应对这些问题,论文提出了一种名为SUSTAINABLE的智能 farming 平台,其解决方案的关键在于整合物联网(IoT)、人工智能(AI)、卫星遥感成像与基于角色的任务编排机制,从而实现高效、可追溯且可持续的农业生产。平台特别针对地中海地区葡萄园场景进行了试点验证,通过集成卫星指数、实时环境数据及角色感知的任务管理功能,提升了农业生产的智能化水平和资源利用效率。

链接: https://arxiv.org/abs/2510.26989
作者: Agorakis Bompotas,Konstantinos Koutras,Nikitas Rigas Kalogeropoulos,Panagiotis Kechagias,Dimitra Gariza,Athanasios P. Kalogeras,Christos Alexakos
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted for presentation to 11th IEEE International Smart Cities Conference (ISC2 2025)

点击查看摘要

Abstract:The global agricultural sector is undergoing a transformative shift, driven by increasing food demands, climate variability and the need for sustainable practices. SUSTAINABLE is a smart farming platform designed to integrate IoT, AI, satellite imaging, and role-based task orchestration to enable efficient, traceable, and sustainable agriculture with a pilot usecase in viticulture. This paper explores current smart agriculture solutions, presents a comparative evaluation, and introduces SUSTAINABLE’s key features, including satellite index integration, real-time environmental data, and role-aware task management tailored to Mediterranean vineyards.
zh

[AI-56] Fine-Grained Iterative Adversarial Attacks with Limited Computation Budget

【速读】:该论文旨在解决在计算资源受限条件下,如何最大化迭代式对抗攻击的有效性问题。其核心挑战在于,粗粒度减少攻击迭代次数虽能降低计算成本,但会显著削弱攻击强度。解决方案的关键在于提出一种细粒度控制机制,通过在迭代层面和层层面选择性地重新计算激活值(recompute layer activations),从而在固定计算预算内实现更高效的攻击策略。实验表明,该方法在同等计算成本下优于现有基线,并在对抗训练中仅需原预算30%即可达到相当性能。

链接: https://arxiv.org/abs/2510.26981
作者: Zhichao Hou,Weizhi Gao,Xiaorui Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work tackles a critical challenge in AI safety research under limited compute: given a fixed computation budget, how can one maximize the strength of iterative adversarial attacks? Coarsely reducing the number of attack iterations lowers cost but substantially weakens effectiveness. To fulfill the attainable attack efficacy within a constrained budget, we propose a fine-grained control mechanism that selectively recomputes layer activations across both iteration-wise and layer-wise levels. Extensive experiments show that our method consistently outperforms existing baselines at equal cost. Moreover, when integrated into adversarial training, it attains comparable performance with only 30% of the original budget.
zh

[AI-57] Can machines think efficiently?

【速读】:该论文试图解决传统图灵测试(Turing Test)在当前先进人工智能系统面前已不再充分的问题,即无法有效区分人类与机器智能,并且忽视了人工智能带来的伦理和环境成本。其解决方案的关键在于对原始模仿游戏进行扩展,引入能量消耗作为新的约束条件,从而将智能评估从抽象的思维能力转向以效率为核心的资源利用视角,使智能评价具备可测量、可落地的量化标准,同时促使社会权衡人工智能带来的时间效益与其总体资源代价。

链接: https://arxiv.org/abs/2510.26954
作者: Adam Winchell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:The Turing Test is no longer adequate for distinguishing human and machine intelligence. With advanced artificial intelligence systems already passing the original Turing Test and contributing to serious ethical and environmental concerns, we urgently need to update the test. This work expands upon the original imitation game by accounting for an additional factor: the energy spent answering the questions. By adding the constraint of energy, the new test forces us to evaluate intelligence through the lens of efficiency, connecting the abstract problem of thinking to the concrete reality of finite resources. Further, this proposed new test ensures the evaluation of intelligence has a measurable, practical finish line that the original test lacks. This additional constraint compels society to weigh the time savings of using artificial intelligence against its total resource cost.
zh

[AI-58] LLM -based Multi-class Attack Analysis and Mitigation Framework in IoT/IIoT Networks

【速读】:该论文旨在解决当前基于人工智能(AI)的物联网(IoT)安全防护评估缺乏标准化、客观量化基准的问题,从而难以一致地衡量模型在攻击检测与缓解方面的有效性。其解决方案的关键在于提出一个混合框架:首先利用机器学习(ML)方法实现多类攻击的精准检测(其中随机森林表现最优),随后结合大语言模型(LLM)与检索增强生成(RAG)技术进行攻击行为分析及缓解建议生成;并通过设计新型定量评估指标和部署由多个权威LLM组成的评判集成系统,对生成内容进行独立、客观的质量评估,显著提升了AI驱动的IoT安全分析的可量化性和可信度。

链接: https://arxiv.org/abs/2510.26941
作者: Seif Ikbarieh,Maanak Gupta,Elmahedi Mahalal
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Internet of Things has expanded rapidly, transforming communication and operations across industries but also increasing the attack surface and security breaches. Artificial Intelligence plays a key role in securing IoT, enabling attack detection, attack behavior analysis, and mitigation suggestion. Despite advancements, evaluations remain purely qualitative, and the lack of a standardized, objective benchmark for quantitatively measuring AI-based attack analysis and mitigation hinders consistent assessment of model effectiveness. In this work, we propose a hybrid framework combining Machine Learning (ML) for multi-class attack detection with Large Language Models (LLMs) for attack behavior analysis and mitigation suggestion. After benchmarking several ML and Deep Learning (DL) classifiers on the Edge-IIoTset and CICIoT2023 datasets, we applied structured role-play prompt engineering with Retrieval-Augmented Generation (RAG) to guide ChatGPT-o3 and DeepSeek-R1 in producing detailed, context-aware responses. We introduce novel evaluation metrics for quantitative assessment to guide us and an ensemble of judge LLMs, namely ChatGPT-4o, DeepSeek-V3, Mixtral 8x7B Instruct, Gemini 2.5 Flash, Meta Llama 4, TII Falcon H1 34B Instruct, xAI Grok 3, and Claude 4 Sonnet, to independently evaluate the responses. Results show that Random Forest has the best detection model, and ChatGPT-o3 outperformed DeepSeek-R1 in attack analysis and mitigation.
zh

[AI-59] Mind the Gaps: Auditing and Reducing Group Inequity in Large-Scale Mobility Prediction

【速读】:该论文旨在解决移动预测模型中存在的群体公平性问题,即在训练数据中隐含的种族和族裔差异导致模型对不同用户群体的预测准确性存在系统性偏差。其核心解决方案是提出一种基于公平性的增量采样策略——Fairness-Guided Incremental Sampling (FGIS),关键在于通过Size-Aware K-Means (SAKM) 方法在潜在移动空间中对用户进行聚类,并强制满足人口普查数据提供的群体比例,从而生成代理种族标签(proxy racial labels)用于四个主要群体(亚裔、黑人、西班牙裔和白人)。在此基础上,FGIS根据预期性能提升和当前群体代表性优先选择样本,逐步构建训练集以缩小群体间性能差距,同时保持整体准确率。实验表明,该方法可在早期采样阶段显著降低群体间差异达40%,且对模型精度影响极小,尤其适用于低资源场景下的公平性改进。

链接: https://arxiv.org/abs/2510.26940
作者: Ashwin Kumar,Hanyu Zhang,David A. Schweidel,William Yeoh
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Next location prediction underpins a growing number of mobility, retail, and public-health applications, yet its societal impacts remain largely unexplored. In this paper, we audit state-of-the-art mobility prediction models trained on a large-scale dataset, highlighting hidden disparities based on user demographics. Drawing from aggregate census data, we compute the difference in predictive performance on racial and ethnic user groups and show a systematic disparity resulting from the underlying dataset, resulting in large differences in accuracy based on location and user groups. To address this, we propose Fairness-Guided Incremental Sampling (FGIS), a group-aware sampling strategy designed for incremental data collection settings. Because individual-level demographic labels are unavailable, we introduce Size-Aware K-Means (SAKM), a clustering method that partitions users in latent mobility space while enforcing census-derived group proportions. This yields proxy racial labels for the four largest groups in the state: Asian, Black, Hispanic, and White. Built on these labels, our sampling algorithm prioritizes users based on expected performance gains and current group representation. This method incrementally constructs training datasets that reduce demographic performance gaps while preserving overall accuracy. Our method reduces total disparity between groups by up to 40% with minimal accuracy trade-offs, as evaluated on a state-of-art MetaPath2Vec model and a transformer-encoder model. Improvements are most significant in early sampling stages, highlighting the potential for fairness-aware strategies to deliver meaningful gains even in low-resource settings. Our findings expose structural inequities in mobility prediction pipelines and demonstrate how lightweight, data-centric interventions can improve fairness with little added complexity, especially for low-data applications.
zh

[AI-60] Heterogeneous Robot Collaboration in Unstructured Environments with Grounded Generative Intelligence

【速读】:该论文旨在解决异构机器人团队在非结构化环境(unstructured environments)中执行复杂任务时面临的挑战,即如何基于自然语言描述的任务目标与机器人能力进行有效协作与在线适应。当前基于大语言模型(Large Language Model, LLM)的团队协同方法通常假设环境为结构化且已知,难以部署于开放世界场景。其解决方案的关键在于提出SPINE-HT框架,通过三阶段流程将LLM的推理能力与机器人团队的实际能力及在线感知反馈相耦合:首先生成语义上合理的子任务并验证可行性;其次依据机器人能力(如移动性或感知能力)分配任务;最后结合运行中收集的实时反馈对子任务进行迭代优化,从而实现从语言指令到物理执行的闭环落地。

链接: https://arxiv.org/abs/2510.26915
作者: Zachary Ravichandran,Fernando Cladera,Ankit Prabhu,Jason Hughes,Varun Murali,Camillo Taylor,George J. Pappas,Vijay Kumar
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Heterogeneous robot teams operating in realistic settings often must accomplish complex missions requiring collaboration and adaptation to information acquired online. Because robot teams frequently operate in unstructured environments – uncertain, open-world settings without prior maps – subtasks must be grounded in robot capabilities and the physical world. While heterogeneous teams have typically been designed for fixed specifications, generative intelligence opens the possibility of teams that can accomplish a wide range of missions described in natural language. However, current large language model (LLM)-enabled teaming methods typically assume well-structured and known environments, limiting deployment in unstructured environments. We present SPINE-HT, a framework that addresses these limitations by grounding the reasoning abilities of LLMs in the context of a heterogeneous robot team through a three-stage process. Given language specifications describing mission goals and team capabilities, an LLM generates grounded subtasks which are validated for feasibility. Subtasks are then assigned to robots based on capabilities such as traversability or perception and refined given feedback collected during online operation. In simulation experiments with closed-loop perception and control, our framework achieves nearly twice the success rate compared to prior LLM-enabled heterogeneous teaming approaches. In real-world experiments with a Clearpath Jackal, a Clearpath Husky, a Boston Dynamics Spot, and a high-altitude UAV, our method achieves an 87% success rate in missions requiring reasoning about robot capabilities and refining subtasks with online feedback. More information is provided at this https URL.
zh

[AI-61] Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations

【速读】:该论文旨在解决基于基础模型(如大语言模型和视觉-语言模型)的网络物理系统(Cyber-Physical Systems, CPS)在提升自主性的同时引入的新类型错误问题,例如幻觉、过度泛化和上下文错位,这些错误会导致决策失误。解决方案的关键在于提出“认知边界”(Cognition Envelopes)的概念,通过建立推理边界来约束AI生成的决策,同时与元认知(meta-cognition)和传统安全边界(safety envelopes)协同工作,从而保障系统决策的可靠性与安全性。

链接: https://arxiv.org/abs/2510.26905
作者: Pedro Antonio Alarcón Granadeno,Arturo Miguel Bernal Russell,Sofia Nelson,Demetrius Hernandez,Maureen Petterson,Michael Murphy,Walter J. Scheirer,Jane Cleland-Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10.5 pages, 9 figures

点击查看摘要

Abstract:Cyber-physical systems increasingly rely on Foundational Models such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, overgeneralizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance.
zh

[AI-62] How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison

【速读】:该论文试图解决的问题是:AI驱动的百科全书(如Grokipedia)是否能够规避人类编辑平台(如Wikipedia)中存在的偏见与局限性,从而提供更“真实”的知识内容。其解决方案的关键在于通过大规模计算比较方法,对382组匹配条目进行多维度分析,包括词汇丰富度、可读性、结构组织、引用密度和语义相似性等指标,系统评估Grokipedia与Wikipedia在形式与内容上的一致性与差异。研究发现,尽管Grokipedia在语义和风格上与Wikipedia高度一致,但在词汇多样性、引用密度及结构稳定性方面存在显著差异,表明AI生成内容虽覆盖相同信息范围,但遵循不同的编辑规范,倾向于叙事扩展而非基于引用的验证,揭示了自动化文本生成时代下知识治理的新挑战。

链接: https://arxiv.org/abs/2510.26899
作者: Taha Yasseri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: 13 pages, 5 figures, 2 tables

点击查看摘要

Abstract:The launch of Grokipedia, an AI-generated encyclopedia developed by Elon Musk’s xAI, was presented as a response to perceived ideological and structural biases in Wikipedia, aiming to produce “truthful” entries via the large language model Grok. Yet whether an AI-driven alternative can escape the biases and limitations of human-edited platforms remains unclear. This study undertakes a large-scale computational comparison of 382 matched article pairs between Grokipedia and Wikipedia. Using metrics across lexical richness, readability, structural organization, reference density, and semantic similarity, we assess how closely the two platforms align in form and substance. The results show that while Grokipedia exhibits strong semantic and stylistic alignment with Wikipedia, it typically produces longer but less lexically diverse articles, with fewer references per word and more variable structural depth. These findings suggest that AI-generated encyclopedic content currently mirrors Wikipedia’s informational scope but diverges in editorial norms, favoring narrative expansion over citation-based verification. The implications highlight new tensions around transparency, provenance, and the governance of knowledge in an era of automated text generation.
zh

[AI-63] BI-DCGAN: A Theoretically Grounded Bayesian Framework for Efficient and Diverse GANs

【速读】:该论文旨在解决生成式对抗网络(Generative Adversarial Networks, GANs)中存在的模式崩溃(mode collapse)问题,即生成器仅输出有限种类的样本,无法充分捕捉数据的真实分布,从而限制了其在需要高多样性和不确定性感知的实际场景中的应用。解决方案的关键在于提出BI-DCGAN,一种基于贝叶斯推断的深度卷积生成对抗网络(Bayesian Inference Deep Convolutional GAN),通过引入贝叶斯方法建模权重不确定性:采用贝叶斯反向传播(Bayes by Backprop)学习权重的后验分布,并利用均场变分推断(mean-field variational inference)高效逼近该分布,从而在保持训练效率的同时显著提升生成样本的多样性与鲁棒性。论文首次基于协方差矩阵分析提供了理论证明,表明贝叶斯建模可增强GAN的样本多样性,实验验证了BI-DCGAN优于传统DCGAN,在标准基准上实现更优性能。

链接: https://arxiv.org/abs/2510.26892
作者: Mahsa Valizadeh,Rui Tuo,James Caverlee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Adversarial Networks (GANs) are proficient at generating synthetic data but continue to suffer from mode collapse, where the generator produces a narrow range of outputs that fool the discriminator but fail to capture the full data distribution. This limitation is particularly problematic, as generative models are increasingly deployed in real-world applications that demand both diversity and uncertainty awareness. In response, we introduce BI-DCGAN, a Bayesian extension of DCGAN that incorporates model uncertainty into the generative process while maintaining computational efficiency. BI-DCGAN integrates Bayes by Backprop to learn a distribution over network weights and employs mean-field variational inference to efficiently approximate the posterior distribution during GAN training. We establishes the first theoretical proof, based on covariance matrix analysis, that Bayesian modeling enhances sample diversity in GANs. We validate this theoretical result through extensive experiments on standard generative benchmarks, demonstrating that BI-DCGAN produces more diverse and robust outputs than conventional DCGANs, while maintaining training efficiency. These findings position BI-DCGAN as a scalable and timely solution for applications where both diversity and uncertainty are critical, and where modern alternatives like diffusion models remain too resource-intensive.
zh

[AI-64] Leverag ing Foundation Models for Enhancing Robot Perception and Action

【速读】:该论文旨在解决机器人在非结构化环境中实现更有效定位、交互与操作的问题,其核心挑战在于如何提升机器人对环境语义的理解能力。解决方案的关键在于系统性地利用基础模型(foundation models),构建一个以语义感知为核心的机器人智能框架,从而增强机器人在复杂场景下的自主决策与执行能力。

链接: https://arxiv.org/abs/2510.26855
作者: Reihaneh Mirjalili
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Doctoral thesis

点击查看摘要

Abstract:This thesis investigates how foundation models can be systematically leveraged to enhance robotic capabilities, enabling more effective localization, interaction, and manipulation in unstructured environments. The work is structured around four core lines of inquiry, each addressing a fundamental challenge in robotics while collectively contributing to a cohesive framework for semantics-aware robotic intelligence.
zh

[AI-65] Inverse Knowledge Search over Verifiable Reasoning : Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base

【速读】:该论文旨在解决科学文献中推理链条被压缩的问题,即现有材料通常仅呈现结论而省略支撑结论的逻辑推导过程,这不仅阻碍了知识的可验证性,也限制了跨学科概念间的因果关联构建。解决方案的关键在于提出一个可扩展的框架,通过生成式 AI (Generative AI) 构建可验证的长链思维(Long Chain-of-Thought, LCoT)知识库,并将其投影为新兴百科全书 SciencePedia。其核心机制包括:基于约200门课程的课程体系驱动苏格拉底式代理生成约300万条第一性原理问题;利用多个独立求解模型生成LCoT并经提示净化与跨模型答案一致性过滤以确保高保真度;进而通过逆向知识搜索引擎 Brainstorm 检索目标概念的所有第一性原理推导路径,并由 Plato 合成器将这些验证过的推理链转化为结构化文章。此方法实现了从推理到知识表达的端到端可信构建,支持大规模、跨领域的科学合成。

链接: https://arxiv.org/abs/2510.26854
作者: Yu Li,Yuan Huang,Tao Wang,Caiyu Fan,Xiansheng Cai,Sihan Hu,Xinzijian Liu,Cheng Shi,Mingjun Xu,Zhen Wang,Yan Wang,Xiangqi Jin,Tianhan Zhang,Linfeng Zhang,Lei Wang,Youjin Deng,Pan Zhang,Weijie Sun,Xingyu Li,Weinan E,Linfeng Zhang,Zhiyuan Yao,Kun Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 43 pages, 4 figures

点击查看摘要

Abstract:Most scientific materials compress reasoning, presenting conclusions while omitting the derivational chains that justify them. This compression hinders verification by lacking explicit, step-wise justifications and inhibits cross-domain links by collapsing the very pathways that establish the logical and causal connections between concepts. We introduce a scalable framework that decompresses scientific reasoning, constructing a verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into an emergent encyclopedia, SciencePedia. Our pipeline operationalizes an endpoint-driven, reductionist strategy: a Socratic agent, guided by a curriculum of around 200 courses, generates approximately 3 million first-principles questions. To ensure high fidelity, multiple independent solver models generate LCoTs, which are then rigorously filtered by prompt sanitization and cross-model answer consensus, retaining only those with verifiable endpoints. This verified corpus powers the Brainstorm Search Engine, which performs inverse knowledge search – retrieving diverse, first-principles derivations that culminate in a target concept. This engine, in turn, feeds the Plato synthesizer, which narrates these verified chains into coherent articles. The initial SciencePedia comprises approximately 200,000 fine-grained entries spanning mathematics, physics, chemistry, biology, engineering, and computation. In evaluations across six disciplines, Plato-synthesized articles (conditioned on retrieved LCoTs) exhibit substantially higher knowledge-point density and significantly lower factual error rates than an equally-prompted baseline without retrieval (as judged by an external LLM). Built on this verifiable LCoT knowledge base, this reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia.
zh

[AI-66] CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLM s NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)在部署大语言模型(Large Language Models, LLMs)时推理加速效率不足的问题,特别是针对自适应推测解码(self-speculative decoding)方法在实际应用中难以达到理想加速比的瓶颈。其解决方案的关键在于提出了一种新型的级联自适应推测解码(Cascade Adaptive Self-Speculative Decoding, CAS-Spec)方法,通过动态切换推理加速策略(如层稀疏化和激活量化)构建可变的推测草稿模型,并设计了动态树级联算法(Dynamic Tree Cascade, DyTC),依据接受率和延迟预测的启发式规则自适应地路由多级草稿模型并分配推测长度。该方案显著提升了推理速度,在多个LLM和数据集上平均加速比达1.1×至2.3×,且相较基线方法进一步提升47%–48%的性能。

链接: https://arxiv.org/abs/2510.26843
作者: Zhiyuan Ning,Jiawei Shao,Ruge Xu,Xinfei Guo,Jun Zhang,Chi Zhang,Xuelong Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, NeurIPS 2025 poster

点击查看摘要

Abstract:Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from 1.1\times to 2.3\times over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by 47 % and 48 % over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.
zh

[AI-67] Accurate Target Privacy Preserving Federated Learning Balancing Fairness and Utility

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中同时保障模型公平性与客户端敏感数据隐私的难题,即如何在不共享原始数据的前提下,实现跨不同人口统计群体的公平性能表现。其解决方案的关键在于提出一种差分隐私公平联邦学习算法(\textitFedPF),将多目标优化问题建模为零和博弈——其中公平性与隐私约束与模型效用相互竞争。理论分析揭示了一个出人意料的反向关系:更严格的隐私保护会显著削弱系统识别和纠正人口统计偏见的能力,从而在公平性与隐私之间形成固有张力;此外,研究发现适度的公平约束可先提升模型泛化能力,随后导致性能下降,呈现非单调的公平-效用关系,挑战了传统对二者权衡的认知。实验验证表明,该方法可在三个数据集上实现最高达42.9%的歧视减少,同时保持竞争力的准确率,但更重要的是,证明了隐私与公平目标无法完全兼顾,必须通过精细平衡才能实现协同优化。

链接: https://arxiv.org/abs/2510.26841
作者: Kangkang Sun,Jun Wu,Minyi Guo,Jianhua Li,Jianwei Huang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, 30 conference

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training without data sharing, yet participants face a fundamental challenge, e.g., simultaneously ensuring fairness across demographic groups while protecting sensitive client data. We introduce a differentially private fair FL algorithm (\textitFedPF) that transforms this multi-objective optimization into a zero-sum game where fairness and privacy constraints compete against model utility. Our theoretical analysis reveals a surprising inverse relationship, i.e., stricter privacy protection fundamentally limits the system’s ability to detect and correct demographic biases, creating an inherent tension between privacy and fairness. Counterintuitively, we prove that moderate fairness constraints initially improve model generalization before causing performance degradation, where a non-monotonic relationship that challenges conventional wisdom about fairness-utility tradeoffs. Experimental validation demonstrates up to 42.9 % discrimination reduction across three datasets while maintaining competitive accuracy, but more importantly, reveals that the privacy-fairness tension is unavoidable, i.e., achieving both objectives simultaneously requires carefully balanced compromises rather than optimization of either in isolation. The source code for our proposed algorithm is publicly accessible at this https URL.
zh

[AI-68] SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

【速读】:该论文旨在解决当前Text-to-SQL评估方法中存在的偏差问题,即基于测试集的评估方式容易产生乐观结果,因为生成的SQL查询与人工标注的正确查询可能在特定静态测试数据库上产生相同执行结果,但语义或结构上存在差异。为应对这一问题,论文提出了一种名为SpotIt的新评估流程,其核心在于引入一个形式化的有界等价验证引擎,该引擎主动搜索能够区分生成查询与标准查询的数据库实例,从而更准确地判断两者是否真正等价。关键创新在于扩展现有验证器以支持更丰富的SQL子集,并通过系统性验证提升评估的严谨性与可靠性。

链接: https://arxiv.org/abs/2510.26840
作者: Rocky Klopfenstein,Yang He,Andrew Tremante,Yuepeng Wang,Nina Narodytska,Haoze Wu
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.
zh

[AI-69] Category-Aware Semantic Caching for Heterogeneous LLM Workloads

【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)服务系统中因查询负载异构性导致的缓存效率低下问题。具体而言,不同类别的查询(如代码查询与对话查询)在嵌入空间中的分布密度、内容时效性及重复模式上存在显著差异,传统统一策略的缓存机制难以适配这种多样性,导致长尾类别(低重复率或高波动性)的缓存命中率仅为5-15%,远低于经济可行阈值(15-20%),从而造成大量请求无法被缓存而引发高延迟(30ms远程搜索开销)。解决方案的关键在于提出一种类别感知的语义缓存(category-aware semantic caching)机制,通过动态调整每类查询的相似度阈值、生存时间(TTL)和配额,实现对不同查询特性的精细化适配;同时设计了一种混合架构,将内存中的HNSW近似最近邻搜索与外部文档存储分离,将缓存未命中成本从30ms降至2ms,使低命中率类别也具备经济可行性(break-even at 3-5%),最终实现全工作负载覆盖,并进一步引入基于负载的自适应策略以动态优化资源分配,理论预测可降低过载下游模型的流量9-17%。

链接: https://arxiv.org/abs/2510.26835
作者: Chen Wang,Xunzhuo Liu,Yue Zhu,Alaa Youssef,Priya Nagpurkar,Huamin Chen
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages including reference, position paper

点击查看摘要

Abstract:LLM serving systems process heterogeneous query workloads where different categories exhibit different characteristics. Code queries cluster densely in embedding space while conversational queries distribute sparsely. Content staleness varies from minutes (stock data) to months (code patterns). Query repetition patterns range from power-law (code) to uniform (conversation), producing long tail cache hit rate distributions: high-repetition categories achieve 40-60% hit rates while low-repetition or volatile categories achieve 5-15% hit rates. Vector databases must exclude the long tail because remote search costs (30ms) require 15–20% hit rates to break even, leaving 20-30% of production traffic uncached. Uniform cache policies compound this problem: fixed thresholds cause false positives in dense spaces and miss valid paraphrases in sparse spaces; fixed TTLs waste memory or serve stale data. This paper presents category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category. We present a hybrid architecture separating in-memory HNSW search from external document storage, reducing miss cost from 30ms to 2ms. This reduction makes low-hit-rate categories economically viable (break-even at 3-5% versus 15-20%), enabling cache coverage across the entire workload distribution. Adaptive load-based policies extend this framework to respond to downstream model load, dynamically adjusting thresholds and TTLs to reduce traffic to overloaded models by 9-17% in theoretical projections.
zh

[AI-70] VISAT: Benchmarking Adversarial and Distribution Shift Robustness in Traffic Sign Recognition with Visual Attributes

【速读】:该论文旨在解决交通标志识别(Traffic Sign Recognition, TSR)模型在面对对抗攻击和分布偏移时的鲁棒性不足问题。其解决方案的关键在于构建了一个名为VISAT的新颖开放数据集与基准测试套件,该套件基于Mapillary Traffic Sign Dataset (MTSD) 并引入两个独立的评估基准:一是针对对抗攻击的基准,采用最先进的投影梯度下降(Projected Gradient Descent, PGD)方法生成对抗样本,评估主流模型性能;二是针对分布偏移的基准,利用ImageNet-C中的真实数据退化与自然变异技术,检验基础模型与多任务学习(Multi-Task Learning, MTL)模型的鲁棒性差异。此外,研究还通过颜色量化等合成扰动手段系统分析了MTL任务间潜在的伪相关(spurious correlations),从而揭示了视觉属性(如颜色、形状、符号和文本)在提升模型鲁棒性中的作用机制。该工作为自动驾驶和网络物理系统中更鲁棒的TSR模型开发提供了关键评估工具与理论支撑。

链接: https://arxiv.org/abs/2510.26833
作者: Simon Yu,Peilin Yu,Hongbo Zheng,Huajie Shao,Han Zhao,Lui Sha
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present VISAT, a novel open dataset and benchmarking suite for evaluating model robustness in the task of traffic sign recognition with the presence of visual attributes. Built upon the Mapillary Traffic Sign Dataset (MTSD), our dataset introduces two benchmarks that respectively emphasize robustness against adversarial attacks and distribution shifts. For our adversarial attack benchmark, we employ the state-of-the-art Projected Gradient Descent (PGD) method to generate adversarial inputs and evaluate their impact on popular models. Additionally, we investigate the effect of adversarial attacks on attribute-specific multi-task learning (MTL) networks, revealing spurious correlations among MTL tasks. The MTL networks leverage visual attributes (color, shape, symbol, and text) that we have created for each traffic sign in our dataset. For our distribution shift benchmark, we utilize ImageNet-C’s realistic data corruption and natural variation techniques to perform evaluations on the robustness of both base and MTL models. Moreover, we further explore spurious correlations among MTL tasks through synthetic alterations of traffic sign colors using color quantization techniques. Our experiments focus on two major backbones, ResNet-152 and ViT-B/32, and compare the performance between base and MTL models. The VISAT dataset and benchmarking framework contribute to the understanding of model robustness for traffic sign recognition, shedding light on the challenges posed by adversarial attacks and distribution shifts. We believe this work will facilitate advancements in developing more robust models for real-world applications in autonomous driving and cyber-physical systems.
zh

[AI-71] LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature

【速读】:该论文旨在解决材料合成流程知识在海量科学文献中以非结构化形式散落分布、难以系统分析的问题。其核心解决方案是构建一个基于大语言模型(Large Language Models, LLMs)和视觉语言模型(Vision Language Models, VLMs)的多模态工具箱,能够自动从材料科学出版物中提取并组织合成步骤与性能数据(涵盖文本和图表),形成结构化的数据集LeMat-Synth(v1.0)。该数据集包含35种合成方法和16类材料的合成程序,并基于材料科学专用本体进行结构化标注;同时,通过专家标注与可扩展的LLM-as-a-judge评估框架验证了提取质量。此外,研究还开源了一个模块化软件库,支持社区对新语料和合成领域进行扩展,从而为合成—结构—性能关系建模及合成流程预测提供了可扩展的机器可读基础设施。

链接: https://arxiv.org/abs/2510.26824
作者: Magdalena Lederbauer,Siddharth Betala,Xiyao Li,Ayush Jain,Amine Sehaba,Georgia Channing,Grégoire Germain,Anamaria Leonescu,Faris Flaifil,Alfonso Amayuelas,Alexandre Nozadze,Stefan P. Schmid,Mohd Zaki,Sudheesh Kumar Ethirajan,Elton Pan,Mathilde Franckel,Alexandre Duval,N. M. Anoop Krishnan,Samuel P. Gleason
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 29 pages, 13 figures, 6 tables

点击查看摘要

Abstract:The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis–structure–property relationships.
zh

[AI-72] Cross-Corpus Validation of Speech Emotion Recognition in Urdu using Domain-Knowledge Acoustic Features

【速读】:该论文旨在解决低资源语言(如乌尔都语)在跨语料库场景下语音情感识别(Speech Emotion Recognition, SER)的性能评估问题,这一领域此前研究甚少。其关键解决方案在于构建了一个跨语料库评估框架,利用三个不同的乌尔都语情感语音数据集进行模型泛化能力测试,并采用基于声学特征的两种标准特征集(eGeMAPS 和 ComParE)结合 Logistic Regression 与多层感知机(Multilayer Perceptron)分类器进行建模,通过未加权平均召回率(UAR)衡量性能以应对类别不平衡问题。结果表明,自语料库验证常高估模型表现(最高高出13%),而跨语料库评估能更真实反映模型鲁棒性,从而强调了跨语料库验证对乌尔都语 SER 及其他低资源语言情感计算研究的重要性。

链接: https://arxiv.org/abs/2510.26823
作者: Unzela Talpur,Zafi Sherhan Syed,Muhammad Shehram Shah Syed,Abbas Shah Syed
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Conference paper, 4 pages, including 3 figures and 3 tables

点击查看摘要

Abstract:Speech Emotion Recognition (SER) is a key affective computing technology that enables emotionally intelligent artificial intelligence. While SER is challenging in general, it is particularly difficult for low-resource languages such as Urdu. This study investigates Urdu SER in a cross-corpus setting, an area that has remained largely unexplored. We employ a cross-corpus evaluation framework across three different Urdu emotional speech datasets to test model generalization. Two standard domain-knowledge based acoustic feature sets, eGeMAPS and ComParE, are used to represent speech signals as feature vectors which are then passed to Logistic Regression and Multilayer Perceptron classifiers. Classification performance is assessed using unweighted average recall (UAR) whilst considering class-label imbalance. Results show that Self-corpus validation often overestimates performance, with UAR exceeding cross-corpus evaluation by up to 13%, underscoring that cross-corpus evaluation offers a more realistic measure of model robustness. Overall, this work emphasizes the importance of cross-corpus validation for Urdu SER and its implications contribute to advancing affective computing research for underrepresented language communities.
zh

[AI-73] GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment ICASSP2026

【速读】:该论文旨在解决舞蹈-音乐生成(Dance-to-music, D2M)中因依赖粗粒度节奏嵌入(如全局运动特征或二值化关节节奏值)而导致的节奏一致性差和时间对齐精度低的问题,尤其在特征下采样引入的时间错位进一步削弱了舞步与音乐的同步性。解决方案的关键在于提出基于扩散变换器(diffusion transformer)的GACA-DiT框架,其核心创新为两个模块:一是风格自适应节奏提取模块(genre-adaptive rhythm extraction module),通过多尺度时域小波分析与空间相位直方图结合自适应关节加权机制,捕获细粒度且风格特异的节奏模式;二是上下文感知时间对齐模块(context-aware temporal alignment module),利用可学习的上下文查询(context queries)将音乐潜在表示与相关的舞蹈节奏特征精准对齐,从而实现高保真节奏一致性和时间同步的音乐生成。

链接: https://arxiv.org/abs/2510.26818
作者: Jinting Wang,Chenxing Li,Li Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: 5 pages, 3 figures, submitted to ICASSP 2026

点击查看摘要

Abstract:Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose \textbfGACA-DiT, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a \textbfgenre-adaptive rhythm extraction module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a \textbfcontext-aware temporal alignment module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: this https URL.
zh

[AI-74] VeriStruct: AI-assisted Automated Verification of Data-Structure Modules in Verus

【速读】:该论文旨在解决如何将AI辅助的自动化形式化验证从单一函数扩展到更复杂的、包含数据结构模块的程序模块的问题。其核心挑战在于大型语言模型(LLM)难以准确理解Verus语言特有的注解语法和验证语义,导致生成代码错误率高。解决方案的关键在于提出VeriStruct框架,该框架引入一个规划模块(planner module),系统性地生成抽象、类型不变式、规格说明和证明代码;同时通过在提示中嵌入语法引导(syntax guidance)并增加自动修复阶段(repair stage)来纠正注解错误,从而显著提升验证成功率——在11个Rust数据结构模块上的实验证明,VeriStruct成功验证了128/129个函数(99.2%),标志着迈向全自动AI辅助形式化验证的重要进展。

链接: https://arxiv.org/abs/2510.25015
作者: Chuyue Sun,Yican Sun,Daneshvar Amrollahi,Ethan Zhang,Shuvendu Lahiri,Shan Lu,David Dill,Clark Barrett
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We introduce VeriStruct, a novel framework that extends AI-assisted automated verification from single functions to more complex data structure modules in Verus. VeriStruct employs a planner module to orchestrate the systematic generation of abstractions, type invariants, specifications, and proof code. To address the challenge that LLMs often misunderstand Verus’ annotation syntax and verification-specific semantics, VeriStruct embeds syntax guidance within prompts and includes a repair stage to automatically correct annotation errors. In an evaluation on eleven Rust data structure modules, VeriStruct succeeds on ten of the eleven, successfully verifying 128 out of 129 functions (99.2%) in total. These results represent an important step toward the goal of automatic AI-assisted formal verification.
zh

[AI-75] LAFA: Agent ic LLM -Driven Federated Analytics over Decentralized Data Sources

【速读】:该论文旨在解决当前联邦分析(Federated Analytics, FA)系统缺乏自然语言输入支持,而基于大语言模型(Large Language Models, LLMs)的分析框架又依赖集中式数据访问、难以保障隐私的问题。其解决方案的关键在于提出LAFA系统,采用分层多智能体架构:首先由粗粒度规划器将复杂自然语言查询分解为子查询,再由细粒度规划器利用先验结构知识将每个子查询映射为有向无环图(Directed Acyclic Graph, DAG)形式的FA操作序列;同时引入优化代理对多个DAG进行重写与合并,消除冗余操作,显著降低计算与通信开销。该设计实现了在保护数据隐私的前提下,支持自然语言驱动的高效联邦分析执行。

链接: https://arxiv.org/abs/2510.18477
作者: Haichao Ji,Zibo Wang,Cheng Pan,Meng Han,Yifei Zhu,Dan Wang,Zhu Han
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: This paper has been accepted by the 16th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting.
zh

[AI-76] LLM s are Overconfident: Evaluating Confidence Interval Calibration with FermiEval

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在数值估计任务中普遍存在的过度自信问题,即其构建的置信区间(Confidence Intervals, CIs)未能准确反映真实不确定性——名义上的99%置信区间实际覆盖真值的比例仅为65%。解决方案的关键在于引入FermiEval基准测试框架,结合校准方法如基于合规预测(Conformal Prediction)的区间调整、直接对数概率诱导(Direct Log-Probability Elicitation)和分位数调整(Quantile Adjustment),显著提升了置信区间的覆盖率与精度,同时提出“感知隧道理论”(Perception-Tunnel Theory)解释LLMs在不确定推理时倾向于忽略分布尾部信息,从而导致系统性高估置信度的根本机制。

链接: https://arxiv.org/abs/2510.26995
作者: Elliot L. Epstein,John Winnicki,Thanawat Sornwanee,Rajat Dwaraknath
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages

点击查看摘要

Abstract:Large language models (LLMs) excel at numerical estimation but struggle to correctly quantify uncertainty. We study how well LLMs construct confidence intervals around their own answers and find that they are systematically overconfident. To evaluate this behavior, we introduce FermiEval, a benchmark of Fermi-style estimation questions with a rigorous scoring rule for confidence interval coverage and sharpness. Across several modern models, nominal 99% intervals cover the true answer only 65% of the time on average. With a conformal prediction based approach that adjusts the intervals, we obtain accurate 99% observed coverage, and the Winkler interval score decreases by 54%. We also propose direct log-probability elicitation and quantile adjustment methods, which further reduce overconfidence at high confidence levels. Finally, we develop a perception-tunnel theory explaining why LLMs exhibit overconfidence: when reasoning under uncertainty, they act as if sampling from a truncated region of their inferred distribution, neglecting its tails.
zh

[AI-77] Diffusion-Driven Generation of Minimally Preprocessed Brain MRI

【速读】:该论文旨在解决生成高质量三维 T₁-weighted MRI 人脑图像的问题,特别是如何在不进行颅骨去除(skullstripping)或图像配准(registration)的情况下,利用扩散概率模型(denoising diffusion probabilistic models, DDPMs)生成具有自然解剖变异和磁场不均匀性的高分辨率脑影像。解决方案的关键在于构建并训练三种不同的DDPM架构——其中两种基于速度与流场预测(velocity and flow prediction models),一种基于样本预测(sample prediction model),并在保持原始数据视觉多样性的同时,最小化预处理步骤,从而实现无需注册或偏置场校正即可生成逼真且结构一致的3D脑图像。

链接: https://arxiv.org/abs/2510.26834
作者: Samuel W. Remedios,Aaron Carass,Jerry L. Prince,Blake E. Dewey
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The purpose of this study is to present and compare three denoising diffusion probabilistic models (DDPMs) that generate 3D T_1 -weighted MRI human brain images. Three DDPMs were trained using 80,675 image volumes from 42,406 subjects spanning 38 publicly available brain MRI datasets. These images had approximately 1 mm isotropic resolution and were manually inspected by three human experts to exclude those with poor quality, field-of-view issues, and excessive pathology. The images were minimally preprocessed to preserve the visual variability of the data. Furthermore, to enable the DDPMs to produce images with natural orientation variations and inhomogeneity, the images were neither registered to a common coordinate system nor bias field corrected. Evaluations included segmentation, Frechet Inception Distance (FID), and qualitative inspection. Regarding results, all three DDPMs generated coherent MR brain volumes. The velocity and flow prediction models achieved lower FIDs than the sample prediction model. However, all three models had higher FIDs compared to real images across multiple cohorts. In a permutation experiment, the generated brain regional volume distributions differed statistically from real data. However, the velocity and flow prediction models had fewer statistically different volume distributions in the thalamus and putamen. In conclusion this work presents and releases the first 3D non-latent diffusion model for brain data without skullstripping or registration. Despite the negative results in statistical testing, the presented DDPMs are capable of generating high-resolution 3D T_1 -weighted brain images. All model weights and corresponding inference code are publicly available at this https URL .
zh

[AI-78] R3GAN-based Optimal Strategy for Augmenting Small Medical Dataset

【速读】:该论文旨在解决医学图像分析中因数据稀缺(data scarcity)和类别不平衡(class imbalance)导致深度学习模型在临床应用中性能受限的问题。针对人类胚胎时间-lapse成像(TLI)这一特定场景,研究提出通过优化生成对抗网络(GAN)来生成真实且具有诊断意义的图像以增强小样本数据集。其解决方案的关键在于设计了一种定制化的R3GAN训练策略:包括完整的预热阶段(full burn-in phase)以及从低到高逐步增加的gamma范围(5–40),从而显著提升生成图像的质量与多样性;该策略成功用于平衡胚胎数据集后,使t3类别的召回率(recall)从0.06提升至0.69,F1分数从0.11提升至0.60,同时不损害其他类别的性能,验证了该方法在小规模医疗影像任务中有效缓解数据稀缺并增强模型鲁棒性的潜力。

链接: https://arxiv.org/abs/2510.26828
作者: Tsung-Wei Pan,Chang-Hong Wu,Jung-Hua Wang,Ming-Jer Chen,Yu-Chiao Yi,Tsung-Hsien Lee
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Medical image analysis often suffers from data scarcity and class imbalance, limiting the effectiveness of deep learning models in clinical applications. Using human embryo time-lapse imaging (TLI) as a case study, this work investigates how generative adversarial networks (GANs) can be optimized for small datasets to generate realistic and diagnostically meaningful images. Based on systematic experiments with R3GAN, we established effective training strategies and designed an optimized configuration for 256x256-resolution datasets, featuring a full burn-in phase and a low, gradually increasing gamma range (5 - 40). The generated samples were used to balance an imbalanced embryo dataset, leading to substantial improvement in classification performance. The recall and F1-score of t3 increased from 0.06 to 0.69 and 0.11 to 0.60, respectively, without compromising other classes. These results demonstrate that tailored R3GAN training strategies can effectively alleviate data scarcity and improve model robustness in small-scale medical imaging tasks.
zh

[AI-79] Systematic Absence of Low-Confidence Nighttime Fire Detections in VIIRS Active Fire Product: Evidence of Undocumented Algorithmic Filtering

【速读】:该论文旨在解决VIIRS(可见光红外成像辐射计套件)活跃火产品中一个未被记录的系统性偏差问题,即夜间观测中完全缺失低置信度分类的现象。研究表明,这一现象并非由地表物理特性导致,而是源于算法本身的约束机制——具体而言,夜间温度低于约295K的火点会被完全排除而非标记为低置信度,这与白天火点的正常置信度分布形成鲜明对比。解决方案的关键在于通过机器学习反向工程(准确率达88.9%)、自举模拟和时空分析等方法,明确识别并验证该行为为算法设计所致,并建议在用户手册中明确标注此约束条件,同时对受影响的数据进行重新处理以保障后续研究的准确性。

链接: https://arxiv.org/abs/2510.26816
作者: Rohit Rajendra Dhage
机构: 未知
类目: Applications (stat.AP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注: 11 pages, 8 tables, 2 algorithms. Submitted to ArXiv for open access dissemination prior to journal submission

点击查看摘要

Abstract:The Visible Infrared Imaging Radiometer Suite (VIIRS) active fire product is widely used for global fire monitoring, yet its confidence classification scheme exhibits an undocumented systematic pattern. Through analysis of 21,540,921 fire detections spanning one year (January 2023 - January 2024), I demonstrate a complete absence of low-confidence classifications during nighttime observations. Of 6,007,831 nighttime fires, zero were classified as low confidence, compared to an expected 696,908 under statistical independence (chi-squared = 1,474,795, p 10^-15, Z = -833). This pattern persists globally across all months, latitude bands, and both NOAA-20 and Suomi-NPP satellites. Machine learning reverse-engineering (88.9% accuracy), bootstrap simulation (1,000 iterations), and spatial-temporal analysis confirm this is an algorithmic constraint rather than a geophysical phenomenon. Brightness temperature analysis reveals nighttime fires below approximately 295K are likely excluded entirely rather than flagged as low-confidence, while daytime fires show normal confidence distributions. This undocumented behavior affects 27.9% of all VIIRS fire detections and has significant implications for fire risk assessment, day-night detection comparisons, confidence-weighted analyses, and any research treating confidence levels as uncertainty metrics. I recommend explicit documentation of this algorithmic constraint in VIIRS user guides and reprocessing strategies for affected analyses.
zh

[AI-80] Impact of clinical decision support systems (cdss) on clinical outcomes and healthcare delivery in low- and middle-income countries: protocol for a systematic review and meta-analysis

【速读】:该论文旨在解决临床决策支持系统(Clinical Decision Support Systems, CDSS)在低收入和中等收入国家(Low- and Middle-Income Countries, LMICs)中对患者结局和医疗交付效果的影响证据分散问题。其解决方案的关键在于系统性地整合现有定量研究证据,通过严格筛选符合比较性设计(如随机对照试验、受控前后对比、间断时间序列和对照队列研究)的文献,并采用双人独立筛查与数据提取、风险偏倚评估(RoB 2 和 ROBINS-I),以及基于概念或统计可比性的随机效应Meta分析或结构化叙述性综述来量化CDSS的效果,同时通过预设亚组分析或元回归探索异质性来源(如疾病领域、护理层级、CDSS类型、准备度代理指标及研究设计)。

链接: https://arxiv.org/abs/2510.26812
作者: Garima Jain,Anand Bodade,Sanghamitra Pati
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 tables

点击查看摘要

Abstract:Clinical decision support systems (CDSS) are used to improve clinical and service outcomes, yet evidence from low- and middle-income countries (LMICs) is dispersed. This protocol outlines methods to quantify the impact of CDSS on patient and healthcare delivery outcomes in LMICs. We will include comparative quantitative designs (randomized trials, controlled before-after, interrupted time series, comparative cohorts) evaluating CDSS in World Bank-defined LMICs. Standalone qualitative studies are excluded; mixed-methods studies are eligible only if they report comparative quantitative outcomes, for which we will extract the quantitative component. Searches (from inception to 30 September 2024) will cover MEDLINE, Embase, CINAHL, CENTRAL, Web of Science, Global Health, Scopus, IEEE Xplore, LILACS, African Index Medicus, and IndMED, plus grey sources. Screening and extraction will be performed in duplicate. Risk of bias will be assessed with RoB 2 (randomized trials) and ROBINS-I (non-randomized). Random-effects meta-analysis will be performed where outcomes are conceptually or statistically comparable; otherwise, a structured narrative synthesis will be presented. Heterogeneity will be explored using relative and absolute metrics and a priori subgroups or meta-regression (condition area, care level, CDSS type, readiness proxies, study design).
zh

[AI-81] Reinforcement Learning for Accelerator Beamline Control: a simulation-based approach

【速读】:该论文旨在解决粒子加速器中束流线(beamline)配置优化问题,即如何通过调整磁铁参数来最大化粒子传输效率,而传统方法依赖专家手动调试,耗时且效率低。解决方案的关键在于将束流线优化建模为强化学习(Reinforcement Learning, RL)问题,提出一个名为RLABC的Python库,利用Elegant模拟框架自动构建RL环境,定义包含束流统计信息的状态空间、磁铁参数调整的动作空间以及以传输效率为核心的奖励函数,并采用深度确定性策略梯度(Deep Deterministic Policy Gradient, DDPG)算法进行训练,最终在两条束流线上分别实现了94%和91%的传输率,达到与专家手动优化相当的效果。

链接: https://arxiv.org/abs/2510.26805
作者: Anwar Ibrahim,Alexey Petrenko,Maxim Kaledin,Ehab Suleiman,Fedor Ratnikov,Denis Derkach
机构: 未知
类目: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Particle accelerators play a pivotal role in advancing scientific research, yet optimizing beamline configurations to maximize particle transmission remains a labor-intensive task requiring expert intervention. In this work, we introduce RLABC (Reinforcement Learning for Accelerator Beamline Control), a Python-based library that reframes beamline optimization as a reinforcement learning (RL) problem. Leveraging the Elegant simulation framework, RLABC automates the creation of an RL environment from standard lattice and element input files, enabling sequential tuning of magnets to minimize particle losses. We define a comprehensive state representation capturing beam statistics, actions for adjusting magnet parameters, and a reward function focused on transmission efficiency. Employing the Deep Deterministic Policy Gradient (DDPG) algorithm, we demonstrate RLABC’s efficacy on two beamlines, achieving transmission rates of 94% and 91%, comparable to expert manual optimizations. This approach bridges accelerator physics and machine learning, offering a versatile tool for physicists and RL researchers alike to streamline beamline tuning.
zh

[AI-82] EARS-UDE: Evaluating Auditory Response in Sensory Overload with Universal Differential Equations

【速读】:该论文旨在解决自闭症谱系障碍(Autism Spectrum Disorder, ASD)个体中普遍存在但尚未被充分建模的听觉感官过载问题,现有方法如机制模型、临床工具和机器学习方法在处理ASD异质性时存在参数固定或可解释性不足的局限。其解决方案的关键在于提出一种科学机器学习(Scientific Machine Learning)框架,利用通用微分方程(Universal Differential Equations, UDEs)将基于生物物理的常微分方程与神经网络相结合,从而同时实现机制理解与个体差异建模;该方法在保持高精度的同时显著减少参数量(比纯神经微分方程少73.5%),并能准确恢复生理参数(误差<2%),提供量化感官过载风险评估,为个性化干预提供了可落地的技术路径。

链接: https://arxiv.org/abs/2510.26804
作者: Miheer Salunke,Prathamesh Dinesh Joshi,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Auditory sensory overload affects 50-70% of individuals with Autism Spectrum Disorder (ASD), yet existing approaches, such as mechanistic models (Hodgkin Huxley type, Wilson Cowan, excitation inhibition balance), clinical tools (EEG/MEG, Sensory Profile scales), and ML methods (Neural ODEs, predictive coding), either assume fixed parameters or lack interpretability, missing autism heterogeneity. We present a Scientific Machine Learning approach using Universal Differential Equations (UDEs) to model sensory adaptation dynamics in autism. Our framework combines ordinary differential equations grounded in biophysics with neural networks to capture both mechanistic understanding and individual variability. We demonstrate that UDEs achieve a 90.8% improvement over pure Neural ODEs while using 73.5% fewer parameters. The model successfully recovers physiological parameters within the 2% error and provides a quantitative risk assessment for sensory overload, predicting 17.2% risk for pulse stimuli with specific temporal patterns. This framework establishes foundations for personalized, evidence-based interventions in autism, with direct applications to wearable technology and clinical practice.
zh

机器学习

[LG-0] On Selecting Few-Shot Examples for LLM -based Code Vulnerability Detection

链接: https://arxiv.org/abs/2510.27675
作者: Md Abdul Hannan,Ronghao Ni,Chi Zhang,Limin Jia,Ravi Mangal,Corina S. Pasareanu
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities for many coding tasks, including summarization, translation, completion, and code generation. However, detecting code vulnerabilities remains a challenging task for LLMs. An effective way to improve LLM performance is in-context learning (ICL) - providing few-shot examples similar to the query, along with correct answers, can improve an LLM’s ability to generate correct solutions. However, choosing the few-shot examples appropriately is crucial to improving model performance. In this paper, we explore two criteria for choosing few-shot examples for ICL used in the code vulnerability detection task. The first criterion considers if the LLM (consistently) makes a mistake or not on a sample with the intuition that LLM performance on a sample is informative about its usefulness as a few-shot example. The other criterion considers similarity of the examples with the program under query and chooses few-shot examples based on the k -nearest neighbors to the given sample. We perform evaluations to determine the benefits of these criteria individually as well as under various combinations, using open-source models on multiple datasets.

[LG-1] Enhancing software product lines with machine learning components

链接: https://arxiv.org/abs/2510.27640
作者: Luz-Viviana Cobaleda,Julián Carvajal,Paola Vallejo,Andrés López,Raúl Mazo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: pp. 73-94, 2 figures

点击查看摘要

Abstract:Modern software systems increasingly integrate machine learning (ML) due to its advancements and ability to enhance data-driven decision-making. However, this integration introduces significant challenges for software engineering, especially in software product lines (SPLs), where managing variability and reuse becomes more complex with the inclusion of ML components. Although existing approaches have addressed variability management in SPLs and the integration of ML components in isolated systems, few have explored the intersection of both domains. Specifically, there is limited support for modeling and managing variability in SPLs that incorporate ML components. To bridge this gap, this article proposes a structured framework designed to extend Software Product Line engineering, facilitating the integration of ML components. It facilitates the design of SPLs with ML capabilities by enabling systematic modeling of variability and reuse. The proposal has been partially implemented with the VariaMos tool.

[LG-2] Panprediction: Optimal Predictions for Any Downstream Task and Loss

链接: https://arxiv.org/abs/2510.27638
作者: Sivaraman Balakrishnan,Nika Haghtalab,Daniel Hsu,Brian Lee,Eric Zhao
类目: Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

Abstract:Supervised learning is classically formulated as training a model to minimize a fixed loss function over a fixed distribution, or task. However, an emerging paradigm instead views model training as extracting enough information from data so that the model can be used to minimize many losses on many downstream tasks. We formalize a mathematical framework for this paradigm, which we call panprediction, and study its statistical complexity. Formally, panprediction generalizes omniprediction and sits upstream from multi-group learning, which respectively focus on predictions that generalize to many downstream losses or many downstream tasks, but not both. Concretely, we design algorithms that learn deterministic and randomized panpredictors with \tildeO(1/\varepsilon^3) and \tildeO(1/\varepsilon^2) samples, respectively. Our results demonstrate that under mild assumptions, simultaneously minimizing infinitely many losses on infinitely many tasks can be as statistically easy as minimizing one loss on one task. Along the way, we improve the best known sample complexity guarantee of deterministic omniprediction by a factor of 1/\varepsilon , and match all other known sample complexity guarantees of omniprediction and multi-group learning. Our key technical ingredient is a nearly lossless reduction from panprediction to a statistically efficient notion of calibration, called step calibration.

[LG-3] ORGEval: Graph-Theoretic Evaluation of LLM s in Optimization Modeling

链接: https://arxiv.org/abs/2510.27610
作者: Zhuohan Wang,Ziwei Zhu,Ziniu Li,Congliang Chen,Yizhou Han,Yufeng Lin,Zhihang Lin,Angyang Gu,Xinglin Hu,Ruoyu Sun,Tian Ding
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Formulating optimization problems for industrial applications demands significant manual effort and domain expertise. While Large Language Models (LLMs) show promise in automating this process, evaluating their performance remains difficult due to the absence of robust metrics. Existing solver-based approaches often face inconsistency, infeasibility issues, and high computational costs. To address these issues, we propose ORGEval, a graph-theoretic evaluation framework for assessing LLMs’ capabilities in formulating linear and mixed-integer linear programs. ORGEval represents optimization models as graphs, reducing equivalence detection to graph isomorphism testing. We identify and prove a sufficient condition, when the tested graphs are symmetric decomposable (SD), under which the Weisfeiler-Lehman (WL) test is guaranteed to correctly detect isomorphism. Building on this, ORGEval integrates a tailored variant of the WL-test with an SD detection algorithm to evaluate model equivalence. By focusing on structural equivalence rather than instance-level configurations, ORGEval is robust to numerical variations. Experimental results show that our method can successfully detect model equivalence and produce 100% consistent results across random parameter configurations, while significantly outperforming solver-based methods in runtime, especially on difficult problems. Leveraging ORGEval, we construct the Bench4Opt dataset and benchmark state-of-the-art LLMs on optimization modeling. Our results reveal that although optimization modeling remains challenging for all LLMs, DeepSeek-V3 and Claude-Opus-4 achieve the highest accuracies under direct prompting, outperforming even leading reasoning models.

[LG-4] Learned Static Function Data Structures

链接: https://arxiv.org/abs/2510.27588
作者: Stefan Hermann,Hans-Peter Lehmann,Giorgio Vinciguerra,Stefan Walzer
类目: Data Structures and Algorithms (cs.DS); Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the task of constructing a data structure for associating a static set of keys with values, while allowing arbitrary output values for queries involving keys outside the set. Compared to hash tables, these so-called static function data structures do not need to store the key set and thus use significantly less memory. Several techniques are known, with compressed static functions approaching the zero-order empirical entropy of the value sequence. In this paper, we introduce learned static functions, which use machine learning to capture correlations between keys and values. For each key, a model predicts a probability distribution over the values, from which we derive a key-specific prefix code to compactly encode the true value. The resulting codeword is stored in a classic static function data structure. This design allows learned static functions to break the zero-order entropy barrier while still supporting point queries. Our experiments show substantial space savings: up to one order of magnitude on real data, and up to three orders of magnitude on synthetic data.

[LG-5] AstuteRAG -FQA: Task-Aware Retrieval-Augmented Generation Framework for Proprietary Data Challenges in Financial Question Answering

链接: https://arxiv.org/abs/2510.27537
作者: Mohammad Zahangir Alam,Khandoker Ashik Uz Zaman,Mahdi H. Miraz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) shows significant promise in knowledge-intensive tasks by improving domain specificity, enhancing temporal relevance, and reducing hallucinations. However, applying RAG to finance encounters critical challenges: restricted access to proprietary datasets, limited retrieval accuracy, regulatory constraints, and sensitive data interpretation. We introduce AstuteRAG-FQA, an adaptive RAG framework tailored for Financial Question Answering (FQA), leveraging task-aware prompt engineering to address these challenges. The framework uses a hybrid retrieval strategy integrating both open-source and proprietary financial data while maintaining strict security protocols and regulatory compliance. A dynamic prompt framework adapts in real time to query complexity, improving precision and contextual relevance. To systematically address diverse financial queries, we propose a four-tier task classification: explicit factual, implicit factual, interpretable rationale, and hidden rationale involving implicit causal reasoning. For each category, we identify key challenges, datasets, and optimization techniques within the retrieval and generation process. The framework incorporates multi-layered security mechanisms including differential privacy, data anonymization, and role-based access controls to protect sensitive financial information. Additionally, AstuteRAG-FQA implements real-time compliance monitoring through automated regulatory validation systems that verify responses against industry standards and legal obligations. We evaluate three data integration techniques - contextual embedding, small model augmentation, and targeted fine-tuning - analyzing their efficiency and feasibility across varied financial environments.

[LG-6] Representing Classical Compositions through Implication-Realization Temporal-Gestalt Graphs

链接: https://arxiv.org/abs/2510.27530
作者: A. V. Bomediano,R. J. Conanan,L. D. Santuyo,A. Coronel
类目: ound (cs.SD); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 8 pages, 11 figures

点击查看摘要

Abstract:Understanding the structural and cognitive underpinnings of musical compositions remains a key challenge in music theory and computational musicology. While traditional methods focus on harmony and rhythm, cognitive models such as the Implication-Realization (I-R) model and Temporal Gestalt theory offer insight into how listeners perceive and anticipate musical structure. This study presents a graph-based computational approach that operationalizes these models by segmenting melodies into perceptual units and annotating them with I-R patterns. These segments are compared using Dynamic Time Warping and organized into k-nearest neighbors graphs to model intra- and inter-segment relationships. Each segment is represented as a node in the graph, and nodes are further labeled with melodic expectancy values derived from Schellenberg’s two-factor I-R model-quantifying pitch proximity and pitch reversal at the segment level. This labeling enables the graphs to encode both structural and cognitive information, reflecting how listeners experience musical tension and resolution. To evaluate the expressiveness of these graphs, we apply the Weisfeiler-Lehman graph kernel to measure similarity between and within compositions. Results reveal statistically significant distinctions between intra- and inter-graph structures. Segment-level analysis via multidimensional scaling confirms that structural similarity at the graph level reflects perceptual similarity at the segment level. Graph2vec embeddings and clustering demonstrate that these representations capture stylistic and structural features that extend beyond composer identity. These findings highlight the potential of graph-based methods as a structured, cognitively informed framework for computational music analysis, enabling a more nuanced understanding of musical structure and style through the lens of listener perception. Comments: 8 pages, 11 figures Subjects: Sound (cs.SD); Machine Learning (cs.LG); Social and Information Networks (cs.SI) ACMclasses: H.5.5; G.2.2; I.5.4 Cite as: arXiv:2510.27530 [cs.SD] (or arXiv:2510.27530v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2510.27530 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proc. 25th Philippine Computing Science Congress Vol. I (2025) 39-46 Submission history From: Lance Santuyo [view email] [v1] Fri, 31 Oct 2025 15:01:00 UTC (2,901 KB)

[LG-7] Active transfer learning for structural health monitoring

链接: https://arxiv.org/abs/2510.27525
作者: J. Poole,N. Dervilis,K. Worden,P. Gardner,V. Giglioni,R.S. Mills,A.J. Hughes
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data for training structural health monitoring (SHM) systems are often expensive and/or impractical to obtain, particularly for labelled data. Population-based SHM (PBSHM) aims to address this limitation by leveraging data from multiple structures. However, data from different structures will follow distinct distributions, potentially leading to large generalisation errors for models learnt via conventional machine learning methods. To address this issue, transfer learning – in the form of domain adaptation (DA) – can be used to align the data distributions. Most previous approaches have only considered \emphunsupervised DA, where no labelled target data are available; they do not consider how to incorporate these technologies in an online framework – updating as labels are obtained throughout the monitoring campaign. This paper proposes a Bayesian framework for DA in PBSHM, that can improve unsupervised DA mappings using a limited quantity of labelled target data. In addition, this model is integrated into an active sampling strategy to guide inspections to select the most informative observations to label – leading to further reductions in the required labelled data to learn a target classifier. The effectiveness of this methodology is evaluated on a population of experimental bridges. Specifically, this population includes data corresponding to several damage states, as well as, a comprehensive set of environmental conditions. It is found that combining transfer learning and active learning can improve data efficiency when learning classification models in label-scarce scenarios. This result has implications for data-informed operation and maintenance of structures, suggesting a reduction in inspections over the operational lifetime of a structure – and therefore a reduction in operational costs – can be achieved.

[LG-8] Learning Sparse Approximate Inverse Preconditioners for Conjugate Gradient Solvers on GPUs NEURIPS2025

链接: https://arxiv.org/abs/2510.27517
作者: Zherui Yang,Zhehao Li,Kangbo Lyu,Yixuan Li,Tao Du,Ligang Liu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: NeurIPS 2025, poster

点击查看摘要

Abstract:The conjugate gradient solver (CG) is a prevalent method for solving symmetric and positive definite linear systems Ax=b, where effective preconditioners are crucial for fast convergence. Traditional preconditioners rely on prescribed algorithms to offer rigorous theoretical guarantees, while limiting their ability to exploit optimization from data. Existing learning-based methods often utilize Graph Neural Networks (GNNs) to improve the performance and speed up the construction. However, their reliance on incomplete factorization leads to significant challenges: the associated triangular solve hinders GPU parallelization in practice, and introduces long-range dependencies which are difficult for GNNs to model. To address these issues, we propose a learning-based method to generate GPU-friendly preconditioners, particularly using GNNs to construct Sparse Approximate Inverse (SPAI) preconditioners, which avoids triangular solves and requires only two matrix-vector products at each CG step. The locality of matrix-vector product is compatible with the local propagation mechanism of GNNs. The flexibility of GNNs also allows our approach to be applied in a wide range of scenarios. Furthermore, we introduce a statistics-based scale-invariant loss function. Its design matches CG’s property that the convergence rate depends on the condition number, rather than the absolute scale of A, leading to improved performance of the learned preconditioner. Evaluations on three PDE-derived datasets and one synthetic dataset demonstrate that our method outperforms standard preconditioners (Diagonal, IC, and traditional SPAI) and previous learning-based preconditioners on GPUs. We reduce solution time on GPUs by 40%-53% (68%-113% faster), along with better condition numbers and superior generalization performance. Source code available at this https URL

[LG-9] Asynchronous Risk-Aware Multi-Agent Packet Routing for Ultra-Dense LEO Satellite Networks

链接: https://arxiv.org/abs/2510.27506
作者: Ke He,Thang X. Vu,Le He,Lisheng Fan,Symeon Chatzinotas,Bjorn Ottersten
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The rise of ultra-dense LEO constellations creates a complex and asynchronous network environment, driven by their massive scale, dynamic topologies, and significant delays. This unique complexity demands an adaptive packet routing algorithm that is asynchronous, risk-aware, and capable of balancing diverse and often conflicting QoS objectives in a decentralized manner. However, existing methods fail to address this need, as they typically rely on impractical synchronous decision-making and/or risk-oblivious approaches. To tackle this gap, we introduce PRIMAL, an event-driven multi-agent routing framework designed specifically to allow each satellite to act independently on its own event-driven timeline, while managing the risk of worst-case performance degradation via a principled primal-dual approach. This is achieved by enabling agents to learn the full cost distribution of the targeted QoS objectives and constrain tail-end risks. Extensive simulations on a LEO constellation with 1584 satellites validate its superiority in effectively optimizing latency and balancing load. Compared to a recent risk-oblivious baseline, it reduces queuing delay by over 70%, and achieves a nearly 12 ms end-to-end delay reduction in loaded scenarios. This is accomplished by resolving the core conflict between naive shortest-path finding and congestion avoidance, highlighting such autonomous risk-awareness as a key to robust routing.

[LG-10] Simplex-to-Euclidean Bijections for Categorical Flow Matching

链接: https://arxiv.org/abs/2510.27480
作者: Bernardo Williams,Victor M. Yeom-Song,Marcelo Hartmann,Arto Klami
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a method for learning and sampling from probability distributions supported on the simplex. Our approach maps the open simplex to Euclidean space via smooth bijections, leveraging the Aitchison geometry to define the mappings, and supports modeling categorical data by a Dirichlet interpolation that dequantizes discrete observations into continuous ones. This enables density modeling in Euclidean space through the bijection while still allowing exact recovery of the original discrete distribution. Compared to previous methods that operate on the simplex using Riemannian geometry or custom noise processes, our approach works in Euclidean space while respecting the Aitchison geometry, and achieves competitive performance on both synthetic and real-world data sets.

[LG-11] Spectral Neural Graph Sparsification

链接: https://arxiv.org/abs/2510.27474
作者: Angelica Liguori,Ettore Ritacco,Pietro Sabatino,Annalisa Socievole
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graphs are central to modeling complex systems in domains such as social networks, molecular chemistry, and neuroscience. While Graph Neural Networks, particularly Graph Convolutional Networks, have become standard tools for graph learning, they remain constrained by reliance on fixed structures and susceptibility to over-smoothing. We propose the Spectral Preservation Network, a new framework for graph representation learning that generates reduced graphs serving as faithful proxies of the original, enabling downstream tasks such as community detection, influence propagation, and information diffusion at a reduced computational cost. The Spectral Preservation Network introduces two key components: the Joint Graph Evolution layer and the Spectral Concordance loss. The former jointly transforms both the graph topology and the node feature matrix, allowing the structure and attributes to evolve adaptively across layers and overcoming the rigidity of static neighborhood aggregation. The latter regularizes these transformations by enforcing consistency in both the spectral properties of the graph and the feature vectors of the nodes. We evaluate the effectiveness of Spectral Preservation Network on node-level sparsification by analyzing well-established metrics and benchmarking against state-of-the-art methods. The experimental results demonstrate the superior performance and clear advantages of our approach.

[LG-12] MVeLMA: Multimodal Vegetation Loss Modeling Architecture for Predicting Post-fire Vegetation Loss

链接: https://arxiv.org/abs/2510.27443
作者: Meenu Ravi,Shailik Sarkar,Yanshen Sun,Vaishnavi Singh,Chang-Tien Lu
类目: Machine Learning (cs.LG)
*备注: Accepted for 2025 ACM SIGSPATIAL conference

点击查看摘要

Abstract:Understanding post-wildfire vegetation loss is critical for developing effective ecological recovery strategies and is often challenging due to the extended time and effort required to capture the evolving ecosystem features. Recent works in this area have not fully explored all the contributing factors, their modalities, and interactions with each other. Furthermore, most research in this domain is limited by a lack of interpretability in predictive modeling, making it less useful in real-world settings. In this work, we propose a novel end-to-end ML pipeline called MVeLMA (\textbfMultimodal \textbfVegetation \textbfLoss \textbfModeling \textbfArchitecture) to predict county-wise vegetation loss from fire events. MVeLMA uses a multimodal feature integration pipeline and a stacked ensemble-based architecture to capture different modalities while also incorporating uncertainty estimation through probabilistic modeling. Through comprehensive experiments, we show that our model outperforms several state-of-the-art (SOTA) and baseline models in predicting post-wildfire vegetation loss. Furthermore, we generate vegetation loss confidence maps to identify high-risk counties, thereby helping targeted recovery efforts. The findings of this work have the potential to inform future disaster relief planning, ecological policy development, and wildlife recovery management.

[LG-13] Pairwise and Attribute-Aware Decision Tree-Based Preference Elicitation for Cold-Start Recommendation

链接: https://arxiv.org/abs/2510.27342
作者: Alireza Gharahighehi,Felipe Kenji Nakano,Xuehua Yang,Wenhan Cu,Celine Vens
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems (RSs) are intelligent filtering methods that suggest items to users based on their inferred preferences, derived from their interaction history on the platform. Collaborative filtering-based RSs rely on users past interactions to generate recommendations. However, when a user is new to the platform, referred to as a cold-start user, there is no historical data available, making it difficult to provide personalized recommendations. To address this, rating elicitation techniques can be used to gather initial ratings or preferences on selected items, helping to build an early understanding of the user’s tastes. Rating elicitation approaches are generally categorized into two types: non-personalized and personalized. Decision tree-based rating elicitation is a personalized method that queries users about their preferences at each node of the tree until sufficient information is gathered. In this paper, we propose an extension to the decision tree approach for rating elicitation in the context of music recommendation. Our method: (i) elicits not only item ratings but also preferences on attributes such as genres to better cluster users, and (ii) uses item pairs instead of single items at each node to more effectively learn user preferences. Experimental results demonstrate that both proposed enhancements lead to improved performance, particularly with a reduced number of queries.

[LG-14] Reasoning Models Sometimes Output Illegible Chains of Thought

链接: https://arxiv.org/abs/2510.27338
作者: Arun Jose
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model’s CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.

[LG-15] MedM2T: A MultiModal Framework for Time-Aware Modeling with Electronic Health Record and Electrocardiogram Data ALT

链接: https://arxiv.org/abs/2510.27321
作者: Yu-Chen Kuo,Yi-Ju Tseng
类目: Machine Learning (cs.LG)
*备注: This preprint version of the manuscript has been submitted to the IEEE Journal of Biomedical and Health Informatics (JBHI) for review. The implementation of MedM2T is available at this https URL

点击查看摘要

Abstract:The inherent multimodality and heterogeneous temporal structures of medical data pose significant challenges for modeling. We propose MedM2T, a time-aware multimodal framework designed to address these complexities. MedM2T integrates: (i) Sparse Time Series Encoder to flexibly handle irregular and sparse time series, (ii) Hierarchical Time-Aware Fusion to capture both micro- and macro-temporal patterns from multiple dense time series, such as ECGs, and (iii) Bi-Modal Attention to extract cross-modal interactions, which can be extended to any number of modalities. To mitigate granularity gaps between modalities, MedM2T uses modality-specific pre-trained encoders and aligns resulting features within a shared encoder. We evaluated MedM2T on MIMIC-IV and MIMIC-IV-ECG datasets for three tasks that encompass chronic and acute disease dynamics: 90-day cardiovascular disease (CVD) prediction, in-hospital mortality prediction, and ICU length-of-stay (LOS) regression. MedM2T outperformed state-of-the-art multimodal learning frameworks and existing time series models, achieving an AUROC of 0.947 and an AUPRC of 0.706 for CVD prediction; an AUROC of 0.901 and an AUPRC of 0.558 for mortality prediction; and Mean Absolute Error (MAE) of 2.31 for LOS regression. These results highlight the robustness and broad applicability of MedM2T, positioning it as a promising tool in clinical prediction. We provide the implementation of MedM2T at this https URL.

[LG-16] Binary Anomaly Detection in Streaming IoT Traffic under Concept Drift

链接: https://arxiv.org/abs/2510.27304
作者: Rodrigo Matos Carnier,Laura Lahesoo,Kensuke Fukuda
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 6 pages, 11 figures, 3 tables

点击查看摘要

Abstract:With the growing volume of Internet of Things (IoT) network traffic, machine learning (ML)-based anomaly detection is more relevant than ever. Traditional batch learning models face challenges such as high maintenance and poor adaptability to rapid anomaly changes, known as concept drift. In contrast, streaming learning integrates online and incremental learning, enabling seamless updates and concept drift detection to improve robustness. This study investigates anomaly detection in streaming IoT traffic as binary classification, comparing batch and streaming learning approaches while assessing the limitations of current IoT traffic datasets. We simulated heterogeneous network data streams by carefully mixing existing datasets and streaming the samples one by one. Our results highlight the failure of batch models to handle concept drift, but also reveal persisting limitations of current datasets to expose model limitations due to low traffic heterogeneity. We also investigated the competitiveness of tree-based ML algorithms, well-known in batch anomaly detection, and compared it to non-tree-based ones, confirming the advantages of the former. Adaptive Random Forest achieved F1-score of 0.990 \pm 0.006 at one-third the computational cost of its batch counterpart. Hoeffding Adaptive Tree reached F1-score of 0.910 \pm 0.007, reducing computational cost by four times, making it a viable choice for online applications despite a slight trade-off in stability.

[LG-17] mporal Cardiovascular Dynamics for Improved PPG-Based Heart Rate Estimation

链接: https://arxiv.org/abs/2510.27297
作者: Berken Utku Demirel,Christian Holz
类目: Machine Learning (cs.LG)
*备注: ArXiv version of the IEEE JBHI paper ( this https URL )

点击查看摘要

Abstract:The oscillations of the human heart rate are inherently complex and non-linear – they are best described by mathematical chaos, and they present a challenge when applied to the practical domain of cardiovascular health monitoring in everyday life. In this work, we study the non-linear chaotic behavior of heart rate through mutual information and introduce a novel approach for enhancing heart rate estimation in real-life conditions. Our proposed approach not only explains and handles the non-linear temporal complexity from a mathematical perspective but also improves the deep learning solutions when combined with them. We validate our proposed method on four established datasets from real-life scenarios and compare its performance with existing algorithms thoroughly with extensive ablation experiments. Our results demonstrate a substantial improvement, up to 40%, of the proposed approach in estimating heart rate compared to traditional methods and existing machine-learning techniques while reducing the reliance on multiple sensing modalities and eliminating the need for post-processing steps.

[LG-18] raceable Drug Recommendation over Medical Knowledge Graphs CIKM2025

链接: https://arxiv.org/abs/2510.27274
作者: Yu Lin,Zhen Jia,Philipp Christmann,Xu Zhang,Shengdong Du,Tianrui Li
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Accepted to MediKS@CIKM2025

点击查看摘要

Abstract:Drug recommendation (DR) systems aim to support healthcare professionals in selecting appropriate medications based on patients’ medical conditions. State-of-the-art approaches utilize deep learning techniques for improving DR, but fall short in providing any insights on the derivation process of recommendations – a critical limitation in such high-stake applications. We propose TraceDR, a novel DR system operating over a medical knowledge graph (MKG), which ensures access to large-scale and high-quality information. TraceDR simultaneously predicts drug recommendations and related evidence within a multi-task learning framework, enabling traceability of medication recommendations. For covering a more diverse set of diseases and drugs than existing works, we devise a framework for automatically constructing patient health records and release DrugRec, a new large-scale testbed for DR.

[LG-19] ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

链接: https://arxiv.org/abs/2510.27263
作者: Han Yu,Kehan Li,Dongbai Li,Yue He,Xingxuan Zhang,Peng Cui
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols in previous literature are inconsistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.

[LG-20] ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models

链接: https://arxiv.org/abs/2510.27256
作者: Xin Tang,Youfang Han,Fangfei Gou,Wei Zhao,Xin Meng,Yang Yu,Jinguo Zhang,Yuanchun Shi,Yuntao Wang,Tengxiang Zhang
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 23 pages, 13 figures, 7 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80% of queries to the small model while incurring less than 10% drop in problem solving probability.

[LG-21] FedSM: Robust Semantics-Guided Feature Mixup for Bias Reduction in Federated Learning with Long-Tail Data

链接: https://arxiv.org/abs/2510.27240
作者: Jingrui Zhang,Yimeng Xu,Shujie Li,Feng Liang,Haihan Duan,Yanjie Dong,Victor C. M. Leung,Xiping Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables collaborative model training across decentralized clients without sharing private data. However, FL suffers from biased global models due to non-IID and long-tail data distributions. We propose \textbfFedSM, a novel client-centric framework that mitigates this bias through semantics-guided feature mixup and lightweight classifier retraining. FedSM uses a pretrained image-text-aligned model to compute category-level semantic relevance, guiding the category selection of local features to mix-up with global prototypes to generate class-consistent pseudo-features. These features correct classifier bias, especially when data are heavily skewed. To address the concern of potential domain shift between the pretrained model and the data, we propose probabilistic category selection, enhancing feature diversity to effectively mitigate biases. All computations are performed locally, requiring minimal server overhead. Extensive experiments on long-tail datasets with various imbalanced levels demonstrate that FedSM consistently outperforms state-of-the-art methods in accuracy, with high robustness to domain shift and computational efficiency.

[LG-22] MDAS-GNN: Multi-Dimensional Spatiotemporal GNN with Spatial Diffusion for Urban Traffic Risk Forecasting

链接: https://arxiv.org/abs/2510.27197
作者: Ziyuan Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Traffic accidents represent a critical public health challenge, claiming over 1.35 million lives annually worldwide. Traditional accident prediction models treat road segments independently, failing to capture complex spatial relationships and temporal dependencies in urban transportation networks. This study develops MDAS-GNN, a Multi-Dimensional Attention-based Spatial-diffusion Graph Neural Network integrating three core risk dimensions: traffic safety, infrastructure, and environmental risk. The framework employs feature-specific spatial diffusion mechanisms and multi-head temporal attention to capture dependencies across different time horizons. Evaluated on UK Department for Transport accident data across Central London, South Manchester, and SE Birmingham, MDASGNN achieves superior performance compared to established baseline methods. The model maintains consistently low prediction errors across short, medium, and long-term periods, with particular strength in long-term forecasting. Ablation studies confirm that integrated multi-dimensional features outperform singlefeature approaches, reducing prediction errors by up to 40%. This framework provides civil engineers and urban planners with advanced predictive capabilities for transportation infrastructure design, enabling data-driven decisions for road network optimization, infrastructure resource improvements, and strategic safety interventions in urban development projects.

[LG-23] SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference

链接: https://arxiv.org/abs/2510.27182
作者: Zongshun Zhang,Ibrahim Matta
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over 23% while efficiently adapting to dynamic workloads.

[LG-24] A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions COLT2025 DATE

链接: https://arxiv.org/abs/2510.27177
作者: Junfan Li,Shizhong Liao,Zenglin Xu,Liqiang Nie
类目: Machine Learning (cs.LG)
*备注: A minor algorithmic error in our paper presented on COLT 2025 has been corrected in this arXiv update. We also have updated the pseudo-code of the algorithm. Our theoretical analyses, as well as all theoretical bounds, remain unaffected by those changes

点击查看摘要

Abstract:In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only k out of d attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds (Ito et al., 2017) under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the \ell_1 -norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the \ell_1 -norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional k_0 attributes after each prediction, and improve previous regret bounds (Kale et al., 2017; Ito et al., 2017).

[LG-25] Relation-Aware Bayesian Optimization of DBMS Configurations Guided by Affinity Scores

链接: https://arxiv.org/abs/2510.27145
作者: Sein Kwon,Seulgi Baek,Hyunseo Yang,Youngwan Jo,Sanghyun Park
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 13 pages

点击查看摘要

Abstract:Database Management Systems (DBMSs) are fundamental for managing large-scale and heterogeneous data, and their performance is critically influenced by configuration parameters. Effective tuning of these parameters is essential for adapting to diverse workloads and maximizing throughput while minimizing latency. Recent research has focused on automated configuration optimization using machine learning; however, existing approaches still exhibit several key limitations. Most tuning frameworks disregard the dependencies among parameters, assuming that each operates independently. This simplification prevents optimizers from leveraging relational effects across parameters, limiting their capacity to capture performancesensitive interactions. Moreover, to reduce the complexity of the high-dimensional search space, prior work often selects only the top few parameters for optimization, overlooking others that contribute meaningfully to performance. Bayesian Optimization (BO), the most common method for automatic tuning, is also constrained by its reliance on surrogate models, which can lead to unstable predictions and inefficient exploration. To overcome these limitations, we propose RelTune, a novel framework that represents parameter dependencies as a Relational Graph and learns GNN-based latent embeddings that encode performancerelevant semantics. RelTune further introduces Hybrid-Score-Guided Bayesian Optimization (HBO), which combines surrogate predictions with an Affinity Score measuring proximity to previously high-performing configurations. Experimental results on multiple DBMSs and workloads demonstrate that RelTune achieves faster convergence and higher optimization efficiency than conventional BO-based methods, achieving state-of-the-art performance across all evaluated scenarios.

[LG-26] FairAD: Computationally Efficient Fair Graph Clustering via Algebraic Distance CIKM2025

链接: https://arxiv.org/abs/2510.27136
作者: Minh Phu Vuong,Young-Ju Lee,Iván Ojeda-Ruiz,Chul-Ho Lee
类目: Machine Learning (cs.LG)
*备注: ACM CIKM 2025

点击查看摘要

Abstract:Due to the growing concern about unsavory behaviors of machine learning models toward certain demographic groups, the notion of ‘fairness’ has recently drawn much attention from the community, thereby motivating the study of fairness in graph clustering. Fair graph clustering aims to partition the set of nodes in a graph into k disjoint clusters such that the proportion of each protected group within each cluster is consistent with the proportion of that group in the entire dataset. It is, however, computationally challenging to incorporate fairness constraints into existing graph clustering algorithms, particularly for large graphs. To address this problem, we propose FairAD, a computationally efficient fair graph clustering method. It first constructs a new affinity matrix based on the notion of algebraic distance such that fairness constraints are imposed. A graph coarsening process is then performed on this affinity matrix to find representative nodes that correspond to k clusters. Finally, a constrained minimization problem is solved to obtain the solution of fair clustering. Experiment results on the modified stochastic block model and six public datasets show that FairAD can achieve fair clustering while being up to 40 times faster compared to state-of-the-art fair graph clustering algorithms.

[LG-27] Exploring the Utilities of the Rationales from Large Language Models to Enhance Automated Essay Scoring

链接: https://arxiv.org/abs/2510.27131
作者: Hong Jiao,Hanna Choi,Haowei Hua
类目: Machine Learning (cs.LG)
*备注: 12 pages, 3 figures

点击查看摘要

Abstract:This study explored the utilities of rationales generated by GPT-4.1 and GPT-5 in automated scoring using Prompt 6 essays from the 2012 Kaggle ASAP data. Essay-based scoring was compared with rationale-based scoring. The study found in general essay-based scoring performed better than rationale-based scoring with higher Quadratic Weighted Kappa (QWK). However, rationale-based scoring led to higher scoring accuracy in terms of F1 scores for score 0 which had less representation due to class imbalance issues. The ensemble modeling of essay-based scoring models increased the scoring accuracy at both specific score levels and across all score levels. The ensemble modeling of essay-based scoring and each of the rationale-based scoring performed about the same. Further ensemble of essay-based scoring and both rationale-based scoring yielded the best scoring accuracy with QWK of 0.870 compared with 0.848 reported in literature.

[LG-28] AI Agents in Drug Discovery

链接: https://arxiv.org/abs/2510.27130
作者: Srijit Seal,Dinh Long Huynh,Moudather Chelbi,Sara Khosravi,Ankur Kumar,Mattson Thieme,Isaac Wilks,Mark Davies,Jessica Mustali,Yannick Sun,Nick Edwards,Daniil Boiko,Andrei Tyrin,Douglas W. Selinger,Ayaan Parikh,Rahul Vijayan,Shoman Kasbekar,Dylan Reid,Andreas Bender,Ola Spjuth
类目: Machine Learning (cs.LG)
*备注: 45 pages, 12 figures

点击查看摘要

Abstract:Artificial intelligence (AI) agents are emerging as transformative tools in drug discovery, with the ability to autonomously reason, act, and learn through complicated research workflows. Building on large language models (LLMs) coupled with perception, computation, action, and memory tools, these agentic AI systems could integrate diverse biomedical data, execute tasks, carry out experiments via robotic platforms, and iteratively refine hypotheses in closed loops. We provide a conceptual and technical overview of agentic AI architectures, ranging from ReAct and Reflection to Supervisor and Swarm systems, and illustrate their applications across key stages of drug discovery, including literature synthesis, toxicity prediction, automated protocol generation, small-molecule synthesis, drug repurposing, and end-to-end decision-making. To our knowledge, this represents the first comprehensive work to present real-world implementations and quantifiable impacts of agentic AI systems deployed in operational drug discovery settings. Early implementations demonstrate substantial gains in speed, reproducibility, and scalability, compressing workflows that once took months into hours while maintaining scientific traceability. We discuss the current challenges related to data heterogeneity, system reliability, privacy, and benchmarking, and outline future directions towards technology in support of science and translation.

[LG-29] Group-Sensitive Offline Contextual Bandits

链接: https://arxiv.org/abs/2510.27123
作者: Yihong Guo,Junjie Luo,Guodong Gao,Ritu Agarwal,Anqi Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.

[LG-30] Learning Generalizable Visuomotor Policy through Dynamics-Alignment

链接: https://arxiv.org/abs/2510.27114
作者: Dohyeok Lee,Jung Min Lee,Munkyung Kim,Seokhun Ju,Jin Woo Koo,Kyungjae Lee,Dohyeong Kim,TaeHyun Cho,Jungwoo Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:Behavior cloning methods for robot learning suffer from poor generalization due to limited data support beyond expert demonstrations. Recent approaches leveraging video prediction models have shown promising results by learning rich spatiotemporal representations from large-scale datasets. However, these models learn action-agnostic dynamics that cannot distinguish between different control inputs, limiting their utility for precise manipulation tasks and requiring large pretraining datasets. We propose a Dynamics-Aligned Flow Matching Policy (DAP) that integrates dynamics prediction into policy learning. Our method introduces a novel architecture where policy and dynamics models provide mutual corrective feedback during action generation, enabling self-correction and improved generalization. Empirical validation demonstrates generalization performance superior to baseline methods on real-world robotic manipulation tasks, showing particular robustness in OOD scenarios including visual distractions and lighting variations.

[LG-31] Hierarchical Bayesian Model for Gene Deconvolution and Functional Analysis in Human Endometrium Across the Menstrual Cycle

链接: https://arxiv.org/abs/2510.27097
作者: Crystal Su,Kuai Yu,Mingyuan Shao,Daniel Bauer
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:Bulk tissue RNA sequencing of heterogeneous samples provides averaged gene expression profiles, obscuring cell type-specific dynamics. To address this, we present a probabilistic hierarchical Bayesian model that deconvolves bulk RNA-seq data into constituent cell-type expression profiles and proportions, leveraging a high-resolution single-cell reference. We apply our model to human endometrial tissue across the menstrual cycle, a context characterized by dramatic hormone-driven cellular composition changes. Our extended framework provides a principled inference of cell type proportions and cell-specific gene expression changes across cycle phases. We demonstrate the model’s structure, priors, and inference strategy in detail, and we validate its performance with simulations and comparisons to existing methods. The results reveal dynamic shifts in epithelial, stromal, and immune cell fractions between menstrual phases, and identify cell-type-specific differential gene expression associated with endometrial function (e.g., decidualization markers in stromal cells during the secretory phase). We further conduct robustness tests and show that our Bayesian approach is resilient to reference mismatches and noise. Finally, we discuss the biological significance of our findings, potential clinical implications for fertility and endometrial disorders, and future directions, including integration of spatial transcriptomics.

[LG-32] Functional embeddings enable Aggregation of multi-area SEEG recordings over subjects and sessions ICLR2026

链接: https://arxiv.org/abs/2510.27090
作者: Sina Javadzadeh,Rahil Soroushmojdehi,S. Alireza Seyyed Mousavi,Mehrnaz Asadi,Sumiko Abe,Terence D. Sanger
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to ICLR 2026

点击查看摘要

Abstract:Aggregating intracranial recordings across subjects is challenging since electrode count, placement, and covered regions vary widely. Spatial normalization methods like MNI coordinates offer a shared anatomical reference, but often fail to capture true functional similarity, particularly when localization is imprecise; even at matched anatomical coordinates, the targeted brain region and underlying neural dynamics can differ substantially between individuals. We propose a scalable representation-learning framework that (i) learns a subject-agnostic functional identity for each electrode from multi-region local field potentials using a Siamese encoder with contrastive objectives, inducing an embedding geometry that is locality-sensitive to region-specific neural signatures, and (ii) tokenizes these embeddings for a transformer that models inter-regional relationships with a variable number of channels. We evaluate this framework on a 20-subject dataset spanning basal ganglia-thalamic regions collected during flexible rest/movement recording sessions with heterogeneous electrode layouts. The learned functional space supports accurate within-subject discrimination and forms clear, region-consistent clusters; it transfers zero-shot to unseen channels. The transformer, operating on functional tokens without subject-specific heads or supervision, captures cross-region dependencies and enables reconstruction of masked channels, providing a subject-agnostic backbone for downstream decoding. Together, these results indicate a path toward large-scale, cross-subject aggregation and pretraining for intracranial neural data where strict task structure and uniform sensor placement are unavailable.

[LG-33] owards Understanding Self-play for LLM Reasoning

链接: https://arxiv.org/abs/2510.27072
作者: Justin Yang Chae,Md Tanvirul Alam,Nidhi Rastogi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in large language model (LLM) reasoning, led by reinforcement learning with verifiable rewards (RLVR), have inspired self-play post-training, where models improve by generating and solving their own problems. While self-play has shown strong in-domain and out-of-domain gains, the mechanisms behind these improvements remain poorly understood. In this work, we analyze the training dynamics of self-play through the lens of the Absolute Zero Reasoner, comparing it against RLVR and supervised fine-tuning (SFT). Our study examines parameter update sparsity, entropy dynamics of token distributions, and alternative proposer reward functions. We further connect these dynamics to reasoning performance using pass@k evaluations. Together, our findings clarify how self-play differs from other post-training strategies, highlight its inherent limitations, and point toward future directions for improving LLM math reasoning through self-play.

[LG-34] MLPerf Automotive

链接: https://arxiv.org/abs/2510.27065
作者: Radoyeh Shojaei,Predrag Djurdjevic,Mostafa El-Khamy,James Goel,Kasper Mecklenburg,John Owens,Pınar Muyan-Özçelik,Tom St. John,Jinho Suh,Arjun Suresh
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 16 pages, 5 figures, 6 tables

点击查看摘要

Abstract:We present MLPerf Automotive, the first standardized public benchmark for evaluating Machine Learning systems that are deployed for AI acceleration in automotive systems. Developed through a collaborative partnership between MLCommons and the Autonomous Vehicle Computing Consortium, this benchmark addresses the need for standardized performance evaluation methodologies in automotive machine learning systems. Existing benchmark suites cannot be utilized for these systems since automotive workloads have unique constraints including safety and real-time processing that distinguish them from the domains that previously introduced benchmarks target. Our benchmarking framework provides latency and accuracy metrics along with evaluation protocols that enable consistent and reproducible performance comparisons across different hardware platforms and software implementations. The first iteration of the benchmark consists of automotive perception tasks in 2D object detection, 2D semantic segmentation, and 3D object detection. We describe the methodology behind the benchmark design including the task selection, reference models, and submission rules. We also discuss the first round of benchmark submissions and the challenges involved in acquiring the datasets and the engineering efforts to develop the reference implementations. Our benchmark code is available at this https URL.

[LG-35] Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

链接: https://arxiv.org/abs/2510.27044
作者: Md Tanvirul Alam,Nidhi Rastogi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: \emphActivity Scheduling and the \emphLongest Increasing Subsequence, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at this https URL.

[LG-36] Domain decomposition architectures and Gauss-Newton training for physics-informed neural networks

链接: https://arxiv.org/abs/2510.27018
作者: Alexander Heinlein,Taniya Kapoor
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 9 pages, 5 figures

点击查看摘要

Abstract:Approximating the solutions of boundary value problems governed by partial differential equations with neural networks is challenging, largely due to the difficult training process. This difficulty can be partly explained by the spectral bias, that is, the slower convergence of high-frequency components, and can be mitigated by localizing neural networks via (overlapping) domain decomposition. We combine this localization with the Gauss-Newton method as the optimizer to obtain faster convergence than gradient-based schemes such as Adam; this comes at the cost of solving an ill-conditioned linear system in each iteration. Domain decomposition induces a block-sparse structure in the otherwise dense Gauss-Newton system, reducing the computational cost per iteration. Our numerical results indicate that combining localization and Gauss-Newton optimization is promising for neural network-based solvers for partial differential equations.

[LG-37] Quantitative Bounds for Length Generalization in Transformers

链接: https://arxiv.org/abs/2510.27015
作者: Zachary Izzo,Eshaan Nichani,Jason D. Lee
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Equal contribution, order determined by coin flip

点击查看摘要

Abstract:We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: \ell_\infty error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be “simulated” by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

[LG-38] Enhancing Sentiment Classification with Machine Learning and Combinatorial Fusion

链接: https://arxiv.org/abs/2510.27014
作者: Sean Patten,Pin-Yu Chen,Christina Schweikert,D. Frank Hsu
类目: Machine Learning (cs.LG)
*备注: IEEE PICom 2025

点击查看摘要

Abstract:This paper presents a novel approach to sentiment classification using the application of Combinatorial Fusion Analysis (CFA) to integrate an ensemble of diverse machine learning models, achieving state-of-the-art accuracy on the IMDB sentiment analysis dataset of 97.072%. CFA leverages the concept of cognitive diversity, which utilizes rank-score characteristic functions to quantify the dissimilarity between models and strategically combine their predictions. This is in contrast to the common process of scaling the size of individual models, and thus is comparatively efficient in computing resource use. Experimental results also indicate that CFA outperforms traditional ensemble methods by effectively computing and employing model diversity. The approach in this paper implements the combination of a transformer-based model of the RoBERTa architecture with traditional machine learning models, including Random Forest, SVM, and XGBoost.

[LG-39] Mixture-of-Transformers Learn Faster: A Theoretical Study on Classification Problems

链接: https://arxiv.org/abs/2510.27004
作者: Hongbo Li,Qinhang Wu,Sen Lin,Yingbin Liang,Ness B. Shroff
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models improve transformer efficiency but lack a unified theoretical explanation, especially when both feed-forward and attention layers are allowed to specialize. To this end, we study the Mixture-of-Transformers (MoT), a tractable theoretical framework in which each transformer block acts as an expert governed by a continuously trained gating network. This design allows us to isolate and study the core learning dynamics of expert specialization and attention alignment. In particular, we develop a three-stage training algorithm with continuous training of the gating network, and show that each transformer expert specializes in a distinct class of tasks and that the gating network accurately routes data samples to the correct expert. Our analysis shows how expert specialization reduces gradient conflicts and makes each subtask strongly convex. We prove that the training drives the expected prediction loss to near zero in O(\log(\epsilon^-1)) iteration steps, significantly improving over the O(\epsilon^-1) rate for a single transformer. We further validate our theoretical findings through extensive real-data experiments, demonstrating the practical effectiveness of MoT. Together, these results offer the first unified theoretical account of transformer-level specialization and learning dynamics, providing practical guidance for designing efficient large-scale models.

[LG-40] Are Online Sports Fan Communities Becoming More Offensive? A Quantitative Review of Topics Trends and Toxicity of r/PremierLeague

链接: https://arxiv.org/abs/2510.27003
作者: Muhammad Zeeshan Mazhar,Tolga Buz,Yiran Su
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online communities for sports fans have surged in popularity, with Reddit’s r/PremierLeague emerging as a focal point for fans of one of the globe’s most celebrated sports leagues. This boom has helped the Premier League make significant inroads into the US market, increasing viewership and sparking greater interest in its matches. Despite the league’s broad appeal, there’s still a notable gap in understanding its online fan community. Therefore, we analyzed a substantial dataset of over 1.1 million comments posted from 2013-2022 on r/PremierLeague. Our study delves into the sentiment, topics, and toxicity of these discussions, tracking trends over time, aiming to map out the conversation landscape. The rapid expansion has brought more diverse discussions, but also a worrying rise in negative sentiment and toxicity. Additionally, the subreddit has become a venue for users to voice frustrations about broader societal issues like racism, the COVID-19 pandemic, and political tensions.

[LG-41] Gradient Descent as Loss Landscape Navigation: a Normative Framework for Deriving Learning Rules NEURIPS2025

链接: https://arxiv.org/abs/2510.26997
作者: John J. Vastola,Samuel J. Gershman,Kanaka Rajan
类目: Machine Learning (cs.LG)
*备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Learning rules – prescriptions for updating model parameters to improve performance – are typically assumed rather than derived. Why do some learning rules work better than others, and under what assumptions can a given rule be considered optimal? We propose a theoretical framework that casts learning rules as policies for navigating (partially observable) loss landscapes, and identifies optimal rules as solutions to an associated optimal control problem. A range of well-known rules emerge naturally within this framework under different assumptions: gradient descent from short-horizon optimization, momentum from longer-horizon planning, natural gradients from accounting for parameter space geometry, non-gradient rules from partial controllability, and adaptive optimizers like Adam from online Bayesian inference of loss landscape shape. We further show that continual learning strategies like weight resetting can be understood as optimal responses to task uncertainty. By unifying these phenomena under a single objective, our framework clarifies the computational structure of learning and offers a principled foundation for designing adaptive algorithms.

[LG-42] HADSF: Aspect Aware Semantic Control for Explainable Recommendation WSDM2026

链接: https://arxiv.org/abs/2510.26994
作者: Zheng Nie,Peijie Sun
类目: Machine Learning (cs.LG)
*备注: Accepted by WSDM 2026

点击查看摘要

Abstract:Recent advances in large language models (LLMs) promise more effective information extraction for review-based recommender systems, yet current methods still (i) mine free-form reviews without scope control, producing redundant and noisy representations, (ii) lack principled metrics that link LLM hallucination to downstream effectiveness, and (iii) leave the cost-quality trade-off across model scales largely unexplored. We address these gaps with the Hyper-Adaptive Dual-Stage Semantic Framework (HADSF), a two-stage approach that first induces a compact, corpus-level aspect vocabulary via adaptive selection and then performs vocabulary-guided, explicitly constrained extraction of structured aspect-opinion triples. To assess the fidelity of the resulting representations, we introduce Aspect Drift Rate (ADR) and Opinion Fidelity Rate (OFR) and empirically uncover a nonmonotonic relationship between hallucination severity and rating prediction error. Experiments on approximately 3 million reviews across LLMs spanning 1.5B-70B parameters show that, when integrated into standard rating predictors, HADSF yields consistent reductions in prediction error and enables smaller models to achieve competitive performance in representative deployment scenarios. We release code, data pipelines, and metric implementations to support reproducible research on hallucination-aware, LLM-enhanced explainable recommendation. Code is available at this https URL

[LG-43] Predicting Household Water Consumption Using Satellite and Street View Images in Two Indian Cities

链接: https://arxiv.org/abs/2510.26957
作者: Qiao Wang,Joseph George
类目: Machine Learning (cs.LG); General Economics (econ.GN)
*备注:

点击查看摘要

Abstract:Monitoring household water use in rapidly urbanizing regions is hampered by costly, time-intensive enumeration methods and surveys. We investigate whether publicly available imagery-satellite tiles, Google Street View (GSV) segmentation-and simple geospatial covariates (nightlight intensity, population density) can be utilized to predict household water consumption in Hubballi-Dharwad, India. We compare four approaches: survey features (benchmark), CNN embeddings (satellite, GSV, combined), and GSV semantic maps with auxiliary data. Under an ordinal classification framework, GSV segmentation plus remote-sensing covariates achieves 0.55 accuracy for water use, approaching survey-based models (0.59 accuracy). Error analysis shows high precision at extremes of the household water consumption distribution, but confusion among middle classes is due to overlapping visual proxies. We also compare and contrast our estimates for household water consumption to that of household subjective income. Our findings demonstrate that open-access imagery, coupled with minimal geospatial data, offers a promising alternative to obtaining reliable household water consumption estimates using surveys in urban analytics.

[LG-44] MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models NEURIPS2025

链接: https://arxiv.org/abs/2510.26937
作者: Zimeng Huang,Jinxin Ke,Xiaoxuan Fan,Yufeng Yang,Yang Liu,Liu Zhonghan,Zedi Wang,Junteng Dai,Haoyi Jiang,Yuyu Zhou,Keze Wang,Ziliang Chen
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 Datasets and Benchmarks Track poster

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at this https URL.

[LG-45] Discovering EV Charging Site Archetypes Through Few Shot Forecasting: The First U.S.-Wide Study NEURIPS2025

链接: https://arxiv.org/abs/2510.26910
作者: Kshitij Nikhal,Luke Ackerknecht,Benjamin S. Riggan,Phil Stahlfeld
类目: Machine Learning (cs.LG)
*备注: Tackling Climate Change with Machine Learning: Workshop at NeurIPS 2025

点击查看摘要

Abstract:The decarbonization of transportation relies on the widespread adoption of electric vehicles (EVs), which requires an accurate understanding of charging behavior to ensure cost-effective, grid-resilient infrastructure. Existing work is constrained by small-scale datasets, simple proximity-based modeling of temporal dependencies, and weak generalization to sites with limited operational history. To overcome these limitations, this work proposes a framework that integrates clustering with few-shot forecasting to uncover site archetypes using a novel large-scale dataset of charging demand. The results demonstrate that archetype-specific expert models outperform global baselines in forecasting demand at unseen sites. By establishing forecast performance as a basis for infrastructure segmentation, we generate actionable insights that enable operators to lower costs, optimize energy and pricing strategies, and support grid resilience critical to climate goals.

[LG-46] Integrating Ontologies with Large Language Models for Enhanced Control Systems in Chemical Engineering

链接: https://arxiv.org/abs/2510.26898
作者: Crystal Su,Kuai Yu,Jingrui Zhang,Mingyuan Shao,Daniel Bauer
类目: Machine Learning (cs.LG)
*备注: Presented as a talk at the 2025 AIChE Annual Conference

点击查看摘要

Abstract:This work presents an ontology-integrated large language model (LLM) framework for chemical engineering that unites structured domain knowledge with generative reasoning. The proposed pipeline aligns model training and inference with the COPE ontology through a sequence of data acquisition, semantic preprocessing, information extraction, and ontology mapping steps, producing templated question-answer pairs that guide fine-tuning. A control-focused decoding stage and citation gate enforce syntactic and factual grounding by constraining outputs to ontology-linked terms, while evaluation metrics quantify both linguistic quality and ontological accuracy. Feedback and future extensions, including semantic retrieval and iterative validation, further enhance the system’s interpretability and reliability. This integration of symbolic structure and neural generation provides a transparent, auditable approach for applying LLMs to process control, safety analysis, and other critical engineering contexts.

[LG-47] SmoothGuard: Defending Multimodal Large Language Models with Noise Perturbation and Clustering Aggregation

链接: https://arxiv.org/abs/2510.26830
作者: Guangzhi Su,Shuchang Huang,Yutong Ke,Zhuohang Liu,Long Qian,Kaizhu Huang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved impressive performance across diverse tasks by jointly reasoning over textual and visual inputs. Despite their success, these models remain highly vulnerable to adversarial manipulations, raising concerns about their safety and reliability in deployment. In this work, we first generalize an approach for generating adversarial images within the HuggingFace ecosystem and then introduce SmoothGuard, a lightweight and model-agnostic defense framework that enhances the robustness of MLLMs through randomized noise injection and clustering-based prediction aggregation. Our method perturbs continuous modalities (e.g., images and audio) with Gaussian noise, generates multiple candidate outputs, and applies embedding-based clustering to filter out adversarially influenced predictions. The final answer is selected from the majority cluster, ensuring stable responses even under malicious perturbations. Extensive experiments on POPE, LLaVA-Bench (In-the-Wild), and MM-SafetyBench demonstrate that SmoothGuard improves resilience to adversarial attacks while maintaining competitive utility. Ablation studies further identify an optimal noise range (0.1-0.2) that balances robustness and utility.

[LG-48] Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

链接: https://arxiv.org/abs/2510.26829
作者: Svetlana Churina,Niranjan Chebrolu,Kokil Jaidka
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model’s internal representation away from the truth. We introduce Layer of Truth, a framework and dataset for probing belief dynamics in continually trained LLMs. By injecting controlled amounts of poisoned data and probing intermediate representations across checkpoints, model scales, and question types, we quantify when and how factual beliefs shift. Our findings reveal that even minimal exposure can induce persistent representational drift in well-established facts, with susceptibility varying across layers and model sizes. These results highlight an overlooked vulnerability of continually updated LLMs: their capacity to internalize misinformation analogously to humans, underscoring the need for robust monitoring of factual integrity during model updates. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2510.26829 [cs.LG] (or arXiv:2510.26829v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2510.26829 Focus to learn more arXiv-issued DOI via DataCite

[LG-49] Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements

链接: https://arxiv.org/abs/2510.27663
作者: Tom Sprunck,Marcelo Pereyra,Tobias Liaudat
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.

[LG-50] Bayesian Optimization on Networks

链接: https://arxiv.org/abs/2510.27643
作者: Wenwen Li,Daniel Sanz-Alonso,Ruiyi Yang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computation (stat.CO)
*备注: 36 pages, 6 figures; includes appendices

点击查看摘要

Abstract:This paper studies optimization on networks modeled as metric graphs. Motivated by applications where the objective function is expensive to evaluate or only available as a black box, we develop Bayesian optimization algorithms that sequentially update a Gaussian process surrogate model of the objective to guide the acquisition of query points. To ensure that the surrogates are tailored to the network’s geometry, we adopt Whittle-Matérn Gaussian process prior models defined via stochastic partial differential equations on metric graphs. In addition to establishing regret bounds for optimizing sufficiently smooth objective functions, we analyze the practical case in which the smoothness of the objective is unknown and the Whittle-Matérn prior is represented using finite elements. Numerical results demonstrate the effectiveness of our algorithms for optimizing benchmark objective functions on a synthetic metric graph and for Bayesian inversion via maximum a posteriori estimation on a telecommunication network.

[LG-51] Optimal Convergence Analysis of DDPM for General Distributions

链接: https://arxiv.org/abs/2510.27562
作者: Yuchen Jiao,Yuchen Zhou,Gen Li
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:Score-based diffusion models have achieved remarkable empirical success in generating high-quality samples from target data distributions. Among them, the Denoising Diffusion Probabilistic Model (DDPM) is one of the most widely used samplers, generating samples via estimated score functions. Despite its empirical success, a tight theoretical understanding of DDPM – especially its convergence properties – remains limited. In this paper, we provide a refined convergence analysis of the DDPM sampler and establish near-optimal convergence rates under general distributional assumptions. Specifically, we introduce a relaxed smoothness condition parameterized by a constant L , which is small for many practical distributions (e.g., Gaussian mixture models). We prove that the DDPM sampler with accurate score estimates achieves a convergence rate of \widetildeO\left(\fracd\min\d,L^2\T^2\right)~\textin Kullback-Leibler divergence, where d is the data dimension, T is the number of iterations, and \widetildeO hides polylogarithmic factors in T . This result substantially improves upon the best-known d^2/T^2 rate when L \sqrtd . By establishing a matching lower bound, we show that our convergence analysis is tight for a wide array of target distributions. Moreover, it reveals that DDPM and DDIM share the same dependence on d , raising an interesting question of why DDIM often appears empirically faster. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST) Cite as: arXiv:2510.27562 [stat.ML] (or arXiv:2510.27562v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2510.27562 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-52] pDANSE: Particle-based Data-driven Nonlinear State Estimation from Nonlinear Measurements

链接: https://arxiv.org/abs/2510.27503
作者: Anubhab Ghosh,Yonina C. Eldar,Saikat Chatterjee
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages, 10 figures, under review at IEEE Transactions on Signal Processing

点击查看摘要

Abstract:We consider the problem of designing a data-driven nonlinear state estimation (DANSE) method that uses (noisy) nonlinear measurements of a process whose underlying state transition model (STM) is unknown. Such a process is referred to as a model-free process. A recurrent neural network (RNN) provides parameters of a Gaussian prior that characterize the state of the model-free process, using all previous measurements at a given time point. In the case of DANSE, the measurement system was linear, leading to a closed-form solution for the state posterior. However, the presence of a nonlinear measurement system renders a closed-form solution infeasible. Instead, the second-order statistics of the state posterior are computed using the nonlinear measurements observed at the time point. We address the nonlinear measurements using a reparameterization trick-based particle sampling approach, and estimate the second-order statistics of the state posterior. The proposed method is referred to as particle-based DANSE (pDANSE). The RNN of pDANSE uses sequential measurements efficiently and avoids the use of computationally intensive sequential Monte-Carlo (SMC) and/or ancestral sampling. We describe the semi-supervised learning method for pDANSE, which transitions to unsupervised learning in the absence of labeled data. Using a stochastic Lorenz- 63 system as a benchmark process, we experimentally demonstrate the state estimation performance for four nonlinear measurement systems. We explore cubic nonlinearity and a camera-model nonlinearity where unsupervised learning is used; then we explore half-wave rectification nonlinearity and Cartesian-to-spherical nonlinearity where semi-supervised learning is used. The performance of state estimation is shown to be competitive vis-à-vis particle filters that have complete knowledge of the STM of the Lorenz- 63 system.

[LG-53] Minimax-Optimal Two-Sample Test with Sliced Wasserstein

链接: https://arxiv.org/abs/2510.27498
作者: Binh Thuan Tran,Nicolas Schreuder
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We study the problem of nonparametric two-sample testing using the sliced Wasserstein (SW) distance. While prior theoretical and empirical work indicates that the SW distance offers a promising balance between strong statistical guarantees and computational efficiency, its theoretical foundations for hypothesis testing remain limited. We address this gap by proposing a permutation-based SW test and analyzing its performance. The test inherits finite-sample Type I error control from the permutation principle. Moreover, we establish non-asymptotic power bounds and show that the procedure achieves the minimax separation rate n^-1/2 over multinomial and bounded-support alternatives, matching the optimal guarantees of kernel-based tests while building on the geometric foundations of Wasserstein distances. Our analysis further quantifies the trade-off between the number of projections and statistical power. Finally, numerical experiments demonstrate that the test combines finite-sample validity with competitive power and scalability, and – unlike kernel-based tests, which require careful kernel tuning – it performs consistently well across all scenarios we consider.

[LG-54] Estimation of aboveground biomass in a tropical dry forest: An intercomparison of airborne unmanned and space laser scanning

链接: https://arxiv.org/abs/2510.27408
作者: Nelson Mattié,Arturo Sanchez-Azofeifa,Pablo Crespo-Peremarch,Juan-Ygnacio López-Hernández
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 32 pages, 17 figures, research paper

点击查看摘要

Abstract:According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: this http URL, Elev.L3, this http URL, this http URL, this http URL, and this http URL, are important for AGB estimation using ALSD and ULSD , while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha . The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.

[LG-55] Interpretable Model-Aware Counterfactual Explanations for Random Forest

链接: https://arxiv.org/abs/2510.27397
作者: Joshua S. Harvey,Guanchao Feng,Sai Anusha Meesala,Tina Zhao,Dhagash Mehta
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Presented at XAI-FIN-2025: International Joint Workshop on Explainable AI in Finance: Achieving Trustworthy Financial Decision-Making; November 15, 2025; Singapore

点击查看摘要

Abstract:Despite their enormous predictive power, machine learning models are often unsuitable for applications in regulated industries such as finance, due to their limited capacity to provide explanations. While model-agnostic frameworks such as Shapley values have proved to be convenient and popular, they rarely align with the kinds of causal explanations that are typically sought after. Counterfactual case-based explanations, where an individual is informed of which circumstances would need to be different to cause a change in outcome, may be more intuitive and actionable. However, finding appropriate counterfactual cases is an open challenge, as is interpreting which features are most critical for the change in outcome. Here, we pose the question of counterfactual search and interpretation in terms of similarity learning, exploiting the representation learned by the random forest predictive model itself. Once a counterfactual is found, the feature importance of the explanation is computed as a function of which random forest partitions are crossed in order to reach it from the original instance. We demonstrate this method on both the MNIST hand-drawn digit dataset and the German credit dataset, finding that it generates explanations that are sparser and more useful than Shapley values.

[LG-56] On the Equivalence of Optimal Transport Problem and Action Matching with Optimal Vector Fields

链接: https://arxiv.org/abs/2510.27385
作者: Nikita Kornilov,Alexander Korotin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Flow Matching (FM) method in generative modeling maps arbitrary probability distributions by constructing an interpolation between them and then learning the vector field that defines ODE for this interpolation. Recently, it was shown that FM can be modified to map distributions optimally in terms of the quadratic cost function for any initial interpolation. To achieve this, only specific optimal vector fields, which are typical for solutions of Optimal Transport (OT) problems, need to be considered during FM loss minimization. In this note, we show that considering only optimal vector fields can lead to OT in another approach: Action Matching (AM). Unlike FM, which learns a vector field for a manually chosen interpolation between given distributions, AM learns the vector field that defines ODE for an entire given sequence of distributions.

[LG-57] When AI Trading Agents Compete: Adverse Selection of Meta-Orders by Reinforcement Learning-Based Market Making

链接: https://arxiv.org/abs/2510.27334
作者: Ali Raza Jafree,Konark Jain,Nick Firoozye
类目: Trading and Market Microstructure (q-fin.TR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-58] SERVIMON: AI-Driven Predictive Maintenance and Real-Time Monitoring for Astronomical Observatories

链接: https://arxiv.org/abs/2510.27146
作者: Emilio Mastriani,Alessandro Costa,Federico Incardona,Kevin Munari,Sebastiano Spinello
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Accepted for publication in IAU Symposium 397: Exploring the Universe with Artificial Intelligence (UniversAI 2025), Cambridge University Press. Editors: C. Sterken, J. Hearnshaw D. Valls-Gabaud

点击查看摘要

Abstract:Objective: ServiMon is designed to offer a scalable and intelligent pipeline for data collection and auditing to monitor distributed astronomical systems such as the ASTRI Mini-Array. The system enhances quality control, predictive maintenance, and real-time anomaly detection for telescope operations. Methods: ServiMon integrates cloud-native technologies-including Prometheus, Grafana, Cassandra, Kafka, and InfluxDB-for telemetry collection and processing. It employs machine learning algorithms, notably Isolation Forest, to detect anomalies in Cassandra performance metrics. Key indicators such as read/write latency, throughput, and memory usage are continuously monitored, stored as time-series data, and preprocessed for feature engineering. Anomalies detected by the model are logged in InfluxDB v2 and accessed via Flux for real-time monitoring and visualization. Results: AI-based anomaly detection increases system resilience by identifying performance degradation at an early stage, minimizing downtime, and optimizing telescope operations. Additionally, ServiMon supports astrostatistical analysis by correlating telemetry with observational data, thus enhancing scientific data quality. AI-generated alerts also improve real-time monitoring, enabling proactive system management. Conclusion: ServiMon’s scalable framework proves effective for predictive maintenance and real-time monitoring of astronomical infrastructures. By leveraging cloud and edge computing, it is adaptable to future large-scale experiments, optimizing both performance and cost. The combination of machine learning and big data analytics makes ServiMon a robust and flexible solution for modern and next-generation observational astronomy.

[LG-59] Overspecified Mixture Discriminant Analysis: Exponential Convergence Statistical Guarantees and Remote Sensing Applications

链接: https://arxiv.org/abs/2510.27056
作者: Arman Bolatov,Alan Legg,Igor Melnykov,Amantay Nurlanuly,Maxat Tezekbayev,Zhenisbek Assylbekov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study explores the classification error of Mixture Discriminant Analysis (MDA) in scenarios where the number of mixture components exceeds those present in the actual data distribution, a condition known as overspecification. We use a two-component Gaussian mixture model within each class to fit data generated from a single Gaussian, analyzing both the algorithmic convergence of the Expectation-Maximization (EM) algorithm and the statistical classification error. We demonstrate that, with suitable initialization, the EM algorithm converges exponentially fast to the Bayes risk at the population level. Further, we extend our results to finite samples, showing that the classification error converges to Bayes risk with a rate n^-1/2 under mild conditions on the initial parameter estimates and sample size. This work provides a rigorous theoretical framework for understanding the performance of overspecified MDA, which is often used empirically in complex data settings, such as image and text classification. To validate our theory, we conduct experiments on remote sensing datasets.

[LG-60] Accelerating Radiative Transfer for Planetary Atmospheres by Orders of Magnitude with a Transformer-Based Machine Learning Model

链接: https://arxiv.org/abs/2510.27050
作者: Isaac Malsky,Tiffany Kataria,Natasha E. Batalha,Matthew Graham
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Radiative transfer calculations are essential for modeling planetary atmospheres. However, standard methods are computationally demanding and impose accuracy-speed trade-offs. High computational costs force numerical simplifications in large models (e.g., General Circulation Models) that degrade the accuracy of the simulation. Radiative transfer calculations are an ideal candidate for machine learning emulation: fundamentally, it is a well-defined physical mapping from a static atmospheric profile to the resulting fluxes, and high-fidelity training data can be created from first principles calculations. We developed a radiative transfer emulator using an encoder-only transformer neural network architecture, trained on 1D profiles representative of solar-composition hot Jupiter atmospheres. Our emulator reproduced bolometric two-stream layer fluxes with mean test set errors of ~1% compared to the traditional method and achieved speedups of 100x. Emulating radiative transfer with machine learning opens up the possibility for faster and more accurate routines within planetary atmospheric models such as GCMs.

[LG-61] GeoPep: A geometry-aware masked language model for protein-peptide binding site prediction

链接: https://arxiv.org/abs/2510.27040
作者: Dian Chen,Yunkai Chen,Tong Lin,Sijie Chen,Xiaolin Cheng
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 11 pages, 5 figures

点击查看摘要

[LG-62] Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition

链接: https://arxiv.org/abs/2510.26838
作者: Amine Razig,Youssef Soulaymani,Loubna Benabbou,Pierre Cauchy
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Applications (stat.AP); Machine Learning (stat.ML)
*备注:

点击查看摘要

[LG-63] oward precision soil health: A regional framework for site-specific management across Missouri

链接: https://arxiv.org/abs/2510.26815
作者: Dipal Shah(1),Jordon Wade(2),Timothy Haithcoat(3),Robert Myers(4),Kelly Wilson(1) ((1) School of Natural Resources, University of Missouri, Columbia, MO, USA, (2) Crop Protection Research amp; Development, Syngenta Group, Basel, Switzerland, (3) Institute for Data Science and Informatics, University of Missouri, Columbia, MO, USA, (4) School of Plant Science amp; Technology, University of Missouri, Columbia, MO, USA)
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Effective soil health management is crucial for sustaining agriculture, adopting ecosystem resilience, and preserving water quality. However, Missouri’s diverse landscapes limit the effectiveness of broad generalized management recommendations. The lack of resolution in existing soil grouping systems necessitates data driven, site specific insights to guide tailored interventions. To address these critical challenges, a regional soil clustering framework designed to support precision soil health management strategies across the state. The methodology leveraged high resolution SSURGO dataset, explicitly processing soil properties aggregated across the 0 to 30 cm root zone. Multivariate analysis incorporating a variational autoencoder and KMeans clustering was used to group soils with similar properties. The derived clusters were validated using statistical metrics, including silhouette scores and checks against existing taxonomic units, to confirm their spatial coherence. This approach enabled us to delineate soil groups that capture textures, hydraulic properties, chemical fertility, and biological indicators unique to Missouri’s diverse agroecological regions. The clustering map identified ten distinct soil health management zones. This alignment of 10 clusters was selected as optimal because it was sufficiently large to capture inherited soil patterns while remaining manageable for practical statewide application. Rooting depth limitation and saturated hydraulic conductivity emerged as principal variables driving soil differentiation. Each management zone is defined by a unique combination of clay, organic matter, pH, and available water capacity. This framework bridges sophisticated data analysis with actionable, site targeted recommendations, enabling conservation planners, and agronomists to optimize management practices and enhance resource efficiency statewide.

[LG-64] owards Gaussian processes modelling to study the late effects of radiotherapy in children and young adults with brain tumours

链接: https://arxiv.org/abs/2510.26814
作者: Angela Davey,Arthur Leroy,Eliana Vasquez Osorio,Kate Vaughan,Peter Clayton,Marcel van Herk,Mauricio A Alvarez,Martin McCabe,Marianne Aznar
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: Presented at the XXth International Conference on the Use of Computers in Radiation Therapy

点击查看摘要

Abstract:Survivors of childhood cancer need lifelong monitoring for side effects from radiotherapy. However, longitudinal data from routine monitoring is often infrequently and irregularly sampled, and subject to inaccuracies. Due to this, measurements are often studied in isolation, or simple relationships (e.g., linear) are used to impute missing timepoints. In this study, we investigated the potential role of Gaussian Processes (GP) modelling to make population-based and individual predictions, using insulin-like growth factor 1 (IGF-1) measurements as a test case. With training data of 23 patients with a median (range) of 4 (1-16) timepoints we identified a trend within the range of literature reported values. In addition, with 8 test cases, individual predictions were made with an average root mean squared error of 31.9 (10.1 - 62.3) ng/ml and 27.4 (0.02 - 66.1) ng/ml for two approaches. GP modelling may overcome limitations of routine longitudinal data and facilitate analysis of late effects of radiotherapy.

[LG-65] A Machine Learning-Based Framework to Shorten the Questionnaire for Assessing Autism Intervention

链接: https://arxiv.org/abs/2510.26808
作者: Audrey Dong,Claire Xu,Samuel R. Guo,Kevin Yang,Xue-Jun Kong
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注: 10 pages, 16 figures

点击查看摘要

Abstract:Caregivers of individuals with autism spectrum disorder (ASD) often find the 77-item Autism Treatment Evaluation Checklist (ATEC) burdensome, limiting its use for routine monitoring. This study introduces a generalizable machine learning framework that seeks to shorten assessments while maintaining evaluative accuracy. Using longitudinal ATEC data from 60 autistic children receiving therapy, we applied feature selection and cross-validation techniques to identify the most predictive items across two assessment goals: longitudinal therapy tracking and point-in-time severity estimation. For progress monitoring, the framework identified 16 items (21% of the original questionnaire) that retained strong correlation with total score change and full subdomain coverage. We also generated smaller subsets (1-7 items) for efficient approximations. For point-in-time severity assessment, our model achieved over 80% classification accuracy using just 13 items (17% of the original set). While demonstrated on ATEC, the methodology-based on subset optimization, model interpretability, and statistical rigor-is broadly applicable to other high-dimensional psychometric tools. The resulting framework could potentially enable more accessible, frequent, and scalable assessments and offer a data-driven approach for AI-supported interventions across neurodevelopmental and psychiatric contexts.

[LG-66] Diabetes Lifestyle Medicine Treatment Assistance Using Reinforcement Learning

链接: https://arxiv.org/abs/2510.26807
作者: Yuhan Tang
类目: Applications (stat.AP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Type 2 diabetes prevention and treatment can benefit from personalized lifestyle prescriptions. However, the delivery of personalized lifestyle medicine prescriptions is limited by the shortage of trained professionals and the variability in physicians’ expertise. We propose an offline contextual bandit approach that learns individualized lifestyle prescriptions from the aggregated NHANES profiles of 119,555 participants by minimizing the Magni glucose risk-reward function. The model encodes patient status and generates lifestyle medicine prescriptions, which are trained using a mixed-action Soft Actor-Critic algorithm. The task is treated as a single-step contextual bandit. The model is validated against lifestyle medicine prescriptions issued by three certified physicians from Xiangya Hospital. These results demonstrate that offline mixed-action SAC can generate risk-aware lifestyle medicine prescriptions from cross-sectional NHANES data, warranting prospective clinical validation.

信息检索

[IR-0] Interact-RAG : Reason and Interact with the Corpus Beyond Black-Box Retrieval

链接: https://arxiv.org/abs/2510.27566
作者: Yulong Hui,Chao Chen,Zhihang Fu,Yihao Liu,Jieping Ye,Huanchen Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has significantly enhanced LLMs by incorporating external information. However, prevailing agentic RAG approaches are constrained by a critical limitation: they treat the retrieval process as a black-box querying operation. This confines agents’ actions to query issuing, hindering its ability to tackle complex information-seeking tasks. To address this, we introduce Interact-RAG, a new paradigm that elevates the LLM agent from a passive query issuer into an active manipulator of the retrieval process. We dismantle the black-box with a Corpus Interaction Engine, equipping the agent with a set of action primitives for fine-grained control over information retrieval. To further empower the agent on the entire RAG pipeline, we first develop a reasoning-enhanced workflow, which enables both zero-shot execution and the synthesis of interaction trajectories. We then leverage this synthetic data to train a fully autonomous end-to-end agent via Supervised Fine-Tuning (SFT), followed by refinement with Reinforcement Learning (RL). Extensive experiments across six benchmarks demonstrate that Interact-RAG significantly outperforms other advanced methods, validating the efficacy of our reasoning-interaction strategy.

[IR-1] Research Output of Webology Journal (2013-2017): A Scientometric Analysis

链接: https://arxiv.org/abs/2510.27259
作者: Muneer Ahmad,M. Sadik Batcha,Basharat Ahmad Wani,Mohammad Idrees Khan,S. Roselin Jahina
类目: Digital Libraries (cs.DL); Information Retrieval (cs.IR)
*备注: 13 pages, 3 figures, Research Paper

点击查看摘要

Abstract:Webology is an international peer-reviewed journal in English devoted to the field of the World Wide Web and serves as a forum for discussion and experimentation. It serves as a forum for new research in information dissemination and communication processes in general, and in the context of the World Wide Web in particular. This paper presents a Scientometric analysis of the Webology Journal. The paper analyses the pattern of growth of the research output published in the journal, pattern of authorship, author productivity, and subjects covered to the papers over the period (2013-2017). It is found that 62 papers were published during the period of study (2013-2017). The maximum numbers of articles were collaborative in nature. The subject concentration of the journal noted was Social Networking/Web 2.0/Library 2.0 and Scientometrics or Bibliometrics. Iranian researchers contributed the maximum number of articles (37.10%). The study applied standard formula and statistical tools to bring out the factual result.

[IR-2] A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation

链接: https://arxiv.org/abs/2510.27232
作者: Liyang He,Zhenya Huang,Cheng Yang,Rui Li,Zheng Zhang,Kai Zhang,Zhi Li,Qi Liu,Enhong Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the rapid growth of textual content on the Internet, efficient large-scale semantic text retrieval has garnered increasing attention from both academia and industry. Text hashing, which projects original texts into compact binary hash codes, is a crucial method for this task. By using binary codes, the semantic similarity computation for text pairs is significantly accelerated via fast Hamming distance calculations, and storage costs are greatly reduced. With the advancement of deep learning, deep text hashing has demonstrated significant advantages over traditional, data-independent hashing techniques. By leveraging deep neural networks, these methods can learn compact and semantically rich binary representations directly from data, overcoming the performance limitations of earlier approaches. This survey investigates current deep text hashing methods by categorizing them based on their core components: semantic extraction, hash code quality preservation, and other key technologies. We then present a detailed evaluation schema with results on several popular datasets, followed by a discussion of practical applications and open-source tools for implementation. Finally, we conclude by discussing key challenges and future research directions, including the integration of deep text hashing with large language models to further advance the field. The project for this survey can be accessed at this https URL.

[IR-3] A Survey on Generative Recommendation: Data Model and Tasks

链接: https://arxiv.org/abs/2510.27157
作者: Min Hou,Le Wu,Yuxin Liao,Yonghui Yang,Zhen Zhang,Changlong Zheng,Han Wu,Richang Hong
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recommender systems serve as foundational infrastructure in modern information ecosystems, helping users navigate digital content and discover items aligned with their preferences. At their core, recommender systems address a fundamental problem: matching users with items. Over the past decades, the field has experienced successive paradigm shifts, from collaborative filtering and matrix factorization in the machine learning era to neural architectures in the deep learning era. Recently, the emergence of generative models, especially large language models (LLMs) and diffusion models, have sparked a new paradigm: generative recommendation, which reconceptualizes recommendation as a generation task rather than discriminative scoring. This survey provides a comprehensive examination through a unified tripartite framework spanning data, model, and task dimensions. Rather than simply categorizing works, we systematically decompose approaches into operational stages-data augmentation and unification, model alignment and training, task formulation and execution. At the data level, generative models enable knowledge-infused augmentation and agent-based simulation while unifying heterogeneous signals. At the model level, we taxonomize LLM-based methods, large recommendation models, and diffusion approaches, analyzing their alignment mechanisms and innovations. At the task level, we illuminate new capabilities including conversational interaction, explainable reasoning, and personalized content generation. We identify five key advantages: world knowledge integration, natural language understanding, reasoning capabilities, scaling laws, and creative generation. We critically examine challenges in benchmark design, model robustness, and deployment efficiency, while charting a roadmap toward intelligent recommendation assistants that fundamentally reshape human-information interaction.

[IR-4] Compass: General Filtered Search across Vector and Structured Data

链接: https://arxiv.org/abs/2510.27141
作者: Chunxiao Ye,Xiao Yan,Eric Lo
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The increasing prevalence of hybrid vector and relational data necessitates efficient, general support for queries that combine high-dimensional vector search with complex relational filtering. However, existing filtered search solutions are fundamentally limited by specialized indices, which restrict arbitrary filtering and hinder integration with general-purpose DBMSs. This work introduces \textscCompass, a unified framework that enables general filtered search across vector and structured data without relying on new index designs. Compass leverages established index structures – such as HNSW and IVF for vector attributes, and B±trees for relational attributes – implementing a principled cooperative query execution strategy that coordinates candidate generation and predicate evaluation across modalities. Uniquely, Compass maintains generality by allowing arbitrary conjunctions, disjunctions, and range predicates, while ensuring robustness even with highly-selective or multi-attribute filters. Comprehensive empirical evaluations demonstrate that Compass consistently outperforms NaviX, the only existing performant general framework, across diverse hybrid query workloads. It also matches the query throughput of specialized single-attribute indices in their favorite settings with only a single attribute involved, all while maintaining full generality and DBMS compatibility. Overall, Compass offers a practical and robust solution for achieving truly general filtered search in vector database systems.

[IR-5] DGAI: Decoupled On-Disk Graph-Based ANN Index for Efficient Updates and Queries

链接: https://arxiv.org/abs/2510.25401
作者: Jiahao Lou,Quan Yu,Shufeng Gong,Song Yu,Yanfeng Zhang,Ge Yu
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 12 pages

点击查看摘要

Abstract:On-disk graph-based indexes are widely used in approximate nearest neighbor (ANN) search systems for large-scale, high-dimensional vectors. However, traditional coupled storage methods, which store vectors within the index, are inefficient for index updates. Coupled storage incurs excessive redundant vector reads and writes when updating the graph topology, leading to significant invalid I/O. To address this issue, we propose a decoupled storage architecture. While a decoupled architecture reduces query performance. To overcome this limitation, we design two tailored strategies: (i) a three-stage query mechanism that leverages multiple PQ compressed vectors to filter invalid I/O and computations, and (ii) an incremental page-level topological reordering strategy that incrementally inserts new nodes into pages containing their most similar neighbors to mitigate read amplification. Together, these techniques substantially reduce both I/O and computational overhead during ANN search. Experimental results show that the decoupled architecture improves update speed by 10.05x for insertions and 6.89x for deletions, while the three-stage query and incremental reordering enhance query efficiency by 2.66x compared to the traditional coupled architecture.

附件下载

点击下载今日全部论文列表